All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-13  4:49 ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev


Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call

To support application checkpoint/restart, a task must have the same pid it
had when it was checkpointed.  When containers are nested, the tasks within
the containers exist in multiple pid namespaces and hence have multiple pids
to specify during restart.

This patchset implements a new system call, clone3() that lets a process
specify the pids of the child process.

Patches 1 through 7 are helper patches, needed for choosing a pid for the
child process.

PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
documentation on the new system call, some/all of which will eventually
go into a man page.

Changelog[v8]:
	- [Oren Laadan, Louis Rilling, KOSAKI Motohiro]
	  The name 'clone2()' is in use - renamed new syscall to clone3().
	- [Oren Laadan] ->parent_tidptr and ->child_tidptr need to be 64bit.
	- [Oren Laadan] Ensure that unused fields/flags in clone_struct are 0.
	  (Added [PATCH 7/10] to the patchset).

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann]
	  Group the arguments to clone2() into a 'struct clone_arg' to
	  workaround the issue of exceeding 6 arguments to the system call.
	  Also define clone-flags as u64 to allow additional clone-flags.

Changelog[v6]:
	- [Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds]
	  Change 'pid_set.pids' to 'pid_t pids[]' so sizeof(struct pid_set) is
	  constant across architectures (Patches 7, 8).
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check (Patches 7,8)
	- (Pavel Machek) New patch (Patch 9) to add some documentation.

Changelog[v5]:
	- Make 'pid_max' a property of pid_ns (Integrated Serge Hallyn's patch
	  into this set)
	- (Eric Biederman): Avoid the new function, set_pidmap() - added
	  couple of checks on 'target_pid' in alloc_pidmap() itself.

=== IMPORTANT TODO:

clone() system call has another limitation - all but one, available bits in
clone-flags are in use and if more new clone-flags are needed, we will need
a variant of the clone() system call. 

It appears to make sense to try and extend this new system call to address
this limitation as well. The requirements of a new clone system call could
then be summarized as:

	- do everything clone() does today, and
	- give application an ability to choose pids for the child process
	  in all ancestor pid namespaces, and
	- allow more clone_flags

Contstraints:

	- system-calls are restricted to 6 parameters and clone() already
	  takes 5 parameters, any extension to clone() interface would require
	  one or more copy_from_user().  (Not sure if copy_from_user() of ~40
	  bytes would have a significant impact on performance of clone()).

Based on these requirements and constraints, we explored a couple of system
call interfaces (in earlier versions of this patchset).  Based on input from
Arnd Bergmann and others, the new interface of the system call is: 

	struct clone_struct {
		u64 flags;
		u64 child_stack;
		u32 nr_pids;
		u32 reserved1;
		u64 parent_tid;
		u64 child_tid;
		u64 reserved2;
	};

	sys_clone3(struct clone_struct __user *cs, pid_t __user *pids)

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-13  4:49 ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:49 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Oren Laadan, serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA


Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call

To support application checkpoint/restart, a task must have the same pid it
had when it was checkpointed.  When containers are nested, the tasks within
the containers exist in multiple pid namespaces and hence have multiple pids
to specify during restart.

This patchset implements a new system call, clone3() that lets a process
specify the pids of the child process.

Patches 1 through 7 are helper patches, needed for choosing a pid for the
child process.

PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
documentation on the new system call, some/all of which will eventually
go into a man page.

Changelog[v8]:
	- [Oren Laadan, Louis Rilling, KOSAKI Motohiro]
	  The name 'clone2()' is in use - renamed new syscall to clone3().
	- [Oren Laadan] ->parent_tidptr and ->child_tidptr need to be 64bit.
	- [Oren Laadan] Ensure that unused fields/flags in clone_struct are 0.
	  (Added [PATCH 7/10] to the patchset).

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann]
	  Group the arguments to clone2() into a 'struct clone_arg' to
	  workaround the issue of exceeding 6 arguments to the system call.
	  Also define clone-flags as u64 to allow additional clone-flags.

Changelog[v6]:
	- [Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds]
	  Change 'pid_set.pids' to 'pid_t pids[]' so sizeof(struct pid_set) is
	  constant across architectures (Patches 7, 8).
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check (Patches 7,8)
	- (Pavel Machek) New patch (Patch 9) to add some documentation.

Changelog[v5]:
	- Make 'pid_max' a property of pid_ns (Integrated Serge Hallyn's patch
	  into this set)
	- (Eric Biederman): Avoid the new function, set_pidmap() - added
	  couple of checks on 'target_pid' in alloc_pidmap() itself.

=== IMPORTANT TODO:

clone() system call has another limitation - all but one, available bits in
clone-flags are in use and if more new clone-flags are needed, we will need
a variant of the clone() system call. 

It appears to make sense to try and extend this new system call to address
this limitation as well. The requirements of a new clone system call could
then be summarized as:

	- do everything clone() does today, and
	- give application an ability to choose pids for the child process
	  in all ancestor pid namespaces, and
	- allow more clone_flags

Contstraints:

	- system-calls are restricted to 6 parameters and clone() already
	  takes 5 parameters, any extension to clone() interface would require
	  one or more copy_from_user().  (Not sure if copy_from_user() of ~40
	  bytes would have a significant impact on performance of clone()).

Based on these requirements and constraints, we explored a couple of system
call interfaces (in earlier versions of this patchset).  Based on input from
Arnd Bergmann and others, the new interface of the system call is: 

	struct clone_struct {
		u64 flags;
		u64 child_stack;
		u32 nr_pids;
		u32 reserved1;
		u64 parent_tid;
		u64 child_tid;
		u64 reserved2;
	};

	sys_clone3(struct clone_struct __user *cs, pid_t __user *pids)

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 1/10]: Factor out code to allocate pidmap page
  2009-10-13  4:49 ` Sukadev Bhattiprolu
@ 2009-10-13  4:49     ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:49 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Eric W. Biederman, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Subject: [RFC][v8][PATCH 1/10]: Factor out code to allocate pidmap page

To simplify alloc_pidmap(), move code to allocate a pid map page to a
separate function.

Changelog[v3]:
	- Earlier version of patchset called alloc_pidmap_page() from two
	  places. But now its called from only one place. Even so, moving
	  this code out into a separate function simplifies alloc_pidmap().
Changelog[v2]:
	- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
	  -ENOMEM on error instead of -1.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/pid.c |   45 ++++++++++++++++++++++++++++++---------------
 1 file changed, 30 insertions(+), 15 deletions(-)

Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:06:25.000000000 -0700
@@ -122,9 +122,35 @@ static void free_pidmap(struct upid *upi
 	atomic_inc(&map->nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	void *page;
+
+	if (likely(map->page))
+		return 0;
+
+	page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+	/*
+	 * Free the page if someone raced with us installing it:
+	 */
+	spin_lock_irq(&pidmap_lock);
+	if (map->page)
+		kfree(page);
+	else
+		map->page = page;
+	spin_unlock_irq(&pidmap_lock);
+
+	if (unlikely(!map->page))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int rc;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -134,21 +160,10 @@ static int alloc_pidmap(struct pid_names
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (map->page)
-				kfree(page);
-			else
-				map->page = page;
-			spin_unlock_irq(&pidmap_lock);
-			if (unlikely(!map->page))
-				break;
-		}
+		rc = alloc_pidmap_page(map);
+		if (rc)
+			break;
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 1/10]: Factor out code to allocate pidmap page
@ 2009-10-13  4:49     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev



Subject: [RFC][v8][PATCH 1/10]: Factor out code to allocate pidmap page

To simplify alloc_pidmap(), move code to allocate a pid map page to a
separate function.

Changelog[v3]:
	- Earlier version of patchset called alloc_pidmap_page() from two
	  places. But now its called from only one place. Even so, moving
	  this code out into a separate function simplifies alloc_pidmap().
Changelog[v2]:
	- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
	  -ENOMEM on error instead of -1.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/pid.c |   45 ++++++++++++++++++++++++++++++---------------
 1 file changed, 30 insertions(+), 15 deletions(-)

Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:06:25.000000000 -0700
@@ -122,9 +122,35 @@ static void free_pidmap(struct upid *upi
 	atomic_inc(&map->nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	void *page;
+
+	if (likely(map->page))
+		return 0;
+
+	page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+	/*
+	 * Free the page if someone raced with us installing it:
+	 */
+	spin_lock_irq(&pidmap_lock);
+	if (map->page)
+		kfree(page);
+	else
+		map->page = page;
+	spin_unlock_irq(&pidmap_lock);
+
+	if (unlikely(!map->page))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int rc;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -134,21 +160,10 @@ static int alloc_pidmap(struct pid_names
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (map->page)
-				kfree(page);
-			else
-				map->page = page;
-			spin_unlock_irq(&pidmap_lock);
-			if (unlikely(!map->page))
-				break;
-		}
+		rc = alloc_pidmap_page(map);
+		if (rc)
+			break;
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 2/10]: Have alloc_pidmap() return actual error code
  2009-10-13  4:49 ` Sukadev Bhattiprolu
@ 2009-10-13  4:50     ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:50 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Eric W. Biederman, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Subject: [RFC][v8][PATCH 2/10]: Have alloc_pidmap() return actual error code

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed.  With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/fork.c |    5 +++--
 kernel/pid.c  |   14 +++++++++-----
 2 files changed, 12 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-09 19:06:46.000000000 -0700
@@ -1110,10 +1110,11 @@ static struct task_struct *copy_process(
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 
 		if (clone_flags & CLONE_NEWPID) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:06:25.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:06:46.000000000 -0700
@@ -150,7 +150,7 @@ static int alloc_pidmap_page(struct pidm
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
-	int rc;
+	int rc = -EAGAIN;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -189,12 +189,14 @@ static int alloc_pidmap(struct pid_names
 		} else {
 			map = &pid_ns->pidmap[0];
 			offset = RESERVED_PIDS;
-			if (unlikely(last == offset))
+			if (unlikely(last == offset)) {
+				rc = -EAGAIN;
 				break;
+			}
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return rc;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -263,8 +265,10 @@ struct pid *alloc_pid(struct pid_namespa
 	struct upid *upid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
-	if (!pid)
+	if (!pid) {
+		pid = ERR_PTR(-ENOMEM);
 		goto out;
+	}
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
@@ -299,7 +303,7 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
+	pid = ERR_PTR(nr);
 	goto out;
 }

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 2/10]: Have alloc_pidmap() return actual error code
@ 2009-10-13  4:50     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev



Subject: [RFC][v8][PATCH 2/10]: Have alloc_pidmap() return actual error code

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed.  With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    5 +++--
 kernel/pid.c  |   14 +++++++++-----
 2 files changed, 12 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-09 19:06:46.000000000 -0700
@@ -1110,10 +1110,11 @@ static struct task_struct *copy_process(
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 
 		if (clone_flags & CLONE_NEWPID) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:06:25.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:06:46.000000000 -0700
@@ -150,7 +150,7 @@ static int alloc_pidmap_page(struct pidm
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
-	int rc;
+	int rc = -EAGAIN;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -189,12 +189,14 @@ static int alloc_pidmap(struct pid_names
 		} else {
 			map = &pid_ns->pidmap[0];
 			offset = RESERVED_PIDS;
-			if (unlikely(last == offset))
+			if (unlikely(last == offset)) {
+				rc = -EAGAIN;
 				break;
+			}
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return rc;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -263,8 +265,10 @@ struct pid *alloc_pid(struct pid_namespa
 	struct upid *upid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
-	if (!pid)
+	if (!pid) {
+		pid = ERR_PTR(-ENOMEM);
 		goto out;
+	}
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
@@ -299,7 +303,7 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
+	pid = ERR_PTR(nr);
 	goto out;
 }
 

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
  2009-10-13  4:49 ` Sukadev Bhattiprolu
@ 2009-10-13  4:50     ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:50 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Eric W. Biederman, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



From: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property

Remove the pid_max global, and make it a property of the
pid_namespace.  When a pid_ns is created, it inherits
the parent's pid_ns.

Fixing up sysctl (trivial akin to ipc version, but
potentially tedious to get right for all CONFIG*
combinations) is left for later.

Changelog[v2]:
	- Port to newer kernel
	- Make pid_max a local variable in alloc_pidmap() to simplify code/patch

Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 include/linux/pid_namespace.h |    1 +
 kernel/pid.c                  |    4 ++--
 kernel/pid_namespace.c        |    1 +
 kernel/sysctl.c               |    4 ++--
 4 files changed, 6 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/pid_namespace.h
===================================================================
--- linux-2.6.orig/include/linux/pid_namespace.h	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/include/linux/pid_namespace.h	2009-09-09 19:07:20.000000000 -0700
@@ -30,6 +30,7 @@ struct pid_namespace {
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct bsd_acct_struct *bacct;
 #endif
+	int pid_max;
 };
 
 extern struct pid_namespace init_pid_ns;
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:06:46.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:07:20.000000000 -0700
@@ -43,8 +43,6 @@ static struct hlist_head *pid_hash;
 static int pidhash_shift;
 struct pid init_struct_pid = INIT_STRUCT_PID;
 
-int pid_max = PID_MAX_DEFAULT;
-
 #define RESERVED_PIDS		300
 
 int pid_max_min = RESERVED_PIDS + 1;
@@ -78,6 +76,7 @@ struct pid_namespace init_pid_ns = {
 	.last_pid = 0,
 	.level = 0,
 	.child_reaper = &init_task,
+	.pid_max = PID_MAX_DEFAULT,
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
@@ -151,6 +150,7 @@ static int alloc_pidmap(struct pid_names
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	int rc = -EAGAIN;
+	int pid_max = pid_ns->pid_max;
 	struct pidmap *map;
 
 	pid = last + 1;
Index: linux-2.6/kernel/pid_namespace.c
===================================================================
--- linux-2.6.orig/kernel/pid_namespace.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/pid_namespace.c	2009-09-09 19:07:20.000000000 -0700
@@ -87,6 +87,7 @@ static struct pid_namespace *create_pid_
 
 	kref_init(&ns->kref);
 	ns->level = level;
+	ns->pid_max = parent_pid_ns->pid_max;
 	ns->parent = get_pid_ns(parent_pid_ns);
 
 	set_bit(0, ns->pidmap[0].page);
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/sysctl.c	2009-09-09 19:07:20.000000000 -0700
@@ -55,6 +55,7 @@
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
+#include <linux/pid_namespace.h>
 
 #ifdef CONFIG_X86
 #include <asm/nmi.h>
@@ -78,7 +79,6 @@ extern int max_threads;
 extern int core_uses_pid;
 extern int suid_dumpable;
 extern char core_pattern[];
-extern int pid_max;
 extern int min_free_kbytes;
 extern int pid_max_min, pid_max_max;
 extern int sysctl_drop_caches;
@@ -670,7 +670,7 @@ static struct ctl_table kern_table[] = {
 	{
 		.ctl_name	= KERN_PIDMAX,
 		.procname	= "pid_max",
-		.data		= &pid_max,
+		.data		= &init_pid_ns.pid_max,
 		.maxlen		= sizeof (int),
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec_minmax,

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
@ 2009-10-13  4:50     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev



From: Serge Hallyn <serue@us.ibm.com>
Subject: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property

Remove the pid_max global, and make it a property of the
pid_namespace.  When a pid_ns is created, it inherits
the parent's pid_ns.

Fixing up sysctl (trivial akin to ipc version, but
potentially tedious to get right for all CONFIG*
combinations) is left for later.

Changelog[v2]:
	- Port to newer kernel
	- Make pid_max a local variable in alloc_pidmap() to simplify code/patch

Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
---
 include/linux/pid_namespace.h |    1 +
 kernel/pid.c                  |    4 ++--
 kernel/pid_namespace.c        |    1 +
 kernel/sysctl.c               |    4 ++--
 4 files changed, 6 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/pid_namespace.h
===================================================================
--- linux-2.6.orig/include/linux/pid_namespace.h	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/include/linux/pid_namespace.h	2009-09-09 19:07:20.000000000 -0700
@@ -30,6 +30,7 @@ struct pid_namespace {
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct bsd_acct_struct *bacct;
 #endif
+	int pid_max;
 };
 
 extern struct pid_namespace init_pid_ns;
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:06:46.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:07:20.000000000 -0700
@@ -43,8 +43,6 @@ static struct hlist_head *pid_hash;
 static int pidhash_shift;
 struct pid init_struct_pid = INIT_STRUCT_PID;
 
-int pid_max = PID_MAX_DEFAULT;
-
 #define RESERVED_PIDS		300
 
 int pid_max_min = RESERVED_PIDS + 1;
@@ -78,6 +76,7 @@ struct pid_namespace init_pid_ns = {
 	.last_pid = 0,
 	.level = 0,
 	.child_reaper = &init_task,
+	.pid_max = PID_MAX_DEFAULT,
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
@@ -151,6 +150,7 @@ static int alloc_pidmap(struct pid_names
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	int rc = -EAGAIN;
+	int pid_max = pid_ns->pid_max;
 	struct pidmap *map;
 
 	pid = last + 1;
Index: linux-2.6/kernel/pid_namespace.c
===================================================================
--- linux-2.6.orig/kernel/pid_namespace.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/pid_namespace.c	2009-09-09 19:07:20.000000000 -0700
@@ -87,6 +87,7 @@ static struct pid_namespace *create_pid_
 
 	kref_init(&ns->kref);
 	ns->level = level;
+	ns->pid_max = parent_pid_ns->pid_max;
 	ns->parent = get_pid_ns(parent_pid_ns);
 
 	set_bit(0, ns->pidmap[0].page);
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/sysctl.c	2009-09-09 19:07:20.000000000 -0700
@@ -55,6 +55,7 @@
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
+#include <linux/pid_namespace.h>
 
 #ifdef CONFIG_X86
 #include <asm/nmi.h>
@@ -78,7 +79,6 @@ extern int max_threads;
 extern int core_uses_pid;
 extern int suid_dumpable;
 extern char core_pattern[];
-extern int pid_max;
 extern int min_free_kbytes;
 extern int pid_max_min, pid_max_max;
 extern int sysctl_drop_caches;
@@ -670,7 +670,7 @@ static struct ctl_table kern_table[] = {
 	{
 		.ctl_name	= KERN_PIDMAX,
 		.procname	= "pid_max",
-		.data		= &pid_max,
+		.data		= &init_pid_ns.pid_max,
 		.maxlen		= sizeof (int),
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec_minmax,

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()
  2009-10-13  4:49 ` Sukadev Bhattiprolu
@ 2009-10-13  4:51     ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:51 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Eric W. Biederman, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Subject: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()

With support for setting a specific pid number for a process,
alloc_pidmap() will need a 'target_pid' parameter.

Changelog[v6]:
	- Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
	- (Eric Biederman): Avoid set_pidmap() function. Added couple of
	  checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
	- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
	  actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 kernel/pid.c |   26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:07:20.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-10 10:23:27.000000000 -0700
@@ -146,16 +146,22 @@ static int alloc_pidmap_page(struct pidm
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	int rc = -EAGAIN;
 	int pid_max = pid_ns->pid_max;
 	struct pidmap *map;
 
-	pid = last + 1;
-	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+	if (target_pid) {
+		pid = target_pid;
+		if (pid < 0 || pid >= pid_max)
+			return -EINVAL;
+	} else {
+		pid = last + 1;
+		if (pid >= pid_max)
+			pid = RESERVED_PIDS;
+	}
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
@@ -164,6 +170,15 @@ static int alloc_pidmap(struct pid_names
 		if (rc)
 			break;
 
+		if (target_pid) {
+			rc = -EBUSY;
+			if (!test_and_set_bit(offset, map->page)) {
+				atomic_dec(&map->nr_free);
+				rc = pid;
+			}
+			return rc;
+		}
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
@@ -196,6 +211,7 @@ static int alloc_pidmap(struct pid_names
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
+
 	return rc;
 }
 
@@ -272,7 +288,7 @@ struct pid *alloc_pid(struct pid_namespa
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = alloc_pidmap(tmp, 0);
 		if (nr < 0)
 			goto out_free;

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()
@ 2009-10-13  4:51     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev



Subject: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()

With support for setting a specific pid number for a process,
alloc_pidmap() will need a 'target_pid' parameter.

Changelog[v6]:
	- Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
	- (Eric Biederman): Avoid set_pidmap() function. Added couple of
	  checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
	- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
	  actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 kernel/pid.c |   26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:07:20.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-10 10:23:27.000000000 -0700
@@ -146,16 +146,22 @@ static int alloc_pidmap_page(struct pidm
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	int rc = -EAGAIN;
 	int pid_max = pid_ns->pid_max;
 	struct pidmap *map;
 
-	pid = last + 1;
-	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+	if (target_pid) {
+		pid = target_pid;
+		if (pid < 0 || pid >= pid_max)
+			return -EINVAL;
+	} else {
+		pid = last + 1;
+		if (pid >= pid_max)
+			pid = RESERVED_PIDS;
+	}
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
@@ -164,6 +170,15 @@ static int alloc_pidmap(struct pid_names
 		if (rc)
 			break;
 
+		if (target_pid) {
+			rc = -EBUSY;
+			if (!test_and_set_bit(offset, map->page)) {
+				atomic_dec(&map->nr_free);
+				rc = pid;
+			}
+			return rc;
+		}
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
@@ -196,6 +211,7 @@ static int alloc_pidmap(struct pid_names
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
+
 	return rc;
 }
 
@@ -272,7 +288,7 @@ struct pid *alloc_pid(struct pid_namespa
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = alloc_pidmap(tmp, 0);
 		if (nr < 0)
 			goto out_free;
 

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 5/10]: Add target_pids parameter to alloc_pid()
  2009-10-13  4:49 ` Sukadev Bhattiprolu
@ 2009-10-13  4:51     ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:51 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Eric W. Biederman, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Subject: [RFC][v8][PATCH 5/10]: Add target_pids parameter to alloc_pid()

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/pid.h |    2 +-
 kernel/fork.c       |    3 ++-
 kernel/pid.c        |    9 +++++++--
 3 files changed, 10 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/pid.h
===================================================================
--- linux-2.6.orig/include/linux/pid.h	2009-09-10 10:19:20.000000000 -0700
+++ linux-2.6/include/linux/pid.h	2009-09-10 10:29:10.000000000 -0700
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-10 10:19:20.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-10 10:29:10.000000000 -0700
@@ -940,6 +940,7 @@ static struct task_struct *copy_process(
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1110,7 +1111,7 @@ static struct task_struct *copy_process(
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-10 10:23:27.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-10 10:29:10.000000000 -0700
@@ -272,13 +272,14 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	int tpid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid) {
@@ -288,7 +289,11 @@ struct pid *alloc_pid(struct pid_namespa
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp, 0);
+		tpid = 0;
+		if (target_pids)
+			tpid = target_pids[i];
+
+		nr = alloc_pidmap(tmp, tpid);
 		if (nr < 0)
 			goto out_free;

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 5/10]: Add target_pids parameter to alloc_pid()
@ 2009-10-13  4:51     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev



Subject: [RFC][v8][PATCH 5/10]: Add target_pids parameter to alloc_pid()

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/pid.h |    2 +-
 kernel/fork.c       |    3 ++-
 kernel/pid.c        |    9 +++++++--
 3 files changed, 10 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/pid.h
===================================================================
--- linux-2.6.orig/include/linux/pid.h	2009-09-10 10:19:20.000000000 -0700
+++ linux-2.6/include/linux/pid.h	2009-09-10 10:29:10.000000000 -0700
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-10 10:19:20.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-10 10:29:10.000000000 -0700
@@ -940,6 +940,7 @@ static struct task_struct *copy_process(
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1110,7 +1111,7 @@ static struct task_struct *copy_process(
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-10 10:23:27.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-10 10:29:10.000000000 -0700
@@ -272,13 +272,14 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	int tpid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid) {
@@ -288,7 +289,11 @@ struct pid *alloc_pid(struct pid_namespa
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp, 0);
+		tpid = 0;
+		if (target_pids)
+			tpid = target_pids[i];
+
+		nr = alloc_pidmap(tmp, tpid);
 		if (nr < 0)
 			goto out_free;
 

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 6/10]: Add target_pids parameter to copy_process()
  2009-10-13  4:49 ` Sukadev Bhattiprolu
@ 2009-10-13  4:52     ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:52 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Eric W. Biederman, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Subject: [RFC][v8][PATCH 6/10]: Add target_pids parameter to copy_process()

Add a 'target_pids' parameter to copy_process().  The new parameter will be
used in a follow-on patch when clone_with_pids() is implemented.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/fork.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-10 10:29:10.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-10 10:29:13.000000000 -0700
@@ -935,12 +935,12 @@ static struct task_struct *copy_process(
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
+					pid_t *target_pids,
 					int trace)
 {
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
-	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1319,7 +1319,7 @@ struct task_struct * __cpuinit fork_idle
 	struct pt_regs regs;
 
 	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
-			    &init_struct_pid, 0);
+			    &init_struct_pid, NULL, 0);
 	if (!IS_ERR(task))
 		init_idle(task, cpu);
 
@@ -1342,6 +1342,7 @@ long do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	pid_t *target_pids = NULL;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1382,7 +1383,7 @@ long do_fork(unsigned long clone_flags,
 		trace = tracehook_prepare_clone(clone_flags);
 
 	p = copy_process(clone_flags, stack_start, regs, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, target_pids, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 6/10]: Add target_pids parameter to copy_process()
@ 2009-10-13  4:52     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev



Subject: [RFC][v8][PATCH 6/10]: Add target_pids parameter to copy_process()

Add a 'target_pids' parameter to copy_process().  The new parameter will be
used in a follow-on patch when clone_with_pids() is implemented.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-10 10:29:10.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-10 10:29:13.000000000 -0700
@@ -935,12 +935,12 @@ static struct task_struct *copy_process(
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
+					pid_t *target_pids,
 					int trace)
 {
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
-	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1319,7 +1319,7 @@ struct task_struct * __cpuinit fork_idle
 	struct pt_regs regs;
 
 	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
-			    &init_struct_pid, 0);
+			    &init_struct_pid, NULL, 0);
 	if (!IS_ERR(task))
 		init_idle(task, cpu);
 
@@ -1342,6 +1342,7 @@ long do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	pid_t *target_pids = NULL;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1382,7 +1383,7 @@ long do_fork(unsigned long clone_flags,
 		trace = tracehook_prepare_clone(clone_flags);
 
 	p = copy_process(clone_flags, stack_start, regs, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, target_pids, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 7/10]: Check invalid clone flags
  2009-10-13  4:49 ` Sukadev Bhattiprolu
@ 2009-10-13  4:52     ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:52 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Eric W. Biederman, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Subject: [RFC][v8][PATCH 7/10]: Check invalid clone flags

As pointed out by Oren Laadan, we want to ensure that unused bits in the
clone-flags remain unused and available for future. To ensure this, define
a mask of clone-flags and check the flags in the clone() system calls.

Changelog[v8]:
	- New patch in set

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

---
 include/linux/sched.h |   10 ++++++++++
 kernel/fork.c         |    3 +++
 2 files changed, 13 insertions(+)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2009-10-02 18:53:55.000000000 -0700
+++ linux-2.6/include/linux/sched.h	2009-10-02 19:58:21.000000000 -0700
@@ -29,6 +29,16 @@
 #define CLONE_NEWNET		0x40000000	/* New network namespace */
 #define CLONE_IO		0x80000000	/* Clone io context */
 
+#define VALID_CLONE_FLAGS	(CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
+				 CLONE_SIGHAND | CLONE_PTRACE | CLONE_VFORK  |\
+				 CLONE_PARENT | CLONE_THREAD | CLONE_NEWNS   |\
+				 CLONE_SYSVSEM | CLONE_SETTLS                |\
+				 CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID  |\
+				 CLONE_DETACHED | CLONE_UNTRACED             |\
+				 CLONE_CHILD_SETTID | CLONE_STOPPED          |\
+				 CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
+				 CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
+
 /*
  * Scheduling policies
  */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-10-02 19:00:08.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-10-02 19:57:36.000000000 -0700
@@ -942,6 +942,9 @@ static struct task_struct *copy_process(
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
 
+	if (clone_flags & ~VALID_CLONE_FLAGS)
+		return ERR_PTR(-EINVAL);
+
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 7/10]: Check invalid clone flags
@ 2009-10-13  4:52     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev



Subject: [RFC][v8][PATCH 7/10]: Check invalid clone flags

As pointed out by Oren Laadan, we want to ensure that unused bits in the
clone-flags remain unused and available for future. To ensure this, define
a mask of clone-flags and check the flags in the clone() system calls.

Changelog[v8]:
	- New patch in set

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

---
 include/linux/sched.h |   10 ++++++++++
 kernel/fork.c         |    3 +++
 2 files changed, 13 insertions(+)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2009-10-02 18:53:55.000000000 -0700
+++ linux-2.6/include/linux/sched.h	2009-10-02 19:58:21.000000000 -0700
@@ -29,6 +29,16 @@
 #define CLONE_NEWNET		0x40000000	/* New network namespace */
 #define CLONE_IO		0x80000000	/* Clone io context */
 
+#define VALID_CLONE_FLAGS	(CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
+				 CLONE_SIGHAND | CLONE_PTRACE | CLONE_VFORK  |\
+				 CLONE_PARENT | CLONE_THREAD | CLONE_NEWNS   |\
+				 CLONE_SYSVSEM | CLONE_SETTLS                |\
+				 CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID  |\
+				 CLONE_DETACHED | CLONE_UNTRACED             |\
+				 CLONE_CHILD_SETTID | CLONE_STOPPED          |\
+				 CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
+				 CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
+
 /*
  * Scheduling policies
  */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-10-02 19:00:08.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-10-02 19:57:36.000000000 -0700
@@ -942,6 +942,9 @@ static struct task_struct *copy_process(
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
 
+	if (clone_flags & ~VALID_CLONE_FLAGS)
+		return ERR_PTR(-EINVAL);
+
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
 

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 8/10]: Define do_fork_with_pids()
  2009-10-13  4:49 ` Sukadev Bhattiprolu
@ 2009-10-13  4:52     ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:52 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Eric W. Biederman, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Subject: [RFC][v8][PATCH 8/10]: Define do_fork_with_pids()

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v7]:
	- Drop 'struct pid_set' object and pass in 'pid_t *target_pids'
	  instead of 'struct pid_set *'.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.

Changelog[v4]:
	- Rename 'struct target_pid_set' to 'struct pid_set' since it may
	  be useful in other contexts.

Changelog[v3]:
	- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
	- To facilitate moving architecture-inpdendent code to kernel/fork.c
	  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
	  rather than 'pid_t *' (next patch moves the arch-independent
	  code to kernel/fork.c)

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/sched.h |    3 +++
 kernel/fork.c         |   17 +++++++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2009-10-12 18:58:50.000000000 -0700
+++ linux-2.6/include/linux/sched.h	2009-10-12 18:59:04.000000000 -0700
@@ -2064,6 +2064,9 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+				unsigned long, int __user *, int __user *,
+				unsigned int, pid_t __user *);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-10-12 18:58:50.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-10-12 19:13:24.000000000 -0700
@@ -1335,12 +1335,14 @@ struct task_struct * __cpuinit fork_idle
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
 	      unsigned long stack_start,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+	      int __user *child_tidptr,
+	      unsigned int num_pids,
+	      pid_t __user *upids)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1443,6 +1445,17 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      struct pt_regs *regs,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+			parent_tidptr, child_tidptr, 0, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 8/10]: Define do_fork_with_pids()
@ 2009-10-13  4:52     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev



Subject: [RFC][v8][PATCH 8/10]: Define do_fork_with_pids()

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v7]:
	- Drop 'struct pid_set' object and pass in 'pid_t *target_pids'
	  instead of 'struct pid_set *'.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.

Changelog[v4]:
	- Rename 'struct target_pid_set' to 'struct pid_set' since it may
	  be useful in other contexts.

Changelog[v3]:
	- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
	- To facilitate moving architecture-inpdendent code to kernel/fork.c
	  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
	  rather than 'pid_t *' (next patch moves the arch-independent
	  code to kernel/fork.c)

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/sched.h |    3 +++
 kernel/fork.c         |   17 +++++++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2009-10-12 18:58:50.000000000 -0700
+++ linux-2.6/include/linux/sched.h	2009-10-12 18:59:04.000000000 -0700
@@ -2064,6 +2064,9 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+				unsigned long, int __user *, int __user *,
+				unsigned int, pid_t __user *);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-10-12 18:58:50.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-10-12 19:13:24.000000000 -0700
@@ -1335,12 +1335,14 @@ struct task_struct * __cpuinit fork_idle
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
 	      unsigned long stack_start,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+	      int __user *child_tidptr,
+	      unsigned int num_pids,
+	      pid_t __user *upids)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1443,6 +1445,17 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      struct pt_regs *regs,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+			parent_tidptr, child_tidptr, 0, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 9/10]: Define clone3() syscall
  2009-10-13  4:49 ` Sukadev Bhattiprolu
@ 2009-10-13  4:54     ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:54 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Eric W. Biederman, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Subject: [RFC][v8][PATCH 9/10]: Define clone3() syscall

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone3(), intended for use during restart, is the same as clone(), except
that it takes a 'pids' paramter. This parameter lets caller choose
specific pid numbers for the child process, in the process's active and
ancestor pid namespaces. (Descendant pid namespaces in general don't matter
since processes don't have pids in them anyway, but see comments in
copy_target_pids() regarding CLONE_NEWPID).

Clone2() system call also attempts to address a second limitation of the
clone() system call. clone() is restricted to 32 clone flags and most (all ?)
of these are in use. If a new clone flag is needed, we will be forced to
define a new variant of the clone() system call.

To prevent unprivileged processes from misusing this interface, clone3()
currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.

See Documentation/clone3 in next patch for more details of clone3() and an
example of its usage.

NOTE:
	System calls are restricted to 6 parameters and the number and sizes
	of parameters needed for sys_clone3() exceed 6 integers. The new
	prototype works around this restriction while providing some
	flexibility if clone3() needs to be further extended in the future.
TODO:
	- May need additional sanity checks in do_fork_with_pids().

Changelog[v8]
	- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
	  must be 64-bit.
	- clone2() is in use in IA64. Rename system call to clone3().

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
	  and group parameters into a new 'struct clone_struct' object.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check.

Changelog[v4]:
	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/include/asm/syscalls.h    |    1 
 arch/x86/include/asm/unistd_32.h   |    1 
 arch/x86/kernel/process_32.c       |   34 +++++++++++++
 arch/x86/kernel/syscall_table_32.S |    1 
 include/linux/types.h              |   10 +++
 kernel/fork.c                      |   96 ++++++++++++++++++++++++++++++++++++-
 6 files changed, 142 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/types.h
===================================================================
--- linux-2.6.orig/include/linux/types.h	2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/include/linux/types.h	2009-10-12 19:15:08.000000000 -0700
@@ -204,6 +204,16 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct clone_struct {
+	u64 flags;
+	u64 child_stack;
+	u32 nr_pids;
+	u32 reserved1;
+	u64 parent_tid;
+	u64 child_tid;
+	u64 reserved2;
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
Index: linux-2.6/arch/x86/include/asm/syscalls.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/syscalls.h	2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/syscalls.h	2009-10-12 19:15:08.000000000 -0700
@@ -55,6 +55,7 @@ struct sel_arg_struct;
 struct oldold_utsname;
 struct old_utsname;
 
+asmlinkage long sys_clone3(struct clone_struct __user *cs, pid_t *pids);
 asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
 			  unsigned long, unsigned long, unsigned long);
 asmlinkage int old_mmap(struct mmap_arg_struct __user *);
Index: linux-2.6/arch/x86/include/asm/unistd_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/unistd_32.h	2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/unistd_32.h	2009-10-12 19:15:08.000000000 -0700
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
+#define __NR_clone3		337
 
 #ifdef __KERNEL__
 
Index: linux-2.6/arch/x86/kernel/process_32.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/process_32.c	2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/arch/x86/kernel/process_32.c	2009-10-12 19:54:34.000000000 -0700
@@ -443,6 +443,40 @@ int sys_clone(struct pt_regs *regs)
 	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
 }
 
+asmlinkage long
+sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
+{
+	int rc;
+	struct clone_struct kcs;
+	unsigned long clone_flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack_base;
+	struct pt_regs *regs;
+
+	rc = copy_from_user(&kcs, ucs, sizeof(kcs));
+	if (rc)
+		return -EFAULT;
+
+	if (kcs.reserved1 || kcs.reserved2)
+		return -EINVAL;
+
+	/*
+	 * TODO: Convert clone_flags to 64-bit
+	 */
+	clone_flags = (unsigned long)kcs.flags;
+	child_stack_base = (unsigned long)kcs.child_stack;
+	parent_tid_ptr = (int *)ucs->parent_tid;
+	child_tid_ptr =  (int *)ucs->child_tid;
+	regs = task_pt_regs(current);
+
+	if (!child_stack_base)
+		child_stack_base = user_stack_pointer(regs);
+
+	return do_fork_with_pids(clone_flags, child_stack_base, regs, 0,
+			parent_tid_ptr, child_tid_ptr, kcs.nr_pids, pids);
+}
+
 /*
  * sys_execve() executes a new program.
  */
Index: linux-2.6/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/syscall_table_32.S	2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/arch/x86/kernel/syscall_table_32.S	2009-10-12 19:15:08.000000000 -0700
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
+	.long sys_clone3
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-10-12 19:15:08.000000000 -0700
@@ -1330,6 +1330,86 @@ struct task_struct * __cpuinit fork_idle
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
+{
+	int j;
+	int rc;
+	int size;
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+
+	if (!unum_pids)
+		return NULL;
+
+	knum_pids = task_pid(current)->level + 1;
+	if (unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+	 * 	 the order is:
+	 *
+	 * 		- pids for 'unknown-namespaces' (if any)
+	 * 		- pids for 'known-namespaces' (requested pids)
+	 * 		- 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], upids, size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1347,7 +1427,7 @@ long do_fork_with_pids(unsigned long clo
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1381,6 +1461,16 @@ long do_fork_with_pids(unsigned long clo
 		}
 	}
 
+	target_pids = copy_target_pids(num_pids, upids);
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1442,6 +1532,10 @@ long do_fork_with_pids(unsigned long clo
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-13  4:54     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev



Subject: [RFC][v8][PATCH 9/10]: Define clone3() syscall

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone3(), intended for use during restart, is the same as clone(), except
that it takes a 'pids' paramter. This parameter lets caller choose
specific pid numbers for the child process, in the process's active and
ancestor pid namespaces. (Descendant pid namespaces in general don't matter
since processes don't have pids in them anyway, but see comments in
copy_target_pids() regarding CLONE_NEWPID).

Clone2() system call also attempts to address a second limitation of the
clone() system call. clone() is restricted to 32 clone flags and most (all ?)
of these are in use. If a new clone flag is needed, we will be forced to
define a new variant of the clone() system call.

To prevent unprivileged processes from misusing this interface, clone3()
currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.

See Documentation/clone3 in next patch for more details of clone3() and an
example of its usage.

NOTE:
	System calls are restricted to 6 parameters and the number and sizes
	of parameters needed for sys_clone3() exceed 6 integers. The new
	prototype works around this restriction while providing some
	flexibility if clone3() needs to be further extended in the future.
TODO:
	- May need additional sanity checks in do_fork_with_pids().

Changelog[v8]
	- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
	  must be 64-bit.
	- clone2() is in use in IA64. Rename system call to clone3().

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
	  and group parameters into a new 'struct clone_struct' object.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check.

Changelog[v4]:
	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 arch/x86/include/asm/syscalls.h    |    1 
 arch/x86/include/asm/unistd_32.h   |    1 
 arch/x86/kernel/process_32.c       |   34 +++++++++++++
 arch/x86/kernel/syscall_table_32.S |    1 
 include/linux/types.h              |   10 +++
 kernel/fork.c                      |   96 ++++++++++++++++++++++++++++++++++++-
 6 files changed, 142 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/types.h
===================================================================
--- linux-2.6.orig/include/linux/types.h	2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/include/linux/types.h	2009-10-12 19:15:08.000000000 -0700
@@ -204,6 +204,16 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct clone_struct {
+	u64 flags;
+	u64 child_stack;
+	u32 nr_pids;
+	u32 reserved1;
+	u64 parent_tid;
+	u64 child_tid;
+	u64 reserved2;
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
Index: linux-2.6/arch/x86/include/asm/syscalls.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/syscalls.h	2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/syscalls.h	2009-10-12 19:15:08.000000000 -0700
@@ -55,6 +55,7 @@ struct sel_arg_struct;
 struct oldold_utsname;
 struct old_utsname;
 
+asmlinkage long sys_clone3(struct clone_struct __user *cs, pid_t *pids);
 asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
 			  unsigned long, unsigned long, unsigned long);
 asmlinkage int old_mmap(struct mmap_arg_struct __user *);
Index: linux-2.6/arch/x86/include/asm/unistd_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/unistd_32.h	2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/unistd_32.h	2009-10-12 19:15:08.000000000 -0700
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
+#define __NR_clone3		337
 
 #ifdef __KERNEL__
 
Index: linux-2.6/arch/x86/kernel/process_32.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/process_32.c	2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/arch/x86/kernel/process_32.c	2009-10-12 19:54:34.000000000 -0700
@@ -443,6 +443,40 @@ int sys_clone(struct pt_regs *regs)
 	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
 }
 
+asmlinkage long
+sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
+{
+	int rc;
+	struct clone_struct kcs;
+	unsigned long clone_flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack_base;
+	struct pt_regs *regs;
+
+	rc = copy_from_user(&kcs, ucs, sizeof(kcs));
+	if (rc)
+		return -EFAULT;
+
+	if (kcs.reserved1 || kcs.reserved2)
+		return -EINVAL;
+
+	/*
+	 * TODO: Convert clone_flags to 64-bit
+	 */
+	clone_flags = (unsigned long)kcs.flags;
+	child_stack_base = (unsigned long)kcs.child_stack;
+	parent_tid_ptr = (int *)ucs->parent_tid;
+	child_tid_ptr =  (int *)ucs->child_tid;
+	regs = task_pt_regs(current);
+
+	if (!child_stack_base)
+		child_stack_base = user_stack_pointer(regs);
+
+	return do_fork_with_pids(clone_flags, child_stack_base, regs, 0,
+			parent_tid_ptr, child_tid_ptr, kcs.nr_pids, pids);
+}
+
 /*
  * sys_execve() executes a new program.
  */
Index: linux-2.6/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/syscall_table_32.S	2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/arch/x86/kernel/syscall_table_32.S	2009-10-12 19:15:08.000000000 -0700
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
+	.long sys_clone3
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-10-12 19:15:08.000000000 -0700
@@ -1330,6 +1330,86 @@ struct task_struct * __cpuinit fork_idle
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
+{
+	int j;
+	int rc;
+	int size;
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+
+	if (!unum_pids)
+		return NULL;
+
+	knum_pids = task_pid(current)->level + 1;
+	if (unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+	 * 	 the order is:
+	 *
+	 * 		- pids for 'unknown-namespaces' (if any)
+	 * 		- pids for 'known-namespaces' (requested pids)
+	 * 		- 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], upids, size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1347,7 +1427,7 @@ long do_fork_with_pids(unsigned long clo
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1381,6 +1461,16 @@ long do_fork_with_pids(unsigned long clo
 		}
 	}
 
+	target_pids = copy_target_pids(num_pids, upids);
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1442,6 +1532,10 @@ long do_fork_with_pids(unsigned long clo
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }
 

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 10/10]: Document clone3() syscall
  2009-10-13  4:49 ` Sukadev Bhattiprolu
@ 2009-10-13  4:55     ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:55 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Eric W. Biederman, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Subject: [RFC][v8][PATCH 10/10]: Document clone3() syscall

This gives a brief overview of the clone3() system call.  We should
eventually describe more details in existing clone(2) man page or in
a new man page.

Changelog[v8]:
	- clone2() is already in use in IA64. Rename syscall to clone3()
	- Add notes to say that we return -EINVAL if invalid clone flags
	  are specified or if the reserved fields are not 0.
Changelog[v7]:
	- Rename clone_with_pids() to clone2()
	- Changes to reflect new prototype of clone2() (using clone_struct).

Signed-off-by: Sukadev Bhattiprolu <sukadev-8jLBTbqmX/OZamtmwQBW5tBPR1lH4CV8@public.gmane.org>
---
 Documentation/clone2 |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

Index: linux-2.6/Documentation/clone2
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/clone2	2009-10-12 19:54:38.000000000 -0700
@@ -0,0 +1,89 @@
+
+struct clone_struct {
+	u64 flags;
+	u64 child_stack;
+	u32 nr_pids;
+	u32 reserved1;
+	u64 parent_tid;
+	u64 child_tid;
+	u64 reserved2;
+};
+
+clone3(struct clone_struct * __user clone_args, pid_t * __user pids)
+
+	In addition to doing everything that clone() system call does,
+	the clone3() system call:
+
+		- allows additional clone flags (all 32 bits in the flags
+		  parameter to clone() are in use)
+
+		- allows user to specify a pid for the child process in its
+		  active and ancestor pid name spaces.
+
+	This system call is meant to be used when restarting an application
+	from a checkpoint.  Such restart requires that the processes in the
+	application have the same pids they had when the application was
+	checkpointed. When containers are nested, the processes within the
+	containers exist in multiple pid namespaces and hence have multiple
+	pids to specify during restart.
+
+	The @pids defines the set of pids that should be assigned to the child
+	process in its active and ancestor pid name spaces. The descendant pid
+	namespaces do not matter since a process does not have a pid in
+	descendant namespaces, unless the process is in a new pid namespace
+	in which case the process is a container-init (and must have the pid 1
+	in that namespace).
+
+	See CLONE_NEWPID section of clone(2) man page for details about pid
+	namespaces.
+
+	The order pids in @pids corresponds to the nesting order of pid-
+	namespaces, with @pids[0] corresponding to the init_pid_ns.
+
+	If a pid in the @pids list is 0, the kernel will assign the next
+	available pid in the pid namespace, for the process.
+
+	If a pid in the @pids list is non-zero, the kernel tries to assign
+	the specified pid in that namespace.  If that pid is already in use
+	by another process, the system call fails with -EBUSY.
+
+	On success, the system call returns the pid of the child process in
+	the parent's active pid namespace.
+
+	On failure, clone3() returns -1 and sets 'errno' to one of following
+	values (the child process is not created).
+
+	EPERM	Caller does not have the SYS_ADMIN privilege needed to excute
+		this call.
+
+	EINVAL	The number of pids specified in 'clone_struct.nr_pids' exceeds
+		the current nesting level of parent process
+
+	EINVAL	Not all specified clone-flags are valid.
+
+	EINVAL	The reserved fields in the clone_struct argument are not 0.
+
+	EBUSY	A requested pid is in use by another process in that name space.
+
+Example:
+
+	pid_t pids[] = { 77, 99 };
+	struct clone_struct cs;
+
+	cs.flags = (u64) SIGCHLD;
+	cs.child_stack = (u64) setup_child_stack();
+	cs.nr_pids = 2;
+	cs.parent_tid = 0LL;
+	cs.child_tid = 0LL;
+
+	rc = syscall(__NR_clone3, &cs, pids);
+
+	if (rc < 0) {
+		perror("clone3()");
+		exit(1);
+	} else if (rc) {
+		/* Parent */
+	} else {
+		/* Child */
+	}
+

^ permalink raw reply	[flat|nested] 226+ messages in thread

* [RFC][v8][PATCH 10/10]: Document clone3() syscall
@ 2009-10-13  4:55     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13  4:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev



Subject: [RFC][v8][PATCH 10/10]: Document clone3() syscall

This gives a brief overview of the clone3() system call.  We should
eventually describe more details in existing clone(2) man page or in
a new man page.

Changelog[v8]:
	- clone2() is already in use in IA64. Rename syscall to clone3()
	- Add notes to say that we return -EINVAL if invalid clone flags
	  are specified or if the reserved fields are not 0.
Changelog[v7]:
	- Rename clone_with_pids() to clone2()
	- Changes to reflect new prototype of clone2() (using clone_struct).

Signed-off-by: Sukadev Bhattiprolu <sukadev@vnet.linux.ibm.com>
---
 Documentation/clone2 |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

Index: linux-2.6/Documentation/clone2
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/clone2	2009-10-12 19:54:38.000000000 -0700
@@ -0,0 +1,89 @@
+
+struct clone_struct {
+	u64 flags;
+	u64 child_stack;
+	u32 nr_pids;
+	u32 reserved1;
+	u64 parent_tid;
+	u64 child_tid;
+	u64 reserved2;
+};
+
+clone3(struct clone_struct * __user clone_args, pid_t * __user pids)
+
+	In addition to doing everything that clone() system call does,
+	the clone3() system call:
+
+		- allows additional clone flags (all 32 bits in the flags
+		  parameter to clone() are in use)
+
+		- allows user to specify a pid for the child process in its
+		  active and ancestor pid name spaces.
+
+	This system call is meant to be used when restarting an application
+	from a checkpoint.  Such restart requires that the processes in the
+	application have the same pids they had when the application was
+	checkpointed. When containers are nested, the processes within the
+	containers exist in multiple pid namespaces and hence have multiple
+	pids to specify during restart.
+
+	The @pids defines the set of pids that should be assigned to the child
+	process in its active and ancestor pid name spaces. The descendant pid
+	namespaces do not matter since a process does not have a pid in
+	descendant namespaces, unless the process is in a new pid namespace
+	in which case the process is a container-init (and must have the pid 1
+	in that namespace).
+
+	See CLONE_NEWPID section of clone(2) man page for details about pid
+	namespaces.
+
+	The order pids in @pids corresponds to the nesting order of pid-
+	namespaces, with @pids[0] corresponding to the init_pid_ns.
+
+	If a pid in the @pids list is 0, the kernel will assign the next
+	available pid in the pid namespace, for the process.
+
+	If a pid in the @pids list is non-zero, the kernel tries to assign
+	the specified pid in that namespace.  If that pid is already in use
+	by another process, the system call fails with -EBUSY.
+
+	On success, the system call returns the pid of the child process in
+	the parent's active pid namespace.
+
+	On failure, clone3() returns -1 and sets 'errno' to one of following
+	values (the child process is not created).
+
+	EPERM	Caller does not have the SYS_ADMIN privilege needed to excute
+		this call.
+
+	EINVAL	The number of pids specified in 'clone_struct.nr_pids' exceeds
+		the current nesting level of parent process
+
+	EINVAL	Not all specified clone-flags are valid.
+
+	EINVAL	The reserved fields in the clone_struct argument are not 0.
+
+	EBUSY	A requested pid is in use by another process in that name space.
+
+Example:
+
+	pid_t pids[] = { 77, 99 };
+	struct clone_struct cs;
+
+	cs.flags = (u64) SIGCHLD;
+	cs.child_stack = (u64) setup_child_stack();
+	cs.nr_pids = 2;
+	cs.parent_tid = 0LL;
+	cs.child_tid = 0LL;
+
+	rc = syscall(__NR_clone3, &cs, pids);
+
+	if (rc < 0) {
+		perror("clone3()");
+		exit(1);
+	} else if (rc) {
+		/* Parent */
+	} else {
+		/* Child */
+	}
+

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
  2009-10-13  4:50     ` Sukadev Bhattiprolu
@ 2009-10-13  5:19         ` Alexey Dobriyan
  -1 siblings, 0 replies; 226+ messages in thread
From: Alexey Dobriyan @ 2009-10-13  5:19 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On Mon, Oct 12, 2009 at 09:50:41PM -0700, Sukadev Bhattiprolu wrote:
> From: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> Subject: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
> 
> Remove the pid_max global, and make it a property of the
> pid_namespace.  When a pid_ns is created, it inherits
> the parent's pid_ns.
> 
> Fixing up sysctl (trivial akin to ipc version, but
> potentially tedious to get right for all CONFIG*
> combinations) is left for later.

pid_max is not pid_ns property, it's just sysctl to limit cpu time
during bitmap search. Just like vm.max_map_count limits kernel memory
used by VMAs.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
@ 2009-10-13  5:19         ` Alexey Dobriyan
  0 siblings, 0 replies; 226+ messages in thread
From: Alexey Dobriyan @ 2009-10-13  5:19 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

On Mon, Oct 12, 2009 at 09:50:41PM -0700, Sukadev Bhattiprolu wrote:
> From: Serge Hallyn <serue@us.ibm.com>
> Subject: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
> 
> Remove the pid_max global, and make it a property of the
> pid_namespace.  When a pid_ns is created, it inherits
> the parent's pid_ns.
> 
> Fixing up sysctl (trivial akin to ipc version, but
> potentially tedious to get right for all CONFIG*
> combinations) is left for later.

pid_max is not pid_ns property, it's just sysctl to limit cpu time
during bitmap search. Just like vm.max_map_count limits kernel memory
used by VMAs.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()
  2009-10-13  4:51     ` Sukadev Bhattiprolu
@ 2009-10-13 11:50         ` Pavel Emelyanov
  -1 siblings, 0 replies; 226+ messages in thread
From: Pavel Emelyanov @ 2009-10-13 11:50 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu wrote:
> 
> Subject: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()
> 

[ snip ]

> @@ -146,16 +146,22 @@ static int alloc_pidmap_page(struct pidm
>  	return 0;
>  }
>  
> -static int alloc_pidmap(struct pid_namespace *pid_ns)
> +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)

Please no! Better create another function, that will just
atomic_test_and_set() the proper bit in the map. This one
is heavy enough already.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()
@ 2009-10-13 11:50         ` Pavel Emelyanov
  0 siblings, 0 replies; 226+ messages in thread
From: Pavel Emelyanov @ 2009-10-13 11:50 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

Sukadev Bhattiprolu wrote:
> 
> Subject: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()
> 

[ snip ]

> @@ -146,16 +146,22 @@ static int alloc_pidmap_page(struct pidm
>  	return 0;
>  }
>  
> -static int alloc_pidmap(struct pid_namespace *pid_ns)
> +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)

Please no! Better create another function, that will just
atomic_test_and_set() the proper bit in the map. This one
is heavy enough already.


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
  2009-10-13  4:50     ` Sukadev Bhattiprolu
@ 2009-10-13 13:09         ` Pavel Emelyanov
  -1 siblings, 0 replies; 226+ messages in thread
From: Pavel Emelyanov @ 2009-10-13 13:09 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu wrote:
> 
> From: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> Subject: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
> 
> Remove the pid_max global, and make it a property of the
> pid_namespace.  When a pid_ns is created, it inherits
> the parent's pid_ns.
> 
> Fixing up sysctl (trivial akin to ipc version, but
> potentially tedious to get right for all CONFIG*
> combinations) is left for later.
> 
> Changelog[v2]:
> 	- Port to newer kernel
> 	- Make pid_max a local variable in alloc_pidmap() to simplify code/patch
> 
> Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Not that I'm about to slow down or block the process, but...
frankly I don't see the reason for doing so. Why should we?
Especially taking into account, that we essentially cannot
change thin in the namespace level 3 and deeper?

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
@ 2009-10-13 13:09         ` Pavel Emelyanov
  0 siblings, 0 replies; 226+ messages in thread
From: Pavel Emelyanov @ 2009-10-13 13:09 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

Sukadev Bhattiprolu wrote:
> 
> From: Serge Hallyn <serue@us.ibm.com>
> Subject: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
> 
> Remove the pid_max global, and make it a property of the
> pid_namespace.  When a pid_ns is created, it inherits
> the parent's pid_ns.
> 
> Fixing up sysctl (trivial akin to ipc version, but
> potentially tedious to get right for all CONFIG*
> combinations) is left for later.
> 
> Changelog[v2]:
> 	- Port to newer kernel
> 	- Make pid_max a local variable in alloc_pidmap() to simplify code/patch
> 
> Signed-off-by: Serge Hallyn <serue@us.ibm.com>
> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>

Not that I'm about to slow down or block the process, but...
frankly I don't see the reason for doing so. Why should we?
Especially taking into account, that we essentially cannot
change thin in the namespace level 3 and deeper?

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
  2009-10-13 13:09         ` Pavel Emelyanov
@ 2009-10-13 15:24             ` Serge E. Hallyn
  -1 siblings, 0 replies; 226+ messages in thread
From: Serge E. Hallyn @ 2009-10-13 15:24 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA

Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org):
> Sukadev Bhattiprolu wrote:
> > 
> > From: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> > Subject: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
> > 
> > Remove the pid_max global, and make it a property of the
> > pid_namespace.  When a pid_ns is created, it inherits
> > the parent's pid_ns.
> > 
> > Fixing up sysctl (trivial akin to ipc version, but
> > potentially tedious to get right for all CONFIG*
> > combinations) is left for later.
> > 
> > Changelog[v2]:
> > 	- Port to newer kernel
> > 	- Make pid_max a local variable in alloc_pidmap() to simplify code/patch
> > 
> > Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> > Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> 
> Not that I'm about to slow down or block the process, but...

This patch isn't a core part of the clone_with_pid functionality,
just something Eric has asked for.  So I don't object to dropping
it.  But I disagree with Alexey's claim that this isn't a namespace
property.  It should be.

> frankly I don't see the reason for doing so. Why should we?
> Especially taking into account, that we essentially cannot
> change thin in the namespace level 3 and deeper?

What do you mean by that?  With this patchset we're not, it's
true, but we trivially can - even now, userspace can simply not
give the container CAP_SYS_ADMIN or write access to the sysctl
so they can't do any more CLONE_NEWPIDS or change the sysctl.

-serge

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
@ 2009-10-13 15:24             ` Serge E. Hallyn
  0 siblings, 0 replies; 226+ messages in thread
From: Serge E. Hallyn @ 2009-10-13 15:24 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Sukadev Bhattiprolu, linux-kernel, Oren Laadan,
	Eric W. Biederman, Alexey Dobriyan, Andrew Morton, torvalds,
	mikew, mingo, hpa, Nathan Lynch, arnd, peterz, Louis.Rilling,
	roland, kosaki.motohiro, randy.dunlap, linux-api, Containers,
	sukadev

Quoting Pavel Emelyanov (xemul@openvz.org):
> Sukadev Bhattiprolu wrote:
> > 
> > From: Serge Hallyn <serue@us.ibm.com>
> > Subject: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
> > 
> > Remove the pid_max global, and make it a property of the
> > pid_namespace.  When a pid_ns is created, it inherits
> > the parent's pid_ns.
> > 
> > Fixing up sysctl (trivial akin to ipc version, but
> > potentially tedious to get right for all CONFIG*
> > combinations) is left for later.
> > 
> > Changelog[v2]:
> > 	- Port to newer kernel
> > 	- Make pid_max a local variable in alloc_pidmap() to simplify code/patch
> > 
> > Signed-off-by: Serge Hallyn <serue@us.ibm.com>
> > Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
> 
> Not that I'm about to slow down or block the process, but...

This patch isn't a core part of the clone_with_pid functionality,
just something Eric has asked for.  So I don't object to dropping
it.  But I disagree with Alexey's claim that this isn't a namespace
property.  It should be.

> frankly I don't see the reason for doing so. Why should we?
> Especially taking into account, that we essentially cannot
> change thin in the namespace level 3 and deeper?

What do you mean by that?  With this patchset we're not, it's
true, but we trivially can - even now, userspace can simply not
give the container CAP_SYS_ADMIN or write access to the sysctl
so they can't do any more CLONE_NEWPIDS or change the sysctl.

-serge

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
       [not found]             ` <20091013152453.GA9994-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-13 16:10               ` Pavel Emelyanov
  0 siblings, 0 replies; 226+ messages in thread
From: Pavel Emelyanov @ 2009-10-13 16:10 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA

> This patch isn't a core part of the clone_with_pid functionality,
> just something Eric has asked for.  So I don't object to dropping
> it.  But I disagree with Alexey's claim that this isn't a namespace
> property.  It should be.

OK

>> frankly I don't see the reason for doing so. Why should we?
>> Especially taking into account, that we essentially cannot
>> change thin in the namespace level 3 and deeper?
> 
> What do you mean by that?  With this patchset we're not, it's
> true, but we trivially can - even now, userspace can simply not
> give the container CAP_SYS_ADMIN or write access to the sysctl
> so they can't do any more CLONE_NEWPIDS or change the sysctl.

It's a misprint - I meant "level 2 and deeper". Sysctl is
only pointing at the init_pid_ns variable.

> -serge
> 

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
  2009-10-13 15:24             ` Serge E. Hallyn
  (?)
@ 2009-10-13 16:10             ` Pavel Emelyanov
  2009-10-13 16:28               ` Serge E. Hallyn
       [not found]               ` <4AD4A676.3010603-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  -1 siblings, 2 replies; 226+ messages in thread
From: Pavel Emelyanov @ 2009-10-13 16:10 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Sukadev Bhattiprolu, linux-kernel, Oren Laadan,
	Eric W. Biederman, Alexey Dobriyan, Andrew Morton, torvalds,
	mikew, mingo, hpa, Nathan Lynch, arnd, peterz, Louis.Rilling,
	roland, kosaki.motohiro, randy.dunlap, linux-api, Containers,
	sukadev

> This patch isn't a core part of the clone_with_pid functionality,
> just something Eric has asked for.  So I don't object to dropping
> it.  But I disagree with Alexey's claim that this isn't a namespace
> property.  It should be.

OK

>> frankly I don't see the reason for doing so. Why should we?
>> Especially taking into account, that we essentially cannot
>> change thin in the namespace level 3 and deeper?
> 
> What do you mean by that?  With this patchset we're not, it's
> true, but we trivially can - even now, userspace can simply not
> give the container CAP_SYS_ADMIN or write access to the sysctl
> so they can't do any more CLONE_NEWPIDS or change the sysctl.

It's a misprint - I meant "level 2 and deeper". Sysctl is
only pointing at the init_pid_ns variable.

> -serge
> 


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
       [not found]               ` <4AD4A676.3010603-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2009-10-13 16:28                 ` Serge E. Hallyn
  0 siblings, 0 replies; 226+ messages in thread
From: Serge E. Hallyn @ 2009-10-13 16:28 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA

Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org):
> > This patch isn't a core part of the clone_with_pid functionality,
> > just something Eric has asked for.  So I don't object to dropping
> > it.  But I disagree with Alexey's claim that this isn't a namespace
> > property.  It should be.
> 
> OK
> 
> >> frankly I don't see the reason for doing so. Why should we?
> >> Especially taking into account, that we essentially cannot
> >> change thin in the namespace level 3 and deeper?
> > 
> > What do you mean by that?  With this patchset we're not, it's
> > true, but we trivially can - even now, userspace can simply not
> > give the container CAP_SYS_ADMIN or write access to the sysctl
> > so they can't do any more CLONE_NEWPIDS or change the sysctl.
> 
> It's a misprint - I meant "level 2 and deeper". Sysctl is
> only pointing at the init_pid_ns variable.

Right, and I'm saying that's to be fixed up as with all other
containerized sysctl's.  You're right that this patch doesn't
solve that problem, but you seem to be arguing that it bc it's
not done in this patch, we should act as though it can't be
done.

But again, maybe we're best off dropping this patch (sorry, Suka,
I had suggested you add it...) and focusing on the rest of the set
for now.

thanks,
-serge

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
  2009-10-13 16:10             ` Pavel Emelyanov
@ 2009-10-13 16:28               ` Serge E. Hallyn
       [not found]               ` <4AD4A676.3010603-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  1 sibling, 0 replies; 226+ messages in thread
From: Serge E. Hallyn @ 2009-10-13 16:28 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Sukadev Bhattiprolu, linux-kernel, Oren Laadan,
	Eric W. Biederman, Alexey Dobriyan, Andrew Morton, torvalds,
	mikew, mingo, hpa, Nathan Lynch, arnd, peterz, Louis.Rilling,
	roland, kosaki.motohiro, randy.dunlap, linux-api, Containers,
	sukadev

Quoting Pavel Emelyanov (xemul@openvz.org):
> > This patch isn't a core part of the clone_with_pid functionality,
> > just something Eric has asked for.  So I don't object to dropping
> > it.  But I disagree with Alexey's claim that this isn't a namespace
> > property.  It should be.
> 
> OK
> 
> >> frankly I don't see the reason for doing so. Why should we?
> >> Especially taking into account, that we essentially cannot
> >> change thin in the namespace level 3 and deeper?
> > 
> > What do you mean by that?  With this patchset we're not, it's
> > true, but we trivially can - even now, userspace can simply not
> > give the container CAP_SYS_ADMIN or write access to the sysctl
> > so they can't do any more CLONE_NEWPIDS or change the sysctl.
> 
> It's a misprint - I meant "level 2 and deeper". Sysctl is
> only pointing at the init_pid_ns variable.

Right, and I'm saying that's to be fixed up as with all other
containerized sysctl's.  You're right that this patch doesn't
solve that problem, but you seem to be arguing that it bc it's
not done in this patch, we should act as though it can't be
done.

But again, maybe we're best off dropping this patch (sorry, Suka,
I had suggested you add it...) and focusing on the rest of the set
for now.

thanks,
-serge

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 7/10]: Check invalid clone flags
  2009-10-13  4:52     ` Sukadev Bhattiprolu
@ 2009-10-13 18:35         ` Oren Laadan
  -1 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-13 18:35 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Sukadev Bhattiprolu wrote:
> 
> Subject: [RFC][v8][PATCH 7/10]: Check invalid clone flags
> 
> As pointed out by Oren Laadan, we want to ensure that unused bits in the
> clone-flags remain unused and available for future. To ensure this, define
> a mask of clone-flags and check the flags in the clone() system calls.
> 
> Changelog[v8]:
> 	- New patch in set
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> 
> ---
>  include/linux/sched.h |   10 ++++++++++
>  kernel/fork.c         |    3 +++
>  2 files changed, 13 insertions(+)
> 
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h	2009-10-02 18:53:55.000000000 -0700
> +++ linux-2.6/include/linux/sched.h	2009-10-02 19:58:21.000000000 -0700
> @@ -29,6 +29,16 @@
>  #define CLONE_NEWNET		0x40000000	/* New network namespace */
>  #define CLONE_IO		0x80000000	/* Clone io context */
>  
> +#define VALID_CLONE_FLAGS	(CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
> +				 CLONE_SIGHAND | CLONE_PTRACE | CLONE_VFORK  |\
> +				 CLONE_PARENT | CLONE_THREAD | CLONE_NEWNS   |\
> +				 CLONE_SYSVSEM | CLONE_SETTLS                |\
> +				 CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID  |\
> +				 CLONE_DETACHED | CLONE_UNTRACED             |\
> +				 CLONE_CHILD_SETTID | CLONE_STOPPED          |\
> +				 CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
> +				 CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
> +
>  /*
>   * Scheduling policies
>   */
> Index: linux-2.6/kernel/fork.c
> ===================================================================
> --- linux-2.6.orig/kernel/fork.c	2009-10-02 19:00:08.000000000 -0700
> +++ linux-2.6/kernel/fork.c	2009-10-02 19:57:36.000000000 -0700
> @@ -942,6 +942,9 @@ static struct task_struct *copy_process(
>  	struct task_struct *p;
>  	int cgroup_callbacks_done = 0;
>  

We can safely apply these tests to clone3(), because it is a new syscall.

However, I don't know if applying it to clone() can break existing
application that may already be (incorrectly) using invalid flags ?

Oren.

> +	if (clone_flags & ~VALID_CLONE_FLAGS)
> +		return ERR_PTR(-EINVAL);
> +
>  	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
>  		return ERR_PTR(-EINVAL);
>  
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 7/10]: Check invalid clone flags
@ 2009-10-13 18:35         ` Oren Laadan
  0 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-13 18:35 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, randy.dunlap, arnd, Containers, Nathan Lynch,
	Louis.Rilling, Eric W. Biederman, kosaki.motohiro, hpa, mingo,
	linux-api, torvalds, Alexey Dobriyan, roland, Pavel Emelyanov



Sukadev Bhattiprolu wrote:
> 
> Subject: [RFC][v8][PATCH 7/10]: Check invalid clone flags
> 
> As pointed out by Oren Laadan, we want to ensure that unused bits in the
> clone-flags remain unused and available for future. To ensure this, define
> a mask of clone-flags and check the flags in the clone() system calls.
> 
> Changelog[v8]:
> 	- New patch in set
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> 
> ---
>  include/linux/sched.h |   10 ++++++++++
>  kernel/fork.c         |    3 +++
>  2 files changed, 13 insertions(+)
> 
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h	2009-10-02 18:53:55.000000000 -0700
> +++ linux-2.6/include/linux/sched.h	2009-10-02 19:58:21.000000000 -0700
> @@ -29,6 +29,16 @@
>  #define CLONE_NEWNET		0x40000000	/* New network namespace */
>  #define CLONE_IO		0x80000000	/* Clone io context */
>  
> +#define VALID_CLONE_FLAGS	(CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
> +				 CLONE_SIGHAND | CLONE_PTRACE | CLONE_VFORK  |\
> +				 CLONE_PARENT | CLONE_THREAD | CLONE_NEWNS   |\
> +				 CLONE_SYSVSEM | CLONE_SETTLS                |\
> +				 CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID  |\
> +				 CLONE_DETACHED | CLONE_UNTRACED             |\
> +				 CLONE_CHILD_SETTID | CLONE_STOPPED          |\
> +				 CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
> +				 CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
> +
>  /*
>   * Scheduling policies
>   */
> Index: linux-2.6/kernel/fork.c
> ===================================================================
> --- linux-2.6.orig/kernel/fork.c	2009-10-02 19:00:08.000000000 -0700
> +++ linux-2.6/kernel/fork.c	2009-10-02 19:57:36.000000000 -0700
> @@ -942,6 +942,9 @@ static struct task_struct *copy_process(
>  	struct task_struct *p;
>  	int cgroup_callbacks_done = 0;
>  

We can safely apply these tests to clone3(), because it is a new syscall.

However, I don't know if applying it to clone() can break existing
application that may already be (incorrectly) using invalid flags ?

Oren.

> +	if (clone_flags & ~VALID_CLONE_FLAGS)
> +		return ERR_PTR(-EINVAL);
> +
>  	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
>  		return ERR_PTR(-EINVAL);
>  
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
  2009-10-13  4:54     ` Sukadev Bhattiprolu
@ 2009-10-13 18:46         ` Oren Laadan
  -1 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-13 18:46 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Sukadev Bhattiprolu wrote:
> 
> Subject: [RFC][v8][PATCH 9/10]: Define clone3() syscall
> 
> Container restart requires that a task have the same pid it had when it was
> checkpointed. When containers are nested the tasks within the containers
> exist in multiple pid namespaces and hence have multiple pids to specify
> during restart.
> 
> clone3(), intended for use during restart, is the same as clone(), except
> that it takes a 'pids' paramter. This parameter lets caller choose
> specific pid numbers for the child process, in the process's active and
> ancestor pid namespaces. (Descendant pid namespaces in general don't matter
> since processes don't have pids in them anyway, but see comments in
> copy_target_pids() regarding CLONE_NEWPID).
> 
> Clone2() system call also attempts to address a second limitation of the
> clone() system call. clone() is restricted to 32 clone flags and most (all ?)
> of these are in use. If a new clone flag is needed, we will be forced to
> define a new variant of the clone() system call.
> 
> To prevent unprivileged processes from misusing this interface, clone3()
> currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.
> 
> See Documentation/clone3 in next patch for more details of clone3() and an
> example of its usage.
> 
> NOTE:
> 	System calls are restricted to 6 parameters and the number and sizes
> 	of parameters needed for sys_clone3() exceed 6 integers. The new
> 	prototype works around this restriction while providing some
> 	flexibility if clone3() needs to be further extended in the future.
> TODO:
> 	- May need additional sanity checks in do_fork_with_pids().
> 
> Changelog[v8]
> 	- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
> 	  must be 64-bit.
> 	- clone2() is in use in IA64. Rename system call to clone3().
> 
> Changelog[v7]:
> 	- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
> 	  and group parameters into a new 'struct clone_struct' object.
> 
> Changelog[v6]:
> 	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
> 	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
> 	  is constant across architectures.
> 	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
> 	  'unum_pids < 0' check.
> 
> Changelog[v4]:
> 	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'
> 
> Changelog[v3]:
> 	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
> 	  in the target_pids[] list and setting it 0. See copy_target_pids()).
> 	- (Oren Laadan) Specified target pids should apply only to youngest
> 	  pid-namespaces (see copy_target_pids())
> 	- (Matt Helsley) Update patch description.
> 
> Changelog[v2]:
> 	- Remove unnecessary printk and add a note to callers of
> 	  copy_target_pids() to free target_pids.
> 	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
> 	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
> 	  'num_pids == 0' (fall back to normal clone()).
> 	- Move arch-independent code (sanity checks and copy-in of target-pids)
> 	  into kernel/fork.c and simplify sys_clone_with_pids()
> 
> Changelog[v1]:
> 	- Fixed some compile errors (had fixed these errors earlier in my
> 	  git tree but had not refreshed patches before emailing them)
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> ---
>  arch/x86/include/asm/syscalls.h    |    1 
>  arch/x86/include/asm/unistd_32.h   |    1 
>  arch/x86/kernel/process_32.c       |   34 +++++++++++++
>  arch/x86/kernel/syscall_table_32.S |    1 
>  include/linux/types.h              |   10 +++
>  kernel/fork.c                      |   96 ++++++++++++++++++++++++++++++++++++-
>  6 files changed, 142 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/include/linux/types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/types.h	2009-10-12 19:13:24.000000000 -0700
> +++ linux-2.6/include/linux/types.h	2009-10-12 19:15:08.000000000 -0700
> @@ -204,6 +204,16 @@ struct ustat {
>  	char			f_fpack[6];
>  };
>  
> +struct clone_struct {
> +	u64 flags;
> +	u64 child_stack;
> +	u32 nr_pids;
> +	u32 reserved1;
> +	u64 parent_tid;
> +	u64 child_tid;
> +	u64 reserved2;
> +};
> +
>  #endif	/* __KERNEL__ */
>  #endif /*  __ASSEMBLY__ */
>  #endif /* _LINUX_TYPES_H */
> Index: linux-2.6/arch/x86/include/asm/syscalls.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/syscalls.h	2009-10-12 19:13:24.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/syscalls.h	2009-10-12 19:15:08.000000000 -0700
> @@ -55,6 +55,7 @@ struct sel_arg_struct;
>  struct oldold_utsname;
>  struct old_utsname;
>  
> +asmlinkage long sys_clone3(struct clone_struct __user *cs, pid_t *pids);
>  asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
>  			  unsigned long, unsigned long, unsigned long);
>  asmlinkage int old_mmap(struct mmap_arg_struct __user *);
> Index: linux-2.6/arch/x86/include/asm/unistd_32.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/unistd_32.h	2009-10-12 19:13:24.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/unistd_32.h	2009-10-12 19:15:08.000000000 -0700
> @@ -342,6 +342,7 @@
>  #define __NR_pwritev		334
>  #define __NR_rt_tgsigqueueinfo	335
>  #define __NR_perf_counter_open	336
> +#define __NR_clone3		337
>  
>  #ifdef __KERNEL__
>  
> Index: linux-2.6/arch/x86/kernel/process_32.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/process_32.c	2009-10-12 19:13:24.000000000 -0700
> +++ linux-2.6/arch/x86/kernel/process_32.c	2009-10-12 19:54:34.000000000 -0700
> @@ -443,6 +443,40 @@ int sys_clone(struct pt_regs *regs)
>  	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
>  }
>  
> +asmlinkage long
> +sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
> +{
> +	int rc;
> +	struct clone_struct kcs;
> +	unsigned long clone_flags;
> +	int __user *parent_tid_ptr;
> +	int __user *child_tid_ptr;
> +	unsigned long __user child_stack_base;
> +	struct pt_regs *regs;
> +
> +	rc = copy_from_user(&kcs, ucs, sizeof(kcs));
> +	if (rc)
> +		return -EFAULT;
> +
> +	if (kcs.reserved1 || kcs.reserved2)
> +		return -EINVAL;
> +
> +	/*
> +	 * TODO: Convert clone_flags to 64-bit
> +	 */
> +	clone_flags = (unsigned long)kcs.flags;

On 32-bit arch this would dismiss upper half of the flags :(
Is there a reason not to use 64-bit ('u64' or 'long long') now ?

[...]

Oren.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-13 18:46         ` Oren Laadan
  0 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-13 18:46 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, randy.dunlap, arnd, Containers, Nathan Lynch,
	Louis.Rilling, Eric W. Biederman, kosaki.motohiro, hpa, mingo,
	linux-api, torvalds, Alexey Dobriyan, roland, Pavel Emelyanov



Sukadev Bhattiprolu wrote:
> 
> Subject: [RFC][v8][PATCH 9/10]: Define clone3() syscall
> 
> Container restart requires that a task have the same pid it had when it was
> checkpointed. When containers are nested the tasks within the containers
> exist in multiple pid namespaces and hence have multiple pids to specify
> during restart.
> 
> clone3(), intended for use during restart, is the same as clone(), except
> that it takes a 'pids' paramter. This parameter lets caller choose
> specific pid numbers for the child process, in the process's active and
> ancestor pid namespaces. (Descendant pid namespaces in general don't matter
> since processes don't have pids in them anyway, but see comments in
> copy_target_pids() regarding CLONE_NEWPID).
> 
> Clone2() system call also attempts to address a second limitation of the
> clone() system call. clone() is restricted to 32 clone flags and most (all ?)
> of these are in use. If a new clone flag is needed, we will be forced to
> define a new variant of the clone() system call.
> 
> To prevent unprivileged processes from misusing this interface, clone3()
> currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.
> 
> See Documentation/clone3 in next patch for more details of clone3() and an
> example of its usage.
> 
> NOTE:
> 	System calls are restricted to 6 parameters and the number and sizes
> 	of parameters needed for sys_clone3() exceed 6 integers. The new
> 	prototype works around this restriction while providing some
> 	flexibility if clone3() needs to be further extended in the future.
> TODO:
> 	- May need additional sanity checks in do_fork_with_pids().
> 
> Changelog[v8]
> 	- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
> 	  must be 64-bit.
> 	- clone2() is in use in IA64. Rename system call to clone3().
> 
> Changelog[v7]:
> 	- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
> 	  and group parameters into a new 'struct clone_struct' object.
> 
> Changelog[v6]:
> 	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
> 	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
> 	  is constant across architectures.
> 	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
> 	  'unum_pids < 0' check.
> 
> Changelog[v4]:
> 	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'
> 
> Changelog[v3]:
> 	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
> 	  in the target_pids[] list and setting it 0. See copy_target_pids()).
> 	- (Oren Laadan) Specified target pids should apply only to youngest
> 	  pid-namespaces (see copy_target_pids())
> 	- (Matt Helsley) Update patch description.
> 
> Changelog[v2]:
> 	- Remove unnecessary printk and add a note to callers of
> 	  copy_target_pids() to free target_pids.
> 	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
> 	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
> 	  'num_pids == 0' (fall back to normal clone()).
> 	- Move arch-independent code (sanity checks and copy-in of target-pids)
> 	  into kernel/fork.c and simplify sys_clone_with_pids()
> 
> Changelog[v1]:
> 	- Fixed some compile errors (had fixed these errors earlier in my
> 	  git tree but had not refreshed patches before emailing them)
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/syscalls.h    |    1 
>  arch/x86/include/asm/unistd_32.h   |    1 
>  arch/x86/kernel/process_32.c       |   34 +++++++++++++
>  arch/x86/kernel/syscall_table_32.S |    1 
>  include/linux/types.h              |   10 +++
>  kernel/fork.c                      |   96 ++++++++++++++++++++++++++++++++++++-
>  6 files changed, 142 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/include/linux/types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/types.h	2009-10-12 19:13:24.000000000 -0700
> +++ linux-2.6/include/linux/types.h	2009-10-12 19:15:08.000000000 -0700
> @@ -204,6 +204,16 @@ struct ustat {
>  	char			f_fpack[6];
>  };
>  
> +struct clone_struct {
> +	u64 flags;
> +	u64 child_stack;
> +	u32 nr_pids;
> +	u32 reserved1;
> +	u64 parent_tid;
> +	u64 child_tid;
> +	u64 reserved2;
> +};
> +
>  #endif	/* __KERNEL__ */
>  #endif /*  __ASSEMBLY__ */
>  #endif /* _LINUX_TYPES_H */
> Index: linux-2.6/arch/x86/include/asm/syscalls.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/syscalls.h	2009-10-12 19:13:24.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/syscalls.h	2009-10-12 19:15:08.000000000 -0700
> @@ -55,6 +55,7 @@ struct sel_arg_struct;
>  struct oldold_utsname;
>  struct old_utsname;
>  
> +asmlinkage long sys_clone3(struct clone_struct __user *cs, pid_t *pids);
>  asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
>  			  unsigned long, unsigned long, unsigned long);
>  asmlinkage int old_mmap(struct mmap_arg_struct __user *);
> Index: linux-2.6/arch/x86/include/asm/unistd_32.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/unistd_32.h	2009-10-12 19:13:24.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/unistd_32.h	2009-10-12 19:15:08.000000000 -0700
> @@ -342,6 +342,7 @@
>  #define __NR_pwritev		334
>  #define __NR_rt_tgsigqueueinfo	335
>  #define __NR_perf_counter_open	336
> +#define __NR_clone3		337
>  
>  #ifdef __KERNEL__
>  
> Index: linux-2.6/arch/x86/kernel/process_32.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/process_32.c	2009-10-12 19:13:24.000000000 -0700
> +++ linux-2.6/arch/x86/kernel/process_32.c	2009-10-12 19:54:34.000000000 -0700
> @@ -443,6 +443,40 @@ int sys_clone(struct pt_regs *regs)
>  	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
>  }
>  
> +asmlinkage long
> +sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
> +{
> +	int rc;
> +	struct clone_struct kcs;
> +	unsigned long clone_flags;
> +	int __user *parent_tid_ptr;
> +	int __user *child_tid_ptr;
> +	unsigned long __user child_stack_base;
> +	struct pt_regs *regs;
> +
> +	rc = copy_from_user(&kcs, ucs, sizeof(kcs));
> +	if (rc)
> +		return -EFAULT;
> +
> +	if (kcs.reserved1 || kcs.reserved2)
> +		return -EINVAL;
> +
> +	/*
> +	 * TODO: Convert clone_flags to 64-bit
> +	 */
> +	clone_flags = (unsigned long)kcs.flags;

On 32-bit arch this would dismiss upper half of the flags :(
Is there a reason not to use 64-bit ('u64' or 'long long') now ?

[...]

Oren.


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-13  4:49 ` Sukadev Bhattiprolu
@ 2009-10-13 20:50     ` Roland McGrath
  -1 siblings, 0 replies; 226+ messages in thread
From: Roland McGrath @ 2009-10-13 20:50 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov

Some userland debugging things and so forth like to look at the clone_flags
argument, so that is kept simpler for them if it stays in a register
(i.e. its own argument) rather than a user pointer fetch for that argument.
Any problem with:

	sys_clone3(unsigned long clone_flags,
		   struct clone_struct __user *cs, pid_t __user *pids)

?

That also has the side benefit that instead of non-ia64 users forever
asking, "Why is it clone3 when there is no clone2?" you can instead pretend
that it follows the "clone3 because it takes three arguments" convention. ;-)

Btw, IMHO "struct foo_struct" is one of the lamest naming conventions ever.
How about "struct clone_args"?

Also, if you were to replace:

		u64 child_stack;

with:

		u64 child_stack_base;
		u64 child_stack_size;

and use in sys_clone3 (for most arch's):

	child_stack_ptr = kcs.child_stack_base + kcs.child_stack_size;

then the same clone3 interface would cover ia64 as well.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-13 20:50     ` Roland McGrath
  0 siblings, 0 replies; 226+ messages in thread
From: Roland McGrath @ 2009-10-13 20:50 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, arnd, peterz, Louis.Rilling,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

Some userland debugging things and so forth like to look at the clone_flags
argument, so that is kept simpler for them if it stays in a register
(i.e. its own argument) rather than a user pointer fetch for that argument.
Any problem with:

	sys_clone3(unsigned long clone_flags,
		   struct clone_struct __user *cs, pid_t __user *pids)

?

That also has the side benefit that instead of non-ia64 users forever
asking, "Why is it clone3 when there is no clone2?" you can instead pretend
that it follows the "clone3 because it takes three arguments" convention. ;-)

Btw, IMHO "struct foo_struct" is one of the lamest naming conventions ever.
How about "struct clone_args"?

Also, if you were to replace:

		u64 child_stack;

with:

		u64 child_stack_base;
		u64 child_stack_size;

and use in sys_clone3 (for most arch's):

	child_stack_ptr = kcs.child_stack_base + kcs.child_stack_size;

then the same clone3 interface would cover ia64 as well.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-13 20:50     ` Roland McGrath
@ 2009-10-13 23:27         ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13 23:27 UTC (permalink / raw)
  To: Roland McGrath
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov

Roland McGrath [roland-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org] wrote:
| Some userland debugging things and so forth like to look at the clone_flags
| argument, so that is kept simpler for them if it stays in a register
| (i.e. its own argument) rather than a user pointer fetch for that argument.
| Any problem with:
| 
| 	sys_clone3(unsigned long clone_flags,
| 		   struct clone_struct __user *cs, pid_t __user *pids)
| 
| ?

My only concern is the support of 64-bit clone flags on 32-bit architectures.
We would need an additional register/parameter, clone_flags_high ?  Also,
hopefully we won't need more than 64 flags, but if we do, the plan AFACIT,
was that we could use one of the reserved fields.

| 
| That also has the side benefit that instead of non-ia64 users forever
| asking, "Why is it clone3 when there is no clone2?" you can instead pretend
| that it follows the "clone3 because it takes three arguments" convention. ;-)
| 
| Btw, IMHO "struct foo_struct" is one of the lamest naming conventions ever.
| How about "struct clone_args"?

Sure :-) In earlier version of patches I had mixed up clone_struct and
clone_arg in comments/descriptions and cleaned up in this version. 
| 
| Also, if you were to replace:
| 
| 		u64 child_stack;
| 
| with:
| 
| 		u64 child_stack_base;
| 		u64 child_stack_size;
| 
| and use in sys_clone3 (for most arch's):
| 
| 	child_stack_ptr = kcs.child_stack_base + kcs.child_stack_size;
| 
| then the same clone3 interface would cover ia64 as well.

Ok

| 
| 
| Thanks,
| Roland

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-13 23:27         ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13 23:27 UTC (permalink / raw)
  To: Roland McGrath
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, arnd, peterz, Louis.Rilling,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

Roland McGrath [roland@redhat.com] wrote:
| Some userland debugging things and so forth like to look at the clone_flags
| argument, so that is kept simpler for them if it stays in a register
| (i.e. its own argument) rather than a user pointer fetch for that argument.
| Any problem with:
| 
| 	sys_clone3(unsigned long clone_flags,
| 		   struct clone_struct __user *cs, pid_t __user *pids)
| 
| ?

My only concern is the support of 64-bit clone flags on 32-bit architectures.
We would need an additional register/parameter, clone_flags_high ?  Also,
hopefully we won't need more than 64 flags, but if we do, the plan AFACIT,
was that we could use one of the reserved fields.

| 
| That also has the side benefit that instead of non-ia64 users forever
| asking, "Why is it clone3 when there is no clone2?" you can instead pretend
| that it follows the "clone3 because it takes three arguments" convention. ;-)
| 
| Btw, IMHO "struct foo_struct" is one of the lamest naming conventions ever.
| How about "struct clone_args"?

Sure :-) In earlier version of patches I had mixed up clone_struct and
clone_arg in comments/descriptions and cleaned up in this version. 
| 
| Also, if you were to replace:
| 
| 		u64 child_stack;
| 
| with:
| 
| 		u64 child_stack_base;
| 		u64 child_stack_size;
| 
| and use in sys_clone3 (for most arch's):
| 
| 	child_stack_ptr = kcs.child_stack_base + kcs.child_stack_size;
| 
| then the same clone3 interface would cover ia64 as well.

Ok

| 
| 
| Thanks,
| Roland

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 7/10]: Check invalid clone flags
  2009-10-13 18:35         ` Oren Laadan
@ 2009-10-13 23:38             ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13 23:38 UTC (permalink / raw)
  To: Oren Laadan
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Oren Laadan [orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org] wrote:
| 
| 
| Sukadev Bhattiprolu wrote:
| > 
| > Subject: [RFC][v8][PATCH 7/10]: Check invalid clone flags
| > 
| > As pointed out by Oren Laadan, we want to ensure that unused bits in the
| > clone-flags remain unused and available for future. To ensure this, define
| > a mask of clone-flags and check the flags in the clone() system calls.
| > 
| > Changelog[v8]:
| > 	- New patch in set
| > 
| > Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
| > 
| > ---
| >  include/linux/sched.h |   10 ++++++++++
| >  kernel/fork.c         |    3 +++
| >  2 files changed, 13 insertions(+)
| > 
| > Index: linux-2.6/include/linux/sched.h
| > ===================================================================
| > --- linux-2.6.orig/include/linux/sched.h	2009-10-02 18:53:55.000000000 -0700
| > +++ linux-2.6/include/linux/sched.h	2009-10-02 19:58:21.000000000 -0700
| > @@ -29,6 +29,16 @@
| >  #define CLONE_NEWNET		0x40000000	/* New network namespace */
| >  #define CLONE_IO		0x80000000	/* Clone io context */
| >  
| > +#define VALID_CLONE_FLAGS	(CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
| > +				 CLONE_SIGHAND | CLONE_PTRACE | CLONE_VFORK  |\
| > +				 CLONE_PARENT | CLONE_THREAD | CLONE_NEWNS   |\
| > +				 CLONE_SYSVSEM | CLONE_SETTLS                |\
| > +				 CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID  |\
| > +				 CLONE_DETACHED | CLONE_UNTRACED             |\
| > +				 CLONE_CHILD_SETTID | CLONE_STOPPED          |\
| > +				 CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
| > +				 CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
| > +
| >  /*
| >   * Scheduling policies
| >   */
| > Index: linux-2.6/kernel/fork.c
| > ===================================================================
| > --- linux-2.6.orig/kernel/fork.c	2009-10-02 19:00:08.000000000 -0700
| > +++ linux-2.6/kernel/fork.c	2009-10-02 19:57:36.000000000 -0700
| > @@ -942,6 +942,9 @@ static struct task_struct *copy_process(
| >  	struct task_struct *p;
| >  	int cgroup_callbacks_done = 0;
| >  
| 
| We can safely apply these tests to clone3(), because it is a new syscall.
| 
| However, I don't know if applying it to clone() can break existing
| application that may already be (incorrectly) using invalid flags ?

Doing the check in copy_process() seems would apply to all architectures.

As for breaking an existing app, there is only one unused flag, 0x00001000
(bw CLONE_SIGHAND and CLONE_PTRACE). If an application could be using that
and we don't want to break it, maybe we can add following macro to the
VALID_CLONE_FLAGS list ?

	#define CLONE_UNUSED_BIT	0x00001000

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 7/10]: Check invalid clone flags
@ 2009-10-13 23:38             ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-13 23:38 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-kernel, randy.dunlap, arnd, Containers, Nathan Lynch,
	Louis.Rilling, Eric W. Biederman, kosaki.motohiro, hpa, mingo,
	linux-api, torvalds, Alexey Dobriyan, roland, Pavel Emelyanov

Oren Laadan [orenl@librato.com] wrote:
| 
| 
| Sukadev Bhattiprolu wrote:
| > 
| > Subject: [RFC][v8][PATCH 7/10]: Check invalid clone flags
| > 
| > As pointed out by Oren Laadan, we want to ensure that unused bits in the
| > clone-flags remain unused and available for future. To ensure this, define
| > a mask of clone-flags and check the flags in the clone() system calls.
| > 
| > Changelog[v8]:
| > 	- New patch in set
| > 
| > Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| > 
| > ---
| >  include/linux/sched.h |   10 ++++++++++
| >  kernel/fork.c         |    3 +++
| >  2 files changed, 13 insertions(+)
| > 
| > Index: linux-2.6/include/linux/sched.h
| > ===================================================================
| > --- linux-2.6.orig/include/linux/sched.h	2009-10-02 18:53:55.000000000 -0700
| > +++ linux-2.6/include/linux/sched.h	2009-10-02 19:58:21.000000000 -0700
| > @@ -29,6 +29,16 @@
| >  #define CLONE_NEWNET		0x40000000	/* New network namespace */
| >  #define CLONE_IO		0x80000000	/* Clone io context */
| >  
| > +#define VALID_CLONE_FLAGS	(CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
| > +				 CLONE_SIGHAND | CLONE_PTRACE | CLONE_VFORK  |\
| > +				 CLONE_PARENT | CLONE_THREAD | CLONE_NEWNS   |\
| > +				 CLONE_SYSVSEM | CLONE_SETTLS                |\
| > +				 CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID  |\
| > +				 CLONE_DETACHED | CLONE_UNTRACED             |\
| > +				 CLONE_CHILD_SETTID | CLONE_STOPPED          |\
| > +				 CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
| > +				 CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
| > +
| >  /*
| >   * Scheduling policies
| >   */
| > Index: linux-2.6/kernel/fork.c
| > ===================================================================
| > --- linux-2.6.orig/kernel/fork.c	2009-10-02 19:00:08.000000000 -0700
| > +++ linux-2.6/kernel/fork.c	2009-10-02 19:57:36.000000000 -0700
| > @@ -942,6 +942,9 @@ static struct task_struct *copy_process(
| >  	struct task_struct *p;
| >  	int cgroup_callbacks_done = 0;
| >  
| 
| We can safely apply these tests to clone3(), because it is a new syscall.
| 
| However, I don't know if applying it to clone() can break existing
| application that may already be (incorrectly) using invalid flags ?

Doing the check in copy_process() seems would apply to all architectures.

As for breaking an existing app, there is only one unused flag, 0x00001000
(bw CLONE_SIGHAND and CLONE_PTRACE). If an application could be using that
and we don't want to break it, maybe we can add following macro to the
VALID_CLONE_FLAGS list ?

	#define CLONE_UNUSED_BIT	0x00001000


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-13  4:49 ` Sukadev Bhattiprolu
@ 2009-10-13 23:49     ` H. Peter Anvin
  -1 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-13 23:49 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On 10/12/2009 09:49 PM, Sukadev Bhattiprolu wrote:
> 
> This patchset implements a new system call, clone3() that lets a process
> specify the pids of the child process.
> 

A system call named clone3() taking two parameters is just too weird to
live.  No, please.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-13 23:49     ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-13 23:49 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

On 10/12/2009 09:49 PM, Sukadev Bhattiprolu wrote:
> 
> This patchset implements a new system call, clone3() that lets a process
> specify the pids of the child process.
> 

A system call named clone3() taking two parameters is just too weird to
live.  No, please.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-13 23:27         ` Sukadev Bhattiprolu
@ 2009-10-13 23:53             ` Roland McGrath
  -1 siblings, 0 replies; 226+ messages in thread
From: Roland McGrath @ 2009-10-13 23:53 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov

> My only concern is the support of 64-bit clone flags on 32-bit architectures.

Oy.  I didn't realize there was serious consideration of having more than
32 flags.  IMHO it would be a bad choice, since they could only be used via
clone3.  Having high-bit flags work in clone on 64-bit machines but not on
32-bit machines just seems like a wrongly confusing way for things to be.
If any high-bits flags are constrained even on 64-bit machines to uses in
clone3 calls for sanity purposes, then it seems questionable IMHO to have
them be more flags in the same u64 at all.

Since all new features will be via this struct, various new kinds of things
could potentially be done by other new struct fields independent of flags.
But that would of course require putting enough reserved fields in now and
requiring that they be zero-filled now in anticipation of such future uses,
which is not very pleasant either.

In short, I guess I really am saying that "clone_flags_high" (or
"more_flags" or something) does seem better to me than any of the
possibilities for having more than 32 CLONE_* in the current flags word.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-13 23:53             ` Roland McGrath
  0 siblings, 0 replies; 226+ messages in thread
From: Roland McGrath @ 2009-10-13 23:53 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, arnd, peterz, Louis.Rilling,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

> My only concern is the support of 64-bit clone flags on 32-bit architectures.

Oy.  I didn't realize there was serious consideration of having more than
32 flags.  IMHO it would be a bad choice, since they could only be used via
clone3.  Having high-bit flags work in clone on 64-bit machines but not on
32-bit machines just seems like a wrongly confusing way for things to be.
If any high-bits flags are constrained even on 64-bit machines to uses in
clone3 calls for sanity purposes, then it seems questionable IMHO to have
them be more flags in the same u64 at all.

Since all new features will be via this struct, various new kinds of things
could potentially be done by other new struct fields independent of flags.
But that would of course require putting enough reserved fields in now and
requiring that they be zero-filled now in anticipation of such future uses,
which is not very pleasant either.

In short, I guess I really am saying that "clone_flags_high" (or
"more_flags" or something) does seem better to me than any of the
possibilities for having more than 32 CLONE_* in the current flags word.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-13 23:53             ` Roland McGrath
@ 2009-10-14  1:13                 ` H. Peter Anvin
  -1 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-14  1:13 UTC (permalink / raw)
  To: Roland McGrath
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, Pavel Emelyanov

On 10/13/2009 04:53 PM, Roland McGrath wrote:
>> My only concern is the support of 64-bit clone flags on 32-bit architectures.
> 
> Oy.  I didn't realize there was serious consideration of having more than
> 32 flags.  IMHO it would be a bad choice, since they could only be used via
> clone3.  Having high-bit flags work in clone on 64-bit machines but not on
> 32-bit machines just seems like a wrongly confusing way for things to be.
> If any high-bits flags are constrained even on 64-bit machines to uses in
> clone3 calls for sanity purposes, then it seems questionable IMHO to have
> them be more flags in the same u64 at all.
> 
> Since all new features will be via this struct, various new kinds of things
> could potentially be done by other new struct fields independent of flags.
> But that would of course require putting enough reserved fields in now and
> requiring that they be zero-filled now in anticipation of such future uses,
> which is not very pleasant either.
> 
> In short, I guess I really am saying that "clone_flags_high" (or
> "more_flags" or something) does seem better to me than any of the
> possibilities for having more than 32 CLONE_* in the current flags word.
> 

Overall it seems sane to:

a) make it an actual 3-argument call;
b) make the existing flags a u32 forever, and make it a separate
   argument;
c) any new expansion can be via the struct, which may want to have
   an "c3_flags" field first in the structure.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-14  1:13                 ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-14  1:13 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Sukadev Bhattiprolu, linux-kernel, Oren Laadan, serue,
	Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, Nathan Lynch, arnd,
	peterz, Louis.Rilling, kosaki.motohiro, randy.dunlap, linux-api,
	Containers, sukadev

On 10/13/2009 04:53 PM, Roland McGrath wrote:
>> My only concern is the support of 64-bit clone flags on 32-bit architectures.
> 
> Oy.  I didn't realize there was serious consideration of having more than
> 32 flags.  IMHO it would be a bad choice, since they could only be used via
> clone3.  Having high-bit flags work in clone on 64-bit machines but not on
> 32-bit machines just seems like a wrongly confusing way for things to be.
> If any high-bits flags are constrained even on 64-bit machines to uses in
> clone3 calls for sanity purposes, then it seems questionable IMHO to have
> them be more flags in the same u64 at all.
> 
> Since all new features will be via this struct, various new kinds of things
> could potentially be done by other new struct fields independent of flags.
> But that would of course require putting enough reserved fields in now and
> requiring that they be zero-filled now in anticipation of such future uses,
> which is not very pleasant either.
> 
> In short, I guess I really am saying that "clone_flags_high" (or
> "more_flags" or something) does seem better to me than any of the
> possibilities for having more than 32 CLONE_* in the current flags word.
> 

Overall it seems sane to:

a) make it an actual 3-argument call;
b) make the existing flags a u32 forever, and make it a separate
   argument;
c) any new expansion can be via the struct, which may want to have
   an "c3_flags" field first in the structure.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-13 23:49     ` H. Peter Anvin
@ 2009-10-14  1:39         ` Matt Helsley
  -1 siblings, 0 replies; 226+ messages in thread
From: Matt Helsley @ 2009-10-14  1:39 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On Tue, Oct 13, 2009 at 04:49:05PM -0700, H. Peter Anvin wrote:
> On 10/12/2009 09:49 PM, Sukadev Bhattiprolu wrote:
> > 
> > This patchset implements a new system call, clone3() that lets a process
> > specify the pids of the child process.
> > 
> 
> A system call named clone3() taking two parameters is just too weird to
> live.  No, please.

Except we can't use clone2() because it conflicts on ia64.  Care to propose
a name you would prefer?

Also I was a bit suprised to discover there are plenty of examples where this
convention has not been followed: vm86, lseek64, and mmap2 to name a few. In
fact, of the 46 __NR_foo[[:digit:]]+, 36 break this convention on x86-32.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-14  1:39         ` Matt Helsley
  0 siblings, 0 replies; 226+ messages in thread
From: Matt Helsley @ 2009-10-14  1:39 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Sukadev Bhattiprolu, randy.dunlap, arnd, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, Eric W. Biederman,
	kosaki.motohiro, linux-api, mingo, torvalds, Alexey Dobriyan,
	roland, Pavel Emelyanov

On Tue, Oct 13, 2009 at 04:49:05PM -0700, H. Peter Anvin wrote:
> On 10/12/2009 09:49 PM, Sukadev Bhattiprolu wrote:
> > 
> > This patchset implements a new system call, clone3() that lets a process
> > specify the pids of the child process.
> > 
> 
> A system call named clone3() taking two parameters is just too weird to
> live.  No, please.

Except we can't use clone2() because it conflicts on ia64.  Care to propose
a name you would prefer?

Also I was a bit suprised to discover there are plenty of examples where this
convention has not been followed: vm86, lseek64, and mmap2 to name a few. In
fact, of the 46 __NR_foo[[:digit:]]+, 36 break this convention on x86-32.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-14  1:39         ` Matt Helsley
@ 2009-10-14  2:24             ` H. Peter Anvin
  -1 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-14  2:24 UTC (permalink / raw)
  To: Matt Helsley
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On 10/13/2009 06:39 PM, Matt Helsley wrote:
> On Tue, Oct 13, 2009 at 04:49:05PM -0700, H. Peter Anvin wrote:
>> On 10/12/2009 09:49 PM, Sukadev Bhattiprolu wrote:
>>>
>>> This patchset implements a new system call, clone3() that lets a process
>>> specify the pids of the child process.
>>>
>>
>> A system call named clone3() taking two parameters is just too weird to
>> live.  No, please.
> 
> Except we can't use clone2() because it conflicts on ia64.  Care to propose
> a name you would prefer?
> 
> Also I was a bit suprised to discover there are plenty of examples where this
> convention has not been followed: vm86, lseek64, and mmap2 to name a few. In
> fact, of the 46 __NR_foo[[:digit:]]+, 36 break this convention on x86-32.
> 

The -86, -64 and so on are visually obviously not a parameter count.
sys_mmap2 is not user visible, and so doesn't really matter.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-14  2:24             ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-14  2:24 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Sukadev Bhattiprolu, randy.dunlap, arnd, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, Eric W. Biederman,
	kosaki.motohiro, linux-api, mingo, torvalds, Alexey Dobriyan,
	roland, Pavel Emelyanov

On 10/13/2009 06:39 PM, Matt Helsley wrote:
> On Tue, Oct 13, 2009 at 04:49:05PM -0700, H. Peter Anvin wrote:
>> On 10/12/2009 09:49 PM, Sukadev Bhattiprolu wrote:
>>>
>>> This patchset implements a new system call, clone3() that lets a process
>>> specify the pids of the child process.
>>>
>>
>> A system call named clone3() taking two parameters is just too weird to
>> live.  No, please.
> 
> Except we can't use clone2() because it conflicts on ia64.  Care to propose
> a name you would prefer?
> 
> Also I was a bit suprised to discover there are plenty of examples where this
> convention has not been followed: vm86, lseek64, and mmap2 to name a few. In
> fact, of the 46 __NR_foo[[:digit:]]+, 36 break this convention on x86-32.
> 

The -86, -64 and so on are visually obviously not a parameter count.
sys_mmap2 is not user visible, and so doesn't really matter.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-14  1:13                 ` H. Peter Anvin
@ 2009-10-14  4:36                     ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-14  4:36 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Roland McGrath, Pavel Emelyanov

H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
| On 10/13/2009 04:53 PM, Roland McGrath wrote:
| >> My only concern is the support of 64-bit clone flags on 32-bit architectures.
| > 
| > Oy.  I didn't realize there was serious consideration of having more than
| > 32 flags.  IMHO it would be a bad choice, since they could only be used via
| > clone3.  Having high-bit flags work in clone on 64-bit machines but not on
| > 32-bit machines just seems like a wrongly confusing way for things to be.
| > If any high-bits flags are constrained even on 64-bit machines to uses in
| > clone3 calls for sanity purposes, then it seems questionable IMHO to have
| > them be more flags in the same u64 at all.
| > 
| > Since all new features will be via this struct, various new kinds of things
| > could potentially be done by other new struct fields independent of flags.
| > But that would of course require putting enough reserved fields in now and
| > requiring that they be zero-filled now in anticipation of such future uses,
| > which is not very pleasant either.
| > 
| > In short, I guess I really am saying that "clone_flags_high" (or
| > "more_flags" or something) does seem better to me than any of the
| > possibilities for having more than 32 CLONE_* in the current flags word.
| > 
| 
| Overall it seems sane to:
| 
| a) make it an actual 3-argument call;
| b) make the existing flags a u32 forever, and make it a separate
|    argument;
| c) any new expansion can be via the struct, which may want to have
|    an "c3_flags" field first in the structure.

Wouldn't that make it a somewhat awkward interface - applications or
libc have to split the clone flags and specify them in two different
fields ? Do we need to preserve the u32 flags forever ? 

The closest example I can think of is the distinction between signal()
and sigaction(). sigaction() interface makes a lot of sense in part bc
it does not try to look like signal(). And strace seems to be able
decode flags like SA_RESTART and SA_SIGINFO with the pointer to the
sigaction buffer. 

Won't the debug tools be able to do the same with a pointer to
struct clone_args ?

If it needs the new features of clone3(), the application has to be
modified. At that point wouldn't it be easy to write a clone3() wrapper
that calls sys_clone3() if it exists or call sys_clone() if not ?

If clone3() supersedes clone(), all architectures could use the clone3()
interface and it would seem best to make clone3() as simple as possible
without having to preserve the u32 flags.

Would it help to use a type clone_flags_64_t to make the distinction
between types more explicit ?

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-14  4:36                     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-14  4:36 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Roland McGrath, linux-kernel, Oren Laadan, serue,
	Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, Nathan Lynch, arnd,
	peterz, Louis.Rilling, kosaki.motohiro, randy.dunlap, linux-api,
	Containers, sukadev

H. Peter Anvin [hpa@zytor.com] wrote:
| On 10/13/2009 04:53 PM, Roland McGrath wrote:
| >> My only concern is the support of 64-bit clone flags on 32-bit architectures.
| > 
| > Oy.  I didn't realize there was serious consideration of having more than
| > 32 flags.  IMHO it would be a bad choice, since they could only be used via
| > clone3.  Having high-bit flags work in clone on 64-bit machines but not on
| > 32-bit machines just seems like a wrongly confusing way for things to be.
| > If any high-bits flags are constrained even on 64-bit machines to uses in
| > clone3 calls for sanity purposes, then it seems questionable IMHO to have
| > them be more flags in the same u64 at all.
| > 
| > Since all new features will be via this struct, various new kinds of things
| > could potentially be done by other new struct fields independent of flags.
| > But that would of course require putting enough reserved fields in now and
| > requiring that they be zero-filled now in anticipation of such future uses,
| > which is not very pleasant either.
| > 
| > In short, I guess I really am saying that "clone_flags_high" (or
| > "more_flags" or something) does seem better to me than any of the
| > possibilities for having more than 32 CLONE_* in the current flags word.
| > 
| 
| Overall it seems sane to:
| 
| a) make it an actual 3-argument call;
| b) make the existing flags a u32 forever, and make it a separate
|    argument;
| c) any new expansion can be via the struct, which may want to have
|    an "c3_flags" field first in the structure.

Wouldn't that make it a somewhat awkward interface - applications or
libc have to split the clone flags and specify them in two different
fields ? Do we need to preserve the u32 flags forever ? 

The closest example I can think of is the distinction between signal()
and sigaction(). sigaction() interface makes a lot of sense in part bc
it does not try to look like signal(). And strace seems to be able
decode flags like SA_RESTART and SA_SIGINFO with the pointer to the
sigaction buffer. 

Won't the debug tools be able to do the same with a pointer to
struct clone_args ?

If it needs the new features of clone3(), the application has to be
modified. At that point wouldn't it be easy to write a clone3() wrapper
that calls sys_clone3() if it exists or call sys_clone() if not ?

If clone3() supersedes clone(), all architectures could use the clone3()
interface and it would seem best to make clone3() as simple as possible
without having to preserve the u32 flags.

Would it help to use a type clone_flags_64_t to make the distinction
between types more explicit ?

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-14  4:36                     ` Sukadev Bhattiprolu
@ 2009-10-14  4:38                         ` H. Peter Anvin
  -1 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-14  4:38 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Roland McGrath, Pavel Emelyanov

On 10/13/2009 09:36 PM, Sukadev Bhattiprolu wrote:
> 
> Would it help to use a type clone_flags_64_t to make the distinction
> between types more explicit ?
> 

The problem with using the same flags in two places, one as a 32-bit and
one as a 64-bit number, is that using one in the wrong place will cause
silent, but deadly, truncation.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-14  4:38                         ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-14  4:38 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Roland McGrath, linux-kernel, Oren Laadan, serue,
	Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, Nathan Lynch, arnd,
	peterz, Louis.Rilling, kosaki.motohiro, randy.dunlap, linux-api,
	Containers, sukadev

On 10/13/2009 09:36 PM, Sukadev Bhattiprolu wrote:
> 
> Would it help to use a type clone_flags_64_t to make the distinction
> between types more explicit ?
> 

The problem with using the same flags in two places, one as a 32-bit and
one as a 64-bit number, is that using one in the wrong place will cause
silent, but deadly, truncation.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-14  2:24             ` H. Peter Anvin
@ 2009-10-14  4:40                 ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-14  4:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Pavel Emelyanov, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA,
	Eric W. Biederman

H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
| > 
| > Except we can't use clone2() because it conflicts on ia64.  Care to propose
| > a name you would prefer?

Yes, I am running out of ideas :-)

How about clone64_with_pids() ? - hope we don't need a 65th clone-flag :p

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-14  4:40                 ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-14  4:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Matt Helsley, randy.dunlap, arnd, Containers, Nathan Lynch,
	linux-kernel, Louis.Rilling, Eric W. Biederman, kosaki.motohiro,
	linux-api, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

H. Peter Anvin [hpa@zytor.com] wrote:
| > 
| > Except we can't use clone2() because it conflicts on ia64.  Care to propose
| > a name you would prefer?

Yes, I am running out of ideas :-)

How about clone64_with_pids() ? - hope we don't need a 65th clone-flag :p

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-14  4:40                 ` Sukadev Bhattiprolu
@ 2009-10-14  4:50                     ` H. Peter Anvin
  -1 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-14  4:50 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Pavel Emelyanov, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA,
	Eric W. Biederman

On 10/13/2009 09:40 PM, Sukadev Bhattiprolu wrote:
> H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
> | > 
> | > Except we can't use clone2() because it conflicts on ia64.  Care to propose
> | > a name you would prefer?
> 
> Yes, I am running out of ideas :-)
> 
> How about clone64_with_pids() ? - hope we don't need a 65th clone-flag :p

never_clone_alone()  ;)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-14  4:50                     ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-14  4:50 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, randy.dunlap, arnd, Containers, Nathan Lynch,
	linux-kernel, Louis.Rilling, Eric W. Biederman, kosaki.motohiro,
	linux-api, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

On 10/13/2009 09:40 PM, Sukadev Bhattiprolu wrote:
> H. Peter Anvin [hpa@zytor.com] wrote:
> | > 
> | > Except we can't use clone2() because it conflicts on ia64.  Care to propose
> | > a name you would prefer?
> 
> Yes, I am running out of ideas :-)
> 
> How about clone64_with_pids() ? - hope we don't need a 65th clone-flag :p

never_clone_alone()  ;)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
  2009-10-13  4:55     ` Sukadev Bhattiprolu
@ 2009-10-14 12:26         ` Arnd Bergmann
  -1 siblings, 0 replies; 226+ messages in thread
From: Arnd Bergmann @ 2009-10-14 12:26 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On Tuesday 13 October 2009, Sukadev Bhattiprolu wrote:
> +clone3(struct clone_struct * __user clone_args, pid_t * __user pids)
> +
> +       In addition to doing everything that clone() system call does,
> +       the clone3() system call:
> +
> +               - allows additional clone flags (all 32 bits in the flags
> +                 parameter to clone() are in use)
> +
> +               - allows user to specify a pid for the child process in its
> +                 active and ancestor pid name spaces.

Someone (sorry, can't find the old mail) pointed out last time that
the 'pid_t *__user tidptr' argument needs to be an independent pointer,
in order to allow the same use patterns with CLONE_CHILD_SETTID and
CLONE_CHILD_CLEARTID that you can do with the current clone
implementation.

Moving that argument from clone_struct into the argument list would
also make this a three-argument syscall, which solves the naming problem.

	Arnd <><

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
@ 2009-10-14 12:26         ` Arnd Bergmann
  0 siblings, 0 replies; 226+ messages in thread
From: Arnd Bergmann @ 2009-10-14 12:26 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

On Tuesday 13 October 2009, Sukadev Bhattiprolu wrote:
> +clone3(struct clone_struct * __user clone_args, pid_t * __user pids)
> +
> +       In addition to doing everything that clone() system call does,
> +       the clone3() system call:
> +
> +               - allows additional clone flags (all 32 bits in the flags
> +                 parameter to clone() are in use)
> +
> +               - allows user to specify a pid for the child process in its
> +                 active and ancestor pid name spaces.

Someone (sorry, can't find the old mail) pointed out last time that
the 'pid_t *__user tidptr' argument needs to be an independent pointer,
in order to allow the same use patterns with CLONE_CHILD_SETTID and
CLONE_CHILD_CLEARTID that you can do with the current clone
implementation.

Moving that argument from clone_struct into the argument list would
also make this a three-argument syscall, which solves the naming problem.

	Arnd <><

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-14  4:40                 ` Sukadev Bhattiprolu
@ 2009-10-14 16:07                     ` Serge E. Hallyn
  -1 siblings, 0 replies; 226+ messages in thread
From: Serge E. Hallyn @ 2009-10-14 16:07 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, H. Peter Anvin,
	mingo-X9Un+BFzKDI, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Quoting Sukadev Bhattiprolu (sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
> | > 
> | > Except we can't use clone2() because it conflicts on ia64.  Care to propose
> | > a name you would prefer?
> 
> Yes, I am running out of ideas :-)
> 
> How about clone64_with_pids() ? - hope we don't need a 65th clone-flag :p

Well, maybe accept that adding more flags will still require a new syscall,
and call this clone_with_pids(), and only take a single 32-bit flags field?

Or, go back to clone_extended().

-serge

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-14 16:07                     ` Serge E. Hallyn
  0 siblings, 0 replies; 226+ messages in thread
From: Serge E. Hallyn @ 2009-10-14 16:07 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: H. Peter Anvin, randy.dunlap, arnd, linux-api, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, kosaki.motohiro,
	mingo, Pavel Emelyanov, torvalds, Alexey Dobriyan, roland,
	Eric W. Biederman

Quoting Sukadev Bhattiprolu (sukadev@linux.vnet.ibm.com):
> H. Peter Anvin [hpa@zytor.com] wrote:
> | > 
> | > Except we can't use clone2() because it conflicts on ia64.  Care to propose
> | > a name you would prefer?
> 
> Yes, I am running out of ideas :-)
> 
> How about clone64_with_pids() ? - hope we don't need a 65th clone-flag :p

Well, maybe accept that adding more flags will still require a new syscall,
and call this clone_with_pids(), and only take a single 32-bit flags field?

Or, go back to clone_extended().

-serge

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
  2009-10-14 12:26         ` Arnd Bergmann
@ 2009-10-14 18:39             ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-14 18:39 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Arnd Bergmann [arndbergmann-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org] wrote:
| On Tuesday 13 October 2009, Sukadev Bhattiprolu wrote:
| > +clone3(struct clone_struct * __user clone_args, pid_t * __user pids)
| > +
| > +       In addition to doing everything that clone() system call does,
| > +       the clone3() system call:
| > +
| > +               - allows additional clone flags (all 32 bits in the flags
| > +                 parameter to clone() are in use)
| > +
| > +               - allows user to specify a pid for the child process in its
| > +                 active and ancestor pid name spaces.
| 
| Someone (sorry, can't find the old mail) pointed out last time that
| the 'pid_t *__user tidptr' argument needs to be an independent pointer,
| in order to allow the same use patterns with CLONE_CHILD_SETTID and
| CLONE_CHILD_CLEARTID that you can do with the current clone
| implementation.

I think it was Oren Laadan.

| 
| Moving that argument from clone_struct into the argument list would
| also make this a three-argument syscall, which solves the naming problem.

How about we pull 'nr_pids' out of clone_args struct and make that
a separate parameter. It seems the odd man out and controls the pid_t
list parameter.
| 
| 	Arnd <><

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
@ 2009-10-14 18:39             ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-14 18:39 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

Arnd Bergmann [arndbergmann@googlemail.com] wrote:
| On Tuesday 13 October 2009, Sukadev Bhattiprolu wrote:
| > +clone3(struct clone_struct * __user clone_args, pid_t * __user pids)
| > +
| > +       In addition to doing everything that clone() system call does,
| > +       the clone3() system call:
| > +
| > +               - allows additional clone flags (all 32 bits in the flags
| > +                 parameter to clone() are in use)
| > +
| > +               - allows user to specify a pid for the child process in its
| > +                 active and ancestor pid name spaces.
| 
| Someone (sorry, can't find the old mail) pointed out last time that
| the 'pid_t *__user tidptr' argument needs to be an independent pointer,
| in order to allow the same use patterns with CLONE_CHILD_SETTID and
| CLONE_CHILD_CLEARTID that you can do with the current clone
| implementation.

I think it was Oren Laadan.

| 
| Moving that argument from clone_struct into the argument list would
| also make this a three-argument syscall, which solves the naming problem.

How about we pull 'nr_pids' out of clone_args struct and make that
a separate parameter. It seems the odd man out and controls the pid_t
list parameter.
| 
| 	Arnd <><

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-14  1:13                 ` H. Peter Anvin
@ 2009-10-14 22:36                     ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-14 22:36 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Roland McGrath, Pavel Emelyanov

H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
| On 10/13/2009 04:53 PM, Roland McGrath wrote:
| >> My only concern is the support of 64-bit clone flags on 32-bit architectures.
| > 
| > Oy.  I didn't realize there was serious consideration of having more than
| > 32 flags.  IMHO it would be a bad choice, since they could only be used via
| > clone3.  Having high-bit flags work in clone on 64-bit machines but not on
| > 32-bit machines just seems like a wrongly confusing way for things to be.
| > If any high-bits flags are constrained even on 64-bit machines to uses in
| > clone3 calls for sanity purposes, then it seems questionable IMHO to have
| > them be more flags in the same u64 at all.
| > 
| > Since all new features will be via this struct, various new kinds of things
| > could potentially be done by other new struct fields independent of flags.
| > But that would of course require putting enough reserved fields in now and
| > requiring that they be zero-filled now in anticipation of such future uses,
| > which is not very pleasant either.
| > 
| > In short, I guess I really am saying that "clone_flags_high" (or
| > "more_flags" or something) does seem better to me than any of the
| > possibilities for having more than 32 CLONE_* in the current flags word.
| > 
| 
| Overall it seems sane to:
| 
| a) make it an actual 3-argument call;
| b) make the existing flags a u32 forever, and make it a separate
|    argument;
| c) any new expansion can be via the struct, which may want to have
|    an "c3_flags" field first in the structure.

Ok, So will this work ?

	struct clone_args {
		u32 flags_high;		/* new clone flags (higher bits) */ 
		u32 reserved1;
		u32 nr_pids;
		u32 reserved2;
		u64 child_stack_base;
		u64 child_stack_size;
		u64 parent_tid_ptr;
		u64 child_tid_ptr;
		u64 reserved3;
	};

	sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)

Even on 64bit architectures the applications have to use sys_clone3() for
the extended features.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-14 22:36                     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-14 22:36 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Roland McGrath, linux-kernel, Oren Laadan, serue,
	Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, Nathan Lynch, arnd,
	peterz, Louis.Rilling, kosaki.motohiro, randy.dunlap, linux-api,
	Containers, sukadev

H. Peter Anvin [hpa@zytor.com] wrote:
| On 10/13/2009 04:53 PM, Roland McGrath wrote:
| >> My only concern is the support of 64-bit clone flags on 32-bit architectures.
| > 
| > Oy.  I didn't realize there was serious consideration of having more than
| > 32 flags.  IMHO it would be a bad choice, since they could only be used via
| > clone3.  Having high-bit flags work in clone on 64-bit machines but not on
| > 32-bit machines just seems like a wrongly confusing way for things to be.
| > If any high-bits flags are constrained even on 64-bit machines to uses in
| > clone3 calls for sanity purposes, then it seems questionable IMHO to have
| > them be more flags in the same u64 at all.
| > 
| > Since all new features will be via this struct, various new kinds of things
| > could potentially be done by other new struct fields independent of flags.
| > But that would of course require putting enough reserved fields in now and
| > requiring that they be zero-filled now in anticipation of such future uses,
| > which is not very pleasant either.
| > 
| > In short, I guess I really am saying that "clone_flags_high" (or
| > "more_flags" or something) does seem better to me than any of the
| > possibilities for having more than 32 CLONE_* in the current flags word.
| > 
| 
| Overall it seems sane to:
| 
| a) make it an actual 3-argument call;
| b) make the existing flags a u32 forever, and make it a separate
|    argument;
| c) any new expansion can be via the struct, which may want to have
|    an "c3_flags" field first in the structure.

Ok, So will this work ?

	struct clone_args {
		u32 flags_high;		/* new clone flags (higher bits) */ 
		u32 reserved1;
		u32 nr_pids;
		u32 reserved2;
		u64 child_stack_base;
		u64 child_stack_size;
		u64 parent_tid_ptr;
		u64 child_tid_ptr;
		u64 reserved3;
	};

	sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)

Even on 64bit architectures the applications have to use sys_clone3() for
the extended features.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                     ` <20091014223634.GB3515-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-14 22:49                       ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-14 22:49 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Roland McGrath, Pavel Emelyanov

On 10/14/2009 03:36 PM, Sukadev Bhattiprolu wrote:
> H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
> | 
> | Overall it seems sane to:
> | 
> | a) make it an actual 3-argument call;
> | b) make the existing flags a u32 forever, and make it a separate
> |    argument;
> | c) any new expansion can be via the struct, which may want to have
> |    an "c3_flags" field first in the structure.
> 
> Ok, So will this work ?
> 
> 	struct clone_args {
> 		u32 flags_high;		/* new clone flags (higher bits) */ 
> 		u32 reserved1;
> 		u32 nr_pids;
> 		u32 reserved2;
> 		u64 child_stack_base;
> 		u64 child_stack_size;
> 		u64 parent_tid_ptr;
> 		u64 child_tid_ptr;
> 		u64 reserved3;
> 	};
> 
> 	sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)
> 
> Even on 64bit architectures the applications have to use sys_clone3() for
> the extended features.

Yes, although I'd just make flags_high a u64.  The other thing that
might be worthwhile is to have a length field on the structure; that way
we could add new fields at the end if ever necessary in the future.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                     ` <20091014223634.GB3515-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-14 22:49                       ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-14 22:49 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Roland McGrath, linux-kernel, Oren Laadan, serue,
	Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, Nathan Lynch, arnd,
	peterz, Louis.Rilling, kosaki.motohiro, randy.dunlap, linux-api,
	Containers, sukadev

On 10/14/2009 03:36 PM, Sukadev Bhattiprolu wrote:
> H. Peter Anvin [hpa@zytor.com] wrote:
> | 
> | Overall it seems sane to:
> | 
> | a) make it an actual 3-argument call;
> | b) make the existing flags a u32 forever, and make it a separate
> |    argument;
> | c) any new expansion can be via the struct, which may want to have
> |    an "c3_flags" field first in the structure.
> 
> Ok, So will this work ?
> 
> 	struct clone_args {
> 		u32 flags_high;		/* new clone flags (higher bits) */ 
> 		u32 reserved1;
> 		u32 nr_pids;
> 		u32 reserved2;
> 		u64 child_stack_base;
> 		u64 child_stack_size;
> 		u64 parent_tid_ptr;
> 		u64 child_tid_ptr;
> 		u64 reserved3;
> 	};
> 
> 	sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)
> 
> Even on 64bit architectures the applications have to use sys_clone3() for
> the extended features.

Yes, although I'd just make flags_high a u64.  The other thing that
might be worthwhile is to have a length field on the structure; that way
we could add new fields at the end if ever necessary in the future.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-14 22:49                       ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-14 22:49 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Roland McGrath, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Oren Laadan,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI, Nathan Lynch,
	arnd-r2nGTMty4D4, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA

On 10/14/2009 03:36 PM, Sukadev Bhattiprolu wrote:
> H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
> | 
> | Overall it seems sane to:
> | 
> | a) make it an actual 3-argument call;
> | b) make the existing flags a u32 forever, and make it a separate
> |    argument;
> | c) any new expansion can be via the struct, which may want to have
> |    an "c3_flags" field first in the structure.
> 
> Ok, So will this work ?
> 
> 	struct clone_args {
> 		u32 flags_high;		/* new clone flags (higher bits) */ 
> 		u32 reserved1;
> 		u32 nr_pids;
> 		u32 reserved2;
> 		u64 child_stack_base;
> 		u64 child_stack_size;
> 		u64 parent_tid_ptr;
> 		u64 child_tid_ptr;
> 		u64 reserved3;
> 	};
> 
> 	sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)
> 
> Even on 64bit architectures the applications have to use sys_clone3() for
> the extended features.

Yes, although I'd just make flags_high a u64.  The other thing that
might be worthwhile is to have a length field on the structure; that way
we could add new fields at the end if ever necessary in the future.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-14 22:49                       ` H. Peter Anvin
@ 2009-10-15  0:17                           ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-15  0:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Roland McGrath, Pavel Emelyanov

H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
| On 10/14/2009 03:36 PM, Sukadev Bhattiprolu wrote:
| > H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
| > | 
| > | Overall it seems sane to:
| > | 
| > | a) make it an actual 3-argument call;
| > | b) make the existing flags a u32 forever, and make it a separate
| > |    argument;
| > | c) any new expansion can be via the struct, which may want to have
| > |    an "c3_flags" field first in the structure.
| > 
| > Ok, So will this work ?
| > 
| > 	struct clone_args {
| > 		u32 flags_high;		/* new clone flags (higher bits) */ 
| > 		u32 reserved1;
| > 		u32 nr_pids;
| > 		u32 reserved2;
| > 		u64 child_stack_base;
| > 		u64 child_stack_size;
| > 		u64 parent_tid_ptr;
| > 		u64 child_tid_ptr;
| > 		u64 reserved3;
| > 	};
| > 
| > 	sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)
| > 
| > Even on 64bit architectures the applications have to use sys_clone3() for
| > the extended features.
| 
| Yes, although I'd just make flags_high a u64.

so we allow 96 bits for flags ?

| The other thing that might be worthwhile is to have a length field on
| the structure; that way we could add new fields at the end if ever
| necessary in the future.

So:
	struct clone_args {
		u64 flags_high;		/* new clone flags (higher bits) */ 
		u64 reserved1;

		u32 nr_pids;
		u32 clone_args_size;

		u64 child_stack_base;
		u64 child_stack_size;

		u64 parent_tid_ptr;
		u64 child_tid_ptr;
	};

	sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)

BTW, on 64-bit architectures, the flags_low would be 64-bits, but the high-
bits there would be ignored right ?

Not sure if we need a second reserved field now that we add ->clone_args_size.

Thanks,

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-15  0:17                           ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-15  0:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Roland McGrath, linux-kernel, Oren Laadan, serue,
	Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, Nathan Lynch, arnd,
	peterz, Louis.Rilling, kosaki.motohiro, randy.dunlap, linux-api,
	Containers, sukadev

H. Peter Anvin [hpa@zytor.com] wrote:
| On 10/14/2009 03:36 PM, Sukadev Bhattiprolu wrote:
| > H. Peter Anvin [hpa@zytor.com] wrote:
| > | 
| > | Overall it seems sane to:
| > | 
| > | a) make it an actual 3-argument call;
| > | b) make the existing flags a u32 forever, and make it a separate
| > |    argument;
| > | c) any new expansion can be via the struct, which may want to have
| > |    an "c3_flags" field first in the structure.
| > 
| > Ok, So will this work ?
| > 
| > 	struct clone_args {
| > 		u32 flags_high;		/* new clone flags (higher bits) */ 
| > 		u32 reserved1;
| > 		u32 nr_pids;
| > 		u32 reserved2;
| > 		u64 child_stack_base;
| > 		u64 child_stack_size;
| > 		u64 parent_tid_ptr;
| > 		u64 child_tid_ptr;
| > 		u64 reserved3;
| > 	};
| > 
| > 	sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)
| > 
| > Even on 64bit architectures the applications have to use sys_clone3() for
| > the extended features.
| 
| Yes, although I'd just make flags_high a u64.

so we allow 96 bits for flags ?

| The other thing that might be worthwhile is to have a length field on
| the structure; that way we could add new fields at the end if ever
| necessary in the future.

So:
	struct clone_args {
		u64 flags_high;		/* new clone flags (higher bits) */ 
		u64 reserved1;

		u32 nr_pids;
		u32 clone_args_size;

		u64 child_stack_base;
		u64 child_stack_size;

		u64 parent_tid_ptr;
		u64 child_tid_ptr;
	};

	sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)

BTW, on 64-bit architectures, the flags_low would be 64-bits, but the high-
bits there would be ignored right ?

Not sure if we need a second reserved field now that we add ->clone_args_size.

Thanks,

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()
  2009-10-13 11:50         ` Pavel Emelyanov
@ 2009-10-15  0:24             ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-15  0:24 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA

Pavel Emelyanov [xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org] wrote:
| Sukadev Bhattiprolu wrote:
| > 
| > Subject: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()
| > 
| 
| [ snip ]
| 
| > @@ -146,16 +146,22 @@ static int alloc_pidmap_page(struct pidm
| >  	return 0;
| >  }
| >  
| > -static int alloc_pidmap(struct pid_namespace *pid_ns)
| > +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
| 
| Please no! Better create another function, that will just
| atomic_test_and_set() the proper bit in the map. This one
| is heavy enough already.

I had a set_pidmap() function before:

	http://lkml.org/lkml/2009/8/7/24

but dropped it based on some comments and the current version evolved
from it. I could go back to that if it still makes sense.

Thanks,

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()
@ 2009-10-15  0:24             ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-15  0:24 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

Pavel Emelyanov [xemul@openvz.org] wrote:
| Sukadev Bhattiprolu wrote:
| > 
| > Subject: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()
| > 
| 
| [ snip ]
| 
| > @@ -146,16 +146,22 @@ static int alloc_pidmap_page(struct pidm
| >  	return 0;
| >  }
| >  
| > -static int alloc_pidmap(struct pid_namespace *pid_ns)
| > +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
| 
| Please no! Better create another function, that will just
| atomic_test_and_set() the proper bit in the map. This one
| is heavy enough already.

I had a set_pidmap() function before:

	http://lkml.org/lkml/2009/8/7/24

but dropped it based on some comments and the current version evolved
from it. I could go back to that if it still makes sense.

Thanks,

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
  2009-10-13  4:54     ` Sukadev Bhattiprolu
@ 2009-10-16  4:20         ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-16  4:20 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Eric W. Biederman, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Here is an updated patch with the following interface:

	long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
			pid_t *pids);

There are just two other (minor) changes pending to this patchset:

	- PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
	- PATCH 10: update documentation to reflect new interface.

If this looks ok, we repost entire patchset next week.
---

Subject: [RFC][v8][PATCH 9/10]: Define clone3() syscall

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone3(), intended for use during restart, is the same as clone(), except
that it takes a 'pids' paramter. This parameter lets caller choose
specific pid numbers for the child process, in the process's active and
ancestor pid namespaces. (Descendant pid namespaces in general don't matter
since processes don't have pids in them anyway, but see comments in
copy_target_pids() regarding CLONE_NEWPID).

Clone2() system call also attempts to address a second limitation of the
clone() system call. clone() is restricted to 32 clone flags and most (all ?)
of these are in use. If a new clone flag is needed, we will be forced to
define a new variant of the clone() system call.

To prevent unprivileged processes from misusing this interface, clone3()
currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.

See Documentation/clone3 in next patch for more details of clone3() and an
example of its usage.

NOTE:
	- System calls are restricted to 6 parameters and the number and sizes
	  of parameters needed for sys_clone3() exceed 6 integers. The new
	  prototype works around this restriction while providing some
	  flexibility if clone3() needs to be further extended in the future.
TODO:
	- We should convert clone-flags to 64-bit value in all architectures.
	  Its probably best to do that as a separate patchset since clone_flags
	  touches several functions and that patchset seems independent of this
	  new system call.


Changelog[v9-rc1]:
	- [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
	  architectures split the new clone-flags into 'low' and 'high'
	  words and pass in the 'lower' flags as the first argument.
	  This would maintain similarity of the clone3() with clone()/
	  clone2(). Also has the side-effect of the name matching the
	  number of parameters :-)
	- [Roland McGrath] Rename structure to 'clone_args' and add a
	  'child_stack_size' field
	
Changelog[v8]
	- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
	  must be 64-bit.
	- clone2() is in use in IA64. Rename system call to clone3().

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
	  and group parameters into a new 'struct clone_struct' object.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check.

Changelog[v4]:
	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

---
 arch/x86/include/asm/syscalls.h    |    2 
 arch/x86/include/asm/unistd_32.h   |    1 
 arch/x86/kernel/process_32.c       |   61 +++++++++++++++++++++++
 arch/x86/kernel/syscall_table_32.S |    1 
 include/linux/types.h              |   11 ++++
 kernel/fork.c                      |   96 ++++++++++++++++++++++++++++++++++++-
 6 files changed, 171 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/types.h
===================================================================
--- linux-2.6.orig/include/linux/types.h	2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/include/linux/types.h	2009-10-15 20:53:22.000000000 -0700
@@ -204,6 +204,17 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack_base;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 clone_args_size;
+	u64 reserved1;
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
Index: linux-2.6/arch/x86/include/asm/syscalls.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/syscalls.h	2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/syscalls.h	2009-10-15 20:38:53.000000000 -0700
@@ -55,6 +55,8 @@ struct sel_arg_struct;
 struct oldold_utsname;
 struct old_utsname;
 
+asmlinkage long sys_clone3(unsigned int flags_low,
+			struct clone_args __user *cs, pid_t *pids);
 asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
 			  unsigned long, unsigned long, unsigned long);
 asmlinkage int old_mmap(struct mmap_arg_struct __user *);
Index: linux-2.6/arch/x86/include/asm/unistd_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/unistd_32.h	2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/unistd_32.h	2009-10-15 20:38:53.000000000 -0700
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
+#define __NR_clone3		337
 
 #ifdef __KERNEL__
 
Index: linux-2.6/arch/x86/kernel/process_32.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/process_32.c	2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/arch/x86/kernel/process_32.c	2009-10-15 20:38:53.000000000 -0700
@@ -443,6 +443,67 @@ int sys_clone(struct pt_regs *regs)
 	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
 }
 
+asmlinkage long
+sys_clone3(unsigned int flags_low, struct clone_args __user *ucs,
+		pid_t __user *pids)
+{
+	int rc;
+	struct clone_args kcs;
+	unsigned long flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack;
+	unsigned long stack_size;
+	struct pt_regs *regs;
+
+	rc = copy_from_user(&kcs, ucs, sizeof(kcs));
+	if (rc)
+		return -EFAULT;
+
+	/*
+	 * TODO: If size of clone_args is not what the kernel expects, it
+	 * 	 could be that kernel is newer and has an extended structure.
+	 * 	 When that happens, this check needs to be smarter (and we
+	 * 	 need an additional copy_from_user()). For now, assume exact
+	 * 	 match.
+	 */
+	if (kcs.clone_args_size != sizeof(kcs))
+		return -EINVAL;
+
+	/*
+	 * To avoid future compatibility issues, ensure unused fields are 0.
+	 */
+	if (kcs.reserved1 || kcs.clone_flags_high)
+		return -EINVAL;
+
+	/*
+	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
+	 * 	 higher word(s) of 'flags':
+	 *
+	 * 		flags = (kcs.clone_flags_high << 32) | flags_low;
+	 */
+	flags = flags_low;
+	parent_tid_ptr = (int *)kcs.parent_tid_ptr;
+	child_tid_ptr =  (int *)kcs.child_tid_ptr;
+
+	stack_size = (unsigned long)kcs.child_stack_size;
+	child_stack = (unsigned long)kcs.child_stack_base + stack_size;
+
+	regs = task_pt_regs(current);
+
+	if (!child_stack)
+		child_stack = user_stack_pointer(regs);
+
+	/*
+	 * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+	 * 	 to several functions. Need to convert clone_flags to 64-bit.
+	 */
+	return do_fork_with_pids(flags, child_stack, regs, stack_size,
+				parent_tid_ptr, child_tid_ptr, kcs.nr_pids,
+				pids);
+}
+
 /*
  * sys_execve() executes a new program.
  */
Index: linux-2.6/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/syscall_table_32.S	2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/arch/x86/kernel/syscall_table_32.S	2009-10-15 20:38:53.000000000 -0700
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
+	.long sys_clone3
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-10-15 20:38:50.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-10-15 20:38:53.000000000 -0700
@@ -1330,6 +1330,86 @@ struct task_struct * __cpuinit fork_idle
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
+{
+	int j;
+	int rc;
+	int size;
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+
+	if (!unum_pids)
+		return NULL;
+
+	knum_pids = task_pid(current)->level + 1;
+	if (unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+	 * 	 the order is:
+	 *
+	 * 		- pids for 'unknown-namespaces' (if any)
+	 * 		- pids for 'known-namespaces' (requested pids)
+	 * 		- 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], upids, size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1347,7 +1427,7 @@ long do_fork_with_pids(unsigned long clo
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1381,6 +1461,16 @@ long do_fork_with_pids(unsigned long clo
 		}
 	}
 
+	target_pids = copy_target_pids(num_pids, upids);
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1442,6 +1532,10 @@ long do_fork_with_pids(unsigned long clo
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-16  4:20         ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-16  4:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

Here is an updated patch with the following interface:

	long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
			pid_t *pids);

There are just two other (minor) changes pending to this patchset:

	- PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
	- PATCH 10: update documentation to reflect new interface.

If this looks ok, we repost entire patchset next week.
---

Subject: [RFC][v8][PATCH 9/10]: Define clone3() syscall

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone3(), intended for use during restart, is the same as clone(), except
that it takes a 'pids' paramter. This parameter lets caller choose
specific pid numbers for the child process, in the process's active and
ancestor pid namespaces. (Descendant pid namespaces in general don't matter
since processes don't have pids in them anyway, but see comments in
copy_target_pids() regarding CLONE_NEWPID).

Clone2() system call also attempts to address a second limitation of the
clone() system call. clone() is restricted to 32 clone flags and most (all ?)
of these are in use. If a new clone flag is needed, we will be forced to
define a new variant of the clone() system call.

To prevent unprivileged processes from misusing this interface, clone3()
currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.

See Documentation/clone3 in next patch for more details of clone3() and an
example of its usage.

NOTE:
	- System calls are restricted to 6 parameters and the number and sizes
	  of parameters needed for sys_clone3() exceed 6 integers. The new
	  prototype works around this restriction while providing some
	  flexibility if clone3() needs to be further extended in the future.
TODO:
	- We should convert clone-flags to 64-bit value in all architectures.
	  Its probably best to do that as a separate patchset since clone_flags
	  touches several functions and that patchset seems independent of this
	  new system call.


Changelog[v9-rc1]:
	- [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
	  architectures split the new clone-flags into 'low' and 'high'
	  words and pass in the 'lower' flags as the first argument.
	  This would maintain similarity of the clone3() with clone()/
	  clone2(). Also has the side-effect of the name matching the
	  number of parameters :-)
	- [Roland McGrath] Rename structure to 'clone_args' and add a
	  'child_stack_size' field
	
Changelog[v8]
	- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
	  must be 64-bit.
	- clone2() is in use in IA64. Rename system call to clone3().

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
	  and group parameters into a new 'struct clone_struct' object.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check.

Changelog[v4]:
	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

---
 arch/x86/include/asm/syscalls.h    |    2 
 arch/x86/include/asm/unistd_32.h   |    1 
 arch/x86/kernel/process_32.c       |   61 +++++++++++++++++++++++
 arch/x86/kernel/syscall_table_32.S |    1 
 include/linux/types.h              |   11 ++++
 kernel/fork.c                      |   96 ++++++++++++++++++++++++++++++++++++-
 6 files changed, 171 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/types.h
===================================================================
--- linux-2.6.orig/include/linux/types.h	2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/include/linux/types.h	2009-10-15 20:53:22.000000000 -0700
@@ -204,6 +204,17 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack_base;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 clone_args_size;
+	u64 reserved1;
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
Index: linux-2.6/arch/x86/include/asm/syscalls.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/syscalls.h	2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/syscalls.h	2009-10-15 20:38:53.000000000 -0700
@@ -55,6 +55,8 @@ struct sel_arg_struct;
 struct oldold_utsname;
 struct old_utsname;
 
+asmlinkage long sys_clone3(unsigned int flags_low,
+			struct clone_args __user *cs, pid_t *pids);
 asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
 			  unsigned long, unsigned long, unsigned long);
 asmlinkage int old_mmap(struct mmap_arg_struct __user *);
Index: linux-2.6/arch/x86/include/asm/unistd_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/unistd_32.h	2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/unistd_32.h	2009-10-15 20:38:53.000000000 -0700
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
+#define __NR_clone3		337
 
 #ifdef __KERNEL__
 
Index: linux-2.6/arch/x86/kernel/process_32.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/process_32.c	2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/arch/x86/kernel/process_32.c	2009-10-15 20:38:53.000000000 -0700
@@ -443,6 +443,67 @@ int sys_clone(struct pt_regs *regs)
 	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
 }
 
+asmlinkage long
+sys_clone3(unsigned int flags_low, struct clone_args __user *ucs,
+		pid_t __user *pids)
+{
+	int rc;
+	struct clone_args kcs;
+	unsigned long flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack;
+	unsigned long stack_size;
+	struct pt_regs *regs;
+
+	rc = copy_from_user(&kcs, ucs, sizeof(kcs));
+	if (rc)
+		return -EFAULT;
+
+	/*
+	 * TODO: If size of clone_args is not what the kernel expects, it
+	 * 	 could be that kernel is newer and has an extended structure.
+	 * 	 When that happens, this check needs to be smarter (and we
+	 * 	 need an additional copy_from_user()). For now, assume exact
+	 * 	 match.
+	 */
+	if (kcs.clone_args_size != sizeof(kcs))
+		return -EINVAL;
+
+	/*
+	 * To avoid future compatibility issues, ensure unused fields are 0.
+	 */
+	if (kcs.reserved1 || kcs.clone_flags_high)
+		return -EINVAL;
+
+	/*
+	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
+	 * 	 higher word(s) of 'flags':
+	 *
+	 * 		flags = (kcs.clone_flags_high << 32) | flags_low;
+	 */
+	flags = flags_low;
+	parent_tid_ptr = (int *)kcs.parent_tid_ptr;
+	child_tid_ptr =  (int *)kcs.child_tid_ptr;
+
+	stack_size = (unsigned long)kcs.child_stack_size;
+	child_stack = (unsigned long)kcs.child_stack_base + stack_size;
+
+	regs = task_pt_regs(current);
+
+	if (!child_stack)
+		child_stack = user_stack_pointer(regs);
+
+	/*
+	 * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+	 * 	 to several functions. Need to convert clone_flags to 64-bit.
+	 */
+	return do_fork_with_pids(flags, child_stack, regs, stack_size,
+				parent_tid_ptr, child_tid_ptr, kcs.nr_pids,
+				pids);
+}
+
 /*
  * sys_execve() executes a new program.
  */
Index: linux-2.6/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/syscall_table_32.S	2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/arch/x86/kernel/syscall_table_32.S	2009-10-15 20:38:53.000000000 -0700
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
+	.long sys_clone3
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-10-15 20:38:50.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-10-15 20:38:53.000000000 -0700
@@ -1330,6 +1330,86 @@ struct task_struct * __cpuinit fork_idle
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
+{
+	int j;
+	int rc;
+	int size;
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+
+	if (!unum_pids)
+		return NULL;
+
+	knum_pids = task_pid(current)->level + 1;
+	if (unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+	 * 	 the order is:
+	 *
+	 * 		- pids for 'unknown-namespaces' (if any)
+	 * 		- pids for 'known-namespaces' (requested pids)
+	 * 		- 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], upids, size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1347,7 +1427,7 @@ long do_fork_with_pids(unsigned long clo
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1381,6 +1461,16 @@ long do_fork_with_pids(unsigned long clo
 		}
 	}
 
+	target_pids = copy_target_pids(num_pids, upids);
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1442,6 +1532,10 @@ long do_fork_with_pids(unsigned long clo
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }
 

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
  2009-10-16  4:20         ` Sukadev Bhattiprolu
@ 2009-10-16  6:25             ` Michael Kerrisk
  -1 siblings, 0 replies; 226+ messages in thread
From: Michael Kerrisk @ 2009-10-16  6:25 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	Pavel Emelyanov, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Michael Kerrisk, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	mingo-X9Un+BFzKDI, Alexey Dobriyan, arnd-r2nGTMty4D4,
	Nathan Lynch, roland-H+wXaHxf7aLQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Sukadev

On Fri, Oct 16, 2009 at 6:20 AM, Sukadev Bhattiprolu
<sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> Here is an updated patch with the following interface:
>
>        long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
>                        pid_t *pids);
>
> There are just two other (minor) changes pending to this patchset:
>
>        - PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
>        - PATCH 10: update documentation to reflect new interface.
>
> If this looks ok, we repost entire patchset next week.

I know I'm late to this discussion, but why the name clone3()? It's
not consistent with any other convention used fo syscall naming,
AFAICS. I think a name like clone_ext() or clonex() (for extended)
might make more sense.

Cheers,

Michael


> ---
>
> Subject: [RFC][v8][PATCH 9/10]: Define clone3() syscall
>
> Container restart requires that a task have the same pid it had when it was
> checkpointed. When containers are nested the tasks within the containers
> exist in multiple pid namespaces and hence have multiple pids to specify
> during restart.
>
> clone3(), intended for use during restart, is the same as clone(), except
> that it takes a 'pids' paramter. This parameter lets caller choose
> specific pid numbers for the child process, in the process's active and
> ancestor pid namespaces. (Descendant pid namespaces in general don't matter
> since processes don't have pids in them anyway, but see comments in
> copy_target_pids() regarding CLONE_NEWPID).
>
> Clone2() system call also attempts to address a second limitation of the
> clone() system call. clone() is restricted to 32 clone flags and most (all ?)
> of these are in use. If a new clone flag is needed, we will be forced to
> define a new variant of the clone() system call.
>
> To prevent unprivileged processes from misusing this interface, clone3()
> currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.
>
> See Documentation/clone3 in next patch for more details of clone3() and an
> example of its usage.
>
> NOTE:
>        - System calls are restricted to 6 parameters and the number and sizes
>          of parameters needed for sys_clone3() exceed 6 integers. The new
>          prototype works around this restriction while providing some
>          flexibility if clone3() needs to be further extended in the future.
> TODO:
>        - We should convert clone-flags to 64-bit value in all architectures.
>          Its probably best to do that as a separate patchset since clone_flags
>          touches several functions and that patchset seems independent of this
>          new system call.
>
>
> Changelog[v9-rc1]:
>        - [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
>          architectures split the new clone-flags into 'low' and 'high'
>          words and pass in the 'lower' flags as the first argument.
>          This would maintain similarity of the clone3() with clone()/
>          clone2(). Also has the side-effect of the name matching the
>          number of parameters :-)
>        - [Roland McGrath] Rename structure to 'clone_args' and add a
>          'child_stack_size' field
>
> Changelog[v8]
>        - [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
>          must be 64-bit.
>        - clone2() is in use in IA64. Rename system call to clone3().
>
> Changelog[v7]:
>        - [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
>          and group parameters into a new 'struct clone_struct' object.
>
> Changelog[v6]:
>        - (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
>          Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
>          is constant across architectures.
>        - (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
>          'unum_pids < 0' check.
>
> Changelog[v4]:
>        - (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'
>
> Changelog[v3]:
>        - (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
>          in the target_pids[] list and setting it 0. See copy_target_pids()).
>        - (Oren Laadan) Specified target pids should apply only to youngest
>          pid-namespaces (see copy_target_pids())
>        - (Matt Helsley) Update patch description.
>
> Changelog[v2]:
>        - Remove unnecessary printk and add a note to callers of
>          copy_target_pids() to free target_pids.
>        - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
>        - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
>          'num_pids == 0' (fall back to normal clone()).
>        - Move arch-independent code (sanity checks and copy-in of target-pids)
>          into kernel/fork.c and simplify sys_clone_with_pids()
>
> Changelog[v1]:
>        - Fixed some compile errors (had fixed these errors earlier in my
>          git tree but had not refreshed patches before emailing them)
>
> Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>
> ---
>  arch/x86/include/asm/syscalls.h    |    2
>  arch/x86/include/asm/unistd_32.h   |    1
>  arch/x86/kernel/process_32.c       |   61 +++++++++++++++++++++++
>  arch/x86/kernel/syscall_table_32.S |    1
>  include/linux/types.h              |   11 ++++
>  kernel/fork.c                      |   96 ++++++++++++++++++++++++++++++++++++-
>  6 files changed, 171 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/include/linux/types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/types.h        2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/include/linux/types.h     2009-10-15 20:53:22.000000000 -0700
> @@ -204,6 +204,17 @@ struct ustat {
>        char                    f_fpack[6];
>  };
>
> +struct clone_args {
> +       u64 clone_flags_high;
> +       u64 child_stack_base;
> +       u64 child_stack_size;
> +       u64 parent_tid_ptr;
> +       u64 child_tid_ptr;
> +       u32 nr_pids;
> +       u32 clone_args_size;
> +       u64 reserved1;
> +};
> +
>  #endif /* __KERNEL__ */
>  #endif /*  __ASSEMBLY__ */
>  #endif /* _LINUX_TYPES_H */
> Index: linux-2.6/arch/x86/include/asm/syscalls.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/syscalls.h      2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/syscalls.h   2009-10-15 20:38:53.000000000 -0700
> @@ -55,6 +55,8 @@ struct sel_arg_struct;
>  struct oldold_utsname;
>  struct old_utsname;
>
> +asmlinkage long sys_clone3(unsigned int flags_low,
> +                       struct clone_args __user *cs, pid_t *pids);
>  asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
>                          unsigned long, unsigned long, unsigned long);
>  asmlinkage int old_mmap(struct mmap_arg_struct __user *);
> Index: linux-2.6/arch/x86/include/asm/unistd_32.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/unistd_32.h     2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/unistd_32.h  2009-10-15 20:38:53.000000000 -0700
> @@ -342,6 +342,7 @@
>  #define __NR_pwritev           334
>  #define __NR_rt_tgsigqueueinfo 335
>  #define __NR_perf_counter_open 336
> +#define __NR_clone3            337
>
>  #ifdef __KERNEL__
>
> Index: linux-2.6/arch/x86/kernel/process_32.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/process_32.c 2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/arch/x86/kernel/process_32.c      2009-10-15 20:38:53.000000000 -0700
> @@ -443,6 +443,67 @@ int sys_clone(struct pt_regs *regs)
>        return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
>  }
>
> +asmlinkage long
> +sys_clone3(unsigned int flags_low, struct clone_args __user *ucs,
> +               pid_t __user *pids)
> +{
> +       int rc;
> +       struct clone_args kcs;
> +       unsigned long flags;
> +       int __user *parent_tid_ptr;
> +       int __user *child_tid_ptr;
> +       unsigned long __user child_stack;
> +       unsigned long stack_size;
> +       struct pt_regs *regs;
> +
> +       rc = copy_from_user(&kcs, ucs, sizeof(kcs));
> +       if (rc)
> +               return -EFAULT;
> +
> +       /*
> +        * TODO: If size of clone_args is not what the kernel expects, it
> +        *       could be that kernel is newer and has an extended structure.
> +        *       When that happens, this check needs to be smarter (and we
> +        *       need an additional copy_from_user()). For now, assume exact
> +        *       match.
> +        */
> +       if (kcs.clone_args_size != sizeof(kcs))
> +               return -EINVAL;
> +
> +       /*
> +        * To avoid future compatibility issues, ensure unused fields are 0.
> +        */
> +       if (kcs.reserved1 || kcs.clone_flags_high)
> +               return -EINVAL;
> +
> +       /*
> +        * TODO: Convert 'clone-flags' to 64-bits on all architectures.
> +        * TODO: When ->clone_flags_high is non-zero, copy it in to the
> +        *       higher word(s) of 'flags':
> +        *
> +        *              flags = (kcs.clone_flags_high << 32) | flags_low;
> +        */
> +       flags = flags_low;
> +       parent_tid_ptr = (int *)kcs.parent_tid_ptr;
> +       child_tid_ptr =  (int *)kcs.child_tid_ptr;
> +
> +       stack_size = (unsigned long)kcs.child_stack_size;
> +       child_stack = (unsigned long)kcs.child_stack_base + stack_size;
> +
> +       regs = task_pt_regs(current);
> +
> +       if (!child_stack)
> +               child_stack = user_stack_pointer(regs);
> +
> +       /*
> +        * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
> +        *       to several functions. Need to convert clone_flags to 64-bit.
> +        */
> +       return do_fork_with_pids(flags, child_stack, regs, stack_size,
> +                               parent_tid_ptr, child_tid_ptr, kcs.nr_pids,
> +                               pids);
> +}
> +
>  /*
>  * sys_execve() executes a new program.
>  */
> Index: linux-2.6/arch/x86/kernel/syscall_table_32.S
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/syscall_table_32.S   2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/arch/x86/kernel/syscall_table_32.S        2009-10-15 20:38:53.000000000 -0700
> @@ -336,3 +336,4 @@ ENTRY(sys_call_table)
>        .long sys_pwritev
>        .long sys_rt_tgsigqueueinfo     /* 335 */
>        .long sys_perf_counter_open
> +       .long sys_clone3
> Index: linux-2.6/kernel/fork.c
> ===================================================================
> --- linux-2.6.orig/kernel/fork.c        2009-10-15 20:38:50.000000000 -0700
> +++ linux-2.6/kernel/fork.c     2009-10-15 20:38:53.000000000 -0700
> @@ -1330,6 +1330,86 @@ struct task_struct * __cpuinit fork_idle
>  }
>
>  /*
> + * If user specified any 'target-pids' in @upid_setp, copy them from
> + * user and return a pointer to a local copy of the list of pids. The
> + * caller must free the list, when they are done using it.
> + *
> + * If user did not specify any target pids, return NULL (caller should
> + * treat this like normal clone).
> + *
> + * On any errors, return the error code
> + */
> +static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
> +{
> +       int j;
> +       int rc;
> +       int size;
> +       int knum_pids;          /* # of pids needed in kernel */
> +       pid_t *target_pids;
> +
> +       if (!unum_pids)
> +               return NULL;
> +
> +       knum_pids = task_pid(current)->level + 1;
> +       if (unum_pids > knum_pids)
> +               return ERR_PTR(-EINVAL);
> +
> +       /*
> +        * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
> +        * and set it to 0. This last entry in target_pids[] corresponds to the
> +        * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
> +        * specified. If CLONE_NEWPID was not specified, this last entry will
> +        * simply be ignored.
> +        */
> +       target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
> +       if (!target_pids)
> +               return ERR_PTR(-ENOMEM);
> +
> +       /*
> +        * A process running in a level 2 pid namespace has three pid namespaces
> +        * and hence three pid numbers. If this process is checkpointed,
> +        * information about these three namespaces are saved. We refer to these
> +        * namespaces as 'known namespaces'.
> +        *
> +        * If this checkpointed process is however restarted in a level 3 pid
> +        * namespace, the restarted process has an extra ancestor pid namespace
> +        * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
> +        *
> +        * During restart, the process requests specific pids for its 'known
> +        * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
> +        *
> +        * Since the requested-pids correspond to 'known namespaces' and since
> +        * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
> +        * namespaces', copy requested pids to the back-end of target_pids[]
> +        * (i.e before the last entry for CLONE_NEWPID mentioned above).
> +        * Any entries in target_pids[] not corresponding to a requested pid
> +        * will be set to zero and kernel assigns a pid in those namespaces.
> +        *
> +        * NOTE: The order of pids in target_pids[] is oldest pid namespace to
> +        *       youngest (target_pids[0] corresponds to init_pid_ns). i.e.
> +        *       the order is:
> +        *
> +        *              - pids for 'unknown-namespaces' (if any)
> +        *              - pids for 'known-namespaces' (requested pids)
> +        *              - 0 in the last entry (for CLONE_NEWPID).
> +        */
> +       j = knum_pids - unum_pids;
> +       size = unum_pids * sizeof(pid_t);
> +
> +       rc = copy_from_user(&target_pids[j], upids, size);
> +       if (rc) {
> +               rc = -EFAULT;
> +               goto out_free;
> +       }
> +
> +       return target_pids;
> +
> +out_free:
> +       kfree(target_pids);
> +       return ERR_PTR(rc);
> +}
> +
> +/*
>  *  Ok, this is the main fork-routine.
>  *
>  * It copies the process, and if successful kick-starts
> @@ -1347,7 +1427,7 @@ long do_fork_with_pids(unsigned long clo
>        struct task_struct *p;
>        int trace = 0;
>        long nr;
> -       pid_t *target_pids = NULL;
> +       pid_t *target_pids;
>
>        /*
>         * Do some preliminary argument and permissions checking before we
> @@ -1381,6 +1461,16 @@ long do_fork_with_pids(unsigned long clo
>                }
>        }
>
> +       target_pids = copy_target_pids(num_pids, upids);
> +       if (target_pids) {
> +               if (IS_ERR(target_pids))
> +                       return PTR_ERR(target_pids);
> +
> +               nr = -EPERM;
> +               if (!capable(CAP_SYS_ADMIN))
> +                       goto out_free;
> +       }
> +
>        /*
>         * When called from kernel_thread, don't do user tracing stuff.
>         */
> @@ -1442,6 +1532,10 @@ long do_fork_with_pids(unsigned long clo
>        } else {
>                nr = PTR_ERR(p);
>        }
> +
> +out_free:
> +       kfree(target_pids);
> +
>        return nr;
>  }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-16  6:25             ` Michael Kerrisk
  0 siblings, 0 replies; 226+ messages in thread
From: Michael Kerrisk @ 2009-10-16  6:25 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev,
	Michael Kerrisk

Hi Sukadev

On Fri, Oct 16, 2009 at 6:20 AM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> Here is an updated patch with the following interface:
>
>        long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
>                        pid_t *pids);
>
> There are just two other (minor) changes pending to this patchset:
>
>        - PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
>        - PATCH 10: update documentation to reflect new interface.
>
> If this looks ok, we repost entire patchset next week.

I know I'm late to this discussion, but why the name clone3()? It's
not consistent with any other convention used fo syscall naming,
AFAICS. I think a name like clone_ext() or clonex() (for extended)
might make more sense.

Cheers,

Michael


> ---
>
> Subject: [RFC][v8][PATCH 9/10]: Define clone3() syscall
>
> Container restart requires that a task have the same pid it had when it was
> checkpointed. When containers are nested the tasks within the containers
> exist in multiple pid namespaces and hence have multiple pids to specify
> during restart.
>
> clone3(), intended for use during restart, is the same as clone(), except
> that it takes a 'pids' paramter. This parameter lets caller choose
> specific pid numbers for the child process, in the process's active and
> ancestor pid namespaces. (Descendant pid namespaces in general don't matter
> since processes don't have pids in them anyway, but see comments in
> copy_target_pids() regarding CLONE_NEWPID).
>
> Clone2() system call also attempts to address a second limitation of the
> clone() system call. clone() is restricted to 32 clone flags and most (all ?)
> of these are in use. If a new clone flag is needed, we will be forced to
> define a new variant of the clone() system call.
>
> To prevent unprivileged processes from misusing this interface, clone3()
> currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.
>
> See Documentation/clone3 in next patch for more details of clone3() and an
> example of its usage.
>
> NOTE:
>        - System calls are restricted to 6 parameters and the number and sizes
>          of parameters needed for sys_clone3() exceed 6 integers. The new
>          prototype works around this restriction while providing some
>          flexibility if clone3() needs to be further extended in the future.
> TODO:
>        - We should convert clone-flags to 64-bit value in all architectures.
>          Its probably best to do that as a separate patchset since clone_flags
>          touches several functions and that patchset seems independent of this
>          new system call.
>
>
> Changelog[v9-rc1]:
>        - [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
>          architectures split the new clone-flags into 'low' and 'high'
>          words and pass in the 'lower' flags as the first argument.
>          This would maintain similarity of the clone3() with clone()/
>          clone2(). Also has the side-effect of the name matching the
>          number of parameters :-)
>        - [Roland McGrath] Rename structure to 'clone_args' and add a
>          'child_stack_size' field
>
> Changelog[v8]
>        - [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
>          must be 64-bit.
>        - clone2() is in use in IA64. Rename system call to clone3().
>
> Changelog[v7]:
>        - [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
>          and group parameters into a new 'struct clone_struct' object.
>
> Changelog[v6]:
>        - (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
>          Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
>          is constant across architectures.
>        - (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
>          'unum_pids < 0' check.
>
> Changelog[v4]:
>        - (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'
>
> Changelog[v3]:
>        - (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
>          in the target_pids[] list and setting it 0. See copy_target_pids()).
>        - (Oren Laadan) Specified target pids should apply only to youngest
>          pid-namespaces (see copy_target_pids())
>        - (Matt Helsley) Update patch description.
>
> Changelog[v2]:
>        - Remove unnecessary printk and add a note to callers of
>          copy_target_pids() to free target_pids.
>        - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
>        - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
>          'num_pids == 0' (fall back to normal clone()).
>        - Move arch-independent code (sanity checks and copy-in of target-pids)
>          into kernel/fork.c and simplify sys_clone_with_pids()
>
> Changelog[v1]:
>        - Fixed some compile errors (had fixed these errors earlier in my
>          git tree but had not refreshed patches before emailing them)
>
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
>
> ---
>  arch/x86/include/asm/syscalls.h    |    2
>  arch/x86/include/asm/unistd_32.h   |    1
>  arch/x86/kernel/process_32.c       |   61 +++++++++++++++++++++++
>  arch/x86/kernel/syscall_table_32.S |    1
>  include/linux/types.h              |   11 ++++
>  kernel/fork.c                      |   96 ++++++++++++++++++++++++++++++++++++-
>  6 files changed, 171 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/include/linux/types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/types.h        2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/include/linux/types.h     2009-10-15 20:53:22.000000000 -0700
> @@ -204,6 +204,17 @@ struct ustat {
>        char                    f_fpack[6];
>  };
>
> +struct clone_args {
> +       u64 clone_flags_high;
> +       u64 child_stack_base;
> +       u64 child_stack_size;
> +       u64 parent_tid_ptr;
> +       u64 child_tid_ptr;
> +       u32 nr_pids;
> +       u32 clone_args_size;
> +       u64 reserved1;
> +};
> +
>  #endif /* __KERNEL__ */
>  #endif /*  __ASSEMBLY__ */
>  #endif /* _LINUX_TYPES_H */
> Index: linux-2.6/arch/x86/include/asm/syscalls.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/syscalls.h      2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/syscalls.h   2009-10-15 20:38:53.000000000 -0700
> @@ -55,6 +55,8 @@ struct sel_arg_struct;
>  struct oldold_utsname;
>  struct old_utsname;
>
> +asmlinkage long sys_clone3(unsigned int flags_low,
> +                       struct clone_args __user *cs, pid_t *pids);
>  asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
>                          unsigned long, unsigned long, unsigned long);
>  asmlinkage int old_mmap(struct mmap_arg_struct __user *);
> Index: linux-2.6/arch/x86/include/asm/unistd_32.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/unistd_32.h     2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/unistd_32.h  2009-10-15 20:38:53.000000000 -0700
> @@ -342,6 +342,7 @@
>  #define __NR_pwritev           334
>  #define __NR_rt_tgsigqueueinfo 335
>  #define __NR_perf_counter_open 336
> +#define __NR_clone3            337
>
>  #ifdef __KERNEL__
>
> Index: linux-2.6/arch/x86/kernel/process_32.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/process_32.c 2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/arch/x86/kernel/process_32.c      2009-10-15 20:38:53.000000000 -0700
> @@ -443,6 +443,67 @@ int sys_clone(struct pt_regs *regs)
>        return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
>  }
>
> +asmlinkage long
> +sys_clone3(unsigned int flags_low, struct clone_args __user *ucs,
> +               pid_t __user *pids)
> +{
> +       int rc;
> +       struct clone_args kcs;
> +       unsigned long flags;
> +       int __user *parent_tid_ptr;
> +       int __user *child_tid_ptr;
> +       unsigned long __user child_stack;
> +       unsigned long stack_size;
> +       struct pt_regs *regs;
> +
> +       rc = copy_from_user(&kcs, ucs, sizeof(kcs));
> +       if (rc)
> +               return -EFAULT;
> +
> +       /*
> +        * TODO: If size of clone_args is not what the kernel expects, it
> +        *       could be that kernel is newer and has an extended structure.
> +        *       When that happens, this check needs to be smarter (and we
> +        *       need an additional copy_from_user()). For now, assume exact
> +        *       match.
> +        */
> +       if (kcs.clone_args_size != sizeof(kcs))
> +               return -EINVAL;
> +
> +       /*
> +        * To avoid future compatibility issues, ensure unused fields are 0.
> +        */
> +       if (kcs.reserved1 || kcs.clone_flags_high)
> +               return -EINVAL;
> +
> +       /*
> +        * TODO: Convert 'clone-flags' to 64-bits on all architectures.
> +        * TODO: When ->clone_flags_high is non-zero, copy it in to the
> +        *       higher word(s) of 'flags':
> +        *
> +        *              flags = (kcs.clone_flags_high << 32) | flags_low;
> +        */
> +       flags = flags_low;
> +       parent_tid_ptr = (int *)kcs.parent_tid_ptr;
> +       child_tid_ptr =  (int *)kcs.child_tid_ptr;
> +
> +       stack_size = (unsigned long)kcs.child_stack_size;
> +       child_stack = (unsigned long)kcs.child_stack_base + stack_size;
> +
> +       regs = task_pt_regs(current);
> +
> +       if (!child_stack)
> +               child_stack = user_stack_pointer(regs);
> +
> +       /*
> +        * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
> +        *       to several functions. Need to convert clone_flags to 64-bit.
> +        */
> +       return do_fork_with_pids(flags, child_stack, regs, stack_size,
> +                               parent_tid_ptr, child_tid_ptr, kcs.nr_pids,
> +                               pids);
> +}
> +
>  /*
>  * sys_execve() executes a new program.
>  */
> Index: linux-2.6/arch/x86/kernel/syscall_table_32.S
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/syscall_table_32.S   2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/arch/x86/kernel/syscall_table_32.S        2009-10-15 20:38:53.000000000 -0700
> @@ -336,3 +336,4 @@ ENTRY(sys_call_table)
>        .long sys_pwritev
>        .long sys_rt_tgsigqueueinfo     /* 335 */
>        .long sys_perf_counter_open
> +       .long sys_clone3
> Index: linux-2.6/kernel/fork.c
> ===================================================================
> --- linux-2.6.orig/kernel/fork.c        2009-10-15 20:38:50.000000000 -0700
> +++ linux-2.6/kernel/fork.c     2009-10-15 20:38:53.000000000 -0700
> @@ -1330,6 +1330,86 @@ struct task_struct * __cpuinit fork_idle
>  }
>
>  /*
> + * If user specified any 'target-pids' in @upid_setp, copy them from
> + * user and return a pointer to a local copy of the list of pids. The
> + * caller must free the list, when they are done using it.
> + *
> + * If user did not specify any target pids, return NULL (caller should
> + * treat this like normal clone).
> + *
> + * On any errors, return the error code
> + */
> +static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
> +{
> +       int j;
> +       int rc;
> +       int size;
> +       int knum_pids;          /* # of pids needed in kernel */
> +       pid_t *target_pids;
> +
> +       if (!unum_pids)
> +               return NULL;
> +
> +       knum_pids = task_pid(current)->level + 1;
> +       if (unum_pids > knum_pids)
> +               return ERR_PTR(-EINVAL);
> +
> +       /*
> +        * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
> +        * and set it to 0. This last entry in target_pids[] corresponds to the
> +        * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
> +        * specified. If CLONE_NEWPID was not specified, this last entry will
> +        * simply be ignored.
> +        */
> +       target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
> +       if (!target_pids)
> +               return ERR_PTR(-ENOMEM);
> +
> +       /*
> +        * A process running in a level 2 pid namespace has three pid namespaces
> +        * and hence three pid numbers. If this process is checkpointed,
> +        * information about these three namespaces are saved. We refer to these
> +        * namespaces as 'known namespaces'.
> +        *
> +        * If this checkpointed process is however restarted in a level 3 pid
> +        * namespace, the restarted process has an extra ancestor pid namespace
> +        * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
> +        *
> +        * During restart, the process requests specific pids for its 'known
> +        * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
> +        *
> +        * Since the requested-pids correspond to 'known namespaces' and since
> +        * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
> +        * namespaces', copy requested pids to the back-end of target_pids[]
> +        * (i.e before the last entry for CLONE_NEWPID mentioned above).
> +        * Any entries in target_pids[] not corresponding to a requested pid
> +        * will be set to zero and kernel assigns a pid in those namespaces.
> +        *
> +        * NOTE: The order of pids in target_pids[] is oldest pid namespace to
> +        *       youngest (target_pids[0] corresponds to init_pid_ns). i.e.
> +        *       the order is:
> +        *
> +        *              - pids for 'unknown-namespaces' (if any)
> +        *              - pids for 'known-namespaces' (requested pids)
> +        *              - 0 in the last entry (for CLONE_NEWPID).
> +        */
> +       j = knum_pids - unum_pids;
> +       size = unum_pids * sizeof(pid_t);
> +
> +       rc = copy_from_user(&target_pids[j], upids, size);
> +       if (rc) {
> +               rc = -EFAULT;
> +               goto out_free;
> +       }
> +
> +       return target_pids;
> +
> +out_free:
> +       kfree(target_pids);
> +       return ERR_PTR(rc);
> +}
> +
> +/*
>  *  Ok, this is the main fork-routine.
>  *
>  * It copies the process, and if successful kick-starts
> @@ -1347,7 +1427,7 @@ long do_fork_with_pids(unsigned long clo
>        struct task_struct *p;
>        int trace = 0;
>        long nr;
> -       pid_t *target_pids = NULL;
> +       pid_t *target_pids;
>
>        /*
>         * Do some preliminary argument and permissions checking before we
> @@ -1381,6 +1461,16 @@ long do_fork_with_pids(unsigned long clo
>                }
>        }
>
> +       target_pids = copy_target_pids(num_pids, upids);
> +       if (target_pids) {
> +               if (IS_ERR(target_pids))
> +                       return PTR_ERR(target_pids);
> +
> +               nr = -EPERM;
> +               if (!capable(CAP_SYS_ADMIN))
> +                       goto out_free;
> +       }
> +
>        /*
>         * When called from kernel_thread, don't do user tracing stuff.
>         */
> @@ -1442,6 +1532,10 @@ long do_fork_with_pids(unsigned long clo
>        } else {
>                nr = PTR_ERR(p);
>        }
> +
> +out_free:
> +       kfree(target_pids);
> +
>        return nr;
>  }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]             ` <cfd18e0f0910152325m4a9125c2q18f36f5bd7d4a0-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-16 18:06               ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-16 18:06 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Michael Kerrisk [mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org] wrote:
| Hi Sukadev
| 
| On Fri, Oct 16, 2009 at 6:20 AM, Sukadev Bhattiprolu
| <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
| > Here is an updated patch with the following interface:
| >
| >        long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
| >                        pid_t *pids);
| >
| > There are just two other (minor) changes pending to this patchset:
| >
| >        - PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
| >        - PATCH 10: update documentation to reflect new interface.
| >
| > If this looks ok, we repost entire patchset next week.
| 
| I know I'm late to this discussion, but why the name clone3()? It's
| not consistent with any other convention used fo syscall naming,
| AFAICS. I think a name like clone_ext() or clonex() (for extended)
| might make more sense.

Sure, we talked about calling it clone_extended() and I can go back
to that.

Only minor concern with that name was if this new call ever needs to
be extended, what would we call it :-). With clone3() we could add a
real/fake parameter and call it clone4() :-p

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]             ` <cfd18e0f0910152325m4a9125c2q18f36f5bd7d4a0-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-16 18:06               ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-16 18:06 UTC (permalink / raw)
  To: mtk.manpages
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

Michael Kerrisk [mtk.manpages@googlemail.com] wrote:
| Hi Sukadev
| 
| On Fri, Oct 16, 2009 at 6:20 AM, Sukadev Bhattiprolu
| <sukadev@linux.vnet.ibm.com> wrote:
| > Here is an updated patch with the following interface:
| >
| >        long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
| >                        pid_t *pids);
| >
| > There are just two other (minor) changes pending to this patchset:
| >
| >        - PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
| >        - PATCH 10: update documentation to reflect new interface.
| >
| > If this looks ok, we repost entire patchset next week.
| 
| I know I'm late to this discussion, but why the name clone3()? It's
| not consistent with any other convention used fo syscall naming,
| AFAICS. I think a name like clone_ext() or clonex() (for extended)
| might make more sense.

Sure, we talked about calling it clone_extended() and I can go back
to that.

Only minor concern with that name was if this new call ever needs to
be extended, what would we call it :-). With clone3() we could add a
real/fake parameter and call it clone4() :-p

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-16 18:06               ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-16 18:06 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Oren Laadan,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA

Michael Kerrisk [mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org] wrote:
| Hi Sukadev
| 
| On Fri, Oct 16, 2009 at 6:20 AM, Sukadev Bhattiprolu
| <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
| > Here is an updated patch with the following interface:
| >
| >        long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
| >                        pid_t *pids);
| >
| > There are just two other (minor) changes pending to this patchset:
| >
| >        - PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
| >        - PATCH 10: update documentation to reflect new interface.
| >
| > If this looks ok, we repost entire patchset next week.
| 
| I know I'm late to this discussion, but why the name clone3()? It's
| not consistent with any other convention used fo syscall naming,
| AFAICS. I think a name like clone_ext() or clonex() (for extended)
| might make more sense.

Sure, we talked about calling it clone_extended() and I can go back
to that.

Only minor concern with that name was if this new call ever needs to
be extended, what would we call it :-). With clone3() we could add a
real/fake parameter and call it clone4() :-p

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found] ` <20091013044925.GA28181-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (11 preceding siblings ...)
  2009-10-13 23:49     ` H. Peter Anvin
@ 2009-10-16 19:22   ` Daniel Lezcano
  12 siblings, 0 replies; 226+ messages in thread
From: Daniel Lezcano @ 2009-10-16 19:22 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu wrote:
> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>
> To support application checkpoint/restart, a task must have the same pid it
> had when it was checkpointed.  When containers are nested, the tasks within
> the containers exist in multiple pid namespaces and hence have multiple pids
> to specify during restart.
>
> This patchset implements a new system call, clone3() that lets a process
> specify the pids of the child process.
>
> Patches 1 through 7 are helper patches, needed for choosing a pid for the
> child process.
>
> PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
> documentation on the new system call, some/all of which will eventually
> go into a man page.
>   

Sorry for jumping so late in the discussion and for having maybe my
remarks pointless...

If this syscall is only for checkpoint / restart, why this shouldn't be
used with a future generic sys_restart syscall ?
Otherwise, shouldn't be more convenient to have something usable for
everyone, let's say:

cloneat(pid_t pid, pid_t desiredpid, ...);

Where 'desiredpid' is a hint of for the kernel for the pid to be
allocated (zero means the kernel will choose one for us) and the newly
allocated task is the son of 'pid'.
That looks more consistent with the "<syscall>at" family, 'openat',
'faccessat', 'readlinkat', etc ... and usable for something else than
the checkpoint / restart.

Thanks
  -- Daniel

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-13  4:49 ` Sukadev Bhattiprolu
  (?)
@ 2009-10-16 19:22 ` Daniel Lezcano
  2009-10-16 19:44     ` Sukadev Bhattiprolu
       [not found]   ` <4AD8C7E4.9000903-GANU6spQydw@public.gmane.org>
  -1 siblings, 2 replies; 226+ messages in thread
From: Daniel Lezcano @ 2009-10-16 19:22 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, randy.dunlap, arnd, Containers, Nathan Lynch,
	Louis.Rilling, Eric W. Biederman, kosaki.motohiro, hpa, mingo,
	linux-api, torvalds, Alexey Dobriyan, roland, Pavel Emelyanov

Sukadev Bhattiprolu wrote:
> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>
> To support application checkpoint/restart, a task must have the same pid it
> had when it was checkpointed.  When containers are nested, the tasks within
> the containers exist in multiple pid namespaces and hence have multiple pids
> to specify during restart.
>
> This patchset implements a new system call, clone3() that lets a process
> specify the pids of the child process.
>
> Patches 1 through 7 are helper patches, needed for choosing a pid for the
> child process.
>
> PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
> documentation on the new system call, some/all of which will eventually
> go into a man page.
>   

Sorry for jumping so late in the discussion and for having maybe my
remarks pointless...

If this syscall is only for checkpoint / restart, why this shouldn't be
used with a future generic sys_restart syscall ?
Otherwise, shouldn't be more convenient to have something usable for
everyone, let's say:

cloneat(pid_t pid, pid_t desiredpid, ...);

Where 'desiredpid' is a hint of for the kernel for the pid to be
allocated (zero means the kernel will choose one for us) and the newly
allocated task is the son of 'pid'.
That looks more consistent with the "<syscall>at" family, 'openat',
'faccessat', 'readlinkat', etc ... and usable for something else than
the checkpoint / restart.

Thanks
  -- Daniel


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]   ` <4AD8C7E4.9000903-GANU6spQydw@public.gmane.org>
@ 2009-10-16 19:44     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-16 19:44 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Daniel Lezcano [daniel.lezcano-GANU6spQydw@public.gmane.org] wrote:
> Sukadev Bhattiprolu wrote:
>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>>
>> To support application checkpoint/restart, a task must have the same pid it
>> had when it was checkpointed.  When containers are nested, the tasks within
>> the containers exist in multiple pid namespaces and hence have multiple pids
>> to specify during restart.
>>
>> This patchset implements a new system call, clone3() that lets a process
>> specify the pids of the child process.
>>
>> Patches 1 through 7 are helper patches, needed for choosing a pid for the
>> child process.
>>
>> PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
>> documentation on the new system call, some/all of which will eventually
>> go into a man page.
>>   
>
> Sorry for jumping so late in the discussion and for having maybe my
> remarks pointless...
>
> If this syscall is only for checkpoint / restart, why this shouldn't be
> used with a future generic sys_restart syscall ?

As I tried to explain in PATCH 0/9, the ability to choose a pid is only
for C/R but we are also trying to clone-flags so we won't need yet
another variant of clone() fairly soon.

> Otherwise, shouldn't be more convenient to have something usable for
> everyone, let's say:
>
> cloneat(pid_t pid, pid_t desiredpid, ...);
>
> Where 'desiredpid' is a hint of for the kernel for the pid to be
> allocated (zero means the kernel will choose one for us) and the newly
> allocated task is the son of 'pid'.

Hmm, so P1 would call cloneat() to create a child P3 _on behalf_ of process
P2 ?  I did not know we had a requirement for that. Can you explain the
use-case more ? IOW, why can't P2 create the child P3 by itself ?

Note also that 'desiredpid' must be a list of pids (one for each pid
namespaces that the child will belong to) and hence we need 'nr_pids'
to specify the list. Given that we are limited to 6 parameters to the
syscall, such parameters must be stuffed into 'struct clone_args'.

So we should do something like:

	sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
		pid_t *desired_pids)

or (to match the name and parameters, move 'pid' parameter into clone_args)

> That looks more consistent with the "<syscall>at" family, 'openat',
> 'faccessat', 'readlinkat', etc ... and usable for something else than
> the checkpoint / restart.

The subtle difference though is that openat() does not open a file on
behalf of another process and so the 'at' suffix would not apply ?

>
> Thanks
>  -- Daniel

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]   ` <4AD8C7E4.9000903-GANU6spQydw@public.gmane.org>
@ 2009-10-16 19:44     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-16 19:44 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-kernel, randy.dunlap, arnd, Containers, Nathan Lynch,
	Louis.Rilling, Eric W. Biederman, kosaki.motohiro, hpa, mingo,
	linux-api, torvalds, Alexey Dobriyan, roland, Pavel Emelyanov

Daniel Lezcano [daniel.lezcano@free.fr] wrote:
> Sukadev Bhattiprolu wrote:
>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>>
>> To support application checkpoint/restart, a task must have the same pid it
>> had when it was checkpointed.  When containers are nested, the tasks within
>> the containers exist in multiple pid namespaces and hence have multiple pids
>> to specify during restart.
>>
>> This patchset implements a new system call, clone3() that lets a process
>> specify the pids of the child process.
>>
>> Patches 1 through 7 are helper patches, needed for choosing a pid for the
>> child process.
>>
>> PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
>> documentation on the new system call, some/all of which will eventually
>> go into a man page.
>>   
>
> Sorry for jumping so late in the discussion and for having maybe my
> remarks pointless...
>
> If this syscall is only for checkpoint / restart, why this shouldn't be
> used with a future generic sys_restart syscall ?

As I tried to explain in PATCH 0/9, the ability to choose a pid is only
for C/R but we are also trying to clone-flags so we won't need yet
another variant of clone() fairly soon.

> Otherwise, shouldn't be more convenient to have something usable for
> everyone, let's say:
>
> cloneat(pid_t pid, pid_t desiredpid, ...);
>
> Where 'desiredpid' is a hint of for the kernel for the pid to be
> allocated (zero means the kernel will choose one for us) and the newly
> allocated task is the son of 'pid'.

Hmm, so P1 would call cloneat() to create a child P3 _on behalf_ of process
P2 ?  I did not know we had a requirement for that. Can you explain the
use-case more ? IOW, why can't P2 create the child P3 by itself ?

Note also that 'desiredpid' must be a list of pids (one for each pid
namespaces that the child will belong to) and hence we need 'nr_pids'
to specify the list. Given that we are limited to 6 parameters to the
syscall, such parameters must be stuffed into 'struct clone_args'.

So we should do something like:

	sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
		pid_t *desired_pids)

or (to match the name and parameters, move 'pid' parameter into clone_args)

> That looks more consistent with the "<syscall>at" family, 'openat',
> 'faccessat', 'readlinkat', etc ... and usable for something else than
> the checkpoint / restart.

The subtle difference though is that openat() does not open a file on
behalf of another process and so the 'at' suffix would not apply ?

>
> Thanks
>  -- Daniel

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-16 19:44     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-16 19:44 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Eric W. Biederman, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Daniel Lezcano [daniel.lezcano-GANU6spQydw@public.gmane.org] wrote:
> Sukadev Bhattiprolu wrote:
>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>>
>> To support application checkpoint/restart, a task must have the same pid it
>> had when it was checkpointed.  When containers are nested, the tasks within
>> the containers exist in multiple pid namespaces and hence have multiple pids
>> to specify during restart.
>>
>> This patchset implements a new system call, clone3() that lets a process
>> specify the pids of the child process.
>>
>> Patches 1 through 7 are helper patches, needed for choosing a pid for the
>> child process.
>>
>> PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
>> documentation on the new system call, some/all of which will eventually
>> go into a man page.
>>   
>
> Sorry for jumping so late in the discussion and for having maybe my
> remarks pointless...
>
> If this syscall is only for checkpoint / restart, why this shouldn't be
> used with a future generic sys_restart syscall ?

As I tried to explain in PATCH 0/9, the ability to choose a pid is only
for C/R but we are also trying to clone-flags so we won't need yet
another variant of clone() fairly soon.

> Otherwise, shouldn't be more convenient to have something usable for
> everyone, let's say:
>
> cloneat(pid_t pid, pid_t desiredpid, ...);
>
> Where 'desiredpid' is a hint of for the kernel for the pid to be
> allocated (zero means the kernel will choose one for us) and the newly
> allocated task is the son of 'pid'.

Hmm, so P1 would call cloneat() to create a child P3 _on behalf_ of process
P2 ?  I did not know we had a requirement for that. Can you explain the
use-case more ? IOW, why can't P2 create the child P3 by itself ?

Note also that 'desiredpid' must be a list of pids (one for each pid
namespaces that the child will belong to) and hence we need 'nr_pids'
to specify the list. Given that we are limited to 6 parameters to the
syscall, such parameters must be stuffed into 'struct clone_args'.

So we should do something like:

	sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
		pid_t *desired_pids)

or (to match the name and parameters, move 'pid' parameter into clone_args)

> That looks more consistent with the "<syscall>at" family, 'openat',
> 'faccessat', 'readlinkat', etc ... and usable for something else than
> the checkpoint / restart.

The subtle difference though is that openat() does not open a file on
behalf of another process and so the 'at' suffix would not apply ?

>
> Thanks
>  -- Daniel

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]               ` <20091016180631.GA31036-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-19 17:44                 ` Matt Helsley
  0 siblings, 0 replies; 226+ messages in thread
From: Matt Helsley @ 2009-10-19 17:44 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Eric W. Biederman

On Fri, Oct 16, 2009 at 11:06:31AM -0700, Sukadev Bhattiprolu wrote:
> Michael Kerrisk [mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org] wrote:
> | Hi Sukadev
> | 
> | On Fri, Oct 16, 2009 at 6:20 AM, Sukadev Bhattiprolu
> | <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> | > Here is an updated patch with the following interface:
> | >
> | >        long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
> | >                        pid_t *pids);
> | >
> | > There are just two other (minor) changes pending to this patchset:
> | >
> | >        - PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
> | >        - PATCH 10: update documentation to reflect new interface.
> | >
> | > If this looks ok, we repost entire patchset next week.
> | 
> | I know I'm late to this discussion, but why the name clone3()? It's
> | not consistent with any other convention used fo syscall naming,
> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
> | might make more sense.
> 
> Sure, we talked about calling it clone_extended() and I can go back
> to that.
> 
> Only minor concern with that name was if this new call ever needs to
> be extended, what would we call it :-). With clone3() we could add a
> real/fake parameter and call it clone4() :-p

Perhaps clone64 (somewhat like stat64 for example)?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]               ` <20091016180631.GA31036-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-19 17:44                 ` Matt Helsley
  0 siblings, 0 replies; 226+ messages in thread
From: Matt Helsley @ 2009-10-19 17:44 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: mtk.manpages, randy.dunlap, arnd, Containers, Nathan Lynch,
	linux-kernel, Louis.Rilling, Eric W. Biederman, kosaki.motohiro,
	hpa, mingo, linux-api, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

On Fri, Oct 16, 2009 at 11:06:31AM -0700, Sukadev Bhattiprolu wrote:
> Michael Kerrisk [mtk.manpages@googlemail.com] wrote:
> | Hi Sukadev
> | 
> | On Fri, Oct 16, 2009 at 6:20 AM, Sukadev Bhattiprolu
> | <sukadev@linux.vnet.ibm.com> wrote:
> | > Here is an updated patch with the following interface:
> | >
> | >        long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
> | >                        pid_t *pids);
> | >
> | > There are just two other (minor) changes pending to this patchset:
> | >
> | >        - PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
> | >        - PATCH 10: update documentation to reflect new interface.
> | >
> | > If this looks ok, we repost entire patchset next week.
> | 
> | I know I'm late to this discussion, but why the name clone3()? It's
> | not consistent with any other convention used fo syscall naming,
> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
> | might make more sense.
> 
> Sure, we talked about calling it clone_extended() and I can go back
> to that.
> 
> Only minor concern with that name was if this new call ever needs to
> be extended, what would we call it :-). With clone3() we could add a
> real/fake parameter and call it clone4() :-p

Perhaps clone64 (somewhat like stat64 for example)?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-19 17:44                 ` Matt Helsley
  0 siblings, 0 replies; 226+ messages in thread
From: Matt Helsley @ 2009-10-19 17:44 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On Fri, Oct 16, 2009 at 11:06:31AM -0700, Sukadev Bhattiprolu wrote:
> Michael Kerrisk [mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org] wrote:
> | Hi Sukadev
> | 
> | On Fri, Oct 16, 2009 at 6:20 AM, Sukadev Bhattiprolu
> | <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> | > Here is an updated patch with the following interface:
> | >
> | >        long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
> | >                        pid_t *pids);
> | >
> | > There are just two other (minor) changes pending to this patchset:
> | >
> | >        - PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
> | >        - PATCH 10: update documentation to reflect new interface.
> | >
> | > If this looks ok, we repost entire patchset next week.
> | 
> | I know I'm late to this discussion, but why the name clone3()? It's
> | not consistent with any other convention used fo syscall naming,
> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
> | might make more sense.
> 
> Sure, we talked about calling it clone_extended() and I can go back
> to that.
> 
> Only minor concern with that name was if this new call ever needs to
> be extended, what would we call it :-). With clone3() we could add a
> real/fake parameter and call it clone4() :-p

Perhaps clone64 (somewhat like stat64 for example)?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]     ` <20091016194451.GA28706-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-19 20:34       ` Daniel Lezcano
  0 siblings, 0 replies; 226+ messages in thread
From: Daniel Lezcano @ 2009-10-19 20:34 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu wrote:
> Daniel Lezcano [daniel.lezcano-GANU6spQydw@public.gmane.org] wrote:
>   
>> Sukadev Bhattiprolu wrote:
>>     
>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>>>
>>> To support application checkpoint/restart, a task must have the same pid it
>>> had when it was checkpointed.  When containers are nested, the tasks within
>>> the containers exist in multiple pid namespaces and hence have multiple pids
>>> to specify during restart.
>>>
>>> This patchset implements a new system call, clone3() that lets a process
>>> specify the pids of the child process.
>>>
>>> Patches 1 through 7 are helper patches, needed for choosing a pid for the
>>> child process.
>>>
>>> PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
>>> documentation on the new system call, some/all of which will eventually
>>> go into a man page.
>>>   
>>>       
>> Sorry for jumping so late in the discussion and for having maybe my
>> remarks pointless...
>>
>> If this syscall is only for checkpoint / restart, why this shouldn't be
>> used with a future generic sys_restart syscall ?
>>     
>
> As I tried to explain in PATCH 0/9, the ability to choose a pid is only
> for C/R but we are also trying to clone-flags so we won't need yet
> another variant of clone() fairly soon.
>
>   
>> Otherwise, shouldn't be more convenient to have something usable for
>> everyone, let's say:
>>
>> cloneat(pid_t pid, pid_t desiredpid, ...);
>>
>> Where 'desiredpid' is a hint of for the kernel for the pid to be
>> allocated (zero means the kernel will choose one for us) and the newly
>> allocated task is the son of 'pid'.
>>     
>
> Hmm, so P1 would call cloneat() to create a child P3 _on behalf_ of process
> P2 ?  I did not know we had a requirement for that. Can you explain the
> use-case more ? IOW, why can't P2 create the child P3 by itself ?
>   
I forgot to mention a constraint with the specified pid : P2 has to be 
child of P1.
In other word, you can not specify a pid to clonat which is not your 
descendant (including yourself).
With this constraint I think there is no security issues.

Concerning of forking on behalf of another process, we can consider it 
is up to the caller / programmer to know what it does. If a process in 
the process hierarchy exec'ed a program and we cloneat this process and 
then the program fails because of an "unexpected error", well, we should 
have not done that. A similar example is when the IPC are removed while 
they are used by some other processes.

Here it is a interesting use case:
 * if you created a pid namespace, and, let's say, booted a system 
container where the container init is the "init" process, then with this 
call you can enter the container at any time by doing cloneat() followed 
by an exec of your command. I think that was a requirement when there 
were discussions around "sys_hijack".

Another point. It's another way to extend the exhausted clone  flags as 
the cloneat can be called as a compatibility way, with cloneat(getpid(), 
0, ... )

> Note also that 'desiredpid' must be a list of pids (one for each pid
> namespaces that the child will belong to) and hence we need 'nr_pids'
> to specify the list. Given that we are limited to 6 parameters to the
> syscall, such parameters must be stuffed into 'struct clone_args'.
>
> So we should do something like:
>
> 	sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
> 		pid_t *desired_pids)
>
> or (to match the name and parameters, move 'pid' parameter into clone_args)
>   
Well, hiding multiple clone in one clone call is ... weird. AFAIR, there 
was a debate between kernel or userspace proctree creation but it looks 
like it's done from the kernel with this call.

I don't really see a difference between sys_restart(pid_t pid , int fd, 
long flags) where pid_t is the topmost in the hierarchy, fd is a file 
descriptor to a structure "pid_t * + struct clone_args *" and flags is 
"PROCTREE".

IMHO, it is nicer to recursively restore the process tree for the nested 
pid namespaces, that will be really an userspace process tree creation 
and cloneat will be your friend here :)

>> That looks more consistent with the "<syscall>at" family, 'openat',
>> 'faccessat', 'readlinkat', etc ... and usable for something else than
>> the checkpoint / restart.
>>     
>
> The subtle difference though is that openat() does not open a file on
> behalf of another process and so the 'at' suffix would not apply ?
>   
Yes and no, depending of where you put the cursor. If you consider the 
'at' suffix means a process context, then I agree with you, there is a 
difference because the cloneat will be out of the current process 
context. But if you consider the 'at' suffix as a context in general, 
and openat means "relatively to a file descriptor" and cloneat means 
"relatively to a pid namespace" the 'at' suffix may apply. But I agree 
that we are so used to call the posix "fork", that cloneat sounds scary :)

Thanks
  -- Daniel

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]     ` <20091016194451.GA28706-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-19 20:34       ` Daniel Lezcano
  0 siblings, 0 replies; 226+ messages in thread
From: Daniel Lezcano @ 2009-10-19 20:34 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, randy.dunlap, arnd, Containers, Nathan Lynch,
	Louis.Rilling, Eric W. Biederman, kosaki.motohiro, hpa, mingo,
	linux-api, torvalds, Alexey Dobriyan, roland, Pavel Emelyanov

Sukadev Bhattiprolu wrote:
> Daniel Lezcano [daniel.lezcano@free.fr] wrote:
>   
>> Sukadev Bhattiprolu wrote:
>>     
>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>>>
>>> To support application checkpoint/restart, a task must have the same pid it
>>> had when it was checkpointed.  When containers are nested, the tasks within
>>> the containers exist in multiple pid namespaces and hence have multiple pids
>>> to specify during restart.
>>>
>>> This patchset implements a new system call, clone3() that lets a process
>>> specify the pids of the child process.
>>>
>>> Patches 1 through 7 are helper patches, needed for choosing a pid for the
>>> child process.
>>>
>>> PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
>>> documentation on the new system call, some/all of which will eventually
>>> go into a man page.
>>>   
>>>       
>> Sorry for jumping so late in the discussion and for having maybe my
>> remarks pointless...
>>
>> If this syscall is only for checkpoint / restart, why this shouldn't be
>> used with a future generic sys_restart syscall ?
>>     
>
> As I tried to explain in PATCH 0/9, the ability to choose a pid is only
> for C/R but we are also trying to clone-flags so we won't need yet
> another variant of clone() fairly soon.
>
>   
>> Otherwise, shouldn't be more convenient to have something usable for
>> everyone, let's say:
>>
>> cloneat(pid_t pid, pid_t desiredpid, ...);
>>
>> Where 'desiredpid' is a hint of for the kernel for the pid to be
>> allocated (zero means the kernel will choose one for us) and the newly
>> allocated task is the son of 'pid'.
>>     
>
> Hmm, so P1 would call cloneat() to create a child P3 _on behalf_ of process
> P2 ?  I did not know we had a requirement for that. Can you explain the
> use-case more ? IOW, why can't P2 create the child P3 by itself ?
>   
I forgot to mention a constraint with the specified pid : P2 has to be 
child of P1.
In other word, you can not specify a pid to clonat which is not your 
descendant (including yourself).
With this constraint I think there is no security issues.

Concerning of forking on behalf of another process, we can consider it 
is up to the caller / programmer to know what it does. If a process in 
the process hierarchy exec'ed a program and we cloneat this process and 
then the program fails because of an "unexpected error", well, we should 
have not done that. A similar example is when the IPC are removed while 
they are used by some other processes.

Here it is a interesting use case:
 * if you created a pid namespace, and, let's say, booted a system 
container where the container init is the "init" process, then with this 
call you can enter the container at any time by doing cloneat() followed 
by an exec of your command. I think that was a requirement when there 
were discussions around "sys_hijack".

Another point. It's another way to extend the exhausted clone  flags as 
the cloneat can be called as a compatibility way, with cloneat(getpid(), 
0, ... )

> Note also that 'desiredpid' must be a list of pids (one for each pid
> namespaces that the child will belong to) and hence we need 'nr_pids'
> to specify the list. Given that we are limited to 6 parameters to the
> syscall, such parameters must be stuffed into 'struct clone_args'.
>
> So we should do something like:
>
> 	sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
> 		pid_t *desired_pids)
>
> or (to match the name and parameters, move 'pid' parameter into clone_args)
>   
Well, hiding multiple clone in one clone call is ... weird. AFAIR, there 
was a debate between kernel or userspace proctree creation but it looks 
like it's done from the kernel with this call.

I don't really see a difference between sys_restart(pid_t pid , int fd, 
long flags) where pid_t is the topmost in the hierarchy, fd is a file 
descriptor to a structure "pid_t * + struct clone_args *" and flags is 
"PROCTREE".

IMHO, it is nicer to recursively restore the process tree for the nested 
pid namespaces, that will be really an userspace process tree creation 
and cloneat will be your friend here :)

>> That looks more consistent with the "<syscall>at" family, 'openat',
>> 'faccessat', 'readlinkat', etc ... and usable for something else than
>> the checkpoint / restart.
>>     
>
> The subtle difference though is that openat() does not open a file on
> behalf of another process and so the 'at' suffix would not apply ?
>   
Yes and no, depending of where you put the cursor. If you consider the 
'at' suffix means a process context, then I agree with you, there is a 
difference because the cloneat will be out of the current process 
context. But if you consider the 'at' suffix as a context in general, 
and openat means "relatively to a file descriptor" and cloneat means 
"relatively to a pid namespace" the 'at' suffix may apply. But I agree 
that we are so used to call the posix "fork", that cloneat sounds scary :)

Thanks
  -- Daniel


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-19 20:34       ` Daniel Lezcano
  0 siblings, 0 replies; 226+ messages in thread
From: Daniel Lezcano @ 2009-10-19 20:34 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	Eric W. Biederman, kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu wrote:
> Daniel Lezcano [daniel.lezcano-GANU6spQydw@public.gmane.org] wrote:
>   
>> Sukadev Bhattiprolu wrote:
>>     
>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>>>
>>> To support application checkpoint/restart, a task must have the same pid it
>>> had when it was checkpointed.  When containers are nested, the tasks within
>>> the containers exist in multiple pid namespaces and hence have multiple pids
>>> to specify during restart.
>>>
>>> This patchset implements a new system call, clone3() that lets a process
>>> specify the pids of the child process.
>>>
>>> Patches 1 through 7 are helper patches, needed for choosing a pid for the
>>> child process.
>>>
>>> PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
>>> documentation on the new system call, some/all of which will eventually
>>> go into a man page.
>>>   
>>>       
>> Sorry for jumping so late in the discussion and for having maybe my
>> remarks pointless...
>>
>> If this syscall is only for checkpoint / restart, why this shouldn't be
>> used with a future generic sys_restart syscall ?
>>     
>
> As I tried to explain in PATCH 0/9, the ability to choose a pid is only
> for C/R but we are also trying to clone-flags so we won't need yet
> another variant of clone() fairly soon.
>
>   
>> Otherwise, shouldn't be more convenient to have something usable for
>> everyone, let's say:
>>
>> cloneat(pid_t pid, pid_t desiredpid, ...);
>>
>> Where 'desiredpid' is a hint of for the kernel for the pid to be
>> allocated (zero means the kernel will choose one for us) and the newly
>> allocated task is the son of 'pid'.
>>     
>
> Hmm, so P1 would call cloneat() to create a child P3 _on behalf_ of process
> P2 ?  I did not know we had a requirement for that. Can you explain the
> use-case more ? IOW, why can't P2 create the child P3 by itself ?
>   
I forgot to mention a constraint with the specified pid : P2 has to be 
child of P1.
In other word, you can not specify a pid to clonat which is not your 
descendant (including yourself).
With this constraint I think there is no security issues.

Concerning of forking on behalf of another process, we can consider it 
is up to the caller / programmer to know what it does. If a process in 
the process hierarchy exec'ed a program and we cloneat this process and 
then the program fails because of an "unexpected error", well, we should 
have not done that. A similar example is when the IPC are removed while 
they are used by some other processes.

Here it is a interesting use case:
 * if you created a pid namespace, and, let's say, booted a system 
container where the container init is the "init" process, then with this 
call you can enter the container at any time by doing cloneat() followed 
by an exec of your command. I think that was a requirement when there 
were discussions around "sys_hijack".

Another point. It's another way to extend the exhausted clone  flags as 
the cloneat can be called as a compatibility way, with cloneat(getpid(), 
0, ... )

> Note also that 'desiredpid' must be a list of pids (one for each pid
> namespaces that the child will belong to) and hence we need 'nr_pids'
> to specify the list. Given that we are limited to 6 parameters to the
> syscall, such parameters must be stuffed into 'struct clone_args'.
>
> So we should do something like:
>
> 	sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
> 		pid_t *desired_pids)
>
> or (to match the name and parameters, move 'pid' parameter into clone_args)
>   
Well, hiding multiple clone in one clone call is ... weird. AFAIR, there 
was a debate between kernel or userspace proctree creation but it looks 
like it's done from the kernel with this call.

I don't really see a difference between sys_restart(pid_t pid , int fd, 
long flags) where pid_t is the topmost in the hierarchy, fd is a file 
descriptor to a structure "pid_t * + struct clone_args *" and flags is 
"PROCTREE".

IMHO, it is nicer to recursively restore the process tree for the nested 
pid namespaces, that will be really an userspace process tree creation 
and cloneat will be your friend here :)

>> That looks more consistent with the "<syscall>at" family, 'openat',
>> 'faccessat', 'readlinkat', etc ... and usable for something else than
>> the checkpoint / restart.
>>     
>
> The subtle difference though is that openat() does not open a file on
> behalf of another process and so the 'at' suffix would not apply ?
>   
Yes and no, depending of where you put the cursor. If you consider the 
'at' suffix means a process context, then I agree with you, there is a 
difference because the cloneat will be out of the current process 
context. But if you consider the 'at' suffix as a context in general, 
and openat means "relatively to a file descriptor" and cloneat means 
"relatively to a pid namespace" the 'at' suffix may apply. But I agree 
that we are so used to call the posix "fork", that cloneat sounds scary :)

Thanks
  -- Daniel

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                 ` <20091019174405.GE27627-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2009-10-19 21:31                   ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-19 21:31 UTC (permalink / raw)
  To: Matt Helsley
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Pavel Emelyanov,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA,
	Eric W. Biederman

On 10/20/2009 02:44 AM, Matt Helsley wrote:
>> |
>> | I know I'm late to this discussion, but why the name clone3()? It's
>> | not consistent with any other convention used fo syscall naming,

This assumption, of course, is just plain wrong.  Look at the wait 
system calls, for example.  However, when a small integer is used like 
that, it pretty much always reflects numbers of arguments.

>> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
>> | might make more sense.
>>
>> Sure, we talked about calling it clone_extended() and I can go back
>> to that.
>>
>> Only minor concern with that name was if this new call ever needs to
>> be extended, what would we call it :-). With clone3() we could add a
>> real/fake parameter and call it clone4() :-p
>
> Perhaps clone64 (somewhat like stat64 for example)?
>

I think that doesn't exactly reflect the nature of the changes.

clone3() is actually pretty good.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                 ` <20091019174405.GE27627-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2009-10-19 21:31                   ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-19 21:31 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Sukadev Bhattiprolu, mtk.manpages, randy.dunlap, arnd,
	Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	Eric W. Biederman, kosaki.motohiro, mingo, linux-api, torvalds,
	Alexey Dobriyan, roland, Pavel Emelyanov

On 10/20/2009 02:44 AM, Matt Helsley wrote:
>> |
>> | I know I'm late to this discussion, but why the name clone3()? It's
>> | not consistent with any other convention used fo syscall naming,

This assumption, of course, is just plain wrong.  Look at the wait 
system calls, for example.  However, when a small integer is used like 
that, it pretty much always reflects numbers of arguments.

>> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
>> | might make more sense.
>>
>> Sure, we talked about calling it clone_extended() and I can go back
>> to that.
>>
>> Only minor concern with that name was if this new call ever needs to
>> be extended, what would we call it :-). With clone3() we could add a
>> real/fake parameter and call it clone4() :-p
>
> Perhaps clone64 (somewhat like stat64 for example)?
>

I think that doesn't exactly reflect the nature of the changes.

clone3() is actually pretty good.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-19 21:31                   ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-19 21:31 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Sukadev Bhattiprolu, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On 10/20/2009 02:44 AM, Matt Helsley wrote:
>> |
>> | I know I'm late to this discussion, but why the name clone3()? It's
>> | not consistent with any other convention used fo syscall naming,

This assumption, of course, is just plain wrong.  Look at the wait 
system calls, for example.  However, when a small integer is used like 
that, it pretty much always reflects numbers of arguments.

>> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
>> | might make more sense.
>>
>> Sure, we talked about calling it clone_extended() and I can go back
>> to that.
>>
>> Only minor concern with that name was if this new call ever needs to
>> be extended, what would we call it :-). With clone3() we could add a
>> real/fake parameter and call it clone4() :-p
>
> Perhaps clone64 (somewhat like stat64 for example)?
>

I think that doesn't exactly reflect the nature of the changes.

clone3() is actually pretty good.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
       [not found]     ` <20091013045556.GJ28435-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-10-14 12:26         ` Arnd Bergmann
@ 2009-10-19 21:36       ` Pavel Machek
  1 sibling, 0 replies; 226+ messages in thread
From: Pavel Machek @ 2009-10-19 21:36 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Hi!

> This gives a brief overview of the clone3() system call.  We should

Thanks!

> eventually describe more details in existing clone(2) man page or in
> a new man page.

M. Kerrisk (see MAINTAINERS) maintains man pages...

> Changelog[v8]:
> 	- clone2() is already in use in IA64. Rename syscall to clone3()
...
> Index: linux-2.6/Documentation/clone2
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/Documentation/clone2	2009-10-12 19:54:38.000000000 -0700

clone3?

> +struct clone_struct {
> +	u64 flags;
> +	u64 child_stack;

u64 seems wrong  on 32 bit platforms. ulong?

> +	u32 nr_pids;

So nr_pids is either 1 or 2?

> +	u32 reserved1;
> +	u64 parent_tid;
> +	u64 child_tid;
> +	u64 reserved2;
> +};
> +


> +	See CLONE_NEWPID section of clone(2) man page for details about pid
> +	namespaces.
> +
> +	The order pids in @pids corresponds to the nesting order of pid-
> +	namespaces, with @pids[0] corresponding to the init_pid_ns.

Ok, so I'm confused.

> +	If a pid in the @pids list is 0, the kernel will assign the next
> +	available pid in the pid namespace, for the process.
> +
> +	If a pid in the @pids list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails with -EBUSY.
...
> +	On failure, clone3() returns -1 and sets 'errno' to one of following
> +	values (the child process is not created).

Inconsistent with above. Syscalls really return -ERRCODE, errno is
glibc magic.

> +Example:
> +
> +	pid_t pids[] = { 77, 99 };
> +	struct clone_struct cs;
> +
> +	cs.flags = (u64) SIGCHLD;
> +	cs.child_stack = (u64) setup_child_stack();
> +	cs.nr_pids = 2;
> +	cs.parent_tid = 0LL;
> +	cs.child_tid = 0LL;
> +
> +	rc = syscall(__NR_clone3, &cs, pids);

Hmm, is there reason why pids are not at the end of struct
clone_struct? Passing most parameters in special structure, then pids
separately is strange...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
       [not found]     ` <20091013045556.GJ28435-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-19 21:36       ` Pavel Machek
  2009-10-19 21:36       ` Pavel Machek
  1 sibling, 0 replies; 226+ messages in thread
From: Pavel Machek @ 2009-10-19 21:36 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

Hi!

> This gives a brief overview of the clone3() system call.  We should

Thanks!

> eventually describe more details in existing clone(2) man page or in
> a new man page.

M. Kerrisk (see MAINTAINERS) maintains man pages...

> Changelog[v8]:
> 	- clone2() is already in use in IA64. Rename syscall to clone3()
...
> Index: linux-2.6/Documentation/clone2
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/Documentation/clone2	2009-10-12 19:54:38.000000000 -0700

clone3?

> +struct clone_struct {
> +	u64 flags;
> +	u64 child_stack;

u64 seems wrong  on 32 bit platforms. ulong?

> +	u32 nr_pids;

So nr_pids is either 1 or 2?

> +	u32 reserved1;
> +	u64 parent_tid;
> +	u64 child_tid;
> +	u64 reserved2;
> +};
> +


> +	See CLONE_NEWPID section of clone(2) man page for details about pid
> +	namespaces.
> +
> +	The order pids in @pids corresponds to the nesting order of pid-
> +	namespaces, with @pids[0] corresponding to the init_pid_ns.

Ok, so I'm confused.

> +	If a pid in the @pids list is 0, the kernel will assign the next
> +	available pid in the pid namespace, for the process.
> +
> +	If a pid in the @pids list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails with -EBUSY.
...
> +	On failure, clone3() returns -1 and sets 'errno' to one of following
> +	values (the child process is not created).

Inconsistent with above. Syscalls really return -ERRCODE, errno is
glibc magic.

> +Example:
> +
> +	pid_t pids[] = { 77, 99 };
> +	struct clone_struct cs;
> +
> +	cs.flags = (u64) SIGCHLD;
> +	cs.child_stack = (u64) setup_child_stack();
> +	cs.nr_pids = 2;
> +	cs.parent_tid = 0LL;
> +	cs.child_tid = 0LL;
> +
> +	rc = syscall(__NR_clone3, &cs, pids);

Hmm, is there reason why pids are not at the end of struct
clone_struct? Passing most parameters in special structure, then pids
separately is strange...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
@ 2009-10-19 21:36       ` Pavel Machek
  0 siblings, 0 replies; 226+ messages in thread
From: Pavel Machek @ 2009-10-19 21:36 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Oren Laadan,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA

Hi!

> This gives a brief overview of the clone3() system call.  We should

Thanks!

> eventually describe more details in existing clone(2) man page or in
> a new man page.

M. Kerrisk (see MAINTAINERS) maintains man pages...

> Changelog[v8]:
> 	- clone2() is already in use in IA64. Rename syscall to clone3()
...
> Index: linux-2.6/Documentation/clone2
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/Documentation/clone2	2009-10-12 19:54:38.000000000 -0700

clone3?

> +struct clone_struct {
> +	u64 flags;
> +	u64 child_stack;

u64 seems wrong  on 32 bit platforms. ulong?

> +	u32 nr_pids;

So nr_pids is either 1 or 2?

> +	u32 reserved1;
> +	u64 parent_tid;
> +	u64 child_tid;
> +	u64 reserved2;
> +};
> +


> +	See CLONE_NEWPID section of clone(2) man page for details about pid
> +	namespaces.
> +
> +	The order pids in @pids corresponds to the nesting order of pid-
> +	namespaces, with @pids[0] corresponding to the init_pid_ns.

Ok, so I'm confused.

> +	If a pid in the @pids list is 0, the kernel will assign the next
> +	available pid in the pid namespace, for the process.
> +
> +	If a pid in the @pids list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails with -EBUSY.
...
> +	On failure, clone3() returns -1 and sets 'errno' to one of following
> +	values (the child process is not created).

Inconsistent with above. Syscalls really return -ERRCODE, errno is
glibc magic.

> +Example:
> +
> +	pid_t pids[] = { 77, 99 };
> +	struct clone_struct cs;
> +
> +	cs.flags = (u64) SIGCHLD;
> +	cs.child_stack = (u64) setup_child_stack();
> +	cs.nr_pids = 2;
> +	cs.parent_tid = 0LL;
> +	cs.child_tid = 0LL;
> +
> +	rc = syscall(__NR_clone3, &cs, pids);

Hmm, is there reason why pids are not at the end of struct
clone_struct? Passing most parameters in special structure, then pids
separately is strange...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]       ` <4ADCCD68.9030003-GANU6spQydw@public.gmane.org>
@ 2009-10-19 21:47         ` Oren Laadan
  0 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-19 21:47 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Daniel Lezcano wrote:
> Sukadev Bhattiprolu wrote:
>> Daniel Lezcano [daniel.lezcano-GANU6spQydw@public.gmane.org] wrote:
>>   
>>> Sukadev Bhattiprolu wrote:
>>>     
>>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>>>>
>>>> To support application checkpoint/restart, a task must have the same pid it
>>>> had when it was checkpointed.  When containers are nested, the tasks within
>>>> the containers exist in multiple pid namespaces and hence have multiple pids
>>>> to specify during restart.
>>>>
>>>> This patchset implements a new system call, clone3() that lets a process
>>>> specify the pids of the child process.
>>>>

[...]

>>> Otherwise, shouldn't be more convenient to have something usable for
>>> everyone, let's say:
>>>
>>> cloneat(pid_t pid, pid_t desiredpid, ...);
>>>
>>> Where 'desiredpid' is a hint of for the kernel for the pid to be
>>> allocated (zero means the kernel will choose one for us) and the newly
>>> allocated task is the son of 'pid'.
>>>     
>> Hmm, so P1 would call cloneat() to create a child P3 _on behalf_ of process
>> P2 ?  I did not know we had a requirement for that. Can you explain the
>> use-case more ? IOW, why can't P2 create the child P3 by itself ?
>>   
> I forgot to mention a constraint with the specified pid : P2 has to be 
> child of P1.
> In other word, you can not specify a pid to clonat which is not your 
> descendant (including yourself).
> With this constraint I think there is no security issues.

Sounds dangerous. What if your descendant executed a setuid program ?

> 
> Concerning of forking on behalf of another process, we can consider it 
> is up to the caller / programmer to know what it does. If a process in 

Before the user can program with this syscall, _you_ need to define
the semantics of this syscall. For example, do you freeze that other
process to ensure that the clone is consistent ?

> the process hierarchy exec'ed a program and we cloneat this process and 
> then the program fails because of an "unexpected error", well, we should 
> have not done that. A similar example is when the IPC are removed while 
> they are used by some other processes.
> 
> Here it is a interesting use case:
>  * if you created a pid namespace, and, let's say, booted a system 
> container where the container init is the "init" process, then with this 
> call you can enter the container at any time by doing cloneat() followed 
> by an exec of your command. I think that was a requirement when there 
> were discussions around "sys_hijack".

Can you define more precisely what you mean by "enter" the container ?

If you simply want create a new process in the container, you can
achieve the same thing with a daemon, or a smart init process (in
there), or even ptrace tricks.

Also, there is a reason why sys_hijack() was hijacked away ... And
I honestly think that a syscall to force another process to clone
would be shot down by the kernel guys.

> Another point. It's another way to extend the exhausted clone  flags as 
> the cloneat can be called as a compatibility way, with cloneat(getpid(), 
> 0, ... )

Which is what the proposed new clone_....() does.

> 
>> Note also that 'desiredpid' must be a list of pids (one for each pid
>> namespaces that the child will belong to) and hence we need 'nr_pids'
>> to specify the list. Given that we are limited to 6 parameters to the
>> syscall, such parameters must be stuffed into 'struct clone_args'.
>>
>> So we should do something like:
>>
>> 	sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
>> 		pid_t *desired_pids)
>>
>> or (to match the name and parameters, move 'pid' parameter into clone_args)
>>   
> Well, hiding multiple clone in one clone call is ... weird. AFAIR, there 
> was a debate between kernel or userspace proctree creation but it looks 
> like it's done from the kernel with this call.

It isn't multiple clones in one clone. The syscall creates *one* single
process. We just ask the kernel to assign a specific pid to that process.

And because processes that live in nested pid-namespace own multiple
pids (one for each level), we need to specify multiple pids (one for
each level) for this single process that we create.

Then, to create the entire restart process tree, we have to call this
system call as many times as the number of processes to restart.

And yes, this is all done in userspace. The _only_ kernel-space help
is an interface to request specific pid(s) for each restarted process.

> I don't really see a difference between sys_restart(pid_t pid , int fd, 
> long flags) where pid_t is the topmost in the hierarchy, fd is a file 
> descriptor to a structure "pid_t * + struct clone_args *" and flags is 
> "PROCTREE".

What you describe here is the way openvz's kernel-side process tree
restart works. This is not the current approach.

> IMHO, it is nicer to recursively restore the process tree for the nested 
> pid namespaces, that will be really an userspace process tree creation 
> and cloneat will be your friend here :)

The restart.c utility from the userspace tools (user-cr) is your
friend here...  I'm sure you'll think it's nice :p

>>> That looks more consistent with the "<syscall>at" family, 'openat',
>>> 'faccessat', 'readlinkat', etc ... and usable for something else than
>>> the checkpoint / restart.
>>>     
>> The subtle difference though is that openat() does not open a file on
>> behalf of another process and so the 'at' suffix would not apply ?
>>   
> Yes and no, depending of where you put the cursor. If you consider the 
> 'at' suffix means a process context, then I agree with you, there is a 
> difference because the cloneat will be out of the current process 
> context. But if you consider the 'at' suffix as a context in general, 
> and openat means "relatively to a file descriptor" and cloneat means 
> "relatively to a pid namespace" the 'at' suffix may apply. But I agree 
> that we are so used to call the posix "fork", that cloneat sounds scary :)

At this point, I really couldn't care less about how we name the
new syscall.

Check out the containers IRC channel for the latest pearls:
clone_plus_with_aloe() and clone_plus_with_aloe_3_args() are
the prominent contenders, together with xerox() and ditto()...

There you go; these should be enough to keep the discussion
around on life support for at least another week.

I really really really hope we can settle down on *a* name,
*any* name, and move forward. Amen.

Oren.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]       ` <4ADCCD68.9030003-GANU6spQydw@public.gmane.org>
@ 2009-10-19 21:47         ` Oren Laadan
  0 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-19 21:47 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Sukadev Bhattiprolu, randy.dunlap, arnd, linux-api, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, Eric W. Biederman,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov



Daniel Lezcano wrote:
> Sukadev Bhattiprolu wrote:
>> Daniel Lezcano [daniel.lezcano@free.fr] wrote:
>>   
>>> Sukadev Bhattiprolu wrote:
>>>     
>>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>>>>
>>>> To support application checkpoint/restart, a task must have the same pid it
>>>> had when it was checkpointed.  When containers are nested, the tasks within
>>>> the containers exist in multiple pid namespaces and hence have multiple pids
>>>> to specify during restart.
>>>>
>>>> This patchset implements a new system call, clone3() that lets a process
>>>> specify the pids of the child process.
>>>>

[...]

>>> Otherwise, shouldn't be more convenient to have something usable for
>>> everyone, let's say:
>>>
>>> cloneat(pid_t pid, pid_t desiredpid, ...);
>>>
>>> Where 'desiredpid' is a hint of for the kernel for the pid to be
>>> allocated (zero means the kernel will choose one for us) and the newly
>>> allocated task is the son of 'pid'.
>>>     
>> Hmm, so P1 would call cloneat() to create a child P3 _on behalf_ of process
>> P2 ?  I did not know we had a requirement for that. Can you explain the
>> use-case more ? IOW, why can't P2 create the child P3 by itself ?
>>   
> I forgot to mention a constraint with the specified pid : P2 has to be 
> child of P1.
> In other word, you can not specify a pid to clonat which is not your 
> descendant (including yourself).
> With this constraint I think there is no security issues.

Sounds dangerous. What if your descendant executed a setuid program ?

> 
> Concerning of forking on behalf of another process, we can consider it 
> is up to the caller / programmer to know what it does. If a process in 

Before the user can program with this syscall, _you_ need to define
the semantics of this syscall. For example, do you freeze that other
process to ensure that the clone is consistent ?

> the process hierarchy exec'ed a program and we cloneat this process and 
> then the program fails because of an "unexpected error", well, we should 
> have not done that. A similar example is when the IPC are removed while 
> they are used by some other processes.
> 
> Here it is a interesting use case:
>  * if you created a pid namespace, and, let's say, booted a system 
> container where the container init is the "init" process, then with this 
> call you can enter the container at any time by doing cloneat() followed 
> by an exec of your command. I think that was a requirement when there 
> were discussions around "sys_hijack".

Can you define more precisely what you mean by "enter" the container ?

If you simply want create a new process in the container, you can
achieve the same thing with a daemon, or a smart init process (in
there), or even ptrace tricks.

Also, there is a reason why sys_hijack() was hijacked away ... And
I honestly think that a syscall to force another process to clone
would be shot down by the kernel guys.

> Another point. It's another way to extend the exhausted clone  flags as 
> the cloneat can be called as a compatibility way, with cloneat(getpid(), 
> 0, ... )

Which is what the proposed new clone_....() does.

> 
>> Note also that 'desiredpid' must be a list of pids (one for each pid
>> namespaces that the child will belong to) and hence we need 'nr_pids'
>> to specify the list. Given that we are limited to 6 parameters to the
>> syscall, such parameters must be stuffed into 'struct clone_args'.
>>
>> So we should do something like:
>>
>> 	sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
>> 		pid_t *desired_pids)
>>
>> or (to match the name and parameters, move 'pid' parameter into clone_args)
>>   
> Well, hiding multiple clone in one clone call is ... weird. AFAIR, there 
> was a debate between kernel or userspace proctree creation but it looks 
> like it's done from the kernel with this call.

It isn't multiple clones in one clone. The syscall creates *one* single
process. We just ask the kernel to assign a specific pid to that process.

And because processes that live in nested pid-namespace own multiple
pids (one for each level), we need to specify multiple pids (one for
each level) for this single process that we create.

Then, to create the entire restart process tree, we have to call this
system call as many times as the number of processes to restart.

And yes, this is all done in userspace. The _only_ kernel-space help
is an interface to request specific pid(s) for each restarted process.

> I don't really see a difference between sys_restart(pid_t pid , int fd, 
> long flags) where pid_t is the topmost in the hierarchy, fd is a file 
> descriptor to a structure "pid_t * + struct clone_args *" and flags is 
> "PROCTREE".

What you describe here is the way openvz's kernel-side process tree
restart works. This is not the current approach.

> IMHO, it is nicer to recursively restore the process tree for the nested 
> pid namespaces, that will be really an userspace process tree creation 
> and cloneat will be your friend here :)

The restart.c utility from the userspace tools (user-cr) is your
friend here...  I'm sure you'll think it's nice :p

>>> That looks more consistent with the "<syscall>at" family, 'openat',
>>> 'faccessat', 'readlinkat', etc ... and usable for something else than
>>> the checkpoint / restart.
>>>     
>> The subtle difference though is that openat() does not open a file on
>> behalf of another process and so the 'at' suffix would not apply ?
>>   
> Yes and no, depending of where you put the cursor. If you consider the 
> 'at' suffix means a process context, then I agree with you, there is a 
> difference because the cloneat will be out of the current process 
> context. But if you consider the 'at' suffix as a context in general, 
> and openat means "relatively to a file descriptor" and cloneat means 
> "relatively to a pid namespace" the 'at' suffix may apply. But I agree 
> that we are so used to call the posix "fork", that cloneat sounds scary :)

At this point, I really couldn't care less about how we name the
new syscall.

Check out the containers IRC channel for the latest pearls:
clone_plus_with_aloe() and clone_plus_with_aloe_3_args() are
the prominent contenders, together with xerox() and ditto()...

There you go; these should be enough to keep the discussion
around on life support for at least another week.

I really really really hope we can settle down on *a* name,
*any* name, and move forward. Amen.

Oren.


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-19 21:47         ` Oren Laadan
  0 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-19 21:47 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Sukadev Bhattiprolu, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Daniel Lezcano wrote:
> Sukadev Bhattiprolu wrote:
>> Daniel Lezcano [daniel.lezcano-GANU6spQydw@public.gmane.org] wrote:
>>   
>>> Sukadev Bhattiprolu wrote:
>>>     
>>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>>>>
>>>> To support application checkpoint/restart, a task must have the same pid it
>>>> had when it was checkpointed.  When containers are nested, the tasks within
>>>> the containers exist in multiple pid namespaces and hence have multiple pids
>>>> to specify during restart.
>>>>
>>>> This patchset implements a new system call, clone3() that lets a process
>>>> specify the pids of the child process.
>>>>

[...]

>>> Otherwise, shouldn't be more convenient to have something usable for
>>> everyone, let's say:
>>>
>>> cloneat(pid_t pid, pid_t desiredpid, ...);
>>>
>>> Where 'desiredpid' is a hint of for the kernel for the pid to be
>>> allocated (zero means the kernel will choose one for us) and the newly
>>> allocated task is the son of 'pid'.
>>>     
>> Hmm, so P1 would call cloneat() to create a child P3 _on behalf_ of process
>> P2 ?  I did not know we had a requirement for that. Can you explain the
>> use-case more ? IOW, why can't P2 create the child P3 by itself ?
>>   
> I forgot to mention a constraint with the specified pid : P2 has to be 
> child of P1.
> In other word, you can not specify a pid to clonat which is not your 
> descendant (including yourself).
> With this constraint I think there is no security issues.

Sounds dangerous. What if your descendant executed a setuid program ?

> 
> Concerning of forking on behalf of another process, we can consider it 
> is up to the caller / programmer to know what it does. If a process in 

Before the user can program with this syscall, _you_ need to define
the semantics of this syscall. For example, do you freeze that other
process to ensure that the clone is consistent ?

> the process hierarchy exec'ed a program and we cloneat this process and 
> then the program fails because of an "unexpected error", well, we should 
> have not done that. A similar example is when the IPC are removed while 
> they are used by some other processes.
> 
> Here it is a interesting use case:
>  * if you created a pid namespace, and, let's say, booted a system 
> container where the container init is the "init" process, then with this 
> call you can enter the container at any time by doing cloneat() followed 
> by an exec of your command. I think that was a requirement when there 
> were discussions around "sys_hijack".

Can you define more precisely what you mean by "enter" the container ?

If you simply want create a new process in the container, you can
achieve the same thing with a daemon, or a smart init process (in
there), or even ptrace tricks.

Also, there is a reason why sys_hijack() was hijacked away ... And
I honestly think that a syscall to force another process to clone
would be shot down by the kernel guys.

> Another point. It's another way to extend the exhausted clone  flags as 
> the cloneat can be called as a compatibility way, with cloneat(getpid(), 
> 0, ... )

Which is what the proposed new clone_....() does.

> 
>> Note also that 'desiredpid' must be a list of pids (one for each pid
>> namespaces that the child will belong to) and hence we need 'nr_pids'
>> to specify the list. Given that we are limited to 6 parameters to the
>> syscall, such parameters must be stuffed into 'struct clone_args'.
>>
>> So we should do something like:
>>
>> 	sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
>> 		pid_t *desired_pids)
>>
>> or (to match the name and parameters, move 'pid' parameter into clone_args)
>>   
> Well, hiding multiple clone in one clone call is ... weird. AFAIR, there 
> was a debate between kernel or userspace proctree creation but it looks 
> like it's done from the kernel with this call.

It isn't multiple clones in one clone. The syscall creates *one* single
process. We just ask the kernel to assign a specific pid to that process.

And because processes that live in nested pid-namespace own multiple
pids (one for each level), we need to specify multiple pids (one for
each level) for this single process that we create.

Then, to create the entire restart process tree, we have to call this
system call as many times as the number of processes to restart.

And yes, this is all done in userspace. The _only_ kernel-space help
is an interface to request specific pid(s) for each restarted process.

> I don't really see a difference between sys_restart(pid_t pid , int fd, 
> long flags) where pid_t is the topmost in the hierarchy, fd is a file 
> descriptor to a structure "pid_t * + struct clone_args *" and flags is 
> "PROCTREE".

What you describe here is the way openvz's kernel-side process tree
restart works. This is not the current approach.

> IMHO, it is nicer to recursively restore the process tree for the nested 
> pid namespaces, that will be really an userspace process tree creation 
> and cloneat will be your friend here :)

The restart.c utility from the userspace tools (user-cr) is your
friend here...  I'm sure you'll think it's nice :p

>>> That looks more consistent with the "<syscall>at" family, 'openat',
>>> 'faccessat', 'readlinkat', etc ... and usable for something else than
>>> the checkpoint / restart.
>>>     
>> The subtle difference though is that openat() does not open a file on
>> behalf of another process and so the 'at' suffix would not apply ?
>>   
> Yes and no, depending of where you put the cursor. If you consider the 
> 'at' suffix means a process context, then I agree with you, there is a 
> difference because the cloneat will be out of the current process 
> context. But if you consider the 'at' suffix as a context in general, 
> and openat means "relatively to a file descriptor" and cloneat means 
> "relatively to a pid namespace" the 'at' suffix may apply. But I agree 
> that we are so used to call the posix "fork", that cloneat sounds scary :)

At this point, I really couldn't care less about how we name the
new syscall.

Check out the containers IRC channel for the latest pearls:
clone_plus_with_aloe() and clone_plus_with_aloe_3_args() are
the prominent contenders, together with xerox() and ditto()...

There you go; these should be enough to keep the discussion
around on life support for at least another week.

I really really really hope we can settle down on *a* name,
*any* name, and move forward. Amen.

Oren.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                   ` <4ADCDAA8.5080408-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
@ 2009-10-19 23:50                     ` Matt Helsley
  0 siblings, 0 replies; 226+ messages in thread
From: Matt Helsley @ 2009-10-19 23:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Eric W. Biederman,
	Pavel Emelyanov, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w

On Tue, Oct 20, 2009 at 06:31:20AM +0900, H. Peter Anvin wrote:
> On 10/20/2009 02:44 AM, Matt Helsley wrote:
>>> |
>>> | I know I'm late to this discussion, but why the name clone3()? It's
>>> | not consistent with any other convention used fo syscall naming,
>
> This assumption, of course, is just plain wrong.  Look at the wait  
> system calls, for example.  However, when a small integer is used like  
> that, it pretty much always reflects numbers of arguments.
>
>>> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
>>> | might make more sense.
>>>
>>> Sure, we talked about calling it clone_extended() and I can go back
>>> to that.
>>>
>>> Only minor concern with that name was if this new call ever needs to
>>> be extended, what would we call it :-). With clone3() we could add a
>>> real/fake parameter and call it clone4() :-p
>>
>> Perhaps clone64 (somewhat like stat64 for example)?
>>
>
> I think that doesn't exactly reflect the nature of the changes.

Yes. Without adopting an impractical encoding scheme it's quite
unlikely a small number like 3 or 64 could exactly reflect all the
changes :). I don't think that's a realistic objection though so...

> clone3() is actually pretty good.

I agree.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                   ` <4ADCDAA8.5080408-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
@ 2009-10-19 23:50                     ` Matt Helsley
  0 siblings, 0 replies; 226+ messages in thread
From: Matt Helsley @ 2009-10-19 23:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Matt Helsley, Sukadev Bhattiprolu, mtk.manpages, randy.dunlap,
	arnd, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	Eric W. Biederman, kosaki.motohiro, mingo, linux-api, torvalds,
	Alexey Dobriyan, roland, Pavel Emelyanov

On Tue, Oct 20, 2009 at 06:31:20AM +0900, H. Peter Anvin wrote:
> On 10/20/2009 02:44 AM, Matt Helsley wrote:
>>> |
>>> | I know I'm late to this discussion, but why the name clone3()? It's
>>> | not consistent with any other convention used fo syscall naming,
>
> This assumption, of course, is just plain wrong.  Look at the wait  
> system calls, for example.  However, when a small integer is used like  
> that, it pretty much always reflects numbers of arguments.
>
>>> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
>>> | might make more sense.
>>>
>>> Sure, we talked about calling it clone_extended() and I can go back
>>> to that.
>>>
>>> Only minor concern with that name was if this new call ever needs to
>>> be extended, what would we call it :-). With clone3() we could add a
>>> real/fake parameter and call it clone4() :-p
>>
>> Perhaps clone64 (somewhat like stat64 for example)?
>>
>
> I think that doesn't exactly reflect the nature of the changes.

Yes. Without adopting an impractical encoding scheme it's quite
unlikely a small number like 3 or 64 could exactly reflect all the
changes :). I don't think that's a realistic objection though so...

> clone3() is actually pretty good.

I agree.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-19 23:50                     ` Matt Helsley
  0 siblings, 0 replies; 226+ messages in thread
From: Matt Helsley @ 2009-10-19 23:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Matt Helsley, Sukadev Bhattiprolu,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On Tue, Oct 20, 2009 at 06:31:20AM +0900, H. Peter Anvin wrote:
> On 10/20/2009 02:44 AM, Matt Helsley wrote:
>>> |
>>> | I know I'm late to this discussion, but why the name clone3()? It's
>>> | not consistent with any other convention used fo syscall naming,
>
> This assumption, of course, is just plain wrong.  Look at the wait  
> system calls, for example.  However, when a small integer is used like  
> that, it pretty much always reflects numbers of arguments.
>
>>> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
>>> | might make more sense.
>>>
>>> Sure, we talked about calling it clone_extended() and I can go back
>>> to that.
>>>
>>> Only minor concern with that name was if this new call ever needs to
>>> be extended, what would we call it :-). With clone3() we could add a
>>> real/fake parameter and call it clone4() :-p
>>
>> Perhaps clone64 (somewhat like stat64 for example)?
>>
>
> I think that doesn't exactly reflect the nature of the changes.

Yes. Without adopting an impractical encoding scheme it's quite
unlikely a small number like 3 or 64 could exactly reflect all the
changes :). I don't think that's a realistic objection though so...

> clone3() is actually pretty good.

I agree.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]         ` <4ADCDE7F.4090501-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-10-20  0:51           ` Matt Helsley
  2009-10-21 15:53           ` Daniel Lezcano
  1 sibling, 0 replies; 226+ messages in thread
From: Matt Helsley @ 2009-10-20  0:51 UTC (permalink / raw)
  To: Oren Laadan
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On Mon, Oct 19, 2009 at 05:47:43PM -0400, Oren Laadan wrote:
> 
> 
> Daniel Lezcano wrote:
> > Sukadev Bhattiprolu wrote:
> >> Daniel Lezcano [daniel.lezcano-GANU6spQydw@public.gmane.org] wrote:
> >>   
> >>> Sukadev Bhattiprolu wrote:
> >>>     
> >>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
> >>>>

<snip>

> > Another point. It's another way to extend the exhausted clone  flags as 
> > the cloneat can be called as a compatibility way, with cloneat(getpid(), 
> > 0, ... )
> 
> Which is what the proposed new clone_....() does.

Just to be clear -- Suka's proposing to extend the clone flags. However I
don't believe reusing the "pid" parameters as Daniel seemed to suggest
was ever part of Suka's proposed changes.

<snip>

> > I don't really see a difference between sys_restart(pid_t pid , int fd, 
> > long flags) where pid_t is the topmost in the hierarchy, fd is a file 
> > descriptor to a structure "pid_t * + struct clone_args *" and flags is 
> > "PROCTREE".

I think the difference has to do with keeping the code maintainable.

Clone creates the process so it's already involved in allocating and
assigning pids to the new task. Switching pids at sys_restart() would 
add another point in the code where pids are allocated and assigned.
This suggests we may have to worry about introducing new obscure races
for anyone who's working on the pid allocator to be careful of. At
least when all the code is "localized" to the clone paths we can be
reasonably certain of proper maintenance.

<snip>

> I really really really hope we can settle down on *a* name,
> *any* name, and move forward. Amen.

clone3() seemed to be the leading contender from what I've read so far.
Does anyone still object to clone3() after reading the whole thread?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-19 21:47         ` Oren Laadan
  (?)
@ 2009-10-20  0:51         ` Matt Helsley
  2009-10-20  3:33             ` Eric W. Biederman
       [not found]           ` <20091020005125.GG27627-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  -1 siblings, 2 replies; 226+ messages in thread
From: Matt Helsley @ 2009-10-20  0:51 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Daniel Lezcano, randy.dunlap, arnd, linux-api, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, Eric W. Biederman,
	kosaki.motohiro, hpa, mingo, Sukadev Bhattiprolu, torvalds,
	Alexey Dobriyan, roland, Pavel Emelyanov

On Mon, Oct 19, 2009 at 05:47:43PM -0400, Oren Laadan wrote:
> 
> 
> Daniel Lezcano wrote:
> > Sukadev Bhattiprolu wrote:
> >> Daniel Lezcano [daniel.lezcano@free.fr] wrote:
> >>   
> >>> Sukadev Bhattiprolu wrote:
> >>>     
> >>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
> >>>>

<snip>

> > Another point. It's another way to extend the exhausted clone  flags as 
> > the cloneat can be called as a compatibility way, with cloneat(getpid(), 
> > 0, ... )
> 
> Which is what the proposed new clone_....() does.

Just to be clear -- Suka's proposing to extend the clone flags. However I
don't believe reusing the "pid" parameters as Daniel seemed to suggest
was ever part of Suka's proposed changes.

<snip>

> > I don't really see a difference between sys_restart(pid_t pid , int fd, 
> > long flags) where pid_t is the topmost in the hierarchy, fd is a file 
> > descriptor to a structure "pid_t * + struct clone_args *" and flags is 
> > "PROCTREE".

I think the difference has to do with keeping the code maintainable.

Clone creates the process so it's already involved in allocating and
assigning pids to the new task. Switching pids at sys_restart() would 
add another point in the code where pids are allocated and assigned.
This suggests we may have to worry about introducing new obscure races
for anyone who's working on the pid allocator to be careful of. At
least when all the code is "localized" to the clone paths we can be
reasonably certain of proper maintenance.

<snip>

> I really really really hope we can settle down on *a* name,
> *any* name, and move forward. Amen.

clone3() seemed to be the leading contender from what I've read so far.
Does anyone still object to clone3() after reading the whole thread?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]           ` <20091020005125.GG27627-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2009-10-20  3:33             ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-20  3:33 UTC (permalink / raw)
  To: Matt Helsley
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> On Mon, Oct 19, 2009 at 05:47:43PM -0400, Oren Laadan wrote:
>> 
>> 
>> Daniel Lezcano wrote:
>> > Sukadev Bhattiprolu wrote:
>> >> Daniel Lezcano [daniel.lezcano-GANU6spQydw@public.gmane.org] wrote:
>> >>   
>> >>> Sukadev Bhattiprolu wrote:
>> >>>     
>> >>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>> >>>>
>
> <snip>
>
>> > Another point. It's another way to extend the exhausted clone  flags as 
>> > the cloneat can be called as a compatibility way, with cloneat(getpid(), 
>> > 0, ... )
>> 
>> Which is what the proposed new clone_....() does.
>
> Just to be clear -- Suka's proposing to extend the clone flags. However I
> don't believe reusing the "pid" parameters as Daniel seemed to suggest
> was ever part of Suka's proposed changes.
>
> <snip>
>
>> > I don't really see a difference between sys_restart(pid_t pid , int fd, 
>> > long flags) where pid_t is the topmost in the hierarchy, fd is a file 
>> > descriptor to a structure "pid_t * + struct clone_args *" and flags is 
>> > "PROCTREE".
>
> I think the difference has to do with keeping the code maintainable.
>
> Clone creates the process so it's already involved in allocating and
> assigning pids to the new task. Switching pids at sys_restart() would 
> add another point in the code where pids are allocated and assigned.
> This suggests we may have to worry about introducing new obscure races
> for anyone who's working on the pid allocator to be careful of. At
> least when all the code is "localized" to the clone paths we can be
> reasonably certain of proper maintenance.
>
> <snip>
>
>> I really really really hope we can settle down on *a* name,
>> *any* name, and move forward. Amen.
>
> clone3() seemed to be the leading contender from what I've read so far.
> Does anyone still object to clone3() after reading the whole thread?

I object to what clone3() is.  The name is not particularly interesting.

The sanity checks for assigning pids are missing and there is a todo
about it.  I am not comfortable with assigning pids to a new process
in a pid namespace with other processes user space processes executing
in it.

How we handle a clone extension depends critically on if we want to
create a processes for restart in user space or kernel space.

Could some one give me or point me at a strong case for creating the
processes for restart in user space?

The pid assignment code is currently ugly.  I asked that we just pass
in the min max pid pids that already exist into the core pid
assignment function and a constrained min/max that only admits a
single pid when we are allocating a struct pid for restart.  That was
not done and now we have a weird abortion with unnecessary special cases.

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]           ` <20091020005125.GG27627-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2009-10-20  3:33             ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-20  3:33 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Oren Laadan, Daniel Lezcano, randy.dunlap, arnd, linux-api,
	Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, Sukadev Bhattiprolu, torvalds,
	Alexey Dobriyan, roland, Pavel Emelyanov

Matt Helsley <matthltc@us.ibm.com> writes:

> On Mon, Oct 19, 2009 at 05:47:43PM -0400, Oren Laadan wrote:
>> 
>> 
>> Daniel Lezcano wrote:
>> > Sukadev Bhattiprolu wrote:
>> >> Daniel Lezcano [daniel.lezcano@free.fr] wrote:
>> >>   
>> >>> Sukadev Bhattiprolu wrote:
>> >>>     
>> >>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>> >>>>
>
> <snip>
>
>> > Another point. It's another way to extend the exhausted clone  flags as 
>> > the cloneat can be called as a compatibility way, with cloneat(getpid(), 
>> > 0, ... )
>> 
>> Which is what the proposed new clone_....() does.
>
> Just to be clear -- Suka's proposing to extend the clone flags. However I
> don't believe reusing the "pid" parameters as Daniel seemed to suggest
> was ever part of Suka's proposed changes.
>
> <snip>
>
>> > I don't really see a difference between sys_restart(pid_t pid , int fd, 
>> > long flags) where pid_t is the topmost in the hierarchy, fd is a file 
>> > descriptor to a structure "pid_t * + struct clone_args *" and flags is 
>> > "PROCTREE".
>
> I think the difference has to do with keeping the code maintainable.
>
> Clone creates the process so it's already involved in allocating and
> assigning pids to the new task. Switching pids at sys_restart() would 
> add another point in the code where pids are allocated and assigned.
> This suggests we may have to worry about introducing new obscure races
> for anyone who's working on the pid allocator to be careful of. At
> least when all the code is "localized" to the clone paths we can be
> reasonably certain of proper maintenance.
>
> <snip>
>
>> I really really really hope we can settle down on *a* name,
>> *any* name, and move forward. Amen.
>
> clone3() seemed to be the leading contender from what I've read so far.
> Does anyone still object to clone3() after reading the whole thread?

I object to what clone3() is.  The name is not particularly interesting.

The sanity checks for assigning pids are missing and there is a todo
about it.  I am not comfortable with assigning pids to a new process
in a pid namespace with other processes user space processes executing
in it.

How we handle a clone extension depends critically on if we want to
create a processes for restart in user space or kernel space.

Could some one give me or point me at a strong case for creating the
processes for restart in user space?

The pid assignment code is currently ugly.  I asked that we just pass
in the min max pid pids that already exist into the core pid
assignment function and a constrained min/max that only admits a
single pid when we are allocating a struct pid for restart.  That was
not done and now we have a weird abortion with unnecessary special cases.

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-20  3:33             ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-20  3:33 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Oren Laadan, Daniel Lezcano, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> On Mon, Oct 19, 2009 at 05:47:43PM -0400, Oren Laadan wrote:
>> 
>> 
>> Daniel Lezcano wrote:
>> > Sukadev Bhattiprolu wrote:
>> >> Daniel Lezcano [daniel.lezcano-GANU6spQydw@public.gmane.org] wrote:
>> >>   
>> >>> Sukadev Bhattiprolu wrote:
>> >>>     
>> >>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>> >>>>
>
> <snip>
>
>> > Another point. It's another way to extend the exhausted clone  flags as 
>> > the cloneat can be called as a compatibility way, with cloneat(getpid(), 
>> > 0, ... )
>> 
>> Which is what the proposed new clone_....() does.
>
> Just to be clear -- Suka's proposing to extend the clone flags. However I
> don't believe reusing the "pid" parameters as Daniel seemed to suggest
> was ever part of Suka's proposed changes.
>
> <snip>
>
>> > I don't really see a difference between sys_restart(pid_t pid , int fd, 
>> > long flags) where pid_t is the topmost in the hierarchy, fd is a file 
>> > descriptor to a structure "pid_t * + struct clone_args *" and flags is 
>> > "PROCTREE".
>
> I think the difference has to do with keeping the code maintainable.
>
> Clone creates the process so it's already involved in allocating and
> assigning pids to the new task. Switching pids at sys_restart() would 
> add another point in the code where pids are allocated and assigned.
> This suggests we may have to worry about introducing new obscure races
> for anyone who's working on the pid allocator to be careful of. At
> least when all the code is "localized" to the clone paths we can be
> reasonably certain of proper maintenance.
>
> <snip>
>
>> I really really really hope we can settle down on *a* name,
>> *any* name, and move forward. Amen.
>
> clone3() seemed to be the leading contender from what I've read so far.
> Does anyone still object to clone3() after reading the whole thread?

I object to what clone3() is.  The name is not particularly interesting.

The sanity checks for assigning pids are missing and there is a todo
about it.  I am not comfortable with assigning pids to a new process
in a pid namespace with other processes user space processes executing
in it.

How we handle a clone extension depends critically on if we want to
create a processes for restart in user space or kernel space.

Could some one give me or point me at a strong case for creating the
processes for restart in user space?

The pid assignment code is currently ugly.  I asked that we just pass
in the min max pid pids that already exist into the core pid
assignment function and a constrained min/max that only admits a
single pid when we are allocating a struct pid for restart.  That was
not done and now we have a weird abortion with unnecessary special cases.

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]             ` <m1vdiad9jd.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-20  4:03               ` Sukadev Bhattiprolu
  2009-10-20 14:09               ` Serge E. Hallyn
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-20  4:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| > clone3() seemed to be the leading contender from what I've read so far.
| > Does anyone still object to clone3() after reading the whole thread?
| 
| I object to what clone3() is.  The name is not particularly interesting.
| 
| The sanity checks for assigning pids are missing and there is a todo
| about it.  I am not comfortable with assigning pids to a new process
| in a pid namespace with other processes user space processes executing
| in it.

Could you clarify ? How is the call to alloc_pidmap() from clone3() different
from the call from clone() itself ?

| 
| How we handle a clone extension depends critically on if we want to
| create a processes for restart in user space or kernel space.
| 
| Could some one give me or point me at a strong case for creating the
| processes for restart in user space?

There has been a lot of discussion on this with reference to the
Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
for instance.

| 
| The pid assignment code is currently ugly.  I asked that we just pass
| in the min max pid pids that already exist into the core pid
| assignment function and a constrained min/max that only admits a
| single pid when we are allocating a struct pid for restart.  That was
| not done and now we have a weird abortion with unnecessary special cases.

I did post a version of the patch attemptint to implement that. As
pointed out in:

	http://lkml.org/lkml/2009/8/17/445

we would need more checks in alloc_pidmap() to cover cases like min or max
being invalid or min being greater than max or max being greater than pid_max
etc. Those checks also made the code ugly (imo).

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]             ` <m1vdiad9jd.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-20  4:03               ` Sukadev Bhattiprolu
  2009-10-20 14:09               ` Serge E. Hallyn
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-20  4:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Eric W. Biederman [ebiederm@xmission.com] wrote:
| > clone3() seemed to be the leading contender from what I've read so far.
| > Does anyone still object to clone3() after reading the whole thread?
| 
| I object to what clone3() is.  The name is not particularly interesting.
| 
| The sanity checks for assigning pids are missing and there is a todo
| about it.  I am not comfortable with assigning pids to a new process
| in a pid namespace with other processes user space processes executing
| in it.

Could you clarify ? How is the call to alloc_pidmap() from clone3() different
from the call from clone() itself ?

| 
| How we handle a clone extension depends critically on if we want to
| create a processes for restart in user space or kernel space.
| 
| Could some one give me or point me at a strong case for creating the
| processes for restart in user space?

There has been a lot of discussion on this with reference to the
Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
for instance.

| 
| The pid assignment code is currently ugly.  I asked that we just pass
| in the min max pid pids that already exist into the core pid
| assignment function and a constrained min/max that only admits a
| single pid when we are allocating a struct pid for restart.  That was
| not done and now we have a weird abortion with unnecessary special cases.

I did post a version of the patch attemptint to implement that. As
pointed out in:

	http://lkml.org/lkml/2009/8/17/445

we would need more checks in alloc_pidmap() to cover cases like min or max
being invalid or min being greater than max or max being greater than pid_max
etc. Those checks also made the code ugly (imo).

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-20  4:03               ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-20  4:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| > clone3() seemed to be the leading contender from what I've read so far.
| > Does anyone still object to clone3() after reading the whole thread?
| 
| I object to what clone3() is.  The name is not particularly interesting.
| 
| The sanity checks for assigning pids are missing and there is a todo
| about it.  I am not comfortable with assigning pids to a new process
| in a pid namespace with other processes user space processes executing
| in it.

Could you clarify ? How is the call to alloc_pidmap() from clone3() different
from the call from clone() itself ?

| 
| How we handle a clone extension depends critically on if we want to
| create a processes for restart in user space or kernel space.
| 
| Could some one give me or point me at a strong case for creating the
| processes for restart in user space?

There has been a lot of discussion on this with reference to the
Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
for instance.

| 
| The pid assignment code is currently ugly.  I asked that we just pass
| in the min max pid pids that already exist into the core pid
| assignment function and a constrained min/max that only admits a
| single pid when we are allocating a struct pid for restart.  That was
| not done and now we have a weird abortion with unnecessary special cases.

I did post a version of the patch attemptint to implement that. As
pointed out in:

	http://lkml.org/lkml/2009/8/17/445

we would need more checks in alloc_pidmap() to cover cases like min or max
being invalid or min being greater than max or max being greater than pid_max
etc. Those checks also made the code ugly (imo).

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]               ` <20091020040315.GA26632-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-20 10:46                 ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-20 10:46 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | > clone3() seemed to be the leading contender from what I've read so far.
> | > Does anyone still object to clone3() after reading the whole thread?
> | 
> | I object to what clone3() is.  The name is not particularly interesting.
> | 
> | The sanity checks for assigning pids are missing and there is a todo
> | about it.  I am not comfortable with assigning pids to a new process
> | in a pid namespace with other processes user space processes executing
> | in it.
>
> Could you clarify ? How is the call to alloc_pidmap() from clone3() different
> from the call from clone() itself ?

I think it is totally inappropriate to assign pids in a pid namespace
where there are user space processes already running.

> | How we handle a clone extension depends critically on if we want to
> | create a processes for restart in user space or kernel space.
> | 
> | Could some one give me or point me at a strong case for creating the
> | processes for restart in user space?
>
> There has been a lot of discussion on this with reference to the
> Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
> for instance.

Just read it.  Thank you.  Now I am certain clone_with_pids() is
not useful functionality to be exporting to userspace.

The only real argument in favor of doing this in user space is greater
flexibility.  I can see checkpointing/restoring a single thread process
without a pid namespace.  Anything more and you are just asking for
trouble.

A design that weakens security.  Increases maintenance costs.  All for
an unreliable result seems like a bad one to me.

> | The pid assignment code is currently ugly.  I asked that we just pass
> | in the min max pid pids that already exist into the core pid
> | assignment function and a constrained min/max that only admits a
> | single pid when we are allocating a struct pid for restart.  That was
> | not done and now we have a weird abortion with unnecessary special cases.
>
> I did post a version of the patch attemptint to implement that. As
> pointed out in:
>
> 	http://lkml.org/lkml/2009/8/17/445
>
> we would need more checks in alloc_pidmap() to cover cases like min or max
> being invalid or min being greater than max or max being greater than pid_max
> etc. Those checks also made the code ugly (imo).

If you need more checks you are doing it wrong.  The code already has min
and max values, and even a start value.  I was just strongly suggesting
we generalize where we get the values from, and then we have not special
cases. 

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]               ` <20091020040315.GA26632-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-20 10:46                 ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-20 10:46 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:

> Eric W. Biederman [ebiederm@xmission.com] wrote:
> | > clone3() seemed to be the leading contender from what I've read so far.
> | > Does anyone still object to clone3() after reading the whole thread?
> | 
> | I object to what clone3() is.  The name is not particularly interesting.
> | 
> | The sanity checks for assigning pids are missing and there is a todo
> | about it.  I am not comfortable with assigning pids to a new process
> | in a pid namespace with other processes user space processes executing
> | in it.
>
> Could you clarify ? How is the call to alloc_pidmap() from clone3() different
> from the call from clone() itself ?

I think it is totally inappropriate to assign pids in a pid namespace
where there are user space processes already running.

> | How we handle a clone extension depends critically on if we want to
> | create a processes for restart in user space or kernel space.
> | 
> | Could some one give me or point me at a strong case for creating the
> | processes for restart in user space?
>
> There has been a lot of discussion on this with reference to the
> Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
> for instance.

Just read it.  Thank you.  Now I am certain clone_with_pids() is
not useful functionality to be exporting to userspace.

The only real argument in favor of doing this in user space is greater
flexibility.  I can see checkpointing/restoring a single thread process
without a pid namespace.  Anything more and you are just asking for
trouble.

A design that weakens security.  Increases maintenance costs.  All for
an unreliable result seems like a bad one to me.

> | The pid assignment code is currently ugly.  I asked that we just pass
> | in the min max pid pids that already exist into the core pid
> | assignment function and a constrained min/max that only admits a
> | single pid when we are allocating a struct pid for restart.  That was
> | not done and now we have a weird abortion with unnecessary special cases.
>
> I did post a version of the patch attemptint to implement that. As
> pointed out in:
>
> 	http://lkml.org/lkml/2009/8/17/445
>
> we would need more checks in alloc_pidmap() to cover cases like min or max
> being invalid or min being greater than max or max being greater than pid_max
> etc. Those checks also made the code ugly (imo).

If you need more checks you are doing it wrong.  The code already has min
and max values, and even a start value.  I was just strongly suggesting
we generalize where we get the values from, and then we have not special
cases. 

Eric


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-20 10:46                 ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-20 10:46 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | > clone3() seemed to be the leading contender from what I've read so far.
> | > Does anyone still object to clone3() after reading the whole thread?
> | 
> | I object to what clone3() is.  The name is not particularly interesting.
> | 
> | The sanity checks for assigning pids are missing and there is a todo
> | about it.  I am not comfortable with assigning pids to a new process
> | in a pid namespace with other processes user space processes executing
> | in it.
>
> Could you clarify ? How is the call to alloc_pidmap() from clone3() different
> from the call from clone() itself ?

I think it is totally inappropriate to assign pids in a pid namespace
where there are user space processes already running.

> | How we handle a clone extension depends critically on if we want to
> | create a processes for restart in user space or kernel space.
> | 
> | Could some one give me or point me at a strong case for creating the
> | processes for restart in user space?
>
> There has been a lot of discussion on this with reference to the
> Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
> for instance.

Just read it.  Thank you.  Now I am certain clone_with_pids() is
not useful functionality to be exporting to userspace.

The only real argument in favor of doing this in user space is greater
flexibility.  I can see checkpointing/restoring a single thread process
without a pid namespace.  Anything more and you are just asking for
trouble.

A design that weakens security.  Increases maintenance costs.  All for
an unreliable result seems like a bad one to me.

> | The pid assignment code is currently ugly.  I asked that we just pass
> | in the min max pid pids that already exist into the core pid
> | assignment function and a constrained min/max that only admits a
> | single pid when we are allocating a struct pid for restart.  That was
> | not done and now we have a weird abortion with unnecessary special cases.
>
> I did post a version of the patch attemptint to implement that. As
> pointed out in:
>
> 	http://lkml.org/lkml/2009/8/17/445
>
> we would need more checks in alloc_pidmap() to cover cases like min or max
> being invalid or min being greater than max or max being greater than pid_max
> etc. Those checks also made the code ugly (imo).

If you need more checks you are doing it wrong.  The code already has min
and max values, and even a start value.  I was just strongly suggesting
we generalize where we get the values from, and then we have not special
cases. 

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]             ` <m1vdiad9jd.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2009-10-20  4:03               ` Sukadev Bhattiprolu
@ 2009-10-20 14:09               ` Serge E. Hallyn
  1 sibling, 0 replies; 226+ messages in thread
From: Serge E. Hallyn @ 2009-10-20 14:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> 
> > On Mon, Oct 19, 2009 at 05:47:43PM -0400, Oren Laadan wrote:
> >> 
> >> 
> >> Daniel Lezcano wrote:
> >> > Sukadev Bhattiprolu wrote:
> >> >> Daniel Lezcano [daniel.lezcano-GANU6spQydw@public.gmane.org] wrote:
> >> >>   
> >> >>> Sukadev Bhattiprolu wrote:
> >> >>>     
> >> >>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
> >> >>>>
> >
> > <snip>
> >
> >> > Another point. It's another way to extend the exhausted clone  flags as 
> >> > the cloneat can be called as a compatibility way, with cloneat(getpid(), 
> >> > 0, ... )
> >> 
> >> Which is what the proposed new clone_....() does.
> >
> > Just to be clear -- Suka's proposing to extend the clone flags. However I
> > don't believe reusing the "pid" parameters as Daniel seemed to suggest
> > was ever part of Suka's proposed changes.
> >
> > <snip>
> >
> >> > I don't really see a difference between sys_restart(pid_t pid , int fd, 
> >> > long flags) where pid_t is the topmost in the hierarchy, fd is a file 
> >> > descriptor to a structure "pid_t * + struct clone_args *" and flags is 
> >> > "PROCTREE".
> >
> > I think the difference has to do with keeping the code maintainable.
> >
> > Clone creates the process so it's already involved in allocating and
> > assigning pids to the new task. Switching pids at sys_restart() would 
> > add another point in the code where pids are allocated and assigned.
> > This suggests we may have to worry about introducing new obscure races
> > for anyone who's working on the pid allocator to be careful of. At
> > least when all the code is "localized" to the clone paths we can be
> > reasonably certain of proper maintenance.
> >
> > <snip>
> >
> >> I really really really hope we can settle down on *a* name,
> >> *any* name, and move forward. Amen.
> >
> > clone3() seemed to be the leading contender from what I've read so far.
> > Does anyone still object to clone3() after reading the whole thread?
> 
> I object to what clone3() is.  The name is not particularly interesting.
> 
> The sanity checks for assigning pids are missing and there is a todo
> about it.  I am not comfortable with assigning pids to a new process
> in a pid namespace with other processes user space processes executing
> in it.
> 
> How we handle a clone extension depends critically on if we want to
> create a processes for restart in user space or kernel space.
> 
> Could some one give me or point me at a strong case for creating the
> processes for restart in user space?
> 
> The pid assignment code is currently ugly.  I asked that we just pass
> in the min max pid pids that already exist into the core pid
> assignment function and a constrained min/max that only admits a
> single pid when we are allocating a struct pid for restart.  That was
> not done and now we have a weird abortion with unnecessary special cases.

I asked you (I believe twice) to clarify how on earth you meant for
that to be done for hierarchical pid namespaces (a task being restored
which needs two of it's 4 pids specified), and you did not reply.

Did you mean for it to be done through procfiles?  If so, does the
task have to keep multiple /proc mounts around, one for each
pidns hierarchy in which it needs to specify a pid?  Or did you have
another idea in mind?  A single procfile into which multiple pids can
be specified in a list?  A completely different interface?

Or do you mean for this to be done only from the kernel?

thanks,
-serge

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]             ` <m1vdiad9jd.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-20 14:09               ` Serge E. Hallyn
  2009-10-20 14:09               ` Serge E. Hallyn
  1 sibling, 0 replies; 226+ messages in thread
From: Serge E. Hallyn @ 2009-10-20 14:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, randy.dunlap, arnd, linux-api, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, kosaki.motohiro, hpa,
	mingo, Sukadev Bhattiprolu, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Matt Helsley <matthltc@us.ibm.com> writes:
> 
> > On Mon, Oct 19, 2009 at 05:47:43PM -0400, Oren Laadan wrote:
> >> 
> >> 
> >> Daniel Lezcano wrote:
> >> > Sukadev Bhattiprolu wrote:
> >> >> Daniel Lezcano [daniel.lezcano@free.fr] wrote:
> >> >>   
> >> >>> Sukadev Bhattiprolu wrote:
> >> >>>     
> >> >>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
> >> >>>>
> >
> > <snip>
> >
> >> > Another point. It's another way to extend the exhausted clone  flags as 
> >> > the cloneat can be called as a compatibility way, with cloneat(getpid(), 
> >> > 0, ... )
> >> 
> >> Which is what the proposed new clone_....() does.
> >
> > Just to be clear -- Suka's proposing to extend the clone flags. However I
> > don't believe reusing the "pid" parameters as Daniel seemed to suggest
> > was ever part of Suka's proposed changes.
> >
> > <snip>
> >
> >> > I don't really see a difference between sys_restart(pid_t pid , int fd, 
> >> > long flags) where pid_t is the topmost in the hierarchy, fd is a file 
> >> > descriptor to a structure "pid_t * + struct clone_args *" and flags is 
> >> > "PROCTREE".
> >
> > I think the difference has to do with keeping the code maintainable.
> >
> > Clone creates the process so it's already involved in allocating and
> > assigning pids to the new task. Switching pids at sys_restart() would 
> > add another point in the code where pids are allocated and assigned.
> > This suggests we may have to worry about introducing new obscure races
> > for anyone who's working on the pid allocator to be careful of. At
> > least when all the code is "localized" to the clone paths we can be
> > reasonably certain of proper maintenance.
> >
> > <snip>
> >
> >> I really really really hope we can settle down on *a* name,
> >> *any* name, and move forward. Amen.
> >
> > clone3() seemed to be the leading contender from what I've read so far.
> > Does anyone still object to clone3() after reading the whole thread?
> 
> I object to what clone3() is.  The name is not particularly interesting.
> 
> The sanity checks for assigning pids are missing and there is a todo
> about it.  I am not comfortable with assigning pids to a new process
> in a pid namespace with other processes user space processes executing
> in it.
> 
> How we handle a clone extension depends critically on if we want to
> create a processes for restart in user space or kernel space.
> 
> Could some one give me or point me at a strong case for creating the
> processes for restart in user space?
> 
> The pid assignment code is currently ugly.  I asked that we just pass
> in the min max pid pids that already exist into the core pid
> assignment function and a constrained min/max that only admits a
> single pid when we are allocating a struct pid for restart.  That was
> not done and now we have a weird abortion with unnecessary special cases.

I asked you (I believe twice) to clarify how on earth you meant for
that to be done for hierarchical pid namespaces (a task being restored
which needs two of it's 4 pids specified), and you did not reply.

Did you mean for it to be done through procfiles?  If so, does the
task have to keep multiple /proc mounts around, one for each
pidns hierarchy in which it needs to specify a pid?  Or did you have
another idea in mind?  A single procfile into which multiple pids can
be specified in a list?  A completely different interface?

Or do you mean for this to be done only from the kernel?

thanks,
-serge

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-20 14:09               ` Serge E. Hallyn
  0 siblings, 0 replies; 226+ messages in thread
From: Serge E. Hallyn @ 2009-10-20 14:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> 
> > On Mon, Oct 19, 2009 at 05:47:43PM -0400, Oren Laadan wrote:
> >> 
> >> 
> >> Daniel Lezcano wrote:
> >> > Sukadev Bhattiprolu wrote:
> >> >> Daniel Lezcano [daniel.lezcano-GANU6spQydw@public.gmane.org] wrote:
> >> >>   
> >> >>> Sukadev Bhattiprolu wrote:
> >> >>>     
> >> >>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
> >> >>>>
> >
> > <snip>
> >
> >> > Another point. It's another way to extend the exhausted clone  flags as 
> >> > the cloneat can be called as a compatibility way, with cloneat(getpid(), 
> >> > 0, ... )
> >> 
> >> Which is what the proposed new clone_....() does.
> >
> > Just to be clear -- Suka's proposing to extend the clone flags. However I
> > don't believe reusing the "pid" parameters as Daniel seemed to suggest
> > was ever part of Suka's proposed changes.
> >
> > <snip>
> >
> >> > I don't really see a difference between sys_restart(pid_t pid , int fd, 
> >> > long flags) where pid_t is the topmost in the hierarchy, fd is a file 
> >> > descriptor to a structure "pid_t * + struct clone_args *" and flags is 
> >> > "PROCTREE".
> >
> > I think the difference has to do with keeping the code maintainable.
> >
> > Clone creates the process so it's already involved in allocating and
> > assigning pids to the new task. Switching pids at sys_restart() would 
> > add another point in the code where pids are allocated and assigned.
> > This suggests we may have to worry about introducing new obscure races
> > for anyone who's working on the pid allocator to be careful of. At
> > least when all the code is "localized" to the clone paths we can be
> > reasonably certain of proper maintenance.
> >
> > <snip>
> >
> >> I really really really hope we can settle down on *a* name,
> >> *any* name, and move forward. Amen.
> >
> > clone3() seemed to be the leading contender from what I've read so far.
> > Does anyone still object to clone3() after reading the whole thread?
> 
> I object to what clone3() is.  The name is not particularly interesting.
> 
> The sanity checks for assigning pids are missing and there is a todo
> about it.  I am not comfortable with assigning pids to a new process
> in a pid namespace with other processes user space processes executing
> in it.
> 
> How we handle a clone extension depends critically on if we want to
> create a processes for restart in user space or kernel space.
> 
> Could some one give me or point me at a strong case for creating the
> processes for restart in user space?
> 
> The pid assignment code is currently ugly.  I asked that we just pass
> in the min max pid pids that already exist into the core pid
> assignment function and a constrained min/max that only admits a
> single pid when we are allocating a struct pid for restart.  That was
> not done and now we have a weird abortion with unnecessary special cases.

I asked you (I believe twice) to clarify how on earth you meant for
that to be done for hierarchical pid namespaces (a task being restored
which needs two of it's 4 pids specified), and you did not reply.

Did you mean for it to be done through procfiles?  If so, does the
task have to keep multiple /proc mounts around, one for each
pidns hierarchy in which it needs to specify a pid?  Or did you have
another idea in mind?  A single procfile into which multiple pids can
be specified in a list?  A completely different interface?

Or do you mean for this to be done only from the kernel?

thanks,
-serge

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                 ` <m1iqeauyvl.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-20 14:16                   ` Serge E. Hallyn
  2009-10-20 18:33                   ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Serge E. Hallyn @ 2009-10-20 14:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:
> The only real argument in favor of doing this in user space is greater
> flexibility.

Yyyyup.

> I can see checkpointing/restoring a single thread process
> without a pid namespace.  Anything more and you are just asking for
> trouble.
> 
> A design that weakens security.

How does it weaken security?  It requires CAP_SYS_ADMIN, and, when
targeted capabilities are added, it will require CAP_SYS_ADMIN to
only those pid namespaces in which we choose a pid.

> Increases maintenance costs.

Does it really?  It re-uses most of the existing code.  More so than
the only existing in-kernel restart code I've seen does.

> All for
> an unreliable result seems like a bad one to me.

I've personally always been on the fence as to whether we rebuild the
process tree in user-space or kernel.  And I'm still on the fence, as
nothing you've said has convinced me otherwise.

> > | The pid assignment code is currently ugly.  I asked that we just pass
> > | in the min max pid pids that already exist into the core pid
> > | assignment function and a constrained min/max that only admits a
> > | single pid when we are allocating a struct pid for restart.  That was
> > | not done and now we have a weird abortion with unnecessary special cases.
> >
> > I did post a version of the patch attemptint to implement that. As
> > pointed out in:
> >
> > 	http://lkml.org/lkml/2009/8/17/445
> >
> > we would need more checks in alloc_pidmap() to cover cases like min or max
> > being invalid or min being greater than max or max being greater than pid_max
> > etc. Those checks also made the code ugly (imo).
> 
> If you need more checks you are doing it wrong.  The code already has min
> and max values, and even a start value.  I was just strongly suggesting
> we generalize where we get the values from, and then we have not special
> cases. 

(I'm not sure whether this argument is a separate one - regarding the
implementation of choosing the pid - from the kernel-vs-userspace one
or not, so will wait for a response to my other email about your API)

thanks,
-serge

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                 ` <m1iqeauyvl.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-20 14:16                   ` Serge E. Hallyn
  2009-10-20 18:33                   ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Serge E. Hallyn @ 2009-10-20 14:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Sukadev Bhattiprolu, randy.dunlap, arnd, linux-api, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, kosaki.motohiro, hpa,
	mingo, torvalds, Alexey Dobriyan, roland, Pavel Emelyanov

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:
> The only real argument in favor of doing this in user space is greater
> flexibility.

Yyyyup.

> I can see checkpointing/restoring a single thread process
> without a pid namespace.  Anything more and you are just asking for
> trouble.
> 
> A design that weakens security.

How does it weaken security?  It requires CAP_SYS_ADMIN, and, when
targeted capabilities are added, it will require CAP_SYS_ADMIN to
only those pid namespaces in which we choose a pid.

> Increases maintenance costs.

Does it really?  It re-uses most of the existing code.  More so than
the only existing in-kernel restart code I've seen does.

> All for
> an unreliable result seems like a bad one to me.

I've personally always been on the fence as to whether we rebuild the
process tree in user-space or kernel.  And I'm still on the fence, as
nothing you've said has convinced me otherwise.

> > | The pid assignment code is currently ugly.  I asked that we just pass
> > | in the min max pid pids that already exist into the core pid
> > | assignment function and a constrained min/max that only admits a
> > | single pid when we are allocating a struct pid for restart.  That was
> > | not done and now we have a weird abortion with unnecessary special cases.
> >
> > I did post a version of the patch attemptint to implement that. As
> > pointed out in:
> >
> > 	http://lkml.org/lkml/2009/8/17/445
> >
> > we would need more checks in alloc_pidmap() to cover cases like min or max
> > being invalid or min being greater than max or max being greater than pid_max
> > etc. Those checks also made the code ugly (imo).
> 
> If you need more checks you are doing it wrong.  The code already has min
> and max values, and even a start value.  I was just strongly suggesting
> we generalize where we get the values from, and then we have not special
> cases. 

(I'm not sure whether this argument is a separate one - regarding the
implementation of choosing the pid - from the kernel-vs-userspace one
or not, so will wait for a response to my other email about your API)

thanks,
-serge

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-20 14:16                   ` Serge E. Hallyn
  0 siblings, 0 replies; 226+ messages in thread
From: Serge E. Hallyn @ 2009-10-20 14:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Sukadev Bhattiprolu, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:
> The only real argument in favor of doing this in user space is greater
> flexibility.

Yyyyup.

> I can see checkpointing/restoring a single thread process
> without a pid namespace.  Anything more and you are just asking for
> trouble.
> 
> A design that weakens security.

How does it weaken security?  It requires CAP_SYS_ADMIN, and, when
targeted capabilities are added, it will require CAP_SYS_ADMIN to
only those pid namespaces in which we choose a pid.

> Increases maintenance costs.

Does it really?  It re-uses most of the existing code.  More so than
the only existing in-kernel restart code I've seen does.

> All for
> an unreliable result seems like a bad one to me.

I've personally always been on the fence as to whether we rebuild the
process tree in user-space or kernel.  And I'm still on the fence, as
nothing you've said has convinced me otherwise.

> > | The pid assignment code is currently ugly.  I asked that we just pass
> > | in the min max pid pids that already exist into the core pid
> > | assignment function and a constrained min/max that only admits a
> > | single pid when we are allocating a struct pid for restart.  That was
> > | not done and now we have a weird abortion with unnecessary special cases.
> >
> > I did post a version of the patch attemptint to implement that. As
> > pointed out in:
> >
> > 	http://lkml.org/lkml/2009/8/17/445
> >
> > we would need more checks in alloc_pidmap() to cover cases like min or max
> > being invalid or min being greater than max or max being greater than pid_max
> > etc. Those checks also made the code ugly (imo).
> 
> If you need more checks you are doing it wrong.  The code already has min
> and max values, and even a start value.  I was just strongly suggesting
> we generalize where we get the values from, and then we have not special
> cases. 

(I'm not sure whether this argument is a separate one - regarding the
implementation of choosing the pid - from the kernel-vs-userspace one
or not, so will wait for a response to my other email about your API)

thanks,
-serge

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                 ` <m1iqeauyvl.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2009-10-20 14:16                   ` Serge E. Hallyn
@ 2009-10-20 18:33                   ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-20 18:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| > Could you clarify ? How is the call to alloc_pidmap() from clone3() different
| > from the call from clone() itself ?
| 
| I think it is totally inappropriate to assign pids in a pid namespace
| where there are user space processes already running.

Honestly, I don't understand why it is inappropriate or how this differs
from normal clone() - which also assigns pids in own and ancestor pid
namespaces.

| 
| > | How we handle a clone extension depends critically on if we want to
| > | create a processes for restart in user space or kernel space.
| > | 
| > | Could some one give me or point me at a strong case for creating the
| > | processes for restart in user space?
| >
| > There has been a lot of discussion on this with reference to the
| > Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
| > for instance.
| 
| Just read it.  Thank you.

Sorry. I should have mentioned the reason here. (Like you mention below),
flexibility is the main reason.

| Now I am certain clone_with_pids() is not useful functionality to be
| exporting to userspace.
| 
| The only real argument in favor of doing this in user space is greater
| flexibility.  I can see checkpointing/restoring a single thread process
| without a pid namespace.  Anything more and you are just asking for
| trouble.
| 
| A design that weakens security.  Increases maintenance costs.  All for
| an unreliable result seems like a bad one to me.
| 
| > | The pid assignment code is currently ugly.  I asked that we just pass
| > | in the min max pid pids that already exist into the core pid
| > | assignment function and a constrained min/max that only admits a
| > | single pid when we are allocating a struct pid for restart.  That was
| > | not done and now we have a weird abortion with unnecessary special cases.
| >
| > I did post a version of the patch attemptint to implement that. As
| > pointed out in:
| >
| > 	http://lkml.org/lkml/2009/8/17/445
| >
| > we would need more checks in alloc_pidmap() to cover cases like min or max
| > being invalid or min being greater than max or max being greater than pid_max
| > etc. Those checks also made the code ugly (imo).
| 
| If you need more checks you are doing it wrong.  The code already has min
| and max values, and even a start value.  I was just strongly suggesting
| we generalize where we get the values from, and then we have not special
| cases. 

Well, if alloc_pidmap(pid_ns, min, max) does not have to check the
parameters passed in (ie assumes that callers pass it in correctly)
it might be simple. But when user specifies the pid, the 

	min == max == user's target pid

so we will need to check the values either here or in callers.

Yes the code already has values and a start value. But these are
controlled by alloc_pidmap() and not passed in from the user space.

alloc_pidmap() needs to assign the next available pid or a specific
target pid.  Generalizing it to alloc a pid in a range seemed be a
bit of an over kill for currently known usages.

I will post a version of the patch outside this patchset with min
and max parameters and we can see if it can be optimized/beautified.

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                 ` <m1iqeauyvl.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-20 18:33                   ` Sukadev Bhattiprolu
  2009-10-20 18:33                   ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-20 18:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Eric W. Biederman [ebiederm@xmission.com] wrote:
| > Could you clarify ? How is the call to alloc_pidmap() from clone3() different
| > from the call from clone() itself ?
| 
| I think it is totally inappropriate to assign pids in a pid namespace
| where there are user space processes already running.

Honestly, I don't understand why it is inappropriate or how this differs
from normal clone() - which also assigns pids in own and ancestor pid
namespaces.

| 
| > | How we handle a clone extension depends critically on if we want to
| > | create a processes for restart in user space or kernel space.
| > | 
| > | Could some one give me or point me at a strong case for creating the
| > | processes for restart in user space?
| >
| > There has been a lot of discussion on this with reference to the
| > Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
| > for instance.
| 
| Just read it.  Thank you.

Sorry. I should have mentioned the reason here. (Like you mention below),
flexibility is the main reason.

| Now I am certain clone_with_pids() is not useful functionality to be
| exporting to userspace.
| 
| The only real argument in favor of doing this in user space is greater
| flexibility.  I can see checkpointing/restoring a single thread process
| without a pid namespace.  Anything more and you are just asking for
| trouble.
| 
| A design that weakens security.  Increases maintenance costs.  All for
| an unreliable result seems like a bad one to me.
| 
| > | The pid assignment code is currently ugly.  I asked that we just pass
| > | in the min max pid pids that already exist into the core pid
| > | assignment function and a constrained min/max that only admits a
| > | single pid when we are allocating a struct pid for restart.  That was
| > | not done and now we have a weird abortion with unnecessary special cases.
| >
| > I did post a version of the patch attemptint to implement that. As
| > pointed out in:
| >
| > 	http://lkml.org/lkml/2009/8/17/445
| >
| > we would need more checks in alloc_pidmap() to cover cases like min or max
| > being invalid or min being greater than max or max being greater than pid_max
| > etc. Those checks also made the code ugly (imo).
| 
| If you need more checks you are doing it wrong.  The code already has min
| and max values, and even a start value.  I was just strongly suggesting
| we generalize where we get the values from, and then we have not special
| cases. 

Well, if alloc_pidmap(pid_ns, min, max) does not have to check the
parameters passed in (ie assumes that callers pass it in correctly)
it might be simple. But when user specifies the pid, the 

	min == max == user's target pid

so we will need to check the values either here or in callers.

Yes the code already has values and a start value. But these are
controlled by alloc_pidmap() and not passed in from the user space.

alloc_pidmap() needs to assign the next available pid or a specific
target pid.  Generalizing it to alloc a pid in a range seemed be a
bit of an over kill for currently known usages.

I will post a version of the patch outside this patchset with min
and max parameters and we can see if it can be optimized/beautified.

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-20 18:33                   ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-20 18:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| > Could you clarify ? How is the call to alloc_pidmap() from clone3() different
| > from the call from clone() itself ?
| 
| I think it is totally inappropriate to assign pids in a pid namespace
| where there are user space processes already running.

Honestly, I don't understand why it is inappropriate or how this differs
from normal clone() - which also assigns pids in own and ancestor pid
namespaces.

| 
| > | How we handle a clone extension depends critically on if we want to
| > | create a processes for restart in user space or kernel space.
| > | 
| > | Could some one give me or point me at a strong case for creating the
| > | processes for restart in user space?
| >
| > There has been a lot of discussion on this with reference to the
| > Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
| > for instance.
| 
| Just read it.  Thank you.

Sorry. I should have mentioned the reason here. (Like you mention below),
flexibility is the main reason.

| Now I am certain clone_with_pids() is not useful functionality to be
| exporting to userspace.
| 
| The only real argument in favor of doing this in user space is greater
| flexibility.  I can see checkpointing/restoring a single thread process
| without a pid namespace.  Anything more and you are just asking for
| trouble.
| 
| A design that weakens security.  Increases maintenance costs.  All for
| an unreliable result seems like a bad one to me.
| 
| > | The pid assignment code is currently ugly.  I asked that we just pass
| > | in the min max pid pids that already exist into the core pid
| > | assignment function and a constrained min/max that only admits a
| > | single pid when we are allocating a struct pid for restart.  That was
| > | not done and now we have a weird abortion with unnecessary special cases.
| >
| > I did post a version of the patch attemptint to implement that. As
| > pointed out in:
| >
| > 	http://lkml.org/lkml/2009/8/17/445
| >
| > we would need more checks in alloc_pidmap() to cover cases like min or max
| > being invalid or min being greater than max or max being greater than pid_max
| > etc. Those checks also made the code ugly (imo).
| 
| If you need more checks you are doing it wrong.  The code already has min
| and max values, and even a start value.  I was just strongly suggesting
| we generalize where we get the values from, and then we have not special
| cases. 

Well, if alloc_pidmap(pid_ns, min, max) does not have to check the
parameters passed in (ie assumes that callers pass it in correctly)
it might be simple. But when user specifies the pid, the 

	min == max == user's target pid

so we will need to check the values either here or in callers.

Yes the code already has values and a start value. But these are
controlled by alloc_pidmap() and not passed in from the user space.

alloc_pidmap() needs to assign the next available pid or a specific
target pid.  Generalizing it to alloc a pid in a range seemed be a
bit of an over kill for currently known usages.

I will post a version of the patch outside this patchset with min
and max parameters and we can see if it can be optimized/beautified.

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                   ` <20091020183329.GB22646-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-20 19:26                     ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-20 19:26 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | > Could you clarify ? How is the call to alloc_pidmap() from clone3() different
> | > from the call from clone() itself ?
> | 
> | I think it is totally inappropriate to assign pids in a pid namespace
> | where there are user space processes already running.
>
> Honestly, I don't understand why it is inappropriate or how this differs
> from normal clone() - which also assigns pids in own and ancestor pid
> namespaces.

The fact we can specify which pids we want.  I won't claim it is as
exploitable as NULL pointer deferences have been but it has that kind
of feel to it.

> | > | How we handle a clone extension depends critically on if we want to
> | > | create a processes for restart in user space or kernel space.
> | > | 
> | > | Could some one give me or point me at a strong case for creating the
> | > | processes for restart in user space?
> | >
> | > There has been a lot of discussion on this with reference to the
> | > Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
> | > for instance.
> | 
> | Just read it.  Thank you.
>
> Sorry. I should have mentioned the reason here. (Like you mention below),
> flexibility is the main reason.
>
> | Now I am certain clone_with_pids() is not useful functionality to be
> | exporting to userspace.
> | 
> | The only real argument in favor of doing this in user space is greater
> | flexibility.  I can see checkpointing/restoring a single thread process
> | without a pid namespace.  Anything more and you are just asking for
> | trouble.
> | 
> | A design that weakens security.  Increases maintenance costs.  All for
> | an unreliable result seems like a bad one to me.
> | 
> | > | The pid assignment code is currently ugly.  I asked that we just pass
> | > | in the min max pid pids that already exist into the core pid
> | > | assignment function and a constrained min/max that only admits a
> | > | single pid when we are allocating a struct pid for restart.  That was
> | > | not done and now we have a weird abortion with unnecessary special cases.
> | >
> | > I did post a version of the patch attemptint to implement that. As
> | > pointed out in:
> | >
> | > 	http://lkml.org/lkml/2009/8/17/445
> | >
> | > we would need more checks in alloc_pidmap() to cover cases like min or max
> | > being invalid or min being greater than max or max being greater than pid_max
> | > etc. Those checks also made the code ugly (imo).
> | 
> | If you need more checks you are doing it wrong.  The code already has min
> | and max values, and even a start value.  I was just strongly suggesting
> | we generalize where we get the values from, and then we have not special
> | cases. 
>
> Well, if alloc_pidmap(pid_ns, min, max) does not have to check the
> parameters passed in (ie assumes that callers pass it in correctly)
> it might be simple. But when user specifies the pid, the 
>
> 	min == max == user's target pid
>
> so we will need to check the values either here or in callers.

Agreed.  When you are talking about the target pid.  That code path
needs the extra check.

> Yes the code already has values and a start value. But these are
> controlled by alloc_pidmap() and not passed in from the user space.

I was only thinking passed in from someplace else in kernel/pid.c

> alloc_pidmap() needs to assign the next available pid or a specific
> target pid.  Generalizing it to alloc a pid in a range seemed be a
> bit of an over kill for currently known usages.

alloc_pidmap in assigning the next available pid is allocating a pid
in a range.

> I will post a version of the patch outside this patchset with min
> and max parameters and we can see if it can be optimized/beautified.

Thanks,
Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                   ` <20091020183329.GB22646-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-20 19:26                     ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-20 19:26 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:

> Eric W. Biederman [ebiederm@xmission.com] wrote:
> | > Could you clarify ? How is the call to alloc_pidmap() from clone3() different
> | > from the call from clone() itself ?
> | 
> | I think it is totally inappropriate to assign pids in a pid namespace
> | where there are user space processes already running.
>
> Honestly, I don't understand why it is inappropriate or how this differs
> from normal clone() - which also assigns pids in own and ancestor pid
> namespaces.

The fact we can specify which pids we want.  I won't claim it is as
exploitable as NULL pointer deferences have been but it has that kind
of feel to it.

> | > | How we handle a clone extension depends critically on if we want to
> | > | create a processes for restart in user space or kernel space.
> | > | 
> | > | Could some one give me or point me at a strong case for creating the
> | > | processes for restart in user space?
> | >
> | > There has been a lot of discussion on this with reference to the
> | > Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
> | > for instance.
> | 
> | Just read it.  Thank you.
>
> Sorry. I should have mentioned the reason here. (Like you mention below),
> flexibility is the main reason.
>
> | Now I am certain clone_with_pids() is not useful functionality to be
> | exporting to userspace.
> | 
> | The only real argument in favor of doing this in user space is greater
> | flexibility.  I can see checkpointing/restoring a single thread process
> | without a pid namespace.  Anything more and you are just asking for
> | trouble.
> | 
> | A design that weakens security.  Increases maintenance costs.  All for
> | an unreliable result seems like a bad one to me.
> | 
> | > | The pid assignment code is currently ugly.  I asked that we just pass
> | > | in the min max pid pids that already exist into the core pid
> | > | assignment function and a constrained min/max that only admits a
> | > | single pid when we are allocating a struct pid for restart.  That was
> | > | not done and now we have a weird abortion with unnecessary special cases.
> | >
> | > I did post a version of the patch attemptint to implement that. As
> | > pointed out in:
> | >
> | > 	http://lkml.org/lkml/2009/8/17/445
> | >
> | > we would need more checks in alloc_pidmap() to cover cases like min or max
> | > being invalid or min being greater than max or max being greater than pid_max
> | > etc. Those checks also made the code ugly (imo).
> | 
> | If you need more checks you are doing it wrong.  The code already has min
> | and max values, and even a start value.  I was just strongly suggesting
> | we generalize where we get the values from, and then we have not special
> | cases. 
>
> Well, if alloc_pidmap(pid_ns, min, max) does not have to check the
> parameters passed in (ie assumes that callers pass it in correctly)
> it might be simple. But when user specifies the pid, the 
>
> 	min == max == user's target pid
>
> so we will need to check the values either here or in callers.

Agreed.  When you are talking about the target pid.  That code path
needs the extra check.

> Yes the code already has values and a start value. But these are
> controlled by alloc_pidmap() and not passed in from the user space.

I was only thinking passed in from someplace else in kernel/pid.c

> alloc_pidmap() needs to assign the next available pid or a specific
> target pid.  Generalizing it to alloc a pid in a range seemed be a
> bit of an over kill for currently known usages.

alloc_pidmap in assigning the next available pid is allocating a pid
in a range.

> I will post a version of the patch outside this patchset with min
> and max parameters and we can see if it can be optimized/beautified.

Thanks,
Eric


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-20 19:26                     ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-20 19:26 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | > Could you clarify ? How is the call to alloc_pidmap() from clone3() different
> | > from the call from clone() itself ?
> | 
> | I think it is totally inappropriate to assign pids in a pid namespace
> | where there are user space processes already running.
>
> Honestly, I don't understand why it is inappropriate or how this differs
> from normal clone() - which also assigns pids in own and ancestor pid
> namespaces.

The fact we can specify which pids we want.  I won't claim it is as
exploitable as NULL pointer deferences have been but it has that kind
of feel to it.

> | > | How we handle a clone extension depends critically on if we want to
> | > | create a processes for restart in user space or kernel space.
> | > | 
> | > | Could some one give me or point me at a strong case for creating the
> | > | processes for restart in user space?
> | >
> | > There has been a lot of discussion on this with reference to the
> | > Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
> | > for instance.
> | 
> | Just read it.  Thank you.
>
> Sorry. I should have mentioned the reason here. (Like you mention below),
> flexibility is the main reason.
>
> | Now I am certain clone_with_pids() is not useful functionality to be
> | exporting to userspace.
> | 
> | The only real argument in favor of doing this in user space is greater
> | flexibility.  I can see checkpointing/restoring a single thread process
> | without a pid namespace.  Anything more and you are just asking for
> | trouble.
> | 
> | A design that weakens security.  Increases maintenance costs.  All for
> | an unreliable result seems like a bad one to me.
> | 
> | > | The pid assignment code is currently ugly.  I asked that we just pass
> | > | in the min max pid pids that already exist into the core pid
> | > | assignment function and a constrained min/max that only admits a
> | > | single pid when we are allocating a struct pid for restart.  That was
> | > | not done and now we have a weird abortion with unnecessary special cases.
> | >
> | > I did post a version of the patch attemptint to implement that. As
> | > pointed out in:
> | >
> | > 	http://lkml.org/lkml/2009/8/17/445
> | >
> | > we would need more checks in alloc_pidmap() to cover cases like min or max
> | > being invalid or min being greater than max or max being greater than pid_max
> | > etc. Those checks also made the code ugly (imo).
> | 
> | If you need more checks you are doing it wrong.  The code already has min
> | and max values, and even a start value.  I was just strongly suggesting
> | we generalize where we get the values from, and then we have not special
> | cases. 
>
> Well, if alloc_pidmap(pid_ns, min, max) does not have to check the
> parameters passed in (ie assumes that callers pass it in correctly)
> it might be simple. But when user specifies the pid, the 
>
> 	min == max == user's target pid
>
> so we will need to check the values either here or in callers.

Agreed.  When you are talking about the target pid.  That code path
needs the extra check.

> Yes the code already has values and a start value. But these are
> controlled by alloc_pidmap() and not passed in from the user space.

I was only thinking passed in from someplace else in kernel/pid.c

> alloc_pidmap() needs to assign the next available pid or a specific
> target pid.  Generalizing it to alloc a pid in a range seemed be a
> bit of an over kill for currently known usages.

alloc_pidmap in assigning the next available pid is allocating a pid
in a range.

> I will post a version of the patch outside this patchset with min
> and max parameters and we can see if it can be optimized/beautified.

Thanks,
Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                     ` <m1r5sxsw7w.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-20 20:13                       ` Oren Laadan
  2009-10-21  6:20                       ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-20 20:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Eric W. Biederman wrote:
> Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:
> 
>> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
>> | > Could you clarify ? How is the call to alloc_pidmap() from clone3() different
>> | > from the call from clone() itself ?
>> | 
>> | I think it is totally inappropriate to assign pids in a pid namespace
>> | where there are user space processes already running.
>>
>> Honestly, I don't understand why it is inappropriate or how this differs
>> from normal clone() - which also assigns pids in own and ancestor pid
>> namespaces.
> 
> The fact we can specify which pids we want.  I won't claim it is as
> exploitable as NULL pointer deferences have been but it has that kind
> of feel to it.

This security concern was first brought up by Linus, and to address it
we made clone3() require that the user be privileged to select pids.

But, honestly, a clone3() that allows the caller to request a specific
pid is like a restart() syscall that allows the caller to restore a
process with its original pid:  you would simply checkpoint, and then
alter the pid in the checkpoint image and restart, repeat ad infinitum.

From your security prism, they are equivalent: they practically allow
a user to have an arbitrary process with a selected pid. So it doesn't
really matter - for this security concern - if you select the pid in
the kernel through restart() or from userspace through clone3().

Sure, you can also "choose" a pid today, by repeatedly forking until
you get what you want... but that is harder to exploit. In contrast,
both clone3() and restart() allow pid selection, instantaneously.
Which is why both require privileges if the caller wants to select/
restore pids.

Oren.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                     ` <m1r5sxsw7w.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-20 20:13                       ` Oren Laadan
  2009-10-21  6:20                       ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-20 20:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Sukadev Bhattiprolu, Matt Helsley, Daniel Lezcano, randy.dunlap,
	arnd, linux-api, Containers, Nathan Lynch, linux-kernel,
	Louis.Rilling, kosaki.motohiro, hpa, mingo, torvalds,
	Alexey Dobriyan, roland, Pavel Emelyanov



Eric W. Biederman wrote:
> Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:
> 
>> Eric W. Biederman [ebiederm@xmission.com] wrote:
>> | > Could you clarify ? How is the call to alloc_pidmap() from clone3() different
>> | > from the call from clone() itself ?
>> | 
>> | I think it is totally inappropriate to assign pids in a pid namespace
>> | where there are user space processes already running.
>>
>> Honestly, I don't understand why it is inappropriate or how this differs
>> from normal clone() - which also assigns pids in own and ancestor pid
>> namespaces.
> 
> The fact we can specify which pids we want.  I won't claim it is as
> exploitable as NULL pointer deferences have been but it has that kind
> of feel to it.

This security concern was first brought up by Linus, and to address it
we made clone3() require that the user be privileged to select pids.

But, honestly, a clone3() that allows the caller to request a specific
pid is like a restart() syscall that allows the caller to restore a
process with its original pid:  you would simply checkpoint, and then
alter the pid in the checkpoint image and restart, repeat ad infinitum.

>From your security prism, they are equivalent: they practically allow
a user to have an arbitrary process with a selected pid. So it doesn't
really matter - for this security concern - if you select the pid in
the kernel through restart() or from userspace through clone3().

Sure, you can also "choose" a pid today, by repeatedly forking until
you get what you want... but that is harder to exploit. In contrast,
both clone3() and restart() allow pid selection, instantaneously.
Which is why both require privileges if the caller wants to select/
restore pids.

Oren.


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-20 20:13                       ` Oren Laadan
  0 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-20 20:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Sukadev Bhattiprolu, Matt Helsley, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Eric W. Biederman wrote:
> Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:
> 
>> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
>> | > Could you clarify ? How is the call to alloc_pidmap() from clone3() different
>> | > from the call from clone() itself ?
>> | 
>> | I think it is totally inappropriate to assign pids in a pid namespace
>> | where there are user space processes already running.
>>
>> Honestly, I don't understand why it is inappropriate or how this differs
>> from normal clone() - which also assigns pids in own and ancestor pid
>> namespaces.
> 
> The fact we can specify which pids we want.  I won't claim it is as
> exploitable as NULL pointer deferences have been but it has that kind
> of feel to it.

This security concern was first brought up by Linus, and to address it
we made clone3() require that the user be privileged to select pids.

But, honestly, a clone3() that allows the caller to request a specific
pid is like a restart() syscall that allows the caller to restore a
process with its original pid:  you would simply checkpoint, and then
alter the pid in the checkpoint image and restart, repeat ad infinitum.

>From your security prism, they are equivalent: they practically allow
a user to have an arbitrary process with a selected pid. So it doesn't
really matter - for this security concern - if you select the pid in
the kernel through restart() or from userspace through clone3().

Sure, you can also "choose" a pid today, by repeatedly forking until
you get what you want... but that is harder to exploit. In contrast,
both clone3() and restart() allow pid selection, instantaneously.
Which is why both require privileges if the caller wants to select/
restore pids.

Oren.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                     ` <20091019235012.GF27627-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2009-10-21  4:26                       ` Michael Kerrisk
  0 siblings, 0 replies; 226+ messages in thread
From: Michael Kerrisk @ 2009-10-21  4:26 UTC (permalink / raw)
  To: Matt Helsley
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, H. Peter Anvin,
	mingo-X9Un+BFzKDI, Sukadev Bhattiprolu,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On Tue, Oct 20, 2009 at 1:50 AM, Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Oct 20, 2009 at 06:31:20AM +0900, H. Peter Anvin wrote:
>> On 10/20/2009 02:44 AM, Matt Helsley wrote:
>>>> |
>>>> | I know I'm late to this discussion, but why the name clone3()? It's
>>>> | not consistent with any other convention used fo syscall naming,
>>
>> This assumption, of course, is just plain wrong.  Look at the wait
>> system calls, for example.  However, when a small integer is used like
>> that, it pretty much always reflects numbers of arguments.
>>
>>>> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
>>>> | might make more sense.
>>>>
>>>> Sure, we talked about calling it clone_extended() and I can go back
>>>> to that.
>>>>
>>>> Only minor concern with that name was if this new call ever needs to
>>>> be extended, what would we call it :-). With clone3() we could add a
>>>> real/fake parameter and call it clone4() :-p
>>>
>>> Perhaps clone64 (somewhat like stat64 for example)?
>>>
>>
>> I think that doesn't exactly reflect the nature of the changes.
>
> Yes. Without adopting an impractical encoding scheme it's quite
> unlikely a small number like 3 or 64 could exactly reflect all the
> changes :). I don't think that's a realistic objection though so...
>
>> clone3() is actually pretty good.
>
> I agree.

My question here is: what does "3" actually mean? In general, system
calls have not followed any convention of numbering to indicate
successive versions -- clone2() being the one possible exception that
I know of.

The only other conventions used for numbering new versions of system
calls relates either to arguments size (e.g., 32 versus 64) or to
their number number of arguments (dup2(), dup3(), wait3(), wait4(),
accept4(), eventfd2(), inotify_init1(), epoll_create1(), evetfd2(),
signalfd4()). The former convention makes some sense, but the latter
is rather questionable, for a couple of reasons. One is that the
number of arguments for the system call may change in the future
(several of the newer system calls have a flags argument which could
be used to indeicate the presence of an additional, optional
argument). Another reason that the latter convention is questionable
is that the number of arguments exposed to userspace by glibc may be
different from the number of arguments in the raw syscall. For
example, signafd4() has 4 arguments, but the glibc interface
(signalfd() http://www.kernel.org/doc/man-pages/online/pages/man2/signalfd.2.html)
has 3.

Using the name clone3() follows no historical convention, which is why
it seems unwise to me. Thus my suggestion of clonex() (like e.g.,
adjtimex()), though quite possibly there could be some better name.

Sukadev, you wrote "With clone3() we could add areal/fake parameter
and call it clone4()". This rasies for me the question: should
clone3() have a flags argument, so as to allow these types of
extensions (i.e., not for clone flags, but rather to indicate changes
in the system call interface). Yes, I understand there is a 64-bit
flags in 'struct clone_struct', but I wonder whether there is any
virtue in having an additional flags argument in the base signature of
the function, as per the following:
http://www.kernel.org/doc/man-pages/online/pages/man2/dup3.2.html
http://www.kernel.org/doc/man-pages/online/pages/man2/signalfd4.2.html
http://www.kernel.org/doc/man-pages/online/pages/man2/eventfd2.2.html
see also
http://linux-man-pages.blogspot.com/2008/10/recent-changes-in-file-descriptor.html

By the way, one further thought: why "struct clone_struct"? We know
it's a struct. Therefore, it seems pointless to include that in the
name. Something like "struct clone_args" would seem less redundant and
slightly more meaningful as a name.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                     ` <20091019235012.GF27627-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2009-10-21  4:26                       ` Michael Kerrisk
  0 siblings, 0 replies; 226+ messages in thread
From: Michael Kerrisk @ 2009-10-21  4:26 UTC (permalink / raw)
  To: Matt Helsley
  Cc: H. Peter Anvin, Sukadev Bhattiprolu, randy.dunlap, arnd,
	Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	Eric W. Biederman, kosaki.motohiro, mingo, linux-api, torvalds,
	Alexey Dobriyan, roland, Pavel Emelyanov

On Tue, Oct 20, 2009 at 1:50 AM, Matt Helsley <matthltc@us.ibm.com> wrote:
> On Tue, Oct 20, 2009 at 06:31:20AM +0900, H. Peter Anvin wrote:
>> On 10/20/2009 02:44 AM, Matt Helsley wrote:
>>>> |
>>>> | I know I'm late to this discussion, but why the name clone3()? It's
>>>> | not consistent with any other convention used fo syscall naming,
>>
>> This assumption, of course, is just plain wrong.  Look at the wait
>> system calls, for example.  However, when a small integer is used like
>> that, it pretty much always reflects numbers of arguments.
>>
>>>> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
>>>> | might make more sense.
>>>>
>>>> Sure, we talked about calling it clone_extended() and I can go back
>>>> to that.
>>>>
>>>> Only minor concern with that name was if this new call ever needs to
>>>> be extended, what would we call it :-). With clone3() we could add a
>>>> real/fake parameter and call it clone4() :-p
>>>
>>> Perhaps clone64 (somewhat like stat64 for example)?
>>>
>>
>> I think that doesn't exactly reflect the nature of the changes.
>
> Yes. Without adopting an impractical encoding scheme it's quite
> unlikely a small number like 3 or 64 could exactly reflect all the
> changes :). I don't think that's a realistic objection though so...
>
>> clone3() is actually pretty good.
>
> I agree.

My question here is: what does "3" actually mean? In general, system
calls have not followed any convention of numbering to indicate
successive versions -- clone2() being the one possible exception that
I know of.

The only other conventions used for numbering new versions of system
calls relates either to arguments size (e.g., 32 versus 64) or to
their number number of arguments (dup2(), dup3(), wait3(), wait4(),
accept4(), eventfd2(), inotify_init1(), epoll_create1(), evetfd2(),
signalfd4()). The former convention makes some sense, but the latter
is rather questionable, for a couple of reasons. One is that the
number of arguments for the system call may change in the future
(several of the newer system calls have a flags argument which could
be used to indeicate the presence of an additional, optional
argument). Another reason that the latter convention is questionable
is that the number of arguments exposed to userspace by glibc may be
different from the number of arguments in the raw syscall. For
example, signafd4() has 4 arguments, but the glibc interface
(signalfd() http://www.kernel.org/doc/man-pages/online/pages/man2/signalfd.2.html)
has 3.

Using the name clone3() follows no historical convention, which is why
it seems unwise to me. Thus my suggestion of clonex() (like e.g.,
adjtimex()), though quite possibly there could be some better name.

Sukadev, you wrote "With clone3() we could add areal/fake parameter
and call it clone4()". This rasies for me the question: should
clone3() have a flags argument, so as to allow these types of
extensions (i.e., not for clone flags, but rather to indicate changes
in the system call interface). Yes, I understand there is a 64-bit
flags in 'struct clone_struct', but I wonder whether there is any
virtue in having an additional flags argument in the base signature of
the function, as per the following:
http://www.kernel.org/doc/man-pages/online/pages/man2/dup3.2.html
http://www.kernel.org/doc/man-pages/online/pages/man2/signalfd4.2.html
http://www.kernel.org/doc/man-pages/online/pages/man2/eventfd2.2.html
see also
http://linux-man-pages.blogspot.com/2008/10/recent-changes-in-file-descriptor.html

By the way, one further thought: why "struct clone_struct"? We know
it's a struct. Therefore, it seems pointless to include that in the
name. Something like "struct clone_args" would seem less redundant and
slightly more meaningful as a name.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-21  4:26                       ` Michael Kerrisk
  0 siblings, 0 replies; 226+ messages in thread
From: Michael Kerrisk @ 2009-10-21  4:26 UTC (permalink / raw)
  To: Matt Helsley
  Cc: H. Peter Anvin, Sukadev Bhattiprolu,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On Tue, Oct 20, 2009 at 1:50 AM, Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Oct 20, 2009 at 06:31:20AM +0900, H. Peter Anvin wrote:
>> On 10/20/2009 02:44 AM, Matt Helsley wrote:
>>>> |
>>>> | I know I'm late to this discussion, but why the name clone3()? It's
>>>> | not consistent with any other convention used fo syscall naming,
>>
>> This assumption, of course, is just plain wrong.  Look at the wait
>> system calls, for example.  However, when a small integer is used like
>> that, it pretty much always reflects numbers of arguments.
>>
>>>> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
>>>> | might make more sense.
>>>>
>>>> Sure, we talked about calling it clone_extended() and I can go back
>>>> to that.
>>>>
>>>> Only minor concern with that name was if this new call ever needs to
>>>> be extended, what would we call it :-). With clone3() we could add a
>>>> real/fake parameter and call it clone4() :-p
>>>
>>> Perhaps clone64 (somewhat like stat64 for example)?
>>>
>>
>> I think that doesn't exactly reflect the nature of the changes.
>
> Yes. Without adopting an impractical encoding scheme it's quite
> unlikely a small number like 3 or 64 could exactly reflect all the
> changes :). I don't think that's a realistic objection though so...
>
>> clone3() is actually pretty good.
>
> I agree.

My question here is: what does "3" actually mean? In general, system
calls have not followed any convention of numbering to indicate
successive versions -- clone2() being the one possible exception that
I know of.

The only other conventions used for numbering new versions of system
calls relates either to arguments size (e.g., 32 versus 64) or to
their number number of arguments (dup2(), dup3(), wait3(), wait4(),
accept4(), eventfd2(), inotify_init1(), epoll_create1(), evetfd2(),
signalfd4()). The former convention makes some sense, but the latter
is rather questionable, for a couple of reasons. One is that the
number of arguments for the system call may change in the future
(several of the newer system calls have a flags argument which could
be used to indeicate the presence of an additional, optional
argument). Another reason that the latter convention is questionable
is that the number of arguments exposed to userspace by glibc may be
different from the number of arguments in the raw syscall. For
example, signafd4() has 4 arguments, but the glibc interface
(signalfd() http://www.kernel.org/doc/man-pages/online/pages/man2/signalfd.2.html)
has 3.

Using the name clone3() follows no historical convention, which is why
it seems unwise to me. Thus my suggestion of clonex() (like e.g.,
adjtimex()), though quite possibly there could be some better name.

Sukadev, you wrote "With clone3() we could add areal/fake parameter
and call it clone4()". This rasies for me the question: should
clone3() have a flags argument, so as to allow these types of
extensions (i.e., not for clone flags, but rather to indicate changes
in the system call interface). Yes, I understand there is a 64-bit
flags in 'struct clone_struct', but I wonder whether there is any
virtue in having an additional flags argument in the base signature of
the function, as per the following:
http://www.kernel.org/doc/man-pages/online/pages/man2/dup3.2.html
http://www.kernel.org/doc/man-pages/online/pages/man2/signalfd4.2.html
http://www.kernel.org/doc/man-pages/online/pages/man2/eventfd2.2.html
see also
http://linux-man-pages.blogspot.com/2008/10/recent-changes-in-file-descriptor.html

By the way, one further thought: why "struct clone_struct"? We know
it's a struct. Therefore, it seems pointless to include that in the
name. Something like "struct clone_args" would seem less redundant and
slightly more meaningful as a name.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                     ` <m1r5sxsw7w.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2009-10-20 20:13                       ` Oren Laadan
@ 2009-10-21  6:20                       ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-21  6:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| > I will post a version of the patch outside this patchset with min
| > and max parameters and we can see if it can be optimized/beautified.
| 
| Thanks,
| Eric

Here is the patch.  alloc_pid() calls alloc_pidmap() as follows:

-		nr = alloc_pidmap(tmp);
+		min = max = 0;
+		if (target_pids) {
+			min = target_pids[i];
+			max = min + 1;
+		}
+		nr = alloc_pidmap(tmp, min, max);

It does look like this patch executes a bit more code even in the common
case but let me know if you think this is better.
	
---

From: Sukadev Bhattiprolu <suka@suka.(none)>
Date: Tue, 20 Oct 2009 17:01:18 -0700
Subject: [PATCH] Add min and max parameters to alloc_pidmap()

With support for setting a specific pid number for a process, alloc_pidmap()
will need a 'min' and a 'max' parameter.

---
 kernel/pid.c |   47 ++++++++++++++++++++++++++++++++++-------------
 1 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index c4d9914..56e13c1 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,18 +147,30 @@ static int alloc_pidmap_page(struct pidmap *map)
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, int min, int max)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	int rc = -EAGAIN;
 	struct pidmap *map;
 
-	pid = last + 1;
-	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+	if (min < 0 || max < 0 || max > pid_max)
+		return -EINVAL;
+
+	if (!max)
+		max = pid_max;
+
+	pid = min;
+	if (pid && pid >= max)
+		return -EINVAL;
+	else if (!pid) {
+		pid = last + 1;
+		if (pid >= pid_max)
+			pid = RESERVED_PIDS;
+	}
+
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
 		rc = alloc_pidmap_page(map);
 		if (rc)
@@ -168,7 +180,12 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
+					/*
+					 * 'last_pid' only makes sense when
+					 * choosing from entire range
+					 */
+					if (!min)
+						pid_ns->last_pid = pid;
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -179,22 +196,26 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			 * bitmap block and the final block was the same
 			 * as the starting point, pid is before last_pid.
 			 */
-			} while (offset < BITS_PER_PAGE && pid < pid_max &&
+			} while (offset < BITS_PER_PAGE && pid < max &&
 					(i != max_scan || pid < last ||
 					    !((last+1) & BITS_PER_PAGE_MASK)));
 		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
 			++map;
 			offset = 0;
 		} else {
 			map = &pid_ns->pidmap[0];
 			offset = RESERVED_PIDS;
-			if (unlikely(last == offset)) {
-				rc = -EAGAIN;
-				break;
-			}
 		}
 		pid = mk_pid(pid_ns, map, offset);
+		/*
+		 * If we are back to where we started, well, no pids are
+		 * currently available in selected range.
+		 */
+		if (pid < min || unlikely(last == offset)) {
+			rc = -EAGAIN;
+			break;
+		}
 	}
 	return rc;
 }
@@ -272,7 +293,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = alloc_pidmap(tmp, 0, 0);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                     ` <m1r5sxsw7w.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-21  6:20                       ` Sukadev Bhattiprolu
  2009-10-21  6:20                       ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-21  6:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Eric W. Biederman [ebiederm@xmission.com] wrote:
| > I will post a version of the patch outside this patchset with min
| > and max parameters and we can see if it can be optimized/beautified.
| 
| Thanks,
| Eric

Here is the patch.  alloc_pid() calls alloc_pidmap() as follows:

-		nr = alloc_pidmap(tmp);
+		min = max = 0;
+		if (target_pids) {
+			min = target_pids[i];
+			max = min + 1;
+		}
+		nr = alloc_pidmap(tmp, min, max);

It does look like this patch executes a bit more code even in the common
case but let me know if you think this is better.
	
---

From: Sukadev Bhattiprolu <suka@suka.(none)>
Date: Tue, 20 Oct 2009 17:01:18 -0700
Subject: [PATCH] Add min and max parameters to alloc_pidmap()

With support for setting a specific pid number for a process, alloc_pidmap()
will need a 'min' and a 'max' parameter.

---
 kernel/pid.c |   47 ++++++++++++++++++++++++++++++++++-------------
 1 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index c4d9914..56e13c1 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,18 +147,30 @@ static int alloc_pidmap_page(struct pidmap *map)
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, int min, int max)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	int rc = -EAGAIN;
 	struct pidmap *map;
 
-	pid = last + 1;
-	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+	if (min < 0 || max < 0 || max > pid_max)
+		return -EINVAL;
+
+	if (!max)
+		max = pid_max;
+
+	pid = min;
+	if (pid && pid >= max)
+		return -EINVAL;
+	else if (!pid) {
+		pid = last + 1;
+		if (pid >= pid_max)
+			pid = RESERVED_PIDS;
+	}
+
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
 		rc = alloc_pidmap_page(map);
 		if (rc)
@@ -168,7 +180,12 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
+					/*
+					 * 'last_pid' only makes sense when
+					 * choosing from entire range
+					 */
+					if (!min)
+						pid_ns->last_pid = pid;
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -179,22 +196,26 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			 * bitmap block and the final block was the same
 			 * as the starting point, pid is before last_pid.
 			 */
-			} while (offset < BITS_PER_PAGE && pid < pid_max &&
+			} while (offset < BITS_PER_PAGE && pid < max &&
 					(i != max_scan || pid < last ||
 					    !((last+1) & BITS_PER_PAGE_MASK)));
 		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
 			++map;
 			offset = 0;
 		} else {
 			map = &pid_ns->pidmap[0];
 			offset = RESERVED_PIDS;
-			if (unlikely(last == offset)) {
-				rc = -EAGAIN;
-				break;
-			}
 		}
 		pid = mk_pid(pid_ns, map, offset);
+		/*
+		 * If we are back to where we started, well, no pids are
+		 * currently available in selected range.
+		 */
+		if (pid < min || unlikely(last == offset)) {
+			rc = -EAGAIN;
+			break;
+		}
 	}
 	return rc;
 }
@@ -272,7 +293,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = alloc_pidmap(tmp, 0, 0);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-21  6:20                       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-21  6:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| > I will post a version of the patch outside this patchset with min
| > and max parameters and we can see if it can be optimized/beautified.
| 
| Thanks,
| Eric

Here is the patch.  alloc_pid() calls alloc_pidmap() as follows:

-		nr = alloc_pidmap(tmp);
+		min = max = 0;
+		if (target_pids) {
+			min = target_pids[i];
+			max = min + 1;
+		}
+		nr = alloc_pidmap(tmp, min, max);

It does look like this patch executes a bit more code even in the common
case but let me know if you think this is better.
	
---

From: Sukadev Bhattiprolu <suka@suka.(none)>
Date: Tue, 20 Oct 2009 17:01:18 -0700
Subject: [PATCH] Add min and max parameters to alloc_pidmap()

With support for setting a specific pid number for a process, alloc_pidmap()
will need a 'min' and a 'max' parameter.

---
 kernel/pid.c |   47 ++++++++++++++++++++++++++++++++++-------------
 1 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index c4d9914..56e13c1 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,18 +147,30 @@ static int alloc_pidmap_page(struct pidmap *map)
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, int min, int max)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	int rc = -EAGAIN;
 	struct pidmap *map;
 
-	pid = last + 1;
-	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+	if (min < 0 || max < 0 || max > pid_max)
+		return -EINVAL;
+
+	if (!max)
+		max = pid_max;
+
+	pid = min;
+	if (pid && pid >= max)
+		return -EINVAL;
+	else if (!pid) {
+		pid = last + 1;
+		if (pid >= pid_max)
+			pid = RESERVED_PIDS;
+	}
+
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
 		rc = alloc_pidmap_page(map);
 		if (rc)
@@ -168,7 +180,12 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
+					/*
+					 * 'last_pid' only makes sense when
+					 * choosing from entire range
+					 */
+					if (!min)
+						pid_ns->last_pid = pid;
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -179,22 +196,26 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			 * bitmap block and the final block was the same
 			 * as the starting point, pid is before last_pid.
 			 */
-			} while (offset < BITS_PER_PAGE && pid < pid_max &&
+			} while (offset < BITS_PER_PAGE && pid < max &&
 					(i != max_scan || pid < last ||
 					    !((last+1) & BITS_PER_PAGE_MASK)));
 		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
 			++map;
 			offset = 0;
 		} else {
 			map = &pid_ns->pidmap[0];
 			offset = RESERVED_PIDS;
-			if (unlikely(last == offset)) {
-				rc = -EAGAIN;
-				break;
-			}
 		}
 		pid = mk_pid(pid_ns, map, offset);
+		/*
+		 * If we are back to where we started, well, no pids are
+		 * currently available in selected range.
+		 */
+		if (pid < min || unlikely(last == offset)) {
+			rc = -EAGAIN;
+			break;
+		}
 	}
 	return rc;
 }
@@ -272,7 +293,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = alloc_pidmap(tmp, 0, 0);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
       [not found]       ` <20091019213636.GB1482-+ZI9xUNit7I@public.gmane.org>
@ 2009-10-21  8:37         ` Arnd Bergmann
  2009-10-21 18:27         ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Arnd Bergmann @ 2009-10-21  8:37 UTC (permalink / raw)
  To: Pavel Machek
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

> 
> > +struct clone_struct {
> > +	u64 flags;
> > +	u64 child_stack;
> 
> u64 seems wrong  on 32 bit platforms. ulong?

That would make it incompatible between 64 bit kernels and
32 bit user space, requiring a wrapper. Better leave it at u64.

> > +	If a pid in the @pids list is 0, the kernel will assign the next
> > +	available pid in the pid namespace, for the process.
> > +
> > +	If a pid in the @pids list is non-zero, the kernel tries to assign
> > +	the specified pid in that namespace.  If that pid is already in use
> > +	by another process, the system call fails with -EBUSY.
> ...
> > +	On failure, clone3() returns -1 and sets 'errno' to one of following
> > +	values (the child process is not created).
> 
> Inconsistent with above. Syscalls really return -ERRCODE, errno is
> glibc magic.

Quite the opposite is true.

The man page describes what the user space sees, which is errno. Returning
-ERRCODE to libc from the kernel is part of the architecture specific
kernel ABI and should not be documented in this place but in the architecture
documentation.

> > +	pid_t pids[] = { 77, 99 };
> > +	struct clone_struct cs;
> > +
> > +	cs.flags = (u64) SIGCHLD;
> > +	cs.child_stack = (u64) setup_child_stack();
> > +	cs.nr_pids = 2;
> > +	cs.parent_tid = 0LL;
> > +	cs.child_tid = 0LL;
> > +
> > +	rc = syscall(__NR_clone3, &cs, pids);
> 
> Hmm, is there reason why pids are not at the end of struct
> clone_struct? Passing most parameters in special structure, then pids
> separately is strange...

I suggested doing that, it's a lot easier to handle fixed length data
structures than an array at the end.

	Arnd <><

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
       [not found]       ` <20091019213636.GB1482-+ZI9xUNit7I@public.gmane.org>
@ 2009-10-21  8:37         ` Arnd Bergmann
  2009-10-21 18:27         ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Arnd Bergmann @ 2009-10-21  8:37 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Sukadev Bhattiprolu, linux-kernel, Oren Laadan, serue,
	Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, peterz,
	Louis.Rilling, roland, kosaki.motohiro, randy.dunlap, linux-api,
	Containers, sukadev

> 
> > +struct clone_struct {
> > +	u64 flags;
> > +	u64 child_stack;
> 
> u64 seems wrong  on 32 bit platforms. ulong?

That would make it incompatible between 64 bit kernels and
32 bit user space, requiring a wrapper. Better leave it at u64.

> > +	If a pid in the @pids list is 0, the kernel will assign the next
> > +	available pid in the pid namespace, for the process.
> > +
> > +	If a pid in the @pids list is non-zero, the kernel tries to assign
> > +	the specified pid in that namespace.  If that pid is already in use
> > +	by another process, the system call fails with -EBUSY.
> ...
> > +	On failure, clone3() returns -1 and sets 'errno' to one of following
> > +	values (the child process is not created).
> 
> Inconsistent with above. Syscalls really return -ERRCODE, errno is
> glibc magic.

Quite the opposite is true.

The man page describes what the user space sees, which is errno. Returning
-ERRCODE to libc from the kernel is part of the architecture specific
kernel ABI and should not be documented in this place but in the architecture
documentation.

> > +	pid_t pids[] = { 77, 99 };
> > +	struct clone_struct cs;
> > +
> > +	cs.flags = (u64) SIGCHLD;
> > +	cs.child_stack = (u64) setup_child_stack();
> > +	cs.nr_pids = 2;
> > +	cs.parent_tid = 0LL;
> > +	cs.child_tid = 0LL;
> > +
> > +	rc = syscall(__NR_clone3, &cs, pids);
> 
> Hmm, is there reason why pids are not at the end of struct
> clone_struct? Passing most parameters in special structure, then pids
> separately is strange...

I suggested doing that, it's a lot easier to handle fixed length data
structures than an array at the end.

	Arnd <><

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
@ 2009-10-21  8:37         ` Arnd Bergmann
  0 siblings, 0 replies; 226+ messages in thread
From: Arnd Bergmann @ 2009-10-21  8:37 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Sukadev Bhattiprolu, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Oren Laadan, serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA

> 
> > +struct clone_struct {
> > +	u64 flags;
> > +	u64 child_stack;
> 
> u64 seems wrong  on 32 bit platforms. ulong?

That would make it incompatible between 64 bit kernels and
32 bit user space, requiring a wrapper. Better leave it at u64.

> > +	If a pid in the @pids list is 0, the kernel will assign the next
> > +	available pid in the pid namespace, for the process.
> > +
> > +	If a pid in the @pids list is non-zero, the kernel tries to assign
> > +	the specified pid in that namespace.  If that pid is already in use
> > +	by another process, the system call fails with -EBUSY.
> ...
> > +	On failure, clone3() returns -1 and sets 'errno' to one of following
> > +	values (the child process is not created).
> 
> Inconsistent with above. Syscalls really return -ERRCODE, errno is
> glibc magic.

Quite the opposite is true.

The man page describes what the user space sees, which is errno. Returning
-ERRCODE to libc from the kernel is part of the architecture specific
kernel ABI and should not be documented in this place but in the architecture
documentation.

> > +	pid_t pids[] = { 77, 99 };
> > +	struct clone_struct cs;
> > +
> > +	cs.flags = (u64) SIGCHLD;
> > +	cs.child_stack = (u64) setup_child_stack();
> > +	cs.nr_pids = 2;
> > +	cs.parent_tid = 0LL;
> > +	cs.child_tid = 0LL;
> > +
> > +	rc = syscall(__NR_clone3, &cs, pids);
> 
> Hmm, is there reason why pids are not at the end of struct
> clone_struct? Passing most parameters in special structure, then pids
> separately is strange...

I suggested doing that, it's a lot easier to handle fixed length data
structures than an array at the end.

	Arnd <><

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                       ` <20091021062021.GA2667-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-21  9:16                         ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-21  9:16 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | > I will post a version of the patch outside this patchset with min
> | > and max parameters and we can see if it can be optimized/beautified.
> | 
> | Thanks,
> | Eric
>
> Here is the patch.  alloc_pid() calls alloc_pidmap() as follows:
>
> -		nr = alloc_pidmap(tmp);
> +		min = max = 0;
> +		if (target_pids) {
> +			min = target_pids[i];
> +			max = min + 1;
> +		}
> +		nr = alloc_pidmap(tmp, min, max);
>
> It does look like this patch executes a bit more code even in the common
> case but let me know if you think this is better.
> 	
> ---

Not what I was thinking.  The following untested patch is what I was
thinking.  It just exposes last, min, and max to the callers which pass
in different appropriate values.

Eric

diff --git a/kernel/pid.c b/kernel/pid.c
index d3f722d..d9b6f82 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,17 +122,17 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min, int max)
 {
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int i, offset, max_scan, pid;
 	struct pidmap *map;
 
 	pid = last + 1;
-	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+	if (pid >= max)
+		pid = min;
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
 		if (unlikely(!map->page)) {
 			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
@@ -153,7 +153,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -164,16 +163,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			 * bitmap block and the final block was the same
 			 * as the starting point, pid is before last_pid.
 			 */
-			} while (offset < BITS_PER_PAGE && pid < pid_max &&
+			} while (offset < BITS_PER_PAGE && pid < max &&
 					(i != max_scan || pid < last ||
 					    !((last+1) & BITS_PER_PAGE_MASK)));
 		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
 			++map;
 			offset = 0;
 		} else {
 			map = &pid_ns->pidmap[0];
-			offset = RESERVED_PIDS;
+			offset = min;
 			if (unlikely(last == offset))
 				break;
 		}
@@ -182,6 +181,25 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return -1;
 }
 
+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+	int nr;
+
+	nr = do_alloc_pidmap(pid_ns, pid_ns->last, RESERVED_PIDS, pid_max);
+	if (nr >= 0)
+		pid_ns->last_pid = nr;
+	return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+	if (target >= pid_max)
+		return -1;
+	if (target < RESERVED_PIDS)
+		return -1;
+	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;

^ permalink raw reply related	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                       ` <20091021062021.GA2667-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-21  9:16                         ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-21  9:16 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:

> Eric W. Biederman [ebiederm@xmission.com] wrote:
> | > I will post a version of the patch outside this patchset with min
> | > and max parameters and we can see if it can be optimized/beautified.
> | 
> | Thanks,
> | Eric
>
> Here is the patch.  alloc_pid() calls alloc_pidmap() as follows:
>
> -		nr = alloc_pidmap(tmp);
> +		min = max = 0;
> +		if (target_pids) {
> +			min = target_pids[i];
> +			max = min + 1;
> +		}
> +		nr = alloc_pidmap(tmp, min, max);
>
> It does look like this patch executes a bit more code even in the common
> case but let me know if you think this is better.
> 	
> ---

Not what I was thinking.  The following untested patch is what I was
thinking.  It just exposes last, min, and max to the callers which pass
in different appropriate values.

Eric

diff --git a/kernel/pid.c b/kernel/pid.c
index d3f722d..d9b6f82 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,17 +122,17 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min, int max)
 {
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int i, offset, max_scan, pid;
 	struct pidmap *map;
 
 	pid = last + 1;
-	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+	if (pid >= max)
+		pid = min;
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
 		if (unlikely(!map->page)) {
 			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
@@ -153,7 +153,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -164,16 +163,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			 * bitmap block and the final block was the same
 			 * as the starting point, pid is before last_pid.
 			 */
-			} while (offset < BITS_PER_PAGE && pid < pid_max &&
+			} while (offset < BITS_PER_PAGE && pid < max &&
 					(i != max_scan || pid < last ||
 					    !((last+1) & BITS_PER_PAGE_MASK)));
 		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
 			++map;
 			offset = 0;
 		} else {
 			map = &pid_ns->pidmap[0];
-			offset = RESERVED_PIDS;
+			offset = min;
 			if (unlikely(last == offset))
 				break;
 		}
@@ -182,6 +181,25 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return -1;
 }
 
+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+	int nr;
+
+	nr = do_alloc_pidmap(pid_ns, pid_ns->last, RESERVED_PIDS, pid_max);
+	if (nr >= 0)
+		pid_ns->last_pid = nr;
+	return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+	if (target >= pid_max)
+		return -1;
+	if (target < RESERVED_PIDS)
+		return -1;
+	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;

^ permalink raw reply related	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-21  9:16                         ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-21  9:16 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | > I will post a version of the patch outside this patchset with min
> | > and max parameters and we can see if it can be optimized/beautified.
> | 
> | Thanks,
> | Eric
>
> Here is the patch.  alloc_pid() calls alloc_pidmap() as follows:
>
> -		nr = alloc_pidmap(tmp);
> +		min = max = 0;
> +		if (target_pids) {
> +			min = target_pids[i];
> +			max = min + 1;
> +		}
> +		nr = alloc_pidmap(tmp, min, max);
>
> It does look like this patch executes a bit more code even in the common
> case but let me know if you think this is better.
> 	
> ---

Not what I was thinking.  The following untested patch is what I was
thinking.  It just exposes last, min, and max to the callers which pass
in different appropriate values.

Eric

diff --git a/kernel/pid.c b/kernel/pid.c
index d3f722d..d9b6f82 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,17 +122,17 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min, int max)
 {
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int i, offset, max_scan, pid;
 	struct pidmap *map;
 
 	pid = last + 1;
-	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+	if (pid >= max)
+		pid = min;
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
 		if (unlikely(!map->page)) {
 			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
@@ -153,7 +153,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -164,16 +163,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			 * bitmap block and the final block was the same
 			 * as the starting point, pid is before last_pid.
 			 */
-			} while (offset < BITS_PER_PAGE && pid < pid_max &&
+			} while (offset < BITS_PER_PAGE && pid < max &&
 					(i != max_scan || pid < last ||
 					    !((last+1) & BITS_PER_PAGE_MASK)));
 		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
 			++map;
 			offset = 0;
 		} else {
 			map = &pid_ns->pidmap[0];
-			offset = RESERVED_PIDS;
+			offset = min;
 			if (unlikely(last == offset))
 				break;
 		}
@@ -182,6 +181,25 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return -1;
 }
 
+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+	int nr;
+
+	nr = do_alloc_pidmap(pid_ns, pid_ns->last, RESERVED_PIDS, pid_max);
+	if (nr >= 0)
+		pid_ns->last_pid = nr;
+	return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+	if (target >= pid_max)
+		return -1;
+	if (target < RESERVED_PIDS)
+		return -1;
+	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;

^ permalink raw reply related	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
       [not found]         ` <200910211037.38413.arnd-r2nGTMty4D4@public.gmane.org>
@ 2009-10-21  9:33           ` Pavel Machek
  0 siblings, 0 replies; 226+ messages in thread
From: Pavel Machek @ 2009-10-21  9:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On Wed 2009-10-21 10:37:38, Arnd Bergmann wrote:
> > 
> > > +struct clone_struct {
> > > +	u64 flags;
> > > +	u64 child_stack;
> > 
> > u64 seems wrong  on 32 bit platforms. ulong?
> 
> That would make it incompatible between 64 bit kernels and
> 32 bit user space, requiring a wrapper. Better leave it at u64.

Ok.

> > > +	If a pid in the @pids list is 0, the kernel will assign the next
> > > +	available pid in the pid namespace, for the process.
> > > +
> > > +	If a pid in the @pids list is non-zero, the kernel tries to assign
> > > +	the specified pid in that namespace.  If that pid is already in use
> > > +	by another process, the system call fails with -EBUSY.
> > ...
> > > +	On failure, clone3() returns -1 and sets 'errno' to one of following
> > > +	values (the child process is not created).
> > 
> > Inconsistent with above. Syscalls really return -ERRCODE, errno is
> > glibc magic.
> 
> Quite the opposite is true.

Well, it is still inconsistent. Half the docs talks -ERRCODE, half
talks -1/errno=ERRCODE.

> > > +	pid_t pids[] = { 77, 99 };
> > > +	struct clone_struct cs;
> > > +
> > > +	cs.flags = (u64) SIGCHLD;
> > > +	cs.child_stack = (u64) setup_child_stack();
> > > +	cs.nr_pids = 2;
> > > +	cs.parent_tid = 0LL;
> > > +	cs.child_tid = 0LL;
> > > +
> > > +	rc = syscall(__NR_clone3, &cs, pids);
> > 
> > Hmm, is there reason why pids are not at the end of struct
> > clone_struct? Passing most parameters in special structure, then pids
> > separately is strange...
> 
> I suggested doing that, it's a lot easier to handle fixed length data
> structures than an array at the end.

You could still put the something * pids at the end of (fixed length)
structure.

(Not that I agree with the argument, pid array is variable-length,
anyway, and inlining it at the end of structure should not make code
more complex).

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
  2009-10-21  8:37         ` Arnd Bergmann
  (?)
@ 2009-10-21  9:33         ` Pavel Machek
  2009-10-21 13:26             ` Arnd Bergmann
       [not found]           ` <20091021093338.GA11670-I/5MKhXcvmPrBKCeMvbIDA@public.gmane.org>
  -1 siblings, 2 replies; 226+ messages in thread
From: Pavel Machek @ 2009-10-21  9:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Sukadev Bhattiprolu, linux-kernel, Oren Laadan, serue,
	Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, peterz,
	Louis.Rilling, roland, kosaki.motohiro, randy.dunlap, linux-api,
	Containers, sukadev

On Wed 2009-10-21 10:37:38, Arnd Bergmann wrote:
> > 
> > > +struct clone_struct {
> > > +	u64 flags;
> > > +	u64 child_stack;
> > 
> > u64 seems wrong  on 32 bit platforms. ulong?
> 
> That would make it incompatible between 64 bit kernels and
> 32 bit user space, requiring a wrapper. Better leave it at u64.

Ok.

> > > +	If a pid in the @pids list is 0, the kernel will assign the next
> > > +	available pid in the pid namespace, for the process.
> > > +
> > > +	If a pid in the @pids list is non-zero, the kernel tries to assign
> > > +	the specified pid in that namespace.  If that pid is already in use
> > > +	by another process, the system call fails with -EBUSY.
> > ...
> > > +	On failure, clone3() returns -1 and sets 'errno' to one of following
> > > +	values (the child process is not created).
> > 
> > Inconsistent with above. Syscalls really return -ERRCODE, errno is
> > glibc magic.
> 
> Quite the opposite is true.

Well, it is still inconsistent. Half the docs talks -ERRCODE, half
talks -1/errno=ERRCODE.

> > > +	pid_t pids[] = { 77, 99 };
> > > +	struct clone_struct cs;
> > > +
> > > +	cs.flags = (u64) SIGCHLD;
> > > +	cs.child_stack = (u64) setup_child_stack();
> > > +	cs.nr_pids = 2;
> > > +	cs.parent_tid = 0LL;
> > > +	cs.child_tid = 0LL;
> > > +
> > > +	rc = syscall(__NR_clone3, &cs, pids);
> > 
> > Hmm, is there reason why pids are not at the end of struct
> > clone_struct? Passing most parameters in special structure, then pids
> > separately is strange...
> 
> I suggested doing that, it's a lot easier to handle fixed length data
> structures than an array at the end.

You could still put the something * pids at the end of (fixed length)
structure.

(Not that I agree with the argument, pid array is variable-length,
anyway, and inlining it at the end of structure should not make code
more complex).

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                       ` <cfd18e0f0910202126w1ebfa2e2u2ecbaa23b198d100-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-21 13:03                         ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-21 13:03 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Pavel Emelyanov,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA,
	Eric W. Biederman

On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>
> My question here is: what does "3" actually mean? In general, system
> calls have not followed any convention of numbering to indicate
> successive versions -- clone2() being the one possible exception that
> I know of.
>

"3" is number of arguments.  It's better than "extended" or something 
like that simply because "extended" just means "more than", and a number 
at least tells you *how much more than*.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
  2009-10-21  4:26                       ` Michael Kerrisk
  (?)
  (?)
@ 2009-10-21 13:03                       ` H. Peter Anvin
  2009-10-21 19:44                           ` Sukadev Bhattiprolu
                                           ` (2 more replies)
  -1 siblings, 3 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-21 13:03 UTC (permalink / raw)
  To: mtk.manpages
  Cc: Matt Helsley, Sukadev Bhattiprolu, randy.dunlap, arnd,
	Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	Eric W. Biederman, kosaki.motohiro, mingo, linux-api, torvalds,
	Alexey Dobriyan, roland, Pavel Emelyanov

On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>
> My question here is: what does "3" actually mean? In general, system
> calls have not followed any convention of numbering to indicate
> successive versions -- clone2() being the one possible exception that
> I know of.
>

"3" is number of arguments.  It's better than "extended" or something 
like that simply because "extended" just means "more than", and a number 
at least tells you *how much more than*.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
       [not found]           ` <20091021093338.GA11670-I/5MKhXcvmPrBKCeMvbIDA@public.gmane.org>
@ 2009-10-21 13:26             ` Arnd Bergmann
  0 siblings, 0 replies; 226+ messages in thread
From: Arnd Bergmann @ 2009-10-21 13:26 UTC (permalink / raw)
  To: Pavel Machek
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	Pavel Emelyanov, Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, Alexey Dobriyan, Nathan Lynch,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Arnd Bergmann,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wednesday 21 October 2009, Pavel Machek wrote:
> > > Hmm, is there reason why pids are not at the end of struct
> > > clone_struct? Passing most parameters in special structure, then pids
> > > separately is strange...
> > 
> > I suggested doing that, it's a lot easier to handle fixed length data
> > structures than an array at the end.
> 
> You could still put the something * pids at the end of (fixed length)
> structure.
> 
> (Not that I agree with the argument, pid array is variable-length,
> anyway, and inlining it at the end of structure should not make code
> more complex).

Please read up on the v1 to v7 patches, we've been through all this
before. There are obviously a lot of ways to do it, many of which are
equally good. The important thing to note is that the user-visible
interface should *not* pass the structure into the library that provides
this function but build the structure itself, so it doesn't matter that
much.

If you put the array at the end of the data structure, the wrapper
will have to copy the array provided by the user to a temporary
buffer. Passing the array as a separate syscall argument avoids
this.

	Arnd <><

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
       [not found]           ` <20091021093338.GA11670-I/5MKhXcvmPrBKCeMvbIDA@public.gmane.org>
@ 2009-10-21 13:26             ` Arnd Bergmann
  0 siblings, 0 replies; 226+ messages in thread
From: Arnd Bergmann @ 2009-10-21 13:26 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Arnd Bergmann, Sukadev Bhattiprolu, linux-kernel, Oren Laadan,
	serue, Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, peterz,
	Louis.Rilling, roland, kosaki.motohiro, randy.dunlap, linux-api,
	Containers, sukadev

On Wednesday 21 October 2009, Pavel Machek wrote:
> > > Hmm, is there reason why pids are not at the end of struct
> > > clone_struct? Passing most parameters in special structure, then pids
> > > separately is strange...
> > 
> > I suggested doing that, it's a lot easier to handle fixed length data
> > structures than an array at the end.
> 
> You could still put the something * pids at the end of (fixed length)
> structure.
> 
> (Not that I agree with the argument, pid array is variable-length,
> anyway, and inlining it at the end of structure should not make code
> more complex).

Please read up on the v1 to v7 patches, we've been through all this
before. There are obviously a lot of ways to do it, many of which are
equally good. The important thing to note is that the user-visible
interface should *not* pass the structure into the library that provides
this function but build the structure itself, so it doesn't matter that
much.

If you put the array at the end of the data structure, the wrapper
will have to copy the array provided by the user to a temporary
buffer. Passing the array as a separate syscall argument avoids
this.

	Arnd <><

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
@ 2009-10-21 13:26             ` Arnd Bergmann
  0 siblings, 0 replies; 226+ messages in thread
From: Arnd Bergmann @ 2009-10-21 13:26 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Arnd Bergmann, Sukadev Bhattiprolu,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Oren Laadan,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA

On Wednesday 21 October 2009, Pavel Machek wrote:
> > > Hmm, is there reason why pids are not at the end of struct
> > > clone_struct? Passing most parameters in special structure, then pids
> > > separately is strange...
> > 
> > I suggested doing that, it's a lot easier to handle fixed length data
> > structures than an array at the end.
> 
> You could still put the something * pids at the end of (fixed length)
> structure.
> 
> (Not that I agree with the argument, pid array is variable-length,
> anyway, and inlining it at the end of structure should not make code
> more complex).

Please read up on the v1 to v7 patches, we've been through all this
before. There are obviously a lot of ways to do it, many of which are
equally good. The important thing to note is that the user-visible
interface should *not* pass the structure into the library that provides
this function but build the structure itself, so it doesn't matter that
much.

If you put the array at the end of the data structure, the wrapper
will have to copy the array provided by the user to a temporary
buffer. Passing the array as a separate syscall argument avoids
this.

	Arnd <><

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]         ` <4ADCDE7F.4090501-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
  2009-10-20  0:51           ` Matt Helsley
@ 2009-10-21 15:53           ` Daniel Lezcano
  1 sibling, 0 replies; 226+ messages in thread
From: Daniel Lezcano @ 2009-10-21 15:53 UTC (permalink / raw)
  To: Oren Laadan
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Oren Laadan wrote:
>
> Daniel Lezcano wrote:
[ ... ]

>> I forgot to mention a constraint with the specified pid : P2 has to be 
>> child of P1.
>> In other word, you can not specify a pid to clonat which is not your 
>> descendant (including yourself).
>> With this constraint I think there is no security issues.
>
> Sounds dangerous. What if your descendant executed a setuid program ?

That does not happen because you inherit the context of the caller.

>> Concerning of forking on behalf of another process, we can consider it 
>> is up to the caller / programmer to know what it does. If a process in 
>
> Before the user can program with this syscall, _you_ need to define
> the semantics of this syscall. 
Yes, you are right. Here it is the proposition of the semantics.

Function prototype is:

pid_t cloneat(pid_t pid, pid_t hint, struct clone_args *args);

Structure types are:

typedef int clone_flag_t;

struct clone_args {
	clone_flag_t *flags;
	int flags_size;
	u32 reserved1;
	u32 reserved2;
	u64 child_stack_base;
	u64 child_stack_size;
	u64 parent_tid_ptr;
	u64 child_tid_ptr;
	u64 reserved3;
};

With the helper macros:

void CLONE_SET(int flag, clone_flag_t *flags);
void CLONE_CLR(int flag, clone_flag_t *flags);
bool CLONE_ISSET(int flag, clone_flag_t *flags);
void CLONE_ZERO(flag_t *clone_flags);

And:

#define CLONEXT_VM      0x20  /* CLONE_VM>>3 */ 
#define CLONEXT_FS      0x21
#define CLONEXT_FILES   0x22
...

The function clones the current task and reparent the child to the 
process specified in the 'pid' parameter and copy the nsproxy from it 
(if different).

If the 'hint' parameter is different from zero, then the 'hint' value 
will be the pid of the child task, otherwise a value is chosen 
automatically by the system like the usual clone. If the 'hint' is 
specified and the task id is already in use, then the call fails.

The syscall returns the child task id on success, < 0 otherwise.

The specified 'pid' _must_ be a descendant of the caller. This is more 
consistent with the inherited resources with clone in a process 
hierarchy and less dangerous than allowing to cloneat everywhere. The 
caller is at the topmost process hierarchy the cloneat is allowed.

It is not possible for the caller to wait for a process created with 
cloneat where the resulting tasks is not its direct child.

The resources have to be shared across the process hierarchy in order to 
use the right flags to clone. eg. it is not possible to create a thread 
to a child process.

For example:

  P 1
   |
  fork()__________ P 2
   |
   |
   |
   |
 cloneat(P2, 0, { ... flags[0] & CLONE_VM ... } => -EINVAL;


The cloneat syscall can be used for the following use cases:

 * checkpoint / restart:

The restart can be done with a clone(.., CLONE_NEWPID|...);
Then the new pid (aka pid 1) retrieves the proctree from the statefile 
and creates the different tasks with the process hierarchy with the 
cloneat syscall.

The proctree creation can be done from outside of the pid namespace or 
from inside.

Concerning nested pid namespaces, IMHO I would not try to checkpoint / 
restart them. The checkpoint of a nested pid namespace should be 
forbidden except for the leaf of a pid namespaces tree. That should 
allow to do partial process tree checkpoint if the application is aware 
of that and creates a new pid namespace for each subtree it wants to be 
checkpointed.

If this is too restrictive, the struct clone_args can be added with 2 
other fields "unused" for future use.

 * execute a command in a container:

If we have a container with the container init process which is usually 
a child reaper (otherwise daemons are not supported or we have zombie 
factory), we can easily cloneat(initpid, 0, ...) and exec a command. As 
the processes of the container are always reparented to the container 
init, it is safe to do that.

 * clone syscall compatibility + extended clone flags

The cloneat function can be used like the usual clone function with:

    cloneat(getpid(), 0, clone_args);

And the extended clone flags can be used.


[ ... ]

> Can you define more precisely what you mean by "enter" the container ?
>
> If you simply want create a new process in the container, you can
> achieve the same thing with a daemon, or a smart init process (in
> there), or even ptrace tricks.

Yes, you can launch a daemon inside the container, that works for a 
system container because the container is killed by killing the first 
process of the container or by a shutdown inside the container (not 
fully implemented in the kernel).
But this is unreliable for application containers, I won't enter in the 
details but the container exits when the application exits, with a 
daemon inside the container, this is no longer the case because you can 
not detect the application death as the daemon is always there.

With cloneat you restrict the life cycle of the command you launched, 
that is the container exits as soon as all the processes exited the 
container, including the spawned command itself.

> Also, there is a reason why sys_hijack() was hijacked away ... And
> I honestly think that a syscall to force another process to clone
> would be shot down by the kernel guys.
Maybe, maybe not. CLONE_PARENT exists and looks similar to cloneat.

>> Another point. It's another way to extend the exhausted clone  flags as 
>> the cloneat can be called as a compatibility way, with cloneat(getpid(), 
>> 0, ... )
>
> Which is what the proposed new clone_....() does.
Yes, right. What I meant is we still have the clone extension feature 
you have with clone_with_pids.

>>> Note also that 'desiredpid' must be a list of pids (one for each pid
>>> namespaces that the child will belong to) and hence we need 'nr_pids'
>>> to specify the list. Given that we are limited to 6 parameters to the
>>> syscall, such parameters must be stuffed into 'struct clone_args'.
>>>
>>> So we should do something like:
>>>
>>> 	sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
>>> 		pid_t *desired_pids)
>>>
>>> or (to match the name and parameters, move 'pid' parameter into clone_args)
>>>   
>> Well, hiding multiple clone in one clone call is ... weird. AFAIR, there 
>> was a debate between kernel or userspace proctree creation but it looks 
>> like it's done from the kernel with this call.
>
> It isn't multiple clones in one clone. The syscall creates *one* single
> process. We just ask the kernel to assign a specific pid to that process.
>
> And because processes that live in nested pid-namespace own multiple
> pids (one for each level), we need to specify multiple pids (one for
> each level) for this single process that we create.
>
> Then, to create the entire restart process tree, we have to call this
> system call as many times as the number of processes to restart.
>
> And yes, this is all done in userspace. The _only_ kernel-space help
> is an interface to request specific pid(s) for each restarted process.
Aaah, Ok ! I understand better now. Thanks for clarifying this point.

>> I don't really see a difference between sys_restart(pid_t pid , int fd, 
>> long flags) where pid_t is the topmost in the hierarchy, fd is a file 
>> descriptor to a structure "pid_t * + struct clone_args *" and flags is 
>> "PROCTREE".
Ok. Never mind :)


[ ... ]

>
> At this point, I really couldn't care less about how we name the
> new syscall.
>
> Check out the containers IRC channel for the latest pearls:
> clone_plus_with_aloe() and clone_plus_with_aloe_3_args() are
> the prominent contenders, together with xerox() and ditto()...
>
> There you go; these should be enough to keep the discussion
> around on life support for at least another week.
>
> I really really really hope we can settle down on *a* name,
> *any* name, and move forward. Amen.
Amen and Alea Jacta Est :)

Thanks
  -- Daniel

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]         ` <4ADCDE7F.4090501-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-10-21 15:53           ` Daniel Lezcano
  2009-10-21 15:53           ` Daniel Lezcano
  1 sibling, 0 replies; 226+ messages in thread
From: Daniel Lezcano @ 2009-10-21 15:53 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Sukadev Bhattiprolu, randy.dunlap, arnd, linux-api, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, Eric W. Biederman,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Oren Laadan wrote:
>
> Daniel Lezcano wrote:
[ ... ]

>> I forgot to mention a constraint with the specified pid : P2 has to be 
>> child of P1.
>> In other word, you can not specify a pid to clonat which is not your 
>> descendant (including yourself).
>> With this constraint I think there is no security issues.
>
> Sounds dangerous. What if your descendant executed a setuid program ?

That does not happen because you inherit the context of the caller.

>> Concerning of forking on behalf of another process, we can consider it 
>> is up to the caller / programmer to know what it does. If a process in 
>
> Before the user can program with this syscall, _you_ need to define
> the semantics of this syscall. 
Yes, you are right. Here it is the proposition of the semantics.

Function prototype is:

pid_t cloneat(pid_t pid, pid_t hint, struct clone_args *args);

Structure types are:

typedef int clone_flag_t;

struct clone_args {
	clone_flag_t *flags;
	int flags_size;
	u32 reserved1;
	u32 reserved2;
	u64 child_stack_base;
	u64 child_stack_size;
	u64 parent_tid_ptr;
	u64 child_tid_ptr;
	u64 reserved3;
};

With the helper macros:

void CLONE_SET(int flag, clone_flag_t *flags);
void CLONE_CLR(int flag, clone_flag_t *flags);
bool CLONE_ISSET(int flag, clone_flag_t *flags);
void CLONE_ZERO(flag_t *clone_flags);

And:

#define CLONEXT_VM      0x20  /* CLONE_VM>>3 */ 
#define CLONEXT_FS      0x21
#define CLONEXT_FILES   0x22
...

The function clones the current task and reparent the child to the 
process specified in the 'pid' parameter and copy the nsproxy from it 
(if different).

If the 'hint' parameter is different from zero, then the 'hint' value 
will be the pid of the child task, otherwise a value is chosen 
automatically by the system like the usual clone. If the 'hint' is 
specified and the task id is already in use, then the call fails.

The syscall returns the child task id on success, < 0 otherwise.

The specified 'pid' _must_ be a descendant of the caller. This is more 
consistent with the inherited resources with clone in a process 
hierarchy and less dangerous than allowing to cloneat everywhere. The 
caller is at the topmost process hierarchy the cloneat is allowed.

It is not possible for the caller to wait for a process created with 
cloneat where the resulting tasks is not its direct child.

The resources have to be shared across the process hierarchy in order to 
use the right flags to clone. eg. it is not possible to create a thread 
to a child process.

For example:

  P 1
   |
  fork()__________ P 2
   |
   |
   |
   |
 cloneat(P2, 0, { ... flags[0] & CLONE_VM ... } => -EINVAL;


The cloneat syscall can be used for the following use cases:

 * checkpoint / restart:

The restart can be done with a clone(.., CLONE_NEWPID|...);
Then the new pid (aka pid 1) retrieves the proctree from the statefile 
and creates the different tasks with the process hierarchy with the 
cloneat syscall.

The proctree creation can be done from outside of the pid namespace or 
from inside.

Concerning nested pid namespaces, IMHO I would not try to checkpoint / 
restart them. The checkpoint of a nested pid namespace should be 
forbidden except for the leaf of a pid namespaces tree. That should 
allow to do partial process tree checkpoint if the application is aware 
of that and creates a new pid namespace for each subtree it wants to be 
checkpointed.

If this is too restrictive, the struct clone_args can be added with 2 
other fields "unused" for future use.

 * execute a command in a container:

If we have a container with the container init process which is usually 
a child reaper (otherwise daemons are not supported or we have zombie 
factory), we can easily cloneat(initpid, 0, ...) and exec a command. As 
the processes of the container are always reparented to the container 
init, it is safe to do that.

 * clone syscall compatibility + extended clone flags

The cloneat function can be used like the usual clone function with:

    cloneat(getpid(), 0, clone_args);

And the extended clone flags can be used.


[ ... ]

> Can you define more precisely what you mean by "enter" the container ?
>
> If you simply want create a new process in the container, you can
> achieve the same thing with a daemon, or a smart init process (in
> there), or even ptrace tricks.

Yes, you can launch a daemon inside the container, that works for a 
system container because the container is killed by killing the first 
process of the container or by a shutdown inside the container (not 
fully implemented in the kernel).
But this is unreliable for application containers, I won't enter in the 
details but the container exits when the application exits, with a 
daemon inside the container, this is no longer the case because you can 
not detect the application death as the daemon is always there.

With cloneat you restrict the life cycle of the command you launched, 
that is the container exits as soon as all the processes exited the 
container, including the spawned command itself.

> Also, there is a reason why sys_hijack() was hijacked away ... And
> I honestly think that a syscall to force another process to clone
> would be shot down by the kernel guys.
Maybe, maybe not. CLONE_PARENT exists and looks similar to cloneat.

>> Another point. It's another way to extend the exhausted clone  flags as 
>> the cloneat can be called as a compatibility way, with cloneat(getpid(), 
>> 0, ... )
>
> Which is what the proposed new clone_....() does.
Yes, right. What I meant is we still have the clone extension feature 
you have with clone_with_pids.

>>> Note also that 'desiredpid' must be a list of pids (one for each pid
>>> namespaces that the child will belong to) and hence we need 'nr_pids'
>>> to specify the list. Given that we are limited to 6 parameters to the
>>> syscall, such parameters must be stuffed into 'struct clone_args'.
>>>
>>> So we should do something like:
>>>
>>> 	sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
>>> 		pid_t *desired_pids)
>>>
>>> or (to match the name and parameters, move 'pid' parameter into clone_args)
>>>   
>> Well, hiding multiple clone in one clone call is ... weird. AFAIR, there 
>> was a debate between kernel or userspace proctree creation but it looks 
>> like it's done from the kernel with this call.
>
> It isn't multiple clones in one clone. The syscall creates *one* single
> process. We just ask the kernel to assign a specific pid to that process.
>
> And because processes that live in nested pid-namespace own multiple
> pids (one for each level), we need to specify multiple pids (one for
> each level) for this single process that we create.
>
> Then, to create the entire restart process tree, we have to call this
> system call as many times as the number of processes to restart.
>
> And yes, this is all done in userspace. The _only_ kernel-space help
> is an interface to request specific pid(s) for each restarted process.
Aaah, Ok ! I understand better now. Thanks for clarifying this point.

>> I don't really see a difference between sys_restart(pid_t pid , int fd, 
>> long flags) where pid_t is the topmost in the hierarchy, fd is a file 
>> descriptor to a structure "pid_t * + struct clone_args *" and flags is 
>> "PROCTREE".
Ok. Never mind :)


[ ... ]

>
> At this point, I really couldn't care less about how we name the
> new syscall.
>
> Check out the containers IRC channel for the latest pearls:
> clone_plus_with_aloe() and clone_plus_with_aloe_3_args() are
> the prominent contenders, together with xerox() and ditto()...
>
> There you go; these should be enough to keep the discussion
> around on life support for at least another week.
>
> I really really really hope we can settle down on *a* name,
> *any* name, and move forward. Amen.
Amen and Alea Jacta Est :)

Thanks
  -- Daniel



^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-21 15:53           ` Daniel Lezcano
  0 siblings, 0 replies; 226+ messages in thread
From: Daniel Lezcano @ 2009-10-21 15:53 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Sukadev Bhattiprolu, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Oren Laadan wrote:
>
> Daniel Lezcano wrote:
[ ... ]

>> I forgot to mention a constraint with the specified pid : P2 has to be 
>> child of P1.
>> In other word, you can not specify a pid to clonat which is not your 
>> descendant (including yourself).
>> With this constraint I think there is no security issues.
>
> Sounds dangerous. What if your descendant executed a setuid program ?

That does not happen because you inherit the context of the caller.

>> Concerning of forking on behalf of another process, we can consider it 
>> is up to the caller / programmer to know what it does. If a process in 
>
> Before the user can program with this syscall, _you_ need to define
> the semantics of this syscall. 
Yes, you are right. Here it is the proposition of the semantics.

Function prototype is:

pid_t cloneat(pid_t pid, pid_t hint, struct clone_args *args);

Structure types are:

typedef int clone_flag_t;

struct clone_args {
	clone_flag_t *flags;
	int flags_size;
	u32 reserved1;
	u32 reserved2;
	u64 child_stack_base;
	u64 child_stack_size;
	u64 parent_tid_ptr;
	u64 child_tid_ptr;
	u64 reserved3;
};

With the helper macros:

void CLONE_SET(int flag, clone_flag_t *flags);
void CLONE_CLR(int flag, clone_flag_t *flags);
bool CLONE_ISSET(int flag, clone_flag_t *flags);
void CLONE_ZERO(flag_t *clone_flags);

And:

#define CLONEXT_VM      0x20  /* CLONE_VM>>3 */ 
#define CLONEXT_FS      0x21
#define CLONEXT_FILES   0x22
...

The function clones the current task and reparent the child to the 
process specified in the 'pid' parameter and copy the nsproxy from it 
(if different).

If the 'hint' parameter is different from zero, then the 'hint' value 
will be the pid of the child task, otherwise a value is chosen 
automatically by the system like the usual clone. If the 'hint' is 
specified and the task id is already in use, then the call fails.

The syscall returns the child task id on success, < 0 otherwise.

The specified 'pid' _must_ be a descendant of the caller. This is more 
consistent with the inherited resources with clone in a process 
hierarchy and less dangerous than allowing to cloneat everywhere. The 
caller is at the topmost process hierarchy the cloneat is allowed.

It is not possible for the caller to wait for a process created with 
cloneat where the resulting tasks is not its direct child.

The resources have to be shared across the process hierarchy in order to 
use the right flags to clone. eg. it is not possible to create a thread 
to a child process.

For example:

  P 1
   |
  fork()__________ P 2
   |
   |
   |
   |
 cloneat(P2, 0, { ... flags[0] & CLONE_VM ... } => -EINVAL;


The cloneat syscall can be used for the following use cases:

 * checkpoint / restart:

The restart can be done with a clone(.., CLONE_NEWPID|...);
Then the new pid (aka pid 1) retrieves the proctree from the statefile 
and creates the different tasks with the process hierarchy with the 
cloneat syscall.

The proctree creation can be done from outside of the pid namespace or 
from inside.

Concerning nested pid namespaces, IMHO I would not try to checkpoint / 
restart them. The checkpoint of a nested pid namespace should be 
forbidden except for the leaf of a pid namespaces tree. That should 
allow to do partial process tree checkpoint if the application is aware 
of that and creates a new pid namespace for each subtree it wants to be 
checkpointed.

If this is too restrictive, the struct clone_args can be added with 2 
other fields "unused" for future use.

 * execute a command in a container:

If we have a container with the container init process which is usually 
a child reaper (otherwise daemons are not supported or we have zombie 
factory), we can easily cloneat(initpid, 0, ...) and exec a command. As 
the processes of the container are always reparented to the container 
init, it is safe to do that.

 * clone syscall compatibility + extended clone flags

The cloneat function can be used like the usual clone function with:

    cloneat(getpid(), 0, clone_args);

And the extended clone flags can be used.


[ ... ]

> Can you define more precisely what you mean by "enter" the container ?
>
> If you simply want create a new process in the container, you can
> achieve the same thing with a daemon, or a smart init process (in
> there), or even ptrace tricks.

Yes, you can launch a daemon inside the container, that works for a 
system container because the container is killed by killing the first 
process of the container or by a shutdown inside the container (not 
fully implemented in the kernel).
But this is unreliable for application containers, I won't enter in the 
details but the container exits when the application exits, with a 
daemon inside the container, this is no longer the case because you can 
not detect the application death as the daemon is always there.

With cloneat you restrict the life cycle of the command you launched, 
that is the container exits as soon as all the processes exited the 
container, including the spawned command itself.

> Also, there is a reason why sys_hijack() was hijacked away ... And
> I honestly think that a syscall to force another process to clone
> would be shot down by the kernel guys.
Maybe, maybe not. CLONE_PARENT exists and looks similar to cloneat.

>> Another point. It's another way to extend the exhausted clone  flags as 
>> the cloneat can be called as a compatibility way, with cloneat(getpid(), 
>> 0, ... )
>
> Which is what the proposed new clone_....() does.
Yes, right. What I meant is we still have the clone extension feature 
you have with clone_with_pids.

>>> Note also that 'desiredpid' must be a list of pids (one for each pid
>>> namespaces that the child will belong to) and hence we need 'nr_pids'
>>> to specify the list. Given that we are limited to 6 parameters to the
>>> syscall, such parameters must be stuffed into 'struct clone_args'.
>>>
>>> So we should do something like:
>>>
>>> 	sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
>>> 		pid_t *desired_pids)
>>>
>>> or (to match the name and parameters, move 'pid' parameter into clone_args)
>>>   
>> Well, hiding multiple clone in one clone call is ... weird. AFAIR, there 
>> was a debate between kernel or userspace proctree creation but it looks 
>> like it's done from the kernel with this call.
>
> It isn't multiple clones in one clone. The syscall creates *one* single
> process. We just ask the kernel to assign a specific pid to that process.
>
> And because processes that live in nested pid-namespace own multiple
> pids (one for each level), we need to specify multiple pids (one for
> each level) for this single process that we create.
>
> Then, to create the entire restart process tree, we have to call this
> system call as many times as the number of processes to restart.
>
> And yes, this is all done in userspace. The _only_ kernel-space help
> is an interface to request specific pid(s) for each restarted process.
Aaah, Ok ! I understand better now. Thanks for clarifying this point.

>> I don't really see a difference between sys_restart(pid_t pid , int fd, 
>> long flags) where pid_t is the topmost in the hierarchy, fd is a file 
>> descriptor to a structure "pid_t * + struct clone_args *" and flags is 
>> "PROCTREE".
Ok. Never mind :)


[ ... ]

>
> At this point, I really couldn't care less about how we name the
> new syscall.
>
> Check out the containers IRC channel for the latest pearls:
> clone_plus_with_aloe() and clone_plus_with_aloe_3_args() are
> the prominent contenders, together with xerox() and ditto()...
>
> There you go; these should be enough to keep the discussion
> around on life support for at least another week.
>
> I really really really hope we can settle down on *a* name,
> *any* name, and move forward. Amen.
Amen and Alea Jacta Est :)

Thanks
  -- Daniel

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
       [not found]       ` <20091019213636.GB1482-+ZI9xUNit7I@public.gmane.org>
  2009-10-21  8:37         ` Arnd Bergmann
@ 2009-10-21 18:27         ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-21 18:27 UTC (permalink / raw)
  To: Pavel Machek
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Pavel Machek [pavel-+ZI9xUNit7I@public.gmane.org] wrote:
| Hi!
| 
| > This gives a brief overview of the clone3() system call.  We should
| 
| Thanks!
| 
| > eventually describe more details in existing clone(2) man page or in
| > a new man page.
| 
| M. Kerrisk (see MAINTAINERS) maintains man pages...

Ok. I copied linux-api and M. Kerrisk is looking at this patchset.

| 
| > Changelog[v8]:
| > 	- clone2() is already in use in IA64. Rename syscall to clone3()
| ...
| > Index: linux-2.6/Documentation/clone2
| > ===================================================================
| > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
| > +++ linux-2.6/Documentation/clone2	2009-10-12 19:54:38.000000000 -0700
| 
| clone3?

Ah, yes, will fix.

| 
| > +struct clone_struct {
| > +	u64 flags;
| > +	u64 child_stack;
| 
| u64 seems wrong  on 32 bit platforms. ulong?
| 
| > +	u32 nr_pids;
| 
| So nr_pids is either 1 or 2?

No. With pid namespaces, which can be nested to arbitrary levels, each
process has several pid numbers - one in its own namespace and one in
each ancestor pid namespaces.

nr_pids specifies the number of pids the user cares about. IOW, if you
checkpoint a process that is 3 levels below the init-pid-ns, and you
care about the pids for those three levels, you would pass in an
array of 4 pid_ts and nr_pids would be 4.

| 
| > +	u32 reserved1;
| > +	u64 parent_tid;
| > +	u64 child_tid;
| > +	u64 reserved2;
| > +};
| > +
| 
| 
| > +	See CLONE_NEWPID section of clone(2) man page for details about pid
| > +	namespaces.
| > +
| > +	The order pids in @pids corresponds to the nesting order of pid-
| > +	namespaces, with @pids[0] corresponding to the init_pid_ns.
| 
| Ok, so I'm confused.

Hope the above note on pid-namespaces helps, if not let me know.

| 
| > +	If a pid in the @pids list is 0, the kernel will assign the next
| > +	available pid in the pid namespace, for the process.
| > +
| > +	If a pid in the @pids list is non-zero, the kernel tries to assign
| > +	the specified pid in that namespace.  If that pid is already in use
| > +	by another process, the system call fails with -EBUSY.
| ...
| > +	On failure, clone3() returns -1 and sets 'errno' to one of following
| > +	values (the child process is not created).
| 
| Inconsistent with above. Syscalls really return -ERRCODE, errno is
| glibc magic.

Ok.  Will make it consistent to say that it returns one of the error codes.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
       [not found]       ` <20091019213636.GB1482-+ZI9xUNit7I@public.gmane.org>
@ 2009-10-21 18:27         ` Sukadev Bhattiprolu
  2009-10-21 18:27         ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-21 18:27 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, arnd, peterz, Louis.Rilling, roland,
	kosaki.motohiro, randy.dunlap, linux-api, Containers, sukadev

Pavel Machek [pavel@ucw.cz] wrote:
| Hi!
| 
| > This gives a brief overview of the clone3() system call.  We should
| 
| Thanks!
| 
| > eventually describe more details in existing clone(2) man page or in
| > a new man page.
| 
| M. Kerrisk (see MAINTAINERS) maintains man pages...

Ok. I copied linux-api and M. Kerrisk is looking at this patchset.

| 
| > Changelog[v8]:
| > 	- clone2() is already in use in IA64. Rename syscall to clone3()
| ...
| > Index: linux-2.6/Documentation/clone2
| > ===================================================================
| > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
| > +++ linux-2.6/Documentation/clone2	2009-10-12 19:54:38.000000000 -0700
| 
| clone3?

Ah, yes, will fix.

| 
| > +struct clone_struct {
| > +	u64 flags;
| > +	u64 child_stack;
| 
| u64 seems wrong  on 32 bit platforms. ulong?
| 
| > +	u32 nr_pids;
| 
| So nr_pids is either 1 or 2?

No. With pid namespaces, which can be nested to arbitrary levels, each
process has several pid numbers - one in its own namespace and one in
each ancestor pid namespaces.

nr_pids specifies the number of pids the user cares about. IOW, if you
checkpoint a process that is 3 levels below the init-pid-ns, and you
care about the pids for those three levels, you would pass in an
array of 4 pid_ts and nr_pids would be 4.

| 
| > +	u32 reserved1;
| > +	u64 parent_tid;
| > +	u64 child_tid;
| > +	u64 reserved2;
| > +};
| > +
| 
| 
| > +	See CLONE_NEWPID section of clone(2) man page for details about pid
| > +	namespaces.
| > +
| > +	The order pids in @pids corresponds to the nesting order of pid-
| > +	namespaces, with @pids[0] corresponding to the init_pid_ns.
| 
| Ok, so I'm confused.

Hope the above note on pid-namespaces helps, if not let me know.

| 
| > +	If a pid in the @pids list is 0, the kernel will assign the next
| > +	available pid in the pid namespace, for the process.
| > +
| > +	If a pid in the @pids list is non-zero, the kernel tries to assign
| > +	the specified pid in that namespace.  If that pid is already in use
| > +	by another process, the system call fails with -EBUSY.
| ...
| > +	On failure, clone3() returns -1 and sets 'errno' to one of following
| > +	values (the child process is not created).
| 
| Inconsistent with above. Syscalls really return -ERRCODE, errno is
| glibc magic.

Ok.  Will make it consistent to say that it returns one of the error codes.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
@ 2009-10-21 18:27         ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-21 18:27 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Oren Laadan,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA

Pavel Machek [pavel-+ZI9xUNit7I@public.gmane.org] wrote:
| Hi!
| 
| > This gives a brief overview of the clone3() system call.  We should
| 
| Thanks!
| 
| > eventually describe more details in existing clone(2) man page or in
| > a new man page.
| 
| M. Kerrisk (see MAINTAINERS) maintains man pages...

Ok. I copied linux-api and M. Kerrisk is looking at this patchset.

| 
| > Changelog[v8]:
| > 	- clone2() is already in use in IA64. Rename syscall to clone3()
| ...
| > Index: linux-2.6/Documentation/clone2
| > ===================================================================
| > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
| > +++ linux-2.6/Documentation/clone2	2009-10-12 19:54:38.000000000 -0700
| 
| clone3?

Ah, yes, will fix.

| 
| > +struct clone_struct {
| > +	u64 flags;
| > +	u64 child_stack;
| 
| u64 seems wrong  on 32 bit platforms. ulong?
| 
| > +	u32 nr_pids;
| 
| So nr_pids is either 1 or 2?

No. With pid namespaces, which can be nested to arbitrary levels, each
process has several pid numbers - one in its own namespace and one in
each ancestor pid namespaces.

nr_pids specifies the number of pids the user cares about. IOW, if you
checkpoint a process that is 3 levels below the init-pid-ns, and you
care about the pids for those three levels, you would pass in an
array of 4 pid_ts and nr_pids would be 4.

| 
| > +	u32 reserved1;
| > +	u64 parent_tid;
| > +	u64 child_tid;
| > +	u64 reserved2;
| > +};
| > +
| 
| 
| > +	See CLONE_NEWPID section of clone(2) man page for details about pid
| > +	namespaces.
| > +
| > +	The order pids in @pids corresponds to the nesting order of pid-
| > +	namespaces, with @pids[0] corresponding to the init_pid_ns.
| 
| Ok, so I'm confused.

Hope the above note on pid-namespaces helps, if not let me know.

| 
| > +	If a pid in the @pids list is 0, the kernel will assign the next
| > +	available pid in the pid namespace, for the process.
| > +
| > +	If a pid in the @pids list is non-zero, the kernel tries to assign
| > +	the specified pid in that namespace.  If that pid is already in use
| > +	by another process, the system call fails with -EBUSY.
| ...
| > +	On failure, clone3() returns -1 and sets 'errno' to one of following
| > +	values (the child process is not created).
| 
| Inconsistent with above. Syscalls really return -ERRCODE, errno is
| glibc magic.

Ok.  Will make it consistent to say that it returns one of the error codes.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]           ` <4ADF2E75.1020801-GANU6spQydw@public.gmane.org>
@ 2009-10-21 18:45             ` Oren Laadan
  0 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-21 18:45 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Daniel Lezcano wrote:
> Oren Laadan wrote:
>>
>> Daniel Lezcano wrote:
> [ ... ]
> 
>>> I forgot to mention a constraint with the specified pid : P2 has to
>>> be child of P1.
>>> In other word, you can not specify a pid to clonat which is not your
>>> descendant (including yourself).
>>> With this constraint I think there is no security issues.
>>
>> Sounds dangerous. What if your descendant executed a setuid program ?
> 
> That does not happen because you inherit the context of the caller.
> 
>>> Concerning of forking on behalf of another process, we can consider
>>> it is up to the caller / programmer to know what it does. If a
>>> process in 
>>
>> Before the user can program with this syscall, _you_ need to define
>> the semantics of this syscall. 
> Yes, you are right. Here it is the proposition of the semantics.
> 
> Function prototype is:
> 
> pid_t cloneat(pid_t pid, pid_t hint, struct clone_args *args);
> 
> Structure types are:
> 
> typedef int clone_flag_t;
> 
> struct clone_args {
>     clone_flag_t *flags;
>     int flags_size;
>     u32 reserved1;
>     u32 reserved2;
>     u64 child_stack_base;
>     u64 child_stack_size;
>     u64 parent_tid_ptr;
>     u64 child_tid_ptr;
>     u64 reserved3;
> };
> 
> With the helper macros:
> 
> void CLONE_SET(int flag, clone_flag_t *flags);
> void CLONE_CLR(int flag, clone_flag_t *flags);
> bool CLONE_ISSET(int flag, clone_flag_t *flags);
> void CLONE_ZERO(flag_t *clone_flags);
> 
> And:
> 
> #define CLONEXT_VM      0x20  /* CLONE_VM>>3 */ #define CLONEXT_FS     
> 0x21
> #define CLONEXT_FILES   0x22
> ...
> 

The main motivation for your new syscall is to make it possible to
inject a process into a namespace. IOW, what you are proposing is
a new incarnation of sys_hijack().

This is _orthogonal_ to the current discussion, which is about an
extension for clone to allow (a) choosing target pid(s), (b) more
flags, and (c) future extensions.

(Your suggested syscall may, too, allow the request a specific set
of pids for the child process, and reuse the current code for that).

I suggest that you start a new thread about your RFC. This will
reduce distractions on the current thread, and bring more focus to
your proposal. I surely will post some comments there :)

[...]

> The cloneat syscall can be used for the following use cases:
> 
> * checkpoint / restart:
> 
> The restart can be done with a clone(.., CLONE_NEWPID|...);
> Then the new pid (aka pid 1) retrieves the proctree from the statefile
> and creates the different tasks with the process hierarchy with the
> cloneat syscall.

s/cloneat/$CLONE3/
(hint: this is how it's done now)

> 
> The proctree creation can be done from outside of the pid namespace or
> from inside.

Ew .. why would you do that ?

> Concerning nested pid namespaces, IMHO I would not try to checkpoint /
> restart them. The checkpoint of a nested pid namespace should be
> forbidden except for the leaf of a pid namespaces tree. That should

Others (me included) *will* try and may get upset if forbidden...
Seriously, there is no technical reason to restrict this.

 >> Can you define more precisely what you mean by "enter" the container ?
>>
>> If you simply want create a new process in the container, you can
>> achieve the same thing with a daemon, or a smart init process (in
>> there), or even ptrace tricks.
> 
> Yes, you can launch a daemon inside the container, that works for a
> system container because the container is killed by killing the first
> process of the container or by a shutdown inside the container (not
> fully implemented in the kernel).
> But this is unreliable for application containers, I won't enter in the
> details but the container exits when the application exits, with a
> daemon inside the container, this is no longer the case because you can
> not detect the application death as the daemon is always there.
> 
> With cloneat you restrict the life cycle of the command you launched,
> that is the container exits as soon as all the processes exited the
> container, including the spawned command itself.

Then start a daemon _in addition_ to the application, or write a
daemon that will launch the application and monitor it... And also
there is ptrace -

But, please let's take this off to a new thread about adding how to
add a process into a namespace from the outside. FYI, I do think
such an interface may be useful and nicer than the two alternatives
I suggested above.

>> Also, there is a reason why sys_hijack() was hijacked away ... And
>> I honestly think that a syscall to force another process to clone
>> would be shot down by the kernel guys.
> Maybe, maybe not. CLONE_PARENT exists and looks similar to cloneat.

Actually, I misread previously; I mean not forcing another process
to clone, but instead forcing another process to become a parent (and
I shall ignore the ethical issues :)

I still suspect it won't be welcome. Several people would have liked
to see CLONE_PARENT go away, too, if that was possible without breaking
userspace applications. Yet another reason to take it to a discussion
of its own.

Oren.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]           ` <4ADF2E75.1020801-GANU6spQydw@public.gmane.org>
@ 2009-10-21 18:45             ` Oren Laadan
  0 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-21 18:45 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Sukadev Bhattiprolu, randy.dunlap, arnd, linux-api, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, Eric W. Biederman,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov



Daniel Lezcano wrote:
> Oren Laadan wrote:
>>
>> Daniel Lezcano wrote:
> [ ... ]
> 
>>> I forgot to mention a constraint with the specified pid : P2 has to
>>> be child of P1.
>>> In other word, you can not specify a pid to clonat which is not your
>>> descendant (including yourself).
>>> With this constraint I think there is no security issues.
>>
>> Sounds dangerous. What if your descendant executed a setuid program ?
> 
> That does not happen because you inherit the context of the caller.
> 
>>> Concerning of forking on behalf of another process, we can consider
>>> it is up to the caller / programmer to know what it does. If a
>>> process in 
>>
>> Before the user can program with this syscall, _you_ need to define
>> the semantics of this syscall. 
> Yes, you are right. Here it is the proposition of the semantics.
> 
> Function prototype is:
> 
> pid_t cloneat(pid_t pid, pid_t hint, struct clone_args *args);
> 
> Structure types are:
> 
> typedef int clone_flag_t;
> 
> struct clone_args {
>     clone_flag_t *flags;
>     int flags_size;
>     u32 reserved1;
>     u32 reserved2;
>     u64 child_stack_base;
>     u64 child_stack_size;
>     u64 parent_tid_ptr;
>     u64 child_tid_ptr;
>     u64 reserved3;
> };
> 
> With the helper macros:
> 
> void CLONE_SET(int flag, clone_flag_t *flags);
> void CLONE_CLR(int flag, clone_flag_t *flags);
> bool CLONE_ISSET(int flag, clone_flag_t *flags);
> void CLONE_ZERO(flag_t *clone_flags);
> 
> And:
> 
> #define CLONEXT_VM      0x20  /* CLONE_VM>>3 */ #define CLONEXT_FS     
> 0x21
> #define CLONEXT_FILES   0x22
> ...
> 

The main motivation for your new syscall is to make it possible to
inject a process into a namespace. IOW, what you are proposing is
a new incarnation of sys_hijack().

This is _orthogonal_ to the current discussion, which is about an
extension for clone to allow (a) choosing target pid(s), (b) more
flags, and (c) future extensions.

(Your suggested syscall may, too, allow the request a specific set
of pids for the child process, and reuse the current code for that).

I suggest that you start a new thread about your RFC. This will
reduce distractions on the current thread, and bring more focus to
your proposal. I surely will post some comments there :)

[...]

> The cloneat syscall can be used for the following use cases:
> 
> * checkpoint / restart:
> 
> The restart can be done with a clone(.., CLONE_NEWPID|...);
> Then the new pid (aka pid 1) retrieves the proctree from the statefile
> and creates the different tasks with the process hierarchy with the
> cloneat syscall.

s/cloneat/$CLONE3/
(hint: this is how it's done now)

> 
> The proctree creation can be done from outside of the pid namespace or
> from inside.

Ew .. why would you do that ?

> Concerning nested pid namespaces, IMHO I would not try to checkpoint /
> restart them. The checkpoint of a nested pid namespace should be
> forbidden except for the leaf of a pid namespaces tree. That should

Others (me included) *will* try and may get upset if forbidden...
Seriously, there is no technical reason to restrict this.

 >> Can you define more precisely what you mean by "enter" the container ?
>>
>> If you simply want create a new process in the container, you can
>> achieve the same thing with a daemon, or a smart init process (in
>> there), or even ptrace tricks.
> 
> Yes, you can launch a daemon inside the container, that works for a
> system container because the container is killed by killing the first
> process of the container or by a shutdown inside the container (not
> fully implemented in the kernel).
> But this is unreliable for application containers, I won't enter in the
> details but the container exits when the application exits, with a
> daemon inside the container, this is no longer the case because you can
> not detect the application death as the daemon is always there.
> 
> With cloneat you restrict the life cycle of the command you launched,
> that is the container exits as soon as all the processes exited the
> container, including the spawned command itself.

Then start a daemon _in addition_ to the application, or write a
daemon that will launch the application and monitor it... And also
there is ptrace -

But, please let's take this off to a new thread about adding how to
add a process into a namespace from the outside. FYI, I do think
such an interface may be useful and nicer than the two alternatives
I suggested above.

>> Also, there is a reason why sys_hijack() was hijacked away ... And
>> I honestly think that a syscall to force another process to clone
>> would be shot down by the kernel guys.
> Maybe, maybe not. CLONE_PARENT exists and looks similar to cloneat.

Actually, I misread previously; I mean not forcing another process
to clone, but instead forcing another process to become a parent (and
I shall ignore the ethical issues :)

I still suspect it won't be welcome. Several people would have liked
to see CLONE_PARENT go away, too, if that was possible without breaking
userspace applications. Yet another reason to take it to a discussion
of its own.

Oren.



^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-21 18:45             ` Oren Laadan
  0 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-21 18:45 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Sukadev Bhattiprolu, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Daniel Lezcano wrote:
> Oren Laadan wrote:
>>
>> Daniel Lezcano wrote:
> [ ... ]
> 
>>> I forgot to mention a constraint with the specified pid : P2 has to
>>> be child of P1.
>>> In other word, you can not specify a pid to clonat which is not your
>>> descendant (including yourself).
>>> With this constraint I think there is no security issues.
>>
>> Sounds dangerous. What if your descendant executed a setuid program ?
> 
> That does not happen because you inherit the context of the caller.
> 
>>> Concerning of forking on behalf of another process, we can consider
>>> it is up to the caller / programmer to know what it does. If a
>>> process in 
>>
>> Before the user can program with this syscall, _you_ need to define
>> the semantics of this syscall. 
> Yes, you are right. Here it is the proposition of the semantics.
> 
> Function prototype is:
> 
> pid_t cloneat(pid_t pid, pid_t hint, struct clone_args *args);
> 
> Structure types are:
> 
> typedef int clone_flag_t;
> 
> struct clone_args {
>     clone_flag_t *flags;
>     int flags_size;
>     u32 reserved1;
>     u32 reserved2;
>     u64 child_stack_base;
>     u64 child_stack_size;
>     u64 parent_tid_ptr;
>     u64 child_tid_ptr;
>     u64 reserved3;
> };
> 
> With the helper macros:
> 
> void CLONE_SET(int flag, clone_flag_t *flags);
> void CLONE_CLR(int flag, clone_flag_t *flags);
> bool CLONE_ISSET(int flag, clone_flag_t *flags);
> void CLONE_ZERO(flag_t *clone_flags);
> 
> And:
> 
> #define CLONEXT_VM      0x20  /* CLONE_VM>>3 */ #define CLONEXT_FS     
> 0x21
> #define CLONEXT_FILES   0x22
> ...
> 

The main motivation for your new syscall is to make it possible to
inject a process into a namespace. IOW, what you are proposing is
a new incarnation of sys_hijack().

This is _orthogonal_ to the current discussion, which is about an
extension for clone to allow (a) choosing target pid(s), (b) more
flags, and (c) future extensions.

(Your suggested syscall may, too, allow the request a specific set
of pids for the child process, and reuse the current code for that).

I suggest that you start a new thread about your RFC. This will
reduce distractions on the current thread, and bring more focus to
your proposal. I surely will post some comments there :)

[...]

> The cloneat syscall can be used for the following use cases:
> 
> * checkpoint / restart:
> 
> The restart can be done with a clone(.., CLONE_NEWPID|...);
> Then the new pid (aka pid 1) retrieves the proctree from the statefile
> and creates the different tasks with the process hierarchy with the
> cloneat syscall.

s/cloneat/$CLONE3/
(hint: this is how it's done now)

> 
> The proctree creation can be done from outside of the pid namespace or
> from inside.

Ew .. why would you do that ?

> Concerning nested pid namespaces, IMHO I would not try to checkpoint /
> restart them. The checkpoint of a nested pid namespace should be
> forbidden except for the leaf of a pid namespaces tree. That should

Others (me included) *will* try and may get upset if forbidden...
Seriously, there is no technical reason to restrict this.

 >> Can you define more precisely what you mean by "enter" the container ?
>>
>> If you simply want create a new process in the container, you can
>> achieve the same thing with a daemon, or a smart init process (in
>> there), or even ptrace tricks.
> 
> Yes, you can launch a daemon inside the container, that works for a
> system container because the container is killed by killing the first
> process of the container or by a shutdown inside the container (not
> fully implemented in the kernel).
> But this is unreliable for application containers, I won't enter in the
> details but the container exits when the application exits, with a
> daemon inside the container, this is no longer the case because you can
> not detect the application death as the daemon is always there.
> 
> With cloneat you restrict the life cycle of the command you launched,
> that is the container exits as soon as all the processes exited the
> container, including the spawned command itself.

Then start a daemon _in addition_ to the application, or write a
daemon that will launch the application and monitor it... And also
there is ptrace -

But, please let's take this off to a new thread about adding how to
add a process into a namespace from the outside. FYI, I do think
such an interface may be useful and nicer than the two alternatives
I suggested above.

>> Also, there is a reason why sys_hijack() was hijacked away ... And
>> I honestly think that a syscall to force another process to clone
>> would be shot down by the kernel guys.
> Maybe, maybe not. CLONE_PARENT exists and looks similar to cloneat.

Actually, I misread previously; I mean not forcing another process
to clone, but instead forcing another process to become a parent (and
I shall ignore the ethical issues :)

I still suspect it won't be welcome. Several people would have liked
to see CLONE_PARENT go away, too, if that was possible without breaking
userspace applications. Yet another reason to take it to a discussion
of its own.

Oren.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                         ` <m1eioxrtsb.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-21 18:52                           ` Sukadev Bhattiprolu
  2009-10-23  0:42                           ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-21 18:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| Not what I was thinking.  The following untested patch is what I was
| thinking.  It just exposes last, min, and max to the callers which pass
| in different appropriate values.


Minor comments are that the caller has to choose which function to call
and passing in 'target-1' for the 'last' in set_pidmap() seems a bit
unnatural. But I can't think of a better way and I think this will work.
Will test it out.

Pavel Emelyanov - you too had some comments about this part of my code.
Let me know if this works for you. 

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                         ` <m1eioxrtsb.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-21 18:52                           ` Sukadev Bhattiprolu
  2009-10-23  0:42                           ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-21 18:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Eric W. Biederman [ebiederm@xmission.com] wrote:
| Not what I was thinking.  The following untested patch is what I was
| thinking.  It just exposes last, min, and max to the callers which pass
| in different appropriate values.


Minor comments are that the caller has to choose which function to call
and passing in 'target-1' for the 'last' in set_pidmap() seems a bit
unnatural. But I can't think of a better way and I think this will work.
Will test it out.

Pavel Emelyanov - you too had some comments about this part of my code.
Let me know if this works for you. 

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-21 18:52                           ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-21 18:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| Not what I was thinking.  The following untested patch is what I was
| thinking.  It just exposes last, min, and max to the callers which pass
| in different appropriate values.


Minor comments are that the caller has to choose which function to call
and passing in 'target-1' for the 'last' in set_pidmap() seems a bit
unnatural. But I can't think of a better way and I think this will work.
Will test it out.

Pavel Emelyanov - you too had some comments about this part of my code.
Let me know if this works for you. 

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
       [not found]             ` <200910211526.50584.arnd-r2nGTMty4D4@public.gmane.org>
  2009-10-21 19:09               ` Pavel Machek
@ 2009-10-21 19:09               ` Pavel Machek
  1 sibling, 0 replies; 226+ messages in thread
From: Pavel Machek @ 2009-10-21 19:09 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Containers, Nathan Lynch,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

> If you put the array at the end of the data structure, the wrapper
> will have to copy the array provided by the user to a temporary
> buffer. Passing the array as a separate syscall argument avoids
> this.

Ok, sounds sane, sorry for noise.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall
       [not found]             ` <200910211526.50584.arnd-r2nGTMty4D4@public.gmane.org>
@ 2009-10-21 19:09               ` Pavel Machek
  2009-10-21 19:09               ` Pavel Machek
  1 sibling, 0 replies; 226+ messages in thread
From: Pavel Machek @ 2009-10-21 19:09 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Sukadev Bhattiprolu, Oren Laadan, serue-r/Jw6+rmf7HQT0dZR+AlfA,
	Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA

> If you put the array at the end of the data structure, the wrapper
> will have to copy the array provided by the user to a temporary
> buffer. Passing the array as a separate syscall argument avoids
> this.

Ok, sounds sane, sorry for noise.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                         ` <4ADF06B7.50508-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
@ 2009-10-21 19:44                           ` Sukadev Bhattiprolu
  2009-10-22 10:26                           ` Michael Kerrisk
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-21 19:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Eric W. Biederman,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Pavel Emelyanov, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA

H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>>
>> My question here is: what does "3" actually mean? In general, system
>> calls have not followed any convention of numbering to indicate
>> successive versions -- clone2() being the one possible exception that
>> I know of.
>>
>
> "3" is number of arguments.

To me, it is a version number.

mmap() and mmap2() both have 6 parameters.

Besides if wait4() were born before wait3(), would it still be wait4() :-)
But I see that it is hard to get one-convention-that-fits-all.

> It's better than "extended" or something  
> like that simply because "extended" just means "more than", and a number  
> at least tells you *how much more than*.

And extended assumes we wont extend again.

An informal poll of reviewers has clone3() with a slight advantage :-)

	clone_extended() camp: Serge Hallyn, Kerrisk, Louis Rilling, 
	clone3(): Sukadev, H. Peter Anvin, Oren, Matt Helsley.

I like clone3() but am not insisting on it. I just want a name...

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                         ` <4ADF06B7.50508-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
@ 2009-10-21 19:44                           ` Sukadev Bhattiprolu
  2009-10-22 10:26                           ` Michael Kerrisk
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-21 19:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mtk.manpages, Matt Helsley, randy.dunlap, arnd, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, Eric W. Biederman,
	kosaki.motohiro, mingo, linux-api, torvalds, Alexey Dobriyan,
	roland, Pavel Emelyanov

H. Peter Anvin [hpa@zytor.com] wrote:
> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>>
>> My question here is: what does "3" actually mean? In general, system
>> calls have not followed any convention of numbering to indicate
>> successive versions -- clone2() being the one possible exception that
>> I know of.
>>
>
> "3" is number of arguments.

To me, it is a version number.

mmap() and mmap2() both have 6 parameters.

Besides if wait4() were born before wait3(), would it still be wait4() :-)
But I see that it is hard to get one-convention-that-fits-all.

> It's better than "extended" or something  
> like that simply because "extended" just means "more than", and a number  
> at least tells you *how much more than*.

And extended assumes we wont extend again.

An informal poll of reviewers has clone3() with a slight advantage :-)

	clone_extended() camp: Serge Hallyn, Kerrisk, Louis Rilling, 
	clone3(): Sukadev, H. Peter Anvin, Oren, Matt Helsley.

I like clone3() but am not insisting on it. I just want a name...

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-21 19:44                           ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-21 19:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Matt Helsley,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>>
>> My question here is: what does "3" actually mean? In general, system
>> calls have not followed any convention of numbering to indicate
>> successive versions -- clone2() being the one possible exception that
>> I know of.
>>
>
> "3" is number of arguments.

To me, it is a version number.

mmap() and mmap2() both have 6 parameters.

Besides if wait4() were born before wait3(), would it still be wait4() :-)
But I see that it is hard to get one-convention-that-fits-all.

> It's better than "extended" or something  
> like that simply because "extended" just means "more than", and a number  
> at least tells you *how much more than*.

And extended assumes we wont extend again.

An informal poll of reviewers has clone3() with a slight advantage :-)

	clone_extended() camp: Serge Hallyn, Kerrisk, Louis Rilling, 
	clone3(): Sukadev, H. Peter Anvin, Oren, Matt Helsley.

I like clone3() but am not insisting on it. I just want a name...

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                           ` <20091021185242.GB12955-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-21 21:11                             ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-21 21:11 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | Not what I was thinking.  The following untested patch is what I was
> | thinking.  It just exposes last, min, and max to the callers which pass
> | in different appropriate values.
>
>
> Minor comments are that the caller has to choose which function to call
> and passing in 'target-1' for the 'last' in set_pidmap() seems a bit
> unnatural. But I can't think of a better way and I think this will work.
> Will test it out.

Thanks.  My primary concern was that we don't unnecessarily duplicate all
of the weird allocating a bitmap page code etc.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                           ` <20091021185242.GB12955-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-21 21:11                             ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-21 21:11 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:

> Eric W. Biederman [ebiederm@xmission.com] wrote:
> | Not what I was thinking.  The following untested patch is what I was
> | thinking.  It just exposes last, min, and max to the callers which pass
> | in different appropriate values.
>
>
> Minor comments are that the caller has to choose which function to call
> and passing in 'target-1' for the 'last' in set_pidmap() seems a bit
> unnatural. But I can't think of a better way and I think this will work.
> Will test it out.

Thanks.  My primary concern was that we don't unnecessarily duplicate all
of the weird allocating a bitmap page code etc.


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-21 21:11                             ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-21 21:11 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | Not what I was thinking.  The following untested patch is what I was
> | thinking.  It just exposes last, min, and max to the callers which pass
> | in different appropriate values.
>
>
> Minor comments are that the caller has to choose which function to call
> and passing in 'target-1' for the 'last' in set_pidmap() seems a bit
> unnatural. But I can't think of a better way and I think this will work.
> Will test it out.

Thanks.  My primary concern was that we don't unnecessarily duplicate all
of the weird allocating a bitmap page code etc.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                           ` <20091021194440.GA1283-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-21 22:03                             ` H. Peter Anvin
  2009-10-22 10:40                             ` Michael Kerrisk
  1 sibling, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-21 22:03 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Eric W. Biederman,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Pavel Emelyanov, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA

On 10/22/2009 04:44 AM, Sukadev Bhattiprolu wrote:
>>
>> "3" is number of arguments.
>
> To me, it is a version number.
>
> mmap() and mmap2() both have 6 parameters.
>

You keep bringing this up.  mmap2() is (a) a non-user-visible call; (b) 
an exception (a mistake, if you want.)

> Besides if wait4() were born before wait3(), would it still be wait4() :-)

Yes.  wait3() came before wait4(), but there never was a wait2().

> But I see that it is hard to get one-convention-that-fits-all.
>
>> It's better than "extended" or something
>> like that simply because "extended" just means "more than", and a number
>> at least tells you *how much more than*.
>
> And extended assumes we wont extend again.

Exactly.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
  2009-10-21 19:44                           ` Sukadev Bhattiprolu
  (?)
  (?)
@ 2009-10-21 22:03                           ` H. Peter Anvin
  -1 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-21 22:03 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: mtk.manpages, Matt Helsley, randy.dunlap, arnd, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, Eric W. Biederman,
	kosaki.motohiro, mingo, linux-api, torvalds, Alexey Dobriyan,
	roland, Pavel Emelyanov

On 10/22/2009 04:44 AM, Sukadev Bhattiprolu wrote:
>>
>> "3" is number of arguments.
>
> To me, it is a version number.
>
> mmap() and mmap2() both have 6 parameters.
>

You keep bringing this up.  mmap2() is (a) a non-user-visible call; (b) 
an exception (a mistake, if you want.)

> Besides if wait4() were born before wait3(), would it still be wait4() :-)

Yes.  wait3() came before wait4(), but there never was a wait2().

> But I see that it is hard to get one-convention-that-fits-all.
>
>> It's better than "extended" or something
>> like that simply because "extended" just means "more than", and a number
>> at least tells you *how much more than*.
>
> And extended assumes we wont extend again.

Exactly.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                         ` <4ADF06B7.50508-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
  2009-10-21 19:44                           ` Sukadev Bhattiprolu
@ 2009-10-22 10:26                           ` Michael Kerrisk
  1 sibling, 0 replies; 226+ messages in thread
From: Michael Kerrisk @ 2009-10-22 10:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Pavel Emelyanov,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA,
	Eric W. Biederman

Peter,

On Wed, Oct 21, 2009 at 3:03 PM, H. Peter Anvin <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org> wrote:
> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>>
>> My question here is: what does "3" actually mean? In general, system
>> calls have not followed any convention of numbering to indicate
>> successive versions -- clone2() being the one possible exception that
>> I know of.
>>
>
> "3" is number of arguments.

sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)

It appears to me that the number of arguments is 2.

> It's better than "extended" or something like
> that simply because "extended" just means "more than", and a number at least
> tells you *how much more than*.

I'm not sure why you think including a number in the name tells us
"how much more than". Unless you are considering the numbering to be
version numbers, which apparently is not what you mean.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
  2009-10-21 13:03                       ` H. Peter Anvin
  2009-10-21 19:44                           ` Sukadev Bhattiprolu
       [not found]                         ` <4ADF06B7.50508-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
@ 2009-10-22 10:26                         ` Michael Kerrisk
  2009-10-22 11:38                           ` H. Peter Anvin
       [not found]                           ` <cfd18e0f0910220326w1a71b899s5ff0fce9e0a2b8ed-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 2 replies; 226+ messages in thread
From: Michael Kerrisk @ 2009-10-22 10:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Matt Helsley, Sukadev Bhattiprolu, randy.dunlap, arnd,
	Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	Eric W. Biederman, kosaki.motohiro, mingo, linux-api, torvalds,
	Alexey Dobriyan, roland, Pavel Emelyanov

Peter,

On Wed, Oct 21, 2009 at 3:03 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>>
>> My question here is: what does "3" actually mean? In general, system
>> calls have not followed any convention of numbering to indicate
>> successive versions -- clone2() being the one possible exception that
>> I know of.
>>
>
> "3" is number of arguments.

sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)

It appears to me that the number of arguments is 2.

> It's better than "extended" or something like
> that simply because "extended" just means "more than", and a number at least
> tells you *how much more than*.

I'm not sure why you think including a number in the name tells us
"how much more than". Unless you are considering the numbering to be
version numbers, which apparently is not what you mean.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                           ` <20091021194440.GA1283-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-10-21 22:03                             ` H. Peter Anvin
@ 2009-10-22 10:40                             ` Michael Kerrisk
  1 sibling, 0 replies; 226+ messages in thread
From: Michael Kerrisk @ 2009-10-22 10:40 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, H. Peter Anvin,
	mingo-X9Un+BFzKDI, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Eric W. Biederman

Sukadev,

On Wed, Oct 21, 2009 at 9:44 PM, Sukadev Bhattiprolu
<sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
>> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>>>
>>> My question here is: what does "3" actually mean? In general, system
>>> calls have not followed any convention of numbering to indicate
>>> successive versions -- clone2() being the one possible exception that
>>> I know of.
>>>
>>
>> "3" is number of arguments.
>
> To me, it is a version number.

See my precending mail. Isn't the number of arguments "2".

> mmap() and mmap2() both have 6 parameters.
>
> Besides if wait4() were born before wait3(), would it still be wait4() :-)
> But I see that it is hard to get one-convention-that-fits-all.

Yes -- that's exactly right.

>> It's better than "extended" or something
>> like that simply because "extended" just means "more than", and a number
>> at least tells you *how much more than*.
>
> And extended assumes we wont extend again.

Well, if we do things right in this design, we may not need to ever
extend (by creating a new syscall) again. That's why I mentioned the
"flags" argument idea. Did you give this some thought?

> An informal poll of reviewers has clone3() with a slight advantage :-)
>
>        clone_extended() camp: Serge Hallyn, Kerrisk, Louis Rilling,
>        clone3(): Sukadev, H. Peter Anvin, Oren, Matt Helsley.
>
> I like clone3() but am not insisting on it. I just want a name...

And I'm not really insisting on a change. As you rightly point out,
there is much inconsistency in the naming conventions that have been
used over the years.

But, because there has been no consistency in the use of numbers, and
because the number of arguments that are presented in a glibc
interface may differ from the number of arguments in an underlying
syscall (several precedents: signalfd4(), pselect(), ppoll()), I'm
inclined to think that clonex() or clone_ext() is slighly better than
clone3(). But, certainly, my arguments are not compelling.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                           ` <20091021194440.GA1283-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-22 10:40                             ` Michael Kerrisk
  2009-10-22 10:40                             ` Michael Kerrisk
  1 sibling, 0 replies; 226+ messages in thread
From: Michael Kerrisk @ 2009-10-22 10:40 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: H. Peter Anvin, Matt Helsley, randy.dunlap, arnd, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, Eric W. Biederman,
	kosaki.motohiro, mingo, linux-api, torvalds, Alexey Dobriyan,
	roland, Pavel Emelyanov

Sukadev,

On Wed, Oct 21, 2009 at 9:44 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> H. Peter Anvin [hpa@zytor.com] wrote:
>> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>>>
>>> My question here is: what does "3" actually mean? In general, system
>>> calls have not followed any convention of numbering to indicate
>>> successive versions -- clone2() being the one possible exception that
>>> I know of.
>>>
>>
>> "3" is number of arguments.
>
> To me, it is a version number.

See my precending mail. Isn't the number of arguments "2".

> mmap() and mmap2() both have 6 parameters.
>
> Besides if wait4() were born before wait3(), would it still be wait4() :-)
> But I see that it is hard to get one-convention-that-fits-all.

Yes -- that's exactly right.

>> It's better than "extended" or something
>> like that simply because "extended" just means "more than", and a number
>> at least tells you *how much more than*.
>
> And extended assumes we wont extend again.

Well, if we do things right in this design, we may not need to ever
extend (by creating a new syscall) again. That's why I mentioned the
"flags" argument idea. Did you give this some thought?

> An informal poll of reviewers has clone3() with a slight advantage :-)
>
>        clone_extended() camp: Serge Hallyn, Kerrisk, Louis Rilling,
>        clone3(): Sukadev, H. Peter Anvin, Oren, Matt Helsley.
>
> I like clone3() but am not insisting on it. I just want a name...

And I'm not really insisting on a change. As you rightly point out,
there is much inconsistency in the naming conventions that have been
used over the years.

But, because there has been no consistency in the use of numbers, and
because the number of arguments that are presented in a glibc
interface may differ from the number of arguments in an underlying
syscall (several precedents: signalfd4(), pselect(), ppoll()), I'm
inclined to think that clonex() or clone_ext() is slighly better than
clone3(). But, certainly, my arguments are not compelling.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-22 10:40                             ` Michael Kerrisk
  0 siblings, 0 replies; 226+ messages in thread
From: Michael Kerrisk @ 2009-10-22 10:40 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: H. Peter Anvin, Matt Helsley,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev,

On Wed, Oct 21, 2009 at 9:44 PM, Sukadev Bhattiprolu
<sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
>> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>>>
>>> My question here is: what does "3" actually mean? In general, system
>>> calls have not followed any convention of numbering to indicate
>>> successive versions -- clone2() being the one possible exception that
>>> I know of.
>>>
>>
>> "3" is number of arguments.
>
> To me, it is a version number.

See my precending mail. Isn't the number of arguments "2".

> mmap() and mmap2() both have 6 parameters.
>
> Besides if wait4() were born before wait3(), would it still be wait4() :-)
> But I see that it is hard to get one-convention-that-fits-all.

Yes -- that's exactly right.

>> It's better than "extended" or something
>> like that simply because "extended" just means "more than", and a number
>> at least tells you *how much more than*.
>
> And extended assumes we wont extend again.

Well, if we do things right in this design, we may not need to ever
extend (by creating a new syscall) again. That's why I mentioned the
"flags" argument idea. Did you give this some thought?

> An informal poll of reviewers has clone3() with a slight advantage :-)
>
>        clone_extended() camp: Serge Hallyn, Kerrisk, Louis Rilling,
>        clone3(): Sukadev, H. Peter Anvin, Oren, Matt Helsley.
>
> I like clone3() but am not insisting on it. I just want a name...

And I'm not really insisting on a change. As you rightly point out,
there is much inconsistency in the naming conventions that have been
used over the years.

But, because there has been no consistency in the use of numbers, and
because the number of arguments that are presented in a glibc
interface may differ from the number of arguments in an underlying
syscall (several precedents: signalfd4(), pselect(), ppoll()), I'm
inclined to think that clonex() or clone_ext() is slighly better than
clone3(). But, certainly, my arguments are not compelling.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]             ` <4ADF56D4.8030405-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-10-22 11:22               ` Daniel Lezcano
  0 siblings, 0 replies; 226+ messages in thread
From: Daniel Lezcano @ 2009-10-22 11:22 UTC (permalink / raw)
  To: Oren Laadan
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Oren Laadan wrote:
>
> Daniel Lezcano wrote:
>> Oren Laadan wrote:
>>> Daniel Lezcano wrote:
>> [ ... ]
>>
>>>> I forgot to mention a constraint with the specified pid : P2 has to
>>>> be child of P1.
>>>> In other word, you can not specify a pid to clonat which is not your
>>>> descendant (including yourself).
>>>> With this constraint I think there is no security issues.
>>> Sounds dangerous. What if your descendant executed a setuid program ?
>> That does not happen because you inherit the context of the caller.
>>
>>>> Concerning of forking on behalf of another process, we can consider
>>>> it is up to the caller / programmer to know what it does. If a
>>>> process in 
>>> Before the user can program with this syscall, _you_ need to define
>>> the semantics of this syscall. 
>> Yes, you are right. Here it is the proposition of the semantics.
>>
>> Function prototype is:
>>
>> pid_t cloneat(pid_t pid, pid_t hint, struct clone_args *args);
>>
>> Structure types are:
>>
>> typedef int clone_flag_t;
>>
>> struct clone_args {
>>     clone_flag_t *flags;
>>     int flags_size;
>>     u32 reserved1;
>>     u32 reserved2;
>>     u64 child_stack_base;
>>     u64 child_stack_size;
>>     u64 parent_tid_ptr;
>>     u64 child_tid_ptr;
>>     u64 reserved3;
>> };
>>
>> With the helper macros:
>>
>> void CLONE_SET(int flag, clone_flag_t *flags);
>> void CLONE_CLR(int flag, clone_flag_t *flags);
>> bool CLONE_ISSET(int flag, clone_flag_t *flags);
>> void CLONE_ZERO(flag_t *clone_flags);
>>
>> And:
>>
>> #define CLONEXT_VM      0x20  /* CLONE_VM>>3 */ #define CLONEXT_FS     
>> 0x21
>> #define CLONEXT_FILES   0x22
>> ...
>>
>
> The main motivation for your new syscall is to make it possible to
> inject a process into a namespace. IOW, what you are proposing is
> a new incarnation of sys_hijack().
>
> This is _orthogonal_ to the current discussion, which is about an
> extension for clone to allow (a) choosing target pid(s), (b) more
> flags, and (c) future extensions.
>
> (Your suggested syscall may, too, allow the request a specific set
> of pids for the child process, and reuse the current code for that).
>
> I suggest that you start a new thread about your RFC. This will
> reduce distractions on the current thread, and bring more focus to
> your proposal. I surely will post some comments there :)

I can argue exactly the same thing, the main motivation for your new 
syscall is to make it possible to restart a process tree for a 
checkpoint / restart and this is orthogonal with adding extended clone 
flags :)

But my main motivation is to have the possibility to a) choose a target 
__and__ b) clone the process relatively to another one. These 2 features 
allows to do what *we* need, that is recreate a process tree and the 
bonus with this approach is the ability to inject a process into a 
namespace, something asked by several people, eg. debug with gdb an 
application running into another pid namespace (is not supported today).

I am sorry for coming late in the discussion and for distracting.

> [...]
>
>> The cloneat syscall can be used for the following use cases:
>>
>> * checkpoint / restart:
>>
>> The restart can be done with a clone(.., CLONE_NEWPID|...);
>> Then the new pid (aka pid 1) retrieves the proctree from the statefile
>> and creates the different tasks with the process hierarchy with the
>> cloneat syscall.
>
> s/cloneat/$CLONE3/
> (hint: this is how it's done now)
Of course, what is described is what you does with 'clone3' !
Do you think I will come proposing a variant of 'clone3' not doing what 
you need ? :)

>> The proctree creation can be done from outside of the pid namespace or
>> from inside.
>
> Ew .. why would you do that ?
And why not. Is there a semantic specifying how a process tree should be 
recreated ?

>> Concerning nested pid namespaces, IMHO I would not try to checkpoint /
>> restart them. The checkpoint of a nested pid namespace should be
>> forbidden except for the leaf of a pid namespaces tree. That should
>
> Others (me included) *will* try and may get upset if forbidden...
> Seriously, there is no technical reason to restrict this.

Ok.

>  >> Can you define more precisely what you mean by "enter" the container ?
>>> If you simply want create a new process in the container, you can
>>> achieve the same thing with a daemon, or a smart init process (in
>>> there), or even ptrace tricks.
>> Yes, you can launch a daemon inside the container, that works for a
>> system container because the container is killed by killing the first
>> process of the container or by a shutdown inside the container (not
>> fully implemented in the kernel).
>> But this is unreliable for application containers, I won't enter in the
>> details but the container exits when the application exits, with a
>> daemon inside the container, this is no longer the case because you can
>> not detect the application death as the daemon is always there.
>>
>> With cloneat you restrict the life cycle of the command you launched,
>> that is the container exits as soon as all the processes exited the
>> container, including the spawned command itself.
>
> Then start a daemon _in addition_ to the application, or write a
> daemon that will launch the application and monitor it... And also
> there is ptrace -
Already tried :)

http://lxc.git.sourceforge.net/git/gitweb.cgi?p=lxc/lxc;a=blob;f=src/lxc/lxc_cinit.c;h=8f235483c1a9d9c9e0cc1ba69f1c33f1bc98b8aa;hb=57ff723f6a174a2a01c58c6ac367d118ef12b91c

> But, please let's take this off to a new thread about adding how to
> add a process into a namespace from the outside. FYI, I do think
> such an interface may be useful and nicer than the two alternatives
> I suggested above.
>
>>> Also, there is a reason why sys_hijack() was hijacked away ... And
>>> I honestly think that a syscall to force another process to clone
>>> would be shot down by the kernel guys.
>> Maybe, maybe not. CLONE_PARENT exists and looks similar to cloneat.
>
> Actually, I misread previously; I mean not forcing another process
> to clone, but instead forcing another process to become a parent (and
> I shall ignore the ethical issues :)
>
> I still suspect it won't be welcome. Several people would have liked
> to see CLONE_PARENT go away, too, if that was possible without breaking
> userspace applications. Yet another reason to take it to a discussion
> of its own.

At this point, I am hesitating of creating a new thread for this 
discussion. Because, there will be:
 * clone
 * clone2
 * clone3

and we will discuss again about a new clone syscall with a different API :(

I will not continue arguing on this thread except if someone is in favor 
of cloneat.
Otherwise, I will spawn a new thread later.

Thanks
  -- Daniel

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]             ` <4ADF56D4.8030405-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-10-22 11:22               ` Daniel Lezcano
  0 siblings, 0 replies; 226+ messages in thread
From: Daniel Lezcano @ 2009-10-22 11:22 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Sukadev Bhattiprolu, randy.dunlap, arnd, linux-api, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, Eric W. Biederman,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Oren Laadan wrote:
>
> Daniel Lezcano wrote:
>> Oren Laadan wrote:
>>> Daniel Lezcano wrote:
>> [ ... ]
>>
>>>> I forgot to mention a constraint with the specified pid : P2 has to
>>>> be child of P1.
>>>> In other word, you can not specify a pid to clonat which is not your
>>>> descendant (including yourself).
>>>> With this constraint I think there is no security issues.
>>> Sounds dangerous. What if your descendant executed a setuid program ?
>> That does not happen because you inherit the context of the caller.
>>
>>>> Concerning of forking on behalf of another process, we can consider
>>>> it is up to the caller / programmer to know what it does. If a
>>>> process in 
>>> Before the user can program with this syscall, _you_ need to define
>>> the semantics of this syscall. 
>> Yes, you are right. Here it is the proposition of the semantics.
>>
>> Function prototype is:
>>
>> pid_t cloneat(pid_t pid, pid_t hint, struct clone_args *args);
>>
>> Structure types are:
>>
>> typedef int clone_flag_t;
>>
>> struct clone_args {
>>     clone_flag_t *flags;
>>     int flags_size;
>>     u32 reserved1;
>>     u32 reserved2;
>>     u64 child_stack_base;
>>     u64 child_stack_size;
>>     u64 parent_tid_ptr;
>>     u64 child_tid_ptr;
>>     u64 reserved3;
>> };
>>
>> With the helper macros:
>>
>> void CLONE_SET(int flag, clone_flag_t *flags);
>> void CLONE_CLR(int flag, clone_flag_t *flags);
>> bool CLONE_ISSET(int flag, clone_flag_t *flags);
>> void CLONE_ZERO(flag_t *clone_flags);
>>
>> And:
>>
>> #define CLONEXT_VM      0x20  /* CLONE_VM>>3 */ #define CLONEXT_FS     
>> 0x21
>> #define CLONEXT_FILES   0x22
>> ...
>>
>
> The main motivation for your new syscall is to make it possible to
> inject a process into a namespace. IOW, what you are proposing is
> a new incarnation of sys_hijack().
>
> This is _orthogonal_ to the current discussion, which is about an
> extension for clone to allow (a) choosing target pid(s), (b) more
> flags, and (c) future extensions.
>
> (Your suggested syscall may, too, allow the request a specific set
> of pids for the child process, and reuse the current code for that).
>
> I suggest that you start a new thread about your RFC. This will
> reduce distractions on the current thread, and bring more focus to
> your proposal. I surely will post some comments there :)

I can argue exactly the same thing, the main motivation for your new 
syscall is to make it possible to restart a process tree for a 
checkpoint / restart and this is orthogonal with adding extended clone 
flags :)

But my main motivation is to have the possibility to a) choose a target 
__and__ b) clone the process relatively to another one. These 2 features 
allows to do what *we* need, that is recreate a process tree and the 
bonus with this approach is the ability to inject a process into a 
namespace, something asked by several people, eg. debug with gdb an 
application running into another pid namespace (is not supported today).

I am sorry for coming late in the discussion and for distracting.

> [...]
>
>> The cloneat syscall can be used for the following use cases:
>>
>> * checkpoint / restart:
>>
>> The restart can be done with a clone(.., CLONE_NEWPID|...);
>> Then the new pid (aka pid 1) retrieves the proctree from the statefile
>> and creates the different tasks with the process hierarchy with the
>> cloneat syscall.
>
> s/cloneat/$CLONE3/
> (hint: this is how it's done now)
Of course, what is described is what you does with 'clone3' !
Do you think I will come proposing a variant of 'clone3' not doing what 
you need ? :)

>> The proctree creation can be done from outside of the pid namespace or
>> from inside.
>
> Ew .. why would you do that ?
And why not. Is there a semantic specifying how a process tree should be 
recreated ?

>> Concerning nested pid namespaces, IMHO I would not try to checkpoint /
>> restart them. The checkpoint of a nested pid namespace should be
>> forbidden except for the leaf of a pid namespaces tree. That should
>
> Others (me included) *will* try and may get upset if forbidden...
> Seriously, there is no technical reason to restrict this.

Ok.

>  >> Can you define more precisely what you mean by "enter" the container ?
>>> If you simply want create a new process in the container, you can
>>> achieve the same thing with a daemon, or a smart init process (in
>>> there), or even ptrace tricks.
>> Yes, you can launch a daemon inside the container, that works for a
>> system container because the container is killed by killing the first
>> process of the container or by a shutdown inside the container (not
>> fully implemented in the kernel).
>> But this is unreliable for application containers, I won't enter in the
>> details but the container exits when the application exits, with a
>> daemon inside the container, this is no longer the case because you can
>> not detect the application death as the daemon is always there.
>>
>> With cloneat you restrict the life cycle of the command you launched,
>> that is the container exits as soon as all the processes exited the
>> container, including the spawned command itself.
>
> Then start a daemon _in addition_ to the application, or write a
> daemon that will launch the application and monitor it... And also
> there is ptrace -
Already tried :)

http://lxc.git.sourceforge.net/git/gitweb.cgi?p=lxc/lxc;a=blob;f=src/lxc/lxc_cinit.c;h=8f235483c1a9d9c9e0cc1ba69f1c33f1bc98b8aa;hb=57ff723f6a174a2a01c58c6ac367d118ef12b91c

> But, please let's take this off to a new thread about adding how to
> add a process into a namespace from the outside. FYI, I do think
> such an interface may be useful and nicer than the two alternatives
> I suggested above.
>
>>> Also, there is a reason why sys_hijack() was hijacked away ... And
>>> I honestly think that a syscall to force another process to clone
>>> would be shot down by the kernel guys.
>> Maybe, maybe not. CLONE_PARENT exists and looks similar to cloneat.
>
> Actually, I misread previously; I mean not forcing another process
> to clone, but instead forcing another process to become a parent (and
> I shall ignore the ethical issues :)
>
> I still suspect it won't be welcome. Several people would have liked
> to see CLONE_PARENT go away, too, if that was possible without breaking
> userspace applications. Yet another reason to take it to a discussion
> of its own.

At this point, I am hesitating of creating a new thread for this 
discussion. Because, there will be:
 * clone
 * clone2
 * clone3

and we will discuss again about a new clone syscall with a different API :(

I will not continue arguing on this thread except if someone is in favor 
of cloneat.
Otherwise, I will spawn a new thread later.

Thanks
  -- Daniel


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-22 11:22               ` Daniel Lezcano
  0 siblings, 0 replies; 226+ messages in thread
From: Daniel Lezcano @ 2009-10-22 11:22 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Sukadev Bhattiprolu, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Oren Laadan wrote:
>
> Daniel Lezcano wrote:
>> Oren Laadan wrote:
>>> Daniel Lezcano wrote:
>> [ ... ]
>>
>>>> I forgot to mention a constraint with the specified pid : P2 has to
>>>> be child of P1.
>>>> In other word, you can not specify a pid to clonat which is not your
>>>> descendant (including yourself).
>>>> With this constraint I think there is no security issues.
>>> Sounds dangerous. What if your descendant executed a setuid program ?
>> That does not happen because you inherit the context of the caller.
>>
>>>> Concerning of forking on behalf of another process, we can consider
>>>> it is up to the caller / programmer to know what it does. If a
>>>> process in 
>>> Before the user can program with this syscall, _you_ need to define
>>> the semantics of this syscall. 
>> Yes, you are right. Here it is the proposition of the semantics.
>>
>> Function prototype is:
>>
>> pid_t cloneat(pid_t pid, pid_t hint, struct clone_args *args);
>>
>> Structure types are:
>>
>> typedef int clone_flag_t;
>>
>> struct clone_args {
>>     clone_flag_t *flags;
>>     int flags_size;
>>     u32 reserved1;
>>     u32 reserved2;
>>     u64 child_stack_base;
>>     u64 child_stack_size;
>>     u64 parent_tid_ptr;
>>     u64 child_tid_ptr;
>>     u64 reserved3;
>> };
>>
>> With the helper macros:
>>
>> void CLONE_SET(int flag, clone_flag_t *flags);
>> void CLONE_CLR(int flag, clone_flag_t *flags);
>> bool CLONE_ISSET(int flag, clone_flag_t *flags);
>> void CLONE_ZERO(flag_t *clone_flags);
>>
>> And:
>>
>> #define CLONEXT_VM      0x20  /* CLONE_VM>>3 */ #define CLONEXT_FS     
>> 0x21
>> #define CLONEXT_FILES   0x22
>> ...
>>
>
> The main motivation for your new syscall is to make it possible to
> inject a process into a namespace. IOW, what you are proposing is
> a new incarnation of sys_hijack().
>
> This is _orthogonal_ to the current discussion, which is about an
> extension for clone to allow (a) choosing target pid(s), (b) more
> flags, and (c) future extensions.
>
> (Your suggested syscall may, too, allow the request a specific set
> of pids for the child process, and reuse the current code for that).
>
> I suggest that you start a new thread about your RFC. This will
> reduce distractions on the current thread, and bring more focus to
> your proposal. I surely will post some comments there :)

I can argue exactly the same thing, the main motivation for your new 
syscall is to make it possible to restart a process tree for a 
checkpoint / restart and this is orthogonal with adding extended clone 
flags :)

But my main motivation is to have the possibility to a) choose a target 
__and__ b) clone the process relatively to another one. These 2 features 
allows to do what *we* need, that is recreate a process tree and the 
bonus with this approach is the ability to inject a process into a 
namespace, something asked by several people, eg. debug with gdb an 
application running into another pid namespace (is not supported today).

I am sorry for coming late in the discussion and for distracting.

> [...]
>
>> The cloneat syscall can be used for the following use cases:
>>
>> * checkpoint / restart:
>>
>> The restart can be done with a clone(.., CLONE_NEWPID|...);
>> Then the new pid (aka pid 1) retrieves the proctree from the statefile
>> and creates the different tasks with the process hierarchy with the
>> cloneat syscall.
>
> s/cloneat/$CLONE3/
> (hint: this is how it's done now)
Of course, what is described is what you does with 'clone3' !
Do you think I will come proposing a variant of 'clone3' not doing what 
you need ? :)

>> The proctree creation can be done from outside of the pid namespace or
>> from inside.
>
> Ew .. why would you do that ?
And why not. Is there a semantic specifying how a process tree should be 
recreated ?

>> Concerning nested pid namespaces, IMHO I would not try to checkpoint /
>> restart them. The checkpoint of a nested pid namespace should be
>> forbidden except for the leaf of a pid namespaces tree. That should
>
> Others (me included) *will* try and may get upset if forbidden...
> Seriously, there is no technical reason to restrict this.

Ok.

>  >> Can you define more precisely what you mean by "enter" the container ?
>>> If you simply want create a new process in the container, you can
>>> achieve the same thing with a daemon, or a smart init process (in
>>> there), or even ptrace tricks.
>> Yes, you can launch a daemon inside the container, that works for a
>> system container because the container is killed by killing the first
>> process of the container or by a shutdown inside the container (not
>> fully implemented in the kernel).
>> But this is unreliable for application containers, I won't enter in the
>> details but the container exits when the application exits, with a
>> daemon inside the container, this is no longer the case because you can
>> not detect the application death as the daemon is always there.
>>
>> With cloneat you restrict the life cycle of the command you launched,
>> that is the container exits as soon as all the processes exited the
>> container, including the spawned command itself.
>
> Then start a daemon _in addition_ to the application, or write a
> daemon that will launch the application and monitor it... And also
> there is ptrace -
Already tried :)

http://lxc.git.sourceforge.net/git/gitweb.cgi?p=lxc/lxc;a=blob;f=src/lxc/lxc_cinit.c;h=8f235483c1a9d9c9e0cc1ba69f1c33f1bc98b8aa;hb=57ff723f6a174a2a01c58c6ac367d118ef12b91c

> But, please let's take this off to a new thread about adding how to
> add a process into a namespace from the outside. FYI, I do think
> such an interface may be useful and nicer than the two alternatives
> I suggested above.
>
>>> Also, there is a reason why sys_hijack() was hijacked away ... And
>>> I honestly think that a syscall to force another process to clone
>>> would be shot down by the kernel guys.
>> Maybe, maybe not. CLONE_PARENT exists and looks similar to cloneat.
>
> Actually, I misread previously; I mean not forcing another process
> to clone, but instead forcing another process to become a parent (and
> I shall ignore the ethical issues :)
>
> I still suspect it won't be welcome. Several people would have liked
> to see CLONE_PARENT go away, too, if that was possible without breaking
> userspace applications. Yet another reason to take it to a discussion
> of its own.

At this point, I am hesitating of creating a new thread for this 
discussion. Because, there will be:
 * clone
 * clone2
 * clone3

and we will discuss again about a new clone syscall with a different API :(

I will not continue arguing on this thread except if someone is in favor 
of cloneat.
Otherwise, I will spawn a new thread later.

Thanks
  -- Daniel

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                           ` <cfd18e0f0910220326w1a71b899s5ff0fce9e0a2b8ed-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-22 11:38                             ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-22 11:38 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Pavel Emelyanov,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA,
	Eric W. Biederman

On 10/22/2009 07:26 PM, Michael Kerrisk wrote:
>>
>> "3" is number of arguments.
>
> sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
>
> It appears to me that the number of arguments is 2.
>

It was 3 at one point... I'm not sure when that changed last :-/

>> It's better than "extended" or something like
>> that simply because "extended" just means "more than", and a number at least
>> tells you *how much more than*.
>
> I'm not sure why you think including a number in the name tells us
> "how much more than". Unless you are considering the numbering to be
> version numbers, which apparently is not what you mean.

It is a version number of sorts.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
  2009-10-22 10:26                         ` Michael Kerrisk
@ 2009-10-22 11:38                           ` H. Peter Anvin
  2009-10-22 12:14                             ` Michael Kerrisk
       [not found]                             ` <4AE0444B.5070603-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
       [not found]                           ` <cfd18e0f0910220326w1a71b899s5ff0fce9e0a2b8ed-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-22 11:38 UTC (permalink / raw)
  To: mtk.manpages
  Cc: Matt Helsley, Sukadev Bhattiprolu, randy.dunlap, arnd,
	Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	Eric W. Biederman, kosaki.motohiro, mingo, linux-api, torvalds,
	Alexey Dobriyan, roland, Pavel Emelyanov

On 10/22/2009 07:26 PM, Michael Kerrisk wrote:
>>
>> "3" is number of arguments.
>
> sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
>
> It appears to me that the number of arguments is 2.
>

It was 3 at one point... I'm not sure when that changed last :-/

>> It's better than "extended" or something like
>> that simply because "extended" just means "more than", and a number at least
>> tells you *how much more than*.
>
> I'm not sure why you think including a number in the name tells us
> "how much more than". Unless you are considering the numbering to be
> version numbers, which apparently is not what you mean.

It is a version number of sorts.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                             ` <4AE0444B.5070603-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
@ 2009-10-22 12:14                               ` Michael Kerrisk
  0 siblings, 0 replies; 226+ messages in thread
From: Michael Kerrisk @ 2009-10-22 12:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Pavel Emelyanov,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA,
	Eric W. Biederman

Peter,

On Thu, Oct 22, 2009 at 1:38 PM, H. Peter Anvin <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org> wrote:
> On 10/22/2009 07:26 PM, Michael Kerrisk wrote:
>>>
>>> "3" is number of arguments.
>>
>> sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
>>
>> It appears to me that the number of arguments is 2.
>>
>
> It was 3 at one point... I'm not sure when that changed last :-/
>
>>> It's better than "extended" or something like
>>> that simply because "extended" just means "more than", and a number at
>>> least
>>> tells you *how much more than*.
>>
>> I'm not sure why you think including a number in the name tells us
>> "how much more than". Unless you are considering the numbering to be
>> version numbers, which apparently is not what you mean.
>
> It is a version number of sorts.

So, sometimes, a number in a system call should be the bit width of
some arguments(s), sometimes it should be the number of arguments, and
sometimes (well, just occasionally, as in mmap2() and clone()) -- it
should be a version number? Does the weather play any part in the
decision? ;-)

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
  2009-10-22 11:38                           ` H. Peter Anvin
@ 2009-10-22 12:14                             ` Michael Kerrisk
  2009-10-22 12:19                                 ` H. Peter Anvin
                                                 ` (2 more replies)
       [not found]                             ` <4AE0444B.5070603-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
  1 sibling, 3 replies; 226+ messages in thread
From: Michael Kerrisk @ 2009-10-22 12:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Matt Helsley, Sukadev Bhattiprolu, randy.dunlap, arnd,
	Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	Eric W. Biederman, kosaki.motohiro, mingo, linux-api, torvalds,
	Alexey Dobriyan, roland, Pavel Emelyanov

Peter,

On Thu, Oct 22, 2009 at 1:38 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 10/22/2009 07:26 PM, Michael Kerrisk wrote:
>>>
>>> "3" is number of arguments.
>>
>> sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
>>
>> It appears to me that the number of arguments is 2.
>>
>
> It was 3 at one point... I'm not sure when that changed last :-/
>
>>> It's better than "extended" or something like
>>> that simply because "extended" just means "more than", and a number at
>>> least
>>> tells you *how much more than*.
>>
>> I'm not sure why you think including a number in the name tells us
>> "how much more than". Unless you are considering the numbering to be
>> version numbers, which apparently is not what you mean.
>
> It is a version number of sorts.

So, sometimes, a number in a system call should be the bit width of
some arguments(s), sometimes it should be the number of arguments, and
sometimes (well, just occasionally, as in mmap2() and clone()) -- it
should be a version number? Does the weather play any part in the
decision? ;-)

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                               ` <cfd18e0f0910220514y1bd5967aj3a04bc3f5b38948b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-22 12:19                                 ` H. Peter Anvin
  2009-10-22 13:57                                 ` Matt Helsley
  1 sibling, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-22 12:19 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Pavel Emelyanov,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA,
	Eric W. Biederman

On 10/22/2009 09:14 PM, Michael Kerrisk wrote:
>
> So, sometimes, a number in a system call should be the bit width of
> some arguments(s), sometimes it should be the number of arguments, and
> sometimes (well, just occasionally, as in mmap2() and clone()) -- it
> should be a version number? Does the weather play any part in the
> decision? ;-)
>

The notion is that they are *some* kind of description on how the system 
call has been augmented.  The bitwidths and argument numbers are 
non-overlapping and visually very different, so are not subject to 
confusion.  Your argument makes about as much sense as saying the letter 
"a" should have the same meaning in every context.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                               ` <cfd18e0f0910220514y1bd5967aj3a04bc3f5b38948b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-22 12:19                                 ` H. Peter Anvin
  2009-10-22 13:57                                 ` Matt Helsley
  1 sibling, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-22 12:19 UTC (permalink / raw)
  To: mtk.manpages
  Cc: Matt Helsley, Sukadev Bhattiprolu, randy.dunlap, arnd,
	Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	Eric W. Biederman, kosaki.motohiro, mingo, linux-api, torvalds,
	Alexey Dobriyan, roland, Pavel Emelyanov

On 10/22/2009 09:14 PM, Michael Kerrisk wrote:
>
> So, sometimes, a number in a system call should be the bit width of
> some arguments(s), sometimes it should be the number of arguments, and
> sometimes (well, just occasionally, as in mmap2() and clone()) -- it
> should be a version number? Does the weather play any part in the
> decision? ;-)
>

The notion is that they are *some* kind of description on how the system 
call has been augmented.  The bitwidths and argument numbers are 
non-overlapping and visually very different, so are not subject to 
confusion.  Your argument makes about as much sense as saying the letter 
"a" should have the same meaning in every context.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-22 12:19                                 ` H. Peter Anvin
  0 siblings, 0 replies; 226+ messages in thread
From: H. Peter Anvin @ 2009-10-22 12:19 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Matt Helsley, Sukadev Bhattiprolu,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On 10/22/2009 09:14 PM, Michael Kerrisk wrote:
>
> So, sometimes, a number in a system call should be the bit width of
> some arguments(s), sometimes it should be the number of arguments, and
> sometimes (well, just occasionally, as in mmap2() and clone()) -- it
> should be a version number? Does the weather play any part in the
> decision? ;-)
>

The notion is that they are *some* kind of description on how the system 
call has been augmented.  The bitwidths and argument numbers are 
non-overlapping and visually very different, so are not subject to 
confusion.  Your argument makes about as much sense as saying the letter 
"a" should have the same meaning in every context.

	-hpa

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                               ` <cfd18e0f0910220514y1bd5967aj3a04bc3f5b38948b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-10-22 12:19                                 ` H. Peter Anvin
@ 2009-10-22 13:57                                 ` Matt Helsley
  1 sibling, 0 replies; 226+ messages in thread
From: Matt Helsley @ 2009-10-22 13:57 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Pavel Emelyanov,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, H. Peter Anvin,
	mingo-X9Un+BFzKDI, Sukadev Bhattiprolu,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Eric W. Biederman

On Thu, Oct 22, 2009 at 02:14:16PM +0200, Michael Kerrisk wrote:
> Peter,
> 
> On Thu, Oct 22, 2009 at 1:38 PM, H. Peter Anvin <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org> wrote:
> > On 10/22/2009 07:26 PM, Michael Kerrisk wrote:
> >>>
> >>> "3" is number of arguments.
> >>
> >> sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
> >>
> >> It appears to me that the number of arguments is 2.
> >>
> >
> > It was 3 at one point... I'm not sure when that changed last :-/
> >
> >>> It's better than "extended" or something like
> >>> that simply because "extended" just means "more than", and a number at
> >>> least
> >>> tells you *how much more than*.
> >>
> >> I'm not sure why you think including a number in the name tells us
> >> "how much more than". Unless you are considering the numbering to be
> >> version numbers, which apparently is not what you mean.
> >
> > It is a version number of sorts.
> 
> So, sometimes, a number in a system call should be the bit width of
> some arguments(s), sometimes it should be the number of arguments, and
> sometimes (well, just occasionally, as in mmap2() and clone()) -- it
> should be a version number? Does the weather play any part in the

In this case 3 could be both the number of arguments and the version
number (clone, clone2, clone3). Match 2 conventions with one choice of
name. :)

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                               ` <cfd18e0f0910220514y1bd5967aj3a04bc3f5b38948b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-22 13:57                                 ` Matt Helsley
  2009-10-22 13:57                                 ` Matt Helsley
  1 sibling, 0 replies; 226+ messages in thread
From: Matt Helsley @ 2009-10-22 13:57 UTC (permalink / raw)
  To: mtk.manpages
  Cc: H. Peter Anvin, Matt Helsley, Sukadev Bhattiprolu, randy.dunlap,
	arnd, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	Eric W. Biederman, kosaki.motohiro, mingo, linux-api, torvalds,
	Alexey Dobriyan, roland, Pavel Emelyanov

On Thu, Oct 22, 2009 at 02:14:16PM +0200, Michael Kerrisk wrote:
> Peter,
> 
> On Thu, Oct 22, 2009 at 1:38 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> > On 10/22/2009 07:26 PM, Michael Kerrisk wrote:
> >>>
> >>> "3" is number of arguments.
> >>
> >> sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
> >>
> >> It appears to me that the number of arguments is 2.
> >>
> >
> > It was 3 at one point... I'm not sure when that changed last :-/
> >
> >>> It's better than "extended" or something like
> >>> that simply because "extended" just means "more than", and a number at
> >>> least
> >>> tells you *how much more than*.
> >>
> >> I'm not sure why you think including a number in the name tells us
> >> "how much more than". Unless you are considering the numbering to be
> >> version numbers, which apparently is not what you mean.
> >
> > It is a version number of sorts.
> 
> So, sometimes, a number in a system call should be the bit width of
> some arguments(s), sometimes it should be the number of arguments, and
> sometimes (well, just occasionally, as in mmap2() and clone()) -- it
> should be a version number? Does the weather play any part in the

In this case 3 could be both the number of arguments and the version
number (clone, clone2, clone3). Match 2 conventions with one choice of
name. :)

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-22 13:57                                 ` Matt Helsley
  0 siblings, 0 replies; 226+ messages in thread
From: Matt Helsley @ 2009-10-22 13:57 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
  Cc: H. Peter Anvin, Matt Helsley, Sukadev Bhattiprolu,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On Thu, Oct 22, 2009 at 02:14:16PM +0200, Michael Kerrisk wrote:
> Peter,
> 
> On Thu, Oct 22, 2009 at 1:38 PM, H. Peter Anvin <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org> wrote:
> > On 10/22/2009 07:26 PM, Michael Kerrisk wrote:
> >>>
> >>> "3" is number of arguments.
> >>
> >> sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
> >>
> >> It appears to me that the number of arguments is 2.
> >>
> >
> > It was 3 at one point... I'm not sure when that changed last :-/
> >
> >>> It's better than "extended" or something like
> >>> that simply because "extended" just means "more than", and a number at
> >>> least
> >>> tells you *how much more than*.
> >>
> >> I'm not sure why you think including a number in the name tells us
> >> "how much more than". Unless you are considering the numbering to be
> >> version numbers, which apparently is not what you mean.
> >
> > It is a version number of sorts.
> 
> So, sometimes, a number in a system call should be the bit width of
> some arguments(s), sometimes it should be the number of arguments, and
> sometimes (well, just occasionally, as in mmap2() and clone()) -- it
> should be a version number? Does the weather play any part in the

In this case 3 could be both the number of arguments and the version
number (clone, clone2, clone3). Match 2 conventions with one choice of
name. :)

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                             ` <cfd18e0f0910220340n7c655daap78e395136c56f882-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-22 18:10                               ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-22 18:10 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, H. Peter Anvin,
	mingo-X9Un+BFzKDI, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Eric W. Biederman

Michael Kerrisk [mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org] wrote:
| Sukadev,
| 
| On Wed, Oct 21, 2009 at 9:44 PM, Sukadev Bhattiprolu
| <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
| > H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
| >> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
| >>>
| >>> My question here is: what does "3" actually mean? In general, system
| >>> calls have not followed any convention of numbering to indicate
| >>> successive versions -- clone2() being the one possible exception that
| >>> I know of.
| >>>
| >>
| >> "3" is number of arguments.
| >
| > To me, it is a version number.
| 
| See my precending mail. Isn't the number of arguments "2".

Well it was 2 at one point, but I have posted a new version of
just that one patch - please see http://lkml.org/lkml/2009/10/16/3
for comments.

I am working on some updates and will post a new patchset - it
will have 3 parameters to clone3() as shown in the above mail.

| 
| > mmap() and mmap2() both have 6 parameters.
| >
| > Besides if wait4() were born before wait3(), would it still be wait4() :-)
| > But I see that it is hard to get one-convention-that-fits-all.
| 
| Yes -- that's exactly right.
| 
| >> It's better than "extended" or something
| >> like that simply because "extended" just means "more than", and a number
| >> at least tells you *how much more than*.
| >
| > And extended assumes we wont extend again.
| 
| Well, if we do things right in this design, we may not need to ever
| extend (by creating a new syscall) again. That's why I mentioned the
| "flags" argument idea. Did you give this some thought?

Yes, we have done the best we can to avoid extending clone() again
anytime soon (some reserved bytes and clone_args_size field). Would
we still need the flags parameter ? Again its in the new patch
that I pointed to above.

| 
| > An informal poll of reviewers has clone3() with a slight advantage :-)
| >
| >        clone_extended() camp: Serge Hallyn, Kerrisk, Louis Rilling,
| >        clone3(): Sukadev, H. Peter Anvin, Oren, Matt Helsley.
| >
| > I like clone3() but am not insisting on it. I just want a name...
| 
| And I'm not really insisting on a change. As you rightly point out,
| there is much inconsistency in the naming conventions that have been
| used over the years.
| 
| But, because there has been no consistency in the use of numbers, and
| because the number of arguments that are presented in a glibc
| interface may differ from the number of arguments in an underlying
| syscall (several precedents: signalfd4(), pselect(), ppoll()), I'm
| inclined to think that clonex() or clone_ext() is slighly better than
| clone3(). But, certainly, my arguments are not compelling.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
       [not found]                             ` <cfd18e0f0910220340n7c655daap78e395136c56f882-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-22 18:10                               ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-22 18:10 UTC (permalink / raw)
  To: mtk.manpages
  Cc: H. Peter Anvin, Matt Helsley, randy.dunlap, arnd, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, Eric W. Biederman,
	kosaki.motohiro, mingo, linux-api, torvalds, Alexey Dobriyan,
	roland, Pavel Emelyanov

Michael Kerrisk [mtk.manpages@googlemail.com] wrote:
| Sukadev,
| 
| On Wed, Oct 21, 2009 at 9:44 PM, Sukadev Bhattiprolu
| <sukadev@linux.vnet.ibm.com> wrote:
| > H. Peter Anvin [hpa@zytor.com] wrote:
| >> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
| >>>
| >>> My question here is: what does "3" actually mean? In general, system
| >>> calls have not followed any convention of numbering to indicate
| >>> successive versions -- clone2() being the one possible exception that
| >>> I know of.
| >>>
| >>
| >> "3" is number of arguments.
| >
| > To me, it is a version number.
| 
| See my precending mail. Isn't the number of arguments "2".

Well it was 2 at one point, but I have posted a new version of
just that one patch - please see http://lkml.org/lkml/2009/10/16/3
for comments.

I am working on some updates and will post a new patchset - it
will have 3 parameters to clone3() as shown in the above mail.

| 
| > mmap() and mmap2() both have 6 parameters.
| >
| > Besides if wait4() were born before wait3(), would it still be wait4() :-)
| > But I see that it is hard to get one-convention-that-fits-all.
| 
| Yes -- that's exactly right.
| 
| >> It's better than "extended" or something
| >> like that simply because "extended" just means "more than", and a number
| >> at least tells you *how much more than*.
| >
| > And extended assumes we wont extend again.
| 
| Well, if we do things right in this design, we may not need to ever
| extend (by creating a new syscall) again. That's why I mentioned the
| "flags" argument idea. Did you give this some thought?

Yes, we have done the best we can to avoid extending clone() again
anytime soon (some reserved bytes and clone_args_size field). Would
we still need the flags parameter ? Again its in the new patch
that I pointed to above.

| 
| > An informal poll of reviewers has clone3() with a slight advantage :-)
| >
| >        clone_extended() camp: Serge Hallyn, Kerrisk, Louis Rilling,
| >        clone3(): Sukadev, H. Peter Anvin, Oren, Matt Helsley.
| >
| > I like clone3() but am not insisting on it. I just want a name...
| 
| And I'm not really insisting on a change. As you rightly point out,
| there is much inconsistency in the naming conventions that have been
| used over the years.
| 
| But, because there has been no consistency in the use of numbers, and
| because the number of arguments that are presented in a glibc
| interface may differ from the number of arguments in an underlying
| syscall (several precedents: signalfd4(), pselect(), ppoll()), I'm
| inclined to think that clonex() or clone_ext() is slighly better than
| clone3(). But, certainly, my arguments are not compelling.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall
@ 2009-10-22 18:10                               ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-22 18:10 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
  Cc: H. Peter Anvin, Matt Helsley,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Michael Kerrisk [mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org] wrote:
| Sukadev,
| 
| On Wed, Oct 21, 2009 at 9:44 PM, Sukadev Bhattiprolu
| <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
| > H. Peter Anvin [hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org] wrote:
| >> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
| >>>
| >>> My question here is: what does "3" actually mean? In general, system
| >>> calls have not followed any convention of numbering to indicate
| >>> successive versions -- clone2() being the one possible exception that
| >>> I know of.
| >>>
| >>
| >> "3" is number of arguments.
| >
| > To me, it is a version number.
| 
| See my precending mail. Isn't the number of arguments "2".

Well it was 2 at one point, but I have posted a new version of
just that one patch - please see http://lkml.org/lkml/2009/10/16/3
for comments.

I am working on some updates and will post a new patchset - it
will have 3 parameters to clone3() as shown in the above mail.

| 
| > mmap() and mmap2() both have 6 parameters.
| >
| > Besides if wait4() were born before wait3(), would it still be wait4() :-)
| > But I see that it is hard to get one-convention-that-fits-all.
| 
| Yes -- that's exactly right.
| 
| >> It's better than "extended" or something
| >> like that simply because "extended" just means "more than", and a number
| >> at least tells you *how much more than*.
| >
| > And extended assumes we wont extend again.
| 
| Well, if we do things right in this design, we may not need to ever
| extend (by creating a new syscall) again. That's why I mentioned the
| "flags" argument idea. Did you give this some thought?

Yes, we have done the best we can to avoid extending clone() again
anytime soon (some reserved bytes and clone_args_size field). Would
we still need the flags parameter ? Again its in the new patch
that I pointed to above.

| 
| > An informal poll of reviewers has clone3() with a slight advantage :-)
| >
| >        clone_extended() camp: Serge Hallyn, Kerrisk, Louis Rilling,
| >        clone3(): Sukadev, H. Peter Anvin, Oren, Matt Helsley.
| >
| > I like clone3() but am not insisting on it. I just want a name...
| 
| And I'm not really insisting on a change. As you rightly point out,
| there is much inconsistency in the naming conventions that have been
| used over the years.
| 
| But, because there has been no consistency in the use of numbers, and
| because the number of arguments that are presented in a glibc
| interface may differ from the number of arguments in an underlying
| syscall (several precedents: signalfd4(), pselect(), ppoll()), I'm
| inclined to think that clonex() or clone_ext() is slighly better than
| clone3(). But, certainly, my arguments are not compelling.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                         ` <m1eioxrtsb.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2009-10-21 18:52                           ` Sukadev Bhattiprolu
@ 2009-10-23  0:42                           ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-23  0:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| +static int set_pidmap(struct pid_namespace *pid_ns, int target)
| +{
| +	if (target >= pid_max)
| +		return -1;

I am changing this and the next return to 'return -EINVAL', to match
an earlier patch in my patchset.

| +	if (target < RESERVED_PIDS)

Should we replace RESERVED_PIDS with 0 ? We currently allow new
containers to have pids 1..32K in the first pass and in subsequent
passes assign starting at RESERVED_PIDS.

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                         ` <m1eioxrtsb.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-23  0:42                           ` Sukadev Bhattiprolu
  2009-10-23  0:42                           ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-23  0:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Eric W. Biederman [ebiederm@xmission.com] wrote:
| +static int set_pidmap(struct pid_namespace *pid_ns, int target)
| +{
| +	if (target >= pid_max)
| +		return -1;

I am changing this and the next return to 'return -EINVAL', to match
an earlier patch in my patchset.

| +	if (target < RESERVED_PIDS)

Should we replace RESERVED_PIDS with 0 ? We currently allow new
containers to have pids 1..32K in the first pass and in subsequent
passes assign starting at RESERVED_PIDS.

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-23  0:42                           ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-23  0:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| +static int set_pidmap(struct pid_namespace *pid_ns, int target)
| +{
| +	if (target >= pid_max)
| +		return -1;

I am changing this and the next return to 'return -EINVAL', to match
an earlier patch in my patchset.

| +	if (target < RESERVED_PIDS)

Should we replace RESERVED_PIDS with 0 ? We currently allow new
containers to have pids 1..32K in the first pass and in subsequent
passes assign starting at RESERVED_PIDS.

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                           ` <20091023004253.GA7915-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-23  1:03                             ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-23  1:03 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | +static int set_pidmap(struct pid_namespace *pid_ns, int target)
> | +{
> | +	if (target >= pid_max)
> | +		return -1;
>
> I am changing this and the next return to 'return -EINVAL', to match
> an earlier patch in my patchset.
>
> | +	if (target < RESERVED_PIDS)
>
> Should we replace RESERVED_PIDS with 0 ? We currently allow new
> containers to have pids 1..32K in the first pass and in subsequent
> passes assign starting at RESERVED_PIDS.

If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
check removes most if not all of the point of RESERVED_PIDS.

In a new fresh pid namespace I have no problem with not performing
the RESERVED_PIDS check.

So I guess that makes the check.

if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
   return -EINVAL;

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                           ` <20091023004253.GA7915-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-23  1:03                             ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-23  1:03 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:

> Eric W. Biederman [ebiederm@xmission.com] wrote:
> | +static int set_pidmap(struct pid_namespace *pid_ns, int target)
> | +{
> | +	if (target >= pid_max)
> | +		return -1;
>
> I am changing this and the next return to 'return -EINVAL', to match
> an earlier patch in my patchset.
>
> | +	if (target < RESERVED_PIDS)
>
> Should we replace RESERVED_PIDS with 0 ? We currently allow new
> containers to have pids 1..32K in the first pass and in subsequent
> passes assign starting at RESERVED_PIDS.

If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
check removes most if not all of the point of RESERVED_PIDS.

In a new fresh pid namespace I have no problem with not performing
the RESERVED_PIDS check.

So I guess that makes the check.

if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
   return -EINVAL;

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-23  1:03                             ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-23  1:03 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | +static int set_pidmap(struct pid_namespace *pid_ns, int target)
> | +{
> | +	if (target >= pid_max)
> | +		return -1;
>
> I am changing this and the next return to 'return -EINVAL', to match
> an earlier patch in my patchset.
>
> | +	if (target < RESERVED_PIDS)
>
> Should we replace RESERVED_PIDS with 0 ? We currently allow new
> containers to have pids 1..32K in the first pass and in subsequent
> passes assign starting at RESERVED_PIDS.

If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
check removes most if not all of the point of RESERVED_PIDS.

In a new fresh pid namespace I have no problem with not performing
the RESERVED_PIDS check.

So I guess that makes the check.

if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
   return -EINVAL;

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                             ` <m1bpjyrkep.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-23  5:30                               ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-23  5:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| > | +	if (target < RESERVED_PIDS)
| >
| > Should we replace RESERVED_PIDS with 0 ? We currently allow new
| > containers to have pids 1..32K in the first pass and in subsequent
| > passes assign starting at RESERVED_PIDS.
| 
| If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
| check removes most if not all of the point of RESERVED_PIDS.
| 
| In a new fresh pid namespace I have no problem with not performing
| the RESERVED_PIDS check.

In that case can we do this

	if (target_pid < RESERVED_PIDS && !pid_ns->level)
		return -EINVAL;

instead ?
| 
| So I guess that makes the check.
| 
| if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
|    return -EINVAL;

I am just wondering if there is a small corner case where C/R would randomly
fail because of this sequence:

	- C/R code calls clone() or clone3() say about RESERVED_PIDS-1
	  times and ->last_pid == RESERVED_PIDS-1.

	- C/R code calls normal fork()/alloc_pidmap() for a short-lived
	  child - its pid == ->last_pid == RESERVED_PIDS

	- C/R code then calls clone3()/set_pidmap() to set the pid of
	  a new child to RESERVED_PID but fails (i.e it fails to restore
	  a pid even when the pid is not in use).

We could argue that mixing alloc_pidmap() and set_pidmap() during restart
is bad since set_pidmap() may fail.

The C/R developer could argue that we are forcing them to specify a pid
even for a short lived process that they wait()s on and thus ensure that
pid is not in use.

Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
would it be ok enforce it only in init_pid_ns ?

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-23  1:03                             ` Eric W. Biederman
  (?)
@ 2009-10-23  5:30                             ` Sukadev Bhattiprolu
  2009-10-23  5:44                                 ` Eric W. Biederman
                                                 ` (2 more replies)
  -1 siblings, 3 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-23  5:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Eric W. Biederman [ebiederm@xmission.com] wrote:
| > | +	if (target < RESERVED_PIDS)
| >
| > Should we replace RESERVED_PIDS with 0 ? We currently allow new
| > containers to have pids 1..32K in the first pass and in subsequent
| > passes assign starting at RESERVED_PIDS.
| 
| If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
| check removes most if not all of the point of RESERVED_PIDS.
| 
| In a new fresh pid namespace I have no problem with not performing
| the RESERVED_PIDS check.

In that case can we do this

	if (target_pid < RESERVED_PIDS && !pid_ns->level)
		return -EINVAL;

instead ?
| 
| So I guess that makes the check.
| 
| if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
|    return -EINVAL;

I am just wondering if there is a small corner case where C/R would randomly
fail because of this sequence:

	- C/R code calls clone() or clone3() say about RESERVED_PIDS-1
	  times and ->last_pid == RESERVED_PIDS-1.

	- C/R code calls normal fork()/alloc_pidmap() for a short-lived
	  child - its pid == ->last_pid == RESERVED_PIDS

	- C/R code then calls clone3()/set_pidmap() to set the pid of
	  a new child to RESERVED_PID but fails (i.e it fails to restore
	  a pid even when the pid is not in use).

We could argue that mixing alloc_pidmap() and set_pidmap() during restart
is bad since set_pidmap() may fail.

The C/R developer could argue that we are forcing them to specify a pid
even for a short lived process that they wait()s on and thus ensure that
pid is not in use.

Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
would it be ok enforce it only in init_pid_ns ?

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                               ` <20091023053001.GA24972-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-23  5:44                                 ` Eric W. Biederman
  2009-10-23 19:16                                 ` Oren Laadan
  1 sibling, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-23  5:44 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | > | +	if (target < RESERVED_PIDS)
> | >
> | > Should we replace RESERVED_PIDS with 0 ? We currently allow new
> | > containers to have pids 1..32K in the first pass and in subsequent
> | > passes assign starting at RESERVED_PIDS.
> | 
> | If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
> | check removes most if not all of the point of RESERVED_PIDS.
> | 
> | In a new fresh pid namespace I have no problem with not performing
> | the RESERVED_PIDS check.
>
> In that case can we do this
>
> 	if (target_pid < RESERVED_PIDS && !pid_ns->level)
> 		return -EINVAL;
>
> instead ?
> | 
> | So I guess that makes the check.
> | 
> | if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
> |    return -EINVAL;
>
> I am just wondering if there is a small corner case where C/R would randomly
> fail because of this sequence:
>
> 	- C/R code calls clone() or clone3() say about RESERVED_PIDS-1
> 	  times and ->last_pid == RESERVED_PIDS-1.
>
> 	- C/R code calls normal fork()/alloc_pidmap() for a short-lived
> 	  child - its pid == ->last_pid == RESERVED_PIDS
>
> 	- C/R code then calls clone3()/set_pidmap() to set the pid of
> 	  a new child to RESERVED_PID but fails (i.e it fails to restore
> 	  a pid even when the pid is not in use).
>
> We could argue that mixing alloc_pidmap() and set_pidmap() during restart
> is bad since set_pidmap() may fail.
>
> The C/R developer could argue that we are forcing them to specify a pid
> even for a short lived process that they wait()s on and thus ensure that
> pid is not in use.
>
> Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
> would it be ok enforce it only in init_pid_ns ?

It is mean for initial user space daemons, things that start on boot.

I don't know how much the protection matters at this date, but we have it.

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                               ` <20091023053001.GA24972-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-23  5:44                                 ` Eric W. Biederman
  2009-10-23 19:16                                 ` Oren Laadan
  1 sibling, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-23  5:44 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:

> Eric W. Biederman [ebiederm@xmission.com] wrote:
> | > | +	if (target < RESERVED_PIDS)
> | >
> | > Should we replace RESERVED_PIDS with 0 ? We currently allow new
> | > containers to have pids 1..32K in the first pass and in subsequent
> | > passes assign starting at RESERVED_PIDS.
> | 
> | If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
> | check removes most if not all of the point of RESERVED_PIDS.
> | 
> | In a new fresh pid namespace I have no problem with not performing
> | the RESERVED_PIDS check.
>
> In that case can we do this
>
> 	if (target_pid < RESERVED_PIDS && !pid_ns->level)
> 		return -EINVAL;
>
> instead ?
> | 
> | So I guess that makes the check.
> | 
> | if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
> |    return -EINVAL;
>
> I am just wondering if there is a small corner case where C/R would randomly
> fail because of this sequence:
>
> 	- C/R code calls clone() or clone3() say about RESERVED_PIDS-1
> 	  times and ->last_pid == RESERVED_PIDS-1.
>
> 	- C/R code calls normal fork()/alloc_pidmap() for a short-lived
> 	  child - its pid == ->last_pid == RESERVED_PIDS
>
> 	- C/R code then calls clone3()/set_pidmap() to set the pid of
> 	  a new child to RESERVED_PID but fails (i.e it fails to restore
> 	  a pid even when the pid is not in use).
>
> We could argue that mixing alloc_pidmap() and set_pidmap() during restart
> is bad since set_pidmap() may fail.
>
> The C/R developer could argue that we are forcing them to specify a pid
> even for a short lived process that they wait()s on and thus ensure that
> pid is not in use.
>
> Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
> would it be ok enforce it only in init_pid_ns ?

It is mean for initial user space daemons, things that start on boot.

I don't know how much the protection matters at this date, but we have it.

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-23  5:44                                 ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-23  5:44 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | > | +	if (target < RESERVED_PIDS)
> | >
> | > Should we replace RESERVED_PIDS with 0 ? We currently allow new
> | > containers to have pids 1..32K in the first pass and in subsequent
> | > passes assign starting at RESERVED_PIDS.
> | 
> | If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
> | check removes most if not all of the point of RESERVED_PIDS.
> | 
> | In a new fresh pid namespace I have no problem with not performing
> | the RESERVED_PIDS check.
>
> In that case can we do this
>
> 	if (target_pid < RESERVED_PIDS && !pid_ns->level)
> 		return -EINVAL;
>
> instead ?
> | 
> | So I guess that makes the check.
> | 
> | if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
> |    return -EINVAL;
>
> I am just wondering if there is a small corner case where C/R would randomly
> fail because of this sequence:
>
> 	- C/R code calls clone() or clone3() say about RESERVED_PIDS-1
> 	  times and ->last_pid == RESERVED_PIDS-1.
>
> 	- C/R code calls normal fork()/alloc_pidmap() for a short-lived
> 	  child - its pid == ->last_pid == RESERVED_PIDS
>
> 	- C/R code then calls clone3()/set_pidmap() to set the pid of
> 	  a new child to RESERVED_PID but fails (i.e it fails to restore
> 	  a pid even when the pid is not in use).
>
> We could argue that mixing alloc_pidmap() and set_pidmap() during restart
> is bad since set_pidmap() may fail.
>
> The C/R developer could argue that we are forcing them to specify a pid
> even for a short lived process that they wait()s on and thus ensure that
> pid is not in use.
>
> Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
> would it be ok enforce it only in init_pid_ns ?

It is mean for initial user space daemons, things that start on boot.

I don't know how much the protection matters at this date, but we have it.

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                               ` <20091023053001.GA24972-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-10-23  5:44                                 ` Eric W. Biederman
@ 2009-10-23 19:16                                 ` Oren Laadan
  1 sibling, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-23 19:16 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA



Sukadev Bhattiprolu wrote:
> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | > | +	if (target < RESERVED_PIDS)
> | >
> | > Should we replace RESERVED_PIDS with 0 ? We currently allow new
> | > containers to have pids 1..32K in the first pass and in subsequent
> | > passes assign starting at RESERVED_PIDS.
> | 
> | If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
> | check removes most if not all of the point of RESERVED_PIDS.
> | 
> | In a new fresh pid namespace I have no problem with not performing
> | the RESERVED_PIDS check.
> 
> In that case can we do this
> 
> 	if (target_pid < RESERVED_PIDS && !pid_ns->level)
> 		return -EINVAL;
> 
> instead ?
> | 
> | So I guess that makes the check.
> | 
> | if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
> |    return -EINVAL;
> 
> I am just wondering if there is a small corner case where C/R would randomly
> fail because of this sequence:
> 
> 	- C/R code calls clone() or clone3() say about RESERVED_PIDS-1
> 	  times and ->last_pid == RESERVED_PIDS-1.
> 
> 	- C/R code calls normal fork()/alloc_pidmap() for a short-lived
> 	  child - its pid == ->last_pid == RESERVED_PIDS
> 
> 	- C/R code then calls clone3()/set_pidmap() to set the pid of
> 	  a new child to RESERVED_PID but fails (i.e it fails to restore
> 	  a pid even when the pid is not in use).

Not only for short-lived children. The problem is restart will succeed
or fail depending on the order in which tasks were checkpointed. If
task with pid 290 is restarted after pid 305, restart will fail.

And because chekcpoint scans the task tree in a DFS manner, this is
more likely to happen than not.

I wonder why you'd like to restrict a pid-specific clone like that ?
It is already a privileged syscall, so it could be exempt. I suggest
that only regular clones will be constrained.

Oren.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                               ` <20091023053001.GA24972-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-23 19:16                                 ` Oren Laadan
  2009-10-23 19:16                                 ` Oren Laadan
  1 sibling, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-23 19:16 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Eric W. Biederman, Matt Helsley, Daniel Lezcano, randy.dunlap,
	arnd, linux-api, Containers, Nathan Lynch, linux-kernel,
	Louis.Rilling, kosaki.motohiro, hpa, mingo, torvalds,
	Alexey Dobriyan, roland, Pavel Emelyanov



Sukadev Bhattiprolu wrote:
> Eric W. Biederman [ebiederm@xmission.com] wrote:
> | > | +	if (target < RESERVED_PIDS)
> | >
> | > Should we replace RESERVED_PIDS with 0 ? We currently allow new
> | > containers to have pids 1..32K in the first pass and in subsequent
> | > passes assign starting at RESERVED_PIDS.
> | 
> | If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
> | check removes most if not all of the point of RESERVED_PIDS.
> | 
> | In a new fresh pid namespace I have no problem with not performing
> | the RESERVED_PIDS check.
> 
> In that case can we do this
> 
> 	if (target_pid < RESERVED_PIDS && !pid_ns->level)
> 		return -EINVAL;
> 
> instead ?
> | 
> | So I guess that makes the check.
> | 
> | if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
> |    return -EINVAL;
> 
> I am just wondering if there is a small corner case where C/R would randomly
> fail because of this sequence:
> 
> 	- C/R code calls clone() or clone3() say about RESERVED_PIDS-1
> 	  times and ->last_pid == RESERVED_PIDS-1.
> 
> 	- C/R code calls normal fork()/alloc_pidmap() for a short-lived
> 	  child - its pid == ->last_pid == RESERVED_PIDS
> 
> 	- C/R code then calls clone3()/set_pidmap() to set the pid of
> 	  a new child to RESERVED_PID but fails (i.e it fails to restore
> 	  a pid even when the pid is not in use).

Not only for short-lived children. The problem is restart will succeed
or fail depending on the order in which tasks were checkpointed. If
task with pid 290 is restarted after pid 305, restart will fail.

And because chekcpoint scans the task tree in a DFS manner, this is
more likely to happen than not.

I wonder why you'd like to restrict a pid-specific clone like that ?
It is already a privileged syscall, so it could be exempt. I suggest
that only regular clones will be constrained.

Oren.



^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-23 19:16                                 ` Oren Laadan
  0 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-23 19:16 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Eric W. Biederman, Matt Helsley, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Sukadev Bhattiprolu wrote:
> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | > | +	if (target < RESERVED_PIDS)
> | >
> | > Should we replace RESERVED_PIDS with 0 ? We currently allow new
> | > containers to have pids 1..32K in the first pass and in subsequent
> | > passes assign starting at RESERVED_PIDS.
> | 
> | If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
> | check removes most if not all of the point of RESERVED_PIDS.
> | 
> | In a new fresh pid namespace I have no problem with not performing
> | the RESERVED_PIDS check.
> 
> In that case can we do this
> 
> 	if (target_pid < RESERVED_PIDS && !pid_ns->level)
> 		return -EINVAL;
> 
> instead ?
> | 
> | So I guess that makes the check.
> | 
> | if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
> |    return -EINVAL;
> 
> I am just wondering if there is a small corner case where C/R would randomly
> fail because of this sequence:
> 
> 	- C/R code calls clone() or clone3() say about RESERVED_PIDS-1
> 	  times and ->last_pid == RESERVED_PIDS-1.
> 
> 	- C/R code calls normal fork()/alloc_pidmap() for a short-lived
> 	  child - its pid == ->last_pid == RESERVED_PIDS
> 
> 	- C/R code then calls clone3()/set_pidmap() to set the pid of
> 	  a new child to RESERVED_PID but fails (i.e it fails to restore
> 	  a pid even when the pid is not in use).

Not only for short-lived children. The problem is restart will succeed
or fail depending on the order in which tasks were checkpointed. If
task with pid 290 is restarted after pid 305, restart will fail.

And because chekcpoint scans the task tree in a DFS manner, this is
more likely to happen than not.

I wonder why you'd like to restrict a pid-specific clone like that ?
It is already a privileged syscall, so it could be exempt. I suggest
that only regular clones will be constrained.

Oren.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
  2009-10-23  5:44                                 ` Eric W. Biederman
@ 2009-10-23 19:21                                     ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-23 19:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| > Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
| > would it be ok enforce it only in init_pid_ns ?
| 
| It is mean for initial user space daemons, things that start on boot.
| 
| I don't know how much the protection matters at this date, but we have it.

Well, since it is not security or other critical restriction, can we allow
set_pidmap() a free hand - even in init-pid-ns ? It could prevent a simple
subtree C/R of one of the early daemons for debug for instance.

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-23 19:21                                     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-23 19:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Eric W. Biederman [ebiederm@xmission.com] wrote:
| > Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
| > would it be ok enforce it only in init_pid_ns ?
| 
| It is mean for initial user space daemons, things that start on boot.
| 
| I don't know how much the protection matters at this date, but we have it.

Well, since it is not security or other critical restriction, can we allow
set_pidmap() a free hand - even in init-pid-ns ? It could prevent a simple
subtree C/R of one of the early daemons for debug for instance.

Sukadev

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                                 ` <4AE20124.4010108-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-10-23 19:34                                   ` Oren Laadan
  0 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-23 19:34 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov



Oren Laadan wrote:
> 
> Sukadev Bhattiprolu wrote:
>> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
>> | > | +	if (target < RESERVED_PIDS)
>> | >
>> | > Should we replace RESERVED_PIDS with 0 ? We currently allow new
>> | > containers to have pids 1..32K in the first pass and in subsequent
>> | > passes assign starting at RESERVED_PIDS.
>> | 
>> | If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
>> | check removes most if not all of the point of RESERVED_PIDS.
>> | 
>> | In a new fresh pid namespace I have no problem with not performing
>> | the RESERVED_PIDS check.
>>
>> In that case can we do this
>>
>> 	if (target_pid < RESERVED_PIDS && !pid_ns->level)
>> 		return -EINVAL;
>>
>> instead ?
>> | 
>> | So I guess that makes the check.
>> | 
>> | if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
>> |    return -EINVAL;
>>
>> I am just wondering if there is a small corner case where C/R would randomly
>> fail because of this sequence:
>>
>> 	- C/R code calls clone() or clone3() say about RESERVED_PIDS-1
>> 	  times and ->last_pid == RESERVED_PIDS-1.
>>
>> 	- C/R code calls normal fork()/alloc_pidmap() for a short-lived
>> 	  child - its pid == ->last_pid == RESERVED_PIDS
>>
>> 	- C/R code then calls clone3()/set_pidmap() to set the pid of
>> 	  a new child to RESERVED_PID but fails (i.e it fails to restore
>> 	  a pid even when the pid is not in use).
> 
> Not only for short-lived children. The problem is restart will succeed
> or fail depending on the order in which tasks were checkpointed. If
> task with pid 290 is restarted after pid 305, restart will fail.
> 
> And because chekcpoint scans the task tree in a DFS manner, this is
> more likely to happen than not.
> 
> I wonder why you'd like to restrict a pid-specific clone like that ?
> It is already a privileged syscall, so it could be exempt. I suggest
> that only regular clones will be constrained.

I stand corrected by Suka: a pid-specific clone does not change
last_pid. Therefore, given that 'restart' only creates tasks with
pid-specific clone, this should be safe for c/r.

Oren.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                                 ` <4AE20124.4010108-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-10-23 19:34                                   ` Oren Laadan
  0 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-23 19:34 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap, arnd, linux-api, Containers, Nathan Lynch,
	linux-kernel, Louis.Rilling, Eric W. Biederman, kosaki.motohiro,
	hpa, mingo, Pavel Emelyanov, torvalds, Alexey Dobriyan, roland



Oren Laadan wrote:
> 
> Sukadev Bhattiprolu wrote:
>> Eric W. Biederman [ebiederm@xmission.com] wrote:
>> | > | +	if (target < RESERVED_PIDS)
>> | >
>> | > Should we replace RESERVED_PIDS with 0 ? We currently allow new
>> | > containers to have pids 1..32K in the first pass and in subsequent
>> | > passes assign starting at RESERVED_PIDS.
>> | 
>> | If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
>> | check removes most if not all of the point of RESERVED_PIDS.
>> | 
>> | In a new fresh pid namespace I have no problem with not performing
>> | the RESERVED_PIDS check.
>>
>> In that case can we do this
>>
>> 	if (target_pid < RESERVED_PIDS && !pid_ns->level)
>> 		return -EINVAL;
>>
>> instead ?
>> | 
>> | So I guess that makes the check.
>> | 
>> | if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
>> |    return -EINVAL;
>>
>> I am just wondering if there is a small corner case where C/R would randomly
>> fail because of this sequence:
>>
>> 	- C/R code calls clone() or clone3() say about RESERVED_PIDS-1
>> 	  times and ->last_pid == RESERVED_PIDS-1.
>>
>> 	- C/R code calls normal fork()/alloc_pidmap() for a short-lived
>> 	  child - its pid == ->last_pid == RESERVED_PIDS
>>
>> 	- C/R code then calls clone3()/set_pidmap() to set the pid of
>> 	  a new child to RESERVED_PID but fails (i.e it fails to restore
>> 	  a pid even when the pid is not in use).
> 
> Not only for short-lived children. The problem is restart will succeed
> or fail depending on the order in which tasks were checkpointed. If
> task with pid 290 is restarted after pid 305, restart will fail.
> 
> And because chekcpoint scans the task tree in a DFS manner, this is
> more likely to happen than not.
> 
> I wonder why you'd like to restrict a pid-specific clone like that ?
> It is already a privileged syscall, so it could be exempt. I suggest
> that only regular clones will be constrained.

I stand corrected by Suka: a pid-specific clone does not change
last_pid. Therefore, given that 'restart' only creates tasks with
pid-specific clone, this should be safe for c/r.

Oren.


^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-23 19:34                                   ` Oren Laadan
  0 siblings, 0 replies; 226+ messages in thread
From: Oren Laadan @ 2009-10-23 19:34 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ, Eric W. Biederman,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA



Oren Laadan wrote:
> 
> Sukadev Bhattiprolu wrote:
>> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
>> | > | +	if (target < RESERVED_PIDS)
>> | >
>> | > Should we replace RESERVED_PIDS with 0 ? We currently allow new
>> | > containers to have pids 1..32K in the first pass and in subsequent
>> | > passes assign starting at RESERVED_PIDS.
>> | 
>> | If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
>> | check removes most if not all of the point of RESERVED_PIDS.
>> | 
>> | In a new fresh pid namespace I have no problem with not performing
>> | the RESERVED_PIDS check.
>>
>> In that case can we do this
>>
>> 	if (target_pid < RESERVED_PIDS && !pid_ns->level)
>> 		return -EINVAL;
>>
>> instead ?
>> | 
>> | So I guess that makes the check.
>> | 
>> | if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
>> |    return -EINVAL;
>>
>> I am just wondering if there is a small corner case where C/R would randomly
>> fail because of this sequence:
>>
>> 	- C/R code calls clone() or clone3() say about RESERVED_PIDS-1
>> 	  times and ->last_pid == RESERVED_PIDS-1.
>>
>> 	- C/R code calls normal fork()/alloc_pidmap() for a short-lived
>> 	  child - its pid == ->last_pid == RESERVED_PIDS
>>
>> 	- C/R code then calls clone3()/set_pidmap() to set the pid of
>> 	  a new child to RESERVED_PID but fails (i.e it fails to restore
>> 	  a pid even when the pid is not in use).
> 
> Not only for short-lived children. The problem is restart will succeed
> or fail depending on the order in which tasks were checkpointed. If
> task with pid 290 is restarted after pid 305, restart will fail.
> 
> And because chekcpoint scans the task tree in a DFS manner, this is
> more likely to happen than not.
> 
> I wonder why you'd like to restrict a pid-specific clone like that ?
> It is already a privileged syscall, so it could be exempt. I suggest
> that only regular clones will be constrained.

I stand corrected by Suka: a pid-specific clone does not change
last_pid. Therefore, given that 'restart' only creates tasks with
pid-specific clone, this should be safe for c/r.

Oren.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                                     ` <20091023192124.GA11088-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-23 20:48                                       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-23 20:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu [sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org] wrote:
| Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| | > Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
| | > would it be ok enforce it only in init_pid_ns ?
| | 
| | It is mean for initial user space daemons, things that start on boot.
| | 
| | I don't know how much the protection matters at this date, but we have it.
| 
| Well, since it is not security or other critical restriction, can we allow
| set_pidmap() a free hand - even in init-pid-ns ? It could prevent a simple
| subtree C/R of one of the early daemons for debug for instance.

So here is how I have it at present. I would like to remove the RESERVED_PIDS
check in set_pidmap() if its ok to do so.

alloc_pid() does this:

	if (target_pids)
		set_pidmap(tmp, target_pids[i])
	else
		alloc_pidmap(tmp);

Sukadev
---

From bc6093fc4fc2f01070647df6f1e85e45edc89d27 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <suka@suka.(none)>
Date: Thu, 22 Oct 2009 16:57:28 -0700
Subject: [PATCH] Define set_pidmap() function

Define a set_pidmap() interface which is like alloc_pidmap() only that
caller specifies the pid number to be assigned.

Changelog[v9]:
	- Complete rewrite this patch based on Eric Biederman's code.
Changelog[v7]:
        - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
Changelog[v6]:
        - Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
        - (Eric Biederman): Avoid set_pidmap() function. Added couple of
          checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
        - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
          actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 kernel/pid.c |   40 ++++++++++++++++++++++++++++++++--------
 1 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index c4d9914..9346755 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
+		int max)
 {
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int i, offset, max_scan, pid;
 	int rc = -EAGAIN;
 	struct pidmap *map;
 
 	pid = last + 1;
 	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+		pid = min;
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
 		rc = alloc_pidmap_page(map);
 		if (rc)
@@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			 * bitmap block and the final block was the same
 			 * as the starting point, pid is before last_pid.
 			 */
-			} while (offset < BITS_PER_PAGE && pid < pid_max &&
+			} while (offset < BITS_PER_PAGE && pid < max &&
 					(i != max_scan || pid < last ||
 					    !((last+1) & BITS_PER_PAGE_MASK)));
 		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
 			++map;
 			offset = 0;
 		} else {
 			map = &pid_ns->pidmap[0];
-			offset = RESERVED_PIDS;
+			offset = min;
 			if (unlikely(last == offset)) {
 				rc = -EAGAIN;
 				break;
@@ -199,6 +199,30 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return rc;
 }
 
+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+	int nr;
+
+	nr = do_alloc_pidmap(pid_ns, pid_ns->last, RESERVED_PIDS, pid_max);
+	if (nr >= 0)
+		pid_ns->last_pid = nr;
+	return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+	if (!target)
+		return alloc_pidmap(pid_ns);
+
+	if (target >= pid_max)
+		return -EINVAL;
+
+	if ((target < 0) || (target < RESERVED_PIDS && pid_ns == &init_pid_ns))
+		return -EINVAL;
+
+	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                                     ` <20091023192124.GA11088-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-23 20:48                                       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-23 20:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
| Eric W. Biederman [ebiederm@xmission.com] wrote:
| | > Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
| | > would it be ok enforce it only in init_pid_ns ?
| | 
| | It is mean for initial user space daemons, things that start on boot.
| | 
| | I don't know how much the protection matters at this date, but we have it.
| 
| Well, since it is not security or other critical restriction, can we allow
| set_pidmap() a free hand - even in init-pid-ns ? It could prevent a simple
| subtree C/R of one of the early daemons for debug for instance.

So here is how I have it at present. I would like to remove the RESERVED_PIDS
check in set_pidmap() if its ok to do so.

alloc_pid() does this:

	if (target_pids)
		set_pidmap(tmp, target_pids[i])
	else
		alloc_pidmap(tmp);

Sukadev
---

>From bc6093fc4fc2f01070647df6f1e85e45edc89d27 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <suka@suka.(none)>
Date: Thu, 22 Oct 2009 16:57:28 -0700
Subject: [PATCH] Define set_pidmap() function

Define a set_pidmap() interface which is like alloc_pidmap() only that
caller specifies the pid number to be assigned.

Changelog[v9]:
	- Complete rewrite this patch based on Eric Biederman's code.
Changelog[v7]:
        - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
Changelog[v6]:
        - Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
        - (Eric Biederman): Avoid set_pidmap() function. Added couple of
          checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
        - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
          actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
---
 kernel/pid.c |   40 ++++++++++++++++++++++++++++++++--------
 1 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index c4d9914..9346755 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
+		int max)
 {
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int i, offset, max_scan, pid;
 	int rc = -EAGAIN;
 	struct pidmap *map;
 
 	pid = last + 1;
 	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+		pid = min;
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
 		rc = alloc_pidmap_page(map);
 		if (rc)
@@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			 * bitmap block and the final block was the same
 			 * as the starting point, pid is before last_pid.
 			 */
-			} while (offset < BITS_PER_PAGE && pid < pid_max &&
+			} while (offset < BITS_PER_PAGE && pid < max &&
 					(i != max_scan || pid < last ||
 					    !((last+1) & BITS_PER_PAGE_MASK)));
 		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
 			++map;
 			offset = 0;
 		} else {
 			map = &pid_ns->pidmap[0];
-			offset = RESERVED_PIDS;
+			offset = min;
 			if (unlikely(last == offset)) {
 				rc = -EAGAIN;
 				break;
@@ -199,6 +199,30 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return rc;
 }
 
+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+	int nr;
+
+	nr = do_alloc_pidmap(pid_ns, pid_ns->last, RESERVED_PIDS, pid_max);
+	if (nr >= 0)
+		pid_ns->last_pid = nr;
+	return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+	if (!target)
+		return alloc_pidmap(pid_ns);
+
+	if (target >= pid_max)
+		return -EINVAL;
+
+	if ((target < 0) || (target < RESERVED_PIDS && pid_ns == &init_pid_ns))
+		return -EINVAL;
+
+	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-23 20:48                                       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-23 20:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu [sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org] wrote:
| Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| | > Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
| | > would it be ok enforce it only in init_pid_ns ?
| | 
| | It is mean for initial user space daemons, things that start on boot.
| | 
| | I don't know how much the protection matters at this date, but we have it.
| 
| Well, since it is not security or other critical restriction, can we allow
| set_pidmap() a free hand - even in init-pid-ns ? It could prevent a simple
| subtree C/R of one of the early daemons for debug for instance.

So here is how I have it at present. I would like to remove the RESERVED_PIDS
check in set_pidmap() if its ok to do so.

alloc_pid() does this:

	if (target_pids)
		set_pidmap(tmp, target_pids[i])
	else
		alloc_pidmap(tmp);

Sukadev
---

>From bc6093fc4fc2f01070647df6f1e85e45edc89d27 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <suka@suka.(none)>
Date: Thu, 22 Oct 2009 16:57:28 -0700
Subject: [PATCH] Define set_pidmap() function

Define a set_pidmap() interface which is like alloc_pidmap() only that
caller specifies the pid number to be assigned.

Changelog[v9]:
	- Complete rewrite this patch based on Eric Biederman's code.
Changelog[v7]:
        - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
Changelog[v6]:
        - Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
        - (Eric Biederman): Avoid set_pidmap() function. Added couple of
          checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
        - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
          actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 kernel/pid.c |   40 ++++++++++++++++++++++++++++++++--------
 1 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index c4d9914..9346755 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
+		int max)
 {
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int i, offset, max_scan, pid;
 	int rc = -EAGAIN;
 	struct pidmap *map;
 
 	pid = last + 1;
 	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+		pid = min;
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
 		rc = alloc_pidmap_page(map);
 		if (rc)
@@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			 * bitmap block and the final block was the same
 			 * as the starting point, pid is before last_pid.
 			 */
-			} while (offset < BITS_PER_PAGE && pid < pid_max &&
+			} while (offset < BITS_PER_PAGE && pid < max &&
 					(i != max_scan || pid < last ||
 					    !((last+1) & BITS_PER_PAGE_MASK)));
 		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
 			++map;
 			offset = 0;
 		} else {
 			map = &pid_ns->pidmap[0];
-			offset = RESERVED_PIDS;
+			offset = min;
 			if (unlikely(last == offset)) {
 				rc = -EAGAIN;
 				break;
@@ -199,6 +199,30 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return rc;
 }
 
+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+	int nr;
+
+	nr = do_alloc_pidmap(pid_ns, pid_ns->last, RESERVED_PIDS, pid_max);
+	if (nr >= 0)
+		pid_ns->last_pid = nr;
+	return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+	if (!target)
+		return alloc_pidmap(pid_ns);
+
+	if (target >= pid_max)
+		return -EINVAL;
+
+	if ((target < 0) || (target < RESERVED_PIDS && pid_ns == &init_pid_ns))
+		return -EINVAL;
+
+	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                                   ` <4AE20532.6060809-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-10-23 23:12                                     ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-23 23:12 UTC (permalink / raw)
  To: Oren Laadan
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Oren Laadan <orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org> writes:


> I stand corrected by Suka: a pid-specific clone does not change
> last_pid. Therefore, given that 'restart' only creates tasks with
> pid-specific clone, this should be safe for c/r.

Exactly.  We can preserve the existing guarantees and keep the ability
to restore any pid (at least in a fresh pid namespace).

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                                   ` <4AE20532.6060809-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-10-23 23:12                                     ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-23 23:12 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Sukadev Bhattiprolu, randy.dunlap, arnd, linux-api, Containers,
	Nathan Lynch, linux-kernel, Louis.Rilling, kosaki.motohiro, hpa,
	mingo, Pavel Emelyanov, torvalds, Alexey Dobriyan, roland

Oren Laadan <orenl@librato.com> writes:


> I stand corrected by Suka: a pid-specific clone does not change
> last_pid. Therefore, given that 'restart' only creates tasks with
> pid-specific clone, this should be safe for c/r.

Exactly.  We can preserve the existing guarantees and keep the ability
to restore any pid (at least in a fresh pid namespace).

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-23 23:12                                     ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-23 23:12 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Sukadev Bhattiprolu, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	arnd-r2nGTMty4D4, linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA

Oren Laadan <orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org> writes:


> I stand corrected by Suka: a pid-specific clone does not change
> last_pid. Therefore, given that 'restart' only creates tasks with
> pid-specific clone, this should be safe for c/r.

Exactly.  We can preserve the existing guarantees and keep the ability
to restore any pid (at least in a fresh pid namespace).

Eric

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                                       ` <20091023204812.GA26524-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-23 23:26                                         ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-23 23:26 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Sukadev Bhattiprolu [sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org] wrote:
> | Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | | > Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
> | | > would it be ok enforce it only in init_pid_ns ?
> | | 
> | | It is mean for initial user space daemons, things that start on boot.
> | | 
> | | I don't know how much the protection matters at this date, but we have it.
> | 
> | Well, since it is not security or other critical restriction, can we allow
> | set_pidmap() a free hand - even in init-pid-ns ? It could prevent a simple
> | subtree C/R of one of the early daemons for debug for instance.
>
> So here is how I have it at present. I would like to remove the RESERVED_PIDS
> check in set_pidmap() if its ok to do so.
>
> alloc_pid() does this:
>
> 	if (target_pids)
> 		set_pidmap(tmp, target_pids[i])
> 	else
> 		alloc_pidmap(tmp);
>
> Sukadev
> ---
>
>>From bc6093fc4fc2f01070647df6f1e85e45edc89d27 Mon Sep 17 00:00:00 2001
> From: Sukadev Bhattiprolu <suka@suka.(none)>
> Date: Thu, 22 Oct 2009 16:57:28 -0700
> Subject: [PATCH] Define set_pidmap() function
>
> Define a set_pidmap() interface which is like alloc_pidmap() only that
> caller specifies the pid number to be assigned.
>
> Changelog[v9]:
> 	- Complete rewrite this patch based on Eric Biederman's code.
> Changelog[v7]:
>         - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
> Changelog[v6]:
>         - Separate target_pid > 0 case to minimize the number of checks needed.
> Changelog[v3]:
>         - (Eric Biederman): Avoid set_pidmap() function. Added couple of
>           checks for target_pid in alloc_pidmap() itself.
> Changelog[v2]:
>         - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
>           actually checks for 'pid <= 0' for completeness).
>
> Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
>  kernel/pid.c |   40 ++++++++++++++++++++++++++++++++--------
>  1 files changed, 32 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index c4d9914..9346755 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
>  	return 0;
>  }
>  
> -static int alloc_pidmap(struct pid_namespace *pid_ns)
> +static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
> +		int max)
>  {
> -	int i, offset, max_scan, pid, last = pid_ns->last_pid;
> +	int i, offset, max_scan, pid;
>  	int rc = -EAGAIN;
>  	struct pidmap *map;
>  
>  	pid = last + 1;
>  	if (pid >= pid_max)
> -		pid = RESERVED_PIDS;
> +		pid = min;
>  	offset = pid & BITS_PER_PAGE_MASK;
>  	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> -	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
> +	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
>  	for (i = 0; i <= max_scan; ++i) {
>  		rc = alloc_pidmap_page(map);
>  		if (rc)
> @@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  			do {
>  				if (!test_and_set_bit(offset, map->page)) {
>  					atomic_dec(&map->nr_free);
> -					pid_ns->last_pid = pid;
>  					return pid;
>  				}
>  				offset = find_next_offset(map, offset);
> @@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  			 * bitmap block and the final block was the same
>  			 * as the starting point, pid is before last_pid.
>  			 */
> -			} while (offset < BITS_PER_PAGE && pid < pid_max &&
> +			} while (offset < BITS_PER_PAGE && pid < max &&
>  					(i != max_scan || pid < last ||
>  					    !((last+1) & BITS_PER_PAGE_MASK)));
>  		}
> -		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
> +		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
>  			++map;
>  			offset = 0;
>  		} else {
>  			map = &pid_ns->pidmap[0];
> -			offset = RESERVED_PIDS;
> +			offset = min;
>  			if (unlikely(last == offset)) {
>  				rc = -EAGAIN;
>  				break;
> @@ -199,6 +199,30 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  	return rc;
>  }
>  
> +static int alloc_pidmap(struct pid_namespace *pid_ns)
> +{
> +	int nr;
> +
> +	nr = do_alloc_pidmap(pid_ns, pid_ns->last, RESERVED_PIDS, pid_max);
                                     pid_ns->last_pid,

Looks like I missed that one.

> +	if (nr >= 0)
> +		pid_ns->last_pid = nr;
> +	return nr;
> +}
> +
> +static int set_pidmap(struct pid_namespace *pid_ns, int target)
> +{
> +	if (!target)
> +		return alloc_pidmap(pid_ns);
> +
> +	if (target >= pid_max)
> +		return -EINVAL;
> +
> +	if ((target < 0) || (target < RESERVED_PIDS && pid_ns == &init_pid_ns))
> +		return -EINVAL;

if ((target < 0) || ((target < RESERVED_PIDS) && (pid_ns->last_pid >= RESERVED_PIDS)))

Please.

Eric

> +
> +	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
> +}
> +
>  int next_pidmap(struct pid_namespace *pid_ns, int last)
>  {
>  	int offset;
> -- 
> 1.6.0.4

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                                       ` <20091023204812.GA26524-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-23 23:26                                         ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-23 23:26 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:

> Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
> | Eric W. Biederman [ebiederm@xmission.com] wrote:
> | | > Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
> | | > would it be ok enforce it only in init_pid_ns ?
> | | 
> | | It is mean for initial user space daemons, things that start on boot.
> | | 
> | | I don't know how much the protection matters at this date, but we have it.
> | 
> | Well, since it is not security or other critical restriction, can we allow
> | set_pidmap() a free hand - even in init-pid-ns ? It could prevent a simple
> | subtree C/R of one of the early daemons for debug for instance.
>
> So here is how I have it at present. I would like to remove the RESERVED_PIDS
> check in set_pidmap() if its ok to do so.
>
> alloc_pid() does this:
>
> 	if (target_pids)
> 		set_pidmap(tmp, target_pids[i])
> 	else
> 		alloc_pidmap(tmp);
>
> Sukadev
> ---
>
>>From bc6093fc4fc2f01070647df6f1e85e45edc89d27 Mon Sep 17 00:00:00 2001
> From: Sukadev Bhattiprolu <suka@suka.(none)>
> Date: Thu, 22 Oct 2009 16:57:28 -0700
> Subject: [PATCH] Define set_pidmap() function
>
> Define a set_pidmap() interface which is like alloc_pidmap() only that
> caller specifies the pid number to be assigned.
>
> Changelog[v9]:
> 	- Complete rewrite this patch based on Eric Biederman's code.
> Changelog[v7]:
>         - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
> Changelog[v6]:
>         - Separate target_pid > 0 case to minimize the number of checks needed.
> Changelog[v3]:
>         - (Eric Biederman): Avoid set_pidmap() function. Added couple of
>           checks for target_pid in alloc_pidmap() itself.
> Changelog[v2]:
>         - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
>           actually checks for 'pid <= 0' for completeness).
>
> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
> ---
>  kernel/pid.c |   40 ++++++++++++++++++++++++++++++++--------
>  1 files changed, 32 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index c4d9914..9346755 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
>  	return 0;
>  }
>  
> -static int alloc_pidmap(struct pid_namespace *pid_ns)
> +static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
> +		int max)
>  {
> -	int i, offset, max_scan, pid, last = pid_ns->last_pid;
> +	int i, offset, max_scan, pid;
>  	int rc = -EAGAIN;
>  	struct pidmap *map;
>  
>  	pid = last + 1;
>  	if (pid >= pid_max)
> -		pid = RESERVED_PIDS;
> +		pid = min;
>  	offset = pid & BITS_PER_PAGE_MASK;
>  	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> -	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
> +	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
>  	for (i = 0; i <= max_scan; ++i) {
>  		rc = alloc_pidmap_page(map);
>  		if (rc)
> @@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  			do {
>  				if (!test_and_set_bit(offset, map->page)) {
>  					atomic_dec(&map->nr_free);
> -					pid_ns->last_pid = pid;
>  					return pid;
>  				}
>  				offset = find_next_offset(map, offset);
> @@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  			 * bitmap block and the final block was the same
>  			 * as the starting point, pid is before last_pid.
>  			 */
> -			} while (offset < BITS_PER_PAGE && pid < pid_max &&
> +			} while (offset < BITS_PER_PAGE && pid < max &&
>  					(i != max_scan || pid < last ||
>  					    !((last+1) & BITS_PER_PAGE_MASK)));
>  		}
> -		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
> +		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
>  			++map;
>  			offset = 0;
>  		} else {
>  			map = &pid_ns->pidmap[0];
> -			offset = RESERVED_PIDS;
> +			offset = min;
>  			if (unlikely(last == offset)) {
>  				rc = -EAGAIN;
>  				break;
> @@ -199,6 +199,30 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  	return rc;
>  }
>  
> +static int alloc_pidmap(struct pid_namespace *pid_ns)
> +{
> +	int nr;
> +
> +	nr = do_alloc_pidmap(pid_ns, pid_ns->last, RESERVED_PIDS, pid_max);
                                     pid_ns->last_pid,

Looks like I missed that one.

> +	if (nr >= 0)
> +		pid_ns->last_pid = nr;
> +	return nr;
> +}
> +
> +static int set_pidmap(struct pid_namespace *pid_ns, int target)
> +{
> +	if (!target)
> +		return alloc_pidmap(pid_ns);
> +
> +	if (target >= pid_max)
> +		return -EINVAL;
> +
> +	if ((target < 0) || (target < RESERVED_PIDS && pid_ns == &init_pid_ns))
> +		return -EINVAL;

if ((target < 0) || ((target < RESERVED_PIDS) && (pid_ns->last_pid >= RESERVED_PIDS)))

Please.

Eric

> +
> +	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
> +}
> +
>  int next_pidmap(struct pid_namespace *pid_ns, int last)
>  {
>  	int offset;
> -- 
> 1.6.0.4

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-23 23:26                                         ` Eric W. Biederman
  0 siblings, 0 replies; 226+ messages in thread
From: Eric W. Biederman @ 2009-10-23 23:26 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Sukadev Bhattiprolu [sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org] wrote:
> | Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | | > Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
> | | > would it be ok enforce it only in init_pid_ns ?
> | | 
> | | It is mean for initial user space daemons, things that start on boot.
> | | 
> | | I don't know how much the protection matters at this date, but we have it.
> | 
> | Well, since it is not security or other critical restriction, can we allow
> | set_pidmap() a free hand - even in init-pid-ns ? It could prevent a simple
> | subtree C/R of one of the early daemons for debug for instance.
>
> So here is how I have it at present. I would like to remove the RESERVED_PIDS
> check in set_pidmap() if its ok to do so.
>
> alloc_pid() does this:
>
> 	if (target_pids)
> 		set_pidmap(tmp, target_pids[i])
> 	else
> 		alloc_pidmap(tmp);
>
> Sukadev
> ---
>
>>From bc6093fc4fc2f01070647df6f1e85e45edc89d27 Mon Sep 17 00:00:00 2001
> From: Sukadev Bhattiprolu <suka@suka.(none)>
> Date: Thu, 22 Oct 2009 16:57:28 -0700
> Subject: [PATCH] Define set_pidmap() function
>
> Define a set_pidmap() interface which is like alloc_pidmap() only that
> caller specifies the pid number to be assigned.
>
> Changelog[v9]:
> 	- Complete rewrite this patch based on Eric Biederman's code.
> Changelog[v7]:
>         - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
> Changelog[v6]:
>         - Separate target_pid > 0 case to minimize the number of checks needed.
> Changelog[v3]:
>         - (Eric Biederman): Avoid set_pidmap() function. Added couple of
>           checks for target_pid in alloc_pidmap() itself.
> Changelog[v2]:
>         - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
>           actually checks for 'pid <= 0' for completeness).
>
> Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
>  kernel/pid.c |   40 ++++++++++++++++++++++++++++++++--------
>  1 files changed, 32 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index c4d9914..9346755 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
>  	return 0;
>  }
>  
> -static int alloc_pidmap(struct pid_namespace *pid_ns)
> +static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
> +		int max)
>  {
> -	int i, offset, max_scan, pid, last = pid_ns->last_pid;
> +	int i, offset, max_scan, pid;
>  	int rc = -EAGAIN;
>  	struct pidmap *map;
>  
>  	pid = last + 1;
>  	if (pid >= pid_max)
> -		pid = RESERVED_PIDS;
> +		pid = min;
>  	offset = pid & BITS_PER_PAGE_MASK;
>  	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> -	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
> +	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
>  	for (i = 0; i <= max_scan; ++i) {
>  		rc = alloc_pidmap_page(map);
>  		if (rc)
> @@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  			do {
>  				if (!test_and_set_bit(offset, map->page)) {
>  					atomic_dec(&map->nr_free);
> -					pid_ns->last_pid = pid;
>  					return pid;
>  				}
>  				offset = find_next_offset(map, offset);
> @@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  			 * bitmap block and the final block was the same
>  			 * as the starting point, pid is before last_pid.
>  			 */
> -			} while (offset < BITS_PER_PAGE && pid < pid_max &&
> +			} while (offset < BITS_PER_PAGE && pid < max &&
>  					(i != max_scan || pid < last ||
>  					    !((last+1) & BITS_PER_PAGE_MASK)));
>  		}
> -		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
> +		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
>  			++map;
>  			offset = 0;
>  		} else {
>  			map = &pid_ns->pidmap[0];
> -			offset = RESERVED_PIDS;
> +			offset = min;
>  			if (unlikely(last == offset)) {
>  				rc = -EAGAIN;
>  				break;
> @@ -199,6 +199,30 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  	return rc;
>  }
>  
> +static int alloc_pidmap(struct pid_namespace *pid_ns)
> +{
> +	int nr;
> +
> +	nr = do_alloc_pidmap(pid_ns, pid_ns->last, RESERVED_PIDS, pid_max);
                                     pid_ns->last_pid,

Looks like I missed that one.

> +	if (nr >= 0)
> +		pid_ns->last_pid = nr;
> +	return nr;
> +}
> +
> +static int set_pidmap(struct pid_namespace *pid_ns, int target)
> +{
> +	if (!target)
> +		return alloc_pidmap(pid_ns);
> +
> +	if (target >= pid_max)
> +		return -EINVAL;
> +
> +	if ((target < 0) || (target < RESERVED_PIDS && pid_ns == &init_pid_ns))
> +		return -EINVAL;

if ((target < 0) || ((target < RESERVED_PIDS) && (pid_ns->last_pid >= RESERVED_PIDS)))

Please.

Eric

> +
> +	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
> +}
> +
>  int next_pidmap(struct pid_namespace *pid_ns, int last)
>  {
>  	int offset;
> -- 
> 1.6.0.4

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                                         ` <m1y6n1lmk7.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-24  3:38                                           ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-24  3:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| > +static int set_pidmap(struct pid_namespace *pid_ns, int target)
| > +{
| > +	if (!target)
| > +		return alloc_pidmap(pid_ns);

BTW,  we need this now that the RESERVED_PIDS check is is conditional on
->last_pid. But this makes set_pidmap() completely general so should we
have alloc_pid() call set_pidmap() always ? Or we could move this check
into alloc_pid(), but it may be better to have all values of 'target'
checked in one place.

| > +
| > +	if (target >= pid_max)
| > +		return -EINVAL;
| > +
| > +	if ((target < 0) || (target < RESERVED_PIDS && pid_ns == &init_pid_ns))
| > +		return -EINVAL;
| 
| if ((target < 0) || ((target < RESERVED_PIDS) && (pid_ns->last_pid >= RESERVED_PIDS)))
| 
| Please.

Ok.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
       [not found]                                         ` <m1y6n1lmk7.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-24  3:38                                           ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-24  3:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano, randy.dunlap, arnd,
	linux-api, Containers, Nathan Lynch, linux-kernel, Louis.Rilling,
	kosaki.motohiro, hpa, mingo, torvalds, Alexey Dobriyan, roland,
	Pavel Emelyanov

Eric W. Biederman [ebiederm@xmission.com] wrote:
| > +static int set_pidmap(struct pid_namespace *pid_ns, int target)
| > +{
| > +	if (!target)
| > +		return alloc_pidmap(pid_ns);

BTW,  we need this now that the RESERVED_PIDS check is is conditional on
->last_pid. But this makes set_pidmap() completely general so should we
have alloc_pid() call set_pidmap() always ? Or we could move this check
into alloc_pid(), but it may be better to have all values of 'target'
checked in one place.

| > +
| > +	if (target >= pid_max)
| > +		return -EINVAL;
| > +
| > +	if ((target < 0) || (target < RESERVED_PIDS && pid_ns == &init_pid_ns))
| > +		return -EINVAL;
| 
| if ((target < 0) || ((target < RESERVED_PIDS) && (pid_ns->last_pid >= RESERVED_PIDS)))
| 
| Please.

Ok.

^ permalink raw reply	[flat|nested] 226+ messages in thread

* Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
@ 2009-10-24  3:38                                           ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 226+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-24  3:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Oren Laadan, Daniel Lezcano,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| > +static int set_pidmap(struct pid_namespace *pid_ns, int target)
| > +{
| > +	if (!target)
| > +		return alloc_pidmap(pid_ns);

BTW,  we need this now that the RESERVED_PIDS check is is conditional on
->last_pid. But this makes set_pidmap() completely general so should we
have alloc_pid() call set_pidmap() always ? Or we could move this check
into alloc_pid(), but it may be better to have all values of 'target'
checked in one place.

| > +
| > +	if (target >= pid_max)
| > +		return -EINVAL;
| > +
| > +	if ((target < 0) || (target < RESERVED_PIDS && pid_ns == &init_pid_ns))
| > +		return -EINVAL;
| 
| if ((target < 0) || ((target < RESERVED_PIDS) && (pid_ns->last_pid >= RESERVED_PIDS)))
| 
| Please.

Ok.

^ permalink raw reply	[flat|nested] 226+ messages in thread

end of thread, other threads:[~2009-10-24  3:38 UTC | newest]

Thread overview: 226+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-13  4:49 [RFC][v8][PATCH 0/10] Implement clone3() system call Sukadev Bhattiprolu
2009-10-13  4:49 ` Sukadev Bhattiprolu
2009-10-16 19:22 ` Daniel Lezcano
2009-10-16 19:44   ` Sukadev Bhattiprolu
2009-10-16 19:44     ` Sukadev Bhattiprolu
2009-10-19 20:34     ` Daniel Lezcano
2009-10-19 20:34       ` Daniel Lezcano
2009-10-19 21:47       ` Oren Laadan
2009-10-19 21:47         ` Oren Laadan
2009-10-20  0:51         ` Matt Helsley
2009-10-20  3:33           ` Eric W. Biederman
2009-10-20  3:33             ` Eric W. Biederman
2009-10-20  4:03             ` Sukadev Bhattiprolu
2009-10-20  4:03               ` Sukadev Bhattiprolu
     [not found]               ` <20091020040315.GA26632-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-20 10:46                 ` Eric W. Biederman
2009-10-20 10:46               ` Eric W. Biederman
2009-10-20 10:46                 ` Eric W. Biederman
2009-10-20 14:16                 ` Serge E. Hallyn
2009-10-20 14:16                   ` Serge E. Hallyn
2009-10-20 18:33                 ` Sukadev Bhattiprolu
2009-10-20 18:33                   ` Sukadev Bhattiprolu
     [not found]                   ` <20091020183329.GB22646-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-20 19:26                     ` Eric W. Biederman
2009-10-20 19:26                   ` Eric W. Biederman
2009-10-20 19:26                     ` Eric W. Biederman
2009-10-20 20:13                     ` Oren Laadan
2009-10-20 20:13                       ` Oren Laadan
     [not found]                     ` <m1r5sxsw7w.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2009-10-20 20:13                       ` Oren Laadan
2009-10-21  6:20                       ` Sukadev Bhattiprolu
2009-10-21  6:20                     ` Sukadev Bhattiprolu
2009-10-21  6:20                       ` Sukadev Bhattiprolu
     [not found]                       ` <20091021062021.GA2667-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-21  9:16                         ` Eric W. Biederman
2009-10-21  9:16                       ` Eric W. Biederman
2009-10-21  9:16                         ` Eric W. Biederman
2009-10-21 18:52                         ` Sukadev Bhattiprolu
2009-10-21 18:52                           ` Sukadev Bhattiprolu
2009-10-21 21:11                           ` Eric W. Biederman
2009-10-21 21:11                             ` Eric W. Biederman
     [not found]                           ` <20091021185242.GB12955-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-21 21:11                             ` Eric W. Biederman
     [not found]                         ` <m1eioxrtsb.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2009-10-21 18:52                           ` Sukadev Bhattiprolu
2009-10-23  0:42                           ` Sukadev Bhattiprolu
2009-10-23  0:42                         ` Sukadev Bhattiprolu
2009-10-23  0:42                           ` Sukadev Bhattiprolu
     [not found]                           ` <20091023004253.GA7915-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-23  1:03                             ` Eric W. Biederman
2009-10-23  1:03                           ` Eric W. Biederman
2009-10-23  1:03                             ` Eric W. Biederman
2009-10-23  5:30                             ` Sukadev Bhattiprolu
2009-10-23  5:44                               ` Eric W. Biederman
2009-10-23  5:44                                 ` Eric W. Biederman
     [not found]                                 ` <m1ws2mpsuk.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2009-10-23 19:21                                   ` Sukadev Bhattiprolu
2009-10-23 19:21                                     ` Sukadev Bhattiprolu
2009-10-23 20:48                                     ` Sukadev Bhattiprolu
2009-10-23 20:48                                       ` Sukadev Bhattiprolu
     [not found]                                       ` <20091023204812.GA26524-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-23 23:26                                         ` Eric W. Biederman
2009-10-23 23:26                                       ` Eric W. Biederman
2009-10-23 23:26                                         ` Eric W. Biederman
     [not found]                                         ` <m1y6n1lmk7.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2009-10-24  3:38                                           ` Sukadev Bhattiprolu
2009-10-24  3:38                                         ` Sukadev Bhattiprolu
2009-10-24  3:38                                           ` Sukadev Bhattiprolu
     [not found]                                     ` <20091023192124.GA11088-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-23 20:48                                       ` Sukadev Bhattiprolu
2009-10-23 19:16                               ` Oren Laadan
2009-10-23 19:16                                 ` Oren Laadan
     [not found]                                 ` <4AE20124.4010108-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-10-23 19:34                                   ` Oren Laadan
2009-10-23 19:34                                 ` Oren Laadan
2009-10-23 19:34                                   ` Oren Laadan
     [not found]                                   ` <4AE20532.6060809-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-10-23 23:12                                     ` Eric W. Biederman
2009-10-23 23:12                                   ` Eric W. Biederman
2009-10-23 23:12                                     ` Eric W. Biederman
     [not found]                               ` <20091023053001.GA24972-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-23  5:44                                 ` Eric W. Biederman
2009-10-23 19:16                                 ` Oren Laadan
     [not found]                             ` <m1bpjyrkep.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2009-10-23  5:30                               ` Sukadev Bhattiprolu
     [not found]                 ` <m1iqeauyvl.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2009-10-20 14:16                   ` Serge E. Hallyn
2009-10-20 18:33                   ` Sukadev Bhattiprolu
2009-10-20 14:09             ` Serge E. Hallyn
2009-10-20 14:09               ` Serge E. Hallyn
     [not found]             ` <m1vdiad9jd.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2009-10-20  4:03               ` Sukadev Bhattiprolu
2009-10-20 14:09               ` Serge E. Hallyn
     [not found]           ` <20091020005125.GG27627-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2009-10-20  3:33             ` Eric W. Biederman
     [not found]         ` <4ADCDE7F.4090501-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-10-20  0:51           ` Matt Helsley
2009-10-21 15:53           ` Daniel Lezcano
2009-10-21 15:53         ` Daniel Lezcano
2009-10-21 15:53           ` Daniel Lezcano
2009-10-21 18:45           ` Oren Laadan
2009-10-21 18:45             ` Oren Laadan
2009-10-22 11:22             ` Daniel Lezcano
2009-10-22 11:22               ` Daniel Lezcano
     [not found]             ` <4ADF56D4.8030405-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-10-22 11:22               ` Daniel Lezcano
     [not found]           ` <4ADF2E75.1020801-GANU6spQydw@public.gmane.org>
2009-10-21 18:45             ` Oren Laadan
     [not found]       ` <4ADCCD68.9030003-GANU6spQydw@public.gmane.org>
2009-10-19 21:47         ` Oren Laadan
     [not found]     ` <20091016194451.GA28706-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-19 20:34       ` Daniel Lezcano
     [not found]   ` <4AD8C7E4.9000903-GANU6spQydw@public.gmane.org>
2009-10-16 19:44     ` Sukadev Bhattiprolu
     [not found] ` <20091013044925.GA28181-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-13  4:49   ` [RFC][v8][PATCH 1/10]: Factor out code to allocate pidmap page Sukadev Bhattiprolu
2009-10-13  4:49     ` Sukadev Bhattiprolu
2009-10-13  4:50   ` [RFC][v8][PATCH 2/10]: Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
2009-10-13  4:50     ` Sukadev Bhattiprolu
2009-10-13  4:50   ` [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property Sukadev Bhattiprolu
2009-10-13  4:50     ` Sukadev Bhattiprolu
     [not found]     ` <20091013045041.GC28435-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-13  5:19       ` Alexey Dobriyan
2009-10-13  5:19         ` Alexey Dobriyan
2009-10-13 13:09       ` Pavel Emelyanov
2009-10-13 13:09         ` Pavel Emelyanov
     [not found]         ` <4AD47C1F.7040703-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2009-10-13 15:24           ` Serge E. Hallyn
2009-10-13 15:24             ` Serge E. Hallyn
2009-10-13 16:10             ` Pavel Emelyanov
2009-10-13 16:28               ` Serge E. Hallyn
     [not found]               ` <4AD4A676.3010603-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2009-10-13 16:28                 ` Serge E. Hallyn
     [not found]             ` <20091013152453.GA9994-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-13 16:10               ` Pavel Emelyanov
2009-10-13  4:51   ` [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu
2009-10-13  4:51     ` Sukadev Bhattiprolu
     [not found]     ` <20091013045104.GD28435-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-13 11:50       ` Pavel Emelyanov
2009-10-13 11:50         ` Pavel Emelyanov
     [not found]         ` <4AD46977.5020303-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2009-10-15  0:24           ` Sukadev Bhattiprolu
2009-10-15  0:24             ` Sukadev Bhattiprolu
2009-10-13  4:51   ` [RFC][v8][PATCH 5/10]: Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
2009-10-13  4:51     ` Sukadev Bhattiprolu
2009-10-13  4:52   ` [RFC][v8][PATCH 6/10]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu
2009-10-13  4:52     ` Sukadev Bhattiprolu
2009-10-13  4:52   ` [RFC][v8][PATCH 7/10]: Check invalid clone flags Sukadev Bhattiprolu
2009-10-13  4:52     ` Sukadev Bhattiprolu
     [not found]     ` <20091013045234.GG28435-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-13 18:35       ` Oren Laadan
2009-10-13 18:35         ` Oren Laadan
     [not found]         ` <4AD4C88D.7040008-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-10-13 23:38           ` Sukadev Bhattiprolu
2009-10-13 23:38             ` Sukadev Bhattiprolu
2009-10-13  4:52   ` [RFC][v8][PATCH 8/10]: Define do_fork_with_pids() Sukadev Bhattiprolu
2009-10-13  4:52     ` Sukadev Bhattiprolu
2009-10-13  4:54   ` [RFC][v8][PATCH 9/10]: Define clone3() syscall Sukadev Bhattiprolu
2009-10-13  4:54     ` Sukadev Bhattiprolu
     [not found]     ` <20091013045439.GI28435-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-13 18:46       ` Oren Laadan
2009-10-13 18:46         ` Oren Laadan
2009-10-16  4:20       ` Sukadev Bhattiprolu
2009-10-16  4:20         ` Sukadev Bhattiprolu
     [not found]         ` <20091016042041.GA7220-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-16  6:25           ` Michael Kerrisk
2009-10-16  6:25             ` Michael Kerrisk
     [not found]             ` <cfd18e0f0910152325m4a9125c2q18f36f5bd7d4a0-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-16 18:06               ` Sukadev Bhattiprolu
2009-10-16 18:06             ` Sukadev Bhattiprolu
2009-10-16 18:06               ` Sukadev Bhattiprolu
2009-10-19 17:44               ` Matt Helsley
2009-10-19 17:44                 ` Matt Helsley
     [not found]                 ` <20091019174405.GE27627-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2009-10-19 21:31                   ` H. Peter Anvin
2009-10-19 21:31                 ` H. Peter Anvin
2009-10-19 21:31                   ` H. Peter Anvin
2009-10-19 23:50                   ` Matt Helsley
2009-10-19 23:50                     ` Matt Helsley
2009-10-21  4:26                     ` Michael Kerrisk
2009-10-21  4:26                       ` Michael Kerrisk
     [not found]                       ` <cfd18e0f0910202126w1ebfa2e2u2ecbaa23b198d100-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-21 13:03                         ` H. Peter Anvin
2009-10-21 13:03                       ` H. Peter Anvin
2009-10-21 19:44                         ` Sukadev Bhattiprolu
2009-10-21 19:44                           ` Sukadev Bhattiprolu
     [not found]                           ` <20091021194440.GA1283-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-21 22:03                             ` H. Peter Anvin
2009-10-22 10:40                             ` Michael Kerrisk
2009-10-21 22:03                           ` H. Peter Anvin
2009-10-22 10:40                           ` Michael Kerrisk
2009-10-22 10:40                             ` Michael Kerrisk
2009-10-22 18:10                             ` Sukadev Bhattiprolu
2009-10-22 18:10                               ` Sukadev Bhattiprolu
     [not found]                             ` <cfd18e0f0910220340n7c655daap78e395136c56f882-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-22 18:10                               ` Sukadev Bhattiprolu
     [not found]                         ` <4ADF06B7.50508-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
2009-10-21 19:44                           ` Sukadev Bhattiprolu
2009-10-22 10:26                           ` Michael Kerrisk
2009-10-22 10:26                         ` Michael Kerrisk
2009-10-22 11:38                           ` H. Peter Anvin
2009-10-22 12:14                             ` Michael Kerrisk
2009-10-22 12:19                               ` H. Peter Anvin
2009-10-22 12:19                                 ` H. Peter Anvin
     [not found]                               ` <cfd18e0f0910220514y1bd5967aj3a04bc3f5b38948b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-22 12:19                                 ` H. Peter Anvin
2009-10-22 13:57                                 ` Matt Helsley
2009-10-22 13:57                               ` Matt Helsley
2009-10-22 13:57                                 ` Matt Helsley
     [not found]                             ` <4AE0444B.5070603-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
2009-10-22 12:14                               ` Michael Kerrisk
     [not found]                           ` <cfd18e0f0910220326w1a71b899s5ff0fce9e0a2b8ed-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-22 11:38                             ` H. Peter Anvin
     [not found]                     ` <20091019235012.GF27627-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2009-10-21  4:26                       ` Michael Kerrisk
     [not found]                   ` <4ADCDAA8.5080408-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
2009-10-19 23:50                     ` Matt Helsley
     [not found]               ` <20091016180631.GA31036-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-19 17:44                 ` Matt Helsley
2009-10-13  4:55   ` [RFC][v8][PATCH 10/10]: Document " Sukadev Bhattiprolu
2009-10-13  4:55     ` Sukadev Bhattiprolu
     [not found]     ` <20091013045556.GJ28435-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-14 12:26       ` Arnd Bergmann
2009-10-14 12:26         ` Arnd Bergmann
     [not found]         ` <200910141426.35338.arnd-r2nGTMty4D4@public.gmane.org>
2009-10-14 18:39           ` Sukadev Bhattiprolu
2009-10-14 18:39             ` Sukadev Bhattiprolu
2009-10-19 21:36       ` Pavel Machek
2009-10-19 21:36     ` Pavel Machek
2009-10-19 21:36       ` Pavel Machek
     [not found]       ` <20091019213636.GB1482-+ZI9xUNit7I@public.gmane.org>
2009-10-21  8:37         ` Arnd Bergmann
2009-10-21 18:27         ` Sukadev Bhattiprolu
2009-10-21  8:37       ` Arnd Bergmann
2009-10-21  8:37         ` Arnd Bergmann
2009-10-21  9:33         ` Pavel Machek
2009-10-21 13:26           ` Arnd Bergmann
2009-10-21 13:26             ` Arnd Bergmann
     [not found]             ` <200910211526.50584.arnd-r2nGTMty4D4@public.gmane.org>
2009-10-21 19:09               ` Pavel Machek
2009-10-21 19:09               ` Pavel Machek
     [not found]           ` <20091021093338.GA11670-I/5MKhXcvmPrBKCeMvbIDA@public.gmane.org>
2009-10-21 13:26             ` Arnd Bergmann
     [not found]         ` <200910211037.38413.arnd-r2nGTMty4D4@public.gmane.org>
2009-10-21  9:33           ` Pavel Machek
2009-10-21 18:27       ` Sukadev Bhattiprolu
2009-10-21 18:27         ` Sukadev Bhattiprolu
2009-10-13 20:50   ` [RFC][v8][PATCH 0/10] Implement clone3() system call Roland McGrath
2009-10-13 20:50     ` Roland McGrath
     [not found]     ` <20091013205015.1ED524F7-nL1rrgvulkc2UH6IwYuUx0EOCMrvLtNR@public.gmane.org>
2009-10-13 23:27       ` Sukadev Bhattiprolu
2009-10-13 23:27         ` Sukadev Bhattiprolu
     [not found]         ` <20091013232736.GA24392-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-13 23:53           ` Roland McGrath
2009-10-13 23:53             ` Roland McGrath
     [not found]             ` <20091013235320.E90022746-nL1rrgvulkc2UH6IwYuUx0EOCMrvLtNR@public.gmane.org>
2009-10-14  1:13               ` H. Peter Anvin
2009-10-14  1:13                 ` H. Peter Anvin
     [not found]                 ` <4AD525B3.2070906-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
2009-10-14  4:36                   ` Sukadev Bhattiprolu
2009-10-14  4:36                     ` Sukadev Bhattiprolu
     [not found]                     ` <20091014043607.GA32114-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-14  4:38                       ` H. Peter Anvin
2009-10-14  4:38                         ` H. Peter Anvin
2009-10-14 22:36                   ` Sukadev Bhattiprolu
2009-10-14 22:36                     ` Sukadev Bhattiprolu
2009-10-14 22:49                     ` H. Peter Anvin
2009-10-14 22:49                       ` H. Peter Anvin
     [not found]                       ` <4AD6557D.3090501-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
2009-10-15  0:17                         ` Sukadev Bhattiprolu
2009-10-15  0:17                           ` Sukadev Bhattiprolu
     [not found]                     ` <20091014223634.GB3515-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-14 22:49                       ` H. Peter Anvin
2009-10-13 23:49   ` H. Peter Anvin
2009-10-13 23:49     ` H. Peter Anvin
     [not found]     ` <4AD511F1.7010207-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
2009-10-14  1:39       ` Matt Helsley
2009-10-14  1:39         ` Matt Helsley
     [not found]         ` <20091014013936.GC27627-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2009-10-14  2:24           ` H. Peter Anvin
2009-10-14  2:24             ` H. Peter Anvin
     [not found]             ` <4AD5365E.5090709-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
2009-10-14  4:40               ` Sukadev Bhattiprolu
2009-10-14  4:40                 ` Sukadev Bhattiprolu
     [not found]                 ` <20091014044035.GB32114-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-14  4:50                   ` H. Peter Anvin
2009-10-14  4:50                     ` H. Peter Anvin
2009-10-14 16:07                   ` Serge E. Hallyn
2009-10-14 16:07                     ` Serge E. Hallyn
2009-10-16 19:22   ` Daniel Lezcano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.