All of lore.kernel.org
 help / color / mirror / Atom feed
* [v11][PATCH 0/9] Implement clone_with_pids() system call
@ 2009-11-05  5:30 Sukadev Bhattiprolu
  2009-11-05  5:36 ` [v11][PATCH 1/9] Factor out code to allocate pidmap page Sukadev Bhattiprolu
                   ` (9 more replies)
  0 siblings, 10 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, roland, arnd, hpa, Eric W. Biederman,
	Pavel Emelyanov, Alexey Dobriyan, Containers


From: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Subject: [v11][PATCH 0/9] Implement clone_with_pids() system call

To support application checkpoint/restart, a task must have the same pid it
had when it was checkpointed.  When containers are nested, the tasks within
the containers exist in multiple pid namespaces and hence have multiple pids
to specify during restart.

This patchset implements a new system call, clone_with_pids() that lets a
process specify the pids of the child process.

Patches 1 through 6 are helper patches needed for choosing a pid for the
child process.

PATCH 8 defines a prototype of the new system call. PATCH 9 adds some
documentation on the new system call, some/all of which will eventually
go into a man page.

Changelog[v11]:
	- [Dave Hansen] Move clone_args validation checks to arch-indpeendent
	  code.
	- [Oren Laadan] Make args_size a parameter to system call and remove
	  it from 'struct clone_args'

Changelog[v10]:
	- [Linus Torvalds] Use PTREGSCALL() implementation for clone rather
	  than the generic system call
	- Rename clone3() to clone_with_pids()
	- Update Documentation/clone_with_pids() to show example usage with
	  the PTREGSCALL implementation.

Changelog[v9]:
	- [Pavel Emelyanov] Drop the patch that made 'pid_max' a property
	  of struct pid_namespace
	- [Roland McGrath, H. Peter Anvin and earlier on, Serge Hallyn] To
	  avoid inadvertent truncation clone_flags, preserve the first
	  parameter of clone3() as 'u32 clone_flags' and specify newer
	  flags in clone_args.flags_high (PATCH 8/9 and PATCH 9/9)
	- [Eric Biederman] Generalize alloc_pidmap() code to simplify and
	  remove duplication (see PATCH 3/9].
	  
Changelog[v8]:
	- [Oren Laadan, Louis Rilling, KOSAKI Motohiro]
	  The name 'clone2()' is in use - renamed new syscall to clone3().
	- [Oren Laadan] ->parent_tidptr and ->child_tidptr need to be 64bit.
	- [Oren Laadan] Ensure that unused fields/flags in clone_struct are 0.
	  (Added [PATCH 7/10] to the patchset).

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann]
	  Group the arguments to clone2() into a 'struct clone_arg' to
	  workaround the issue of exceeding 6 arguments to the system call.
	  Also define clone-flags as u64 to allow additional clone-flags.

Changelog[v6]:
	- [Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds]
	  Change 'pid_set.pids' to 'pid_t pids[]' so sizeof(struct pid_set) is
	  constant across architectures (Patches 7, 8).
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check (Patches 7,8)
	- (Pavel Machek) New patch (Patch 9) to add some documentation.

Changelog[v5]:
	- Make 'pid_max' a property of pid_ns (Integrated Serge Hallyn's patch
	  into this set)
	- (Eric Biederman): Avoid the new function, set_pidmap() - added
	  couple of checks on 'target_pid' in alloc_pidmap() itself.

=== IMPORTANT NOTE:

clone() system call has another limitation - all but one bits in clone-flags
are in use and if more new clone-flags are needed, we will need a variant of
the clone() system call. 

It appears to make sense to try and extend this new system call to address
this limitation as well. The requirements of a new clone system call could
then be summarized as:

	- do everything clone() does today, and
	- give application an ability to choose pids for the child process
	  in all ancestor pid namespaces, and
	- allow more clone_flags

Contstraints:

	- system-calls are restricted to 6 parameters and clone() already
	  takes 5 parameters, any extension to clone() interface would require
	  one or more copy_from_user().  (Not sure if copy_from_user() of ~40
	  bytes would have a significant impact on performance of clone()).

Based on these requirements and constraints, we explored a couple of system
call interfaces (in earlier versions of this patchset). Based on input from
Arnd Bergmann and others, the new interface of the system call is: 

	struct clone_args {
		u64 clone_flags_high;
		u64 child_stack_base;
		u64 child_stack_size;
		u64 parent_tid_ptr;
		u64 child_tid_ptr;
		u32 nr_pids;
		u32 reserved0;
		u64 reserved1;
	};

	sys_clone_with_pids(u32 flags_low, struct clone_args *cargs, 
			int args_size, pid_t *pids)

Details of the struct clone_args and the usage are explained in the
documentation (PATCH 9/9).

NOTE:
	While this patchset enables support for more clone-flags, actual
	implementation for additional clone-flags is best implemented as
	a separate patchset (PATCH 8/9 identifies some TODOs)

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [v11][PATCH 1/9] Factor out code to allocate pidmap page
       [not found] ` <20091105053053.GA11289-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-11-05  5:36   ` Sukadev Bhattiprolu
  2009-11-05  5:37   ` [v11][PATCH 2/9] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:36 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov


From: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: [v11][PATCH 1/9] Factor out code to allocate pidmap page

To simplify alloc_pidmap(), move code to allocate a pid map page to a
separate function.

Changelog[v3]:
	- Earlier version of patchset called alloc_pidmap_page() from two
	  places. But now its called from only one place. Even so, moving
	  this code out into a separate function simplifies alloc_pidmap().
Changelog[v2]:
	- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
	  -ENOMEM on error instead of -1.

Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/pid.c |   45 ++++++++++++++++++++++++++++++---------------
 1 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index d3f722d..7d4bb6e 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,9 +122,35 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	void *page;
+
+	if (likely(map->page))
+		return 0;
+
+	page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+	/*
+	 * Free the page if someone raced with us installing it:
+	 */
+	spin_lock_irq(&pidmap_lock);
+	if (map->page)
+		kfree(page);
+	else
+		map->page = page;
+	spin_unlock_irq(&pidmap_lock);
+
+	if (unlikely(!map->page))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int rc;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -134,21 +160,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (map->page)
-				kfree(page);
-			else
-				map->page = page;
-			spin_unlock_irq(&pidmap_lock);
-			if (unlikely(!map->page))
-				break;
-		}
+		rc = alloc_pidmap_page(map);
+		if (rc)
+			break;
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 1/9] Factor out code to allocate pidmap page
  2009-11-05  5:30 [v11][PATCH 0/9] Implement clone_with_pids() system call Sukadev Bhattiprolu
@ 2009-11-05  5:36 ` Sukadev Bhattiprolu
  2009-11-05  5:37 ` [v11][PATCH 2/9] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: arnd, Containers, Eric W. Biederman, hpa, Alexey Dobriyan,
	roland, Pavel Emelyanov, Oren Laadan, serue


From: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Subject: [v11][PATCH 1/9] Factor out code to allocate pidmap page

To simplify alloc_pidmap(), move code to allocate a pid map page to a
separate function.

Changelog[v3]:
	- Earlier version of patchset called alloc_pidmap_page() from two
	  places. But now its called from only one place. Even so, moving
	  this code out into a separate function simplifies alloc_pidmap().
Changelog[v2]:
	- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
	  -ENOMEM on error instead of -1.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/pid.c |   45 ++++++++++++++++++++++++++++++---------------
 1 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index d3f722d..7d4bb6e 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,9 +122,35 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	void *page;
+
+	if (likely(map->page))
+		return 0;
+
+	page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+	/*
+	 * Free the page if someone raced with us installing it:
+	 */
+	spin_lock_irq(&pidmap_lock);
+	if (map->page)
+		kfree(page);
+	else
+		map->page = page;
+	spin_unlock_irq(&pidmap_lock);
+
+	if (unlikely(!map->page))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int rc;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -134,21 +160,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (map->page)
-				kfree(page);
-			else
-				map->page = page;
-			spin_unlock_irq(&pidmap_lock);
-			if (unlikely(!map->page))
-				break;
-		}
+		rc = alloc_pidmap_page(map);
+		if (rc)
+			break;
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 2/9] Have alloc_pidmap() return actual error code
       [not found] ` <20091105053053.GA11289-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-11-05  5:36   ` [v11][PATCH 1/9] Factor out code to allocate pidmap page Sukadev Bhattiprolu
@ 2009-11-05  5:37   ` Sukadev Bhattiprolu
  2009-11-05  5:38   ` [v11][PATCH 3/9] Define set_pidmap() function Sukadev Bhattiprolu
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:37 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov


From: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: [v11][PATCH 2/9] Have alloc_pidmap() return actual error code

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed.  With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/fork.c |    5 +++--
 kernel/pid.c  |   14 +++++++++-----
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 4c20fff..93626b2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1156,10 +1156,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 
 		if (clone_flags & CLONE_NEWPID) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
diff --git a/kernel/pid.c b/kernel/pid.c
index 7d4bb6e..c4d9914 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -150,7 +150,7 @@ static int alloc_pidmap_page(struct pidmap *map)
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
-	int rc;
+	int rc = -EAGAIN;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -189,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		} else {
 			map = &pid_ns->pidmap[0];
 			offset = RESERVED_PIDS;
-			if (unlikely(last == offset))
+			if (unlikely(last == offset)) {
+				rc = -EAGAIN;
 				break;
+			}
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return rc;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -263,8 +265,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	struct upid *upid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
-	if (!pid)
+	if (!pid) {
+		pid = ERR_PTR(-ENOMEM);
 		goto out;
+	}
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
@@ -299,7 +303,7 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
+	pid = ERR_PTR(nr);
 	goto out;
 }
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 2/9] Have alloc_pidmap() return actual error code
  2009-11-05  5:30 [v11][PATCH 0/9] Implement clone_with_pids() system call Sukadev Bhattiprolu
  2009-11-05  5:36 ` [v11][PATCH 1/9] Factor out code to allocate pidmap page Sukadev Bhattiprolu
@ 2009-11-05  5:37 ` Sukadev Bhattiprolu
  2009-11-05  5:38 ` [v11][PATCH 3/9] Define set_pidmap() function Sukadev Bhattiprolu
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: arnd, Containers, Eric W. Biederman, hpa, Alexey Dobriyan,
	roland, Pavel Emelyanov, serue, Oren Laadan


From: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Subject: [v11][PATCH 2/9] Have alloc_pidmap() return actual error code

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed.  With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    5 +++--
 kernel/pid.c  |   14 +++++++++-----
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 4c20fff..93626b2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1156,10 +1156,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 
 		if (clone_flags & CLONE_NEWPID) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
diff --git a/kernel/pid.c b/kernel/pid.c
index 7d4bb6e..c4d9914 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -150,7 +150,7 @@ static int alloc_pidmap_page(struct pidmap *map)
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
-	int rc;
+	int rc = -EAGAIN;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -189,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		} else {
 			map = &pid_ns->pidmap[0];
 			offset = RESERVED_PIDS;
-			if (unlikely(last == offset))
+			if (unlikely(last == offset)) {
+				rc = -EAGAIN;
 				break;
+			}
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return rc;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -263,8 +265,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	struct upid *upid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
-	if (!pid)
+	if (!pid) {
+		pid = ERR_PTR(-ENOMEM);
 		goto out;
+	}
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
@@ -299,7 +303,7 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
+	pid = ERR_PTR(nr);
 	goto out;
 }
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 3/9] Define set_pidmap() function
       [not found] ` <20091105053053.GA11289-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-11-05  5:36   ` [v11][PATCH 1/9] Factor out code to allocate pidmap page Sukadev Bhattiprolu
  2009-11-05  5:37   ` [v11][PATCH 2/9] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
@ 2009-11-05  5:38   ` Sukadev Bhattiprolu
  2009-11-05  5:38   ` [v11][PATCH 4/9] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:38 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov


From: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: [v11][PATCH 3/9] Define set_pidmap() function

Define a set_pidmap() interface which is like alloc_pidmap() only that
caller specifies the pid number to be assigned.

Changelog[v9]:
	- Completely rewrote this patch based on Eric Biederman's code.
Changelog[v7]:
        - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
Changelog[v6]:
        - Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
        - (Eric Biederman): Avoid set_pidmap() function. Added couple of
          checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
        - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
          actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/pid.c |   41 +++++++++++++++++++++++++++++++++--------
 1 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index c4d9914..c849327 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
+		int max)
 {
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int i, offset, max_scan, pid;
 	int rc = -EAGAIN;
 	struct pidmap *map;
 
 	pid = last + 1;
 	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+		pid = min;
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
 		rc = alloc_pidmap_page(map);
 		if (rc)
@@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			 * bitmap block and the final block was the same
 			 * as the starting point, pid is before last_pid.
 			 */
-			} while (offset < BITS_PER_PAGE && pid < pid_max &&
+			} while (offset < BITS_PER_PAGE && pid < max &&
 					(i != max_scan || pid < last ||
 					    !((last+1) & BITS_PER_PAGE_MASK)));
 		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
 			++map;
 			offset = 0;
 		} else {
 			map = &pid_ns->pidmap[0];
-			offset = RESERVED_PIDS;
+			offset = min;
 			if (unlikely(last == offset)) {
 				rc = -EAGAIN;
 				break;
@@ -199,6 +199,31 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return rc;
 }
 
+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+	int nr;
+
+	nr = do_alloc_pidmap(pid_ns, pid_ns->last_pid, RESERVED_PIDS, pid_max);
+	if (nr >= 0)
+		pid_ns->last_pid = nr;
+	return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+	if (!target)
+		return alloc_pidmap(pid_ns);
+
+	if (target >= pid_max)
+		return -EINVAL;
+
+	if ((target < 0) || (target < RESERVED_PIDS && 
+				pid_ns->last_pid >= RESERVED_PIDS))
+		return -EINVAL;
+
+	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 3/9] Define set_pidmap() function
  2009-11-05  5:30 [v11][PATCH 0/9] Implement clone_with_pids() system call Sukadev Bhattiprolu
  2009-11-05  5:36 ` [v11][PATCH 1/9] Factor out code to allocate pidmap page Sukadev Bhattiprolu
  2009-11-05  5:37 ` [v11][PATCH 2/9] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
@ 2009-11-05  5:38 ` Sukadev Bhattiprolu
  2009-11-05  5:38 ` [v11][PATCH 4/9] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: arnd, Containers, Eric W. Biederman, hpa, Alexey Dobriyan,
	roland, Pavel Emelyanov, serue, Oren Laadan


From: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Subject: [v11][PATCH 3/9] Define set_pidmap() function

Define a set_pidmap() interface which is like alloc_pidmap() only that
caller specifies the pid number to be assigned.

Changelog[v9]:
	- Completely rewrote this patch based on Eric Biederman's code.
Changelog[v7]:
        - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
Changelog[v6]:
        - Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
        - (Eric Biederman): Avoid set_pidmap() function. Added couple of
          checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
        - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
          actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/pid.c |   41 +++++++++++++++++++++++++++++++++--------
 1 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index c4d9914..c849327 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
+		int max)
 {
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int i, offset, max_scan, pid;
 	int rc = -EAGAIN;
 	struct pidmap *map;
 
 	pid = last + 1;
 	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+		pid = min;
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
 		rc = alloc_pidmap_page(map);
 		if (rc)
@@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			 * bitmap block and the final block was the same
 			 * as the starting point, pid is before last_pid.
 			 */
-			} while (offset < BITS_PER_PAGE && pid < pid_max &&
+			} while (offset < BITS_PER_PAGE && pid < max &&
 					(i != max_scan || pid < last ||
 					    !((last+1) & BITS_PER_PAGE_MASK)));
 		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
 			++map;
 			offset = 0;
 		} else {
 			map = &pid_ns->pidmap[0];
-			offset = RESERVED_PIDS;
+			offset = min;
 			if (unlikely(last == offset)) {
 				rc = -EAGAIN;
 				break;
@@ -199,6 +199,31 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return rc;
 }
 
+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+	int nr;
+
+	nr = do_alloc_pidmap(pid_ns, pid_ns->last_pid, RESERVED_PIDS, pid_max);
+	if (nr >= 0)
+		pid_ns->last_pid = nr;
+	return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+	if (!target)
+		return alloc_pidmap(pid_ns);
+
+	if (target >= pid_max)
+		return -EINVAL;
+
+	if ((target < 0) || (target < RESERVED_PIDS && 
+				pid_ns->last_pid >= RESERVED_PIDS))
+		return -EINVAL;
+
+	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 4/9] Add target_pids parameter to alloc_pid()
       [not found] ` <20091105053053.GA11289-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2009-11-05  5:38   ` [v11][PATCH 3/9] Define set_pidmap() function Sukadev Bhattiprolu
@ 2009-11-05  5:38   ` Sukadev Bhattiprolu
  2009-11-05  5:39   ` [v11][PATCH 5/9] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:38 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov


From: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: [v11][PATCH 4/9] Add target_pids parameter to alloc_pid()

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/pid.h |    2 +-
 kernel/fork.c       |    3 ++-
 kernel/pid.c        |    9 +++++++--
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 93626b2..3f1dddf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -980,6 +980,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1156,7 +1157,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index c849327..7c73096 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -281,13 +281,14 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	pid_t tpid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid) {
@@ -297,7 +298,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		tpid = 0;
+		if (target_pids)
+			tpid = target_pids[i];
+
+		nr = set_pidmap(tmp, tpid);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 4/9] Add target_pids parameter to alloc_pid()
  2009-11-05  5:30 [v11][PATCH 0/9] Implement clone_with_pids() system call Sukadev Bhattiprolu
                   ` (2 preceding siblings ...)
  2009-11-05  5:38 ` [v11][PATCH 3/9] Define set_pidmap() function Sukadev Bhattiprolu
@ 2009-11-05  5:38 ` Sukadev Bhattiprolu
  2009-11-05  5:39 ` [v11][PATCH 5/9] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: arnd, Containers, Eric W. Biederman, hpa, Alexey Dobriyan,
	roland, Pavel Emelyanov, serue, Oren Laadan


From: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Subject: [v11][PATCH 4/9] Add target_pids parameter to alloc_pid()

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/pid.h |    2 +-
 kernel/fork.c       |    3 ++-
 kernel/pid.c        |    9 +++++++--
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 93626b2..3f1dddf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -980,6 +980,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1156,7 +1157,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index c849327..7c73096 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -281,13 +281,14 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	pid_t tpid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid) {
@@ -297,7 +298,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		tpid = 0;
+		if (target_pids)
+			tpid = target_pids[i];
+
+		nr = set_pidmap(tmp, tpid);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 5/9] Add target_pids parameter to copy_process()
       [not found] ` <20091105053053.GA11289-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2009-11-05  5:38   ` [v11][PATCH 4/9] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
@ 2009-11-05  5:39   ` Sukadev Bhattiprolu
  2009-11-05  5:40   ` [v11][PATCH 6/9] Check invalid clone flags Sukadev Bhattiprolu
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:39 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov


From: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: [v11][PATCH 5/9] Add target_pids parameter to copy_process()

Add a 'target_pids' parameter to copy_process().  The new parameter will be
used in a follow-on patch when clone_with_pids() is implemented.

Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/fork.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 3f1dddf..c8a06de 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -975,12 +975,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
+					pid_t *target_pids,
 					int trace)
 {
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
-	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1361,7 +1361,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 	struct pt_regs regs;
 
 	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
-			    &init_struct_pid, 0);
+			    &init_struct_pid, NULL, 0);
 	if (!IS_ERR(task))
 		init_idle(task, cpu);
 
@@ -1384,6 +1384,7 @@ long do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	pid_t *target_pids = NULL;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1424,7 +1425,7 @@ long do_fork(unsigned long clone_flags,
 		trace = tracehook_prepare_clone(clone_flags);
 
 	p = copy_process(clone_flags, stack_start, regs, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, target_pids, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 5/9] Add target_pids parameter to copy_process()
  2009-11-05  5:30 [v11][PATCH 0/9] Implement clone_with_pids() system call Sukadev Bhattiprolu
                   ` (3 preceding siblings ...)
  2009-11-05  5:38 ` [v11][PATCH 4/9] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
@ 2009-11-05  5:39 ` Sukadev Bhattiprolu
  2009-11-05  5:40 ` [v11][PATCH 6/9] Check invalid clone flags Sukadev Bhattiprolu
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: arnd, Containers, Eric W. Biederman, hpa, Alexey Dobriyan,
	roland, Pavel Emelyanov, serue, Oren Laadan


From: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Subject: [v11][PATCH 5/9] Add target_pids parameter to copy_process()

Add a 'target_pids' parameter to copy_process().  The new parameter will be
used in a follow-on patch when clone_with_pids() is implemented.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 3f1dddf..c8a06de 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -975,12 +975,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
+					pid_t *target_pids,
 					int trace)
 {
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
-	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1361,7 +1361,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 	struct pt_regs regs;
 
 	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
-			    &init_struct_pid, 0);
+			    &init_struct_pid, NULL, 0);
 	if (!IS_ERR(task))
 		init_idle(task, cpu);
 
@@ -1384,6 +1384,7 @@ long do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	pid_t *target_pids = NULL;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1424,7 +1425,7 @@ long do_fork(unsigned long clone_flags,
 		trace = tracehook_prepare_clone(clone_flags);
 
 	p = copy_process(clone_flags, stack_start, regs, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, target_pids, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 6/9] Check invalid clone flags
       [not found] ` <20091105053053.GA11289-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2009-11-05  5:39   ` [v11][PATCH 5/9] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
@ 2009-11-05  5:40   ` Sukadev Bhattiprolu
  2009-11-05  5:40   ` [v11][PATCH 7/9] Define do_fork_with_pids() Sukadev Bhattiprolu
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:40 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov


From: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: [v11][PATCH 6/9] Check invalid clone flags

As pointed out by Oren Laadan, we want to ensure that unused bits in the
clone-flags remain unused and available for future. To ensure this, define
a mask of clone-flags and check the flags in the clone() system calls.

Changelog[v9]:
	- Include the unused clone-flag (CLONE_UNUSED) to VALID_CLONE_FLAGS
	  to avoid breaking any applications that may have set it. IOW, this
	  patch/check only applies to clone-flags bits 33 and higher.

Changelog[v8]:
	- New patch in set

Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/sched.h |   12 ++++++++++++
 kernel/fork.c         |    3 +++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75e6e60..6b319a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -29,6 +29,18 @@
 #define CLONE_NEWNET		0x40000000	/* New network namespace */
 #define CLONE_IO		0x80000000	/* Clone io context */
 
+#define CLONE_UNUSED        	0x00001000	/* Can be reused ? */
+
+#define VALID_CLONE_FLAGS	(CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
+				 CLONE_SIGHAND | CLONE_UNUSED | CLONE_PTRACE |\
+				 CLONE_VFORK  | CLONE_PARENT | CLONE_THREAD  |\
+				 CLONE_NEWNS  | CLONE_SYSVSEM | CLONE_SETTLS |\
+				 CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID  |\
+				 CLONE_DETACHED | CLONE_UNTRACED             |\
+				 CLONE_CHILD_SETTID | CLONE_STOPPED          |\
+				 CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
+				 CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
+
 /*
  * Scheduling policies
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index c8a06de..11f77ed 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -982,6 +982,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
 
+	if (clone_flags & ~VALID_CLONE_FLAGS)
+		return ERR_PTR(-EINVAL);
+
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 6/9] Check invalid clone flags
  2009-11-05  5:30 [v11][PATCH 0/9] Implement clone_with_pids() system call Sukadev Bhattiprolu
                   ` (4 preceding siblings ...)
  2009-11-05  5:39 ` [v11][PATCH 5/9] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
@ 2009-11-05  5:40 ` Sukadev Bhattiprolu
  2009-11-05  5:40 ` [v11][PATCH 7/9] Define do_fork_with_pids() Sukadev Bhattiprolu
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: arnd, Containers, Eric W. Biederman, hpa, Alexey Dobriyan,
	roland, Pavel Emelyanov, serue, Oren Laadan


From: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Subject: [v11][PATCH 6/9] Check invalid clone flags

As pointed out by Oren Laadan, we want to ensure that unused bits in the
clone-flags remain unused and available for future. To ensure this, define
a mask of clone-flags and check the flags in the clone() system calls.

Changelog[v9]:
	- Include the unused clone-flag (CLONE_UNUSED) to VALID_CLONE_FLAGS
	  to avoid breaking any applications that may have set it. IOW, this
	  patch/check only applies to clone-flags bits 33 and higher.

Changelog[v8]:
	- New patch in set

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/sched.h |   12 ++++++++++++
 kernel/fork.c         |    3 +++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75e6e60..6b319a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -29,6 +29,18 @@
 #define CLONE_NEWNET		0x40000000	/* New network namespace */
 #define CLONE_IO		0x80000000	/* Clone io context */
 
+#define CLONE_UNUSED        	0x00001000	/* Can be reused ? */
+
+#define VALID_CLONE_FLAGS	(CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
+				 CLONE_SIGHAND | CLONE_UNUSED | CLONE_PTRACE |\
+				 CLONE_VFORK  | CLONE_PARENT | CLONE_THREAD  |\
+				 CLONE_NEWNS  | CLONE_SYSVSEM | CLONE_SETTLS |\
+				 CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID  |\
+				 CLONE_DETACHED | CLONE_UNTRACED             |\
+				 CLONE_CHILD_SETTID | CLONE_STOPPED          |\
+				 CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
+				 CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
+
 /*
  * Scheduling policies
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index c8a06de..11f77ed 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -982,6 +982,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
 
+	if (clone_flags & ~VALID_CLONE_FLAGS)
+		return ERR_PTR(-EINVAL);
+
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 7/9] Define do_fork_with_pids()
       [not found] ` <20091105053053.GA11289-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2009-11-05  5:40   ` [v11][PATCH 6/9] Check invalid clone flags Sukadev Bhattiprolu
@ 2009-11-05  5:40   ` Sukadev Bhattiprolu
  2009-11-05  5:41   ` [v11][PATCH 8/9] Define clone_with_pids() syscall Sukadev Bhattiprolu
  2009-11-05  5:42   ` [v11][PATCH 9/9] Document " Sukadev Bhattiprolu
  8 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:40 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov


From: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: [v11][PATCH 7/9] Define do_fork_with_pids()

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v7]:
	- Drop 'struct pid_set' object and pass in 'pid_t *target_pids'
	  instead of 'struct pid_set *'.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.

Changelog[v4]:
	- Rename 'struct target_pid_set' to 'struct pid_set' since it may
	  be useful in other contexts.

Changelog[v3]:
	- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
	- To facilitate moving architecture-inpdendent code to kernel/fork.c
	  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
	  rather than 'pid_t *' (next patch moves the arch-independent
	  code to kernel/fork.c)

Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/sched.h |    3 +++
 kernel/fork.c         |   17 +++++++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6b319a0..265efe5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2153,6 +2153,9 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+				unsigned long, int __user *, int __user *,
+				unsigned int, pid_t __user *);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/kernel/fork.c b/kernel/fork.c
index 11f77ed..210e841 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1377,12 +1377,14 @@ struct task_struct * __cpuinit fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
 	      unsigned long stack_start,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+	      int __user *child_tidptr,
+	      unsigned int num_pids,
+	      pid_t __user *upids)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1485,6 +1487,17 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      struct pt_regs *regs,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+			parent_tidptr, child_tidptr, 0, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 7/9] Define do_fork_with_pids()
  2009-11-05  5:30 [v11][PATCH 0/9] Implement clone_with_pids() system call Sukadev Bhattiprolu
                   ` (5 preceding siblings ...)
  2009-11-05  5:40 ` [v11][PATCH 6/9] Check invalid clone flags Sukadev Bhattiprolu
@ 2009-11-05  5:40 ` Sukadev Bhattiprolu
       [not found] ` <20091105053053.GA11289-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: arnd, Containers, Eric W. Biederman, hpa, Alexey Dobriyan,
	roland, Pavel Emelyanov, Oren Laadan, serue


From: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Subject: [v11][PATCH 7/9] Define do_fork_with_pids()

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v7]:
	- Drop 'struct pid_set' object and pass in 'pid_t *target_pids'
	  instead of 'struct pid_set *'.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.

Changelog[v4]:
	- Rename 'struct target_pid_set' to 'struct pid_set' since it may
	  be useful in other contexts.

Changelog[v3]:
	- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
	- To facilitate moving architecture-inpdendent code to kernel/fork.c
	  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
	  rather than 'pid_t *' (next patch moves the arch-independent
	  code to kernel/fork.c)

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/sched.h |    3 +++
 kernel/fork.c         |   17 +++++++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6b319a0..265efe5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2153,6 +2153,9 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+				unsigned long, int __user *, int __user *,
+				unsigned int, pid_t __user *);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/kernel/fork.c b/kernel/fork.c
index 11f77ed..210e841 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1377,12 +1377,14 @@ struct task_struct * __cpuinit fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
 	      unsigned long stack_start,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+	      int __user *child_tidptr,
+	      unsigned int num_pids,
+	      pid_t __user *upids)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1485,6 +1487,17 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      struct pt_regs *regs,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+			parent_tidptr, child_tidptr, 0, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 8/9] Define clone_with_pids() syscall
       [not found] ` <20091105053053.GA11289-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2009-11-05  5:40   ` [v11][PATCH 7/9] Define do_fork_with_pids() Sukadev Bhattiprolu
@ 2009-11-05  5:41   ` Sukadev Bhattiprolu
  2009-11-05  5:42   ` [v11][PATCH 9/9] Document " Sukadev Bhattiprolu
  8 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:41 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov


From: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: [v11][PATCH 8/9] Define clone_with_pids() syscall

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone_with_pids(), intended for use during restart, is the same as
clone(), except that it takes a 'pids' paramter. This parameter lets
caller choose specific pid numbers for the child process, in the
process's active and ancestor pid namespaces. (Descendant pid namespaces
in general don't matter since processes don't have pids in them anyway,
but see comments in copy_target_pids() regarding CLONE_NEWPID).

clone_with_pids() also attempts to address a second limitation of the
clone() system call. clone() is restricted to 32 clone flags and all but
one of these are in use. If more new clone flags are needed, we will be
forced to define a new variant of the clone() system call. To address
this, clone_with_pids() allows at least 64 clone flags with some room
for more if necessary.

To prevent unprivileged processes from misusing this interface,
clone_with_pids() currently needs CAP_SYS_ADMIN, when the 'pids' parameter
is non-NULL.

See Documentation/clone_with_pids in next patch for more details and an
example of its usage.

NOTE:
	- System calls are restricted to 6 parameters and the number and sizes
	  of parameters needed for clone_with_pids() exceed 6 integers. The new
	  prototype works around this restriction while providing some
	  flexibility if clone_with_pids() needs to be further extended in the
	  future.
TODO:
	- We should convert clone-flags to 64-bit value in all architectures.
	  Its probably best to do that as a separate patchset since clone_flags
	  touches several functions and that patchset seems independent of this
	  new system call.

Changelog[v11]:
	- [Dave Hansen] Move clone_args validation checks to arch-indpeendent
	  code.
	- [Oren Laadan] Make args_size a parameter to system call and remove
	  it from 'struct clone_args'

Changelog[v10]:
	- Rename clone3() to clone_with_pids()
	- [Linus Torvalds] Use PTREGSCALL() rather than the generic syscall
	  implementation

Changelog[v9]:
	- [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
	  architectures split the new clone-flags into 'low' and 'high'
	  words and pass in the 'lower' flags as the first argument.
	  This would maintain similarity of the clone3() with clone()/
	  clone2(). Also has the side-effect of the name matching the
	  number of parameters :-)
	- [Roland McGrath] Rename structure to 'clone_args' and add a
	  'child_stack_size' field

Changelog[v8]
	- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
	  must be 64-bit.
	- clone2() is in use in IA64. Rename system call to clone3().

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
	  and group parameters into a new 'struct clone_struct' object.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check.

Changelog[v4]:
	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 arch/x86/include/asm/syscalls.h    |    1 +
 arch/x86/include/asm/unistd_32.h   |    1 +
 arch/x86/kernel/entry_32.S         |    1 +
 arch/x86/kernel/process_32.c       |   49 ++++++++++++++
 arch/x86/kernel/syscall_table_32.S |    1 +
 include/linux/sched.h              |    2 +
 include/linux/types.h              |   11 +++
 kernel/fork.c                      |  124 +++++++++++++++++++++++++++++++++++-
 8 files changed, 189 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 372b76e..3205f55 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -40,6 +40,7 @@ long sys_iopl(struct pt_regs *);
 
 /* kernel/process_32.c */
 int sys_clone(struct pt_regs *);
+int sys_clone_with_pids(struct pt_regs *);
 int sys_execve(struct pt_regs *);
 
 /* kernel/signal.c */
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6fb3c20..a427571 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
+#define __NR_clone_with_pids	337
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index c097e7d..c7bd1f6 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -718,6 +718,7 @@ ptregs_##name: \
 PTREGSCALL(iopl)
 PTREGSCALL(fork)
 PTREGSCALL(clone)
+PTREGSCALL(clone_with_pids)
 PTREGSCALL(vfork)
 PTREGSCALL(execve)
 PTREGSCALL(sigaltstack)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4cf7956..55dd051 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -445,6 +445,55 @@ int sys_clone(struct pt_regs *regs)
 	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
 }
 
+int sys_clone_with_pids(struct pt_regs *regs)
+{
+	int rc;
+	int args_size;
+	struct clone_args kca;
+	unsigned long flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack;
+	unsigned long stack_size;
+	unsigned int flags_low;
+	struct clone_args __user *uca;
+	pid_t __user *pids;
+
+	flags_low = regs->bx;
+	uca = (struct clone_args __user *)regs->cx;
+	args_size = regs->dx;
+	pids = (pid_t __user *)regs->di;
+
+	rc = fetch_clone_args_from_user(uca, args_size, &kca);
+	if (rc)
+		return rc;
+
+	/*
+	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
+	 * 	 higher word(s) of 'flags':
+	 *
+	 * 		flags = (kca.clone_flags_high << 32) | flags_low;
+	 */
+	flags = flags_low;
+	parent_tid_ptr = (int *)kca.parent_tid_ptr;
+	child_tid_ptr =  (int *)kca.child_tid_ptr;
+
+	stack_size = (unsigned long)kca.child_stack_size;
+	child_stack = (unsigned long)kca.child_stack_base + stack_size;
+
+	if (!child_stack)
+		child_stack = regs->sp;
+
+	/*
+	 * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+	 * 	 to several functions. Need to convert clone_flags to 64-bit.
+	 */
+	return do_fork_with_pids(flags, child_stack, regs, stack_size,
+				parent_tid_ptr, child_tid_ptr, kca.nr_pids,
+				pids);
+}
+
 /*
  * sys_execve() executes a new program.
  */
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 0157cd2..1357e83 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
+	.long ptregs_clone_with_pids
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 265efe5..0095ba0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2153,6 +2153,8 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern int fetch_clone_args_from_user(struct clone_args __user *, int,
+				struct clone_args *);
 extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
 				unsigned long, int __user *, int __user *,
 				unsigned int, pid_t __user *);
diff --git a/include/linux/types.h b/include/linux/types.h
index c42724f..0d08683 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -204,6 +204,17 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack_base;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 reserved0;
+	u64 reserved1;
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 210e841..da28baf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1372,6 +1372,114 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
+{
+	int j;
+	int rc;
+	int size;
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+
+	if (!unum_pids)
+		return NULL;
+
+	knum_pids = task_pid(current)->level + 1;
+	if (unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+	 * 	 the order is:
+	 *
+	 * 		- pids for 'unknown-namespaces' (if any)
+	 * 		- pids for 'known-namespaces' (requested pids)
+	 * 		- 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], upids, size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+int
+fetch_clone_args_from_user(struct clone_args __user *uca, int args_size,
+			struct clone_args *kca)
+{
+	int rc;
+
+	/*
+	 * TODO: If size of clone_args is not what the kernel expects, it
+	 * 	 could be that kernel is newer and has an extended structure.
+	 * 	 When that happens, this check needs to be smarter.  For now,
+	 * 	 assume exact match.
+	 */
+	if (args_size != sizeof(struct clone_args))
+		return -EINVAL;
+
+        rc = copy_from_user(kca, uca, args_size);
+        if (rc)
+                return -EFAULT;
+
+	/*
+	 * To avoid future compatibility issues, ensure unused fields are 0.
+	 */
+	if (kca->reserved0 || kca->reserved1 || kca->clone_flags_high)
+		return -EINVAL;
+
+	return 0;
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1389,7 +1497,7 @@ long do_fork_with_pids(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1423,6 +1531,16 @@ long do_fork_with_pids(unsigned long clone_flags,
 		}
 	}
 
+	target_pids = copy_target_pids(num_pids, upids);
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1484,6 +1602,10 @@ long do_fork_with_pids(unsigned long clone_flags,
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 8/9] Define clone_with_pids() syscall
  2009-11-05  5:30 [v11][PATCH 0/9] Implement clone_with_pids() system call Sukadev Bhattiprolu
                   ` (7 preceding siblings ...)
       [not found] ` <20091105053053.GA11289-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-11-05  5:41 ` Sukadev Bhattiprolu
  2009-11-06 18:02   ` Serge E. Hallyn
                     ` (2 more replies)
  2009-11-05  5:42 ` [v11][PATCH 9/9] Document " Sukadev Bhattiprolu
  9 siblings, 3 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: arnd, Containers, Eric W. Biederman, hpa, Alexey Dobriyan,
	roland, Pavel Emelyanov


From: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Subject: [v11][PATCH 8/9] Define clone_with_pids() syscall

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone_with_pids(), intended for use during restart, is the same as
clone(), except that it takes a 'pids' paramter. This parameter lets
caller choose specific pid numbers for the child process, in the
process's active and ancestor pid namespaces. (Descendant pid namespaces
in general don't matter since processes don't have pids in them anyway,
but see comments in copy_target_pids() regarding CLONE_NEWPID).

clone_with_pids() also attempts to address a second limitation of the
clone() system call. clone() is restricted to 32 clone flags and all but
one of these are in use. If more new clone flags are needed, we will be
forced to define a new variant of the clone() system call. To address
this, clone_with_pids() allows at least 64 clone flags with some room
for more if necessary.

To prevent unprivileged processes from misusing this interface,
clone_with_pids() currently needs CAP_SYS_ADMIN, when the 'pids' parameter
is non-NULL.

See Documentation/clone_with_pids in next patch for more details and an
example of its usage.

NOTE:
	- System calls are restricted to 6 parameters and the number and sizes
	  of parameters needed for clone_with_pids() exceed 6 integers. The new
	  prototype works around this restriction while providing some
	  flexibility if clone_with_pids() needs to be further extended in the
	  future.
TODO:
	- We should convert clone-flags to 64-bit value in all architectures.
	  Its probably best to do that as a separate patchset since clone_flags
	  touches several functions and that patchset seems independent of this
	  new system call.

Changelog[v11]:
	- [Dave Hansen] Move clone_args validation checks to arch-indpeendent
	  code.
	- [Oren Laadan] Make args_size a parameter to system call and remove
	  it from 'struct clone_args'

Changelog[v10]:
	- Rename clone3() to clone_with_pids()
	- [Linus Torvalds] Use PTREGSCALL() rather than the generic syscall
	  implementation

Changelog[v9]:
	- [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
	  architectures split the new clone-flags into 'low' and 'high'
	  words and pass in the 'lower' flags as the first argument.
	  This would maintain similarity of the clone3() with clone()/
	  clone2(). Also has the side-effect of the name matching the
	  number of parameters :-)
	- [Roland McGrath] Rename structure to 'clone_args' and add a
	  'child_stack_size' field

Changelog[v8]
	- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
	  must be 64-bit.
	- clone2() is in use in IA64. Rename system call to clone3().

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
	  and group parameters into a new 'struct clone_struct' object.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check.

Changelog[v4]:
	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
---
 arch/x86/include/asm/syscalls.h    |    1 +
 arch/x86/include/asm/unistd_32.h   |    1 +
 arch/x86/kernel/entry_32.S         |    1 +
 arch/x86/kernel/process_32.c       |   49 ++++++++++++++
 arch/x86/kernel/syscall_table_32.S |    1 +
 include/linux/sched.h              |    2 +
 include/linux/types.h              |   11 +++
 kernel/fork.c                      |  124 +++++++++++++++++++++++++++++++++++-
 8 files changed, 189 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 372b76e..3205f55 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -40,6 +40,7 @@ long sys_iopl(struct pt_regs *);
 
 /* kernel/process_32.c */
 int sys_clone(struct pt_regs *);
+int sys_clone_with_pids(struct pt_regs *);
 int sys_execve(struct pt_regs *);
 
 /* kernel/signal.c */
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6fb3c20..a427571 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
+#define __NR_clone_with_pids	337
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index c097e7d..c7bd1f6 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -718,6 +718,7 @@ ptregs_##name: \
 PTREGSCALL(iopl)
 PTREGSCALL(fork)
 PTREGSCALL(clone)
+PTREGSCALL(clone_with_pids)
 PTREGSCALL(vfork)
 PTREGSCALL(execve)
 PTREGSCALL(sigaltstack)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4cf7956..55dd051 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -445,6 +445,55 @@ int sys_clone(struct pt_regs *regs)
 	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
 }
 
+int sys_clone_with_pids(struct pt_regs *regs)
+{
+	int rc;
+	int args_size;
+	struct clone_args kca;
+	unsigned long flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack;
+	unsigned long stack_size;
+	unsigned int flags_low;
+	struct clone_args __user *uca;
+	pid_t __user *pids;
+
+	flags_low = regs->bx;
+	uca = (struct clone_args __user *)regs->cx;
+	args_size = regs->dx;
+	pids = (pid_t __user *)regs->di;
+
+	rc = fetch_clone_args_from_user(uca, args_size, &kca);
+	if (rc)
+		return rc;
+
+	/*
+	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
+	 * 	 higher word(s) of 'flags':
+	 *
+	 * 		flags = (kca.clone_flags_high << 32) | flags_low;
+	 */
+	flags = flags_low;
+	parent_tid_ptr = (int *)kca.parent_tid_ptr;
+	child_tid_ptr =  (int *)kca.child_tid_ptr;
+
+	stack_size = (unsigned long)kca.child_stack_size;
+	child_stack = (unsigned long)kca.child_stack_base + stack_size;
+
+	if (!child_stack)
+		child_stack = regs->sp;
+
+	/*
+	 * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+	 * 	 to several functions. Need to convert clone_flags to 64-bit.
+	 */
+	return do_fork_with_pids(flags, child_stack, regs, stack_size,
+				parent_tid_ptr, child_tid_ptr, kca.nr_pids,
+				pids);
+}
+
 /*
  * sys_execve() executes a new program.
  */
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 0157cd2..1357e83 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
+	.long ptregs_clone_with_pids
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 265efe5..0095ba0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2153,6 +2153,8 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern int fetch_clone_args_from_user(struct clone_args __user *, int,
+				struct clone_args *);
 extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
 				unsigned long, int __user *, int __user *,
 				unsigned int, pid_t __user *);
diff --git a/include/linux/types.h b/include/linux/types.h
index c42724f..0d08683 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -204,6 +204,17 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack_base;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 reserved0;
+	u64 reserved1;
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 210e841..da28baf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1372,6 +1372,114 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
+{
+	int j;
+	int rc;
+	int size;
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+
+	if (!unum_pids)
+		return NULL;
+
+	knum_pids = task_pid(current)->level + 1;
+	if (unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+	 * 	 the order is:
+	 *
+	 * 		- pids for 'unknown-namespaces' (if any)
+	 * 		- pids for 'known-namespaces' (requested pids)
+	 * 		- 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], upids, size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+int
+fetch_clone_args_from_user(struct clone_args __user *uca, int args_size,
+			struct clone_args *kca)
+{
+	int rc;
+
+	/*
+	 * TODO: If size of clone_args is not what the kernel expects, it
+	 * 	 could be that kernel is newer and has an extended structure.
+	 * 	 When that happens, this check needs to be smarter.  For now,
+	 * 	 assume exact match.
+	 */
+	if (args_size != sizeof(struct clone_args))
+		return -EINVAL;
+
+        rc = copy_from_user(kca, uca, args_size);
+        if (rc)
+                return -EFAULT;
+
+	/*
+	 * To avoid future compatibility issues, ensure unused fields are 0.
+	 */
+	if (kca->reserved0 || kca->reserved1 || kca->clone_flags_high)
+		return -EINVAL;
+
+	return 0;
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1389,7 +1497,7 @@ long do_fork_with_pids(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1423,6 +1531,16 @@ long do_fork_with_pids(unsigned long clone_flags,
 		}
 	}
 
+	target_pids = copy_target_pids(num_pids, upids);
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1484,6 +1602,10 @@ long do_fork_with_pids(unsigned long clone_flags,
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 9/9] Document clone_with_pids() syscall
       [not found] ` <20091105053053.GA11289-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2009-11-05  5:41   ` [v11][PATCH 8/9] Define clone_with_pids() syscall Sukadev Bhattiprolu
@ 2009-11-05  5:42   ` Sukadev Bhattiprolu
  8 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:42 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov


From: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: [v11][PATCH 9/9] Document clone_with_pids() syscall

This gives a brief overview of the clone_with_pids() system call.  We should
eventually describe more details in existing clone(2) man page or in
a new man page.

Changelog[v11]:
	- [Dave Hansen] Move clone_args validation checks to arch-indpendent
	  code.
	- [Oren Laadan] Make args_size a parameter to system call and remove
	  it from 'struct clone_args'
	- [Oren Laadan] Fix some typos and clarify the order of pids in the
	  @pids parameter.

Changelog[v10]:
	- Rename clone3() to clone_with_pids() and fix some typos.
	- Modify example to show usage with the ptregs implementation.
Changelog[v9]:
	- [Pavel Machek]: Fix an inconsistency and rename new file to
	  Documentation/clone3.
	- [Roland McGrath, H. Peter Anvin] Updates to description and
	  example to reflect new prototype of clone3() and the updated/
	  renamed 'struct clone_args'.

Changelog[v8]:
	- clone2() is already in use in IA64. Rename syscall to clone3()
	- Add notes to say that we return -EINVAL if invalid clone flags
	  are specified or if the reserved fields are not 0.
Changelog[v7]:
	- Rename clone_with_pids() to clone2()
	- Changes to reflect new prototype of clone2() (using clone_struct).

Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Oren Laadan  <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 Documentation/clone_with_pids |  332 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 332 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/clone_with_pids

diff --git a/Documentation/clone_with_pids b/Documentation/clone_with_pids
new file mode 100644
index 0000000..80e9b20
--- /dev/null
+++ b/Documentation/clone_with_pids
@@ -0,0 +1,332 @@
+
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack_base;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 reserved0;
+	u64 reserved1;
+};
+
+
+clone_with_pids(u32 flags_low, struct clone_args * __user cargs,
+		int cargs_size, pid_t * __user pids)
+
+	In addition to doing everything that clone() system call does,
+	the clone_with_pids() system call:
+
+		- allows additional clone flags (31 of 32 bits in the flags
+		  parameter to clone() are in use)
+
+		- allows user to specify a pid for the child process in its
+		  active and ancestor pid namespaces.
+
+	This system call is meant to be used when restarting an application
+	from a checkpoint.  Such restart requires that the processes in the
+	application have the same pids they had when the application was
+	checkpointed. When containers are nested, the processes within the
+	containers exist in multiple pid namespaces and hence have multiple
+	pids to specify during restart.
+
+	The @flags_low parameter is identical to the 'clone_flags' parameter
+	in existing clone() system call.
+
+	The fields in 'struct clone_args' are meant to be used as follows:
+
+	u64 clone_flags_high:
+
+		When clone_with_pids() supports more than 32 clone flags, the
+		additional bits in the clone_flags should be specified in this
+		field.  This field is currently unused and must be set to 0.
+
+	u64 child_stack_base;
+	u64 child_stack_size;
+
+		These two fields correspond to the 'child_stack' fields
+		in clone() and clone2() system calls (on IA64).
+
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+
+		These two fields correspond to the 'parent_tid_ptr' and
+		'child_tid_ptr' fields in the clone() system call
+
+	u32 nr_pids;
+
+		nr_pids specifies the number of pids in the @pids array
+		parameter to clone_with_pids() (see below). nr_pids should
+		not exceed the current nesting level of the calling process
+		(i.e if the process is in init_pid_ns, nr_pids must be 1,
+		if process is in a pid namespace that is a child of
+		init-pid-ns, nr_pids cannot exceed 2, and so on).
+
+	u32 reserved0;
+	u64 reserved1;
+
+		These fields are intended to extend the functionality of the
+		clone_with_pids() in the future, while preserving backward
+		compatibility. They must be set to 0 for now.
+
+	The @cargs_size parameter specifes the sizeof(struct clone_args) and
+	is intended to enable extending this structure in the future, while
+	preserving backward compatibility.  For now, this field must be set
+	to the sizeof(struct clone_args) and this size must match the kernel's
+	view of the structure.
+
+	The @pids parameter defines the set of pids that should be assigned to
+	the child process in its active and ancestor pid namespaces. The
+	descendant pid namespaces do not matter since a process does not have a
+	pid in descendant namespaces, unless the process is in a new pid
+	namespace in which case the process is a container-init (and must have
+	the pid 1 in that namespace).
+
+	See CLONE_NEWPID section of clone(2) man page for details about pid
+	namespaces.
+
+	If a pid in the @pids list is 0, the kernel will assign the next
+	available pid in the pid namespace.
+
+	If a pid in the @pids list is non-zero, the kernel tries to assign
+	the specified pid in that namespace.  If that pid is already in use
+	by another process, the system call fails (see EBUSY below).
+
+	The order of pids in @pids is oldest in pids[0] to youngest pid
+	namespace in pids[nr_pids-1]. If the number of pids specified in the
+	@pids list is fewer than the nesting level of the process, the pids
+	are applied from youngest namespace. i.e if the process is nested in
+	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
+	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
+	have a pid of '0' (the kernel will assign a pid in those namespaces).
+
+	On success, the system call returns the pid of the child process in
+	the parent's active pid namespace.
+
+	On failure, clone_with_pids() returns -1 and sets 'errno' to one of
+	following values (the child process is not created).
+
+	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
+		specify the pids in this call (if pids are not specifed
+		CAP_SYS_ADMIN is not required).
+
+	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
+		the current nesting level of parent process
+
+	EINVAL	Not all specified clone-flags are valid.
+
+	EINVAL	The reserved fields in the clone_args argument are not 0.
+
+	EBUSY	A requested pid is in use by another process in that namespace.
+
+---
+/* 
+ * Example clone_with_pids() usage - Create a child with pid CHILD_TID1 if
+ * program is run in init_pid_ns. If program is run in a child of init_pid_ns,
+ * create the child process with pid CHILD_TID2.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <errno.h>
+#include <unistd.h>
+#include <wait.h>
+#include <sys/syscall.h>
+
+#define __NR_clone_with_pids	337
+#define CLONE_NEWPID            0x20000000
+#define CLONE_CHILD_SETTID      0x01000000
+#define CLONE_PARENT_SETTID     0x00100000
+#define CLONE_UNUSED		0x00001000
+
+#define STACKSIZE	8192
+
+typedef unsigned long long u64;
+typedef unsigned int u32;
+typedef int pid_t;
+struct clone_args {
+        u64 clone_flags_high;
+
+        u64 child_stack_base;
+        u64 child_stack_size;
+
+        u64 parent_tid_ptr;
+        u64 child_tid_ptr;
+
+        u32 nr_pids;
+
+        u32 reserved0;
+        u64 reserved1;
+};
+
+#define exit		_exit
+
+/*
+ * Following clone_with_pids() is based on code posted by Oren Laadan at:
+ * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
+ */
+#if defined(__i386__) && defined(__NR_clone_with_pids)
+
+int clone_with_pids(int flags_low, struct clone_args *clone_args, int args_size,
+	 	int *pids)
+{
+	long retval;
+
+	__asm__  __volatile__(
+		 "movl %0, %%ebx\n\t"		/* flags -> 1st (ebx) */
+		 "movl %1, %%ecx\n\t"		/* clone_args -> 2nd (ecx)*/
+		 "movl %2, %%edx\n\t"		/* args_size -> 3rd (edx) */
+		 "movl %3, %%edi\n\t"		/* pids -> 4th (edi)*/
+		 "pushl %%ebp\n\t"		/* save value of ebp */
+		:
+		:"b" (flags_low),
+		 "c" (clone_args),
+		 "d" (args_size),
+		 "D" (pids)
+		);
+
+	__asm__ __volatile__(
+		 "int $0x80\n\t"	/* Linux/i386 system call */
+		 "testl %0,%0\n\t"	/* check return value */
+		 "jne 1f\n\t"		/* jump if parent */
+		 "popl %%ebx\n\t"	/* get subthread function */
+		 "call *%%ebx\n\t"	/* start subthread function */
+		 "movl %2,%0\n\t"
+		 "int $0x80\n"		/* exit system call: exit subthread */
+		 "1:\n\t"
+		 "popl %%ebp\t"		/* restore parent's ebp */
+		:"=a" (retval)
+		:"0" (__NR_clone_with_pids), "i" (__NR_exit)
+		:"ebx", "ecx", "edx"
+		);
+
+	if (retval < 0) {
+		errno = -retval;
+		retval = -1;
+	}
+	return retval;
+}
+
+/*
+ * Allocate a stack for the clone-child and arrange to have the child
+ * execute @child_fn with @child_arg as the argument.
+ */
+void *setup_stack(int (*child_fn)(void *), void *child_arg)
+{
+	void *child_stack;
+	void **new_stack;
+
+        child_stack = malloc(STACKSIZE);
+        if (!child_stack) {
+		perror("malloc()");
+		exit(1);
+	}
+        child_stack = (char *)child_stack + (STACKSIZE - 4);
+
+	new_stack = (void **)child_stack;
+	*--new_stack = child_arg;
+	*--new_stack = child_fn;
+
+	return new_stack;
+}
+
+#endif
+
+/* gettid() is a bit more useful than getpid() when messing with clone() */
+int gettid()
+{
+	int rc;
+
+	rc = syscall(__NR_gettid, 0, 0, 0);
+	if (rc < 0) {
+		printf("rc %d, errno %d\n", rc, errno);
+		exit(1);
+	}
+	return rc;
+}
+
+#define CHILD_TID1	377
+#define CHILD_TID2	25
+struct clone_args clone_args;
+void *child_arg = &clone_args;
+int child_tid;
+
+int do_child(void *arg)
+{
+	struct clone_args *cs = (struct clone_args *)arg;
+	int ctid;
+
+	/* Verify we pushed the arguments correctly on the stack... */
+	if (arg != child_arg)  {
+		printf("Child: Incorrect child arg pointer, expected %p,"
+				"actual %p\n", child_arg, arg);
+		exit(1);
+	}
+
+	/* ... and that we got the thread-id we expected */
+	ctid = *((int *)cs->child_tid_ptr);
+	if (ctid != CHILD_TID) {
+		printf("Child: Incorrect child tid, expected %d, actual %d\n",
+				CHILD_TID, ctid);
+		exit(1);
+	}
+	sleep(3);
+
+	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
+	exit(0);
+}
+
+static int do_clone(int (*child_fn)(void *), void *child_arg, 
+		unsigned int flags_low, int nr_pids, pid_t *pids_list)
+{
+	int rc;
+	void *stack;
+	struct clone_args *ca = &clone_args;
+	int args_size;
+
+	stack = setup_stack(child_fn, child_arg);
+
+	memset(ca, 0, sizeof(*ca));
+
+	ca->child_stack_base 	= (u64)stack;
+	ca->child_tid_ptr 	= (u64)&child_tid;
+	ca->nr_pids 		= nr_pids;
+
+	args_size = sizeof(struct clone_args);
+	rc = clone_with_pids(flags_low, ca, args_size, pids_list);
+
+	printf("[%d, %d]: clone_with_pids() returned %d, error %d\n",
+		 getpid(), gettid(), rc, errno);
+
+	return rc;
+}
+
+pid_t pids_list[] = { CHILD_TID1, CHILD_TID2 };
+main()
+{
+	int rc, pid, ret, status;
+	unsigned long flags; 
+	int nr_pids = 1;
+
+	flags = SIGCHLD|CLONE_PARENT_SETTID|CLONE_CHILD_SETTID;
+
+	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
+
+	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
+
+	rc = waitpid(pid, &status, __WALL);
+	if (rc < 0) {
+		printf("waitpid(): rc %d, error %d\n", rc, errno);
+	} else {
+		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
+			 gettid(), rc, status);
+
+		if (WIFEXITED(status)) {
+			printf("\t EXITED, %d\n", WEXITSTATUS(status));
+		} else if (WIFSIGNALED(status)) {
+			printf("\t SIGNALED, %d\n", WTERMSIG(status));
+		} 
+	}
+}
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 9/9] Document clone_with_pids() syscall
  2009-11-05  5:30 [v11][PATCH 0/9] Implement clone_with_pids() system call Sukadev Bhattiprolu
                   ` (8 preceding siblings ...)
  2009-11-05  5:41 ` [v11][PATCH 8/9] Define " Sukadev Bhattiprolu
@ 2009-11-05  5:42 ` Sukadev Bhattiprolu
       [not found]   ` <20091105054204.GI16142-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-11-06 18:39   ` Serge E. Hallyn
  9 siblings, 2 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: arnd, Containers, Eric W. Biederman, hpa, Alexey Dobriyan,
	roland, Pavel Emelyanov, serue, Oren Laadan


From: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Subject: [v11][PATCH 9/9] Document clone_with_pids() syscall

This gives a brief overview of the clone_with_pids() system call.  We should
eventually describe more details in existing clone(2) man page or in
a new man page.

Changelog[v11]:
	- [Dave Hansen] Move clone_args validation checks to arch-indpendent
	  code.
	- [Oren Laadan] Make args_size a parameter to system call and remove
	  it from 'struct clone_args'
	- [Oren Laadan] Fix some typos and clarify the order of pids in the
	  @pids parameter.

Changelog[v10]:
	- Rename clone3() to clone_with_pids() and fix some typos.
	- Modify example to show usage with the ptregs implementation.
Changelog[v9]:
	- [Pavel Machek]: Fix an inconsistency and rename new file to
	  Documentation/clone3.
	- [Roland McGrath, H. Peter Anvin] Updates to description and
	  example to reflect new prototype of clone3() and the updated/
	  renamed 'struct clone_args'.

Changelog[v8]:
	- clone2() is already in use in IA64. Rename syscall to clone3()
	- Add notes to say that we return -EINVAL if invalid clone flags
	  are specified or if the reserved fields are not 0.
Changelog[v7]:
	- Rename clone_with_pids() to clone2()
	- Changes to reflect new prototype of clone2() (using clone_struct).

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Oren Laadan  <orenl@cs.columbia.edu>
---
 Documentation/clone_with_pids |  332 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 332 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/clone_with_pids

diff --git a/Documentation/clone_with_pids b/Documentation/clone_with_pids
new file mode 100644
index 0000000..80e9b20
--- /dev/null
+++ b/Documentation/clone_with_pids
@@ -0,0 +1,332 @@
+
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack_base;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 reserved0;
+	u64 reserved1;
+};
+
+
+clone_with_pids(u32 flags_low, struct clone_args * __user cargs,
+		int cargs_size, pid_t * __user pids)
+
+	In addition to doing everything that clone() system call does,
+	the clone_with_pids() system call:
+
+		- allows additional clone flags (31 of 32 bits in the flags
+		  parameter to clone() are in use)
+
+		- allows user to specify a pid for the child process in its
+		  active and ancestor pid namespaces.
+
+	This system call is meant to be used when restarting an application
+	from a checkpoint.  Such restart requires that the processes in the
+	application have the same pids they had when the application was
+	checkpointed. When containers are nested, the processes within the
+	containers exist in multiple pid namespaces and hence have multiple
+	pids to specify during restart.
+
+	The @flags_low parameter is identical to the 'clone_flags' parameter
+	in existing clone() system call.
+
+	The fields in 'struct clone_args' are meant to be used as follows:
+
+	u64 clone_flags_high:
+
+		When clone_with_pids() supports more than 32 clone flags, the
+		additional bits in the clone_flags should be specified in this
+		field.  This field is currently unused and must be set to 0.
+
+	u64 child_stack_base;
+	u64 child_stack_size;
+
+		These two fields correspond to the 'child_stack' fields
+		in clone() and clone2() system calls (on IA64).
+
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+
+		These two fields correspond to the 'parent_tid_ptr' and
+		'child_tid_ptr' fields in the clone() system call
+
+	u32 nr_pids;
+
+		nr_pids specifies the number of pids in the @pids array
+		parameter to clone_with_pids() (see below). nr_pids should
+		not exceed the current nesting level of the calling process
+		(i.e if the process is in init_pid_ns, nr_pids must be 1,
+		if process is in a pid namespace that is a child of
+		init-pid-ns, nr_pids cannot exceed 2, and so on).
+
+	u32 reserved0;
+	u64 reserved1;
+
+		These fields are intended to extend the functionality of the
+		clone_with_pids() in the future, while preserving backward
+		compatibility. They must be set to 0 for now.
+
+	The @cargs_size parameter specifes the sizeof(struct clone_args) and
+	is intended to enable extending this structure in the future, while
+	preserving backward compatibility.  For now, this field must be set
+	to the sizeof(struct clone_args) and this size must match the kernel's
+	view of the structure.
+
+	The @pids parameter defines the set of pids that should be assigned to
+	the child process in its active and ancestor pid namespaces. The
+	descendant pid namespaces do not matter since a process does not have a
+	pid in descendant namespaces, unless the process is in a new pid
+	namespace in which case the process is a container-init (and must have
+	the pid 1 in that namespace).
+
+	See CLONE_NEWPID section of clone(2) man page for details about pid
+	namespaces.
+
+	If a pid in the @pids list is 0, the kernel will assign the next
+	available pid in the pid namespace.
+
+	If a pid in the @pids list is non-zero, the kernel tries to assign
+	the specified pid in that namespace.  If that pid is already in use
+	by another process, the system call fails (see EBUSY below).
+
+	The order of pids in @pids is oldest in pids[0] to youngest pid
+	namespace in pids[nr_pids-1]. If the number of pids specified in the
+	@pids list is fewer than the nesting level of the process, the pids
+	are applied from youngest namespace. i.e if the process is nested in
+	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
+	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
+	have a pid of '0' (the kernel will assign a pid in those namespaces).
+
+	On success, the system call returns the pid of the child process in
+	the parent's active pid namespace.
+
+	On failure, clone_with_pids() returns -1 and sets 'errno' to one of
+	following values (the child process is not created).
+
+	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
+		specify the pids in this call (if pids are not specifed
+		CAP_SYS_ADMIN is not required).
+
+	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
+		the current nesting level of parent process
+
+	EINVAL	Not all specified clone-flags are valid.
+
+	EINVAL	The reserved fields in the clone_args argument are not 0.
+
+	EBUSY	A requested pid is in use by another process in that namespace.
+
+---
+/* 
+ * Example clone_with_pids() usage - Create a child with pid CHILD_TID1 if
+ * program is run in init_pid_ns. If program is run in a child of init_pid_ns,
+ * create the child process with pid CHILD_TID2.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <errno.h>
+#include <unistd.h>
+#include <wait.h>
+#include <sys/syscall.h>
+
+#define __NR_clone_with_pids	337
+#define CLONE_NEWPID            0x20000000
+#define CLONE_CHILD_SETTID      0x01000000
+#define CLONE_PARENT_SETTID     0x00100000
+#define CLONE_UNUSED		0x00001000
+
+#define STACKSIZE	8192
+
+typedef unsigned long long u64;
+typedef unsigned int u32;
+typedef int pid_t;
+struct clone_args {
+        u64 clone_flags_high;
+
+        u64 child_stack_base;
+        u64 child_stack_size;
+
+        u64 parent_tid_ptr;
+        u64 child_tid_ptr;
+
+        u32 nr_pids;
+
+        u32 reserved0;
+        u64 reserved1;
+};
+
+#define exit		_exit
+
+/*
+ * Following clone_with_pids() is based on code posted by Oren Laadan at:
+ * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
+ */
+#if defined(__i386__) && defined(__NR_clone_with_pids)
+
+int clone_with_pids(int flags_low, struct clone_args *clone_args, int args_size,
+	 	int *pids)
+{
+	long retval;
+
+	__asm__  __volatile__(
+		 "movl %0, %%ebx\n\t"		/* flags -> 1st (ebx) */
+		 "movl %1, %%ecx\n\t"		/* clone_args -> 2nd (ecx)*/
+		 "movl %2, %%edx\n\t"		/* args_size -> 3rd (edx) */
+		 "movl %3, %%edi\n\t"		/* pids -> 4th (edi)*/
+		 "pushl %%ebp\n\t"		/* save value of ebp */
+		:
+		:"b" (flags_low),
+		 "c" (clone_args),
+		 "d" (args_size),
+		 "D" (pids)
+		);
+
+	__asm__ __volatile__(
+		 "int $0x80\n\t"	/* Linux/i386 system call */
+		 "testl %0,%0\n\t"	/* check return value */
+		 "jne 1f\n\t"		/* jump if parent */
+		 "popl %%ebx\n\t"	/* get subthread function */
+		 "call *%%ebx\n\t"	/* start subthread function */
+		 "movl %2,%0\n\t"
+		 "int $0x80\n"		/* exit system call: exit subthread */
+		 "1:\n\t"
+		 "popl %%ebp\t"		/* restore parent's ebp */
+		:"=a" (retval)
+		:"0" (__NR_clone_with_pids), "i" (__NR_exit)
+		:"ebx", "ecx", "edx"
+		);
+
+	if (retval < 0) {
+		errno = -retval;
+		retval = -1;
+	}
+	return retval;
+}
+
+/*
+ * Allocate a stack for the clone-child and arrange to have the child
+ * execute @child_fn with @child_arg as the argument.
+ */
+void *setup_stack(int (*child_fn)(void *), void *child_arg)
+{
+	void *child_stack;
+	void **new_stack;
+
+        child_stack = malloc(STACKSIZE);
+        if (!child_stack) {
+		perror("malloc()");
+		exit(1);
+	}
+        child_stack = (char *)child_stack + (STACKSIZE - 4);
+
+	new_stack = (void **)child_stack;
+	*--new_stack = child_arg;
+	*--new_stack = child_fn;
+
+	return new_stack;
+}
+
+#endif
+
+/* gettid() is a bit more useful than getpid() when messing with clone() */
+int gettid()
+{
+	int rc;
+
+	rc = syscall(__NR_gettid, 0, 0, 0);
+	if (rc < 0) {
+		printf("rc %d, errno %d\n", rc, errno);
+		exit(1);
+	}
+	return rc;
+}
+
+#define CHILD_TID1	377
+#define CHILD_TID2	25
+struct clone_args clone_args;
+void *child_arg = &clone_args;
+int child_tid;
+
+int do_child(void *arg)
+{
+	struct clone_args *cs = (struct clone_args *)arg;
+	int ctid;
+
+	/* Verify we pushed the arguments correctly on the stack... */
+	if (arg != child_arg)  {
+		printf("Child: Incorrect child arg pointer, expected %p,"
+				"actual %p\n", child_arg, arg);
+		exit(1);
+	}
+
+	/* ... and that we got the thread-id we expected */
+	ctid = *((int *)cs->child_tid_ptr);
+	if (ctid != CHILD_TID) {
+		printf("Child: Incorrect child tid, expected %d, actual %d\n",
+				CHILD_TID, ctid);
+		exit(1);
+	}
+	sleep(3);
+
+	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
+	exit(0);
+}
+
+static int do_clone(int (*child_fn)(void *), void *child_arg, 
+		unsigned int flags_low, int nr_pids, pid_t *pids_list)
+{
+	int rc;
+	void *stack;
+	struct clone_args *ca = &clone_args;
+	int args_size;
+
+	stack = setup_stack(child_fn, child_arg);
+
+	memset(ca, 0, sizeof(*ca));
+
+	ca->child_stack_base 	= (u64)stack;
+	ca->child_tid_ptr 	= (u64)&child_tid;
+	ca->nr_pids 		= nr_pids;
+
+	args_size = sizeof(struct clone_args);
+	rc = clone_with_pids(flags_low, ca, args_size, pids_list);
+
+	printf("[%d, %d]: clone_with_pids() returned %d, error %d\n",
+		 getpid(), gettid(), rc, errno);
+
+	return rc;
+}
+
+pid_t pids_list[] = { CHILD_TID1, CHILD_TID2 };
+main()
+{
+	int rc, pid, ret, status;
+	unsigned long flags; 
+	int nr_pids = 1;
+
+	flags = SIGCHLD|CLONE_PARENT_SETTID|CLONE_CHILD_SETTID;
+
+	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
+
+	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
+
+	rc = waitpid(pid, &status, __WALL);
+	if (rc < 0) {
+		printf("waitpid(): rc %d, error %d\n", rc, errno);
+	} else {
+		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
+			 gettid(), rc, status);
+
+		if (WIFEXITED(status)) {
+			printf("\t EXITED, %d\n", WEXITSTATUS(status));
+		} else if (WIFSIGNALED(status)) {
+			printf("\t SIGNALED, %d\n", WTERMSIG(status));
+		} 
+	}
+}
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 8/9] Define clone_with_pids() syscall
       [not found]   ` <20091105054124.GH16142-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-11-06 18:02     ` Serge E. Hallyn
  2009-11-09 20:37     ` Serge E. Hallyn
  1 sibling, 0 replies; 39+ messages in thread
From: Serge E. Hallyn @ 2009-11-06 18:02 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: arnd-r2nGTMty4D4, Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Quoting Sukadev Bhattiprolu (sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> +	stack_size = (unsigned long)kca.child_stack_size;
> +	child_stack = (unsigned long)kca.child_stack_base + stack_size;
> +
> +	if (!child_stack)
> +		child_stack = regs->sp;

I'm hooking up the s390 version right now.  Do you think you should
make this

	if (!kca.child_stack_base)
		child_stack = regs->sp;

?

I suppose that in general if I pass in a NULL kca.child_stack_base
I'll also pass in a 0 stacksize, but as a user I'd expect that if
I pass in NULL, the size gets ignored.  Instead, if I pass in NULL
plus a size, then the kernel will take (void *)size as the stacktop.

-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 8/9] Define clone_with_pids() syscall
  2009-11-05  5:41 ` [v11][PATCH 8/9] Define " Sukadev Bhattiprolu
@ 2009-11-06 18:02   ` Serge E. Hallyn
       [not found]     ` <20091106180210.GA31652-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-11-07 20:18     ` Sukadev Bhattiprolu
       [not found]   ` <20091105054124.GH16142-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-11-09 20:37     ` Serge E. Hallyn
  2 siblings, 2 replies; 39+ messages in thread
From: Serge E. Hallyn @ 2009-11-06 18:02 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, arnd, Containers, Eric W. Biederman, hpa,
	Alexey Dobriyan, roland, Pavel Emelyanov

Quoting Sukadev Bhattiprolu (sukadev@us.ibm.com):
> +	stack_size = (unsigned long)kca.child_stack_size;
> +	child_stack = (unsigned long)kca.child_stack_base + stack_size;
> +
> +	if (!child_stack)
> +		child_stack = regs->sp;

I'm hooking up the s390 version right now.  Do you think you should
make this

	if (!kca.child_stack_base)
		child_stack = regs->sp;

?

I suppose that in general if I pass in a NULL kca.child_stack_base
I'll also pass in a 0 stacksize, but as a user I'd expect that if
I pass in NULL, the size gets ignored.  Instead, if I pass in NULL
plus a size, then the kernel will take (void *)size as the stacktop.

-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 9/9] Document clone_with_pids() syscall
       [not found]   ` <20091105054204.GI16142-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-11-06 18:39     ` Serge E. Hallyn
  0 siblings, 0 replies; 39+ messages in thread
From: Serge E. Hallyn @ 2009-11-06 18:39 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: arnd-r2nGTMty4D4, Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Quoting Sukadev Bhattiprolu (sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
...
> +	If a pid in the @pids list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails (see EBUSY below).
> +
> +	The order of pids in @pids is oldest in pids[0] to youngest pid
> +	namespace in pids[nr_pids-1]. If the number of pids specified in the

In the sys_choosepid() discussion, Matt suggested it would be more
user-friendly to have the pid for the youngest pidns be pids[0].
That way the user doesn't have to know their pidns depth.

-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 9/9] Document clone_with_pids() syscall
  2009-11-05  5:42 ` [v11][PATCH 9/9] Document " Sukadev Bhattiprolu
       [not found]   ` <20091105054204.GI16142-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-11-06 18:39   ` Serge E. Hallyn
  2009-11-06 20:18     ` Matt Helsley
       [not found]     ` <20091106183936.GA32531-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: Serge E. Hallyn @ 2009-11-06 18:39 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, arnd, Containers, Eric W. Biederman, hpa,
	Alexey Dobriyan, roland, Pavel Emelyanov, Oren Laadan

Quoting Sukadev Bhattiprolu (sukadev@us.ibm.com):
...
> +	If a pid in the @pids list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails (see EBUSY below).
> +
> +	The order of pids in @pids is oldest in pids[0] to youngest pid
> +	namespace in pids[nr_pids-1]. If the number of pids specified in the

In the sys_choosepid() discussion, Matt suggested it would be more
user-friendly to have the pid for the youngest pidns be pids[0].
That way the user doesn't have to know their pidns depth.

-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 9/9] Document clone_with_pids() syscall
       [not found]     ` <20091106183936.GA32531-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-11-06 20:18       ` Matt Helsley
  0 siblings, 0 replies; 39+ messages in thread
From: Matt Helsley @ 2009-11-06 20:18 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: arnd-r2nGTMty4D4, Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On Fri, Nov 06, 2009 at 12:39:36PM -0600, Serge E. Hallyn wrote:
> Quoting Sukadev Bhattiprolu (sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> ...
> > +	If a pid in the @pids list is non-zero, the kernel tries to assign
> > +	the specified pid in that namespace.  If that pid is already in use
> > +	by another process, the system call fails (see EBUSY below).
> > +
> > +	The order of pids in @pids is oldest in pids[0] to youngest pid
> > +	namespace in pids[nr_pids-1]. If the number of pids specified in the
> 
> In the sys_choosepid() discussion, Matt suggested it would be more
> user-friendly to have the pid for the youngest pidns be pids[0].
> That way the user doesn't have to know their pidns depth.

As far as I could see, Suka's solution also does not require knowing
the pidns depth (aka level). He made it so that copy_from_user()
adjusts its destination using the discrepancy between the number of
pids passed and the number of levels.

If userspace passes an array with n pids and there are k namespace levels
then clone_with_pids() makes sure that the kernel sees a pid array like:

index	  0     ... k - (n + 1)        ...          k - 1
	+-----------------------+-------------------------+
pid_t	| 0 ..................0 | <copied from userspace> |
	+-----------------------+-------------------------+

So even though the order is different from choosepid() the calling
task still doesn't need to know its pidns level. Of course, just
like choosepid(), n <= k or userspace will get EINVAL.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 9/9] Document clone_with_pids() syscall
  2009-11-06 18:39   ` Serge E. Hallyn
@ 2009-11-06 20:18     ` Matt Helsley
  2009-11-06 21:45       ` Matt Helsley
       [not found]       ` <20091106201814.GA26614-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
       [not found]     ` <20091106183936.GA32531-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: Matt Helsley @ 2009-11-06 20:18 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Sukadev Bhattiprolu, arnd, Containers, linux-kernel,
	Eric W. Biederman, hpa, Alexey Dobriyan, roland, Pavel Emelyanov

On Fri, Nov 06, 2009 at 12:39:36PM -0600, Serge E. Hallyn wrote:
> Quoting Sukadev Bhattiprolu (sukadev@us.ibm.com):
> ...
> > +	If a pid in the @pids list is non-zero, the kernel tries to assign
> > +	the specified pid in that namespace.  If that pid is already in use
> > +	by another process, the system call fails (see EBUSY below).
> > +
> > +	The order of pids in @pids is oldest in pids[0] to youngest pid
> > +	namespace in pids[nr_pids-1]. If the number of pids specified in the
> 
> In the sys_choosepid() discussion, Matt suggested it would be more
> user-friendly to have the pid for the youngest pidns be pids[0].
> That way the user doesn't have to know their pidns depth.

As far as I could see, Suka's solution also does not require knowing
the pidns depth (aka level). He made it so that copy_from_user()
adjusts its destination using the discrepancy between the number of
pids passed and the number of levels.

If userspace passes an array with n pids and there are k namespace levels
then clone_with_pids() makes sure that the kernel sees a pid array like:

index	  0     ... k - (n + 1)        ...          k - 1
	+-----------------------+-------------------------+
pid_t	| 0 ..................0 | <copied from userspace> |
	+-----------------------+-------------------------+

So even though the order is different from choosepid() the calling
task still doesn't need to know its pidns level. Of course, just
like choosepid(), n <= k or userspace will get EINVAL.

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 9/9] Document clone_with_pids() syscall
       [not found]       ` <20091106201814.GA26614-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2009-11-06 21:45         ` Matt Helsley
  0 siblings, 0 replies; 39+ messages in thread
From: Matt Helsley @ 2009-11-06 21:45 UTC (permalink / raw)
  To: Matt Helsley
  Cc: arnd-r2nGTMty4D4, Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

On Fri, Nov 06, 2009 at 12:18:14PM -0800, Matt Helsley wrote:
> On Fri, Nov 06, 2009 at 12:39:36PM -0600, Serge E. Hallyn wrote:
> > Quoting Sukadev Bhattiprolu (sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> > ...
> > > +	If a pid in the @pids list is non-zero, the kernel tries to assign
> > > +	the specified pid in that namespace.  If that pid is already in use
> > > +	by another process, the system call fails (see EBUSY below).
> > > +
> > > +	The order of pids in @pids is oldest in pids[0] to youngest pid
> > > +	namespace in pids[nr_pids-1]. If the number of pids specified in the
> > 
> > In the sys_choosepid() discussion, Matt suggested it would be more
> > user-friendly to have the pid for the youngest pidns be pids[0].
> > That way the user doesn't have to know their pidns depth.
> 
> As far as I could see, Suka's solution also does not require knowing
> the pidns depth (aka level). He made it so that copy_from_user()
> adjusts its destination using the discrepancy between the number of
> pids passed and the number of levels.
> 
> If userspace passes an array with n pids and there are k namespace levels
> then clone_with_pids() makes sure that the kernel sees a pid array like:
> 
> index	  0     ... k - (n + 1)        ...          k - 1
> 	+-----------------------+-------------------------+
> pid_t	| 0 ..................0 | <copied from userspace> |
> 	+-----------------------+-------------------------+

(diagram assumes n != k. If n == k then pids[0] is the pid desired
in the initial namespace..)

> 
> So even though the order is different from choosepid() the calling
> task still doesn't need to know its pidns level. Of course, just
> like choosepid(), n <= k or userspace will get EINVAL.

Forgot to mention that I prefer the way choosepid orders the pids.
It's not inspired by the way that the kernel implements pid namespaces
and has more to do with the way userspace sees things (IMHO). I don't
know if it makes more sense to change clone_with_pids() or have
[e]glibc wrappers swap the array contents.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 9/9] Document clone_with_pids() syscall
  2009-11-06 20:18     ` Matt Helsley
@ 2009-11-06 21:45       ` Matt Helsley
  2009-11-07  2:26         ` Sukadev Bhattiprolu
       [not found]         ` <20091106214529.GB26614-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
       [not found]       ` <20091106201814.GA26614-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: Matt Helsley @ 2009-11-06 21:45 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Serge E. Hallyn, arnd, Containers, linux-kernel,
	Eric W. Biederman, hpa, Alexey Dobriyan, roland, Pavel Emelyanov

On Fri, Nov 06, 2009 at 12:18:14PM -0800, Matt Helsley wrote:
> On Fri, Nov 06, 2009 at 12:39:36PM -0600, Serge E. Hallyn wrote:
> > Quoting Sukadev Bhattiprolu (sukadev@us.ibm.com):
> > ...
> > > +	If a pid in the @pids list is non-zero, the kernel tries to assign
> > > +	the specified pid in that namespace.  If that pid is already in use
> > > +	by another process, the system call fails (see EBUSY below).
> > > +
> > > +	The order of pids in @pids is oldest in pids[0] to youngest pid
> > > +	namespace in pids[nr_pids-1]. If the number of pids specified in the
> > 
> > In the sys_choosepid() discussion, Matt suggested it would be more
> > user-friendly to have the pid for the youngest pidns be pids[0].
> > That way the user doesn't have to know their pidns depth.
> 
> As far as I could see, Suka's solution also does not require knowing
> the pidns depth (aka level). He made it so that copy_from_user()
> adjusts its destination using the discrepancy between the number of
> pids passed and the number of levels.
> 
> If userspace passes an array with n pids and there are k namespace levels
> then clone_with_pids() makes sure that the kernel sees a pid array like:
> 
> index	  0     ... k - (n + 1)        ...          k - 1
> 	+-----------------------+-------------------------+
> pid_t	| 0 ..................0 | <copied from userspace> |
> 	+-----------------------+-------------------------+

(diagram assumes n != k. If n == k then pids[0] is the pid desired
in the initial namespace..)

> 
> So even though the order is different from choosepid() the calling
> task still doesn't need to know its pidns level. Of course, just
> like choosepid(), n <= k or userspace will get EINVAL.

Forgot to mention that I prefer the way choosepid orders the pids.
It's not inspired by the way that the kernel implements pid namespaces
and has more to do with the way userspace sees things (IMHO). I don't
know if it makes more sense to change clone_with_pids() or have
[e]glibc wrappers swap the array contents.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 9/9] Document clone_with_pids() syscall
       [not found]         ` <20091106214529.GB26614-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2009-11-07  2:26           ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-07  2:26 UTC (permalink / raw)
  To: Matt Helsley
  Cc: arnd-r2nGTMty4D4, Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Matt Helsley [matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote:
| > If userspace passes an array with n pids and there are k namespace levels
| > then clone_with_pids() makes sure that the kernel sees a pid array like:
| > 
| > index	  0     ... k - (n + 1)        ...          k - 1
| > 	+-----------------------+-------------------------+
| > pid_t	| 0 ..................0 | <copied from userspace> |
| > 	+-----------------------+-------------------------+
| 
| (diagram assumes n != k. If n == k then pids[0] is the pid desired
| in the initial namespace..)

True.

Also I was not sure if we should prevent choosing pids in ancestor containers.
since a process is not even supposed to know of ancestor namespaces. Is there
a need for choosing pids in those namespaces.

| 
| > 
| > So even though the order is different from choosepid() the calling
| > task still doesn't need to know its pidns level. Of course, just
| > like choosepid(), n <= k or userspace will get EINVAL.
| 
| Forgot to mention that I prefer the way choosepid orders the pids.
| It's not inspired by the way that the kernel implements pid namespaces
| and has more to do with the way userspace sees things (IMHO).

Hmm, In general we C/R a descendant container. So the way userspace
sees it at that point is "what are the pids of this process in my current
and in any descendant namespaces". IOW, the pid of container from which
we checkpoint seems more interesting first - right ?  If so, the pids[]
are better ordered from older namespace to younger namespace ?

| I don't know if it makes more sense to change clone_with_pids() or have
| [e]glibc wrappers swap the array contents.
| 
| Cheers,
| 	-Matt Helsley
| _______________________________________________
| Containers mailing list
| Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
| https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 9/9] Document clone_with_pids() syscall
  2009-11-06 21:45       ` Matt Helsley
@ 2009-11-07  2:26         ` Sukadev Bhattiprolu
  2009-11-07 21:56           ` Oren Laadan
  2009-11-07 21:56           ` Oren Laadan
       [not found]         ` <20091106214529.GB26614-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-07  2:26 UTC (permalink / raw)
  To: Matt Helsley
  Cc: arnd, Containers, linux-kernel, Eric W. Biederman, hpa,
	Alexey Dobriyan, roland, Pavel Emelyanov

Matt Helsley [matthltc@us.ibm.com] wrote:
| > If userspace passes an array with n pids and there are k namespace levels
| > then clone_with_pids() makes sure that the kernel sees a pid array like:
| > 
| > index	  0     ... k - (n + 1)        ...          k - 1
| > 	+-----------------------+-------------------------+
| > pid_t	| 0 ..................0 | <copied from userspace> |
| > 	+-----------------------+-------------------------+
| 
| (diagram assumes n != k. If n == k then pids[0] is the pid desired
| in the initial namespace..)

True.

Also I was not sure if we should prevent choosing pids in ancestor containers.
since a process is not even supposed to know of ancestor namespaces. Is there
a need for choosing pids in those namespaces.

| 
| > 
| > So even though the order is different from choosepid() the calling
| > task still doesn't need to know its pidns level. Of course, just
| > like choosepid(), n <= k or userspace will get EINVAL.
| 
| Forgot to mention that I prefer the way choosepid orders the pids.
| It's not inspired by the way that the kernel implements pid namespaces
| and has more to do with the way userspace sees things (IMHO).

Hmm, In general we C/R a descendant container. So the way userspace
sees it at that point is "what are the pids of this process in my current
and in any descendant namespaces". IOW, the pid of container from which
we checkpoint seems more interesting first - right ?  If so, the pids[]
are better ordered from older namespace to younger namespace ?

| I don't know if it makes more sense to change clone_with_pids() or have
| [e]glibc wrappers swap the array contents.
| 
| Cheers,
| 	-Matt Helsley
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 8/9] Define clone_with_pids() syscall
       [not found]     ` <20091106180210.GA31652-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-11-07 20:18       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-07 20:18 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: arnd-r2nGTMty4D4, Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Serge E. Hallyn [serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote:
| Quoting Sukadev Bhattiprolu (sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
| > +	stack_size = (unsigned long)kca.child_stack_size;
| > +	child_stack = (unsigned long)kca.child_stack_base + stack_size;
| > +
| > +	if (!child_stack)
| > +		child_stack = regs->sp;
| 
| I'm hooking up the s390 version right now.  Do you think you should
| make this
| 
| 	if (!kca.child_stack_base)
| 		child_stack = regs->sp;
| 
| ?
| 
| I suppose that in general if I pass in a NULL kca.child_stack_base
| I'll also pass in a 0 stacksize, but as a user I'd expect that if
| I pass in NULL, the size gets ignored.  Instead, if I pass in NULL
| plus a size, then the kernel will take (void *)size as the stacktop.

Good point. Like copy_thread() on IA64, how about ignoring 'stack_size'
if base is NULL ?

        child_stack = 0UL;
        if (kca.child_stack_base)
                child_stack = (unsigned long)kca.child_stack_base + stack_size;

        if (!child_stack)
                child_stack = regs->sp;

The other question is whether we should force all architectures to pass in
the stack *base* ? clone(2) man page says:

	Stacks grow  downwards  on  all  processors  that  run  Linux (except
	the  HP  PA  processors), so child_stack usually points to the topmost
	address of the memory space set up for the child stack.

To be compatibile with clone() on most architectures, should we rename
'clone_args.child_stack_base' to 'clone_args.child_stack' and let
architectures use this field like they currently use the 'child_stack'
parameter to clone(2) ?

So x86 would pass in address of top-of-stack while HP-PA can pass in address
of base-of-stack.

Arnd, Roland, Peter please let me know if you have any inputs on this.

Sukadev

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 8/9] Define clone_with_pids() syscall
  2009-11-06 18:02   ` Serge E. Hallyn
       [not found]     ` <20091106180210.GA31652-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-11-07 20:18     ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-07 20:18 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Sukadev Bhattiprolu, linux-kernel, arnd, Containers,
	Eric W. Biederman, hpa, Alexey Dobriyan, roland, Pavel Emelyanov

Serge E. Hallyn [serue@us.ibm.com] wrote:
| Quoting Sukadev Bhattiprolu (sukadev@us.ibm.com):
| > +	stack_size = (unsigned long)kca.child_stack_size;
| > +	child_stack = (unsigned long)kca.child_stack_base + stack_size;
| > +
| > +	if (!child_stack)
| > +		child_stack = regs->sp;
| 
| I'm hooking up the s390 version right now.  Do you think you should
| make this
| 
| 	if (!kca.child_stack_base)
| 		child_stack = regs->sp;
| 
| ?
| 
| I suppose that in general if I pass in a NULL kca.child_stack_base
| I'll also pass in a 0 stacksize, but as a user I'd expect that if
| I pass in NULL, the size gets ignored.  Instead, if I pass in NULL
| plus a size, then the kernel will take (void *)size as the stacktop.

Good point. Like copy_thread() on IA64, how about ignoring 'stack_size'
if base is NULL ?

        child_stack = 0UL;
        if (kca.child_stack_base)
                child_stack = (unsigned long)kca.child_stack_base + stack_size;

        if (!child_stack)
                child_stack = regs->sp;

The other question is whether we should force all architectures to pass in
the stack *base* ? clone(2) man page says:

	Stacks grow  downwards  on  all  processors  that  run  Linux (except
	the  HP  PA  processors), so child_stack usually points to the topmost
	address of the memory space set up for the child stack.

To be compatibile with clone() on most architectures, should we rename
'clone_args.child_stack_base' to 'clone_args.child_stack' and let
architectures use this field like they currently use the 'child_stack'
parameter to clone(2) ?

So x86 would pass in address of top-of-stack while HP-PA can pass in address
of base-of-stack.

Arnd, Roland, Peter please let me know if you have any inputs on this.

Sukadev

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 9/9] Document clone_with_pids() syscall
  2009-11-07  2:26         ` Sukadev Bhattiprolu
  2009-11-07 21:56           ` Oren Laadan
@ 2009-11-07 21:56           ` Oren Laadan
  1 sibling, 0 replies; 39+ messages in thread
From: Oren Laadan @ 2009-11-07 21:56 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: arnd-r2nGTMty4D4, Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Pavel Emelyanov, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA



Sukadev Bhattiprolu wrote:
> Matt Helsley [matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote:
> | > If userspace passes an array with n pids and there are k namespace levels
> | > then clone_with_pids() makes sure that the kernel sees a pid array like:
> | > 
> | > index	  0     ... k - (n + 1)        ...          k - 1
> | > 	+-----------------------+-------------------------+
> | > pid_t	| 0 ..................0 | <copied from userspace> |
> | > 	+-----------------------+-------------------------+
> | 
> | (diagram assumes n != k. If n == k then pids[0] is the pid desired
> | in the initial namespace..)
> 
> True.
> 
> Also I was not sure if we should prevent choosing pids in ancestor containers.
> since a process is not even supposed to know of ancestor namespaces. Is there
> a need for choosing pids in those namespaces.

IMHO this is a bit confusing.

A process observes a single namespace - the one in which it "lives".
There is no such thing as descendant namespaces for that process.
There may be ancestor namespaces.

The clone occurs in the context of the process. So the process that
is forking _must_ indicate pids in _ancestor_ namespaces if it wishes
to select pids in those (as is the case in c/r).

> 
> | 
> | > 
> | > So even though the order is different from choosepid() the calling
> | > task still doesn't need to know its pidns level. Of course, just
> | > like choosepid(), n <= k or userspace will get EINVAL.
> | 
> | Forgot to mention that I prefer the way choosepid orders the pids.
> | It's not inspired by the way that the kernel implements pid namespaces
> | and has more to do with the way userspace sees things (IMHO).
> 
> Hmm, In general we C/R a descendant container. So the way userspace
> sees it at that point is "what are the pids of this process in my current
> and in any descendant namespaces". IOW, the pid of container from which
> we checkpoint seems more interesting first - right ?  If so, the pids[]
> are better ordered from older namespace to younger namespace ?

When we checkpoint, we use an external process to record the state of
(current or) descendant namespaces.

When we restart, we run in the context of the restarting process, so
we select a pid in the current and _ancestor_ namespaces.

So the order of pids as it (will) appear in the checkpoint image for
a given process will be from an ancestor down to descendant namespaces.
And this is how we (will) hand it over to eclone().

> 
> | I don't know if it makes more sense to change clone_with_pids() or have
> | [e]glibc wrappers swap the array contents.

I prefer to decide now on an order and stick to it in the kernel and
in glibc.

Oren

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 9/9] Document clone_with_pids() syscall
  2009-11-07  2:26         ` Sukadev Bhattiprolu
@ 2009-11-07 21:56           ` Oren Laadan
       [not found]             ` <4AF5ECFD.3000509-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
  2009-11-08 15:09             ` Serge E. Hallyn
  2009-11-07 21:56           ` Oren Laadan
  1 sibling, 2 replies; 39+ messages in thread
From: Oren Laadan @ 2009-11-07 21:56 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Matt Helsley, arnd, Containers, linux-kernel, Eric W. Biederman,
	hpa, Alexey Dobriyan, roland, Pavel Emelyanov



Sukadev Bhattiprolu wrote:
> Matt Helsley [matthltc@us.ibm.com] wrote:
> | > If userspace passes an array with n pids and there are k namespace levels
> | > then clone_with_pids() makes sure that the kernel sees a pid array like:
> | > 
> | > index	  0     ... k - (n + 1)        ...          k - 1
> | > 	+-----------------------+-------------------------+
> | > pid_t	| 0 ..................0 | <copied from userspace> |
> | > 	+-----------------------+-------------------------+
> | 
> | (diagram assumes n != k. If n == k then pids[0] is the pid desired
> | in the initial namespace..)
> 
> True.
> 
> Also I was not sure if we should prevent choosing pids in ancestor containers.
> since a process is not even supposed to know of ancestor namespaces. Is there
> a need for choosing pids in those namespaces.

IMHO this is a bit confusing.

A process observes a single namespace - the one in which it "lives".
There is no such thing as descendant namespaces for that process.
There may be ancestor namespaces.

The clone occurs in the context of the process. So the process that
is forking _must_ indicate pids in _ancestor_ namespaces if it wishes
to select pids in those (as is the case in c/r).

> 
> | 
> | > 
> | > So even though the order is different from choosepid() the calling
> | > task still doesn't need to know its pidns level. Of course, just
> | > like choosepid(), n <= k or userspace will get EINVAL.
> | 
> | Forgot to mention that I prefer the way choosepid orders the pids.
> | It's not inspired by the way that the kernel implements pid namespaces
> | and has more to do with the way userspace sees things (IMHO).
> 
> Hmm, In general we C/R a descendant container. So the way userspace
> sees it at that point is "what are the pids of this process in my current
> and in any descendant namespaces". IOW, the pid of container from which
> we checkpoint seems more interesting first - right ?  If so, the pids[]
> are better ordered from older namespace to younger namespace ?

When we checkpoint, we use an external process to record the state of
(current or) descendant namespaces.

When we restart, we run in the context of the restarting process, so
we select a pid in the current and _ancestor_ namespaces.

So the order of pids as it (will) appear in the checkpoint image for
a given process will be from an ancestor down to descendant namespaces.
And this is how we (will) hand it over to eclone().

> 
> | I don't know if it makes more sense to change clone_with_pids() or have
> | [e]glibc wrappers swap the array contents.

I prefer to decide now on an order and stick to it in the kernel and
in glibc.

Oren


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 9/9] Document clone_with_pids() syscall
       [not found]             ` <4AF5ECFD.3000509-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-11-08 15:09               ` Serge E. Hallyn
  0 siblings, 0 replies; 39+ messages in thread
From: Serge E. Hallyn @ 2009-11-08 15:09 UTC (permalink / raw)
  To: Oren Laadan
  Cc: arnd-r2nGTMty4D4, Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Sukadev Bhattiprolu, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov

Quoting Oren Laadan (orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org):
> Sukadev Bhattiprolu wrote:
> > Matt Helsley [matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote:
> > | > If userspace passes an array with n pids and there are k namespace levels
> > | > then clone_with_pids() makes sure that the kernel sees a pid array like:
> > | > 
> > | > index	  0     ... k - (n + 1)        ...          k - 1
> > | > 	+-----------------------+-------------------------+
> > | > pid_t	| 0 ..................0 | <copied from userspace> |
> > | > 	+-----------------------+-------------------------+
> > | 
> > | (diagram assumes n != k. If n == k then pids[0] is the pid desired
> > | in the initial namespace..)
> > 
> > True.
> > 
> > Also I was not sure if we should prevent choosing pids in ancestor containers.
> > since a process is not even supposed to know of ancestor namespaces. Is there
> > a need for choosing pids in those namespaces.

Yes, that is necessary.

> > | I don't know if it makes more sense to change clone_with_pids() or have
> > | [e]glibc wrappers swap the array contents.
> 
> I prefer to decide now on an order and stick to it in the kernel and
> in glibc.

Agreed!

I'd forgotten that, as Matt said, we can just specify pids to the depth
that we want, so I guess the current order is fine.

-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 9/9] Document clone_with_pids() syscall
  2009-11-07 21:56           ` Oren Laadan
       [not found]             ` <4AF5ECFD.3000509-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-11-08 15:09             ` Serge E. Hallyn
  1 sibling, 0 replies; 39+ messages in thread
From: Serge E. Hallyn @ 2009-11-08 15:09 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Sukadev Bhattiprolu, arnd, Containers, linux-kernel,
	Eric W. Biederman, hpa, Pavel Emelyanov, Alexey Dobriyan, roland

Quoting Oren Laadan (orenl@librato.com):
> Sukadev Bhattiprolu wrote:
> > Matt Helsley [matthltc@us.ibm.com] wrote:
> > | > If userspace passes an array with n pids and there are k namespace levels
> > | > then clone_with_pids() makes sure that the kernel sees a pid array like:
> > | > 
> > | > index	  0     ... k - (n + 1)        ...          k - 1
> > | > 	+-----------------------+-------------------------+
> > | > pid_t	| 0 ..................0 | <copied from userspace> |
> > | > 	+-----------------------+-------------------------+
> > | 
> > | (diagram assumes n != k. If n == k then pids[0] is the pid desired
> > | in the initial namespace..)
> > 
> > True.
> > 
> > Also I was not sure if we should prevent choosing pids in ancestor containers.
> > since a process is not even supposed to know of ancestor namespaces. Is there
> > a need for choosing pids in those namespaces.

Yes, that is necessary.

> > | I don't know if it makes more sense to change clone_with_pids() or have
> > | [e]glibc wrappers swap the array contents.
> 
> I prefer to decide now on an order and stick to it in the kernel and
> in glibc.

Agreed!

I'd forgotten that, as Matt said, we can just specify pids to the depth
that we want, so I guess the current order is fine.

-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 8/9] Define clone_with_pids() syscall
       [not found]   ` <20091105054124.GH16142-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-11-06 18:02     ` Serge E. Hallyn
@ 2009-11-09 20:37     ` Serge E. Hallyn
  1 sibling, 0 replies; 39+ messages in thread
From: Serge E. Hallyn @ 2009-11-09 20:37 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Martin Schwidefsky, linux-s390-u79uwXL29TY76Z2rM5mHXA,
	Containers, Heiko Carstens, linux-kernel-u79uwXL29TY76Z2rM5mHXA

Here is a stab at the s390 syscall.

From f710be4f1296d50551210bcc9ff6ba25d288bc46 Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Date: Fri, 6 Nov 2009 19:03:43 -0500
Subject: [PATCH 1/1] implement s390 clone_with_pids syscall

This does the s390 hook for v11 of clone-with-pids.

I've got a program using the syscall successfully passing
args to the child function and continuing to run - but I
haven't yet gotten that hooked into user-cr/restart.c
successfully.

The core of my user-space code to use it is:

int do_child(void *arg)
{
	int iarg = (int ) arg;

	printf("here i am, i was passed %d, my tid is %d\n", iarg, gettid());
	return 0;
}

 #define do_cwp(flags, pids, args, sz) \
( { \
	register unsigned long int __r2 asm ("2") = (unsigned long int)(flags); \
	register unsigned long int __r3 asm ("3") = (unsigned long int)(args); \
	register unsigned long int __r4 asm ("4") = (unsigned long int)(sz); \
	register unsigned long int __r5 asm ("5") = (unsigned long int)(pids); \
	register unsigned long int __result asm ("2"); \
	__asm__ __volatile__( \
		" lghi %%r1,%5\n" /* put __NR_cwp in r1 for svc 0 */ \
		" svc 0\n" /* do __NR_cwp syscall */ \
		" ltgr %%r2,%%r2\n" /* returned 0? */ \
		" jnz 1f\n" /* if not goto label 1 */ \
		" lg %%r3,0(%%r15)\n"   /* get fnarg off stack into arg 1 */ \
		" lg %%r2,8(%%r15)\n"   /* get fn off stack into r3 for basr*/ \
		" lgr %%r1,%%r15\n" /* tmp store old stack pointer */ \
		" aghi %%r15,-160\n" /* move the stack */ \
		" stg %%r1,0(%%r15)\n" /* and save old stack pointer */ \
		" basr %%r14,%%r3\n" /* call fn(arg) */ \
		" svc 1\n"  /* call exit */ \
		" 1:\n" \
		: "=d" (__result) \
		: "0" (__r2), "d" (__r3), "d" (__r4), "d" (__r5), \
		  "i" (__NR_clone_with_pids) \
		: "1", "cc", "memory"); \
	__result; \
} )

int clone_with_pids(int (*fn)(void *), int flags, int nrpids, int *pids,
		    void *fnarg)
{
	long retval;
	struct clone_args clone_args, *ca = &clone_args;
	int stacksize;
	void *sb;
	u64 *s;
	int i;

	memset(ca, 0, sizeof(struct clone_args));
	stacksize = 4*getpagesize();
	sb  = (void *) malloc(stacksize);
	if (!sb) {
		perror("malloc");
		_exit(1);
	}

	ca->child_stack_base = (u64) sb;
	ca->child_stack_size = stacksize-8;
	s = (u64 *)(sb + ca->child_stack_size);
	*--s = (u64)fnarg;
	*--s = (u64)fn;
	ca->child_stack_size -= 16;
	ca->nr_pids = nrpids;
	retval = do_cwp(flags, pids, ca, sizeof(struct clone_args));

	return retval;
}

Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 arch/s390/include/asm/unistd.h  |    3 +-
 arch/s390/kernel/compat_linux.c |   50 ++++++++++++++++++++++++++++++++++++++
 arch/s390/kernel/process.c      |   51 +++++++++++++++++++++++++++++++++++++++
 arch/s390/kernel/syscalls.S     |    1 +
 4 files changed, 104 insertions(+), 1 deletions(-)

diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index cb5232d..ae9474e 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -269,7 +269,8 @@
 #define	__NR_pwritev		329
 #define __NR_rt_tgsigqueueinfo	330
 #define __NR_perf_event_open	331
-#define NR_syscalls 332
+#define __NR_clone_with_pids	332
+#define NR_syscalls 333
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index 0debcec..1750fae 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -762,6 +762,56 @@ asmlinkage long sys32_write(unsigned int fd, char __user * buf, size_t count)
 	return sys_write(fd, buf, count);
 }
 
+asmlinkage long sys32_clone_with_pids(void)
+{
+	int rc;
+	struct pt_regs *regs = task_pt_regs(current);
+	int args_size;
+	struct clone_args kca;
+	unsigned long flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack;
+	unsigned long stack_size;
+	unsigned int flags_low;
+	struct clone_args __user *uca;
+	pid_t __user *pids;
+
+	flags_low = regs->gprs[2] & 0xffffffffUL;
+	uca = compat_ptr(regs->gprs[3]);
+	args_size = regs->gprs[4] & 0xffffffffUL;
+	pids = compat_ptr(regs->gprs[5]);
+
+	rc = fetch_clone_args_from_user(uca, args_size, &kca);
+	if (rc)
+		return rc;
+
+	/*
+	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
+	 * 	 higher word(s) of 'flags':
+	 *
+	 * 		flags = (kca.clone_flags_high << 32) | flags_low;
+	 */
+	flags = flags_low;
+	parent_tid_ptr = (int *)kca.parent_tid_ptr;
+	child_tid_ptr =  (int *)kca.child_tid_ptr;
+
+	stack_size = (unsigned long)kca.child_stack_size;
+	child_stack = (unsigned long)kca.child_stack_base + stack_size;
+
+	if (!child_stack)
+		child_stack = regs->gprs[15];
+
+	/*
+	 * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+	 * 	 to several functions. Need to convert clone_flags to 64-bit.
+	 */
+	return do_fork_with_pids(flags, child_stack, regs, stack_size,
+				parent_tid_ptr, child_tid_ptr, kca.nr_pids,
+				pids);
+}
+
 /*
  * 31 bit emulation wrapper functions for sys_fadvise64/fadvise64_64.
  * These need to rewrite the advise values for POSIX_FADV_{DONTNEED,NOREUSE}
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 5417eb5..e27a1b4 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -241,6 +241,57 @@ SYSCALL_DEFINE4(clone, unsigned long, newsp, unsigned long, clone_flags,
 		       parent_tidptr, child_tidptr);
 }
 
+SYSCALL_DEFINE0(clone_with_pids)
+{
+	int rc;
+	struct pt_regs *regs = task_pt_regs(current);
+	int args_size;
+	struct clone_args kca;
+	unsigned long flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack;
+	unsigned long stack_size;
+	unsigned int flags_low;
+	struct clone_args __user *uca;
+	pid_t __user *pids;
+
+	flags_low = regs->gprs[2];
+	uca = (struct clone_args __user *)regs->gprs[3];
+	args_size = regs->gprs[4];
+	pids = (pid_t __user *)regs->gprs[5];
+
+	rc = fetch_clone_args_from_user(uca, args_size, &kca);
+	if (rc)
+		return rc;
+
+	/*
+	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
+	 * 	 higher word(s) of 'flags':
+	 *
+	 * 		flags = (kca.clone_flags_high << 32) | flags_low;
+	 */
+	flags = flags_low;
+	parent_tid_ptr = (int *)kca.parent_tid_ptr;
+	child_tid_ptr =  (int *)kca.child_tid_ptr;
+
+	stack_size = (unsigned long)kca.child_stack_size;
+	child_stack = (unsigned long)kca.child_stack_base;
+	if (child_stack)
+		child_stack += stack_size;
+	else
+		child_stack = regs->gprs[15];
+
+	/*
+	 * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+	 * 	 to several functions. Need to convert clone_flags to 64-bit.
+	 */
+	return do_fork_with_pids(flags, child_stack, regs, stack_size,
+				parent_tid_ptr, child_tid_ptr, kca.nr_pids,
+				pids);
+}
+
 /*
  * This is trivial, and on the face of it looks like it
  * could equally well be done in user mode.
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 30eca07..c6dc240 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -340,3 +340,4 @@ SYSCALL(sys_preadv,sys_preadv,compat_sys_preadv_wrapper)
 SYSCALL(sys_pwritev,sys_pwritev,compat_sys_pwritev_wrapper)
 SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo_wrapper) /* 330 */
 SYSCALL(sys_perf_event_open,sys_perf_event_open,sys_perf_event_open_wrapper)
+SYSCALL(sys_clone_with_pids,sys_clone_with_pids,sys_clone_with_pids_wrapper)
-- 
1.6.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 8/9] Define clone_with_pids() syscall
  2009-11-05  5:41 ` [v11][PATCH 8/9] Define " Sukadev Bhattiprolu
@ 2009-11-09 20:37     ` Serge E. Hallyn
       [not found]   ` <20091105054124.GH16142-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-11-09 20:37     ` Serge E. Hallyn
  2 siblings, 0 replies; 39+ messages in thread
From: Serge E. Hallyn @ 2009-11-09 20:37 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Containers, Martin Schwidefsky, Heiko Carstens, linux-s390

Here is a stab at the s390 syscall.

>From f710be4f1296d50551210bcc9ff6ba25d288bc46 Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn <serue@us.ibm.com>
Date: Fri, 6 Nov 2009 19:03:43 -0500
Subject: [PATCH 1/1] implement s390 clone_with_pids syscall

This does the s390 hook for v11 of clone-with-pids.

I've got a program using the syscall successfully passing
args to the child function and continuing to run - but I
haven't yet gotten that hooked into user-cr/restart.c
successfully.

The core of my user-space code to use it is:

int do_child(void *arg)
{
	int iarg = (int ) arg;

	printf("here i am, i was passed %d, my tid is %d\n", iarg, gettid());
	return 0;
}

 #define do_cwp(flags, pids, args, sz) \
( { \
	register unsigned long int __r2 asm ("2") = (unsigned long int)(flags); \
	register unsigned long int __r3 asm ("3") = (unsigned long int)(args); \
	register unsigned long int __r4 asm ("4") = (unsigned long int)(sz); \
	register unsigned long int __r5 asm ("5") = (unsigned long int)(pids); \
	register unsigned long int __result asm ("2"); \
	__asm__ __volatile__( \
		" lghi %%r1,%5\n" /* put __NR_cwp in r1 for svc 0 */ \
		" svc 0\n" /* do __NR_cwp syscall */ \
		" ltgr %%r2,%%r2\n" /* returned 0? */ \
		" jnz 1f\n" /* if not goto label 1 */ \
		" lg %%r3,0(%%r15)\n"   /* get fnarg off stack into arg 1 */ \
		" lg %%r2,8(%%r15)\n"   /* get fn off stack into r3 for basr*/ \
		" lgr %%r1,%%r15\n" /* tmp store old stack pointer */ \
		" aghi %%r15,-160\n" /* move the stack */ \
		" stg %%r1,0(%%r15)\n" /* and save old stack pointer */ \
		" basr %%r14,%%r3\n" /* call fn(arg) */ \
		" svc 1\n"  /* call exit */ \
		" 1:\n" \
		: "=d" (__result) \
		: "0" (__r2), "d" (__r3), "d" (__r4), "d" (__r5), \
		  "i" (__NR_clone_with_pids) \
		: "1", "cc", "memory"); \
	__result; \
} )

int clone_with_pids(int (*fn)(void *), int flags, int nrpids, int *pids,
		    void *fnarg)
{
	long retval;
	struct clone_args clone_args, *ca = &clone_args;
	int stacksize;
	void *sb;
	u64 *s;
	int i;

	memset(ca, 0, sizeof(struct clone_args));
	stacksize = 4*getpagesize();
	sb  = (void *) malloc(stacksize);
	if (!sb) {
		perror("malloc");
		_exit(1);
	}

	ca->child_stack_base = (u64) sb;
	ca->child_stack_size = stacksize-8;
	s = (u64 *)(sb + ca->child_stack_size);
	*--s = (u64)fnarg;
	*--s = (u64)fn;
	ca->child_stack_size -= 16;
	ca->nr_pids = nrpids;
	retval = do_cwp(flags, pids, ca, sizeof(struct clone_args));

	return retval;
}

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 arch/s390/include/asm/unistd.h  |    3 +-
 arch/s390/kernel/compat_linux.c |   50 ++++++++++++++++++++++++++++++++++++++
 arch/s390/kernel/process.c      |   51 +++++++++++++++++++++++++++++++++++++++
 arch/s390/kernel/syscalls.S     |    1 +
 4 files changed, 104 insertions(+), 1 deletions(-)

diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index cb5232d..ae9474e 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -269,7 +269,8 @@
 #define	__NR_pwritev		329
 #define __NR_rt_tgsigqueueinfo	330
 #define __NR_perf_event_open	331
-#define NR_syscalls 332
+#define __NR_clone_with_pids	332
+#define NR_syscalls 333
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index 0debcec..1750fae 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -762,6 +762,56 @@ asmlinkage long sys32_write(unsigned int fd, char __user * buf, size_t count)
 	return sys_write(fd, buf, count);
 }
 
+asmlinkage long sys32_clone_with_pids(void)
+{
+	int rc;
+	struct pt_regs *regs = task_pt_regs(current);
+	int args_size;
+	struct clone_args kca;
+	unsigned long flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack;
+	unsigned long stack_size;
+	unsigned int flags_low;
+	struct clone_args __user *uca;
+	pid_t __user *pids;
+
+	flags_low = regs->gprs[2] & 0xffffffffUL;
+	uca = compat_ptr(regs->gprs[3]);
+	args_size = regs->gprs[4] & 0xffffffffUL;
+	pids = compat_ptr(regs->gprs[5]);
+
+	rc = fetch_clone_args_from_user(uca, args_size, &kca);
+	if (rc)
+		return rc;
+
+	/*
+	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
+	 * 	 higher word(s) of 'flags':
+	 *
+	 * 		flags = (kca.clone_flags_high << 32) | flags_low;
+	 */
+	flags = flags_low;
+	parent_tid_ptr = (int *)kca.parent_tid_ptr;
+	child_tid_ptr =  (int *)kca.child_tid_ptr;
+
+	stack_size = (unsigned long)kca.child_stack_size;
+	child_stack = (unsigned long)kca.child_stack_base + stack_size;
+
+	if (!child_stack)
+		child_stack = regs->gprs[15];
+
+	/*
+	 * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+	 * 	 to several functions. Need to convert clone_flags to 64-bit.
+	 */
+	return do_fork_with_pids(flags, child_stack, regs, stack_size,
+				parent_tid_ptr, child_tid_ptr, kca.nr_pids,
+				pids);
+}
+
 /*
  * 31 bit emulation wrapper functions for sys_fadvise64/fadvise64_64.
  * These need to rewrite the advise values for POSIX_FADV_{DONTNEED,NOREUSE}
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 5417eb5..e27a1b4 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -241,6 +241,57 @@ SYSCALL_DEFINE4(clone, unsigned long, newsp, unsigned long, clone_flags,
 		       parent_tidptr, child_tidptr);
 }
 
+SYSCALL_DEFINE0(clone_with_pids)
+{
+	int rc;
+	struct pt_regs *regs = task_pt_regs(current);
+	int args_size;
+	struct clone_args kca;
+	unsigned long flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack;
+	unsigned long stack_size;
+	unsigned int flags_low;
+	struct clone_args __user *uca;
+	pid_t __user *pids;
+
+	flags_low = regs->gprs[2];
+	uca = (struct clone_args __user *)regs->gprs[3];
+	args_size = regs->gprs[4];
+	pids = (pid_t __user *)regs->gprs[5];
+
+	rc = fetch_clone_args_from_user(uca, args_size, &kca);
+	if (rc)
+		return rc;
+
+	/*
+	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
+	 * 	 higher word(s) of 'flags':
+	 *
+	 * 		flags = (kca.clone_flags_high << 32) | flags_low;
+	 */
+	flags = flags_low;
+	parent_tid_ptr = (int *)kca.parent_tid_ptr;
+	child_tid_ptr =  (int *)kca.child_tid_ptr;
+
+	stack_size = (unsigned long)kca.child_stack_size;
+	child_stack = (unsigned long)kca.child_stack_base;
+	if (child_stack)
+		child_stack += stack_size;
+	else
+		child_stack = regs->gprs[15];
+
+	/*
+	 * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+	 * 	 to several functions. Need to convert clone_flags to 64-bit.
+	 */
+	return do_fork_with_pids(flags, child_stack, regs, stack_size,
+				parent_tid_ptr, child_tid_ptr, kca.nr_pids,
+				pids);
+}
+
 /*
  * This is trivial, and on the face of it looks like it
  * could equally well be done in user mode.
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 30eca07..c6dc240 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -340,3 +340,4 @@ SYSCALL(sys_preadv,sys_preadv,compat_sys_preadv_wrapper)
 SYSCALL(sys_pwritev,sys_pwritev,compat_sys_pwritev_wrapper)
 SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo_wrapper) /* 330 */
 SYSCALL(sys_perf_event_open,sys_perf_event_open,sys_perf_event_open_wrapper)
+SYSCALL(sys_clone_with_pids,sys_clone_with_pids,sys_clone_with_pids_wrapper)
-- 
1.6.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [v11][PATCH 8/9] Define clone_with_pids() syscall
@ 2009-11-09 20:37     ` Serge E. Hallyn
  0 siblings, 0 replies; 39+ messages in thread
From: Serge E. Hallyn @ 2009-11-09 20:37 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Containers, Martin Schwidefsky, Heiko Carstens, linux-s390

Here is a stab at the s390 syscall.

From f710be4f1296d50551210bcc9ff6ba25d288bc46 Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn <serue@us.ibm.com>
Date: Fri, 6 Nov 2009 19:03:43 -0500
Subject: [PATCH 1/1] implement s390 clone_with_pids syscall

This does the s390 hook for v11 of clone-with-pids.

I've got a program using the syscall successfully passing
args to the child function and continuing to run - but I
haven't yet gotten that hooked into user-cr/restart.c
successfully.

The core of my user-space code to use it is:

int do_child(void *arg)
{
	int iarg = (int ) arg;

	printf("here i am, i was passed %d, my tid is %d\n", iarg, gettid());
	return 0;
}

 #define do_cwp(flags, pids, args, sz) \
( { \
	register unsigned long int __r2 asm ("2") = (unsigned long int)(flags); \
	register unsigned long int __r3 asm ("3") = (unsigned long int)(args); \
	register unsigned long int __r4 asm ("4") = (unsigned long int)(sz); \
	register unsigned long int __r5 asm ("5") = (unsigned long int)(pids); \
	register unsigned long int __result asm ("2"); \
	__asm__ __volatile__( \
		" lghi %%r1,%5\n" /* put __NR_cwp in r1 for svc 0 */ \
		" svc 0\n" /* do __NR_cwp syscall */ \
		" ltgr %%r2,%%r2\n" /* returned 0? */ \
		" jnz 1f\n" /* if not goto label 1 */ \
		" lg %%r3,0(%%r15)\n"   /* get fnarg off stack into arg 1 */ \
		" lg %%r2,8(%%r15)\n"   /* get fn off stack into r3 for basr*/ \
		" lgr %%r1,%%r15\n" /* tmp store old stack pointer */ \
		" aghi %%r15,-160\n" /* move the stack */ \
		" stg %%r1,0(%%r15)\n" /* and save old stack pointer */ \
		" basr %%r14,%%r3\n" /* call fn(arg) */ \
		" svc 1\n"  /* call exit */ \
		" 1:\n" \
		: "=d" (__result) \
		: "0" (__r2), "d" (__r3), "d" (__r4), "d" (__r5), \
		  "i" (__NR_clone_with_pids) \
		: "1", "cc", "memory"); \
	__result; \
} )

int clone_with_pids(int (*fn)(void *), int flags, int nrpids, int *pids,
		    void *fnarg)
{
	long retval;
	struct clone_args clone_args, *ca = &clone_args;
	int stacksize;
	void *sb;
	u64 *s;
	int i;

	memset(ca, 0, sizeof(struct clone_args));
	stacksize = 4*getpagesize();
	sb  = (void *) malloc(stacksize);
	if (!sb) {
		perror("malloc");
		_exit(1);
	}

	ca->child_stack_base = (u64) sb;
	ca->child_stack_size = stacksize-8;
	s = (u64 *)(sb + ca->child_stack_size);
	*--s = (u64)fnarg;
	*--s = (u64)fn;
	ca->child_stack_size -= 16;
	ca->nr_pids = nrpids;
	retval = do_cwp(flags, pids, ca, sizeof(struct clone_args));

	return retval;
}

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 arch/s390/include/asm/unistd.h  |    3 +-
 arch/s390/kernel/compat_linux.c |   50 ++++++++++++++++++++++++++++++++++++++
 arch/s390/kernel/process.c      |   51 +++++++++++++++++++++++++++++++++++++++
 arch/s390/kernel/syscalls.S     |    1 +
 4 files changed, 104 insertions(+), 1 deletions(-)

diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index cb5232d..ae9474e 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -269,7 +269,8 @@
 #define	__NR_pwritev		329
 #define __NR_rt_tgsigqueueinfo	330
 #define __NR_perf_event_open	331
-#define NR_syscalls 332
+#define __NR_clone_with_pids	332
+#define NR_syscalls 333
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index 0debcec..1750fae 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -762,6 +762,56 @@ asmlinkage long sys32_write(unsigned int fd, char __user * buf, size_t count)
 	return sys_write(fd, buf, count);
 }
 
+asmlinkage long sys32_clone_with_pids(void)
+{
+	int rc;
+	struct pt_regs *regs = task_pt_regs(current);
+	int args_size;
+	struct clone_args kca;
+	unsigned long flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack;
+	unsigned long stack_size;
+	unsigned int flags_low;
+	struct clone_args __user *uca;
+	pid_t __user *pids;
+
+	flags_low = regs->gprs[2] & 0xffffffffUL;
+	uca = compat_ptr(regs->gprs[3]);
+	args_size = regs->gprs[4] & 0xffffffffUL;
+	pids = compat_ptr(regs->gprs[5]);
+
+	rc = fetch_clone_args_from_user(uca, args_size, &kca);
+	if (rc)
+		return rc;
+
+	/*
+	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
+	 * 	 higher word(s) of 'flags':
+	 *
+	 * 		flags = (kca.clone_flags_high << 32) | flags_low;
+	 */
+	flags = flags_low;
+	parent_tid_ptr = (int *)kca.parent_tid_ptr;
+	child_tid_ptr =  (int *)kca.child_tid_ptr;
+
+	stack_size = (unsigned long)kca.child_stack_size;
+	child_stack = (unsigned long)kca.child_stack_base + stack_size;
+
+	if (!child_stack)
+		child_stack = regs->gprs[15];
+
+	/*
+	 * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+	 * 	 to several functions. Need to convert clone_flags to 64-bit.
+	 */
+	return do_fork_with_pids(flags, child_stack, regs, stack_size,
+				parent_tid_ptr, child_tid_ptr, kca.nr_pids,
+				pids);
+}
+
 /*
  * 31 bit emulation wrapper functions for sys_fadvise64/fadvise64_64.
  * These need to rewrite the advise values for POSIX_FADV_{DONTNEED,NOREUSE}
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 5417eb5..e27a1b4 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -241,6 +241,57 @@ SYSCALL_DEFINE4(clone, unsigned long, newsp, unsigned long, clone_flags,
 		       parent_tidptr, child_tidptr);
 }
 
+SYSCALL_DEFINE0(clone_with_pids)
+{
+	int rc;
+	struct pt_regs *regs = task_pt_regs(current);
+	int args_size;
+	struct clone_args kca;
+	unsigned long flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack;
+	unsigned long stack_size;
+	unsigned int flags_low;
+	struct clone_args __user *uca;
+	pid_t __user *pids;
+
+	flags_low = regs->gprs[2];
+	uca = (struct clone_args __user *)regs->gprs[3];
+	args_size = regs->gprs[4];
+	pids = (pid_t __user *)regs->gprs[5];
+
+	rc = fetch_clone_args_from_user(uca, args_size, &kca);
+	if (rc)
+		return rc;
+
+	/*
+	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
+	 * 	 higher word(s) of 'flags':
+	 *
+	 * 		flags = (kca.clone_flags_high << 32) | flags_low;
+	 */
+	flags = flags_low;
+	parent_tid_ptr = (int *)kca.parent_tid_ptr;
+	child_tid_ptr =  (int *)kca.child_tid_ptr;
+
+	stack_size = (unsigned long)kca.child_stack_size;
+	child_stack = (unsigned long)kca.child_stack_base;
+	if (child_stack)
+		child_stack += stack_size;
+	else
+		child_stack = regs->gprs[15];
+
+	/*
+	 * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+	 * 	 to several functions. Need to convert clone_flags to 64-bit.
+	 */
+	return do_fork_with_pids(flags, child_stack, regs, stack_size,
+				parent_tid_ptr, child_tid_ptr, kca.nr_pids,
+				pids);
+}
+
 /*
  * This is trivial, and on the face of it looks like it
  * could equally well be done in user mode.
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 30eca07..c6dc240 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -340,3 +340,4 @@ SYSCALL(sys_preadv,sys_preadv,compat_sys_preadv_wrapper)
 SYSCALL(sys_pwritev,sys_pwritev,compat_sys_pwritev_wrapper)
 SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo_wrapper) /* 330 */
 SYSCALL(sys_perf_event_open,sys_perf_event_open,sys_perf_event_open_wrapper)
+SYSCALL(sys_clone_with_pids,sys_clone_with_pids,sys_clone_with_pids_wrapper)
-- 
1.6.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [v11][PATCH 0/9] Implement clone_with_pids() system call
@ 2009-11-05  5:30 Sukadev Bhattiprolu
  0 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-05  5:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Alexey Dobriyan,
	roland-H+wXaHxf7aLQT0dZR+AlfA, Pavel Emelyanov


From: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: [v11][PATCH 0/9] Implement clone_with_pids() system call

To support application checkpoint/restart, a task must have the same pid it
had when it was checkpointed.  When containers are nested, the tasks within
the containers exist in multiple pid namespaces and hence have multiple pids
to specify during restart.

This patchset implements a new system call, clone_with_pids() that lets a
process specify the pids of the child process.

Patches 1 through 6 are helper patches needed for choosing a pid for the
child process.

PATCH 8 defines a prototype of the new system call. PATCH 9 adds some
documentation on the new system call, some/all of which will eventually
go into a man page.

Changelog[v11]:
	- [Dave Hansen] Move clone_args validation checks to arch-indpeendent
	  code.
	- [Oren Laadan] Make args_size a parameter to system call and remove
	  it from 'struct clone_args'

Changelog[v10]:
	- [Linus Torvalds] Use PTREGSCALL() implementation for clone rather
	  than the generic system call
	- Rename clone3() to clone_with_pids()
	- Update Documentation/clone_with_pids() to show example usage with
	  the PTREGSCALL implementation.

Changelog[v9]:
	- [Pavel Emelyanov] Drop the patch that made 'pid_max' a property
	  of struct pid_namespace
	- [Roland McGrath, H. Peter Anvin and earlier on, Serge Hallyn] To
	  avoid inadvertent truncation clone_flags, preserve the first
	  parameter of clone3() as 'u32 clone_flags' and specify newer
	  flags in clone_args.flags_high (PATCH 8/9 and PATCH 9/9)
	- [Eric Biederman] Generalize alloc_pidmap() code to simplify and
	  remove duplication (see PATCH 3/9].
	  
Changelog[v8]:
	- [Oren Laadan, Louis Rilling, KOSAKI Motohiro]
	  The name 'clone2()' is in use - renamed new syscall to clone3().
	- [Oren Laadan] ->parent_tidptr and ->child_tidptr need to be 64bit.
	- [Oren Laadan] Ensure that unused fields/flags in clone_struct are 0.
	  (Added [PATCH 7/10] to the patchset).

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann]
	  Group the arguments to clone2() into a 'struct clone_arg' to
	  workaround the issue of exceeding 6 arguments to the system call.
	  Also define clone-flags as u64 to allow additional clone-flags.

Changelog[v6]:
	- [Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds]
	  Change 'pid_set.pids' to 'pid_t pids[]' so sizeof(struct pid_set) is
	  constant across architectures (Patches 7, 8).
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check (Patches 7,8)
	- (Pavel Machek) New patch (Patch 9) to add some documentation.

Changelog[v5]:
	- Make 'pid_max' a property of pid_ns (Integrated Serge Hallyn's patch
	  into this set)
	- (Eric Biederman): Avoid the new function, set_pidmap() - added
	  couple of checks on 'target_pid' in alloc_pidmap() itself.

=== IMPORTANT NOTE:

clone() system call has another limitation - all but one bits in clone-flags
are in use and if more new clone-flags are needed, we will need a variant of
the clone() system call. 

It appears to make sense to try and extend this new system call to address
this limitation as well. The requirements of a new clone system call could
then be summarized as:

	- do everything clone() does today, and
	- give application an ability to choose pids for the child process
	  in all ancestor pid namespaces, and
	- allow more clone_flags

Contstraints:

	- system-calls are restricted to 6 parameters and clone() already
	  takes 5 parameters, any extension to clone() interface would require
	  one or more copy_from_user().  (Not sure if copy_from_user() of ~40
	  bytes would have a significant impact on performance of clone()).

Based on these requirements and constraints, we explored a couple of system
call interfaces (in earlier versions of this patchset). Based on input from
Arnd Bergmann and others, the new interface of the system call is: 

	struct clone_args {
		u64 clone_flags_high;
		u64 child_stack_base;
		u64 child_stack_size;
		u64 parent_tid_ptr;
		u64 child_tid_ptr;
		u32 nr_pids;
		u32 reserved0;
		u64 reserved1;
	};

	sys_clone_with_pids(u32 flags_low, struct clone_args *cargs, 
			int args_size, pid_t *pids)

Details of the struct clone_args and the usage are explained in the
documentation (PATCH 9/9).

NOTE:
	While this patchset enables support for more clone-flags, actual
	implementation for additional clone-flags is best implemented as
	a separate patchset (PATCH 8/9 identifies some TODOs)

Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2009-11-09 20:37 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-05  5:30 [v11][PATCH 0/9] Implement clone_with_pids() system call Sukadev Bhattiprolu
2009-11-05  5:36 ` [v11][PATCH 1/9] Factor out code to allocate pidmap page Sukadev Bhattiprolu
2009-11-05  5:37 ` [v11][PATCH 2/9] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
2009-11-05  5:38 ` [v11][PATCH 3/9] Define set_pidmap() function Sukadev Bhattiprolu
2009-11-05  5:38 ` [v11][PATCH 4/9] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
2009-11-05  5:39 ` [v11][PATCH 5/9] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
2009-11-05  5:40 ` [v11][PATCH 6/9] Check invalid clone flags Sukadev Bhattiprolu
2009-11-05  5:40 ` [v11][PATCH 7/9] Define do_fork_with_pids() Sukadev Bhattiprolu
     [not found] ` <20091105053053.GA11289-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-11-05  5:36   ` [v11][PATCH 1/9] Factor out code to allocate pidmap page Sukadev Bhattiprolu
2009-11-05  5:37   ` [v11][PATCH 2/9] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
2009-11-05  5:38   ` [v11][PATCH 3/9] Define set_pidmap() function Sukadev Bhattiprolu
2009-11-05  5:38   ` [v11][PATCH 4/9] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
2009-11-05  5:39   ` [v11][PATCH 5/9] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
2009-11-05  5:40   ` [v11][PATCH 6/9] Check invalid clone flags Sukadev Bhattiprolu
2009-11-05  5:40   ` [v11][PATCH 7/9] Define do_fork_with_pids() Sukadev Bhattiprolu
2009-11-05  5:41   ` [v11][PATCH 8/9] Define clone_with_pids() syscall Sukadev Bhattiprolu
2009-11-05  5:42   ` [v11][PATCH 9/9] Document " Sukadev Bhattiprolu
2009-11-05  5:41 ` [v11][PATCH 8/9] Define " Sukadev Bhattiprolu
2009-11-06 18:02   ` Serge E. Hallyn
     [not found]     ` <20091106180210.GA31652-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-11-07 20:18       ` Sukadev Bhattiprolu
2009-11-07 20:18     ` Sukadev Bhattiprolu
     [not found]   ` <20091105054124.GH16142-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-11-06 18:02     ` Serge E. Hallyn
2009-11-09 20:37     ` Serge E. Hallyn
2009-11-09 20:37   ` Serge E. Hallyn
2009-11-09 20:37     ` Serge E. Hallyn
2009-11-05  5:42 ` [v11][PATCH 9/9] Document " Sukadev Bhattiprolu
     [not found]   ` <20091105054204.GI16142-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-11-06 18:39     ` Serge E. Hallyn
2009-11-06 18:39   ` Serge E. Hallyn
2009-11-06 20:18     ` Matt Helsley
2009-11-06 21:45       ` Matt Helsley
2009-11-07  2:26         ` Sukadev Bhattiprolu
2009-11-07 21:56           ` Oren Laadan
     [not found]             ` <4AF5ECFD.3000509-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-11-08 15:09               ` Serge E. Hallyn
2009-11-08 15:09             ` Serge E. Hallyn
2009-11-07 21:56           ` Oren Laadan
     [not found]         ` <20091106214529.GB26614-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2009-11-07  2:26           ` Sukadev Bhattiprolu
     [not found]       ` <20091106201814.GA26614-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2009-11-06 21:45         ` Matt Helsley
     [not found]     ` <20091106183936.GA32531-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-11-06 20:18       ` Matt Helsley
  -- strict thread matches above, loose matches on Subject: below --
2009-11-05  5:30 [v11][PATCH 0/9] Implement clone_with_pids() system call Sukadev Bhattiprolu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.