All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][v6][PATCH 0/9] clone_with_pids() syscall
@ 2009-09-10  6:06 Sukadev Bhattiprolu
  2009-09-10  6:08 ` [RFC][v6][PATCH 1/9]: Factor out code to allocate pidmap page Sukadev Bhattiprolu
                   ` (10 more replies)
  0 siblings, 11 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, arnd,
	container, sukadev


=== NEW CLONE() SYSTEM CALL:

To support application checkpoint/restart, a task must have the same pid it
had when it was checkpointed.  When containers are nested, the tasks within
the containers exist in multiple pid namespaces and hence have multiple pids
to specify during restart.

This patchset implements a new system call, clone_with_pids() that lets a
process specify the pids of the child process.

Patches 1 through 6 are helpers and we believe they are needed for application
restart, regardless of the kernel implementation of application restart.

Patch 8 defines a prototype of the new system call. Patch 9 adds some
documentation on the new system call.

Changelog[v6]:
	- [Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds]
	  Change 'pid_set.pids' to 'pid_t pids[]' so sizeof(struct pid_set) is
	  constant across architectures (Patches 7, 8).
	- [Nathan Lynch] Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check (Patches 7,8)
	- [Pavel Machek] New patch (Patch 9) to add some documentation.

Changelog[v5]:
	- Make 'pid_max' a property of pid_ns (Integrated Serge Hallyn's patch
	  into this set)
	- (Eric Biederman): Avoid the new function, set_pidmap() - added
	  couple of checks on 'target_pid' in alloc_pidmap() itself.

=== IMPORTANT TODO:

clone() system call has another limitation - all available bits in clone-flags
are in use and any new clone-flag will need a variant of the clone() system
call. 

It appears to make sense to try and extend this new system call to address
this limitation as well. The basic requirements of a new clone system call
could then be summarized as:

	- do everything clone() does today, and
	- give application an ability to choose pids for the child process
	  in all ancestor pid namespaces, and
	- allow more clone_flags

Contstraints:

	- system-calls are restricted to 6 parameters and clone() already
	  takes 5 parameters, any extension to clone() interface would require
	  one or more copy_from_user().

	- does copy_from_user() of a few words have a significant impact on
	  the total cost of clone() ?

Based on these requirements and constraints, we have been exploring a couple
of system call interfaces and appreciate any iput.  

1. =====

	#if 64bit
	#define CLONE_FLAGS_WORDS	1
	#else
	#define CLONE_FLAGS_WORDS	2
	#endif

        struct pid_set {
                int num_pids;
                pid_t pids[];
        };

	typedef struct {
		unsigned long flags[CLONE_FLAGS_WORDS];
	} clone_flags_t;

	int clone_extended(clone_flags_t *flags, void *child_stack, int *unused,
		int *parent_tid, int *child_tid, struct pid_set *pid_set);

	Pros:
		- extendible clone_flags (like sigset_t)

	Cons:
		- copy_from_user() needed on all architectures (we maybe able
		  to play some tricks with 'clone_flags_t' to avoid the copy
		  on 64-bit archtitectures till N_CLONE_FLAGS exceeds 64).

		- Both applications and kernel must use interfaces equivalent
		  to sigsetops(3) to test/set/clear clone flags.
2. ======

	struct clone_info {
		int num_clone_high_words;
		int *flags_high;
		struct pid_set pid_set;
	}

        int clone_extended(int flags_low, void *child_stack, void *unused,
		int *parent_tid, int *child_tid, struct clone_info *clone_info);

	Pros:
		- copy_from_user() needed only for new flags and pid_set

	Cons:
		- splitting the high and low clone-flags is awkward ?


Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][v6][PATCH 1/9]: Factor out code to allocate pidmap page
  2009-09-10  6:06 [RFC][v6][PATCH 0/9] clone_with_pids() syscall Sukadev Bhattiprolu
@ 2009-09-10  6:08 ` Sukadev Bhattiprolu
  2009-09-10  6:09 ` [RFC][v6][PATCH 2/9]: Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, arnd,
	container, sukadev



Subject: [RFC][v6][PATCH 1/9]: Factor out code to allocate pidmap page

To simplify alloc_pidmap(), move code to allocate a pid map page to a
separate function.

Changelog[v3]:
	- Earlier version of patchset called alloc_pidmap_page() from two
	  places. But now its called from only one place. Even so, moving
	  this code out into a separate function simplifies alloc_pidmap().
Changelog[v2]:
	- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
	  -ENOMEM on error instead of -1.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/pid.c |   45 ++++++++++++++++++++++++++++++---------------
 1 file changed, 30 insertions(+), 15 deletions(-)

Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:06:25.000000000 -0700
@@ -122,9 +122,35 @@ static void free_pidmap(struct upid *upi
 	atomic_inc(&map->nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	void *page;
+
+	if (likely(map->page))
+		return 0;
+
+	page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+	/*
+	 * Free the page if someone raced with us installing it:
+	 */
+	spin_lock_irq(&pidmap_lock);
+	if (map->page)
+		kfree(page);
+	else
+		map->page = page;
+	spin_unlock_irq(&pidmap_lock);
+
+	if (unlikely(!map->page))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int rc;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -134,21 +160,10 @@ static int alloc_pidmap(struct pid_names
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (map->page)
-				kfree(page);
-			else
-				map->page = page;
-			spin_unlock_irq(&pidmap_lock);
-			if (unlikely(!map->page))
-				break;
-		}
+		rc = alloc_pidmap_page(map);
+		if (rc)
+			break;
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][v6][PATCH 2/9]: Have alloc_pidmap() return actual error code
  2009-09-10  6:06 [RFC][v6][PATCH 0/9] clone_with_pids() syscall Sukadev Bhattiprolu
  2009-09-10  6:08 ` [RFC][v6][PATCH 1/9]: Factor out code to allocate pidmap page Sukadev Bhattiprolu
@ 2009-09-10  6:09 ` Sukadev Bhattiprolu
  2009-09-10  6:09 ` [RFC][v6][PATCH 3/9] Make pid_max a pid_ns property Sukadev Bhattiprolu
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, arnd,
	container, sukadev



Subject: [RFC][v6][PATCH 2/9]: Have alloc_pidmap() return actual error code

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed.  With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    5 +++--
 kernel/pid.c  |   14 +++++++++-----
 2 files changed, 12 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-09 19:06:46.000000000 -0700
@@ -1110,10 +1110,11 @@ static struct task_struct *copy_process(
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 
 		if (clone_flags & CLONE_NEWPID) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:06:25.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:06:46.000000000 -0700
@@ -150,7 +150,7 @@ static int alloc_pidmap_page(struct pidm
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
-	int rc;
+	int rc = -EAGAIN;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -189,12 +189,14 @@ static int alloc_pidmap(struct pid_names
 		} else {
 			map = &pid_ns->pidmap[0];
 			offset = RESERVED_PIDS;
-			if (unlikely(last == offset))
+			if (unlikely(last == offset)) {
+				rc = -EAGAIN;
 				break;
+			}
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return rc;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -263,8 +265,10 @@ struct pid *alloc_pid(struct pid_namespa
 	struct upid *upid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
-	if (!pid)
+	if (!pid) {
+		pid = ERR_PTR(-ENOMEM);
 		goto out;
+	}
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
@@ -299,7 +303,7 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
+	pid = ERR_PTR(nr);
 	goto out;
 }
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][v6][PATCH 3/9] Make pid_max a pid_ns property
  2009-09-10  6:06 [RFC][v6][PATCH 0/9] clone_with_pids() syscall Sukadev Bhattiprolu
  2009-09-10  6:08 ` [RFC][v6][PATCH 1/9]: Factor out code to allocate pidmap page Sukadev Bhattiprolu
  2009-09-10  6:09 ` [RFC][v6][PATCH 2/9]: Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
@ 2009-09-10  6:09 ` Sukadev Bhattiprolu
  2009-09-10  6:09 ` [RFC][v6][PATCH 4/9]: Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, arnd,
	container, sukadev



From: Serge Hallyn <serue@us.ibm.com>
Subject: [RFC][v6][PATCH 3/9] Make pid_max a pid_ns property

Remove the pid_max global, and make it a property of the
pid_namespace.  When a pid_ns is created, it inherits
the parent's pid_ns.

Fixing up sysctl (trivial akin to ipc version, but
potentially tedious to get right for all CONFIG*
combinations) is left for later.

Changelog[v2]:
	- Port to newer kernel
	- Make pid_max a local variable in alloc_pidmap() to simplify code/patch

Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
---
 include/linux/pid_namespace.h |    1 +
 kernel/pid.c                  |    4 ++--
 kernel/pid_namespace.c        |    1 +
 kernel/sysctl.c               |    4 ++--
 4 files changed, 6 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/pid_namespace.h
===================================================================
--- linux-2.6.orig/include/linux/pid_namespace.h	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/include/linux/pid_namespace.h	2009-09-09 19:07:20.000000000 -0700
@@ -30,6 +30,7 @@ struct pid_namespace {
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct bsd_acct_struct *bacct;
 #endif
+	int pid_max;
 };
 
 extern struct pid_namespace init_pid_ns;
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:06:46.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:07:20.000000000 -0700
@@ -43,8 +43,6 @@ static struct hlist_head *pid_hash;
 static int pidhash_shift;
 struct pid init_struct_pid = INIT_STRUCT_PID;
 
-int pid_max = PID_MAX_DEFAULT;
-
 #define RESERVED_PIDS		300
 
 int pid_max_min = RESERVED_PIDS + 1;
@@ -78,6 +76,7 @@ struct pid_namespace init_pid_ns = {
 	.last_pid = 0,
 	.level = 0,
 	.child_reaper = &init_task,
+	.pid_max = PID_MAX_DEFAULT,
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
@@ -151,6 +150,7 @@ static int alloc_pidmap(struct pid_names
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	int rc = -EAGAIN;
+	int pid_max = pid_ns->pid_max;
 	struct pidmap *map;
 
 	pid = last + 1;
Index: linux-2.6/kernel/pid_namespace.c
===================================================================
--- linux-2.6.orig/kernel/pid_namespace.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/pid_namespace.c	2009-09-09 19:07:20.000000000 -0700
@@ -87,6 +87,7 @@ static struct pid_namespace *create_pid_
 
 	kref_init(&ns->kref);
 	ns->level = level;
+	ns->pid_max = parent_pid_ns->pid_max;
 	ns->parent = get_pid_ns(parent_pid_ns);
 
 	set_bit(0, ns->pidmap[0].page);
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/sysctl.c	2009-09-09 19:07:20.000000000 -0700
@@ -55,6 +55,7 @@
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
+#include <linux/pid_namespace.h>
 
 #ifdef CONFIG_X86
 #include <asm/nmi.h>
@@ -78,7 +79,6 @@ extern int max_threads;
 extern int core_uses_pid;
 extern int suid_dumpable;
 extern char core_pattern[];
-extern int pid_max;
 extern int min_free_kbytes;
 extern int pid_max_min, pid_max_max;
 extern int sysctl_drop_caches;
@@ -670,7 +670,7 @@ static struct ctl_table kern_table[] = {
 	{
 		.ctl_name	= KERN_PIDMAX,
 		.procname	= "pid_max",
-		.data		= &pid_max,
+		.data		= &init_pid_ns.pid_max,
 		.maxlen		= sizeof (int),
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec_minmax,

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][v6][PATCH 4/9]: Add target_pid parameter to alloc_pidmap()
  2009-09-10  6:06 [RFC][v6][PATCH 0/9] clone_with_pids() syscall Sukadev Bhattiprolu
                   ` (2 preceding siblings ...)
  2009-09-10  6:09 ` [RFC][v6][PATCH 3/9] Make pid_max a pid_ns property Sukadev Bhattiprolu
@ 2009-09-10  6:09 ` Sukadev Bhattiprolu
  2009-09-10  6:10 ` [RFC][v6][PATCH 5/9]: Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, arnd,
	container, sukadev



Subject: [RFC][v6][PATCH 4/9]: Add target_pid parameter to alloc_pidmap()

With support for setting a specific pid number for a process,
alloc_pidmap() will need a 'target_pid' parameter.

Changelog[v3]:
	- (Eric Biederman): Avoid set_pidmap() function. Added couple of
	  checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
	- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
	  actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 kernel/pid.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:07:20.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:07:28.000000000 -0700
@@ -146,16 +146,22 @@ static int alloc_pidmap_page(struct pidm
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	int rc = -EAGAIN;
 	int pid_max = pid_ns->pid_max;
 	struct pidmap *map;
 
-	pid = last + 1;
-	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+	if (target_pid) {
+		pid = target_pid;
+		if (pid < 0 || pid >= pid_max)
+			return -EINVAL;
+	} else {
+		pid = last + 1;
+		if (pid >= pid_max)
+			pid = RESERVED_PIDS;
+	}
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
@@ -168,9 +174,14 @@ static int alloc_pidmap(struct pid_names
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
+					if (!target_pid)
+						pid_ns->last_pid = pid;
 					return pid;
+				} else if (target_pid) {
+					rc = -EBUSY;
+					break;
 				}
+
 				offset = find_next_offset(map, offset);
 				pid = mk_pid(pid_ns, map, offset);
 			/*
@@ -196,6 +207,7 @@ static int alloc_pidmap(struct pid_names
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
+
 	return rc;
 }
 
@@ -272,7 +284,7 @@ struct pid *alloc_pid(struct pid_namespa
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = alloc_pidmap(tmp, 0);
 		if (nr < 0)
 			goto out_free;
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][v6][PATCH 5/9]: Add target_pids parameter to alloc_pid()
  2009-09-10  6:06 [RFC][v6][PATCH 0/9] clone_with_pids() syscall Sukadev Bhattiprolu
                   ` (3 preceding siblings ...)
  2009-09-10  6:09 ` [RFC][v6][PATCH 4/9]: Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu
@ 2009-09-10  6:10 ` Sukadev Bhattiprolu
  2009-09-10  6:11 ` [RFC][v6][PATCH 6/9]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, arnd,
	container, sukadev



Subject: [RFC][v6][PATCH 5/9]: Add target_pids parameter to alloc_pid()

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/pid.h |    2 +-
 kernel/fork.c       |    3 ++-
 kernel/pid.c        |    9 +++++++--
 3 files changed, 10 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/pid.h
===================================================================
--- linux-2.6.orig/include/linux/pid.h	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/include/linux/pid.h	2009-09-09 19:07:36.000000000 -0700
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-09 19:06:46.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-09 19:07:36.000000000 -0700
@@ -940,6 +940,7 @@ static struct task_struct *copy_process(
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1110,7 +1111,7 @@ static struct task_struct *copy_process(
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:07:28.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:07:36.000000000 -0700
@@ -268,13 +268,14 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	int tpid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid) {
@@ -284,7 +285,11 @@ struct pid *alloc_pid(struct pid_namespa
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp, 0);
+		tpid = 0;
+		if (target_pids)
+			tpid = target_pids[i];
+
+		nr = alloc_pidmap(tmp, tpid);
 		if (nr < 0)
 			goto out_free;
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][v6][PATCH 6/9]: Add target_pids parameter to copy_process()
       [not found] ` <20090910060627.GA24343-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-09-10  6:11   ` Sukadev Bhattiprolu
  2009-09-10  6:12   ` [RFC][v6][PATCH 7/9]: Define do_fork_with_pids() Sukadev Bhattiprolu
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:11 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Nathan Lynch, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov



Subject: [RFC][v6][PATCH 6/9]: Add target_pids parameter to copy_process()

Add a 'target_pids' parameter to copy_process().  The new parameter will be
used in a follow-on patch when clone_with_pids() is implemented.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/fork.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-09 19:07:36.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-09 19:07:40.000000000 -0700
@@ -935,12 +935,12 @@ static struct task_struct *copy_process(
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
+					pid_t *target_pids,
 					int trace)
 {
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
-	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1319,7 +1319,7 @@ struct task_struct * __cpuinit fork_idle
 	struct pt_regs regs;
 
 	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
-			    &init_struct_pid, 0);
+			    &init_struct_pid, NULL, 0);
 	if (!IS_ERR(task))
 		init_idle(task, cpu);
 
@@ -1342,6 +1342,7 @@ long do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	pid_t *target_pids = NULL;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1382,7 +1383,7 @@ long do_fork(unsigned long clone_flags,
 		trace = tracehook_prepare_clone(clone_flags);
 
 	p = copy_process(clone_flags, stack_start, regs, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, target_pids, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][v6][PATCH 6/9]: Add target_pids parameter to copy_process()
  2009-09-10  6:06 [RFC][v6][PATCH 0/9] clone_with_pids() syscall Sukadev Bhattiprolu
                   ` (4 preceding siblings ...)
  2009-09-10  6:10 ` [RFC][v6][PATCH 5/9]: Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
@ 2009-09-10  6:11 ` Sukadev Bhattiprolu
  2009-09-10  6:12 ` [RFC][v6][PATCH 7/9]: Define do_fork_with_pids() Sukadev Bhattiprolu
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, arnd,
	Containers, sukadev



Subject: [RFC][v6][PATCH 6/9]: Add target_pids parameter to copy_process()

Add a 'target_pids' parameter to copy_process().  The new parameter will be
used in a follow-on patch when clone_with_pids() is implemented.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-09 19:07:36.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-09 19:07:40.000000000 -0700
@@ -935,12 +935,12 @@ static struct task_struct *copy_process(
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
+					pid_t *target_pids,
 					int trace)
 {
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
-	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1319,7 +1319,7 @@ struct task_struct * __cpuinit fork_idle
 	struct pt_regs regs;
 
 	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
-			    &init_struct_pid, 0);
+			    &init_struct_pid, NULL, 0);
 	if (!IS_ERR(task))
 		init_idle(task, cpu);
 
@@ -1342,6 +1342,7 @@ long do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	pid_t *target_pids = NULL;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1382,7 +1383,7 @@ long do_fork(unsigned long clone_flags,
 		trace = tracehook_prepare_clone(clone_flags);
 
 	p = copy_process(clone_flags, stack_start, regs, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, target_pids, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][v6][PATCH 7/9]: Define do_fork_with_pids()
       [not found] ` <20090910060627.GA24343-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-09-10  6:11   ` [RFC][v6][PATCH 6/9]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu
@ 2009-09-10  6:12   ` Sukadev Bhattiprolu
  2009-09-10  6:13   ` [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall Sukadev Bhattiprolu
  2009-09-10  6:14   ` [RFC][v6][PATCH 9/9]: Document " Sukadev Bhattiprolu
  3 siblings, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:12 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Nathan Lynch, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov


Subject: [RFC][v6][PATCH 7/9]: Define do_fork_with_pids()

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.

Changelog[v4]:
	- Rename 'struct target_pid_set' to 'struct pid_set' since it may
	  be useful in other contexts.

Changelog[v3]:
	- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
	- To facilitate moving architecture-inpdendent code to kernel/fork.c
	  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
	  rather than 'pid_t *' (next patch moves the arch-independent
	  code to kernel/fork.c)

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/sched.h |    3 +++
 include/linux/types.h |    5 +++++
 kernel/fork.c         |   16 ++++++++++++++--
 3 files changed, 22 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/include/linux/sched.h	2009-09-09 19:07:45.000000000 -0700
@@ -2054,6 +2054,9 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+				unsigned long, int __user *, int __user *,
+				struct pid_set __user *pid_set);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
Index: linux-2.6/include/linux/types.h
===================================================================
--- linux-2.6.orig/include/linux/types.h	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/include/linux/types.h	2009-09-09 19:07:45.000000000 -0700
@@ -204,6 +204,11 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct pid_set {
+	unsigned int num_pids;
+	pid_t pids[];
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-09 19:07:40.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-09 19:07:45.000000000 -0700
@@ -1332,12 +1332,13 @@ struct task_struct * __cpuinit fork_idle
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
 	      unsigned long stack_start,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+	      int __user *child_tidptr,
+	      struct pid_set __user *pid_setp)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1440,6 +1441,17 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      struct pt_regs *regs,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+			parent_tidptr, child_tidptr, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][v6][PATCH 7/9]: Define do_fork_with_pids()
  2009-09-10  6:06 [RFC][v6][PATCH 0/9] clone_with_pids() syscall Sukadev Bhattiprolu
                   ` (5 preceding siblings ...)
  2009-09-10  6:11 ` [RFC][v6][PATCH 6/9]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu
@ 2009-09-10  6:12 ` Sukadev Bhattiprolu
       [not found]   ` <20090910061227.GF25883-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-09-10  7:05   ` Arnd Bergmann
  2009-09-10  6:13 ` [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall Sukadev Bhattiprolu
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, arnd,
	Containers, sukadev


Subject: [RFC][v6][PATCH 7/9]: Define do_fork_with_pids()

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.

Changelog[v4]:
	- Rename 'struct target_pid_set' to 'struct pid_set' since it may
	  be useful in other contexts.

Changelog[v3]:
	- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
	- To facilitate moving architecture-inpdendent code to kernel/fork.c
	  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
	  rather than 'pid_t *' (next patch moves the arch-independent
	  code to kernel/fork.c)

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/sched.h |    3 +++
 include/linux/types.h |    5 +++++
 kernel/fork.c         |   16 ++++++++++++++--
 3 files changed, 22 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/include/linux/sched.h	2009-09-09 19:07:45.000000000 -0700
@@ -2054,6 +2054,9 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+				unsigned long, int __user *, int __user *,
+				struct pid_set __user *pid_set);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
Index: linux-2.6/include/linux/types.h
===================================================================
--- linux-2.6.orig/include/linux/types.h	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/include/linux/types.h	2009-09-09 19:07:45.000000000 -0700
@@ -204,6 +204,11 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct pid_set {
+	unsigned int num_pids;
+	pid_t pids[];
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-09 19:07:40.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-09 19:07:45.000000000 -0700
@@ -1332,12 +1332,13 @@ struct task_struct * __cpuinit fork_idle
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
 	      unsigned long stack_start,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+	      int __user *child_tidptr,
+	      struct pid_set __user *pid_setp)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1440,6 +1441,17 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      struct pt_regs *regs,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+			parent_tidptr, child_tidptr, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall
       [not found] ` <20090910060627.GA24343-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-09-10  6:11   ` [RFC][v6][PATCH 6/9]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu
  2009-09-10  6:12   ` [RFC][v6][PATCH 7/9]: Define do_fork_with_pids() Sukadev Bhattiprolu
@ 2009-09-10  6:13   ` Sukadev Bhattiprolu
  2009-09-10  6:14   ` [RFC][v6][PATCH 9/9]: Document " Sukadev Bhattiprolu
  3 siblings, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:13 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Nathan Lynch, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov



Subject: [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone_with_pids(), intended for use during restart, is the same as clone(),
except that it takes a 'pid_set' paramter. This parameter lets caller choose
specific pid numbers for the child process, in the process's active and
ancestor pid namespaces. (Descendant pid namespaces in general don't matter
since processes don't have pids in them anyway, but see comments in
copy_target_pids() regarding CLONE_NEWPID).

Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to
prevent unprivileged processes from misusing this interface.

Call clone_with_pids as follows:

	struct pid_set pid_set = { 3, {0, 97, 99} };
	syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set);

If a target-pid is 0, the kernel continues to assign a pid for the process in
that namespace. In the above example, pids[0] is 0, meaning the kernel will
assign next available pid to the process in init_pid_ns. But kernel will assign
pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
77 or 99 are taken, the system call fails with -EBUSY.

If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
the system call fails with -EINVAL.

NOTE:
	Compared to clone(), clone_with_pids() needs to pass in two more
	pieces of information:

		- number of pids in the set
		- user buffer containing the list of pids.

	But since clone() already takes 5 parameters, use a 'struct pid_set'.


TODO:
	- May need additional sanity checks in do_fork_with_pids().

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check.

Changelog[v4]:
	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/include/asm/syscalls.h    |    2 
 arch/x86/include/asm/unistd_32.h   |    1 
 arch/x86/kernel/entry_32.S         |    1 
 arch/x86/kernel/process_32.c       |   21 +++++++
 arch/x86/kernel/syscall_table_32.S |    1 
 kernel/fork.c                      |  107 ++++++++++++++++++++++++++++++++++++-
 6 files changed, 132 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/include/asm/syscalls.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/syscalls.h	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/syscalls.h	2009-09-09 19:07:53.000000000 -0700
@@ -40,6 +40,8 @@ long sys_iopl(struct pt_regs *);
 
 /* kernel/process_32.c */
 int sys_clone(struct pt_regs *);
+int sys_clone_with_pids(struct pt_regs *);
+int sys_vfork(struct pt_regs *);
 int sys_execve(struct pt_regs *);
 
 /* kernel/signal.c */
Index: linux-2.6/arch/x86/include/asm/unistd_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/unistd_32.h	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/unistd_32.h	2009-09-09 19:07:53.000000000 -0700
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
+#define __NR_clone_with_pids	337
 
 #ifdef __KERNEL__
 
Index: linux-2.6/arch/x86/kernel/entry_32.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/entry_32.S	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/arch/x86/kernel/entry_32.S	2009-09-09 19:07:53.000000000 -0700
@@ -718,6 +718,7 @@ ptregs_##name: \
 PTREGSCALL(iopl)
 PTREGSCALL(fork)
 PTREGSCALL(clone)
+PTREGSCALL(clone_with_pids)
 PTREGSCALL(vfork)
 PTREGSCALL(execve)
 PTREGSCALL(sigaltstack)
Index: linux-2.6/arch/x86/kernel/process_32.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/process_32.c	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/arch/x86/kernel/process_32.c	2009-09-09 21:30:45.000000000 -0700
@@ -443,6 +443,27 @@ int sys_clone(struct pt_regs *regs)
 	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
 }
 
+int sys_clone_with_pids(struct pt_regs *regs)
+{
+	unsigned long clone_flags;
+	unsigned long newsp;
+	int __user *parent_tidptr;
+	int __user *child_tidptr;
+	struct pid_set __user *upid_setp;
+
+	clone_flags = regs->bx;
+	newsp = regs->cx;
+	parent_tidptr = (int __user *)regs->dx;
+	child_tidptr = (int __user *)regs->di;
+	upid_setp = (struct pid_set __user *)regs->bp;
+
+	if (!newsp)
+		newsp = regs->sp;
+
+	return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
+			child_tidptr, upid_setp);
+}
+
 /*
  * sys_execve() executes a new program.
  */
Index: linux-2.6/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/syscall_table_32.S	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/arch/x86/kernel/syscall_table_32.S	2009-09-09 19:07:53.000000000 -0700
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
+	.long ptregs_clone_with_pids
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-09 19:07:45.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-09 21:30:45.000000000 -0700
@@ -1327,6 +1327,96 @@ struct task_struct * __cpuinit fork_idle
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(struct pid_set __user *upid_setp)
+{
+	int j;
+	int rc;
+	int size;
+	int unum_pids;		/* # of pids specified by user */
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+
+	if (!upid_setp)
+		return NULL;
+
+	knum_pids = task_pid(current)->level + 1;
+
+	/* First find the number of pids user is specifying */
+	rc = copy_from_user(&unum_pids, &upid_setp->num_pids, sizeof(int));
+	if (rc)
+		return ERR_PTR(-EFAULT);
+
+	if (!unum_pids)
+		return NULL;
+
+	if (unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+	 * 	 the order is:
+	 *
+	 * 		- pids for 'unknown-namespaces' (if any)
+	 * 		- pids for 'known-namespaces' (requested pids)
+	 * 		- 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], &upid_setp->pids[0], size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1343,7 +1433,7 @@ long do_fork_with_pids(unsigned long clo
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1377,6 +1467,17 @@ long do_fork_with_pids(unsigned long clo
 		}
 	}
 
+	target_pids = copy_target_pids(pid_setp);
+
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1438,6 +1539,10 @@ long do_fork_with_pids(unsigned long clo
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall
  2009-09-10  6:06 [RFC][v6][PATCH 0/9] clone_with_pids() syscall Sukadev Bhattiprolu
                   ` (6 preceding siblings ...)
  2009-09-10  6:12 ` [RFC][v6][PATCH 7/9]: Define do_fork_with_pids() Sukadev Bhattiprolu
@ 2009-09-10  6:13 ` Sukadev Bhattiprolu
  2009-09-10  7:31   ` Arnd Bergmann
       [not found]   ` <20090910061301.GG25883-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
       [not found] ` <20090910060627.GA24343-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, arnd,
	Containers, sukadev



Subject: [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone_with_pids(), intended for use during restart, is the same as clone(),
except that it takes a 'pid_set' paramter. This parameter lets caller choose
specific pid numbers for the child process, in the process's active and
ancestor pid namespaces. (Descendant pid namespaces in general don't matter
since processes don't have pids in them anyway, but see comments in
copy_target_pids() regarding CLONE_NEWPID).

Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to
prevent unprivileged processes from misusing this interface.

Call clone_with_pids as follows:

	struct pid_set pid_set = { 3, {0, 97, 99} };
	syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set);

If a target-pid is 0, the kernel continues to assign a pid for the process in
that namespace. In the above example, pids[0] is 0, meaning the kernel will
assign next available pid to the process in init_pid_ns. But kernel will assign
pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
77 or 99 are taken, the system call fails with -EBUSY.

If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
the system call fails with -EINVAL.

NOTE:
	Compared to clone(), clone_with_pids() needs to pass in two more
	pieces of information:

		- number of pids in the set
		- user buffer containing the list of pids.

	But since clone() already takes 5 parameters, use a 'struct pid_set'.


TODO:
	- May need additional sanity checks in do_fork_with_pids().

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check.

Changelog[v4]:
	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 arch/x86/include/asm/syscalls.h    |    2 
 arch/x86/include/asm/unistd_32.h   |    1 
 arch/x86/kernel/entry_32.S         |    1 
 arch/x86/kernel/process_32.c       |   21 +++++++
 arch/x86/kernel/syscall_table_32.S |    1 
 kernel/fork.c                      |  107 ++++++++++++++++++++++++++++++++++++-
 6 files changed, 132 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/include/asm/syscalls.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/syscalls.h	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/syscalls.h	2009-09-09 19:07:53.000000000 -0700
@@ -40,6 +40,8 @@ long sys_iopl(struct pt_regs *);
 
 /* kernel/process_32.c */
 int sys_clone(struct pt_regs *);
+int sys_clone_with_pids(struct pt_regs *);
+int sys_vfork(struct pt_regs *);
 int sys_execve(struct pt_regs *);
 
 /* kernel/signal.c */
Index: linux-2.6/arch/x86/include/asm/unistd_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/unistd_32.h	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/unistd_32.h	2009-09-09 19:07:53.000000000 -0700
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
+#define __NR_clone_with_pids	337
 
 #ifdef __KERNEL__
 
Index: linux-2.6/arch/x86/kernel/entry_32.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/entry_32.S	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/arch/x86/kernel/entry_32.S	2009-09-09 19:07:53.000000000 -0700
@@ -718,6 +718,7 @@ ptregs_##name: \
 PTREGSCALL(iopl)
 PTREGSCALL(fork)
 PTREGSCALL(clone)
+PTREGSCALL(clone_with_pids)
 PTREGSCALL(vfork)
 PTREGSCALL(execve)
 PTREGSCALL(sigaltstack)
Index: linux-2.6/arch/x86/kernel/process_32.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/process_32.c	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/arch/x86/kernel/process_32.c	2009-09-09 21:30:45.000000000 -0700
@@ -443,6 +443,27 @@ int sys_clone(struct pt_regs *regs)
 	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
 }
 
+int sys_clone_with_pids(struct pt_regs *regs)
+{
+	unsigned long clone_flags;
+	unsigned long newsp;
+	int __user *parent_tidptr;
+	int __user *child_tidptr;
+	struct pid_set __user *upid_setp;
+
+	clone_flags = regs->bx;
+	newsp = regs->cx;
+	parent_tidptr = (int __user *)regs->dx;
+	child_tidptr = (int __user *)regs->di;
+	upid_setp = (struct pid_set __user *)regs->bp;
+
+	if (!newsp)
+		newsp = regs->sp;
+
+	return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
+			child_tidptr, upid_setp);
+}
+
 /*
  * sys_execve() executes a new program.
  */
Index: linux-2.6/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/syscall_table_32.S	2009-09-09 19:06:20.000000000 -0700
+++ linux-2.6/arch/x86/kernel/syscall_table_32.S	2009-09-09 19:07:53.000000000 -0700
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
+	.long ptregs_clone_with_pids
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-09 19:07:45.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-09 21:30:45.000000000 -0700
@@ -1327,6 +1327,96 @@ struct task_struct * __cpuinit fork_idle
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(struct pid_set __user *upid_setp)
+{
+	int j;
+	int rc;
+	int size;
+	int unum_pids;		/* # of pids specified by user */
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+
+	if (!upid_setp)
+		return NULL;
+
+	knum_pids = task_pid(current)->level + 1;
+
+	/* First find the number of pids user is specifying */
+	rc = copy_from_user(&unum_pids, &upid_setp->num_pids, sizeof(int));
+	if (rc)
+		return ERR_PTR(-EFAULT);
+
+	if (!unum_pids)
+		return NULL;
+
+	if (unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+	 * 	 the order is:
+	 *
+	 * 		- pids for 'unknown-namespaces' (if any)
+	 * 		- pids for 'known-namespaces' (requested pids)
+	 * 		- 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], &upid_setp->pids[0], size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1343,7 +1433,7 @@ long do_fork_with_pids(unsigned long clo
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1377,6 +1467,17 @@ long do_fork_with_pids(unsigned long clo
 		}
 	}
 
+	target_pids = copy_target_pids(pid_setp);
+
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1438,6 +1539,10 @@ long do_fork_with_pids(unsigned long clo
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][v6][PATCH 9/9]: Document clone_with_pids() syscall
       [not found] ` <20090910060627.GA24343-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2009-09-10  6:13   ` [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall Sukadev Bhattiprolu
@ 2009-09-10  6:14   ` Sukadev Bhattiprolu
  3 siblings, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:14 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: arnd-r2nGTMty4D4, Containers, Nathan Lynch, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov


Subject: [RFC][v6][PATCH 9/9]: Document clone_with_pids() syscall

This gives a brief overview of the clone_with_pids() system call.  We should
eventually describe more details either in clone(2) or in a new man page.

Signed-off-by: Sukadev Bhattiprolu <sukadev-8jLBTbqmX/OZamtmwQBW5tBPR1lH4CV8@public.gmane.org>
---
 Documentation/clone-with-pids |   58 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

Index: linux-2.6/Documentation/clone-with-pids
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/clone-with-pids	2009-09-09 21:53:30.000000000 -0700
@@ -0,0 +1,58 @@
+
+struct pid_set {
+	unsigned int num_pids;
+	pid_t pids[];
+};
+
+clone_with_pids(int flags, void *child_stack_base, int *parent_tid_ptr,
+			int *child_tid_ptr, NULL, struct pid_set *pid_setp)
+
+	The clone_with_pids() system call is identical to clone(), except
+	that it allows the user to specify a pid for the child process
+	in each of the child processes' pid name spaces.
+
+	This system call is meant to be used when restarting an application
+	from an earlier checkpoint. When restarting the application, the
+	processes in the application must get the same pids they had at the
+	time of the checkpoint.
+
+	The 'pid_setp' parameter defines a set of pids to use, one for each
+	pid-namespace of the child process.  The order pids in '->pids[]'
+	corresponds to the nesting order of pid-namespaces, with ->pids[0]
+	corresponding to the init_pid_ns.
+
+	If a pid in the ->pids list is 0, the kernel will assign the next
+	available pid in the pid namespace, for the process.
+
+	If a pid in the ->pids[] list is non-zero, the kernel tries to assign
+	the specified pid in that namespace.  If that pid is already in use
+	by another process, the system call fails with -EBUSY.
+
+	On success, the system call returns the pid of the child process in
+	the parent's active pid namespace.
+
+	On failure, clone_with_pids() returns -1 and sets 'errno' to one of
+	following values (the child process is not created).
+
+	EPERM	Caller does not have the SYS_ADMIN privilege needed to excute
+		this call.
+
+	EINVAL	The number of pids specified in 'pid_set.num_pids' exceeds
+		the current nesting level of parent process
+
+	EBUSY	A requested 'pid' is in use by another process in that name
+		space.
+
+Example:
+
+	struct pid_set pid_set { 3, {0, 99, 177} };
+	void *child_stack = malloc(STACKSIZE);
+
+	/* set up child_stack, like with clone() */
+	rc = clone_with_pids(clone_flags, child_stack, NULL, NULL, &pid_set);
+
+	if (rc < 0) {
+		perror("clone_with_pids()");
+		exit(1);
+	}
+

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][v6][PATCH 9/9]: Document clone_with_pids() syscall
  2009-09-10  6:06 [RFC][v6][PATCH 0/9] clone_with_pids() syscall Sukadev Bhattiprolu
                   ` (8 preceding siblings ...)
       [not found] ` <20090910060627.GA24343-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-09-10  6:14 ` Sukadev Bhattiprolu
  2009-09-10 15:26   ` Randy Dunlap
       [not found]   ` <20090910061413.GH25883-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-09-11 11:22 ` [RFC][v6][PATCH 0/9] " Peter Zijlstra
  10 siblings, 2 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10  6:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, arnd,
	Containers, sukadev


Subject: [RFC][v6][PATCH 9/9]: Document clone_with_pids() syscall

This gives a brief overview of the clone_with_pids() system call.  We should
eventually describe more details either in clone(2) or in a new man page.

Signed-off-by: Sukadev Bhattiprolu <sukadev@vnet.linux.ibm.com>
---
 Documentation/clone-with-pids |   58 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

Index: linux-2.6/Documentation/clone-with-pids
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/clone-with-pids	2009-09-09 21:53:30.000000000 -0700
@@ -0,0 +1,58 @@
+
+struct pid_set {
+	unsigned int num_pids;
+	pid_t pids[];
+};
+
+clone_with_pids(int flags, void *child_stack_base, int *parent_tid_ptr,
+			int *child_tid_ptr, NULL, struct pid_set *pid_setp)
+
+	The clone_with_pids() system call is identical to clone(), except
+	that it allows the user to specify a pid for the child process
+	in each of the child processes' pid name spaces.
+
+	This system call is meant to be used when restarting an application
+	from an earlier checkpoint. When restarting the application, the
+	processes in the application must get the same pids they had at the
+	time of the checkpoint.
+
+	The 'pid_setp' parameter defines a set of pids to use, one for each
+	pid-namespace of the child process.  The order pids in '->pids[]'
+	corresponds to the nesting order of pid-namespaces, with ->pids[0]
+	corresponding to the init_pid_ns.
+
+	If a pid in the ->pids list is 0, the kernel will assign the next
+	available pid in the pid namespace, for the process.
+
+	If a pid in the ->pids[] list is non-zero, the kernel tries to assign
+	the specified pid in that namespace.  If that pid is already in use
+	by another process, the system call fails with -EBUSY.
+
+	On success, the system call returns the pid of the child process in
+	the parent's active pid namespace.
+
+	On failure, clone_with_pids() returns -1 and sets 'errno' to one of
+	following values (the child process is not created).
+
+	EPERM	Caller does not have the SYS_ADMIN privilege needed to excute
+		this call.
+
+	EINVAL	The number of pids specified in 'pid_set.num_pids' exceeds
+		the current nesting level of parent process
+
+	EBUSY	A requested 'pid' is in use by another process in that name
+		space.
+
+Example:
+
+	struct pid_set pid_set { 3, {0, 99, 177} };
+	void *child_stack = malloc(STACKSIZE);
+
+	/* set up child_stack, like with clone() */
+	rc = clone_with_pids(clone_flags, child_stack, NULL, NULL, &pid_set);
+
+	if (rc < 0) {
+		perror("clone_with_pids()");
+		exit(1);
+	}
+

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 7/9]: Define do_fork_with_pids()
       [not found]   ` <20090910061227.GF25883-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-09-10  7:05     ` Arnd Bergmann
  0 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2009-09-10  7:05 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Eric W. Biederman, hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov

On Thursday 10 September 2009, Sukadev Bhattiprolu wrote:
> +long do_fork(unsigned long clone_flags,
> +             unsigned long stack_start,
> +             struct pt_regs *regs,
> +             unsigned long stack_size,
> +             int __user *parent_tidptr,
> +             int __user *child_tidptr)
> +{
> +       return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
> +                       parent_tidptr, child_tidptr, NULL);
> +}
> +

How about making this one a static inline in linux/sched.h?

	Arnd <><

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 7/9]: Define do_fork_with_pids()
  2009-09-10  6:12 ` [RFC][v6][PATCH 7/9]: Define do_fork_with_pids() Sukadev Bhattiprolu
       [not found]   ` <20090910061227.GF25883-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-09-10  7:05   ` Arnd Bergmann
       [not found]     ` <200909100905.35817.arnd-r2nGTMty4D4@public.gmane.org>
  2009-09-10 21:29     ` Sukadev Bhattiprolu
  1 sibling, 2 replies; 41+ messages in thread
From: Arnd Bergmann @ 2009-09-10  7:05 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, Containers, sukadev

On Thursday 10 September 2009, Sukadev Bhattiprolu wrote:
> +long do_fork(unsigned long clone_flags,
> +             unsigned long stack_start,
> +             struct pt_regs *regs,
> +             unsigned long stack_size,
> +             int __user *parent_tidptr,
> +             int __user *child_tidptr)
> +{
> +       return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
> +                       parent_tidptr, child_tidptr, NULL);
> +}
> +

How about making this one a static inline in linux/sched.h?

	Arnd <><

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall
       [not found]   ` <20090910061301.GG25883-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-09-10  7:31     ` Arnd Bergmann
  0 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2009-09-10  7:31 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Eric W. Biederman, hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov

On Thursday 10 September 2009, Sukadev Bhattiprolu wrote:
> 
> +int sys_clone_with_pids(struct pt_regs *regs)
> +{
> +       unsigned long clone_flags;
> +       unsigned long newsp;
> +       int __user *parent_tidptr;
> +       int __user *child_tidptr;
> +       struct pid_set __user *upid_setp;
> +
> +       clone_flags = regs->bx;
> +       newsp = regs->cx;
> +       parent_tidptr = (int __user *)regs->dx;
> +       child_tidptr = (int __user *)regs->di;
> +       upid_setp = (struct pid_set __user *)regs->bp;
> +
> +       if (!newsp)
> +               newsp = regs->sp;
> +
> +       return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
> +                       child_tidptr, upid_setp);
> +}

I wonder if we can avoid spreading copies of this function across
all architectures.

Would it be possible to define it like this?

asmlinkage long
clone_with_pids(int flags, unsigned long child_stack_base, int __user *parent_tid_ptr,
                       int __user *child_tid_ptr, struct pid_set __user *pid_setp)
{
	if (!child_stack_base) {
		struct pt_regs *regs;

		regs = task_pt_regs(current);
		child_stack_base = user_stack_pointer(regs);
	}

	return do_fork_with_pids(clone_flags, child_stack_base, 0,
		parent_tid_ptr, child_tid_ptr, pid_setp);
}

	Arnd <><

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall
  2009-09-10  6:13 ` [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall Sukadev Bhattiprolu
@ 2009-09-10  7:31   ` Arnd Bergmann
       [not found]     ` <200909100931.25585.arnd-r2nGTMty4D4@public.gmane.org>
  2009-09-10 21:28     ` Sukadev Bhattiprolu
       [not found]   ` <20090910061301.GG25883-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 41+ messages in thread
From: Arnd Bergmann @ 2009-09-10  7:31 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, Containers, sukadev

On Thursday 10 September 2009, Sukadev Bhattiprolu wrote:
> 
> +int sys_clone_with_pids(struct pt_regs *regs)
> +{
> +       unsigned long clone_flags;
> +       unsigned long newsp;
> +       int __user *parent_tidptr;
> +       int __user *child_tidptr;
> +       struct pid_set __user *upid_setp;
> +
> +       clone_flags = regs->bx;
> +       newsp = regs->cx;
> +       parent_tidptr = (int __user *)regs->dx;
> +       child_tidptr = (int __user *)regs->di;
> +       upid_setp = (struct pid_set __user *)regs->bp;
> +
> +       if (!newsp)
> +               newsp = regs->sp;
> +
> +       return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
> +                       child_tidptr, upid_setp);
> +}

I wonder if we can avoid spreading copies of this function across
all architectures.

Would it be possible to define it like this?

asmlinkage long
clone_with_pids(int flags, unsigned long child_stack_base, int __user *parent_tid_ptr,
                       int __user *child_tid_ptr, struct pid_set __user *pid_setp)
{
	if (!child_stack_base) {
		struct pt_regs *regs;

		regs = task_pt_regs(current);
		child_stack_base = user_stack_pointer(regs);
	}

	return do_fork_with_pids(clone_flags, child_stack_base, 0,
		parent_tid_ptr, child_tid_ptr, pid_setp);
}

	Arnd <><


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 9/9]: Document clone_with_pids() syscall
       [not found]   ` <20090910061413.GH25883-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-09-10 15:26     ` Randy Dunlap
  0 siblings, 0 replies; 41+ messages in thread
From: Randy Dunlap @ 2009-09-10 15:26 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: arnd-r2nGTMty4D4, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov

On Wed, 9 Sep 2009 23:14:13 -0700 Sukadev Bhattiprolu wrote:

> 
> Subject: [RFC][v6][PATCH 9/9]: Document clone_with_pids() syscall
> 
> This gives a brief overview of the clone_with_pids() system call.  We should
> eventually describe more details either in clone(2) or in a new man page.
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev-8jLBTbqmX/OZamtmwQBW5tBPR1lH4CV8@public.gmane.org>
> ---
>  Documentation/clone-with-pids |   58 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 58 insertions(+)
> 
> Index: linux-2.6/Documentation/clone-with-pids
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/Documentation/clone-with-pids	2009-09-09 21:53:30.000000000 -0700
> @@ -0,0 +1,58 @@
> +
> +struct pid_set {
> +	unsigned int num_pids;
> +	pid_t pids[];
> +};
> +
> +clone_with_pids(int flags, void *child_stack_base, int *parent_tid_ptr,
> +			int *child_tid_ptr, NULL, struct pid_set *pid_setp)
> +
> +	The clone_with_pids() system call is identical to clone(), except
> +	that it allows the user to specify a pid for the child process
> +	in each of the child processes' pid name spaces.
> +
	                                    namespaces.  {as below}

> +	This system call is meant to be used when restarting an application
> +	from an earlier checkpoint. When restarting the application, the
> +	processes in the application must get the same pids they had at the
> +	time of the checkpoint.
> +
> +	The 'pid_setp' parameter defines a set of pids to use, one for each
> +	pid-namespace of the child process.  The order pids in '->pids[]'

	                                         order of pids

> +	corresponds to the nesting order of pid-namespaces, with ->pids[0]
> +	corresponding to the init_pid_ns.
> +
> +	If a pid in the ->pids list is 0, the kernel will assign the next
> +	available pid in the pid namespace, for the process.
> +
> +	If a pid in the ->pids[] list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails with -EBUSY.
> +
> +	On success, the system call returns the pid of the child process in
> +	the parent's active pid namespace.
> +
> +	On failure, clone_with_pids() returns -1 and sets 'errno' to one of
> +	following values (the child process is not created).
> +
> +	EPERM	Caller does not have the SYS_ADMIN privilege needed to excute

		                                                       execute

> +		this call.
> +
> +	EINVAL	The number of pids specified in 'pid_set.num_pids' exceeds
> +		the current nesting level of parent process
> +
> +	EBUSY	A requested 'pid' is in use by another process in that name
> +		space.
> +
> +Example:
> +
> +	struct pid_set pid_set { 3, {0, 99, 177} };
> +	void *child_stack = malloc(STACKSIZE);
> +
> +	/* set up child_stack, like with clone() */
> +	rc = clone_with_pids(clone_flags, child_stack, NULL, NULL, &pid_set);
> +
> +	if (rc < 0) {
> +		perror("clone_with_pids()");
> +		exit(1);
> +	}

What happens when one of the pids is busy?  Say the last one in the
example above [177].  Are the first 2 children already cloned
or are all pids checked for availability before cloning?
If the latter, is there a race there?
and what value is returned?

---
~Randy
LPC 2009, Sept. 23-25, Portland, Oregon
http://linuxplumbersconf.org/2009/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 9/9]: Document clone_with_pids() syscall
  2009-09-10  6:14 ` Sukadev Bhattiprolu
@ 2009-09-10 15:26   ` Randy Dunlap
       [not found]     ` <20090910082659.033ab8fd.randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2009-09-10 16:31     ` Sukadev Bhattiprolu
       [not found]   ` <20090910061413.GH25883-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 41+ messages in thread
From: Randy Dunlap @ 2009-09-10 15:26 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, Containers, sukadev

On Wed, 9 Sep 2009 23:14:13 -0700 Sukadev Bhattiprolu wrote:

> 
> Subject: [RFC][v6][PATCH 9/9]: Document clone_with_pids() syscall
> 
> This gives a brief overview of the clone_with_pids() system call.  We should
> eventually describe more details either in clone(2) or in a new man page.
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev@vnet.linux.ibm.com>
> ---
>  Documentation/clone-with-pids |   58 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 58 insertions(+)
> 
> Index: linux-2.6/Documentation/clone-with-pids
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/Documentation/clone-with-pids	2009-09-09 21:53:30.000000000 -0700
> @@ -0,0 +1,58 @@
> +
> +struct pid_set {
> +	unsigned int num_pids;
> +	pid_t pids[];
> +};
> +
> +clone_with_pids(int flags, void *child_stack_base, int *parent_tid_ptr,
> +			int *child_tid_ptr, NULL, struct pid_set *pid_setp)
> +
> +	The clone_with_pids() system call is identical to clone(), except
> +	that it allows the user to specify a pid for the child process
> +	in each of the child processes' pid name spaces.
> +
	                                    namespaces.  {as below}

> +	This system call is meant to be used when restarting an application
> +	from an earlier checkpoint. When restarting the application, the
> +	processes in the application must get the same pids they had at the
> +	time of the checkpoint.
> +
> +	The 'pid_setp' parameter defines a set of pids to use, one for each
> +	pid-namespace of the child process.  The order pids in '->pids[]'

	                                         order of pids

> +	corresponds to the nesting order of pid-namespaces, with ->pids[0]
> +	corresponding to the init_pid_ns.
> +
> +	If a pid in the ->pids list is 0, the kernel will assign the next
> +	available pid in the pid namespace, for the process.
> +
> +	If a pid in the ->pids[] list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails with -EBUSY.
> +
> +	On success, the system call returns the pid of the child process in
> +	the parent's active pid namespace.
> +
> +	On failure, clone_with_pids() returns -1 and sets 'errno' to one of
> +	following values (the child process is not created).
> +
> +	EPERM	Caller does not have the SYS_ADMIN privilege needed to excute

		                                                       execute

> +		this call.
> +
> +	EINVAL	The number of pids specified in 'pid_set.num_pids' exceeds
> +		the current nesting level of parent process
> +
> +	EBUSY	A requested 'pid' is in use by another process in that name
> +		space.
> +
> +Example:
> +
> +	struct pid_set pid_set { 3, {0, 99, 177} };
> +	void *child_stack = malloc(STACKSIZE);
> +
> +	/* set up child_stack, like with clone() */
> +	rc = clone_with_pids(clone_flags, child_stack, NULL, NULL, &pid_set);
> +
> +	if (rc < 0) {
> +		perror("clone_with_pids()");
> +		exit(1);
> +	}

What happens when one of the pids is busy?  Say the last one in the
example above [177].  Are the first 2 children already cloned
or are all pids checked for availability before cloning?
If the latter, is there a race there?
and what value is returned?

---
~Randy
LPC 2009, Sept. 23-25, Portland, Oregon
http://linuxplumbersconf.org/2009/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 9/9]: Document clone_with_pids() syscall
       [not found]     ` <20090910082659.033ab8fd.randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2009-09-10 16:31       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10 16:31 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: arnd-r2nGTMty4D4, Containers, Nathan Lynch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov


Thanks for the review. Will fix the typos.

Randy Dunlap [randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org] wrote:
| > +Example:
| > +
| > +	struct pid_set pid_set { 3, {0, 99, 177} };
| > +	void *child_stack = malloc(STACKSIZE);
| > +
| > +	/* set up child_stack, like with clone() */
| > +	rc = clone_with_pids(clone_flags, child_stack, NULL, NULL, &pid_set);
| > +
| > +	if (rc < 0) {
| > +		perror("clone_with_pids()");
| > +		exit(1);
| > +	}
| 
| What happens when one of the pids is busy?  Say the last one in the
| example above [177].  Are the first 2 children already cloned
| or are all pids checked for availability before cloning?

Only _one_ one child process is created (on success). With nested pid
namespaces a process is known by different pids (or pid numbers to be
precise) - one in each pid namespace.

(BTW, looks like we did not document pid-namespaces in the Documentation/
But please see the CLONE_NEWPID section of a recent version of clone(2)).

In short, the above clone_with_pids() tries to create a single child process
with the given pids.

| If the latter, is there a race there?
| and what value is returned?

There is no race - bc just a single process is being created.

If any of the pids are not available, the child process is not created
and the system call returns -EBUSY (if the user requested say 99 and 99
is in use by another process)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 9/9]: Document clone_with_pids() syscall
  2009-09-10 15:26   ` Randy Dunlap
       [not found]     ` <20090910082659.033ab8fd.randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2009-09-10 16:31     ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10 16:31 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-kernel, Oren Laadan, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, Containers, sukadev


Thanks for the review. Will fix the typos.

Randy Dunlap [randy.dunlap@oracle.com] wrote:
| > +Example:
| > +
| > +	struct pid_set pid_set { 3, {0, 99, 177} };
| > +	void *child_stack = malloc(STACKSIZE);
| > +
| > +	/* set up child_stack, like with clone() */
| > +	rc = clone_with_pids(clone_flags, child_stack, NULL, NULL, &pid_set);
| > +
| > +	if (rc < 0) {
| > +		perror("clone_with_pids()");
| > +		exit(1);
| > +	}
| 
| What happens when one of the pids is busy?  Say the last one in the
| example above [177].  Are the first 2 children already cloned
| or are all pids checked for availability before cloning?

Only _one_ one child process is created (on success). With nested pid
namespaces a process is known by different pids (or pid numbers to be
precise) - one in each pid namespace.

(BTW, looks like we did not document pid-namespaces in the Documentation/
But please see the CLONE_NEWPID section of a recent version of clone(2)).

In short, the above clone_with_pids() tries to create a single child process
with the given pids.

| If the latter, is there a race there?
| and what value is returned?

There is no race - bc just a single process is being created.

If any of the pids are not available, the child process is not created
and the system call returns -EBUSY (if the user requested say 99 and 99
is in use by another process)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall
       [not found]     ` <200909100931.25585.arnd-r2nGTMty4D4@public.gmane.org>
@ 2009-09-10 21:28       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10 21:28 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Eric W. Biederman, hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov

Arnd Bergmann [arnd-r2nGTMty4D4@public.gmane.org] wrote:
| 
| I wonder if we can avoid spreading copies of this function across
| all architectures.
| 
| Would it be possible to define it like this?

Since this is a variant of clone() and clone is listed as a PTREGSCALL(),
I pass in the pt_regs.

arch/x86/kernel/entry_32.S lists clone() under this comment:

/*
 * System calls that need a pt_regs pointer.
 */

Is there a guideline on what system calls use/need pt_regs ?

Thanks,

Suakdev

| 
| asmlinkage long
| clone_with_pids(int flags, unsigned long child_stack_base, int __user *parent_tid_ptr,
|                        int __user *child_tid_ptr, struct pid_set __user *pid_setp)
| {
| 	if (!child_stack_base) {
| 		struct pt_regs *regs;
| 
| 		regs = task_pt_regs(current);
| 		child_stack_base = user_stack_pointer(regs);
| 	}
| 
| 	return do_fork_with_pids(clone_flags, child_stack_base, 0,
| 		parent_tid_ptr, child_tid_ptr, pid_setp);
| }

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall
  2009-09-10  7:31   ` Arnd Bergmann
       [not found]     ` <200909100931.25585.arnd-r2nGTMty4D4@public.gmane.org>
@ 2009-09-10 21:28     ` Sukadev Bhattiprolu
  2009-09-11 10:31       ` Arnd Bergmann
       [not found]       ` <20090910212837.GA31459-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10 21:28 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-kernel, Oren Laadan, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, Containers, sukadev

Arnd Bergmann [arnd@arndb.de] wrote:
| 
| I wonder if we can avoid spreading copies of this function across
| all architectures.
| 
| Would it be possible to define it like this?

Since this is a variant of clone() and clone is listed as a PTREGSCALL(),
I pass in the pt_regs.

arch/x86/kernel/entry_32.S lists clone() under this comment:

/*
 * System calls that need a pt_regs pointer.
 */

Is there a guideline on what system calls use/need pt_regs ?

Thanks,

Suakdev

| 
| asmlinkage long
| clone_with_pids(int flags, unsigned long child_stack_base, int __user *parent_tid_ptr,
|                        int __user *child_tid_ptr, struct pid_set __user *pid_setp)
| {
| 	if (!child_stack_base) {
| 		struct pt_regs *regs;
| 
| 		regs = task_pt_regs(current);
| 		child_stack_base = user_stack_pointer(regs);
| 	}
| 
| 	return do_fork_with_pids(clone_flags, child_stack_base, 0,
| 		parent_tid_ptr, child_tid_ptr, pid_setp);
| }

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 7/9]: Define do_fork_with_pids()
       [not found]     ` <200909100905.35817.arnd-r2nGTMty4D4@public.gmane.org>
@ 2009-09-10 21:29       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10 21:29 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Eric W. Biederman, hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov

Arnd Bergmann [arnd-r2nGTMty4D4@public.gmane.org] wrote:
| On Thursday 10 September 2009, Sukadev Bhattiprolu wrote:
| > +long do_fork(unsigned long clone_flags,
| > +             unsigned long stack_start,
| > +             struct pt_regs *regs,
| > +             unsigned long stack_size,
| > +             int __user *parent_tidptr,
| > +             int __user *child_tidptr)
| > +{
| > +       return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
| > +                       parent_tidptr, child_tidptr, NULL);
| > +}
| > +
| 
| How about making this one a static inline in linux/sched.h?

Yes. I will inline do_fork().

Sukadev

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 7/9]: Define do_fork_with_pids()
  2009-09-10  7:05   ` Arnd Bergmann
       [not found]     ` <200909100905.35817.arnd-r2nGTMty4D4@public.gmane.org>
@ 2009-09-10 21:29     ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-10 21:29 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-kernel, Oren Laadan, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, Containers, sukadev

Arnd Bergmann [arnd@arndb.de] wrote:
| On Thursday 10 September 2009, Sukadev Bhattiprolu wrote:
| > +long do_fork(unsigned long clone_flags,
| > +             unsigned long stack_start,
| > +             struct pt_regs *regs,
| > +             unsigned long stack_size,
| > +             int __user *parent_tidptr,
| > +             int __user *child_tidptr)
| > +{
| > +       return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
| > +                       parent_tidptr, child_tidptr, NULL);
| > +}
| > +
| 
| How about making this one a static inline in linux/sched.h?

Yes. I will inline do_fork().

Sukadev

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall
       [not found]       ` <20090910212837.GA31459-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-09-11 10:31         ` Arnd Bergmann
  0 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2009-09-11 10:31 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Eric W. Biederman, hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	Pavel Emelyanov

On Thursday 10 September 2009, Sukadev Bhattiprolu wrote:
> Since this is a variant of clone() and clone is listed as a PTREGSCALL(),
> I pass in the pt_regs.
> 
> arch/x86/kernel/entry_32.S lists clone() under this comment:
> 
> /*
>  * System calls that need a pt_regs pointer.
>  */
> 
> Is there a guideline on what system calls use/need pt_regs ?

You need pt_regs if you access any registers from the user task
other than the argument registers. In case of clone(), this is
the user stack pointer.

The user_stack_pointer() function is relatively new, before this
you couldn't get the stack pointer out of pt_regs in a generic
way.

	Arnd <><

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall
  2009-09-10 21:28     ` Sukadev Bhattiprolu
@ 2009-09-11 10:31       ` Arnd Bergmann
       [not found]         ` <200909111231.30495.arnd-r2nGTMty4D4@public.gmane.org>
       [not found]       ` <20090910212837.GA31459-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 41+ messages in thread
From: Arnd Bergmann @ 2009-09-11 10:31 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, Containers, sukadev

On Thursday 10 September 2009, Sukadev Bhattiprolu wrote:
> Since this is a variant of clone() and clone is listed as a PTREGSCALL(),
> I pass in the pt_regs.
> 
> arch/x86/kernel/entry_32.S lists clone() under this comment:
> 
> /*
>  * System calls that need a pt_regs pointer.
>  */
> 
> Is there a guideline on what system calls use/need pt_regs ?

You need pt_regs if you access any registers from the user task
other than the argument registers. In case of clone(), this is
the user stack pointer.

The user_stack_pointer() function is relatively new, before this
you couldn't get the stack pointer out of pt_regs in a generic
way.

	Arnd <><

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall
  2009-09-11 10:31       ` Arnd Bergmann
@ 2009-09-11 11:00             ` Louis Rilling
  0 siblings, 0 replies; 41+ messages in thread
From: Louis Rilling @ 2009-09-11 11:00 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Eric W. Biederman, hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, Pavel Emelyanov


[-- Attachment #1.1: Type: text/plain, Size: 925 bytes --]

On 11/09/09 12:31 +0200, Arnd Bergmann wrote:
> On Thursday 10 September 2009, Sukadev Bhattiprolu wrote:
> > Since this is a variant of clone() and clone is listed as a PTREGSCALL(),
> > I pass in the pt_regs.
> > 
> > arch/x86/kernel/entry_32.S lists clone() under this comment:
> > 
> > /*
> >  * System calls that need a pt_regs pointer.
> >  */
> > 
> > Is there a guideline on what system calls use/need pt_regs ?
> 
> You need pt_regs if you access any registers from the user task
> other than the argument registers. In case of clone(), this is
> the user stack pointer.

AFAICS clone() actually needs _all_ registers (see x86 copy_thread() for
instance). Any variant of clone() will have this requirement.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall
@ 2009-09-11 11:00             ` Louis Rilling
  0 siblings, 0 replies; 41+ messages in thread
From: Louis Rilling @ 2009-09-11 11:00 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Sukadev Bhattiprolu, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, hpa, mingo, torvalds, Alexey Dobriyan,
	Pavel Emelyanov

[-- Attachment #1: Type: text/plain, Size: 925 bytes --]

On 11/09/09 12:31 +0200, Arnd Bergmann wrote:
> On Thursday 10 September 2009, Sukadev Bhattiprolu wrote:
> > Since this is a variant of clone() and clone is listed as a PTREGSCALL(),
> > I pass in the pt_regs.
> > 
> > arch/x86/kernel/entry_32.S lists clone() under this comment:
> > 
> > /*
> >  * System calls that need a pt_regs pointer.
> >  */
> > 
> > Is there a guideline on what system calls use/need pt_regs ?
> 
> You need pt_regs if you access any registers from the user task
> other than the argument registers. In case of clone(), this is
> the user stack pointer.

AFAICS clone() actually needs _all_ registers (see x86 copy_thread() for
instance). Any variant of clone() will have this requirement.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall
       [not found]             ` <20090911110056.GA12824-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
@ 2009-09-11 11:12               ` Arnd Bergmann
  0 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2009-09-11 11:12 UTC (permalink / raw)
  To: Louis Rilling
  Cc: Containers, Nathan Lynch, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Eric W. Biederman, hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-X9Un+BFzKDI,
	Sukadev Bhattiprolu, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, Pavel Emelyanov

On Friday 11 September 2009, Louis Rilling wrote:
> On 11/09/09 12:31 +0200, Arnd Bergmann wrote:
> > On Thursday 10 September 2009, Sukadev Bhattiprolu wrote:
> > > Since this is a variant of clone() and clone is listed as a PTREGSCALL(),
> > > I pass in the pt_regs.
> > > 
> > > arch/x86/kernel/entry_32.S lists clone() under this comment:
> > > 
> > > /*
> > >  * System calls that need a pt_regs pointer.
> > >  */
> > > 
> > > Is there a guideline on what system calls use/need pt_regs ?
> > 
> > You need pt_regs if you access any registers from the user task
> > other than the argument registers. In case of clone(), this is
> > the user stack pointer.
> 
> AFAICS clone() actually needs all registers (see x86 copy_thread() for
> instance). Any variant of clone() will have this requirement.

Right, but that part (do_fork) is already handled in common code, which
function calls into architecture specific code.

The point I was making is that the sys_clone() and sys_clone_with_pids()
functions themselves only access the argument registers, the stack
pointer and pass those down to do_fork, along with pt_regs.

	Arnd <><

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall
  2009-09-11 11:00             ` Louis Rilling
  (?)
@ 2009-09-11 11:12             ` Arnd Bergmann
  -1 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2009-09-11 11:12 UTC (permalink / raw)
  To: Louis Rilling
  Cc: Sukadev Bhattiprolu, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, hpa, mingo, torvalds, Alexey Dobriyan,
	Pavel Emelyanov

On Friday 11 September 2009, Louis Rilling wrote:
> On 11/09/09 12:31 +0200, Arnd Bergmann wrote:
> > On Thursday 10 September 2009, Sukadev Bhattiprolu wrote:
> > > Since this is a variant of clone() and clone is listed as a PTREGSCALL(),
> > > I pass in the pt_regs.
> > > 
> > > arch/x86/kernel/entry_32.S lists clone() under this comment:
> > > 
> > > /*
> > >  * System calls that need a pt_regs pointer.
> > >  */
> > > 
> > > Is there a guideline on what system calls use/need pt_regs ?
> > 
> > You need pt_regs if you access any registers from the user task
> > other than the argument registers. In case of clone(), this is
> > the user stack pointer.
> 
> AFAICS clone() actually needs all registers (see x86 copy_thread() for
> instance). Any variant of clone() will have this requirement.

Right, but that part (do_fork) is already handled in common code, which
function calls into architecture specific code.

The point I was making is that the sys_clone() and sys_clone_with_pids()
functions themselves only access the argument registers, the stack
pointer and pass those down to do_fork, along with pt_regs.

	Arnd <><

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 0/9] clone_with_pids() syscall
  2009-09-10  6:06 [RFC][v6][PATCH 0/9] clone_with_pids() syscall Sukadev Bhattiprolu
                   ` (9 preceding siblings ...)
  2009-09-10  6:14 ` Sukadev Bhattiprolu
@ 2009-09-11 11:22 ` Peter Zijlstra
  2009-09-11 11:34   ` Arnd Bergmann
  10 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2009-09-11 11:22 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, container, sukadev

On Wed, 2009-09-09 at 23:06 -0700, Sukadev Bhattiprolu wrote:

> Based on these requirements and constraints, we have been exploring a couple
> of system call interfaces and appreciate any iput.  

Why not have something like:

struct clone_struct {
	u32 size;
	u32 __reserved;
	u64 flags;
	u64 child_stack;
	u32 child_tid;
	u32 parent_tid;
};

struct clone_pid_struct {
	u32 nr;
	pid_t pids[];
};

int clone2(struct clone_struct *cs, struct clone_pid_struct *cps);

If you then get passed a longer clone_struct than you know about, all is
well IFF the tail is 0, otherwise fail with -E2BIG.

If you get passed a short clone_struct, zero out the tail.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 0/9] clone_with_pids() syscall
  2009-09-11 11:22 ` [RFC][v6][PATCH 0/9] " Peter Zijlstra
@ 2009-09-11 11:34   ` Arnd Bergmann
  2009-09-11 11:40     ` Peter Zijlstra
  2009-09-11 16:47     ` Sukadev Bhattiprolu
  0 siblings, 2 replies; 41+ messages in thread
From: Arnd Bergmann @ 2009-09-11 11:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sukadev Bhattiprolu, linux-kernel, Oren Laadan,
	Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch,
	container, sukadev

On Friday 11 September 2009, Peter Zijlstra wrote:
> Why not have something like:
> 
> struct clone_struct {
>         u32 size;
>         u32 __reserved;
>         u64 flags;
>         u64 child_stack;
>         u32 child_tid;
>         u32 parent_tid;
> };
> 
> struct clone_pid_struct {
>         u32 nr;
>         pid_t pids[];
> };
> 
> int clone2(struct clone_struct *cs, struct clone_pid_struct *cps);
> 
> If you then get passed a longer clone_struct than you know about, all is
> well IFF the tail is 0, otherwise fail with -E2BIG.
> 
> If you get passed a short clone_struct, zero out the tail.

I would leave out the size argument. We can put a few reserved fields
and flag bits in there for possible extensions, but if we ever run out
of these, just define a new syscall.

Also, if you're passing a struct, why not put nr_pids in there, and
replace clone_pid_struct with a simple array? That would give us

struct clone_struct {
        u64 flags;
        u64 child_stack;
        u32 child_tid;
        u32 parent_tid;
	u32 nr_pids;
	u32 reserved1;
	u64 reserved2;
};
 
int clone2(struct clone_struct *cs, pid_t *pids);

	Arnd <><

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 0/9] clone_with_pids() syscall
  2009-09-11 11:34   ` Arnd Bergmann
@ 2009-09-11 11:40     ` Peter Zijlstra
  2009-09-11 11:50       ` Arnd Bergmann
  2009-09-11 16:47     ` Sukadev Bhattiprolu
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2009-09-11 11:40 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Sukadev Bhattiprolu, linux-kernel, Oren Laadan,
	Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch,
	container, sukadev

On Fri, 2009-09-11 at 13:34 +0200, Arnd Bergmann wrote:
> On Friday 11 September 2009, Peter Zijlstra wrote:

> > If you then get passed a longer clone_struct than you know about, all is
> > well IFF the tail is 0, otherwise fail with -E2BIG.
> > 
> > If you get passed a short clone_struct, zero out the tail.
> 
> I would leave out the size argument. We can put a few reserved fields
> and flag bits in there for possible extensions, but if we ever run out
> of these, just define a new syscall.

Why? If we can avoid this new syscall isn't that nicer?

> Also, if you're passing a struct, why not put nr_pids in there, and
> replace clone_pid_struct with a simple array? That would give us

Sure..



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 0/9] clone_with_pids() syscall
  2009-09-11 11:40     ` Peter Zijlstra
@ 2009-09-11 11:50       ` Arnd Bergmann
  0 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2009-09-11 11:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sukadev Bhattiprolu, linux-kernel, Oren Laadan,
	Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch,
	container, sukadev

On Friday 11 September 2009, Peter Zijlstra wrote:
> On Fri, 2009-09-11 at 13:34 +0200, Arnd Bergmann wrote:
> > On Friday 11 September 2009, Peter Zijlstra wrote:
> 
> > > If you then get passed a longer clone_struct than you know about, all is
> > > well IFF the tail is 0, otherwise fail with -E2BIG.
> > > 
> > > If you get passed a short clone_struct, zero out the tail.
> > 
> > I would leave out the size argument. We can put a few reserved fields
> > and flag bits in there for possible extensions, but if we ever run out
> > of these, just define a new syscall.
> 
> Why? If we can avoid this new syscall isn't that nicer?

There is a limit to how much flexibility I would aim for. In the last
fourty years, we needed three revisions of that call (fork, clone,
clone2). By invalid extrapolation, adding room for another extension
should give us at least twenty years ;-)

Also, the flags field basically has the same purpose are the size field,
so we do not need both. In the worst case, you can define one of the
flags to mean 'the structure is now 168 bytes long'.

	Arnd <><

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 0/9] clone_with_pids() syscall
  2009-09-11 11:34   ` Arnd Bergmann
  2009-09-11 11:40     ` Peter Zijlstra
@ 2009-09-11 16:47     ` Sukadev Bhattiprolu
  2009-09-11 17:00       ` Peter Zijlstra
  1 sibling, 1 reply; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-11 16:47 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, linux-kernel, Oren Laadan, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, container, sukadev

| Also, if you're passing a struct, why not put nr_pids in there, and
| replace clone_pid_struct with a simple array? That would give us
| 
| struct clone_struct {
|         u64 flags;
|         u64 child_stack;
|         u32 child_tid;
|         u32 parent_tid;
| 	u32 nr_pids;
| 	u32 reserved1;
| 	u64 reserved2;
| };
| 
| int clone2(struct clone_struct *cs, pid_t *pids);

My only concern with this approach was the extra copy_from_user() in the
common case (i.e when not using the extended features).  I assume the
overhead of copy_from_user() is small enough to be ignored ?

Thanks,

Sukadev

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 0/9] clone_with_pids() syscall
  2009-09-11 16:47     ` Sukadev Bhattiprolu
@ 2009-09-11 17:00       ` Peter Zijlstra
  2009-09-12 17:19         ` Sukadev Bhattiprolu
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2009-09-11 17:00 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Arnd Bergmann, linux-kernel, Oren Laadan, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, container, sukadev

On Fri, 2009-09-11 at 09:47 -0700, Sukadev Bhattiprolu wrote:
> | Also, if you're passing a struct, why not put nr_pids in there, and
> | replace clone_pid_struct with a simple array? That would give us
> | 
> | struct clone_struct {
> |         u64 flags;
> |         u64 child_stack;
> |         u32 child_tid;
> |         u32 parent_tid;
> | 	u32 nr_pids;
> | 	u32 reserved1;
> | 	u64 reserved2;
> | };
> | 
> | int clone2(struct clone_struct *cs, pid_t *pids);
> 
> My only concern with this approach was the extra copy_from_user() in the
> common case (i.e when not using the extended features).  I assume the
> overhead of copy_from_user() is small enough to be ignored ?

I would think so, esp for small structure, it would be a cache hot copy.
That is, for x86 I doubt it'll show up. No idea what hoops other arch
have to jump through to get copy_from_user() doing what it does.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 0/9] clone_with_pids() syscall
  2009-09-11 17:00       ` Peter Zijlstra
@ 2009-09-12 17:19         ` Sukadev Bhattiprolu
  2009-09-13 14:36           ` Arnd Bergmann
  2009-09-14  7:14           ` Peter Zijlstra
  0 siblings, 2 replies; 41+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-12 17:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnd Bergmann, linux-kernel, Oren Laadan, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, container, sukadev

Peter Zijlstra [peterz@infradead.org] wrote:
| On Fri, 2009-09-11 at 09:47 -0700, Sukadev Bhattiprolu wrote:
| > | Also, if you're passing a struct, why not put nr_pids in there, and
| > | replace clone_pid_struct with a simple array? That would give us
| > | 
| > | struct clone_struct {
| > |         u64 flags;
| > |         u64 child_stack;
| > |         u32 child_tid;
| > |         u32 parent_tid;

BTW, these two tids are __user pointers that kernel copies data into.
They should be u64 to avoid conversions in architecture specific code ?

| > | 	u32 nr_pids;
| > | 	u32 reserved1;
| > | 	u64 reserved2;
| > | };
| > | 
| > | int clone2(struct clone_struct *cs, pid_t *pids);
| > 
| > My only concern with this approach was the extra copy_from_user() in the
| > common case (i.e when not using the extended features).  I assume the
| > overhead of copy_from_user() is small enough to be ignored ?
| 
| I would think so, esp for small structure, it would be a cache hot copy.
| That is, for x86 I doubt it'll show up. No idea what hoops other arch
| have to jump through to get copy_from_user() doing what it does.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 0/9] clone_with_pids() syscall
  2009-09-12 17:19         ` Sukadev Bhattiprolu
@ 2009-09-13 14:36           ` Arnd Bergmann
  2009-09-14  7:14           ` Peter Zijlstra
  1 sibling, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2009-09-13 14:36 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Peter Zijlstra, linux-kernel, Oren Laadan, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, container, sukadev

On Saturday 12 September 2009, Sukadev Bhattiprolu wrote:
> 
> Peter Zijlstra [peterz@infradead.org] wrote:
> | On Fri, 2009-09-11 at 09:47 -0700, Sukadev Bhattiprolu wrote:
> | > | Also, if you're passing a struct, why not put nr_pids in there, and
> | > | replace clone_pid_struct with a simple array? That would give us
> | > | 
> | > | struct clone_struct {
> | > |         u64 flags;
> | > |         u64 child_stack;
> | > |         u32 child_tid;
> | > |         u32 parent_tid;
> 
> BTW, these two tids are __user pointers that kernel copies data into.
> They should be u64 to avoid conversions in architecture specific code ?
> 

No, they are not pointers, the kernel can directly write into the user
data structure. Indirect pointers would not be helpful here IMHO.

	Arnd <><

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][v6][PATCH 0/9] clone_with_pids() syscall
  2009-09-12 17:19         ` Sukadev Bhattiprolu
  2009-09-13 14:36           ` Arnd Bergmann
@ 2009-09-14  7:14           ` Peter Zijlstra
  1 sibling, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2009-09-14  7:14 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Arnd Bergmann, linux-kernel, Oren Laadan, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, container, sukadev

On Sat, 2009-09-12 at 10:19 -0700, Sukadev Bhattiprolu wrote:
> Peter Zijlstra [peterz@infradead.org] wrote:
> | On Fri, 2009-09-11 at 09:47 -0700, Sukadev Bhattiprolu wrote:
> | > | Also, if you're passing a struct, why not put nr_pids in there, and
> | > | replace clone_pid_struct with a simple array? That would give us
> | > | 
> | > | struct clone_struct {
> | > |         u64 flags;
> | > |         u64 child_stack;
> | > |         u32 child_tid;
> | > |         u32 parent_tid;
> 
> BTW, these two tids are __user pointers that kernel copies data into.
> They should be u64 to avoid conversions in architecture specific code ?

Yeah, just take the idea not the details from these posts ;-)


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2009-09-14  7:14 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-10  6:06 [RFC][v6][PATCH 0/9] clone_with_pids() syscall Sukadev Bhattiprolu
2009-09-10  6:08 ` [RFC][v6][PATCH 1/9]: Factor out code to allocate pidmap page Sukadev Bhattiprolu
2009-09-10  6:09 ` [RFC][v6][PATCH 2/9]: Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
2009-09-10  6:09 ` [RFC][v6][PATCH 3/9] Make pid_max a pid_ns property Sukadev Bhattiprolu
2009-09-10  6:09 ` [RFC][v6][PATCH 4/9]: Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu
2009-09-10  6:10 ` [RFC][v6][PATCH 5/9]: Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
2009-09-10  6:11 ` [RFC][v6][PATCH 6/9]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu
2009-09-10  6:12 ` [RFC][v6][PATCH 7/9]: Define do_fork_with_pids() Sukadev Bhattiprolu
     [not found]   ` <20090910061227.GF25883-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-09-10  7:05     ` Arnd Bergmann
2009-09-10  7:05   ` Arnd Bergmann
     [not found]     ` <200909100905.35817.arnd-r2nGTMty4D4@public.gmane.org>
2009-09-10 21:29       ` Sukadev Bhattiprolu
2009-09-10 21:29     ` Sukadev Bhattiprolu
2009-09-10  6:13 ` [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall Sukadev Bhattiprolu
2009-09-10  7:31   ` Arnd Bergmann
     [not found]     ` <200909100931.25585.arnd-r2nGTMty4D4@public.gmane.org>
2009-09-10 21:28       ` Sukadev Bhattiprolu
2009-09-10 21:28     ` Sukadev Bhattiprolu
2009-09-11 10:31       ` Arnd Bergmann
     [not found]         ` <200909111231.30495.arnd-r2nGTMty4D4@public.gmane.org>
2009-09-11 11:00           ` Louis Rilling
2009-09-11 11:00             ` Louis Rilling
2009-09-11 11:12             ` Arnd Bergmann
     [not found]             ` <20090911110056.GA12824-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2009-09-11 11:12               ` Arnd Bergmann
     [not found]       ` <20090910212837.GA31459-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-09-11 10:31         ` Arnd Bergmann
     [not found]   ` <20090910061301.GG25883-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-09-10  7:31     ` Arnd Bergmann
     [not found] ` <20090910060627.GA24343-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-09-10  6:11   ` [RFC][v6][PATCH 6/9]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu
2009-09-10  6:12   ` [RFC][v6][PATCH 7/9]: Define do_fork_with_pids() Sukadev Bhattiprolu
2009-09-10  6:13   ` [RFC][v6][PATCH 8/9]: Define clone_with_pids() syscall Sukadev Bhattiprolu
2009-09-10  6:14   ` [RFC][v6][PATCH 9/9]: Document " Sukadev Bhattiprolu
2009-09-10  6:14 ` Sukadev Bhattiprolu
2009-09-10 15:26   ` Randy Dunlap
     [not found]     ` <20090910082659.033ab8fd.randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2009-09-10 16:31       ` Sukadev Bhattiprolu
2009-09-10 16:31     ` Sukadev Bhattiprolu
     [not found]   ` <20090910061413.GH25883-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-09-10 15:26     ` Randy Dunlap
2009-09-11 11:22 ` [RFC][v6][PATCH 0/9] " Peter Zijlstra
2009-09-11 11:34   ` Arnd Bergmann
2009-09-11 11:40     ` Peter Zijlstra
2009-09-11 11:50       ` Arnd Bergmann
2009-09-11 16:47     ` Sukadev Bhattiprolu
2009-09-11 17:00       ` Peter Zijlstra
2009-09-12 17:19         ` Sukadev Bhattiprolu
2009-09-13 14:36           ` Arnd Bergmann
2009-09-14  7:14           ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.