All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
To: Sukadev Bhattiprolu
	<sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Cc: Containers
	<containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>,
	"David C. Hansen"
	<haveblue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: Re: [RFC][v3][PATCH 7/7] Define clone_with_pids syscall
Date: Sun, 31 May 2009 13:59:51 -0400	[thread overview]
Message-ID: <4A22C597.8060207@cs.columbia.edu> (raw)
In-Reply-To: <20090531000350.GF4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>


the code looks good. A couple of nits about comments, below.

Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

Sukadev Bhattiprolu wrote:
> Serge,
> 
> I have removed restriction on CLONE_NEWPID and may need a new ack.
> 
> Sukadev
> ---
> From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> Date: Mon, 4 May 2009 01:17:45 -0700
> Subject: [RFC][v3][PATCH 7/7] Define clone_with_pids syscall
> 
> Container restart requires that a task have the same pid it had when it was
> checkpointed. When containers are nested the tasks within the containers
> exist in multiple pid namespaces and hence have multiple pids to specify
> during restart.
> 
> clone_with_pids(), intended for use during restart, is the same as clone(),
> except that it takes a 'target_pid_set' paramter. This parameter lets caller
> choose specific pid numbers for the child process, in the process's active
> and ancestor pid namespaces. (Descendant pid namespaces in general don't
> matter since processes don't have pids in them anyway, but see comments
> in copy_target_pids() regarding CLONE_NEWPID).
> 
> Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to
> prevent unprivileged processes from misusing this interface.
> 
> Call clone_with_pids as follows:
> 
> 	pid_t pids[] = { 0, 77, 99 };
> 	struct target_pid_set pid_set;
> 
> 	pid_set.num_pids = sizeof(pids) / sizeof(int);
> 	pid_set.target_pids = &pids;
> 
> 	syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set);
> 
> If a target-pid is 0, the kernel continues to assign a pid for the process in
> that namespace. In the above example, pids[0] is 0, meaning the kernel will
> assign next available pid to the process in init_pid_ns. But kernel will assign
> pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
> 77 or 99 are taken, the system call fails with -EBUSY.
> 
> If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
> the system call fails with -EINVAL.

Perhaps also add:
" If 'pid_set.num_pids' falls short of the current nesting level of
  pid namespaces, the syscall call assumes extra 0's at the head of
  the array."

> 
> Its mostly an exploratory patch seeking feedback on the interface.
> 
> NOTE:
> 	Compared to clone(), clone_with_pids() needs to pass in two more
> 	pieces of information:
> 
> 		- number of pids in the set
> 		- user buffer containing the list of pids.
> 
> 	But since clone() already takes 5 parameters, use a 'struct
> 	target_pid_set'.
> 
> TODO:
> 	- Gently tested.
> 	- May need additional sanity checks in do_fork_with_pids().
> 
> Changelog[v3]:
> 	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
> 	  in the target_pids[] list and setting it 0. See copy_target_pids()).
> 	- (Oren Laadan) Specified target pids should apply only to youngest
> 	  pid-namespaces (see copy_target_pids())
> 	- (Matt Helsley) Update patch description.
> 
> Changelog[v2]:
> 	- Remove unnecessary printk and add a note to callers of
> 	  copy_target_pids() to free target_pids.
> 	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
> 	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
> 	  'num_pids == 0' (fall back to normal clone()).
> 	- Move arch-independent code (sanity checks and copy-in of target-pids)
> 	  into kernel/fork.c and simplify sys_clone_with_pids()
> 
> Changelog[v1]:
> 	- Fixed some compile errors (had fixed these errors earlier in my
> 	  git tree but had not refreshed patches before emailing them)
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> ---
>  arch/x86/include/asm/syscalls.h    |    1 +
>  arch/x86/include/asm/unistd_32.h   |    1 +
>  arch/x86/kernel/entry_32.S         |    1 +
>  arch/x86/kernel/process_32.c       |   21 +++++++
>  arch/x86/kernel/syscall_table_32.S |    1 +
>  kernel/fork.c                      |  108 +++++++++++++++++++++++++++++++++++-
>  6 files changed, 132 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
> index 7043408..1fdc149 100644
> --- a/arch/x86/include/asm/syscalls.h
> +++ b/arch/x86/include/asm/syscalls.h
> @@ -31,6 +31,7 @@ asmlinkage int sys_get_thread_area(struct user_desc __user *);
>  /* kernel/process_32.c */
>  int sys_fork(struct pt_regs *);
>  int sys_clone(struct pt_regs *);
> +int sys_clone_with_pids(struct pt_regs *);
>  int sys_vfork(struct pt_regs *);
>  int sys_execve(struct pt_regs *);
>  
> diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
> index 6e72d74..90f906f 100644
> --- a/arch/x86/include/asm/unistd_32.h
> +++ b/arch/x86/include/asm/unistd_32.h
> @@ -340,6 +340,7 @@
>  #define __NR_inotify_init1	332
>  #define __NR_preadv		333
>  #define __NR_pwritev		334
> +#define __NR_clone_with_pids	335
>  
>  #ifdef __KERNEL__
>  
> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> index c929add..ee92b0d 100644
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -707,6 +707,7 @@ ptregs_##name: \
>  PTREGSCALL(iopl)
>  PTREGSCALL(fork)
>  PTREGSCALL(clone)
> +PTREGSCALL(clone_with_pids)
>  PTREGSCALL(vfork)
>  PTREGSCALL(execve)
>  PTREGSCALL(sigaltstack)
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 76f8f84..1efc3de 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -445,6 +445,27 @@ int sys_clone(struct pt_regs *regs)
>  	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
>  }
>  
> +int sys_clone_with_pids(struct pt_regs *regs)
> +{
> +	unsigned long clone_flags;
> +	unsigned long newsp;
> +	int __user *parent_tidptr;
> +	int __user *child_tidptr;
> +	void __user *upid_setp;
> +
> +	clone_flags = regs->bx;
> +	newsp = regs->cx;
> +	parent_tidptr = (int __user *)regs->dx;
> +	child_tidptr = (int __user *)regs->di;
> +	upid_setp = (void __user *)regs->bp;
> +
> +	if (!newsp)
> +		newsp = regs->sp;
> +
> +	return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
> +			child_tidptr, upid_setp);
> +}
> +
>  /*
>   * sys_execve() executes a new program.
>   */
> diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
> index ff5c873..94c1a58 100644
> --- a/arch/x86/kernel/syscall_table_32.S
> +++ b/arch/x86/kernel/syscall_table_32.S
> @@ -334,3 +334,4 @@ ENTRY(sys_call_table)
>  	.long sys_inotify_init1
>  	.long sys_preadv
>  	.long sys_pwritev
> +	.long ptregs_clone_with_pids	/* 335 */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a16ef7b..7738aed 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1335,6 +1335,97 @@ struct task_struct * __cpuinit fork_idle(int cpu)
>  }
>  
>  /*
> + * If user specified any 'target-pids' in @upid_setp, copy them from
> + * user and return a pointer to a local copy of the list of pids. The
> + * caller must free the list, when they are done using it.
> + *
> + * If user did not specify any target pids, return NULL (caller should
> + * treat this like normal clone).
> + *
> + * On any errors, return the error code
> + */
> +static pid_t *copy_target_pids(void __user *upid_setp)
> +{
> +	int j;
> +	int rc;
> +	int size;
> +	int unum_pids;		/* # of pids specified by user */
> +	int knum_pids;		/* # of pids needed in kernel */
> +	pid_t *target_pids;
> +	struct target_pid_set pid_set;
> +
> +	if (!upid_setp)
> +		return NULL;
> +
> +	rc = copy_from_user(&pid_set, upid_setp, sizeof(pid_set));
> +	if (rc)
> +		return ERR_PTR(-EFAULT);
> +
> +	unum_pids = pid_set.num_pids;
> +	knum_pids = task_pid(current)->level + 1;
> +
> +	if (!unum_pids)
> +		return NULL;
> +
> +	if (unum_pids < 0 || unum_pids > knum_pids)
> +		return ERR_PTR(-EINVAL);
> +
> +	/*
> +	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
> +	 * and set it to 0. This last entry in target_pids[] corresponds to the
> +	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
> +	 * specified. If CLONE_NEWPID was not specified, this last entry will
> +	 * simply be ignored.
> +	 */
> +	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
> +	if (!target_pids)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/*
> +	 * A process running in a level 2 pid namespace has three pid namespaces
> +	 * and hence three pid numbers. If this process is checkpointed,
> +	 * information about these three namespaces are saved. We refer to these
> +	 * namespaces as 'known namespaces'.

The number of levels saved at checkpoint depends not only on the
checkpointee, but also on the checkpointer's nesting level.

Perhaps:
  [...and hence three pid numbers.] The process can be checkpointed
  from (its) level 0, 1 or 2. If this process is checkpointed from
  level 1, information about two namespaces is saved. [We refer ...]

> +	 *
> +	 * If this checkpointed process is however restarted in a level 3 pid
> +	 * namespace, the restarted process has an extra ancestor pid namespace
> +	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
> +	 *
> +	 * During restart, the process requests specific pids for its 'known
> +	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
> +	 *
> +	 * Since the requested-pids correspond to 'known namespaces' and since
> +	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
> +	 * namespaces', copy requested pids to the back-end of target_pids[]
> +	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
> +	 * Any entries in target_pids[] not corresponding to a requested pid
> +	 * will be set to zero and kernel assigns a pid in those namespaces.
> +	 *
> +	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
> +	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
> +	 * 	 the order is:
> +	 *
> +	 * 		- pids for 'unknown-namespaces' (if any)
> +	 * 		- pids for 'known-namespaces' (requested pids)
> +	 * 		- 0 in the last entry (for CLONE_NEWPID).
> +	 */
> +	j = knum_pids - unum_pids;
> +	size = unum_pids * sizeof(pid_t);
> +
> +	rc = copy_from_user(&target_pids[j], pid_set.target_pids, size);
> +	if (rc) {
> +		rc = -EFAULT;
> +		goto out_free;
> +	}
> +
> +	return target_pids;
> +
> +out_free:
> +	kfree(target_pids);
> +	return ERR_PTR(rc);
> +}
> +
> +/*
>   *  Ok, this is the main fork-routine.
>   *
>   * It copies the process, and if successful kick-starts
> @@ -1351,7 +1442,7 @@ long do_fork_with_pids(unsigned long clone_flags,
>  	struct task_struct *p;
>  	int trace = 0;
>  	long nr;
> -	pid_t *target_pids = NULL;
> +	pid_t *target_pids;
>  
>  	/*
>  	 * Do some preliminary argument and permissions checking before we
> @@ -1385,6 +1476,17 @@ long do_fork_with_pids(unsigned long clone_flags,
>  		}
>  	}
>  
> +	target_pids = copy_target_pids(pid_setp);
> +
> +	if (target_pids) {
> +		if (IS_ERR(target_pids))
> +			return PTR_ERR(target_pids);
> +
> +		nr = -EPERM;
> +		if (!capable(CAP_SYS_ADMIN))
> +			goto out_free;
> +	}
> +
>  	/*
>  	 * When called from kernel_thread, don't do user tracing stuff.
>  	 */
> @@ -1446,6 +1548,10 @@ long do_fork_with_pids(unsigned long clone_flags,
>  	} else {
>  		nr = PTR_ERR(p);
>  	}
> +
> +out_free:
> +	kfree(target_pids);
> +
>  	return nr;
>  }
>  

  parent reply	other threads:[~2009-05-31 17:59 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-30 23:57 [RFC][v3][PATCH 1/7] Factor out code to allocate pidmap page Sukadev Bhattiprolu
     [not found] ` <20090530235714.GA4083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-31  0:01   ` [RFC][v3][PATCH 2/7] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
     [not found]     ` <20090531000115.GA4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01  8:17       ` Amerigo Wang
2009-05-31  0:01   ` [RFC][v3][PATCH 3/7] Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu
     [not found]     ` <20090531000144.GB4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01  8:30       ` Amerigo Wang
2009-05-31  0:02   ` [RFC][v3][PATCH 4/7] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
     [not found]     ` <20090531000220.GC4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01  8:34       ` Amerigo Wang
     [not found]         ` <20090601083419.GE4381-+dguKlz9DXUf7BdofF/totBPR1lH4CV8@public.gmane.org>
2009-06-01 20:52           ` Sukadev Bhattiprolu
     [not found]             ` <20090601205233.GB1812-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01 21:01               ` Sukadev Bhattiprolu
2009-05-31  0:02   ` [RFC][v3][PATCH 5/7] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
     [not found]     ` <20090531000237.GD4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01  8:40       ` Amerigo Wang
2009-05-31  0:02   ` [RFC][v3][PATCH 6/7] Define do_fork_with_pids() Sukadev Bhattiprolu
     [not found]     ` <20090531000255.GE4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01  8:19       ` Amerigo Wang
     [not found]         ` <20090601081929.GC4381-+dguKlz9DXUf7BdofF/totBPR1lH4CV8@public.gmane.org>
2009-06-03  0:35           ` Sukadev Bhattiprolu
     [not found]             ` <20090603003522.GA22704-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-04  3:07               ` Oren Laadan
     [not found]                 ` <Pine.LNX.4.64.0906032306240.25421-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
2009-06-04  7:21                   ` Amerigo Wang
     [not found]                     ` <20090604072112.GA6856-+dguKlz9DXUf7BdofF/totBPR1lH4CV8@public.gmane.org>
2009-06-04 15:41                       ` Sukadev Bhattiprolu
2009-05-31  0:03   ` [RFC][v3][PATCH 7/7] Define clone_with_pids syscall Sukadev Bhattiprolu
     [not found]     ` <20090531000350.GF4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-31 17:59       ` Oren Laadan [this message]
2009-06-01 15:16       ` Serge E. Hallyn
     [not found]         ` <20090601151650.GA20295-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01 16:30           ` Oren Laadan
     [not found]             ` <4A240213.70001-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-06-01 16:42               ` Serge E. Hallyn
     [not found]                 ` <20090601164232.GA23252-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01 16:54                   ` Oren Laadan
     [not found]                     ` <4A2407B2.8030304-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-06-01 17:19                       ` Serge E. Hallyn
     [not found]                         ` <20090601171943.GA23878-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01 19:35                           ` Oren Laadan
     [not found]                             ` <4A242D66.2070907-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-06-01 19:54                               ` Serge E. Hallyn
2009-06-01 16:58               ` Oren Laadan
2009-06-13 18:18       ` Sukadev Bhattiprolu
     [not found]         ` <20090613181838.GA2775-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-13 20:22           ` Oren Laadan
2009-06-01  8:16   ` [RFC][v3][PATCH 1/7] Factor out code to allocate pidmap page Amerigo Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A22C597.8060207@cs.columbia.edu \
    --to=orenl-eqauephvms7envbuuze7ea@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=haveblue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org \
    --cc=sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.