All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][patch 00/21] PID Virtualization: Overview and Patches
@ 2005-12-15 14:35 Hubertus Franke
  2005-12-15 14:35 ` [RFC][patch 01/21] PID Virtualization: const parameter for process group Hubertus Franke
                   ` (21 more replies)
  0 siblings, 22 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:35 UTC (permalink / raw)
  To: linux-kernel

This patchset is a followup to the posting by Serge.
http://marc.theaimsgroup.com/?l=linux-kernel&m=113200410620972&w=2

In this patchset here, we are providing the pid virtualization mentioned
in serge's posting.

> I'm part of a project implementing checkpoint/restart processes.
> After a process or group of processes is checkpointed, killed, and
> restarted, the changing of pids could confuse them.  There are many
> other such issues, but we wanted to start with pids.
>
> This patchset introduces functions to access task->pid and ->tgid,
> and updates ->pid accessors to use the functions.  This is in
> preparation for a subsequent patchset which will separate the kernel
> and virtualized pidspaces.  This will allow us to virtualize pids
> from users' pov, so that, for instance, a checkpointed set of
> processes could be restarted with particular pids.  Even though their
> kernel pids may already be in use by new processes, the checkpointed
> processes can be started in a new user pidspace with their old
> virtual pid.  This also gives vserver a simpler way to fake vserver
> init processes as pid 1.  Note that this does not change the kernel's
> internal idea of pids, only what users see.
>
> The first 12 patches change all locations which access ->pid and
> ->tgid to use the inlined functions.  The last patch actually
> introduces task_pid() and task_tgid(), and renames ->pid and ->tgid
> to __pid and __tgid to make sure any uncaught users error out.
>
> Does something like this, presumably after much working over, seem
> mergeable?

These patches build on top of serge's posted patches (if necessary
we can repost them here).

PID Virtualization is based on the concept of a container.
The ultimate goal is to checkpoint/restart containers. 

The mechanism to start a container 
is to 'echo "container_name" > /proc/container'  which creates a new
container and associates the calling process with it. All subsequently
forked tasks then belong to that container.
There is a separate pid space associated with each container.
Only processes/task belonging to the same container "see" each other.
The exception is an implied default system container that has 
a global view.

The following patches accomplish 3 things:
1) identify the locations at the user/kernel boundary where pids and 
   related ids ( pgrp, sessionids, .. ) need to be (de-)virtualized and
   call appropriate (de-)virtualization functions.
2) provide the virtualization implementation in these functions.
3) implement a container object and a simple /proc interface to create one
4) provide a per container /proc/fs

-- Hubertus Franke    (frankeh@watson.ibm.com)
-- Cedric Le Goater   (clg@fr.ibm.com)
-- Serge E Hallyn     (serue@us.ibm.com)
-- Dave Hansen        (haveblue@us.ibm.com)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 01/21] PID Virtualization: const parameter for process group
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
@ 2005-12-15 14:35 ` Hubertus Franke
  2005-12-15 14:35 ` [RFC][patch 02/21] PID Virtualization: task virtual pid access functions Hubertus Franke
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:35 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: F1-const-task-parameter.patch --]
[-- Type: text/plain, Size: 881 bytes --]

Change parameter in access functions to const.
We try to be more diligent with the "const" attribute.
As a result not introducing const for this function will
result in many compiler warnings.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 include/linux/sched.h |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.15-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.15-rc1.orig/include/linux/sched.h	2005-11-30 18:07:18.000000000 -0500
+++ linux-2.6.15-rc1/include/linux/sched.h	2005-11-30 18:08:02.000000000 -0500
@@ -860,7 +860,7 @@ struct task_struct {
 	atomic_t fs_excl;	/* holding fs exclusive resources */
 };
 
-static inline pid_t process_group(struct task_struct *tsk)
+static inline pid_t process_group(const struct task_struct *tsk)
 {
 	return tsk->signal->pgrp;
 }

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 02/21] PID Virtualization: task virtual pid access functions
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
  2005-12-15 14:35 ` [RFC][patch 01/21] PID Virtualization: const parameter for process group Hubertus Franke
@ 2005-12-15 14:35 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 03/21] PID Virtualization: return virtual pids where required Hubertus Franke
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:35 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: F2-define-task-virt-access-functions.patch --]
[-- Type: text/plain, Size: 1182 bytes --]

Introduce task access functions for the virtual pid domain
for pid/ppid/tgid/process_group/sessionids

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 include/linux/sched.h |   20 ++++++++++++++++++++
 1 files changed, 20 insertions(+)

Index: linux-2.6.15-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.15-rc1.orig/include/linux/sched.h	2005-11-30 18:08:02.000000000 -0500
+++ linux-2.6.15-rc1/include/linux/sched.h	2005-11-30 18:08:02.000000000 -0500
@@ -888,6 +888,26 @@ static inline pid_t task_tgid(const stru
 	return p->__tgid;
 }
 
+static inline pid_t task_vpid(const struct task_struct *p)
+{
+	return task_pid(p);
+}
+
+static inline pid_t task_vppid(const struct task_struct *p)
+{
+	return task_pid(p->parent);
+}
+
+static inline pid_t task_vtgid(const struct task_struct *p)
+{
+	return task_tgid(p);
+}
+
+static inline pid_t virt_process_group(const struct task_struct *p)
+{
+	return process_group(p);
+}
+
 extern void free_task(struct task_struct *tsk);
 extern void __put_task_struct(struct task_struct *tsk);
 #define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 03/21] PID Virtualization: return virtual pids where required
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
  2005-12-15 14:35 ` [RFC][patch 01/21] PID Virtualization: const parameter for process group Hubertus Franke
  2005-12-15 14:35 ` [RFC][patch 02/21] PID Virtualization: task virtual pid access functions Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 04/21] PID Virtualization: return virtual process group ids Hubertus Franke
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: F3-replace-pid-kernel-access-with-virt-access.patch --]
[-- Type: text/plain, Size: 9570 bytes --]

In this patch we now identify where in the kernel code conceptually
a virtual pid(etc.) needs to be returned to userspace. This is at the 
kernel/user interfaces. We need to identify all locations where 
pids are returned, broadly they fall into 3 categories:
(a) syscall return parameter, 
(b) syscall return code, 
(c) through a datastructure filled in a syscall

The process_group virtualization will be done in the following patch.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 arch/ia64/kernel/signal.c |    2 +-
 fs/binfmt_elf.c           |    8 ++++----
 fs/proc/array.c           |    8 ++++----
 fs/proc/base.c            |    8 ++++----
 kernel/exit.c             |    4 ++--
 kernel/fork.c             |    4 ++--
 kernel/sched.c            |    2 +-
 kernel/signal.c           |   10 +++++-----
 kernel/timer.c            |    4 ++--
 9 files changed, 25 insertions(+), 25 deletions(-)

Index: linux-2.6.15-rc1/arch/ia64/kernel/signal.c
===================================================================
--- linux-2.6.15-rc1.orig/arch/ia64/kernel/signal.c	2005-12-12 11:46:47.000000000 -0500
+++ linux-2.6.15-rc1/arch/ia64/kernel/signal.c	2005-12-12 15:24:27.000000000 -0500
@@ -270,7 +270,7 @@ ia64_rt_sigreturn (struct sigscratch *sc
 	si.si_signo = SIGSEGV;
 	si.si_errno = 0;
 	si.si_code = SI_KERNEL;
-	si.si_pid = task_pid(current);
+	si.si_pid = task_vpid(current);
 	si.si_uid = current->uid;
 	si.si_addr = sc;
 	force_sig_info(SIGSEGV, &si, current);
Index: linux-2.6.15-rc1/fs/binfmt_elf.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/binfmt_elf.c	2005-12-12 11:46:47.000000000 -0500
+++ linux-2.6.15-rc1/fs/binfmt_elf.c	2005-12-12 16:11:37.000000000 -0500
@@ -1270,8 +1270,8 @@ static void fill_prstatus(struct elf_prs
 	prstatus->pr_info.si_signo = prstatus->pr_cursig = signr;
 	prstatus->pr_sigpend = p->pending.signal.sig[0];
 	prstatus->pr_sighold = p->blocked.sig[0];
-	prstatus->pr_pid = task_pid(p);
-	prstatus->pr_ppid = task_pid(p->parent);
+	prstatus->pr_pid = task_vpid(p);
+	prstatus->pr_ppid = task_vppid(p);
 	prstatus->pr_pgrp = process_group(p);
 	prstatus->pr_sid = p->signal->session;
 	if (thread_group_leader(p)) {
@@ -1316,8 +1316,8 @@ static int fill_psinfo(struct elf_prpsin
 			psinfo->pr_psargs[i] = ' ';
 	psinfo->pr_psargs[len] = 0;
 
-	psinfo->pr_pid = task_pid(p);
-	psinfo->pr_ppid = task_pid(p->parent);
+	psinfo->pr_pid = task_vpid(p);
+	psinfo->pr_ppid = task_vppid(p);
 	psinfo->pr_pgrp = process_group(p);
 	psinfo->pr_sid = p->signal->session;
 
Index: linux-2.6.15-rc1/fs/proc/array.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/proc/array.c	2005-12-12 11:46:47.000000000 -0500
+++ linux-2.6.15-rc1/fs/proc/array.c	2005-12-12 16:11:54.000000000 -0500
@@ -174,10 +174,10 @@ static inline char * task_state(struct t
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
 		(p->sleep_avg/1024)*100/(1020000000/1024),
-	       	task_tgid(p),
-		task_pid(p), pid_alive(p) ?
-			task_tgid(p->group_leader->real_parent) : 0,
-		pid_alive(p) && p->ptrace ? task_pid(p->parent) : 0,
+	       	task_vtgid(p),
+		task_vpid(p), pid_alive(p) ?
+			task_vtgid(p->group_leader->real_parent) : 0,
+		pid_alive(p) && p->ptrace ? task_vpid(p->parent) : 0,
 		p->uid, p->euid, p->suid, p->fsuid,
 		p->gid, p->egid, p->sgid, p->fsgid);
 	read_unlock(&tasklist_lock);
@@ -390,7 +390,7 @@ static int do_task_stat(struct task_stru
 		it_real_value = task->signal->it_real_value;
 	}
 	ppid = pid_alive(task) ?
-		task_tgid(task->group_leader->real_parent) : 0;
+		task_vtgid(task->group_leader->real_parent) : 0;
 	read_unlock(&tasklist_lock);
 
 	if (!whole || num_threads<2)
@@ -417,7 +417,7 @@ static int do_task_stat(struct task_stru
 	res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \
 %lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
-		task_pid(task),
+		task_vpid(task),
 		tcomm,
 		state,
 		ppid,
Index: linux-2.6.15-rc1/fs/proc/base.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/proc/base.c	2005-12-12 11:46:47.000000000 -0500
+++ linux-2.6.15-rc1/fs/proc/base.c	2005-12-12 16:07:05.000000000 -0500
@@ -1878,14 +1878,14 @@ static int proc_self_readlink(struct den
 			      int buflen)
 {
 	char tmp[30];
-	sprintf(tmp, "%d", task_tgid(current));
+	sprintf(tmp, "%d", task_vtgid(current));
 	return vfs_readlink(dentry,buffer,buflen,tmp);
 }
 
 static void *proc_self_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
 	char tmp[30];
-	sprintf(tmp, "%d", task_tgid(current));
+	sprintf(tmp, "%d", task_vtgid(current));
 	return ERR_PTR(vfs_follow_link(nd,tmp));
 }	
 
@@ -2100,7 +2100,7 @@ static int get_tgid_list(int index, unsi
 		p = next_task(&init_task);
 
 	for ( ; p != &init_task; p = next_task(p)) {
-		int tgid = task_pid(p);
+		int tgid = task_vpid(p);
 		if (!pid_alive(p))
 			continue;
 		if (--index >= 0)
@@ -2133,7 +2133,7 @@ static int get_tid_list(int index, unsig
 	 * via next_thread().
 	 */
 	if (pid_alive(task)) do {
-		int tid = task_pid(task);
+		int tid = task_vpid(task);
 
 		if (--index >= 0)
 			continue;
Index: linux-2.6.15-rc1/kernel/exit.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/exit.c	2005-12-12 11:46:47.000000000 -0500
+++ linux-2.6.15-rc1/kernel/exit.c	2005-12-12 16:07:05.000000000 -0500
@@ -1143,7 +1143,7 @@ static int wait_task_zombie(task_t *p, i
 		p->exit_state = EXIT_ZOMBIE;
 		return retval;
 	}
-	retval = task_pid(p);
+	retval = task_vpid(p);
 	if (p->real_parent != p->parent) {
 		write_lock_irq(&tasklist_lock);
 		/* Double-check with lock held.  */
@@ -1278,7 +1278,7 @@ bail_ref:
 	if (!retval && infop)
 		retval = put_user(p->uid, &infop->si_uid);
 	if (!retval)
-		retval = task_pid(p);
+		retval = task_vpid(p);
 	put_task_struct(p);
 
 	BUG_ON(!retval);
Index: linux-2.6.15-rc1/kernel/fork.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/fork.c	2005-12-12 11:46:47.000000000 -0500
+++ linux-2.6.15-rc1/kernel/fork.c	2005-12-12 16:07:34.000000000 -0500
@@ -850,7 +850,7 @@ asmlinkage long sys_set_tid_address(int 
 {
 	current->clear_child_tid = tidptr;
 
-	return task_pid(current);
+	return task_vpid(current);
 }
 
 /*
@@ -930,7 +930,7 @@ static task_t *copy_process(unsigned lon
 	p->__pid = pid;
 	retval = -EFAULT;
 	if (clone_flags & CLONE_PARENT_SETTID)
-		if (put_user(task_pid(p), parent_tidptr))
+		if (put_user(task_vpid(p), parent_tidptr))
 			goto bad_fork_cleanup;
 
 	p->proc_dentry = NULL;
Index: linux-2.6.15-rc1/kernel/sched.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/sched.c	2005-12-12 11:46:47.000000000 -0500
+++ linux-2.6.15-rc1/kernel/sched.c	2005-12-12 16:07:05.000000000 -0500
@@ -1653,7 +1653,7 @@ asmlinkage void schedule_tail(task_t *pr
 	preempt_enable();
 #endif
 	if (current->set_child_tid)
-		put_user(task_pid(current), current->set_child_tid);
+		put_user(task_vpid(current), current->set_child_tid);
 }
 
 /*
Index: linux-2.6.15-rc1/kernel/signal.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/signal.c	2005-12-12 11:46:47.000000000 -0500
+++ linux-2.6.15-rc1/kernel/signal.c	2005-12-12 16:07:05.000000000 -0500
@@ -809,7 +809,7 @@ static int send_signal(int sig, struct s
 			q->info.si_signo = sig;
 			q->info.si_errno = 0;
 			q->info.si_code = SI_USER;
-			q->info.si_pid = task_pid(current);
+			q->info.si_pid = task_vpid(current);
 			q->info.si_uid = current->uid;
 			break;
 		case (unsigned long) SEND_SIG_PRIV:
@@ -1478,7 +1478,7 @@ void do_notify_parent(struct task_struct
 
 	info.si_signo = sig;
 	info.si_errno = 0;
-	info.si_pid = task_pid(tsk);
+	info.si_pid = task_vpid(tsk);
 	info.si_uid = tsk->uid;
 
 	/* FIXME: find out whether or not this is supposed to be c*time. */
@@ -1543,7 +1543,7 @@ static void do_notify_parent_cldstop(str
 
 	info.si_signo = SIGCHLD;
 	info.si_errno = 0;
-	info.si_pid = task_pid(tsk);
+	info.si_pid = task_vpid(tsk);
 	info.si_uid = tsk->uid;
 
 	/* FIXME: find out whether or not this is supposed to be c*time. */
@@ -2254,7 +2254,7 @@ sys_kill(int pid, int sig)
 	info.si_signo = sig;
 	info.si_errno = 0;
 	info.si_code = SI_USER;
-	info.si_pid = task_tgid(current);
+	info.si_pid = task_vtgid(current);
 	info.si_uid = current->uid;
 
 	return kill_something_info(sig, &info, pid);
@@ -2270,7 +2270,7 @@ static int do_tkill(int tgid, int pid, i
 	info.si_signo = sig;
 	info.si_errno = 0;
 	info.si_code = SI_TKILL;
-	info.si_pid = task_tgid(current);
+	info.si_pid = task_vtgid(current);
 	info.si_uid = current->uid;
 
 	read_lock(&tasklist_lock);
Index: linux-2.6.15-rc1/kernel/timer.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/timer.c	2005-12-12 11:46:47.000000000 -0500
+++ linux-2.6.15-rc1/kernel/timer.c	2005-12-12 16:06:39.000000000 -0500
@@ -941,7 +941,7 @@ asmlinkage unsigned long sys_alarm(unsig
  */
 asmlinkage long sys_getpid(void)
 {
-	return task_tgid(current);
+	return task_vtgid(current);
 }
 
 /*
@@ -1115,7 +1115,7 @@ EXPORT_SYMBOL(schedule_timeout_uninterru
 /* Thread ID - the internal kernel "pid" */
 asmlinkage long sys_gettid(void)
 {
-	return task_pid(current);
+	return task_vpid(current);
 }
 
 static long __sched nanosleep_restart(struct restart_block *restart)

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 04/21] PID Virtualization: return virtual process group ids
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (2 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 03/21] PID Virtualization: return virtual pids where required Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 05/21] PID Virtualization: code enhancements for virtual pids in /proc Hubertus Franke
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: F4-replace-process-group-access-with-virt-access.patch --]
[-- Type: text/plain, Size: 3439 bytes --]

In this patch we now identify where in the kernel code conceptually
a virtual process group needs to be returned to userspace. This is
simply the extension of the previous patch which only dealt with
identify the location of virtual pid/tgid/ppids returns.

As in that patch, these locations are at the kernel/user interfaces. 
and broadly they fall into 3 categories:
(a) syscall return parameter,
(b) syscall return code,
(c) through a datastructure filled in a syscall

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 fs/binfmt_elf.c |    4 ++--
 fs/proc/array.c |    2 +-
 kernel/sys.c    |    8 ++++----
 3 files changed, 7 insertions(+), 7 deletions(-)

Index: linux-2.6.15-rc1/fs/binfmt_elf.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/binfmt_elf.c	2005-11-30 18:08:02.000000000 -0500
+++ linux-2.6.15-rc1/fs/binfmt_elf.c	2005-11-30 18:08:03.000000000 -0500
@@ -1272,7 +1272,7 @@ static void fill_prstatus(struct elf_prs
 	prstatus->pr_sighold = p->blocked.sig[0];
 	prstatus->pr_pid = task_vpid(p);
 	prstatus->pr_ppid = task_vppid(p);
-	prstatus->pr_pgrp = process_group(p);
+	prstatus->pr_pgrp = virt_process_group(p);
 	prstatus->pr_sid = p->signal->session;
 	if (thread_group_leader(p)) {
 		/*
@@ -1318,7 +1318,7 @@ static int fill_psinfo(struct elf_prpsin
 
 	psinfo->pr_pid = task_vpid(p);
 	psinfo->pr_ppid = task_vppid(p);
-	psinfo->pr_pgrp = process_group(p);
+	psinfo->pr_pgrp = virt_process_group(p);
 	psinfo->pr_sid = p->signal->session;
 
 	i = p->state ? ffz(~p->state) + 1 : 0;
Index: linux-2.6.15-rc1/fs/proc/array.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/proc/array.c	2005-11-30 18:08:02.000000000 -0500
+++ linux-2.6.15-rc1/fs/proc/array.c	2005-11-30 18:08:03.000000000 -0500
@@ -374,7 +374,7 @@ static int do_task_stat(struct task_stru
 			tty_pgrp = task->signal->tty->pgrp;
 			tty_nr = new_encode_dev(tty_devnum(task->signal->tty));
 		}
-		pgid = process_group(task);
+		pgid = virt_process_group(task);
 		sid = task->signal->session;
 		cmin_flt = task->signal->cmin_flt;
 		cmaj_flt = task->signal->cmaj_flt;
Index: linux-2.6.15-rc1/kernel/sys.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/sys.c	2005-11-30 18:07:18.000000000 -0500
+++ linux-2.6.15-rc1/kernel/sys.c	2005-11-30 18:08:03.000000000 -0500
@@ -1154,7 +1154,7 @@ out:
 asmlinkage long sys_getpgid(pid_t pid)
 {
 	if (!pid) {
-		return process_group(current);
+		return virt_process_group(current);
 	} else {
 		int retval;
 		struct task_struct *p;
@@ -1166,7 +1166,7 @@ asmlinkage long sys_getpgid(pid_t pid)
 		if (p) {
 			retval = security_task_getpgid(p);
 			if (!retval)
-				retval = process_group(p);
+				retval = virt_process_group(p);
 		}
 		read_unlock(&tasklist_lock);
 		return retval;
@@ -1178,7 +1178,7 @@ asmlinkage long sys_getpgid(pid_t pid)
 asmlinkage long sys_getpgrp(void)
 {
 	/* SMP - assuming writes are word atomic this is fine */
-	return process_group(current);
+	return virt_process_group(current);
 }
 
 #endif
@@ -1224,7 +1224,7 @@ asmlinkage long sys_setsid(void)
 	__set_special_pids(task_pid(current), task_pid(current));
 	current->signal->tty = NULL;
 	current->signal->tty_old_pgrp = 0;
-	err = process_group(current);
+	err = virt_process_group(current);
 out:
 	write_unlock_irq(&tasklist_lock);
 	up(&tty_sem);

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 05/21] PID Virtualization: code enhancements for virtual pids in /proc
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (3 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 04/21] PID Virtualization: return virtual process group ids Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 06/21] PID Virtualization: Define pid_to_vpid functions Hubertus Franke
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: F5-code-cleanup-procarray.patch --]
[-- Type: text/plain, Size: 1594 bytes --]

To avoid ugly parameter specifications for the sprintf statement
we pull the ppid,tpid computations out. Later these statements
will get a tiny bit more elaborate, because we need to deal with
the special case of an illegal task_vvpid (not in the same container)
virtualization. This is simply in preparation for that.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 fs/proc/array.c |   14 +++++++++++---
 1 files changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6.15-rc1/fs/proc/array.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/proc/array.c	2005-12-12 16:12:09.000000000 -0500
+++ linux-2.6.15-rc1/fs/proc/array.c	2005-12-12 16:13:36.000000000 -0500
@@ -161,8 +161,17 @@ static inline char * task_state(struct t
 	struct group_info *group_info;
 	int g;
 	struct fdtable *fdt = NULL;
+	pid_t ppid, tpid;
 
 	read_lock(&tasklist_lock);
+	if (pid_alive(p))
+		ppid = task_vtgid(p->group_leader->real_parent);
+	else
+		ppid = 0;
+	if (pid_alive(p) && p->ptrace)
+		tpid = task_vppid(p);
+	else
+		tpid = 0;
 	buffer += sprintf(buffer,
 		"State:\t%s\n"
 		"SleepAVG:\t%lu%%\n"
@@ -175,9 +184,8 @@ static inline char * task_state(struct t
 		get_task_state(p),
 		(p->sleep_avg/1024)*100/(1020000000/1024),
 	       	task_vtgid(p),
-		task_vpid(p), pid_alive(p) ?
-			task_vtgid(p->group_leader->real_parent) : 0,
-		pid_alive(p) && p->ptrace ? task_vpid(p->parent) : 0,
+		task_vpid(p),
+		ppid, tpid,
 		p->uid, p->euid, p->suid, p->fsuid,
 		p->gid, p->egid, p->sgid, p->fsgid);
 	read_unlock(&tasklist_lock);

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 06/21] PID Virtualization: Define pid_to_vpid functions
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (4 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 05/21] PID Virtualization: code enhancements for virtual pids in /proc Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 07/21] PID Virtualization: Use pid_to_vpid conversion functions Hubertus Franke
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: F6-define-pid-to-vpid-translation.patch --]
[-- Type: text/plain, Size: 1188 bytes --]

In this patch we introduce convertion functions to 
translate pids into virtual pids. These are just the APIs
not the implementation yet.
Subsequent patches will utilize these internal functions
to rewrite the task virtual pid/ppid/tgid access functions
such that finally we only have to rewrite these virtual
conversion functions to actually obtain the pid virtualization.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 include/linux/sched.h |   14 ++++++++++++++
 1 files changed, 14 insertions(+)

Index: linux-2.6.15-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.15-rc1.orig/include/linux/sched.h	2005-11-30 18:08:02.000000000 -0500
+++ linux-2.6.15-rc1/include/linux/sched.h	2005-11-30 18:08:03.000000000 -0500
@@ -866,6 +866,20 @@ static inline pid_t process_group(const 
 }
 
 /**
+ *  pid domain translation functions:
+ *	- from kernel to user pid domain
+ */
+static inline pid_t pid_to_vpid(pid_t pid)
+{
+	return pid;
+}
+
+static inline pid_t pgid_to_vpgid(pid_t pid)
+{
+	return pid;
+}
+
+/**
  * pid_alive - check that a task structure is not stale
  * @p: Task structure to be checked.
  *

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 07/21] PID Virtualization: Use pid_to_vpid conversion functions
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (5 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 06/21] PID Virtualization: Define pid_to_vpid functions Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 08/21] PID Virtualization: file owner pid virtualization Hubertus Franke
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: F7-pid-to-vpid-translation.patch --]
[-- Type: text/plain, Size: 6710 bytes --]

Utilize the pid_to_vpid translation function 
to return to userspace a virtual pid. 
These need to be applied where the task access functions 
previously defined can not be utilized.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 drivers/char/tty_io.c |    2 +-
 fs/binfmt_elf.c       |    4 ++--
 fs/proc/array.c       |    4 ++--
 ipc/msg.c             |    8 ++++----
 ipc/shm.c             |    8 ++++----
 kernel/fork.c         |    9 ++++++---
 kernel/sys.c          |    4 ++--
 7 files changed, 21 insertions(+), 18 deletions(-)

Index: linux-2.6.15-rc1/drivers/char/tty_io.c
===================================================================
--- linux-2.6.15-rc1.orig/drivers/char/tty_io.c	2005-12-12 18:37:36.000000000 -0500
+++ linux-2.6.15-rc1/drivers/char/tty_io.c	2005-12-12 18:46:39.000000000 -0500
@@ -2158,7 +2158,7 @@ static int tiocgpgrp(struct tty_struct *
 	 */
 	if (tty == real_tty && current->signal->tty != real_tty)
 		return -ENOTTY;
-	return put_user(real_tty->pgrp, p);
+	return put_user(pid_to_vpid(real_tty->pgrp), p);
 }
 
 static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
Index: linux-2.6.15-rc1/fs/proc/array.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/proc/array.c	2005-12-12 18:37:42.000000000 -0500
+++ linux-2.6.15-rc1/fs/proc/array.c	2005-12-12 18:49:10.000000000 -0500
@@ -379,11 +379,11 @@ static int do_task_stat(struct task_stru
 	}
 	if (task->signal) {
 		if (task->signal->tty) {
-			tty_pgrp = task->signal->tty->pgrp;
+			tty_pgrp = pid_to_vpid(task->signal->tty->pgrp);
 			tty_nr = new_encode_dev(tty_devnum(task->signal->tty));
 		}
 		pgid = virt_process_group(task);
-		sid = task->signal->session;
+		sid = pid_to_vpid(task->signal->session);
 		cmin_flt = task->signal->cmin_flt;
 		cmaj_flt = task->signal->cmaj_flt;
 		cutime = task->signal->cutime;
Index: linux-2.6.15-rc1/ipc/msg.c
===================================================================
--- linux-2.6.15-rc1.orig/ipc/msg.c	2005-12-12 18:37:36.000000000 -0500
+++ linux-2.6.15-rc1/ipc/msg.c	2005-12-12 18:46:38.000000000 -0500
@@ -416,8 +416,8 @@ asmlinkage long sys_msgctl (int msqid, i
 		tbuf.msg_cbytes = msq->q_cbytes;
 		tbuf.msg_qnum   = msq->q_qnum;
 		tbuf.msg_qbytes = msq->q_qbytes;
-		tbuf.msg_lspid  = msq->q_lspid;
-		tbuf.msg_lrpid  = msq->q_lrpid;
+		tbuf.msg_lspid  = pid_to_vpid(msq->q_lspid);
+		tbuf.msg_lrpid  = pid_to_vpid(msq->q_lrpid);
 		msg_unlock(msq);
 		if (copy_msqid_to_user(buf, &tbuf, version))
 			return -EFAULT;
@@ -821,8 +821,8 @@ static int sysvipc_msg_proc_show(struct 
 			  msq->q_perm.mode,
 			  msq->q_cbytes,
 			  msq->q_qnum,
-			  msq->q_lspid,
-			  msq->q_lrpid,
+			  pid_to_vpid(msq->q_lspid),
+			  pid_to_vpid(msq->q_lrpid),
 			  msq->q_perm.uid,
 			  msq->q_perm.gid,
 			  msq->q_perm.cuid,
Index: linux-2.6.15-rc1/ipc/shm.c
===================================================================
--- linux-2.6.15-rc1.orig/ipc/shm.c	2005-12-12 18:37:36.000000000 -0500
+++ linux-2.6.15-rc1/ipc/shm.c	2005-12-12 18:46:38.000000000 -0500
@@ -508,8 +508,8 @@ asmlinkage long sys_shmctl (int shmid, i
 		tbuf.shm_atime	= shp->shm_atim;
 		tbuf.shm_dtime	= shp->shm_dtim;
 		tbuf.shm_ctime	= shp->shm_ctim;
-		tbuf.shm_cpid	= shp->shm_cprid;
-		tbuf.shm_lpid	= shp->shm_lprid;
+		tbuf.shm_cpid	= pid_to_vpid(shp->shm_cprid);
+		tbuf.shm_lpid	= pid_to_vpid(shp->shm_lprid);
 		if (!is_file_hugepages(shp->shm_file))
 			tbuf.shm_nattch	= shp->shm_nattch;
 		else
@@ -896,8 +896,8 @@ static int sysvipc_shm_proc_show(struct 
 			  shp->id,
 			  shp->shm_flags,
 			  shp->shm_segsz,
-			  shp->shm_cprid,
-			  shp->shm_lprid,
+			  pid_to_vpid(shp->shm_cprid),
+			  pid_to_vpid(shp->shm_lprid),
 			  is_file_hugepages(shp->shm_file) ? (file_count(shp->shm_file) - 1) : shp->shm_nattch,
 			  shp->shm_perm.uid,
 			  shp->shm_perm.gid,
Index: linux-2.6.15-rc1/kernel/fork.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/fork.c	2005-12-12 18:37:41.000000000 -0500
+++ linux-2.6.15-rc1/kernel/fork.c	2005-12-12 18:46:38.000000000 -0500
@@ -1241,6 +1241,7 @@ long do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long pid = alloc_pidmap();
+	long vpid;
 
 	if (pid < 0)
 		return -EAGAIN;
@@ -1271,13 +1272,15 @@ long do_fork(unsigned long clone_flags,
 			set_tsk_thread_flag(p, TIF_SIGPENDING);
 		}
 
+		vpid = pid_to_vpid(pid);
+
 		if (!(clone_flags & CLONE_STOPPED))
 			wake_up_new_task(p, clone_flags);
 		else
 			p->state = TASK_STOPPED;
 
 		if (unlikely (trace)) {
-			current->ptrace_message = pid;
+			current->ptrace_message = vpid;
 			ptrace_notify ((trace << 8) | SIGTRAP);
 		}
 
@@ -1288,9 +1291,9 @@ long do_fork(unsigned long clone_flags,
 		}
 	} else {
 		free_pidmap(pid);
-		pid = PTR_ERR(p);
+		vpid = PTR_ERR(p);
 	}
-	return pid;
+	return vpid;
 }
 
 void __init proc_caches_init(void)
Index: linux-2.6.15-rc1/kernel/sys.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/sys.c	2005-12-12 18:37:41.000000000 -0500
+++ linux-2.6.15-rc1/kernel/sys.c	2005-12-12 18:50:31.000000000 -0500
@@ -1186,7 +1186,7 @@ asmlinkage long sys_getpgrp(void)
 asmlinkage long sys_getsid(pid_t pid)
 {
 	if (!pid) {
-		return current->signal->session;
+		return pid_to_vpid(current->signal->session);
 	} else {
 		int retval;
 		struct task_struct *p;
@@ -1198,7 +1198,7 @@ asmlinkage long sys_getsid(pid_t pid)
 		if(p) {
 			retval = security_task_getsid(p);
 			if (!retval)
-				retval = p->signal->session;
+				retval = pid_to_vpid(p->signal->session);
 		}
 		read_unlock(&tasklist_lock);
 		return retval;
Index: linux-2.6.15-rc1/fs/binfmt_elf.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/binfmt_elf.c	2005-12-12 18:37:41.000000000 -0500
+++ linux-2.6.15-rc1/fs/binfmt_elf.c	2005-12-12 18:47:53.000000000 -0500
@@ -1273,7 +1273,7 @@ static void fill_prstatus(struct elf_prs
 	prstatus->pr_pid = task_vpid(p);
 	prstatus->pr_ppid = task_vppid(p);
 	prstatus->pr_pgrp = virt_process_group(p);
-	prstatus->pr_sid = p->signal->session;
+	prstatus->pr_sid = pid_to_vpid(p->signal->session);
 	if (thread_group_leader(p)) {
 		/*
 		 * This is the record for the group leader.  Add in the
@@ -1319,7 +1319,7 @@ static int fill_psinfo(struct elf_prpsin
 	psinfo->pr_pid = task_vpid(p);
 	psinfo->pr_ppid = task_vppid(p);
 	psinfo->pr_pgrp = virt_process_group(p);
-	psinfo->pr_sid = p->signal->session;
+	psinfo->pr_sid = pid_to_vpid(p->signal->session);
 
 	i = p->state ? ffz(~p->state) + 1 : 0;
 	psinfo->pr_state = i;

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 08/21] PID Virtualization: file owner pid virtualization
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (6 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 07/21] PID Virtualization: Use pid_to_vpid conversion functions Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 09/21] PID Virtualization: define vpid_to_pid functions Hubertus Franke
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: F8-pgid-to-vpgid-translation.patch --]
[-- Type: text/plain, Size: 847 bytes --]

Utilization of the internal pid_to_vpid function for the 
process group id. This is specifically for the owner of 
a file that needs to be returned through the fcntl 
system call.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 fs/fcntl.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.15-rc1/fs/fcntl.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/fcntl.c	2005-12-14 15:12:28.000000000 -0500
+++ linux-2.6.15-rc1/fs/fcntl.c	2005-12-14 15:13:55.000000000 -0500
@@ -316,7 +316,7 @@ static long do_fcntl(int fd, unsigned in
 		 * current syscall conventions, the only way
 		 * to fix this will be in libc.
 		 */
-		err = filp->f_owner.pid;
+		err = pgid_to_vpgid(filp->f_owner.pid);
 		force_successful_syscall_return();
 		break;
 	case F_SETOWN:

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 09/21] PID Virtualization: define vpid_to_pid functions
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (7 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 08/21] PID Virtualization: file owner pid virtualization Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 10/21] PID Virtualization: Use " Hubertus Franke
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: F9-define-vpid-to-pid-translation.patch --]
[-- Type: text/plain, Size: 1234 bytes --]

Introduce the reverse conversion functions namely from the 
user virtual pid to the kernel pid.
Again, we only specify the API here, will utilize the API 
at the appropriate locations in subsequent patches and finally
will provide a real implementation for the virtualization
behind these functions together with the pid_to_vpid conversion.
Any pid passed through the syscall interface from userspace
is virtual and therefore must pass through this conversion 
before it can be used as a kernel pid.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 include/linux/sched.h |   10 ++++++++++
 1 files changed, 10 insertions(+)

Index: linux-2.6.15-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.15-rc1.orig/include/linux/sched.h	2005-11-30 18:08:03.000000000 -0500
+++ linux-2.6.15-rc1/include/linux/sched.h	2005-11-30 18:08:04.000000000 -0500
@@ -879,6 +879,16 @@ static inline pid_t pgid_to_vpgid(pid_t 
 	return pid;
 }
 
+static inline pid_t vpid_to_pid(pid_t pid)
+{
+	return pid;
+}
+
+static inline pid_t vpgid_to_pgid(pid_t pid)
+{
+	return pid;
+}
+
 /**
  * pid_alive - check that a task structure is not stale
  * @p: Task structure to be checked.

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 10/21] PID Virtualization: Use vpid_to_pid functions
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (8 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 09/21] PID Virtualization: define vpid_to_pid functions Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 11/21] PID Virtualization: use vpgid_to_pgid function Hubertus Franke
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: FA-vpid-to-pid-translation.patch --]
[-- Type: text/plain, Size: 7840 bytes --]

We now utilize the vpid_to_pid function where ever
a pid is passed from user space and needs to be converted 
into a kernel pid.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 arch/ia64/kernel/ptrace.c |    1 +
 arch/s390/kernel/ptrace.c |    1 +
 drivers/char/tty_io.c     |    1 +
 fs/proc/base.c            |    2 ++
 kernel/capability.c       |    1 +
 kernel/exit.c             |    2 ++
 kernel/ptrace.c           |    1 +
 kernel/sched.c            |    6 +++++-
 kernel/signal.c           |    3 +++
 kernel/sys.c              |   14 ++++++++++++++
 10 files changed, 31 insertions(+), 1 deletion(-)

Index: linux-2.6.15-rc1/arch/ia64/kernel/ptrace.c
===================================================================
--- linux-2.6.15-rc1.orig/arch/ia64/kernel/ptrace.c	2005-12-12 11:46:47.000000000 -0500
+++ linux-2.6.15-rc1/arch/ia64/kernel/ptrace.c	2005-12-12 15:24:36.000000000 -0500
@@ -1419,6 +1419,7 @@ sys_ptrace (long request, pid_t pid, uns
 	struct switch_stack *sw;
 	long ret;
 
+	pid = vpid_to_pid(pid);
 	lock_kernel();
 	ret = -EPERM;
 	if (request == PTRACE_TRACEME) {
Index: linux-2.6.15-rc1/arch/s390/kernel/ptrace.c
===================================================================
--- linux-2.6.15-rc1.orig/arch/s390/kernel/ptrace.c	2005-12-12 11:46:47.000000000 -0500
+++ linux-2.6.15-rc1/arch/s390/kernel/ptrace.c	2005-12-12 15:24:36.000000000 -0500
@@ -711,6 +711,7 @@ sys_ptrace(long request, long pid, long 
 	struct task_struct *child;
 	int ret;
 
+	pid = vpid_to_pid(pid);
 	lock_kernel();
 
 	if (request == PTRACE_TRACEME) {
Index: linux-2.6.15-rc1/drivers/char/tty_io.c
===================================================================
--- linux-2.6.15-rc1.orig/drivers/char/tty_io.c	2005-12-12 15:24:32.000000000 -0500
+++ linux-2.6.15-rc1/drivers/char/tty_io.c	2005-12-12 15:24:36.000000000 -0500
@@ -2176,6 +2176,7 @@ static int tiocspgrp(struct tty_struct *
 		return -ENOTTY;
 	if (get_user(pgrp, p))
 		return -EFAULT;
+	pgrp = vpid_to_pid(pgrp);
 	if (pgrp < 0)
 		return -EINVAL;
 	if (session_of_pgrp(pgrp) != current->signal->session)
Index: linux-2.6.15-rc1/fs/proc/base.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/proc/base.c	2005-12-12 15:24:27.000000000 -0500
+++ linux-2.6.15-rc1/fs/proc/base.c	2005-12-12 15:24:36.000000000 -0500
@@ -1975,6 +1975,7 @@ struct dentry *proc_pid_lookup(struct in
 	tgid = name_to_int(dentry);
 	if (tgid == ~0U)
 		goto out;
+	tgid = vpid_to_pid(tgid);
 
 	read_lock(&tasklist_lock);
 	task = find_task_by_pid(tgid);
@@ -2032,6 +2033,7 @@ static struct dentry *proc_task_lookup(s
 	unsigned tid;
 
 	tid = name_to_int(dentry);
+	tid = vpid_to_pid(tid);
 	if (tid == ~0U)
 		goto out;
 
Index: linux-2.6.15-rc1/kernel/capability.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/capability.c	2005-12-12 11:46:47.000000000 -0500
+++ linux-2.6.15-rc1/kernel/capability.c	2005-12-12 15:24:36.000000000 -0500
@@ -63,6 +63,7 @@ asmlinkage long sys_capget(cap_user_head
      if (pid < 0) 
              return -EINVAL;
 
+     pid = vpid_to_pid(pid);
      spin_lock(&task_capability_lock);
      read_lock(&tasklist_lock); 
 
Index: linux-2.6.15-rc1/kernel/exit.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/exit.c	2005-12-12 15:24:27.000000000 -0500
+++ linux-2.6.15-rc1/kernel/exit.c	2005-12-12 15:24:36.000000000 -0500
@@ -1529,10 +1529,12 @@ asmlinkage long sys_waitid(int which, pi
 	case P_PID:
 		if (pid <= 0)
 			return -EINVAL;
+		pid = vpid_to_pid(pid);
 		break;
 	case P_PGID:
 		if (pid <= 0)
 			return -EINVAL;
+		pid = vpid_to_pid(pid);
 		pid = -pid;
 		break;
 	default:
Index: linux-2.6.15-rc1/kernel/sched.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/sched.c	2005-12-12 15:24:27.000000000 -0500
+++ linux-2.6.15-rc1/kernel/sched.c	2005-12-12 15:24:36.000000000 -0500
@@ -3680,7 +3680,11 @@ task_t *idle_task(int cpu)
  */
 static inline task_t *find_process_by_pid(pid_t pid)
 {
-	return pid ? find_task_by_pid(pid) : current;
+	if (pid) {
+		pid = vpid_to_pid(pid);
+		return find_task_by_pid(pid);
+	}
+	return current;
 }
 
 /* Actually do priority change: must hold rq lock. */
Index: linux-2.6.15-rc1/kernel/signal.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/signal.c	2005-12-12 15:24:27.000000000 -0500
+++ linux-2.6.15-rc1/kernel/signal.c	2005-12-12 15:26:42.000000000 -0500
@@ -1218,9 +1218,9 @@ static int kill_something_info(int sig, 
 		read_unlock(&tasklist_lock);
 		return count ? retval : -ESRCH;
 	} else if (pid < 0) {
-		return kill_pg_info(sig, info, -pid);
+		return kill_pg_info(sig, info, vpid_to_pid(-pid));
 	} else {
-		return kill_proc_info(sig, info, pid);
+		return kill_proc_info(sig, info, vpid_to_pid(pid));
 	}
 }
 
@@ -2273,6 +2273,8 @@ static int do_tkill(int tgid, int pid, i
 	info.si_pid = task_vtgid(current);
 	info.si_uid = current->uid;
 
+	pid  = vpid_to_pid(pid);
+	tgid = vpid_to_pid(tgid);
 	read_lock(&tasklist_lock);
 	p = find_task_by_pid(pid);
 	if (p && (tgid <= 0 || task_tgid(p) == tgid)) {
@@ -2340,6 +2342,7 @@ sys_rt_sigqueueinfo(int pid, int sig, si
 	info.si_signo = sig;
 
 	/* POSIX.1b doesn't mention process groups.  */
+	pid = vpid_to_pid(pid);
 	return kill_proc_info(sig, &info, pid);
 }
 
Index: linux-2.6.15-rc1/kernel/sys.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/sys.c	2005-12-12 15:24:32.000000000 -0500
+++ linux-2.6.15-rc1/kernel/sys.c	2005-12-12 15:24:36.000000000 -0500
@@ -268,6 +268,8 @@ asmlinkage long sys_setpriority(int whic
 		case PRIO_PROCESS:
 			if (!who)
 				who = task_pid(current);
+			else
+				who = vpid_to_pid(who);
 			p = find_task_by_pid(who);
 			if (p)
 				error = set_one_prio(p, niceval, error);
@@ -275,6 +277,8 @@ asmlinkage long sys_setpriority(int whic
 		case PRIO_PGRP:
 			if (!who)
 				who = process_group(current);
+			else
+				who = vpid_to_pid(who);
 			do_each_task_pid(who, PIDTYPE_PGID, p) {
 				error = set_one_prio(p, niceval, error);
 			} while_each_task_pid(who, PIDTYPE_PGID, p);
@@ -321,6 +325,8 @@ asmlinkage long sys_getpriority(int whic
 		case PRIO_PROCESS:
 			if (!who)
 				who = task_pid(current);
+			else
+				who = vpid_to_pid(who);
 			p = find_task_by_pid(who);
 			if (p) {
 				niceval = 20 - task_nice(p);
@@ -331,6 +337,8 @@ asmlinkage long sys_getpriority(int whic
 		case PRIO_PGRP:
 			if (!who)
 				who = process_group(current);
+			else
+				who = vpid_to_pid(who);
 			do_each_task_pid(who, PIDTYPE_PGID, p) {
 				niceval = 20 - task_nice(p);
 				if (niceval > retval)
@@ -1087,8 +1095,12 @@ asmlinkage long sys_setpgid(pid_t pid, p
 
 	if (!pid)
 		pid = task_pid(current);
+	else
+		pid = vpid_to_pid(pid);
 	if (!pgid)
 		pgid = pid;
+	else
+		pgid = vpid_to_pid(pgid);
 	if (pgid < 0)
 		return -EINVAL;
 
@@ -1159,6 +1171,7 @@ asmlinkage long sys_getpgid(pid_t pid)
 		int retval;
 		struct task_struct *p;
 
+		pid = vpid_to_pid(pid);
 		read_lock(&tasklist_lock);
 		p = find_task_by_pid(pid);
 
@@ -1191,6 +1204,7 @@ asmlinkage long sys_getsid(pid_t pid)
 		int retval;
 		struct task_struct *p;
 
+		pid = vpid_to_pid(pid);
 		read_lock(&tasklist_lock);
 		p = find_task_by_pid(pid);
 
Index: linux-2.6.15-rc1/kernel/ptrace.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/ptrace.c	2005-12-12 11:46:47.000000000 -0500
+++ linux-2.6.15-rc1/kernel/ptrace.c	2005-12-12 15:24:36.000000000 -0500
@@ -439,6 +439,7 @@ static int ptrace_get_task_struct(long r
 	/*
 	 * You may not mess with init
 	 */
+	pid = vpid_to_pid(pid);
 	if (pid == 1)
 		return -EPERM;
 

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 11/21] PID Virtualization: use vpgid_to_pgid function
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (9 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 10/21] PID Virtualization: Use " Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 12/21] PID Virtualization: Context for pid_to_vpid conversition functions Hubertus Franke
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: FB-vpgid-to-pgid-translation.patch --]
[-- Type: text/plain, Size: 1924 bytes --]

Same as previous patch for pids, but here we focus on virtual
ids that are interpreted as process group ids. Since process
groups ids can be negative, they are handled as to deal with
the negative value.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 fs/fcntl.c          |    1 +
 kernel/capability.c |    1 +
 kernel/exit.c       |    2 ++
 3 files changed, 4 insertions(+)

Index: linux-2.6.15-rc1/fs/fcntl.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/fcntl.c	2005-12-14 15:13:55.000000000 -0500
+++ linux-2.6.15-rc1/fs/fcntl.c	2005-12-14 15:16:34.000000000 -0500
@@ -267,6 +267,7 @@ int f_setown(struct file *filp, unsigned
 	if (err)
 		return err;
 
+	arg = vpgid_to_pgid(arg);
 	f_modown(filp, arg, current->uid, current->euid, force);
 	return 0;
 }
Index: linux-2.6.15-rc1/kernel/capability.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/capability.c	2005-12-14 15:14:38.000000000 -0500
+++ linux-2.6.15-rc1/kernel/capability.c	2005-12-14 15:14:42.000000000 -0500
@@ -188,6 +188,7 @@ asmlinkage long sys_capset(cap_user_head
      if (get_user(pid, &header->pid))
 	     return -EFAULT; 
 
+     pid = vpgid_to_pgid(pid);
      if (pid && pid != task_pid(current) && !capable(CAP_SETPCAP))
              return -EPERM;
 
Index: linux-2.6.15-rc1/kernel/exit.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/exit.c	2005-12-14 15:14:38.000000000 -0500
+++ linux-2.6.15-rc1/kernel/exit.c	2005-12-14 15:14:42.000000000 -0500
@@ -1556,6 +1556,8 @@ asmlinkage long sys_wait4(pid_t pid, int
 	if (options & ~(WNOHANG|WUNTRACED|WCONTINUED|
 			__WNOTHREAD|__WCLONE|__WALL))
 		return -EINVAL;
+	if (pid != -1)
+		pid = vpgid_to_pgid(pid);
 	ret = do_wait(pid, options | WEXITED, NULL, stat_addr, ru);
 
 	/* avoid REGPARM breakage on x86: */

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 12/21] PID Virtualization: Context for pid_to_vpid conversition functions
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (10 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 11/21] PID Virtualization: use vpgid_to_pgid function Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 13/21] PID Virtualization: Documentation Hubertus Franke
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: FC-context-for-pid2vpid.patch --]
[-- Type: text/plain, Size: 14377 bytes --]

pid_to_vpid conversion require the context task relative to which
the conversion should take place. For instance, the virtual init process 
of a container is vpid=1 relative to the tasks of that container, 
vpid=-1 from within a different container and vpid=pid in the global context.

By default we assume that the virtual access functions are called
within the context of the task's container itself.
Provide the context for the pid_to_vpid translations
vpids are with respect to a context task. 
In this patch we therefore only identify where the context is different
then the default. 

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 fs/binfmt_elf.c       |    4 ++--
 fs/fcntl.c            |    2 +-
 fs/proc/array.c       |   18 +++++++++---------
 fs/proc/base.c        |    4 ++--
 include/linux/sched.h |   41 +++++++++++++++++++++++++++++++++++++----
 ipc/msg.c             |    8 ++++----
 ipc/sem.c             |    2 +-
 ipc/shm.c             |    8 ++++----
 kernel/exit.c         |    4 ++--
 kernel/fork.c         |    3 ++-
 kernel/signal.c       |    4 ++--
 kernel/sys.c          |    7 ++++---
 kernel/timer.c        |    2 +-
 13 files changed, 71 insertions(+), 36 deletions(-)

Index: linux-2.6.15-rc1/fs/proc/base.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/proc/base.c	2005-12-14 15:14:38.000000000 -0500
+++ linux-2.6.15-rc1/fs/proc/base.c	2005-12-14 15:16:46.000000000 -0500
@@ -2102,7 +2102,7 @@ static int get_tgid_list(int index, unsi
 		p = next_task(&init_task);
 
 	for ( ; p != &init_task; p = next_task(p)) {
-		int tgid = task_vpid(p);
+		int tgid = task_vpid_ctx(p, current);
 		if (!pid_alive(p))
 			continue;
 		if (--index >= 0)
@@ -2135,7 +2135,7 @@ static int get_tid_list(int index, unsig
 	 * via next_thread().
 	 */
 	if (pid_alive(task)) do {
-		int tid = task_vpid(task);
+		int tid = task_vpid_ctx(task, current);
 
 		if (--index >= 0)
 			continue;
Index: linux-2.6.15-rc1/fs/proc/array.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/proc/array.c	2005-12-14 15:12:36.000000000 -0500
+++ linux-2.6.15-rc1/fs/proc/array.c	2005-12-14 15:16:46.000000000 -0500
@@ -165,11 +165,11 @@ static inline char * task_state(struct t
 
 	read_lock(&tasklist_lock);
 	if (pid_alive(p))
-		ppid = task_vtgid(p->group_leader->real_parent);
+		ppid = task_vtgid_ctx(p->group_leader->real_parent, current);
 	else
 		ppid = 0;
 	if (pid_alive(p) && p->ptrace)
-		tpid = task_vppid(p);
+		tpid = task_vppid_ctx(p, current);
 	else
 		tpid = 0;
 	buffer += sprintf(buffer,
@@ -183,8 +183,8 @@ static inline char * task_state(struct t
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
 		(p->sleep_avg/1024)*100/(1020000000/1024),
-	       	task_vtgid(p),
-		task_vpid(p),
+	       	task_vtgid_ctx(p,current),
+		task_vpid_ctx(p,current),
 		ppid, tpid,
 		p->uid, p->euid, p->suid, p->fsuid,
 		p->gid, p->egid, p->sgid, p->fsgid);
@@ -379,11 +379,11 @@ static int do_task_stat(struct task_stru
 	}
 	if (task->signal) {
 		if (task->signal->tty) {
-			tty_pgrp = pid_to_vpid(task->signal->tty->pgrp);
+			tty_pgrp = pid_to_vpid_ctx(task->signal->tty->pgrp, current);
 			tty_nr = new_encode_dev(tty_devnum(task->signal->tty));
 		}
-		pgid = virt_process_group(task);
-		sid = pid_to_vpid(task->signal->session);
+		pgid = pid_to_vpid_ctx(process_group(task), current);
+		sid = pid_to_vpid_ctx(task->signal->session, current);
 		cmin_flt = task->signal->cmin_flt;
 		cmaj_flt = task->signal->cmaj_flt;
 		cutime = task->signal->cutime;
@@ -398,7 +398,7 @@ static int do_task_stat(struct task_stru
 		it_real_value = task->signal->it_real_value;
 	}
 	ppid = pid_alive(task) ?
-		task_vtgid(task->group_leader->real_parent) : 0;
+		pid_to_vpid_ctx(task_tgid(task->group_leader->real_parent), current) : 0;
 	read_unlock(&tasklist_lock);
 
 	if (!whole || num_threads<2)
@@ -425,7 +425,7 @@ static int do_task_stat(struct task_stru
 	res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \
 %lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
-		task_vpid(task),
+		task_vpid_ctx(task,current),
 		tcomm,
 		state,
 		ppid,
Index: linux-2.6.15-rc1/fs/fcntl.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/fcntl.c	2005-12-14 15:16:34.000000000 -0500
+++ linux-2.6.15-rc1/fs/fcntl.c	2005-12-14 15:17:01.000000000 -0500
@@ -317,7 +317,7 @@ static long do_fcntl(int fd, unsigned in
 		 * current syscall conventions, the only way
 		 * to fix this will be in libc.
 		 */
-		err = pgid_to_vpgid(filp->f_owner.pid);
+		err = pgid_to_vpgid_ctx(filp->f_owner.pid, current);
 		force_successful_syscall_return();
 		break;
 	case F_SETOWN:
Index: linux-2.6.15-rc1/ipc/msg.c
===================================================================
--- linux-2.6.15-rc1.orig/ipc/msg.c	2005-12-14 15:12:36.000000000 -0500
+++ linux-2.6.15-rc1/ipc/msg.c	2005-12-14 15:16:46.000000000 -0500
@@ -416,8 +416,8 @@ asmlinkage long sys_msgctl (int msqid, i
 		tbuf.msg_cbytes = msq->q_cbytes;
 		tbuf.msg_qnum   = msq->q_qnum;
 		tbuf.msg_qbytes = msq->q_qbytes;
-		tbuf.msg_lspid  = pid_to_vpid(msq->q_lspid);
-		tbuf.msg_lrpid  = pid_to_vpid(msq->q_lrpid);
+		tbuf.msg_lspid  = pid_to_vpid_ctx(msq->q_lspid, current);
+		tbuf.msg_lrpid  = pid_to_vpid_ctx(msq->q_lrpid, current);
 		msg_unlock(msq);
 		if (copy_msqid_to_user(buf, &tbuf, version))
 			return -EFAULT;
@@ -821,8 +821,8 @@ static int sysvipc_msg_proc_show(struct 
 			  msq->q_perm.mode,
 			  msq->q_cbytes,
 			  msq->q_qnum,
-			  pid_to_vpid(msq->q_lspid),
-			  pid_to_vpid(msq->q_lrpid),
+			  pid_to_vpid_ctx(msq->q_lspid, current),
+			  pid_to_vpid_ctx(msq->q_lrpid, current),
 			  msq->q_perm.uid,
 			  msq->q_perm.gid,
 			  msq->q_perm.cuid,
Index: linux-2.6.15-rc1/ipc/shm.c
===================================================================
--- linux-2.6.15-rc1.orig/ipc/shm.c	2005-12-14 15:12:36.000000000 -0500
+++ linux-2.6.15-rc1/ipc/shm.c	2005-12-14 15:16:46.000000000 -0500
@@ -508,8 +508,8 @@ asmlinkage long sys_shmctl (int shmid, i
 		tbuf.shm_atime	= shp->shm_atim;
 		tbuf.shm_dtime	= shp->shm_dtim;
 		tbuf.shm_ctime	= shp->shm_ctim;
-		tbuf.shm_cpid	= pid_to_vpid(shp->shm_cprid);
-		tbuf.shm_lpid	= pid_to_vpid(shp->shm_lprid);
+		tbuf.shm_cpid	= pid_to_vpid_ctx(shp->shm_cprid, current);
+		tbuf.shm_lpid	= pid_to_vpid_ctx(shp->shm_lprid, current);
 		if (!is_file_hugepages(shp->shm_file))
 			tbuf.shm_nattch	= shp->shm_nattch;
 		else
@@ -896,8 +896,8 @@ static int sysvipc_shm_proc_show(struct 
 			  shp->id,
 			  shp->shm_flags,
 			  shp->shm_segsz,
-			  pid_to_vpid(shp->shm_cprid),
-			  pid_to_vpid(shp->shm_lprid),
+			  pid_to_vpid_ctx(shp->shm_cprid, current),
+			  pid_to_vpid_ctx(shp->shm_lprid, current),
 			  is_file_hugepages(shp->shm_file) ? (file_count(shp->shm_file) - 1) : shp->shm_nattch,
 			  shp->shm_perm.uid,
 			  shp->shm_perm.gid,
Index: linux-2.6.15-rc1/ipc/sem.c
===================================================================
--- linux-2.6.15-rc1.orig/ipc/sem.c	2005-12-14 15:12:27.000000000 -0500
+++ linux-2.6.15-rc1/ipc/sem.c	2005-12-14 15:16:46.000000000 -0500
@@ -721,7 +721,7 @@ static int semctl_main(int semid, int se
 		err = curr->semval;
 		goto out_unlock;
 	case GETPID:
-		err = curr->sempid;
+		err = pid_to_vpid_ctx(curr->sempid, current);
 		goto out_unlock;
 	case GETNCNT:
 		err = count_semncnt(sma,semnum);
Index: linux-2.6.15-rc1/kernel/exit.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/exit.c	2005-12-14 15:14:42.000000000 -0500
+++ linux-2.6.15-rc1/kernel/exit.c	2005-12-14 15:16:46.000000000 -0500
@@ -1143,7 +1143,7 @@ static int wait_task_zombie(task_t *p, i
 		p->exit_state = EXIT_ZOMBIE;
 		return retval;
 	}
-	retval = task_vpid(p);
+	retval = task_vpid_ctx(p, current);
 	if (p->real_parent != p->parent) {
 		write_lock_irq(&tasklist_lock);
 		/* Double-check with lock held.  */
@@ -1278,7 +1278,7 @@ bail_ref:
 	if (!retval && infop)
 		retval = put_user(p->uid, &infop->si_uid);
 	if (!retval)
-		retval = task_vpid(p);
+		retval = task_vpid_ctx(p, current);
 	put_task_struct(p);
 
 	BUG_ON(!retval);
Index: linux-2.6.15-rc1/kernel/fork.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/fork.c	2005-12-14 15:12:36.000000000 -0500
+++ linux-2.6.15-rc1/kernel/fork.c	2005-12-14 15:16:46.000000000 -0500
@@ -928,9 +928,10 @@ static task_t *copy_process(unsigned lon
 	p->did_exec = 0;
 	copy_flags(clone_flags, p);
 	p->__pid = pid;
+
 	retval = -EFAULT;
 	if (clone_flags & CLONE_PARENT_SETTID)
-		if (put_user(task_vpid(p), parent_tidptr))
+		if (put_user(task_vpid_ctx(p, current), parent_tidptr))
 			goto bad_fork_cleanup;
 
 	p->proc_dentry = NULL;
Index: linux-2.6.15-rc1/kernel/signal.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/signal.c	2005-12-14 15:14:38.000000000 -0500
+++ linux-2.6.15-rc1/kernel/signal.c	2005-12-14 15:16:46.000000000 -0500
@@ -1478,7 +1478,7 @@ void do_notify_parent(struct task_struct
 
 	info.si_signo = sig;
 	info.si_errno = 0;
-	info.si_pid = task_vpid(tsk);
+	info.si_pid = task_vpid_ctx(tsk, tsk->parent);
 	info.si_uid = tsk->uid;
 
 	/* FIXME: find out whether or not this is supposed to be c*time. */
@@ -1543,7 +1543,7 @@ static void do_notify_parent_cldstop(str
 
 	info.si_signo = SIGCHLD;
 	info.si_errno = 0;
-	info.si_pid = task_vpid(tsk);
+	info.si_pid = task_vpid_ctx(tsk, tsk->parent);
 	info.si_uid = tsk->uid;
 
 	/* FIXME: find out whether or not this is supposed to be c*time. */
Index: linux-2.6.15-rc1/kernel/sys.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/sys.c	2005-12-14 15:14:38.000000000 -0500
+++ linux-2.6.15-rc1/kernel/sys.c	2005-12-14 15:16:46.000000000 -0500
@@ -1179,7 +1179,8 @@ asmlinkage long sys_getpgid(pid_t pid)
 		if (p) {
 			retval = security_task_getpgid(p);
 			if (!retval)
-				retval = virt_process_group(p);
+				retval = pid_to_vpid_ctx(process_group(p),
+							 current);
 		}
 		read_unlock(&tasklist_lock);
 		return retval;
@@ -1199,7 +1200,7 @@ asmlinkage long sys_getpgrp(void)
 asmlinkage long sys_getsid(pid_t pid)
 {
 	if (!pid) {
-		return pid_to_vpid(current->signal->session);
+		return pid_to_vpid_ctx(current->signal->session,current);
 	} else {
 		int retval;
 		struct task_struct *p;
@@ -1212,7 +1213,7 @@ asmlinkage long sys_getsid(pid_t pid)
 		if(p) {
 			retval = security_task_getsid(p);
 			if (!retval)
-				retval = pid_to_vpid(p->signal->session);
+				retval = pid_to_vpid_ctx(p->signal->session, current);
 		}
 		read_unlock(&tasklist_lock);
 		return retval;
Index: linux-2.6.15-rc1/kernel/timer.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/timer.c	2005-12-14 15:12:33.000000000 -0500
+++ linux-2.6.15-rc1/kernel/timer.c	2005-12-14 15:16:46.000000000 -0500
@@ -968,7 +968,7 @@ asmlinkage long sys_getppid(void)
 
 	parent = me->group_leader->real_parent;
 	for (;;) {
-		pid = task_tgid(parent);
+		pid = task_vtgid_ctx(parent, current);
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
 {
 		struct task_struct *old = parent;
Index: linux-2.6.15-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.15-rc1.orig/include/linux/sched.h	2005-12-14 15:14:37.000000000 -0500
+++ linux-2.6.15-rc1/include/linux/sched.h	2005-12-14 15:16:46.000000000 -0500
@@ -869,14 +869,29 @@ static inline pid_t process_group(const 
  *  pid domain translation functions:
  *	- from kernel to user pid domain
  */
+static inline pid_t pid_to_vpid_ctx(pid_t pid, const struct task_struct *ctx)
+{
+	return pid;
+}
+
 static inline pid_t pid_to_vpid(pid_t pid)
 {
+	return pid_to_vpid_ctx(pid, current);
+}
+
+static inline pid_t pgid_to_vpgid_ctx(pid_t pid, const struct task_struct *ctx)
+{
+	int isgrp = (pid < 0) ;
+
+	if (isgrp) pid = -pid;
+	pid = pid_to_vpid_ctx(pid, ctx);
+	if (isgrp) pid = -pid;
 	return pid;
 }
 
 static inline pid_t pgid_to_vpgid(pid_t pid)
 {
-	return pid;
+	return pgid_to_vpgid_ctx(pid, current);
 }
 
 static inline pid_t vpid_to_pid(pid_t pid)
@@ -912,19 +927,37 @@ static inline pid_t task_tgid(const stru
 	return p->__tgid;
 }
 
-static inline pid_t task_vpid(const struct task_struct *p)
+static inline pid_t task_vpid_ctx(const struct task_struct *p,
+				   const struct task_struct *ctx)
 {
 	return task_pid(p);
 }
 
+static inline pid_t task_vpid(const struct task_struct *p)
+{
+	return task_vpid_ctx(p, p);
+}
+
+static inline pid_t task_vppid_ctx(const struct task_struct *p,
+			      	   const struct task_struct *ctx)
+{
+	return task_vpid_ctx(p->parent, ctx);
+}
+
 static inline pid_t task_vppid(const struct task_struct *p)
 {
-	return task_pid(p->parent);
+	return task_vppid_ctx(p, p);
+}
+
+static inline pid_t task_vtgid_ctx(const struct task_struct *p,
+				    const struct task_struct *ctx)
+{
+	return pid_to_vpid_ctx(task_tgid(p), ctx);
 }
 
 static inline pid_t task_vtgid(const struct task_struct *p)
 {
-	return task_tgid(p);
+	return task_vtgid_ctx(p, p);
 }
 
 static inline pid_t virt_process_group(const struct task_struct *p)
Index: linux-2.6.15-rc1/fs/binfmt_elf.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/binfmt_elf.c	2005-12-14 15:12:36.000000000 -0500
+++ linux-2.6.15-rc1/fs/binfmt_elf.c	2005-12-14 15:16:46.000000000 -0500
@@ -1273,7 +1273,7 @@ static void fill_prstatus(struct elf_prs
 	prstatus->pr_pid = task_vpid(p);
 	prstatus->pr_ppid = task_vppid(p);
 	prstatus->pr_pgrp = virt_process_group(p);
-	prstatus->pr_sid = pid_to_vpid(p->signal->session);
+	prstatus->pr_sid = pid_to_vpid_ctx(p->signal->session, p);
 	if (thread_group_leader(p)) {
 		/*
 		 * This is the record for the group leader.  Add in the
@@ -1319,7 +1319,7 @@ static int fill_psinfo(struct elf_prpsin
 	psinfo->pr_pid = task_vpid(p);
 	psinfo->pr_ppid = task_vppid(p);
 	psinfo->pr_pgrp = virt_process_group(p);
-	psinfo->pr_sid = pid_to_vpid(p->signal->session);
+	psinfo->pr_sid = pid_to_vpid_ctx(p->signal->session, p);
 
 	i = p->state ? ffz(~p->state) + 1 : 0;
 	psinfo->pr_state = i;

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 13/21] PID Virtualization: Documentation
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (11 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 12/21] PID Virtualization: Context for pid_to_vpid conversition functions Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 14/21] PID Virtualization: pidspace Hubertus Franke
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: G0-documentation.patch --]
[-- Type: text/plain, Size: 2299 bytes --]

First (incomplete) attempt of documentation
Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 Documentation/pidvirtualization.txt |   64 ++++++++++++++++++++++++++++++++++++
 1 files changed, 64 insertions(+)

Index: linux-2.6.15-rc1/Documentation/containers.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.15-rc1/Documentation/containers.txt	2005-12-08 01:30:42.000000000 -0500
@@ -0,0 +1,64 @@
+This document describes the basics of the container
+
+Hubertus Franke	<frankeh@watson.ibm.com>
+Serge E Hallyn	<serue@us.ibm.com>
+Cedric Legoater <clg@fr.ibm.com>
+
+Applications and associated processes can be containerized into
+"isolated" soft partitions. The goal is to make containers
+transparently migratable. To do so certain resources identifiers
+need to be virtualized.
+These includes
+	- pids, gids,
+	- SysV ids
+	- procfs
+Only resource belonging to a container can be accessed within
+the container.
+
+A "container" is created through a helper program <contexe>,
+that is supplied separately.
+A process moves itself to a container by writing
+the name of the container to create to /proc/container.
+Doing so makes the calling process the pseudo init process
+of the container.
+
+
+For example "contexe -j2 /bin/bash" spawns a bash within
+a new container <cont_2> and make the contexe process
+the containers virtual initproc.
+
+
+PID-VIRTUALIZATION:
+-------------------
+
+Let Process <A> be the currently running process ( e.g. bash with pid 913 )
+Each container has an associated pidspace id associated. Each pidspace
+id is managed like the standard pid range in linux.
+
+We obtain the following tree, where <pidspace | vpid > denotes the
+internal pid which is obtained by bitmasking.
+
+A some older bash < 0 | 913 >
+	|
+	\/
+B == contexe == < 0 | 1087 >      ( also container->init_proc := A
+				   	 container->init_pid  := 1087
+	|
+	\/
+C == /bin/bash == < 1 | 2 >
+
+
+let's define the results here we are expecting.
+
+C in context of B:      vpid = 2
+B in context of C:	vpid = 1
+
+B in context of A:	vpid = pid = 1087
+C in context of A:	vpid = pid = < 1 | 2 >
+
+A in context of B:	vpid = pid = 913
+A in context of C:	vpid = -1
+
+< More to Follow >
+
+

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 14/21] PID Virtualization: pidspace
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (12 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 13/21] PID Virtualization: Documentation Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 15/21] PID Virtualization: container object and functions Hubertus Franke
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: G1-pidspace.patch --]
[-- Type: text/plain, Size: 8233 bytes --]

This patch introduces pitspaces to provide pid virtualization
capabilities. A pidspace will be allocated for each container
and destroyed (resources freed) when the container is 
terminated.

The global pid range ( 32 bit) is partitioned into 
PID_MAX_LIMIT sized pidspaces. The virtualization
is defined as kernel_pid ::= < pidspace_id, vpid >

In this patch we are utilizing the existing pid management,
i.e. allocation and hashing. We are providing a pidspace, as managed 
previously, for each pidspace id. 

Patch eliminates the explicit management of vpids and allows
continued usage of the existing pid hashing and lookup functions.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 include/linux/pid.h     |   27 +++++++++++-
 include/linux/threads.h |   17 +++++--
 kernel/fork.c           |    2 
 kernel/pid.c            |  105 +++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 135 insertions(+), 16 deletions(-)

Index: linux-2.6.15-rc1/kernel/fork.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/fork.c	2005-12-12 18:39:51.000000000 -0500
+++ linux-2.6.15-rc1/kernel/fork.c	2005-12-12 18:40:32.000000000 -0500
@@ -1241,7 +1241,7 @@ long do_fork(unsigned long clone_flags,
 {
 	struct task_struct *p;
 	int trace = 0;
-	long pid = alloc_pidmap();
+	long pid = alloc_pidmap(DEFAULT_PIDSPACE);
 	long vpid;
 
 	if (pid < 0)
Index: linux-2.6.15-rc1/include/linux/pid.h
===================================================================
--- linux-2.6.15-rc1.orig/include/linux/pid.h	2005-12-12 18:37:35.000000000 -0500
+++ linux-2.6.15-rc1/include/linux/pid.h	2005-12-12 18:40:04.000000000 -0500
@@ -36,7 +36,7 @@ extern void FASTCALL(detach_pid(struct t
  */
 extern struct pid *FASTCALL(find_pid(enum pid_type, int));
 
-extern int alloc_pidmap(void);
+extern int alloc_pidmap(int pidspace_id);
 extern void FASTCALL(free_pidmap(int));
 extern void switch_exec_pids(struct task_struct *leader, struct task_struct *thread);
 
@@ -51,5 +51,30 @@ extern void switch_exec_pids(struct task
 			prefetch((task)->pids[type].pid_list.next),	\
 			hlist_unhashed(&(task)->pids[type].pid_chain));	\
 	}								\
+/*
+ * Pidspace related definition for translation  real <-> virtual
+ * and initialization functions
+ */
+
+#define DEFAULT_PIDSPACE	0
+
+extern int pidspace_init(int pidspace_id);
+extern int pidspace_free(int pidspace_id);
+
+static inline int pid_to_pidspace(int pid)
+{
+	return (pid >> PID_MAX_LIMIT_SHIFT);
+}
+
+static inline int pidspace_vpid_to_pid(int pidspace_id, pid_t pid)
+{
+	return (pidspace_id << PID_MAX_LIMIT_SHIFT) | pid;
+}
+
+static inline int pidspace_pid_to_vpid(pid_t pid)
+{
+	return (pid & (PID_MAX_LIMIT-1));
+}
+
 
 #endif /* _LINUX_PID_H */
Index: linux-2.6.15-rc1/include/linux/threads.h
===================================================================
--- linux-2.6.15-rc1.orig/include/linux/threads.h	2005-12-12 18:37:35.000000000 -0500
+++ linux-2.6.15-rc1/include/linux/threads.h	2005-12-12 18:40:04.000000000 -0500
@@ -25,12 +25,21 @@
 /*
  * This controls the default maximum pid allocated to a process
  */
-#define PID_MAX_DEFAULT (CONFIG_BASE_SMALL ? 0x1000 : 0x8000)
+#define PID_MAX_DEFAULT_SHIFT	(CONFIG_BASE_SMALL ? 12 : 15)
+#define PID_MAX_DEFAULT 	(1<< PID_MAX_DEFAULT_SHIFT)
 
 /*
- * A maximum of 4 million PIDs should be enough for a while:
+ * The entire global pid range is devided into pidspaces
+ * each able to hold upto PID_MAX_LIMIT pids.
+ * A maximum of 512 pidspace should be enough for a while
+ * A maximum of 4 million PIDs per pidspace should be enough for a while:
+ * we keep high bit reserved for negative values
  */
-#define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
-	(sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))
+#define PID_MAX_LIMIT_SHIFT (CONFIG_BASE_SMALL ? PAGE_SHIFT + 8 : \
+	(sizeof(long) > 4 ? 22 : PID_MAX_DEFAULT_SHIFT))
+#define PID_MAX_LIMIT 		(1<<PID_MAX_LIMIT_SHIFT)
+
+#define MAX_NR_PIDSPACES 	(PID_MAX_LIMIT_SHIFT > 22 ?   \
+				 1<<(32-PID_MAX_LIMIT_SHIFT-1) : 512)
 
 #endif
Index: linux-2.6.15-rc1/kernel/pid.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/pid.c	2005-12-12 18:37:35.000000000 -0500
+++ linux-2.6.15-rc1/kernel/pid.c	2005-12-12 18:40:04.000000000 -0500
@@ -35,6 +35,7 @@ int pid_max = PID_MAX_DEFAULT;
 int last_pid;
 
 #define RESERVED_PIDS		300
+#define RESERVED_PIDS_NON_DFLT    1
 
 int pid_max_min = RESERVED_PIDS + 1;
 int pid_max_max = PID_MAX_LIMIT;
@@ -57,29 +58,103 @@ typedef struct pidmap {
 	void *page;
 } pidmap_t;
 
-static pidmap_t pidmap_array[PIDMAP_ENTRIES] =
+struct pidspace {
+	int last_pid;
+	pidmap_t *pidmap_array;
+};
+
+static pidmap_t dflt_pidmap_array[PIDMAP_ENTRIES] =
 	 { [ 0 ... PIDMAP_ENTRIES-1 ] = { ATOMIC_INIT(BITS_PER_PAGE), NULL } };
 
+static struct pidspace pid_spaces[MAX_NR_PIDSPACES] =
+	{ { 0, dflt_pidmap_array } };
+
 static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
 
+int pidspace_init(int pidspace_id)
+{
+	pidmap_t *map;
+	struct pidspace *pid_space =  &pid_spaces[pidspace_id];
+	int i;
+	int rc;
+
+	if (unlikely(pid_space->pidmap_array))
+		return -EBUSY;
+
+	map = kmalloc(PIDMAP_ENTRIES*sizeof(pidmap_t), GFP_KERNEL);
+	if (!map)
+		return -ENOMEM;
+
+	for (i=0 ; i< PIDMAP_ENTRIES ; i++)
+		map[i] = (pidmap_t){ ATOMIC_INIT(BITS_PER_PAGE), NULL };
+
+	/*
+	 * Free the pidspace if someone raced with us
+	 * installing it:
+	 */
+
+	spin_lock(&pidmap_lock);
+	if (pid_space->pidmap_array) {
+		kfree(map);
+		rc = -EAGAIN;
+	} else {
+		pid_space->pidmap_array = map;
+		pid_space->last_pid = RESERVED_PIDS_NON_DFLT;
+		rc = 0;
+	}
+	spin_unlock(&pidmap_lock);
+	return rc;
+}
+
+int pidspace_free(int pidspace_id)
+{
+	struct pidspace *pid_space =  &pid_spaces[pidspace_id];
+	pidmap_t *map;
+	int i;
+
+	spin_lock(&pidmap_lock);
+	BUG_ON(pid_space->pidmap_array == NULL);
+	map = pid_space->pidmap_array;
+	pid_space->pidmap_array = NULL;
+	spin_unlock(&pidmap_lock);
+
+	for ( i=0; i<PIDMAP_ENTRIES; i++)
+		free_page((unsigned long)map[i].page);
+	kfree(map);
+	return 0;
+}
+
 fastcall void free_pidmap(int pid)
 {
-	pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
-	int offset = pid & BITS_PER_PAGE_MASK;
+	pidmap_t *map, *pidmap_array;
+	int offset;
+
+	pidmap_array = pid_spaces[pid_to_pidspace(pid)].pidmap_array;
+	pid = pidspace_pid_to_vpid(pid);
+	map = pidmap_array + pid / BITS_PER_PAGE;
+	offset = pid & BITS_PER_PAGE_MASK;
 
 	clear_bit(offset, map->page);
 	atomic_inc(&map->nr_free);
 }
 
-int alloc_pidmap(void)
+int alloc_pidmap(int pidspace_id)
 {
-	int i, offset, max_scan, pid, last = last_pid;
-	pidmap_t *map;
+	int i, offset, max_scan, pid, last;
+	struct pidspace *pid_space;
+	pidmap_t *map, *pidmap_array;
 
+	pid_space = &pid_spaces[pidspace_id];
+	last = pid_space->last_pid;
 	pid = last + 1;
-	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+	if (pid >= pid_max) {
+		if (pidspace_id == DEFAULT_PIDSPACE)
+			pid = RESERVED_PIDS;
+		else
+			pid = RESERVED_PIDS_NON_DFLT;
+	}
 	offset = pid & BITS_PER_PAGE_MASK;
+	pidmap_array = pid_space->pidmap_array;
 	map = &pidmap_array[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
@@ -102,7 +177,12 @@ int alloc_pidmap(void)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					last_pid = pid;
+					pid_space->last_pid = pid;
+					if (pidspace_id == 0) {
+						last_pid = pid;
+						return pid;
+					}
+					pid = pidspace_vpid_to_pid(pidspace_id, pid);
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -122,7 +202,10 @@ int alloc_pidmap(void)
 			offset = 0;
 		} else {
 			map = &pidmap_array[0];
-			offset = RESERVED_PIDS;
+			if (pidspace_id == DEFAULT_PIDSPACE)
+				offset = RESERVED_PIDS;
+			else
+				offset = RESERVED_PIDS_NON_DFLT;
 			if (unlikely(last == offset))
 				break;
 		}
@@ -279,6 +362,8 @@ void __init pidmap_init(void)
 {
 	int i;
 
+	pidmap_t *pidmap_array = dflt_pidmap_array;
+
 	pidmap_array->page = (void *)get_zeroed_page(GFP_KERNEL);
 	set_bit(0, pidmap_array->page);
 	atomic_dec(&pidmap_array->nr_free);

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 15/21] PID Virtualization: container object and functions
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (13 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 14/21] PID Virtualization: pidspace Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 16/21] PID Virtualization: container attach/detach calls Hubertus Franke
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: G2-container.patch --]
[-- Type: text/plain, Size: 7481 bytes --]

Introduce the container object and its managemenent functions,
in particular the creation/deletion of containers and the
linkage between the container object and the task.
By default, if the task->container object is NULL, then the task belongs
to the default global container. 

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 include/linux/container.h |   37 ++++++++++++
 include/linux/sched.h     |   10 +++
 kernel/Makefile           |    3 
 kernel/container.c        |  140 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/pid.c              |   14 ++--
 5 files changed, 198 insertions(+), 6 deletions(-)

Index: linux-2.6.15-rc1/include/linux/container.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.15-rc1/include/linux/container.h	2005-12-08 19:43:25.000000000 -0500
@@ -0,0 +1,37 @@
+
+#ifndef _LINUX_CONTAINER_H
+#define _LINUX_CONTAINER_H
+
+/* number of containers will depend on many constraints, which will have to
+ * be integrated here as they become apparent
+ */
+
+
+#define MAX_NR_CONTAINERS		MAX_NR_PIDSPACES
+
+#define MAX_CONTAINER_NAME_LEN 		32
+
+struct container_struct {
+	spinlock_t	    lock;
+	char		    name[MAX_CONTAINER_NAME_LEN];
+	int		    pidspace_id;
+	struct task_struct *init_proc;			/* root proc   */
+	int		    init_pid;			/* pid of root */
+	atomic_t	    tcount;			/* thread count */
+
+	/* and all the other things that will be necessary to track
+	 * for a container
+	 */
+};
+
+/****************************************************************
+ *      Container Management Functions
+ ****************************************************************/
+
+extern struct container_struct *container_find(const char *container_name);
+extern int    container_new     (const char *container_name);
+extern void   container_attach  (struct task_struct *task);
+extern void   container_detach  (struct task_struct *task);
+
+#endif
+
Index: linux-2.6.15-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.15-rc1.orig/include/linux/sched.h	2005-12-08 19:40:49.000000000 -0500
+++ linux-2.6.15-rc1/include/linux/sched.h	2005-12-08 19:43:25.000000000 -0500
@@ -36,6 +36,7 @@
 #include <linux/seccomp.h>
 
 #include <linux/auxvec.h>	/* For AT_VECTOR_SIZE */
+#include <linux/container.h>
 
 struct exec_domain;
 
@@ -858,6 +859,7 @@ struct task_struct {
 	int cpuset_mems_generation;
 #endif
 	atomic_t fs_excl;	/* holding fs exclusive resources */
+	struct container_struct *container;
 };
 
 static inline pid_t process_group(const struct task_struct *tsk)
@@ -965,6 +967,14 @@ static inline pid_t virt_process_group(c
 	return process_group(p);
 }
 
+static inline unsigned int task_pidspace_id(const struct task_struct *p)
+{
+	if (p->container)
+		return p->container->pidspace_id;
+	else
+		return DEFAULT_PIDSPACE;
+}
+
 extern void free_task(struct task_struct *tsk);
 extern void __put_task_struct(struct task_struct *tsk);
 #define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)
Index: linux-2.6.15-rc1/kernel/Makefile
===================================================================
--- linux-2.6.15-rc1.orig/kernel/Makefile	2005-12-08 19:40:49.000000000 -0500
+++ linux-2.6.15-rc1/kernel/Makefile	2005-12-08 19:43:25.000000000 -0500
@@ -7,7 +7,8 @@ obj-y     = sched.o fork.o exec_domain.o
 	    sysctl.o capability.o ptrace.o timer.o user.o \
 	    signal.o sys.o kmod.o workqueue.o pid.o \
 	    rcupdate.o intermodule.o extable.o params.o posix-timers.o \
-	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o
+	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \
+	    container.o
 
 obj-$(CONFIG_FUTEX) += futex.o
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
Index: linux-2.6.15-rc1/kernel/container.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.15-rc1/kernel/container.c	2005-12-08 19:44:26.000000000 -0500
@@ -0,0 +1,140 @@
+/*
+ * Management of Containers
+ *
+ * Copyright (C) Hubertus Franke, IBM Corp. 2005 <frankeh@watson.ibm.com>
+ *
+ */
+
+/* Changes
+ *
+ * 11/22/2005:  Created
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <asm/uaccess.h>
+#include <linux/proc_fs.h>
+#include <linux/timer.h>
+#include <linux/mm.h>
+#include <linux/container.h>
+
+#define DPRINTK( fmt, args... ) // printk( "%s: " fmt, __FUNCTION__, ##args )
+
+static struct container_struct *containers[MAX_NR_CONTAINERS];
+static DEFINE_SPINLOCK(container_lock);
+
+/****************************************************************
+ *      Container Management
+ ****************************************************************/
+
+void container_attach(struct task_struct *task)
+{
+	struct container_struct *container = task->container;
+
+	if (!container)
+		return;
+	atomic_inc(&container->tcount);
+
+	DPRINTK("c=<%p:%s> atask=<%x:%x:%s>\n",
+		container, container->name,
+		task_pid(task), task_vpid(task), task->comm);
+}
+
+void container_detach(struct task_struct *task)
+{
+	struct container_struct *container = task->container;
+	unsigned long flags;
+	int empty;
+
+	if (!container)
+		return;
+
+	DPRINTK("c=<%p:%s> dtask=<%x:%x:%s>\n",
+		container, container->name,
+		task_pid(task), task_vpid(task), task->comm);
+
+	task->container = NULL;
+	if (unlikely(task == container->init_proc)) {
+		container->init_proc = NULL;
+		container->init_pid  = 0;
+		memset(container->name, 0, MAX_CONTAINER_NAME_LEN);
+	}
+	empty = atomic_dec_and_test(&container->tcount);
+	if (!empty)
+		return;
+
+	/* we are the last process, so lets destroy the container */
+
+	DPRINTK("c=<%p:%s> destroy container exiting root proc\n",
+		container, container->name);
+
+	spin_lock_irqsave(&container_lock,flags);
+	containers[container->pidspace_id] = NULL;
+	pidspace_free(container->pidspace_id);
+
+	spin_lock(&container->lock);
+	/* ANYTHING UNDER THE LOCK */
+	spin_unlock(&container->lock);
+
+	spin_unlock_irqrestore(&container_lock,flags);
+
+	kfree(container);
+}
+
+/*
+ * create a new container and make the caller the virtual init_proc
+ * of the container
+ */
+
+int container_new(const char *container_name)
+{
+	struct container_struct *newc = NULL;
+	unsigned long flags;
+	int i;
+	int rc;
+
+	newc = kmalloc(sizeof(struct container_struct),GFP_KERNEL);
+	if (newc == NULL)
+		return -ENOMEM;
+	memset(newc,0,sizeof(struct container_struct));
+	strncpy(newc->name, container_name, MAX_CONTAINER_NAME_LEN-1);
+	newc->init_proc = current;
+	newc->init_pid  = task_pid(current);
+	atomic_set(&newc->tcount,0);
+
+	spin_lock_irqsave(&container_lock,flags);
+	for ( i=1; i<MAX_NR_CONTAINERS; i++) {
+		struct container_struct *cptr = containers[i];
+
+		if (cptr == NULL)
+			break;
+		if (strncmp(container_name, cptr->name, MAX_CONTAINER_NAME_LEN) == 0) {
+			rc = -EEXIST;
+			goto out_unlock_free;
+		}
+	}
+	if ( i == MAX_NR_CONTAINERS ) {
+		rc = -ENOMEM;
+		goto out_unlock_free;
+	}
+
+	spin_lock_init(&newc->lock);
+	pidspace_init(i);
+	newc->pidspace_id = i;
+	containers[i] = newc;
+	DPRINTK("created container #%d: %s\n", newc->pidspace_id, newc->name);
+	current->container = newc;
+	container_attach(current);
+	rc = 0;
+	goto out_unlock;
+
+out_unlock_free:
+	kfree(newc);
+out_unlock:
+	spin_unlock_irqrestore(&container_lock,flags);
+	return rc;
+}
+

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 16/21] PID Virtualization: container attach/detach calls
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (14 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 15/21] PID Virtualization: container object and functions Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 17/21] PID Virtualization: /proc/container filesystem Hubertus Franke
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: G3-container-fork-exit.patch --]
[-- Type: text/plain, Size: 1926 bytes --]

Call the container attach and detach functions at their respective
locations. This happens during the fork and exit functions.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 kernel/exit.c |    1 +
 kernel/fork.c |    5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

Index: linux-2.6.15-rc1/kernel/exit.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/exit.c	2005-12-12 18:39:51.000000000 -0500
+++ linux-2.6.15-rc1/kernel/exit.c	2005-12-12 18:41:09.000000000 -0500
@@ -101,6 +101,7 @@ repeat: 
 		zap_leader = (leader->exit_signal == -1);
 	}
 
+	container_detach(p);
 	sched_exit(p);
 	write_unlock_irq(&tasklist_lock);
 	spin_unlock(&p->proc_lock);
Index: linux-2.6.15-rc1/kernel/fork.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/fork.c	2005-12-12 18:40:32.000000000 -0500
+++ linux-2.6.15-rc1/kernel/fork.c	2005-12-12 18:41:36.000000000 -0500
@@ -43,6 +43,7 @@
 #include <linux/rmap.h>
 #include <linux/acct.h>
 #include <linux/cn_proc.h>
+#include <linux/container.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1001,6 +1002,7 @@ static task_t *copy_process(unsigned lon
 		goto bad_fork_cleanup_mm;
 	if ((retval = copy_namespace(clone_flags, p)))
 		goto bad_fork_cleanup_keys;
+	container_attach(p);
 	retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
 	if (retval)
 		goto bad_fork_cleanup_namespace;
@@ -1178,6 +1180,7 @@ bad_fork_cleanup_policy:
 	mpol_free(p->mempolicy);
 #endif
 bad_fork_cleanup:
+	container_detach(p);
 	if (p->binfmt)
 		module_put(p->binfmt->module);
 bad_fork_cleanup_put_domain:
@@ -1241,7 +1244,7 @@ long do_fork(unsigned long clone_flags,
 {
 	struct task_struct *p;
 	int trace = 0;
-	long pid = alloc_pidmap(DEFAULT_PIDSPACE);
+	long pid = alloc_pidmap(task_pidspace_id(current));
 	long vpid;
 
 	if (pid < 0)

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 17/21] PID Virtualization: /proc/container filesystem
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (15 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 16/21] PID Virtualization: container attach/detach calls Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 18/21] PID Virtualization: Implementation of low level virtualization functions Hubertus Franke
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: G4-container-procfs.patch --]
[-- Type: text/plain, Size: 4320 bytes --]

Provide the /proc/container directory to
containerize a process or retrieve an associated container.
We need a reasonable quick mechanism to trigger container creation.

A process becomes the root of a container if it writes
a unique name to the /proc/container file. If the process does
not already belong to a container and the name is unique, 
a container is created and the calling process becomes the root.
Reading from the file returns the name of the container.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 kernel/Makefile        |    2 
 kernel/container_api.c |  116 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+), 1 deletion(-)

Index: linux-2.6.15-rc1/kernel/Makefile
===================================================================
--- linux-2.6.15-rc1.orig/kernel/Makefile	2005-12-08 19:43:25.000000000 -0500
+++ linux-2.6.15-rc1/kernel/Makefile	2005-12-08 19:44:33.000000000 -0500
@@ -8,7 +8,7 @@ obj-y     = sched.o fork.o exec_domain.o
 	    signal.o sys.o kmod.o workqueue.o pid.o \
 	    rcupdate.o intermodule.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \
-	    container.o
+	    container.o container_api.o
 
 obj-$(CONFIG_FUTEX) += futex.o
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
Index: linux-2.6.15-rc1/kernel/container_api.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.15-rc1/kernel/container_api.c	2005-12-08 19:46:10.000000000 -0500
@@ -0,0 +1,116 @@
+/*
+ * External Interface to containers
+ *
+ * This is only for quick bootstrapping the container support
+ * A proper external API needs to be found
+ *
+ * Copyright (C) Hubertus Franke, IBM Corp. 2005 <frankeh@watson.ibm.com>
+ *
+ */
+
+/* Changes
+ *
+ * 11/22/2005:  Created
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <asm/uaccess.h>
+#include <linux/proc_fs.h>
+#include <linux/timer.h>
+#include <linux/mm.h>
+#include <linux/container.h>
+
+MODULE_LICENSE("GPL");
+
+#define DPRINTK( fmt, args...)  // printk( "%s: " fmt, __FUNCTION__, ##args)
+
+/****************************************************************
+ *		P R O C   F S   S T U F F
+ ****************************************************************/
+
+static ssize_t container_write(struct file *file, const char __user *ubuf,
+			       size_t count, loff_t *p)
+{
+	const char *delims = " \t\n";
+	char kbuf[MAX_CONTAINER_NAME_LEN];
+	char *cptr;
+	char *cname;
+	int rc;
+
+	if (current->container)
+		return -EPERM;
+	if (count >= MAX_CONTAINER_NAME_LEN)
+		return -EINVAL;
+	if (copy_from_user(kbuf, ubuf, count))
+		return -EFAULT;
+	kbuf[MAX_CONTAINER_NAME_LEN-1] = '\0';
+
+	cptr = kbuf;
+	cname = strsep(&cptr,delims);
+	DPRINTK("<%s:%d>: <%s>\n", current->comm, task_pid(current), cname);
+	rc = container_new(cname);
+	if (rc < 0)
+		return rc;
+	return count;
+}
+
+static ssize_t container_read(struct file *file, char __user *ubuf,
+	       		      size_t count, loff_t *ppos)
+{
+	char kbuf[MAX_CONTAINER_NAME_LEN];
+	int len;
+	char *cname;
+	loff_t __ppos = *ppos;
+
+	cname = current->container ? current->container->name : "";
+	len = sprintf(kbuf,"%s\n",cname);
+	if (__ppos >= len)
+		return 0;
+	if (count > len-__ppos)
+		count = len-__ppos;
+	if (copy_to_user(ubuf, kbuf+__ppos, count))
+		return -EFAULT;
+	*ppos += __ppos + count;
+	DPRINTK("%s: caller <%s:%d>: <%s>\n",
+		current->comm, task_pid(current), cname);
+	return count;
+}
+
+static struct file_operations container_proc_operations = {
+	.read  = container_read,
+	.write = container_write,
+};
+
+/****************************************************************
+ *
+ ****************************************************************/
+
+static int __init container_init(void)
+{
+	int rc = 0;
+	struct proc_dir_entry *entry;
+
+	entry = create_proc_entry("container", S_IWUGO|S_IRUGO, NULL);
+	if (entry)
+		entry->proc_fops = &container_proc_operations;
+	else
+		rc = -EINVAL;
+
+	/* Other initialization */
+
+	if (rc)
+		remove_proc_entry("container", NULL);
+	return rc;
+}
+
+static void __exit container_exit(void)
+{
+}
+
+module_init(container_init);
+module_exit(container_exit);
+

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 18/21] PID Virtualization: Implementation of low level virtualization functions
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (16 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 17/21] PID Virtualization: /proc/container filesystem Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 19/21] PID Virtualization: Handle special case vpid return cases Hubertus Franke
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: G5-virtfunct-impl.patch --]
[-- Type: text/plain, Size: 4146 bytes --]

We finally utilize the pid space implementation to obtain a real virtualizaton
inside the pid/vpid conversion functions. Care has been taken to retain
the fast path (either in global context or in the same pidspace) as inline, 
while the exception case (typically involves checking for container root)
is handled separately.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 include/linux/sched.h |   49 +++++++++++++++++++++++++++++++++++++++++++------
 kernel/container.c    |   28 ++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+), 6 deletions(-)

Index: linux-2.6.15-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.15-rc1.orig/include/linux/sched.h	2005-12-08 19:43:25.000000000 -0500
+++ linux-2.6.15-rc1/include/linux/sched.h	2005-12-08 19:46:18.000000000 -0500
@@ -871,9 +871,25 @@ static inline pid_t process_group(const 
  *  pid domain translation functions:
  *	- from kernel to user pid domain
  */
+
+extern pid_t __pid_to_vpid_ctx_excp(pid_t pid, int psid_pid,
+				     const struct task_struct *ctx);
+
 static inline pid_t pid_to_vpid_ctx(pid_t pid, const struct task_struct *ctx)
 {
-	return pid;
+	int psid_pid, psid_ctx;
+
+	if (!ctx->container)
+		return pid;
+
+	psid_ctx = pid_to_pidspace(ctx->__pid);
+	psid_pid = pid_to_pidspace(pid);
+	pid      = pidspace_pid_to_vpid(pid);
+
+	if (likely(psid_ctx == psid_pid))
+		return pid;
+
+	return __pid_to_vpid_ctx_excp(pid, psid_pid, ctx);
 }
 
 static inline pid_t pid_to_vpid(pid_t pid)
@@ -885,9 +901,11 @@ static inline pid_t pgid_to_vpgid_ctx(pi
 {
 	int isgrp = (pid < 0) ;
 
-	if (isgrp) pid = -pid;
+	if (isgrp)
+		pid = -pid;
 	pid = pid_to_vpid_ctx(pid, ctx);
-	if (isgrp) pid = -pid;
+	if (isgrp && pid != -1)
+		pid = -pid;
 	return pid;
 }
 
@@ -896,13 +914,32 @@ static inline pid_t pgid_to_vpgid(pid_t 
 	return pgid_to_vpgid_ctx(pid, current);
 }
 
+extern pid_t __vpid_to_pid_excp(pid_t pid);
+
 static inline pid_t vpid_to_pid(pid_t pid)
 {
-	return pid;
+	if (!current->container)
+		return pid;
+
+	if (pid == 1)
+		return current->container->init_pid;
+
+	if (!pid_to_pidspace(pid)) {
+		int psid = pid_to_pidspace(current->__pid);
+		return pidspace_vpid_to_pid(psid, pid);
+	}
+	return __vpid_to_pid_excp(pid);
 }
 
 static inline pid_t vpgid_to_pgid(pid_t pid)
 {
+	int isgrp = (pid < 0) ;
+
+	if (isgrp)
+		pid = -pid;
+	pid = vpid_to_pid(pid);
+	if (isgrp && pid != -1)
+		pid = -pid;
 	return pid;
 }
 
@@ -932,7 +969,7 @@ static inline pid_t task_tgid(const stru
 static inline pid_t task_vpid_ctx(const struct task_struct *p,
 				   const struct task_struct *ctx)
 {
-	return task_pid(p);
+	return pid_to_vpid_ctx(task_pid(p), ctx);
 }
 
 static inline pid_t task_vpid(const struct task_struct *p)
@@ -964,7 +1001,7 @@ static inline pid_t task_vtgid(const str
 
 static inline pid_t virt_process_group(const struct task_struct *p)
 {
-	return process_group(p);
+	return pid_to_vpid(process_group(p));
 }
 
 static inline unsigned int task_pidspace_id(const struct task_struct *p)
Index: linux-2.6.15-rc1/kernel/container.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/container.c	2005-12-08 19:44:26.000000000 -0500
+++ linux-2.6.15-rc1/kernel/container.c	2005-12-08 19:46:18.000000000 -0500
@@ -138,3 +138,31 @@ out_unlock:
 	return rc;
 }
 
+pid_t __pid_to_vpid_ctx_excp(pid_t pid, int pidspace_id,
+			     const struct task_struct *ctx)
+{
+	/* figure out whether pid .. virtual to pidspace_id_pid space
+	 * is meaningful to ctx (which is in differnt pidspace_id).
+	 * since a container's init_proc resides physically in psdi=0
+	 */
+	if (unlikely(ctx == ctx->container->init_proc)) {
+		if (pidspace_id != ctx->container->pidspace_id)
+			pid = -1;
+		return pid;
+	}
+	if (pid == ctx->container->init_pid)
+		return 1;
+	return -1;
+}
+
+pid_t __vpid_to_pid_excp(pid_t pid)
+{
+	/* we only let realpid pass as vpid if it marks the top of
+	 * current is the init_proc and vpid == init_pid
+	 */
+	if (current->container->pidspace_id == pid_to_pidspace(pid))
+		return pid;
+	return -1;
+}
+
+

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 19/21] PID Virtualization: Handle special case vpid return cases
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (17 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 18/21] PID Virtualization: Implementation of low level virtualization functions Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 20/21] PID Virtualization: per container /proc filesystem Hubertus Franke
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: G6-vpid-rc-special-handling.patch --]
[-- Type: text/plain, Size: 2828 bytes --]

Certain places in the virtual pid return locations need special handling
to return the appropriate information back to the user.

Signed-off-by: Hubertus Franke <frankeh@watson.ibm.com>
--

 fs/proc/array.c |   17 ++++++++++-------
 fs/proc/base.c  |    2 ++
 kernel/signal.c |    8 ++++++--
 3 files changed, 18 insertions(+), 9 deletions(-)

Index: linux-2.6.15-rc1/fs/proc/base.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/proc/base.c	2005-12-12 16:44:15.000000000 -0500
+++ linux-2.6.15-rc1/fs/proc/base.c	2005-12-12 16:44:33.000000000 -0500
@@ -2103,6 +2103,8 @@ static int get_tgid_list(int index, unsi
 
 	for ( ; p != &init_task; p = next_task(p)) {
 		int tgid = task_vpid_ctx(p, current);
+		if (tgid < 0)
+			continue;
 		if (!pid_alive(p))
 			continue;
 		if (--index >= 0)
Index: linux-2.6.15-rc1/fs/proc/array.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/proc/array.c	2005-12-12 16:44:15.000000000 -0500
+++ linux-2.6.15-rc1/fs/proc/array.c	2005-12-12 16:44:15.000000000 -0500
@@ -164,13 +164,16 @@ static inline char * task_state(struct t
 	pid_t ppid, tpid;
 
 	read_lock(&tasklist_lock);
-	if (pid_alive(p))
+	if (pid_alive(p)) {
 		ppid = task_vtgid_ctx(p->group_leader->real_parent, current);
-	else
+		if (ppid < 0) ppid = 1;
+	} else {
 		ppid = 0;
-	if (pid_alive(p) && p->ptrace)
+	}
+	if (pid_alive(p) && p->ptrace) {
 		tpid = task_vppid_ctx(p, current);
-	else
+		if (tpid < 0) tpid = 0;
+	} else
 		tpid = 0;
 	buffer += sprintf(buffer,
 		"State:\t%s\n"
@@ -183,8 +186,8 @@ static inline char * task_state(struct t
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
 		(p->sleep_avg/1024)*100/(1020000000/1024),
-	       	task_vtgid_ctx(p,current),
-		task_vpid_ctx(p,current),
+	       	task_vtgid_ctx(p, current),
+		task_vpid_ctx(p, current),
 		ppid, tpid,
 		p->uid, p->euid, p->suid, p->fsuid,
 		p->gid, p->egid, p->sgid, p->fsgid);
Index: linux-2.6.15-rc1/kernel/signal.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/signal.c	2005-12-12 16:44:15.000000000 -0500
+++ linux-2.6.15-rc1/kernel/signal.c	2005-12-12 16:44:32.000000000 -0500
@@ -2266,6 +2266,12 @@ static int do_tkill(int tgid, int pid, i
 	struct siginfo info;
 	struct task_struct *p;
 
+	pid  = vpid_to_pid(pid);
+	if (pid < 0)
+		return pid;
+	tgid = vpid_to_pid(tgid);
+	if (tgid < 0)
+		return tgid;
 	error = -ESRCH;
 	info.si_signo = sig;
 	info.si_errno = 0;
@@ -2273,8 +2279,6 @@ static int do_tkill(int tgid, int pid, i
 	info.si_pid = task_vtgid(current);
 	info.si_uid = current->uid;
 
-	pid  = vpid_to_pid(pid);
-	tgid = vpid_to_pid(tgid);
 	read_lock(&tasklist_lock);
 	p = find_task_by_pid(pid);
 	if (p && (tgid <= 0 || task_tgid(p) == tgid)) {

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 20/21] PID Virtualization: per container /proc filesystem
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (18 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 19/21] PID Virtualization: Handle special case vpid return cases Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 14:36 ` [RFC][patch 21/21] PID Virtualization: pidspace parent : signal behavior Hubertus Franke
  2005-12-15 19:49 ` [RFC][patch 00/21] PID Virtualization: Overview and Patches Gerrit Huizenga
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: Cedric Le Goater, Serge E Hallyn

[-- Attachment #1: G7-percontainer-procfs.patch --]
[-- Type: text/plain, Size: 2748 bytes --]

Provide the interception and virtualization of the proc interface.
In particular, from within the container the processes need to be 
identified as virtual under /proc as well as we need to limit the 
ones shown to the ones in the container.
NOTE: This is only temporarily since this exhibits some performance problems.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E Hallyn <serue@us.ibm.com>
--

 fs/proc/base.c  |    2 ++
 fs/proc/inode.c |   28 ++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

Index: linux-2.6.15-rc1/fs/proc/inode.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/proc/inode.c	2005-12-12 11:46:46.000000000 -0500
+++ linux-2.6.15-rc1/fs/proc/inode.c	2005-12-12 16:27:15.000000000 -0500
@@ -190,6 +190,33 @@ out_mod:
 	return NULL;
 }			
 
+/* This service performs checks on virtualization marker to allow multiple
+ * dentries with the same name in the dcache.
+ */
+
+#define procpid_check_marker(task, data) (task->container == data)
+static int proc_root_compare(struct dentry *dentry, struct qstr *a,
+			      struct qstr *b)
+{
+	/* CAUTION: to evaluate pointer of target dentry, we assume parameter
+	 * 'a' is its 'd_name' field. This is always the case anyway.
+	 */
+	struct dentry* d = (struct dentry *)
+		((unsigned long) a -
+		((unsigned long) &dentry->d_name - (unsigned long) dentry));
+	int result = 1;
+
+	if (a->len == b->len && !memcmp(a->name, b->name, a->len))
+		result = !procpid_check_marker(current, d->d_fsdata);
+
+	return result;
+}
+
+static struct dentry_operations root_dentry_operations =
+{
+	d_compare:      proc_root_compare,
+};
+
 int proc_fill_super(struct super_block *s, void *data, int silent)
 {
 	struct inode * root_inode;
@@ -213,6 +240,7 @@ int proc_fill_super(struct super_block *
 	s->s_root = d_alloc_root(root_inode);
 	if (!s->s_root)
 		goto out_no_root;
+	s->s_root->d_op = &root_dentry_operations;
 	return 0;
 
 out_no_root:
Index: linux-2.6.15-rc1/fs/proc/base.c
===================================================================
--- linux-2.6.15-rc1.orig/fs/proc/base.c	2005-12-12 16:27:11.000000000 -0500
+++ linux-2.6.15-rc1/fs/proc/base.c	2005-12-12 16:27:15.000000000 -0500
@@ -1497,6 +1497,7 @@ static struct dentry *proc_lookupfd(stru
 	inode->i_op = &proc_pid_link_inode_operations;
 	inode->i_size = 64;
 	ei->op.proc_get_link = proc_fd_link;
+	dentry->d_fsdata = current->container;
 	dentry->d_op = &tid_fd_dentry_operations;
 	d_add(dentry, inode);
 	return NULL;
@@ -2002,6 +2003,7 @@ struct dentry *proc_pid_lookup(struct in
 	inode->i_nlink = 4;
 #endif
 
+	dentry->d_fsdata = current->container;
 	dentry->d_op = &pid_base_dentry_operations;
 
 	died = 0;

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][patch 21/21] PID Virtualization: pidspace parent : signal behavior
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (19 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 20/21] PID Virtualization: per container /proc filesystem Hubertus Franke
@ 2005-12-15 14:36 ` Hubertus Franke
  2005-12-15 19:49 ` [RFC][patch 00/21] PID Virtualization: Overview and Patches Gerrit Huizenga
  21 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 14:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: Cedric Le Goater

[-- Attachment #1: G8-prohibit-init-kill.patch --]
[-- Type: text/plain, Size: 825 bytes --]

make sure a process parent of a pidspace discards signals sent
from processes in that pidspace.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>

--

 kernel/signal.c |    4 ++++
 1 files changed, 4 insertions(+)

Index: linux-2.6.15-rc1/kernel/signal.c
===================================================================
--- linux-2.6.15-rc1.orig/kernel/signal.c	2005-12-08 01:50:37.000000000 -0500
+++ linux-2.6.15-rc1/kernel/signal.c	2005-12-08 01:50:37.000000000 -0500
@@ -651,6 +651,10 @@ static int check_kill_permission(int sig
 	if (!valid_signal(sig))
 		return error;
 	error = -EPERM;
+
+	if (task_vpid_ctx(t, current) == 1)
+	    return error;
+
 	if ((info == SEND_SIG_NOINFO || (!is_si_special(info) && SI_FROMUSER(info)))
 	    && ((sig != SIGCONT) ||
 		(current->signal->session != t->signal->session))

--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
                   ` (20 preceding siblings ...)
  2005-12-15 14:36 ` [RFC][patch 21/21] PID Virtualization: pidspace parent : signal behavior Hubertus Franke
@ 2005-12-15 19:49 ` Gerrit Huizenga
  2005-12-15 20:02   ` [ckrm-tech] " Dave Hansen
                     ` (2 more replies)
  21 siblings, 3 replies; 37+ messages in thread
From: Gerrit Huizenga @ 2005-12-15 19:49 UTC (permalink / raw)
  To: Hubertus Franke, ckrm-tech
  Cc: linux-kernel, lse-tech, vserver, Andrew Morton, Rik van Riel, pagg


On Thu, 15 Dec 2005 09:35:57 EST, Hubertus Franke wrote:
> This patchset is a followup to the posting by Serge.
> http://marc.theaimsgroup.com/?l=linux-kernel&m=113200410620972&w=2
> 
> In this patchset here, we are providing the pid virtualization mentioned
> in serge's posting.
> 
> > I'm part of a project implementing checkpoint/restart processes.
> > After a process or group of processes is checkpointed, killed, and
> > restarted, the changing of pids could confuse them.  There are many
> > other such issues, but we wanted to start with pids.
> >
> > This patchset introduces functions to access task->pid and ->tgid,
> > and updates ->pid accessors to use the functions.  This is in
> > preparation for a subsequent patchset which will separate the kernel
> > and virtualized pidspaces.  This will allow us to virtualize pids
> > from users' pov, so that, for instance, a checkpointed set of
> > processes could be restarted with particular pids.  Even though their
> > kernel pids may already be in use by new processes, the checkpointed
> > processes can be started in a new user pidspace with their old
> > virtual pid.  This also gives vserver a simpler way to fake vserver
> > init processes as pid 1.  Note that this does not change the kernel's
> > internal idea of pids, only what users see.
> >
> > The first 12 patches change all locations which access ->pid and
> > ->tgid to use the inlined functions.  The last patch actually
> > introduces task_pid() and task_tgid(), and renames ->pid and ->tgid
> > to __pid and __tgid to make sure any uncaught users error out.
> >
> > Does something like this, presumably after much working over, seem
> > mergeable?
> 
> These patches build on top of serge's posted patches (if necessary
> we can repost them here).
> 
> PID Virtualization is based on the concept of a container.
> The ultimate goal is to checkpoint/restart containers. 
> 
> The mechanism to start a container 
> is to 'echo "container_name" > /proc/container'  which creates a new
> container and associates the calling process with it. All subsequently
> forked tasks then belong to that container.
> There is a separate pid space associated with each container.
> Only processes/task belonging to the same container "see" each other.
> The exception is an implied default system container that has 
> a global view.
> 
> The following patches accomplish 3 things:
> 1) identify the locations at the user/kernel boundary where pids and 
>    related ids ( pgrp, sessionids, .. ) need to be (de-)virtualized and
>    call appropriate (de-)virtualization functions.
> 2) provide the virtualization implementation in these functions.
> 3) implement a container object and a simple /proc interface to create one
> 4) provide a per container /proc/fs
> 
> -- Hubertus Franke    (frankeh@watson.ibm.com)
> -- Cedric Le Goater   (clg@fr.ibm.com)
> -- Serge E Hallyn     (serue@us.ibm.com)
> -- Dave Hansen        (haveblue@us.ibm.com)

I think this is actually quite interesting in a number of ways - it
might actually be a way of cleanly addressing several current out
of tree problems, several of which are indpendently (occasionally) striving
for mainline adoption:  vserver, openvz, cluster checkpoint/restart.

I think perhaps this could also be the basis for a CKRM "class"
grouping as well.  Rather than maintaining an independent class
affiliation for tasks, why not have a class devolve (evolve?) into
a "container" as described here.  The container provides much of
the same grouping capabilities as a class as far as I can see.  The
right information would be availble for scheduling and IO resource
management.  The memory component of CKRM is perhaps a bit tricky
still, but an overall strategy (can I use that word here? ;-) might
be to use these "containers" as the single intrinsic grouping mechanism
for vserver, openvz, application checkpoint/restart, resource
management, and possibly others?

Opinions, especially from the CKRM folks?  This might even be useful
to the PAGG folks as a grouping mechanism, similar to their jobs or
containers.

"This patchset solves multiple problems".

gerrit

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [ckrm-tech] Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-15 19:49 ` [RFC][patch 00/21] PID Virtualization: Overview and Patches Gerrit Huizenga
@ 2005-12-15 20:02   ` Dave Hansen
  2005-12-15 20:12     ` Gerrit Huizenga
  2005-12-15 22:52     ` Matt Helsley
  2005-12-15 22:02   ` Hubertus Franke
  2005-12-16  2:20   ` [ckrm-tech] " Matt Helsley
  2 siblings, 2 replies; 37+ messages in thread
From: Dave Hansen @ 2005-12-15 20:02 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Hubertus Franke, ckrm-tech, Linux Kernel Mailing List, LSE,
	vserver, Andrew Morton, Rik van Riel, pagg

On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> I think perhaps this could also be the basis for a CKRM "class"
> grouping as well.  Rather than maintaining an independent class
> affiliation for tasks, why not have a class devolve (evolve?) into
> a "container" as described here.

Wasn't one of the grand schemes of CKRM to be able to have application
instances be shared?  For instance, running a single DB2, Oracle, or
Apache server, and still accounting for all of the classes separately.
If so, that wouldn't work with a scheme that requires process
separation.

But, sharing the application instances is probably mostly (only)
important for databases anyway.  I would imagine that most of the
overhead in a server like an Apache instance is for the page cache for
content, as well as a bit for Apache's executables themselves.  The
container schemes should be able to share page cache for both cases.
The main issues would be managing multiple configurations, and the
increased overhead from having more processes around than with a single
server.

There might also be some serious restrictions on containerized
applications.  For instance, taking a running application, moving it out
of one container, and into another might not be feasible.  Is this
something that is common or desired in the current CKRM framework?

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [ckrm-tech] Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-15 20:02   ` [ckrm-tech] " Dave Hansen
@ 2005-12-15 20:12     ` Gerrit Huizenga
  2005-12-15 22:52     ` Matt Helsley
  1 sibling, 0 replies; 37+ messages in thread
From: Gerrit Huizenga @ 2005-12-15 20:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hubertus Franke, ckrm-tech, Linux Kernel Mailing List, LSE,
	vserver, Andrew Morton, Rik van Riel, pagg


On Thu, 15 Dec 2005 12:02:41 PST, Dave Hansen wrote:
> On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> > I think perhaps this could also be the basis for a CKRM "class"
> > grouping as well.  Rather than maintaining an independent class
> > affiliation for tasks, why not have a class devolve (evolve?) into
> > a "container" as described here.
> 
> Wasn't one of the grand schemes of CKRM to be able to have application
> instances be shared?  For instance, running a single DB2, Oracle, or
> Apache server, and still accounting for all of the classes separately.
> If so, that wouldn't work with a scheme that requires process
> separation.
 
 Yes, it is.  However, that may be a sub-case where a single, large
 server application actually jumps around from container to container.
 I consider that a detail (well, our DB2 folks don't but I'm all for
 solving one problem at a time ;-) and we can work some of that out
 later.  They are less concerned about the application being shared
 or part of multiple "classes" simultaneously, as opposed to being
 appropriately resource contrained based on the (large) transactions
 that they are handling on behalf of a user.  So, if it were possible
 to jump from one container to another dynamically, then the appropriate
 resource management stuff could be handled at some other level.

> There might also be some serious restrictions on containerized
> applications.  For instance, taking a running application, moving it out
> of one container, and into another might not be feasible.  Is this
> something that is common or desired in the current CKRM framework?

 Desired, but primarily for large server applications.  And, I don't
 think I see much in this patch set that makes that infeasible.  If
 containers are going to work, you are going to have to have a mechanism
 to get applications into them and to move them anyway, right?  While
 it would be nice if that were dirt-cheap, if it isn't, applications
 may have to adapt their usage of them based on the cost.  Not a big
 deal as I see it.

gerrit

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-15 19:49 ` [RFC][patch 00/21] PID Virtualization: Overview and Patches Gerrit Huizenga
  2005-12-15 20:02   ` [ckrm-tech] " Dave Hansen
@ 2005-12-15 22:02   ` Hubertus Franke
  2005-12-16  2:20   ` [ckrm-tech] " Matt Helsley
  2 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-15 22:02 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: ckrm-tech, linux-kernel, lse-tech, vserver, Andrew Morton,
	Rik van Riel, pagg

On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> On Thu, 15 Dec 2005 09:35:57 EST, Hubertus Franke wrote:

> > PID Virtualization is based on the concept of a container.
> > The ultimate goal is to checkpoint/restart containers. 
> > 
> > The mechanism to start a container 
> > is to 'echo "container_name" > /proc/container'  which creates a new
> > container and associates the calling process with it. All subsequently
> > forked tasks then belong to that container.
> > There is a separate pid space associated with each container.
> > Only processes/task belonging to the same container "see" each other.
> > The exception is an implied default system container that has 
> > a global view.
> > 
> > The following patches accomplish 3 things:
> > 1) identify the locations at the user/kernel boundary where pids and 
> >    related ids ( pgrp, sessionids, .. ) need to be (de-)virtualized and
> >    call appropriate (de-)virtualization functions.
> > 2) provide the virtualization implementation in these functions.
> > 3) implement a container object and a simple /proc interface to create one
> > 4) provide a per container /proc/fs
> > 
> > -- Hubertus Franke    (frankeh@watson.ibm.com)
> > -- Cedric Le Goater   (clg@fr.ibm.com)
> > -- Serge E Hallyn     (serue@us.ibm.com)
> > -- Dave Hansen        (haveblue@us.ibm.com)
> 
> I think this is actually quite interesting in a number of ways - it
> might actually be a way of cleanly addressing several current out
> of tree problems, several of which are indpendently (occasionally) striving
> for mainline adoption:  vserver, openvz, cluster checkpoint/restart.

Indeed the entire set might be able to benefit wrt to pid
virtualization. I think we are quite open to embrace a larger set of
applications of pid virtualization.

> I think perhaps this could also be the basis for a CKRM "class"
> grouping as well.  Rather than maintaining an independent class
> affiliation for tasks, why not have a class devolve (evolve?) into
> a "container" as described here.  The container provides much of
> the same grouping capabilities as a class as far as I can see.  The
> right information would be availble for scheduling and IO resource
> management.  The memory component of CKRM is perhaps a bit tricky
> still, but an overall strategy (can I use that word here? ;-) might
> be to use these "containers" as the single intrinsic grouping mechanism
> for vserver, openvz, application checkpoint/restart, resource
> management, and possibly others?
> 
> Opinions, especially from the CKRM folks?  This might even be useful
> to the PAGG folks as a grouping mechanism, similar to their jobs or
> containers.
> 
Not being to alien to the CKRM concept, yes there is some nice synergy 
here. As well as to PAGG and SGI's jobs. CKRM provides resource
constraints and runtime enforcements based on some grouping of
processes. Similar to container, class membership is inherited (if
that's still the case from last time I looked at it) until explicitely
changed. Containers and in particular provide another dimension
namely the ability to constraint "visibility" of resources and objects,
in this particular case pids as the first resource used.

> "This patchset solves multiple problems".

> gerrit
> 
-- 
Hubertus Franke <frankeh@watson.ibm.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [ckrm-tech] Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-15 20:02   ` [ckrm-tech] " Dave Hansen
  2005-12-15 20:12     ` Gerrit Huizenga
@ 2005-12-15 22:52     ` Matt Helsley
  1 sibling, 0 replies; 37+ messages in thread
From: Matt Helsley @ 2005-12-15 22:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Gerrit Huizenga, Hubertus Franke, CKRM-Tech,
	Linux Kernel Mailing List, LSE, vserver, Andrew Morton,
	Rik van Riel, pagg

On Thu, 2005-12-15 at 12:02 -0800, Dave Hansen wrote:
> On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> > I think perhaps this could also be the basis for a CKRM "class"
> > grouping as well.  Rather than maintaining an independent class
> > affiliation for tasks, why not have a class devolve (evolve?) into
> > a "container" as described here.
> 
> Wasn't one of the grand schemes of CKRM to be able to have application
> instances be shared?  For instance, running a single DB2, Oracle, or
> Apache server, and still accounting for all of the classes separately.
> If so, that wouldn't work with a scheme that requires process
> separation.

	f-series CKRM manages tasks via the task struct -- this means it
manages each thread and not a process. Since, generally speaking, each
thread is assigned the same class as the main thread this effectively
manages processes. So yes, separate DB2, Oracle, Apache, etc. threads
could be assigned to different classes. This is definitely something a
strict container could not do.

> But, sharing the application instances is probably mostly (only)
> important for databases anyway.  I would imagine that most of the

<nit>
I wouldn't say only for databases. human-interaction-bound processes can
share instances (gnome-terminal). Granted, these probably would never
need to span a container or a class...
</nit>

> overhead in a server like an Apache instance is for the page cache for
> content, as well as a bit for Apache's executables themselves.  The
> container schemes should be able to share page cache for both cases.
> The main issues would be managing multiple configurations, and the
> increased overhead from having more processes around than with a single
> server.
> 
> There might also be some serious restrictions on containerized
> applications.  For instance, taking a running application, moving it out
> of one container, and into another might not be feasible.  Is this
> something that is common or desired in the current CKRM framework?
> 
> -- Dave

	Yes, being able to move a process from one class to another is
important. This can happen as a consequence of the system administrator
deciding to change the distribution of resources without having to
restart services. The change in distribution can be done by changing
shares of a class, manually moving processes between classes, by making
or deleting classes, or a combination of these operations.

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [ckrm-tech] Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-15 19:49 ` [RFC][patch 00/21] PID Virtualization: Overview and Patches Gerrit Huizenga
  2005-12-15 20:02   ` [ckrm-tech] " Dave Hansen
  2005-12-15 22:02   ` Hubertus Franke
@ 2005-12-16  2:20   ` Matt Helsley
  2005-12-16  3:28     ` Gerrit Huizenga
  2 siblings, 1 reply; 37+ messages in thread
From: Matt Helsley @ 2005-12-16  2:20 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Hubertus Franke, CKRM-Tech, LKML, lse-tech, vserver,
	Andrew Morton, Rik van Riel, pagg

On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> On Thu, 15 Dec 2005 09:35:57 EST, Hubertus Franke wrote:
> > This patchset is a followup to the posting by Serge.
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=113200410620972&w=2
> > 
> > In this patchset here, we are providing the pid virtualization mentioned
> > in serge's posting.
> > 
> > > I'm part of a project implementing checkpoint/restart processes.
> > > After a process or group of processes is checkpointed, killed, and
> > > restarted, the changing of pids could confuse them.  There are many
> > > other such issues, but we wanted to start with pids.
> > >
> > > This patchset introduces functions to access task->pid and ->tgid,
> > > and updates ->pid accessors to use the functions.  This is in
> > > preparation for a subsequent patchset which will separate the kernel
> > > and virtualized pidspaces.  This will allow us to virtualize pids
> > > from users' pov, so that, for instance, a checkpointed set of
> > > processes could be restarted with particular pids.  Even though their
> > > kernel pids may already be in use by new processes, the checkpointed
> > > processes can be started in a new user pidspace with their old
> > > virtual pid.  This also gives vserver a simpler way to fake vserver
> > > init processes as pid 1.  Note that this does not change the kernel's
> > > internal idea of pids, only what users see.
> > >
> > > The first 12 patches change all locations which access ->pid and
> > > ->tgid to use the inlined functions.  The last patch actually
> > > introduces task_pid() and task_tgid(), and renames ->pid and ->tgid
> > > to __pid and __tgid to make sure any uncaught users error out.
> > >
> > > Does something like this, presumably after much working over, seem
> > > mergeable?
> > 
> > These patches build on top of serge's posted patches (if necessary
> > we can repost them here).
> > 
> > PID Virtualization is based on the concept of a container.
> > The ultimate goal is to checkpoint/restart containers. 
> > 
> > The mechanism to start a container 
> > is to 'echo "container_name" > /proc/container'  which creates a new
> > container and associates the calling process with it. All subsequently
> > forked tasks then belong to that container.
> > There is a separate pid space associated with each container.
> > Only processes/task belonging to the same container "see" each other.
> > The exception is an implied default system container that has 
> > a global view.

<snip>

> I think perhaps this could also be the basis for a CKRM "class"
> grouping as well.  Rather than maintaining an independent class
> affiliation for tasks, why not have a class devolve (evolve?) into
> a "container" as described here.  The container provides much of
> the same grouping capabilities as a class as far as I can see.  The
> right information would be availble for scheduling and IO resource
> management.  The memory component of CKRM is perhaps a bit tricky
> still, but an overall strategy (can I use that word here? ;-) might
> be to use these "containers" as the single intrinsic grouping mechanism
> for vserver, openvz, application checkpoint/restart, resource
> management, and possibly others?
> 
> Opinions, especially from the CKRM folks?  This might even be useful
> to the PAGG folks as a grouping mechanism, similar to their jobs or
> containers.
> 
> "This patchset solves multiple problems".
> 
> gerrit

CKRM classes seem too different from containers to merge the two
concepts:

- Classes don't assign class-unique pids to tasks.

- Tasks can move between classes.

- Tasks move between classes without any need for checkpoint/restart.

- Classes show up in a filesystem interface rather that using a file
in /proc to create them. (trivial interface difference)

- There are no "visibility boundaries" to enforce between tasks in
different classes.

- Classes are hierarchial.

- Unless I am mistaken, a container groups processes (Can one thread run
in container A and another in container B?) while a class groups tasks.
Since a task represents a thread or a process one thread could be in
class A and another in class B.

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [ckrm-tech] Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-16  2:20   ` [ckrm-tech] " Matt Helsley
@ 2005-12-16  3:28     ` Gerrit Huizenga
  2005-12-16 17:35       ` Dave Hansen
  2005-12-17  1:38       ` Matt Helsley
  0 siblings, 2 replies; 37+ messages in thread
From: Gerrit Huizenga @ 2005-12-16  3:28 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Hubertus Franke, CKRM-Tech, LKML, lse-tech, vserver,
	Andrew Morton, Rik van Riel, pagg


On Thu, 15 Dec 2005 18:20:52 PST, Matt Helsley wrote:
> On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> > On Thu, 15 Dec 2005 09:35:57 EST, Hubertus Franke wrote:
> > > PID Virtualization is based on the concept of a container.
> > > The ultimate goal is to checkpoint/restart containers. 
> > > 
> > > The mechanism to start a container 
> > > is to 'echo "container_name" > /proc/container'  which creates a new
> > > container and associates the calling process with it. All subsequently
> > > forked tasks then belong to that container.
> > > There is a separate pid space associated with each container.
> > > Only processes/task belonging to the same container "see" each other.
> > > The exception is an implied default system container that has 
> > > a global view.
> 
> <snip>
> 
> > I think perhaps this could also be the basis for a CKRM "class"
> > grouping as well.  Rather than maintaining an independent class
> > affiliation for tasks, why not have a class devolve (evolve?) into
> > a "container" as described here.  The container provides much of
> > the same grouping capabilities as a class as far as I can see.  The
> > right information would be availble for scheduling and IO resource
> > management.  The memory component of CKRM is perhaps a bit tricky
> > still, but an overall strategy (can I use that word here? ;-) might
> > be to use these "containers" as the single intrinsic grouping mechanism
> > for vserver, openvz, application checkpoint/restart, resource
> > management, and possibly others?
> > 
> > Opinions, especially from the CKRM folks?  This might even be useful
> > to the PAGG folks as a grouping mechanism, similar to their jobs or
> > containers.
> > 
> > "This patchset solves multiple problems".
> > 
> > gerrit
> 
> CKRM classes seem too different from containers to merge the two
> concepts:

I agree that the implementation of pid virtualization and classes have
different characteristics.  However, you bring up interesting points
about the differences...  But I question whether or not they are
relevent to an implementation of resource management.  I'm going out
on a limb here looking at a possibly radical change which might
simplify things so there is only one grouping mechanism in kernel.
I could be wrong but...
 
> - Classes don't assign class-unique pids to tasks.

What part of this is important to resource management?  A container
ID is like a class ID.  Yes, I think container ID's are assigned to
processes rather than tasks, but is that really all that important?

> - Tasks can move between classes.
 
In the pid virtualization, I would think that tasks can move between
containers as well, although it isn't all that useful for most things.
For instance, checkpoint/restart needs to checkpoint a process and all
of its threads if it wants to restart it.  So there may be restrictions
on what you can checkpoint/restart.  Vserver probably wants isolation
at a process boundary, rather than a task boundary.  Most resource
management, e.g. Java, probably doesn't care about task vs. process.

> - Tasks move between classes without any need for checkpoint/restart.
 
That *should* be possible with a generalized container solution.
For instance, just like with classes, you have to move things into
containers in the first place.  And, you could in theory have a classification
engine that helped choose which container to put a task/process in
at creation/instantiation/significant event...

> - Classes show up in a filesystem interface rather that using a file
> in /proc to create them. (trivial interface difference)
 
Yep - there will probably be a /proc or /configfs interface to containers
at some point, I would expect.  No significant difference there.

> - There are no "visibility boundaries" to enforce between tasks in
> different classes.
 
Are there in virtualized pids?  There *can* be - e.g. ps can distinguish,
but it is possible for tasks to interact across container boundaries.
Not ideal for vserver, checkpoint/restart, for instance (makes c/r a
little harder or more limited - signals heading outside the container
may "disappear" when you checkpoint/restart but for apps that c/r, that
probably isn't all that likely).

> - Classes are hierarchial.
 
Conceptually they are.  But are they in the CKRM f series?  I thought
that was one area for simplification.  And, how important is that *really*
for most applications?

> - Unless I am mistaken, a container groups processes (Can one thread run
> in container A and another in container B?) while a class groups tasks.
> Since a task represents a thread or a process one thread could be in
> class A and another in class B.

Definitely useful, and one question is whether pid virtualization is
container isolation, or simply virtualization to enable container
isolation.  If it is an enabling technology, perhaps it doesn't have
that restriction and could be used either way based on resource management
needs or based on vserver or c/r needs...

Debate away... ;-)

gerrit

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [ckrm-tech] Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-16  3:28     ` Gerrit Huizenga
@ 2005-12-16 17:35       ` Dave Hansen
  2005-12-16 20:45         ` Gerrit Huizenga
  2005-12-16 23:47         ` Hubertus Franke
  2005-12-17  1:38       ` Matt Helsley
  1 sibling, 2 replies; 37+ messages in thread
From: Dave Hansen @ 2005-12-16 17:35 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Matt Helsley, Hubertus Franke, CKRM-Tech, LKML, LSE, vserver,
	Andrew Morton, Rik van Riel, pagg

On Thu, 2005-12-15 at 19:28 -0800, Gerrit Huizenga wrote:
> In the pid virtualization, I would think that tasks can move between
> containers as well,

I don't think tasks can not be permitted to move between containers.  As
a simple exercise, imagine that you have two processes with the same
pid, one in container A and one in container B.  You wish to have them
both run in container A.  They can't both have the same pid.  What do
you do?

I've been talking a lot lately about how important filesystem isolation
between containers is to implement containers properly.  Isolating the
filesystem namespaces makes it much easier to do things like fs-based
shared memory during a checkpoint/resume.  If we want to allow tasks to
move around, we'll have to throw out this entire concept.  That means
that a _lot_ of things get a notch closer to the too-costly-to-implement
category.

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [ckrm-tech] Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-16 17:35       ` Dave Hansen
@ 2005-12-16 20:45         ` Gerrit Huizenga
  2005-12-16 21:10           ` Dave Hansen
  2005-12-16 23:47         ` Hubertus Franke
  1 sibling, 1 reply; 37+ messages in thread
From: Gerrit Huizenga @ 2005-12-16 20:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Matt Helsley, Hubertus Franke, CKRM-Tech, LKML, LSE, vserver,
	Andrew Morton, Rik van Riel, pagg


On Fri, 16 Dec 2005 09:35:19 PST, Dave Hansen wrote:
> On Thu, 2005-12-15 at 19:28 -0800, Gerrit Huizenga wrote:
> > In the pid virtualization, I would think that tasks can move between
> > containers as well,
> 
> I don't think tasks can not be permitted to move between containers.  As
> a simple exercise, imagine that you have two processes with the same
> pid, one in container A and one in container B.  You wish to have them
> both run in container A.  They can't both have the same pid.  What do
> you do?
> 
> I've been talking a lot lately about how important filesystem isolation
> between containers is to implement containers properly.  Isolating the
> filesystem namespaces makes it much easier to do things like fs-based
> shared memory during a checkpoint/resume.  If we want to allow tasks to
> move around, we'll have to throw out this entire concept.  That means
> that a _lot_ of things get a notch closer to the too-costly-to-implement
> category.

Interesting...  So how to tasks get *into* a container?  And can they
ever get back "out" of a container?  Are most processes on the system
initially not in a container?  And then they can be stuffed in a container?
And then containers can be moved around or be isolated from each other?

And, is pid virtualization the point where this happens?  Or is that
a slightly higher level?  In other words, is pid virtualization the
full implementation of container isolation?  Or is it a significant
element on which additional policy, restrictions, and usage models
can be built?

gerrit

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [ckrm-tech] Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-16 20:45         ` Gerrit Huizenga
@ 2005-12-16 21:10           ` Dave Hansen
  2005-12-16 23:40             ` Hubertus Franke
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2005-12-16 21:10 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Matt Helsley, Hubertus Franke, CKRM-Tech, LKML, LSE, vserver,
	Andrew Morton, Rik van Riel, pagg

On Fri, 2005-12-16 at 12:45 -0800, Gerrit Huizenga wrote:
> Interesting...  So how to tasks get *into* a container?

Only by inheritance.  

> And can they ever get back "out" of a container?

No.  Think of the pids again.  Even the "outside" of a container, things
like the real init, have to have unique pids.  What if the process's pid
is the same as one in use in the default container?

> Are most processes on the system
> initially not in a container?  And then they can be stuffed in a container?
> And then containers can be moved around or be isolated from each other?

The current idea is that processes are assigned at fork-time.  The
isolation is for the lifetime of the process.

> And, is pid virtualization the point where this happens?  Or is that
> a slightly higher level?  In other words, is pid virtualization the
> full implementation of container isolation?  Or is it a significant
> element on which additional policy, restrictions, and usage models
> can be built?

pid virtualization is simply the one that's easiest to understand, and
the one that demonstrates the largest number of issues.  It is a small
piece of the puzzle, but an important one.

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [ckrm-tech] Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-16 21:10           ` Dave Hansen
@ 2005-12-16 23:40             ` Hubertus Franke
  0 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-16 23:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Gerrit Huizenga, Matt Helsley, CKRM-Tech, LKML, LSE, vserver,
	Andrew Morton, Rik van Riel, pagg

On Fri, 2005-12-16 at 13:10 -0800, Dave Hansen wrote:
> On Fri, 2005-12-16 at 12:45 -0800, Gerrit Huizenga wrote:
> > Interesting...  So how to tasks get *into* a container?
> 
> Only by inheritance.  

That is only true today. There is no reason (other then introducing
some heavy code complexity (haven't thought about that) 
why we can't at some point move a process group/tree into a container.
The reason for this is that for the global container V=R in pid space
terms (read the vpid=realpid). Moving an entire group into a container
requires to assign new kernel pids to each task, while keeping the 
the vpid part constant. Lots of kpid related references though..
Don't know whether that's worth the trouble, particularly at this stage.

> 
> > And can they ever get back "out" of a container?
> 
> No.  Think of the pids again.  Even the "outside" of a container, things
> like the real init, have to have unique pids.  What if the process's pid
> is the same as one in use in the default container?

Correct..look at my answer above  moving from global to container can be
accomplished because in a fresh container all pids are available, so we
can simply reoccupy the same vpids in the new pidspace. This keeps all
user level "references" and pid values valid.
The only way we could EVER go back is if we could guarantee that the
pids the global space are free, hence they would have to be reserved.
NOWAY.... particularly if migration is involved later on..

> 
> > Are most processes on the system
> > initially not in a container?  And then they can be stuffed in a container?
> > And then containers can be moved around or be isolated from each other?
> 
> The current idea is that processes are assigned at fork-time.  The
> isolation is for the lifetime of the process.
> 
> > And, is pid virtualization the point where this happens?  Or is that
> > a slightly higher level?  In other words, is pid virtualization the
> > full implementation of container isolation?  Or is it a significant
> > element on which additional policy, restrictions, and usage models
> > can be built?
> 
> pid virtualization is simply the one that's easiest to understand, and
> the one that demonstrates the largest number of issues.  It is a small
> piece of the puzzle, but an important one.
> 

Ditto..

> -- Dave
> 
> 
-- 
Hubertus Franke <frankeh@watson.ibm.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [ckrm-tech] Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-16 17:35       ` Dave Hansen
  2005-12-16 20:45         ` Gerrit Huizenga
@ 2005-12-16 23:47         ` Hubertus Franke
  2005-12-17  1:18           ` Matt Helsley
  1 sibling, 1 reply; 37+ messages in thread
From: Hubertus Franke @ 2005-12-16 23:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Gerrit Huizenga, Matt Helsley, CKRM-Tech, LKML, LSE, vserver,
	Andrew Morton, Rik van Riel, pagg

On Fri, 2005-12-16 at 09:35 -0800, Dave Hansen wrote:
> On Thu, 2005-12-15 at 19:28 -0800, Gerrit Huizenga wrote:
> > In the pid virtualization, I would think that tasks can move between
> > containers as well,
> 
> I don't think tasks can not be permitted to move between containers.  As
> a simple exercise, imagine that you have two processes with the same
> pid, one in container A and one in container B.  You wish to have them
> both run in container A.  They can't both have the same pid.  What do
> you do?
> 

Dave, I think you meant "I don't think tasks can <strike>not</strike> be
permitted"...
Anyway, you make the constraints very clear, unless one can guarantee 
that the pidspaces don't have any overlaps in vpid usage, there is NOWAY
that we can allow this. Otherwise vpids that have been handed to 
to userspace (think sys_getpid()) need to be revoked (think coherence
here). That violates the transparency requirements.

> I've been talking a lot lately about how important filesystem isolation
> between containers is to implement containers properly.  Isolating the
> filesystem namespaces makes it much easier to do things like fs-based
> shared memory during a checkpoint/resume.  If we want to allow tasks to
> move around, we'll have to throw out this entire concept.  That means
> that a _lot_ of things get a notch closer to the too-costly-to-implement
> category.
> 

Not only that, as the example of pids already show, while at the surface
these might seem as desirable features ( particular since they came up
wrt to the CKRM discussion ), there are significant technical limitation
to these. 

-- 
Hubertus Franke <frankeh@watson.ibm.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [ckrm-tech] Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-16 23:47         ` Hubertus Franke
@ 2005-12-17  1:18           ` Matt Helsley
  2005-12-17  3:03             ` [Lse-tech] " Hubertus Franke
  0 siblings, 1 reply; 37+ messages in thread
From: Matt Helsley @ 2005-12-17  1:18 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: Dave Hansen, Gerrit Huizenga, CKRM-Tech, LKML, LSE, vserver,
	Andrew Morton, Rik van Riel, pagg

On Fri, 2005-12-16 at 18:47 -0500, Hubertus Franke wrote:
> On Fri, 2005-12-16 at 09:35 -0800, Dave Hansen wrote:
<snip>
> > I've been talking a lot lately about how important filesystem isolation
> > between containers is to implement containers properly.  Isolating the
> > filesystem namespaces makes it much easier to do things like fs-based
> > shared memory during a checkpoint/resume.  If we want to allow tasks to
> > move around, we'll have to throw out this entire concept.  That means
> > that a _lot_ of things get a notch closer to the too-costly-to-implement
> > category.
> > 
> 
> Not only that, as the example of pids already show, while at the surface
> these might seem as desirable features ( particular since they came up
> wrt to the CKRM discussion ), there are significant technical limitation
> to these. 

	Perhaps merging the container process grouping functionality is not a
good idea. 

	However, I think CKRM could be made minimally consistent with
containers using a few small modifications. I suspect all that is
necessary is:

1) Expanding the pid syntax accepted and reported when accessing the
members file to include an optional container id:

        # classify init in container 0 to a class
        echo 0:1 >> ${RCFS}/class_foo/members
        echo :1 >> ${RCFS}/class_foo/members
        
        # while in container 0 classify init in container 0 to a class
        echo 1 >> ${RCFS}/class_foo/members
        
        # while in container 0 classify init in container 3 to a class
        echo 3:1 >> ${RCFS}/class_foo/bar_class/members
        
        Then pids in container 0 would show up as cid:pid
        $ cat ${RCFS}/class_foo/members
        0:1
        5:2
        ...
        3:4
        
        Processes listing members in container n would only see the pid
        and only pids in that container.

2) Limiting the pids and container ids accepted as input to the members
file from processes doing classification from within containers:

        # classify init in the current container to a class
	echo :1 >> ${RCFS}/class_foo/members
        echo 1 >> ${RCFS}/class_foo/members

	# returns an error when not in container 0
	echo 0:1 >> ${RCFS}/class_foo/members
	# returns an error when not in container 1
	echo 1:1 >> ${RCFS}/class_foo/members
	...

(Incidentally these kind of details are what I was referring to earlier
in this thread as "visibility boundaries")

	I think this would be sufficient to make CKRM and containers play
nicely with each other. I suspect further kernel-enforced constraints
between CKRM and containers may constitute policy and not functionality.

	<shameless_plug>I also suspect that with the right userspace
classification engine a wide variety of useful container resource
management policies could be enforced based on these simple
modifications.</shameless_plug>

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [ckrm-tech] Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-16  3:28     ` Gerrit Huizenga
  2005-12-16 17:35       ` Dave Hansen
@ 2005-12-17  1:38       ` Matt Helsley
  1 sibling, 0 replies; 37+ messages in thread
From: Matt Helsley @ 2005-12-17  1:38 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Hubertus Franke, CKRM-Tech, LKML, lse-tech, vserver,
	Andrew Morton, Rik van Riel, pagg

On Thu, 2005-12-15 at 19:28 -0800, Gerrit Huizenga wrote:
> On Thu, 15 Dec 2005 18:20:52 PST, Matt Helsley wrote:
> > On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> > > On Thu, 15 Dec 2005 09:35:57 EST, Hubertus Franke wrote:
> > > > PID Virtualization is based on the concept of a container.
> > > > The ultimate goal is to checkpoint/restart containers. 
> > > > 
> > > > The mechanism to start a container 
> > > > is to 'echo "container_name" > /proc/container'  which creates a new
> > > > container and associates the calling process with it. All subsequently
> > > > forked tasks then belong to that container.
> > > > There is a separate pid space associated with each container.
> > > > Only processes/task belonging to the same container "see" each other.
> > > > The exception is an implied default system container that has 
> > > > a global view.
> > 
> > <snip>
> > 
> > > I think perhaps this could also be the basis for a CKRM "class"
> > > grouping as well.  Rather than maintaining an independent class
> > > affiliation for tasks, why not have a class devolve (evolve?) into
> > > a "container" as described here.  The container provides much of
> > > the same grouping capabilities as a class as far as I can see.  The
> > > right information would be availble for scheduling and IO resource
> > > management.  The memory component of CKRM is perhaps a bit tricky
> > > still, but an overall strategy (can I use that word here? ;-) might
> > > be to use these "containers" as the single intrinsic grouping mechanism
> > > for vserver, openvz, application checkpoint/restart, resource
> > > management, and possibly others?
> > > 
> > > Opinions, especially from the CKRM folks?  This might even be useful
> > > to the PAGG folks as a grouping mechanism, similar to their jobs or
> > > containers.
> > > 
> > > "This patchset solves multiple problems".
> > > 
> > > gerrit
> > 
> > CKRM classes seem too different from containers to merge the two
> > concepts:
> 
> I agree that the implementation of pid virtualization and classes have
> different characteristics.  However, you bring up interesting points
> about the differences...  But I question whether or not they are
> relevent to an implementation of resource management.  I'm going out
> on a limb here looking at a possibly radical change which might
> simplify things so there is only one grouping mechanism in kernel.
> I could be wrong but...

<snip>

> > - Classes don't assign class-unique pids to tasks.
> 
> What part of this is important to resource management?  A container
> ID is like a class ID.  Yes, I think container ID's are assigned to
> processes rather than tasks, but is that really all that important?

	Perhaps you misunderstood my point. Upon inserting a task into a
container you must assign it a pid unique within the container.
Inserting a task into a class requires no analogous operation. While
there is no conflict here neither is there commonality.

<snip>

> For instance, checkpoint/restart needs to checkpoint a process and all
> of its threads if it wants to restart it.  So there may be restrictions
> on what you can checkpoint/restart.  Vserver probably wants isolation
> at a process boundary, rather than a task boundary.  Most resource
> management, e.g. Java, probably doesn't care about task vs. process.

	I really don't see how Java itself is a good example of most resource
management. As I see it Java tries to present a runtime environment for
applications and it is the applications administrators are concerned
with.

	A process could allocate different roles to each thread or dole out
uniform pieces of work to each thread. Being able to manage the resource
usage of these threads could be useful -- so while Java may not "care"
about task vs. process an administrator might.

> > - Tasks move between classes without any need for checkpoint/restart.
>  
> That *should* be possible with a generalized container solution.
> For instance, just like with classes, you have to move things into
> containers in the first place.  And, you could in theory have a classification
> engine that helped choose which container to put a task/process in
> at creation/instantiation/significant event...

	Since arbitrary movement (time, source, and destination) is not
possible the classification analogy does not fit. This is one very big
difference between classes and containers that suggests merging the two
might not be best.

<snip>

> > - There are no "visibility boundaries" to enforce between tasks in
> > different classes.
>  
> Are there in virtualized pids?  There *can* be - e.g. ps can distinguish,
> but it is possible for tasks to interact across container boundaries.

	Right. I didn't say they were entirely invisible to each other. If they
were entirely visible to each other then these boundaries I'm talking
about wouldn't exist and a container would be more similar to a class. 

	These boundaries are probably delineated in miscellaneous areas of the
kernel like getpid(), kill(), any /proc file that shows a set of pids,
etc. Each of these would have to correctly limit the set of pids
displayed and/or accepted as input.

	A CKRM class on the other hand has no such boundaries to present to
userspace and hence does not alter code in such diverse places. I think
this is a consequence of the fact it doesn't virtualize resources for
the purposes of checkpoint/restart (esp. well-known and user-visible
resources like pids, filehandles, etc).

<snip>

> > - Classes are hierarchial.
>  
> Conceptually they are.  But are they in the CKRM f series?  I thought
> that was one area for simplification.  And, how important is that *really*
> for most applications?

	Hiearchy still exists in f-series. It's something Chandra has been
considering removing in order to simplify the code. I think hierarchy
offers a chance for administrators to better organize their classes. I
think the goal should be to enable administrators to let users manage a
class and/or subclasses of their own -- though implementing rcfs via
configfs limits config items to root currently. Perhaps this could be
useful for CKRM inside containers if each container had a virtual root
user id of its own with a corresponding non-zero id in container 0...

> > - Unless I am mistaken, a container groups processes (Can one thread run
> > in container A and another in container B?) while a class groups tasks.
> > Since a task represents a thread or a process one thread could be in
> > class A and another in class B.
> 
> Definitely useful, and one question is whether pid virtualization is

	Above you suggested that most resource management ("e.g. Java") doesn't
care about process vs. threads. Here you say it could be useful.

> container isolation, or simply virtualization to enable container
> isolation.  If it is an enabling technology, perhaps it doesn't have
> that restriction and could be used either way based on resource management
> needs or based on vserver or c/r needs...

	I thought that the point of pid virtualization was to enable
checkpoint/restart and that, as a consequence, moving processes to other
containers is impossible.

> Debate away... ;-)
> 
> gerrit

	The strongest disimilarity between the two I can see is the lack of
task movement between containers. The core similarity is the ability to
group. However, they don't group quite the same things -- from what I
can see containers group _trees of tasks_ with process (thread group)
granularity while classes group _tasks_ with thread granularity.

	At the very least I think we need to know the full extent of isolation
and interaction that are planned/necessary for containers before further
considering any merge proposals.

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Lse-tech] Re: [ckrm-tech] Re: [RFC][patch 00/21] PID Virtualization: Overview and Patches
  2005-12-17  1:18           ` Matt Helsley
@ 2005-12-17  3:03             ` Hubertus Franke
  0 siblings, 0 replies; 37+ messages in thread
From: Hubertus Franke @ 2005-12-17  3:03 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Dave Hansen, Gerrit Huizenga, CKRM-Tech, LKML, LSE, vserver,
	Andrew Morton, Rik van Riel, pagg

On Fri, 2005-12-16 at 17:18 -0800, Matt Helsley wrote:
> On Fri, 2005-12-16 at 18:47 -0500, Hubertus Franke wrote:
> > On Fri, 2005-12-16 at 09:35 -0800, Dave Hansen wrote:
> <snip>
> > > I've been talking a lot lately about how important filesystem isolation
> > > between containers is to implement containers properly.  Isolating the
> > > filesystem namespaces makes it much easier to do things like fs-based
> > > shared memory during a checkpoint/resume.  If we want to allow tasks to
> > > move around, we'll have to throw out this entire concept.  That means
> > > that a _lot_ of things get a notch closer to the too-costly-to-implement
> > > category.
> > > 
> > 
> > Not only that, as the example of pids already show, while at the surface
> > these might seem as desirable features ( particular since they came up
> > wrt to the CKRM discussion ), there are significant technical limitation
> > to these. 
> 
> 	Perhaps merging the container process grouping functionality is not a
> good idea. 
> 
> 	However, I think CKRM could be made minimally consistent with
> containers using a few small modifications. I suspect all that is
> necessary is:
> 
<snip>
> 	I think this would be sufficient to make CKRM and containers play
> nicely with each other. I suspect further kernel-enforced constraints
> between CKRM and containers may constitute policy and not functionality.
> 

I think that as a first step mutual coexistence is already quite
useful. 
Once I containerize applications, having the ability to actually
constrain and manage the resources consumed by that application would
be a real plus. In that sense a container and CKRM class coincide.
So even enforcing that "alignment" at a higher level through some 
awareness in the classification engine for instance would be quite
useful. Are they the same kernel object .. NO .. because of the 
life cycle management of a process, namely once moved into a container
it stays there...


> 
> Cheers,
> 	-Matt Helsley

Prost ...

Hubertus Franke <frankeh@watson.ibm.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2005-12-17  3:03 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-15 14:35 [RFC][patch 00/21] PID Virtualization: Overview and Patches Hubertus Franke
2005-12-15 14:35 ` [RFC][patch 01/21] PID Virtualization: const parameter for process group Hubertus Franke
2005-12-15 14:35 ` [RFC][patch 02/21] PID Virtualization: task virtual pid access functions Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 03/21] PID Virtualization: return virtual pids where required Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 04/21] PID Virtualization: return virtual process group ids Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 05/21] PID Virtualization: code enhancements for virtual pids in /proc Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 06/21] PID Virtualization: Define pid_to_vpid functions Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 07/21] PID Virtualization: Use pid_to_vpid conversion functions Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 08/21] PID Virtualization: file owner pid virtualization Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 09/21] PID Virtualization: define vpid_to_pid functions Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 10/21] PID Virtualization: Use " Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 11/21] PID Virtualization: use vpgid_to_pgid function Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 12/21] PID Virtualization: Context for pid_to_vpid conversition functions Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 13/21] PID Virtualization: Documentation Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 14/21] PID Virtualization: pidspace Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 15/21] PID Virtualization: container object and functions Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 16/21] PID Virtualization: container attach/detach calls Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 17/21] PID Virtualization: /proc/container filesystem Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 18/21] PID Virtualization: Implementation of low level virtualization functions Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 19/21] PID Virtualization: Handle special case vpid return cases Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 20/21] PID Virtualization: per container /proc filesystem Hubertus Franke
2005-12-15 14:36 ` [RFC][patch 21/21] PID Virtualization: pidspace parent : signal behavior Hubertus Franke
2005-12-15 19:49 ` [RFC][patch 00/21] PID Virtualization: Overview and Patches Gerrit Huizenga
2005-12-15 20:02   ` [ckrm-tech] " Dave Hansen
2005-12-15 20:12     ` Gerrit Huizenga
2005-12-15 22:52     ` Matt Helsley
2005-12-15 22:02   ` Hubertus Franke
2005-12-16  2:20   ` [ckrm-tech] " Matt Helsley
2005-12-16  3:28     ` Gerrit Huizenga
2005-12-16 17:35       ` Dave Hansen
2005-12-16 20:45         ` Gerrit Huizenga
2005-12-16 21:10           ` Dave Hansen
2005-12-16 23:40             ` Hubertus Franke
2005-12-16 23:47         ` Hubertus Franke
2005-12-17  1:18           ` Matt Helsley
2005-12-17  3:03             ` [Lse-tech] " Hubertus Franke
2005-12-17  1:38       ` Matt Helsley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.