linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/23] proc cleanup.
@ 2006-02-23 15:52 Eric W. Biederman
  2006-02-23 15:54 ` [PATCH 01/23] tref: Implement task references Eric W. Biederman
                   ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 15:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


When working on pid namespaces I keep tripping over /proc.
It's hard coded inode numbers and the amount of cruft
accumulated over the years makes it hard to deal with.

So to put /proc out of my misery here is a series of patches that
removes the worst of the warts.

The first patch which introduces task_refs is used later to address
one of the worst faults how much low kernel memory it allows
an unprivileged process to pin.  There are other patches to cleanup
the permission checking, to cleanup how /proc interacts with the rest
of the kernel, and to patches to simply clean /proc up.

At least some of the cleans up go back to cruft that was introduced
in 2.2.  That was a challenge to track down and understand the
thinking at the time because even the historic git archive I have
doesn't go back that far :(

Ultimately the biggest cleanup is that this patchset removes 
the hard coded inode numbers from /proc.  There are still a few
theoretical issues about non-unique inode numbers but the /proc code
doesn't care, and it is no worse than the current situation with the
file descriptor inode numbers.  I would have loved to have made the
inode number the address of the inode data structure in the kernel but
I can't because on alpha __kernel_ino_t is an unsigned int!  Oh well,
the current situation keeps the inode numbers small and readable, and
32bit.

Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 01/23] tref: Implement task references.
  2006-02-23 15:52 [PATCH 00/23] proc cleanup Eric W. Biederman
@ 2006-02-23 15:54 ` Eric W. Biederman
  2006-02-23 15:56   ` [PATCH 02/23] proc: Fix the .. inode number on /proc/<pid>/fd Eric W. Biederman
  2006-02-23 16:49   ` [PATCH 01/23] tref: Implement task references Eric W. Biederman
  2006-02-25 12:27 ` [PATCH 00/23] proc cleanup Andrew Morton
  2006-02-27 15:26 ` Serge E. Hallyn
  2 siblings, 2 replies; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 15:54 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel


Holding a reference to a task_struct pins about 10K of low memory even
after that task has exited.  Which seems to be at 1 or 2 orders of
mangnitude more memory than any other data structure in the kernel.
Not holding a reference to a task_struct and you risk problems with
pid wrap around.

Even worse because we allow session and process group leaders to exit
there is no task_struct you can hold onto to prevent pid wrap around
problems for those kinds of structures.

The task_ref is an small intermediate data structure that other
structures can point, that solves these problems.  A task_ref will
always point at the first user of a pid value or contain a NULL
pointer if there are no longer any users of that pid.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 include/linux/pid.h      |    4 +
 include/linux/task_ref.h |   69 ++++++++++++++++++++++++
 kernel/Makefile          |    2 -
 kernel/fork.c            |    7 ++
 kernel/pid.c             |   12 ++++
 kernel/task_ref.c        |  131 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 224 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/task_ref.h
 create mode 100644 kernel/task_ref.c

8622b332e1e3c5ca2e451828f127e91729ae497f
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 099e70e..2849b7d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -1,6 +1,8 @@
 #ifndef _LINUX_PID_H
 #define _LINUX_PID_H
 
+struct task_ref;
+
 enum pid_type
 {
 	PIDTYPE_PID,
@@ -17,6 +19,8 @@ struct pid
 	struct hlist_node pid_chain;
 	/* list of pids with the same nr, only one of them is in the hash */
 	struct list_head pid_list;
+	/* Does a weak reference of this type exist to the task struct? */
+	struct task_ref *ref;
 };
 
 #define pid_task(elem, type) \
diff --git a/include/linux/task_ref.h b/include/linux/task_ref.h
new file mode 100644
index 0000000..e8446bd
--- /dev/null
+++ b/include/linux/task_ref.h
@@ -0,0 +1,69 @@
+#ifndef _LINUX_TASK_REF_H
+#define _LINUX_TASK_REF_H
+
+/* What is a task_ref?
+ *
+ * A task_ref is a structure that holds a pointer to a task_struct, but
+ * instead of holding a reference count to the task_struct a backwards
+ * pointer from the task_struct to the task_ref is maintained.  When
+ * the task exits that references is broken and the task_struct
+ * pointer in the task_ref is cleared to NULL.
+ *
+ * This allows tracking a task_struct without pinning it in memory.  A
+ * task_struct plus a stack consumes around 10K of low kernel memory.
+ * More precisely this is THREAD_SIZE + sizeof(struct task_struct).
+ * By comparision a task_ref is between 16 and 20 bytes.
+ *
+ * The task_ref allows tracking not individual pids but also any pid_type.
+ * This means we can stop using individual pids in kernel data
+ * structures and directly track the processes those pids refer to.
+ * This advantage is that this allows the kernel to avoid pid wrap
+ * problems with it's internal references.
+ *
+ *
+ * Using a pointer to a pointer can be awkward, especially if you
+ * always must test to see if that pointer is NULL before using it.
+ *
+ * I simply things by including having the init_tref member
+ * and the tref_init, tref_set, tref_reset, and tref_fini functions
+ * for manipulating a task_ref pointer.  They take care of reference
+ * counting and ensuring that a task_ref pointer will point to
+ * init_task_ref if it does not have something useful to point to.
+ *
+ */
+
+struct task_struct;
+enum pid_type;
+
+struct task_ref
+{
+	atomic_t count;
+	enum pid_type type;
+	pid_t pid;
+	struct task_struct *task;
+};
+
+/* Note to read a usable value task value from struct task_ref
+ * the tasklist_lock must be held.  The atomic property of single
+ * word reads will keep any value you read consistent but it doesn't
+ * protect you from the race of the task exiting on another cpu and
+ * having it's task_struct freed or reused.  Holding the tasklist_lock
+ * prevents the task from going away as you dereference the task pointer.
+ */
+
+extern struct task_ref init_tref;
+
+extern void tref_put(struct task_ref *ref);
+extern struct task_ref *tref_get(struct task_ref *ref);
+extern struct task_ref *tref_get_by_task(task_t *task, enum pid_type type);
+extern struct task_ref *tref_get_by_pid(int pid, enum pid_type type);
+
+extern void tref_init(struct task_ref **dst);
+extern void tref_set(struct task_ref **dst, struct task_ref *ref);
+extern void tref_reset(struct task_ref **dst);
+extern void tref_fini(struct task_ref **dst);
+
+extern struct task_struct *get_tref_task(const struct task_ref *tref);
+
+
+#endif /* _LINUX_TASK_REF_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 4ae0fbd..d8c0970 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -5,7 +5,7 @@
 obj-y     = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
 	    exit.o itimer.o time.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o timer.o user.o \
-	    signal.o sys.o kmod.o workqueue.o pid.o \
+	    signal.o sys.o kmod.o workqueue.o pid.o task_ref.o \
 	    rcupdate.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
 	    hrtimer.o
diff --git a/kernel/fork.c b/kernel/fork.c
index fbea12d..3f56d5a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -157,6 +157,7 @@ void __init fork_init(unsigned long memp
 
 static struct task_struct *dup_task_struct(struct task_struct *orig)
 {
+	int type;
 	struct task_struct *tsk;
 	struct thread_info *ti;
 
@@ -179,6 +180,12 @@ static struct task_struct *dup_task_stru
 	/* One for us, one for whoever does the "release_task()" (usually parent) */
 	atomic_set(&tsk->usage,2);
 	atomic_set(&tsk->fs_excl, 0);
+
+	/* Initially there are no weak references to this task */
+	for (type = 0; type < PIDTYPE_MAX; type++) {
+		tsk->pids[type].nr = 0;
+		tsk->pids[type].ref = NULL;
+	}
 	return tsk;
 }
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 7781d99..f365dbb 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -26,6 +26,7 @@
 #include <linux/init.h>
 #include <linux/bootmem.h>
 #include <linux/hash.h>
+#include <linux/task_ref.h>
 
 #define pid_hashfn(nr) hash_long((unsigned long)nr, pidhash_shift)
 static struct hlist_head *pid_hash[PIDTYPE_MAX];
@@ -151,6 +152,7 @@ int fastcall attach_pid(task_t *task, en
 	task_pid = &task->pids[type];
 	pid = find_pid(type, nr);
 	task_pid->nr = nr;
+	task_pid->ref = NULL;
 	if (pid == NULL) {
 		INIT_LIST_HEAD(&task_pid->pid_list);
 		hlist_add_head_rcu(&task_pid->pid_chain,
@@ -165,18 +167,28 @@ int fastcall attach_pid(task_t *task, en
 
 static fastcall int __detach_pid(task_t *task, enum pid_type type)
 {
+	task_t *task_next;
 	struct pid *pid, *pid_next;
+	struct task_ref *ref;
 	int nr = 0;
 
 	pid = &task->pids[type];
+	ref = pid->ref;
 	if (!hlist_unhashed(&pid->pid_chain)) {
 
 		if (list_empty(&pid->pid_list)) {
+			if (ref)
+				ref->task = NULL;
 			nr = pid->nr;
 			hlist_del_rcu(&pid->pid_chain);
 		} else {
+			task_next = pid_task(pid->pid_list.next, type);
 			pid_next = list_entry(pid->pid_list.next,
 						struct pid, pid_list);
+			/* Update the reference to point at the next task */
+			if (ref)
+				ref->task = task_next;
+			pid_next->ref = ref;
 			/* insert next pid from pid_list to hash */
 			hlist_replace_rcu(&pid->pid_chain,
 					  &pid_next->pid_chain);
diff --git a/kernel/task_ref.c b/kernel/task_ref.c
new file mode 100644
index 0000000..2f0a880
--- /dev/null
+++ b/kernel/task_ref.c
@@ -0,0 +1,131 @@
+#include <linux/sched.h>
+#include <linux/task_ref.h>
+
+struct task_ref init_tref = {
+	.count = ATOMIC_INIT(1),
+	.type  = PIDTYPE_PID,
+	.pid   = 0,
+	.task  = NULL,
+};
+
+void tref_put(struct task_ref *ref)
+{
+	might_sleep();
+	if (atomic_dec_and_test(&ref->count)) {
+		struct task_struct *task;
+		BUG_ON(ref == &init_tref);
+		/* Carefully serialize against __detach_pid and tref_get_by_pid */
+		write_lock_irq(&tasklist_lock);
+		task = ref->task;
+		if (task)
+			task->pids[ref->type].ref = NULL;
+		write_unlock_irq(&tasklist_lock);
+		kfree(ref);
+	}
+}
+
+struct task_ref *tref_get(struct task_ref *ref)
+{
+	atomic_inc(&ref->count);
+	return ref;
+}
+
+struct task_ref *tref_get_by_task(struct task_struct *task, enum pid_type type)
+{
+	struct task_ref *new_ref, *ref = NULL;
+	struct pid *pid;
+	might_sleep();
+	
+	/* Get the pid hash table entry */
+	pid = &task->pids[type];
+
+	/* Safely get the an existing reference */
+	read_lock(&tasklist_lock);
+	ref = pid->ref;
+	if (ref)
+		tref_get(ref);
+	read_unlock(&tasklist_lock);
+	if (ref)
+		goto out;
+
+	/* There was not an existing task ref so allocate one */
+	new_ref = kmalloc(sizeof(*new_ref), GFP_KERNEL);
+	if (new_ref) {
+		/* Carefully serialize against __detach_pid and tref_put */
+		write_lock_irq(&tasklist_lock);
+		ref = pid->ref;
+		if (ref)
+			tref_get(ref);
+		else if (pid->nr) {
+			atomic_set(&new_ref->count, 1);
+			new_ref->type = type;
+			new_ref->pid  = pid->nr;
+			new_ref->task = task;
+			pid->ref = ref = new_ref;
+		}
+		write_unlock_irq(&tasklist_lock);
+		if (ref != new_ref)
+			kfree(new_ref);
+	}
+out:
+	if (!ref)
+		ref = tref_get(&init_tref);
+	return ref;
+}
+
+struct task_ref *tref_get_by_pid(int pid, enum pid_type type)
+{
+	struct task_struct *task;
+	struct task_ref *tref;
+
+	/* Lookup the and pin the task */
+	read_lock(&tasklist_lock);
+	task = find_task_by_pid_type(type, pid);
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+
+	/* Now get the tref */
+	if (task) {
+		tref = tref_get_by_task(task, type);
+		put_task_struct(task);
+	}
+	else
+		tref = tref_get(&init_tref);
+	return tref;
+}
+
+void tref_init(struct task_ref **dst)
+{
+	*dst = tref_get(&init_tref);
+}
+
+void tref_set(struct task_ref **dst, struct task_ref *ref)
+{
+	tref_put(*dst);
+	*dst = ref;
+}
+
+void tref_reset(struct task_ref **dst)
+{
+	tref_put(*dst);
+	*dst = tref_get(&init_tref);
+}
+
+void tref_fini(struct task_ref **dst)
+{
+	tref_put(*dst);
+	*dst = NULL;
+}
+
+
+struct task_struct *get_tref_task(const struct task_ref *tref)
+{
+	struct task_struct *task;
+	read_lock(&tasklist_lock);
+	task = tref->task;
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+	return task;
+}
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 02/23] proc: Fix the .. inode number on /proc/<pid>/fd
  2006-02-23 15:54 ` [PATCH 01/23] tref: Implement task references Eric W. Biederman
@ 2006-02-23 15:56   ` Eric W. Biederman
  2006-02-23 15:57     ` [PATCH 03/23] proc: Remove useless BKL in proc_pid_readlink Eric W. Biederman
  2006-02-23 16:49   ` [PATCH 01/23] tref: Implement task references Eric W. Biederman
  1 sibling, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 15:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

c901696b26aa347532930dc5ab12ecb54e473722
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 20feb75..4cbbd2d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1149,7 +1149,8 @@ static struct inode_operations proc_pid_
 
 static int proc_readfd(struct file * filp, void * dirent, filldir_t filldir)
 {
-	struct inode *inode = filp->f_dentry->d_inode;
+	struct dentry *dentry = filp->f_dentry;
+	struct inode *inode = dentry->d_inode;
 	struct task_struct *p = proc_task(inode);
 	unsigned int fd, tid, ino;
 	int retval;
@@ -1170,7 +1171,7 @@ static int proc_readfd(struct file * fil
 				goto out;
 			filp->f_pos++;
 		case 1:
-			ino = fake_ino(tid, PROC_TID_INO);
+			ino = parent_ino(dentry);
 			if (filldir(dirent, "..", 2, 1, ino, DT_DIR) < 0)
 				goto out;
 			filp->f_pos++;
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 03/23] proc: Remove useless BKL in proc_pid_readlink.
  2006-02-23 15:56   ` [PATCH 02/23] proc: Fix the .. inode number on /proc/<pid>/fd Eric W. Biederman
@ 2006-02-23 15:57     ` Eric W. Biederman
  2006-02-23 15:58       ` [PATCH 04/23] proc: Remove unnecessary and misleading assignments from proc_pid_make_inode Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 15:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


We already call everything except do_proc_readlink
outside of the BKL in proc_pid_followlink, and there
appears to be nothing in do_proc_readlink that needs
any special protection.

So remove this leftover from one of the BKL cleanup
efforts.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

da9fe7b5227340bea1f4bd1e246af4a921ce765a
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4cbbd2d..24a3526 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1120,7 +1120,6 @@ static int proc_pid_readlink(struct dent
 	struct dentry *de;
 	struct vfsmount *mnt = NULL;
 
-	lock_kernel();
 
 	if (current->fsuid != inode->i_uid && !capable(CAP_DAC_OVERRIDE))
 		goto out;
@@ -1136,7 +1135,6 @@ static int proc_pid_readlink(struct dent
 	dput(de);
 	mntput(mnt);
 out:
-	unlock_kernel();
 	return error;
 }
 
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 04/23] proc: Remove unnecessary and misleading assignments from proc_pid_make_inode.
  2006-02-23 15:57     ` [PATCH 03/23] proc: Remove useless BKL in proc_pid_readlink Eric W. Biederman
@ 2006-02-23 15:58       ` Eric W. Biederman
  2006-02-23 16:00         ` [PATCH 05/23] proc: Simplify the ownership rules for /proc Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 15:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


The removed fields are already set by proc_alloc_inode.  
Initializing them in proc_pid_make_inode implies they need
to be set.  At least ei->pde was not set on all paths making
it look like proc_pid_make_inode was buggy.  So just remove 
the redundant assignments.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

2b0fa5317e60458090cfa528e9421ecd3de38f6b
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 24a3526..56ca519 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1310,7 +1310,6 @@ static struct inode *proc_pid_make_inode
 
 	/* Common stuff */
 	ei = PROC_I(inode);
-	ei->task = NULL;
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
 	inode->i_ino = fake_ino(task->pid, ino);
 
@@ -1335,7 +1334,6 @@ out:
 	return inode;
 
 out_unlock:
-	ei->pde = NULL;
 	iput(inode);
 	return NULL;
 }
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 05/23] proc: Simplify the ownership rules for /proc
  2006-02-23 15:58       ` [PATCH 04/23] proc: Remove unnecessary and misleading assignments from proc_pid_make_inode Eric W. Biederman
@ 2006-02-23 16:00         ` Eric W. Biederman
  2006-02-23 16:02           ` Eric W. Biederman
  2006-02-23 16:04           ` [PATCH 06/23] proc: Replace proc_inode.type with proc_inode.fd Eric W. Biederman
  0 siblings, 2 replies; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


Currently in /proc if the task is dumpable all of files are owned by
the tasks effective users.  Otherwise the files are owned by root.
Unless it is the /proc/<tgid>/ or /proc/<tgid>/task/<pid> directory
in that case we always make the directory owned by the effective user.

However the special case for directories is pointless except as a way
to read the effective user, because the permissions on both of those
directories are world readable, and executable.

/proc/<tgid>/status provides a much better way to read a processes effecitve
userid, so it is silly to try to provide that on the directory.

So this patch simplifies the code by removing a pointless special case and
gets us one step closer to being able to remove the hard coded /proc inode
numbers.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

453d43f2b9e9fee71c23007f1cfe5dbedd9d3790
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 56ca519..c35f340 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1324,7 +1324,7 @@ static struct inode *proc_pid_make_inode
 	ei->type = ino;
 	inode->i_uid = 0;
 	inode->i_gid = 0;
-	if (ino == PROC_TGID_INO || ino == PROC_TID_INO || task_dumpable(task)) {
+	if (task_dumpable(task)) {
 		inode->i_uid = task->euid;
 		inode->i_gid = task->egid;
 	}
@@ -1353,7 +1353,7 @@ static int pid_revalidate(struct dentry 
 	struct inode *inode = dentry->d_inode;
 	struct task_struct *task = proc_task(inode);
 	if (pid_alive(task)) {
-		if (proc_type(inode) == PROC_TGID_INO || proc_type(inode) == PROC_TID_INO || task_dumpable(task)) {
+		if (task_dumpable(task)) {
 			inode->i_uid = task->euid;
 			inode->i_gid = task->egid;
 		} else {
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 05/23] proc: Simplify the ownership rules for /proc
  2006-02-23 16:00         ` [PATCH 05/23] proc: Simplify the ownership rules for /proc Eric W. Biederman
@ 2006-02-23 16:02           ` Eric W. Biederman
  2006-02-23 16:04           ` [PATCH 06/23] proc: Replace proc_inode.type with proc_inode.fd Eric W. Biederman
  1 sibling, 0 replies; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


Currently in /proc if the task is dumpable all of files are owned by
the tasks effective users.  Otherwise the files are owned by root.
Unless it is the /proc/<tgid>/ or /proc/<tgid>/task/<pid> directory
in that case we always make the directory owned by the effective user.

However the special case for directories is pointless except as a way
to read the effective user, because the permissions on both of those
directories are world readable, and executable.

/proc/<tgid>/status provides a much better way to read a processes effecitve
userid, so it is silly to try to provide that on the directory.

So this patch simplifies the code by removing a pointless special case and
gets us one step closer to being able to remove the hard coded /proc inode
numbers.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

453d43f2b9e9fee71c23007f1cfe5dbedd9d3790
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 56ca519..c35f340 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1324,7 +1324,7 @@ static struct inode *proc_pid_make_inode
 	ei->type = ino;
 	inode->i_uid = 0;
 	inode->i_gid = 0;
-	if (ino == PROC_TGID_INO || ino == PROC_TID_INO || task_dumpable(task)) {
+	if (task_dumpable(task)) {
 		inode->i_uid = task->euid;
 		inode->i_gid = task->egid;
 	}
@@ -1353,7 +1353,7 @@ static int pid_revalidate(struct dentry 
 	struct inode *inode = dentry->d_inode;
 	struct task_struct *task = proc_task(inode);
 	if (pid_alive(task)) {
-		if (proc_type(inode) == PROC_TGID_INO || proc_type(inode) == PROC_TID_INO || task_dumpable(task)) {
+		if (task_dumpable(task)) {
 			inode->i_uid = task->euid;
 			inode->i_gid = task->egid;
 		} else {
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 06/23] proc: Replace proc_inode.type with proc_inode.fd
  2006-02-23 16:00         ` [PATCH 05/23] proc: Simplify the ownership rules for /proc Eric W. Biederman
  2006-02-23 16:02           ` Eric W. Biederman
@ 2006-02-23 16:04           ` Eric W. Biederman
  2006-02-23 16:05             ` [PATCH 07/23] proc: Remove bogus proc_task_permission Eric W. Biederman
  1 sibling, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


The sole renaming use of proc_inode.type is to discover the file descriptor
number, so just store the file descriptor number instead of deriving it
from the inode type.  This removes any /proc limits on the maximum number
of file descriptors, and clears the path to make the hard coded /proc
inode numbers go away.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c          |    6 +++---
 fs/proc/inode.c         |    2 +-
 fs/proc/internal.h      |    4 ++--
 include/linux/proc_fs.h |    2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

7c7a69a8f4291176a595da2c8046ddef15bc6135
diff --git a/fs/proc/base.c b/fs/proc/base.c
index c35f340..8357c52 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -290,7 +290,7 @@ static int proc_fd_link(struct inode *in
 	struct task_struct *task = proc_task(inode);
 	struct files_struct *files;
 	struct file *file;
-	int fd = proc_type(inode) - PROC_TID_FD_DIR;
+	int fd = proc_fd(inode);
 
 	files = get_files_struct(task);
 	if (files) {
@@ -1321,7 +1321,6 @@ static struct inode *proc_pid_make_inode
 	 */
 	get_task_struct(task);
 	ei->task = task;
-	ei->type = ino;
 	inode->i_uid = 0;
 	inode->i_gid = 0;
 	if (task_dumpable(task)) {
@@ -1371,7 +1370,7 @@ static int tid_fd_revalidate(struct dent
 {
 	struct inode *inode = dentry->d_inode;
 	struct task_struct *task = proc_task(inode);
-	int fd = proc_type(inode) - PROC_TID_FD_DIR;
+	int fd = proc_fd(inode);
 	struct files_struct *files;
 
 	files = get_files_struct(task);
@@ -1478,6 +1477,7 @@ static struct dentry *proc_lookupfd(stru
 	if (!inode)
 		goto out;
 	ei = PROC_I(inode);
+	ei->fd = fd;
 	files = get_files_struct(task);
 	if (!files)
 		goto out_unlock;
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 075d3e9..8f532d7 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -95,7 +95,7 @@ static struct inode *proc_alloc_inode(st
 	if (!ei)
 		return NULL;
 	ei->task = NULL;
-	ei->type = 0;
+	ei->fd = 0;
 	ei->op.proc_get_link = NULL;
 	ei->pde = NULL;
 	inode = &ei->vfs_inode;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 95a1cf3..8ea21d3 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -46,7 +46,7 @@ static inline struct task_struct *proc_t
 	return PROC_I(inode)->task;
 }
 
-static inline int proc_type(struct inode *inode)
+static inline int proc_fd(struct inode *inode)
 {
-	return PROC_I(inode)->type;
+	return PROC_I(inode)->fd;
 }
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index aa6322d..cab152d 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -247,7 +247,7 @@ extern void kclist_add(struct kcore_list
 
 struct proc_inode {
 	struct task_struct *task;
-	int type;
+	int fd;
 	union {
 		int (*proc_get_link)(struct inode *, struct dentry **, struct vfsmount **);
 		int (*proc_read)(struct task_struct *task, char *page);
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 07/23] proc: Remove bogus proc_task_permission.
  2006-02-23 16:04           ` [PATCH 06/23] proc: Replace proc_inode.type with proc_inode.fd Eric W. Biederman
@ 2006-02-23 16:05             ` Eric W. Biederman
  2006-02-23 16:06               ` [PATCH 08/23] proc: Kill proc_mem_inode_operations Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


First we can access every /proc/<tgid>/task/<pid> directory as
/proc/<pid> so proc_task_permission is not usefully limiting
visibility.

Second having related filesystems information should have nothing to
do with process visibility.  kill does not implement any checks
like that.

It looks like proc_task_permission was added when the /proc/<tgid>/task
directories were added and someone misunderstood what proc_permission
was trying to accomplish.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |   63 --------------------------------------------------------
 1 files changed, 0 insertions(+), 63 deletions(-)

e1ab81806f60fd8ccda2773f9cdadd05990b5e81
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 8357c52..8b938ef 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -350,54 +350,6 @@ static int proc_root_link(struct inode *
 	return result;
 }
 
-
-/* Same as proc_root_link, but this addionally tries to get fs from other
- * threads in the group */
-static int proc_task_root_link(struct inode *inode, struct dentry **dentry,
-				struct vfsmount **mnt)
-{
-	struct fs_struct *fs;
-	int result = -ENOENT;
-	struct task_struct *leader = proc_task(inode);
-
-	task_lock(leader);
-	fs = leader->fs;
-	if (fs) {
-		atomic_inc(&fs->count);
-		task_unlock(leader);
-	} else {
-		/* Try to get fs from other threads */
-		task_unlock(leader);
-		read_lock(&tasklist_lock);
-		if (pid_alive(leader)) {
-			struct task_struct *task = leader;
-
-			while ((task = next_thread(task)) != leader) {
-				task_lock(task);
-				fs = task->fs;
-				if (fs) {
-					atomic_inc(&fs->count);
-					task_unlock(task);
-					break;
-				}
-				task_unlock(task);
-			}
-		}
-		read_unlock(&tasklist_lock);
-	}
-
-	if (fs) {
-		read_lock(&fs->lock);
-		*mnt = mntget(fs->rootmnt);
-		*dentry = dget(fs->root);
-		read_unlock(&fs->lock);
-		result = 0;
-		put_fs_struct(fs);
-	}
-	return result;
-}
-
-
 #define MAY_PTRACE(task) \
 	(task == current || \
 	(task->parent == current && \
@@ -586,20 +538,6 @@ static int proc_permission(struct inode 
 	return proc_check_root(inode);
 }
 
-static int proc_task_permission(struct inode *inode, int mask, struct nameidata *nd)
-{
-	struct dentry *root;
-	struct vfsmount *vfsmnt;
-
-	if (generic_permission(inode, mask, NULL) != 0)
-		return -EACCES;
-
-	if (proc_task_root_link(inode, &root, &vfsmnt))
-		return -ENOENT;
-
-	return proc_check_chroot(root, vfsmnt);
-}
-
 extern struct seq_operations proc_pid_maps_op;
 static int maps_open(struct inode *inode, struct file *file)
 {
@@ -1531,7 +1469,6 @@ static struct inode_operations proc_fd_i
 
 static struct inode_operations proc_task_inode_operations = {
 	.lookup		= proc_task_lookup,
-	.permission	= proc_task_permission,
 };
 
 #ifdef CONFIG_SECURITY
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 08/23] proc: Kill proc_mem_inode_operations.
  2006-02-23 16:05             ` [PATCH 07/23] proc: Remove bogus proc_task_permission Eric W. Biederman
@ 2006-02-23 16:06               ` Eric W. Biederman
  2006-02-23 16:08                 ` [PATCH 09/23] proc: Properly filter out files that are not visible to a process Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


The inode operations only exist to support the proc_permission
function.  Currently mem_read and mem_write have all the same
permission checks as ptrace.  The fs check makes no sense
in this context, and we can trivially get around it by
calling ptrace.

So simply the code by killing the strange weird case.

I admit the code has had this check since 2.2 but even
there it doesn't seem to make sense.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)

c5af674b972bf21e1bc69b8d9c343e3158d2b3c0
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 8b938ef..1d1feb7 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -881,10 +881,6 @@ static struct file_operations proc_oom_a
 	.write		= oom_adjust_write,
 };
 
-static struct inode_operations proc_mem_inode_operations = {
-	.permission	= proc_permission,
-};
-
 #ifdef CONFIG_AUDITSYSCALL
 #define TMPBUFLEN 21
 static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -1645,7 +1641,6 @@ static struct dentry *proc_pident_lookup
 #endif
 		case PROC_TID_MEM:
 		case PROC_TGID_MEM:
-			inode->i_op = &proc_mem_inode_operations;
 			inode->i_fop = &proc_mem_operations;
 			break;
 #ifdef CONFIG_SECCOMP
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 09/23] proc: Properly filter out files that are not visible to a process.
  2006-02-23 16:06               ` [PATCH 08/23] proc: Kill proc_mem_inode_operations Eric W. Biederman
@ 2006-02-23 16:08                 ` Eric W. Biederman
  2006-02-23 16:10                   ` [PATCH 10/23] proc: Fix the link count for /proc/<pid>/task Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel



Long ago and far away in 2.2 we started checking to ensure the files
we displayed in /proc were visible to the current process.  It was
an unsophisticated time and no one was worried about functions
full of FIXMES in a stable kernel.  As time passed the function
became sacred and was enshrined in the shrine of how things
have always been.  The fixes came in but only to keep the function
working no one really remembering or documenting why we did things
that way.

The intent and the functionality make a lot of sense.  Don't let
/proc be an be a way to access files a process can see no other way.
The implementation however is completely wrong.

We are currently checking the root directories of the two processes,
we are not checking the actual file descriptors themselves.

We are strangely checking with a permission method instead of just when
we use the data.

This patch fixes the logic to actually check the files being returned and
makes a note that implementing a permission method for this part of
/proc almost certainly the wrong way to implement a permission check.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |  153 ++++++++++++++++++++++++++++++++------------------------
 1 files changed, 87 insertions(+), 66 deletions(-)

2c3a592c549bde816af8d38fc41542ce32f24bef
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1d1feb7..81c2f2a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -74,6 +74,16 @@
 #include <linux/poll.h>
 #include "internal.h"
 
+/* NOTE:   
+ *	Implementing inode permission operations in /proc is almost
+ *	certainly an error.  Permission checks need to happen during
+ *	each system call not at open time.  The reason is that most of
+ *	what we wish to check for permissions in /proc varies at runtime.
+ *	 
+ *	The classic example of a problem is opening file descriptors
+ *	in /proc for a task before it execs a suid executable.
+ */
+
 /*
  * For hysterical raisins we keep the same inumbers as in the old procfs.
  * Feel free to change the macro below - just keep the range distinct from
@@ -479,65 +489,6 @@ static int proc_oom_score(struct task_st
 /*                       Here the fs part begins                        */
 /************************************************************************/
 
-/* permission checks */
-
-/* If the process being read is separated by chroot from the reading process,
- * don't let the reader access the threads.
- */
-static int proc_check_chroot(struct dentry *root, struct vfsmount *vfsmnt)
-{
-	struct dentry *de, *base;
-	struct vfsmount *our_vfsmnt, *mnt;
-	int res = 0;
-	read_lock(&current->fs->lock);
-	our_vfsmnt = mntget(current->fs->rootmnt);
-	base = dget(current->fs->root);
-	read_unlock(&current->fs->lock);
-
-	spin_lock(&vfsmount_lock);
-	de = root;
-	mnt = vfsmnt;
-
-	while (vfsmnt != our_vfsmnt) {
-		if (vfsmnt == vfsmnt->mnt_parent)
-			goto out;
-		de = vfsmnt->mnt_mountpoint;
-		vfsmnt = vfsmnt->mnt_parent;
-	}
-
-	if (!is_subdir(de, base))
-		goto out;
-	spin_unlock(&vfsmount_lock);
-
-exit:
-	dput(base);
-	mntput(our_vfsmnt);
-	dput(root);
-	mntput(mnt);
-	return res;
-out:
-	spin_unlock(&vfsmount_lock);
-	res = -EACCES;
-	goto exit;
-}
-
-static int proc_check_root(struct inode *inode)
-{
-	struct dentry *root;
-	struct vfsmount *vfsmnt;
-
-	if (proc_root_link(inode, &root, &vfsmnt)) /* Ewww... */
-		return -ENOENT;
-	return proc_check_chroot(root, vfsmnt);
-}
-
-static int proc_permission(struct inode *inode, int mask, struct nameidata *nd)
-{
-	if (generic_permission(inode, mask, NULL) != 0)
-		return -EACCES;
-	return proc_check_root(inode);
-}
-
 extern struct seq_operations proc_pid_maps_op;
 static int maps_open(struct inode *inode, struct file *file)
 {
@@ -1001,6 +952,70 @@ static struct file_operations proc_secco
 };
 #endif /* CONFIG_SECCOMP */
 
+static int proc_check_dentry_visible(struct inode *inode,
+	struct dentry *dentry, struct vfsmount *mnt)
+{
+	/* Verify that the current process can already see the
+	 * file pointed at by the file descriptor.
+	 * This prevents /proc from being an accidental information leak.
+	 * 
+	 * This prevents access to files that are not visible do to
+	 * being on the otherside of a chroot, in a different
+	 * namespace, or are simply process local (like pipes).
+	 */
+	struct dentry *root;
+	struct vfsmount *rootmnt;
+	struct task_struct *task;
+	struct files_struct *task_files, *files;
+	int error = -EACCES;
+
+	/* See if the the two tasks share a commone set of 
+	 * file descriptors.  If so everything is visible.
+	 */
+	task = get_proc_task(inode);
+	if (!task)
+		goto out;
+	files = get_files_struct(current);
+	task_files = get_files_struct(task);
+	if (files && task_files && (files == task_files))
+		error = 0;
+	if (task_files)
+		put_files_struct(task_files);
+	if (files)
+		put_files_struct(files);
+	put_task_struct(task);
+	if (!error)
+		goto out;
+
+	/* If the two tasks don't share a common set of file
+	 * descriptors see if the destination dentry is already
+	 * visible in the current tasks filesystem namespace.
+	 */
+	read_lock(&current->fs->lock);
+	rootmnt = mntget(current->fs->rootmnt);
+	root = dget(current->fs->root);
+	read_unlock(&current->fs->lock);
+
+	spin_lock(&vfsmount_lock);
+	while (mnt != rootmnt) {
+		if (mnt == mnt->mnt_parent)
+			goto out_unlock;
+		dentry = mnt->mnt_mountpoint;
+		mnt = mnt->mnt_parent;
+	}
+	if (!is_subdir(dentry, root))
+		goto out_unlock;
+	error = 0;
+out_unlock:
+	spin_unlock(&vfsmount_lock);
+
+	dput(root);
+	mntput(rootmnt);
+out:
+	return error;
+
+}
+
 static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
 	struct inode *inode = dentry->d_inode;
@@ -1011,12 +1026,16 @@ static void *proc_pid_follow_link(struct
 
 	if (current->fsuid != inode->i_uid && !capable(CAP_DAC_OVERRIDE))
 		goto out;
-	error = proc_check_root(inode);
-	if (error)
-		goto out;
 
 	error = PROC_I(inode)->op.proc_get_link(inode, &nd->dentry, &nd->mnt);
 	nd->last_type = LAST_BIND;
+	if (error)
+		goto out;
+
+	/* Only return files this task can already see */
+	error = proc_check_dentry_visible(inode, nd->dentry, nd->mnt);
+	if (error)
+		path_release(nd);
 out:
 	return ERR_PTR(error);
 }
@@ -1057,15 +1076,18 @@ static int proc_pid_readlink(struct dent
 
 	if (current->fsuid != inode->i_uid && !capable(CAP_DAC_OVERRIDE))
 		goto out;
-	error = proc_check_root(inode);
-	if (error)
-		goto out;
 
 	error = PROC_I(inode)->op.proc_get_link(inode, &de, &mnt);
 	if (error)
 		goto out;
 
+	/* Only return files this task can already see */
+	error = proc_check_dentry_visible(inode, de, mnt);
+	if (error)
+		goto out_put;
+
 	error = do_proc_readlink(de, mnt, buffer, buflen);
+out_put:
 	dput(de);
 	mntput(mnt);
 out:
@@ -1460,7 +1482,6 @@ static struct file_operations proc_task_
  */
 static struct inode_operations proc_fd_inode_operations = {
 	.lookup		= proc_lookupfd,
-	.permission	= proc_permission,
 };
 
 static struct inode_operations proc_task_inode_operations = {
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 10/23] proc: Fix the link count for /proc/<pid>/task
  2006-02-23 16:08                 ` [PATCH 09/23] proc: Properly filter out files that are not visible to a process Eric W. Biederman
@ 2006-02-23 16:10                   ` Eric W. Biederman
  2006-02-23 16:12                     ` [PATCH 11/23] proc: Move proc_maps_operations into task_mmu.c Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


Use getattr to get an accurate link count when needed.  This is cheaper
and more accurate than trying to derive it by walking the thread list
of a process.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |   21 +++++++++++++++++++--
 1 files changed, 19 insertions(+), 2 deletions(-)

eec5c7327c53b862025a595985a72fa01509c5e4
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 81c2f2a..1a39258 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1466,6 +1466,7 @@ out:
 
 static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldir);
 static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd);
+static int proc_task_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat);
 
 static struct file_operations proc_fd_operations = {
 	.read		= generic_read_dir,
@@ -1486,6 +1487,7 @@ static struct inode_operations proc_fd_i
 
 static struct inode_operations proc_task_inode_operations = {
 	.lookup		= proc_task_lookup,
+	.getattr	= proc_task_getattr,
 };
 
 #ifdef CONFIG_SECURITY
@@ -1592,7 +1594,7 @@ static struct dentry *proc_pident_lookup
 	 */
 	switch(p->type) {
 		case PROC_TGID_TASK:
-			inode->i_nlink = 2 + get_tid_list(2, NULL, dir);
+			inode->i_nlink = 2;
 			inode->i_op = &proc_task_inode_operations;
 			inode->i_fop = &proc_task_operations;
 			break;
@@ -2189,7 +2191,6 @@ static int proc_task_readdir(struct file
 	}
 
 	nr_tids = get_tid_list(pos, tid_array, inode);
-	inode->i_nlink = pos + nr_tids;
 
 	for (i = 0; i < nr_tids; i++) {
 		unsigned long j = PROC_NUMBUF;
@@ -2209,3 +2210,19 @@ out:
 	filp->f_pos = pos;
 	return retval;
 }
+
+static int proc_task_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
+{
+	struct inode *inode = dentry->d_inode;
+	struct task_struct *p = proc_task(inode);
+	generic_fillattr(inode, stat);
+
+	if (pid_alive(p)) {
+		task_lock(p);
+		if (p->signal)
+			stat->nlink += atomic_read(&p->signal->count);
+		task_unlock(p);
+	}
+		
+	return 0;
+}
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 11/23] proc: Move proc_maps_operations into task_mmu.c
  2006-02-23 16:10                   ` [PATCH 10/23] proc: Fix the link count for /proc/<pid>/task Eric W. Biederman
@ 2006-02-23 16:12                     ` Eric W. Biederman
  2006-02-23 16:15                       ` [PATCH 12/23] proc: Rewrite the proc dentry flush on exit optimization Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


All of the functions for proc_maps_operations are already
defined in task_mmu.c so move the operations structure to
keep the functionality together.

Since task_nommu.c implements a dummy version of
/proc/<pid>/maps give it a simplified version
of proc_maps_operations that it can modify
to best suit its needs.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c       |   61 --------------------------------------------------
 fs/proc/internal.h   |    4 +++
 fs/proc/task_mmu.c   |   54 ++++++++++++++++++++++++++++++++++++++++++--
 fs/proc/task_nommu.c |   21 ++++++++++++++++-
 4 files changed, 75 insertions(+), 65 deletions(-)

b5c8957160dd4dd4a680f3bc99cdcda6af7bf1de
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1a39258..4bdc859 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -489,67 +489,6 @@ static int proc_oom_score(struct task_st
 /*                       Here the fs part begins                        */
 /************************************************************************/
 
-extern struct seq_operations proc_pid_maps_op;
-static int maps_open(struct inode *inode, struct file *file)
-{
-	struct task_struct *task = proc_task(inode);
-	int ret = seq_open(file, &proc_pid_maps_op);
-	if (!ret) {
-		struct seq_file *m = file->private_data;
-		m->private = task;
-	}
-	return ret;
-}
-
-static struct file_operations proc_maps_operations = {
-	.open		= maps_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
-};
-
-#ifdef CONFIG_NUMA
-extern struct seq_operations proc_pid_numa_maps_op;
-static int numa_maps_open(struct inode *inode, struct file *file)
-{
-	struct task_struct *task = proc_task(inode);
-	int ret = seq_open(file, &proc_pid_numa_maps_op);
-	if (!ret) {
-		struct seq_file *m = file->private_data;
-		m->private = task;
-	}
-	return ret;
-}
-
-static struct file_operations proc_numa_maps_operations = {
-	.open		= numa_maps_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
-};
-#endif
-
-#ifdef CONFIG_MMU
-extern struct seq_operations proc_pid_smaps_op;
-static int smaps_open(struct inode *inode, struct file *file)
-{
-	struct task_struct *task = proc_task(inode);
-	int ret = seq_open(file, &proc_pid_smaps_op);
-	if (!ret) {
-		struct seq_file *m = file->private_data;
-		m->private = task;
-	}
-	return ret;
-}
-
-static struct file_operations proc_smaps_operations = {
-	.open		= smaps_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
-};
-#endif
-
 extern struct seq_operations mounts_op;
 struct proc_mounts {
 	struct seq_file m;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 8ea21d3..ac95bfc 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -37,6 +37,10 @@ extern int proc_tgid_stat(struct task_st
 extern int proc_pid_status(struct task_struct *, char *);
 extern int proc_pid_statm(struct task_struct *, char *);
 
+extern struct file_operations proc_maps_operations;
+extern struct file_operations proc_numa_maps_operations;
+extern struct file_operations proc_smaps_operations;
+
 void free_proc_entry(struct proc_dir_entry *de);
 
 int proc_init_inodecache(void);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 0eaad41..56cd932 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -375,27 +375,75 @@ static void *m_next(struct seq_file *m, 
 	return (vma != tail_vma)? tail_vma: NULL;
 }
 
-struct seq_operations proc_pid_maps_op = {
+static struct seq_operations proc_pid_maps_op = {
 	.start	= m_start,
 	.next	= m_next,
 	.stop	= m_stop,
 	.show	= show_map
 };
 
-struct seq_operations proc_pid_smaps_op = {
+static struct seq_operations proc_pid_smaps_op = {
 	.start	= m_start,
 	.next	= m_next,
 	.stop	= m_stop,
 	.show	= show_smap
 };
 
+static int do_maps_open(struct inode *inode, struct file *file, 
+			struct seq_operations *ops)
+{
+	struct task_struct *task = proc_task(inode);
+	int ret = seq_open(file, ops);
+	if (!ret) {
+		struct seq_file *m = file->private_data;
+		m->private = task;
+	}
+	return ret;
+}
+
+static int maps_open(struct inode *inode, struct file *file)
+{
+	return do_maps_open(inode, file, &proc_pid_maps_op);
+}
+
+struct file_operations proc_maps_operations = {
+	.open		= maps_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_NUMA
 extern int show_numa_map(struct seq_file *m, void *v);
 
-struct seq_operations proc_pid_numa_maps_op = {
+static struct seq_operations proc_pid_numa_maps_op = {
         .start  = m_start,
         .next   = m_next,
         .stop   = m_stop,
         .show   = show_numa_map
 };
+
+static int numa_maps_open(struct inode *inode, struct file *file)
+{
+	return do_maps_open(inode, file, &proc_pid_numa_maps_op);
+}
+
+struct file_operations proc_numa_maps_operations = {
+	.open		= numa_maps_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
 #endif
+
+static int smaps_open(struct inode *inode, struct file *file)
+{
+	return do_maps_open(inode, file, &proc_pid_smaps_op);
+}
+
+struct file_operations proc_smaps_operations = {
+	.open		= smaps_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 8f68827..af69f28 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -156,9 +156,28 @@ static void *m_next(struct seq_file *m, 
 {
 	return NULL;
 }
-struct seq_operations proc_pid_maps_op = {
+static struct seq_operations proc_pid_maps_op = {
 	.start	= m_start,
 	.next	= m_next,
 	.stop	= m_stop,
 	.show	= show_map
 };
+
+static int maps_open(struct inode *inode, struct file *file)
+{
+	int ret;
+	ret = seq_open(file, &proc_pid_maps_op);
+	if (!ret) {
+		struct seq_file *m = file->private_data;
+		m->private = NULL;
+	}
+	return ret;
+}
+
+struct file_operations proc_maps_operations = {
+	.open		= maps_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 12/23] proc: Rewrite the proc dentry flush on exit optimization.
  2006-02-23 16:12                     ` [PATCH 11/23] proc: Move proc_maps_operations into task_mmu.c Eric W. Biederman
@ 2006-02-23 16:15                       ` Eric W. Biederman
  2006-02-23 16:16                         ` [PATCH 13/23] proc: Close the race of a process dying durning lookup Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


To keep the dcache from filling up with dead /proc entries we flush
them on process exit.  However over the years that code has gotten
hairy with a dentry pointer and a lock in task_struct and
misdocumented as a correctness feature. 

I have rewritten this code to look and see if we have a corresponding
entry in the dcache and if so flush it on process exit.  This removes
the extra fields in the task_struct and allows me to trivially handle
the case of a /proc/<tgid>/task/<pid> entry as well as the current
/proc/<pid> entries. 

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/exec.c                 |    9 ---
 fs/proc/base.c            |  143 ++++++++++++++++++++++-----------------------
 include/linux/init_task.h |    1 
 include/linux/proc_fs.h   |    6 +-
 include/linux/sched.h     |    3 -
 kernel/exit.c             |   12 ----
 kernel/fork.c             |    3 -
 7 files changed, 72 insertions(+), 105 deletions(-)

8ab2fe374434b424ef579e6e8d644a2b81ec4459
diff --git a/fs/exec.c b/fs/exec.c
index fc1c7a2..8033939 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -657,7 +657,6 @@ static int de_thread(struct task_struct 
 	 */
 	if (!thread_group_leader(current)) {
 		struct task_struct *parent;
-		struct dentry *proc_dentry1, *proc_dentry2;
 		unsigned long ptrace;
 
 		/*
@@ -669,10 +668,6 @@ static int de_thread(struct task_struct 
 		while (leader->exit_state != EXIT_ZOMBIE)
 			yield();
 
-		spin_lock(&leader->proc_lock);
-		spin_lock(&current->proc_lock);
-		proc_dentry1 = proc_pid_unhash(current);
-		proc_dentry2 = proc_pid_unhash(leader);
 		write_lock_irq(&tasklist_lock);
 
 		BUG_ON(leader->tgid != current->tgid);
@@ -729,10 +724,6 @@ static int de_thread(struct task_struct 
 		leader->exit_state = EXIT_DEAD;
 
 		write_unlock_irq(&tasklist_lock);
-		spin_unlock(&leader->proc_lock);
-		spin_unlock(&current->proc_lock);
-		proc_pid_flush(proc_dentry1);
-		proc_pid_flush(proc_dentry2);
         }
 
 	/*
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4bdc859..9fab7fe 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -911,7 +911,7 @@ static int proc_check_dentry_visible(str
 	/* See if the the two tasks share a commone set of 
 	 * file descriptors.  If so everything is visible.
 	 */
-	task = get_proc_task(inode);
+	task = proc_task(inode);
 	if (!task)
 		goto out;
 	files = get_files_struct(current);
@@ -922,7 +922,6 @@ static int proc_check_dentry_visible(str
 		put_files_struct(task_files);
 	if (files)
 		put_files_struct(files);
-	put_task_struct(task);
 	if (!error)
 		goto out;
 
@@ -1291,16 +1290,6 @@ static int tid_fd_revalidate(struct dent
 	return 0;
 }
 
-static void pid_base_iput(struct dentry *dentry, struct inode *inode)
-{
-	struct task_struct *task = proc_task(inode);
-	spin_lock(&task->proc_lock);
-	if (task->proc_dentry == dentry)
-		task->proc_dentry = NULL;
-	spin_unlock(&task->proc_lock);
-	iput(inode);
-}
-
 static int pid_delete_dentry(struct dentry * dentry)
 {
 	/* Is the task we represent dead?
@@ -1322,13 +1311,6 @@ static struct dentry_operations pid_dent
 	.d_delete	= pid_delete_dentry,
 };
 
-static struct dentry_operations pid_base_dentry_operations =
-{
-	.d_revalidate	= pid_revalidate,
-	.d_iput		= pid_base_iput,
-	.d_delete	= pid_delete_dentry,
-};
-
 /* Lookups */
 
 static unsigned name_to_int(struct dentry *dentry)
@@ -1787,57 +1769,78 @@ static struct inode_operations proc_self
 };
 
 /**
- * proc_pid_unhash -  Unhash /proc/@pid entry from the dcache.
- * @p: task that should be flushed.
+ * proc_flush_task -  Remove dcache entries for @task from the /proc dcache.
+ *
+ * @task: task that should be flushed.
+ *
+ * Looks in the dcache for
+ * /proc/@pid 
+ * /proc/@tgid/task/@pid
+ * if either directory is present flushes it and all of it'ts children
+ * from the dcache.
  *
- * Drops the /proc/@pid dcache entry from the hash chains.
+ * It is safe and reasonable to cache /proc entries for a task until
+ * that task exits.  After that they just clog up the dcache with
+ * useless entries, possibly causing useful dcache entries to be
+ * flushed instead.  This routine is proved to flush those useless
+ * dcache entries at process exit time.
  *
- * Dropping /proc/@pid entries and detach_pid must be synchroneous,
- * otherwise e.g. /proc/@pid/exe might point to the wrong executable,
- * if the pid value is immediately reused. This is enforced by
- * - caller must acquire spin_lock(p->proc_lock)
- * - must be called before detach_pid()
- * - proc_pid_lookup acquires proc_lock, and checks that
- *   the target is not dead by looking at the attach count
- *   of PIDTYPE_PID.
+ * NOTE: This routine is just an optimization so it does not guarantee
+ *       that no dcache entries will exist at process exit time it
+ *       just makes it very unlikely that any will persist.
  */
-
-struct dentry *proc_pid_unhash(struct task_struct *p)
+void proc_flush_task(struct task_struct *task)
 {
-	struct dentry *proc_dentry;
+	struct dentry *dentry, *leader, *dir;
+	char buf[30];
+	struct qstr name;
+
+	name.name = buf;
+	name.len = snprintf(buf, sizeof(buf), "%d", task->pid);
+	name.hash = full_name_hash(name.name, name.len);
+
+	dentry = d_lookup(proc_mnt->mnt_root, &name);
+	if (dentry) {
+		shrink_dcache_parent(dentry);
+		d_drop(dentry);
+		dput(dentry);
+	}
+	
+	if (thread_group_leader(task))
+		goto out;
 
-	proc_dentry = p->proc_dentry;
-	if (proc_dentry != NULL) {
+	name.name = buf;
+	name.len = snprintf(buf, sizeof(buf), "%d", task->tgid);
+	name.hash = full_name_hash(name.name, name.len);
 
-		spin_lock(&dcache_lock);
-		spin_lock(&proc_dentry->d_lock);
-		if (!d_unhashed(proc_dentry)) {
-			dget_locked(proc_dentry);
-			__d_drop(proc_dentry);
-			spin_unlock(&proc_dentry->d_lock);
-		} else {
-			spin_unlock(&proc_dentry->d_lock);
-			proc_dentry = NULL;
-		}
-		spin_unlock(&dcache_lock);
-	}
-	return proc_dentry;
-}
+	leader = d_lookup(proc_mnt->mnt_root, &name);
+	if (!leader)
+		goto out;
 
-/**
- * proc_pid_flush - recover memory used by stale /proc/@pid/x entries
- * @proc_dentry: directoy to prune.
- *
- * Shrink the /proc directory that was used by the just killed thread.
- */
+	name.name = "task";
+	name.len = strlen(name.name);
+	name.hash = full_name_hash(name.name, name.len);
 	
-void proc_pid_flush(struct dentry *proc_dentry)
-{
-	might_sleep();
-	if(proc_dentry != NULL) {
-		shrink_dcache_parent(proc_dentry);
-		dput(proc_dentry);
+	dir = d_lookup(leader, &name);
+	if (!dir)
+		goto out_put_leader;
+
+	name.name = buf;
+	name.len = snprintf(buf, sizeof(buf), "%d", task->pid);
+	name.hash = full_name_hash(name.name, name.len);
+
+	dentry = d_lookup(dir, &name);
+	if (dentry) {
+		shrink_dcache_parent(dentry);
+		d_drop(dentry);
+		dput(dentry);
 	}
+	
+	dput(dir);	
+out_put_leader:
+	dput(leader);	
+out:
+	return;
 }
 
 /* SMP-safe */
@@ -1847,7 +1850,6 @@ struct dentry *proc_pid_lookup(struct in
 	struct inode *inode;
 	struct proc_inode *ei;
 	unsigned tgid;
-	int died;
 
 	if (dentry->d_name.len == 4 && !memcmp(dentry->d_name.name,"self",4)) {
 		inode = new_inode(dir->i_sb);
@@ -1893,23 +1895,16 @@ struct dentry *proc_pid_lookup(struct in
 	inode->i_nlink = 4;
 #endif
 
-	dentry->d_op = &pid_base_dentry_operations;
+	dentry->d_op = &pid_dentry_operations;
 
-	died = 0;
 	d_add(dentry, inode);
-	spin_lock(&task->proc_lock);
-	task->proc_dentry = dentry;
 	if (!pid_alive(task)) {
-		dentry = proc_pid_unhash(task);
-		died = 1;
+		d_drop(dentry);
+		shrink_dcache_parent(dentry);
+		goto out;
 	}
-	spin_unlock(&task->proc_lock);
 
 	put_task_struct(task);
-	if (died) {
-		proc_pid_flush(dentry);
-		goto out;
-	}
 	return NULL;
 out:
 	return ERR_PTR(-ENOENT);
@@ -1952,7 +1947,7 @@ static struct dentry *proc_task_lookup(s
 	inode->i_nlink = 3;
 #endif
 
-	dentry->d_op = &pid_base_dentry_operations;
+	dentry->d_op = &pid_dentry_operations;
 
 	d_add(dentry, inode);
 
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index dcfd2ec..62deb30 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -117,7 +117,6 @@ extern struct group_info init_groups;
 		.signal = {{0}}},					\
 	.blocked	= {{0}},					\
 	.alloc_lock	= SPIN_LOCK_UNLOCKED,				\
-	.proc_lock	= SPIN_LOCK_UNLOCKED,				\
 	.journal_info	= NULL,						\
 	.cpu_timers	= INIT_CPU_TIMERS(tsk.cpu_timers),		\
 	.fs_excl	= ATOMIC_INIT(0),				\
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index cab152d..302c24e 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -97,9 +97,8 @@ extern void proc_misc_init(void);
 
 struct mm_struct;
 
+void proc_flush_task(struct task_struct *task);
 struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *);
-struct dentry *proc_pid_unhash(struct task_struct *p);
-void proc_pid_flush(struct dentry *proc_dentry);
 int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir);
 unsigned long task_vsize(struct mm_struct *);
 int task_statm(struct mm_struct *, int *, int *, int *, int *);
@@ -209,8 +208,7 @@ static inline void proc_net_remove(const
 #define proc_net_create(name, mode, info)	({ (void)(mode), NULL; })
 static inline void proc_net_remove(const char *name) {}
 
-static inline struct dentry *proc_pid_unhash(struct task_struct *p) { return NULL; }
-static inline void proc_pid_flush(struct dentry *proc_dentry) { }
+static inline void proc_flush_task(struct task_struct *task) { }
 
 static inline struct proc_dir_entry *create_proc_entry(const char *name,
 	mode_t mode, struct proc_dir_entry *parent) { return NULL; }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b6f51e3..9fb7688 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -824,8 +824,6 @@ struct task_struct {
    	u32 self_exec_id;
 /* Protection of (de-)allocation: mm, files, fs, tty, keyrings */
 	spinlock_t alloc_lock;
-/* Protection of proc_dentry: nesting proc_lock, dcache_lock, write_lock_irq(&tasklist_lock); */
-	spinlock_t proc_lock;
 
 #ifdef CONFIG_DEBUG_MUTEXES
 	/* mutex deadlock detection */
@@ -838,7 +836,6 @@ struct task_struct {
 /* VM state */
 	struct reclaim_state *reclaim_state;
 
-	struct dentry *proc_dentry;
 	struct backing_dev_info *backing_dev_info;
 
 	struct io_context *io_context;
diff --git a/kernel/exit.c b/kernel/exit.c
index 531aadc..64956e0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -63,12 +63,9 @@ void release_task(struct task_struct * p
 {
 	int zap_leader;
 	task_t *leader;
-	struct dentry *proc_dentry;
 
 repeat: 
 	atomic_dec(&p->user->processes);
-	spin_lock(&p->proc_lock);
-	proc_dentry = proc_pid_unhash(p);
 	write_lock_irq(&tasklist_lock);
 	if (unlikely(p->ptrace))
 		__ptrace_unlink(p);
@@ -104,8 +101,7 @@ repeat: 
 
 	sched_exit(p);
 	write_unlock_irq(&tasklist_lock);
-	spin_unlock(&p->proc_lock);
-	proc_pid_flush(proc_dentry);
+	proc_flush_task(p);
 	release_thread(p);
 	put_task_struct(p);
 
@@ -118,15 +114,9 @@ repeat: 
 
 void unhash_process(struct task_struct *p)
 {
-	struct dentry *proc_dentry;
-
-	spin_lock(&p->proc_lock);
-	proc_dentry = proc_pid_unhash(p);
 	write_lock_irq(&tasklist_lock);
 	__unhash_process(p);
 	write_unlock_irq(&tasklist_lock);
-	spin_unlock(&p->proc_lock);
-	proc_pid_flush(proc_dentry);
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 3f56d5a..fae6510 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -984,13 +984,10 @@ static task_t *copy_process(unsigned lon
 		if (put_user(p->pid, parent_tidptr))
 			goto bad_fork_cleanup;
 
-	p->proc_dentry = NULL;
-
 	INIT_LIST_HEAD(&p->children);
 	INIT_LIST_HEAD(&p->sibling);
 	p->vfork_done = NULL;
 	spin_lock_init(&p->alloc_lock);
-	spin_lock_init(&p->proc_lock);
 
 	clear_tsk_thread_flag(p, TIF_SIGPENDING);
 	init_sigpending(&p->pending);
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 13/23] proc: Close the race of a process dying durning lookup.
  2006-02-23 16:15                       ` [PATCH 12/23] proc: Rewrite the proc dentry flush on exit optimization Eric W. Biederman
@ 2006-02-23 16:16                         ` Eric W. Biederman
  2006-02-23 16:18                           ` [PATCH 14/23] proc: Make PROC_NUMBUF the buffer size for holding a integers as strings Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


proc_lookup and task exiting are not synchronized, although some of the
previous code may have suggested that.  Every time before we reuse a dentry
namei.c calls d_op->derevalidate which prevents us from reusing a stale
dcache entry.  Unfortunately it does not prevent us from returning a stale
dcache entry.  This race has been explicitly plugged in proc_pid_lookup
but there is nothing to confine it to just that proc lookup function.

So to prevent the race I call revalidate explictily in all of the
proc lookup functions after I call d_add, and report an error if
the revalidate does not succeed.

Years ago Al Viro did something similar but those changes got lost in
the churn.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |   54 +++++++++++++++++++++++++++++-------------------------
 1 files changed, 29 insertions(+), 25 deletions(-)

aea0459c7bef967ce3345449db5183d4b2dafefe
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9fab7fe..36cddda 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1340,6 +1340,7 @@ static struct dentry *proc_lookupfd(stru
 {
 	struct task_struct *task = proc_task(dir);
 	unsigned fd = name_to_int(dentry);
+	struct dentry *result = ERR_PTR(-ENOENT);
 	struct file * file;
 	struct files_struct * files;
 	struct inode *inode;
@@ -1374,15 +1375,18 @@ static struct dentry *proc_lookupfd(stru
 	ei->op.proc_get_link = proc_fd_link;
 	dentry->d_op = &tid_fd_dentry_operations;
 	d_add(dentry, inode);
-	return NULL;
+	/* Close the race of the process dying before we return the dentry */
+	if (tid_fd_revalidate(dentry, NULL))
+		result = NULL;
+out:
+	return result;
 
 out_unlock2:
 	rcu_read_unlock();
 	put_files_struct(files);
 out_unlock:
 	iput(inode);
-out:
-	return ERR_PTR(-ENOENT);
+	goto out;
 }
 
 static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldir);
@@ -1482,12 +1486,12 @@ static struct dentry *proc_pident_lookup
 					 struct pid_entry *ents)
 {
 	struct inode *inode;
-	int error;
+	struct dentry *error;
 	struct task_struct *task = proc_task(dir);
 	struct pid_entry *p;
 	struct proc_inode *ei;
 
-	error = -ENOENT;
+	error = ERR_PTR(-ENOENT);
 	inode = NULL;
 
 	if (!pid_alive(task))
@@ -1502,7 +1506,7 @@ static struct dentry *proc_pident_lookup
 	if (!p->name)
 		goto out;
 
-	error = -EINVAL;
+	error = ERR_PTR(-EINVAL);
 	inode = proc_pid_make_inode(dir->i_sb, task, p->type);
 	if (!inode)
 		goto out;
@@ -1663,14 +1667,16 @@ static struct dentry *proc_pident_lookup
 		default:
 			printk("procfs: impossible type (%d)",p->type);
 			iput(inode);
-			return ERR_PTR(-EINVAL);
+			error = ERR_PTR(-EINVAL);
+			goto out;
 	}
 	dentry->d_op = &pid_dentry_operations;
 	d_add(dentry, inode);
-	return NULL;
-
+	/* Close the race of the process dying before we return the dentry */
+	if (pid_revalidate(dentry, NULL))
+		error = NULL;
 out:
-	return ERR_PTR(error);
+	return error;
 }
 
 static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd){
@@ -1846,6 +1852,7 @@ out:
 /* SMP-safe */
 struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd)
 {
+	struct dentry *result = ERR_PTR(-ENOENT);
 	struct task_struct *task;
 	struct inode *inode;
 	struct proc_inode *ei;
@@ -1879,12 +1886,9 @@ struct dentry *proc_pid_lookup(struct in
 		goto out;
 
 	inode = proc_pid_make_inode(dir->i_sb, task, PROC_TGID_INO);
+	if (!inode)
+		goto out_put_task;
 
-
-	if (!inode) {
-		put_task_struct(task);
-		goto out;
-	}
 	inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
 	inode->i_op = &proc_tgid_base_inode_operations;
 	inode->i_fop = &proc_tgid_base_operations;
@@ -1898,21 +1902,20 @@ struct dentry *proc_pid_lookup(struct in
 	dentry->d_op = &pid_dentry_operations;
 
 	d_add(dentry, inode);
-	if (!pid_alive(task)) {
-		d_drop(dentry);
-		shrink_dcache_parent(dentry);
-		goto out;
-	}
+	/* Close the race of the process dying before we return the dentry */
+	if (pid_revalidate(dentry, NULL))
+		result = NULL;
 
+out_put_task:
 	put_task_struct(task);
-	return NULL;
 out:
-	return ERR_PTR(-ENOENT);
+	return result;
 }
 
 /* SMP-safe */
 static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd)
 {
+	struct dentry *result = ERR_PTR(-ENOENT);
 	struct task_struct *task;
 	struct task_struct *leader = proc_task(dir);
 	struct inode *inode;
@@ -1950,13 +1953,14 @@ static struct dentry *proc_task_lookup(s
 	dentry->d_op = &pid_dentry_operations;
 
 	d_add(dentry, inode);
+	/* Close the race of the process dying before we return the dentry */
+	if (pid_revalidate(dentry, NULL))
+		result = NULL;
 
-	put_task_struct(task);
-	return NULL;
 out_drop_task:
 	put_task_struct(task);
 out:
-	return ERR_PTR(-ENOENT);
+	return result;
 }
 
 #define PROC_NUMBUF 10
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 14/23] proc: Make PROC_NUMBUF the buffer size for holding a integers as strings.
  2006-02-23 16:16                         ` [PATCH 13/23] proc: Close the race of a process dying durning lookup Eric W. Biederman
@ 2006-02-23 16:18                           ` Eric W. Biederman
  2006-02-23 16:20                             ` [PATCH 15/23] proc: refactor reading directories of tasks Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


Currently in /proc at several different places we define buffers to
hold a process id, or a file descriptor .  In most of them we use
either a hard coded number or a different define.  Modify them all to
use PROC_NUMBUF, so the code has a chance of being maintained. 

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |   30 +++++++++++++++---------------
 1 files changed, 15 insertions(+), 15 deletions(-)

43e5ec3ac1c2badfd746bc7f17f60235dfde549f
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 36cddda..1ab12e5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -181,6 +181,9 @@ enum pid_directory_inos {
 	PROC_TID_FD_DIR = 0x8000,	/* 0x8000-0xffff */
 };
 
+/* Worst case buffer size needed for holding an integer. */
+#define PROC_NUMBUF 10
+
 struct pid_entry {
 	int type;
 	int len;
@@ -725,12 +728,12 @@ static ssize_t oom_adjust_read(struct fi
 				size_t count, loff_t *ppos)
 {
 	struct task_struct *task = proc_task(file->f_dentry->d_inode);
-	char buffer[8];
+	char buffer[PROC_NUMBUF];
 	size_t len;
 	int oom_adjust = task->oomkilladj;
 	loff_t __ppos = *ppos;
 
-	len = sprintf(buffer, "%i\n", oom_adjust);
+	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
 	if (__ppos >= len)
 		return 0;
 	if (count > len-__ppos)
@@ -745,14 +748,14 @@ static ssize_t oom_adjust_write(struct f
 				size_t count, loff_t *ppos)
 {
 	struct task_struct *task = proc_task(file->f_dentry->d_inode);
-	char buffer[8], *end;
+	char buffer[PROC_NUMBUF], *end;
 	int oom_adjust;
 
 	if (!capable(CAP_SYS_RESOURCE))
 		return -EPERM;
-	memset(buffer, 0, 8);
-	if (count > 6)
-		count = 6;
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
 	if (copy_from_user(buffer, buf, count))
 		return -EFAULT;
 	oom_adjust = simple_strtol(buffer, &end, 0);
@@ -1037,8 +1040,6 @@ static struct inode_operations proc_pid_
 	.follow_link	= proc_pid_follow_link
 };
 
-#define NUMBUF 10
-
 static int proc_readfd(struct file * filp, void * dirent, filldir_t filldir)
 {
 	struct dentry *dentry = filp->f_dentry;
@@ -1046,7 +1047,7 @@ static int proc_readfd(struct file * fil
 	struct task_struct *p = proc_task(inode);
 	unsigned int fd, tid, ino;
 	int retval;
-	char buf[NUMBUF];
+	char buf[PROC_NUMBUF];
 	struct files_struct * files;
 	struct fdtable *fdt;
 
@@ -1082,7 +1083,7 @@ static int proc_readfd(struct file * fil
 					continue;
 				rcu_read_unlock();
 
-				j = NUMBUF;
+				j = PROC_NUMBUF;
 				i = fd;
 				do {
 					j--;
@@ -1091,7 +1092,7 @@ static int proc_readfd(struct file * fil
 				} while (i);
 
 				ino = fake_ino(tid, PROC_TID_FD_DIR + fd);
-				if (filldir(dirent, buf+j, NUMBUF-j, fd+2, ino, DT_LNK) < 0) {
+				if (filldir(dirent, buf+j, PROC_NUMBUF-j, fd+2, ino, DT_LNK) < 0) {
 					rcu_read_lock();
 					break;
 				}
@@ -1757,14 +1758,14 @@ static struct inode_operations proc_tid_
 static int proc_self_readlink(struct dentry *dentry, char __user *buffer,
 			      int buflen)
 {
-	char tmp[30];
+	char tmp[PROC_NUMBUF];
 	sprintf(tmp, "%d", current->tgid);
 	return vfs_readlink(dentry,buffer,buflen,tmp);
 }
 
 static void *proc_self_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
-	char tmp[30];
+	char tmp[PROC_NUMBUF];
 	sprintf(tmp, "%d", current->tgid);
 	return ERR_PTR(vfs_follow_link(nd,tmp));
 }	
@@ -1798,7 +1799,7 @@ static struct inode_operations proc_self
 void proc_flush_task(struct task_struct *task)
 {
 	struct dentry *dentry, *leader, *dir;
-	char buf[30];
+	char buf[PROC_NUMBUF];
 	struct qstr name;
 
 	name.name = buf;
@@ -1963,7 +1964,6 @@ out:
 	return result;
 }
 
-#define PROC_NUMBUF 10
 #define PROC_MAXPIDS 20
 
 /*
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 15/23] proc: refactor reading directories of tasks.
  2006-02-23 16:18                           ` [PATCH 14/23] proc: Make PROC_NUMBUF the buffer size for holding a integers as strings Eric W. Biederman
@ 2006-02-23 16:20                             ` Eric W. Biederman
  2006-02-23 16:23                               ` [PATCH 16/23] proc: Don't lock task_structs indefinitely Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel



There are a couple of problems this patch addresses.
- /proc/<tgid>/task currently does not work correctly if you stop reading
  in the middle of a directory.

- /proc/ currently requires a full pass through the task list with
  the tasklist lock held, to determine there are no more processes to read.

- The hand rolled integer to string conversion does not properly handle
  running out of buffer space.

- We seem to be batching reading of pids from the tasklist without reason,
  and complicating the logic of the code.

This patch addresses that by changing how tasks are processed.  A
first_<task_type> function is built that handles restarts, and a
next_<task_type> function is built that just advances to the next
task.

first_<task_type> when it detects a restart usually uses
find_task_by_pid.  If that doesn't work because there has been
a seek on the directory, or we have already given a complete
directory listing, it first checks the number tasks of that
type, and only if we are under that count does it walk through
all of the tasks to find the one we are interested in.

The code that fills in the directory is simpler because there is
only a single for loop.

The hand rolled integer to string conversion is replaced by snprintf
which should handle the the out of buffer case correctly.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |  285 ++++++++++++++++++++++++++++++++++----------------------
 1 files changed, 174 insertions(+), 111 deletions(-)

fa086d704f447a54a1f3566471d8a32f173701af
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1ab12e5..f507887 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1479,8 +1479,6 @@ static struct file_operations proc_tgid_
 static struct inode_operations proc_tgid_attr_inode_operations;
 #endif
 
-static int get_tid_list(int index, unsigned int *tids, struct inode *dir);
-
 /* SMP-safe */
 static struct dentry *proc_pident_lookup(struct inode *dir, 
 					 struct dentry *dentry,
@@ -1964,89 +1962,89 @@ out:
 	return result;
 }
 
-#define PROC_MAXPIDS 20
-
 /*
- * Get a few tgid's to return for filldir - we need to hold the
- * tasklist lock while doing this, and we must release it before
- * we actually do the filldir itself, so we use a temp buffer..
+ * Find the first tgid to return to user space.
+ *
+ * Usually this is just whatever follows &init_task, but if the users
+ * buffer was too small to hold the full list or there was a seek into
+ * the middle of the directory we have more work to do.
+ *
+ * In the case of a short read we start with find_task_by_pid.
+ * 
+ * In the case of a seek we start with &init_task and walk nr
+ * threads past it.
  */
-static int get_tgid_list(int index, unsigned long version, unsigned int *tgids)
+static struct task_struct *first_tgid(int tgid, int nr)
 {
-	struct task_struct *p;
-	int nr_tgids = 0;
-
-	index--;
+	struct task_struct *pos = NULL;
 	read_lock(&tasklist_lock);
-	p = NULL;
-	if (version) {
-		p = find_task_by_pid(version);
-		if (p && !thread_group_leader(p))
-			p = NULL;
-	}
+	if (tgid && nr) {
+		pos = find_task_by_pid(tgid);
+		if (pos && !thread_group_leader(pos))
+			pos = NULL;
+		if (pos)
+			nr = 0;
+	}
+	/* If nr exceeds the number of processes get out quickly */
+	if (nr && nr >= nr_processes())
+		goto done;
 
-	if (p)
-		index = 0;
-	else
-		p = next_task(&init_task);
+	/* If we haven't found our starting place yet start with
+	 * the init_task and walk nr tasks forward.
+	 */
+	if (!pos && (nr >= 0))
+		pos = next_task(&init_task);
 
-	for ( ; p != &init_task; p = next_task(p)) {
-		int tgid = p->pid;
-		if (!pid_alive(p))
-			continue;
-		if (--index >= 0)
+	/* The pid_alive test serves two purposes.
+	 * - The first is to verify the task is actually valid.
+	 * - The second is to ensure we don't go around the list
+	 *   of processes more than once.  pid_alive always
+	 *   fails for init_task as it has pid == 0 and is unhashed.
+	 */
+	for (; pos && pid_alive(pos); pos = next_task(pos)) {
+		if (--nr > 0)
 			continue;
-		tgids[nr_tgids] = tgid;
-		nr_tgids++;
-		if (nr_tgids >= PROC_MAXPIDS)
-			break;
+		get_task_struct(pos);
+		goto done;
 	}
+	pos = NULL;
+done:
 	read_unlock(&tasklist_lock);
-	return nr_tgids;
+	return pos;
 }
 
-/*
- * Get a few tid's to return for filldir - we need to hold the
- * tasklist lock while doing this, and we must release it before
- * we actually do the filldir itself, so we use a temp buffer..
+/* 
+ * Find the next task in the task list.
+ * Return NULL if we loop or there is any error.
+ *
+ * The reference to the input task_struct is released.
  */
-static int get_tid_list(int index, unsigned int *tids, struct inode *dir)
+static struct task_struct *next_tgid(struct task_struct *start)
 {
-	struct task_struct *leader_task = proc_task(dir);
-	struct task_struct *task = leader_task;
-	int nr_tids = 0;
-
-	index -= 2;
+	struct task_struct *pos;
 	read_lock(&tasklist_lock);
-	/*
-	 * The starting point task (leader_task) might be an already
-	 * unlinked task, which cannot be used to access the task-list
-	 * via next_thread().
-	 */
-	if (pid_alive(task)) do {
-		int tid = task->pid;
-
-		if (--index >= 0)
-			continue;
-		if (tids != NULL)
-			tids[nr_tids] = tid;
-		nr_tids++;
-		if (nr_tids >= PROC_MAXPIDS)
-			break;
-	} while ((task = next_thread(task)) != leader_task);
+	pos = start;
+	if (pid_alive(start))
+		pos = next_task(start);
+	if (pid_alive(pos)) {
+		get_task_struct(pos);
+		goto done;
+	}
+	pos = NULL;
+done:		
 	read_unlock(&tasklist_lock);
-	return nr_tids;
+	put_task_struct(start);
+	return pos;
 }
 
 /* for the /proc/ directory itself, after non-process stuff has been done */
 int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
 {
-	unsigned int tgid_array[PROC_MAXPIDS];
 	char buf[PROC_NUMBUF];
 	unsigned int nr = filp->f_pos - FIRST_PROCESS_ENTRY;
-	unsigned int nr_tgids, i;
-	int next_tgid;
-
+	struct task_struct *task;
+	int tgid;
+	
 	if (!nr) {
 		ino_t ino = fake_ino(0,PROC_TGID_INO);
 		if (filldir(dirent, "self", 4, filp->f_pos, ino, DT_LNK) < 0)
@@ -2054,62 +2052,123 @@ int proc_pid_readdir(struct file * filp,
 		filp->f_pos++;
 		nr++;
 	}
+	nr -= 1;
 
 	/* f_version caches the tgid value that the last readdir call couldn't
 	 * return. lseek aka telldir automagically resets f_version to 0.
 	 */
-	next_tgid = filp->f_version;
+	tgid = filp->f_version;
 	filp->f_version = 0;
-	for (;;) {
-		nr_tgids = get_tgid_list(nr, next_tgid, tgid_array);
-		if (!nr_tgids) {
-			/* no more entries ! */
+	for (task = first_tgid(tgid, nr);
+	     task;
+	     task = next_tgid(task), filp->f_pos++) {
+		int len;
+		ino_t ino;
+		tgid = task->pid;
+		len = snprintf(buf, sizeof(buf), "%d", tgid);
+		ino = fake_ino(tgid, PROC_TGID_INO);
+		if (filldir(dirent, buf, len, filp->f_pos, ino, DT_DIR) < 0) {
+			/* returning this tgid failed, save it as the first
+			 * pid for the next readir call */
+			filp->f_version = tgid;
+			put_task_struct(task);
 			break;
 		}
-		next_tgid = 0;
+	}
+	return 0;
+}
 
-		/* do not use the last found pid, reserve it for next_tgid */
-		if (nr_tgids == PROC_MAXPIDS) {
-			nr_tgids--;
-			next_tgid = tgid_array[nr_tgids];
-		}
+/*
+ * Find the first tid of a thread group to return to user space.
+ *
+ * Usually this is just the thread group leader, but if the users
+ * buffer was too small or there was a seek into the middle of the
+ * directory we have more work todo.
+ *
+ * In the case of a short read we start with find_task_by_pid.
+ *
+ * In the case of a seek we start with the leader and walk nr
+ * threads past it.
+ */
+static struct task_struct *first_tid(struct task_struct *leader, int tid, int nr)
+{
+	struct task_struct *pos = NULL;
+	read_lock(&tasklist_lock);
 
-		for (i=0;i<nr_tgids;i++) {
-			int tgid = tgid_array[i];
-			ino_t ino = fake_ino(tgid,PROC_TGID_INO);
-			unsigned long j = PROC_NUMBUF;
-
-			do
-				buf[--j] = '0' + (tgid % 10);
-			while ((tgid /= 10) != 0);
-
-			if (filldir(dirent, buf+j, PROC_NUMBUF-j, filp->f_pos, ino, DT_DIR) < 0) {
-				/* returning this tgid failed, save it as the first
-				 * pid for the next readir call */
-				filp->f_version = tgid_array[i];
-				goto out;
-			}
-			filp->f_pos++;
-			nr++;
-		}
+	/* Attempt to start with the pid of a thread */
+	if (tid && (nr > 0)) {
+		pos = find_task_by_pid(tid);
+		if (pos && (pos->group_leader != leader))
+			pos = NULL;
+		if (pos)
+			nr = 0;
+	}
+
+	/* If nr exceeds the number of threads there is nothing todo */
+	if (nr) {
+		int threads = 0;
+		task_lock(leader);
+		if (leader->signal)
+			threads = atomic_read(&leader->signal->count);
+		task_unlock(leader);
+		if (nr >= threads)
+			goto done;
 	}
-out:
-	return 0;
+
+	/* If we haven't found our starting place yet start with the
+	 * leader and walk nr threads forward.
+	 */
+	if (!pos && (nr >= 0))
+		pos = leader;
+
+	for (; pos && pid_alive(pos); pos = next_thread(pos)) {
+		if (--nr > 0)
+			continue;
+		get_task_struct(pos);
+		goto done;
+	}
+	pos = NULL;
+done:
+	read_unlock(&tasklist_lock);
+	return pos;
 }
 
+/*
+ * Find the next thread in the thread list.
+ * Return NULL if there is an error or no next thread.
+ * 
+ * The reference to the input task_struct is released.
+ */
+static struct task_struct *next_tid(struct task_struct *start)
+{
+	struct task_struct *pos;
+	read_lock(&tasklist_lock);
+	pos = start;
+	if (pid_alive(start))
+		pos = next_thread(start);
+	if (pid_alive(pos) && (pos != start->group_leader))
+		get_task_struct(pos);
+	else
+		pos = NULL;
+	read_unlock(&tasklist_lock);
+	put_task_struct(start);
+	return pos;
+}
+  
 /* for the /proc/TGID/task/ directories */
 static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldir)
 {
-	unsigned int tid_array[PROC_MAXPIDS];
 	char buf[PROC_NUMBUF];
-	unsigned int nr_tids, i;
 	struct dentry *dentry = filp->f_dentry;
 	struct inode *inode = dentry->d_inode;
+	struct task_struct *leader = proc_task(inode);
+	struct task_struct *task;
 	int retval = -ENOENT;
 	ino_t ino;
+	int tid;
 	unsigned long pos = filp->f_pos;  /* avoiding "long long" filp->f_pos */
 
-	if (!pid_alive(proc_task(inode)))
+	if (!pid_alive(leader))
 		goto out;
 	retval = 0;
 
@@ -2128,21 +2187,25 @@ static int proc_task_readdir(struct file
 		/* fall through */
 	}
 
-	nr_tids = get_tid_list(pos, tid_array, inode);
-
-	for (i = 0; i < nr_tids; i++) {
-		unsigned long j = PROC_NUMBUF;
-		int tid = tid_array[i];
-
-		ino = fake_ino(tid,PROC_TID_INO);
-
-		do
-			buf[--j] = '0' + (tid % 10);
-		while ((tid /= 10) != 0);
-
-		if (filldir(dirent, buf+j, PROC_NUMBUF-j, pos, ino, DT_DIR) < 0)
+	/* f_version caches the tgid value that the last readdir call couldn't
+	 * return. lseek aka telldir automagically resets f_version to 0.
+	 */
+	tid = filp->f_version;
+	filp->f_version = 0;
+	for (task = first_tid(leader, tid, pos - 2);
+	     task;
+	     task = next_tid(task), pos++) {
+		int len;
+		tid = task->pid;
+		len = snprintf(buf, sizeof(buf), "%d", tid);
+		ino = fake_ino(tid, PROC_TID_INO);
+		if (filldir(dirent, buf, len, pos, ino, DT_DIR < 0)) {
+			/* returning this tgid failed, save it as the first
+			 * pid for the next readir call */
+			filp->f_version = tid;
+			put_task_struct(task);
 			break;
-		pos++;
+		}
 	}
 out:
 	filp->f_pos = pos;
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 16/23] proc: Don't lock task_structs indefinitely.
  2006-02-23 16:20                             ` [PATCH 15/23] proc: refactor reading directories of tasks Eric W. Biederman
@ 2006-02-23 16:23                               ` Eric W. Biederman
  2006-02-23 16:24                                 ` [PATCH 17/23] proc: Give the root directory a task Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


Every inode in /proc holds a reference to a struct task_struct.
If a directory or file is opened and remains open after the
the task exits this pinning continues.  With 8K stacks on a
32bit machine the amount pinned per file descriptor is about 10K.

Normally I would figure a reasonable per user process limit is about
100 processes.  With 80 processes, with a 1000 file descriptors each
I can trigger the 00M killer on a 32bit kernel, because I have
pinned about 800MB of useless data.

This patch replaces the struct task_struct pointer with a pointer
to a struct task_ref which has a struct task_struct pointer.  The
difference is that the task_ref is updated when a task is removed,
so the pinning of dead tasks does not happen.

The code now has to contend with the fact that the task may now
exit at any time.  Which is a little but not much more complicated.

With this change it takes about 600 processes each opening up
1000 file descriptors before I can trigger the OOM killer.  Much
better.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c          |  268 ++++++++++++++++++++++++++++++++---------------
 fs/proc/inode.c         |    9 +-
 fs/proc/internal.h      |   15 ++-
 fs/proc/task_mmu.c      |   63 +++++++----
 include/linux/proc_fs.h |    8 +
 mm/mempolicy.c          |    6 +
 6 files changed, 252 insertions(+), 117 deletions(-)

eefe36bb6eeb5d815f702a3e1ecd3b026cd2d9d7
diff --git a/fs/proc/base.c b/fs/proc/base.c
index f507887..86aa5c5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -300,11 +300,15 @@ static struct pid_entry tid_attr_stuff[]
 
 static int proc_fd_link(struct inode *inode, struct dentry **dentry, struct vfsmount **mnt)
 {
-	struct task_struct *task = proc_task(inode);
-	struct files_struct *files;
+	struct task_struct *task = get_proc_task(inode);
+	struct files_struct *files = NULL;
 	struct file *file;
 	int fd = proc_fd(inode);
 
+	if (task) {
+		files = get_files_struct(task);
+		put_task_struct(task);
+	}
 	files = get_files_struct(task);
 	if (files) {
 		rcu_read_lock();
@@ -335,8 +339,14 @@ static struct fs_struct *get_fs_struct(s
 
 static int proc_cwd_link(struct inode *inode, struct dentry **dentry, struct vfsmount **mnt)
 {
-	struct fs_struct *fs = get_fs_struct(proc_task(inode));
+	struct task_struct *task = get_proc_task(inode);
+	struct fs_struct *fs = NULL;
 	int result = -ENOENT;
+
+	if (task) {
+		fs = get_fs_struct(task);
+		put_task_struct(task);
+	}
 	if (fs) {
 		read_lock(&fs->lock);
 		*mnt = mntget(fs->pwdmnt);
@@ -350,8 +360,14 @@ static int proc_cwd_link(struct inode *i
 
 static int proc_root_link(struct inode *inode, struct dentry **dentry, struct vfsmount **mnt)
 {
-	struct fs_struct *fs = get_fs_struct(proc_task(inode));
+	struct task_struct *task = get_proc_task(inode);
+	struct fs_struct *fs = NULL;
 	int result = -ENOENT;
+
+	if (task) {
+		fs = get_fs_struct(task);
+		put_task_struct(task);
+	}
 	if (fs) {
 		read_lock(&fs->lock);
 		*mnt = mntget(fs->rootmnt);
@@ -500,16 +516,19 @@ struct proc_mounts {
 
 static int mounts_open(struct inode *inode, struct file *file)
 {
-	struct task_struct *task = proc_task(inode);
-	struct namespace *namespace;
+	struct task_struct *task = get_proc_task(inode);
+	struct namespace *namespace = NULL;
 	struct proc_mounts *p;
 	int ret = -EINVAL;
 
-	task_lock(task);
-	namespace = task->namespace;
-	if (namespace)
-		get_namespace(namespace);
-	task_unlock(task);
+	if (task) {
+		task_lock(task);
+		namespace = task->namespace;
+		if (namespace)
+			get_namespace(namespace);
+		task_unlock(task);
+		put_task_struct(task);
+	}
 
 	if (namespace) {
 		ret = -ENOMEM;
@@ -571,18 +590,27 @@ static ssize_t proc_info_read(struct fil
 	struct inode * inode = file->f_dentry->d_inode;
 	unsigned long page;
 	ssize_t length;
-	struct task_struct *task = proc_task(inode);
+	struct task_struct *task = get_proc_task(inode);
+
+	length = -ESRCH;
+	if (!task)
+		goto out_no_task;
 
 	if (count > PROC_BLOCK_SIZE)
 		count = PROC_BLOCK_SIZE;
-	if (!(page = __get_free_page(GFP_KERNEL)))
-		return -ENOMEM;
 
+	length = -ENOMEM;
+	if (!(page = __get_free_page(GFP_KERNEL)))
+		goto out;
+	
 	length = PROC_I(inode)->op.proc_read(task, (char*)page);
-
+	
 	if (length >= 0)
 		length = simple_read_from_buffer(buf, count, ppos, (char *)page, length);
 	free_page(page);
+out:
+	put_task_struct(task);
+out_no_task:
 	return length;
 }
 
@@ -599,7 +627,7 @@ static int mem_open(struct inode* inode,
 static ssize_t mem_read(struct file * file, char __user * buf,
 			size_t count, loff_t *ppos)
 {
-	struct task_struct *task = proc_task(file->f_dentry->d_inode);
+	struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
 	char *page;
 	unsigned long src = *ppos;
 	int ret = -ESRCH;
@@ -666,15 +694,20 @@ static ssize_t mem_write(struct file * f
 {
 	int copied = 0;
 	char *page;
-	struct task_struct *task = proc_task(file->f_dentry->d_inode);
+	struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
 	unsigned long dst = *ppos;
 
+	copied = -ESRCH;
+	if (!task)
+		goto out_no_task;
+
 	if (!MAY_PTRACE(task) || !ptrace_may_attach(task))
-		return -ESRCH;
+		goto out;
 
+	copied = -ENOMEM;
 	page = (char *)__get_free_page(GFP_USER);
 	if (!page)
-		return -ENOMEM;
+		goto out;
 
 	while (count > 0) {
 		int this_len, retval;
@@ -697,6 +730,9 @@ static ssize_t mem_write(struct file * f
 	}
 	*ppos = dst;
 	free_page((unsigned long) page);
+out:
+	put_task_struct(task);
+out_no_task:
 	return copied;
 }
 #endif
@@ -727,12 +763,17 @@ static struct file_operations proc_mem_o
 static ssize_t oom_adjust_read(struct file *file, char __user *buf,
 				size_t count, loff_t *ppos)
 {
-	struct task_struct *task = proc_task(file->f_dentry->d_inode);
+	struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
 	char buffer[PROC_NUMBUF];
 	size_t len;
-	int oom_adjust = task->oomkilladj;
+	int oom_adjust;
 	loff_t __ppos = *ppos;
 
+	if (!task)
+		return -ESRCH;
+	oom_adjust = task->oomkilladj;
+	put_task_struct(task);
+
 	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
 	if (__ppos >= len)
 		return 0;
@@ -747,7 +788,7 @@ static ssize_t oom_adjust_read(struct fi
 static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 				size_t count, loff_t *ppos)
 {
-	struct task_struct *task = proc_task(file->f_dentry->d_inode);
+	struct task_struct *task;
 	char buffer[PROC_NUMBUF], *end;
 	int oom_adjust;
 
@@ -763,7 +804,11 @@ static ssize_t oom_adjust_write(struct f
 		return -EINVAL;
 	if (*end == '\n')
 		end++;
+	task = get_proc_task(file->f_dentry->d_inode);
+	if (!task)
+		return -ESRCH;
 	task->oomkilladj = oom_adjust;
+	put_task_struct(task);
 	if (end - buffer == 0)
 		return -EIO;
 	return end - buffer;
@@ -780,12 +825,15 @@ static ssize_t proc_loginuid_read(struct
 				  size_t count, loff_t *ppos)
 {
 	struct inode * inode = file->f_dentry->d_inode;
-	struct task_struct *task = proc_task(inode);
+	struct task_struct *task = get_proc_task(inode);
 	ssize_t length;
 	char tmpbuf[TMPBUFLEN];
 
+	if (!task)
+		return -ESRCH;
 	length = scnprintf(tmpbuf, TMPBUFLEN, "%u",
 				audit_get_loginuid(task->audit_context));
+	put_task_struct(task);
 	return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
 }
 
@@ -795,13 +843,12 @@ static ssize_t proc_loginuid_write(struc
 	struct inode * inode = file->f_dentry->d_inode;
 	char *page, *tmp;
 	ssize_t length;
-	struct task_struct *task = proc_task(inode);
 	uid_t loginuid;
 
 	if (!capable(CAP_AUDIT_CONTROL))
 		return -EPERM;
 
-	if (current != task)
+	if (current != proc_tref(inode)->task)
 		return -EPERM;
 
 	if (count > PAGE_SIZE)
@@ -824,7 +871,7 @@ static ssize_t proc_loginuid_write(struc
 		goto out_free_page;
 
 	}
-	length = audit_set_loginuid(task, loginuid);
+	length = audit_set_loginuid(current, loginuid);
 	if (likely(length == 0))
 		length = count;
 
@@ -843,13 +890,16 @@ static struct file_operations proc_login
 static ssize_t seccomp_read(struct file *file, char __user *buf,
 			    size_t count, loff_t *ppos)
 {
-	struct task_struct *tsk = proc_task(file->f_dentry->d_inode);
+	struct task_struct *tsk = get_proc_task(file->f_dentry->d_inode);
 	char __buf[20];
 	loff_t __ppos = *ppos;
 	size_t len;
 
+	if (!tsk)
+		return -ESRCH;
 	/* no need to print the trailing zero, so use only len */
 	len = sprintf(__buf, "%u\n", tsk->seccomp.mode);
+	put_task_struct(tsk);
 	if (__ppos >= len)
 		return 0;
 	if (count > len - __ppos)
@@ -863,13 +913,19 @@ static ssize_t seccomp_read(struct file 
 static ssize_t seccomp_write(struct file *file, const char __user *buf,
 			     size_t count, loff_t *ppos)
 {
-	struct task_struct *tsk = proc_task(file->f_dentry->d_inode);
+	struct task_struct *tsk = get_proc_task(file->f_dentry->d_inode);
 	char __buf[20], *end;
 	unsigned int seccomp_mode;
+	ssize_t result;
+
+	result = -ESRCH;
+	if (!tsk)
+		goto out_no_task;
 
 	/* can set it only once to be even more secure */
+	result = -EPERM;
 	if (unlikely(tsk->seccomp.mode))
-		return -EPERM;
+		goto out;
 
 	memset(__buf, 0, sizeof(__buf));
 	count = min(count, sizeof(__buf) - 1);
@@ -878,14 +934,20 @@ static ssize_t seccomp_write(struct file
 	seccomp_mode = simple_strtoul(__buf, &end, 0);
 	if (*end == '\n')
 		end++;
+	result = -EINVAL;
 	if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
 		tsk->seccomp.mode = seccomp_mode;
 		set_tsk_thread_flag(tsk, TIF_SECCOMP);
 	} else
-		return -EINVAL;
+		goto out;
+	result = -EIO;
 	if (unlikely(!(end - __buf)))
-		return -EIO;
-	return end - __buf;
+		goto out;
+	result = end - __buf;
+out:
+	put_task_struct(tsk);
+out_no_task:
+	return result;
 }
 
 static struct file_operations proc_seccomp_operations = {
@@ -914,7 +976,7 @@ static int proc_check_dentry_visible(str
 	/* See if the the two tasks share a commone set of 
 	 * file descriptors.  If so everything is visible.
 	 */
-	task = proc_task(inode);
+	task = get_proc_task(inode);
 	if (!task)
 		goto out;
 	files = get_files_struct(current);
@@ -925,6 +987,7 @@ static int proc_check_dentry_visible(str
 		put_files_struct(task_files);
 	if (files)
 		put_files_struct(files);
+	put_task_struct(task);
 	if (!error)
 		goto out;
 
@@ -1044,7 +1107,7 @@ static int proc_readfd(struct file * fil
 {
 	struct dentry *dentry = filp->f_dentry;
 	struct inode *inode = dentry->d_inode;
-	struct task_struct *p = proc_task(inode);
+	struct task_struct *p = get_proc_task(inode);
 	unsigned int fd, tid, ino;
 	int retval;
 	char buf[PROC_NUMBUF];
@@ -1052,8 +1115,8 @@ static int proc_readfd(struct file * fil
 	struct fdtable *fdt;
 
 	retval = -ENOENT;
-	if (!pid_alive(p))
-		goto out;
+	if (!p)
+		goto out_no_task;
 	retval = 0;
 	tid = p->pid;
 
@@ -1102,6 +1165,8 @@ static int proc_readfd(struct file * fil
 			put_files_struct(files);
 	}
 out:
+	put_task_struct(p);
+out_no_task:
 	return retval;
 }
 
@@ -1113,16 +1178,18 @@ static int proc_pident_readdir(struct fi
 	int pid;
 	struct dentry *dentry = filp->f_dentry;
 	struct inode *inode = dentry->d_inode;
+	struct task_struct *task = get_proc_task(inode);
 	struct pid_entry *p;
 	ino_t ino;
 	int ret;
 
 	ret = -ENOENT;
-	if (!pid_alive(proc_task(inode)))
+	if (!task)
 		goto out;
 
 	ret = 0;
-	pid = proc_task(inode)->pid;
+	pid = task->pid;
+	put_task_struct(task);
 	i = filp->f_pos;
 	switch (i) {
 	case 0:
@@ -1208,14 +1275,13 @@ static struct inode *proc_pid_make_inode
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
 	inode->i_ino = fake_ino(task->pid, ino);
 
-	if (!pid_alive(task))
-		goto out_unlock;
-
 	/*
 	 * grab the reference to task.
 	 */
-	get_task_struct(task);
-	ei->task = task;
+	tref_set(&ei->tref, tref_get_by_task(task, PIDTYPE_PID));
+	if (!ei->tref->task)
+		goto out_unlock;
+
 	inode->i_uid = 0;
 	inode->i_gid = 0;
 	if (task_dumpable(task)) {
@@ -1245,8 +1311,8 @@ out_unlock:
 static int pid_revalidate(struct dentry *dentry, struct nameidata *nd)
 {
 	struct inode *inode = dentry->d_inode;
-	struct task_struct *task = proc_task(inode);
-	if (pid_alive(task)) {
+	struct task_struct *task = get_proc_task(inode);
+	if (task) {
 		if (task_dumpable(task)) {
 			inode->i_uid = task->euid;
 			inode->i_gid = task->egid;
@@ -1255,6 +1321,7 @@ static int pid_revalidate(struct dentry 
 			inode->i_gid = 0;
 		}
 		security_task_to_inode(task, inode);
+		put_task_struct(task);
 		return 1;
 	}
 	d_drop(dentry);
@@ -1264,28 +1331,31 @@ static int pid_revalidate(struct dentry 
 static int tid_fd_revalidate(struct dentry *dentry, struct nameidata *nd)
 {
 	struct inode *inode = dentry->d_inode;
-	struct task_struct *task = proc_task(inode);
+	struct task_struct *task = get_proc_task(inode);
 	int fd = proc_fd(inode);
 	struct files_struct *files;
 
-	files = get_files_struct(task);
-	if (files) {
-		rcu_read_lock();
-		if (fcheck_files(files, fd)) {
+	if (task) {
+		files = get_files_struct(task);
+		if (files) {
+			rcu_read_lock();
+			if (fcheck_files(files, fd)) {
+				rcu_read_unlock();
+				put_files_struct(files);
+				if (task_dumpable(task)) {
+					inode->i_uid = task->euid;
+					inode->i_gid = task->egid;
+				} else {
+					inode->i_uid = 0;
+					inode->i_gid = 0;
+				}
+				security_task_to_inode(task, inode);
+				return 1;
+			}
 			rcu_read_unlock();
 			put_files_struct(files);
-			if (task_dumpable(task)) {
-				inode->i_uid = task->euid;
-				inode->i_gid = task->egid;
-			} else {
-				inode->i_uid = 0;
-				inode->i_gid = 0;
-			}
-			security_task_to_inode(task, inode);
-			return 1;
 		}
-		rcu_read_unlock();
-		put_files_struct(files);
+		put_task_struct(task);
 	}
 	d_drop(dentry);
 	return 0;
@@ -1297,7 +1367,7 @@ static int pid_delete_dentry(struct dent
 	 * If so, then don't put the dentry on the lru list,
 	 * kill it immediately.
 	 */
-	return !pid_alive(proc_task(dentry->d_inode));
+	return !proc_tref(dentry->d_inode)->task;
 }
 
 static struct dentry_operations tid_fd_dentry_operations =
@@ -1339,7 +1409,7 @@ out:
 /* SMP-safe */
 static struct dentry *proc_lookupfd(struct inode * dir, struct dentry * dentry, struct nameidata *nd)
 {
-	struct task_struct *task = proc_task(dir);
+	struct task_struct *task = get_proc_task(dir);
 	unsigned fd = name_to_int(dentry);
 	struct dentry *result = ERR_PTR(-ENOENT);
 	struct file * file;
@@ -1347,10 +1417,10 @@ static struct dentry *proc_lookupfd(stru
 	struct inode *inode;
 	struct proc_inode *ei;
 
+	if (!task)
+		goto out_no_task;
 	if (fd == ~0U)
 		goto out;
-	if (!pid_alive(task))
-		goto out;
 
 	inode = proc_pid_make_inode(dir->i_sb, task, PROC_TID_FD_DIR+fd);
 	if (!inode)
@@ -1380,6 +1450,8 @@ static struct dentry *proc_lookupfd(stru
 	if (tid_fd_revalidate(dentry, NULL))
 		result = NULL;
 out:
+	put_task_struct(task);
+out_no_task:
 	return result;
 
 out_unlock2:
@@ -1423,12 +1495,17 @@ static ssize_t proc_pid_attr_read(struct
 	struct inode * inode = file->f_dentry->d_inode;
 	unsigned long page;
 	ssize_t length;
-	struct task_struct *task = proc_task(inode);
+	struct task_struct *task = get_proc_task(inode);
+
+	length = -ESRCH;
+	if (!task)
+		goto out_no_task;
 
 	if (count > PAGE_SIZE)
 		count = PAGE_SIZE;
+	length = -ENOMEM;
 	if (!(page = __get_free_page(GFP_KERNEL)))
-		return -ENOMEM;
+		goto out;
 
 	length = security_getprocattr(task, 
 				      (char*)file->f_dentry->d_name.name, 
@@ -1436,6 +1513,9 @@ static ssize_t proc_pid_attr_read(struct
 	if (length >= 0)
 		length = simple_read_from_buffer(buf, count, ppos, (char *)page, length);
 	free_page(page);
+out:
+	put_task_struct(task);
+out_no_task:
 	return length;
 }
 
@@ -1445,26 +1525,36 @@ static ssize_t proc_pid_attr_write(struc
 	struct inode * inode = file->f_dentry->d_inode;
 	char *page; 
 	ssize_t length; 
-	struct task_struct *task = proc_task(inode); 
+	struct task_struct *task = get_proc_task(inode); 
 
+	length = -ESRCH;
+	if (!task)
+		goto out_no_task;
 	if (count > PAGE_SIZE) 
 		count = PAGE_SIZE; 
-	if (*ppos != 0) {
-		/* No partial writes. */
-		return -EINVAL;
-	}
+
+	/* No partial writes. */
+	length = -EINVAL;
+	if (*ppos != 0)
+		goto out;
+
+	length = -ENOMEM;
 	page = (char*)__get_free_page(GFP_USER); 
 	if (!page) 
-		return -ENOMEM;
+		goto out;
+
 	length = -EFAULT; 
 	if (copy_from_user(page, buf, count)) 
-		goto out;
+		goto out_free;
 
 	length = security_setprocattr(task, 
 				      (char*)file->f_dentry->d_name.name, 
 				      (void*)page, count);
-out:
+out_free:
 	free_page((unsigned long) page);
+out:
+	put_task_struct(task);
+out_no_task:
 	return length;
 } 
 
@@ -1486,15 +1576,15 @@ static struct dentry *proc_pident_lookup
 {
 	struct inode *inode;
 	struct dentry *error;
-	struct task_struct *task = proc_task(dir);
+	struct task_struct *task = get_proc_task(dir);
 	struct pid_entry *p;
 	struct proc_inode *ei;
 
 	error = ERR_PTR(-ENOENT);
 	inode = NULL;
 
-	if (!pid_alive(task))
-		goto out;
+	if (!task)
+		goto out_no_task;
 
 	for (p = ents; p->name; p++) {
 		if (p->len != dentry->d_name.len)
@@ -1675,6 +1765,8 @@ static struct dentry *proc_pident_lookup
 	if (pid_revalidate(dentry, NULL))
 		error = NULL;
 out:
+	put_task_struct(task);
+out_no_task:
 	return error;
 }
 
@@ -1916,10 +2008,13 @@ static struct dentry *proc_task_lookup(s
 {
 	struct dentry *result = ERR_PTR(-ENOENT);
 	struct task_struct *task;
-	struct task_struct *leader = proc_task(dir);
+	struct task_struct *leader = get_proc_task(dir);
 	struct inode *inode;
 	unsigned tid;
 
+	if (!leader)
+		goto out_no_task;
+
 	tid = name_to_int(dentry);
 	if (tid == ~0U)
 		goto out;
@@ -1959,6 +2054,8 @@ static struct dentry *proc_task_lookup(s
 out_drop_task:
 	put_task_struct(task);
 out:
+	put_task_struct(leader);
+out_no_task:
 	return result;
 }
 
@@ -2161,15 +2258,15 @@ static int proc_task_readdir(struct file
 	char buf[PROC_NUMBUF];
 	struct dentry *dentry = filp->f_dentry;
 	struct inode *inode = dentry->d_inode;
-	struct task_struct *leader = proc_task(inode);
+	struct task_struct *leader = get_proc_task(inode);
 	struct task_struct *task;
 	int retval = -ENOENT;
 	ino_t ino;
 	int tid;
 	unsigned long pos = filp->f_pos;  /* avoiding "long long" filp->f_pos */
 
-	if (!pid_alive(leader))
-		goto out;
+	if (!leader)
+		goto out_no_task;
 	retval = 0;
 
 	switch (pos) {
@@ -2209,20 +2306,23 @@ static int proc_task_readdir(struct file
 	}
 out:
 	filp->f_pos = pos;
+	put_task_struct(leader);
+out_no_task:
 	return retval;
 }
 
 static int proc_task_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
 {
 	struct inode *inode = dentry->d_inode;
-	struct task_struct *p = proc_task(inode);
+	struct task_struct *p = get_proc_task(inode);
 	generic_fillattr(inode, stat);
 
-	if (pid_alive(p)) {
+	if (p) {
 		task_lock(p);
 		if (p->signal)
 			stat->nlink += atomic_read(&p->signal->count);
 		task_unlock(p);
+		put_task_struct(p);
 	}
 		
 	return 0;
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 8f532d7..95c3cc5 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -58,14 +58,11 @@ static void de_put(struct proc_dir_entry
 static void proc_delete_inode(struct inode *inode)
 {
 	struct proc_dir_entry *de;
-	struct task_struct *tsk;
 
 	truncate_inode_pages(&inode->i_data, 0);
 
-	/* Let go of any associated process */
-	tsk = PROC_I(inode)->task;
-	if (tsk)
-		put_task_struct(tsk);
+	/* Stop tracking associated processes */
+	tref_fini(&PROC_I(inode)->tref);
 
 	/* Let go of any associated proc directory entry */
 	de = PROC_I(inode)->pde;
@@ -94,7 +91,7 @@ static struct inode *proc_alloc_inode(st
 	ei = (struct proc_inode *)kmem_cache_alloc(proc_inode_cachep, SLAB_KERNEL);
 	if (!ei)
 		return NULL;
-	ei->task = NULL;
+	tref_init(&ei->tref);
 	ei->fd = 0;
 	ei->op.proc_get_link = NULL;
 	ei->pde = NULL;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index ac95bfc..73b6384 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -10,6 +10,7 @@
  */
 
 #include <linux/proc_fs.h>
+#include <linux/task_ref.h>
 
 struct vmalloc_info {
 	unsigned long	used;
@@ -41,13 +42,23 @@ extern struct file_operations proc_maps_
 extern struct file_operations proc_numa_maps_operations;
 extern struct file_operations proc_smaps_operations;
 
+extern struct file_operations proc_maps_operations;
+extern struct file_operations proc_numa_maps_operations;
+extern struct file_operations proc_smaps_operations;
+
+
 void free_proc_entry(struct proc_dir_entry *de);
 
 int proc_init_inodecache(void);
 
-static inline struct task_struct *proc_task(struct inode *inode)
+static inline struct task_ref *proc_tref(struct inode *inode)
+{
+	return PROC_I(inode)->tref;
+}
+
+static inline struct task_struct *get_proc_task(struct inode *inode)
 {
-	return PROC_I(inode)->task;
+	return get_tref_task(proc_tref(inode));
 }
 
 static inline int proc_fd(struct inode *inode)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 56cd932..4772543 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -75,9 +75,13 @@ int proc_exe_link(struct inode *inode, s
 {
 	struct vm_area_struct * vma;
 	int result = -ENOENT;
-	struct task_struct *task = proc_task(inode);
-	struct mm_struct * mm = get_task_mm(task);
+	struct task_struct *task = get_proc_task(inode);
+	struct mm_struct * mm = NULL;
 
+	if (task) {
+		mm = get_task_mm(task);
+		put_task_struct(task);
+	}
 	if (!mm)
 		goto out;
 	down_read(&mm->mmap_sem);
@@ -296,12 +300,16 @@ static int show_smap(struct seq_file *m,
 
 static void *m_start(struct seq_file *m, loff_t *pos)
 {
-	struct task_struct *task = m->private;
+	struct proc_maps_private *priv = m->private; 
 	unsigned long last_addr = m->version;
 	struct mm_struct *mm;
-	struct vm_area_struct *vma, *tail_vma;
+	struct vm_area_struct *vma;
 	loff_t l = *pos;
 
+	/* Clear the per syscall fields in priv */
+	priv->task = NULL;
+	priv->tail_vma = NULL;
+
 	/*
 	 * We remember last_addr rather than next_addr to hit with
 	 * mmap_cache most of the time. We have zero last_addr at
@@ -312,11 +320,15 @@ static void *m_start(struct seq_file *m,
 	if (last_addr == -1UL)
 		return NULL;
 
-	mm = get_task_mm(task);
+	priv->task = get_tref_task(priv->tref);
+	if (!priv->task)
+		return NULL;
+
+	mm = get_task_mm(priv->task);
 	if (!mm)
 		return NULL;
 
-	tail_vma = get_gate_vma(task);
+	priv->tail_vma = get_gate_vma(priv->task);
 	down_read(&mm->mmap_sem);
 
 	/* Start with last addr hint */
@@ -338,35 +350,37 @@ static void *m_start(struct seq_file *m,
 	}
 
 	if (l != mm->map_count)
-		tail_vma = NULL; /* After gate vma */
+		priv->tail_vma = NULL; /* After gate vma */
 
 out:
 	if (vma)
 		return vma;
 
 	/* End of vmas has been reached */
-	m->version = (tail_vma != NULL)? 0: -1UL;
+	m->version = (priv->tail_vma != NULL)? 0: -1UL;
 	up_read(&mm->mmap_sem);
 	mmput(mm);
-	return tail_vma;
+	return priv->tail_vma;
 }
 
 static void m_stop(struct seq_file *m, void *v)
 {
-	struct task_struct *task = m->private;
+	struct proc_maps_private *priv = m->private;
 	struct vm_area_struct *vma = v;
-	if (vma && vma != get_gate_vma(task)) {
+	if (vma && vma != priv->tail_vma) {
 		struct mm_struct *mm = vma->vm_mm;
 		up_read(&mm->mmap_sem);
 		mmput(mm);
 	}
+	if (priv->task)
+		put_task_struct(priv->task);
 }
 
 static void *m_next(struct seq_file *m, void *v, loff_t *pos)
 {
-	struct task_struct *task = m->private;
+	struct proc_maps_private *priv = m->private;
 	struct vm_area_struct *vma = v;
-	struct vm_area_struct *tail_vma = get_gate_vma(task);
+	struct vm_area_struct *tail_vma = priv->tail_vma;
 
 	(*pos)++;
 	if (vma && (vma != tail_vma) && vma->vm_next)
@@ -392,11 +406,18 @@ static struct seq_operations proc_pid_sm
 static int do_maps_open(struct inode *inode, struct file *file, 
 			struct seq_operations *ops)
 {
-	struct task_struct *task = proc_task(inode);
-	int ret = seq_open(file, ops);
-	if (!ret) {
-		struct seq_file *m = file->private_data;
-		m->private = task;
+	struct proc_maps_private *priv;
+	int ret = -ENOMEM;
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (priv) {
+		priv->tref = proc_tref(inode);
+		ret = seq_open(file, ops);
+		if (!ret) {
+			struct seq_file *m = file->private_data;
+			m->private = priv;
+		} else {
+			kfree(priv);
+		}
 	}
 	return ret;
 }
@@ -410,7 +431,7 @@ struct file_operations proc_maps_operati
 	.open		= maps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= seq_release,
+	.release	= seq_release_private,
 };
 
 #ifdef CONFIG_NUMA
@@ -432,7 +453,7 @@ struct file_operations proc_numa_maps_op
 	.open		= numa_maps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= seq_release,
+	.release	= seq_release_private,
 };
 #endif
 
@@ -445,5 +466,5 @@ struct file_operations proc_smaps_operat
 	.open		= smaps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= seq_release,
+	.release	= seq_release_private,
 };
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 302c24e..f6b491f 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -244,7 +244,7 @@ extern void kclist_add(struct kcore_list
 #endif
 
 struct proc_inode {
-	struct task_struct *task;
+	struct task_ref *tref;
 	int fd;
 	union {
 		int (*proc_get_link)(struct inode *, struct dentry **, struct vfsmount **);
@@ -264,4 +264,10 @@ static inline struct proc_dir_entry *PDE
 	return PROC_I(inode)->pde;
 }
 
+struct proc_maps_private {
+	struct task_ref *tref;
+	struct task_struct *task;
+	struct vm_area_struct *tail_vma;
+};
+
 #endif /* _LINUX_PROC_FS_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 880831b..3eed61c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1765,7 +1765,7 @@ static void gather_stats(struct page *pa
 
 int show_numa_map(struct seq_file *m, void *v)
 {
-	struct task_struct *task = m->private;
+	struct proc_maps_private *priv = m->private;
 	struct vm_area_struct *vma = v;
 	struct numa_maps *md;
 	int n;
@@ -1783,7 +1783,7 @@ int show_numa_map(struct seq_file *m, vo
 
 	if (md->pages) {
 		mpol_to_str(buffer, sizeof(buffer),
-			    get_vma_policy(task, vma, vma->vm_start));
+			    get_vma_policy(priv->task, vma, vma->vm_start));
 
 		seq_printf(m, "%08lx %s pages=%lu mapped=%lu maxref=%lu",
 			   vma->vm_start, buffer, md->pages,
@@ -1801,7 +1801,7 @@ int show_numa_map(struct seq_file *m, vo
 	kfree(md);
 
 	if (m->count < m->size)
-		m->version = (vma != get_gate_vma(task)) ? vma->vm_start : 0;
+		m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
 	return 0;
 }
 
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 17/23] proc: Give the root directory a task.
  2006-02-23 16:23                               ` [PATCH 16/23] proc: Don't lock task_structs indefinitely Eric W. Biederman
@ 2006-02-23 16:24                                 ` Eric W. Biederman
  2006-02-23 16:25                                   ` [PATCH 18/23] proc: Reorder the functions in base.c Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


Helper functions in base.c like proc_pident_readdir and proc_pident_lookup
assume the directories have an associated task, and cannot currently be used on
the /proc root directory because it does not have such a task.

This small changes allows for base.c to be simplified and later when multiple
pid spaces are introduced it makes getting the needed context information trivial.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/root.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

516b38bf721663deded53591272bdfc8f032f885
diff --git a/fs/proc/root.c b/fs/proc/root.c
index c3fd361..a3ceff3 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -17,6 +17,7 @@
 #include <linux/module.h>
 #include <linux/bitops.h>
 #include <linux/smp_lock.h>
+#include <linux/mount.h>
 
 #include "internal.h"
 
@@ -29,6 +30,18 @@ struct proc_dir_entry *proc_sys_root;
 static struct super_block *proc_get_sb(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data)
 {
+	if (proc_mnt) {
+		/* Seed the root directory with a task so it doesn't need
+		 * to be special in base.c.  I would do this earlier but
+		 * the only task alive when /proc is mounted the first time
+		 * is the init_task and it is never considered alive.
+		 */
+		struct proc_inode *ei;
+		ei = PROC_I(proc_mnt->mnt_sb->s_root->d_inode);
+		if (!ei->tref->task)
+			tref_set(&ei->tref,
+				tref_get_by_pid(1, PIDTYPE_PID));
+	}
 	return get_sb_single(fs_type, flags, data, proc_fill_super);
 }
 
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 18/23] proc: Reorder the functions in base.c
  2006-02-23 16:24                                 ` [PATCH 17/23] proc: Give the root directory a task Eric W. Biederman
@ 2006-02-23 16:25                                   ` Eric W. Biederman
  2006-02-23 16:27                                     ` [PATCH 19/23] proc: Modify proc_pident_lookup to be completely table driven Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


Group the functions by what they implement instead of
by type of operation.  As it existed base.c was quickly approaching
the point where it could not be followed.

No functionality or code changes asside from adding/removing
forward declartions are implemented in this patch.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c | 1062 ++++++++++++++++++++++++++++----------------------------
 1 files changed, 533 insertions(+), 529 deletions(-)

1f2a8bf86e95c6c85808dfc18d75b1412290bef1
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 86aa5c5..71edbad 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -193,139 +193,6 @@ struct pid_entry {
 
 #define E(type,name,mode) {(type),sizeof(name)-1,(name),(mode)}
 
-static struct pid_entry tgid_base_stuff[] = {
-	E(PROC_TGID_TASK,      "task",    S_IFDIR|S_IRUGO|S_IXUGO),
-	E(PROC_TGID_FD,        "fd",      S_IFDIR|S_IRUSR|S_IXUSR),
-	E(PROC_TGID_ENVIRON,   "environ", S_IFREG|S_IRUSR),
-	E(PROC_TGID_AUXV,      "auxv",	  S_IFREG|S_IRUSR),
-	E(PROC_TGID_STATUS,    "status",  S_IFREG|S_IRUGO),
-	E(PROC_TGID_CMDLINE,   "cmdline", S_IFREG|S_IRUGO),
-	E(PROC_TGID_STAT,      "stat",    S_IFREG|S_IRUGO),
-	E(PROC_TGID_STATM,     "statm",   S_IFREG|S_IRUGO),
-	E(PROC_TGID_MAPS,      "maps",    S_IFREG|S_IRUGO),
-#ifdef CONFIG_NUMA
-	E(PROC_TGID_NUMA_MAPS, "numa_maps", S_IFREG|S_IRUGO),
-#endif
-	E(PROC_TGID_MEM,       "mem",     S_IFREG|S_IRUSR|S_IWUSR),
-#ifdef CONFIG_SECCOMP
-	E(PROC_TGID_SECCOMP,   "seccomp", S_IFREG|S_IRUSR|S_IWUSR),
-#endif
-	E(PROC_TGID_CWD,       "cwd",     S_IFLNK|S_IRWXUGO),
-	E(PROC_TGID_ROOT,      "root",    S_IFLNK|S_IRWXUGO),
-	E(PROC_TGID_EXE,       "exe",     S_IFLNK|S_IRWXUGO),
-	E(PROC_TGID_MOUNTS,    "mounts",  S_IFREG|S_IRUGO),
-#ifdef CONFIG_MMU
-	E(PROC_TGID_SMAPS,     "smaps",   S_IFREG|S_IRUGO),
-#endif
-#ifdef CONFIG_SECURITY
-	E(PROC_TGID_ATTR,      "attr",    S_IFDIR|S_IRUGO|S_IXUGO),
-#endif
-#ifdef CONFIG_KALLSYMS
-	E(PROC_TGID_WCHAN,     "wchan",   S_IFREG|S_IRUGO),
-#endif
-#ifdef CONFIG_SCHEDSTATS
-	E(PROC_TGID_SCHEDSTAT, "schedstat", S_IFREG|S_IRUGO),
-#endif
-#ifdef CONFIG_CPUSETS
-	E(PROC_TGID_CPUSET,    "cpuset",  S_IFREG|S_IRUGO),
-#endif
-	E(PROC_TGID_OOM_SCORE, "oom_score",S_IFREG|S_IRUGO),
-	E(PROC_TGID_OOM_ADJUST,"oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
-#ifdef CONFIG_AUDITSYSCALL
-	E(PROC_TGID_LOGINUID, "loginuid", S_IFREG|S_IWUSR|S_IRUGO),
-#endif
-	{0,0,NULL,0}
-};
-static struct pid_entry tid_base_stuff[] = {
-	E(PROC_TID_FD,         "fd",      S_IFDIR|S_IRUSR|S_IXUSR),
-	E(PROC_TID_ENVIRON,    "environ", S_IFREG|S_IRUSR),
-	E(PROC_TID_AUXV,       "auxv",	  S_IFREG|S_IRUSR),
-	E(PROC_TID_STATUS,     "status",  S_IFREG|S_IRUGO),
-	E(PROC_TID_CMDLINE,    "cmdline", S_IFREG|S_IRUGO),
-	E(PROC_TID_STAT,       "stat",    S_IFREG|S_IRUGO),
-	E(PROC_TID_STATM,      "statm",   S_IFREG|S_IRUGO),
-	E(PROC_TID_MAPS,       "maps",    S_IFREG|S_IRUGO),
-#ifdef CONFIG_NUMA
-	E(PROC_TID_NUMA_MAPS,  "numa_maps",    S_IFREG|S_IRUGO),
-#endif
-	E(PROC_TID_MEM,        "mem",     S_IFREG|S_IRUSR|S_IWUSR),
-#ifdef CONFIG_SECCOMP
-	E(PROC_TID_SECCOMP,    "seccomp", S_IFREG|S_IRUSR|S_IWUSR),
-#endif
-	E(PROC_TID_CWD,        "cwd",     S_IFLNK|S_IRWXUGO),
-	E(PROC_TID_ROOT,       "root",    S_IFLNK|S_IRWXUGO),
-	E(PROC_TID_EXE,        "exe",     S_IFLNK|S_IRWXUGO),
-	E(PROC_TID_MOUNTS,     "mounts",  S_IFREG|S_IRUGO),
-#ifdef CONFIG_MMU
-	E(PROC_TID_SMAPS,      "smaps",   S_IFREG|S_IRUGO),
-#endif
-#ifdef CONFIG_SECURITY
-	E(PROC_TID_ATTR,       "attr",    S_IFDIR|S_IRUGO|S_IXUGO),
-#endif
-#ifdef CONFIG_KALLSYMS
-	E(PROC_TID_WCHAN,      "wchan",   S_IFREG|S_IRUGO),
-#endif
-#ifdef CONFIG_SCHEDSTATS
-	E(PROC_TID_SCHEDSTAT, "schedstat",S_IFREG|S_IRUGO),
-#endif
-#ifdef CONFIG_CPUSETS
-	E(PROC_TID_CPUSET,     "cpuset",  S_IFREG|S_IRUGO),
-#endif
-	E(PROC_TID_OOM_SCORE,  "oom_score",S_IFREG|S_IRUGO),
-	E(PROC_TID_OOM_ADJUST, "oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
-#ifdef CONFIG_AUDITSYSCALL
-	E(PROC_TID_LOGINUID, "loginuid", S_IFREG|S_IWUSR|S_IRUGO),
-#endif
-	{0,0,NULL,0}
-};
-
-#ifdef CONFIG_SECURITY
-static struct pid_entry tgid_attr_stuff[] = {
-	E(PROC_TGID_ATTR_CURRENT,  "current",  S_IFREG|S_IRUGO|S_IWUGO),
-	E(PROC_TGID_ATTR_PREV,     "prev",     S_IFREG|S_IRUGO),
-	E(PROC_TGID_ATTR_EXEC,     "exec",     S_IFREG|S_IRUGO|S_IWUGO),
-	E(PROC_TGID_ATTR_FSCREATE, "fscreate", S_IFREG|S_IRUGO|S_IWUGO),
-	{0,0,NULL,0}
-};
-static struct pid_entry tid_attr_stuff[] = {
-	E(PROC_TID_ATTR_CURRENT,   "current",  S_IFREG|S_IRUGO|S_IWUGO),
-	E(PROC_TID_ATTR_PREV,      "prev",     S_IFREG|S_IRUGO),
-	E(PROC_TID_ATTR_EXEC,      "exec",     S_IFREG|S_IRUGO|S_IWUGO),
-	E(PROC_TID_ATTR_FSCREATE,  "fscreate", S_IFREG|S_IRUGO|S_IWUGO),
-	{0,0,NULL,0}
-};
-#endif
-
-#undef E
-
-static int proc_fd_link(struct inode *inode, struct dentry **dentry, struct vfsmount **mnt)
-{
-	struct task_struct *task = get_proc_task(inode);
-	struct files_struct *files = NULL;
-	struct file *file;
-	int fd = proc_fd(inode);
-
-	if (task) {
-		files = get_files_struct(task);
-		put_task_struct(task);
-	}
-	files = get_files_struct(task);
-	if (files) {
-		rcu_read_lock();
-		file = fcheck_files(files, fd);
-		if (file) {
-			*mnt = mntget(file->f_vfsmnt);
-			*dentry = dget(file->f_dentry);
-			rcu_read_unlock();
-			put_files_struct(files);
-			return 0;
-		}
-		rcu_read_unlock();
-		put_files_struct(files);
-	}
-	return -ENOENT;
-}
-
 static struct fs_struct *get_fs_struct(struct task_struct *task)
 {
 	struct fs_struct *fs;
@@ -1103,229 +970,158 @@ static struct inode_operations proc_pid_
 	.follow_link	= proc_pid_follow_link
 };
 
-static int proc_readfd(struct file * filp, void * dirent, filldir_t filldir)
+/* building an inode */
+
+static int task_dumpable(struct task_struct *task)
 {
-	struct dentry *dentry = filp->f_dentry;
-	struct inode *inode = dentry->d_inode;
-	struct task_struct *p = get_proc_task(inode);
-	unsigned int fd, tid, ino;
-	int retval;
-	char buf[PROC_NUMBUF];
-	struct files_struct * files;
-	struct fdtable *fdt;
+	int dumpable = 0;
+	struct mm_struct *mm;
 
-	retval = -ENOENT;
-	if (!p)
-		goto out_no_task;
-	retval = 0;
-	tid = p->pid;
+	task_lock(task);
+	mm = task->mm;
+	if (mm)
+		dumpable = mm->dumpable;
+	task_unlock(task);
+	if(dumpable == 1)
+		return 1;
+	return 0;
+}
 
-	fd = filp->f_pos;
-	switch (fd) {
-		case 0:
-			if (filldir(dirent, ".", 1, 0, inode->i_ino, DT_DIR) < 0)
-				goto out;
-			filp->f_pos++;
-		case 1:
-			ino = parent_ino(dentry);
-			if (filldir(dirent, "..", 2, 1, ino, DT_DIR) < 0)
-				goto out;
-			filp->f_pos++;
-		default:
-			files = get_files_struct(p);
-			if (!files)
-				goto out;
-			rcu_read_lock();
-			fdt = files_fdtable(files);
-			for (fd = filp->f_pos-2;
-			     fd < fdt->max_fds;
-			     fd++, filp->f_pos++) {
-				unsigned int i,j;
 
-				if (!fcheck_files(files, fd))
-					continue;
-				rcu_read_unlock();
+static struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task, int ino)
+{
+	struct inode * inode;
+	struct proc_inode *ei;
 
-				j = PROC_NUMBUF;
-				i = fd;
-				do {
-					j--;
-					buf[j] = '0' + (i % 10);
-					i /= 10;
-				} while (i);
+	/* We need a new inode */
+	
+	inode = new_inode(sb);
+	if (!inode)
+		goto out;
 
-				ino = fake_ino(tid, PROC_TID_FD_DIR + fd);
-				if (filldir(dirent, buf+j, PROC_NUMBUF-j, fd+2, ino, DT_LNK) < 0) {
-					rcu_read_lock();
-					break;
-				}
-				rcu_read_lock();
-			}
-			rcu_read_unlock();
-			put_files_struct(files);
+	/* Common stuff */
+	ei = PROC_I(inode);
+	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
+	inode->i_ino = fake_ino(task->pid, ino);
+
+	/*
+	 * grab the reference to task.
+	 */
+	tref_set(&ei->tref, tref_get_by_task(task, PIDTYPE_PID));
+	if (!ei->tref->task)
+		goto out_unlock;
+
+	inode->i_uid = 0;
+	inode->i_gid = 0;
+	if (task_dumpable(task)) {
+		inode->i_uid = task->euid;
+		inode->i_gid = task->egid;
 	}
+	security_task_to_inode(task, inode);
+
 out:
-	put_task_struct(p);
-out_no_task:
-	return retval;
+	return inode;
+
+out_unlock:
+	iput(inode);
+	return NULL;
 }
 
-static int proc_pident_readdir(struct file *filp,
-		void *dirent, filldir_t filldir,
-		struct pid_entry *ents, unsigned int nents)
+/* dentry stuff */
+
+/*
+ *	Exceptional case: normally we are not allowed to unhash a busy
+ * directory. In this case, however, we can do it - no aliasing problems
+ * due to the way we treat inodes.
+ *
+ * Rewrite the inode's ownerships here because the owning task may have
+ * performed a setuid(), etc.
+ */
+static int pid_revalidate(struct dentry *dentry, struct nameidata *nd)
 {
-	int i;
-	int pid;
-	struct dentry *dentry = filp->f_dentry;
 	struct inode *inode = dentry->d_inode;
 	struct task_struct *task = get_proc_task(inode);
-	struct pid_entry *p;
-	ino_t ino;
-	int ret;
-
-	ret = -ENOENT;
-	if (!task)
-		goto out;
-
-	ret = 0;
-	pid = task->pid;
-	put_task_struct(task);
-	i = filp->f_pos;
-	switch (i) {
-	case 0:
-		ino = inode->i_ino;
-		if (filldir(dirent, ".", 1, i, ino, DT_DIR) < 0)
-			goto out;
-		i++;
-		filp->f_pos++;
-		/* fall through */
-	case 1:
-		ino = parent_ino(dentry);
-		if (filldir(dirent, "..", 2, i, ino, DT_DIR) < 0)
-			goto out;
-		i++;
-		filp->f_pos++;
-		/* fall through */
-	default:
-		i -= 2;
-		if (i >= nents) {
-			ret = 1;
-			goto out;
-		}
-		p = ents + i;
-		while (p->name) {
-			if (filldir(dirent, p->name, p->len, filp->f_pos,
-				    fake_ino(pid, p->type), p->mode >> 12) < 0)
-				goto out;
-			filp->f_pos++;
-			p++;
+	if (task) {
+		if (task_dumpable(task)) {
+			inode->i_uid = task->euid;
+			inode->i_gid = task->egid;
+		} else {
+			inode->i_uid = 0;
+			inode->i_gid = 0;
 		}
+		security_task_to_inode(task, inode);
+		put_task_struct(task);
+		return 1;
 	}
-
-	ret = 1;
-out:
-	return ret;
-}
-
-static int proc_tgid_base_readdir(struct file * filp,
-			     void * dirent, filldir_t filldir)
-{
-	return proc_pident_readdir(filp,dirent,filldir,
-				   tgid_base_stuff,ARRAY_SIZE(tgid_base_stuff));
+	d_drop(dentry);
+	return 0;
 }
 
-static int proc_tid_base_readdir(struct file * filp,
-			     void * dirent, filldir_t filldir)
+static int pid_delete_dentry(struct dentry * dentry)
 {
-	return proc_pident_readdir(filp,dirent,filldir,
-				   tid_base_stuff,ARRAY_SIZE(tid_base_stuff));
+	/* Is the task we represent dead?
+	 * If so, then don't put the dentry on the lru list,
+	 * kill it immediately.
+	 */
+	return !proc_tref(dentry->d_inode)->task;
 }
 
-/* building an inode */
-
-static int task_dumpable(struct task_struct *task)
+static struct dentry_operations pid_dentry_operations =
 {
-	int dumpable = 0;
-	struct mm_struct *mm;
-
-	task_lock(task);
-	mm = task->mm;
-	if (mm)
-		dumpable = mm->dumpable;
-	task_unlock(task);
-	if(dumpable == 1)
-		return 1;
-	return 0;
-}
+	.d_revalidate	= pid_revalidate,
+	.d_delete	= pid_delete_dentry,
+};
 
+/* Lookups */
 
-static struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task, int ino)
+static unsigned name_to_int(struct dentry *dentry)
 {
-	struct inode * inode;
-	struct proc_inode *ei;
+	const char *name = dentry->d_name.name;
+	int len = dentry->d_name.len;
+	unsigned n = 0;
 
-	/* We need a new inode */
-	
-	inode = new_inode(sb);
-	if (!inode)
+	if (len > 1 && *name == '0')
 		goto out;
-
-	/* Common stuff */
-	ei = PROC_I(inode);
-	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
-	inode->i_ino = fake_ino(task->pid, ino);
-
-	/*
-	 * grab the reference to task.
-	 */
-	tref_set(&ei->tref, tref_get_by_task(task, PIDTYPE_PID));
-	if (!ei->tref->task)
-		goto out_unlock;
-
-	inode->i_uid = 0;
-	inode->i_gid = 0;
-	if (task_dumpable(task)) {
-		inode->i_uid = task->euid;
-		inode->i_gid = task->egid;
+	while (len-- > 0) {
+		unsigned c = *name++ - '0';
+		if (c > 9)
+			goto out;
+		if (n >= (~0U-9)/10)
+			goto out;
+		n *= 10;
+		n += c;
 	}
-	security_task_to_inode(task, inode);
-
+	return n;
 out:
-	return inode;
-
-out_unlock:
-	iput(inode);
-	return NULL;
+	return ~0U;
 }
 
-/* dentry stuff */
-
-/*
- *	Exceptional case: normally we are not allowed to unhash a busy
- * directory. In this case, however, we can do it - no aliasing problems
- * due to the way we treat inodes.
- *
- * Rewrite the inode's ownerships here because the owning task may have
- * performed a setuid(), etc.
- */
-static int pid_revalidate(struct dentry *dentry, struct nameidata *nd)
+static int proc_fd_link(struct inode *inode, struct dentry **dentry, struct vfsmount **mnt)
 {
-	struct inode *inode = dentry->d_inode;
 	struct task_struct *task = get_proc_task(inode);
+	struct files_struct *files = NULL;
+	struct file *file;
+	int fd = proc_fd(inode);
+
 	if (task) {
-		if (task_dumpable(task)) {
-			inode->i_uid = task->euid;
-			inode->i_gid = task->egid;
-		} else {
-			inode->i_uid = 0;
-			inode->i_gid = 0;
-		}
-		security_task_to_inode(task, inode);
+		files = get_files_struct(task);
 		put_task_struct(task);
-		return 1;
 	}
-	d_drop(dentry);
-	return 0;
+	files = get_files_struct(task);
+	if (files) {
+		rcu_read_lock();
+		file = fcheck_files(files, fd);
+		if (file) {
+			*mnt = mntget(file->f_vfsmnt);
+			*dentry = dget(file->f_dentry);
+			rcu_read_unlock();
+			put_files_struct(files);
+			return 0;
+		}
+		rcu_read_unlock();
+		put_files_struct(files);
+	}
+	return -ENOENT;
 }
 
 static int tid_fd_revalidate(struct dentry *dentry, struct nameidata *nd)
@@ -1361,51 +1157,12 @@ static int tid_fd_revalidate(struct dent
 	return 0;
 }
 
-static int pid_delete_dentry(struct dentry * dentry)
-{
-	/* Is the task we represent dead?
-	 * If so, then don't put the dentry on the lru list,
-	 * kill it immediately.
-	 */
-	return !proc_tref(dentry->d_inode)->task;
-}
-
 static struct dentry_operations tid_fd_dentry_operations =
 {
 	.d_revalidate	= tid_fd_revalidate,
 	.d_delete	= pid_delete_dentry,
 };
 
-static struct dentry_operations pid_dentry_operations =
-{
-	.d_revalidate	= pid_revalidate,
-	.d_delete	= pid_delete_dentry,
-};
-
-/* Lookups */
-
-static unsigned name_to_int(struct dentry *dentry)
-{
-	const char *name = dentry->d_name.name;
-	int len = dentry->d_name.len;
-	unsigned n = 0;
-
-	if (len > 1 && *name == '0')
-		goto out;
-	while (len-- > 0) {
-		unsigned c = *name++ - '0';
-		if (c > 9)
-			goto out;
-		if (n >= (~0U-9)/10)
-			goto out;
-		n *= 10;
-		n += c;
-	}
-	return n;
-out:
-	return ~0U;
-}
-
 /* SMP-safe */
 static struct dentry *proc_lookupfd(struct inode * dir, struct dentry * dentry, struct nameidata *nd)
 {
@@ -1450,32 +1207,90 @@ static struct dentry *proc_lookupfd(stru
 	if (tid_fd_revalidate(dentry, NULL))
 		result = NULL;
 out:
-	put_task_struct(task);
+	put_task_struct(task);
+out_no_task:
+	return result;
+
+out_unlock2:
+	rcu_read_unlock();
+	put_files_struct(files);
+out_unlock:
+	iput(inode);
+	goto out;
+}
+
+static int proc_readfd(struct file * filp, void * dirent, filldir_t filldir)
+{
+	struct dentry *dentry = filp->f_dentry;
+	struct inode *inode = dentry->d_inode;
+	struct task_struct *p = get_proc_task(inode);
+	unsigned int fd, tid, ino;
+	int retval;
+	char buf[PROC_NUMBUF];
+	struct files_struct * files;
+	struct fdtable *fdt;
+
+	retval = -ENOENT;
+	if (!p)
+		goto out_no_task;
+	retval = 0;
+	tid = p->pid;
+
+	fd = filp->f_pos;
+	switch (fd) {
+		case 0:
+			if (filldir(dirent, ".", 1, 0, inode->i_ino, DT_DIR) < 0)
+				goto out;
+			filp->f_pos++;
+		case 1:
+			ino = parent_ino(dentry);
+			if (filldir(dirent, "..", 2, 1, ino, DT_DIR) < 0)
+				goto out;
+			filp->f_pos++;
+		default:
+			files = get_files_struct(p);
+			if (!files)
+				goto out;
+			rcu_read_lock();
+			fdt = files_fdtable(files);
+			for (fd = filp->f_pos-2;
+			     fd < fdt->max_fds;
+			     fd++, filp->f_pos++) {
+				unsigned int i,j;
+
+				if (!fcheck_files(files, fd))
+					continue;
+				rcu_read_unlock();
+
+				j = PROC_NUMBUF;
+				i = fd;
+				do {
+					j--;
+					buf[j] = '0' + (i % 10);
+					i /= 10;
+				} while (i);
+
+				ino = fake_ino(tid, PROC_TID_FD_DIR + fd);
+				if (filldir(dirent, buf+j, PROC_NUMBUF-j, fd+2, ino, DT_LNK) < 0) {
+					rcu_read_lock();
+					break;
+				}
+				rcu_read_lock();
+			}
+			rcu_read_unlock();
+			put_files_struct(files);
+	}
+out:
+	put_task_struct(p);
 out_no_task:
-	return result;
-
-out_unlock2:
-	rcu_read_unlock();
-	put_files_struct(files);
-out_unlock:
-	iput(inode);
-	goto out;
+	return retval;
 }
 
-static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldir);
-static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd);
-static int proc_task_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat);
-
 static struct file_operations proc_fd_operations = {
 	.read		= generic_read_dir,
 	.readdir	= proc_readfd,
 };
 
-static struct file_operations proc_task_operations = {
-	.read		= generic_read_dir,
-	.readdir	= proc_task_readdir,
-};
-
 /*
  * proc directories can do almost nothing..
  */
@@ -1483,86 +1298,11 @@ static struct inode_operations proc_fd_i
 	.lookup		= proc_lookupfd,
 };
 
-static struct inode_operations proc_task_inode_operations = {
-	.lookup		= proc_task_lookup,
-	.getattr	= proc_task_getattr,
-};
+static struct file_operations proc_task_operations;
+static struct inode_operations proc_task_inode_operations;
 
 #ifdef CONFIG_SECURITY
-static ssize_t proc_pid_attr_read(struct file * file, char __user * buf,
-				  size_t count, loff_t *ppos)
-{
-	struct inode * inode = file->f_dentry->d_inode;
-	unsigned long page;
-	ssize_t length;
-	struct task_struct *task = get_proc_task(inode);
-
-	length = -ESRCH;
-	if (!task)
-		goto out_no_task;
-
-	if (count > PAGE_SIZE)
-		count = PAGE_SIZE;
-	length = -ENOMEM;
-	if (!(page = __get_free_page(GFP_KERNEL)))
-		goto out;
-
-	length = security_getprocattr(task, 
-				      (char*)file->f_dentry->d_name.name, 
-				      (void*)page, count);
-	if (length >= 0)
-		length = simple_read_from_buffer(buf, count, ppos, (char *)page, length);
-	free_page(page);
-out:
-	put_task_struct(task);
-out_no_task:
-	return length;
-}
-
-static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
-				   size_t count, loff_t *ppos)
-{ 
-	struct inode * inode = file->f_dentry->d_inode;
-	char *page; 
-	ssize_t length; 
-	struct task_struct *task = get_proc_task(inode); 
-
-	length = -ESRCH;
-	if (!task)
-		goto out_no_task;
-	if (count > PAGE_SIZE) 
-		count = PAGE_SIZE; 
-
-	/* No partial writes. */
-	length = -EINVAL;
-	if (*ppos != 0)
-		goto out;
-
-	length = -ENOMEM;
-	page = (char*)__get_free_page(GFP_USER); 
-	if (!page) 
-		goto out;
-
-	length = -EFAULT; 
-	if (copy_from_user(page, buf, count)) 
-		goto out_free;
-
-	length = security_setprocattr(task, 
-				      (char*)file->f_dentry->d_name.name, 
-				      (void*)page, count);
-out_free:
-	free_page((unsigned long) page);
-out:
-	put_task_struct(task);
-out_no_task:
-	return length;
-} 
-
-static struct file_operations proc_pid_attr_operations = {
-	.read		= proc_pid_attr_read,
-	.write		= proc_pid_attr_write,
-};
-
+static struct file_operations proc_pid_attr_operations;
 static struct file_operations proc_tid_attr_operations;
 static struct inode_operations proc_tid_attr_inode_operations;
 static struct file_operations proc_tgid_attr_operations;
@@ -1770,33 +1510,153 @@ out_no_task:
 	return error;
 }
 
-static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd){
-	return proc_pident_lookup(dir, dentry, tgid_base_stuff);
+static int proc_pident_readdir(struct file *filp,
+		void *dirent, filldir_t filldir,
+		struct pid_entry *ents, unsigned int nents)
+{
+	int i;
+	int pid;
+	struct dentry *dentry = filp->f_dentry;
+	struct inode *inode = dentry->d_inode;
+	struct task_struct *task = get_proc_task(inode);
+	struct pid_entry *p;
+	ino_t ino;
+	int ret;
+
+	ret = -ENOENT;
+	if (!task)
+		goto out;
+
+	ret = 0;
+	pid = task->pid;
+	put_task_struct(task);
+	i = filp->f_pos;
+	switch (i) {
+	case 0:
+		ino = inode->i_ino;
+		if (filldir(dirent, ".", 1, i, ino, DT_DIR) < 0)
+			goto out;
+		i++;
+		filp->f_pos++;
+		/* fall through */
+	case 1:
+		ino = parent_ino(dentry);
+		if (filldir(dirent, "..", 2, i, ino, DT_DIR) < 0)
+			goto out;
+		i++;
+		filp->f_pos++;
+		/* fall through */
+	default:
+		i -= 2;
+		if (i >= nents) {
+			ret = 1;
+			goto out;
+		}
+		p = ents + i;
+		while (p->name) {
+			if (filldir(dirent, p->name, p->len, filp->f_pos,
+				    fake_ino(pid, p->type), p->mode >> 12) < 0)
+				goto out;
+			filp->f_pos++;
+			p++;
+		}
+	}
+
+	ret = 1;
+out:
+	return ret;
 }
 
-static struct dentry *proc_tid_base_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd){
-	return proc_pident_lookup(dir, dentry, tid_base_stuff);
+#ifdef CONFIG_SECURITY
+static ssize_t proc_pid_attr_read(struct file * file, char __user * buf,
+				  size_t count, loff_t *ppos)
+{
+	struct inode * inode = file->f_dentry->d_inode;
+	unsigned long page;
+	ssize_t length;
+	struct task_struct *task = get_proc_task(inode);
+
+	length = -ESRCH;
+	if (!task)
+		goto out_no_task;
+
+	if (count > PAGE_SIZE)
+		count = PAGE_SIZE;
+	length = -ENOMEM;
+	if (!(page = __get_free_page(GFP_KERNEL)))
+		goto out;
+
+	length = security_getprocattr(task, 
+				      (char*)file->f_dentry->d_name.name, 
+				      (void*)page, count);
+	if (length >= 0)
+		length = simple_read_from_buffer(buf, count, ppos, (char *)page, length);
+	free_page(page);
+out:
+	put_task_struct(task);
+out_no_task:
+	return length;
 }
 
-static struct file_operations proc_tgid_base_operations = {
-	.read		= generic_read_dir,
-	.readdir	= proc_tgid_base_readdir,
-};
+static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
+				   size_t count, loff_t *ppos)
+{ 
+	struct inode * inode = file->f_dentry->d_inode;
+	char *page; 
+	ssize_t length; 
+	struct task_struct *task = get_proc_task(inode); 
 
-static struct file_operations proc_tid_base_operations = {
-	.read		= generic_read_dir,
-	.readdir	= proc_tid_base_readdir,
-};
+	length = -ESRCH;
+	if (!task)
+		goto out_no_task;
+	if (count > PAGE_SIZE) 
+		count = PAGE_SIZE; 
 
-static struct inode_operations proc_tgid_base_inode_operations = {
-	.lookup		= proc_tgid_base_lookup,
+	/* No partial writes. */
+	length = -EINVAL;
+	if (*ppos != 0)
+		goto out;
+
+	length = -ENOMEM;
+	page = (char*)__get_free_page(GFP_USER); 
+	if (!page) 
+		goto out;
+
+	length = -EFAULT; 
+	if (copy_from_user(page, buf, count)) 
+		goto out_free;
+
+	length = security_setprocattr(task, 
+				      (char*)file->f_dentry->d_name.name, 
+				      (void*)page, count);
+out_free:
+	free_page((unsigned long) page);
+out:
+	put_task_struct(task);
+out_no_task:
+	return length;
+} 
+
+static struct file_operations proc_pid_attr_operations = {
+	.read		= proc_pid_attr_read,
+	.write		= proc_pid_attr_write,
 };
 
-static struct inode_operations proc_tid_base_inode_operations = {
-	.lookup		= proc_tid_base_lookup,
+static struct pid_entry tgid_attr_stuff[] = {
+	E(PROC_TGID_ATTR_CURRENT,  "current",  S_IFREG|S_IRUGO|S_IWUGO),
+	E(PROC_TGID_ATTR_PREV,     "prev",     S_IFREG|S_IRUGO),
+	E(PROC_TGID_ATTR_EXEC,     "exec",     S_IFREG|S_IRUGO|S_IWUGO),
+	E(PROC_TGID_ATTR_FSCREATE, "fscreate", S_IFREG|S_IRUGO|S_IWUGO),
+	{0,0,NULL,0}
+};
+static struct pid_entry tid_attr_stuff[] = {
+	E(PROC_TID_ATTR_CURRENT,   "current",  S_IFREG|S_IRUGO|S_IWUGO),
+	E(PROC_TID_ATTR_PREV,      "prev",     S_IFREG|S_IRUGO),
+	E(PROC_TID_ATTR_EXEC,      "exec",     S_IFREG|S_IRUGO|S_IWUGO),
+	E(PROC_TID_ATTR_FSCREATE,  "fscreate", S_IFREG|S_IRUGO|S_IWUGO),
+	{0,0,NULL,0}
 };
 
-#ifdef CONFIG_SECURITY
 static int proc_tgid_attr_readdir(struct file * filp,
 			     void * dirent, filldir_t filldir)
 {
@@ -1865,6 +1725,73 @@ static struct inode_operations proc_self
 	.follow_link	= proc_self_follow_link,
 };
 
+/*
+ * Thread groups
+ */
+static struct pid_entry tgid_base_stuff[] = {
+	E(PROC_TGID_TASK,      "task",    S_IFDIR|S_IRUGO|S_IXUGO),
+	E(PROC_TGID_FD,        "fd",      S_IFDIR|S_IRUSR|S_IXUSR),
+	E(PROC_TGID_ENVIRON,   "environ", S_IFREG|S_IRUSR),
+	E(PROC_TGID_AUXV,      "auxv",	  S_IFREG|S_IRUSR),
+	E(PROC_TGID_STATUS,    "status",  S_IFREG|S_IRUGO),
+	E(PROC_TGID_CMDLINE,   "cmdline", S_IFREG|S_IRUGO),
+	E(PROC_TGID_STAT,      "stat",    S_IFREG|S_IRUGO),
+	E(PROC_TGID_STATM,     "statm",   S_IFREG|S_IRUGO),
+	E(PROC_TGID_MAPS,      "maps",    S_IFREG|S_IRUGO),
+#ifdef CONFIG_NUMA
+	E(PROC_TGID_NUMA_MAPS, "numa_maps", S_IFREG|S_IRUGO),
+#endif
+	E(PROC_TGID_MEM,       "mem",     S_IFREG|S_IRUSR|S_IWUSR),
+#ifdef CONFIG_SECCOMP
+	E(PROC_TGID_SECCOMP,   "seccomp", S_IFREG|S_IRUSR|S_IWUSR),
+#endif
+	E(PROC_TGID_CWD,       "cwd",     S_IFLNK|S_IRWXUGO),
+	E(PROC_TGID_ROOT,      "root",    S_IFLNK|S_IRWXUGO),
+	E(PROC_TGID_EXE,       "exe",     S_IFLNK|S_IRWXUGO),
+	E(PROC_TGID_MOUNTS,    "mounts",  S_IFREG|S_IRUGO),
+#ifdef CONFIG_MMU
+	E(PROC_TGID_SMAPS,     "smaps",   S_IFREG|S_IRUGO),
+#endif
+#ifdef CONFIG_SECURITY
+	E(PROC_TGID_ATTR,      "attr",    S_IFDIR|S_IRUGO|S_IXUGO),
+#endif
+#ifdef CONFIG_KALLSYMS
+	E(PROC_TGID_WCHAN,     "wchan",   S_IFREG|S_IRUGO),
+#endif
+#ifdef CONFIG_SCHEDSTATS
+	E(PROC_TGID_SCHEDSTAT, "schedstat", S_IFREG|S_IRUGO),
+#endif
+#ifdef CONFIG_CPUSETS
+	E(PROC_TGID_CPUSET,    "cpuset",  S_IFREG|S_IRUGO),
+#endif
+	E(PROC_TGID_OOM_SCORE, "oom_score",S_IFREG|S_IRUGO),
+	E(PROC_TGID_OOM_ADJUST,"oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
+#ifdef CONFIG_AUDITSYSCALL
+	E(PROC_TGID_LOGINUID, "loginuid", S_IFREG|S_IWUSR|S_IRUGO),
+#endif
+	{0,0,NULL,0}
+};
+
+static int proc_tgid_base_readdir(struct file * filp,
+			     void * dirent, filldir_t filldir)
+{
+	return proc_pident_readdir(filp,dirent,filldir,
+				   tgid_base_stuff,ARRAY_SIZE(tgid_base_stuff));
+}
+
+static struct file_operations proc_tgid_base_operations = {
+	.read		= generic_read_dir,
+	.readdir	= proc_tgid_base_readdir,
+};
+
+static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd){
+	return proc_pident_lookup(dir, dentry, tgid_base_stuff);
+}
+
+static struct inode_operations proc_tgid_base_inode_operations = {
+	.lookup		= proc_tgid_base_lookup,
+};
+
 /**
  * proc_flush_task -  Remove dcache entries for @task from the /proc dcache.
  *
@@ -2003,62 +1930,6 @@ out:
 	return result;
 }
 
-/* SMP-safe */
-static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd)
-{
-	struct dentry *result = ERR_PTR(-ENOENT);
-	struct task_struct *task;
-	struct task_struct *leader = get_proc_task(dir);
-	struct inode *inode;
-	unsigned tid;
-
-	if (!leader)
-		goto out_no_task;
-
-	tid = name_to_int(dentry);
-	if (tid == ~0U)
-		goto out;
-
-	read_lock(&tasklist_lock);
-	task = find_task_by_pid(tid);
-	if (task)
-		get_task_struct(task);
-	read_unlock(&tasklist_lock);
-	if (!task)
-		goto out;
-	if (leader->tgid != task->tgid)
-		goto out_drop_task;
-
-	inode = proc_pid_make_inode(dir->i_sb, task, PROC_TID_INO);
-
-
-	if (!inode)
-		goto out_drop_task;
-	inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
-	inode->i_op = &proc_tid_base_inode_operations;
-	inode->i_fop = &proc_tid_base_operations;
-	inode->i_flags|=S_IMMUTABLE;
-#ifdef CONFIG_SECURITY
-	inode->i_nlink = 4;
-#else
-	inode->i_nlink = 3;
-#endif
-
-	dentry->d_op = &pid_dentry_operations;
-
-	d_add(dentry, inode);
-	/* Close the race of the process dying before we return the dentry */
-	if (pid_revalidate(dentry, NULL))
-		result = NULL;
-
-out_drop_task:
-	put_task_struct(task);
-out:
-	put_task_struct(leader);
-out_no_task:
-	return result;
-}
-
 /*
  * Find the first tgid to return to user space.
  *
@@ -2176,6 +2047,129 @@ int proc_pid_readdir(struct file * filp,
 }
 
 /*
+ * Tasks
+ */
+static struct pid_entry tid_base_stuff[] = {
+	E(PROC_TID_FD,         "fd",      S_IFDIR|S_IRUSR|S_IXUSR),
+	E(PROC_TID_ENVIRON,    "environ", S_IFREG|S_IRUSR),
+	E(PROC_TID_AUXV,       "auxv",	  S_IFREG|S_IRUSR),
+	E(PROC_TID_STATUS,     "status",  S_IFREG|S_IRUGO),
+	E(PROC_TID_CMDLINE,    "cmdline", S_IFREG|S_IRUGO),
+	E(PROC_TID_STAT,       "stat",    S_IFREG|S_IRUGO),
+	E(PROC_TID_STATM,      "statm",   S_IFREG|S_IRUGO),
+	E(PROC_TID_MAPS,       "maps",    S_IFREG|S_IRUGO),
+#ifdef CONFIG_NUMA
+	E(PROC_TID_NUMA_MAPS,  "numa_maps",    S_IFREG|S_IRUGO),
+#endif
+	E(PROC_TID_MEM,        "mem",     S_IFREG|S_IRUSR|S_IWUSR),
+#ifdef CONFIG_SECCOMP
+	E(PROC_TID_SECCOMP,    "seccomp", S_IFREG|S_IRUSR|S_IWUSR),
+#endif
+	E(PROC_TID_CWD,        "cwd",     S_IFLNK|S_IRWXUGO),
+	E(PROC_TID_ROOT,       "root",    S_IFLNK|S_IRWXUGO),
+	E(PROC_TID_EXE,        "exe",     S_IFLNK|S_IRWXUGO),
+	E(PROC_TID_MOUNTS,     "mounts",  S_IFREG|S_IRUGO),
+#ifdef CONFIG_MMU
+	E(PROC_TID_SMAPS,      "smaps",   S_IFREG|S_IRUGO),
+#endif
+#ifdef CONFIG_SECURITY
+	E(PROC_TID_ATTR,       "attr",    S_IFDIR|S_IRUGO|S_IXUGO),
+#endif
+#ifdef CONFIG_KALLSYMS
+	E(PROC_TID_WCHAN,      "wchan",   S_IFREG|S_IRUGO),
+#endif
+#ifdef CONFIG_SCHEDSTATS
+	E(PROC_TID_SCHEDSTAT, "schedstat",S_IFREG|S_IRUGO),
+#endif
+#ifdef CONFIG_CPUSETS
+	E(PROC_TID_CPUSET,     "cpuset",  S_IFREG|S_IRUGO),
+#endif
+	E(PROC_TID_OOM_SCORE,  "oom_score",S_IFREG|S_IRUGO),
+	E(PROC_TID_OOM_ADJUST, "oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
+#ifdef CONFIG_AUDITSYSCALL
+	E(PROC_TID_LOGINUID, "loginuid", S_IFREG|S_IWUSR|S_IRUGO),
+#endif
+	{0,0,NULL,0}
+};
+
+
+static int proc_tid_base_readdir(struct file * filp,
+			     void * dirent, filldir_t filldir)
+{
+	return proc_pident_readdir(filp,dirent,filldir,
+				   tid_base_stuff,ARRAY_SIZE(tid_base_stuff));
+}
+
+static struct file_operations proc_tid_base_operations = {
+	.read		= generic_read_dir,
+	.readdir	= proc_tid_base_readdir,
+};
+
+static struct dentry *proc_tid_base_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd){
+	return proc_pident_lookup(dir, dentry, tid_base_stuff);
+}
+
+static struct inode_operations proc_tid_base_inode_operations = {
+	.lookup		= proc_tid_base_lookup,
+};
+
+/* SMP-safe */
+static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd)
+{
+	struct dentry *result = ERR_PTR(-ENOENT);
+	struct task_struct *task;
+	struct task_struct *leader = get_proc_task(dir);
+	struct inode *inode;
+	unsigned tid;
+
+	if (!leader)
+		goto out_no_task;
+
+	tid = name_to_int(dentry);
+	if (tid == ~0U)
+		goto out;
+
+	read_lock(&tasklist_lock);
+	task = find_task_by_pid(tid);
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+	if (!task)
+		goto out;
+	if (leader->tgid != task->tgid)
+		goto out_drop_task;
+
+	inode = proc_pid_make_inode(dir->i_sb, task, PROC_TID_INO);
+
+
+	if (!inode)
+		goto out_drop_task;
+	inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
+	inode->i_op = &proc_tid_base_inode_operations;
+	inode->i_fop = &proc_tid_base_operations;
+	inode->i_flags|=S_IMMUTABLE;
+#ifdef CONFIG_SECURITY
+	inode->i_nlink = 4;
+#else
+	inode->i_nlink = 3;
+#endif
+
+	dentry->d_op = &pid_dentry_operations;
+
+	d_add(dentry, inode);
+	/* Close the race of the process dying before we return the dentry */
+	if (pid_revalidate(dentry, NULL))
+		result = NULL;
+
+out_drop_task:
+	put_task_struct(task);
+out:
+	put_task_struct(leader);
+out_no_task:
+	return result;
+}
+
+/*
  * Find the first tid of a thread group to return to user space.
  *
  * Usually this is just the thread group leader, but if the users
@@ -2327,3 +2321,13 @@ static int proc_task_getattr(struct vfsm
 		
 	return 0;
 }
+
+static struct inode_operations proc_task_inode_operations = {
+	.lookup		= proc_task_lookup,
+	.getattr	= proc_task_getattr,
+};
+
+static struct file_operations proc_task_operations = {
+	.read		= generic_read_dir,
+	.readdir	= proc_task_readdir,
+};
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 19/23] proc: Modify proc_pident_lookup to be completely table driven.
  2006-02-23 16:25                                   ` [PATCH 18/23] proc: Reorder the functions in base.c Eric W. Biederman
@ 2006-02-23 16:27                                     ` Eric W. Biederman
  2006-02-23 16:28                                       ` [PATCH 20/23] proc: Make the generation of the self symlink " Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


Currently proc_pident_lookup gets the names and types from a table
and then has a huge switch statement to get the inode and file
operations it needs.  That is silly and is becoming increasingly hard
to maintain so I just put all of the information in the table.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c          |  332 +++++++++++++++--------------------------------
 include/linux/proc_fs.h |   10 +
 2 files changed, 109 insertions(+), 233 deletions(-)

aa1e6b9f7d2bf0d4a39fcb1ebd3c539e1287ea45
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 71edbad..b27175a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -189,9 +189,37 @@ struct pid_entry {
 	int len;
 	char *name;
 	mode_t mode;
-};
+	struct inode_operations *iop;
+	struct file_operations *fop;
+	union proc_op op;
+};
+
+#define NOD(TYPE, NAME, MODE, IOP, FOP, OP) {		\
+	.type = (TYPE),					\
+	.len  = sizeof(NAME) - 1,			\
+	.name = (NAME),					\
+	.mode = MODE,					\
+	.iop  = IOP,					\
+	.fop  = FOP,					\
+	.op   = OP,					\
+}
+
+#define DIR(TYPE, NAME, MODE, OTYPE)						\
+	NOD(TYPE, NAME, (S_IFDIR|(MODE)),					\
+		&proc_##OTYPE##_inode_operations, &proc_##OTYPE##_operations,	\
+		{} )
+#define LNK(TYPE, NAME, OTYPE)					\
+	NOD(TYPE, NAME, (S_IFLNK|S_IRWXUGO),			\
+		&proc_pid_link_inode_operations, NULL,		\
+		{ .proc_get_link = &proc_##OTYPE##_link } )
+#define REG(TYPE, NAME, MODE, OTYPE)			\
+	NOD(TYPE, NAME, (S_IFREG|(MODE)), NULL,		\
+		&proc_##OTYPE##_operations, {})
+#define INF(TYPE, NAME, MODE, OTYPE)			\
+	NOD(TYPE, NAME, (S_IFREG|(MODE)), 		\
+		NULL, &proc_info_file_operations,	\
+		{ .proc_read = &proc_##OTYPE } )
 
-#define E(type,name,mode) {(type),sizeof(name)-1,(name),(mode)}
 
 static struct fs_struct *get_fs_struct(struct task_struct *task)
 {
@@ -1298,17 +1326,6 @@ static struct inode_operations proc_fd_i
 	.lookup		= proc_lookupfd,
 };
 
-static struct file_operations proc_task_operations;
-static struct inode_operations proc_task_inode_operations;
-
-#ifdef CONFIG_SECURITY
-static struct file_operations proc_pid_attr_operations;
-static struct file_operations proc_tid_attr_operations;
-static struct inode_operations proc_tid_attr_inode_operations;
-static struct file_operations proc_tgid_attr_operations;
-static struct inode_operations proc_tgid_attr_inode_operations;
-#endif
-
 /* SMP-safe */
 static struct dentry *proc_pident_lookup(struct inode *dir, 
 					 struct dentry *dentry,
@@ -1326,6 +1343,10 @@ static struct dentry *proc_pident_lookup
 	if (!task)
 		goto out_no_task;
 
+	/*
+	 * Yes, it does not scale. And it should not. Don't add
+	 * new entries into /proc/<tgid>/ without very good reasons.
+	 */
 	for (p = ents; p->name; p++) {
 		if (p->len != dentry->d_name.len)
 			continue;
@@ -1342,163 +1363,13 @@ static struct dentry *proc_pident_lookup
 
 	ei = PROC_I(inode);
 	inode->i_mode = p->mode;
-	/*
-	 * Yes, it does not scale. And it should not. Don't add
-	 * new entries into /proc/<tgid>/ without very good reasons.
-	 */
-	switch(p->type) {
-		case PROC_TGID_TASK:
-			inode->i_nlink = 2;
-			inode->i_op = &proc_task_inode_operations;
-			inode->i_fop = &proc_task_operations;
-			break;
-		case PROC_TID_FD:
-		case PROC_TGID_FD:
-			inode->i_nlink = 2;
-			inode->i_op = &proc_fd_inode_operations;
-			inode->i_fop = &proc_fd_operations;
-			break;
-		case PROC_TID_EXE:
-		case PROC_TGID_EXE:
-			inode->i_op = &proc_pid_link_inode_operations;
-			ei->op.proc_get_link = proc_exe_link;
-			break;
-		case PROC_TID_CWD:
-		case PROC_TGID_CWD:
-			inode->i_op = &proc_pid_link_inode_operations;
-			ei->op.proc_get_link = proc_cwd_link;
-			break;
-		case PROC_TID_ROOT:
-		case PROC_TGID_ROOT:
-			inode->i_op = &proc_pid_link_inode_operations;
-			ei->op.proc_get_link = proc_root_link;
-			break;
-		case PROC_TID_ENVIRON:
-		case PROC_TGID_ENVIRON:
-			inode->i_fop = &proc_info_file_operations;
-			ei->op.proc_read = proc_pid_environ;
-			break;
-		case PROC_TID_AUXV:
-		case PROC_TGID_AUXV:
-			inode->i_fop = &proc_info_file_operations;
-			ei->op.proc_read = proc_pid_auxv;
-			break;
-		case PROC_TID_STATUS:
-		case PROC_TGID_STATUS:
-			inode->i_fop = &proc_info_file_operations;
-			ei->op.proc_read = proc_pid_status;
-			break;
-		case PROC_TID_STAT:
-			inode->i_fop = &proc_info_file_operations;
-			ei->op.proc_read = proc_tid_stat;
-			break;
-		case PROC_TGID_STAT:
-			inode->i_fop = &proc_info_file_operations;
-			ei->op.proc_read = proc_tgid_stat;
-			break;
-		case PROC_TID_CMDLINE:
-		case PROC_TGID_CMDLINE:
-			inode->i_fop = &proc_info_file_operations;
-			ei->op.proc_read = proc_pid_cmdline;
-			break;
-		case PROC_TID_STATM:
-		case PROC_TGID_STATM:
-			inode->i_fop = &proc_info_file_operations;
-			ei->op.proc_read = proc_pid_statm;
-			break;
-		case PROC_TID_MAPS:
-		case PROC_TGID_MAPS:
-			inode->i_fop = &proc_maps_operations;
-			break;
-#ifdef CONFIG_NUMA
-		case PROC_TID_NUMA_MAPS:
-		case PROC_TGID_NUMA_MAPS:
-			inode->i_fop = &proc_numa_maps_operations;
-			break;
-#endif
-		case PROC_TID_MEM:
-		case PROC_TGID_MEM:
-			inode->i_fop = &proc_mem_operations;
-			break;
-#ifdef CONFIG_SECCOMP
-		case PROC_TID_SECCOMP:
-		case PROC_TGID_SECCOMP:
-			inode->i_fop = &proc_seccomp_operations;
-			break;
-#endif /* CONFIG_SECCOMP */
-		case PROC_TID_MOUNTS:
-		case PROC_TGID_MOUNTS:
-			inode->i_fop = &proc_mounts_operations;
-			break;
-#ifdef CONFIG_MMU
-		case PROC_TID_SMAPS:
-		case PROC_TGID_SMAPS:
-			inode->i_fop = &proc_smaps_operations;
-			break;
-#endif
-#ifdef CONFIG_SECURITY
-		case PROC_TID_ATTR:
-			inode->i_nlink = 2;
-			inode->i_op = &proc_tid_attr_inode_operations;
-			inode->i_fop = &proc_tid_attr_operations;
-			break;
-		case PROC_TGID_ATTR:
-			inode->i_nlink = 2;
-			inode->i_op = &proc_tgid_attr_inode_operations;
-			inode->i_fop = &proc_tgid_attr_operations;
-			break;
-		case PROC_TID_ATTR_CURRENT:
-		case PROC_TGID_ATTR_CURRENT:
-		case PROC_TID_ATTR_PREV:
-		case PROC_TGID_ATTR_PREV:
-		case PROC_TID_ATTR_EXEC:
-		case PROC_TGID_ATTR_EXEC:
-		case PROC_TID_ATTR_FSCREATE:
-		case PROC_TGID_ATTR_FSCREATE:
-			inode->i_fop = &proc_pid_attr_operations;
-			break;
-#endif
-#ifdef CONFIG_KALLSYMS
-		case PROC_TID_WCHAN:
-		case PROC_TGID_WCHAN:
-			inode->i_fop = &proc_info_file_operations;
-			ei->op.proc_read = proc_pid_wchan;
-			break;
-#endif
-#ifdef CONFIG_SCHEDSTATS
-		case PROC_TID_SCHEDSTAT:
-		case PROC_TGID_SCHEDSTAT:
-			inode->i_fop = &proc_info_file_operations;
-			ei->op.proc_read = proc_pid_schedstat;
-			break;
-#endif
-#ifdef CONFIG_CPUSETS
-		case PROC_TID_CPUSET:
-		case PROC_TGID_CPUSET:
-			inode->i_fop = &proc_cpuset_operations;
-			break;
-#endif
-		case PROC_TID_OOM_SCORE:
-		case PROC_TGID_OOM_SCORE:
-			inode->i_fop = &proc_info_file_operations;
-			ei->op.proc_read = proc_oom_score;
-			break;
-		case PROC_TID_OOM_ADJUST:
-		case PROC_TGID_OOM_ADJUST:
-			inode->i_fop = &proc_oom_adjust_operations;
-			break;
-#ifdef CONFIG_AUDITSYSCALL
-		case PROC_TID_LOGINUID:
-		case PROC_TGID_LOGINUID:
-			inode->i_fop = &proc_loginuid_operations;
-			break;
-#endif
-		default:
-			printk("procfs: impossible type (%d)",p->type);
-			iput(inode);
-			error = ERR_PTR(-EINVAL);
-			goto out;
-	}
+	if (S_ISDIR(inode->i_mode))
+		inode->i_nlink = 2;	/* Use getattr to fix if necessary */
+	if (p->iop)
+		inode->i_op = p->iop;
+	if (p->fop)
+		inode->i_fop = p->fop;
+	ei->op = p->op;
 	dentry->d_op = &pid_dentry_operations;
 	d_add(dentry, inode);
 	/* Close the race of the process dying before we return the dentry */
@@ -1643,18 +1514,18 @@ static struct file_operations proc_pid_a
 };
 
 static struct pid_entry tgid_attr_stuff[] = {
-	E(PROC_TGID_ATTR_CURRENT,  "current",  S_IFREG|S_IRUGO|S_IWUGO),
-	E(PROC_TGID_ATTR_PREV,     "prev",     S_IFREG|S_IRUGO),
-	E(PROC_TGID_ATTR_EXEC,     "exec",     S_IFREG|S_IRUGO|S_IWUGO),
-	E(PROC_TGID_ATTR_FSCREATE, "fscreate", S_IFREG|S_IRUGO|S_IWUGO),
-	{0,0,NULL,0}
+	REG(PROC_TGID_ATTR_CURRENT,  "current",  S_IRUGO|S_IWUGO, pid_attr),
+	REG(PROC_TGID_ATTR_PREV,     "prev",     S_IRUGO,         pid_attr),
+	REG(PROC_TGID_ATTR_EXEC,     "exec",     S_IRUGO|S_IWUGO, pid_attr),
+	REG(PROC_TGID_ATTR_FSCREATE, "fscreate", S_IRUGO|S_IWUGO, pid_attr),
+	{}
 };
 static struct pid_entry tid_attr_stuff[] = {
-	E(PROC_TID_ATTR_CURRENT,   "current",  S_IFREG|S_IRUGO|S_IWUGO),
-	E(PROC_TID_ATTR_PREV,      "prev",     S_IFREG|S_IRUGO),
-	E(PROC_TID_ATTR_EXEC,      "exec",     S_IFREG|S_IRUGO|S_IWUGO),
-	E(PROC_TID_ATTR_FSCREATE,  "fscreate", S_IFREG|S_IRUGO|S_IWUGO),
-	{0,0,NULL,0}
+	REG(PROC_TID_ATTR_CURRENT,   "current",  S_IRUGO|S_IWUGO, pid_attr),
+	REG(PROC_TID_ATTR_PREV,      "prev",     S_IRUGO,         pid_attr),
+	REG(PROC_TID_ATTR_EXEC,      "exec",     S_IRUGO|S_IWUGO, pid_attr),
+	REG(PROC_TID_ATTR_FSCREATE,  "fscreate", S_IRUGO|S_IWUGO, pid_attr),
+	{}
 };
 
 static int proc_tgid_attr_readdir(struct file * filp,
@@ -1728,48 +1599,51 @@ static struct inode_operations proc_self
 /*
  * Thread groups
  */
+static struct file_operations proc_task_operations;
+static struct inode_operations proc_task_inode_operations;
+
 static struct pid_entry tgid_base_stuff[] = {
-	E(PROC_TGID_TASK,      "task",    S_IFDIR|S_IRUGO|S_IXUGO),
-	E(PROC_TGID_FD,        "fd",      S_IFDIR|S_IRUSR|S_IXUSR),
-	E(PROC_TGID_ENVIRON,   "environ", S_IFREG|S_IRUSR),
-	E(PROC_TGID_AUXV,      "auxv",	  S_IFREG|S_IRUSR),
-	E(PROC_TGID_STATUS,    "status",  S_IFREG|S_IRUGO),
-	E(PROC_TGID_CMDLINE,   "cmdline", S_IFREG|S_IRUGO),
-	E(PROC_TGID_STAT,      "stat",    S_IFREG|S_IRUGO),
-	E(PROC_TGID_STATM,     "statm",   S_IFREG|S_IRUGO),
-	E(PROC_TGID_MAPS,      "maps",    S_IFREG|S_IRUGO),
+	DIR(PROC_TGID_TASK,    "task",    S_IRUGO|S_IXUGO, task),
+	DIR(PROC_TGID_FD,      "fd",      S_IRUSR|S_IXUSR, fd),
+	INF(PROC_TGID_ENVIRON, "environ", S_IRUSR, pid_environ),
+	INF(PROC_TGID_AUXV,    "auxv",	  S_IRUSR, pid_auxv),
+	INF(PROC_TGID_STATUS,  "status",  S_IRUGO, pid_status),
+	INF(PROC_TGID_CMDLINE, "cmdline", S_IRUGO, pid_cmdline),
+	INF(PROC_TGID_STAT,    "stat",    S_IRUGO, tgid_stat),
+	INF(PROC_TGID_STATM,   "statm",   S_IRUGO, pid_statm),
+	REG(PROC_TGID_MAPS,    "maps",    S_IRUGO, maps),
 #ifdef CONFIG_NUMA
-	E(PROC_TGID_NUMA_MAPS, "numa_maps", S_IFREG|S_IRUGO),
+	REG(PROC_TGID_NUMA_MAPS, "numa_maps", S_IRUGO, numa_maps),
 #endif
-	E(PROC_TGID_MEM,       "mem",     S_IFREG|S_IRUSR|S_IWUSR),
+	REG(PROC_TGID_MEM,     "mem",     S_IRUSR|S_IWUSR, mem),
 #ifdef CONFIG_SECCOMP
-	E(PROC_TGID_SECCOMP,   "seccomp", S_IFREG|S_IRUSR|S_IWUSR),
+	REG(PROC_TGID_SECCOMP, "seccomp", S_IRUSR|S_IWUSR, seccomp),
 #endif
-	E(PROC_TGID_CWD,       "cwd",     S_IFLNK|S_IRWXUGO),
-	E(PROC_TGID_ROOT,      "root",    S_IFLNK|S_IRWXUGO),
-	E(PROC_TGID_EXE,       "exe",     S_IFLNK|S_IRWXUGO),
-	E(PROC_TGID_MOUNTS,    "mounts",  S_IFREG|S_IRUGO),
+	LNK(PROC_TGID_CWD,     "cwd",     cwd),
+	LNK(PROC_TGID_ROOT,    "root",    root),
+	LNK(PROC_TGID_EXE,     "exe",     exe),
+	REG(PROC_TGID_MOUNTS,  "mounts",  S_IRUGO, mounts),
 #ifdef CONFIG_MMU
-	E(PROC_TGID_SMAPS,     "smaps",   S_IFREG|S_IRUGO),
+	REG(PROC_TGID_SMAPS,   "smaps",   S_IRUGO, smaps),
 #endif
 #ifdef CONFIG_SECURITY
-	E(PROC_TGID_ATTR,      "attr",    S_IFDIR|S_IRUGO|S_IXUGO),
+	DIR(PROC_TGID_ATTR,    "attr",    S_IRUGO|S_IXUGO, tgid_attr),
 #endif
 #ifdef CONFIG_KALLSYMS
-	E(PROC_TGID_WCHAN,     "wchan",   S_IFREG|S_IRUGO),
+	INF(PROC_TGID_WCHAN,   "wchan",   S_IRUGO, pid_wchan),
 #endif
 #ifdef CONFIG_SCHEDSTATS
-	E(PROC_TGID_SCHEDSTAT, "schedstat", S_IFREG|S_IRUGO),
+	INF(PROC_TGID_SCHEDSTAT, "schedstat", S_IRUGO, pid_schedstat),
 #endif
 #ifdef CONFIG_CPUSETS
-	E(PROC_TGID_CPUSET,    "cpuset",  S_IFREG|S_IRUGO),
+	REG(PROC_TGID_CPUSET,  "cpuset",  S_IRUGO, cpuset),
 #endif
-	E(PROC_TGID_OOM_SCORE, "oom_score",S_IFREG|S_IRUGO),
-	E(PROC_TGID_OOM_ADJUST,"oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
+	INF(PROC_TGID_OOM_SCORE, "oom_score", S_IRUGO, oom_score),
+	REG(PROC_TGID_OOM_ADJUST,"oom_adj", S_IRUGO|S_IWUSR, oom_adjust),
 #ifdef CONFIG_AUDITSYSCALL
-	E(PROC_TGID_LOGINUID, "loginuid", S_IFREG|S_IWUSR|S_IRUGO),
+	REG(PROC_TGID_LOGINUID, "loginuid", S_IWUSR|S_IRUGO, loginuid),
 #endif
-	{0,0,NULL,0}
+	{}
 };
 
 static int proc_tgid_base_readdir(struct file * filp,
@@ -2050,46 +1924,46 @@ int proc_pid_readdir(struct file * filp,
  * Tasks
  */
 static struct pid_entry tid_base_stuff[] = {
-	E(PROC_TID_FD,         "fd",      S_IFDIR|S_IRUSR|S_IXUSR),
-	E(PROC_TID_ENVIRON,    "environ", S_IFREG|S_IRUSR),
-	E(PROC_TID_AUXV,       "auxv",	  S_IFREG|S_IRUSR),
-	E(PROC_TID_STATUS,     "status",  S_IFREG|S_IRUGO),
-	E(PROC_TID_CMDLINE,    "cmdline", S_IFREG|S_IRUGO),
-	E(PROC_TID_STAT,       "stat",    S_IFREG|S_IRUGO),
-	E(PROC_TID_STATM,      "statm",   S_IFREG|S_IRUGO),
-	E(PROC_TID_MAPS,       "maps",    S_IFREG|S_IRUGO),
+	DIR(PROC_TID_FD,       "fd",      S_IRUSR|S_IXUSR, fd),
+	INF(PROC_TID_ENVIRON,  "environ", S_IRUSR, pid_environ),
+	INF(PROC_TID_AUXV,     "auxv",	  S_IRUSR, pid_auxv),
+	INF(PROC_TID_STATUS,   "status",  S_IRUGO, pid_status),
+	INF(PROC_TID_CMDLINE,  "cmdline", S_IRUGO, pid_cmdline),
+	INF(PROC_TID_STAT,     "stat",    S_IRUGO, tid_stat),
+	INF(PROC_TID_STATM,    "statm",   S_IRUGO, pid_statm),
+	REG(PROC_TID_MAPS,     "maps",    S_IRUGO, maps),
 #ifdef CONFIG_NUMA
-	E(PROC_TID_NUMA_MAPS,  "numa_maps",    S_IFREG|S_IRUGO),
+	REG(PROC_TID_NUMA_MAPS, "numa_maps", S_IRUGO, numa_maps),
 #endif
-	E(PROC_TID_MEM,        "mem",     S_IFREG|S_IRUSR|S_IWUSR),
+	REG(PROC_TID_MEM,      "mem",     S_IRUSR|S_IWUSR, mem),
 #ifdef CONFIG_SECCOMP
-	E(PROC_TID_SECCOMP,    "seccomp", S_IFREG|S_IRUSR|S_IWUSR),
+	REG(PROC_TID_SECCOMP,  "seccomp", S_IRUSR|S_IWUSR, seccomp),
 #endif
-	E(PROC_TID_CWD,        "cwd",     S_IFLNK|S_IRWXUGO),
-	E(PROC_TID_ROOT,       "root",    S_IFLNK|S_IRWXUGO),
-	E(PROC_TID_EXE,        "exe",     S_IFLNK|S_IRWXUGO),
-	E(PROC_TID_MOUNTS,     "mounts",  S_IFREG|S_IRUGO),
+	LNK(PROC_TID_CWD,      "cwd",     cwd),
+	LNK(PROC_TID_ROOT,     "root",    root),
+	LNK(PROC_TID_EXE,      "exe",     exe),
+	REG(PROC_TID_MOUNTS,   "mounts",  S_IRUGO, mounts),
 #ifdef CONFIG_MMU
-	E(PROC_TID_SMAPS,      "smaps",   S_IFREG|S_IRUGO),
+	REG(PROC_TID_SMAPS,    "smaps",   S_IRUGO, smaps),
 #endif
 #ifdef CONFIG_SECURITY
-	E(PROC_TID_ATTR,       "attr",    S_IFDIR|S_IRUGO|S_IXUGO),
+	DIR(PROC_TID_ATTR,     "attr",    S_IRUGO|S_IXUGO, tid_attr),
 #endif
 #ifdef CONFIG_KALLSYMS
-	E(PROC_TID_WCHAN,      "wchan",   S_IFREG|S_IRUGO),
+	INF(PROC_TID_WCHAN,    "wchan",   S_IRUGO, pid_wchan),
 #endif
 #ifdef CONFIG_SCHEDSTATS
-	E(PROC_TID_SCHEDSTAT, "schedstat",S_IFREG|S_IRUGO),
+	INF(PROC_TID_SCHEDSTAT, "schedstat", S_IRUGO, pid_schedstat),
 #endif
 #ifdef CONFIG_CPUSETS
-	E(PROC_TID_CPUSET,     "cpuset",  S_IFREG|S_IRUGO),
+	REG(PROC_TID_CPUSET,   "cpuset",  S_IRUGO, cpuset),
 #endif
-	E(PROC_TID_OOM_SCORE,  "oom_score",S_IFREG|S_IRUGO),
-	E(PROC_TID_OOM_ADJUST, "oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
+	INF(PROC_TID_OOM_SCORE, "oom_score", S_IRUGO, oom_score),
+	REG(PROC_TID_OOM_ADJUST, "oom_adj", S_IRUGO|S_IWUSR, oom_adjust),
 #ifdef CONFIG_AUDITSYSCALL
-	E(PROC_TID_LOGINUID, "loginuid", S_IFREG|S_IWUSR|S_IRUGO),
+	REG(PROC_TID_LOGINUID, "loginuid", S_IWUSR|S_IRUGO, loginuid),
 #endif
-	{0,0,NULL,0}
+	{}
 };
 
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index f6b491f..561db5f 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -243,13 +243,15 @@ static inline void kclist_add(struct kco
 extern void kclist_add(struct kcore_list *, void *, size_t);
 #endif
 
+union proc_op {
+	int (*proc_get_link)(struct inode *, struct dentry **, struct vfsmount **);
+	int (*proc_read)(struct task_struct *task, char *page);
+};
+
 struct proc_inode {
 	struct task_ref *tref;
 	int fd;
-	union {
-		int (*proc_get_link)(struct inode *, struct dentry **, struct vfsmount **);
-		int (*proc_read)(struct task_struct *task, char *page);
-	} op;
+	union proc_op op;
 	struct proc_dir_entry *pde;
 	struct inode vfs_inode;
 };
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 20/23] proc: Make the generation of the self symlink table driven.
  2006-02-23 16:27                                     ` [PATCH 19/23] proc: Modify proc_pident_lookup to be completely table driven Eric W. Biederman
@ 2006-02-23 16:28                                       ` Eric W. Biederman
  2006-02-23 16:30                                         ` [PATCH 21/23] proc: Factor out an instantiate method from every lookup method Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


By not rolling our own inode we get a little more code reuse,
and things get a little simpler and we don't have special
cases to contend with later.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |   40 +++++++++++++++++-----------------------
 1 files changed, 17 insertions(+), 23 deletions(-)

a0e7796422bcb45748f230194723ef1b4afe1228
diff --git a/fs/proc/base.c b/fs/proc/base.c
index b27175a..a6bff2f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1666,6 +1666,12 @@ static struct inode_operations proc_tgid
 	.lookup		= proc_tgid_base_lookup,
 };
 
+static struct pid_entry proc_base_stuff[] = {
+	NOD(PROC_TGID_INO,     "self",	S_IFLNK|S_IRWXUGO,
+		&proc_self_inode_operations, NULL, {}),
+	{}
+};
+
 /**
  * proc_flush_task -  Remove dcache entries for @task from the /proc dcache.
  *
@@ -1747,24 +1753,12 @@ struct dentry *proc_pid_lookup(struct in
 	struct dentry *result = ERR_PTR(-ENOENT);
 	struct task_struct *task;
 	struct inode *inode;
-	struct proc_inode *ei;
 	unsigned tgid;
 
-	if (dentry->d_name.len == 4 && !memcmp(dentry->d_name.name,"self",4)) {
-		inode = new_inode(dir->i_sb);
-		if (!inode)
-			return ERR_PTR(-ENOMEM);
-		ei = PROC_I(inode);
-		inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
-		inode->i_ino = fake_ino(0, PROC_TGID_INO);
-		ei->pde = NULL;
-		inode->i_mode = S_IFLNK|S_IRWXUGO;
-		inode->i_uid = inode->i_gid = 0;
-		inode->i_size = 64;
-		inode->i_op = &proc_self_inode_operations;
-		d_add(dentry, inode);
-		return NULL;
-	}
+	result = proc_pident_lookup(dir, dentry, proc_base_stuff);
+	if (!IS_ERR(result) || PTR_ERR(result) != -ENOENT)
+		goto out;
+
 	tgid = name_to_int(dentry);
 	if (tgid == ~0U)
 		goto out;
@@ -1887,14 +1881,13 @@ int proc_pid_readdir(struct file * filp,
 	struct task_struct *task;
 	int tgid;
 	
-	if (!nr) {
-		ino_t ino = fake_ino(0,PROC_TGID_INO);
-		if (filldir(dirent, "self", 4, filp->f_pos, ino, DT_LNK) < 0)
-			return 0;
-		filp->f_pos++;
-		nr++;
+	for (;nr < (ARRAY_SIZE(proc_base_stuff) - 1); filp->f_pos++, nr++) {
+		struct pid_entry *p = &proc_base_stuff[nr];
+		if (filldir(dirent, p->name, p->len, filp->f_pos,
+			    fake_ino(0, p->type), p->mode >> 12) < 0)
+			goto out;
 	}
-	nr -= 1;
+	nr -= (ARRAY_SIZE(proc_base_stuff) - 1);
 
 	/* f_version caches the tgid value that the last readdir call couldn't
 	 * return. lseek aka telldir automagically resets f_version to 0.
@@ -1917,6 +1910,7 @@ int proc_pid_readdir(struct file * filp,
 			break;
 		}
 	}
+out:
 	return 0;
 }
 
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 21/23] proc: Factor out an instantiate method from every lookup method.
  2006-02-23 16:28                                       ` [PATCH 20/23] proc: Make the generation of the self symlink " Eric W. Biederman
@ 2006-02-23 16:30                                         ` Eric W. Biederman
  2006-02-23 16:32                                           ` [PATCH 22/23] proc: Remove the hard coded inode numbers Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


To remove the hard coded proc inode numbers it is necessary to be able
to create the proc inodes during readdir, so that inode numbers reported
but readdir and stat stay in sync.  The instantiate methods are the
subset of lookup that is needed to accomplish that.

This first step just splits the lookup methods into 2 functions.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |  212 +++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 125 insertions(+), 87 deletions(-)

572832049c16973d7a8201eaaf2eb514210d943d
diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6bff2f..5dfc754 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1191,21 +1191,15 @@ static struct dentry_operations tid_fd_d
 	.d_delete	= pid_delete_dentry,
 };
 
-/* SMP-safe */
-static struct dentry *proc_lookupfd(struct inode * dir, struct dentry * dentry, struct nameidata *nd)
+static struct dentry *proc_fd_instantiate(struct inode *dir,
+	struct dentry *dentry, struct task_struct *task, void *ptr)
 {
-	struct task_struct *task = get_proc_task(dir);
-	unsigned fd = name_to_int(dentry);
-	struct dentry *result = ERR_PTR(-ENOENT);
-	struct file * file;
-	struct files_struct * files;
+	int fd = *(int *)ptr;
+	struct file *file;
+	struct files_struct *files;
 	struct inode *inode;
 	struct proc_inode *ei;
-
-	if (!task)
-		goto out_no_task;
-	if (fd == ~0U)
-		goto out;
+	struct dentry *error = ERR_PTR(-ENOENT);
 
 	inode = proc_pid_make_inode(dir->i_sb, task, PROC_TID_FD_DIR+fd);
 	if (!inode)
@@ -1214,18 +1208,19 @@ static struct dentry *proc_lookupfd(stru
 	ei->fd = fd;
 	files = get_files_struct(task);
 	if (!files)
-		goto out_unlock;
+		goto out_iput;
 	inode->i_mode = S_IFLNK;
 	rcu_read_lock();
 	file = fcheck_files(files, fd);
 	if (!file)
-		goto out_unlock2;
+		goto out_unlock;
 	if (file->f_mode & 1)
 		inode->i_mode |= S_IRUSR | S_IXUSR;
 	if (file->f_mode & 2)
 		inode->i_mode |= S_IWUSR | S_IXUSR;
 	rcu_read_unlock();
 	put_files_struct(files);
+
 	inode->i_op = &proc_pid_link_inode_operations;
 	inode->i_size = 64;
 	ei->op.proc_get_link = proc_fd_link;
@@ -1233,18 +1228,36 @@ static struct dentry *proc_lookupfd(stru
 	d_add(dentry, inode);
 	/* Close the race of the process dying before we return the dentry */
 	if (tid_fd_revalidate(dentry, NULL))
-		result = NULL;
+		error = NULL;
 out:
-	put_task_struct(task);
-out_no_task:
-	return result;
-
-out_unlock2:
+	return error;
+out_unlock:
 	rcu_read_unlock();
 	put_files_struct(files);
-out_unlock:
+out_iput:
 	iput(inode);
 	goto out;
+	
+
+	
+}
+/* SMP-safe */
+static struct dentry *proc_lookupfd(struct inode * dir, struct dentry * dentry, struct nameidata *nd)
+{
+	struct task_struct *task = get_proc_task(dir);
+	unsigned fd = name_to_int(dentry);
+	struct dentry *result = ERR_PTR(-ENOENT);
+
+	if (!task)
+		goto out_no_task;
+	if (fd == ~0U)
+		goto out;
+
+	result = proc_fd_instantiate(dir, dentry, task, &fd);
+out:
+	put_task_struct(task);
+out_no_task:
+	return result;
 }
 
 static int proc_readfd(struct file * filp, void * dirent, filldir_t filldir)
@@ -1326,6 +1339,36 @@ static struct inode_operations proc_fd_i
 	.lookup		= proc_lookupfd,
 };
 
+static struct dentry *proc_pident_instantiate(struct inode *dir, 
+	struct dentry *dentry, struct task_struct *task, void *ptr)
+{
+	struct pid_entry *p = ptr;
+	struct inode *inode;
+	struct proc_inode *ei;
+	struct dentry *error = ERR_PTR(-EINVAL);
+
+	inode = proc_pid_make_inode(dir->i_sb, task, p->type);
+	if (!inode)
+		goto out;
+
+	ei = PROC_I(inode);
+	inode->i_mode = p->mode;
+	if (S_ISDIR(inode->i_mode))
+		inode->i_nlink = 2;	/* Use getattr to fix if necessary */
+	if (p->iop)
+		inode->i_op = p->iop;
+	if (p->fop)
+		inode->i_fop = p->fop;
+	ei->op = p->op;
+	dentry->d_op = &pid_dentry_operations;
+	d_add(dentry, inode);
+	/* Close the race of the process dying before we return the dentry */
+	if (pid_revalidate(dentry, NULL))
+		error = NULL;
+out:
+	return error;
+}
+
 /* SMP-safe */
 static struct dentry *proc_pident_lookup(struct inode *dir, 
 					 struct dentry *dentry,
@@ -1335,7 +1378,6 @@ static struct dentry *proc_pident_lookup
 	struct dentry *error;
 	struct task_struct *task = get_proc_task(dir);
 	struct pid_entry *p;
-	struct proc_inode *ei;
 
 	error = ERR_PTR(-ENOENT);
 	inode = NULL;
@@ -1356,25 +1398,7 @@ static struct dentry *proc_pident_lookup
 	if (!p->name)
 		goto out;
 
-	error = ERR_PTR(-EINVAL);
-	inode = proc_pid_make_inode(dir->i_sb, task, p->type);
-	if (!inode)
-		goto out;
-
-	ei = PROC_I(inode);
-	inode->i_mode = p->mode;
-	if (S_ISDIR(inode->i_mode))
-		inode->i_nlink = 2;	/* Use getattr to fix if necessary */
-	if (p->iop)
-		inode->i_op = p->iop;
-	if (p->fop)
-		inode->i_fop = p->fop;
-	ei->op = p->op;
-	dentry->d_op = &pid_dentry_operations;
-	d_add(dentry, inode);
-	/* Close the race of the process dying before we return the dentry */
-	if (pid_revalidate(dentry, NULL))
-		error = NULL;
+	error = proc_pident_instantiate(dir, dentry, task, p);
 out:
 	put_task_struct(task);
 out_no_task:
@@ -1747,12 +1771,40 @@ out:
 	return;
 }
 
+struct dentry *proc_pid_instantiate(struct inode *dir, 
+	struct dentry * dentry, struct task_struct *task, void *ptr)
+{
+	struct dentry *error = ERR_PTR(-ENOENT);
+	struct inode *inode;
+
+	inode = proc_pid_make_inode(dir->i_sb, task, PROC_TGID_INO);
+	if (!inode)
+		goto out;
+
+	inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
+	inode->i_op = &proc_tgid_base_inode_operations;
+	inode->i_fop = &proc_tgid_base_operations;
+	inode->i_flags|=S_IMMUTABLE;
+	inode->i_nlink = 4;
+#ifdef CONFIG_SECURITY
+	inode->i_nlink += 1;
+#endif
+
+	dentry->d_op = &pid_dentry_operations;
+
+	d_add(dentry, inode);
+	/* Close the race of the process dying before we return the dentry */
+	if (pid_revalidate(dentry, NULL))
+		error = NULL;
+out:
+	return error;
+}
+	
 /* SMP-safe */
 struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd)
 {
 	struct dentry *result = ERR_PTR(-ENOENT);
 	struct task_struct *task;
-	struct inode *inode;
 	unsigned tgid;
 
 	result = proc_pident_lookup(dir, dentry, proc_base_stuff);
@@ -1771,28 +1823,7 @@ struct dentry *proc_pid_lookup(struct in
 	if (!task)
 		goto out;
 
-	inode = proc_pid_make_inode(dir->i_sb, task, PROC_TGID_INO);
-	if (!inode)
-		goto out_put_task;
-
-	inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
-	inode->i_op = &proc_tgid_base_inode_operations;
-	inode->i_fop = &proc_tgid_base_operations;
-	inode->i_flags|=S_IMMUTABLE;
-#ifdef CONFIG_SECURITY
-	inode->i_nlink = 5;
-#else
-	inode->i_nlink = 4;
-#endif
-
-	dentry->d_op = &pid_dentry_operations;
-
-	d_add(dentry, inode);
-	/* Close the race of the process dying before we return the dentry */
-	if (pid_revalidate(dentry, NULL))
-		result = NULL;
-
-out_put_task:
+	result = proc_pid_instantiate(dir, dentry, task, NULL);
 	put_task_struct(task);
 out:
 	return result;
@@ -1981,13 +2012,41 @@ static struct inode_operations proc_tid_
 	.lookup		= proc_tid_base_lookup,
 };
 
+
+static struct dentry *proc_task_instantiate(struct inode *dir,
+	struct dentry *dentry, struct task_struct *task, void *ptr)
+{
+	struct dentry *error = ERR_PTR(-ENOENT);
+	struct inode *inode;
+	inode = proc_pid_make_inode(dir->i_sb, task, PROC_TID_INO);
+
+	if (!inode)
+		goto out;
+	inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
+	inode->i_op = &proc_tid_base_inode_operations;
+	inode->i_fop = &proc_tid_base_operations;
+	inode->i_flags|=S_IMMUTABLE;
+	inode->i_nlink = 3;
+#ifdef CONFIG_SECURITY
+	inode->i_nlink += 1;
+#endif
+
+	dentry->d_op = &pid_dentry_operations;
+
+	d_add(dentry, inode);
+	/* Close the race of the process dying before we return the dentry */
+	if (pid_revalidate(dentry, NULL))
+		error = NULL;
+out:
+	return error;
+}
+
 /* SMP-safe */
 static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd)
 {
 	struct dentry *result = ERR_PTR(-ENOENT);
 	struct task_struct *task;
 	struct task_struct *leader = get_proc_task(dir);
-	struct inode *inode;
 	unsigned tid;
 
 	if (!leader)
@@ -2007,28 +2066,7 @@ static struct dentry *proc_task_lookup(s
 	if (leader->tgid != task->tgid)
 		goto out_drop_task;
 
-	inode = proc_pid_make_inode(dir->i_sb, task, PROC_TID_INO);
-
-
-	if (!inode)
-		goto out_drop_task;
-	inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
-	inode->i_op = &proc_tid_base_inode_operations;
-	inode->i_fop = &proc_tid_base_operations;
-	inode->i_flags|=S_IMMUTABLE;
-#ifdef CONFIG_SECURITY
-	inode->i_nlink = 4;
-#else
-	inode->i_nlink = 3;
-#endif
-
-	dentry->d_op = &pid_dentry_operations;
-
-	d_add(dentry, inode);
-	/* Close the race of the process dying before we return the dentry */
-	if (pid_revalidate(dentry, NULL))
-		result = NULL;
-
+	result = proc_task_instantiate(dir, dentry, task, NULL);
 out_drop_task:
 	put_task_struct(task);
 out:
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 22/23] proc: Remove the hard coded inode numbers.
  2006-02-23 16:30                                         ` [PATCH 21/23] proc: Factor out an instantiate method from every lookup method Eric W. Biederman
@ 2006-02-23 16:32                                           ` Eric W. Biederman
  2006-02-23 16:34                                             ` [PATCH 23/23] proc: Merge proc_tid_attr and proc_tgid_attr Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


The hard coded inode numbers in proc currently limit it's
maintainability, it's flexibility, and what can be done with
the rest of system.  /proc limits pid-max to 32768 on 32 bit
systems it limits fd-max to 32768 on all systems, and placing
the pid in the inode number really gets in the way of implementing
multiple pid namespaces.

Ever since people started adding to the middle of the file type
enumeration we haven't been maintaing the historical inode numbers,
all we have really succeeded in doing is keeping the pid in the proc
inode number.  The pid is already available in the directory name
so no information is lost removing it from the inode number.

So if something in user space cares if we remove the inode number
from the /proc inode it is almost certainly broken.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |  356 ++++++++++++++++++++++++++------------------------------
 1 files changed, 164 insertions(+), 192 deletions(-)

f8049ad0ffd4f86a6fe758efed772419ca738e1c
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5dfc754..ae63eeb 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -92,100 +92,10 @@
  * about magical ranges too.
  */
 
-#define fake_ino(pid,ino) (((pid)<<16)|(ino))
-
-enum pid_directory_inos {
-	PROC_TGID_INO = 2,
-	PROC_TGID_TASK,
-	PROC_TGID_STATUS,
-	PROC_TGID_MEM,
-#ifdef CONFIG_SECCOMP
-	PROC_TGID_SECCOMP,
-#endif
-	PROC_TGID_CWD,
-	PROC_TGID_ROOT,
-	PROC_TGID_EXE,
-	PROC_TGID_FD,
-	PROC_TGID_ENVIRON,
-	PROC_TGID_AUXV,
-	PROC_TGID_CMDLINE,
-	PROC_TGID_STAT,
-	PROC_TGID_STATM,
-	PROC_TGID_MAPS,
-	PROC_TGID_NUMA_MAPS,
-	PROC_TGID_MOUNTS,
-	PROC_TGID_WCHAN,
-#ifdef CONFIG_MMU
-	PROC_TGID_SMAPS,
-#endif
-#ifdef CONFIG_SCHEDSTATS
-	PROC_TGID_SCHEDSTAT,
-#endif
-#ifdef CONFIG_CPUSETS
-	PROC_TGID_CPUSET,
-#endif
-#ifdef CONFIG_SECURITY
-	PROC_TGID_ATTR,
-	PROC_TGID_ATTR_CURRENT,
-	PROC_TGID_ATTR_PREV,
-	PROC_TGID_ATTR_EXEC,
-	PROC_TGID_ATTR_FSCREATE,
-#endif
-#ifdef CONFIG_AUDITSYSCALL
-	PROC_TGID_LOGINUID,
-#endif
-	PROC_TGID_OOM_SCORE,
-	PROC_TGID_OOM_ADJUST,
-	PROC_TID_INO,
-	PROC_TID_STATUS,
-	PROC_TID_MEM,
-#ifdef CONFIG_SECCOMP
-	PROC_TID_SECCOMP,
-#endif
-	PROC_TID_CWD,
-	PROC_TID_ROOT,
-	PROC_TID_EXE,
-	PROC_TID_FD,
-	PROC_TID_ENVIRON,
-	PROC_TID_AUXV,
-	PROC_TID_CMDLINE,
-	PROC_TID_STAT,
-	PROC_TID_STATM,
-	PROC_TID_MAPS,
-	PROC_TID_NUMA_MAPS,
-	PROC_TID_MOUNTS,
-	PROC_TID_WCHAN,
-#ifdef CONFIG_MMU
-	PROC_TID_SMAPS,
-#endif
-#ifdef CONFIG_SCHEDSTATS
-	PROC_TID_SCHEDSTAT,
-#endif
-#ifdef CONFIG_CPUSETS
-	PROC_TID_CPUSET,
-#endif
-#ifdef CONFIG_SECURITY
-	PROC_TID_ATTR,
-	PROC_TID_ATTR_CURRENT,
-	PROC_TID_ATTR_PREV,
-	PROC_TID_ATTR_EXEC,
-	PROC_TID_ATTR_FSCREATE,
-#endif
-#ifdef CONFIG_AUDITSYSCALL
-	PROC_TID_LOGINUID,
-#endif
-	PROC_TID_OOM_SCORE,
-	PROC_TID_OOM_ADJUST,
-
-	/* Add new entries before this */
-	PROC_TID_FD_DIR = 0x8000,	/* 0x8000-0xffff */
-};
-
 /* Worst case buffer size needed for holding an integer. */
 #define PROC_NUMBUF 10
 
 struct pid_entry {
-	int type;
 	int len;
 	char *name;
 	mode_t mode;
@@ -194,8 +104,7 @@ struct pid_entry {
 	union proc_op op;
 };
 
-#define NOD(TYPE, NAME, MODE, IOP, FOP, OP) {		\
-	.type = (TYPE),					\
+#define NOD(NAME, MODE, IOP, FOP, OP) {			\
 	.len  = sizeof(NAME) - 1,			\
 	.name = (NAME),					\
 	.mode = MODE,					\
@@ -204,19 +113,19 @@ struct pid_entry {
 	.op   = OP,					\
 }
 
-#define DIR(TYPE, NAME, MODE, OTYPE)						\
-	NOD(TYPE, NAME, (S_IFDIR|(MODE)),					\
+#define DIR(NAME, MODE, OTYPE)							\
+	NOD(NAME, (S_IFDIR|(MODE)),						\
 		&proc_##OTYPE##_inode_operations, &proc_##OTYPE##_operations,	\
 		{} )
-#define LNK(TYPE, NAME, OTYPE)					\
-	NOD(TYPE, NAME, (S_IFLNK|S_IRWXUGO),			\
+#define LNK(NAME, OTYPE)					\
+	NOD(NAME, (S_IFLNK|S_IRWXUGO),				\
 		&proc_pid_link_inode_operations, NULL,		\
 		{ .proc_get_link = &proc_##OTYPE##_link } )
-#define REG(TYPE, NAME, MODE, OTYPE)			\
-	NOD(TYPE, NAME, (S_IFREG|(MODE)), NULL,		\
+#define REG(NAME, MODE, OTYPE)				\
+	NOD(NAME, (S_IFREG|(MODE)), NULL,		\
 		&proc_##OTYPE##_operations, {})
-#define INF(TYPE, NAME, MODE, OTYPE)			\
-	NOD(TYPE, NAME, (S_IFREG|(MODE)), 		\
+#define INF(NAME, MODE, OTYPE)				\
+	NOD(NAME, (S_IFREG|(MODE)), 			\
 		NULL, &proc_info_file_operations,	\
 		{ .proc_read = &proc_##OTYPE } )
 
@@ -1016,7 +925,7 @@ static int task_dumpable(struct task_str
 }
 
 
-static struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task, int ino)
+static struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task)
 {
 	struct inode * inode;
 	struct proc_inode *ei;
@@ -1030,7 +939,7 @@ static struct inode *proc_pid_make_inode
 	/* Common stuff */
 	ei = PROC_I(inode);
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
-	inode->i_ino = fake_ino(task->pid, ino);
+	/* Use the default inode number assigned by new_inode */
 
 	/*
 	 * grab the reference to task.
@@ -1102,6 +1011,50 @@ static struct dentry_operations pid_dent
 
 /* Lookups */
 
+typedef struct dentry *instantiate_t(struct inode *, struct dentry *, struct task_struct *, void *);
+
+static int proc_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
+	char *name, int len,
+	instantiate_t instantiate, struct task_struct *task, void *ptr)
+{
+	struct dentry *child, *dir = filp->f_dentry;
+	struct inode *inode;
+	struct qstr qname;
+	ino_t ino = 0;
+	unsigned type = DT_UNKNOWN;
+
+	qname.name = name;
+	qname.len  = len;
+	qname.hash = full_name_hash(name, len);
+	
+	child = d_lookup(dir, &qname);
+	if (!child) {
+		struct dentry *new;
+		new = d_alloc(dir, &qname);
+		if (new) {
+			child = instantiate(dir->d_inode, new, task, ptr);
+			if (child)
+				dput(new);
+			else
+				child = new;
+		}
+	}
+	if (!child || IS_ERR(child) || !child->d_inode)
+		goto end_instantiate;
+	inode = child->d_inode;
+	if (inode) {
+		ino = inode->i_ino;
+		type = inode->i_mode >> 12;
+	}
+	dput(child);
+end_instantiate:
+	if (!ino)
+		ino = find_inode_number(dir, &qname);
+	if (!ino)
+		ino = 1;
+	return filldir(dirent, name, len, filp->f_pos, ino, type);
+}
+
 static unsigned name_to_int(struct dentry *dentry)
 {
 	const char *name = dentry->d_name.name;
@@ -1201,7 +1154,7 @@ static struct dentry *proc_fd_instantiat
 	struct proc_inode *ei;
 	struct dentry *error = ERR_PTR(-ENOENT);
 
-	inode = proc_pid_make_inode(dir->i_sb, task, PROC_TID_FD_DIR+fd);
+	inode = proc_pid_make_inode(dir->i_sb, task);
 	if (!inode)
 		goto out;
 	ei = PROC_I(inode);
@@ -1260,6 +1213,15 @@ out_no_task:
 	return result;
 }
 
+static int proc_fd_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
+	struct task_struct *task, int fd)
+{
+	char name[PROC_NUMBUF];
+	int len = snprintf(name, sizeof(name), "%d", fd);
+	return proc_fill_cache(filp, dirent, filldir, name, len, 
+				proc_fd_instantiate, task, &fd);
+}
+
 static int proc_readfd(struct file * filp, void * dirent, filldir_t filldir)
 {
 	struct dentry *dentry = filp->f_dentry;
@@ -1267,7 +1229,6 @@ static int proc_readfd(struct file * fil
 	struct task_struct *p = get_proc_task(inode);
 	unsigned int fd, tid, ino;
 	int retval;
-	char buf[PROC_NUMBUF];
 	struct files_struct * files;
 	struct fdtable *fdt;
 
@@ -1297,22 +1258,12 @@ static int proc_readfd(struct file * fil
 			for (fd = filp->f_pos-2;
 			     fd < fdt->max_fds;
 			     fd++, filp->f_pos++) {
-				unsigned int i,j;
 
 				if (!fcheck_files(files, fd))
 					continue;
 				rcu_read_unlock();
 
-				j = PROC_NUMBUF;
-				i = fd;
-				do {
-					j--;
-					buf[j] = '0' + (i % 10);
-					i /= 10;
-				} while (i);
-
-				ino = fake_ino(tid, PROC_TID_FD_DIR + fd);
-				if (filldir(dirent, buf+j, PROC_NUMBUF-j, fd+2, ino, DT_LNK) < 0) {
+				if (proc_fd_fill_cache(filp, dirent, filldir, p, fd) < 0) {
 					rcu_read_lock();
 					break;
 				}
@@ -1347,7 +1298,7 @@ static struct dentry *proc_pident_instan
 	struct proc_inode *ei;
 	struct dentry *error = ERR_PTR(-EINVAL);
 
-	inode = proc_pid_make_inode(dir->i_sb, task, p->type);
+	inode = proc_pid_make_inode(dir->i_sb, task);
 	if (!inode)
 		goto out;
 
@@ -1405,6 +1356,13 @@ out_no_task:
 	return error;
 }
 
+static int proc_pident_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
+	struct task_struct *task, struct pid_entry *p)
+{
+	return proc_fill_cache(filp, dirent, filldir, p->name, p->len, 
+				proc_pident_instantiate, task, p);
+}
+
 static int proc_pident_readdir(struct file *filp,
 		void *dirent, filldir_t filldir,
 		struct pid_entry *ents, unsigned int nents)
@@ -1420,11 +1378,10 @@ static int proc_pident_readdir(struct fi
 
 	ret = -ENOENT;
 	if (!task)
-		goto out;
+		goto out_no_task;
 
 	ret = 0;
 	pid = task->pid;
-	put_task_struct(task);
 	i = filp->f_pos;
 	switch (i) {
 	case 0:
@@ -1449,8 +1406,7 @@ static int proc_pident_readdir(struct fi
 		}
 		p = ents + i;
 		while (p->name) {
-			if (filldir(dirent, p->name, p->len, filp->f_pos,
-				    fake_ino(pid, p->type), p->mode >> 12) < 0)
+			if (proc_pident_fill_cache(filp, dirent, filldir, task, p) < 0)
 				goto out;
 			filp->f_pos++;
 			p++;
@@ -1459,6 +1415,8 @@ static int proc_pident_readdir(struct fi
 
 	ret = 1;
 out:
+	put_task_struct(task);		
+out_no_task:
 	return ret;
 }
 
@@ -1538,17 +1496,17 @@ static struct file_operations proc_pid_a
 };
 
 static struct pid_entry tgid_attr_stuff[] = {
-	REG(PROC_TGID_ATTR_CURRENT,  "current",  S_IRUGO|S_IWUGO, pid_attr),
-	REG(PROC_TGID_ATTR_PREV,     "prev",     S_IRUGO,         pid_attr),
-	REG(PROC_TGID_ATTR_EXEC,     "exec",     S_IRUGO|S_IWUGO, pid_attr),
-	REG(PROC_TGID_ATTR_FSCREATE, "fscreate", S_IRUGO|S_IWUGO, pid_attr),
+	REG("current",  S_IRUGO|S_IWUGO, pid_attr),
+	REG("prev",     S_IRUGO,         pid_attr),
+	REG("exec",     S_IRUGO|S_IWUGO, pid_attr),
+	REG("fscreate", S_IRUGO|S_IWUGO, pid_attr),
 	{}
 };
 static struct pid_entry tid_attr_stuff[] = {
-	REG(PROC_TID_ATTR_CURRENT,   "current",  S_IRUGO|S_IWUGO, pid_attr),
-	REG(PROC_TID_ATTR_PREV,      "prev",     S_IRUGO,         pid_attr),
-	REG(PROC_TID_ATTR_EXEC,      "exec",     S_IRUGO|S_IWUGO, pid_attr),
-	REG(PROC_TID_ATTR_FSCREATE,  "fscreate", S_IRUGO|S_IWUGO, pid_attr),
+	REG("current",  S_IRUGO|S_IWUGO, pid_attr),
+	REG("prev",     S_IRUGO,         pid_attr),
+	REG("exec",     S_IRUGO|S_IWUGO, pid_attr),
+	REG("fscreate", S_IRUGO|S_IWUGO, pid_attr),
 	{}
 };
 
@@ -1627,45 +1585,45 @@ static struct file_operations proc_task_
 static struct inode_operations proc_task_inode_operations;
 
 static struct pid_entry tgid_base_stuff[] = {
-	DIR(PROC_TGID_TASK,    "task",    S_IRUGO|S_IXUGO, task),
-	DIR(PROC_TGID_FD,      "fd",      S_IRUSR|S_IXUSR, fd),
-	INF(PROC_TGID_ENVIRON, "environ", S_IRUSR, pid_environ),
-	INF(PROC_TGID_AUXV,    "auxv",	  S_IRUSR, pid_auxv),
-	INF(PROC_TGID_STATUS,  "status",  S_IRUGO, pid_status),
-	INF(PROC_TGID_CMDLINE, "cmdline", S_IRUGO, pid_cmdline),
-	INF(PROC_TGID_STAT,    "stat",    S_IRUGO, tgid_stat),
-	INF(PROC_TGID_STATM,   "statm",   S_IRUGO, pid_statm),
-	REG(PROC_TGID_MAPS,    "maps",    S_IRUGO, maps),
+	DIR("task",      S_IRUGO|S_IXUGO, task),
+	DIR("fd",        S_IRUSR|S_IXUSR, fd),
+	INF("environ",   S_IRUSR, pid_environ),
+	INF("auxv",	 S_IRUSR, pid_auxv),
+	INF("status",    S_IRUGO, pid_status),
+	INF("cmdline",   S_IRUGO, pid_cmdline),
+	INF("stat",      S_IRUGO, tgid_stat),
+	INF("statm",     S_IRUGO, pid_statm),
+	REG("maps",      S_IRUGO, maps),
 #ifdef CONFIG_NUMA
-	REG(PROC_TGID_NUMA_MAPS, "numa_maps", S_IRUGO, numa_maps),
+	REG("numa_maps", S_IRUGO, numa_maps),
 #endif
-	REG(PROC_TGID_MEM,     "mem",     S_IRUSR|S_IWUSR, mem),
+	REG("mem",       S_IRUSR|S_IWUSR, mem),
 #ifdef CONFIG_SECCOMP
-	REG(PROC_TGID_SECCOMP, "seccomp", S_IRUSR|S_IWUSR, seccomp),
+	REG("seccomp",   S_IRUSR|S_IWUSR, seccomp),
 #endif
-	LNK(PROC_TGID_CWD,     "cwd",     cwd),
-	LNK(PROC_TGID_ROOT,    "root",    root),
-	LNK(PROC_TGID_EXE,     "exe",     exe),
-	REG(PROC_TGID_MOUNTS,  "mounts",  S_IRUGO, mounts),
+	LNK("cwd",       cwd),
+	LNK("root",      root),
+	LNK("exe",       exe),
+	REG("mounts",    S_IRUGO, mounts),
 #ifdef CONFIG_MMU
-	REG(PROC_TGID_SMAPS,   "smaps",   S_IRUGO, smaps),
+	REG("smaps",     S_IRUGO, smaps),
 #endif
 #ifdef CONFIG_SECURITY
-	DIR(PROC_TGID_ATTR,    "attr",    S_IRUGO|S_IXUGO, tgid_attr),
+	DIR("attr",      S_IRUGO|S_IXUGO, tgid_attr),
 #endif
 #ifdef CONFIG_KALLSYMS
-	INF(PROC_TGID_WCHAN,   "wchan",   S_IRUGO, pid_wchan),
+	INF("wchan",     S_IRUGO, pid_wchan),
 #endif
 #ifdef CONFIG_SCHEDSTATS
-	INF(PROC_TGID_SCHEDSTAT, "schedstat", S_IRUGO, pid_schedstat),
+	INF("schedstat", S_IRUGO, pid_schedstat),
 #endif
 #ifdef CONFIG_CPUSETS
-	REG(PROC_TGID_CPUSET,  "cpuset",  S_IRUGO, cpuset),
+	REG("cpuset",    S_IRUGO, cpuset),
 #endif
-	INF(PROC_TGID_OOM_SCORE, "oom_score", S_IRUGO, oom_score),
-	REG(PROC_TGID_OOM_ADJUST,"oom_adj", S_IRUGO|S_IWUSR, oom_adjust),
+	INF("oom_score", S_IRUGO, oom_score),
+	REG("oom_adj",   S_IRUGO|S_IWUSR, oom_adjust),
 #ifdef CONFIG_AUDITSYSCALL
-	REG(PROC_TGID_LOGINUID, "loginuid", S_IWUSR|S_IRUGO, loginuid),
+	REG("loginuid",  S_IWUSR|S_IRUGO, loginuid),
 #endif
 	{}
 };
@@ -1691,7 +1649,7 @@ static struct inode_operations proc_tgid
 };
 
 static struct pid_entry proc_base_stuff[] = {
-	NOD(PROC_TGID_INO,     "self",	S_IFLNK|S_IRWXUGO,
+	NOD("self",	 S_IFLNK|S_IRWXUGO,
 		&proc_self_inode_operations, NULL, {}),
 	{}
 };
@@ -1777,7 +1735,7 @@ struct dentry *proc_pid_instantiate(stru
 	struct dentry *error = ERR_PTR(-ENOENT);
 	struct inode *inode;
 
-	inode = proc_pid_make_inode(dir->i_sb, task, PROC_TGID_INO);
+	inode = proc_pid_make_inode(dir->i_sb, task);
 	if (!inode)
 		goto out;
 
@@ -1799,7 +1757,7 @@ struct dentry *proc_pid_instantiate(stru
 out:
 	return error;
 }
-	
+
 /* SMP-safe */
 struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd)
 {
@@ -1904,18 +1862,29 @@ done:		
 	return pos;
 }
 
+static int proc_pid_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
+	struct task_struct *task, int tgid)
+{
+	char name[PROC_NUMBUF];
+	int len = snprintf(name, sizeof(name), "%d", tgid);
+	return proc_fill_cache(filp, dirent, filldir, name, len, 
+				proc_pid_instantiate, task, NULL);
+}
+
 /* for the /proc/ directory itself, after non-process stuff has been done */
 int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
 {
-	char buf[PROC_NUMBUF];
 	unsigned int nr = filp->f_pos - FIRST_PROCESS_ENTRY;
+	struct task_struct *reaper = get_proc_task(filp->f_dentry->d_inode);
 	struct task_struct *task;
 	int tgid;
 	
+	if (!reaper)
+		goto out_no_task;
+
 	for (;nr < (ARRAY_SIZE(proc_base_stuff) - 1); filp->f_pos++, nr++) {
 		struct pid_entry *p = &proc_base_stuff[nr];
-		if (filldir(dirent, p->name, p->len, filp->f_pos,
-			    fake_ino(0, p->type), p->mode >> 12) < 0)
+		if (proc_pident_fill_cache(filp, dirent, filldir, reaper, p) < 0)
 			goto out;
 	}
 	nr -= (ARRAY_SIZE(proc_base_stuff) - 1);
@@ -1928,12 +1897,8 @@ int proc_pid_readdir(struct file * filp,
 	for (task = first_tgid(tgid, nr);
 	     task;
 	     task = next_tgid(task), filp->f_pos++) {
-		int len;
-		ino_t ino;
 		tgid = task->pid;
-		len = snprintf(buf, sizeof(buf), "%d", tgid);
-		ino = fake_ino(tgid, PROC_TGID_INO);
-		if (filldir(dirent, buf, len, filp->f_pos, ino, DT_DIR) < 0) {
+		if (proc_pid_fill_cache(filp, dirent, filldir, task, tgid) < 0) {
 			/* returning this tgid failed, save it as the first
 			 * pid for the next readir call */
 			filp->f_version = tgid;
@@ -1942,6 +1907,8 @@ int proc_pid_readdir(struct file * filp,
 		}
 	}
 out:
+	put_task_struct(reaper);
+out_no_task:
 	return 0;
 }
 
@@ -1949,44 +1916,44 @@ out:
  * Tasks
  */
 static struct pid_entry tid_base_stuff[] = {
-	DIR(PROC_TID_FD,       "fd",      S_IRUSR|S_IXUSR, fd),
-	INF(PROC_TID_ENVIRON,  "environ", S_IRUSR, pid_environ),
-	INF(PROC_TID_AUXV,     "auxv",	  S_IRUSR, pid_auxv),
-	INF(PROC_TID_STATUS,   "status",  S_IRUGO, pid_status),
-	INF(PROC_TID_CMDLINE,  "cmdline", S_IRUGO, pid_cmdline),
-	INF(PROC_TID_STAT,     "stat",    S_IRUGO, tid_stat),
-	INF(PROC_TID_STATM,    "statm",   S_IRUGO, pid_statm),
-	REG(PROC_TID_MAPS,     "maps",    S_IRUGO, maps),
+	DIR("fd",        S_IRUSR|S_IXUSR, fd),
+	INF("environ",   S_IRUSR, pid_environ),
+	INF("auxv",	 S_IRUSR, pid_auxv),
+	INF("status",    S_IRUGO, pid_status),
+	INF("cmdline",   S_IRUGO, pid_cmdline),
+	INF("stat",      S_IRUGO, tid_stat),
+	INF("statm",     S_IRUGO, pid_statm),
+	REG("maps",      S_IRUGO, maps),
 #ifdef CONFIG_NUMA
-	REG(PROC_TID_NUMA_MAPS, "numa_maps", S_IRUGO, numa_maps),
+	REG("numa_maps", S_IRUGO, numa_maps),
 #endif
-	REG(PROC_TID_MEM,      "mem",     S_IRUSR|S_IWUSR, mem),
+	REG("mem",       S_IRUSR|S_IWUSR, mem),
 #ifdef CONFIG_SECCOMP
-	REG(PROC_TID_SECCOMP,  "seccomp", S_IRUSR|S_IWUSR, seccomp),
+	REG("seccomp",   S_IRUSR|S_IWUSR, seccomp),
 #endif
-	LNK(PROC_TID_CWD,      "cwd",     cwd),
-	LNK(PROC_TID_ROOT,     "root",    root),
-	LNK(PROC_TID_EXE,      "exe",     exe),
-	REG(PROC_TID_MOUNTS,   "mounts",  S_IRUGO, mounts),
+	LNK("cwd",       cwd),
+	LNK("root",      root),
+	LNK("exe",       exe),
+	REG("mounts",    S_IRUGO, mounts),
 #ifdef CONFIG_MMU
-	REG(PROC_TID_SMAPS,    "smaps",   S_IRUGO, smaps),
+	REG("smaps",     S_IRUGO, smaps),
 #endif
 #ifdef CONFIG_SECURITY
-	DIR(PROC_TID_ATTR,     "attr",    S_IRUGO|S_IXUGO, tid_attr),
+	DIR("attr",      S_IRUGO|S_IXUGO, tid_attr),
 #endif
 #ifdef CONFIG_KALLSYMS
-	INF(PROC_TID_WCHAN,    "wchan",   S_IRUGO, pid_wchan),
+	INF("wchan",     S_IRUGO, pid_wchan),
 #endif
 #ifdef CONFIG_SCHEDSTATS
-	INF(PROC_TID_SCHEDSTAT, "schedstat", S_IRUGO, pid_schedstat),
+	INF("schedstat", S_IRUGO, pid_schedstat),
 #endif
 #ifdef CONFIG_CPUSETS
-	REG(PROC_TID_CPUSET,   "cpuset",  S_IRUGO, cpuset),
+	REG("cpuset",    S_IRUGO, cpuset),
 #endif
-	INF(PROC_TID_OOM_SCORE, "oom_score", S_IRUGO, oom_score),
-	REG(PROC_TID_OOM_ADJUST, "oom_adj", S_IRUGO|S_IWUSR, oom_adjust),
+	INF("oom_score", S_IRUGO, oom_score),
+	REG("oom_adj",   S_IRUGO|S_IWUSR, oom_adjust),
 #ifdef CONFIG_AUDITSYSCALL
-	REG(PROC_TID_LOGINUID, "loginuid", S_IWUSR|S_IRUGO, loginuid),
+	REG("loginuid",  S_IWUSR|S_IRUGO, loginuid),
 #endif
 	{}
 };
@@ -2018,7 +1985,7 @@ static struct dentry *proc_task_instanti
 {
 	struct dentry *error = ERR_PTR(-ENOENT);
 	struct inode *inode;
-	inode = proc_pid_make_inode(dir->i_sb, task, PROC_TID_INO);
+	inode = proc_pid_make_inode(dir->i_sb, task);
 
 	if (!inode)
 		goto out;
@@ -2152,10 +2119,18 @@ static struct task_struct *next_tid(stru
 	return pos;
 }
   
+static int proc_task_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
+	struct task_struct *task, int tid)
+{
+	char name[PROC_NUMBUF];
+	int len = snprintf(name, sizeof(name), "%d", tid);
+	return proc_fill_cache(filp, dirent, filldir, name, len, 
+				proc_task_instantiate, task, NULL);
+}
+
 /* for the /proc/TGID/task/ directories */
 static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldir)
 {
-	char buf[PROC_NUMBUF];
 	struct dentry *dentry = filp->f_dentry;
 	struct inode *inode = dentry->d_inode;
 	struct task_struct *leader = get_proc_task(inode);
@@ -2192,11 +2167,8 @@ static int proc_task_readdir(struct file
 	for (task = first_tid(leader, tid, pos - 2);
 	     task;
 	     task = next_tid(task), pos++) {
-		int len;
 		tid = task->pid;
-		len = snprintf(buf, sizeof(buf), "%d", tid);
-		ino = fake_ino(tid, PROC_TID_INO);
-		if (filldir(dirent, buf, len, pos, ino, DT_DIR < 0)) {
+		if (proc_task_fill_cache(filp, dirent, filldir, task, tid) < 0) {
 			/* returning this tgid failed, save it as the first
 			 * pid for the next readir call */
 			filp->f_version = tid;
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 23/23] proc: Merge proc_tid_attr and proc_tgid_attr
  2006-02-23 16:32                                           ` [PATCH 22/23] proc: Remove the hard coded inode numbers Eric W. Biederman
@ 2006-02-23 16:34                                             ` Eric W. Biederman
  0 siblings, 0 replies; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


The implementation is exactly the same and there is currently nothing to
distinguish proc_tid_attr, and proc_tgid_attr.   So it is pointless
to have two separate implementations.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


---

 fs/proc/base.c |   50 +++++++++++---------------------------------------
 1 files changed, 11 insertions(+), 39 deletions(-)

283d43b12524e2ae07f6a47fa28392ed77bc9e00
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ae63eeb..e4a4f85 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1495,14 +1495,7 @@ static struct file_operations proc_pid_a
 	.write		= proc_pid_attr_write,
 };
 
-static struct pid_entry tgid_attr_stuff[] = {
-	REG("current",  S_IRUGO|S_IWUGO, pid_attr),
-	REG("prev",     S_IRUGO,         pid_attr),
-	REG("exec",     S_IRUGO|S_IWUGO, pid_attr),
-	REG("fscreate", S_IRUGO|S_IWUGO, pid_attr),
-	{}
-};
-static struct pid_entry tid_attr_stuff[] = {
+static struct pid_entry attr_dir_stuff[] = {
 	REG("current",  S_IRUGO|S_IWUGO, pid_attr),
 	REG("prev",     S_IRUGO,         pid_attr),
 	REG("exec",     S_IRUGO|S_IWUGO, pid_attr),
@@ -1510,49 +1503,28 @@ static struct pid_entry tid_attr_stuff[]
 	{}
 };
 
-static int proc_tgid_attr_readdir(struct file * filp,
+static int proc_attr_dir_readdir(struct file * filp,
 			     void * dirent, filldir_t filldir)
 {
 	return proc_pident_readdir(filp,dirent,filldir,
-				   tgid_attr_stuff,ARRAY_SIZE(tgid_attr_stuff));
+				   attr_dir_stuff,ARRAY_SIZE(attr_dir_stuff));
 }
 
-static int proc_tid_attr_readdir(struct file * filp,
-			     void * dirent, filldir_t filldir)
-{
-	return proc_pident_readdir(filp,dirent,filldir,
-				   tid_attr_stuff,ARRAY_SIZE(tid_attr_stuff));
-}
-
-static struct file_operations proc_tgid_attr_operations = {
+static struct file_operations proc_attr_dir_operations = {
 	.read		= generic_read_dir,
-	.readdir	= proc_tgid_attr_readdir,
+	.readdir	= proc_attr_dir_readdir,
 };
 
-static struct file_operations proc_tid_attr_operations = {
-	.read		= generic_read_dir,
-	.readdir	= proc_tid_attr_readdir,
-};
-
-static struct dentry *proc_tgid_attr_lookup(struct inode *dir,
+static struct dentry *proc_attr_dir_lookup(struct inode *dir,
 				struct dentry *dentry, struct nameidata *nd)
 {
-	return proc_pident_lookup(dir, dentry, tgid_attr_stuff);
+	return proc_pident_lookup(dir, dentry, attr_dir_stuff);
 }
 
-static struct dentry *proc_tid_attr_lookup(struct inode *dir,
-				struct dentry *dentry, struct nameidata *nd)
-{
-	return proc_pident_lookup(dir, dentry, tid_attr_stuff);
-}
-
-static struct inode_operations proc_tgid_attr_inode_operations = {
-	.lookup		= proc_tgid_attr_lookup,
+static struct inode_operations proc_attr_dir_inode_operations = {
+	.lookup		= proc_attr_dir_lookup,
 };
 
-static struct inode_operations proc_tid_attr_inode_operations = {
-	.lookup		= proc_tid_attr_lookup,
-};
 #endif
 
 /*
@@ -1609,7 +1581,7 @@ static struct pid_entry tgid_base_stuff[
 	REG("smaps",     S_IRUGO, smaps),
 #endif
 #ifdef CONFIG_SECURITY
-	DIR("attr",      S_IRUGO|S_IXUGO, tgid_attr),
+	DIR("attr",      S_IRUGO|S_IXUGO, attr_dir),
 #endif
 #ifdef CONFIG_KALLSYMS
 	INF("wchan",     S_IRUGO, pid_wchan),
@@ -1939,7 +1911,7 @@ static struct pid_entry tid_base_stuff[]
 	REG("smaps",     S_IRUGO, smaps),
 #endif
 #ifdef CONFIG_SECURITY
-	DIR("attr",      S_IRUGO|S_IXUGO, tid_attr),
+	DIR("attr",      S_IRUGO|S_IXUGO, attr_dir),
 #endif
 #ifdef CONFIG_KALLSYMS
 	INF("wchan",     S_IRUGO, pid_wchan),
-- 
1.2.2.g709a


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-02-23 15:54 ` [PATCH 01/23] tref: Implement task references Eric W. Biederman
  2006-02-23 15:56   ` [PATCH 02/23] proc: Fix the .. inode number on /proc/<pid>/fd Eric W. Biederman
@ 2006-02-23 16:49   ` Eric W. Biederman
  2006-03-02 19:16     ` Oleg Nesterov
  2006-03-03 19:23     ` Oleg Nesterov
  1 sibling, 2 replies; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-23 16:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Oleg Nesterov

ebiederm@xmission.com (Eric W. Biederman) writes:

> Holding a reference to a task_struct pins about 10K of low memory even
> after that task has exited.  Which seems to be at 1 or 2 orders of
> mangnitude more memory than any other data structure in the kernel.
> Not holding a reference to a task_struct and you risk problems with
> pid wrap around.
>
> Even worse because we allow session and process group leaders to exit
> there is no task_struct you can hold onto to prevent pid wrap around
> problems for those kinds of structures.
>
> The task_ref is an small intermediate data structure that other
> structures can point, that solves these problems.  A task_ref will
> always point at the first user of a pid value or contain a NULL
> pointer if there are no longer any users of that pid.

I forgot to note that there is a correctness dependence on an my
kill switch_exec_pids patch.  Without that task_refs will
stop being able to track a pid when we pass it on to
a new process in de_thread.

I built this patchset against Linus latest kernel and not -mm so I think
I may have one or two trivial conflicts with Olegs changes as
well.  In particular I have some changes to unhash_process() that Oleg
has removed, but simply removing that hunch should be all the resolution
that is needed.  Hopefully that won't be a problem..

Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/23] proc cleanup.
  2006-02-23 15:52 [PATCH 00/23] proc cleanup Eric W. Biederman
  2006-02-23 15:54 ` [PATCH 01/23] tref: Implement task references Eric W. Biederman
@ 2006-02-25 12:27 ` Andrew Morton
  2006-02-25 13:34   ` Eric W. Biederman
  2006-02-25 15:20   ` Eric W. Biederman
  2006-02-27 15:26 ` Serge E. Hallyn
  2 siblings, 2 replies; 49+ messages in thread
From: Andrew Morton @ 2006-02-25 12:27 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel

ebiederm@xmission.com (Eric W. Biederman) wrote:
>
> When working on pid namespaces I keep tripping over /proc.
>  It's hard coded inode numbers and the amount of cruft
>  accumulated over the years makes it hard to deal with.
> 
>  So to put /proc out of my misery here is a series of patches that
>  removes the worst of the warts.

An additional 2.7k of vmlinux.  A shame.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/23] proc cleanup.
  2006-02-25 12:27 ` [PATCH 00/23] proc cleanup Andrew Morton
@ 2006-02-25 13:34   ` Eric W. Biederman
  2006-02-25 15:20   ` Eric W. Biederman
  1 sibling, 0 replies; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-25 13:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton <akpm@osdl.org> writes:

> ebiederm@xmission.com (Eric W. Biederman) wrote:
>>
>> When working on pid namespaces I keep tripping over /proc.
>>  It's hard coded inode numbers and the amount of cruft
>>  accumulated over the years makes it hard to deal with.
>> 
>>  So to put /proc out of my misery here is a series of patches that
>>  removes the worst of the warts.
>
> An additional 2.7k of vmlinux.  A shame.

Yes.  I guess so.

You want me to run the bloat-o-meter and see
if I can see where the size increase comes from?

Looking at the diffstat there was barely a code size
increase.

 fs/exec.c                 |    9
 fs/proc/base.c            | 2374 ++++++++++++++++++++++------------------------
 fs/proc/inode.c           |   11
 fs/proc/internal.h        |   23
 fs/proc/root.c            |   13
 fs/proc/task_mmu.c        |  101 +
 fs/proc/task_nommu.c      |   21
 include/linux/init_task.h |    1
 include/linux/pid.h       |    4
 include/linux/proc_fs.h   |   26
 include/linux/sched.h     |    3
 include/linux/task_ref.h  |   69 +
 kernel/Makefile           |    2
 kernel/exit.c             |   12
 kernel/fork.c             |   10
 kernel/pid.c              |   12
 kernel/task_ref.c         |  131 ++
 mm/mempolicy.c            |    6
 18 files changed, 1533 insertions(+), 1295 deletions(-)


Eric






^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/23] proc cleanup.
  2006-02-25 12:27 ` [PATCH 00/23] proc cleanup Andrew Morton
  2006-02-25 13:34   ` Eric W. Biederman
@ 2006-02-25 15:20   ` Eric W. Biederman
  1 sibling, 0 replies; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-25 15:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton <akpm@osdl.org> writes:

> ebiederm@xmission.com (Eric W. Biederman) wrote:
>>
>> When working on pid namespaces I keep tripping over /proc.
>>  It's hard coded inode numbers and the amount of cruft
>>  accumulated over the years makes it hard to deal with.
>> 
>>  So to put /proc out of my misery here is a series of patches that
>>  removes the worst of the warts.
>
> An additional 2.7k of vmlinux.  A shame.

Looks like that at least is compiler dependent, with gcc-3.3.5 I get:

   text    data     bss     dec     hex filename
2601428  502342  226092 3329862  32cf46 ../linux-2.6-ns-mirror-build1/vmlinux
2602548  502494  226092 3331134  32d43e ../linux-2.6-ns-mirror-build2/vmlinux

So it looks like 1K of test and about 100 bytes of data.

Investigating this quickly.  Because of the refactoring it
is hard to pin this down to any major culprit.  But that is
also good news in that it doesn't look like an inline function
is responsible for this growth :)

It looks like the culprit for small amounts of growth is
the work to see if a task still exists, and other similar
checks that were needed but missing.

For the big chunks it looks like the work to populate the
dcache during readdir, which keeps the inode numbers in
sync and should help readdir+stat performance.

The other big culprit is proc_flush_task which is both
more comprehensive and simpler that proc_pid_flush+proc_pid_unhash=107+28.
But unfortunately that has made it a lot bigger.

So short of getting better dcache helpers for the case
where readdir populates the dcache it doesn't look the code
size will come down much.

The one practical thing that will help a little is that it
looks like with just a little more work we can replace
all of read_lock(&tasklist_lock) with rcu_read_lock().

add/remove: 33/23 grow/shrink: 28/16 up/down: 4968/-3619 (1349)
function                                     old     new   delta
proc_flush_task                                -     605    +605
proc_check_dentry_visible                      -     325    +325
proc_fill_cache                                -     256    +256
proc_fd_instantiate                            -     243    +243
tref_get_by_task                               -     200    +200
first_tid                                      -     179    +179
tgid_base_stuff                              336     504    +168
tid_base_stuff                               320     480    +160
first_tgid                                     -     159    +159
proc_pident_instantiate                        -     158    +158
proc_task_instantiate                          -     128    +128
proc_pid_instantiate                           -     128    +128
attr_dir_stuff                                 -     120    +120
proc_attr_dir_operations                       -     108    +108
tref_get_by_pid                                -     107    +107
proc_task_getattr                              -     105    +105
next_tid                                       -     100    +100
next_tgid                                      -      89     +89
tref_put                                       -      87     +87
proc_attr_dir_inode_operations                 -      84     +84
do_maps_open                                   -      80     +80
proc_fd_fill_cache                             -      63     +63
proc_task_fill_cache                           -      60     +60
proc_pid_fill_cache                            -      60     +60
__detach_pid                                 136     195     +59
proc_info_read                               111     164     +53
oom_adjust_read                              162     215     +53
proc_get_sb                                   26      78     +52
seccomp_write                                168     218     +50
seccomp_read                                 164     214     +50
oom_adjust_write                             164     214     +50
proc_pid_attr_read                           124     172     +48
proc_base_stuff                                -      48     +48
proc_pid_attr_write                          148     194     +46
proc_fd_link                                 122     168     +46
proc_exe_link                                152     198     +46
mounts_open                                  157     200     +43
proc_root_link                                99     141     +42
proc_cwd_link                                 99     141     +42
tid_fd_revalidate                            207     247     +40
proc_pident_fill_cache                         -      40     +40
get_tref_task                                  -      33     +33
dup_task_struct                              137     170     +33
m_stop                                        59      88     +29
m_start                                      235     264     +29
tref_reset                                     -      28     +28
proc_attr_dir_readdir                          -      28     +28
proc_pident_readdir                          270     296     +26
proc_attr_dir_lookup                           -      22     +22
tref_set                                       -      21     +21
tref_fini                                      -      21     +21
tref_init                                      -      18     +18
proc_pid_follow_link                          98     116     +18
init_tref                                      -      16     +16
init_task                                   1328    1344     +16
pid_revalidate                               178     192     +14
attach_pid                                   149     162     +13
tref_get                                       -       8      +8
mem_read                                     430     438      +8
proc_alloc_inode                              98     102      +4
proc_task_readdir                            320     323      +3
pid_delete_dentry                             24      21      -3
m_next                                        70      61      -9
proc_readfd                                  327     307     -20
copy_process                                3190    3170     -20
smaps_open                                    43      22     -21
maps_open                                     43      22     -21
proc_tid_attr_lookup                          22       -     -22
proc_tgid_attr_lookup                         22       -     -22
proc_delete_inode                            129     105     -24
pid_base_dentry_operations                    24       -     -24
proc_tid_attr_readdir                         28       -     -28
proc_tgid_attr_readdir                        28       -     -28
proc_pid_flush                                28       -     -28
release_task                                 257     228     -29
proc_permission                               38       -     -38
proc_pid_make_inode                          205     166     -39
unhash_process                                73      33     -40
de_thread                                   1310    1266     -44
proc_check_root                               55       -     -55
proc_task_lookup                             244     188     -56
pid_base_iput                                 62       -     -62
proc_pid_readdir                             303     229     -74
tid_attr_stuff                                80       -     -80
tgid_attr_stuff                               80       -     -80
proc_task_permission                          82       -     -82
proc_tid_attr_inode_operations                84       -     -84
proc_tgid_attr_inode_operations               84       -     -84
proc_mem_inode_operations                     84       -     -84
get_tid_list                                  97       -     -97
proc_pid_unhash                              107       -    -107
proc_tid_attr_operations                     108       -    -108
proc_tgid_attr_operations                    108       -    -108
proc_lookupfd                                240      99    -141
get_tgid_list                                146       -    -146
proc_task_root_link                          218       -    -218
proc_check_chroot                            245       -    -245
switch_exec_pids                             290       -    -290
proc_pid_lookup                              503     145    -358
proc_pident_lookup                           742     142    -600

Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/23] proc cleanup.
  2006-02-23 15:52 [PATCH 00/23] proc cleanup Eric W. Biederman
  2006-02-23 15:54 ` [PATCH 01/23] tref: Implement task references Eric W. Biederman
  2006-02-25 12:27 ` [PATCH 00/23] proc cleanup Andrew Morton
@ 2006-02-27 15:26 ` Serge E. Hallyn
  2006-02-27 15:56   ` Eric W. Biederman
  2 siblings, 1 reply; 49+ messages in thread
From: Serge E. Hallyn @ 2006-02-27 15:26 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel

Quoting Eric W. Biederman (ebiederm@xmission.com):
> 
> When working on pid namespaces I keep tripping over /proc.
> It's hard coded inode numbers and the amount of cruft
> accumulated over the years makes it hard to deal with.
> 
> So to put /proc out of my misery here is a series of patches that
> removes the worst of the warts.
> 
> The first patch which introduces task_refs is used later to address
> one of the worst faults how much low kernel memory it allows

Glad to see the task_refs patches in particular resubmitted.

This is a long set including some big patches, so it's hard to just
sit down and audit for errors, but looking at before- and after- they
look nice.

Resulting kernel passes ltp stresstests and zseries.

-serge

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/23] proc cleanup.
  2006-02-27 15:26 ` Serge E. Hallyn
@ 2006-02-27 15:56   ` Eric W. Biederman
  0 siblings, 0 replies; 49+ messages in thread
From: Eric W. Biederman @ 2006-02-27 15:56 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Andrew Morton, linux-kernel

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> 
>> When working on pid namespaces I keep tripping over /proc.
>> It's hard coded inode numbers and the amount of cruft
>> accumulated over the years makes it hard to deal with.
>> 
>> So to put /proc out of my misery here is a series of patches that
>> removes the worst of the warts.
>> 
>> The first patch which introduces task_refs is used later to address
>> one of the worst faults how much low kernel memory it allows
>
> Glad to see the task_refs patches in particular resubmitted.
>
> This is a long set including some big patches, so it's hard to just
> sit down and audit for errors, but looking at before- and after- they
> look nice.
>
> Resulting kernel passes ltp stresstests and zseries.

Very oddly there was a hickup the first time I sent them to Andrew.
So I am resending in to Andrew in slightly smaller chunks.
So far so good...

Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-02-23 16:49   ` [PATCH 01/23] tref: Implement task references Eric W. Biederman
@ 2006-03-02 19:16     ` Oleg Nesterov
  2006-03-02 20:37       ` Oleg Nesterov
  2006-03-02 22:19       ` Eric W. Biederman
  2006-03-03 19:23     ` Oleg Nesterov
  1 sibling, 2 replies; 49+ messages in thread
From: Oleg Nesterov @ 2006-03-02 19:16 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel

Eric W. Biederman wrote:
>
> Holding a reference to a task_struct pins about 10K of low memory even after
> that task has exited.  Which seems to be at 1 or 2 orders of mangnitude more
> memory than any other data structure in the kernel.  Not holding a reference
> to a task_struct and you risk problems with pid wrap around.

I think there is another, much simpler solution. We can make a "reference" to the
pid itself to protect it against free_pidmap(), so that this pid can't be reused.

	struct pid_ref
	{
		pid_t			pid;
		atomic_t		count;
		struct hlist_node	chain;
	};

	// allocated in pidhash_init()
	static struct hlist_head *ref_hash;

	struct pid_ref *find_pid_ref(pid_t pid)
	{
		struct hlist_node *elem;
		struct pid_ref *ref;

		hlist_for_each_entry(ref, elem, &ref_hash[pid_hashfn(pid)], chain)
			if (ref->pid == pid)
				return ref;

		return NULL;
	}

	// just s/free_pidmap/__free_pidmap/
	static void __free_pidmap(int pid)
	{
		pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
		int offset = pid & BITS_PER_PAGE_MASK;

		clear_bit(offset, map->page);
		atomic_inc(&map->nr_free);
	}

	fastcall void free_pidmap(int pid)
	{
		if (!find_pid_ref(pid))
			__free_pidmap(pid);
	}


	static int pid_inuse(pid_t pid)
	{
		int type;

		for (type = 0; type < PIDTYPE_MAX; ++type)
			if (find_pid(type, pid))
				return 1;

		return 0;
	}

	// simple, non-optimized version
	struct pid_ref *mk_pid_ref(pid_t pid)
	{
		struct pid_ref *ref;

		write_lock_irq(&tasklist_lock);
		ref = find_pid_ref(pid);
		if (ref)
			atomic_inc(&ref->count);
		else if (pid_inuse(pid)) {
			ref = kmalloc(sizeof(*ref), GFP_ATOMIC);
			if (ref) {
				ref->pid = pid;
				atomic_set(&ref->count, 1);
				hlist_add_head(&ref->chain,
					&ref_hash[pid_hashfn(pid)]);
			}
		}
		write_unlock_irq(&tasklist_lock);

		return ref;
	}

	void put_pid_ref(struct pid_ref *ref)
	{
		if (!ref || !atomic_dec_and_test(&ref->count))
			return;

		write_lock_irq(&tasklist_lock);
		if (!atomic_read(&ref->count)) {
			if (!pid_inuse(ref->pid))
				__free_pidmap(ref->pid);
			hlist_del(&ref->chain);
			kfree(ref);
		}
		write_unlock_irq(&tasklist_lock);
	}

That's all. The only modified function is free_pidmap(), and the change is
trivial. Example of usage:

	 struct fown_struct {
		...
	-	int pid;
	+	struct pid_ref *ref;
	+	enum pid_type	type;
		...
	 }

	 void file_free(struct file *f)
	 {
	+	put_pid_ref(f->f_owner->ref);
		...
	 }

	void f_modown(struct file *filp, int pid, uid_t uid, uid_t euid, int force)
	{
		struct pid_ref *old, *ref;
		enum pid_type type = PIDTYPE_PID;

		if (pid < 0) {
			pid = -pid;
			type = PIDTYPE_PGID;
		}
		ref = mk_pid_ref(pid);

		write_lock_irq(&filp->f_owner.lock);
		old = ref;
		if (force || !filp->f_owner.ref) {
			old = filp->f_owner.ref;
			filp->f_owner.ref = ref;
			filp->f_owner.type = type;
			filp->f_owner.uid = uid;
			filp->f_owner.euid = euid;
		}
		write_unlock_irq(&filp->f_owner.lock);

		put_pid_ref(old);
	}

	void send_sigio(struct fown_struct *fown, int fd, int band)
	{
		struct task_struct *p;

		read_lock(&fown->lock);
		if (!fown->ref)
			goto out_unlock_fown;

		read_lock(&tasklist_lock);

		do_each_task_pid(fown->ref->pid, fown->type, p)
			send_sigio_to_task(p, fown, fd, band);
		while_each_task_pid(fown->ref->pid, fown->type, p);

		read_unlock(&tasklist_lock);
	out_unlock_fown:
		read_unlock(&fown->lock);
	}

What do you think?

Oleg.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-02 19:16     ` Oleg Nesterov
@ 2006-03-02 20:37       ` Oleg Nesterov
  2006-03-02 22:19       ` Eric W. Biederman
  1 sibling, 0 replies; 49+ messages in thread
From: Oleg Nesterov @ 2006-03-02 20:37 UTC (permalink / raw)
  To: Eric W. Biederman, Andrew Morton, linux-kernel

Oleg Nesterov wrote:
> 
>         void put_pid_ref(struct pid_ref *ref)
>         {
>                 if (!ref || !atomic_dec_and_test(&ref->count))
>                         return;
> 
>                 write_lock_irq(&tasklist_lock);
>                 if (!atomic_read(&ref->count)) {

Ok, this is racy, but the fix is possible.

Oleg.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-02 19:16     ` Oleg Nesterov
  2006-03-02 20:37       ` Oleg Nesterov
@ 2006-03-02 22:19       ` Eric W. Biederman
  2006-03-03 16:56         ` Oleg Nesterov
  2006-03-06 21:06         ` Oleg Nesterov
  1 sibling, 2 replies; 49+ messages in thread
From: Eric W. Biederman @ 2006-03-02 22:19 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andrew Morton, linux-kernel

Oleg Nesterov <oleg@tv-sign.ru> writes:

> Eric W. Biederman wrote:
>>
>> Holding a reference to a task_struct pins about 10K of low memory even after
>> that task has exited.  Which seems to be at 1 or 2 orders of mangnitude more
>> memory than any other data structure in the kernel.  Not holding a reference
>> to a task_struct and you risk problems with pid wrap around.
>
> I think there is another, much simpler solution. We can make a "reference" to
> the
> pid itself to protect it against free_pidmap(), so that this pid can't be
> reused.

I kind of like the idea of not releasing the pid when someone is using it.
However with my trivial hostile program I can with 32 or 33 living processes
each with 1000 references to dead processes I can completely saturate the
default pid map.  And it won't be obvious why alloc_pidmap is failing.

Your resource consumption with the extra hash table is higher than
mine at until very high process counts.

In addition it doesn't really help with the problem that inspired
this work.  That of having multiple pid spaces.  I could make it work
by throwing a pspace reference in struct pid_ref, but without some
fancy footwork it would prevent cleanup until all of the outside
references are gone.

> What do you think?

So I kind of like it but I don't feel it does as good a job solving
the problems I am solving.  

In this instance I'm not at all certain I like having NULL ref
pointers.  It increases the number of cases you have to deal with when
reading back the pid, but that is minor and something task_refs suffer
from more than pid_refs.

Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-02 22:19       ` Eric W. Biederman
@ 2006-03-03 16:56         ` Oleg Nesterov
  2006-03-03 17:48           ` Eric W. Biederman
  2006-03-04 11:16           ` Eric W. Biederman
  2006-03-06 21:06         ` Oleg Nesterov
  1 sibling, 2 replies; 49+ messages in thread
From: Oleg Nesterov @ 2006-03-03 16:56 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel

Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@tv-sign.ru> writes:
>
> > I think there is another, much simpler solution. We can make a "reference" to
> > the
> > pid itself to protect it against free_pidmap(), so that this pid can't be
> > reused.
>
> However with my trivial hostile program I can with 32 or 33 living processes
> each with 1000 references to dead processes I can completely saturate the
> default pid map.  And it won't be obvious why alloc_pidmap is failing.

Yes, this is a problem. Please see the new version below. Instead of delaying
pid releasing, free_pidmap() just invalidates pid_ref. The code becomes even
simpler.

> Your resource consumption with the extra hash table is higher than
> mine at until very high process counts.

The size of ref_array[] could be arbitrary low (we can't use pid_hashfn() in
this case, of course). And tref adds 4 * sizeof(void*) to every task, and it
is much more complicated.

> In addition it doesn't really help with the problem that inspired
> this work.  That of having multiple pid spaces.  I could make it work
> by throwing a pspace reference in struct pid_ref, but without some
> fancy footwork it would prevent cleanup until all of the outside
> references are gone.
>
> So I kind of like it but I don't feel it does as good a job solving
> the problems I am solving.

Ok. I missed the virtualization/pspace discussion completely, so you are
very probably right.

Oleg.

struct pid_ref
{
	pid_t			pid;
	int			count;
	struct hlist_node	chain;
};

// allocated in pidhash_init()
static struct hlist_head *ref_hash;

static struct pid_ref *find_pid_ref(pid_t pid)
{
	struct hlist_node *elem;
	struct pid_ref *ref;

	hlist_for_each_entry(ref, elem, &ref_hash[pid_hashfn(pid)], chain)
		if (ref->pid == pid)
			return ref;

	return NULL;
}

// This is the only function modified.
fastcall void free_pidmap(int pid)
{
	pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
	int offset = pid & BITS_PER_PAGE_MASK;
	struct pid_ref *ref;

	clear_bit(offset, map->page);
	atomic_inc(&map->nr_free);

	ref = find_pid_ref(pid);
	if (unlikely(ref != NULL)) {
		hlist_del_init(&ref->chain);
		ref->pid = 0;
	}
}

static inline int pid_inuse(pid_t pid)
{
	pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
	int offset = pid & BITS_PER_PAGE_MASK;

	return test_bit(offset, map->page);
}

struct pid_ref *alloc_pid_ref(pid_t pid)
{
	struct pid_ref *ref;

	write_lock_irq(&tasklist_lock);
	ref = find_pid_ref(pid);
	if (ref)
		ref->count++;
	else if (pid_inuse(pid)) {
		ref = kmalloc(sizeof(*ref), GFP_ATOMIC);
		if (ref) {
			ref->pid = pid;
			ref->count = 1;
			hlist_add_head(&ref->chain,
				&ref_hash[pid_hashfn(pid)]);
		}
	}
	write_unlock_irq(&tasklist_lock);

	return ref;
}

void free_pid_ref(struct pid_ref *ref)
{
	if (!ref)
		return;

	write_lock_irq(&tasklist_lock);
	if (!--ref->count) {
		hlist_del_init(&ref->chain);
		kfree(ref);
	}
	write_unlock_irq(&tasklist_lock);
}

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-03 16:56         ` Oleg Nesterov
@ 2006-03-03 17:48           ` Eric W. Biederman
  2006-03-04 11:16           ` Eric W. Biederman
  1 sibling, 0 replies; 49+ messages in thread
From: Eric W. Biederman @ 2006-03-03 17:48 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andrew Morton, linux-kernel, Serge E. Hallyn

Oleg Nesterov <oleg@tv-sign.ru> writes:

> Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@tv-sign.ru> writes:
>>
>> > I think there is another, much simpler solution. We can make a "reference"
> to
>> > the
>> > pid itself to protect it against free_pidmap(), so that this pid can't be
>> > reused.
>>
>> However with my trivial hostile program I can with 32 or 33 living processes
>> each with 1000 references to dead processes I can completely saturate the
>> default pid map.  And it won't be obvious why alloc_pidmap is failing.
>
> Yes, this is a problem. Please see the new version below. Instead of delaying
> pid releasing, free_pidmap() just invalidates pid_ref. The code becomes even
> simpler.

And it removes most of my interaction problem with multiple pid spaces.

>> Your resource consumption with the extra hash table is higher than
>> mine at until very high process counts.
>
> The size of ref_array[] could be arbitrary low (we can't use pid_hashfn() in
> this case, of course). And tref adds 4 * sizeof(void*) to every task, and it
> is much more complicated.

I guess the worst case behavior would be triggered by a find in /proc.
Which would probably populate a ref for every pid, and it isn't that
uncommon.   So I suspect we really want to make ref_array be able to
use pid hashfn as it is likely to get an equal amount of use.

More comments when I have time.

Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-02-23 16:49   ` [PATCH 01/23] tref: Implement task references Eric W. Biederman
  2006-03-02 19:16     ` Oleg Nesterov
@ 2006-03-03 19:23     ` Oleg Nesterov
  2006-03-04 10:51       ` Eric W. Biederman
  1 sibling, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2006-03-03 19:23 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel

Eric W. Biederman wrote:
>
> +++ devel-akpm/kernel/task_ref.c	2006-02-27 20:28:59.000000000 -0800
> @@ -0,0 +1,131 @@
> +#include <linux/sched.h>
> +#include <linux/task_ref.h>
> +
> +struct task_ref init_tref = {
> +	.count = ATOMIC_INIT(1),
> +	.type  = PIDTYPE_PID,
> +	.pid   = 0,
> +	.task  = NULL,
> +};

Make it static? Actually, I don't understand why init_tref is better
than NULL. Yes, NULL will add some checks into task_ref.c, but we can
avoid some costly atoimic ops.

> +void tref_put(struct task_ref *ref)
> +{
> +	might_sleep();
> +	if (atomic_dec_and_test(&ref->count)) {
> +		struct task_struct *task;
> +		BUG_ON(ref == &init_tref);
> +		/* Carefully serialize against __detach_pid and tref_get_by_pid */
> +		write_lock_irq(&tasklist_lock);
> +		task = ref->task;
> +		if (task)
> +			task->pids[ref->type].ref = NULL;
> +		write_unlock_irq(&tasklist_lock);
> +		kfree(ref);
> +	}

I think this is racy. Suppose ref->count == 1. What if another cpu does
tref_get_by_task() between atomic_dec_and_test() and write_lock_irq() ?
It takes tasklist_lock, increments ->count again, and returns the pointer
to the memory which will be freed soon.

> +struct task_ref *tref_get_by_pid(int pid, enum pid_type type)
> +{
> +	struct task_struct *task;
> +	struct task_ref *tref;
> +
> +	/* Lookup the and pin the task */
> +	read_lock(&tasklist_lock);
> +	task = find_task_by_pid_type(type, pid);
> +	if (task)
> +		get_task_struct(task);
> +	read_unlock(&tasklist_lock);
> +
> +	/* Now get the tref */
> +	if (task) {
> +		tref = tref_get_by_task(task, type);
> +		put_task_struct(task);
> +	}
> +	else
> +		tref = tref_get(&init_tref);
> +	return tref;
> +}

I beleive this could be simplified, we don't need to get/put task_struct,

	rcu_read_lock();

	task = find_task_by_pid_type(type, pid);
	if (task)
		tref = tref_get_by_task(task, type);
	else
		tref = tref_get(&init_tref);

	rcu_read_unlock();

Oleg.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-03 19:23     ` Oleg Nesterov
@ 2006-03-04 10:51       ` Eric W. Biederman
  0 siblings, 0 replies; 49+ messages in thread
From: Eric W. Biederman @ 2006-03-04 10:51 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andrew Morton, linux-kernel

Oleg Nesterov <oleg@tv-sign.ru> writes:

> Eric W. Biederman wrote:
>>
>> +++ devel-akpm/kernel/task_ref.c	2006-02-27 20:28:59.000000000 -0800
>> @@ -0,0 +1,131 @@
>> +#include <linux/sched.h>
>> +#include <linux/task_ref.h>
>> +
>> +struct task_ref init_tref = {
>> +	.count = ATOMIC_INIT(1),
>> +	.type  = PIDTYPE_PID,
>> +	.pid   = 0,
>> +	.task  = NULL,
>> +};
>
> Make it static? Actually, I don't understand why init_tref is better
> than NULL. Yes, NULL will add some checks into task_ref.c, but we can
> avoid some costly atoimic ops.

init_tref isn't for task_ref.c so much as it is there to allow the
users of task_refs to avoid the checks.  I have enough helper
functions an appropriate helper instead of a well prescribed value
may be an equally valid approach.

>> +void tref_put(struct task_ref *ref)
>> +{
>> +	might_sleep();
>> +	if (atomic_dec_and_test(&ref->count)) {
>> +		struct task_struct *task;
>> +		BUG_ON(ref == &init_tref);
>> + /* Carefully serialize against __detach_pid and tref_get_by_pid */
>> +		write_lock_irq(&tasklist_lock);
>> +		task = ref->task;
>> +		if (task)
>> +			task->pids[ref->type].ref = NULL;
>> +		write_unlock_irq(&tasklist_lock);
>> +		kfree(ref);
>> +	}
>
> I think this is racy. Suppose ref->count == 1. What if another cpu does
> tref_get_by_task() between atomic_dec_and_test() and write_lock_irq() ?
> It takes tasklist_lock, increments ->count again, and returns the pointer
> to the memory which will be freed soon.

Agreed.  Grumble.  This needs to get fixed.

>> +struct task_ref *tref_get_by_pid(int pid, enum pid_type type)
>> +{
>> +	struct task_struct *task;
>> +	struct task_ref *tref;
>> +
>> +	/* Lookup the and pin the task */
>> +	read_lock(&tasklist_lock);
>> +	task = find_task_by_pid_type(type, pid);
>> +	if (task)
>> +		get_task_struct(task);
>> +	read_unlock(&tasklist_lock);
>> +
>> +	/* Now get the tref */
>> +	if (task) {
>> +		tref = tref_get_by_task(task, type);
>> +		put_task_struct(task);
>> +	}
>> +	else
>> +		tref = tref_get(&init_tref);
>> +	return tref;
>> +}
>
> I beleive this could be simplified, we don't need to get/put task_struct,
>
> 	rcu_read_lock();
>
> 	task = find_task_by_pid_type(type, pid);
> 	if (task)
> 		tref = tref_get_by_task(task, type);
> 	else
> 		tref = tref_get(&init_tref);
>
> 	rcu_read_unlock();

Nope kmalloc can sleep.

Eric


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-03 16:56         ` Oleg Nesterov
  2006-03-03 17:48           ` Eric W. Biederman
@ 2006-03-04 11:16           ` Eric W. Biederman
  2006-03-04 12:31             ` Oleg Nesterov
  1 sibling, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-03-04 11:16 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andrew Morton, linux-kernel

Oleg Nesterov <oleg@tv-sign.ru> writes:

> Ok. I missed the virtualization/pspace discussion completely, so you are
> very probably right.

So I think the pid_ref could is likely still short several helper functions,
but is probably usable.  Using it is slightly more costly but I doubt the
pid hash table has any significant performance penalties.

The important property to preserve from a maintenance standpoint is
that the helper functions take enough information that when I go back
and implement pid spaces I will need to at most tweak the pid_ref
implementation, and the pid_ref helper functions and not need to
go back through and change all of the users (again).

> Oleg.
>
> struct pid_ref
> {
> 	pid_t			pid;
> 	int			count;
> 	struct hlist_node	chain;
> };
>
> // allocated in pidhash_init()
> static struct hlist_head *ref_hash;
>
> static struct pid_ref *find_pid_ref(pid_t pid)
> {
> 	struct hlist_node *elem;
> 	struct pid_ref *ref;
>
> 	hlist_for_each_entry(ref, elem, &ref_hash[pid_hashfn(pid)], chain)
> 		if (ref->pid == pid)
> 			return ref;
>
> 	return NULL;
> }
>
> // This is the only function modified.
> fastcall void free_pidmap(int pid)
> {
> 	pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
> 	int offset = pid & BITS_PER_PAGE_MASK;
> 	struct pid_ref *ref;
>
> 	clear_bit(offset, map->page);
> 	atomic_inc(&map->nr_free);
>
> 	ref = find_pid_ref(pid);
> 	if (unlikely(ref != NULL)) {
> 		hlist_del_init(&ref->chain);
> 		ref->pid = 0;
> 	}
> }

Ouch!  I believe free_pidmap now needs the tasklist_lock so
we can free the pid and kill the pid_ref atomically.  Otherwise
the pid could potentially get reused before we free the pid reference.
I think that means ensuring all of the callers take tasklist_lock.

> static inline int pid_inuse(pid_t pid)
> {
> 	pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
> 	int offset = pid & BITS_PER_PAGE_MASK;
>
> 	return test_bit(offset, map->page);
> }
>
> struct pid_ref *alloc_pid_ref(pid_t pid)
> {
> 	struct pid_ref *ref;
>
> 	write_lock_irq(&tasklist_lock);
> 	ref = find_pid_ref(pid);
> 	if (ref)
> 		ref->count++;
> 	else if (pid_inuse(pid)) {
> 		ref = kmalloc(sizeof(*ref), GFP_ATOMIC);
> 		if (ref) {
> 			ref->pid = pid;
> 			ref->count = 1;
> 			hlist_add_head(&ref->chain,
> 				&ref_hash[pid_hashfn(pid)]);
> 		}
> 	}
> 	write_unlock_irq(&tasklist_lock);
>
> 	return ref;
> }

I need a helper that does this from a task structure but that
is simple enough. 

> void free_pid_ref(struct pid_ref *ref)
> {
> 	if (!ref)
> 		return;
>
> 	write_lock_irq(&tasklist_lock);
> 	if (!--ref->count) {
> 		hlist_del_init(&ref->chain);
> 		kfree(ref);
> 	}
> 	write_unlock_irq(&tasklist_lock);
> }

I think calling this put_pid_ref instead of free_pid_ref
is more accurate.  The whole alloc/free _pid_ref instead
of the more traditional get/put kind of throws me.  Since
an allocation/free is possible I can see where this comes from 
but I don't feel right about those names.

Eric


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-04 11:16           ` Eric W. Biederman
@ 2006-03-04 12:31             ` Oleg Nesterov
  2006-03-04 17:30               ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2006-03-04 12:31 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel

"Eric W. Biederman" wrote:
> 
> Oleg Nesterov <oleg@tv-sign.ru> writes:
> 
> > fastcall void free_pidmap(int pid)
> > {
> >       pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
> >       int offset = pid & BITS_PER_PAGE_MASK;
> >       struct pid_ref *ref;
> >
> >       clear_bit(offset, map->page);
> >       atomic_inc(&map->nr_free);
> >
> >       ref = find_pid_ref(pid);
> >       if (unlikely(ref != NULL)) {
> >               hlist_del_init(&ref->chain);
> >               ref->pid = 0;
> >       }
> > }
> 
> Ouch!  I believe free_pidmap now needs the tasklist_lock so
> we can free the pid and kill the pid_ref atomically.  Otherwise
> the pid could potentially get reused before we free the pid reference.
> I think that means ensuring all of the callers take tasklist_lock.

Yes, you are right. And do_fork() does free_pidmap() lockless in
the error path. This path is not performance critical, so may be
it is ok to add wrie_lock(tasklist) here.
 
> > void free_pid_ref(struct pid_ref *ref)
> > {
> >       if (!ref)
> >               return;
> >
> >       write_lock_irq(&tasklist_lock);
> >       if (!--ref->count) {
> >               hlist_del_init(&ref->chain);
> >               kfree(ref);
> >       }
> >       write_unlock_irq(&tasklist_lock);
> > }
> 
> I think calling this put_pid_ref instead of free_pid_ref
> is more accurate.  The whole alloc/free _pid_ref instead
> of the more traditional get/put kind of throws me.  Since
> an allocation/free is possible I can see where this comes from
> but I don't feel right about those names.

Agree.

Oleg.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-04 12:31             ` Oleg Nesterov
@ 2006-03-04 17:30               ` Oleg Nesterov
  0 siblings, 0 replies; 49+ messages in thread
From: Oleg Nesterov @ 2006-03-04 17:30 UTC (permalink / raw)
  To: Eric W. Biederman, Andrew Morton, linux-kernel

Oleg Nesterov wrote:
> 
> "Eric W. Biederman" wrote:
> >
> > Oleg Nesterov <oleg@tv-sign.ru> writes:
> >
> > > fastcall void free_pidmap(int pid)
> > > {
> > >       pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
> > >       int offset = pid & BITS_PER_PAGE_MASK;
> > >       struct pid_ref *ref;
> > >
> > >       clear_bit(offset, map->page);
> > >       atomic_inc(&map->nr_free);
> > >
> > >       ref = find_pid_ref(pid);
> > >       if (unlikely(ref != NULL)) {
> > >               hlist_del_init(&ref->chain);
> > >               ref->pid = 0;
> > >       }
> > > }
> >
> > Ouch!  I believe free_pidmap now needs the tasklist_lock so
> > we can free the pid and kill the pid_ref atomically.  Otherwise
> > the pid could potentially get reused before we free the pid reference.
> > I think that means ensuring all of the callers take tasklist_lock.
> 
> Yes, you are right. And do_fork() does free_pidmap() lockless in
> the error path. This path is not performance critical, so may be
> it is ok to add wrie_lock(tasklist) here.

Even better: don't use tasklist_lock at all. We can use pidmap_lock
instead, see the patch below.

I have added a simple find_task_by_pid_ref() helper, note that it
doesn't need pidmap_lock, and it doesn't need to check ref->pid != 0.

If the caller does read_lock(tasklist), then this helper can't return
unhashed task_struct, otherwise it is possible anyway.

Oleg.

(for review only)

--- 2.6.16-rc5/include/linux/sched.h~1_REF	2006-03-01 22:00:30.000000000 +0300
+++ 2.6.16-rc5/include/linux/sched.h	2006-03-04 22:56:44.000000000 +0300
@@ -1012,8 +1012,6 @@ extern struct task_struct init_task;
 
 extern struct   mm_struct init_mm;
 
-#define find_task_by_pid(nr)	find_task_by_pid_type(PIDTYPE_PID, nr)
-extern struct task_struct *find_task_by_pid_type(int type, int pid);
 extern void set_special_pids(pid_t session, pid_t pgrp);
 extern void __set_special_pids(pid_t session, pid_t pgrp);
 
--- 2.6.16-rc5/include/linux/pid.h~1_REF	2006-03-01 22:00:29.000000000 +0300
+++ 2.6.16-rc5/include/linux/pid.h	2006-03-04 23:02:51.000000000 +0300
@@ -35,6 +35,9 @@ extern void FASTCALL(detach_pid(struct t
  * held.
  */
 extern struct pid *FASTCALL(find_pid(enum pid_type, int));
+extern struct task_struct *find_task_by_pid_type(int type, int pid);
+
+#define find_task_by_pid(nr)	find_task_by_pid_type(PIDTYPE_PID, nr)
 
 extern int alloc_pidmap(void);
 extern void FASTCALL(free_pidmap(int));
@@ -51,4 +54,23 @@ extern void FASTCALL(free_pidmap(int));
 			hlist_unhashed(&(task)->pids[type].pid_chain));	\
 	}								\
 
+struct pid_ref
+{
+	pid_t			pid;
+	int			count;
+	struct hlist_node	chain;
+};
+
+extern struct pid_ref *alloc_pid_ref(pid_t pid);
+extern void put_pid_ref(struct pid_ref *ref);
+
+static inline struct task_struct *find_task_by_pid_ref(struct pid_ref *ref,
+							enum pid_type type)
+{
+	if (!ref)
+		return NULL;
+
+	return find_task_by_pid_type(type, ref->pid);
+}
+
 #endif /* _LINUX_PID_H */
--- 2.6.16-rc5/kernel/pid.c~1_REF	2006-03-01 22:03:25.000000000 +0300
+++ 2.6.16-rc5/kernel/pid.c	2006-03-04 22:27:51.000000000 +0300
@@ -28,9 +28,12 @@
 #include <linux/hash.h>
 
 #define pid_hashfn(nr) hash_long((unsigned long)nr, pidhash_shift)
-static struct hlist_head *pid_hash[PIDTYPE_MAX];
+static struct hlist_head *pid_hash[PIDTYPE_MAX + 1];
 static int pidhash_shift;
 
+#define ref_hashfn(pid)		pid_hashfn(pid)
+#define ref_hash		pid_hash[PIDTYPE_MAX]
+
 int pid_max = PID_MAX_DEFAULT;
 int last_pid;
 
@@ -62,13 +65,35 @@ static pidmap_t pidmap_array[PIDMAP_ENTR
 
 static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
 
+static struct pid_ref *find_pid_ref(pid_t pid)
+{
+	struct hlist_node *elem;
+	struct pid_ref *ref;
+
+	hlist_for_each_entry(ref, elem, &ref_hash[ref_hashfn(pid)], chain)
+		if (ref->pid == pid)
+			return ref;
+
+	return NULL;
+}
+
 fastcall void free_pidmap(int pid)
 {
 	pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
 	int offset = pid & BITS_PER_PAGE_MASK;
+	struct pid_ref *ref;
+	unsigned long flags;
 
 	clear_bit(offset, map->page);
 	atomic_inc(&map->nr_free);
+
+	spin_lock_irqsave(&pidmap_lock, flags);
+	ref = find_pid_ref(pid);
+	if (unlikely(ref != NULL)) {
+		hlist_del_init(&ref->chain);
+		ref->pid = 0;
+	}
+	spin_unlock_irqrestore(&pidmap_lock, flags);
 }
 
 int alloc_pidmap(void)
@@ -217,6 +242,48 @@ task_t *find_task_by_pid_type(int type, 
 
 EXPORT_SYMBOL(find_task_by_pid_type);
 
+static inline int pid_inuse(pid_t pid)
+{
+	pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
+	int offset = pid & BITS_PER_PAGE_MASK;
+
+	return likely(map->page) && test_bit(offset, map->page);
+}
+
+struct pid_ref *alloc_pid_ref(pid_t pid)
+{
+	struct pid_ref *ref;
+
+	spin_lock_irq(&pidmap_lock);
+	ref = find_pid_ref(pid);
+	if (ref)
+		ref->count++;
+	else if (pid_inuse(pid)) {
+		ref = kmalloc(sizeof(*ref), GFP_ATOMIC);
+		if (ref) {
+			ref->pid = pid;
+			ref->count = 1;
+			hlist_add_head(&ref->chain,
+				&ref_hash[ref_hashfn(pid)]);
+		}
+	}
+	spin_unlock_irq(&pidmap_lock);
+
+	return ref;
+}
+
+void free_pid_ref(struct pid_ref *ref)
+{
+	if (!ref)
+		return;
+
+	spin_lock_irq(&pidmap_lock);
+	if (!--ref->count) {
+		hlist_del_init(&ref->chain);
+		kfree(ref);
+	}
+	spin_lock_irq(&pidmap_lock);
+}
 /*
  * The pid hash table is scaled according to the amount of memory in the
  * machine.  From a minimum of 16 slots up to 4096 slots at one gigabyte or
@@ -233,9 +300,9 @@ void __init pidhash_init(void)
 
 	printk("PID hash table entries: %d (order: %d, %Zd bytes)\n",
 		pidhash_size, pidhash_shift,
-		PIDTYPE_MAX * pidhash_size * sizeof(struct hlist_head));
+		(PIDTYPE_MAX + 1) * pidhash_size * sizeof(struct hlist_head));
 
-	for (i = 0; i < PIDTYPE_MAX; i++) {
+	for (i = 0; i < PIDTYPE_MAX + 1; i++) {
 		pid_hash[i] = alloc_bootmem(pidhash_size *
 					sizeof(*(pid_hash[i])));
 		if (!pid_hash[i])

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-02 22:19       ` Eric W. Biederman
  2006-03-03 16:56         ` Oleg Nesterov
@ 2006-03-06 21:06         ` Oleg Nesterov
  2006-03-06 22:18           ` Eric W. Biederman
                             ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: Oleg Nesterov @ 2006-03-06 21:06 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel, Ingo Molnar

I think I have a really good idea.

Forget about task ref for a moment. I thinks we can greatly
simplify the pids management. We don't PIDTYPE_MAX hash tables,
we need only one.

The plan:

	kill PIDTYPE_TGID
	(copy_process/unhash_process need a simple fix)

	kill 'struct pid'

Now,

	struct task_struct
	{
		...
		struct list_head	pids[PIDTYPE_MAX];
		struct list_head	tgrp;
		...
	};

	static inline struct task_struct *next_thread(struct task_struct *p)
	{
		return list_entry(p->tgrp.next, struct task_struct, tgrp);
	}

	struct pid_head
	{
		pid_t			nr;
		struct hlist_node	chain;
		struct list_head	tasks[PIDTYPE_MAX];
	};

kernel/pid.c:

	static kmem_cache_t *pid_cachep;

	static struct hlist_head *pid_hash;
	#define	pid_bucket(nr)	(pid_hash + pid_hashfn(nr))

	// alloc_pidmap() becomes static,
	// do_fork() calls this instead
	struct pid_head *alloc_pid(void)
	{
		struct pid_head *pid;

		pid = kmem_cache_alloc(pid_cachep, GFP_KERNEL);

		if (likely(pid)) {
			enum pid_type type;

			for (type = 0; type < PIDTYPE_MAX; ++type)
				INIT_LIST_HEAD(pid->tasks + type);

			pid->nr = alloc_pidmap();
			hlist_add_head_rcu(&pid->chain, pid_bucket(pid->nr));
		}

		return pid;
	}

	void free_pid(struct pid_head *pid)
	{
		free_pidmap(pid->nr);
		kmem_cache_free(pid_cachep, pid);
	}

	static struct pid_head *find_pid(pid_t nr)
	{
		struct hlist_node *node;
		struct pid_head *pid;

		hlist_for_each_entry_rcu(pid, node, pid_bucket(nr), chain)
			if (pid->nr == nr)
				return pid;

		return NULL;
	}

	struct list_head *find_pid_list(enum pid_type type, pid_t nr)
	{
		struct pid_head *pid;

		pid = find_pid(nr);
		if (pid)
			return pid->tasks + type;

		return NULL;
	}

	void attach_pid(struct task_struct *task, enum pid_type type, pid_t nr)
	{
		struct list_head *list;

		list = find_pid_list(type, nr);
		BUG_ON(!list);
		list_add_tail_rcu(task->pids + type, list);
	}

	static inline struct list_head *__detach_pid(struct task_struct *task,
								enum pid_type type)
	{
		struct list_head *list;

		list = task->pids + type;
		list_del_rcu(list);	// it doesn't touch ->next
		return list->next;
	}

	void detach_pid(struct task_struct *task, enum pid_type type)
	{
		struct list_head *head;
		struct pid_head *pid;

		head = __detach_pid(task, type);
		if (!list_empty(head))
			return;

		pid = list_entry(head, struct pid_head, tasks[type]);

		for (type = 0; type < PIDTYPE_MAX; ++type)
			if (!list_empty(pid->tasks + type))
				return;

		free_pid(pid);
	}

We don't need ugly do_each_task_pid/while_each_task_pid anymore:

	#define	for_each_task_pid(head, pid, type, task)		\
		if ((head = find_pid_list(type, pid)))			\
			list_for_each_entry(task, head, pids[type])


And noe we can inplement pid_ref almost for free, just add ->count
to 'struct pid_head'.

What do you think?

Oleg.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-06 21:06         ` Oleg Nesterov
@ 2006-03-06 22:18           ` Eric W. Biederman
  2006-03-07 20:44             ` Oleg Nesterov
  2006-03-07  1:39           ` Eric W. Biederman
  2006-03-07 13:12           ` Eric W. Biederman
  2 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-03-06 22:18 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andrew Morton, linux-kernel, Ingo Molnar

Oleg Nesterov <oleg@tv-sign.ru> writes:

> I think I have a really good idea.
>
> Forget about task ref for a moment. I thinks we can greatly
> simplify the pids management. We don't PIDTYPE_MAX hash tables,
> we need only one.

I like it.  If we run top we wind of with the same number of dynamic
allocations, with task_refs (because /proc uses them).  The amount of
memory utilized is lower.  Probes for unused sessions and process
groups are a little more expensive but not noticeably so.

Unless we can implement do_each_task_pid/while_each_task_pid in terms
of for_each_task_pid.  I am nervous about making the conversion.

During fork is a very nice time to allocate these as it allows the
rest of the code to assume they are always available.

I think we had something similar several years ago, that's where
the name struct pid came from.  But it used a separate head for each
type of pid, and it used a separate structure for what we now embed
in struct task.

It completely breaks my patch for multiple pid spaces. Oh well it
isn't merged anyway. :)

> And noe we can inplement pid_ref almost for free, just add ->count
> to 'struct pid_head'.
>
> What do you think?

I will take a good hard look at it once I send off my patchs to shore
up task_refs in the -mm tree.

Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-06 21:06         ` Oleg Nesterov
  2006-03-06 22:18           ` Eric W. Biederman
@ 2006-03-07  1:39           ` Eric W. Biederman
  2006-03-07 20:38             ` Oleg Nesterov
  2006-03-07 13:12           ` Eric W. Biederman
  2 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-03-07  1:39 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andrew Morton, linux-kernel, Ingo Molnar

Oleg Nesterov <oleg@tv-sign.ru> writes:

> I think I have a really good idea.
>
> Forget about task ref for a moment. I thinks we can greatly
> simplify the pids management. We don't PIDTYPE_MAX hash tables,
> we need only one.
>
> The plan:
>
> 	kill PIDTYPE_TGID
> 	(copy_process/unhash_process need a simple fix)

Worth doing.  But I think it is an independent problem.
I almost wonder if it is possible to let the thread group leader
exit, at which point we really would need PIDTYPE_TGID.

Or do you need this to have the thread list embedded in the task_struct?

> 	kill 'struct pid'
>
> Now,
>
> 	struct task_struct
> 	{
> 		...
> 		struct list_head	pids[PIDTYPE_MAX];
> 		struct list_head	tgrp;
> 		...
> 	};
>
> 	static inline struct task_struct *next_thread(struct task_struct *p)
> 	{
> 		return list_entry(p->tgrp.next, struct task_struct, tgrp);
> 	}
>
> 	struct pid_head
> 	{
> 		pid_t			nr;
> 		struct hlist_node	chain;
> 		struct list_head	tasks[PIDTYPE_MAX];
> 	};

pid_head is decent but I am very tempted to call this struct pid.
Especially if we start getting a lot of pointers to them a simple
name that makes sense is useful.

> kernel/pid.c:
>
> 	static kmem_cache_t *pid_cachep;
>
> 	static struct hlist_head *pid_hash;
> 	#define	pid_bucket(nr)	(pid_hash + pid_hashfn(nr))
>
> 	// alloc_pidmap() becomes static,
> 	// do_fork() calls this instead
> 	struct pid_head *alloc_pid(void)
> 	{
> 		struct pid_head *pid;
>
> 		pid = kmem_cache_alloc(pid_cachep, GFP_KERNEL);
>
> 		if (likely(pid)) {
> 			enum pid_type type;
>
> 			for (type = 0; type < PIDTYPE_MAX; ++type)
> 				INIT_LIST_HEAD(pid->tasks + type);
>
> 			pid->nr = alloc_pidmap();
> 			hlist_add_head_rcu(&pid->chain, pid_bucket(pid->nr));
> 		}
>
> 		return pid;
> 	}

Hmm.  I guess that works.  I'm tempted to still return just a pid_t.
I guess I can't see how the struct pid_head, will be used.

There may be another problem here as well.  I don't think we have a lock
at this point that makes us safe to update the hash table.

> 	static struct pid_head *find_pid(pid_t nr)
> 	{
> 		struct hlist_node *node;
> 		struct pid_head *pid;
>
> 		hlist_for_each_entry_rcu(pid, node, pid_bucket(nr), chain)
> 			if (pid->nr == nr)
> 				return pid;
>
> 		return NULL;
> 	}

I'm pretty certain there are uses for find_pid, outside of pid.c

> And noe we can inplement pid_ref almost for free, just add ->count
> to 'struct pid_head'.

If we want to kill the tasklist_lock we also want to add a lock
to struct pid_head.  Otherwise I don't see how we can safely bump
the count, above zero.  But using the tasklist_lock for the first
version shouldn't be a problem.


Eric


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-06 21:06         ` Oleg Nesterov
  2006-03-06 22:18           ` Eric W. Biederman
  2006-03-07  1:39           ` Eric W. Biederman
@ 2006-03-07 13:12           ` Eric W. Biederman
  2006-03-07 21:02             ` Oleg Nesterov
  2 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2006-03-07 13:12 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andrew Morton, linux-kernel, Ingo Molnar


Horrible incremental patch...

Ok.  I just threw this together and it seems to work..

I need to head to back to bed, but I figured since I got
this to boot successfully that I would throw this out
so someone could look at it while I sleep.

This patch is way to big and ugly to go in like this and
it would probably make even more sense if it would
simply replace part of what is in the -mm tree.  
Andrew any suggestions?

There is a neat trick in here for implementing an
rcu findable data structure that is also reference
counted, that allows the reference count to be safely
incremented without a lock.

Oleg anyway you can now look and see how I have butchered
your great idea :)

Eric



---

 drivers/char/tty_io.c   |   29 ++++---
 fs/exec.c               |    2 -
 fs/fcntl.c              |   10 ++-
 fs/ioprio.c             |   10 ++-
 fs/proc/base.c          |   17 +++-
 fs/proc/inode.c         |    4 +
 fs/proc/internal.h      |    7 +-
 fs/proc/task_mmu.c      |    4 +
 include/linux/pid.h     |   43 ++++++-----
 include/linux/proc_fs.h |    4 +
 include/linux/sched.h   |    4 +
 kernel/Makefile         |    2 -
 kernel/capability.c     |    5 +
 kernel/cpuset.c         |   11 +--
 kernel/exit.c           |   25 ++++--
 kernel/fork.c           |   23 +++---
 kernel/pid.c            |  186 +++++++++++++++++++++++++++--------------------
 kernel/signal.c         |    5 +
 kernel/sys.c            |   17 +++-
 19 files changed, 226 insertions(+), 182 deletions(-)

e5062e605992e32e15a75eade0fee34c8355353a
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index dc8d79d..7433c2a 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -1092,7 +1092,8 @@ static void do_tty_hangup(void *data)
 	
 	read_lock(&tasklist_lock);
 	if (tty->session > 0) {
-		do_each_task_pid(tty->session, PIDTYPE_SID, p) {
+		struct pid *pid;
+		for_each_task_pid(tty->session, PIDTYPE_SID, p, pid) {
 			if (p->signal->tty == tty)
 				p->signal->tty = NULL;
 			if (!p->signal->leader)
@@ -1101,7 +1102,7 @@ static void do_tty_hangup(void *data)
 			group_send_sig_info(SIGCONT, SEND_SIG_PRIV, p);
 			if (tty->pgrp > 0)
 				p->signal->tty_old_pgrp = tty->pgrp;
-		} while_each_task_pid(tty->session, PIDTYPE_SID, p);
+		}
 	}
 	read_unlock(&tasklist_lock);
 
@@ -1184,6 +1185,7 @@ void disassociate_ctty(int on_exit)
 {
 	struct tty_struct *tty;
 	struct task_struct *p;
+	struct pid *pid;
 	int tty_pgrp = -1;
 
 	lock_kernel();
@@ -1218,9 +1220,9 @@ void disassociate_ctty(int on_exit)
 
 	/* Now clear signal->tty under the lock */
 	read_lock(&tasklist_lock);
-	do_each_task_pid(current->signal->session, PIDTYPE_SID, p) {
+	for_each_task_pid(current->signal->session, PIDTYPE_SID, p, pid) {
 		p->signal->tty = NULL;
-	} while_each_task_pid(current->signal->session, PIDTYPE_SID, p);
+	}
 	read_unlock(&tasklist_lock);
 	mutex_unlock(&tty_mutex);
 	unlock_kernel();
@@ -1922,15 +1924,16 @@ static void release_dev(struct file * fi
 	 */
 	if (tty_closing || o_tty_closing) {
 		struct task_struct *p;
+		struct pid *pid;
 
 		read_lock(&tasklist_lock);
-		do_each_task_pid(tty->session, PIDTYPE_SID, p) {
+		for_each_task_pid(tty->session, PIDTYPE_SID, p, pid) {
 			p->signal->tty = NULL;
-		} while_each_task_pid(tty->session, PIDTYPE_SID, p);
+		}
 		if (o_tty)
-			do_each_task_pid(o_tty->session, PIDTYPE_SID, p) {
+			for_each_task_pid(o_tty->session, PIDTYPE_SID, p, pid) {
 				p->signal->tty = NULL;
-			} while_each_task_pid(o_tty->session, PIDTYPE_SID, p);
+			}
 		read_unlock(&tasklist_lock);
 	}
 
@@ -2358,14 +2361,15 @@ static int tiocsctty(struct tty_struct *
 		 * tty for another session group!
 		 */
 		if ((arg == 1) && capable(CAP_SYS_ADMIN)) {
+			struct pid *pid;
 			/*
 			 * Steal it away
 			 */
 
 			read_lock(&tasklist_lock);
-			do_each_task_pid(tty->session, PIDTYPE_SID, p) {
+			for_each_task_pid(tty->session, PIDTYPE_SID, p, pid) {
 				p->signal->tty = NULL;
-			} while_each_task_pid(tty->session, PIDTYPE_SID, p);
+			}
 			read_unlock(&tasklist_lock);
 		} else
 			return -EPERM;
@@ -2680,6 +2684,7 @@ static void __do_SAK(void *arg)
 	int		i;
 	struct file	*filp;
 	struct tty_ldisc *disc;
+	struct pid *pid;
 	struct fdtable *fdt;
 	
 	if (!tty)
@@ -2698,12 +2703,12 @@ static void __do_SAK(void *arg)
 	rcu_read_lock();
 	read_lock(&tasklist_lock);
 	/* Kill the entire session */
-	do_each_task_pid(session, PIDTYPE_SID, p) {
+	for_each_task_pid(session, PIDTYPE_SID, p, pid) {
 		printk(KERN_NOTICE "SAK: killed process %d"
 			" (%s): p->signal->session==tty->session\n",
 			p->pid, p->comm);
 		send_sig(SIGKILL, p, 1);
-	} while_each_task_pid(session, PIDTYPE_SID, p);
+	}
 	/* Now kill any processes that happen to have the
 	 * tty open.
 	 */
diff --git a/fs/exec.c b/fs/exec.c
index d961639..c7fe6e8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -708,7 +708,7 @@ static int de_thread(struct task_struct 
 		 * Note: The old leader also uses thispid until release_task
 		 *       is called.  Odd but simple and correct.
 		 */
-		detach_pid(current, PIDTYPE_PID);
+		detach_pid(current, PIDTYPE_PID, current->pid);
 		current->pid = leader->pid;
 		attach_pid(current, PIDTYPE_PID,  current->pid);
 		attach_pid(current, PIDTYPE_PGID, current->signal->pgrp);
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 03c7895..deb4ca7 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -485,9 +485,10 @@ void send_sigio(struct fown_struct *fown
 			send_sigio_to_task(p, fown, fd, band);
 		}
 	} else {
-		do_each_task_pid(-pid, PIDTYPE_PGID, p) {
+		struct pid *pidp;
+		for_each_task_pid(-pid, PIDTYPE_PGID, p, pidp) {
 			send_sigio_to_task(p, fown, fd, band);
-		} while_each_task_pid(-pid, PIDTYPE_PGID, p);
+		}
 	}
 	read_unlock(&tasklist_lock);
  out_unlock_fown:
@@ -520,9 +521,10 @@ int send_sigurg(struct fown_struct *fown
 			send_sigurg_to_task(p, fown);
 		}
 	} else {
-		do_each_task_pid(-pid, PIDTYPE_PGID, p) {
+		struct pid *pidp;
+		for_each_task_pid(-pid, PIDTYPE_PGID, p, pidp) {
 			send_sigurg_to_task(p, fown);
-		} while_each_task_pid(-pid, PIDTYPE_PGID, p);
+		}
 	}
 	read_unlock(&tasklist_lock);
  out_unlock_fown:
diff --git a/fs/ioprio.c b/fs/ioprio.c
index ca77008..96e4044 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -51,6 +51,7 @@ asmlinkage long sys_ioprio_set(int which
 	int data = IOPRIO_PRIO_DATA(ioprio);
 	struct task_struct *p, *g;
 	struct user_struct *user;
+	struct pid *pid;
 	int ret;
 
 	switch (class) {
@@ -85,11 +86,11 @@ asmlinkage long sys_ioprio_set(int which
 		case IOPRIO_WHO_PGRP:
 			if (!who)
 				who = process_group(current);
-			do_each_task_pid(who, PIDTYPE_PGID, p) {
+			for_each_task_pid(who, PIDTYPE_PGID, p, pid) {
 				ret = set_task_ioprio(p, ioprio);
 				if (ret)
 					break;
-			} while_each_task_pid(who, PIDTYPE_PGID, p);
+			}
 			break;
 		case IOPRIO_WHO_USER:
 			if (!who)
@@ -123,6 +124,7 @@ asmlinkage long sys_ioprio_get(int which
 {
 	struct task_struct *g, *p;
 	struct user_struct *user;
+	struct pid *pid;
 	int ret = -ESRCH;
 
 	read_lock_irq(&tasklist_lock);
@@ -138,12 +140,12 @@ asmlinkage long sys_ioprio_get(int which
 		case IOPRIO_WHO_PGRP:
 			if (!who)
 				who = process_group(current);
-			do_each_task_pid(who, PIDTYPE_PGID, p) {
+			for_each_task_pid(who, PIDTYPE_PGID, p, pid) {
 				if (ret == -ESRCH)
 					ret = p->ioprio;
 				else
 					ret = ioprio_best(ret, p->ioprio);
-			} while_each_task_pid(who, PIDTYPE_PGID, p);
+			}
 			break;
 		case IOPRIO_WHO_USER:
 			if (!who)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index dc70dfc..67383a6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -908,7 +908,7 @@ static ssize_t proc_loginuid_write(struc
 	if (!capable(CAP_AUDIT_CONTROL))
 		return -EPERM;
 
-	if (current != proc_tref(inode)->task)
+	if (current != pid_task(proc_pid(inode)), PIDTYPE_PID)
 		return -EPERM;
 
 	if (count > PAGE_SIZE)
@@ -1264,10 +1264,13 @@ static struct inode *proc_pid_make_inode
 	/*
 	 * grab the reference to task.
 	 */
-	ei->tref = tref_get_by_task(task);
-	if (!tref_task(ei->tref))
+	rcu_read_lock();
+	if (pid_alive(task))
+		ei->pid = get_pid(find_pid(task->pid));
+	rcu_read_unlock();
+	if (!ei->pid)
 		goto out_unlock;
-
+	
 	inode->i_uid = 0;
 	inode->i_gid = 0;
 	if (task_dumpable(task)) {
@@ -1349,11 +1352,15 @@ static int tid_fd_revalidate(struct dent
 
 static int pid_delete_dentry(struct dentry * dentry)
 {
+	struct task_struct *task;
 	/* Is the task we represent dead?
 	 * If so, then don't put the dentry on the lru list,
 	 * kill it immediately.
 	 */
-	return !proc_tref(dentry->d_inode)->task;
+	rcu_read_lock();
+	task = pid_task(proc_pid(dentry->d_inode), PIDTYPE_PID);
+	rcu_read_unlock();
+	return !task;
 }
 
 static struct dentry_operations tid_fd_dentry_operations =
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 31e0475..6dcef08 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -62,7 +62,7 @@ static void proc_delete_inode(struct ino
 	truncate_inode_pages(&inode->i_data, 0);
 
 	/* Stop tracking associated processes */
-	tref_put(PROC_I(inode)->tref);
+	put_pid(PROC_I(inode)->pid);
 
 	/* Let go of any associated proc directory entry */
 	de = PROC_I(inode)->pde;
@@ -91,7 +91,7 @@ static struct inode *proc_alloc_inode(st
 	ei = (struct proc_inode *)kmem_cache_alloc(proc_inode_cachep, SLAB_KERNEL);
 	if (!ei)
 		return NULL;
-	ei->tref = NULL;
+	ei->pid = NULL;
 	ei->fd = 0;
 	ei->op.proc_get_link = NULL;
 	ei->pde = NULL;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 37f1648..146a434 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -10,7 +10,6 @@
  */
 
 #include <linux/proc_fs.h>
-#include <linux/task_ref.h>
 
 struct vmalloc_info {
 	unsigned long	used;
@@ -51,14 +50,14 @@ void free_proc_entry(struct proc_dir_ent
 
 int proc_init_inodecache(void);
 
-static inline struct task_ref *proc_tref(struct inode *inode)
+static inline struct pid *proc_pid(struct inode *inode)
 {
-	return PROC_I(inode)->tref;
+	return PROC_I(inode)->pid;
 }
 
 static inline struct task_struct *get_proc_task(struct inode *inode)
 {
-	return get_tref_task(proc_tref(inode));
+	return get_pid_task(proc_pid(inode), PIDTYPE_PID);
 }
 
 static inline int proc_fd(struct inode *inode)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 0491eb7..11592dc 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -320,7 +320,7 @@ static void *m_start(struct seq_file *m,
 	if (last_addr == -1UL)
 		return NULL;
 
-	priv->task = get_tref_task(priv->tref);
+	priv->task = get_pid_task(priv->pid, PIDTYPE_PID);
 	if (!priv->task)
 		return NULL;
 
@@ -416,7 +416,7 @@ static int do_maps_open(struct inode *in
 	int ret = -ENOMEM;
 	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
 	if (priv) {
-		priv->tref = proc_tref(inode);
+		priv->pid = proc_pid(inode);
 		ret = seq_open(file, ops);
 		if (!ret) {
 			struct seq_file *m = file->private_data;
diff --git a/include/linux/pid.h b/include/linux/pid.h
index da5cd89..11992c1 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -1,7 +1,7 @@
 #ifndef _LINUX_PID_H
 #define _LINUX_PID_H
 
-struct task_ref;
+#include <linux/rcupdate.h>
 
 enum pid_type
 {
@@ -13,17 +13,25 @@ enum pid_type
 
 struct pid
 {
+	atomic_t count;
 	/* Try to keep pid_chain in the same cacheline as nr for find_pid */
 	int nr;
 	struct hlist_node pid_chain;
 	/* list of pids with the same nr, only one of them is in the hash */
-	struct list_head pid_list;
-	/* Does a weak reference of this type exist to the task struct? */
-	struct task_ref *tref;
+	struct list_head tasks[PIDTYPE_MAX];
+	struct rcu_head rcu;
 };
 
-#define pid_task(elem, type) \
-	list_entry(elem, struct task_struct, pids[type].pid_list)
+static inline struct pid *get_pid(struct pid *pid)
+{
+	if (pid)
+		atomic_inc(&pid->count);
+	return pid;
+}
+
+extern void FASTCALL(put_pid(struct pid *pid));
+extern struct task_struct *FASTCALL(pid_task(struct pid *pid, enum pid_type type));
+extern struct task_struct *FASTCALL(get_pid_task(struct pid *pid, enum pid_type type));
 
 /*
  * attach_pid() and detach_pid() must be called with the tasklist_lock
@@ -31,30 +39,23 @@ struct pid
  */
 extern int FASTCALL(attach_pid(struct task_struct *task, enum pid_type type, int nr));
 
-extern void FASTCALL(detach_pid(struct task_struct *task, enum pid_type));
+extern void FASTCALL(detach_pid(struct task_struct *task, enum pid_type, int nr));
 
 /*
  * look up a PID in the hash table. Must be called with the tasklist_lock
  * held.
  */
-extern struct pid *FASTCALL(find_pid(enum pid_type, int));
+extern struct pid *FASTCALL(find_pid(int nr));
 
 extern struct task_struct *find_task_by_pid_type(int type, int pid);
 #define find_task_by_pid(nr)	find_task_by_pid_type(PIDTYPE_PID, nr)
 
-extern int alloc_pidmap(void);
-extern void FASTCALL(free_pidmap(int));
+extern struct pid *alloc_pid(void);
+extern void FASTCALL(free_pid(struct pid *pid));
+
+#define for_each_task_pid(who, type, task, pid)	\
+	if ((pid = find_pid(who)))		\
+		list_for_each_entry_rcu(task, &pid->tasks[type], pid_list[type])
 
-#define do_each_task_pid(who, type, task)				\
-	if ((task = find_task_by_pid_type(type, who))) {		\
-		prefetch((task)->pids[type].pid_list.next);		\
-		do {
-
-#define while_each_task_pid(who, type, task)				\
-		} while (task = pid_task((task)->pids[type].pid_list.next,\
-						type),			\
-			prefetch((task)->pids[type].pid_list.next),	\
-			hlist_unhashed(&(task)->pids[type].pid_chain));	\
-	}								\
 
 #endif /* _LINUX_PID_H */
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 7f5a400..b5fc62f 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -247,7 +247,7 @@ extern void kclist_add(struct kcore_list
 #endif
 
 struct proc_inode {
-	struct task_ref *tref;
+	struct pid *pid;
 	int fd;
 	union {
 		int (*proc_get_link)(struct inode *, struct dentry **, struct vfsmount **);
@@ -268,7 +268,7 @@ static inline struct proc_dir_entry *PDE
 }
 
 struct proc_maps_private {
-	struct task_ref *tref;
+	struct pid *pid;
 	struct task_struct *task;
 	struct vm_area_struct *tail_vma;
 };
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7e2500c..a54e2c5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -764,7 +764,7 @@ struct task_struct {
 	struct task_struct *group_leader;	/* threadgroup leader */
 
 	/* PID/PID hash table linkage. */
-	struct pid pids[PIDTYPE_MAX];
+	struct list_head pid_list[PIDTYPE_MAX];
 	struct list_head threads;
 
 	struct completion *vfork_done;		/* for vfork() */
@@ -902,7 +902,7 @@ static inline pid_t process_group(struct
  */
 static inline int pid_alive(struct task_struct *p)
 {
-	return p->pids[PIDTYPE_PID].pid_list.prev != LIST_POISON2;
+	return p->pid_list[PIDTYPE_PID].prev != LIST_POISON2;
 }
 
 extern void free_task(struct task_struct *tsk);
diff --git a/kernel/Makefile b/kernel/Makefile
index 1905c80..3401e54 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -5,7 +5,7 @@
 obj-y     = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
 	    exit.o itimer.o time.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o timer.o user.o \
-	    signal.o sys.o kmod.o workqueue.o pid.o task_ref.o \
+	    signal.o sys.o kmod.o workqueue.o pid.o \
 	    rcupdate.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
 	    hrtimer.o
diff --git a/kernel/capability.c b/kernel/capability.c
index bfa3c92..adae2ab 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -97,10 +97,11 @@ static inline int cap_set_pg(int pgrp, k
 			      kernel_cap_t *permitted)
 {
 	task_t *g, *target;
+	struct pid *pid;
 	int ret = -EPERM;
 	int found = 0;
 
-	do_each_task_pid(pgrp, PIDTYPE_PGID, g) {
+	for_each_task_pid(pgrp, PIDTYPE_PGID, g, pid) {
 		target = g;
 		while_each_thread(g, target) {
 			if (!security_capset_check(target, effective,
@@ -113,7 +114,7 @@ static inline int cap_set_pg(int pgrp, k
 			}
 			found = 1;
 		}
-	} while_each_task_pid(pgrp, PIDTYPE_PGID, g);
+	}
 
 	if (!found)
 	     ret = 0;
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index d81dd44..fb2ddd1 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -49,7 +49,6 @@
 #include <linux/time.h>
 #include <linux/backing-dev.h>
 #include <linux/sort.h>
-#include <linux/task_ref.h>
 
 #include <asm/uaccess.h>
 #include <asm/atomic.h>
@@ -2380,7 +2379,7 @@ void __cpuset_memory_pressure_bump(void)
 static int proc_cpuset_show(struct seq_file *m, void *v)
 {
 	struct cpuset *cs;
-	struct task_ref *tref;
+	struct pid *pid;
 	struct task_struct *tsk;
 	char *buf;
 	int retval;
@@ -2391,8 +2390,8 @@ static int proc_cpuset_show(struct seq_f
 		goto out;
 
 	retval = -ESRCH;
-	tref = m->private;
-	tsk = get_tref_task(tref);
+	pid = m->private;
+	tsk = get_pid_task(pid, PIDTYPE_PID);
 	if (!tsk)
 		goto out_free;
 
@@ -2418,8 +2417,8 @@ out:
 
 static int cpuset_open(struct inode *inode, struct file *file)
 {
-	struct task_ref *tref = PROC_I(inode)->tref;
-	return single_open(file, proc_cpuset_show, tref);
+	struct pid *pid = PROC_I(inode)->pid;
+	return single_open(file, proc_cpuset_show, pid);
 }
 
 struct file_operations proc_cpuset_operations = {
diff --git a/kernel/exit.c b/kernel/exit.c
index df406fe..e69c6d8 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -50,11 +50,11 @@ static void exit_mm(struct task_struct *
 static void __unhash_process(struct task_struct *p)
 {
 	nr_threads--;
-	detach_pid(p, PIDTYPE_PID);
+	detach_pid(p, PIDTYPE_PID, p->pid);
 	list_del_rcu(&p->threads);
 	if (thread_group_leader(p)) {
-		detach_pid(p, PIDTYPE_PGID);
-		detach_pid(p, PIDTYPE_SID);
+		detach_pid(p, PIDTYPE_PGID, p->signal->pgrp);
+		detach_pid(p, PIDTYPE_SID,  p->signal->session);
 
 		list_del_init(&p->tasks);
 		__get_cpu_var(process_counts)--;
@@ -178,15 +178,16 @@ repeat:
 int session_of_pgrp(int pgrp)
 {
 	struct task_struct *p;
+	struct pid *pid;
 	int sid = -1;
 
 	read_lock(&tasklist_lock);
-	do_each_task_pid(pgrp, PIDTYPE_PGID, p) {
+	for_each_task_pid(pgrp, PIDTYPE_PGID, p, pid) {
 		if (p->signal->session > 0) {
 			sid = p->signal->session;
 			goto out;
 		}
-	} while_each_task_pid(pgrp, PIDTYPE_PGID, p);
+	}
 	p = find_task_by_pid(pgrp);
 	if (p)
 		sid = p->signal->session;
@@ -207,9 +208,10 @@ out:
 static int will_become_orphaned_pgrp(int pgrp, task_t *ignored_task)
 {
 	struct task_struct *p;
+	struct pid *pid;
 	int ret = 1;
 
-	do_each_task_pid(pgrp, PIDTYPE_PGID, p) {
+	for_each_task_pid(pgrp, PIDTYPE_PGID, p, pid) {
 		if (p == ignored_task
 				|| p->exit_state
 				|| p->real_parent->pid == 1)
@@ -219,7 +221,7 @@ static int will_become_orphaned_pgrp(int
 			ret = 0;
 			break;
 		}
-	} while_each_task_pid(pgrp, PIDTYPE_PGID, p);
+	}
 	return ret;	/* (sighing) "Often!" */
 }
 
@@ -238,8 +240,9 @@ static int has_stopped_jobs(int pgrp)
 {
 	int retval = 0;
 	struct task_struct *p;
+	struct pid *pid;
 
-	do_each_task_pid(pgrp, PIDTYPE_PGID, p) {
+	for_each_task_pid(pgrp, PIDTYPE_PGID, p, pid) {
 		if (p->state != TASK_STOPPED)
 			continue;
 
@@ -255,7 +258,7 @@ static int has_stopped_jobs(int pgrp)
 
 		retval = 1;
 		break;
-	} while_each_task_pid(pgrp, PIDTYPE_PGID, p);
+	}
 	return retval;
 }
 
@@ -305,12 +308,12 @@ void __set_special_pids(pid_t session, p
 	struct task_struct *curr = current->group_leader;
 
 	if (curr->signal->session != session) {
-		detach_pid(curr, PIDTYPE_SID);
+		detach_pid(curr, PIDTYPE_SID, curr->signal->session);
 		curr->signal->session = session;
 		attach_pid(curr, PIDTYPE_SID, session);
 	}
 	if (process_group(curr) != pgrp) {
-		detach_pid(curr, PIDTYPE_PGID);
+		detach_pid(curr, PIDTYPE_PGID, curr->signal->pgrp);
 		curr->signal->pgrp = pgrp;
 		attach_pid(curr, PIDTYPE_PGID, pgrp);
 	}
diff --git a/kernel/fork.c b/kernel/fork.c
index eb5e7ec..0144135 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -172,7 +172,6 @@ void __init fork_init(unsigned long memp
 
 static struct task_struct *dup_task_struct(struct task_struct *orig)
 {
-	int type;
 	struct task_struct *tsk;
 	struct thread_info *ti;
 
@@ -196,11 +195,7 @@ static struct task_struct *dup_task_stru
 	atomic_set(&tsk->usage,2);
 	atomic_set(&tsk->fs_excl, 0);
 	tsk->btrace_seq = 0;
-	/* Initially there are no weak references to this task */
-	for (type = 0; type < PIDTYPE_MAX; type++) {
-		tsk->pids[type].nr = 0;
-		tsk->pids[type].tref = NULL;
-	}
+
 	return tsk;
 }
 
@@ -1332,17 +1327,19 @@ long do_fork(unsigned long clone_flags,
 {
 	struct task_struct *p;
 	int trace = 0;
-	long pid = alloc_pidmap();
+	struct pid *pid = alloc_pid();
+	long nr;
 
-	if (pid < 0)
+	if (!pid)
 		return -EAGAIN;
+	nr = pid->nr;
 	if (unlikely(current->ptrace)) {
 		trace = fork_traceflag (clone_flags);
 		if (trace)
 			clone_flags |= CLONE_PTRACE;
 	}
 
-	p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr, pid);
+	p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr, nr);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
@@ -1369,7 +1366,7 @@ long do_fork(unsigned long clone_flags,
 			p->state = TASK_STOPPED;
 
 		if (unlikely (trace)) {
-			current->ptrace_message = pid;
+			current->ptrace_message = nr;
 			ptrace_notify ((trace << 8) | SIGTRAP);
 		}
 
@@ -1379,10 +1376,10 @@ long do_fork(unsigned long clone_flags,
 				ptrace_notify ((PTRACE_EVENT_VFORK_DONE << 8) | SIGTRAP);
 		}
 	} else {
-		free_pidmap(pid);
-		pid = PTR_ERR(p);
+		free_pid(pid);
+		nr = PTR_ERR(p);
 	}
-	return pid;
+	return nr;
 }
 
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
diff --git a/kernel/pid.c b/kernel/pid.c
index a3cc593..4875fc6 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -27,11 +27,11 @@
 #include <linux/kgdb.h>
 #include <linux/bootmem.h>
 #include <linux/hash.h>
-#include <linux/task_ref.h>
 
 #define pid_hashfn(nr) hash_long((unsigned long)nr, pidhash_shift)
-static struct hlist_head *pid_hash[PIDTYPE_MAX];
+static struct hlist_head *pid_hash;
 static int pidhash_shift;
+static kmem_cache_t *pid_cachep;
 
 int pid_max = PID_MAX_DEFAULT;
 int last_pid;
@@ -64,7 +64,7 @@ static pidmap_t pidmap_array[PIDMAP_ENTR
 
 static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
 
-fastcall void free_pidmap(int pid)
+static fastcall void free_pidmap(int pid)
 {
 	pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
 	int offset = pid & BITS_PER_PAGE_MASK;
@@ -73,7 +73,7 @@ fastcall void free_pidmap(int pid)
 	atomic_inc(&map->nr_free);
 }
 
-int alloc_pidmap(void)
+static int alloc_pidmap(void)
 {
 	int i, offset, max_scan, pid, last = last_pid;
 	pidmap_t *map;
@@ -133,13 +133,68 @@ int alloc_pidmap(void)
 	return -1;
 }
 
-struct pid * fastcall find_pid(enum pid_type type, int nr)
+fastcall void put_pid(struct pid *pid)
+{
+	if (!pid)
+		return;
+	if ((atomic_read(&pid->count) == 1) ||
+	     atomic_dec_and_test(&pid->count))
+		kmem_cache_free(pid_cachep, pid);
+}
+
+static void rcu_put_pid(struct rcu_head *rhp)
+{
+	struct pid *pid = container_of(rhp, struct pid, rcu);
+	put_pid(pid);
+}
+
+fastcall void free_pid(struct pid *pid)
+{
+	spin_lock(&pidmap_lock);
+	hlist_del_rcu(&pid->pid_chain);
+	spin_unlock(&pidmap_lock);
+
+	free_pidmap(pid->nr);
+	call_rcu(&pid->rcu, rcu_put_pid);
+}
+
+struct pid *alloc_pid(void)
+{
+	struct pid *pid;
+	enum pid_type type;
+	int nr = -1;
+
+	pid = kmem_cache_alloc(pid_cachep, GFP_KERNEL);
+	if (!pid)
+		goto out;
+	
+	for (type = 0; type < PIDTYPE_MAX; ++type)
+		INIT_LIST_HEAD(&pid->tasks[type]);
+	
+	nr = alloc_pidmap();
+	if (nr < 0)
+		goto out_free;
+
+	atomic_set(&pid->count, 1);
+	pid->nr = nr;
+	spin_lock(&pidmap_lock);
+	hlist_add_head_rcu(&pid->pid_chain, &pid_hash[pid_hashfn(pid->nr)]);
+	spin_unlock(&pidmap_lock);
+out:
+	return pid;
+out_free:
+	kmem_cache_free(pid_cachep, pid);
+	pid = NULL;
+	goto out;
+}
+
+struct pid * fastcall find_pid(int nr)
 {
 	struct hlist_node *elem;
 	struct pid *pid;
 
 	hlist_for_each_entry_rcu(pid, elem,
-			&pid_hash[type][pid_hashfn(nr)], pid_chain) {
+			&pid_hash[pid_hashfn(nr)], pid_chain) {
 		WARN_ON(!pid->nr); /* to be removed soon */
 		if (pid->nr == nr)
 			return pid;
@@ -149,93 +204,60 @@ struct pid * fastcall find_pid(enum pid_
 
 int fastcall attach_pid(task_t *task, enum pid_type type, int nr)
 {
-	struct pid *pid, *task_pid;
+	struct pid *pid;
 
 	WARN_ON(!task->pid); /* to be removed soon */
 	WARN_ON(!nr); /* to be removed soon */
 
-	task_pid = &task->pids[type];
-	pid = find_pid(type, nr);
-	task_pid->nr = nr;
-	task_pid->tref = NULL;
-	if (pid == NULL) {
-		INIT_LIST_HEAD(&task_pid->pid_list);
-		hlist_add_head_rcu(&task_pid->pid_chain,
-				   &pid_hash[type][pid_hashfn(nr)]);
-	} else {
-		INIT_HLIST_NODE(&task_pid->pid_chain);
-		list_add_tail_rcu(&task_pid->pid_list, &pid->pid_list);
-	}
+	pid = find_pid(nr);
+	list_add_tail_rcu(&task->pid_list[type], &pid->tasks[type]);
 
 	return 0;
 }
 
-static fastcall int __detach_pid(task_t *task, enum pid_type type)
+void fastcall detach_pid(task_t *task, enum pid_type type, int nr)
 {
-	task_t *task_next;
-	struct pid *pid, *pid_next;
-	struct task_ref *tref;
-	int nr = 0;
-
-	pid = &task->pids[type];
-	tref = pid->tref;
-	pid->tref = NULL;
-	if (!hlist_unhashed(&pid->pid_chain)) {
-
-		if (list_empty(&pid->pid_list)) {
-			if (tref)
-				tref->task = NULL;
-			nr = pid->nr;
-			hlist_del_rcu(&pid->pid_chain);
-		} else {
-			task_next = pid_task(pid->pid_list.next, type);
-			pid_next = list_entry(pid->pid_list.next,
-						struct pid, pid_list);
-			pid_next->tref = tref_get(tref);
-			/* insert next pid from pid_list to hash */
-			hlist_replace_rcu(&pid->pid_chain,
-					  &pid_next->pid_chain);
-			/* Update the reference to point at the next task */
-			if (tref)
-				rcu_assign_pointer(tref->task, task_next);
-		}
-	}
+	struct pid *pid;
 
-	list_del_rcu(&pid->pid_list);
-	tref_put(tref);
-	pid->nr = 0;
+	list_del_rcu(&task->pid_list[type]);
 
-	return nr;
+	pid = find_pid(nr);
+	for (type = 0; type < PIDTYPE_MAX; ++type)
+		if (!list_empty(&pid->tasks[type]))
+			return;
+	
+	free_pid(pid);
 }
 
-void fastcall detach_pid(task_t *task, enum pid_type type)
+struct task_struct * fastcall pid_task(struct pid *pid, enum pid_type type)
 {
-	int tmp, nr;
-
-	nr = __detach_pid(task, type);
-	if (!nr)
-		return;
-
-	for (tmp = PIDTYPE_MAX; --tmp >= 0; )
-		if (tmp != type && find_pid(tmp, nr))
-			return;
-
-	free_pidmap(nr);
+	struct task_struct *result = NULL;
+	if (pid) {
+		struct list_head *list, *next;
+		list = rcu_dereference(&pid->tasks[type]);
+		next = rcu_dereference(list->next);
+		if (list != next)
+			result = list_entry(next, struct task_struct, pid_list[type]);
+	}
+	return result;
 }
 
 task_t *find_task_by_pid_type(int type, int nr)
 {
-	struct pid *pid;
-
-	pid = find_pid(type, nr);
-	if (!pid)
-		return NULL;
-
-	return pid_task(&pid->pid_list, type);
+	return pid_task(find_pid(nr), type);
 }
 
 EXPORT_SYMBOL(find_task_by_pid_type);
 
+struct task_struct * fastcall get_pid_task(struct pid *pid, enum pid_type type)
+{
+	struct task_struct *result;
+	rcu_read_lock();
+	result = rcu_get_task_struct(pid_task(pid, type));
+	rcu_read_unlock();
+	return result;
+}
+
 /*
  * The pid hash table is scaled according to the amount of memory in the
  * machine.  From a minimum of 16 slots up to 4096 slots at one gigabyte or
@@ -243,7 +265,7 @@ EXPORT_SYMBOL(find_task_by_pid_type);
  */
 void __init pidhash_init(void)
 {
-	int i, j, pidhash_size;
+	int i, pidhash_size;
 	unsigned long megabytes = nr_kernel_pages >> (20 - PAGE_SHIFT);
 
 	pidhash_shift = max(4, fls(megabytes * 4));
@@ -252,16 +274,14 @@ void __init pidhash_init(void)
 
 	printk("PID hash table entries: %d (order: %d, %Zd bytes)\n",
 		pidhash_size, pidhash_shift,
-		PIDTYPE_MAX * pidhash_size * sizeof(struct hlist_head));
+		pidhash_size * sizeof(struct hlist_head));
+
+	pid_hash = alloc_bootmem(pidhash_size *	sizeof(*(pid_hash)));
+	if (!pid_hash)
+		panic("Could not alloc pidhash!\n");
+	for (i = 0; i < pidhash_size; i++)
+		INIT_HLIST_HEAD(&pid_hash[i]);
 
-	for (i = 0; i < PIDTYPE_MAX; i++) {
-		pid_hash[i] = alloc_bootmem(pidhash_size *
-					sizeof(*(pid_hash[i])));
-		if (!pid_hash[i])
-			panic("Could not alloc pidhash!\n");
-		for (j = 0; j < pidhash_size; j++)
-			INIT_HLIST_HEAD(&pid_hash[i][j]);
-	}
 #ifdef CONFIG_KGDB
 	kgdb_pid_init_done++;
 #endif
@@ -273,4 +293,8 @@ void __init pidmap_init(void)
 	/* Reserve PID 0. We never call free_pidmap(0) */
 	set_bit(0, pidmap_array->page);
 	atomic_dec(&pidmap_array->nr_free);
+
+	pid_cachep = kmem_cache_create("pid", sizeof(struct pid),
+					__alignof__(struct pid),
+					SLAB_PANIC, NULL, NULL);
 }
diff --git a/kernel/signal.c b/kernel/signal.c
index 2dfaa50..6390879 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1060,6 +1060,7 @@ int group_send_sig_info(int sig, struct 
 int __kill_pg_info(int sig, struct siginfo *info, pid_t pgrp)
 {
 	struct task_struct *p = NULL;
+	struct pid *pid;
 	int retval, success;
 
 	if (pgrp <= 0)
@@ -1067,11 +1068,11 @@ int __kill_pg_info(int sig, struct sigin
 
 	success = 0;
 	retval = -ESRCH;
-	do_each_task_pid(pgrp, PIDTYPE_PGID, p) {
+	for_each_task_pid(pgrp, PIDTYPE_PGID, p, pid) {
 		int err = group_send_sig_info(sig, info, p);
 		success |= !err;
 		retval = err;
-	} while_each_task_pid(pgrp, PIDTYPE_PGID, p);
+	}
 	return success ? 0 : retval;
 }
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 3d46f39..ad92bec 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -470,6 +470,7 @@ asmlinkage long sys_setpriority(int whic
 {
 	struct task_struct *g, *p;
 	struct user_struct *user;
+	struct pid *pid;
 	int error = -EINVAL;
 
 	if (which > 2 || which < 0)
@@ -494,9 +495,9 @@ asmlinkage long sys_setpriority(int whic
 		case PRIO_PGRP:
 			if (!who)
 				who = process_group(current);
-			do_each_task_pid(who, PIDTYPE_PGID, p) {
+			for_each_task_pid(who, PIDTYPE_PGID, p, pid) {
 				error = set_one_prio(p, niceval, error);
-			} while_each_task_pid(who, PIDTYPE_PGID, p);
+			}
 			break;
 		case PRIO_USER:
 			user = current->user;
@@ -530,6 +531,7 @@ asmlinkage long sys_getpriority(int whic
 {
 	struct task_struct *g, *p;
 	struct user_struct *user;
+	struct pid *pid;
 	long niceval, retval = -ESRCH;
 
 	if (which > 2 || which < 0)
@@ -550,11 +552,11 @@ asmlinkage long sys_getpriority(int whic
 		case PRIO_PGRP:
 			if (!who)
 				who = process_group(current);
-			do_each_task_pid(who, PIDTYPE_PGID, p) {
+			for_each_task_pid(who, PIDTYPE_PGID, p, pid) {
 				niceval = 20 - task_nice(p);
 				if (niceval > retval)
 					retval = niceval;
-			} while_each_task_pid(who, PIDTYPE_PGID, p);
+			}
 			break;
 		case PRIO_USER:
 			user = current->user;
@@ -1301,11 +1303,12 @@ asmlinkage long sys_setpgid(pid_t pid, p
 
 	if (pgid != pid) {
 		struct task_struct *p;
+		struct pid *pidp;
 
-		do_each_task_pid(pgid, PIDTYPE_PGID, p) {
+		for_each_task_pid(pgid, PIDTYPE_PGID, p, pidp) {
 			if (p->signal->session == group_leader->signal->session)
 				goto ok_pgid;
-		} while_each_task_pid(pgid, PIDTYPE_PGID, p);
+		}
 		goto out;
 	}
 
@@ -1315,7 +1318,7 @@ ok_pgid:
 		goto out;
 
 	if (process_group(p) != pgid) {
-		detach_pid(p, PIDTYPE_PGID);
+		detach_pid(p, PIDTYPE_PGID, p->signal->pgrp);
 		p->signal->pgrp = pgid;
 		attach_pid(p, PIDTYPE_PGID, pgid);
 	}
-- 
1.2.2.g709a-dirty


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-07  1:39           ` Eric W. Biederman
@ 2006-03-07 20:38             ` Oleg Nesterov
  0 siblings, 0 replies; 49+ messages in thread
From: Oleg Nesterov @ 2006-03-07 20:38 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel, Ingo Molnar

"Eric W. Biederman" wrote:
> 
> Oleg Nesterov <oleg@tv-sign.ru> writes:
> 
> > I think I have a really good idea.
> >
> > Forget about task ref for a moment. I thinks we can greatly
> > simplify the pids management. We don't PIDTYPE_MAX hash tables,
> > we need only one.
> >
> > The plan:
> >
> >       kill PIDTYPE_TGID
> >       (copy_process/unhash_process need a simple fix)
> 
> Worth doing.  But I think it is an independent problem.

Almost independent, but still we have to do it before we introduce
pid_head. Otherwise next_thread() will be broken, because ->pids[].next
could actually point to pid_head->tasks[].

> pid_head is decent but I am very tempted to call this struct pid.
> Especially if we start getting a lot of pointers to them a simple
> name that makes sense is useful.

Agreed, will rename.

> >       // alloc_pidmap() becomes static,
> >       // do_fork() calls this instead
> >       struct pid_head *alloc_pid(void)
> >       {
>
> Hmm.  I guess that works.  I'm tempted to still return just a pid_t.
> I guess I can't see how the struct pid_head, will be used.

Probably you are right.
 
> There may be another problem here as well.  I don't think we have a lock
> at this point that makes us safe to update the hash table.

Yes,

> If we want to kill the tasklist_lock we also want to add a lock
> to struct pid_head.  Otherwise I don't see how we can safely bump
> the count, above zero.  But using the tasklist_lock for the first
> version shouldn't be a problem.

... and yes. I think we can use pidmap_lock.

Also, find_pid() needs to be rcu safe. I'll try to show the code
tomorrow.

Oleg.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-06 22:18           ` Eric W. Biederman
@ 2006-03-07 20:44             ` Oleg Nesterov
  0 siblings, 0 replies; 49+ messages in thread
From: Oleg Nesterov @ 2006-03-07 20:44 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel, Ingo Molnar

"Eric W. Biederman" wrote:
> 
> Unless we can implement do_each_task_pid/while_each_task_pid in terms
> of for_each_task_pid.  I am nervous about making the conversion.

Yes, of course. Currently I have:

#define for_each_task_pid(head, who, type, task)                             \
        if ((head = find_pid(who)))                                          \ 
                list_for_each_entry(task, ((head)->tasks + type), pids[type])

// OBSOLETE
#define do_each_task_pid(who, type, task)                               \
        do {                                                            \
                struct pid_head * __pid_head__;                         \
                for_each_task_pid(__pid_head__, who, type, task) {

#define while_each_task_pid(who, type, task)                            \
                }                                                       \
        } while (0)

It's better not to change the users of do_each_task_pid() for some
time at least.

Oleg.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-07 13:12           ` Eric W. Biederman
@ 2006-03-07 21:02             ` Oleg Nesterov
  2006-03-07 23:00               ` Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2006-03-07 21:02 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Andrew Morton, linux-kernel, Ingo Molnar

"Eric W. Biederman" wrote:
> 
>  struct pid
>  {
> +       atomic_t count;
>         /* Try to keep pid_chain in the same cacheline as nr for find_pid */
>         int nr;
>         struct hlist_node pid_chain;
>         /* list of pids with the same nr, only one of them is in the hash */
> -       struct list_head pid_list;
> -       /* Does a weak reference of this type exist to the task struct? */
> -       struct task_ref *tref;
> +       struct list_head tasks[PIDTYPE_MAX];
> +       struct rcu_head rcu;
>  };
>
> ...
>
> +static void rcu_put_pid(struct rcu_head *rhp)
> +{
> +       struct pid *pid = container_of(rhp, struct pid, rcu);
> +       put_pid(pid);
> +}

I hope we can do it without pid->rcu and rcu_put_pid(). Hopefuly
we can use SLAB_DESTROY_BY_RCU. To do so we need some changes in
find_task_by_pid_type().

I'll try to look closer at this patch tomorrow.

Oleg.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/23] tref: Implement task references.
  2006-03-07 21:02             ` Oleg Nesterov
@ 2006-03-07 23:00               ` Eric W. Biederman
  0 siblings, 0 replies; 49+ messages in thread
From: Eric W. Biederman @ 2006-03-07 23:00 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andrew Morton, linux-kernel, Ingo Molnar

Oleg Nesterov <oleg@tv-sign.ru> writes:

> "Eric W. Biederman" wrote:
>> 
>>  struct pid
>>  {
>> +       atomic_t count;
>>         /* Try to keep pid_chain in the same cacheline as nr for find_pid */
>>         int nr;
>>         struct hlist_node pid_chain;
>>         /* list of pids with the same nr, only one of them is in the hash */
>> -       struct list_head pid_list;
>> -       /* Does a weak reference of this type exist to the task struct? */
>> -       struct task_ref *tref;
>> +       struct list_head tasks[PIDTYPE_MAX];
>> +       struct rcu_head rcu;
>>  };
>>
>> ...
>>
>> +static void rcu_put_pid(struct rcu_head *rhp)
>> +{
>> +       struct pid *pid = container_of(rhp, struct pid, rcu);
>> +       put_pid(pid);
>> +}
>
> I hope we can do it without pid->rcu and rcu_put_pid(). Hopefuly
> we can use SLAB_DESTROY_BY_RCU. To do so we need some changes in
> find_task_by_pid_type().

I am fairly certain that SLAB_DESTROY_BY_RCU is a entire slab
operation, not something that applies on an individual slab
entry basis.

Delaying the decrement of the struct pid with call_rcu does have the
advantage that we can safely do an atomic_inc(&pid->count) after
looking up the pid.

> I'll try to look closer at this patch tomorrow.

Thanks.  One semi painful thing that has occurred to me is that
we may want to make the definition.

struct pid
{
       atomic_t count;
       /* Try to keep pid_chain in the same cacheline as nr for find_pid */
       int nr;
       struct hlist_node pid_chain;
       /* list of pids with the same nr, only one of them is in the hash */
       struct hlist_head tasks[PIDTYPE_MAX];
       struct rcu_head rcu;
};

Using a hlist_head for tasks cuts the size of the structure almost in
half.   Which for a long lived structure is a desirable property.
Unfortunately hlist_for_each_entry_rcu takes an additional argument,
compared to list_for_each_entry_rcu.

Eric


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2006-03-07 23:02 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-23 15:52 [PATCH 00/23] proc cleanup Eric W. Biederman
2006-02-23 15:54 ` [PATCH 01/23] tref: Implement task references Eric W. Biederman
2006-02-23 15:56   ` [PATCH 02/23] proc: Fix the .. inode number on /proc/<pid>/fd Eric W. Biederman
2006-02-23 15:57     ` [PATCH 03/23] proc: Remove useless BKL in proc_pid_readlink Eric W. Biederman
2006-02-23 15:58       ` [PATCH 04/23] proc: Remove unnecessary and misleading assignments from proc_pid_make_inode Eric W. Biederman
2006-02-23 16:00         ` [PATCH 05/23] proc: Simplify the ownership rules for /proc Eric W. Biederman
2006-02-23 16:02           ` Eric W. Biederman
2006-02-23 16:04           ` [PATCH 06/23] proc: Replace proc_inode.type with proc_inode.fd Eric W. Biederman
2006-02-23 16:05             ` [PATCH 07/23] proc: Remove bogus proc_task_permission Eric W. Biederman
2006-02-23 16:06               ` [PATCH 08/23] proc: Kill proc_mem_inode_operations Eric W. Biederman
2006-02-23 16:08                 ` [PATCH 09/23] proc: Properly filter out files that are not visible to a process Eric W. Biederman
2006-02-23 16:10                   ` [PATCH 10/23] proc: Fix the link count for /proc/<pid>/task Eric W. Biederman
2006-02-23 16:12                     ` [PATCH 11/23] proc: Move proc_maps_operations into task_mmu.c Eric W. Biederman
2006-02-23 16:15                       ` [PATCH 12/23] proc: Rewrite the proc dentry flush on exit optimization Eric W. Biederman
2006-02-23 16:16                         ` [PATCH 13/23] proc: Close the race of a process dying durning lookup Eric W. Biederman
2006-02-23 16:18                           ` [PATCH 14/23] proc: Make PROC_NUMBUF the buffer size for holding a integers as strings Eric W. Biederman
2006-02-23 16:20                             ` [PATCH 15/23] proc: refactor reading directories of tasks Eric W. Biederman
2006-02-23 16:23                               ` [PATCH 16/23] proc: Don't lock task_structs indefinitely Eric W. Biederman
2006-02-23 16:24                                 ` [PATCH 17/23] proc: Give the root directory a task Eric W. Biederman
2006-02-23 16:25                                   ` [PATCH 18/23] proc: Reorder the functions in base.c Eric W. Biederman
2006-02-23 16:27                                     ` [PATCH 19/23] proc: Modify proc_pident_lookup to be completely table driven Eric W. Biederman
2006-02-23 16:28                                       ` [PATCH 20/23] proc: Make the generation of the self symlink " Eric W. Biederman
2006-02-23 16:30                                         ` [PATCH 21/23] proc: Factor out an instantiate method from every lookup method Eric W. Biederman
2006-02-23 16:32                                           ` [PATCH 22/23] proc: Remove the hard coded inode numbers Eric W. Biederman
2006-02-23 16:34                                             ` [PATCH 23/23] proc: Merge proc_tid_attr and proc_tgid_attr Eric W. Biederman
2006-02-23 16:49   ` [PATCH 01/23] tref: Implement task references Eric W. Biederman
2006-03-02 19:16     ` Oleg Nesterov
2006-03-02 20:37       ` Oleg Nesterov
2006-03-02 22:19       ` Eric W. Biederman
2006-03-03 16:56         ` Oleg Nesterov
2006-03-03 17:48           ` Eric W. Biederman
2006-03-04 11:16           ` Eric W. Biederman
2006-03-04 12:31             ` Oleg Nesterov
2006-03-04 17:30               ` Oleg Nesterov
2006-03-06 21:06         ` Oleg Nesterov
2006-03-06 22:18           ` Eric W. Biederman
2006-03-07 20:44             ` Oleg Nesterov
2006-03-07  1:39           ` Eric W. Biederman
2006-03-07 20:38             ` Oleg Nesterov
2006-03-07 13:12           ` Eric W. Biederman
2006-03-07 21:02             ` Oleg Nesterov
2006-03-07 23:00               ` Eric W. Biederman
2006-03-03 19:23     ` Oleg Nesterov
2006-03-04 10:51       ` Eric W. Biederman
2006-02-25 12:27 ` [PATCH 00/23] proc cleanup Andrew Morton
2006-02-25 13:34   ` Eric W. Biederman
2006-02-25 15:20   ` Eric W. Biederman
2006-02-27 15:26 ` Serge E. Hallyn
2006-02-27 15:56   ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).