linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] generic-pidhash-2.5.36-D4, BK-curr
       [not found] <Pine.LNX.4.44.0209182101150.27697-100000@localhost.localdomain>
@ 2002-09-19  2:54 ` Ingo Molnar
  2002-09-19  6:10   ` Linus Torvalds
  2002-09-20  8:27   ` [patch] generic-pidhash-2.5.36-D4, BK-curr Oleg Drokin
  0 siblings, 2 replies; 28+ messages in thread
From: Ingo Molnar @ 2002-09-19  2:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: William Lee Irwin III, linux-kernel


the attached patch is a significantly cleaned up version of the generic
pidhash patch, against BK-curr. Changes:

 - merged the pidhash and id-hash concepts into a single 'generic 
   pidhash' thing. 

 - unified kernel/pid.c and kernel/idtag.c into kernel/pid.c.

 - eliminated the pidhash lock.

 - simplified the use of the generic pidhash, in most cases we simply
   iterate over a given PID 'type' (or class):

       for_each_task_pid(current->session, PIDTYPE_SID, p, l, pid)
               p->tty = NULL;

   there's no get_tag()/put_tag() anymore, just simple loops.

 - fixed a number of bugs that caused inconsistent pid hashing. Added two 
   bugchecks that catch all cases when the pidhash gets out of sync with
   the pidmap.

 - removed the per-user task-list changes, it will be done in a separate
   patch.

 - removed the PIDTYPE_TGID class, it's not necessary anymore with the
   latest threading changes in 2.5.35.

 - lots of other cleanups and simplifications.

the patch is stable in both UP and SMP tests.

performance and robustness measurements:

thread creation+destruction (full) latency is the one that is most
sensitive to the PID allocation code. I've done testing on a 'low end
server' system with 1000 tasks running, and i've also done 5000, 10000 and
50000 (inactive) threads test. In all cases pid_max is at a safely large
value, 1 million, the PID fill ratio is 1% for the 10K threads test, 0.1%
in the 1000 threads test.

i created and destroyed 1 million threads, in a serial way:

    ./perf -s 1000000 -t 1 -r 0 -T --sync-join

4 of such measurements were running at once, to load the dual-P4 SMP box.  
All CPUs were 100% loaded during the test. Note that pid_max and the
number of threads created is 1 million as well, this is to get a fair
average of passing the whole PID range.

BK-stock gives the following results:

	# of tasks:		250	1000	5000 	10000	50000
        -------------------------------------------------------------
	4x 1m threads (sec):	21.2	23.0	47.4	[NMI]	[NMI]

the results prove that even this extremely low PID space fill rate causes
noticeable PID allocation overhead. Things really escallate at 10000
threads, the NMI watchdog triggered very quickly (ie. an irqs-off latency
of more than 5 seconds happened). The results would have been well above
200 seconds. Even in the 5000 threads test the system was occasionally
very jerky, and probably missed interrupts as well.

Note that the inactive threads (1K, 5K, 10K and 10K threads) were using a
consecutive PID range, which favors the get_pid() algorithm, as all
cachemisses will be concentrated into one big chunk, and ~0.95 million
PIDs can be allocated without any interference afterwards. With a more
random distribution of PIDs the overhead is larger.

BK+patch gives the following results:

	# of tasks:		250	1000	5000 	10000	50000
        -------------------------------------------------------------
	4x 1m threads (sec):	23.8	23.8	24.1	25.5	27.7

ie. thread creation performance is stable, it increases slightly, probably
due to hash-chains getting bigger. The catastrophic breakdown in
performance is solved.

but i'm still not happy about the 250 tasks performance of the generic
pidhash - it has some constant overhead. It's not noticeable in fork
latencies though. I think there are still a number of performance
optimizations possible that we can do.

(i have not attempted to measure the improvement in those cases where the
for_each_task loop was eliminated - it's no doubt significant.)

	Ingo

--- linux/drivers/char/tty_io.c.orig	Thu Sep 19 00:40:44 2002
+++ linux/drivers/char/tty_io.c	Thu Sep 19 01:27:37 2002
@@ -432,6 +432,7 @@
 	struct file * cons_filp = NULL;
 	struct task_struct *p;
 	struct list_head *l;
+	struct pid *pid;
 	int    closecount = 0, n;
 
 	if (!tty)
@@ -496,17 +497,17 @@
 	}
 	
 	read_lock(&tasklist_lock);
- 	for_each_process(p) {
-		if ((tty->session > 0) && (p->session == tty->session) &&
-		    p->leader) {
-			send_sig(SIGHUP,p,1);
-			send_sig(SIGCONT,p,1);
+	if (tty->session > 0)
+		for_each_task_pid(tty->session, PIDTYPE_SID, p, l, pid) {
+			if (p->tty == tty)
+				p->tty = NULL;
+			if (!p->leader)
+				continue;
+			send_sig(SIGHUP, p, 1);
+			send_sig(SIGCONT, p, 1);
 			if (tty->pgrp > 0)
 				p->tty_old_pgrp = tty->pgrp;
 		}
-		if (p->tty == tty)
-			p->tty = NULL;
-	}
 	read_unlock(&tasklist_lock);
 
 	tty->flags = 0;
@@ -571,6 +572,8 @@
 {
 	struct tty_struct *tty = current->tty;
 	struct task_struct *p;
+	struct list_head *l;
+	struct pid *pid;
 	int tty_pgrp = -1;
 
 	lock_kernel();
@@ -598,9 +601,8 @@
 	tty->pgrp = -1;
 
 	read_lock(&tasklist_lock);
-	for_each_process(p)
-	  	if (p->session == current->session)
-			p->tty = NULL;
+	for_each_task_pid(current->session, PIDTYPE_SID, p, l, pid)
+		p->tty = NULL;
 	read_unlock(&tasklist_lock);
 	unlock_kernel();
 }
@@ -1221,12 +1223,15 @@
 	 */
 	if (tty_closing || o_tty_closing) {
 		struct task_struct *p;
+		struct list_head *l;
+		struct pid *pid;
 
 		read_lock(&tasklist_lock);
-		for_each_process(p) {
-			if (p->tty == tty || (o_tty && p->tty == o_tty))
+		for_each_task_pid(tty->session, PIDTYPE_SID, p, l, pid)
+			p->tty = NULL;
+		if (o_tty)
+			for_each_task_pid(o_tty->session, PIDTYPE_SID, p,l, pid)
 				p->tty = NULL;
-		}
 		read_unlock(&tasklist_lock);
 
 		if (redirect == tty || (o_tty && redirect == o_tty))
@@ -1540,6 +1545,10 @@
 
 static int tiocsctty(struct tty_struct *tty, int arg)
 {
+	struct list_head *l;
+	struct pid *pid;
+	task_t *p;
+
 	if (current->leader &&
 	    (current->session == tty->session))
 		return 0;
@@ -1558,12 +1567,10 @@
 			/*
 			 * Steal it away
 			 */
-			struct task_struct *p;
 
 			read_lock(&tasklist_lock);
-			for_each_process(p)
-				if (p->tty == tty)
-					p->tty = NULL;
+			for_each_task_pid(tty->session, PIDTYPE_SID, p, l, pid)
+				p->tty = NULL;
 			read_unlock(&tasklist_lock);
 		} else
 			return -EPERM;
--- linux/include/linux/pid.h.orig	Thu Sep 19 00:41:21 2002
+++ linux/include/linux/pid.h	Thu Sep 19 01:18:29 2002
@@ -0,0 +1,61 @@
+#ifndef _LINUX_PID_H
+#define _LINUX_PID_H
+
+enum pid_type
+{
+	PIDTYPE_PID,
+	PIDTYPE_PGID,
+	PIDTYPE_SID,
+	PIDTYPE_MAX
+};
+
+struct pid
+{
+	int nr;
+	enum pid_type type;
+	atomic_t count;
+	struct list_head hash_chain;
+	struct list_head task_list;
+};
+
+struct pid_link
+{
+	int nr;
+	struct list_head pid_chain;
+	struct pid *pid;
+};
+
+#define pid_task(elem, type) \
+	list_entry(elem, struct task_struct, pids[(int)(type)].pid_chain)
+
+/*
+ * attach_pid() must be called with the tasklist_lock write-held.
+ * It might unlock the tasklist_lock for allocation, so this
+ * function must be called last after installing the links of
+ * a new task.
+ */
+extern int FASTCALL(attach_pid(struct task_struct *, enum pid_type, int));
+
+/*
+ * detach_pid() must be called with the tasklist_lock write-held.
+ */
+extern void FASTCALL(detach_pid(struct task_struct *task, enum pid_type));
+
+/*
+ * Quick & dirty hash table lookup.
+ */
+extern struct pid *FASTCALL(find_pid(enum pid_type, int));
+
+extern int alloc_pidmap(void);
+extern void FASTCALL(free_pidmap(int));
+
+#define for_each_task_pid(who, type, task, elem, pid)		\
+	if ((pid = find_pid(type, who)))			\
+	        for (elem = pid->task_list.next,			\
+			prefetch(elem->next),				\
+			task = pid_task(elem, type);			\
+			elem != &pid->task_list;			\
+			elem = elem->next, prefetch(elem->next), 	\
+			task = pid_task(elem, type))
+
+#endif /* _LINUX_PID_H */
--- linux/include/linux/sched.h.orig	Thu Sep 19 00:40:52 2002
+++ linux/include/linux/sched.h	Thu Sep 19 02:56:18 2002
@@ -28,6 +28,7 @@
 #include <linux/fs_struct.h>
 #include <linux/compiler.h>
 #include <linux/completion.h>
+#include <linux/pid.h>
 
 struct exec_domain;
 
@@ -266,6 +267,8 @@
 	atomic_inc(&__user->__count);			\
 	__user; })
 
+extern struct user_struct *find_user(uid_t);
+
 extern struct user_struct root_user;
 #define INIT_USER (&root_user)
 
@@ -326,9 +329,8 @@
 	struct task_struct *group_leader;
 	struct list_head thread_group;
 
-	/* PID hash table linkage. */
-	struct task_struct *pidhash_next;
-	struct task_struct **pidhash_pprev;
+	/* PID/PID hash table linkage. */
+	struct pid_link pids[PIDTYPE_MAX];
 
 	wait_queue_head_t wait_chldexit;	/* for wait4() */
 	struct completion *vfork_done;		/* for vfork() */
@@ -474,38 +476,7 @@
 
 extern struct   mm_struct init_mm;
 
-/* PID hashing. (shouldnt this be dynamic?) */
-#define PIDHASH_SZ 8192
-extern struct task_struct *pidhash[PIDHASH_SZ];
-
-#define pid_hashfn(x)	((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))
-
-static inline void hash_pid(struct task_struct *p)
-{
-	struct task_struct **htable = &pidhash[pid_hashfn(p->pid)];
-
-	if((p->pidhash_next = *htable) != NULL)
-		(*htable)->pidhash_pprev = &p->pidhash_next;
-	*htable = p;
-	p->pidhash_pprev = htable;
-}
-
-static inline void unhash_pid(struct task_struct *p)
-{
-	if(p->pidhash_next)
-		p->pidhash_next->pidhash_pprev = p->pidhash_pprev;
-	*p->pidhash_pprev = p->pidhash_next;
-}
-
-static inline struct task_struct *find_task_by_pid(int pid)
-{
-	struct task_struct *p, **htable = &pidhash[pid_hashfn(pid)];
-
-	for(p = *htable; p && p->pid != pid; p = p->pidhash_next)
-		;
-
-	return p;
-}
+extern struct task_struct *find_task_by_pid(int pid);
 
 /* per-UID process charging. */
 extern struct user_struct * alloc_uid(uid_t);
--- linux/include/linux/threads.h.orig	Thu Sep 19 00:40:38 2002
+++ linux/include/linux/threads.h	Thu Sep 19 00:41:21 2002
@@ -17,8 +17,13 @@
 #define MIN_THREADS_LEFT_FOR_ROOT 4
 
 /*
- * This controls the maximum pid allocated to a process
+ * This controls the default maximum pid allocated to a process
  */
-#define DEFAULT_PID_MAX 0x8000
+#define PID_MAX_DEFAULT 0x8000
+
+/*
+ * A maximum of 4 million PIDs should be enough for a while:
+ */
+#define PID_MAX_LIMIT (4*1024*1024)
 
 #endif
--- linux/include/asm-i386/types.h.orig	Sun Jun  9 07:26:52 2002
+++ linux/include/asm-i386/types.h	Thu Sep 19 00:41:21 2002
@@ -41,7 +41,8 @@
 typedef signed long long s64;
 typedef unsigned long long u64;
 
-#define BITS_PER_LONG 32
+#define BITS_PER_LONG_SHIFT	5
+#define BITS_PER_LONG		(1 << BITS_PER_LONG_SHIFT)
 
 /* DMA addresses come in generic and 64-bit flavours.  */
 
--- linux/fs/fcntl.c.orig	Thu Sep 19 00:40:45 2002
+++ linux/fs/fcntl.c	Thu Sep 19 01:32:06 2002
@@ -480,7 +480,9 @@
 
 void send_sigio(struct fown_struct *fown, int fd, int band)
 {
-	struct task_struct * p;
+	struct task_struct *p;
+	struct list_head *l;
+	struct pid *pidptr;
 	int pid;
 	
 	read_lock(&fown->lock);
@@ -493,14 +495,8 @@
 		send_sigio_to_task(p, fown, fd, band);
 		goto out_unlock_task;
 	}
-	for_each_process(p) {
-		int match = p->pid;
-		if (pid < 0)
-			match = -p->pgrp;
-		if (pid != match)
-			continue;
-		send_sigio_to_task(p, fown, fd, band);
-	}
+	for_each_task_pid(-pid, PIDTYPE_PGID, p, l, pidptr)
+		send_sigio_to_task(p, fown,fd,band);
 out_unlock_task:
 	read_unlock(&tasklist_lock);
 out_unlock_fown:
--- linux/fs/exec.c.orig	Thu Sep 19 00:40:51 2002
+++ linux/fs/exec.c	Thu Sep 19 01:48:30 2002
@@ -609,8 +609,6 @@
 
 		ptrace_unlink(leader);
 		ptrace_unlink(current);
-		unhash_pid(current);
-		unhash_pid(leader);
 		remove_parent(current);
 		remove_parent(leader);
 		/*
@@ -631,8 +629,6 @@
 			current->ptrace = ptrace;
 			__ptrace_link(current, parent);
 		}
-		hash_pid(current);
-		hash_pid(leader);
 		
 		list_add_tail(&current->tasks, &init_task.tasks);
 		state = leader->state;
--- linux/kernel/Makefile.orig	Thu Sep 19 00:40:12 2002
+++ linux/kernel/Makefile	Thu Sep 19 00:41:21 2002
@@ -15,7 +15,7 @@
 obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
 	    module.o exit.o itimer.o time.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o timer.o user.o \
-	    signal.o sys.o kmod.o context.o futex.o platform.o
+	    signal.o sys.o kmod.o context.o futex.o platform.o pid.o
 
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
 obj-$(CONFIG_SMP) += cpu.o
--- linux/kernel/exit.c.orig	Thu Sep 19 00:41:07 2002
+++ linux/kernel/exit.c	Thu Sep 19 02:45:34 2002
@@ -33,7 +33,12 @@
 {
 	struct dentry *proc_dentry;
 	nr_threads--;
-	unhash_pid(p);
+	detach_pid(p, PIDTYPE_PID);
+	if (thread_group_leader(p)) {
+		detach_pid(p, PIDTYPE_PGID);
+		detach_pid(p, PIDTYPE_SID);
+	}
+
 	REMOVE_LINKS(p);
 	p->pid = 0;
 	proc_dentry = p->proc_dentry;
@@ -109,22 +114,18 @@
 int session_of_pgrp(int pgrp)
 {
 	struct task_struct *p;
-	int fallback;
+	struct list_head *l;
+	struct pid *pid;
+	int sid = -1;
 
-	fallback = -1;
 	read_lock(&tasklist_lock);
-	for_each_process(p) {
- 		if (p->session <= 0)
- 			continue;
-		if (p->pgrp == pgrp) {
-			fallback = p->session;
+	for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid)
+		if (p->session > 0) {
+			sid = p->session;
 			break;
 		}
-		if (p->pid == pgrp)
-			fallback = p->session;
-	}
 	read_unlock(&tasklist_lock);
-	return fallback;
+	return sid;
 }
 
 /*
@@ -135,21 +136,25 @@
  *
  * "I ask you, have you ever known what it is to be an orphan?"
  */
-static int __will_become_orphaned_pgrp(int pgrp, struct task_struct * ignored_task)
+static int __will_become_orphaned_pgrp(int pgrp, task_t *ignored_task)
 {
 	struct task_struct *p;
-
-	for_each_process(p) {
-		if ((p == ignored_task) || (p->pgrp != pgrp) ||
-		    (p->state == TASK_ZOMBIE) ||
-		    (p->real_parent->pid == 1))
+	struct list_head *l;
+	struct pid *pid;
+	int ret = 1;
+
+	for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) {
+		if (p == ignored_task
+				|| p->state == TASK_ZOMBIE 
+				|| p->real_parent->pid == 1)
 			continue;
-		if ((p->real_parent->pgrp != pgrp) &&
-		    (p->real_parent->session == p->session)) {
- 			return 0;
+		if (p->real_parent->pgrp != pgrp
+			    && p->real_parent->session == p->session) {
+			ret = 0;
+			break;
 		}
 	}
-	return 1;	/* (sighing) "Often!" */
+	return ret;	/* (sighing) "Often!" */
 }
 
 static int will_become_orphaned_pgrp(int pgrp, struct task_struct * ignored_task)
@@ -171,11 +176,11 @@
 static inline int __has_stopped_jobs(int pgrp)
 {
 	int retval = 0;
-	struct task_struct * p;
+	struct task_struct *p;
+	struct list_head *l;
+	struct pid *pid;
 
-	for_each_process(p) {
-		if (p->pgrp != pgrp)
-			continue;
+	for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) {
 		if (p->state != TASK_STOPPED)
 			continue;
 		retval = 1;
--- linux/kernel/user.c.orig	Thu Sep 19 02:53:48 2002
+++ linux/kernel/user.c	Thu Sep 19 02:54:02 2002
@@ -64,6 +64,11 @@
 	return NULL;
 }
 
+struct user_struct *find_user(uid_t uid)
+{
+	return uid_hash_find(uid, uidhashentry(uid));
+}
+
 void free_uid(struct user_struct *up)
 {
 	if (up && atomic_dec_and_lock(&up->__count, &uidhash_lock)) {
--- linux/kernel/pid.c.orig	Thu Sep 19 00:41:21 2002
+++ linux/kernel/pid.c	Thu Sep 19 03:24:47 2002
@@ -0,0 +1,268 @@
+/*
+ * Generic pidhash and scalable, time-bounded PID allocator
+ *
+ * (C) 2002 William Irwin, IBM
+ * (C) 2002 Ingo Molnar, Red Hat
+ *
+ * pid-structures are backing objects for tasks sharing a given ID to chain
+ * against. There is very little to them aside from hashing them and
+ * parking tasks using given ID's on a list.
+ *
+ * The hash is always changed with the tasklist_lock write-acquired,
+ * and the hash is only accessed with the tasklist_lock at least
+ * read-acquired, so there's no additional SMP locking needed here.
+ *
+ * We have a list of bitmap pages, which bitmaps represent the PID space.
+ * Allocating and freeing PIDs is completely lockless. The worst-case
+ * allocation scenario when all but one out of 1 million PIDs possible are
+ * allocated already: the scanning of 32 list entries and at most PAGE_SIZE
+ * bytes. The typical fastpath is a single successful setbit. Freeing is O(1).
+ */
+
+#include <linux/mm.h>
+#include <linux/bootmem.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+
+static kmem_cache_t *pid_cache;
+
+#define PIDHASH_SIZE 4096
+static struct list_head pid_hash[PIDTYPE_MAX][PIDHASH_SIZE];
+#define pid_hashfn(nr) ((nr >> 8) ^ nr) & (PIDHASH_SIZE - 1)
+
+int pid_max = PID_MAX_DEFAULT;
+int last_pid;
+
+#define RESERVED_PIDS		300
+
+#define PIDMAP_ENTRIES		(PID_MAX_LIMIT/PAGE_SIZE/8)
+#define BITS_PER_PAGE		(PAGE_SIZE*8)
+#define BITS_PER_PAGE_MASK	(BITS_PER_PAGE-1)
+
+/*
+ * PID-map pages start out as NULL, they get allocated upon
+ * first use and are never deallocated. This way a low pid_max
+ * value does not cause lots of bitmaps to be allocated, but
+ * the scheme scales to up to 4 million PIDs, runtime.
+ */
+typedef struct pidmap {
+	atomic_t nr_free;
+	void *page;
+} pidmap_t;
+
+static pidmap_t pidmap_array[PIDMAP_ENTRIES] =
+	 { [ 0 ... PIDMAP_ENTRIES-1 ] = { ATOMIC_INIT(BITS_PER_PAGE), NULL } };
+
+static pidmap_t *map_limit = pidmap_array + PIDMAP_ENTRIES;
+
+static inline int check_pid(int pid)
+{
+	pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
+	int offset = pid & BITS_PER_PAGE_MASK;
+
+	return test_bit(offset, map->page);
+}
+
+void free_pidmap(int pid)
+{
+	pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
+	int offset = pid & BITS_PER_PAGE_MASK;
+
+	if (!test_and_clear_bit(offset, map->page))
+		BUG();
+	atomic_inc(&map->nr_free);
+}
+
+/*
+ * Here we search for the next map that has free bits left.
+ * Normally the next map has free PIDs.
+ */
+static inline pidmap_t *next_free_map(pidmap_t *map, int *max_steps)
+{
+	while (--*max_steps) {
+		if (++map == map_limit)
+			map = pidmap_array;
+		if (unlikely(!map->page)) {
+			unsigned long page = get_zeroed_page(GFP_KERNEL);
+			/*
+			 * Free the page if someone raced with us
+			 * installing it:
+			 */
+			if (cmpxchg(&map->page, NULL, page))
+				free_page(page);
+			if (!map->page)
+				break;
+		}
+		if (atomic_read(&map->nr_free))
+			return map;
+	}
+	return NULL;
+}
+
+int alloc_pidmap(void)
+{
+	int pid, offset, max_steps = PIDMAP_ENTRIES + 1;
+	pidmap_t *map;
+
+	pid = last_pid + 1;
+	if (pid >= pid_max)
+		pid = RESERVED_PIDS;
+
+	offset = pid & BITS_PER_PAGE_MASK;
+	map = pidmap_array + pid / BITS_PER_PAGE;
+
+	if (likely(map->page && !test_and_set_bit(offset, map->page))) {
+		/*
+		 * There is a small window for last_pid updates to race,
+		 * but in that case the next allocation will go into the
+		 * slowpath and that fixes things up.
+		 */
+return_pid:
+		atomic_dec(&map->nr_free);
+		last_pid = pid;
+		return pid;
+	}
+	
+	if (!offset || !atomic_read(&map->nr_free)) {
+next_map:
+		map = next_free_map(map, &max_steps);
+		if (!map)
+			goto failure;
+		offset = 0;
+	}
+	/*
+	 * Find the next zero bit:
+	 */
+scan_more:
+	offset = find_next_zero_bit(map->page, BITS_PER_PAGE, offset);
+	if (offset == BITS_PER_PAGE)
+		goto next_map;
+	if (test_and_set_bit(offset, map->page))
+		goto scan_more;
+
+	/* we got the PID: */
+	pid = (map - pidmap_array) * BITS_PER_PAGE + offset;
+	goto return_pid;
+
+failure:
+	return -1;
+}
+
+struct pid *find_pid(enum pid_type type, int nr)
+{
+	struct list_head *elem, *bucket = &pid_hash[type][pid_hashfn(nr)];
+	struct pid *pid;
+
+	list_for_each(elem, bucket) {
+		pid = list_entry(elem, struct pid, hash_chain);
+		if (pid->nr == nr) {
+			if (unlikely(!check_pid(nr)))
+				BUG();
+			return pid;
+		}
+	}
+	return NULL;
+}
+
+static inline struct pid *get_pid(enum pid_type type, int nr)
+{
+	struct pid *pid, *raced_pid;
+
+	pid = find_pid(type, nr);
+	if (pid)
+		goto out_inc;
+
+	while (!pid) {
+		write_unlock_irq(&tasklist_lock);
+		pid = kmem_cache_alloc(pid_cache, SLAB_KERNEL);
+		write_lock_irq(&tasklist_lock);
+	}
+
+	raced_pid = find_pid(type, nr);
+	if (likely(!raced_pid)) {
+		INIT_LIST_HEAD(&pid->task_list);
+		atomic_set(&pid->count, 1);
+		pid->type = type;
+		pid->nr   = nr;
+		list_add(&pid->hash_chain, &pid_hash[type][pid_hashfn(nr)]);
+		return pid;
+	}
+	kmem_cache_free(pid_cache, pid);
+
+	pid = raced_pid;
+out_inc:
+	atomic_inc(&pid->count);
+	return pid;
+}
+
+int attach_pid(task_t *task, enum pid_type type, int nr)
+{
+	struct pid *pid;
+
+	if (unlikely(!check_pid(nr)))
+		BUG();
+
+	pid = get_pid(type, nr);
+	if (unlikely(!pid))
+		return -ENOMEM;
+
+	list_add(&task->pids[type].pid_chain, &pid->task_list);
+	task->pids[type].nr = nr;
+	task->pids[type].pid = pid;
+
+	return 0;
+}
+
+void detach_pid(task_t *task, enum pid_type type)
+{
+	struct pid_link *link = task->pids + type;
+	struct pid *pid = link->pid;
+	int nr;
+
+	list_del(&link->pid_chain);
+
+	if (!atomic_dec_and_test(&pid->count))
+		return;
+
+	nr = pid->nr;
+	list_del(&pid->hash_chain);
+	kmem_cache_free(pid_cache, pid);
+	for (type = 0; type < PIDTYPE_MAX; ++type)
+		if (find_pid(type, nr))
+			return;
+	free_pidmap(nr);
+}
+
+extern struct task_struct *find_task_by_pid(int nr)
+{
+	struct pid *pid = find_pid(PIDTYPE_PID, nr);
+
+	if (!pid || list_empty(&pid->task_list))
+		return NULL;
+
+	return pid_task(pid->task_list.next, PIDTYPE_PID);
+}
+
+void __init pid_init(void)
+{
+	pidmap_t *map = pidmap_array;
+
+	/*
+	 * Allocate PID 0:
+	 */
+	map->page = alloc_bootmem(PAGE_SIZE);
+	set_bit(0, map->page);
+	atomic_dec(&map->nr_free);
+}
+
+void __init pidhash_init(void)
+{
+	int i, j;
+
+	pid_cache = kmem_cache_create("pid_cache", sizeof(struct pid),
+				0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+
+	for (i = 0; i < PIDTYPE_MAX; ++i)
+		for (j = 0; j < PIDHASH_SIZE; ++j)
+			INIT_LIST_HEAD(&pid_hash[i][j]);
+}
--- linux/kernel/sched.c.orig	Thu Sep 19 00:40:48 2002
+++ linux/kernel/sched.c	Thu Sep 19 00:41:21 2002
@@ -2099,6 +2099,7 @@
 {
 	runqueue_t *rq;
 	int i, j, k;
+	extern void pid_init(void);
 
 	for (i = 0; i < NR_CPUS; i++) {
 		prio_array_t *array;
@@ -2139,5 +2140,7 @@
 	 */
 	atomic_inc(&init_mm.mm_count);
 	enter_lazy_tlb(&init_mm, current, smp_processor_id());
+
+	pid_init();
 }
 
--- linux/kernel/signal.c.orig	Thu Sep 19 00:40:48 2002
+++ linux/kernel/signal.c	Thu Sep 19 01:41:03 2002
@@ -943,18 +943,18 @@
 
 int __kill_pg_info(int sig, struct siginfo *info, pid_t pgrp)
 {
-	int retval = -EINVAL;
-	if (pgrp > 0) {
-		struct task_struct *p;
-
-		retval = -ESRCH;
-		for_each_process(p) {
-			if (p->pgrp == pgrp) {
-				int err = send_sig_info(sig, info, p);
-				if (retval)
-					retval = err;
-			}
-		}
+	struct task_struct *p;
+	struct list_head *l;
+	struct pid *pid;
+	int err, retval = -ESRCH;
+
+	if (pgrp <= 0)
+		return -EINVAL;
+
+	for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) {
+		err = send_sig_info(sig, info, p);
+		if (retval)
+			retval = err;
 	}
 	return retval;
 }
@@ -977,28 +977,33 @@
  * the connection is lost.
  */
 
+
 int
-kill_sl_info(int sig, struct siginfo *info, pid_t sess)
+kill_sl_info(int sig, struct siginfo *info, pid_t sid)
 {
-	int retval = -EINVAL;
-	if (sess > 0) {
-		struct task_struct *p;
-
-		retval = -ESRCH;
-		read_lock(&tasklist_lock);
-		for_each_process(p) {
-			if (p->leader && p->session == sess) {
-				int err = send_sig_info(sig, info, p);
-				if (retval)
-					retval = err;
-			}
-		}
-		read_unlock(&tasklist_lock);
+	int err, retval = -EINVAL;
+	struct pid *pid;
+	struct list_head *l;
+	struct task_struct *p;
+
+	if (sid <= 0)
+		goto out;
+
+	retval = -ESRCH;
+	read_lock(&tasklist_lock);
+	for_each_task_pid(sid, PIDTYPE_SID, p, l, pid) {
+		if (!p->leader)
+			continue;
+		err = send_sig_info(sig, info, p);
+		if (retval)
+			retval = err;
 	}
+	read_unlock(&tasklist_lock);
+out:
 	return retval;
 }
 
-inline int
+int
 kill_proc_info(int sig, struct siginfo *info, pid_t pid)
 {
 	int error;
--- linux/kernel/sys.c.orig	Thu Sep 19 02:51:05 2002
+++ linux/kernel/sys.c	Thu Sep 19 02:55:49 2002
@@ -203,35 +203,34 @@
 cond_syscall(sys_quotactl)
 cond_syscall(sys_acct)
 
-static int proc_sel(struct task_struct *p, int which, int who)
+static int set_one_prio(struct task_struct *p, int niceval, int error)
 {
-	if(p->pid)
-	{
-		switch (which) {
-			case PRIO_PROCESS:
-				if (!who && p == current)
-					return 1;
-				return(p->pid == who);
-			case PRIO_PGRP:
-				if (!who)
-					who = current->pgrp;
-				return(p->pgrp == who);
-			case PRIO_USER:
-				if (!who)
-					who = current->uid;
-				return(p->uid == who);
-		}
+	if (p->uid != current->euid &&
+		p->uid != current->uid && !capable(CAP_SYS_NICE)) {
+		error = -EPERM;
+		goto out;
 	}
-	return 0;
+
+	if (error == -ESRCH)
+		error = 0;
+	if (niceval < task_nice(p) && !capable(CAP_SYS_NICE))
+		error = -EACCES;
+	else
+		set_user_nice(p, niceval);
+out:
+	return error;
 }
 
 asmlinkage long sys_setpriority(int which, int who, int niceval)
 {
 	struct task_struct *g, *p;
-	int error;
+	struct user_struct *user;
+	struct pid *pid;
+	struct list_head *l;
+	int error = -EINVAL;
 
 	if (which > 2 || which < 0)
-		return -EINVAL;
+		goto out;
 
 	/* normalize: avoid signed division (rounding problems) */
 	error = -ESRCH;
@@ -241,31 +240,38 @@
 		niceval = 19;
 
 	read_lock(&tasklist_lock);
-	do_each_thread(g, p) {
-		int no_nice;
-		if (!proc_sel(p, which, who))
-			continue;
-		if (p->uid != current->euid &&
-			p->uid != current->uid && !capable(CAP_SYS_NICE)) {
-			error = -EPERM;
-			continue;
-		}
-		if (error == -ESRCH)
-			error = 0;
-		if (niceval < task_nice(p) && !capable(CAP_SYS_NICE)) {
-			error = -EACCES;
-			continue;
-		}
-		no_nice = security_ops->task_setnice(p, niceval);
-		if (no_nice) {
-			error = no_nice;
-			continue;
-		}
-		set_user_nice(p, niceval);
-	} while_each_thread(g, p);
-
+	switch (which) {
+		case PRIO_PROCESS:
+			if (!who)
+				who = current->pid;
+			p = find_task_by_pid(who);
+			if (p)
+				error = set_one_prio(p, niceval, error);
+			break;
+		case PRIO_PGRP:
+			if (!who)
+				who = current->pgrp;
+			for_each_task_pid(who, PIDTYPE_PGID, p, l, pid)
+				error = set_one_prio(p, niceval, error);
+			break;
+		case PRIO_USER:
+			if (!who)
+				user = current->user;
+			else
+				user = find_user(who);
+
+			if (!user)
+				goto out_unlock;
+
+			do_each_thread(g, p)
+				if (p->uid == who)
+					error = set_one_prio(p, niceval, error);
+			while_each_thread(g, p);
+			break;
+	}
+out_unlock:
 	read_unlock(&tasklist_lock);
-
+out:
 	return error;
 }
 
@@ -278,20 +284,54 @@
 asmlinkage long sys_getpriority(int which, int who)
 {
 	struct task_struct *g, *p;
-	long retval = -ESRCH;
+	struct list_head *l;
+	struct pid *pid;
+	struct user_struct *user;
+	long niceval, retval = -ESRCH;
 
 	if (which > 2 || which < 0)
 		return -EINVAL;
 
 	read_lock(&tasklist_lock);
-	do_each_thread(g, p) {
-		long niceval;
-		if (!proc_sel(p, which, who))
-			continue;
-		niceval = 20 - task_nice(p);
-		if (niceval > retval)
-			retval = niceval;
-	} while_each_thread(g, p);
+	switch (which) {
+		case PRIO_PROCESS:
+			if (!who)
+				who = current->pid;
+			p = find_task_by_pid(who);
+			if (p) {
+				niceval = 20 - task_nice(p);
+				if (niceval > retval)
+					retval = niceval;
+			}
+			break;
+		case PRIO_PGRP:
+			if (!who)
+				who = current->pgrp;
+			for_each_task_pid(who, PIDTYPE_PGID, p, l, pid) {
+				niceval = 20 - task_nice(p);
+				if (niceval > retval)
+					retval = niceval;
+			}
+			break;
+		case PRIO_USER:
+			if (!who)
+				user = current->user;
+			else
+				user = find_user(who);
+
+			if (!user)
+				goto out_unlock;
+
+			do_each_thread(g, p)
+				if (p->uid == who) {
+					niceval = 20 - task_nice(p);
+					if (niceval > retval)
+						retval = niceval;
+				}
+			while_each_thread(g, p);
+			break;
+	}
+out_unlock:
 	read_unlock(&tasklist_lock);
 
 	return retval;
@@ -849,7 +889,7 @@
 
 asmlinkage long sys_setpgid(pid_t pid, pid_t pgid)
 {
-	struct task_struct * p;
+	struct task_struct *p;
 	int err = -EINVAL;
 
 	if (!pid)
@@ -862,12 +902,15 @@
 	/* From this point forward we keep holding onto the tasklist lock
 	 * so that our parent does not change from under us. -DaveM
 	 */
-	read_lock(&tasklist_lock);
+	write_lock_irq(&tasklist_lock);
 
 	err = -ESRCH;
 	p = find_task_by_pid(pid);
 	if (!p)
 		goto out;
+	err = -EINVAL;
+	if (!thread_group_leader(p))
+		goto out;
 
 	if (p->parent == current || p->real_parent == current) {
 		err = -EPERM;
@@ -882,25 +925,26 @@
 	if (p->leader)
 		goto out;
 	if (pgid != pid) {
-		struct task_struct *g, *tmp;
-		do_each_thread(g, tmp) {
-			if (tmp->pgrp == pgid &&
-			    tmp->session == current->session)
+		struct task_struct *p;
+		struct pid *pid;
+		struct list_head *l;
+
+		for_each_task_pid(pgid, PIDTYPE_PGID, p, l, pid)
+			if (p->session == current->session)
 				goto ok_pgid;
-		} while_each_thread(g, tmp);
 		goto out;
 	}
 
 ok_pgid:
-	err = security_ops->task_setpgid(p, pgid);
-	if (err)
-		goto out;
-
-	p->pgrp = pgid;
+	if (p->pgrp != pgid) {
+		detach_pid(p, PIDTYPE_PGID);
+		p->pgrp = pgid;
+		attach_pid(p, PIDTYPE_PGID, pgid);
+	}
 	err = 0;
 out:
 	/* All paths lead to here, thus we are safe. -DaveM */
-	read_unlock(&tasklist_lock);
+	write_unlock_irq(&tasklist_lock);
 	return err;
 }
 
@@ -956,22 +1000,34 @@
 
 asmlinkage long sys_setsid(void)
 {
-	struct task_struct *g, *p;
+	struct pid *pid;
 	int err = -EPERM;
 
-	read_lock(&tasklist_lock);
-	do_each_thread(g, p)
-		if (p->pgrp == current->pid)
-			goto out;
-	while_each_thread(g, p);
+	if (!thread_group_leader(current))
+		return -EINVAL;
+
+	write_lock_irq(&tasklist_lock);
+
+	pid = find_pid(PIDTYPE_PGID, current->pid);
+	if (pid)
+		goto out;
 
 	current->leader = 1;
-	current->session = current->pgrp = current->pid;
+	if (current->session != current->pid) {
+		detach_pid(current, PIDTYPE_SID);
+		current->session = current->pid;
+		attach_pid(current, PIDTYPE_SID, current->pid);
+	}
+	if (current->pgrp != current->pid) {
+		detach_pid(current, PIDTYPE_PGID);
+		current->pgrp = current->pid;
+		attach_pid(current, PIDTYPE_PGID, current->pid);
+	}
 	current->tty = NULL;
 	current->tty_old_pgrp = 0;
 	err = current->pgrp;
 out:
-	read_unlock(&tasklist_lock);
+	write_unlock_irq(&tasklist_lock);
 	return err;
 }
 
--- linux/kernel/fork.c.orig	Thu Sep 19 00:41:07 2002
+++ linux/kernel/fork.c	Thu Sep 19 02:52:17 2002
@@ -47,17 +47,6 @@
 int max_threads;
 unsigned long total_forks;	/* Handle normal Linux uptimes. */
 
-/*
- * Protects next_safe, last_pid and pid_max:
- */
-spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
-
-static int next_safe = DEFAULT_PID_MAX;
-int pid_max = DEFAULT_PID_MAX;
-int last_pid;
-
-struct task_struct *pidhash[PIDHASH_SZ];
-
 rwlock_t tasklist_lock __cacheline_aligned = RW_LOCK_UNLOCKED;  /* outer */
 
 /*
@@ -84,7 +73,6 @@
 	}
 }
 
-/* Protects next_safe and last_pid. */
 void add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)
 {
 	unsigned long flags;
@@ -159,54 +147,6 @@
 	return tsk;
 }
 
-static int get_pid(unsigned long flags)
-{
-	struct task_struct *g, *p;
-	int pid;
-
-	if (flags & CLONE_IDLETASK)
-		return 0;
-
-	spin_lock(&lastpid_lock);
-	if (++last_pid > pid_max) {
-		last_pid = 300;		/* Skip daemons etc. */
-		goto inside;
-	}
-
-	if (last_pid >= next_safe) {
-inside:
-		if (nr_threads > pid_max >> 4)
-			pid_max <<= 1;
-		next_safe = pid_max;
-		read_lock(&tasklist_lock);
-	repeat:
-		do_each_thread(g, p) {
-			if (p->pid == last_pid	||
-			   p->pgrp == last_pid	||
-			   p->session == last_pid) {
-				if (++last_pid >= next_safe) {
-					if (last_pid >= pid_max)
-						last_pid = 300;
-					next_safe = pid_max;
-				}
-				goto repeat;
-			}
-			if (p->pid > last_pid && next_safe > p->pid)
-				next_safe = p->pid;
-			if (p->pgrp > last_pid && next_safe > p->pgrp)
-				next_safe = p->pgrp;
-			if (p->session > last_pid && next_safe > p->session)
-				next_safe = p->session;
-		} while_each_thread(g, p);
-
-		read_unlock(&tasklist_lock);
-	}
-	pid = last_pid;
-	spin_unlock(&lastpid_lock);
-
-	return pid;
-}
-
 static inline int dup_mmap(struct mm_struct * mm)
 {
 	struct vm_area_struct * mpnt, *tmp, **pprev;
@@ -726,7 +666,13 @@
 	p->state = TASK_UNINTERRUPTIBLE;
 
 	copy_flags(clone_flags, p);
-	p->pid = get_pid(clone_flags);
+	if (clone_flags & CLONE_IDLETASK)
+		p->pid = 0;
+	else {
+		p->pid = alloc_pidmap();
+		if (p->pid == -1)
+			goto bad_fork_cleanup;
+	}
 	p->proc_dentry = NULL;
 
 	INIT_LIST_HEAD(&p->run_list);
@@ -889,7 +835,13 @@
 	SET_LINKS(p);
 	if (p->ptrace & PT_PTRACED)
 		__ptrace_link(p, current->parent);
-	hash_pid(p);
+
+	attach_pid(p, PIDTYPE_PID, p->pid);
+	if (thread_group_leader(p)) {
+		attach_pid(p, PIDTYPE_PGID, p->pgrp);
+		attach_pid(p, PIDTYPE_SID, p->session);
+	}
+
 	nr_threads++;
 	write_unlock_irq(&tasklist_lock);
 	retval = 0;
@@ -914,6 +866,8 @@
 bad_fork_cleanup_security:
 	security_ops->task_free_security(p);
 bad_fork_cleanup:
+	if (p->pid > 0)
+		free_pidmap(p->pid);
 	put_exec_domain(p->thread_info->exec_domain);
 	if (p->binfmt && p->binfmt->module)
 		__MOD_DEC_USE_COUNT(p->binfmt->module);
--- linux/kernel/ksyms.c.orig	Thu Sep 19 00:40:52 2002
+++ linux/kernel/ksyms.c	Thu Sep 19 00:41:21 2002
@@ -602,7 +602,6 @@
 EXPORT_SYMBOL(init_thread_union);
 
 EXPORT_SYMBOL(tasklist_lock);
-EXPORT_SYMBOL(pidhash);
 #if defined(CONFIG_SMP) && defined(__GENERIC_PER_CPU)
 EXPORT_SYMBOL(__per_cpu_offset);
 #endif
--- linux/init/main.c.orig	Thu Sep 19 00:40:48 2002
+++ linux/init/main.c	Thu Sep 19 00:41:21 2002
@@ -66,6 +66,7 @@
 extern void sysctl_init(void);
 extern void signals_init(void);
 extern void buffer_init(void);
+extern void pidhash_init(void);
 extern void pte_chain_init(void);
 extern void radix_tree_init(void);
 extern void free_initmem(void);
@@ -432,6 +433,7 @@
 #endif
 	mem_init();
 	kmem_cache_sizes_init();
+	pidhash_init();
 	pgtable_cache_init();
 	pte_chain_init();
 	fork_init(num_physpages);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-19  2:54 ` [patch] generic-pidhash-2.5.36-D4, BK-curr Ingo Molnar
@ 2002-09-19  6:10   ` Linus Torvalds
  2002-09-19  9:25     ` Ingo Molnar
  2002-09-20  8:27   ` [patch] generic-pidhash-2.5.36-D4, BK-curr Oleg Drokin
  1 sibling, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2002-09-19  6:10 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: William Lee Irwin III, linux-kernel


On Thu, 19 Sep 2002, Ingo Molnar wrote:
> 
> the attached patch is a significantly cleaned up version of the generic
> pidhash patch, against BK-curr.

Hmm.. I think I like it. One big improvement is just the renaming of
"struct idtag" to "struct pid", which makes it look a lot less abstract to
me. Maybe that's just me.

However:

>  	read_lock(&tasklist_lock);
> - 	for_each_process(p) {
> -		if ((tty->session > 0) && (p->session == tty->session) &&
> -		    p->leader) {
> -			send_sig(SIGHUP,p,1);
> -			send_sig(SIGCONT,p,1);
> +	if (tty->session > 0)
> +		for_each_task_pid(tty->session, PIDTYPE_SID, p, l, pid) {
> +			if (p->tty == tty)
> +				p->tty = NULL;
> +			if (!p->leader)
> +				continue;
> +			send_sig(SIGHUP, p, 1);
> +			send_sig(SIGCONT, p, 1);
>  			if (tty->pgrp > 0)
>  				p->tty_old_pgrp = tty->pgrp;
>  		}
> -		if (p->tty == tty)
> -			p->tty = NULL;
> -	}

This looks a bit wrong. In particular, the old code used to set "p->tty" 
to NULL if it matched any process, while the new code only does that for 
processes in the right session. Hmm?

Can you double-check some of these things?

		Linus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-19  6:10   ` Linus Torvalds
@ 2002-09-19  9:25     ` Ingo Molnar
  2002-09-19 10:59       ` William Lee Irwin III
  0 siblings, 1 reply; 28+ messages in thread
From: Ingo Molnar @ 2002-09-19  9:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: William Lee Irwin III, linux-kernel


On Wed, 18 Sep 2002, Linus Torvalds wrote:

> However:
> 
> >  	read_lock(&tasklist_lock);
> > - 	for_each_process(p) {
> > -		if ((tty->session > 0) && (p->session == tty->session) &&
> > -		    p->leader) {
> > -			send_sig(SIGHUP,p,1);
> > -			send_sig(SIGCONT,p,1);
[...]

> This looks a bit wrong. In particular, the old code used to set "p->tty"  
> to NULL if it matched any process, while the new code only does that for
> processes in the right session. Hmm?

i had this code changed back and forth twice already, because i was unsure
whether it's correct. It's true that it's not an equivalent
transformation, but after some thinking it looked correct to me. Isnt
there a 1:1 relationship between the tty and the session ID in this case?
Ie. the previous code was pessimistic and just scanned for matching
p->tty's outside of the tty->session == p->session space, but it should
not have done so.

i'll add some debugging code to the old code to print out cases when
p->tty == tty but p->session != tty->session and start up X, that should
prove or disprove this theory, right?

(i cant remember any other place where the code transformation was not
identity, but will double-check this again.)

William, you did the original transformation, was this optimization done
intentionally?

	Ingo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-19  9:25     ` Ingo Molnar
@ 2002-09-19 10:59       ` William Lee Irwin III
  2002-09-19 15:12         ` Linus Torvalds
  0 siblings, 1 reply; 28+ messages in thread
From: William Lee Irwin III @ 2002-09-19 10:59 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel

On Thu, Sep 19, 2002 at 11:25:11AM +0200, Ingo Molnar wrote:
> i'll add some debugging code to the old code to print out cases when
> p->tty == tty but p->session != tty->session and start up X, that should
> prove or disprove this theory, right?
> (i cant remember any other place where the code transformation was not
> identity, but will double-check this again.)
> William, you did the original transformation, was this optimization done
> intentionally?
> 	Ingo

Sorry, I was debugging some boot-time issues I ran into.

I did this intentionally. Basically, sys_setsid() does the right thing,
but tty_ioctl() does not. There is already some inconsistency
about how task->tty is locked, and I'd not yet come to a conclusion.

On second thought, I really hope I'm imagining things here:

For instance, tty_open() does the following while not holding the
tasklist_lock for writing:

	if (IS_TTY_DEV(device)) {
		if (!current->tty)
			return -ENXIO;
		device = current->tty->device;


So tiocsctty() appears to have the following race against tty_open():

A                                       B
-----------------------------------------
tty->session > 0         if (!current->tty)
                           return -ENXIO; (test fails, don't return)

iterate tasklist,        delay...
  setting p->tty = NULL       current->tty is set to NULL

        (this looks like an attempt to obtain a unique
         reference by blowing away everyone else's)

task_lock(current)       device = current->tty->device
current->tty = tty       OOPS
task_unlock(task)

tiocsctty() holds the BKL (from sys_ioctl() itself) and the
tasklist_lock for reading, but tty_open() seems to hold only some
vfs references. release_dev() does the same style of iteration, but
doesn't even hold the BKL, only the tasklist_lock for reading. And
is very ominously called by tty_open() itself.

Also, most other accesses to tty->session and/or tty->pgrp seem to
be protected by the BKL.


Bill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-19 10:59       ` William Lee Irwin III
@ 2002-09-19 15:12         ` Linus Torvalds
  2002-09-19 15:21           ` Ingo Molnar
  2002-09-19 16:35           ` Andries Brouwer
  0 siblings, 2 replies; 28+ messages in thread
From: Linus Torvalds @ 2002-09-19 15:12 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Ingo Molnar, linux-kernel


On Thu, 19 Sep 2002, William Lee Irwin III wrote:
> 
> I did this intentionally. Basically, sys_setsid() does the right thing,
> but tty_ioctl() does not. There is already some inconsistency
> about how task->tty is locked, and I'd not yet come to a conclusion.

I agree about the locking issue (although I do _not_ believe that the
tasklock should have anything to do with the tsk->tty locking - it should
most likely use some per-task lock for the actual tty accesses, together
with the optimization that a write lock on the tasklock is sufficient to
protect it because it means that nobody else can look up the task).

However, what I worry about is that there may not (will not) be a 1:1
session<->tty thing. In particular, when somebody creates a new session 
with "setsid()", that does not remove the tty from processes that used to 
hold it, I think (this is all from memory, so I might be wrong).

Which means that if the tty is going away, it has to be removed from _all_ 
tasks, not just from the one session that happened to be the most recent 
one.

		Linus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-19 15:12         ` Linus Torvalds
@ 2002-09-19 15:21           ` Ingo Molnar
  2002-09-19 16:35           ` Andries Brouwer
  1 sibling, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2002-09-19 15:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: William Lee Irwin III, linux-kernel


On Thu, 19 Sep 2002, Linus Torvalds wrote:

> However, what I worry about is that there may not (will not) be a 1:1
> session<->tty thing. In particular, when somebody creates a new session
> with "setsid()", that does not remove the tty from processes that used
> to hold it, I think (this is all from memory, so I might be wrong).

i've done the testrun with the p->tty == tty && p->session != tty->session
check, and it never triggers. (Which means nothing, but at least the
assumption is not trivially wrong.)

	Ingo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-19 15:12         ` Linus Torvalds
  2002-09-19 15:21           ` Ingo Molnar
@ 2002-09-19 16:35           ` Andries Brouwer
  2002-09-19 16:43             ` Linus Torvalds
  1 sibling, 1 reply; 28+ messages in thread
From: Andries Brouwer @ 2002-09-19 16:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: William Lee Irwin III, Ingo Molnar, linux-kernel

On Thu, Sep 19, 2002 at 08:12:31AM -0700, Linus Torvalds wrote:

> However, what I worry about is that there may not (will not) be a 1:1
> session<->tty thing. In particular, when somebody creates a new session 
> with "setsid()", that does not remove the tty from processes that used to 
> hold it, I think (this is all from memory, so I might be wrong).
> 
> Which means that if the tty is going away, it has to be removed from _all_ 
> tasks, not just from the one session that happened to be the most recent 
> one.

[POSIX 1003.1-2001]

Controlling Terminal

A terminal that is associated with a session. Each session may have
at most one controlling terminal associated with it, and a controlling
terminal is associated with exactly one session.
Certain input sequences from the controlling terminal cause signals
to be sent to all processes in the process group associated with
the controlling terminal. 

The Controlling Terminal

A terminal may belong to a process as its controlling terminal.
Each process of a session that has a controlling terminal has the same
controlling terminal. A terminal may be the controlling terminal
for at most one session. If a session leader has no controlling terminal,
and opens a terminal device file that is not already associated with a
session without using the O_NOCTTY option, it is implementation-defined
whether the terminal becomes the controlling terminal of the session leader.
If a process which is not a session leader opens a terminal file, or
the O_NOCTTY option is used on open(), then that terminal shall not become
the controlling terminal of the calling process. When a controlling terminal
becomes associated with a session, its foreground process group shall be set
to the process group of the session leader.

The controlling terminal is inherited by a child process during a fork()
function call. A process relinquishes its controlling terminal when it creates
a new session with the setsid() function; other processes remaining in the
old session that had this terminal as their controlling terminal continue
to have it. Upon the close of the last file descriptor in the system
(whether or not it is in the current session) associated with the controlling
terminal, it is unspecified whether all processes that had that terminal
as their controlling terminal cease to have any controlling terminal.
A process does not relinquish its controlling terminal simply by closing all
of its file descriptors associated with the controlling terminal if other
processes continue to have it open.

When a controlling process terminates, the controlling terminal is dissociated
from the current session, allowing it to be acquired by a new session leader.
Subsequent access to the terminal by other processes in the earlier session
may be denied, with attempts to access the terminal treated as if a modem
disconnect had been sensed.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-19 16:35           ` Andries Brouwer
@ 2002-09-19 16:43             ` Linus Torvalds
  2002-09-19 18:54               ` Miquel van Smoorenburg
                                 ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Linus Torvalds @ 2002-09-19 16:43 UTC (permalink / raw)
  To: Andries Brouwer; +Cc: William Lee Irwin III, Ingo Molnar, linux-kernel


On Thu, 19 Sep 2002, Andries Brouwer wrote:
> > 
> > Which means that if the tty is going away, it has to be removed from _all_ 
> > tasks, not just from the one session that happened to be the most recent 
> > one.
> 
> [POSIX 1003.1-2001]

Gaah. Good that somebody else has the energy to actually read the 
standards instead of just trying to desperately remember some dusty 
details..

> The controlling terminal is inherited by a child process during a fork()
> function call. A process relinquishes its controlling terminal when it creates
> a new session with the setsid() function; other processes remaining in the
> old session that had this terminal as their controlling terminal continue
> to have it.

Well, that certainly clinches the fact that the controlling terminal _can_ 
and does continue to be hold by processes outside the current session 
group.

I suspect that to handle controlling terminals efficiently (ie without
iterating over all tasks), the "current->tty" thing needs to be expanded
with a linked list of processes sharing the tty or something (probably
with the head of the list being in the tty structure itself).

On the other hand, I don't think it necessarily is a problem to walk all 
threads either - the controlling terminal changes should be rare, and this 
is O(n) rather than some quadratic or other behaviour.

Anyway, that seems to make Ingo's patch wrong for this case at least. 
Ingo?

		Linus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-19 16:43             ` Linus Torvalds
@ 2002-09-19 18:54               ` Miquel van Smoorenburg
  2002-09-19 19:10               ` Kai Henningsen
  2002-09-19 19:38               ` [patch] generic-pidhash-2.5.36-J2, BK-curr Ingo Molnar
  2 siblings, 0 replies; 28+ messages in thread
From: Miquel van Smoorenburg @ 2002-09-19 18:54 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.44.0209190938340.1594-100000@home.transmeta.com>,
Linus Torvalds  <torvalds@transmeta.com> wrote:
>
>On Thu, 19 Sep 2002, Andries Brouwer wrote:
>> The controlling terminal is inherited by a child process during a fork()
>> function call. A process relinquishes its controlling terminal when it creates
>> a new session with the setsid() function; other processes remaining in the
>> old session that had this terminal as their controlling terminal continue
>> to have it.
>
>Well, that certainly clinches the fact that the controlling terminal _can_ 
>and does continue to be hold by processes outside the current session 
>group.

No, not at all. A few lines above that it said:

>> A terminal that is associated with a session. Each session may have
>> at most one controlling terminal associated with it, and a controlling
>> terminal is associated with exactly one session.

A session has zero or 1 controlling terminals. A controlling
terminal is associated with one session only.

Mike.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-19 16:43             ` Linus Torvalds
  2002-09-19 18:54               ` Miquel van Smoorenburg
@ 2002-09-19 19:10               ` Kai Henningsen
  2002-09-19 20:32                 ` Linus Torvalds
  2002-09-19 19:38               ` [patch] generic-pidhash-2.5.36-J2, BK-curr Ingo Molnar
  2 siblings, 1 reply; 28+ messages in thread
From: Kai Henningsen @ 2002-09-19 19:10 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel

torvalds@transmeta.com (Linus Torvalds)  wrote on 19.09.02 in <Pine.LNX.4.44.0209190938340.1594-100000@home.transmeta.com>:

> On Thu, 19 Sep 2002, Andries Brouwer wrote:

> > [POSIX 1003.1-2001]

> > The controlling terminal is inherited by a child process during a fork()
> > function call. A process relinquishes its controlling terminal when it
> > creates a new session with the setsid() function; other processes
> > remaining in the old session that had this terminal as their controlling
> > terminal continue to have it.
>
> Well, that certainly clinches the fact that the controlling terminal _can_
> and does continue to be hold by processes outside the current session
> group.

On the contrary: it says that this can never happen - the new session has  
no controlling terminal, and can't get the old one unless the old session  
loses it first.

MfG Kai

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch] generic-pidhash-2.5.36-J2, BK-curr
  2002-09-19 16:43             ` Linus Torvalds
  2002-09-19 18:54               ` Miquel van Smoorenburg
  2002-09-19 19:10               ` Kai Henningsen
@ 2002-09-19 19:38               ` Ingo Molnar
  2002-09-19 20:21                 ` Christoph Hellwig
  2002-09-19 21:31                 ` Linus Torvalds
  2 siblings, 2 replies; 28+ messages in thread
From: Ingo Molnar @ 2002-09-19 19:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: William Lee Irwin III, linux-kernel


i've attached the latest version of the generic pidhash patch. The biggest
change is the removal of separately allocated pid structures: they are now
part of the task structure and the first task that uses a PID will provide
the pid structure. Task refcounting is used to avoid the freeing of the
task structure before every member of a process group or session has
exited.

this approach has a number of advantages besides the performance gains.  
Besides simplifying the whole hashing code significantly, attach_pid() is
now fundamentally atomic and can be called during create_process() without
worrying about task-list side-effects. It does not have to re-search the
pidhash to find out about raced PID-adding either, and attach_pid()  
cannot fail due to OOM. detach_pid() can do a simple put_task_struct()
instead of the kmem_cache_free().

the only minimal downside is the potential pending task structures after
session leaders or group leaders have exited - but the number of orphan
sessions and process groups is usually very low - and even if it's higher,
this can be regarded as a slow execution of the final deallocation of the
session leader, not some additional burden.

[Benchmark results comparing stock kernel against patched kernel come
after the Changelog.]

Changes since -D4:

 - remove the tty_io.c changes.

 - remove stale change from include/asm-i386/types.h

 - add list_for_each_noprefetch() to list.h, for all those list users who 
   know that in the majority of cases the list is going to be short.

 - reorganize struct pid: remove the type field, add the task pointer.

 - reorganize pid_link: remove the 'nr' field, add the struct pid field.

 - thread-exit optimization: remove the tsk->real_timer only if it was 
   pending.

 - thread-create optimization: reuse the task_cache if existent. Use 
   atomic ops to manipulate it, so that thread-create can access it from
   outside of the tasklist lock.

 - remove the pid_cache SLAB cache.

 - simplify and speed up attach_pid(): no need to drop the lock, look up
   the pid only once, and no failure path.

 - speed up detach_pid(): put_task_struct() instead of kmem_cache_free().

 - merge pidhash_init() into pid_init(). Call it from main.c only.

 - allocate not only the PID 0 bitmap slot, but also hash it into all
   three hashes to make sure no-one tries to reuse it.

 - cleanups, code simplifications.

Performance:

single-thread create+exit latency (full latency, including the
release_task() overhead) on a UP 525 MHz PIII box:

	stock BK-curr
        -------------
        -> 3864 cycles [7.025455 usecs]

	BK-curr + generic-pidhash-J2
        ----------------------------
        -> 3430 cycles [6.236364 usecs]

ie. the patched kernel is 12% faster than the stock kernel. Most of the
improvement is due to the fork.c and exit.c optimizations.

Note that this is the best-possible benchmark scenario for the old PID
allocator: the new PID allocator's best-case latency is burdened by the
fact that it avoids worst-case latencies, and its average latency is
independent of layout of the PID space. Yesterday's test that proved the
old allocator's high sensitivity to the PID allocation rate and patterns
is still valid.

i've compiled, booted & tested the patch on UP and SMP x86 as well.

	Ingo

--- linux/include/linux/pid.h.orig	Thu Sep 19 21:30:31 2002
+++ linux/include/linux/pid.h	Thu Sep 19 21:30:31 2002
@@ -0,0 +1,63 @@
+#ifndef _LINUX_PID_H
+#define _LINUX_PID_H
+
+enum pid_type
+{
+	PIDTYPE_PID,
+	PIDTYPE_PGID,
+	PIDTYPE_SID,
+	PIDTYPE_MAX
+};
+
+struct pid
+{
+	int nr;
+	atomic_t count;
+	struct task_struct *task;
+	struct list_head task_list;
+	struct list_head hash_chain;
+};
+
+struct pid_link
+{
+	struct list_head pid_chain;
+	struct pid *pidptr;
+	struct pid pid;
+};
+
+#define pid_task(elem, type) \
+	list_entry(elem, struct task_struct, pids[type].pid_chain)
+
+/*
+ * attach_pid() must be called with the tasklist_lock write-held.
+ *
+ * It might unlock the tasklist_lock for allocation, so this
+ * function must be called after installing all other links of
+ * a new task.
+ */
+extern int FASTCALL(attach_pid(struct task_struct *, enum pid_type, int));
+
+/*
+ * detach_pid() must be called with the tasklist_lock write-held.
+ */
+extern void FASTCALL(detach_pid(struct task_struct *task, enum pid_type));
+
+/*
+ * look up a PID in the hash table. Must be called with the tasklist_lock
+ * held.
+ */
+extern struct pid *FASTCALL(find_pid(enum pid_type, int));
+
+extern int alloc_pidmap(void);
+extern void FASTCALL(free_pidmap(int));
+
+#define for_each_task_pid(who, type, task, elem, pid)		\
+	if ((pid = find_pid(type, who)))			\
+	        for (elem = pid->task_list.next,			\
+			prefetch(elem->next),				\
+			task = pid_task(elem, type);			\
+			elem != &pid->task_list;			\
+			elem = elem->next, prefetch(elem->next), 	\
+			task = pid_task(elem, type))
+
+#endif /* _LINUX_PID_H */
--- linux/include/linux/sched.h.orig	Thu Sep 19 21:20:51 2002
+++ linux/include/linux/sched.h	Thu Sep 19 21:30:31 2002
@@ -28,6 +28,7 @@
 #include <linux/fs_struct.h>
 #include <linux/compiler.h>
 #include <linux/completion.h>
+#include <linux/pid.h>
 
 struct exec_domain;
 
@@ -266,6 +267,8 @@
 	atomic_inc(&__user->__count);			\
 	__user; })
 
+extern struct user_struct *find_user(uid_t);
+
 extern struct user_struct root_user;
 #define INIT_USER (&root_user)
 
@@ -326,9 +329,8 @@
 	struct task_struct *group_leader;
 	struct list_head thread_group;
 
-	/* PID hash table linkage. */
-	struct task_struct *pidhash_next;
-	struct task_struct **pidhash_pprev;
+	/* PID/PID hash table linkage. */
+	struct pid_link pids[PIDTYPE_MAX];
 
 	wait_queue_head_t wait_chldexit;	/* for wait4() */
 	struct completion *vfork_done;		/* for vfork() */
@@ -474,38 +476,7 @@
 
 extern struct   mm_struct init_mm;
 
-/* PID hashing. (shouldnt this be dynamic?) */
-#define PIDHASH_SZ 8192
-extern struct task_struct *pidhash[PIDHASH_SZ];
-
-#define pid_hashfn(x)	((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))
-
-static inline void hash_pid(struct task_struct *p)
-{
-	struct task_struct **htable = &pidhash[pid_hashfn(p->pid)];
-
-	if((p->pidhash_next = *htable) != NULL)
-		(*htable)->pidhash_pprev = &p->pidhash_next;
-	*htable = p;
-	p->pidhash_pprev = htable;
-}
-
-static inline void unhash_pid(struct task_struct *p)
-{
-	if(p->pidhash_next)
-		p->pidhash_next->pidhash_pprev = p->pidhash_pprev;
-	*p->pidhash_pprev = p->pidhash_next;
-}
-
-static inline struct task_struct *find_task_by_pid(int pid)
-{
-	struct task_struct *p, **htable = &pidhash[pid_hashfn(pid)];
-
-	for(p = *htable; p && p->pid != pid; p = p->pidhash_next)
-		;
-
-	return p;
-}
+extern struct task_struct *find_task_by_pid(int pid);
 
 /* per-UID process charging. */
 extern struct user_struct * alloc_uid(uid_t);
--- linux/include/linux/threads.h.orig	Thu Sep 19 21:20:33 2002
+++ linux/include/linux/threads.h	Thu Sep 19 21:30:31 2002
@@ -17,8 +17,13 @@
 #define MIN_THREADS_LEFT_FOR_ROOT 4
 
 /*
- * This controls the maximum pid allocated to a process
+ * This controls the default maximum pid allocated to a process
  */
-#define DEFAULT_PID_MAX 0x8000
+#define PID_MAX_DEFAULT 0x8000
+
+/*
+ * A maximum of 4 million PIDs should be enough for a while:
+ */
+#define PID_MAX_LIMIT (4*1024*1024)
 
 #endif
--- linux/include/linux/list.h.orig	Thu Sep 19 21:20:33 2002
+++ linux/include/linux/list.h	Thu Sep 19 21:30:31 2002
@@ -195,6 +195,10 @@
 #define list_for_each(pos, head) \
 	for (pos = (head)->next, prefetch(pos->next); pos != (head); \
         	pos = pos->next, prefetch(pos->next))
+
+#define list_for_each_noprefetch(pos, head) \
+	for (pos = (head)->next; pos != (head); pos = pos->next)
+
 /**
  * list_for_each_prev	-	iterate over a list backwards
  * @pos:	the &struct list_head to use as a loop counter.
--- linux/fs/fcntl.c.orig	Thu Sep 19 21:20:41 2002
+++ linux/fs/fcntl.c	Thu Sep 19 21:30:31 2002
@@ -480,7 +480,9 @@
 
 void send_sigio(struct fown_struct *fown, int fd, int band)
 {
-	struct task_struct * p;
+	struct task_struct *p;
+	struct list_head *l;
+	struct pid *pidptr;
 	int pid;
 	
 	read_lock(&fown->lock);
@@ -493,14 +495,8 @@
 		send_sigio_to_task(p, fown, fd, band);
 		goto out_unlock_task;
 	}
-	for_each_process(p) {
-		int match = p->pid;
-		if (pid < 0)
-			match = -p->pgrp;
-		if (pid != match)
-			continue;
-		send_sigio_to_task(p, fown, fd, band);
-	}
+	for_each_task_pid(-pid, PIDTYPE_PGID, p, l, pidptr)
+		send_sigio_to_task(p, fown,fd,band);
 out_unlock_task:
 	read_unlock(&tasklist_lock);
 out_unlock_fown:
--- linux/fs/exec.c.orig	Thu Sep 19 21:20:47 2002
+++ linux/fs/exec.c	Thu Sep 19 21:30:31 2002
@@ -609,8 +609,6 @@
 
 		ptrace_unlink(leader);
 		ptrace_unlink(current);
-		unhash_pid(current);
-		unhash_pid(leader);
 		remove_parent(current);
 		remove_parent(leader);
 		/*
@@ -631,8 +629,6 @@
 			current->ptrace = ptrace;
 			__ptrace_link(current, parent);
 		}
-		hash_pid(current);
-		hash_pid(leader);
 		
 		list_add_tail(&current->tasks, &init_task.tasks);
 		state = leader->state;
--- linux/kernel/Makefile.orig	Thu Sep 19 21:30:18 2002
+++ linux/kernel/Makefile	Thu Sep 19 21:30:31 2002
@@ -8,7 +8,7 @@
 obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
 	    module.o exit.o itimer.o time.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o timer.o user.o \
-	    signal.o sys.o kmod.o context.o futex.o platform.o
+	    signal.o sys.o kmod.o context.o futex.o platform.o pid.o
 
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
 obj-$(CONFIG_SMP) += cpu.o
--- linux/kernel/exit.c.orig	Thu Sep 19 21:30:18 2002
+++ linux/kernel/exit.c	Thu Sep 19 21:30:31 2002
@@ -33,7 +33,12 @@
 {
 	struct dentry *proc_dentry;
 	nr_threads--;
-	unhash_pid(p);
+	detach_pid(p, PIDTYPE_PID);
+	if (thread_group_leader(p)) {
+		detach_pid(p, PIDTYPE_PGID);
+		detach_pid(p, PIDTYPE_SID);
+	}
+
 	REMOVE_LINKS(p);
 	p->pid = 0;
 	proc_dentry = p->proc_dentry;
@@ -109,22 +114,18 @@
 int session_of_pgrp(int pgrp)
 {
 	struct task_struct *p;
-	int fallback;
+	struct list_head *l;
+	struct pid *pid;
+	int sid = -1;
 
-	fallback = -1;
 	read_lock(&tasklist_lock);
-	for_each_process(p) {
- 		if (p->session <= 0)
- 			continue;
-		if (p->pgrp == pgrp) {
-			fallback = p->session;
+	for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid)
+		if (p->session > 0) {
+			sid = p->session;
 			break;
 		}
-		if (p->pid == pgrp)
-			fallback = p->session;
-	}
 	read_unlock(&tasklist_lock);
-	return fallback;
+	return sid;
 }
 
 /*
@@ -135,21 +136,25 @@
  *
  * "I ask you, have you ever known what it is to be an orphan?"
  */
-static int __will_become_orphaned_pgrp(int pgrp, struct task_struct * ignored_task)
+static int __will_become_orphaned_pgrp(int pgrp, task_t *ignored_task)
 {
 	struct task_struct *p;
-
-	for_each_process(p) {
-		if ((p == ignored_task) || (p->pgrp != pgrp) ||
-		    (p->state == TASK_ZOMBIE) ||
-		    (p->real_parent->pid == 1))
+	struct list_head *l;
+	struct pid *pid;
+	int ret = 1;
+
+	for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) {
+		if (p == ignored_task
+				|| p->state == TASK_ZOMBIE 
+				|| p->real_parent->pid == 1)
 			continue;
-		if ((p->real_parent->pgrp != pgrp) &&
-		    (p->real_parent->session == p->session)) {
- 			return 0;
+		if (p->real_parent->pgrp != pgrp
+			    && p->real_parent->session == p->session) {
+			ret = 0;
+			break;
 		}
 	}
-	return 1;	/* (sighing) "Often!" */
+	return ret;	/* (sighing) "Often!" */
 }
 
 static int will_become_orphaned_pgrp(int pgrp, struct task_struct * ignored_task)
@@ -171,11 +176,11 @@
 static inline int __has_stopped_jobs(int pgrp)
 {
 	int retval = 0;
-	struct task_struct * p;
+	struct task_struct *p;
+	struct list_head *l;
+	struct pid *pid;
 
-	for_each_process(p) {
-		if (p->pgrp != pgrp)
-			continue;
+	for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) {
 		if (p->state != TASK_STOPPED)
 			continue;
 		retval = 1;
@@ -605,7 +610,8 @@
 	if (tsk->pid == 1)
 		panic("Attempted to kill init!");
 	tsk->flags |= PF_EXITING;
-	del_timer_sync(&tsk->real_timer);
+	if (timer_pending(&tsk->real_timer))
+		del_timer_sync(&tsk->real_timer);
 
 	if (unlikely(preempt_count()))
 		printk(KERN_INFO "note: %s[%d] exited with preempt_count %d\n",
--- linux/kernel/user.c.orig	Thu Sep 19 21:20:08 2002
+++ linux/kernel/user.c	Thu Sep 19 21:30:31 2002
@@ -64,6 +64,11 @@
 	return NULL;
 }
 
+struct user_struct *find_user(uid_t uid)
+{
+	return uid_hash_find(uid, uidhashentry(uid));
+}
+
 void free_uid(struct user_struct *up)
 {
 	if (up && atomic_dec_and_lock(&up->__count, &uidhash_lock)) {
--- linux/kernel/pid.c.orig	Thu Sep 19 21:30:31 2002
+++ linux/kernel/pid.c	Thu Sep 19 21:30:31 2002
@@ -0,0 +1,219 @@
+/*
+ * Generic pidhash and scalable, time-bounded PID allocator
+ *
+ * (C) 2002 William Irwin, IBM
+ * (C) 2002 Ingo Molnar, Red Hat
+ *
+ * pid-structures are backing objects for tasks sharing a given ID to chain
+ * against. There is very little to them aside from hashing them and
+ * parking tasks using given ID's on a list.
+ *
+ * The hash is always changed with the tasklist_lock write-acquired,
+ * and the hash is only accessed with the tasklist_lock at least
+ * read-acquired, so there's no additional SMP locking needed here.
+ *
+ * We have a list of bitmap pages, which bitmaps represent the PID space.
+ * Allocating and freeing PIDs is completely lockless. The worst-case
+ * allocation scenario when all but one out of 1 million PIDs possible are
+ * allocated already: the scanning of 32 list entries and at most PAGE_SIZE
+ * bytes. The typical fastpath is a single successful setbit. Freeing is O(1).
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/bootmem.h>
+
+#define PIDHASH_SIZE 4096
+#define pid_hashfn(nr) ((nr >> 8) ^ nr) & (PIDHASH_SIZE - 1)
+static struct list_head pid_hash[PIDTYPE_MAX][PIDHASH_SIZE];
+
+int pid_max = PID_MAX_DEFAULT;
+int last_pid;
+
+#define RESERVED_PIDS		300
+
+#define PIDMAP_ENTRIES		(PID_MAX_LIMIT/PAGE_SIZE/8)
+#define BITS_PER_PAGE		(PAGE_SIZE*8)
+#define BITS_PER_PAGE_MASK	(BITS_PER_PAGE-1)
+
+/*
+ * PID-map pages start out as NULL, they get allocated upon
+ * first use and are never deallocated. This way a low pid_max
+ * value does not cause lots of bitmaps to be allocated, but
+ * the scheme scales to up to 4 million PIDs, runtime.
+ */
+typedef struct pidmap {
+	atomic_t nr_free;
+	void *page;
+} pidmap_t;
+
+static pidmap_t pidmap_array[PIDMAP_ENTRIES] =
+	 { [ 0 ... PIDMAP_ENTRIES-1 ] = { ATOMIC_INIT(BITS_PER_PAGE), NULL } };
+
+static pidmap_t *map_limit = pidmap_array + PIDMAP_ENTRIES;
+
+inline void free_pidmap(int pid)
+{
+	pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
+	int offset = pid & BITS_PER_PAGE_MASK;
+
+	clear_bit(offset, map->page);
+	atomic_inc(&map->nr_free);
+}
+
+/*
+ * Here we search for the next map that has free bits left.
+ * Normally the next map has free PIDs.
+ */
+static inline pidmap_t *next_free_map(pidmap_t *map, int *max_steps)
+{
+	while (--*max_steps) {
+		if (++map == map_limit)
+			map = pidmap_array;
+		if (unlikely(!map->page)) {
+			unsigned long page = get_zeroed_page(GFP_KERNEL);
+			/*
+			 * Free the page if someone raced with us
+			 * installing it:
+			 */
+			if (cmpxchg(&map->page, NULL, page))
+				free_page(page);
+			if (!map->page)
+				break;
+		}
+		if (atomic_read(&map->nr_free))
+			return map;
+	}
+	return NULL;
+}
+
+int alloc_pidmap(void)
+{
+	int pid, offset, max_steps = PIDMAP_ENTRIES + 1;
+	pidmap_t *map;
+
+	pid = last_pid + 1;
+	if (pid >= pid_max)
+		pid = RESERVED_PIDS;
+
+	offset = pid & BITS_PER_PAGE_MASK;
+	map = pidmap_array + pid / BITS_PER_PAGE;
+
+	if (likely(map->page && !test_and_set_bit(offset, map->page))) {
+		/*
+		 * There is a small window for last_pid updates to race,
+		 * but in that case the next allocation will go into the
+		 * slowpath and that fixes things up.
+		 */
+return_pid:
+		atomic_dec(&map->nr_free);
+		last_pid = pid;
+		return pid;
+	}
+	
+	if (!offset || !atomic_read(&map->nr_free)) {
+next_map:
+		map = next_free_map(map, &max_steps);
+		if (!map)
+			goto failure;
+		offset = 0;
+	}
+	/*
+	 * Find the next zero bit:
+	 */
+scan_more:
+	offset = find_next_zero_bit(map->page, BITS_PER_PAGE, offset);
+	if (offset == BITS_PER_PAGE)
+		goto next_map;
+	if (test_and_set_bit(offset, map->page))
+		goto scan_more;
+
+	/* we got the PID: */
+	pid = (map - pidmap_array) * BITS_PER_PAGE + offset;
+	goto return_pid;
+
+failure:
+	return -1;
+}
+
+inline struct pid *find_pid(enum pid_type type, int nr)
+{
+	struct list_head *elem, *bucket = &pid_hash[type][pid_hashfn(nr)];
+	struct pid *pid;
+
+	list_for_each_noprefetch(elem, bucket) {
+		pid = list_entry(elem, struct pid, hash_chain);
+		if (pid->nr == nr)
+			return pid;
+	}
+	return NULL;
+}
+
+int attach_pid(task_t *task, enum pid_type type, int nr)
+{
+	struct pid *pid = find_pid(type, nr);
+
+	if (pid)
+		atomic_inc(&pid->count);
+	else {
+		pid = &task->pids[type].pid;
+		pid->nr = nr;
+		atomic_set(&pid->count, 1);
+		INIT_LIST_HEAD(&pid->task_list);
+		pid->task = current;
+		get_task_struct(current);
+		list_add(&pid->hash_chain, &pid_hash[type][pid_hashfn(nr)]);
+	}
+	list_add(&task->pids[type].pid_chain, &pid->task_list);
+	task->pids[type].pidptr = pid;
+
+	return 0;
+}
+
+void detach_pid(task_t *task, enum pid_type type)
+{
+	struct pid_link *link = task->pids + type;
+	struct pid *pid = link->pidptr;
+	int nr;
+
+	list_del(&link->pid_chain);
+	if (!atomic_dec_and_test(&pid->count))
+		return;
+
+	nr = pid->nr;
+	list_del(&pid->hash_chain);
+	put_task_struct(pid->task);
+
+	for (type = 0; type < PIDTYPE_MAX; ++type)
+		if (find_pid(type, nr))
+			return;
+	free_pidmap(nr);
+}
+
+extern task_t *find_task_by_pid(int nr)
+{
+	struct pid *pid = find_pid(PIDTYPE_PID, nr);
+
+	if (!pid)
+		return NULL;
+	return pid_task(pid->task_list.next, PIDTYPE_PID);
+}
+
+void __init pidhash_init(void)
+{
+	int i, j;
+
+	/*
+	 * Allocate PID 0, and hash it via all PID types:
+	 */
+	pidmap_array->page = (void *)get_zeroed_page(GFP_KERNEL);
+	set_bit(0, pidmap_array->page);
+	atomic_dec(&pidmap_array->nr_free);
+
+	for (i = 0; i < PIDTYPE_MAX; i++) {
+		for (j = 0; j < PIDHASH_SIZE; j++)
+			INIT_LIST_HEAD(&pid_hash[i][j]);
+		attach_pid(current, i, 0);
+	}
+}
--- linux/kernel/signal.c.orig	Thu Sep 19 21:20:41 2002
+++ linux/kernel/signal.c	Thu Sep 19 21:30:31 2002
@@ -943,18 +943,18 @@
 
 int __kill_pg_info(int sig, struct siginfo *info, pid_t pgrp)
 {
-	int retval = -EINVAL;
-	if (pgrp > 0) {
-		struct task_struct *p;
-
-		retval = -ESRCH;
-		for_each_process(p) {
-			if (p->pgrp == pgrp) {
-				int err = send_sig_info(sig, info, p);
-				if (retval)
-					retval = err;
-			}
-		}
+	struct task_struct *p;
+	struct list_head *l;
+	struct pid *pid;
+	int err, retval = -ESRCH;
+
+	if (pgrp <= 0)
+		return -EINVAL;
+
+	for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) {
+		err = send_sig_info(sig, info, p);
+		if (retval)
+			retval = err;
 	}
 	return retval;
 }
@@ -977,28 +977,33 @@
  * the connection is lost.
  */
 
+
 int
-kill_sl_info(int sig, struct siginfo *info, pid_t sess)
+kill_sl_info(int sig, struct siginfo *info, pid_t sid)
 {
-	int retval = -EINVAL;
-	if (sess > 0) {
-		struct task_struct *p;
-
-		retval = -ESRCH;
-		read_lock(&tasklist_lock);
-		for_each_process(p) {
-			if (p->leader && p->session == sess) {
-				int err = send_sig_info(sig, info, p);
-				if (retval)
-					retval = err;
-			}
-		}
-		read_unlock(&tasklist_lock);
+	int err, retval = -EINVAL;
+	struct pid *pid;
+	struct list_head *l;
+	struct task_struct *p;
+
+	if (sid <= 0)
+		goto out;
+
+	retval = -ESRCH;
+	read_lock(&tasklist_lock);
+	for_each_task_pid(sid, PIDTYPE_SID, p, l, pid) {
+		if (!p->leader)
+			continue;
+		err = send_sig_info(sig, info, p);
+		if (retval)
+			retval = err;
 	}
+	read_unlock(&tasklist_lock);
+out:
 	return retval;
 }
 
-inline int
+int
 kill_proc_info(int sig, struct siginfo *info, pid_t pid)
 {
 	int error;
--- linux/kernel/sys.c.orig	Thu Sep 19 21:20:41 2002
+++ linux/kernel/sys.c	Thu Sep 19 21:30:31 2002
@@ -203,35 +203,34 @@
 cond_syscall(sys_quotactl)
 cond_syscall(sys_acct)
 
-static int proc_sel(struct task_struct *p, int which, int who)
+static int set_one_prio(struct task_struct *p, int niceval, int error)
 {
-	if(p->pid)
-	{
-		switch (which) {
-			case PRIO_PROCESS:
-				if (!who && p == current)
-					return 1;
-				return(p->pid == who);
-			case PRIO_PGRP:
-				if (!who)
-					who = current->pgrp;
-				return(p->pgrp == who);
-			case PRIO_USER:
-				if (!who)
-					who = current->uid;
-				return(p->uid == who);
-		}
+	if (p->uid != current->euid &&
+		p->uid != current->uid && !capable(CAP_SYS_NICE)) {
+		error = -EPERM;
+		goto out;
 	}
-	return 0;
+
+	if (error == -ESRCH)
+		error = 0;
+	if (niceval < task_nice(p) && !capable(CAP_SYS_NICE))
+		error = -EACCES;
+	else
+		set_user_nice(p, niceval);
+out:
+	return error;
 }
 
 asmlinkage long sys_setpriority(int which, int who, int niceval)
 {
 	struct task_struct *g, *p;
-	int error;
+	struct user_struct *user;
+	struct pid *pid;
+	struct list_head *l;
+	int error = -EINVAL;
 
 	if (which > 2 || which < 0)
-		return -EINVAL;
+		goto out;
 
 	/* normalize: avoid signed division (rounding problems) */
 	error = -ESRCH;
@@ -241,31 +240,38 @@
 		niceval = 19;
 
 	read_lock(&tasklist_lock);
-	do_each_thread(g, p) {
-		int no_nice;
-		if (!proc_sel(p, which, who))
-			continue;
-		if (p->uid != current->euid &&
-			p->uid != current->uid && !capable(CAP_SYS_NICE)) {
-			error = -EPERM;
-			continue;
-		}
-		if (error == -ESRCH)
-			error = 0;
-		if (niceval < task_nice(p) && !capable(CAP_SYS_NICE)) {
-			error = -EACCES;
-			continue;
-		}
-		no_nice = security_ops->task_setnice(p, niceval);
-		if (no_nice) {
-			error = no_nice;
-			continue;
-		}
-		set_user_nice(p, niceval);
-	} while_each_thread(g, p);
-
+	switch (which) {
+		case PRIO_PROCESS:
+			if (!who)
+				who = current->pid;
+			p = find_task_by_pid(who);
+			if (p)
+				error = set_one_prio(p, niceval, error);
+			break;
+		case PRIO_PGRP:
+			if (!who)
+				who = current->pgrp;
+			for_each_task_pid(who, PIDTYPE_PGID, p, l, pid)
+				error = set_one_prio(p, niceval, error);
+			break;
+		case PRIO_USER:
+			if (!who)
+				user = current->user;
+			else
+				user = find_user(who);
+
+			if (!user)
+				goto out_unlock;
+
+			do_each_thread(g, p)
+				if (p->uid == who)
+					error = set_one_prio(p, niceval, error);
+			while_each_thread(g, p);
+			break;
+	}
+out_unlock:
 	read_unlock(&tasklist_lock);
-
+out:
 	return error;
 }
 
@@ -278,20 +284,54 @@
 asmlinkage long sys_getpriority(int which, int who)
 {
 	struct task_struct *g, *p;
-	long retval = -ESRCH;
+	struct list_head *l;
+	struct pid *pid;
+	struct user_struct *user;
+	long niceval, retval = -ESRCH;
 
 	if (which > 2 || which < 0)
 		return -EINVAL;
 
 	read_lock(&tasklist_lock);
-	do_each_thread(g, p) {
-		long niceval;
-		if (!proc_sel(p, which, who))
-			continue;
-		niceval = 20 - task_nice(p);
-		if (niceval > retval)
-			retval = niceval;
-	} while_each_thread(g, p);
+	switch (which) {
+		case PRIO_PROCESS:
+			if (!who)
+				who = current->pid;
+			p = find_task_by_pid(who);
+			if (p) {
+				niceval = 20 - task_nice(p);
+				if (niceval > retval)
+					retval = niceval;
+			}
+			break;
+		case PRIO_PGRP:
+			if (!who)
+				who = current->pgrp;
+			for_each_task_pid(who, PIDTYPE_PGID, p, l, pid) {
+				niceval = 20 - task_nice(p);
+				if (niceval > retval)
+					retval = niceval;
+			}
+			break;
+		case PRIO_USER:
+			if (!who)
+				user = current->user;
+			else
+				user = find_user(who);
+
+			if (!user)
+				goto out_unlock;
+
+			do_each_thread(g, p)
+				if (p->uid == who) {
+					niceval = 20 - task_nice(p);
+					if (niceval > retval)
+						retval = niceval;
+				}
+			while_each_thread(g, p);
+			break;
+	}
+out_unlock:
 	read_unlock(&tasklist_lock);
 
 	return retval;
@@ -849,7 +889,7 @@
 
 asmlinkage long sys_setpgid(pid_t pid, pid_t pgid)
 {
-	struct task_struct * p;
+	struct task_struct *p;
 	int err = -EINVAL;
 
 	if (!pid)
@@ -862,12 +902,15 @@
 	/* From this point forward we keep holding onto the tasklist lock
 	 * so that our parent does not change from under us. -DaveM
 	 */
-	read_lock(&tasklist_lock);
+	write_lock_irq(&tasklist_lock);
 
 	err = -ESRCH;
 	p = find_task_by_pid(pid);
 	if (!p)
 		goto out;
+	err = -EINVAL;
+	if (!thread_group_leader(p))
+		goto out;
 
 	if (p->parent == current || p->real_parent == current) {
 		err = -EPERM;
@@ -882,25 +925,26 @@
 	if (p->leader)
 		goto out;
 	if (pgid != pid) {
-		struct task_struct *g, *tmp;
-		do_each_thread(g, tmp) {
-			if (tmp->pgrp == pgid &&
-			    tmp->session == current->session)
+		struct task_struct *p;
+		struct pid *pid;
+		struct list_head *l;
+
+		for_each_task_pid(pgid, PIDTYPE_PGID, p, l, pid)
+			if (p->session == current->session)
 				goto ok_pgid;
-		} while_each_thread(g, tmp);
 		goto out;
 	}
 
 ok_pgid:
-	err = security_ops->task_setpgid(p, pgid);
-	if (err)
-		goto out;
-
-	p->pgrp = pgid;
+	if (p->pgrp != pgid) {
+		detach_pid(p, PIDTYPE_PGID);
+		p->pgrp = pgid;
+		attach_pid(p, PIDTYPE_PGID, pgid);
+	}
 	err = 0;
 out:
 	/* All paths lead to here, thus we are safe. -DaveM */
-	read_unlock(&tasklist_lock);
+	write_unlock_irq(&tasklist_lock);
 	return err;
 }
 
@@ -956,22 +1000,34 @@
 
 asmlinkage long sys_setsid(void)
 {
-	struct task_struct *g, *p;
+	struct pid *pid;
 	int err = -EPERM;
 
-	read_lock(&tasklist_lock);
-	do_each_thread(g, p)
-		if (p->pgrp == current->pid)
-			goto out;
-	while_each_thread(g, p);
+	if (!thread_group_leader(current))
+		return -EINVAL;
+
+	write_lock_irq(&tasklist_lock);
+
+	pid = find_pid(PIDTYPE_PGID, current->pid);
+	if (pid)
+		goto out;
 
 	current->leader = 1;
-	current->session = current->pgrp = current->pid;
+	if (current->session != current->pid) {
+		detach_pid(current, PIDTYPE_SID);
+		current->session = current->pid;
+		attach_pid(current, PIDTYPE_SID, current->pid);
+	}
+	if (current->pgrp != current->pid) {
+		detach_pid(current, PIDTYPE_PGID);
+		current->pgrp = current->pid;
+		attach_pid(current, PIDTYPE_PGID, current->pid);
+	}
 	current->tty = NULL;
 	current->tty_old_pgrp = 0;
 	err = current->pgrp;
 out:
-	read_unlock(&tasklist_lock);
+	write_unlock_irq(&tasklist_lock);
 	return err;
 }
 
--- linux/kernel/fork.c.orig	Thu Sep 19 21:30:18 2002
+++ linux/kernel/fork.c	Thu Sep 19 21:30:31 2002
@@ -47,17 +47,6 @@
 int max_threads;
 unsigned long total_forks;	/* Handle normal Linux uptimes. */
 
-/*
- * Protects next_safe, last_pid and pid_max:
- */
-spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
-
-static int next_safe = DEFAULT_PID_MAX;
-int pid_max = DEFAULT_PID_MAX;
-int last_pid;
-
-struct task_struct *pidhash[PIDHASH_SZ];
-
 rwlock_t tasklist_lock __cacheline_aligned = RW_LOCK_UNLOCKED;  /* outer */
 
 /*
@@ -75,16 +64,14 @@
 	} else {
 		int cpu = smp_processor_id();
 
-		tsk = task_cache[cpu];
+		tsk = xchg(task_cache + cpu, tsk);
 		if (tsk) {
 			free_thread_info(tsk->thread_info);
 			kmem_cache_free(task_struct_cachep,tsk);
 		}
-		task_cache[cpu] = current;
 	}
 }
 
-/* Protects next_safe and last_pid. */
 void add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)
 {
 	unsigned long flags;
@@ -140,73 +127,28 @@
 	struct task_struct *tsk;
 	struct thread_info *ti;
 
-	ti = alloc_thread_info();
-	if (!ti)
-		return NULL;
-
-	tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
+	tsk = xchg(task_cache + smp_processor_id(), NULL);
 	if (!tsk) {
-		free_thread_info(ti);
-		return NULL;
-	}
+		ti = alloc_thread_info();
+		if (!ti)
+			return NULL;
+
+		tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
+		if (!tsk) {
+			free_thread_info(ti);
+			return NULL;
+		}
+	} else
+		ti = tsk->thread_info;
 
 	*ti = *orig->thread_info;
 	*tsk = *orig;
 	tsk->thread_info = ti;
 	ti->task = tsk;
 	atomic_set(&tsk->usage,1);
-
 	return tsk;
 }
 
-static int get_pid(unsigned long flags)
-{
-	struct task_struct *g, *p;
-	int pid;
-
-	if (flags & CLONE_IDLETASK)
-		return 0;
-
-	spin_lock(&lastpid_lock);
-	if (++last_pid > pid_max) {
-		last_pid = 300;		/* Skip daemons etc. */
-		goto inside;
-	}
-
-	if (last_pid >= next_safe) {
-inside:
-		if (nr_threads > pid_max >> 4)
-			pid_max <<= 1;
-		next_safe = pid_max;
-		read_lock(&tasklist_lock);
-	repeat:
-		do_each_thread(g, p) {
-			if (p->pid == last_pid	||
-			   p->pgrp == last_pid	||
-			   p->session == last_pid) {
-				if (++last_pid >= next_safe) {
-					if (last_pid >= pid_max)
-						last_pid = 300;
-					next_safe = pid_max;
-				}
-				goto repeat;
-			}
-			if (p->pid > last_pid && next_safe > p->pid)
-				next_safe = p->pid;
-			if (p->pgrp > last_pid && next_safe > p->pgrp)
-				next_safe = p->pgrp;
-			if (p->session > last_pid && next_safe > p->session)
-				next_safe = p->session;
-		} while_each_thread(g, p);
-
-		read_unlock(&tasklist_lock);
-	}
-	pid = last_pid;
-	spin_unlock(&lastpid_lock);
-
-	return pid;
-}
-
 static inline int dup_mmap(struct mm_struct * mm)
 {
 	struct vm_area_struct * mpnt, *tmp, **pprev;
@@ -726,7 +668,13 @@
 	p->state = TASK_UNINTERRUPTIBLE;
 
 	copy_flags(clone_flags, p);
-	p->pid = get_pid(clone_flags);
+	if (clone_flags & CLONE_IDLETASK)
+		p->pid = 0;
+	else {
+		p->pid = alloc_pidmap();
+		if (p->pid == -1)
+			goto bad_fork_cleanup;
+	}
 	p->proc_dentry = NULL;
 
 	INIT_LIST_HEAD(&p->run_list);
@@ -889,7 +837,13 @@
 	SET_LINKS(p);
 	if (p->ptrace & PT_PTRACED)
 		__ptrace_link(p, current->parent);
-	hash_pid(p);
+
+	attach_pid(p, PIDTYPE_PID, p->pid);
+	if (thread_group_leader(p)) {
+		attach_pid(p, PIDTYPE_PGID, p->pgrp);
+		attach_pid(p, PIDTYPE_SID, p->session);
+	}
+
 	nr_threads++;
 	write_unlock_irq(&tasklist_lock);
 	retval = 0;
@@ -914,6 +868,8 @@
 bad_fork_cleanup_security:
 	security_ops->task_free_security(p);
 bad_fork_cleanup:
+	if (p->pid > 0)
+		free_pidmap(p->pid);
 	put_exec_domain(p->thread_info->exec_domain);
 	if (p->binfmt && p->binfmt->module)
 		__MOD_DEC_USE_COUNT(p->binfmt->module);
--- linux/kernel/ksyms.c.orig	Thu Sep 19 21:20:51 2002
+++ linux/kernel/ksyms.c	Thu Sep 19 21:30:31 2002
@@ -602,7 +602,6 @@
 EXPORT_SYMBOL(init_thread_union);
 
 EXPORT_SYMBOL(tasklist_lock);
-EXPORT_SYMBOL(pidhash);
 #if defined(CONFIG_SMP) && defined(__GENERIC_PER_CPU)
 EXPORT_SYMBOL(__per_cpu_offset);
 #endif
--- linux/init/main.c.orig	Thu Sep 19 21:20:41 2002
+++ linux/init/main.c	Thu Sep 19 21:30:31 2002
@@ -66,6 +66,7 @@
 extern void sysctl_init(void);
 extern void signals_init(void);
 extern void buffer_init(void);
+extern void pidhash_init(void);
 extern void pte_chain_init(void);
 extern void radix_tree_init(void);
 extern void free_initmem(void);
@@ -432,6 +433,7 @@
 #endif
 	mem_init();
 	kmem_cache_sizes_init();
+	pidhash_init();
 	pgtable_cache_init();
 	pte_chain_init();
 	fork_init(num_physpages);


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-J2, BK-curr
  2002-09-19 19:38               ` [patch] generic-pidhash-2.5.36-J2, BK-curr Ingo Molnar
@ 2002-09-19 20:21                 ` Christoph Hellwig
  2002-09-19 23:32                   ` Dave Jones
  2002-09-19 21:31                 ` Linus Torvalds
  1 sibling, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2002-09-19 20:21 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, William Lee Irwin III, linux-kernel

On Thu, Sep 19, 2002 at 09:38:03PM +0200, Ingo Molnar wrote:
>  - add list_for_each_noprefetch() to list.h, for all those list users who 
>    know that in the majority of cases the list is going to be short.

That name is really ugly, as the lack ofthe prefetch is an implementation
detail.  What about list_for_each_short or __list_for_each?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-19 19:10               ` Kai Henningsen
@ 2002-09-19 20:32                 ` Linus Torvalds
  2002-09-19 22:29                   ` Miquel van Smoorenburg
  0 siblings, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2002-09-19 20:32 UTC (permalink / raw)
  To: Kai Henningsen; +Cc: linux-kernel


On 19 Sep 2002, Kai Henningsen wrote:
> 
> On the contrary: it says that this can never happen - the new session has  
> no controlling terminal, and can't get the old one unless the old session  
> loses it first.

Hmm.. I read it as "the tty stays with the stale group", which is
problematic. But if all the places that set a new controlling terminal
check that it's not already used by some other non-session then I guess 
we're still ok..

		Linus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-J2, BK-curr
  2002-09-19 19:38               ` [patch] generic-pidhash-2.5.36-J2, BK-curr Ingo Molnar
  2002-09-19 20:21                 ` Christoph Hellwig
@ 2002-09-19 21:31                 ` Linus Torvalds
  2002-09-19 21:39                   ` Ingo Molnar
  1 sibling, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2002-09-19 21:31 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: William Lee Irwin III, linux-kernel


Ok, applied.

I'm also applying the session handling changes to tty_io.c as a separate
changeset, since the resulting code is certainly cleaner and reading
peoples areguments and looking at the code have made me think it _is_
correct after all.

And as a separate changeset it will be easier to test and perhaps revert 
on demand.

		Linus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-J2, BK-curr
  2002-09-19 21:31                 ` Linus Torvalds
@ 2002-09-19 21:39                   ` Ingo Molnar
  0 siblings, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2002-09-19 21:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: William Lee Irwin III, linux-kernel


On Thu, 19 Sep 2002, Linus Torvalds wrote:

> I'm also applying the session handling changes to tty_io.c as a separate
> changeset, since the resulting code is certainly cleaner and reading
> peoples areguments and looking at the code have made me think it _is_
> correct after all.

yes, i too think it's correct. In fact this is partly proven by one piece
of code in the old tty_io.c file:

void disassociate_ctty(int on_exit)
[...]
        read_lock(&tasklist_lock);
        for_each_process(p)
                if (p->session == current->session)
                        p->tty = NULL;
        read_unlock(&tasklist_lock);

ie. if the SID-based iteration is incorrect, then the above code is 
incorrect already - but we never saw any problems in this area.

	Ingo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-19 20:32                 ` Linus Torvalds
@ 2002-09-19 22:29                   ` Miquel van Smoorenburg
  0 siblings, 0 replies; 28+ messages in thread
From: Miquel van Smoorenburg @ 2002-09-19 22:29 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.44.0209191324310.1277-100000@home.transmeta.com>,
Linus Torvalds  <torvalds@transmeta.com> wrote:
>
>On 19 Sep 2002, Kai Henningsen wrote:
>> 
>> On the contrary: it says that this can never happen - the new session has  
>> no controlling terminal, and can't get the old one unless the old session  
>> loses it first.
>
>Hmm.. I read it as "the tty stays with the stale group", which is
>problematic. But if all the places that set a new controlling terminal
>check that it's not already used by some other non-session then I guess 
>we're still ok..

No, it said that if a new session is created the tty stays with
the old session group which is by no means stale.

A session leader cannot create a new session, since the session-id
is the pid of the leader. Only children can create a new session,
and at that point the tty is not the controlling tty of the
_newly created_ session. Ofcourse the original session still
exists, and the tty is still the controlling tty of that session.

Only if the new session leader opens a tty that hasn't been
associated with a session yet will it become the controlling
tty of that session.

So to assocciate a tty with another session, you'll have to
disassociate it with the existing session by a) killing all
processes in the session or b) calling TIOCNOTTY. Then another
session can open it and it will become the controlling tty
of the other session (if it didn't have a controlling tty
yet, that is!)

A session can also steal away a controlling tty from another
session by using TIOCSCTTY (if its root) but the tty will
first be disassociated from the old session and only then
will it become the controlling tty of the stealing session.

See drivers/char/tty_io.c::tiocsctty() for how it loops
over all tasks to do this ..

Mike.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-J2, BK-curr
  2002-09-19 20:21                 ` Christoph Hellwig
@ 2002-09-19 23:32                   ` Dave Jones
  2002-09-19 23:46                     ` Ingo Molnar
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Jones @ 2002-09-19 23:32 UTC (permalink / raw)
  To: Christoph Hellwig, Ingo Molnar, Linus Torvalds,
	William Lee Irwin III, linux-kernel

On Thu, Sep 19, 2002 at 09:21:11PM +0100, Christoph Hellwig wrote:

 > >  - add list_for_each_noprefetch() to list.h, for all those list users who 
 > >    know that in the majority of cases the list is going to be short.
 > That name is really ugly, as the lack ofthe prefetch is an implementation
 > detail.  What about list_for_each_short or __list_for_each?

Wasn't this the reason we have list_for_each_safe() ?

        Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-J2, BK-curr
  2002-09-19 23:32                   ` Dave Jones
@ 2002-09-19 23:46                     ` Ingo Molnar
  0 siblings, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2002-09-19 23:46 UTC (permalink / raw)
  To: Dave Jones
  Cc: Christoph Hellwig, Linus Torvalds, William Lee Irwin III, linux-kernel


On Fri, 20 Sep 2002, Dave Jones wrote:

>  > >  - add list_for_each_noprefetch() to list.h, for all those list users who 
>  > >    know that in the majority of cases the list is going to be short.
>  > That name is really ugly, as the lack ofthe prefetch is an implementation
>  > detail.  What about list_for_each_short or __list_for_each?
> 
> Wasn't this the reason we have list_for_each_safe() ?

no, list_for_each_safe() is a loop that is entry-removal safe.

	Ingo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-19  2:54 ` [patch] generic-pidhash-2.5.36-D4, BK-curr Ingo Molnar
  2002-09-19  6:10   ` Linus Torvalds
@ 2002-09-20  8:27   ` Oleg Drokin
  2002-09-20  9:40     ` Ingo Molnar
  1 sibling, 1 reply; 28+ messages in thread
From: Oleg Drokin @ 2002-09-20  8:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, William Lee Irwin III, linux-kernel

Hello!

On Thu, Sep 19, 2002 at 04:54:46AM +0200, Ingo Molnar wrote:

>  - unified kernel/pid.c and kernel/idtag.c into kernel/pid.c.

> +			/*
> +			 * Free the page if someone raced with us
> +			 * installing it:
> +			 */
> +			if (cmpxchg(&map->page, NULL, page))
> +				free_page(page);

Note that this piece breaks compilation for every arch that does not
have cmpxchg implementation.
This is the case with x86 (with CONFIG_X86_CMPXCHG undefined, e.g. i386),
ARM, CRIS, m68k, MIPS, MIPS64, PARISC, s390, SH, sparc32, UML (for x86).

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-20  8:27   ` [patch] generic-pidhash-2.5.36-D4, BK-curr Oleg Drokin
@ 2002-09-20  9:40     ` Ingo Molnar
  2002-09-20 11:43       ` Oleg Drokin
  2002-09-20 17:11       ` Andrew Morton
  0 siblings, 2 replies; 28+ messages in thread
From: Ingo Molnar @ 2002-09-20  9:40 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Linus Torvalds, William Lee Irwin III, linux-kernel


On Fri, 20 Sep 2002, Oleg Drokin wrote:

> > +			if (cmpxchg(&map->page, NULL, page))
> > +				free_page(page);
> 
> Note that this piece breaks compilation for every arch that does not
> have cmpxchg implementation.
> This is the case with x86 (with CONFIG_X86_CMPXCHG undefined, e.g. i386),
> ARM, CRIS, m68k, MIPS, MIPS64, PARISC, s390, SH, sparc32, UML (for x86).

we need a cmpxchg() function in the generic library, using a spinlock.  
Then every architecture can enhance the implementation if it wishes to.

	Ingo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-20  9:40     ` Ingo Molnar
@ 2002-09-20 11:43       ` Oleg Drokin
  2002-09-20 12:15         ` Russell King
  2002-09-20 16:34         ` Ingo Molnar
  2002-09-20 17:11       ` Andrew Morton
  1 sibling, 2 replies; 28+ messages in thread
From: Oleg Drokin @ 2002-09-20 11:43 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, William Lee Irwin III, linux-kernel

Hello!

On Fri, Sep 20, 2002 at 11:40:16AM +0200, Ingo Molnar wrote:
> > > +			if (cmpxchg(&map->page, NULL, page))
> > > +				free_page(page);
> > Note that this piece breaks compilation for every arch that does not
> > have cmpxchg implementation.
> > This is the case with x86 (with CONFIG_X86_CMPXCHG undefined, e.g. i386),
> > ARM, CRIS, m68k, MIPS, MIPS64, PARISC, s390, SH, sparc32, UML (for x86).
> we need a cmpxchg() function in the generic library, using a spinlock.  

But this is not safe for arches that provides SMP but does not provide
cmpxchg in hadware, right?
I mean it is only safe to use such spinlock-based function if
all other places read and write this value via special functions that are
also taking this spinlock.
Do you think we can count on this?

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-20 11:43       ` Oleg Drokin
@ 2002-09-20 12:15         ` Russell King
  2002-09-20 16:34         ` Ingo Molnar
  1 sibling, 0 replies; 28+ messages in thread
From: Russell King @ 2002-09-20 12:15 UTC (permalink / raw)
  To: Oleg Drokin
  Cc: Ingo Molnar, Linus Torvalds, William Lee Irwin III, linux-kernel

On Fri, Sep 20, 2002 at 03:43:47PM +0400, Oleg Drokin wrote:
> On Fri, Sep 20, 2002 at 11:40:16AM +0200, Ingo Molnar wrote:
> > we need a cmpxchg() function in the generic library, using a spinlock.  
> 
> But this is not safe for arches that provides SMP but does not provide
> cmpxchg in hadware, right?
> I mean it is only safe to use such spinlock-based function if
> all other places read and write this value via special functions that are
> also taking this spinlock.

Correct.

> Do you think we can count on this?

I don't think so.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-20 11:43       ` Oleg Drokin
  2002-09-20 12:15         ` Russell King
@ 2002-09-20 16:34         ` Ingo Molnar
  1 sibling, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2002-09-20 16:34 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Linus Torvalds, William Lee Irwin III, linux-kernel


On Fri, 20 Sep 2002, Oleg Drokin wrote:

> I mean it is only safe to use such spinlock-based function if all other
> places read and write this value via special functions that are also
> taking this spinlock.

yes - but this is true in this specific PID allocator case, we only access
it via cmpxchg().

	Ingo



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-20  9:40     ` Ingo Molnar
  2002-09-20 11:43       ` Oleg Drokin
@ 2002-09-20 17:11       ` Andrew Morton
  1 sibling, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2002-09-20 17:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Oleg Drokin, Linus Torvalds, William Lee Irwin III, linux-kernel

Ingo Molnar wrote:
> 
> On Fri, 20 Sep 2002, Oleg Drokin wrote:
> 
> > > +                   if (cmpxchg(&map->page, NULL, page))
> > > +                           free_page(page);
> >
> > Note that this piece breaks compilation for every arch that does not
> > have cmpxchg implementation.
> > This is the case with x86 (with CONFIG_X86_CMPXCHG undefined, e.g. i386),
> > ARM, CRIS, m68k, MIPS, MIPS64, PARISC, s390, SH, sparc32, UML (for x86).
> 
> we need a cmpxchg() function in the generic library, using a spinlock.
> Then every architecture can enhance the implementation if it wishes to.
> 

That would be good, but wouldn't we then need a special per-arch
"cmpxchngable" type, like atomic_t?

Seems that just doing compare-and-exchange on a bare page* might force
some architectures to use a global lock.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE:  [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-20 13:12 Hanumanthu. H
  2002-09-20 13:09 ` Oleg Drokin
@ 2002-09-20 16:47 ` Ingo Molnar
  1 sibling, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2002-09-20 16:47 UTC (permalink / raw)
  To: Hanumanthu. H; +Cc: linux-kernel


On Fri, 20 Sep 2002, Hanumanthu. H wrote:

> First of all, using bitmap is not that good as exposed in this
> mailing list (I think some good guys at IBM already implememted
> bitmap for pids as part of Linux scalability enhancements & [...]

i mentioned that patch in earlier emails, and it is a whole different
thing. The patch built a completely new bitmap *every* time we hit an
already allocated PID, to find the next 'safe range'. The new PID
allocator in 2.5.37 keeps one bitmap and does a single set-bit map
operation in the fastpath and does a fast bitscan forward even in the
worst case.

> remember they discussed pros & cons too) & [...]

(that approach was pretty gross i have to say.)

> [...] atomic operations are not that cheap anyway. [...]

huh?

	Ingo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE:  [patch] generic-pidhash-2.5.36-D4, BK-curr
@ 2002-09-20 13:12 Hanumanthu. H
  2002-09-20 13:09 ` Oleg Drokin
  2002-09-20 16:47 ` Ingo Molnar
  0 siblings, 2 replies; 28+ messages in thread
From: Hanumanthu. H @ 2002-09-20 13:12 UTC (permalink / raw)
  To: linux-kernel



On Fri, Sep 20, 2002 at 11:40:16AM +0200, Ingo Molnar wrote:
> > > +			if (cmpxchg(&map->page, NULL, page))
> > > +				free_page(page);
> > Note that this piece breaks compilation for every arch that does not
> > have cmpxchg implementation.
> > This is the case with x86 (with CONFIG_X86_CMPXCHG undefined, e.g.
i386),
> > ARM, CRIS, m68k, MIPS, MIPS64, PARISC, s390, SH, sparc32, UML (for
x86).
> we need a cmpxchg() function in the generic library, using a spinlock.

>But this is not safe for arches that provides SMP but does not provide
>cmpxchg in hadware, right?
>I mean it is only safe to use such spinlock-based function if
>all other places read and write this value via special functions that are
>also taking this spinlock.
>Do you think we can count on this?

> Bye,
   > Oleg

First of all, using bitmap is not that good as exposed in this
mailing list (I think some good guys at IBM already implememted
bitmap for pids as part of Linux scalability enhancements & I
remember they discussed pros & cons too) & atomic operations
are not that cheap anyway. However it is good to have something
better than the previous one.

~Hanu



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] generic-pidhash-2.5.36-D4, BK-curr
  2002-09-20 13:12 Hanumanthu. H
@ 2002-09-20 13:09 ` Oleg Drokin
  2002-09-20 16:47 ` Ingo Molnar
  1 sibling, 0 replies; 28+ messages in thread
From: Oleg Drokin @ 2002-09-20 13:09 UTC (permalink / raw)
  To: Hanumanthu. H; +Cc: linux-kernel

Hello!

On Fri, Sep 20, 2002 at 06:42:46PM +0530, Hanumanthu. H wrote:

> are not that cheap anyway. However it is good to have something
> better than the previous one.

This is only true if it works for everyone.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE:  [patch] generic-pidhash-2.5.36-D4, BK-curr
@ 2002-09-20 13:03 Hanumanthu. H
  0 siblings, 0 replies; 28+ messages in thread
From: Hanumanthu. H @ 2002-09-20 13:03 UTC (permalink / raw)
  To: linux-kernel; +Cc: jw

Let me first say thanks to jw schultz <jw@pegasys.ws>
who suggested this idea already (as a modification to
one of my earlier proposal). It is good that someone
finally did not wait till someone stands-up asks for it
(well, I did not wait, I was busy in doing my earning
work :-)).



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2002-09-20 17:07 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.44.0209182101150.27697-100000@localhost.localdomain>
2002-09-19  2:54 ` [patch] generic-pidhash-2.5.36-D4, BK-curr Ingo Molnar
2002-09-19  6:10   ` Linus Torvalds
2002-09-19  9:25     ` Ingo Molnar
2002-09-19 10:59       ` William Lee Irwin III
2002-09-19 15:12         ` Linus Torvalds
2002-09-19 15:21           ` Ingo Molnar
2002-09-19 16:35           ` Andries Brouwer
2002-09-19 16:43             ` Linus Torvalds
2002-09-19 18:54               ` Miquel van Smoorenburg
2002-09-19 19:10               ` Kai Henningsen
2002-09-19 20:32                 ` Linus Torvalds
2002-09-19 22:29                   ` Miquel van Smoorenburg
2002-09-19 19:38               ` [patch] generic-pidhash-2.5.36-J2, BK-curr Ingo Molnar
2002-09-19 20:21                 ` Christoph Hellwig
2002-09-19 23:32                   ` Dave Jones
2002-09-19 23:46                     ` Ingo Molnar
2002-09-19 21:31                 ` Linus Torvalds
2002-09-19 21:39                   ` Ingo Molnar
2002-09-20  8:27   ` [patch] generic-pidhash-2.5.36-D4, BK-curr Oleg Drokin
2002-09-20  9:40     ` Ingo Molnar
2002-09-20 11:43       ` Oleg Drokin
2002-09-20 12:15         ` Russell King
2002-09-20 16:34         ` Ingo Molnar
2002-09-20 17:11       ` Andrew Morton
2002-09-20 13:03 Hanumanthu. H
2002-09-20 13:12 Hanumanthu. H
2002-09-20 13:09 ` Oleg Drokin
2002-09-20 16:47 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).