linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Fwd: [Lse-tech] get_pid() performance fix
@ 2002-03-05 16:43 Rajan Ravindran
  2002-03-05 17:57 ` OGAWA Hirofumi
  2002-03-14 23:18 ` [PATCH] " Hubertus Franke
  0 siblings, 2 replies; 28+ messages in thread
From: Rajan Ravindran @ 2002-03-05 16:43 UTC (permalink / raw)
  To: OGAWA Hirofumi; +Cc: linux-kernel, lse-tech



Yes, pid's are guaranteed to be unique.
Here the problem we focused is the time taken in finding the next
available free pid.
I really don't mean by your task->xxx.

-Rajan



Hubertus Franke <frankeh@watson.ibm.com> writes:

> @@ -153,13 +155,18 @@
>                               if(last_pid & 0xffff8000)
>                                     last_pid = 300;
>                               next_safe = PID_MAX;
> +                             goto repeat;
>                         }
> -                       goto repeat;
> +                       if(unlikely(last_pid == beginpid))
> +                             goto nomorepids;
> +                       continue;

It isn't guaranteed that pid is unique.

In the case:
             task->pid = 300, task->xxx = 301
             pid 301 is free

             This get_pid() returns 301.

Regards.
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

_______________________________________________
Lse-tech mailing list
Lse-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lse-tech





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-05 16:43 Fwd: [Lse-tech] get_pid() performance fix Rajan Ravindran
@ 2002-03-05 17:57 ` OGAWA Hirofumi
  2002-03-05 19:53   ` Hubertus Franke
  2002-03-14 23:18 ` [PATCH] " Hubertus Franke
  1 sibling, 1 reply; 28+ messages in thread
From: OGAWA Hirofumi @ 2002-03-05 17:57 UTC (permalink / raw)
  To: Rajan Ravindran; +Cc: linux-kernel, lse-tech

"Rajan Ravindran" <rajancr@us.ibm.com> writes:

> Yes, pid's are guaranteed to be unique.
> Here the problem we focused is the time taken in finding the next
> available free pid.
> I really don't mean by your task->xxx.

I'm confused.

I said:
	task { pid = 300, pgrp = 301, };
	301 is free;

	get_pid() returns 301.

"task 301" can't call setsid(). pid 301 is available?
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-05 17:57 ` OGAWA Hirofumi
@ 2002-03-05 19:53   ` Hubertus Franke
  2002-03-05 20:10     ` OGAWA Hirofumi
  0 siblings, 1 reply; 28+ messages in thread
From: Hubertus Franke @ 2002-03-05 19:53 UTC (permalink / raw)
  To: OGAWA Hirofumi, Rajan Ravindran; +Cc: linux-kernel, lse-tech

On Tuesday 05 March 2002 12:57 pm, OGAWA Hirofumi wrote:
> "Rajan Ravindran" <rajancr@us.ibm.com> writes:
> > Yes, pid's are guaranteed to be unique.
> > Here the problem we focused is the time taken in finding the next
> > available free pid.
> > I really don't mean by your task->xxx.
>
> I'm confused.
>

yes you are .....

> I said:
> 	task { pid = 300, pgrp = 301, };
> 	301 is free;
>
> 	get_pid() returns 301.
>
> "task 301" can't call setsid(). pid 301 is available?

The original code is/was:

                        if(p->pid == last_pid   ||
                           p->pgrp == last_pid  ||
                           p->tgid == last_pid  ||
                           p->session == last_pid) {
                                if(++last_pid >= next_safe) {
                                        if(last_pid & 0xffff8000)
                                                last_pid = 300;
                                        next_safe = PID_MAX;
                                }
                                goto repeat;
                        }

if any process holds the pgrp=301 as in your case, 301 won't be eligible due 
to (p->pgrp == last_pid) check.

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-05 19:53   ` Hubertus Franke
@ 2002-03-05 20:10     ` OGAWA Hirofumi
  2002-03-05 21:59       ` Hubertus Franke
  0 siblings, 1 reply; 28+ messages in thread
From: OGAWA Hirofumi @ 2002-03-05 20:10 UTC (permalink / raw)
  To: frankeh; +Cc: Rajan Ravindran, linux-kernel, lse-tech

Hubertus Franke <frankeh@watson.ibm.com> writes:

> > I said:
> > 	task { pid = 300, pgrp = 301, };
> > 	301 is free;
> >
> > 	get_pid() returns 301.
> >
> > "task 301" can't call setsid(). pid 301 is available?
> 
> The original code is/was:
> 
>                         if(p->pid == last_pid   ||
>                            p->pgrp == last_pid  ||
>                            p->tgid == last_pid  ||
>                            p->session == last_pid) {
>                                 if(++last_pid >= next_safe) {
>                                         if(last_pid & 0xffff8000)
>                                                 last_pid = 300;
>                                         next_safe = PID_MAX;
>                                 }
>                                 goto repeat;
>                         }
> 
> if any process holds the pgrp=301 as in your case, 301 won't be eligible due 
> to (p->pgrp == last_pid) check.

I know.

> @@ -153,13 +155,18 @@
>                               if(last_pid & 0xffff8000)
>                                     last_pid = 300;
>                               next_safe = PID_MAX;
> +                             goto repeat;
>                         }
> -                       goto repeat;
> +                       if(unlikely(last_pid == beginpid))
> +                             goto nomorepids;
> +                       continue;

You changed it. No?
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-05 20:10     ` OGAWA Hirofumi
@ 2002-03-05 21:59       ` Hubertus Franke
  2002-03-05 22:48         ` OGAWA Hirofumi
  0 siblings, 1 reply; 28+ messages in thread
From: Hubertus Franke @ 2002-03-05 21:59 UTC (permalink / raw)
  To: OGAWA Hirofumi; +Cc: Rajan Ravindran, linux-kernel, lse-tech

On Tuesday 05 March 2002 03:10 pm, OGAWA Hirofumi wrote:
> Hubertus Franke <frankeh@watson.ibm.com> writes:
> > > I said:
> > > 	task { pid = 300, pgrp = 301, };
> > > 	301 is free;
> > >
> > > 	get_pid() returns 301.
> > >
> > > "task 301" can't call setsid(). pid 301 is available?
> >
> > The original code is/was:
> >
> >                         if(p->pid == last_pid   ||
> >                            p->pgrp == last_pid  ||
> >                            p->tgid == last_pid  ||
> >                            p->session == last_pid) {
> >                                 if(++last_pid >= next_safe) {
> >                                         if(last_pid & 0xffff8000)
> >                                                 last_pid = 300;
> >                                         next_safe = PID_MAX;
> >                                 }
> >                                 goto repeat;
> >                         }
> >
> > if any process holds the pgrp=301 as in your case, 301 won't be eligible
> > due to (p->pgrp == last_pid) check.
>
> I know.
>
> > @@ -153,13 +155,18 @@
> >                               if(last_pid & 0xffff8000)
> >                                     last_pid = 300;
> >                               next_safe = PID_MAX;
> > +                             goto repeat;
> >                         }
> > -                       goto repeat;
> > +                       if(unlikely(last_pid == beginpid))
> > +                             goto nomorepids;
> > +                       continue;
>
> You changed it. No?

Yes, we changed but only the logic that once a pid is busy we start searching 
for every task again. This is exactly the O(n**2) problem.
Run the program and you'll see.

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-05 21:59       ` Hubertus Franke
@ 2002-03-05 22:48         ` OGAWA Hirofumi
  2002-03-05 23:40           ` Hubertus Franke
  0 siblings, 1 reply; 28+ messages in thread
From: OGAWA Hirofumi @ 2002-03-05 22:48 UTC (permalink / raw)
  To: frankeh; +Cc: Rajan Ravindran, linux-kernel, lse-tech

[-- Attachment #1: Type: text/plain, Size: 941 bytes --]

Hubertus Franke <frankeh@watson.ibm.com> writes:

> > > @@ -153,13 +155,18 @@
> > >                               if(last_pid & 0xffff8000)
> > >                                     last_pid = 300;
> > >                               next_safe = PID_MAX;
> > > +                             goto repeat;
> > >                         }
> > > -                       goto repeat;
> > > +                       if(unlikely(last_pid == beginpid))
> > > +                             goto nomorepids;
> > > +                       continue;
> >
> > You changed it. No?
> 
> Yes, we changed but only the logic that once a pid is busy we start searching 
> for every task again. This is exactly the O(n**2) problem.
> Run the program and you'll see.

Run the attached file.

Result,
	new pid: 301, 300: pid 300, pgrp 301

new pid == task(300)->pgrp. This get_pid() has bug.
I'm missing something?
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: get_pid.c --]
[-- Type: text/x-csrc, Size: 3624 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/types.h>

#define spin_lock(x)
#define spin_unlock(x)
#define read_lock(x)
#define read_unlock(x)
#define unlikely(x)	(x)
#define SET_LINKS(p) do {			\
	(p)->next_task = &init_task;		\
	(p)->prev_task = init_task.prev_task;	\
	init_task.prev_task->next_task = (p);	\
	init_task.prev_task = (p);		\
} while (0)
#define REMOVE_LINKS(p) do {				\
	(p)->next_task->prev_task = (p)->prev_task;	\
	(p)->prev_task->next_task = (p)->next_task;	\
} while (0)
#define CLONE_PID	0x00001000	/* set if pid shared */
#define PID_MAX		0x8000

#define for_each_task(p) \
	for (p = &init_task ; (p = p->next_task) != &init_task ; )

struct task_struct {
	struct task_struct *next_task, *prev_task;
	pid_t pid;
	pid_t pgrp;
	pid_t session;
	pid_t tgid;
	int did_exec;
};

struct task_struct init_task = {
	next_task: 	&init_task,
	prev_task: 	&init_task,
	pid:		0,
	pgrp:		0,
	session:	0,
	tgid:		0,
	did_exec:	0,
};
struct task_struct *current = &init_task;
static pid_t last_pid;

static int get_pid(unsigned long flags)
{
	static int next_safe = PID_MAX;
	struct task_struct *p;
	int pid, beginpid;

	if (flags & CLONE_PID)
		return current->pid;

	spin_lock(&lastpid_lock);
	beginpid = last_pid;
	if((++last_pid) & 0xffff8000) {
		last_pid = 300;		/* Skip daemons etc. */
		goto inside;
	}
	if(last_pid >= next_safe) {
inside:
		next_safe = PID_MAX;
		read_lock(&tasklist_lock);
	repeat:
		for_each_task(p) {
			if(p->pid == last_pid	||
			   p->pgrp == last_pid	||
			   p->tgid == last_pid	||
			   p->session == last_pid) {
				if(++last_pid >= next_safe) {
					if(last_pid & 0xffff8000)
						last_pid = 300;
					next_safe = PID_MAX;
					goto repeat;
				}
				if(unlikely(last_pid == beginpid))
					goto nomorepids;
				continue;
			}
			if(p->pid > last_pid && next_safe > p->pid)
				next_safe = p->pid;
			if(p->pgrp > last_pid && next_safe > p->pgrp)
				next_safe = p->pgrp;
			if(p->tgid > last_pid && next_safe > p->tgid)
				next_safe = p->tgid;
			if(p->session > last_pid && next_safe > p->session)
				next_safe = p->session;
		}
		read_unlock(&tasklist_lock);
	}
	pid = last_pid;
	spin_unlock(&lastpid_lock);

	return pid;

nomorepids:
	read_unlock(&tasklist_lock);
	spin_unlock(&lastpid_lock);
	return 0;
}

static void fatal(char *msg)
{
	fprintf(stderr, "%s: %s", msg, strerror(errno));
	exit(1);
}

static struct task_struct *find_task_by_pid(pid_t pid)
{
	struct task_struct *tsk;

	for_each_task(tsk)
		if (tsk->pid == pid)
			return tsk;
	return NULL;
}

static struct task_struct *add_task(pid_t pid, pid_t pgrp, pid_t session,
				    pid_t tgid)
{
	struct task_struct *tsk;

	tsk = malloc(sizeof(struct task_struct));
	if (tsk == NULL)
		fatal("malloc");
	tsk->pid = pid;
	tsk->pgrp = pgrp;
	tsk->session = session;
	tsk->tgid = tgid;
	SET_LINKS(tsk);

	return tsk;
}

static void del_task(struct task_struct *tsk)
{
	REMOVE_LINKS(tsk);
	free(tsk);
}

int main()
{
	int i;
	pid_t pid, pgrp;
	struct task_struct *tsk;

	for (i = 1; i < PID_MAX; i++) {
		pid = get_pid(0);
		add_task(pid, pid, pid, pid);
	}
	/* exit() */
	tsk = find_task_by_pid(301);
	del_task(tsk);
	/* fork() */
	pgrp = pid = get_pid(0);
	add_task(pid, pid, pid, pid);
	/* exit() */
	tsk = find_task_by_pid(300);
	del_task(tsk);
	/* fork() */
	pid = get_pid(0);
	add_task(pid, pgrp, pid, pid);
	/* exit() */
	tsk = find_task_by_pid(301);
	del_task(tsk);

	/* fork() */
	pid = get_pid(0);
	add_task(pid, pgrp, pid, pid);

	tsk = find_task_by_pid(300);
	printf("new pid: %d, 300: pid %d, pgrp %d\n",
	       pid, tsk->pid, tsk->pgrp);

	return 0;
}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-05 22:48         ` OGAWA Hirofumi
@ 2002-03-05 23:40           ` Hubertus Franke
  0 siblings, 0 replies; 28+ messages in thread
From: Hubertus Franke @ 2002-03-05 23:40 UTC (permalink / raw)
  To: OGAWA Hirofumi; +Cc: Rajan Ravindran, linux-kernel, lse-tech

On Tuesday 05 March 2002 05:48 pm, OGAWA Hirofumi wrote:
> Hubertus Franke <frankeh@watson.ibm.com> writes:
> > > > @@ -153,13 +155,18 @@
> > > >                               if(last_pid & 0xffff8000)
> > > >                                     last_pid = 300;
> > > >                               next_safe = PID_MAX;
> > > > +                             goto repeat;
> > > >                         }
> > > > -                       goto repeat;
> > > > +                       if(unlikely(last_pid == beginpid))
> > > > +                             goto nomorepids;
> > > > +                       continue;
> > >
> > > You changed it. No?
> >
> > Yes, we changed but only the logic that once a pid is busy we start
> > searching for every task again. This is exactly the O(n**2) problem.
> > Run the program and you'll see.
>
> Run the attached file.
>
> Result,
> 	new pid: 301, 300: pid 300, pgrp 301
>
> new pid == task(300)->pgrp. This get_pid() has bug.
> I'm missing something?

Yipp you are right. 
I stay corrected, we are missing something. Will work on it and see
whether we can correct it, either way we should be able to
get ride of the o(n**2) effect.

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] get_pid() performance fix
  2002-03-05 16:43 Fwd: [Lse-tech] get_pid() performance fix Rajan Ravindran
  2002-03-05 17:57 ` OGAWA Hirofumi
@ 2002-03-14 23:18 ` Hubertus Franke
  2002-03-15 14:57   ` OGAWA Hirofumi
  2002-03-15 15:16   ` OGAWA Hirofumi
  1 sibling, 2 replies; 28+ messages in thread
From: Hubertus Franke @ 2002-03-14 23:18 UTC (permalink / raw)
  To: Rajan Ravindran, OGAWA Hirofumi; +Cc: linux-kernel, lse-tech

[-- Attachment #1: Type: text/plain, Size: 18654 bytes --]

I implemented an alternative version of getpid, that for large thread counts
( > 210000), provides "significantly" better performance as shown in attached
performance analysis. This is particulary viable for PID_MAX=32768.

-- Hubertus Franke <frankeh@watson.ibm.com>

---------------------------------------------------------------------------------

Currently the getpid algorithm works as follows:
At any given time an interval of [ last_pid .. next_safe ) is known to hold   
  
unused pids. Initially the interval is set to [0 .. 32767]

Hence to allocate a new pid the following is sufficient:

       if (++last_pid < next_safe) return last_pid;

However, if we move out of the safe interval, the next safe interval needs 
to be established first.
This is currently done by a repetive search 

repeat:
     foralltasks(p) {
        if (p uses lastpid) { last_pid++; goto repeat; }
	/* narrow [ last_pid .. next_safe ) */
	if (p->pids in [ last_pid .. next_safe ) ) next_safe = p->pid
     }

Particulary for large number of tasks, this can lead to frequent exercise of
the repeat resulting in a O(N^2) algorithm. We call this : <algo-0>.

Instead I have provided an alternative mechanism that at the time
of determining the next interval marks a bitmask by walking the tasklist
once [ O(N) ] and then finding the proper bit offsets to mark the next free
interval starting from last_pid. The bitmap requires 4096 bytes.
This is <algo-1>.

An optimization to this to keep the last bitmap instead of clearing it
with every search. Only if we fail to obtain a free range, then we have
to go back and clear the bitmap and redo the search one more time.
This is <algo-2>.

I dragged the various algorithms into a userlevel test program to figure
out where the cut off points are with PID_MAX=32768. In this testprogram
I maintain A tasks, and for 10 rounds (delete D random tasks and 
reallocate D tasks again) resulting in T=10*D total measured allocations.

Si states how many interval searches where needed for algo-i.
Gi states the average overhead per get_pid for algo-i in usecs.

Based on that one should use the current algorithm until ~ 22K tasks and
beyond that use algo-2. Only the last 15 tasks are a bit faster under algo-1.
We can safely ignore that case.

Based on that providing an adaptive implementation seems the right choice.
The patch for 2.5.7-pre1 is attached.

testprogram is attached:  
executed program example: getpid -c 2 -s 10 -e 100 -D 10 -t <0,1> 
	0 is old 1 is new algo 2.

   A     D      T |      S0           G0 |    S1        G1 |    S2        G2 
----------------------------------------------------------------------------
   10    10    80 |       1         0.34 |     1      0.59 |    1       0.81 
   20    10   100 |       1         0.30 |     1      0.49 |    1       0.64 
   30    10   100 |       1         0.29 |     1      0.55 |    1       0.65 
   40    10   100 |       1         0.35 |     1      0.51 |    1       0.65 
   50    10   100 |       1         0.35 |     1      0.54 |    1       0.67 
   60    10   100 |       2         0.38 |    21      1.95 |    2       0.79 
   70    10   100 |       1         0.39 |     1      0.59 |    1       0.76 
   80    10   100 |       1         0.41 |     1      0.62 |    1       0.76 
  100    50   500 |       2         0.22 |    63      1.26 |    2       0.30 
  150    50   500 |       3         0.24 |    12      0.56 |    4       0.36 
  200    50   500 |       3         0.27 |    56      2.26 |    5       0.46 
  250    50   500 |       2         0.26 |   119      5.63 |    6       0.54 
  300    50   500 |       3         0.32 |   148      8.73 |    9       0.76 
  350    50   500 |       5         0.45 |   168     11.51 |    6       0.76 
  400    50   500 |       4         0.44 |    90      7.28 |   10       1.10 
  450    50   500 |       6         0.61 |   143     13.08 |    7       0.97 
  500    50   500 |       6         0.65 |   100     10.47 |    7       1.06 
  550    50   500 |       5         0.63 |    71      8.10 |    9       1.34 
  600    50   500 |       7         0.86 |   115     14.32 |   14       2.04 
  650    50   500 |       8         1.00 |   112     15.08 |   13       2.07 
  700    50   500 |       8         1.06 |   127     18.12 |   10       1.79 
  750    50   500 |       8         1.26 |    62      9.73 |   15       2.73 
  800    50   500 |      11         1.68 |    92     15.14 |   12       2.42 
  850    50   500 |      14         2.03 |    78     13.73 |   13       2.67 
  900    50   500 |      21         3.17 |   102     18.74 |   27       5.18 
 1000  1000  9980 |       1         0.18 |     4      0.19 |    1       0.18 
 2000  1000 10000 |      76         1.22 |  3604     53.03 |  322       4.81 
 3000  1000 10000 |     161         3.84 |  4502    112.24 |  606      15.49 
 4000  1000 10000 |     359        11.17 |  4912    183.37 |  901      33.76 
 5000  1000 10000 |     539        23.33 |  4949    257.35 | 1165      59.91 
 6000  1000 10000 |     724        43.42 |  4918    349.59 | 1498     104.36 
 7000  1000 10000 |    1026        85.38 |  4886    447.58 | 1835     165.08 
 8000  1000 10000 |    1228       126.45 |  4870    565.29 | 2084     234.73 
 9000  1000 10000 |    1516       193.62 |  4826    699.85 | 2489     354.27 
10000  1000 10000 |    1818       289.32 |  4910    843.32 | 2763     472.47 
11000  1000 10000 |    2093       389.33 |  5005   1023.08 | 3095     629.70 
12000  1000 10000 |    2305       506.23 |  5095   1194.71 | 3277     773.06 
13000  1000 10000 |    2680       683.66 |  5289   1424.81 | 3711    1003.67 
14000  1000 10000 |    2959       853.10 |  5358   1602.05 | 3878    1172.70 
15000  1000 10000 |    3167      1037.79 |  5539   1835.64 | 4301    1436.40 
16000  1000 10000 |    3466      1272.80 |  5669   2087.03 | 4485    1635.48 
17000  1000 10000 |    3743      1539.06 |  5932   2338.50 | 4844    1924.27 
18000  1000 10000 |    4069      1869.63 |  6097   2613.60 | 5218    2232.52 
19000  1000 10000 |    4293      2183.98 |  6242   2866.34 | 5501    2519.60 
20000  1000 10000 |    4616      2607.10 |  6508   3175.90 | 5770    2823.98 
21000  1000 10000 |    4974      3119.34 |  6700   3460.95 | 6161    3183.73 
22000  1000 10000 |    5177      3609.28 |  6944   3788.19 | 6389    3492.97 
23000  1000 10000 |    5483      4214.03 |  7183   4136.25 | 6665    3823.38 
24000  1000 10000 |    5838      4971.60 |  7404   4460.62 | 6982    4199.61 
25000  1000 10000 |    6183      5880.92 |  7736   4891.80 | 7209    4546.18 
26000  1000 10000 |    6413      6829.07 |  7890   5210.85 | 7533    4939.12 
27000  1000 10000 |    6748      8132.96 |  8148   5598.19 | 7959    5442.25 
28000  1000 10000 |    7139     10065.52 |  8445   6047.42 | 8140    5767.13 
29000  1000 10000 |    7638     12967.20 |  8736   6475.23 | 8501    6250.86 
30000  1000 10000 |    8178     16991.05 |  8994   6907.40 | 8911    6791.97 
32000    50   500 |     482     26446.69 |   488   7405.63 |  487    7494.39 
32050    50   500 |     488     34769.89 |   488   7463.11 |  486    7541.61 
32100    50   500 |     489     44564.86 |   493   7593.99 |  486    7589.02 
32150    50   500 |     486     58150.58 |   487   7549.96 |  492    7731.18 
32200    50   500 |     490     64875.38 |   495   7721.82 |  497    7854.59 
32250    50   500 |     491     81790.21 |   491   7697.57 |  490    7795.12 
32300    50   500 |     489     88975.62 |   493   7763.04 |  495    7909.77 
32350    50   500 |     489    115797.38 |   492   7782.34 |  495    7967.86 
32400    50   500 |     490    120958.50 |   497   7898.45 |  496    8018.98 
32450    50   500 |     492    147541.84 |   493   7874.27 |  492    7982.34 
32500    50   500 |     493    175498.39 |   495   7940.18 |  495    8060.97 
32550    50   500 |     492    207229.29 |   496   7973.88 |  498    8134.02 
32600    50   500 |     495    267057.05 |   498   8028.86 |  498    8171.97 
32650    50   500 |     492    375722.28 |   500   8088.30 |  498    8213.85 
32700    50   500 |     497    528321.07 |   500   8110.51 |  499    8267.67 
32751     1    10 |      10    259785.80 |    10   7851.50 |   10    8549.30 
32752     1    10 |      10   1121285.60 |    10   7846.30 |   10    8556.10 
32753     1    10 |      10    383729.50 |    10   7848.60 |   10    8562.20 
32754     1    10 |      10   1061467.50 |    10   7849.80 |   10    8564.40 
32755     1    10 |      10    612726.50 |    10   7853.00 |   10    8553.90 
32756     1    10 |      10   1725559.90 |    10   7851.90 |   10    8553.00 
32757     1    10 |      10   1679818.50 |    10   7847.80 |   10    8552.10 
32758     1    10 |      10   2991838.60 |    10   7865.70 |   10    8557.20 
32759     1    10 |      10   883388.90  |    10   7859.40 |   10    8562.00 
32760     1    10 |      10   4830405.90 |    10   7850.50 |   10    9336.60 
32761     1    10 |      10   7105809.20 |    10   7863.90 |   10    9337.20 
32762     1    10 |      10   7919703.40 |    10   7867.10 |   10    9340.70 
32763     1    10 |      10   1537522.50 |    10   7869.40 |   10    9340.70 
32764     1    10 |      10   6173019.20 |    10   7866.60 |   10    9340.00 
32765     1    10 |      10   8104105.00 |    10   7876.20 |   10   10112.80 
32766     1    10 |      10  16145415.40 |    10   7880.80 |   10   10893.50 
32767     1    10 |      10  16135267.10 |    10   7878.60 |   10   11674.40 


Other variants are possible, for instance if 4096 bytes is too much
(hell I don't know how that could be), one can break it up into smaller
search chunks (e.g. 256 bytes).

Another alternative is to allocate the page on the first occasion of
getting into get_pid_my_map....

In the following I give a comparative result between algo-2 and
algo-2 with a max interval size of 256. The times are very comparative.
Also the search count values are identical, but in 2 cases suggesting
that a interval size particular for large thread counts of 256 is certainly 
sufficient, but it brings some small overhead. Question to answer is 
wether setting aside an extra page is such a crime..... 

   A     D      T |    S2        G2 |  S2-256     G2-256
-------------------------------------------------------
   10    10    80 |    1       0.81 |      1       0.84
   20    10   100 |    1       0.64 |      1       0.67
   30    10   100 |    1       0.65 |      1       0.68
   40    10   100 |    1       0.65 |      1       0.69
   50    10   100 |    1       0.67 |      1       0.71
   60    10   100 |    2       0.79 |      2       0.82
   70    10   100 |    1       0.76 |      1       0.76
   80    10   100 |    1       0.76 |      1       0.79
  100    50   500 |    2       0.30 |      2       0.31
  150    50   500 |    4       0.36 |      5       0.39  <=
  200    50   500 |    5       0.46 |      5       0.46
  250    50   500 |    6       0.54 |      6       0.55
  300    50   500 |    9       0.76 |      9       0.76
  350    50   500 |    6       0.76 |      6       0.75
  400    50   500 |   10       1.10 |     10       1.10
  450    50   500 |    7       0.97 |      7       0.97
  500    50   500 |    7       1.06 |      7       1.06
  550    50   500 |    9       1.34 |      9       1.35
  600    50   500 |   14       2.04 |     14       2.06
  650    50   500 |   13       2.07 |     13       2.09
  700    50   500 |   10       1.79 |     10       1.82
  750    50   500 |   15       2.73 |     15       2.69
  800    50   500 |   12       2.42 |     12       2.38
  850    50   500 |   13       2.67 |     13       2.66
  900    50   500 |   27       5.18 |     27       5.25
 1000  1000  9980 |    1       0.18 |      3       0.19 <=
 2000  1000 10000 |  322       4.81 |    322       4.84
 3000  1000 10000 |  606      15.49 |    606      15.55
 4000  1000 10000 |  901      33.76 |    901      34.42
 5000  1000 10000 | 1165      59.91 |   1165      62.35
 6000  1000 10000 | 1498     104.36 |   1498     105.55
 7000  1000 10000 | 1835     165.08 |   1835     174.82
 8000  1000 10000 | 2084     234.73 |   2084     244.18
 9000  1000 10000 | 2489     354.27 |   2489     372.11
10000  1000 10000 | 2763     472.47 |   2763     486.73
11000  1000 10000 | 3095     629.70 |   3095     648.31
12000  1000 10000 | 3277     773.06 |   3277     784.75
13000  1000 10000 | 3711    1003.67 |   3711    1006.94
14000  1000 10000 | 3878    1172.70 |   3878    1168.71
15000  1000 10000 | 4301    1436.40 |   4301    1429.89
16000  1000 10000 | 4485    1635.48 |   4485    1620.90
17000  1000 10000 | 4844    1924.27 |   4844    1904.92
18000  1000 10000 | 5218    2232.52 |   5218    2218.80
19000  1000 10000 | 5501    2519.60 |   5501    2508.83
20000  1000 10000 | 5770    2823.98 |   5770    2895.66
21000  1000 10000 | 6161    3183.73 |   6161    3307.54
22000  1000 10000 | 6389    3492.97 |   6389    3620.53
23000  1000 10000 | 6665    3823.38 |   6665    3995.63
24000  1000 10000 | 6982    4199.61 |   6982    4347.95
25000  1000 10000 | 7209    4546.18 |   7209    4701.95
26000  1000 10000 | 7533    4939.12 |   7533    5088.13
27000  1000 10000 | 7959    5442.25 |   7959    5599.85
28000  1000 10000 | 8140    5767.13 |   8140    5817.86
29000  1000 10000 | 8501    6250.86 |   8501    6250.30
30000  1000 10000 | 8911    6791.97 |   8911    6788.51
32000    50   500 |  487    7494.39 |    487    7493.47
32050    50   500 |  486    7541.61 |    486    7541.05
32100    50   500 |  486    7589.02 |    486    7586.12
32150    50   500 |  492    7731.18 |    492    7728.76
32200    50   500 |  497    7854.59 |    497    7854.94
32250    50   500 |  490    7795.12 |    490    7783.10
32300    50   500 |  495    7909.77 |    495    7902.70
32350    50   500 |  495    7967.86 |    495    7946.20
32400    50   500 |  496    8018.98 |    496    7999.34
32450    50   500 |  492    7982.34 |    492    7962.93
32500    50   500 |  495    8060.97 |    495    8048.18
32550    50   500 |  498    8134.02 |    498    8122.08
32600    50   500 |  498    8171.97 |    498    8169.34
32650    50   500 |  498    8213.85 |    498    8209.95
32700    50   500 |  499    8267.67 |    499    8266.13
32751     1    10 |   10    8549.30 |     10    8629.00
32752     1    10 |   10    8556.10 |     10    8636.30
32753     1    10 |   10    8562.20 |     10    8632.00
32754     1    10 |   10    8564.40 |     10    8633.40
32755     1    10 |   10    8553.90 |     10    8635.40
32756     1    10 |   10    8553.00 |     10    8637.60
32757     1    10 |   10    8552.10 |     10    8640.90
32758     1    10 |   10    8557.20 |     10    8644.90
32759     1    10 |   10    8562.00 |     10    8644.10
32760     1    10 |   10    9336.60 |     10    9436.10
32761     1    10 |   10    9337.20 |     10    9435.60
32762     1    10 |   10    9340.70 |     10    9439.10
32763     1    10 |   10    9340.70 |     10    9433.60
32764     1    10 |   10    9340.00 |     10    9440.60
32765     1    10 |   10   10112.80 |     10   10228.40
32766     1    10 |   10   10893.50 |     10   11023.50
32767     1    10 |   10   11674.40 |     10   11813.70


-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)


diff -urbN linux-2.5.7-pre1/kernel/fork.c linux-2.5.7-pre1pid/kernel/fork.c
--- linux-2.5.7-pre1/kernel/fork.c	Thu Mar  7 21:18:12 2002
+++ linux-2.5.7-pre1pid/kernel/fork.c	Thu Mar 14 16:03:23 2002
@@ -125,24 +125,101 @@
 /* Protects next_safe and last_pid. */
 spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
 
+/* this should be provided in every architecture */
+#ifndef SHIFT_PER_LONG
+#if BITS_PER_LONG == 64
+#   define SHIFT_PER_LONG 6
+#elif BITS_PER_LONG == 32
+#   define SHIFT_PER_LONG 5
+#else
+#error "SHIFT_PER_LONG"
+#endif
+#endif
+
+#define RESERVED_PIDS       (300)
+#define GETPID_THRESHOLD    (21000)  /* when to switch to a mapped algo */
+#define PID_MAP_SIZE        (PID_MAX >> SHIFT_PER_LONG)
+static unsigned long pid_map[PID_MAP_SIZE];
+static int next_safe = PID_MAX;
+
+static inline void mark_pid(int pid)
+{
+	pid_map[(pid) >> SHIFT_PER_LONG] |= (1 << ((pid) & (BITS_PER_LONG-1)));
+}
+
+static int get_pid_by_map(int last_pid) 
+{
+        struct task_struct *p;
+        int i;
+	int again = 1;
+	unsigned long mask;
+	int fpos;
+
+repeat:
+	for_each_task(p) {
+		mark_pid(p->pid);
+		mark_pid(p->pgrp);
+		mark_pid(p->tgid);
+		mark_pid(p->session);
+	}
+
+	/* find next free pid */
+	i = (last_pid >> SHIFT_PER_LONG);
+	mask = pid_map[i] | (((last_pid & (BITS_PER_LONG-1)) << 1) - 1);
+	
+	while ((mask == ~0) && (++i < PID_MAP_SIZE)) 
+		mask = pid_map[i];
+	
+	if (i == PID_MAP_SIZE) { 
+		if (again) {
+			/* we didn't find any pid , sweep and try again */
+			again = 0;
+			memset(pid_map, 0, PID_MAP_SIZE * sizeof(unsigned long));
+			last_pid = RESERVED_PIDS;
+			goto repeat;
+		}
+		next_safe = RESERVED_PIDS;
+		return 0; 
+	}
+
+	fpos = ffz(mask);
+	i &= (PID_MAX-1);
+	last_pid = (i << SHIFT_PER_LONG) + fpos;
+
+	/* find next save pid */
+	mask &= ~((1 << fpos) - 1);
+
+	while ((mask == 0) && (++i < PID_MAP_SIZE)) 
+		mask = pid_map[i];
+
+	if (i==PID_MAP_SIZE) 
+		next_safe = PID_MAX;
+	else 
+		next_safe = (i << SHIFT_PER_LONG) + ffs(mask) - 1;
+	return last_pid;
+}
+
 static int get_pid(unsigned long flags)
 {
-	static int next_safe = PID_MAX;
 	struct task_struct *p;
-	int pid;
+	int pid,beginpid;
 
 	if (flags & CLONE_PID)
 		return current->pid;
 
 	spin_lock(&lastpid_lock);
+	beginpid = last_pid;
 	if((++last_pid) & 0xffff8000) {
-		last_pid = 300;		/* Skip daemons etc. */
+		last_pid = RESERVED_PIDS;		/* Skip daemons etc. */
 		goto inside;
 	}
 	if(last_pid >= next_safe) {
 inside:
 		next_safe = PID_MAX;
 		read_lock(&tasklist_lock);
+		if (nr_threads > GETPID_THRESHOLD) {
+			last_pid = get_pid_by_map(last_pid);
+		} else {
 	repeat:
 		for_each_task(p) {
 			if(p->pid == last_pid	||
@@ -151,9 +228,11 @@
 			   p->session == last_pid) {
 				if(++last_pid >= next_safe) {
 					if(last_pid & 0xffff8000)
-						last_pid = 300;
+							last_pid = RESERVED_PIDS;
 					next_safe = PID_MAX;
 				}
+					if(unlikely(last_pid == beginpid))
+						goto nomorepids;
 				goto repeat;
 			}
 			if(p->pid > last_pid && next_safe > p->pid)
@@ -162,6 +241,9 @@
 				next_safe = p->pgrp;
 			if(p->session > last_pid && next_safe > p->session)
 				next_safe = p->session;
+				if(p->tgid > last_pid && next_safe > p->tgid)
+					next_safe = p->tgid;
+			}
 		}
 		read_unlock(&tasklist_lock);
 	}
@@ -169,6 +251,10 @@
 	spin_unlock(&lastpid_lock);
 
 	return pid;
+nomorepids:
+	read_unlock(&tasklist_lock);
+	spin_unlock(&lastpid_lock);
+	return 0;
 }
 
 static inline int dup_mmap(struct mm_struct * mm)

[-- Attachment #2: Test Program --]
[-- Type: text/x-c, Size: 12690 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <string.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/time.h>
#include <unistd.h>   
#include <sys/sysinfo.h>
#include <asm/bitops.h>

int SEARCH;

#define spin_lock(x)
#define spin_unlock(x)
#define read_lock(x)
#define read_unlock(x)
#define unlikely(x)	(x)
#define SET_LINKS(p) do {			\
	(p)->next_task = &init_task;		\
	(p)->prev_task = init_task.prev_task;	\
	init_task.prev_task->next_task = (p);	\
	init_task.prev_task = (p);		\
} while (0)
#define REMOVE_LINKS(p) do {				\
	(p)->next_task->prev_task = (p)->prev_task;	\
	(p)->prev_task->next_task = (p)->next_task;	\
} while (0)
#define CLONE_PID	0x00001000	/* set if pid shared */
#define PID_MAX		0x8000
#define THRES_VALUE     threshold
#define MAX_TASKS       (PID_MAX-1)
#define RESERVED_PIDS   reserved

#define for_each_task(p) \
	for (p = &init_task ; (p = p->next_task) != &init_task ; )

struct task_struct {
	struct task_struct *next_task, *prev_task;
	pid_t pid;
	pid_t pgrp;
	pid_t session;
	pid_t tgid;
	int did_exec;
};

struct task_struct init_task = {
	next_task: 	&init_task,
	prev_task: 	&init_task,
	pid:		0,
	pgrp:		0,
	session:	0,
	tgid:		0,
	did_exec:	0,
};
struct task_struct *current = &init_task;
static pid_t last_pid = 0;
static int next_safe = PID_MAX;
int nr_threads;
int threshold = 0;
int reserved = 1; /* need to keep 0 */
int verbose;

#define BITS_PER_LONG  32  /* no in the kernel !! */

#ifndef SHIFT_PER_LONG
#if BITS_PER_LONG == 64
#   define SHIFT_PER_LONG 6
#elif BITS_PER_LONG == 32
#   define SHIFT_PER_LONG 5
#else
#error "SHIFT_PER_LONG"
#endif
#endif

#define PID_MAP_SIZE (PID_MAX >> SHIFT_PER_LONG)
static unsigned long pid_map[PID_MAP_SIZE];

static inline void mark_pid(int pid)
{
	pid_map[(pid) >> SHIFT_PER_LONG] |= (1 << ((pid) & (BITS_PER_LONG-1)));
}

static int get_pid_by_map(int last_pid) 
{
        struct task_struct *p;
        int i, end;
	int again = 1;
	unsigned long mask, fpos;

#if 0
	/* testing version <algo-1> ... won't pay off */
	memset(pid_map, 0, PID_MAP_SIZE * sizeof(unsigned long));
	again = 0;
	last_pid = RESERVED_PIDS;
#endif

repeat:
	for_each_task(p) {
		mark_pid(p->pid);
		mark_pid(p->pgrp);
		mark_pid(p->tgid);
		mark_pid(p->session);
	}

	/* find next free pid */
	i = last_pid >> SHIFT_PER_LONG;
	mask = pid_map[i] | (((last_pid & (BITS_PER_LONG-1)) << 1) - 1);
	
	while ((mask == ~0) && (++i < PID_MAP_SIZE)) 
		mask = pid_map[i];
	
	if (i == PID_MAP_SIZE) { 
		if (again) {
			/* we didn't find any pid , sweep and try again */
			again = 0;
			memset(pid_map, 0, PID_MAP_SIZE * sizeof(unsigned long));
			last_pid = RESERVED_PIDS;
			goto repeat;
		}
		next_safe = RESERVED_PIDS;
		return 0; 
	}

	fpos = ffz(mask);
	i &= (PID_MAX-1);
	last_pid = (i << SHIFT_PER_LONG) + fpos;

	/* find next save pid */
	mask &= ~((1 << fpos) - 1);

#if 0
	/* testing to see whether interval max size brings 
	 * benefits
	 */
	end = last_pid + 8;
	if (end > PID_MAP_SIZE) 
#endif
		end = PID_MAP_SIZE;
	while ((mask == 0) && (++i < end)) 
		mask = pid_map[i];

	if (i==PID_MAP_SIZE) 
		next_safe = PID_MAX;
	else 
		next_safe = (i << SHIFT_PER_LONG) + ffs(mask) - 1;
	return last_pid;
}

static int get_pid(unsigned long flags)
{
	struct task_struct *p;
	int pid, beginpid;

	if (flags & CLONE_PID)
		return current->pid;

	spin_lock(&lastpid_lock);
	beginpid = last_pid;
	if((++last_pid) & 0xffff8000) {
		beginpid = last_pid = RESERVED_PIDS;		/* Skip daemons etc. */
		goto inside;
	}
	if(last_pid >= next_safe) {
inside:
		next_safe = PID_MAX;
		read_lock(&tasklist_lock);
		SEARCH++;
		if (nr_threads > THRES_VALUE) {
			last_pid = get_pid_by_map(last_pid);
		} else {
			repeat:
				for_each_task(p) {
					if(p->pid == last_pid	||
					   p->pgrp == last_pid	||
					   p->tgid == last_pid	||
					   p->session == last_pid) {
						if(++last_pid >= next_safe) {
							if(last_pid & 0xffff8000)
								last_pid = RESERVED_PIDS;
							next_safe = PID_MAX;
						}
						if(unlikely(last_pid == beginpid))
							goto nomorepids;

						goto repeat;
					}
					if(p->pid > last_pid && next_safe > p->pid)
						next_safe = p->pid;
					if(p->pgrp > last_pid && next_safe > p->pgrp)
						next_safe = p->pgrp;
					if(p->tgid > last_pid && next_safe > p->tgid)
						next_safe = p->tgid;
					if(p->session > last_pid && next_safe > p->session)
						next_safe = p->session;
				}
		}	
		read_unlock(&tasklist_lock);
	}
	pid = last_pid;
	spin_unlock(&lastpid_lock);
	return pid;

nomorepids:
	read_unlock(&tasklist_lock);
	spin_unlock(&lastpid_lock);
	return 0;
}


static void fatal(char *msg,...)
{
	char buf[256];
	va_list ap;

	va_start(ap,msg);
	vsprintf(buf,msg,ap);
	va_end(ap);
	fprintf(stderr,"%s: %s\n", buf, strerror(errno));
	exit(1);
}


static struct task_struct  tasks[PID_MAX];
static struct task_struct *tasks_by_pid[PID_MAX];
static struct task_struct *tasks_free;

static struct task_struct *find_task_by_pid(pid_t pid)
{
	if (pid & ~(PID_MAX-1)) {
		fatal("find_task_by_pid");
		return NULL;
	}
	return tasks_by_pid[pid];
}

static struct task_struct *add_task(pid_t pid, pid_t pgrp, pid_t session,
				    pid_t tgid)
{
	static int initialized = 0;
	struct task_struct *tsk;

	if (!initialized) {
		int i;
		for ( i=0 ; i<PID_MAX ; i++) {
			tasks[i].next_task = tasks_free;
			tasks_free = &tasks[i];
		}
		initialized = 1;
	}
		
	tsk = tasks_free;
	
	if (tsk == NULL)
		fatal("add_task:  pid=%d:%d",pid,nr_threads);

	tasks_free = tasks_free->next_task;
	nr_threads++;
	tasks_by_pid[pid] = tsk;

	tsk->pid = pid;
	tsk->pgrp = pgrp;
	tsk->session = session;
	tsk->tgid = tgid;
	SET_LINKS(tsk);

	return tsk;
}

static void del_task(struct task_struct *tsk)
{
	nr_threads--;
	tasks_by_pid[tsk->pid] = NULL;
	REMOVE_LINKS(tsk);
	tsk->next_task = tasks_free;
	tasks_free = tsk;
}


void populate_all(int count, int fast)
{
	int i,pid;
	int oldthreshold = 0;
	int mode;

	if (count == 0) {
		mode = 0;
		count = MAX_TASKS;
	} else {
		mode = 1;
	}
      
	if (fast) { 
		/* use the new map version */
		oldthreshold = threshold; 
		threshold = 0; 
	}

	/* first allocate all PIDs available */
	for (i = 0; i < count; i++) {
		pid = get_pid(0);
		if (pid == 0) {
			if (mode) {
				printf("failed %d\n",i);
				continue;
			} else 
				break;
		}
		add_task(pid, pid, pid, pid);
		if ((verbose >= 3) && ((i % 1000) == 0)) 
			printf("add task %d  %d %d\n",i,pid, nr_threads);
	}
	if (fast) { 
		/* return to global setting */
		threshold = oldthreshold; 
	}
}

int delete_range(int spid, int epid)
{
	int i;
	struct task_struct *tsk;
	
	for(i=spid; i<=epid; i++) {
		if ((tsk = find_task_by_pid(i))) 
			del_task(tsk);
	}
	return epid-spid+1;
}

int delete_random(int cnt)
{
	int delcnt = 0;
	int coll=0;
	while ((delcnt < cnt) && (nr_threads > RESERVED_PIDS)) {
		struct task_struct *tsk;
		int rnums = random() & (PID_MAX-1);
		if ((rnums < RESERVED_PIDS) || ((tsk = find_task_by_pid(rnums)) == NULL))
		{
			coll++;
			continue;
		}
		del_task(tsk); 
		delcnt++; 
		if ((verbose >= 3) && ((delcnt % 1000) == 0))
			printf("\t del %d\n",delcnt);
	}
	return delcnt;
}

static inline void difftime(struct timeval *dtime, struct timeval *stime, struct timeval *etime)
{
	dtime->tv_sec = etime->tv_sec - stime->tv_sec;
	if (etime->tv_usec < stime->tv_usec) {
		dtime->tv_usec = 1000000 - stime->tv_usec + etime->tv_usec;
		dtime->tv_sec--;
	} else
		dtime->tv_usec = etime->tv_usec - stime->tv_usec;
}


static inline void addtime(struct timeval *rtime, struct timeval *dtime)
{
	rtime->tv_sec += dtime->tv_sec;
	if ((rtime->tv_usec += dtime->tv_usec) >= 1000000) {
		rtime->tv_usec -= 1000000;
		rtime->tv_sec++;
	}
}

int runtest1(int rnums[], int cnt, struct timeval *response)
{
	int i,delcnt;
	struct timeval stime, etime, ttime;
	struct task_struct *tsk;

	/* first delete */
	delcnt = 0;
	for ( i=0 ; i<cnt ; i++ ) {
		if (rnums[i] < RESERVED_PIDS) continue;
		tsk = find_task_by_pid(rnums[i]);
		if (tsk) { del_task(tsk); delcnt++; }
	}

	gettimeofday(&stime, (struct timezone *) 0); 
	for ( i=0 ; i<delcnt ; i++ ) {
		int pid = get_pid(0);
		if (pid == 0) {
			printf("failed %d\n",i);
			continue;
		}
		add_task(pid, pid, pid, pid);
	}
	gettimeofday(&etime, (struct timezone *) 0); 
	difftime(&ttime,&etime,&stime);
	addtime(response,&ttime);
	return delcnt;
}

void runtest2(int start, int end, int incr, int dels)
{
	int i,j;
	struct timeval ttime;
	int cnt;
	int aveth;

	/* first create some randomness:
	 *   a few loops delete random and refill some 
	 */	
#if 1
	populate_all(0,1);
	for (i=0;i<2;i++) {
		int n;
		if (verbose >= 2) printf("P%d\n",i);
		n = delete_random(PID_MAX/4);
		populate_all(n-1000,1);
	}
#endif	
	
	/* now delete until we get to the aveth */
	for (aveth = start ; aveth < end ; aveth += incr) {
		struct timeval response = {0,0};
		int tsearch = 0;
		int talloc  = 0;
		double avtime;

		populate_all(0,1); /* fill all remaining */
		cnt = delete_random(PID_MAX-aveth);

		for ( j=0 ; j< 10 ; j++) {
			struct timeval stime, etime;
			int delcnt = delete_random(dels);
			
			SEARCH = 0;
			gettimeofday(&stime, (struct timezone *) 0); 
			for ( i=0 ; i<delcnt ; i++ ) {
				int pid = get_pid(0);
				if (pid == 0) {
					printf("failed %d\n",i);
					continue;
				}
				add_task(pid, pid, pid, pid);
				talloc++;
			}
			gettimeofday(&etime, (struct timezone *) 0); 
			difftime(&ttime,&stime,&etime);
			addtime(&response,&ttime);
			tsearch += SEARCH;
			if (verbose >= 1) 
				printf("next_phase %d %5d %5d %5d  %ld:%06ld\n",
				       j,SEARCH,delcnt, nr_threads,
				       ttime.tv_sec, ttime.tv_usec);
		}
		avtime = ((1.0E+6*response.tv_sec) + 
			   (double)response.tv_usec) / (double)talloc;
			       
		printf("%5d %5d %5d %5d %d %ld:%06ld %10.2lf\n",
		       aveth,dels,talloc,tsearch,(threshold>0),
		       response.tv_sec,response.tv_usec, avtime
		       );
	}
}

int main(int argc, char *argv[])
{
	int i,cnt,c;
	pid_t pid=0;
	pid_t pgrp=0;
	struct task_struct *tsk = NULL;
	int si,sv;
	int testcase = 2;
	struct timeval response,stime,etime,ttime;
	int startth, endth, incth, dels;

	startth = endth = incth = dels = 1000;
	
        while ((c = getopt(argc, argv, "s:e:i:d:t:c:v:r:")) != -1) {
		switch (c) {
		case 's': startth   = atoi(optarg); break;
		case 'e': endth     = atoi(optarg); break;
		case 'i': incth     = atoi(optarg); break;
		case 'd': dels      = atoi(optarg); break;
		case 't': threshold = atoi(optarg); break;
		case 'c': testcase  = atoi(optarg); break;
		case 'v': verbose   = atoi(optarg); break;
		case 'r': reserved  = atoi(optarg); break;
		}
	}
	if (threshold) threshold = (1<<20);
	if ((reserved < 1) || (reserved > 1000)) 
		fatal("bad reserved %d\n",reserved);

	response.tv_sec = response.tv_usec = 0;
	switch (testcase) {
	case 0:
		populate_all(MAX_TASKS,1);
		cnt = 0;
		cnt += delete_range(300,337);
		cnt += delete_range(1037,1097);
		cnt += delete_range(32723,32767);
		
		/* fork() */
		printf("allocate more %d\n",cnt);
		si = -1; sv=-1;
		for (i=0; i<cnt; i++) {
			pid = get_pid(0);
			if (pid == 0) {
				printf("failed %d\n",i);
				continue;
			}
			if (si == -1) {
				si = pid;
				sv = pid;
			} else {
				if (pid != sv+1) {
					printf("%d - %d\n",si,sv);
					si = pid;
				}
				sv = pid;
			}
			add_task(pid, pid, pid, pid);
		}
		printf("%d - %d\n",si,sv);
		break;

	case 1:
		threshold = 0;
		while (1) {
			const int NUM=100;
			int rnums[NUM];
			int num;
			for (i=0;i<NUM;i++)
				rnums[i] = random() & (PID_MAX-1);
			
			num = runtest1(rnums,NUM,&response);
			printf("%5d %ld:%06ld\n",num,
			       response.tv_sec,response.tv_usec);
		}
		break;

	case 2:
		SEARCH=0;
		runtest2(startth,endth,incth,dels);
		break;

	case 3:
		gettimeofday(&stime, (struct timezone *) 0); 
		pid = get_pid(0);
		add_task(pid, pid, pid, pid);
		printf("new pid: %d, 300: pid %d, pgrp %d\n",
		       pid, tsk->pid, tsk->pgrp);
		/* exit() */
		tsk = find_task_by_pid(300);
		del_task(tsk);
		/* fork() */
		pid = get_pid(0);
		add_task(pid, pgrp, pid, pid);
		printf("new pid: %d, 300: pid %d, pgrp %d\n",
		       pid, tsk->pid, tsk->pgrp);
		/* exit() */
		tsk = find_task_by_pid(301);
		del_task(tsk);
		
		/* fork() */
		pid = get_pid(0);
		add_task(pid, pgrp, pid, pid);
		printf("new pid: %d, 300: pid %d, pgrp %d\n",
		       pid, tsk->pid, tsk->pgrp);
		
		tsk = find_task_by_pid(300);
		gettimeofday(&etime, (struct timezone *) 0); 
		difftime(&ttime,&stime,&etime);
		addtime(&response,&ttime);
		printf("new pid: %d, 300: pid %d, pgrp %d, time:%ld:%ld sec\n",
		       pid, tsk->pid, tsk->pgrp, response.tv_sec,response.tv_usec);
		break;
		
	default: 
		printf("Unknown testcase %d\n",testcase);
		break;
	}
	return 0;
}


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] get_pid() performance fix
  2002-03-14 23:18 ` [PATCH] " Hubertus Franke
@ 2002-03-15 14:57   ` OGAWA Hirofumi
  2002-03-15 15:16   ` OGAWA Hirofumi
  1 sibling, 0 replies; 28+ messages in thread
From: OGAWA Hirofumi @ 2002-03-15 14:57 UTC (permalink / raw)
  To: frankeh; +Cc: Rajan Ravindran, linux-kernel, lse-tech

Hubertus Franke <frankeh@watson.ibm.com> writes:

> +	if (i == PID_MAP_SIZE) { 
> +		if (again) {
> +			/* we didn't find any pid , sweep and try again */
> +			again = 0;
> +			memset(pid_map, 0, PID_MAP_SIZE * sizeof(unsigned long));
> +			last_pid = RESERVED_PIDS;
> +			goto repeat;
> +		}
> +		next_safe = RESERVED_PIDS;
> +		return 0;

I think the bug is here.

Probaly, the following is test case: ./getpid1 -r300 -c3

	case 3:
		populate_all(0, 1);
		pid = get_pid(0);
		printf("new pid: %d\n", pid);
		tsk = find_task_by_pid(400);
		del_task(tsk);
		pid = get_pid(0);
		printf("new pid: %d\n", pid);

		break;

result,
	new pid: 0
	new pid: 1

> +	}
> +
> +	fpos = ffz(mask);
> +	i &= (PID_MAX-1);
> +	last_pid = (i << SHIFT_PER_LONG) + fpos;
> +
> +	/* find next save pid */
> +	mask &= ~((1 << fpos) - 1);
> +
> +	while ((mask == 0) && (++i < PID_MAP_SIZE)) 
> +		mask = pid_map[i];
> +
> +	if (i==PID_MAP_SIZE) 
> +		next_safe = PID_MAX;
> +	else 
> +		next_safe = (i << SHIFT_PER_LONG) + ffs(mask) - 1;
> +	return last_pid;
> +}
> +
>  static int get_pid(unsigned long flags)
>  {
> -	static int next_safe = PID_MAX;
>  	struct task_struct *p;
> -	int pid;
> +	int pid,beginpid;
>  
>  	if (flags & CLONE_PID)
>  		return current->pid;
>  
>  	spin_lock(&lastpid_lock);
> +	beginpid = last_pid;
>  	if((++last_pid) & 0xffff8000) {
> -		last_pid = 300;		/* Skip daemons etc. */
> +		last_pid = RESERVED_PIDS;		/* Skip daemons etc. */
>  		goto inside;
>  	}
>  	if(last_pid >= next_safe) {
>  inside:
>  		next_safe = PID_MAX;
>  		read_lock(&tasklist_lock);
> +		if (nr_threads > GETPID_THRESHOLD) {
> +			last_pid = get_pid_by_map(last_pid);
> +		} else {
>  	repeat:
>  		for_each_task(p) {
>  			if(p->pid == last_pid	||
> @@ -151,9 +228,11 @@
>  			   p->session == last_pid) {
>  				if(++last_pid >= next_safe) {
>  					if(last_pid & 0xffff8000)
> -						last_pid = 300;
> +							last_pid = RESERVED_PIDS;
>  					next_safe = PID_MAX;
>  				}
> +					if(unlikely(last_pid == beginpid))
> +						goto nomorepids;
>  				goto repeat;
>  			}
>  			if(p->pid > last_pid && next_safe > p->pid)
> @@ -162,6 +241,9 @@
>  				next_safe = p->pgrp;
>  			if(p->session > last_pid && next_safe > p->session)
>  				next_safe = p->session;
> +				if(p->tgid > last_pid && next_safe > p->tgid)
> +					next_safe = p->tgid;
> +			}
>  		}
>  		read_unlock(&tasklist_lock);
>  	}
> @@ -169,6 +251,10 @@
>  	spin_unlock(&lastpid_lock);
>  
>  	return pid;
> +nomorepids:

Probably, I think the following line are required, here.

  +	next_safe = RESERVED_PIDS;	or 0

> +	read_unlock(&tasklist_lock);
> +	spin_unlock(&lastpid_lock);
> +	return 0;
>  }
>  
>  static inline int dup_mmap(struct mm_struct * mm)
> 

-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] get_pid() performance fix
  2002-03-14 23:18 ` [PATCH] " Hubertus Franke
  2002-03-15 14:57   ` OGAWA Hirofumi
@ 2002-03-15 15:16   ` OGAWA Hirofumi
  2002-03-15 18:37     ` Hubertus Franke
  1 sibling, 1 reply; 28+ messages in thread
From: OGAWA Hirofumi @ 2002-03-15 15:16 UTC (permalink / raw)
  To: frankeh; +Cc: Rajan Ravindran, linux-kernel, lse-tech

Whoops! I'm sorry. previous email was the middle of writing.

Hubertus Franke <frankeh@watson.ibm.com> writes:

> +	if (i == PID_MAP_SIZE) { 
> +		if (again) {
> +			/* we didn't find any pid , sweep and try again */
> +			again = 0;
> +			memset(pid_map, 0, PID_MAP_SIZE * sizeof(unsigned long));
> +			last_pid = RESERVED_PIDS;
> +			goto repeat;
> +		}
> +		next_safe = RESERVED_PIDS;
> +		return 0; 

Probably, the bug is here.

the following is test case: ./getpid1 -r300 -c3

	case 3:
		populate_all(0, 1);
		pid = get_pid(0);
		printf("new pid: %d\n", pid);
		tsk = find_task_by_pid(400);
		del_task(tsk);
		pid = get_pid(0);
		printf("new pid: %d\n", pid);

		break;

result,
	new pid: 0
	new pid: 1

> +	}
> +
> +	fpos = ffz(mask);
> +	i &= (PID_MAX-1);
> +	last_pid = (i << SHIFT_PER_LONG) + fpos;
> +
> +	/* find next save pid */
> +	mask &= ~((1 << fpos) - 1);
> +
> +	while ((mask == 0) && (++i < PID_MAP_SIZE)) 
> +		mask = pid_map[i];
> +
> +	if (i==PID_MAP_SIZE) 
> +		next_safe = PID_MAX;
> +	else 
> +		next_safe = (i << SHIFT_PER_LONG) + ffs(mask) - 1;
> +	return last_pid;
> +}
> +
>  static int get_pid(unsigned long flags)
>  {
> -	static int next_safe = PID_MAX;
>  	struct task_struct *p;
> -	int pid;
> +	int pid,beginpid;
>  
>  	if (flags & CLONE_PID)
>  		return current->pid;
>  
>  	spin_lock(&lastpid_lock);
> +	beginpid = last_pid;
>  	if((++last_pid) & 0xffff8000) {
> -		last_pid = 300;		/* Skip daemons etc. */
> +		last_pid = RESERVED_PIDS;		/* Skip daemons etc. */
>  		goto inside;
>  	}
>  	if(last_pid >= next_safe) {
>  inside:
>  		next_safe = PID_MAX;
>  		read_lock(&tasklist_lock);
> +		if (nr_threads > GETPID_THRESHOLD) {
> +			last_pid = get_pid_by_map(last_pid);
> +		} else {
>  	repeat:
>  		for_each_task(p) {
>  			if(p->pid == last_pid	||
> @@ -151,9 +228,11 @@
>  			   p->session == last_pid) {
>  				if(++last_pid >= next_safe) {
>  					if(last_pid & 0xffff8000)
> -						last_pid = 300;
> +							last_pid = RESERVED_PIDS;
>  					next_safe = PID_MAX;
>  				}
> +					if(unlikely(last_pid == beginpid))
> +						goto nomorepids;
>  				goto repeat;
>  			}
>  			if(p->pid > last_pid && next_safe > p->pid)
> @@ -162,6 +241,9 @@
>  				next_safe = p->pgrp;
>  			if(p->session > last_pid && next_safe > p->session)
>  				next_safe = p->session;
> +				if(p->tgid > last_pid && next_safe > p->tgid)
> +					next_safe = p->tgid;
> +			}
>  		}
>  		read_unlock(&tasklist_lock);
>  	}
> @@ -169,6 +251,10 @@
>  	spin_unlock(&lastpid_lock);
>  
>  	return pid;
> +nomorepids:

Probably, the following line are required, here.

  +	next_safe = RESERVED_PIDS;	/* or 0 */

> +	read_unlock(&tasklist_lock);
> +	spin_unlock(&lastpid_lock);
> +	return 0;
>  }


Basically nice, I think.

BTW, How about using the __set_bit(), find_next_zero_bit(), and
find_next_bit() in get_pid_by_map().

Thanks for nice work.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] get_pid() performance fix
  2002-03-15 15:16   ` OGAWA Hirofumi
@ 2002-03-15 18:37     ` Hubertus Franke
  2002-03-16  5:12       ` OGAWA Hirofumi
  0 siblings, 1 reply; 28+ messages in thread
From: Hubertus Franke @ 2002-03-15 18:37 UTC (permalink / raw)
  To: OGAWA Hirofumi; +Cc: Rajan Ravindran, linux-kernel, lse-tech

On Friday 15 March 2002 10:16 am, OGAWA Hirofumi wrote:
> Whoops! I'm sorry. previous email was the middle of writing.
>
> Hubertus Franke <frankeh@watson.ibm.com> writes:
> > +	if (i == PID_MAP_SIZE) {
> > +		if (again) {
> > +			/* we didn't find any pid , sweep and try again */
> > +			again = 0;
> > +			memset(pid_map, 0, PID_MAP_SIZE * sizeof(unsigned long));
> > +			last_pid = RESERVED_PIDS;
> > +			goto repeat;
> > +		}
> > +		next_safe = RESERVED_PIDS;
> > +		return 0;
>
> Probably, the bug is here. No bug ....

>
>   +	next_safe = RESERVED_PIDS;	/* or 0 */
>
> > +	read_unlock(&tasklist_lock);
> > +	spin_unlock(&lastpid_lock);
> > +	return 0;
> >  }
>
> Basically nice, I think.
>
> BTW, How about using the __set_bit(), find_next_zero_bit(), and
> find_next_bit() in get_pid_by_map().
>
> Thanks for nice work.

OGAWA, honestly I only tried testcase 2.
But looking at your suggestion its not clear to me whether
there is a bug.
Remember we need to determine a valid interval [ last_pid .. next_safe ).
In the pid_map function, if no pid is available, then
[ PID_MAX .. PID_MAX ) will be returned.
The other path should also end up with this as well.
Could you  point where you see this not happening.

In the next release, I'll look at using your bitmap function suggestion.

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] get_pid() performance fix
  2002-03-15 18:37     ` Hubertus Franke
@ 2002-03-16  5:12       ` OGAWA Hirofumi
  2002-03-18 21:44         ` Hubertus Franke
  2002-03-22 22:14         ` Hubertus Franke
  0 siblings, 2 replies; 28+ messages in thread
From: OGAWA Hirofumi @ 2002-03-16  5:12 UTC (permalink / raw)
  To: frankeh; +Cc: Rajan Ravindran, linux-kernel, lse-tech

Hubertus Franke <frankeh@watson.ibm.com> writes:

> On Friday 15 March 2002 10:16 am, OGAWA Hirofumi wrote:
> > Whoops! I'm sorry. previous email was the middle of writing.
> >
> > Hubertus Franke <frankeh@watson.ibm.com> writes:
> > > +	if (i == PID_MAP_SIZE) {
> > > +		if (again) {
> > > +			/* we didn't find any pid , sweep and try again */
> > > +			again = 0;
> > > +			memset(pid_map, 0, PID_MAP_SIZE * sizeof(unsigned long));
> > > +			last_pid = RESERVED_PIDS;
> > > +			goto repeat;
> > > +		}
> > > +		next_safe = RESERVED_PIDS;
> > > +		return 0;
> >
> > Probably, the bug is here. No bug ....
> 
> >
> >   +	next_safe = RESERVED_PIDS;	/* or 0 */
> >
> > > +	read_unlock(&tasklist_lock);
> > > +	spin_unlock(&lastpid_lock);
> > > +	return 0;
> > >  }
> >
> > Basically nice, I think.
> >
> > BTW, How about using the __set_bit(), find_next_zero_bit(), and
> > find_next_bit() in get_pid_by_map().
> >
> > Thanks for nice work.
> 
> OGAWA, honestly I only tried testcase 2.
> But looking at your suggestion its not clear to me whether
> there is a bug.
> Remember we need to determine a valid interval [ last_pid .. next_safe ).
> In the pid_map function, if no pid is available, then
> [ PID_MAX .. PID_MAX ) will be returned.
> The other path should also end up with this as well.
> Could you  point where you see this not happening.

Maybe my point was unclear. Sorry.

Please consider what happens after using up pid. Then,
get_pid_by_map() returns 0.

And last_pid = 0, next_safe = RESERVED_PIDS.  After it, get_pid()
returns the values between 0 and RESERVED_PIDS.

And the line which I added is also the same reason. 

Regards.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] get_pid() performance fix
  2002-03-16  5:12       ` OGAWA Hirofumi
@ 2002-03-18 21:44         ` Hubertus Franke
  2002-03-22 22:14         ` Hubertus Franke
  1 sibling, 0 replies; 28+ messages in thread
From: Hubertus Franke @ 2002-03-18 21:44 UTC (permalink / raw)
  To: OGAWA Hirofumi; +Cc: Rajan Ravindran, linux-kernel, lse-tech

On Saturday 16 March 2002 12:12 am, OGAWA Hirofumi wrote:
> Hubertus Franke <frankeh@watson.ibm.com> writes:
> > On Friday 15 March 2002 10:16 am, OGAWA Hirofumi wrote:
> > > Whoops! I'm sorry. previous email was the middle of writing.
> > >
> > > Hubertus Franke <frankeh@watson.ibm.com> writes:
> > > > +	if (i == PID_MAP_SIZE) {
> > > > +		if (again) {
> > > > +			/* we didn't find any pid , sweep and try again */
> > > > +			again = 0;
> > > > +			memset(pid_map, 0, PID_MAP_SIZE * sizeof(unsigned long));
> > > > +			last_pid = RESERVED_PIDS;
> > > > +			goto repeat;
> > > > +		}
> > > > +		next_safe = RESERVED_PIDS;
> > > > +		return 0;
> > >
> > > Probably, the bug is here. No bug ....
> > >
> > >
> > >   +	next_safe = RESERVED_PIDS;	/* or 0 */
> > >
> > > > +	read_unlock(&tasklist_lock);
> > > > +	spin_unlock(&lastpid_lock);
> > > > +	return 0;
> > > >  }
> > >
> > > Basically nice, I think.
> > >
> > > BTW, How about using the __set_bit(), find_next_zero_bit(), and
> > > find_next_bit() in get_pid_by_map().
> > >
> > > Thanks for nice work.
> >
> > OGAWA, honestly I only tried testcase 2.
> > But looking at your suggestion its not clear to me whether
> > there is a bug.
> > Remember we need to determine a valid interval [ last_pid .. next_safe ).
> > In the pid_map function, if no pid is available, then
> > [ PID_MAX .. PID_MAX ) will be returned.
> > The other path should also end up with this as well.
> > Could you  point where you see this not happening.
>
> Maybe my point was unclear. Sorry.
>
> Please consider what happens after using up pid. Then,
> get_pid_by_map() returns 0.
>
> And last_pid = 0, next_safe = RESERVED_PIDS.  After it, get_pid()
> returns the values between 0 and RESERVED_PIDS.
>
> And the line which I added is also the same reason.
>
> Regards.

Hirofumi, yes you are right, "the thought was there, but the implementation 
was weak".
Proofs again, all these benchmarks are not enough. I'll make a new patch 
soon and resubmit.
-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] get_pid() performance fix
  2002-03-16  5:12       ` OGAWA Hirofumi
  2002-03-18 21:44         ` Hubertus Franke
@ 2002-03-22 22:14         ` Hubertus Franke
  2002-03-22 22:28           ` Davide Libenzi
  1 sibling, 1 reply; 28+ messages in thread
From: Hubertus Franke @ 2002-03-22 22:14 UTC (permalink / raw)
  To: OGAWA Hirofumi, davej, marcelo; +Cc: Rajan Ravindran, linux-kernel, lse-tech

I implemented an alternative version of getpid, that for large thread counts
( > 220000), provides "significantly" better performance as shown in attached
performance analysis. This is particulary viable for PID_MAX=32768.
Note under the current algorithm, allocating the last pid will take 32 
seconds (!!!) on a 500MHZ P-III.
I addressed the correctnes issue that Hirofumi brought up with the last 
submission (now test case "-c 4")
Attached patch is against 2.5.7.

-- Hubertus Franke <frankeh@watson.ibm.com>

------------------------------------------------------------------------------

Currently the getpid algorithm works as follows:
At any given time an interval of [ last_pid .. next_safe ) is known to hold  
unused pids. Initially the interval is set to [0 .. 32767]

Hence to allocate a new pid the following is sufficient:

       if (++last_pid < next_safe) return last_pid;

However, if we move out of the safe interval, the next safe interval needs 
to be established first.
This is currently done by a repetive search 

repeat:
     foralltasks(p) {
        if (p uses lastpid) { last_pid++; goto repeat; }
        /* narrow [ last_pid .. next_safe ) */
        if (p->pids in [ last_pid .. next_safe ) ) next_safe = p->pid
     }

Particulary for large number of tasks, this can lead to frequent exercise of
the repeat resulting in a O(N^2) algorithm. We call this : <algo-0>.

Instead I have provided an alternative mechanism that at the time
of determining the next interval marks a bitmask by walking the tasklist
once [ O(N) ] and then finding the proper bit offsets to mark the next free
interval starting from last_pid. The bitmap requires 4096 bytes.
This is <algo-1>.

An optimization to this to keep the last bitmap instead of clearing it
with every search. Only if we fail to obtain a free range, then we have
to go back and clear the bitmap and redo the search one more time.
This is <algo-2>. Here I omit the results, as they have only shown 
improvements for the very last few pids.

I dragged the various algorithms into a userlevel test program to figure
out where the cut off points are with PID_MAX=32768. In this testprogram
I maintain <TH> tasks, and for 10 rounds (delete <DEL> random tasks and 
reallocate <DEL> tasks again) resulting in T=10*D total measured allocations.

Si states how many interval searches where needed for algo-i.
Gi states the average overhead per get_pid for algo-i in usecs.

Based on that one should use the current algorithm until ~ 22-25K tasks and
beyond that use algo-2. Only the last 15 tasks are a bit faster under algo-1.
We can safely ignore that case.

Based on that providing an adaptive implementation seems the right choice.
The patch for 2.5.7 is attached.


executed program example: getpid -c 2 -s 10 -e 100 -D 10 -t <0,1> 
        0 is old 1 is new algo 2.

TH:  average thread count in system
DEL: number of randomly deleted threads and then reallocated
TAL: total number of getpid invocation
C-NEW: number of times search was invoked 
T-NEW: per getpid() overhead
C-OLD: number of times search was invoked 
T-OLD: per getpid() overhead


   TH   DEL   TAL |   C-NEW       T-NEW |   C-OLD       T-OLD
   10    10    80 |      1        0.79  |      1        0.41
   20    10   100 |      1        0.56  |      1        0.36
   30    10   100 |      1        0.56  |      1        0.38
   40    10   100 |      1        0.58  |      1        0.42
   50    10   100 |      1        0.59  |      1        0.43
   60    10   100 |      2        0.84  |      2        0.52
   70    10   100 |      1        0.72  |      1        0.43
   80    10   100 |      1        0.72  |      1        0.48
  100    50   500 |      2        0.38  |      2        0.27
  150    50   500 |      4        0.56  |      3        0.32
  200    50   500 |      5        0.79  |      4        0.40
  250    50   500 |      6        1.03  |      2        0.35
  300    50   500 |      9        1.68  |      2        0.38
  350    50   500 |      6        1.47  |      4        0.58
  400    50   500 |     10        2.54  |      8        1.05
  450    50   500 |      7        2.10  |      7        1.02
  500    50   500 |      7        2.32  |      4        0.75
  550    50   500 |      9        3.13  |      5        0.95
  600    50   500 |     14        5.14  |      6        1.18
  650    50   500 |     13        5.23  |      9        1.79
  700    50   500 |     10        4.35  |      9        1.87
  750    50   500 |     15        6.91  |      8        2.02
  800    50   500 |     12        5.98  |      8        1.95
  850    50   500 |     13        6.85  |      9        2.29
  900    50   500 |     27       14.55  |     18        4.46
 1000  1000  9980 |      1        0.21  |      1        0.21
 2000  1000 10000 |    322       14.36  |     76        2.03
 3000  1000 10000 |    606       46.10  |    161        6.63
 4000  1000 10000 |    901       97.58  |    359       18.87
 5000  1000 10000 |   1165      164.19  |    539       38.75
 6000  1000 10000 |   1498      266.58  |    724       66.96
 7000  1000 10000 |   1835      396.91  |   1026      122.35
 8000  1000 10000 |   2084      531.83  |   1228      179.70
 9000  1000 10000 |   2489      746.99  |   1516      264.16
10000  1000 10000 |   2763      946.99  |   1818      375.22
11000  1000 10000 |   3095     1199.73  |   2093      502.47
12000  1000 10000 |   3277     1422.02  |   2305      648.17
13000  1000 10000 |   3711     1776.28  |   2680      854.89
14000  1000 10000 |   3878     2033.30  |   2959     1061.61
15000  1000 10000 |   4301     2452.35  |   3167     1266.78
16000  1000 10000 |   4485     2757.00  |   3466     1554.22
17000  1000 10000 |   4844     3210.19  |   3743     1857.74
18000  1000 10000 |   5218     3681.90  |   4069     2247.13
19000  1000 10000 |   5501     4118.69  |   4293     2605.90
20000  1000 10000 |   5770     4594.99  |   4616     3095.39
21000  1000 10000 |   6161     5172.44  |   4974     3686.87
22000  1000 10000 |   6389     5637.80  |   5177     4258.81
23000  1000 10000 |   6665     6191.73  |   5483     4949.61
24000  1000 10000 |   6982     6777.02  |   5838     5831.25
25000  1000 10000 |   7209     7318.15  |   6183     6905.34
----------------> Break even range <------------------------
26000  1000 10000 |   7533     7954.33  |   6413     7955.66
27000  1000 10000 |   7959     8749.97  |   6748     9444.79
28000  1000 10000 |   8140     9302.36  |   7139    11617.75
29000  1000 10000 |   8501    10085.77  |   7638    14960.53
30000  1000 10000 |   8911    10946.80  |   8178    19584.24
32000    50   500 |    487    12265.89  |    486    94498.36
32050    50   500 |    486    12314.17  |    486    95832.03
32100    50   500 |    486    12389.38  |    488   108895.28
32150    50   500 |    492    12599.58  |    484   111742.39
32200    50   500 |    497    12792.36  |    491   124440.45
32250    50   500 |    490    12659.33  |    490   134499.09
32300    50   500 |    495    12843.78  |    489   145873.72
32350    50   500 |    495    12915.18  |    493   158940.66
32400    50   500 |    496    12988.64  |    494   183092.45
32450    50   500 |    492    12924.86  |    490   196135.28
32500    50   500 |    495    13043.85  |    488   223226.98
32550    50   500 |    498    13164.09  |    495   265431.10
32600    50   500 |    498    13222.56  |    495   326363.36
32650    50   500 |    498    13279.90  |    497   441002.09
32700    50   500 |    499    13368.66  |    499   656269.52
32751     1    10 |     10    12836.40  |     10  4031660.40
32752     1    10 |     10    12831.70  |     10  4620214.70
32753     1    10 |     10    12832.60  |     10  4188605.70
32754     1    10 |     10    12837.60  |     10  2972975.40
32755     1    10 |     10    12835.90  |     10  4434635.70
32756     1    10 |     10    12833.80  |     10  3892058.30
32757     1    10 |     10    12839.10  |     10  5002365.30
32758     1    10 |     10    12840.20  |     10  6332786.20
32759     1    10 |     10    12840.20  |     10  5269462.90
32760     1    10 |     10    14127.80  |     10  8234780.40
32761     1    10 |     10    14134.90  |     10  7558043.20
32762     1    10 |     10    14129.70  |     10  9117119.40
32763     1    10 |     10    14134.70  |     10 13895498.10
32764     1    10 |     10    14140.10  |     10 13608972.90
32765     1    10 |     10    15427.30  |     10 12930469.20
32766     1    10 |     10    16708.30  |     10 23576610.90
32767     1    10 |     10    17997.90  |     10 32603396.10

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

diff -urbN linux-2.5.7/kernel/fork.c linux-2.5.7-pid/kernel/fork.c
--- linux-2.5.7/kernel/fork.c	Mon Mar 18 15:37:05 2002
+++ linux-2.5.7-pid/kernel/fork.c	Fri Mar 22 16:38:29 2002
@@ -125,24 +125,111 @@
 /* Protects next_safe and last_pid. */
 spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
 
+/* this should be provided in every architecture */
+#ifndef SHIFT_PER_LONG
+#if BITS_PER_LONG == 64
+#   define SHIFT_PER_LONG 6
+#elif BITS_PER_LONG == 32
+#   define SHIFT_PER_LONG 5
+#else
+#error "SHIFT_PER_LONG"
+#endif
+#endif
+
+#define RESERVED_PIDS       (300)
+#define GETPID_THRESHOLD    (26000)  /* when to switch to a mapped algo */
+#define PID_MAP_SIZE        (PID_MAX >> SHIFT_PER_LONG)
+static unsigned long pid_map[PID_MAP_SIZE];
+static int next_safe = PID_MAX;
+
+static inline void mark_pid(int pid)
+{
+	__set_bit(pid,pid_map);
+}
+
+static int get_pid_by_map(int lastpid) 
+{
+	static int mark_and_sweep = 0;
+
+	int round = 0;
+        struct task_struct *p;
+        int i;
+	unsigned long mask;
+	int fpos;
+
+
+	if (mark_and_sweep) {
+repeat:
+		mark_and_sweep = 0;
+		memset(pid_map, 0, PID_MAP_SIZE * sizeof(unsigned long));
+		lastpid = RESERVED_PIDS;
+	}
+	for_each_task(p) {
+		mark_pid(p->pid);
+		mark_pid(p->pgrp);
+		mark_pid(p->tgid);
+		mark_pid(p->session);
+	}
+
+	/* find next free pid */
+	i = (lastpid >> SHIFT_PER_LONG);
+	mask = pid_map[i] | (((lastpid & (BITS_PER_LONG-1)) << 1) - 1);
+	
+	while ((mask == ~0) && (++i < PID_MAP_SIZE)) 
+		mask = pid_map[i];
+	
+	if (i == PID_MAP_SIZE) { 
+		if (round == 0) {
+			round = 1;
+			goto repeat;
+		}
+		next_safe = RESERVED_PIDS;
+		mark_and_sweep = 1;  /* mark next time */
+		return 0; 
+	}
+
+	fpos = ffz(mask);
+	i &= (PID_MAX-1);
+	lastpid = (i << SHIFT_PER_LONG) + fpos;
+
+	/* find next save pid */
+	mask &= ~((1 << fpos) - 1);
+
+	while ((mask == 0) && (++i < PID_MAP_SIZE)) 
+		mask = pid_map[i];
+
+	if (i==PID_MAP_SIZE) 
+		next_safe = PID_MAX;
+	else 
+		next_safe = (i << SHIFT_PER_LONG) + ffs(mask) - 1;
+	return lastpid;
+}
+
 static int get_pid(unsigned long flags)
 {
-	static int next_safe = PID_MAX;
 	struct task_struct *p;
-	int pid;
+	int pid,beginpid;
 
 	if (flags & CLONE_PID)
 		return current->pid;
 
 	spin_lock(&lastpid_lock);
+	beginpid = last_pid;
 	if((++last_pid) & 0xffff8000) {
-		last_pid = 300;		/* Skip daemons etc. */
+		last_pid = RESERVED_PIDS;		/* Skip daemons etc. */
 		goto inside;
 	}
 	if(last_pid >= next_safe) {
 inside:
 		next_safe = PID_MAX;
 		read_lock(&tasklist_lock);
+		if (nr_threads > GETPID_THRESHOLD) {
+			last_pid = get_pid_by_map(last_pid);
+			if (unlikely(last_pid == 0)) {
+				last_pid = PID_MAX;
+				goto nomorepids; 
+			}
+		} else {
 	repeat:
 		for_each_task(p) {
 			if(p->pid == last_pid	||
@@ -151,9 +238,11 @@
 			   p->session == last_pid) {
 				if(++last_pid >= next_safe) {
 					if(last_pid & 0xffff8000)
-						last_pid = 300;
+							last_pid = RESERVED_PIDS;
 					next_safe = PID_MAX;
 				}
+					if(unlikely(last_pid == beginpid))
+						goto nomorepids;
 				goto repeat;
 			}
 			if(p->pid > last_pid && next_safe > p->pid)
@@ -162,6 +251,9 @@
 				next_safe = p->pgrp;
 			if(p->session > last_pid && next_safe > p->session)
 				next_safe = p->session;
+				if(p->tgid > last_pid && next_safe > p->tgid)
+					next_safe = p->tgid;
+			}
 		}
 		read_unlock(&tasklist_lock);
 	}
@@ -169,6 +261,10 @@
 	spin_unlock(&lastpid_lock);
 
 	return pid;
+nomorepids:
+	read_unlock(&tasklist_lock);
+	spin_unlock(&lastpid_lock);
+	return 0;
 }
 
 static inline int dup_mmap(struct mm_struct * mm)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] get_pid() performance fix
  2002-03-22 22:28           ` Davide Libenzi
@ 2002-03-22 22:26             ` Hubertus Franke
  0 siblings, 0 replies; 28+ messages in thread
From: Hubertus Franke @ 2002-03-22 22:26 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: OGAWA Hirofumi, davej, marcelo, Rajan Ravindran, linux-kernel, lse-tech

On Friday 22 March 2002 05:28 pm, Davide Libenzi wrote:
> On Fri, 22 Mar 2002, Hubertus Franke wrote:
> > I implemented an alternative version of getpid, that for large thread
> > counts ( > 220000), provides "significantly" better performance as shown
> > in attached
>
>       ^^^^^^
> You've a very nice system Hubertus because it's about 1.8Gb only for the
> stack :-)
>
>
>
> - Davide


Friday evening alert. Already thinking about the pub...
Meant 22000-25000.


-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] get_pid() performance fix
  2002-03-22 22:14         ` Hubertus Franke
@ 2002-03-22 22:28           ` Davide Libenzi
  2002-03-22 22:26             ` Hubertus Franke
  0 siblings, 1 reply; 28+ messages in thread
From: Davide Libenzi @ 2002-03-22 22:28 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: OGAWA Hirofumi, davej, marcelo, Rajan Ravindran, linux-kernel, lse-tech

On Fri, 22 Mar 2002, Hubertus Franke wrote:

> I implemented an alternative version of getpid, that for large thread counts
> ( > 220000), provides "significantly" better performance as shown in attached
      ^^^^^^
You've a very nice system Hubertus because it's about 1.8Gb only for the
stack :-)



- Davide



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-07 19:46         ` Guest section DW
@ 2002-03-07 23:14           ` Hubertus Franke
  0 siblings, 0 replies; 28+ messages in thread
From: Hubertus Franke @ 2002-03-07 23:14 UTC (permalink / raw)
  To: Guest section DW; +Cc: Rusty Russell, rajancr, linux-kernel, lse-tech

On Thursday 07 March 2002 02:46 pm, Guest section DW wrote:
> On Thu, Mar 07, 2002 at 02:07:38PM -0500, Hubertus Franke wrote:
> > On Thursday 07 March 2002 09:54 am, Guest section DW wrote:
> > > On Thu, Mar 07, 2002 at 09:35:09AM -0500, Hubertus Franke wrote:
> > > ...
> > >
> > > Long ago I submitted a patch that changed the max pid from 15 bits to
> > > 24 or 30 bits or so. (And of course removed the inefficiency noticed
> > > by some people in the current thread.)
> >
> > I don't think that will solve the N^2 problem
>
> Do you understand "inefficiency"? And "remove"?


I do, what's your point ?
I haven't seen your patch picked up....

I do understand that when you occasionally get into this scenario
of the loop that particularly for large thread counts this stalls the system
for seconds (say 30, with the tasklock held).


-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
@ 2002-03-07 22:37 Manfred Spraul
  0 siblings, 0 replies; 28+ messages in thread
From: Manfred Spraul @ 2002-03-07 22:37 UTC (permalink / raw)
  To: Guest section DW, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 512 bytes --]

> Maybe Alan will mutter something about sysvipc.
> Roughly speaking there are only advantages, especially since
> I think we'll have to do this sooner or later, and in such cases
> sooner is better.

The sysvipc situation isn't that bad. The new interfaces added for
32-bit uids also contain 32-bit pid values.
The semaphore code contains a stupid hack that assumes 16-bit uids,
appart from that apps linked against new glibc version should run fine.

[patch vs. 2.5.6-pre2 attached, but untested].

--
	Manfred

[-- Attachment #2: patch-sem --]
[-- Type: text/plain, Size: 1853 bytes --]

--- 2.5/ipc/sem.c	Sun Mar  3 12:04:00 2002
+++ build-2.5/ipc/sem.c	Thu Mar  7 23:33:02 2002
@@ -251,39 +251,39 @@
 	for (sop = sops; sop < sops + nsops; sop++) {
 		curr = sma->sem_base + sop->sem_num;
 		sem_op = sop->sem_op;
-
-		if (!sem_op && curr->semval)
+		result = curr->semval;
+  
+		if (!sem_op && result)
 			goto would_block;
 
-		curr->sempid = (curr->sempid << 16) | pid;
-		curr->semval += sem_op;
-		if (sop->sem_flg & SEM_UNDO)
-		{
+		result += sem_op;
+		if (result < 0)
+			goto would_block;
+		if (result > SEMVMX)
+			goto out_of_range;
+		if (sop->sem_flg & SEM_UNDO) {
 			int undo = un->semadj[sop->sem_num] - sem_op;
 			/*
 	 		 *	Exceeding the undo range is an error.
 			 */
 			if (undo < (-SEMAEM - 1) || undo > SEMAEM)
-			{
-				/* Don't undo the undo */
-				sop->sem_flg &= ~SEM_UNDO;
 				goto out_of_range;
-			}
-			un->semadj[sop->sem_num] = undo;
 		}
-		if (curr->semval < 0)
-			goto would_block;
-		if (curr->semval > SEMVMX)
-			goto out_of_range;
+		curr->semval = result;
 	}
 
-	if (do_undo)
-	{
-		sop--;
+	if (do_undo) {
 		result = 0;
 		goto undo;
 	}
-
+	sop--;
+	while (sop >= sops) {
+		sma->sem_base[sop->sem_num].sempid = pid;
+		if (sop->sem_flg & SEM_UNDO)
+			un->semadj[sop->sem_num] -= sop->sem_op;
+		sop--;
+	}
+	
 	sma->sem_otime = CURRENT_TIME;
 	return 0;
 
@@ -298,13 +298,9 @@
 		result = 1;
 
 undo:
+	sop--;
 	while (sop >= sops) {
-		curr = sma->sem_base + sop->sem_num;
-		curr->semval -= sop->sem_op;
-		curr->sempid >>= 16;
-
-		if (sop->sem_flg & SEM_UNDO)
-			un->semadj[sop->sem_num] += sop->sem_op;
+		sma->sem_base[sop->sem_num].semval -= sop->sem_op;
 		sop--;
 	}
 
@@ -624,7 +620,7 @@
 		err = curr->semval;
 		goto out_unlock;
 	case GETPID:
-		err = curr->sempid & 0xffff;
+		err = curr->sempid;
 		goto out_unlock;
 	case GETNCNT:
 		err = count_semncnt(sma,semnum);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-07 19:07       ` Hubertus Franke
  2002-03-07 19:44         ` Richard B. Johnson
@ 2002-03-07 19:46         ` Guest section DW
  2002-03-07 23:14           ` Hubertus Franke
  1 sibling, 1 reply; 28+ messages in thread
From: Guest section DW @ 2002-03-07 19:46 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: Rusty Russell, rajancr, linux-kernel, lse-tech

On Thu, Mar 07, 2002 at 02:07:38PM -0500, Hubertus Franke wrote:
> On Thursday 07 March 2002 09:54 am, Guest section DW wrote:
> > On Thu, Mar 07, 2002 at 09:35:09AM -0500, Hubertus Franke wrote:
> > ...
> >
> > Long ago I submitted a patch that changed the max pid from 15 bits to
> > 24 or 30 bits or so. (And of course removed the inefficiency noticed
> > by some people in the current thread.)

> I don't think that will solve the N^2 problem

Do you understand "inefficiency"? And "remove"?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-07 19:07       ` Hubertus Franke
@ 2002-03-07 19:44         ` Richard B. Johnson
  2002-03-07 19:46         ` Guest section DW
  1 sibling, 0 replies; 28+ messages in thread
From: Richard B. Johnson @ 2002-03-07 19:44 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: Guest section DW, Rusty Russell, rajancr, linux-kernel, lse-tech

On Thu, 7 Mar 2002, Hubertus Franke wrote:

> On Thursday 07 March 2002 09:54 am, Guest section DW wrote:
> > On Thu, Mar 07, 2002 at 09:35:09AM -0500, Hubertus Franke wrote:
> > ...
> >
> > Long ago I submitted a patch that changed the max pid from 15 bits to
> > 24 or 30 bits or so. (And of course removed the inefficiency noticed
> > by some people in the current thread.)
> > Probably this is a good moment to try and see what Linus thinks
> > about this today.
> >
> > [Of course Albert Cahalan will object that this is bad for the columns
> > of ps output. Maybe Alan will mutter something about sysvipc.
> > Roughly speaking there are only advantages, especially since
> > I think we'll have to do this sooner or later, and in such cases
> > sooner is better.]
> 
> I don't think that will solve the N^2 problem you still have the algorithm
> do the following:
> 
>        if (++last_pid > next_safe) {
>  repeat:	
> 	last_pid++;
> 	foralltasks p:
> 		deal with wraparound;
> 		if (p uses last_pid) goto repeat
> 		determine next_safe
> 	}
> [ last_pid  ... next_safe ) is the range that can be saftely used
> 
> By extending it to 24 or larger bits all you do is handle the wraparound
> at some later point and less frequent It also becomes expensive if a large 
> number of threads is present. 
> Well, the problem can be fixed. Even in the current 16 bit approach.
> What we are experimenting around
> is a <mark-and-sweep> algorithm that would be invoked if the nr_threads is 
> above a threshold. 
> The algorithm would do something like this. Rajan will code it up and see
> its efficacy.
> 
> if (last_pid >= next_safe) {
> inside:
> 	if (nr_threads > threshold) {  // constant 
> 		last_pid = get_pid_map(last_pid,&next_safe);
> 	} else {
> 	 	.. <as now>
> 	}
> }
> 
> static unsigned long pid_map[1<<12];
> 
> /* determine a range of pids that is available for sure 
>  *   [ last_pid .. next_safe ) 
>  * pid_map has stale information. some pids might be marked
>  * as used even if they had been freed in the meantime
>  */
> 
> int get_pid_map(int last_pid, int *next_safe)
> {
> 	int again = 1;
> repeat:
> 	for_each_task(p)
> 		mark_pids_in_bitmap;
> 		last_pid = ffz(pid_map);   /* we will start from last_pid with wraparound */
> 		if ((last_pid == -1) && (again)) {
> 			again = 0;
> 			memset(pid_map,0);
> 			goto repeat
> 		}
> 	}
> 	next_safe = first_non_zero(pid_map,last_pid); /* starting from last_pid */
> 	return last_pid;
> }
> 
> 
> 
> Note, if the pid map is to large, it can be done in smaller sections or 
> windows. Also, note keeping stale information is OK actually desirable, as
> it avoids the sweep in almost all cases.
> 			
> -- 
> -- Hubertus Franke  (frankeh@watson.ibm.com)
> -

If security issues were not a concern, you save the last PID, freed at
_exit() and use that immediately. If it's already used, it's zeroed.
exit() just stuffs the exit() PID into the variable, overwriting any
previous, including zero. It needs to be handled under a lock, but
it gets a quick return on investment for the usual fork() exec() stuff
that interactive users use.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (799.53 BogoMips).

	Bill Gates? Who?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-07 14:54     ` Guest section DW
@ 2002-03-07 19:07       ` Hubertus Franke
  2002-03-07 19:44         ` Richard B. Johnson
  2002-03-07 19:46         ` Guest section DW
  0 siblings, 2 replies; 28+ messages in thread
From: Hubertus Franke @ 2002-03-07 19:07 UTC (permalink / raw)
  To: Guest section DW; +Cc: Rusty Russell, rajancr, linux-kernel, lse-tech

On Thursday 07 March 2002 09:54 am, Guest section DW wrote:
> On Thu, Mar 07, 2002 at 09:35:09AM -0500, Hubertus Franke wrote:
> ...
>
> Long ago I submitted a patch that changed the max pid from 15 bits to
> 24 or 30 bits or so. (And of course removed the inefficiency noticed
> by some people in the current thread.)
> Probably this is a good moment to try and see what Linus thinks
> about this today.
>
> [Of course Albert Cahalan will object that this is bad for the columns
> of ps output. Maybe Alan will mutter something about sysvipc.
> Roughly speaking there are only advantages, especially since
> I think we'll have to do this sooner or later, and in such cases
> sooner is better.]

I don't think that will solve the N^2 problem you still have the algorithm
do the following:

       if (++last_pid > next_safe) {
 repeat:	
	last_pid++;
	foralltasks p:
		deal with wraparound;
		if (p uses last_pid) goto repeat
		determine next_safe
	}
[ last_pid  ... next_safe ) is the range that can be saftely used

By extending it to 24 or larger bits all you do is handle the wraparound
at some later point and less frequent It also becomes expensive if a large 
number of threads is present. 
Well, the problem can be fixed. Even in the current 16 bit approach.
What we are experimenting around
is a <mark-and-sweep> algorithm that would be invoked if the nr_threads is 
above a threshold. 
The algorithm would do something like this. Rajan will code it up and see
its efficacy.

if (last_pid >= next_safe) {
inside:
	if (nr_threads > threshold) {  // constant 
		last_pid = get_pid_map(last_pid,&next_safe);
	} else {
	 	.. <as now>
	}
}

static unsigned long pid_map[1<<12];

/* determine a range of pids that is available for sure 
 *   [ last_pid .. next_safe ) 
 * pid_map has stale information. some pids might be marked
 * as used even if they had been freed in the meantime
 */

int get_pid_map(int last_pid, int *next_safe)
{
	int again = 1;
repeat:
	for_each_task(p)
		mark_pids_in_bitmap;
		last_pid = ffz(pid_map);   /* we will start from last_pid with wraparound */
		if ((last_pid == -1) && (again)) {
			again = 0;
			memset(pid_map,0);
			goto repeat
		}
	}
	next_safe = first_non_zero(pid_map,last_pid); /* starting from last_pid */
	return last_pid;
}



Note, if the pid map is to large, it can be done in smaller sections or 
windows. Also, note keeping stale information is OK actually desirable, as
it avoids the sweep in almost all cases.
			
-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-07  3:56 ` Rusty Russell
  2002-03-07 14:35   ` Hubertus Franke
@ 2002-03-07 15:21   ` Dave McCracken
  1 sibling, 0 replies; 28+ messages in thread
From: Dave McCracken @ 2002-03-07 15:21 UTC (permalink / raw)
  To: frankeh, Rusty Russell, rajancr; +Cc: linux-kernel, lse-tech



--On Thursday, March 07, 2002 09:35:09 AM -0500 Hubertus Franke
<frankeh@watson.ibm.com> wrote:

>> At a cursory glance, this seems to be three patches:
>> 	1) Fix the get_pid() hang.
>> 	2) Speed up get_pid().
>> 
>> 	3) And this piece I'm not sure about:
>> > +                 if(p->tgid > last_pid && next_safe > p->tgid)
>> > +                       next_safe = p->tgid;

> 1) was done by Greg Larson and was already submitted
> 2) once properly done, we will circulate before bothering Linus again
> 3) this must have come in because of a wrong patch generation.

Actually that was Paul Larson, but close enough :)

3) is actually a separate bugfix.  I had a patch accepted some time back
that added the check for tgid, but missed adding it to the section below.
Paul caught it when he did his patch and added it for me.

Dave McCracken

======================================================================
Dave McCracken          IBM Linux Base Kernel Team      1-512-838-3059
dmccr@us.ibm.com                                        T/L   678-3059


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-07 14:35   ` Hubertus Franke
@ 2002-03-07 14:54     ` Guest section DW
  2002-03-07 19:07       ` Hubertus Franke
  0 siblings, 1 reply; 28+ messages in thread
From: Guest section DW @ 2002-03-07 14:54 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: Rusty Russell, rajancr, linux-kernel, lse-tech

On Thu, Mar 07, 2002 at 09:35:09AM -0500, Hubertus Franke wrote:
...

Long ago I submitted a patch that changed the max pid from 15 bits to
24 or 30 bits or so. (And of course removed the inefficiency noticed
by some people in the current thread.)
Probably this is a good moment to try and see what Linus thinks
about this today.

[Of course Albert Cahalan will object that this is bad for the columns
of ps output. Maybe Alan will mutter something about sysvipc.
Roughly speaking there are only advantages, especially since
I think we'll have to do this sooner or later, and in such cases
sooner is better.]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-07  3:56 ` Rusty Russell
@ 2002-03-07 14:35   ` Hubertus Franke
  2002-03-07 14:54     ` Guest section DW
  2002-03-07 15:21   ` Dave McCracken
  1 sibling, 1 reply; 28+ messages in thread
From: Hubertus Franke @ 2002-03-07 14:35 UTC (permalink / raw)
  To: Rusty Russell, rajancr; +Cc: linux-kernel, lse-tech

On Wednesday 06 March 2002 10:56 pm, Rusty Russell wrote:
> On Mon, 4 Mar 2002 20:57:49 -0500
>
> Hubertus Franke <frankeh@watson.ibm.com> wrote:
> > Can somebody post why this patch shouldn't be picked up ?
> > The attached program shows the problem in user space
> > and the patch is almost trivial ..
>
> At a cursory glance, this seems to be three patches:
> 	1) Fix the get_pid() hang.
> 	2) Speed up get_pid().
>
> 	3) And this piece I'm not sure about:
> > +                 if(p->tgid > last_pid && next_safe > p->tgid)
> > +                       next_safe = p->tgid;
>
> Please split, and send the fix get_pid() hang to trivial patch monkey,
> and push optimization to Linus.
>
> Cheers!
> Rusty.

Thanks, patch was bad and not properly functioning as pointed out to us.
We are rewriting right now (actually <rajancr@us.ibm.com> 
is doing the coding. I am just there for idea bouncing
easy if the office is 2 doors away.

1) was done by Greg Larson and was already submitted
2) once properly done, we will circulate before bothering Linus again
3) this must have come in because of a wrong patch generation.

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-05  1:57 Hubertus Franke
  2002-03-05 16:26 ` OGAWA Hirofumi
@ 2002-03-07  3:56 ` Rusty Russell
  2002-03-07 14:35   ` Hubertus Franke
  2002-03-07 15:21   ` Dave McCracken
  1 sibling, 2 replies; 28+ messages in thread
From: Rusty Russell @ 2002-03-07  3:56 UTC (permalink / raw)
  To: frankeh; +Cc: linux-kernel, lse-tech

On Mon, 4 Mar 2002 20:57:49 -0500
Hubertus Franke <frankeh@watson.ibm.com> wrote:

> 
> Can somebody post why this patch shouldn't be picked up ?
> The attached program shows the problem in user space 
> and the patch is almost trivial ..

At a cursory glance, this seems to be three patches:
	1) Fix the get_pid() hang.
	2) Speed up get_pid().
	3) And this piece I'm not sure about:
> +                 if(p->tgid > last_pid && next_safe > p->tgid)
> +                       next_safe = p->tgid;

Please split, and send the fix get_pid() hang to trivial patch monkey,
and push optimization to Linus.

Cheers!
Rusty.
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-05 16:26 ` OGAWA Hirofumi
@ 2002-03-05 16:43   ` Hubertus Franke
  0 siblings, 0 replies; 28+ messages in thread
From: Hubertus Franke @ 2002-03-05 16:43 UTC (permalink / raw)
  To: OGAWA Hirofumi; +Cc: linux-kernel, lse-tech

On Tuesday 05 March 2002 11:26 am, OGAWA Hirofumi wrote:
> Hubertus Franke <frankeh@watson.ibm.com> writes:
> > @@ -153,13 +155,18 @@
> >                               if(last_pid & 0xffff8000)
> >                                     last_pid = 300;
> >                               next_safe = PID_MAX;
> > +                             goto repeat;
> >                         }
> > -                       goto repeat;
> > +                       if(unlikely(last_pid == beginpid))
> > +                             goto nomorepids;
> > +                       continue;
>
> It isn't guaranteed that pid is unique.
>
> In the case:
> 	task->pid = 300, task->xxx = 301
> 	pid 301 is free
>
> 	This get_pid() returns 301.
>
> Regards.

No the point of this patch was to limit the search time for finding
the next available pid. We are not mocking around with the 
logic that declares a pid available or not. That stays the same.
However, one doesn't need to start every single time from the
beginning to find the next available pid.

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fwd: [Lse-tech] get_pid() performance fix
  2002-03-05  1:57 Hubertus Franke
@ 2002-03-05 16:26 ` OGAWA Hirofumi
  2002-03-05 16:43   ` Hubertus Franke
  2002-03-07  3:56 ` Rusty Russell
  1 sibling, 1 reply; 28+ messages in thread
From: OGAWA Hirofumi @ 2002-03-05 16:26 UTC (permalink / raw)
  To: frankeh; +Cc: linux-kernel, lse-tech

Hubertus Franke <frankeh@watson.ibm.com> writes:

> @@ -153,13 +155,18 @@
>                               if(last_pid & 0xffff8000)
>                                     last_pid = 300;
>                               next_safe = PID_MAX;
> +                             goto repeat;
>                         }
> -                       goto repeat;
> +                       if(unlikely(last_pid == beginpid))
> +                             goto nomorepids;
> +                       continue;

It isn't guaranteed that pid is unique.

In the case:
	task->pid = 300, task->xxx = 301
	pid 301 is free

	This get_pid() returns 301.

Regards.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Fwd: [Lse-tech] get_pid() performance fix
@ 2002-03-05  1:57 Hubertus Franke
  2002-03-05 16:26 ` OGAWA Hirofumi
  2002-03-07  3:56 ` Rusty Russell
  0 siblings, 2 replies; 28+ messages in thread
From: Hubertus Franke @ 2002-03-05  1:57 UTC (permalink / raw)
  To: linux-kernel, lse-tech

[-- Attachment #1: Type: text/plain, Size: 4033 bytes --]


Can somebody post why this patch shouldn't be picked up ?
The attached program shows the problem in user space 
and the patch is almost trivial ..

-- Hubertus

----------  Forwarded Message  ----------

Subject: [Lse-tech] get_pid() performance fix
Date: Tue, 26 Feb 2002 18:58:51 -0500
From: "Rajan Ravindran" <rajancr@us.ibm.com>
To: lse-tech@lists.sourceforge.net
Cc: linux-kernel@vger.kernel.org, davej@suse.de

Paul Larson posted a patch which fixes the get_pid() hang,
if we run out of available pids. Nevertheless, we have
observed that it takes a long time to find the next available
pid once the last pid reaches the PID_MAX. This is due to
a constant rescan of the task list while progressing only one
pid at time, yielding an O(n**2) problem.
Here is a patch (together with Paul's fix) which eliminates the
time taken to search for an available pid, by not constantly
restarting the search through the entire task list.

Attached is also a user level test program getpid.c),
by which one can simulate creation and deletion of processes.
This application will measure the time taken for the get_pid()
routine to find the next available process.

example:
      getpid -p 32770 -n 3

which will try to create 32770 process, eventually get_pid can't
find any free pid after 32767, so it will delete 3 pids randomly
from the existing list and start calculating the time taken to
find the available pid (which we deleted).

In getpid.c the new fixes  are inside the #define NEW_METHOD.
Try compiling the getpid.c with and without -DNEW_METHOD compile
option to see the performance improvement.

here is an example output for the old and the new algorithm and
their respective execution time to find a new pid.

(See attached file: output)

This can/should be applied to 2.5 and 2.4 kernels.


(See attached file: getpid.c)

Thanks,
Rajan


diff -Naur linux-2.5.5/kernel/fork.c linux-2.5.5-new/kernel/fork.c
--- linux-2.5.5/kernel/fork.c Tue Feb 19 21:10:55 2002
+++ linux-2.5.5-new/kernel/fork.c   Tue Feb 26 15:46:36 2002
@@ -24,6 +24,7 @@
 #include <linux/file.h>
 #include <linux/binfmts.h>
 #include <linux/fs.h>
+#include <linux/compiler.h>

 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -129,12 +130,13 @@
 {
      static int next_safe = PID_MAX;
      struct task_struct *p;
-     int pid;
+     int pid, beginpid;

      if (flags & CLONE_PID)
            return current->pid;

      spin_lock(&lastpid_lock);
+     beginpid = last_pid;
      if((++last_pid) & 0xffff8000) {
            last_pid = 300;         /* Skip daemons etc. */
            goto inside;
@@ -153,13 +155,18 @@
                              if(last_pid & 0xffff8000)
                                    last_pid = 300;
                              next_safe = PID_MAX;
+                             goto repeat;
                        }
-                       goto repeat;
+                       if(unlikely(last_pid == beginpid))
+                             goto nomorepids;
+                       continue;
                  }
                  if(p->pid > last_pid && next_safe > p->pid)
                        next_safe = p->pid;
                  if(p->pgrp > last_pid && next_safe > p->pgrp)
                        next_safe = p->pgrp;
+                 if(p->tgid > last_pid && next_safe > p->tgid)
+                       next_safe = p->tgid;
                  if(p->session > last_pid && next_safe > p->session)
                        next_safe = p->session;
            }
@@ -169,6 +176,11 @@
      spin_unlock(&lastpid_lock);

      return pid;
+
+nomorepids:
+     read_unlock(&tasklist_lock);
+     spin_unlock(&lastpid_lock);
+     return 0;
 }

 static inline int dup_mmap(struct mm_struct * mm)
@@ -667,6 +679,8 @@

      copy_flags(clone_flags, p);
      p->pid = get_pid(clone_flags);
+     if (p->pid == 0 && current->pid != 0)
+           goto bad_fork_cleanup;

      INIT_LIST_HEAD(&p->run_list);

-------------------------------------------------------



-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

[-- Attachment #2: output --]
[-- Type: application/octet-stream, Size: 1014 bytes --]

./getpid -p 32771 -n 4 

Output when compiled without -DNEW_METHOD
-----------------------------------------
nprocs: 32771, ndelete:0, nitems:4
start creating 32771 processes
reached max pid <32768> reset last_pid to <300>
removed pid: <17767>
removed pid: <9158>
removed pid: <6249>
removed pid: <18547>
spent 1.172945 seconds in finding free pid <6249>
spent 2.519057 seconds in finding free pid <9158>
spent 9.857908 seconds in finding free pid <17767>
spent 10.924213 seconds in finding free pid <18547>


Output when compiled with -DNEW_METHOD
--------------------------------------
nprocs: 32771, ndelete:0, nitems:4
start creating 32771 processes
reached max pid <32768> reset last_pid to <300>
removed pid: <17767>
removed pid: <9158>
spent 0.18468 seconds in finding free pid <17767>
removed pid: <6249>
removed pid: <18547>
spent 0.30833 seconds in finding free pid <6249>
spent 0.34572 seconds in finding free pid <9158>
spent 0.41594 seconds in finding free pid <18547>


[-- Attachment #3: getpid.c --]
[-- Type: application/octet-stream, Size: 4613 bytes --]

/* 
 * Program to measure the get_pid() function performance
 *
 */

#include <stdio.h>
#include <errno.h>
#include <sched.h>
#include <sys/types.h>
#include <sys/wait.h>         
#include <sys/time.h>         
#include <getopt.h>
#include <stdlib.h>

struct task_struct {
	int pid;
	struct task_struct *next_task;
};	

#define PID_MAX	0x8000
#define GETPID_MASK	0xffff8000
#define STACK_SIZE	8192

struct task_struct *tsk_head;
int last_pid;
unsigned long nprocs;
int verbose;
int start_del;

#define alloc_task_struct() \
	((struct task_struct *)malloc(sizeof(struct task_struct)))
#define for_each_task(p) \
        for (p = tsk_head ; (p = p->next_task) != tsk_head ; ) 

int create_task();
int (*ctask) (void *arg);

int __clone (int (*fn) (void *arg), void *thread_stack, int flags, void *arg); 

int get_pid()
{
	static int next_safe = PID_MAX;
	struct task_struct *p;
	static struct timeval stime, etime;
	static int start_search=0;
	int beginpid;
	long sec,usec;

	beginpid = last_pid;
	if((++last_pid) & GETPID_MASK) {
		printf("reached max pid <%d> reset last_pid to <300>\n", last_pid);
		last_pid = 300;		/* Skip daemons etc. */
		start_del = start_search =1;
        	gettimeofday(&stime, (struct timezone *) 0);   
		goto inside;
	}

#ifdef NEW_METHOD
	if(last_pid >= next_safe) {
inside:
		next_safe = PID_MAX;
	repeat:
       		for (p = tsk_head ; (p = p->next_task) != tsk_head ; ) {
			if(p->pid == last_pid) {
				if(++last_pid >= next_safe) {
					if(last_pid & GETPID_MASK)
					last_pid = 300;
					next_safe = PID_MAX;
					goto repeat;
				}
				continue;
			}
			if(p->pid > last_pid && next_safe > p->pid)
				next_safe = p->pid;
		}
	}
#else
	if(last_pid >= next_safe) {
inside:
		next_safe = PID_MAX;
	repeat:
       		for (p = tsk_head ; (p = p->next_task) != tsk_head ; ) {
			if(p->pid == last_pid) {
				if(++last_pid >= next_safe) {
					if(last_pid & GETPID_MASK)
						last_pid = 300;
					next_safe = PID_MAX;
				}
				goto repeat;
			}
		if(p->pid > last_pid && next_safe > p->pid)
				next_safe = p->pid;
		}
	}
#endif
	if (start_search) {
        	gettimeofday(&etime, (struct timezone *) 0);  
		if ((etime.tv_usec - stime.tv_usec) < 0) {
			sec = (etime.tv_sec - stime.tv_sec) -1;
			usec = stime.tv_usec - etime.tv_usec;
		}
		else {
			sec = (etime.tv_sec - stime.tv_sec);
			usec = etime.tv_usec - stime.tv_usec;
		}
		printf("spent %d.%02d seconds in finding free pid <%d>\n",
                                        sec, usec, last_pid);
	}
	return last_pid;
}

int create_task() {
	struct task_struct *p, *prev;
	unsigned int i;

	prev = tsk_head;
	tsk_head->pid = 0; tsk_head->next_task = tsk_head;
	printf("start creating %d processes\n", nprocs);
	for (i=0; i<nprocs; i++) {
		p = alloc_task_struct();
		p->next_task = tsk_head;
		p->pid = get_pid();
		if (verbose)
			printf("created <%d>\n", p->pid);
		prev->next_task = p;
		prev = prev->next_task;
	}
}

void delete_task(int nitems) {
	struct task_struct *p, *prev;
	unsigned int i;
	long int rand;

	while (!start_del);
	for (i=0; i < nitems; i++) {
		rand = random() % PID_MAX;
        	for (prev = p = tsk_head ; (p = p->next_task) != tsk_head ; ) { 
			if (p->pid == rand) {
				prev->next_task = p->next_task;
				printf("removed pid: <%d>\n", rand);
			}
			prev = p;
		}
	}
	start_del=0;	
}

void usage() {
	printf("Usage: getpid -p <nprocs> -n <npids> -s <seed> -v [verbose]\n");
	printf("       nprocs: number of process to create\n");
	printf("       npids: number of process to delete\n");
	printf("       verbose: print debug strings\n");
}

int main(int argc, char *argv[]) {
	int pid;
	char *stk;
	int wait_status, rc;
	int c, nitems=0;

        while ((c = getopt(argc, argv, "p:s:n:v:")) != -1) {
          switch(c) {
                case 'p': nprocs = atoi(optarg); break;
                case 'n': nitems = atoi(optarg); break;
		case 's': srand(atoi(optarg)); break;
                case 'v': verbose = 1; break;
                default: usage(); return 0;
          }
        }

	if ((nprocs == 0) || (nitems > PID_MAX)) {
		printf("Invalid parameter\n");
		usage();
		return 0;
	}
	tsk_head = alloc_task_struct();
	stk = (char *)malloc(2 * STACK_SIZE);

	ctask = create_task;
	pid = __clone(ctask, &stk[STACK_SIZE],  CLONE_VM | CLONE_FS | CLONE_SIGHAND, (void *)0);
	if (pid == -1)
		printf("errno=%d <%s>\n", errno,strerror(errno));
	
	delete_task(nitems);
	waitpid(-1, &wait_status, __WCLONE);
}

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2002-03-22 22:26 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-03-05 16:43 Fwd: [Lse-tech] get_pid() performance fix Rajan Ravindran
2002-03-05 17:57 ` OGAWA Hirofumi
2002-03-05 19:53   ` Hubertus Franke
2002-03-05 20:10     ` OGAWA Hirofumi
2002-03-05 21:59       ` Hubertus Franke
2002-03-05 22:48         ` OGAWA Hirofumi
2002-03-05 23:40           ` Hubertus Franke
2002-03-14 23:18 ` [PATCH] " Hubertus Franke
2002-03-15 14:57   ` OGAWA Hirofumi
2002-03-15 15:16   ` OGAWA Hirofumi
2002-03-15 18:37     ` Hubertus Franke
2002-03-16  5:12       ` OGAWA Hirofumi
2002-03-18 21:44         ` Hubertus Franke
2002-03-22 22:14         ` Hubertus Franke
2002-03-22 22:28           ` Davide Libenzi
2002-03-22 22:26             ` Hubertus Franke
  -- strict thread matches above, loose matches on Subject: below --
2002-03-07 22:37 Fwd: [Lse-tech] " Manfred Spraul
2002-03-05  1:57 Hubertus Franke
2002-03-05 16:26 ` OGAWA Hirofumi
2002-03-05 16:43   ` Hubertus Franke
2002-03-07  3:56 ` Rusty Russell
2002-03-07 14:35   ` Hubertus Franke
2002-03-07 14:54     ` Guest section DW
2002-03-07 19:07       ` Hubertus Franke
2002-03-07 19:44         ` Richard B. Johnson
2002-03-07 19:46         ` Guest section DW
2002-03-07 23:14           ` Hubertus Franke
2002-03-07 15:21   ` Dave McCracken

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).