linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
@ 2010-06-09  6:24 Salman
  2010-06-09  6:53 ` Andi Kleen
                   ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: Salman @ 2010-06-09  6:24 UTC (permalink / raw)
  To: peterz, mingo, akpm, linux-kernel; +Cc: tytso

A program that repeatedly forks and waits is susceptible to having the
same pid repeated, especially when it competes with another instance of the
same program.  This is really bad for bash implementation.  Furthermore, many shell
scripts assume that pid numbers will not be used for some length of time.

Thanks to Ted Tso for the key ideas of this implementation.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 kernel/pid.c |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index e9fd8c1..8cedeab 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -153,8 +153,17 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
+					int prev;
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
+
+					do {
+						prev = last;
+						last = cmpxchg(&pid_ns->last_pid,
+							       prev, pid);
+						if (last >= pid)
+							break;
+					} while (prev != last);
+
 					return pid;
 				}
 				offset = find_next_offset(map, offset);


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09  6:24 [PATCH] Fix a race in pid generation that causes pids to be reused immediately Salman
@ 2010-06-09  6:53 ` Andi Kleen
  2010-06-09  9:48 ` Ingo Molnar
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: Andi Kleen @ 2010-06-09  6:53 UTC (permalink / raw)
  To: Salman; +Cc: peterz, mingo, akpm, linux-kernel, tytso

Salman <sqazi@google.com> writes:
> +++ b/kernel/pid.c
> @@ -153,8 +153,17 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  		if (likely(atomic_read(&map->nr_free))) {
>  			do {
>  				if (!test_and_set_bit(offset, map->page)) {
> +					int prev;
>  					atomic_dec(&map->nr_free);
> -					pid_ns->last_pid = pid;
> +
> +					do {
> +						prev = last;
> +						last = cmpxchg(&pid_ns->last_pid,
> +							       prev,
> +						pid);

At some point not all architectures in Linux supported cmpxchg,
so it was not allowed to use it unconditionally in portable code.

This might have changed now (at least the UP only architectures fall
back to a generic cmpxchg now I think), but I'm not sure you have full
coverage on SMP.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09  6:24 [PATCH] Fix a race in pid generation that causes pids to be reused immediately Salman
  2010-06-09  6:53 ` Andi Kleen
@ 2010-06-09  9:48 ` Ingo Molnar
  2010-06-09 15:39   ` Linus Torvalds
  2010-06-09 11:49 ` Michel Lespinasse
  2010-06-09 12:17 ` tytso
  3 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2010-06-09  9:48 UTC (permalink / raw)
  To: Salman, Linus Torvalds, Andrew Morton
  Cc: peterz, akpm, linux-kernel, tytso, Thomas Gleixner


* Salman <sqazi@google.com> wrote:

> A program that repeatedly forks and waits is susceptible to having the
> same pid repeated, especially when it competes with another instance of the
> same program.  This is really bad for bash implementation.  Furthermore, many shell
> scripts assume that pid numbers will not be used for some length of time.
> 
> Thanks to Ted Tso for the key ideas of this implementation.
> 
> Signed-off-by: Salman Qazi <sqazi@google.com>
> ---
>  kernel/pid.c |   11 ++++++++++-
>  1 files changed, 10 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/pid.c b/kernel/pid.c
> index e9fd8c1..8cedeab 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -153,8 +153,17 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  		if (likely(atomic_read(&map->nr_free))) {
>  			do {
>  				if (!test_and_set_bit(offset, map->page)) {
> +					int prev;
>  					atomic_dec(&map->nr_free);
> -					pid_ns->last_pid = pid;
> +
> +					do {
> +						prev = last;
> +						last = cmpxchg(&pid_ns->last_pid,
> +							       prev, pid);
> +						if (last >= pid)
> +							break;
> +					} while (prev != last);
> +
>  					return pid;
>  				}
>  				offset = find_next_offset(map, offset);

Looks rather interesting. (Cleanliness-wise i'd suggest to split out the while 
loop into a helper function.)

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09  6:24 [PATCH] Fix a race in pid generation that causes pids to be reused immediately Salman
  2010-06-09  6:53 ` Andi Kleen
  2010-06-09  9:48 ` Ingo Molnar
@ 2010-06-09 11:49 ` Michel Lespinasse
  2010-06-09 12:37   ` tytso
  2010-06-09 12:17 ` tytso
  3 siblings, 1 reply; 45+ messages in thread
From: Michel Lespinasse @ 2010-06-09 11:49 UTC (permalink / raw)
  To: Salman; +Cc: peterz, mingo, akpm, linux-kernel, tytso

On Tue, Jun 08, 2010 at 11:24:38PM -0700, Salman wrote:
> A program that repeatedly forks and waits is susceptible to having the
> same pid repeated, especially when it competes with another instance of the
> same program.  This is really bad for bash implementation.  Furthermore, many shell
> scripts assume that pid numbers will not be used for some length of time.
> 
> Thanks to Ted Tso for the key ideas of this implementation.
> 
> Signed-off-by: Salman Qazi <sqazi@google.com>
> ---
>  kernel/pid.c |   11 ++++++++++-
>  1 files changed, 10 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/pid.c b/kernel/pid.c
> index e9fd8c1..8cedeab 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -153,8 +153,17 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  		if (likely(atomic_read(&map->nr_free))) {
>  			do {
>  				if (!test_and_set_bit(offset, map->page)) {
> +					int prev;
>  					atomic_dec(&map->nr_free);
> -					pid_ns->last_pid = pid;
> +
> +					do {
> +						prev = last;
> +						last = cmpxchg(&pid_ns->last_pid,
> +							       prev, pid);
> +						if (last >= pid)
> +							break;

You should make sure to handle pid wrap-around for this last/pid comparison.
I think proper way to do that would be:

	/* last is the pid we started scanning at
         * last_read is the last observed value of pid_ns->last_pid
         */
        last_read = last;
        do {
                prev = last_read;
                last_read = cmpxchg(&pid_ns->last_pid, prev, pid);
        /* Exit if one of these conditions is true:
         * - cmpxchg succeeded
         * - last <= pid <= last_read  (other thread already bumped last_pid)
         * - last_read <= last <= pid  (same with wraparound)
         * - pid <= last_read <= last  (same with different wraparound)
         */
        } while (last_read != prev &&
                 (last > pid || pid > last_read) &&
                 (last_read > last || last > pid) &&
                 (pid > last_read || last_read > last));

The last_read == pid case is also interesting - it means another thread found
the same pid, forked a child with that pid, and the child exited already
(since the bitmap was cleared). However we don't need to handle that case -
first, that race is much less likely to happen, and second, the duplicate
pid would be returned in two separate tasks - so this would not cause problems
in bash as in your example.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09  6:24 [PATCH] Fix a race in pid generation that causes pids to be reused immediately Salman
                   ` (2 preceding siblings ...)
  2010-06-09 11:49 ` Michel Lespinasse
@ 2010-06-09 12:17 ` tytso
  3 siblings, 0 replies; 45+ messages in thread
From: tytso @ 2010-06-09 12:17 UTC (permalink / raw)
  To: Salman; +Cc: peterz, mingo, akpm, linux-kernel, tytso

On Tue, Jun 08, 2010 at 11:24:38PM -0700, Salman wrote:

> A program that repeatedly forks and waits is susceptible to having
> the same pid repeated, especially when it competes with another
> instance of the same program.  This is really bad for bash
> implementation.  Furthermore, many shell scripts assume that pid
> numbers will not be used for some length of time.
> 
> Thanks to Ted Tso for the key ideas of this implementation.
> 
> Signed-off-by: Salman Qazi <sqazi@google.com>

Here's a slightly more succint way of expressing it.  Others will have
decide if it's easier to understand.  (It is for me, but I wrote it.  :-P)

       	       	      	 	      	  - Ted

 kernel/pid.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/pid.c b/kernel/pid.c
index e9fd8c1..c51f413 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -154,8 +154,13 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
-					return pid;
+					while (1) {
+						i = cmpxchg(&pid_ns->last_pid,
+							    last, pid);
+						if (i == last || i >= pid)
+							return pid;
+						last = i;
+					}
 				}
 				offset = find_next_offset(map, offset);
 				pid = mk_pid(pid_ns, map, offset);



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09 11:49 ` Michel Lespinasse
@ 2010-06-09 12:37   ` tytso
  0 siblings, 0 replies; 45+ messages in thread
From: tytso @ 2010-06-09 12:37 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: Salman, peterz, mingo, akpm, linux-kernel, tytso

On Wed, Jun 09, 2010 at 04:49:03AM -0700, Michel Lespinasse wrote:
> You should make sure to handle pid wrap-around for this last/pid comparison.
> I think proper way to do that would be:

Good point!  I forgot about checking for pid wrap-around.

> The last_read == pid case is also interesting - it means another
> thread found the same pid, forked a child with that pid, and the
> child exited already (since the bitmap was cleared).  However we
> don't need to handle that case - first, that race is much less
> likely to happen, and second, the duplicate pid would be returned in
> two separate tasks - so this would not cause problems in bash as in
> your example.

In order for that to happen, all of this would have to happen between
the time that last_pid was initially sampled at the very beginning of
alloc_pidmap().  Could that happen?  I think it could, if kzalloc()
took a very long time to run, and the scheduler was _very_ unfair.  We
could try to guard against that by checking to see if last_pid has
changed after allocating map->page (in the unlikely case of !map->page
being NULL in the first place) and if so, restarting alloc_pidmap() by
jumping back to the beginning of the function.

Could that happen with bash?  I'm not as confident as you that it
could never happen.  The fact that we saw the race in the first place
in bash means that it could still happen in this case, I think.  In
any case, if we have two processes getting the same pid in the absence
of a fork, that would be a bad thing and could lead to all sorts of
confusion, so it's probably a good thing to guard against even if it
is a rare case.

BTW, Salman, I haven't seen it, but I vaguely remember in the mail
exchange with the person who reported this that we have a C/C++
reproduction case?  If it's small enough, it might be a good idea to
include it in the patch description.

						- Ted

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09  9:48 ` Ingo Molnar
@ 2010-06-09 15:39   ` Linus Torvalds
  2010-06-09 15:50     ` tytso
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2010-06-09 15:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Salman, Andrew Morton, peterz, akpm, linux-kernel, tytso,
	Thomas Gleixner



On Wed, 9 Jun 2010, Ingo Molnar wrote:

> 
> * Salman <sqazi@google.com> wrote:
> > 
> > A program that repeatedly forks and waits is susceptible to having the
> > same pid repeated, especially when it competes with another instance of the
> > same program.  This is really bad for bash implementation.  Furthermore, many shell
> > scripts assume that pid numbers will not be used for some length of time.
> > 
> > Thanks to Ted Tso for the key ideas of this implementation.
> 
> Looks rather interesting. (Cleanliness-wise i'd suggest to split out the while 
> loop into a helper function.)

I have to say that usually I can look at a patch and see what it does.

This time I had absolutely no clue.

There was a whole big context missing: that original load of "last_pid" 
into "last" at the top of the function, far outside the patch.  And in 
this case I don't think splitting out the trivial cmpxchg() loop would 
help that problem - that would just make the "load original" part of the 
big picture be even further away from the final "replace if same" part, 
and I think _that_ is a much more critical part of the subtleties there.

So I had to read the patch _and_ go read the code it patched, in order to 
at all understand what it did. I think the patch explanation should have 
done it, and I also think this would need a bit comment at the top.

[ In fact, I'd argue that the _old_ code would have needed a big comment 
  at the top about last_pid usage, but i somebody had done that, they'd 
  probably also have seen the race while explaning how the code worked ;]

So having looked at the patch and the code, I agree with the patch, but 
I'd like some more explanation to go with it.

[ Or Ted's version: as mentioned, I don't think the complexity is actually 
  in the final cmpxchg loop itself, but in the bigger picture, so I don't 
  think the differences between Ted's and Salman's versions are that big ]

			Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09 15:39   ` Linus Torvalds
@ 2010-06-09 15:50     ` tytso
  2010-06-09 16:06       ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: tytso @ 2010-06-09 15:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Salman, Andrew Morton, peterz, akpm, linux-kernel,
	tytso, Thomas Gleixner

On Wed, Jun 09, 2010 at 08:39:00AM -0700, Linus Torvalds wrote:
> 
> So I had to read the patch _and_ go read the code it patched, in order to 
> at all understand what it did. I think the patch explanation should have 
> done it, and I also think this would need a bit comment at the top.
> 
> [ In fact, I'd argue that the _old_ code would have needed a big comment 
>   at the top about last_pid usage, but i somebody had done that, they'd 
>   probably also have seen the race while explaning how the code worked ;]
>

Salman had created a very nice ASCII art diagram of the race in the
mail thread with the internal bug reporter who noticed the problem.
We could include that, if you don't mind the commit description
growing by 30-40 lines.  :-) I agree though that better documentation
n the source code about _how_ alloc_pidmap was supposed to avoid all
possible races would have probably been a good idea.

> [ Or Ted's version: as mentioned, I don't think the complexity is actually 
>   in the final cmpxchg loop itself, but in the bigger picture, so I don't 
>   think the differences between Ted's and Salman's versions are that big ]

Yah, I had been staring at the code for a while, so I had the feeling
that my intuition of which patch would be clearer was probably biased.

We do need to deal with pid wrap possibility just to be completely
correct, although the chance of hitting _that_ are pretty remote.

	 	      	     		- Ted

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09 15:50     ` tytso
@ 2010-06-09 16:06       ` Linus Torvalds
  2010-06-09 17:10         ` tytso
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2010-06-09 16:06 UTC (permalink / raw)
  To: tytso
  Cc: Ingo Molnar, Salman, Andrew Morton, peterz, akpm, linux-kernel,
	tytso, Thomas Gleixner



On Wed, 9 Jun 2010, tytso@mit.edu wrote:
> 
> Salman had created a very nice ASCII art diagram of the race in the
> mail thread with the internal bug reporter who noticed the problem.
> We could include that, if you don't mind the commit description
> growing by 30-40 lines.  :-)

I don't horribly mind, but I also don't think it's necessarily all that 
useful to go that far. For the commit message, there are two real reasons:

 - explaining the patch itself upstream to make people like me understand 
   _why_ it needs to be applied.

   But then a denser explanation than a 30-40 line ASCII art diagram would 
   probably suffice.

 - explaning the code after-the-fact to people who end up seeing it in 
   log/blame output

   And then we don't care so much about the old bug per se, as about how 
   the code is supposed to work - so a lot of verbiage about what used to 
   happen is much less important than describing just what's going on with 
   the cmpxchg (where the "loop" is all the way from the original value 
   load, especially in this case where we don't have to loop all the way 
   back).

In fact, conceptually the cmpxchg loop is pretty much the whole function, 
it's just that if we know that any racing elements will be idempotent wrt 
anything but the final assignment of 'last_pid'. So we can end up just 
looping over the final part. I think _that_ is the clever part, and I 
think that is the part that needs explanation.

(And that is also why the diff itself doesn't include the "early part of 
the loop", and why I couldn't understand it purely as a patch - because it 
in this case the patch is "artificially small" because of how we don't 
need to loop all the way up)

> We do need to deal with pid wrap possibility just to be completely
> correct, although the chance of hitting _that_ are pretty remote.

That seems to be purely an artifact of the cleverness of avoiding re-doing 
the whole loop, no? If we _had_ re-done the whole loop (the "normal" 
cmpxchg model), we would have re-started the whole pid search and handled 
the overflow in the existing overflow handling code.

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09 16:06       ` Linus Torvalds
@ 2010-06-09 17:10         ` tytso
  2010-06-09 17:23           ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: tytso @ 2010-06-09 17:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Salman, Andrew Morton, peterz, akpm, linux-kernel,
	tytso, Thomas Gleixner

On Wed, Jun 09, 2010 at 09:06:37AM -0700, Linus Torvalds wrote:
> 
> That seems to be purely an artifact of the cleverness of avoiding re-doing 
> the whole loop, no? If we _had_ re-done the whole loop (the "normal" 
> cmpxchg model), we would have re-started the whole pid search and handled 
> the overflow in the existing overflow handling code.

This brings up a question.  If we're going to use a cmpxchg() loop, is
there any point to doing the test-and-set game with the bitmap?  It
should be just as fast to just use cmpxchg as it is to do the
test-and-set, and isn't it more likely that various CPU architectures
will have the cmpxchg than test-and-set-bit?

We might be able to radically simplify alloc_pidmap().  Would that
cause a huge problem on some non-Intel architectures, though?

					- Ted

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09 17:10         ` tytso
@ 2010-06-09 17:23           ` Linus Torvalds
  2010-06-09 17:25             ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2010-06-09 17:23 UTC (permalink / raw)
  To: tytso
  Cc: Ingo Molnar, Salman, Andrew Morton, peterz, akpm,
	Linux Kernel Mailing List, tytso, Thomas Gleixner



On Wed, 9 Jun 2010, tytso@mit.edu wrote:
> 
> This brings up a question.  If we're going to use a cmpxchg() loop, is
> there any point to doing the test-and-set game with the bitmap?

I think there very much is.

Otherwise you have three threads, two of which pick the same pid (because 
the test-and-set isn't atomic), and a third of which picks a new one. The 
cmpxchg succeeds (the third one wins, and everybody picks that winner), 
but you only expanded the map by two entries, and you're going to return 
the same pid nr to two people.

So the cmpxchg only protects "last_pid". It does _not_ in any way protect 
the pid we're actually going to return.

Or am I missing something?

			Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09 17:23           ` Linus Torvalds
@ 2010-06-09 17:25             ` Linus Torvalds
  2010-06-09 17:34               ` tytso
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2010-06-09 17:25 UTC (permalink / raw)
  To: tytso
  Cc: Ingo Molnar, Salman, Andrew Morton, peterz, akpm,
	Linux Kernel Mailing List, tytso, Thomas Gleixner



On Wed, 9 Jun 2010, Linus Torvalds wrote:
>
> Otherwise you have three threads, two of which pick the same pid (because 
> the test-and-set isn't atomic), and a third of which picks a new one.

In fact, I don't think you need three threads at all. It's perfectly ok to 
just have two threads, and they'd both end up picking the same 'pid' 
without the atomicity guarantees of that 'test_and_set()' bitmap access.

And they'd both be perfectly fine setting last_pid to that (shared) pid if 
I read that cmpxchg loop right. No?

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09 17:25             ` Linus Torvalds
@ 2010-06-09 17:34               ` tytso
  2010-06-09 17:43                 ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: tytso @ 2010-06-09 17:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Salman, Andrew Morton, peterz, akpm,
	Linux Kernel Mailing List, tytso, Thomas Gleixner

On Wed, Jun 09, 2010 at 10:25:50AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 9 Jun 2010, Linus Torvalds wrote:
> >
> > Otherwise you have three threads, two of which pick the same pid (because 
> > the test-and-set isn't atomic), and a third of which picks a new one.
> 
> In fact, I don't think you need three threads at all. It's perfectly ok to 
> just have two threads, and they'd both end up picking the same 'pid' 
> without the atomicity guarantees of that 'test_and_set()' bitmap access.
> 
> And they'd both be perfectly fine setting last_pid to that (shared) pid if 
> I read that cmpxchg loop right. No?

Well, I was thinking about something like this:

	while (1) {
		last = pid_ns->last_pid;
		pid = last + 1;
		if (pid >= pid_max)
			pid = RESERVED_PIDS;
		if (cmpxchg(&pid_ns->last_pid, last, pid) == last)
			return pid;
	}

Which I don't think is racy, unless I'm missing something.  Both might
end up picking the same pid, but only one will successfully set
last_pid, and the other will just loop and try again.

There appears to be some interesting uses of the bitmap by
find_ge_pid() and next_pidmap() that I haven't completely grokked yet,
especially as to why they're needed, though.  Assuming they are
needed, we might end up needing the bitmap after all, though.

					- Ted

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09 17:34               ` tytso
@ 2010-06-09 17:43                 ` Linus Torvalds
  2010-06-09 17:47                   ` tytso
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2010-06-09 17:43 UTC (permalink / raw)
  To: tytso
  Cc: Ingo Molnar, Salman, Andrew Morton, peterz, akpm,
	Linux Kernel Mailing List, tytso, Thomas Gleixner



On Wed, 9 Jun 2010, tytso@mit.edu wrote:
> 
> Well, I was thinking about something like this:

No, this is wrong for the case where you end up having to allocate a new 
pidmap and/or overflow max_pid.

It also doesn't set the bit at all.

> There appears to be some interesting uses of the bitmap by
> find_ge_pid() and next_pidmap() that I haven't completely grokked yet,

We need that bitmap to handle the overflow max_pid case. We are _not_ 
returning just increasing pid numbers.

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09 17:43                 ` Linus Torvalds
@ 2010-06-09 17:47                   ` tytso
  2010-06-09 18:09                     ` Salman Qazi
  0 siblings, 1 reply; 45+ messages in thread
From: tytso @ 2010-06-09 17:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Salman, Andrew Morton, peterz, akpm,
	Linux Kernel Mailing List, tytso, Thomas Gleixner

On Wed, Jun 09, 2010 at 10:43:33AM -0700, Linus Torvalds wrote:
> 
> We need that bitmap to handle the overflow max_pid case. We are _not_ 
> returning just increasing pid numbers.

Doh!  I knew I was forgetting something obvious.  I was hoping we
could get rid of the bitmap entirely, but I guess not....

(Unless users would stand for 64-bit pid numbers... no?  Dang. :-)

      	      	     	    	      - Ted

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be  reused immediately.
  2010-06-09 17:47                   ` tytso
@ 2010-06-09 18:09                     ` Salman Qazi
  0 siblings, 0 replies; 45+ messages in thread
From: Salman Qazi @ 2010-06-09 18:09 UTC (permalink / raw)
  To: tytso, Linus Torvalds, Ingo Molnar, Salman, Andrew Morton,
	peterz, akpm, Linux Kernel Mailing List, tytso, Thomas Gleixner

On Wed, Jun 9, 2010 at 10:47 AM,  <tytso@mit.edu> wrote:
> On Wed, Jun 09, 2010 at 10:43:33AM -0700, Linus Torvalds wrote:
>>
>> We need that bitmap to handle the overflow max_pid case. We are _not_
>> returning just increasing pid numbers.
>
> Doh!  I knew I was forgetting something obvious.  I was hoping we
> could get rid of the bitmap entirely, but I guess not....
>
> (Unless users would stand for 64-bit pid numbers... no?  Dang. :-)
>
>                                      - Ted
>

(sorry about the previous message, to those who got it... my mail
client silently switched to HTML mode)

I am working on a new version of the change taking into account
comments (both about substance and style) by Michel, Ted and Linus.  I
agree with Michel in that I am not sure that the rare case of same
last_pid being set by two threads is worth fixing.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-15 14:35           ` Peter Zijlstra
@ 2010-06-15 19:37             ` Andrew Morton
  0 siblings, 0 replies; 45+ messages in thread
From: Andrew Morton @ 2010-06-15 19:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tytso, Salman, mingo, linux-kernel, tytso, torvalds, walken,
	Chen Liqin, Lennox Wu

On Tue, 15 Jun 2010 16:35:52 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > I suspect sched_clock.c might be generating fair amounts of code which
> > UP builds don't need. 
> 
> Only sched_clock_remote() and its caller, something like the below, not
> much code..
> 
> UP machines can still have utterly sucky TSC, although the
> inter-cpu-drift thing isn't much of an issue ;-)
> 
> ---
>  kernel/sched_clock.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
> index 52f1a14..7ff5b56 100644
> --- a/kernel/sched_clock.c
> +++ b/kernel/sched_clock.c
> @@ -170,6 +170,7 @@ again:
>  	return clock;
>  }
>  
> +#ifdef CONFIG_SMP
>  static u64 sched_clock_remote(struct sched_clock_data *scd)
>  {
>  	struct sched_clock_data *my_scd = this_scd();
> @@ -205,6 +206,7 @@ again:
>  
>  	return val;
>  }
> +#endif
>  
>  /*
>   * Similar to cpu_clock(), but requires local IRQs to be disabled.
> @@ -226,9 +228,11 @@ u64 sched_clock_cpu(int cpu)
>  
>  	scd = cpu_sdc(cpu);
>  
> +#ifdef CONFIG_SMP
>  	if (cpu != smp_processor_id())
>  		clock = sched_clock_remote(scd);
>  	else
> +#endif
>  		clock = sched_clock_local(scd);
>  
>  	return clock;

hm, OK, I was actually looking at sched_clock_local() at the time.  Can
clocks go backwards on UP hardware?  What breaks if we do #define
sched_clock_local sched_clock?

I've mentioned this before, but sched_clock.c is really opaque - it
would be a formidable task for anyone to get in there and work on the
code if they hadn't already been working on it for years.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-15  1:55         ` Andrew Morton
  2010-06-15  3:26           ` Paul Mackerras
  2010-06-15 12:56           ` tytso
@ 2010-06-15 14:35           ` Peter Zijlstra
  2010-06-15 19:37             ` Andrew Morton
  2 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2010-06-15 14:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tytso, Salman, mingo, linux-kernel, tytso, torvalds, walken,
	Chen Liqin, Lennox Wu

On Mon, 2010-06-14 at 18:55 -0700, Andrew Morton wrote:
> > kernel/sched_clock.c:   if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc
> 
> I guess that'll flush out any stragglers.

cmpxchg64() is special, at the time i386 didn't handle the 8 byte
cmpxchg(), although we could easily make it do today.

> I suspect sched_clock.c might be generating fair amounts of code which
> UP builds don't need. 

Only sched_clock_remote() and its caller, something like the below, not
much code..

UP machines can still have utterly sucky TSC, although the
inter-cpu-drift thing isn't much of an issue ;-)

---
 kernel/sched_clock.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index 52f1a14..7ff5b56 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -170,6 +170,7 @@ again:
 	return clock;
 }
 
+#ifdef CONFIG_SMP
 static u64 sched_clock_remote(struct sched_clock_data *scd)
 {
 	struct sched_clock_data *my_scd = this_scd();
@@ -205,6 +206,7 @@ again:
 
 	return val;
 }
+#endif
 
 /*
  * Similar to cpu_clock(), but requires local IRQs to be disabled.
@@ -226,9 +228,11 @@ u64 sched_clock_cpu(int cpu)
 
 	scd = cpu_sdc(cpu);
 
+#ifdef CONFIG_SMP
 	if (cpu != smp_processor_id())
 		clock = sched_clock_remote(scd);
 	else
+#endif
 		clock = sched_clock_local(scd);
 
 	return clock;


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-15 12:56           ` tytso
@ 2010-06-15 13:06             ` Kyle McMartin
  0 siblings, 0 replies; 45+ messages in thread
From: Kyle McMartin @ 2010-06-15 13:06 UTC (permalink / raw)
  To: tytso, Andrew Morton, Salman, mingo, linux-kernel, peterz, tytso,
	torvalds, walken, Chen Liqin, Lennox Wu

On Tue, Jun 15, 2010 at 08:56:50AM -0400, tytso@mit.edu wrote:
> > I think you're probably right, as long as one sticks with 4-byte
> > scalars.  The cmpxchg-is-now-generic change snuck in under the radar
> > (mine, at least).
> 
> Hmmm, what about unsigned longs?  (Which might be 8 bytes on some
> architectures....)
> 

Longs are fine, since Linux only supports LP64 (and would need major work
to support anything else.)

The problem documented above is that on 32-bit, a 64-bit read is
non-atomic, so even if you use a hashed spinlock to protect a u64
variable on 32-bit, reads will be non-atomic, and so must take the same
lock in order to be safe. Hence, you need accessor functions.

This is what the generic atomic code does, perhaps we could add a new
API that gives us hooks to do proper hashed spinlocks on crap
architectures but falls back to simple assignment and real cmpxchg on
real platforms.

--Kyle


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-15  1:55         ` Andrew Morton
  2010-06-15  3:26           ` Paul Mackerras
@ 2010-06-15 12:56           ` tytso
  2010-06-15 13:06             ` Kyle McMartin
  2010-06-15 14:35           ` Peter Zijlstra
  2 siblings, 1 reply; 45+ messages in thread
From: tytso @ 2010-06-15 12:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Salman, mingo, linux-kernel, peterz, tytso, torvalds, walken,
	Chen Liqin, Lennox Wu

On Mon, Jun 14, 2010 at 06:55:56PM -0700, Andrew Morton wrote:
> 
> I think you're probably right, as long as one sticks with 4-byte
> scalars.  The cmpxchg-is-now-generic change snuck in under the radar
> (mine, at least).

Hmmm, what about unsigned longs?  (Which might be 8 bytes on some
architectures....)

					- Ted

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-15  4:21             ` Andrew Morton
  2010-06-15  4:38               ` Eric Dumazet
  2010-06-15  6:57               ` Benjamin Herrenschmidt
@ 2010-06-15  7:25               ` Paul Mackerras
  2 siblings, 0 replies; 45+ messages in thread
From: Paul Mackerras @ 2010-06-15  7:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tytso, Salman, mingo, linux-kernel, peterz, tytso, torvalds,
	walken, Chen Liqin, Lennox Wu

On Mon, Jun 14, 2010 at 09:21:50PM -0700, Andrew Morton wrote:

> If that happens then the best fix is for those architectures to get
> themselves a cmpxchg64().  Unless for some reason it's simply
> unimplementable?  Worst case I guess one could use a global spinlock. 
> Second-worst-case: hashed spinlocks.

Spinlocks won't help as long as you can use cmpxchg64 on any old u64
that is also accessed directly.  UP can disable interrupts, of course,
but for SMP we would have to restrict cmpxchg64 to some type
(atomic64_t, maybe) for which you have to use an accessor function to
read it, so we can take the spinlock around the reads.

I suspect it isn't worth the trouble.  Note that I'm talking
specifically about cmpxchg64 here, not cmpxchg.

Paul.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-15  4:21             ` Andrew Morton
  2010-06-15  4:38               ` Eric Dumazet
@ 2010-06-15  6:57               ` Benjamin Herrenschmidt
  2010-06-15  7:25               ` Paul Mackerras
  2 siblings, 0 replies; 45+ messages in thread
From: Benjamin Herrenschmidt @ 2010-06-15  6:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Mackerras, tytso, Salman, mingo, linux-kernel, peterz,
	tytso, torvalds, walken, Chen Liqin, Lennox Wu

On Mon, 2010-06-14 at 21:21 -0700, Andrew Morton wrote:
> If that happens then the best fix is for those architectures to get
> themselves a cmpxchg64().  Unless for some reason it's simply
> unimplementable?  Worst case I guess one could use a global spinlock. 
> Second-worst-case: hashed spinlocks. 

Right, ppc32 at least can't so that would be spinlocks...

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-15  4:21             ` Andrew Morton
@ 2010-06-15  4:38               ` Eric Dumazet
  2010-06-15  6:57               ` Benjamin Herrenschmidt
  2010-06-15  7:25               ` Paul Mackerras
  2 siblings, 0 replies; 45+ messages in thread
From: Eric Dumazet @ 2010-06-15  4:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Mackerras, tytso, Salman, mingo, linux-kernel, peterz,
	tytso, torvalds, walken, Chen Liqin, Lennox Wu

Le lundi 14 juin 2010 à 21:21 -0700, Andrew Morton a écrit :
> On Tue, 15 Jun 2010 13:26:08 +1000 Paul Mackerras <paulus@samba.org> wrote:
> 
> > On Mon, Jun 14, 2010 at 06:55:56PM -0700, Andrew Morton wrote:
> > 
> > > > kernel/sched_clock.c:   if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc
> > > 
> > > I guess that'll flush out any stragglers.
> > 
> > And break most non-x86 32-bit architectures, including 32-bit powerpc.
> 
> If CONFIG_SMP=y, yes.  On UP there's a generic implementation
> (include/asm-generic/cmpxchg-local.h, include/asm-generic/cmpxchg.h)
> 
> > Fortunately that code is only used if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
> > is set, and it looks like only x86 and ia64 set it.
> > 
> 
> If that happens then the best fix is for those architectures to get
> themselves a cmpxchg64().  Unless for some reason it's simply
> unimplementable?  Worst case I guess one could use a global spinlock. 
> Second-worst-case: hashed spinlocks.

Hmm, this reminds me a patch I had somewhere, not yet sent to David :)

1) cmpxchg() was not available for all arches, maybe
atomic_long_cmpxchg() is ?

2) I am not sure atomic_long_t quarantee that all 32 or 64 bits of the
underlying long container are available.

Thanks !

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a291edb..335ca89 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1268,17 +1268,18 @@ skip_hashing:
 
 void rt_bind_peer(struct rtable *rt, int create)
 {
-	static DEFINE_SPINLOCK(rt_peer_lock);
 	struct inet_peer *peer;
 
 	peer = inet_getpeer(rt->rt_dst, create);
 
-	spin_lock_bh(&rt_peer_lock);
-	if (rt->peer == NULL) {
-		rt->peer = peer;
+#if defined(__HAVE_ARCH_CMPXCHG)
+	if (!cmpxchg(&rt->peer, NULL, peer))
 		peer = NULL;
-	}
-	spin_unlock_bh(&rt_peer_lock);
+#else
+	BUILD_BUG_ON(sizeof(rt->peer) != sizeof(atomic_long_t));
+	if (!atomic_long_cmpxchg((atomic_long_t *)&rt->peer, 0, (long)peer))
+		peer = NULL;
+#endif
 	if (peer)
 		inet_putpeer(peer);
 }



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-15  3:26           ` Paul Mackerras
@ 2010-06-15  4:21             ` Andrew Morton
  2010-06-15  4:38               ` Eric Dumazet
                                 ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Andrew Morton @ 2010-06-15  4:21 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: tytso, Salman, mingo, linux-kernel, peterz, tytso, torvalds,
	walken, Chen Liqin, Lennox Wu

On Tue, 15 Jun 2010 13:26:08 +1000 Paul Mackerras <paulus@samba.org> wrote:

> On Mon, Jun 14, 2010 at 06:55:56PM -0700, Andrew Morton wrote:
> 
> > > kernel/sched_clock.c:   if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc
> > 
> > I guess that'll flush out any stragglers.
> 
> And break most non-x86 32-bit architectures, including 32-bit powerpc.

If CONFIG_SMP=y, yes.  On UP there's a generic implementation
(include/asm-generic/cmpxchg-local.h, include/asm-generic/cmpxchg.h)

> Fortunately that code is only used if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
> is set, and it looks like only x86 and ia64 set it.
> 

If that happens then the best fix is for those architectures to get
themselves a cmpxchg64().  Unless for some reason it's simply
unimplementable?  Worst case I guess one could use a global spinlock. 
Second-worst-case: hashed spinlocks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-15  1:55         ` Andrew Morton
@ 2010-06-15  3:26           ` Paul Mackerras
  2010-06-15  4:21             ` Andrew Morton
  2010-06-15 12:56           ` tytso
  2010-06-15 14:35           ` Peter Zijlstra
  2 siblings, 1 reply; 45+ messages in thread
From: Paul Mackerras @ 2010-06-15  3:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tytso, Salman, mingo, linux-kernel, peterz, tytso, torvalds,
	walken, Chen Liqin, Lennox Wu

On Mon, Jun 14, 2010 at 06:55:56PM -0700, Andrew Morton wrote:

> > kernel/sched_clock.c:   if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc
> 
> I guess that'll flush out any stragglers.

And break most non-x86 32-bit architectures, including 32-bit powerpc.
Fortunately that code is only used if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
is set, and it looks like only x86 and ia64 set it.

Paul.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-15  0:56       ` tytso
@ 2010-06-15  1:55         ` Andrew Morton
  2010-06-15  3:26           ` Paul Mackerras
                             ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Andrew Morton @ 2010-06-15  1:55 UTC (permalink / raw)
  To: tytso
  Cc: Salman, mingo, linux-kernel, peterz, tytso, torvalds, walken,
	Chen Liqin, Lennox Wu

On Mon, 14 Jun 2010 20:56:19 -0400 tytso@mit.edu wrote:

> On Mon, Jun 14, 2010 at 04:58:51PM -0700, Andrew Morton wrote:
> > Using
> > 
> > 	grep -r '[ 	]cmpxchg[^_]' . | grep -v /arch/
> > 
> > I can't see any cmpxchg() callers in truly generic code.  lockdep and
> > kernel/trace/ring_buffer.c aren't used on the more remote
> > architectures, I think.
> 
> What about:
> 
> drivers/gpu/drm/drm_lock.c:             prev = cmpxchg(lock, old, new);
> kernel/lockdep.c:               n = cmpxchg(&nr_chain_hlocks, cn, cn + chain->de

I put these in the not-used-on-weird-architectures bucket.

> kernel/sched_clock.c:   if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc

I guess that'll flush out any stragglers.

I suspect sched_clock.c might be generating fair amounts of code which
UP builds don't need.

> fs/btrfs/inode.c:       if (cmpxchg(&root->orphan_cleanup_state, 0, ORPHAN_CLEAN
> fs/ext4/inode.c:        } while (cmpxchg(&ei->i_flags, old_fl, new_fl) != old_fl
> 
> The last is quite new --- I had just recently done a similar set of
> research as you did before accepting the patch that added cmpxchg into
> ext4 (during the last merge window), and I thought cmpxchg() had
> entered the "supported by all architectures" category.  It looked like
> it had only recently reached state, but I had reached the conclusion
> that it was safe to use.

I think you're probably right, as long as one sticks with 4-byte
scalars.  The cmpxchg-is-now-generic change snuck in under the radar
(mine, at least).


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-14 23:58     ` Andrew Morton
@ 2010-06-15  0:56       ` tytso
  2010-06-15  1:55         ` Andrew Morton
  0 siblings, 1 reply; 45+ messages in thread
From: tytso @ 2010-06-15  0:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Salman, mingo, linux-kernel, peterz, tytso, torvalds, walken,
	Chen Liqin, Lennox Wu

On Mon, Jun 14, 2010 at 04:58:51PM -0700, Andrew Morton wrote:
> Using
> 
> 	grep -r '[ 	]cmpxchg[^_]' . | grep -v /arch/
> 
> I can't see any cmpxchg() callers in truly generic code.  lockdep and
> kernel/trace/ring_buffer.c aren't used on the more remote
> architectures, I think.

What about:

drivers/gpu/drm/drm_lock.c:             prev = cmpxchg(lock, old, new);
kernel/lockdep.c:               n = cmpxchg(&nr_chain_hlocks, cn, cn + chain->de
kernel/sched_clock.c:   if (cmpxchg64(&scd->clock, old_clock, clock) != old_cloc
fs/btrfs/inode.c:       if (cmpxchg(&root->orphan_cleanup_state, 0, ORPHAN_CLEAN
fs/ext4/inode.c:        } while (cmpxchg(&ei->i_flags, old_fl, new_fl) != old_fl

The last is quite new --- I had just recently done a similar set of
research as you did before accepting the patch that added cmpxchg into
ext4 (during the last merge window), and I thought cmpxchg() had
entered the "supported by all architectures" category.  It looked like
it had only recently reached state, but I had reached the conclusion
that it was safe to use.

						- Ted

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-11 22:49   ` Salman
  2010-06-11 23:07     ` Linus Torvalds
@ 2010-06-14 23:58     ` Andrew Morton
  2010-06-15  0:56       ` tytso
  1 sibling, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2010-06-14 23:58 UTC (permalink / raw)
  To: Salman
  Cc: mingo, linux-kernel, peterz, tytso, torvalds, walken, Chen Liqin,
	Lennox Wu

On Fri, 11 Jun 2010 15:49:54 -0700
Salman <sqazi@google.com> wrote:

> A program that repeatedly forks and waits is susceptible to having the
> same pid repeated, especially when it competes with another instance of the
> same program.  This is really bad for bash implementation.  Furthermore,
> many shell scripts assume that pid numbers will not be used for some length
> of time.
> 
> Race Description:
>
> ...
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index e9fd8c1..fbbd5f6 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -122,6 +122,43 @@ static void free_pidmap(struct upid *upid)
>  	atomic_inc(&map->nr_free);
>  }
>  
> +/*
> + * If we started walking pids at 'base', is 'a' seen before 'b'?
> + */
> +static int pid_before(int base, int a, int b)
> +{
> +	/*
> +	 * This is the same as saying
> +	 *
> +	 * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
> +	 * and that mapping orders 'a' and 'b' with respect to 'base'.
> +	 */
> +	return (unsigned)(a - base) < (unsigned)(b - base);
> +}

pid.c uses an exotic mix of `int' and `pid_t' to represent pids.  `int'
seems to preponderate.

> +/*
> + * We might be racing with someone else trying to set pid_ns->last_pid.
> + * We want the winner to have the "later" value, because if the
> + * "earlier" value prevails, then a pid may get reused immediately.
> + *
> + * Since pids rollover, it is not sufficient to just pick the bigger
> + * value.  We have to consider where we started counting from.
> + *
> + * 'base' is the value of pid_ns->last_pid that we observed when
> + * we started looking for a pid.
> + *
> + * 'pid' is the pid that we eventually found.
> + */
> +static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
> +{
> +	int prev;
> +	int last_write = base;
> +	do {
> +		prev = last_write;
> +		last_write = cmpxchg(&pid_ns->last_pid, prev, pid);
> +	} while ((prev != last_write) && (pid_before(base, last_write, pid)));
> +}

<gets distracted>

hm.  For a long time cmpxchg() wasn't available on all architectures. 
That _seems_ to have been fixed.

arch/score assumes that cmpxchg() operates on unsigned longs.

arch/powerpc plays the necessary games to make 4- and 8-byte scalars work.

ia64 handles 1, 2, 4 and 8-byte quantities.

arm handles 1, 2 and 4-byte scalars.

as does blackfin.

So from the few architectures I looked at, it seems that we do indeed
handle cmpxchg() on all architectures although not very consistently. 
arch/score will blow up if someone tries to use cmpxchg() on 1- or
2-byte scalars.

<looks at the consumers>

infiniband deos cmpxchg() on u64*'s, which will blow up on many
architectures.

Using

	grep -r '[ 	]cmpxchg[^_]' . | grep -v /arch/

I can't see any cmpxchg() callers in truly generic code.  lockdep and
kernel/trace/ring_buffer.c aren't used on the more remote
architectures, I think.

Traditionally, atomic_cmpxchg() was the safe and portable one to use.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be  reused immediately.
  2010-06-11 22:49   ` Salman
@ 2010-06-11 23:07     ` Linus Torvalds
  2010-06-14 23:58     ` Andrew Morton
  1 sibling, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2010-06-11 23:07 UTC (permalink / raw)
  To: Salman; +Cc: akpm, mingo, linux-kernel, peterz, tytso, walken

Yup. Much prettier. Thanks.

Ingo, Andrew, you can now have a cage-match on who actually takes it.

                      Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-11 17:44 ` Linus Torvalds
@ 2010-06-11 22:49   ` Salman
  2010-06-11 23:07     ` Linus Torvalds
  2010-06-14 23:58     ` Andrew Morton
  0 siblings, 2 replies; 45+ messages in thread
From: Salman @ 2010-06-11 22:49 UTC (permalink / raw)
  To: akpm, mingo, linux-kernel; +Cc: peterz, tytso, torvalds, walken

A program that repeatedly forks and waits is susceptible to having the
same pid repeated, especially when it competes with another instance of the
same program.  This is really bad for bash implementation.  Furthermore,
many shell scripts assume that pid numbers will not be used for some length
of time.

Race Description:

A                                    B

// pid == offset == n                // pid == offset == n + 1
test_and_set_bit(offset, map->page)
                                     test_and_set_bit(offset, map->page);
                                     pid_ns->last_pid = pid;
pid_ns->last_pid = pid;
                                     // pid == n + 1 is freed (wait())

                                     // Next fork()...
                                     last = pid_ns->last_pid; // == n
                                     pid = last + 1;

Code to reproduce it (Running multiple instances is more effective):

#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

// The distance mod 32768 between two pids, where the first pid is expected
// to be smaller than the second.
int PidDistance(pid_t first, pid_t second) {
  return (second + 32768 - first) % 32768;
}

int main(int argc, char* argv[]) {
  int failed = 0;
  pid_t last_pid = 0;
  int i;
  printf("%d\n", sizeof(pid_t));
  for (i = 0; i < 10000000; ++i) {
    if (i % 32786 == 0)
      printf("Iter: %d\n", i/32768);
    int child_exit_code = i % 256;
    pid_t pid = fork();
    if (pid == -1) {
      fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
      exit(1);
    }
    if (pid == 0) {
      // Child
      exit(child_exit_code);
    } else {
      // Parent
      if (i > 0) {
        int distance = PidDistance(last_pid, pid);
        if (distance == 0 || distance > 30000) {
          fprintf(stderr,
                  "Unexpected pid sequence: previous fork: pid=%d, "
                  "current fork: pid=%d for iteration=%d.\n",
                  last_pid, pid, i);
          failed = 1;
        }
      }
      last_pid = pid;
      int status;
      int reaped = wait(&status);
      if (reaped != pid) {
        fprintf(stderr,
                "Wait return value: expected pid=%d, "
                "got %d, iteration %d\n",
                pid, reaped, i);
        failed = 1;
      } else if (WEXITSTATUS(status) != child_exit_code) {
        fprintf(stderr,
                "Unexpected exit status %x, iteration %d\n",
                WEXITSTATUS(status), i);
        failed = 1;
      }
    }
  }
  exit(failed);
}


Thanks to Ted Tso for the key ideas of this implementation.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 kernel/pid.c |   39 ++++++++++++++++++++++++++++++++++++++-
 1 files changed, 38 insertions(+), 1 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index e9fd8c1..fbbd5f6 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,6 +122,43 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
+/*
+ * If we started walking pids at 'base', is 'a' seen before 'b'?
+ */
+static int pid_before(int base, int a, int b)
+{
+	/*
+	 * This is the same as saying
+	 *
+	 * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
+	 * and that mapping orders 'a' and 'b' with respect to 'base'.
+	 */
+	return (unsigned)(a - base) < (unsigned)(b - base);
+}
+
+/*
+ * We might be racing with someone else trying to set pid_ns->last_pid.
+ * We want the winner to have the "later" value, because if the
+ * "earlier" value prevails, then a pid may get reused immediately.
+ *
+ * Since pids rollover, it is not sufficient to just pick the bigger
+ * value.  We have to consider where we started counting from.
+ *
+ * 'base' is the value of pid_ns->last_pid that we observed when
+ * we started looking for a pid.
+ *
+ * 'pid' is the pid that we eventually found.
+ */
+static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
+{
+	int prev;
+	int last_write = base;
+	do {
+		prev = last_write;
+		last_write = cmpxchg(&pid_ns->last_pid, prev, pid);
+	} while ((prev != last_write) && (pid_before(base, last_write, pid)));
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -154,7 +191,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
+					set_last_pid(pid_ns, last, pid);
 					return pid;
 				}
 				offset = find_next_offset(map, offset);


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be  reused immediately.
  2010-06-11 17:17 Salman
@ 2010-06-11 17:44 ` Linus Torvalds
  2010-06-11 22:49   ` Salman
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2010-06-11 17:44 UTC (permalink / raw)
  To: Salman; +Cc: peterz, linux-kernel, tytso, akpm, walken, mingo

On Fri, Jun 11, 2010 at 10:17 AM, Salman <sqazi@google.com> wrote:
> Fixed the whitespace issue that Michel pointed out.
>
> Btw., who is responsible for ACKing this?

I don't know about "responsible", but I'll Ack it. Much of that code
is really old. And exactly since this is a really old issue, I think
I'll leave it unapplied for now, and let it simmer in some test-queue
(get it into next somehow?) until I get back.

Also, now that I look at that complex end-condition for the while-loop
and the big comment, I do start to think that Ingo was right, and it
would be better to make that thing a helper function of its own,
called something like "set_last_pid(pid_ns, last, pid);"

That would lessen the indentation a lot too, and with the commentary,
it would all look fairly pretty.

So Ack-with-cleanup-suggestion. And maybe Andrew can take it into -mm
while I'm gone? Or Ingo into some core-branch? You fight it out, but I
think  this is already acceptable, just perhaps still open to
improvement.

                       Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
@ 2010-06-11 17:17 Salman
  2010-06-11 17:44 ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Salman @ 2010-06-11 17:17 UTC (permalink / raw)
  To: peterz, linux-kernel, tytso, akpm, walken, torvalds, mingo

Fixed the whitespace issue that Michel pointed out.

Btw., who is responsible for ACKing this?

--

A program that repeatedly forks and waits is susceptible to having the
same pid repeated, especially when it competes with another instance of the
same program.  This is really bad for bash implementation.  Furthermore,
many shell scripts assume that pid numbers will not be used for some length
of time.

Race Description:

A                                    B

// pid == offset == n                // pid == offset == n + 1
test_and_set_bit(offset, map->page)
                                     test_and_set_bit(offset, map->page);
                                     pid_ns->last_pid = pid;
pid_ns->last_pid = pid;
                                     // pid == n + 1 is freed (wait())

                                     // Next fork()...
                                     last = pid_ns->last_pid; // == n
                                     pid = last + 1;

Code to reproduce it (Running multiple instances is more effective):

#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

// The distance mod 32768 between two pids, where the first pid is expected
// to be smaller than the second.
int PidDistance(pid_t first, pid_t second) {
  return (second + 32768 - first) % 32768;
}

int main(int argc, char* argv[]) {
  int failed = 0;
  pid_t last_pid = 0;
  int i;
  printf("%d\n", sizeof(pid_t));
  for (i = 0; i < 10000000; ++i) {
    if (i % 32786 == 0)
      printf("Iter: %d\n", i/32768);
    int child_exit_code = i % 256;
    pid_t pid = fork();
    if (pid == -1) {
      fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
      exit(1);
    }
    if (pid == 0) {
      // Child
      exit(child_exit_code);
    } else {
      // Parent
      if (i > 0) {
        int distance = PidDistance(last_pid, pid);
        if (distance == 0 || distance > 30000) {
          fprintf(stderr,
                  "Unexpected pid sequence: previous fork: pid=%d, "
                  "current fork: pid=%d for iteration=%d.\n",
                  last_pid, pid, i);
          failed = 1;
        }
      }
      last_pid = pid;
      int status;
      int reaped = wait(&status);
      if (reaped != pid) {
        fprintf(stderr,
                "Wait return value: expected pid=%d, "
                "got %d, iteration %d\n",
                pid, reaped, i);
        failed = 1;
      } else if (WEXITSTATUS(status) != child_exit_code) {
        fprintf(stderr,
                "Unexpected exit status %x, iteration %d\n",
                WEXITSTATUS(status), i);
        failed = 1;
      }
    }
  }
  exit(failed);
}


Thanks to Ted Tso for the key ideas of this implementation.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 kernel/pid.c |   40 +++++++++++++++++++++++++++++++++++++++-
 1 files changed, 39 insertions(+), 1 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index e9fd8c1..8e9067c 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,6 +122,20 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
+/*
+ * If we started walking pids at 'base', is 'a' seen before 'b'?
+ */
+static int pid_before(int base, int a, int b)
+{
+	/*
+	 * This is the same as saying
+	 *
+	 * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
+	 * and that mapping orders 'a' and 'b' with respect to 'base'.
+	 */
+	return (unsigned)(a - base) < (unsigned)(b - base);
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -153,8 +167,32 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
+					int prev;
+					int last_write = last;
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
+
+					/*
+					 * We might be racing with someone else
+					 * trying to set pid_ns->last_pid.
+					 * We want the winner to have the
+					 * "later" value, because if the
+					 * "earlier" value prevails, then
+					 * a pid may get reused immediately.
+					 *
+					 * Since pids rollover, it is not
+					 * sufficent to just pick the bigger
+					 * value.  We have to consider
+					 * where we started counting from.
+					 */
+					do {
+						prev = last_write;
+						last_write = cmpxchg(
+							&pid_ns->last_pid,
+							prev, pid);
+					} while ((prev != last_write) &&
+						 (pid_before(last, last_write,
+						  pid)));
+
 					return pid;
 				}
 				offset = find_next_offset(map, offset);


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
@ 2010-06-10 21:24 Salman
  0 siblings, 0 replies; 45+ messages in thread
From: Salman @ 2010-06-10 21:24 UTC (permalink / raw)
  To: peterz, linux-kernel, tytso, akpm, walken, torvalds, mingo

[Fixed lines > 80-column]

A program that repeatedly forks and waits is susceptible to having the
same pid repeated, especially when it competes with another instance of the
same program.  This is really bad for bash implementation.  Furthermore,
many shell scripts assume that pid numbers will not be used for some length
of time.

Race Description:

A                                    B

// pid == offset == n                // pid == offset == n + 1
test_and_set_bit(offset, map->page)
                                     test_and_set_bit(offset, map->page);
                                     pid_ns->last_pid = pid;
pid_ns->last_pid = pid;
                                     // pid == n + 1 is freed (wait())

                                     // Next fork()...
                                     last = pid_ns->last_pid; // == n
                                     pid = last + 1;

Code to reproduce it (Running multiple instances is more effective):

#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

// The distance mod 32768 between two pids, where the first pid is expected
// to be smaller than the second.
int PidDistance(pid_t first, pid_t second) {
  return (second + 32768 - first) % 32768;
}

int main(int argc, char* argv[]) {
  int failed = 0;
  pid_t last_pid = 0;
  int i;
  printf("%d\n", sizeof(pid_t));
  for (i = 0; i < 10000000; ++i) {
    if (i % 32786 == 0)
      printf("Iter: %d\n", i/32768);
    int child_exit_code = i % 256;
    pid_t pid = fork();
    if (pid == -1) {
      fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
      exit(1);
    }
    if (pid == 0) {
      // Child
      exit(child_exit_code);
    } else {
      // Parent
      if (i > 0) {
        int distance = PidDistance(last_pid, pid);
        if (distance == 0 || distance > 30000) {
          fprintf(stderr,
                  "Unexpected pid sequence: previous fork: pid=%d, "
                  "current fork: pid=%d for iteration=%d.\n",
                  last_pid, pid, i);
          failed = 1;
        }
      }
      last_pid = pid;
      int status;
      int reaped = wait(&status);
      if (reaped != pid) {
        fprintf(stderr,
                "Wait return value: expected pid=%d, "
                "got %d, iteration %d\n",
                pid, reaped, i);
        failed = 1;
      } else if (WEXITSTATUS(status) != child_exit_code) {
        fprintf(stderr,
                "Unexpected exit status %x, iteration %d\n",
                WEXITSTATUS(status), i);
        failed = 1;
      }
    }
  }
  exit(failed);
}


Thanks to Ted Tso for the key ideas of this implementation.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 kernel/pid.c |   42 +++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 41 insertions(+), 1 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index e9fd8c1..e8da445 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,6 +122,22 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
+/*
+ * If we started walking pids at 'base', is 'a' seen before 'b'?
+ *
+ */
+static int pid_before(int base, int a, int b)
+{
+	/*
+	 * This is the same as saying
+	 *
+	 * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
+	 * and that mapping orders 'a' and 'b' with respect to 'base'.
+	 *
+	 */
+	return (unsigned)(a - base) < (unsigned)(b - base);
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -153,8 +169,32 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
+					int prev;
+					int last_write = last;
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
+
+					/*
+					 * We might be racing with someone else
+					 * trying to set pid_ns->last_pid.
+					 * We want the winner to have the
+					 * "later" value, because if the
+					 * "earlier" value prevails, then
+					 * a pid may get reused immediately.
+					 *
+					 * Since pids rollover, it is not
+					 * sufficent to just pick the bigger
+					 * value.  We have to consider
+					 * where we started counting from.
+					 */
+					do {
+						prev = last_write;
+						last_write = cmpxchg(
+							&pid_ns->last_pid,
+							prev, pid);
+					} while ((prev != last_write) &&
+						 (pid_before(last, last_write,
+						  pid)));
+
 					return pid;
 				}
 				offset = find_next_offset(map, offset);


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be  reused immediately.
  2010-06-10 20:38 ` tytso
@ 2010-06-10 21:04   ` Salman Qazi
  0 siblings, 0 replies; 45+ messages in thread
From: Salman Qazi @ 2010-06-10 21:04 UTC (permalink / raw)
  To: tytso, Salman, peterz, linux-kernel, tytso, akpm, walken,
	torvalds, mingo

On Thu, Jun 10, 2010 at 1:38 PM,  <tytso@mit.edu> wrote:
> On Thu, Jun 10, 2010 at 01:09:11PM -0700, Salman wrote:
>> A program that repeatedly forks and waits is susceptible to having the
>> same pid repeated, especially when it competes with another instance of the
>> same program.  This is really bad for bash implementation.  Furthermore, many shell
>> scripts assume that pid numbers will not be used for some length of time.
>
> This should probably get wrapped at column 74 or so....
>
>> +static int pid_before(int base, int a, int b)
>> +{
>> +     /*
>> +      * This is the same as saying
>> +      *
>> +      * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
>> +      * and that mapping orders 'a' and 'b' with respect to 'base'.
>> +      *
>> +      */
>> +     return (unsigned)(a - base) < (unsigned)(b - base);
>> +}
>
> Does this work though if /proc/sys/kernel/pid_max is not set to
> MAXUINT?

Yes it does.  It should work for all values of pid_max.
>
> I like the optimization, but it looks like pid_max defaults to 4096 if
> CONFIG_BASE_SMALL is set, and 32768 otherwise.
>
> Am I missing something?

Yes.

(a - base + pid_max) % pid_max < (b - base + max_pid) % pid_max iff
(a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
for all pid_max <= MAXUINT.

The values of 'a' (or 'b')  in the range [base, pid_max) gets mapped
to [0, pid_max - base) and the range [0, base) gets mapped to
[MAXUINT, MAXUINT - base).  So, the order is essentially maintained by
this mapping.

>
>                                        - Ted
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-10 20:09 Salman
@ 2010-06-10 20:38 ` tytso
  2010-06-10 21:04   ` Salman Qazi
  0 siblings, 1 reply; 45+ messages in thread
From: tytso @ 2010-06-10 20:38 UTC (permalink / raw)
  To: Salman; +Cc: peterz, linux-kernel, tytso, akpm, walken, torvalds, mingo

On Thu, Jun 10, 2010 at 01:09:11PM -0700, Salman wrote:
> A program that repeatedly forks and waits is susceptible to having the
> same pid repeated, especially when it competes with another instance of the
> same program.  This is really bad for bash implementation.  Furthermore, many shell
> scripts assume that pid numbers will not be used for some length of time.

This should probably get wrapped at column 74 or so....

> +static int pid_before(int base, int a, int b)
> +{
> +	/*
> +	 * This is the same as saying
> +	 *
> +	 * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
> +	 * and that mapping orders 'a' and 'b' with respect to 'base'.
> +	 *
> +	 */
> +	return (unsigned)(a - base) < (unsigned)(b - base);
> +}

Does this work though if /proc/sys/kernel/pid_max is not set to
MAXUINT?

I like the optimization, but it looks like pid_max defaults to 4096 if
CONFIG_BASE_SMALL is set, and 32768 otherwise.

Am I missing something?

					- Ted

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
@ 2010-06-10 20:09 Salman
  2010-06-10 20:38 ` tytso
  0 siblings, 1 reply; 45+ messages in thread
From: Salman @ 2010-06-10 20:09 UTC (permalink / raw)
  To: peterz, linux-kernel, tytso, akpm, walken, torvalds, mingo

A program that repeatedly forks and waits is susceptible to having the
same pid repeated, especially when it competes with another instance of the
same program.  This is really bad for bash implementation.  Furthermore, many shell
scripts assume that pid numbers will not be used for some length of time.

Race Description:

A                                    B

// pid == offset == n                // pid == offset == n + 1
test_and_set_bit(offset, map->page)
                                     test_and_set_bit(offset, map->page);
                                     pid_ns->last_pid = pid;
pid_ns->last_pid = pid;
                                     // pid == n + 1 is freed (wait())

                                     // Next fork()...
                                     last = pid_ns->last_pid; // == n
                                     pid = last + 1;

Code to reproduce it (Running multiple instances is more effective):

#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

// The distance mod 32768 between two pids, where the first pid is expected
// to be smaller than the second.
int PidDistance(pid_t first, pid_t second) {
  return (second + 32768 - first) % 32768;
}

int main(int argc, char* argv[]) {
  int failed = 0;
  pid_t last_pid = 0;
  int i;
  printf("%d\n", sizeof(pid_t));
  for (i = 0; i < 10000000; ++i) {
    if (i % 32786 == 0)
      printf("Iter: %d\n", i/32768);
    int child_exit_code = i % 256;
    pid_t pid = fork();
    if (pid == -1) {
      fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
      exit(1);
    }
    if (pid == 0) {
      // Child
      exit(child_exit_code);
    } else {
      // Parent
      if (i > 0) {
        int distance = PidDistance(last_pid, pid);
        if (distance == 0 || distance > 30000) {
          fprintf(stderr,
                  "Unexpected pid sequence: previous fork: pid=%d, "
                  "current fork: pid=%d for iteration=%d.\n",
                  last_pid, pid, i);
          failed = 1;
        }
      }
      last_pid = pid;
      int status;
      int reaped = wait(&status);
      if (reaped != pid) {
        fprintf(stderr,
                "Wait return value: expected pid=%d, "
                "got %d, iteration %d\n",
                pid, reaped, i);
        failed = 1;
      } else if (WEXITSTATUS(status) != child_exit_code) {
        fprintf(stderr,
                "Unexpected exit status %x, iteration %d\n",
                WEXITSTATUS(status), i);
        failed = 1;
      }
    }
  }
  exit(failed);
}


Thanks to Ted Tso for the key ideas of this implementation.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 kernel/pid.c |   42 +++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 41 insertions(+), 1 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index e9fd8c1..e8da445 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,6 +122,22 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
+/*
+ * If we started walking pids at 'base', is 'a' seen before 'b'?
+ *
+ */
+static int pid_before(int base, int a, int b)
+{
+	/*
+	 * This is the same as saying
+	 *
+	 * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
+	 * and that mapping orders 'a' and 'b' with respect to 'base'.
+	 *
+	 */
+	return (unsigned)(a - base) < (unsigned)(b - base);
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -153,8 +169,32 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
+					int prev;
+					int last_write = last;
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
+
+					/*
+					 * We might be racing with someone else
+					 * trying to set pid_ns->last_pid.
+					 * We want the winner to have the
+					 * "later" value, because if the
+					 * "earlier" value prevails, then
+					 * a pid may get reused immediately.
+					 *
+					 * Since pids rollover, it is not
+					 * sufficent to just pick the bigger
+					 * value.  We have to consider
+					 * where we started counting from.
+					 */
+					do {
+						prev = last_write;
+						last_write = cmpxchg(
+							&pid_ns->last_pid,
+							prev, pid);
+					} while ((prev != last_write) &&
+						 (pid_before(last, last_write,
+						  pid)));
+
 					return pid;
 				}
 				offset = find_next_offset(map, offset);


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be  reused immediately.
  2010-06-10  5:55               ` Salman Qazi
@ 2010-06-10 16:39                 ` Linus Torvalds
  0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2010-06-10 16:39 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Michel Lespinasse, Peter Zijlstra, akpm, mingo, linux-kernel, tytso

On Wed, Jun 9, 2010 at 10:55 PM, Salman Qazi <sqazi@google.com> wrote:
> On Wed, Jun 9, 2010 at 10:17 PM, Michel Lespinasse <walken@google.com> wrote:
>>
>> As shown by the three compares version, the question here does not depend on
>> the max_pid value. So, we can replace max_pid with any integer >= max_pid
>> without changing the result.

I like the logical argument, and cannot fault it.

>> This is nice because, replacing max_pid with
>> 2^32, the expression simplifies to:
>>
>> (unsigned)(a - base) < (unsigned)(b - base)
>>
>> I think I like this form after all :)

Sounds right to me.

> I don't mind doing this, if nobody disagrees.  Should be fine with a
> comment on top explaining the intention?

Yes. And lots of testing (including, very much, the whole max_pid
roll-over, of course). The logic of it all seems impeccable, but
nothing beats testing.

Thanks,
             Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be  reused immediately.
       [not found]             ` <AANLkTilXJ0X2qxD9cNTlLayKzySEZu1HEZUWu--Go8kw@mail.gmail.com>
@ 2010-06-10  5:55               ` Salman Qazi
  2010-06-10 16:39                 ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Salman Qazi @ 2010-06-10  5:55 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, Peter Zijlstra, akpm, mingo, linux-kernel, tytso

On Wed, Jun 9, 2010 at 10:17 PM, Michel Lespinasse <walken@google.com> wrote:
> On Wed, Jun 9, 2010 at 5:20 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> On Wed, 9 Jun 2010, Salman Qazi wrote:
>> >
>> > I don't think this gives the right answer in the a < base < b  case.
>> > Here a - base < 0 and
>> > b - base > 0.  But we really want b to be before a, since a has rolled
>> > over further than b.
>>
>> Right you are.
>>
>> > I think the right solution is comparing (a - base + max_pid) % max_pid
>> > with (b - base + max_pid) % max_pid.  Am I correct or deluded? .
>>
>> That would work, but it would be horrible. Just use the three compares
>> version: doing a integer 'mod' operator is _way_ more expensive.
>
> As shown by the three compares version, the question here does not depend on
> the max_pid value. So, we can replace max_pid with any integer >= max_pid
> without changing the result. This is nice because, replacing max_pid with
> 2^32, the expression simplifies to:
>
> (unsigned)(a - base) < (unsigned)(b - base)
>
> I think I like this form after all :)

I don't mind doing this, if nobody disagrees.  Should be fine with a
comment on top explaining the intention?

>
> --
> Michel "Walken" Lespinasse
> A program is never fully debugged until the last user dies.
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be  reused immediately.
  2010-06-10  0:08         ` Salman Qazi
@ 2010-06-10  0:20           ` Linus Torvalds
       [not found]             ` <AANLkTilXJ0X2qxD9cNTlLayKzySEZu1HEZUWu--Go8kw@mail.gmail.com>
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2010-06-10  0:20 UTC (permalink / raw)
  To: Salman Qazi; +Cc: Peter Zijlstra, akpm, mingo, linux-kernel, tytso



On Wed, 9 Jun 2010, Salman Qazi wrote:
> 
> I don't think this gives the right answer in the a < base < b  case.
> Here a - base < 0 and
> b - base > 0.  But we really want b to be before a, since a has rolled
> over further than b.

Right you are.

> I think the right solution is comparing (a - base + max_pid) % max_pid 
> with (b - base + max_pid) % max_pid.  Am I correct or deluded? .

That would work, but it would be horrible. Just use the three compares 
version: doing a integer 'mod' operator is _way_ more expensive.

			Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be  reused immediately.
  2010-06-09 22:27       ` Linus Torvalds
@ 2010-06-10  0:08         ` Salman Qazi
  2010-06-10  0:20           ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Salman Qazi @ 2010-06-10  0:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Zijlstra, akpm, mingo, linux-kernel, tytso

On Wed, Jun 9, 2010 at 3:27 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Wed, 9 Jun 2010, Linus Torvalds wrote:
>>
>> Quite possibly. I'd worry about the overflow case a bit, but it's
>> certainly going to get the right value when base << MAX_INT.
>
> Having given it a couple of seconds more thought, I don't think there is
> an overflow case either. All of a/b/base are guaranteed to be non-negative
> (or our pid code is in worse trouble anyway), so there is no overflow
> possible. So yes. Just comparing a-base < b-base should always be safe

I don't think this gives the right answer in the a < base < b  case.
Here a - base < 0 and
b - base > 0.  But we really want b to be before a, since a has rolled
over further than b.  I think
the right solution is comparing (a - base + max_pid) % max_pid with (b
- base + max_pid) % max_pid.  Am I correct or deluded?
.
>
>                        Linus
>
>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09 22:20     ` Linus Torvalds
@ 2010-06-09 22:27       ` Linus Torvalds
  2010-06-10  0:08         ` Salman Qazi
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2010-06-09 22:27 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Salman, akpm, mingo, linux-kernel, tytso



On Wed, 9 Jun 2010, Linus Torvalds wrote:
> 
> Quite possibly. I'd worry about the overflow case a bit, but it's 
> certainly going to get the right value when base << MAX_INT.

Having given it a couple of seconds more thought, I don't think there is 
an overflow case either. All of a/b/base are guaranteed to be non-negative 
(or our pid code is in worse trouble anyway), so there is no overflow 
possible. So yes. Just comparing a-base < b-base should always be safe.

			Linus



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09 21:33   ` Peter Zijlstra
@ 2010-06-09 22:20     ` Linus Torvalds
  2010-06-09 22:27       ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2010-06-09 22:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Salman, akpm, mingo, linux-kernel, tytso



On Wed, 9 Jun 2010, Peter Zijlstra wrote:
> 
> Isn't: return a - base < b - base, the natural way to express this?

Quite possibly. I'd worry about the overflow case a bit, but it's 
certainly going to get the right value when base << MAX_INT.

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09 21:21 ` Linus Torvalds
@ 2010-06-09 21:33   ` Peter Zijlstra
  2010-06-09 22:20     ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2010-06-09 21:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Salman, akpm, mingo, linux-kernel, tytso

On Wed, 2010-06-09 at 14:21 -0700, Linus Torvalds wrote:
> 
> On Wed, 9 Jun 2010, Salman wrote:
> > +/*
> > + * If we started walking pids at 'base', is 'a' seen before 'b'?
> > + *
> > + */
> > +static int pid_before(int base, int a, int b)
> > +{
> > +	int a_lt_b = (a < b);
> > +	int min_a_b = min(a, b);
> > +	int max_a_b = max(a, b);
> > +
> > +	if ((base <= min_a_b) || (base >= max_a_b))
> > +		return a_lt_b;
> > +
> > +	return !a_lt_b;
> > +}
> 
> Ok, so that's a very confusing expression. I'm sure it gets the right 
> value, but it's not exactly straightforward, is it?
> 
> Wouldn't it be nicer to write it out in a more straightforward way? 
> Something like
> 
> 	/* a and b in order? base must not be between them */
> 	if (a <= b)
> 		return (base <= a || base >= b);
> 	/* b < a? We reach 'a' first iff base is between them */
> 	return base >= b && base <= a;
> 
> would seem to be equivalent and easier to explain, no?
> 
> And when you write it that way, it looks like the compiler should be able 
> to trivially CSE the five comparisons down to just three (notice how the 
> "base <= a" and "base >= b" comparisons are repeated. Which I'm sure some 
> super-optimizing compiler can do from your version too, but mine seems 
> more straightforward.
> 
> But maybe I did that thing wrong, and I just confused myself. I have _not_ 
> checked the logic deeply, somebody else should definitely double-check me.

Isn't: return a - base < b - base, the natural way to express this?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
  2010-06-09 21:00 Salman
@ 2010-06-09 21:21 ` Linus Torvalds
  2010-06-09 21:33   ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2010-06-09 21:21 UTC (permalink / raw)
  To: Salman; +Cc: peterz, akpm, mingo, linux-kernel, tytso



On Wed, 9 Jun 2010, Salman wrote:
> +/*
> + * If we started walking pids at 'base', is 'a' seen before 'b'?
> + *
> + */
> +static int pid_before(int base, int a, int b)
> +{
> +	int a_lt_b = (a < b);
> +	int min_a_b = min(a, b);
> +	int max_a_b = max(a, b);
> +
> +	if ((base <= min_a_b) || (base >= max_a_b))
> +		return a_lt_b;
> +
> +	return !a_lt_b;
> +}

Ok, so that's a very confusing expression. I'm sure it gets the right 
value, but it's not exactly straightforward, is it?

Wouldn't it be nicer to write it out in a more straightforward way? 
Something like

	/* a and b in order? base must not be between them */
	if (a <= b)
		return (base <= a || base >= b);
	/* b < a? We reach 'a' first iff base is between them */
	return base >= b && base <= a;

would seem to be equivalent and easier to explain, no?

And when you write it that way, it looks like the compiler should be able 
to trivially CSE the five comparisons down to just three (notice how the 
"base <= a" and "base >= b" comparisons are repeated. Which I'm sure some 
super-optimizing compiler can do from your version too, but mine seems 
more straightforward.

But maybe I did that thing wrong, and I just confused myself. I have _not_ 
checked the logic deeply, somebody else should definitely double-check me.

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH] Fix a race in pid generation that causes pids to be reused immediately.
@ 2010-06-09 21:00 Salman
  2010-06-09 21:21 ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Salman @ 2010-06-09 21:00 UTC (permalink / raw)
  To: peterz, akpm, torvalds, mingo, linux-kernel; +Cc: tytso

A program that repeatedly forks and waits is susceptible to having the
same pid repeated, especially when it competes with another instance of the
same program.  This is really bad for bash implementation.  Furthermore, many shell
scripts assume that pid numbers will not be used for some length of time.

Race Description:

A                                    B

// pid == offset == n                // pid == offset == n + 1
test_and_set_bit(offset, map->page)
                                     test_and_set_bit(offset, map->page);
                                     pid_ns->last_pid = pid;
pid_ns->last_pid = pid;
                                     // pid == n + 1 is freed (wait())

                                     // Next fork()...
                                     last = pid_ns->last_pid; // == n
                                     pid = last + 1;

Code to reproduce it (Running multiple instances is more effective):

#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

// The distance mod 32768 between two pids, where the first pid is expected
// to be smaller than the second.
int PidDistance(pid_t first, pid_t second) {
  return (second + 32768 - first) % 32768;
}

int main(int argc, char* argv[]) {
  int failed = 0;
  pid_t last_pid = 0;
  int i;
  printf("%d\n", sizeof(pid_t));
  for (i = 0; i < 10000000; ++i) {
    if (i % 32786 == 0)
      printf("Iter: %d\n", i/32768);
    int child_exit_code = i % 256;
    pid_t pid = fork();
    if (pid == -1) {
      fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
      exit(1);
    }
    if (pid == 0) {
      // Child
      exit(child_exit_code);
    } else {
      // Parent
      if (i > 0) {
        int distance = PidDistance(last_pid, pid);
        if (distance == 0 || distance > 30000) {
          fprintf(stderr,
                  "Unexpected pid sequence: previous fork: pid=%d, "
                  "current fork: pid=%d for iteration=%d.\n",
                  last_pid, pid, i);
          failed = 1;
        }
      }
      last_pid = pid;
      int status;
      int reaped = wait(&status);
      if (reaped != pid) {
        fprintf(stderr,
                "Wait return value: expected pid=%d, "
                "got %d, iteration %d\n",
                pid, reaped, i);
        failed = 1;
      } else if (WEXITSTATUS(status) != child_exit_code) {
        fprintf(stderr,
                "Unexpected exit status %x, iteration %d\n",
                WEXITSTATUS(status), i);
        failed = 1;
      }
    }
  }
  exit(failed);
}


Thanks to Ted Tso for the key ideas of this implementation.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 kernel/pid.c |   39 ++++++++++++++++++++++++++++++++++++++-
 1 files changed, 38 insertions(+), 1 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index e9fd8c1..865a482 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,6 +122,22 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
+/*
+ * If we started walking pids at 'base', is 'a' seen before 'b'?
+ *
+ */
+static int pid_before(int base, int a, int b)
+{
+	int a_lt_b = (a < b);
+	int min_a_b = min(a, b);
+	int max_a_b = max(a, b);
+
+	if ((base <= min_a_b) || (base >= max_a_b))
+		return a_lt_b;
+
+	return !a_lt_b;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -153,8 +169,29 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
+					int prev;
+					int last_write = last;
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
+
+					/*
+					 * We might be racing with someone else trying
+					 * to set pid_ns->last_pid.  We want the
+					 * the winner to have the "later" value,
+					 * because if the "earlier" value prevails, then
+					 * a pid may get reused immediately.
+					 *
+					 * Since pids rollover, it is not sufficent
+					 * to just pick the bigger value.  We
+					 * have to consider where we started counting
+					 * from.
+					 */
+					do {
+						prev = last_write;
+						last_write = cmpxchg(&pid_ns->last_pid,
+							       prev, pid);
+					} while ((prev != last_write) &&
+						 (pid_before(last, last_write, pid)));
+
 					return pid;
 				}
 				offset = find_next_offset(map, offset);


^ permalink raw reply related	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2010-06-15 19:37 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-09  6:24 [PATCH] Fix a race in pid generation that causes pids to be reused immediately Salman
2010-06-09  6:53 ` Andi Kleen
2010-06-09  9:48 ` Ingo Molnar
2010-06-09 15:39   ` Linus Torvalds
2010-06-09 15:50     ` tytso
2010-06-09 16:06       ` Linus Torvalds
2010-06-09 17:10         ` tytso
2010-06-09 17:23           ` Linus Torvalds
2010-06-09 17:25             ` Linus Torvalds
2010-06-09 17:34               ` tytso
2010-06-09 17:43                 ` Linus Torvalds
2010-06-09 17:47                   ` tytso
2010-06-09 18:09                     ` Salman Qazi
2010-06-09 11:49 ` Michel Lespinasse
2010-06-09 12:37   ` tytso
2010-06-09 12:17 ` tytso
2010-06-09 21:00 Salman
2010-06-09 21:21 ` Linus Torvalds
2010-06-09 21:33   ` Peter Zijlstra
2010-06-09 22:20     ` Linus Torvalds
2010-06-09 22:27       ` Linus Torvalds
2010-06-10  0:08         ` Salman Qazi
2010-06-10  0:20           ` Linus Torvalds
     [not found]             ` <AANLkTilXJ0X2qxD9cNTlLayKzySEZu1HEZUWu--Go8kw@mail.gmail.com>
2010-06-10  5:55               ` Salman Qazi
2010-06-10 16:39                 ` Linus Torvalds
2010-06-10 20:09 Salman
2010-06-10 20:38 ` tytso
2010-06-10 21:04   ` Salman Qazi
2010-06-10 21:24 Salman
2010-06-11 17:17 Salman
2010-06-11 17:44 ` Linus Torvalds
2010-06-11 22:49   ` Salman
2010-06-11 23:07     ` Linus Torvalds
2010-06-14 23:58     ` Andrew Morton
2010-06-15  0:56       ` tytso
2010-06-15  1:55         ` Andrew Morton
2010-06-15  3:26           ` Paul Mackerras
2010-06-15  4:21             ` Andrew Morton
2010-06-15  4:38               ` Eric Dumazet
2010-06-15  6:57               ` Benjamin Herrenschmidt
2010-06-15  7:25               ` Paul Mackerras
2010-06-15 12:56           ` tytso
2010-06-15 13:06             ` Kyle McMartin
2010-06-15 14:35           ` Peter Zijlstra
2010-06-15 19:37             ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).