linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Linux] Linux PID algorithm is BRAINDEAD!
@ 2015-10-10  2:00 Dave Goel
  2015-10-10  3:20 ` yalin wang
  2015-10-10 21:58 ` Theodore Ts'o
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Goel @ 2015-10-10  2:00 UTC (permalink / raw)
  To: linux-kernel, Dave Goel

Pardon the subject line!  I think the PID algo. is actually pretty
good and cheap.


I just think that a very minor tweak could actually make it *actually* do
what it always intended to do (that is, satisfy the PID-AIM listed below)!
No expanded PID renumbering, no incompatibility introduction, nothing, just
a relatively *minor tweak*.


*** QUICK SUMMARY:


PROBLEM:  PID gets re-used immediately as soon as it is freed.

EXAMPLE: My program with PID 1323 uses temporary files that are of the form
PID-<something>. It finishes in two days By this time, linux has circled
PIDs back around five times. And, the counter happens to be right at 1323
right as the program finishes. As soon as the PID is freed, Linux thus
re-uses the PID 1323. A trash-cleaner program (I know, I know) gets
confused. 1323 exited uncleanly. The cleaner sees no 1323 running. It then
proceeds to delete the temp files. But, by that time, the kernel started
another 1323 which happens to use the very files. The cleaner ends up
deleting files for the wrong, running program!

This is just one example. There are many more examples of race conditions.



A TINY TWEAK AS SOLUTION:

The kernel already tries to do the right thing. The only minor tweak needed
is that:

***When Linux Re-Uses A Pid, It Uses The Pid That Has Been Free For
The Longest.***

That's it!


RESULT:

All PID-related race conditions are eliminated, period. No "good enough"
hacks needed any more by utilities trying to avoid race conditions. A
freshly freed PID will never get re-used immediately any more . No more race
conditions. (The only time you will ever see an immediate re-use any more is
when your machine actually has 32767(!)  or (2^22-1) processes!  And, by
that time, you have FAR bigger problems.)


----


**** DETAILS:


You don't have to google very much to see the 1000 algos and other bashisms
that exist to avoid the very race condition. For example, when you want to
read a PID list and deleting temporary files based on a PID. The concern
is that Linux might have created a new process with the same PID by the
time you read the file-list.  We could argue endlessly that these bashisms
are stupid and there are better ways. But, it seems to me that these better
ways are not foolproof either; they are themselves hacks. And, a very simple
tweak could alleviate 100% of these problems.

Now, 32768, or even 2^22 are actually very small numbers. OTOH, 2^200 is
not. In an ideal world, the PID would just sample from the 2^200 space and
declare that there will never be PID re-use or conflicts, period. If there's
a 32-bit limit, it could use multiple bytes and still use a 2^200 space,
etc.  But, all that would be a drastic change, which is why we are stuck
with the 2^15 space.

Is there a way to continue using the 2^15 (or 2^22) space, but more
reasonably?

I argue that the kernel already tries to satisfy a "PID-AIM" and already
tries to Do The Right Thing, but there's just a tiny thinko in its current
implementation that's easily corrected.


PID-AIM:
"No immediate re-use." The aim is to NOT immediately re-use a PID right
after it's been freed.  I argue that this aim can easily be satisfied, even
within the limited 2^15 space.


CURRENT IMPLEMENTATION:
The creators had the right idea. Linux indeed tries to satisfy the PID-AIM
condition. This is why it increments the counter even if the PID is actually
available.

But, looping happens (within a few hours for me). And, as soon as looping
happens, satisfying the PID-AIM goes out the window.

This tweak would ensure that immediate re-use never happens, period.

COMPLEXITY, ETC:

All that the entire system needs is one queue of free PIDs. Any time you
need a PID, take it from the head. Any time a PID is newly freed, push it at
the back of the queue.  That's it! The overhead seems minimal to me.

The queue is initially populated by 2-32768, of course.

In fact, we could even use a smaller queue and not even activate the
queue-mode till it is actually necessary; we could use various optimizing
conditions and shortcuts, say, to push PIDs in the queue in batches. After
all, it's not STRICTLY necessary to maintain exactly the correct order. The
ONLY aim we really want to satisfy is that the latest-freed PID NOT go near
the *head* of the queue; that's all.  Also, the queue is only even needed in
the first place until you've actually looped around.  So, tiny rescue disk
bootups would not even need to go the queue-mode.. (unless they've been
running many days.)


(Thanks to jd_1 and _mjg on for humoring and discussing this idea when I
presented it on ##kernel.)


Dave Goel (Please CC responses.)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Linux] Linux PID algorithm is BRAINDEAD!
  2015-10-10  2:00 [Linux] Linux PID algorithm is BRAINDEAD! Dave Goel
@ 2015-10-10  3:20 ` yalin wang
  2015-10-10 21:58 ` Theodore Ts'o
  1 sibling, 0 replies; 8+ messages in thread
From: yalin wang @ 2015-10-10  3:20 UTC (permalink / raw)
  To: Dave Goel; +Cc: linux-kernel


> On Oct 10, 2015, at 10:00, Dave Goel <deego3@gmail.com> wrote:
> 
> Pardon the subject line!  I think the PID algo. is actually pretty
> good and cheap.
> 
> 
> I just think that a very minor tweak could actually make it *actually* do
> what it always intended to do (that is, satisfy the PID-AIM listed below)!
> No expanded PID renumbering, no incompatibility introduction, nothing, just
> a relatively *minor tweak*.
> 
> 
> *** QUICK SUMMARY:
> 
> 
> PROBLEM:  PID gets re-used immediately as soon as it is freed.
> 
> EXAMPLE: My program with PID 1323 uses temporary files that are of the form
> PID-<something>. It finishes in two days By this time, linux has circled
> PIDs back around five times. And, the counter happens to be right at 1323
> right as the program finishes. As soon as the PID is freed, Linux thus
> re-uses the PID 1323. A trash-cleaner program (I know, I know) gets
> confused. 1323 exited uncleanly. The cleaner sees no 1323 running. It then
> proceeds to delete the temp files. But, by that time, the kernel started
> another 1323 which happens to use the very files. The cleaner ends up
> deleting files for the wrong, running program!
> 
> This is just one example. There are many more examples of race conditions.
> 
> 
> 
> A TINY TWEAK AS SOLUTION:
> 
> The kernel already tries to do the right thing. The only minor tweak needed
> is that:
> 
> ***When Linux Re-Uses A Pid, It Uses The Pid That Has Been Free For
> The Longest.***
> 
> That's it!
> 
> 
> RESULT:
> 
> All PID-related race conditions are eliminated, period. No "good enough"
> hacks needed any more by utilities trying to avoid race conditions. A
> freshly freed PID will never get re-used immediately any more . No more race
> conditions. (The only time you will ever see an immediate re-use any more is
> when your machine actually has 32767(!)  or (2^22-1) processes!  And, by
> that time, you have FAR bigger problems.)
> 
> 
> ----
> 
> 
> **** DETAILS:
> 
> 
> You don't have to google very much to see the 1000 algos and other bashisms
> that exist to avoid the very race condition. For example, when you want to
> read a PID list and deleting temporary files based on a PID. The concern
> is that Linux might have created a new process with the same PID by the
> time you read the file-list.  We could argue endlessly that these bashisms
> are stupid and there are better ways. But, it seems to me that these better
> ways are not foolproof either; they are themselves hacks. And, a very simple
> tweak could alleviate 100% of these problems.
> 
> Now, 32768, or even 2^22 are actually very small numbers. OTOH, 2^200 is
> not. In an ideal world, the PID would just sample from the 2^200 space and
> declare that there will never be PID re-use or conflicts, period. If there's
> a 32-bit limit, it could use multiple bytes and still use a 2^200 space,
> etc.  But, all that would be a drastic change, which is why we are stuck
> with the 2^15 space.
> 
> Is there a way to continue using the 2^15 (or 2^22) space, but more
> reasonably?
> 
> I argue that the kernel already tries to satisfy a "PID-AIM" and already
> tries to Do The Right Thing, but there's just a tiny thinko in its current
> implementation that's easily corrected.
> 
> 
> PID-AIM:
> "No immediate re-use." The aim is to NOT immediately re-use a PID right
> after it's been freed.  I argue that this aim can easily be satisfied, even
> within the limited 2^15 space.
> 
> 
> CURRENT IMPLEMENTATION:
> The creators had the right idea. Linux indeed tries to satisfy the PID-AIM
> condition. This is why it increments the counter even if the PID is actually
> available.
> 
> But, looping happens (within a few hours for me). And, as soon as looping
> happens, satisfying the PID-AIM goes out the window.
> 
> This tweak would ensure that immediate re-use never happens, period.
> 
> COMPLEXITY, ETC:
> 
> All that the entire system needs is one queue of free PIDs. Any time you
> need a PID, take it from the head. Any time a PID is newly freed, push it at
> the back of the queue.  That's it! The overhead seems minimal to me.
> 
> The queue is initially populated by 2-32768, of course.
> 
> In fact, we could even use a smaller queue and not even activate the
> queue-mode till it is actually necessary; we could use various optimizing
> conditions and shortcuts, say, to push PIDs in the queue in batches. After
> all, it's not STRICTLY necessary to maintain exactly the correct order. The
> ONLY aim we really want to satisfy is that the latest-freed PID NOT go near
> the *head* of the queue; that's all.  Also, the queue is only even needed in
> the first place until you've actually looped around.  So, tiny rescue disk
> bootups would not even need to go the queue-mode.. (unless they've been
> running many days.)
> 
> 
> (Thanks to jd_1 and _mjg on for humoring and discussing this idea when I
> presented it on ##kernel.)
> 
> 
> Dave Goel (Please CC responses.)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
i see there is 
pid_ns->last_pid,
it will allocate new pid after last_pid,
this means it will allocate large number pid unless pid wrap ,

the same result as you describe queue method .
i think .

Thanks







^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Linux] Linux PID algorithm is BRAINDEAD!
  2015-10-10  2:00 [Linux] Linux PID algorithm is BRAINDEAD! Dave Goel
  2015-10-10  3:20 ` yalin wang
@ 2015-10-10 21:58 ` Theodore Ts'o
  2015-10-11  3:51   ` Dave Goel
  2015-10-11 17:44   ` Dave Goel
  1 sibling, 2 replies; 8+ messages in thread
From: Theodore Ts'o @ 2015-10-10 21:58 UTC (permalink / raw)
  To: Dave Goel; +Cc: linux-kernel

On Fri, Oct 09, 2015 at 10:00:34PM -0400, Dave Goel wrote:
> 
> All that the entire system needs is one queue of free PIDs. Any time you
> need a PID, take it from the head. Any time a PID is newly freed, push it at
> the back of the queue.  That's it! The overhead seems minimal to me.
> 
> The queue is initially populated by 2-32768, of course.

The worst-case overhead is 64k -- 2 bytes times 32k pid's.  You can
use a 64k circular buffer to store the list of pid's to use, sure.  So
the RAM utilization isn't _that_ bad, except that you need to keep one
of these for each pid namespace.  So for systems using a large number
of containers and lots of pid namespaces for isolation purposes, the
memory utilization overhead is not necessarily going to be considered
cheap.

But that's actually not be biggest problem.  The biggest problem is
that accessing this free pid queue is now a locking bottleneck ---
especially on a very large NUMA system, and especially on a system
where people are running tons of shell scripts are launching processes
all the time.  So in other words, those systems which are most likely
to suffer from pid workaround will also be the systems that will be
punished the most by needing to take a lock each time you allocate a
new pid.

Given that there *are* people who use Linux on systems with hundreds
of CPU's, where global locks are exquisitely painful, the solution
you've outlined is not something that could be the only solution
available in the kernel.

In addition, most people don't find the the workarounds to be terribly
onerous.  Using "trap" to catch signals and then having the shell
script clean up after itself (so you don't need to depend on a cleaner
program) is not terribly hard, and in is considered best practice.

So adding something complex into the kernel just to work around sloppy
shell scripts doesn't seem like something that most people would
considered a great use of resources.  And if it causes significant
performance regressions on kernel scalability for large NUMA systems,
people will probably be even less interested in implementing something
just for the convenience of sloppy shell script programmers.  Telling
kernel developers that Linux's PID algorithm is braindead isn't going
to help.  :-)


So what to do instead?  I'm going to assume you are in an environment
where you have a huge number of legacy shell scripts and fixing them
all is too hard (tm).  What then?  Well, the fact that you are talking
about running some kind of task cleaner means that in all likelihood
it's running in some kind of structured environment where you know
that temp files will have a certain format.

So in that world, what I'd suggest is that you have all of the jobs be
started from a management daemon.  For the sake of argument, let's
call that management daemon a "borglet"[1].  The borglet starts each
of your jobs running on your machine, so it knows when the job exits,
and when it does, you can just have the borglet delete the job's
entire task directory.  For bonus points you could have the borglet
use container technology to control the amount of cpu, memory,
networing, and disk time used by a particular job, but if you need all
of that functionality, it might simpler for you to grab Kubernetes and
just use that to control your jobs.  :-)

[1] http://thenewstack.io/google-lifts-the-veil-on-borg-revealing-apache-auroras-heritage/

The advantage of using Kubernetes is that when you're ready to take a
leap into cloud computing, it will be a lot less work since you will
have already structurered your jobs in a way that makes this easy.  :-)

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Linux] Linux PID algorithm is BRAINDEAD!
  2015-10-10 21:58 ` Theodore Ts'o
@ 2015-10-11  3:51   ` Dave Goel
  2015-10-11 17:44   ` Dave Goel
  1 sibling, 0 replies; 8+ messages in thread
From: Dave Goel @ 2015-10-11  3:51 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Goel, linux-kernel

Hi Ted,

Thanks for responding. Fair points all of them.

I would like to take exception to one of them, the bottleneck part:

> The biggest problem is that accessing this free pid queue is now
> a locking bottleneck --- especially on a very large NUMA system

That was exactly what I was trying to say towards the end: the the
queue idea or implementation need not be strict. No one cares if
instead of grabbing the very firts aof
the queue, you grab, say, the third element. The /only/ real
requirement is that newly entered elements not go near the head of the
queue.

I would argue that bottleneck or resource locking doesn't exist at all:

If you have n cpu's, you can have n queues, albeit each now smaller.
(Or, equivalently, the cpu's deciding to divvy up the queue into a mod
n space.). And, the populator pushing into them in a round-robin
fashion.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Linux] Linux PID algorithm is BRAINDEAD!
  2015-10-10 21:58 ` Theodore Ts'o
  2015-10-11  3:51   ` Dave Goel
@ 2015-10-11 17:44   ` Dave Goel
  2015-10-12 15:04     ` Theodore Ts'o
  1 sibling, 1 reply; 8+ messages in thread
From: Dave Goel @ 2015-10-11 17:44 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Goel, linux-kernel

Ted,

Here's a method to achieve the same goal (no immediate pid re-use),
but without using any queues whatsoever:

All freshly available PIDs are entered into PoolA.

Every N seconds, a timer moves PoolA->PoolB, and PoolB->Free PIDs.

And, the current PID allocation algo continues its allocation just the
way it is.

Then, a PID will wait between N and 2N seconds before getting re-allocated.

N could be 5? This method seems to be to eliminate any queues, nor
cause any additional bottlenecks or locks?

Dave

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Linux] Linux PID algorithm is BRAINDEAD!
  2015-10-11 17:44   ` Dave Goel
@ 2015-10-12 15:04     ` Theodore Ts'o
  2015-10-12 17:49       ` Dave Goel
  0 siblings, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2015-10-12 15:04 UTC (permalink / raw)
  To: Dave Goel; +Cc: linux-kernel

On Sun, Oct 11, 2015 at 01:44:02PM -0400, Dave Goel wrote:
> 
> Here's a method to achieve the same goal (no immediate pid re-use),
> but without using any queues whatsoever:
> 
> All freshly available PIDs are entered into PoolA.
> 
> Every N seconds, a timer moves PoolA->PoolB, and PoolB->Free PIDs.
> 
> And, the current PID allocation algo continues its allocation just the
> way it is.
> 
> Then, a PID will wait between N and 2N seconds before getting re-allocated.
> 
> N could be 5? This method seems to be to eliminate any queues, nor
> cause any additional bottlenecks or locks?

This won't help your temp cleaner.  Consider: the pid wraps after 4
hours, then 10 seconds before the temp cleaner runs, a process exits,
the pid gets reused, and you still have the a race between the temp
cleaner testing whether or not a process exists, and the temp cleaner
deciding whether it's OK to delete the file with the form
prefix.<pid>.

Now, there are bunch of things you can do avoid the race, such as
having the shell script periodically touching the file, and simply
having temp cleaner avoid deleting a file that has a recent mtime.
But you can do that without making any changes to the kernel.  Or you
can have the bash script keep a file descriptor open on the file, and
then delete the file, and then pass around the filename
/proc/$$/fd/<fd>.  Then when the shell script dies, the file(s)
automatically go away, and you don't need the temp cleaner, or the
kernel changes, at all.

I can imagine solutions such as hacks where after a process existence
is tested using a kill 0 signal, that pid is prevented from being
reused for N seconds.  But now you open up denial of service attacks
where some tries sending a kill 0 signal to the entire pidspace.
Ultimately, the question is trying to solve the problem in kernel is
worth the effort, especially when there are userspace solution which
don't require you to have a temp cleaner scanning the entire file
system.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Linux] Linux PID algorithm is BRAINDEAD!
  2015-10-12 15:04     ` Theodore Ts'o
@ 2015-10-12 17:49       ` Dave Goel
  2015-10-12 18:12         ` Theodore Ts'o
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Goel @ 2015-10-12 17:49 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Goel, linux-kernel

Ted,

Thanks for your patience with me.

I thought I had this down, and I thought that the only real
problem is *immediate* re-use and the race conditions it causes
and that for all other cases, scripts can find a way to work
around things. But:

But truly, your example stumps me atm.

Completely understood your point that the "real" solution is to
properly use trap in all scripts. What that also entails in practice is
working with every debian package and program .. .  If A calls B
that calls C which then calls D, and they are all programs
written in different languages, some of which don't even allow
arguments, etc.  But, indeed, that's not the kernel's "problem."


In the meanwhile, if deep inside D, I am generating files that A
needs to be aware of (again, A through D are in different
languages, with their own broken utilities), I can always add
additional information and make it unique.  I can use random
strings to make it unique enough, or if I need deterministic, I
can add pid + start time (jiffies of the proc) + linux start
time (btime from /proc/stat) to make it unique enough.  If I
truly have a unique PID, I never have to worry about any of this,
I guess.



OTOH, I guess if one has to write cleaner/other meta scripts without
proper traps, the cleaner can simply check if linux's PID counter
is too close to the current PID, and if so, refrain from drastic
actions. For the latter, I wonder if there's a way to
get "current PID counter."

Thanks again,
Dave

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Linux] Linux PID algorithm is BRAINDEAD!
  2015-10-12 17:49       ` Dave Goel
@ 2015-10-12 18:12         ` Theodore Ts'o
  0 siblings, 0 replies; 8+ messages in thread
From: Theodore Ts'o @ 2015-10-12 18:12 UTC (permalink / raw)
  To: Dave Goel; +Cc: linux-kernel

On Mon, Oct 12, 2015 at 01:49:59PM -0400, Dave Goel wrote:
> 
> OTOH, I guess if one has to write cleaner/other meta scripts without
> proper traps, the cleaner can simply check if linux's PID counter
> is too close to the current PID, and if so, refrain from drastic
> actions. For the latter, I wonder if there's a way to
> get "current PID counter."

How about:

	current_pid = fork();
	if (current_pid == 0)
		exit(0);
	(void) waitpid(current_pid);

Translating this to perl or python shouldn't be that difficult.

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-10-12 18:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-10  2:00 [Linux] Linux PID algorithm is BRAINDEAD! Dave Goel
2015-10-10  3:20 ` yalin wang
2015-10-10 21:58 ` Theodore Ts'o
2015-10-11  3:51   ` Dave Goel
2015-10-11 17:44   ` Dave Goel
2015-10-12 15:04     ` Theodore Ts'o
2015-10-12 17:49       ` Dave Goel
2015-10-12 18:12         ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).