linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* top stack (l)users for 2.5.69
@ 2003-05-07 13:20 Jörn Engel
  2003-05-07 13:45 ` Richard B. Johnson
  0 siblings, 1 reply; 68+ messages in thread
From: Jörn Engel @ 2003-05-07 13:20 UTC (permalink / raw)
  To: linux-kernel

41 functoins for 2.5.69, 45 functions for 2.5.68, 44 for 2.5.67.
Things are improving again.

There are five more fixes remaining in -je, so it might be time for a
resend session.

P 0xc0229406 presto_get_fileid:                            sub    $0x1198,%esp
P 0xc0227bf6 presto_copy_kml_tail:                         sub    $0x1028,%esp
0xc08f1458 ide_unregister:                               sub    $0x9dc,%esp
0xc082b66b v4l_compat_translate_ioctl:                   sub    $0x8d4,%esp
0xc08b2d23 ia_ioctl:                                     sub    $0x84c,%esp
0xc0e48233 snd_emu10k1_fx8010_ioctl:                     sub    $0x830,%esp
0xc0845e86 w9966_v4l_read:                               sub    $0x828,%esp
0xc0dd895b snd_cmipci_ac3_copy:                          sub    $0x7c0,%esp
0xc0dd8f7b snd_cmipci_ac3_silence:                       sub    $0x7c0,%esp
P 0xc0a9f1a8 amd_flash_probe:                              sub    $0x72c,%esp
0xc0105650 huft_build:                                   sub    $0x59c,%esp
0xc01073d0 huft_build:                                   sub    $0x59c,%esp
0xc02e4a96 dohash:                                       sub    $0x594,%esp
0xc0108256 inflate_dynamic:                              sub    $0x554,%esp
P 0xc05d8733 ida_ioctl:                                    sub    $0x54c,%esp
0xc01064a6 inflate_dynamic:                              sub    $0x538,%esp
P 0xc0fbf8b3 device_new_if:                                sub    $0x520,%esp
0xc021ddd6 presto_ioctl:                                 sub    $0x508,%esp
0xc0e424b8 snd_emu10k1_add_controls:                     sub    $0x4dc,%esp
0xc0e6a066 snd_trident_mixer:                            sub    $0x4c0,%esp
0xc0106307 inflate_fixed:                                sub    $0x4ac,%esp
0xc01080b7 inflate_fixed:                                sub    $0x4ac,%esp
0xc0908ab1 ide_config:                                   sub    $0x4a8,%esp
0xc05bcc5c parport_config:                               sub    $0x490,%esp
0xc0c0e643 ixj_config:                                   sub    $0x484,%esp
0xc10ad9e6 sctp_hash_digest:                             sub    $0x45c,%esp
0xc104da33 gss_pipe_downcall:                            sub    $0x450,%esp
0xc03bc4c8 ciGetLeafPrefixKey:                           sub    $0x428,%esp
0xc045fae3 befs_error:                                   sub    $0x418,%esp
0xc045fb53 befs_warning:                                 sub    $0x418,%esp
0xc045fbc3 befs_debug:                                   sub    $0x418,%esp
0xc07a5c86 wv_hw_reset:                                  sub    $0x418,%esp
0xc0b4bea0 isd200_action:                                sub    $0x414,%esp
0xc1685145 root_nfs_name:                                sub    $0x414,%esp
0xc0c32172 bt3c_config:                                  sub    $0x410,%esp
0xc0c36282 btuart_config:                                sub    $0x410,%esp
0xc07642c1 hex_dump:                                     sub    $0x40c,%esp
0xc0331cf7 jffs2_rtime_compress:                         sub    $0x408,%esp
0xc0c3073f dtl1_config:                                  sub    $0x408,%esp
0xc0c34556 bluecard_config:                              sub    $0x408,%esp
0xc0331df5 jffs2_rtime_decompress:                       sub    $0x404,%esp

Jörn

-- 
A victorious army first wins and then seeks battle.
-- Sun Tzu

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 13:20 top stack (l)users for 2.5.69 Jörn Engel
@ 2003-05-07 13:45 ` Richard B. Johnson
  2003-05-07 13:56   ` Jörn Engel
  2003-05-07 18:36   ` Linus Torvalds
  0 siblings, 2 replies; 68+ messages in thread
From: Richard B. Johnson @ 2003-05-07 13:45 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Linux kernel

On Wed, 7 May 2003, [iso-8859-1] Jörn Engel wrote:

> 41 functoins for 2.5.69, 45 functions for 2.5.68, 44 for 2.5.67.
> Things are improving again.
>
> There are five more fixes remaining in -je, so it might be time for a
> resend session.
>
> P 0xc0229406 presto_get_fileid:                            sub    $0x1198,%esp
> P 0xc0227bf6 presto_copy_kml_tail:                         sub    $0x1028,%esp
> 0xc08f1458 ide_unregister:                               sub    $0x9dc,%esp
> 0xc082b66b v4l_compat_translate_ioctl:                   sub    $0x8d4,%esp
> 0xc08b2d23 ia_ioctl:                                     sub    $0x84c,%esp
> 0xc0e48233 snd_emu10k1_fx8010_ioctl:                     sub    $0x830,%esp
> 0xc0845e86 w9966_v4l_read:                               sub    $0x828,%esp
> 0xc0dd895b snd_cmipci_ac3_copy:                          sub    $0x7c0,%esp
> 0xc0dd8f7b snd_cmipci_ac3_silence:                       sub    $0x7c0,%esp
> P 0xc0a9f1a8 amd_flash_probe:                              sub    $0x72c,%esp
> 0xc0105650 huft_build:                                   sub    $0x59c,%esp
> 0xc01073d0 huft_build:                                   sub    $0x59c,%esp
> 0xc02e4a96 dohash:                                       sub    $0x594,%esp
> 0xc0108256 inflate_dynamic:                              sub    $0x554,%esp
> P 0xc05d8733 ida_ioctl:                                    sub    $0x54c,%esp
> 0xc01064a6 inflate_dynamic:                              sub    $0x538,%esp
> P 0xc0fbf8b3 device_new_if:                                sub    $0x520,%esp
> 0xc021ddd6 presto_ioctl:                                 sub    $0x508,%esp
> 0xc0e424b8 snd_emu10k1_add_controls:                     sub    $0x4dc,%esp
> 0xc0e6a066 snd_trident_mixer:                            sub    $0x4c0,%esp
> 0xc0106307 inflate_fixed:                                sub    $0x4ac,%esp
> 0xc01080b7 inflate_fixed:                                sub    $0x4ac,%esp
> 0xc0908ab1 ide_config:                                   sub    $0x4a8,%esp
> 0xc05bcc5c parport_config:                               sub    $0x490,%esp
> 0xc0c0e643 ixj_config:                                   sub    $0x484,%esp
> 0xc10ad9e6 sctp_hash_digest:                             sub    $0x45c,%esp
> 0xc104da33 gss_pipe_downcall:                            sub    $0x450,%esp
> 0xc03bc4c8 ciGetLeafPrefixKey:                           sub    $0x428,%esp
> 0xc045fae3 befs_error:                                   sub    $0x418,%esp
> 0xc045fb53 befs_warning:                                 sub    $0x418,%esp
> 0xc045fbc3 befs_debug:                                   sub    $0x418,%esp
> 0xc07a5c86 wv_hw_reset:                                  sub    $0x418,%esp
> 0xc0b4bea0 isd200_action:                                sub    $0x414,%esp
> 0xc1685145 root_nfs_name:                                sub    $0x414,%esp
> 0xc0c32172 bt3c_config:                                  sub    $0x410,%esp
> 0xc0c36282 btuart_config:                                sub    $0x410,%esp
> 0xc07642c1 hex_dump:                                     sub    $0x40c,%esp
> 0xc0331cf7 jffs2_rtime_compress:                         sub    $0x408,%esp
> 0xc0c3073f dtl1_config:                                  sub    $0x408,%esp
> 0xc0c34556 bluecard_config:                              sub    $0x408,%esp
> 0xc0331df5 jffs2_rtime_decompress:                       sub    $0x404,%esp
>
> Jörn
>

You know (I hope) that allocating stuff on the stack is not
"bad". In fact, it's the quickest way to allocate data that
will automatically go away when the function returns. One
just subtracts a value from the stack-pointer and you have
the data area. I sure hope that these temporary allocations
are not being replaced with kmalloc()/kfree(). If so, the
code is badly broken and you are eating my CPU cycles for
nothing.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 13:45 ` Richard B. Johnson
@ 2003-05-07 13:56   ` Jörn Engel
  2003-05-07 14:16     ` Richard B. Johnson
  2003-05-07 14:33     ` Torsten Landschoff
  2003-05-07 18:36   ` Linus Torvalds
  1 sibling, 2 replies; 68+ messages in thread
From: Jörn Engel @ 2003-05-07 13:56 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Linux kernel

On Wed, 7 May 2003 09:45:13 -0400, Richard B. Johnson wrote:
> 
> You know (I hope) that allocating stuff on the stack is not
> "bad". In fact, it's the quickest way to allocate data that
> will automatically go away when the function returns. One
> just subtracts a value from the stack-pointer and you have
> the data area. I sure hope that these temporary allocations
> are not being replaced with kmalloc()/kfree(). If so, the
> code is badly broken and you are eating my CPU cycles for
> nothing.

Agreed, partially. There is the current issue of the kernel stack
being just 8k in size and no decent mechanism in place to detect a
stack overflow. And there is (arguably) the future issue of the kernel
stack shrinking to 4k.

Stuff like intermezzo will break with 4k, no discussion about that.
Other stuff may or may not work. What I'm trying to do is pave the way
to shrink the kernel stack during 2.7 sometime.

If there is large agreement that the kernel stack should not shrink,
I'll stop this effort any day. But so far, I am under the impression
that the agreement is to do the shink. Am I wrong?

Jörn

-- 
Time? What's that? Time is only worth what you do with it.
-- Theo de Raadt

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 13:56   ` Jörn Engel
@ 2003-05-07 14:16     ` Richard B. Johnson
  2003-05-07 17:13       ` Jonathan Lundell
  2003-05-07 14:33     ` Torsten Landschoff
  1 sibling, 1 reply; 68+ messages in thread
From: Richard B. Johnson @ 2003-05-07 14:16 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Linux kernel

On Wed, 7 May 2003, [iso-8859-1] Jörn Engel wrote:

> On Wed, 7 May 2003 09:45:13 -0400, Richard B. Johnson wrote:
> >
> > You know (I hope) that allocating stuff on the stack is not
> > "bad". In fact, it's the quickest way to allocate data that
> > will automatically go away when the function returns. One
> > just subtracts a value from the stack-pointer and you have
> > the data area. I sure hope that these temporary allocations
> > are not being replaced with kmalloc()/kfree(). If so, the
> > code is badly broken and you are eating my CPU cycles for
> > nothing.
>
> Agreed, partially. There is the current issue of the kernel stack
> being just 8k in size and no decent mechanism in place to detect a
> stack overflow. And there is (arguably) the future issue of the kernel
> stack shrinking to 4k.
>
> Stuff like intermezzo will break with 4k, no discussion about that.
> Other stuff may or may not work. What I'm trying to do is pave the way
> to shrink the kernel stack during 2.7 sometime.
>
> If there is large agreement that the kernel stack should not shrink,
> I'll stop this effort any day. But so far, I am under the impression
> that the agreement is to do the shink. Am I wrong?
>
> Jörn

Nope. Just don't steal thousands of CPU cycles to make something
"pretty". Obviously something called recursively with a 2k buffer
on the stack is going to break. However one has to actually
look at the code and determine the best (if any) way to reduce
stack usage. For instance, some persons may just "like" 0x400 for
the size of a temporary buffer when, in fact, 29 bytes are actually
used.

FYI, one can make a module that will show the maximum amount
of stack ever used IFF the stack gets zeroed before use upon
kernel startup. Would this be useful or has it already been
done?


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 13:56   ` Jörn Engel
  2003-05-07 14:16     ` Richard B. Johnson
@ 2003-05-07 14:33     ` Torsten Landschoff
  2003-05-07 14:47       ` William Lee Irwin III
  2003-05-07 14:49       ` Richard B. Johnson
  1 sibling, 2 replies; 68+ messages in thread
From: Torsten Landschoff @ 2003-05-07 14:33 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Richard B. Johnson, Linux kernel

On Wed, May 07, 2003 at 03:56:57PM +0200, Jörn Engel wrote:
> Agreed, partially. There is the current issue of the kernel stack
> being just 8k in size and no decent mechanism in place to detect a
> stack overflow. And there is (arguably) the future issue of the kernel
> stack shrinking to 4k.

Pardon my ignorance, but why is the kernel stack shrinked to just a few
kilobytes? With 256MB of RAM in a typical desktop system it shouldn't
be a problem to use 256KB from that as the stack, but I am sure there
are good reasons to shrink it. 

Just curious, thanks for any info

	Torsten

PS: Joern, you don't by chance know my sister (kirsten@wh.fh-wedel.de)??
:-))

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 14:33     ` Torsten Landschoff
@ 2003-05-07 14:47       ` William Lee Irwin III
  2003-05-07 15:04         ` Torsten Landschoff
                           ` (2 more replies)
  2003-05-07 14:49       ` Richard B. Johnson
  1 sibling, 3 replies; 68+ messages in thread
From: William Lee Irwin III @ 2003-05-07 14:47 UTC (permalink / raw)
  To: Torsten Landschoff; +Cc: J?rn Engel, Linux kernel

On Wed, May 07, 2003 at 03:56:57PM +0200, J?rn Engel wrote:
>> Agreed, partially. There is the current issue of the kernel stack
>> being just 8k in size and no decent mechanism in place to detect a
>> stack overflow. And there is (arguably) the future issue of the kernel
>> stack shrinking to 4k.

On Wed, May 07, 2003 at 04:33:15PM +0200, Torsten Landschoff wrote:
> Pardon my ignorance, but why is the kernel stack shrinked to just a few
> kilobytes? With 256MB of RAM in a typical desktop system it shouldn't
> be a problem to use 256KB from that as the stack, but I am sure there
> are good reasons to shrink it. 
> Just curious, thanks for any info

The kernel stack is (in Linux) unswappable memory that persists
throughout the lifetime of a thread. It's basically how many threads
you want to be able to cram into a system, and it matters a lot for
32-bit.


-- wli

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 14:33     ` Torsten Landschoff
  2003-05-07 14:47       ` William Lee Irwin III
@ 2003-05-07 14:49       ` Richard B. Johnson
  1 sibling, 0 replies; 68+ messages in thread
From: Richard B. Johnson @ 2003-05-07 14:49 UTC (permalink / raw)
  To: Torsten Landschoff; +Cc: Jörn Engel, Linux kernel

On Wed, 7 May 2003, Torsten Landschoff wrote:

> On Wed, May 07, 2003 at 03:56:57PM +0200, Jörn Engel wrote:
> > Agreed, partially. There is the current issue of the kernel stack
> > being just 8k in size and no decent mechanism in place to detect a
> > stack overflow. And there is (arguably) the future issue of the kernel
> > stack shrinking to 4k.
>
> Pardon my ignorance, but why is the kernel stack shrinked to just a few
> kilobytes? With 256MB of RAM in a typical desktop system it shouldn't
> be a problem to use 256KB from that as the stack, but I am sure there
> are good reasons to shrink it.
>
> Just curious, thanks for any info
>
> 	Torsten
>
> PS: Joern, you don't by chance know my sister (kirsten@wh.fh-wedel.de)??
> :-))
>

An x86 'page' is 0x1000. For a kernel stack, this must be always
resident. The next possible size would be 0x2000, etc. If it can
be kept under 0x1000, then remaining fixed pagers can be used for
those dynamically-allocated network-packet buffers, etc.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 14:47       ` William Lee Irwin III
@ 2003-05-07 15:04         ` Torsten Landschoff
  2003-05-07 16:01           ` William Lee Irwin III
  2003-05-07 15:23         ` Timothy Miller
  2003-05-07 16:49         ` Jörn Engel
  2 siblings, 1 reply; 68+ messages in thread
From: Torsten Landschoff @ 2003-05-07 15:04 UTC (permalink / raw)
  To: William Lee Irwin III, J?rn Engel, Linux kernel

On Wed, May 07, 2003 at 07:47:36AM -0700, William Lee Irwin III wrote:
> The kernel stack is (in Linux) unswappable memory that persists
> throughout the lifetime of a thread. It's basically how many threads
> you want to be able to cram into a system, and it matters a lot for
> 32-bit.

Okay, that makes sense. BTW: Why not go a step further and have just 
one kernel stack (probably better one per CPU)?

Greetings

	Torsten

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 14:47       ` William Lee Irwin III
  2003-05-07 15:04         ` Torsten Landschoff
@ 2003-05-07 15:23         ` Timothy Miller
  2003-05-07 15:47           ` William Lee Irwin III
  2003-05-07 16:49         ` Jörn Engel
  2 siblings, 1 reply; 68+ messages in thread
From: Timothy Miller @ 2003-05-07 15:23 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Torsten Landschoff, J?rn Engel, Linux kernel



William Lee Irwin III wrote:

> 
> The kernel stack is (in Linux) unswappable memory that persists
> throughout the lifetime of a thread. It's basically how many threads
> you want to be able to cram into a system, and it matters a lot for
> 32-bit.
> 
> 

The point that may or may not have been obvious is that more than one 
kernel stack is hanging around.  One single 8k stack versus one single 
4k stack is a trivial difference, even for most embedded systems.  But 
this becomes a huge problem when you have numerous concurrent threads 
hanging around, one of which can be swapped out.  That eats memory fast.

Or am I getting it wrong?



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 15:23         ` Timothy Miller
@ 2003-05-07 15:47           ` William Lee Irwin III
  0 siblings, 0 replies; 68+ messages in thread
From: William Lee Irwin III @ 2003-05-07 15:47 UTC (permalink / raw)
  To: Timothy Miller; +Cc: Torsten Landschoff, J?rn Engel, Linux kernel

William Lee Irwin III wrote:
>> The kernel stack is (in Linux) unswappable memory that persists
>> throughout the lifetime of a thread. It's basically how many threads
>> you want to be able to cram into a system, and it matters a lot for
>> 32-bit.

On Wed, May 07, 2003 at 11:23:54AM -0400, Timothy Miller wrote:
> The point that may or may not have been obvious is that more than one 
> kernel stack is hanging around.  One single 8k stack versus one single 
> 4k stack is a trivial difference, even for most embedded systems.  But 
> this becomes a huge problem when you have numerous concurrent threads 
> hanging around, one of which can be swapped out.  That eats memory fast.
> Or am I getting it wrong?

You've got it right. Thanks for pointing that out.


-- wli

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 15:04         ` Torsten Landschoff
@ 2003-05-07 16:01           ` William Lee Irwin III
  2003-05-08 15:36             ` Ingo Oeser
  0 siblings, 1 reply; 68+ messages in thread
From: William Lee Irwin III @ 2003-05-07 16:01 UTC (permalink / raw)
  To: Torsten Landschoff; +Cc: J?rn Engel, Linux kernel

On Wed, May 07, 2003 at 07:47:36AM -0700, William Lee Irwin III wrote:
>> The kernel stack is (in Linux) unswappable memory that persists
>> throughout the lifetime of a thread. It's basically how many threads
>> you want to be able to cram into a system, and it matters a lot for
>> 32-bit.

On Wed, May 07, 2003 at 05:04:29PM +0200, Torsten Landschoff wrote:
> Okay, that makes sense. BTW: Why not go a step further and have just 
> one kernel stack (probably better one per CPU)?

Generally things are stopped in the middle of function calls when
scheduled out and the register state etc. saved nowhere but as register
spills to the stack in the task model of programming (commonly used in
UNIX implementations and for most kernels really). Each userspace thread
has a "mirror image" thread inside the kernel, and basically scheduling
happens as some thread inside the kernel deciding it's time to check
whether one should schedule, and when it does, dumping what registers
it hasn't already to the kernel stack to save state, and then switching
stacks. So the stack is implicitly used to save per-thread state and
can't really be shared on a per-cpu basis in a UNIX-like design. UNIX
IIRC dealt with the resource scalability problem that a decision to pin
the memory would cause by shoving kernel stacks in the u area, which
could be swapped when under sufficient duress, and there was a whole
layer of scheduling that decided when to dump whole processes to swap
when there were too many competing for memory, and when to swap them in.

Pure per-cpu stacks would require the interrupt model of programming to
be used, which is a design decision deep enough it's debatable whether
it's feasible to do conversions to or from at all, never mind desirable.
Basically every entry point into the kernel is treated as an interrupt,
and nothing can ever sleep or be scheduled in the kernel, but rather
only register callbacks to be run when the event waited for occurs.
Scheduling only happens as a decision of which userspace task to resume
when returning from the kernel to userspace, though one could envision
a priority queue discipline for processing the registered callbacks.

Many of the mechanics used for async io qualify as "partial conversions"
but in truth most of the truly difficult aspects are avoided by limiting
the usage of the style to io requests and not using its style for
memory allocations or other things. It's basically Not UNIX (TM). AFAIK
only a couple of research kernels from the late 80's, QuickSilver and V
(cited by Vahalia) ever used it, though Vahalia isn't likely to give an
exhaustive list of the things so there may be others. But it probably
makes for impressive resource scalability numbers wrt. threads, at the
cost of some runtime overhead for more complex state maintenance.


-- wli

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 17:55         ` Jörn Engel
@ 2003-05-07 16:20           ` Martin J. Bligh
  0 siblings, 0 replies; 68+ messages in thread
From: Martin J. Bligh @ 2003-05-07 16:20 UTC (permalink / raw)
  To: Jörn Engel, Jonathan Lundell; +Cc: root, Linux kernel

>> Does 2.5 use a separate interrupt stack? (Excuse my ignorance; I 
>> haven't been paying attention.) Total stack-page usage in the 2.4 
>> model, at any rate, is the sum of the task struct, the usage of any 
>> task-level thread (system calls, pretty much), any softirq (including 
>> the network protocol & routing handlers, and any netfilter modules), 
>> and some number of possibly-nested hard interrupts.
> 
> Depends on the architecture. s390 does, ppc didn't as of 2.4.2, the
> rest I'm not sure about. But this is another requirement for stack
> reduction to 4k for most platforms, if not all.

There are patches to make i386 do this (and use 4K stacks as a config option) 
from Dave Hansen and Ben LaHaise in 2.5-mjb tree. 
 
>> One thing that would help (aside from separate interrupt stacks) 
>> would be a guard page below the stack. That wouldn't require any 
>> physical memory to be reserved, and would provide positive indication 
>> of stack overflow without significant runtime overhead.
> 
> Yes, that should work. It needs some additional code in the page fault
> handler to detect this case, but that shouldn't slow the system down
> too much.

There's stack overflow detection in there as well.

M.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 14:47       ` William Lee Irwin III
  2003-05-07 15:04         ` Torsten Landschoff
  2003-05-07 15:23         ` Timothy Miller
@ 2003-05-07 16:49         ` Jörn Engel
  2003-05-07 17:18           ` Davide Libenzi
  2003-05-07 17:38           ` William Lee Irwin III
  2 siblings, 2 replies; 68+ messages in thread
From: Jörn Engel @ 2003-05-07 16:49 UTC (permalink / raw)
  To: William Lee Irwin III, Torsten Landschoff, Linux kernel

On Wed, 7 May 2003 07:47:36 -0700, William Lee Irwin III wrote:
> On Wed, May 07, 2003 at 04:33:15PM +0200, Torsten Landschoff wrote:
> > Pardon my ignorance, but why is the kernel stack shrinked to just a few
> > kilobytes? With 256MB of RAM in a typical desktop system it shouldn't
> > be a problem to use 256KB from that as the stack, but I am sure there
> > are good reasons to shrink it. 
> > Just curious, thanks for any info
> 
> The kernel stack is (in Linux) unswappable memory that persists
> throughout the lifetime of a thread. It's basically how many threads
> you want to be able to cram into a system, and it matters a lot for
> 32-bit.

It also matters if people writing applications for embedded systems
have a fetish for many threads. 1000 threads, each eating 8k memory
for pure existance (no actual work done yet), do put some memory
pressure on small machines. Yes, it would be possible to educate those
people, but changing kernel code is more fun and less work.

Jörn

-- 
With a PC, I always felt limited by the software available. On Unix, 
I am limited only by my knowledge.
-- Peter J. Schoenster

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 14:16     ` Richard B. Johnson
@ 2003-05-07 17:13       ` Jonathan Lundell
  2003-05-07 17:40         ` Richard B. Johnson
                           ` (3 more replies)
  0 siblings, 4 replies; 68+ messages in thread
From: Jonathan Lundell @ 2003-05-07 17:13 UTC (permalink / raw)
  To: root, Jörn Engel; +Cc: Linux kernel

At 10:16am -0400 5/7/03, Richard B. Johnson wrote:
>Nope. Just don't steal thousands of CPU cycles to make something
>"pretty". Obviously something called recursively with a 2k buffer
>on the stack is going to break. However one has to actually
>look at the code and determine the best (if any) way to reduce
>stack usage. For instance, some persons may just "like" 0x400 for
>the size of a temporary buffer when, in fact, 29 bytes are actually
>used.
>
>FYI, one can make a module that will show the maximum amount
>of stack ever used IFF the stack gets zeroed before use upon
>kernel startup. Would this be useful or has it already been
>done?

There's at least one patch floating around to do that; we've used it 
to help track down some stack overflow problems.

Does 2.5 use a separate interrupt stack? (Excuse my ignorance; I 
haven't been paying attention.) Total stack-page usage in the 2.4 
model, at any rate, is the sum of the task struct, the usage of any 
task-level thread (system calls, pretty much), any softirq (including 
the network protocol & routing handlers, and any netfilter modules), 
and some number of possibly-nested hard interrupts.

That adds up.

One thing that would help (aside from separate interrupt stacks) 
would be a guard page below the stack. That wouldn't require any 
physical memory to be reserved, and would provide positive indication 
of stack overflow without significant runtime overhead.
-- 
/Jonathan Lundell.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 16:49         ` Jörn Engel
@ 2003-05-07 17:18           ` Davide Libenzi
  2003-05-07 17:40             ` Jörn Engel
  2003-05-07 18:23             ` William Lee Irwin III
  2003-05-07 17:38           ` William Lee Irwin III
  1 sibling, 2 replies; 68+ messages in thread
From: Davide Libenzi @ 2003-05-07 17:18 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Linux kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN, Size: 913 bytes --]

On Wed, 7 May 2003, [iso-8859-1] Jörn Engel wrote:

> It also matters if people writing applications for embedded systems
> have a fetish for many threads. 1000 threads, each eating 8k memory
> for pure existance (no actual work done yet), do put some memory
> pressure on small machines. Yes, it would be possible to educate those
> people, but changing kernel code is more fun and less work.

I'm afraid I do not agree with both your sentences. Changing a *working
kernel* code is definitely not much fun and not really less work if your
target is the per-cpu kernel stack. You'll completely lose kernel
preemption and this is really bad since many paths inside the kernel are
easily preemptable. The design and the code of the kernel will become more
complex (and slow) and even people that are correctly programming it are
going to pay the price. No thanks, I'd say screw you thread maniacs ...




- Davide


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 16:49         ` Jörn Engel
  2003-05-07 17:18           ` Davide Libenzi
@ 2003-05-07 17:38           ` William Lee Irwin III
  2003-05-07 17:47             ` Jörn Engel
  1 sibling, 1 reply; 68+ messages in thread
From: William Lee Irwin III @ 2003-05-07 17:38 UTC (permalink / raw)
  To: J?rn Engel; +Cc: Torsten Landschoff, Linux kernel

On Wed, May 07, 2003 at 06:49:01PM +0200, J?rn Engel wrote:
> It also matters if people writing applications for embedded systems
> have a fetish for many threads. 1000 threads, each eating 8k memory
> for pure existance (no actual work done yet), do put some memory
> pressure on small machines. Yes, it would be possible to educate those
> people, but changing kernel code is more fun and less work.

If they're embedded and UP they can probably get by on a userspace
threading library that only creates one kernel thread.

It's highly unlikely anyone will get anywhere "fixing" this in the
kernel. The closest approximations to mitigating the pinned memory
overhead with UNIX-style kernel semantics are swappable stacks a la the
u area and M:N threading, neither of which are popular notions. If
you're trying the other approach I mentioned in this thread, good luck
ever getting it done and good luck ever surviving even a single merge.

$ grep -nr schedule . | wc -l
   3773

Basically monsterpatch Hell for a resource scalability problem no one's
taking very seriously at the moment. We're already into the territory of
trivially userspace-triggerable NMI oopsing on 32x with merely linear
algorithms on 10000 threads, so the VM side isn't even worth talking
about until answers for the issues triggerable with the thread counts
the VM can currently handle appear, and even then only if there is some
real demand from apps and/or systems for it.

Just consider forkbombing infeasible; most (if not all) forkbombing apps
do so out of stupidity or outright bugs. 32-bit machines have something
to do mostly because pagetables are enormous and many process address
spaces are needed to establish per-sub-pool mappings and are largely an
entirely separate issue from general thread scalability (it adds up to
a lot of processes but usually well under 50000 and mostly vastly less).


-- wli

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 17:13       ` Jonathan Lundell
@ 2003-05-07 17:40         ` Richard B. Johnson
  2003-05-07 18:12           ` Roland Dreier
  2003-05-08 10:29           ` David Howells
  2003-05-07 17:55         ` Jörn Engel
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 68+ messages in thread
From: Richard B. Johnson @ 2003-05-07 17:40 UTC (permalink / raw)
  To: Jonathan Lundell; +Cc: Jörn Engel, Linux kernel

On Wed, 7 May 2003, Jonathan Lundell wrote:

> At 10:16am -0400 5/7/03, Richard B. Johnson wrote:
> >Nope. Just don't steal thousands of CPU cycles to make something
> >"pretty". Obviously something called recursively with a 2k buffer
> >on the stack is going to break. However one has to actually
> >look at the code and determine the best (if any) way to reduce
> >stack usage. For instance, some persons may just "like" 0x400 for
> >the size of a temporary buffer when, in fact, 29 bytes are actually
> >used.
> >
> >FYI, one can make a module that will show the maximum amount
> >of stack ever used IFF the stack gets zeroed before use upon
> >kernel startup. Would this be useful or has it already been
> >done?
>
> There's at least one patch floating around to do that; we've used it
> to help track down some stack overflow problems.
>
> Does 2.5 use a separate interrupt stack? (Excuse my ignorance; I
> haven't been paying attention.) Total stack-page usage in the 2.4
> model, at any rate, is the sum of the task struct, the usage of any
> task-level thread (system calls, pretty much), any softirq (including
> the network protocol & routing handlers, and any netfilter modules),
> and some number of possibly-nested hard interrupts.
>
> That adds up.
>
> One thing that would help (aside from separate interrupt stacks)
> would be a guard page below the stack. That wouldn't require any
> physical memory to be reserved, and would provide positive indication
> of stack overflow without significant runtime overhead.
> --
> /Jonathan Lundell.
>

The kernel stack, at least for ix86, is only one, set upon startup
at 8192 bytes above a label called init_task_unit. The kernel must
have a separate stack and, contrary to what I've been reading on
this list, it can't have more kernel stacks than CPUs and, I don't
see a separate stack allocated for different CPUs. The kernel stack
must be separate from the user stack or else a user could blow up
the kernel by setting ESP to 0 and waiting for an interrupt, etc.

The kernel can only have one stack or else you can't get back to
the task that was interrupted or preempted. If the kernel was
designed like VMS, it could have multiple stacks, but this
would require that every interrupt produce a context switch.
In that case, the interrupted task will simply be rescheduled
at some later time, as in VAX/VMS. Linux, and other Unix's I've
observed, don't do this. The interrupted user's state is
saved, the kernel segments are set, the interrupt procedure
is called, then the reverse to back-out and return control to
the user that was interrupted. Yes the interrupt steals time
from the user so it might not be "fair". However, it is faster
on handling interrupts.

I don't know where the idea that the kernel had multiple stacks
came from. Recently some of the kernel has been revamped so
the kernel might be premptible. If this is correct, it's
definitely required to have a single stack so it can be
unwound all the way back to the first preemption or else
a waiting task will lose it's place in the queue and never
get scheduled (lots of activity on a Nth level stack, while
the loser in one a N-1th level stack.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 17:18           ` Davide Libenzi
@ 2003-05-07 17:40             ` Jörn Engel
  2003-05-07 18:35               ` Davide Libenzi
  2003-05-07 18:23             ` William Lee Irwin III
  1 sibling, 1 reply; 68+ messages in thread
From: Jörn Engel @ 2003-05-07 17:40 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux kernel

On Wed, 7 May 2003 10:18:29 -0700, Davide Libenzi wrote:
> On Wed, 7 May 2003, [iso-8859-1] Jörn Engel wrote:
> 
> > It also matters if people writing applications for embedded systems
> > have a fetish for many threads. 1000 threads, each eating 8k memory
> > for pure existance (no actual work done yet), do put some memory
> > pressure on small machines. Yes, it would be possible to educate those
> > people, but changing kernel code is more fun and less work.
> 
> I'm afraid I do not agree with both your sentences. Changing a *working
> kernel* code is definitely not much fun and not really less work if your
> target is the per-cpu kernel stack. You'll completely lose kernel
> preemption and this is really bad since many paths inside the kernel are
> easily preemptable. The design and the code of the kernel will become more
> complex (and slow) and even people that are correctly programming it are
> going to pay the price. No thanks, I'd say screw you thread maniacs ...

I'm not sure if I got you wrong, or vice versa. Either way, some
definitions first.
Process Stack == the traditional per-process kernel stack
Interrupt Stack == a dedicated per-CPU stack for interrupts only
CPU Stack == all kernel data on a per-CPU stack

Not for anything would I want a CPU Stack. At first thought, this is
impossible, but in reality it is just ugly beyond anything I could
bear.

An Interrupt Stack is a very good thing. I know PPC machines with 125
Interrupt lines (3 for cascading) that could theoretically all happen
at once. That alone demands for a stack size well above 8k and having
this per process is just a bad design. But that is another issue.

The real Process Stack without the interrupt overhead should not need
to be bigger than 4k. It currently is for all platforms I know about,
s390 has even 16k. This is the point of my regular allyesconfig
compilations and postings.

Do you still disagree? Then I must have misread your mail.

Jörn

-- 
Do not stop an army on its way home.
-- Sun Tzu

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 17:38           ` William Lee Irwin III
@ 2003-05-07 17:47             ` Jörn Engel
  0 siblings, 0 replies; 68+ messages in thread
From: Jörn Engel @ 2003-05-07 17:47 UTC (permalink / raw)
  To: William Lee Irwin III, Torsten Landschoff, Linux kernel

On Wed, 7 May 2003 10:38:25 -0700, William Lee Irwin III wrote:
> On Wed, May 07, 2003 at 06:49:01PM +0200, J?rn Engel wrote:
> > It also matters if people writing applications for embedded systems
> > have a fetish for many threads. 1000 threads, each eating 8k memory
> > for pure existance (no actual work done yet), do put some memory
> > pressure on small machines. Yes, it would be possible to educate those
> > people, but changing kernel code is more fun and less work.
> 
> If they're embedded and UP they can probably get by on a userspace
> threading library that only creates one kernel thread.
> 
> It's highly unlikely anyone will get anywhere "fixing" this in the
> kernel. The closest approximations to mitigating the pinned memory
> overhead with UNIX-style kernel semantics are swappable stacks a la the
> u area and M:N threading, neither of which are popular notions. If
> you're trying the other approach I mentioned in this thread, good luck
> ever getting it done and good luck ever surviving even a single merge.
> 
> $ grep -nr schedule . | wc -l
>    3773

Ah, now I see where the misunderstanding comes from. My bad.
I would merely like to save NO_THREADS * 4k, not the full 8k. People
here are migrating from Readtime OS's to Linux, partially and I
wouldn't think about introducing hard priorities into the scheduler
either. "This is plain impossible." is a very good argument for
education. "This only works under certain conditions." is where people
always demand more, sometimes rightfully, sometimes not.

Jörn

-- 
Measure. Don't tune for speed until you've measured, and even then
don't unless one part of the code overwhelms the rest.
-- Rob Pike

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 17:13       ` Jonathan Lundell
  2003-05-07 17:40         ` Richard B. Johnson
@ 2003-05-07 17:55         ` Jörn Engel
  2003-05-07 16:20           ` Martin J. Bligh
  2003-05-07 19:01         ` Dave Hansen
  2003-05-07 21:30         ` Jesse Pollard
  3 siblings, 1 reply; 68+ messages in thread
From: Jörn Engel @ 2003-05-07 17:55 UTC (permalink / raw)
  To: Jonathan Lundell; +Cc: root, Linux kernel

On Wed, 7 May 2003 10:13:55 -0700, Jonathan Lundell wrote:
> At 10:16am -0400 5/7/03, Richard B. Johnson wrote:
> >Nope. Just don't steal thousands of CPU cycles to make something
> >"pretty". Obviously something called recursively with a 2k buffer
> >on the stack is going to break. However one has to actually
> >look at the code and determine the best (if any) way to reduce
> >stack usage. For instance, some persons may just "like" 0x400 for
> >the size of a temporary buffer when, in fact, 29 bytes are actually
> >used.
> >
> >FYI, one can make a module that will show the maximum amount
> >of stack ever used IFF the stack gets zeroed before use upon
> >kernel startup. Would this be useful or has it already been
> >done?
> 
> There's at least one patch floating around to do that; we've used it 
> to help track down some stack overflow problems.

Do you have a URL or can you post that patch? Sounds very useful for
information gathering.

> Does 2.5 use a separate interrupt stack? (Excuse my ignorance; I 
> haven't been paying attention.) Total stack-page usage in the 2.4 
> model, at any rate, is the sum of the task struct, the usage of any 
> task-level thread (system calls, pretty much), any softirq (including 
> the network protocol & routing handlers, and any netfilter modules), 
> and some number of possibly-nested hard interrupts.

Depends on the architecture. s390 does, ppc didn't as of 2.4.2, the
rest I'm not sure about. But this is another requirement for stack
reduction to 4k for most platforms, if not all.

> One thing that would help (aside from separate interrupt stacks) 
> would be a guard page below the stack. That wouldn't require any 
> physical memory to be reserved, and would provide positive indication 
> of stack overflow without significant runtime overhead.

Yes, that should work. It needs some additional code in the page fault
handler to detect this case, but that shouldn't slow the system down
too much.

Jörn

-- 
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 17:40         ` Richard B. Johnson
@ 2003-05-07 18:12           ` Roland Dreier
  2003-05-07 18:28             ` Richard B. Johnson
  2003-05-08 10:29           ` David Howells
  1 sibling, 1 reply; 68+ messages in thread
From: Roland Dreier @ 2003-05-07 18:12 UTC (permalink / raw)
  To: root; +Cc: Jonathan Lundell, Jörn Engel, Linux kernel

    Richard> The kernel stack, at least for ix86, is only one, set
    Richard> upon startup at 8192 bytes above a label called
    Richard> init_task_unit. The kernel must have a separate stack
    Richard> and, contrary to what I've been reading on this list, it
    Richard> can't have more kernel stacks than CPUs and, I don't see
    Richard> a separate stack allocated for different CPUs.

This is total nonsense.  Please don't confuse matters by spreading
misinformation like this.  Every task has a separate (8K) kernel
stack.  Look at the implementation of do_fork() and in particular
alloc_task_struct().

If there were only one kernel stack, what do you think would happen if
a process went to sleep in kernel code?

 - Roland

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 17:18           ` Davide Libenzi
  2003-05-07 17:40             ` Jörn Engel
@ 2003-05-07 18:23             ` William Lee Irwin III
  1 sibling, 0 replies; 68+ messages in thread
From: William Lee Irwin III @ 2003-05-07 18:23 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: J?rn Engel, Linux kernel

On Wed, May 07, 2003 at 10:18:29AM -0700, Davide Libenzi wrote:
> I'm afraid I do not agree with both your sentences. Changing a *working
> kernel* code is definitely not much fun and not really less work if your
> target is the per-cpu kernel stack. You'll completely lose kernel
> preemption and this is really bad since many paths inside the kernel are
> easily preemptable. The design and the code of the kernel will become more
> complex (and slow) and even people that are correctly programming it are
> going to pay the price. No thanks, I'd say screw you thread maniacs ...

Yes, the interrupt model of programming more or less requires
preemption be explicit in every case. Every scheduling point would have
to be explicitly registered as a splitup of the function into the code
run before the scheduling point and a continuation for the code after.
As preempt works now most points should be clearly delimited, since it
inserts an implicit schedule() at lock droppings, and things like
cond_sched() and yield(). The points where preempt_count() == 0 and
things could be preempted by scheduling off of returning from interrupts
would be lost, though. Yes, this is probably as inefficient as it sounds
from the bit about introducing an indirect function call and queueing
operation at all those points.

I did mention something about the overhead for the general case, which
is one reason why no one will ever seriously entertain the notion.

I don't see a threat of anything like this appearing in the near future,
since the implementation effort required is probably greater than that
of reimplementing significant chunks of the kernel from scratch if not
writing an entire kernel from scratch. In fact, the model is a poor fit
for the C language and is basically just too painful to program, which
is probably more important than even the performance considerations.
But some appropriate performance-relevant bits for aio shouldn't hurt,
especially since they fall back to normal methods when not async.


-- wli

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 18:12           ` Roland Dreier
@ 2003-05-07 18:28             ` Richard B. Johnson
  2003-05-07 18:44               ` Timothy Miller
                                 ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Richard B. Johnson @ 2003-05-07 18:28 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Jonathan Lundell, Jörn Engel, Linux kernel

On Wed, 7 May 2003, Roland Dreier wrote:

>     Richard> The kernel stack, at least for ix86, is only one, set
>     Richard> upon startup at 8192 bytes above a label called
>     Richard> init_task_unit. The kernel must have a separate stack
>     Richard> and, contrary to what I've been reading on this list, it
>     Richard> can't have more kernel stacks than CPUs and, I don't see
>     Richard> a separate stack allocated for different CPUs.
>
> This is total nonsense.  Please don't confuse matters by spreading
> misinformation like this.  Every task has a separate (8K) kernel
> stack.  Look at the implementation of do_fork() and in particular
> alloc_task_struct().
>
> If there were only one kernel stack, what do you think would happen if
> a process went to sleep in kernel code?
>
>  - Roland
>

No, No. That is a process stack. Every process has it's own, entirely
seperate stack. This stack is used only in user mode. The kernel has
it's own stack. Every time you switch to kernel mode either by
calling the kernel or by a hardware interrupt, the kernel's stack
is used.

When a task sleeps, it sleeps in kernel mode. The kernel schedules
other tasks until the sleeper has been satisfied either by time or
by event.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 17:40             ` Jörn Engel
@ 2003-05-07 18:35               ` Davide Libenzi
  2003-05-07 19:45                 ` Jörn Engel
  0 siblings, 1 reply; 68+ messages in thread
From: Davide Libenzi @ 2003-05-07 18:35 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Linux kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN, Size: 1263 bytes --]

On Wed, 7 May 2003, [iso-8859-1] Jörn Engel wrote:

> I'm not sure if I got you wrong, or vice versa. Either way, some
> definitions first.
> Process Stack == the traditional per-process kernel stack
> Interrupt Stack == a dedicated per-CPU stack for interrupts only
> CPU Stack == all kernel data on a per-CPU stack
>
> Not for anything would I want a CPU Stack. At first thought, this is
> impossible, but in reality it is just ugly beyond anything I could
> bear.
>
> An Interrupt Stack is a very good thing. I know PPC machines with 125
> Interrupt lines (3 for cascading) that could theoretically all happen
> at once. That alone demands for a stack size well above 8k and having
> this per process is just a bad design. But that is another issue.
>
> The real Process Stack without the interrupt overhead should not need
> to be bigger than 4k. It currently is for all platforms I know about,
> s390 has even 16k. This is the point of my regular allyesconfig
> compilations and postings.
>
> Do you still disagree? Then I must have misread your mail.

It was not really clear you were talking about interrupts stack, that are
a feasible thing. Even though, I'd not feel confident going down to 4k,
looking at the post that started this thread.



- Davide


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 13:45 ` Richard B. Johnson
  2003-05-07 13:56   ` Jörn Engel
@ 2003-05-07 18:36   ` Linus Torvalds
  2003-05-07 19:17     ` Jeff Garzik
  1 sibling, 1 reply; 68+ messages in thread
From: Linus Torvalds @ 2003-05-07 18:36 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.53.0305070933450.11740@chaos>,
Richard B. Johnson <root@chaos.analogic.com> wrote:
>
>You know (I hope) that allocating stuff on the stack is not
>"bad". 

Allocating stuff on the stack _is_ bad if you allocate more than a few
hundred bytes. That's _especially_ true deep down in the call-sequence,
ie in device drivers, low-level filesystems etc.

The kernel stack is a very limited resource, with no protection from
overflow. Being lazy and using automatic variables is a BAD BAD thing,
even if it's syntactically easy and generates good code.

			Linus

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 18:28             ` Richard B. Johnson
@ 2003-05-07 18:44               ` Timothy Miller
  2003-05-07 18:46               ` Roland Dreier
  2003-05-07 18:51               ` Davide Libenzi
  2 siblings, 0 replies; 68+ messages in thread
From: Timothy Miller @ 2003-05-07 18:44 UTC (permalink / raw)
  To: root; +Cc: Roland Dreier, Jonathan Lundell, Jörn Engel, Linux kernel



Richard B. Johnson wrote:
> On Wed, 7 May 2003, Roland Dreier wrote:
> 
> 
>>    Richard> The kernel stack, at least for ix86, is only one, set
>>    Richard> upon startup at 8192 bytes above a label called
>>    Richard> init_task_unit. The kernel must have a separate stack
>>    Richard> and, contrary to what I've been reading on this list, it
>>    Richard> can't have more kernel stacks than CPUs and, I don't see
>>    Richard> a separate stack allocated for different CPUs.
>>
>>This is total nonsense.  Please don't confuse matters by spreading
>>misinformation like this.  Every task has a separate (8K) kernel
>>stack.  Look at the implementation of do_fork() and in particular
>>alloc_task_struct().
>>
>>If there were only one kernel stack, what do you think would happen if
>>a process went to sleep in kernel code?
>>
>> - Roland
>>
> 
> 
> No, No. That is a process stack. Every process has it's own, entirely
> seperate stack. This stack is used only in user mode. The kernel has
> it's own stack. Every time you switch to kernel mode either by
> calling the kernel or by a hardware interrupt, the kernel's stack
> is used.
> 
> When a task sleeps, it sleeps in kernel mode. The kernel schedules
> other tasks until the sleeper has been satisfied either by time or
> by event.


I don't think this is quite accurate either.  I have been reading this 
thread, and putting that together with what makes sense to me, I gather 
that Linux uses the following stacks:

1) A variable sized user-space stack, one per process (maybe more for 
user-level threads?).  This is swappable.

2) One 8K kernel stack for each process.  This is used for situations 
such as when a user process makes a system call that then needs to use a 
stack.  This has to be separate from the user stack for two reasons:
    (a) If the user process borked the user stack pointer, the kernel 
still needs to have something valid.
    (b) The stack used by the kernel cannot be swappable.

3) One single interrupt stack for hardware interrupts.  I don't know how 
various CPU's deal with this, so either the CPU knows to use this for 
hardware interrupts, or the hardware interrupt starts using the current 
process's kernel stack then realizes this and switches over.


At the moment, I'm assuming that when the kernel is preempted, the stack 
switched to is one which is the kernel stack for some other process.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 18:28             ` Richard B. Johnson
  2003-05-07 18:44               ` Timothy Miller
@ 2003-05-07 18:46               ` Roland Dreier
  2003-05-07 19:30                 ` Richard B. Johnson
  2003-05-07 18:51               ` Davide Libenzi
  2 siblings, 1 reply; 68+ messages in thread
From: Roland Dreier @ 2003-05-07 18:46 UTC (permalink / raw)
  To: root; +Cc: Jonathan Lundell, Jörn Engel, Linux kernel

>>>>> "Richard" == Richard B Johnson <root@chaos.analogic.com> writes:

    Richard> On Wed, 7 May 2003, Roland Dreier wrote: The kernel
    Richard> stack, at least for ix86, is only one, set upon startup
    Richard> at 8192 bytes above a label called init_task_unit. The
    Richard> kernel must have a separate stack and, contrary to what
    Richard> I've been reading on this list, it can't have more kernel
    Richard> stacks than CPUs and, I don't see a separate stack
    Richard> allocated for different CPUs.

    Roland> This is total nonsense.  Please don't confuse matters by
    Roland> spreading misinformation like this.  Every task has a
    Roland> separate (8K) kernel stack.  Look at the implementation of
    Roland> do_fork() and in particular alloc_task_struct().

    Roland> If there were only one kernel stack, what do you think
    Roland> would happen if a process went to sleep in kernel code?

    Richard> No, No. That is a process stack. Every process has it's
    Richard> own, entirely seperate stack. This stack is used only in
    Richard> user mode. The kernel has it's own stack. Every time you
    Richard> switch to kernel mode either by calling the kernel or by
    Richard> a hardware interrupt, the kernel's stack is used.

Again, this is nonsense and misinformation.  Look at do_fork() and
alloc_task_struct().  Do you see how alloc_task_struct() is just
defined to be __get_free_pages(GFP_KERNEL,1) for i386?  Do you
understand that that just allocates two pages (8K) of kernel memory?
Do you see that it is never mapped into userspace, and that anyway
a userspace process can use far more than 8K of stack?

That 8K of memory is used for the kernel stack for a particular
process.  When a process makes a system call, that specific stack is
used as the kernel stack.

    Richard> When a task sleeps, it sleeps in kernel mode. The kernel
    Richard> schedules other tasks until the sleeper has been
    Richard> satisfied either by time or by event.

Right.  Now think about where the kernel stack for the process that is
sleeping in the kernel is kept.

 - Roland

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 18:28             ` Richard B. Johnson
  2003-05-07 18:44               ` Timothy Miller
  2003-05-07 18:46               ` Roland Dreier
@ 2003-05-07 18:51               ` Davide Libenzi
  2003-05-07 19:22                 ` Richard B. Johnson
  2003-05-07 21:47                 ` Martin J. Bligh
  2 siblings, 2 replies; 68+ messages in thread
From: Davide Libenzi @ 2003-05-07 18:51 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Linux kernel

On Wed, 7 May 2003, Richard B. Johnson wrote:

> No, No. That is a process stack. Every process has it's own, entirely
> seperate stack. This stack is used only in user mode. The kernel has
> it's own stack. Every time you switch to kernel mode either by
> calling the kernel or by a hardware interrupt, the kernel's stack
> is used.

Is it your understanding that does not exist a per task kernel stack ?



- Davide


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 17:13       ` Jonathan Lundell
  2003-05-07 17:40         ` Richard B. Johnson
  2003-05-07 17:55         ` Jörn Engel
@ 2003-05-07 19:01         ` Dave Hansen
  2003-05-07 20:06           ` Jörn Engel
  2003-05-07 21:30         ` Jesse Pollard
  3 siblings, 1 reply; 68+ messages in thread
From: Dave Hansen @ 2003-05-07 19:01 UTC (permalink / raw)
  To: Jonathan Lundell; +Cc: root, Jörn Engel, Linux kernel

Jonathan Lundell wrote:
> One thing that would help (aside from separate interrupt stacks) 
> would be a guard page below the stack. That wouldn't require any 
> physical memory to be reserved, and would provide positive indication 
> of stack overflow without significant runtime overhead.

x86 doesn't really have big physical shortages right now.  But, the
_virtual_ shortages are significant.  The guard page just increases the
virtual cost by 50%.

The stack overflow checking in -mjb uses gcc's mcount mechanism to
detect overflows.  It should get called on every single function call.

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 18:36   ` Linus Torvalds
@ 2003-05-07 19:17     ` Jeff Garzik
  2003-05-07 20:38       ` Randy.Dunlap
  0 siblings, 1 reply; 68+ messages in thread
From: Jeff Garzik @ 2003-05-07 19:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:
> In article <Pine.LNX.4.53.0305070933450.11740@chaos>,
> Richard B. Johnson <root@chaos.analogic.com> wrote:
> 
>>You know (I hope) that allocating stuff on the stack is not
>>"bad". 
> 
> 
> Allocating stuff on the stack _is_ bad if you allocate more than a few
> hundred bytes. That's _especially_ true deep down in the call-sequence,
> ie in device drivers, low-level filesystems etc.
> 
> The kernel stack is a very limited resource, with no protection from
> overflow. Being lazy and using automatic variables is a BAD BAD thing,
> even if it's syntactically easy and generates good code.


Note that the problem is exacerbated if you have a bunch of disjoint 
stack scopes.  For that case, gcc will take the _sum_ of the stacks and 
not the union.  rth was kind enough to file gcc PR 9997 on this problem.

It is turning out to be fairly common problem in the various drivers' 
ioctl handlers.  Kernel hackers (myself included) often create automatic 
variables for each case in a C switch statement.  (and now I'm having to 
go back and fix that :))

	Jeff




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 18:51               ` Davide Libenzi
@ 2003-05-07 19:22                 ` Richard B. Johnson
  2003-05-07 19:31                   ` Davide Libenzi
  2003-05-07 19:39                   ` Hua Zhong
  2003-05-07 21:47                 ` Martin J. Bligh
  1 sibling, 2 replies; 68+ messages in thread
From: Richard B. Johnson @ 2003-05-07 19:22 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux kernel

On Wed, 7 May 2003, Davide Libenzi wrote:

> On Wed, 7 May 2003, Richard B. Johnson wrote:
>
> > No, No. That is a process stack. Every process has it's own, entirely
> > seperate stack. This stack is used only in user mode. The kernel has
> > it's own stack. Every time you switch to kernel mode either by
> > calling the kernel or by a hardware interrupt, the kernel's stack
> > is used.
>
> Is it your understanding that does not exist a per task kernel stack ?
>

It is my understanding that there is one kernel stack. If there
is a stack allocated for some "transition", and I guess there
may be, because of the mail I'm getting, then it has absolutely
no purpose whatsoever and is wasted valuable non-paged RAM.

The reason why system-call parameters are passed in registers
is so that we didn't have the overhead of copying stuff from a
user stack to a kernel stack.

Does anybody know (not guess) if this was stuff added for the
new non-interrupt 0x80 syscall code? I want to know how a
simple kernel got corrupted into this twisted thing.

Anybody who has a copy of any of the Intel manuals since '386
knows that there needs to be only one kernel stack.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 18:46               ` Roland Dreier
@ 2003-05-07 19:30                 ` Richard B. Johnson
  2003-05-07 19:42                   ` Roland Dreier
  0 siblings, 1 reply; 68+ messages in thread
From: Richard B. Johnson @ 2003-05-07 19:30 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Jonathan Lundell, Jörn Engel, Linux kernel

On Wed, 7 May 2003, Roland Dreier wrote:

> >>>>> "Richard" == Richard B Johnson <root@chaos.analogic.com> writes:
>
>     Richard> On Wed, 7 May 2003, Roland Dreier wrote: The kernel
>     Richard> stack, at least for ix86, is only one, set upon startup
>     Richard> at 8192 bytes above a label called init_task_unit. The
>     Richard> kernel must have a separate stack and, contrary to what
>     Richard> I've been reading on this list, it can't have more kernel
>     Richard> stacks than CPUs and, I don't see a separate stack
>     Richard> allocated for different CPUs.
>
>     Roland> This is total nonsense.  Please don't confuse matters by
>     Roland> spreading misinformation like this.  Every task has a
>     Roland> separate (8K) kernel stack.  Look at the implementation of
>     Roland> do_fork() and in particular alloc_task_struct().
>
>     Roland> If there were only one kernel stack, what do you think
>     Roland> would happen if a process went to sleep in kernel code?
>
>     Richard> No, No. That is a process stack. Every process has it's
>     Richard> own, entirely seperate stack. This stack is used only in
>     Richard> user mode. The kernel has it's own stack. Every time you
>     Richard> switch to kernel mode either by calling the kernel or by
>     Richard> a hardware interrupt, the kernel's stack is used.
>
> Again, this is nonsense and misinformation.  Look at do_fork() and
> alloc_task_struct().  Do you see how alloc_task_struct() is just
> defined to be __get_free_pages(GFP_KERNEL,1) for i386?  Do you
> understand that that just allocates two pages (8K) of kernel memory?
> Do you see that it is never mapped into userspace, and that anyway
> a userspace process can use far more than 8K of stack?
>
> That 8K of memory is used for the kernel stack for a particular
> process.  When a process makes a system call, that specific stack is
> used as the kernel stack.

I haven't got a clue why and when that code got added. It is
absolutely positively wasted and is not required for kernel system
calls nor interrupts since they all must operate in kernel mode
and, therefore, use the kernel stack.

>
>     Richard> When a task sleeps, it sleeps in kernel mode. The kernel
>     Richard> schedules other tasks until the sleeper has been
>     Richard> satisfied either by time or by event.
>
> Right.  Now think about where the kernel stack for the process that is
> sleeping in the kernel is kept.
>

It's the kernel, of course. The scheduler runs in the kernel under
the kernel stack, with the kernel data. It has nothing to do with
the original user once the user sleeps. The user's context was
saved, the kernel was set up, and the kernel will schedule other
tasks until the sleep time or the sleep_on even is complete.
At that time, (or thereafter), the kernel will schedule
the previously sleeping task, its context will be restored, and
it continues execution.

The context of a task (see entry.S) is completely defined by
its registers, including the hidden part of the segments
(selectors) that define priviledge.

>  - Roland
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 19:22                 ` Richard B. Johnson
@ 2003-05-07 19:31                   ` Davide Libenzi
  2003-05-07 19:39                   ` Hua Zhong
  1 sibling, 0 replies; 68+ messages in thread
From: Davide Libenzi @ 2003-05-07 19:31 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Linux kernel

On Wed, 7 May 2003, Richard B. Johnson wrote:

> It is my understanding that there is one kernel stack. If there
> is a stack allocated for some "transition", and I guess there
> may be, because of the mail I'm getting, then it has absolutely
> no purpose whatsoever and is wasted valuable non-paged RAM.
>
> The reason why system-call parameters are passed in registers
> is so that we didn't have the overhead of copying stuff from a
> user stack to a kernel stack.
>
> Does anybody know (not guess) if this was stuff added for the
> new non-interrupt 0x80 syscall code? I want to know how a
> simple kernel got corrupted into this twisted thing.
>
> Anybody who has a copy of any of the Intel manuals since '386
> knows that there needs to be only one kernel stack.

I don't believe anyone is guessing here :)



- Davide


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: top stack (l)users for 2.5.69
  2003-05-07 19:22                 ` Richard B. Johnson
  2003-05-07 19:31                   ` Davide Libenzi
@ 2003-05-07 19:39                   ` Hua Zhong
  1 sibling, 0 replies; 68+ messages in thread
From: Hua Zhong @ 2003-05-07 19:39 UTC (permalink / raw)
  To: root, Davide Libenzi; +Cc: Linux kernel

> It is my understanding that there is one kernel stack. If there
> is a stack allocated for some "transition", and I guess there
> may be, because of the mail I'm getting, then it has absolutely
> no purpose whatsoever and is wasted valuable non-paged RAM.

I think your understanding is wrong. Each process has its own kernel stack
allocated together with the task_struct in a 8K chunk. At least for 2.4 it
is. I think interrupt handler also uses the current kernel stack.

> The reason why system-call parameters are passed in registers
> is so that we didn't have the overhead of copying stuff from a
> user stack to a kernel stack.
>
> Does anybody know (not guess) if this was stuff added for the
> new non-interrupt 0x80 syscall code? I want to know how a
> simple kernel got corrupted into this twisted thing.
>
> Anybody who has a copy of any of the Intel manuals since '386
> knows that there needs to be only one kernel stack.
>
> Cheers,
> Dick Johnson
> Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
> Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 19:30                 ` Richard B. Johnson
@ 2003-05-07 19:42                   ` Roland Dreier
  2003-05-07 20:04                     ` Richard B. Johnson
  0 siblings, 1 reply; 68+ messages in thread
From: Roland Dreier @ 2003-05-07 19:42 UTC (permalink / raw)
  To: root; +Cc: Linux kernel

    Roland> Right.  Now think about where the kernel stack for the
    Roland> process that is sleeping in the kernel is kept.

    Richard> It's the kernel, of course. The scheduler runs in the
    Richard> kernel under the kernel stack, with the kernel data. It
    Richard> has nothing to do with the original user once the user
    Richard> sleeps. The user's context was saved, the kernel was set
    Richard> up, and the kernel will schedule other tasks until the
    Richard> sleep time or the sleep_on even is complete.  At that
    Richard> time, (or thereafter), the kernel will schedule the
    Richard> previously sleeping task, its context will be restored,
    Richard> and it continues execution.

    Richard> The context of a task (see entry.S) is completely defined
    Richard> by its registers, including the hidden part of the
    Richard> segments (selectors) that define priviledge.

I'll try one more time.  Let's say a user process makes a system call
and enters the kernel.  That system call goes through a few function
calls in the kernel (which each push something on the kernel stack for
that process).  Finally, the kernel has to sleep to sleep to service
the system call (let's say it's a blocking read() waiting for some
data to arrive on a socket).

OK, now the scheduler runs, and another user process starts and makes
its own system call, which also goes to sleep.

Now say the data the original process was waiting for arrives.  The
scheduler wakes up that process, which is in the kernel, and it
finishes servicing the read.  This means it now returns through the
chain of kernel function calls before returing to user space.  Each
return in kernel space has to pop some stuff off the stack, and it
better not get mixed up with the second process's kernel stack.

That's (one reason) why each process needs its own kernel stack.

 - Roland

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 18:35               ` Davide Libenzi
@ 2003-05-07 19:45                 ` Jörn Engel
  0 siblings, 0 replies; 68+ messages in thread
From: Jörn Engel @ 2003-05-07 19:45 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux kernel

On Wed, 7 May 2003 11:35:25 -0700, Davide Libenzi wrote:
> 
> It was not really clear you were talking about interrupts stack, that are
> a feasible thing. Even though, I'd not feel confident going down to 4k,
> looking at the post that started this thread.

Neither would I - yet. And the single functions are just part of the
problem, the lowest hanging fruits maybe. In the next step, we have to
find cases like the below and fix them. 4608 bytes should still be ok,
especially since the interrupt stuff is still on the per-process
stack. 

The real bitch is that those cases depend on your .config and your
hardware, so finding them will take quite some time and cannot be done
by a couple of developers alone. Damn!

do_IRQ: stack overflow: 4608
de9b57a0 00001200 00000000 c14b2d4c c0289250 db6a1560 db6924e4 c010b298 
c14b2d4c c14b2d4c 00000000 c0289250 db6a1560 db6924e4 00000000 00000018 
00000018 ffffff00 c0134f3b 00000010 00000296 de9b5800 0000000c 1b564045 
Call Trace:    [call_do_IRQ+5/13] [try_to_swap_out+187/400]
[swap_out_pmd+260/288] [swap_out_mm+248/336] [swap_out+91/224]
[shrink_cache+317/784] [shrink_caches+99/160]
[call_console_drivers+101/288] [try_to_free_pages_zone+54/80]
[balance_classzone+87/496] [__alloc_pages+243/400]
[find_or_create_page+114/240] [grow_dev_page+46/208]
[grow_buffers+152/256] [getblk+70/112] [bread+32/144]
[ext3_get_branch+111/240] [ext3_get_block_handle+120/704]
[schedule_task+91/112] [create_buffers+107/224]
[ext3_get_block+74/144] [block_read_full_page+585/656]
[__alloc_pages+181/400] [__alloc_pages+297/400]
[page_cache_read+171/208] [ext3_get_block+0/144]
[read_cluster_nonblocking+57/80] [filemap_nopage+285/560]
[do_no_page+121/448] [ext3_read_inode+425/720]
[handle_mm_fault+119/272] [do_page_fault+364/1277]
[rb_insert_color+210/240] [do_page_fault+0/1277] [error_code+52/60]
[clear_user+51/80] [load_elf_interp+338/784] [do_page_fault+0/1277]
[error_code+52/60] [clear_user+51/80] [padzero+40/48]
[load_elf_binary+1356/2960] [ext3_dirty_inode+137/256]
[load_elf_binary+0/2960] [search_binary_handler+258/384]
[do_execve+379/544] [sys_execve+80/128] 

Jörn

-- 
"Translations are and will always be problematic. They inflict violence 
upon two languages." (translation from German)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 19:42                   ` Roland Dreier
@ 2003-05-07 20:04                     ` Richard B. Johnson
  2003-05-07 20:23                       ` Roland Dreier
  2003-05-07 20:42                       ` Timothy Miller
  0 siblings, 2 replies; 68+ messages in thread
From: Richard B. Johnson @ 2003-05-07 20:04 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Linux kernel

On Wed, 7 May 2003, Roland Dreier wrote:

>     Roland> Right.  Now think about where the kernel stack for the
>     Roland> process that is sleeping in the kernel is kept.
>
>     Richard> It's the kernel, of course. The scheduler runs in the
>     Richard> kernel under the kernel stack, with the kernel data. It
>     Richard> has nothing to do with the original user once the user
>     Richard> sleeps. The user's context was saved, the kernel was set
>     Richard> up, and the kernel will schedule other tasks until the
>     Richard> sleep time or the sleep_on even is complete.  At that
>     Richard> time, (or thereafter), the kernel will schedule the
>     Richard> previously sleeping task, its context will be restored,
>     Richard> and it continues execution.
>
>     Richard> The context of a task (see entry.S) is completely defined
>     Richard> by its registers, including the hidden part of the
>     Richard> segments (selectors) that define privilege.
>
> I'll try one more time.  Let's say a user process makes a system call
> and enters the kernel.  That system call goes through a few function
> calls in the kernel (which each push something on the kernel stack for
> that process).  Finally, the kernel has to sleep to sleep to service
> the system call (let's say it's a blocking read() waiting for some
> data to arrive on a socket).
>
> OK, now the scheduler runs, and another user process starts and makes
> its own system call, which also goes to sleep.
>
> Now say the data the original process was waiting for arrives.  The
> scheduler wakes up that process, which is in the kernel, and it
> finishes servicing the read.  This means it now returns through the
> chain of kernel function calls before returning to user space.  Each
> return in kernel space has to pop some stuff off the stack, and it
> better not get mixed up with the second process's kernel stack.
>
> That's (one reason) why each process needs its own kernel stack.
>


But no! Not at all. The context of a user does not need to be saved
on the stack, and in fact, isn't. It's saved in a task structure
that was created when the original task was born. The pointer to
that task structure is called 'current' in the kernel. It's in
the kernel's data space, and everything necessary to put that
task back together is in that structure.

Context switching is usually not done by pushing all the registers
onto a stack, then later popping them back. That's not the way
it works.

When a caller executes int 0x80, this is a software interrupt,
called a 'trap'. It enters the trap handler on the kernel stack,
with the segment selectors set up as defined for that trap-handler.
It happens because software told hardware what to do ahead of time.
Software doesn't do it during the trap event. In the trap handler,
no context switch normally occurs. This is so that the kernel can
perform privileged tasks upon behalf of the caller without the
overhead of a context switch. However, all the user's registers
are saved and the kernel's data selector(s) are set so that they
can access the kernel data and the user's data (the user-mode
pointers for file/IO work, etc.). This happens in the context of
the user, but the privilege of the kernel. If the kernel-mode
function needs to sleep, the user's registers have already been
saved it its "current" structure. The kernel is free to find
some other task to load from the run queue and switch to that
task (see switch_to()).

Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 19:01         ` Dave Hansen
@ 2003-05-07 20:06           ` Jörn Engel
  2003-05-07 20:14             ` Dave Hansen
  0 siblings, 1 reply; 68+ messages in thread
From: Jörn Engel @ 2003-05-07 20:06 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Jonathan Lundell, root, Linux kernel

On Wed, 7 May 2003 12:01:14 -0700, Dave Hansen wrote:
> Jonathan Lundell wrote:
> > One thing that would help (aside from separate interrupt stacks) 
> > would be a guard page below the stack. That wouldn't require any 
> > physical memory to be reserved, and would provide positive indication 
> > of stack overflow without significant runtime overhead.
> 
> x86 doesn't really have big physical shortages right now.  But, the
> _virtual_ shortages are significant.  The guard page just increases the
> virtual cost by 50%.

Different people have different constraints. :)

For me, physical memory is low while virtual memory is abundant.  But
even in your case, the guard page might be an acceptable evil during a
(hopefully short) transition time.

> The stack overflow checking in -mjb uses gcc's mcount mechanism to
> detect overflows.  It should get called on every single function call.

Nice trick.  Do you have better documentation on that machanism than
man gcc?  The paragraph to -p is quite short and I cannot make the
connection to the rest of the patch immediately.

Jörn

-- 
Optimizations always bust things, because all optimizations are, in
the long haul, a form of cheating, and cheaters eventually get caught.
-- Larry Wall 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 20:06           ` Jörn Engel
@ 2003-05-07 20:14             ` Dave Hansen
  2003-05-08  8:41               ` Jörn Engel
  0 siblings, 1 reply; 68+ messages in thread
From: Dave Hansen @ 2003-05-07 20:14 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Jonathan Lundell, root, Linux kernel

[-- Attachment #1: Type: text/plain, Size: 773 bytes --]

Jörn Engel wrote:
>>The stack overflow checking in -mjb uses gcc's mcount mechanism to
>>detect overflows.  It should get called on every single function call.
> 
> Nice trick.  Do you have better documentation on that machanism than
> man gcc?  The paragraph to -p is quite short and I cannot make the
> connection to the rest of the patch immediately.

It is a nice trick, but I didn't write it :)  I stole the code from Ben
LaHaise, around 2.5.20.  All that I've needed to know to maintain the
patch is that a "jmp mcount" gets placed in the critical places.

I've attached a fairly recent version of the stack check patch.  If you
need some more examples, check out kernprof's use of it.  It's acg
functionality used mcount as well.
-- 
Dave Hansen
haveblue@us.ibm.com

[-- Attachment #2: C-stack_usage_check-2.5.59-8.patch --]
[-- Type: text/plain, Size: 6756 bytes --]

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	irqstack-2.5.59-1 -> 1.962  
#	arch/i386/kernel/process.c	1.32.1.4 -> 1.40   
#	arch/i386/kernel/irq.c	1.23.1.2 -> 1.26   
#	            Makefile	1.344.2.13 -> 1.349  
#	include/asm-i386/thread_info.h	1.10.1.4 -> 1.16   
#	   arch/i386/Kconfig	1.13.2.22 -> 1.19   
#	arch/i386/kernel/entry.S	1.38.1.9 -> 1.53   
#	  arch/i386/Makefile	1.24.2.17 -> 1.33   
#	arch/i386/boot/compressed/misc.c	1.9.1.1 -> 1.12   
#	arch/i386/kernel/init_task.c	1.6.1.1 -> 1.8    
#	arch/i386/kernel/i386_ksyms.c	1.36.2.6 -> 1.44   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/01/27	haveblue@elm3b96.(none)	1.958.1.2
# import new irqstack patch
# covers BUILD_INTERRUPT, as well as common_interrupt
# --------------------------------------------
# 03/01/27	haveblue@elm3b96.(none)	1.961
# Merge elm3b96.(none):/work/dave/bk/linux-2.5-irq-stack
# into elm3b96.(none):/work/dave/bk/linux-2.5-irq-stack+overflow-detect
# --------------------------------------------
# 03/01/27	haveblue@elm3b96.(none)	1.962
# Merge elm3b96.(none):/work/dave/bk/linux-2.5-overflow-detect
# into elm3b96.(none):/work/dave/bk/linux-2.5-irq-stack+overflow-detect
# --------------------------------------------
#
diff -Nru a/arch/i386/Kconfig b/arch/i386/Kconfig
--- a/arch/i386/Kconfig	Mon Jan 27 11:40:03 2003
+++ b/arch/i386/Kconfig	Mon Jan 27 11:40:03 2003
@@ -1624,6 +1624,25 @@
 	  If you don't debug the kernel, you can say N, but we may not be able
 	  to solve problems without frame pointers.
 
+config X86_STACK_CHECK
+	bool "Detect stack overflows"
+	depends on FRAME_POINTER
+	help
+	  Say Y here to have the kernel attempt to detect when the per-task
+	  kernel stack overflows.  This is much more robust checking than
+	  the above overflow check, which will only occasionally detect
+	  an overflow.  The level of guarantee here is much greater.
+	
+	  Some older versions of gcc don't handle the -p option correctly.  
+	  Kernprof is affected by the same problem, which is described here:
+	  http://oss.sgi.com/projects/kernprof/faq.html#Q9
+	
+	  Basically, if you get oopses in __free_pages_ok during boot when
+	  you have this turned on, you need to fix gcc.  The Redhat 2.96 
+	  version and gcc-3.x seem to work.  
+	
+	  If not debugging a stack overflow problem, say N
+
 config X86_EXTRA_IRQS
 	bool
 	depends on X86_LOCAL_APIC || X86_VOYAGER
diff -Nru a/arch/i386/Makefile b/arch/i386/Makefile
--- a/arch/i386/Makefile	Mon Jan 27 11:40:03 2003
+++ b/arch/i386/Makefile	Mon Jan 27 11:40:03 2003
@@ -76,6 +76,10 @@
 # default subarch .h files
 mflags-y += -Iinclude/asm-i386/mach-default
 
+ifdef CONFIG_X86_STACK_CHECK
+CFLAGS += -p
+endif
+
 HEAD := arch/i386/kernel/head.o arch/i386/kernel/init_task.o
 
 libs-y 					+= arch/i386/lib/
diff -Nru a/arch/i386/boot/compressed/misc.c b/arch/i386/boot/compressed/misc.c
--- a/arch/i386/boot/compressed/misc.c	Mon Jan 27 11:40:03 2003
+++ b/arch/i386/boot/compressed/misc.c	Mon Jan 27 11:40:03 2003
@@ -377,3 +377,7 @@
 	if (high_loaded) close_output_buffer_if_we_run_high(mv);
 	return high_loaded;
 }
+
+/* We don't actually check for stack overflows this early. */
+__asm__(".globl mcount ; mcount: ret\n");
+
diff -Nru a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S
--- a/arch/i386/kernel/entry.S	Mon Jan 27 11:40:03 2003
+++ b/arch/i386/kernel/entry.S	Mon Jan 27 11:40:03 2003
@@ -597,6 +597,61 @@
 	pushl $do_spurious_interrupt_bug
 	jmp error_code
 
+
+#ifdef CONFIG_X86_STACK_CHECK
+.data
+	.globl	stack_overflowed
+stack_overflowed:
+	.long	0
+.text
+
+ENTRY(mcount)
+	push %eax
+	movl $(THREAD_SIZE - 1),%eax
+	andl %esp,%eax
+	cmpl $STACK_WARN,%eax	/* more than half the stack is used*/
+	jle 1f
+2:
+	popl %eax
+	ret
+1:	
+	lock;   btsl    $0,stack_overflowed
+	jc      2b
+	
+	# switch to overflow stack
+	movl	%esp,%eax
+	movl	$(stack_overflow_stack + THREAD_SIZE - 4),%esp
+
+	pushf
+	cli
+	pushl	%eax
+
+	# push eip then esp of error for stack_overflow_panic
+	pushl	4(%eax)
+	pushl	%eax
+
+	# update the task pointer and cpu in the overflow stack's thread_info.
+	GET_THREAD_INFO_WITH_ESP(%eax)
+	movl	TI_TASK(%eax),%ebx
+	movl	%ebx,stack_overflow_stack+TI_TASK
+	movl	TI_CPU(%eax),%ebx
+	movl	%ebx,stack_overflow_stack+TI_CPU
+
+	call	stack_overflow
+
+	# pop off call arguments
+	addl	$8,%esp 
+
+	popl	%eax
+	popf
+	movl	%eax,%esp
+	popl	%eax
+	movl	$0,stack_overflowed
+	ret
+
+#warning stack check enabled
+#endif
+
 .data
 ENTRY(sys_call_table)
 	.long sys_restart_syscall	/* 0 - old "setup()" system call, used for restarting */
diff -Nru a/arch/i386/kernel/i386_ksyms.c b/arch/i386/kernel/i386_ksyms.c
--- a/arch/i386/kernel/i386_ksyms.c	Mon Jan 27 11:40:03 2003
+++ b/arch/i386/kernel/i386_ksyms.c	Mon Jan 27 11:40:03 2003
@@ -214,3 +214,8 @@
 EXPORT_SYMBOL(edd);
 EXPORT_SYMBOL(eddnr);
 #endif
+
+#ifdef CONFIG_X86_STACK_CHECK
+extern void mcount(void);
+EXPORT_SYMBOL(mcount);
+#endif
diff -Nru a/arch/i386/kernel/init_task.c b/arch/i386/kernel/init_task.c
--- a/arch/i386/kernel/init_task.c	Mon Jan 27 11:40:03 2003
+++ b/arch/i386/kernel/init_task.c	Mon Jan 27 11:40:03 2003
@@ -16,6 +16,10 @@
 union thread_union init_irq_union
 	__attribute__((__section__(".data.init_task")));
 
+#ifdef CONFIG_X86_STACK_CHECK
+union thread_union stack_overflow_stack
+	__attribute__((__section__(".data.init_task")));
+#endif
 
 /*
  * Initial thread structure.
diff -Nru a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
--- a/arch/i386/kernel/process.c	Mon Jan 27 11:40:03 2003
+++ b/arch/i386/kernel/process.c	Mon Jan 27 11:40:03 2003
@@ -159,7 +159,22 @@
 
 __setup("idle=", idle_setup);
 
-void show_regs(struct pt_regs * regs)
+void stack_overflow(unsigned long esp, unsigned long eip)
+{
+	int panicing = ((esp&(THREAD_SIZE-1)) <= STACK_PANIC);
+
+	if (panicing)
+		print_symbol("stack overflow from %s\n", eip);
+	else
+		print_symbol("excessive stack use from %s\n", eip);
+	printk("esp: %p\n", (void*)esp);
+	show_trace((void*)esp);
+	
+	if (panicing)
+		panic("stack overflow\n");
+}
+
+asmlinkage void show_regs(struct pt_regs * regs)
 {
 	unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L;
 
diff -Nru a/include/asm-i386/thread_info.h b/include/asm-i386/thread_info.h
--- a/include/asm-i386/thread_info.h	Mon Jan 27 11:40:03 2003
+++ b/include/asm-i386/thread_info.h	Mon Jan 27 11:40:03 2003
@@ -63,6 +63,8 @@
  */
 #define THREAD_ORDER 1 
 #define INIT_THREAD_SIZE       THREAD_SIZE
+#define STACK_PANIC		0x200ul
+#define STACK_WARN		((THREAD_SIZE)>>1)
 
 #ifndef __ASSEMBLY__
 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 20:04                     ` Richard B. Johnson
@ 2003-05-07 20:23                       ` Roland Dreier
  2003-05-07 20:42                       ` Timothy Miller
  1 sibling, 0 replies; 68+ messages in thread
From: Roland Dreier @ 2003-05-07 20:23 UTC (permalink / raw)
  To: root; +Cc: Linux kernel

    [ misinformation snipped ]

OK, my real last word on the subject.

When the kernel is running on behalf of a user process, there is more
context than just the struct task_struct part of current.  There is a
kernel stack, which must be per process since each process can have
its own call chain of kernel functions.

By all means, see switch_to().  Look at what it does to ESP.

 - Roland

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 19:17     ` Jeff Garzik
@ 2003-05-07 20:38       ` Randy.Dunlap
  2003-05-07 21:27         ` Marcus Alanen
  2003-05-08 15:10         ` Ingo Oeser
  0 siblings, 2 replies; 68+ messages in thread
From: Randy.Dunlap @ 2003-05-07 20:38 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel

On Wed, 07 May 2003 15:17:43 -0400 Jeff Garzik <jgarzik@pobox.com> wrote:

| Linus Torvalds wrote:
| > In article <Pine.LNX.4.53.0305070933450.11740@chaos>,
| > Richard B. Johnson <root@chaos.analogic.com> wrote:
| > 
| >>You know (I hope) that allocating stuff on the stack is not
| >>"bad". 
| > 
| > 
| > Allocating stuff on the stack _is_ bad if you allocate more than a few
| > hundred bytes. That's _especially_ true deep down in the call-sequence,
| > ie in device drivers, low-level filesystems etc.
| > 
| > The kernel stack is a very limited resource, with no protection from
| > overflow. Being lazy and using automatic variables is a BAD BAD thing,
| > even if it's syntactically easy and generates good code.
| 
| 
| Note that the problem is exacerbated if you have a bunch of disjoint 
| stack scopes.  For that case, gcc will take the _sum_ of the stacks and 
| not the union.  rth was kind enough to file gcc PR 9997 on this problem.

Glad to hear that.
 
| It is turning out to be fairly common problem in the various drivers' 
| ioctl handlers.  Kernel hackers (myself included) often create automatic 
| variables for each case in a C switch statement.  (and now I'm having to 
| go back and fix that :))

I've written a few of the stack reduction patches.  Lots of ioctl functions
need work, so gcc handling it better would be good to have.

I have mostly used kmalloc/kfree, but using automatic variables is certainly
cleaner to write (code).  One of the patches that I did just made each ioctl
cmd call a separate function, and then each separate function was able to use
automatic variables on the stack instead of kmalloc/kfree.  I prefer this
method when it's feasible (and until gcc can handle these cases).

--
~Randy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 20:04                     ` Richard B. Johnson
  2003-05-07 20:23                       ` Roland Dreier
@ 2003-05-07 20:42                       ` Timothy Miller
  2003-05-08  9:06                         ` Jörn Engel
  2003-05-08 11:33                         ` Richard B. Johnson
  1 sibling, 2 replies; 68+ messages in thread
From: Timothy Miller @ 2003-05-07 20:42 UTC (permalink / raw)
  To: root; +Cc: Roland Dreier, Linux kernel



Richard B. Johnson wrote:
> 
> When a caller executes int 0x80, this is a software interrupt,
> called a 'trap'. It enters the trap handler on the kernel stack,
> with the segment selectors set up as defined for that trap-handler.
> It happens because software told hardware what to do ahead of time.
> Software doesn't do it during the trap event. In the trap handler,
> no context switch normally occurs. 

On typical processors, when one gets an interrupt, the current program 
counter and processor state flags are pushed onto a stack.  Which stack 
gets used for this?



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 20:38       ` Randy.Dunlap
@ 2003-05-07 21:27         ` Marcus Alanen
  2003-05-07 21:27           ` Randy.Dunlap
  2003-05-08 15:10         ` Ingo Oeser
  1 sibling, 1 reply; 68+ messages in thread
From: Marcus Alanen @ 2003-05-07 21:27 UTC (permalink / raw)
  To: rddunlap, Jeff Garzik; +Cc: linux-kernel

On Wed, 7 May 2003 13:38:56 -0700, Randy.Dunlap <rddunlap@osdl.org> wrote:
>I have mostly used kmalloc/kfree, but using automatic variables is certainly
>cleaner to write (code).  One of the patches that I did just made each ioctl
>cmd call a separate function, and then each separate function was able to use
>automatic variables on the stack instead of kmalloc/kfree.  I prefer this
>method when it's feasible (and until gcc can handle these cases).

I take it moving the automatic variables in the function to a static
data area would be possible, _if_ that function (or rather, the
variables) is protected by some unique lock (not some per-structure
lock, of course)? Although this is probably already done in the
majority of cases.

Marcus


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 21:27         ` Marcus Alanen
@ 2003-05-07 21:27           ` Randy.Dunlap
  0 siblings, 0 replies; 68+ messages in thread
From: Randy.Dunlap @ 2003-05-07 21:27 UTC (permalink / raw)
  To: Marcus Alanen; +Cc: jgarzik, linux-kernel

On Thu, 8 May 2003 00:27:01 +0300 Marcus Alanen <marcus@infa.abo.fi> wrote:

| On Wed, 7 May 2003 13:38:56 -0700, Randy.Dunlap <rddunlap@osdl.org> wrote:
| >I have mostly used kmalloc/kfree, but using automatic variables is certainly
| >cleaner to write (code).  One of the patches that I did just made each ioctl
| >cmd call a separate function, and then each separate function was able to use
| >automatic variables on the stack instead of kmalloc/kfree.  I prefer this
| >method when it's feasible (and until gcc can handle these cases).
| 
| I take it moving the automatic variables in the function to a static
| data area would be possible, _if_ that function (or rather, the
| variables) is protected by some unique lock (not some per-structure
| lock, of course)? Although this is probably already done in the
| majority of cases.

Sure, it just means that use of those areas has to be serialized,
whereas the other ways allow reentrant/concurrent uses.

--
~Randy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 17:13       ` Jonathan Lundell
                           ` (2 preceding siblings ...)
  2003-05-07 19:01         ` Dave Hansen
@ 2003-05-07 21:30         ` Jesse Pollard
  2003-05-07 21:54           ` Timothy Miller
  3 siblings, 1 reply; 68+ messages in thread
From: Jesse Pollard @ 2003-05-07 21:30 UTC (permalink / raw)
  To: Jonathan Lundell, root@chaos.analogic.com,Jörn Engel; +Cc: Linux kernel

On Wednesday 07 May 2003 12:13, Jonathan Lundell wrote:
[snip]
> One thing that would help (aside from separate interrupt stacks)
> would be a guard page below the stack. That wouldn't require any
> physical memory to be reserved, and would provide positive indication
> of stack overflow without significant runtime overhead.

It does take up a page table entry, which may also be in short supply

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 18:51               ` Davide Libenzi
  2003-05-07 19:22                 ` Richard B. Johnson
@ 2003-05-07 21:47                 ` Martin J. Bligh
  1 sibling, 0 replies; 68+ messages in thread
From: Martin J. Bligh @ 2003-05-07 21:47 UTC (permalink / raw)
  To: Davide Libenzi, Richard B. Johnson; +Cc: Linux kernel



--On Wednesday, May 07, 2003 11:51:19 -0700 Davide Libenzi <davidel@xmailserver.org> wrote:

> On Wed, 7 May 2003, Richard B. Johnson wrote:
> 
>> No, No. That is a process stack. Every process has it's own, entirely
>> seperate stack. This stack is used only in user mode. The kernel has
>> it's own stack. Every time you switch to kernel mode either by
>> calling the kernel or by a hardware interrupt, the kernel's stack
>> is used.
> 
> Is it your understanding that does not exist a per task kernel stack ?

It seems to be his lack of understanding, more to the point.

M.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 21:30         ` Jesse Pollard
@ 2003-05-07 21:54           ` Timothy Miller
  2003-05-07 22:01             ` Jesse Pollard
  0 siblings, 1 reply; 68+ messages in thread
From: Timothy Miller @ 2003-05-07 21:54 UTC (permalink / raw)
  To: Jesse Pollard, Linux Kernel Mailing List



Jesse Pollard wrote:
> On Wednesday 07 May 2003 12:13, Jonathan Lundell wrote:
> [snip]
> 
>>One thing that would help (aside from separate interrupt stacks)
>>would be a guard page below the stack. That wouldn't require any
>>physical memory to be reserved, and would provide positive indication
>>of stack overflow without significant runtime overhead.
> 
> 
> It does take up a page table entry, which may also be in short supply

Now, I'm sure this has GOT to be a terribly ignorant question, but I'll 
try anyhow:

What happens if you simply neglect to provide a mapping for that page? 
I'm sure that will cause some sort of page fault.  Why would you have to 
do something different?


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 21:54           ` Timothy Miller
@ 2003-05-07 22:01             ` Jesse Pollard
  0 siblings, 0 replies; 68+ messages in thread
From: Jesse Pollard @ 2003-05-07 22:01 UTC (permalink / raw)
  To: Timothy Miller, Linux Kernel Mailing List

On Wednesday 07 May 2003 16:54, Timothy Miller wrote:
> Jesse Pollard wrote:
> > On Wednesday 07 May 2003 12:13, Jonathan Lundell wrote:
> > [snip]
> >
> >>One thing that would help (aside from separate interrupt stacks)
> >>would be a guard page below the stack. That wouldn't require any
> >>physical memory to be reserved, and would provide positive indication
> >>of stack overflow without significant runtime overhead.
> >
> > It does take up a page table entry, which may also be in short supply
>
> Now, I'm sure this has GOT to be a terribly ignorant question, but I'll
> try anyhow:
>
> What happens if you simply neglect to provide a mapping for that page?
> I'm sure that will cause some sort of page fault.  Why would you have to
> do something different?

I believe it shifts the entire virtual range up(/down depending on your point
of view). Each page in the virtual address range (whether it physically
exists or not) has a descriptor. To reserve one requires that the descriptor
be set to "does not exist, no read, no write". Then any access to that page
can/will generate a trap.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 20:14             ` Dave Hansen
@ 2003-05-08  8:41               ` Jörn Engel
  2003-05-08 16:51                 ` Dave Hansen
  0 siblings, 1 reply; 68+ messages in thread
From: Jörn Engel @ 2003-05-08  8:41 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Jonathan Lundell, root, Linux kernel

On Wed, 7 May 2003 13:14:14 -0700, Dave Hansen wrote:
> Jörn Engel wrote:
> >>The stack overflow checking in -mjb uses gcc's mcount mechanism to
> >>detect overflows.  It should get called on every single function call.
> > 
> > Nice trick.  Do you have better documentation on that machanism than
> > man gcc?  The paragraph to -p is quite short and I cannot make the
> > connection to the rest of the patch immediately.
> 
> It is a nice trick, but I didn't write it :)  I stole the code from Ben
> LaHaise, around 2.5.20.  All that I've needed to know to maintain the
> patch is that a "jmp mcount" gets placed in the critical places.

Sure.  But exactly that information is not contained in the manpage (as
of Debians 3.2.3).  I guess I'll have to dig deeper.

> I've attached a fairly recent version of the stack check patch.  If you
> need some more examples, check out kernprof's use of it.  It's acg
> functionality used mcount as well.

Oh, kernprof was too advanced already.  It basically worked out of the
box for me, porting it to ppc took maybe one hour, not counting a
linker problem that was loosely related to that patch.  Never bothered
to really understand what it does. :(

> diff -Nru a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
> --- a/arch/i386/kernel/process.c	Mon Jan 27 11:40:03 2003
> +++ b/arch/i386/kernel/process.c	Mon Jan 27 11:40:03 2003
> @@ -159,7 +159,22 @@
>  
>  __setup("idle=", idle_setup);
>  
> -void show_regs(struct pt_regs * regs)
> +void stack_overflow(unsigned long esp, unsigned long eip)
> +{
> +	int panicing = ((esp&(THREAD_SIZE-1)) <= STACK_PANIC);
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +	if (panicing)
> +		print_symbol("stack overflow from %s\n", eip);
> +	else
> +		print_symbol("excessive stack use from %s\n", eip);
> +	printk("esp: %p\n", (void*)esp);
> +	show_trace((void*)esp);
> +	
> +	if (panicing)
> +		panic("stack overflow\n");
> +}
> +
> +asmlinkage void show_regs(struct pt_regs * regs)
>  {
>  	unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L;
>  
> diff -Nru a/include/asm-i386/thread_info.h b/include/asm-i386/thread_info.h
> --- a/include/asm-i386/thread_info.h	Mon Jan 27 11:40:03 2003
> +++ b/include/asm-i386/thread_info.h	Mon Jan 27 11:40:03 2003
> @@ -63,6 +63,8 @@
>   */
>  #define THREAD_ORDER 1 
>  #define INIT_THREAD_SIZE       THREAD_SIZE
> +#define STACK_PANIC		0x200ul
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +#define STACK_WARN		((THREAD_SIZE)>>1)
>  
>  #ifndef __ASSEMBLY__

If I read this correctly, your patch doesn't catch everything, if
there are functions remaining that use stack frames >0x200ul.  Ok,
tell me I'm wrong and should go through the assembler code first.

Jörn

-- 
Fantasy is more important than knowlegde. Knowlegde is limited,
while fantasy embraces the whole world.
-- Albert Einstein

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 20:42                       ` Timothy Miller
@ 2003-05-08  9:06                         ` Jörn Engel
  2003-05-08 11:33                         ` Richard B. Johnson
  1 sibling, 0 replies; 68+ messages in thread
From: Jörn Engel @ 2003-05-08  9:06 UTC (permalink / raw)
  To: Timothy Miller; +Cc: root, Roland Dreier, Linux kernel

On Wed, 7 May 2003 16:42:26 -0400, Timothy Miller wrote:
> Richard B. Johnson wrote:
> >
> >When a caller executes int 0x80, this is a software interrupt,
> >called a 'trap'. It enters the trap handler on the kernel stack,
> >with the segment selectors set up as defined for that trap-handler.
> >It happens because software told hardware what to do ahead of time.
> >Software doesn't do it during the trap event. In the trap handler,
> >no context switch normally occurs. 
> 
> On typical processors, when one gets an interrupt, the current program 
> counter and processor state flags are pushed onto a stack.  Which stack 
> gets used for this?

I have no idea, what a 'typical processor' might look like. But the
thing most CPU seem to have in common is that they save two registers
either on the stack or into other registers that only exist for this
purpose (SRR on PPC).

Once that has happened, the OS has the job to figure out where it's
stack (or equivalent) is located, *without* clobbering the registers.
Once that is done, it can save all the registern on the stack,
including SRR. It might also move what the CPU has pushed to the
"stack" somewhere else.

After the interrupt has been handled, the reverse path is executed,
restoring registers in the correct order, possibly switching from
kernel to user stack, etc.

And there is one kernel stack per process. Please don't argue about
that, unless you have read the code.

Jörn

-- 
Do not stop an army on its way home.
-- Sun Tzu

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 17:40         ` Richard B. Johnson
  2003-05-07 18:12           ` Roland Dreier
@ 2003-05-08 10:29           ` David Howells
  1 sibling, 0 replies; 68+ messages in thread
From: David Howells @ 2003-05-08 10:29 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Linux kernel


> The kernel stack, at least for ix86, is only one, set upon startup
> at 8192 bytes above a label called init_task_unit. The kernel must
> have a separate stack and, contrary to what I've been reading on
> this list, it can't have more kernel stacks than CPUs and, I don't
> see a separate stack allocated for different CPUs.

Not so... Look into asm-i386/thread_info.h on 2.5 kernels. Each process/thread
has a chunk of memory of THREAD_SIZE size (8K on i386) with a small structure
(struct thread_info) lurking at the bottom and its own personal kernel stack
lurking at the top.

The number of CPUs doesn't have anything to do with it.

Unless you mean "kernel stack pointer"?

> The context of a task (see entry.S) is completely defined by
> its registers, including the hidden part of the segments
> (selectors) that define priviledge.

No, it's not. It's also defined by, for instance, all the waitqueues it's on,
and these bits of information are frequently stored on the kernel stack, and
all the internal function information in the call chain leading to whoever
called schedule().

> But no! Not at all. The context of a user does not need to be saved
> on the stack, and in fact, isn't. It's saved in a task structure
> that was created when the original task was born. The pointer to
> that task structure is called 'current' in the kernel. It's in
> the kernel's data space, and everything necessary to put that
> task back together is in that structure.
>
> Context switching is usually not done by pushing all the registers
> onto a stack, then later popping them back. That's not the way
> it works.

Yes it is. The context saved on the kernel stack belonging to the process that
was executing at the time. entry.S uses SAVE_ALL to build a stack frame
including all the values of all the data registers that were in use at the
time. The pt_regs structure defines the layout of this.

The pt_regs that holds the userspace context for any particular process is
pointed to by current->thread.esp0 or its equivalent on other archs (look at
arch/i386/ptrace.c).

What you get in the thread info for any process is:

	+------------------+	<--- highest addr
	| PT_REGS (userspace)
	+------------------+
	| function call chain
	+------------------+
	| PT_REGS (kernel)	<--- MMU exception maybe
	+------------------+
	| function call chain
	+------------------+
	| PT_REGS (kernel)	<--- timer interrupt maybe
	+------------------+
	| function call chain
	+------------------+	<--- addr in kernel stack pointer
	|
	:
	|
	+------------------+	<--- limit of stack
	| THREAD_INFO
	+------------------+	<--- lowest addr

The lowest address always resides on an 8Kb boundary, and so the address of
THREAD_INFO can be found by ESP&~8191.

That said, some registers are saved in current->thread, but only the minimum
possible. This is done by switch_to() which is called from the scheduler.

Furthermore, interrupt handlers aren't allowed to call schedule (the scheduler
barfs on them if they do), so stacks will only be switched if the top stack
frame belongs to the process and not to an interrupt.

David

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 20:42                       ` Timothy Miller
  2003-05-08  9:06                         ` Jörn Engel
@ 2003-05-08 11:33                         ` Richard B. Johnson
  2003-05-08 12:00                           ` Helge Hafting
                                             ` (2 more replies)
  1 sibling, 3 replies; 68+ messages in thread
From: Richard B. Johnson @ 2003-05-08 11:33 UTC (permalink / raw)
  To: Timothy Miller; +Cc: Roland Dreier, Linux kernel

On Wed, 7 May 2003, Timothy Miller wrote:

>
>
> Richard B. Johnson wrote:
> >
> > When a caller executes int 0x80, this is a software interrupt,
> > called a 'trap'. It enters the trap handler on the kernel stack,
> > with the segment selectors set up as defined for that trap-handler.
> > It happens because software told hardware what to do ahead of time.
> > Software doesn't do it during the trap event. In the trap handler,
> > no context switch normally occurs.
>
> On typical processors, when one gets an interrupt, the current program
> counter and processor state flags are pushed onto a stack.  Which stack
> gets used for this?
>

In protected mode, the kernel stack. And, regardless of implimentation
details, there can only be one. It's the one whos stack-selector
is being used by the CPU. So, in the case of Linux, with multiple
kernel stacks (!?????), one for each process, whatever process is
running in kernel mode (current) has it's SS active. It's the
one that gets hit with the interrupt.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-08 11:33                         ` Richard B. Johnson
@ 2003-05-08 12:00                           ` Helge Hafting
  2003-05-08 15:42                           ` Timothy Miller
  2003-05-08 16:47                           ` Davide Libenzi
  2 siblings, 0 replies; 68+ messages in thread
From: Helge Hafting @ 2003-05-08 12:00 UTC (permalink / raw)
  To: root; +Cc: Timothy Miller, Roland Dreier, Linux kernel

Richard B. Johnson wrote:
> On Wed, 7 May 2003, Timothy Miller wrote:

>>On typical processors, when one gets an interrupt, the current program
>>counter and processor state flags are pushed onto a stack.  Which stack
>>gets used for this?
>>
> 
> 
> In protected mode, the kernel stack. And, regardless of implimentation
> details, there can only be one. It's the one whos stack-selector
> is being used by the CPU. So, in the case of Linux, with multiple

A little contradiction there.  "There can only be one" versus
"the one whos stack-selector is being used by the CPU"

Of course there can only be one stack _at a time_,
but the stack selector is switched as part of the context
switch - so there is one stack per process.

The same applies to kernel stacks. There can only be
one at a time, but the kernel stack pointer is
updated on task switches so there is one kernel
stack per process too.

> kernel stacks (!?????), one for each process, whatever process is
> running in kernel mode (current) has it's SS active. It's the
> one that gets hit with the interrupt.

Helge Hafting



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 20:38       ` Randy.Dunlap
  2003-05-07 21:27         ` Marcus Alanen
@ 2003-05-08 15:10         ` Ingo Oeser
  2003-05-08 17:12           ` Randy.Dunlap
  1 sibling, 1 reply; 68+ messages in thread
From: Ingo Oeser @ 2003-05-08 15:10 UTC (permalink / raw)
  To: Randy.Dunlap; +Cc: Jeff Garzik, linux-kernel

On Wed, May 07, 2003 at 01:38:56PM -0700, Randy Dunlap wrote:
> I've written a few of the stack reduction patches.  Lots of ioctl functions
> need work, so gcc handling it better would be good to have.
> 
> I have mostly used kmalloc/kfree, but using automatic variables is certainly
> cleaner to write (code).  One of the patches that I did just made each ioctl
> cmd call a separate function, and then each separate function was able to use
> automatic variables on the stack instead of kmalloc/kfree.  I prefer this
> method when it's feasible (and until gcc can handle these cases).

Wouldn't be a explicit union a better solution for the
switch-statement-issue? 

That way you still can use stack, are using even less of it and
have still all cases in place.

Regards

Ingo Oeser
-- 
Marketing ist die Kunst, Leuten Sachen zu verkaufen, die sie
nicht brauchen, mit Geld, was sie nicht haben, um Leute zu
beeindrucken, die sie nicht moegen.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-07 16:01           ` William Lee Irwin III
@ 2003-05-08 15:36             ` Ingo Oeser
  2003-05-08 18:04               ` William Lee Irwin III
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Oeser @ 2003-05-08 15:36 UTC (permalink / raw)
  To: William Lee Irwin III, Torsten Landschoff, J?rn Engel, Linux kernel

On Wed, May 07, 2003 at 09:01:44AM -0700, William Lee Irwin III wrote:
> Pure per-cpu stacks would require the interrupt model of programming to
> be used, which is a design decision deep enough it's debatable whether
> it's feasible to do conversions to or from at all, never mind desirable.
> Basically every entry point into the kernel is treated as an interrupt,
> and nothing can ever sleep or be scheduled in the kernel, but rather
> only register callbacks to be run when the event waited for occurs.
> Scheduling only happens as a decision of which userspace task to resume
> when returning from the kernel to userspace, though one could envision
> a priority queue discipline for processing the registered callbacks.

To illustrate that: It's basically a difference like between
fork() and spawn(). Threads (of control) are completely decoupled
und re-coupled only by the event/callback mechanism. 

This is introducing exactly the mechanisms Linus didn't like when
he decided, that he doesn't want a micro kernel architecture.

So it is not going to happen RSN.


Regards

Ingo Oeser
-- 
Marketing ist die Kunst, Leuten Sachen zu verkaufen, die sie
nicht brauchen, mit Geld, was sie nicht haben, um Leute zu
beeindrucken, die sie nicht moegen.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-08 11:33                         ` Richard B. Johnson
  2003-05-08 12:00                           ` Helge Hafting
@ 2003-05-08 15:42                           ` Timothy Miller
  2003-05-09  8:57                             ` Miles Bader
  2003-05-08 16:47                           ` Davide Libenzi
  2 siblings, 1 reply; 68+ messages in thread
From: Timothy Miller @ 2003-05-08 15:42 UTC (permalink / raw)
  To: root; +Cc: Roland Dreier, Linux kernel



Richard B. Johnson wrote:
> 
> In protected mode, the kernel stack. And, regardless of implimentation
> details, there can only be one. It's the one whos stack-selector
> is being used by the CPU. So, in the case of Linux, with multiple
> kernel stacks (!?????), one for each process, whatever process is
> running in kernel mode (current) has it's SS active. It's the
> one that gets hit with the interrupt.

That's kinda what I figured.  I just didn't know if there was some 
(hardware) provision to do otherwise, or if there was some reason why 
the interrupt handler might immediately switch stacks, etc.

That is to say, some CPUs might have provision for a stack pointer to be 
associated with each interrupt vector.

Secondly, given so many unknowns about what might already be on the 
current kernel stack, it might be generally safer to move the processor 
state (saved by the CPU on interrupt) from the current stack to some 
"interrupt stack" which may have a more predictable amount of free 
space.  (Then again, if the CPU is currently executing in user space, 
the kernel stack is probably completely empty.)

I realize that, however small, that would be an undesirable amount of 
overhead, but it occurs to me that someone might do that anyhow for 
stability reasons.  I could imagine some interrupts needing more than a 
trivial amount of stack space.  I'm assuming, for instance, that things 
like the IDE block driver would need to do things like PIO a sector 
to/from an old CDROM drive, look up the next scheduled I/O operation to 
perform, etc.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-08 11:33                         ` Richard B. Johnson
  2003-05-08 12:00                           ` Helge Hafting
  2003-05-08 15:42                           ` Timothy Miller
@ 2003-05-08 16:47                           ` Davide Libenzi
  2 siblings, 0 replies; 68+ messages in thread
From: Davide Libenzi @ 2003-05-08 16:47 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Linux kernel

On Thu, 8 May 2003, Richard B. Johnson wrote:

> In protected mode, the kernel stack. And, regardless of implimentation
> details, there can only be one. It's the one whos stack-selector
> is being used by the CPU. So, in the case of Linux, with multiple
> kernel stacks (!?????), one for each process, whatever process is
> running in kernel mode (current) has it's SS active. It's the
> one that gets hit with the interrupt.

Why the multiple !????? Richard ? The fact that Intel is showing something
different on their manuals does not automatically mean that it is the
right way to do it. The boundary line between tasks is inside switch_to()
that is deep inside the kernel path. The stack space that goes from the
system call entry to the switch_to() call *must* be obviously preserved.
Can it be done in a different way ? Sure it can. Try to think about it and
look at how more complex things become with such scenario. Intel also
suggests to use one TSS per task, while we're recycling the same TSS for
all processes for example ( per CPU ).




- Davide


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-08  8:41               ` Jörn Engel
@ 2003-05-08 16:51                 ` Dave Hansen
  2003-05-08 22:12                   ` Jörn Engel
  0 siblings, 1 reply; 68+ messages in thread
From: Dave Hansen @ 2003-05-08 16:51 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Jonathan Lundell, root, Linux kernel

Jörn Engel wrote:
> If I read this correctly, your patch doesn't catch everything, if
> there are functions remaining that use stack frames >0x200ul.  Ok,
> tell me I'm wrong and should go through the assembler code first.

If any function is ever called with < 0x200 bytes of space left on the
stack, it considers it an overflow.

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-08 15:10         ` Ingo Oeser
@ 2003-05-08 17:12           ` Randy.Dunlap
  0 siblings, 0 replies; 68+ messages in thread
From: Randy.Dunlap @ 2003-05-08 17:12 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: jgarzik, linux-kernel

On Thu, 8 May 2003 17:10:42 +0200 Ingo Oeser <ingo.oeser@informatik.tu-chemnitz.de> wrote:

| On Wed, May 07, 2003 at 01:38:56PM -0700, Randy Dunlap wrote:
| > I've written a few of the stack reduction patches.  Lots of ioctl functions
| > need work, so gcc handling it better would be good to have.
| > 
| > I have mostly used kmalloc/kfree, but using automatic variables is certainly
| > cleaner to write (code).  One of the patches that I did just made each ioctl
| > cmd call a separate function, and then each separate function was able to use
| > automatic variables on the stack instead of kmalloc/kfree.  I prefer this
| > method when it's feasible (and until gcc can handle these cases).
| 
| Wouldn't be a explicit union a better solution for the
| switch-statement-issue? 
| 
| That way you still can use stack, are using even less of it and
| have still all cases in place.

Sure, that's a good solution too.  Better one is the gcc solution.

--
~Randy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-08 15:36             ` Ingo Oeser
@ 2003-05-08 18:04               ` William Lee Irwin III
  0 siblings, 0 replies; 68+ messages in thread
From: William Lee Irwin III @ 2003-05-08 18:04 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: Torsten Landschoff, J?rn Engel, Linux kernel

On Wed, May 07, 2003 at 09:01:44AM -0700, William Lee Irwin III wrote:
>> Pure per-cpu stacks would require the interrupt model of programming to
>> be used, which is a design decision deep enough it's debatable whether
>> it's feasible to do conversions to or from at all, never mind desirable.
>> Basically every entry point into the kernel is treated as an interrupt,
>> and nothing can ever sleep or be scheduled in the kernel, but rather
>> only register callbacks to be run when the event waited for occurs.
>> Scheduling only happens as a decision of which userspace task to resume
>> when returning from the kernel to userspace, though one could envision
>> a priority queue discipline for processing the registered callbacks.

On Thu, May 08, 2003 at 05:36:47PM +0200, Ingo Oeser wrote:
> To illustrate that: It's basically a difference like between
> fork() and spawn(). Threads (of control) are completely decoupled
> und re-coupled only by the event/callback mechanism. 
> This is introducing exactly the mechanisms Linus didn't like when
> he decided, that he doesn't want a micro kernel architecture.
> So it is not going to happen RSN.

Your analogy is poor and I vaguely doubt the mechanism has been
suggested by anyone for use in Linux ever. It has nothing whatsoever to
do with a microkernel and in most incarnations precludes microkernel
designs. I'm not suggesting it, I just thought that was what "per-cpu
stacks" was supposed to mean.

Not that elaboration is needed, but the threads of control are not
decoupled as you suggest, but rather connected with continuations at
what would in the UNIX model be scheduling points. spawn() is just
POSIX' API for optimizing out some of the overhead of a fork()/exec()
cycle, and has nothing to do with interrupt model programming, esp.
since it is the exact opposite of thread creation. i.e. the interrupt
model is the extreme incarnation of "state machines, not threads".


-- wli

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-08 16:51                 ` Dave Hansen
@ 2003-05-08 22:12                   ` Jörn Engel
  0 siblings, 0 replies; 68+ messages in thread
From: Jörn Engel @ 2003-05-08 22:12 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Jonathan Lundell, root, Linux kernel

On Thu, 8 May 2003 09:51:16 -0700, Dave Hansen wrote:
> Jörn Engel wrote:
> > If I read this correctly, your patch doesn't catch everything, if
> > there are functions remaining that use stack frames >0x200ul.  Ok,
> > tell me I'm wrong and should go through the assembler code first.
> 
> If any function is ever called with < 0x200 bytes of space left on the
> stack, it considers it an overflow.

Is that number before or after the function placed it's own stackframe
on the stack? If before, I'd rather increase that to 0x400 or even a
bit higher. But hopefully gcc is smarter than that.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-08 15:42                           ` Timothy Miller
@ 2003-05-09  8:57                             ` Miles Bader
  2003-05-09 16:50                               ` Timothy Miller
  0 siblings, 1 reply; 68+ messages in thread
From: Miles Bader @ 2003-05-09  8:57 UTC (permalink / raw)
  To: Timothy Miller; +Cc: root, Roland Dreier, Linux kernel

Timothy Miller <miller@techsource.com> writes:
> That is to say, some CPUs might have provision for a stack pointer to be 
> associated with each interrupt vector.

On my arch, the CPU doesn't use the stack for interrupts at all...
so any saving on the stack is what's done by entry.S.

-Miles
-- 
o The existentialist, not having a pillow, goes everywhere with the book by
  Sullivan, _I am going to spit on your graves_.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-09  8:57                             ` Miles Bader
@ 2003-05-09 16:50                               ` Timothy Miller
  0 siblings, 0 replies; 68+ messages in thread
From: Timothy Miller @ 2003-05-09 16:50 UTC (permalink / raw)
  To: Miles Bader; +Cc: root, Roland Dreier, Linux kernel



Miles Bader wrote:
> Timothy Miller <miller@techsource.com> writes:
> 
>>That is to say, some CPUs might have provision for a stack pointer to be 
>>associated with each interrupt vector.
> 
> 
> On my arch, the CPU doesn't use the stack for interrupts at all...
> so any saving on the stack is what's done by entry.S.
> 


Sure, but probably the majority are using x86 which would be affected by 
some of the ideas we've talked about in this thread.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-08 19:05   ` Timothy Miller
@ 2003-05-08 21:00     ` Jonathan Lundell
  0 siblings, 0 replies; 68+ messages in thread
From: Jonathan Lundell @ 2003-05-08 21:00 UTC (permalink / raw)
  To: Timothy Miller; +Cc: linux-kernel

At 3:05pm -0400 5/8/03, Timothy Miller wrote:
>My suggestion would be that if we do manage to get typical stack 
>usage down to the point where we can go to a 4K stack, then 
>interrupt handlers would have to be rewritten to recognize whether 
>or not the interrupt arrived on a user process kernel stack and then 
>move the context over to the "interrupt stack".  The overhead would 
>be low enough that it's worth doing so that we could reduce process 
>kernel stack size.  Whenever an interrupt service routine is itself 
>interrupted, the interrupt stack check code would realize that it is 
>already using the interrupt stack and not move the context.  Here, 
>then, we would need only one single interrupt stack which we would 
>size for worst case; so if we made it 8 or 12K, that's 8 or 12K once 
>for each CPU which is allowed to receive interrupts, not once per 
>process.
>
>You like?  :)

It makes sense to me. But this thread has gone in a circle, I think.

At 9:20am -0700 5/7/03, Martin J. Bligh wrote:
>There are patches to make i386 do this (and use 4K stacks as a config option)
>from Dave Hansen and Ben LaHaise in 2.5-mjb tree.

(the message context was a separate interrupt stack)
-- 
/Jonathan Lundell.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-08 18:04 ` Jonathan Lundell
@ 2003-05-08 19:05   ` Timothy Miller
  2003-05-08 21:00     ` Jonathan Lundell
  0 siblings, 1 reply; 68+ messages in thread
From: Timothy Miller @ 2003-05-08 19:05 UTC (permalink / raw)
  To: Jonathan Lundell; +Cc: linux-kernel



Jonathan Lundell wrote:
> 
> 
> In particular, the interrupt stack is the kernel stack of the current 
> task. This is (in part) what leads to stack overflows. If the current 
> task is running in the kernel, using a significant hunk of its stack, an 
> interrupt is limited to the balance of that stack. And if that interrupt 
> triggers a soft irq that runs, say, a network stack, and that softirq 
> handler in turn gets interrupted, we've got, effectively, three 
> processes sharing the stack. And of course hard interrupts can be 
> nested, so it's pretty damn difficult to specify a safe upper limit for 
> stack usage.

This is the sort of things that would severely limit our ability to 
shrink the kernel stack.  While it's perhaps feasible to shrink kernel 
stack usage for typical syscalls, exactly the situation you describe is 
unpredictable and very difficult to avoid.

My suggestion would be that if we do manage to get typical stack usage 
down to the point where we can go to a 4K stack, then interrupt handlers 
would have to be rewritten to recognize whether or not the interrupt 
arrived on a user process kernel stack and then move the context over to 
the "interrupt stack".  The overhead would be low enough that it's worth 
doing so that we could reduce process kernel stack size.  Whenever an 
interrupt service routine is itself interrupted, the interrupt stack 
check code would realize that it is already using the interrupt stack 
and not move the context.  Here, then, we would need only one single 
interrupt stack which we would size for worst case; so if we made it 8 
or 12K, that's 8 or 12K once for each CPU which is allowed to receive 
interrupts, not once per process.

You like?  :)


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
  2003-05-08 14:08 Chuck Ebbert
@ 2003-05-08 18:04 ` Jonathan Lundell
  2003-05-08 19:05   ` Timothy Miller
  0 siblings, 1 reply; 68+ messages in thread
From: Jonathan Lundell @ 2003-05-08 18:04 UTC (permalink / raw)
  To: linux-kernel

At 10:08am -0400 5/8/03, Chuck Ebbert wrote:
>  > I have no idea, what a 'typical processor' might look like. But the
>>  thing most CPU seem to have in common is that they save two registers
>>  either on the stack or into other registers that only exist for this
>>  purpose (SRR on PPC).
>>
>>  Once that has happened, the OS has the job to figure out where it's
>>  stack (or equivalent) is located, *without* clobbering the registers.
>>  Once that is done, it can save all the registern on the stack,
>>  including SRR.
>
>   On i386 the CPU automatically switches to the stack corresponding to
>the privilege level (PL) of the interrupt handler, then pushes the
>instruction pointer and flags onto that stack.  It is theoretically
>possible to write unprivileged interrupt handlers by using conforming
>code segments, in which case a stack switch will not occur, but such a
>handler cannot touch anything but registers and stack so it's not very
>useful.

In particular, the interrupt stack is the kernel stack of the current 
task. This is (in part) what leads to stack overflows. If the current 
task is running in the kernel, using a significant hunk of its stack, 
an interrupt is limited to the balance of that stack. And if that 
interrupt triggers a soft irq that runs, say, a network stack, and 
that softirq handler in turn gets interrupted, we've got, 
effectively, three processes sharing the stack. And of course hard 
interrupts can be nested, so it's pretty damn difficult to specify a 
safe upper limit for stack usage.
-- 
/Jonathan Lundell.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
@ 2003-05-08 14:08 Chuck Ebbert
  2003-05-08 18:04 ` Jonathan Lundell
  0 siblings, 1 reply; 68+ messages in thread
From: Chuck Ebbert @ 2003-05-08 14:08 UTC (permalink / raw)
  To: Jvrn Engel; +Cc: linux-kernel

> I have no idea, what a 'typical processor' might look like. But the
> thing most CPU seem to have in common is that they save two registers
> either on the stack or into other registers that only exist for this
> purpose (SRR on PPC).
>
> Once that has happened, the OS has the job to figure out where it's
> stack (or equivalent) is located, *without* clobbering the registers.
> Once that is done, it can save all the registern on the stack,
> including SRR.

  On i386 the CPU automatically switches to the stack corresponding to
the privilege level (PL) of the interrupt handler, then pushes the
instruction pointer and flags onto that stack.  It is theoretically
possible to write unprivileged interrupt handlers by using conforming
code segments, in which case a stack switch will not occur, but such a
handler cannot touch anything but registers and stack so it's not very
useful.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: top stack (l)users for 2.5.69
@ 2003-05-07 19:38 Chuck Ebbert
  0 siblings, 0 replies; 68+ messages in thread
From: Chuck Ebbert @ 2003-05-07 19:38 UTC (permalink / raw)
  To: root; +Cc: linux-kernel

> Every time you switch to kernel mode either by
> calling the kernel or by a hardware interrupt, the kernel's stack
> is used.

  Almost correct: it's _a_ kernel stack, not _the_ kernel stack.

  The load_esp0() function changes this on every task switch on
i386.  If there were only one kernel stack then there would be no
need to ever do that...

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2003-05-09 16:33 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-05-07 13:20 top stack (l)users for 2.5.69 Jörn Engel
2003-05-07 13:45 ` Richard B. Johnson
2003-05-07 13:56   ` Jörn Engel
2003-05-07 14:16     ` Richard B. Johnson
2003-05-07 17:13       ` Jonathan Lundell
2003-05-07 17:40         ` Richard B. Johnson
2003-05-07 18:12           ` Roland Dreier
2003-05-07 18:28             ` Richard B. Johnson
2003-05-07 18:44               ` Timothy Miller
2003-05-07 18:46               ` Roland Dreier
2003-05-07 19:30                 ` Richard B. Johnson
2003-05-07 19:42                   ` Roland Dreier
2003-05-07 20:04                     ` Richard B. Johnson
2003-05-07 20:23                       ` Roland Dreier
2003-05-07 20:42                       ` Timothy Miller
2003-05-08  9:06                         ` Jörn Engel
2003-05-08 11:33                         ` Richard B. Johnson
2003-05-08 12:00                           ` Helge Hafting
2003-05-08 15:42                           ` Timothy Miller
2003-05-09  8:57                             ` Miles Bader
2003-05-09 16:50                               ` Timothy Miller
2003-05-08 16:47                           ` Davide Libenzi
2003-05-07 18:51               ` Davide Libenzi
2003-05-07 19:22                 ` Richard B. Johnson
2003-05-07 19:31                   ` Davide Libenzi
2003-05-07 19:39                   ` Hua Zhong
2003-05-07 21:47                 ` Martin J. Bligh
2003-05-08 10:29           ` David Howells
2003-05-07 17:55         ` Jörn Engel
2003-05-07 16:20           ` Martin J. Bligh
2003-05-07 19:01         ` Dave Hansen
2003-05-07 20:06           ` Jörn Engel
2003-05-07 20:14             ` Dave Hansen
2003-05-08  8:41               ` Jörn Engel
2003-05-08 16:51                 ` Dave Hansen
2003-05-08 22:12                   ` Jörn Engel
2003-05-07 21:30         ` Jesse Pollard
2003-05-07 21:54           ` Timothy Miller
2003-05-07 22:01             ` Jesse Pollard
2003-05-07 14:33     ` Torsten Landschoff
2003-05-07 14:47       ` William Lee Irwin III
2003-05-07 15:04         ` Torsten Landschoff
2003-05-07 16:01           ` William Lee Irwin III
2003-05-08 15:36             ` Ingo Oeser
2003-05-08 18:04               ` William Lee Irwin III
2003-05-07 15:23         ` Timothy Miller
2003-05-07 15:47           ` William Lee Irwin III
2003-05-07 16:49         ` Jörn Engel
2003-05-07 17:18           ` Davide Libenzi
2003-05-07 17:40             ` Jörn Engel
2003-05-07 18:35               ` Davide Libenzi
2003-05-07 19:45                 ` Jörn Engel
2003-05-07 18:23             ` William Lee Irwin III
2003-05-07 17:38           ` William Lee Irwin III
2003-05-07 17:47             ` Jörn Engel
2003-05-07 14:49       ` Richard B. Johnson
2003-05-07 18:36   ` Linus Torvalds
2003-05-07 19:17     ` Jeff Garzik
2003-05-07 20:38       ` Randy.Dunlap
2003-05-07 21:27         ` Marcus Alanen
2003-05-07 21:27           ` Randy.Dunlap
2003-05-08 15:10         ` Ingo Oeser
2003-05-08 17:12           ` Randy.Dunlap
2003-05-07 19:38 Chuck Ebbert
2003-05-08 14:08 Chuck Ebbert
2003-05-08 18:04 ` Jonathan Lundell
2003-05-08 19:05   ` Timothy Miller
2003-05-08 21:00     ` Jonathan Lundell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).