linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* force_successful_syscall_return() buggy?
@ 2003-06-15 18:36 Russell King
  2003-06-15 23:11 ` Richard Henderson
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Russell King @ 2003-06-15 18:36 UTC (permalink / raw)
  To: Linux Kernel List

While looking at the new bits'n'pieces which has appeared in 2.5.71, I
noticed the following in alpha and ia64:

#define alpha_task_regs(task) \
  ((struct pt_regs *) ((long) (task)->thread_info + 2*PAGE_SIZE) - 1)

#define force_successful_syscall_return() (alpha_task_regs(current)->r0 = 0)


# define ia64_task_regs(t)              (((struct pt_regs *) ((char *) (t) + IA64_STK_OFFSET)) - 1)

#define force_successful_syscall_return()               \
        do {                                            \
                ia64_task_regs(current)->r8 = 0;        \
        } while (0)

I don't know what happens on these architectures, but I have a suspicion
that there is a case which the above will fail, maybe with dramatic
consequences.

Consider what happens when a userspace program is started from kernel
space, eg the init(8) or hotplug programs.  In these, we call execve()
from within kernel space function.  This implies that we have some
frames already on the stack.

AFAIK, sys_execve() does not ensure that the kernel stack will be empty
before starting the user space thread, so these programs are running with
a slightly reduced kernel stack.

In turn, this means that the user registers are not stored at the top
of the kernel stack when the user space program subsequently calls a
kernel system call, which means the *_task_regs() macro doesn't point
at the saved user registers.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: force_successful_syscall_return() buggy?
  2003-06-15 18:36 force_successful_syscall_return() buggy? Russell King
@ 2003-06-15 23:11 ` Richard Henderson
  2003-06-16  2:30 ` Paul Mackerras
  2003-06-16 17:38 ` David Mosberger
  2 siblings, 0 replies; 11+ messages in thread
From: Richard Henderson @ 2003-06-15 23:11 UTC (permalink / raw)
  To: Linux Kernel List

On Sun, Jun 15, 2003 at 07:36:04PM +0100, Russell King wrote:
> AFAIK, sys_execve() does not ensure that the kernel stack will be empty
> before starting the user space thread, so these programs are running with
> a slightly reduced kernel stack.

Indeed.  This is fixed in a rewrite I have of entry.S, but
that's not particularly stable at the moment...


r~

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: force_successful_syscall_return() buggy?
  2003-06-15 18:36 force_successful_syscall_return() buggy? Russell King
  2003-06-15 23:11 ` Richard Henderson
@ 2003-06-16  2:30 ` Paul Mackerras
  2003-06-16 17:38 ` David Mosberger
  2 siblings, 0 replies; 11+ messages in thread
From: Paul Mackerras @ 2003-06-16  2:30 UTC (permalink / raw)
  To: Russell King; +Cc: Linux Kernel List

Russell King writes:

> #define force_successful_syscall_return()               \
>         do {                                            \
>                 ia64_task_regs(current)->r8 = 0;        \
>         } while (0)
> 
> I don't know what happens on these architectures, but I have a suspicion
> that there is a case which the above will fail, maybe with dramatic
> consequences.

On PPC, I am going to use a bit in the thread_info flags field to
indicate that the current syscall should not return an error.  The
syscall entry and exit paths already look at the thread_info flags
(testing the syscall trace bit, among others) so it's convenient to
have the "no error" flag there too.  The flag bit gets cleared on
syscall entry, set by force_successful_syscall_return() and tested on
syscall exit.  The only restriction is that kernel code should not do
a system call between where it calls force_successful_syscall_return
and where it returns from the syscall.  But I don't believe we ever do
recursive system calls anyway, so that should be fine.

Regards,
Paul.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: force_successful_syscall_return() buggy?
  2003-06-15 18:36 force_successful_syscall_return() buggy? Russell King
  2003-06-15 23:11 ` Richard Henderson
  2003-06-16  2:30 ` Paul Mackerras
@ 2003-06-16 17:38 ` David Mosberger
  2003-06-16 17:55   ` Russell King
  2 siblings, 1 reply; 11+ messages in thread
From: David Mosberger @ 2003-06-16 17:38 UTC (permalink / raw)
  To: Russell King; +Cc: Linux Kernel List

>>>>> On Sun, 15 Jun 2003 19:36:04 +0100, Russell King <rmk@arm.linux.org.uk> said:

  Russell> Consider what happens when a userspace program is started
  Russell> from kernel space, eg the init(8) or hotplug programs.  In
  Russell> these, we call execve() from within kernel space function.
  Russell> This implies that we have some frames already on the stack.

  Russell> AFAIK, sys_execve() does not ensure that the kernel stack
  Russell> will be empty before starting the user space thread, so
  Russell> these programs are running with a slightly reduced kernel
  Russell> stack.

  Russell> In turn, this means that the user registers are not stored
  Russell> at the top of the kernel stack when the user space program
  Russell> subsequently calls a kernel system call, which means the
  Russell> *_task_regs() macro doesn't point at the saved user
  Russell> registers.

That's a limitation that was described in the change log entry:

	The only limitation of force_successful_syscall_return() is
	that it doesn't help with system calls performed by the
	kernel.  But the kernel does that so rarely and for such a
	limited set of syscalls that this is not a real problem.

Alpha and ia64 have used pt_regs for "force-success" purposes for a
long time, but if you want to add support to another platform, I'd
also recommend using the task_info instead.  Dave Miller proposed a
scheme that could work nicely: force_success flag is off by default
and turned on by force_successful_syscall_return().  When you
"consume" the flag in the syscall exit path, you turn it off
afterwards.  The advantage of this scheme is that you don't have any
extra stores on the performance-critical syscall entry path.  The
disadvantage is that you'll have to check the flag even when the
return-value is non-negative (probably not much of a disadvantage if
you store the flag along with the other flags).

Having said that, there is one real problem: Linus pointed out that
the current scheme may be in correct for certain platforms: if
sizeof(loff_t) > sizeof(off_t) then drivers/char/mem.c:memory_lseek()
may do force_successful_syscall_return() yet sys_lseek() would have to
return EOVERFLOW if it turns out that the offset doesn't fit in off_t.
I think this would only affect 32-bit platforms with some sort of
physical address-extensions.  So it would affect x86, but perhaps
nothing else (but of course, x86 doesn't use
force_successful_syscall() anyhow, so for now, that's OK).

The clean way to fix this would probably be to return the offset via a
pointer argument.

	--david

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: force_successful_syscall_return() buggy?
  2003-06-16 17:38 ` David Mosberger
@ 2003-06-16 17:55   ` Russell King
  2003-06-16 18:02     ` David S. Miller
  2003-06-16 18:25     ` David Mosberger
  0 siblings, 2 replies; 11+ messages in thread
From: Russell King @ 2003-06-16 17:55 UTC (permalink / raw)
  To: davidm; +Cc: Linux Kernel List

On Mon, Jun 16, 2003 at 10:38:27AM -0700, David Mosberger wrote:
> >>>>> On Sun, 15 Jun 2003 19:36:04 +0100, Russell King <rmk@arm.linux.org.uk> said:
> 
>   Russell> Consider what happens when a userspace program is started
>   Russell> from kernel space, eg the init(8) or hotplug programs.  In
>   Russell> these, we call execve() from within kernel space function.
>   Russell> This implies that we have some frames already on the stack.
> 
>   Russell> AFAIK, sys_execve() does not ensure that the kernel stack
>   Russell> will be empty before starting the user space thread, so
>   Russell> these programs are running with a slightly reduced kernel
>   Russell> stack.
> 
>   Russell> In turn, this means that the user registers are not stored
>   Russell> at the top of the kernel stack when the user space program
>   Russell> subsequently calls a kernel system call, which means the
>   Russell> *_task_regs() macro doesn't point at the saved user
>   Russell> registers.
> 
> That's a limitation that was described in the change log entry:
> 
> 	The only limitation of force_successful_syscall_return() is
> 	that it doesn't help with system calls performed by the
> 	kernel.  But the kernel does that so rarely and for such a
> 	limited set of syscalls that this is not a real problem.

I'm not actually talking about subsequent syscalls issued by the kernel.
I'm talking about stuff like init, bash, and the module tools.  If
any of these call any of the affected syscalls which expect user
registers to be at the top of the kernel stack, they'll be accessing
the wrong data.  A corrected comment would be:

 	The only limitation of force_successful_syscall_return() is
 	that it doesn't help with system calls performed by the
 	kernel or user threads exec'd from the kernel.

I'd suggest such a comment is added to force_successful_syscall_return()
to ensure that anyone thinking it'll work for all user space processes
is sufficiently deterred.

> Alpha and ia64 have used pt_regs for "force-success" purposes for a
> long time, but if you want to add support to another platform, I'd
> also recommend using the task_info instead.

Oh, I'm currently not thinking about implementing this on ARM; this
touches a similar area which I investigated a number of months ago.
I through out the idea of accessing user registers for user space
programs at the top of the kernel stack because it does not work for
processes exec'd from kernel space.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: force_successful_syscall_return() buggy?
  2003-06-16 17:55   ` Russell King
@ 2003-06-16 18:02     ` David S. Miller
  2003-06-16 18:25     ` David Mosberger
  1 sibling, 0 replies; 11+ messages in thread
From: David S. Miller @ 2003-06-16 18:02 UTC (permalink / raw)
  To: Russell King; +Cc: davidm, Linux Kernel List

On Mon, 2003-06-16 at 10:55, Russell King wrote:
> I'm not actually talking about subsequent syscalls issued by the kernel.
> I'm talking about stuff like init, bash, and the module tools.

Wrong, after the go for the first time into user space, the
next trap into the kernel will put the pt_regs at the top at
the stack where we expect it to be.

-- 
David S. Miller <davem@redhat.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: force_successful_syscall_return() buggy?
  2003-06-16 17:55   ` Russell King
  2003-06-16 18:02     ` David S. Miller
@ 2003-06-16 18:25     ` David Mosberger
  1 sibling, 0 replies; 11+ messages in thread
From: David Mosberger @ 2003-06-16 18:25 UTC (permalink / raw)
  To: Russell King; +Cc: Linux Kernel List

>>>>> On Mon, 16 Jun 2003 18:55:49 +0100, Russell King <rmk@arm.linux.org.uk> said:

  Russell> I through out the idea of accessing user registers for user space
  Russell> programs at the top of the kernel stack because it does not work for
  Russell> processes exec'd from kernel space.

It doesn't work for the execve() itself, but it works for all
subsequent syscalls.  force_successful_syscall() not working for
execve() is of course not a problem, since it returns only on error.

	--david

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: force_successful_syscall_return() buggy?
  2003-06-17 19:01     ` David Mosberger
  2003-06-17 18:58       ` David S. Miller
@ 2003-06-19 17:35       ` Richard Henderson
  1 sibling, 0 replies; 11+ messages in thread
From: Richard Henderson @ 2003-06-19 17:35 UTC (permalink / raw)
  To: davidm; +Cc: Aneesh Kumar K.V, David S. Miller, Russell King, Linux Kernel List

On Tue, Jun 17, 2003 at 12:01:12PM -0700, David Mosberger wrote:
>   Aneesh> I was facing a simillar problem with ptrace on Alpha (ptrace
>   Aneesh> on alpha expect the pt_regs at current + 2*PAGE_SIZE for
>   Aneesh> 2.4. kernel ) w.r.t www.openssi.org project. What i found
>   Aneesh> was that even after we return to user space subsequent
>   Aneesh> syscalls are not putting pt_regs at that offset. I guess
>   Aneesh> while entering the kernel kernel stack pointer always point
>   Aneesh> to value stored in thread_struct.ksp ?
> 
> If a platform doesn't start with an empty kernel stack on entry from
> user-space, that platform will be wasting (precious) stack space and
> ptrace() most likely won't work reliably.  Personally, I'd consider
> such behavior a bug...

So would I.  It's now fixed.


r~

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: force_successful_syscall_return() buggy?
  2003-06-17  6:54   ` Aneesh Kumar K.V
@ 2003-06-17 19:01     ` David Mosberger
  2003-06-17 18:58       ` David S. Miller
  2003-06-19 17:35       ` Richard Henderson
  0 siblings, 2 replies; 11+ messages in thread
From: David Mosberger @ 2003-06-17 19:01 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: David S. Miller, Russell King, davidm, Linux Kernel List

>>>>> On Tue, 17 Jun 2003 12:24:23 +0530, "Aneesh Kumar K.V" <aneesh.kumar@digital.com> said:

  Aneesh> I was facing a simillar problem with ptrace on Alpha (ptrace
  Aneesh> on alpha expect the pt_regs at current + 2*PAGE_SIZE for
  Aneesh> 2.4. kernel ) w.r.t www.openssi.org project. What i found
  Aneesh> was that even after we return to user space subsequent
  Aneesh> syscalls are not putting pt_regs at that offset. I guess
  Aneesh> while entering the kernel kernel stack pointer always point
  Aneesh> to value stored in thread_struct.ksp ?

If a platform doesn't start with an empty kernel stack on entry from
user-space, that platform will be wasting (precious) stack space and
ptrace() most likely won't work reliably.  Personally, I'd consider
such behavior a bug, but I suppose it is to some degree a
platform-choice.

	--david

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: force_successful_syscall_return() buggy?
  2003-06-17 19:01     ` David Mosberger
@ 2003-06-17 18:58       ` David S. Miller
  2003-06-19 17:35       ` Richard Henderson
  1 sibling, 0 replies; 11+ messages in thread
From: David S. Miller @ 2003-06-17 18:58 UTC (permalink / raw)
  To: davidm, davidm; +Cc: aneesh.kumar, rmk, linux-kernel

   From: David Mosberger <davidm@napali.hpl.hp.com>
   Date: Tue, 17 Jun 2003 12:01:12 -0700
   
   Personally, I'd consider such behavior a bug,

Me too.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: force_successful_syscall_return() buggy?
       [not found] ` <fa.gvpfoqi.ngk8p2@ifi.uio.no>
@ 2003-06-17  6:54   ` Aneesh Kumar K.V
  2003-06-17 19:01     ` David Mosberger
  0 siblings, 1 reply; 11+ messages in thread
From: Aneesh Kumar K.V @ 2003-06-17  6:54 UTC (permalink / raw)
  To: David S. Miller; +Cc: Russell King, davidm, Linux Kernel List

David S. Miller wrote:
> On Mon, 2003-06-16 at 10:55, Russell King wrote:
> 
>>I'm not actually talking about subsequent syscalls issued by the kernel.
>>I'm talking about stuff like init, bash, and the module tools.
> 
> 
> Wrong, after the go for the first time into user space, the
> next trap into the kernel will put the pt_regs at the top at
> the stack where we expect it to be.
> 

I was facing a simillar problem with ptrace on Alpha (ptrace on alpha 
expect the pt_regs at current + 2*PAGE_SIZE for 2.4. kernel  ) w.r.t 
www.openssi.org project. What i found was that  even after we return to 
user space subsequent syscalls are not putting pt_regs at that offset. I 
guess while entering the kernel kernel stack pointer always point to 
value stored in thread_struct.ksp ?

-aneesh



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2003-06-19 17:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-06-15 18:36 force_successful_syscall_return() buggy? Russell King
2003-06-15 23:11 ` Richard Henderson
2003-06-16  2:30 ` Paul Mackerras
2003-06-16 17:38 ` David Mosberger
2003-06-16 17:55   ` Russell King
2003-06-16 18:02     ` David S. Miller
2003-06-16 18:25     ` David Mosberger
     [not found] <fa.it5uct2.s4s8om@ifi.uio.no>
     [not found] ` <fa.gvpfoqi.ngk8p2@ifi.uio.no>
2003-06-17  6:54   ` Aneesh Kumar K.V
2003-06-17 19:01     ` David Mosberger
2003-06-17 18:58       ` David S. Miller
2003-06-19 17:35       ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).