All of lore.kernel.org
 help / color / mirror / Atom feed
* timer + fpu stuff locks my console race
@ 2004-06-09 21:02 stian
  2004-06-10 21:00 ` Matias Hermanrud Fjeld
  2004-06-12  2:53 ` Rik van Riel
  0 siblings, 2 replies; 26+ messages in thread
From: stian @ 2004-06-09 21:02 UTC (permalink / raw)
  To: linux-kernel

Please keep me in CC as I'm not on the mailinglist. I'm currently on a
vaccation, so I can't hook my linux-box to the Internet, but I came across
a race condition in the "old" 2.4.26-rc1 vanilla kernel.

I'm doing some code tests when I came across problems with my program
locking my console (even X if I'm using a xterm).

I think first of all gcc triggers the problem, so the full report is here:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15905

For more details about versions and other information needed, please let
me know if needed. It triggers at every attempt at my box currently (and
I'm lacking Internet connection at the time-being on my machine).



Stian Skjelstad

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
  2004-06-09 21:02 timer + fpu stuff locks my console race stian
@ 2004-06-10 21:00 ` Matias Hermanrud Fjeld
  2004-06-11  6:08   ` Lars Age Kamfjord
  2004-06-12  2:53 ` Rik van Riel
  1 sibling, 1 reply; 26+ messages in thread
From: Matias Hermanrud Fjeld @ 2004-06-10 21:00 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 158 bytes --]

ACK

mhf@bilbo:~$ uname -a 
Linux bilbo 2.6.6-1-k7 #1 Wed May 12 18:19:40 EST 2004 i686 GNU/Linux

-- 
Matias Hermanrud Fjeld
http://www.hex.no/mhf


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
  2004-06-10 21:00 ` Matias Hermanrud Fjeld
@ 2004-06-11  6:08   ` Lars Age Kamfjord
  0 siblings, 0 replies; 26+ messages in thread
From: Lars Age Kamfjord @ 2004-06-11  6:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: stian

This bug seems VERY serious.... Every machine I've tested with so far 
has crashed totally; and it happens with every version of 2.4 and 
2.6-kernels. I've tested with a 2.2.19-kernel, and that didn't crash, so 
it seems to be a bug in 2.4 and later..... Somebody really should look 
at this......


Lars Age Kamfjord

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
  2004-06-09 21:02 timer + fpu stuff locks my console race stian
  2004-06-10 21:00 ` Matias Hermanrud Fjeld
@ 2004-06-12  2:53 ` Rik van Riel
  2004-06-12  3:50   ` Rik van Riel
  2004-06-12  4:35   ` timer + fpu stuff locks my console race Matt Mackall
  1 sibling, 2 replies; 26+ messages in thread
From: Rik van Riel @ 2004-06-12  2:53 UTC (permalink / raw)
  To: stian; +Cc: linux-kernel

On Wed, 9 Jun 2004 stian@nixia.no wrote:

> I'm doing some code tests when I came across problems with my program
> locking my console (even X if I'm using a xterm).

Reproduced here, on my test system running a 2.6 kernel.
I did get a kernel backtrace over serial console, though ;)

Pid: 19752, comm:      kernel-hang-bz1
EIP: 0060:[<ffff345c>] CPU: 0
EIP is at 0xffff345c
 EFLAGS: 00000202    Not tainted  (2.6.5-1.332)
EAX: 00000001 EBX: 12005870 ECX: fef32ea8 EDX: 1958f000
ESI: 1958f000 EDI: fef32ea8 EBP: fef32e48 DS: 007b ES: 007b
CR0: 80050033 CR2: 00c4b720 CR3: 003ab000 CR4: 000006d0
Call Trace:
 [<0210dcda>] restore_i387_fxsave+0x18/0x60
 [<0210dd38>] restore_i387+0x16/0x65
 [<021059e5>] restore_sigcontext+0xf2/0x10c
 [<0215b737>] get_user_size+0x30/0x57
 [<02105c13>] sys_sigreturn+0x214/0x23a


-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
  2004-06-12  2:53 ` Rik van Riel
@ 2004-06-12  3:50   ` Rik van Riel
  2004-06-12 13:44     ` Sergey Vlasov
  2004-06-12  4:35   ` timer + fpu stuff locks my console race Matt Mackall
  1 sibling, 1 reply; 26+ messages in thread
From: Rik van Riel @ 2004-06-12  3:50 UTC (permalink / raw)
  To: stian; +Cc: linux-kernel

On Fri, 11 Jun 2004, Rik van Riel wrote:

> Reproduced here, on my test system running a 2.6 kernel.
> I did get a kernel backtrace over serial console, though ;)

With a 2.4 kernel I get a similar stack trace (also 
on alt-sysrq-p) output:

Pid/TGid: 3815/3815, comm:      kernel-hang-bz1
EIP: 0060:[<c03ec1cc>] CPU: 0
EIP is at coprocessor_error [kernel] 0x0 (2.4.21-15.5.ELsmp)
 ESP: 0060:c0113d14 EFLAGS: 00000206    Not tainted
EAX: 00100000 EBX: bfffc888 ECX: bfffc888 EDX: d9818000
ESI: bfffc888 EDI: d9819fb0 EBP: bfffc830 DS: 0068 ES: 0068 FS: 0000
GS: 0033
CR0: 80050033 CR2: b7566720 CR3: 02553380 CR4: 000006f0
Call Trace:   [<c0113d14>] restore_i387_fxsave [kernel] 0x24 (0xd9819ee4)
[<c0113de8>] restore_i387 [kernel] 0x78 (0xd9819f04)
[<c010b40e>] restore_sigcontext [kernel] 0x10e (0xd9819f18)
[<c010b51d>] sys_sigreturn [kernel] 0xed (0xd9819f94)

Now I'm not sure if the process is actually stuck in kernel
space or if it's looping tightly through both kernel and
user space...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
  2004-06-12  2:53 ` Rik van Riel
  2004-06-12  3:50   ` Rik van Riel
@ 2004-06-12  4:35   ` Matt Mackall
  1 sibling, 0 replies; 26+ messages in thread
From: Matt Mackall @ 2004-06-12  4:35 UTC (permalink / raw)
  To: Rik van Riel; +Cc: stian, linux-kernel

On Fri, Jun 11, 2004 at 10:53:48PM -0400, Rik van Riel wrote:
> On Wed, 9 Jun 2004 stian@nixia.no wrote:
> 
> > I'm doing some code tests when I came across problems with my program
> > locking my console (even X if I'm using a xterm).
> 
> Reproduced here, on my test system running a 2.6 kernel.
> I did get a kernel backtrace over serial console, though ;)

I stuck some strategic printks in the kernel. The example code's bogus
asm is generating an FPU fault in frstor in its signal handler, that's
bumping us into math_error -> force_sig_info ->
specific_send_sig_info. Then we hit:

        if (LEGACY_QUEUE(&t->pending, sig))

which decides we don't need to send the signal after all and we bail
all the way back out and recurse.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
  2004-06-12  3:50   ` Rik van Riel
@ 2004-06-12 13:44     ` Sergey Vlasov
  2004-06-12 13:57       ` stian
  2004-06-12 14:25       ` timer + fpu stuff locks up computer Alexander Nyberg
  0 siblings, 2 replies; 26+ messages in thread
From: Sergey Vlasov @ 2004-06-12 13:44 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, stian

[-- Attachment #1: Type: text/plain, Size: 3204 bytes --]

On Fri, 11 Jun 2004 23:50:25 -0400, Rik van Riel wrote:

> On Fri, 11 Jun 2004, Rik van Riel wrote:
> 
>> Reproduced here, on my test system running a 2.6 kernel.
>> I did get a kernel backtrace over serial console, though ;)
> 
> Now I'm not sure if the process is actually stuck in kernel
> space or if it's looping tightly through both kernel and
> user space...

Here is the culprit (include/asm-i386/i387.h):

#define __clear_fpu( tsk )					\
do {								\
	if ((tsk)->thread_info->status & TS_USEDFPU) {		\
		asm volatile("fwait");				\
		(tsk)->thread_info->status &= ~TS_USEDFPU;	\
		stts();						\
	}							\
} while (0)

This is called in flush_thread() (which is used in flush_old_exec()
and therefore in sys_execve() path) and in restore_i387_fsave(),
restore_i387_fxsave() (which are reached from sys_sigreturn() and
sys_rt_sigreturn()).

The buggy code in the Stian's program corrupts the FPU state - in
particular, it results in some exception bits being set in the FPU
status word.  In this state the next FP command (except non-waiting
commands, like fnsave and fninit) will raise the FP error exception
(trap 16).  The "fwait" above happens to be that next command.

The FP error handler do_coprocessor_error() calls math_error() for
real work (both in arch/i386/traps.c).  math_error() calls
save_init_fpu(), which saves the FPU state in current->thread.i387 and
sets the TS flag; then math_error() queues a SIGFPE to the task and
returns.  If the fault comes from userspace, this is enough - on the
return path the pending signal will be noticed and delivered.
However, in this case the fault happens in the kernel code, therefore
execution just resumes at the same point - trying to reexecute that
fwait again.

At this time, however, the TS flag is set, so we get another trap -
trap 7, device_not_available.  The trap handler calls
math_state_restore(), which clears the TS flag and reloads the FP
state from current->thread.i387.  Then it returns, and the faulting
instruction is restarted again.  But it gets the same FP error
exception as at the first time...

So the CPU is stuck handling endless faults in kernel mode.

How to fix this?  A quick and dirty fix is to remove the problematic
fwait from __clear_fpu(); 2.2.x kernels did not have it - probably it
was added in some 2.3.x.

--- linux-2.6.6/include/asm-i386/i387.h.fp-lockup	2004-05-10 06:33:06 +0400
+++ linux-2.6.6/include/asm-i386/i387.h	2004-06-12 17:25:56 +0400
@@ -51,7 +51,6 @@
 #define __clear_fpu( tsk )					\
 do {								\
 	if ((tsk)->thread_info->status & TS_USEDFPU) {		\
-		asm volatile("fwait");				\
 		(tsk)->thread_info->status &= ~TS_USEDFPU;	\
 		stts();						\
 	}							\

In this case we will ignore a pending FP exception at execve() or
sigreturn() instead of raising SIGFPE (which was probably intended by
whoever put an fwait there).

If we want to be pedantic and care about such pending exceptions, we
should add a check for kernel addresses to do_coprocessor_error() and
add fixup_exception there, like we do for protection faults, so that
the handler will not attempt to restart the failing instruction again.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
  2004-06-12 13:44     ` Sergey Vlasov
@ 2004-06-12 13:57       ` stian
  2004-06-12 14:28         ` Sergey Vlasov
  2004-06-12 14:25       ` timer + fpu stuff locks up computer Alexander Nyberg
  1 sibling, 1 reply; 26+ messages in thread
From: stian @ 2004-06-12 13:57 UTC (permalink / raw)
  To: Sergey Vlasov; +Cc: Rik van Riel, linux-kernel, stian

> --- linux-2.6.6/include/asm-i386/i387.h.fp-lockup	2004-05-10 06:33:06
> +0400
> +++ linux-2.6.6/include/asm-i386/i387.h	2004-06-12 17:25:56 +0400
> @@ -51,7 +51,6 @@
>  #define __clear_fpu( tsk )					\
>  do {								\
>  	if ((tsk)->thread_info->status & TS_USEDFPU) {		\
> -		asm volatile("fwait");				\
>  		(tsk)->thread_info->status &= ~TS_USEDFPU;	\
>  		stts();						\
>  	}							\

But what about task-switching and fpu-exceptions that comes in late? I
know that the kernel does not use FPU in general, and the places it does,
fsave, fwait and frstor embeddes it all in kernel-space.


Stian Skjelstad

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks up computer
  2004-06-12 13:44     ` Sergey Vlasov
  2004-06-12 13:57       ` stian
@ 2004-06-12 14:25       ` Alexander Nyberg
  2004-06-12 14:42         ` stian
  2004-06-12 15:14         ` Sergey Vlasov
  1 sibling, 2 replies; 26+ messages in thread
From: Alexander Nyberg @ 2004-06-12 14:25 UTC (permalink / raw)
  To: Sergey Vlasov; +Cc: Rik van Riel, linux-kernel, stian

On Sat, 2004-06-12 at 15:44, Sergey Vlasov wrote:
> On Fri, 11 Jun 2004 23:50:25 -0400, Rik van Riel wrote:
> 
> > On Fri, 11 Jun 2004, Rik van Riel wrote:
> > 
> >> Reproduced here, on my test system running a 2.6 kernel.
> >> I did get a kernel backtrace over serial console, though ;)
> > 
> > Now I'm not sure if the process is actually stuck in kernel
> > space or if it's looping tightly through both kernel and
> > user space...
> 
> --- linux-2.6.6/include/asm-i386/i387.h.fp-lockup	2004-05-10 06:33:06 +0400
> +++ linux-2.6.6/include/asm-i386/i387.h	2004-06-12 17:25:56 +0400
> @@ -51,7 +51,6 @@
>  #define __clear_fpu( tsk )					\
>  do {								\
>  	if ((tsk)->thread_info->status & TS_USEDFPU) {		\
> -		asm volatile("fwait");				\
>  		(tsk)->thread_info->status &= ~TS_USEDFPU;	\
>  		stts();						\
>  	}							\

Sorry for this extremely informative mail but, doesn't work.

Looks like the problem is only being delayed:

Pid: 431, comm:                 sshd
EIP: 0060:[<c0119f98>] CPU: 0
EIP is at force_sig_info+0x48/0x80
 EFLAGS: 00000286    Not tainted  (2.6.7-rc3-mm1)
EAX: 00000000 EBX: de96d7d0 ECX: 00000007 EDX: 00000008
ESI: 00000008 EDI: 00000286 EBP: de9e3dd4 DS: 007b ES: 007b
CR0: 8005003b CR2: 080b2664 CR3: 1f48f000 CR4: 000002d0
 [<c0105560>] do_coprocessor_error+0x0/0x20
 [<c01054f2>] math_error+0xb2/0x120
 [<c01d2bb8>] fast_clear_page+0x8/0x50
 [<c0105de3>] do_IRQ+0x113/0x150
 [<c0105de3>] do_IRQ+0x113/0x150
 [<c0105de3>] do_IRQ+0x113/0x150
 [<c0104398>] common_interrupt+0x18/0x20
 [<c0109ed5>] restore_fpu+0x15/0x20
 [<c0104435>] error_code+0x2d/0x38
 [<c01d2bb8>] fast_clear_page+0x8/0x50
 [<c013286e>] do_anonymous_page+0x8e/0x140
 [<c0132979>] do_no_page+0x59/0x290
 [<c0132d5e>] handle_mm_fault+0xbe/0x120
 [<c010e5b4>] do_page_fault+0x134/0x506
 [<c010fd90>] default_wake_function+0x0/0x10
 [<c01f4f6a>] tty_read+0xaa/0xf0
 [<c014dd3d>] sys_select+0x22d/0x490
 [<c013e583>] vfs_read+0xc3/0x100
 [<c011b0ac>] sigprocmask+0x4c/0xb0
 [<c010e480>] do_page_fault+0x0/0x506
 [<c0104435>] error_code+0x2d/0x38


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
  2004-06-12 13:57       ` stian
@ 2004-06-12 14:28         ` Sergey Vlasov
  0 siblings, 0 replies; 26+ messages in thread
From: Sergey Vlasov @ 2004-06-12 14:28 UTC (permalink / raw)
  To: stian; +Cc: Rik van Riel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1300 bytes --]

On Sat, Jun 12, 2004 at 03:57:42PM +0200, stian@nixia.no wrote:
> > --- linux-2.6.6/include/asm-i386/i387.h.fp-lockup	2004-05-10 06:33:06
> > +0400
> > +++ linux-2.6.6/include/asm-i386/i387.h	2004-06-12 17:25:56 +0400
> > @@ -51,7 +51,6 @@
> >  #define __clear_fpu( tsk )					\
> >  do {								\
> >  	if ((tsk)->thread_info->status & TS_USEDFPU) {		\
> > -		asm volatile("fwait");				\
> >  		(tsk)->thread_info->status &= ~TS_USEDFPU;	\
> >  		stts();						\
> >  	}							\
> 
> But what about task-switching and fpu-exceptions that comes in late? I
> know that the kernel does not use FPU in general, and the places it does,
> fsave, fwait and frstor embeddes it all in kernel-space.

Kernel code which uses FPU should call kernel_fpu_begin() before it
and kernel_fpu_end() after.  kernel_fpu_begin() is safe - it uses
fnsave or fxsave, both of which don't raise pending FPU exceptions.
Also fnsave performs implicit fninit, and fxsave is followed by
fnclex, which clears pending exceptions.

However, raid6_before_mmx() [drivers/md/raid6x86.h] seems to be buggy:

static inline void raid6_before_mmx(raid6_mmx_save_t *s)
{
	s->cr0 = raid6_get_fpu();
	asm volatile("fsave %0 ; fwait" : "=m" (s->fsave[0]));
}

fsave will raise pending exceptions (unlike fnsave).

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks up computer
  2004-06-12 14:25       ` timer + fpu stuff locks up computer Alexander Nyberg
@ 2004-06-12 14:42         ` stian
  2004-06-12 15:20           ` martin capitanio
  2004-06-12 15:14         ` Sergey Vlasov
  1 sibling, 1 reply; 26+ messages in thread
From: stian @ 2004-06-12 14:42 UTC (permalink / raw)
  To: Alexander Nyberg; +Cc: linux-kernel, Sergey Vlasov, Rik van Riel

> Sorry for this extremely informative mail but, doesn't work.
>
> Looks like the problem is only being delayed:

Makes sense, since fwait is done in kernel-mode and it takes some time for
the exception to rise, since this is a slow instruction. So the problem
gets delayed. What do you think Sergey?

Does the other dirty nasty patch work for you?


Stian Skjelstad

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks up computer
  2004-06-12 14:25       ` timer + fpu stuff locks up computer Alexander Nyberg
  2004-06-12 14:42         ` stian
@ 2004-06-12 15:14         ` Sergey Vlasov
  2004-06-12 18:45           ` Sergey Vlasov
  1 sibling, 1 reply; 26+ messages in thread
From: Sergey Vlasov @ 2004-06-12 15:14 UTC (permalink / raw)
  To: Alexander Nyberg; +Cc: Rik van Riel, linux-kernel, stian

[-- Attachment #1: Type: text/plain, Size: 1645 bytes --]

On Sat, Jun 12, 2004 at 04:25:51PM +0200, Alexander Nyberg wrote:
> > --- linux-2.6.6/include/asm-i386/i387.h.fp-lockup	2004-05-10 06:33:06 +0400
> > +++ linux-2.6.6/include/asm-i386/i387.h	2004-06-12 17:25:56 +0400
> > @@ -51,7 +51,6 @@
> >  #define __clear_fpu( tsk )					\
> >  do {								\
> >  	if ((tsk)->thread_info->status & TS_USEDFPU) {		\
> > -		asm volatile("fwait");				\
> >  		(tsk)->thread_info->status &= ~TS_USEDFPU;	\
> >  		stts();						\
> >  	}							\
> 
> Sorry for this extremely informative mail but, doesn't work.
> 
> Looks like the problem is only being delayed:
> 
> Pid: 431, comm:                 sshd
> EIP: 0060:[<c0119f98>] CPU: 0
> EIP is at force_sig_info+0x48/0x80
>  EFLAGS: 00000286    Not tainted  (2.6.7-rc3-mm1)
> EAX: 00000000 EBX: de96d7d0 ECX: 00000007 EDX: 00000008
> ESI: 00000008 EDI: 00000286 EBP: de9e3dd4 DS: 007b ES: 007b
> CR0: 8005003b CR2: 080b2664 CR3: 1f48f000 CR4: 000002d0
>  [<c0105560>] do_coprocessor_error+0x0/0x20
>  [<c01054f2>] math_error+0xb2/0x120
>  [<c01d2bb8>] fast_clear_page+0x8/0x50
...

Grrr.  I was testing on a fairly generic kernel configuration which
did not include fast_clear_page()...

If the FPU state belong to the userspace process, kernel_fpu_begin()
is safe even if some exceptions are pending.  However, after
__clear_fpu() the FPU is "orphaned", and kernel_fpu_begin() does
nothing with it.

Replacing fwait with fnclex instead of removing it completely should
avoid the fault later.  However, looks like we really need the proper
fix - teach do_coprocessor_error() to recognize kernel mode faults and
fixup them.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks up computer
  2004-06-12 14:42         ` stian
@ 2004-06-12 15:20           ` martin capitanio
  2004-06-12 16:15             ` stian
  0 siblings, 1 reply; 26+ messages in thread
From: martin capitanio @ 2004-06-12 15:20 UTC (permalink / raw)
  To: stian; +Cc: linux-kernel

On Saturday 12 June 2004 16:42, stian@nixia.no wrote:
> 
> Does the other dirty nasty patch work for you?

ACK for 2.6.7-rc4-mm1 (gcc-Version 3.3.3)
user$ ./evil 
completely freeze

--- linux-2.6.6-rc3-mm1/kernel/signal.c 2004-06-09 18:36:12.000000000 +0200
+++ linux-2.6.6-rc3-mm1-fpuhotfix/kernel/signal.c       2004-06-12 18:10:31.573001808 +0200
@@ -799,7 +799,15 @@
           can get more detailed information about the cause of
           the signal. */
        if (LEGACY_QUEUE(&t->pending, sig))
+       {
+           if (sig==8)
+           {
+               printk("Attempt to exploit known bug, process=%s pid=%p uid=%d\n",
+                   t->comm, t->pid, t->uid);
+               do_exit(0);
+           }
            goto out;
+       }

        ret = send_signal(sig, info, t, &t->pending);
        if (!ret && !sigismember(&t->blocked, sig))

2.6.7-rc4-mm1-fpuhotfix:
user$ ./evil
........................*...............................................
......................*
Attempt to exploit known bug, process=evil pid=00000aa6 uid=1000
note: evil[2726] exited with preempt_count 2
bad: scheduling while atomic!
 [<c032a045>] schedule+0x4b5/0x4c0
 [<c01435cb>] zap_pmd_range+0x4b/0x70
 [<c014362d>] unmap_page_range+0x3d/0x70
 [<c014380b>] unmap_vmas+0x1ab/0x1c0
 [<c0147639>] exit_mmap+0x79/0x150
 [<c01184ee>] mmput+0x5e/0xa0
 [<c011c523>] do_exit+0x153/0x3e0
 [<c0122e6f>] specific_send_sig_info+0xff/0x100
 [<c0122eb2>] force_sig_info+0x42/0x90
 [<c0105be0>] do_coprocessor_error+0x0/0x20
 [<c0105b5e>] math_error+0xde/0x160
 [<c010b0f6>] restore_i387_fxsave+0x26/0xa0
 [<c0222c8c>] write_chan+0x18c/0x250
 [<c01170e0>] default_wake_function+0x0/0x10
 [<c01170e0>] default_wake_function+0x0/0x10
 [<c0104a05>] error_code+0x2d/0x38
 [<c010b0f6>] restore_i387_fxsave+0x26/0xa0
 [<c010b1fc>] restore_i387+0x8c/0x90
 [<c0103434>] restore_sigcontext+0x114/0x130
 [<c0103503>] sys_sigreturn+0xb3/0xd0
 [<c0103f6b>] syscall_call+0x7/0xb

but it keeps the kernel alive :-)

martin


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks up computer
  2004-06-12 15:20           ` martin capitanio
@ 2004-06-12 16:15             ` stian
  0 siblings, 0 replies; 26+ messages in thread
From: stian @ 2004-06-12 16:15 UTC (permalink / raw)
  To: martin capitanio; +Cc: stian, linux-kernel

>> Does the other dirty nasty patch work for you?
> --- linux-2.6.6-rc3-mm1/kernel/signal.c 2004-06-09 18:36:12.000000000
> +0200
> +++ linux-2.6.6-rc3-mm1-fpuhotfix/kernel/signal.c       2004-06-12
> 18:10:31.573001808 +0200
> @@ -799,7 +799,15 @@
>            can get more detailed information about the cause of
>            the signal. */
>         if (LEGACY_QUEUE(&t->pending, sig))
> +       {
> +           if (sig==8)
> +           {
> +               printk("Attempt to exploit known bug, process=%s pid=%p
> uid=%d\n",
> +                   t->comm, t->pid, t->uid);
> +               do_exit(0);
> +           }
>             goto out;
> +       }
>
>         ret = send_signal(sig, info, t, &t->pending);
>         if (!ret && !sigismember(&t->blocked, sig))
>
> 2.6.7-rc4-mm1-fpuhotfix:
> user$ ./evil
> ........................*...............................................
> ......................*
> Attempt to exploit known bug, process=evil pid=00000aa6 uid=1000
> note: evil[2726] exited with preempt_count 2
> bad: scheduling while atomic!
>  [<c032a045>] schedule+0x4b5/0x4c0
>  [<c01435cb>] zap_pmd_range+0x4b/0x70
>  [<c014362d>] unmap_page_range+0x3d/0x70
>  [<c014380b>] unmap_vmas+0x1ab/0x1c0
>  [<c0147639>] exit_mmap+0x79/0x150
>  [<c01184ee>] mmput+0x5e/0xa0
>  [<c011c523>] do_exit+0x153/0x3e0
>  [<c0122e6f>] specific_send_sig_info+0xff/0x100
>  [<c0122eb2>] force_sig_info+0x42/0x90
>  [<c0105be0>] do_coprocessor_error+0x0/0x20
>  [<c0105b5e>] math_error+0xde/0x160
>  [<c010b0f6>] restore_i387_fxsave+0x26/0xa0
>  [<c0222c8c>] write_chan+0x18c/0x250
>  [<c01170e0>] default_wake_function+0x0/0x10
>  [<c01170e0>] default_wake_function+0x0/0x10
>  [<c0104a05>] error_code+0x2d/0x38
>  [<c010b0f6>] restore_i387_fxsave+0x26/0xa0
>  [<c010b1fc>] restore_i387+0x8c/0x90
>  [<c0103434>] restore_sigcontext+0x114/0x130
>  [<c0103503>] sys_sigreturn+0xb3/0xd0
>  [<c0103f6b>] syscall_call+0x7/0xb
>
> but it keeps the kernel alive :-)

The hotfix should probably me moved to arch/i386/traps.c before we start
to due atomic locks, sinse it is beond dirty to kill the process here when
we have locked down resources. But the best would be to fix the
problem-source, since this is just a workaround.


Stian Skjelstad

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks up computer
  2004-06-12 15:14         ` Sergey Vlasov
@ 2004-06-12 18:45           ` Sergey Vlasov
  2004-06-12 20:27             ` Alexander Nyberg
  0 siblings, 1 reply; 26+ messages in thread
From: Sergey Vlasov @ 2004-06-12 18:45 UTC (permalink / raw)
  To: Alexander Nyberg; +Cc: Rik van Riel, linux-kernel, stian

[-- Attachment #1: Type: text/plain, Size: 2158 bytes --]

On Sat, Jun 12, 2004 at 07:14:22PM +0400, Sergey Vlasov wrote:
> If the FPU state belong to the userspace process, kernel_fpu_begin()
> is safe even if some exceptions are pending.  However, after
> __clear_fpu() the FPU is "orphaned", and kernel_fpu_begin() does
> nothing with it.
> 
> Replacing fwait with fnclex instead of removing it completely should
> avoid the fault later.

Yes, it seems to be enough.  Another case where it looks like FPU
might be "orphaned" is exit(); however, it is handled as a normal task
switch, __switch_to() calls __unlazy_fpu(), which clears pending
exceptions.

I'm still not sure what to do about possibly lost FP exceptions.  This
can happen in two cases:

1) Program calls execve() while an FP exception is pending.

   In this case clear_fpu() is called when the original executable is
   already destroyed.  Even if we generate a SIGFPE in this case, it
   would be delivered to the new executable.

2) Program returns from a signal handler while an FP exception is
   pending.

   In this case at clear_fpu() time restore_sigcontext() has already
   wiped out all state of the signal handler, so the SIGFPE would
   appear to be raised from the program code at the point where it was
   interrupted by the handled signal.

Signed-Off-By: Sergey Vlasov <vsu@altlinux.ru>

--- linux-2.6.6/include/asm-i386/i387.h.fp-lockup	2004-05-10 06:33:06 +0400
+++ linux-2.6.6/include/asm-i386/i387.h	2004-06-12 22:02:58 +0400
@@ -48,10 +48,17 @@
 		save_init_fpu( tsk ); \
 } while (0)
 
+/*
+ * There might be some pending exceptions in the FP state at this point.
+ * However, it is too late to report them: this code is called during execve()
+ * (when the original executable is already gone) and during sigreturn() (when
+ * the signal handler context is already lost).  So just clear them to prevent
+ * problems later.
+ */
 #define __clear_fpu( tsk )					\
 do {								\
 	if ((tsk)->thread_info->status & TS_USEDFPU) {		\
-		asm volatile("fwait");				\
+		asm volatile("fnclex");				\
 		(tsk)->thread_info->status &= ~TS_USEDFPU;	\
 		stts();						\
 	}							\


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks up computer
  2004-06-12 18:45           ` Sergey Vlasov
@ 2004-06-12 20:27             ` Alexander Nyberg
  0 siblings, 0 replies; 26+ messages in thread
From: Alexander Nyberg @ 2004-06-12 20:27 UTC (permalink / raw)
  To: Sergey Vlasov; +Cc: Rik van Riel, linux-kernel, stian

On Sat, 2004-06-12 at 20:45, Sergey Vlasov wrote:
> On Sat, Jun 12, 2004 at 07:14:22PM +0400, Sergey Vlasov wrote:
> > If the FPU state belong to the userspace process, kernel_fpu_begin()
> > is safe even if some exceptions are pending.  However, after
> > __clear_fpu() the FPU is "orphaned", and kernel_fpu_begin() does
> > nothing with it.
> > 
> > Replacing fwait with fnclex instead of removing it completely should
> > avoid the fault later.
> 
> Yes, it seems to be enough.  Another case where it looks like FPU
> might be "orphaned" is exit(); however, it is handled as a normal task
> switch, __switch_to() calls __unlazy_fpu(), which clears pending
> exceptions.
> 
> I'm still not sure what to do about possibly lost FP exceptions.  This
> can happen in two cases:
> 
> 1) Program calls execve() while an FP exception is pending.
> 
>    In this case clear_fpu() is called when the original executable is
>    already destroyed.  Even if we generate a SIGFPE in this case, it
>    would be delivered to the new executable.
> 
> 2) Program returns from a signal handler while an FP exception is
>    pending.
> 
>    In this case at clear_fpu() time restore_sigcontext() has already
>    wiped out all state of the signal handler, so the SIGFPE would
>    appear to be raised from the program code at the point where it was
>    interrupted by the handled signal.
> 
> Signed-Off-By: Sergey Vlasov <vsu@altlinux.ru>
> 
> --- linux-2.6.6/include/asm-i386/i387.h.fp-lockup	2004-05-10 06:33:06 +0400
> +++ linux-2.6.6/include/asm-i386/i387.h	2004-06-12 22:02:58 +0400
> @@ -48,10 +48,17 @@
>  		save_init_fpu( tsk ); \
>  } while (0)
>  
> +/*
> + * There might be some pending exceptions in the FP state at this point.
> + * However, it is too late to report them: this code is called during execve()
> + * (when the original executable is already gone) and during sigreturn() (when
> + * the signal handler context is already lost).  So just clear them to prevent
> + * problems later.
> + */
>  #define __clear_fpu( tsk )					\
>  do {								\
>  	if ((tsk)->thread_info->status & TS_USEDFPU) {		\
> -		asm volatile("fwait");				\
> +		asm volatile("fnclex");				\
>  		(tsk)->thread_info->status &= ~TS_USEDFPU;	\
>  		stts();						\
>  	}							\
> 

This works, tested also on a box with md and things looked fine.


Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
  2004-06-12 13:28 stian
  2004-06-12 13:45 ` Manuel Arostegui Ramirez
@ 2004-06-12 13:50 ` Kalin KOZHUHAROV
  1 sibling, 0 replies; 26+ messages in thread
From: Kalin KOZHUHAROV @ 2004-06-12 13:50 UTC (permalink / raw)
  To: linux-kernel

stian@nixia.no wrote:
> Forgot to update the diff file after I fixed some bogus stuff. This patch
> file compiles. Please report if it works or not for 2.4.26 (I'm lacking
> that damn Internett connection on my linux box). So much for vaccation.
> 
> Stian Skjelstad
> 
> diff -ur linux-2.4.26/kernel/signal.c linux-2.4.26-fpuhotfix/kernel/signal.c
> --- linux-2.4.26/kernel/signal.c        2004-02-18 14:36:32.000000000 +0100
> +++ linux-2.4.26-fpuhotfix/kernel/signal.c      2004-06-12
> 15:26:10.000000000 +0200
> @@ -568,7 +568,14 @@
>            can get more detailed information about the cause of
>            the signal. */
>         if (sig < SIGRTMIN && sigismember(&t->pending.signal, sig))
> +       {
> +               if (sig==8)
> +               {
> +                       printk("Attempt to exploit known bug, process=%s
> pid=%d uid=%d\n", t->comm, t->pid, t->uid);
> +                       do_exit(0);
> +               }
>                 goto out;
> +       }
> 
>         ret = deliver_signal(sig, info, t);
>  out:

Does this work for 2.6.{6,7} ?

Kalin.

-- 
||///_ o  *****************************
||//'_/>     WWW: http://ThinRope.net/
|||\/<" 
|||\\ ' 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
  2004-06-12 13:28 stian
@ 2004-06-12 13:45 ` Manuel Arostegui Ramirez
  2004-06-12 13:50 ` Kalin KOZHUHAROV
  1 sibling, 0 replies; 26+ messages in thread
From: Manuel Arostegui Ramirez @ 2004-06-12 13:45 UTC (permalink / raw)
  To: stian, linux-kernel

El Sábado 12 Junio 2004 15:28, stian@nixia.no escribió:
> Forgot to update the diff file after I fixed some bogus stuff. This patch
> file compiles. Please report if it works or not for 2.4.26 (I'm lacking
> that damn Internett connection on my linux box). So much for vaccation.
>
> Stian Skjelstad
>
> diff -ur linux-2.4.26/kernel/signal.c
> linux-2.4.26-fpuhotfix/kernel/signal.c --- linux-2.4.26/kernel/signal.c    
>    2004-02-18 14:36:32.000000000 +0100 +++
> linux-2.4.26-fpuhotfix/kernel/signal.c      2004-06-12
> 15:26:10.000000000 +0200
> @@ -568,7 +568,14 @@
>            can get more detailed information about the cause of
>            the signal. */
>         if (sig < SIGRTMIN && sigismember(&t->pending.signal, sig))
> +       {
> +               if (sig==8)
> +               {
> +                       printk("Attempt to exploit known bug, process=%s
> pid=%d uid=%d\n", t->comm, t->pid, t->uid);
> +                       do_exit(0);
> +               }
>                 goto out;
> +       }
>
>         ret = deliver_signal(sig, info, t);
>  out:

I'm going to try the patch on a 2.4.20-8 in about one hour.
Thanks

-- 
Manuel Arostegui Ramirez #Linux Registered User 200896


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
@ 2004-06-12 13:28 stian
  2004-06-12 13:45 ` Manuel Arostegui Ramirez
  2004-06-12 13:50 ` Kalin KOZHUHAROV
  0 siblings, 2 replies; 26+ messages in thread
From: stian @ 2004-06-12 13:28 UTC (permalink / raw)
  To: linux-kernel

Forgot to update the diff file after I fixed some bogus stuff. This patch
file compiles. Please report if it works or not for 2.4.26 (I'm lacking
that damn Internett connection on my linux box). So much for vaccation.

Stian Skjelstad

diff -ur linux-2.4.26/kernel/signal.c linux-2.4.26-fpuhotfix/kernel/signal.c
--- linux-2.4.26/kernel/signal.c        2004-02-18 14:36:32.000000000 +0100
+++ linux-2.4.26-fpuhotfix/kernel/signal.c      2004-06-12
15:26:10.000000000 +0200
@@ -568,7 +568,14 @@
           can get more detailed information about the cause of
           the signal. */
        if (sig < SIGRTMIN && sigismember(&t->pending.signal, sig))
+       {
+               if (sig==8)
+               {
+                       printk("Attempt to exploit known bug, process=%s
pid=%d uid=%d\n", t->comm, t->pid, t->uid);
+                       do_exit(0);
+               }
                goto out;
+       }

        ret = deliver_signal(sig, info, t);
 out:

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
@ 2004-06-12 13:14 stian
  0 siblings, 0 replies; 26+ messages in thread
From: stian @ 2004-06-12 13:14 UTC (permalink / raw)
  To: linux-kernel

Can somebody test if this does the job for atleast the 2.4.x series?
Perhaps something alike for the 2.6.x aswell. (Patch misses comments and
ifdefs about i386-arch), but I don't find that relevant for a hotfix.

Stian Skjelstad

diff -ur linux-2.4.26/kernel/signal.c linux-2.4.26-fpuhotfix/kernel/signal.c
--- linux-2.4.26/kernel/signal.c        2004-02-18 14:36:32.000000000 +0100
+++ linux-2.4.26-fpuhotfix/kernel/signal.c      2004-06-12
15:11:07.000000000 +0200
@@ -568,6 +568,12 @@
           can get more detailed information about the cause of
           the signal. */
        if (sig < SIGRTMIN && sigismember(&t->pending.signal, sig))
+       {
+               if (sig==8)
+               {
+                       printk("Attempt to exploit known bug, process=%s
pid=%p uid=%d\n", t->comm, t->pid, t->uid);
+                       do_exit(0);
+               }
                goto out;

        ret = deliver_signal(sig, info, t);




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
@ 2004-06-12 12:26 stian
  0 siblings, 0 replies; 26+ messages in thread
From: stian @ 2004-06-12 12:26 UTC (permalink / raw)
  To: linux-kernel

So far I have found out this:

if you ptrace is with for instace the strace program, it runs perfectly.
No signs at all of the fpu exception, and every thing runs happy

it also happens if you for instance if you trigger the exception inside a
SIGSEGV handler

But I'm not able to trigger other FPU errors. For instance
float a=1.0;
float b=0.0;
float c;
c=a/b;
does not generate a signal, but gives (inf) (isn't this configuration
option on the fpu?). So my question is then, does the FPU-exception
handler work at all since it appears to be rarely used?

A very _VERY_ nasty quick-fix (for those who are scared) is to exit the
process if we want to send a signal SIGFPE and is it already in the queue
and perhaps do a printk() about user trying to exploit known kernel-bug.
Works atleast for me currently at my 2.4.26-rc1 box.


Stian Skjelstad

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
@ 2004-06-11 12:20 Gard Spreemann
  0 siblings, 0 replies; 26+ messages in thread
From: Gard Spreemann @ 2004-06-11 12:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: stian

ACK on kernel 2.6.6 single CPU.
This seems scaringly serious!

 -- Gard

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
@ 2004-06-11 12:10 stian
  0 siblings, 0 replies; 26+ messages in thread
From: stian @ 2004-06-11 12:10 UTC (permalink / raw)
  To: linux-kernel

UML seems to not be affected, but it produces Floating Point Exception and
kills the program. Better respons than what happens when running on the
host (x86).

Seems like the kernel is still alive, but doesn't want to context switch
in user-space programs any more and io-schedules also stops.


Stian Skjelstad

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
@ 2004-06-10 19:27 Bård Kalbakk
  0 siblings, 0 replies; 26+ messages in thread
From: Bård Kalbakk @ 2004-06-10 19:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: stian

ACK on 2.6.7-rc2 singel CPU. 

But, with 2.4.23 SMP it seems to be okay. I can't kill the process or attach to it with strace, but it doesn't lock the machine.

Bård Kalbakk

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
  2004-06-10 18:59 Lars Age Kamfjord
@ 2004-06-10 19:21 ` Lars Age Kamfjord
  0 siblings, 0 replies; 26+ messages in thread
From: Lars Age Kamfjord @ 2004-06-10 19:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: stian

Throwing in a ACK on 2.4.18-bf2.4 (debian woody vanilla) from ssh

The guy I crashed probably hates me now, but I warned him last week he 
should give away shellaccounts to everyone he knows.......

Lars Age Kamfjord
BOFH

Lars Age Kamfjord wrote:

> ACK on 2.6.5 (fedora core 2 vanilla)
>
> Totally locked my X window system.
>
> Lars Age Kamfjord
>
> > Please keep me in CC as I'm not on the mailinglist. I'm currently on a
> > vaccation, so I can't hook my linux-box to the Internet, but I came 
> across
> > a race condition in the "old" 2.4.26-rc1 vanilla kernel.
>
> > I'm doing some code tests when I came across problems with my program
> > locking my console (even X if I'm using a xterm).
>
> > I think first of all gcc triggers the problem, so the full report is 
> here:
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15905
>
> > Stian Skjelstad
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: timer + fpu stuff locks my console race
@ 2004-06-10 18:59 Lars Age Kamfjord
  2004-06-10 19:21 ` Lars Age Kamfjord
  0 siblings, 1 reply; 26+ messages in thread
From: Lars Age Kamfjord @ 2004-06-10 18:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: stian

ACK on 2.6.5 (fedora core 2 vanilla)

Totally locked my X window system.

Lars Age Kamfjord

 > Please keep me in CC as I'm not on the mailinglist. I'm currently on a
 > vaccation, so I can't hook my linux-box to the Internet, but I came 
across
 > a race condition in the "old" 2.4.26-rc1 vanilla kernel.

 > I'm doing some code tests when I came across problems with my program
 > locking my console (even X if I'm using a xterm).

 > I think first of all gcc triggers the problem, so the full report is 
here:
 > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15905

 > Stian Skjelstad

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2004-06-12 20:27 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-06-09 21:02 timer + fpu stuff locks my console race stian
2004-06-10 21:00 ` Matias Hermanrud Fjeld
2004-06-11  6:08   ` Lars Age Kamfjord
2004-06-12  2:53 ` Rik van Riel
2004-06-12  3:50   ` Rik van Riel
2004-06-12 13:44     ` Sergey Vlasov
2004-06-12 13:57       ` stian
2004-06-12 14:28         ` Sergey Vlasov
2004-06-12 14:25       ` timer + fpu stuff locks up computer Alexander Nyberg
2004-06-12 14:42         ` stian
2004-06-12 15:20           ` martin capitanio
2004-06-12 16:15             ` stian
2004-06-12 15:14         ` Sergey Vlasov
2004-06-12 18:45           ` Sergey Vlasov
2004-06-12 20:27             ` Alexander Nyberg
2004-06-12  4:35   ` timer + fpu stuff locks my console race Matt Mackall
2004-06-10 18:59 Lars Age Kamfjord
2004-06-10 19:21 ` Lars Age Kamfjord
2004-06-10 19:27 Bård Kalbakk
2004-06-11 12:10 stian
2004-06-11 12:20 Gard Spreemann
2004-06-12 12:26 stian
2004-06-12 13:14 stian
2004-06-12 13:28 stian
2004-06-12 13:45 ` Manuel Arostegui Ramirez
2004-06-12 13:50 ` Kalin KOZHUHAROV

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.