linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: BUG: Global FPU corruption in 2.2
@ 2001-04-24 18:21 Victor Zandy
  2001-04-24 18:37 ` Alan Cox
  0 siblings, 1 reply; 33+ messages in thread
From: Victor Zandy @ 2001-04-24 18:21 UTC (permalink / raw)
  To: linux-kernel


Linus Torvalds writes:
> Ahh.. This actually _does_ look like a race on "current->flags": 
> PTRACE_ATTACH will do a 
> 
>         child->flags |= PF_PTRACED; 
> 
> without waiting for the child to have stopped. 

I can see how this could case PF_USEDFPU to be cleared inadvertently,
but I do not have any ideas for testing this.  Is it clear that this
is the source of the problem?

What would be involved in backporting the split ptrace flags to 2.2?
Are there other solutions?

Vic

^ permalink raw reply	[flat|nested] 33+ messages in thread
* Re: BUG: Global FPU corruption in 2.2
@ 2001-04-24 13:05 Victor Zandy
  2001-04-24 16:24 ` Linus Torvalds
  2001-04-24 16:47 ` Christian Ehrhardt
  0 siblings, 2 replies; 33+ messages in thread
From: Victor Zandy @ 2001-04-24 13:05 UTC (permalink / raw)
  To: linux-kernel


Someone else here traced the process flags of a FP-intensive program
on a machine before and after it is put in the faulty FPU state.  He
periodically sampled /proc/pid/stat while the program was running.

He found that PF_USEDFPU was always set before the machine was broken.
After he found that it was set about 70% of the time.

Vic




^ permalink raw reply	[flat|nested] 33+ messages in thread
* Re: BUG: Global FPU corruption in 2.2
@ 2001-04-24  8:56 alad
  0 siblings, 0 replies; 33+ messages in thread
From: alad @ 2001-04-24  8:56 UTC (permalink / raw)
  To: linux-kernel






Hi,
     I want to look into this problem. Its seems to be very interesting. But I
was not following the thread from the beginning (and I mistakely deleted all
these mails :( .. ).. I hope you won't mind answering following questions...

1) you are doing this on an MP or a uniprocessor ?
2) I want to know how are you calling sys_ptrace(Attach) and
sys_ptrace(detach).. i.e is it something linke following

      for(;;){
     sys_ptrace(attach to process);
     sys_wait4();
     sys_ptrace(detach from process);
      }

In short the sequence of system calls you are using for attaching and detaching
to the process

3) Have you tried doing attach and detach only once ? If not.. can you please
try this and let me know whether by doing attach and detach one time also
results in global FPU corruption. Please do not fork in the above process.

---------

Whenever process A calls sys_ptrace(Attach) to Process B, sys_ptrace sends
SIGSTOP to process B.
Now process B in do_signal, checks that it is being traced and then it does the
following
     current->state = TASK_STOPPED;
     notify_parent(current,SIGCHLD);
     schedule();

so now in schedule() --> __switch_to --> unlazy_fpu() function we do following
     if (current->flags & PF_USEDFPU)
          save_fpu();

In save_fpu() we do following
     fnsave current->tss.i387
     fwait;

I want to ask a question....... is it possible if 'somehow' we were not able to
save the complete floating point state with fnsave i.e. current->tss.i387 is
'invalid' after
          fnsave current->tss.i387
     fwait;

Thanks
Amol




David Konerding <dek_ml@konerding.com> on 04/23/2001 01:09:27 AM

To:   Ulrich Drepper <drepper@cygnus.com>
cc:   root@chaos.analogic.com, linux-kernel@vger.kernel.org (bcc: Amol Lad/HSS)

Subject:  Re: BUG: Global FPU corruption in 2.2




Ulrich Drepper wrote:

> "Richard B. Johnson" <root@chaos.analogic.com> writes:
>
> > The kernel doesn't know if a process is going to use the FPU when
> > a new process is created. Only the user's code, i.e., the 'C' runtime
> > library knows.
>
> Maybe you should try to understand the kernel code and the features of
> the processor first.  The kernel can detect when the FPU is used for
> the first time.

OK, regardless of how the linux kernel actually manages the FPU for user-space

programs, does anybody have any comments on the original bugreport?

>We have found that one of our programs can cause system-wide
>corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
>run this program, the FPU gives bad results to all subsequent
>processes.

>We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
>these things, and we see the problem on every node we try (dozens).
>We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
>don't seem to be affected.

>Below are two programs we use to produce the behavior.  The first
>program, pi, repeatedly spawns 10 parallel computations of pi.  When
>all is well, each process prints pi as it completes.

>The second program, pt, repeatedly attaches to and detaches from
>another process.  Run pt against the root pi process until the output
>of pi begins to look wrong.  Then kill everything and run pi by itself
>again.  It will no longer produce good results.  We find that the FPU
>persistently gives bad results until we reboot.

I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
described.
If it is a bug in the linux kernel (I can see nothing wrong with the source
code provided),
I would suspect probems with SMP and ptrace, somehow causing the wrong FP
registers
to be returned to a process after the scheduler restarted it.  It's very
interesting that the
PI program works fine until you run PT, but after you run PT, PI is screwed
until reboot.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/







^ permalink raw reply	[flat|nested] 33+ messages in thread
* Re: BUG: Global FPU corruption in 2.2
@ 2001-04-24  7:56 alad
  0 siblings, 0 replies; 33+ messages in thread
From: alad @ 2001-04-24  7:56 UTC (permalink / raw)
  To: David Konerding; +Cc: Ulrich Drepper, root, linux-kernel



Hi,
     I want to look into this problem. Its seems to be very interesting. But I
was not following the thread from the beginning (and I mistakely deleted all
these mails :( .. ).. I hope you won't mind answering following questions...

1) you are doing this on an MP or a uniprocessor ?
2) I want to know how are you calling sys_ptrace(Attach) and
sys_ptrace(detach).. i.e is it something linke following

      for(;;){
     sys_ptrace(attach to process);
     sys_wait4();
     sys_ptrace(detach from process);
      }

In short the sequence of system calls you are using for attaching and detaching
to the process

3) Have you tried doing attach and detach only once ? If not.. can you please
try this and let me know whether by doing attach and detach one time also
results in global FPU corruption. Please do not fork in the above process.

---------

Whenever process A calls sys_ptrace(Attach) to Process B, sys_ptrace sends
SIGSTOP to process B.
Now process B in do_signal, checks that it is being traced and then it does the
following
     current->state = TASK_STOPPED;
     notify_parent(current,SIGCHLD);
     schedule();

so now in schedule() --> __switch_to --> unlazy_fpu() function we do following
     if (current->flags & PF_USEDFPU)
          save_fpu();

In save_fpu() we do following
     fnsave current->tss.i387
     fwait;

I want to ask a question....... is it possible if 'somehow' we were not able to
save the complete floating point state with fnsave i.e. current->tss.i387 is
'invalid' after
          fnsave current->tss.i387
     fwait;

Thanks
Amol





David Konerding <dek_ml@konerding.com> on 04/23/2001 01:09:27 AM

To:   Ulrich Drepper <drepper@cygnus.com>
cc:   root@chaos.analogic.com, linux-kernel@vger.kernel.org (bcc: Amol Lad/HSS)

Subject:  Re: BUG: Global FPU corruption in 2.2




Ulrich Drepper wrote:

> "Richard B. Johnson" <root@chaos.analogic.com> writes:
>
> > The kernel doesn't know if a process is going to use the FPU when
> > a new process is created. Only the user's code, i.e., the 'C' runtime
> > library knows.
>
> Maybe you should try to understand the kernel code and the features of
> the processor first.  The kernel can detect when the FPU is used for
> the first time.

OK, regardless of how the linux kernel actually manages the FPU for user-space

programs, does anybody have any comments on the original bugreport?

>We have found that one of our programs can cause system-wide
>corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
>run this program, the FPU gives bad results to all subsequent
>processes.

>We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
>these things, and we see the problem on every node we try (dozens).
>We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
>don't seem to be affected.

>Below are two programs we use to produce the behavior.  The first
>program, pi, repeatedly spawns 10 parallel computations of pi.  When
>all is well, each process prints pi as it completes.

>The second program, pt, repeatedly attaches to and detaches from
>another process.  Run pt against the root pi process until the output
>of pi begins to look wrong.  Then kill everything and run pi by itself
>again.  It will no longer produce good results.  We find that the FPU
>persistently gives bad results until we reboot.

I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
described.
If it is a bug in the linux kernel (I can see nothing wrong with the source
code provided),
I would suspect probems with SMP and ptrace, somehow causing the wrong FP
registers
to be returned to a process after the scheduler restarted it.  It's very
interesting that the
PI program works fine until you run PT, but after you run PT, PI is screwed
until reboot.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





^ permalink raw reply	[flat|nested] 33+ messages in thread
* Re: BUG: Global FPU corruption in 2.2
@ 2001-04-24  5:33 alad
  0 siblings, 0 replies; 33+ messages in thread
From: alad @ 2001-04-24  5:33 UTC (permalink / raw)
  To: Erik Paulson; +Cc: Christian Ehrhardt, linux-kernel, zandy








Erik Paulson <epaulson@cs.wisc.edu> on 04/24/2001 01:14:27 AM

To:   Christian Ehrhardt <ehrhardt@mathematik.uni-ulm.de>
cc:   linux-kernel@vger.kernel.org, zandy@cs.wisc.edu (bcc: Amol Lad/HSS)

Subject:  Re: BUG: Global FPU corruption in 2.2




On 23 Apr 2001 18:11:48 +0200, Christian Ehrhardt wrote:
> On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote:
> >
> > We have found that one of our programs can cause system-wide
> > corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
> > run this program, the FPU gives bad results to all subsequent
> > processes.
>
<...>
>
> 3.) It might be interesting to know if the problem can be triggered:
> a) If pi doesn't fork, i.e. just one process calculating pi and
> another one doing the attach/detach.

Yes, we are still able to reproduce it without calling fork (the new
program just calls
do_pi() a bunch of times, and then we attach and detach to that process)

> b) If pi doesn't do FPU Operations, i.e. only the children call do_pi.
>

You seem to need to attach and detach to a program using the fpu -
running pt on a
process that is just busy-looping over and over some integer adds does
not seem to
while running pi on the machine at the same time, but not attaching to
it does not
seem to affect the floating point state.

>>>> well... during context switching.. call to unlazy_fpu() does the following
        if (current->flags & PF_USEDFPU)
          save_fpu();

somebody earlier pointed out, for the possible race when in sys_ptrace, at the
time of attach we modify child->flags.
It really looks again strange that it is software that is causing the problem as
the code to handle FPU looks pretty clean.
still can we check current->flags when the problem occurs ?


Amol


-Erik

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





^ permalink raw reply	[flat|nested] 33+ messages in thread
* BUG: Global FPU corruption in 2.2
@ 2001-04-19 16:05 Victor Zandy
  2001-04-19 20:18 ` Michal Jaegermann
                   ` (4 more replies)
  0 siblings, 5 replies; 33+ messages in thread
From: Victor Zandy @ 2001-04-19 16:05 UTC (permalink / raw)
  To: linux-kernel


We have found that one of our programs can cause system-wide
corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
run this program, the FPU gives bad results to all subsequent
processes.

We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
these things, and we see the problem on every node we try (dozens).
We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
don't seem to be affected.

While we prepare to test for the problem on more recent 2.2 and 2.4
kernels, we would appreciate hearing from anyone who may have insight
into it.

Below are two programs we use to produce the behavior.  The first
program, pi, repeatedly spawns 10 parallel computations of pi.  When
all is well, each process prints pi as it completes.

The second program, pt, repeatedly attaches to and detaches from
another process.  Run pt against the root pi process until the output
of pi begins to look wrong.  Then kill everything and run pi by itself
again.  It will no longer produce good results.  We find that the FPU
persistently gives bad results until we reboot.

Here is the sort of thing we see:

BEFORE                  AFTER
--------------------------------------
c36% ./pi               c36% ./pi        
[3883]                  [4069]           
3.141593                6865157.146714   
3.141593                inf              
3.141593                81705.277947     
3.141593                4.742524         
3.141593                nan              
3.141593                585.810296       
3.141593                inf              
3.141593                4.578857         
3.141593                nan              
3.141593                4.578857         

I am not currently subscribed to linux-kernel.  I'll be checking the
web archives, but please CC replies to me.

Thanks!

Vic Zandy

/* pi.c: gcc -g -o pi pi.c -lm */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <math.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <signal.h>
#include <errno.h>

static double
do_pi()
{
	double sum=0.0;
	double x=1.0;
	double s=1.0;
	double pi;

	while (x <= 10000000.0)	{
		sum += (1.0/pow(x, 3.0))*s;
		s = -s;
		x += 2.0;
	}
	pi = pow(sum*32.0, 1.0/3.0);
	return pi;
}

int
main( int argc, char* argv[] )
{
	int i;
	int pid;
	int m = 1000;   /* runs */
	int n = 10;     /* procs per run */

	pid = getpid();
	fprintf(stderr, "[%d]\n", pid);
	while (m-- > 0) {
	     for (i = 1; i < n; i++)
		  if (!fork())
		       break;
	     fprintf(stderr, "%f\n", do_pi());
	     if (getpid() != pid)
		  return 0;
	     while (waitpid(0, 0, WNOHANG) > 0)
		  ;
	}
	return 0;
}
/* end of pi.c */

/* pt.c: gcc -g -o pt pt.c */
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <string.h>
#include <linux/ptrace.h>

long
dptrace(int req, pid_t pid, void *addr, void *data)
{
	char buf[64];
	int rv;
	rv = ptrace(req, pid, addr, data);
	if ((req != PTRACE_PEEKUSR && req != PTRACE_PEEKTEXT) && 0 > rv) {
		sprintf(buf, "ptrace (req=%d)", req);
		perror(buf);
		exit(1);
	}
	return rv;
}

int
main(int argc, char *argv[])
{
	int pid;
	char buf[1024];
	int n;

	if (argc < 2) {
		fprintf(stderr, "Usage: %s PID\n", argv[0]);
		exit(1);
	}
	pid = atoi(argv[1]);
	while (1) {
		dptrace(PTRACE_ATTACH, pid, 0, 0);
		waitpid(pid, 0, 0);
		dptrace(PTRACE_DETACH, pid, 0, 0);
		fprintf(stderr, ".");
	}
	return 0;
}
/* end of pt.c */



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2001-04-24 20:15 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-04-24 18:21 BUG: Global FPU corruption in 2.2 Victor Zandy
2001-04-24 18:37 ` Alan Cox
2001-04-24 19:17   ` Victor Zandy
2001-04-24 19:51     ` Alan Cox
  -- strict thread matches above, loose matches on Subject: below --
2001-04-24 13:05 Victor Zandy
2001-04-24 16:24 ` Linus Torvalds
2001-04-24 16:47 ` Christian Ehrhardt
2001-04-24 18:09   ` Victor Zandy
2001-04-24  8:56 alad
2001-04-24  7:56 alad
2001-04-24  5:33 alad
2001-04-19 16:05 Victor Zandy
2001-04-19 20:18 ` Michal Jaegermann
2001-04-20 18:50 ` Victor Zandy
2001-04-20 19:07   ` Richard B. Johnson
2001-04-20 19:20     ` Victor Zandy
2001-04-20 19:44       ` Richard B. Johnson
2001-04-20 19:23     ` Ulrich Drepper
2001-04-20 19:37       ` Richard B. Johnson
2001-04-20 20:20         ` Victor Zandy
2001-04-20 21:44         ` Ulrich Drepper
2001-04-22  1:46           ` Richard B. Johnson
2001-04-22  2:18             ` Alan Cox
2001-04-22  2:30               ` Richard B. Johnson
2001-04-22 18:39           ` David Konerding
2001-04-22 18:59             ` Alan Cox
2001-04-22 20:59 ` kees
2001-04-23 16:11 ` Christian Ehrhardt
2001-04-24 16:10   ` Linus Torvalds
2001-04-24 16:25     ` Alan Cox
2001-04-24 16:56     ` Christian Ehrhardt
2001-04-24 20:15       ` Michal Jaegermann
2001-04-23 18:44 ` Erik Paulson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).