On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote: > > We have found that one of our programs can cause system-wide > corruption of the x86 FPU under 2.2.16 and 2.2.17. .... > > We see this problem on dual 550MHz Xeons with 1GB RAM. Hm, I started to wonder if this is not somewhat related to a recent report I got. "The victim" was running 2.2.19 (basically) on an SMP Alpha UP2000+ with two 800 MHz processors. He managed to reduce the problem to a rather small test case and I attach sources, Makefile and a "loop.sh" driver as a shar archive if you want to have a closer look. This "loop.sh" simply fires triplets of "harry" process in a loop. The guy hit by this gets apparently random floating point exceptions starting with roughly sixth process and later intervals between bombs will vary. I have also 'strace' outputs from failing processes but they are not telling very much. 'gdb' is also not very illuminating: Program received signal SIGFPE, Arithmetic exception. 0x1200010a8 in vadd_ (a=0x11fff21e4, ia=0x120003294, b=0x11fff7004, ib=0x120003294, c=0x11fffbe20, ic=0x120003294, n=0x11ffffc70) at vadd.f:99 99 C(CI) = A(AI) + B(BI) Current language: auto; currently fortran (gdb) p *ia $10 = 1 (gdb) p *ib $11 = 1 (gdb) p *ic $12 = 1 (gdb) p *n Cannot access memory at address 0x4 (gdb) p *(0x11ffffc70) $13 = 1024 (gdb) info locals n = (PTR TO -> ( integer )) 0x4 __g77_expr_0 = 10 He tells me that he is getting that on two different machines he has around. The trouble is that I tried to repeat that with different hardware, kernels, compilers and libraries and I failed even on SMP; but I got an access to a box with only 667 MHz processors. OTOH he is running right now 2.4.3-ac9 plus Andrea Arcangeli patches for rw semaphores on Alpha and he reports that the problem went away (and, hopefuly, nothing else will crop out :-). Anybody can offer an insight what that may really be? It may be, of course, totally unrelated to this report from Victor Zandy. Michal michal@harddata.com