linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Signal 11
@ 2000-12-14 12:42 Clayton Weaver
  2000-12-14 19:11 ` Linus Torvalds
  2000-12-14 22:46 ` Signal 11 Jakub Jelinek
  0 siblings, 2 replies; 57+ messages in thread
From: Clayton Weaver @ 2000-12-14 12:42 UTC (permalink / raw)
  To: linux-kernel

This is unrelated to the signal 11 problem, but something to consider
for "random crashes and segfaults", ie are you using this compiler
and glibc version combination.

There has a been a thread on the teTeX mailing list the last few days
about a (RedHat, but probably more general than just their rpms)
gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 

unsigned varname; /* "unsigned int varname;" is ok */

(no problem at -O or no optimization at all, and doesn't happen if teTeX
is compiled with kgcc).

Showed up in the kpathsea library (which began to split paths on
'-' as well as '/' after a user upgraded compiler and libc and
recompiled teTeX).

Regards,

Clayton Weaver
<mailto:cgweav@eskimo.com>
(Seattle)

"Everybody's ignorant, just in different subjects."  Will Rogers



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-14 12:42 Signal 11 Clayton Weaver
@ 2000-12-14 19:11 ` Linus Torvalds
  2000-12-14 22:35   ` Alan Cox
  2000-12-14 23:35   ` Jakub Jelinek
  2000-12-14 22:46 ` Signal 11 Jakub Jelinek
  1 sibling, 2 replies; 57+ messages in thread
From: Linus Torvalds @ 2000-12-14 19:11 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.SUN.3.96.1001214042948.15033A-100000@eskimo.com>,
Clayton Weaver  <cgweav@eskimo.com> wrote:
>
>There has a been a thread on the teTeX mailing list the last few days
>about a (RedHat, but probably more general than just their rpms)
>gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 

Quite frankly, anybody who uses RedHat 7.0 and their broken compiler for
_anything_ is going to have trouble.

I don't know why RH decided to do their idiotic gcc-2.96 release (it
certainly wasn't approved by any technical gcc people - the gcc people
were upset about it too), and I find it even more surprising that they
apparently KNEW that the compiler they were using was completely broken. 
They included another (non-broken) compiler, and called it "kgcc". 

"kgcc" stands for "kernel gcc", apparently because (a) they realised
that a miscompiled kernel is even worse than miscompiling some random
user applications and (b) gcc-2.96 is so broken that it requires special
libraries for C++ vtable chunks handling that is different, so the
_working_ gcc can only be used with programs that do not need such
library support.  Namely the kernel. 

In case it wasn't obvious yet, I consider RedHat-7.0 to be basically
unusable as a development platform, and I hope RH downgrades their
compiler to something that works better RSN.  It apparently has problems
compiling stuff like the CVS snapshots of X etc too (and obviously,
anything you compile under gcc-2.96 is not likely to work anywhere else
except with the broken libraries). 

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-14 19:11 ` Linus Torvalds
@ 2000-12-14 22:35   ` Alan Cox
  2000-12-14 22:45     ` Linus Torvalds
  2000-12-14 23:35   ` Jakub Jelinek
  1 sibling, 1 reply; 57+ messages in thread
From: Alan Cox @ 2000-12-14 22:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

> I don't know why RH decided to do their idiotic gcc-2.96 release (it
> certainly wasn't approved by any technical gcc people - the gcc people

Every single patch in that release barring I believe 2 was accepted into
the main tree. So they liked the code. The naming did upset people and was
unfortunate, but done talking to the compiler folks at Red Hat with the
best of intentions behind it. If we had called it 'Red Hat cc' I think people
would have been even more offended at the way they had been discredited.

I do understand why they got peeved, I do understand why they feel no urge
to support the 296 codebase (nor would I want them to). I hit 'd' when I 
see 'I have 2.2.18 patched with [reiserfs|ext3|bigmem|lfs]' for the same
reasons.

> They included another (non-broken) compiler, and called it "kgcc". 
> "kgcc" stands for "kernel gcc", apparently because (a) they realised

kgcc is a convention invented a long time ago by Conectiva. Debian also used
to have gcc272. It is done because

gcc272 is useless at C++, has lots of bugs
egcs is no better at C++ and has lots of bugs
gcc295 is a little better at C++ and is _Crawling_ with bugs
gcc296(redhat) is a lot better at C++ and doesn't appear to be any buggier.

In fact gcc296 is the first compiler that can compiled 2.2.16 correctly. All
the previous compilers miscompile the strstr() inline in some cases. Thats
why I had to hack the 2.2 kernel tree to make it work. (And the cases where
you got compile time errors gcc was right to moan about - like using (...)
in traditional

> user applications and (b) gcc-2.96 is so broken that it requires special
> libraries for C++ vtable chunks handling that is different, so the

Wrong - the C++ vtable format change is part of the intended progression of the
compiler and needed to meet standards compliance. gcc 295 also changed the
internal formats. Unfortunately the gcc295 and 296 formats are both probably
not the final format. The compiler folks are not willing to guarantee anything
untill gcc 3.0, which may actually be out by the time 2.4 is stable.

> unusable as a development platform, and I hope RH downgrades their
> compiler to something that works better RSN.  It apparently has problems

Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
buggier than before, but that the bugs are in different places. egcs and gcc295
both caused X compile problems too.

I still advise people: Use egcs-1.1.2 for Linux 2.2.x. You can build 2.2.18 with
gcc 2.9.6 but I personally wouldn't be running production systems on a kernel
built that way - but NOT because gcc296 is buggier but because the bugs are
going to be in different places and I firmly believe production system people
should let the loons find them ;)

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-14 22:35   ` Alan Cox
@ 2000-12-14 22:45     ` Linus Torvalds
  2000-12-14 22:58       ` Bernhard Rosenkraenzer
  2000-12-14 23:24       ` Alan Cox
  0 siblings, 2 replies; 57+ messages in thread
From: Linus Torvalds @ 2000-12-14 22:45 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel



On Thu, 14 Dec 2000, Alan Cox wrote:
> 
> > user applications and (b) gcc-2.96 is so broken that it requires special
> > libraries for C++ vtable chunks handling that is different, so the
> 
> Wrong - the C++ vtable format change is part of the intended progression of the
> compiler and needed to meet standards compliance. gcc 295 also changed the
> internal formats. Unfortunately the gcc295 and 296 formats are both probably
> not the final format. The compiler folks are not willing to guarantee anything
> untill gcc 3.0, which may actually be out by the time 2.4 is stable.

If you ask any gcc folks, the main reason they think this was a really
stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
with the 2.95.x release _and_ the upcoming 3.0 release.

Nobody asked the people who knew this, apparently.

> > unusable as a development platform, and I hope RH downgrades their
> > compiler to something that works better RSN.  It apparently has problems
> 
> Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
> buggier than before, but that the bugs are in different places. egcs and gcc295
> both caused X compile problems too.

gcc-2.95.2 is at least a real release, from a branch that is actively
maintained - so a 2.95.3 is likely to happen reasonably soon, fixing as
many problems as possible _without_ being incompatible like the snapshots
are.

Or just stay at 2.91.66 (egcs).

As to X compile problems - neither egcs nor 2.95.2 appears to have any
trouble with the CVS tree. Possibly because they got fixed, because, after
all, at least those were real releases.

I'd applaud RedHat for making snapshots available, but they should be
marked as SNAPSHOTS, and not as the main compiler with no way to fix the
damn problems it causes.

As it is, anybody doing development is probably better off at RH-6.2.
That is doubly true if they intend to release binaries.

			Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-14 12:42 Signal 11 Clayton Weaver
  2000-12-14 19:11 ` Linus Torvalds
@ 2000-12-14 22:46 ` Jakub Jelinek
  1 sibling, 0 replies; 57+ messages in thread
From: Jakub Jelinek @ 2000-12-14 22:46 UTC (permalink / raw)
  To: Clayton Weaver; +Cc: linux-kernel

On Thu, Dec 14, 2000 at 04:42:03AM -0800, Clayton Weaver wrote:
> There has a been a thread on the teTeX mailing list the last few days
> about a (RedHat, but probably more general than just their rpms)
> gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 
> 
> unsigned varname; /* "unsigned int varname;" is ok */
> 
> (no problem at -O or no optimization at all, and doesn't happen if teTeX
> is compiled with kgcc).

That one is fixed already for some time, it was a bug in loop unrolling
(that patch is still pending review for the mainline CVS though).

	Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-14 22:45     ` Linus Torvalds
@ 2000-12-14 22:58       ` Bernhard Rosenkraenzer
  2000-12-14 23:11         ` Linus Torvalds
  2000-12-15  0:10         ` Miquel van Smoorenburg
  2000-12-14 23:24       ` Alan Cox
  1 sibling, 2 replies; 57+ messages in thread
From: Bernhard Rosenkraenzer @ 2000-12-14 22:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, linux-kernel

On Thu, 14 Dec 2000, Linus Torvalds wrote:

> If you ask any gcc folks, the main reason they think this was a really
> stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
> with the 2.95.x release _and_ the upcoming 3.0 release.

The same thing is true of *any* gcc release.
For example, C++-ABI wise, 2.95.x is incompatible BOTH with egcs 1.1.x
_and_ the upcoming 3.0 release.

> > Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
> > buggier than before, but that the bugs are in different places. egcs and gcc295
> > both caused X compile problems too.
>
> gcc-2.95.2 is at least a real release, from a branch that is actively
> maintained

Not very actively.
Please take the time to compare the activity in gcc_2_95_branch with the
patches in the current "2.96" version in rawhide.

> - so a 2.95.3 is likely to happen reasonably soon, fixing as
> many problems as possible _without_ being incompatible like the snapshots
> are.

It will be incompatible with any non-2.95.x-version, and I don't think
2.96-68 is any more buggy than the current 2.95 branch.
The initial 2.96 "release" did have some odd bugs; all the known ones have
been fixed.

> Or just stay at 2.91.66 (egcs).

This may be good for the kernel, but it's not acceptable for C++.
Also, there's no support for some of the platforms we have to work with,
such as ia64 and S/390 - using different compilers for different
architectures isn't a real solution either.

> As to X compile problems - neither egcs nor 2.95.2 appears to have any
> trouble with the CVS tree.

Neither does 2.96-68.

LLaP
bero


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-14 22:58       ` Bernhard Rosenkraenzer
@ 2000-12-14 23:11         ` Linus Torvalds
  2000-12-15  0:10         ` Miquel van Smoorenburg
  1 sibling, 0 replies; 57+ messages in thread
From: Linus Torvalds @ 2000-12-14 23:11 UTC (permalink / raw)
  To: Bernhard Rosenkraenzer; +Cc: Alan Cox, linux-kernel



On Thu, 14 Dec 2000, Bernhard Rosenkraenzer wrote:
> >
> > gcc-2.95.2 is at least a real release, from a branch that is actively
> > maintained
> 
> Not very actively.
> Please take the time to compare the activity in gcc_2_95_branch with the
> patches in the current "2.96" version in rawhide.

Take a look at the differences in linux-2.2.x and linux-2.3.x.

linux-2.3.x is was a h*ll of a lot more "actively maintained".

But nobody really considers that to be an argument for RedHat (or anybody
else) to installa 2.3.x kernel by default. Sure, most distributions have a
"hacker kernel", but it's NOT installed by default, and it is clearly
marked as experimental.

Your arguments make no sense.

The compiler is often _more_ important to system stability than the
kernel. A "real release" implies that it at least had testing, and that
people know what the problem spots tend to be.

Note that the "know what the problem spots tend to be" is important.

> > As to X compile problems - neither egcs nor 2.95.2 appears to have any
> > trouble with the CVS tree.
> 
> Neither does 2.96-68.

Good. Maybe you'd make it clearer to everybody who installed from your
CD's that they had better upgrade. Pronto.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-14 22:45     ` Linus Torvalds
  2000-12-14 22:58       ` Bernhard Rosenkraenzer
@ 2000-12-14 23:24       ` Alan Cox
  1 sibling, 0 replies; 57+ messages in thread
From: Alan Cox @ 2000-12-14 23:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, linux-kernel

> If you ask any gcc folks, the main reason they think this was a really
> stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
> with the 2.95.x release _and_ the upcoming 3.0 release.

And with egcs 1.1.2. So 
	egcs is a different format to all others
	2.95 is a different format to all others
	2.96 is a different format to all others

and 2.96 is a C++ compiler

> gcc-2.95.2 is at least a real release, from a branch that is actively
> maintained - so a 2.95.3 is likely to happen reasonably soon, fixing as
> many problems as possible _without_ being incompatible like the snapshots
> are.

The 2.96 tree is maintained actively. Updates for the Red Hat 7 packages
are being worked on and CygnusHat people are working on both that maintenance
and on feeding all they find back to the core gcc team.

In fact we have sufficient faith in it we sell packages and support based around
that and our preparedness to support it. 

> As to X compile problems - neither egcs nor 2.95.2 appears to have any
> trouble with the CVS tree. Possibly because they got fixed, because, after
> all, at least those were real releases.

I asked Jakub. He's confused as to your report. As far as he is aware the only
X problems in the CVS tree were related to XFree86 source code bugs misusing
type punning. If you have a case to lookat Jakub would love to hear about it
and fix either X or gcc.

> I'd applaud RedHat for making snapshots available, but they should be
> marked as SNAPSHOTS, and not as the main compiler with no way to fix the
> damn problems it causes.

That it was confusing and mistaken by some as an official GNU group release
is something we never intended and have already apologised for. It was done
without malice or ill intent.

> As it is, anybody doing development is probably better off at RH-6.2.
> That is doubly true if they intend to release binaries.

We strongly recommend that people use 6.2 for developing binaries for general
release unless they have specific requirements for glibc 2.2. Thats the same
guidelines the LSB 'oops we havent finished yet here is a quickie for now'
documentation recommends.

Similarly RPM packaging using RPMv3 is recommended.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-14 19:11 ` Linus Torvalds
  2000-12-14 22:35   ` Alan Cox
@ 2000-12-14 23:35   ` Jakub Jelinek
  2000-12-14 23:51     ` Linus Torvalds
  1 sibling, 1 reply; 57+ messages in thread
From: Jakub Jelinek @ 2000-12-14 23:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, Dec 14, 2000 at 11:11:28AM -0800, Linus Torvalds wrote:
> user applications and (b) gcc-2.96 is so broken that it requires special
> libraries for C++ vtable chunks handling that is different, so the
> _working_ gcc can only be used with programs that do not need such
> library support.

Every major g++ release had incompatible libstdc++, even g++ 2.95.2 if
bootstrapped under glibc 2.1.x is binary incompatible with g++ 2.95.2
bootstrapped under glibc 2.2.x (libstdc++ uses different soname then;
even if we used g++ 2.95.2 we would not have C++ binary compatible with
other distributions).
This will change once 3.0 is out, but it will still take some time.

> compiler to something that works better RSN.  It apparently has problems
> compiling stuff like the CVS snapshots of X etc too (and obviously,
> anything you compile under gcc-2.96 is not likely to work anywhere else
> except with the broken libraries). 

Can you point to things in X which were actually miscompiled because of bugs
in gcc 2.96? So far I was aware about X bugs (already fixed in X CVS) which
were triggered with -fstrict-aliasing which is now the default while
gcc 2.95.2 had -fstrict-aliasing disabled by default.
That is not to say there were not bugs in the gcc we shipped, but the bugs
which were reported against it have been fixed already.

	Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-14 23:35   ` Jakub Jelinek
@ 2000-12-14 23:51     ` Linus Torvalds
  2000-12-15  0:11       ` Dan Egli
  0 siblings, 1 reply; 57+ messages in thread
From: Linus Torvalds @ 2000-12-14 23:51 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: linux-kernel



On Thu, 14 Dec 2000, Jakub Jelinek wrote:

> On Thu, Dec 14, 2000 at 11:11:28AM -0800, Linus Torvalds wrote:
> > user applications and (b) gcc-2.96 is so broken that it requires special
> > libraries for C++ vtable chunks handling that is different, so the
> > _working_ gcc can only be used with programs that do not need such
> > library support.
> 
> Every major g++ release had incompatible libstdc++, even g++ 2.95.2 if
> bootstrapped under glibc 2.1.x is binary incompatible with g++ 2.95.2
> bootstrapped under glibc 2.2.x (libstdc++ uses different soname then;
> even if we used g++ 2.95.2 we would not have C++ binary compatible with
> other distributions).

Yes. 

And I realize that somebody inside RedHat really wanted to use a snapshot
in order to get some C++ code to compile right.

But it at the same time threw C stability out the window, by using a
not-very-widely-tested snapshot for a major new release. 

Are you seriously saying that you think it was a good trade-off? Or are
you just ashamed of admitting that RH did something stupid?

> > compiler to something that works better RSN.  It apparently has problems
> > compiling stuff like the CVS snapshots of X etc too (and obviously,
> > anything you compile under gcc-2.96 is not likely to work anywhere else
> > except with the broken libraries). 
> 
> Can you point to things in X which were actually miscompiled because of bugs
> in gcc 2.96?

I have a report from a Sony VAIO user that couldn't compile the CVS X at
all on his picturebook (and you need to compile the CVS tree in order to
get required fixes for the ATI Rage Mobility in that machine). I don't
know the details, but they were apparently due to RH 7 issues. 

> So far I was aware about X bugs (already fixed in X CVS) which
> were triggered with -fstrict-aliasing which is now the default while
> gcc 2.95.2 had -fstrict-aliasing disabled by default.

I hope that's another thing that the gcc people fix by the time they do a
_real_ release. Anobody who thinks that "-fstrict-aliasing" being on by
default is a good idea is probably a compiler person who hasn't seen real
code.

> That is not to say there were not bugs in the gcc we shipped, but the bugs
> which were reported against it have been fixed already.

That's good.

It's even better if you don't play quite as fast-and-lose with your
shipping compiler.

			Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-14 22:58       ` Bernhard Rosenkraenzer
  2000-12-14 23:11         ` Linus Torvalds
@ 2000-12-15  0:10         ` Miquel van Smoorenburg
  2000-12-15  0:32           ` Alan Cox
  1 sibling, 1 reply; 57+ messages in thread
From: Miquel van Smoorenburg @ 2000-12-15  0:10 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.30.0012142351520.19104-100000@bochum.redhat.de>,
Bernhard Rosenkraenzer  <bero@redhat.de> wrote:
>The same thing is true of *any* gcc release.
>For example, C++-ABI wise, 2.95.x is incompatible BOTH with egcs 1.1.x
>_and_ the upcoming 3.0 release.

Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
And since redhat is _the_ distro that commercial entities use to
release software for, this was very arguably a bad move.

There's simply no excuse. It's too obvious.

Mike.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-14 23:51     ` Linus Torvalds
@ 2000-12-15  0:11       ` Dan Egli
  2000-12-16  1:28         ` Signal 11gy Alan Cox
  0 siblings, 1 reply; 57+ messages in thread
From: Dan Egli @ 2000-12-15  0:11 UTC (permalink / raw)
  To: linux-kernel

On Thu, 14 Dec 2000, Linus Torvalds wrote:

> Yes. 
> 
> And I realize that somebody inside RedHat really wanted to use a snapshot
> in order to get some C++ code to compile right.
> 
> But it at the same time threw C stability out the window, by using a
> not-very-widely-tested snapshot for a major new release. 
> 
> Are you seriously saying that you think it was a good trade-off? Or are
> you just ashamed of admitting that RH did something stupid?
> 
Pardon the poking in here, but I must say I agree here. RH did a VERY dumb
thing. 

> I have a report from a Sony VAIO user that couldn't compile the CVS X at
> all on his picturebook (and you need to compile the CVS tree in order to
> get required fixes for the ATI Rage Mobility in that machine). I don't
> know the details, but they were apparently due to RH 7 issues. 

It's not in the X tree or anything, but here's a personal example.
Machine: Dual P3 550
HDD: Dual Ultra2Wide Seagate 18GB Hdd
OS: RedHat 7
Compile Target: Linux Kernel 2.2.17
Result with gcc 2.96: Failure (syntax errors in the i386 branch of the
arch tree)
Result with compat-egcs-62: Success on the first try.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-15  0:10         ` Miquel van Smoorenburg
@ 2000-12-15  0:32           ` Alan Cox
  2000-12-15  0:42             ` Miquel van Smoorenburg
  2000-12-15  2:07             ` Michael Peddemors
  0 siblings, 2 replies; 57+ messages in thread
From: Alan Cox @ 2000-12-15  0:32 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: linux-kernel

> Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
> And since redhat is _the_ distro that commercial entities use to
> release software for, this was very arguably a bad move.

Except you conveniently ignore a few facts

o	Someone else moved to 2.95 not RH . In fact some of us felt 2.95 wasnt 
	fit to ship at the time. 

o	We tell vendors to build RPMv3 , glibc 2.1.x

o	Vendors not being stupid understand that they have a bigger market
	share if they do that.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-15  0:32           ` Alan Cox
@ 2000-12-15  0:42             ` Miquel van Smoorenburg
  2000-12-15  2:07             ` Michael Peddemors
  1 sibling, 0 replies; 57+ messages in thread
From: Miquel van Smoorenburg @ 2000-12-15  0:42 UTC (permalink / raw)
  To: linux-kernel

In article <E146inG-0000O0-00@the-village.bc.nu>,
Alan Cox  <alan@lxorguk.ukuu.org.uk> wrote:
>> Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
>> And since redhat is _the_ distro that commercial entities use to
>> release software for, this was very arguably a bad move.
>
>Except you conveniently ignore a few facts

Doesn't everyone. I should have included a smiley with as comment
that I was only half-joking. Anyway this is the kernel list, and
as such this is becoming off-topic.

Mike.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-15  2:07             ` Michael Peddemors
@ 2000-12-15  1:09               ` Alan Cox
  2000-12-15 16:12                 ` Theodore Y. Ts'o
  0 siblings, 1 reply; 57+ messages in thread
From: Alan Cox @ 2000-12-15  1:09 UTC (permalink / raw)
  To: Michael Peddemors; +Cc: Alan Cox, linux-kernel

> > o	We tell vendors to build RPMv3 , glibc 2.1.x
> Curious HOW do you tell vendors??

When they ask. More usefully Dan Quinlann and most vendors put together a
recommended set of things to build with and use. It warns about library
pitfalls, kernel changes and what packaging is supported. It is far from
perfect and nothing like the LSB goals but its a start and following it does
give you applications that with a bit of care run on everything.

> > o	Vendors not being stupid understand that they have a bigger market
> > 	share if they do that.
> Ummm.. I remember Oracle's first release... wasn't it JUST redhat??

I believe so, and Adabas was SuSE only, and I doubt either vendor wanted it
that way. Both actually ran fine on the other but were not supported.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-15  0:32           ` Alan Cox
  2000-12-15  0:42             ` Miquel van Smoorenburg
@ 2000-12-15  2:07             ` Michael Peddemors
  2000-12-15  1:09               ` Alan Cox
  1 sibling, 1 reply; 57+ messages in thread
From: Michael Peddemors @ 2000-12-15  2:07 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

Sticking my nose where it doesn't belong...

On Thu, 14 Dec 2000, Alan Cox wrote:
> > Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
> > And since redhat is _the_ distro that commercial entities use to
> > release software for, this was very arguably a bad move.

> o	We tell vendors to build RPMv3 , glibc 2.1.x

Curious HOW do you tell vendors??

> o	Vendors not being stupid understand that they have a bigger market
> 	share if they do that.

Ummm.. I remember Oracle's first release... wasn't it JUST redhat??

-- 
--------------------------------------------------------
Michael Peddemors - Senior Consultant
Unix Administration - WebSite Hosting
Network Services - Programming
Wizard Internet Services http://www.wizard.ca
Linux Support Specialist - http://www.linuxmagic.com
--------------------------------------------------------
(604) 589-0037 Beautiful British Columbia, Canada
--------------------------------------------------------
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-15  1:09               ` Alan Cox
@ 2000-12-15 16:12                 ` Theodore Y. Ts'o
  0 siblings, 0 replies; 57+ messages in thread
From: Theodore Y. Ts'o @ 2000-12-15 16:12 UTC (permalink / raw)
  To: Alan Cox; +Cc: michael, alan, linux-kernel

   Date: 	Fri, 15 Dec 2000 01:09:29 +0000 (GMT)
   From: Alan Cox <alan@lxorguk.ukuu.org.uk>

   > > o	We tell vendors to build RPMv3 , glibc 2.1.x
   > Curious HOW do you tell vendors??

   When they ask. More usefully Dan Quinlann and most vendors put together a
   recommended set of things to build with and use. It warns about library
   pitfalls, kernel changes and what packaging is supported. It is far from
   perfect and nothing like the LSB goals but its a start and following it does
   give you applications that with a bit of care run on everything.

In the interests of making sure everyone understands the history:

The Linux Development Platform Specification (LDPS) was started as a
result of an informal evening post-LSB-meeting gathering in June --- to
which by the way Red Hat didn't send any representatives(*) --- the
discussion at the restaurant started along the lines of "Oh, my *GOD*
RedHat is about to do something stupid --- they're releasing Red Hat 7.0
with beta/snapshots of just about every single critical system component
except the kernel --- and vendors who fall into the trap developing
against Red Hat 7.0 won't work with any other distribution.  This is
going to be *bad* for Linux."

So yes, the reason why LDPS was formed was to recommend to vendors what
they should build and use --- but while Alan gave comments about the
LDPS once it was announced that a group of people were working on the
LDPS , there is no way that the LDPS could even vaguely be considered a
Red Hat initiative.  (The LDPS is a separate work group which is part of
the FSG, so it is a sister group to the LSB effort.)

							- Ted

(*) Ever since Jim Kingdon left Red Hat (he was at VA Linux for a while,
and is now at SGI), as far as I know no one at Red Hat is actively
participating in the LSB activities --- they haven't sent anyone to the
physical LSB meetings, or participated in the bi-weekly phone
conferences, or taken work items to help finish the LSB.  Alan does
participate on the mailing lists, and makes quite helpful comments, but
as far as I know that's about the limit to Red Hat's participation to
either the LSB or the LDPS specification work.  Speaking as someone who
has been contributing time and effort to the LSB, it would be great if
Red Hat were to become more fully involved in the LSB; I (and I'm sure
all the other LSB volunteers) would welcome a greater level of
participation by Red Hat.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11gy
  2000-12-15  0:11       ` Dan Egli
@ 2000-12-16  1:28         ` Alan Cox
  0 siblings, 0 replies; 57+ messages in thread
From: Alan Cox @ 2000-12-16  1:28 UTC (permalink / raw)
  To: Dan Egli; +Cc: linux-kernel

> It's not in the X tree or anything, but here's a personal example.
> Machine: Dual P3 550
> HDD: Dual Ultra2Wide Seagate 18GB Hdd
> OS: RedHat 7
> Compile Target: Linux Kernel 2.2.17
> Result with gcc 2.96: Failure (syntax errors in the i386 branch of the
> arch tree)
> Result with compat-egcs-62: Success on the first try.

It isnt a bug in the compiler. Its a bug in the kernel tree.  Its a bug in
the old compiler that it didnt error it before.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  2:28       ` davej
  2000-12-08  3:13         ` Jeff V. Merkey
  2000-12-08 13:52         ` Alan Cox
@ 2000-12-15  0:11         ` lamont
  2 siblings, 0 replies; 57+ messages in thread
From: lamont @ 2000-12-15  0:11 UTC (permalink / raw)
  To: davej; +Cc: Linux Kernel Mailing List


I had tons of problems with K6III/450s in ASUS P5A motherboards with
various kinds of 128MB SIMMs.  There were multiple different symptoms,
including just sig11s on compiles, corrupted input (leading to syntax
error) in compiles, and corrupted input in the buffer cache (same crash
over and over, but dd if=/dev/hda of=/dev/null bs=1024k count=128 fixed
it).  Swapping the memory would sometimes get rid of the problem, but then
it would come back weeks-months later.

I saw a bizzare problem once in an Tyan dual proc PIII/500 box with
2x256MB ECC RAM that one of the ECC RAM sticks was bad and that repeated
kernel compiles would hang after about 24 hours.  Strange problem, but
found that in troubleshooting it, the problem followed this stick of RAM
around to different machines.  Blamed the RAM but don't understand what
the underlying problem was...

On Fri, 8 Dec 2000 davej@suse.de wrote:
> On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
> 
> > It's related to some change in 2.4 vs. 2.2.  There are other programs
> > affected other than X, SSH also get's spurious signal 11's now and again
> > with 2.4 and glibc <= 2.1 and it does not occur on 2.2.
> 
> <AOL>
> 
> I've begun to get a bit paranoid about my K6-2 500 box.
> 
> Various processes have been getting random signals after heavy CPU usage.
> Playing an MPEG movie, kernel compile, or even just some small apps
> compiling sometimes. Just for the record, this isn't an OOM situation,
> I've watched this box with half its memory free or in buffers left
> unattended, and suddenly a compile will just die.
> 
> I replaced the CPU with a brand new K6-2. Problem remained.
> Next suspect was faulty RAM. Despite having passed a memtest, I
> swapped out the DIMMs for some known good ones.
> Suspecting cooling problems, I added some case fans.
> Next came a bigger power supply. Still the problems.
> The latest last ditch attempt to make this box stable has been
> to attach the biggest fan I could find that would fit a socket 7 CPU.
> 
> And still the problems are there.
> The only remaining suspect would be a flaky motherboard.
> But then comes the real killer : This box is rock solid under 2.2
> 
> *boggle*
> 
> I'm not sure exactly when this started, but I think I first noticed
> it around test5 or so, but didn't suspect the kernel at the time.
> 
> I've tried kernels compiled with everything from 2.91.66 when this
> was a Redhat box, to gcc 2.95.2 (from Debian woody) when I installed
> debian on it.  If this is a compiler bug, it's one that no compiler
> I've tried seems to be immune from.
> 
> regards,
> 
> Davej.
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: Signal 11
  2000-12-11 13:33           ` Mike Galbraith
@ 2000-12-11 23:24             ` Rainer Mager
  0 siblings, 0 replies; 57+ messages in thread
From: Rainer Mager @ 2000-12-11 23:24 UTC (permalink / raw)
  To: linux-kernel

(This message contains a number of related replies.)

> From: Mike Galbraith [mailto:mikeg@wen-online.de]
> Is init permanently running after you see a couple of these?

No, that is, after 23 hours up time it has used only 6 seconds CPU time
(according to top).

That reminds me that I should repeat that my signal 11 problem has (so far)
only caused X to die. The OS remains up and stable.


> From: davej@suse.de [mailto:davej@suse.de]
> My troublesome box finally seems to be stable.[...]I disabled DRM
> & AGPGart. With them both disabled, I get no problems at all.
> No Sig11's, No Sig4's, No lockups.
>
> This box has a Voodoo3 3000 AGP..

I suppose I can try this too. My box has a Matrox G400. BTW, what is DRM?
Direct Rendering something?


> From: CMA [mailto:cma@mclink.it]
> Did you already try to selectively disable L1 and L2 caches (if
> your box has both) and see what happens?

I'll look into this as well. Anyone have any pointers on how to do this? I
have a Tyan Tiger 133 with Award BIOS if this helps/matters.

Even if this setting does make a difference, what does this tell me/us? I
don't consider running the box with disabled cache(s) a viable solution.



Thanks all and keep those suggestions coming.

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: Signal 11
  2000-12-11  9:05         ` Rainer Mager
  2000-12-11 13:33           ` Mike Galbraith
@ 2000-12-11 14:14           ` davej
  1 sibling, 0 replies; 57+ messages in thread
From: davej @ 2000-12-11 14:14 UTC (permalink / raw)
  To: Rainer Mager; +Cc: Alan Cox, Linux Kernel Mailing List, Linus Torvalds

On Mon, 11 Dec 2000, Rainer Mager wrote:

> Well, I just had a Signal 11 even with the patch. What can I do to help
> figure this out?

My troublesome box finally seems to be stable. It's been up for the
last two days whilst under quite heavy loads without problems.
Previously, it would be lucky to last an hour.
The change? I disabled DRM & AGPGart.
With them both disabled, I get no problems at all. No Sig11's,
No Sig4's, No lockups.

This box has a Voodoo3 3000 AGP..

01:00.0 VGA compatible controller: 3Dfx Interactive, Inc. Voodoo 3 (rev 01)

And is running on an MVP3 chipset....

00:01.0 PCI bridge: VIA Technologies, Inc. VT82C598/694x [Apollo MVP3/Pro133x AGP]

This box does display the same problem with IRQ routing that I've
got on my Athlon box...

PCI: Using IRQ router VIA [1106/0586] at 00:07.0
PCI: Assigned IRQ 11 for device 00:08.0
PCI: The same IRQ used for device 01:00.0
IRQ routing conflict in pirq table! Try 'pci=autoirq'

(00:08:0 is an SBLive)

A related problem ?
As I mentioned in an earlier mail `autoirq' is an unknown option.

The Athlon box has similar messages, but it happens with even
more devices..

They both do the same with the various PCI options 'nobios' etc,
and changing PnP OS in the BIOS makes no difference either.

regards,

Davej.

-- 
| Dave Jones <davej@suse.de>  http://www.suse.de/~davej
| SuSE Labs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: Signal 11
  2000-12-11  9:05         ` Rainer Mager
@ 2000-12-11 13:33           ` Mike Galbraith
  2000-12-11 23:24             ` Rainer Mager
  2000-12-11 14:14           ` davej
  1 sibling, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2000-12-11 13:33 UTC (permalink / raw)
  To: Rainer Mager; +Cc: Alan Cox, linux-kernel

On Mon, 11 Dec 2000, Rainer Mager wrote:

> Well, I just had a Signal 11 even with the patch. What can I do to help
> figure this out?

Is init permanently running after you see a couple of these?

	-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: Signal 11
  2000-12-11  0:58       ` Rainer Mager
@ 2000-12-11  9:05         ` Rainer Mager
  2000-12-11 13:33           ` Mike Galbraith
  2000-12-11 14:14           ` davej
  0 siblings, 2 replies; 57+ messages in thread
From: Rainer Mager @ 2000-12-11  9:05 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

Well, I just had a Signal 11 even with the patch. What can I do to help
figure this out?


Thanks,

--Rainer

-----Original Message-----
From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk]
Sent: Friday, December 08, 2000 11:07 PM
To: David Woodhouse
Cc: Andi Kleen; Rainer Mager; linux-kernel@vger.kernel.org; Mark Vojkovich
Subject: Re: Signal 11


> > wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
> > would say that this is definitely a kernel problem.=20
>
> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6=B9. The random crashes started to happen when =
> I
> upgraded my distribution=B2 - and are only seen by people using 2.4. So=
>  I
> suspect that it's the combination of glibc and kernel which is triggeri=
> ng
> it.

Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
table updating race help ?

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: Signal 11
  2000-12-08 14:06     ` Alan Cox
  2000-12-09 19:01       ` Matthew Vanecek
@ 2000-12-11  0:58       ` Rainer Mager
  2000-12-11  9:05         ` Rainer Mager
  1 sibling, 1 reply; 57+ messages in thread
From: Rainer Mager @ 2000-12-11  0:58 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

I just applied the said patch and will report my results. Note that I have
never been able to reliably, on-demand reproduce this so give me a few days
to see what happens.

--Rainer


-----Original Message-----
From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk]
Sent: Friday, December 08, 2000 11:07 PM
To: David Woodhouse
Cc: Andi Kleen; Rainer Mager; linux-kernel@vger.kernel.org; Mark Vojkovich
Subject: Re: Signal 11


> > wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
> > would say that this is definitely a kernel problem.=20
>
> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6=B9. The random crashes started to happen when =
> I
> upgraded my distribution=B2 - and are only seen by people using 2.4. So=
>  I
> suspect that it's the combination of glibc and kernel which is triggeri=
> ng
> it.

Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
table updating race help ?

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-09 19:20         ` davej
@ 2000-12-09 23:31           ` Matthew Vanecek
  0 siblings, 0 replies; 57+ messages in thread
From: Matthew Vanecek @ 2000-12-09 23:31 UTC (permalink / raw)
  To: linux-kernel

davej@suse.de wrote:
> 
> On Sat, 9 Dec 2000, Matthew Vanecek wrote:
> 
> > > Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
> > > table updating race help ?
> > > Alan
> >
> > Where are his fixes at?  I don't seem to see any of his posts in the
> > archives.
> 
> dwmw2 posted one such patch earlier this week :-
> 
> http://www.lib.uaa.alaska.edu/linux-kernel/archive/2000-Week-49/0856.html
> 
> regards,
> 

I saw that.  I thought it was a patch to try to "reproduce it", as
opposed to fixing it.  Is it truly a fix, and is it applicable for UP
kernels?
-- 
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'
********************************************************************************
For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-09 19:01       ` Matthew Vanecek
@ 2000-12-09 19:20         ` davej
  2000-12-09 23:31           ` Matthew Vanecek
  0 siblings, 1 reply; 57+ messages in thread
From: davej @ 2000-12-09 19:20 UTC (permalink / raw)
  To: Matthew Vanecek; +Cc: linux-kernel

On Sat, 9 Dec 2000, Matthew Vanecek wrote:

> > Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
> > table updating race help ?
> > Alan
> 
> Where are his fixes at?  I don't seem to see any of his posts in the
> archives.

dwmw2 posted one such patch earlier this week :-

http://www.lib.uaa.alaska.edu/linux-kernel/archive/2000-Week-49/0856.html

regards,

Davej.

-- 
| Dave Jones <davej@suse.de>  http://www.suse.de/~davej
| SuSE Labs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08 14:06     ` Alan Cox
@ 2000-12-09 19:01       ` Matthew Vanecek
  2000-12-09 19:20         ` davej
  2000-12-11  0:58       ` Rainer Mager
  1 sibling, 1 reply; 57+ messages in thread
From: Matthew Vanecek @ 2000-12-09 19:01 UTC (permalink / raw)
  To: linux-kernel

Alan Cox wrote:
> 
> > > wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
> > > would say that this is definitely a kernel problem.=20
> >
> > XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> > kernels - even on my BP6=B9. The random crashes started to happen when =
> > I
> > upgraded my distribution=B2 - and are only seen by people using 2.4. So=
> >  I
> > suspect that it's the combination of glibc and kernel which is triggeri=
> > ng
> > it.
> 
> Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
> table updating race help ?
> 
> Alan

Where are his fixes at?  I don't seem to see any of his posts in the
archives.
-- 
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'
********************************************************************************
For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
@ 2000-12-09  5:32 davej
  0 siblings, 0 replies; 57+ messages in thread
From: davej @ 2000-12-09  5:32 UTC (permalink / raw)
  To: Linux Kernel Mailing List


David Woodhouse (dwmw2@infradead.org) wrote...

> Can you reproduce it with bcrl's patch below: 

Did nothing for me. gcc still got a sig11 after a while.
Took three runs of 'make bzImage' before it completed.

I wondered if I'd been unlucky enough to have been sent a
replacement K6-2 which was also screwed, but as I mentioned
earlier, this box runs fine under 2.2

btw, I was unsubscribed from all lists at vger yesterday,
for reasons currently unknown to me. Did this happen to anyone
else, or did my mail setup break something?

regards,

Davej.

-- 
| Dave Jones <davej@suse.de>  http://www.suse.de/~davej
| SuSE Labs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08 22:24         ` David Woodhouse
@ 2000-12-09  0:56           ` Jeff V. Merkey
  0 siblings, 0 replies; 57+ messages in thread
From: Jeff V. Merkey @ 2000-12-09  0:56 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Mark Vojkovich, Andi Kleen, Rainer Mager, linux-kernel



I'll try.

Jeff


On Fri, Dec 08, 2000 at 10:24:55PM +0000, David Woodhouse wrote:
> On Fri, 8 Dec 2000, Jeff V. Merkey wrote:
> 
> > I have not seen it on UP systems either.  I only see it on SMP systems.
> > After trying very hard last night, I was able to get my 4 x PPro system to
> > do it with 2.4.0-12.  It seems related to loading in some way.  If you
> > have more than two processors, the loading is less since there's more
> > processors, and for whatever reason, it makes it harder to produce
> > whatever race condition is causing it.  I can get it to happen
> > pretty easily on a 2 x PII system.
> 
> Can you reproduce it with bcrl's patch below:
> 
> Index: mm/memory.c
> ===================================================================
> RCS file: /net/passion/inst/cvs/linux/mm/memory.c,v
> retrieving revision 1.2.2.40
> diff -u -r1.2.2.40 memory.c
> --- mm/memory.c	2000/12/05 13:33:39	1.2.2.40
> +++ mm/memory.c	2000/12/08 22:24:09
> @@ -860,6 +860,7 @@
>  	/*
>  	 * Ok, we need to copy. Oh, well..
>  	 */
> +	set_pte(page_table, pte);
>  	spin_unlock(&mm->page_table_lock);
>  	new_page = page_cache_alloc();
>  	if (!new_page)
> @@ -870,6 +871,12 @@
>  	 * Re-check the pte - we dropped the lock
>  	 */
>  	if (pte_same(*page_table, pte)) {
> +		/* We are changing the pte, so get rid of the old
> +		 * one to avoid races with the hardware, this really
> +		 * only affects the accessed bit here.
> +		 */
> +		pte = ptep_get_and_clear(page_table);
> +
>  		if (PageReserved(old_page))
>  			++mm->rss;
>  		break_cow(vma, old_page, new_page, address, page_table);
> @@ -1216,12 +1223,14 @@
>  		return do_swap_page(mm, vma, address, pte,
> pte_to_swp_entry(entry), write_access);
>  	}
> 
> +	entry = ptep_get_and_clear(pte);
>  	if (write_access) {
>  		if (!pte_write(entry))
>  			return do_wp_page(mm, vma, address, pte, entry);
> 
>  		entry = pte_mkdirty(entry);
>  	}
> +
>  	entry = pte_mkyoung(entry);
>  	establish_pte(vma, address, pte, entry);
>  	spin_unlock(&mm->page_table_lock);
> 
> 
> -- 
> dwmw2
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08 19:34     ` Mark Vojkovich
@ 2000-12-08 23:16       ` Jeff V. Merkey
  2000-12-08 22:24         ` David Woodhouse
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff V. Merkey @ 2000-12-08 23:16 UTC (permalink / raw)
  To: Mark Vojkovich; +Cc: David Woodhouse, Andi Kleen, Rainer Mager, linux-kernel

On Fri, Dec 08, 2000 at 11:34:51AM -0800, Mark Vojkovich wrote:
> 
> 
> On Fri, 8 Dec 2000, David Woodhouse wrote:
> 
>    Some additional data points.  It goes away on UP 2.4 kernels.
> Also, I can't recall seeing this problem on IA64.  Maybe it's still
> there on IA64 and I just haven't been trying hard enough to crash
> it, but my current impression is that the problem doesn't exist on IA64.
> 
>   Hmmm...  IA64 is a static server.  I don't hear of people having
> problems on 3.3.6 servers either.  I'm wondering if a non-loader
> 4.0 server would have problems on IA32 with a 2.4 kernel.  That's
> something for people to try.
> 
> 
> 				Mark.


I have not seen it on UP systems either.  I only see it on SMP systems.  
After trying very hard last night, I was able to get my 4 x PPro system to 
do it with 2.4.0-12.  It seems related to loading in some way.  If you 
have more than two processors, the loading is less since there's more 
processors, and for whatever reason, it makes it harder to produce
whatever race condition is causing it.  I can get it to happen 
pretty easily on a 2 x PII system.

:-)

Jeff



> 
> >
> > --
> > dwmw2
> >
> > ¹ And the BP6 still falls over less frequently than the dual P3 I use at
> > work.
> > ² RH7. Don't start.
> >
> >
> >
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08 23:16       ` Jeff V. Merkey
@ 2000-12-08 22:24         ` David Woodhouse
  2000-12-09  0:56           ` Jeff V. Merkey
  0 siblings, 1 reply; 57+ messages in thread
From: David Woodhouse @ 2000-12-08 22:24 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: Mark Vojkovich, Andi Kleen, Rainer Mager, linux-kernel

On Fri, 8 Dec 2000, Jeff V. Merkey wrote:

> I have not seen it on UP systems either.  I only see it on SMP systems.
> After trying very hard last night, I was able to get my 4 x PPro system to
> do it with 2.4.0-12.  It seems related to loading in some way.  If you
> have more than two processors, the loading is less since there's more
> processors, and for whatever reason, it makes it harder to produce
> whatever race condition is causing it.  I can get it to happen
> pretty easily on a 2 x PII system.

Can you reproduce it with bcrl's patch below:

Index: mm/memory.c
===================================================================
RCS file: /net/passion/inst/cvs/linux/mm/memory.c,v
retrieving revision 1.2.2.40
diff -u -r1.2.2.40 memory.c
--- mm/memory.c	2000/12/05 13:33:39	1.2.2.40
+++ mm/memory.c	2000/12/08 22:24:09
@@ -860,6 +860,7 @@
 	/*
 	 * Ok, we need to copy. Oh, well..
 	 */
+	set_pte(page_table, pte);
 	spin_unlock(&mm->page_table_lock);
 	new_page = page_cache_alloc();
 	if (!new_page)
@@ -870,6 +871,12 @@
 	 * Re-check the pte - we dropped the lock
 	 */
 	if (pte_same(*page_table, pte)) {
+		/* We are changing the pte, so get rid of the old
+		 * one to avoid races with the hardware, this really
+		 * only affects the accessed bit here.
+		 */
+		pte = ptep_get_and_clear(page_table);
+
 		if (PageReserved(old_page))
 			++mm->rss;
 		break_cow(vma, old_page, new_page, address, page_table);
@@ -1216,12 +1223,14 @@
 		return do_swap_page(mm, vma, address, pte,
pte_to_swp_entry(entry), write_access);
 	}

+	entry = ptep_get_and_clear(pte);
 	if (write_access) {
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address, pte, entry);

 		entry = pte_mkdirty(entry);
 	}
+
 	entry = pte_mkyoung(entry);
 	establish_pte(vma, address, pte, entry);
 	spin_unlock(&mm->page_table_lock);


-- 
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  3:25           ` davej
  2000-12-08 16:44             ` Matthew Vanecek
@ 2000-12-08 19:43             ` Dr. Kelsey Hudson
  1 sibling, 0 replies; 57+ messages in thread
From: Dr. Kelsey Hudson @ 2000-12-08 19:43 UTC (permalink / raw)
  To: davej; +Cc: Jeff V. Merkey, Rainer Mager, Linux Kernel Mailing List

On Fri, 8 Dec 2000 davej@suse.de wrote:

> On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
> 
> > I think there may be a case when a process forks, that the MMU or some
> > other subsystem is either not setting the page bits correctly, or
> > mapping in a bad page.  It's a LEVEL I bug in 2.4 is this is the case,
> > BTW.  In core dumps (I've looked at 2 of them from SSH) it barfs right
> > after executing fork() or one of the exec functions and at some places
> > in the code where there's not any obvious coding bugs.  Looks like some
> > type of mapping problem.  I reported it three months ago, but it was
> > pretty much ignored.
> > 
> > Linus needs to add this one to the pre-12 list -- looks like some type
> > of mapping bug.
> 
> Now that you mention it, every app that has bombed has been the type
> that forks a lot. MpegTV, gtv, and make spring to mind. All apps drive
> the CPU load up quite a lot, which was why I initially suspected
> overheating. I don't see it on my other 2.4 boxes though which is
> suspicious. But they don't get as much of a beating as this, which was
> up until last week my main workstation.

Just to add some input and insight on here, I loaded the system down with
some FFT algorithms, and then ran an 8-way kernel compile. The machine in
question is a dual P3/600 with 512MB RAM, 2.4.0-test11. The load
skyrocketed to a mere 13.6. xmms was still running, didn't skip even
once. The FFT algorithms didn't bitch at all. Neither did the kernel
compile. In fact, it compiled without a hitch...

I dunno what to say about these boxes that segfault all the
time... Probably just bad hardware somewhere along the lines.

 Kelsey Hudson                                           khudson@ctica.com 
 Software Engineer
 Compendium Technologies, Inc                               (619) 725-0771
---------------------------------------------------------------------------     

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  2:04     ` Peter Samuelson
  2000-12-08 16:36       ` Matthew Vanecek
@ 2000-12-08 19:36       ` Dr. Kelsey Hudson
  1 sibling, 0 replies; 57+ messages in thread
From: Dr. Kelsey Hudson @ 2000-12-08 19:36 UTC (permalink / raw)
  To: Peter Samuelson; +Cc: Richard B. Johnson, Rainer Mager, linux-kernel

On Thu, 7 Dec 2000, Peter Samuelson wrote:

> 
> [Dick Johnson]
> > Do:
> > 
> > char main[]={0xff,0xff,0xff,0xff};
> 
> Oh come on, at least pick an *interesting* invalid opcode:
> 
>   char main[]={0xf0,0x0f,0xc0,0xc8};	/* try also on NT (: */

What's funny, is that this actually executes on SPARC hardware, but
immediately segfaults. On Intel hardware though, you get a message similar
to:

zsh: illegal hardware instruction (core dumped)  a.out

I wrote relatively the same program in college. It exploits the F0 0F bug
found in early Pentium hardware.

 Kelsey Hudson                                           khudson@ctica.com 
 Software Engineer
 Compendium Technologies, Inc                               (619) 725-0771
---------------------------------------------------------------------------     

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  9:46   ` David Woodhouse
  2000-12-08 14:06     ` Alan Cox
  2000-12-08 16:21     ` Horst von Brand
@ 2000-12-08 19:34     ` Mark Vojkovich
  2000-12-08 23:16       ` Jeff V. Merkey
  2 siblings, 1 reply; 57+ messages in thread
From: Mark Vojkovich @ 2000-12-08 19:34 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Andi Kleen, Rainer Mager, linux-kernel



On Fri, 8 Dec 2000, David Woodhouse wrote:

>
> ak@suse.de said:
> >  Sounds like a X Server bug. You should probably contact XFree86, not
> > linux-kernel
>
> I quote from the X devel list, which perhaps I shouldn't do but this is hardly
> NDA'd stuff:
>
> On Mon 20 Nov 2000, mvojkovich@valinux.com said:
> >   I have seen random crashes on dual P3 BX boards (Tyan) and dual Xeon
> > GX boards (Intel).  XFree86 core dumps indicate that it happens in
> > random places, in old as dirt software rendering code that has nothing
> > wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
> > would say that this is definitely a kernel problem.
>
> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6¹. The random crashes started to happen when I
> upgraded my distribution² - and are only seen by people using 2.4. So I
> suspect that it's the combination of glibc and kernel which is triggering
> it.

   Some additional data points.  It goes away on UP 2.4 kernels.
Also, I can't recall seeing this problem on IA64.  Maybe it's still
there on IA64 and I just haven't been trying hard enough to crash
it, but my current impression is that the problem doesn't exist on IA64.

  Hmmm...  IA64 is a static server.  I don't hear of people having
problems on 3.3.6 servers either.  I'm wondering if a non-loader
4.0 server would have problems on IA32 with a 2.4 kernel.  That's
something for people to try.


				Mark.

>
> --
> dwmw2
>
> ¹ And the BP6 still falls over less frequently than the dual P3 I use at
> work.
> ² RH7. Don't start.
>
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  1:43         ` Jeff V. Merkey
  2000-12-08  1:55           ` Jeff V. Merkey
@ 2000-12-08 19:20           ` Dr. Kelsey Hudson
  1 sibling, 0 replies; 57+ messages in thread
From: Dr. Kelsey Hudson @ 2000-12-08 19:20 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: Andi Kleen, Rainer Mager, linux-kernel

Don't post the core file... It's system-dependant and really wont do
anyone but yourself a shred of good.

On Thu, 7 Dec 2000, Jeff V. Merkey wrote:

> 
> 
> Andi Kleen wrote:
> > 
> > On Thu, Dec 07, 2000 at 06:24:34PM -0700, Jeff V. Merkey wrote:
> > >
> > > Andi,
> > >
> > > It's related to some change in 2.4 vs. 2.2.  There are other programs
> > > affected other than X, SSH also get's spurious signal 11's now and again
> > > with 2.4 and glibc <= 2.1 and it does not occur on 2.2.
> > 
> > So have you enabled core dumps and actually looked at the core dumps
> > of the programs using gdb to see where they crashed ?
> 
> Yes.  I can only get the SSH crash when I am running remotely from the
> house over the internet, and it only shows then when running a build in
> jobserver mode (parallel build).  The X problem seems related as well,
> since it's related to (usually) NetScape spawing off a forked process. 
> I will attempt to recreate tonight, and post the core dump file.  
> 
> Jeff 
> 
> 
> 
> 
> 
> > 
> > -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
> 

-- 
 Kelsey Hudson                                           khudson@ctica.com 
 Software Engineer
 Compendium Technologies, Inc                               (619) 725-0771
---------------------------------------------------------------------------     

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08 16:49         ` Richard B. Johnson
@ 2000-12-08 17:40           ` Peter Samuelson
  0 siblings, 0 replies; 57+ messages in thread
From: Peter Samuelson @ 2000-12-08 17:40 UTC (permalink / raw)
  To: root; +Cc: Matthew Vanecek, Rainer Mager, linux-kernel


[Dick Johnson]
> > >   char main[]={0xf0,0x0f,0xc0,0xc8};    /* try also on NT (: */
> > me2v@reliant DRFDecoder $ ./op
> > Illegal instruction (core dumped)
> 
> Yep. And on early Pentinums, the ones with the "f00f" bug, it would
> lock the machine tighter than a witches crotch. Ooops, not
> politically correct.... It would allow user-mode code to halt the
> machine.

...Until Linux 2.0.34 or so (can't remember the exact version number)
which had the workaround for this bug, about a week after the bug was
discovered.

And I was reminded in private mail that the correct lockup sequence is
actually

  char main[]={0xf0,0x0f,0xc7,0xc8};

where the 0xc8 can be anything from 0xc8 to 0xcf.

Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08 16:36       ` Matthew Vanecek
@ 2000-12-08 16:49         ` Richard B. Johnson
  2000-12-08 17:40           ` Peter Samuelson
  0 siblings, 1 reply; 57+ messages in thread
From: Richard B. Johnson @ 2000-12-08 16:49 UTC (permalink / raw)
  To: Matthew Vanecek; +Cc: Peter Samuelson, Rainer Mager, linux-kernel

On Fri, 8 Dec 2000, Matthew Vanecek wrote:

> Peter Samuelson wrote:
> > 
> > [Dick Johnson]
> > > Do:
> > >
> > > char main[]={0xff,0xff,0xff,0xff};
> > 
> > Oh come on, at least pick an *interesting* invalid opcode:
> > 
> >   char main[]={0xf0,0x0f,0xc0,0xc8};    /* try also on NT (: */
> > 
> 
> me2v@reliant DRFDecoder $ ./op
> Illegal instruction (core dumped)
> 
> Is that the expected behavior?

Yep. And on early Pentinums, the ones with the "f00f" bug, it
would lock the machine tighter than a witches crotch. Ooops,
not politically correct.... It would allow user-mode code
to halt the machine.

Here is code that just quietly returns to the runtime code
that called it:

char main[]={0x90, 0x90, 0xc3};

FYI, if the .data section was not executable, you couldn't do
this. You would have to use some __asm__ stuff to put it in
the .text section. But, this is an interesting example of
how you can create code that the compiler refuses to generate.

It's easier to use assembly, though.....

Cheers,
Dick Johnson

Penguin : Linux version 2.4.0 on an i686 machine (799.54 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  3:25           ` davej
@ 2000-12-08 16:44             ` Matthew Vanecek
  2000-12-08 19:43             ` Dr. Kelsey Hudson
  1 sibling, 0 replies; 57+ messages in thread
From: Matthew Vanecek @ 2000-12-08 16:44 UTC (permalink / raw)
  To: Linux Kernel Mailing List

davej@suse.de wrote:
> 
> On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
> 
> > I think there may be a case when a process forks, that the MMU or some
> > other subsystem is either not setting the page bits correctly, or
> > mapping in a bad page.  It's a LEVEL I bug in 2.4 is this is the case,
> > BTW.  In core dumps (I've looked at 2 of them from SSH) it barfs right
> > after executing fork() or one of the exec functions and at some places
> > in the code where there's not any obvious coding bugs.  Looks like some
> > type of mapping problem.  I reported it three months ago, but it was
> > pretty much ignored.
> >
> > Linus needs to add this one to the pre-12 list -- looks like some type
> > of mapping bug.
> 
> Now that you mention it, every app that has bombed has been the type
> that forks a lot. MpegTV, gtv, and make spring to mind. All apps drive
> the CPU load up quite a lot, which was why I initially suspected
> overheating. I don't see it on my other 2.4 boxes though which is
> suspicious. But they don't get as much of a beating as this, which was
> up until last week my main workstation.
> 
> regards,
> 
> Dave.
> 

I've noticed the same problem, and it occasionally happens with XFree86
4.0.1, as well.  Hopefully we've established that this is not the
hardware issue which gcc people of so fond of pushing sig 11s on (even
in the face of overwhelming evidence to the contrary).  It would be good
to have this put on a current to-do list and looked into.

-- 
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'
********************************************************************************
For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  2:04     ` Peter Samuelson
@ 2000-12-08 16:36       ` Matthew Vanecek
  2000-12-08 16:49         ` Richard B. Johnson
  2000-12-08 19:36       ` Dr. Kelsey Hudson
  1 sibling, 1 reply; 57+ messages in thread
From: Matthew Vanecek @ 2000-12-08 16:36 UTC (permalink / raw)
  To: Peter Samuelson; +Cc: Richard B. Johnson, Rainer Mager, linux-kernel

Peter Samuelson wrote:
> 
> [Dick Johnson]
> > Do:
> >
> > char main[]={0xff,0xff,0xff,0xff};
> 
> Oh come on, at least pick an *interesting* invalid opcode:
> 
>   char main[]={0xf0,0x0f,0xc0,0xc8};    /* try also on NT (: */
> 

me2v@reliant DRFDecoder $ ./op
Illegal instruction (core dumped)

Is that the expected behavior?

-- 
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'
********************************************************************************
For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  9:46   ` David Woodhouse
  2000-12-08 14:06     ` Alan Cox
@ 2000-12-08 16:21     ` Horst von Brand
  2000-12-08 19:34     ` Mark Vojkovich
  2 siblings, 0 replies; 57+ messages in thread
From: Horst von Brand @ 2000-12-08 16:21 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Andi Kleen, Rainer Mager, linux-kernel, Mark Vojkovich

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1449 bytes --]

David Woodhouse <dwmw2@infradead.org> said:

[...]

> I quote from the X devel list, which perhaps I shouldn't do but this is
> hardly NDA'd stuff:

> On Mon 20 Nov 2000, mvojkovich@valinux.com said:
> >   I have seen random crashes on dual P3 BX boards (Tyan) and dual Xeon
> > GX boards (Intel).  XFree86 core dumps indicate that it happens in
> > random places, in old as dirt software rendering code that has nothing
> > wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
> > would say that this is definitely a kernel problem. 

> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6¹. The random crashes started to happen when I
> upgraded my distribution² - and are only seen by people using 2.4. So I
> suspect that it's the combination of glibc and kernel which is triggering
> it.

I get regular segfaults and random lockups trying to build CVS GCCs and
kernels since I updated RH 7 to glibc-2.2-5. P3, sr440bx mobo (UP),
2.2.18preX kernels; previously rock solid. Might be that the mains voltage
here tends to be out of whack, but I doubt it.
-- 
Horst von Brand                             vonbrand@sleipnir.valparaiso.cl
Casilla 9G, Vin~a del Mar, Chile                               +56 32 672616

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  9:46   ` David Woodhouse
@ 2000-12-08 14:06     ` Alan Cox
  2000-12-09 19:01       ` Matthew Vanecek
  2000-12-11  0:58       ` Rainer Mager
  2000-12-08 16:21     ` Horst von Brand
  2000-12-08 19:34     ` Mark Vojkovich
  2 siblings, 2 replies; 57+ messages in thread
From: Alan Cox @ 2000-12-08 14:06 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Andi Kleen, Rainer Mager, linux-kernel, Mark Vojkovich

> > wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
> > would say that this is definitely a kernel problem.=20
> 
> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6=B9. The random crashes started to happen when =
> I
> upgraded my distribution=B2 - and are only seen by people using 2.4. So=
>  I
> suspect that it's the combination of glibc and kernel which is triggeri=
> ng
> it.

Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
table updating race help ?

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  2:28       ` davej
  2000-12-08  3:13         ` Jeff V. Merkey
@ 2000-12-08 13:52         ` Alan Cox
  2000-12-15  0:11         ` lamont
  2 siblings, 0 replies; 57+ messages in thread
From: Alan Cox @ 2000-12-08 13:52 UTC (permalink / raw)
  To: davej; +Cc: Jeff V. Merkey, Rainer Mager, Linux Kernel Mailing List

> Various processes have been getting random signals after heavy CPU usage.
> Playing an MPEG movie, kernel compile, or even just some small apps
> compiling sometimes. Just for the record, this isn't an OOM situation,
> I've watched this box with half its memory free or in buffers left
> unattended, and suddenly a compile will just die.

This is consistent with page cache corruption in memory. We definitely had
that in older 2.4test kernels. I saw this building stuff on Linux parisc
and it was because some page of gcc had randomly decided to become something
different. Since that was test6 I didnt figure it important 8)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  0:44 ` Signal 11 Rainer Mager
                     ` (3 preceding siblings ...)
  2000-12-08  1:58   ` Richard B. Johnson
@ 2000-12-08  9:46   ` David Woodhouse
  2000-12-08 14:06     ` Alan Cox
                       ` (2 more replies)
  4 siblings, 3 replies; 57+ messages in thread
From: David Woodhouse @ 2000-12-08  9:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rainer Mager, linux-kernel, Mark Vojkovich


ak@suse.de said:
>  Sounds like a X Server bug. You should probably contact XFree86, not
> linux-kernel

I quote from the X devel list, which perhaps I shouldn't do but this is hardly 
NDA'd stuff:

On Mon 20 Nov 2000, mvojkovich@valinux.com said:
>   I have seen random crashes on dual P3 BX boards (Tyan) and dual Xeon
> GX boards (Intel).  XFree86 core dumps indicate that it happens in
> random places, in old as dirt software rendering code that has nothing
> wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
> would say that this is definitely a kernel problem. 

XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
kernels - even on my BP6¹. The random crashes started to happen when I
upgraded my distribution² - and are only seen by people using 2.4. So I
suspect that it's the combination of glibc and kernel which is triggering
it.

--
dwmw2

¹ And the BP6 still falls over less frequently than the dual P3 I use at 
work.
² RH7. Don't start.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  3:13         ` Jeff V. Merkey
@ 2000-12-08  3:25           ` davej
  2000-12-08 16:44             ` Matthew Vanecek
  2000-12-08 19:43             ` Dr. Kelsey Hudson
  0 siblings, 2 replies; 57+ messages in thread
From: davej @ 2000-12-08  3:25 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: Rainer Mager, Linux Kernel Mailing List

On Thu, 7 Dec 2000, Jeff V. Merkey wrote:

> I think there may be a case when a process forks, that the MMU or some
> other subsystem is either not setting the page bits correctly, or
> mapping in a bad page.  It's a LEVEL I bug in 2.4 is this is the case,
> BTW.  In core dumps (I've looked at 2 of them from SSH) it barfs right
> after executing fork() or one of the exec functions and at some places
> in the code where there's not any obvious coding bugs.  Looks like some
> type of mapping problem.  I reported it three months ago, but it was
> pretty much ignored.
> 
> Linus needs to add this one to the pre-12 list -- looks like some type
> of mapping bug.

Now that you mention it, every app that has bombed has been the type
that forks a lot. MpegTV, gtv, and make spring to mind. All apps drive
the CPU load up quite a lot, which was why I initially suspected
overheating. I don't see it on my other 2.4 boxes though which is
suspicious. But they don't get as much of a beating as this, which was
up until last week my main workstation.

regards,

Dave.

-- 
| Dave Jones <davej@suse.de>  http://www.suse.de/~davej
| SuSE Labs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  2:28       ` davej
@ 2000-12-08  3:13         ` Jeff V. Merkey
  2000-12-08  3:25           ` davej
  2000-12-08 13:52         ` Alan Cox
  2000-12-15  0:11         ` lamont
  2 siblings, 1 reply; 57+ messages in thread
From: Jeff V. Merkey @ 2000-12-08  3:13 UTC (permalink / raw)
  To: davej; +Cc: Rainer Mager, Linux Kernel Mailing List


Dave,

I think there may be a case when a process forks, that the MMU or some
other subsystem is either not setting the page bits correctly, or
mapping in a bad page.  It's a LEVEL I bug in 2.4 is this is the case,
BTW.  In core dumps (I've looked at 2 of them from SSH) it barfs right
after executing fork() or one of the exec functions and at some places
in the code where there's not any obvious coding bugs.  Looks like some
type of mapping problem.  I reported it three months ago, but it was
pretty much ignored.

Linus needs to add this one to the pre-12 list -- looks like some type
of mapping bug.

Jeff

davej@suse.de wrote:
> 
> On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
> 
> > It's related to some change in 2.4 vs. 2.2.  There are other programs
> > affected other than X, SSH also get's spurious signal 11's now and again
> > with 2.4 and glibc <= 2.1 and it does not occur on 2.2.
> 
> <AOL>
> 
> I've begun to get a bit paranoid about my K6-2 500 box.
> 
> Various processes have been getting random signals after heavy CPU usage.
> Playing an MPEG movie, kernel compile, or even just some small apps
> compiling sometimes. Just for the record, this isn't an OOM situation,
> I've watched this box with half its memory free or in buffers left
> unattended, and suddenly a compile will just die.
> 
> I replaced the CPU with a brand new K6-2. Problem remained.
> Next suspect was faulty RAM. Despite having passed a memtest, I
> swapped out the DIMMs for some known good ones.
> Suspecting cooling problems, I added some case fans.
> Next came a bigger power supply. Still the problems.
> The latest last ditch attempt to make this box stable has been
> to attach the biggest fan I could find that would fit a socket 7 CPU.
> 
> And still the problems are there.
> The only remaining suspect would be a flaky motherboard.
> But then comes the real killer : This box is rock solid under 2.2
> 
> *boggle*
> 
> I'm not sure exactly when this started, but I think I first noticed
> it around test5 or so, but didn't suspect the kernel at the time.
> 
> I've tried kernels compiled with everything from 2.91.66 when this
> was a Redhat box, to gcc 2.95.2 (from Debian woody) when I installed
> debian on it.  If this is a compiler bug, it's one that no compiler
> I've tried seems to be immune from.
> 
> regards,
> 
> Davej.
> 
> --
> | Dave Jones <davej@suse.de>  http://www.suse.de/~davej
> | SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  1:24     ` Jeff V. Merkey
  2000-12-08  1:40       ` Andi Kleen
@ 2000-12-08  2:28       ` davej
  2000-12-08  3:13         ` Jeff V. Merkey
                           ` (2 more replies)
  1 sibling, 3 replies; 57+ messages in thread
From: davej @ 2000-12-08  2:28 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: Rainer Mager, Linux Kernel Mailing List

On Thu, 7 Dec 2000, Jeff V. Merkey wrote:

> It's related to some change in 2.4 vs. 2.2.  There are other programs
> affected other than X, SSH also get's spurious signal 11's now and again
> with 2.4 and glibc <= 2.1 and it does not occur on 2.2.

<AOL>

I've begun to get a bit paranoid about my K6-2 500 box.

Various processes have been getting random signals after heavy CPU usage.
Playing an MPEG movie, kernel compile, or even just some small apps
compiling sometimes. Just for the record, this isn't an OOM situation,
I've watched this box with half its memory free or in buffers left
unattended, and suddenly a compile will just die.

I replaced the CPU with a brand new K6-2. Problem remained.
Next suspect was faulty RAM. Despite having passed a memtest, I
swapped out the DIMMs for some known good ones.
Suspecting cooling problems, I added some case fans.
Next came a bigger power supply. Still the problems.
The latest last ditch attempt to make this box stable has been
to attach the biggest fan I could find that would fit a socket 7 CPU.

And still the problems are there.
The only remaining suspect would be a flaky motherboard.
But then comes the real killer : This box is rock solid under 2.2

*boggle*

I'm not sure exactly when this started, but I think I first noticed
it around test5 or so, but didn't suspect the kernel at the time.

I've tried kernels compiled with everything from 2.91.66 when this
was a Redhat box, to gcc 2.95.2 (from Debian woody) when I installed
debian on it.  If this is a compiler bug, it's one that no compiler
I've tried seems to be immune from.

regards,

Davej.

-- 
| Dave Jones <davej@suse.de>  http://www.suse.de/~davej
| SuSE Labs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: Signal 11
  2000-12-08  1:09   ` Michel LESPINASSE
@ 2000-12-08  2:14     ` Rainer Mager
  0 siblings, 0 replies; 57+ messages in thread
From: Rainer Mager @ 2000-12-08  2:14 UTC (permalink / raw)
  To: linux-kernel

Hi all,

	Thanks for all the input so far. Regarding this...

> (I'm not sure exactly what cerberos does, do you have a link for it ?).

The official name is "Cerberus Test Control System" aka CTCS. I don't know
the official site but a search for this should reveal something. Anyway it
is a pretty comprehensive test that includes multiple kernel compiles,
memory tests, disk test, etc, etc. Like I said, I ran this for more than 15
hours with no problems.

Well, actually, I did notice that if I run CTCS from within X then it
freezes up after a few minutes. This appears to happen when/because of
extreme swapping.


Aside from the above I've also run repeated kernel compiles (more than 50
times) with 'make -j bzImage' and had no problems; all outputs were
identical.

So given these tests, I'm reasonably confident the core hardware is ok. I
suppose it is possible there's some iffy bits in the G400's VRAM (but
wouldn't that just result in screen artifacts?). I will admit that I have't
yet tried swapping RAM or any other system components.


Any other ideas?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  1:58   ` Richard B. Johnson
@ 2000-12-08  2:04     ` Peter Samuelson
  2000-12-08 16:36       ` Matthew Vanecek
  2000-12-08 19:36       ` Dr. Kelsey Hudson
  0 siblings, 2 replies; 57+ messages in thread
From: Peter Samuelson @ 2000-12-08  2:04 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Rainer Mager, linux-kernel


[Dick Johnson]
> Do:
> 
> char main[]={0xff,0xff,0xff,0xff};

Oh come on, at least pick an *interesting* invalid opcode:

  char main[]={0xf0,0x0f,0xc0,0xc8};	/* try also on NT (: */

Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  0:44 ` Signal 11 Rainer Mager
                     ` (2 preceding siblings ...)
  2000-12-08  1:20   ` Andi Kleen
@ 2000-12-08  1:58   ` Richard B. Johnson
  2000-12-08  2:04     ` Peter Samuelson
  2000-12-08  9:46   ` David Woodhouse
  4 siblings, 1 reply; 57+ messages in thread
From: Richard B. Johnson @ 2000-12-08  1:58 UTC (permalink / raw)
  To: Rainer Mager; +Cc: linux-kernel

On Fri, 8 Dec 2000, Rainer Mager wrote:

> Hi all,
> 
> 	I've searched around for a answer to this with no real luck yet. If anyone
> has some ideas I'd be very grateful.

Signal 11 just means that you "seg-faulted". This is usually caused
by a coding error. However, if you have tools (like the C compiler)
that has been running fine, but starts to seg-fault, this points to
the very real possibility of a hardware error.

Modern RAM (with no error correction), running outside of its
timing specifications, is often the culpret. Even power supplies can
cause this problem. All you need is a single-bit error in a pointer's
value and -- signal 11.

Also, a bad opcode fetched from RAM with an error, also traps to
the same handler.

Do:

char main[]={0xff,0xff,0xff,0xff};


Compile and run this (it will compile!). You will see what
bad opcodes will do.



Cheers,
Dick Johnson

Penguin : Linux version 2.4.0 on an i686 machine (799.54 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  1:43         ` Jeff V. Merkey
@ 2000-12-08  1:55           ` Jeff V. Merkey
  2000-12-08 19:20           ` Dr. Kelsey Hudson
  1 sibling, 0 replies; 57+ messages in thread
From: Jeff V. Merkey @ 2000-12-08  1:55 UTC (permalink / raw)
  To: Andi Kleen, Rainer Mager, linux-kernel



"Jeff V. Merkey" wrote:
> 
> > So have you enabled core dumps and actually looked at the core dumps
> > of the programs using gdb to see where they crashed ?
> 
> Yes.  I can only get the SSH crash when I am running remotely from the
> house over the internet, and it only shows then when running a build in
> jobserver mode (parallel build).  The X problem seems related as well,
> since it's related to (usually) NetScape spawing off a forked process.
> I will attempt to recreate tonight, and post the core dump file.

BTW.  Were I to wager a guess, I would guess it's related to the paging
problems in 2.4 when a process gets cloned, since everytime I have seen
it, it happens when a child process gets forked then accesses the cloned
data from the parent.  In the previous core dumps, it always puked right
after a call to fork() when the child process attempted to WRITE (not
read) data in the program.

Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  1:40       ` Andi Kleen
@ 2000-12-08  1:43         ` Jeff V. Merkey
  2000-12-08  1:55           ` Jeff V. Merkey
  2000-12-08 19:20           ` Dr. Kelsey Hudson
  0 siblings, 2 replies; 57+ messages in thread
From: Jeff V. Merkey @ 2000-12-08  1:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rainer Mager, linux-kernel



Andi Kleen wrote:
> 
> On Thu, Dec 07, 2000 at 06:24:34PM -0700, Jeff V. Merkey wrote:
> >
> > Andi,
> >
> > It's related to some change in 2.4 vs. 2.2.  There are other programs
> > affected other than X, SSH also get's spurious signal 11's now and again
> > with 2.4 and glibc <= 2.1 and it does not occur on 2.2.
> 
> So have you enabled core dumps and actually looked at the core dumps
> of the programs using gdb to see where they crashed ?

Yes.  I can only get the SSH crash when I am running remotely from the
house over the internet, and it only shows then when running a build in
jobserver mode (parallel build).  The X problem seems related as well,
since it's related to (usually) NetScape spawing off a forked process. 
I will attempt to recreate tonight, and post the core dump file.  

Jeff 





> 
> -Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  1:24     ` Jeff V. Merkey
@ 2000-12-08  1:40       ` Andi Kleen
  2000-12-08  1:43         ` Jeff V. Merkey
  2000-12-08  2:28       ` davej
  1 sibling, 1 reply; 57+ messages in thread
From: Andi Kleen @ 2000-12-08  1:40 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: Andi Kleen, Rainer Mager, linux-kernel

On Thu, Dec 07, 2000 at 06:24:34PM -0700, Jeff V. Merkey wrote:
> 
> Andi,
> 
> It's related to some change in 2.4 vs. 2.2.  There are other programs
> affected other than X, SSH also get's spurious signal 11's now and again
> with 2.4 and glibc <= 2.1 and it does not occur on 2.2.

So have you enabled core dumps and actually looked at the core dumps 
of the programs using gdb to see where they crashed ? 



-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  1:20   ` Andi Kleen
@ 2000-12-08  1:24     ` Jeff V. Merkey
  2000-12-08  1:40       ` Andi Kleen
  2000-12-08  2:28       ` davej
  0 siblings, 2 replies; 57+ messages in thread
From: Jeff V. Merkey @ 2000-12-08  1:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rainer Mager, linux-kernel


Andi,

It's related to some change in 2.4 vs. 2.2.  There are other programs
affected other than X, SSH also get's spurious signal 11's now and again
with 2.4 and glibc <= 2.1 and it does not occur on 2.2.

Jeff

Andi Kleen wrote:
> 
> On Fri, Dec 08, 2000 at 09:44:29AM +0900, Rainer Mager wrote:
> >       I recently upgraded to a new machine. It is running RedHat 6.2 Linux (with
> > a SMP 2.4.0test[8-11] kernel) and has a Matrox G400 in it. X is 4.0.1.
> > Anyway, about once every 2-3 days X will spontaneously die and the only info
> > I get back is that it was because of signal 11.
> >       I've heard that signal 11 can be related to bad hardware, most often
> > memory, but I've done a good bit of testing on this and the system seems ok.
> > What I did was to run the VA Linux Cerberos(sp?) test for 15 hours+ with no
> > errors. Actually this only worked when running from the console. When
> > running from X the machine locked up (although no signal 11).
> >       The only info I've gotten back from the XFree86 mailing lists so far is
> > that there are known and wide spread problems with SMP and these types of
> > problems. Can anyone comment on this? Are there known SMP problems? What is
> > the current resolution plan?
> 
> signal 11 just means that the program crashed with a segmentation fault.
> 
> Sounds like a X Server bug. You should probably contact XFree86, not
> linux-kernel
> 
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  0:44 ` Signal 11 Rainer Mager
  2000-12-08  1:05   ` Jeff V. Merkey
  2000-12-08  1:09   ` Michel LESPINASSE
@ 2000-12-08  1:20   ` Andi Kleen
  2000-12-08  1:24     ` Jeff V. Merkey
  2000-12-08  1:58   ` Richard B. Johnson
  2000-12-08  9:46   ` David Woodhouse
  4 siblings, 1 reply; 57+ messages in thread
From: Andi Kleen @ 2000-12-08  1:20 UTC (permalink / raw)
  To: Rainer Mager; +Cc: linux-kernel

On Fri, Dec 08, 2000 at 09:44:29AM +0900, Rainer Mager wrote:
> 	I recently upgraded to a new machine. It is running RedHat 6.2 Linux (with
> a SMP 2.4.0test[8-11] kernel) and has a Matrox G400 in it. X is 4.0.1.
> Anyway, about once every 2-3 days X will spontaneously die and the only info
> I get back is that it was because of signal 11.
> 	I've heard that signal 11 can be related to bad hardware, most often
> memory, but I've done a good bit of testing on this and the system seems ok.
> What I did was to run the VA Linux Cerberos(sp?) test for 15 hours+ with no
> errors. Actually this only worked when running from the console. When
> running from X the machine locked up (although no signal 11).
> 	The only info I've gotten back from the XFree86 mailing lists so far is
> that there are known and wide spread problems with SMP and these types of
> problems. Can anyone comment on this? Are there known SMP problems? What is
> the current resolution plan?

signal 11 just means that the program crashed with a segmentation fault. 

Sounds like a X Server bug. You should probably contact XFree86, not
linux-kernel


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  0:44 ` Signal 11 Rainer Mager
  2000-12-08  1:05   ` Jeff V. Merkey
@ 2000-12-08  1:09   ` Michel LESPINASSE
  2000-12-08  2:14     ` Rainer Mager
  2000-12-08  1:20   ` Andi Kleen
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 57+ messages in thread
From: Michel LESPINASSE @ 2000-12-08  1:09 UTC (permalink / raw)
  To: Rainer Mager; +Cc: linux-kernel

On Fri, Dec 08, 2000 at 09:44:29AM +0900, Rainer Mager wrote:

> 	I've heard that signal 11 can be related to bad hardware, most
> often memory, but I've done a good bit of testing on this and the
> system seems ok.  What I did was to run the VA Linux Cerberos(sp?)
> test for 15 hours+ with no errors. Actually this only worked when
> running from the console. When running from X the machine locked up
> (although no signal 11).

Don't be so quick to dismiss the "bad hardware" possibility. It is
really quite common these days. And, some cases of bad hardware are
not detected using simple tests like memtest86. (I'm not sure exactly
what cerberos does, do you have a link for it ?).

My recommandation would be to take a big source tree (say, a bit
bigger than the amount of RAM you have), and run repetitive
tar+detar+diff -ru runs on it for 48 hours or so. If your hardware
runs OK, diff should not report any inconsistencies. I found this test
to be quite reliable to detect hardware problems. If you have several
disk controllers, run one instance of the test on each of
them. Additionally you could run a background task to keep the CPU at
100% - a simple while 1 loop would do.

-- 
Michel "Walken" LESPINASSE
Of course I think I'm right. If I thought I was wrong, I'd change my mind.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Signal 11
  2000-12-08  0:44 ` Signal 11 Rainer Mager
@ 2000-12-08  1:05   ` Jeff V. Merkey
  2000-12-08  1:09   ` Michel LESPINASSE
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 57+ messages in thread
From: Jeff V. Merkey @ 2000-12-08  1:05 UTC (permalink / raw)
  To: Rainer Mager; +Cc: linux-kernel


I have previously reported this error (about three months ago) on 2.4
with XFree 3.3.6.  If you are running RedHat 6.2, then you are running
this X Server.  It also shows up on Calders'a 2.4 eDesktop.  It appears
to be something with glib 2.1 < versions on 2.4.  I also see it with
secure shell 1.2.27 on 2.4.  I've also seen it on RH 7.0 on 2.4 kernels
as well, but only with SSH.

Jeff

Rainer Mager wrote:
> 
> Hi all,
> 
>         I've searched around for a answer to this with no real luck yet. If anyone
> has some ideas I'd be very grateful.
> 
>         I recently upgraded to a new machine. It is running RedHat 6.2 Linux (with
> a SMP 2.4.0test[8-11] kernel) and has a Matrox G400 in it. X is 4.0.1.
> Anyway, about once every 2-3 days X will spontaneously die and the only info
> I get back is that it was because of signal 11.
>         I've heard that signal 11 can be related to bad hardware, most often
> memory, but I've done a good bit of testing on this and the system seems ok.
> What I did was to run the VA Linux Cerberos(sp?) test for 15 hours+ with no
> errors. Actually this only worked when running from the console. When
> running from X the machine locked up (although no signal 11).
>         The only info I've gotten back from the XFree86 mailing lists so far is
> that there are known and wide spread problems with SMP and these types of
> problems. Can anyone comment on this? Are there known SMP problems? What is
> the current resolution plan?
> 
> Thanks,
> 
> --Rainer
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Signal 11
  2000-12-08  0:27 Linux 2.2.18pre25 Alan Cox
@ 2000-12-08  0:44 ` Rainer Mager
  2000-12-08  1:05   ` Jeff V. Merkey
                     ` (4 more replies)
  0 siblings, 5 replies; 57+ messages in thread
From: Rainer Mager @ 2000-12-08  0:44 UTC (permalink / raw)
  To: linux-kernel

Hi all,

	I've searched around for a answer to this with no real luck yet. If anyone
has some ideas I'd be very grateful.

	I recently upgraded to a new machine. It is running RedHat 6.2 Linux (with
a SMP 2.4.0test[8-11] kernel) and has a Matrox G400 in it. X is 4.0.1.
Anyway, about once every 2-3 days X will spontaneously die and the only info
I get back is that it was because of signal 11.
	I've heard that signal 11 can be related to bad hardware, most often
memory, but I've done a good bit of testing on this and the system seems ok.
What I did was to run the VA Linux Cerberos(sp?) test for 15 hours+ with no
errors. Actually this only worked when running from the console. When
running from X the machine locked up (although no signal 11).
	The only info I've gotten back from the XFree86 mailing lists so far is
that there are known and wide spread problems with SMP and these types of
problems. Can anyone comment on this? Are there known SMP problems? What is
the current resolution plan?


Thanks,

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2000-12-16  1:57 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-12-14 12:42 Signal 11 Clayton Weaver
2000-12-14 19:11 ` Linus Torvalds
2000-12-14 22:35   ` Alan Cox
2000-12-14 22:45     ` Linus Torvalds
2000-12-14 22:58       ` Bernhard Rosenkraenzer
2000-12-14 23:11         ` Linus Torvalds
2000-12-15  0:10         ` Miquel van Smoorenburg
2000-12-15  0:32           ` Alan Cox
2000-12-15  0:42             ` Miquel van Smoorenburg
2000-12-15  2:07             ` Michael Peddemors
2000-12-15  1:09               ` Alan Cox
2000-12-15 16:12                 ` Theodore Y. Ts'o
2000-12-14 23:24       ` Alan Cox
2000-12-14 23:35   ` Jakub Jelinek
2000-12-14 23:51     ` Linus Torvalds
2000-12-15  0:11       ` Dan Egli
2000-12-16  1:28         ` Signal 11gy Alan Cox
2000-12-14 22:46 ` Signal 11 Jakub Jelinek
  -- strict thread matches above, loose matches on Subject: below --
2000-12-09  5:32 davej
2000-12-08  0:27 Linux 2.2.18pre25 Alan Cox
2000-12-08  0:44 ` Signal 11 Rainer Mager
2000-12-08  1:05   ` Jeff V. Merkey
2000-12-08  1:09   ` Michel LESPINASSE
2000-12-08  2:14     ` Rainer Mager
2000-12-08  1:20   ` Andi Kleen
2000-12-08  1:24     ` Jeff V. Merkey
2000-12-08  1:40       ` Andi Kleen
2000-12-08  1:43         ` Jeff V. Merkey
2000-12-08  1:55           ` Jeff V. Merkey
2000-12-08 19:20           ` Dr. Kelsey Hudson
2000-12-08  2:28       ` davej
2000-12-08  3:13         ` Jeff V. Merkey
2000-12-08  3:25           ` davej
2000-12-08 16:44             ` Matthew Vanecek
2000-12-08 19:43             ` Dr. Kelsey Hudson
2000-12-08 13:52         ` Alan Cox
2000-12-15  0:11         ` lamont
2000-12-08  1:58   ` Richard B. Johnson
2000-12-08  2:04     ` Peter Samuelson
2000-12-08 16:36       ` Matthew Vanecek
2000-12-08 16:49         ` Richard B. Johnson
2000-12-08 17:40           ` Peter Samuelson
2000-12-08 19:36       ` Dr. Kelsey Hudson
2000-12-08  9:46   ` David Woodhouse
2000-12-08 14:06     ` Alan Cox
2000-12-09 19:01       ` Matthew Vanecek
2000-12-09 19:20         ` davej
2000-12-09 23:31           ` Matthew Vanecek
2000-12-11  0:58       ` Rainer Mager
2000-12-11  9:05         ` Rainer Mager
2000-12-11 13:33           ` Mike Galbraith
2000-12-11 23:24             ` Rainer Mager
2000-12-11 14:14           ` davej
2000-12-08 16:21     ` Horst von Brand
2000-12-08 19:34     ` Mark Vojkovich
2000-12-08 23:16       ` Jeff V. Merkey
2000-12-08 22:24         ` David Woodhouse
2000-12-09  0:56           ` Jeff V. Merkey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).