linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
@ 2006-06-19 19:15 Andreas Mohr
  2006-06-19 19:39 ` John Richard Moser
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Andreas Mohr @ 2006-06-19 19:15 UTC (permalink / raw)
  To: linux-kernel

Hello all,

while looking for loop places to apply cpu_relax() to, I found the
following gems:

arch/i386/kernel/crash.c/crash_nmi_callback():

        /* Assume hlt works */
        halt();
        for(;;);

        return 1;
}

arch/i386/kernel/doublefault.c/doublefault_fn():

        for (;;) /* nothing */;
}

Let's assume that we have a less than moderate fan failure that causes
the CPU to heat up beyond the critical limit...
That might result in - you guessed it - crashes or doublefaults.
In which case we enter the corresponding handler and do... what?
Exactly, we accelerate the CPUs happy march into bit heaven by letting it
execute a busy-loop under a non-working fan.
Thanks, your users will be very happy, I think ;)
(especially since it was "just" a simple fan failure that could have been
entirely remedied by buying another fan for $3)


The same thing applies to
arch/i386/kernel/smp.c/stop_this_cpu(), albeit there it's less catastrophic
due to most likely normal working conditions there.

IMHO on any critical CPU failure we should:
- try to log it (might be difficult with a broken CPU, though)
- optionally somehow directly alert the user
- STOP the system, COMPLETELY (that way people WILL take notice, hopefully
  before it's too late and actual damage will have occurred)
- make DAMN SURE that the (possibly already broken) CPU won't have a
  less than nice time once the system is stopped

Am I completely missing something here?

If this is an issue, then maybe we should consolidate those places into
one function that safely(!) halts a CPU, optionally disabling APIC etc.

Oh, and once you finished processing my mail here, you could optionally
also look at my report about almost unusably broken USB:
http://lkml.org/lkml/2006/6/19/54
(no replies yet despite advanced breakage)

Thanks!

Andreas Mohr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 19:15 [RFC/SERIOUS] grilling troubled CPUs for fun and profit? Andreas Mohr
@ 2006-06-19 19:39 ` John Richard Moser
  2006-06-19 20:00 ` linux-os (Dick Johnson)
  2006-06-19 22:16 ` Pavel Machek
  2 siblings, 0 replies; 26+ messages in thread
From: John Richard Moser @ 2006-06-19 19:39 UTC (permalink / raw)
  To: Andreas Mohr; +Cc: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Andreas Mohr wrote:
> Hello all,
> 
> while looking for loop places to apply cpu_relax() to, I found the
> following gems:
> 
> arch/i386/kernel/crash.c/crash_nmi_callback():
> 
>         /* Assume hlt works */
>         halt();
>         for(;;);
> 
>         return 1;
> }
> 
> arch/i386/kernel/doublefault.c/doublefault_fn():
> 
>         for (;;) /* nothing */;
> }
> 
> Let's assume that we have a less than moderate fan failure that causes
> the CPU to heat up beyond the critical limit...
> That might result in - you guessed it - crashes or doublefaults.
> In which case we enter the corresponding handler and do... what?

Looks like it calls halt() to put the CPU into idle mode, and then
performs a nop?  (I think the null condition evaluates false.... not
sure, haven't tried this before!)

> Exactly, we accelerate the CPUs happy march into bit heaven by letting it
> execute a busy-loop under a non-working fan.
> Thanks, your users will be very happy, I think ;)
> (especially since it was "just" a simple fan failure that could have been
> entirely remedied by buying another fan for $3)
> 
> 
> The same thing applies to
> arch/i386/kernel/smp.c/stop_this_cpu(), albeit there it's less catastrophic
> due to most likely normal working conditions there.
> 
> IMHO on any critical CPU failure we should:
> - try to log it (might be difficult with a broken CPU, though)
> - optionally somehow directly alert the user
> - STOP the system, COMPLETELY (that way people WILL take notice, hopefully
>   before it's too late and actual damage will have occurred)
> - make DAMN SURE that the (possibly already broken) CPU won't have a
>   less than nice time once the system is stopped
> 
> Am I completely missing something here?
> 
> If this is an issue, then maybe we should consolidate those places into
> one function that safely(!) halts a CPU, optionally disabling APIC etc.
> 
> Oh, and once you finished processing my mail here, you could optionally
> also look at my report about almost unusably broken USB:
> http://lkml.org/lkml/2006/6/19/54
> (no replies yet despite advanced breakage)
> 
> Thanks!
> 
> Andreas Mohr
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond

    We will enslave their women, eat their children and rape their
    cattle!
                  -- Bosc, Evil alien overlord from the fifth dimension
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQIVAwUBRJb9fAs1xW0HCTEFAQJdSQ/+OdfUJ9e43KruVLWofFGwEOcx0+PoUYfo
u7eEDMdIAGm8nCv8jfUr47svydmeiLHIQYQToWjyrVvm05FacgsTKPFWAzlWv8h0
8tnPyET2WU+r4+mzyvmPy5qStlUBh8Jh0XNq52Ayf3WCninoIx07B/Hv+lhHOrZK
m9dghJlJge1KMKgws5DYokuO7vMR8/+fLltMjALr/0IecJOlAR5LnBgKGgTyUXj5
9hr85nFcBdM37fQz8VJUfcsh62fgS3g75/hAPX79uwG0bhnmNthgdsrFbAWUcf3y
H/VDWs2d/F5x8mALhUp53dPkx8kjx/L7l6v9qOf/38+8mBrq1k88FuSY7r+/4sKK
7DqYQtVZynsLvfLTuc7rkHR8O0E4bkNSDenjzhaxWzHb3+5NTo7z8p5eBnGVbVQc
ou2XPIuH6n0yIU1scbmItMZ4iw9o/9i0oO4WkBQ5c4zxKJDxxUZZ1Lruc+8AWGPx
rbeg01PaXx133sTYfSDMa28hMvmqnwPTkmTysCpJEtW6UXqperBfEJuRdVcGLNHh
4uxSHdf6wU9sWYGtp2mUIXAsOLd6MXygAKUL90xARz42b8k5edTSJ1yEcpxiw5pr
fDpN+5niJR8s/DM1d6IwY72rONFV/Y71hIIuT6RBx90auwyq3WaUTLciJzdEAJRy
yXzZdFrVvp8=
=QwbH
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 19:15 [RFC/SERIOUS] grilling troubled CPUs for fun and profit? Andreas Mohr
  2006-06-19 19:39 ` John Richard Moser
@ 2006-06-19 20:00 ` linux-os (Dick Johnson)
  2006-06-19 20:23   ` Dave Jones
  2006-06-19 21:16   ` Claudio Martins
  2006-06-19 22:16 ` Pavel Machek
  2 siblings, 2 replies; 26+ messages in thread
From: linux-os (Dick Johnson) @ 2006-06-19 20:00 UTC (permalink / raw)
  To: Andreas Mohr; +Cc: linux-kernel


On Mon, 19 Jun 2006, Andreas Mohr wrote:

> Hello all,
>
> while looking for loop places to apply cpu_relax() to, I found the
> following gems:
>
> arch/i386/kernel/crash.c/crash_nmi_callback():
>
>        /* Assume hlt works */
>        halt();
>        for(;;);
>
>        return 1;
> }
>
> arch/i386/kernel/doublefault.c/doublefault_fn():
>
>        for (;;) /* nothing */;
> }
>
> Let's assume that we have a less than moderate fan failure that causes
> the CPU to heat up beyond the critical limit...
> That might result in - you guessed it - crashes or doublefaults.
> In which case we enter the corresponding handler and do... what?

The double-fault is just a place-holder. The CPU will actually
reset without even executing this (try it).

> Exactly, we accelerate the CPUs happy march into bit heaven by letting it
> execute a busy-loop under a non-working fan.

You accelerate nothing. Bit heaven? A CPU without a fan will go into
a cold, cold, shutdown, requiring a hardware reset to get it out of
that latched, no internal clock running, mode. Try it. I have had
broken plastic heat-sink hold-downs let the entire heat-sink fall off
the CPU. The machine just stops. I didn't know why it was stopping
since each time I reset it, it ran long enough to boot. When
I took the side panel off, I was surprised to see the heat-sink
and fan assembly hanging by the fan wires. Also, the CPU was only
warm to the touch, having been completely shut down for the
several minutes it took to locate tools to remove the cover, even
though I deliberately left the power ON.

> Thanks, your users will be very happy, I think ;)
> (especially since it was "just" a simple fan failure that could have been
> entirely remedied by buying another fan for $3)
>
>
> The same thing applies to
> arch/i386/kernel/smp.c/stop_this_cpu(), albeit there it's less catastrophic
> due to most likely normal working conditions there.
>
> IMHO on any critical CPU failure we should:
> - try to log it (might be difficult with a broken CPU, though)
> - optionally somehow directly alert the user
> - STOP the system, COMPLETELY (that way people WILL take notice, hopefully
>  before it's too late and actual damage will have occurred)
> - make DAMN SURE that the (possibly already broken) CPU won't have a
>  less than nice time once the system is stopped


In the first place, when the Intel and AMD CPUs overheat, they
shut down. To prevent them from cycling ON/OFF, they need a
hardware reset to take them out of shutdown mode. There is no
need to fret about busy-waiting.

For sure, it might be nicer to have some call-and-never-return
function for waiting with the rep-nop code, but it isn't necessary
for CPU protection.

>
> Am I completely missing something here?
>

See above.

> If this is an issue, then maybe we should consolidate those places into
> one function that safely(!) halts a CPU, optionally disabling APIC etc.
>
> Oh, and once you finished processing my mail here, you could optionally
> also look at my report about almost unusably broken USB:
> http://lkml.org/lkml/2006/6/19/54
> (no replies yet despite advanced breakage)
>
> Thanks!
>
> Andreas Mohr
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.4 on an i686 machine (5592.72 BogoMips).
New book: http://www.AbominableFirebug.com/
_
\x1a\x04

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 20:00 ` linux-os (Dick Johnson)
@ 2006-06-19 20:23   ` Dave Jones
  2006-06-19 20:47     ` linux-os (Dick Johnson)
                       ` (2 more replies)
  2006-06-19 21:16   ` Claudio Martins
  1 sibling, 3 replies; 26+ messages in thread
From: Dave Jones @ 2006-06-19 20:23 UTC (permalink / raw)
  To: linux-os (Dick Johnson); +Cc: Andreas Mohr, linux-kernel

On Mon, Jun 19, 2006 at 04:00:06PM -0400, linux-os (Dick Johnson) wrote:

 > > arch/i386/kernel/doublefault.c/doublefault_fn():
 > >
 > >        for (;;) /* nothing */;
 > > }
 > >
 > > Let's assume that we have a less than moderate fan failure that causes
 > > the CPU to heat up beyond the critical limit...
 > > That might result in - you guessed it - crashes or doublefaults.
 > > In which case we enter the corresponding handler and do... what?
 > 
 > The double-fault is just a place-holder. The CPU will actually
 > reset without even executing this (try it).

Wrong.

Why do you think we go to the bother of installing a double fault handler if
we're going to reset? Why would we go to the bother of printk'ing
information about the double fault if we're about to reset faster than
it would get to a serial console ?

The box intentionally locks up, so we have a chance to know wtf happened.

 > A CPU without a fan will go into
 > a cold, cold, shutdown, requiring a hardware reset to get it out of
 > that latched, no internal clock running, mode.

Wrong.

 > Try it. I have had
 > broken plastic heat-sink hold-downs let the entire heat-sink fall off
 > the CPU. The machine just stops.

Your single datapoint is just that, a single datapoint.
There are a number of reported cases of CPUs frying themselves.
Here's one: http://www.tomshardware.com/2001/09/17/hot_spot/page4.html
Google no doubt has more.

Another anecdote: Upon fan failure, I once had an athlon MP *completely shatter*
(as in broke in two pieces) under extreme heat.

This _does_ happen.

 > Also, the CPU was only warm to the touch, having been completely shut down for the
 > several minutes it took to locate tools to remove the cover, even
 > though I deliberately left the power ON.

So you got lucky. I've blistered a thumb on hot CPUs before now
after fan failure.

 > In the first place, when the Intel and AMD CPUs overheat, they
 > shut down. 

Reality disagrees with you.

 > For sure, it might be nicer to have some call-and-never-return
 > function for waiting with the rep-nop code, but it isn't necessary
 > for CPU protection.

cpu_relax() and friends aren't going to save a box in light of
a fan failure in my experience.  
However for a box which has locked up (intentionally)
running instructions that do save power in a loop has obvious advantages.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 20:23   ` Dave Jones
@ 2006-06-19 20:47     ` linux-os (Dick Johnson)
  2006-06-19 20:59       ` Dave Jones
  2006-06-19 22:25     ` Pavel Machek
  2006-06-20  9:54     ` Jan Engelhardt
  2 siblings, 1 reply; 26+ messages in thread
From: linux-os (Dick Johnson) @ 2006-06-19 20:47 UTC (permalink / raw)
  To: Dave Jones; +Cc: Andreas Mohr, Linux kernel


On Mon, 19 Jun 2006, Dave Jones wrote:

> On Mon, Jun 19, 2006 at 04:00:06PM -0400, linux-os (Dick Johnson) wrote:
>
> > > arch/i386/kernel/doublefault.c/doublefault_fn():
> > >
> > >        for (;;) /* nothing */;
> > > }
> > >
> > > Let's assume that we have a less than moderate fan failure that causes
> > > the CPU to heat up beyond the critical limit...
> > > That might result in - you guessed it - crashes or doublefaults.
> > > In which case we enter the corresponding handler and do... what?
> >
> > The double-fault is just a place-holder. The CPU will actually
> > reset without even executing this (try it).
>
> Wrong.
>
> Why do you think we go to the bother of installing a double fault handler if
> we're going to reset? Why would we go to the bother of printk'ing
> information about the double fault if we're about to reset faster than
> it would get to a serial console ?
>

I don't know why you go to the bother of installing such a handler.
Have you actually gotten it to print something? All my experience
with double-faults (and many with your RH Linux, BTW) result in
the screen going blank, the POST starting, and the machine re-booting.

Don't think so? Make a real double-fault, reset the paging bit
while executing somewhere that's not 1:1 mapped. Been there, done
that.

> The box intentionally locks up, so we have a chance to know wtf happened.
>
> > A CPU without a fan will go into
> > a cold, cold, shutdown, requiring a hardware reset to get it out of
> > that latched, no internal clock running, mode.
>
> Wrong.
>
> > Try it. I have had
> > broken plastic heat-sink hold-downs let the entire heat-sink fall off
> > the CPU. The machine just stops.
>
> Your single datapoint is just that, a single datapoint.
> There are a number of reported cases of CPUs frying themselves.
> Here's one: http://www.tomshardware.com/2001/09/17/hot_spot/page4.html
> Google no doubt has more.
>
> Another anecdote: Upon fan failure, I once had an athlon MP *completely shatter*
> (as in broke in two pieces) under extreme heat.
>
> This _does_ happen.
>


Maybe it wasn't a FAN failure? The ceramic won't shatter even if
you play a burnz-o-matic blowtorch on it (don't try this at home).
It's a refractory material! To break the material, something
inside (the chip) probably exploded causing an overpressure that
cracked the device. This explosion could be caused by any number
of defects, including a bond failure.

> > Also, the CPU was only warm to the touch, having been completely shut down for the
> > several minutes it took to locate tools to remove the cover, even
> > though I deliberately left the power ON.
>
> So you got lucky. I've blistered a thumb on hot CPUs before now
> after fan failure.
>
> > In the first place, when the Intel and AMD CPUs overheat, they
> > shut down.
>
> Reality disagrees with you.
>

Then send the CPU back. It's in the specification. It's supposed
to shut down.

Further, if you `google` for intel cpu overheat shutdown, you
will find 144,000 results of complaints of the CPUs shutting
down because of over temperature.

> > For sure, it might be nicer to have some call-and-never-return
> > function for waiting with the rep-nop code, but it isn't necessary
> > for CPU protection.
>
> cpu_relax() and friends aren't going to save a box in light of
> a fan failure in my experience.
> However for a box which has locked up (intentionally)
> running instructions that do save power in a loop has obvious advantages.
>
> 		Dave
>
> --
> http://www.codemonkey.org.uk
>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.4 on an i686 machine (5592.72 BogoMips).
New book: http://www.AbominableFirebug.com/
_
\x1a\x04

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 20:47     ` linux-os (Dick Johnson)
@ 2006-06-19 20:59       ` Dave Jones
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Jones @ 2006-06-19 20:59 UTC (permalink / raw)
  To: linux-os (Dick Johnson); +Cc: Andreas Mohr, Linux kernel

On Mon, Jun 19, 2006 at 04:47:37PM -0400, linux-os (Dick Johnson) wrote:

 > > Why do you think we go to the bother of installing a double fault handler if
 > > we're going to reset? Why would we go to the bother of printk'ing
 > > information about the double fault if we're about to reset faster than
 > > it would get to a serial console ?
 > 
 > I don't know why you go to the bother of installing such a handler.

I even explained why in my mail.

 "The box intentionally locks up, so we have a chance to know wtf happened."

What's easier to debug: A box that just spontaneously reboots, or a box
that locks up with doublefault information on screen ?

 > Have you actually gotten it to print something? All my experience
 > with double-faults (and many with your RH Linux, BTW) result in
 > the screen going blank, the POST starting, and the machine re-booting.

No release of Red Hat Linux ever shipped with a double fault hander.
RHEL4 is the only product we sell that has it. Fedora also has it since
FC2 iirc.

 > > Your single datapoint is just that, a single datapoint.
 > > There are a number of reported cases of CPUs frying themselves.
 > > Here's one: http://www.tomshardware.com/2001/09/17/hot_spot/page4.html
 > > Google no doubt has more.
 > >
 > > Another anecdote: Upon fan failure, I once had an athlon MP *completely shatter*
 > > (as in broke in two pieces) under extreme heat.
 > >
 > > This _does_ happen.
 > 
 > Maybe it wasn't a FAN failure?

Same fan was transplanted to another box after that death, where it
was found to be as dead as the CPU & board it was previously plugged into.

Given the heat of the dead CPU, it's safe to say it wasn't operation
at time of death.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 20:00 ` linux-os (Dick Johnson)
  2006-06-19 20:23   ` Dave Jones
@ 2006-06-19 21:16   ` Claudio Martins
  1 sibling, 0 replies; 26+ messages in thread
From: Claudio Martins @ 2006-06-19 21:16 UTC (permalink / raw)
  To: linux-os (Dick Johnson); +Cc: Andreas Mohr, linux-kernel


On Monday 19 June 2006 21:00, linux-os (Dick Johnson) wrote:
> You accelerate nothing. Bit heaven? A CPU without a fan will go into
> a cold, cold, shutdown, requiring a hardware reset to get it out of
> that latched, no internal clock running, mode. Try it. I have had
> broken plastic heat-sink hold-downs let the entire heat-sink fall off
> the CPU. The machine just stops. I didn't know why it was stopping

 That may well be true for Intel Pentium 4 and AMD Ahtlon64/Opteron CPUs but 
it is definitely *not* what happens on older AMD CPUs. Actually we have had a 
case of smoke coming out of a box when the heatsink fan on an Ahtlon XP CPU 
stopped. You could still smell the burnt plastic from the socket the next day 
in the room ;-)
 And I personally know of a least another case with an Athlon XP.

 I think Intel used to be better than AMD concerning CPU casing and the size 
of the area in thermal contact with the heatsink, but not anymore. 
Opteron/Athlon64 case seems to be at least as good as Intel's, and the chips 
should now have thermal throttling, though I haven't tried it!

Regards

Cláudio


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 19:15 [RFC/SERIOUS] grilling troubled CPUs for fun and profit? Andreas Mohr
  2006-06-19 19:39 ` John Richard Moser
  2006-06-19 20:00 ` linux-os (Dick Johnson)
@ 2006-06-19 22:16 ` Pavel Machek
  2006-06-19 22:43   ` Dave Jones
  2 siblings, 1 reply; 26+ messages in thread
From: Pavel Machek @ 2006-06-19 22:16 UTC (permalink / raw)
  To: Andreas Mohr; +Cc: linux-kernel

Hi!

> while looking for loop places to apply cpu_relax() to, I found the
> following gems:
> 
> arch/i386/kernel/crash.c/crash_nmi_callback():
> 
>         /* Assume hlt works */
>         halt();
>         for(;;);
> 
>         return 1;
> }
> 
> arch/i386/kernel/doublefault.c/doublefault_fn():
> 
>         for (;;) /* nothing */;
> }
> 
> Let's assume that we have a less than moderate fan failure that causes
> the CPU to heat up beyond the critical limit...
> That might result in - you guessed it - crashes or doublefaults.
> In which case we enter the corresponding handler and do... what?
> Exactly, we accelerate the CPUs happy march into bit heaven by letting it

...
> Am I completely missing something here?

Yes. You are missing that modern hw already protects itself. See my  blog on planet.kernel.org.

-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 20:23   ` Dave Jones
  2006-06-19 20:47     ` linux-os (Dick Johnson)
@ 2006-06-19 22:25     ` Pavel Machek
  2006-06-19 22:41       ` Dave Jones
  2006-06-20  9:58       ` Jan Engelhardt
  2006-06-20  9:54     ` Jan Engelhardt
  2 siblings, 2 replies; 26+ messages in thread
From: Pavel Machek @ 2006-06-19 22:25 UTC (permalink / raw)
  To: Dave Jones, linux-os (Dick Johnson), Andreas Mohr, linux-kernel

Hi!

>  > Try it. I have had
>  > broken plastic heat-sink hold-downs let the entire heat-sink fall off
>  > the CPU. The machine just stops.
> 
> Your single datapoint is just that, a single datapoint.
> There are a number of reported cases of CPUs frying themselves.
> Here's one: http://www.tomshardware.com/2001/09/17/hot_spot/page4.html
> Google no doubt has more.
> 
> Another anecdote: Upon fan failure, I once had an athlon MP *completely shatter*
> (as in broke in two pieces) under extreme heat.
> 
> This _does_ happen.

If it happens to you... you needed a new cpu anyway. Anything non-historical
*has* thermal protection.

BTW I doubt those old athlons can be saved by cli; hlt . (Someone willing to try if old
athlon can run cli; hlt code w/o heatsink?).

And no, we probably do not want to enter C2 or C3 from doublefault handler.
				Pavel
-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 22:25     ` Pavel Machek
@ 2006-06-19 22:41       ` Dave Jones
  2006-06-20 11:39         ` linux-os (Dick Johnson)
  2006-06-22 17:47         ` Pavel Machek
  2006-06-20  9:58       ` Jan Engelhardt
  1 sibling, 2 replies; 26+ messages in thread
From: Dave Jones @ 2006-06-19 22:41 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-os (Dick Johnson), Andreas Mohr, linux-kernel

On Tue, Jun 20, 2006 at 12:25:29AM +0200, Pavel Machek wrote:
 
 > > Another anecdote: Upon fan failure, I once had an athlon MP *completely shatter*
 > > (as in broke in two pieces) under extreme heat.
 > > 
 > > This _does_ happen.
 > 
 > If it happens to you... you needed a new cpu anyway. Anything non-historical
 > *has* thermal protection.

That's the single dumbest thing I've read today.

newsflash: you don't get to dictate when I (or anyone else) buys new hardware.
Before its accident, that box happily was my home firewall for 3 years, and
its replacement is actually an /older/ box.  I didn't "need a new cpu" at all.

(incidentally, it was replaced with a VIA C3 that doesn't need a fan :)

 > BTW I doubt those old athlons can be saved by cli; hlt . (Someone willing to try if old
 > athlon can run cli; hlt code w/o heatsink?).

you snipped the important part of my mail.

"cpu_relax() and friends aren't going to save a box"

We have two completely different things being discussed in this thread.

1. Fan failure, and the possibility to keep running.
IMO, there's nothing we can do here, and nor should we try.

2. Situations where we forcibly lock up and spin the CPU in a tight loop,
producing heat.  Given there are CPUs that benefit from cpu_relax()
in such places, adding them so that they don't unnecessarily sit there
sucking power until someone gets to the datacenter to investigate
can only be a good thing.

 > And no, we probably do not want to enter C2 or C3 from doublefault handler.

I didn't see that being proposed.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 22:16 ` Pavel Machek
@ 2006-06-19 22:43   ` Dave Jones
  2006-06-20  7:29     ` Andreas Mohr
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Jones @ 2006-06-19 22:43 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Andreas Mohr, linux-kernel

On Tue, Jun 20, 2006 at 12:16:55AM +0200, Pavel Machek wrote:

 > > Am I completely missing something here?
 > 
 > Yes. You are missing that modern hw already protects itself. See my  blog on planet.kernel.org.

And you are missing that not everyone is running linux on the latest CPUs.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 22:43   ` Dave Jones
@ 2006-06-20  7:29     ` Andreas Mohr
  0 siblings, 0 replies; 26+ messages in thread
From: Andreas Mohr @ 2006-06-20  7:29 UTC (permalink / raw)
  To: Dave Jones, Pavel Machek, linux-kernel

Hi,

[whoa, maybe I shouldn't have used such an inflammatory subject,
the mail volume would suggest that ;-)]

On Mon, Jun 19, 2006 at 06:43:12PM -0400, Dave Jones wrote:
> On Tue, Jun 20, 2006 at 12:16:55AM +0200, Pavel Machek wrote:
> 
>  > > Am I completely missing something here?
>  > 
>  > Yes. You are missing that modern hw already protects itself. See my  blog on planet.kernel.org.
> 
> And you are missing that not everyone is running linux on the latest CPUs.

Darn right. Intel has been having thermal protection for a somewhat longer
time, but Athlon had serious issues with missing or incompletely functioning
thermal sensors on many not too outdated (let's say it was 4 years ago, ok?)
motherboard/CPU combos.

And I don't really want to know about thermal protection status of various
Cyrix, VIA or even Winchip CPUs (you've been a nic^H^Hïve believer of
competition in a healthy marketplace and bought some of those, right?
I know I did... ;).

But since Pavel's blog mentions that thermal protection is an ACPI
specification, there's hope that it may actually work half-decently
after all.

Andreas Mohr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 20:23   ` Dave Jones
  2006-06-19 20:47     ` linux-os (Dick Johnson)
  2006-06-19 22:25     ` Pavel Machek
@ 2006-06-20  9:54     ` Jan Engelhardt
  2 siblings, 0 replies; 26+ messages in thread
From: Jan Engelhardt @ 2006-06-20  9:54 UTC (permalink / raw)
  To: Dave Jones; +Cc: linux-os (Dick Johnson), Andreas Mohr, linux-kernel

> > > Let's assume that we have a less than moderate fan failure that causes
> > > the CPU to heat up beyond the critical limit...
> > > That might result in - you guessed it - crashes or doublefaults.
> > > In which case we enter the corresponding handler and do... what?
> > 
> > A CPU without a fan will go into
> > a cold, cold, shutdown, requiring a hardware reset to get it out of
> > that latched, no internal clock running, mode.
>
>Wrong.
>

Fact: The fan of a Sony Vaio U3 (TM5800 CPU) increases its speed when:
- I'm in the BIOS (seems programmed using busy-wait)
- kernel panic
- or worse

IOW, whenever it is not executing HLT.
What do we learn? No automatic shutdown, at least not into a cool state.

> > In the first place, when the Intel and AMD CPUs overheat, they
> > shut down. 

Can't confirm this either. The same behavior as with the TM5800 (above 
list) can be experienced with (also autonomic fans) with AMD K6
(preferably VIA Apollo boards).

>cpu_relax() and friends aren't going to save a box in light of
>a fan failure in my experience.  
>However for a box which has locked up (intentionally)
>running instructions that do save power in a loop has obvious advantages.


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 22:25     ` Pavel Machek
  2006-06-19 22:41       ` Dave Jones
@ 2006-06-20  9:58       ` Jan Engelhardt
  2006-06-22 18:16         ` Pavel Machek
  1 sibling, 1 reply; 26+ messages in thread
From: Jan Engelhardt @ 2006-06-20  9:58 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dave Jones, linux-os (Dick Johnson), Andreas Mohr, linux-kernel

>
>If it happens to you... you needed a new cpu anyway. Anything non-historical
>*has* thermal protection.
>
>BTW I doubt those old athlons can be saved by cli; hlt . (Someone willing 
>to try if old
>athlon can run cli; hlt code w/o heatsink?).

K6 run cooler even with the regular kernel HLT (sti/hlt I presume). 
Difference to full load can be up to 10 deg, depending on ambient (room) 
temperature. In winter (read 2005-12-31) it ran between 28 celsius and 34 
celsius. The fan even stopped and I thought it was a fan failure, but 
luckily it was just hw-controlled :)



Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 22:41       ` Dave Jones
@ 2006-06-20 11:39         ` linux-os (Dick Johnson)
  2006-06-21 17:16           ` Ian Romanick
  2006-06-22 17:47         ` Pavel Machek
  1 sibling, 1 reply; 26+ messages in thread
From: linux-os (Dick Johnson) @ 2006-06-20 11:39 UTC (permalink / raw)
  To: Dave Jones; +Cc: Pavel Machek, Andreas Mohr, linux-kernel


On Mon, 19 Jun 2006, Dave Jones wrote:

> On Tue, Jun 20, 2006 at 12:25:29AM +0200, Pavel Machek wrote:
>
> > > Another anecdote: Upon fan failure, I once had an athlon MP *completely shatter*
> > > (as in broke in two pieces) under extreme heat.
> > >
> > > This _does_ happen.
> >
> > If it happens to you... you needed a new cpu anyway. Anything non-historical
> > *has* thermal protection.
>
> That's the single dumbest thing I've read today.
>
> newsflash: you don't get to dictate when I (or anyone else) buys new hardware.
> Before its accident, that box happily was my home firewall for 3 years, and
> its replacement is actually an /older/ box.  I didn't "need a new cpu" at all.
>
> (incidentally, it was replaced with a VIA C3 that doesn't need a fan :)
>
> > BTW I doubt those old athlons can be saved by cli; hlt . (Someone willing to try if old
> > athlon can run cli; hlt code w/o heatsink?).
>
> you snipped the important part of my mail.
>
> "cpu_relax() and friends aren't going to save a box"
>
> We have two completely different things being discussed in this thread.
>
> 1. Fan failure, and the possibility to keep running.
> IMO, there's nothing we can do here, and nor should we try.
>
> 2. Situations where we forcibly lock up and spin the CPU in a tight loop,
> producing heat.

The CPU produces heat. It's efficiency as a heater is nearly 100%
because it doesn't produce much noise or anything else to transfer
its 50+ watts into anything but heat. Spinning doesn't make friction.
It doesn't make more heat. The total box dissipation might even
be reduced because there is little memory activity and no seeks
of hard disks.

Some CPUs will go to a low-power 'sleep' mode if halted. Some require
more than that, they must fetch 'pause', i.e., rep nop to stay in
a low power mode. Other CPUs will throttle back their power
consumption when the instruction cache is inactive, read spinning.
These CPUs are normally used in lap-tops to maximize battery life.

So relax. The CPU isn't going to get dizzy because it's spinning!

[SNIPPED]


Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.4 on an i686 machine (5592.72 BogoMips).
New book: http://www.AbominableFirebug.com/
_
\x1a\x04

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-20 11:39         ` linux-os (Dick Johnson)
@ 2006-06-21 17:16           ` Ian Romanick
  2006-06-21 17:57             ` linux-os (Dick Johnson)
  0 siblings, 1 reply; 26+ messages in thread
From: Ian Romanick @ 2006-06-21 17:16 UTC (permalink / raw)
  To: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

linux-os (Dick Johnson) wrote:

> The CPU produces heat. It's efficiency as a heater is nearly 100%
> because it doesn't produce much noise or anything else to transfer
> its 50+ watts into anything but heat. Spinning doesn't make friction.
> It doesn't make more heat. The total box dissipation might even
> be reduced because there is little memory activity and no seeks
> of hard disks.
> 
> Some CPUs will go to a low-power 'sleep' mode if halted. Some require
> more than that, they must fetch 'pause', i.e., rep nop to stay in
> a low power mode. Other CPUs will throttle back their power
> consumption when the instruction cache is inactive, read spinning.
> These CPUs are normally used in lap-tops to maximize battery life.

What creates a fair amount of the heat in a CPU is the period of time
when the transistors switch from nearly zero resistance to infinite
resistance.  That brief period where the resistance very high but not
yet infinite generates heat.  That's why running a CPU at a higher clock
speed generates more heat:  there are more of those high resistance
transitions in a given period of time.

I think every processor since at least 1998 or so has had a mode where
executing the HLT instruction puts the bulk of the chip in a steady
state.  When it's in that steady state, the transistors don't switch, so
there are no high resistance periods.

This is, of course, completely different than the sleep or reduced clock
modes that modern processors support.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFEmX7wX1gOwKyEAw8RAu+MAJwOgq1rwrcr+dQHnM9ucT6R1+ZlOACgkLR7
J0CmccUthcHknCaXYTL6ON0=
=lkld
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-21 17:16           ` Ian Romanick
@ 2006-06-21 17:57             ` linux-os (Dick Johnson)
  0 siblings, 0 replies; 26+ messages in thread
From: linux-os (Dick Johnson) @ 2006-06-21 17:57 UTC (permalink / raw)
  To: Ian Romanick; +Cc: linux-kernel


On Wed, 21 Jun 2006, Ian Romanick wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> linux-os (Dick Johnson) wrote:
>
>> The CPU produces heat. It's efficiency as a heater is nearly 100%
>> because it doesn't produce much noise or anything else to transfer
>> its 50+ watts into anything but heat. Spinning doesn't make friction.
>> It doesn't make more heat. The total box dissipation might even
>> be reduced because there is little memory activity and no seeks
>> of hard disks.
>>
>> Some CPUs will go to a low-power 'sleep' mode if halted. Some require
>> more than that, they must fetch 'pause', i.e., rep nop to stay in
>> a low power mode. Other CPUs will throttle back their power
>> consumption when the instruction cache is inactive, read spinning.
>> These CPUs are normally used in lap-tops to maximize battery life.
>
> What creates a fair amount of the heat in a CPU is the period of time
> when the transistors switch from nearly zero resistance to infinite
> resistance.  That brief period where the resistance very high but not
> yet infinite generates heat.  That's why running a CPU at a higher clock
> speed generates more heat:  there are more of those high resistance
> transitions in a given period of time.
>
> I think every processor since at least 1998 or so has had a mode where
> executing the HLT instruction puts the bulk of the chip in a steady
> state.  When it's in that steady state, the transistors don't switch, so
> there are no high resistance periods.
>
> This is, of course, completely different than the sleep or reduced clock
> modes that modern processors support.

P = I^2 * R, where R is resistance, I is current, and P is power. So
naively, one might assume that a high resistance state would cause
high power dissipation. This is not what's happening when the clock
is running.

What happens is that logic levels form voltage levels within devices.
The elements that are charged and discharged due to logic level changes
have associated capacitances. Changing the voltage on a capacitor
requires current to flow. It's this current, often called displacement
current, flowing through the non-zero output resistances of the devices
that charge these capacities, that cause the dominant power loss. There
are other resistances as well, through which this displacement current
must flow.

This displacement current varies directly with the charging rate
(frequency), but the power loss (from I^2) varies as the square of
the frequency. That's why reducing the clock rate during idle periods
even a small amount, or shutting down some clocked portions of the
chip during such periods can be effective in reducing the power
dissipation and, therefore, heat generation. If you reduce the
current a small amount, the power reduction can be substantial.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.4 on an i686 machine (5592.88 BogoMips).
New book: http://www.AbominableFirebug.com/
_
\x1a\x04

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 22:41       ` Dave Jones
  2006-06-20 11:39         ` linux-os (Dick Johnson)
@ 2006-06-22 17:47         ` Pavel Machek
  1 sibling, 0 replies; 26+ messages in thread
From: Pavel Machek @ 2006-06-22 17:47 UTC (permalink / raw)
  To: Dave Jones, Pavel Machek, linux-os (Dick Johnson),
	Andreas Mohr, linux-kernel

Hi!

>  > > Another anecdote: Upon fan failure, I once had an athlon MP *completely shatter*
>  > > (as in broke in two pieces) under extreme heat.
>  > > 
>  > > This _does_ happen.
>  > 
>  > If it happens to you... you needed a new cpu anyway. Anything non-historical
>  > *has* thermal protection.
> 
> That's the single dumbest thing I've read today.

Sorry to upset you.

> newsflash: you don't get to dictate when I (or anyone else) buys new hardware.
> Before its accident, that box happily was my home firewall for 3 years, and
> its replacement is actually an /older/ box.  I didn't "need a new cpu" at all.
> 

Yep, it is bad... broken machines die. And while proposed patches
probably will not hurt, I do not think they will help, and I do not think
anyone will ever *test* them.

>  > BTW I doubt those old athlons can be saved by cli; hlt . (Someone willing to try if old
>  > athlon can run cli; hlt code w/o heatsink?).
> 
> you snipped the important part of my mail.
> 
> "cpu_relax() and friends aren't going to save a box"
> 
> We have two completely different things being discussed in this thread.
> 
> 1. Fan failure, and the possibility to keep running.
> IMO, there's nothing we can do here, and nor should we try.

Agreed... so you  know that proposed patch would probably not prevent your old
box from self-destructing?

> 2. Situations where we forcibly lock up and spin the CPU in a tight loop,
> producing heat.  Given there are CPUs that benefit from cpu_relax()
> in such places, adding them so that they don't unnecessarily sit there
> sucking power until someone gets to the datacenter to investigate
> can only be a good thing.

Are you sure that cpu_relax actually does somenthing on pre-pentium-4 machines?

>  > And no, we probably do not want to enter C2 or C3 from doublefault handler.
> 
> I didn't see that being proposed.

Good. 
			Pavel
-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-20  9:58       ` Jan Engelhardt
@ 2006-06-22 18:16         ` Pavel Machek
  2006-06-23 17:32           ` Jan Engelhardt
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Machek @ 2006-06-22 18:16 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Dave Jones, linux-os (Dick Johnson), Andreas Mohr, linux-kernel

Hi!

> >If it happens to you... you needed a new cpu anyway. Anything non-historical
> >*has* thermal protection.
> >
> >BTW I doubt those old athlons can be saved by cli; hlt . (Someone willing 
> >to try if old
> >athlon can run cli; hlt code w/o heatsink?).
> 
> K6 run cooler even with the regular kernel HLT (sti/hlt I presume). 
> Difference to full load can be up to 10 deg, depending on ambient (room) 
> temperature. In winter (read 2005-12-31) it ran between 28 celsius and 34 
> celsius. The fan even stopped and I thought it was a fan failure, but 
> luckily it was just hw-controlled :)

Okay, so you've got a point. The patch is useful on k6 in the winter
:-). (Actually, to show you've got a point, you'd have to stop the fan
and show that cpu badly overheats under for(;;) conditions).

Yes, we probably want to consolidate various for(;;) loops... and
maybe it will helpsomeone. If overheating causes reboot instead of
panic, you probably still loose, as BIOS is close to for(;;)...

									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-22 18:16         ` Pavel Machek
@ 2006-06-23 17:32           ` Jan Engelhardt
  2006-06-24 19:54             ` Pavel Machek
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Engelhardt @ 2006-06-23 17:32 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dave Jones, linux-os (Dick Johnson), Andreas Mohr, linux-kernel

>> K6 run cooler even with the regular kernel HLT (sti/hlt I presume). 
>> Difference to full load can be up to 10 deg, depending on ambient (room) 
>> temperature. In winter (read 2005-12-31) it ran between 28 celsius and 34 
>> celsius. The fan even stopped and I thought it was a fan failure, but 
>> luckily it was just hw-controlled :)
>
>Okay, so you've got a point. The patch is useful on k6 in the winter
>:-). (Actually, to show you've got a point, you'd have to stop the fan
>and show that cpu badly overheats under for(;;) conditions).

If you spend me a new CPU, no problem :p
Maybe someone from AMD with spare K6s can try.

>Yes, we probably want to consolidate various for(;;) loops... and
>maybe it will helpsomeone. If overheating causes reboot instead of
>panic, you probably still loose, as BIOS is close to for(;;)...

The best would be to turn the machine off after, say, 60 seconds. (So you 
can grab the panic, if there is one.)


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-23 17:32           ` Jan Engelhardt
@ 2006-06-24 19:54             ` Pavel Machek
  2006-06-25 11:01               ` Jan Engelhardt
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Machek @ 2006-06-24 19:54 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Dave Jones, linux-os (Dick Johnson), Andreas Mohr, linux-kernel

On Fri 2006-06-23 19:32:59, Jan Engelhardt wrote:
> >> K6 run cooler even with the regular kernel HLT (sti/hlt I presume). 
> >> Difference to full load can be up to 10 deg, depending on ambient (room) 
> >> temperature. In winter (read 2005-12-31) it ran between 28 celsius and 34 
> >> celsius. The fan even stopped and I thought it was a fan failure, but 
> >> luckily it was just hw-controlled :)
> >
> >Okay, so you've got a point. The patch is useful on k6 in the winter
> >:-). (Actually, to show you've got a point, you'd have to stop the fan
> >and show that cpu badly overheats under for(;;) conditions).
> 
> If you spend me a new CPU, no problem :p
> Maybe someone from AMD with spare K6s can try.
> 
> >Yes, we probably want to consolidate various for(;;) loops... and
> >maybe it will helpsomeone. If overheating causes reboot instead of
> >panic, you probably still loose, as BIOS is close to for(;;)...
> 
> The best would be to turn the machine off after, say, 60 seconds. (So you 
> can grab the panic, if there is one.)

I'd rather not overdo it. Shutting machine down is pretty hard
operation.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-24 19:54             ` Pavel Machek
@ 2006-06-25 11:01               ` Jan Engelhardt
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Engelhardt @ 2006-06-25 11:01 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dave Jones, linux-os (Dick Johnson), Andreas Mohr, linux-kernel

>> The best would be to turn the machine off after, say, 60 seconds. (So you 
>> can grab the panic, if there is one.)
>
>I'd rather not overdo it. Shutting machine down is pretty hard
>operation.

Hm. Too bad a CPU does not have a simple low-level trigger for poweroff, 
like there is for reboot when a triplefault occurs.


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
@ 2006-06-20  3:30 Ken Ryan
  0 siblings, 0 replies; 26+ messages in thread
From: Ken Ryan @ 2006-06-20  3:30 UTC (permalink / raw)
  To: linux-kernel

 > > You accelerate nothing. Bit heaven? A CPU without a fan will go into
 > > a cold, cold, shutdown, requiring a hardware reset to get it out of
 > > that latched, no internal clock running, mode.
 >
 > Some CPU may do this, others will go via the random-generator mode
 > into the self-deformation-mode instead.

A few years ago Tom's Hardware Guide made a cool video as part of an 
article on thermal emergencies.  The article is here:

http://www.tomshardware.com/2001/09/17/hot_spot/index.html

The test was pulling off the CPU fan and heatsink while playing Quake. 
Granted it's not entirely realistic; I don't imagine the heatsink would 
come of during heavy gameplay (a more reasonable scenario THG mentions 
is the fan/heatsink coming off during shipping) however considering the 
preposterous little tabs AMD specs for their sockets I think sudden 
breakage is not out of the question.

The video shows a PIII coping (halting), a P4 gracefully slowing down, 
and two variants of Athlon self-destructing (smoke and running solder).

Evidently this set of tests convinced AMD to alter how they handled
overtemp on their processors.  The mobos in the test were built 
according to spec in terms of the thermal sensors and protection code in 
the BIOS.  It didn't help; the exposed die of the Athlon ramped up its 
temperature way faster than the sensor could react.

As for the ceramic package cracking, it is certainly possible,  The
ceramic is indeed designed for very high temperatures, but only if 
heated evenly.  Give the package a 200C temperature differential within 
a second or two and thermal expansion is going to do some damage...

I can certainly believe modern processors deal with sudden thermal rise 
better than the ones in the THG video.  However not all of us can afford 
to always have the latest 'n' greatest... :-(

		ken

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
       [not found] ` <fa.so5wrYE6MzA2swzlOE1Xjw9iqvk@ifi.uio.no>
@ 2006-06-19 23:32   ` Robert Hancock
  0 siblings, 0 replies; 26+ messages in thread
From: Robert Hancock @ 2006-06-19 23:32 UTC (permalink / raw)
  To: Dave Jones, linux-os (Dick Johnson), Andreas Mohr, linux-kernel

Dave Jones wrote:
> Wrong.
> 
>  > Try it. I have had
>  > broken plastic heat-sink hold-downs let the entire heat-sink fall off
>  > the CPU. The machine just stops.
> 
> Your single datapoint is just that, a single datapoint.
> There are a number of reported cases of CPUs frying themselves.
> Here's one: http://www.tomshardware.com/2001/09/17/hot_spot/page4.html
> Google no doubt has more.
> 
> Another anecdote: Upon fan failure, I once had an athlon MP *completely shatter*
> (as in broke in two pieces) under extreme heat.
> 
> This _does_ happen.

This is entirely dependent on the CPU type. Intel CPUs for years would
shut down on over-temperature (the THERMTRIP signal). Pentium 4s will
additionally throttle the CPU clock before reaching that temperature.
Older AMD CPUs like that Athlon MP indeed had no internal temperature
limiting - if there was any it was part of the motherboard, which often
responded too slowly such that it was indeed possible to smoke
(literally) CPUs if the heatsink fell off. Athlon 64/Opteron CPUs will
shut down on exceeding a critical temperature.

> cpu_relax() and friends aren't going to save a box in light of
> a fan failure in my experience.  
> However for a box which has locked up (intentionally)
> running instructions that do save power in a loop has obvious advantages.

cpu_relax likely doesn't save a whole lot of power, I wouldn't think..
However, at least in the first case that was pointed out, halt() is
called first, the loop only is there in case the halt() somehow exits
for some reason.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
  2006-06-19 21:40   ` Bodo Eggert
@ 2006-06-19 21:44     ` Dave Jones
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Jones @ 2006-06-19 21:44 UTC (permalink / raw)
  To: 7eggert; +Cc: linux-os (Dick Johnson), Andreas Mohr, linux-kernel

On Mon, Jun 19, 2006 at 11:40:46PM +0200, Bodo Eggert wrote:
 > linux-os (Dick Johnson) <linux-os@analogic.com> wrote:
 > > On Mon, 19 Jun 2006, Andreas Mohr wrote:
 > 
 > >> while looking for loop places to apply cpu_relax() to, I found the
 > >> following gems:
 > >>
 > >> arch/i386/kernel/crash.c/crash_nmi_callback():
 > >>
 > >>        /* Assume hlt works */
 > >>        halt();
 > >>        for(;;);
 > >>
 > >>        return 1;
 > >> }
 > 
 > This will result in the processor burning energy again after the first
 > interrupt, won't it?

Interrupts are disabled here.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?
       [not found] ` <6pyer-2Pt-1@gated-at.bofh.it>
@ 2006-06-19 21:40   ` Bodo Eggert
  2006-06-19 21:44     ` Dave Jones
  0 siblings, 1 reply; 26+ messages in thread
From: Bodo Eggert @ 2006-06-19 21:40 UTC (permalink / raw)
  To: linux-os (Dick Johnson), Andreas Mohr, linux-kernel

linux-os (Dick Johnson) <linux-os@analogic.com> wrote:
> On Mon, 19 Jun 2006, Andreas Mohr wrote:

>> while looking for loop places to apply cpu_relax() to, I found the
>> following gems:
>>
>> arch/i386/kernel/crash.c/crash_nmi_callback():
>>
>>        /* Assume hlt works */
>>        halt();
>>        for(;;);
>>
>>        return 1;
>> }

This will result in the processor burning energy again after the first
interrupt, won't it?

>> arch/i386/kernel/doublefault.c/doublefault_fn():
>>
>>        for (;;) /* nothing */;
>> }
>>
>> Let's assume that we have a less than moderate fan failure that causes
>> the CPU to heat up beyond the critical limit...
>> That might result in - you guessed it - crashes or doublefaults.
>> In which case we enter the corresponding handler and do... what?
> 
> The double-fault is just a place-holder. The CPU will actually
> reset without even executing this (try it).

If it does reset, you don't need that fuctions (you can use anything),
and if not, doing while(1) halt(); in both cases should DTRT even if the
system is FUBAR, and it should be even smaller.

>> Exactly, we accelerate the CPUs happy march into bit heaven by letting it
>> execute a busy-loop under a non-working fan.
> 
> You accelerate nothing. Bit heaven? A CPU without a fan will go into
> a cold, cold, shutdown, requiring a hardware reset to get it out of
> that latched, no internal clock running, mode.

Some CPU may do this, others will go via the random-generator mode into the
self-deformation-mode instead.
-- 
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.

http://david.woodhou.se/why-not-spf.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2006-06-25 11:01 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-06-19 19:15 [RFC/SERIOUS] grilling troubled CPUs for fun and profit? Andreas Mohr
2006-06-19 19:39 ` John Richard Moser
2006-06-19 20:00 ` linux-os (Dick Johnson)
2006-06-19 20:23   ` Dave Jones
2006-06-19 20:47     ` linux-os (Dick Johnson)
2006-06-19 20:59       ` Dave Jones
2006-06-19 22:25     ` Pavel Machek
2006-06-19 22:41       ` Dave Jones
2006-06-20 11:39         ` linux-os (Dick Johnson)
2006-06-21 17:16           ` Ian Romanick
2006-06-21 17:57             ` linux-os (Dick Johnson)
2006-06-22 17:47         ` Pavel Machek
2006-06-20  9:58       ` Jan Engelhardt
2006-06-22 18:16         ` Pavel Machek
2006-06-23 17:32           ` Jan Engelhardt
2006-06-24 19:54             ` Pavel Machek
2006-06-25 11:01               ` Jan Engelhardt
2006-06-20  9:54     ` Jan Engelhardt
2006-06-19 21:16   ` Claudio Martins
2006-06-19 22:16 ` Pavel Machek
2006-06-19 22:43   ` Dave Jones
2006-06-20  7:29     ` Andreas Mohr
     [not found] <6pxs2-1AR-5@gated-at.bofh.it>
     [not found] ` <6pyer-2Pt-1@gated-at.bofh.it>
2006-06-19 21:40   ` Bodo Eggert
2006-06-19 21:44     ` Dave Jones
     [not found] <fa.pC0NfRl4O1eOCqPOBXy8f+7gbqU@ifi.uio.no>
     [not found] ` <fa.so5wrYE6MzA2swzlOE1Xjw9iqvk@ifi.uio.no>
2006-06-19 23:32   ` Robert Hancock
2006-06-20  3:30 Ken Ryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).