linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Readd BUG for SMP TLB IPI
@ 2003-07-09 10:49 Andi Kleen
  2003-07-09 11:27 ` Alan Cox
  0 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2003-07-09 10:49 UTC (permalink / raw)
  To: torvalds, linux-kernel


Readd the BUG for an spurious SMP TLB IPI.

Rationale: 

The condition is fatal and it's better to have a BUG than a hang.
The goto out code forgets to ack the IPI in the APIC. When the IPI would really arrive
at the wrong CPU it would immediately deadlock because the non-Acked IPI is retriggered.
Adding an ACK for this path is also no good, because then the SMP flusher would
need to detect this case and "retransmit" the IPI, otherwise it would hang too
in the loop waiting for other CPUs. But nobody has ever seen such a hang, so it's safe
to assume that all hardware guarantees it cannot happen.

-Andi

--- linux-2.5-amd64/arch/i386/kernel/smp.c~	2003-07-09 12:42:36.000000000 +0200
+++ linux-2.5-amd64/arch/i386/kernel/smp.c	2003-07-09 12:42:36.000000000 +0200
@@ -312,15 +312,7 @@
 	cpu = get_cpu();
 
 	if (!test_bit(cpu, &flush_cpumask))
-		goto out;
-		/* 
-		 * This was a BUG() but until someone can quote me the
-		 * line from the intel manual that guarantees an IPI to
-		 * multiple CPUs is retried _only_ on the erroring CPUs
-		 * its staying as a return
-		 *
-		 * BUG();
-		 */
+		BUG();
 		 
 	if (flush_mm == cpu_tlbstate[cpu].active_mm) {
 		if (cpu_tlbstate[cpu].state == TLBSTATE_OK) {

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Readd BUG for SMP TLB IPI
  2003-07-09 10:49 [PATCH] Readd BUG for SMP TLB IPI Andi Kleen
@ 2003-07-09 11:27 ` Alan Cox
  2003-07-09 11:41   ` Andi Kleen
  0 siblings, 1 reply; 10+ messages in thread
From: Alan Cox @ 2003-07-09 11:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: torvalds, Linux Kernel Mailing List

On Mer, 2003-07-09 at 11:49, Andi Kleen wrote:
> Adding an ACK for this path is also no good, because then the SMP flusher would
> need to detect this case and "retransmit" the IPI, otherwise it would hang too
> in the loop waiting for other CPUs. But nobody has ever seen such a hang, so it's safe
> to assume that all hardware guarantees it cannot happen.

We have recorded retransmitted IPI's on some boards (notably the
infamous BP6). They do happen. Early 2.4 had a fix for handling the
replay case too, but someone lost it and BP6 boards no longer work as
reliably.

An IPI can be retried and the retry for dual PII at least seems to hit
all the CPUs. Even on a 2 CPU box this has been observed - I assume
because the error was raised by the IO APIC when it got a garbled IPI.

Go ask Intel if you doubt it.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Readd BUG for SMP TLB IPI
  2003-07-09 11:27 ` Alan Cox
@ 2003-07-09 11:41   ` Andi Kleen
  2003-07-09 16:53     ` Alan Cox
  0 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2003-07-09 11:41 UTC (permalink / raw)
  To: Alan Cox; +Cc: torvalds, linux-kernel

On 09 Jul 2003 12:27:02 +0100
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> We have recorded retransmitted IPI's on some boards (notably the
> infamous BP6). They do happen. Early 2.4 had a fix for handling the
> replay case too, but someone lost it and BP6 boards no longer work as
> reliably.

Ok, all non broken hardware.

> 
> An IPI can be retried and the retry for dual PII at least seems to hit
> all the CPUs. Even on a 2 CPU box this has been observed - I assume
> because the error was raised by the IO APIC when it got a garbled IPI.

I don't believe it. It would cause an immediate hang because the goto out
does not ack the IPI.  Surely such hangs would have been reported?

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Readd BUG for SMP TLB IPI
  2003-07-09 11:41   ` Andi Kleen
@ 2003-07-09 16:53     ` Alan Cox
  2003-07-09 16:58       ` Andi Kleen
  0 siblings, 1 reply; 10+ messages in thread
From: Alan Cox @ 2003-07-09 16:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: torvalds, Linux Kernel Mailing List

On Mer, 2003-07-09 at 12:41, Andi Kleen wrote:
> > We have recorded retransmitted IPI's on some boards (notably the
> > infamous BP6). They do happen. Early 2.4 had a fix for handling the
> > replay case too, but someone lost it and BP6 boards no longer work as
> > reliably.
> 
> Ok, all non broken hardware.

It can happen to any PII/PIII box, its just very very rare on others, so
rare I guess such crashes are in the noise.

> I don't believe it. It would cause an immediate hang because the goto out
> does not ack the IPI.  Surely such hangs would have been reported?

See - trawling bug data is *useful* 8)

For 2.4.x crashes relating to IPI replay bugs are reported very occasionally, for
2.5 I've no idea at all



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Readd BUG for SMP TLB IPI
  2003-07-09 16:53     ` Alan Cox
@ 2003-07-09 16:58       ` Andi Kleen
  2003-07-09 17:04         ` Alan Cox
  0 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2003-07-09 16:58 UTC (permalink / raw)
  To: Alan Cox; +Cc: torvalds, linux-kernel

On 09 Jul 2003 17:53:28 +0100
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> On Mer, 2003-07-09 at 12:41, Andi Kleen wrote:
> > > We have recorded retransmitted IPI's on some boards (notably the
> > > infamous BP6). They do happen. Early 2.4 had a fix for handling the
> > > replay case too, but someone lost it and BP6 boards no longer work as
> > > reliably.
> > 
> > Ok, all non broken hardware.
> 
> It can happen to any PII/PIII box, its just very very rare on others, so
> rare I guess such crashes are in the noise.

How do you know it an happen on them? Do you have backtraces?

If the BUG was there they wouldn't be in the noise.  With the incomplete/broken
handling it's just a silent hang.

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Readd BUG for SMP TLB IPI
  2003-07-09 16:58       ` Andi Kleen
@ 2003-07-09 17:04         ` Alan Cox
  2003-07-09 17:32           ` Andi Kleen
  2003-07-09 17:37           ` Compile failure 2.4.22-pre3-ac1 Midian
  0 siblings, 2 replies; 10+ messages in thread
From: Alan Cox @ 2003-07-09 17:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: torvalds, Linux Kernel Mailing List

On Mer, 2003-07-09 at 17:58, Andi Kleen wrote:
> > It can happen to any PII/PIII box, its just very very rare on others, so
> > rare I guess such crashes are in the noise.
> 
> How do you know it an happen on them? Do you have backtraces?

I sat down with a BP6 owner and did lots of debugging.

> If the BUG was there they wouldn't be in the noise.  With the incomplete/broken
> handling it's just a silent hang.

I'm not arguing BUG is better than hang, but working right is better than BUG 8)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Readd BUG for SMP TLB IPI
  2003-07-09 17:04         ` Alan Cox
@ 2003-07-09 17:32           ` Andi Kleen
  2003-07-09 23:38             ` Alan Cox
  2003-07-09 17:37           ` Compile failure 2.4.22-pre3-ac1 Midian
  1 sibling, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2003-07-09 17:32 UTC (permalink / raw)
  To: Alan Cox; +Cc: torvalds, Linux Kernel Mailing List

On 09 Jul 2003 18:04:15 +0100
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> On Mer, 2003-07-09 at 17:58, Andi Kleen wrote:
> > > It can happen to any PII/PIII box, its just very very rare on others, so
> > > rare I guess such crashes are in the noise.
> > 
> > How do you know it an happen on them? Do you have backtraces?
> 
> I sat down with a BP6 owner and did lots of debugging.

I meant on the not known-to-be-nearly-unusable boards. Or are you saying that
on the other boards the APIC bus could be lossy too, but it's very unlikely? 

[my personal feeling would be to consider the lossy APIC bus to be a hardware
problem, like an MCE that cannot be really handled] 

> > If the BUG was there they wouldn't be in the noise.  With the incomplete/broken
> > handling it's just a silent hang.
> 
> I'm not arguing BUG is better than hang, but working right is better than BUG 8)

I suspect when you have a lossy APIC bug you will run into problems with other IPIs too,
it's really an uphill fight which you are likely to lose.

Anyways, having a clear BUG will make it easier to evaluate if there is really
a problem.

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Compile failure 2.4.22-pre3-ac1
  2003-07-09 17:04         ` Alan Cox
  2003-07-09 17:32           ` Andi Kleen
@ 2003-07-09 17:37           ` Midian
  2003-07-09 18:10             ` Steven Cole
  1 sibling, 1 reply; 10+ messages in thread
From: Midian @ 2003-07-09 17:37 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

Hello Alan,
I've tryed to compile 2.4.22-pre3-ac1, but every time I get this error:

arch/i386/kernel/kernel.o(.text.init+0x7803): In function
`setup_ioapic_ids_from_mpc':
: undefined reference to `xapic_support'
arch/i386/kernel/kernel.o(.text.init+0x7a16): In function
`setup_ioapic_ids_from_mpc':
: undefined reference to `xapic_support'
make: *** [vmlinux] Error 1

I've tryed to search for patches from the mailing list with no luck, is
there some patches for this?

Regards
-- 
Markus Hästbacka <midian@ihme.org>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Compile failure 2.4.22-pre3-ac1
  2003-07-09 17:37           ` Compile failure 2.4.22-pre3-ac1 Midian
@ 2003-07-09 18:10             ` Steven Cole
  0 siblings, 0 replies; 10+ messages in thread
From: Steven Cole @ 2003-07-09 18:10 UTC (permalink / raw)
  To: Midian; +Cc: Alan Cox, linux-kernel

On Wed, 2003-07-09 at 11:37, Midian wrote:
> Hello Alan,
> I've tryed to compile 2.4.22-pre3-ac1, but every time I get this error:
> 
> arch/i386/kernel/kernel.o(.text.init+0x7803): In function
> `setup_ioapic_ids_from_mpc':
> : undefined reference to `xapic_support'
> arch/i386/kernel/kernel.o(.text.init+0x7a16): In function
> `setup_ioapic_ids_from_mpc':
> : undefined reference to `xapic_support'
> make: *** [vmlinux] Error 1
> 
> I've tryed to search for patches from the mailing list with no luck, is
> there some patches for this?
> 
> Regards

I posted this workaround here:
http://marc.theaimsgroup.com/?l=linux-kernel&m=105760102522650&w=2

but as Adrian Bunk pointed out in a response, the problem is that
"changes to arch/i386/kernel/mpparse.c got lost at the update of -ac to -pre3".

If you want to be slightly more adventurous than using my
workaround patch, you could copy the mpparse.c file from 2.4.21-ac4.
That was compile tested but not run tested.

Steven


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Readd BUG for SMP TLB IPI
  2003-07-09 17:32           ` Andi Kleen
@ 2003-07-09 23:38             ` Alan Cox
  0 siblings, 0 replies; 10+ messages in thread
From: Alan Cox @ 2003-07-09 23:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: torvalds, Linux Kernel Mailing List

On Mer, 2003-07-09 at 18:32, Andi Kleen wrote:
> > > How do you know it an happen on them? Do you have backtraces?
> > 
> > I sat down with a BP6 owner and did lots of debugging.
> 
> I meant on the not known-to-be-nearly-unusable boards. Or are you saying that
> on the other boards the APIC bus could be lossy too, but it's very unlikely? 
> [my personal feeling would be to consider the lossy APIC bus to be a hardware
> problem, like an MCE that cannot be really handled] 

APIC errors can occur on any box. The checksum ensures they get
retransmitted but does mean you can ger replay of events

> I suspect when you have a lossy APIC bug you will run into problems with other IPIs too,
> it's really an uphill fight which you are likely to lose.

If you follow intels recommendations it ought to just work. Its basically about 
checking if the message is one you processed and not repeating the execution


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-07-09 23:27 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-07-09 10:49 [PATCH] Readd BUG for SMP TLB IPI Andi Kleen
2003-07-09 11:27 ` Alan Cox
2003-07-09 11:41   ` Andi Kleen
2003-07-09 16:53     ` Alan Cox
2003-07-09 16:58       ` Andi Kleen
2003-07-09 17:04         ` Alan Cox
2003-07-09 17:32           ` Andi Kleen
2003-07-09 23:38             ` Alan Cox
2003-07-09 17:37           ` Compile failure 2.4.22-pre3-ac1 Midian
2003-07-09 18:10             ` Steven Cole

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).