Re: smp_call_function_single lockups

* Re: smp_call_function_single lockups
@ 2015-02-22  8:59 Daniel J Blueman
  2015-02-22 10:37 ` Ingo Molnar
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel J Blueman @ 2015-02-22  8:59 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar
  Cc: Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra,
	LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez,
	Christopher Arges, the arch/x86 maintainers

On Saturday, February 21, 2015 at 3:50:05 AM UTC+8, Ingo Molnar wrote:
 > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
 >
 > > On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar <mingo@kernel.org> wrote:
 > > >
 > > > So if my memory serves me right, I think it was for
 > > > local APICs, and even there mostly it was a performance
 > > > issue: if an IO-APIC sent more than 2 IRQs per 'level'
 > > > to a local APIC then the IO-APIC might be forced to
 > > > resend those IRQs, leading to excessive message traffic
 > > > on the relevant hardware bus.
 > >
 > > Hmm. I have a distinct memory of interrupts actually
 > > being lost, but I really can't find anything to support
 > > that memory, so it's probably some drug-induced confusion
 > > of mine. I don't find *anything* about interrupt "levels"
 > > any more in modern Intel documentation on the APIC, but
 > > maybe I missed something. But it might all have been an
 > > IO-APIC thing.
 >
 > So I just found an older discussion of it:
 >
 > 
http://www.gossamer-threads.com/lists/linux/kernel/1554815?do=post_view_threaded#1554815
 >
 > while it's not a comprehensive description, it matches what
 > I remember from it: with 3 vectors within a level of 16
 > vectors we'd get excessive "retries" sent by the IO-APIC
 > through the (then rather slow) APIC bus.
 >
 > ( It was possible for the same phenomenon to occur with
 >   IPIs as well, when a CPU sent an APIC message to another
 >   CPU, if the affected vectors were equal modulo 16 - but
 >   this was rare IIRC because most systems were dual CPU so
 >   only two IPIs could have occured. )
 >
 > > Well, the attached patch for that seems pretty trivial.
 > > And seems to work for me (my machine also defaults to
 > > x2apic clustered mode), and allows the APIC code to start
 > > doing a "send to specific cpu" thing one by one, since it
 > > falls back to the send_IPI_mask() function if no
 > > individual CPU IPI function exists.
 > >
 > > NOTE! There's a few cases in
 > > arch/x86/kernel/apic/vector.c that also do that
 > > "apic->send_IPI_mask(cpumask_of(i), .." thing, but they
 > > aren't that important, so I didn't bother with them.
 > >
 > > NOTE2! I've tested this, and it seems to work, but maybe
 > > there is something seriously wrong. I skipped the
 > > "disable interrupts" part when doing the "send_IPI", for
 > > example, because I think it's entirely unnecessary for
 > > that case. But this has certainly *not* gotten any real
 > > stress-testing.

 > I'm not so sure about that aspect: I think disabling IRQs
 > might be necessary with some APICs (if lower levels don't
 > disable IRQs), to make sure the 'local APIC busy' bit isn't
 > set:
 >
 > we typically do a wait_icr_idle() call before sending an
 > IPI - and if IRQs are not off then the idleness of the APIC
 > might be gone. (Because a hardirq that arrives after a
 > wait_icr_idle() but before the actual IPI sending sent out
 > an IPI and the queue is full.)

The Intel SDM [1] and AMD F15h BKDG [2] state that IPIs are queued, so 
the wait_icr_idle() polling is only necessary on PPro and older, and 
maybe then to avoid delivery retry. This unnecessarily ties up the IPI 
caller, so we bypass the polling in the Numachip APIC driver IPI-to-self 
path.

On Linus's earlier point, with the large core counts on Numascale 
systems, I previously implemented a shortcut to allow single IPIs to 
bypass all the cpumask generation and walking; it's way down on my list, 
but I'll see if I can generalise and present a patch series at some 
point if interested?

Dan

-- [1] Intel SDM 3, p10-30 
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf

If more than one interrupt is generated with the same vector number, the 
local APIC can set the bit for the vector both in the IRR and the ISR. 
This means that for the Pentium 4 and Intel Xeon processors, the IRR and 
ISR can queue two interrupts for each interrupt vector: one in the IRR 
and one in the ISR. Any additional interrupts issued for the same 
interrupt vector are collapsed into the single bit in the IRR. For the 
P6 family and Pentium processors, the IRR and ISR registers can queue no 
more than two interrupts per interrupt vector and will reject other 
interrupts that are received within the same vector.

-- [2] AMD Fam15h BKDG p470 
http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf

DS: interrupt delivery status. Read-only. Reset: 0. In xAPIC mode this 
bit is set to indicate that the interrupt has not yet been accepted by 
the destination core(s). 0=Idle. 1=Send pending. Reserved in x2APIC 
mode. Software may repeatedly write ICRL without polling the DS bit; all 
requested IPIs will be delivered.

^ permalink raw reply	[flat|nested] 54+ messages in thread