* Re: smp_call_function_single lockups @ 2015-02-22 8:59 Daniel J Blueman 2015-02-22 10:37 ` Ingo Molnar 0 siblings, 1 reply; 54+ messages in thread From: Daniel J Blueman @ 2015-02-22 8:59 UTC (permalink / raw) To: Linus Torvalds, Ingo Molnar Cc: Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges, the arch/x86 maintainers On Saturday, February 21, 2015 at 3:50:05 AM UTC+8, Ingo Molnar wrote: > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > So if my memory serves me right, I think it was for > > > local APICs, and even there mostly it was a performance > > > issue: if an IO-APIC sent more than 2 IRQs per 'level' > > > to a local APIC then the IO-APIC might be forced to > > > resend those IRQs, leading to excessive message traffic > > > on the relevant hardware bus. > > > > Hmm. I have a distinct memory of interrupts actually > > being lost, but I really can't find anything to support > > that memory, so it's probably some drug-induced confusion > > of mine. I don't find *anything* about interrupt "levels" > > any more in modern Intel documentation on the APIC, but > > maybe I missed something. But it might all have been an > > IO-APIC thing. > > So I just found an older discussion of it: > > http://www.gossamer-threads.com/lists/linux/kernel/1554815?do=post_view_threaded#1554815 > > while it's not a comprehensive description, it matches what > I remember from it: with 3 vectors within a level of 16 > vectors we'd get excessive "retries" sent by the IO-APIC > through the (then rather slow) APIC bus. > > ( It was possible for the same phenomenon to occur with > IPIs as well, when a CPU sent an APIC message to another > CPU, if the affected vectors were equal modulo 16 - but > this was rare IIRC because most systems were dual CPU so > only two IPIs could have occured. ) > > > Well, the attached patch for that seems pretty trivial. > > And seems to work for me (my machine also defaults to > > x2apic clustered mode), and allows the APIC code to start > > doing a "send to specific cpu" thing one by one, since it > > falls back to the send_IPI_mask() function if no > > individual CPU IPI function exists. > > > > NOTE! There's a few cases in > > arch/x86/kernel/apic/vector.c that also do that > > "apic->send_IPI_mask(cpumask_of(i), .." thing, but they > > aren't that important, so I didn't bother with them. > > > > NOTE2! I've tested this, and it seems to work, but maybe > > there is something seriously wrong. I skipped the > > "disable interrupts" part when doing the "send_IPI", for > > example, because I think it's entirely unnecessary for > > that case. But this has certainly *not* gotten any real > > stress-testing. > I'm not so sure about that aspect: I think disabling IRQs > might be necessary with some APICs (if lower levels don't > disable IRQs), to make sure the 'local APIC busy' bit isn't > set: > > we typically do a wait_icr_idle() call before sending an > IPI - and if IRQs are not off then the idleness of the APIC > might be gone. (Because a hardirq that arrives after a > wait_icr_idle() but before the actual IPI sending sent out > an IPI and the queue is full.) The Intel SDM [1] and AMD F15h BKDG [2] state that IPIs are queued, so the wait_icr_idle() polling is only necessary on PPro and older, and maybe then to avoid delivery retry. This unnecessarily ties up the IPI caller, so we bypass the polling in the Numachip APIC driver IPI-to-self path. On Linus's earlier point, with the large core counts on Numascale systems, I previously implemented a shortcut to allow single IPIs to bypass all the cpumask generation and walking; it's way down on my list, but I'll see if I can generalise and present a patch series at some point if interested? Dan -- [1] Intel SDM 3, p10-30 http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf If more than one interrupt is generated with the same vector number, the local APIC can set the bit for the vector both in the IRR and the ISR. This means that for the Pentium 4 and Intel Xeon processors, the IRR and ISR can queue two interrupts for each interrupt vector: one in the IRR and one in the ISR. Any additional interrupts issued for the same interrupt vector are collapsed into the single bit in the IRR. For the P6 family and Pentium processors, the IRR and ISR registers can queue no more than two interrupts per interrupt vector and will reject other interrupts that are received within the same vector. -- [2] AMD Fam15h BKDG p470 http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf DS: interrupt delivery status. Read-only. Reset: 0. In xAPIC mode this bit is set to indicate that the interrupt has not yet been accepted by the destination core(s). 0=Idle. 1=Send pending. Reserved in x2APIC mode. Software may repeatedly write ICRL without polling the DS bit; all requested IPIs will be delivered. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-22 8:59 smp_call_function_single lockups Daniel J Blueman @ 2015-02-22 10:37 ` Ingo Molnar 0 siblings, 0 replies; 54+ messages in thread From: Ingo Molnar @ 2015-02-22 10:37 UTC (permalink / raw) To: Daniel J Blueman Cc: Linus Torvalds, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges, the arch/x86 maintainers * Daniel J Blueman <daniel@numascale.com> wrote: > The Intel SDM [1] and AMD F15h BKDG [2] state that IPIs > are queued, so the wait_icr_idle() polling is only > necessary on PPro and older, and maybe then to avoid > delivery retry. This unnecessarily ties up the IPI > caller, so we bypass the polling in the Numachip APIC > driver IPI-to-self path. It would be nice to propagate this back to the generic x86 code. > On Linus's earlier point, with the large core counts on > Numascale systems, I previously implemented a shortcut to > allow single IPIs to bypass all the cpumask generation > and walking; it's way down on my list, but I'll see if I > can generalise and present a patch series at some point > if interested? I am definitely interested! Thanks, Ingo ^ permalink raw reply [flat|nested] 54+ messages in thread
* smp_call_function_single lockups
@ 2015-02-11 13:19 Rafael David Tinoco
2015-02-11 18:18 ` Linus Torvalds
0 siblings, 1 reply; 54+ messages in thread
From: Rafael David Tinoco @ 2015-02-11 13:19 UTC (permalink / raw)
To: LKML; +Cc: torvalds, Thomas Gleixner, Jens Axboe
Linus, Thomas, Jens..
During the 3.18 - 3.19 "frequent lockups discussion", in some point
you have observed csd_lock() and csd_unlock() possible synchronization
problems. I think we have managed to reproduce that issue in a
constant basis with 3.13 (ubuntu) and 3.19 (latest vanilla).
- When running "open-stack tempest" in a nested-kvm environment we are
able to cause a lockup in question of hours (from 2 to 20 hours
usually). Trace from nested hypervisor (ubuntu 3.13):
crash> bt
PID: 29130 TASK: ffff8804288ac800 CPU: 1 COMMAND: "qemu-system-x86"
#0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
#1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
#2 [ffff88043fd03e30] panic at ffffffff81719ff4
#3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
#4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
#5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
#6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537
#7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f
#8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd
--- <IRQ stack> ---
#9 [ffff8804284a3bc8] apic_timer_interrupt at ffffffff817326dd
[exception RIP: generic_exec_single+130]
RIP: ffffffff810dbe62 RSP: ffff8804284a3c70 RFLAGS: 00000202
RAX: 0000000000000002 RBX: ffff8804284a3c40 RCX: 0000000000000001
RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286
RBP: ffff8804284a3ca0 R8: ffffffff8180ad48 R9: 0000000000000001
R10: ffffffff81185cac R11: ffffea00109b4a00 R12: ffff88042829f400
R13: 0000000000000000 R14: ffffea001017d640 R15:
0000000000000005 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#10 [ffff8804284a3ca8] smp_call_function_single at ffffffff810dbf75
#11 [ffff8804284a3d20] smp_call_function_many at ffffffff810dc3a6
#12 [ffff8804284a3d80] native_flush_tlb_others at ffffffff8105c8f7
#13 [ffff8804284a3da8] flush_tlb_mm_range at ffffffff8105c9cb
#14 [ffff8804284a3dd8] tlb_flush_mmu at ffffffff811755b3
#15 [ffff8804284a3e00] tlb_finish_mmu at ffffffff81176145
#16 [ffff8804284a3e20] unmap_region at ffffffff8117e013
#17 [ffff8804284a3ee0] do_munmap at ffffffff81180356
#18 [ffff8804284a3f30] vm_munmap at ffffffff81180521
#19 [ffff8804284a3f60] sys_munmap at ffffffff81181482
#20 [ffff8804284a3f80] system_call_fastpath at ffffffff8173196d
RIP: 00007fa3ed16c587 RSP: 00007fa3536f5c10 RFLAGS: 00000246
RAX: 000000000000000b RBX: ffffffff8173196d RCX: 0000001d00000007
RDX: 0000000000000000 RSI: 0000000000801000 RDI: 00007fa315ff4000
RBP: 00007fa3167f49c0 R8: 0000000000000000 R9: 00007fa3f5396738
R10: 00007fa3536f5a60 R11: 0000000000000202 R12: 00007fa3ed6562a0
R13: 00007fa350ef19c0 R14: ffffffff81181482 R15: ffff8804284a3f78
ORIG_RAX: 000000000000000b CS: 0033 SS: 002b
- After applying patch provided by Thomas we were able to cause the
lockup only after 6 days (also locked inside
smp_call_function_single). Test performance (even for a nested kvm)
was reduced substantially with 3.19 + this patch. Trace from the
nested hypervisor (3.19 + patch):
crash> bt
PID: 10467 TASK: ffff880817b3b1c0 CPU: 1 COMMAND: "qemu-system-x86"
#0 [ffff88083fd03cc0] machine_kexec at ffffffff81052052
#1 [ffff88083fd03d10] crash_kexec at ffffffff810f91c3
#2 [ffff88083fd03de0] panic at ffffffff8176f713
#3 [ffff88083fd03e60] watchdog_timer_fn at ffffffff8112316b
#4 [ffff88083fd03ea0] __run_hrtimer at ffffffff810da087
#5 [ffff88083fd03ef0] hrtimer_interrupt at ffffffff810da467
#6 [ffff88083fd03f70] local_apic_timer_interrupt at ffffffff81049769
#7 [ffff88083fd03f90] smp_apic_timer_interrupt at ffffffff8177fc25
#8 [ffff88083fd03fb0] apic_timer_interrupt at ffffffff8177dcbd
--- <IRQ stack> ---
#9 [ffff880817973a68] apic_timer_interrupt at ffffffff8177dcbd
[exception RIP: generic_exec_single+218]
RIP: ffffffff810ee0ca RSP: ffff880817973b18 RFLAGS: 00000202
RAX: 0000000000000002 RBX: 0000000000000292 RCX: 0000000000000001
RDX: ffffffff8180e6e0 RSI: 0000000000000000 RDI: 0000000000000292
RBP: ffff880817973b58 R8: ffffffff8180e6c8 R9: 0000000000000001
R10: 000000000000b6e0 R11: 0000000000000001 R12: ffffffff811f6626
R13: ffff880817973ab8 R14: ffffffff8109cfd2 R15: ffff880817973a78
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#10 [ffff880817973b60] smp_call_function_single at ffffffff810ee1c7
#11 [ffff880817973b90] loaded_vmcs_clear at ffffffffa0309097 [kvm_intel]
#12 [ffff880817973ba0] vmx_vcpu_load at ffffffffa030defe [kvm_intel]
#13 [ffff880817973be0] kvm_arch_vcpu_load at ffffffffa01eba53 [kvm]
#14 [ffff880817973c00] kvm_sched_in at ffffffffa01d94a9 [kvm]
#15 [ffff880817973c20] finish_task_switch at ffffffff81099148
#16 [ffff880817973c60] __schedule at ffffffff817781ec
#17 [ffff880817973cd0] schedule at ffffffff81778699
#18 [ffff880817973ce0] kvm_vcpu_block at ffffffffa01d8dfd [kvm]
#19 [ffff880817973d40] kvm_arch_vcpu_ioctl_run at ffffffffa01ef64c [kvm]
#20 [ffff880817973e10] kvm_vcpu_ioctl at ffffffffa01dbc19 [kvm]
#21 [ffff880817973eb0] do_vfs_ioctl at ffffffff811f5948
#22 [ffff880817973f30] sys_ioctl at ffffffff811f5be1
#23 [ffff880817973f80] system_call_fastpath at ffffffff8177cc2d
RIP: 00007f42f987fec7 RSP: 00007f42ef1bebd8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: ffffffff8177cc2d RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000e
RBP: 00007f430047b040 R8: 0000000000000000 R9: 00000000000000ff
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f42ff920240
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
Not sure if you are still pursuing this, anyway.. let me know if you
think of any other change, I'll keep the environment.
-Tinoco
On Mon, 17 Nov 2014, Thomas Gleixner wrote:
> On Mon, 17 Nov 2014, Linus Torvalds wrote:
> > llist_for_each_entry_safe(csd, csd_next, entry, llist) {
> > - csd->func(csd->info);
> > + smp_call_func_t func = csd->func;
> > + void *info = csd->info;
> > csd_unlock(csd);
> > +
> > + func(info);
>
> No, that won't work for synchronous calls:
>
> CPU 0 CPU 1
>
> csd_lock(csd);
> queue_csd();
> ipi();
> func = csd->func;
> info = csd->info;
> csd_unlock(csd);
> csd_lock_wait();
> func(info);
>
> The csd_lock_wait() side will succeed and therefor assume that the
> call has been completed while the function has not been called at
> all. Interesting explosions to follow.
>
> The proper solution is to revert that commit and properly analyze the
> problem which Jens was trying to solve and work from there.
So a combo of both (Jens and yours) might do the trick. Patch below.
I think what Jens was trying to solve is:
CPU 0 CPU 1
csd_lock(csd);
queue_csd();
ipi();
csd->func(csd->info);
wait_for_completion(csd);
complete(csd);
reuse_csd(csd);
csd_unlock(csd);
Thanks,
tglx
Index: linux/kernel/smp.c
===================================================================
--- linux.orig/kernel/smp.c
+++ linux/kernel/smp.c
@@ -126,7 +126,7 @@ static void csd_lock(struct call_single_
static void csd_unlock(struct call_single_data *csd)
{
- WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
+ WARN_ON(!(csd->flags & CSD_FLAG_LOCK));
/*
* ensure we're all done before releasing data:
@@ -250,8 +250,23 @@ static void flush_smp_call_function_queu
}
llist_for_each_entry_safe(csd, csd_next, entry, llist) {
- csd->func(csd->info);
- csd_unlock(csd);
+
+ /*
+ * For synchronous calls we are not allowed to unlock
+ * before the callback returned. For the async case
+ * its the responsibility of the caller to keep
+ * csd->info consistent while the callback runs.
+ */
+ if (csd->flags & CSD_FLAG_WAIT) {
+ csd->func(csd->info);
+ csd_unlock(csd);
+ } else {
+ smp_call_func_t func = csd->func;
+ void *info = csd->info;
+
+ csd_unlock(csd);
+ func(info);
+ }
}
/*
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-11 13:19 Rafael David Tinoco @ 2015-02-11 18:18 ` Linus Torvalds 2015-02-11 19:59 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-02-11 18:18 UTC (permalink / raw) To: Rafael David Tinoco; +Cc: LKML, Thomas Gleixner, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 1566 bytes --] On Wed, Feb 11, 2015 at 5:19 AM, Rafael David Tinoco <inaddy@ubuntu.com> wrote: > > - After applying patch provided by Thomas we were able to cause the > lockup only after 6 days (also locked inside > smp_call_function_single). Test performance (even for a nested kvm) > was reduced substantially with 3.19 + this patch. I think that just means that the patch from Thomas doesn't change anything - the reason it takes longer to lock up is just that performance reduction, so whatever race it is that causes the problem was just harder to hit, but not fundamentally affected. I think a more interesting thing to get is the traces from the other CPU's when this happens. In a virtualized environment, that might be easier to get than on real hardware, and if you are able to reproduce this at will - especially with something recent like 3.19, and could get that, that would be really good. I'll think about this all, but we couldn't figure anything out last time we looked at it, so without more clues, don't hold your breath. That said, it *would* be good if we could get rid of the synchronous behavior entirely, and make it a rule that if somebody wants to wait for it, they'll have to do their own waiting. Because I still think that that CSD_FLAG_WAIT is pure and utter garbage. And I think that Jens said that it is probably bogus to begin with. I also don't even see where the CSD_FLAG_WAIT bit woudl ever be cleared, so it all looks completely buggy anyway. Does this (COMPLETELY UNTESTED!) attached patch change anything? Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 1453 bytes --] kernel/smp.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/kernel/smp.c b/kernel/smp.c index f38a1e692259..13a8e75e1379 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -19,7 +19,6 @@ enum { CSD_FLAG_LOCK = 0x01, - CSD_FLAG_WAIT = 0x02, }; struct call_function_data { @@ -114,26 +113,24 @@ static void csd_lock_wait(struct call_single_data *csd) static void csd_lock(struct call_single_data *csd) { csd_lock_wait(csd); - csd->flags |= CSD_FLAG_LOCK; + csd->flags = CSD_FLAG_LOCK; /* * prevent CPU from reordering the above assignment * to ->flags with any subsequent assignments to other * fields of the specified call_single_data structure: */ - smp_mb(); + smp_wmb(); } static void csd_unlock(struct call_single_data *csd) { - WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK)); + WARN_ON(!(csd->flags & CSD_FLAG_LOCK)); /* * ensure we're all done before releasing data: */ - smp_mb(); - - csd->flags &= ~CSD_FLAG_LOCK; + smp_store_release(&csd->flags, 0); } static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data); @@ -173,9 +170,6 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, csd->func = func; csd->info = info; - if (wait) - csd->flags |= CSD_FLAG_WAIT; - /* * The list addition should be visible before sending the IPI * handler locks the list to pull the entry off it because of ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-11 18:18 ` Linus Torvalds @ 2015-02-11 19:59 ` Linus Torvalds 2015-02-11 20:42 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-02-11 19:59 UTC (permalink / raw) To: Rafael David Tinoco; +Cc: LKML, Thomas Gleixner, Jens Axboe On Wed, Feb 11, 2015 at 10:18 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I'll think about this all, but we couldn't figure anything out last > time we looked at it, so without more clues, don't hold your breath. So having looked at it once more, one thing struck me: Look at smp_call_function_single_async(). The comment says * Like smp_call_function_single(), but the call is asynchonous and * can thus be done from contexts with disabled interrupts. but that is *only* true if we don't have to wait for the csd lock. The comments even clarify that: * The caller passes his own pre-allocated data structure * (ie: embedded in an object) and is responsible for synchronizing it * such that the IPIs performed on the @csd are strictly serialized. but it's not at all clear that the caller *can* do that. Since the "csd_unlock()" is done *after* the call to the callback function, any serialization done by the caller is fundamentally not trustworthy, since it cannot serialize with the csd lock - if it releases things in the callback, the csd lock will still be set after releasing things. So the caller has a really hard time guaranteeing that CSD_LOCK isn't set. And if the call is done in interrupt context, for all we know it is interrupting the code that is going to clear CSD_LOCK, so CSD_LOCK will never be cleared at all, and csd_lock() will wait forever. So I actually think that for the async case, we really *should* unlock before doing the callback (which is what Thomas' old patch did). And we migth well be better off doing something like WARN_ON_ONCE(csd->flags & CSD_LOCK); in smp_call_function_single_async(), because that really is a hard requirement. And it strikes me that hrtick_csd is one of these cases that do this with interrupts disabled, and use the callback for serialization. So I really wonder if this is part of the problem.. Thomas? Am I missing something? Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-11 19:59 ` Linus Torvalds @ 2015-02-11 20:42 ` Linus Torvalds 2015-02-12 16:38 ` Rafael David Tinoco ` (3 more replies) 0 siblings, 4 replies; 54+ messages in thread From: Linus Torvalds @ 2015-02-11 20:42 UTC (permalink / raw) To: Rafael David Tinoco Cc: LKML, Thomas Gleixner, Jens Axboe, Frederic Weisbecker [-- Attachment #1: Type: text/plain, Size: 1260 bytes --] [ Added Frederic to the cc, since he's touched this file/area most ] On Wed, Feb 11, 2015 at 11:59 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So the caller has a really hard time guaranteeing that CSD_LOCK isn't > set. And if the call is done in interrupt context, for all we know it > is interrupting the code that is going to clear CSD_LOCK, so CSD_LOCK > will never be cleared at all, and csd_lock() will wait forever. > > So I actually think that for the async case, we really *should* unlock > before doing the callback (which is what Thomas' old patch did). > > And we migth well be better off doing something like > > WARN_ON_ONCE(csd->flags & CSD_LOCK); > > in smp_call_function_single_async(), because that really is a hard requirement. > > And it strikes me that hrtick_csd is one of these cases that do this > with interrupts disabled, and use the callback for serialization. So I > really wonder if this is part of the problem.. > > Thomas? Am I missing something? Ok, this is a more involved patch than I'd like, but making the *caller* do all the CSD maintenance actually cleans things up. And this is still completely untested, and may be entirely buggy. What do you guys think? Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 4698 bytes --] kernel/smp.c | 78 ++++++++++++++++++++++++++++++++++++------------------------ 1 file changed, 47 insertions(+), 31 deletions(-) diff --git a/kernel/smp.c b/kernel/smp.c index f38a1e692259..2aaac2c47683 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -19,7 +19,7 @@ enum { CSD_FLAG_LOCK = 0x01, - CSD_FLAG_WAIT = 0x02, + CSD_FLAG_SYNCHRONOUS = 0x02, }; struct call_function_data { @@ -107,7 +107,7 @@ void __init call_function_init(void) */ static void csd_lock_wait(struct call_single_data *csd) { - while (csd->flags & CSD_FLAG_LOCK) + while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK) cpu_relax(); } @@ -121,19 +121,17 @@ static void csd_lock(struct call_single_data *csd) * to ->flags with any subsequent assignments to other * fields of the specified call_single_data structure: */ - smp_mb(); + smp_wmb(); } static void csd_unlock(struct call_single_data *csd) { - WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK)); + WARN_ON(!(csd->flags & CSD_FLAG_LOCK)); /* * ensure we're all done before releasing data: */ - smp_mb(); - - csd->flags &= ~CSD_FLAG_LOCK; + smp_store_release(&csd->flags, 0); } static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data); @@ -144,13 +142,16 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data); * ->func, ->info, and ->flags set. */ static int generic_exec_single(int cpu, struct call_single_data *csd, - smp_call_func_t func, void *info, int wait) + smp_call_func_t func, void *info) { - struct call_single_data csd_stack = { .flags = 0 }; - unsigned long flags; - - if (cpu == smp_processor_id()) { + unsigned long flags; + + /* + * We can unlock early even for the synchronous on-stack case, + * since we're doing this from the same CPU.. + */ + csd_unlock(csd); local_irq_save(flags); func(info); local_irq_restore(flags); @@ -161,21 +162,9 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, if ((unsigned)cpu >= nr_cpu_ids || !cpu_online(cpu)) return -ENXIO; - - if (!csd) { - csd = &csd_stack; - if (!wait) - csd = this_cpu_ptr(&csd_data); - } - - csd_lock(csd); - csd->func = func; csd->info = info; - if (wait) - csd->flags |= CSD_FLAG_WAIT; - /* * The list addition should be visible before sending the IPI * handler locks the list to pull the entry off it because of @@ -190,9 +179,6 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) arch_send_call_function_single_ipi(cpu); - if (wait) - csd_lock_wait(csd); - return 0; } @@ -250,8 +236,17 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline) } llist_for_each_entry_safe(csd, csd_next, entry, llist) { - csd->func(csd->info); - csd_unlock(csd); + smp_call_func_t func = csd->func; + void *info = csd->info; + + /* Do we wait until *after* callback? */ + if (csd->flags & CSD_FLAG_SYNCHRONOUS) { + func(info); + csd_unlock(csd); + } else { + csd_unlock(csd); + func(info); + } } /* @@ -274,6 +269,8 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline) int smp_call_function_single(int cpu, smp_call_func_t func, void *info, int wait) { + struct call_single_data *csd; + struct call_single_data csd_stack = { .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS }; int this_cpu; int err; @@ -292,7 +289,16 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info, WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() && !oops_in_progress); - err = generic_exec_single(cpu, NULL, func, info, wait); + csd = &csd_stack; + if (!wait) { + csd = this_cpu_ptr(&csd_data); + csd_lock(csd); + } + + err = generic_exec_single(cpu, csd, func, info); + + if (wait) + csd_lock_wait(csd); put_cpu(); @@ -321,7 +327,15 @@ int smp_call_function_single_async(int cpu, struct call_single_data *csd) int err = 0; preempt_disable(); - err = generic_exec_single(cpu, csd, csd->func, csd->info, 0); + + /* We could deadlock if we have to wait here with interrupts disabled! */ + if (WARN_ON_ONCE(csd->flags & CSD_FLAG_LOCK)) + csd_lock_wait(csd); + + csd->flags = CSD_FLAG_LOCK; + smp_wmb(); + + err = generic_exec_single(cpu, csd, csd->func, csd->info); preempt_enable(); return err; @@ -433,6 +447,8 @@ void smp_call_function_many(const struct cpumask *mask, struct call_single_data *csd = per_cpu_ptr(cfd->csd, cpu); csd_lock(csd); + if (wait) + csd->flags |= CSD_FLAG_SYNCHRONOUS; csd->func = func; csd->info = info; llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)); ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-11 20:42 ` Linus Torvalds @ 2015-02-12 16:38 ` Rafael David Tinoco 2015-02-18 22:25 ` Peter Zijlstra ` (2 subsequent siblings) 3 siblings, 0 replies; 54+ messages in thread From: Rafael David Tinoco @ 2015-02-12 16:38 UTC (permalink / raw) To: Linus Torvalds Cc: LKML, Thomas Gleixner, Jens Axboe, Frederic Weisbecker, chris.j.arges, gema.gomez-solano Meanwhile we'll take the opportunity to run same tests with the "smp_load_acquire/smp_store_release + outside sync/async" approach made by your latest patch on top of 3.19. If anything comes up I'll provide full back traces (2 vcpus). Here I can only reproduce this inside nested kvm on top of Proliant DL360 Gen8 machines with: - no opt out from x2apic (gen8 firmware asks for opting out but HP says x2apic should be used for >= gen8) - no intel_idle since proliant firmware is causing NMIs during MWAIT instructions As observed before, reducing performance made the problem to be triggered only after some days so, if nothing goes wrong with performance this time, I expect to have results in between 10 to 30 hours. Thank you Tinoco On Wed, Feb 11, 2015 at 6:42 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > [ Added Frederic to the cc, since he's touched this file/area most ] > > On Wed, Feb 11, 2015 at 11:59 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So the caller has a really hard time guaranteeing that CSD_LOCK isn't > > set. And if the call is done in interrupt context, for all we know it > > is interrupting the code that is going to clear CSD_LOCK, so CSD_LOCK > > will never be cleared at all, and csd_lock() will wait forever. > > > > So I actually think that for the async case, we really *should* unlock > > before doing the callback (which is what Thomas' old patch did). > > > > And we migth well be better off doing something like > > > > WARN_ON_ONCE(csd->flags & CSD_LOCK); > > > > in smp_call_function_single_async(), because that really is a hard requirement. > > > > And it strikes me that hrtick_csd is one of these cases that do this > > with interrupts disabled, and use the callback for serialization. So I > > really wonder if this is part of the problem.. > > > > Thomas? Am I missing something? > > Ok, this is a more involved patch than I'd like, but making the > *caller* do all the CSD maintenance actually cleans things up. > > And this is still completely untested, and may be entirely buggy. What > do you guys think? > > Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-11 20:42 ` Linus Torvalds 2015-02-12 16:38 ` Rafael David Tinoco @ 2015-02-18 22:25 ` Peter Zijlstra 2015-02-19 15:42 ` Rafael David Tinoco 2015-03-20 10:15 ` Peter Zijlstra 2015-04-01 14:22 ` Frederic Weisbecker 3 siblings, 1 reply; 54+ messages in thread From: Peter Zijlstra @ 2015-02-18 22:25 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael David Tinoco, LKML, Thomas Gleixner, Jens Axboe, Frederic Weisbecker On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote: > Ok, this is a more involved patch than I'd like, but making the > *caller* do all the CSD maintenance actually cleans things up. > > And this is still completely untested, and may be entirely buggy. What > do you guys think? I think it makes perfect sense. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-18 22:25 ` Peter Zijlstra @ 2015-02-19 15:42 ` Rafael David Tinoco 2015-02-19 16:14 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 54+ messages in thread From: Rafael David Tinoco @ 2015-02-19 15:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, LKML, Thomas Gleixner, Jens Axboe, Frederic Weisbecker, Gema Gomez, chris.j.arges Linus, Peter, Thomas Just a quick feedback, We were able to reproduce the lockup with this proposed patch (3.19 + patch). Unfortunately we had problems with the core file and I have only the stack trace for now but I think we are able to reproduce it again and provide more details (sorry for the delay... after a reboot it took some days for us to reproduce this again). It looks like RIP is still smp_call_function_single. Same environment as before: Nested KVM (2 vcpus) on top of Proliant DL380G8 with acpi_idle and no x2apic optout. [47708.068013] CPU: 0 PID: 29869 Comm: qemu-system-x86 Tainted: G E 3.19.0-c7671cf-lp1413540v2 #31 [47708.068013] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011 [47708.068013] task: ffff88081b9beca0 ti: ffff88081a7a0000 task.ti: ffff88081a7a0000 [47708.068013] RIP: 0010:[<ffffffff810f537a>] [<ffffffff810f537a>] smp_call_function_single+0xca/0x120 [47708.068013] RSP: 0018:ffff88081a7a3b38 EFLAGS: 00000202 [47708.068013] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002 [47708.068013] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000296 [47708.068013] RBP: ffff88081a7a3b78 R08: ffffffff81815168 R09: ffff880818192000 [47708.068013] R10: 000000000000bdf6 R11: 000000000001bf90 R12: 00080000810b66f8 [47708.068013] R13: 00000000000000fb R14: 0000000000000296 R15: 0000000000000000 [47708.068013] FS: 00007fa143fff700(0000) GS:ffff88083fc00000(0000) knlGS:0000000000000000 [47708.068013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [47708.068013] CR2: 00007f5d76f5d050 CR3: 00000008190cc000 CR4: 00000000000426f0 [47708.068013] Stack: [47708.068013] ffff88083fd151b8 0000000000000001 0000000000000000 ffffffffc0589320 [47708.068013] ffff88081a547a80 0000000000000003 ffff88081a543f80 0000000000000000 [47708.068013] ffff88081a7a3b88 ffffffffc0586097 ffff88081a7a3bc8 ffffffffc058aefe [47708.068013] Call Trace: [47708.068013] [<ffffffffc0589320>] ? copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel] [47708.068013] [<ffffffffc0586097>] loaded_vmcs_clear+0x27/0x30 [kvm_intel] [47708.068013] [<ffffffffc058aefe>] vmx_vcpu_load+0x17e/0x1a0 [kvm_intel] [47708.068013] [<ffffffff810a918d>] ? set_next_entity+0x9d/0xb0 [47708.068013] [<ffffffffc04660e3>] kvm_arch_vcpu_load+0x33/0x1f0 [kvm] [47708.068013] [<ffffffffc0452529>] kvm_sched_in+0x39/0x40 [kvm] [47708.068013] [<ffffffff8109e8e8>] finish_task_switch+0x98/0x1a0 [47708.068013] [<ffffffff817aa81b>] __schedule+0x33b/0x900 [47708.068013] [<ffffffff817aae17>] schedule+0x37/0x90 [47708.068013] [<ffffffffc0451e7d>] kvm_vcpu_block+0x6d/0xb0 [kvm] [47708.068013] [<ffffffff810b6ec0>] ? prepare_to_wait_event+0x110/0x110 [47708.068013] [<ffffffffc0469d3c>] kvm_arch_vcpu_ioctl_run+0x10c/0x1290 [kvm] [47708.068013] [<ffffffffc04551ce>] kvm_vcpu_ioctl+0x2ce/0x670 [kvm] [47708.068013] [<ffffffff811ef441>] ? new_sync_write+0x81/0xb0 [47708.068013] [<ffffffff812034e8>] do_vfs_ioctl+0x2f8/0x510 [47708.068013] [<ffffffff811f2215>] ? __sb_end_write+0x35/0x70 [47708.068013] [<ffffffffc045cf84>] ? kvm_on_user_return+0x74/0x80 [kvm] [47708.068013] [<ffffffff81203781>] SyS_ioctl+0x81/0xa0 [47708.068013] [<ffffffff817aefad>] system_call_fastpath+0x16/0x1b [47708.068013] Code: 30 5b 41 5c 5d c3 0f 1f 00 48 8d 75 d0 48 89 d1 89 df 4c 89 e2 e8 57 fe ff ff 0f b7 55 e8 83 e2 01 74 da 66 0f 1f 44 00 00 f3 90 <0f> b7 55 e8 83 e2 01 75 f5 eb c7 0f 1f 00 8b 05 ca e6 dd 00 85 [47708.068013] Kernel panic - not syncing: softlockup: hung tasks [47708.068013] CPU: 0 PID: 29869 Comm: qemu-system-x86 Tainted: G EL 3.19.0-c7671cf-lp1413540v2 #31 [47708.068013] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011 [47708.068013] ffff88081b9beca0 ffff88083fc03de8 ffffffff817a6bf6 0000000000000000 [47708.068013] ffffffff81ab30d4 ffff88083fc03e68 ffffffff817a1aec 0000000000000e92 [47708.068013] 0000000000000008 ffff88083fc03e78 ffff88083fc03e18 ffff88083fc03e68 [47708.068013] Call Trace: [47708.068013] <IRQ> [<ffffffff817a6bf6>] dump_stack+0x45/0x57 [47708.068013] [<ffffffff817a1aec>] panic+0xc1/0x1f5 [47708.068013] [<ffffffff8112ba0b>] watchdog_timer_fn+0x1db/0x1f0 [47708.068013] [<ffffffff810e0e37>] __run_hrtimer+0x77/0x1d0 [47708.068013] [<ffffffff8112b830>] ? watchdog+0x30/0x30 [47708.068013] [<ffffffff810e1203>] hrtimer_interrupt+0xf3/0x220 [47708.068013] [<ffffffffc0589320>] ? copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel] [47708.068013] [<ffffffff8104b0a9>] local_apic_timer_interrupt+0x39/0x60 [47708.068013] [<ffffffff817b1fb5>] smp_apic_timer_interrupt+0x45/0x60 [47708.068013] [<ffffffff817b002d>] apic_timer_interrupt+0x6d/0x80 [47708.068013] <EOI> [<ffffffff810f537a>] ? smp_call_function_single+0xca/0x120 [47708.068013] [<ffffffff810f5369>] ? smp_call_function_single+0xb9/0x120 [47708.068013] [<ffffffffc0589320>] ? copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel] [47708.068013] [<ffffffffc0586097>] loaded_vmcs_clear+0x27/0x30 [kvm_intel] [47708.068013] [<ffffffffc058aefe>] vmx_vcpu_load+0x17e/0x1a0 [kvm_intel] [47708.068013] [<ffffffff810a918d>] ? set_next_entity+0x9d/0xb0 [47708.068013] [<ffffffffc04660e3>] kvm_arch_vcpu_load+0x33/0x1f0 [kvm] [47708.068013] [<ffffffffc0452529>] kvm_sched_in+0x39/0x40 [kvm] [47708.068013] [<ffffffff8109e8e8>] finish_task_switch+0x98/0x1a0 [47708.068013] [<ffffffff817aa81b>] __schedule+0x33b/0x900 [47708.068013] [<ffffffff817aae17>] schedule+0x37/0x90 [47708.068013] [<ffffffffc0451e7d>] kvm_vcpu_block+0x6d/0xb0 [kvm] [47708.068013] [<ffffffff810b6ec0>] ? prepare_to_wait_event+0x110/0x110 [47708.068013] [<ffffffffc0469d3c>] kvm_arch_vcpu_ioctl_run+0x10c/0x1290 [kvm] [47708.068013] [<ffffffffc04551ce>] kvm_vcpu_ioctl+0x2ce/0x670 [kvm] [47708.068013] [<ffffffff811ef441>] ? new_sync_write+0x81/0xb0 [47708.068013] [<ffffffff812034e8>] do_vfs_ioctl+0x2f8/0x510 [47708.068013] [<ffffffff811f2215>] ? __sb_end_write+0x35/0x70 [47708.068013] [<ffffffffc045cf84>] ? kvm_on_user_return+0x74/0x80 [kvm] [47708.068013] [<ffffffff81203781>] SyS_ioctl+0x81/0xa0 [47708.068013] [<ffffffff817aefad>] system_call_fastpath+0x16/0x1b Tks Rafael Tinoco On Wed, Feb 18, 2015 at 8:25 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote: >> Ok, this is a more involved patch than I'd like, but making the >> *caller* do all the CSD maintenance actually cleans things up. >> >> And this is still completely untested, and may be entirely buggy. What >> do you guys think? > > I think it makes perfect sense. > > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-19 15:42 ` Rafael David Tinoco @ 2015-02-19 16:14 ` Linus Torvalds 2015-02-23 14:01 ` Rafael David Tinoco 2015-02-19 16:16 ` Peter Zijlstra 2015-02-19 16:26 ` Linus Torvalds 2 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-02-19 16:14 UTC (permalink / raw) To: Rafael David Tinoco Cc: Peter Zijlstra, LKML, Thomas Gleixner, Jens Axboe, Frederic Weisbecker, Gema Gomez, chris.j.arges On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco <inaddy@ubuntu.com> wrote: > > Just a quick feedback, We were able to reproduce the lockup with this > proposed patch (3.19 + patch). Unfortunately we had problems with the > core file and I have only the stack trace for now but I think we are > able to reproduce it again and provide more details (sorry for the > delay... after a reboot it took some days for us to reproduce this > again). > > It looks like RIP is still smp_call_function_single. Hmm. Still just the stack trace for the CPU that is blocked (CPU0), if you can get the core-file to work and figure out where the other CPU is, that would be good. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-19 16:14 ` Linus Torvalds @ 2015-02-23 14:01 ` Rafael David Tinoco 2015-02-23 19:32 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Rafael David Tinoco @ 2015-02-23 14:01 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, LKML, Thomas Gleixner, Jens Axboe, Frederic Weisbecker, Gema Gomez, chris.j.arges On Thu, Feb 19, 2015 at 2:14 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Hmm. Still just the stack trace for the CPU that is blocked (CPU0), if > you can get the core-file to work and figure out where the other CPU > is, that would be good. > This is v3.19 + your patch (smp acquire/release) - (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode and acpi_idle) * http://people.canonical.com/~inaddy/lp1413540/STACK_TRACE.txt http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt http://people.canonical.com/~inaddy/lp1413540/FOREACH_BT.txt * It looks like we got locked because of reentrant flush_tlb_* through smp_call_* but I'll leave it to you. let me know if you need any more info.. - Rafael Tinoco ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-23 14:01 ` Rafael David Tinoco @ 2015-02-23 19:32 ` Linus Torvalds 2015-02-23 20:50 ` Peter Zijlstra 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-02-23 19:32 UTC (permalink / raw) To: Rafael David Tinoco Cc: Peter Zijlstra, LKML, Thomas Gleixner, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges [-- Attachment #1: Type: text/plain, Size: 2187 bytes --] On Mon, Feb 23, 2015 at 6:01 AM, Rafael David Tinoco <inaddy@ubuntu.com> wrote: > > This is v3.19 + your patch (smp acquire/release) > - (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode > and acpi_idle) Hmm. There is absolutely nothing else going on on that machine, except for the single call to smp_call_function_single() that is waiting for the CSD to be released. > * It looks like we got locked because of reentrant flush_tlb_* through > smp_call_* > but I'll leave it to you. No, that is all a perfectly regular callchain: .. native_flush_tlb_others -> smp_call_function_many -> smp_call_function_single but the stack contains some stale addresses (one is probably just from smp_call_function_single() calling into "generic_exec_single()", and thus the stack contains the return address inside smp_call_function_single() in _addition_ to the actual place where the watchdog timer then interrupted it). It all really looks very regular and sane, and looks like smp_call_function_single() is happily just waiting for the IPI to finish in the (inlined) csd_lock_wait(). I see nothing wrong at all. However, here's a patch to the actual x86 smp_call_function_single_interrupt() handler, which probably doesn't make any difference at all, but does: - gets rid of the incredibly crazy and ugly smp_entering_irq() that inexplicably pairs with irq_exit() - makes the SMP IPI functions dot he APIC ACK cycle at the *end* rather than at the beginning. The exact placement of the ACK should make absolutely no difference, since interrupts should be disabled for the whole period anyway. But I'm desperate and am flailing around worrying about x2apic bugs and whatnot in addition to whatever bugs we migth have in this area in sw. So maybe there is some timing thing. This makes us consistently ack the IPI _after_ we have cleared the list of smp call actions, and executed the list, rather than ack it before, and then go on to muck with the list. Plus that old code really annoyed me anyway. So I don't see why this should matter, but since I don't see how the bug could happen in the first place, might as well try it.. Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 2511 bytes --] arch/x86/kernel/smp.c | 35 +++++++++++++++++++++++------------ 1 file changed, 23 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c index be8e1bde07aa..ed43259111d8 100644 --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -265,12 +265,23 @@ __visible void smp_reschedule_interrupt(struct pt_regs *regs) */ } -static inline void smp_entering_irq(void) +/* + * This is the same as "entering_irq()", except it doesn't + * do the 'exit_idle()'. + * + * It pairs with smp_irq_exit(). + */ +static inline void smp_irq_enter(void) { - ack_APIC_irq(); irq_enter(); } +static inline void smp_irq_exit(void) +{ + irq_exit(); + ack_APIC_irq(); +} + __visible void smp_trace_reschedule_interrupt(struct pt_regs *regs) { /* @@ -279,11 +290,11 @@ __visible void smp_trace_reschedule_interrupt(struct pt_regs *regs) * scheduler_ipi(). This is OK, since those functions are allowed * to nest. */ - smp_entering_irq(); + smp_irq_enter(); trace_reschedule_entry(RESCHEDULE_VECTOR); __smp_reschedule_interrupt(); trace_reschedule_exit(RESCHEDULE_VECTOR); - exiting_irq(); + smp_irq_exit(); /* * KVM uses this interrupt to force a cpu out of guest mode */ @@ -297,18 +308,18 @@ static inline void __smp_call_function_interrupt(void) __visible void smp_call_function_interrupt(struct pt_regs *regs) { - smp_entering_irq(); + smp_irq_enter(); __smp_call_function_interrupt(); - exiting_irq(); + smp_irq_exit(); } __visible void smp_trace_call_function_interrupt(struct pt_regs *regs) { - smp_entering_irq(); + smp_irq_enter(); trace_call_function_entry(CALL_FUNCTION_VECTOR); __smp_call_function_interrupt(); trace_call_function_exit(CALL_FUNCTION_VECTOR); - exiting_irq(); + smp_irq_exit(); } static inline void __smp_call_function_single_interrupt(void) @@ -319,18 +330,18 @@ static inline void __smp_call_function_single_interrupt(void) __visible void smp_call_function_single_interrupt(struct pt_regs *regs) { - smp_entering_irq(); + smp_irq_enter(); __smp_call_function_single_interrupt(); - exiting_irq(); + smp_irq_exit(); } __visible void smp_trace_call_function_single_interrupt(struct pt_regs *regs) { - smp_entering_irq(); + smp_irq_enter(); trace_call_function_single_entry(CALL_FUNCTION_SINGLE_VECTOR); __smp_call_function_single_interrupt(); trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR); - exiting_irq(); + smp_irq_exit(); } static int __init nonmi_ipi_setup(char *str) ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-23 19:32 ` Linus Torvalds @ 2015-02-23 20:50 ` Peter Zijlstra 2015-02-23 21:02 ` Rafael David Tinoco 0 siblings, 1 reply; 54+ messages in thread From: Peter Zijlstra @ 2015-02-23 20:50 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael David Tinoco, LKML, Thomas Gleixner, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges On Mon, Feb 23, 2015 at 11:32:50AM -0800, Linus Torvalds wrote: > On Mon, Feb 23, 2015 at 6:01 AM, Rafael David Tinoco <inaddy@ubuntu.com> wrote: > > > > This is v3.19 + your patch (smp acquire/release) > > - (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode > > and acpi_idle) > > Hmm. There is absolutely nothing else going on on that machine, except > for the single call to smp_call_function_single() that is waiting for > the CSD to be released. > > > * It looks like we got locked because of reentrant flush_tlb_* through > > smp_call_* > > but I'll leave it to you. > > No, that is all a perfectly regular callchain: > > .. native_flush_tlb_others -> smp_call_function_many -> > smp_call_function_single > > but the stack contains some stale addresses (one is probably just from > smp_call_function_single() calling into "generic_exec_single()", and > thus the stack contains the return address inside > smp_call_function_single() in _addition_ to the actual place where the > watchdog timer then interrupted it). > > It all really looks very regular and sane, and looks like > smp_call_function_single() is happily just waiting for the IPI to > finish in the (inlined) csd_lock_wait(). > > I see nothing wrong at all. [11396.096002] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011 But its a virtual machine right? Its not running bare metal, its running a !virt kernel on a virt machine, so maybe some of the virt muck is borked? A very subtly broken APIC emulation would be heaps of 'fun'. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-23 20:50 ` Peter Zijlstra @ 2015-02-23 21:02 ` Rafael David Tinoco 0 siblings, 0 replies; 54+ messages in thread From: Rafael David Tinoco @ 2015-02-23 21:02 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, LKML, Thomas Gleixner, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges, Dave Jones > > [11396.096002] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011 > > But its a virtual machine right? Its not running bare metal, its running > a !virt kernel on a virt machine, so maybe some of the virt muck is > borked? > > A very subtly broken APIC emulation would be heaps of 'fun'. Indeed, I'll dig hypervisor changes... Anyway, this initial stack trace was first seen during the "frequent lockups in 3.18rc4" discussion: https://lkml.org/lkml/2014/11/14/656 RIP: 0010:[<ffffffff9c11e98a>] [<ffffffff9c11e98a>] generic_exec_single+0xea/0x1d0 At that time seen by Dave Jones, and running bare metal (if i remember correctly). -- Rafael Tinoco ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-19 15:42 ` Rafael David Tinoco 2015-02-19 16:14 ` Linus Torvalds @ 2015-02-19 16:16 ` Peter Zijlstra 2015-02-19 16:26 ` Linus Torvalds 2 siblings, 0 replies; 54+ messages in thread From: Peter Zijlstra @ 2015-02-19 16:16 UTC (permalink / raw) To: Rafael David Tinoco Cc: Linus Torvalds, LKML, Thomas Gleixner, Jens Axboe, Frederic Weisbecker, Gema Gomez, chris.j.arges On Thu, Feb 19, 2015 at 01:42:39PM -0200, Rafael David Tinoco wrote: > Linus, Peter, Thomas > > Just a quick feedback, We were able to reproduce the lockup with this > proposed patch (3.19 + patch). Unfortunately we had problems with the > core file and I have only the stack trace for now but I think we are > able to reproduce it again and provide more details (sorry for the > delay... after a reboot it took some days for us to reproduce this > again). > > It looks like RIP is still smp_call_function_single. So Linus' patch mostly fixes smp_call_function_single_async() which is not what you're using. It would be very good to see traces of other CPUs; if for some reason the target CPU doesn't get around to running your callback, then we'll forever wait on it. loaded_vmcs_clear() uses smp_call_function_single(.wait = 1), that should work as before. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-19 15:42 ` Rafael David Tinoco 2015-02-19 16:14 ` Linus Torvalds 2015-02-19 16:16 ` Peter Zijlstra @ 2015-02-19 16:26 ` Linus Torvalds 2015-02-19 16:32 ` Rafael David Tinoco 2 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-02-19 16:26 UTC (permalink / raw) To: Rafael David Tinoco Cc: Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, chris.j.arges, the arch/x86 maintainers On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco <inaddy@ubuntu.com> wrote: > > Same environment as before: Nested KVM (2 vcpus) on top of Proliant > DL380G8 with acpi_idle and no x2apic optout. Btw, which apic model does that end up using? Does "no x2apic optout" mean you're using the x2apic? What does "dmesg | grep apic" report? Something like Switched APIC routing to cluster x2apic. or what? Side note to the apic guys: I think the "single CPU" case ends up being one of the most important ones, but the stupid APIC model doesn't allow that, so sending an IPI to a single CPU ends up being "send a mask with a single bit set", and then we have that horrible "for_each_cpu(cpu, mask)" crap. Would it make sense to perhaps add a "send_IPI_single()" function call, and then for the APIC models that always are based on masks, use a wrapper that just does that "cpumask_of(cpu)" thing.. Hmm? Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-19 16:26 ` Linus Torvalds @ 2015-02-19 16:32 ` Rafael David Tinoco 2015-02-19 16:59 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Rafael David Tinoco @ 2015-02-19 16:32 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, chris.j.arges, the arch/x86 maintainers For the host, we are using "intremap=no_x2apic_optout intel_idle.max_cstate=0" for cmdline. It looks like that DL360/DL380 Gen8 firmware still asks to optout from x2apic but HP engineering team said that using x2apic for Gen8 would be ok (intel_idle causes these servers to generate NMIs when idling, probably related to packed c-states and this server's dependency on acpi tables for c-state). Feb 19 08:21:28 derain kernel: [ 3.504676] Enabled IRQ remapping in x2apic mode Feb 19 08:21:28 derain kernel: [ 3.565451] Enabling x2apic Feb 19 08:21:28 derain kernel: [ 3.602134] Enabled x2apic Feb 19 08:21:28 derain kernel: [ 3.637682] Switched APIC routing to cluster x2apic. On Thu, Feb 19, 2015 at 2:26 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco <inaddy@ubuntu.com> wrote: >> >> Same environment as before: Nested KVM (2 vcpus) on top of Proliant >> DL380G8 with acpi_idle and no x2apic optout. > > Btw, which apic model does that end up using? Does "no x2apic optout" > mean you're using the x2apic? > > What does "dmesg | grep apic" report? Something like > > Switched APIC routing to cluster x2apic. > > or what? > > Side note to the apic guys: I think the "single CPU" case ends up > being one of the most important ones, but the stupid APIC model > doesn't allow that, so sending an IPI to a single CPU ends up being > "send a mask with a single bit set", and then we have that horrible > "for_each_cpu(cpu, mask)" crap. > > Would it make sense to perhaps add a "send_IPI_single()" function > call, and then for the APIC models that always are based on masks, use > a wrapper that just does that "cpumask_of(cpu)" thing.. > > Hmm? > > Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-19 16:32 ` Rafael David Tinoco @ 2015-02-19 16:59 ` Linus Torvalds 2015-02-19 17:30 ` Rafael David Tinoco 2015-02-19 17:39 ` Linus Torvalds 0 siblings, 2 replies; 54+ messages in thread From: Linus Torvalds @ 2015-02-19 16:59 UTC (permalink / raw) To: Rafael David Tinoco Cc: Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, chris.j.arges, the arch/x86 maintainers On Thu, Feb 19, 2015 at 8:32 AM, Rafael David Tinoco <inaddy@ubuntu.com> wrote: > Feb 19 08:21:28 derain kernel: [ 3.637682] Switched APIC routing to > cluster x2apic. Ok. That "cluster x2apic" mode is just about the nastiest mode when it comes to sending a single ipi. We do that insane dance where we - turn single cpu number into cpumask - copy the cpumask to a percpu temporary storage - walk each cpu in the cpumask - for each cpu, look up the cluster siblings - for each cluster sibling that is also in the cpumask, look up the logical apic mask and add it to the actual ipi destination mask - send an ipi to that final mask. which is just insane. It's complicated, it's fragile, and it's unnecessary. If we had a simple "send_IPI()" function, we could do this all with something much saner, and it would look sopmething like static void x2apic_send_IPI(int cpu, int vector) { u32 dest = per_cpu(x86_cpu_to_logical_apicid, cpu); x2apic_wrmsr_fence(); __x2apic_send_IPI_dest(dest, vector, APIC_DEST_LOGICAL); } and then 'void native_send_call_func_single_ipi()' would just look like void native_send_call_func_single_ipi(int cpu) { apic->send_IPI(cpu, CALL_FUNCTION_SINGLE_VECTOR); } but I might have missed something (and we might want to have a wrapper that says "if the apic doesn't have a 'send_IPI' function, use "send_IPI_mask(cpumask_of(cpu, vector) instead" The fact that you need that no_x2apic_optout (which in turn means that your ACPI tables seem to say "don't use x2apic") also makes me worry. Are there known errata for the x2apic? Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-19 16:59 ` Linus Torvalds @ 2015-02-19 17:30 ` Rafael David Tinoco 2015-02-19 17:39 ` Linus Torvalds 1 sibling, 0 replies; 54+ messages in thread From: Rafael David Tinoco @ 2015-02-19 17:30 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, chris.j.arges, the arch/x86 maintainers I could only find an advisory (regarding sr-iov and irq remaps) from HP to RHEL6.2 users stating that Gen8 firmware does not enable it by default. http://h20564.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c03645796 """ The interrupt remapping capability depends on x2apic enabled in the BIOS and HP ProLiant Gen8 systems do not enable x2apic; therefore, the following workaround is required for device assignment: Edit the /etc/grub.conf and add intremap=no_x2apic_optout option to the kernel command line options. """ Probably for backwards compatibility... not sure if there is an option to enable/disable it in firmware (like DL390 seems to have). I don't think so... but I was told by HP team that I should use x2apic for >= Gen8. On Thu, Feb 19, 2015 at 2:59 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Feb 19, 2015 at 8:32 AM, Rafael David Tinoco <inaddy@ubuntu.com> wrote: >> Feb 19 08:21:28 derain kernel: [ 3.637682] Switched APIC routing to >> cluster x2apic. > > Ok. That "cluster x2apic" mode is just about the nastiest mode when it > comes to sending a single ipi. We do that insane dance where we > > - turn single cpu number into cpumask > - copy the cpumask to a percpu temporary storage > - walk each cpu in the cpumask > - for each cpu, look up the cluster siblings > - for each cluster sibling that is also in the cpumask, look up the > logical apic mask and add it to the actual ipi destination mask > - send an ipi to that final mask. > > which is just insane. It's complicated, it's fragile, and it's unnecessary. > > If we had a simple "send_IPI()" function, we could do this all with > something much saner, and it would look sopmething like > > static void x2apic_send_IPI(int cpu, int vector) > { > u32 dest = per_cpu(x86_cpu_to_logical_apicid, cpu); > x2apic_wrmsr_fence(); > __x2apic_send_IPI_dest(dest, vector, APIC_DEST_LOGICAL); > } > > and then 'void native_send_call_func_single_ipi()' would just look like > > void native_send_call_func_single_ipi(int cpu) > { > apic->send_IPI(cpu, CALL_FUNCTION_SINGLE_VECTOR); > } > > but I might have missed something (and we might want to have a wrapper > that says "if the apic doesn't have a 'send_IPI' function, use > "send_IPI_mask(cpumask_of(cpu, vector) instead" > > The fact that you need that no_x2apic_optout (which in turn means that > your ACPI tables seem to say "don't use x2apic") also makes me worry. > > Are there known errata for the x2apic? > > Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-19 16:59 ` Linus Torvalds 2015-02-19 17:30 ` Rafael David Tinoco @ 2015-02-19 17:39 ` Linus Torvalds 2015-02-19 20:29 ` Linus Torvalds 1 sibling, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-02-19 17:39 UTC (permalink / raw) To: Rafael David Tinoco Cc: Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges, the arch/x86 maintainers On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Are there known errata for the x2apic? .. and in particular, do we still have to worry about the traditional local apic "if there are more than two pending interrupts per priority level, things get lost" problem? I forget the exact details. Hopefully somebody remembers. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-19 17:39 ` Linus Torvalds @ 2015-02-19 20:29 ` Linus Torvalds 2015-02-19 21:59 ` Linus Torvalds 2015-02-20 9:30 ` Ingo Molnar 0 siblings, 2 replies; 54+ messages in thread From: Linus Torvalds @ 2015-02-19 20:29 UTC (permalink / raw) To: Rafael David Tinoco, Ingo Molnar, Peter Anvin, Jiang Liu Cc: Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges, the arch/x86 maintainers On Thu, Feb 19, 2015 at 9:39 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Are there known errata for the x2apic? > > .. and in particular, do we still have to worry about the traditional > local apic "if there are more than two pending interrupts per priority > level, things get lost" problem? > > I forget the exact details. Hopefully somebody remembers. I can't find it in the docs. I find the "two-entries per vector", but not anything that is per priority level (group of 16 vectors). Maybe that was the IO-APIC, in which case it's immaterial for IPI's. However, having now mostly re-acquainted myself with the APIC details, it strikes me that we do have some oddities here. In particular, a few interrupt types are very special: NMI, SMI, INIT, ExtINT, or SIPI are handled early in the interrupt acceptance logic, and are sent directly to the CPU core, without going through the usual intermediate IRR/ISR dance. And why might this matter? It's important because it means that those kinds of interrupts must *not* do the apic EOI that ack_APIC_irq() does. And we correctly don't do ack_APIC_irq() for NMI etc, but it strikes me that ExtINT is odd and special. I think we still use ExtINT for some odd cases. We used to have some magic with the legacy timer interrupt, for example. And I think they all go through the normal "do_IRQ()" logic regardless of whether they are ExtINT or not. Now, what happens if we send an EOI for an ExtINT interrupt? It basically ends up being a spurious IPI. And I *think* that what normally happens is absolutely nothing at all. But if in addition to the ExtINT, there was a pending IPI (or other pending ISR bit set), maybe we lose interrupts.. .. and it's entirely possible that I'm just completely full of shit. Who is the poor bastard who has worked most with things like ExtINT, and can educate me? I'm adding Ingo, hpa and Jiang Liu as primary contacts.. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-19 20:29 ` Linus Torvalds @ 2015-02-19 21:59 ` Linus Torvalds 2015-02-19 22:45 ` Linus Torvalds 2015-02-20 9:30 ` Ingo Molnar 1 sibling, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-02-19 21:59 UTC (permalink / raw) To: Rafael David Tinoco, Ingo Molnar, Peter Anvin, Jiang Liu Cc: Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 3236 bytes --] On Thu, Feb 19, 2015 at 12:29 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Now, what happens if we send an EOI for an ExtINT interrupt? It > basically ends up being a spurious IPI. And I *think* that what > normally happens is absolutely nothing at all. But if in addition to > the ExtINT, there was a pending IPI (or other pending ISR bit set), > maybe we lose interrupts.. > > .. and it's entirely possible that I'm just completely full of shit. > Who is the poor bastard who has worked most with things like ExtINT, > and can educate me? I'm adding Ingo, hpa and Jiang Liu as primary > contacts.. So quite frankly, trying to follow all the logic from do_IRQ() through handle_irq() to the actual low-level handler, I just couldn't do it. So instead, I wrote a patch to verify that the ISR bit is actually set when we do ack_APIC_irq(). This was complicated by the fact that we don't actually pass in the vector number at all to the acking, so 99% of the patch is just doing that. A couple of places we don't really have a good vector number, so I said "screw it, a negative value means that we won't check the ISR). The attached patch is quite possibly garbage, but it gives an interesting warning for me during i8042 probing, so who knows. Maybe it actually shows a real problem - or maybe I just screwed up the patch. .. and maybe even if the patch is fine, it's actually never really a problem to have spurious APIC ACK cycles. Maybe it cannot make interrupts be ignored. Anyway, the back-trace for the warning I get is during boot: ... PNP: No PS/2 controller found. Probing ports directly. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1 at ./arch/x86/include/asm/apic.h:436 ir_ack_apic_edge+0x74/0x80() Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.19.0-08857-g89d3fa45b4ad-dirty #2 Call Trace: <IRQ> dump_stack+0x45/0x57 warn_slowpath_common+0x80/0xc0 warn_slowpath_null+0x15/0x20 ir_ack_apic_edge+0x74/0x80 handle_edge_irq+0x51/0x110 handle_irq+0x74/0x140 do_IRQ+0x4a/0x140 common_interrupt+0x6a/0x6a <EOI> ? _raw_spin_unlock_irqrestore+0x9/0x10 __setup_irq+0x239/0x5a0 request_threaded_irq+0xc2/0x180 i8042_probe+0x5b8/0x680 platform_drv_probe+0x2f/0xa0 driver_probe_device+0x8b/0x3e0 __driver_attach+0x93/0xa0 bus_for_each_dev+0x63/0xa0 driver_attach+0x19/0x20 bus_add_driver+0x178/0x250 driver_register+0x5f/0xf0 __platform_driver_register+0x45/0x50 __platform_driver_probe+0x26/0xa0 __platform_create_bundle+0xad/0xe0 i8042_init+0x3d0/0x3f6 do_one_initcall+0xb8/0x1d0 kernel_init_freeable+0x16d/0x1fa kernel_init+0x9/0xf0 ret_from_fork+0x7c/0xb0 ---[ end trace 1de82c4457c6a0f0 ]--- serio: i8042 KBD port at 0x60,0x64 irq 1 serio: i8042 AUX port at 0x60,0x64 irq 12 ... and it looks not entirely insane. Is this worth looking at? Or is it something spurious? I might have gotten the vectors wrong, and maybe the warning is not because the ISR bit isn't set, but because I test the wrong bit. Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 13041 bytes --] arch/x86/include/asm/apic.h | 22 ++++++++++++++-------- arch/x86/kernel/apic/apic.c | 10 +++++----- arch/x86/kernel/apic/io_apic.c | 5 +++-- arch/x86/kernel/apic/vector.c | 8 +++++--- arch/x86/kernel/cpu/mcheck/therm_throt.c | 4 ++-- arch/x86/kernel/cpu/mcheck/threshold.c | 4 ++-- arch/x86/kernel/irq.c | 11 ++++++----- arch/x86/kernel/irq_work.c | 8 ++++---- arch/x86/kernel/smp.c | 18 +++++++++--------- drivers/iommu/irq_remapping.c | 6 ++++-- 10 files changed, 54 insertions(+), 42 deletions(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 92003f3c8a42..912a969fd747 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -387,7 +387,7 @@ static inline void apic_write(u32 reg, u32 val) apic->write(reg, val); } -static inline void apic_eoi(void) +static inline void apic_eoi(int vector) { apic->eoi_write(APIC_EOI, APIC_EOI_ACK); } @@ -418,7 +418,7 @@ extern void __init apic_set_eoi_write(void (*eoi_write)(u32 reg, u32 v)); static inline u32 apic_read(u32 reg) { return 0; } static inline void apic_write(u32 reg, u32 val) { } -static inline void apic_eoi(void) { } +static inline void apic_eoi(int vector) { } static inline u64 apic_icr_read(void) { return 0; } static inline void apic_icr_write(u32 low, u32 high) { } static inline void apic_wait_icr_idle(void) { } @@ -427,13 +427,19 @@ static inline void apic_set_eoi_write(void (*eoi_write)(u32 reg, u32 v)) {} #endif /* CONFIG_X86_LOCAL_APIC */ -static inline void ack_APIC_irq(void) +static inline void ack_APIC_irq(int vector) { + /* Is the ISR bit actually set for this vector? */ + if (vector >= 16) { + unsigned v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1)); + v >>= vector & 0x1f; + WARN_ON_ONCE(!(v & 1)); + } /* * ack_APIC_irq() actually gets compiled as a single instruction * ... yummie. */ - apic_eoi(); + apic_eoi(vector); } static inline unsigned default_get_apic_id(unsigned long x) @@ -631,9 +637,9 @@ static inline void entering_irq(void) exit_idle(); } -static inline void entering_ack_irq(void) +static inline void entering_ack_irq(int vector) { - ack_APIC_irq(); + ack_APIC_irq(vector); entering_irq(); } @@ -642,11 +648,11 @@ static inline void exiting_irq(void) irq_exit(); } -static inline void exiting_ack_irq(void) +static inline void exiting_ack_irq(int vector) { irq_exit(); /* Ack only at the end to avoid potential reentry */ - ack_APIC_irq(); + ack_APIC_irq(vector); } extern void ioapic_zap_locks(void); diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index ad3639ae1b9b..20ef1236d057 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -910,7 +910,7 @@ __visible void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs) * Besides, if we don't timer interrupts ignore the global * interrupt lock, which is the WrongThing (tm) to do. */ - entering_ack_irq(); + entering_ack_irq(LOCAL_TIMER_VECTOR); local_apic_timer_interrupt(); exiting_irq(); @@ -929,7 +929,7 @@ __visible void __irq_entry smp_trace_apic_timer_interrupt(struct pt_regs *regs) * Besides, if we don't timer interrupts ignore the global * interrupt lock, which is the WrongThing (tm) to do. */ - entering_ack_irq(); + entering_ack_irq(LOCAL_TIMER_VECTOR); trace_local_timer_entry(LOCAL_TIMER_VECTOR); local_apic_timer_interrupt(); trace_local_timer_exit(LOCAL_TIMER_VECTOR); @@ -1342,7 +1342,7 @@ void setup_local_APIC(void) value = apic_read(APIC_ISR + i*0x10); for (j = 31; j >= 0; j--) { if (value & (1<<j)) { - ack_APIC_irq(); + ack_APIC_irq(-1); acked++; } } @@ -1868,7 +1868,7 @@ static inline void __smp_spurious_interrupt(u8 vector) */ v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1)); if (v & (1 << (vector & 0x1f))) - ack_APIC_irq(); + ack_APIC_irq(vector); inc_irq_stat(irq_spurious_count); @@ -1917,7 +1917,7 @@ static inline void __smp_error_interrupt(struct pt_regs *regs) if (lapic_get_maxlvt() > 3) /* Due to the Pentium erratum 3AP. */ apic_write(APIC_ESR, 0); v = apic_read(APIC_ESR); - ack_APIC_irq(); + ack_APIC_irq(ERROR_APIC_VECTOR); atomic_inc(&irq_err_count); apic_printk(APIC_DEBUG, KERN_DEBUG "APIC error on CPU%d: %02x", diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index f4dc2462a1ac..b74db5a2660b 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -1993,7 +1993,7 @@ static void ack_ioapic_level(struct irq_data *data) * We must acknowledge the irq before we move it or the acknowledge will * not propagate properly. */ - ack_APIC_irq(); + ack_APIC_irq(i); /* * Tail end of clearing remote IRR bit (either by delivering the EOI @@ -2067,7 +2067,8 @@ static void unmask_lapic_irq(struct irq_data *data) static void ack_lapic_irq(struct irq_data *data) { - ack_APIC_irq(); + struct irq_cfg *cfg = irqd_cfg(data); + ack_APIC_irq(cfg->vector); } static struct irq_chip lapic_chip __read_mostly = { diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 6cedd7914581..30643ffe061c 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -335,9 +335,11 @@ int apic_retrigger_irq(struct irq_data *data) void apic_ack_edge(struct irq_data *data) { - irq_complete_move(irqd_cfg(data)); + struct irq_cfg *cfg = irqd_cfg(data); + + irq_complete_move(cfg); irq_move_irq(data); - ack_APIC_irq(); + ack_APIC_irq(cfg->vector); } /* @@ -397,7 +399,7 @@ asmlinkage __visible void smp_irq_move_cleanup_interrupt(void) { unsigned vector, me; - ack_APIC_irq(); + ack_APIC_irq(IRQ_MOVE_CLEANUP_VECTOR); irq_enter(); exit_idle(); diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c index 1af51b1586d7..89bac1f9ed78 100644 --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c @@ -433,7 +433,7 @@ asmlinkage __visible void smp_thermal_interrupt(struct pt_regs *regs) { entering_irq(); __smp_thermal_interrupt(); - exiting_ack_irq(); + exiting_ack_irq(THERMAL_APIC_VECTOR); } asmlinkage __visible void smp_trace_thermal_interrupt(struct pt_regs *regs) @@ -442,7 +442,7 @@ asmlinkage __visible void smp_trace_thermal_interrupt(struct pt_regs *regs) trace_thermal_apic_entry(THERMAL_APIC_VECTOR); __smp_thermal_interrupt(); trace_thermal_apic_exit(THERMAL_APIC_VECTOR); - exiting_ack_irq(); + exiting_ack_irq(THERMAL_APIC_VECTOR); } /* Thermal monitoring depends on APIC, ACPI and clock modulation */ diff --git a/arch/x86/kernel/cpu/mcheck/threshold.c b/arch/x86/kernel/cpu/mcheck/threshold.c index 7245980186ee..03068d1844c2 100644 --- a/arch/x86/kernel/cpu/mcheck/threshold.c +++ b/arch/x86/kernel/cpu/mcheck/threshold.c @@ -28,7 +28,7 @@ asmlinkage __visible void smp_threshold_interrupt(void) { entering_irq(); __smp_threshold_interrupt(); - exiting_ack_irq(); + exiting_ack_irq(THRESHOLD_APIC_VECTOR); } asmlinkage __visible void smp_trace_threshold_interrupt(void) @@ -37,5 +37,5 @@ asmlinkage __visible void smp_trace_threshold_interrupt(void) trace_threshold_apic_entry(THRESHOLD_APIC_VECTOR); __smp_threshold_interrupt(); trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR); - exiting_ack_irq(); + exiting_ack_irq(THRESHOLD_APIC_VECTOR); } diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 705ef8d48e2d..801e10d3d125 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -45,7 +45,8 @@ void ack_bad_irq(unsigned int irq) * completely. * But only ack when the APIC is enabled -AK */ - ack_APIC_irq(); + /* This doesn't know the vector. It needs the irq descriptor */ + ack_APIC_irq(-1); } #define irq_stats(x) (&per_cpu(irq_stat, x)) @@ -198,7 +199,7 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs) irq = __this_cpu_read(vector_irq[vector]); if (!handle_irq(irq, regs)) { - ack_APIC_irq(); + ack_APIC_irq(vector); if (irq != VECTOR_RETRIGGERED) { pr_emerg_ratelimited("%s: %d.%d No irq handler for vector (irq %d)\n", @@ -230,7 +231,7 @@ __visible void smp_x86_platform_ipi(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); - entering_ack_irq(); + entering_ack_irq(X86_PLATFORM_IPI_VECTOR); __smp_x86_platform_ipi(); exiting_irq(); set_irq_regs(old_regs); @@ -244,7 +245,7 @@ __visible void smp_kvm_posted_intr_ipi(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); - ack_APIC_irq(); + ack_APIC_irq(POSTED_INTR_VECTOR); irq_enter(); @@ -262,7 +263,7 @@ __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); - entering_ack_irq(); + entering_ack_irq(X86_PLATFORM_IPI_VECTOR); trace_x86_platform_ipi_entry(X86_PLATFORM_IPI_VECTOR); __smp_x86_platform_ipi(); trace_x86_platform_ipi_exit(X86_PLATFORM_IPI_VECTOR); diff --git a/arch/x86/kernel/irq_work.c b/arch/x86/kernel/irq_work.c index 15d741ddfeeb..dfe92635307c 100644 --- a/arch/x86/kernel/irq_work.c +++ b/arch/x86/kernel/irq_work.c @@ -10,10 +10,10 @@ #include <asm/apic.h> #include <asm/trace/irq_vectors.h> -static inline void irq_work_entering_irq(void) +static inline void irq_work_entering_irq(int vector) { irq_enter(); - ack_APIC_irq(); + ack_APIC_irq(vector); } static inline void __smp_irq_work_interrupt(void) @@ -24,14 +24,14 @@ static inline void __smp_irq_work_interrupt(void) __visible void smp_irq_work_interrupt(struct pt_regs *regs) { - irq_work_entering_irq(); + irq_work_entering_irq(IRQ_WORK_VECTOR); __smp_irq_work_interrupt(); exiting_irq(); } __visible void smp_trace_irq_work_interrupt(struct pt_regs *regs) { - irq_work_entering_irq(); + irq_work_entering_irq(IRQ_WORK_VECTOR); trace_irq_work_entry(IRQ_WORK_VECTOR); __smp_irq_work_interrupt(); trace_irq_work_exit(IRQ_WORK_VECTOR); diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c index be8e1bde07aa..44b74b24c5e4 100644 --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -170,7 +170,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs) asmlinkage __visible void smp_reboot_interrupt(void) { - ack_APIC_irq(); + ack_APIC_irq(REBOOT_VECTOR); irq_enter(); stop_this_cpu(NULL); irq_exit(); @@ -258,16 +258,16 @@ static inline void __smp_reschedule_interrupt(void) __visible void smp_reschedule_interrupt(struct pt_regs *regs) { - ack_APIC_irq(); + ack_APIC_irq(RESCHEDULE_VECTOR); __smp_reschedule_interrupt(); /* * KVM uses this interrupt to force a cpu out of guest mode */ } -static inline void smp_entering_irq(void) +static inline void smp_entering_irq(int vector) { - ack_APIC_irq(); + ack_APIC_irq(vector); irq_enter(); } @@ -279,7 +279,7 @@ __visible void smp_trace_reschedule_interrupt(struct pt_regs *regs) * scheduler_ipi(). This is OK, since those functions are allowed * to nest. */ - smp_entering_irq(); + smp_entering_irq(RESCHEDULE_VECTOR); trace_reschedule_entry(RESCHEDULE_VECTOR); __smp_reschedule_interrupt(); trace_reschedule_exit(RESCHEDULE_VECTOR); @@ -297,14 +297,14 @@ static inline void __smp_call_function_interrupt(void) __visible void smp_call_function_interrupt(struct pt_regs *regs) { - smp_entering_irq(); + smp_entering_irq(CALL_FUNCTION_VECTOR); __smp_call_function_interrupt(); exiting_irq(); } __visible void smp_trace_call_function_interrupt(struct pt_regs *regs) { - smp_entering_irq(); + smp_entering_irq(CALL_FUNCTION_VECTOR); trace_call_function_entry(CALL_FUNCTION_VECTOR); __smp_call_function_interrupt(); trace_call_function_exit(CALL_FUNCTION_VECTOR); @@ -319,14 +319,14 @@ static inline void __smp_call_function_single_interrupt(void) __visible void smp_call_function_single_interrupt(struct pt_regs *regs) { - smp_entering_irq(); + smp_entering_irq(CALL_FUNCTION_SINGLE_VECTOR); __smp_call_function_single_interrupt(); exiting_irq(); } __visible void smp_trace_call_function_single_interrupt(struct pt_regs *regs) { - smp_entering_irq(); + smp_entering_irq(CALL_FUNCTION_SINGLE_VECTOR); trace_call_function_single_entry(CALL_FUNCTION_SINGLE_VECTOR); __smp_call_function_single_interrupt(); trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR); diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c index 390079ee1350..896c4b35422d 100644 --- a/drivers/iommu/irq_remapping.c +++ b/drivers/iommu/irq_remapping.c @@ -334,12 +334,14 @@ void panic_if_irq_remap(const char *msg) static void ir_ack_apic_edge(struct irq_data *data) { - ack_APIC_irq(); + struct irq_cfg *cfg = irqd_cfg(data); + ack_APIC_irq(cfg->vector); } static void ir_ack_apic_level(struct irq_data *data) { - ack_APIC_irq(); + struct irq_cfg *cfg = irqd_cfg(data); + ack_APIC_irq(cfg->vector); eoi_ioapic_irq(data->irq, irqd_cfg(data)); } ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-19 21:59 ` Linus Torvalds @ 2015-02-19 22:45 ` Linus Torvalds 2015-03-31 3:15 ` Chris J Arges 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-02-19 22:45 UTC (permalink / raw) To: Rafael David Tinoco, Ingo Molnar, Peter Anvin, Jiang Liu Cc: Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges, the arch/x86 maintainers On Thu, Feb 19, 2015 at 1:59 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Is this worth looking at? Or is it something spurious? I might have > gotten the vectors wrong, and maybe the warning is not because the ISR > bit isn't set, but because I test the wrong bit. I edited the patch to do ratelimiting (one per 10s max) rather than "once". And tested it some more. It seems to work correctly. The irq case during 8042 probing is not repeatable, and I suspect it happens because the interrupt source goes away (some probe-time thing that first triggers an interrupt, but then clears it itself), so it doesn't happen every boot, and I've gotten it with slightly different backtraces. But it's the only warning that happens for me, so I think my code is right (at least for the cases that trigger on this machine). It's definitely not a "every interrupt causes the warning because the code was buggy, and the WARN_ONCE() just printed the first one". It would be interesting to hear if others see spurious APIC EOI cases too. In particular, the people seeing the IPI lockup. Because a lot of the lockups we've seen have *looked* like the IPI interrupt just never happened, and so we're waiting forever for the target CPU to react to it. And just maybe the spurious EOI could cause the wrong bit to be cleared in the ISR, and then the interrupt never shows up. Something like that would certainly explain why it only happens on some machines and under certain timing circumstances. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-19 22:45 ` Linus Torvalds @ 2015-03-31 3:15 ` Chris J Arges 2015-03-31 4:28 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 54+ messages in thread From: Chris J Arges @ 2015-03-31 3:15 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael David Tinoco, Ingo Molnar, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Thu, Feb 19, 2015 at 02:45:54PM -0800, Linus Torvalds wrote: > On Thu, Feb 19, 2015 at 1:59 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Is this worth looking at? Or is it something spurious? I might have > > gotten the vectors wrong, and maybe the warning is not because the ISR > > bit isn't set, but because I test the wrong bit. > > I edited the patch to do ratelimiting (one per 10s max) rather than > "once". And tested it some more. It seems to work correctly. The irq > case during 8042 probing is not repeatable, and I suspect it happens > because the interrupt source goes away (some probe-time thing that > first triggers an interrupt, but then clears it itself), so it doesn't > happen every boot, and I've gotten it with slightly different > backtraces. > > But it's the only warning that happens for me, so I think my code is > right (at least for the cases that trigger on this machine). It's > definitely not a "every interrupt causes the warning because the code > was buggy, and the WARN_ONCE() just printed the first one". > > It would be interesting to hear if others see spurious APIC EOI cases > too. In particular, the people seeing the IPI lockup. Because a lot of > the lockups we've seen have *looked* like the IPI interrupt just never > happened, and so we're waiting forever for the target CPU to react to > it. And just maybe the spurious EOI could cause the wrong bit to be > cleared in the ISR, and then the interrupt never shows up. Something > like that would certainly explain why it only happens on some machines > and under certain timing circumstances. > > Linus > Linus, I'm able to reproduce this IPI lockup easily now when using specific hardware and nested KVM VMs. However, it seems to only occur when using certain host hardware. For example: - Xeon E5620 (Westmere-EP) (fam: 06, model: 2c) - Xeon E312xx (Sandy Bridge) (fam: 06, model: 2a) Now I'm not sure if this indicates hardware LAPIC issues or if the timing in these processors make it more likely to hit this issue. So far I haven't seen the issue on other CPUs. To set this up, I've done the following (L0 being the Xeon box): 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce) 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM Sometimes this is sufficient to reproduce the issue, I've observed that running KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others). If this doesn't reproduce then you can do the following: 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live OR tasksel) between L1 vCPUs until the hang occurs. Pinning L1 vCPUs to L0 initially (i.e. 0->0 1->1) causes the hangs to not occur in my testing. (Which makes sense because we're unlikely to do real IPIs.) I've been able to repro with your patch and observed the WARN_ON when booting a VM on affected hardware and non affected hardware: [ 13.613531] ------------[ cut here ]------------ [ 13.613531] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 apic_ack_edge+0x84/0x90() [ 13.613531] Modules linked in: ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_conntrack(E) nf_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_CHECKSUM(E) iptable_mangle(E) xt_tcpudp(E) bridge(E) stp(E) llc(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) ebtable_nat(E) ebtables(E) x_tables(E) dm_crypt(E) ppdev(E) kvm_intel(E) kvm(E) serio_raw(E) i2c_piix4(E) parport_pc(E) pvpanic(E) parport(E) mac_hid(E) nls_utf8(E) isofs(E) psmouse(E) floppy(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) drm_kms_helper(E) drm(E) pata_acpi(E) [ 13.613531] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G E 4.0.0-rc6-460f8calinus1 #4 [ 13.613531] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 13.613531] ffffffff81a911a6 ffff88013fc03eb8 ffffffff817999a7 0000000000000007 [ 13.613531] 0000000000000000 ffff88013fc03ef8 ffffffff810734ca 0000000000000092 [ 13.613531] 0000000000000011 ffff8800b8bbac00 000000000000001f 00000000000000b1 [ 13.613531] Call Trace: [ 13.613531] <IRQ> [<ffffffff817999a7>] dump_stack+0x45/0x57 [ 13.613531] [<ffffffff810734ca>] warn_slowpath_common+0x8a/0xc0 [ 13.613531] [<ffffffff810735ba>] warn_slowpath_null+0x1a/0x20 [ 13.613531] [<ffffffff8104d3f4>] apic_ack_edge+0x84/0x90 [ 13.613531] [<ffffffff810cf8e7>] handle_edge_irq+0x57/0x120 [ 13.613531] [<ffffffff81016aa2>] handle_irq+0x22/0x40 [ 13.613531] [<ffffffff817a3b9f>] do_IRQ+0x4f/0x140 [ 13.613531] [<ffffffff817a196d>] common_interrupt+0x6d/0x6d [ 13.613531] <EOI> [<ffffffff810def08>] ? hrtimer_start+0x18/0x20 [ 13.613531] [<ffffffff8105a356>] ? native_safe_halt+0x6/0x10 [ 13.613531] [<ffffffff810d5623>] ? rcu_eqs_enter+0xa3/0xb0 [ 13.613531] [<ffffffff8101ecde>] default_idle+0x1e/0xc0 [ 13.613531] [<ffffffff8101f6cf>] arch_cpu_idle+0xf/0x20 [ 13.613531] [<ffffffff810b5d6f>] cpu_startup_entry+0x2ff/0x420 [ 13.613531] [<ffffffff8178e9b7>] rest_init+0x77/0x80 [ 13.613531] [<ffffffff81d50f97>] start_kernel+0x43c/0x449 [ 13.613531] [<ffffffff81d50120>] ? early_idt_handlers+0x120/0x120 [ 13.613531] [<ffffffff81d504d7>] x86_64_start_reservations+0x2a/0x2c [ 13.613531] [<ffffffff81d5062b>] x86_64_start_kernel+0x152/0x161 [ 13.613531] ---[ end trace 4512c19aad733ce8 ]--- I modified the posted patch with the following: diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index bf32309..dc3e192 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -441,7 +441,7 @@ static inline void ack_APIC_irq(int vector) if (vector >= 16) { unsigned v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1)); v >>= vector & 0x1f; - WARN_ON_ONCE(!(v & 1)); + WARN(!(v & 1), "ack_APIC_irq: vector = %0x\n", vector); } /* * ack_APIC_irq() actually gets compiled as a single instruction And it showed vector = 1b when booting. However, when I run the reproducer on an affected machine I get the following WARNs before the hang: [ 36.301282] ------------[ cut here ]------------ [ 36.301299] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 apic_ack_edge+0x93/0xa0() [ 36.301301] ack_APIC_irq: vector = e1 [ 36.301303] Modules linked in: ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_conntrack(E) nf_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_CHECKS UM(E) iptable_mangle(E) xt_tcpudp(E) bridge(E) stp(E) llc(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) ebtable_nat(E) ebtables(E) x_tables(E) dm_crypt(E) ppdev(E) kvm_intel(E) kvm(E) parport_pc(E) pvpanic(E) parport(E) serio_raw(E) mac_hid(E) i2c_piix4(E) nls_utf8(E) isofs(E) cirrus(E) psmouse(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) floppy(E) drm_kms_helper(E) drm(E) pata_acpi(E) [ 36.301344] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W E 4.0.0-rc6-655d7f2linus2 #5 [ 36.301346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 36.301348] ffffffff81a910f2 ffff88013fc03e68 ffffffff817998e7 0000000000000007 [ 36.301352] ffff88013fc03eb8 ffff88013fc03ea8 ffffffff810734ca ffffffff81c0c0f8 [ 36.301355] 0000000000000001 00000000000000e1 0000000000000019 0000000000000051 [ 36.301359] Call Trace: [ 36.301361] <IRQ> [<ffffffff817998e7>] dump_stack+0x45/0x57 [ 36.301373] [<ffffffff810734ca>] warn_slowpath_common+0x8a/0xc0 [ 36.301377] [<ffffffff81073546>] warn_slowpath_fmt+0x46/0x50 [ 36.301383] [<ffffffff81405420>] ? pci_msi_unmask_irq+0x10/0x20 [ 36.301387] [<ffffffff8104d3b3>] apic_ack_edge+0x93/0xa0 [ 36.301392] [<ffffffff810cf8e7>] handle_edge_irq+0x57/0x120 [ 36.301399] [<ffffffff81016aa2>] handle_irq+0x22/0x40 [ 36.301403] [<ffffffff817a3adf>] do_IRQ+0x4f/0x140 [ 36.301407] [<ffffffff817a18ad>] common_interrupt+0x6d/0x6d [ 36.301409] <EOI> [<ffffffff810def08>] ? hrtimer_start+0x18/0x20 [ 36.301417] [<ffffffff8105a356>] ? native_safe_halt+0x6/0x10 [ 36.301422] [<ffffffff810d5623>] ? rcu_eqs_enter+0xa3/0xb0 [ 36.301426] [<ffffffff8101ecce>] default_idle+0x1e/0xc0 [ 36.301430] [<ffffffff8101f6bf>] arch_cpu_idle+0xf/0x20 [ 36.301433] [<ffffffff810b5d6f>] cpu_startup_entry+0x2ff/0x420 [ 36.301438] [<ffffffff8178e8f7>] rest_init+0x77/0x80 [ 36.301443] [<ffffffff81d50f97>] start_kernel+0x43c/0x449 [ 36.301446] [<ffffffff81d50120>] ? early_idt_handlers+0x120/0x120 [ 36.301450] [<ffffffff81d504d7>] x86_64_start_reservations+0x2a/0x2c [ 36.301453] [<ffffffff81d5062b>] x86_64_start_kernel+0x152/0x161 [ 36.301455] ---[ end trace 8f10e066e6d6eddf ]--- [ 40.430515] ------------[ cut here ]------------ [ 40.430531] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 apic_ack_edge+0x93/0xa0() [ 40.430533] ack_APIC_irq: vector = 22 [ 40.430535] Modules linked in: ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_conntrack(E) nf_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_CHECKS UM(E) iptable_mangle(E) xt_tcpudp(E) bridge(E) stp(E) llc(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) ebtable_nat(E) ebtables(E) x_tables(E) dm_crypt(E) ppdev(E) kvm_intel(E) kvm(E) parport_pc(E) pvpanic(E) parport(E) serio_raw(E) mac_hid(E) i2c_piix4(E) nls_utf8(E) isofs(E) cirrus(E) psmouse(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) floppy(E) drm_kms_helper(E) drm(E) pata_acpi(E) [ 40.430576] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W E 4.0.0-rc6-655d7f2linus2 #5 [ 40.430578] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 40.430581] ffffffff81a910f2 ffff88013fc03e68 ffffffff817998e7 0000000000000007 [ 40.430584] ffff88013fc03eb8 ffff88013fc03ea8 ffffffff810734ca 0000000000000202 [ 40.430587] 0000000000000002 0000000000000022 000000000000001d 0000000000000091 [ 40.430591] Call Trace: [ 40.430593] <IRQ> [<ffffffff817998e7>] dump_stack+0x45/0x57 [ 40.430605] [<ffffffff810734ca>] warn_slowpath_common+0x8a/0xc0 [ 40.430609] [<ffffffff81073546>] warn_slowpath_fmt+0x46/0x50 [ 40.430615] [<ffffffff81405420>] ? pci_msi_unmask_irq+0x10/0x20 [ 40.430631] [<ffffffff8104d3b3>] apic_ack_edge+0x93/0xa0 [ 40.430637] [<ffffffff810cf8e7>] handle_edge_irq+0x57/0x120 [ 40.430643] [<ffffffff81016aa2>] handle_irq+0x22/0x40 [ 40.430647] [<ffffffff817a3adf>] do_IRQ+0x4f/0x140 [ 40.430651] [<ffffffff817a18ad>] common_interrupt+0x6d/0x6d [ 40.430653] <EOI> [<ffffffff810def08>] ? hrtimer_start+0x18/0x20 [ 40.430661] [<ffffffff8105a356>] ? native_safe_halt+0x6/0x10 [ 40.430666] [<ffffffff810d5623>] ? rcu_eqs_enter+0xa3/0xb0 [ 40.430670] [<ffffffff8101ecce>] default_idle+0x1e/0xc0 [ 40.430674] [<ffffffff8101f6bf>] arch_cpu_idle+0xf/0x20 [ 40.430678] [<ffffffff810b5d6f>] cpu_startup_entry+0x2ff/0x420 [ 40.430684] [<ffffffff8178e8f7>] rest_init+0x77/0x80 [ 40.430689] [<ffffffff81d50f97>] start_kernel+0x43c/0x449 [ 40.430693] [<ffffffff81d50120>] ? early_idt_handlers+0x120/0x120 [ 40.430696] [<ffffffff81d504d7>] x86_64_start_reservations+0x2a/0x2c [ 40.430699] [<ffffffff81d5062b>] x86_64_start_kernel+0x152/0x161 [ 40.430702] ---[ end trace 8f10e066e6d6ede0 ]--- So vector = e1 then 22 before the hang. [ 45.728212] kvm [1602]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff [ 53.391974] random: nonblocking pool is initialized [ 80.076005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:1602] Some things I've already done are: - run with nox2apic on L0,L1 (so apic->name == "flat"), this still reproduces - run with apic=off on L0,L1, this still reproduces Anyway, maybe this sheds some more light on this issue. I can reproduce this at will, so let me know of other experiments to do. Thanks, --chris j arges ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-03-31 3:15 ` Chris J Arges @ 2015-03-31 4:28 ` Linus Torvalds 2015-03-31 4:46 ` Linus Torvalds 2015-03-31 15:08 ` Linus Torvalds 2 siblings, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2015-03-31 4:28 UTC (permalink / raw) To: Chris J Arges Cc: Rafael David Tinoco, Ingo Molnar, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges <chris.j.arges@canonical.com> wrote: > > I've been able to repro with your patch and observed the WARN_ON when booting a > VM on affected hardware and non affected hardware: Ok, interesting. So the whole "we try to do an APIC ACK with the ISR bit clear" seems to be a real issue. > I modified the posted patch with the following: > - WARN_ON_ONCE(!(v & 1)); > + WARN(!(v & 1), "ack_APIC_irq: vector = %0x\n", vector); Yes, makes sense, although I'm not sure what the vector translations end up being. See below. > And it showed vector = 1b when booting. However, when I run the reproducer on > an affected machine I get the following WARNs before the hang: Ok, so the boot-time thing I think happens because a device irq happens but goes away immediately because the CPU that triggers it also clears it immediately in the device initialization code, so it's a level-triggered interrupt that goes away "on its own". But vector 0x1b seems odd. I thought we mapped external interrupts to 0x20+ (FIRST_EXTERNAL_VECTOR). Ingo/Peter? Is there any sane interface to look up the percpu apic vector data? Chris, since this is repeatable for you, can you do int irq; irq = __this_cpu_read(vector_irq[vector]); and print that out too? That *should* show the actual hardware irq, although there are a few magic cases too (-1/-2 mean special things) But the fact that you get the warning before the hang is much more interesting. > [ 36.301299] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 apic_ack_edge+0x93/0xa0() > [ 36.301301] ack_APIC_irq: vector = e1 Is this repeatable? Does it happen before *every* hang, or at least often enough to be a good pattern? > [ 40.430533] ack_APIC_irq: vector = 22 > > So vector = e1 then 22 before the hang. Is it always the same ones? I assume that on different machines the vector allocations would be different, but is it consistent on any particular machine? That's assuming the whole warning is consistent at all before the hang, of course. > Anyway, maybe this sheds some more light on this issue. I can reproduce this at > will, so let me know of other experiments to do. Somebody else who knows the apic needs to also take a look, but I'd love to hear what the actual hardware interrupt is (from that "vector_irq[vector]" thing above. I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they are 0xf0-0xff), so it sounds like an external one. But that then requires the whole mapping table thing. Ingo/Peter/Jiang - is there anything else useful we could print out? I worry about the irq movement code. Can we add printk's to when an irq is chasing from one CPU to another and doing that "move_in_progress" thing? I've always been scared of that code. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-03-31 3:15 ` Chris J Arges 2015-03-31 4:28 ` Linus Torvalds @ 2015-03-31 4:46 ` Linus Torvalds 2015-03-31 15:08 ` Linus Torvalds 2 siblings, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2015-03-31 4:46 UTC (permalink / raw) To: Chris J Arges Cc: Rafael David Tinoco, Ingo Molnar, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges <chris.j.arges@canonical.com> wrote: > [ 13.613531] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 apic_ack_edge+0x84/0x90() > [ 13.613531] [<ffffffff8104d3f4>] apic_ack_edge+0x84/0x90 > [ 13.613531] [<ffffffff810cf8e7>] handle_edge_irq+0x57/0x120 > [ 13.613531] [<ffffffff81016aa2>] handle_irq+0x22/0x40 > [ 13.613531] [<ffffffff817a3b9f>] do_IRQ+0x4f/0x140 > [ 13.613531] [<ffffffff817a196d>] common_interrupt+0x6d/0x6d > [ 13.613531] <EOI> [<ffffffff810def08>] ? hrtimer_start+0x18/0x20 > [ 13.613531] [<ffffffff8105a356>] ? native_safe_halt+0x6/0x10 > [ 13.613531] [<ffffffff810d5623>] ? rcu_eqs_enter+0xa3/0xb0 > [ 13.613531] [<ffffffff8101ecde>] default_idle+0x1e/0xc0 Hmm. I didn't notice that "hrtimer_start" was always there as a stale entry on the stack when this happened. That may well be immaterial - the CPU being idle means that the last thing it did before going to sleep was likely that "start timer" thing, but it's interesting even so. Some issue with reprogramming the hrtimer as it is triggering, kind of similar to the bootup case I saw where the keyboard init sequence raises an interrupt that was already cleared by the time the interrupt happened. So maybe something like this happens: - local timer is about to go off and raises the interrupt line - in the meantime, we're reprogramming the timer into the future - the CPU takes the interrupt, but now the timer has been reprogammed, so the irq line is no longer active, and ISR is zero even though we took the interrupt (which is why the new warning triggers) - we're running the local timer interrupt (which happened due to the *old* programmed value), but we do something wrong because when we read the timer state, we see the *new* programmed value and so we think that it's the new timer that triggered. I dunno. I don't see why we'd lock up, but DaveJ's old lockup had several signs that it seemed to be timer-related. It would be interesting to see the actual irq number. Maybe this has nothing what-so-ever to do with the hrtimer. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-03-31 3:15 ` Chris J Arges 2015-03-31 4:28 ` Linus Torvalds 2015-03-31 4:46 ` Linus Torvalds @ 2015-03-31 15:08 ` Linus Torvalds 2015-03-31 22:23 ` Chris J Arges 2 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-03-31 15:08 UTC (permalink / raw) To: Chris J Arges Cc: Rafael David Tinoco, Ingo Molnar, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges <chris.j.arges@canonical.com> wrote: > > I modified the posted patch with the following: Actually, in addition to Ingo's patches (and the irq printout), which you should try first, if none of that really gives any different behavior, can modify that ack_APIC_irq() debugging code a bit more: > diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h > index bf32309..dc3e192 100644 > --- a/arch/x86/include/asm/apic.h > +++ b/arch/x86/include/asm/apic.h > @@ -441,7 +441,7 @@ static inline void ack_APIC_irq(int vector) > if (vector >= 16) { > unsigned v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1)); > v >>= vector & 0x1f; > - WARN_ON_ONCE(!(v & 1)); > + WARN(!(v & 1), "ack_APIC_irq: vector = %0x\n", vector); > } > /* > * ack_APIC_irq() actually gets compiled as a single instruction So what I'd suggest doing is: - change the test of "vector >= 16" to just "vector >= 0". We still have "-1" as the "unknown vector" thing, but I think only the ack_bad_irq() thing calls it, and that should print out its own message if it ever triggers, so it isn't an issue. The reason for the ">= 16" was kind of bogus - the first 16 vectors are system vectors, but we definitely shouldn't ack the apic for such vectors anyway, so giving a warning for them is very much appropriate. In particular, vector 2 is NMI, and maybe we do ACk it incorrectly. - add a "return" if the warning triggers, and simply don't do the actual ACK cycle at all if the ISR bit is clear. IOW, make it do "if (WARN(..)) return;" Now, we might get the vector number wrong for some reason, and in that case not ACK'ing at all might cause problems too, but it would be interesting to see if it changes behavior wrt the lockup. I don't have any other ideas at the moment, but hopefully the suggested changes by me and Ingo will give some more data to go on and clarify what might be going on. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-03-31 15:08 ` Linus Torvalds @ 2015-03-31 22:23 ` Chris J Arges 2015-03-31 23:07 ` Linus Torvalds 2015-04-01 12:43 ` Ingo Molnar 0 siblings, 2 replies; 54+ messages in thread From: Chris J Arges @ 2015-03-31 22:23 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael David Tinoco, Ingo Molnar, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Tue, Mar 31, 2015 at 08:08:40AM -0700, Linus Torvalds wrote: > On Mon, Mar 30, 2015 at 8:15 PM, Chris J Arges > <chris.j.arges@canonical.com> wrote: > > > > I modified the posted patch with the following: > > Actually, in addition to Ingo's patches (and the irq printout), which > you should try first, if none of that really gives any different > behavior, can modify that ack_APIC_irq() debugging code a bit more: > > > diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h > > index bf32309..dc3e192 100644 > > --- a/arch/x86/include/asm/apic.h > > +++ b/arch/x86/include/asm/apic.h > > @@ -441,7 +441,7 @@ static inline void ack_APIC_irq(int vector) > > if (vector >= 16) { > > unsigned v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1)); > > v >>= vector & 0x1f; > > - WARN_ON_ONCE(!(v & 1)); > > + WARN(!(v & 1), "ack_APIC_irq: vector = %0x\n", vector); > > } > > /* > > * ack_APIC_irq() actually gets compiled as a single instruction > > So what I'd suggest doing is: > > - change the test of "vector >= 16" to just "vector >= 0". > > We still have "-1" as the "unknown vector" thing, but I think only > the ack_bad_irq() thing calls it, and that should print out its own > message if it ever triggers, so it isn't an issue. > > The reason for the ">= 16" was kind of bogus - the first 16 vectors > are system vectors, but we definitely shouldn't ack the apic for such > vectors anyway, so giving a warning for them is very much appropriate. > In particular, vector 2 is NMI, and maybe we do ACk it incorrectly. > > - add a "return" if the warning triggers, and simply don't do the > actual ACK cycle at all if the ISR bit is clear. > > IOW, make it do "if (WARN(..)) return;" > > Now, we might get the vector number wrong for some reason, and in that > case not ACK'ing at all might cause problems too, but it would be > interesting to see if it changes behavior wrt the lockup. > > I don't have any other ideas at the moment, but hopefully the > suggested changes by me and Ingo will give some more data to go on and > clarify what might be going on. > > Linus > Linus, I had a few runs with your patch plus modifications, and got the following results (modified patch inlined below): [ 14.423916] ack_APIC_irq: vector = d1, irq = ffffffff [ 176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:1630] [ 17.995298] ack_APIC_irq: vector = d1, irq = ffffffff [ 182.993828] ack_APIC_irq: vector = e1, irq = ffffffff [ 202.919691] ack_APIC_irq: vector = 22, irq = ffffffff [ 484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1586] [ 15.592032] ack_APIC_irq: vector = d1, irq = ffffffff [ 304.993490] ack_APIC_irq: vector = e1, irq = ffffffff [ 315.174755] ack_APIC_irq: vector = 22, irq = ffffffff [ 360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ksmd:26] [ 15.026077] ack_APIC_irq: vector = b1, irq = ffffffff [ 374.828531] ack_APIC_irq: vector = c1, irq = ffffffff [ 402.965942] ack_APIC_irq: vector = d1, irq = ffffffff [ 434.540814] ack_APIC_irq: vector = e1, irq = ffffffff [ 461.820768] ack_APIC_irq: vector = 22, irq = ffffffff [ 536.120027] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:4243] [ 17.889334] ack_APIC_irq: vector = d1, irq = ffffffff [ 291.888784] ack_APIC_irq: vector = e1, irq = ffffffff [ 297.824627] ack_APIC_irq: vector = 22, irq = ffffffff [ 336.960594] ack_APIC_irq: vector = 42, irq = ffffffff [ 367.012706] ack_APIC_irq: vector = 52, irq = ffffffff [ 377.025090] ack_APIC_irq: vector = 62, irq = ffffffff [ 417.088773] ack_APIC_irq: vector = 72, irq = ffffffff [ 447.136788] ack_APIC_irq: vector = 82, irq = ffffffff -- stopped it since it wasn't reproducing / I was impatient -- So I'm seeing irq == VECTOR_UNDEFINED in all of these cases. Making (vector >= 0) didn't seem to expose any additional vectors. This was only instrumented at the L1 level, I'm also planning on instrumenting the L0 kernel and seeing if we get these as well. --chris j arges -- diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index efc3b22..88b3933 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -395,7 +395,7 @@ static inline void apic_write(u32 reg, u32 val) apic->write(reg, val); } -static inline void apic_eoi(void) +static inline void apic_eoi(int vector) { apic->eoi_write(APIC_EOI, APIC_EOI_ACK); } @@ -426,7 +426,7 @@ extern void __init apic_set_eoi_write(void (*eoi_write)(u32 reg, u32 v)); static inline u32 apic_read(u32 reg) { return 0; } static inline void apic_write(u32 reg, u32 val) { } -static inline void apic_eoi(void) { } +static inline void apic_eoi(int vector) { } static inline u64 apic_icr_read(void) { return 0; } static inline void apic_icr_write(u32 low, u32 high) { } static inline void apic_wait_icr_idle(void) { } @@ -435,13 +435,26 @@ static inline void apic_set_eoi_write(void (*eoi_write)(u32 reg, u32 v)) {} #endif /* CONFIG_X86_LOCAL_APIC */ -static inline void ack_APIC_irq(void) +#include <asm/irq_vectors.h> +typedef int vector_irq_t[NR_VECTORS]; +DECLARE_PER_CPU(vector_irq_t, vector_irq); + +static inline void ack_APIC_irq(int vector) { + /* Is the ISR bit actually set for this vector? */ + if (vector >= 0) { + unsigned v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1)); + int irq; + v >>= vector & 0x1f; + irq = __this_cpu_read(vector_irq[vector]); + + WARN(!(v & 1), "ack_APIC_irq: vector = %0x, irq = %0x\n", vector, irq); + } /* * ack_APIC_irq() actually gets compiled as a single instruction * ... yummie. */ - apic_eoi(); + apic_eoi(vector); } static inline unsigned default_get_apic_id(unsigned long x) @@ -639,9 +652,9 @@ static inline void entering_irq(void) exit_idle(); } -static inline void entering_ack_irq(void) +static inline void entering_ack_irq(int vector) { - ack_APIC_irq(); + ack_APIC_irq(vector); entering_irq(); } @@ -650,11 +663,11 @@ static inline void exiting_irq(void) irq_exit(); } -static inline void exiting_ack_irq(void) +static inline void exiting_ack_irq(int vector) { irq_exit(); /* Ack only at the end to avoid potential reentry */ - ack_APIC_irq(); + ack_APIC_irq(vector); } extern void ioapic_zap_locks(void); diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index ad3639a..20ef123 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -910,7 +910,7 @@ __visible void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs) * Besides, if we don't timer interrupts ignore the global * interrupt lock, which is the WrongThing (tm) to do. */ - entering_ack_irq(); + entering_ack_irq(LOCAL_TIMER_VECTOR); local_apic_timer_interrupt(); exiting_irq(); @@ -929,7 +929,7 @@ __visible void __irq_entry smp_trace_apic_timer_interrupt(struct pt_regs *regs) * Besides, if we don't timer interrupts ignore the global * interrupt lock, which is the WrongThing (tm) to do. */ - entering_ack_irq(); + entering_ack_irq(LOCAL_TIMER_VECTOR); trace_local_timer_entry(LOCAL_TIMER_VECTOR); local_apic_timer_interrupt(); trace_local_timer_exit(LOCAL_TIMER_VECTOR); @@ -1342,7 +1342,7 @@ void setup_local_APIC(void) value = apic_read(APIC_ISR + i*0x10); for (j = 31; j >= 0; j--) { if (value & (1<<j)) { - ack_APIC_irq(); + ack_APIC_irq(-1); acked++; } } @@ -1868,7 +1868,7 @@ static inline void __smp_spurious_interrupt(u8 vector) */ v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1)); if (v & (1 << (vector & 0x1f))) - ack_APIC_irq(); + ack_APIC_irq(vector); inc_irq_stat(irq_spurious_count); @@ -1917,7 +1917,7 @@ static inline void __smp_error_interrupt(struct pt_regs *regs) if (lapic_get_maxlvt() > 3) /* Due to the Pentium erratum 3AP. */ apic_write(APIC_ESR, 0); v = apic_read(APIC_ESR); - ack_APIC_irq(); + ack_APIC_irq(ERROR_APIC_VECTOR); atomic_inc(&irq_err_count); apic_printk(APIC_DEBUG, KERN_DEBUG "APIC error on CPU%d: %02x", diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index f4dc246..b74db5a 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -1993,7 +1993,7 @@ static void ack_ioapic_level(struct irq_data *data) * We must acknowledge the irq before we move it or the acknowledge will * not propagate properly. */ - ack_APIC_irq(); + ack_APIC_irq(i); /* * Tail end of clearing remote IRR bit (either by delivering the EOI @@ -2067,7 +2067,8 @@ static void unmask_lapic_irq(struct irq_data *data) static void ack_lapic_irq(struct irq_data *data) { - ack_APIC_irq(); + struct irq_cfg *cfg = irqd_cfg(data); + ack_APIC_irq(cfg->vector); } static struct irq_chip lapic_chip __read_mostly = { diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 6cedd79..30643ff 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -335,9 +335,11 @@ int apic_retrigger_irq(struct irq_data *data) void apic_ack_edge(struct irq_data *data) { - irq_complete_move(irqd_cfg(data)); + struct irq_cfg *cfg = irqd_cfg(data); + + irq_complete_move(cfg); irq_move_irq(data); - ack_APIC_irq(); + ack_APIC_irq(cfg->vector); } /* @@ -397,7 +399,7 @@ asmlinkage __visible void smp_irq_move_cleanup_interrupt(void) { unsigned vector, me; - ack_APIC_irq(); + ack_APIC_irq(IRQ_MOVE_CLEANUP_VECTOR); irq_enter(); exit_idle(); diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c index 1af51b1..89bac1f 100644 --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c @@ -433,7 +433,7 @@ asmlinkage __visible void smp_thermal_interrupt(struct pt_regs *regs) { entering_irq(); __smp_thermal_interrupt(); - exiting_ack_irq(); + exiting_ack_irq(THERMAL_APIC_VECTOR); } asmlinkage __visible void smp_trace_thermal_interrupt(struct pt_regs *regs) @@ -442,7 +442,7 @@ asmlinkage __visible void smp_trace_thermal_interrupt(struct pt_regs *regs) trace_thermal_apic_entry(THERMAL_APIC_VECTOR); __smp_thermal_interrupt(); trace_thermal_apic_exit(THERMAL_APIC_VECTOR); - exiting_ack_irq(); + exiting_ack_irq(THERMAL_APIC_VECTOR); } /* Thermal monitoring depends on APIC, ACPI and clock modulation */ diff --git a/arch/x86/kernel/cpu/mcheck/threshold.c b/arch/x86/kernel/cpu/mcheck/threshold.c index 7245980..03068d1 100644 --- a/arch/x86/kernel/cpu/mcheck/threshold.c +++ b/arch/x86/kernel/cpu/mcheck/threshold.c @@ -28,7 +28,7 @@ asmlinkage __visible void smp_threshold_interrupt(void) { entering_irq(); __smp_threshold_interrupt(); - exiting_ack_irq(); + exiting_ack_irq(THRESHOLD_APIC_VECTOR); } asmlinkage __visible void smp_trace_threshold_interrupt(void) @@ -37,5 +37,5 @@ asmlinkage __visible void smp_trace_threshold_interrupt(void) trace_threshold_apic_entry(THRESHOLD_APIC_VECTOR); __smp_threshold_interrupt(); trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR); - exiting_ack_irq(); + exiting_ack_irq(THRESHOLD_APIC_VECTOR); } diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 67b1cbe..fc2b9a0 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -45,7 +45,8 @@ void ack_bad_irq(unsigned int irq) * completely. * But only ack when the APIC is enabled -AK */ - ack_APIC_irq(); + /* This doesn't know the vector. It needs the irq descriptor */ + ack_APIC_irq(-1); } #define irq_stats(x) (&per_cpu(irq_stat, x)) @@ -198,7 +199,7 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs) irq = __this_cpu_read(vector_irq[vector]); if (!handle_irq(irq, regs)) { - ack_APIC_irq(); + ack_APIC_irq(vector); if (irq != VECTOR_RETRIGGERED) { pr_emerg_ratelimited("%s: %d.%d No irq handler for vector (irq %d)\n", @@ -230,7 +231,7 @@ __visible void smp_x86_platform_ipi(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); - entering_ack_irq(); + entering_ack_irq(X86_PLATFORM_IPI_VECTOR); __smp_x86_platform_ipi(); exiting_irq(); set_irq_regs(old_regs); @@ -244,7 +245,7 @@ __visible void smp_kvm_posted_intr_ipi(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); - ack_APIC_irq(); + ack_APIC_irq(POSTED_INTR_VECTOR); irq_enter(); @@ -262,7 +263,7 @@ __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); - entering_ack_irq(); + entering_ack_irq(X86_PLATFORM_IPI_VECTOR); trace_x86_platform_ipi_entry(X86_PLATFORM_IPI_VECTOR); __smp_x86_platform_ipi(); trace_x86_platform_ipi_exit(X86_PLATFORM_IPI_VECTOR); diff --git a/arch/x86/kernel/irq_work.c b/arch/x86/kernel/irq_work.c index 15d741d..dfe9263 100644 --- a/arch/x86/kernel/irq_work.c +++ b/arch/x86/kernel/irq_work.c @@ -10,10 +10,10 @@ #include <asm/apic.h> #include <asm/trace/irq_vectors.h> -static inline void irq_work_entering_irq(void) +static inline void irq_work_entering_irq(int vector) { irq_enter(); - ack_APIC_irq(); + ack_APIC_irq(vector); } static inline void __smp_irq_work_interrupt(void) @@ -24,14 +24,14 @@ static inline void __smp_irq_work_interrupt(void) __visible void smp_irq_work_interrupt(struct pt_regs *regs) { - irq_work_entering_irq(); + irq_work_entering_irq(IRQ_WORK_VECTOR); __smp_irq_work_interrupt(); exiting_irq(); } __visible void smp_trace_irq_work_interrupt(struct pt_regs *regs) { - irq_work_entering_irq(); + irq_work_entering_irq(IRQ_WORK_VECTOR); trace_irq_work_entry(IRQ_WORK_VECTOR); __smp_irq_work_interrupt(); trace_irq_work_exit(IRQ_WORK_VECTOR); diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c index be8e1bd..44b74b2 100644 --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -170,7 +170,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs) asmlinkage __visible void smp_reboot_interrupt(void) { - ack_APIC_irq(); + ack_APIC_irq(REBOOT_VECTOR); irq_enter(); stop_this_cpu(NULL); irq_exit(); @@ -258,16 +258,16 @@ static inline void __smp_reschedule_interrupt(void) __visible void smp_reschedule_interrupt(struct pt_regs *regs) { - ack_APIC_irq(); + ack_APIC_irq(RESCHEDULE_VECTOR); __smp_reschedule_interrupt(); /* * KVM uses this interrupt to force a cpu out of guest mode */ } -static inline void smp_entering_irq(void) +static inline void smp_entering_irq(int vector) { - ack_APIC_irq(); + ack_APIC_irq(vector); irq_enter(); } @@ -279,7 +279,7 @@ __visible void smp_trace_reschedule_interrupt(struct pt_regs *regs) * scheduler_ipi(). This is OK, since those functions are allowed * to nest. */ - smp_entering_irq(); + smp_entering_irq(RESCHEDULE_VECTOR); trace_reschedule_entry(RESCHEDULE_VECTOR); __smp_reschedule_interrupt(); trace_reschedule_exit(RESCHEDULE_VECTOR); @@ -297,14 +297,14 @@ static inline void __smp_call_function_interrupt(void) __visible void smp_call_function_interrupt(struct pt_regs *regs) { - smp_entering_irq(); + smp_entering_irq(CALL_FUNCTION_VECTOR); __smp_call_function_interrupt(); exiting_irq(); } __visible void smp_trace_call_function_interrupt(struct pt_regs *regs) { - smp_entering_irq(); + smp_entering_irq(CALL_FUNCTION_VECTOR); trace_call_function_entry(CALL_FUNCTION_VECTOR); __smp_call_function_interrupt(); trace_call_function_exit(CALL_FUNCTION_VECTOR); @@ -319,14 +319,14 @@ static inline void __smp_call_function_single_interrupt(void) __visible void smp_call_function_single_interrupt(struct pt_regs *regs) { - smp_entering_irq(); + smp_entering_irq(CALL_FUNCTION_SINGLE_VECTOR); __smp_call_function_single_interrupt(); exiting_irq(); } __visible void smp_trace_call_function_single_interrupt(struct pt_regs *regs) { - smp_entering_irq(); + smp_entering_irq(CALL_FUNCTION_SINGLE_VECTOR); trace_call_function_single_entry(CALL_FUNCTION_SINGLE_VECTOR); __smp_call_function_single_interrupt(); trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR); diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c index 390079e..896c4b3 100644 --- a/drivers/iommu/irq_remapping.c +++ b/drivers/iommu/irq_remapping.c @@ -334,12 +334,14 @@ void panic_if_irq_remap(const char *msg) static void ir_ack_apic_edge(struct irq_data *data) { - ack_APIC_irq(); + struct irq_cfg *cfg = irqd_cfg(data); + ack_APIC_irq(cfg->vector); } static void ir_ack_apic_level(struct irq_data *data) { - ack_APIC_irq(); + struct irq_cfg *cfg = irqd_cfg(data); + ack_APIC_irq(cfg->vector); eoi_ioapic_irq(data->irq, irqd_cfg(data)); } ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-03-31 22:23 ` Chris J Arges @ 2015-03-31 23:07 ` Linus Torvalds 2015-04-01 14:32 ` Chris J Arges 2015-04-01 12:43 ` Ingo Molnar 1 sibling, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-03-31 23:07 UTC (permalink / raw) To: Chris J Arges Cc: Rafael David Tinoco, Ingo Molnar, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Tue, Mar 31, 2015 at 3:23 PM, Chris J Arges <chris.j.arges@canonical.com> wrote: > > I had a few runs with your patch plus modifications, and got the following > results (modified patch inlined below): Ok, thanks. > [ 14.423916] ack_APIC_irq: vector = d1, irq = ffffffff > [ 176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:1630] > > [ 17.995298] ack_APIC_irq: vector = d1, irq = ffffffff > [ 182.993828] ack_APIC_irq: vector = e1, irq = ffffffff > [ 202.919691] ack_APIC_irq: vector = 22, irq = ffffffff > [ 484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1586] > > [ 15.592032] ack_APIC_irq: vector = d1, irq = ffffffff > [ 304.993490] ack_APIC_irq: vector = e1, irq = ffffffff > [ 315.174755] ack_APIC_irq: vector = 22, irq = ffffffff > [ 360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ksmd:26] .. snip snip .. So yeah, that's VECTOR_UNDEFINED, and while it could happen as part of irq setup, I'm not seeing that being something that your load should trigger. It could also obviously just be the vector being somehow corrupted, either due to crazy hardware or software. But quite frankly, the most likely reason is that whole irq vector movement. Especially since it sounds from your other email that when you apply Ingo's patches, the ack_APIC_irq warnings go away. Is that correct? Or did you just grep for "move" in the messages? If you do get both movement messages (from Info's patch) _and_ the ack_APIC_irq warnings (from mine), it would be interesting to see if the vectors line up somehow.. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-03-31 23:07 ` Linus Torvalds @ 2015-04-01 14:32 ` Chris J Arges 2015-04-01 15:36 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Chris J Arges @ 2015-04-01 14:32 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael David Tinoco, Ingo Molnar, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Tue, Mar 31, 2015 at 04:07:32PM -0700, Linus Torvalds wrote: > On Tue, Mar 31, 2015 at 3:23 PM, Chris J Arges > <chris.j.arges@canonical.com> wrote: > > > > I had a few runs with your patch plus modifications, and got the following > > results (modified patch inlined below): > > Ok, thanks. > > > [ 14.423916] ack_APIC_irq: vector = d1, irq = ffffffff > > [ 176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:1630] > > > > [ 17.995298] ack_APIC_irq: vector = d1, irq = ffffffff > > [ 182.993828] ack_APIC_irq: vector = e1, irq = ffffffff > > [ 202.919691] ack_APIC_irq: vector = 22, irq = ffffffff > > [ 484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1586] > > > > [ 15.592032] ack_APIC_irq: vector = d1, irq = ffffffff > > [ 304.993490] ack_APIC_irq: vector = e1, irq = ffffffff > > [ 315.174755] ack_APIC_irq: vector = 22, irq = ffffffff > > [ 360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ksmd:26] > .. snip snip .. > > So yeah, that's VECTOR_UNDEFINED, and while it could happen as part of > irq setup, I'm not seeing that being something that your load should > trigger. > > It could also obviously just be the vector being somehow corrupted, > either due to crazy hardware or software. > > But quite frankly, the most likely reason is that whole irq vector movement. > > Especially since it sounds from your other email that when you apply > Ingo's patches, the ack_APIC_irq warnings go away. Is that correct? Or > did you just grep for "move" in the messages? > > If you do get both movement messages (from Info's patch) _and_ the > ack_APIC_irq warnings (from mine), it would be interesting to see if > the vectors line up somehow.. > > Linus > Linus, I included the full patch in reply to Ingo's email, and when running with that I no longer get the ack_APIC_irq WARNs. My next homework assignments are: - Testing with irqbalance disabled - Testing w/ the appropriate dump_stack() in Ingo's patch - L0 testing Thanks, --chris ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-01 14:32 ` Chris J Arges @ 2015-04-01 15:36 ` Linus Torvalds 2015-04-02 9:55 ` Ingo Molnar 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-04-01 15:36 UTC (permalink / raw) To: Chris J Arges Cc: Rafael David Tinoco, Ingo Molnar, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Wed, Apr 1, 2015 at 7:32 AM, Chris J Arges <chris.j.arges@canonical.com> wrote: > > I included the full patch in reply to Ingo's email, and when running with that > I no longer get the ack_APIC_irq WARNs. Ok. That means that the printk's themselves just change timing enough, or change the compiler instruction scheduling so that it hides the apic problem. Which very much indicates that these things are interconnected. For example, Ingo's printk patch does cfg->move_in_progress = cpumask_intersects(cfg->old_domain, cpu_online_mask); + if (cfg->move_in_progress) + pr_info("apic: vector %02x, same-domain move in progress\n", cfg->vector); cpumask_and(cfg->domain, cfg->domain, tmp_mask); and that means that now the setting of move_in_progress is serialized with the cpumask_and() in a way that it wasn't before. And while the code takes the "vector_lock" and disables interrupts, the interrupts themselves can happily continue on other cpu's, and they don't take the vector_lock. Neither does send_cleanup_vector(), which clears that bit, afaik. I don't know. The locking there is odd. > My next homework assignments are: > - Testing with irqbalance disabled Definitely. > - Testing w/ the appropriate dump_stack() in Ingo's patch > - L0 testing Thanks, Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-01 15:36 ` Linus Torvalds @ 2015-04-02 9:55 ` Ingo Molnar 2015-04-02 17:35 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Ingo Molnar @ 2015-04-02 9:55 UTC (permalink / raw) To: Linus Torvalds Cc: Chris J Arges, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Apr 1, 2015 at 7:32 AM, Chris J Arges > <chris.j.arges@canonical.com> wrote: > > > > I included the full patch in reply to Ingo's email, and when > > running with that I no longer get the ack_APIC_irq WARNs. > > Ok. That means that the printk's themselves just change timing > enough, or change the compiler instruction scheduling so that it > hides the apic problem. So another possibility would be that it's the third change causing this change in behavior: diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 6cedd7914581..833a981c5420 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -335,9 +340,11 @@ int apic_retrigger_irq(struct irq_data *data) void apic_ack_edge(struct irq_data *data) { + ack_APIC_irq(); + + /* Might generate IPIs, so do this after having ACKed the APIC: */ irq_complete_move(irqd_cfg(data)); irq_move_irq(data); - ack_APIC_irq(); } /* ... since with this we won't send IPIs in a semi-nested fashion with an unacked APIC, which is a good idea to do in general. It's also a weird enough hardware pattern that virtualization's APIC emulation might get it slightly wrong or slightly different. > Which very much indicates that these things are interconnected. > > For example, Ingo's printk patch does > > cfg->move_in_progress = > cpumask_intersects(cfg->old_domain, cpu_online_mask); > + if (cfg->move_in_progress) > + pr_info("apic: vector %02x, > same-domain move in progress\n", cfg->vector); > cpumask_and(cfg->domain, cfg->domain, tmp_mask); > > and that means that now the setting of move_in_progress is > serialized with the cpumask_and() in a way that it wasn't before. Yeah, that's a possibility too. It all looks very fragile. Thanks, Ingo ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-02 9:55 ` Ingo Molnar @ 2015-04-02 17:35 ` Linus Torvalds 0 siblings, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2015-04-02 17:35 UTC (permalink / raw) To: Ingo Molnar Cc: Chris J Arges, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Thu, Apr 2, 2015 at 2:55 AM, Ingo Molnar <mingo@kernel.org> wrote: > > So another possibility would be that it's the third change causing > this change in behavior: Oh, yes, that looks much more likely. I overlooked that small change entirely. > ... since with this we won't send IPIs in a semi-nested fashion with > an unacked APIC, which is a good idea to do in general. It's also a > weird enough hardware pattern that virtualization's APIC emulation > might get it slightly wrong or slightly different. Yup, that's more likely than the subtle timing/reordering differences of just the added debug printouts. I agree. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-03-31 22:23 ` Chris J Arges 2015-03-31 23:07 ` Linus Torvalds @ 2015-04-01 12:43 ` Ingo Molnar 2015-04-01 16:10 ` Chris J Arges 1 sibling, 1 reply; 54+ messages in thread From: Ingo Molnar @ 2015-04-01 12:43 UTC (permalink / raw) To: Chris J Arges Cc: Linus Torvalds, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers * Chris J Arges <chris.j.arges@canonical.com> wrote: > Linus, > > I had a few runs with your patch plus modifications, and got the following > results (modified patch inlined below): > > [ 14.423916] ack_APIC_irq: vector = d1, irq = ffffffff > [ 176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:1630] > > [ 17.995298] ack_APIC_irq: vector = d1, irq = ffffffff > [ 182.993828] ack_APIC_irq: vector = e1, irq = ffffffff > [ 202.919691] ack_APIC_irq: vector = 22, irq = ffffffff > [ 484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1586] > > [ 15.592032] ack_APIC_irq: vector = d1, irq = ffffffff > [ 304.993490] ack_APIC_irq: vector = e1, irq = ffffffff > [ 315.174755] ack_APIC_irq: vector = 22, irq = ffffffff > [ 360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ksmd:26] > > [ 15.026077] ack_APIC_irq: vector = b1, irq = ffffffff > [ 374.828531] ack_APIC_irq: vector = c1, irq = ffffffff > [ 402.965942] ack_APIC_irq: vector = d1, irq = ffffffff > [ 434.540814] ack_APIC_irq: vector = e1, irq = ffffffff > [ 461.820768] ack_APIC_irq: vector = 22, irq = ffffffff > [ 536.120027] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:4243] > > [ 17.889334] ack_APIC_irq: vector = d1, irq = ffffffff > [ 291.888784] ack_APIC_irq: vector = e1, irq = ffffffff > [ 297.824627] ack_APIC_irq: vector = 22, irq = ffffffff > [ 336.960594] ack_APIC_irq: vector = 42, irq = ffffffff > [ 367.012706] ack_APIC_irq: vector = 52, irq = ffffffff > [ 377.025090] ack_APIC_irq: vector = 62, irq = ffffffff > [ 417.088773] ack_APIC_irq: vector = 72, irq = ffffffff > [ 447.136788] ack_APIC_irq: vector = 82, irq = ffffffff > -- stopped it since it wasn't reproducing / I was impatient -- > > So I'm seeing irq == VECTOR_UNDEFINED in all of these cases. Making > (vector >= 0) didn't seem to expose any additional vectors. So, these vectors do seem to be lining up with the pattern of how new irq vectors get assigned and how we slowly rotate through all available ones. The VECTOR_UNDEFINED might correspond to the fact that we already 'freed' that vector, as part of the irq-move mechanism - but it was likely in use shortly before. So the irq-move code is not off the hook, to the contrary. Have you already tested whether the hang goes away if you remove irq-affinity fiddling daemons from the system? Do you have irqbalance installed or similar mechanisms? Thanks, Ingo ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-01 12:43 ` Ingo Molnar @ 2015-04-01 16:10 ` Chris J Arges 2015-04-01 16:14 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Chris J Arges @ 2015-04-01 16:10 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Wed, Apr 01, 2015 at 02:43:36PM +0200, Ingo Molnar wrote: <snip> > Have you already tested whether the hang goes away if you remove > irq-affinity fiddling daemons from the system? Do you have irqbalance > installed or similar mechanisms? > > Thanks, > > Ingo > Even with irqbalance removed from the L0/L1 machines the hang still occurs. Using the debug patch as described here: https://lkml.org/lkml/2015/4/1/338 This results in no 'apic: vector* or 'ack_APIC*' messages being displayed. --chris ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-01 16:10 ` Chris J Arges @ 2015-04-01 16:14 ` Linus Torvalds 2015-04-01 21:59 ` Chris J Arges 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-04-01 16:14 UTC (permalink / raw) To: Chris J Arges Cc: Ingo Molnar, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Wed, Apr 1, 2015 at 9:10 AM, Chris J Arges <chris.j.arges@canonical.com> wrote: > > Even with irqbalance removed from the L0/L1 machines the hang still occurs. > > This results in no 'apic: vector* or 'ack_APIC*' messages being displayed. Ok. So the ack_APIC debug patch found *something*, but it seems to be unrelated to the hang. Dang. Oh well. Back to square one. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-01 16:14 ` Linus Torvalds @ 2015-04-01 21:59 ` Chris J Arges 2015-04-02 17:31 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Chris J Arges @ 2015-04-01 21:59 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On 04/01/2015 11:14 AM, Linus Torvalds wrote: > On Wed, Apr 1, 2015 at 9:10 AM, Chris J Arges > <chris.j.arges@canonical.com> wrote: >> >> Even with irqbalance removed from the L0/L1 machines the hang still occurs. >> >> This results in no 'apic: vector* or 'ack_APIC*' messages being displayed. > > Ok. So the ack_APIC debug patch found *something*, but it seems to be > unrelated to the hang. > > Dang. Oh well. Back to square one. > > Linus > With my L0 testing I've normally used a 3.13 series kernel since it tends to reproduce the hang very quickly with the testcase. Note we have reproduced an identical hang with newer kernels (3.19+patches) using the openstack tempest on openstack reproducer, but the timing can vary between hours and days. Installing a v4.0-rc6+patch kernel on L0 makes the problem very slow to reproduce, so I am running these tests now which may take day(s). It is worthwhile to do a 'bisect' to see where on average it takes longer to reproduce? Perhaps it will point to a relevant change, or it may be completely useless. --chris ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-01 21:59 ` Chris J Arges @ 2015-04-02 17:31 ` Linus Torvalds 2015-04-02 18:26 ` Ingo Molnar 2015-04-06 17:23 ` Chris J Arges 0 siblings, 2 replies; 54+ messages in thread From: Linus Torvalds @ 2015-04-02 17:31 UTC (permalink / raw) To: Chris J Arges Cc: Ingo Molnar, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Wed, Apr 1, 2015 at 2:59 PM, Chris J Arges <chris.j.arges@canonical.com> wrote: > > It is worthwhile to do a 'bisect' to see where on average it takes > longer to reproduce? Perhaps it will point to a relevant change, or it > may be completely useless. It's likely to be an exercise in futility. "git bisect" is realyl bad at "gray area" things, and when it's a question of "it takes hours or days to reproduce", it's almost certainly not worth it. Not unless there is some really clear cut-off that we can believably say "this causes it to get much slower". And in this case, I don't think it's that clear-cut. Judging by DaveJ's attempts at bisecting things, the timing just changes. And the differences might be due to entirely unrelated changes like cacheline alignment etc. So unless we find a real clear signature of the bug (I was hoping that the ISR bit would be that sign), I don't think trying to bisect it based on how quickly you can reproduce things is worthwhile. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-02 17:31 ` Linus Torvalds @ 2015-04-02 18:26 ` Ingo Molnar 2015-04-02 18:51 ` Chris J Arges 2015-04-06 17:23 ` Chris J Arges 1 sibling, 1 reply; 54+ messages in thread From: Ingo Molnar @ 2015-04-02 18:26 UTC (permalink / raw) To: Linus Torvalds Cc: Chris J Arges, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers * Linus Torvalds <torvalds@linux-foundation.org> wrote: > So unless we find a real clear signature of the bug (I was hoping > that the ISR bit would be that sign), I don't think trying to bisect > it based on how quickly you can reproduce things is worthwhile. So I'm wondering (and I might have missed some earlier report that outlines just that), now that the possible location of the bug is again sadly up to 15+ million lines of code, I have no better idea than to debug by symptoms again: what kind of effort was made to examine the locked up state itself? Softlockups always have some direct cause, which task exactly causes scheduling to stop altogether, why does it lock up - or is it not a clear lockup, just a very slow system? Thanks, Ingo ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-02 18:26 ` Ingo Molnar @ 2015-04-02 18:51 ` Chris J Arges 2015-04-02 19:07 ` Ingo Molnar 0 siblings, 1 reply; 54+ messages in thread From: Chris J Arges @ 2015-04-02 18:51 UTC (permalink / raw) To: Ingo Molnar, Linus Torvalds Cc: Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On 04/02/2015 01:26 PM, Ingo Molnar wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> So unless we find a real clear signature of the bug (I was hoping >> that the ISR bit would be that sign), I don't think trying to bisect >> it based on how quickly you can reproduce things is worthwhile. > > So I'm wondering (and I might have missed some earlier report that > outlines just that), now that the possible location of the bug is > again sadly up to 15+ million lines of code, I have no better idea > than to debug by symptoms again: what kind of effort was made to > examine the locked up state itself? > Ingo, Rafael did some analysis when I was out earlier here: https://lkml.org/lkml/2015/2/23/234 My reproducer setup is as follows: L0 - 8-way CPU, 48 GB memory L1 - 2-way vCPU, 4 GB memory L2 - 1-way vCPU, 1 GB memory Stress is only run in the L2 VM, and running top on L0/L1 doesn't show excessive load. > Softlockups always have some direct cause, which task exactly causes > scheduling to stop altogether, why does it lock up - or is it not a > clear lockup, just a very slow system? > > Thanks, > > Ingo > Whenever we look through the crashdump we see csd_lock_wait waiting for CSD_FLAG_LOCK bit to be cleared. Usually the signature leading up to that looks like the following (in the openstack tempest on openstack and nested VM stress case) (qemu-system-x86 task) kvm_sched_in -> kvm_arch_vcpu_load -> vmx_vcpu_load -> loaded_vmcs_clear -> smp_call_function_single (ksmd task) pmdp_clear_flush -> flush_tlb_mm_range -> native_flush_tlb_others -> smp_call_function_many --chris ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-02 18:51 ` Chris J Arges @ 2015-04-02 19:07 ` Ingo Molnar 2015-04-02 20:57 ` Linus Torvalds 2015-04-02 21:13 ` Chris J Arges 0 siblings, 2 replies; 54+ messages in thread From: Ingo Molnar @ 2015-04-02 19:07 UTC (permalink / raw) To: Chris J Arges Cc: Linus Torvalds, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers * Chris J Arges <chris.j.arges@canonical.com> wrote: > Whenever we look through the crashdump we see csd_lock_wait waiting > for CSD_FLAG_LOCK bit to be cleared. Usually the signature leading > up to that looks like the following (in the openstack tempest on > openstack and nested VM stress case) > > (qemu-system-x86 task) > kvm_sched_in > -> kvm_arch_vcpu_load > -> vmx_vcpu_load > -> loaded_vmcs_clear > -> smp_call_function_single > > (ksmd task) > pmdp_clear_flush > -> flush_tlb_mm_range > -> native_flush_tlb_others > -> smp_call_function_many So is this two separate smp_call_function instances, crossing each other, and none makes any progress, indefinitely - as if the two IPIs got lost? The traces Rafael he linked to show a simpler scenario with two CPUs apparently locked up, doing this: CPU0: #5 [ffffffff81c03e88] native_safe_halt at ffffffff81059386 #6 [ffffffff81c03e90] default_idle at ffffffff8101eaee #7 [ffffffff81c03eb0] arch_cpu_idle at ffffffff8101f46f #8 [ffffffff81c03ec0] cpu_startup_entry at ffffffff810b6563 #9 [ffffffff81c03f30] rest_init at ffffffff817a6067 #10 [ffffffff81c03f40] start_kernel at ffffffff81d4cfce #11 [ffffffff81c03f80] x86_64_start_reservations at ffffffff81d4c4d7 #12 [ffffffff81c03f90] x86_64_start_kernel at ffffffff81d4c61c This CPU is idle. CPU1: #10 [ffff88081993fa70] smp_call_function_single at ffffffff810f4d69 #11 [ffff88081993fb10] native_flush_tlb_others at ffffffff810671ae #12 [ffff88081993fb40] flush_tlb_mm_range at ffffffff810672d4 #13 [ffff88081993fb80] pmdp_splitting_flush at ffffffff81065e0d #14 [ffff88081993fba0] split_huge_page_to_list at ffffffff811ddd39 #15 [ffff88081993fc30] __split_huge_page_pmd at ffffffff811dec65 #16 [ffff88081993fcc0] unmap_single_vma at ffffffff811a4f03 #17 [ffff88081993fdc0] zap_page_range at ffffffff811a5d08 #18 [ffff88081993fe80] sys_madvise at ffffffff811b9775 #19 [ffff88081993ff80] system_call_fastpath at ffffffff817b8bad This CPU is busy-waiting for the TLB flush IPI to finish. There's no unexpected pattern here (other than it not finishing) AFAICS, the smp_call_function_single() is just the usual way we invoke the TLB flushing methods AFAICS. So one possibility would be that an 'IPI was sent but lost'. We could try the following trick: poll for completion for a couple of seconds (since an IPI is not held up by anything but irqs-off sections, it should arrive within microseconds typically - seconds of polling should be more than enough), and if the IPI does not arrive, print a warning message and re-send the IPI. If the IPI was lost due to some race and there's no other failure mode that we don't understand, then this would work around the bug and would make the tests pass indefinitely - with occasional hickups and a handful of messages produced along the way whenever it would have locked up with a previous kernel. If testing indeed confirms that kind of behavior we could drill down more closely to figure out why the IPI did not get to its destination. Or if the behavior is different, we'd have some new behavior to look at. (for example the IPI sending mechanism might be wedged indefinitely for some reason, so that even a resend won't work.) Agreed? Thanks, Ingo ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-02 19:07 ` Ingo Molnar @ 2015-04-02 20:57 ` Linus Torvalds 2015-04-02 21:13 ` Chris J Arges 1 sibling, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2015-04-02 20:57 UTC (permalink / raw) To: Ingo Molnar Cc: Chris J Arges, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Thu, Apr 2, 2015 at 12:07 PM, Ingo Molnar <mingo@kernel.org> wrote: > > So one possibility would be that an 'IPI was sent but lost'. Yes, the "sent but lost" thing would certainly explain the lockups. At the same time, that sounds like a huge hardware bug, and that's somewhat surprising/unlikely. That said. > We could try the following trick: poll for completion for a couple of > seconds (since an IPI is not held up by anything but irqs-off > sections, it should arrive within microseconds typically - seconds of > polling should be more than enough), and if the IPI does not arrive, > print a warning message and re-send the IPI. Sounds like a reasonable approach. At worst it doesn't fix anything, and we never see any messages, and that tells us something too. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-02 19:07 ` Ingo Molnar 2015-04-02 20:57 ` Linus Torvalds @ 2015-04-02 21:13 ` Chris J Arges 2015-04-03 5:45 ` Ingo Molnar 1 sibling, 1 reply; 54+ messages in thread From: Chris J Arges @ 2015-04-02 21:13 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On 04/02/2015 02:07 PM, Ingo Molnar wrote: > > * Chris J Arges <chris.j.arges@canonical.com> wrote: > >> Whenever we look through the crashdump we see csd_lock_wait waiting >> for CSD_FLAG_LOCK bit to be cleared. Usually the signature leading >> up to that looks like the following (in the openstack tempest on >> openstack and nested VM stress case) >> >> (qemu-system-x86 task) >> kvm_sched_in >> -> kvm_arch_vcpu_load >> -> vmx_vcpu_load >> -> loaded_vmcs_clear >> -> smp_call_function_single >> >> (ksmd task) >> pmdp_clear_flush >> -> flush_tlb_mm_range >> -> native_flush_tlb_others >> -> smp_call_function_many > > So is this two separate smp_call_function instances, crossing each > other, and none makes any progress, indefinitely - as if the two IPIs > got lost? > This is two different crash signatures. Sorry for the confusion. > The traces Rafael he linked to show a simpler scenario with two CPUs > apparently locked up, doing this: > > CPU0: > > #5 [ffffffff81c03e88] native_safe_halt at ffffffff81059386 > #6 [ffffffff81c03e90] default_idle at ffffffff8101eaee > #7 [ffffffff81c03eb0] arch_cpu_idle at ffffffff8101f46f > #8 [ffffffff81c03ec0] cpu_startup_entry at ffffffff810b6563 > #9 [ffffffff81c03f30] rest_init at ffffffff817a6067 > #10 [ffffffff81c03f40] start_kernel at ffffffff81d4cfce > #11 [ffffffff81c03f80] x86_64_start_reservations at ffffffff81d4c4d7 > #12 [ffffffff81c03f90] x86_64_start_kernel at ffffffff81d4c61c > > This CPU is idle. > > CPU1: > > #10 [ffff88081993fa70] smp_call_function_single at ffffffff810f4d69 > #11 [ffff88081993fb10] native_flush_tlb_others at ffffffff810671ae > #12 [ffff88081993fb40] flush_tlb_mm_range at ffffffff810672d4 > #13 [ffff88081993fb80] pmdp_splitting_flush at ffffffff81065e0d > #14 [ffff88081993fba0] split_huge_page_to_list at ffffffff811ddd39 > #15 [ffff88081993fc30] __split_huge_page_pmd at ffffffff811dec65 > #16 [ffff88081993fcc0] unmap_single_vma at ffffffff811a4f03 > #17 [ffff88081993fdc0] zap_page_range at ffffffff811a5d08 > #18 [ffff88081993fe80] sys_madvise at ffffffff811b9775 > #19 [ffff88081993ff80] system_call_fastpath at ffffffff817b8bad > > This CPU is busy-waiting for the TLB flush IPI to finish. > > There's no unexpected pattern here (other than it not finishing) > AFAICS, the smp_call_function_single() is just the usual way we invoke > the TLB flushing methods AFAICS. > > So one possibility would be that an 'IPI was sent but lost'. > > We could try the following trick: poll for completion for a couple of > seconds (since an IPI is not held up by anything but irqs-off > sections, it should arrive within microseconds typically - seconds of > polling should be more than enough), and if the IPI does not arrive, > print a warning message and re-send the IPI. > > If the IPI was lost due to some race and there's no other failure mode > that we don't understand, then this would work around the bug and > would make the tests pass indefinitely - with occasional hickups and a > handful of messages produced along the way whenever it would have > locked up with a previous kernel. > > If testing indeed confirms that kind of behavior we could drill down > more closely to figure out why the IPI did not get to its destination. > > Or if the behavior is different, we'd have some new behavior to look > at. (for example the IPI sending mechanism might be wedged > indefinitely for some reason, so that even a resend won't work.) > > Agreed? > > Thanks, > > Ingo > Ingo, I think tracking IPI calls from 'generic_exec_single' would make a lot of sense. When you say poll for completion do you mean a loop after 'arch_send_call_function_single_ipi' in kernel/smp.c? My main concern would be to not alter the timings too much so we can still reproduce the original problem. Another approach: If we want to check for non-ACKed IPIs a possibility would be to add a timestamp field to 'struct call_single_data' and just record jiffies when the IPI gets called. Then have a per-cpu kthread check the 'call_single_queue' percpu list periodically if (jiffies - timestamp) > THRESHOLD. When we reach that condition print the stale entry in call_single_queue, backtrace, then re-send the IPI. Let me know what makes the most sense to hack on. Thanks, --chris ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-02 21:13 ` Chris J Arges @ 2015-04-03 5:45 ` Ingo Molnar 0 siblings, 0 replies; 54+ messages in thread From: Ingo Molnar @ 2015-04-03 5:45 UTC (permalink / raw) To: Chris J Arges Cc: Linus Torvalds, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers * Chris J Arges <chris.j.arges@canonical.com> wrote: > > > On 04/02/2015 02:07 PM, Ingo Molnar wrote: > > > > * Chris J Arges <chris.j.arges@canonical.com> wrote: > > > >> Whenever we look through the crashdump we see csd_lock_wait waiting > >> for CSD_FLAG_LOCK bit to be cleared. Usually the signature leading > >> up to that looks like the following (in the openstack tempest on > >> openstack and nested VM stress case) > >> > >> (qemu-system-x86 task) > >> kvm_sched_in > >> -> kvm_arch_vcpu_load > >> -> vmx_vcpu_load > >> -> loaded_vmcs_clear > >> -> smp_call_function_single > >> > >> (ksmd task) > >> pmdp_clear_flush > >> -> flush_tlb_mm_range > >> -> native_flush_tlb_others > >> -> smp_call_function_many > > > > So is this two separate smp_call_function instances, crossing each > > other, and none makes any progress, indefinitely - as if the two IPIs > > got lost? > > > > This is two different crash signatures. Sorry for the confusion. So just in case, both crash signatures ought to be detected by the patch I just sent. Thanks, Ingo ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-04-02 17:31 ` Linus Torvalds 2015-04-02 18:26 ` Ingo Molnar @ 2015-04-06 17:23 ` Chris J Arges 1 sibling, 0 replies; 54+ messages in thread From: Chris J Arges @ 2015-04-06 17:23 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, the arch/x86 maintainers On Thu, Apr 02, 2015 at 10:31:50AM -0700, Linus Torvalds wrote: > On Wed, Apr 1, 2015 at 2:59 PM, Chris J Arges > <chris.j.arges@canonical.com> wrote: > > > > It is worthwhile to do a 'bisect' to see where on average it takes > > longer to reproduce? Perhaps it will point to a relevant change, or it > > may be completely useless. > > It's likely to be an exercise in futility. "git bisect" is realyl bad > at "gray area" things, and when it's a question of "it takes hours or > days to reproduce", it's almost certainly not worth it. Not unless > there is some really clear cut-off that we can believably say "this > causes it to get much slower". And in this case, I don't think it's > that clear-cut. Judging by DaveJ's attempts at bisecting things, the > timing just changes. And the differences might be due to entirely > unrelated changes like cacheline alignment etc. > > So unless we find a real clear signature of the bug (I was hoping that > the ISR bit would be that sign), I don't think trying to bisect it > based on how quickly you can reproduce things is worthwhile. > > Linus > Linus, Ingo, I did some testing and found that at the following patch level, the issue was much, much more likely to reproduce within < 15m. commit b6b8a1451fc40412c57d10c94b62e22acab28f94 Author: Jan Kiszka <jan.kiszka@siemens.com> Date: Fri Mar 7 20:03:12 2014 +0100 KVM: nVMX: Rework interception of IRQs and NMIs Move the check for leaving L2 on pending and intercepted IRQs or NMIs from the *_allowed handler into a dedicated callback. Invoke this callback at the relevant points before KVM checks if IRQs/NMIs can be injected. The callback has the task to switch from L2 to L1 if needed and inject the proper vmexit events. The rework fixes L2 wakeups from HLT and provides the foundation for preemption timer emulation. However, when the following patch was applied the average time to reproduction goes down greatly (the stress reproducer ran for hours without issue): commit 9242b5b60df8b13b469bc6b7be08ff6ebb551ad3 Author: Bandan Das <bsd@redhat.com> Date: Tue Jul 8 00:30:23 2014 -0400 KVM: x86: Check for nested events if there is an injectable interrupt With commit b6b8a1451fc40412c57d1 that introduced vmx_check_nested_events, checks for injectable interrupts happen at different points in time for L1 and L2 that could potentially cause a race. The regression occurs because KVM_REQ_EVENT is always set when nested_run_pending is set even if there's no pending interrupt. Consequently, there could be a small window when check_nested_events returns without exiting to L1, but an interrupt comes through soon after and it incorrectly, gets injected to L2 by inject_pending_event Fix this by adding a call to check for nested events too when a check for injectable interrupt returns true However we reproduced with v3.19 (containing these two patches) which did eventually softlockup with a similar backtrace. So far, this agrees with the current understanding that we may be not ACK'ing certain interrupts (IPIs from the L1 guest) causing csd_lock_wait to spin and causing the soft lockup. Hopefully this helps shed more light on this issue. Thanks, --chris j arges ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-19 20:29 ` Linus Torvalds 2015-02-19 21:59 ` Linus Torvalds @ 2015-02-20 9:30 ` Ingo Molnar 2015-02-20 16:49 ` Linus Torvalds 1 sibling, 1 reply; 54+ messages in thread From: Ingo Molnar @ 2015-02-20 9:30 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges, the arch/x86 maintainers * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Feb 19, 2015 at 9:39 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > >> > >> Are there known errata for the x2apic? > > > > .. and in particular, do we still have to worry about > > the traditional local apic "if there are more than two > > pending interrupts per priority level, things get lost" > > problem? > > > > I forget the exact details. Hopefully somebody > > remembers. > > I can't find it in the docs. I find the "two-entries per > vector", but not anything that is per priority level > (group of 16 vectors). Maybe that was the IO-APIC, in > which case it's immaterial for IPI's. So if my memory serves me right, I think it was for local APICs, and even there mostly it was a performance issue: if an IO-APIC sent more than 2 IRQs per 'level' to a local APIC then the IO-APIC might be forced to resend those IRQs, leading to excessive message traffic on the relevant hardware bus. ( I think the 'resend' was automatic in this case, i.e. a hardware fallback for a CPU side resource shortage, and it could not result in actually lost IRQs. I never saw this documented properly, so people inside Intel or AMD would be in a better position to comment on this ... I might be mis-remembering this or confusing different bugs. ) > However, having now mostly re-acquainted myself with the > APIC details, it strikes me that we do have some oddities > here. > > In particular, a few interrupt types are very special: > NMI, SMI, INIT, ExtINT, or SIPI are handled early in the > interrupt acceptance logic, and are sent directly to the > CPU core, without going through the usual intermediate > IRR/ISR dance. > > And why might this matter? It's important because it > means that those kinds of interrupts must *not* do the > apic EOI that ack_APIC_irq() does. > > And we correctly don't do ack_APIC_irq() for NMI etc, but > it strikes me that ExtINT is odd and special. > > I think we still use ExtINT for some odd cases. We used > to have some magic with the legacy timer interrupt, for > example. And I think they all go through the normal > "do_IRQ()" logic regardless of whether they are ExtINT or > not. > > Now, what happens if we send an EOI for an ExtINT > interrupt? It basically ends up being a spurious IPI. And > I *think* that what normally happens is absolutely > nothing at all. But if in addition to the ExtINT, there > was a pending IPI (or other pending ISR bit set), maybe > we lose interrupts.. 1) I think you got it right. So the principle of EOI acknowledgement from the OS to the local APIC is specific to the IRQ that raised the interrupt and caused the vector to be executed, so it's not possible to ack the 'wrong' IRQ. But technically the EOI is state-less, i.e. (as you know) we write a constant value to a local APIC register without indicating which vector or external IRQ we meant. The OS wants to ack 'the IRQ that we are executing currently', but this leaves the situation a bit confused in cases where for example an IRQ handler enables IRQs, another IRQ comes in and stays unacked. So I _think_ it's not possible to accidentally acknowledge a pending IRQ that has not been issued to the CPU yet (unless we have hardirqs enabled), just by writing stray EOIs to the local APIC. So in that sense the ExtInt irq0 case should be mostly harmless. But I could be wrong :-/ 2) So my suggestion for this bug would be: The 'does a stray EOI matter' question could also be tested by deliberately writing two EOIs instead of just one - does this trigger the bug faster? Then perhaps try to make sure that no hardirqs get ever enabled in an irq handler, and figure out whether any of the IRQs in question are edge triggered - but AFAICS it could be 'any' IRQ handler or flow causing the problem, right? 3) I also fully share your frustration about the level of obfuscation the various APIC drivers display today. The lack of a simple single-IPI implementation is annoying as well - when that injury was first inflicted with clustered APICs I tried to resist, but AFAICR there were some good hardware arguments why it cannot be kept and I gave up. If you agree then I can declare a feature stop for new hardware support (that isn't a stop-ship issue for users) until it's all cleaned up for real, and Thomas started some of that work already. > .. and it's entirely possible that I'm just completely > full of shit. Who is the poor bastard who has worked most > with things like ExtINT, and can educate me? I'm adding > Ingo, hpa and Jiang Liu as primary contacts.. So the buck stops at my desk, but any help is welcome! Thanks, Ingo ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-20 9:30 ` Ingo Molnar @ 2015-02-20 16:49 ` Linus Torvalds 2015-02-20 19:41 ` Ingo Molnar 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-02-20 16:49 UTC (permalink / raw) To: Ingo Molnar Cc: Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 3815 bytes --] On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar <mingo@kernel.org> wrote: > > So if my memory serves me right, I think it was for local > APICs, and even there mostly it was a performance issue: if > an IO-APIC sent more than 2 IRQs per 'level' to a local > APIC then the IO-APIC might be forced to resend those IRQs, > leading to excessive message traffic on the relevant > hardware bus. Hmm. I have a distinct memory of interrupts actually being lost, but I really can't find anything to support that memory, so it's probably some drug-induced confusion of mine. I don't find *anything* about interrupt "levels" any more in modern Intel documentation on the APIC, but maybe I missed something. But it might all have been an IO-APIC thing. > I think you got it right. > > So the principle of EOI acknowledgement from the OS to the > local APIC is specific to the IRQ that raised the interrupt > and caused the vector to be executed, so it's not possible > to ack the 'wrong' IRQ. I agree - but only for cases where we actualyl should ACK in the first place. The LAPIC knows what its own priorities were, so it always just clears the highest-priority ISR bit. > So I _think_ it's not possible to accidentally acknowledge > a pending IRQ that has not been issued to the CPU yet > (unless we have hardirqs enabled), just by writing stray > EOIs to the local APIC. So in that sense the ExtInt irq0 > case should be mostly harmless. It must clearly be mostly harmless even if I'm right, since we've always done the ACK for it, as far as I can tell. So you're probably right, and it doesn't matter. In particular, it would depend on exactly when the ISR bit actually gets set in the APIC. If it gets set only after the CPU has actually *accepted* the interrupt, then it should not be possible for a spurious EOI write to screw things up, because it's still happening in the context of the previously executing interrupt, so a new ISR bit hasn't been set yet, and any old ISR bits must be lower-priority than the currently executing interrupt routine (that were interrupted by a higher-priority one). That sounds like the likely scenario. It just worries me. And my test patch did actually trigger, even if it only seemed to trigger during device probing (when the probing itself probably caused and then cleared an interrupt). > I also fully share your frustration about the level of > obfuscation the various APIC drivers display today. > > The lack of a simple single-IPI implementation is annoying > as well - when that injury was first inflicted with > clustered APICs I tried to resist, but AFAICR there were > some good hardware arguments why it cannot be kept and I > gave up. > > If you agree then I can declare a feature stop for new > hardware support (that isn't a stop-ship issue for users) > until it's all cleaned up for real, and Thomas started some > of that work already. Well, the attached patch for that seems pretty trivial. And seems to work for me (my machine also defaults to x2apic clustered mode), and allows the APIC code to start doing a "send to specific cpu" thing one by one, since it falls back to the send_IPI_mask() function if no individual CPU IPI function exists. NOTE! There's a few cases in arch/x86/kernel/apic/vector.c that also do that "apic->send_IPI_mask(cpumask_of(i), .." thing, but they aren't that important, so I didn't bother with them. NOTE2! I've tested this, and it seems to work, but maybe there is something seriously wrong. I skipped the "disable interrupts" part when doing the "send_IPI", for example, because I think it's entirely unnecessary for that case. But this has certainly *not* gotten any real stress-testing. But /proc/interrupts shows thousands of rescheduling IPI's, so this should all have triggered. Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 2793 bytes --] arch/x86/include/asm/apic.h | 1 + arch/x86/kernel/apic/x2apic_cluster.c | 9 +++++++++ arch/x86/kernel/smp.c | 16 ++++++++++++++-- 3 files changed, 24 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 92003f3c8a42..5d47ffcfa65d 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -296,6 +296,7 @@ struct apic { unsigned int *apicid); /* ipi */ + void (*send_IPI)(int cpu, int vector); void (*send_IPI_mask)(const struct cpumask *mask, int vector); void (*send_IPI_mask_allbutself)(const struct cpumask *mask, int vector); diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c index e658f21681c8..177c4fb027a1 100644 --- a/arch/x86/kernel/apic/x2apic_cluster.c +++ b/arch/x86/kernel/apic/x2apic_cluster.c @@ -23,6 +23,14 @@ static inline u32 x2apic_cluster(int cpu) return per_cpu(x86_cpu_to_logical_apicid, cpu) >> 16; } +static void x2apic_send_IPI(int cpu, int vector) +{ + u32 dest = per_cpu(x86_cpu_to_logical_apicid, cpu); + + x2apic_wrmsr_fence(); + __x2apic_send_IPI_dest(dest, vector, APIC_DEST_LOGICAL); +} + static void __x2apic_send_IPI_mask(const struct cpumask *mask, int vector, int apic_dest) { @@ -266,6 +274,7 @@ static struct apic apic_x2apic_cluster = { .cpu_mask_to_apicid_and = x2apic_cpu_mask_to_apicid_and, + .send_IPI = x2apic_send_IPI, .send_IPI_mask = x2apic_send_IPI_mask, .send_IPI_mask_allbutself = x2apic_send_IPI_mask_allbutself, .send_IPI_allbutself = x2apic_send_IPI_allbutself, diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c index be8e1bde07aa..78c9b12d892a 100644 --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -114,6 +114,18 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1); static bool smp_no_nmi_ipi = false; /* + * Helper wrapper: not all apic definitions support sending to + * a single CPU, so we fall back to sending to a mask. + */ +static void send_IPI_cpu(int cpu, int vector) +{ + if (apic->send_IPI) + apic->send_IPI(cpu, vector); + else + apic->send_IPI_mask(cpumask_of(cpu), vector); +} + +/* * this function sends a 'reschedule' IPI to another CPU. * it goes straight through and wastes no time serializing * anything. Worst case is that we lose a reschedule ... @@ -124,12 +136,12 @@ static void native_smp_send_reschedule(int cpu) WARN_ON(1); return; } - apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR); + send_IPI_cpu(cpu, RESCHEDULE_VECTOR); } void native_send_call_func_single_ipi(int cpu) { - apic->send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR); + send_IPI_cpu(cpu, CALL_FUNCTION_SINGLE_VECTOR); } void native_send_call_func_ipi(const struct cpumask *mask) ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-20 16:49 ` Linus Torvalds @ 2015-02-20 19:41 ` Ingo Molnar 2015-02-20 20:03 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Ingo Molnar @ 2015-02-20 19:41 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges, the arch/x86 maintainers * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > > So if my memory serves me right, I think it was for > > local APICs, and even there mostly it was a performance > > issue: if an IO-APIC sent more than 2 IRQs per 'level' > > to a local APIC then the IO-APIC might be forced to > > resend those IRQs, leading to excessive message traffic > > on the relevant hardware bus. > > Hmm. I have a distinct memory of interrupts actually > being lost, but I really can't find anything to support > that memory, so it's probably some drug-induced confusion > of mine. I don't find *anything* about interrupt "levels" > any more in modern Intel documentation on the APIC, but > maybe I missed something. But it might all have been an > IO-APIC thing. So I just found an older discussion of it: http://www.gossamer-threads.com/lists/linux/kernel/1554815?do=post_view_threaded#1554815 while it's not a comprehensive description, it matches what I remember from it: with 3 vectors within a level of 16 vectors we'd get excessive "retries" sent by the IO-APIC through the (then rather slow) APIC bus. ( It was possible for the same phenomenon to occur with IPIs as well, when a CPU sent an APIC message to another CPU, if the affected vectors were equal modulo 16 - but this was rare IIRC because most systems were dual CPU so only two IPIs could have occured. ) > Well, the attached patch for that seems pretty trivial. > And seems to work for me (my machine also defaults to > x2apic clustered mode), and allows the APIC code to start > doing a "send to specific cpu" thing one by one, since it > falls back to the send_IPI_mask() function if no > individual CPU IPI function exists. > > NOTE! There's a few cases in > arch/x86/kernel/apic/vector.c that also do that > "apic->send_IPI_mask(cpumask_of(i), .." thing, but they > aren't that important, so I didn't bother with them. > > NOTE2! I've tested this, and it seems to work, but maybe > there is something seriously wrong. I skipped the > "disable interrupts" part when doing the "send_IPI", for > example, because I think it's entirely unnecessary for > that case. But this has certainly *not* gotten any real > stress-testing. I'm not so sure about that aspect: I think disabling IRQs might be necessary with some APICs (if lower levels don't disable IRQs), to make sure the 'local APIC busy' bit isn't set: we typically do a wait_icr_idle() call before sending an IPI - and if IRQs are not off then the idleness of the APIC might be gone. (Because a hardirq that arrives after a wait_icr_idle() but before the actual IPI sending sent out an IPI and the queue is full.) So the IPI sending should be atomic in that sense. Thanks, Ingo ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-20 19:41 ` Ingo Molnar @ 2015-02-20 20:03 ` Linus Torvalds 2015-02-20 20:11 ` Ingo Molnar 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-02-20 20:03 UTC (permalink / raw) To: Ingo Molnar Cc: Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges, the arch/x86 maintainers On Fri, Feb 20, 2015 at 11:41 AM, Ingo Molnar <mingo@kernel.org> wrote: > > I'm not so sure about that aspect: I think disabling IRQs > might be necessary with some APICs (if lower levels don't > disable IRQs), to make sure the 'local APIC busy' bit isn't > set: Right. But afaik not for the x2apic case, which this is. The x2apic doesn't even have a busy bit, and sending the ipi is a single write, I agree that when doing other apic implementations, we may need to guarantee atomicity for things like "wait for apic idle, then send the ipi". Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-20 20:03 ` Linus Torvalds @ 2015-02-20 20:11 ` Ingo Molnar 0 siblings, 0 replies; 54+ messages in thread From: Ingo Molnar @ 2015-02-20 20:11 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez, Christopher Arges, the arch/x86 maintainers * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Feb 20, 2015 at 11:41 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > > I'm not so sure about that aspect: I think disabling > > IRQs might be necessary with some APICs (if lower > > levels don't disable IRQs), to make sure the 'local > > APIC busy' bit isn't set: > > Right. But afaik not for the x2apic case, which this is. > The x2apic doesn't even have a busy bit, and sending the > ipi is a single write, Ah, ok! Then the patch looks good to me. ( Originally we didn't wait for the ICR bit either, but then it was added due to later erratas and was eventually made an architectural requirement. ) > I agree that when doing other apic implementations, we > may need to guarantee atomicity for things like "wait for > apic idle, then send the ipi". Yeah. Thanks, Ingo ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-11 20:42 ` Linus Torvalds 2015-02-12 16:38 ` Rafael David Tinoco 2015-02-18 22:25 ` Peter Zijlstra @ 2015-03-20 10:15 ` Peter Zijlstra 2015-03-20 16:26 ` Linus Torvalds 2015-04-01 14:22 ` Frederic Weisbecker 3 siblings, 1 reply; 54+ messages in thread From: Peter Zijlstra @ 2015-03-20 10:15 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael David Tinoco, LKML, Thomas Gleixner, Jens Axboe, Frederic Weisbecker On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote: > Ok, this is a more involved patch than I'd like, but making the > *caller* do all the CSD maintenance actually cleans things up. > > And this is still completely untested, and may be entirely buggy. What > do you guys think? > Linus, any plans for this patch? I think it does solve a fair few issues the current code has. do you want me to route it through tip (and write a changelog while I'm at it) or will you be taking it yourself? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-03-20 10:15 ` Peter Zijlstra @ 2015-03-20 16:26 ` Linus Torvalds 2015-03-20 17:14 ` Mike Galbraith 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2015-03-20 16:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael David Tinoco, LKML, Thomas Gleixner, Jens Axboe, Frederic Weisbecker On Fri, Mar 20, 2015 at 3:15 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > Linus, any plans for this patch? I think it does solve a fair few issues > the current code has. So I didn't really have any plans. I think it's a good patch, and it might even fix a bug and improve code generation, but it didn't fix the lockup problem, so.. > do you want me to route it through tip (and write a changelog while I'm > at it) or will you be taking it yourself? If you think it's worth committing, then go ahead. Add my signed-off, although I have to note that I didn't do a lot of testing of that patch (I think it was compile-tested when I sent it out, and then I ended up running it for a while, and obviously it got *some* testing by the person who saw the lockup that just showed that it didn't fix the problem ;( Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-03-20 16:26 ` Linus Torvalds @ 2015-03-20 17:14 ` Mike Galbraith 0 siblings, 0 replies; 54+ messages in thread From: Mike Galbraith @ 2015-03-20 17:14 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Rafael David Tinoco, LKML, Thomas Gleixner, Jens Axboe, Frederic Weisbecker On Fri, 2015-03-20 at 09:26 -0700, Linus Torvalds wrote: > On Fri, Mar 20, 2015 at 3:15 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > Linus, any plans for this patch? I think it does solve a fair few issues > > the current code has. > > So I didn't really have any plans. I think it's a good patch, and it > might even fix a bug and improve code generation, but it didn't fix > the lockup problem, so.. > > > do you want me to route it through tip (and write a changelog while I'm > > at it) or will you be taking it yourself? > > If you think it's worth committing, then go ahead. Add my signed-off, > although I have to note that I didn't do a lot of testing of that > patch (I think it was compile-tested when I sent it out, and then I > ended up running it for a while, and obviously it got *some* testing > by the person who saw the lockup that just showed that it didn't fix > the problem ;( I carried it for a while on boxen big and small, so it got some burn-in. -Mike ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: smp_call_function_single lockups 2015-02-11 20:42 ` Linus Torvalds ` (2 preceding siblings ...) 2015-03-20 10:15 ` Peter Zijlstra @ 2015-04-01 14:22 ` Frederic Weisbecker 3 siblings, 0 replies; 54+ messages in thread From: Frederic Weisbecker @ 2015-04-01 14:22 UTC (permalink / raw) To: Linus Torvalds; +Cc: Rafael David Tinoco, LKML, Thomas Gleixner, Jens Axboe On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote: > [ Added Frederic to the cc, since he's touched this file/area most ] > > On Wed, Feb 11, 2015 at 11:59 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So the caller has a really hard time guaranteeing that CSD_LOCK isn't > > set. And if the call is done in interrupt context, for all we know it > > is interrupting the code that is going to clear CSD_LOCK, so CSD_LOCK > > will never be cleared at all, and csd_lock() will wait forever. > > > > So I actually think that for the async case, we really *should* unlock > > before doing the callback (which is what Thomas' old patch did). > > > > And we migth well be better off doing something like > > > > WARN_ON_ONCE(csd->flags & CSD_LOCK); > > > > in smp_call_function_single_async(), because that really is a hard requirement. > > > > And it strikes me that hrtick_csd is one of these cases that do this > > with interrupts disabled, and use the callback for serialization. So I > > really wonder if this is part of the problem.. > > > > Thomas? Am I missing something? > > Ok, this is a more involved patch than I'd like, but making the > *caller* do all the CSD maintenance actually cleans things up. > > And this is still completely untested, and may be entirely buggy. What > do you guys think? > > Linus > kernel/smp.c | 78 ++++++++++++++++++++++++++++++++++++------------------------ > 1 file changed, 47 insertions(+), 31 deletions(-) > > diff --git a/kernel/smp.c b/kernel/smp.c > index f38a1e692259..2aaac2c47683 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -19,7 +19,7 @@ > > enum { > CSD_FLAG_LOCK = 0x01, > - CSD_FLAG_WAIT = 0x02, > + CSD_FLAG_SYNCHRONOUS = 0x02, > }; > > struct call_function_data { > @@ -107,7 +107,7 @@ void __init call_function_init(void) > */ > static void csd_lock_wait(struct call_single_data *csd) > { > - while (csd->flags & CSD_FLAG_LOCK) > + while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK) > cpu_relax(); > } > > @@ -121,19 +121,17 @@ static void csd_lock(struct call_single_data *csd) > * to ->flags with any subsequent assignments to other > * fields of the specified call_single_data structure: > */ > - smp_mb(); > + smp_wmb(); > } > > static void csd_unlock(struct call_single_data *csd) > { > - WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK)); > + WARN_ON(!(csd->flags & CSD_FLAG_LOCK)); > > /* > * ensure we're all done before releasing data: > */ > - smp_mb(); > - > - csd->flags &= ~CSD_FLAG_LOCK; > + smp_store_release(&csd->flags, 0); > } > > static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data); > @@ -144,13 +142,16 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data); > * ->func, ->info, and ->flags set. > */ > static int generic_exec_single(int cpu, struct call_single_data *csd, > - smp_call_func_t func, void *info, int wait) > + smp_call_func_t func, void *info) > { > - struct call_single_data csd_stack = { .flags = 0 }; > - unsigned long flags; > - > - > if (cpu == smp_processor_id()) { > + unsigned long flags; > + > + /* > + * We can unlock early even for the synchronous on-stack case, > + * since we're doing this from the same CPU.. > + */ > + csd_unlock(csd); > local_irq_save(flags); > func(info); > local_irq_restore(flags); > @@ -161,21 +162,9 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, > if ((unsigned)cpu >= nr_cpu_ids || !cpu_online(cpu)) > return -ENXIO; > > - > - if (!csd) { > - csd = &csd_stack; > - if (!wait) > - csd = this_cpu_ptr(&csd_data); > - } > - > - csd_lock(csd); > - > csd->func = func; > csd->info = info; > > - if (wait) > - csd->flags |= CSD_FLAG_WAIT; > - > /* > * The list addition should be visible before sending the IPI > * handler locks the list to pull the entry off it because of > @@ -190,9 +179,6 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, > if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) > arch_send_call_function_single_ipi(cpu); > > - if (wait) > - csd_lock_wait(csd); > - > return 0; > } > > @@ -250,8 +236,17 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline) > } > > llist_for_each_entry_safe(csd, csd_next, entry, llist) { > - csd->func(csd->info); > - csd_unlock(csd); > + smp_call_func_t func = csd->func; > + void *info = csd->info; > + > + /* Do we wait until *after* callback? */ > + if (csd->flags & CSD_FLAG_SYNCHRONOUS) { > + func(info); > + csd_unlock(csd); > + } else { > + csd_unlock(csd); > + func(info); Just to clarify things, the expected kind of lockup it is expected to fix is the case where the IPI is resent from the IPI itself, right? Thanks. ^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2015-04-06 17:24 UTC | newest] Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-02-22 8:59 smp_call_function_single lockups Daniel J Blueman 2015-02-22 10:37 ` Ingo Molnar -- strict thread matches above, loose matches on Subject: below -- 2015-02-11 13:19 Rafael David Tinoco 2015-02-11 18:18 ` Linus Torvalds 2015-02-11 19:59 ` Linus Torvalds 2015-02-11 20:42 ` Linus Torvalds 2015-02-12 16:38 ` Rafael David Tinoco 2015-02-18 22:25 ` Peter Zijlstra 2015-02-19 15:42 ` Rafael David Tinoco 2015-02-19 16:14 ` Linus Torvalds 2015-02-23 14:01 ` Rafael David Tinoco 2015-02-23 19:32 ` Linus Torvalds 2015-02-23 20:50 ` Peter Zijlstra 2015-02-23 21:02 ` Rafael David Tinoco 2015-02-19 16:16 ` Peter Zijlstra 2015-02-19 16:26 ` Linus Torvalds 2015-02-19 16:32 ` Rafael David Tinoco 2015-02-19 16:59 ` Linus Torvalds 2015-02-19 17:30 ` Rafael David Tinoco 2015-02-19 17:39 ` Linus Torvalds 2015-02-19 20:29 ` Linus Torvalds 2015-02-19 21:59 ` Linus Torvalds 2015-02-19 22:45 ` Linus Torvalds 2015-03-31 3:15 ` Chris J Arges 2015-03-31 4:28 ` Linus Torvalds 2015-03-31 4:46 ` Linus Torvalds 2015-03-31 15:08 ` Linus Torvalds 2015-03-31 22:23 ` Chris J Arges 2015-03-31 23:07 ` Linus Torvalds 2015-04-01 14:32 ` Chris J Arges 2015-04-01 15:36 ` Linus Torvalds 2015-04-02 9:55 ` Ingo Molnar 2015-04-02 17:35 ` Linus Torvalds 2015-04-01 12:43 ` Ingo Molnar 2015-04-01 16:10 ` Chris J Arges 2015-04-01 16:14 ` Linus Torvalds 2015-04-01 21:59 ` Chris J Arges 2015-04-02 17:31 ` Linus Torvalds 2015-04-02 18:26 ` Ingo Molnar 2015-04-02 18:51 ` Chris J Arges 2015-04-02 19:07 ` Ingo Molnar 2015-04-02 20:57 ` Linus Torvalds 2015-04-02 21:13 ` Chris J Arges 2015-04-03 5:45 ` Ingo Molnar 2015-04-06 17:23 ` Chris J Arges 2015-02-20 9:30 ` Ingo Molnar 2015-02-20 16:49 ` Linus Torvalds 2015-02-20 19:41 ` Ingo Molnar 2015-02-20 20:03 ` Linus Torvalds 2015-02-20 20:11 ` Ingo Molnar 2015-03-20 10:15 ` Peter Zijlstra 2015-03-20 16:26 ` Linus Torvalds 2015-03-20 17:14 ` Mike Galbraith 2015-04-01 14:22 ` Frederic Weisbecker
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).