linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: smp_call_function_single lockups
@ 2015-02-22  8:59 Daniel J Blueman
  2015-02-22 10:37 ` Ingo Molnar
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel J Blueman @ 2015-02-22  8:59 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar
  Cc: Rafael David Tinoco, Peter Anvin, Jiang Liu, Peter Zijlstra,
	LKML, Jens Axboe, Frederic Weisbecker, Gema Gomez,
	Christopher Arges, the arch/x86 maintainers

On Saturday, February 21, 2015 at 3:50:05 AM UTC+8, Ingo Molnar wrote:
 > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
 >
 > > On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar <mingo@kernel.org> wrote:
 > > >
 > > > So if my memory serves me right, I think it was for
 > > > local APICs, and even there mostly it was a performance
 > > > issue: if an IO-APIC sent more than 2 IRQs per 'level'
 > > > to a local APIC then the IO-APIC might be forced to
 > > > resend those IRQs, leading to excessive message traffic
 > > > on the relevant hardware bus.
 > >
 > > Hmm. I have a distinct memory of interrupts actually
 > > being lost, but I really can't find anything to support
 > > that memory, so it's probably some drug-induced confusion
 > > of mine. I don't find *anything* about interrupt "levels"
 > > any more in modern Intel documentation on the APIC, but
 > > maybe I missed something. But it might all have been an
 > > IO-APIC thing.
 >
 > So I just found an older discussion of it:
 >
 > 
http://www.gossamer-threads.com/lists/linux/kernel/1554815?do=post_view_threaded#1554815
 >
 > while it's not a comprehensive description, it matches what
 > I remember from it: with 3 vectors within a level of 16
 > vectors we'd get excessive "retries" sent by the IO-APIC
 > through the (then rather slow) APIC bus.
 >
 > ( It was possible for the same phenomenon to occur with
 >   IPIs as well, when a CPU sent an APIC message to another
 >   CPU, if the affected vectors were equal modulo 16 - but
 >   this was rare IIRC because most systems were dual CPU so
 >   only two IPIs could have occured. )
 >
 > > Well, the attached patch for that seems pretty trivial.
 > > And seems to work for me (my machine also defaults to
 > > x2apic clustered mode), and allows the APIC code to start
 > > doing a "send to specific cpu" thing one by one, since it
 > > falls back to the send_IPI_mask() function if no
 > > individual CPU IPI function exists.
 > >
 > > NOTE! There's a few cases in
 > > arch/x86/kernel/apic/vector.c that also do that
 > > "apic->send_IPI_mask(cpumask_of(i), .." thing, but they
 > > aren't that important, so I didn't bother with them.
 > >
 > > NOTE2! I've tested this, and it seems to work, but maybe
 > > there is something seriously wrong. I skipped the
 > > "disable interrupts" part when doing the "send_IPI", for
 > > example, because I think it's entirely unnecessary for
 > > that case. But this has certainly *not* gotten any real
 > > stress-testing.

 > I'm not so sure about that aspect: I think disabling IRQs
 > might be necessary with some APICs (if lower levels don't
 > disable IRQs), to make sure the 'local APIC busy' bit isn't
 > set:
 >
 > we typically do a wait_icr_idle() call before sending an
 > IPI - and if IRQs are not off then the idleness of the APIC
 > might be gone. (Because a hardirq that arrives after a
 > wait_icr_idle() but before the actual IPI sending sent out
 > an IPI and the queue is full.)

The Intel SDM [1] and AMD F15h BKDG [2] state that IPIs are queued, so 
the wait_icr_idle() polling is only necessary on PPro and older, and 
maybe then to avoid delivery retry. This unnecessarily ties up the IPI 
caller, so we bypass the polling in the Numachip APIC driver IPI-to-self 
path.

On Linus's earlier point, with the large core counts on Numascale 
systems, I previously implemented a shortcut to allow single IPIs to 
bypass all the cpumask generation and walking; it's way down on my list, 
but I'll see if I can generalise and present a patch series at some 
point if interested?

Dan

-- [1] Intel SDM 3, p10-30 
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf

If more than one interrupt is generated with the same vector number, the 
local APIC can set the bit for the vector both in the IRR and the ISR. 
This means that for the Pentium 4 and Intel Xeon processors, the IRR and 
ISR can queue two interrupts for each interrupt vector: one in the IRR 
and one in the ISR. Any additional interrupts issued for the same 
interrupt vector are collapsed into the single bit in the IRR. For the 
P6 family and Pentium processors, the IRR and ISR registers can queue no 
more than two interrupts per interrupt vector and will reject other 
interrupts that are received within the same vector.

-- [2] AMD Fam15h BKDG p470 
http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf

DS: interrupt delivery status. Read-only. Reset: 0. In xAPIC mode this 
bit is set to indicate that the interrupt has not yet been accepted by 
the destination core(s). 0=Idle. 1=Send pending. Reserved in x2APIC 
mode. Software may repeatedly write ICRL without polling the DS bit; all 
requested IPIs will be delivered.

^ permalink raw reply	[flat|nested] 54+ messages in thread
* smp_call_function_single lockups
@ 2015-02-11 13:19 Rafael David Tinoco
  2015-02-11 18:18 ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Rafael David Tinoco @ 2015-02-11 13:19 UTC (permalink / raw)
  To: LKML; +Cc: torvalds, Thomas Gleixner, Jens Axboe

Linus, Thomas, Jens..

During the 3.18 - 3.19 "frequent lockups discussion", in some point
you have observed csd_lock() and csd_unlock() possible synchronization
problems. I think we have managed to reproduce that issue in a
constant basis with 3.13 (ubuntu) and 3.19 (latest vanilla).

- When running "open-stack tempest" in a nested-kvm environment we are
able to cause a lockup in question of hours (from 2 to 20 hours
usually). Trace from nested hypervisor (ubuntu 3.13):

crash> bt
PID: 29130  TASK: ffff8804288ac800  CPU: 1   COMMAND: "qemu-system-x86"
 #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
 #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
 #2 [ffff88043fd03e30] panic at ffffffff81719ff4
 #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
 #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
 #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
 #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537
 #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f
 #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd
--- <IRQ stack> ---
 #9 [ffff8804284a3bc8] apic_timer_interrupt at ffffffff817326dd
    [exception RIP: generic_exec_single+130]
    RIP: ffffffff810dbe62  RSP: ffff8804284a3c70  RFLAGS: 00000202
    RAX: 0000000000000002  RBX: ffff8804284a3c40  RCX: 0000000000000001
    RDX: ffffffff8180ad60  RSI: 0000000000000000  RDI: 0000000000000286
    RBP: ffff8804284a3ca0   R8: ffffffff8180ad48   R9: 0000000000000001
    R10: ffffffff81185cac  R11: ffffea00109b4a00  R12: ffff88042829f400
    R13: 0000000000000000  R14: ffffea001017d640  R15:
0000000000000005    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#10 [ffff8804284a3ca8] smp_call_function_single at ffffffff810dbf75
#11 [ffff8804284a3d20] smp_call_function_many at ffffffff810dc3a6
#12 [ffff8804284a3d80] native_flush_tlb_others at ffffffff8105c8f7
#13 [ffff8804284a3da8] flush_tlb_mm_range at ffffffff8105c9cb
#14 [ffff8804284a3dd8] tlb_flush_mmu at ffffffff811755b3
#15 [ffff8804284a3e00] tlb_finish_mmu at ffffffff81176145
#16 [ffff8804284a3e20] unmap_region at ffffffff8117e013
#17 [ffff8804284a3ee0] do_munmap at ffffffff81180356
#18 [ffff8804284a3f30] vm_munmap at ffffffff81180521
#19 [ffff8804284a3f60] sys_munmap at ffffffff81181482
#20 [ffff8804284a3f80] system_call_fastpath at ffffffff8173196d
    RIP: 00007fa3ed16c587  RSP: 00007fa3536f5c10  RFLAGS: 00000246
    RAX: 000000000000000b  RBX: ffffffff8173196d  RCX: 0000001d00000007
    RDX: 0000000000000000  RSI: 0000000000801000  RDI: 00007fa315ff4000
    RBP: 00007fa3167f49c0   R8: 0000000000000000   R9: 00007fa3f5396738
    R10: 00007fa3536f5a60  R11: 0000000000000202  R12: 00007fa3ed6562a0
    R13: 00007fa350ef19c0  R14: ffffffff81181482  R15: ffff8804284a3f78
    ORIG_RAX: 000000000000000b  CS: 0033  SS: 002b

- After applying patch provided by Thomas we were able to cause the
lockup only after 6 days (also locked inside
smp_call_function_single). Test performance (even for a nested kvm)
was reduced substantially with 3.19 + this patch. Trace from the
nested hypervisor (3.19 + patch):

 crash> bt
PID: 10467  TASK: ffff880817b3b1c0  CPU: 1   COMMAND: "qemu-system-x86"
 #0 [ffff88083fd03cc0] machine_kexec at ffffffff81052052
 #1 [ffff88083fd03d10] crash_kexec at ffffffff810f91c3
 #2 [ffff88083fd03de0] panic at ffffffff8176f713
 #3 [ffff88083fd03e60] watchdog_timer_fn at ffffffff8112316b
 #4 [ffff88083fd03ea0] __run_hrtimer at ffffffff810da087
 #5 [ffff88083fd03ef0] hrtimer_interrupt at ffffffff810da467
 #6 [ffff88083fd03f70] local_apic_timer_interrupt at ffffffff81049769
 #7 [ffff88083fd03f90] smp_apic_timer_interrupt at ffffffff8177fc25
 #8 [ffff88083fd03fb0] apic_timer_interrupt at ffffffff8177dcbd
--- <IRQ stack> ---
 #9 [ffff880817973a68] apic_timer_interrupt at ffffffff8177dcbd
    [exception RIP: generic_exec_single+218]
    RIP: ffffffff810ee0ca  RSP: ffff880817973b18  RFLAGS: 00000202
    RAX: 0000000000000002  RBX: 0000000000000292  RCX: 0000000000000001
    RDX: ffffffff8180e6e0  RSI: 0000000000000000  RDI: 0000000000000292
    RBP: ffff880817973b58   R8: ffffffff8180e6c8   R9: 0000000000000001
    R10: 000000000000b6e0  R11: 0000000000000001  R12: ffffffff811f6626
    R13: ffff880817973ab8  R14: ffffffff8109cfd2  R15: ffff880817973a78
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#10 [ffff880817973b60] smp_call_function_single at ffffffff810ee1c7
#11 [ffff880817973b90] loaded_vmcs_clear at ffffffffa0309097 [kvm_intel]
#12 [ffff880817973ba0] vmx_vcpu_load at ffffffffa030defe [kvm_intel]
#13 [ffff880817973be0] kvm_arch_vcpu_load at ffffffffa01eba53 [kvm]
#14 [ffff880817973c00] kvm_sched_in at ffffffffa01d94a9 [kvm]
#15 [ffff880817973c20] finish_task_switch at ffffffff81099148
#16 [ffff880817973c60] __schedule at ffffffff817781ec
#17 [ffff880817973cd0] schedule at ffffffff81778699
#18 [ffff880817973ce0] kvm_vcpu_block at ffffffffa01d8dfd [kvm]
#19 [ffff880817973d40] kvm_arch_vcpu_ioctl_run at ffffffffa01ef64c [kvm]
#20 [ffff880817973e10] kvm_vcpu_ioctl at ffffffffa01dbc19 [kvm]
#21 [ffff880817973eb0] do_vfs_ioctl at ffffffff811f5948
#22 [ffff880817973f30] sys_ioctl at ffffffff811f5be1
#23 [ffff880817973f80] system_call_fastpath at ffffffff8177cc2d
    RIP: 00007f42f987fec7  RSP: 00007f42ef1bebd8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8177cc2d  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 000000000000ae80  RDI: 000000000000000e
    RBP: 00007f430047b040   R8: 0000000000000000   R9: 00000000000000ff
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007f42ff920240
    R13: 0000000000000000  R14: 0000000000000001  R15: 0000000000000001
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

Not sure if you are still pursuing this, anyway.. let me know if you
think of any other change, I'll keep the environment.

-Tinoco

On Mon, 17 Nov 2014, Thomas Gleixner wrote:
> On Mon, 17 Nov 2014, Linus Torvalds wrote:
> >   llist_for_each_entry_safe(csd, csd_next, entry, llist) {
> > - csd->func(csd->info);
> > + smp_call_func_t func = csd->func;
> > + void *info = csd->info;
> >   csd_unlock(csd);
> > +
> > + func(info);
>
> No, that won't work for synchronous calls:
>
>     CPU 0           CPU 1
>
>     csd_lock(csd);
>     queue_csd();
>     ipi();
> func = csd->func;
> info = csd->info;
> csd_unlock(csd);
>     csd_lock_wait();
> func(info);
>
> The csd_lock_wait() side will succeed and therefor assume that the
> call has been completed while the function has not been called at
> all. Interesting explosions to follow.
>
> The proper solution is to revert that commit and properly analyze the
> problem which Jens was trying to solve and work from there.

So a combo of both (Jens and yours) might do the trick. Patch below.

I think what Jens was trying to solve is:

     CPU 0           CPU 1

     csd_lock(csd);
     queue_csd();
     ipi();
  csd->func(csd->info);
     wait_for_completion(csd);
  complete(csd);
     reuse_csd(csd);
csd_unlock(csd);
Thanks,

tglx

Index: linux/kernel/smp.c
===================================================================
--- linux.orig/kernel/smp.c
+++ linux/kernel/smp.c
@@ -126,7 +126,7 @@ static void csd_lock(struct call_single_

 static void csd_unlock(struct call_single_data *csd)
 {
- WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
+ WARN_ON(!(csd->flags & CSD_FLAG_LOCK));

  /*
  * ensure we're all done before releasing data:
@@ -250,8 +250,23 @@ static void flush_smp_call_function_queu
  }

  llist_for_each_entry_safe(csd, csd_next, entry, llist) {
- csd->func(csd->info);
- csd_unlock(csd);
+
+ /*
+ * For synchronous calls we are not allowed to unlock
+ * before the callback returned. For the async case
+ * its the responsibility of the caller to keep
+ * csd->info consistent while the callback runs.
+ */
+ if (csd->flags & CSD_FLAG_WAIT) {
+ csd->func(csd->info);
+ csd_unlock(csd);
+ } else {
+ smp_call_func_t func = csd->func;
+ void *info = csd->info;
+
+ csd_unlock(csd);
+ func(info);
+ }
  }

  /*

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2015-04-06 17:24 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-22  8:59 smp_call_function_single lockups Daniel J Blueman
2015-02-22 10:37 ` Ingo Molnar
  -- strict thread matches above, loose matches on Subject: below --
2015-02-11 13:19 Rafael David Tinoco
2015-02-11 18:18 ` Linus Torvalds
2015-02-11 19:59   ` Linus Torvalds
2015-02-11 20:42     ` Linus Torvalds
2015-02-12 16:38       ` Rafael David Tinoco
2015-02-18 22:25       ` Peter Zijlstra
2015-02-19 15:42         ` Rafael David Tinoco
2015-02-19 16:14           ` Linus Torvalds
2015-02-23 14:01             ` Rafael David Tinoco
2015-02-23 19:32               ` Linus Torvalds
2015-02-23 20:50                 ` Peter Zijlstra
2015-02-23 21:02                   ` Rafael David Tinoco
2015-02-19 16:16           ` Peter Zijlstra
2015-02-19 16:26           ` Linus Torvalds
2015-02-19 16:32             ` Rafael David Tinoco
2015-02-19 16:59               ` Linus Torvalds
2015-02-19 17:30                 ` Rafael David Tinoco
2015-02-19 17:39                 ` Linus Torvalds
2015-02-19 20:29                   ` Linus Torvalds
2015-02-19 21:59                     ` Linus Torvalds
2015-02-19 22:45                       ` Linus Torvalds
2015-03-31  3:15                         ` Chris J Arges
2015-03-31  4:28                           ` Linus Torvalds
2015-03-31  4:46                           ` Linus Torvalds
2015-03-31 15:08                           ` Linus Torvalds
2015-03-31 22:23                             ` Chris J Arges
2015-03-31 23:07                               ` Linus Torvalds
2015-04-01 14:32                                 ` Chris J Arges
2015-04-01 15:36                                   ` Linus Torvalds
2015-04-02  9:55                                     ` Ingo Molnar
2015-04-02 17:35                                       ` Linus Torvalds
2015-04-01 12:43                               ` Ingo Molnar
2015-04-01 16:10                                 ` Chris J Arges
2015-04-01 16:14                                   ` Linus Torvalds
2015-04-01 21:59                                     ` Chris J Arges
2015-04-02 17:31                                       ` Linus Torvalds
2015-04-02 18:26                                         ` Ingo Molnar
2015-04-02 18:51                                           ` Chris J Arges
2015-04-02 19:07                                             ` Ingo Molnar
2015-04-02 20:57                                               ` Linus Torvalds
2015-04-02 21:13                                               ` Chris J Arges
2015-04-03  5:45                                                 ` Ingo Molnar
2015-04-06 17:23                                         ` Chris J Arges
2015-02-20  9:30                     ` Ingo Molnar
2015-02-20 16:49                       ` Linus Torvalds
2015-02-20 19:41                         ` Ingo Molnar
2015-02-20 20:03                           ` Linus Torvalds
2015-02-20 20:11                             ` Ingo Molnar
2015-03-20 10:15       ` Peter Zijlstra
2015-03-20 16:26         ` Linus Torvalds
2015-03-20 17:14           ` Mike Galbraith
2015-04-01 14:22       ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).