linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* smp_call_function_single lockups
@ 2015-02-11 13:19 Rafael David Tinoco
  2015-02-11 18:18 ` Linus Torvalds
  0 siblings, 1 reply; 75+ messages in thread
From: Rafael David Tinoco @ 2015-02-11 13:19 UTC (permalink / raw)
  To: LKML; +Cc: torvalds, Thomas Gleixner, Jens Axboe

Linus, Thomas, Jens..

During the 3.18 - 3.19 "frequent lockups discussion", in some point
you have observed csd_lock() and csd_unlock() possible synchronization
problems. I think we have managed to reproduce that issue in a
constant basis with 3.13 (ubuntu) and 3.19 (latest vanilla).

- When running "open-stack tempest" in a nested-kvm environment we are
able to cause a lockup in question of hours (from 2 to 20 hours
usually). Trace from nested hypervisor (ubuntu 3.13):

crash> bt
PID: 29130  TASK: ffff8804288ac800  CPU: 1   COMMAND: "qemu-system-x86"
 #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
 #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
 #2 [ffff88043fd03e30] panic at ffffffff81719ff4
 #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
 #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
 #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
 #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537
 #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f
 #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd
--- <IRQ stack> ---
 #9 [ffff8804284a3bc8] apic_timer_interrupt at ffffffff817326dd
    [exception RIP: generic_exec_single+130]
    RIP: ffffffff810dbe62  RSP: ffff8804284a3c70  RFLAGS: 00000202
    RAX: 0000000000000002  RBX: ffff8804284a3c40  RCX: 0000000000000001
    RDX: ffffffff8180ad60  RSI: 0000000000000000  RDI: 0000000000000286
    RBP: ffff8804284a3ca0   R8: ffffffff8180ad48   R9: 0000000000000001
    R10: ffffffff81185cac  R11: ffffea00109b4a00  R12: ffff88042829f400
    R13: 0000000000000000  R14: ffffea001017d640  R15:
0000000000000005    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#10 [ffff8804284a3ca8] smp_call_function_single at ffffffff810dbf75
#11 [ffff8804284a3d20] smp_call_function_many at ffffffff810dc3a6
#12 [ffff8804284a3d80] native_flush_tlb_others at ffffffff8105c8f7
#13 [ffff8804284a3da8] flush_tlb_mm_range at ffffffff8105c9cb
#14 [ffff8804284a3dd8] tlb_flush_mmu at ffffffff811755b3
#15 [ffff8804284a3e00] tlb_finish_mmu at ffffffff81176145
#16 [ffff8804284a3e20] unmap_region at ffffffff8117e013
#17 [ffff8804284a3ee0] do_munmap at ffffffff81180356
#18 [ffff8804284a3f30] vm_munmap at ffffffff81180521
#19 [ffff8804284a3f60] sys_munmap at ffffffff81181482
#20 [ffff8804284a3f80] system_call_fastpath at ffffffff8173196d
    RIP: 00007fa3ed16c587  RSP: 00007fa3536f5c10  RFLAGS: 00000246
    RAX: 000000000000000b  RBX: ffffffff8173196d  RCX: 0000001d00000007
    RDX: 0000000000000000  RSI: 0000000000801000  RDI: 00007fa315ff4000
    RBP: 00007fa3167f49c0   R8: 0000000000000000   R9: 00007fa3f5396738
    R10: 00007fa3536f5a60  R11: 0000000000000202  R12: 00007fa3ed6562a0
    R13: 00007fa350ef19c0  R14: ffffffff81181482  R15: ffff8804284a3f78
    ORIG_RAX: 000000000000000b  CS: 0033  SS: 002b

- After applying patch provided by Thomas we were able to cause the
lockup only after 6 days (also locked inside
smp_call_function_single). Test performance (even for a nested kvm)
was reduced substantially with 3.19 + this patch. Trace from the
nested hypervisor (3.19 + patch):

 crash> bt
PID: 10467  TASK: ffff880817b3b1c0  CPU: 1   COMMAND: "qemu-system-x86"
 #0 [ffff88083fd03cc0] machine_kexec at ffffffff81052052
 #1 [ffff88083fd03d10] crash_kexec at ffffffff810f91c3
 #2 [ffff88083fd03de0] panic at ffffffff8176f713
 #3 [ffff88083fd03e60] watchdog_timer_fn at ffffffff8112316b
 #4 [ffff88083fd03ea0] __run_hrtimer at ffffffff810da087
 #5 [ffff88083fd03ef0] hrtimer_interrupt at ffffffff810da467
 #6 [ffff88083fd03f70] local_apic_timer_interrupt at ffffffff81049769
 #7 [ffff88083fd03f90] smp_apic_timer_interrupt at ffffffff8177fc25
 #8 [ffff88083fd03fb0] apic_timer_interrupt at ffffffff8177dcbd
--- <IRQ stack> ---
 #9 [ffff880817973a68] apic_timer_interrupt at ffffffff8177dcbd
    [exception RIP: generic_exec_single+218]
    RIP: ffffffff810ee0ca  RSP: ffff880817973b18  RFLAGS: 00000202
    RAX: 0000000000000002  RBX: 0000000000000292  RCX: 0000000000000001
    RDX: ffffffff8180e6e0  RSI: 0000000000000000  RDI: 0000000000000292
    RBP: ffff880817973b58   R8: ffffffff8180e6c8   R9: 0000000000000001
    R10: 000000000000b6e0  R11: 0000000000000001  R12: ffffffff811f6626
    R13: ffff880817973ab8  R14: ffffffff8109cfd2  R15: ffff880817973a78
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#10 [ffff880817973b60] smp_call_function_single at ffffffff810ee1c7
#11 [ffff880817973b90] loaded_vmcs_clear at ffffffffa0309097 [kvm_intel]
#12 [ffff880817973ba0] vmx_vcpu_load at ffffffffa030defe [kvm_intel]
#13 [ffff880817973be0] kvm_arch_vcpu_load at ffffffffa01eba53 [kvm]
#14 [ffff880817973c00] kvm_sched_in at ffffffffa01d94a9 [kvm]
#15 [ffff880817973c20] finish_task_switch at ffffffff81099148
#16 [ffff880817973c60] __schedule at ffffffff817781ec
#17 [ffff880817973cd0] schedule at ffffffff81778699
#18 [ffff880817973ce0] kvm_vcpu_block at ffffffffa01d8dfd [kvm]
#19 [ffff880817973d40] kvm_arch_vcpu_ioctl_run at ffffffffa01ef64c [kvm]
#20 [ffff880817973e10] kvm_vcpu_ioctl at ffffffffa01dbc19 [kvm]
#21 [ffff880817973eb0] do_vfs_ioctl at ffffffff811f5948
#22 [ffff880817973f30] sys_ioctl at ffffffff811f5be1
#23 [ffff880817973f80] system_call_fastpath at ffffffff8177cc2d
    RIP: 00007f42f987fec7  RSP: 00007f42ef1bebd8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8177cc2d  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 000000000000ae80  RDI: 000000000000000e
    RBP: 00007f430047b040   R8: 0000000000000000   R9: 00000000000000ff
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007f42ff920240
    R13: 0000000000000000  R14: 0000000000000001  R15: 0000000000000001
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

Not sure if you are still pursuing this, anyway.. let me know if you
think of any other change, I'll keep the environment.

-Tinoco

On Mon, 17 Nov 2014, Thomas Gleixner wrote:
> On Mon, 17 Nov 2014, Linus Torvalds wrote:
> >   llist_for_each_entry_safe(csd, csd_next, entry, llist) {
> > - csd->func(csd->info);
> > + smp_call_func_t func = csd->func;
> > + void *info = csd->info;
> >   csd_unlock(csd);
> > +
> > + func(info);
>
> No, that won't work for synchronous calls:
>
>     CPU 0           CPU 1
>
>     csd_lock(csd);
>     queue_csd();
>     ipi();
> func = csd->func;
> info = csd->info;
> csd_unlock(csd);
>     csd_lock_wait();
> func(info);
>
> The csd_lock_wait() side will succeed and therefor assume that the
> call has been completed while the function has not been called at
> all. Interesting explosions to follow.
>
> The proper solution is to revert that commit and properly analyze the
> problem which Jens was trying to solve and work from there.

So a combo of both (Jens and yours) might do the trick. Patch below.

I think what Jens was trying to solve is:

     CPU 0           CPU 1

     csd_lock(csd);
     queue_csd();
     ipi();
  csd->func(csd->info);
     wait_for_completion(csd);
  complete(csd);
     reuse_csd(csd);
csd_unlock(csd);
Thanks,

tglx

Index: linux/kernel/smp.c
===================================================================
--- linux.orig/kernel/smp.c
+++ linux/kernel/smp.c
@@ -126,7 +126,7 @@ static void csd_lock(struct call_single_

 static void csd_unlock(struct call_single_data *csd)
 {
- WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
+ WARN_ON(!(csd->flags & CSD_FLAG_LOCK));

  /*
  * ensure we're all done before releasing data:
@@ -250,8 +250,23 @@ static void flush_smp_call_function_queu
  }

  llist_for_each_entry_safe(csd, csd_next, entry, llist) {
- csd->func(csd->info);
- csd_unlock(csd);
+
+ /*
+ * For synchronous calls we are not allowed to unlock
+ * before the callback returned. For the async case
+ * its the responsibility of the caller to keep
+ * csd->info consistent while the callback runs.
+ */
+ if (csd->flags & CSD_FLAG_WAIT) {
+ csd->func(csd->info);
+ csd_unlock(csd);
+ } else {
+ smp_call_func_t func = csd->func;
+ void *info = csd->info;
+
+ csd_unlock(csd);
+ func(info);
+ }
  }

  /*

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2015-05-20 18:20 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-11 13:19 smp_call_function_single lockups Rafael David Tinoco
2015-02-11 18:18 ` Linus Torvalds
2015-02-11 19:59   ` Linus Torvalds
2015-02-11 20:42     ` Linus Torvalds
2015-02-12 16:38       ` Rafael David Tinoco
2015-02-18 22:25       ` Peter Zijlstra
2015-02-19 15:42         ` Rafael David Tinoco
2015-02-19 16:14           ` Linus Torvalds
2015-02-23 14:01             ` Rafael David Tinoco
2015-02-23 19:32               ` Linus Torvalds
2015-02-23 20:50                 ` Peter Zijlstra
2015-02-23 21:02                   ` Rafael David Tinoco
2015-02-19 16:16           ` Peter Zijlstra
2015-02-19 16:26           ` Linus Torvalds
2015-02-19 16:32             ` Rafael David Tinoco
2015-02-19 16:59               ` Linus Torvalds
2015-02-19 17:30                 ` Rafael David Tinoco
2015-02-19 17:39                 ` Linus Torvalds
2015-02-19 20:29                   ` Linus Torvalds
2015-02-19 21:59                     ` Linus Torvalds
2015-02-19 22:45                       ` Linus Torvalds
2015-03-31  3:15                         ` Chris J Arges
2015-03-31  4:28                           ` Linus Torvalds
2015-03-31 10:56                             ` [debug PATCHes] " Ingo Molnar
2015-03-31 22:38                               ` Chris J Arges
2015-04-01 12:39                                 ` Ingo Molnar
2015-04-01 14:10                                   ` Chris J Arges
2015-04-01 14:55                                     ` Ingo Molnar
2015-03-31  4:46                           ` Linus Torvalds
2015-03-31 15:08                           ` Linus Torvalds
2015-03-31 22:23                             ` Chris J Arges
2015-03-31 23:07                               ` Linus Torvalds
2015-04-01 14:32                                 ` Chris J Arges
2015-04-01 15:36                                   ` Linus Torvalds
2015-04-02  9:55                                     ` Ingo Molnar
2015-04-02 17:35                                       ` Linus Torvalds
2015-04-01 12:43                               ` Ingo Molnar
2015-04-01 16:10                                 ` Chris J Arges
2015-04-01 16:14                                   ` Linus Torvalds
2015-04-01 21:59                                     ` Chris J Arges
2015-04-02 17:31                                       ` Linus Torvalds
2015-04-02 18:26                                         ` Ingo Molnar
2015-04-02 18:51                                           ` Chris J Arges
2015-04-02 19:07                                             ` Ingo Molnar
2015-04-02 20:57                                               ` Linus Torvalds
2015-04-02 21:13                                               ` Chris J Arges
2015-04-03  5:43                                                 ` [PATCH] smp/call: Detect stuck CSD locks Ingo Molnar
2015-04-03  5:47                                                   ` Ingo Molnar
2015-04-06 16:58                                                   ` Chris J Arges
2015-04-06 17:32                                                     ` Linus Torvalds
2015-04-07  9:21                                                       ` Ingo Molnar
2015-04-07 20:59                                                         ` Chris J Arges
2015-04-07 21:15                                                           ` Linus Torvalds
2015-04-08  6:47                                                           ` Ingo Molnar
2015-04-13  3:56                                                             ` Chris J Arges
2015-04-13  6:14                                                               ` Ingo Molnar
2015-04-15 19:54                                                                 ` Chris J Arges
2015-04-16 11:04                                                                   ` Ingo Molnar
2015-04-16 15:58                                                                     ` Chris J Arges
2015-04-16 16:31                                                                       ` Ingo Molnar
2015-04-29 21:08                                                                         ` Chris J Arges
2015-05-11 14:00                                                                           ` Ingo Molnar
2015-05-20 18:19                                                                             ` Chris J Arges
2015-04-03  5:45                                                 ` smp_call_function_single lockups Ingo Molnar
2015-04-06 17:23                                         ` Chris J Arges
2015-02-20  9:30                     ` Ingo Molnar
2015-02-20 16:49                       ` Linus Torvalds
2015-02-20 19:41                         ` Ingo Molnar
2015-02-20 20:03                           ` Linus Torvalds
2015-02-20 20:11                             ` Ingo Molnar
2015-03-20 10:15       ` Peter Zijlstra
2015-03-20 16:26         ` Linus Torvalds
2015-03-20 17:14           ` Mike Galbraith
2015-04-01 14:22       ` Frederic Weisbecker
2015-04-18 10:13       ` [tip:locking/urgent] smp: Fix smp_call_function_single_async() locking tip-bot for Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).