All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ian Campbell <Ian.Campbell@citrix.com>
To: Stefan Bader <stefan.bader@canonical.com>
Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Subject: Re: Strange PVM spinlock case revisited
Date: Mon, 11 Feb 2013 17:29:04 +0000	[thread overview]
Message-ID: <1360603744.20449.74.camel@zakaz.uk.xensource.com> (raw)
In-Reply-To: <51151ABF.1070007@canonical.com>

On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:
> A while ago I had been reporting an issue which causes Xen PVM guests
> to hang (apparently related to spinlocks). Since then I was able to
> gather a few more facts which I try to provide below. For the real
> reasons, I am still puzzled and would be thankful for any hints or
> direction.
> 
> - Happens with Xen 4.1.2 and Xen 4.2 host-side.
> - PVM guest with 8 VCPUs is running 800 threads doing DB updates.

This is on a host with >= 8 PCPUs, correct?

> - When hanging at least 2 VCPUs in xen_spin_lock_slow with the lock free.
>   IIRC the weird lock always belongs to one VCPU runqueue.
> - Testcase shows the problem for guest kernel versions v3.2..v3.5 (not
>   verified farther back). Since v3.6 it cannot be reproduced. Bisect
>   ends up with: 611ae8e3f5204f7480b3b405993b3352cfa16662
>     "x86/tlb: enable tlb flush range support for x86"

That seems like a very odd candidate for impacting on this issue.

> - The other potential fix for this was to prevent xen_spin_lock_slow()
>   from unmasking events (enabling interrupts) if those were enabled
>   before the spinlock call.
> 
> Maybe allowing or not allowing the interrupts in xen_spin_lock_slow
> just manages to tip the likelihood of getting nested deeper.

Either that or the bug is in the nesting aspect of xen_spin_lock_slow(),
which seems to be a rather complex dance judging from the comments in
that function (and one I confess I don't really follow).

>  The same
> way that the changed TLB flush may reduce it by causing lesser IPIs.

Ah, that might explain why that patch is relevant, yes.

> From the static information I can extract from a dump it is hard to
> tell what exactly is going on. VCPU2 seems to be doing something but
> looks to be rather deep in event handling. The irq_count is at 2, but
> irq_count seems to be used in a few more places than just xen_hypervisor
> callback. Though that at least seems to be the entry on the stack I can
> see for VCPU2. The contents of the per-cpu variables for irq_reqs and
> irq_stack_ptr at least seem consistent as if a callback came in while
> being on the sched_op hypercall. But then the bt command seems to be
> operating on a completely different stack.
> VCPU2 seems to be active, not polling (event processing could be initiated
> from enable_irqs via force callback or before finally exiting the sched_op
> hypercall), there seems to be a pending upcall but upcalls are masked.
> The next thing I would try is to instrument the sites incrementing and
> decrementing irq_count...
> ---
> 
> Below some of the info seen from dom0 debugging keys and looking
> at a dump with crash:
> 
> Dom0 Info:
> 
> Polling vCPUs: {1,5-7}
> VCPU0: CPU0 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00
> VCPU2: CPU6 [has=T] poll=0  upcall_pend = 01, upcall_mask = 01
> VCPU3: CPU2 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> VCPU4: CPU4 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> VCPU5: CPU3 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01
> VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01
> 
> 10 [0/1]: s=6 n=1 x=0
> 40 [0/1]: s=6 n=6 x=0
> 46 [0/1]: s=6 n=7 x=0
> 
> Guest Info:
> 
> lock_spinners:
> [0] (struct xen_spinlock *) 0x
> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 }
> [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 }

If I understand correctly this means that 2 threads (VCPU2 & 6, I
suppose) are spinning on this lock, but no one is holding it?

An interesting hack^Wexperiment might be to make xen_poll_irq use a
timeout and see if that unwedges things -- this would help confirm that
the issue is on nested wakeup.

I suppose xen_spin_unlock could also be instrumented to indicate who it
last kicked and for which lock and that might show us something?

Can someone explain why the non-locked update of lock_spinners in
spinning_lock() is safe against xen_spin_unlock reading the field from
another VCPU? I suspect it isn't, so what happens if the other VCPU
kicks the lock which is just about to become prev? maybe this is handled
in xen_spin_lock_slow somehow?

In a similar vein do we not need a barrier in unspinning_lock to ensure
that the remote cpu sees prev getting restored?

> [3] (struct xen_spinlock *) 0x
> [4] (struct xen_spinlock *) 0x
> [5] (struct xen_spinlock *) 0xffff8803bffc8ae8 -> [1]
> [6] (struct xen_spinlock *) 0xffff8803bfc93700 -> [2]
> [7] (struct xen_spinlock *) 0xffffffff81f15ef0 = { lock=1, spinners=1 }
> 
> VCPU[2,6] -> lock of runqueue[2] = (struct rq *) 0xffff8803bfc93700
> 
> irq_count[1] = 1
> irq_count[2] = 2
> It is -1 for the rest.
> 
> *(struct pt_regs *) irq_regs[2] = {
> 	.ip = 0xffffffff810013aa == hypercall_page+938 -> sched_op
> 	.sp = 0xffff8803bfc83ce8
> }
> 
> char *irq_stack_ptr[2] = 0xffff8803bfc83fc0
> 
> #> bt -T ffff88033d65c4a0
> PID: 2050   TASK: ffff88033d65c4a0  CPU: 2   COMMAND: "postgres"
>   [ffff88033d72d530] get_page_from_freelist at ffffffff8111e6df
>   [ffff88033d72d600] __alloc_pages_nodemask at ffffffff8111ebec
>   [ffff88033d72d620] check_preempt_curr at ffffffff81050254
>   [ffff88033d72d640] pull_task at ffffffff8105e33e
>   [ffff88033d72d670] balance_tasks at ffffffff8105e4b6
>   [ffff88033d72d680] radix_tree_lookup_slot at ffffffff8130db5e
>   [ffff88033d72d690] find_get_page at ffffffff811161ee
>   [ffff88033d72d6b0] find_lock_page at ffffffff81116286
>   [ffff88033d72d6e0] shmem_getpage_gfp at ffffffff8112dd10
>   [ffff88033d72d748] pte_pfn_to_mfn at ffffffff81005a78
>   [ffff88033d72d7a0] get_page_from_freelist at ffffffff8111e6df
>   [ffff88033d72d7c0] _raw_spin_lock at ffffffff81655e4e
>   [ffff88033d72d7d0] ext4_ext_check_cache at ffffffff81239248
>   [ffff88033d72d820] ext4_ext_map_blocks at ffffffff8123e269
>   [ffff88033d72d870] _raw_spin_lock at ffffffff81655e4e
>   [ffff88033d72d880] enqueue_to_backlog at ffffffff8153e55f
>   [ffff88033d72d890] __wake_up_common at ffffffff8104c1c8
>   [ffff88033d72d8e0] netif_rx.part.82 at ffffffff8153f2de
>   [ffff88033d72d920] netif_rx at ffffffff8153f400
>   [ffff88033d72d960] loopback_xmit at ffffffff8146fb1c
>   [ffff88033d72d990] dev_hard_start_xmit at ffffffff8153fea6
>   [ffff88033d72d9d0] do_softirq at ffffffff81016284
>   [ffff88033d72d9f0] local_bh_enable at ffffffff8106d614
>   [ffff88033d72da00] dev_queue_xmit at ffffffff81540309
>   [ffff88033d72da50] ip_finish_output at ffffffff815769ab
>   [ffff88033d72da90] ip_output at ffffffff81577518
>   [ffff88033d72dac0] ip_local_out at ffffffff81576c09
>   [ffff88033d72dae0] ip_send_skb at ffffffff81577f4b
>   [ffff88033d72db00] udp_send_skb at ffffffff8159aaa1
>   [ffff88033d72db40] ip_generic_getfrag at ffffffff81575030
>   [ffff88033d72db50] udp_sendmsg at ffffffff8159bd38
>   [ffff88033d72db90] irq_to_desc at ffffffff810d70b7
>   [ffff88033d72dbd0] notify_remote_via_irq at ffffffff813a74c1
>   [ffff88033d72dc10] ttwu_queue at ffffffff81052672
>   [ffff88033d72dc38] _raw_spin_unlock_irqrestore at ffffffff8165605e
>   [ffff88033d72dc90] inet_sendmsg at ffffffff815a6474
>   --- bt -t would end here
>   [ffff88033d72dcb0] apparmor_socket_sendmsg at ffffffff812d43f7
>   [ffff88033d72dcd0] sock_sendmsg at ffffffff815266fe
>   [ffff88033d72dd70] md_make_request at ffffffff814e0606
>   [ffff88033d72ddd0] kmem_cache_free at ffffffff811605cf
>   [ffff88033d72de10] mempool_free_slab at ffffffff81118507
>   [ffff88033d72de20] mempool_free at ffffffff811185b7
>   [ffff88033d72de50] sys_sendto at ffffffff8152974d
>   [ffff88033d72dee0] ext4_sync_file at ffffffff8120f07b
>   [ffff88033d72df00] vfs_write at ffffffff811764b0
>   [ffff88033d72df50] xen_hypervisor_callback at ffffffff816606db
>     RIP: 00007f6852c7b39a  RSP: 00007fff431454e0  RFLAGS: 00000212
>     RAX: 00007f6852fb38b8  RBX: 00007f6852fb3720  RCX: 000000000000014a
>     RDX: 0000000000000150  RSI: 0000000000000140  RDI: 00007f6852fb3720
>     RBP: 0000000000000150   R8: 0000000000000015   R9: 0000000000000000
>     R10: 0000000000000000  R11: 00007f6852c9174a  R12: 00007f6852fb3778
>     R13: 0000000000000140  R14: 00007f6852fb38b8  R15: 0000000000000001
>     ORIG_RAX: ffffffffffffffff  CS: e033  SS: e02b
> 
> #> bt -T ffff88033b812dc0
> PID: 2069   TASK: ffff88033b812dc0  CPU: 6   COMMAND: "postgres"
>   [ffff88033b897530] get_page_from_freelist at ffffffff8111e6df
>   [ffff88033b897600] __alloc_pages_nodemask at ffffffff8111ebec
>   [ffff88033b897640] radix_tree_lookup at ffffffff8130db6b
>   [ffff88033b897650] irq_to_desc at ffffffff810d70b7
>   [ffff88033b897660] irq_get_irq_data at ffffffff810d9f4e
>   [ffff88033b897670] balance_tasks at ffffffff8105e493
>   [ffff88033b897680] radix_tree_lookup_slot at ffffffff8130db5e
>   [ffff88033b897690] find_get_page at ffffffff811161ee
>   [ffff88033b8976b0] find_lock_page at ffffffff81116286
>   [ffff88033b8976e0] shmem_getpage_gfp at ffffffff8112dd10
>   [ffff88033b897748] pte_pfn_to_mfn at ffffffff81005a78
>   [ffff88033b897770] xen_set_pte_at at ffffffff81008e29
>   [ffff88033b897788] __raw_callee_save_xen_make_pte at ffffffff810052cd
>   [ffff88033b8977b0] unlock_page at ffffffff8111589a
>   [ffff88033b8977d0] __do_fault at ffffffff81138ac9
>   [ffff88033b8978a0] pvclock_clocksource_read at ffffffff8103dd15
>   [ffff88033b8978f0] xen_clocksource_read at ffffffff8100a850
>   [ffff88033b897900] sched_clock at ffffffff8101bd79
>   [ffff88033b897910] blk_rq_init at ffffffff812ec262
>   [ffff88033b897930] get_request at ffffffff812f004e
>   [ffff88033b8979c0] cpumask_next_and at ffffffff81308c66
>   [ffff88033b8979e0] find_busiest_group at ffffffff8105a061
>   --- bt -t ends here
>   [ffff88033b897a30] radix_tree_lookup at ffffffff8130db6b
>   [ffff88033b897a40] irq_to_desc at ffffffff810d70b7
>   [ffff88033b897a50] irq_get_irq_data at ffffffff810d9f4e
>   [ffff88033b897a60] info_for_irq at ffffffff813a636e
>   [ffff88033b897a80] xen_poll_irq_timeout at ffffffff813a696e
>   [ffff88033b897ac0] xen_poll_irq at ffffffff813a8450
>   [ffff88033b897ad0] xen_spin_lock_slow at ffffffff8163ad77
>   [ffff88033b897b20] xen_spin_lock at ffffffff810121da
>   [ffff88033b897b40] _raw_spin_lock at ffffffff81655e4e
>   [ffff88033b897b50] double_rq_lock at ffffffff8105119c
>   [ffff88033b897b80] load_balance at ffffffff8105e788
>   [ffff88033b897b88] _raw_spin_unlock_irqrestore at ffffffff8165605e
>   [ffff88033b897c10] idle_balance at ffffffff8163d57c
>   [ffff88033b897c60] __schedule at ffffffff81653ea9
>   [ffff88033b897cc0] sleep_on_page at ffffffff81115810
>   [ffff88033b897ce0] schedule at ffffffff81653fcf
>   [ffff88033b897cf0] io_schedule at ffffffff8165407f
>   [ffff88033b897d10] sleep_on_page at ffffffff8111581e
>   [ffff88033b897d20] __wait_on_bit at ffffffff8165489f
>   [ffff88033b897d70] wait_on_page_bit at ffffffff81115988
>   [ffff88033b897da0] wake_bit_function at ffffffff8108a140
>   [ffff88033b897dc0] filemap_fdatawait_range at ffffffff81115a9c
>   [ffff88033b897e60] do_writepages at ffffffff81120ee1
>   [ffff88033b897eb0] filemap_write_and_wait_range at ffffffff81117398
>   [ffff88033b897ee0] ext4_sync_file at ffffffff8120ef9f
>   [ffff88033b897f00] vfs_write at ffffffff811764b0
>   [ffff88033b897f40] do_fsync at ffffffff811a4a36
>   [ffff88033b897f70] sys_fdatasync at ffffffff811a4d83
>   [ffff88033b897f80] system_call_fastpath at ffffffff8165e442
>     RIP: 00007f6852ce8240  RSP: 00007fff43145618  RFLAGS: 00000246
>     RAX: 000000000000004b  RBX: ffffffff8165e442  RCX: 00007f6852ce1900
>     RDX: 00000000000000c5  RSI: 000000000000000c  RDI: 000000000000000c
>     RBP: 00000000000000c5   R8: 000000000073a550   R9: 0000000000000000
>     R10: 0000000000000004  R11: 0000000000000246  R12: ffffffff811a4d83
>     R13: ffff88033b897f78  R14: 0000000000000001  R15: 0000000000af7488
>     ORIG_RAX: 000000000000004b  CS: e033  SS: e02b
> 

  reply	other threads:[~2013-02-11 17:29 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-08 15:33 Strange PVM spinlock case revisited Stefan Bader
2013-02-11 17:29 ` Ian Campbell [this message]
2013-02-12 13:42   ` Stefan Bader
2013-02-12 14:07     ` Ian Campbell
2013-02-12 14:50       ` Stefan Bader
2013-02-12 15:07         ` Ian Campbell
2013-02-12 15:53           ` Stefan Bader
2013-02-13 11:31   ` Stefan Bader
2013-02-13 12:16     ` Stefan Bader
2013-02-14 11:04 ` Strange PVM spinlock case revisited (nailed down) Stefan Bader
2013-02-14 11:43   ` Jan Beulich
2013-02-14 14:21     ` Stefan Bader
2013-02-14 14:48       ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1360603744.20449.74.camel@zakaz.uk.xensource.com \
    --to=ian.campbell@citrix.com \
    --cc=Andrew.Cooper3@citrix.com \
    --cc=stefan.bader@canonical.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.