Re: Strange PVM spinlock case revisited

From: Ian Campbell <Ian.Campbell@citrix.com>
To: Stefan Bader <stefan.bader@canonical.com>
Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Subject: Re: Strange PVM spinlock case revisited
Date: Mon, 11 Feb 2013 17:29:04 +0000	[thread overview]
Message-ID: <1360603744.20449.74.camel@zakaz.uk.xensource.com> (raw)
In-Reply-To: <51151ABF.1070007@canonical.com>

On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:
> A while ago I had been reporting an issue which causes Xen PVM guests
> to hang (apparently related to spinlocks). Since then I was able to
> gather a few more facts which I try to provide below. For the real
> reasons, I am still puzzled and would be thankful for any hints or
> direction.
> 
> - Happens with Xen 4.1.2 and Xen 4.2 host-side.
> - PVM guest with 8 VCPUs is running 800 threads doing DB updates.

This is on a host with >= 8 PCPUs, correct?

> - When hanging at least 2 VCPUs in xen_spin_lock_slow with the lock free.
>   IIRC the weird lock always belongs to one VCPU runqueue.
> - Testcase shows the problem for guest kernel versions v3.2..v3.5 (not
>   verified farther back). Since v3.6 it cannot be reproduced. Bisect
>   ends up with: 611ae8e3f5204f7480b3b405993b3352cfa16662
>     "x86/tlb: enable tlb flush range support for x86"

That seems like a very odd candidate for impacting on this issue.

> - The other potential fix for this was to prevent xen_spin_lock_slow()
>   from unmasking events (enabling interrupts) if those were enabled
>   before the spinlock call.
> 
> Maybe allowing or not allowing the interrupts in xen_spin_lock_slow
> just manages to tip the likelihood of getting nested deeper.

Either that or the bug is in the nesting aspect of xen_spin_lock_slow(),
which seems to be a rather complex dance judging from the comments in
that function (and one I confess I don't really follow).

>  The same
> way that the changed TLB flush may reduce it by causing lesser IPIs.

Ah, that might explain why that patch is relevant, yes.

> From the static information I can extract from a dump it is hard to
> tell what exactly is going on. VCPU2 seems to be doing something but
> looks to be rather deep in event handling. The irq_count is at 2, but
> irq_count seems to be used in a few more places than just xen_hypervisor
> callback. Though that at least seems to be the entry on the stack I can
> see for VCPU2. The contents of the per-cpu variables for irq_reqs and
> irq_stack_ptr at least seem consistent as if a callback came in while
> being on the sched_op hypercall. But then the bt command seems to be
> operating on a completely different stack.
> VCPU2 seems to be active, not polling (event processing could be initiated
> from enable_irqs via force callback or before finally exiting the sched_op
> hypercall), there seems to be a pending upcall but upcalls are masked.
> The next thing I would try is to instrument the sites incrementing and
> decrementing irq_count...
> ---
> 
> Below some of the info seen from dom0 debugging keys and looking
> at a dump with crash:
> 
> Dom0 Info:
> 
> Polling vCPUs: {1,5-7}
> VCPU0: CPU0 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00
> VCPU2: CPU6 [has=T] poll=0  upcall_pend = 01, upcall_mask = 01
> VCPU3: CPU2 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> VCPU4: CPU4 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> VCPU5: CPU3 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01
> VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01
> 
> 10 [0/1]: s=6 n=1 x=0
> 40 [0/1]: s=6 n=6 x=0
> 46 [0/1]: s=6 n=7 x=0
> 
> Guest Info:
> 
> lock_spinners:
> [0] (struct xen_spinlock *) 0x
> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 }
> [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 }

If I understand correctly this means that 2 threads (VCPU2 & 6, I
suppose) are spinning on this lock, but no one is holding it?

An interesting hack^Wexperiment might be to make xen_poll_irq use a
timeout and see if that unwedges things -- this would help confirm that
the issue is on nested wakeup.

I suppose xen_spin_unlock could also be instrumented to indicate who it
last kicked and for which lock and that might show us something?

Can someone explain why the non-locked update of lock_spinners in
spinning_lock() is safe against xen_spin_unlock reading the field from
another VCPU? I suspect it isn't, so what happens if the other VCPU
kicks the lock which is just about to become prev? maybe this is handled
in xen_spin_lock_slow somehow?

In a similar vein do we not need a barrier in unspinning_lock to ensure
that the remote cpu sees prev getting restored?

> [3] (struct xen_spinlock *) 0x
> [4] (struct xen_spinlock *) 0x
> [5] (struct xen_spinlock *) 0xffff8803bffc8ae8 -> [1]
> [6] (struct xen_spinlock *) 0xffff8803bfc93700 -> [2]
> [7] (struct xen_spinlock *) 0xffffffff81f15ef0 = { lock=1, spinners=1 }
> 
> VCPU[2,6] -> lock of runqueue[2] = (struct rq *) 0xffff8803bfc93700
> 
> irq_count[1] = 1
> irq_count[2] = 2
> It is -1 for the rest.
> 
> *(struct pt_regs *) irq_regs[2] = {
> 	.ip = 0xffffffff810013aa == hypercall_page+938 -> sched_op
> 	.sp = 0xffff8803bfc83ce8
> }
> 
> char *irq_stack_ptr[2] = 0xffff8803bfc83fc0
> 
> #> bt -T ffff88033d65c4a0
> PID: 2050   TASK: ffff88033d65c4a0  CPU: 2   COMMAND: "postgres"
>   [ffff88033d72d530] get_page_from_freelist at ffffffff8111e6df
>   [ffff88033d72d600] __alloc_pages_nodemask at ffffffff8111ebec
>   [ffff88033d72d620] check_preempt_curr at ffffffff81050254
>   [ffff88033d72d640] pull_task at ffffffff8105e33e
>   [ffff88033d72d670] balance_tasks at ffffffff8105e4b6
>   [ffff88033d72d680] radix_tree_lookup_slot at ffffffff8130db5e
>   [ffff88033d72d690] find_get_page at ffffffff811161ee
>   [ffff88033d72d6b0] find_lock_page at ffffffff81116286
>   [ffff88033d72d6e0] shmem_getpage_gfp at ffffffff8112dd10
>   [ffff88033d72d748] pte_pfn_to_mfn at ffffffff81005a78
>   [ffff88033d72d7a0] get_page_from_freelist at ffffffff8111e6df
>   [ffff88033d72d7c0] _raw_spin_lock at ffffffff81655e4e
>   [ffff88033d72d7d0] ext4_ext_check_cache at ffffffff81239248
>   [ffff88033d72d820] ext4_ext_map_blocks at ffffffff8123e269
>   [ffff88033d72d870] _raw_spin_lock at ffffffff81655e4e
>   [ffff88033d72d880] enqueue_to_backlog at ffffffff8153e55f
>   [ffff88033d72d890] __wake_up_common at ffffffff8104c1c8
>   [ffff88033d72d8e0] netif_rx.part.82 at ffffffff8153f2de
>   [ffff88033d72d920] netif_rx at ffffffff8153f400
>   [ffff88033d72d960] loopback_xmit at ffffffff8146fb1c
>   [ffff88033d72d990] dev_hard_start_xmit at ffffffff8153fea6
>   [ffff88033d72d9d0] do_softirq at ffffffff81016284
>   [ffff88033d72d9f0] local_bh_enable at ffffffff8106d614
>   [ffff88033d72da00] dev_queue_xmit at ffffffff81540309
>   [ffff88033d72da50] ip_finish_output at ffffffff815769ab
>   [ffff88033d72da90] ip_output at ffffffff81577518
>   [ffff88033d72dac0] ip_local_out at ffffffff81576c09
>   [ffff88033d72dae0] ip_send_skb at ffffffff81577f4b
>   [ffff88033d72db00] udp_send_skb at ffffffff8159aaa1
>   [ffff88033d72db40] ip_generic_getfrag at ffffffff81575030
>   [ffff88033d72db50] udp_sendmsg at ffffffff8159bd38
>   [ffff88033d72db90] irq_to_desc at ffffffff810d70b7
>   [ffff88033d72dbd0] notify_remote_via_irq at ffffffff813a74c1
>   [ffff88033d72dc10] ttwu_queue at ffffffff81052672
>   [ffff88033d72dc38] _raw_spin_unlock_irqrestore at ffffffff8165605e
>   [ffff88033d72dc90] inet_sendmsg at ffffffff815a6474
>   --- bt -t would end here
>   [ffff88033d72dcb0] apparmor_socket_sendmsg at ffffffff812d43f7
>   [ffff88033d72dcd0] sock_sendmsg at ffffffff815266fe
>   [ffff88033d72dd70] md_make_request at ffffffff814e0606
>   [ffff88033d72ddd0] kmem_cache_free at ffffffff811605cf
>   [ffff88033d72de10] mempool_free_slab at ffffffff81118507
>   [ffff88033d72de20] mempool_free at ffffffff811185b7
>   [ffff88033d72de50] sys_sendto at ffffffff8152974d
>   [ffff88033d72dee0] ext4_sync_file at ffffffff8120f07b
>   [ffff88033d72df00] vfs_write at ffffffff811764b0
>   [ffff88033d72df50] xen_hypervisor_callback at ffffffff816606db
>     RIP: 00007f6852c7b39a  RSP: 00007fff431454e0  RFLAGS: 00000212
>     RAX: 00007f6852fb38b8  RBX: 00007f6852fb3720  RCX: 000000000000014a
>     RDX: 0000000000000150  RSI: 0000000000000140  RDI: 00007f6852fb3720
>     RBP: 0000000000000150   R8: 0000000000000015   R9: 0000000000000000
>     R10: 0000000000000000  R11: 00007f6852c9174a  R12: 00007f6852fb3778
>     R13: 0000000000000140  R14: 00007f6852fb38b8  R15: 0000000000000001
>     ORIG_RAX: ffffffffffffffff  CS: e033  SS: e02b
> 
> #> bt -T ffff88033b812dc0
> PID: 2069   TASK: ffff88033b812dc0  CPU: 6   COMMAND: "postgres"
>   [ffff88033b897530] get_page_from_freelist at ffffffff8111e6df
>   [ffff88033b897600] __alloc_pages_nodemask at ffffffff8111ebec
>   [ffff88033b897640] radix_tree_lookup at ffffffff8130db6b
>   [ffff88033b897650] irq_to_desc at ffffffff810d70b7
>   [ffff88033b897660] irq_get_irq_data at ffffffff810d9f4e
>   [ffff88033b897670] balance_tasks at ffffffff8105e493
>   [ffff88033b897680] radix_tree_lookup_slot at ffffffff8130db5e
>   [ffff88033b897690] find_get_page at ffffffff811161ee
>   [ffff88033b8976b0] find_lock_page at ffffffff81116286
>   [ffff88033b8976e0] shmem_getpage_gfp at ffffffff8112dd10
>   [ffff88033b897748] pte_pfn_to_mfn at ffffffff81005a78
>   [ffff88033b897770] xen_set_pte_at at ffffffff81008e29
>   [ffff88033b897788] __raw_callee_save_xen_make_pte at ffffffff810052cd
>   [ffff88033b8977b0] unlock_page at ffffffff8111589a
>   [ffff88033b8977d0] __do_fault at ffffffff81138ac9
>   [ffff88033b8978a0] pvclock_clocksource_read at ffffffff8103dd15
>   [ffff88033b8978f0] xen_clocksource_read at ffffffff8100a850
>   [ffff88033b897900] sched_clock at ffffffff8101bd79
>   [ffff88033b897910] blk_rq_init at ffffffff812ec262
>   [ffff88033b897930] get_request at ffffffff812f004e
>   [ffff88033b8979c0] cpumask_next_and at ffffffff81308c66
>   [ffff88033b8979e0] find_busiest_group at ffffffff8105a061
>   --- bt -t ends here
>   [ffff88033b897a30] radix_tree_lookup at ffffffff8130db6b
>   [ffff88033b897a40] irq_to_desc at ffffffff810d70b7
>   [ffff88033b897a50] irq_get_irq_data at ffffffff810d9f4e
>   [ffff88033b897a60] info_for_irq at ffffffff813a636e
>   [ffff88033b897a80] xen_poll_irq_timeout at ffffffff813a696e
>   [ffff88033b897ac0] xen_poll_irq at ffffffff813a8450
>   [ffff88033b897ad0] xen_spin_lock_slow at ffffffff8163ad77
>   [ffff88033b897b20] xen_spin_lock at ffffffff810121da
>   [ffff88033b897b40] _raw_spin_lock at ffffffff81655e4e
>   [ffff88033b897b50] double_rq_lock at ffffffff8105119c
>   [ffff88033b897b80] load_balance at ffffffff8105e788
>   [ffff88033b897b88] _raw_spin_unlock_irqrestore at ffffffff8165605e
>   [ffff88033b897c10] idle_balance at ffffffff8163d57c
>   [ffff88033b897c60] __schedule at ffffffff81653ea9
>   [ffff88033b897cc0] sleep_on_page at ffffffff81115810
>   [ffff88033b897ce0] schedule at ffffffff81653fcf
>   [ffff88033b897cf0] io_schedule at ffffffff8165407f
>   [ffff88033b897d10] sleep_on_page at ffffffff8111581e
>   [ffff88033b897d20] __wait_on_bit at ffffffff8165489f
>   [ffff88033b897d70] wait_on_page_bit at ffffffff81115988
>   [ffff88033b897da0] wake_bit_function at ffffffff8108a140
>   [ffff88033b897dc0] filemap_fdatawait_range at ffffffff81115a9c
>   [ffff88033b897e60] do_writepages at ffffffff81120ee1
>   [ffff88033b897eb0] filemap_write_and_wait_range at ffffffff81117398
>   [ffff88033b897ee0] ext4_sync_file at ffffffff8120ef9f
>   [ffff88033b897f00] vfs_write at ffffffff811764b0
>   [ffff88033b897f40] do_fsync at ffffffff811a4a36
>   [ffff88033b897f70] sys_fdatasync at ffffffff811a4d83
>   [ffff88033b897f80] system_call_fastpath at ffffffff8165e442
>     RIP: 00007f6852ce8240  RSP: 00007fff43145618  RFLAGS: 00000246
>     RAX: 000000000000004b  RBX: ffffffff8165e442  RCX: 00007f6852ce1900
>     RDX: 00000000000000c5  RSI: 000000000000000c  RDI: 000000000000000c
>     RBP: 00000000000000c5   R8: 000000000073a550   R9: 0000000000000000
>     R10: 0000000000000004  R11: 0000000000000246  R12: ffffffff811a4d83
>     R13: ffff88033b897f78  R14: 0000000000000001  R15: 0000000000af7488
>     ORIG_RAX: 000000000000004b  CS: e033  SS: e02b
>