Strange PVM spinlock case revisited

* Strange PVM spinlock case revisited
@ 2013-02-08 15:33 Stefan Bader
  2013-02-11 17:29 ` Ian Campbell
  2013-02-14 11:04 ` Strange PVM spinlock case revisited (nailed down) Stefan Bader
  0 siblings, 2 replies; 13+ messages in thread
From: Stefan Bader @ 2013-02-08 15:33 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Ian Campbell

[-- Attachment #1.1: Type: text/plain, Size: 10471 bytes --]

A while ago I had been reporting an issue which causes Xen PVM guests
to hang (apparently related to spinlocks). Since then I was able to
gather a few more facts which I try to provide below. For the real
reasons, I am still puzzled and would be thankful for any hints or
direction.

- Happens with Xen 4.1.2 and Xen 4.2 host-side.
- PVM guest with 8 VCPUs is running 800 threads doing DB updates.
- When hanging at least 2 VCPUs in xen_spin_lock_slow with the lock free.
  IIRC the weird lock always belongs to one VCPU runqueue.
- Testcase shows the problem for guest kernel versions v3.2..v3.5 (not
  verified farther back). Since v3.6 it cannot be reproduced. Bisect
  ends up with: 611ae8e3f5204f7480b3b405993b3352cfa16662
    "x86/tlb: enable tlb flush range support for x86"
- The other potential fix for this was to prevent xen_spin_lock_slow()
  from unmasking events (enabling interrupts) if those were enabled
  before the spinlock call.

Maybe allowing or not allowing the interrupts in xen_spin_lock_slow
just manages to tip the likelihood of getting nested deeper. The same
way that the changed TLB flush may reduce it by causing lesser IPIs.
From the static information I can extract from a dump it is hard to
tell what exactly is going on. VCPU2 seems to be doing something but
looks to be rather deep in event handling. The irq_count is at 2, but
irq_count seems to be used in a few more places than just xen_hypervisor
callback. Though that at least seems to be the entry on the stack I can
see for VCPU2. The contents of the per-cpu variables for irq_reqs and
irq_stack_ptr at least seem consistent as if a callback came in while
being on the sched_op hypercall. But then the bt command seems to be
operating on a completely different stack.
VCPU2 seems to be active, not polling (event processing could be initiated
from enable_irqs via force callback or before finally exiting the sched_op
hypercall), there seems to be a pending upcall but upcalls are masked.
The next thing I would try is to instrument the sites incrementing and
decrementing irq_count...
---

Below some of the info seen from dom0 debugging keys and looking
at a dump with crash:

Dom0 Info:

Polling vCPUs: {1,5-7}
VCPU0: CPU0 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00
VCPU2: CPU6 [has=T] poll=0  upcall_pend = 01, upcall_mask = 01
VCPU3: CPU2 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
VCPU4: CPU4 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
VCPU5: CPU3 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01
VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01

10 [0/1]: s=6 n=1 x=0
40 [0/1]: s=6 n=6 x=0
46 [0/1]: s=6 n=7 x=0

Guest Info:

lock_spinners:
[0] (struct xen_spinlock *) 0x
[1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 }
[2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 }
[3] (struct xen_spinlock *) 0x
[4] (struct xen_spinlock *) 0x
[5] (struct xen_spinlock *) 0xffff8803bffc8ae8 -> [1]
[6] (struct xen_spinlock *) 0xffff8803bfc93700 -> [2]
[7] (struct xen_spinlock *) 0xffffffff81f15ef0 = { lock=1, spinners=1 }

VCPU[2,6] -> lock of runqueue[2] = (struct rq *) 0xffff8803bfc93700

irq_count[1] = 1
irq_count[2] = 2
It is -1 for the rest.

*(struct pt_regs *) irq_regs[2] = {
	.ip = 0xffffffff810013aa == hypercall_page+938 -> sched_op
	.sp = 0xffff8803bfc83ce8
}

char *irq_stack_ptr[2] = 0xffff8803bfc83fc0

#> bt -T ffff88033d65c4a0
PID: 2050   TASK: ffff88033d65c4a0  CPU: 2   COMMAND: "postgres"
  [ffff88033d72d530] get_page_from_freelist at ffffffff8111e6df
  [ffff88033d72d600] __alloc_pages_nodemask at ffffffff8111ebec
  [ffff88033d72d620] check_preempt_curr at ffffffff81050254
  [ffff88033d72d640] pull_task at ffffffff8105e33e
  [ffff88033d72d670] balance_tasks at ffffffff8105e4b6
  [ffff88033d72d680] radix_tree_lookup_slot at ffffffff8130db5e
  [ffff88033d72d690] find_get_page at ffffffff811161ee
  [ffff88033d72d6b0] find_lock_page at ffffffff81116286
  [ffff88033d72d6e0] shmem_getpage_gfp at ffffffff8112dd10
  [ffff88033d72d748] pte_pfn_to_mfn at ffffffff81005a78
  [ffff88033d72d7a0] get_page_from_freelist at ffffffff8111e6df
  [ffff88033d72d7c0] _raw_spin_lock at ffffffff81655e4e
  [ffff88033d72d7d0] ext4_ext_check_cache at ffffffff81239248
  [ffff88033d72d820] ext4_ext_map_blocks at ffffffff8123e269
  [ffff88033d72d870] _raw_spin_lock at ffffffff81655e4e
  [ffff88033d72d880] enqueue_to_backlog at ffffffff8153e55f
  [ffff88033d72d890] __wake_up_common at ffffffff8104c1c8
  [ffff88033d72d8e0] netif_rx.part.82 at ffffffff8153f2de
  [ffff88033d72d920] netif_rx at ffffffff8153f400
  [ffff88033d72d960] loopback_xmit at ffffffff8146fb1c
  [ffff88033d72d990] dev_hard_start_xmit at ffffffff8153fea6
  [ffff88033d72d9d0] do_softirq at ffffffff81016284
  [ffff88033d72d9f0] local_bh_enable at ffffffff8106d614
  [ffff88033d72da00] dev_queue_xmit at ffffffff81540309
  [ffff88033d72da50] ip_finish_output at ffffffff815769ab
  [ffff88033d72da90] ip_output at ffffffff81577518
  [ffff88033d72dac0] ip_local_out at ffffffff81576c09
  [ffff88033d72dae0] ip_send_skb at ffffffff81577f4b
  [ffff88033d72db00] udp_send_skb at ffffffff8159aaa1
  [ffff88033d72db40] ip_generic_getfrag at ffffffff81575030
  [ffff88033d72db50] udp_sendmsg at ffffffff8159bd38
  [ffff88033d72db90] irq_to_desc at ffffffff810d70b7
  [ffff88033d72dbd0] notify_remote_via_irq at ffffffff813a74c1
  [ffff88033d72dc10] ttwu_queue at ffffffff81052672
  [ffff88033d72dc38] _raw_spin_unlock_irqrestore at ffffffff8165605e
  [ffff88033d72dc90] inet_sendmsg at ffffffff815a6474
  --- bt -t would end here
  [ffff88033d72dcb0] apparmor_socket_sendmsg at ffffffff812d43f7
  [ffff88033d72dcd0] sock_sendmsg at ffffffff815266fe
  [ffff88033d72dd70] md_make_request at ffffffff814e0606
  [ffff88033d72ddd0] kmem_cache_free at ffffffff811605cf
  [ffff88033d72de10] mempool_free_slab at ffffffff81118507
  [ffff88033d72de20] mempool_free at ffffffff811185b7
  [ffff88033d72de50] sys_sendto at ffffffff8152974d
  [ffff88033d72dee0] ext4_sync_file at ffffffff8120f07b
  [ffff88033d72df00] vfs_write at ffffffff811764b0
  [ffff88033d72df50] xen_hypervisor_callback at ffffffff816606db
    RIP: 00007f6852c7b39a  RSP: 00007fff431454e0  RFLAGS: 00000212
    RAX: 00007f6852fb38b8  RBX: 00007f6852fb3720  RCX: 000000000000014a
    RDX: 0000000000000150  RSI: 0000000000000140  RDI: 00007f6852fb3720
    RBP: 0000000000000150   R8: 0000000000000015   R9: 0000000000000000
    R10: 0000000000000000  R11: 00007f6852c9174a  R12: 00007f6852fb3778
    R13: 0000000000000140  R14: 00007f6852fb38b8  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: e033  SS: e02b

#> bt -T ffff88033b812dc0
PID: 2069   TASK: ffff88033b812dc0  CPU: 6   COMMAND: "postgres"
  [ffff88033b897530] get_page_from_freelist at ffffffff8111e6df
  [ffff88033b897600] __alloc_pages_nodemask at ffffffff8111ebec
  [ffff88033b897640] radix_tree_lookup at ffffffff8130db6b
  [ffff88033b897650] irq_to_desc at ffffffff810d70b7
  [ffff88033b897660] irq_get_irq_data at ffffffff810d9f4e
  [ffff88033b897670] balance_tasks at ffffffff8105e493
  [ffff88033b897680] radix_tree_lookup_slot at ffffffff8130db5e
  [ffff88033b897690] find_get_page at ffffffff811161ee
  [ffff88033b8976b0] find_lock_page at ffffffff81116286
  [ffff88033b8976e0] shmem_getpage_gfp at ffffffff8112dd10
  [ffff88033b897748] pte_pfn_to_mfn at ffffffff81005a78
  [ffff88033b897770] xen_set_pte_at at ffffffff81008e29
  [ffff88033b897788] __raw_callee_save_xen_make_pte at ffffffff810052cd
  [ffff88033b8977b0] unlock_page at ffffffff8111589a
  [ffff88033b8977d0] __do_fault at ffffffff81138ac9
  [ffff88033b8978a0] pvclock_clocksource_read at ffffffff8103dd15
  [ffff88033b8978f0] xen_clocksource_read at ffffffff8100a850
  [ffff88033b897900] sched_clock at ffffffff8101bd79
  [ffff88033b897910] blk_rq_init at ffffffff812ec262
  [ffff88033b897930] get_request at ffffffff812f004e
  [ffff88033b8979c0] cpumask_next_and at ffffffff81308c66
  [ffff88033b8979e0] find_busiest_group at ffffffff8105a061
  --- bt -t ends here
  [ffff88033b897a30] radix_tree_lookup at ffffffff8130db6b
  [ffff88033b897a40] irq_to_desc at ffffffff810d70b7
  [ffff88033b897a50] irq_get_irq_data at ffffffff810d9f4e
  [ffff88033b897a60] info_for_irq at ffffffff813a636e
  [ffff88033b897a80] xen_poll_irq_timeout at ffffffff813a696e
  [ffff88033b897ac0] xen_poll_irq at ffffffff813a8450
  [ffff88033b897ad0] xen_spin_lock_slow at ffffffff8163ad77
  [ffff88033b897b20] xen_spin_lock at ffffffff810121da
  [ffff88033b897b40] _raw_spin_lock at ffffffff81655e4e
  [ffff88033b897b50] double_rq_lock at ffffffff8105119c
  [ffff88033b897b80] load_balance at ffffffff8105e788
  [ffff88033b897b88] _raw_spin_unlock_irqrestore at ffffffff8165605e
  [ffff88033b897c10] idle_balance at ffffffff8163d57c
  [ffff88033b897c60] __schedule at ffffffff81653ea9
  [ffff88033b897cc0] sleep_on_page at ffffffff81115810
  [ffff88033b897ce0] schedule at ffffffff81653fcf
  [ffff88033b897cf0] io_schedule at ffffffff8165407f
  [ffff88033b897d10] sleep_on_page at ffffffff8111581e
  [ffff88033b897d20] __wait_on_bit at ffffffff8165489f
  [ffff88033b897d70] wait_on_page_bit at ffffffff81115988
  [ffff88033b897da0] wake_bit_function at ffffffff8108a140
  [ffff88033b897dc0] filemap_fdatawait_range at ffffffff81115a9c
  [ffff88033b897e60] do_writepages at ffffffff81120ee1
  [ffff88033b897eb0] filemap_write_and_wait_range at ffffffff81117398
  [ffff88033b897ee0] ext4_sync_file at ffffffff8120ef9f
  [ffff88033b897f00] vfs_write at ffffffff811764b0
  [ffff88033b897f40] do_fsync at ffffffff811a4a36
  [ffff88033b897f70] sys_fdatasync at ffffffff811a4d83
  [ffff88033b897f80] system_call_fastpath at ffffffff8165e442
    RIP: 00007f6852ce8240  RSP: 00007fff43145618  RFLAGS: 00000246
    RAX: 000000000000004b  RBX: ffffffff8165e442  RCX: 00007f6852ce1900
    RDX: 00000000000000c5  RSI: 000000000000000c  RDI: 000000000000000c
    RBP: 00000000000000c5   R8: 000000000073a550   R9: 0000000000000000
    R10: 0000000000000004  R11: 0000000000000246  R12: ffffffff811a4d83
    R13: ffff88033b897f78  R14: 0000000000000001  R15: 0000000000af7488
    ORIG_RAX: 000000000000004b  CS: e033  SS: e02b

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 899 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread