From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Bader Subject: Re: Strange PVM spinlock case revisited Date: Tue, 12 Feb 2013 14:42:04 +0100 Message-ID: <511A46AC.5020202@canonical.com> References: <51151ABF.1070007@canonical.com> <1360603744.20449.74.camel@zakaz.uk.xensource.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4157919901350199352==" Return-path: In-Reply-To: <1360603744.20449.74.camel@zakaz.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Campbell Cc: Andrew Cooper , "xen-devel@lists.xen.org" List-Id: xen-devel@lists.xenproject.org This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --===============4157919901350199352== Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="------------enig57AF0FB6ACE3157B505EB2A6" This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig57AF0FB6ACE3157B505EB2A6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On 11.02.2013 18:29, Ian Campbell wrote: > On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote: >> A while ago I had been reporting an issue which causes Xen PVM guests >> to hang (apparently related to spinlocks). Since then I was able to >> gather a few more facts which I try to provide below. For the real >> reasons, I am still puzzled and would be thankful for any hints or >> direction. >> >> - Happens with Xen 4.1.2 and Xen 4.2 host-side. >> - PVM guest with 8 VCPUs is running 800 threads doing DB updates. >=20 > This is on a host with >=3D 8 PCPUs, correct? Maybe >=3D4 but I cannot remember for sure anymore. >=20 >> - When hanging at least 2 VCPUs in xen_spin_lock_slow with the lock fr= ee. >> IIRC the weird lock always belongs to one VCPU runqueue. >> - Testcase shows the problem for guest kernel versions v3.2..v3.5 (not= >> verified farther back). Since v3.6 it cannot be reproduced. Bisect >> ends up with: 611ae8e3f5204f7480b3b405993b3352cfa16662 >> "x86/tlb: enable tlb flush range support for x86" >=20 > That seems like a very odd candidate for impacting on this issue. >=20 >> - The other potential fix for this was to prevent xen_spin_lock_slow()= >> from unmasking events (enabling interrupts) if those were enabled >> before the spinlock call. >> >> Maybe allowing or not allowing the interrupts in xen_spin_lock_slow >> just manages to tip the likelihood of getting nested deeper. >=20 > Either that or the bug is in the nesting aspect of xen_spin_lock_slow()= , > which seems to be a rather complex dance judging from the comments in > that function (and one I confess I don't really follow). >=20 I am suspecting something like this. Or maybe generally a nesting issue r= evealed by enabling interrupts here. >> The same >> way that the changed TLB flush may reduce it by causing lesser IPIs. >=20 > Ah, that might explain why that patch is relevant, yes. >=20 >> From the static information I can extract from a dump it is hard to >> tell what exactly is going on. VCPU2 seems to be doing something but >> looks to be rather deep in event handling. The irq_count is at 2, but >> irq_count seems to be used in a few more places than just xen_hypervis= or >> callback. Though that at least seems to be the entry on the stack I ca= n >> see for VCPU2. The contents of the per-cpu variables for irq_reqs and >> irq_stack_ptr at least seem consistent as if a callback came in while >> being on the sched_op hypercall. But then the bt command seems to be >> operating on a completely different stack. >> VCPU2 seems to be active, not polling (event processing could be initi= ated >> from enable_irqs via force callback or before finally exiting the sche= d_op >> hypercall), there seems to be a pending upcall but upcalls are masked.= >> The next thing I would try is to instrument the sites incrementing and= >> decrementing irq_count... >> --- >> >> Below some of the info seen from dom0 debugging keys and looking >> at a dump with crash: >> >> Dom0 Info: >> >> Polling vCPUs: {1,5-7} >> VCPU0: CPU0 [has=3DF] poll=3D0 upcall_pend =3D 00, upcall_mask =3D 00= >> VCPU1: CPU1 [has=3DF] poll=3D10 upcall_pend =3D 00, upcall_mask =3D 00= >> VCPU2: CPU6 [has=3DT] poll=3D0 upcall_pend =3D 01, upcall_mask =3D 01= >> VCPU3: CPU2 [has=3DF] poll=3D0 upcall_pend =3D 00, upcall_mask =3D 00= >> VCPU4: CPU4 [has=3DF] poll=3D0 upcall_pend =3D 00, upcall_mask =3D 00= >> VCPU5: CPU3 [has=3DF] poll=3D0 upcall_pend =3D 00, upcall_mask =3D 00= >> VCPU6: CPU7 [has=3DF] poll=3D40 upcall_pend =3D 01, upcall_mask =3D 01= >> VCPU7: CPU4 [has=3DF] poll=3D46 upcall_pend =3D 01, upcall_mask =3D 01= >> >> 10 [0/1]: s=3D6 n=3D1 x=3D0 >> 40 [0/1]: s=3D6 n=3D6 x=3D0 >> 46 [0/1]: s=3D6 n=3D7 x=3D0 >> >> Guest Info: >> >> lock_spinners: >> [0] (struct xen_spinlock *) 0x >> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 =3D { lock=3D1, spinner= s=3D2 } >> [2] (struct xen_spinlock *) 0xffff8803bfc93700 =3D { lock=3D0, spinner= s=3D2 } >=20 > If I understand correctly this means that 2 threads (VCPU2 & 6, I > suppose) are spinning on this lock, but no one is holding it? To my reading both went into xen_spinlock_slow(). VCPU2 may be either in = the raw_local_irq_enable() before doing poll_irq or maybe in handling some ev= ent after receiving the wakeup irq but before returning from the hv call. So = the VCPU that had the lock has returned it and likely the wakeup has been sen= t. >=20 > An interesting hack^Wexperiment might be to make xen_poll_irq use a > timeout and see if that unwedges things -- this would help confirm that= > the issue is on nested wakeup. I could do that. This would at least re-activate the other waiters. In th= e case shown VCPU6 and if something in the path of execution on VCPU2 deadlocks = there this would move things ahead. >=20 > I suppose xen_spin_unlock could also be instrumented to indicate who it= > last kicked and for which lock and that might show us something? IIRC I had done this but it did not really show much. What I have done in= the meantime was to instrument the IRQ service functions in arch/x86/kernel/entry_64.S to give me a history of callbacks. This shows = (in another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still p= olling) that the last events on the vcpu not polling are: xen_do_hypervisor_callback+127 call_softirq+29 call_softirq+125 call_softirq+29 call_softirq+125 call_softirq+29 call_softirq+125 xen_do_hypervisor_callback+28 call_softirq+29 xen_do_hypervisor_callback+28 The lower offsets are when irq_count gets incremented and the higher offs= ets are when irq_count gets decremented before ending the callback. This shows th= at at least in this case there was an upcall, a softirq and another upcall bein= g triggered without finishing the previous one. There was another experimen= t where I saw softirq, upcall, upcall. But at least it seems that there is always= a three level stack. I believe that softirq, upcall would be ok as softirq processing enables interrupts but is it expected to have an upcall interrupted by another on= e? >=20 > Can someone explain why the non-locked update of lock_spinners in > spinning_lock() is safe against xen_spin_unlock reading the field from > another VCPU? I suspect it isn't, so what happens if the other VCPU > kicks the lock which is just about to become prev? maybe this is handle= d > in xen_spin_lock_slow somehow? Isn't that safe because lock_spinners is percpu? >=20 > In a similar vein do we not need a barrier in unspinning_lock to ensure= > that the remote cpu sees prev getting restored? >=20 >> [3] (struct xen_spinlock *) 0x >> [4] (struct xen_spinlock *) 0x >> [5] (struct xen_spinlock *) 0xffff8803bffc8ae8 -> [1] >> [6] (struct xen_spinlock *) 0xffff8803bfc93700 -> [2] >> [7] (struct xen_spinlock *) 0xffffffff81f15ef0 =3D { lock=3D1, spinner= s=3D1 } >> >> VCPU[2,6] -> lock of runqueue[2] =3D (struct rq *) 0xffff8803bfc93700 >> >> irq_count[1] =3D 1 >> irq_count[2] =3D 2 >> It is -1 for the rest. >> >> *(struct pt_regs *) irq_regs[2] =3D { >> .ip =3D 0xffffffff810013aa =3D=3D hypercall_page+938 -> sched_op >> .sp =3D 0xffff8803bfc83ce8 >> } >> >> char *irq_stack_ptr[2] =3D 0xffff8803bfc83fc0 >> >> #> bt -T ffff88033d65c4a0 >> PID: 2050 TASK: ffff88033d65c4a0 CPU: 2 COMMAND: "postgres" >> [ffff88033d72d530] get_page_from_freelist at ffffffff8111e6df >> [ffff88033d72d600] __alloc_pages_nodemask at ffffffff8111ebec >> [ffff88033d72d620] check_preempt_curr at ffffffff81050254 >> [ffff88033d72d640] pull_task at ffffffff8105e33e >> [ffff88033d72d670] balance_tasks at ffffffff8105e4b6 >> [ffff88033d72d680] radix_tree_lookup_slot at ffffffff8130db5e >> [ffff88033d72d690] find_get_page at ffffffff811161ee >> [ffff88033d72d6b0] find_lock_page at ffffffff81116286 >> [ffff88033d72d6e0] shmem_getpage_gfp at ffffffff8112dd10 >> [ffff88033d72d748] pte_pfn_to_mfn at ffffffff81005a78 >> [ffff88033d72d7a0] get_page_from_freelist at ffffffff8111e6df >> [ffff88033d72d7c0] _raw_spin_lock at ffffffff81655e4e >> [ffff88033d72d7d0] ext4_ext_check_cache at ffffffff81239248 >> [ffff88033d72d820] ext4_ext_map_blocks at ffffffff8123e269 >> [ffff88033d72d870] _raw_spin_lock at ffffffff81655e4e >> [ffff88033d72d880] enqueue_to_backlog at ffffffff8153e55f >> [ffff88033d72d890] __wake_up_common at ffffffff8104c1c8 >> [ffff88033d72d8e0] netif_rx.part.82 at ffffffff8153f2de >> [ffff88033d72d920] netif_rx at ffffffff8153f400 >> [ffff88033d72d960] loopback_xmit at ffffffff8146fb1c >> [ffff88033d72d990] dev_hard_start_xmit at ffffffff8153fea6 >> [ffff88033d72d9d0] do_softirq at ffffffff81016284 >> [ffff88033d72d9f0] local_bh_enable at ffffffff8106d614 >> [ffff88033d72da00] dev_queue_xmit at ffffffff81540309 >> [ffff88033d72da50] ip_finish_output at ffffffff815769ab >> [ffff88033d72da90] ip_output at ffffffff81577518 >> [ffff88033d72dac0] ip_local_out at ffffffff81576c09 >> [ffff88033d72dae0] ip_send_skb at ffffffff81577f4b >> [ffff88033d72db00] udp_send_skb at ffffffff8159aaa1 >> [ffff88033d72db40] ip_generic_getfrag at ffffffff81575030 >> [ffff88033d72db50] udp_sendmsg at ffffffff8159bd38 >> [ffff88033d72db90] irq_to_desc at ffffffff810d70b7 >> [ffff88033d72dbd0] notify_remote_via_irq at ffffffff813a74c1 >> [ffff88033d72dc10] ttwu_queue at ffffffff81052672 >> [ffff88033d72dc38] _raw_spin_unlock_irqrestore at ffffffff8165605e >> [ffff88033d72dc90] inet_sendmsg at ffffffff815a6474 >> --- bt -t would end here >> [ffff88033d72dcb0] apparmor_socket_sendmsg at ffffffff812d43f7 >> [ffff88033d72dcd0] sock_sendmsg at ffffffff815266fe >> [ffff88033d72dd70] md_make_request at ffffffff814e0606 >> [ffff88033d72ddd0] kmem_cache_free at ffffffff811605cf >> [ffff88033d72de10] mempool_free_slab at ffffffff81118507 >> [ffff88033d72de20] mempool_free at ffffffff811185b7 >> [ffff88033d72de50] sys_sendto at ffffffff8152974d >> [ffff88033d72dee0] ext4_sync_file at ffffffff8120f07b >> [ffff88033d72df00] vfs_write at ffffffff811764b0 >> [ffff88033d72df50] xen_hypervisor_callback at ffffffff816606db >> RIP: 00007f6852c7b39a RSP: 00007fff431454e0 RFLAGS: 00000212 >> RAX: 00007f6852fb38b8 RBX: 00007f6852fb3720 RCX: 000000000000014= a >> RDX: 0000000000000150 RSI: 0000000000000140 RDI: 00007f6852fb372= 0 >> RBP: 0000000000000150 R8: 0000000000000015 R9: 000000000000000= 0 >> R10: 0000000000000000 R11: 00007f6852c9174a R12: 00007f6852fb377= 8 >> R13: 0000000000000140 R14: 00007f6852fb38b8 R15: 000000000000000= 1 >> ORIG_RAX: ffffffffffffffff CS: e033 SS: e02b >> >> #> bt -T ffff88033b812dc0 >> PID: 2069 TASK: ffff88033b812dc0 CPU: 6 COMMAND: "postgres" >> [ffff88033b897530] get_page_from_freelist at ffffffff8111e6df >> [ffff88033b897600] __alloc_pages_nodemask at ffffffff8111ebec >> [ffff88033b897640] radix_tree_lookup at ffffffff8130db6b >> [ffff88033b897650] irq_to_desc at ffffffff810d70b7 >> [ffff88033b897660] irq_get_irq_data at ffffffff810d9f4e >> [ffff88033b897670] balance_tasks at ffffffff8105e493 >> [ffff88033b897680] radix_tree_lookup_slot at ffffffff8130db5e >> [ffff88033b897690] find_get_page at ffffffff811161ee >> [ffff88033b8976b0] find_lock_page at ffffffff81116286 >> [ffff88033b8976e0] shmem_getpage_gfp at ffffffff8112dd10 >> [ffff88033b897748] pte_pfn_to_mfn at ffffffff81005a78 >> [ffff88033b897770] xen_set_pte_at at ffffffff81008e29 >> [ffff88033b897788] __raw_callee_save_xen_make_pte at ffffffff810052c= d >> [ffff88033b8977b0] unlock_page at ffffffff8111589a >> [ffff88033b8977d0] __do_fault at ffffffff81138ac9 >> [ffff88033b8978a0] pvclock_clocksource_read at ffffffff8103dd15 >> [ffff88033b8978f0] xen_clocksource_read at ffffffff8100a850 >> [ffff88033b897900] sched_clock at ffffffff8101bd79 >> [ffff88033b897910] blk_rq_init at ffffffff812ec262 >> [ffff88033b897930] get_request at ffffffff812f004e >> [ffff88033b8979c0] cpumask_next_and at ffffffff81308c66 >> [ffff88033b8979e0] find_busiest_group at ffffffff8105a061 >> --- bt -t ends here >> [ffff88033b897a30] radix_tree_lookup at ffffffff8130db6b >> [ffff88033b897a40] irq_to_desc at ffffffff810d70b7 >> [ffff88033b897a50] irq_get_irq_data at ffffffff810d9f4e >> [ffff88033b897a60] info_for_irq at ffffffff813a636e >> [ffff88033b897a80] xen_poll_irq_timeout at ffffffff813a696e >> [ffff88033b897ac0] xen_poll_irq at ffffffff813a8450 >> [ffff88033b897ad0] xen_spin_lock_slow at ffffffff8163ad77 >> [ffff88033b897b20] xen_spin_lock at ffffffff810121da >> [ffff88033b897b40] _raw_spin_lock at ffffffff81655e4e >> [ffff88033b897b50] double_rq_lock at ffffffff8105119c >> [ffff88033b897b80] load_balance at ffffffff8105e788 >> [ffff88033b897b88] _raw_spin_unlock_irqrestore at ffffffff8165605e >> [ffff88033b897c10] idle_balance at ffffffff8163d57c >> [ffff88033b897c60] __schedule at ffffffff81653ea9 >> [ffff88033b897cc0] sleep_on_page at ffffffff81115810 >> [ffff88033b897ce0] schedule at ffffffff81653fcf >> [ffff88033b897cf0] io_schedule at ffffffff8165407f >> [ffff88033b897d10] sleep_on_page at ffffffff8111581e >> [ffff88033b897d20] __wait_on_bit at ffffffff8165489f >> [ffff88033b897d70] wait_on_page_bit at ffffffff81115988 >> [ffff88033b897da0] wake_bit_function at ffffffff8108a140 >> [ffff88033b897dc0] filemap_fdatawait_range at ffffffff81115a9c >> [ffff88033b897e60] do_writepages at ffffffff81120ee1 >> [ffff88033b897eb0] filemap_write_and_wait_range at ffffffff81117398 >> [ffff88033b897ee0] ext4_sync_file at ffffffff8120ef9f >> [ffff88033b897f00] vfs_write at ffffffff811764b0 >> [ffff88033b897f40] do_fsync at ffffffff811a4a36 >> [ffff88033b897f70] sys_fdatasync at ffffffff811a4d83 >> [ffff88033b897f80] system_call_fastpath at ffffffff8165e442 >> RIP: 00007f6852ce8240 RSP: 00007fff43145618 RFLAGS: 00000246 >> RAX: 000000000000004b RBX: ffffffff8165e442 RCX: 00007f6852ce190= 0 >> RDX: 00000000000000c5 RSI: 000000000000000c RDI: 000000000000000= c >> RBP: 00000000000000c5 R8: 000000000073a550 R9: 000000000000000= 0 >> R10: 0000000000000004 R11: 0000000000000246 R12: ffffffff811a4d8= 3 >> R13: ffff88033b897f78 R14: 0000000000000001 R15: 0000000000af748= 8 >> ORIG_RAX: 000000000000004b CS: e033 SS: e02b >> --------------enig57AF0FB6ACE3157B505EB2A6 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQIcBAEBCgAGBQJRGka3AAoJEOhnXe7L7s6jXNwP/RLIeHMvUzmgKWHOd3yNS45/ C4KxvLaygs+pHoQsQWOKFBnHvNiT08QsIC9pWqYWz3XCMBAcZrYb/KCcN//2laIg 51hxXdPzTHAvHEOv6Jphx1th0lhthGTZXAQcAalLM8h3WaNWmOxjJvfSqDIlEoVo Y+i+Ml8JHA5DvVGarMxgHpXj3viqO7OZHQygT/zrRbLqZ5dohru1MamdLD4KMJHA hBvAax7fgE057KAPTBO2TN8ecYVUt1yj6YxlK6MIEL9uCMmRONSxBzpgs2DaWut8 ATA6b4nr/qOsPjFpe4XfWKIpPRT707j7EttfofVjByW9XIRVovYu9I1LBskFPuL4 z2J8I2YOrDXgGSqJFPTPMrFKspKwj403hW3ReKx5LlAC7ARHnIvOkEcBDMcUQgM9 SZmIuGqSBpDrTe2UdmH4PKQPiAPCrp72pJjym6sIy/4ok3o/dfZDepW6CMqST18K mTbkCbuljzsWsS6jb97ao2j6+uRT4w/ZHndpiv/PVyg0y3AweK7YbmgJlnhGlblc yERiO3qJ2H7JgexZdaG1913xTplJWruGHw1QSBW/QIbmFSZR0HGWVJZVVRWXGgLL K/wibX+sC271//wMpqqFuVnp4GGyWYfisci5Q91UQE/YbHMw0Pg6pmhFKwmVdPLU zSdUfSfd5ZNU9BIDu4er =/1yu -----END PGP SIGNATURE----- --------------enig57AF0FB6ACE3157B505EB2A6-- --===============4157919901350199352== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============4157919901350199352==--