On 12.02.2013 15:07, Ian Campbell wrote: > On Tue, 2013-02-12 at 13:42 +0000, Stefan Bader wrote: >> On 11.02.2013 18:29, Ian Campbell wrote: >>> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote: >>>> A while ago I had been reporting an issue which causes Xen PVM guests >>>> to hang (apparently related to spinlocks). Since then I was able to >>>> gather a few more facts which I try to provide below. For the real >>>> reasons, I am still puzzled and would be thankful for any hints or >>>> direction. >>>> >>>> - Happens with Xen 4.1.2 and Xen 4.2 host-side. >>>> - PVM guest with 8 VCPUs is running 800 threads doing DB updates. >>> >>> This is on a host with >= 8 PCPUs, correct? >> >> Maybe >=4 but I cannot remember for sure anymore. > > OK, so the important point I was getting at was whether the guest was > overcommitting the host or not, it seems like it is (2xVCPUs for each > PCPU) Err not so much. I run on an 8-core host. ;) > > >>>> From the static information I can extract from a dump it is hard to >>>> tell what exactly is going on. VCPU2 seems to be doing something but >>>> looks to be rather deep in event handling. The irq_count is at 2, but >>>> irq_count seems to be used in a few more places than just xen_hypervisor >>>> callback. Though that at least seems to be the entry on the stack I can >>>> see for VCPU2. The contents of the per-cpu variables for irq_reqs and >>>> irq_stack_ptr at least seem consistent as if a callback came in while >>>> being on the sched_op hypercall. But then the bt command seems to be >>>> operating on a completely different stack. >>>> VCPU2 seems to be active, not polling (event processing could be initiated >>>> from enable_irqs via force callback or before finally exiting the sched_op >>>> hypercall), there seems to be a pending upcall but upcalls are masked. >>>> The next thing I would try is to instrument the sites incrementing and >>>> decrementing irq_count... >>>> --- >>>> >>>> Below some of the info seen from dom0 debugging keys and looking >>>> at a dump with crash: >>>> >>>> Dom0 Info: >>>> >>>> Polling vCPUs: {1,5-7} >>>> VCPU0: CPU0 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 >>>> VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00 >>>> VCPU2: CPU6 [has=T] poll=0 upcall_pend = 01, upcall_mask = 01 >>>> VCPU3: CPU2 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 >>>> VCPU4: CPU4 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 >>>> VCPU5: CPU3 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 >>>> VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01 >>>> VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01 >>>> >>>> 10 [0/1]: s=6 n=1 x=0 >>>> 40 [0/1]: s=6 n=6 x=0 >>>> 46 [0/1]: s=6 n=7 x=0 >>>> >>>> Guest Info: >>>> >>>> lock_spinners: >>>> [0] (struct xen_spinlock *) 0x >>>> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 } >>>> [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 } >>> >>> If I understand correctly this means that 2 threads (VCPU2 & 6, I >>> suppose) are spinning on this lock, but no one is holding it? >> >> To my reading both went into xen_spinlock_slow(). VCPU2 may be either in the >> raw_local_irq_enable() before doing poll_irq or maybe in handling some event >> after receiving the wakeup irq but before returning from the hv call. So the >> VCPU that had the lock has returned it and likely the wakeup has been sent. > > Makes sense, so we are in the window between one holder unlocking and > the next unlocking, which explains why noone is currently holding it > (which I previously thought odd). > >>> An interesting hack^Wexperiment might be to make xen_poll_irq use a >>> timeout and see if that unwedges things -- this would help confirm that >>> the issue is on nested wakeup. >> >> I could do that. This would at least re-activate the other waiters. In the case >> shown VCPU6 and if something in the path of execution on VCPU2 deadlocks there >> this would move things ahead. > > Yes, it would be interesting e.g. if only either 6 or 2 made progress > and the other remained wedged for all time. On my list next... > >>> I suppose xen_spin_unlock could also be instrumented to indicate who it >>> last kicked and for which lock and that might show us something? >> >> IIRC I had done this but it did not really show much. What I have done in the >> meantime was to instrument the IRQ service functions in >> arch/x86/kernel/entry_64.S to give me a history of callbacks. This shows (in >> another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still polling) >> that the last events on the vcpu not polling are: >> >> xen_do_hypervisor_callback+127 >> call_softirq+29 >> call_softirq+125 >> call_softirq+29 >> call_softirq+125 >> call_softirq+29 >> call_softirq+125 >> xen_do_hypervisor_callback+28 >> call_softirq+29 >> xen_do_hypervisor_callback+28 >> >> The lower offsets are when irq_count gets incremented and the higher offsets are >> when irq_count gets decremented before ending the callback. This shows that at >> least in this case there was an upcall, a softirq and another upcall being >> triggered without finishing the previous one. There was another experiment where >> I saw softirq, upcall, upcall. But at least it seems that there is always a >> three level stack. >> I believe that softirq, upcall would be ok as softirq processing enables >> interrupts but is it expected to have an upcall interrupted by another one? > > I'm not sure. evtchn's don't have a priority mechanism so I expect not > in general but individual interrupt handlers could well reenable > interrupts, I think (it might be one of the flags you pass when calling > request_irq?). Just a gut feeling but it feels wrong to enable interrupts in any interrupt handler. I thought for anything that needs interrupts enabled and/or takes longer it should pushed off to a workqueue... > >>> Can someone explain why the non-locked update of lock_spinners in >>> spinning_lock() is safe against xen_spin_unlock reading the field from >>> another VCPU? I suspect it isn't, so what happens if the other VCPU >>> kicks the lock which is just about to become prev? maybe this is handled >>> in xen_spin_lock_slow somehow? >> >> Isn't that safe because lock_spinners is percpu? > > xen_spin_unlock_slow() accesses the lock_spinners of remote CPUs, which > is what I think might be unsafe. Ah right, that. Hm, the way those two play together always was a bit brain hurting. Though somehow it feels like if the top level poll_irq would return and count things as spurious wakeup. I think in that case I would expect the lock_spinner entry of one vcpu to be not matching the freed lock... Not a really scientific argument. > > Ian. >