On 12.02.2013 15:07, Ian Campbell wrote:
> On Tue, 2013-02-12 at 13:42 +0000, Stefan Bader wrote:
>> On 11.02.2013 18:29, Ian Campbell wrote:
>>> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:
>>>> A while ago I had been reporting an issue which causes Xen PVM guests
>>>> to hang (apparently related to spinlocks). Since then I was able to
>>>> gather a few more facts which I try to provide below. For the real
>>>> reasons, I am still puzzled and would be thankful for any hints or
>>>> direction.
>>>>
>>>> - Happens with Xen 4.1.2 and Xen 4.2 host-side.
>>>> - PVM guest with 8 VCPUs is running 800 threads doing DB updates.
>>>
>>> This is on a host with >= 8 PCPUs, correct?
>>
>> Maybe >=4 but I cannot remember for sure anymore.
> 
> OK, so the important point I was getting at was whether the guest was
> overcommitting the host or not, it seems like it is (2xVCPUs for each
> PCPU)

Err not so much. I run on an 8-core host. ;)

> 
> 
>>>> From the static information I can extract from a dump it is hard to
>>>> tell what exactly is going on. VCPU2 seems to be doing something but
>>>> looks to be rather deep in event handling. The irq_count is at 2, but
>>>> irq_count seems to be used in a few more places than just xen_hypervisor
>>>> callback. Though that at least seems to be the entry on the stack I can
>>>> see for VCPU2. The contents of the per-cpu variables for irq_reqs and
>>>> irq_stack_ptr at least seem consistent as if a callback came in while
>>>> being on the sched_op hypercall. But then the bt command seems to be
>>>> operating on a completely different stack.
>>>> VCPU2 seems to be active, not polling (event processing could be initiated
>>>> from enable_irqs via force callback or before finally exiting the sched_op
>>>> hypercall), there seems to be a pending upcall but upcalls are masked.
>>>> The next thing I would try is to instrument the sites incrementing and
>>>> decrementing irq_count...
>>>> ---
>>>>
>>>> Below some of the info seen from dom0 debugging keys and looking
>>>> at a dump with crash:
>>>>
>>>> Dom0 Info:
>>>>
>>>> Polling vCPUs: {1,5-7}
>>>> VCPU0: CPU0 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>>>> VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00
>>>> VCPU2: CPU6 [has=T] poll=0  upcall_pend = 01, upcall_mask = 01
>>>> VCPU3: CPU2 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>>>> VCPU4: CPU4 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>>>> VCPU5: CPU3 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>>>> VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01
>>>> VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01
>>>>
>>>> 10 [0/1]: s=6 n=1 x=0
>>>> 40 [0/1]: s=6 n=6 x=0
>>>> 46 [0/1]: s=6 n=7 x=0
>>>>
>>>> Guest Info:
>>>>
>>>> lock_spinners:
>>>> [0] (struct xen_spinlock *) 0x
>>>> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 }
>>>> [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 }
>>>
>>> If I understand correctly this means that 2 threads (VCPU2 & 6, I
>>> suppose) are spinning on this lock, but no one is holding it?
>>
>> To my reading both went into xen_spinlock_slow(). VCPU2 may be either in the
>> raw_local_irq_enable() before doing poll_irq or maybe in handling some event
>> after receiving the wakeup irq but before returning from the hv call. So the
>> VCPU that had the lock has returned it and likely the wakeup has been sent.
> 
> Makes sense, so we are in the window between one holder unlocking and
> the next unlocking, which explains why noone is currently holding it
> (which I previously thought odd).
> 
>>> An interesting hack^Wexperiment might be to make xen_poll_irq use a
>>> timeout and see if that unwedges things -- this would help confirm that
>>> the issue is on nested wakeup.
>>
>> I could do that. This would at least re-activate the other waiters. In the case
>> shown VCPU6 and if something in the path of execution on VCPU2 deadlocks there
>> this would move things ahead.
> 
> Yes, it would be interesting e.g. if only either 6 or 2 made progress
> and the other remained wedged for all time.

On my list next...

> 
>>> I suppose xen_spin_unlock could also be instrumented to indicate who it
>>> last kicked and for which lock and that might show us something?
>>
>> IIRC I had done this but it did not really show much. What I have done in the
>> meantime was to instrument the IRQ service functions in
>> arch/x86/kernel/entry_64.S to give me a history of callbacks. This shows (in
>> another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still polling)
>> that the last events on the vcpu not polling are:
>>
>> xen_do_hypervisor_callback+127
>> call_softirq+29
>> call_softirq+125
>> call_softirq+29
>> call_softirq+125
>> call_softirq+29
>> call_softirq+125
>> xen_do_hypervisor_callback+28
>> call_softirq+29
>> xen_do_hypervisor_callback+28
>>
>> The lower offsets are when irq_count gets incremented and the higher offsets are
>> when irq_count gets decremented before ending the callback. This shows that at
>> least in this case there was an upcall, a softirq and another upcall being
>> triggered without finishing the previous one. There was another experiment where
>> I saw softirq, upcall, upcall. But at least it seems that there is always a
>> three level stack.
>> I believe that softirq, upcall would be ok as softirq processing enables
>> interrupts but is it expected to have an upcall interrupted by another one?
> 
> I'm not sure. evtchn's don't have a priority mechanism so I expect not
> in general but individual interrupt handlers could well reenable
> interrupts, I think (it might be one of the flags you pass when calling
> request_irq?).

Just a gut feeling but it feels wrong to enable interrupts in any interrupt
handler. I thought for anything that needs interrupts enabled and/or takes
longer it should pushed off to a workqueue...

> 
>>> Can someone explain why the non-locked update of lock_spinners in
>>> spinning_lock() is safe against xen_spin_unlock reading the field from
>>> another VCPU? I suspect it isn't, so what happens if the other VCPU
>>> kicks the lock which is just about to become prev? maybe this is handled
>>> in xen_spin_lock_slow somehow?
>>
>> Isn't that safe because lock_spinners is percpu?
> 
> xen_spin_unlock_slow() accesses the lock_spinners of remote CPUs, which
> is what I think might be unsafe.

Ah right, that. Hm, the way those two play together always was a bit brain
hurting. Though somehow it feels like if the top level poll_irq would return and
count things as spurious wakeup. I think in that case I would expect the
lock_spinner entry of one vcpu to be not matching the freed lock...
Not a really scientific argument.
> 
> Ian.
>