From mboxrd@z Thu Jan  1 00:00:00 1970
From: Stefan Bader <stefan.bader@canonical.com>
Subject: Re: Strange PVM spinlock case revisited
Date: Tue, 12 Feb 2013 15:50:13 +0100
Message-ID: <511A56A5.1040305@canonical.com>
References: <51151ABF.1070007@canonical.com>
	<1360603744.20449.74.camel@zakaz.uk.xensource.com>
	<511A46AC.5020202@canonical.com>
	<1360678069.20449.169.camel@zakaz.uk.xensource.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============9144088592457002976=="
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <1360678069.20449.169.camel@zakaz.uk.xensource.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--===============9144088592457002976==
Content-Type: multipart/signed; micalg=pgp-sha512;
 protocol="application/pgp-signature";
 boundary="------------enigB330817712CDC1EA33DC055E"

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigB330817712CDC1EA33DC055E
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 12.02.2013 15:07, Ian Campbell wrote:
> On Tue, 2013-02-12 at 13:42 +0000, Stefan Bader wrote:
>> On 11.02.2013 18:29, Ian Campbell wrote:
>>> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:
>>>> A while ago I had been reporting an issue which causes Xen PVM guest=
s
>>>> to hang (apparently related to spinlocks). Since then I was able to
>>>> gather a few more facts which I try to provide below. For the real
>>>> reasons, I am still puzzled and would be thankful for any hints or
>>>> direction.
>>>>
>>>> - Happens with Xen 4.1.2 and Xen 4.2 host-side.
>>>> - PVM guest with 8 VCPUs is running 800 threads doing DB updates.
>>>
>>> This is on a host with >=3D 8 PCPUs, correct?
>>
>> Maybe >=3D4 but I cannot remember for sure anymore.
>=20
> OK, so the important point I was getting at was whether the guest was
> overcommitting the host or not, it seems like it is (2xVCPUs for each
> PCPU)

Err not so much. I run on an 8-core host. ;)

>=20
>=20
>>>> From the static information I can extract from a dump it is hard to
>>>> tell what exactly is going on. VCPU2 seems to be doing something but=

>>>> looks to be rather deep in event handling. The irq_count is at 2, bu=
t
>>>> irq_count seems to be used in a few more places than just xen_hyperv=
isor
>>>> callback. Though that at least seems to be the entry on the stack I =
can
>>>> see for VCPU2. The contents of the per-cpu variables for irq_reqs an=
d
>>>> irq_stack_ptr at least seem consistent as if a callback came in whil=
e
>>>> being on the sched_op hypercall. But then the bt command seems to be=

>>>> operating on a completely different stack.
>>>> VCPU2 seems to be active, not polling (event processing could be ini=
tiated
>>>> from enable_irqs via force callback or before finally exiting the sc=
hed_op
>>>> hypercall), there seems to be a pending upcall but upcalls are maske=
d.
>>>> The next thing I would try is to instrument the sites incrementing a=
nd
>>>> decrementing irq_count...
>>>> ---
>>>>
>>>> Below some of the info seen from dom0 debugging keys and looking
>>>> at a dump with crash:
>>>>
>>>> Dom0 Info:
>>>>
>>>> Polling vCPUs: {1,5-7}
>>>> VCPU0: CPU0 [has=3DF] poll=3D0  upcall_pend =3D 00, upcall_mask =3D =
00
>>>> VCPU1: CPU1 [has=3DF] poll=3D10 upcall_pend =3D 00, upcall_mask =3D =
00
>>>> VCPU2: CPU6 [has=3DT] poll=3D0  upcall_pend =3D 01, upcall_mask =3D =
01
>>>> VCPU3: CPU2 [has=3DF] poll=3D0  upcall_pend =3D 00, upcall_mask =3D =
00
>>>> VCPU4: CPU4 [has=3DF] poll=3D0  upcall_pend =3D 00, upcall_mask =3D =
00
>>>> VCPU5: CPU3 [has=3DF] poll=3D0  upcall_pend =3D 00, upcall_mask =3D =
00
>>>> VCPU6: CPU7 [has=3DF] poll=3D40 upcall_pend =3D 01, upcall_mask =3D =
01
>>>> VCPU7: CPU4 [has=3DF] poll=3D46 upcall_pend =3D 01, upcall_mask =3D =
01
>>>>
>>>> 10 [0/1]: s=3D6 n=3D1 x=3D0
>>>> 40 [0/1]: s=3D6 n=3D6 x=3D0
>>>> 46 [0/1]: s=3D6 n=3D7 x=3D0
>>>>
>>>> Guest Info:
>>>>
>>>> lock_spinners:
>>>> [0] (struct xen_spinlock *) 0x
>>>> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 =3D { lock=3D1, spinn=
ers=3D2 }
>>>> [2] (struct xen_spinlock *) 0xffff8803bfc93700 =3D { lock=3D0, spinn=
ers=3D2 }
>>>
>>> If I understand correctly this means that 2 threads (VCPU2 & 6, I
>>> suppose) are spinning on this lock, but no one is holding it?
>>
>> To my reading both went into xen_spinlock_slow(). VCPU2 may be either =
in the
>> raw_local_irq_enable() before doing poll_irq or maybe in handling some=
 event
>> after receiving the wakeup irq but before returning from the hv call. =
So the
>> VCPU that had the lock has returned it and likely the wakeup has been =
sent.
>=20
> Makes sense, so we are in the window between one holder unlocking and
> the next unlocking, which explains why noone is currently holding it
> (which I previously thought odd).
>=20
>>> An interesting hack^Wexperiment might be to make xen_poll_irq use a
>>> timeout and see if that unwedges things -- this would help confirm th=
at
>>> the issue is on nested wakeup.
>>
>> I could do that. This would at least re-activate the other waiters. In=
 the case
>> shown VCPU6 and if something in the path of execution on VCPU2 deadloc=
ks there
>> this would move things ahead.
>=20
> Yes, it would be interesting e.g. if only either 6 or 2 made progress
> and the other remained wedged for all time.

On my list next...

>=20
>>> I suppose xen_spin_unlock could also be instrumented to indicate who =
it
>>> last kicked and for which lock and that might show us something?
>>
>> IIRC I had done this but it did not really show much. What I have done=
 in the
>> meantime was to instrument the IRQ service functions in
>> arch/x86/kernel/entry_64.S to give me a history of callbacks. This sho=
ws (in
>> another experiment where it is 1,2,5,6 in the same lock and 2,5,6 stil=
l polling)
>> that the last events on the vcpu not polling are:
>>
>> xen_do_hypervisor_callback+127
>> call_softirq+29
>> call_softirq+125
>> call_softirq+29
>> call_softirq+125
>> call_softirq+29
>> call_softirq+125
>> xen_do_hypervisor_callback+28
>> call_softirq+29
>> xen_do_hypervisor_callback+28
>>
>> The lower offsets are when irq_count gets incremented and the higher o=
ffsets are
>> when irq_count gets decremented before ending the callback. This shows=
 that at
>> least in this case there was an upcall, a softirq and another upcall b=
eing
>> triggered without finishing the previous one. There was another experi=
ment where
>> I saw softirq, upcall, upcall. But at least it seems that there is alw=
ays a
>> three level stack.
>> I believe that softirq, upcall would be ok as softirq processing enabl=
es
>> interrupts but is it expected to have an upcall interrupted by another=
 one?
>=20
> I'm not sure. evtchn's don't have a priority mechanism so I expect not
> in general but individual interrupt handlers could well reenable
> interrupts, I think (it might be one of the flags you pass when calling=

> request_irq?).

Just a gut feeling but it feels wrong to enable interrupts in any interru=
pt
handler. I thought for anything that needs interrupts enabled and/or take=
s
longer it should pushed off to a workqueue...

>=20
>>> Can someone explain why the non-locked update of lock_spinners in
>>> spinning_lock() is safe against xen_spin_unlock reading the field fro=
m
>>> another VCPU? I suspect it isn't, so what happens if the other VCPU
>>> kicks the lock which is just about to become prev? maybe this is hand=
led
>>> in xen_spin_lock_slow somehow?
>>
>> Isn't that safe because lock_spinners is percpu?
>=20
> xen_spin_unlock_slow() accesses the lock_spinners of remote CPUs, which=

> is what I think might be unsafe.

Ah right, that. Hm, the way those two play together always was a bit brai=
n
hurting. Though somehow it feels like if the top level poll_irq would ret=
urn and
count things as spurious wakeup. I think in that case I would expect the
lock_spinner entry of one vcpu to be not matching the freed lock...
Not a really scientific argument.
>=20
> Ian.
>=20


--------------enigB330817712CDC1EA33DC055E
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBCgAGBQJRGlalAAoJEOhnXe7L7s6j+1oP/08jOOODxE7u2zRq1cdxkYRJ
TXGrorgkovk7xRoRCrAyjNukRDep+hA4POyDmTiwYJkSwwTvwGTYVAprRopZS8l/
q2nt63LxJ1dLS51xSCejxwzd7bZcJWALmPWa69kJh+A9HSebTADjLtiESi5N6wM9
DJYMVlHqV++Gb3DaWsTf7KYfYqUi5H0Kqy9kXS47fKAc/u/IQGYsMILfOpnxNDLN
NqDqJAzIxtdEXlmXMCzgLLZW1O5XbYdgPp+QKUzT4UBssDgNfztUoWRDizjWHT/R
Vkyo+DnYxA+2qX3LDwF1dRy6vh6d6wY1cnPPhGI6RmqAsL00CHbk/aDhqs42yzoS
Ox30SJ38KVc65gGFPRFasgkuyOC+6QJYDm/7h9cHTOWSx4G2TtMYc5qoZrRN+Vcc
VirMbb28kP9N3FHgUYnsfcEPKAn4fN1SBQuy1oPUxlDa+u3otibSFf93Gvz2s8ld
uvmH/LK4+lfpAwCleZZxhwZKzJ9kvC5ejQ50Hq+0X6bJ7jWPYRVOB1/48U4NS8sg
rU6dfM1gU6UIj3qKu/cYHsJoZSNB5cjvHBPehz3hdTuLf7hanSocXcgRfAnKYePh
Y+FcCUDsQTKgMRCu7S2WJEgJjBPI0pD0J7jgVncK5MnOPHXwW1kzgr+UdDZsMov9
6h6Ih1ePWrbcRLTQ2/E1
=RPP+
-----END PGP SIGNATURE-----

--------------enigB330817712CDC1EA33DC055E--


--===============9144088592457002976==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

--===============9144088592457002976==--