From mboxrd@z Thu Jan  1 00:00:00 1970
From: Stefan Bader <stefan.bader@canonical.com>
Subject: Re: Strange PVM spinlock case revisited
Date: Tue, 12 Feb 2013 14:42:04 +0100
Message-ID: <511A46AC.5020202@canonical.com>
References: <51151ABF.1070007@canonical.com>
	<1360603744.20449.74.camel@zakaz.uk.xensource.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============4157919901350199352=="
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <1360603744.20449.74.camel@zakaz.uk.xensource.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--===============4157919901350199352==
Content-Type: multipart/signed; micalg=pgp-sha512;
 protocol="application/pgp-signature";
 boundary="------------enig57AF0FB6ACE3157B505EB2A6"

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig57AF0FB6ACE3157B505EB2A6
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On 11.02.2013 18:29, Ian Campbell wrote:
> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:
>> A while ago I had been reporting an issue which causes Xen PVM guests
>> to hang (apparently related to spinlocks). Since then I was able to
>> gather a few more facts which I try to provide below. For the real
>> reasons, I am still puzzled and would be thankful for any hints or
>> direction.
>>
>> - Happens with Xen 4.1.2 and Xen 4.2 host-side.
>> - PVM guest with 8 VCPUs is running 800 threads doing DB updates.
>=20
> This is on a host with >=3D 8 PCPUs, correct?

Maybe >=3D4 but I cannot remember for sure anymore.
>=20
>> - When hanging at least 2 VCPUs in xen_spin_lock_slow with the lock fr=
ee.
>>   IIRC the weird lock always belongs to one VCPU runqueue.
>> - Testcase shows the problem for guest kernel versions v3.2..v3.5 (not=

>>   verified farther back). Since v3.6 it cannot be reproduced. Bisect
>>   ends up with: 611ae8e3f5204f7480b3b405993b3352cfa16662
>>     "x86/tlb: enable tlb flush range support for x86"
>=20
> That seems like a very odd candidate for impacting on this issue.
>=20
>> - The other potential fix for this was to prevent xen_spin_lock_slow()=

>>   from unmasking events (enabling interrupts) if those were enabled
>>   before the spinlock call.
>>
>> Maybe allowing or not allowing the interrupts in xen_spin_lock_slow
>> just manages to tip the likelihood of getting nested deeper.
>=20
> Either that or the bug is in the nesting aspect of xen_spin_lock_slow()=
,
> which seems to be a rather complex dance judging from the comments in
> that function (and one I confess I don't really follow).
>=20

I am suspecting something like this. Or maybe generally a nesting issue r=
evealed
by enabling interrupts here.

>>  The same
>> way that the changed TLB flush may reduce it by causing lesser IPIs.
>=20
> Ah, that might explain why that patch is relevant, yes.
>=20
>> From the static information I can extract from a dump it is hard to
>> tell what exactly is going on. VCPU2 seems to be doing something but
>> looks to be rather deep in event handling. The irq_count is at 2, but
>> irq_count seems to be used in a few more places than just xen_hypervis=
or
>> callback. Though that at least seems to be the entry on the stack I ca=
n
>> see for VCPU2. The contents of the per-cpu variables for irq_reqs and
>> irq_stack_ptr at least seem consistent as if a callback came in while
>> being on the sched_op hypercall. But then the bt command seems to be
>> operating on a completely different stack.
>> VCPU2 seems to be active, not polling (event processing could be initi=
ated
>> from enable_irqs via force callback or before finally exiting the sche=
d_op
>> hypercall), there seems to be a pending upcall but upcalls are masked.=

>> The next thing I would try is to instrument the sites incrementing and=

>> decrementing irq_count...
>> ---
>>
>> Below some of the info seen from dom0 debugging keys and looking
>> at a dump with crash:
>>
>> Dom0 Info:
>>
>> Polling vCPUs: {1,5-7}
>> VCPU0: CPU0 [has=3DF] poll=3D0  upcall_pend =3D 00, upcall_mask =3D 00=

>> VCPU1: CPU1 [has=3DF] poll=3D10 upcall_pend =3D 00, upcall_mask =3D 00=

>> VCPU2: CPU6 [has=3DT] poll=3D0  upcall_pend =3D 01, upcall_mask =3D 01=

>> VCPU3: CPU2 [has=3DF] poll=3D0  upcall_pend =3D 00, upcall_mask =3D 00=

>> VCPU4: CPU4 [has=3DF] poll=3D0  upcall_pend =3D 00, upcall_mask =3D 00=

>> VCPU5: CPU3 [has=3DF] poll=3D0  upcall_pend =3D 00, upcall_mask =3D 00=

>> VCPU6: CPU7 [has=3DF] poll=3D40 upcall_pend =3D 01, upcall_mask =3D 01=

>> VCPU7: CPU4 [has=3DF] poll=3D46 upcall_pend =3D 01, upcall_mask =3D 01=

>>
>> 10 [0/1]: s=3D6 n=3D1 x=3D0
>> 40 [0/1]: s=3D6 n=3D6 x=3D0
>> 46 [0/1]: s=3D6 n=3D7 x=3D0
>>
>> Guest Info:
>>
>> lock_spinners:
>> [0] (struct xen_spinlock *) 0x
>> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 =3D { lock=3D1, spinner=
s=3D2 }
>> [2] (struct xen_spinlock *) 0xffff8803bfc93700 =3D { lock=3D0, spinner=
s=3D2 }
>=20
> If I understand correctly this means that 2 threads (VCPU2 & 6, I
> suppose) are spinning on this lock, but no one is holding it?

To my reading both went into xen_spinlock_slow(). VCPU2 may be either in =
the
raw_local_irq_enable() before doing poll_irq or maybe in handling some ev=
ent
after receiving the wakeup irq but before returning from the hv call. So =
the
VCPU that had the lock has returned it and likely the wakeup has been sen=
t.

>=20
> An interesting hack^Wexperiment might be to make xen_poll_irq use a
> timeout and see if that unwedges things -- this would help confirm that=

> the issue is on nested wakeup.

I could do that. This would at least re-activate the other waiters. In th=
e case
shown VCPU6 and if something in the path of execution on VCPU2 deadlocks =
there
this would move things ahead.

>=20
> I suppose xen_spin_unlock could also be instrumented to indicate who it=

> last kicked and for which lock and that might show us something?

IIRC I had done this but it did not really show much. What I have done in=
 the
meantime was to instrument the IRQ service functions in
arch/x86/kernel/entry_64.S to give me a history of callbacks. This shows =
(in
another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still p=
olling)
that the last events on the vcpu not polling are:

xen_do_hypervisor_callback+127
call_softirq+29
call_softirq+125
call_softirq+29
call_softirq+125
call_softirq+29
call_softirq+125
xen_do_hypervisor_callback+28
call_softirq+29
xen_do_hypervisor_callback+28

The lower offsets are when irq_count gets incremented and the higher offs=
ets are
when irq_count gets decremented before ending the callback. This shows th=
at at
least in this case there was an upcall, a softirq and another upcall bein=
g
triggered without finishing the previous one. There was another experimen=
t where
I saw softirq, upcall, upcall. But at least it seems that there is always=
 a
three level stack.
I believe that softirq, upcall would be ok as softirq processing enables
interrupts but is it expected to have an upcall interrupted by another on=
e?

>=20
> Can someone explain why the non-locked update of lock_spinners in
> spinning_lock() is safe against xen_spin_unlock reading the field from
> another VCPU? I suspect it isn't, so what happens if the other VCPU
> kicks the lock which is just about to become prev? maybe this is handle=
d
> in xen_spin_lock_slow somehow?

Isn't that safe because lock_spinners is percpu?

>=20
> In a similar vein do we not need a barrier in unspinning_lock to ensure=

> that the remote cpu sees prev getting restored?
>=20
>> [3] (struct xen_spinlock *) 0x
>> [4] (struct xen_spinlock *) 0x
>> [5] (struct xen_spinlock *) 0xffff8803bffc8ae8 -> [1]
>> [6] (struct xen_spinlock *) 0xffff8803bfc93700 -> [2]
>> [7] (struct xen_spinlock *) 0xffffffff81f15ef0 =3D { lock=3D1, spinner=
s=3D1 }
>>
>> VCPU[2,6] -> lock of runqueue[2] =3D (struct rq *) 0xffff8803bfc93700
>>
>> irq_count[1] =3D 1
>> irq_count[2] =3D 2
>> It is -1 for the rest.
>>
>> *(struct pt_regs *) irq_regs[2] =3D {
>> 	.ip =3D 0xffffffff810013aa =3D=3D hypercall_page+938 -> sched_op
>> 	.sp =3D 0xffff8803bfc83ce8
>> }
>>
>> char *irq_stack_ptr[2] =3D 0xffff8803bfc83fc0
>>
>> #> bt -T ffff88033d65c4a0
>> PID: 2050   TASK: ffff88033d65c4a0  CPU: 2   COMMAND: "postgres"
>>   [ffff88033d72d530] get_page_from_freelist at ffffffff8111e6df
>>   [ffff88033d72d600] __alloc_pages_nodemask at ffffffff8111ebec
>>   [ffff88033d72d620] check_preempt_curr at ffffffff81050254
>>   [ffff88033d72d640] pull_task at ffffffff8105e33e
>>   [ffff88033d72d670] balance_tasks at ffffffff8105e4b6
>>   [ffff88033d72d680] radix_tree_lookup_slot at ffffffff8130db5e
>>   [ffff88033d72d690] find_get_page at ffffffff811161ee
>>   [ffff88033d72d6b0] find_lock_page at ffffffff81116286
>>   [ffff88033d72d6e0] shmem_getpage_gfp at ffffffff8112dd10
>>   [ffff88033d72d748] pte_pfn_to_mfn at ffffffff81005a78
>>   [ffff88033d72d7a0] get_page_from_freelist at ffffffff8111e6df
>>   [ffff88033d72d7c0] _raw_spin_lock at ffffffff81655e4e
>>   [ffff88033d72d7d0] ext4_ext_check_cache at ffffffff81239248
>>   [ffff88033d72d820] ext4_ext_map_blocks at ffffffff8123e269
>>   [ffff88033d72d870] _raw_spin_lock at ffffffff81655e4e
>>   [ffff88033d72d880] enqueue_to_backlog at ffffffff8153e55f
>>   [ffff88033d72d890] __wake_up_common at ffffffff8104c1c8
>>   [ffff88033d72d8e0] netif_rx.part.82 at ffffffff8153f2de
>>   [ffff88033d72d920] netif_rx at ffffffff8153f400
>>   [ffff88033d72d960] loopback_xmit at ffffffff8146fb1c
>>   [ffff88033d72d990] dev_hard_start_xmit at ffffffff8153fea6
>>   [ffff88033d72d9d0] do_softirq at ffffffff81016284
>>   [ffff88033d72d9f0] local_bh_enable at ffffffff8106d614
>>   [ffff88033d72da00] dev_queue_xmit at ffffffff81540309
>>   [ffff88033d72da50] ip_finish_output at ffffffff815769ab
>>   [ffff88033d72da90] ip_output at ffffffff81577518
>>   [ffff88033d72dac0] ip_local_out at ffffffff81576c09
>>   [ffff88033d72dae0] ip_send_skb at ffffffff81577f4b
>>   [ffff88033d72db00] udp_send_skb at ffffffff8159aaa1
>>   [ffff88033d72db40] ip_generic_getfrag at ffffffff81575030
>>   [ffff88033d72db50] udp_sendmsg at ffffffff8159bd38
>>   [ffff88033d72db90] irq_to_desc at ffffffff810d70b7
>>   [ffff88033d72dbd0] notify_remote_via_irq at ffffffff813a74c1
>>   [ffff88033d72dc10] ttwu_queue at ffffffff81052672
>>   [ffff88033d72dc38] _raw_spin_unlock_irqrestore at ffffffff8165605e
>>   [ffff88033d72dc90] inet_sendmsg at ffffffff815a6474
>>   --- bt -t would end here
>>   [ffff88033d72dcb0] apparmor_socket_sendmsg at ffffffff812d43f7
>>   [ffff88033d72dcd0] sock_sendmsg at ffffffff815266fe
>>   [ffff88033d72dd70] md_make_request at ffffffff814e0606
>>   [ffff88033d72ddd0] kmem_cache_free at ffffffff811605cf
>>   [ffff88033d72de10] mempool_free_slab at ffffffff81118507
>>   [ffff88033d72de20] mempool_free at ffffffff811185b7
>>   [ffff88033d72de50] sys_sendto at ffffffff8152974d
>>   [ffff88033d72dee0] ext4_sync_file at ffffffff8120f07b
>>   [ffff88033d72df00] vfs_write at ffffffff811764b0
>>   [ffff88033d72df50] xen_hypervisor_callback at ffffffff816606db
>>     RIP: 00007f6852c7b39a  RSP: 00007fff431454e0  RFLAGS: 00000212
>>     RAX: 00007f6852fb38b8  RBX: 00007f6852fb3720  RCX: 000000000000014=
a
>>     RDX: 0000000000000150  RSI: 0000000000000140  RDI: 00007f6852fb372=
0
>>     RBP: 0000000000000150   R8: 0000000000000015   R9: 000000000000000=
0
>>     R10: 0000000000000000  R11: 00007f6852c9174a  R12: 00007f6852fb377=
8
>>     R13: 0000000000000140  R14: 00007f6852fb38b8  R15: 000000000000000=
1
>>     ORIG_RAX: ffffffffffffffff  CS: e033  SS: e02b
>>
>> #> bt -T ffff88033b812dc0
>> PID: 2069   TASK: ffff88033b812dc0  CPU: 6   COMMAND: "postgres"
>>   [ffff88033b897530] get_page_from_freelist at ffffffff8111e6df
>>   [ffff88033b897600] __alloc_pages_nodemask at ffffffff8111ebec
>>   [ffff88033b897640] radix_tree_lookup at ffffffff8130db6b
>>   [ffff88033b897650] irq_to_desc at ffffffff810d70b7
>>   [ffff88033b897660] irq_get_irq_data at ffffffff810d9f4e
>>   [ffff88033b897670] balance_tasks at ffffffff8105e493
>>   [ffff88033b897680] radix_tree_lookup_slot at ffffffff8130db5e
>>   [ffff88033b897690] find_get_page at ffffffff811161ee
>>   [ffff88033b8976b0] find_lock_page at ffffffff81116286
>>   [ffff88033b8976e0] shmem_getpage_gfp at ffffffff8112dd10
>>   [ffff88033b897748] pte_pfn_to_mfn at ffffffff81005a78
>>   [ffff88033b897770] xen_set_pte_at at ffffffff81008e29
>>   [ffff88033b897788] __raw_callee_save_xen_make_pte at ffffffff810052c=
d
>>   [ffff88033b8977b0] unlock_page at ffffffff8111589a
>>   [ffff88033b8977d0] __do_fault at ffffffff81138ac9
>>   [ffff88033b8978a0] pvclock_clocksource_read at ffffffff8103dd15
>>   [ffff88033b8978f0] xen_clocksource_read at ffffffff8100a850
>>   [ffff88033b897900] sched_clock at ffffffff8101bd79
>>   [ffff88033b897910] blk_rq_init at ffffffff812ec262
>>   [ffff88033b897930] get_request at ffffffff812f004e
>>   [ffff88033b8979c0] cpumask_next_and at ffffffff81308c66
>>   [ffff88033b8979e0] find_busiest_group at ffffffff8105a061
>>   --- bt -t ends here
>>   [ffff88033b897a30] radix_tree_lookup at ffffffff8130db6b
>>   [ffff88033b897a40] irq_to_desc at ffffffff810d70b7
>>   [ffff88033b897a50] irq_get_irq_data at ffffffff810d9f4e
>>   [ffff88033b897a60] info_for_irq at ffffffff813a636e
>>   [ffff88033b897a80] xen_poll_irq_timeout at ffffffff813a696e
>>   [ffff88033b897ac0] xen_poll_irq at ffffffff813a8450
>>   [ffff88033b897ad0] xen_spin_lock_slow at ffffffff8163ad77
>>   [ffff88033b897b20] xen_spin_lock at ffffffff810121da
>>   [ffff88033b897b40] _raw_spin_lock at ffffffff81655e4e
>>   [ffff88033b897b50] double_rq_lock at ffffffff8105119c
>>   [ffff88033b897b80] load_balance at ffffffff8105e788
>>   [ffff88033b897b88] _raw_spin_unlock_irqrestore at ffffffff8165605e
>>   [ffff88033b897c10] idle_balance at ffffffff8163d57c
>>   [ffff88033b897c60] __schedule at ffffffff81653ea9
>>   [ffff88033b897cc0] sleep_on_page at ffffffff81115810
>>   [ffff88033b897ce0] schedule at ffffffff81653fcf
>>   [ffff88033b897cf0] io_schedule at ffffffff8165407f
>>   [ffff88033b897d10] sleep_on_page at ffffffff8111581e
>>   [ffff88033b897d20] __wait_on_bit at ffffffff8165489f
>>   [ffff88033b897d70] wait_on_page_bit at ffffffff81115988
>>   [ffff88033b897da0] wake_bit_function at ffffffff8108a140
>>   [ffff88033b897dc0] filemap_fdatawait_range at ffffffff81115a9c
>>   [ffff88033b897e60] do_writepages at ffffffff81120ee1
>>   [ffff88033b897eb0] filemap_write_and_wait_range at ffffffff81117398
>>   [ffff88033b897ee0] ext4_sync_file at ffffffff8120ef9f
>>   [ffff88033b897f00] vfs_write at ffffffff811764b0
>>   [ffff88033b897f40] do_fsync at ffffffff811a4a36
>>   [ffff88033b897f70] sys_fdatasync at ffffffff811a4d83
>>   [ffff88033b897f80] system_call_fastpath at ffffffff8165e442
>>     RIP: 00007f6852ce8240  RSP: 00007fff43145618  RFLAGS: 00000246
>>     RAX: 000000000000004b  RBX: ffffffff8165e442  RCX: 00007f6852ce190=
0
>>     RDX: 00000000000000c5  RSI: 000000000000000c  RDI: 000000000000000=
c
>>     RBP: 00000000000000c5   R8: 000000000073a550   R9: 000000000000000=
0
>>     R10: 0000000000000004  R11: 0000000000000246  R12: ffffffff811a4d8=
3
>>     R13: ffff88033b897f78  R14: 0000000000000001  R15: 0000000000af748=
8
>>     ORIG_RAX: 000000000000004b  CS: e033  SS: e02b
>>


--------------enig57AF0FB6ACE3157B505EB2A6
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBCgAGBQJRGka3AAoJEOhnXe7L7s6jXNwP/RLIeHMvUzmgKWHOd3yNS45/
C4KxvLaygs+pHoQsQWOKFBnHvNiT08QsIC9pWqYWz3XCMBAcZrYb/KCcN//2laIg
51hxXdPzTHAvHEOv6Jphx1th0lhthGTZXAQcAalLM8h3WaNWmOxjJvfSqDIlEoVo
Y+i+Ml8JHA5DvVGarMxgHpXj3viqO7OZHQygT/zrRbLqZ5dohru1MamdLD4KMJHA
hBvAax7fgE057KAPTBO2TN8ecYVUt1yj6YxlK6MIEL9uCMmRONSxBzpgs2DaWut8
ATA6b4nr/qOsPjFpe4XfWKIpPRT707j7EttfofVjByW9XIRVovYu9I1LBskFPuL4
z2J8I2YOrDXgGSqJFPTPMrFKspKwj403hW3ReKx5LlAC7ARHnIvOkEcBDMcUQgM9
SZmIuGqSBpDrTe2UdmH4PKQPiAPCrp72pJjym6sIy/4ok3o/dfZDepW6CMqST18K
mTbkCbuljzsWsS6jb97ao2j6+uRT4w/ZHndpiv/PVyg0y3AweK7YbmgJlnhGlblc
yERiO3qJ2H7JgexZdaG1913xTplJWruGHw1QSBW/QIbmFSZR0HGWVJZVVRWXGgLL
K/wibX+sC271//wMpqqFuVnp4GGyWYfisci5Q91UQE/YbHMw0Pg6pmhFKwmVdPLU
zSdUfSfd5ZNU9BIDu4er
=/1yu
-----END PGP SIGNATURE-----

--------------enig57AF0FB6ACE3157B505EB2A6--


--===============4157919901350199352==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

--===============4157919901350199352==--