xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* Xen 4.7 crash
@ 2016-06-01 19:54 Aaron Cornelius
  2016-06-01 20:00 ` Andrew Cooper
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Aaron Cornelius @ 2016-06-01 19:54 UTC (permalink / raw)
  To: Xen-devel

I am doing some work with Xen 4.7 on the cubietruck (ARM32).  I've noticed some strange behavior after I create/destroy enough domains and put together a script to do the add/remove for me.  For this particular test I am creating a small mini-os (Mirage) domain with 32MB of RAM, deleting it, creating the new one, and so on.

After running this for a while, I get the following error (with version 8478c9409a2c6726208e8dbc9f3e455b76725a33):

(d846) Virtual -> physical offset = 3fc00000
(d846) Checking DTB at 023ff000...
(d846) [32;1mMirageOS booting...[0m
(d846) Initialising console ... done.
(d846) gnttab_stubs.c: initialised mini-os gntmap
(d846) allocate_ondemand(1, 1) returning 2300000
(d846) allocate_ondemand(1, 1) returning 2301000
(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0)
(XEN) p2m.c: dom1101: VMID pool exhausted
(XEN) CPU0: Unexpected Trap: Data Abort
(XEN) ----[ Xen-4.7.0-rc  arm32  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) PC:     0021fdd4 free_domheap_pages+0x1c/0x324
(XEN) CPSR:   6001011a MODE:Hypervisor
(XEN)      R0: 00000000 R1: 00000001 R2: 00000003 R3: 00304320
(XEN)      R4: 41c57000 R5: 41c57188 R6: 00200200 R7: 00100100
(XEN)      R8: 41c57180 R9: 43fdfe60 R10:00000000 R11:43fdfd5c R12:00000000
(XEN) HYP: SP: 43fdfd2c LR: 0025b0cc
(XEN)
(XEN)   VTCR_EL2: 80003558
(XEN)  VTTBR_EL2: 00010000bfb0e000
(XEN)
(XEN)  SCTLR_EL2: 30cd187f
(XEN)    HCR_EL2: 000000000038663f
(XEN)  TTBR0_EL2: 00000000bfafc000
(XEN)
(XEN)    ESR_EL2: 94000006
(XEN)  HPFAR_EL2: 000000000001c810
(XEN)      HDFAR: 00000014
(XEN)      HIFAR: 84e37182
(XEN)
(XEN) Xen stack trace from sp=43fdfd2c:
(XEN)    002cf1b7 43fdfd64 41c57000 00000100 41c57000 41c57188 00200200 00100100
(XEN)    41c57180 43fdfe60 00000000 43fdfd7c 0025b0cc 41c57000 fffffff0 43fdfe60
(XEN)    0000001f 0000044d 43fdfe60 43fdfd8c 0024f668 41c57000 fffffff0 43fdfda4
(XEN)    0024f8f0 41c57000 00000000 00000000 0000001f 43fdfddc 0020854c 43fdfddc
(XEN)    00000000 cccccccd 00304600 002822bc 00000000 b6f20004 0000044d 00304600
(XEN)    00304320 d767a000 00000000 43fdfeec 00206d6c 43fdfe6c 00218f8c 00000000
(XEN)    00000007 43fdfe30 43fdfe34 00000000 43fdfe20 00000002 43fdfe48 43fdfe78
(XEN)    00000000 00000000 00000000 00007622 00002b0e 40023000 00000000 43fdfec8
(XEN)    00000002 43fdfebc 00218f8c 00000001 0000000b 0000ffff b6eba880 0000000b
(XEN)    5abab87d f34aab2c 6adc50b8 e1713cd0 00000000 00000000 00000000 00000000
(XEN)    b6eba8d8 00000000 50043f00 b6eb5038 b6effba8 0000003e 00000000 000c3034
(XEN)    000b9cb8 000bda30 000bda30 00000000 b6eba56c 0000003e b6effba8 b6effdb0
(XEN)    be9558d4 000000d0 be9558d4 00000071 b6effba8 b6effd6c b6ed6fb4 4a000ea1
(XEN)    c01077f8 43fdff58 002067b8 00305000 be9557c8 d767a000 00000000 43fdff54
(XEN)    00260130 00000000 43fdfefc 43fdff1c 200f019a 400238f4 00000004 00000004
(XEN)    002c9f00 00000000 00304600 c094c240 00000000 00305000 be9557a0 d767a000
(XEN)    00000000 43fdff44 00000000 c094c240 00000000 00305000 be9557c8 d767a000
(XEN)    00000000 43fdff58 00263b10 b6f20004 00000000 00000000 00000000 00000000
(XEN)    c094c240 00000000 00305000 be9557c8 d767a000 00000000 00000001 00000024
(XEN)    ffffffff b691ab34 c01077f8 60010013 00000000 be9557c4 c0a38600 c010c400
(XEN) Xen call trace:
(XEN)    [<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC)
(XEN)    [<0025b0cc>] p2m_teardown+0xa0/0x108 (LR)
(XEN)    [<0025b0cc>] p2m_teardown+0xa0/0x108
(XEN)    [<0024f668>] arch_domain_destroy+0x20/0x50
(XEN)    [<0024f8f0>] arch_domain_create+0x258/0x284
(XEN)    [<0020854c>] domain_create+0x2dc/0x510
(XEN)    [<00206d6c>] do_domctl+0x5b4/0x1928
(XEN)    [<00260130>] do_trap_hypervisor+0x1170/0x15b0
(XEN)    [<00263b10>] entry.o#return_from_trap+0/0x4
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) CPU0: Unexpected Trap: Data Abort
(XEN)
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...

I'm not 100% sure, from the "VMID pool exhausted" message it would appear that the p2m_init() function failed to allocate a VM ID, which caused domain creation to fail, and the NULL pointer dereference when trying to clean up the not-fully-created domain.

However, since I only have 1 domain active at a time, I'm not sure why I should run out of VM IDs.

- Aaron Cornelius

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-01 19:54 Xen 4.7 crash Aaron Cornelius
@ 2016-06-01 20:00 ` Andrew Cooper
  2016-06-01 20:45   ` Aaron Cornelius
  2016-06-01 21:35 ` Andrew Cooper
  2016-06-01 22:35 ` Julien Grall
  2 siblings, 1 reply; 29+ messages in thread
From: Andrew Cooper @ 2016-06-01 20:00 UTC (permalink / raw)
  To: Aaron Cornelius, Xen-devel

On 01/06/2016 20:54, Aaron Cornelius wrote:
> I am doing some work with Xen 4.7 on the cubietruck (ARM32).  I've noticed some strange behavior after I create/destroy enough domains and put together a script to do the add/remove for me.  For this particular test I am creating a small mini-os (Mirage) domain with 32MB of RAM, deleting it, creating the new one, and so on.
>
> After running this for a while, I get the following error (with version 8478c9409a2c6726208e8dbc9f3e455b76725a33):
>
> (d846) Virtual -> physical offset = 3fc00000
> (d846) Checking DTB at 023ff000...
> (d846) [32;1mMirageOS booting...[0m
> (d846) Initialising console ... done.
> (d846) gnttab_stubs.c: initialised mini-os gntmap
> (d846) allocate_ondemand(1, 1) returning 2300000
> (d846) allocate_ondemand(1, 1) returning 2301000
> (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
> (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0)
> (XEN) p2m.c: dom1101: VMID pool exhausted
> (XEN) CPU0: Unexpected Trap: Data Abort
> <snip>
>
> I'm not 100% sure, from the "VMID pool exhausted" message it would appear that the p2m_init() function failed to allocate a VM ID, which caused domain creation to fail, and the NULL pointer dereference when trying to clean up the not-fully-created domain.
>
> However, since I only have 1 domain active at a time, I'm not sure why I should run out of VM IDs.

Sounds like a VMID resource leak.  Check to see whether it is freed
properly in domain_destroy().

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-01 20:00 ` Andrew Cooper
@ 2016-06-01 20:45   ` Aaron Cornelius
  2016-06-01 21:24     ` Andrew Cooper
  0 siblings, 1 reply; 29+ messages in thread
From: Aaron Cornelius @ 2016-06-01 20:45 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel

> -----Original Message-----
> From: Andrew Cooper [mailto:amc96@hermes.cam.ac.uk] On Behalf Of
> Andrew Cooper
> Sent: Wednesday, June 1, 2016 4:01 PM
> To: Aaron Cornelius <Aaron.Cornelius@dornerworks.com>; Xen-devel <xen-
> devel@lists.xenproject.org>
> Subject: Re: [Xen-devel] Xen 4.7 crash
> 
> On 01/06/2016 20:54, Aaron Cornelius wrote:
> > I am doing some work with Xen 4.7 on the cubietruck (ARM32).  I've
> noticed some strange behavior after I create/destroy enough domains and
> put together a script to do the add/remove for me.  For this particular test I
> am creating a small mini-os (Mirage) domain with 32MB of RAM, deleting it,
> creating the new one, and so on.
> >
> > After running this for a while, I get the following error (with version
> 8478c9409a2c6726208e8dbc9f3e455b76725a33):
> >
> > (d846) Virtual -> physical offset = 3fc00000
> > (d846) Checking DTB at 023ff000...
> > (d846) [32;1mMirageOS booting...[0m
> > (d846) Initialising console ... done.
> > (d846) gnttab_stubs.c: initialised mini-os gntmap
> > (d846) allocate_ondemand(1, 1) returning 2300000
> > (d846) allocate_ondemand(1, 1) returning 2301000
> > (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2)
> > dom:(0)
> > (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2)
> > dom:(0)
> > (XEN) p2m.c: dom1101: VMID pool exhausted
> > (XEN) CPU0: Unexpected Trap: Data Abort <snip>
> >
> > I'm not 100% sure, from the "VMID pool exhausted" message it would
> appear that the p2m_init() function failed to allocate a VM ID, which caused
> domain creation to fail, and the NULL pointer dereference when trying to
> clean up the not-fully-created domain.
> >
> > However, since I only have 1 domain active at a time, I'm not sure why I
> should run out of VM IDs.
> 
> Sounds like a VMID resource leak.  Check to see whether it is freed properly
> in domain_destroy().
> 
> ~Andrew

That would be my assumption.  But as far as I can tell, arch_domain_destroy() calls pwm_teardown() which calls p2m_free_vmid(), and none of the functionality related to freeing a VM ID appears to have changed in years.

- Aaron
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-01 20:45   ` Aaron Cornelius
@ 2016-06-01 21:24     ` Andrew Cooper
  2016-06-01 22:18       ` Julien Grall
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Cooper @ 2016-06-01 21:24 UTC (permalink / raw)
  To: Aaron Cornelius, Xen-devel

On 01/06/2016 21:45, Aaron Cornelius wrote:
>>
>>> However, since I only have 1 domain active at a time, I'm not sure why I
>> should run out of VM IDs.
>>
>> Sounds like a VMID resource leak.  Check to see whether it is freed properly
>> in domain_destroy().
>>
>> ~Andrew
> That would be my assumption.  But as far as I can tell, arch_domain_destroy() calls pwm_teardown() which calls p2m_free_vmid(), and none of the functionality related to freeing a VM ID appears to have changed in years.

The VMID handling looks suspect.  It can be called repeatedly during
domain destruction, and it will repeatedly clear the same bit out of the
vmid_mask.

diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index 838d004..7adb39a 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1393,7 +1393,10 @@ static void p2m_free_vmid(struct domain *d)
     struct p2m_domain *p2m = &d->arch.p2m;
     spin_lock(&vmid_alloc_lock);
     if ( p2m->vmid != INVALID_VMID )
-        clear_bit(p2m->vmid, vmid_mask);
+    {
+        ASSERT(test_and_clear_bit(p2m->vmid, vmid_mask));
+        p2m->vmid = INVALID_VMID;
+    }

     spin_unlock(&vmid_alloc_lock);
 }

Having said that, I can't explain why that bug would result in the
symptoms you are seeing.  It is also possibly that your issue is memory
corruption from a separate source.

Can you see about instrumenting p2m_alloc_vmid()/p2m_free_vmid() (with
vmid_alloc_lock held) to see which vmid is being allocated/freed ? 
After the initial boot of the system, you should see the same vmid being
allocated and freed for each of your domains.

~Andrew


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-01 19:54 Xen 4.7 crash Aaron Cornelius
  2016-06-01 20:00 ` Andrew Cooper
@ 2016-06-01 21:35 ` Andrew Cooper
  2016-06-01 22:24   ` Julien Grall
  2016-06-01 22:35 ` Julien Grall
  2 siblings, 1 reply; 29+ messages in thread
From: Andrew Cooper @ 2016-06-01 21:35 UTC (permalink / raw)
  To: Aaron Cornelius, Xen-devel

On 01/06/2016 20:54, Aaron Cornelius wrote:
> <snip>
> (XEN) Xen call trace:
> (XEN)    [<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC)
> (XEN)    [<0025b0cc>] p2m_teardown+0xa0/0x108 (LR)
> (XEN)    [<0025b0cc>] p2m_teardown+0xa0/0x108
> (XEN)    [<0024f668>] arch_domain_destroy+0x20/0x50
> (XEN)    [<0024f8f0>] arch_domain_create+0x258/0x284
> (XEN)    [<0020854c>] domain_create+0x2dc/0x510
> (XEN)    [<00206d6c>] do_domctl+0x5b4/0x1928
> (XEN)    [<00260130>] do_trap_hypervisor+0x1170/0x15b0
> (XEN)    [<00263b10>] entry.o#return_from_trap+0/0x4
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) CPU0: Unexpected Trap: Data Abort
> (XEN)
> (XEN) ****************************************
> (XEN)
> (XEN) Reboot in five seconds...

As for this specific crash itself,  In the case of an early error path,
p2m->root can be NULL in p2m_teardown(), in which case
free_domheap_pages() will fall over in a heap.  This patch should
resolve it.

@@ -1408,7 +1411,8 @@ void p2m_teardown(struct domain *d)
     while ( (pg = page_list_remove_head(&p2m->pages)) )
         free_domheap_page(pg);

-    free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
+    if ( p2m->root )
+        free_domheap_pages(p2m->root, P2M_ROOT_ORDER);

     p2m->root = NULL;

I would be tempted to suggest making free_domheap_pages() tolerate NULL
pointers, except that would only be a safe thing to do if we assert that
the order parameter is 0, which won't help this specific case.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-01 21:24     ` Andrew Cooper
@ 2016-06-01 22:18       ` Julien Grall
  2016-06-01 22:26         ` Andrew Cooper
  0 siblings, 1 reply; 29+ messages in thread
From: Julien Grall @ 2016-06-01 22:18 UTC (permalink / raw)
  To: Andrew Cooper, Aaron Cornelius, Xen-devel, Stefano Stabellini

Hi Andrew,

On 01/06/2016 22:24, Andrew Cooper wrote:
> On 01/06/2016 21:45, Aaron Cornelius wrote:
>>>
>>>> However, since I only have 1 domain active at a time, I'm not sure why I
>>> should run out of VM IDs.
>>>
>>> Sounds like a VMID resource leak.  Check to see whether it is freed properly
>>> in domain_destroy().
>>>
>>> ~Andrew
>> That would be my assumption.  But as far as I can tell, arch_domain_destroy() calls pwm_teardown() which calls p2m_free_vmid(), and none of the functionality related to freeing a VM ID appears to have changed in years.
>
> The VMID handling looks suspect.  It can be called repeatedly during
> domain destruction, and it will repeatedly clear the same bit out of the
> vmid_mask.

Can you explain how the p2m_free_vmid can be called multiple time?

We have the following path:
    arch_domain_destroy -> p2m_teardown -> p2m_free_vmid.

And I can find only 3 call of arch_domain_destroy we should only be done 
once per domain.

If arch_domain_destroy is called multiple time, p2m_free_vmid will not 
be the only place where Xen will be in trouble.

> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> index 838d004..7adb39a 100644
> --- a/xen/arch/arm/p2m.c
> +++ b/xen/arch/arm/p2m.c
> @@ -1393,7 +1393,10 @@ static void p2m_free_vmid(struct domain *d)
>      struct p2m_domain *p2m = &d->arch.p2m;
>      spin_lock(&vmid_alloc_lock);
>      if ( p2m->vmid != INVALID_VMID )
> -        clear_bit(p2m->vmid, vmid_mask);
> +    {
> +        ASSERT(test_and_clear_bit(p2m->vmid, vmid_mask));
> +        p2m->vmid = INVALID_VMID;
> +    }
>
>      spin_unlock(&vmid_alloc_lock);
>  }
>
> Having said that, I can't explain why that bug would result in the
> symptoms you are seeing.  It is also possibly that your issue is memory
> corruption from a separate source.
>
> Can you see about instrumenting p2m_alloc_vmid()/p2m_free_vmid() (with
> vmid_alloc_lock held) to see which vmid is being allocated/freed ?
> After the initial boot of the system, you should see the same vmid being
> allocated and freed for each of your domains.

Looking quickly at the log, the domain is dom1101. However, the number 
maximum number of VMID supported is 256, so the exhaustion might be a 
race somewhere.

I would be interested to get a reproducer. I wrote a script to cycle a 
domain (create/domain) in loop, and I have not seen any issue after 1200 
cycles (and counting).

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-01 21:35 ` Andrew Cooper
@ 2016-06-01 22:24   ` Julien Grall
  2016-06-01 22:31     ` Andrew Cooper
  0 siblings, 1 reply; 29+ messages in thread
From: Julien Grall @ 2016-06-01 22:24 UTC (permalink / raw)
  To: Andrew Cooper, Aaron Cornelius, Xen-devel, Stefano Stabellini

Hi,

On 01/06/2016 22:35, Andrew Cooper wrote:
> On 01/06/2016 20:54, Aaron Cornelius wrote:
>> <snip>
>> (XEN) Xen call trace:
>> (XEN)    [<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC)
>> (XEN)    [<0025b0cc>] p2m_teardown+0xa0/0x108 (LR)
>> (XEN)    [<0025b0cc>] p2m_teardown+0xa0/0x108
>> (XEN)    [<0024f668>] arch_domain_destroy+0x20/0x50
>> (XEN)    [<0024f8f0>] arch_domain_create+0x258/0x284
>> (XEN)    [<0020854c>] domain_create+0x2dc/0x510
>> (XEN)    [<00206d6c>] do_domctl+0x5b4/0x1928
>> (XEN)    [<00260130>] do_trap_hypervisor+0x1170/0x15b0
>> (XEN)    [<00263b10>] entry.o#return_from_trap+0/0x4
>> (XEN)
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 0:
>> (XEN) CPU0: Unexpected Trap: Data Abort
>> (XEN)
>> (XEN) ****************************************
>> (XEN)
>> (XEN) Reboot in five seconds...
>
> As for this specific crash itself,  In the case of an early error path,
> p2m->root can be NULL in p2m_teardown(), in which case
> free_domheap_pages() will fall over in a heap.  This patch should
> resolve it.

Good catch!

>
> @@ -1408,7 +1411,8 @@ void p2m_teardown(struct domain *d)
>      while ( (pg = page_list_remove_head(&p2m->pages)) )
>          free_domheap_page(pg);
>
> -    free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
> +    if ( p2m->root )
> +        free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
>
>      p2m->root = NULL;
>
> I would be tempted to suggest making free_domheap_pages() tolerate NULL
> pointers, except that would only be a safe thing to do if we assert that
> the order parameter is 0, which won't help this specific case.

free_xenheap_pages already tolerates NULL (even if an order != 0). Is 
there any reason to not do the same for free_domheap_pages?

Regards,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-01 22:18       ` Julien Grall
@ 2016-06-01 22:26         ` Andrew Cooper
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Cooper @ 2016-06-01 22:26 UTC (permalink / raw)
  To: Julien Grall, Aaron Cornelius, Xen-devel, Stefano Stabellini

On 01/06/2016 23:18, Julien Grall wrote:
> Hi Andrew,
>
> On 01/06/2016 22:24, Andrew Cooper wrote:
>> On 01/06/2016 21:45, Aaron Cornelius wrote:
>>>>
>>>>> However, since I only have 1 domain active at a time, I'm not sure
>>>>> why I
>>>> should run out of VM IDs.
>>>>
>>>> Sounds like a VMID resource leak.  Check to see whether it is freed
>>>> properly
>>>> in domain_destroy().
>>>>
>>>> ~Andrew
>>> That would be my assumption.  But as far as I can tell,
>>> arch_domain_destroy() calls pwm_teardown() which calls
>>> p2m_free_vmid(), and none of the functionality related to freeing a
>>> VM ID appears to have changed in years.
>>
>> The VMID handling looks suspect.  It can be called repeatedly during
>> domain destruction, and it will repeatedly clear the same bit out of the
>> vmid_mask.
>
> Can you explain how the p2m_free_vmid can be called multiple time?
>
> We have the following path:
>    arch_domain_destroy -> p2m_teardown -> p2m_free_vmid.
>
> And I can find only 3 call of arch_domain_destroy we should only be
> done once per domain.
>
> If arch_domain_destroy is called multiple time, p2m_free_vmid will not
> be the only place where Xen will be in trouble.

You are correct.  I was getting my phases of domain destruction mixed
up.  arch_domain_destroy() is strictly once, after the RCU reference of
the domain has dropped to 0.

>
>> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
>> index 838d004..7adb39a 100644
>> --- a/xen/arch/arm/p2m.c
>> +++ b/xen/arch/arm/p2m.c
>> @@ -1393,7 +1393,10 @@ static void p2m_free_vmid(struct domain *d)
>>      struct p2m_domain *p2m = &d->arch.p2m;
>>      spin_lock(&vmid_alloc_lock);
>>      if ( p2m->vmid != INVALID_VMID )
>> -        clear_bit(p2m->vmid, vmid_mask);
>> +    {
>> +        ASSERT(test_and_clear_bit(p2m->vmid, vmid_mask));
>> +        p2m->vmid = INVALID_VMID;
>> +    }
>>
>>      spin_unlock(&vmid_alloc_lock);
>>  }
>>
>> Having said that, I can't explain why that bug would result in the
>> symptoms you are seeing.  It is also possibly that your issue is memory
>> corruption from a separate source.
>>
>> Can you see about instrumenting p2m_alloc_vmid()/p2m_free_vmid() (with
>> vmid_alloc_lock held) to see which vmid is being allocated/freed ?
>> After the initial boot of the system, you should see the same vmid being
>> allocated and freed for each of your domains.
>
> Looking quickly at the log, the domain is dom1101. However, the number
> maximum number of VMID supported is 256, so the exhaustion might be a
> race somewhere.
>
> I would be interested to get a reproducer. I wrote a script to cycle a
> domain (create/domain) in loop, and I have not seen any issue after
> 1200 cycles (and counting).

Given that my previous thought was wrong, I am going to suggest that
some other form of memory corruption is a more likely cause.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-01 22:24   ` Julien Grall
@ 2016-06-01 22:31     ` Andrew Cooper
  2016-06-02  8:47       ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Cooper @ 2016-06-01 22:31 UTC (permalink / raw)
  To: Julien Grall, Aaron Cornelius, Xen-devel, Stefano Stabellini

On 01/06/2016 23:24, Julien Grall wrote:
> Hi,
>
> On 01/06/2016 22:35, Andrew Cooper wrote:
>> On 01/06/2016 20:54, Aaron Cornelius wrote:
>>> <snip>
>>> (XEN) Xen call trace:
>>> (XEN)    [<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC)
>>> (XEN)    [<0025b0cc>] p2m_teardown+0xa0/0x108 (LR)
>>> (XEN)    [<0025b0cc>] p2m_teardown+0xa0/0x108
>>> (XEN)    [<0024f668>] arch_domain_destroy+0x20/0x50
>>> (XEN)    [<0024f8f0>] arch_domain_create+0x258/0x284
>>> (XEN)    [<0020854c>] domain_create+0x2dc/0x510
>>> (XEN)    [<00206d6c>] do_domctl+0x5b4/0x1928
>>> (XEN)    [<00260130>] do_trap_hypervisor+0x1170/0x15b0
>>> (XEN)    [<00263b10>] entry.o#return_from_trap+0/0x4
>>> (XEN)
>>> (XEN)
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 0:
>>> (XEN) CPU0: Unexpected Trap: Data Abort
>>> (XEN)
>>> (XEN) ****************************************
>>> (XEN)
>>> (XEN) Reboot in five seconds...
>>
>> As for this specific crash itself,  In the case of an early error path,
>> p2m->root can be NULL in p2m_teardown(), in which case
>> free_domheap_pages() will fall over in a heap.  This patch should
>> resolve it.
>
> Good catch!
>
>>
>> @@ -1408,7 +1411,8 @@ void p2m_teardown(struct domain *d)
>>      while ( (pg = page_list_remove_head(&p2m->pages)) )
>>          free_domheap_page(pg);
>>
>> -    free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
>> +    if ( p2m->root )
>> +        free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
>>
>>      p2m->root = NULL;
>>
>> I would be tempted to suggest making free_domheap_pages() tolerate NULL
>> pointers, except that would only be a safe thing to do if we assert that
>> the order parameter is 0, which won't help this specific case.
>
> free_xenheap_pages already tolerates NULL (even if an order != 0). Is
> there any reason to not do the same for free_domheap_pages?

The xenheap allocation functions deal in terms of plain virtual
addresses, while the domheap functions deal in terms of struct page_info *.

Overall, this means that the domheap functions have a more restricted
input/output set than their xenheap variants.

As there is already precedent with xenheap, making domheap tolerate NULL
is probably fine, and indeed the preferred course of action.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-01 19:54 Xen 4.7 crash Aaron Cornelius
  2016-06-01 20:00 ` Andrew Cooper
  2016-06-01 21:35 ` Andrew Cooper
@ 2016-06-01 22:35 ` Julien Grall
  2016-06-02  1:32   ` Aaron Cornelius
  2 siblings, 1 reply; 29+ messages in thread
From: Julien Grall @ 2016-06-01 22:35 UTC (permalink / raw)
  To: Aaron Cornelius, Xen-devel

Hello Aaron,

On 01/06/2016 20:54, Aaron Cornelius wrote:
> I am doing some work with Xen 4.7 on the cubietruck (ARM32).  I've noticed some strange behavior after I create/destroy enough domains and put together a script to do the add/remove for me.  For this particular test I am creating a small mini-os (Mirage) domain with 32MB of RAM, deleting it, creating the new one, and so on.
>
> After running this for a while, I get the following error (with version 8478c9409a2c6726208e8dbc9f3e455b76725a33):
>
> (d846) Virtual -> physical offset = 3fc00000
> (d846) Checking DTB at 023ff000...
> (d846) [32;1mMirageOS booting...[0m
> (d846) Initialising console ... done.
> (d846) gnttab_stubs.c: initialised mini-os gntmap
> (d846) allocate_ondemand(1, 1) returning 2300000
> (d846) allocate_ondemand(1, 1) returning 2301000
> (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
> (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0)
> (XEN) p2m.c: dom1101: VMID pool exhausted
> (XEN) CPU0: Unexpected Trap: Data Abort
> (XEN) ----[ Xen-4.7.0-rc  arm32  debug=y  Not tainted ]----
> (XEN) CPU:    0
> (XEN) PC:     0021fdd4 free_domheap_pages+0x1c/0x324
> (XEN) CPSR:   6001011a MODE:Hypervisor
> (XEN)      R0: 00000000 R1: 00000001 R2: 00000003 R3: 00304320
> (XEN)      R4: 41c57000 R5: 41c57188 R6: 00200200 R7: 00100100
> (XEN)      R8: 41c57180 R9: 43fdfe60 R10:00000000 R11:43fdfd5c R12:00000000
> (XEN) HYP: SP: 43fdfd2c LR: 0025b0cc
> (XEN)
> (XEN)   VTCR_EL2: 80003558
> (XEN)  VTTBR_EL2: 00010000bfb0e000
> (XEN)
> (XEN)  SCTLR_EL2: 30cd187f
> (XEN)    HCR_EL2: 000000000038663f
> (XEN)  TTBR0_EL2: 00000000bfafc000
> (XEN)
> (XEN)    ESR_EL2: 94000006
> (XEN)  HPFAR_EL2: 000000000001c810
> (XEN)      HDFAR: 00000014
> (XEN)      HIFAR: 84e37182
> (XEN)
> (XEN) Xen stack trace from sp=43fdfd2c:
> (XEN)    002cf1b7 43fdfd64 41c57000 00000100 41c57000 41c57188 00200200 00100100
> (XEN)    41c57180 43fdfe60 00000000 43fdfd7c 0025b0cc 41c57000 fffffff0 43fdfe60
> (XEN)    0000001f 0000044d 43fdfe60 43fdfd8c 0024f668 41c57000 fffffff0 43fdfda4
> (XEN)    0024f8f0 41c57000 00000000 00000000 0000001f 43fdfddc 0020854c 43fdfddc
> (XEN)    00000000 cccccccd 00304600 002822bc 00000000 b6f20004 0000044d 00304600
> (XEN)    00304320 d767a000 00000000 43fdfeec 00206d6c 43fdfe6c 00218f8c 00000000
> (XEN)    00000007 43fdfe30 43fdfe34 00000000 43fdfe20 00000002 43fdfe48 43fdfe78
> (XEN)    00000000 00000000 00000000 00007622 00002b0e 40023000 00000000 43fdfec8
> (XEN)    00000002 43fdfebc 00218f8c 00000001 0000000b 0000ffff b6eba880 0000000b
> (XEN)    5abab87d f34aab2c 6adc50b8 e1713cd0 00000000 00000000 00000000 00000000
> (XEN)    b6eba8d8 00000000 50043f00 b6eb5038 b6effba8 0000003e 00000000 000c3034
> (XEN)    000b9cb8 000bda30 000bda30 00000000 b6eba56c 0000003e b6effba8 b6effdb0
> (XEN)    be9558d4 000000d0 be9558d4 00000071 b6effba8 b6effd6c b6ed6fb4 4a000ea1
> (XEN)    c01077f8 43fdff58 002067b8 00305000 be9557c8 d767a000 00000000 43fdff54
> (XEN)    00260130 00000000 43fdfefc 43fdff1c 200f019a 400238f4 00000004 00000004
> (XEN)    002c9f00 00000000 00304600 c094c240 00000000 00305000 be9557a0 d767a000
> (XEN)    00000000 43fdff44 00000000 c094c240 00000000 00305000 be9557c8 d767a000
> (XEN)    00000000 43fdff58 00263b10 b6f20004 00000000 00000000 00000000 00000000
> (XEN)    c094c240 00000000 00305000 be9557c8 d767a000 00000000 00000001 00000024
> (XEN)    ffffffff b691ab34 c01077f8 60010013 00000000 be9557c4 c0a38600 c010c400
> (XEN) Xen call trace:
> (XEN)    [<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC)
> (XEN)    [<0025b0cc>] p2m_teardown+0xa0/0x108 (LR)
> (XEN)    [<0025b0cc>] p2m_teardown+0xa0/0x108
> (XEN)    [<0024f668>] arch_domain_destroy+0x20/0x50
> (XEN)    [<0024f8f0>] arch_domain_create+0x258/0x284
> (XEN)    [<0020854c>] domain_create+0x2dc/0x510
> (XEN)    [<00206d6c>] do_domctl+0x5b4/0x1928
> (XEN)    [<00260130>] do_trap_hypervisor+0x1170/0x15b0
> (XEN)    [<00263b10>] entry.o#return_from_trap+0/0x4
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) CPU0: Unexpected Trap: Data Abort
> (XEN)
> (XEN) ****************************************
> (XEN)
> (XEN) Reboot in five seconds...
>
> I'm not 100% sure, from the "VMID pool exhausted" message it would appear that the p2m_init() function failed to allocate a VM ID, which caused domain creation to fail, and the NULL pointer dereference when trying to clean up the not-fully-created domain.
>
> However, since I only have 1 domain active at a time, I'm not sure why I should run out of VM IDs.

arch_domain_destroy (and p2m_teardown) is only called when all the 
reference on the given domain are released.

It may take a while to release all the resources. So if you launch the 
domain as the same time as you destroy the previous guest. You will have 
more than 1 domain active.

Can you detail how you create/destroy guest?

Regards,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-01 22:35 ` Julien Grall
@ 2016-06-02  1:32   ` Aaron Cornelius
  2016-06-02  8:49     ` Jan Beulich
  2016-06-02  9:07     ` Julien Grall
  0 siblings, 2 replies; 29+ messages in thread
From: Aaron Cornelius @ 2016-06-02  1:32 UTC (permalink / raw)
  To: Xen-devel, Julien Grall

On 6/1/2016 6:35 PM, Julien Grall wrote:
> Hello Aaron,
>
> On 01/06/2016 20:54, Aaron Cornelius wrote:
<snip>
>> I'm not 100% sure, from the "VMID pool exhausted" message it would
>> appear that the p2m_init() function failed to allocate a VM ID, which
>> caused domain creation to fail, and the NULL pointer dereference when
>> trying to clean up the not-fully-created domain.
>>
>> However, since I only have 1 domain active at a time, I'm not sure why
>> I should run out of VM IDs.
>
> arch_domain_destroy (and p2m_teardown) is only called when all the
> reference on the given domain are released.
>
> It may take a while to release all the resources. So if you launch the
> domain as the same time as you destroy the previous guest. You will have
> more than 1 domain active.
>
> Can you detail how you create/destroy guest?
>

This is with a custom application, we use the libxl APIs to interact 
with Xen.  Domains are created using the libxl_domain_create_new() 
function, and domains are destroyed using the libxl_domain_destroy() 
function.

The test in this case creates a domain, waits a minute, then 
deletes/creates the next domain, waits a minute, and so on.  So I 
wouldn't be surprised to see the VMID occasionally indicate there are 2 
active domains since there could be one being created and one being 
destroyed in a very short time.  However, I wouldn't expect to ever have 
256 domains.

The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which 
means that only 48 of the the Mirage domains (with 32MB of RAM) would 
work at the same time anyway.  Which doesn't account for the various 
inter-domain resources or the RAM used by Xen itself.

If the p2m_teardown() function checked for NULL it would prevent the 
crash, but I suspect Xen would be just as broken since all of my 
resources have leaked away.  More broken in fact, since if the board 
reboots at least the applications will restart and domains can be recreated.

It certainly appears that some resources are leaking when domains are 
deleted (possibly only on the ARM or ARM32 platforms).  We will try to 
add some debug prints and see if we can discover exactly what is going on.

- Aaron Cornelius


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-01 22:31     ` Andrew Cooper
@ 2016-06-02  8:47       ` Jan Beulich
  2016-06-02  8:53         ` Andrew Cooper
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2016-06-02  8:47 UTC (permalink / raw)
  To: Julien Grall, Andrew Cooper
  Cc: Aaron Cornelius, Stefano Stabellini, Xen-devel

>>> On 02.06.16 at 00:31, <andrew.cooper3@citrix.com> wrote:
> On 01/06/2016 23:24, Julien Grall wrote:
>> free_xenheap_pages already tolerates NULL (even if an order != 0). Is
>> there any reason to not do the same for free_domheap_pages?
> 
> The xenheap allocation functions deal in terms of plain virtual
> addresses, while the domheap functions deal in terms of struct page_info *.
> 
> Overall, this means that the domheap functions have a more restricted
> input/output set than their xenheap variants.
> 
> As there is already precedent with xenheap, making domheap tolerate NULL
> is probably fine, and indeed the preferred course of action.

I disagree, for the very reason you mention above.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-02  1:32   ` Aaron Cornelius
@ 2016-06-02  8:49     ` Jan Beulich
  2016-06-02  9:07     ` Julien Grall
  1 sibling, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2016-06-02  8:49 UTC (permalink / raw)
  To: Aaron Cornelius; +Cc: Xen-devel, Julien Grall

>>> On 02.06.16 at 03:32, <aaron.cornelius@dornerworks.com> wrote:
> The test in this case creates a domain, waits a minute, then 
> deletes/creates the next domain, waits a minute, and so on.  So I 
> wouldn't be surprised to see the VMID occasionally indicate there are 2 
> active domains since there could be one being created and one being 
> destroyed in a very short time.  However, I wouldn't expect to ever have 
> 256 domains.

But - did you check? Things may pile up over time...

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-02  8:47       ` Jan Beulich
@ 2016-06-02  8:53         ` Andrew Cooper
  2016-06-02  9:07           ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Cooper @ 2016-06-02  8:53 UTC (permalink / raw)
  To: Jan Beulich, Julien Grall; +Cc: Aaron Cornelius, Stefano Stabellini, Xen-devel

On 02/06/16 09:47, Jan Beulich wrote:
>>>> On 02.06.16 at 00:31, <andrew.cooper3@citrix.com> wrote:
>> On 01/06/2016 23:24, Julien Grall wrote:
>>> free_xenheap_pages already tolerates NULL (even if an order != 0). Is
>>> there any reason to not do the same for free_domheap_pages?
>> The xenheap allocation functions deal in terms of plain virtual
>> addresses, while the domheap functions deal in terms of struct page_info *.
>>
>> Overall, this means that the domheap functions have a more restricted
>> input/output set than their xenheap variants.
>>
>> As there is already precedent with xenheap, making domheap tolerate NULL
>> is probably fine, and indeed the preferred course of action.
> I disagree, for the very reason you mention above.

Which?  Dealing with struct page_info pointer?  Its still just a
pointer, whose value is expected to be NULL if not allocated.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-02  8:53         ` Andrew Cooper
@ 2016-06-02  9:07           ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2016-06-02  9:07 UTC (permalink / raw)
  To: Julien Grall, Andrew Cooper
  Cc: Aaron Cornelius, Stefano Stabellini, Xen-devel

>>> On 02.06.16 at 10:53, <andrew.cooper3@citrix.com> wrote:
> On 02/06/16 09:47, Jan Beulich wrote:
>>>>> On 02.06.16 at 00:31, <andrew.cooper3@citrix.com> wrote:
>>> On 01/06/2016 23:24, Julien Grall wrote:
>>>> free_xenheap_pages already tolerates NULL (even if an order != 0). Is
>>>> there any reason to not do the same for free_domheap_pages?
>>> The xenheap allocation functions deal in terms of plain virtual
>>> addresses, while the domheap functions deal in terms of struct page_info *.
>>>
>>> Overall, this means that the domheap functions have a more restricted
>>> input/output set than their xenheap variants.
>>>
>>> As there is already precedent with xenheap, making domheap tolerate NULL
>>> is probably fine, and indeed the preferred course of action.
>> I disagree, for the very reason you mention above.
> 
> Which?  Dealing with struct page_info pointer?  Its still just a
> pointer, whose value is expected to be NULL if not allocated.

Yes, but it still makes the interface not malloc()-like, other than - as
you say yourself - e.g. the xenheap one. Just look at Linux for
comparison: __free_pages() also doesn't accept NULL, while
free_pages() does. I think we should stick to that distinction.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-02  1:32   ` Aaron Cornelius
  2016-06-02  8:49     ` Jan Beulich
@ 2016-06-02  9:07     ` Julien Grall
  2016-06-06 13:58       ` Aaron Cornelius
  1 sibling, 1 reply; 29+ messages in thread
From: Julien Grall @ 2016-06-02  9:07 UTC (permalink / raw)
  To: Aaron Cornelius, Xen-devel, Jan Beulich

Hello Aaron,

On 02/06/2016 02:32, Aaron Cornelius wrote:
> This is with a custom application, we use the libxl APIs to interact
> with Xen.  Domains are created using the libxl_domain_create_new()
> function, and domains are destroyed using the libxl_domain_destroy()
> function.
>
> The test in this case creates a domain, waits a minute, then
> deletes/creates the next domain, waits a minute, and so on.  So I
> wouldn't be surprised to see the VMID occasionally indicate there are 2
> active domains since there could be one being created and one being
> destroyed in a very short time.  However, I wouldn't expect to ever have
> 256 domains.

Your log has:

(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0)

Which suggest that some grants are still mapped in DOM0.

>
> The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
> means that only 48 of the the Mirage domains (with 32MB of RAM) would
> work at the same time anyway.  Which doesn't account for the various
> inter-domain resources or the RAM used by Xen itself.

All the pages who belongs to the domain could have been freed except the 
one referenced by DOM0. So the footprint of this domain will be limited 
at the time.

I would recommend you to check how many domain are running at this time 
and if DOM0 effectively released all the resources.

> If the p2m_teardown() function checked for NULL it would prevent the
> crash, but I suspect Xen would be just as broken since all of my
> resources have leaked away.  More broken in fact, since if the board
> reboots at least the applications will restart and domains can be
> recreated.
>
> It certainly appears that some resources are leaking when domains are
> deleted (possibly only on the ARM or ARM32 platforms).  We will try to
> add some debug prints and see if we can discover exactly what is going on.

The leakage could also happen from DOM0. FWIW, I have been able to cycle 
2000 guests over the night on an ARM platforms.

Regards,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-02  9:07     ` Julien Grall
@ 2016-06-06 13:58       ` Aaron Cornelius
  2016-06-06 14:05         ` Julien Grall
  0 siblings, 1 reply; 29+ messages in thread
From: Aaron Cornelius @ 2016-06-06 13:58 UTC (permalink / raw)
  To: Julien Grall, Xen-devel, Jan Beulich

On 6/2/2016 5:07 AM, Julien Grall wrote:
> Hello Aaron,
>
> On 02/06/2016 02:32, Aaron Cornelius wrote:
>> This is with a custom application, we use the libxl APIs to interact
>> with Xen.  Domains are created using the libxl_domain_create_new()
>> function, and domains are destroyed using the libxl_domain_destroy()
>> function.
>>
>> The test in this case creates a domain, waits a minute, then
>> deletes/creates the next domain, waits a minute, and so on.  So I
>> wouldn't be surprised to see the VMID occasionally indicate there are 2
>> active domains since there could be one being created and one being
>> destroyed in a very short time.  However, I wouldn't expect to ever have
>> 256 domains.
>
> Your log has:
>
> (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
> (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0)
>
> Which suggest that some grants are still mapped in DOM0.
>
>>
>> The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
>> means that only 48 of the the Mirage domains (with 32MB of RAM) would
>> work at the same time anyway.  Which doesn't account for the various
>> inter-domain resources or the RAM used by Xen itself.
>
> All the pages who belongs to the domain could have been freed except the
> one referenced by DOM0. So the footprint of this domain will be limited
> at the time.
>
> I would recommend you to check how many domain are running at this time
> and if DOM0 effectively released all the resources.
>
>> If the p2m_teardown() function checked for NULL it would prevent the
>> crash, but I suspect Xen would be just as broken since all of my
>> resources have leaked away.  More broken in fact, since if the board
>> reboots at least the applications will restart and domains can be
>> recreated.
>>
>> It certainly appears that some resources are leaking when domains are
>> deleted (possibly only on the ARM or ARM32 platforms).  We will try to
>> add some debug prints and see if we can discover exactly what is going on.
>
> The leakage could also happen from DOM0. FWIW, I have been able to cycle
> 2000 guests over the night on an ARM platforms.
>

We've done some more testing regarding this issue.  And further testing 
shows that it doesn't matter if we delete the vchans before the domains 
are deleted.  Those appear to be cleaned up correctly when the domain is 
destroyed.

What does stop this issue from happening (using the same version of Xen 
that the issue was detected on) is removing any non-standard xenstore 
references before deleting the domain.  In this case our application 
allocates permissions for created domains to non-standard xenstore 
paths.  Making sure to remove those domain permissions before deleting 
the domain prevents this issue from happening.

It does not appear to matter if we delete the standard domain xenstore 
path (/local/domain/<id>) since libxl handles removing this path when 
the domain is destroyed.

Based on this I would guess that the xenstore is hanging onto the VMID.

- Aaron Cornelius

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-06 13:58       ` Aaron Cornelius
@ 2016-06-06 14:05         ` Julien Grall
  2016-06-06 14:19           ` Wei Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Julien Grall @ 2016-06-06 14:05 UTC (permalink / raw)
  To: Aaron Cornelius, Xen-devel, Jan Beulich
  Cc: Ian Jackson, Stefano Stabellini, Wei Liu

(CC Ian, Stefano and Wei)

Hello Aaron,

On 06/06/16 14:58, Aaron Cornelius wrote:
> On 6/2/2016 5:07 AM, Julien Grall wrote:
>> Hello Aaron,
>>
>> On 02/06/2016 02:32, Aaron Cornelius wrote:
>>> This is with a custom application, we use the libxl APIs to interact
>>> with Xen.  Domains are created using the libxl_domain_create_new()
>>> function, and domains are destroyed using the libxl_domain_destroy()
>>> function.
>>>
>>> The test in this case creates a domain, waits a minute, then
>>> deletes/creates the next domain, waits a minute, and so on.  So I
>>> wouldn't be surprised to see the VMID occasionally indicate there are 2
>>> active domains since there could be one being created and one being
>>> destroyed in a very short time.  However, I wouldn't expect to ever have
>>> 256 domains.
>>
>> Your log has:
>>
>> (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
>> (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2)
>> dom:(0)
>>
>> Which suggest that some grants are still mapped in DOM0.
>>
>>>
>>> The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
>>> means that only 48 of the the Mirage domains (with 32MB of RAM) would
>>> work at the same time anyway.  Which doesn't account for the various
>>> inter-domain resources or the RAM used by Xen itself.
>>
>> All the pages who belongs to the domain could have been freed except the
>> one referenced by DOM0. So the footprint of this domain will be limited
>> at the time.
>>
>> I would recommend you to check how many domain are running at this time
>> and if DOM0 effectively released all the resources.
>>
>>> If the p2m_teardown() function checked for NULL it would prevent the
>>> crash, but I suspect Xen would be just as broken since all of my
>>> resources have leaked away.  More broken in fact, since if the board
>>> reboots at least the applications will restart and domains can be
>>> recreated.
>>>
>>> It certainly appears that some resources are leaking when domains are
>>> deleted (possibly only on the ARM or ARM32 platforms).  We will try to
>>> add some debug prints and see if we can discover exactly what is
>>> going on.
>>
>> The leakage could also happen from DOM0. FWIW, I have been able to cycle
>> 2000 guests over the night on an ARM platforms.
>>
>
> We've done some more testing regarding this issue.  And further testing
> shows that it doesn't matter if we delete the vchans before the domains
> are deleted.  Those appear to be cleaned up correctly when the domain is
> destroyed.
>
> What does stop this issue from happening (using the same version of Xen
> that the issue was detected on) is removing any non-standard xenstore
> references before deleting the domain.  In this case our application
> allocates permissions for created domains to non-standard xenstore
> paths.  Making sure to remove those domain permissions before deleting
> the domain prevents this issue from happening.

I am not sure to understand what you mean here. Could you give a quick 
example?

>
> It does not appear to matter if we delete the standard domain xenstore
> path (/local/domain/<id>) since libxl handles removing this path when
> the domain is destroyed.
>
> Based on this I would guess that the xenstore is hanging onto the VMID.

Regards,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-06 14:05         ` Julien Grall
@ 2016-06-06 14:19           ` Wei Liu
  2016-06-06 15:02             ` Aaron Cornelius
  0 siblings, 1 reply; 29+ messages in thread
From: Wei Liu @ 2016-06-06 14:19 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, Wei Liu, Aaron Cornelius, Ian Jackson,
	Jan Beulich, Xen-devel

On Mon, Jun 06, 2016 at 03:05:47PM +0100, Julien Grall wrote:
> (CC Ian, Stefano and Wei)
> 
> Hello Aaron,
> 
> On 06/06/16 14:58, Aaron Cornelius wrote:
> >On 6/2/2016 5:07 AM, Julien Grall wrote:
> >>Hello Aaron,
> >>
> >>On 02/06/2016 02:32, Aaron Cornelius wrote:
> >>>This is with a custom application, we use the libxl APIs to interact
> >>>with Xen.  Domains are created using the libxl_domain_create_new()
> >>>function, and domains are destroyed using the libxl_domain_destroy()
> >>>function.
> >>>
> >>>The test in this case creates a domain, waits a minute, then
> >>>deletes/creates the next domain, waits a minute, and so on.  So I
> >>>wouldn't be surprised to see the VMID occasionally indicate there are 2
> >>>active domains since there could be one being created and one being
> >>>destroyed in a very short time.  However, I wouldn't expect to ever have
> >>>256 domains.
> >>
> >>Your log has:
> >>
> >>(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
> >>(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2)
> >>dom:(0)
> >>
> >>Which suggest that some grants are still mapped in DOM0.
> >>
> >>>
> >>>The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
> >>>means that only 48 of the the Mirage domains (with 32MB of RAM) would
> >>>work at the same time anyway.  Which doesn't account for the various
> >>>inter-domain resources or the RAM used by Xen itself.
> >>
> >>All the pages who belongs to the domain could have been freed except the
> >>one referenced by DOM0. So the footprint of this domain will be limited
> >>at the time.
> >>
> >>I would recommend you to check how many domain are running at this time
> >>and if DOM0 effectively released all the resources.
> >>
> >>>If the p2m_teardown() function checked for NULL it would prevent the
> >>>crash, but I suspect Xen would be just as broken since all of my
> >>>resources have leaked away.  More broken in fact, since if the board
> >>>reboots at least the applications will restart and domains can be
> >>>recreated.
> >>>
> >>>It certainly appears that some resources are leaking when domains are
> >>>deleted (possibly only on the ARM or ARM32 platforms).  We will try to
> >>>add some debug prints and see if we can discover exactly what is
> >>>going on.
> >>
> >>The leakage could also happen from DOM0. FWIW, I have been able to cycle
> >>2000 guests over the night on an ARM platforms.
> >>
> >
> >We've done some more testing regarding this issue.  And further testing
> >shows that it doesn't matter if we delete the vchans before the domains
> >are deleted.  Those appear to be cleaned up correctly when the domain is
> >destroyed.
> >
> >What does stop this issue from happening (using the same version of Xen
> >that the issue was detected on) is removing any non-standard xenstore
> >references before deleting the domain.  In this case our application
> >allocates permissions for created domains to non-standard xenstore
> >paths.  Making sure to remove those domain permissions before deleting
> >the domain prevents this issue from happening.
> 
> I am not sure to understand what you mean here. Could you give a quick
> example?
> 
> >
> >It does not appear to matter if we delete the standard domain xenstore
> >path (/local/domain/<id>) since libxl handles removing this path when
> >the domain is destroyed.
> >
> >Based on this I would guess that the xenstore is hanging onto the VMID.
> 

This is a somewhat strange conclusion. I guess the root cause is still
unclear at this point.

Is it possible that something else what rely on those xenstore node to
free up resources?

Wei.

> Regards,
> 
> -- 
> Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-06 14:19           ` Wei Liu
@ 2016-06-06 15:02             ` Aaron Cornelius
  2016-06-07  9:53               ` Ian Jackson
  0 siblings, 1 reply; 29+ messages in thread
From: Aaron Cornelius @ 2016-06-06 15:02 UTC (permalink / raw)
  To: Wei Liu, Julien Grall
  Cc: Xen-devel, Stefano Stabellini, Ian Jackson, Jan Beulich

On 6/6/2016 10:19 AM, Wei Liu wrote:
> On Mon, Jun 06, 2016 at 03:05:47PM +0100, Julien Grall wrote:
>> (CC Ian, Stefano and Wei)
>>
>> Hello Aaron,
>>
>> On 06/06/16 14:58, Aaron Cornelius wrote:
>>> On 6/2/2016 5:07 AM, Julien Grall wrote:
>>>> Hello Aaron,
>>>>
>>>> On 02/06/2016 02:32, Aaron Cornelius wrote:
>>>>> This is with a custom application, we use the libxl APIs to interact
>>>>> with Xen.  Domains are created using the libxl_domain_create_new()
>>>>> function, and domains are destroyed using the libxl_domain_destroy()
>>>>> function.
>>>>>
>>>>> The test in this case creates a domain, waits a minute, then
>>>>> deletes/creates the next domain, waits a minute, and so on.  So I
>>>>> wouldn't be surprised to see the VMID occasionally indicate there are 2
>>>>> active domains since there could be one being created and one being
>>>>> destroyed in a very short time.  However, I wouldn't expect to ever have
>>>>> 256 domains.
>>>>
>>>> Your log has:
>>>>
>>>> (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
>>>> (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2)
>>>> dom:(0)
>>>>
>>>> Which suggest that some grants are still mapped in DOM0.
>>>>
>>>>>
>>>>> The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
>>>>> means that only 48 of the the Mirage domains (with 32MB of RAM) would
>>>>> work at the same time anyway.  Which doesn't account for the various
>>>>> inter-domain resources or the RAM used by Xen itself.
>>>>
>>>> All the pages who belongs to the domain could have been freed except the
>>>> one referenced by DOM0. So the footprint of this domain will be limited
>>>> at the time.
>>>>
>>>> I would recommend you to check how many domain are running at this time
>>>> and if DOM0 effectively released all the resources.
>>>>
>>>>> If the p2m_teardown() function checked for NULL it would prevent the
>>>>> crash, but I suspect Xen would be just as broken since all of my
>>>>> resources have leaked away.  More broken in fact, since if the board
>>>>> reboots at least the applications will restart and domains can be
>>>>> recreated.
>>>>>
>>>>> It certainly appears that some resources are leaking when domains are
>>>>> deleted (possibly only on the ARM or ARM32 platforms).  We will try to
>>>>> add some debug prints and see if we can discover exactly what is
>>>>> going on.
>>>>
>>>> The leakage could also happen from DOM0. FWIW, I have been able to cycle
>>>> 2000 guests over the night on an ARM platforms.
>>>>
>>>
>>> We've done some more testing regarding this issue.  And further testing
>>> shows that it doesn't matter if we delete the vchans before the domains
>>> are deleted.  Those appear to be cleaned up correctly when the domain is
>>> destroyed.
>>>
>>> What does stop this issue from happening (using the same version of Xen
>>> that the issue was detected on) is removing any non-standard xenstore
>>> references before deleting the domain.  In this case our application
>>> allocates permissions for created domains to non-standard xenstore
>>> paths.  Making sure to remove those domain permissions before deleting
>>> the domain prevents this issue from happening.
>>
>> I am not sure to understand what you mean here. Could you give a quick
>> example?

So we have a custom xenstore path for our tool (/tool/custom/ for the 
sake of this example), and we then allow every domain created using this 
tool to read that path.  When the domain is created, the domain is 
explicitly given read permissions using xs_set_permissions().  More 
precisely we:
1. retrieve the current list of permissions with xs_get_permissions()
2. realloc the permissions list to increase it by 1
3. update the list of permissions to give the new domain read only access
4. then set the new permissions list with xs_set_permissions()

We saw errors logged because this list of permissions was getting 
prohibitively large, but this error did not appear to be directly 
connected to the Xen crash I submitted last week.  Or so we thought at 
the time.

We realized that we had forgotten to remove the domain from the 
permissions list when the domain is deleted (which would cause the error 
we saw).  The application was updated to remove the domain from the 
permissions list:
1. retrieve the permissions with xs_get_permissions()
2. find the domain ID that is being deleted
3. memmove() the remaining domains down by 1 to "delete" the old domain 
from the permissions list
4. update the permissions with xs_set_permissions()

After we made that change, a load test over the weekend confirmed that 
the Xen crash no longer happens.  We checked this morning first thing 
and confirmed that without this change the crash reliably occurs.

>>> It does not appear to matter if we delete the standard domain xenstore
>>> path (/local/domain/<id>) since libxl handles removing this path when
>>> the domain is destroyed.
>>>
>>> Based on this I would guess that the xenstore is hanging onto the VMID.
>>
>
> This is a somewhat strange conclusion. I guess the root cause is still
> unclear at this point.

We originally tested a fix that explicitly cleaned up the vchans 
(created to communicate with the domains) before the 
xen_domain_destroy() function is called and there was no change.  We 
have confirmed that the vchans do not appear to cause issues when they 
are not deleted prior to the domain being destroyed.

Our application did delete them eventually, but last week they were only 
deleted _after_ the domain was destroyed.  I would guess that if they 
are not explicitly deleted they could cause this same problem.

> Is it possible that something else what rely on those xenstore node to
> free up resources?

It was stated earlier in this thread that the VMID is only deleted once 
all references to it are destroyed.  I would speculate that the xenstore 
permissions list is one of these references that could prevent a domain 
reference (and VMID) from being completely cleaned up.

- Aaron Cornelius

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-06 15:02             ` Aaron Cornelius
@ 2016-06-07  9:53               ` Ian Jackson
  2016-06-07 13:40                 ` Aaron Cornelius
  0 siblings, 1 reply; 29+ messages in thread
From: Ian Jackson @ 2016-06-07  9:53 UTC (permalink / raw)
  To: Aaron Cornelius
  Cc: Xen-devel, Julien Grall, Stefano Stabellini, Wei Liu, Jan Beulich

Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):
> We realized that we had forgotten to remove the domain from the 
> permissions list when the domain is deleted (which would cause the error 
> we saw).  The application was updated to remove the domain from the 
> permissions list:
> 1. retrieve the permissions with xs_get_permissions()
> 2. find the domain ID that is being deleted
> 3. memmove() the remaining domains down by 1 to "delete" the old domain 
> from the permissions list
> 4. update the permissions with xs_set_permissions()
> 
> After we made that change, a load test over the weekend confirmed that 
> the Xen crash no longer happens.  We checked this morning first thing 
> and confirmed that without this change the crash reliably occurs.

This is rather odd behaviour.  I don't think xenstored should hang
onto the domain's xs ring page just because the domain is still
mentioned in a permission list.

But it may do.  I haven't checked the code.  Are you using the
ocaml xenstored (oxenstored) or the C one ?

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-07  9:53               ` Ian Jackson
@ 2016-06-07 13:40                 ` Aaron Cornelius
  2016-06-07 15:13                   ` Aaron Cornelius
  0 siblings, 1 reply; 29+ messages in thread
From: Aaron Cornelius @ 2016-06-07 13:40 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Xen-devel, Julien Grall, Stefano Stabellini, Wei Liu, Jan Beulich

On 6/7/2016 5:53 AM, Ian Jackson wrote:
> Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):
>> We realized that we had forgotten to remove the domain from the
>> permissions list when the domain is deleted (which would cause the error
>> we saw).  The application was updated to remove the domain from the
>> permissions list:
>> 1. retrieve the permissions with xs_get_permissions()
>> 2. find the domain ID that is being deleted
>> 3. memmove() the remaining domains down by 1 to "delete" the old domain
>> from the permissions list
>> 4. update the permissions with xs_set_permissions()
>>
>> After we made that change, a load test over the weekend confirmed that
>> the Xen crash no longer happens.  We checked this morning first thing
>> and confirmed that without this change the crash reliably occurs.
>
> This is rather odd behaviour.  I don't think xenstored should hang
> onto the domain's xs ring page just because the domain is still
> mentioned in a permission list.
>
> But it may do.  I haven't checked the code.  Are you using the
> ocaml xenstored (oxenstored) or the C one ?

I didn't remember specifying anything special when building the xen 
tools, but I did run into troubles where the ocaml tools appeared to 
conflict with the opam installed mirage packages and libraries. Running 
"sudo make dist-install" command installs the ocaml libraries as root 
which made using opam difficult.  So I did disable the ocaml tools 
during my build.

I double checked and confirmed that the C version of xenstored was 
built.  We will try to test the failure scenario with oxenstored to see 
if it behaves any differently.

- Aaron

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-07 13:40                 ` Aaron Cornelius
@ 2016-06-07 15:13                   ` Aaron Cornelius
  2016-06-09 11:14                     ` Ian Jackson
  0 siblings, 1 reply; 29+ messages in thread
From: Aaron Cornelius @ 2016-06-07 15:13 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Xen-devel, Julien Grall, Stefano Stabellini, Wei Liu, Jan Beulich

On 6/7/2016 9:40 AM, Aaron Cornelius wrote:
> On 6/7/2016 5:53 AM, Ian Jackson wrote:
>> Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):
>>> We realized that we had forgotten to remove the domain from the
>>> permissions list when the domain is deleted (which would cause the error
>>> we saw).  The application was updated to remove the domain from the
>>> permissions list:
>>> 1. retrieve the permissions with xs_get_permissions()
>>> 2. find the domain ID that is being deleted
>>> 3. memmove() the remaining domains down by 1 to "delete" the old domain
>>> from the permissions list
>>> 4. update the permissions with xs_set_permissions()
>>>
>>> After we made that change, a load test over the weekend confirmed that
>>> the Xen crash no longer happens.  We checked this morning first thing
>>> and confirmed that without this change the crash reliably occurs.
>>
>> This is rather odd behaviour.  I don't think xenstored should hang
>> onto the domain's xs ring page just because the domain is still
>> mentioned in a permission list.
>>
>> But it may do.  I haven't checked the code.  Are you using the
>> ocaml xenstored (oxenstored) or the C one ?
>
> I didn't remember specifying anything special when building the xen
> tools, but I did run into troubles where the ocaml tools appeared to
> conflict with the opam installed mirage packages and libraries. Running
> "sudo make dist-install" command installs the ocaml libraries as root
> which made using opam difficult.  So I did disable the ocaml tools
> during my build.
>
> I double checked and confirmed that the C version of xenstored was
> built.  We will try to test the failure scenario with oxenstored to see
> if it behaves any differently.

I am not that familiar with the xenstored code, but as far as I can tell 
the grant mapping will be held by the xenstore until the xs_release() 
function is called (which is not called by libxl, and I do not 
explicitly call it in my software, although I might now just to be 
safe), or until the last reference to a domain is released and the 
registered destructor (destroy_domain), set by talloc_set_destructor(), 
is called.

I tried to follow the oxenstored code, but I certainly don't consider 
myself an expert at OCaml.  The oxenstored code does not appear to 
allocate grant mappings at all, which makes me think I am probably 
misunderstanding the code :)

- Aaron

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-07 15:13                   ` Aaron Cornelius
@ 2016-06-09 11:14                     ` Ian Jackson
  2016-06-14 13:11                       ` Aaron Cornelius
  0 siblings, 1 reply; 29+ messages in thread
From: Ian Jackson @ 2016-06-09 11:14 UTC (permalink / raw)
  To: Aaron Cornelius
  Cc: Xen-devel, Julien Grall, Stefano Stabellini, Wei Liu, Jan Beulich

Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):
> I am not that familiar with the xenstored code, but as far as I can tell 
> the grant mapping will be held by the xenstore until the xs_release() 
> function is called (which is not called by libxl, and I do not 
> explicitly call it in my software, although I might now just to be 
> safe), or until the last reference to a domain is released and the 
> registered destructor (destroy_domain), set by talloc_set_destructor(), 
> is called.

I'm not sure I follow.  Or maybe I disagree.  ISTM that:

The grant mapping is released by destroy_domain, which is called via
the talloc destructor as a result of talloc_free(domain->conn) in
domain_cleanup.  I don't see other references to domain->conn.

domain_cleanup calls talloc_free on domain->conn when it sees the
domain marked as dying in domain_cleanup.

So I still think that your acl reference ought not to keep the grant
mapping alive.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-09 11:14                     ` Ian Jackson
@ 2016-06-14 13:11                       ` Aaron Cornelius
  2016-06-14 13:15                         ` Wei Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Aaron Cornelius @ 2016-06-14 13:11 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Xen-devel, Julien Grall, Stefano Stabellini, Wei Liu, Jan Beulich

On 6/9/2016 7:14 AM, Ian Jackson wrote:
> Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):
>> I am not that familiar with the xenstored code, but as far as I can tell
>> the grant mapping will be held by the xenstore until the xs_release()
>> function is called (which is not called by libxl, and I do not
>> explicitly call it in my software, although I might now just to be
>> safe), or until the last reference to a domain is released and the
>> registered destructor (destroy_domain), set by talloc_set_destructor(),
>> is called.
>
> I'm not sure I follow.  Or maybe I disagree.  ISTM that:
>
> The grant mapping is released by destroy_domain, which is called via
> the talloc destructor as a result of talloc_free(domain->conn) in
> domain_cleanup.  I don't see other references to domain->conn.
>
> domain_cleanup calls talloc_free on domain->conn when it sees the
> domain marked as dying in domain_cleanup.
>
> So I still think that your acl reference ought not to keep the grant
> mapping alive.

It took a while to complete the testing, but we've finished trying to 
reproduce the error using oxenstored instead of the C xenstored.  When 
the condition occurs that caused the error with the C xenstored (on 
4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not 
cause the crash.

So for whatever reason, it would appear that the C xenstored does keep 
the grant allocations open, but oxenstored does not.

- Aaron Cornelius


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-14 13:11                       ` Aaron Cornelius
@ 2016-06-14 13:15                         ` Wei Liu
  2016-06-14 13:26                           ` Aaron Cornelius
  0 siblings, 1 reply; 29+ messages in thread
From: Wei Liu @ 2016-06-14 13:15 UTC (permalink / raw)
  To: Aaron Cornelius
  Cc: Stefano Stabellini, Wei Liu, Ian Jackson, Julien Grall,
	Jan Beulich, Xen-devel

On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote:
> On 6/9/2016 7:14 AM, Ian Jackson wrote:
> >Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):
> >>I am not that familiar with the xenstored code, but as far as I can tell
> >>the grant mapping will be held by the xenstore until the xs_release()
> >>function is called (which is not called by libxl, and I do not
> >>explicitly call it in my software, although I might now just to be
> >>safe), or until the last reference to a domain is released and the
> >>registered destructor (destroy_domain), set by talloc_set_destructor(),
> >>is called.
> >
> >I'm not sure I follow.  Or maybe I disagree.  ISTM that:
> >
> >The grant mapping is released by destroy_domain, which is called via
> >the talloc destructor as a result of talloc_free(domain->conn) in
> >domain_cleanup.  I don't see other references to domain->conn.
> >
> >domain_cleanup calls talloc_free on domain->conn when it sees the
> >domain marked as dying in domain_cleanup.
> >
> >So I still think that your acl reference ought not to keep the grant
> >mapping alive.
> 
> It took a while to complete the testing, but we've finished trying to
> reproduce the error using oxenstored instead of the C xenstored.  When the
> condition occurs that caused the error with the C xenstored (on
> 4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not
> cause the crash.
> 
> So for whatever reason, it would appear that the C xenstored does keep the
> grant allocations open, but oxenstored does not.
> 

Can you provide some easy to follow steps to reproduce this issue?

AFAICT your environment is very specialised, but we should be able to
trigger the issue with plan xenstore-* utilities?

Wei.

> - Aaron Cornelius
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-14 13:15                         ` Wei Liu
@ 2016-06-14 13:26                           ` Aaron Cornelius
  2016-06-14 13:38                             ` Aaron Cornelius
  0 siblings, 1 reply; 29+ messages in thread
From: Aaron Cornelius @ 2016-06-14 13:26 UTC (permalink / raw)
  To: Wei Liu
  Cc: Xen-devel, Julien Grall, Stefano Stabellini, Ian Jackson, Jan Beulich

On 6/14/2016 9:15 AM, Wei Liu wrote:
> On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote:
>> On 6/9/2016 7:14 AM, Ian Jackson wrote:
>>> Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):
>>>> I am not that familiar with the xenstored code, but as far as I can tell
>>>> the grant mapping will be held by the xenstore until the xs_release()
>>>> function is called (which is not called by libxl, and I do not
>>>> explicitly call it in my software, although I might now just to be
>>>> safe), or until the last reference to a domain is released and the
>>>> registered destructor (destroy_domain), set by talloc_set_destructor(),
>>>> is called.
>>>
>>> I'm not sure I follow.  Or maybe I disagree.  ISTM that:
>>>
>>> The grant mapping is released by destroy_domain, which is called via
>>> the talloc destructor as a result of talloc_free(domain->conn) in
>>> domain_cleanup.  I don't see other references to domain->conn.
>>>
>>> domain_cleanup calls talloc_free on domain->conn when it sees the
>>> domain marked as dying in domain_cleanup.
>>>
>>> So I still think that your acl reference ought not to keep the grant
>>> mapping alive.
>>
>> It took a while to complete the testing, but we've finished trying to
>> reproduce the error using oxenstored instead of the C xenstored.  When the
>> condition occurs that caused the error with the C xenstored (on
>> 4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not
>> cause the crash.
>>
>> So for whatever reason, it would appear that the C xenstored does keep the
>> grant allocations open, but oxenstored does not.
>>
>
> Can you provide some easy to follow steps to reproduce this issue?
>
> AFAICT your environment is very specialised, but we should be able to
> trigger the issue with plan xenstore-* utilities?

I am not sure if the plain xenstore-* utilities will work, but here are 
the steps to follow:

1. Create a non-standard xenstore path: /tool/test
2. Create a domU (mini-os/mirage/something small)
3. Add the new domU to the /tool/test permissions list (I'm not 100% 
sure how to do this with the xenstore-* utilities)
    a. call xs_get_permissions()
    b. realloc() the permissions block to add the new domain
    c. call xs_set_permissions()
4. Delete the domU from step 2
5. Repeat steps 2-4

Eventually the xs_set_permissions() function will return an E2BIG error 
because the list of domains has grown too large.  Sometime after that is 
when the crash occurs with the C xenstored and the 4.7.0-rc4 version of 
Xen.  It usually takes around 1200 or so iterations for the crash to occur.

- Aaron Cornelius

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-14 13:26                           ` Aaron Cornelius
@ 2016-06-14 13:38                             ` Aaron Cornelius
  2016-06-14 13:47                               ` Wei Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Aaron Cornelius @ 2016-06-14 13:38 UTC (permalink / raw)
  To: Wei Liu
  Cc: Xen-devel, Julien Grall, Stefano Stabellini, Ian Jackson, Jan Beulich

On 6/14/2016 9:26 AM, Aaron Cornelius wrote:
> On 6/14/2016 9:15 AM, Wei Liu wrote:
>> On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote:
>>> On 6/9/2016 7:14 AM, Ian Jackson wrote:
>>>> Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):
>>>>> I am not that familiar with the xenstored code, but as far as I can tell
>>>>> the grant mapping will be held by the xenstore until the xs_release()
>>>>> function is called (which is not called by libxl, and I do not
>>>>> explicitly call it in my software, although I might now just to be
>>>>> safe), or until the last reference to a domain is released and the
>>>>> registered destructor (destroy_domain), set by talloc_set_destructor(),
>>>>> is called.
>>>>
>>>> I'm not sure I follow.  Or maybe I disagree.  ISTM that:
>>>>
>>>> The grant mapping is released by destroy_domain, which is called via
>>>> the talloc destructor as a result of talloc_free(domain->conn) in
>>>> domain_cleanup.  I don't see other references to domain->conn.
>>>>
>>>> domain_cleanup calls talloc_free on domain->conn when it sees the
>>>> domain marked as dying in domain_cleanup.
>>>>
>>>> So I still think that your acl reference ought not to keep the grant
>>>> mapping alive.
>>>
>>> It took a while to complete the testing, but we've finished trying to
>>> reproduce the error using oxenstored instead of the C xenstored.  When the
>>> condition occurs that caused the error with the C xenstored (on
>>> 4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not
>>> cause the crash.
>>>
>>> So for whatever reason, it would appear that the C xenstored does keep the
>>> grant allocations open, but oxenstored does not.
>>>
>>
>> Can you provide some easy to follow steps to reproduce this issue?
>>
>> AFAICT your environment is very specialised, but we should be able to
>> trigger the issue with plan xenstore-* utilities?
>
> I am not sure if the plain xenstore-* utilities will work, but here are
> the steps to follow:
>
> 1. Create a non-standard xenstore path: /tool/test
> 2. Create a domU (mini-os/mirage/something small)
> 3. Add the new domU to the /tool/test permissions list (I'm not 100%
> sure how to do this with the xenstore-* utilities)
>     a. call xs_get_permissions()
>     b. realloc() the permissions block to add the new domain
>     c. call xs_set_permissions()
> 4. Delete the domU from step 2
> 5. Repeat steps 2-4
>
> Eventually the xs_set_permissions() function will return an E2BIG error
> because the list of domains has grown too large.  Sometime after that is
> when the crash occurs with the C xenstored and the 4.7.0-rc4 version of
> Xen.  It usually takes around 1200 or so iterations for the crash to occur.

After writing up those steps I suddenly realized that I think I have a 
bug in my test that might have been causing the bug in the first place. 
Once I get errors returned from xs_set_permissions() I was not properly 
cleaning up the created domains.  So I think this was just a simple case 
of VMID exhaustion by creating more than 255 domUs at the same time.

In which case this is completely unrelated to xenstore holding on to 
grant allocations, and the C xenstore most likely behaves correctly.

- Aaron Cornelius


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Xen 4.7 crash
  2016-06-14 13:38                             ` Aaron Cornelius
@ 2016-06-14 13:47                               ` Wei Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Wei Liu @ 2016-06-14 13:47 UTC (permalink / raw)
  To: Aaron Cornelius
  Cc: Stefano Stabellini, Wei Liu, Ian Jackson, Julien Grall,
	Jan Beulich, Xen-devel

On Tue, Jun 14, 2016 at 09:38:22AM -0400, Aaron Cornelius wrote:
> On 6/14/2016 9:26 AM, Aaron Cornelius wrote:
> >On 6/14/2016 9:15 AM, Wei Liu wrote:
> >>On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote:
> >>>On 6/9/2016 7:14 AM, Ian Jackson wrote:
> >>>>Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):
> >>>>>I am not that familiar with the xenstored code, but as far as I can tell
> >>>>>the grant mapping will be held by the xenstore until the xs_release()
> >>>>>function is called (which is not called by libxl, and I do not
> >>>>>explicitly call it in my software, although I might now just to be
> >>>>>safe), or until the last reference to a domain is released and the
> >>>>>registered destructor (destroy_domain), set by talloc_set_destructor(),
> >>>>>is called.
> >>>>
> >>>>I'm not sure I follow.  Or maybe I disagree.  ISTM that:
> >>>>
> >>>>The grant mapping is released by destroy_domain, which is called via
> >>>>the talloc destructor as a result of talloc_free(domain->conn) in
> >>>>domain_cleanup.  I don't see other references to domain->conn.
> >>>>
> >>>>domain_cleanup calls talloc_free on domain->conn when it sees the
> >>>>domain marked as dying in domain_cleanup.
> >>>>
> >>>>So I still think that your acl reference ought not to keep the grant
> >>>>mapping alive.
> >>>
> >>>It took a while to complete the testing, but we've finished trying to
> >>>reproduce the error using oxenstored instead of the C xenstored.  When the
> >>>condition occurs that caused the error with the C xenstored (on
> >>>4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not
> >>>cause the crash.
> >>>
> >>>So for whatever reason, it would appear that the C xenstored does keep the
> >>>grant allocations open, but oxenstored does not.
> >>>
> >>
> >>Can you provide some easy to follow steps to reproduce this issue?
> >>
> >>AFAICT your environment is very specialised, but we should be able to
> >>trigger the issue with plan xenstore-* utilities?
> >
> >I am not sure if the plain xenstore-* utilities will work, but here are
> >the steps to follow:
> >
> >1. Create a non-standard xenstore path: /tool/test
> >2. Create a domU (mini-os/mirage/something small)
> >3. Add the new domU to the /tool/test permissions list (I'm not 100%
> >sure how to do this with the xenstore-* utilities)
> >    a. call xs_get_permissions()
> >    b. realloc() the permissions block to add the new domain
> >    c. call xs_set_permissions()
> >4. Delete the domU from step 2
> >5. Repeat steps 2-4
> >
> >Eventually the xs_set_permissions() function will return an E2BIG error
> >because the list of domains has grown too large.  Sometime after that is
> >when the crash occurs with the C xenstored and the 4.7.0-rc4 version of
> >Xen.  It usually takes around 1200 or so iterations for the crash to occur.
> 
> After writing up those steps I suddenly realized that I think I have a bug
> in my test that might have been causing the bug in the first place. Once I
> get errors returned from xs_set_permissions() I was not properly cleaning up
> the created domains.  So I think this was just a simple case of VMID
> exhaustion by creating more than 255 domUs at the same time.
> 
> In which case this is completely unrelated to xenstore holding on to grant
> allocations, and the C xenstore most likely behaves correctly.
> 

OK, so I will treat this issue as resolved for now. Let us know if you
discover something new.

Wei.

> - Aaron Cornelius
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-06-14 13:47 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-01 19:54 Xen 4.7 crash Aaron Cornelius
2016-06-01 20:00 ` Andrew Cooper
2016-06-01 20:45   ` Aaron Cornelius
2016-06-01 21:24     ` Andrew Cooper
2016-06-01 22:18       ` Julien Grall
2016-06-01 22:26         ` Andrew Cooper
2016-06-01 21:35 ` Andrew Cooper
2016-06-01 22:24   ` Julien Grall
2016-06-01 22:31     ` Andrew Cooper
2016-06-02  8:47       ` Jan Beulich
2016-06-02  8:53         ` Andrew Cooper
2016-06-02  9:07           ` Jan Beulich
2016-06-01 22:35 ` Julien Grall
2016-06-02  1:32   ` Aaron Cornelius
2016-06-02  8:49     ` Jan Beulich
2016-06-02  9:07     ` Julien Grall
2016-06-06 13:58       ` Aaron Cornelius
2016-06-06 14:05         ` Julien Grall
2016-06-06 14:19           ` Wei Liu
2016-06-06 15:02             ` Aaron Cornelius
2016-06-07  9:53               ` Ian Jackson
2016-06-07 13:40                 ` Aaron Cornelius
2016-06-07 15:13                   ` Aaron Cornelius
2016-06-09 11:14                     ` Ian Jackson
2016-06-14 13:11                       ` Aaron Cornelius
2016-06-14 13:15                         ` Wei Liu
2016-06-14 13:26                           ` Aaron Cornelius
2016-06-14 13:38                             ` Aaron Cornelius
2016-06-14 13:47                               ` Wei Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).