* Xen 4.7 crash @ 2016-06-01 19:54 Aaron Cornelius 2016-06-01 20:00 ` Andrew Cooper ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Aaron Cornelius @ 2016-06-01 19:54 UTC (permalink / raw) To: Xen-devel I am doing some work with Xen 4.7 on the cubietruck (ARM32). I've noticed some strange behavior after I create/destroy enough domains and put together a script to do the add/remove for me. For this particular test I am creating a small mini-os (Mirage) domain with 32MB of RAM, deleting it, creating the new one, and so on. After running this for a while, I get the following error (with version 8478c9409a2c6726208e8dbc9f3e455b76725a33): (d846) Virtual -> physical offset = 3fc00000 (d846) Checking DTB at 023ff000... (d846) [32;1mMirageOS booting...[0m (d846) Initialising console ... done. (d846) gnttab_stubs.c: initialised mini-os gntmap (d846) allocate_ondemand(1, 1) returning 2300000 (d846) allocate_ondemand(1, 1) returning 2301000 (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0) (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0) (XEN) p2m.c: dom1101: VMID pool exhausted (XEN) CPU0: Unexpected Trap: Data Abort (XEN) ----[ Xen-4.7.0-rc arm32 debug=y Not tainted ]---- (XEN) CPU: 0 (XEN) PC: 0021fdd4 free_domheap_pages+0x1c/0x324 (XEN) CPSR: 6001011a MODE:Hypervisor (XEN) R0: 00000000 R1: 00000001 R2: 00000003 R3: 00304320 (XEN) R4: 41c57000 R5: 41c57188 R6: 00200200 R7: 00100100 (XEN) R8: 41c57180 R9: 43fdfe60 R10:00000000 R11:43fdfd5c R12:00000000 (XEN) HYP: SP: 43fdfd2c LR: 0025b0cc (XEN) (XEN) VTCR_EL2: 80003558 (XEN) VTTBR_EL2: 00010000bfb0e000 (XEN) (XEN) SCTLR_EL2: 30cd187f (XEN) HCR_EL2: 000000000038663f (XEN) TTBR0_EL2: 00000000bfafc000 (XEN) (XEN) ESR_EL2: 94000006 (XEN) HPFAR_EL2: 000000000001c810 (XEN) HDFAR: 00000014 (XEN) HIFAR: 84e37182 (XEN) (XEN) Xen stack trace from sp=43fdfd2c: (XEN) 002cf1b7 43fdfd64 41c57000 00000100 41c57000 41c57188 00200200 00100100 (XEN) 41c57180 43fdfe60 00000000 43fdfd7c 0025b0cc 41c57000 fffffff0 43fdfe60 (XEN) 0000001f 0000044d 43fdfe60 43fdfd8c 0024f668 41c57000 fffffff0 43fdfda4 (XEN) 0024f8f0 41c57000 00000000 00000000 0000001f 43fdfddc 0020854c 43fdfddc (XEN) 00000000 cccccccd 00304600 002822bc 00000000 b6f20004 0000044d 00304600 (XEN) 00304320 d767a000 00000000 43fdfeec 00206d6c 43fdfe6c 00218f8c 00000000 (XEN) 00000007 43fdfe30 43fdfe34 00000000 43fdfe20 00000002 43fdfe48 43fdfe78 (XEN) 00000000 00000000 00000000 00007622 00002b0e 40023000 00000000 43fdfec8 (XEN) 00000002 43fdfebc 00218f8c 00000001 0000000b 0000ffff b6eba880 0000000b (XEN) 5abab87d f34aab2c 6adc50b8 e1713cd0 00000000 00000000 00000000 00000000 (XEN) b6eba8d8 00000000 50043f00 b6eb5038 b6effba8 0000003e 00000000 000c3034 (XEN) 000b9cb8 000bda30 000bda30 00000000 b6eba56c 0000003e b6effba8 b6effdb0 (XEN) be9558d4 000000d0 be9558d4 00000071 b6effba8 b6effd6c b6ed6fb4 4a000ea1 (XEN) c01077f8 43fdff58 002067b8 00305000 be9557c8 d767a000 00000000 43fdff54 (XEN) 00260130 00000000 43fdfefc 43fdff1c 200f019a 400238f4 00000004 00000004 (XEN) 002c9f00 00000000 00304600 c094c240 00000000 00305000 be9557a0 d767a000 (XEN) 00000000 43fdff44 00000000 c094c240 00000000 00305000 be9557c8 d767a000 (XEN) 00000000 43fdff58 00263b10 b6f20004 00000000 00000000 00000000 00000000 (XEN) c094c240 00000000 00305000 be9557c8 d767a000 00000000 00000001 00000024 (XEN) ffffffff b691ab34 c01077f8 60010013 00000000 be9557c4 c0a38600 c010c400 (XEN) Xen call trace: (XEN) [<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC) (XEN) [<0025b0cc>] p2m_teardown+0xa0/0x108 (LR) (XEN) [<0025b0cc>] p2m_teardown+0xa0/0x108 (XEN) [<0024f668>] arch_domain_destroy+0x20/0x50 (XEN) [<0024f8f0>] arch_domain_create+0x258/0x284 (XEN) [<0020854c>] domain_create+0x2dc/0x510 (XEN) [<00206d6c>] do_domctl+0x5b4/0x1928 (XEN) [<00260130>] do_trap_hypervisor+0x1170/0x15b0 (XEN) [<00263b10>] entry.o#return_from_trap+0/0x4 (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 0: (XEN) CPU0: Unexpected Trap: Data Abort (XEN) (XEN) **************************************** (XEN) (XEN) Reboot in five seconds... I'm not 100% sure, from the "VMID pool exhausted" message it would appear that the p2m_init() function failed to allocate a VM ID, which caused domain creation to fail, and the NULL pointer dereference when trying to clean up the not-fully-created domain. However, since I only have 1 domain active at a time, I'm not sure why I should run out of VM IDs. - Aaron Cornelius _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-01 19:54 Xen 4.7 crash Aaron Cornelius @ 2016-06-01 20:00 ` Andrew Cooper 2016-06-01 20:45 ` Aaron Cornelius 2016-06-01 21:35 ` Andrew Cooper 2016-06-01 22:35 ` Julien Grall 2 siblings, 1 reply; 29+ messages in thread From: Andrew Cooper @ 2016-06-01 20:00 UTC (permalink / raw) To: Aaron Cornelius, Xen-devel On 01/06/2016 20:54, Aaron Cornelius wrote: > I am doing some work with Xen 4.7 on the cubietruck (ARM32). I've noticed some strange behavior after I create/destroy enough domains and put together a script to do the add/remove for me. For this particular test I am creating a small mini-os (Mirage) domain with 32MB of RAM, deleting it, creating the new one, and so on. > > After running this for a while, I get the following error (with version 8478c9409a2c6726208e8dbc9f3e455b76725a33): > > (d846) Virtual -> physical offset = 3fc00000 > (d846) Checking DTB at 023ff000... > (d846) [32;1mMirageOS booting...[0m > (d846) Initialising console ... done. > (d846) gnttab_stubs.c: initialised mini-os gntmap > (d846) allocate_ondemand(1, 1) returning 2300000 > (d846) allocate_ondemand(1, 1) returning 2301000 > (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0) > (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0) > (XEN) p2m.c: dom1101: VMID pool exhausted > (XEN) CPU0: Unexpected Trap: Data Abort > <snip> > > I'm not 100% sure, from the "VMID pool exhausted" message it would appear that the p2m_init() function failed to allocate a VM ID, which caused domain creation to fail, and the NULL pointer dereference when trying to clean up the not-fully-created domain. > > However, since I only have 1 domain active at a time, I'm not sure why I should run out of VM IDs. Sounds like a VMID resource leak. Check to see whether it is freed properly in domain_destroy(). ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-01 20:00 ` Andrew Cooper @ 2016-06-01 20:45 ` Aaron Cornelius 2016-06-01 21:24 ` Andrew Cooper 0 siblings, 1 reply; 29+ messages in thread From: Aaron Cornelius @ 2016-06-01 20:45 UTC (permalink / raw) To: Andrew Cooper, Xen-devel > -----Original Message----- > From: Andrew Cooper [mailto:amc96@hermes.cam.ac.uk] On Behalf Of > Andrew Cooper > Sent: Wednesday, June 1, 2016 4:01 PM > To: Aaron Cornelius <Aaron.Cornelius@dornerworks.com>; Xen-devel <xen- > devel@lists.xenproject.org> > Subject: Re: [Xen-devel] Xen 4.7 crash > > On 01/06/2016 20:54, Aaron Cornelius wrote: > > I am doing some work with Xen 4.7 on the cubietruck (ARM32). I've > noticed some strange behavior after I create/destroy enough domains and > put together a script to do the add/remove for me. For this particular test I > am creating a small mini-os (Mirage) domain with 32MB of RAM, deleting it, > creating the new one, and so on. > > > > After running this for a while, I get the following error (with version > 8478c9409a2c6726208e8dbc9f3e455b76725a33): > > > > (d846) Virtual -> physical offset = 3fc00000 > > (d846) Checking DTB at 023ff000... > > (d846) [32;1mMirageOS booting...[0m > > (d846) Initialising console ... done. > > (d846) gnttab_stubs.c: initialised mini-os gntmap > > (d846) allocate_ondemand(1, 1) returning 2300000 > > (d846) allocate_ondemand(1, 1) returning 2301000 > > (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) > > dom:(0) > > (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) > > dom:(0) > > (XEN) p2m.c: dom1101: VMID pool exhausted > > (XEN) CPU0: Unexpected Trap: Data Abort <snip> > > > > I'm not 100% sure, from the "VMID pool exhausted" message it would > appear that the p2m_init() function failed to allocate a VM ID, which caused > domain creation to fail, and the NULL pointer dereference when trying to > clean up the not-fully-created domain. > > > > However, since I only have 1 domain active at a time, I'm not sure why I > should run out of VM IDs. > > Sounds like a VMID resource leak. Check to see whether it is freed properly > in domain_destroy(). > > ~Andrew That would be my assumption. But as far as I can tell, arch_domain_destroy() calls pwm_teardown() which calls p2m_free_vmid(), and none of the functionality related to freeing a VM ID appears to have changed in years. - Aaron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-01 20:45 ` Aaron Cornelius @ 2016-06-01 21:24 ` Andrew Cooper 2016-06-01 22:18 ` Julien Grall 0 siblings, 1 reply; 29+ messages in thread From: Andrew Cooper @ 2016-06-01 21:24 UTC (permalink / raw) To: Aaron Cornelius, Xen-devel On 01/06/2016 21:45, Aaron Cornelius wrote: >> >>> However, since I only have 1 domain active at a time, I'm not sure why I >> should run out of VM IDs. >> >> Sounds like a VMID resource leak. Check to see whether it is freed properly >> in domain_destroy(). >> >> ~Andrew > That would be my assumption. But as far as I can tell, arch_domain_destroy() calls pwm_teardown() which calls p2m_free_vmid(), and none of the functionality related to freeing a VM ID appears to have changed in years. The VMID handling looks suspect. It can be called repeatedly during domain destruction, and it will repeatedly clear the same bit out of the vmid_mask. diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c index 838d004..7adb39a 100644 --- a/xen/arch/arm/p2m.c +++ b/xen/arch/arm/p2m.c @@ -1393,7 +1393,10 @@ static void p2m_free_vmid(struct domain *d) struct p2m_domain *p2m = &d->arch.p2m; spin_lock(&vmid_alloc_lock); if ( p2m->vmid != INVALID_VMID ) - clear_bit(p2m->vmid, vmid_mask); + { + ASSERT(test_and_clear_bit(p2m->vmid, vmid_mask)); + p2m->vmid = INVALID_VMID; + } spin_unlock(&vmid_alloc_lock); } Having said that, I can't explain why that bug would result in the symptoms you are seeing. It is also possibly that your issue is memory corruption from a separate source. Can you see about instrumenting p2m_alloc_vmid()/p2m_free_vmid() (with vmid_alloc_lock held) to see which vmid is being allocated/freed ? After the initial boot of the system, you should see the same vmid being allocated and freed for each of your domains. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-01 21:24 ` Andrew Cooper @ 2016-06-01 22:18 ` Julien Grall 2016-06-01 22:26 ` Andrew Cooper 0 siblings, 1 reply; 29+ messages in thread From: Julien Grall @ 2016-06-01 22:18 UTC (permalink / raw) To: Andrew Cooper, Aaron Cornelius, Xen-devel, Stefano Stabellini Hi Andrew, On 01/06/2016 22:24, Andrew Cooper wrote: > On 01/06/2016 21:45, Aaron Cornelius wrote: >>> >>>> However, since I only have 1 domain active at a time, I'm not sure why I >>> should run out of VM IDs. >>> >>> Sounds like a VMID resource leak. Check to see whether it is freed properly >>> in domain_destroy(). >>> >>> ~Andrew >> That would be my assumption. But as far as I can tell, arch_domain_destroy() calls pwm_teardown() which calls p2m_free_vmid(), and none of the functionality related to freeing a VM ID appears to have changed in years. > > The VMID handling looks suspect. It can be called repeatedly during > domain destruction, and it will repeatedly clear the same bit out of the > vmid_mask. Can you explain how the p2m_free_vmid can be called multiple time? We have the following path: arch_domain_destroy -> p2m_teardown -> p2m_free_vmid. And I can find only 3 call of arch_domain_destroy we should only be done once per domain. If arch_domain_destroy is called multiple time, p2m_free_vmid will not be the only place where Xen will be in trouble. > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c > index 838d004..7adb39a 100644 > --- a/xen/arch/arm/p2m.c > +++ b/xen/arch/arm/p2m.c > @@ -1393,7 +1393,10 @@ static void p2m_free_vmid(struct domain *d) > struct p2m_domain *p2m = &d->arch.p2m; > spin_lock(&vmid_alloc_lock); > if ( p2m->vmid != INVALID_VMID ) > - clear_bit(p2m->vmid, vmid_mask); > + { > + ASSERT(test_and_clear_bit(p2m->vmid, vmid_mask)); > + p2m->vmid = INVALID_VMID; > + } > > spin_unlock(&vmid_alloc_lock); > } > > Having said that, I can't explain why that bug would result in the > symptoms you are seeing. It is also possibly that your issue is memory > corruption from a separate source. > > Can you see about instrumenting p2m_alloc_vmid()/p2m_free_vmid() (with > vmid_alloc_lock held) to see which vmid is being allocated/freed ? > After the initial boot of the system, you should see the same vmid being > allocated and freed for each of your domains. Looking quickly at the log, the domain is dom1101. However, the number maximum number of VMID supported is 256, so the exhaustion might be a race somewhere. I would be interested to get a reproducer. I wrote a script to cycle a domain (create/domain) in loop, and I have not seen any issue after 1200 cycles (and counting). Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-01 22:18 ` Julien Grall @ 2016-06-01 22:26 ` Andrew Cooper 0 siblings, 0 replies; 29+ messages in thread From: Andrew Cooper @ 2016-06-01 22:26 UTC (permalink / raw) To: Julien Grall, Aaron Cornelius, Xen-devel, Stefano Stabellini On 01/06/2016 23:18, Julien Grall wrote: > Hi Andrew, > > On 01/06/2016 22:24, Andrew Cooper wrote: >> On 01/06/2016 21:45, Aaron Cornelius wrote: >>>> >>>>> However, since I only have 1 domain active at a time, I'm not sure >>>>> why I >>>> should run out of VM IDs. >>>> >>>> Sounds like a VMID resource leak. Check to see whether it is freed >>>> properly >>>> in domain_destroy(). >>>> >>>> ~Andrew >>> That would be my assumption. But as far as I can tell, >>> arch_domain_destroy() calls pwm_teardown() which calls >>> p2m_free_vmid(), and none of the functionality related to freeing a >>> VM ID appears to have changed in years. >> >> The VMID handling looks suspect. It can be called repeatedly during >> domain destruction, and it will repeatedly clear the same bit out of the >> vmid_mask. > > Can you explain how the p2m_free_vmid can be called multiple time? > > We have the following path: > arch_domain_destroy -> p2m_teardown -> p2m_free_vmid. > > And I can find only 3 call of arch_domain_destroy we should only be > done once per domain. > > If arch_domain_destroy is called multiple time, p2m_free_vmid will not > be the only place where Xen will be in trouble. You are correct. I was getting my phases of domain destruction mixed up. arch_domain_destroy() is strictly once, after the RCU reference of the domain has dropped to 0. > >> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c >> index 838d004..7adb39a 100644 >> --- a/xen/arch/arm/p2m.c >> +++ b/xen/arch/arm/p2m.c >> @@ -1393,7 +1393,10 @@ static void p2m_free_vmid(struct domain *d) >> struct p2m_domain *p2m = &d->arch.p2m; >> spin_lock(&vmid_alloc_lock); >> if ( p2m->vmid != INVALID_VMID ) >> - clear_bit(p2m->vmid, vmid_mask); >> + { >> + ASSERT(test_and_clear_bit(p2m->vmid, vmid_mask)); >> + p2m->vmid = INVALID_VMID; >> + } >> >> spin_unlock(&vmid_alloc_lock); >> } >> >> Having said that, I can't explain why that bug would result in the >> symptoms you are seeing. It is also possibly that your issue is memory >> corruption from a separate source. >> >> Can you see about instrumenting p2m_alloc_vmid()/p2m_free_vmid() (with >> vmid_alloc_lock held) to see which vmid is being allocated/freed ? >> After the initial boot of the system, you should see the same vmid being >> allocated and freed for each of your domains. > > Looking quickly at the log, the domain is dom1101. However, the number > maximum number of VMID supported is 256, so the exhaustion might be a > race somewhere. > > I would be interested to get a reproducer. I wrote a script to cycle a > domain (create/domain) in loop, and I have not seen any issue after > 1200 cycles (and counting). Given that my previous thought was wrong, I am going to suggest that some other form of memory corruption is a more likely cause. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-01 19:54 Xen 4.7 crash Aaron Cornelius 2016-06-01 20:00 ` Andrew Cooper @ 2016-06-01 21:35 ` Andrew Cooper 2016-06-01 22:24 ` Julien Grall 2016-06-01 22:35 ` Julien Grall 2 siblings, 1 reply; 29+ messages in thread From: Andrew Cooper @ 2016-06-01 21:35 UTC (permalink / raw) To: Aaron Cornelius, Xen-devel On 01/06/2016 20:54, Aaron Cornelius wrote: > <snip> > (XEN) Xen call trace: > (XEN) [<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC) > (XEN) [<0025b0cc>] p2m_teardown+0xa0/0x108 (LR) > (XEN) [<0025b0cc>] p2m_teardown+0xa0/0x108 > (XEN) [<0024f668>] arch_domain_destroy+0x20/0x50 > (XEN) [<0024f8f0>] arch_domain_create+0x258/0x284 > (XEN) [<0020854c>] domain_create+0x2dc/0x510 > (XEN) [<00206d6c>] do_domctl+0x5b4/0x1928 > (XEN) [<00260130>] do_trap_hypervisor+0x1170/0x15b0 > (XEN) [<00263b10>] entry.o#return_from_trap+0/0x4 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 0: > (XEN) CPU0: Unexpected Trap: Data Abort > (XEN) > (XEN) **************************************** > (XEN) > (XEN) Reboot in five seconds... As for this specific crash itself, In the case of an early error path, p2m->root can be NULL in p2m_teardown(), in which case free_domheap_pages() will fall over in a heap. This patch should resolve it. @@ -1408,7 +1411,8 @@ void p2m_teardown(struct domain *d) while ( (pg = page_list_remove_head(&p2m->pages)) ) free_domheap_page(pg); - free_domheap_pages(p2m->root, P2M_ROOT_ORDER); + if ( p2m->root ) + free_domheap_pages(p2m->root, P2M_ROOT_ORDER); p2m->root = NULL; I would be tempted to suggest making free_domheap_pages() tolerate NULL pointers, except that would only be a safe thing to do if we assert that the order parameter is 0, which won't help this specific case. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-01 21:35 ` Andrew Cooper @ 2016-06-01 22:24 ` Julien Grall 2016-06-01 22:31 ` Andrew Cooper 0 siblings, 1 reply; 29+ messages in thread From: Julien Grall @ 2016-06-01 22:24 UTC (permalink / raw) To: Andrew Cooper, Aaron Cornelius, Xen-devel, Stefano Stabellini Hi, On 01/06/2016 22:35, Andrew Cooper wrote: > On 01/06/2016 20:54, Aaron Cornelius wrote: >> <snip> >> (XEN) Xen call trace: >> (XEN) [<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC) >> (XEN) [<0025b0cc>] p2m_teardown+0xa0/0x108 (LR) >> (XEN) [<0025b0cc>] p2m_teardown+0xa0/0x108 >> (XEN) [<0024f668>] arch_domain_destroy+0x20/0x50 >> (XEN) [<0024f8f0>] arch_domain_create+0x258/0x284 >> (XEN) [<0020854c>] domain_create+0x2dc/0x510 >> (XEN) [<00206d6c>] do_domctl+0x5b4/0x1928 >> (XEN) [<00260130>] do_trap_hypervisor+0x1170/0x15b0 >> (XEN) [<00263b10>] entry.o#return_from_trap+0/0x4 >> (XEN) >> (XEN) >> (XEN) **************************************** >> (XEN) Panic on CPU 0: >> (XEN) CPU0: Unexpected Trap: Data Abort >> (XEN) >> (XEN) **************************************** >> (XEN) >> (XEN) Reboot in five seconds... > > As for this specific crash itself, In the case of an early error path, > p2m->root can be NULL in p2m_teardown(), in which case > free_domheap_pages() will fall over in a heap. This patch should > resolve it. Good catch! > > @@ -1408,7 +1411,8 @@ void p2m_teardown(struct domain *d) > while ( (pg = page_list_remove_head(&p2m->pages)) ) > free_domheap_page(pg); > > - free_domheap_pages(p2m->root, P2M_ROOT_ORDER); > + if ( p2m->root ) > + free_domheap_pages(p2m->root, P2M_ROOT_ORDER); > > p2m->root = NULL; > > I would be tempted to suggest making free_domheap_pages() tolerate NULL > pointers, except that would only be a safe thing to do if we assert that > the order parameter is 0, which won't help this specific case. free_xenheap_pages already tolerates NULL (even if an order != 0). Is there any reason to not do the same for free_domheap_pages? Regards, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-01 22:24 ` Julien Grall @ 2016-06-01 22:31 ` Andrew Cooper 2016-06-02 8:47 ` Jan Beulich 0 siblings, 1 reply; 29+ messages in thread From: Andrew Cooper @ 2016-06-01 22:31 UTC (permalink / raw) To: Julien Grall, Aaron Cornelius, Xen-devel, Stefano Stabellini On 01/06/2016 23:24, Julien Grall wrote: > Hi, > > On 01/06/2016 22:35, Andrew Cooper wrote: >> On 01/06/2016 20:54, Aaron Cornelius wrote: >>> <snip> >>> (XEN) Xen call trace: >>> (XEN) [<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC) >>> (XEN) [<0025b0cc>] p2m_teardown+0xa0/0x108 (LR) >>> (XEN) [<0025b0cc>] p2m_teardown+0xa0/0x108 >>> (XEN) [<0024f668>] arch_domain_destroy+0x20/0x50 >>> (XEN) [<0024f8f0>] arch_domain_create+0x258/0x284 >>> (XEN) [<0020854c>] domain_create+0x2dc/0x510 >>> (XEN) [<00206d6c>] do_domctl+0x5b4/0x1928 >>> (XEN) [<00260130>] do_trap_hypervisor+0x1170/0x15b0 >>> (XEN) [<00263b10>] entry.o#return_from_trap+0/0x4 >>> (XEN) >>> (XEN) >>> (XEN) **************************************** >>> (XEN) Panic on CPU 0: >>> (XEN) CPU0: Unexpected Trap: Data Abort >>> (XEN) >>> (XEN) **************************************** >>> (XEN) >>> (XEN) Reboot in five seconds... >> >> As for this specific crash itself, In the case of an early error path, >> p2m->root can be NULL in p2m_teardown(), in which case >> free_domheap_pages() will fall over in a heap. This patch should >> resolve it. > > Good catch! > >> >> @@ -1408,7 +1411,8 @@ void p2m_teardown(struct domain *d) >> while ( (pg = page_list_remove_head(&p2m->pages)) ) >> free_domheap_page(pg); >> >> - free_domheap_pages(p2m->root, P2M_ROOT_ORDER); >> + if ( p2m->root ) >> + free_domheap_pages(p2m->root, P2M_ROOT_ORDER); >> >> p2m->root = NULL; >> >> I would be tempted to suggest making free_domheap_pages() tolerate NULL >> pointers, except that would only be a safe thing to do if we assert that >> the order parameter is 0, which won't help this specific case. > > free_xenheap_pages already tolerates NULL (even if an order != 0). Is > there any reason to not do the same for free_domheap_pages? The xenheap allocation functions deal in terms of plain virtual addresses, while the domheap functions deal in terms of struct page_info *. Overall, this means that the domheap functions have a more restricted input/output set than their xenheap variants. As there is already precedent with xenheap, making domheap tolerate NULL is probably fine, and indeed the preferred course of action. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-01 22:31 ` Andrew Cooper @ 2016-06-02 8:47 ` Jan Beulich 2016-06-02 8:53 ` Andrew Cooper 0 siblings, 1 reply; 29+ messages in thread From: Jan Beulich @ 2016-06-02 8:47 UTC (permalink / raw) To: Julien Grall, Andrew Cooper Cc: Aaron Cornelius, Stefano Stabellini, Xen-devel >>> On 02.06.16 at 00:31, <andrew.cooper3@citrix.com> wrote: > On 01/06/2016 23:24, Julien Grall wrote: >> free_xenheap_pages already tolerates NULL (even if an order != 0). Is >> there any reason to not do the same for free_domheap_pages? > > The xenheap allocation functions deal in terms of plain virtual > addresses, while the domheap functions deal in terms of struct page_info *. > > Overall, this means that the domheap functions have a more restricted > input/output set than their xenheap variants. > > As there is already precedent with xenheap, making domheap tolerate NULL > is probably fine, and indeed the preferred course of action. I disagree, for the very reason you mention above. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-02 8:47 ` Jan Beulich @ 2016-06-02 8:53 ` Andrew Cooper 2016-06-02 9:07 ` Jan Beulich 0 siblings, 1 reply; 29+ messages in thread From: Andrew Cooper @ 2016-06-02 8:53 UTC (permalink / raw) To: Jan Beulich, Julien Grall; +Cc: Aaron Cornelius, Stefano Stabellini, Xen-devel On 02/06/16 09:47, Jan Beulich wrote: >>>> On 02.06.16 at 00:31, <andrew.cooper3@citrix.com> wrote: >> On 01/06/2016 23:24, Julien Grall wrote: >>> free_xenheap_pages already tolerates NULL (even if an order != 0). Is >>> there any reason to not do the same for free_domheap_pages? >> The xenheap allocation functions deal in terms of plain virtual >> addresses, while the domheap functions deal in terms of struct page_info *. >> >> Overall, this means that the domheap functions have a more restricted >> input/output set than their xenheap variants. >> >> As there is already precedent with xenheap, making domheap tolerate NULL >> is probably fine, and indeed the preferred course of action. > I disagree, for the very reason you mention above. Which? Dealing with struct page_info pointer? Its still just a pointer, whose value is expected to be NULL if not allocated. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-02 8:53 ` Andrew Cooper @ 2016-06-02 9:07 ` Jan Beulich 0 siblings, 0 replies; 29+ messages in thread From: Jan Beulich @ 2016-06-02 9:07 UTC (permalink / raw) To: Julien Grall, Andrew Cooper Cc: Aaron Cornelius, Stefano Stabellini, Xen-devel >>> On 02.06.16 at 10:53, <andrew.cooper3@citrix.com> wrote: > On 02/06/16 09:47, Jan Beulich wrote: >>>>> On 02.06.16 at 00:31, <andrew.cooper3@citrix.com> wrote: >>> On 01/06/2016 23:24, Julien Grall wrote: >>>> free_xenheap_pages already tolerates NULL (even if an order != 0). Is >>>> there any reason to not do the same for free_domheap_pages? >>> The xenheap allocation functions deal in terms of plain virtual >>> addresses, while the domheap functions deal in terms of struct page_info *. >>> >>> Overall, this means that the domheap functions have a more restricted >>> input/output set than their xenheap variants. >>> >>> As there is already precedent with xenheap, making domheap tolerate NULL >>> is probably fine, and indeed the preferred course of action. >> I disagree, for the very reason you mention above. > > Which? Dealing with struct page_info pointer? Its still just a > pointer, whose value is expected to be NULL if not allocated. Yes, but it still makes the interface not malloc()-like, other than - as you say yourself - e.g. the xenheap one. Just look at Linux for comparison: __free_pages() also doesn't accept NULL, while free_pages() does. I think we should stick to that distinction. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-01 19:54 Xen 4.7 crash Aaron Cornelius 2016-06-01 20:00 ` Andrew Cooper 2016-06-01 21:35 ` Andrew Cooper @ 2016-06-01 22:35 ` Julien Grall 2016-06-02 1:32 ` Aaron Cornelius 2 siblings, 1 reply; 29+ messages in thread From: Julien Grall @ 2016-06-01 22:35 UTC (permalink / raw) To: Aaron Cornelius, Xen-devel Hello Aaron, On 01/06/2016 20:54, Aaron Cornelius wrote: > I am doing some work with Xen 4.7 on the cubietruck (ARM32). I've noticed some strange behavior after I create/destroy enough domains and put together a script to do the add/remove for me. For this particular test I am creating a small mini-os (Mirage) domain with 32MB of RAM, deleting it, creating the new one, and so on. > > After running this for a while, I get the following error (with version 8478c9409a2c6726208e8dbc9f3e455b76725a33): > > (d846) Virtual -> physical offset = 3fc00000 > (d846) Checking DTB at 023ff000... > (d846) [32;1mMirageOS booting...[0m > (d846) Initialising console ... done. > (d846) gnttab_stubs.c: initialised mini-os gntmap > (d846) allocate_ondemand(1, 1) returning 2300000 > (d846) allocate_ondemand(1, 1) returning 2301000 > (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0) > (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0) > (XEN) p2m.c: dom1101: VMID pool exhausted > (XEN) CPU0: Unexpected Trap: Data Abort > (XEN) ----[ Xen-4.7.0-rc arm32 debug=y Not tainted ]---- > (XEN) CPU: 0 > (XEN) PC: 0021fdd4 free_domheap_pages+0x1c/0x324 > (XEN) CPSR: 6001011a MODE:Hypervisor > (XEN) R0: 00000000 R1: 00000001 R2: 00000003 R3: 00304320 > (XEN) R4: 41c57000 R5: 41c57188 R6: 00200200 R7: 00100100 > (XEN) R8: 41c57180 R9: 43fdfe60 R10:00000000 R11:43fdfd5c R12:00000000 > (XEN) HYP: SP: 43fdfd2c LR: 0025b0cc > (XEN) > (XEN) VTCR_EL2: 80003558 > (XEN) VTTBR_EL2: 00010000bfb0e000 > (XEN) > (XEN) SCTLR_EL2: 30cd187f > (XEN) HCR_EL2: 000000000038663f > (XEN) TTBR0_EL2: 00000000bfafc000 > (XEN) > (XEN) ESR_EL2: 94000006 > (XEN) HPFAR_EL2: 000000000001c810 > (XEN) HDFAR: 00000014 > (XEN) HIFAR: 84e37182 > (XEN) > (XEN) Xen stack trace from sp=43fdfd2c: > (XEN) 002cf1b7 43fdfd64 41c57000 00000100 41c57000 41c57188 00200200 00100100 > (XEN) 41c57180 43fdfe60 00000000 43fdfd7c 0025b0cc 41c57000 fffffff0 43fdfe60 > (XEN) 0000001f 0000044d 43fdfe60 43fdfd8c 0024f668 41c57000 fffffff0 43fdfda4 > (XEN) 0024f8f0 41c57000 00000000 00000000 0000001f 43fdfddc 0020854c 43fdfddc > (XEN) 00000000 cccccccd 00304600 002822bc 00000000 b6f20004 0000044d 00304600 > (XEN) 00304320 d767a000 00000000 43fdfeec 00206d6c 43fdfe6c 00218f8c 00000000 > (XEN) 00000007 43fdfe30 43fdfe34 00000000 43fdfe20 00000002 43fdfe48 43fdfe78 > (XEN) 00000000 00000000 00000000 00007622 00002b0e 40023000 00000000 43fdfec8 > (XEN) 00000002 43fdfebc 00218f8c 00000001 0000000b 0000ffff b6eba880 0000000b > (XEN) 5abab87d f34aab2c 6adc50b8 e1713cd0 00000000 00000000 00000000 00000000 > (XEN) b6eba8d8 00000000 50043f00 b6eb5038 b6effba8 0000003e 00000000 000c3034 > (XEN) 000b9cb8 000bda30 000bda30 00000000 b6eba56c 0000003e b6effba8 b6effdb0 > (XEN) be9558d4 000000d0 be9558d4 00000071 b6effba8 b6effd6c b6ed6fb4 4a000ea1 > (XEN) c01077f8 43fdff58 002067b8 00305000 be9557c8 d767a000 00000000 43fdff54 > (XEN) 00260130 00000000 43fdfefc 43fdff1c 200f019a 400238f4 00000004 00000004 > (XEN) 002c9f00 00000000 00304600 c094c240 00000000 00305000 be9557a0 d767a000 > (XEN) 00000000 43fdff44 00000000 c094c240 00000000 00305000 be9557c8 d767a000 > (XEN) 00000000 43fdff58 00263b10 b6f20004 00000000 00000000 00000000 00000000 > (XEN) c094c240 00000000 00305000 be9557c8 d767a000 00000000 00000001 00000024 > (XEN) ffffffff b691ab34 c01077f8 60010013 00000000 be9557c4 c0a38600 c010c400 > (XEN) Xen call trace: > (XEN) [<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC) > (XEN) [<0025b0cc>] p2m_teardown+0xa0/0x108 (LR) > (XEN) [<0025b0cc>] p2m_teardown+0xa0/0x108 > (XEN) [<0024f668>] arch_domain_destroy+0x20/0x50 > (XEN) [<0024f8f0>] arch_domain_create+0x258/0x284 > (XEN) [<0020854c>] domain_create+0x2dc/0x510 > (XEN) [<00206d6c>] do_domctl+0x5b4/0x1928 > (XEN) [<00260130>] do_trap_hypervisor+0x1170/0x15b0 > (XEN) [<00263b10>] entry.o#return_from_trap+0/0x4 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 0: > (XEN) CPU0: Unexpected Trap: Data Abort > (XEN) > (XEN) **************************************** > (XEN) > (XEN) Reboot in five seconds... > > I'm not 100% sure, from the "VMID pool exhausted" message it would appear that the p2m_init() function failed to allocate a VM ID, which caused domain creation to fail, and the NULL pointer dereference when trying to clean up the not-fully-created domain. > > However, since I only have 1 domain active at a time, I'm not sure why I should run out of VM IDs. arch_domain_destroy (and p2m_teardown) is only called when all the reference on the given domain are released. It may take a while to release all the resources. So if you launch the domain as the same time as you destroy the previous guest. You will have more than 1 domain active. Can you detail how you create/destroy guest? Regards, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-01 22:35 ` Julien Grall @ 2016-06-02 1:32 ` Aaron Cornelius 2016-06-02 8:49 ` Jan Beulich 2016-06-02 9:07 ` Julien Grall 0 siblings, 2 replies; 29+ messages in thread From: Aaron Cornelius @ 2016-06-02 1:32 UTC (permalink / raw) To: Xen-devel, Julien Grall On 6/1/2016 6:35 PM, Julien Grall wrote: > Hello Aaron, > > On 01/06/2016 20:54, Aaron Cornelius wrote: <snip> >> I'm not 100% sure, from the "VMID pool exhausted" message it would >> appear that the p2m_init() function failed to allocate a VM ID, which >> caused domain creation to fail, and the NULL pointer dereference when >> trying to clean up the not-fully-created domain. >> >> However, since I only have 1 domain active at a time, I'm not sure why >> I should run out of VM IDs. > > arch_domain_destroy (and p2m_teardown) is only called when all the > reference on the given domain are released. > > It may take a while to release all the resources. So if you launch the > domain as the same time as you destroy the previous guest. You will have > more than 1 domain active. > > Can you detail how you create/destroy guest? > This is with a custom application, we use the libxl APIs to interact with Xen. Domains are created using the libxl_domain_create_new() function, and domains are destroyed using the libxl_domain_destroy() function. The test in this case creates a domain, waits a minute, then deletes/creates the next domain, waits a minute, and so on. So I wouldn't be surprised to see the VMID occasionally indicate there are 2 active domains since there could be one being created and one being destroyed in a very short time. However, I wouldn't expect to ever have 256 domains. The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which means that only 48 of the the Mirage domains (with 32MB of RAM) would work at the same time anyway. Which doesn't account for the various inter-domain resources or the RAM used by Xen itself. If the p2m_teardown() function checked for NULL it would prevent the crash, but I suspect Xen would be just as broken since all of my resources have leaked away. More broken in fact, since if the board reboots at least the applications will restart and domains can be recreated. It certainly appears that some resources are leaking when domains are deleted (possibly only on the ARM or ARM32 platforms). We will try to add some debug prints and see if we can discover exactly what is going on. - Aaron Cornelius _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-02 1:32 ` Aaron Cornelius @ 2016-06-02 8:49 ` Jan Beulich 2016-06-02 9:07 ` Julien Grall 1 sibling, 0 replies; 29+ messages in thread From: Jan Beulich @ 2016-06-02 8:49 UTC (permalink / raw) To: Aaron Cornelius; +Cc: Xen-devel, Julien Grall >>> On 02.06.16 at 03:32, <aaron.cornelius@dornerworks.com> wrote: > The test in this case creates a domain, waits a minute, then > deletes/creates the next domain, waits a minute, and so on. So I > wouldn't be surprised to see the VMID occasionally indicate there are 2 > active domains since there could be one being created and one being > destroyed in a very short time. However, I wouldn't expect to ever have > 256 domains. But - did you check? Things may pile up over time... Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-02 1:32 ` Aaron Cornelius 2016-06-02 8:49 ` Jan Beulich @ 2016-06-02 9:07 ` Julien Grall 2016-06-06 13:58 ` Aaron Cornelius 1 sibling, 1 reply; 29+ messages in thread From: Julien Grall @ 2016-06-02 9:07 UTC (permalink / raw) To: Aaron Cornelius, Xen-devel, Jan Beulich Hello Aaron, On 02/06/2016 02:32, Aaron Cornelius wrote: > This is with a custom application, we use the libxl APIs to interact > with Xen. Domains are created using the libxl_domain_create_new() > function, and domains are destroyed using the libxl_domain_destroy() > function. > > The test in this case creates a domain, waits a minute, then > deletes/creates the next domain, waits a minute, and so on. So I > wouldn't be surprised to see the VMID occasionally indicate there are 2 > active domains since there could be one being created and one being > destroyed in a very short time. However, I wouldn't expect to ever have > 256 domains. Your log has: (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0) (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0) Which suggest that some grants are still mapped in DOM0. > > The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which > means that only 48 of the the Mirage domains (with 32MB of RAM) would > work at the same time anyway. Which doesn't account for the various > inter-domain resources or the RAM used by Xen itself. All the pages who belongs to the domain could have been freed except the one referenced by DOM0. So the footprint of this domain will be limited at the time. I would recommend you to check how many domain are running at this time and if DOM0 effectively released all the resources. > If the p2m_teardown() function checked for NULL it would prevent the > crash, but I suspect Xen would be just as broken since all of my > resources have leaked away. More broken in fact, since if the board > reboots at least the applications will restart and domains can be > recreated. > > It certainly appears that some resources are leaking when domains are > deleted (possibly only on the ARM or ARM32 platforms). We will try to > add some debug prints and see if we can discover exactly what is going on. The leakage could also happen from DOM0. FWIW, I have been able to cycle 2000 guests over the night on an ARM platforms. Regards, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-02 9:07 ` Julien Grall @ 2016-06-06 13:58 ` Aaron Cornelius 2016-06-06 14:05 ` Julien Grall 0 siblings, 1 reply; 29+ messages in thread From: Aaron Cornelius @ 2016-06-06 13:58 UTC (permalink / raw) To: Julien Grall, Xen-devel, Jan Beulich On 6/2/2016 5:07 AM, Julien Grall wrote: > Hello Aaron, > > On 02/06/2016 02:32, Aaron Cornelius wrote: >> This is with a custom application, we use the libxl APIs to interact >> with Xen. Domains are created using the libxl_domain_create_new() >> function, and domains are destroyed using the libxl_domain_destroy() >> function. >> >> The test in this case creates a domain, waits a minute, then >> deletes/creates the next domain, waits a minute, and so on. So I >> wouldn't be surprised to see the VMID occasionally indicate there are 2 >> active domains since there could be one being created and one being >> destroyed in a very short time. However, I wouldn't expect to ever have >> 256 domains. > > Your log has: > > (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0) > (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0) > > Which suggest that some grants are still mapped in DOM0. > >> >> The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which >> means that only 48 of the the Mirage domains (with 32MB of RAM) would >> work at the same time anyway. Which doesn't account for the various >> inter-domain resources or the RAM used by Xen itself. > > All the pages who belongs to the domain could have been freed except the > one referenced by DOM0. So the footprint of this domain will be limited > at the time. > > I would recommend you to check how many domain are running at this time > and if DOM0 effectively released all the resources. > >> If the p2m_teardown() function checked for NULL it would prevent the >> crash, but I suspect Xen would be just as broken since all of my >> resources have leaked away. More broken in fact, since if the board >> reboots at least the applications will restart and domains can be >> recreated. >> >> It certainly appears that some resources are leaking when domains are >> deleted (possibly only on the ARM or ARM32 platforms). We will try to >> add some debug prints and see if we can discover exactly what is going on. > > The leakage could also happen from DOM0. FWIW, I have been able to cycle > 2000 guests over the night on an ARM platforms. > We've done some more testing regarding this issue. And further testing shows that it doesn't matter if we delete the vchans before the domains are deleted. Those appear to be cleaned up correctly when the domain is destroyed. What does stop this issue from happening (using the same version of Xen that the issue was detected on) is removing any non-standard xenstore references before deleting the domain. In this case our application allocates permissions for created domains to non-standard xenstore paths. Making sure to remove those domain permissions before deleting the domain prevents this issue from happening. It does not appear to matter if we delete the standard domain xenstore path (/local/domain/<id>) since libxl handles removing this path when the domain is destroyed. Based on this I would guess that the xenstore is hanging onto the VMID. - Aaron Cornelius _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-06 13:58 ` Aaron Cornelius @ 2016-06-06 14:05 ` Julien Grall 2016-06-06 14:19 ` Wei Liu 0 siblings, 1 reply; 29+ messages in thread From: Julien Grall @ 2016-06-06 14:05 UTC (permalink / raw) To: Aaron Cornelius, Xen-devel, Jan Beulich Cc: Ian Jackson, Stefano Stabellini, Wei Liu (CC Ian, Stefano and Wei) Hello Aaron, On 06/06/16 14:58, Aaron Cornelius wrote: > On 6/2/2016 5:07 AM, Julien Grall wrote: >> Hello Aaron, >> >> On 02/06/2016 02:32, Aaron Cornelius wrote: >>> This is with a custom application, we use the libxl APIs to interact >>> with Xen. Domains are created using the libxl_domain_create_new() >>> function, and domains are destroyed using the libxl_domain_destroy() >>> function. >>> >>> The test in this case creates a domain, waits a minute, then >>> deletes/creates the next domain, waits a minute, and so on. So I >>> wouldn't be surprised to see the VMID occasionally indicate there are 2 >>> active domains since there could be one being created and one being >>> destroyed in a very short time. However, I wouldn't expect to ever have >>> 256 domains. >> >> Your log has: >> >> (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0) >> (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) >> dom:(0) >> >> Which suggest that some grants are still mapped in DOM0. >> >>> >>> The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which >>> means that only 48 of the the Mirage domains (with 32MB of RAM) would >>> work at the same time anyway. Which doesn't account for the various >>> inter-domain resources or the RAM used by Xen itself. >> >> All the pages who belongs to the domain could have been freed except the >> one referenced by DOM0. So the footprint of this domain will be limited >> at the time. >> >> I would recommend you to check how many domain are running at this time >> and if DOM0 effectively released all the resources. >> >>> If the p2m_teardown() function checked for NULL it would prevent the >>> crash, but I suspect Xen would be just as broken since all of my >>> resources have leaked away. More broken in fact, since if the board >>> reboots at least the applications will restart and domains can be >>> recreated. >>> >>> It certainly appears that some resources are leaking when domains are >>> deleted (possibly only on the ARM or ARM32 platforms). We will try to >>> add some debug prints and see if we can discover exactly what is >>> going on. >> >> The leakage could also happen from DOM0. FWIW, I have been able to cycle >> 2000 guests over the night on an ARM platforms. >> > > We've done some more testing regarding this issue. And further testing > shows that it doesn't matter if we delete the vchans before the domains > are deleted. Those appear to be cleaned up correctly when the domain is > destroyed. > > What does stop this issue from happening (using the same version of Xen > that the issue was detected on) is removing any non-standard xenstore > references before deleting the domain. In this case our application > allocates permissions for created domains to non-standard xenstore > paths. Making sure to remove those domain permissions before deleting > the domain prevents this issue from happening. I am not sure to understand what you mean here. Could you give a quick example? > > It does not appear to matter if we delete the standard domain xenstore > path (/local/domain/<id>) since libxl handles removing this path when > the domain is destroyed. > > Based on this I would guess that the xenstore is hanging onto the VMID. Regards, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-06 14:05 ` Julien Grall @ 2016-06-06 14:19 ` Wei Liu 2016-06-06 15:02 ` Aaron Cornelius 0 siblings, 1 reply; 29+ messages in thread From: Wei Liu @ 2016-06-06 14:19 UTC (permalink / raw) To: Julien Grall Cc: Stefano Stabellini, Wei Liu, Aaron Cornelius, Ian Jackson, Jan Beulich, Xen-devel On Mon, Jun 06, 2016 at 03:05:47PM +0100, Julien Grall wrote: > (CC Ian, Stefano and Wei) > > Hello Aaron, > > On 06/06/16 14:58, Aaron Cornelius wrote: > >On 6/2/2016 5:07 AM, Julien Grall wrote: > >>Hello Aaron, > >> > >>On 02/06/2016 02:32, Aaron Cornelius wrote: > >>>This is with a custom application, we use the libxl APIs to interact > >>>with Xen. Domains are created using the libxl_domain_create_new() > >>>function, and domains are destroyed using the libxl_domain_destroy() > >>>function. > >>> > >>>The test in this case creates a domain, waits a minute, then > >>>deletes/creates the next domain, waits a minute, and so on. So I > >>>wouldn't be surprised to see the VMID occasionally indicate there are 2 > >>>active domains since there could be one being created and one being > >>>destroyed in a very short time. However, I wouldn't expect to ever have > >>>256 domains. > >> > >>Your log has: > >> > >>(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0) > >>(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) > >>dom:(0) > >> > >>Which suggest that some grants are still mapped in DOM0. > >> > >>> > >>>The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which > >>>means that only 48 of the the Mirage domains (with 32MB of RAM) would > >>>work at the same time anyway. Which doesn't account for the various > >>>inter-domain resources or the RAM used by Xen itself. > >> > >>All the pages who belongs to the domain could have been freed except the > >>one referenced by DOM0. So the footprint of this domain will be limited > >>at the time. > >> > >>I would recommend you to check how many domain are running at this time > >>and if DOM0 effectively released all the resources. > >> > >>>If the p2m_teardown() function checked for NULL it would prevent the > >>>crash, but I suspect Xen would be just as broken since all of my > >>>resources have leaked away. More broken in fact, since if the board > >>>reboots at least the applications will restart and domains can be > >>>recreated. > >>> > >>>It certainly appears that some resources are leaking when domains are > >>>deleted (possibly only on the ARM or ARM32 platforms). We will try to > >>>add some debug prints and see if we can discover exactly what is > >>>going on. > >> > >>The leakage could also happen from DOM0. FWIW, I have been able to cycle > >>2000 guests over the night on an ARM platforms. > >> > > > >We've done some more testing regarding this issue. And further testing > >shows that it doesn't matter if we delete the vchans before the domains > >are deleted. Those appear to be cleaned up correctly when the domain is > >destroyed. > > > >What does stop this issue from happening (using the same version of Xen > >that the issue was detected on) is removing any non-standard xenstore > >references before deleting the domain. In this case our application > >allocates permissions for created domains to non-standard xenstore > >paths. Making sure to remove those domain permissions before deleting > >the domain prevents this issue from happening. > > I am not sure to understand what you mean here. Could you give a quick > example? > > > > >It does not appear to matter if we delete the standard domain xenstore > >path (/local/domain/<id>) since libxl handles removing this path when > >the domain is destroyed. > > > >Based on this I would guess that the xenstore is hanging onto the VMID. > This is a somewhat strange conclusion. I guess the root cause is still unclear at this point. Is it possible that something else what rely on those xenstore node to free up resources? Wei. > Regards, > > -- > Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-06 14:19 ` Wei Liu @ 2016-06-06 15:02 ` Aaron Cornelius 2016-06-07 9:53 ` Ian Jackson 0 siblings, 1 reply; 29+ messages in thread From: Aaron Cornelius @ 2016-06-06 15:02 UTC (permalink / raw) To: Wei Liu, Julien Grall Cc: Xen-devel, Stefano Stabellini, Ian Jackson, Jan Beulich On 6/6/2016 10:19 AM, Wei Liu wrote: > On Mon, Jun 06, 2016 at 03:05:47PM +0100, Julien Grall wrote: >> (CC Ian, Stefano and Wei) >> >> Hello Aaron, >> >> On 06/06/16 14:58, Aaron Cornelius wrote: >>> On 6/2/2016 5:07 AM, Julien Grall wrote: >>>> Hello Aaron, >>>> >>>> On 02/06/2016 02:32, Aaron Cornelius wrote: >>>>> This is with a custom application, we use the libxl APIs to interact >>>>> with Xen. Domains are created using the libxl_domain_create_new() >>>>> function, and domains are destroyed using the libxl_domain_destroy() >>>>> function. >>>>> >>>>> The test in this case creates a domain, waits a minute, then >>>>> deletes/creates the next domain, waits a minute, and so on. So I >>>>> wouldn't be surprised to see the VMID occasionally indicate there are 2 >>>>> active domains since there could be one being created and one being >>>>> destroyed in a very short time. However, I wouldn't expect to ever have >>>>> 256 domains. >>>> >>>> Your log has: >>>> >>>> (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0) >>>> (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) >>>> dom:(0) >>>> >>>> Which suggest that some grants are still mapped in DOM0. >>>> >>>>> >>>>> The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which >>>>> means that only 48 of the the Mirage domains (with 32MB of RAM) would >>>>> work at the same time anyway. Which doesn't account for the various >>>>> inter-domain resources or the RAM used by Xen itself. >>>> >>>> All the pages who belongs to the domain could have been freed except the >>>> one referenced by DOM0. So the footprint of this domain will be limited >>>> at the time. >>>> >>>> I would recommend you to check how many domain are running at this time >>>> and if DOM0 effectively released all the resources. >>>> >>>>> If the p2m_teardown() function checked for NULL it would prevent the >>>>> crash, but I suspect Xen would be just as broken since all of my >>>>> resources have leaked away. More broken in fact, since if the board >>>>> reboots at least the applications will restart and domains can be >>>>> recreated. >>>>> >>>>> It certainly appears that some resources are leaking when domains are >>>>> deleted (possibly only on the ARM or ARM32 platforms). We will try to >>>>> add some debug prints and see if we can discover exactly what is >>>>> going on. >>>> >>>> The leakage could also happen from DOM0. FWIW, I have been able to cycle >>>> 2000 guests over the night on an ARM platforms. >>>> >>> >>> We've done some more testing regarding this issue. And further testing >>> shows that it doesn't matter if we delete the vchans before the domains >>> are deleted. Those appear to be cleaned up correctly when the domain is >>> destroyed. >>> >>> What does stop this issue from happening (using the same version of Xen >>> that the issue was detected on) is removing any non-standard xenstore >>> references before deleting the domain. In this case our application >>> allocates permissions for created domains to non-standard xenstore >>> paths. Making sure to remove those domain permissions before deleting >>> the domain prevents this issue from happening. >> >> I am not sure to understand what you mean here. Could you give a quick >> example? So we have a custom xenstore path for our tool (/tool/custom/ for the sake of this example), and we then allow every domain created using this tool to read that path. When the domain is created, the domain is explicitly given read permissions using xs_set_permissions(). More precisely we: 1. retrieve the current list of permissions with xs_get_permissions() 2. realloc the permissions list to increase it by 1 3. update the list of permissions to give the new domain read only access 4. then set the new permissions list with xs_set_permissions() We saw errors logged because this list of permissions was getting prohibitively large, but this error did not appear to be directly connected to the Xen crash I submitted last week. Or so we thought at the time. We realized that we had forgotten to remove the domain from the permissions list when the domain is deleted (which would cause the error we saw). The application was updated to remove the domain from the permissions list: 1. retrieve the permissions with xs_get_permissions() 2. find the domain ID that is being deleted 3. memmove() the remaining domains down by 1 to "delete" the old domain from the permissions list 4. update the permissions with xs_set_permissions() After we made that change, a load test over the weekend confirmed that the Xen crash no longer happens. We checked this morning first thing and confirmed that without this change the crash reliably occurs. >>> It does not appear to matter if we delete the standard domain xenstore >>> path (/local/domain/<id>) since libxl handles removing this path when >>> the domain is destroyed. >>> >>> Based on this I would guess that the xenstore is hanging onto the VMID. >> > > This is a somewhat strange conclusion. I guess the root cause is still > unclear at this point. We originally tested a fix that explicitly cleaned up the vchans (created to communicate with the domains) before the xen_domain_destroy() function is called and there was no change. We have confirmed that the vchans do not appear to cause issues when they are not deleted prior to the domain being destroyed. Our application did delete them eventually, but last week they were only deleted _after_ the domain was destroyed. I would guess that if they are not explicitly deleted they could cause this same problem. > Is it possible that something else what rely on those xenstore node to > free up resources? It was stated earlier in this thread that the VMID is only deleted once all references to it are destroyed. I would speculate that the xenstore permissions list is one of these references that could prevent a domain reference (and VMID) from being completely cleaned up. - Aaron Cornelius _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-06 15:02 ` Aaron Cornelius @ 2016-06-07 9:53 ` Ian Jackson 2016-06-07 13:40 ` Aaron Cornelius 0 siblings, 1 reply; 29+ messages in thread From: Ian Jackson @ 2016-06-07 9:53 UTC (permalink / raw) To: Aaron Cornelius Cc: Xen-devel, Julien Grall, Stefano Stabellini, Wei Liu, Jan Beulich Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"): > We realized that we had forgotten to remove the domain from the > permissions list when the domain is deleted (which would cause the error > we saw). The application was updated to remove the domain from the > permissions list: > 1. retrieve the permissions with xs_get_permissions() > 2. find the domain ID that is being deleted > 3. memmove() the remaining domains down by 1 to "delete" the old domain > from the permissions list > 4. update the permissions with xs_set_permissions() > > After we made that change, a load test over the weekend confirmed that > the Xen crash no longer happens. We checked this morning first thing > and confirmed that without this change the crash reliably occurs. This is rather odd behaviour. I don't think xenstored should hang onto the domain's xs ring page just because the domain is still mentioned in a permission list. But it may do. I haven't checked the code. Are you using the ocaml xenstored (oxenstored) or the C one ? Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-07 9:53 ` Ian Jackson @ 2016-06-07 13:40 ` Aaron Cornelius 2016-06-07 15:13 ` Aaron Cornelius 0 siblings, 1 reply; 29+ messages in thread From: Aaron Cornelius @ 2016-06-07 13:40 UTC (permalink / raw) To: Ian Jackson Cc: Xen-devel, Julien Grall, Stefano Stabellini, Wei Liu, Jan Beulich On 6/7/2016 5:53 AM, Ian Jackson wrote: > Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"): >> We realized that we had forgotten to remove the domain from the >> permissions list when the domain is deleted (which would cause the error >> we saw). The application was updated to remove the domain from the >> permissions list: >> 1. retrieve the permissions with xs_get_permissions() >> 2. find the domain ID that is being deleted >> 3. memmove() the remaining domains down by 1 to "delete" the old domain >> from the permissions list >> 4. update the permissions with xs_set_permissions() >> >> After we made that change, a load test over the weekend confirmed that >> the Xen crash no longer happens. We checked this morning first thing >> and confirmed that without this change the crash reliably occurs. > > This is rather odd behaviour. I don't think xenstored should hang > onto the domain's xs ring page just because the domain is still > mentioned in a permission list. > > But it may do. I haven't checked the code. Are you using the > ocaml xenstored (oxenstored) or the C one ? I didn't remember specifying anything special when building the xen tools, but I did run into troubles where the ocaml tools appeared to conflict with the opam installed mirage packages and libraries. Running "sudo make dist-install" command installs the ocaml libraries as root which made using opam difficult. So I did disable the ocaml tools during my build. I double checked and confirmed that the C version of xenstored was built. We will try to test the failure scenario with oxenstored to see if it behaves any differently. - Aaron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-07 13:40 ` Aaron Cornelius @ 2016-06-07 15:13 ` Aaron Cornelius 2016-06-09 11:14 ` Ian Jackson 0 siblings, 1 reply; 29+ messages in thread From: Aaron Cornelius @ 2016-06-07 15:13 UTC (permalink / raw) To: Ian Jackson Cc: Xen-devel, Julien Grall, Stefano Stabellini, Wei Liu, Jan Beulich On 6/7/2016 9:40 AM, Aaron Cornelius wrote: > On 6/7/2016 5:53 AM, Ian Jackson wrote: >> Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"): >>> We realized that we had forgotten to remove the domain from the >>> permissions list when the domain is deleted (which would cause the error >>> we saw). The application was updated to remove the domain from the >>> permissions list: >>> 1. retrieve the permissions with xs_get_permissions() >>> 2. find the domain ID that is being deleted >>> 3. memmove() the remaining domains down by 1 to "delete" the old domain >>> from the permissions list >>> 4. update the permissions with xs_set_permissions() >>> >>> After we made that change, a load test over the weekend confirmed that >>> the Xen crash no longer happens. We checked this morning first thing >>> and confirmed that without this change the crash reliably occurs. >> >> This is rather odd behaviour. I don't think xenstored should hang >> onto the domain's xs ring page just because the domain is still >> mentioned in a permission list. >> >> But it may do. I haven't checked the code. Are you using the >> ocaml xenstored (oxenstored) or the C one ? > > I didn't remember specifying anything special when building the xen > tools, but I did run into troubles where the ocaml tools appeared to > conflict with the opam installed mirage packages and libraries. Running > "sudo make dist-install" command installs the ocaml libraries as root > which made using opam difficult. So I did disable the ocaml tools > during my build. > > I double checked and confirmed that the C version of xenstored was > built. We will try to test the failure scenario with oxenstored to see > if it behaves any differently. I am not that familiar with the xenstored code, but as far as I can tell the grant mapping will be held by the xenstore until the xs_release() function is called (which is not called by libxl, and I do not explicitly call it in my software, although I might now just to be safe), or until the last reference to a domain is released and the registered destructor (destroy_domain), set by talloc_set_destructor(), is called. I tried to follow the oxenstored code, but I certainly don't consider myself an expert at OCaml. The oxenstored code does not appear to allocate grant mappings at all, which makes me think I am probably misunderstanding the code :) - Aaron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-07 15:13 ` Aaron Cornelius @ 2016-06-09 11:14 ` Ian Jackson 2016-06-14 13:11 ` Aaron Cornelius 0 siblings, 1 reply; 29+ messages in thread From: Ian Jackson @ 2016-06-09 11:14 UTC (permalink / raw) To: Aaron Cornelius Cc: Xen-devel, Julien Grall, Stefano Stabellini, Wei Liu, Jan Beulich Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"): > I am not that familiar with the xenstored code, but as far as I can tell > the grant mapping will be held by the xenstore until the xs_release() > function is called (which is not called by libxl, and I do not > explicitly call it in my software, although I might now just to be > safe), or until the last reference to a domain is released and the > registered destructor (destroy_domain), set by talloc_set_destructor(), > is called. I'm not sure I follow. Or maybe I disagree. ISTM that: The grant mapping is released by destroy_domain, which is called via the talloc destructor as a result of talloc_free(domain->conn) in domain_cleanup. I don't see other references to domain->conn. domain_cleanup calls talloc_free on domain->conn when it sees the domain marked as dying in domain_cleanup. So I still think that your acl reference ought not to keep the grant mapping alive. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-09 11:14 ` Ian Jackson @ 2016-06-14 13:11 ` Aaron Cornelius 2016-06-14 13:15 ` Wei Liu 0 siblings, 1 reply; 29+ messages in thread From: Aaron Cornelius @ 2016-06-14 13:11 UTC (permalink / raw) To: Ian Jackson Cc: Xen-devel, Julien Grall, Stefano Stabellini, Wei Liu, Jan Beulich On 6/9/2016 7:14 AM, Ian Jackson wrote: > Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"): >> I am not that familiar with the xenstored code, but as far as I can tell >> the grant mapping will be held by the xenstore until the xs_release() >> function is called (which is not called by libxl, and I do not >> explicitly call it in my software, although I might now just to be >> safe), or until the last reference to a domain is released and the >> registered destructor (destroy_domain), set by talloc_set_destructor(), >> is called. > > I'm not sure I follow. Or maybe I disagree. ISTM that: > > The grant mapping is released by destroy_domain, which is called via > the talloc destructor as a result of talloc_free(domain->conn) in > domain_cleanup. I don't see other references to domain->conn. > > domain_cleanup calls talloc_free on domain->conn when it sees the > domain marked as dying in domain_cleanup. > > So I still think that your acl reference ought not to keep the grant > mapping alive. It took a while to complete the testing, but we've finished trying to reproduce the error using oxenstored instead of the C xenstored. When the condition occurs that caused the error with the C xenstored (on 4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not cause the crash. So for whatever reason, it would appear that the C xenstored does keep the grant allocations open, but oxenstored does not. - Aaron Cornelius _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-14 13:11 ` Aaron Cornelius @ 2016-06-14 13:15 ` Wei Liu 2016-06-14 13:26 ` Aaron Cornelius 0 siblings, 1 reply; 29+ messages in thread From: Wei Liu @ 2016-06-14 13:15 UTC (permalink / raw) To: Aaron Cornelius Cc: Stefano Stabellini, Wei Liu, Ian Jackson, Julien Grall, Jan Beulich, Xen-devel On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote: > On 6/9/2016 7:14 AM, Ian Jackson wrote: > >Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"): > >>I am not that familiar with the xenstored code, but as far as I can tell > >>the grant mapping will be held by the xenstore until the xs_release() > >>function is called (which is not called by libxl, and I do not > >>explicitly call it in my software, although I might now just to be > >>safe), or until the last reference to a domain is released and the > >>registered destructor (destroy_domain), set by talloc_set_destructor(), > >>is called. > > > >I'm not sure I follow. Or maybe I disagree. ISTM that: > > > >The grant mapping is released by destroy_domain, which is called via > >the talloc destructor as a result of talloc_free(domain->conn) in > >domain_cleanup. I don't see other references to domain->conn. > > > >domain_cleanup calls talloc_free on domain->conn when it sees the > >domain marked as dying in domain_cleanup. > > > >So I still think that your acl reference ought not to keep the grant > >mapping alive. > > It took a while to complete the testing, but we've finished trying to > reproduce the error using oxenstored instead of the C xenstored. When the > condition occurs that caused the error with the C xenstored (on > 4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not > cause the crash. > > So for whatever reason, it would appear that the C xenstored does keep the > grant allocations open, but oxenstored does not. > Can you provide some easy to follow steps to reproduce this issue? AFAICT your environment is very specialised, but we should be able to trigger the issue with plan xenstore-* utilities? Wei. > - Aaron Cornelius > _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-14 13:15 ` Wei Liu @ 2016-06-14 13:26 ` Aaron Cornelius 2016-06-14 13:38 ` Aaron Cornelius 0 siblings, 1 reply; 29+ messages in thread From: Aaron Cornelius @ 2016-06-14 13:26 UTC (permalink / raw) To: Wei Liu Cc: Xen-devel, Julien Grall, Stefano Stabellini, Ian Jackson, Jan Beulich On 6/14/2016 9:15 AM, Wei Liu wrote: > On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote: >> On 6/9/2016 7:14 AM, Ian Jackson wrote: >>> Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"): >>>> I am not that familiar with the xenstored code, but as far as I can tell >>>> the grant mapping will be held by the xenstore until the xs_release() >>>> function is called (which is not called by libxl, and I do not >>>> explicitly call it in my software, although I might now just to be >>>> safe), or until the last reference to a domain is released and the >>>> registered destructor (destroy_domain), set by talloc_set_destructor(), >>>> is called. >>> >>> I'm not sure I follow. Or maybe I disagree. ISTM that: >>> >>> The grant mapping is released by destroy_domain, which is called via >>> the talloc destructor as a result of talloc_free(domain->conn) in >>> domain_cleanup. I don't see other references to domain->conn. >>> >>> domain_cleanup calls talloc_free on domain->conn when it sees the >>> domain marked as dying in domain_cleanup. >>> >>> So I still think that your acl reference ought not to keep the grant >>> mapping alive. >> >> It took a while to complete the testing, but we've finished trying to >> reproduce the error using oxenstored instead of the C xenstored. When the >> condition occurs that caused the error with the C xenstored (on >> 4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not >> cause the crash. >> >> So for whatever reason, it would appear that the C xenstored does keep the >> grant allocations open, but oxenstored does not. >> > > Can you provide some easy to follow steps to reproduce this issue? > > AFAICT your environment is very specialised, but we should be able to > trigger the issue with plan xenstore-* utilities? I am not sure if the plain xenstore-* utilities will work, but here are the steps to follow: 1. Create a non-standard xenstore path: /tool/test 2. Create a domU (mini-os/mirage/something small) 3. Add the new domU to the /tool/test permissions list (I'm not 100% sure how to do this with the xenstore-* utilities) a. call xs_get_permissions() b. realloc() the permissions block to add the new domain c. call xs_set_permissions() 4. Delete the domU from step 2 5. Repeat steps 2-4 Eventually the xs_set_permissions() function will return an E2BIG error because the list of domains has grown too large. Sometime after that is when the crash occurs with the C xenstored and the 4.7.0-rc4 version of Xen. It usually takes around 1200 or so iterations for the crash to occur. - Aaron Cornelius _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-14 13:26 ` Aaron Cornelius @ 2016-06-14 13:38 ` Aaron Cornelius 2016-06-14 13:47 ` Wei Liu 0 siblings, 1 reply; 29+ messages in thread From: Aaron Cornelius @ 2016-06-14 13:38 UTC (permalink / raw) To: Wei Liu Cc: Xen-devel, Julien Grall, Stefano Stabellini, Ian Jackson, Jan Beulich On 6/14/2016 9:26 AM, Aaron Cornelius wrote: > On 6/14/2016 9:15 AM, Wei Liu wrote: >> On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote: >>> On 6/9/2016 7:14 AM, Ian Jackson wrote: >>>> Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"): >>>>> I am not that familiar with the xenstored code, but as far as I can tell >>>>> the grant mapping will be held by the xenstore until the xs_release() >>>>> function is called (which is not called by libxl, and I do not >>>>> explicitly call it in my software, although I might now just to be >>>>> safe), or until the last reference to a domain is released and the >>>>> registered destructor (destroy_domain), set by talloc_set_destructor(), >>>>> is called. >>>> >>>> I'm not sure I follow. Or maybe I disagree. ISTM that: >>>> >>>> The grant mapping is released by destroy_domain, which is called via >>>> the talloc destructor as a result of talloc_free(domain->conn) in >>>> domain_cleanup. I don't see other references to domain->conn. >>>> >>>> domain_cleanup calls talloc_free on domain->conn when it sees the >>>> domain marked as dying in domain_cleanup. >>>> >>>> So I still think that your acl reference ought not to keep the grant >>>> mapping alive. >>> >>> It took a while to complete the testing, but we've finished trying to >>> reproduce the error using oxenstored instead of the C xenstored. When the >>> condition occurs that caused the error with the C xenstored (on >>> 4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not >>> cause the crash. >>> >>> So for whatever reason, it would appear that the C xenstored does keep the >>> grant allocations open, but oxenstored does not. >>> >> >> Can you provide some easy to follow steps to reproduce this issue? >> >> AFAICT your environment is very specialised, but we should be able to >> trigger the issue with plan xenstore-* utilities? > > I am not sure if the plain xenstore-* utilities will work, but here are > the steps to follow: > > 1. Create a non-standard xenstore path: /tool/test > 2. Create a domU (mini-os/mirage/something small) > 3. Add the new domU to the /tool/test permissions list (I'm not 100% > sure how to do this with the xenstore-* utilities) > a. call xs_get_permissions() > b. realloc() the permissions block to add the new domain > c. call xs_set_permissions() > 4. Delete the domU from step 2 > 5. Repeat steps 2-4 > > Eventually the xs_set_permissions() function will return an E2BIG error > because the list of domains has grown too large. Sometime after that is > when the crash occurs with the C xenstored and the 4.7.0-rc4 version of > Xen. It usually takes around 1200 or so iterations for the crash to occur. After writing up those steps I suddenly realized that I think I have a bug in my test that might have been causing the bug in the first place. Once I get errors returned from xs_set_permissions() I was not properly cleaning up the created domains. So I think this was just a simple case of VMID exhaustion by creating more than 255 domUs at the same time. In which case this is completely unrelated to xenstore holding on to grant allocations, and the C xenstore most likely behaves correctly. - Aaron Cornelius _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Xen 4.7 crash 2016-06-14 13:38 ` Aaron Cornelius @ 2016-06-14 13:47 ` Wei Liu 0 siblings, 0 replies; 29+ messages in thread From: Wei Liu @ 2016-06-14 13:47 UTC (permalink / raw) To: Aaron Cornelius Cc: Stefano Stabellini, Wei Liu, Ian Jackson, Julien Grall, Jan Beulich, Xen-devel On Tue, Jun 14, 2016 at 09:38:22AM -0400, Aaron Cornelius wrote: > On 6/14/2016 9:26 AM, Aaron Cornelius wrote: > >On 6/14/2016 9:15 AM, Wei Liu wrote: > >>On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote: > >>>On 6/9/2016 7:14 AM, Ian Jackson wrote: > >>>>Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"): > >>>>>I am not that familiar with the xenstored code, but as far as I can tell > >>>>>the grant mapping will be held by the xenstore until the xs_release() > >>>>>function is called (which is not called by libxl, and I do not > >>>>>explicitly call it in my software, although I might now just to be > >>>>>safe), or until the last reference to a domain is released and the > >>>>>registered destructor (destroy_domain), set by talloc_set_destructor(), > >>>>>is called. > >>>> > >>>>I'm not sure I follow. Or maybe I disagree. ISTM that: > >>>> > >>>>The grant mapping is released by destroy_domain, which is called via > >>>>the talloc destructor as a result of talloc_free(domain->conn) in > >>>>domain_cleanup. I don't see other references to domain->conn. > >>>> > >>>>domain_cleanup calls talloc_free on domain->conn when it sees the > >>>>domain marked as dying in domain_cleanup. > >>>> > >>>>So I still think that your acl reference ought not to keep the grant > >>>>mapping alive. > >>> > >>>It took a while to complete the testing, but we've finished trying to > >>>reproduce the error using oxenstored instead of the C xenstored. When the > >>>condition occurs that caused the error with the C xenstored (on > >>>4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not > >>>cause the crash. > >>> > >>>So for whatever reason, it would appear that the C xenstored does keep the > >>>grant allocations open, but oxenstored does not. > >>> > >> > >>Can you provide some easy to follow steps to reproduce this issue? > >> > >>AFAICT your environment is very specialised, but we should be able to > >>trigger the issue with plan xenstore-* utilities? > > > >I am not sure if the plain xenstore-* utilities will work, but here are > >the steps to follow: > > > >1. Create a non-standard xenstore path: /tool/test > >2. Create a domU (mini-os/mirage/something small) > >3. Add the new domU to the /tool/test permissions list (I'm not 100% > >sure how to do this with the xenstore-* utilities) > > a. call xs_get_permissions() > > b. realloc() the permissions block to add the new domain > > c. call xs_set_permissions() > >4. Delete the domU from step 2 > >5. Repeat steps 2-4 > > > >Eventually the xs_set_permissions() function will return an E2BIG error > >because the list of domains has grown too large. Sometime after that is > >when the crash occurs with the C xenstored and the 4.7.0-rc4 version of > >Xen. It usually takes around 1200 or so iterations for the crash to occur. > > After writing up those steps I suddenly realized that I think I have a bug > in my test that might have been causing the bug in the first place. Once I > get errors returned from xs_set_permissions() I was not properly cleaning up > the created domains. So I think this was just a simple case of VMID > exhaustion by creating more than 255 domUs at the same time. > > In which case this is completely unrelated to xenstore holding on to grant > allocations, and the C xenstore most likely behaves correctly. > OK, so I will treat this issue as resolved for now. Let us know if you discover something new. Wei. > - Aaron Cornelius > _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2016-06-14 13:47 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-06-01 19:54 Xen 4.7 crash Aaron Cornelius 2016-06-01 20:00 ` Andrew Cooper 2016-06-01 20:45 ` Aaron Cornelius 2016-06-01 21:24 ` Andrew Cooper 2016-06-01 22:18 ` Julien Grall 2016-06-01 22:26 ` Andrew Cooper 2016-06-01 21:35 ` Andrew Cooper 2016-06-01 22:24 ` Julien Grall 2016-06-01 22:31 ` Andrew Cooper 2016-06-02 8:47 ` Jan Beulich 2016-06-02 8:53 ` Andrew Cooper 2016-06-02 9:07 ` Jan Beulich 2016-06-01 22:35 ` Julien Grall 2016-06-02 1:32 ` Aaron Cornelius 2016-06-02 8:49 ` Jan Beulich 2016-06-02 9:07 ` Julien Grall 2016-06-06 13:58 ` Aaron Cornelius 2016-06-06 14:05 ` Julien Grall 2016-06-06 14:19 ` Wei Liu 2016-06-06 15:02 ` Aaron Cornelius 2016-06-07 9:53 ` Ian Jackson 2016-06-07 13:40 ` Aaron Cornelius 2016-06-07 15:13 ` Aaron Cornelius 2016-06-09 11:14 ` Ian Jackson 2016-06-14 13:11 ` Aaron Cornelius 2016-06-14 13:15 ` Wei Liu 2016-06-14 13:26 ` Aaron Cornelius 2016-06-14 13:38 ` Aaron Cornelius 2016-06-14 13:47 ` Wei Liu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).