All of lore.kernel.org
 help / color / mirror / Atom feed
From: Aaron Cornelius <aaron.cornelius@dornerworks.com>
To: Wei Liu <wei.liu2@citrix.com>, Julien Grall <julien.grall@arm.com>
Cc: Xen-devel <xen-devel@lists.xenproject.org>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	Jan Beulich <jbeulich@suse.com>
Subject: Re: Xen 4.7 crash
Date: Mon, 6 Jun 2016 11:02:49 -0400	[thread overview]
Message-ID: <9fe65e93-542d-aad7-a820-6c1edb0260b3@dornerworks.com> (raw)
In-Reply-To: <20160606141931.GH14588@citrix.com>

On 6/6/2016 10:19 AM, Wei Liu wrote:
> On Mon, Jun 06, 2016 at 03:05:47PM +0100, Julien Grall wrote:
>> (CC Ian, Stefano and Wei)
>>
>> Hello Aaron,
>>
>> On 06/06/16 14:58, Aaron Cornelius wrote:
>>> On 6/2/2016 5:07 AM, Julien Grall wrote:
>>>> Hello Aaron,
>>>>
>>>> On 02/06/2016 02:32, Aaron Cornelius wrote:
>>>>> This is with a custom application, we use the libxl APIs to interact
>>>>> with Xen.  Domains are created using the libxl_domain_create_new()
>>>>> function, and domains are destroyed using the libxl_domain_destroy()
>>>>> function.
>>>>>
>>>>> The test in this case creates a domain, waits a minute, then
>>>>> deletes/creates the next domain, waits a minute, and so on.  So I
>>>>> wouldn't be surprised to see the VMID occasionally indicate there are 2
>>>>> active domains since there could be one being created and one being
>>>>> destroyed in a very short time.  However, I wouldn't expect to ever have
>>>>> 256 domains.
>>>>
>>>> Your log has:
>>>>
>>>> (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
>>>> (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2)
>>>> dom:(0)
>>>>
>>>> Which suggest that some grants are still mapped in DOM0.
>>>>
>>>>>
>>>>> The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
>>>>> means that only 48 of the the Mirage domains (with 32MB of RAM) would
>>>>> work at the same time anyway.  Which doesn't account for the various
>>>>> inter-domain resources or the RAM used by Xen itself.
>>>>
>>>> All the pages who belongs to the domain could have been freed except the
>>>> one referenced by DOM0. So the footprint of this domain will be limited
>>>> at the time.
>>>>
>>>> I would recommend you to check how many domain are running at this time
>>>> and if DOM0 effectively released all the resources.
>>>>
>>>>> If the p2m_teardown() function checked for NULL it would prevent the
>>>>> crash, but I suspect Xen would be just as broken since all of my
>>>>> resources have leaked away.  More broken in fact, since if the board
>>>>> reboots at least the applications will restart and domains can be
>>>>> recreated.
>>>>>
>>>>> It certainly appears that some resources are leaking when domains are
>>>>> deleted (possibly only on the ARM or ARM32 platforms).  We will try to
>>>>> add some debug prints and see if we can discover exactly what is
>>>>> going on.
>>>>
>>>> The leakage could also happen from DOM0. FWIW, I have been able to cycle
>>>> 2000 guests over the night on an ARM platforms.
>>>>
>>>
>>> We've done some more testing regarding this issue.  And further testing
>>> shows that it doesn't matter if we delete the vchans before the domains
>>> are deleted.  Those appear to be cleaned up correctly when the domain is
>>> destroyed.
>>>
>>> What does stop this issue from happening (using the same version of Xen
>>> that the issue was detected on) is removing any non-standard xenstore
>>> references before deleting the domain.  In this case our application
>>> allocates permissions for created domains to non-standard xenstore
>>> paths.  Making sure to remove those domain permissions before deleting
>>> the domain prevents this issue from happening.
>>
>> I am not sure to understand what you mean here. Could you give a quick
>> example?

So we have a custom xenstore path for our tool (/tool/custom/ for the 
sake of this example), and we then allow every domain created using this 
tool to read that path.  When the domain is created, the domain is 
explicitly given read permissions using xs_set_permissions().  More 
precisely we:
1. retrieve the current list of permissions with xs_get_permissions()
2. realloc the permissions list to increase it by 1
3. update the list of permissions to give the new domain read only access
4. then set the new permissions list with xs_set_permissions()

We saw errors logged because this list of permissions was getting 
prohibitively large, but this error did not appear to be directly 
connected to the Xen crash I submitted last week.  Or so we thought at 
the time.

We realized that we had forgotten to remove the domain from the 
permissions list when the domain is deleted (which would cause the error 
we saw).  The application was updated to remove the domain from the 
permissions list:
1. retrieve the permissions with xs_get_permissions()
2. find the domain ID that is being deleted
3. memmove() the remaining domains down by 1 to "delete" the old domain 
from the permissions list
4. update the permissions with xs_set_permissions()

After we made that change, a load test over the weekend confirmed that 
the Xen crash no longer happens.  We checked this morning first thing 
and confirmed that without this change the crash reliably occurs.

>>> It does not appear to matter if we delete the standard domain xenstore
>>> path (/local/domain/<id>) since libxl handles removing this path when
>>> the domain is destroyed.
>>>
>>> Based on this I would guess that the xenstore is hanging onto the VMID.
>>
>
> This is a somewhat strange conclusion. I guess the root cause is still
> unclear at this point.

We originally tested a fix that explicitly cleaned up the vchans 
(created to communicate with the domains) before the 
xen_domain_destroy() function is called and there was no change.  We 
have confirmed that the vchans do not appear to cause issues when they 
are not deleted prior to the domain being destroyed.

Our application did delete them eventually, but last week they were only 
deleted _after_ the domain was destroyed.  I would guess that if they 
are not explicitly deleted they could cause this same problem.

> Is it possible that something else what rely on those xenstore node to
> free up resources?

It was stated earlier in this thread that the VMID is only deleted once 
all references to it are destroyed.  I would speculate that the xenstore 
permissions list is one of these references that could prevent a domain 
reference (and VMID) from being completely cleaned up.

- Aaron Cornelius

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

  reply	other threads:[~2016-06-06 15:02 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-01 19:54 Xen 4.7 crash Aaron Cornelius
2016-06-01 20:00 ` Andrew Cooper
2016-06-01 20:45   ` Aaron Cornelius
2016-06-01 21:24     ` Andrew Cooper
2016-06-01 22:18       ` Julien Grall
2016-06-01 22:26         ` Andrew Cooper
2016-06-01 21:35 ` Andrew Cooper
2016-06-01 22:24   ` Julien Grall
2016-06-01 22:31     ` Andrew Cooper
2016-06-02  8:47       ` Jan Beulich
2016-06-02  8:53         ` Andrew Cooper
2016-06-02  9:07           ` Jan Beulich
2016-06-01 22:35 ` Julien Grall
2016-06-02  1:32   ` Aaron Cornelius
2016-06-02  8:49     ` Jan Beulich
2016-06-02  9:07     ` Julien Grall
2016-06-06 13:58       ` Aaron Cornelius
2016-06-06 14:05         ` Julien Grall
2016-06-06 14:19           ` Wei Liu
2016-06-06 15:02             ` Aaron Cornelius [this message]
2016-06-07  9:53               ` Ian Jackson
2016-06-07 13:40                 ` Aaron Cornelius
2016-06-07 15:13                   ` Aaron Cornelius
2016-06-09 11:14                     ` Ian Jackson
2016-06-14 13:11                       ` Aaron Cornelius
2016-06-14 13:15                         ` Wei Liu
2016-06-14 13:26                           ` Aaron Cornelius
2016-06-14 13:38                             ` Aaron Cornelius
2016-06-14 13:47                               ` Wei Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9fe65e93-542d-aad7-a820-6c1edb0260b3@dornerworks.com \
    --to=aaron.cornelius@dornerworks.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=jbeulich@suse.com \
    --cc=julien.grall@arm.com \
    --cc=sstabellini@kernel.org \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.