All of lore.kernel.org
 help / color / mirror / Atom feed
From: Glenn Enright <glenn@rimuhosting.com>
To: "Juergen Gross" <jgross@suse.com>,
	"Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jennifer Herbert <Jennifer.Herbert@citrix.com>,
	xen-devel@lists.xen.org, Steven Haigh <netwiz@crc.id.au>,
	Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: null domains after xl destroy
Date: Tue, 16 May 2017 12:49:51 +1200	[thread overview]
Message-ID: <7f15f2b2-e969-0d5d-cf2f-c234910e1884@rimuhosting.com> (raw)
In-Reply-To: <a492a326-b777-52fc-343b-0ed0dd0e9bea@suse.com>

On 15/05/17 21:57, Juergen Gross wrote:
> On 13/05/17 06:02, Glenn Enright wrote:
>> On 09/05/17 21:24, Roger Pau Monné wrote:
>>> On Mon, May 08, 2017 at 11:10:24AM +0200, Juergen Gross wrote:
>>>> On 04/05/17 00:17, Glenn Enright wrote:
>>>>> On 04/05/17 04:58, Steven Haigh wrote:
>>>>>> On 04/05/17 01:53, Juergen Gross wrote:
>>>>>>> On 03/05/17 12:45, Steven Haigh wrote:
>>>>>>>> Just wanted to give this a little nudge now people seem to be
>>>>>>>> back on
>>>>>>>> deck...
>>>>>>>
>>>>>>> Glenn, could you please give the attached patch a try?
>>>>>>>
>>>>>>> It should be applied on top of the other correction, the old debug
>>>>>>> patch should not be applied.
>>>>>>>
>>>>>>> I have added some debug output to make sure we see what is happening.
>>>>>>
>>>>>> This patch is included in kernel-xen-4.9.26-1
>>>>>>
>>>>>> It should be in the repos now.
>>>>>>
>>>>>
>>>>> Still seeing the same issue. Without the extra debug patch all I see in
>>>>> the logs after destroy is this...
>>>>>
>>>>> xen-blkback: xen_blkif_disconnect: busy
>>>>> xen-blkback: xen_blkif_free: delayed = 0
>>>>
>>>> Hmm, to me it seems as if some grant isn't being unmapped.
>>>>
>>>> Looking at gnttab_unmap_refs_async() I wonder how this is supposed to
>>>> work:
>>>>
>>>> I don't see how a grant would ever be unmapped in case of
>>>> page_count(item->pages[pc]) > 1 in __gnttab_unmap_refs_async(). All it
>>>> does is deferring the call to the unmap operation again and again. Or
>>>> am I missing something here?
>>>
>>> No, I don't think you are missing anything, but I cannot see how this
>>> can be
>>> solved in a better way, unmapping a page that's still referenced is
>>> certainly
>>> not the best option, or else we risk triggering a page-fault elsewhere.
>>>
>>> IMHO, gnttab_unmap_refs_async should have a timeout, and return an
>>> error at
>>> some point. Also, I'm wondering whether there's a way to keep track of
>>> who has
>>> references on a specific page, but so far I haven't been able to
>>> figure out how
>>> to get this information from Linux.
>>>
>>> Also, I've noticed that __gnttab_unmap_refs_async uses page_count,
>>> shouldn't it
>>> use page_ref_count instead?
>>>
>>> Roger.
>>>
>>
>> In case it helps, I have continued to work on this. I notices processed
>> left behind (under 4.9.27). The same issue is ongoing.
>>
>> # ps auxf | grep [x]vda
>> root      2983  0.0  0.0      0     0 ?        S    01:44   0:00  \_
>> [1.xvda1-1]
>> root      5457  0.0  0.0      0     0 ?        S    02:06   0:00  \_
>> [3.xvda1-1]
>> root      7382  0.0  0.0      0     0 ?        S    02:36   0:00  \_
>> [4.xvda1-1]
>> root      9668  0.0  0.0      0     0 ?        S    02:51   0:00  \_
>> [6.xvda1-1]
>> root     11080  0.0  0.0      0     0 ?        S    02:57   0:00  \_
>> [7.xvda1-1]
>>
>> # xl list
>> Name                              ID   Mem VCPUs      State   Time(s)
>> Domain-0                          0  1512     2     r-----     118.5
>> (null)                            1     8     4     --p--d      43.8
>> (null)                            3     8     4     --p--d       6.3
>> (null)                            4     8     4     --p--d      73.4
>> (null)                            6     8     4     --p--d      14.7
>> (null)                            7     8     4     --p--d      30
>>
>> Those all have...
>>
>> [root 11080]# cat wchan
>> xen_blkif_schedule
>>
>> [root 11080]# cat stack
>> [<ffffffff814eaee8>] xen_blkif_schedule+0x418/0xb40
>> [<ffffffff810a0555>] kthread+0xe5/0x100
>> [<ffffffff816f1c45>] ret_from_fork+0x25/0x30
>> [<ffffffffffffffff>] 0xffffffffffffffff
>
> And found another reference count bug. Would you like to give the
> attached patch (to be applied additionally to the previous ones) a try?
>
>
> Juergen
>

This seems to have solved the issue in 4.9.28, with all three patches 
applied. Awesome!

On my main test machine I can no longer replicate what I was originally 
seeing, and in dmesg I now see this flow...

xen-blkback: xen_blkif_disconnect: busy
xen-blkback: xen_blkif_free: delayed = 1
xen-blkback: xen_blkif_free: delayed = 0

xl list is clean, xenstore looks right. No extraneous processes left over.

Thankyou Juergen, so much. Really appreciate your persistence with this. 
Anything I can do to help push this upstream please let me know. Feel 
free to add a reported-by line with my name if you think it appropriate.

Regards, Glenn
http://rimuhosting.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

  reply	other threads:[~2017-05-16  0:49 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-11  5:25 null domains after xl destroy Glenn Enright
2017-04-11  5:59 ` Juergen Gross
2017-04-11  8:03   ` Glenn Enright
2017-04-11  9:49     ` Dietmar Hahn
2017-04-11 22:13       ` Glenn Enright
2017-04-11 22:23         ` Andrew Cooper
2017-04-11 22:45           ` Glenn Enright
2017-04-18  8:36             ` Juergen Gross
2017-04-19  1:02               ` Glenn Enright
2017-04-19  4:39                 ` Juergen Gross
2017-04-19  7:16                   ` Roger Pau Monné
2017-04-19  7:35                     ` Juergen Gross
2017-04-19 10:09                     ` Juergen Gross
2017-04-19 16:22                       ` Steven Haigh
2017-04-21  8:42                         ` Steven Haigh
2017-04-21  8:44                           ` Juergen Gross
2017-05-01  0:55                       ` Glenn Enright
2017-05-03 10:45                         ` Steven Haigh
2017-05-03 13:38                           ` Juergen Gross
2017-05-03 15:53                           ` Juergen Gross
2017-05-03 16:58                             ` Steven Haigh
2017-05-03 22:17                               ` Glenn Enright
2017-05-08  9:10                                 ` Juergen Gross
2017-05-09  9:24                                   ` Roger Pau Monné
2017-05-13  4:02                                     ` Glenn Enright
2017-05-15  9:57                                       ` Juergen Gross
2017-05-16  0:49                                         ` Glenn Enright [this message]
2017-05-16  1:18                                           ` Steven Haigh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7f15f2b2-e969-0d5d-cf2f-c234910e1884@rimuhosting.com \
    --to=glenn@rimuhosting.com \
    --cc=Jennifer.Herbert@citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=dietmar.hahn@ts.fujitsu.com \
    --cc=jgross@suse.com \
    --cc=netwiz@crc.id.au \
    --cc=roger.pau@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.