All of lore.kernel.org
 help / color / mirror / Atom feed
* mem_sharing: summarized problems when domain is dying
@ 2011-01-21 16:19 Jui-Hao Chiang
  2011-01-21 16:29 ` George Dunlap
  2011-01-21 19:45 ` Jui-Hao Chiang
  0 siblings, 2 replies; 12+ messages in thread
From: Jui-Hao Chiang @ 2011-01-21 16:19 UTC (permalink / raw)
  To: Tim Deegan; +Cc: MaoXiaoyun, xen devel

Hi, Tim:

>From tinnycloud's result, here I summarize the current problem and
findings of mem_sharing due to domain dying.
(1) When domain is dying, alloc_domheap_page() and
set_shared_p2m_entry() would just fail. So the shr_lock is not enough
to ensure that the domain won't die in the middle of mem_sharing code.
As tinnycloud's code shows, is that better to use
rcu_lock_domain_by_id before calling the above two functions?

(2) What's the proper behavior of nominate/share/unshare when domain is dying?
The following is just my current guess. Please give comments as well.

(2.1) nominate: return fail; but needs to check blktap2's code to make
sure it understand and act properly (should be minor issue now)

(2.2) share: return success but skip the gfns of dying domain, i.e.,
we don't remove them from the hash list, and don't update their p2m
entry (set_shared_p2m_entry). We believe that the p2m_teardown will
clean up them later.

(2.3) unshare: it's the most problematic part. Because we are not able
to alloc_domheap_page at this moment, the only thing we can do is
simply skip the page and return. But what's the side effect?
(a) If p2m_teardown comes in, there is no problem. Just destroy it and done.
(b) hap_nested_page_fault: if we return fail, will this cause problem
to guest? or we can simply return success to cheat the guest. But
later the guest will trigger another page fault if it write the page
again.
(c) gnttab_map_grant_ref: this function specify must_succeed to
gfn_to_mfn_unshare(), which would BUG if unshare() fails.

Do we really need (b) and (c) in the last steps of domain dying? If
that's the case, we need to have a special alloc_domheap_page for
dying domain.


On Thu, Jan 20, 2011 at 4:19 AM, Tim Deegan <Tim.Deegan@citrix.com> wrote:
> At 07:19 +0000 on 20 Jan (1295507976), MaoXiaoyun wrote:
>> Hi:
>>
>>             The latest BUG in mem_sharing_alloc_page from mem_sharing_unshare_page.
>>             I printed heap info, which shows plenty memory left.
>>             Could domain be NULL during in unshare, or should it be locked by rcu_lock_domain_by_id ?
>>
>
> 'd' probably isn't NULL; more likely is that the domain is not allowed
> to have any more memory.  You should look at the values of d->max_pages
> and d->tot_pages when the failure happens.
>
> Cheers.
>
> Tim.
>

Bests,
Jui-Hao

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mem_sharing: summarized problems when domain is dying
  2011-01-21 16:19 mem_sharing: summarized problems when domain is dying Jui-Hao Chiang
@ 2011-01-21 16:29 ` George Dunlap
  2011-01-21 16:32   ` George Dunlap
  2011-01-21 19:45 ` Jui-Hao Chiang
  1 sibling, 1 reply; 12+ messages in thread
From: George Dunlap @ 2011-01-21 16:29 UTC (permalink / raw)
  To: Jui-Hao Chiang; +Cc: MaoXiaoyun, xen devel, Tim Deegan

On Fri, Jan 21, 2011 at 4:19 PM, Jui-Hao Chiang <juihaochiang@gmail.com> wrote:
> (b) hap_nested_page_fault: if we return fail, will this cause problem
> to guest? or we can simply return success to cheat the guest. But
> later the guest will trigger another page fault if it write the page
> again.
> (c) gnttab_map_grant_ref: this function specify must_succeed to
> gfn_to_mfn_unshare(), which would BUG if unshare() fails.

I took a glance around the code this morning, but it seems like:

(b) should never happen.  If a domain is dying, all of its vcpus
should be offline.  If I'm wrong and there's a race between
d->is_dying set and the vcpus being paused, then the vcpus should just
be paused if they get an un-handleable page fault.

(c) happens because backend drivers may still be servicing requests
(finishing disk I/O, incoming network packets) before being torn down.
 It should be OK for those to fail if the domain is dying.

I'm not sure the exact rationale behind the "cannot fail" flag; but it
looks like in grant_table.c, both callers of gfn_to_mfn_unshare()
handle the case where the returned p2m entry is just

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mem_sharing: summarized problems when domain is dying
  2011-01-21 16:29 ` George Dunlap
@ 2011-01-21 16:32   ` George Dunlap
  2011-01-21 16:41     ` George Dunlap
  0 siblings, 1 reply; 12+ messages in thread
From: George Dunlap @ 2011-01-21 16:32 UTC (permalink / raw)
  To: Jui-Hao Chiang; +Cc: MaoXiaoyun, xen devel, Tim Deegan

[sorry, accidentally sent too early]

On Fri, Jan 21, 2011 at 4:29 PM, George Dunlap <dunlapg@umich.edu> wrote:
> I'm not sure the exact rationale behind the "cannot fail" flag; but it
> looks like in grant_table.c, both callers of gfn_to_mfn_unshare()
> handle the case where the returned p2m entry is just

...invalid.  I wonder if "unsharing" the page, but marking the entry
invalid during death would help.

I suppose the problem there is that if you're keeping the VM around
but paused for analysis, you'll have holes in your address space.  But
just returning an invalid entry to the callers who try to unshare
pages might work.

 -George

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mem_sharing: summarized problems when domain is dying
  2011-01-21 16:32   ` George Dunlap
@ 2011-01-21 16:41     ` George Dunlap
  2011-01-21 16:53       ` Tim Deegan
  2011-01-22 11:17       ` MaoXiaoyun
  0 siblings, 2 replies; 12+ messages in thread
From: George Dunlap @ 2011-01-21 16:41 UTC (permalink / raw)
  To: Jui-Hao Chiang; +Cc: MaoXiaoyun, xen devel, Tim Deegan

[-- Attachment #1: Type: text/plain, Size: 839 bytes --]

Tim / Xiaoyun, do you think something like this might work?

 -George

On Fri, Jan 21, 2011 at 4:32 PM, George Dunlap <dunlapg@umich.edu> wrote:
> [sorry, accidentally sent too early]
>
> On Fri, Jan 21, 2011 at 4:29 PM, George Dunlap <dunlapg@umich.edu> wrote:
>> I'm not sure the exact rationale behind the "cannot fail" flag; but it
>> looks like in grant_table.c, both callers of gfn_to_mfn_unshare()
>> handle the case where the returned p2m entry is just
>
> ...invalid.  I wonder if "unsharing" the page, but marking the entry
> invalid during death would help.
>
> I suppose the problem there is that if you're keeping the VM around
> but paused for analysis, you'll have holes in your address space.  But
> just returning an invalid entry to the callers who try to unshare
> pages might work.
>
>  -George
>

[-- Attachment #2: interpret_must_succeed_if_dying.diff --]
[-- Type: text/plain, Size: 680 bytes --]

diff -r 9ca9331c9780 xen/include/asm-x86/p2m.h
--- a/xen/include/asm-x86/p2m.h	Fri Jan 21 15:37:36 2011 +0000
+++ b/xen/include/asm-x86/p2m.h	Fri Jan 21 16:41:58 2011 +0000
@@ -390,7 +390,14 @@
                                       must_succeed 
                                       ? MEM_SHARING_MUST_SUCCEED : 0) )
         {
-            BUG_ON(must_succeed);
+            if ( must_succeed
+                 && p2m->domain->is_dying )
+            {
+                mfn = INVALID_MFN;
+                *p2mt=p2m_invalid;
+            }
+            else
+                BUG_ON(must_succeed);
             return mfn;
         }
         mfn = gfn_to_mfn(p2m, gfn, p2mt);

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mem_sharing: summarized problems when domain is dying
  2011-01-21 16:41     ` George Dunlap
@ 2011-01-21 16:53       ` Tim Deegan
  2011-01-22 11:17       ` MaoXiaoyun
  1 sibling, 0 replies; 12+ messages in thread
From: Tim Deegan @ 2011-01-21 16:53 UTC (permalink / raw)
  To: George Dunlap; +Cc: MaoXiaoyun, xen devel, Jui-Hao Chiang

At 16:41 +0000 on 21 Jan (1295628107), George Dunlap wrote:
> Tim / Xiaoyun, do you think something like this might work?

Worth a try.  I don't think it will do much harm -- there should be no
cases where dom0 really must map a dying domain's memory. 

Tim.

> On Fri, Jan 21, 2011 at 4:32 PM, George Dunlap <dunlapg@umich.edu> wrote:
> > [sorry, accidentally sent too early]
> >
> > On Fri, Jan 21, 2011 at 4:29 PM, George Dunlap <dunlapg@umich.edu> wrote:
> >> I'm not sure the exact rationale behind the "cannot fail" flag; but it
> >> looks like in grant_table.c, both callers of gfn_to_mfn_unshare()
> >> handle the case where the returned p2m entry is just
> >
> > ...invalid.  I wonder if "unsharing" the page, but marking the entry
> > invalid during death would help.
> >
> > I suppose the problem there is that if you're keeping the VM around
> > but paused for analysis, you'll have holes in your address space.  But
> > just returning an invalid entry to the callers who try to unshare
> > pages might work.
> >
> >  -George
> >

> diff -r 9ca9331c9780 xen/include/asm-x86/p2m.h
> --- a/xen/include/asm-x86/p2m.h	Fri Jan 21 15:37:36 2011 +0000
> +++ b/xen/include/asm-x86/p2m.h	Fri Jan 21 16:41:58 2011 +0000
> @@ -390,7 +390,14 @@
>                                        must_succeed 
>                                        ? MEM_SHARING_MUST_SUCCEED : 0) )
>          {
> -            BUG_ON(must_succeed);
> +            if ( must_succeed
> +                 && p2m->domain->is_dying )
> +            {
> +                mfn = INVALID_MFN;
> +                *p2mt=p2m_invalid;
> +            }
> +            else
> +                BUG_ON(must_succeed);
>              return mfn;
>          }
>          mfn = gfn_to_mfn(p2m, gfn, p2mt);


-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mem_sharing: summarized problems when domain is dying
  2011-01-21 16:19 mem_sharing: summarized problems when domain is dying Jui-Hao Chiang
  2011-01-21 16:29 ` George Dunlap
@ 2011-01-21 19:45 ` Jui-Hao Chiang
  2011-01-24 13:14   ` MaoXiaoyun
  2011-01-24 14:02   ` mem_sharing: summarized problems when domain is dying Tim Deegan
  1 sibling, 2 replies; 12+ messages in thread
From: Jui-Hao Chiang @ 2011-01-21 19:45 UTC (permalink / raw)
  To: Tim Deegan; +Cc: MaoXiaoyun, xen devel

Hi

On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@gmail.com> wrote:
> Hi, Tim:
>
> From tinnycloud's result, here I summarize the current problem and
> findings of mem_sharing due to domain dying.
> (1) When domain is dying, alloc_domheap_page() and
> set_shared_p2m_entry() would just fail. So the shr_lock is not enough
> to ensure that the domain won't die in the middle of mem_sharing code.
> As tinnycloud's code shows, is that better to use
> rcu_lock_domain_by_id before calling the above two functions?
>

There seems no good locking to protect a domain from changing the
is_dying state. So the unshare function could fail in the middle in
several points, e.g., alloc_domheap_page and set_shared_p2m_entry.
If that's the case, we need to add some checking, and probably revert
the things we have done when is_dying is changed in the middle.

Any comments?

Jui-Hao

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: mem_sharing: summarized problems when domain is dying
  2011-01-21 16:41     ` George Dunlap
  2011-01-21 16:53       ` Tim Deegan
@ 2011-01-22 11:17       ` MaoXiaoyun
  1 sibling, 0 replies; 12+ messages in thread
From: MaoXiaoyun @ 2011-01-22 11:17 UTC (permalink / raw)
  To: xen devel; +Cc: george.dunlap, tim.deegan, juihaochiang


[-- Attachment #1.1: Type: text/plain, Size: 1937 bytes --]


Hi George:
 
  Appreciate for your kindly help.
  
 I think the page type should be changed inside mem_sharing_unshare_page() in shr_lock too
,to prevent someone might unshare the page again. So your patch and mine makes the whole solution.
 
As for my patch, it seems that use put_page_and_type(page); to clean the page is enough, and 
don't need to   BUG_ON(set_shared_p2m_entry_invalid(d, gfn)==0); ( which actually calls
set_p2m_entry(d, gfn, _mfn(INVALID_MFN), 0, p2m_invalid)  ), right?
 
One another thing is rcu_lock_domain_by_id(d->domain_id);  When someone hold this lock,
d->is_dying = 0, does this mean  d->is_dying will not be changed untill it call rcu_unlock_domain?
That is to say, the lock actually protects whole d structure?

 
> Date: Fri, 21 Jan 2011 16:41:47 +0000
> Subject: Re: [Xen-devel] mem_sharing: summarized problems when domain is dying
> From: George.Dunlap@eu.citrix.com
> To: juihaochiang@gmail.com
> CC: Tim.Deegan@citrix.com; tinnycloud@hotmail.com; xen-devel@lists.xensource.com
> 
> Tim / Xiaoyun, do you think something like this might work?
> 
> -George
> 
> On Fri, Jan 21, 2011 at 4:32 PM, George Dunlap <dunlapg@umich.edu> wrote:
> > [sorry, accidentally sent too early]
> >
> > On Fri, Jan 21, 2011 at 4:29 PM, George Dunlap <dunlapg@umich.edu> wrote:
> >> I'm not sure the exact rationale behind the "cannot fail" flag; but it
> >> looks like in grant_table.c, both callers of gfn_to_mfn_unshare()
> >> handle the case where the returned p2m entry is just
> >
> > ...invalid.  I wonder if "unsharing" the page, but marking the entry
> > invalid during death would help.
> >
> > I suppose the problem there is that if you're keeping the VM around
> > but paused for analysis, you'll have holes in your address space.  But
> > just returning an invalid entry to the callers who try to unshare
> > pages might work.
> >
> >  -George
> >
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 5906 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: mem_sharing: summarized problems when domain is dying
  2011-01-21 19:45 ` Jui-Hao Chiang
@ 2011-01-24 13:14   ` MaoXiaoyun
  2011-01-24 14:08     ` George Dunlap
  2011-01-25  4:13     ` Linux Guest Crash on stress test of memory sharing MaoXiaoyun
  2011-01-24 14:02   ` mem_sharing: summarized problems when domain is dying Tim Deegan
  1 sibling, 2 replies; 12+ messages in thread
From: MaoXiaoyun @ 2011-01-24 13:14 UTC (permalink / raw)
  To: xen devel; +Cc: george.dunlap, tim.deegan, juihaochiang


[-- Attachment #1.1: Type: text/plain, Size: 3623 bytes --]


Hi:
 
       Another BUG found when testing memory sharing.
       In this test, I start 24 linux HVMS, each of them reboot through "xm reboot" every 30minutes.
       After several hours, some of the HVM will crash. All of the crash HVM are stopped during booting.
       The bug still exists even I forbid page sharing by cheating tapdisk that xc_memshr_nominate_gref()
       return failure.
 
       And no special log found.
 
       I was able to dump the crash stack.  
       what could happen?
       thanks.
 
PID: 2307   TASK: ffff810014166100  CPU: 0   COMMAND: "setfont"
 #0 [ffff8100123cd900] xen_panic_event at ffffffff88001d28
 #1 [ffff8100123cd920] notifier_call_chain at ffffffff80066eaa
 #2 [ffff8100123cd940] panic at ffffffff8009094a
 #3 [ffff8100123cda30] oops_end at ffffffff80064fca
 #4 [ffff8100123cda40] do_page_fault at ffffffff80066dc0
 #5 [ffff8100123cdb30] error_exit at ffffffff8005dde9
    [exception RIP: vgacon_do_font_op+363]
    RIP: ffffffff800515e5  RSP: ffff8100123cdbe8  RFLAGS: 00010203
    RAX: 0000000000000000  RBX: ffffffff804b3740  RCX: ffff8100000a03fc
    RDX: 00000000000003fd  RSI: ffff810011cec000  RDI: ffffffff803244c4
    RBP: ffff810011cec000   R8: d0d6999996000000   R9: 0000009090b0b0ff
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000004
    R13: 0000000000000001  R14: 0000000000000001  R15: 000000000000000e
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffff8100123cdc20] vgacon_font_set at ffffffff8016bec5
 #7 [ffff8100123cdc60] con_font_op at ffffffff801aa86b
 #8 [ffff8100123cdcd0] vt_ioctl at ffffffff801a5af4
 #9 [ffff8100123cdd70] tty_ioctl at ffffffff80038a2c
#10 [ffff8100123cdeb0] do_ioctl at ffffffff800420d9
#11 [ffff8100123cded0] vfs_ioctl at ffffffff800302ce
#12 [ffff8100123cdf40] sys_ioctl at ffffffff8004c766
#13 [ffff8100123cdf80] tracesys at ffffffff8005d28d (via system_call)
    RIP: 00000039294cc557  RSP: 00007fff54c4aec8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
    RDX: 00007fff54c4aee0  RSI: 0000000000004b72  RDI: 0000000000000003
    RBP: 000000001d747ab0   R8: 0000000000000010   R9: 0000000000800000
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000010
    R13: 0000000000000200  R14: 0000000000000008  R15: 0000000000000008
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

 
> Date: Fri, 21 Jan 2011 14:45:14 -0500
> Subject: Re: mem_sharing: summarized problems when domain is dying
> From: juihaochiang@gmail.com
> To: Tim.Deegan@citrix.com
> CC: tinnycloud@hotmail.com; xen-devel@lists.xensource.com
> 
> Hi
> 
> On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@gmail.com> wrote:
> > Hi, Tim:
> >
> > From tinnycloud's result, here I summarize the current problem and
> > findings of mem_sharing due to domain dying.
> > (1) When domain is dying, alloc_domheap_page() and
> > set_shared_p2m_entry() would just fail. So the shr_lock is not enough
> > to ensure that the domain won't die in the middle of mem_sharing code.
> > As tinnycloud's code shows, is that better to use
> > rcu_lock_domain_by_id before calling the above two functions?
> >
> 
> There seems no good locking to protect a domain from changing the
> is_dying state. So the unshare function could fail in the middle in
> several points, e.g., alloc_domheap_page and set_shared_p2m_entry.
> If that's the case, we need to add some checking, and probably revert
> the things we have done when is_dying is changed in the middle.
> 
> Any comments?
> 
> Jui-Hao
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 5911 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mem_sharing: summarized problems when domain is dying
  2011-01-21 19:45 ` Jui-Hao Chiang
  2011-01-24 13:14   ` MaoXiaoyun
@ 2011-01-24 14:02   ` Tim Deegan
  1 sibling, 0 replies; 12+ messages in thread
From: Tim Deegan @ 2011-01-24 14:02 UTC (permalink / raw)
  To: Jui-Hao Chiang; +Cc: MaoXiaoyun, xen devel

At 19:45 +0000 on 21 Jan (1295639114), Jui-Hao Chiang wrote:
> Hi
> 
> On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@gmail.com> wrote:
> > Hi, Tim:
> >
> > From tinnycloud's result, here I summarize the current problem and
> > findings of mem_sharing due to domain dying.
> > (1) When domain is dying, alloc_domheap_page() and
> > set_shared_p2m_entry() would just fail. So the shr_lock is not enough
> > to ensure that the domain won't die in the middle of mem_sharing code.
> > As tinnycloud's code shows, is that better to use
> > rcu_lock_domain_by_id before calling the above two functions?
> >
> 
> There seems no good locking to protect a domain from changing the
> is_dying state. So the unshare function could fail in the middle in
> several points, e.g., alloc_domheap_page and set_shared_p2m_entry.
> If that's the case, we need to add some checking, and probably revert
> the things we have done when is_dying is changed in the middle.

That sounds correct.  It would be a good idea to handle failures from
those functions anyway!

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RE: mem_sharing: summarized problems when domain is dying
  2011-01-24 13:14   ` MaoXiaoyun
@ 2011-01-24 14:08     ` George Dunlap
  2011-01-25  4:13     ` Linux Guest Crash on stress test of memory sharing MaoXiaoyun
  1 sibling, 0 replies; 12+ messages in thread
From: George Dunlap @ 2011-01-24 14:08 UTC (permalink / raw)
  To: MaoXiaoyun; +Cc: xen devel, tim.deegan, juihaochiang

I think it would be best if every separate issue you're facing is a
separate thread.  This looks like a Linux crash -- please include the
kernel version you're using, and whatever other information might be
appropriate.

 -George

2011/1/24 MaoXiaoyun <tinnycloud@hotmail.com>:
> Hi:
>
>        Another BUG found when testing memory sharing.
>        In this test, I start 24 linux HVMS, each of them reboot through "xm
> reboot" every 30minutes.
>        After several hours, some of the HVM will crash. All of the crash HVM
> are stopped during booting.
>        The bug still exists even I forbid page sharing by cheating tapdisk
> that xc_memshr_nominate_gref()
>        return failure.
>
>        And no special log found.
>
>        I was able to dump the crash stack.
>        what could happen?
>        thanks.
>
> PID: 2307   TASK: ffff810014166100  CPU: 0   COMMAND: "setfont"
>  #0 [ffff8100123cd900] xen_panic_event at ffffffff88001d28
>  #1 [ffff8100123cd920] notifier_call_chain at ffffffff80066eaa
>  #2 [ffff8100123cd940] panic at ffffffff8009094a
>  #3 [ffff8100123cda30] oops_end at ffffffff80064fca
>  #4 [ffff8100123cda40] do_page_fault at ffffffff80066dc0
>  #5 [ffff8100123cdb30] error_exit at ffffffff8005dde9
>     [exception RIP: vgacon_do_font_op+363]
>     RIP: ffffffff800515e5  RSP: ffff8100123cdbe 8  RFLAGS: 00010203
>     RAX: 0000000000000000  RBX: ffffffff804b3740  RCX: ffff8100000a03fc
>     RDX: 00000000000003fd  RSI: ffff810011cec000  RDI: ffffffff803244c4
>     RBP: ffff810011cec000   R8: d0d6999996000000   R9: 0000009090b0b0ff
>     R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000004
>     R13: 0000000000000001  R14: 0000000000000001  R15: 000000000000000e
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #6 [ffff8100123cdc20] vgacon_font_set at ffffffff8016bec5
>  #7 [ffff8100123cdc60] con_font_op at ffffffff801aa86b
>  #8&nbsp ;[ffff8100123cdcd0] vt_ioctl at ffffffff801a5af4
>  #9 [ffff8100123cdd70] tty_ioctl at ffffffff80038a2c
> #10 [ffff8100123cdeb0] do_ioctl at ffffffff800420d9
> #11 [ffff8100123cded0] vfs_ioctl at ffffffff800302ce
> #12 [ffff8100123cdf40] sys_ioctl at ffffffff8004c766
> #13 [ffff8100123cdf80] tracesys at ffffffff8005d28d (via system_call)
>     RIP: 00000039294cc557  RSP: 00007fff54c4aec8  RFLAGS: 00000246
>     RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
>     RDX: 00007fff54c4aee0  RSI: 0000000000004b72  RDI: 0000000000000003
>     RBP: 000000001d747ab0   R8: 0000000000000010   R9: 0000000 000800000
>     R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000010
>     R13: 0000000000000200  R14: 0000000000000008  R15: 0000000000000008
>     ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
>
>> Date: Fri, 21 Jan 2011 14:45:14 -0500
>> Subject: Re: mem_sharing: summarized problems when domain is dying
>> From: juihaochiang@gmail.com
>> To: Tim.Deegan@citrix.com
>> CC: tinnycloud@hotmail.com; xen-devel@lists.xensource.com
>>
>> Hi
>>
>> On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@gmail.com>
>> wrote:
>> > Hi, Tim:
>> >
>> > From tinnycloud's result, here I summarize the current problem and
>> > findings of mem_sharing due to domain dying.
>> > (1) When domain is dying, alloc_domheap_page() and
>> > set_shared_p2m_entry() would just fail. So the shr_lock is not enough
>> > to ensure that the domain won't die in the middle of mem_sharing code.
>> > As tinnycloud's code shows, is that better to use
>> > rcu_lock_domain_by_id before calling the above two functions?
>> >
>>
>> There seems no good locking to protect a domain from changing the
>> is_dying state. So the unshare function could fail in the middle in
>> several points, e.g., alloc_domheap_page and set_shared_p2m_entry.
>> If that's the case, we need to add some checking, and probably revert
>> the things we have done when is_dying is changed in the middle.
>>
>> Any comments?
>>
>> Jui-Hao
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Linux Guest Crash on stress test of memory sharing
  2011-01-24 13:14   ` MaoXiaoyun
  2011-01-24 14:08     ` George Dunlap
@ 2011-01-25  4:13     ` MaoXiaoyun
       [not found]       ` <BLU157-w350046650B3C4960C4B1F2DAFC0@phx.gbl>
  1 sibling, 1 reply; 12+ messages in thread
From: MaoXiaoyun @ 2011-01-25  4:13 UTC (permalink / raw)
  To: xen devel; +Cc: george.dunlap, zpfalpc23, tim.deegan, juihaochiang


[-- Attachment #1.1: Type: text/plain, Size: 3889 bytes --]





Hi:
 
       Follow George's suggestion to summit the bug in this new thread.
 
       Start 24 linux HVMS on a physical host, each of them reboot through "xm reboot" every 30minutes.
       After several hours, some of the HVM will crash. 
 
       All of the crash HVM are stopped during booting.
       The bug still exists even I forbid page sharing by cheating tapdisk that xc_memshr_nominate_gref()
       return failure. No bug if memory sharing is disabled.
       (This means only mem_sharing_nominate_page() are called, and in mem_sharing_nominate_page()
        page type is set to p2m_shared, so later it needs to be unshared when someone try to use it)
 
       I remember there is a call routine in memory sharing,
       hvm_hap_nested_page_fault()->mem_sharing_unshare_page() 
       compare to the crash dump, it might indicates some connections.
 
DomU kernel is from ftp://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/kernel-2.6.18-164.el5.src.rpm
Xen version: 4.0.0
 
crash dump stack :
 
crash> bt -l
PID: 2422   TASK: ffff810013b40860  CPU: 1   COMMAND: "setfont"
 #0 [ffff810012cef900] xen_panic_event at ffffffff88001d28
 #1 [ffff810012cef920] notifier_call_chain at ffffffff80066eaa
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/sys.c: 146
 #2 [ffff810012cef940] panic at ffffffff8009094a
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/panic.c: 101
 #3 [ffff810012cefa30] oops_end at ffffffff80064fca
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/kernel/traps.c: 539
 #4 [ffff810012cefa40] do_page_fault at ffffffff80066dc0
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 591
 #5 [ffff810012cefb30] error_exit at ffffffff8005dde9
    [exception RIP: vgacon_do_font_op+435]
    RIP: ffffffff8005162d  RSP: ffff810012cefbe8  RFLAGS: 00010287
    RAX: ffff8100000a6000  RBX: ffffffff804b3740  RCX: ffff8100000a4ae0
    RDX: ffff810012d16ae1  RSI: ffff810012d14000  RDI: ffffffff803244c4
    RBP: ffff810012d14000   R8: d0d6999996000000   R9: 0000009090b0b0ff
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000004
    R13: 0000000000000001  R14: 0000000000000001  R15: 000000000000000e
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffff810012cefc20] vgacon_font_set at ffffffff8016bec5
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/video/console/vgacon.c: 1238
 #7 [ffff810012cefc60] con_font_op at ffffffff801aa86b
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/vt.c: 3645
 #8 [ffff810012cefcd0] vt_ioctl at ffffffff801a5af4
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/vt_ioctl.c: 965
 #9 [ffff810012cefd70] tty_ioctl at ffffffff80038a2c
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/tty_io.c: 3340
#10 [ffff810012cefeb0] do_ioctl at ffffffff800420d9
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 39
#11 [ffff810012cefed0] vfs_ioctl at ffffffff800302ce
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 500
#12 [ffff810012ceff40] sys_ioctl at ffffffff8004c766
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 520
#13 [ffff810012ceff80] tracesys at ffffffff8005d28d (via system_call)
    RIP: 00000039294cc557  RSP: 00007fff1a57ed98  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
    RDX: 00007fff1a57edb0  RSI: 0000000000004b72  RDI: 0000000000000003
    RBP: 000000001e33dab0   R8: 0000000000000010   R9: 0000000000800000
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000010
    R13: 0000000000000200  R14: 0000000000000008  R15: 0000000000000008
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 5838 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Linux Guest Crash on stress test of memory sharing
       [not found]       ` <BLU157-w350046650B3C4960C4B1F2DAFC0@phx.gbl>
@ 2011-01-25  6:23         ` MaoXiaoyun
  0 siblings, 0 replies; 12+ messages in thread
From: MaoXiaoyun @ 2011-01-25  6:23 UTC (permalink / raw)
  To: xen devel; +Cc: george.dunlap, zpfalpc23, tim.deegan, juihaochiang


[-- Attachment #1.1: Type: text/plain, Size: 5422 bytes --]


Hi:
 
      Most the core dump has the same stack as submitted before, now we have another stack
      thanks.
 
crash> bt -l
PID: 1      TASK: ffff8100011df7a0  CPU: 0   COMMAND: "init"
 #0 [ffff8100011fddf0] xen_panic_event at ffffffff88001d28
 #1 [ffff8100011fde10] notifier_call_chain at ffffffff80066eaa
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/sys.c: 146
 #2 [ffff8100011fde30] panic at ffffffff8009094a
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/panic.c: 101
 #3 [ffff8100011fdf20] do_exit at ffffffff80015477
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/exit.c: 835
 #4 [ffff8100011fdf80] system_call at ffffffff8005d116
    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/kernel/entry.S
    RIP: 000000000055a5ff  RSP: 00007fff2b8c2e10  RFLAGS: 00010246
    RAX: 00000000000000e7  RBX: ffffffff8005d116  RCX: 0000000000000047
    RDX: 0000000000000001  RSI: 000000000000003c  RDI: 0000000000000001
    RBP: 0000000000000000   R8: 00000000000000e7   R9: ffffffffffffffb4
    R10: 00000000ffffffff  R11: 0000000000000246  R12: 0000000000000001
    R13: 0000000000604ea8  R14: ffffffff80049281  R15: 0000000000000000
    ORIG_RAX: 00000000000000e7  CS: 0033  SS: 002b
crash> 
       
 
>From: tinnycloud@hotmail.com
>To: tinnycloud@hotmail.com
>Subject: Linux Guest Crash on stress test of memory sharing
>Date: Tue, 25 Jan 2011 13:07:15 +0800
>
>Hi:
> 
>       Follow George's suggestion to summit the bug in this new thread.
> 
>       Start 24 linux HVMS on a physical host, each of them reboot through "xm reboot" every 30minutes.
>       After several hours, some of the HVM will crash. 
> 
>       All of the crash HVM are stopped during booting.
>       The bug still exists even I forbid page sharing by cheating tapdisk that xc_memshr_nominate_gref()
>       return failure. No bug if memory sharing is disabled.
>       (This means only mem_sharing_nominate_page() are called, and in mem_sharing_nominate_page()
>        page type is set to p2m_shared, so later it needs to be unshared when someone try to use it)
> 
>       I remember there is a call routine in memory sharing,
>       hvm_hap_nested_page_fault()->mem_sharing_unshare_page() 
>       compare to the crash dump, it might indicates some connections.
> 
>DomU kernel is from ftp://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/kernel-2.6.18-164.el5.src.rpm
>Xen version: 4.0.0
> 
>crash dump stack :
> 
>crash> bt -l
>PID: 2422   TASK: ffff810013b40860  CPU: 1   COMMAND: "setfont"
> #0 [ffff810012cef900] xen_panic_event at ffffffff88001d28
> #1 [ffff810012cef920] notifier_call_chain at ffffffff80066eaa
>    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/sys.c: 146
> #2 [ffff810012cef940] panic at ffffffff8009094a
>    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/panic.c: 101
> #3 [ffff810012cefa30] oops_end at ffffffff80064fca
>    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/kernel/traps.c: 539
> #4 [ffff810012cefa40] do_page_fault at ffffffff80066dc0
>    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 591
> #5 [ffff810012cefb30] error_exit at ffffffff8005dde9
>    [exception RIP: vgacon_do_font_op+435]
>    RIP: ffffffff8005162d  RSP: ffff810012cefbe8  RFLAGS: 00010287
>    RAX: ffff8100000a6000  RBX: ffffffff804b3740  RCX: ffff8100000a4ae0
>    RDX: ffff810012d16ae1  RSI: ffff810012d14000  RDI: ffffffff803244c4
>    RBP: ffff810012d14000   R8: d0d6999996000000   R9: 0000009090b0b0ff
>    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000004
>    R13: 0000000000000001  R14: 0000000000000001  R15: 000000000000000e
>    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> #6 [ffff810012cefc20] vgacon_font_set at ffffffff8016bec5
>    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/video/console/vgacon.c: 1238
> #7 [ffff810012cefc60] con_font_op at ffffffff801aa86b
>    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/vt.c: 3645
> #8 [ffff810012cefcd0] vt_ioctl at ffffffff801a5af4
>    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/vt_ioctl.c: 965
> #9 [ffff810012cefd70] tty_ioctl at ffffffff80038a2c
>    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/tty_io.c: 3340
>#10 [ffff810012cefeb0] do_ioctl at ffffffff800420d9
>    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 39
>#11 [ffff810012cefed0] vfs_ioctl at ffffffff800302ce
>    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 500
>#12 [ffff810012ceff40] sys_ioctl at ffffffff8004c766
>    /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 520
>#13 [ffff810012ceff80] tracesys at ffffffff8005d28d (via system_call)
>    RIP: 00000039294cc557  RSP: 00007fff1a57ed98  RFLAGS: 00000246
>    RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
>    RDX: 00007fff1a57edb0  RSI: 0000000000004b72  RDI: 0000000000000003
>    RBP: 000000001e33dab0   R8: 0000000000000010   R9: 0000000000800000
>    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000010
>    R13: 0000000000000200  R14: 0000000000000008  R15: 0000000000000008
>    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 7610 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-01-25  6:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-21 16:19 mem_sharing: summarized problems when domain is dying Jui-Hao Chiang
2011-01-21 16:29 ` George Dunlap
2011-01-21 16:32   ` George Dunlap
2011-01-21 16:41     ` George Dunlap
2011-01-21 16:53       ` Tim Deegan
2011-01-22 11:17       ` MaoXiaoyun
2011-01-21 19:45 ` Jui-Hao Chiang
2011-01-24 13:14   ` MaoXiaoyun
2011-01-24 14:08     ` George Dunlap
2011-01-25  4:13     ` Linux Guest Crash on stress test of memory sharing MaoXiaoyun
     [not found]       ` <BLU157-w350046650B3C4960C4B1F2DAFC0@phx.gbl>
2011-01-25  6:23         ` MaoXiaoyun
2011-01-24 14:02   ` mem_sharing: summarized problems when domain is dying Tim Deegan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.