* mem_sharing: summarized problems when domain is dying
@ 2011-01-21 16:19 Jui-Hao Chiang
2011-01-21 16:29 ` George Dunlap
2011-01-21 19:45 ` Jui-Hao Chiang
0 siblings, 2 replies; 12+ messages in thread
From: Jui-Hao Chiang @ 2011-01-21 16:19 UTC (permalink / raw)
To: Tim Deegan; +Cc: MaoXiaoyun, xen devel
Hi, Tim:
>From tinnycloud's result, here I summarize the current problem and
findings of mem_sharing due to domain dying.
(1) When domain is dying, alloc_domheap_page() and
set_shared_p2m_entry() would just fail. So the shr_lock is not enough
to ensure that the domain won't die in the middle of mem_sharing code.
As tinnycloud's code shows, is that better to use
rcu_lock_domain_by_id before calling the above two functions?
(2) What's the proper behavior of nominate/share/unshare when domain is dying?
The following is just my current guess. Please give comments as well.
(2.1) nominate: return fail; but needs to check blktap2's code to make
sure it understand and act properly (should be minor issue now)
(2.2) share: return success but skip the gfns of dying domain, i.e.,
we don't remove them from the hash list, and don't update their p2m
entry (set_shared_p2m_entry). We believe that the p2m_teardown will
clean up them later.
(2.3) unshare: it's the most problematic part. Because we are not able
to alloc_domheap_page at this moment, the only thing we can do is
simply skip the page and return. But what's the side effect?
(a) If p2m_teardown comes in, there is no problem. Just destroy it and done.
(b) hap_nested_page_fault: if we return fail, will this cause problem
to guest? or we can simply return success to cheat the guest. But
later the guest will trigger another page fault if it write the page
again.
(c) gnttab_map_grant_ref: this function specify must_succeed to
gfn_to_mfn_unshare(), which would BUG if unshare() fails.
Do we really need (b) and (c) in the last steps of domain dying? If
that's the case, we need to have a special alloc_domheap_page for
dying domain.
On Thu, Jan 20, 2011 at 4:19 AM, Tim Deegan <Tim.Deegan@citrix.com> wrote:
> At 07:19 +0000 on 20 Jan (1295507976), MaoXiaoyun wrote:
>> Hi:
>>
>> The latest BUG in mem_sharing_alloc_page from mem_sharing_unshare_page.
>> I printed heap info, which shows plenty memory left.
>> Could domain be NULL during in unshare, or should it be locked by rcu_lock_domain_by_id ?
>>
>
> 'd' probably isn't NULL; more likely is that the domain is not allowed
> to have any more memory. You should look at the values of d->max_pages
> and d->tot_pages when the failure happens.
>
> Cheers.
>
> Tim.
>
Bests,
Jui-Hao
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mem_sharing: summarized problems when domain is dying
2011-01-21 16:19 mem_sharing: summarized problems when domain is dying Jui-Hao Chiang
@ 2011-01-21 16:29 ` George Dunlap
2011-01-21 16:32 ` George Dunlap
2011-01-21 19:45 ` Jui-Hao Chiang
1 sibling, 1 reply; 12+ messages in thread
From: George Dunlap @ 2011-01-21 16:29 UTC (permalink / raw)
To: Jui-Hao Chiang; +Cc: MaoXiaoyun, xen devel, Tim Deegan
On Fri, Jan 21, 2011 at 4:19 PM, Jui-Hao Chiang <juihaochiang@gmail.com> wrote:
> (b) hap_nested_page_fault: if we return fail, will this cause problem
> to guest? or we can simply return success to cheat the guest. But
> later the guest will trigger another page fault if it write the page
> again.
> (c) gnttab_map_grant_ref: this function specify must_succeed to
> gfn_to_mfn_unshare(), which would BUG if unshare() fails.
I took a glance around the code this morning, but it seems like:
(b) should never happen. If a domain is dying, all of its vcpus
should be offline. If I'm wrong and there's a race between
d->is_dying set and the vcpus being paused, then the vcpus should just
be paused if they get an un-handleable page fault.
(c) happens because backend drivers may still be servicing requests
(finishing disk I/O, incoming network packets) before being torn down.
It should be OK for those to fail if the domain is dying.
I'm not sure the exact rationale behind the "cannot fail" flag; but it
looks like in grant_table.c, both callers of gfn_to_mfn_unshare()
handle the case where the returned p2m entry is just
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mem_sharing: summarized problems when domain is dying
2011-01-21 16:29 ` George Dunlap
@ 2011-01-21 16:32 ` George Dunlap
2011-01-21 16:41 ` George Dunlap
0 siblings, 1 reply; 12+ messages in thread
From: George Dunlap @ 2011-01-21 16:32 UTC (permalink / raw)
To: Jui-Hao Chiang; +Cc: MaoXiaoyun, xen devel, Tim Deegan
[sorry, accidentally sent too early]
On Fri, Jan 21, 2011 at 4:29 PM, George Dunlap <dunlapg@umich.edu> wrote:
> I'm not sure the exact rationale behind the "cannot fail" flag; but it
> looks like in grant_table.c, both callers of gfn_to_mfn_unshare()
> handle the case where the returned p2m entry is just
...invalid. I wonder if "unsharing" the page, but marking the entry
invalid during death would help.
I suppose the problem there is that if you're keeping the VM around
but paused for analysis, you'll have holes in your address space. But
just returning an invalid entry to the callers who try to unshare
pages might work.
-George
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mem_sharing: summarized problems when domain is dying
2011-01-21 16:32 ` George Dunlap
@ 2011-01-21 16:41 ` George Dunlap
2011-01-21 16:53 ` Tim Deegan
2011-01-22 11:17 ` MaoXiaoyun
0 siblings, 2 replies; 12+ messages in thread
From: George Dunlap @ 2011-01-21 16:41 UTC (permalink / raw)
To: Jui-Hao Chiang; +Cc: MaoXiaoyun, xen devel, Tim Deegan
[-- Attachment #1: Type: text/plain, Size: 839 bytes --]
Tim / Xiaoyun, do you think something like this might work?
-George
On Fri, Jan 21, 2011 at 4:32 PM, George Dunlap <dunlapg@umich.edu> wrote:
> [sorry, accidentally sent too early]
>
> On Fri, Jan 21, 2011 at 4:29 PM, George Dunlap <dunlapg@umich.edu> wrote:
>> I'm not sure the exact rationale behind the "cannot fail" flag; but it
>> looks like in grant_table.c, both callers of gfn_to_mfn_unshare()
>> handle the case where the returned p2m entry is just
>
> ...invalid. I wonder if "unsharing" the page, but marking the entry
> invalid during death would help.
>
> I suppose the problem there is that if you're keeping the VM around
> but paused for analysis, you'll have holes in your address space. But
> just returning an invalid entry to the callers who try to unshare
> pages might work.
>
> -George
>
[-- Attachment #2: interpret_must_succeed_if_dying.diff --]
[-- Type: text/plain, Size: 680 bytes --]
diff -r 9ca9331c9780 xen/include/asm-x86/p2m.h
--- a/xen/include/asm-x86/p2m.h Fri Jan 21 15:37:36 2011 +0000
+++ b/xen/include/asm-x86/p2m.h Fri Jan 21 16:41:58 2011 +0000
@@ -390,7 +390,14 @@
must_succeed
? MEM_SHARING_MUST_SUCCEED : 0) )
{
- BUG_ON(must_succeed);
+ if ( must_succeed
+ && p2m->domain->is_dying )
+ {
+ mfn = INVALID_MFN;
+ *p2mt=p2m_invalid;
+ }
+ else
+ BUG_ON(must_succeed);
return mfn;
}
mfn = gfn_to_mfn(p2m, gfn, p2mt);
[-- Attachment #3: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mem_sharing: summarized problems when domain is dying
2011-01-21 16:41 ` George Dunlap
@ 2011-01-21 16:53 ` Tim Deegan
2011-01-22 11:17 ` MaoXiaoyun
1 sibling, 0 replies; 12+ messages in thread
From: Tim Deegan @ 2011-01-21 16:53 UTC (permalink / raw)
To: George Dunlap; +Cc: MaoXiaoyun, xen devel, Jui-Hao Chiang
At 16:41 +0000 on 21 Jan (1295628107), George Dunlap wrote:
> Tim / Xiaoyun, do you think something like this might work?
Worth a try. I don't think it will do much harm -- there should be no
cases where dom0 really must map a dying domain's memory.
Tim.
> On Fri, Jan 21, 2011 at 4:32 PM, George Dunlap <dunlapg@umich.edu> wrote:
> > [sorry, accidentally sent too early]
> >
> > On Fri, Jan 21, 2011 at 4:29 PM, George Dunlap <dunlapg@umich.edu> wrote:
> >> I'm not sure the exact rationale behind the "cannot fail" flag; but it
> >> looks like in grant_table.c, both callers of gfn_to_mfn_unshare()
> >> handle the case where the returned p2m entry is just
> >
> > ...invalid. I wonder if "unsharing" the page, but marking the entry
> > invalid during death would help.
> >
> > I suppose the problem there is that if you're keeping the VM around
> > but paused for analysis, you'll have holes in your address space. But
> > just returning an invalid entry to the callers who try to unshare
> > pages might work.
> >
> > -George
> >
> diff -r 9ca9331c9780 xen/include/asm-x86/p2m.h
> --- a/xen/include/asm-x86/p2m.h Fri Jan 21 15:37:36 2011 +0000
> +++ b/xen/include/asm-x86/p2m.h Fri Jan 21 16:41:58 2011 +0000
> @@ -390,7 +390,14 @@
> must_succeed
> ? MEM_SHARING_MUST_SUCCEED : 0) )
> {
> - BUG_ON(must_succeed);
> + if ( must_succeed
> + && p2m->domain->is_dying )
> + {
> + mfn = INVALID_MFN;
> + *p2mt=p2m_invalid;
> + }
> + else
> + BUG_ON(must_succeed);
> return mfn;
> }
> mfn = gfn_to_mfn(p2m, gfn, p2mt);
--
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mem_sharing: summarized problems when domain is dying
2011-01-21 16:19 mem_sharing: summarized problems when domain is dying Jui-Hao Chiang
2011-01-21 16:29 ` George Dunlap
@ 2011-01-21 19:45 ` Jui-Hao Chiang
2011-01-24 13:14 ` MaoXiaoyun
2011-01-24 14:02 ` mem_sharing: summarized problems when domain is dying Tim Deegan
1 sibling, 2 replies; 12+ messages in thread
From: Jui-Hao Chiang @ 2011-01-21 19:45 UTC (permalink / raw)
To: Tim Deegan; +Cc: MaoXiaoyun, xen devel
Hi
On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@gmail.com> wrote:
> Hi, Tim:
>
> From tinnycloud's result, here I summarize the current problem and
> findings of mem_sharing due to domain dying.
> (1) When domain is dying, alloc_domheap_page() and
> set_shared_p2m_entry() would just fail. So the shr_lock is not enough
> to ensure that the domain won't die in the middle of mem_sharing code.
> As tinnycloud's code shows, is that better to use
> rcu_lock_domain_by_id before calling the above two functions?
>
There seems no good locking to protect a domain from changing the
is_dying state. So the unshare function could fail in the middle in
several points, e.g., alloc_domheap_page and set_shared_p2m_entry.
If that's the case, we need to add some checking, and probably revert
the things we have done when is_dying is changed in the middle.
Any comments?
Jui-Hao
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: mem_sharing: summarized problems when domain is dying
2011-01-21 16:41 ` George Dunlap
2011-01-21 16:53 ` Tim Deegan
@ 2011-01-22 11:17 ` MaoXiaoyun
1 sibling, 0 replies; 12+ messages in thread
From: MaoXiaoyun @ 2011-01-22 11:17 UTC (permalink / raw)
To: xen devel; +Cc: george.dunlap, tim.deegan, juihaochiang
[-- Attachment #1.1: Type: text/plain, Size: 1937 bytes --]
Hi George:
Appreciate for your kindly help.
I think the page type should be changed inside mem_sharing_unshare_page() in shr_lock too
,to prevent someone might unshare the page again. So your patch and mine makes the whole solution.
As for my patch, it seems that use put_page_and_type(page); to clean the page is enough, and
don't need to BUG_ON(set_shared_p2m_entry_invalid(d, gfn)==0); ( which actually calls
set_p2m_entry(d, gfn, _mfn(INVALID_MFN), 0, p2m_invalid) ), right?
One another thing is rcu_lock_domain_by_id(d->domain_id); When someone hold this lock,
d->is_dying = 0, does this mean d->is_dying will not be changed untill it call rcu_unlock_domain?
That is to say, the lock actually protects whole d structure?
> Date: Fri, 21 Jan 2011 16:41:47 +0000
> Subject: Re: [Xen-devel] mem_sharing: summarized problems when domain is dying
> From: George.Dunlap@eu.citrix.com
> To: juihaochiang@gmail.com
> CC: Tim.Deegan@citrix.com; tinnycloud@hotmail.com; xen-devel@lists.xensource.com
>
> Tim / Xiaoyun, do you think something like this might work?
>
> -George
>
> On Fri, Jan 21, 2011 at 4:32 PM, George Dunlap <dunlapg@umich.edu> wrote:
> > [sorry, accidentally sent too early]
> >
> > On Fri, Jan 21, 2011 at 4:29 PM, George Dunlap <dunlapg@umich.edu> wrote:
> >> I'm not sure the exact rationale behind the "cannot fail" flag; but it
> >> looks like in grant_table.c, both callers of gfn_to_mfn_unshare()
> >> handle the case where the returned p2m entry is just
> >
> > ...invalid. I wonder if "unsharing" the page, but marking the entry
> > invalid during death would help.
> >
> > I suppose the problem there is that if you're keeping the VM around
> > but paused for analysis, you'll have holes in your address space. But
> > just returning an invalid entry to the callers who try to unshare
> > pages might work.
> >
> > -George
> >
[-- Attachment #1.2: Type: text/html, Size: 5906 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: mem_sharing: summarized problems when domain is dying
2011-01-21 19:45 ` Jui-Hao Chiang
@ 2011-01-24 13:14 ` MaoXiaoyun
2011-01-24 14:08 ` George Dunlap
2011-01-25 4:13 ` Linux Guest Crash on stress test of memory sharing MaoXiaoyun
2011-01-24 14:02 ` mem_sharing: summarized problems when domain is dying Tim Deegan
1 sibling, 2 replies; 12+ messages in thread
From: MaoXiaoyun @ 2011-01-24 13:14 UTC (permalink / raw)
To: xen devel; +Cc: george.dunlap, tim.deegan, juihaochiang
[-- Attachment #1.1: Type: text/plain, Size: 3623 bytes --]
Hi:
Another BUG found when testing memory sharing.
In this test, I start 24 linux HVMS, each of them reboot through "xm reboot" every 30minutes.
After several hours, some of the HVM will crash. All of the crash HVM are stopped during booting.
The bug still exists even I forbid page sharing by cheating tapdisk that xc_memshr_nominate_gref()
return failure.
And no special log found.
I was able to dump the crash stack.
what could happen?
thanks.
PID: 2307 TASK: ffff810014166100 CPU: 0 COMMAND: "setfont"
#0 [ffff8100123cd900] xen_panic_event at ffffffff88001d28
#1 [ffff8100123cd920] notifier_call_chain at ffffffff80066eaa
#2 [ffff8100123cd940] panic at ffffffff8009094a
#3 [ffff8100123cda30] oops_end at ffffffff80064fca
#4 [ffff8100123cda40] do_page_fault at ffffffff80066dc0
#5 [ffff8100123cdb30] error_exit at ffffffff8005dde9
[exception RIP: vgacon_do_font_op+363]
RIP: ffffffff800515e5 RSP: ffff8100123cdbe8 RFLAGS: 00010203
RAX: 0000000000000000 RBX: ffffffff804b3740 RCX: ffff8100000a03fc
RDX: 00000000000003fd RSI: ffff810011cec000 RDI: ffffffff803244c4
RBP: ffff810011cec000 R8: d0d6999996000000 R9: 0000009090b0b0ff
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
R13: 0000000000000001 R14: 0000000000000001 R15: 000000000000000e
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#6 [ffff8100123cdc20] vgacon_font_set at ffffffff8016bec5
#7 [ffff8100123cdc60] con_font_op at ffffffff801aa86b
#8 [ffff8100123cdcd0] vt_ioctl at ffffffff801a5af4
#9 [ffff8100123cdd70] tty_ioctl at ffffffff80038a2c
#10 [ffff8100123cdeb0] do_ioctl at ffffffff800420d9
#11 [ffff8100123cded0] vfs_ioctl at ffffffff800302ce
#12 [ffff8100123cdf40] sys_ioctl at ffffffff8004c766
#13 [ffff8100123cdf80] tracesys at ffffffff8005d28d (via system_call)
RIP: 00000039294cc557 RSP: 00007fff54c4aec8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff
RDX: 00007fff54c4aee0 RSI: 0000000000004b72 RDI: 0000000000000003
RBP: 000000001d747ab0 R8: 0000000000000010 R9: 0000000000800000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000010
R13: 0000000000000200 R14: 0000000000000008 R15: 0000000000000008
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
> Date: Fri, 21 Jan 2011 14:45:14 -0500
> Subject: Re: mem_sharing: summarized problems when domain is dying
> From: juihaochiang@gmail.com
> To: Tim.Deegan@citrix.com
> CC: tinnycloud@hotmail.com; xen-devel@lists.xensource.com
>
> Hi
>
> On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@gmail.com> wrote:
> > Hi, Tim:
> >
> > From tinnycloud's result, here I summarize the current problem and
> > findings of mem_sharing due to domain dying.
> > (1) When domain is dying, alloc_domheap_page() and
> > set_shared_p2m_entry() would just fail. So the shr_lock is not enough
> > to ensure that the domain won't die in the middle of mem_sharing code.
> > As tinnycloud's code shows, is that better to use
> > rcu_lock_domain_by_id before calling the above two functions?
> >
>
> There seems no good locking to protect a domain from changing the
> is_dying state. So the unshare function could fail in the middle in
> several points, e.g., alloc_domheap_page and set_shared_p2m_entry.
> If that's the case, we need to add some checking, and probably revert
> the things we have done when is_dying is changed in the middle.
>
> Any comments?
>
> Jui-Hao
[-- Attachment #1.2: Type: text/html, Size: 5911 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mem_sharing: summarized problems when domain is dying
2011-01-21 19:45 ` Jui-Hao Chiang
2011-01-24 13:14 ` MaoXiaoyun
@ 2011-01-24 14:02 ` Tim Deegan
1 sibling, 0 replies; 12+ messages in thread
From: Tim Deegan @ 2011-01-24 14:02 UTC (permalink / raw)
To: Jui-Hao Chiang; +Cc: MaoXiaoyun, xen devel
At 19:45 +0000 on 21 Jan (1295639114), Jui-Hao Chiang wrote:
> Hi
>
> On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@gmail.com> wrote:
> > Hi, Tim:
> >
> > From tinnycloud's result, here I summarize the current problem and
> > findings of mem_sharing due to domain dying.
> > (1) When domain is dying, alloc_domheap_page() and
> > set_shared_p2m_entry() would just fail. So the shr_lock is not enough
> > to ensure that the domain won't die in the middle of mem_sharing code.
> > As tinnycloud's code shows, is that better to use
> > rcu_lock_domain_by_id before calling the above two functions?
> >
>
> There seems no good locking to protect a domain from changing the
> is_dying state. So the unshare function could fail in the middle in
> several points, e.g., alloc_domheap_page and set_shared_p2m_entry.
> If that's the case, we need to add some checking, and probably revert
> the things we have done when is_dying is changed in the middle.
That sounds correct. It would be a good idea to handle failures from
those functions anyway!
Tim.
--
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RE: mem_sharing: summarized problems when domain is dying
2011-01-24 13:14 ` MaoXiaoyun
@ 2011-01-24 14:08 ` George Dunlap
2011-01-25 4:13 ` Linux Guest Crash on stress test of memory sharing MaoXiaoyun
1 sibling, 0 replies; 12+ messages in thread
From: George Dunlap @ 2011-01-24 14:08 UTC (permalink / raw)
To: MaoXiaoyun; +Cc: xen devel, tim.deegan, juihaochiang
I think it would be best if every separate issue you're facing is a
separate thread. This looks like a Linux crash -- please include the
kernel version you're using, and whatever other information might be
appropriate.
-George
2011/1/24 MaoXiaoyun <tinnycloud@hotmail.com>:
> Hi:
>
> Another BUG found when testing memory sharing.
> In this test, I start 24 linux HVMS, each of them reboot through "xm
> reboot" every 30minutes.
> After several hours, some of the HVM will crash. All of the crash HVM
> are stopped during booting.
> The bug still exists even I forbid page sharing by cheating tapdisk
> that xc_memshr_nominate_gref()
> return failure.
>
> And no special log found.
>
> I was able to dump the crash stack.
> what could happen?
> thanks.
>
> PID: 2307 TASK: ffff810014166100 CPU: 0 COMMAND: "setfont"
> #0 [ffff8100123cd900] xen_panic_event at ffffffff88001d28
> #1 [ffff8100123cd920] notifier_call_chain at ffffffff80066eaa
> #2 [ffff8100123cd940] panic at ffffffff8009094a
> #3 [ffff8100123cda30] oops_end at ffffffff80064fca
> #4 [ffff8100123cda40] do_page_fault at ffffffff80066dc0
> #5 [ffff8100123cdb30] error_exit at ffffffff8005dde9
> [exception RIP: vgacon_do_font_op+363]
> RIP: ffffffff800515e5 RSP: ffff8100123cdbe 8 RFLAGS: 00010203
> RAX: 0000000000000000 RBX: ffffffff804b3740 RCX: ffff8100000a03fc
> RDX: 00000000000003fd RSI: ffff810011cec000 RDI: ffffffff803244c4
> RBP: ffff810011cec000 R8: d0d6999996000000 R9: 0000009090b0b0ff
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
> R13: 0000000000000001 R14: 0000000000000001 R15: 000000000000000e
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> #6 [ffff8100123cdc20] vgacon_font_set at ffffffff8016bec5
> #7 [ffff8100123cdc60] con_font_op at ffffffff801aa86b
> #8  ;[ffff8100123cdcd0] vt_ioctl at ffffffff801a5af4
> #9 [ffff8100123cdd70] tty_ioctl at ffffffff80038a2c
> #10 [ffff8100123cdeb0] do_ioctl at ffffffff800420d9
> #11 [ffff8100123cded0] vfs_ioctl at ffffffff800302ce
> #12 [ffff8100123cdf40] sys_ioctl at ffffffff8004c766
> #13 [ffff8100123cdf80] tracesys at ffffffff8005d28d (via system_call)
> RIP: 00000039294cc557 RSP: 00007fff54c4aec8 RFLAGS: 00000246
> RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff
> RDX: 00007fff54c4aee0 RSI: 0000000000004b72 RDI: 0000000000000003
> RBP: 000000001d747ab0 R8: 0000000000000010 R9: 0000000 000800000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000010
> R13: 0000000000000200 R14: 0000000000000008 R15: 0000000000000008
> ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
>
>> Date: Fri, 21 Jan 2011 14:45:14 -0500
>> Subject: Re: mem_sharing: summarized problems when domain is dying
>> From: juihaochiang@gmail.com
>> To: Tim.Deegan@citrix.com
>> CC: tinnycloud@hotmail.com; xen-devel@lists.xensource.com
>>
>> Hi
>>
>> On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@gmail.com>
>> wrote:
>> > Hi, Tim:
>> >
>> > From tinnycloud's result, here I summarize the current problem and
>> > findings of mem_sharing due to domain dying.
>> > (1) When domain is dying, alloc_domheap_page() and
>> > set_shared_p2m_entry() would just fail. So the shr_lock is not enough
>> > to ensure that the domain won't die in the middle of mem_sharing code.
>> > As tinnycloud's code shows, is that better to use
>> > rcu_lock_domain_by_id before calling the above two functions?
>> >
>>
>> There seems no good locking to protect a domain from changing the
>> is_dying state. So the unshare function could fail in the middle in
>> several points, e.g., alloc_domheap_page and set_shared_p2m_entry.
>> If that's the case, we need to add some checking, and probably revert
>> the things we have done when is_dying is changed in the middle.
>>
>> Any comments?
>>
>> Jui-Hao
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Linux Guest Crash on stress test of memory sharing
2011-01-24 13:14 ` MaoXiaoyun
2011-01-24 14:08 ` George Dunlap
@ 2011-01-25 4:13 ` MaoXiaoyun
[not found] ` <BLU157-w350046650B3C4960C4B1F2DAFC0@phx.gbl>
1 sibling, 1 reply; 12+ messages in thread
From: MaoXiaoyun @ 2011-01-25 4:13 UTC (permalink / raw)
To: xen devel; +Cc: george.dunlap, zpfalpc23, tim.deegan, juihaochiang
[-- Attachment #1.1: Type: text/plain, Size: 3889 bytes --]
Hi:
Follow George's suggestion to summit the bug in this new thread.
Start 24 linux HVMS on a physical host, each of them reboot through "xm reboot" every 30minutes.
After several hours, some of the HVM will crash.
All of the crash HVM are stopped during booting.
The bug still exists even I forbid page sharing by cheating tapdisk that xc_memshr_nominate_gref()
return failure. No bug if memory sharing is disabled.
(This means only mem_sharing_nominate_page() are called, and in mem_sharing_nominate_page()
page type is set to p2m_shared, so later it needs to be unshared when someone try to use it)
I remember there is a call routine in memory sharing,
hvm_hap_nested_page_fault()->mem_sharing_unshare_page()
compare to the crash dump, it might indicates some connections.
DomU kernel is from ftp://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/kernel-2.6.18-164.el5.src.rpm
Xen version: 4.0.0
crash dump stack :
crash> bt -l
PID: 2422 TASK: ffff810013b40860 CPU: 1 COMMAND: "setfont"
#0 [ffff810012cef900] xen_panic_event at ffffffff88001d28
#1 [ffff810012cef920] notifier_call_chain at ffffffff80066eaa
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/sys.c: 146
#2 [ffff810012cef940] panic at ffffffff8009094a
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/panic.c: 101
#3 [ffff810012cefa30] oops_end at ffffffff80064fca
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/kernel/traps.c: 539
#4 [ffff810012cefa40] do_page_fault at ffffffff80066dc0
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 591
#5 [ffff810012cefb30] error_exit at ffffffff8005dde9
[exception RIP: vgacon_do_font_op+435]
RIP: ffffffff8005162d RSP: ffff810012cefbe8 RFLAGS: 00010287
RAX: ffff8100000a6000 RBX: ffffffff804b3740 RCX: ffff8100000a4ae0
RDX: ffff810012d16ae1 RSI: ffff810012d14000 RDI: ffffffff803244c4
RBP: ffff810012d14000 R8: d0d6999996000000 R9: 0000009090b0b0ff
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
R13: 0000000000000001 R14: 0000000000000001 R15: 000000000000000e
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#6 [ffff810012cefc20] vgacon_font_set at ffffffff8016bec5
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/video/console/vgacon.c: 1238
#7 [ffff810012cefc60] con_font_op at ffffffff801aa86b
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/vt.c: 3645
#8 [ffff810012cefcd0] vt_ioctl at ffffffff801a5af4
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/vt_ioctl.c: 965
#9 [ffff810012cefd70] tty_ioctl at ffffffff80038a2c
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/tty_io.c: 3340
#10 [ffff810012cefeb0] do_ioctl at ffffffff800420d9
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 39
#11 [ffff810012cefed0] vfs_ioctl at ffffffff800302ce
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 500
#12 [ffff810012ceff40] sys_ioctl at ffffffff8004c766
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 520
#13 [ffff810012ceff80] tracesys at ffffffff8005d28d (via system_call)
RIP: 00000039294cc557 RSP: 00007fff1a57ed98 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff
RDX: 00007fff1a57edb0 RSI: 0000000000004b72 RDI: 0000000000000003
RBP: 000000001e33dab0 R8: 0000000000000010 R9: 0000000000800000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000010
R13: 0000000000000200 R14: 0000000000000008 R15: 0000000000000008
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
[-- Attachment #1.2: Type: text/html, Size: 5838 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Linux Guest Crash on stress test of memory sharing
[not found] ` <BLU157-w350046650B3C4960C4B1F2DAFC0@phx.gbl>
@ 2011-01-25 6:23 ` MaoXiaoyun
0 siblings, 0 replies; 12+ messages in thread
From: MaoXiaoyun @ 2011-01-25 6:23 UTC (permalink / raw)
To: xen devel; +Cc: george.dunlap, zpfalpc23, tim.deegan, juihaochiang
[-- Attachment #1.1: Type: text/plain, Size: 5422 bytes --]
Hi:
Most the core dump has the same stack as submitted before, now we have another stack
thanks.
crash> bt -l
PID: 1 TASK: ffff8100011df7a0 CPU: 0 COMMAND: "init"
#0 [ffff8100011fddf0] xen_panic_event at ffffffff88001d28
#1 [ffff8100011fde10] notifier_call_chain at ffffffff80066eaa
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/sys.c: 146
#2 [ffff8100011fde30] panic at ffffffff8009094a
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/panic.c: 101
#3 [ffff8100011fdf20] do_exit at ffffffff80015477
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/exit.c: 835
#4 [ffff8100011fdf80] system_call at ffffffff8005d116
/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/kernel/entry.S
RIP: 000000000055a5ff RSP: 00007fff2b8c2e10 RFLAGS: 00010246
RAX: 00000000000000e7 RBX: ffffffff8005d116 RCX: 0000000000000047
RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001
RBP: 0000000000000000 R8: 00000000000000e7 R9: ffffffffffffffb4
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000604ea8 R14: ffffffff80049281 R15: 0000000000000000
ORIG_RAX: 00000000000000e7 CS: 0033 SS: 002b
crash>
>From: tinnycloud@hotmail.com
>To: tinnycloud@hotmail.com
>Subject: Linux Guest Crash on stress test of memory sharing
>Date: Tue, 25 Jan 2011 13:07:15 +0800
>
>Hi:
>
> Follow George's suggestion to summit the bug in this new thread.
>
> Start 24 linux HVMS on a physical host, each of them reboot through "xm reboot" every 30minutes.
> After several hours, some of the HVM will crash.
>
> All of the crash HVM are stopped during booting.
> The bug still exists even I forbid page sharing by cheating tapdisk that xc_memshr_nominate_gref()
> return failure. No bug if memory sharing is disabled.
> (This means only mem_sharing_nominate_page() are called, and in mem_sharing_nominate_page()
> page type is set to p2m_shared, so later it needs to be unshared when someone try to use it)
>
> I remember there is a call routine in memory sharing,
> hvm_hap_nested_page_fault()->mem_sharing_unshare_page()
> compare to the crash dump, it might indicates some connections.
>
>DomU kernel is from ftp://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/kernel-2.6.18-164.el5.src.rpm
>Xen version: 4.0.0
>
>crash dump stack :
>
>crash> bt -l
>PID: 2422 TASK: ffff810013b40860 CPU: 1 COMMAND: "setfont"
> #0 [ffff810012cef900] xen_panic_event at ffffffff88001d28
> #1 [ffff810012cef920] notifier_call_chain at ffffffff80066eaa
> /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/sys.c: 146
> #2 [ffff810012cef940] panic at ffffffff8009094a
> /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/panic.c: 101
> #3 [ffff810012cefa30] oops_end at ffffffff80064fca
> /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/kernel/traps.c: 539
> #4 [ffff810012cefa40] do_page_fault at ffffffff80066dc0
> /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 591
> #5 [ffff810012cefb30] error_exit at ffffffff8005dde9
> [exception RIP: vgacon_do_font_op+435]
> RIP: ffffffff8005162d RSP: ffff810012cefbe8 RFLAGS: 00010287
> RAX: ffff8100000a6000 RBX: ffffffff804b3740 RCX: ffff8100000a4ae0
> RDX: ffff810012d16ae1 RSI: ffff810012d14000 RDI: ffffffff803244c4
> RBP: ffff810012d14000 R8: d0d6999996000000 R9: 0000009090b0b0ff
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
> R13: 0000000000000001 R14: 0000000000000001 R15: 000000000000000e
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> #6 [ffff810012cefc20] vgacon_font_set at ffffffff8016bec5
> /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/video/console/vgacon.c: 1238
> #7 [ffff810012cefc60] con_font_op at ffffffff801aa86b
> /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/vt.c: 3645
> #8 [ffff810012cefcd0] vt_ioctl at ffffffff801a5af4
> /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/vt_ioctl.c: 965
> #9 [ffff810012cefd70] tty_ioctl at ffffffff80038a2c
> /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/tty_io.c: 3340
>#10 [ffff810012cefeb0] do_ioctl at ffffffff800420d9
> /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 39
>#11 [ffff810012cefed0] vfs_ioctl at ffffffff800302ce
> /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 500
>#12 [ffff810012ceff40] sys_ioctl at ffffffff8004c766
> /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 520
>#13 [ffff810012ceff80] tracesys at ffffffff8005d28d (via system_call)
> RIP: 00000039294cc557 RSP: 00007fff1a57ed98 RFLAGS: 00000246
> RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff
> RDX: 00007fff1a57edb0 RSI: 0000000000004b72 RDI: 0000000000000003
> RBP: 000000001e33dab0 R8: 0000000000000010 R9: 0000000000800000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000010
> R13: 0000000000000200 R14: 0000000000000008 R15: 0000000000000008
> ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
[-- Attachment #1.2: Type: text/html, Size: 7610 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2011-01-25 6:23 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-21 16:19 mem_sharing: summarized problems when domain is dying Jui-Hao Chiang
2011-01-21 16:29 ` George Dunlap
2011-01-21 16:32 ` George Dunlap
2011-01-21 16:41 ` George Dunlap
2011-01-21 16:53 ` Tim Deegan
2011-01-22 11:17 ` MaoXiaoyun
2011-01-21 19:45 ` Jui-Hao Chiang
2011-01-24 13:14 ` MaoXiaoyun
2011-01-24 14:08 ` George Dunlap
2011-01-25 4:13 ` Linux Guest Crash on stress test of memory sharing MaoXiaoyun
[not found] ` <BLU157-w350046650B3C4960C4B1F2DAFC0@phx.gbl>
2011-01-25 6:23 ` MaoXiaoyun
2011-01-24 14:02 ` mem_sharing: summarized problems when domain is dying Tim Deegan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.