[PATCH] EPT: Only sync pcpus on which a domain's vcpus might be running

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] EPT: Only sync pcpus on which a domain's vcpus might be running
@ 2009-09-17 17:51 George Dunlap
  2009-09-18  7:43 ` Keir Fraser
  0 siblings, 1 reply; 8+ messages in thread
From: George Dunlap @ 2009-09-17 17:51 UTC (permalink / raw)
  To: xen-devel, Keir Fraser, Tim Deegan, Xin Li, Jun Nakajima, Xiaohu

[-- Attachment #1: Type: text/plain, Size: 428 bytes --]

ept_sync_domain() is called whenever the p2m changes.  The current
code calls sync on all cpus; this is extremely wasteful, especially
for a single-vcpu VM on a 16-way (2x4x2) box.

This patch will only call sync on cpus where there may be dirty cpu
state.  I've done a basic test, but I'd appreciate someone from Intel
verifying that there shouldn't be any problems.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

[-- Attachment #2: 20090917-unstable-ept-sync.diff --]
[-- Type: text/x-diff, Size: 451 bytes --]

diff -r 0814d57b5a29 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Tue Sep 15 11:13:56 2009 +0100
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu Sep 17 18:51:30 2009 +0100
@@ -1219,7 +1219,7 @@
     if ( d->arch.hvm_domain.hap_enabled && d->vcpu && d->vcpu[0] )
     {
         ASSERT(local_irq_is_enabled());
-        on_each_cpu(__ept_sync_domain, d, 1);
+        on_selected_cpus(&d->domain_dirty_cpumask, __ept_sync_domain, d, 1);
     }
 }
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] EPT: Only sync pcpus on which a domain's vcpus might be running
  2009-09-17 17:51 [PATCH] EPT: Only sync pcpus on which a domain's vcpus might be running George Dunlap
@ 2009-09-18  7:43 ` Keir Fraser
  2009-09-18  9:09   ` George Dunlap
  0 siblings, 1 reply; 8+ messages in thread
From: Keir Fraser @ 2009-09-18  7:43 UTC (permalink / raw)
  To: George Dunlap, xen-devel, Tim Deegan, Xin Li, Jun Nakajima

This is definitely not safe. See ARM Vol 3B Section 24.3.3. EPTP-tagged
cached mappings (that is, partial mappings from guest-phys to host-phys
addresses, tagged by the EPT base pointer value) only need be flushed when a
suitable INVEPT instruction is executed. So a HVM VCPU can leave EPTP-tagged
droppings lying around in other TLBs as it migrates from CPU to CPU -- the
domain_dirty_cpumask does not track this!

The way to fix this is to' if ( hap_enabled ) __invept(1,
d->arch.hvm_domain.vmx.ept_control.eptp, 0)' in vmx_ctxt_switch_from(). That
then makes your patch correct.

Care to test this and spin another patch?

 -- Keir

On 17/09/2009 18:51, "George Dunlap" <dunlapg@umich.edu> wrote:

> ept_sync_domain() is called whenever the p2m changes.  The current
> code calls sync on all cpus; this is extremely wasteful, especially
> for a single-vcpu VM on a 16-way (2x4x2) box.
> 
> This patch will only call sync on cpus where there may be dirty cpu
> state.  I've done a basic test, but I'd appreciate someone from Intel
> verifying that there shouldn't be any problems.
> 
> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [PATCH] EPT: Only sync pcpus on which a domain's vcpus might be running
  2009-09-18  7:43 ` Keir Fraser
@ 2009-09-18  9:09   ` George Dunlap
  2009-09-18 13:21     ` Keir Fraser
  0 siblings, 1 reply; 8+ messages in thread
From: George Dunlap @ 2009-09-18  9:09 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Tim Deegan, Xiaohui Xin, xen-devel, Xin Li, Jun Nakajima

Just making sure I understand the failure mode... the risk is that if
we change the p2m for this domain while a vcpu is not running on that
core, and it's scheduled there again, it may have stale mappings that
get used (tagged by the EPT base pointer).

Hmm... any idea the performance implications of flushing the EPT
mappings when switching out?  The main performance problem I need to
fix is the ballooning case.  Without this patch, on my 2x4x2 Nehalem
box, a single-vcpu guest ballooning from 16GiB down to 8GiB for an
initial boot takes 12-15 minutes; with the patch, it takes around 40
seconds.

Perhaps we could start with the flush-ept-on-context-switch-out, and
do some performance testing later to see if it's worth maintaining a
"might-need-an-ept-flush" mask.

 -George

On Fri, Sep 18, 2009 at 8:43 AM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:
> This is definitely not safe. See ARM Vol 3B Section 24.3.3. EPTP-tagged
> cached mappings (that is, partial mappings from guest-phys to host-phys
> addresses, tagged by the EPT base pointer value) only need be flushed when a
> suitable INVEPT instruction is executed. So a HVM VCPU can leave EPTP-tagged
> droppings lying around in other TLBs as it migrates from CPU to CPU -- the
> domain_dirty_cpumask does not track this!
>
> The way to fix this is to' if ( hap_enabled ) __invept(1,
> d->arch.hvm_domain.vmx.ept_control.eptp, 0)' in vmx_ctxt_switch_from(). That
> then makes your patch correct.
>
> Care to test this and spin another patch?
>
>  -- Keir
>
> On 17/09/2009 18:51, "George Dunlap" <dunlapg@umich.edu> wrote:
>
>> ept_sync_domain() is called whenever the p2m changes.  The current
>> code calls sync on all cpus; this is extremely wasteful, especially
>> for a single-vcpu VM on a 16-way (2x4x2) box.
>>
>> This patch will only call sync on cpus where there may be dirty cpu
>> state.  I've done a basic test, but I'd appreciate someone from Intel
>> verifying that there shouldn't be any problems.
>>
>> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [PATCH] EPT: Only sync pcpus on which a domain's vcpus might be running
  2009-09-18  9:09   ` George Dunlap
@ 2009-09-18 13:21     ` Keir Fraser
  2009-09-18 14:55       ` George Dunlap
  0 siblings, 1 reply; 8+ messages in thread
From: Keir Fraser @ 2009-09-18 13:21 UTC (permalink / raw)
  To: George Dunlap; +Cc: Tim Deegan, Xiaohui Xin, xen-devel, Xin Li, Jun Nakajima

On 18/09/2009 10:09, "George Dunlap" <dunlapg@umich.edu> wrote:

> Just making sure I understand the failure mode... the risk is that if
> we change the p2m for this domain while a vcpu is not running on that
> core, and it's scheduled there again, it may have stale mappings that
> get used (tagged by the EPT base pointer).

Correct.

> Hmm... any idea the performance implications of flushing the EPT
> mappings when switching out?  The main performance problem I need to
> fix is the ballooning case.  Without this patch, on my 2x4x2 Nehalem
> box, a single-vcpu guest ballooning from 16GiB down to 8GiB for an
> initial boot takes 12-15 minutes; with the patch, it takes around 40
> seconds.

How bad is it is you don't flush at all? 40s is still not great and maybe we
could batch the INVEPT invocations? That might also greatly reduce the cost
of the current naïve flush-all implementation, to a point where it is
actually acceptable.

> Perhaps we could start with the flush-ept-on-context-switch-out, and
> do some performance testing later to see if it's worth maintaining a
> "might-need-an-ept-flush" mask.

Well, that's a possible alternative, yes. Just needs a bit of care in
implementation but it wouldn't need much extra code.

 -- Keir

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [PATCH] EPT: Only sync pcpus on which a domain's vcpus might be running
  2009-09-18 13:21     ` Keir Fraser
@ 2009-09-18 14:55       ` George Dunlap
  2009-09-18 16:51         ` Paul Durrant
                           ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: George Dunlap @ 2009-09-18 14:55 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Xiaohui Xin, xen-devel, Tim Deegan, Xin Li, Jun Nakajima, Paul Durrant

On Fri, Sep 18, 2009 at 2:21 PM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:
> How bad is it is you don't flush at all? 40s is still not great and maybe we
> could batch the INVEPT invocations? That might also greatly reduce the cost
> of the current naïve flush-all implementation, to a point where it is
> actually acceptable.

In theory, the memory handed back by the balloon driver shouldn't be
touched by the OS.  I think it would be OK if guess accesses to that
gfn space didn't fail for the guest giving up the pages; however, we
can't give the memory to another guest until we know for sure that the
first guest can't possibly access it anymore.  I think we should be
able to modify the balloon driver to "batch" some number of updates;
say, 1024 at a time.  Paul, any thoughts on this?

The other thing is the level of indirection -- we have to add a
parameter to set_p2m_entry() that says, "Don't sync this right away",
and then add another function that says, "Now commit all the changes I
just made".  That may take some thought to do correctly.

I think avoiding flush-all is still a good idea, as we have other
things like populate-on-demand and zero-page sweep doing lots of
modifications of the p2m table during boot that can't be batched this
way.  For example, the zero sweep *must* remove the page from the p2m
before doing a final scan to make sure it's still zero.  I suppose we
could scan a list of pages, remove them all from the p2m (taking the
flush-all), and then scan them again... but it seems like now we're
starting to get a lot more complicated than just keeping a mask or two
around.

Thoughts?

I think starting with a "flush-on-switch-to" would be good; it should
be fairly straightforward to make that flush happen only when:
* the domain has run on that cpu before, and
* the domain has had p2m changes since the last time it ran

 -George

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [PATCH] EPT: Only sync pcpus on which a domain's vcpus might be running
  2009-09-18 14:55       ` George Dunlap
@ 2009-09-18 16:51         ` Paul Durrant
  2009-09-18 17:06         ` Dan Magenheimer
  2009-09-19 10:34         ` Keir Fraser
  2 siblings, 0 replies; 8+ messages in thread
From: Paul Durrant @ 2009-09-18 16:51 UTC (permalink / raw)
  To: George Dunlap
  Cc: Xiaohui Xin, xen-devel, Tim Deegan, Xin Li, Keir Fraser, Jun Nakajima

George Dunlap wrote:
> 
> In theory, the memory handed back by the balloon driver shouldn't be
> touched by the OS.  I think it would be OK if guess accesses to that
> gfn space didn't fail for the guest giving up the pages; however, we
> can't give the memory to another guest until we know for sure that the
> first guest can't possibly access it anymore.  I think we should be
> able to modify the balloon driver to "batch" some number of updates;
> say, 1024 at a time.  Paul, any thoughts on this?
>

The decrease_reservation op is called with however many PFNs can fit 
into the Windows MDL structure (8184 for a 32-bit VM) so we already do 
things in pretty big batches.

   Paul

-- 
===============================
Paul Durrant, Software Engineer

Citrix Systems (R&D) Ltd.
First Floor, Building 101
Cambridge Science Park
Milton Road
Cambridge CB4 0FY
United Kingdom

TEL: x35957 (+44 1223 225957)
===============================

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Re: [PATCH] EPT: Only sync pcpus on which a domain's vcpus might be running
  2009-09-18 14:55       ` George Dunlap
  2009-09-18 16:51         ` Paul Durrant
@ 2009-09-18 17:06         ` Dan Magenheimer
  2009-09-19 10:34         ` Keir Fraser
  2 siblings, 0 replies; 8+ messages in thread
From: Dan Magenheimer @ 2009-09-18 17:06 UTC (permalink / raw)
  To: George Dunlap, Keir Fraser; +Cc: Xiaohui Xin, xen-devel, Deegan

> In theory, the memory handed back by the balloon driver shouldn't be
> touched by the OS.  I think it would be OK if guess accesses to that
> gfn space didn't fail for the guest giving up the pages; however, we
> can't give the memory to another guest until we know for sure that the
> first guest can't possibly access it anymore.  I think we should be
> able to modify the balloon driver to "batch" some number of updates;
> say, 1024 at a time.  Paul, any thoughts on this?

Not sure if this is relevant, but tmem can and will grab
memory handed back by the balloon driver for one guest
and immediately use it for another guest's data.
So I hope the ballooning guest can't possibly access it
anymore!

Dan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [PATCH] EPT: Only sync pcpus on which a domain's vcpus might be running
  2009-09-18 14:55       ` George Dunlap
  2009-09-18 16:51         ` Paul Durrant
  2009-09-18 17:06         ` Dan Magenheimer
@ 2009-09-19 10:34         ` Keir Fraser
  2 siblings, 0 replies; 8+ messages in thread
From: Keir Fraser @ 2009-09-19 10:34 UTC (permalink / raw)
  To: George Dunlap
  Cc: Xiaohui Xin, xen-devel, Tim Deegan, Paul Durrant, Xin Li, Jun Nakajima

On 18/09/2009 15:55, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:

> I think starting with a "flush-on-switch-to" would be good; it should
> be fairly straightforward to make that flush happen only when:
> * the domain has run on that cpu before, and
> * the domain has had p2m changes since the last time it ran

Well, I'm okay with that, but I will wait for a tested patch from you.

 -- Keir

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-09-19 10:34 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-17 17:51 [PATCH] EPT: Only sync pcpus on which a domain's vcpus might be running George Dunlap
2009-09-18  7:43 ` Keir Fraser
2009-09-18  9:09   ` George Dunlap
2009-09-18 13:21     ` Keir Fraser
2009-09-18 14:55       ` George Dunlap
2009-09-18 16:51         ` Paul Durrant
2009-09-18 17:06         ` Dan Magenheimer
2009-09-19 10:34         ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.