All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] 0/9 Populate-on-demand memory
@ 2008-12-23 12:55 George Dunlap
  2008-12-23 19:06 ` Dan Magenheimer
  2008-12-24  1:46 ` Tian, Kevin
  0 siblings, 2 replies; 19+ messages in thread
From: George Dunlap @ 2008-12-23 12:55 UTC (permalink / raw)
  To: xen-devel

This set of patches introduces a set of mechanisms and interfaces to
implement populate-on-demand memory.  The purpose of
populate-on-demand memory is to allow non-paravirtualized guests (such
as Windows or Linux HVM) boot in a ballooned state.

BACKGROUND

When non-PV domains boots, they typically read the e820 maps to
determine how much memory they have, and then assume that much memory
thereafter.  Memory requirements can be reduced using a balloon
driver, but it cannot be increased past this initial value.
Currently, this means that a non-PV domain must be booted with the
maximum amount of memory you want that VM every to be able to use.

Populate-on-demand allows us to "boot ballooned", in the following manner:
* Mark the entire range of memory (memory_static_max aka maxmem) with
a new p2m type, populate_on_demand, reporting memory_static_max in th
e820 map.  No memory is allocated at this stage.
* Allocate the "memory_dynamic_max" (aka "target") amount of memory
for a "PoD cache".  This memory is kept on a separate list in the
domain struct.
* Boot the guest.
* Populate the p2m table on-demand as it's accessed with pages from
the PoD cache.
* When the balloon driver loads, it inflates the balloon size to
(maxmem - target), giving the memory back to Xen.  When this is
accomplished, the "populate-on-demand" portion of boot is effectively
finished.

One complication is that many operating systems have start-of-day page
scrubbers, which touch all of memory to zero it.  This scrubber may
run before the balloon driver can return memory to Xen.  These zeroed
pages, however, don't contain any information; we can safely replace
them with PoD entries again.  So when we run out of PoD cache, we do
an "emergency sweep" to look for zero pages we can reclaim for the
populate-on-demand cache.  When we find a page range which is entirely
zero, we mark the gfn range PoD again, and put the memory back into
the PoD cache.

NB that this code is designed to work only in conjunction with a
balloon driver.  If the balloon driver is not loaded, eventually all
pages will be dirtied (non-zero), the emergency sweep will fail, and
there will be no memory to back outstanding PoD pages.  When this
happens, the domain will crash.

The code works for both shadow mode and HAP mode; it has been tested
with NPT/RVI and shadow, but not yet with EPT.  It also attempts to
avoid splintering superpages, to allow HAP to function more
effectively.

To use:
* ensure that you have a functioning balloon driver in the guest
(e.g., xen_balloon.ko for Linux HVM guests).
* Set maxmem/memory_static_max to one value, and
memory/memory_dynamic_max to another when creating the domain; e.g:
 # xm create debian-hvm maxmem=512 memory=256

The patches are as follows:
01 - Add a p2m_query_type to core gfn_to_mfn*() functions.

02 - Change some gfn_to_mfn() calls to gfn_to_mfn_query(), which will
not populate PoD entries.  Specifically, since gfn_to_mfn() may grab
the p2m lock, it must not be called while the shadow lock is held.

03 - Populate-on-demand core.  Introduce new p2m type, PoD cache
structures, and core functionality.  Add PoD checking to audit_p2m().
Add PoD information to the 'q' debug key.

04 - Implement p2m_decrease_reservation.  As the balloon driver
returns gfns to Xen, it handles PoD entries properly; it also "steals"
memory being returned for the PoD cache instead of freeing it, if
necessary.

05 - emergency sweep: Implement emergency sweep for zero memory if the
cache is low.  If it finds pages (or page ranges) entirely zero, it
will replace the entry with a PoD entry again, reclaiming the memory
for the PoD cache.

06 - Deal with splintering both PoD pages (to back singleton PoD
entries) and PoD ranges

07 - Xen interface for populate-on-demand functionality: PoD flag for
populate_physmap, {get,set}_pod_target for interacting with the PoD
cache.  set_pod_target() should be called for any domain that may have
PoD entries.  It will increase the size of the cache if necessary, but
will never decrease the size of the cache.  (This will be done as the
balloon driver balloons down.)

08 - libxc interface.  Add a new libxc functions:
+ xc_hvm_build_target_mem(), which accepts memsize and target.  If
these are equal, PoD functionality is not invoked.  Otherwise, memsize
is marked PoD, and the target MiB is allocated to the PoD cache.
+ xc_[sg]et_pod_target(): get / set PoD target.  set_pod_target()
should be called whenever you change the guest target mem on a domain
which may have outstaning PoD entries.  This may increase the size of
the PoD cache up to the number of outstanding PoD entries, but will
not reduce the size of the cache.  (The cache may be reduced as the
balloon driver returns gfn space to Xen.)

09 - xend integration.
+ Always calls xc_hvm_build_target_mem() with memsize=maxmem and
target=memory.  If these the same, the internal function will not use
PoD.
+ Calls xc_set_target_mem() whenever a domain's target is changed.
Also calls balloon.free(), causing dom0 to balloon down itself if
there's not enough memory otherwise.

Things still to do:
* When reduce_reservation() is called with a superpage, keep the
superpage intact.
* Create a hypercall continuation for set_pod_target.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-23 12:55 [RFC][PATCH] 0/9 Populate-on-demand memory George Dunlap
@ 2008-12-23 19:06 ` Dan Magenheimer
  2008-12-24 13:55   ` George Dunlap
  2008-12-24  1:46 ` Tian, Kevin
  1 sibling, 1 reply; 19+ messages in thread
From: Dan Magenheimer @ 2008-12-23 19:06 UTC (permalink / raw)
  To: George Dunlap, xen-devel

Very nice!

One thing that might be worth adding to the requirements list or
README is that this approach (or any which depends on ballooning)
will now almost certainly require any participating hvm domain
to have an adequately-sized properly-configured swap disk.
Ballooning is insufficiently responsive to grow memory fast
enough to handle rapidly growing memory needs of an active domain
The consequence for a no-swap-disk is application failures
and the consequence even if a swap disk IS configured is temporarily
very poor performance.

I'm working on fixing that (at least on pv domains).  Watch
this list after the new year.

So this won't work for any domain that does start-of-day
scrubbing with a non-zero value?  I suppose that's OK.

Happy holidays to all!

Dan

> -----Original Message-----
> From: George Dunlap [mailto:dunlapg@umich.edu]
> Sent: Tuesday, December 23, 2008 5:55 AM
> To: xen-devel@lists.xensource.com
> Subject: [Xen-devel] [RFC][PATCH] 0/9 Populate-on-demand memory
> 
> 
> This set of patches introduces a set of mechanisms and interfaces to
> implement populate-on-demand memory.  The purpose of
> populate-on-demand memory is to allow non-paravirtualized guests (such
> as Windows or Linux HVM) boot in a ballooned state.
> 
> BACKGROUND
> 
> When non-PV domains boots, they typically read the e820 maps to
> determine how much memory they have, and then assume that much memory
> thereafter.  Memory requirements can be reduced using a balloon
> driver, but it cannot be increased past this initial value.
> Currently, this means that a non-PV domain must be booted with the
> maximum amount of memory you want that VM every to be able to use.
> 
> Populate-on-demand allows us to "boot ballooned", in the 
> following manner:
> * Mark the entire range of memory (memory_static_max aka maxmem) with
> a new p2m type, populate_on_demand, reporting memory_static_max in th
> e820 map.  No memory is allocated at this stage.
> * Allocate the "memory_dynamic_max" (aka "target") amount of memory
> for a "PoD cache".  This memory is kept on a separate list in the
> domain struct.
> * Boot the guest.
> * Populate the p2m table on-demand as it's accessed with pages from
> the PoD cache.
> * When the balloon driver loads, it inflates the balloon size to
> (maxmem - target), giving the memory back to Xen.  When this is
> accomplished, the "populate-on-demand" portion of boot is effectively
> finished.
> 
> One complication is that many operating systems have start-of-day page
> scrubbers, which touch all of memory to zero it.  This scrubber may
> run before the balloon driver can return memory to Xen.  These zeroed
> pages, however, don't contain any information; we can safely replace
> them with PoD entries again.  So when we run out of PoD cache, we do
> an "emergency sweep" to look for zero pages we can reclaim for the
> populate-on-demand cache.  When we find a page range which is entirely
> zero, we mark the gfn range PoD again, and put the memory back into
> the PoD cache.
> 
> NB that this code is designed to work only in conjunction with a
> balloon driver.  If the balloon driver is not loaded, eventually all
> pages will be dirtied (non-zero), the emergency sweep will fail, and
> there will be no memory to back outstanding PoD pages.  When this
> happens, the domain will crash.
> 
> The code works for both shadow mode and HAP mode; it has been tested
> with NPT/RVI and shadow, but not yet with EPT.  It also attempts to
> avoid splintering superpages, to allow HAP to function more
> effectively.
> 
> To use:
> * ensure that you have a functioning balloon driver in the guest
> (e.g., xen_balloon.ko for Linux HVM guests).
> * Set maxmem/memory_static_max to one value, and
> memory/memory_dynamic_max to another when creating the domain; e.g:
>  # xm create debian-hvm maxmem=512 memory=256
> 
> The patches are as follows:
> 01 - Add a p2m_query_type to core gfn_to_mfn*() functions.
> 
> 02 - Change some gfn_to_mfn() calls to gfn_to_mfn_query(), which will
> not populate PoD entries.  Specifically, since gfn_to_mfn() may grab
> the p2m lock, it must not be called while the shadow lock is held.
> 
> 03 - Populate-on-demand core.  Introduce new p2m type, PoD cache
> structures, and core functionality.  Add PoD checking to audit_p2m().
> Add PoD information to the 'q' debug key.
> 
> 04 - Implement p2m_decrease_reservation.  As the balloon driver
> returns gfns to Xen, it handles PoD entries properly; it also "steals"
> memory being returned for the PoD cache instead of freeing it, if
> necessary.
> 
> 05 - emergency sweep: Implement emergency sweep for zero memory if the
> cache is low.  If it finds pages (or page ranges) entirely zero, it
> will replace the entry with a PoD entry again, reclaiming the memory
> for the PoD cache.
> 
> 06 - Deal with splintering both PoD pages (to back singleton PoD
> entries) and PoD ranges
> 
> 07 - Xen interface for populate-on-demand functionality: PoD flag for
> populate_physmap, {get,set}_pod_target for interacting with the PoD
> cache.  set_pod_target() should be called for any domain that may have
> PoD entries.  It will increase the size of the cache if necessary, but
> will never decrease the size of the cache.  (This will be done as the
> balloon driver balloons down.)
> 
> 08 - libxc interface.  Add a new libxc functions:
> + xc_hvm_build_target_mem(), which accepts memsize and target.  If
> these are equal, PoD functionality is not invoked.  Otherwise, memsize
> is marked PoD, and the target MiB is allocated to the PoD cache.
> + xc_[sg]et_pod_target(): get / set PoD target.  set_pod_target()
> should be called whenever you change the guest target mem on a domain
> which may have outstaning PoD entries.  This may increase the size of
> the PoD cache up to the number of outstanding PoD entries, but will
> not reduce the size of the cache.  (The cache may be reduced as the
> balloon driver returns gfn space to Xen.)
> 
> 09 - xend integration.
> + Always calls xc_hvm_build_target_mem() with memsize=maxmem and
> target=memory.  If these the same, the internal function will not use
> PoD.
> + Calls xc_set_target_mem() whenever a domain's target is changed.
> Also calls balloon.free(), causing dom0 to balloon down itself if
> there's not enough memory otherwise.
> 
> Things still to do:
> * When reduce_reservation() is called with a superpage, keep the
> superpage intact.
> * Create a hypercall continuation for set_pod_target.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-23 12:55 [RFC][PATCH] 0/9 Populate-on-demand memory George Dunlap
  2008-12-23 19:06 ` Dan Magenheimer
@ 2008-12-24  1:46 ` Tian, Kevin
  2008-12-24 14:42   ` George Dunlap
  1 sibling, 1 reply; 19+ messages in thread
From: Tian, Kevin @ 2008-12-24  1:46 UTC (permalink / raw)
  To: 'George Dunlap', xen-devel

>From: George Dunlap
>Sent: Tuesday, December 23, 2008 8:55 PM
>BACKGROUND
>
>When non-PV domains boots, they typically read the e820 maps to
>determine how much memory they have, and then assume that much memory
>thereafter.  Memory requirements can be reduced using a balloon
>driver, but it cannot be increased past this initial value.

Isn't it also true for pv guest? Unless guest supports memory hotadd,
balloon driver always can't increase past the initial max memory. But
your patch is nice since more VMs can be created w/o below hard 
limitation at boot time.

>Currently, this means that a non-PV domain must be booted with the
>maximum amount of memory you want that VM every to be able to use.
>
>Populate-on-demand allows us to "boot ballooned", in the 
>following manner:
>* Mark the entire range of memory (memory_static_max aka maxmem) with
>a new p2m type, populate_on_demand, reporting memory_static_max in th
>e820 map.  No memory is allocated at this stage.
>* Allocate the "memory_dynamic_max" (aka "target") amount of memory
>for a "PoD cache".  This memory is kept on a separate list in the
>domain struct.
>* Boot the guest.
>* Populate the p2m table on-demand as it's accessed with pages from
>the PoD cache.
>* When the balloon driver loads, it inflates the balloon size to
>(maxmem - target), giving the memory back to Xen.  When this is
>accomplished, the "populate-on-demand" portion of boot is effectively
>finished.
>

Another tricky point could be with VT-d. If one guest page is used as 
DMA target before balloon driver is installed, and no early access on
that page (like start-of-day scrubber), then PoD action will not be triggered...
Not sure the possibility of such condition, but you may need to have 
some thought or guard on that. em... after more thinking, actually PoD 
pages may be alive even after balloon driver is installed. I guess before
coming up a solution you may add a check on whether target domain
has passthrough device to decide whether this feature is on on-the-fly.

PoD is anyhow a bit different from balloon driver, since the latter claims
ownership on ballooned pages which then will not be used as the DMA
target within guest.

>
>NB that this code is designed to work only in conjunction with a
>balloon driver.  If the balloon driver is not loaded, eventually all
>pages will be dirtied (non-zero), the emergency sweep will fail, and
>there will be no memory to back outstanding PoD pages.  When this
>happens, the domain will crash.

In that case, is it better to increase PoD target to configured max mem?
It looks uncomfortable to crash a domain just because some optimization
doesn't apply. :-)

Last, do you have any performance data on how this patch may impact
the boot process, or even some workload after login?

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-23 19:06 ` Dan Magenheimer
@ 2008-12-24 13:55   ` George Dunlap
  2008-12-24 14:32     ` Dan Magenheimer
  0 siblings, 1 reply; 19+ messages in thread
From: George Dunlap @ 2008-12-24 13:55 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: xen-devel

On Tue, Dec 23, 2008 at 7:06 PM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
> Very nice!

Thanks!

> One thing that might be worth adding to the requirements list or
> README is that this approach (or any which depends on ballooning)
> will now almost certainly require any participating hvm domain
> to have an adequately-sized properly-configured swap disk.
> Ballooning is insufficiently responsive to grow memory fast
> enough to handle rapidly growing memory needs of an active domain
> The consequence for a no-swap-disk is application failures
> and the consequence even if a swap disk IS configured is temporarily
> very poor performance.

I don't think this is particular to the PoD patches, or even
ballooning per se.  A swap disk would be required any time you boot
with a small amount of memory, whether it could be increased or not.

But you're right, in that this differs from a typical operating
system's "demang-paging" mechanism, where the goal is to give a
process only the memory it actually needs, so you can use it for other
processes.  You're still allocating a fixed amount of memory to a
guest at start-up.  The un-populated memory is not available to use by
other VMs, and allocating more memory is a (relatively) slow process.
I guess a brief note pointing out the difference between "populate on
demand" and "allocate on demand" would be useful.

> So this won't work for any domain that does start-of-day
> scrubbing with a non-zero value?  I suppose that's OK.

Not if the scrubber might win the race against the balloon driver. :-)
 If this really becomes an issue, it should be straightforward to add
functionality to handle it.  It just requires having a simple way of
specifying what "scrubbed" pages look like, an extra p2m type for "PoD
scrubbed" (rather than PoD zero, the default), and how to change from
scrubbed <-> zero.

Did you have a particular system in mind?

-George

>> -----Original Message-----
>> From: George Dunlap [mailto:dunlapg@umich.edu]
>> Sent: Tuesday, December 23, 2008 5:55 AM
>> To: xen-devel@lists.xensource.com
>> Subject: [Xen-devel] [RFC][PATCH] 0/9 Populate-on-demand memory
>>
>>
>> This set of patches introduces a set of mechanisms and interfaces to
>> implement populate-on-demand memory.  The purpose of
>> populate-on-demand memory is to allow non-paravirtualized guests (such
>> as Windows or Linux HVM) boot in a ballooned state.
>>
>> BACKGROUND
>>
>> When non-PV domains boots, they typically read the e820 maps to
>> determine how much memory they have, and then assume that much memory
>> thereafter.  Memory requirements can be reduced using a balloon
>> driver, but it cannot be increased past this initial value.
>> Currently, this means that a non-PV domain must be booted with the
>> maximum amount of memory you want that VM every to be able to use.
>>
>> Populate-on-demand allows us to "boot ballooned", in the
>> following manner:
>> * Mark the entire range of memory (memory_static_max aka maxmem) with
>> a new p2m type, populate_on_demand, reporting memory_static_max in th
>> e820 map.  No memory is allocated at this stage.
>> * Allocate the "memory_dynamic_max" (aka "target") amount of memory
>> for a "PoD cache".  This memory is kept on a separate list in the
>> domain struct.
>> * Boot the guest.
>> * Populate the p2m table on-demand as it's accessed with pages from
>> the PoD cache.
>> * When the balloon driver loads, it inflates the balloon size to
>> (maxmem - target), giving the memory back to Xen.  When this is
>> accomplished, the "populate-on-demand" portion of boot is effectively
>> finished.
>>
>> One complication is that many operating systems have start-of-day page
>> scrubbers, which touch all of memory to zero it.  This scrubber may
>> run before the balloon driver can return memory to Xen.  These zeroed
>> pages, however, don't contain any information; we can safely replace
>> them with PoD entries again.  So when we run out of PoD cache, we do
>> an "emergency sweep" to look for zero pages we can reclaim for the
>> populate-on-demand cache.  When we find a page range which is entirely
>> zero, we mark the gfn range PoD again, and put the memory back into
>> the PoD cache.
>>
>> NB that this code is designed to work only in conjunction with a
>> balloon driver.  If the balloon driver is not loaded, eventually all
>> pages will be dirtied (non-zero), the emergency sweep will fail, and
>> there will be no memory to back outstanding PoD pages.  When this
>> happens, the domain will crash.
>>
>> The code works for both shadow mode and HAP mode; it has been tested
>> with NPT/RVI and shadow, but not yet with EPT.  It also attempts to
>> avoid splintering superpages, to allow HAP to function more
>> effectively.
>>
>> To use:
>> * ensure that you have a functioning balloon driver in the guest
>> (e.g., xen_balloon.ko for Linux HVM guests).
>> * Set maxmem/memory_static_max to one value, and
>> memory/memory_dynamic_max to another when creating the domain; e.g:
>>  # xm create debian-hvm maxmem=512 memory=256
>>
>> The patches are as follows:
>> 01 - Add a p2m_query_type to core gfn_to_mfn*() functions.
>>
>> 02 - Change some gfn_to_mfn() calls to gfn_to_mfn_query(), which will
>> not populate PoD entries.  Specifically, since gfn_to_mfn() may grab
>> the p2m lock, it must not be called while the shadow lock is held.
>>
>> 03 - Populate-on-demand core.  Introduce new p2m type, PoD cache
>> structures, and core functionality.  Add PoD checking to audit_p2m().
>> Add PoD information to the 'q' debug key.
>>
>> 04 - Implement p2m_decrease_reservation.  As the balloon driver
>> returns gfns to Xen, it handles PoD entries properly; it also "steals"
>> memory being returned for the PoD cache instead of freeing it, if
>> necessary.
>>
>> 05 - emergency sweep: Implement emergency sweep for zero memory if the
>> cache is low.  If it finds pages (or page ranges) entirely zero, it
>> will replace the entry with a PoD entry again, reclaiming the memory
>> for the PoD cache.
>>
>> 06 - Deal with splintering both PoD pages (to back singleton PoD
>> entries) and PoD ranges
>>
>> 07 - Xen interface for populate-on-demand functionality: PoD flag for
>> populate_physmap, {get,set}_pod_target for interacting with the PoD
>> cache.  set_pod_target() should be called for any domain that may have
>> PoD entries.  It will increase the size of the cache if necessary, but
>> will never decrease the size of the cache.  (This will be done as the
>> balloon driver balloons down.)
>>
>> 08 - libxc interface.  Add a new libxc functions:
>> + xc_hvm_build_target_mem(), which accepts memsize and target.  If
>> these are equal, PoD functionality is not invoked.  Otherwise, memsize
>> is marked PoD, and the target MiB is allocated to the PoD cache.
>> + xc_[sg]et_pod_target(): get / set PoD target.  set_pod_target()
>> should be called whenever you change the guest target mem on a domain
>> which may have outstaning PoD entries.  This may increase the size of
>> the PoD cache up to the number of outstanding PoD entries, but will
>> not reduce the size of the cache.  (The cache may be reduced as the
>> balloon driver returns gfn space to Xen.)
>>
>> 09 - xend integration.
>> + Always calls xc_hvm_build_target_mem() with memsize=maxmem and
>> target=memory.  If these the same, the internal function will not use
>> PoD.
>> + Calls xc_set_target_mem() whenever a domain's target is changed.
>> Also calls balloon.free(), causing dom0 to balloon down itself if
>> there's not enough memory otherwise.
>>
>> Things still to do:
>> * When reduce_reservation() is called with a superpage, keep the
>> superpage intact.
>> * Create a hypercall continuation for set_pod_target.
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-24 13:55   ` George Dunlap
@ 2008-12-24 14:32     ` Dan Magenheimer
  2008-12-24 15:13       ` George Dunlap
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Magenheimer @ 2008-12-24 14:32 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel

> > The consequence for a no-swap-disk is application failures
> > and the consequence even if a swap disk IS configured is temporarily
> > very poor performance.
> 
> I don't think this is particular to the PoD patches, or even
> ballooning per se.  A swap disk would be required any time you boot
> with a small amount of memory, whether it could be increased or not.
>
> But you're right, in that this differs from a typical operating
> system's "demang-paging" mechanism, where the goal is to give a
> process only the memory it actually needs, so you can use it for other
> processes.  You're still allocating a fixed amount of memory to a
> guest at start-up.  The un-populated memory is not available to use by
> other VMs, and allocating more memory is a (relatively) slow process.
> I guess a brief note pointing out the difference between "populate on
> demand" and "allocate on demand" would be useful.

Yes, its just that with your fix, Windows VM users are much more
likely to use memory overcommit and will need to be "trained" to
always configure a swap disk to ensure bad things don't happen.
And this swap disk had better be on a network-based medium or
live migration won't work.
 
> > So this won't work for any domain that does start-of-day
> > scrubbing with a non-zero value?  I suppose that's OK.
> 
> Not if the scrubber might win the race against the balloon driver. :-)
>  If this really becomes an issue, it should be straightforward to add
> functionality to handle it.  It just requires having a simple way of
> specifying what "scrubbed" pages look like, an extra p2m type for "PoD
> scrubbed" (rather than PoD zero, the default), and how to change from
> scrubbed <-> zero.
> 
> Did you have a particular system in mind?

No, I had just given some limited thought to this problem previously,
had considered the idea of sharing a zero page for the Windows
start-of-day scrubbing problem, but didn't know if the scrubbing
always only used zeroes.  If it does, great!  I was worried that
something like a secure version of Windows might use some other random
bit pattern, but I'll bet Windows elsewhere assumes that all pages
start as zero-filled and is thus dependent on start-of-day ZERO
scrubbing, so I'll bet your approach will always work.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-24  1:46 ` Tian, Kevin
@ 2008-12-24 14:42   ` George Dunlap
  2008-12-24 15:35     ` Dan Magenheimer
  2008-12-25  2:36     ` Tian, Kevin
  0 siblings, 2 replies; 19+ messages in thread
From: George Dunlap @ 2008-12-24 14:42 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: xen-devel

On Wed, Dec 24, 2008 at 1:46 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
>>* When the balloon driver loads, it inflates the balloon size to
>>(maxmem - target), giving the memory back to Xen.  When this is
>>accomplished, the "populate-on-demand" portion of boot is effectively
>>finished.
>>
>
> Another tricky point could be with VT-d. If one guest page is used as
> DMA target before balloon driver is installed, and no early access on
> that page (like start-of-day scrubber), then PoD action will not be triggered...
> Not sure the possibility of such condition, but you may need to have
> some thought or guard on that. em... after more thinking, actually PoD
> pages may be alive even after balloon driver is installed. I guess before
> coming up a solution you may add a check on whether target domain
> has passthrough device to decide whether this feature is on on-the-fly.

Hmm, I haven't looked at VT-d integration; it at least requires some
examination.  How are gfns translated to mfns for the VT-d hardware?
Does it use the hardware EPT tables?  Is the transaction re-startable
if we get an EPT fault and then fix the EPT table?

Any time gfn_to_mfn() is called, unless it's specifcally called with
the "query" type, the gfn is populated.  That's why qemu, the domain
builder, &c work currently without any modifications.  But if VT-d
uses the EPT tables to translate requests for a guest in hardware, and
the device requests can't be easily re-started after an EPT fault,
then this won't work.

A second issue is with the emergency sweep: if a page which happens to
be zero ends up being the target of a DMA, we may get:
* Device request to write to gfn X, which translates to mfn Y.
* Demand-fault on gfn Z, with no pages in the cache.
* Emergency sweep scans through gfn space, finds that mfn Y is empty.
It replaces gfn X with a PoD entry, and puts mfn Y behind gfn Z.
* The request finishes.  Either the request then fails (because EPT
translation for gfn X is not valid anymore), or it silently succeeds
in writing to mfn Y, which is now behind gfn Z instead of gfn X.

If we can't tell that there's an outstanding I/O on the page, then we
can't do an emergency sweep.  If we have some way of knowing that
there's *some* outstanding I/O to *some* page, we could pause the
guest until the I/O completes, then do the sweep.

At any rate, until we have that worked out, we should probably add
some "seatbelt" code to make sure that people don't use PoD for a VT-d
enabled domain.  I know absolutely nothing about the VT-d code; could
you either write a patch to do this check, or give me an idea of the
simplest thing to check?

>>NB that this code is designed to work only in conjunction with a
>>balloon driver.  If the balloon driver is not loaded, eventually all
>>pages will be dirtied (non-zero), the emergency sweep will fail, and
>>there will be no memory to back outstanding PoD pages.  When this
>>happens, the domain will crash.
>
> In that case, is it better to increase PoD target to configured max mem?
> It looks uncomfortable to crash a domain just because some optimization
> doesn't apply. :-)

If this happened, it wouldn't be because an optimization didn't apply,
but because we purposely tried to use a feature for which a key
component failed or wasn't properly in place.  If we set up a domain
with VT-d access on a box with no VT-d hardware, it would fail as well
-- just during boot, not 5 minutes after it. :-)

We could to allocate a new page at that point; but it's likely that
the allocation will fail unless there happens to be memory lying
around somewhere, not used by dom0 or any other doamin.  And if that
were the case, why not just start it with that much memory to begin
with?

The only way to make this more robust would be to pause the domain,
send a message back to xend, have it try to balloon down domain 0 (or
possibly other domains), increase the PoD cache size, and then unpause
the domain again.  This is not only a lot of work, but many of the
failure modes will be really hard to handle; e.g., if qemu makes a
hypercall that ends up doing a gfn_to_mfn() translation which fails,
we would need to make that whole operation re-startable.  I did look
at this, but it's a ton of work, and a lot of code changes (including
interface changes bewteen Xen and dom0 components), for a situation
which really should never happen in a properly configured system.
There's no reason that with a balloon driver which loads during boot,
and a properly configured target (i.e., not unreasonably small), the
driver shouldn't be able to quickly reach its target.

> Last, do you have any performance data on how this patch may impact
> the boot process, or even some workload after login?

I do not have any solid numbers.  Perceptually, I haven't noticed
anything too slow.  I'll do some simple benchmarks.

 -George

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-24 14:32     ` Dan Magenheimer
@ 2008-12-24 15:13       ` George Dunlap
  2008-12-24 15:54         ` Dan Magenheimer
  0 siblings, 1 reply; 19+ messages in thread
From: George Dunlap @ 2008-12-24 15:13 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: xen-devel

On Wed, Dec 24, 2008 at 2:32 PM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
> Yes, its just that with your fix, Windows VM users are much more
> likely to use memory overcommit and will need to be "trained" to
> always configure a swap disk to ensure bad things don't happen.
> And this swap disk had better be on a network-based medium or
> live migration won't work.

You mean they may be much more likely to under-provision memory to
their VMs, booting with (say) 64M on the assumption that they can
balloon it up to 512M if they want to?  That seems rather unlikely to
me... if they're not likely to start a Windows VM with 64M normally,
why would they be more likely to start with 64M now?  I'd've thought
it would be likely to go the other way: if they normally boot a guest
with 256M, they can now start with maxmem=1G and memory=256M, and
balloon it up if they want.

> No, I had just given some limited thought to this problem previously,
> had considered the idea of sharing a zero page for the Windows
> start-of-day scrubbing problem, but didn't know if the scrubbing
> always only used zeroes.  If it does, great!  I was worried that
> something like a secure version of Windows might use some other random
> bit pattern, but I'll bet Windows elsewhere assumes that all pages
> start as zero-filled and is thus dependent on start-of-day ZERO
> scrubbing, so I'll bet your approach will always work.

AIUI, Windows has two "free page" lists: zeroed, and dirty.  The
scrubber moves pages from the dirty list to the zero list.  Most of
the page allocation interfaces promise zeroed pages, as would mapping
"anonymous" process memory (not sure the Windows term for that).  So
the most useful state for an un-allocated page to be in is zero,
because there's a high probability that it will have to be zeroed
before it's used anyway.

At any rate, we can cross that bridge if we ever come to it. :-)

 -George

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-24 14:42   ` George Dunlap
@ 2008-12-24 15:35     ` Dan Magenheimer
  2008-12-24 15:46       ` George Dunlap
  2008-12-25  2:47       ` Tian, Kevin
  2008-12-25  2:36     ` Tian, Kevin
  1 sibling, 2 replies; 19+ messages in thread
From: Dan Magenheimer @ 2008-12-24 15:35 UTC (permalink / raw)
  To: George Dunlap, Tian, Kevin; +Cc: xen-devel

> We could to allocate a new page at that point; but it's likely that
> the allocation will fail unless there happens to be memory lying
> around somewhere, not used by dom0 or any other doamin.  And if that
> were the case, why not just start it with that much memory to begin
> with?

Actually, if dom0_mem is used rather than the default of letting
domain0 absorb all free memory and dole it as needed to launching
VMs, there will almost always be some memory lying around.

And in the not-to-distant future, when live migration is
more widely used, there had better be memory lying around
or migration won't work.

As for "why not just start it with that much memory to begin
with?"... because in most environments VMs are sized once
(e.g. 512MB) and almost never changed... because sysadmins
rarely want to be bothered with constantly fine tuning
just to use an extra spare few MB of memory.

That's why your patch is so important!

Dan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-24 15:35     ` Dan Magenheimer
@ 2008-12-24 15:46       ` George Dunlap
  2008-12-30  9:26         ` Tim Deegan
  2008-12-25  2:47       ` Tian, Kevin
  1 sibling, 1 reply; 19+ messages in thread
From: George Dunlap @ 2008-12-24 15:46 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Tian, Kevin, xen-devel

On Wed, Dec 24, 2008 at 3:35 PM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
>> We could to allocate a new page at that point; but it's likely that
>> the allocation will fail unless there happens to be memory lying
>> around somewhere, not used by dom0 or any other doamin.  And if that
>> were the case, why not just start it with that much memory to begin
>> with?
>
> Actually, if dom0_mem is used rather than the default of letting
> domain0 absorb all free memory and dole it as needed to launching
> VMs, there will almost always be some memory lying around.

At any rate, I suppose it might not be a bad idea to *try* to allocate
more memory in an emergency.  I'll add that to the list of
improvements.

 -George

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-24 15:13       ` George Dunlap
@ 2008-12-24 15:54         ` Dan Magenheimer
  0 siblings, 0 replies; 19+ messages in thread
From: Dan Magenheimer @ 2008-12-24 15:54 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel

> On Wed, Dec 24, 2008 at 2:32 PM, Dan Magenheimer
> <dan.magenheimer@oracle.com> wrote:
> > Yes, its just that with your fix, Windows VM users are much more
> > likely to use memory overcommit and will need to be "trained" to
> > always configure a swap disk to ensure bad things don't happen.
> > And this swap disk had better be on a network-based medium or
> > live migration won't work.
> 
> You mean they may be much more likely to under-provision memory to
> their VMs, booting with (say) 64M on the assumption that they can
> balloon it up to 512M if they want to?  That seems rather unlikely to
> me... if they're not likely to start a Windows VM with 64M normally,
> why would they be more likely to start with 64M now?  I'd've thought
> it would be likely to go the other way: if they normally boot a guest
> with 256M, they can now start with maxmem=1G and memory=256M, and
> balloon it up if they want.

What I mean is that now that they CAN start with memory=256M and
maxmem=1G, it is now much more likely that ballooning and memory
overcommit will be used, possibly hidden by vendors' tools.

Once ballooning is used at all, memory can not only go above
the starting memory= threshold but can also go below.

Thus, your patch will make it more likely that "memory pressure"
will be dynamically applied to Windows VMs, which means swapping
is more likely to occur, which means there had better be a
properly-sized swap disk.

For example, on a 2GB system, a reasonable configuration might be:

Windows VM1: memory=256M maxmem=1GB
Windows VM2: memory=256M maxmem=1GB
Windows VM3: memory=256M maxmem=1GB
Windows VM4: memory=256M maxmem=1GB
(dom0_mem=256M, Xen+heap=256M for the sake of argument)

Assume that VM1 and VM2 are heavily loaded and VM3 and VM4
are idle (or nearly so).  So VM1 and VM2 are ballooned up
towards 1G by taking memory away from VM3 and VM4.  Say
VM3 and VM4 are ballooned down to about 128M each.  Now
VM3 and VM4 suddenly get loaded and need more memory.
But VM1 and VM2 are hesitant to surrender memory because
it is fully utilized.  SOME VM is going to have to start
swapping!

So, I'm just saying that your patch makes this kind of
scenario more likely, so listing the need for a swap disk
in your README would be a good idea.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-24 14:42   ` George Dunlap
  2008-12-24 15:35     ` Dan Magenheimer
@ 2008-12-25  2:36     ` Tian, Kevin
  2008-12-25  5:43       ` Han, Weidong
  1 sibling, 1 reply; 19+ messages in thread
From: Tian, Kevin @ 2008-12-25  2:36 UTC (permalink / raw)
  To: 'George Dunlap'; +Cc: xen-devel, Han, Weidong

>From: George Dunlap
>Sent: Wednesday, December 24, 2008 10:43 PM
>> Another tricky point could be with VT-d. If one guest page is used as
>> DMA target before balloon driver is installed, and no early access on
>> that page (like start-of-day scrubber), then PoD action will 
>not be triggered...
>> Not sure the possibility of such condition, but you may need to have
>> some thought or guard on that. em... after more thinking, 
>actually PoD
>> pages may be alive even after balloon driver is installed. I 
>guess before
>> coming up a solution you may add a check on whether target domain
>> has passthrough device to decide whether this feature is on 
>on-the-fly.
>
>Hmm, I haven't looked at VT-d integration; it at least requires some
>examination.  How are gfns translated to mfns for the VT-d hardware?
>Does it use the hardware EPT tables?  Is the transaction re-startable
>if we get an EPT fault and then fix the EPT table?

there's a VT-d page table walked by VT-d engine, which is similar to
EPT content. When device dma request is intercepted by VT-d engine,
VT-d page table corresponding to that device is walked for valid mapping.
Not like EPT which is restartable, VT-d page fault is just for log purpose
since pci bus doesn't support I/O restart yet (although pcisig is looking
at this possibility). That says, if we can't find a chance to trigger a cpu
page fault before PoD page is used as dma target, either one should be
disabled if both are configured.

>
>A second issue is with the emergency sweep: if a page which happens to
>be zero ends up being the target of a DMA, we may get:
>* Device request to write to gfn X, which translates to mfn Y.
>* Demand-fault on gfn Z, with no pages in the cache.
>* Emergency sweep scans through gfn space, finds that mfn Y is empty.
>It replaces gfn X with a PoD entry, and puts mfn Y behind gfn Z.
>* The request finishes.  Either the request then fails (because EPT
>translation for gfn X is not valid anymore), or it silently succeeds
>in writing to mfn Y, which is now behind gfn Z instead of gfn X.

yes, this is also one issue. the request will fail since the dma address 
written to device is gfn, while X->Y mapping is cut off due to sweep.

>
>If we can't tell that there's an outstanding I/O on the page, then we
>can't do an emergency sweep.  If we have some way of knowing that
>there's *some* outstanding I/O to *some* page, we could pause the
>guest until the I/O completes, then do the sweep.

one possibility is to have a pv dma engine or virtual VT-d engine
within guest, but that's another story.

>
>At any rate, until we have that worked out, we should probably add
>some "seatbelt" code to make sure that people don't use PoD for a VT-d
>enabled domain.  I know absolutely nothing about the VT-d code; could
>you either write a patch to do this check, or give me an idea of the
>simplest thing to check?

Weidong works on VT-d and could give comments on exact point
to check.

>
>>>NB that this code is designed to work only in conjunction with a
>>>balloon driver.  If the balloon driver is not loaded, eventually all
>>>pages will be dirtied (non-zero), the emergency sweep will fail, and
>>>there will be no memory to back outstanding PoD pages.  When this
>>>happens, the domain will crash.
>>
>> In that case, is it better to increase PoD target to 
>configured max mem?
>> It looks uncomfortable to crash a domain just because some 
>optimization
>> doesn't apply. :-)
>
>If this happened, it wouldn't be because an optimization didn't apply,
>but because we purposely tried to use a feature for which a key
>component failed or wasn't properly in place.  If we set up a domain
>with VT-d access on a box with no VT-d hardware, it would fail as well
>-- just during boot, not 5 minutes after it. :-)

It's different story regarding to VT-d, since as you said domain
creation will fail due to lacking of VT-d support, and user can 
be aware of what's happening immediately and then make 
approriate change to configuration file. Nothing is impacted.
However in PoD case, failure of emergency sweep may happen
after booting 5 minutes or even longer if guest doesn't use too
much memory, and then... crash. This is a bad user experience
and especially some unsynced stuff could be lost.

Anyway PoD looks like a nice-to-have feature, just like super
page. In both cases, as long as there're fallback chance, we'd
better fallback instead of crash. for example, as long as free
domheap pages are enough, use 4k page for failed super page 
case and expand PoD to max mem for domain which doesn't 
install a balloon driver successfully. In a environment with such
over-commitment support, not all VMs are expected to participate
into that party. :-)

A side question is how emergency sweep failure could be
checked and reported to user...

>
>We could to allocate a new page at that point; but it's likely that
>the allocation will fail unless there happens to be memory lying
>around somewhere, not used by dom0 or any other doamin.  And if that
>were the case, why not just start it with that much memory to begin
>with?

This is the case that user's willing to use PoD doesn't mean it
always successful. You won't expect to have user to disable PoD
and use that much memory only after several rounds of crash
experience.

>
>The only way to make this more robust would be to pause the domain,
>send a message back to xend, have it try to balloon down domain 0 (or
>possibly other domains), increase the PoD cache size, and then unpause
>the domain again.  This is not only a lot of work, but many of the
>failure modes will be really hard to handle; e.g., if qemu makes a
>hypercall that ends up doing a gfn_to_mfn() translation which fails,
>we would need to make that whole operation re-startable.  I did look
>at this, but it's a ton of work, and a lot of code changes (including
>interface changes bewteen Xen and dom0 components), for a situation
>which really should never happen in a properly configured system.
>There's no reason that with a balloon driver which loads during boot,
>and a properly configured target (i.e., not unreasonably small), the
>driver shouldn't be able to quickly reach its target.

So I think a simple fallback to expand PoD to maxmem automatically
can avoid such complexity.

>
>> Last, do you have any performance data on how this patch may impact
>> the boot process, or even some workload after login?
>
>I do not have any solid numbers.  Perceptually, I haven't noticed
>anything too slow.  I'll do some simple benchmarks.
>

Thanks for your good work.
Kevin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-24 15:35     ` Dan Magenheimer
  2008-12-24 15:46       ` George Dunlap
@ 2008-12-25  2:47       ` Tian, Kevin
  1 sibling, 0 replies; 19+ messages in thread
From: Tian, Kevin @ 2008-12-25  2:47 UTC (permalink / raw)
  To: 'Dan Magenheimer', George Dunlap; +Cc: xen-devel

>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] 
>Sent: Wednesday, December 24, 2008 11:35 PM
>
>> We could to allocate a new page at that point; but it's likely that
>> the allocation will fail unless there happens to be memory lying
>> around somewhere, not used by dom0 or any other doamin.  And if that
>> were the case, why not just start it with that much memory to begin
>> with?
>
>Actually, if dom0_mem is used rather than the default of letting
>domain0 absorb all free memory and dole it as needed to launching
>VMs, there will almost always be some memory lying around.

I recall some previous discussion to have explicit dom0_mem
setting instead of blindly giving all memory to dom0. How about
others' preferrence on this option? At least another benefit to
limit dom0_mem size, iirc, is about numa node aware memory
allocation. Currently Xen could allocate memory by taking node
factor into consideration, but once all memories are allocated 
to dom0 in the start, it'd be much more complex since balloon
driver is not node aware and thus can't selectively give back
pages from dom0, which then nullify xen's node aware allocator.

>
>And in the not-to-distant future, when live migration is
>more widely used, there had better be memory lying around
>or migration won't work.

live migration seems orthogonal here, since new domain is
created and the condition doesn't change as long as dom0
has enough memory to balloon back. :-)

>
>As for "why not just start it with that much memory to begin
>with?"... because in most environments VMs are sized once
>(e.g. 512MB) and almost never changed... because sysadmins
>rarely want to be bothered with constantly fine tuning
>just to use an extra spare few MB of memory.
>
>That's why your patch is so important!

Agree.

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-25  2:36     ` Tian, Kevin
@ 2008-12-25  5:43       ` Han, Weidong
  2008-12-25 11:45         ` Tian, Kevin
  0 siblings, 1 reply; 19+ messages in thread
From: Han, Weidong @ 2008-12-25  5:43 UTC (permalink / raw)
  To: Tian, Kevin, 'George Dunlap'; +Cc: xen-devel

Tian, Kevin wrote:
>> From: George Dunlap
>> Sent: Wednesday, December 24, 2008 10:43 PM
>>> Another tricky point could be with VT-d. If one guest page is used
>>> as DMA target before balloon driver is installed, and no early
>>> access on that page (like start-of-day scrubber), then PoD action
>>> will not be triggered... Not sure the possibility of such
>>> condition, but you may need to have some thought or guard on that.
>>> em... after more thinking, actually PoD pages may be alive even
>>> after balloon driver is installed. I guess before coming up a
>>> solution you may add a check on whether target domain has
>>> passthrough device to decide whether this feature is on on-the-fly. 
>> 
>> Hmm, I haven't looked at VT-d integration; it at least requires some
>> examination.  How are gfns translated to mfns for the VT-d hardware?
>> Does it use the hardware EPT tables?  Is the transaction re-startable
>> if we get an EPT fault and then fix the EPT table?
> 
> there's a VT-d page table walked by VT-d engine, which is similar to
> EPT content. When device dma request is intercepted by VT-d engine,
> VT-d page table corresponding to that device is walked for valid
> mapping. Not like EPT which is restartable, VT-d page fault is just
> for log purpose since pci bus doesn't support I/O restart yet
> (although pcisig is looking at this possibility). That says, if we
> can't find a chance to trigger a cpu page fault before PoD page is
> used as dma target, either one should be disabled if both are
> configured. 
> 
>> 
>> A second issue is with the emergency sweep: if a page which happens
>> to be zero ends up being the target of a DMA, we may get:
>> * Device request to write to gfn X, which translates to mfn Y.
>> * Demand-fault on gfn Z, with no pages in the cache.
>> * Emergency sweep scans through gfn space, finds that mfn Y is empty.
>> It replaces gfn X with a PoD entry, and puts mfn Y behind gfn Z.
>> * The request finishes.  Either the request then fails (because EPT
>> translation for gfn X is not valid anymore), or it silently succeeds
>> in writing to mfn Y, which is now behind gfn Z instead of gfn X.
> 
> yes, this is also one issue. the request will fail since the dma
> address written to device is gfn, while X->Y mapping is cut off due
> to sweep. 
> 
>> 
>> If we can't tell that there's an outstanding I/O on the page, then we
>> can't do an emergency sweep.  If we have some way of knowing that
>> there's *some* outstanding I/O to *some* page, we could pause the
>> guest until the I/O completes, then do the sweep.
> 
> one possibility is to have a pv dma engine or virtual VT-d engine
> within guest, but that's another story.
> 
>> 
>> At any rate, until we have that worked out, we should probably add
>> some "seatbelt" code to make sure that people don't use PoD for a
>> VT-d enabled domain.  I know absolutely nothing about the VT-d code;
>> could you either write a patch to do this check, or give me an idea
>> of the simplest thing to check?
> 
> Weidong works on VT-d and could give comments on exact point
> to check.
> 

You can simply check "iommu_enabled" to know whether IOMMU including VT-d and AMD IOMMU is used or not.

Regards,
Weidong

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-25  5:43       ` Han, Weidong
@ 2008-12-25 11:45         ` Tian, Kevin
  2008-12-26  0:42           ` Han, Weidong
  0 siblings, 1 reply; 19+ messages in thread
From: Tian, Kevin @ 2008-12-25 11:45 UTC (permalink / raw)
  To: Han, Weidong, 'George Dunlap'; +Cc: xen-devel

 
>From: Han, Weidong 
>Sent: Thursday, December 25, 2008 1:43 PM
>>> 
>>> At any rate, until we have that worked out, we should probably add
>>> some "seatbelt" code to make sure that people don't use PoD for a
>>> VT-d enabled domain.  I know absolutely nothing about the VT-d code;
>>> could you either write a patch to do this check, or give me an idea
>>> of the simplest thing to check?
>> 
>> Weidong works on VT-d and could give comments on exact point
>> to check.
>> 
>
>You can simply check "iommu_enabled" to know whether IOMMU 
>including VT-d and AMD IOMMU is used or not.
>

Weidong, does iommu_enabled indicate IOMMU h/w availability?
Then you'll have this nice feature disabled on most new platform
with IOMMU shipped. :-) Here a domain-based check is required,
i.e. PoD is only appliable when target domain is not passthroughed
with any device.

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-25 11:45         ` Tian, Kevin
@ 2008-12-26  0:42           ` Han, Weidong
  0 siblings, 0 replies; 19+ messages in thread
From: Han, Weidong @ 2008-12-26  0:42 UTC (permalink / raw)
  To: Tian, Kevin, 'George Dunlap'; +Cc: xen-devel

Tian, Kevin wrote:
>> From: Han, Weidong
>> Sent: Thursday, December 25, 2008 1:43 PM
>>>> 
>>>> At any rate, until we have that worked out, we should probably add
>>>> some "seatbelt" code to make sure that people don't use PoD for a
>>>> VT-d enabled domain.  I know absolutely nothing about the VT-d
>>>> code; could you either write a patch to do this check, or give me
>>>> an idea of the simplest thing to check?
>>> 
>>> Weidong works on VT-d and could give comments on exact point to
>>> check. 
>>> 
>> 
>> You can simply check "iommu_enabled" to know whether IOMMU
>> including VT-d and AMD IOMMU is used or not.
>> 
> 
> Weidong, does iommu_enabled indicate IOMMU h/w availability?
> Then you'll have this nice feature disabled on most new platform
> with IOMMU shipped. :-) Here a domain-based check is required,
> i.e. PoD is only appliable when target domain is not passthroughed
> with any device.
> 
> Thanks,
> Kevin

iommu_enabled will be set when IOMMU h/w is available and user sets "iommu=1" in grub to use it. Because device hotplug with VT-d is already supported, I think domain-based check is not enough, it's better to disable PoD when iommu_enabled is set. 

Regards,
Weidong

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-24 15:46       ` George Dunlap
@ 2008-12-30  9:26         ` Tim Deegan
  2008-12-31  1:40           ` Tian, Kevin
  0 siblings, 1 reply; 19+ messages in thread
From: Tim Deegan @ 2008-12-30  9:26 UTC (permalink / raw)
  To: George Dunlap; +Cc: Dan Magenheimer, xen-devel, Tian, Kevin

At 15:46 +0000 on 24 Dec (1230133560), George Dunlap wrote:
> On Wed, Dec 24, 2008 at 3:35 PM, Dan Magenheimer
> <dan.magenheimer@oracle.com> wrote:
> >> We could to allocate a new page at that point; but it's likely that
> >> the allocation will fail unless there happens to be memory lying
> >> around somewhere, not used by dom0 or any other doamin.  And if that
> >> were the case, why not just start it with that much memory to begin
> >> with?
> >
> > Actually, if dom0_mem is used rather than the default of letting
> > domain0 absorb all free memory and dole it as needed to launching
> > VMs, there will almost always be some memory lying around.
> 
> At any rate, I suppose it might not be a bad idea to *try* to allocate
> more memory in an emergency.  I'll add that to the list of
> improvements.

Please don't do this.  It's not OK for a domain to start using more
memory without the say-so of the tool stack.  Since this emergency
condition means something has gone wrong (balloon driver failed to
start) then you're probably just postponing the inevitable, and in the
meantime you might cause problems for domains that *aren't* misbehaving.

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-30  9:26         ` Tim Deegan
@ 2008-12-31  1:40           ` Tian, Kevin
  2009-01-02 10:03             ` Tim Deegan
  0 siblings, 1 reply; 19+ messages in thread
From: Tian, Kevin @ 2008-12-31  1:40 UTC (permalink / raw)
  To: 'Tim Deegan', George Dunlap; +Cc: Dan Magenheimer, xen-devel

>From: Tim Deegan [mailto:Tim.Deegan@citrix.com] 
>Sent: Tuesday, December 30, 2008 5:27 PM
>
>At 15:46 +0000 on 24 Dec (1230133560), George Dunlap wrote:
>> On Wed, Dec 24, 2008 at 3:35 PM, Dan Magenheimer
>> <dan.magenheimer@oracle.com> wrote:
>> >> We could to allocate a new page at that point; but it's 
>likely that
>> >> the allocation will fail unless there happens to be memory lying
>> >> around somewhere, not used by dom0 or any other doamin.  
>And if that
>> >> were the case, why not just start it with that much 
>memory to begin
>> >> with?
>> >
>> > Actually, if dom0_mem is used rather than the default of letting
>> > domain0 absorb all free memory and dole it as needed to launching
>> > VMs, there will almost always be some memory lying around.
>> 
>> At any rate, I suppose it might not be a bad idea to *try* 
>to allocate
>> more memory in an emergency.  I'll add that to the list of
>> improvements.
>
>Please don't do this.  It's not OK for a domain to start using more
>memory without the say-so of the tool stack.  Since this emergency
>condition means something has gone wrong (balloon driver failed to
>start) then you're probably just postponing the inevitable, and in the
>meantime you might cause problems for domains that *aren't* 
>misbehaving.
>

Then a user controlled option would fit here, which indicate whether
given domain is important and then emergency expansion could be
allowed in such case if mandatory kill is not acceptable.

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] 0/9 Populate-on-demand memory
  2008-12-31  1:40           ` Tian, Kevin
@ 2009-01-02 10:03             ` Tim Deegan
  2009-01-05  6:08               ` Tian, Kevin
  0 siblings, 1 reply; 19+ messages in thread
From: Tim Deegan @ 2009-01-02 10:03 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: George Dunlap, Dan Magenheimer, xen-devel

Hi, 

At 09:40 +0800 on 31 Dec (1230716432), Tian, Kevin wrote:
> >From: Tim Deegan [mailto:Tim.Deegan@citrix.com] 
> >At 15:46 +0000 on 24 Dec (1230133560), George Dunlap wrote:
> >> At any rate, I suppose it might not be a bad idea to *try* to allocate
> >> more memory in an emergency.  I'll add that to the list of
> >> improvements.
> >
> >Please don't do this.  It's not OK for a domain to start using more
> >memory without the say-so of the tool stack.  Since this emergency
> >condition means something has gone wrong (balloon driver failed to
> >start) then you're probably just postponing the inevitable, and in the
> >meantime you might cause problems for domains that *aren't* 
> >misbehaving.
> >
> 
> Then a user controlled option would fit here, which indicate whether
> given domain is important and then emergency expansion could be
> allowed in such case if mandatory kill is not acceptable.

What if you're booting two important domains, one of which misbehaves
and uses extra memory, causing the second boot to fail?  They were both
important, and you've just chosen the buggy one. :)

Anyway, the only way to guarantee that a domain will boot even if it
fails to launch its balloon driver is to make sure there is enough
memory around for it to populate its entire p2m -- in which case you
might as well just allocate it all that memory in the first place and
avoid the extra risk of a bug in the pod code nobbling this important
domain.

The marginal benefit of allowing it to break the rules in the case where
things go "slightly wrong" (i.e. it overruns its allocation but somehow
recovers before using all available memory) seems so small to me that
it's not even worth the extra lines of code in Xen and xend.  Especially
since probably either nobody would turn it on, or everyone would turn it
on for every domain.

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC][PATCH] 0/9 Populate-on-demand memory
  2009-01-02 10:03             ` Tim Deegan
@ 2009-01-05  6:08               ` Tian, Kevin
  0 siblings, 0 replies; 19+ messages in thread
From: Tian, Kevin @ 2009-01-05  6:08 UTC (permalink / raw)
  To: 'Tim Deegan'; +Cc: George Dunlap, Dan Magenheimer, xen-devel

>From: Tim Deegan [mailto:Tim.Deegan@citrix.com] 
>Sent: Friday, January 02, 2009 6:04 PM
>Hi, 
>
>At 09:40 +0800 on 31 Dec (1230716432), Tian, Kevin wrote:
>> >From: Tim Deegan [mailto:Tim.Deegan@citrix.com] 
>> >At 15:46 +0000 on 24 Dec (1230133560), George Dunlap wrote:
>> >> At any rate, I suppose it might not be a bad idea to 
>*try* to allocate
>> >> more memory in an emergency.  I'll add that to the list of
>> >> improvements.
>> >
>> >Please don't do this.  It's not OK for a domain to start using more
>> >memory without the say-so of the tool stack.  Since this emergency
>> >condition means something has gone wrong (balloon driver failed to
>> >start) then you're probably just postponing the inevitable, 
>and in the
>> >meantime you might cause problems for domains that *aren't* 
>> >misbehaving.
>> >
>> 
>> Then a user controlled option would fit here, which indicate whether
>> given domain is important and then emergency expansion could be
>> allowed in such case if mandatory kill is not acceptable.
>
>What if you're booting two important domains, one of which misbehaves
>and uses extra memory, causing the second boot to fail?  They were both
>important, and you've just chosen the buggy one. :)
>
>Anyway, the only way to guarantee that a domain will boot even if it
>fails to launch its balloon driver is to make sure there is enough
>memory around for it to populate its entire p2m -- in which case you
>might as well just allocate it all that memory in the first place and
>avoid the extra risk of a bug in the pod code nobbling this important
>domain.
>
>The marginal benefit of allowing it to break the rules in the 
>case where
>things go "slightly wrong" (i.e. it overruns its allocation but somehow
>recovers before using all available memory) seems so small to me that
>it's not even worth the extra lines of code in Xen and xend.  
>Especially
>since probably either nobody would turn it on, or everyone 
>would turn it
>on for every domain.
>

ok, a sound argument.

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2009-01-05  6:08 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-23 12:55 [RFC][PATCH] 0/9 Populate-on-demand memory George Dunlap
2008-12-23 19:06 ` Dan Magenheimer
2008-12-24 13:55   ` George Dunlap
2008-12-24 14:32     ` Dan Magenheimer
2008-12-24 15:13       ` George Dunlap
2008-12-24 15:54         ` Dan Magenheimer
2008-12-24  1:46 ` Tian, Kevin
2008-12-24 14:42   ` George Dunlap
2008-12-24 15:35     ` Dan Magenheimer
2008-12-24 15:46       ` George Dunlap
2008-12-30  9:26         ` Tim Deegan
2008-12-31  1:40           ` Tian, Kevin
2009-01-02 10:03             ` Tim Deegan
2009-01-05  6:08               ` Tian, Kevin
2008-12-25  2:47       ` Tian, Kevin
2008-12-25  2:36     ` Tian, Kevin
2008-12-25  5:43       ` Han, Weidong
2008-12-25 11:45         ` Tian, Kevin
2008-12-26  0:42           ` Han, Weidong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.