Proposed new "memory capacity claim" hypercall/feature

* Proposed new "memory capacity claim" hypercall/feature
@ 2012-10-29 17:06 Dan Magenheimer
  2012-10-29 18:24 ` Keir Fraser
  2012-10-29 22:35 ` Tim Deegan
  0 siblings, 2 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-10-29 17:06 UTC (permalink / raw)
  To: Keir (Xen.org), Jan Beulich
  Cc: Olaf Hering, Ian Campbell, Konrad Wilk, George Dunlap,
	George Shuklin, Tim (Xen.org),
	xen-devel, Dario Faggioli, Kurt Hackel, Zhigang Wang,
	Ian Jackson

Keir, Jan (et al) --

In a recent long thread [1], there was a great deal of discussion
about the possible need for a "memory reservation" hypercall.
While there was some confusion due to the two worldviews of static
vs dynamic management of physical memory capacity, one worldview
definitely has a requirement for this new capability.  It is still
uncertain whether the other worldview will benefit as well, though
I believe it eventually will, especially when page sharing is
fully deployed.

Note that to avoid confusion with existing usages of various
terms (such as "reservation"), I am now using the distinct
word "claim" as in a "land claim" or "mining claim":
http://dictionary.cambridge.org/dictionary/british/stake-a-claim 
When a toolstack creates a domain, it can first "stake a claim"
to the amount of memory capacity necessary to ensure the domain
launch will succeed.

In order to explore feasibility, I wanted to propose a possible
hypervisor design and would very much appreciate feedback!

The objective of the design is to ensure that a multi-threaded
toolstack can atomically claim a specific amount of RAM capacity for a
domain, especially in the presence of independent dynamic memory demand
(such as tmem and selfballooning) which the toolstack is not able to track.
"Claim X 50G" means that, on completion of the call, either (A) 50G of
capacity has been claimed for use by domain X and the call returns
success or (B) the call returns failure.  Note that in the above,
"claim" explicitly does NOT mean that specific physical RAM pages have
been assigned, only that the 50G of RAM capacity is not available either
to a subsequent "claim" or for most[2] independent dynamic memory demands.

I think the underlying hypervisor issue is that the current process
of "reserving" memory capacity (which currently does assign specific
physical RAM pages) is, by necessity when used for large quantities of RAM,
batched and slow and, consequently, can NOT be atomic.  One way to think
of the newly proposed "claim" is as "lazy reserving":  The capacity is
set aside even though specific physical RAM pages have not been assigned.
In another way, claiming is really just an accounting illusion, similar
to how an accountant must "accrue" future liabilities.

Hypervisor design/implementation overview:

A domain currently does RAM accounting with two primary counters
"tot_pages" and "max_pages".  (For now, let's ignore shr_pages,
paged_pages, and xenheap_pages, and I hope Olaf/Andre/others can
provide further expertise and input.)

Tot_pages is a struct_domain element in the hypervisor that tracks
the number of physical RAM pageframes "owned" by the domain.  The
hypervisor enforces that tot_pages is never allowed to exceed another
struct_domain element called max_pages.

I would like to introduce a new counter, which records how
much capacity is claimed for a domain which may or may not yet be
mapped to physical RAM pageframes.  To do so, I'd like to split
the concept of tot_pages into two variables, tot_phys_pages and
tot_claimed_pages and require the hypervisor to also enforce:

d.tot_phys_pages <= d.tot_claimed_pages[3] <= d.max_pages

I'd also split the hypervisor global "total_avail_pages" into
"total_free_pages" and "total_unclaimed_pages".  (I'm definitely
going to need to study more the two-dimensional array "avail"...)
The hypervisor must now do additional accounting to keep track
of the sum of claims across all domains and also enforce the
global:

total_unclaimed_pages <= total_free_pages

I think the memory_op hypercall can be extended to add two
additional subops, XENMEM_claim and XENMEM_release.  (Note: To
support tmem, there will need to be two variations of XEN_claim,
"hard claim" and "soft claim" [3].)  The XEN_claim subop atomically
evaluates total_unclaimed_pages against the new claim, claims
the pages for the domain if possible and returns success or failure.
The XEN_release "unsets" the domain's tot_claimed_pages (to an
"illegal" value such as zero or MINUS_ONE).

The hypervisor must also enforce some semantics:  If an allocation
occurs such that a domain's tot_phys_pages would equal or exceed
d.tot_claimed_pages, then d.tot_claimed_pages becomes "unset".
This enforces the temporary nature of a claim:  Once a domain
fully "occupies" its claim, the claim silently expires.

In the case of a dying domain, a XENMEM_release operation
is implied and must be executed by the hypervisor.

Ideally, the quantity of unclaimed memory for each domain and
for the system should be query-able.  This may require additional
memory_op hypercalls.

I'd very much appreciate feedback on this proposed design!

Thanks,
Dan

[1] http://lists.xen.org/archives/html/xen-devel/2012-09/msg02229.html
    and continued in October (the archives don't thread across months)
    http://lists.xen.org/archives/html/xen-devel/2012-10/msg00080.html 
[2] Pages used to store tmem "ephemeral" data may be an exception
    because those pages are "free-on-demand".
[3] I'd be happy to explain the minor additional work necessary to
    support tmem but have mostly left it out of the proposal for clarity.

^ permalink raw reply	[flat|nested] 58+ messages in thread