All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
       [not found] <mailman.18000.1354568068.1399.xen-devel@lists.xen.org>
@ 2012-12-04  3:24 ` Andres Lagar-Cavilla
  2012-12-18 22:17   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 53+ messages in thread
From: Andres Lagar-Cavilla @ 2012-12-04  3:24 UTC (permalink / raw)
  To: xen-devel
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Ian Jackson, Tim Deegan,
	Jan Beulich

> I earlier promised a complete analysis of the problem
> addressed by the proposed claim hypercall as well as
> an analysis of the alternate solutions.  I had not
> yet provided these analyses when I asked for approval
> to commit the hypervisor patch, so there was still
> a good amount of misunderstanding, and I am trying
> to fix that here.
> 
> I had hoped this essay could be both concise and complete
> but quickly found it to be impossible to be both at the
> same time.  So I have erred on the side of verbosity,
> but also have attempted to ensure that the analysis
> flows smoothly and is understandable to anyone interested
> in learning more about memory allocation in Xen.
> I'd appreciate feedback from other developers to understand
> if I've also achieved that goal.
> 
> Ian, Ian, George, and Tim -- I have tagged a few
> out-of-flow questions to you with [IIGF].  If I lose
> you at any point, I'd especially appreciate your feedback
> at those points.  I trust that, first, you will read
> this completely.  As I've said, I understand that
> Oracle's paradigm may differ in many ways from your
> own, so I also trust that you will read it completely
> with an open mind.
> 
> Thanks,
> Dan
> 
> PROBLEM STATEMENT OVERVIEW
> 
> The fundamental problem is a race; two entities are
> competing for part or all of a shared resource: in this case,
> physical system RAM.  Normally, a lock is used to mediate
> a race.
> 
> For memory allocation in Xen, there are two significant
> entities, the toolstack and the hypervisor.  And, in
> general terms, there are currently two important locks:
> one used in the toolstack for domain creation;
> and one in the hypervisor used for the buddy allocator.
> 
> Considering first only domain creation, the toolstack
> lock is taken to ensure that domain creation is serialized.
> The lock is taken when domain creation starts, and released
> when domain creation is complete.
> 
> As system and domain memory requirements grow, the amount
> of time to allocate all necessary memory to launch a large
> domain is growing and may now exceed several minutes, so
> this serialization is increasingly problematic.  The result
> is a customer reported problem:  If a customer wants to
> launch two or more very large domains, the "wait time"
> required by the serialization is unacceptable.
> 
> Oracle would like to solve this problem.  And Oracle
> would like to solve this problem not just for a single
> customer sitting in front of a single machine console, but
> for the very complex case of a large number of machines,
> with the "agent" on each machine taking independent
> actions including automatic load balancing and power
> management via migration.
Hi Dan,
an issue with your reasoning throughout has been the constant invocation of the multi host environment as a justification for your proposal. But this argument is not used in your proposal below beyond this mention in passing. Further, there is no relation between what you are changing (the hypervisor) and what you are claiming it is needed for (multi host VM management).


>  (This complex environment
> is sold by Oracle today; it is not a "future vision".)
> 
> [IIGT] Completely ignoring any possible solutions to this
> problem, is everyone in agreement that this _is_ a problem
> that _needs_ to be solved with _some_ change in the Xen
> ecosystem?
> 
> SOME IMPORTANT BACKGROUND INFORMATION
> 
> In the subsequent discussion, it is important to
> understand a few things:
> 
> While the toolstack lock is held, allocating memory for
> the domain creation process is done as a sequence of one
> or more hypercalls, each asking the hypervisor to allocate
> one or more -- "X" -- slabs of physical RAM, where a slab
> is 2**N contiguous aligned pages, also known as an
> "order N" allocation.  While the hypercall is defined
> to work with any value of N, common values are N=0
> (individual pages), N=9 ("hugepages" or "superpages"),
> and N=18 ("1GiB pages").  So, for example, if the toolstack
> requires 201MiB of memory, it will make two hypercalls:
> One with X=100 and N=9, and one with X=1 and N=0.
> 
> While the toolstack may ask for a smaller number X of
> order==9 slabs, system fragmentation may unpredictably
> cause the hypervisor to fail the request, in which case
> the toolstack will fall back to a request for 512*X
> individual pages.  If there is sufficient RAM in the system,
> this request for order==0 pages is guaranteed to succeed.
> Thus for a 1TiB domain, the hypervisor must be prepared
> to allocate up to 256Mi individual pages.
> 
> Note carefully that when the toolstack hypercall asks for
> 100 slabs, the hypervisor "heaplock" is currently taken
> and released 100 times.  Similarly, for 256M individual
> pages... 256 million spin_lock-alloc_page-spin_unlocks.
> This means that domain creation is not "atomic" inside
> the hypervisor, which means that races can and will still
> occur.
> 
> RULING OUT SOME SIMPLE SOLUTIONS
> 
> Is there an elegant simple solution here?
> 
> Let's first consider the possibility of removing the toolstack
> serialization entirely and/or the possibility that two
> independent toolstack threads (or "agents") can simultaneously
> request a very large domain creation in parallel.  As described
> above, the hypervisor's heaplock is insufficient to serialize RAM
> allocation, so the two domain creation processes race.  If there
> is sufficient resource for either one to launch, but insufficient
> resource for both to launch, the winner of the race is indeterminate,
> and one or both launches will fail, possibly after one or both 
> domain creation threads have been working for several minutes.
> This is a classic "TOCTOU" (time-of-check-time-of-use) race.
> If a customer is unhappy waiting several minutes to launch
> a domain, they will be even more unhappy waiting for several
> minutes to be told that one or both of the launches has failed.
> Multi-minute failure is even more unacceptable for an automated
> agent trying to, for example, evacuate a machine that the
> data center administrator needs to powercycle.
> 
> [IIGT: Please hold your objections for a moment... the paragraph
> above is discussing the simple solution of removing the serialization;
> your suggested solution will be discussed soon.]
> 
> Next, let's consider the possibility of changing the heaplock
> strategy in the hypervisor so that the lock is held not
> for one slab but for the entire request of N slabs.  As with
> any core hypervisor lock, holding the heaplock for a "long time"
> is unacceptable.  To a hypervisor, several minutes is an eternity.
> And, in any case, by serializing domain creation in the hypervisor,
> we have really only moved the problem from the toolstack into
> the hypervisor, not solved the problem.
> 
> [IIGT] Are we in agreement that these simple solutions can be
> safely ruled out?
> 
> CAPACITY ALLOCATION VS RAM ALLOCATION
> 
> Looking for a creative solution, one may realize that it is the
> page allocation -- especially in large quantities -- that is very
> time-consuming.  But, thinking outside of the box, it is not
> the actual pages of RAM that we are racing on, but the quantity of pages required to launch a domain!  If we instead have a way to
> "claim" a quantity of pages cheaply now and then allocate the actual
> physical RAM pages later, we have changed the race to require only serialization of the claiming process!  In other words, if some entity
> knows the number of pages available in the system, and can "claim"
> N pages for the benefit of a domain being launched, the successful launch of the domain can be ensured.  Well... the domain launch may
> still fail for an unrelated reason, but not due to a memory TOCTOU
> race.  But, in this case, if the cost (in time) of the claiming
> process is very small compared to the cost of the domain launch,
> we have solved the memory TOCTOU race with hardly any delay added
> to a non-memory-related failure that would have occurred anyway.
> 
> This "claim" sounds promising.  But we have made an assumption that
> an "entity" has certain knowledge.  In the Xen system, that entity
> must be either the toolstack or the hypervisor.  Or, in the Oracle
> environment, an "agent"... but an agent and a toolstack are similar
> enough for our purposes that we will just use the more broadly-used
> term "toolstack".  In using this term, however, it's important to
> remember it is necessary to consider the existence of multiple
> threads within this toolstack.
> 
> Now I quote Ian Jackson: "It is a key design principle of a system
> like Xen that the hypervisor should provide only those facilities
> which are strictly necessary.  Any functionality which can be
> reasonably provided outside the hypervisor should be excluded
> from it."
> 
> So let's examine the toolstack first.
> 
> [IIGT] Still all on the same page (pun intended)?
> 
> TOOLSTACK-BASED CAPACITY ALLOCATION
> 
> Does the toolstack know how many physical pages of RAM are available?
> Yes, it can use a hypercall to find out this information after Xen and
> dom0 launch, but before it launches any domain.  Then if it subtracts
> the number of pages used when it launches a domain and is aware of
> when any domain dies, and adds them back, the toolstack has a pretty
> good estimate.  In actuality, the toolstack doesn't _really_ know the
> exact number of pages used when a domain is launched, but there
> is a poorly-documented "fuzz factor"... the toolstack knows the
> number of pages within a few megabytes, which is probably close enough.
> 
> This is a fairly good description of how the toolstack works today
> and the accounting seems simple enough, so does toolstack-based
> capacity allocation solve our original problem?  It would seem so.
> Even if there are multiple threads, the accounting -- not the extended
> sequence of page allocation for the domain creation -- can be
> serialized by a lock in the toolstack.  But note carefully, either
> the toolstack and the hypervisor must always be in sync on the
> number of available pages (within an acceptable margin of error);
> or any query to the hypervisor _and_ the toolstack-based claim must
> be paired atomically, i.e. the toolstack lock must be held across
> both.  Otherwise we again have another TOCTOU race. Interesting,
> but probably not really a problem.
> 
> Wait, isn't it possible for the toolstack to dynamically change the
> number of pages assigned to a domain?  Yes, this is often called
> ballooning and the toolstack can do this via a hypercall.  But

> that's still OK because each call goes through the toolstack and
> it simply needs to add more accounting for when it uses ballooning
> to adjust the domain's memory footprint.  So we are still OK.
> 
> But wait again... that brings up an interesting point.  Are there
> any significant allocations that are done in the hypervisor without
> the knowledge and/or permission of the toolstack?  If so, the
> toolstack may be missing important information.
> 
> So are there any such allocations?  Well... yes. There are a few.
> Let's take a moment to enumerate them:
> 
> A) In Linux, a privileged user can write to a sysfs file which writes
> to the balloon driver which makes hypercalls from the guest kernel to

A fairly bizarre limitation of a balloon-based approach to memory management. Why on earth should the guest be allowed to change the size of its balloon, and therefore its footprint on the host. This may be justified with arguments pertaining to the stability of the in-guest workload. What they really reveal are limitations of ballooning. But the inadequacy of the balloon in itself doesn't automatically translate into justifying the need for a new hyper call.

> the hypervisor, which adjusts the domain memory footprint, which changes the number of free pages _without_ the toolstack knowledge.
> The toolstack controls constraints (essentially a minimum and maximum)
> which the hypervisor enforces.  The toolstack can ensure that the
> minimum and maximum are identical to essentially disallow Linux from
> using this functionality.  Indeed, this is precisely what Citrix's
> Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always has complete control and, so, knowledge of any domain memory
> footprint changes.  But DMC is not prescribed by the toolstack,

Neither is enforcing min==max. This was my argument when previously commenting on this thread. The fact that you have enforcement of a maximum domain allocation gives you an excellent tool to keep a domain's unsupervised growth at bay. The toolstack can choose how fine-grained, how often to be alerted and stall the domain.

> and some real Oracle Linux customers use and depend on the flexibility
> provided by in-guest ballooning.   So guest-privileged-user-driven-
> ballooning is a potential issue for toolstack-based capacity allocation.
> 
> [IIGT: This is why I have brought up DMC several times and have
> called this the "Citrix model,".. I'm not trying to be snippy
> or impugn your morals as maintainers.]
> 
> B) Xen's page sharing feature has slowly been completed over a number
> of recent Xen releases.  It takes advantage of the fact that many
> pages often contain identical data; the hypervisor merges them to save

Great care has been taken for this statement to not be exactly true. The hypervisor discards one of two pages that the toolstack tells it to (and patches the physmap of the VM previously pointing to the discard page). It doesn't merge, nor does it look into contents. The hypervisor doesn't care about the page contents. This is deliberate, so as to avoid spurious claims of "you are using technique X!"

> physical RAM.  When any "shared" page is written, the hypervisor
> "splits" the page (aka, copy-on-write) by allocating a new physical
> page.  There is a long history of this feature in other virtualization
> products and it is known to be possible that, under many circumstances, thousands of splits may occur in any fraction of a second.  The
> hypervisor does not notify or ask permission of the toolstack.
> So, page-splitting is an issue for toolstack-based capacity
> allocation, at least as currently coded in Xen.
> 
> [Andre: Please hold your objection here until you read further.]

Name is Andres. And please cc me if you'll be addressing me directly!

Note that I don't disagree with your previous statement in itself. Although "page-splitting" is fairly unique terminology, and confusing (at least to me). CoW works.

> 
> C) Transcendent Memory ("tmem") has existed in the Xen hypervisor and
> toolstack for over three years.  It depends on an in-guest-kernel
> adaptive technique to constantly adjust the domain memory footprint as
> well as hooks in the in-guest-kernel to move data to and from the
> hypervisor.  While the data is in the hypervisor's care, interesting
> memory-load balancing between guests is done, including optional
> compression and deduplication.  All of this has been in Xen since 2009
> and has been awaiting changes in the (guest-side) Linux kernel. Those
> changes are now merged into the mainstream kernel and are fully
> functional in shipping distros.
> 
> While a complete description of tmem's guest<->hypervisor interaction
> is beyond the scope of this document, it is important to understand
> that any tmem-enabled guest kernel may unpredictably request thousands
> or even millions of pages directly via hypercalls from the hypervisor in a fraction of a second with absolutely no interaction with the toolstack.  Further, the guest-side hypercalls that allocate pages
> via the hypervisor are done in "atomic" code deep in the Linux mm
> subsystem.
> 
> Indeed, if one truly understands tmem, it should become clear that
> tmem is fundamentally incompatible with toolstack-based capacity
> allocation. But let's stop discussing tmem for now and move on.

You have not discussed tmem pool thaw and freeze in this proposal.

> 
> OK.  So with existing code both in Xen and Linux guests, there are
> three challenges to toolstack-based capacity allocation.  We'd
> really still like to do capacity allocation in the toolstack.  Can
> something be done in the toolstack to "fix" these three cases?
> 
> Possibly.  But let's first look at hypervisor-based capacity
> allocation: the proposed "XENMEM_claim_pages" hypercall.
> 
> HYPERVISOR-BASED CAPACITY ALLOCATION
> 
> The posted patch for the claim hypercall is quite simple, but let's
> look at it in detail.  The claim hypercall is actually a subop
> of an existing hypercall.  After checking parameters for validity,
> a new function is called in the core Xen memory management code.
> This function takes the hypervisor heaplock, checks for a few
> special cases, does some arithmetic to ensure a valid claim, stakes
> the claim, releases the hypervisor heaplock, and then returns.  To
> review from earlier, the hypervisor heaplock protects _all_ page/slab
> allocations, so we can be absolutely certain that there are no other
> page allocation races.  This new function is about 35 lines of code,
> not counting comments.
> 
> The patch includes two other significant changes to the hypervisor:
> First, when any adjustment to a domain's memory footprint is made
> (either through a toolstack-aware hypercall or one of the three
> toolstack-unaware methods described above), the heaplock is
> taken, arithmetic is done, and the heaplock is released.  This
> is 12 lines of code.  Second, when any memory is allocated within
> Xen, a check must be made (with the heaplock already held) to
> determine if, given a previous claim, the domain has exceeded
> its upper bound, maxmem.  This code is a single conditional test.
> 
> With some declarations, but not counting the copious comments,
> all told, the new code provided by the patch is well under 100 lines.
> 
> What about the toolstack side?  First, it's important to note that
> the toolstack changes are entirely optional.  If any toolstack
> wishes either to not fix the original problem, or avoid toolstack-
> unaware allocation completely by ignoring the functionality provided
> by in-guest ballooning, page-sharing, and/or tmem, that toolstack need
> not use the new hyper call.

You are ruling out any other possibility here. In particular, but not limited to, use of max_pages.

>  Second, it's very relevant to note that the Oracle product uses a combination of a proprietary "manager"
> which oversees many machines, and the older open-source xm/xend
> toolstack, for which the current Xen toolstack maintainers are no
> longer accepting patches.
> 
> The preface of the published patch does suggest, however, some
> straightforward pseudo-code, as follows:
> 
> Current toolstack domain creation memory allocation code fragment:
> 
> 1. call populate_physmap repeatedly to achieve mem=N memory
> 2. if any populate_physmap call fails, report -ENOMEM up the stack
> 3. memory is held until domain dies or the toolstack decreases it
> 
> Proposed toolstack domain creation memory allocation code fragment
> (new code marked with "+"):
> 
> +  call claim for mem=N amount of memory
> +. if claim succeeds:
> 1.  call populate_physmap repeatedly to achieve mem=N memory (failsafe)
> +  else
> 2.  report -ENOMEM up the stack
> +  claim is held until mem=N is achieved or the domain dies or
>    forced to 0 by a second hypercall
> 3. memory is held until domain dies or the toolstack decreases it
> 
> Reviewing the pseudo-code, one can readily see that the toolstack
> changes required to implement the hypercall are quite small.
> 
> To complete this discussion, it has been pointed out that
> the proposed hypercall doesn't solve the original problem
> for certain classes of legacy domains... but also neither
> does it make the problem worse.  It has also been pointed
> out that the proposed patch is not (yet) NUMA-aware.
> 
> Now let's return to the earlier question:  There are three 
> challenges to toolstack-based capacity allocation, which are
> all handled easily by in-hypervisor capacity allocation. But we'd
> really still like to do capacity allocation in the toolstack.
> Can something be done in the toolstack to "fix" these three cases?
> 
> The answer is, of course, certainly... anything can be done in
> software.  So, recalling Ian Jackson's stated requirement:
> 
> "Any functionality which can be reasonably provided outside the
>  hypervisor should be excluded from it."
> 
> we are now left to evaluate the subjective term "reasonably".
> 
> CAN TOOLSTACK-BASED CAPACITY ALLOCATION OVERCOME THE ISSUES?
> 
> In earlier discussion on this topic, when page-splitting was raised
> as a concern, some of the authors of Xen's page-sharing feature
> pointed out that a mechanism could be designed such that "batches"
> of pages were pre-allocated by the toolstack and provided to the
> hypervisor to be utilized as needed for page-splitting.  Should the
> batch run dry, the hypervisor could stop the domain that was provoking
> the page-split until the toolstack could be consulted and the toolstack, at its leisure, could request the hypervisor to refill
> the batch, which then allows the page-split-causing domain to proceed.
> 
> But this batch page-allocation isn't implemented in Xen today.
> 
> Andres Lagar-Cavilla says "... this is because of shortcomings in the
> [Xen] mm layer and its interaction with wait queues, documented
> elsewhere."  In other words, this batching proposal requires
> significant changes to the hypervisor, which I think we
> all agreed we were trying to avoid.

This is a misunderstanding. There is no connection between the batching proposal and what I was referring to in the quote. Certainly I never advocated for pre-allocations.

The "significant changes to the hypervisor" statement is FUD. Everyone you've addressed on this email makes significant changes to the hypervisor, under the proviso that they are necessary/useful changes.

The interactions between the mm layer and wait queues need fixing, sooner or later, claim hyper call or not. But they are not a blocker, they are essentially a race that may trigger under certain circumstances. That is why they remain a low priority fix.

> 
> [Note to Andre: I'm not objecting to the need for this functionality
> for page-sharing to work with proprietary kernels and DMC; just

Let me nip this at the bud. I use page sharing and other techniques in an environment that doesn't use Citrix's DMC, nor is focused only on proprietary kernels...

> pointing out that it, too, is dependent on further hypervisor changes.]

… with 4.2 Xen. It is not perfect and has limitations that I am trying to fix. But our product ships, and page sharing works for anyone who would want to consume it, independently of further hypervisor changes.

> 
> Such an approach makes sense in the min==max model enforced by
> DMC but, again, DMC is not prescribed by the toolstack.
> 
> Further, this waitqueue solution for page-splitting only awkwardly
> works around in-guest ballooning (probably only with more hypervisor
> changes, TBD) and would be useless for tmem.  [IIGT: Please argue
> this last point only if you feel confident you truly understand how
> tmem works.]

I will argue though that "waitqueue solution … ballooning" is not true. Ballooning has never needed nor it suddenly needs now hypervisor wait queues.

> 
> So this as-yet-unimplemented solution only really solves a part
> of the problem.

As per the previous comments, I don't see your characterization as accurate.

Andres
> 
> Are there any other possibilities proposed?  Ian Jackson has
> suggested a somewhat different approach:
> 
> Let me quote Ian Jackson again:
> 
> "Of course if it is really desired to have each guest make its own
> decisions and simply for them to somehow agree to divvy up the
> available resources, then even so a new hypervisor mechanism is
> not needed.  All that is needed is a way for those guests to
> synchronise their accesses and updates to shared records of the
> available and in-use memory."
> 
> Ian then goes on to say:  "I don't have a detailed counter-proposal
> design of course..."
> 
> This proposal is certainly possible, but I think most would agree that
> it would require some fairly massive changes in OS memory management
> design that would run contrary to many years of computing history.
> It requires guest OS's to cooperate with each other about basic memory
> management decisions.  And to work for tmem, it would require
> communication from atomic code in the kernel to user-space, then communication from user-space in a guest to user-space-in-domain0
> and then (presumably... I don't have a design either) back again.
> One must also wonder what the performance impact would be.
> 
> CONCLUDING REMARKS
> 
> "Any functionality which can be reasonably provided outside the
>  hypervisor should be excluded from it."
> 
> I think this document has described a real customer problem and
> a good solution that could be implemented either in the toolstack
> or in the hypervisor.  Memory allocation in existing Xen functionality
> has been shown to interfere significantly with the toolstack-based
> solution and suggested partial solutions to those issues either
> require even more hypervisor work, or are completely undesigned and,
> at least, call into question the definition of "reasonably".
> 
> The hypervisor-based solution has been shown to be extremely
> simple, fits very logically with existing Xen memory management
> mechanisms/code, and has been reviewed through several iterations
> by Xen hypervisor experts.
> 
> While I understand completely the Xen maintainers' desire to
> fend off unnecessary additions to the hypervisor, I believe
> XENMEM_claim_pages is a reasonable and natural hypervisor feature
> and I hope you will now Ack the patch.
> 
> Acknowledgements: Thanks very much to Konrad for his thorough
> read-through and for suggestions on how to soften my combative
> style which may have alienated the maintainers more than the
> proposal itself.
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 
> 
> End of Xen-devel Digest, Vol 94, Issue 22
> *****************************************

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2012-12-04  3:24 ` Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions Andres Lagar-Cavilla
@ 2012-12-18 22:17   ` Konrad Rzeszutek Wilk
  2012-12-19 12:53     ` George Dunlap
                       ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-12-18 22:17 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Tim Deegan, Ian Jackson, xen-devel,
	Jan Beulich

Hey Andres,

Thanks for your response. Sorry for the really late response - I
had it in my postponed mailbox and thought it has been sent
already.

On Mon, Dec 03, 2012 at 10:24:40PM -0500, Andres Lagar-Cavilla wrote:
> > I earlier promised a complete analysis of the problem
> > addressed by the proposed claim hypercall as well as
> > an analysis of the alternate solutions.  I had not
> > yet provided these analyses when I asked for approval
> > to commit the hypervisor patch, so there was still
> > a good amount of misunderstanding, and I am trying
> > to fix that here.
> > 
> > I had hoped this essay could be both concise and complete
> > but quickly found it to be impossible to be both at the
> > same time.  So I have erred on the side of verbosity,
> > but also have attempted to ensure that the analysis
> > flows smoothly and is understandable to anyone interested
> > in learning more about memory allocation in Xen.
> > I'd appreciate feedback from other developers to understand
> > if I've also achieved that goal.
> > 
> > Ian, Ian, George, and Tim -- I have tagged a few
> > out-of-flow questions to you with [IIGF].  If I lose
> > you at any point, I'd especially appreciate your feedback
> > at those points.  I trust that, first, you will read
> > this completely.  As I've said, I understand that
> > Oracle's paradigm may differ in many ways from your
> > own, so I also trust that you will read it completely
> > with an open mind.
> > 
> > Thanks,
> > Dan
> > 
> > PROBLEM STATEMENT OVERVIEW
> > 
> > The fundamental problem is a race; two entities are
> > competing for part or all of a shared resource: in this case,
> > physical system RAM.  Normally, a lock is used to mediate
> > a race.
> > 
> > For memory allocation in Xen, there are two significant
> > entities, the toolstack and the hypervisor.  And, in
> > general terms, there are currently two important locks:
> > one used in the toolstack for domain creation;
> > and one in the hypervisor used for the buddy allocator.
> > 
> > Considering first only domain creation, the toolstack
> > lock is taken to ensure that domain creation is serialized.
> > The lock is taken when domain creation starts, and released
> > when domain creation is complete.
> > 
> > As system and domain memory requirements grow, the amount
> > of time to allocate all necessary memory to launch a large
> > domain is growing and may now exceed several minutes, so
> > this serialization is increasingly problematic.  The result
> > is a customer reported problem:  If a customer wants to
> > launch two or more very large domains, the "wait time"
> > required by the serialization is unacceptable.
> > 
> > Oracle would like to solve this problem.  And Oracle
> > would like to solve this problem not just for a single
> > customer sitting in front of a single machine console, but
> > for the very complex case of a large number of machines,
> > with the "agent" on each machine taking independent
> > actions including automatic load balancing and power
> > management via migration.
> Hi Dan,
> an issue with your reasoning throughout has been the constant invocation of the multi host environment as a justification for your proposal. But this argument is not used in your proposal below beyond this mention in passing. Further, there is no relation between what you are changing (the hypervisor) and what you are claiming it is needed for (multi host VM management).
> 

Heh. I hadn't realized that the emails need to conform to a
the way legal briefs are written in US :-) Meaning that
each topic must be addressed.

Anyhow, the multi-host env or a single-host env has the same
issue - you try to launch multiple guests and you some of
them might not launch.

The changes that Dan is proposing (the claim hypercall)
would provide the functionality to fix this problem.

> 
> >  (This complex environment
> > is sold by Oracle today; it is not a "future vision".)
> > 
> > [IIGT] Completely ignoring any possible solutions to this
> > problem, is everyone in agreement that this _is_ a problem
> > that _needs_ to be solved with _some_ change in the Xen
> > ecosystem?
> > 
> > SOME IMPORTANT BACKGROUND INFORMATION
> > 
> > In the subsequent discussion, it is important to
> > understand a few things:
> > 
> > While the toolstack lock is held, allocating memory for
> > the domain creation process is done as a sequence of one
> > or more hypercalls, each asking the hypervisor to allocate
> > one or more -- "X" -- slabs of physical RAM, where a slab
> > is 2**N contiguous aligned pages, also known as an
> > "order N" allocation.  While the hypercall is defined
> > to work with any value of N, common values are N=0
> > (individual pages), N=9 ("hugepages" or "superpages"),
> > and N=18 ("1GiB pages").  So, for example, if the toolstack
> > requires 201MiB of memory, it will make two hypercalls:
> > One with X=100 and N=9, and one with X=1 and N=0.
> > 
> > While the toolstack may ask for a smaller number X of
> > order==9 slabs, system fragmentation may unpredictably
> > cause the hypervisor to fail the request, in which case
> > the toolstack will fall back to a request for 512*X
> > individual pages.  If there is sufficient RAM in the system,
> > this request for order==0 pages is guaranteed to succeed.
> > Thus for a 1TiB domain, the hypervisor must be prepared
> > to allocate up to 256Mi individual pages.
> > 
> > Note carefully that when the toolstack hypercall asks for
> > 100 slabs, the hypervisor "heaplock" is currently taken
> > and released 100 times.  Similarly, for 256M individual
> > pages... 256 million spin_lock-alloc_page-spin_unlocks.
> > This means that domain creation is not "atomic" inside
> > the hypervisor, which means that races can and will still
> > occur.
> > 
> > RULING OUT SOME SIMPLE SOLUTIONS
> > 
> > Is there an elegant simple solution here?
> > 
> > Let's first consider the possibility of removing the toolstack
> > serialization entirely and/or the possibility that two
> > independent toolstack threads (or "agents") can simultaneously
> > request a very large domain creation in parallel.  As described
> > above, the hypervisor's heaplock is insufficient to serialize RAM
> > allocation, so the two domain creation processes race.  If there
> > is sufficient resource for either one to launch, but insufficient
> > resource for both to launch, the winner of the race is indeterminate,
> > and one or both launches will fail, possibly after one or both 
> > domain creation threads have been working for several minutes.
> > This is a classic "TOCTOU" (time-of-check-time-of-use) race.
> > If a customer is unhappy waiting several minutes to launch
> > a domain, they will be even more unhappy waiting for several
> > minutes to be told that one or both of the launches has failed.
> > Multi-minute failure is even more unacceptable for an automated
> > agent trying to, for example, evacuate a machine that the
> > data center administrator needs to powercycle.
> > 
> > [IIGT: Please hold your objections for a moment... the paragraph
> > above is discussing the simple solution of removing the serialization;
> > your suggested solution will be discussed soon.]
> > 
> > Next, let's consider the possibility of changing the heaplock
> > strategy in the hypervisor so that the lock is held not
> > for one slab but for the entire request of N slabs.  As with
> > any core hypervisor lock, holding the heaplock for a "long time"
> > is unacceptable.  To a hypervisor, several minutes is an eternity.
> > And, in any case, by serializing domain creation in the hypervisor,
> > we have really only moved the problem from the toolstack into
> > the hypervisor, not solved the problem.
> > 
> > [IIGT] Are we in agreement that these simple solutions can be
> > safely ruled out?
> > 
> > CAPACITY ALLOCATION VS RAM ALLOCATION
> > 
> > Looking for a creative solution, one may realize that it is the
> > page allocation -- especially in large quantities -- that is very
> > time-consuming.  But, thinking outside of the box, it is not
> > the actual pages of RAM that we are racing on, but the quantity of pages required to launch a domain!  If we instead have a way to
> > "claim" a quantity of pages cheaply now and then allocate the actual
> > physical RAM pages later, we have changed the race to require only serialization of the claiming process!  In other words, if some entity
> > knows the number of pages available in the system, and can "claim"
> > N pages for the benefit of a domain being launched, the successful launch of the domain can be ensured.  Well... the domain launch may
> > still fail for an unrelated reason, but not due to a memory TOCTOU
> > race.  But, in this case, if the cost (in time) of the claiming
> > process is very small compared to the cost of the domain launch,
> > we have solved the memory TOCTOU race with hardly any delay added
> > to a non-memory-related failure that would have occurred anyway.
> > 
> > This "claim" sounds promising.  But we have made an assumption that
> > an "entity" has certain knowledge.  In the Xen system, that entity
> > must be either the toolstack or the hypervisor.  Or, in the Oracle
> > environment, an "agent"... but an agent and a toolstack are similar
> > enough for our purposes that we will just use the more broadly-used
> > term "toolstack".  In using this term, however, it's important to
> > remember it is necessary to consider the existence of multiple
> > threads within this toolstack.
> > 
> > Now I quote Ian Jackson: "It is a key design principle of a system
> > like Xen that the hypervisor should provide only those facilities
> > which are strictly necessary.  Any functionality which can be
> > reasonably provided outside the hypervisor should be excluded
> > from it."
> > 
> > So let's examine the toolstack first.
> > 
> > [IIGT] Still all on the same page (pun intended)?
> > 
> > TOOLSTACK-BASED CAPACITY ALLOCATION
> > 
> > Does the toolstack know how many physical pages of RAM are available?
> > Yes, it can use a hypercall to find out this information after Xen and
> > dom0 launch, but before it launches any domain.  Then if it subtracts
> > the number of pages used when it launches a domain and is aware of
> > when any domain dies, and adds them back, the toolstack has a pretty
> > good estimate.  In actuality, the toolstack doesn't _really_ know the
> > exact number of pages used when a domain is launched, but there
> > is a poorly-documented "fuzz factor"... the toolstack knows the
> > number of pages within a few megabytes, which is probably close enough.
> > 
> > This is a fairly good description of how the toolstack works today
> > and the accounting seems simple enough, so does toolstack-based
> > capacity allocation solve our original problem?  It would seem so.
> > Even if there are multiple threads, the accounting -- not the extended
> > sequence of page allocation for the domain creation -- can be
> > serialized by a lock in the toolstack.  But note carefully, either
> > the toolstack and the hypervisor must always be in sync on the
> > number of available pages (within an acceptable margin of error);
> > or any query to the hypervisor _and_ the toolstack-based claim must
> > be paired atomically, i.e. the toolstack lock must be held across
> > both.  Otherwise we again have another TOCTOU race. Interesting,
> > but probably not really a problem.
> > 
> > Wait, isn't it possible for the toolstack to dynamically change the
> > number of pages assigned to a domain?  Yes, this is often called
> > ballooning and the toolstack can do this via a hypercall.  But
> 
> > that's still OK because each call goes through the toolstack and
> > it simply needs to add more accounting for when it uses ballooning
> > to adjust the domain's memory footprint.  So we are still OK.
> > 
> > But wait again... that brings up an interesting point.  Are there
> > any significant allocations that are done in the hypervisor without
> > the knowledge and/or permission of the toolstack?  If so, the
> > toolstack may be missing important information.
> > 
> > So are there any such allocations?  Well... yes. There are a few.
> > Let's take a moment to enumerate them:
> > 
> > A) In Linux, a privileged user can write to a sysfs file which writes
> > to the balloon driver which makes hypercalls from the guest kernel to
> 
> A fairly bizarre limitation of a balloon-based approach to memory management. Why on earth should the guest be allowed to change the size of its balloon, and therefore its footprint on the host. This may be justified with arguments pertaining to the stability of the in-guest workload. What they really reveal are limitations of ballooning. But the inadequacy of the balloon in itself doesn't automatically translate into justifying the need for a new hyper call.

Why is this a limitation? Why shouldn't the guest the allowed to change
its memory usage? It can go up and down as it sees fit.
And if it goes down and it gets better performance - well, why shouldn't
it do it?

I concur it is odd - but it has been like that for decades.


> 
> > the hypervisor, which adjusts the domain memory footprint, which changes the number of free pages _without_ the toolstack knowledge.
> > The toolstack controls constraints (essentially a minimum and maximum)
> > which the hypervisor enforces.  The toolstack can ensure that the
> > minimum and maximum are identical to essentially disallow Linux from
> > using this functionality.  Indeed, this is precisely what Citrix's
> > Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always has complete control and, so, knowledge of any domain memory
> > footprint changes.  But DMC is not prescribed by the toolstack,
> 
> Neither is enforcing min==max. This was my argument when previously commenting on this thread. The fact that you have enforcement of a maximum domain allocation gives you an excellent tool to keep a domain's unsupervised growth at bay. The toolstack can choose how fine-grained, how often to be alerted and stall the domain.

There is a down-call (so events) to the tool-stack from the hypervisor when
the guest tries to balloon in/out? So the need for this problem arose
but the mechanism to deal with it has been shifted to the user-space
then? What to do when the guest does this in/out balloon at freq
intervals?

I am missing actually the reasoning behind wanting to stall the domain?
Is that to compress/swap the pages that the guest requests? Meaning
an user-space daemon that does "things" and has ownership
of the pages?

> 
> > and some real Oracle Linux customers use and depend on the flexibility
> > provided by in-guest ballooning.   So guest-privileged-user-driven-
> > ballooning is a potential issue for toolstack-based capacity allocation.
> > 
> > [IIGT: This is why I have brought up DMC several times and have
> > called this the "Citrix model,".. I'm not trying to be snippy
> > or impugn your morals as maintainers.]
> > 
> > B) Xen's page sharing feature has slowly been completed over a number
> > of recent Xen releases.  It takes advantage of the fact that many
> > pages often contain identical data; the hypervisor merges them to save
> 
> Great care has been taken for this statement to not be exactly true. The hypervisor discards one of two pages that the toolstack tells it to (and patches the physmap of the VM previously pointing to the discard page). It doesn't merge, nor does it look into contents. The hypervisor doesn't care about the page contents. This is deliberate, so as to avoid spurious claims of "you are using technique X!"
> 

Is the toolstack (or a daemon in userspace) doing this? I would
have thought that there would be some optimization to do this
somewhere?

> > physical RAM.  When any "shared" page is written, the hypervisor
> > "splits" the page (aka, copy-on-write) by allocating a new physical
> > page.  There is a long history of this feature in other virtualization
> > products and it is known to be possible that, under many circumstances, thousands of splits may occur in any fraction of a second.  The
> > hypervisor does not notify or ask permission of the toolstack.
> > So, page-splitting is an issue for toolstack-based capacity
> > allocation, at least as currently coded in Xen.
> > 
> > [Andre: Please hold your objection here until you read further.]
> 
> Name is Andres. And please cc me if you'll be addressing me directly!
> 
> Note that I don't disagree with your previous statement in itself. Although "page-splitting" is fairly unique terminology, and confusing (at least to me). CoW works.

<nods>
> 
> > 
> > C) Transcendent Memory ("tmem") has existed in the Xen hypervisor and
> > toolstack for over three years.  It depends on an in-guest-kernel
> > adaptive technique to constantly adjust the domain memory footprint as
> > well as hooks in the in-guest-kernel to move data to and from the
> > hypervisor.  While the data is in the hypervisor's care, interesting
> > memory-load balancing between guests is done, including optional
> > compression and deduplication.  All of this has been in Xen since 2009
> > and has been awaiting changes in the (guest-side) Linux kernel. Those
> > changes are now merged into the mainstream kernel and are fully
> > functional in shipping distros.
> > 
> > While a complete description of tmem's guest<->hypervisor interaction
> > is beyond the scope of this document, it is important to understand
> > that any tmem-enabled guest kernel may unpredictably request thousands
> > or even millions of pages directly via hypercalls from the hypervisor in a fraction of a second with absolutely no interaction with the toolstack.  Further, the guest-side hypercalls that allocate pages
> > via the hypervisor are done in "atomic" code deep in the Linux mm
> > subsystem.
> > 
> > Indeed, if one truly understands tmem, it should become clear that
> > tmem is fundamentally incompatible with toolstack-based capacity
> > allocation. But let's stop discussing tmem for now and move on.
> 
> You have not discussed tmem pool thaw and freeze in this proposal.

Oooh, you know about it :-) Dan didn't want to go too verbose on
people. It is a bit of rathole - and this hypercall would
allow to deprecate said freeze/thaw calls.

> 
> > 
> > OK.  So with existing code both in Xen and Linux guests, there are
> > three challenges to toolstack-based capacity allocation.  We'd
> > really still like to do capacity allocation in the toolstack.  Can
> > something be done in the toolstack to "fix" these three cases?
> > 
> > Possibly.  But let's first look at hypervisor-based capacity
> > allocation: the proposed "XENMEM_claim_pages" hypercall.
> > 
> > HYPERVISOR-BASED CAPACITY ALLOCATION
> > 
> > The posted patch for the claim hypercall is quite simple, but let's
> > look at it in detail.  The claim hypercall is actually a subop
> > of an existing hypercall.  After checking parameters for validity,
> > a new function is called in the core Xen memory management code.
> > This function takes the hypervisor heaplock, checks for a few
> > special cases, does some arithmetic to ensure a valid claim, stakes
> > the claim, releases the hypervisor heaplock, and then returns.  To
> > review from earlier, the hypervisor heaplock protects _all_ page/slab
> > allocations, so we can be absolutely certain that there are no other
> > page allocation races.  This new function is about 35 lines of code,
> > not counting comments.
> > 
> > The patch includes two other significant changes to the hypervisor:
> > First, when any adjustment to a domain's memory footprint is made
> > (either through a toolstack-aware hypercall or one of the three
> > toolstack-unaware methods described above), the heaplock is
> > taken, arithmetic is done, and the heaplock is released.  This
> > is 12 lines of code.  Second, when any memory is allocated within
> > Xen, a check must be made (with the heaplock already held) to
> > determine if, given a previous claim, the domain has exceeded
> > its upper bound, maxmem.  This code is a single conditional test.
> > 
> > With some declarations, but not counting the copious comments,
> > all told, the new code provided by the patch is well under 100 lines.
> > 
> > What about the toolstack side?  First, it's important to note that
> > the toolstack changes are entirely optional.  If any toolstack
> > wishes either to not fix the original problem, or avoid toolstack-
> > unaware allocation completely by ignoring the functionality provided
> > by in-guest ballooning, page-sharing, and/or tmem, that toolstack need
> > not use the new hyper call.
> 
> You are ruling out any other possibility here. In particular, but not limited to, use of max_pages.

The one max_page check that comes to my mind is the one that Xapi
uses. That is it has a daemon that sets the max_pages of all the
guests at some value so that it can squeeze in as many guests as
possible. It also balloons pages out of a guest to make space if
need to launch. The heurestic of how many pages or the ratio
of max/min looks to be proportional (so to make space for 1GB
for a guest, and say we have 10 guests, we will subtract
101MB from each guest - the extra 1MB is for extra overhead).
This depends on one hypercall that 'xl' or 'xm' toolstack do not
use - which sets the max_pages.

That code makes certain assumptions - that the guest will not go/up down
in the ballooning once the toolstack has decreed how much
memory the guest should use. It also assumes that the operations
are semi-atomic - and to make it so as much as it can - it executes
these operations in serial.

This goes back to the problem statement - if we try to parallize
this we run in the problem that the amount of memory we thought
we free is not true anymore. The start of this email has a good
description of some of the issues.

In essence, the max_pages does work - _if_ one does these operations
in serial. We are trying to make this work in parallel and without
any failures - for that we - one way that is quite simplistic
is the claim hypercall. It sets up a 'stake' of the amount of
memory that the hypervisor should reserve. This way other
guests creations/ ballonning do not infringe on the 'claimed' amount.

I believe with this hypercall the Xapi can be made to do its operations
in parallel as well.

> 
> >  Second, it's very relevant to note that the Oracle product uses a combination of a proprietary "manager"
> > which oversees many machines, and the older open-source xm/xend
> > toolstack, for which the current Xen toolstack maintainers are no
> > longer accepting patches.
> > 
> > The preface of the published patch does suggest, however, some
> > straightforward pseudo-code, as follows:
> > 
> > Current toolstack domain creation memory allocation code fragment:
> > 
> > 1. call populate_physmap repeatedly to achieve mem=N memory
> > 2. if any populate_physmap call fails, report -ENOMEM up the stack
> > 3. memory is held until domain dies or the toolstack decreases it
> > 
> > Proposed toolstack domain creation memory allocation code fragment
> > (new code marked with "+"):
> > 
> > +  call claim for mem=N amount of memory
> > +. if claim succeeds:
> > 1.  call populate_physmap repeatedly to achieve mem=N memory (failsafe)
> > +  else
> > 2.  report -ENOMEM up the stack
> > +  claim is held until mem=N is achieved or the domain dies or
> >    forced to 0 by a second hypercall
> > 3. memory is held until domain dies or the toolstack decreases it
> > 
> > Reviewing the pseudo-code, one can readily see that the toolstack
> > changes required to implement the hypercall are quite small.
> > 
> > To complete this discussion, it has been pointed out that
> > the proposed hypercall doesn't solve the original problem
> > for certain classes of legacy domains... but also neither
> > does it make the problem worse.  It has also been pointed
> > out that the proposed patch is not (yet) NUMA-aware.
> > 
> > Now let's return to the earlier question:  There are three 
> > challenges to toolstack-based capacity allocation, which are
> > all handled easily by in-hypervisor capacity allocation. But we'd
> > really still like to do capacity allocation in the toolstack.
> > Can something be done in the toolstack to "fix" these three cases?
> > 
> > The answer is, of course, certainly... anything can be done in
> > software.  So, recalling Ian Jackson's stated requirement:
> > 
> > "Any functionality which can be reasonably provided outside the
> >  hypervisor should be excluded from it."
> > 
> > we are now left to evaluate the subjective term "reasonably".
> > 
> > CAN TOOLSTACK-BASED CAPACITY ALLOCATION OVERCOME THE ISSUES?
> > 
> > In earlier discussion on this topic, when page-splitting was raised
> > as a concern, some of the authors of Xen's page-sharing feature
> > pointed out that a mechanism could be designed such that "batches"
> > of pages were pre-allocated by the toolstack and provided to the
> > hypervisor to be utilized as needed for page-splitting.  Should the
> > batch run dry, the hypervisor could stop the domain that was provoking
> > the page-split until the toolstack could be consulted and the toolstack, at its leisure, could request the hypervisor to refill
> > the batch, which then allows the page-split-causing domain to proceed.
> > 
> > But this batch page-allocation isn't implemented in Xen today.
> > 
> > Andres Lagar-Cavilla says "... this is because of shortcomings in the
> > [Xen] mm layer and its interaction with wait queues, documented
> > elsewhere."  In other words, this batching proposal requires
> > significant changes to the hypervisor, which I think we
> > all agreed we were trying to avoid.
> 
> This is a misunderstanding. There is no connection between the batching proposal and what I was referring to in the quote. Certainly I never advocated for pre-allocations.
> 
> The "significant changes to the hypervisor" statement is FUD. Everyone you've addressed on this email makes significant changes to the hypervisor, under the proviso that they are necessary/useful changes.
> 
> The interactions between the mm layer and wait queues need fixing, sooner or later, claim hyper call or not. But they are not a blocker, they are essentially a race that may trigger under certain circumstances. That is why they remain a low priority fix.
> 
> > 
> > [Note to Andre: I'm not objecting to the need for this functionality
> > for page-sharing to work with proprietary kernels and DMC; just
> 
> Let me nip this at the bud. I use page sharing and other techniques in an environment that doesn't use Citrix's DMC, nor is focused only on proprietary kernels...
> 
> > pointing out that it, too, is dependent on further hypervisor changes.]
> 
> … with 4.2 Xen. It is not perfect and has limitations that I am trying to fix. But our product ships, and page sharing works for anyone who would want to consume it, independently of further hypervisor changes.
> 

I believe Dan is saying is that it is not enabled by default.
Meaning it does not get executed in by /etc/init.d/xencommons and
as such it never gets run (or does it now?) - unless one knows
about it - or it is enabled by default in a product. But perhaps
we are both mistaken? Is it enabled by default now on xen-unstable?

> > 
> > Such an approach makes sense in the min==max model enforced by
> > DMC but, again, DMC is not prescribed by the toolstack.
> > 
> > Further, this waitqueue solution for page-splitting only awkwardly
> > works around in-guest ballooning (probably only with more hypervisor
> > changes, TBD) and would be useless for tmem.  [IIGT: Please argue
> > this last point only if you feel confident you truly understand how
> > tmem works.]
> 
> I will argue though that "waitqueue solution … ballooning" is not true. Ballooning has never needed nor it suddenly needs now hypervisor wait queues.

It is the use case of parallel starts that we are trying to solve.
Worst - we want to start 16GB or 32GB guests and those seem to take
quite a bit of time.

> 
> > 
> > So this as-yet-unimplemented solution only really solves a part
> > of the problem.
> 
> As per the previous comments, I don't see your characterization as accurate.
> 
> Andres
> > 
> > Are there any other possibilities proposed?  Ian Jackson has
> > suggested a somewhat different approach:
> > 
> > Let me quote Ian Jackson again:
> > 
> > "Of course if it is really desired to have each guest make its own
> > decisions and simply for them to somehow agree to divvy up the
> > available resources, then even so a new hypervisor mechanism is
> > not needed.  All that is needed is a way for those guests to
> > synchronise their accesses and updates to shared records of the
> > available and in-use memory."
> > 
> > Ian then goes on to say:  "I don't have a detailed counter-proposal
> > design of course..."
> > 
> > This proposal is certainly possible, but I think most would agree that
> > it would require some fairly massive changes in OS memory management
> > design that would run contrary to many years of computing history.
> > It requires guest OS's to cooperate with each other about basic memory
> > management decisions.  And to work for tmem, it would require
> > communication from atomic code in the kernel to user-space, then communication from user-space in a guest to user-space-in-domain0
> > and then (presumably... I don't have a design either) back again.
> > One must also wonder what the performance impact would be.
> > 
> > CONCLUDING REMARKS
> > 
> > "Any functionality which can be reasonably provided outside the
> >  hypervisor should be excluded from it."
> > 
> > I think this document has described a real customer problem and
> > a good solution that could be implemented either in the toolstack
> > or in the hypervisor.  Memory allocation in existing Xen functionality
> > has been shown to interfere significantly with the toolstack-based
> > solution and suggested partial solutions to those issues either
> > require even more hypervisor work, or are completely undesigned and,
> > at least, call into question the definition of "reasonably".
> > 
> > The hypervisor-based solution has been shown to be extremely
> > simple, fits very logically with existing Xen memory management
> > mechanisms/code, and has been reviewed through several iterations
> > by Xen hypervisor experts.
> > 
> > While I understand completely the Xen maintainers' desire to
> > fend off unnecessary additions to the hypervisor, I believe
> > XENMEM_claim_pages is a reasonable and natural hypervisor feature
> > and I hope you will now Ack the patch.


Just as a summary as this is getting to be a long thread - my
understanding has been that the hypervisor is suppose to toolstack
independent.

Our first goal is to implement this in 'xend' as that
is what we use right now. Problem will be of course to find somebody
to review it :-(

We certainly want to implement this also in the 'xl' tool-stack
as in the future that is what we want to use when we rebase
our product on Xen 4.2 or greater.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2012-12-18 22:17   ` Konrad Rzeszutek Wilk
@ 2012-12-19 12:53     ` George Dunlap
  2012-12-19 13:48       ` George Dunlap
  2013-01-02 21:59       ` Konrad Rzeszutek Wilk
  2012-12-20 16:04     ` Tim Deegan
  2013-01-02 15:29     ` Andres Lagar-Cavilla
  2 siblings, 2 replies; 53+ messages in thread
From: George Dunlap @ 2012-12-19 12:53 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Tim (Xen.org), Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, Andres Lagar-Cavilla, Ian Jackson, xen-devel,
	Jan Beulich

On 18/12/12 22:17, Konrad Rzeszutek Wilk wrote:
>> Hi Dan, an issue with your reasoning throughout has been the constant 
>> invocation of the multi host environment as a justification for your 
>> proposal. But this argument is not used in your proposal below beyond 
>> this mention in passing. Further, there is no relation between what 
>> you are changing (the hypervisor) and what you are claiming it is 
>> needed for (multi host VM management). 
> Heh. I hadn't realized that the emails need to conform to a
> the way legal briefs are written in US :-) Meaning that
> each topic must be addressed.

Every time we try to suggest alternatives, Dan goes on some rant about 
how we're on different planets, how we're all old-guard stuck in 
static-land thinking, and how we're focused on single-server use cases, 
but that multi-server use cases are so different.  That's not a one-off, 
Dan has brought up the multi-server case as a reason that a user-space 
version won't work several times.  But when it comes down to it, he 
(apparently) has barely mentioned it.  If it's such a key reason point, 
why does he not bring it up here?  It turns out we were right all along 
-- the whole multi-server thing has nothing to do with it.  That's the 
point Andres is getting at, I think.

(FYI I'm not wasting my time reading mail from Dan anymore on this 
subject.  As far as I can tell in this entire discussion he has never 
changed his mind or his core argument in response to anything anyone has 
said, nor has he understood better our ideas or where we are coming 
from.  He has only responded by generating more verbiage than anyone has 
the time to read and understand, much less respond to.  That's why I 
suggested to Dan that he ask someone else to take over the conversation.)

> Anyhow, the multi-host env or a single-host env has the same
> issue - you try to launch multiple guests and you some of
> them might not launch.
>
> The changes that Dan is proposing (the claim hypercall)
> would provide the functionality to fix this problem.
>
>> A fairly bizarre limitation of a balloon-based approach to memory management. Why on earth should the guest be allowed to change the size of its balloon, and therefore its footprint on the host. This may be justified with arguments pertaining to the stability of the in-guest workload. What they really reveal are limitations of ballooning. But the inadequacy of the balloon in itself doesn't automatically translate into justifying the need for a new hyper call.
> Why is this a limitation? Why shouldn't the guest the allowed to change
> its memory usage? It can go up and down as it sees fit.
> And if it goes down and it gets better performance - well, why shouldn't
> it do it?
>
> I concur it is odd - but it has been like that for decades.

Well, it shouldn't be allowed to do it because it causes this problem 
you're having with creating guests in parallel.  Ultimately, that is the 
core of your problem.  So if you want us to solve the problem by 
implementing something in the hypervisor, then you need to justify why 
"Just don't have guests balloon down" is an unacceptable option.  Saying 
"why shouldn't it", and "it's been that way for decades*" isn't a good 
enough reason.

* Xen is only just 10, so "decades" is a bit of a hyperbole. :-)

>
>
>>> the hypervisor, which adjusts the domain memory footprint, which changes the number of free pages _without_ the toolstack knowledge.
>>> The toolstack controls constraints (essentially a minimum and maximum)
>>> which the hypervisor enforces.  The toolstack can ensure that the
>>> minimum and maximum are identical to essentially disallow Linux from
>>> using this functionality.  Indeed, this is precisely what Citrix's
>>> Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always has complete control and, so, knowledge of any domain memory
>>> footprint changes.  But DMC is not prescribed by the toolstack,
>> Neither is enforcing min==max. This was my argument when previously commenting on this thread. The fact that you have enforcement of a maximum domain allocation gives you an excellent tool to keep a domain's unsupervised growth at bay. The toolstack can choose how fine-grained, how often to be alerted and stall the domain.
> There is a down-call (so events) to the tool-stack from the hypervisor when
> the guest tries to balloon in/out? So the need for this problem arose
> but the mechanism to deal with it has been shifted to the user-space
> then? What to do when the guest does this in/out balloon at freq
> intervals?
>
> I am missing actually the reasoning behind wanting to stall the domain?
> Is that to compress/swap the pages that the guest requests? Meaning
> an user-space daemon that does "things" and has ownership
> of the pages?
>
>>> and some real Oracle Linux customers use and depend on the flexibility
>>> provided by in-guest ballooning.   So guest-privileged-user-driven-
>>> ballooning is a potential issue for toolstack-based capacity allocation.
>>>
>>> [IIGT: This is why I have brought up DMC several times and have
>>> called this the "Citrix model,".. I'm not trying to be snippy
>>> or impugn your morals as maintainers.]
>>>
>>> B) Xen's page sharing feature has slowly been completed over a number
>>> of recent Xen releases.  It takes advantage of the fact that many
>>> pages often contain identical data; the hypervisor merges them to save
>> Great care has been taken for this statement to not be exactly true. The hypervisor discards one of two pages that the toolstack tells it to (and patches the physmap of the VM previously pointing to the discard page). It doesn't merge, nor does it look into contents. The hypervisor doesn't care about the page contents. This is deliberate, so as to avoid spurious claims of "you are using technique X!"
>>
> Is the toolstack (or a daemon in userspace) doing this? I would
> have thought that there would be some optimization to do this
> somewhere?
>
>>> physical RAM.  When any "shared" page is written, the hypervisor
>>> "splits" the page (aka, copy-on-write) by allocating a new physical
>>> page.  There is a long history of this feature in other virtualization
>>> products and it is known to be possible that, under many circumstances, thousands of splits may occur in any fraction of a second.  The
>>> hypervisor does not notify or ask permission of the toolstack.
>>> So, page-splitting is an issue for toolstack-based capacity
>>> allocation, at least as currently coded in Xen.
>>>
>>> [Andre: Please hold your objection here until you read further.]
>> Name is Andres. And please cc me if you'll be addressing me directly!
>>
>> Note that I don't disagree with your previous statement in itself. Although "page-splitting" is fairly unique terminology, and confusing (at least to me). CoW works.
> <nods>
>>> C) Transcendent Memory ("tmem") has existed in the Xen hypervisor and
>>> toolstack for over three years.  It depends on an in-guest-kernel
>>> adaptive technique to constantly adjust the domain memory footprint as
>>> well as hooks in the in-guest-kernel to move data to and from the
>>> hypervisor.  While the data is in the hypervisor's care, interesting
>>> memory-load balancing between guests is done, including optional
>>> compression and deduplication.  All of this has been in Xen since 2009
>>> and has been awaiting changes in the (guest-side) Linux kernel. Those
>>> changes are now merged into the mainstream kernel and are fully
>>> functional in shipping distros.
>>>
>>> While a complete description of tmem's guest<->hypervisor interaction
>>> is beyond the scope of this document, it is important to understand
>>> that any tmem-enabled guest kernel may unpredictably request thousands
>>> or even millions of pages directly via hypercalls from the hypervisor in a fraction of a second with absolutely no interaction with the toolstack.  Further, the guest-side hypercalls that allocate pages
>>> via the hypervisor are done in "atomic" code deep in the Linux mm
>>> subsystem.
>>>
>>> Indeed, if one truly understands tmem, it should become clear that
>>> tmem is fundamentally incompatible with toolstack-based capacity
>>> allocation. But let's stop discussing tmem for now and move on.
>> You have not discussed tmem pool thaw and freeze in this proposal.
> Oooh, you know about it :-) Dan didn't want to go too verbose on
> people. It is a bit of rathole - and this hypercall would
> allow to deprecate said freeze/thaw calls.
>
>>> OK.  So with existing code both in Xen and Linux guests, there are
>>> three challenges to toolstack-based capacity allocation.  We'd
>>> really still like to do capacity allocation in the toolstack.  Can
>>> something be done in the toolstack to "fix" these three cases?
>>>
>>> Possibly.  But let's first look at hypervisor-based capacity
>>> allocation: the proposed "XENMEM_claim_pages" hypercall.
>>>
>>> HYPERVISOR-BASED CAPACITY ALLOCATION
>>>
>>> The posted patch for the claim hypercall is quite simple, but let's
>>> look at it in detail.  The claim hypercall is actually a subop
>>> of an existing hypercall.  After checking parameters for validity,
>>> a new function is called in the core Xen memory management code.
>>> This function takes the hypervisor heaplock, checks for a few
>>> special cases, does some arithmetic to ensure a valid claim, stakes
>>> the claim, releases the hypervisor heaplock, and then returns.  To
>>> review from earlier, the hypervisor heaplock protects _all_ page/slab
>>> allocations, so we can be absolutely certain that there are no other
>>> page allocation races.  This new function is about 35 lines of code,
>>> not counting comments.
>>>
>>> The patch includes two other significant changes to the hypervisor:
>>> First, when any adjustment to a domain's memory footprint is made
>>> (either through a toolstack-aware hypercall or one of the three
>>> toolstack-unaware methods described above), the heaplock is
>>> taken, arithmetic is done, and the heaplock is released.  This
>>> is 12 lines of code.  Second, when any memory is allocated within
>>> Xen, a check must be made (with the heaplock already held) to
>>> determine if, given a previous claim, the domain has exceeded
>>> its upper bound, maxmem.  This code is a single conditional test.
>>>
>>> With some declarations, but not counting the copious comments,
>>> all told, the new code provided by the patch is well under 100 lines.
>>>
>>> What about the toolstack side?  First, it's important to note that
>>> the toolstack changes are entirely optional.  If any toolstack
>>> wishes either to not fix the original problem, or avoid toolstack-
>>> unaware allocation completely by ignoring the functionality provided
>>> by in-guest ballooning, page-sharing, and/or tmem, that toolstack need
>>> not use the new hyper call.
>> You are ruling out any other possibility here. In particular, but not limited to, use of max_pages.
> The one max_page check that comes to my mind is the one that Xapi
> uses. That is it has a daemon that sets the max_pages of all the
> guests at some value so that it can squeeze in as many guests as
> possible. It also balloons pages out of a guest to make space if
> need to launch. The heurestic of how many pages or the ratio
> of max/min looks to be proportional (so to make space for 1GB
> for a guest, and say we have 10 guests, we will subtract
> 101MB from each guest - the extra 1MB is for extra overhead).
> This depends on one hypercall that 'xl' or 'xm' toolstack do not
> use - which sets the max_pages.
>
> That code makes certain assumptions - that the guest will not go/up down
> in the ballooning once the toolstack has decreed how much
> memory the guest should use. It also assumes that the operations
> are semi-atomic - and to make it so as much as it can - it executes
> these operations in serial.

No, the xapi code does no such assumptions.  After it tells a guest to 
balloon down, it watches to see  what actually happens, and has 
heuristics to deal with "non-cooperative guests".  It does assume that 
if it sets max_pages lower than or equal to the current amount of used 
memory, that the hypervisor will not allow the guest to balloon up -- 
but that's a pretty safe assumption.  A guest can balloon down if it 
wants to, but as xapi does not consider that memory free, it will never 
use it.

BTW, I don't know if you realize this: Originally Xen would return an 
error if you tried to set max_pages below tot_pages.  But as a result of 
the DMC work, it was seen as useful to allow the toolstack to tell the 
hypervisor once, "Once the VM has ballooned down to X, don't let it 
balloon up above X anymore."

> This goes back to the problem statement - if we try to parallize
> this we run in the problem that the amount of memory we thought
> we free is not true anymore. The start of this email has a good
> description of some of the issues.
>
> In essence, the max_pages does work - _if_ one does these operations
> in serial. We are trying to make this work in parallel and without
> any failures - for that we - one way that is quite simplistic
> is the claim hypercall. It sets up a 'stake' of the amount of
> memory that the hypervisor should reserve. This way other
> guests creations/ ballonning do not infringe on the 'claimed' amount.

I'm not sure what you mean by "do these operations in serial" in this 
context.  Each of your "reservation hypercalls" has to happen in 
serial.  If we had a user-space daemon that was in charge of freeing up 
or reserving memory, each request to that daemon would happen in serial 
as well.  But once the allocation / reservation happened, the domain 
builds could happen in parallel.

> I believe with this hypercall the Xapi can be made to do its operations
> in parallel as well.

xapi can already boot guests in parallel when there's enough memory to 
do so -- what operations did you have in mind?

I haven't followed all of the discussion (for reasons mentioned above), 
but I think the alternative to Dan's solution is something like below.  
Maybe you can tell me why it's not very suitable:

Have one place in the user-space -- either in the toolstack, or a 
separate daemon -- that is responsible for knowing all the places where 
memory might be in use.  Memory can be in use either by Xen, or by one 
of several VMs, or in a tmem pool.

In your case, when not creating VMs, it can remove all limitations -- 
allow the guests or tmem to grow or shrink as much as they want.

When a request comes in for a certain amount of memory, it will go and 
set each VM's max_pages, and the max tmem pool size.  It can then check 
whether there is enough free memory to complete the allocation or not 
(since there's a race between checking how much memory a guest is using 
and setting max_pages).  If that succeeds, it can return "success".  If, 
while that VM is being built, another request comes in, it can again go 
around and set the max sizes lower.  It has to know how much of the 
memory is "reserved" for the first guest being built, but if there's 
enough left after that, it can return "success" and allow the second VM 
to start being built.

After the VMs are built, the toolstack can remove the limits again if it 
wants, again allowing the free flow of memory.

Do you see any problems with this scheme?  All it requires is for the 
toolstack to be able to temporarliy set limits on both guests ballooning 
up and on tmem allocating more than a certain amount of memory.  We 
already have mechanisms for the first, so if we had a "max_pages" for 
tmem, then you'd have all the tools you need to implement it.

This is the point at which Dan says something about giant multi-host 
deployments, which has absolutely no bearing on the issue -- the 
reservation happens at a host level, whether it's in userspace or the 
hypervisor.

It's also where he goes on about how we're stuck in an old stodgy static 
world and he lives in a magical dynamic hippie world of peace and free 
love... er, free memory.  Which is also not true -- in the scenario I 
describe above, tmem is actively being used, and guests can actively 
balloon down and up, while the VM builds are happening.  In Dan's 
proposal, tmem and guests are prevented from allocating "reserved" 
memory by some complicated scheme inside the allocator; in the above 
proposal, tmem and guests are prevented from allocating "reserved" 
memory by simple hypervisor-enforced max_page settings.  The end result 
looks the same to me.

  -George

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2012-12-19 12:53     ` George Dunlap
@ 2012-12-19 13:48       ` George Dunlap
  2013-01-03 20:38         ` Dan Magenheimer
  2013-01-02 21:59       ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 53+ messages in thread
From: George Dunlap @ 2012-12-19 13:48 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Tim (Xen.org), Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, Andres Lagar-Cavilla, Ian Jackson, xen-devel,
	Jan Beulich

On 19/12/12 12:53, George Dunlap wrote:
> When a request comes in for a certain amount of memory, it will go and 
> set each VM's max_pages, and the max tmem pool size.  It can then 
> check whether there is enough free memory to complete the allocation 
> or not (since there's a race between checking how much memory a guest 
> is using and setting max_pages).  If that succeeds, it can return 
> "success".  If, while that VM is being built, another request comes 
> in, it can again go around and set the max sizes lower.  It has to 
> know how much of the memory is "reserved" for the first guest being 
> built, but if there's enough left after that, it can return "success" 
> and allow the second VM to start being built.
>
> After the VMs are built, the toolstack can remove the limits again if 
> it wants, again allowing the free flow of memory.
>
> Do you see any problems with this scheme?  All it requires is for the 
> toolstack to be able to temporarliy set limits on both guests 
> ballooning up and on tmem allocating more than a certain amount of 
> memory.  We already have mechanisms for the first, so if we had a 
> "max_pages" for tmem, then you'd have all the tools you need to 
> implement it.

I should also point out, this scheme has some distinct *advantages*: 
Namely, that if there isn't enough free memory, such a daemon can easily 
be modified to *make* free memory by cranking down balloon targets 
and/or tmem pool size.

  -George

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2012-12-18 22:17   ` Konrad Rzeszutek Wilk
  2012-12-19 12:53     ` George Dunlap
@ 2012-12-20 16:04     ` Tim Deegan
  2013-01-02 15:31       ` Andres Lagar-Cavilla
  2013-01-02 21:38       ` Dan Magenheimer
  2013-01-02 15:29     ` Andres Lagar-Cavilla
  2 siblings, 2 replies; 53+ messages in thread
From: Tim Deegan @ 2012-12-20 16:04 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Jan Beulich

Hi,

At 17:17 -0500 on 18 Dec (1355851071), Konrad Rzeszutek Wilk wrote:
> In essence, the max_pages does work - _if_ one does these operations
> in serial. We are trying to make this work in parallel and without
> any failures - for that we - one way that is quite simplistic
> is the claim hypercall. It sets up a 'stake' of the amount of
> memory that the hypervisor should reserve. This way other
> guests creations/ ballonning do not infringe on the 'claimed' amount.
> 
> I believe with this hypercall the Xapi can be made to do its operations
> in parallel as well.

The question of starting VMs in parallel seems like a red herring to me:
- TTBOMK Xapi already can start VMs in parallel.  Since it knows what
  constraints it's placed on existing VMs and what VMs it's currently
  building, there is nothing stopping it.  Indeed, AFAICS any toolstack
  that can guarantee enough RAM to build one VM at a time could do the
  same for multiple parallel builds with a bit of bookkeeping.
- Dan's stated problem (failure during VM build in the presence of
  unconstrained guest-controlled allocations) happens even if there is
  only one VM being created.

> > > Andres Lagar-Cavilla says "... this is because of shortcomings in the
> > > [Xen] mm layer and its interaction with wait queues, documented
> > > elsewhere."  In other words, this batching proposal requires
> > > significant changes to the hypervisor, which I think we
> > > all agreed we were trying to avoid.
> >
> > Let me nip this at the bud. I use page sharing and other techniques in an environment that doesn't use Citrix's DMC, nor is focused only on proprietary kernels...
> 
> I believe Dan is saying is that it is not enabled by default.
> Meaning it does not get executed in by /etc/init.d/xencommons and
> as such it never gets run (or does it now?) - unless one knows
> about it - or it is enabled by default in a product. But perhaps
> we are both mistaken? Is it enabled by default now on xen-unstable?

I think the point Dan was trying to make is that if you use page-sharing
to do overcommit, you can end up with the same problem that self-balloon
has: guest activity might consume all your RAM while you're trying to
build a new VM.

That could be fixed by a 'further hypervisor change' (constraining the
total amount of free memory that CoW unsharing can consume).  I suspect
that it can also be resolved by using d->max_pages on each shared-memory
VM to put a limit on how much memory they can (severally) consume.

> Just as a summary as this is getting to be a long thread - my
> understanding has been that the hypervisor is suppose to toolstack
> independent.

Let's keep calm.  If people were arguing "xl (or xapi) doesn't need this
so we shouldn't do it" that would certainly be wrong, but I don't think
that's the case.  At least I certainly hope not!

The discussion ought to be around the actual problem, which is (as far
as I can see) that in a system where guests are ballooning without
limits, VM creation failure can happen after a long delay.  In
particular it is the delay that is the problem, rather than the failure.
Some solutions that have been proposed so far:
 - don't do that, it's silly (possibly true but not helpful);
 - this reservation hypercall, to pull the failure forward;
 - make allocation faster to avoid the delay (a good idea anyway,
   but can it be made fast enough?);
 - use max_pages or similar to stop other VMs using all of RAM.

My own position remains that I can live with the reservation hypercall,
as long as it's properly done - including handling PV 32-bit and PV
superpage guests.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2012-12-18 22:17   ` Konrad Rzeszutek Wilk
  2012-12-19 12:53     ` George Dunlap
  2012-12-20 16:04     ` Tim Deegan
@ 2013-01-02 15:29     ` Andres Lagar-Cavilla
  2013-01-11 16:03       ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 53+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-02 15:29 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Tim Deegan, Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Jan Beulich

Konrad et al.:
On Dec 18, 2012, at 5:17 PM, Konrad Rzeszutek Wilk <konrad@kernel.org> wrote:

> Hey Andres,
> 
> Thanks for your response. Sorry for the really late response - I
> had it in my postponed mailbox and thought it has been sent
> already.

Been on vacation myself. No worries.

> 
> On Mon, Dec 03, 2012 at 10:24:40PM -0500, Andres Lagar-Cavilla wrote:
>>> I earlier promised a complete analysis of the problem
>>> addressed by the proposed claim hypercall as well as
>>> an analysis of the alternate solutions.  I had not
>>> yet provided these analyses when I asked for approval
>>> to commit the hypervisor patch, so there was still
>>> a good amount of misunderstanding, and I am trying
>>> to fix that here.
>>> 
>>> I had hoped this essay could be both concise and complete
>>> but quickly found it to be impossible to be both at the
>>> same time.  So I have erred on the side of verbosity,
>>> but also have attempted to ensure that the analysis
>>> flows smoothly and is understandable to anyone interested
>>> in learning more about memory allocation in Xen.
>>> I'd appreciate feedback from other developers to understand
>>> if I've also achieved that goal.
>>> 
>>> Ian, Ian, George, and Tim -- I have tagged a few
>>> out-of-flow questions to you with [IIGF].  If I lose
>>> you at any point, I'd especially appreciate your feedback
>>> at those points.  I trust that, first, you will read
>>> this completely.  As I've said, I understand that
>>> Oracle's paradigm may differ in many ways from your
>>> own, so I also trust that you will read it completely
>>> with an open mind.
>>> 
>>> Thanks,
>>> Dan
>>> 
>>> PROBLEM STATEMENT OVERVIEW
>>> 
>>> The fundamental problem is a race; two entities are
>>> competing for part or all of a shared resource: in this case,
>>> physical system RAM.  Normally, a lock is used to mediate
>>> a race.
>>> 
>>> For memory allocation in Xen, there are two significant
>>> entities, the toolstack and the hypervisor.  And, in
>>> general terms, there are currently two important locks:
>>> one used in the toolstack for domain creation;
>>> and one in the hypervisor used for the buddy allocator.
>>> 
>>> Considering first only domain creation, the toolstack
>>> lock is taken to ensure that domain creation is serialized.
>>> The lock is taken when domain creation starts, and released
>>> when domain creation is complete.
>>> 
>>> As system and domain memory requirements grow, the amount
>>> of time to allocate all necessary memory to launch a large
>>> domain is growing and may now exceed several minutes, so
>>> this serialization is increasingly problematic.  The result
>>> is a customer reported problem:  If a customer wants to
>>> launch two or more very large domains, the "wait time"
>>> required by the serialization is unacceptable.
>>> 
>>> Oracle would like to solve this problem.  And Oracle
>>> would like to solve this problem not just for a single
>>> customer sitting in front of a single machine console, but
>>> for the very complex case of a large number of machines,
>>> with the "agent" on each machine taking independent
>>> actions including automatic load balancing and power
>>> management via migration.
>> Hi Dan,
>> an issue with your reasoning throughout has been the constant invocation of the multi host environment as a justification for your proposal. But this argument is not used in your proposal below beyond this mention in passing. Further, there is no relation between what you are changing (the hypervisor) and what you are claiming it is needed for (multi host VM management).
>> 
> 
> Heh. I hadn't realized that the emails need to conform to a
> the way legal briefs are written in US :-) Meaning that
> each topic must be addressed.
> 
> Anyhow, the multi-host env or a single-host env has the same
> issue - you try to launch multiple guests and you some of
> them might not launch.
> 
> The changes that Dan is proposing (the claim hypercall)
> would provide the functionality to fix this problem.
> 
>> 
>>> (This complex environment
>>> is sold by Oracle today; it is not a "future vision".)
>>> 
>>> [IIGT] Completely ignoring any possible solutions to this
>>> problem, is everyone in agreement that this _is_ a problem
>>> that _needs_ to be solved with _some_ change in the Xen
>>> ecosystem?
>>> 
>>> SOME IMPORTANT BACKGROUND INFORMATION
>>> 
>>> In the subsequent discussion, it is important to
>>> understand a few things:
>>> 
>>> While the toolstack lock is held, allocating memory for
>>> the domain creation process is done as a sequence of one
>>> or more hypercalls, each asking the hypervisor to allocate
>>> one or more -- "X" -- slabs of physical RAM, where a slab
>>> is 2**N contiguous aligned pages, also known as an
>>> "order N" allocation.  While the hypercall is defined
>>> to work with any value of N, common values are N=0
>>> (individual pages), N=9 ("hugepages" or "superpages"),
>>> and N=18 ("1GiB pages").  So, for example, if the toolstack
>>> requires 201MiB of memory, it will make two hypercalls:
>>> One with X=100 and N=9, and one with X=1 and N=0.
>>> 
>>> While the toolstack may ask for a smaller number X of
>>> order==9 slabs, system fragmentation may unpredictably
>>> cause the hypervisor to fail the request, in which case
>>> the toolstack will fall back to a request for 512*X
>>> individual pages.  If there is sufficient RAM in the system,
>>> this request for order==0 pages is guaranteed to succeed.
>>> Thus for a 1TiB domain, the hypervisor must be prepared
>>> to allocate up to 256Mi individual pages.
>>> 
>>> Note carefully that when the toolstack hypercall asks for
>>> 100 slabs, the hypervisor "heaplock" is currently taken
>>> and released 100 times.  Similarly, for 256M individual
>>> pages... 256 million spin_lock-alloc_page-spin_unlocks.
>>> This means that domain creation is not "atomic" inside
>>> the hypervisor, which means that races can and will still
>>> occur.
>>> 
>>> RULING OUT SOME SIMPLE SOLUTIONS
>>> 
>>> Is there an elegant simple solution here?
>>> 
>>> Let's first consider the possibility of removing the toolstack
>>> serialization entirely and/or the possibility that two
>>> independent toolstack threads (or "agents") can simultaneously
>>> request a very large domain creation in parallel.  As described
>>> above, the hypervisor's heaplock is insufficient to serialize RAM
>>> allocation, so the two domain creation processes race.  If there
>>> is sufficient resource for either one to launch, but insufficient
>>> resource for both to launch, the winner of the race is indeterminate,
>>> and one or both launches will fail, possibly after one or both 
>>> domain creation threads have been working for several minutes.
>>> This is a classic "TOCTOU" (time-of-check-time-of-use) race.
>>> If a customer is unhappy waiting several minutes to launch
>>> a domain, they will be even more unhappy waiting for several
>>> minutes to be told that one or both of the launches has failed.
>>> Multi-minute failure is even more unacceptable for an automated
>>> agent trying to, for example, evacuate a machine that the
>>> data center administrator needs to powercycle.
>>> 
>>> [IIGT: Please hold your objections for a moment... the paragraph
>>> above is discussing the simple solution of removing the serialization;
>>> your suggested solution will be discussed soon.]
>>> 
>>> Next, let's consider the possibility of changing the heaplock
>>> strategy in the hypervisor so that the lock is held not
>>> for one slab but for the entire request of N slabs.  As with
>>> any core hypervisor lock, holding the heaplock for a "long time"
>>> is unacceptable.  To a hypervisor, several minutes is an eternity.
>>> And, in any case, by serializing domain creation in the hypervisor,
>>> we have really only moved the problem from the toolstack into
>>> the hypervisor, not solved the problem.
>>> 
>>> [IIGT] Are we in agreement that these simple solutions can be
>>> safely ruled out?
>>> 
>>> CAPACITY ALLOCATION VS RAM ALLOCATION
>>> 
>>> Looking for a creative solution, one may realize that it is the
>>> page allocation -- especially in large quantities -- that is very
>>> time-consuming.  But, thinking outside of the box, it is not
>>> the actual pages of RAM that we are racing on, but the quantity of pages required to launch a domain!  If we instead have a way to
>>> "claim" a quantity of pages cheaply now and then allocate the actual
>>> physical RAM pages later, we have changed the race to require only serialization of the claiming process!  In other words, if some entity
>>> knows the number of pages available in the system, and can "claim"
>>> N pages for the benefit of a domain being launched, the successful launch of the domain can be ensured.  Well... the domain launch may
>>> still fail for an unrelated reason, but not due to a memory TOCTOU
>>> race.  But, in this case, if the cost (in time) of the claiming
>>> process is very small compared to the cost of the domain launch,
>>> we have solved the memory TOCTOU race with hardly any delay added
>>> to a non-memory-related failure that would have occurred anyway.
>>> 
>>> This "claim" sounds promising.  But we have made an assumption that
>>> an "entity" has certain knowledge.  In the Xen system, that entity
>>> must be either the toolstack or the hypervisor.  Or, in the Oracle
>>> environment, an "agent"... but an agent and a toolstack are similar
>>> enough for our purposes that we will just use the more broadly-used
>>> term "toolstack".  In using this term, however, it's important to
>>> remember it is necessary to consider the existence of multiple
>>> threads within this toolstack.
>>> 
>>> Now I quote Ian Jackson: "It is a key design principle of a system
>>> like Xen that the hypervisor should provide only those facilities
>>> which are strictly necessary.  Any functionality which can be
>>> reasonably provided outside the hypervisor should be excluded
>>> from it."
>>> 
>>> So let's examine the toolstack first.
>>> 
>>> [IIGT] Still all on the same page (pun intended)?
>>> 
>>> TOOLSTACK-BASED CAPACITY ALLOCATION
>>> 
>>> Does the toolstack know how many physical pages of RAM are available?
>>> Yes, it can use a hypercall to find out this information after Xen and
>>> dom0 launch, but before it launches any domain.  Then if it subtracts
>>> the number of pages used when it launches a domain and is aware of
>>> when any domain dies, and adds them back, the toolstack has a pretty
>>> good estimate.  In actuality, the toolstack doesn't _really_ know the
>>> exact number of pages used when a domain is launched, but there
>>> is a poorly-documented "fuzz factor"... the toolstack knows the
>>> number of pages within a few megabytes, which is probably close enough.
>>> 
>>> This is a fairly good description of how the toolstack works today
>>> and the accounting seems simple enough, so does toolstack-based
>>> capacity allocation solve our original problem?  It would seem so.
>>> Even if there are multiple threads, the accounting -- not the extended
>>> sequence of page allocation for the domain creation -- can be
>>> serialized by a lock in the toolstack.  But note carefully, either
>>> the toolstack and the hypervisor must always be in sync on the
>>> number of available pages (within an acceptable margin of error);
>>> or any query to the hypervisor _and_ the toolstack-based claim must
>>> be paired atomically, i.e. the toolstack lock must be held across
>>> both.  Otherwise we again have another TOCTOU race. Interesting,
>>> but probably not really a problem.
>>> 
>>> Wait, isn't it possible for the toolstack to dynamically change the
>>> number of pages assigned to a domain?  Yes, this is often called
>>> ballooning and the toolstack can do this via a hypercall.  But
>> 
>>> that's still OK because each call goes through the toolstack and
>>> it simply needs to add more accounting for when it uses ballooning
>>> to adjust the domain's memory footprint.  So we are still OK.
>>> 
>>> But wait again... that brings up an interesting point.  Are there
>>> any significant allocations that are done in the hypervisor without
>>> the knowledge and/or permission of the toolstack?  If so, the
>>> toolstack may be missing important information.
>>> 
>>> So are there any such allocations?  Well... yes. There are a few.
>>> Let's take a moment to enumerate them:
>>> 
>>> A) In Linux, a privileged user can write to a sysfs file which writes
>>> to the balloon driver which makes hypercalls from the guest kernel to
>> 
>> A fairly bizarre limitation of a balloon-based approach to memory management. Why on earth should the guest be allowed to change the size of its balloon, and therefore its footprint on the host. This may be justified with arguments pertaining to the stability of the in-guest workload. What they really reveal are limitations of ballooning. But the inadequacy of the balloon in itself doesn't automatically translate into justifying the need for a new hyper call.
> 
> Why is this a limitation? Why shouldn't the guest the allowed to change
> its memory usage? It can go up and down as it sees fit.

No no. Can the guest change its cpu utilization outside scheduler constraints? NIC/block dev quotas? Why should an unprivileged guest be able to take a massive s*it over the host controller's memory allocation, at the guest's whim?

I'll be happy with a balloon the day I see an OS that can't be rooted :)

Obviously this points to a problem with sharing & paging. And this is why I still spam this thread. More below.
 
> And if it goes down and it gets better performance - well, why shouldn't
> it do it?
> 
> I concur it is odd - but it has been like that for decades.

Heh. Decades … one?
> 
> 
>> 
>>> the hypervisor, which adjusts the domain memory footprint, which changes the number of free pages _without_ the toolstack knowledge.
>>> The toolstack controls constraints (essentially a minimum and maximum)
>>> which the hypervisor enforces.  The toolstack can ensure that the
>>> minimum and maximum are identical to essentially disallow Linux from
>>> using this functionality.  Indeed, this is precisely what Citrix's
>>> Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always has complete control and, so, knowledge of any domain memory
>>> footprint changes.  But DMC is not prescribed by the toolstack,
>> 
>> Neither is enforcing min==max. This was my argument when previously commenting on this thread. The fact that you have enforcement of a maximum domain allocation gives you an excellent tool to keep a domain's unsupervised growth at bay. The toolstack can choose how fine-grained, how often to be alerted and stall the domain.
> 
> There is a down-call (so events) to the tool-stack from the hypervisor when
> the guest tries to balloon in/out? So the need for this problem arose
> but the mechanism to deal with it has been shifted to the user-space
> then? What to do when the guest does this in/out balloon at freq
> intervals?
> 
> I am missing actually the reasoning behind wanting to stall the domain?
> Is that to compress/swap the pages that the guest requests? Meaning
> an user-space daemon that does "things" and has ownership
> of the pages?

The (my) reasoning is that this enables control over unsupervised growth. I was being facetious a couple lines above. Paging and sharing also have the same problem with badly behaved guests. So this is where you stop these guys, allow the toolstack to catch a breath, and figure out what to do with this domain (more RAM? page out? foo?).

All your questions are very valid, but they are policy in toolstack-land. Luckily the hypervisor needs no knowledge of that.

> 
>> 
>>> and some real Oracle Linux customers use and depend on the flexibility
>>> provided by in-guest ballooning.   So guest-privileged-user-driven-
>>> ballooning is a potential issue for toolstack-based capacity allocation.
>>> 
>>> [IIGT: This is why I have brought up DMC several times and have
>>> called this the "Citrix model,".. I'm not trying to be snippy
>>> or impugn your morals as maintainers.]
>>> 
>>> B) Xen's page sharing feature has slowly been completed over a number
>>> of recent Xen releases.  It takes advantage of the fact that many
>>> pages often contain identical data; the hypervisor merges them to save
>> 
>> Great care has been taken for this statement to not be exactly true. The hypervisor discards one of two pages that the toolstack tells it to (and patches the physmap of the VM previously pointing to the discard page). It doesn't merge, nor does it look into contents. The hypervisor doesn't care about the page contents. This is deliberate, so as to avoid spurious claims of "you are using technique X!"
>> 
> 
> Is the toolstack (or a daemon in userspace) doing this? I would
> have thought that there would be some optimization to do this
> somewhere?

You could optimize but then you are baking policy where it does not belong. This is what KSM did, which I dislike. Seriously, does the kernel need to scan memory to find duplicates? Can't something else do it given suitable interfaces? Now any other form of sharing policy that tries to use VMA_MERGEABLE is SOL. Tim, Gregor and I, at different points in time, tried to avoid this. I don't know that it was a conscious or deliberate effort, but it worked out that way.
 
> 
>>> physical RAM.  When any "shared" page is written, the hypervisor
>>> "splits" the page (aka, copy-on-write) by allocating a new physical
>>> page.  There is a long history of this feature in other virtualization
>>> products and it is known to be possible that, under many circumstances, thousands of splits may occur in any fraction of a second.  The
>>> hypervisor does not notify or ask permission of the toolstack.
>>> So, page-splitting is an issue for toolstack-based capacity
>>> allocation, at least as currently coded in Xen.
>>> 
>>> [Andre: Please hold your objection here until you read further.]
>> 
>> Name is Andres. And please cc me if you'll be addressing me directly!
>> 
>> Note that I don't disagree with your previous statement in itself. Although "page-splitting" is fairly unique terminology, and confusing (at least to me). CoW works.
> 
> <nods>
>> 
>>> 
>>> C) Transcendent Memory ("tmem") has existed in the Xen hypervisor and
>>> toolstack for over three years.  It depends on an in-guest-kernel
>>> adaptive technique to constantly adjust the domain memory footprint as
>>> well as hooks in the in-guest-kernel to move data to and from the
>>> hypervisor.  While the data is in the hypervisor's care, interesting
>>> memory-load balancing between guests is done, including optional
>>> compression and deduplication.  All of this has been in Xen since 2009
>>> and has been awaiting changes in the (guest-side) Linux kernel. Those
>>> changes are now merged into the mainstream kernel and are fully
>>> functional in shipping distros.
>>> 
>>> While a complete description of tmem's guest<->hypervisor interaction
>>> is beyond the scope of this document, it is important to understand
>>> that any tmem-enabled guest kernel may unpredictably request thousands
>>> or even millions of pages directly via hypercalls from the hypervisor in a fraction of a second with absolutely no interaction with the toolstack.  Further, the guest-side hypercalls that allocate pages
>>> via the hypervisor are done in "atomic" code deep in the Linux mm
>>> subsystem.
>>> 
>>> Indeed, if one truly understands tmem, it should become clear that
>>> tmem is fundamentally incompatible with toolstack-based capacity
>>> allocation. But let's stop discussing tmem for now and move on.
>> 
>> You have not discussed tmem pool thaw and freeze in this proposal.
> 
> Oooh, you know about it :-) Dan didn't want to go too verbose on
> people. It is a bit of rathole - and this hypercall would
> allow to deprecate said freeze/thaw calls.
> 
>> 
>>> 
>>> OK.  So with existing code both in Xen and Linux guests, there are
>>> three challenges to toolstack-based capacity allocation.  We'd
>>> really still like to do capacity allocation in the toolstack.  Can
>>> something be done in the toolstack to "fix" these three cases?
>>> 
>>> Possibly.  But let's first look at hypervisor-based capacity
>>> allocation: the proposed "XENMEM_claim_pages" hypercall.
>>> 
>>> HYPERVISOR-BASED CAPACITY ALLOCATION
>>> 
>>> The posted patch for the claim hypercall is quite simple, but let's
>>> look at it in detail.  The claim hypercall is actually a subop
>>> of an existing hypercall.  After checking parameters for validity,
>>> a new function is called in the core Xen memory management code.
>>> This function takes the hypervisor heaplock, checks for a few
>>> special cases, does some arithmetic to ensure a valid claim, stakes
>>> the claim, releases the hypervisor heaplock, and then returns.  To
>>> review from earlier, the hypervisor heaplock protects _all_ page/slab
>>> allocations, so we can be absolutely certain that there are no other
>>> page allocation races.  This new function is about 35 lines of code,
>>> not counting comments.
>>> 
>>> The patch includes two other significant changes to the hypervisor:
>>> First, when any adjustment to a domain's memory footprint is made
>>> (either through a toolstack-aware hypercall or one of the three
>>> toolstack-unaware methods described above), the heaplock is
>>> taken, arithmetic is done, and the heaplock is released.  This
>>> is 12 lines of code.  Second, when any memory is allocated within
>>> Xen, a check must be made (with the heaplock already held) to
>>> determine if, given a previous claim, the domain has exceeded
>>> its upper bound, maxmem.  This code is a single conditional test.
>>> 
>>> With some declarations, but not counting the copious comments,
>>> all told, the new code provided by the patch is well under 100 lines.
>>> 
>>> What about the toolstack side?  First, it's important to note that
>>> the toolstack changes are entirely optional.  If any toolstack
>>> wishes either to not fix the original problem, or avoid toolstack-
>>> unaware allocation completely by ignoring the functionality provided
>>> by in-guest ballooning, page-sharing, and/or tmem, that toolstack need
>>> not use the new hyper call.
>> 
>> You are ruling out any other possibility here. In particular, but not limited to, use of max_pages.
> 
> The one max_page check that comes to my mind is the one that Xapi
> uses. That is it has a daemon that sets the max_pages of all the
> guests at some value so that it can squeeze in as many guests as
> possible. It also balloons pages out of a guest to make space if
> need to launch. The heurestic of how many pages or the ratio
> of max/min looks to be proportional (so to make space for 1GB
> for a guest, and say we have 10 guests, we will subtract
> 101MB from each guest - the extra 1MB is for extra overhead).
> This depends on one hypercall that 'xl' or 'xm' toolstack do not
> use - which sets the max_pages.
> 
> That code makes certain assumptions - that the guest will not go/up down
> in the ballooning once the toolstack has decreed how much
> memory the guest should use. It also assumes that the operations
> are semi-atomic - and to make it so as much as it can - it executes
> these operations in serial.
> 
> This goes back to the problem statement - if we try to parallize
> this we run in the problem that the amount of memory we thought
> we free is not true anymore. The start of this email has a good
> description of some of the issues.

Just set max_pages (bad name...) everywhere as needed to make room. Then kick tmem (everywhere, in parallel) to free memory. Wait until enough is free …. Allocate your domain(s, in parallel). If any vcpus become stalled because a tmem guest driver is trying to allocate beyond max_pages, you need to adjust your allocations. As usual.

> 
> In essence, the max_pages does work - _if_ one does these operations
> in serial. We are trying to make this work in parallel and without
> any failures - for that we - one way that is quite simplistic
> is the claim hypercall. It sets up a 'stake' of the amount of
> memory that the hypervisor should reserve. This way other
> guests creations/ ballonning do not infringe on the 'claimed' amount.
> 
> I believe with this hypercall the Xapi can be made to do its operations
> in parallel as well.
> 
>> 
>>> Second, it's very relevant to note that the Oracle product uses a combination of a proprietary "manager"
>>> which oversees many machines, and the older open-source xm/xend
>>> toolstack, for which the current Xen toolstack maintainers are no
>>> longer accepting patches.
>>> 
>>> The preface of the published patch does suggest, however, some
>>> straightforward pseudo-code, as follows:
>>> 
>>> Current toolstack domain creation memory allocation code fragment:
>>> 
>>> 1. call populate_physmap repeatedly to achieve mem=N memory
>>> 2. if any populate_physmap call fails, report -ENOMEM up the stack
>>> 3. memory is held until domain dies or the toolstack decreases it
>>> 
>>> Proposed toolstack domain creation memory allocation code fragment
>>> (new code marked with "+"):
>>> 
>>> +  call claim for mem=N amount of memory
>>> +. if claim succeeds:
>>> 1.  call populate_physmap repeatedly to achieve mem=N memory (failsafe)
>>> +  else
>>> 2.  report -ENOMEM up the stack
>>> +  claim is held until mem=N is achieved or the domain dies or
>>>   forced to 0 by a second hypercall
>>> 3. memory is held until domain dies or the toolstack decreases it
>>> 
>>> Reviewing the pseudo-code, one can readily see that the toolstack
>>> changes required to implement the hypercall are quite small.
>>> 
>>> To complete this discussion, it has been pointed out that
>>> the proposed hypercall doesn't solve the original problem
>>> for certain classes of legacy domains... but also neither
>>> does it make the problem worse.  It has also been pointed
>>> out that the proposed patch is not (yet) NUMA-aware.
>>> 
>>> Now let's return to the earlier question:  There are three 
>>> challenges to toolstack-based capacity allocation, which are
>>> all handled easily by in-hypervisor capacity allocation. But we'd
>>> really still like to do capacity allocation in the toolstack.
>>> Can something be done in the toolstack to "fix" these three cases?
>>> 
>>> The answer is, of course, certainly... anything can be done in
>>> software.  So, recalling Ian Jackson's stated requirement:
>>> 
>>> "Any functionality which can be reasonably provided outside the
>>> hypervisor should be excluded from it."
>>> 
>>> we are now left to evaluate the subjective term "reasonably".
>>> 
>>> CAN TOOLSTACK-BASED CAPACITY ALLOCATION OVERCOME THE ISSUES?
>>> 
>>> In earlier discussion on this topic, when page-splitting was raised
>>> as a concern, some of the authors of Xen's page-sharing feature
>>> pointed out that a mechanism could be designed such that "batches"
>>> of pages were pre-allocated by the toolstack and provided to the
>>> hypervisor to be utilized as needed for page-splitting.  Should the
>>> batch run dry, the hypervisor could stop the domain that was provoking
>>> the page-split until the toolstack could be consulted and the toolstack, at its leisure, could request the hypervisor to refill
>>> the batch, which then allows the page-split-causing domain to proceed.
>>> 
>>> But this batch page-allocation isn't implemented in Xen today.
>>> 
>>> Andres Lagar-Cavilla says "... this is because of shortcomings in the
>>> [Xen] mm layer and its interaction with wait queues, documented
>>> elsewhere."  In other words, this batching proposal requires
>>> significant changes to the hypervisor, which I think we
>>> all agreed we were trying to avoid.
>> 
>> This is a misunderstanding. There is no connection between the batching proposal and what I was referring to in the quote. Certainly I never advocated for pre-allocations.
>> 
>> The "significant changes to the hypervisor" statement is FUD. Everyone you've addressed on this email makes significant changes to the hypervisor, under the proviso that they are necessary/useful changes.
>> 
>> The interactions between the mm layer and wait queues need fixing, sooner or later, claim hyper call or not. But they are not a blocker, they are essentially a race that may trigger under certain circumstances. That is why they remain a low priority fix.
>> 
>>> 
>>> [Note to Andre: I'm not objecting to the need for this functionality
>>> for page-sharing to work with proprietary kernels and DMC; just
>> 
>> Let me nip this at the bud. I use page sharing and other techniques in an environment that doesn't use Citrix's DMC, nor is focused only on proprietary kernels...
>> 
>>> pointing out that it, too, is dependent on further hypervisor changes.]
>> 
>> … with 4.2 Xen. It is not perfect and has limitations that I am trying to fix. But our product ships, and page sharing works for anyone who would want to consume it, independently of further hypervisor changes.
>> 
> 
> I believe Dan is saying is that it is not enabled by default.
> Meaning it does not get executed in by /etc/init.d/xencommons and
> as such it never gets run (or does it now?) - unless one knows
> about it - or it is enabled by default in a product. But perhaps
> we are both mistaken? Is it enabled by default now on den-unstable?

I'm a bit lost … what is supposed to be enabled? A sharing daemon? A paging daemon? Neither daemon requires wait queue work, batch allocations, etc. I can't figure out what this portion of the conversation is about.

Having said that, thanks for the thoughtful follow-up
Andres

> 
>>> 
>>> Such an approach makes sense in the min==max model enforced by
>>> DMC but, again, DMC is not prescribed by the toolstack.
>>> 
>>> Further, this waitqueue solution for page-splitting only awkwardly
>>> works around in-guest ballooning (probably only with more hypervisor
>>> changes, TBD) and would be useless for tmem.  [IIGT: Please argue
>>> this last point only if you feel confident you truly understand how
>>> tmem works.]
>> 
>> I will argue though that "waitqueue solution … ballooning" is not true. Ballooning has never needed nor it suddenly needs now hypervisor wait queues.
> 
> It is the use case of parallel starts that we are trying to solve.
> Worst - we want to start 16GB or 32GB guests and those seem to take
> quite a bit of time.
> 
>> 
>>> 
>>> So this as-yet-unimplemented solution only really solves a part
>>> of the problem.
>> 
>> As per the previous comments, I don't see your characterization as accurate.
>> 
>> Andres
>>> 
>>> Are there any other possibilities proposed?  Ian Jackson has
>>> suggested a somewhat different approach:
>>> 
>>> Let me quote Ian Jackson again:
>>> 
>>> "Of course if it is really desired to have each guest make its own
>>> decisions and simply for them to somehow agree to divvy up the
>>> available resources, then even so a new hypervisor mechanism is
>>> not needed.  All that is needed is a way for those guests to
>>> synchronise their accesses and updates to shared records of the
>>> available and in-use memory."
>>> 
>>> Ian then goes on to say:  "I don't have a detailed counter-proposal
>>> design of course..."
>>> 
>>> This proposal is certainly possible, but I think most would agree that
>>> it would require some fairly massive changes in OS memory management
>>> design that would run contrary to many years of computing history.
>>> It requires guest OS's to cooperate with each other about basic memory
>>> management decisions.  And to work for tmem, it would require
>>> communication from atomic code in the kernel to user-space, then communication from user-space in a guest to user-space-in-domain0
>>> and then (presumably... I don't have a design either) back again.
>>> One must also wonder what the performance impact would be.
>>> 
>>> CONCLUDING REMARKS
>>> 
>>> "Any functionality which can be reasonably provided outside the
>>> hypervisor should be excluded from it."
>>> 
>>> I think this document has described a real customer problem and
>>> a good solution that could be implemented either in the toolstack
>>> or in the hypervisor.  Memory allocation in existing Xen functionality
>>> has been shown to interfere significantly with the toolstack-based
>>> solution and suggested partial solutions to those issues either
>>> require even more hypervisor work, or are completely undesigned and,
>>> at least, call into question the definition of "reasonably".
>>> 
>>> The hypervisor-based solution has been shown to be extremely
>>> simple, fits very logically with existing Xen memory management
>>> mechanisms/code, and has been reviewed through several iterations
>>> by Xen hypervisor experts.
>>> 
>>> While I understand completely the Xen maintainers' desire to
>>> fend off unnecessary additions to the hypervisor, I believe
>>> XENMEM_claim_pages is a reasonable and natural hypervisor feature
>>> and I hope you will now Ack the patch.
> 
> 
> Just as a summary as this is getting to be a long thread - my
> understanding has been that the hypervisor is suppose to toolstack
> independent.
> 
> Our first goal is to implement this in 'xend' as that
> is what we use right now. Problem will be of course to find somebody
> to review it :-(
> 
> We certainly want to implement this also in the 'xl' tool-stack
> as in the future that is what we want to use when we rebase
> our product on Xen 4.2 or greater.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2012-12-20 16:04     ` Tim Deegan
@ 2013-01-02 15:31       ` Andres Lagar-Cavilla
  2013-01-02 21:43         ` Dan Magenheimer
  2013-01-02 21:38       ` Dan Magenheimer
  1 sibling, 1 reply; 53+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-02 15:31 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich

Hello,
On Dec 20, 2012, at 11:04 AM, Tim Deegan <tim@xen.org> wrote:

> Hi,
> 
> At 17:17 -0500 on 18 Dec (1355851071), Konrad Rzeszutek Wilk wrote:
>> In essence, the max_pages does work - _if_ one does these operations
>> in serial. We are trying to make this work in parallel and without
>> any failures - for that we - one way that is quite simplistic
>> is the claim hypercall. It sets up a 'stake' of the amount of
>> memory that the hypervisor should reserve. This way other
>> guests creations/ ballonning do not infringe on the 'claimed' amount.
>> 
>> I believe with this hypercall the Xapi can be made to do its operations
>> in parallel as well.
> 
> The question of starting VMs in parallel seems like a red herring to me:
> - TTBOMK Xapi already can start VMs in parallel.  Since it knows what
>  constraints it's placed on existing VMs and what VMs it's currently
>  building, there is nothing stopping it.  Indeed, AFAICS any toolstack
>  that can guarantee enough RAM to build one VM at a time could do the
>  same for multiple parallel builds with a bit of bookkeeping.
> - Dan's stated problem (failure during VM build in the presence of
>  unconstrained guest-controlled allocations) happens even if there is
>  only one VM being created.
> 
>>>> Andres Lagar-Cavilla says "... this is because of shortcomings in the
>>>> [Xen] mm layer and its interaction with wait queues, documented
>>>> elsewhere."  In other words, this batching proposal requires
>>>> significant changes to the hypervisor, which I think we
>>>> all agreed we were trying to avoid.
>>> 
>>> Let me nip this at the bud. I use page sharing and other techniques in an environment that doesn't use Citrix's DMC, nor is focused only on proprietary kernels...
>> 
>> I believe Dan is saying is that it is not enabled by default.
>> Meaning it does not get executed in by /etc/init.d/xencommons and
>> as such it never gets run (or does it now?) - unless one knows
>> about it - or it is enabled by default in a product. But perhaps
>> we are both mistaken? Is it enabled by default now on xen-unstable?
> 
> I think the point Dan was trying to make is that if you use page-sharing
> to do overcommit, you can end up with the same problem that self-balloon
> has: guest activity might consume all your RAM while you're trying to
> build a new VM.
> 
> That could be fixed by a 'further hypervisor change' (constraining the
> total amount of free memory that CoW unsharing can consume).  I suspect
> that it can also be resolved by using d->max_pages on each shared-memory
> VM to put a limit on how much memory they can (severally) consume.

To be completely clear. I don't think we need a separate allocation/list of pages/foo to absorb CoW hits. I think the solution is using d->max_pages. Sharing will hit that limit and then send a notification via the "sharing" (which is actually an enomem) men event ring.

Andres
> 
>> Just as a summary as this is getting to be a long thread - my
>> understanding has been that the hypervisor is suppose to toolstack
>> independent.
> 
> Let's keep calm.  If people were arguing "xl (or xapi) doesn't need this
> so we shouldn't do it" that would certainly be wrong, but I don't think
> that's the case.  At least I certainly hope not!
> 
> The discussion ought to be around the actual problem, which is (as far
> as I can see) that in a system where guests are ballooning without
> limits, VM creation failure can happen after a long delay.  In
> particular it is the delay that is the problem, rather than the failure.
> Some solutions that have been proposed so far:
> - don't do that, it's silly (possibly true but not helpful);
> - this reservation hypercall, to pull the failure forward;
> - make allocation faster to avoid the delay (a good idea anyway,
>   but can it be made fast enough?);
> - use max_pages or similar to stop other VMs using all of RAM.
> 
> My own position remains that I can live with the reservation hypercall,
> as long as it's properly done - including handling PV 32-bit and PV
> superpage guests.
> 
> Cheers,
> 
> Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2012-12-20 16:04     ` Tim Deegan
  2013-01-02 15:31       ` Andres Lagar-Cavilla
@ 2013-01-02 21:38       ` Dan Magenheimer
  2013-01-03 16:24         ` Andres Lagar-Cavilla
  2013-01-10 17:13         ` Tim Deegan
  1 sibling, 2 replies; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-02 21:38 UTC (permalink / raw)
  To: Tim Deegan, Konrad Rzeszutek Wilk
  Cc: Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Jan Beulich

> From: Tim Deegan [mailto:tim@xen.org]
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> solutions
> 
> Hi,

Happy New Year Tim, and thanks for trying to add some clarity to the
discussion.

> The question of starting VMs in parallel seems like a red herring to me:
> - TTBOMK Xapi already can start VMs in parallel.  Since it knows what
>   constraints it's placed on existing VMs and what VMs it's currently
>   building, there is nothing stopping it.  Indeed, AFAICS any toolstack
>   that can guarantee enough RAM to build one VM at a time could do the
>   same for multiple parallel builds with a bit of bookkeeping.
> - Dan's stated problem (failure during VM build in the presence of
>   unconstrained guest-controlled allocations) happens even if there is
>   only one VM being created.

Agreed.  The parallel VM discussion was simply trying to point out
that races can occur even without guest-controlled allocations,
so is distracting from the actual issue (which is, according to
wikipedia, one of the definitions of "red herring").

(As an aside, your use of the word "unconstrained" is a red herring. ;-)
 
> > > > Andres Lagar-Cavilla says "... this is because of shortcomings in the
> > > > [Xen] mm layer and its interaction with wait queues, documented
> > > > elsewhere."  In other words, this batching proposal requires
> > > > significant changes to the hypervisor, which I think we
> > > > all agreed we were trying to avoid.
> > >
> > > Let me nip this at the bud. I use page sharing and other techniques in an environment that doesn't
> use Citrix's DMC, nor is focused only on proprietary kernels...
> >
> > I believe Dan is saying is that it is not enabled by default.
> > Meaning it does not get executed in by /etc/init.d/xencommons and
> > as such it never gets run (or does it now?) - unless one knows
> > about it - or it is enabled by default in a product. But perhaps
> > we are both mistaken? Is it enabled by default now on xen-unstable?
> 
> I think the point Dan was trying to make is that if you use page-sharing
> to do overcommit, you can end up with the same problem that self-balloon
> has: guest activity might consume all your RAM while you're trying to
> build a new VM.
> 
> That could be fixed by a 'further hypervisor change' (constraining the
> total amount of free memory that CoW unsharing can consume).  I suspect
> that it can also be resolved by using d->max_pages on each shared-memory
> VM to put a limit on how much memory they can (severally) consume.

(I will respond to this in the context of Andres' response shortly...)

> > Just as a summary as this is getting to be a long thread - my
> > understanding has been that the hypervisor is suppose to toolstack
> > independent.
> 
> Let's keep calm.  If people were arguing "xl (or xapi) doesn't need this
> so we shouldn't do it"

Well Tim, I think this is approximately what some people ARE arguing.
AFAICT, "people" _are_ arguing that "the toolstack" must have knowledge
of and control over all memory allocation.  Since the primary toolstack
is "xl", even though xl does not currently have this knowledge/control
(and, IMHO, never can or should), I think people _are_ arguing:

"xl (or xapi) SHOULDn't need this so we shouldn't do it".

> that would certainly be wrong, but I don't think
> that's the case.  At least I certainly hope not!

I agree that would certainly be wrong, but it seems to be happening
anyway. :-(  Indeed, some are saying that we should disable existing
working functionality (eg. in-guest ballooning) so that the toolstack
CAN have complete knowledge and control.

So let me check, Tim, do you agree that some entity, either the toolstack
or the hypervisor, must have knowledge of and control over all memory
allocation, or the allocation race condition is present?

> The discussion ought to be around the actual problem, which is (as far
> as I can see) that in a system where guests are ballooning without
> limits, VM creation failure can happen after a long delay.  In
> particular it is the delay that is the problem, rather than the failure.
> Some solutions that have been proposed so far:
>  - don't do that, it's silly (possibly true but not helpful);
>  - this reservation hypercall, to pull the failure forward;
>  - make allocation faster to avoid the delay (a good idea anyway,
>    but can it be made fast enough?);
>  - use max_pages or similar to stop other VMs using all of RAM.

Good summary.  So, would you agree that the solution selection
comes down to: "Can max_pages or similar be used effectively to
stop other VMs using all of RAM? If so, who is implementing that?
Else the reservation hypercall is a good solution." ?

> My own position remains that I can live with the reservation hypercall,
> as long as it's properly done - including handling PV 32-bit and PV
> superpage guests.

Tim, would you at least agree that "properly" is a red herring?
Solving 100% of a problem is clearly preferable and I would gladly
change my loyalty to someone else's 100% solution.  But solving 98%*
of a problem while not making the other 2% any worse is not "improper",
just IMHO sensible engineering.

* I'm approximating the total number of PV 32-bit and PV superpage
guests as 2%.  Substitute a different number if you like, but
the number is certainly getting smaller over time, not growing.

Tim, thanks again for your useful input.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-02 15:31       ` Andres Lagar-Cavilla
@ 2013-01-02 21:43         ` Dan Magenheimer
  2013-01-03 16:25           ` Andres Lagar-Cavilla
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-02 21:43 UTC (permalink / raw)
  To: Andres Lagar-Cavilla, Tim Deegan
  Cc: Keir (Xen.org),
	Ian Campbell, George Dunlap, Ian Jackson, xen-devel,
	Konrad Rzeszutek Wilk, Jan Beulich

> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> solutions
> 
> Hello,

Happy New Year, Andres!  (yay, I spelled it right this time! ;)

> On Dec 20, 2012, at 11:04 AM, Tim Deegan <tim@xen.org> wrote:
> 
> > I think the point Dan was trying to make is that if you use page-sharing
> > to do overcommit, you can end up with the same problem that self-balloon
> > has: guest activity might consume all your RAM while you're trying to
> > build a new VM.
> >
> > That could be fixed by a 'further hypervisor change' (constraining the
> > total amount of free memory that CoW unsharing can consume).  I suspect
> > that it can also be resolved by using d->max_pages on each shared-memory
> > VM to put a limit on how much memory they can (severally) consume.
> 
> To be completely clear. I don't think we need a separate allocation/list
> of pages/foo to absorb CoW hits. I think the solution is using d->max_pages.
> Sharing will hit that limit and then send a notification via the "sharing"
> (which is actually an enomem) men event ring.

And here is the very crux of our disagreement.

You say "I think the solution is using d->max_pages".  Unless
I misunderstand completely, this means your model is what I've
called the "Citrix model" (because Citrix DMC uses it), in which
d->max_pages is dynamically adjusted regularly for each running
guest based on external inferences by (what I have sarcastically
called) a "omniscient toolstack".

In the Oracle model, d->max_pages is a fixed hard limit set when
the guest is launched; only d->curr_pages dynamically varies across
time (e.g. via in-guest selfballooning).

I reject the omnisicient toolstack model as unimplementable [1]
and, without it, I think you either do need a separate allocation/list,
with all the issues that entails, or you need the proposed
XENMEM_claim_pages hypercall to resolve memory allocation races
(i.e. vs domain creation).

So, please Andres, assume for a moment you have neither "the
solution using d->max_pages" nor "a separate allocation/list".
IIUC if one uses your implementation of page-sharing when d->max_pages
is permanently fixed, it is impossible for a "CoW hit" to result in
exceeding d->max_pages; and so the _only_ time a CoW hit would
result in a toolstack notification and/or host swapping is if
physical memory in the machine is fully allocated.  True?

Now does it make more sense what I and Konrad (and now Tim)
are trying to point out?

Thanks,
Dan

[1] excerpted from my own email at:
http://lists.xen.org/archives/html/xen-devel/2012-12/msg00107.html 

> The last 4+ years of my life have been built on the fundamental
> assumption that nobody, not even one guest kernel itself,
> can adequately predict when memory usage is going to spike.
> Accurate inference from an external entity across potentially dozens
> of VMs is IMHO.... well... um... unlikely.  I could be wrong
> but I believe, even in academia, there is no realistic research
> solution proposed for this.  (If I'm wrong, please send a pointer.)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2012-12-19 12:53     ` George Dunlap
  2012-12-19 13:48       ` George Dunlap
@ 2013-01-02 21:59       ` Konrad Rzeszutek Wilk
  2013-01-14 18:28         ` George Dunlap
  1 sibling, 1 reply; 53+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-02 21:59 UTC (permalink / raw)
  To: George Dunlap
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

. snip..
> >Heh. I hadn't realized that the emails need to conform to a
> >the way legal briefs are written in US :-) Meaning that
> >each topic must be addressed.
> 
> Every time we try to suggest alternatives, Dan goes on some rant
> about how we're on different planets, how we're all old-guard stuck
> in static-land thinking, and how we're focused on single-server use

.. snip..

> than anyone has the time to read and understand, much less respond
> to.  That's why I suggested to Dan that he ask someone else to take
> over the conversation.)

First of, lets leave the characterization of people out of this.
I have great respect for Dan and I am hurt that you would so cavaliery
treat him. But that is your choice and lets leave this thread to just
a technical discussion.

> 
> >Anyhow, the multi-host env or a single-host env has the same
> >issue - you try to launch multiple guests and you some of
> >them might not launch.
> >
> >The changes that Dan is proposing (the claim hypercall)
> >would provide the functionality to fix this problem.
> >
> >>A fairly bizarre limitation of a balloon-based approach to memory management. Why on earth should the guest be allowed to change the size of its balloon, and therefore its footprint on the host. This may be justified with arguments pertaining to the stability of the in-guest workload. What they really reveal are limitations of ballooning. But the inadequacy of the balloon in itself doesn't automatically translate into justifying the need for a new hyper call.
> >Why is this a limitation? Why shouldn't the guest the allowed to change
> >its memory usage? It can go up and down as it sees fit.
> >And if it goes down and it gets better performance - well, why shouldn't
> >it do it?
> >
> >I concur it is odd - but it has been like that for decades.
> 
> Well, it shouldn't be allowed to do it because it causes this
> problem you're having with creating guests in parallel.  Ultimately,
> that is the core of your problem.  So if you want us to solve the
> problem by implementing something in the hypervisor, then you need
> to justify why "Just don't have guests balloon down" is an
> unacceptable option.  Saying "why shouldn't it", and "it's been that
> way for decades*" isn't a good enough reason.

We find that the balloon usage very flexible and see no problems with it.

.. snip..

> >>>What about the toolstack side?  First, it's important to note that
> >>>the toolstack changes are entirely optional.  If any toolstack
> >>>wishes either to not fix the original problem, or avoid toolstack-
> >>>unaware allocation completely by ignoring the functionality provided
> >>>by in-guest ballooning, page-sharing, and/or tmem, that toolstack need
> >>>not use the new hyper call.
> >>You are ruling out any other possibility here. In particular, but not limited to, use of max_pages.
> >The one max_page check that comes to my mind is the one that Xapi
> >uses. That is it has a daemon that sets the max_pages of all the
> >guests at some value so that it can squeeze in as many guests as
> >possible. It also balloons pages out of a guest to make space if
> >need to launch. The heurestic of how many pages or the ratio
> >of max/min looks to be proportional (so to make space for 1GB
> >for a guest, and say we have 10 guests, we will subtract
> >101MB from each guest - the extra 1MB is for extra overhead).
> >This depends on one hypercall that 'xl' or 'xm' toolstack do not
> >use - which sets the max_pages.
> >
> >That code makes certain assumptions - that the guest will not go/up down
> >in the ballooning once the toolstack has decreed how much
> >memory the guest should use. It also assumes that the operations
> >are semi-atomic - and to make it so as much as it can - it executes
> >these operations in serial.
> 
> No, the xapi code does no such assumptions.  After it tells a guest
> to balloon down, it watches to see  what actually happens, and has
> heuristics to deal with "non-cooperative guests".  It does assume
> that if it sets max_pages lower than or equal to the current amount
> of used memory, that the hypervisor will not allow the guest to
> balloon up -- but that's a pretty safe assumption.  A guest can
> balloon down if it wants to, but as xapi does not consider that
> memory free, it will never use it.

Thanks for the clarification. I am not that fluent in the
OCaml code.

> 
> BTW, I don't know if you realize this: Originally Xen would return
> an error if you tried to set max_pages below tot_pages.  But as a
> result of the DMC work, it was seen as useful to allow the toolstack
> to tell the hypervisor once, "Once the VM has ballooned down to X,
> don't let it balloon up above X anymore."
> 
> >This goes back to the problem statement - if we try to parallize
> >this we run in the problem that the amount of memory we thought
> >we free is not true anymore. The start of this email has a good
> >description of some of the issues.
> >
> >In essence, the max_pages does work - _if_ one does these operations
> >in serial. We are trying to make this work in parallel and without
> >any failures - for that we - one way that is quite simplistic
> >is the claim hypercall. It sets up a 'stake' of the amount of
> >memory that the hypervisor should reserve. This way other
> >guests creations/ ballonning do not infringe on the 'claimed' amount.
> 
> I'm not sure what you mean by "do these operations in serial" in
> this context.  Each of your "reservation hypercalls" has to happen
> in serial.  If we had a user-space daemon that was in charge of
> freeing up or reserving memory, each request to that daemon would
> happen in serial as well.  But once the allocation / reservation
> happened, the domain builds could happen in parallel.
> 
> >I believe with this hypercall the Xapi can be made to do its operations
> >in parallel as well.
> 
> xapi can already boot guests in parallel when there's enough memory
> to do so -- what operations did you have in mind?

That - the booting. My understading (wrongly) was that it did it
in serial.
> 
> I haven't followed all of the discussion (for reasons mentioned
> above), but I think the alternative to Dan's solution is something
> like below.  Maybe you can tell me why it's not very suitable:
> 
> Have one place in the user-space -- either in the toolstack, or a
> separate daemon -- that is responsible for knowing all the places
> where memory might be in use.  Memory can be in use either by Xen,
> or by one of several VMs, or in a tmem pool.
> 
> In your case, when not creating VMs, it can remove all limitations
> -- allow the guests or tmem to grow or shrink as much as they want.

We don't have those limitations right now.
> 
> When a request comes in for a certain amount of memory, it will go
> and set each VM's max_pages, and the max tmem pool size.  It can
> then check whether there is enough free memory to complete the
> allocation or not (since there's a race between checking how much
> memory a guest is using and setting max_pages).  If that succeeds,
> it can return "success".  If, while that VM is being built, another
> request comes in, it can again go around and set the max sizes
> lower.  It has to know how much of the memory is "reserved" for the
> first guest being built, but if there's enough left after that, it
> can return "success" and allow the second VM to start being built.
> 
> After the VMs are built, the toolstack can remove the limits again
> if it wants, again allowing the free flow of memory.

This sounds to me like what Xapi does?
> 
> Do you see any problems with this scheme?  All it requires is for
> the toolstack to be able to temporarliy set limits on both guests
> ballooning up and on tmem allocating more than a certain amount of
> memory.  We already have mechanisms for the first, so if we had a
> "max_pages" for tmem, then you'd have all the tools you need to
> implement it.

Of the top of my hat the thing that come in my mind are:
 - The 'lock' over the memory usage (so the tmem freeze + maxpages set)
   looks to solve the launching in parallel of guests.
   It will allow us to launch multiple guests - but it will also
   suppressing the tmem asynchronous calls and having to balloon up/down
   the guests. The claim hypercall does not do any of those and
   gives a definite 'yes' or 'no'.

 - Complex code that has to keep track of this in the user-space.
   It also has to know of the extra 'reserved' space that is associated
   with a guest. I am not entirely sure how that would couple with
   PCI passthrough. The claim hypercall is fairly simple - albeit
   having it extended to do Super pages and 32-bit guests could make this
   longer.

 - I am not sure whether the toolstack can manage all the memory
   allocation. It sounds like it could but I am just wondering if there
   are some extra corners that we hadn't thought off.

 - Latency. With the locks being placed on the pools of memory the
   existing workload can be negatively affected. Say that this means we
   need to balloon down a couple hundred guests, then launch the new
   guest. This process of 'lower all of them by X', lets check the
   'free amount'. Oh nope - not enougth - lets do this again. That would
   delay the creation process.

   The claim hypercall will avoid all of that by just declaring:
   "This is how much you will get." without having to balloon the rest
   of the guests.

   Here is how I see what your toolstack would do:

     [serial]
	1). Figure out how much memory we need for X guests.
	2). round-robin existing guests to decrease their memory
	    consumption (if they can be ballooned down). Or this
	    can be exectued in parallel for the guests.
	3). check if the amount of free memory is at least X
	    [this check has to be done in serial]
     [parallel]
	4). launch multiple guests at the same time.

   The claim hypercall would avoid the '3' part b/c it is inherently
   part of the Xen's MM bureaucracy. It would allow:

     [parallel]
	1). claim hypercall for X guest.
	2). if any of the claim's return 0 (so success), then launch guest
	3). if the errno was -ENOMEM then:
     [serial]
        3a). round-robin existing guests to decrease their memory
             consumption if allowed. Goto 1).

   So the 'error-case' only has to run in the slow-serial case.

 - This still has the race issue - how much memory you see vs the
   moment you launch it. Granted you can avoid it by having a "fudge"
   factor (so when a guest says it wants 1G you know it actually
   needs an extra 100MB on top of the 1GB or so). The claim hypercall
   would count all of that for you so you don't have to race.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-02 21:38       ` Dan Magenheimer
@ 2013-01-03 16:24         ` Andres Lagar-Cavilla
  2013-01-03 18:33           ` Dan Magenheimer
  2013-01-10 17:13         ` Tim Deegan
  1 sibling, 1 reply; 53+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-03 16:24 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Tim Deegan,
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

On Jan 2, 2013, at 4:38 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:

>> From: Tim Deegan [mailto:tim@xen.org]
>> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
>> solutions
>> 
>> Hi,
> 
> Happy New Year Tim, and thanks for trying to add some clarity to the
> discussion.
> 
>> The question of starting VMs in parallel seems like a red herring to me:
>> - TTBOMK Xapi already can start VMs in parallel.  Since it knows what
>>  constraints it's placed on existing VMs and what VMs it's currently
>>  building, there is nothing stopping it.  Indeed, AFAICS any toolstack
>>  that can guarantee enough RAM to build one VM at a time could do the
>>  same for multiple parallel builds with a bit of bookkeeping.
>> - Dan's stated problem (failure during VM build in the presence of
>>  unconstrained guest-controlled allocations) happens even if there is
>>  only one VM being created.
> 
> Agreed.  The parallel VM discussion was simply trying to point out
> that races can occur even without guest-controlled allocations,
> so is distracting from the actual issue (which is, according to
> wikipedia, one of the definitions of "red herring").
> 
> (As an aside, your use of the word "unconstrained" is a red herring. ;-)
> 
>>>>> Andres Lagar-Cavilla says "... this is because of shortcomings in the
>>>>> [Xen] mm layer and its interaction with wait queues, documented
>>>>> elsewhere."  In other words, this batching proposal requires
>>>>> significant changes to the hypervisor, which I think we
>>>>> all agreed we were trying to avoid.
>>>> 
>>>> Let me nip this at the bud. I use page sharing and other techniques in an environment that doesn't
>> use Citrix's DMC, nor is focused only on proprietary kernels...
>>> 
>>> I believe Dan is saying is that it is not enabled by default.
>>> Meaning it does not get executed in by /etc/init.d/xencommons and
>>> as such it never gets run (or does it now?) - unless one knows
>>> about it - or it is enabled by default in a product. But perhaps
>>> we are both mistaken? Is it enabled by default now on xen-unstable?
>> 
>> I think the point Dan was trying to make is that if you use page-sharing
>> to do overcommit, you can end up with the same problem that self-balloon
>> has: guest activity might consume all your RAM while you're trying to
>> build a new VM.
>> 
>> That could be fixed by a 'further hypervisor change' (constraining the
>> total amount of free memory that CoW unsharing can consume).  I suspect
>> that it can also be resolved by using d->max_pages on each shared-memory
>> VM to put a limit on how much memory they can (severally) consume.
> 
> (I will respond to this in the context of Andres' response shortly...)
> 
>>> Just as a summary as this is getting to be a long thread - my
>>> understanding has been that the hypervisor is suppose to toolstack
>>> independent.
>> 
>> Let's keep calm.  If people were arguing "xl (or xapi) doesn't need this
>> so we shouldn't do it"
> 
> Well Tim, I think this is approximately what some people ARE arguing.
> AFAICT, "people" _are_ arguing that "the toolstack" must have knowledge
> of and control over all memory allocation.  Since the primary toolstack
> is "xl", even though xl does not currently have this knowledge/control
> (and, IMHO, never can or should), I think people _are_ arguing:
> 
> "xl (or xapi) SHOULDn't need this so we shouldn't do it".
> 
>> that would certainly be wrong, but I don't think
>> that's the case.  At least I certainly hope not!
> 
> I agree that would certainly be wrong, but it seems to be happening
> anyway. :-(  Indeed, some are saying that we should disable existing
> working functionality (eg. in-guest ballooning) so that the toolstack
> CAN have complete knowledge and control.

If you refer to my opinion on the bizarre-ness of the balloon, what you say is not at all what I mean. Note that I took great care to not break balloon functionality in the face of paging or sharing, and vice-versa.

Andres
> 
> So let me check, Tim, do you agree that some entity, either the toolstack
> or the hypervisor, must have knowledge of and control over all memory
> allocation, or the allocation race condition is present?
> 
>> The discussion ought to be around the actual problem, which is (as far
>> as I can see) that in a system where guests are ballooning without
>> limits, VM creation failure can happen after a long delay.  In
>> particular it is the delay that is the problem, rather than the failure.
>> Some solutions that have been proposed so far:
>> - don't do that, it's silly (possibly true but not helpful);
>> - this reservation hypercall, to pull the failure forward;
>> - make allocation faster to avoid the delay (a good idea anyway,
>>   but can it be made fast enough?);
>> - use max_pages or similar to stop other VMs using all of RAM.
> 
> Good summary.  So, would you agree that the solution selection
> comes down to: "Can max_pages or similar be used effectively to
> stop other VMs using all of RAM? If so, who is implementing that?
> Else the reservation hypercall is a good solution." ?
> 
>> My own position remains that I can live with the reservation hypercall,
>> as long as it's properly done - including handling PV 32-bit and PV
>> superpage guests.
> 
> Tim, would you at least agree that "properly" is a red herring?
> Solving 100% of a problem is clearly preferable and I would gladly
> change my loyalty to someone else's 100% solution.  But solving 98%*
> of a problem while not making the other 2% any worse is not "improper",
> just IMHO sensible engineering.
> 
> * I'm approximating the total number of PV 32-bit and PV superpage
> guests as 2%.  Substitute a different number if you like, but
> the number is certainly getting smaller over time, not growing.
> 
> Tim, thanks again for your useful input.
> 
> Thanks,
> Dan
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-02 21:43         ` Dan Magenheimer
@ 2013-01-03 16:25           ` Andres Lagar-Cavilla
  2013-01-03 18:49             ` Dan Magenheimer
  0 siblings, 1 reply; 53+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-03 16:25 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Tim Deegan,
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

On Jan 2, 2013, at 4:43 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:

>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
>> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
>> solutions
>> 
>> Hello,
> 
> Happy New Year, Andres!  (yay, I spelled it right this time! ;)

Heh, cheers!

> 
>> On Dec 20, 2012, at 11:04 AM, Tim Deegan <tim@xen.org> wrote:
>> 
>>> I think the point Dan was trying to make is that if you use page-sharing
>>> to do overcommit, you can end up with the same problem that self-balloon
>>> has: guest activity might consume all your RAM while you're trying to
>>> build a new VM.
>>> 
>>> That could be fixed by a 'further hypervisor change' (constraining the
>>> total amount of free memory that CoW unsharing can consume).  I suspect
>>> that it can also be resolved by using d->max_pages on each shared-memory
>>> VM to put a limit on how much memory they can (severally) consume.
>> 
>> To be completely clear. I don't think we need a separate allocation/list
>> of pages/foo to absorb CoW hits. I think the solution is using d->max_pages.
>> Sharing will hit that limit and then send a notification via the "sharing"
>> (which is actually an enomem) men event ring.
> 
> And here is the very crux of our disagreement.
> 
> You say "I think the solution is using d->max_pages".  Unless
> I misunderstand completely, this means your model is what I've
> called the "Citrix model" (because Citrix DMC uses it), in which
> d->max_pages is dynamically adjusted regularly for each running
> guest based on external inferences by (what I have sarcastically
> called) a "omniscient toolstack".
> 
> In the Oracle model, d->max_pages is a fixed hard limit set when
> the guest is launched; only d->curr_pages dynamically varies across
> time (e.g. via in-guest self ballooning).
> 
> I reject the omnisicient toolstack model as unimplementable [1]
> and, without it, I think you either do need a separate allocation/list,
> with all the issues that entails, or you need the proposed
> XENMEM_claim_pages hypercall to resolve memory allocation races
> (i.e. vs domain creation).

That pretty much ends the discussion. If you ask me below to reason within the constraints your rejection places, then that's artificial reasoning. Your rejection seems to stem from philosophical reasons, rather than technical limitations.

Look, your hyper call doesn't kill kittens, so that's about as far as I will go in this discussion.

My purpose here was to a) dispel misconceptions about sharing b) see if something better comes out from a discussion between all interested mm parties. I'm satisfied insofar a).

Thanks
Andres
> 
> So, please Andres, assume for a moment you have neither "the
> solution using d->max_pages" nor "a separate allocation/list".
> IIUC if one uses your implementation of page-sharing when d->max_pages
> is permanently fixed, it is impossible for a "CoW hit" to result in
> exceeding d->max_pages; and so the _only_ time a CoW hit would
> result in a toolstack notification and/or host swapping is if
> physical memory in the machine is fully allocated.  True?
> 
> Now does it make more sense what I and Konrad (and now Tim)
> are trying to point out?
> 
> Thanks,
> Dan
> 
> [1] excerpted from my own email at:
> http://lists.xen.org/archives/html/xen-devel/2012-12/msg00107.html 
> 
>> The last 4+ years of my life have been built on the fundamental
>> assumption that nobody, not even one guest kernel itself,
>> can adequately predict when memory usage is going to spike.
>> Accurate inference from an external entity across potentially dozens
>> of VMs is IMHO.... well... um... unlikely.  I could be wrong
>> but I believe, even in academia, there is no realistic research
>> solution proposed for this.  (If I'm wrong, please send a pointer.)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-03 16:24         ` Andres Lagar-Cavilla
@ 2013-01-03 18:33           ` Dan Magenheimer
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-03 18:33 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Keir (Xen.org),
	Ian Campbell, George Dunlap, Tim Deegan, Ian Jackson, xen-devel,
	Konrad Rzeszutek Wilk, Jan Beulich

> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> solutions
> 
> >>> Just as a summary as this is getting to be a long thread - my
> >>> understanding has been that the hypervisor is suppose to toolstack
> >>> independent.
> >>
> >> Let's keep calm.  If people were arguing "xl (or xapi) doesn't need this
> >> so we shouldn't do it"
> >
> > Well Tim, I think this is approximately what some people ARE arguing.
> > AFAICT, "people" _are_ arguing that "the toolstack" must have knowledge
> > of and control over all memory allocation.  Since the primary toolstack
> > is "xl", even though xl does not currently have this knowledge/control
> > (and, IMHO, never can or should), I think people _are_ arguing:
> >
> > "xl (or xapi) SHOULDn't need this so we shouldn't do it".
> >
> >> that would certainly be wrong, but I don't think
> >> that's the case.  At least I certainly hope not!
> >
> > I agree that would certainly be wrong, but it seems to be happening
> > anyway. :-(  Indeed, some are saying that we should disable existing
> > working functionality (eg. in-guest ballooning) so that the toolstack
> > CAN have complete knowledge and control.
> 
> If you refer to my opinion on the bizarre-ness of the balloon, what you say is not at all what I mean.
> Note that I took great care to not break balloon functionality in the face of paging or sharing, and
> vice-versa.
> 
> Andres

And just to be clear, no, Andres, I was referring to George's statement
in http://lists.xen.org/archives/html/xen-devel/2012-12/msg01492.html 
where he says about a guest kernel doing ballooning:

"Well, it shouldn't be allowed to do it..."

I appreciate your great care to ensure backwards compatibility
and fully agree that both these functionalities (ballooning
and paging/sharing) are useful and valuable for significant
segments of the Xen customer base.  And for some smaller segment
they may need to safely co-exist and even, in the future, interact.

So... peace?

Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-03 16:25           ` Andres Lagar-Cavilla
@ 2013-01-03 18:49             ` Dan Magenheimer
  2013-01-07 14:43               ` Ian Campbell
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-03 18:49 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Keir (Xen.org),
	Ian Campbell, George Dunlap, Ian Jackson, Tim Deegan, xen-devel,
	Konrad Rzeszutek Wilk, Jan Beulich

> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]

> On Jan 2, 2013, at 4:43 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> > I reject the omnisicient toolstack model as unimplementable [1]
> > and, without it, I think you either do need a separate allocation/list,
> > with all the issues that entails, or you need the proposed
> > XENMEM_claim_pages hypercall to resolve memory allocation races
> > (i.e. vs domain creation).
> 
> That pretty much ends the discussion. If you ask me below to reason within the constraints your
> rejection places, then that's artificial reasoning. Your rejection seems to stem from philosophical
> reasons, rather than technical limitations.

Well, perhaps my statement is a bit heavy-handed, but I don't see
how it ends the discussion... you simply need to prove my statement
incorrect! ;-)  To me, that would mean pointing out any existing
implementation or even university research that successfully
predicts or externally infers future memory demand for guests.
(That's a good approximation of my definition of an omniscient
toolstack.)

But let's save that for another time or thread.
 
> Look, your hyper call doesn't kill kittens, so that's about as far as I will go in this discussion.

Noted.  I will look at adding kitten-killing functionality
in the next revision. ;-)

> My purpose here was to a) dispel misconceptions about sharing b) see if something better comes out
> from a discussion between all interested mm parties. I'm satisfied insofar a).

At some time I hope to understand paging/sharing more completely
and apologize, Andres, if I have cast aspersions about its/your
implementation, I was simply trying to use it as another
example of an in-hypervisor page allocation that is not
directly under the control of the toolstack.  I do understand
and agree that IF the toolstack is capable of intelligently
managing d->max_pages across all domains, then your model
for handling CoW hits will be sufficient.

So... again... peace?
Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2012-12-19 13:48       ` George Dunlap
@ 2013-01-03 20:38         ` Dan Magenheimer
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-03 20:38 UTC (permalink / raw)
  To: George Dunlap, Konrad Rzeszutek Wilk
  Cc: Keir (Xen.org),
	Ian Campbell, Andres Lagar-Cavilla, Ian Jackson, Tim (Xen.org),
	lars.kurth, Jan Beulich, xen-devel

> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Sent: Wednesday, December 19, 2012 6:49 AM
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> solutions

George --

Your public personal attacks are hurtful and unprofessional, not to mention
inaccurate.  While I have tried to interpret them is if they are simply
banter or even sarcasm, they border on defamatory.  If we worked for the
same company, I would have already filed a complaint with HR and spoken
bluntly to your manager.

So, now, can we please focus on the technical discussion? **

Let me attempt to briefly summarize your position to see if
I understand it from your last email.  Your position is:

1) Certain existing Xen page allocation mechanisms that occur without
   the knowledge of the toolstack should be permanently disabled,
   regardless of backwards compatibility; and
2) All memory allocations for all future Xen functionality should
   be done only with the express permission of the toolstack; and
3) The toolstack should intelligently and dynamically adjust d->max_pages
   for all domains to match current and predict future memory demand for
   each domain; and
4) It is reasonable to expect and demand that ALL Xen implementations
   and toolstacks must conform to (2) and (3)
As a result, the proposed XENMEM_claim_pages hypercall is not needed.

So, George, you believe that (1) through (4) are the proper way forward
for the Xen community and the hypercall should be rejected.

Is that correct?  If not, please briefly clarify.  And, if it is
correct, I have a number of questions.

Now, George, would you like to attempt to briefly summarize my
position?

Dan

** It is clear to me, and hopefully is to others, that this is not
a discussion about how to fix a bug; it is a discussion about a
fundamental Xen architectural principle, namely where in the Xen
stack should memory be managed and controlled.  Two different Xen
vendors have based product decisions on different assumptions and
opinions colored perhaps in part by the demands of differing customer
bases (i.e. open source guests vs proprietary guests).  The resolution
of this discussion needs to be either: (1) one vendor is "right" and the
other must conform, or (2) both are "right" and the assumptions must
be allowed to co-exist.  I've intentionally added Lars to the cc list
in case this issue should be escalated within xen.org.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-03 18:49             ` Dan Magenheimer
@ 2013-01-07 14:43               ` Ian Campbell
  2013-01-07 18:41                 ` Dan Magenheimer
  0 siblings, 1 reply; 53+ messages in thread
From: Ian Campbell @ 2013-01-07 14:43 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org),
	George Dunlap, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

On Thu, 2013-01-03 at 18:49 +0000, Dan Magenheimer wrote:
> > From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> 
> > On Jan 2, 2013, at 4:43 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> > > I reject the omnisicient toolstack model as unimplementable [1]
> > > and, without it, I think you either do need a separate allocation/list,
> > > with all the issues that entails, or you need the proposed
> > > XENMEM_claim_pages hypercall to resolve memory allocation races
> > > (i.e. vs domain creation).
> > 
> > That pretty much ends the discussion. If you ask me below to reason within the constraints your
> > rejection places, then that's artificial reasoning. Your rejection seems to stem from philosophical
> > reasons, rather than technical limitations.
> 
> Well, perhaps my statement is a bit heavy-handed, but I don't see
> how it ends the discussion... you simply need to prove my statement
> incorrect! ;-)  To me, that would mean pointing out any existing
> implementation or even university research that successfully
> predicts or externally infers future memory demand for guests.
> (That's a good approximation of my definition of an omniscient
> toolstack.)

I don't think a solution involving massaging of tot_pages need involve
either frequent changes to tot_pages nor omniscience from the tool
stack.

Start by separating the lifetime_maxmem from current_maxmem. The
lifetime_maxmem is internal to the toolstack (it is effectively your
tot_pages from today) and current_maxmem becomes whatever the toolstack
has actually pushed down into tot_pages at any given time.

In the normal steady state lifetime_maxmem == current_maxmem.

When you want to claim some memory in order to start a new domain of
size M you *temporarily* reduce current_maxmem for some set of domains
on the chosen host and arrange that the total of all the current_maxmems
on the host is such that "HOST_MEM - SUM(current_maxmems) > M".

Once the toolstack has built (or failed to build) the domain it can set
all the current_maxmems back to their lifetime_maxmem values.

If you want to build multiple domains in parallel then M just becomes
the sum over all the domains currently being built.

Ian.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-07 14:43               ` Ian Campbell
@ 2013-01-07 18:41                 ` Dan Magenheimer
  2013-01-08  9:03                   ` Ian Campbell
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-07 18:41 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir (Xen.org),
	George Dunlap, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> 
> On Thu, 2013-01-03 at 18:49 +0000, Dan Magenheimer wrote:
> >
> > Well, perhaps my statement is a bit heavy-handed, but I don't see
> > how it ends the discussion... you simply need to prove my statement
> > incorrect! ;-)  To me, that would mean pointing out any existing
> > implementation or even university research that successfully
> > predicts or externally infers future memory demand for guests.
> > (That's a good approximation of my definition of an omniscient
> > toolstack.)
> 
> I don't think a solution involving massaging of tot_pages need involve
> either frequent changes to tot_pages nor omniscience from the tool
> stack.
> 
> Start by separating the lifetime_maxmem from current_maxmem. The
> lifetime_maxmem is internal to the toolstack (it is effectively your
> tot_pages from today) and current_maxmem becomes whatever the toolstack
> has actually pushed down into tot_pages at any given time.
> 
> In the normal steady state lifetime_maxmem == current_maxmem.
> 
> When you want to claim some memory in order to start a new domain of
> size M you *temporarily* reduce current_maxmem for some set of domains
> on the chosen host and arrange that the total of all the current_maxmems
> on the host is such that "HOST_MEM - SUM(current_maxmems) > M".
> 
> Once the toolstack has built (or failed to build) the domain it can set
> all the current_maxmems back to their lifetime_maxmem values.
> 
> If you want to build multiple domains in parallel then M just becomes
> the sum over all the domains currently being built.

Hi Ian --

Happy New Year!

Perhaps you are missing an important point that is leading
you to oversimplify and draw conclusions based on that
oversimplification...

We are _primarily_ discussing the case where physical RAM is
overcommitted, or to use your terminology IIUC:

   SUM(lifetime_maxmem) > HOST_MEM

Thus:

> In the normal steady state lifetime_maxmem == current_maxmem.

is a flawed assumption, except perhaps as an initial condition
or in systems where RAM is almost never a bottleneck.

Without that assumption, in your model, the toolstack must
make intelligent policy decisions about how to vary
current_maxmem relative to lifetime_maxmem, across all the
domains on the system.  Since the memory demands of any domain
often vary frequently, dramatically and unpredictably (i.e.
"spike") and since the performance consequences of inadequate
memory can be dire (i.e. "swap storm"), that is why I say the
toolstack (in your model) must both make frequent changes
to tot_pages and "be omniscient".

FWIW, I fully acknowledge that your model works fine when
there are no memory overcommitment technologies active.
I also acknowledge that your model is the best that can
be expected with legacy proprietary domains.  The Oracle
model however assumes both that RAM is frequently a bottleneck,
and that open-source guest kernels can intelligently participate
in optimizing their own memory usage; such guest kernels are
now shipping.

So, Ian, would you please acknowledge that the Oracle model
is valid and, in such cases where your maxmem assumption
is incorrect, that hypervisor-controlled capacity allocation
(i.e. XENMEM_claim_pages) is an acceptable solution?

Thanks,
Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-07 18:41                 ` Dan Magenheimer
@ 2013-01-08  9:03                   ` Ian Campbell
  2013-01-08 19:41                     ` Dan Magenheimer
  0 siblings, 1 reply; 53+ messages in thread
From: Ian Campbell @ 2013-01-08  9:03 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org),
	George Dunlap, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

On Mon, 2013-01-07 at 18:41 +0000, Dan Magenheimer wrote:
> > From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> > 
> > On Thu, 2013-01-03 at 18:49 +0000, Dan Magenheimer wrote:
> > >
> > > Well, perhaps my statement is a bit heavy-handed, but I don't see
> > > how it ends the discussion... you simply need to prove my statement
> > > incorrect! ;-)  To me, that would mean pointing out any existing
> > > implementation or even university research that successfully
> > > predicts or externally infers future memory demand for guests.
> > > (That's a good approximation of my definition of an omniscient
> > > toolstack.)
> > 
> > I don't think a solution involving massaging of tot_pages need involve
> > either frequent changes to tot_pages nor omniscience from the tool
> > stack.
> > 
> > Start by separating the lifetime_maxmem from current_maxmem. The
> > lifetime_maxmem is internal to the toolstack (it is effectively your
> > tot_pages from today) and current_maxmem becomes whatever the toolstack
> > has actually pushed down into tot_pages at any given time.
> > 
> > In the normal steady state lifetime_maxmem == current_maxmem.
> > 
> > When you want to claim some memory in order to start a new domain of
> > size M you *temporarily* reduce current_maxmem for some set of domains
> > on the chosen host and arrange that the total of all the current_maxmems
> > on the host is such that "HOST_MEM - SUM(current_maxmems) > M".
> > 
> > Once the toolstack has built (or failed to build) the domain it can set
> > all the current_maxmems back to their lifetime_maxmem values.
> > 
> > If you want to build multiple domains in parallel then M just becomes
> > the sum over all the domains currently being built.
> 
> Hi Ian --
> 
> Happy New Year!
> 
> Perhaps you are missing an important point that is leading
> you to oversimplify and draw conclusions based on that
> oversimplification...
> 
> We are _primarily_ discussing the case where physical RAM is
> overcommitted, or to use your terminology IIUC:
> 
>    SUM(lifetime_maxmem) > HOST_MEM

I understand this perfectly well.

> Thus:
> 
> > In the normal steady state lifetime_maxmem == current_maxmem.
> 
> is a flawed assumption, except perhaps as an initial condition
> or in systems where RAM is almost never a bottleneck.

I see that I have incorrectly (but it seems at least consistently) said
"d->tot_pages" where I meant d->max_pages. This was no doubt extremely
confusing and does indeed render the scheme unworkable. Sorry.

AIUI you currently set d->max_pages == lifetime_maxmem. In the steady
state therefore current_maxmem == lifetime_maxmem == d->max_pages and
nothing changes compared with how things are for you today

In the case where you are claiming some memory you change only max_pages
(and not tot_pages as I incorrectly stated before, tot_pages can
continue to vary dynamically, albeit with reduced range). So
d->max_pages == current_maxmem which is derived as I describe previously
(managing to keep my tot and max straight for once):

        When you want to claim some memory in order to start a new
        domain of size M you *temporarily* reduce current_maxmem for
        some set of domains on the chosen host and arrange that the
        total of all the current_maxmems on the host is such that
        "HOST_MEM - SUM(current_maxmems) > M".

I hope that clarifies what I was suggesting.

> Without that assumption, in your model, the toolstack must
> make intelligent policy decisions about how to vary
> current_maxmem relative to lifetime_maxmem, across all the
> domains on the system.  Since the memory demands of any domain
> often vary frequently, dramatically and unpredictably (i.e.
> "spike") and since the performance consequences of inadequate
> memory can be dire (i.e. "swap storm"), that is why I say the
> toolstack (in your model) must both make frequent changes
> to tot_pages and "be omniscient".

Agreed, I was mistaken in saying tot_pages where I meant max_pages.

My intention was to describe a scheme where max_pages would change only
a) when you start building a new domain and b) when you finish building
a domain. There should be no need to make adjustments between those
events.

The inputs into the calculations are lifetime_maxmems for all domains,
the current number of domains in the system, the initial allocation of
any domain(s) currently being built (AKA the current claim) and the
total physical RAM present in the host. AIUI all of those are either
static or dynamic but only actually changing when new domains are
introduced/removed (or otherwise only changing infrequently).

> So, Ian, would you please acknowledge that the Oracle model
> is valid and, in such cases where your maxmem assumption
> is incorrect, that hypervisor-controlled capacity allocation
> (i.e. XENMEM_claim_pages) is an acceptable solution?

I have no problem with the validity of the Oracle model. I don't think
we have reached the consensus that the hypervisor-controlled capacity
allocation is the only possible solution, or the preferable solution
from the PoV of the hypervisor maintainers. In that sense it is
"unacceptable" because things which can be done outside the hypervisor
should be and so I cannot acknowledge what you ask.

Apologies again for my incorrect use of tot_pages which has lead to this
confusion.

Ian.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-08  9:03                   ` Ian Campbell
@ 2013-01-08 19:41                     ` Dan Magenheimer
  2013-01-09 10:41                       ` Ian Campbell
  2013-01-10 10:31                       ` Ian Campbell
  0 siblings, 2 replies; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-08 19:41 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir (Xen.org),
	George Dunlap, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> Sent: Tuesday, January 08, 2013 2:03 AM
> To: Dan Magenheimer
> Cc: Andres Lagar-Cavilla; Tim (Xen.org); Konrad Rzeszutek Wilk; xen-devel@lists.xen.org; Keir
> (Xen.org); George Dunlap; Ian Jackson; Jan Beulich
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> solutions
> 
> On Mon, 2013-01-07 at 18:41 +0000, Dan Magenheimer wrote:
> > > From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> > >
> > > On Thu, 2013-01-03 at 18:49 +0000, Dan Magenheimer wrote:
> > > >
> > > > Well, perhaps my statement is a bit heavy-handed, but I don't see
> > > > how it ends the discussion... you simply need to prove my statement
> > > > incorrect! ;-)  To me, that would mean pointing out any existing
> > > > implementation or even university research that successfully
> > > > predicts or externally infers future memory demand for guests.
> > > > (That's a good approximation of my definition of an omniscient
> > > > toolstack.)
> > >
> > > I don't think a solution involving massaging of tot_pages need involve
> > > either frequent changes to tot_pages nor omniscience from the tool
> > > stack.
> > >
> > > Start by separating the lifetime_maxmem from current_maxmem. The
> > > lifetime_maxmem is internal to the toolstack (it is effectively your
> > > tot_pages from today) and current_maxmem becomes whatever the toolstack
> > > has actually pushed down into tot_pages at any given time.
> > >
> > > In the normal steady state lifetime_maxmem == current_maxmem.
> > >
> > > When you want to claim some memory in order to start a new domain of
> > > size M you *temporarily* reduce current_maxmem for some set of domains
> > > on the chosen host and arrange that the total of all the current_maxmems
> > > on the host is such that "HOST_MEM - SUM(current_maxmems) > M".
> > >
> > > Once the toolstack has built (or failed to build) the domain it can set
> > > all the current_maxmems back to their lifetime_maxmem values.
> > >
> > > If you want to build multiple domains in parallel then M just becomes
> > > the sum over all the domains currently being built.
> >
> > Hi Ian --
> >
> > Happy New Year!
> >
> > Perhaps you are missing an important point that is leading
> > you to oversimplify and draw conclusions based on that
> > oversimplification...
> >
> > We are _primarily_ discussing the case where physical RAM is
> > overcommitted, or to use your terminology IIUC:
> >
> >    SUM(lifetime_maxmem) > HOST_MEM
> 
> I understand this perfectly well.
> 
> > Thus:
> >
> > > In the normal steady state lifetime_maxmem == current_maxmem.
> >
> > is a flawed assumption, except perhaps as an initial condition
> > or in systems where RAM is almost never a bottleneck.
> 
> I see that I have incorrectly (but it seems at least consistently) said
> "d->tot_pages" where I meant d->max_pages. This was no doubt extremely
> confusing and does indeed render the scheme unworkable. Sorry.
> 
> AIUI you currently set d->max_pages == lifetime_maxmem. In the steady
> state therefore current_maxmem == lifetime_maxmem == d->max_pages and
> nothing changes compared with how things are for you today
> 
> In the case where you are claiming some memory you change only max_pages
> (and not tot_pages as I incorrectly stated before, tot_pages can
> continue to vary dynamically, albeit with reduced range). So
> d->max_pages == current_maxmem which is derived as I describe previously
> (managing to keep my tot and max straight for once):
> 
>         When you want to claim some memory in order to start a new
>         domain of size M you *temporarily* reduce current_maxmem for
>         some set of domains on the chosen host and arrange that the
>         total of all the current_maxmems on the host is such that
>         "HOST_MEM - SUM(current_maxmems) > M".
> 
> I hope that clarifies what I was suggesting.
> 
> > Without that assumption, in your model, the toolstack must
> > make intelligent policy decisions about how to vary
> > current_maxmem relative to lifetime_maxmem, across all the
> > domains on the system.  Since the memory demands of any domain
> > often vary frequently, dramatically and unpredictably (i.e.
> > "spike") and since the performance consequences of inadequate
> > memory can be dire (i.e. "swap storm"), that is why I say the
> > toolstack (in your model) must both make frequent changes
> > to tot_pages and "be omniscient".
> 
> Agreed, I was mistaken in saying tot_pages where I meant max_pages.
> 
> My intention was to describe a scheme where max_pages would change only
> a) when you start building a new domain and b) when you finish building
> a domain. There should be no need to make adjustments between those
> events.
> 
> The inputs into the calculations are lifetime_maxmems for all domains,
> the current number of domains in the system, the initial allocation of
> any domain(s) currently being built (AKA the current claim) and the
> total physical RAM present in the host. AIUI all of those are either
> static or dynamic but only actually changing when new domains are
> introduced/removed (or otherwise only changing infrequently).
> 
> > So, Ian, would you please acknowledge that the Oracle model
> > is valid and, in such cases where your maxmem assumption
> > is incorrect, that hypervisor-controlled capacity allocation
> > (i.e. XENMEM_claim_pages) is an acceptable solution?
> 
> I have no problem with the validity of the Oracle model. I don't think
> we have reached the consensus that the hypervisor-controlled capacity
> allocation is the only possible solution, or the preferable solution
> from the PoV of the hypervisor maintainers. In that sense it is
> "unacceptable" because things which can be done outside the hypervisor
> should be and so I cannot acknowledge what you ask.
> 
> Apologies again for my incorrect use of tot_pages which has lead to this
> confusion.

Hi Ian --

> I have no problem with the validity of the Oracle model. I don't think
> we have reached the consensus that the hypervisor-controlled capacity
> allocation is the only possible solution, or the preferable solution
> from the PoV of the hypervisor maintainers. In that sense it is
> "unacceptable" because things which can be done outside the hypervisor
> should be and so I cannot acknowledge what you ask.

IMHO, you have not yet demonstrated that your alternate proposal solves
the problem in the context which Oracle cares about, so I regret that we
must continue this discussion.

> I see that I have incorrectly (but it seems at least consistently) said
> "d->tot_pages" where I meant d->max_pages. This was no doubt extremely
> confusing and does indeed render the scheme unworkable. Sorry.

I am fairly sure I understood exactly what you were saying and my
comments are the same even with your text substituted, i.e. your proposal
works fine when there are no memory overcommit technologies active
and, thus, on legacy proprietary domains; but your proposal fails
in the Oracle context.

So let's ensure we agree on a few premises:

First, you said we agree that we are discussing the case of overcommitted
memory, where:

   SUM(lifetime_maxmem) > HOST_MEM

So that's good.

Then a second premise that I would like to check to ensure we
agree:  In the Oracle model, as I said, "open source guest kernels
can intelligently participate in optimizing their own memory usage...
such guests are now shipping" (FYI Fedora, Ubuntu, and Oracle Linux).
With these mechanisms, there is direct guest->hypervisor interaction
that, without knowledge of the toolstack, causes d->tot_pages
to increase.  This interaction may (and does) occur from several
domains simultaneously and the increase for any domain may occur
frequently, unpredictably and sometimes dramatically.

Ian, do you agree with this premise and that a "capacity allocation
solution" (whether hypervisor-based or toolstack-based) must work
properly in this context?  Or are you maybe proposing to eliminate
all such interactions?  Or are you maybe proposing to insert the
toolstack in the middle of all such interactions?

Next, in your most recent reply, I think you skipped replying to my
comment of "[in your proposal] the toolstack must make intelligent
policy decisions about how to vary current_maxmem relative to
lifetime_maxmem, across all the domains on the system [1]".  We
seem to disagree on whether this need only be done twice per domain
launch (once at domain creation start and once at domain creation
finish, in your proposal) vs. more frequently.  But in either case,
do you agree that the toolstack is not equipped to make policy
decisions across multiple guests to do this and that poor
choices may have dire consequences (swapstorm, OOM) on a guest?
This is a third premise: Launching a domain should never cause
another unrelated domain to crash.  Do you agree?

I have more, but let's make sure we are on the same page
with these first.

Thanks,
Dan

[1] A clarification: In the Oracle model, there is only maxmem;
i.e. current_maxmem is always the same as lifetime_maxmem;
i.e. d->max_pages is fixed for the life of the domain and
only d->tot_pages varies; i.e. no intelligence is required
in the toolstack.  AFAIK, the distinction between current_maxmem
and lifetime_maxmem was added for Citrix DMC support.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-08 19:41                     ` Dan Magenheimer
@ 2013-01-09 10:41                       ` Ian Campbell
  2013-01-09 14:44                         ` Dan Magenheimer
  2013-01-10 10:31                       ` Ian Campbell
  1 sibling, 1 reply; 53+ messages in thread
From: Ian Campbell @ 2013-01-09 10:41 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org),
	George Dunlap, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

On Tue, 2013-01-08 at 19:41 +0000, Dan Magenheimer wrote:
> [1] A clarification: In the Oracle model, there is only maxmem;
> i.e. current_maxmem is always the same as lifetime_maxmem;

This is exactly what I am proposing that you change in order to
implement something like the claim mechanism in the toolstack.

If your model is fixed in stone and cannot accommodate changes of this
type then there isn't much point in continuing this conversation.

I think we need to agree on this before we consider the rest of your
mail in detail, so I have snipped all that for the time being.

> i.e. d->max_pages is fixed for the life of the domain and
> only d->tot_pages varies; i.e. no intelligence is required
> in the toolstack.  AFAIK, the distinction between current_maxmem
> and lifetime_maxmem was added for Citrix DMC support.

I don't believe Xen itself has any such concept, the distinction is
purely internal to the toolstack and which value it chooses to push down
to d->max_pages.

I don't know (or particularly care) what Citrix DMC does since I was not
involved with it other than when it triggered bugs in balloon drivers.

Ian.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-09 10:41                       ` Ian Campbell
@ 2013-01-09 14:44                         ` Dan Magenheimer
  2013-01-09 14:58                           ` Ian Campbell
  2013-01-14 15:45                           ` George Dunlap
  0 siblings, 2 replies; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-09 14:44 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir (Xen.org),
	George Dunlap, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> solutions
> 
> On Tue, 2013-01-08 at 19:41 +0000, Dan Magenheimer wrote:
> > [1] A clarification: In the Oracle model, there is only maxmem;
> > i.e. current_maxmem is always the same as lifetime_maxmem;
> 
> This is exactly what I am proposing that you change in order to
> implement something like the claim mechanism in the toolstack.
> 
> If your model is fixed in stone and cannot accommodate changes of this
> type then there isn't much point in continuing this conversation.
> 
> I think we need to agree on this before we consider the rest of your
> mail in detail, so I have snipped all that for the time being.

Agreed that it is not fixed in stone.  I should have said
"In the _current_ Oracle model" and that footnote was only for
comparison purposes.  So, please, do proceed in commenting on the
two premises I outlined.
 
> > i.e. d->max_pages is fixed for the life of the domain and
> > only d->tot_pages varies; i.e. no intelligence is required
> > in the toolstack.  AFAIK, the distinction between current_maxmem
> > and lifetime_maxmem was added for Citrix DMC support.
> 
> I don't believe Xen itself has any such concept, the distinction is
> purely internal to the toolstack and which value it chooses to push down
> to d->max_pages.

Actually I believe a change was committed to the hypervisor specifically
to accommodate this.  George mentioned it earlier in this thread...
I'll have to dig to find the specific changeset but the change allows
the toolstack to reduce d->max_pages so that it is (temporarily)
less than d->tot_pages.  Such a change would clearly be unnecessary
if current_maxmem was always the same as lifetime_maxmem.
 
> I don't know (or particularly care) what Citrix DMC does since I was not
> involved with it other than when it triggered bugs in balloon drivers.

I bring up DMC not to impugn the maintainers independence but
as I would if we were discussing an academic paper; DMC
is built on very similar concepts to the model you are proposing.
And (IMHO) DMC does not succeed in solving the memory overcommitment
problem.  Oracle has been building a different approach for memory
overcommit (selfballooning and tmem) for several years, it is
implemented in shipping Xen hypervisors and Linux kernels, and it is
in this context that we wish to ensure that any capacity allocation
mechanism, whether toolstack-based or hypervisor-based, works.

So, please, let's continue discussing the premises I outlined.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-09 14:44                         ` Dan Magenheimer
@ 2013-01-09 14:58                           ` Ian Campbell
  2013-01-14 15:45                           ` George Dunlap
  1 sibling, 0 replies; 53+ messages in thread
From: Ian Campbell @ 2013-01-09 14:58 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org),
	George Dunlap, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

On Wed, 2013-01-09 at 14:44 +0000, Dan Magenheimer wrote:
> > From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> > Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> > solutions
> > 
> > On Tue, 2013-01-08 at 19:41 +0000, Dan Magenheimer wrote:
> > > [1] A clarification: In the Oracle model, there is only maxmem;
> > > i.e. current_maxmem is always the same as lifetime_maxmem;
> > 
> > This is exactly what I am proposing that you change in order to
> > implement something like the claim mechanism in the toolstack.
> > 
> > If your model is fixed in stone and cannot accommodate changes of this
> > type then there isn't much point in continuing this conversation.
> > 
> > I think we need to agree on this before we consider the rest of your
> > mail in detail, so I have snipped all that for the time being.
> 
> Agreed that it is not fixed in stone.  I should have said
> "In the _current_ Oracle model" and that footnote was only for
> comparison purposes.  So, please, do proceed in commenting on the
> two premises I outlined.

I have a meeting in a moment, I'll take a look later.
 
Ian.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-08 19:41                     ` Dan Magenheimer
  2013-01-09 10:41                       ` Ian Campbell
@ 2013-01-10 10:31                       ` Ian Campbell
  2013-01-10 18:42                         ` Dan Magenheimer
  1 sibling, 1 reply; 53+ messages in thread
From: Ian Campbell @ 2013-01-10 10:31 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org),
	George Dunlap, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

On Tue, 2013-01-08 at 19:41 +0000, Dan Magenheimer wrote:
> Then a second premise that I would like to check to ensure we
> agree:  In the Oracle model, as I said, "open source guest kernels
> can intelligently participate in optimizing their own memory usage...
> such guests are now shipping" (FYI Fedora, Ubuntu, and Oracle Linux).
> With these mechanisms, there is direct guest->hypervisor interaction
> that, without knowledge of the toolstack, causes d->tot_pages
> to increase.  This interaction may (and does) occur from several
> domains simultaneously and the increase for any domain may occur
> frequently, unpredictably and sometimes dramatically.

Agreed.

> Ian, do you agree with this premise and that a "capacity allocation
> solution" (whether hypervisor-based or toolstack-based) must work
> properly in this context?

> Or are you maybe proposing to eliminate all such interactions?

I think these interactions are fine. They are obviously a key part of
your model. My intention is to suggest a possible userspace solution to
the claim proposal which continues to allow this behaviour.

> Or are you maybe proposing to insert the toolstack in the middle of
> all such interactions?

Not at all.

> Next, in your most recent reply, I think you skipped replying to my
> comment of "[in your proposal] the toolstack must make intelligent
> policy decisions about how to vary current_maxmem relative to
> lifetime_maxmem, across all the domains on the system [1]".  We
> seem to disagree on whether this need only be done twice per domain
> launch (once at domain creation start and once at domain creation
> finish, in your proposal) vs. more frequently.  But in either case,
> do you agree that the toolstack is not equipped to make policy
> decisions across multiple guests to do this

No, I don't agree.

> and that poor choices may have dire consequences (swapstorm, OOM) on a
> guest?

Setting maxmem on a domain does not immediately force a domain to that
amount of RAM and so the act of doing setting maxmem is not going to
cause a swap storm. (I think this relates to the "distinction between
current_maxmem and lifetime_maxmem was added for Citrix DMC support"
patch you were referring too below, previously to that Xen would reject
attempts to set max < current)

Setting maxmem doesn't even ask the domain to try and head for that
limit (that is the target which is a separate thing). So the domain
won't react to setting maxmem at all and unless it goes specifically
looking I don't think it would even be aware that its maximum has been
temporarily reduced.

Having set all the maxmem's on the domains you would then immediately
check if each domain has tot_pages under or over the temporary maxmem
limit. If all domains are under then the claim has succeeded and you may
proceed to build the domain. If any one domain is over then the claim
has failed and you need to reset all the maxmems back to the lifetime
value and try again on another host (I understand that this is an
accepted possibility with the h/v based claim approach too).

I forgot to say but you'd obviously want to use whatever controls tmem
provides to ensure it doesn't just gobble up the M bytes needed for the
new domain. It can of course continue to operate as normal on the
remainder of the spare RAM.

> AFAIK, the distinction between current_maxmem
> and lifetime_maxmem was added for Citrix DMC support.

As I mentioned above I think you are thinking of the patch which cause
the XEN_DOMCTL_max_mem hypercall to succeed even if tot_pages is
currently greater than the new requested maximum.

It's not quite the same thing as a distinction between current_maxmem
and lifetime_maxmem.

Ian.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-02 21:38       ` Dan Magenheimer
  2013-01-03 16:24         ` Andres Lagar-Cavilla
@ 2013-01-10 17:13         ` Tim Deegan
  2013-01-10 21:43           ` Dan Magenheimer
  1 sibling, 1 reply; 53+ messages in thread
From: Tim Deegan @ 2013-01-10 17:13 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich

Hi, 

At 13:38 -0800 on 02 Jan (1357133898), Dan Magenheimer wrote:
> > The discussion ought to be around the actual problem, which is (as far
> > as I can see) that in a system where guests are ballooning without
> > limits, VM creation failure can happen after a long delay.  In
> > particular it is the delay that is the problem, rather than the failure.
> > Some solutions that have been proposed so far:
> >  - don't do that, it's silly (possibly true but not helpful);
> >  - this reservation hypercall, to pull the failure forward;
> >  - make allocation faster to avoid the delay (a good idea anyway,
> >    but can it be made fast enough?);
> >  - use max_pages or similar to stop other VMs using all of RAM.
> 
> Good summary.  So, would you agree that the solution selection
> comes down to: "Can max_pages or similar be used effectively to
> stop other VMs using all of RAM? If so, who is implementing that?
> Else the reservation hypercall is a good solution." ?

Not quite.  I think there are other viable options, and I don't
particularly like the reservation hypercall.

I can still see something like max_pages working well enough.  AFAICS
the main problem with that solution is something like this: because it
limits the guests individually rather than collectively, it prevents
memory transfers between VMs even if they wouldn't clash with the VM
being built.  That could be worked around with an upcall to a toolstack
agent that reshuffles things on a coarse granularity based on need.  I
agree that's slower than having the hypervisor make the decisions but
I'm not convinced it'd be unmanageable.

Or, how about actually moving towards a memory scheduler like you
suggested -- for example by integrating memory allocation more tightly
with tmem.  There could be an xsm-style hook in the allocator for
tmem-enabled domains.  That way tmem would have complete control over
all memory allocations for the guests under its control, and it could
implement a shared upper limit.  Potentially in future the tmem
interface could be extended to allow it to force guests to give back
more kinds of memory, so that it could try to enforce fairness (e.g. if
two VMs are busy, why should the one that spiked first get to keep all
the RAM?) or other nice scheduler-like properties.

Or, you could consider booting the new guest pre-ballooned so it doesn't
have to allocate all that memory in the build phase.  It would boot much
quicker (solving the delayed-failure problem), and join the scramble for
resources on an equal footing with its peers.

> > My own position remains that I can live with the reservation hypercall,
> > as long as it's properly done - including handling PV 32-bit and PV
> > superpage guests.
> 
> Tim, would you at least agree that "properly" is a red herring?

I'm not quite sure what you mean by that.  To the extent that this isn't
a criticism of the high-level reservation design, maybe.  But I stand by
it as a criticism of the current implementation.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-10 10:31                       ` Ian Campbell
@ 2013-01-10 18:42                         ` Dan Magenheimer
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-10 18:42 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir (Xen.org),
	George Dunlap, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> Sent: Thursday, January 10, 2013 3:32 AM
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> solutions

Hi Ian --

In your email I think is the most detailed description of the
mechanism of your proposal I've seen yet, so I think I now
understand it better than before.  Thanks for that.

I'm still quite concerned about the policy issues, however, as
well as the unintended consequences of interactions between your
proposal and existing guest->hypervisor interactions including
tmem, in-guest ballooning, and (possibly) page-sharing.

So thanks much for continuing the discussion and please read on...

> On Tue, 2013-01-08 at 19:41 +0000, Dan Magenheimer wrote:
> > Then a second premise that I would like to check to ensure we
> > agree:  In the Oracle model, as I said, "open source guest kernels
> > can intelligently participate in optimizing their own memory usage...
> > such guests are now shipping" (FYI Fedora, Ubuntu, and Oracle Linux).
> > With these mechanisms, there is direct guest->hypervisor interaction
> > that, without knowledge of the toolstack, causes d->tot_pages
> > to increase.  This interaction may (and does) occur from several
> > domains simultaneously and the increase for any domain may occur
> > frequently, unpredictably and sometimes dramatically.
> 
> Agreed.

OK, for brevity, I'm going to call these (guest->hypervisor interactions
that cause d->tot_pages to increase) "dynamic allocations".

> > Ian, do you agree with this premise and that a "capacity allocation
> > solution" (whether hypervisor-based or toolstack-based) must work
> > properly in this context?
> 
> > Or are you maybe proposing to eliminate all such interactions?
> 
> I think these interactions are fine. They are obviously a key part of
> your model. My intention is to suggest a possible userspace solution to
> the claim proposal which continues to allow this behaviour.

Good.  I believe George suggested much earlier in this thread that
such interactions should simply be disallowed, which made me a bit cross.
(I may also have misunderstood.)
 
> > Or are you maybe proposing to insert the toolstack in the middle of
> > all such interactions?
> 
> Not at all.

Good.  I believe Ian Jackson's proposal much earlier in a related thread
was something along these lines.  (Again, I may have misunderstood.)

So, Ian, for the sake of argument below, please envision a domain
in which d->tot_pages varies across time like a high-frequency
high-amplitude sine wave.  By bad luck, when d->tot_pages is sampled
at t=0, d->tot_pages is at the minimum point of the sine wave.
For brevity, let's call this a "worst-case domain."  (I realize
it is contrived, but nor is it completely unrealistic.)

And, as we've agreed, the toolstack is completely unaware of this
sine wave behavior.

> > Next, in your most recent reply, I think you skipped replying to my
> > comment of "[in your proposal] the toolstack must make intelligent
> > policy decisions about how to vary current_maxmem relative to
> > lifetime_maxmem, across all the domains on the system [1]".  We
> > seem to disagree on whether this need only be done twice per domain
> > launch (once at domain creation start and once at domain creation
> > finish, in your proposal) vs. more frequently.  But in either case,
> > do you agree that the toolstack is not equipped to make policy
> > decisions across multiple guests to do this
> 
> No, I don't agree.

OK, so then this is an important point of discussion.  You believe
the toolstack IS equipped to make policy decisions across multiple
guests.  Let's get back to that in a minute.

> > and that poor choices may have dire consequences (swapstorm, OOM) on a
> > guest?
> 
> Setting maxmem on a domain does not immediately force a domain to that
> amount of RAM and so the act of doing setting maxmem is not going to
> cause a swap storm. (I think this relates to the "distinction between
> current_maxmem and lifetime_maxmem was added for Citrix DMC support"
> patch you were referring too below, previously to that Xen would reject
> attempts to set max < current)

Agreed that it doesn't "immediately force a domain", but let's
leave open the "not going to cause a swap storm" as a possible
point of disagreement.

> Setting maxmem doesn't even ask the domain to try and head for that
> limit (that is the target which is a separate thing). So the domain
> won't react to setting maxmem at all and unless it goes specifically
> looking I don't think it would even be aware that its maximum has been
> temporarily reduced.

Agreed, _except_ that during the period where its max_pages is temporarily
reduced (which, we've demonstrated earlier in a related thread, may
be a period of many minutes), there are now two differences:

1) if d->max_pages is set below d->tot_pages, all dynamic allocations
of the type that would otherwise cause d->tot_pages to increase will
now fail, and
2) if d->max_pages is set "somewhat" higher than d->tot_pages, the
possible increase of d->tot_pages has now been constrained; some
dynamic allocations will succeed and some will fail.

Do you agree that there is a possibility that these differences
may result in unintended consequences?

> Having set all the maxmem's on the domains you would then immediately
> check if each domain has tot_pages under or over the temporary maxmem
> limit.
>
> If all domains are under then the claim has succeeded and you may
> proceed to build the domain. If any one domain is over then the claim
> has failed and you need to reset all the maxmems back to the lifetime
> value and try again on another host (I understand that this is an
> accepted possibility with the h/v based claim approach too).

NOW you are getting into policy.  You say "set all the maxmem's on
the domains" and "immediately check each domain tot_pages".  Let me
interpret this as a policy statement and try to define it more precisely:

1) For the N domains running on the system (and N may be measured in
   the hundreds), you must select L domains (where 1<=L<=N) and, for
   each, make a hypercall to change d->max_pages.  How do you
   propose to select these L?  Or, in your proposal, is L==N?
   (i.e. L may also be >100)?
2) For each of the L domains, you must decide _how much_ to
   decrease d->max_pages.  (How do you propose to do this?  Maybe
   decrease each by the same amount, M-divided-by-L?)
3) You now make L (or is it N?) hypercalls to read each d->tot_pages.
4) I may be wrong, but I assume _before_ you decrease d->max_pages
   you will likely want to sample d->tot_pages for each L to inform
   your selection  process in (1) and (2) above.  If so, for each
   of L (possibly N?) domains, a hypercall is required to check
   d->tot_pages and a TOCTOU race is introduced because tot_pages
   may change unless and until you set d->max_pages lower than
   d->tot_pages.
5) Since the toolstack is unaware of dynamic allocations, your
   proposal might unwittingly decrease d->max_pages on a worst-case
   domain to the point where max_pages is much lower than the
   peak of the sine wave, and this constraint may be imposed for
   several minutes, potentially causing swapping or OOMs for our
   worst-case domains.  (Do you still disagree?)
6) You are imposing the above constraints on _all_ toolstacks.

Also, I'm not positive I understand, but it appears that your
solution as outlined will have false negatives; i.e. your
algorithm will cause some claims to fail when there is
actually sufficient RAM (in the case of "if any ONE domain is
over").  But unless you specify your selection criteria more
precisely, I don't know.

In sum, this all seems like a very high price to pay to avoid
less than a hundred lines of code (plus comments) in the
hypervisor.

> I forgot to say but you'd obviously want to use whatever controls tmem
> provides to ensure it doesn't just gobble up the M bytes needed for the
> new domain. It can of course continue to operate as normal on the
> remainder of the spare RAM.

Hmmm.. so you want to shut off _all_ dynamic allocations for
a period of possibly several minutes?   And how does tmem know
what the "remainder of the spare RAM" is... isn't that information
now only in the toolstack?  Forgive me if I am missing something
obvious, but in any case...

Tmem does have a gross ham-handed freeze/thaw mechanism to do this
via tmem hypercalls.  But AFAIK there is no equivalent mechanism for
controlling in-guest ballooning (nor AFAIK for shared-page
CoW resolution).  But reserving the M bytes in the hypervisor
(as the proposed XENMEM_claim_pages does) is atomic so solves any
TOCTOU races and both eliminates the need for tmem freeze/thaw and
solves the problem for in-guest-kernel selfballooning all at the
same time. (And, I think, shared-page CoW stuff as well.)
 
One more subtle but very important point, especially in the
context of memory overcommit:  Your toolstack-based proposal
explicitly constrains the growth of L independent domains.
This is a sum-of-maxes constraint.  The hypervisor-based proposal
constrains only the _total_ growth of N domains and is thus
a max-of-sums constraint.  Statistically, for any resource
management problem, a max-of-sums solution provides much
much more flexibility.  So even academically speaking, the
hypervisor solution is superior.  (If that's clear as mud,
please let me know and I can try to explain further.)

Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-10 17:13         ` Tim Deegan
@ 2013-01-10 21:43           ` Dan Magenheimer
  2013-01-17 15:12             ` Tim Deegan
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-10 21:43 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich

> From: Tim Deegan [mailto:tim@xen.org]
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> solutions

Hi Tim --

Thanks for the response.

> At 13:38 -0800 on 02 Jan (1357133898), Dan Magenheimer wrote:
> > > The discussion ought to be around the actual problem, which is (as far
> > > as I can see) that in a system where guests are ballooning without
> > > limits, VM creation failure can happen after a long delay.  In
> > > particular it is the delay that is the problem, rather than the failure.
> > > Some solutions that have been proposed so far:
> > >  - don't do that, it's silly (possibly true but not helpful);
> > >  - this reservation hypercall, to pull the failure forward;
> > >  - make allocation faster to avoid the delay (a good idea anyway,
> > >    but can it be made fast enough?);
> > >  - use max_pages or similar to stop other VMs using all of RAM.
> >
> > Good summary.  So, would you agree that the solution selection
> > comes down to: "Can max_pages or similar be used effectively to
> > stop other VMs using all of RAM? If so, who is implementing that?
> > Else the reservation hypercall is a good solution." ?
> 
> Not quite.  I think there are other viable options, and I don't
> particularly like the reservation hypercall.

Are you suggesting an alternative option other than the max_pages
toolstack-based proposal that Ian and I are discussing in a parallel
subthread?  Just checking, in case I am forgetting an alternative
you (or someone else proposed).

Are there reasons other than "incompleteness" (see below) that
you dislike the reservation hypercall?  To me, it seems fairly
elegant in that it uses the same locks for capacity-allocation
as for page allocation, thus guaranteeing no races can occur.

> I can still see something like max_pages working well enough.  AFAICS
> the main problem with that solution is something like this: because it
> limits the guests individually rather than collectively, it prevents
> memory transfers between VMs even if they wouldn't clash with the VM
> being built.

Indeed, you are commenting on one of the same differences
I observed today in the subthread with Ian, where I said
that the hypervisor-based solution is only "max-of-sums"-
constrained whereas the toolstack-based solution is
"sum-of-maxes"-constrained.  With tmem/selfballooning active,
what you call "memory transfers between VMs" can be happening
constantly.  (To clarify for others, it is not the contents
of the memory that is being transferred, just the capacity...
i.e. VM A frees a page and VM B allocates a page.)

So thanks for reinforcing this point as I think it is subtle
but important.

> That could be worked around with an upcall to a toolstack
> agent that reshuffles things on a coarse granularity based on need.  I
> agree that's slower than having the hypervisor make the decisions but
> I'm not convinced it'd be unmanageable.

"Based on need" begs a number of questions, starting with how
"need" is defined and how conflicting needs are resolved.
Tmem balances need as a self-adapting system. For your upcalls,
you'd have to convince me that, even if "need" could be communicated
to an guest-external entity (i.e. a toolstack), that the entity
would/could have any data to inform a policy to intelligently resolve
conflicts.  I also don't see how it could be done without either
significant hypervisor or guest-kernel changes.

> Or, how about actually moving towards a memory scheduler like you
> suggested -- for example by integrating memory allocation more tightly
> with tmem.  There could be an xsm-style hook in the allocator for
> tmem-enabled domains.  That way tmem would have complete control over
> all memory allocations for the guests under its control, and it could
> implement a shared upper limit.  Potentially in future the tmem
> interface could be extended to allow it to force guests to give back
> more kinds of memory, so that it could try to enforce fairness (e.g. if
> two VMs are busy, why should the one that spiked first get to keep all
> the RAM?) or other nice scheduler-like properties.

Tmem (plus selfballooning), unchanged, already does some of this.
While I would be interested in discussing better solutions, the
now four-year odyssey of pushing what I thought were relatively
simple changes upstream into Linux has left a rather sour taste
in my mouth, so rather than consider any solution that requires
more guest kernel changes, I'd first prefer to ensure that you
thoroughly understand what tmem already does, and how and why.
Would you be interested in that?   I would be very happy to see
other core members of the Xen community (outside Oracle) understand
tmem, as I'd like to see the whole community benefit rather than
just Oracle.

> Or, you could consider booting the new guest pre-ballooned so it doesn't
> have to allocate all that memory in the build phase.  It would boot much
> quicker (solving the delayed-failure problem), and join the scramble for
> resources on an equal footing with its peers.

I'm not positive I understand "pre-ballooned" but IIUC, all Linux
guests already boot pre-ballooned, in that, from the vm.cfg file,
"mem=" is allocated, not "maxmem=".  If you mean something less than
"mem=", you'd have to explain to me how Xen guesses how much memory a
guest kernel needs when even the guest kernel doesn't know.

Tmem, with self-ballooning, launches the guest with "mem=", and
then the guest kernel "self adapts" to (dramatically) reduce its usage
soon after boot.  It can be fun to "watch(1)", meaning using the
Linux "watch -d 'head -1 /proc/meminfo'" command.

> > > My own position remains that I can live with the reservation hypercall,
> > > as long as it's properly done - including handling PV 32-bit and PV
> > > superpage guests.
> >
> > Tim, would you at least agree that "properly" is a red herring?
> 
> I'm not quite sure what you mean by that.  To the extent that this isn't
> a criticism of the high-level reservation design, maybe.  But I stand by
> it as a criticism of the current implementation.

Sorry, I was just picking on word usage.  IMHO, the hypercall
does work "properly" for the classes of domains it was designed
to work on (which I'd estimate in the range of 98% of domains
these days).  I do agree that it doesn't work for 2%, so I'd
claim that the claim hypercall is "properly done", but maybe
not "completely done".  Clearly, one would prefer a solution that
handles 100%, but I'd rather have a solution that solves 98%
(and doesn't make the other 2% any worse), than no solution at all.

Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-02 15:29     ` Andres Lagar-Cavilla
@ 2013-01-11 16:03       ` Konrad Rzeszutek Wilk
  2013-01-11 16:13         ` Andres Lagar-Cavilla
  0 siblings, 1 reply; 53+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-11 16:03 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Tim Deegan, Ian Jackson, xen-devel,
	Konrad Rzeszutek Wilk, Jan Beulich

Heya,

Much appreciate your input, and below are my responses.
> >>> A) In Linux, a privileged user can write to a sysfs file which writes
> >>> to the balloon driver which makes hypercalls from the guest kernel to
> >> 
> >> A fairly bizarre limitation of a balloon-based approach to memory management. Why on earth should the guest be allowed to change the size of its balloon, and therefore its footprint on the host. This may be justified with arguments pertaining to the stability of the in-guest workload. What they really reveal are limitations of ballooning. But the inadequacy of the balloon in itself doesn't automatically translate into justifying the need for a new hyper call.
> > 
> > Why is this a limitation? Why shouldn't the guest the allowed to change
> > its memory usage? It can go up and down as it sees fit.
> 
> No no. Can the guest change its cpu utilization outside scheduler constraints? NIC/block dev quotas? Why should an unprivileged guest be able to take a massive s*it over the host controller's memory allocation, at the guest's whim?

There is a limit to what it can do. It is not an uncontrolled guest
going mayhem - it does it stuff within the parameters of the guest config.
Within in my mind also implies the 'tmem' doing extra things in the hypervisor.

> 
> I'll be happy with a balloon the day I see an OS that can't be rooted :)
> 
> Obviously this points to a problem with sharing & paging. And this is why I still spam this thread. More below.
>  
> > And if it goes down and it gets better performance - well, why shouldn't
> > it do it?
> > 
> > I concur it is odd - but it has been like that for decades.
> 
> Heh. Decades … one?

Still - a decade.
> > 
> > 
> >> 
> >>> the hypervisor, which adjusts the domain memory footprint, which changes the number of free pages _without_ the toolstack knowledge.
> >>> The toolstack controls constraints (essentially a minimum and maximum)
> >>> which the hypervisor enforces.  The toolstack can ensure that the
> >>> minimum and maximum are identical to essentially disallow Linux from
> >>> using this functionality.  Indeed, this is precisely what Citrix's
> >>> Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always has complete control and, so, knowledge of any domain memory
> >>> footprint changes.  But DMC is not prescribed by the toolstack,
> >> 
> >> Neither is enforcing min==max. This was my argument when previously commenting on this thread. The fact that you have enforcement of a maximum domain allocation gives you an excellent tool to keep a domain's unsupervised growth at bay. The toolstack can choose how fine-grained, how often to be alerted and stall the domain.

That would also do the trick - but there are penalties to it.

If one just wants to launch multiple guests and "freeze" all the other guests
from using the balloon driver - that can certainly be done.

But that is a half-way solution (in my mind). Dan's idea is that you wouldn't
even need that and can just allocate without having to worry about the other
guests at all - b/c you have reserved enough memory in the hypervisor (host) to
launch the guest.

> > 
> > There is a down-call (so events) to the tool-stack from the hypervisor when
> > the guest tries to balloon in/out? So the need for this problem arose
> > but the mechanism to deal with it has been shifted to the user-space
> > then? What to do when the guest does this in/out balloon at freq
> > intervals?
> > 
> > I am missing actually the reasoning behind wanting to stall the domain?
> > Is that to compress/swap the pages that the guest requests? Meaning
> > an user-space daemon that does "things" and has ownership
> > of the pages?
> 
> The (my) reasoning is that this enables control over unsupervised growth. I was being facetious a couple lines above. Paging and sharing also have the same problem with badly behaved guests. So this is where you stop these guys, allow the toolstack to catch a breath, and figure out what to do with this domain (more RAM? page out? foo?).

But what if we do not even have to have the toolstack to catch a breath. The goal
here is for it not to be involved in this and let the hypervisor deal with
unsupervised growth as it is better equiped to do so - and it is the ultimate
judge whether the guest can grow wildly or not.

I mean why make the toolstack become CPU bound when you can just set
the hypervisor to take this extra information in an account and you avoid
the CPU-bound problem altogether.

> 
> All your questions are very valid, but they are policy in toolstack-land. Luckily the hypervisor needs no knowledge of that.

My thinking is that some policy (say how much the guests can grow) is something
that the host sets. And the hypervisor is the engine that takes these values
in account and runs with it.

I think you are advocating that the "engine" and policy should be both
in the user-land.

.. snip..
> >> Great care has been taken for this statement to not be exactly true. The hypervisor discards one of two pages that the toolstack tells it to (and patches the physmap of the VM previously pointing to the discard page). It doesn't merge, nor does it look into contents. The hypervisor doesn't care about the page contents. This is deliberate, so as to avoid spurious claims of "you are using technique X!"
> >> 
> > 
> > Is the toolstack (or a daemon in userspace) doing this? I would
> > have thought that there would be some optimization to do this
> > somewhere?
> 
> You could optimize but then you are baking policy where it does not belong. This is what KSM did, which I dislike. Seriously, does the kernel need to scan memory to find duplicates? Can't something else do it given suitable interfaces? Now any other form of sharing policy that tries to use VMA_MERGEABLE is SOL. Tim, Gregor and I, at different points in time, tried to avoid this. I don't know that it was a conscious or deliberate effort, but it worked out that way.

OK, I think I understand you - you are advocating for user-space
because the combination of policy/engine can be done there.

Dan's and mine thinking is to piggyback on the hypervisors' MM engine
and just provide a means of tweaking one value. In some ways that
is simialar to making sysctls in the kernel to tell the MM how to
behave.

.. snip..
> > That code makes certain assumptions - that the guest will not go/up down
> > in the ballooning once the toolstack has decreed how much
> > memory the guest should use. It also assumes that the operations
> > are semi-atomic - and to make it so as much as it can - it executes
> > these operations in serial.
> > 
> > This goes back to the problem statement - if we try to parallize
> > this we run in the problem that the amount of memory we thought
> > we free is not true anymore. The start of this email has a good
> > description of some of the issues.
> 
> Just set max_pages (bad name...) everywhere as needed to make room. Then kick tmem (everywhere, in parallel) to free memory. Wait until enough is free …. Allocate your domain(s, in parallel). If any vcpus become stalled because a tmem guest driver is trying to allocate beyond max_pages, you need to adjust your allocations. As usual.


Versus just one "reserve" that would remove the need for most of this.
That is - if we can not "reserve" we would fall-back to the mechanism you
stated, but if there is enough memory we do not have to do the "wait"
game (which on a 1TB takes forever and makes launching guests sometimes
take minutes) - and can launch the guest without having to worry
about slow-path.
.. snip.

> >> 
> > 
> > I believe Dan is saying is that it is not enabled by default.
> > Meaning it does not get executed in by /etc/init.d/xencommons and
> > as such it never gets run (or does it now?) - unless one knows
> > about it - or it is enabled by default in a product. But perhaps
> > we are both mistaken? Is it enabled by default now on den-unstable?
> 
> I'm a bit lost … what is supposed to be enabled? A sharing daemon? A paging daemon? Neither daemon requires wait queue work, batch allocations, etc. I can't figure out what this portion of the conversation is about.

The xenshared daemon.
> 
> Having said that, thanks for the thoughtful follow-up

Thank you for your response!

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-11 16:03       ` Konrad Rzeszutek Wilk
@ 2013-01-11 16:13         ` Andres Lagar-Cavilla
  2013-01-11 19:08           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 53+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-11 16:13 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Tim Deegan, Ian Jackson, xen-devel,
	Konrad Rzeszutek Wilk, Jan Beulich


On Jan 11, 2013, at 11:03 AM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:

> Heya,
> 
> Much appreciate your input, and below are my responses.
>>>>> A) In Linux, a privileged user can write to a sysfs file which writes
>>>>> to the balloon driver which makes hypercalls from the guest kernel to
>>>> 
>>>> A fairly bizarre limitation of a balloon-based approach to memory management. Why on earth should the guest be allowed to change the size of its balloon, and therefore its footprint on the host. This may be justified with arguments pertaining to the stability of the in-guest workload. What they really reveal are limitations of ballooning. But the inadequacy of the balloon in itself doesn't automatically translate into justifying the need for a new hyper call.
>>> 
>>> Why is this a limitation? Why shouldn't the guest the allowed to change
>>> its memory usage? It can go up and down as it sees fit.
>> 
>> No no. Can the guest change its cpu utilization outside scheduler constraints? NIC/block dev quotas? Why should an unprivileged guest be able to take a massive s*it over the host controller's memory allocation, at the guest's whim?
> 
> There is a limit to what it can do. It is not an uncontrolled guest
> going mayhem - it does it stuff within the parameters of the guest config.
> Within in my mind also implies the 'tmem' doing extra things in the hypervisor.
> 
>> 
>> I'll be happy with a balloon the day I see an OS that can't be rooted :)
>> 
>> Obviously this points to a problem with sharing & paging. And this is why I still spam this thread. More below.
>> 
>>> And if it goes down and it gets better performance - well, why shouldn't
>>> it do it?
>>> 
>>> I concur it is odd - but it has been like that for decades.
>> 
>> Heh. Decades … one?
> 
> Still - a decade.
>>> 
>>> 
>>>> 
>>>>> the hypervisor, which adjusts the domain memory footprint, which changes the number of free pages _without_ the toolstack knowledge.
>>>>> The toolstack controls constraints (essentially a minimum and maximum)
>>>>> which the hypervisor enforces.  The toolstack can ensure that the
>>>>> minimum and maximum are identical to essentially disallow Linux from
>>>>> using this functionality.  Indeed, this is precisely what Citrix's
>>>>> Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always has complete control and, so, knowledge of any domain memory
>>>>> footprint changes.  But DMC is not prescribed by the toolstack,
>>>> 
>>>> Neither is enforcing min==max. This was my argument when previously commenting on this thread. The fact that you have enforcement of a maximum domain allocation gives you an excellent tool to keep a domain's unsupervised growth at bay. The toolstack can choose how fine-grained, how often to be alerted and stall the domain.
> 
> That would also do the trick - but there are penalties to it.
> 
> If one just wants to launch multiple guests and "freeze" all the other guests
> from using the balloon driver - that can certainly be done.
> 
> But that is a half-way solution (in my mind). Dan's idea is that you wouldn't
> even need that and can just allocate without having to worry about the other
> guests at all - b/c you have reserved enough memory in the hypervisor (host) to
> launch the guest.

Konrad:
Ok, what happens when a guest is stalled because it cannot allocate more pages due to existing claims? Exactly the same that happens when it can't grow because it has hit d->max_pages.

> 
>>> 
>>> There is a down-call (so events) to the tool-stack from the hypervisor when
>>> the guest tries to balloon in/out? So the need for this problem arose
>>> but the mechanism to deal with it has been shifted to the user-space
>>> then? What to do when the guest does this in/out balloon at freq
>>> intervals?
>>> 
>>> I am missing actually the reasoning behind wanting to stall the domain?
>>> Is that to compress/swap the pages that the guest requests? Meaning
>>> an user-space daemon that does "things" and has ownership
>>> of the pages?
>> 
>> The (my) reasoning is that this enables control over unsupervised growth. I was being facetious a couple lines above. Paging and sharing also have the same problem with badly behaved guests. So this is where you stop these guys, allow the toolstack to catch a breath, and figure out what to do with this domain (more RAM? page out? foo?).
> 
> But what if we do not even have to have the toolstack to catch a breath. The goal
> here is for it not to be involved in this and let the hypervisor deal with
> unsupervised growth as it is better equiped to do so - and it is the ultimate
> judge whether the guest can grow wildly or not.
> 
> I mean why make the toolstack become CPU bound when you can just set
> the hypervisor to take this extra information in an account and you avoid
> the CPU-bound problem altogether.
> 
>> 
>> All your questions are very valid, but they are policy in toolstack-land. Luckily the hypervisor needs no knowledge of that.
> 
> My thinking is that some policy (say how much the guests can grow) is something
> that the host sets. And the hypervisor is the engine that takes these values
> in account and runs with it.
> 
> I think you are advocating that the "engine" and policy should be both
> in the user-land.
> 
> .. snip..
>>>> Great care has been taken for this statement to not be exactly true. The hypervisor discards one of two pages that the toolstack tells it to (and patches the physmap of the VM previously pointing to the discard page). It doesn't merge, nor does it look into contents. The hypervisor doesn't care about the page contents. This is deliberate, so as to avoid spurious claims of "you are using technique X!"
>>>> 
>>> 
>>> Is the toolstack (or a daemon in userspace) doing this? I would
>>> have thought that there would be some optimization to do this
>>> somewhere?
>> 
>> You could optimize but then you are baking policy where it does not belong. This is what KSM did, which I dislike. Seriously, does the kernel need to scan memory to find duplicates? Can't something else do it given suitable interfaces? Now any other form of sharing policy that tries to use VMA_MERGEABLE is SOL. Tim, Gregor and I, at different points in time, tried to avoid this. I don't know that it was a conscious or deliberate effort, but it worked out that way.
> 
> OK, I think I understand you - you are advocating for user-space
> because the combination of policy/engine can be done there.
> 
> Dan's and mine thinking is to piggyback on the hypervisors' MM engine
> and just provide a means of tweaking one value. In some ways that
> is simialar to making sysctls in the kernel to tell the MM how to
> behave.
> 
> .. snip..
>>> That code makes certain assumptions - that the guest will not go/up down
>>> in the ballooning once the toolstack has decreed how much
>>> memory the guest should use. It also assumes that the operations
>>> are semi-atomic - and to make it so as much as it can - it executes
>>> these operations in serial.
>>> 
>>> This goes back to the problem statement - if we try to parallize
>>> this we run in the problem that the amount of memory we thought
>>> we free is not true anymore. The start of this email has a good
>>> description of some of the issues.
>> 
>> Just set max_pages (bad name...) everywhere as needed to make room. Then kick tmem (everywhere, in parallel) to free memory. Wait until enough is free …. Allocate your domain(s, in parallel). If any vcpus become stalled because a tmem guest driver is trying to allocate beyond max_pages, you need to adjust your allocations. As usual.
> 
> 
> Versus just one "reserve" that would remove the need for most of this.
> That is - if we can not "reserve" we would fall-back to the mechanism you
> stated, but if there is enough memory we do not have to do the "wait"
> game (which on a 1TB takes forever and makes launching guests sometimes
> take minutes) - and can launch the guest without having to worry
> about slow-path.
> .. snip.

The "wait" could be literally zero in a common case. And if not, because there is not enough free ram, the claim would have failed.

> 
>>>> 
>>> 
>>> I believe Dan is saying is that it is not enabled by default.
>>> Meaning it does not get executed in by /etc/init.d/xencommons and
>>> as such it never gets run (or does it now?) - unless one knows
>>> about it - or it is enabled by default in a product. But perhaps
>>> we are both mistaken? Is it enabled by default now on den-unstable?
>> 
>> I'm a bit lost … what is supposed to be enabled? A sharing daemon? A paging daemon? Neither daemon requires wait queue work, batch allocations, etc. I can't figure out what this portion of the conversation is about.
> 
> The xenshared daemon.
That's not in the tree. Unbeknownst to me. Would appreciate to know more. Or is it a symbolic placeholder in this conversation?

Andres

>> 
>> Having said that, thanks for the thoughtful follow-up
> 
> Thank you for your response!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-11 16:13         ` Andres Lagar-Cavilla
@ 2013-01-11 19:08           ` Konrad Rzeszutek Wilk
  2013-01-14 16:00             ` George Dunlap
  2013-01-17 15:16             ` Tim Deegan
  0 siblings, 2 replies; 53+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-11 19:08 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Ian Jackson, Tim Deegan, xen-devel,
	Konrad Rzeszutek Wilk, Jan Beulich

> >>>> Neither is enforcing min==max. This was my argument when previously commenting on this thread. The fact that you have enforcement of a maximum domain allocation gives you an excellent tool to keep a domain's unsupervised growth at bay. The toolstack can choose how fine-grained, how often to be alerted and stall the domain.
> > 
> > That would also do the trick - but there are penalties to it.
> > 
> > If one just wants to launch multiple guests and "freeze" all the other guests
> > from using the balloon driver - that can certainly be done.
> > 
> > But that is a half-way solution (in my mind). Dan's idea is that you wouldn't
> > even need that and can just allocate without having to worry about the other
> > guests at all - b/c you have reserved enough memory in the hypervisor (host) to
> > launch the guest.
> 
> Konrad:
> Ok, what happens when a guest is stalled because it cannot allocate more pages due to existing claims? Exactly the same that happens when it can't grow because it has hit d->max_pages.

But it wouldn't. I am going here on a limp, b/c I believe this is what the code
does but I should double-check.

The variables for the guest to go up/down would still stay in place - so it
should not be impacted by the 'claim'. Meaning you just leave them alone
and let the guest do whatever it wants without influencing it.

If the claim hypercall fails, then yes - you could have this issue.

But the solution to the hypercall failing are multiple - one is to 
try to "squeeze" all the guests to make space or just try to allocate
the guest on another box that has more memory and where the claim
hypercall returned success. Or it can do these claim hypercalls
on all the nodes in parallel and pick amongst the ones that returned
success.

Perhaps the 'claim' call should be called 'probe_and_claim'?

.. snip..
> >>> That code makes certain assumptions - that the guest will not go/up down
> >>> in the ballooning once the toolstack has decreed how much
> >>> memory the guest should use. It also assumes that the operations
> >>> are semi-atomic - and to make it so as much as it can - it executes
> >>> these operations in serial.
> >>> 
> >>> This goes back to the problem statement - if we try to parallize
> >>> this we run in the problem that the amount of memory we thought
> >>> we free is not true anymore. The start of this email has a good
> >>> description of some of the issues.
> >> 
> >> Just set max_pages (bad name...) everywhere as needed to make room. Then kick tmem (everywhere, in parallel) to free memory. Wait until enough is free …. Allocate your domain(s, in parallel). If any vcpus become stalled because a tmem guest driver is trying to allocate beyond max_pages, you need to adjust your allocations. As usual.
> > 
> > 
> > Versus just one "reserve" that would remove the need for most of this.
> > That is - if we can not "reserve" we would fall-back to the mechanism you
> > stated, but if there is enough memory we do not have to do the "wait"
> > game (which on a 1TB takes forever and makes launching guests sometimes
> > take minutes) - and can launch the guest without having to worry
> > about slow-path.
> > .. snip.
> 
> The "wait" could be literally zero in a common case. And if not, because there is not enough free ram, the claim would have failed.
> 

Absolutly. And that is the beaty of it. If it fails then we can
decide to persue other options knowing that there was no race in finding
the value of free memory at all. The other options could be the
squeeze other guests down and try again; or just decide to claim/allocate
the guest on another host altogether.


> >>> I believe Dan is saying is that it is not enabled by default.
> >>> Meaning it does not get executed in by /etc/init.d/xencommons and
> >>> as such it never gets run (or does it now?) - unless one knows
> >>> about it - or it is enabled by default in a product. But perhaps
> >>> we are both mistaken? Is it enabled by default now on den-unstable?
> >> 
> >> I'm a bit lost … what is supposed to be enabled? A sharing daemon? A paging daemon? Neither daemon requires wait queue work, batch allocations, etc. I can't figure out what this portion of the conversation is about.
> > 
> > The xenshared daemon.
> That's not in the tree. Unbeknownst to me. Would appreciate to know more. Or is it a symbolic placeholder in this conversation?

OK, I am confused then. I thought there was now an daemon that would take
care of the PoD and swapping? Perhaps its called something else?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-09 14:44                         ` Dan Magenheimer
  2013-01-09 14:58                           ` Ian Campbell
@ 2013-01-14 15:45                           ` George Dunlap
  2013-01-14 18:18                             ` Dan Magenheimer
  1 sibling, 1 reply; 53+ messages in thread
From: George Dunlap @ 2013-01-14 15:45 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org), Ian Campbell, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

On 09/01/13 14:44, Dan Magenheimer wrote:
>> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
>> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
>> solutions
>>
>> On Tue, 2013-01-08 at 19:41 +0000, Dan Magenheimer wrote:
>>> [1] A clarification: In the Oracle model, there is only maxmem;
>>> i.e. current_maxmem is always the same as lifetime_maxmem;
>> This is exactly what I am proposing that you change in order to
>> implement something like the claim mechanism in the toolstack.
>>
>> If your model is fixed in stone and cannot accommodate changes of this
>> type then there isn't much point in continuing this conversation.
>>
>> I think we need to agree on this before we consider the rest of your
>> mail in detail, so I have snipped all that for the time being.
> Agreed that it is not fixed in stone.  I should have said
> "In the _current_ Oracle model" and that footnote was only for
> comparison purposes.  So, please, do proceed in commenting on the
> two premises I outlined.
>   
>>> i.e. d->max_pages is fixed for the life of the domain and
>>> only d->tot_pages varies; i.e. no intelligence is required
>>> in the toolstack.  AFAIK, the distinction between current_maxmem
>>> and lifetime_maxmem was added for Citrix DMC support.
>> I don't believe Xen itself has any such concept, the distinction is
>> purely internal to the toolstack and which value it chooses to push down
>> to d->max_pages.
> Actually I believe a change was committed to the hypervisor specifically
> to accommodate this.  George mentioned it earlier in this thread...
> I'll have to dig to find the specific changeset but the change allows
> the toolstack to reduce d->max_pages so that it is (temporarily)
> less than d->tot_pages.  Such a change would clearly be unnecessary
> if current_maxmem was always the same as lifetime_maxmem.

Not exactly.  You could always change d->max_pages; and so there was 
never a concept of "lifetime_maxmem" inside of Xen.

The change I think you're talking about is this.  While you could always 
change d->max_pages, it used to be the case that if you tried to set 
d->max_pages to a value less than d->tot_pages, it would return 
-EINVAL*.    What this meant was that if you wanted to use d->max_pages 
to enforce a ballooning request, you had to do the following:
  1. Issue a balloon request to the guest
  2. Wait for the guest to successfully balloon down to the new target
  3. Set d->max_pages to the new target.

The waiting made the logic more complicated, and also introduced a race 
between steps 2 and 3.  So the change was made so that Xen would 
tolerate setting max_pages to less than tot_pages.  Then things looked 
like this:
  1. Set d->max_pages to the new target
  2. Issue a balloon request to the guest.

The new semantics guaranteed that the guest would not be able to "change 
its mind" and ask for memory back after freeing it without the toolstack 
needing to closely monitor the actual current usage.

But even before the change, it was still possible to change max_pages; 
so the change doesn't have any bearing on the discussion here.

  -George

* I may have some of the details incorrect (e.g., maybe it was 
d->tot_pages+something else, maybe it didn't return -EINVAL but failed 
in some other way), but the general idea is correct.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-11 19:08           ` Konrad Rzeszutek Wilk
@ 2013-01-14 16:00             ` George Dunlap
  2013-01-14 16:11               ` Andres Lagar-Cavilla
  2013-01-17 15:16             ` Tim Deegan
  1 sibling, 1 reply; 53+ messages in thread
From: George Dunlap @ 2013-01-14 16:00 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, Andres Lagar-Cavilla, Ian Jackson, Tim (Xen.org),
	Konrad Rzeszutek Wilk, Jan Beulich, xen-devel

On 11/01/13 19:08, Konrad Rzeszutek Wilk wrote:
>>> The xenshared daemon.
>> That's not in the tree. Unbeknownst to me. Would appreciate to know more. Or is it a symbolic placeholder in this conversation?
> OK, I am confused then. I thought there was now an daemon that would take
> care of the PoD and swapping? Perhaps its called something else?

FYI PoD at the moment is all handled within the hypervisor.  There was 
discussion of extending the swap daemon (sorry, also don't know the 
official name off the top of my head) to handle PoD, since PoD is 
essentially a degenerate case of swapping, but it hasn't happened yet.  
In any case it will need to be tested first, since it may cause boot 
time for pre-ballooned guests to slow down unacceptably.

  -George

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-14 16:00             ` George Dunlap
@ 2013-01-14 16:11               ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 53+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-14 16:11 UTC (permalink / raw)
  To: George Dunlap
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, Konrad Rzeszutek Wilk, Andres Lagar-Cavilla,
	Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson


On Jan 14, 2013, at 11:00 AM, George Dunlap <george.dunlap@eu.citrix.com> wrote:

> On 11/01/13 19:08, Konrad Rzeszutek Wilk wrote:
>>>> The xenshared daemon.
>>> That's not in the tree. Unbeknownst to me. Would appreciate to know more. Or is it a symbolic placeholder in this conversation?
>> OK, I am confused then. I thought there was now an daemon that would take
>> care of the PoD and swapping? Perhaps its called something else?
> 
> FYI PoD at the moment is all handled within the hypervisor.  There was discussion of extending the swap daemon (sorry, also don't know the official name off the top of my head) to handle PoD, since PoD is essentially a degenerate case of swapping, but it hasn't happened yet.  In any case it will need to be tested first, since it may cause boot time for pre-ballooned guests to slow down unacceptably.

Not necessarily involve the swapping daemon. I thought the idea was to have PoD use the existing hypervisor paging infrastructure. We could add a p2m type (paged_out_zero) and then no need to actively involve a swapping daemon during allocations.

Andres

> 
> -George
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-14 15:45                           ` George Dunlap
@ 2013-01-14 18:18                             ` Dan Magenheimer
  2013-01-14 19:42                               ` George Dunlap
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-14 18:18 UTC (permalink / raw)
  To: George Dunlap
  Cc: Keir (Xen.org), Ian Campbell, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> solutions

Hi George -- I trust we have gotten past the recent unpleasantness?
I do value your technical input to this debate (even when we
disagree), so I thank you for continuing the discussion below.

> On 09/01/13 14:44, Dan Magenheimer wrote:
> >> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> >> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> >> solutions
> >>
> >> On Tue, 2013-01-08 at 19:41 +0000, Dan Magenheimer wrote:
> >>> [1] A clarification: In the Oracle model, there is only maxmem;
> >>> i.e. current_maxmem is always the same as lifetime_maxmem;
> >> This is exactly what I am proposing that you change in order to
> >> implement something like the claim mechanism in the toolstack.
> >>
> >> If your model is fixed in stone and cannot accommodate changes of this
> >> type then there isn't much point in continuing this conversation.
> >>
> >> I think we need to agree on this before we consider the rest of your
> >> mail in detail, so I have snipped all that for the time being.
> > Agreed that it is not fixed in stone.  I should have said
> > "In the _current_ Oracle model" and that footnote was only for
> > comparison purposes.  So, please, do proceed in commenting on the
> > two premises I outlined.
> >
> >>> i.e. d->max_pages is fixed for the life of the domain and
> >>> only d->tot_pages varies; i.e. no intelligence is required
> >>> in the toolstack.  AFAIK, the distinction between current_maxmem
> >>> and lifetime_maxmem was added for Citrix DMC support.
> >> I don't believe Xen itself has any such concept, the distinction is
> >> purely internal to the toolstack and which value it chooses to push down
> >> to d->max_pages.
> > Actually I believe a change was committed to the hypervisor specifically
> > to accommodate this.  George mentioned it earlier in this thread...
> > I'll have to dig to find the specific changeset but the change allows
> > the toolstack to reduce d->max_pages so that it is (temporarily)
> > less than d->tot_pages.  Such a change would clearly be unnecessary
> > if current_maxmem was always the same as lifetime_maxmem.
> 
> Not exactly.  You could always change d->max_pages; and so there was
> never a concept of "lifetime_maxmem" inside of Xen.

(Well, not exactly "always", but since Aug 2006... changeset 11257.
There being no documentation, it's not clear whether the addition
of a domctl to modify d->max_pages was intended to be used
frequently by the toolstack, as opposed to used only rarely and only
by a responsible host system administrator.)

> The change I think you're talking about is this.  While you could always
> change d->max_pages, it used to be the case that if you tried to set
> d->max_pages to a value less than d->tot_pages, it would return
> -EINVAL*.    What this meant was that if you wanted to use d->max_pages
> to enforce a ballooning request, you had to do the following:
>   1. Issue a balloon request to the guest
>   2. Wait for the guest to successfully balloon down to the new target
>   3. Set d->max_pages to the new target.
> 
> The waiting made the logic more complicated, and also introduced a race
> between steps 2 and 3.  So the change was made so that Xen would
> tolerate setting max_pages to less than tot_pages.  Then things looked
> like this:
>   1. Set d->max_pages to the new target
>   2. Issue a balloon request to the guest.
> 
> The new semantics guaranteed that the guest would not be able to "change
> its mind" and ask for memory back after freeing it without the toolstack
> needing to closely monitor the actual current usage.
> 
> But even before the change, it was still possible to change max_pages;
> so the change doesn't have any bearing on the discussion here.
> 
>   -George
> 
> * I may have some of the details incorrect (e.g., maybe it was
> d->tot_pages+something else, maybe it didn't return -EINVAL but failed
> in some other way), but the general idea is correct.

Yes, understood.  Ian please correct me if I am wrong, but I believe
your proposal (at least as last stated) does indeed, in some cases,
set d->max_pages less than or equal to d->tot_pages.  So AFAICT the
change does very much have a bearing on the discussion here. 

> The new semantics guaranteed that the guest would not be able to "change
> its mind" and ask for memory back after freeing it without the toolstack
> needing to closely monitor the actual current usage.

Exactly.  So, in your/Ian's model, you are artificially constraining a
guest's memory growth, including any dynamic allocations*.  If, by bad luck,
you do that at a moment when the guest was growing and is very much in
need of that additional memory, the guest may now swapstorm or OOM, and
the toolstack has seriously impacted a running guest.  Oracle considers
this both unacceptable and unnecessary.

In the Oracle model, d->max_pages never gets changed, except possibly
by explicit rare demand by a host administrator.  In the Oracle model,
the toolstack has no business arbitrarily changing a constraint for a
guest that can have a serious impact on the guest.  In the Oracle model,
each guest shrinks and grows its memory needs self-adaptively, only
constrained by the vm.cfg at the launch of the guest and the physical
limits of the machine (max-of-sums because it is done in the hypervisor,
not sum-of-maxes).  All this uses working shipping code upstream in
Xen and Linux... except that you are blocking from open source the
proposed XENMEM_claim_pages hypercall.

So, I think it is very fair (not snide) to point out that a change was
made to the hypervisor to accommodate your/Ian's memory-management model,
a change that Oracle considers unnecessary, a change explicitly
supporting your/Ian's model, which is a model that has not been
implemented in open source and has no clear (let alone proven) policy
to guide it.  Yet you wish to block a minor hypervisor change which
is needed to accommodate Oracle's shipping memory-management model?

Please reconsider.

Thanks,
Dan

* To repeat my definition of that term, "dynamic allocations" means
any increase to d->tot_pages that is unbeknownst to the toolstack,
including specifically in-guest ballooning and certain tmem calls.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-02 21:59       ` Konrad Rzeszutek Wilk
@ 2013-01-14 18:28         ` George Dunlap
  2013-01-22 21:57           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 53+ messages in thread
From: George Dunlap @ 2013-01-14 18:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

On 02/01/13 21:59, Konrad Rzeszutek Wilk wrote:
> Thanks for the clarification. I am not that fluent in the OCaml code.

I'm not fluent in OCaml either, I'm mainly going from memory based on 
the discussions I had with the author when it was being designed, as 
well as discussions with the xapi team when dealing with bugs at later 
points.

>> When a request comes in for a certain amount of memory, it will go
>> and set each VM's max_pages, and the max tmem pool size.  It can
>> then check whether there is enough free memory to complete the
>> allocation or not (since there's a race between checking how much
>> memory a guest is using and setting max_pages).  If that succeeds,
>> it can return "success".  If, while that VM is being built, another
>> request comes in, it can again go around and set the max sizes
>> lower.  It has to know how much of the memory is "reserved" for the
>> first guest being built, but if there's enough left after that, it
>> can return "success" and allow the second VM to start being built.
>>
>> After the VMs are built, the toolstack can remove the limits again
>> if it wants, again allowing the free flow of memory.
> This sounds to me like what Xapi does?

No, AFAIK xapi always sets the max_pages to what it wants the guest to 
be using at any given time.  I talked about removing the limits (and 
about operating without limits in the normal case) because it seems like 
something that Oracle wants (having to do with tmem).
>> Do you see any problems with this scheme?  All it requires is for
>> the toolstack to be able to temporarliy set limits on both guests
>> ballooning up and on tmem allocating more than a certain amount of
>> memory.  We already have mechanisms for the first, so if we had a
>> "max_pages" for tmem, then you'd have all the tools you need to
>> implement it.
> Of the top of my hat the thing that come in my mind are:
>   - The 'lock' over the memory usage (so the tmem freeze + maxpages set)
>     looks to solve the launching in parallel of guests.
>     It will allow us to launch multiple guests - but it will also
>     suppressing the tmem asynchronous calls and having to balloon up/down
>     the guests. The claim hypercall does not do any of those and
>     gives a definite 'yes' or 'no'.

So when you say, "tmem freeze", are you specifically talking about not 
allowing tmem to allocate more memory (what I called a "max_pages" for 
tmem)?  Or is there more to it?

Secondly, just to clarify: when a guest is using memory from the tmem 
pool, is that added to tot_pages?

I'm not sure what "gives a definite yes or no" is supposed to mean -- 
the scheme I described also gives a definite yes or no.

In any case, your point about ballooning is taken: if we set max_pages 
for a VM and just leave it there while VMs are being built, then VMs 
cannot balloon up, even if there is "free" memory (i.e., memory that 
will not be used for the currently-building VM), and cannot be moved 
*bewteen* VMs either (i.e., by ballooning down one and ballooning the 
other up).  Both of these be done by extending the toolstack with a 
memory model (see below), but that adds an extra level of complication.

>   - Complex code that has to keep track of this in the user-space.
>     It also has to know of the extra 'reserved' space that is associated
>     with a guest. I am not entirely sure how that would couple with
>     PCI passthrough. The claim hypercall is fairly simple - albeit
>     having it extended to do Super pages and 32-bit guests could make this
>     longer.

What do you mean by the extra 'reserved' space?  And what potential 
issues are there with PCI passthrough?

To be accepted, the reservation hypercall will certainly have to be 
extended to do superpages and 32-bit guests, so that's the case we 
should be considering.

>   - I am not sure whether the toolstack can manage all the memory
>     allocation. It sounds like it could but I am just wondering if there
>     are some extra corners that we hadn't thought off.

Wouldn't the same argument apply to the reservation hypercall? Suppose 
that there was enough domain memory but not enough Xen heap memory, or 
enough of some other resource -- the hypercall might succeed, but then 
the domain build still fail at some later point when the other resource 
allocation failed.

>   - Latency. With the locks being placed on the pools of memory the
>     existing workload can be negatively affected. Say that this means we
>     need to balloon down a couple hundred guests, then launch the new
>     guest. This process of 'lower all of them by X', lets check the
>     'free amount'. Oh nope - not enougth - lets do this again. That would
>     delay the creation process.
>
>     The claim hypercall will avoid all of that by just declaring:
>     "This is how much you will get." without having to balloon the rest
>     of the guests.
>
>     Here is how I see what your toolstack would do:
>
>       [serial]
> 	1). Figure out how much memory we need for X guests.
> 	2). round-robin existing guests to decrease their memory
> 	    consumption (if they can be ballooned down). Or this
> 	    can be exectued in parallel for the guests.
> 	3). check if the amount of free memory is at least X
> 	    [this check has to be done in serial]
>       [parallel]
> 	4). launch multiple guests at the same time.
>
>     The claim hypercall would avoid the '3' part b/c it is inherently
>     part of the Xen's MM bureaucracy. It would allow:
>
>       [parallel]
> 	1). claim hypercall for X guest.
> 	2). if any of the claim's return 0 (so success), then launch guest
> 	3). if the errno was -ENOMEM then:
>       [serial]
>          3a). round-robin existing guests to decrease their memory
>               consumption if allowed. Goto 1).
>
>     So the 'error-case' only has to run in the slow-serial case.
Hmm, I don't think what you wrote about mine is quite right.  Here's 
what I had in mind for mine (let me call it "limit-and-check"):

[serial]
1). Set limits on all guests, and tmem, and see how much memory is left.
2) Read free memory
[parallel]
2a) Claim memory for each guest from freshly-calculated pool of free memory.
3) For each claim that can be satisfied, launch a guest
4) If there are guests that can't be satisfied with the current free 
memory, then:
[serial]
4a) round-robin existing guests to decrease their memory consumption if 
allowed. Goto 2.
5) Remove limits on guests.

Note that 1 would only be done for the first such "request", and 5 would 
only be done after all such requests have succeeded or failed.  Also 
note that steps 1 and 5 are only necessary if you want to go without 
such limits -- xapi doesn't do them, because it always keeps max_pages 
set to what it wants the guest to be using.

Also, note that the "claiming" (2a for mine above and 1 for yours) has 
to be serialized with other "claims" in both cases (in the reservation 
hypercall case, with a lock inside the hypervisor), but that the 
building can begin in parallel with the "claiming" in both cases.

But I think I do see what you're getting at.  The "free memory" 
measurement has to be taken when the system is in a "quiescent" state -- 
or at least a "grow only" state -- otherwise it's meaningless.  So #4a 
should really be:

4a) Round-robin existing guests to decrease their memory consumption if 
allowed.
4b) Wait for currently-building guests to finish building (if any), then 
go to #2.

So suppose the following cases, in which several requests for guest 
creation come in over a short period of time (not necessarily all at once):
A. There is enough memory for all requested VMs to be built without 
ballooning / something else
B. There is enough for some, but not all of the VMs to be built without 
ballooning / something else

In case A, then I think "limit-and-check" and "reservation hypercall" 
should perform the same.  For each new request that comes in, the 
toolstack can say, "Well, when I checked I had 64GiB free; then I 
started to build a 16GiB VM.  So I should have 48GiB left, enough to 
build this 32GiB VM."  "Well, when I checked I had 64GiB free; then I 
started to build a 16GiB VM and a 32GiB VM, so I should have 16GiB left, 
enough to be able to build this 16GiB VM."

The main difference comes in case B.  The "reservation hypercall" method 
will not have to wait until all existing guests have finished building 
to be able to start subsequent guests; but "limit-and-check" would have 
to wait until the currently-building guests are finished before doing 
another check.

This limitation doesn't apply to xapi, because it doesn't use the 
hypervisor's free memory as a measure of the memory it has available to 
it.  Instead, it keeps an internal model of the free memory the 
hypervisor has available.  This is based on MAX(current_target, 
tot_pages) of each guest (where "current_target" for a domain in the 
process of being built is the amount of memory it will have 
eventually).  We might call this the "model" approach.

We could extend "limit-and-check" to "limit-check-and-model" (i.e., 
estimate how much memory is really free after ballooning based on how 
much the guests' tot_pages), or "limit-model" (basically, fully switch 
to a xapi-style "model" approach while you're doing domain creation).  
That would be significantly more complicated.  On the other hand, a lot 
of the work has already been done by the XenServer team, and (I believe) 
the code in question is all GPL'ed, so Oracle could just take the 
algorithms and adapt them with just a bit if tweaking (and a bit of code 
translation).  It seems to me that he "model" approach brings a lot of 
other benefits as well.

But at any rate -- without debating the value or cost of the "model" 
approach, would you agree with my analysis and conclusions?  Namely:

a. "limit-and-check" and "reservation hypercall" are similar wrt guest 
creation when there is enough memory currently free to build all 
requested guests
b. "limit-and-check" may be slower if some guests can succeed in being 
built but others must wait for memory to be freed up, since the "check" 
has to wait for current guests to finish building
c. (From further back) One downside of a pure "limit-and-check" approach 
is that while VMs are being built, VMs cannot increase in size, even if 
there is "free" memory (not being used to build the currently-building 
domain(s)) or if another VM can be ballooned down.
d. "model"-based approaches can mitigate b and c, at the cost of a more 
complicated algorithm

>   - This still has the race issue - how much memory you see vs the
>     moment you launch it. Granted you can avoid it by having a "fudge"
>     factor (so when a guest says it wants 1G you know it actually
>     needs an extra 100MB on top of the 1GB or so). The claim hypercall
>     would count all of that for you so you don't have to race.
I'm sorry, what race / fudge factor are you talking about?

  -George

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-14 18:18                             ` Dan Magenheimer
@ 2013-01-14 19:42                               ` George Dunlap
  2013-01-14 23:14                                 ` Dan Magenheimer
  0 siblings, 1 reply; 53+ messages in thread
From: George Dunlap @ 2013-01-14 19:42 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org), Ian Campbell, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

On 14/01/13 18:18, Dan Magenheimer wrote:
>>>>> i.e. d->max_pages is fixed for the life of the domain and
>>>>> only d->tot_pages varies; i.e. no intelligence is required
>>>>> in the toolstack.  AFAIK, the distinction between current_maxmem
>>>>> and lifetime_maxmem was added for Citrix DMC support.
[snip]
> Yes, understood.  Ian please correct me if I am wrong, but I believe
> your proposal (at least as last stated) does indeed, in some cases,
> set d->max_pages less than or equal to d->tot_pages.  So AFAICT the
> change does very much have a bearing on the discussion here.

Strictly speaking, no, it doesn't have to do with what we're proposing.  
To impement "limit-and-check", you only need to set d->max_pages to 
d->tot_pages.  This capability has been possible for quite a while, and 
was not introduced to support Citrix's DMC.

> Exactly.  So, in your/Ian's model, you are artificially constraining a
> guest's memory growth, including any dynamic allocations*.  If, by bad luck,
> you do that at a moment when the guest was growing and is very much in
> need of that additional memory, the guest may now swapstorm or OOM, and
> the toolstack has seriously impacted a running guest.  Oracle considers
> this both unacceptable and unnecessary.

Yes, I realized the limitation to dynamic allocation from my discussion 
with Konrad.  This is a constraint, but it can be worked around.

Even so you rather overstate your case.  Even in the "reservation 
hypercall" model, if after the "reservation" there's not enough memory 
for the guest to grow, the same thing will happen.  If Oracle really 
considered this "unacceptable and unnecessary", then the toolstack 
should have a model of when this is likely to happen and keep memory 
around for such a contingency.

> So, I think it is very fair (not snide) to point out that a change was
> made to the hypervisor to accommodate your/Ian's memory-management model,
> a change that Oracle considers unnecessary, a change explicitly
> supporting your/Ian's model, which is a model that has not been
> implemented in open source and has no clear (let alone proven) policy
> to guide it.  Yet you wish to block a minor hypervisor change which
> is needed to accommodate Oracle's shipping memory-management model?

We've been over this a number of times, but let me say it again. Whether 
a change gets accepted has nothing to do with who suggested it, but 
whether the person suggesting it can convince the community that it's 
worthwhile.  Fujitsu-Seimens implemented cpupools, which is a fairly 
invasive patch, in order to support their own business models; while the 
XenClient team has had a lot of resistance to getting v4v upstreamed, 
even though their product depends on it.  My max_pages change was 
accepted (along with many others), but many others have also been 
rejected.  For example, my "domain runstates" patch was rejected, and is 
still being carried in the XenServer patchqueue several years later.

If you have been unable to convince the community that your patch is 
necessary, then either:
1. It's not necessary / not ready in its current state
2. You're not very good at being persuasive
3. We're too closed-minded / biased whatever to understand it

You clearly believe #3 -- you began by accusing us of being 
closed-minded (i.e., "stuck in a static world", &c), but have since 
changed to accusing us of being biased.  You have now made this 
accusation several times, in spite of being presented evidence to the 
contrary each time.  This evidence has included important Citrix patches 
that have been rejected, patches from other organizations that have been 
accepted, and also evidence that most of the people opposing your patch 
(including Jan, IanC, IanJ, Keir, Tim, and Andres) don't know anything 
about DMC and have no direct connection with XenServer.

For my part, I'm willing to believe #2, which is why I suggested that 
you ask someone else to take up the cause, and why I am glad that Konrad 
has joined the discussion.

  -George

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-14 19:42                               ` George Dunlap
@ 2013-01-14 23:14                                 ` Dan Magenheimer
  2013-01-23 12:18                                   ` Ian Campbell
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-14 23:14 UTC (permalink / raw)
  To: George Dunlap
  Cc: Keir (Xen.org), Ian Campbell, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> solutions
>
> On 14/01/13 18:18, Dan Magenheimer wrote:
> >>>>> i.e. d->max_pages is fixed for the life of the domain and
> >>>>> only d->tot_pages varies; i.e. no intelligence is required
> >>>>> in the toolstack.  AFAIK, the distinction between current_maxmem
> >>>>> and lifetime_maxmem was added for Citrix DMC support.
> [snip]
> > Yes, understood.  Ian please correct me if I am wrong, but I believe
> > your proposal (at least as last stated) does indeed, in some cases,
> > set d->max_pages less than or equal to d->tot_pages.  So AFAICT the
> > change does very much have a bearing on the discussion here.
> 
> Strictly speaking, no, it doesn't have to do with what we're proposing.
> To impement "limit-and-check", you only need to set d->max_pages to
> d->tot_pages.  This capability has been possible for quite a while, and
> was not introduced to support Citrix's DMC.
> 
> > Exactly.  So, in your/Ian's model, you are artificially constraining a
> > guest's memory growth, including any dynamic allocations*.  If, by bad luck,
> > you do that at a moment when the guest was growing and is very much in
> > need of that additional memory, the guest may now swapstorm or OOM, and
> > the toolstack has seriously impacted a running guest.  Oracle considers
> > this both unacceptable and unnecessary.
> 
> Yes, I realized the limitation to dynamic allocation from my discussion
> with Konrad.  This is a constraint, but it can be worked around.

Please say more about how you think it can be worked around.

> Even so you rather overstate your case.  Even in the "reservation
> hypercall" model, if after the "reservation" there's not enough memory
> for the guest to grow, the same thing will happen.  If Oracle really
> considered this "unacceptable and unnecessary", then the toolstack
> should have a model of when this is likely to happen and keep memory
> around for such a contingency.

Hmmm... I think you are still missing the point of how
Oracle's dynamic allocations work, as evidenced by the
fact that "Keeping memory around for such a contingency"
makes no sense at all in the Oracle model.  And the
"not enough memory for the guest to grow" only occurs in
the Oracle model when physical memory is completely exhausted
across all running domains in the system (i.e. max-of-sums
not sum-of-maxes), which is a very different constraint.
 
> > So, I think it is very fair (not snide) to point out that a change was
> > made to the hypervisor to accommodate your/Ian's memory-management model,
> > a change that Oracle considers unnecessary, a change explicitly
> > supporting your/Ian's model, which is a model that has not been
> > implemented in open source and has no clear (let alone proven) policy
> > to guide it.  Yet you wish to block a minor hypervisor change which
> > is needed to accommodate Oracle's shipping memory-management model?
> 
> We've been over this a number of times, but let me say it again. Whether
> a change gets accepted has nothing to do with who suggested it, but
> whether the person suggesting it can convince the community that it's
> worthwhile.  Fujitsu-Seimens implemented cpupools, which is a fairly
> invasive patch, in order to support their own business models; while the
> XenClient team has had a lot of resistance to getting v4v upstreamed,
> even though their product depends on it.  My max_pages change was
> accepted (along with many others), but many others have also been
> rejected.  For example, my "domain runstates" patch was rejected, and is
> still being carried in the XenServer patchqueue several years later.
> 
> If you have been unable to convince the community that your patch is
> necessary, then either:
> 1. It's not necessary / not ready in its current state
> 2. You're not very good at being persuasive
> 3. We're too closed-minded / biased whatever to understand it
> 
> You clearly believe #3 -- you began by accusing us of being
> closed-minded (i.e., "stuck in a static world", &c), but have since
> changed to accusing us of being biased.  You have now made this
> accusation several times, in spite of being presented evidence to the
> contrary each time.  This evidence has included important Citrix patches
> that have been rejected, patches from other organizations that have been
> accepted, and also evidence that most of the people opposing your patch
> (including Jan, IanC, IanJ, Keir, Tim, and Andres) don't know anything
> about DMC and have no direct connection with XenServer.

For the public record, I _partially_ believe #3.  I would restate it
as: You (and others with the same point-of-view) have a very fixed
idea of how memory-management should work in the Xen stack.  This
idea is not really implemented, AFAICT you haven't thought through
the policy issues, and you haven't yet realized the challenges
I believe it will present in the context of Oracle's dynamic model
(since AFAIK you have not understood tmem and selfballooning though
it is all open source upstream in Xen and Linux).

I fully believe if you fully understood those challenges and the
shipping implementation of Oracle's dynamic model, your position
would be different.  So this has been a long long education process
for all of us.

"Closed-minded" and "biased" are very subjective terms and have
negative connotations, so I will let others interpret my statements
above and will plead guilty only if the court of public opinion
deems I "clearly believe #3".
 
> For my part, I'm willing to believe #2, which is why I suggested that
> you ask someone else to take up the cause, and why I am glad that Konrad
> has joined the discussion.

I'm glad too. :-)

Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-10 21:43           ` Dan Magenheimer
@ 2013-01-17 15:12             ` Tim Deegan
  2013-01-17 15:26               ` Andres Lagar-Cavilla
  2013-01-22 19:22               ` Dan Magenheimer
  0 siblings, 2 replies; 53+ messages in thread
From: Tim Deegan @ 2013-01-17 15:12 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich

Hi,

At 13:43 -0800 on 10 Jan (1357825433), Dan Magenheimer wrote:
> > From: Tim Deegan [mailto:tim@xen.org]
> > Not quite.  I think there are other viable options, and I don't
> > particularly like the reservation hypercall.
> 
> Are you suggesting an alternative option other than the max_pages
> toolstack-based proposal that Ian and I are discussing in a parallel
> subthread?

Yes, I suggested three just below in that email.

> Are there reasons other than "incompleteness" (see below) that
> you dislike the reservation hypercall?

Yes.  Mostly it strikes me as treating a symptom.  That is, it solves
the specific problem of delayed build failure rather than looking at the
properties of the system that caused it. 

If I were given a self-ballooning system and asked to support it, I'd be
looking at other things first, and probably solving the delayed failure
of VM creation as a side-effect.  For example:
 - the lack of policy.  If we assume all VMs have the same admin,
   so we can ignore malicious attackers, a buggy guest or guests
   can still starve out well-behaved ones.  And because it implicitly
   relies on all OSes having an equivalent measure of how much they
   'need' memory, on a host with a mix of guest OSes, the aggressive
   ones will starve the others.
 - the lack of fairness: when a storm of activity hits an idle system,
   whichever VMs get busy first will get all the memory.
 - allocating _all_ memory with no slack makes the system more vulnerable
   to any bugs in the rest of xen where allocation failure isn't handled
   cleanly.  There shouldn't be any, but I bet there are. 
 - there's no way of forcing a new VM into a 'full' system; the admin must
   wait and hope for the existing VMs to shrink.  (If there were such
   a system, it would solve the delayed-failure problem because you'd
   just use it to enforce the 

Now, of course, I don't want to dictate what you do in your own system,
and in any case I haven't time to get involved in a long discussion
about it.  And as I've said this reservation hypercall seems harmless
enough.

> > That could be worked around with an upcall to a toolstack
> > agent that reshuffles things on a coarse granularity based on need.  I
> > agree that's slower than having the hypervisor make the decisions but
> > I'm not convinced it'd be unmanageable.
> 
> "Based on need" begs a number of questions, starting with how
> "need" is defined and how conflicting needs are resolved.
> Tmem balances need as a self-adapting system. For your upcalls,
> you'd have to convince me that, even if "need" could be communicated
> to an guest-external entity (i.e. a toolstack), that the entity
> would/could have any data to inform a policy to intelligently resolve
> conflicts. 

It can easily have all the information that Xen has -- that is, some VMs
are asking for more memory.  It can even make the same decision about
what to do that Xen might, though I think it can probably do better.

> I also don't see how it could be done without either
> significant hypervisor or guest-kernel changes.

The only hypervisor change would be a ring (or even an eventchn) to
notify the tools when a guest's XENMEM_populate_physmap fails.

> > Or, how about actually moving towards a memory scheduler like you
> > suggested -- for example by integrating memory allocation more tightly
> > with tmem.  There could be an xsm-style hook in the allocator for
> > tmem-enabled domains.  That way tmem would have complete control over
> > all memory allocations for the guests under its control, and it could
> > implement a shared upper limit.  Potentially in future the tmem
> > interface could be extended to allow it to force guests to give back
> > more kinds of memory, so that it could try to enforce fairness (e.g. if
> > two VMs are busy, why should the one that spiked first get to keep all
> > the RAM?) or other nice scheduler-like properties.
> 
> Tmem (plus selfballooning), unchanged, already does some of this.
> While I would be interested in discussing better solutions, the
> now four-year odyssey of pushing what I thought were relatively
> simple changes upstream into Linux has left a rather sour taste
> in my mouth, so rather than consider any solution that requires
> more guest kernel changes [...]

I don't mean that you'd have to do all of that now, but if you were
considering moving in that direction, an easy first step would be to add
a hook allowing tmem to veto allocations for VMs under its control.
That would let tmem have proper control over its client VMs (so it can
solve the delayed-failure race for you), while at the same time being a
constructive step towards a more complete memory scheduler.

> > Or, you could consider booting the new guest pre-ballooned so it doesn't
> > have to allocate all that memory in the build phase.  It would boot much
> > quicker (solving the delayed-failure problem), and join the scramble for
> > resources on an equal footing with its peers.
> 
> I'm not positive I understand "pre-ballooned" but IIUC, all Linux
> guests already boot pre-ballooned, in that, from the vm.cfg file,
> "mem=" is allocated, not "maxmem=".

Absolutely.

> Tmem, with self-ballooning, launches the guest with "mem=", and
> then the guest kernel "self adapts" to (dramatically) reduce its usage
> soon after boot.  It can be fun to "watch(1)", meaning using the
> Linux "watch -d 'head -1 /proc/meminfo'" command.

If it were to launch the same guest with mem= a much smaller number and
then let it selfballoon _up_ to its chosen amount, vm-building failures
due to allocation races could be (a) much rarer and (b) much faster.  

> > > > My own position remains that I can live with the reservation hypercall,
> > > > as long as it's properly done - including handling PV 32-bit and PV
> > > > superpage guests.
> > >
> > > Tim, would you at least agree that "properly" is a red herring?
> > 
> > I'm not quite sure what you mean by that.  To the extent that this isn't
> > a criticism of the high-level reservation design, maybe.  But I stand by
> > it as a criticism of the current implementation.
> 
> Sorry, I was just picking on word usage.  IMHO, the hypercall
> does work "properly" for the classes of domains it was designed
> to work on (which I'd estimate in the range of 98% of domains
> these days).

But it's deliberately incorrect for PV-superpage guests, which are a
feature developed and maintained by Oracle.  I assume you'll want to
make them work with your own toolstack -- why would you not?

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-11 19:08           ` Konrad Rzeszutek Wilk
  2013-01-14 16:00             ` George Dunlap
@ 2013-01-17 15:16             ` Tim Deegan
  2013-01-18 21:45               ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 53+ messages in thread
From: Tim Deegan @ 2013-01-17 15:16 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich

At 14:08 -0500 on 11 Jan (1357913294), Konrad Rzeszutek Wilk wrote:
> But the solution to the hypercall failing are multiple - one is to 
> try to "squeeze" all the guests to make space

AFAICT if the toolstack can squeeze guests up to make room then the
reservation hypercall isn't necessary -- just use the squeezing
mechanism to make sure that running VMs don't use up the memory you want
for building new ones.

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-17 15:12             ` Tim Deegan
@ 2013-01-17 15:26               ` Andres Lagar-Cavilla
  2013-01-22 19:22               ` Dan Magenheimer
  1 sibling, 0 replies; 53+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-17 15:26 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich

On Jan 17, 2013, at 10:12 AM, Tim Deegan <tim@xen.org> wrote:

> Hi,
> 
> At 13:43 -0800 on 10 Jan (1357825433), Dan Magenheimer wrote:
>>> From: Tim Deegan [mailto:tim@xen.org]
>>> Not quite.  I think there are other viable options, and I don't
>>> particularly like the reservation hypercall.
>> 
>> Are you suggesting an alternative option other than the max_pages
>> toolstack-based proposal that Ian and I are discussing in a parallel
>> subthread?
> 
> Yes, I suggested three just below in that email.
> 
>> Are there reasons other than "incompleteness" (see below) that
>> you dislike the reservation hypercall?
> 
> Yes.  Mostly it strikes me as treating a symptom.  That is, it solves
> the specific problem of delayed build failure rather than looking at the
> properties of the system that caused it. 
> 
> If I were given a self-ballooning system and asked to support it, I'd be
> looking at other things first, and probably solving the delayed failure
> of VM creation as a side-effect.  For example:
> - the lack of policy.  If we assume all VMs have the same admin,
>   so we can ignore malicious attackers, a buggy guest or guests
>   can still starve out well-behaved ones.  And because it implicitly
>   relies on all OSes having an equivalent measure of how much they
>   'need' memory, on a host with a mix of guest OSes, the aggressive
>   ones will starve the others.
> - the lack of fairness: when a storm of activity hits an idle system,
>   whichever VMs get busy first will get all the memory.
> - allocating _all_ memory with no slack makes the system more vulnerable
>   to any bugs in the rest of xen where allocation failure isn't handled
>   cleanly.  There shouldn't be any, but I bet there are. 
> - there's no way of forcing a new VM into a 'full' system; the admin must
>   wait and hope for the existing VMs to shrink.  (If there were such
>   a system, it would solve the delayed-failure problem because you'd
>   just use it to enforce the 
> 
> Now, of course, I don't want to dictate what you do in your own system,
> and in any case I haven't time to get involved in a long discussion
> about it.  And as I've said this reservation hypercall seems harmless
> enough.
> 
>>> That could be worked around with an upcall to a toolstack
>>> agent that reshuffles things on a coarse granularity based on need.  I
>>> agree that's slower than having the hypervisor make the decisions but
>>> I'm not convinced it'd be unmanageable.
>> 
>> "Based on need" begs a number of questions, starting with how
>> "need" is defined and how conflicting needs are resolved.
>> Tmem balances need as a self-adapting system. For your upcalls,
>> you'd have to convince me that, even if "need" could be communicated
>> to an guest-external entity (i.e. a toolstack), that the entity
>> would/could have any data to inform a policy to intelligently resolve
>> conflicts. 
> 
> It can easily have all the information that Xen has -- that is, some VMs
> are asking for more memory.  It can even make the same decision about
> what to do that Xen might, though I think it can probably do better.
> 
>> I also don't see how it could be done without either
>> significant hypervisor or guest-kernel changes.
> 
> The only hypervisor change would be a ring (or even an eventchn) to
> notify the tools when a guest's XENMEM_populate_physmap fails.

We already have a notification ring for ENOMEM on unshare. It's named "sharing" ring, but frankly it's more like an "enomem" ring. It can be easily generalized. I hope…

Andres

> 
>>> Or, how about actually moving towards a memory scheduler like you
>>> suggested -- for example by integrating memory allocation more tightly
>>> with tmem.  There could be an xsm-style hook in the allocator for
>>> tmem-enabled domains.  That way tmem would have complete control over
>>> all memory allocations for the guests under its control, and it could
>>> implement a shared upper limit.  Potentially in future the tmem
>>> interface could be extended to allow it to force guests to give back
>>> more kinds of memory, so that it could try to enforce fairness (e.g. if
>>> two VMs are busy, why should the one that spiked first get to keep all
>>> the RAM?) or other nice scheduler-like properties.
>> 
>> Tmem (plus selfballooning), unchanged, already does some of this.
>> While I would be interested in discussing better solutions, the
>> now four-year odyssey of pushing what I thought were relatively
>> simple changes upstream into Linux has left a rather sour taste
>> in my mouth, so rather than consider any solution that requires
>> more guest kernel changes [...]
> 
> I don't mean that you'd have to do all of that now, but if you were
> considering moving in that direction, an easy first step would be to add
> a hook allowing tmem to veto allocations for VMs under its control.
> That would let tmem have proper control over its client VMs (so it can
> solve the delayed-failure race for you), while at the same time being a
> constructive step towards a more complete memory scheduler.
> 
>>> Or, you could consider booting the new guest pre-ballooned so it doesn't
>>> have to allocate all that memory in the build phase.  It would boot much
>>> quicker (solving the delayed-failure problem), and join the scramble for
>>> resources on an equal footing with its peers.
>> 
>> I'm not positive I understand "pre-ballooned" but IIUC, all Linux
>> guests already boot pre-ballooned, in that, from the vm.cfg file,
>> "mem=" is allocated, not "maxmem=".
> 
> Absolutely.
> 
>> Tmem, with self-ballooning, launches the guest with "mem=", and
>> then the guest kernel "self adapts" to (dramatically) reduce its usage
>> soon after boot.  It can be fun to "watch(1)", meaning using the
>> Linux "watch -d 'head -1 /proc/meminfo'" command.
> 
> If it were to launch the same guest with mem= a much smaller number and
> then let it selfballoon _up_ to its chosen amount, vm-building failures
> due to allocation races could be (a) much rarer and (b) much faster.  
> 
>>>>> My own position remains that I can live with the reservation hypercall,
>>>>> as long as it's properly done - including handling PV 32-bit and PV
>>>>> superpage guests.
>>>> 
>>>> Tim, would you at least agree that "properly" is a red herring?
>>> 
>>> I'm not quite sure what you mean by that.  To the extent that this isn't
>>> a criticism of the high-level reservation design, maybe.  But I stand by
>>> it as a criticism of the current implementation.
>> 
>> Sorry, I was just picking on word usage.  IMHO, the hypercall
>> does work "properly" for the classes of domains it was designed
>> to work on (which I'd estimate in the range of 98% of domains
>> these days).
> 
> But it's deliberately incorrect for PV-superpage guests, which are a
> feature developed and maintained by Oracle.  I assume you'll want to
> make them work with your own toolstack -- why would you not?
> 
> Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-17 15:16             ` Tim Deegan
@ 2013-01-18 21:45               ` Konrad Rzeszutek Wilk
  2013-01-21 10:29                 ` Tim Deegan
  0 siblings, 1 reply; 53+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-18 21:45 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich

On Thu, Jan 17, 2013 at 03:16:31PM +0000, Tim Deegan wrote:
> At 14:08 -0500 on 11 Jan (1357913294), Konrad Rzeszutek Wilk wrote:
> > But the solution to the hypercall failing are multiple - one is to 
> > try to "squeeze" all the guests to make space
> 
> AFAICT if the toolstack can squeeze guests up to make room then the
> reservation hypercall isn't necessary -- just use the squeezing
> mechanism to make sure that running VMs don't use up the memory you want
> for building new ones.

We might want to not do that until we have run out of options (this would
be a toolstack option to select the right choice). The other option is
to just launch the guest on another node.

The reasoning for not wanting to squeeze the guests as it might cause the
guest to fall in the OOM camp. 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-18 21:45               ` Konrad Rzeszutek Wilk
@ 2013-01-21 10:29                 ` Tim Deegan
  2013-02-12 15:54                   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 53+ messages in thread
From: Tim Deegan @ 2013-01-21 10:29 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich

At 16:45 -0500 on 18 Jan (1358527542), Konrad Rzeszutek Wilk wrote:
> On Thu, Jan 17, 2013 at 03:16:31PM +0000, Tim Deegan wrote:
> > At 14:08 -0500 on 11 Jan (1357913294), Konrad Rzeszutek Wilk wrote:
> > > But the solution to the hypercall failing are multiple - one is to 
> > > try to "squeeze" all the guests to make space
> > 
> > AFAICT if the toolstack can squeeze guests up to make room then the
> > reservation hypercall isn't necessary -- just use the squeezing
> > mechanism to make sure that running VMs don't use up the memory you want
> > for building new ones.
> 
> We might want to not do that until we have run out of options (this would
> be a toolstack option to select the right choice). The other option is
> to just launch the guest on another node.

Sure, I see that; but what I meant was: the reservation hypercall only
makes any kind of sense if the toolstack can't squeeze the existing guests. 

If it can squeeze VMs, as part of that it must have some mechanism to
stop them from immediately re-allocating all the memory as it frees it.
So in the case where enough memory is already free, you just use that
mechanism to protect it while you build the new VM.

Or (since I get the impression that losing this allocation race is a
rare event) you can take the optimistic route: after you've checked that
enough memory is free, just start building the VM.  If you run out of
memory part-way through, you can squeeze the other VMs back out so you can
finish the job.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-17 15:12             ` Tim Deegan
  2013-01-17 15:26               ` Andres Lagar-Cavilla
@ 2013-01-22 19:22               ` Dan Magenheimer
  2013-01-23 12:18                 ` Ian Campbell
  1 sibling, 1 reply; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-22 19:22 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich

> From: Tim Deegan [mailto:tim@xen.org]
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> solutions
> 
> Hi,

Hi Tim --

It's probably worth correcting a few of your points below,
even if only for the xen-devel archives and posterity...
 
> If I were given a self-ballooning system and asked to support it, I'd be
> looking at other things first, and probably solving the delayed failure
> of VM creation as a side-effect.

Agreed.  These other things were looked at in 2009 when tmem
was added into Xen and prototyped for Linux.  And the delayed
failure was solved poorly with a hack in 2010 and is being
looked at again in 2012/2013 with the intent of solving it correctly.

> For example:
>  - the lack of policy.  If we assume all VMs have the same admin,
>    so we can ignore malicious attackers, a buggy guest or guests
>    can still starve out well-behaved ones.  And because it implicitly
>    relies on all OSes having an equivalent measure of how much they
>    'need' memory, on a host with a mix of guest OSes, the aggressive
>    ones will starve the others.

With tmem, a malicious attacker can never get more memory than
the original maxmem assigned by the host administrator when the
guest is launched.  This is also true of any non-tmem guests
running (e.g. proprietary Windows).

And the architecture of tmem takes into account the difference
between memory a guest "needs" vs memory it "wants".  Though this
is a basic OS concept that exists in some form in all OS's,
AFAIK it has never been exposed outside of the OS (e.g. to
a hypervisor) because, in a physical system, RAM is RAM and
the only limit is the total amount of physical RAM in the system.
Tmem changes in the guest kernel expose the needs/wants information
and tmem in the hypervisor defines very simple carrots and
sticks to keep guests in line by offering, under well-defined
constraints, to keep and manage certain pages of data for the guest.

While it is true of any resource sharing mechanism (including CPU
and I/O scheduling under Xen) that the "must" demand for the resource
may exceed the total available resource, just as with CPU
scheduling, resource demand can be controlled by a few simple policy
variables that default to reasonable values and are enforced,
as necessary, in the hypervisor.  Just as with CPU schedulers
and I/O schedulers, different workloads may over time expose
weaknesses, but that doesn't mean we throw away our CPU and
I/O schedulers and partition those resources instead.  Nor should
we do so with RAM.

All this has been implemented in Xen for years and the Linux-side
is now shipping.  I would very much welcome input and improvements.
But it is very frustrating when people say, on the one hand,
that "it can't be done" or "it won't work" or "it's too hard",
while on the other hand those same people are saying "I don't
have time to understand tmem".

> For example:
>  - the lack of fairness: when a storm of activity hits an idle system,
>    whichever VMs get busy first will get all the memory.

True, but only up to the policy limits built into tmem (i.e
not "all").  Also true of CPU scheduling up to the policy
limits built into the CPU scheduler.

(BTW, tmem optionally supports caps and weights too.)

> For example:
>  - allocating _all_ memory with no slack makes the system more vulnerable
>    to any bugs in the rest of xen where allocation failure isn't handled
>    cleanly.  There shouldn't be any, but I bet there are.

Once tmem has been running for awhile, it works in an eternal
state of "no slack".  IIRC there was a bug or two worked through
years ago.  The real issue has always been fragmentation and
non-resilience of failed allocation of higher-order pages, but
Jan (as of 4.1?) has removed all of those issues from Xen.

So tmem is using ALL the memory in the system.  Keir (and Jan) wrote
a very solid memory manager and it works very well even under stress.

> For example:
>  - there's no way of forcing a new VM into a 'full' system; the admin must
>    wait and hope for the existing VMs to shrink.  (If there were such
>    a system, it would solve the delayed-failure problem because you'd
>    just use it to enforce the

Not true at all.  With tmem, the "want" pages of all the guests (plus
any "fallow" pages that might be truly free at the moment for various
reasons) is the source of pages for adding a new VM.  By definition,
the hypervisor can "free" any or all of these pages when the toolstack
tells the hypervisor to allocate memory for a new guest. No waiting
necessary.  That's how the claim_pages hypercall works so cleanly
and quickly.

(And, sorry to sound like a broken record, but I think it's worth
emphasizing and re-emphasizing, this is not a blue sky proposal.
All of this code is already working in the Xen hypervisor today.)

> > > Or, how about actually moving towards a memory scheduler like you
> > > suggested -- for example by integrating memory allocation more tightly
> > > with tmem.  There could be an xsm-style hook in the allocator for
> > > tmem-enabled domains.  That way tmem would have complete control over
> > > all memory allocations for the guests under its control, and it could
> > > implement a shared upper limit.  Potentially in future the tmem
> > > interface could be extended to allow it to force guests to give back
> > > more kinds of memory, so that it could try to enforce fairness (e.g. if
> > > two VMs are busy, why should the one that spiked first get to keep all
> > > the RAM?) or other nice scheduler-like properties.
> >
> > Tmem (plus selfballooning), unchanged, already does some of this.
> > While I would be interested in discussing better solutions, the
> > now four-year odyssey of pushing what I thought were relatively
> > simple changes upstream into Linux has left a rather sour taste
> > in my mouth, so rather than consider any solution that requires
> > more guest kernel changes [...]
> 
> I don't mean that you'd have to do all of that now, but if you were
> considering moving in that direction, an easy first step would be to add
> a hook allowing tmem to veto allocations for VMs under its control.
> That would let tmem have proper control over its client VMs (so it can
> solve the delayed-failure race for you), while at the same time being a
> constructive step towards a more complete memory scheduler.

While you are using different words, you are describing what
tmem does today.  Tmem does have control and uses the existing
hypervisor mechanisms and the existing hypervisor lock for memory
allocation.  That's why it's so clean to solve the "delayed-failure
race" using the same lock.

Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-14 18:28         ` George Dunlap
@ 2013-01-22 21:57           ` Konrad Rzeszutek Wilk
  2013-01-23 18:36             ` Dave Scott
  0 siblings, 1 reply; 53+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-22 21:57 UTC (permalink / raw)
  To: George Dunlap
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

Hey George,

Sorry for taking so long to answer.

On Mon, Jan 14, 2013 at 06:28:48PM +0000, George Dunlap wrote:
> On 02/01/13 21:59, Konrad Rzeszutek Wilk wrote:
> >Thanks for the clarification. I am not that fluent in the OCaml code.
> 
> I'm not fluent in OCaml either, I'm mainly going from memory based
> on the discussions I had with the author when it was being designed,
> as well as discussions with the xapi team when dealing with bugs at
> later points.

I was looking at xen-api/ocaml/xenops/squeeze.ml and just reading the
comments and feebly trying to understand how the OCaml code is.
Best I could understand it does various measurements, makes the appropiate
hypercalls and waits for everything to stabilize before allowing the
guest to start.

N.B: With tmem, the 'stabilization' might never happen.
> 
> >>When a request comes in for a certain amount of memory, it will go
> >>and set each VM's max_pages, and the max tmem pool size.  It can
> >>then check whether there is enough free memory to complete the
> >>allocation or not (since there's a race between checking how much
> >>memory a guest is using and setting max_pages).  If that succeeds,
> >>it can return "success".  If, while that VM is being built, another
> >>request comes in, it can again go around and set the max sizes
> >>lower.  It has to know how much of the memory is "reserved" for the
> >>first guest being built, but if there's enough left after that, it
> >>can return "success" and allow the second VM to start being built.
> >>
> >>After the VMs are built, the toolstack can remove the limits again
> >>if it wants, again allowing the free flow of memory.
> >This sounds to me like what Xapi does?
> 
> No, AFAIK xapi always sets the max_pages to what it wants the guest
> to be using at any given time.  I talked about removing the limits
> (and about operating without limits in the normal case) because it
> seems like something that Oracle wants (having to do with tmem).

We still (and we do want them as much as possible) have the limits
in the hypervisor. The guest can't go above max_pages which is absolutly
fine. We don't want guests going above max_pages. Conversly we also
do no want to reduce max_pages. It is risky to do so.

> >>Do you see any problems with this scheme?  All it requires is for
> >>the toolstack to be able to temporarliy set limits on both guests
> >>ballooning up and on tmem allocating more than a certain amount of
> >>memory.  We already have mechanisms for the first, so if we had a
> >>"max_pages" for tmem, then you'd have all the tools you need to
> >>implement it.
> >Of the top of my hat the thing that come in my mind are:
> >  - The 'lock' over the memory usage (so the tmem freeze + maxpages set)
> >    looks to solve the launching in parallel of guests.
> >    It will allow us to launch multiple guests - but it will also
> >    suppressing the tmem asynchronous calls and having to balloon up/down
> >    the guests. The claim hypercall does not do any of those and
> >    gives a definite 'yes' or 'no'.
> 
> So when you say, "tmem freeze", are you specifically talking about
> not allowing tmem to allocate more memory (what I called a
> "max_pages" for tmem)?  Or is there more to it?

I think I am going to confuse you here a bit. 
> 
> Secondly, just to clarify: when a guest is using memory from the
> tmem pool, is that added to tot_pages?

Yes and no. It depends on what type of tmem page it is (there are
only two). Pages that must persist (such as swap pages) are accounted
in the d->tot_pages. Pages that are cache type  not accounted in the
tot_pages. These are called ephemeral or temporary pages. Note that
they are utilizing the balloon system - so the content of them
could be thrown out, but the pages themselves might need to be
put back in the guest (and increase the d->tot_pages).

The tmem_freeze is basically putting a plug on the current activity
of a guest trying to put more pages in the ephemeral and
in the pool of pages that is accounted for using d->tot_pages.
It has the similar bad effect of setting d->max_pages == d->tot_pages.
The hypercall would replace this bandaid.

As said, there are two types. The temporary pages subtract d->tot_pages
and end up in the heap memory (and if there is memory pressure
in the hypervisor it can happily usurp). In essence the "pages"
move from a domain accounting to this "pool". If the guest needs it
back, the pool size decreases and d->tot_pages increases.

N.B: The pool can be usurped by the Xen hypervisor - so the pages
are not locked in and can be re-used for launch of a new guest.

The persistent ones do not end up in that pool. Rather they are
accounted for in the d->tot_pages.

The amount of memory that is "flowing" for a guest remains
constant - it just that it can be in a pool or in the d->tot_pages.
(I am ignoring the de-duplication or compression that tmem can do)

The idea behind the claim call is that we do not want to put pressure
on this "flow" as the guest might suddently need that memory back - as
much as it can.  Putting pressure is by altering the d->max_pages.

> 
> I'm not sure what "gives a definite yes or no" is supposed to mean
> -- the scheme I described also gives a definite yes or no.
> 
> In any case, your point about ballooning is taken: if we set
> max_pages for a VM and just leave it there while VMs are being
> built, then VMs cannot balloon up, even if there is "free" memory
> (i.e., memory that will not be used for the currently-building VM),
> and cannot be moved *bewteen* VMs either (i.e., by ballooning down
> one and ballooning the other up).  Both of these be done by
> extending the toolstack with a memory model (see below), but that
> adds an extra level of complication.
> 
> >  - Complex code that has to keep track of this in the user-space.
> >    It also has to know of the extra 'reserved' space that is associated
> >    with a guest. I am not entirely sure how that would couple with
> >    PCI passthrough. The claim hypercall is fairly simple - albeit
> >    having it extended to do Super pages and 32-bit guests could make this
> >    longer.
> 
> What do you mean by the extra 'reserved' space?  And what potential
> issues are there with PCI passthrough?

I was thinking about space for VIRQ, VCPUs,  IOMMU entries to cover a PCI
device permissions, and grant-tables. I think the IOMMU entries consume
the most bulk - but maybe all of this is under 1MB.

> 
> To be accepted, the reservation hypercall will certainly have to be
> extended to do superpages and 32-bit guests, so that's the case we
> should be considering.

OK. That sounds to me like you are OK with the idea - you would like
to make the claim hypercall taking in-to account the lesser used
cases. The reason Dan stopped looking at expanding is b/c it seemed
that folks would like to undertand the usage scenarios in depth - and
that has taken a bit of time to explain.

I believe the corner cases in the claim hypercall are mostly tied in with
PV (specifically the super-pages and 32-bit guests with more than a
certain amount of memory). 

> 
> >  - I am not sure whether the toolstack can manage all the memory
> >    allocation. It sounds like it could but I am just wondering if there
> >    are some extra corners that we hadn't thought off.
> 
> Wouldn't the same argument apply to the reservation hypercall?
> Suppose that there was enough domain memory but not enough Xen heap
> memory, or enough of some other resource -- the hypercall might
> succeed, but then the domain build still fail at some later point
> when the other resource allocation failed.

This is refering to the 1MB that I mentioned above.

Anyhow, if the hypercall fails and the domain build fails then we are
back at the toolstack making an choice whether it wants to allocate the
guest on a different node. Or for that matter balloon the existing
guests.

> 
> >  - Latency. With the locks being placed on the pools of memory the
> >    existing workload can be negatively affected. Say that this means we
> >    need to balloon down a couple hundred guests, then launch the new
> >    guest. This process of 'lower all of them by X', lets check the
> >    'free amount'. Oh nope - not enougth - lets do this again. That would
> >    delay the creation process.
> >
> >    The claim hypercall will avoid all of that by just declaring:
> >    "This is how much you will get." without having to balloon the rest
> >    of the guests.
> >
> >    Here is how I see what your toolstack would do:
> >
> >      [serial]
> >	1). Figure out how much memory we need for X guests.
> >	2). round-robin existing guests to decrease their memory
> >	    consumption (if they can be ballooned down). Or this
> >	    can be exectued in parallel for the guests.
> >	3). check if the amount of free memory is at least X
> >	    [this check has to be done in serial]
> >      [parallel]
> >	4). launch multiple guests at the same time.
> >
> >    The claim hypercall would avoid the '3' part b/c it is inherently
> >    part of the Xen's MM bureaucracy. It would allow:
> >
> >      [parallel]
> >	1). claim hypercall for X guest.
> >	2). if any of the claim's return 0 (so success), then launch guest
> >	3). if the errno was -ENOMEM then:
> >      [serial]
> >         3a). round-robin existing guests to decrease their memory
> >              consumption if allowed. Goto 1).

and here I forgot about the other way of fixing this - that is launch
the guest on another node altogether as at least in our product - we
don't want to change the initial d->max_pages. This is due in part
to the issues that were pointed out - it might suddenly need that
memory or otherwise it will OOM.

> >
> >    So the 'error-case' only has to run in the slow-serial case.
> Hmm, I don't think what you wrote about mine is quite right.  Here's
> what I had in mind for mine (let me call it "limit-and-check"):
> 
> [serial]
> 1). Set limits on all guests, and tmem, and see how much memory is left.
> 2) Read free memory
> [parallel]
> 2a) Claim memory for each guest from freshly-calculated pool of free memory.
> 3) For each claim that can be satisfied, launch a guest
> 4) If there are guests that can't be satisfied with the current free
> memory, then:
> [serial]
> 4a) round-robin existing guests to decrease their memory consumption
> if allowed. Goto 2.
> 5) Remove limits on guests.
> 
> Note that 1 would only be done for the first such "request", and 5
> would only be done after all such requests have succeeded or failed.
> Also note that steps 1 and 5 are only necessary if you want to go
> without such limits -- xapi doesn't do them, because it always keeps
> max_pages set to what it wants the guest to be using.
> 
> Also, note that the "claiming" (2a for mine above and 1 for yours)
> has to be serialized with other "claims" in both cases (in the
> reservation hypercall case, with a lock inside the hypervisor), but
> that the building can begin in parallel with the "claiming" in both
> cases.

Sure. The claim call has a very short duration as it has to take a lock
in the hypervisor. It would a bunch of super-fast calls. Heck, you
could even use the multicall for this to batch it up.

The problem we are trying to fix is that launching a guest can take
minutes. During that time other guests are artificially blocked from
growing and might OOM.

> 
> But I think I do see what you're getting at.  The "free memory"
> measurement has to be taken when the system is in a "quiescent"
> state -- or at least a "grow only" state -- otherwise it's
> meaningless.  So #4a should really be:

Exactly! With tmem running the quiescent state might never happen.
> 
> 4a) Round-robin existing guests to decrease their memory consumption
> if allowed.

I believe this is what Xapi does. The question comes how does the toolstack
decide that properly and on the spot 100% of the time?

I believe that the source of that knowledge lays with the guest kernel - and
it can determine when it needs more or less. We have set the boundaries
(d->max_pages) which haven't changed since the bootup and we let the guest
decide where it wants to be within that spectrum.

> 4b) Wait for currently-building guests to finish building (if any),
> then go to #2.
> 
> So suppose the following cases, in which several requests for guest
> creation come in over a short period of time (not necessarily all at
> once):
> A. There is enough memory for all requested VMs to be built without
> ballooning / something else
> B. There is enough for some, but not all of the VMs to be built
> without ballooning / something else
> 
> In case A, then I think "limit-and-check" and "reservation
> hypercall" should perform the same.  For each new request that comes
> in, the toolstack can say, "Well, when I checked I had 64GiB free;
> then I started to build a 16GiB VM.  So I should have 48GiB left,
> enough to build this 32GiB VM."  "Well, when I checked I had 64GiB
> free; then I started to build a 16GiB VM and a 32GiB VM, so I should
> have 16GiB left, enough to be able to build this 16GiB VM."

For case A, I assume all the guests are launched with mem=maxmem and
there is no PoD, no PCI passthrough and no tmem. Then yes.

For case B, "Limit-and-check" requires "limiting" one of the guests
(or more).
This means we limit one (or more) of the guests. Which one is choosen
and what criteria are done means more heuristics (or just take the
shotgun approach and limit all of the guest by some number).
In other words: d->max_pages -= some X value.

The other way is limiting the total growth of all guests
(so d->tot_pages can't reach d->max_pages). We don't set the d->max_pages
and let the guests balloon up. Note that with tmem in here you can
"move" the temporary pages back in the guest so that the d->tot_pages
can increase by some Y, and the total free amount of heap space increases
by Y as well - b/c the Y value has moved).

Now back to your question:
Accounting for this in user-space is possible, but there are latency
issues and catching breath for the toolstack as there might be millions of
these updates on heavily used machine.  There might not be any "quiescent"
state ever.

> 
> The main difference comes in case B.  The "reservation hypercall"
> method will not have to wait until all existing guests have finished
> building to be able to start subsequent guests; but
> "limit-and-check" would have to wait until the currently-building
> guests are finished before doing another check.

Correct. And the check is imprecise b/c the moment it gets the value
the system might have changed dramatically. The right time to get the
value is when the host is in "quiescent" state, but who knows when
that is going to happen. Perhaps never, at which point you might be
spinning for a long time trying to get that value.
> 
> This limitation doesn't apply to xapi, because it doesn't use the
> hypervisor's free memory as a measure of the memory it has available
> to it.  Instead, it keeps an internal model of the free memory the
> hypervisor has available.  This is based on MAX(current_target,
> tot_pages) of each guest (where "current_target" for a domain in the
> process of being built is the amount of memory it will have
> eventually).  We might call this the "model" approach.
> 

OK. I think it actually checks how much memory the guest is consumed.
This is what one of the comments says:

   (* Some VMs are considered by us (but not by xen) to have an "initial-reservation". For VMs which have never 
       run (eg which are still being built or restored) we take the difference between memory_actual_kib and the
       reservation and subtract this manually from the host's free memory. Note that we don't get an atomic snapshot
       of system state so there is a natural race between the hypercalls. Hopefully the memory is being consumed
       fairly slowly and so the error is small. *)
  

So that would imply that a check against "current" memory consumption
is done. But you know comments - sometimes they do not match
what the code is doing. But if they do match then it looks like
this system would hit issues with self-ballooning and tmem. I believe
that the claim hypercall would fix that easily. It probably would
also make the OCaml code much much simpler.

> We could extend "limit-and-check" to "limit-check-and-model" (i.e.,
> estimate how much memory is really free after ballooning based on
> how much the guests' tot_pages), or "limit-model" (basically, fully
> switch to a xapi-style "model" approach while you're doing domain
> creation).  That would be significantly more complicated.  On the
> other hand, a lot of the work has already been done by the XenServer
> team, and (I believe) the code in question is all GPL'ed, so Oracle
> could just take the algorithms and adapt them with just a bit if
> tweaking (and a bit of code translation).  It seems to me that he
> "model" approach brings a lot of other benefits as well.

It is hard for me to be convienced by that since the code is in OCaml
and I am having a hard time understanding it. If it was in C, it would
have been much easier to get it and make that evaluation.

The other part of this that I am not sure if I am explaining well is
that the kernel with self-balloon and tmem is very self-adaptive.
It seems to me that having the toolstack be minutely aware of the guests
memory changes  so that it can know exactly how much free memory there is
- is duplicating efforts.

> 
> But at any rate -- without debating the value or cost of the "model"
> approach, would you agree with my analysis and conclusions?  Namely:
> 
> a. "limit-and-check" and "reservation hypercall" are similar wrt
> guest creation when there is enough memory currently free to build
> all requested guests

Not 100%. When there is enough memory free "for the entire period of time
that it takes to build all the requested guests", then yes.

> b. "limit-and-check" may be slower if some guests can succeed in
> being built but others must wait for memory to be freed up, since
> the "check" has to wait for current guests to finish building

No. The check also races with the amount of memory that the hypervisor
reports as free - and that might be altered by the existing
guests (so not the guests that are being built).

> c. (From further back) One downside of a pure "limit-and-check"
> approach is that while VMs are being built, VMs cannot increase in
> size, even if there is "free" memory (not being used to build the
> currently-building domain(s)) or if another VM can be ballooned
> down.

Ah, yes. We really want to avoid that.

> d. "model"-based approaches can mitigate b and c, at the cost of a
> more complicated algorithm

Correct. And also more work done in the userspace to track this.
> 
> >  - This still has the race issue - how much memory you see vs the
> >    moment you launch it. Granted you can avoid it by having a "fudge"
> >    factor (so when a guest says it wants 1G you know it actually
> >    needs an extra 100MB on top of the 1GB or so). The claim hypercall
> >    would count all of that for you so you don't have to race.
> I'm sorry, what race / fudge factor are you talking about?

The scenario when the host is not in "quiescent" state.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-14 23:14                                 ` Dan Magenheimer
@ 2013-01-23 12:18                                   ` Ian Campbell
  2013-01-23 17:34                                     ` Dan Magenheimer
  2013-02-12 16:18                                     ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 53+ messages in thread
From: Ian Campbell @ 2013-01-23 12:18 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org),
	George Dunlap, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

On Mon, 2013-01-14 at 23:14 +0000, Dan Magenheimer wrote:
> For the public record, I _partially_ believe #3.  I would restate it
> as: You (and others with the same point-of-view) have a very fixed
> idea of how memory-management should work in the Xen stack.  This
> idea is not really implemented, AFAICT you haven't thought through
> the policy issues, and you haven't yet realized the challenges
> I believe it will present in the context of Oracle's dynamic model
> (since AFAIK you have not understood tmem and selfballooning though
> it is all open source upstream in Xen and Linux).

Putting aside any bias or fixed mindedness the maintainers are not
especially happy with the proposed fix, even within the constraints of
the dynamic model. (It omits to cover various use cases and I think
strikes many as something of a sticking plaster).

Given that I've been trying to suggest an alternative solution which
works within the constraints of you model and happens to have the nice
property of not requiring hypervisor changes. I genuinely think there is
a workable solution to your problem in there, although you are correct
that it mostly just an idea right now.

That said the best suggestion for a solution I've seen so far was Tim's
suggestion that tmem be more tightly integrated with memory allocation
as another step towards the "memory scheduler" idea. So I wouldn't
bother pursuing the maxmem approach further unless the tmem-integration
idea doesn't pan out for some reason.

Ian.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-22 19:22               ` Dan Magenheimer
@ 2013-01-23 12:18                 ` Ian Campbell
  2013-01-23 16:05                   ` Dan Magenheimer
  0 siblings, 1 reply; 53+ messages in thread
From: Ian Campbell @ 2013-01-23 12:18 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org), George Dunlap, Ian Jackson, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich,
	Andres Lagar-Cavilla

On Tue, 2013-01-22 at 19:22 +0000, Dan Magenheimer wrote:
> > I don't mean that you'd have to do all of that now, but if you were
> > considering moving in that direction, an easy first step would be to add
> > a hook allowing tmem to veto allocations for VMs under its control.
> > That would let tmem have proper control over its client VMs (so it can
> > solve the delayed-failure race for you), while at the same time being a
> > constructive step towards a more complete memory scheduler.
> 
> While you are using different words, you are describing what
> tmem does today.  Tmem does have control and uses the existing
> hypervisor mechanisms and the existing hypervisor lock for memory
> allocation.  That's why it's so clean to solve the "delayed-failure
> race" using the same lock.
> 

So it sounds like it would easily be possible to solve this issue via a
tmem hook as Tim suggests?

Ian.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-23 12:18                 ` Ian Campbell
@ 2013-01-23 16:05                   ` Dan Magenheimer
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-23 16:05 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir (Xen.org), George Dunlap, Ian Jackson, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich,
	Andres Lagar-Cavilla

> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> solutions
> 
> On Tue, 2013-01-22 at 19:22 +0000, Dan Magenheimer wrote:
> > > I don't mean that you'd have to do all of that now, but if you were
> > > considering moving in that direction, an easy first step would be to add
> > > a hook allowing tmem to veto allocations for VMs under its control.
> > > That would let tmem have proper control over its client VMs (so it can
> > > solve the delayed-failure race for you), while at the same time being a
> > > constructive step towards a more complete memory scheduler.
> >
> > While you are using different words, you are describing what
> > tmem does today.  Tmem does have control and uses the existing
> > hypervisor mechanisms and the existing hypervisor lock for memory
> > allocation.  That's why it's so clean to solve the "delayed-failure
> > race" using the same lock.
> 
> So it sounds like it would easily be possible to solve this issue via a
> tmem hook as Tim suggests?

Hmmm... I see how my reply might be interpreted that way,
so let me rephrase and add some different emphasis:

Tmem already has "proper" control over its client VMs:
The only constraints tmem needs to enforce are the
d->max_pages value which was set when the guest launched,
and total physical RAM.  It's no coincidence that these
are the same constraints enforced by the existing
hypervisor allocator mechanisms inside existing hypervisor
locks.  And tmem is already a very large step towards
a complete memory scheduler.

But tmem is just a user of the existing hypervisor
allocator and locks.  It doesn't pretend to be able to
supervise or control all allocations; that's the job of
the hypervisor allocator.  Tmem only provides services
to guests, some of which require allocating memory
to store data on behalf of the guest.  And some of
those allocations do not increase d->tot_pages and some
do. (I can further explain why if you wish.)

So a clean solution to the "delayed-failure race" is
to use the same hypervisor allocator locks used by
all other allocations (including tmem and in-guest
ballooning).  That's exactly what XENMEM_claim_pages does.

Heh, I suppose you could rename XENMEM_claim_pages to be
XENMEM_tmem_claim_pages without changing the semantics
or any other code in the patch, and then this issue
would indeed be solved by a "tmem hook".

Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-23 12:18                                   ` Ian Campbell
@ 2013-01-23 17:34                                     ` Dan Magenheimer
  2013-02-12 16:18                                     ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 53+ messages in thread
From: Dan Magenheimer @ 2013-01-23 17:34 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir (Xen.org),
	George Dunlap, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate
> solutions
> 
> Putting aside any bias or fixed mindedness the maintainers are not
> especially happy with the proposed fix, even within the constraints of
> the dynamic model. (It omits to cover various use cases and I think
> strikes many as something of a sticking plaster).

Sticking plaster: FWIW I agree.  But the wound it is covering
is that a buddy allocator is not well suited to atomically allocate
large quantities of potentially discontiguous memory, which is what
we need Xen to do to allocate all the memory to create a domain without
a race.  The whole concept of capacity allocation is a hack to work
around that limitation.  Maybe we could overhaul the allocator to
handle this better or maybe we could replace the whole allocator,
but IMHO, compared to those alternatives (especially considering a
likely bug tail), a plaster is far preferable.

Omits use cases:  I've stated my opinion on this several times
("prefer to fix 98% of a bug and not make the other 2%[1] worse
than fix 0%") and nobody has argued the point.  It's not uncommon
for a proposed Xen fix to solve a HVM problem and not a similar PV
problem, or vice-versa.  Maybe one should think of claim_pages as
a complete solution to the _HVM_ "delayed failure race problem"
that, coincidentally, also solves nearly all of the _PV_ "delayed
failure race problem".  Does that help? ;-)  So I see this not
as a reason to block the proposed hypercall, but as an indication
that some corner cases need to be put on someone's "to-do" list.
And, IMHO, prioritized very low on that person's to-do list.

[1] BTW, to clarify, the "2%" is PV domains (not HVM) with superpages=1
manually set in vm.cfg, plus 32-bit PV domains _only_ on systems
with >64GB physical RAM. So 2% is probably way too high.

> Given that I've been trying to suggest an alternative solution which
> works within the constraints of you model and happens to have the nice
> property of not requiring hypervisor changes. I genuinely think there is
> a workable solution to your problem in there, although you are correct
> that it mostly just an idea right now.

Let me also summarize my argument:

It's very hard to argue against ideas and I certainly don't
want to challenge anyone's genuineness (or extremely hard work as
a maintainer), but the hypervisor really does have very easy
atomic access to certain information and locks, and the toolstack
simply doesn't.  So the toolstack has to guess and/or create
unnecessary (and potentially dangerous) constraints to ensure
the information it collects doesn't race against changes to that
information (TOCTOU races).  And even if the toolstack could safely
create those constraints, it must create them severally against
multiple domains whereas the hypervisor can choose to enforce
only the total system constraint (i.e. max-of-sums is better than
sum-of-maxes).

So, I think we all agree with the goal:

"Any functionality which can be reasonably provided outside the
  hypervisor should be excluded from it."

Ian, I believe I have clearly proven that providing the claim
functionality outside the hypervisor can be done only by
taking away other functionality (e.g. unnecessarily constraining
guests which are doing dynamic allocation and requiring sum-of-maxes
rather than max-of-sums).

I hope you can finally agree and ack the hypervisor patch.

But first...

> That said the best suggestion for a solution I've seen so far was Tim's
> suggestion that tmem be more tightly integrated with memory allocation
> as another step towards the "memory scheduler" idea. So I wouldn't
> bother pursuing the maxmem approach further unless the tmem-integration
> idea doesn't pan out for some reason.

Please excuse my frustration and if I sound like a broken record,
but tmem, as it sits today (and has sat in the hypervisor for nearly
four years now) _is_ already a huge step towards the memory scheduler
idea, and _is_ already tightly integrated with the hypervisor
memory allocator.  In fact, one could say it is exactly because
this tight integration already exists that claim_pages needs to
be implemented as a hypercall.

I've repeatedly invited all Xen maintainers [2] to take some time to
truly understand how tmem works and why, but still have had no takers.
It's a very clever solution to a very hard problem, and it's all open
source and all shipping today; but it is not simple so unfortunately
can't be explained in a couple of paragraphs or a 10-minute call.
Please let me know if you want to know more.

[2] I think Jan and Keir fully understand the Xen mechanisms
  but perhaps not the guest-side or how tmem all works together
  and why.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-22 21:57           ` Konrad Rzeszutek Wilk
@ 2013-01-23 18:36             ` Dave Scott
  2013-02-12 15:38               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 53+ messages in thread
From: Dave Scott @ 2013-01-23 18:36 UTC (permalink / raw)
  To: 'Konrad Rzeszutek Wilk', George Dunlap
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

Hi,

> On Mon, Jan 14, 2013 at 06:28:48PM +0000, George Dunlap wrote:
> > I'm not fluent in OCaml either, I'm mainly going from memory based on
> > the discussions I had with the author when it was being designed, as
> > well as discussions with the xapi team when dealing with bugs at later
> > points.

Konrad Rzeszutek Wilk replied:

> I was looking at xen-api/ocaml/xenops/squeeze.ml and just reading the
> comments and feebly trying to understand how the OCaml code is.
> Best I could understand it does various measurements, makes the
> appropiate hypercalls and waits for everything to stabilize before allowing
> the guest to start.
> 
> N.B: With tmem, the 'stabilization' might never happen.

In case it's useful I re-uploaded the squeezed design doc to the xen wiki:

http://wiki.xen.org/wiki/File:Squeezed.pdf

I think it got lost during the conversion from the old wiki to the new wiki.

Hopefully the doc gives a better "big picture" view than the code itself :-)

The quick summary is that squeezed tries to "balance" memory between the VMs on the host by manipulating their balloon targets. When a VM is to be started, xapi will ask it to "reserve" memory, squeezed will lower the balloon targets (and set maxmem as an absolute limit on allocation), wait for something to happen, possibly conclude some guests are being "uncooperative" and ask the "cooperative" ones to balloon down some more etc. It works but the problems I would highlight are:

0. since a VM which refuses to balloon down causes other VMs to be ballooned harder, we needed a good way to signal misbehavior to the user

1. freeing memory by ballooning can be quite slow (especially on windows)

2. to actually free 'x' MiB we have to know what number to set the memory/target to. Experimentally it seems that, when an HVM guest has finished ballooning, the domain's total_pages will equal the memory/target + a constant offset. Squeezed performs an initial calibration but it is potentially quite fragile.

3. we want to keep as much memory in-use as possible (ie allocated to guests) but allocating domain structures often failed due to lack of (low? contiguous?) memory. To work around this we balloon first and domain create second, but this required us to track memory 'reservations' independently of the domains so that we wouldn't leak over a crash. This is a bit complicated but ok because all memory allocations are handled by squeezed.

4. squeezed's memory management will clearly not work very well if some degree of page sharing is in-use :-)

5. (more of a bug) the code for "balancing" would occasionally oscillate, moving pages between VMs every few seconds. This caused quite a lot of log spam.

HTH,

Dave

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-23 18:36             ` Dave Scott
@ 2013-02-12 15:38               ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 53+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-02-12 15:38 UTC (permalink / raw)
  To: Dave Scott
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

On Wed, Jan 23, 2013 at 06:36:06PM +0000, Dave Scott wrote:
> Hi,
> 
> > On Mon, Jan 14, 2013 at 06:28:48PM +0000, George Dunlap wrote:
> > > I'm not fluent in OCaml either, I'm mainly going from memory based on
> > > the discussions I had with the author when it was being designed, as
> > > well as discussions with the xapi team when dealing with bugs at later
> > > points.
> 
> Konrad Rzeszutek Wilk replied:
> 
> > I was looking at xen-api/ocaml/xenops/squeeze.ml and just reading the
> > comments and feebly trying to understand how the OCaml code is.
> > Best I could understand it does various measurements, makes the
> > appropiate hypercalls and waits for everything to stabilize before allowing
> > the guest to start.
> > 
> > N.B: With tmem, the 'stabilization' might never happen.
> 
> In case it's useful I re-uploaded the squeezed design doc to the xen wiki:
> 
> http://wiki.xen.org/wiki/File:Squeezed.pdf
> 
> I think it got lost during the conversion from the old wiki to the new wiki.
> 
> Hopefully the doc gives a better "big picture" view than the code itself :-)
> 
> The quick summary is that squeezed tries to "balance" memory between the VMs on the host by manipulating their balloon targets. When a VM is to be started, xapi will ask it to "reserve" memory, squeezed will lower the balloon targets (and set maxmem as an absolute limit on allocation), wait for something to happen, possibly conclude some guests are being "uncooperative" and ask the "cooperative" ones to balloon down some more etc. It works but the problems I would highlight are:
> 

How do you know whether the cooperative guests _can_ balloon further down? As in, what if they are OK doing it but end
up OOM-ing? That can happen right now with Linux if you set the memory target too low.

> 0. since a VM which refuses to balloon down causes other VMs to be ballooned harder, we needed a good way to signal misbehavior to the user
> 
> 1. freeing memory by ballooning can be quite slow (especially on windows)
> 
> 2. to actually free 'x' MiB we have to know what number to set the memory/target to. Experimentally it seems that, when an HVM guest has finished ballooning, the domain's total_pages will equal the memory/target + a constant offset. Squeezed performs an initial calibration but it is potentially quite fragile.
> 
> 3. we want to keep as much memory in-use as possible (ie allocated to guests) but allocating domain structures often failed due to lack of (low? contiguous?) memory. To work around this we balloon first and domain create second, but this required us to track memory 'reservations' independently of the domains so that we wouldn't leak over a crash. This is a bit complicated but ok because all memory allocations are handled by squeezed.
> 
> 4. squeezed's memory management will clearly not work very well if some degree of page sharing is in-use :-)

Right, and 'tmem' is in the same "boat" so to say.
> 
> 5. (more of a bug) the code for "balancing" would occasionally oscillate, moving pages between VMs every few seconds. This caused quite a lot of log spam.

Thank you for writing it up. I think  I got a better understanding of it.
> 
> HTH,
> 
> Dave
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-21 10:29                 ` Tim Deegan
@ 2013-02-12 15:54                   ` Konrad Rzeszutek Wilk
  2013-02-14 13:32                     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 53+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-02-12 15:54 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich

On Mon, Jan 21, 2013 at 10:29:23AM +0000, Tim Deegan wrote:
> At 16:45 -0500 on 18 Jan (1358527542), Konrad Rzeszutek Wilk wrote:
> > On Thu, Jan 17, 2013 at 03:16:31PM +0000, Tim Deegan wrote:
> > > At 14:08 -0500 on 11 Jan (1357913294), Konrad Rzeszutek Wilk wrote:
> > > > But the solution to the hypercall failing are multiple - one is to 
> > > > try to "squeeze" all the guests to make space
> > > 
> > > AFAICT if the toolstack can squeeze guests up to make room then the
> > > reservation hypercall isn't necessary -- just use the squeezing
> > > mechanism to make sure that running VMs don't use up the memory you want
> > > for building new ones.
> > 
> > We might want to not do that until we have run out of options (this would
> > be a toolstack option to select the right choice). The other option is
> > to just launch the guest on another node.
> 
> Sure, I see that; but what I meant was: the reservation hypercall only
> makes any kind of sense if the toolstack can't squeeze the existing guests. 

OK. I am going to take the liberty here to assume that squeeze is setting
d->max_pages and kicking the guest to balloon down to some number.

> 
> If it can squeeze VMs, as part of that it must have some mechanism to
> stop them from immediately re-allocating all the memory as it frees it.
> So in the case where enough memory is already free, you just use that
> mechanism to protect it while you build the new VM.

Sure.
> 
> Or (since I get the impression that losing this allocation race is a
> rare event) you can take the optimistic route: after you've checked that
> enough memory is free, just start building the VM.  If you run out of
> memory part-way through, you can squeeze the other VMs back out so you can
> finish the job.

All of this revolves around 'squeezing' the existing guests from the
tool-stack side. And as such the options you enumerated are the right
way of fixing it. And also the ways Xapi are pretty good.

However that is not the problem we are trying to address. We do _not_ want
to squeeze the guest at all. We want to leave it up to the guest to
go and down as it sees fit. We just need to set the ceiling (at start
time, and this is d->max_pages), and let the guest increment/decrement
d->tot_pages as it sees fit. And while that is going on, still be able
to create new guests in parallel.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-01-23 12:18                                   ` Ian Campbell
  2013-01-23 17:34                                     ` Dan Magenheimer
@ 2013-02-12 16:18                                     ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 53+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-02-12 16:18 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Dan Magenheimer, Keir (Xen.org),
	George Dunlap, Andres Lagar-Cavilla, Tim (Xen.org),
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich, Ian Jackson

On Wed, Jan 23, 2013 at 12:18:40PM +0000, Ian Campbell wrote:
> On Mon, 2013-01-14 at 23:14 +0000, Dan Magenheimer wrote:
> > For the public record, I _partially_ believe #3.  I would restate it
> > as: You (and others with the same point-of-view) have a very fixed
> > idea of how memory-management should work in the Xen stack.  This
> > idea is not really implemented, AFAICT you haven't thought through
> > the policy issues, and you haven't yet realized the challenges
> > I believe it will present in the context of Oracle's dynamic model
> > (since AFAIK you have not understood tmem and selfballooning though
> > it is all open source upstream in Xen and Linux).
> 
> Putting aside any bias or fixed mindedness the maintainers are not
> especially happy with the proposed fix, even within the constraints of
> the dynamic model. (It omits to cover various use cases and I think
> strikes many as something of a sticking plaster).

Could you excuse my ignorance of idioms and explain what 'sticking plaster'
is in this context? Is it akin to 'duct-tape'?

> 
> Given that I've been trying to suggest an alternative solution which
> works within the constraints of you model and happens to have the nice
> property of not requiring hypervisor changes. I genuinely think there is
> a workable solution to your problem in there, although you are correct
> that it mostly just an idea right now.

This is mid.gmane.org/mid.gmane.org/20130121102923.GA72616@ocelot.phlegethon.org
right? Dan had some questions about it and some clarifications about
the premises of it. And in:
http://mid.gmane.org/1357743524.7989.266.camel@zakaz.uk.xensource.com

you mentioned that you will take a look back at it. Perhaps I am missing an
email?
> 
> That said the best suggestion for a solution I've seen so far was Tim's
> suggestion that tmem be more tightly integrated with memory allocation
> as another step towards the "memory scheduler" idea. So I wouldn't

Is this the mid.gmane.org/20130121102923.GA72616@ocelot.phlegethon.org ?

> bother pursuing the maxmem approach further unless the tmem-integration
> idea doesn't pan out for some reason.

Maxmem is which one? Is that the one that Xapi is using wherein the 
d->max_pages is set via the XEN_DOMCTL_max_mem hypercall?

> 
> Ian.
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  2013-02-12 15:54                   ` Konrad Rzeszutek Wilk
@ 2013-02-14 13:32                     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 53+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-02-14 13:32 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Andres Lagar-Cavilla, Ian Jackson,
	xen-devel, Konrad Rzeszutek Wilk, Jan Beulich

On Tue, Feb 12, 2013 at 10:54:10AM -0500, Konrad Rzeszutek Wilk wrote:
> On Mon, Jan 21, 2013 at 10:29:23AM +0000, Tim Deegan wrote:
> > At 16:45 -0500 on 18 Jan (1358527542), Konrad Rzeszutek Wilk wrote:
> > > On Thu, Jan 17, 2013 at 03:16:31PM +0000, Tim Deegan wrote:
> > > > At 14:08 -0500 on 11 Jan (1357913294), Konrad Rzeszutek Wilk wrote:
> > > > > But the solution to the hypercall failing are multiple - one is to 
> > > > > try to "squeeze" all the guests to make space
> > > > 
> > > > AFAICT if the toolstack can squeeze guests up to make room then the
> > > > reservation hypercall isn't necessary -- just use the squeezing
> > > > mechanism to make sure that running VMs don't use up the memory you want
> > > > for building new ones.
> > > 
> > > We might want to not do that until we have run out of options (this would
> > > be a toolstack option to select the right choice). The other option is
> > > to just launch the guest on another node.
> > 
> > Sure, I see that; but what I meant was: the reservation hypercall only
> > makes any kind of sense if the toolstack can't squeeze the existing guests. 
> 
> OK. I am going to take the liberty here to assume that squeeze is setting
> d->max_pages and kicking the guest to balloon down to some number.
> 
> > 
> > If it can squeeze VMs, as part of that it must have some mechanism to
> > stop them from immediately re-allocating all the memory as it frees it.
> > So in the case where enough memory is already free, you just use that
> > mechanism to protect it while you build the new VM.
> 
> Sure.
> > 
> > Or (since I get the impression that losing this allocation race is a
> > rare event) you can take the optimistic route: after you've checked that
> > enough memory is free, just start building the VM.  If you run out of
> > memory part-way through, you can squeeze the other VMs back out so you can
> > finish the job.
> 
> All of this revolves around 'squeezing' the existing guests from the
> tool-stack side. And as such the options you enumerated are the right
> way of fixing it. And also the ways Xapi are pretty good.
> 
> However that is not the problem we are trying to address. We do _not_ want
> to squeeze the guest at all. We want to leave it up to the guest to
> go and down as it sees fit. We just need to set the ceiling (at start
> time, and this is d->max_pages), and let the guest increment/decrement
> d->tot_pages as it sees fit. And while that is going on, still be able
> to create new guests in parallel.

When I was mulling this over today it dawned on me that I think you
(and Ian) are saying something along these lines: that the claim
hypercall is a piece of this - the fallback mechanism of properly ballooning
("squeezing") should also be implemented - so that this is a full fledged
solution.

In other words - the hypervisor patch _and_ a toolstack logic ought to
be done/consider.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
@ 2012-12-03 20:54 Dan Magenheimer
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Magenheimer @ 2012-12-03 20:54 UTC (permalink / raw)
  To: Ian Jackson, Ian Campbell, George Dunlap, Tim (Xen.org)
  Cc: Keir (Xen.org), Jan Beulich, xen-devel

I earlier promised a complete analysis of the problem
addressed by the proposed claim hypercall as well as
an analysis of the alternate solutions.  I had not
yet provided these analyses when I asked for approval
to commit the hypervisor patch, so there was still
a good amount of misunderstanding, and I am trying
to fix that here.

I had hoped this essay could be both concise and complete
but quickly found it to be impossible to be both at the
same time.  So I have erred on the side of verbosity,
but also have attempted to ensure that the analysis
flows smoothly and is understandable to anyone interested
in learning more about memory allocation in Xen.
I'd appreciate feedback from other developers to understand
if I've also achieved that goal.

Ian, Ian, George, and Tim -- I have tagged a few
out-of-flow questions to you with [IIGF].  If I lose
you at any point, I'd especially appreciate your feedback
at those points.  I trust that, first, you will read
this completely.  As I've said, I understand that
Oracle's paradigm may differ in many ways from your
own, so I also trust that you will read it completely
with an open mind.

Thanks,
Dan

PROBLEM STATEMENT OVERVIEW

The fundamental problem is a race; two entities are
competing for part or all of a shared resource: in this case,
physical system RAM.  Normally, a lock is used to mediate
a race.

For memory allocation in Xen, there are two significant
entities, the toolstack and the hypervisor.  And, in
general terms, there are currently two important locks:
one used in the toolstack for domain creation;
and one in the hypervisor used for the buddy allocator.

Considering first only domain creation, the toolstack
lock is taken to ensure that domain creation is serialized.
The lock is taken when domain creation starts, and released
when domain creation is complete.

As system and domain memory requirements grow, the amount
of time to allocate all necessary memory to launch a large
domain is growing and may now exceed several minutes, so
this serialization is increasingly problematic.  The result
is a customer reported problem:  If a customer wants to
launch two or more very large domains, the "wait time"
required by the serialization is unacceptable.

Oracle would like to solve this problem.  And Oracle
would like to solve this problem not just for a single
customer sitting in front of a single machine console, but
for the very complex case of a large number of machines,
with the "agent" on each machine taking independent
actions including automatic load balancing and power
management via migration.  (This complex environment
is sold by Oracle today; it is not a "future vision".)

[IIGT] Completely ignoring any possible solutions to this
problem, is everyone in agreement that this _is_ a problem
that _needs_ to be solved with _some_ change in the Xen
ecosystem?

SOME IMPORTANT BACKGROUND INFORMATION

In the subsequent discussion, it is important to
understand a few things:

While the toolstack lock is held, allocating memory for
the domain creation process is done as a sequence of one
or more hypercalls, each asking the hypervisor to allocate
one or more -- "X" -- slabs of physical RAM, where a slab
is 2**N contiguous aligned pages, also known as an
"order N" allocation.  While the hypercall is defined
to work with any value of N, common values are N=0
(individual pages), N=9 ("hugepages" or "superpages"),
and N=18 ("1GiB pages").  So, for example, if the toolstack
requires 201MiB of memory, it will make two hypercalls:
One with X=100 and N=9, and one with X=1 and N=0.

While the toolstack may ask for a smaller number X of
order==9 slabs, system fragmentation may unpredictably
cause the hypervisor to fail the request, in which case
the toolstack will fall back to a request for 512*X
individual pages.  If there is sufficient RAM in the system,
this request for order==0 pages is guaranteed to succeed.
Thus for a 1TiB domain, the hypervisor must be prepared
to allocate up to 256Mi individual pages.

Note carefully that when the toolstack hypercall asks for
100 slabs, the hypervisor "heaplock" is currently taken
and released 100 times.  Similarly, for 256M individual
pages... 256 million spin_lock-alloc_page-spin_unlocks.
This means that domain creation is not "atomic" inside
the hypervisor, which means that races can and will still
occur.

RULING OUT SOME SIMPLE SOLUTIONS

Is there an elegant simple solution here?

Let's first consider the possibility of removing the toolstack
serialization entirely and/or the possibility that two
independent toolstack threads (or "agents") can simultaneously
request a very large domain creation in parallel.  As described
above, the hypervisor's heaplock is insufficient to serialize RAM
allocation, so the two domain creation processes race.  If there
is sufficient resource for either one to launch, but insufficient
resource for both to launch, the winner of the race is indeterminate,
and one or both launches will fail, possibly after one or both 
domain creation threads have been working for several minutes.
This is a classic "TOCTOU" (time-of-check-time-of-use) race.
If a customer is unhappy waiting several minutes to launch
a domain, they will be even more unhappy waiting for several
minutes to be told that one or both of the launches has failed.
Multi-minute failure is even more unacceptable for an automated
agent trying to, for example, evacuate a machine that the
data center administrator needs to powercycle.

[IIGT: Please hold your objections for a moment... the paragraph
above is discussing the simple solution of removing the serialization;
your suggested solution will be discussed soon.]
 
Next, let's consider the possibility of changing the heaplock
strategy in the hypervisor so that the lock is held not
for one slab but for the entire request of N slabs.  As with
any core hypervisor lock, holding the heaplock for a "long time"
is unacceptable.  To a hypervisor, several minutes is an eternity.
And, in any case, by serializing domain creation in the hypervisor,
we have really only moved the problem from the toolstack into
the hypervisor, not solved the problem.

[IIGT] Are we in agreement that these simple solutions can be
safely ruled out?

CAPACITY ALLOCATION VS RAM ALLOCATION

Looking for a creative solution, one may realize that it is the
page allocation -- especially in large quantities -- that is very
time-consuming.  But, thinking outside of the box, it is not
the actual pages of RAM that we are racing on, but the quantity of pages required to launch a domain!  If we instead have a way to
"claim" a quantity of pages cheaply now and then allocate the actual
physical RAM pages later, we have changed the race to require only serialization of the claiming process!  In other words, if some entity
knows the number of pages available in the system, and can "claim"
N pages for the benefit of a domain being launched, the successful launch of the domain can be ensured.  Well... the domain launch may
still fail for an unrelated reason, but not due to a memory TOCTOU
race.  But, in this case, if the cost (in time) of the claiming
process is very small compared to the cost of the domain launch,
we have solved the memory TOCTOU race with hardly any delay added
to a non-memory-related failure that would have occurred anyway.

This "claim" sounds promising.  But we have made an assumption that
an "entity" has certain knowledge.  In the Xen system, that entity
must be either the toolstack or the hypervisor.  Or, in the Oracle
environment, an "agent"... but an agent and a toolstack are similar
enough for our purposes that we will just use the more broadly-used
term "toolstack".  In using this term, however, it's important to
remember it is necessary to consider the existence of multiple
threads within this toolstack.

Now I quote Ian Jackson: "It is a key design principle of a system
like Xen that the hypervisor should provide only those facilities
which are strictly necessary.  Any functionality which can be
reasonably provided outside the hypervisor should be excluded
from it."

So let's examine the toolstack first.

[IIGT] Still all on the same page (pun intended)?

TOOLSTACK-BASED CAPACITY ALLOCATION

Does the toolstack know how many physical pages of RAM are available?
Yes, it can use a hypercall to find out this information after Xen and
dom0 launch, but before it launches any domain.  Then if it subtracts
the number of pages used when it launches a domain and is aware of
when any domain dies, and adds them back, the toolstack has a pretty
good estimate.  In actuality, the toolstack doesn't _really_ know the
exact number of pages used when a domain is launched, but there
is a poorly-documented "fuzz factor"... the toolstack knows the
number of pages within a few megabytes, which is probably close enough.

This is a fairly good description of how the toolstack works today
and the accounting seems simple enough, so does toolstack-based
capacity allocation solve our original problem?  It would seem so.
Even if there are multiple threads, the accounting -- not the extended
sequence of page allocation for the domain creation -- can be
serialized by a lock in the toolstack.  But note carefully, either
the toolstack and the hypervisor must always be in sync on the
number of available pages (within an acceptable margin of error);
or any query to the hypervisor _and_ the toolstack-based claim must
be paired atomically, i.e. the toolstack lock must be held across
both.  Otherwise we again have another TOCTOU race. Interesting,
but probably not really a problem.

Wait, isn't it possible for the toolstack to dynamically change the
number of pages assigned to a domain?  Yes, this is often called
ballooning and the toolstack can do this via a hypercall.  But
that's still OK because each call goes through the toolstack and
it simply needs to add more accounting for when it uses ballooning
to adjust the domain's memory footprint.  So we are still OK.

But wait again... that brings up an interesting point.  Are there
any significant allocations that are done in the hypervisor without
the knowledge and/or permission of the toolstack?  If so, the
toolstack may be missing important information.

So are there any such allocations?  Well... yes. There are a few.
Let's take a moment to enumerate them:

A) In Linux, a privileged user can write to a sysfs file which writes
to the balloon driver which makes hypercalls from the guest kernel to
the hypervisor, which adjusts the domain memory footprint, which changes the number of free pages _without_ the toolstack knowledge.
The toolstack controls constraints (essentially a minimum and maximum)
which the hypervisor enforces.  The toolstack can ensure that the
minimum and maximum are identical to essentially disallow Linux from
using this functionality.  Indeed, this is precisely what Citrix's
Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always has complete control and, so, knowledge of any domain memory
footprint changes.  But DMC is not prescribed by the toolstack,
and some real Oracle Linux customers use and depend on the flexibility
provided by in-guest ballooning.   So guest-privileged-user-driven-
ballooning is a potential issue for toolstack-based capacity allocation.

[IIGT: This is why I have brought up DMC several times and have
called this the "Citrix model,".. I'm not trying to be snippy
or impugn your morals as maintainers.]

B) Xen's page sharing feature has slowly been completed over a number
of recent Xen releases.  It takes advantage of the fact that many
pages often contain identical data; the hypervisor merges them to save
physical RAM.  When any "shared" page is written, the hypervisor
"splits" the page (aka, copy-on-write) by allocating a new physical
page.  There is a long history of this feature in other virtualization
products and it is known to be possible that, under many circumstances, thousands of splits may occur in any fraction of a second.  The
hypervisor does not notify or ask permission of the toolstack.
So, page-splitting is an issue for toolstack-based capacity
allocation, at least as currently coded in Xen.

[Andre: Please hold your objection here until you read further.]

C) Transcendent Memory ("tmem") has existed in the Xen hypervisor and
toolstack for over three years.  It depends on an in-guest-kernel
adaptive technique to constantly adjust the domain memory footprint as
well as hooks in the in-guest-kernel to move data to and from the
hypervisor.  While the data is in the hypervisor's care, interesting
memory-load balancing between guests is done, including optional
compression and deduplication.  All of this has been in Xen since 2009
and has been awaiting changes in the (guest-side) Linux kernel. Those
changes are now merged into the mainstream kernel and are fully
functional in shipping distros.

While a complete description of tmem's guest<->hypervisor interaction
is beyond the scope of this document, it is important to understand
that any tmem-enabled guest kernel may unpredictably request thousands
or even millions of pages directly via hypercalls from the hypervisor in a fraction of a second with absolutely no interaction with the toolstack.  Further, the guest-side hypercalls that allocate pages
via the hypervisor are done in "atomic" code deep in the Linux mm
subsystem.

Indeed, if one truly understands tmem, it should become clear that
tmem is fundamentally incompatible with toolstack-based capacity
allocation. But let's stop discussing tmem for now and move on.

OK.  So with existing code both in Xen and Linux guests, there are
three challenges to toolstack-based capacity allocation.  We'd
really still like to do capacity allocation in the toolstack.  Can
something be done in the toolstack to "fix" these three cases?

Possibly.  But let's first look at hypervisor-based capacity
allocation: the proposed "XENMEM_claim_pages" hypercall.

HYPERVISOR-BASED CAPACITY ALLOCATION

The posted patch for the claim hypercall is quite simple, but let's
look at it in detail.  The claim hypercall is actually a subop
of an existing hypercall.  After checking parameters for validity,
a new function is called in the core Xen memory management code.
This function takes the hypervisor heaplock, checks for a few
special cases, does some arithmetic to ensure a valid claim, stakes
the claim, releases the hypervisor heaplock, and then returns.  To
review from earlier, the hypervisor heaplock protects _all_ page/slab
allocations, so we can be absolutely certain that there are no other
page allocation races.  This new function is about 35 lines of code,
not counting comments.

The patch includes two other significant changes to the hypervisor:
First, when any adjustment to a domain's memory footprint is made
(either through a toolstack-aware hypercall or one of the three
toolstack-unaware methods described above), the heaplock is
taken, arithmetic is done, and the heaplock is released.  This
is 12 lines of code.  Second, when any memory is allocated within
Xen, a check must be made (with the heaplock already held) to
determine if, given a previous claim, the domain has exceeded
its upper bound, maxmem.  This code is a single conditional test.

With some declarations, but not counting the copious comments,
all told, the new code provided by the patch is well under 100 lines.

What about the toolstack side?  First, it's important to note that
the toolstack changes are entirely optional.  If any toolstack
wishes either to not fix the original problem, or avoid toolstack-
unaware allocation completely by ignoring the functionality provided
by in-guest ballooning, page-sharing, and/or tmem, that toolstack need
not use the new hypercall.  Second, it's very relevant to note that the Oracle product uses a combination of a proprietary "manager"
which oversees many machines, and the older open-source xm/xend
toolstack, for which the current Xen toolstack maintainers are no
longer accepting patches.

The preface of the published patch does suggest, however, some
straightforward pseudo-code, as follows:

Current toolstack domain creation memory allocation code fragment:

1. call populate_physmap repeatedly to achieve mem=N memory
2. if any populate_physmap call fails, report -ENOMEM up the stack
3. memory is held until domain dies or the toolstack decreases it

Proposed toolstack domain creation memory allocation code fragment
(new code marked with "+"):

+  call claim for mem=N amount of memory
+. if claim succeeds:
1.  call populate_physmap repeatedly to achieve mem=N memory (failsafe)
+  else
2.  report -ENOMEM up the stack
+  claim is held until mem=N is achieved or the domain dies or
    forced to 0 by a second hypercall
3. memory is held until domain dies or the toolstack decreases it

Reviewing the pseudo-code, one can readily see that the toolstack
changes required to implement the hypercall are quite small.

To complete this discussion, it has been pointed out that
the proposed hypercall doesn't solve the original problem
for certain classes of legacy domains... but also neither
does it make the problem worse.  It has also been pointed
out that the proposed patch is not (yet) NUMA-aware.

Now let's return to the earlier question:  There are three 
challenges to toolstack-based capacity allocation, which are
all handled easily by in-hypervisor capacity allocation. But we'd
really still like to do capacity allocation in the toolstack.
Can something be done in the toolstack to "fix" these three cases?

The answer is, of course, certainly... anything can be done in
software.  So, recalling Ian Jackson's stated requirement:

 "Any functionality which can be reasonably provided outside the
  hypervisor should be excluded from it."

we are now left to evaluate the subjective term "reasonably".

CAN TOOLSTACK-BASED CAPACITY ALLOCATION OVERCOME THE ISSUES?

In earlier discussion on this topic, when page-splitting was raised
as a concern, some of the authors of Xen's page-sharing feature
pointed out that a mechanism could be designed such that "batches"
of pages were pre-allocated by the toolstack and provided to the
hypervisor to be utilized as needed for page-splitting.  Should the
batch run dry, the hypervisor could stop the domain that was provoking
the page-split until the toolstack could be consulted and the toolstack, at its leisure, could request the hypervisor to refill
the batch, which then allows the page-split-causing domain to proceed.

But this batch page-allocation isn't implemented in Xen today.

Andres Lagar-Cavilla says "... this is because of shortcomings in the
[Xen] mm layer and its interaction with wait queues, documented
elsewhere."  In other words, this batching proposal requires
significant changes to the hypervisor, which I think we
all agreed we were trying to avoid.

[Note to Andre: I'm not objecting to the need for this functionality
for page-sharing to work with proprietary kernels and DMC; just
pointing out that it, too, is dependent on further hypervisor changes.]

Such an approach makes sense in the min==max model enforced by
DMC but, again, DMC is not prescribed by the toolstack.

Further, this waitqueue solution for page-splitting only awkwardly
works around in-guest ballooning (probably only with more hypervisor
changes, TBD) and would be useless for tmem.  [IIGT: Please argue
this last point only if you feel confident you truly understand how
tmem works.]

So this as-yet-unimplemented solution only really solves a part
of the problem.

Are there any other possibilities proposed?  Ian Jackson has
suggested a somewhat different approach:

Let me quote Ian Jackson again:

"Of course if it is really desired to have each guest make its own
decisions and simply for them to somehow agree to divvy up the
available resources, then even so a new hypervisor mechanism is
not needed.  All that is needed is a way for those guests to
synchronise their accesses and updates to shared records of the
available and in-use memory."

Ian then goes on to say:  "I don't have a detailed counter-proposal
design of course..."

This proposal is certainly possible, but I think most would agree that
it would require some fairly massive changes in OS memory management
design that would run contrary to many years of computing history.
It requires guest OS's to cooperate with each other about basic memory
management decisions.  And to work for tmem, it would require
communication from atomic code in the kernel to user-space, then communication from user-space in a guest to user-space-in-domain0
and then (presumably... I don't have a design either) back again.
One must also wonder what the performance impact would be.

CONCLUDING REMARKS

"Any functionality which can be reasonably provided outside the
  hypervisor should be excluded from it."

I think this document has described a real customer problem and
a good solution that could be implemented either in the toolstack
or in the hypervisor.  Memory allocation in existing Xen functionality
has been shown to interfere significantly with the toolstack-based
solution and suggested partial solutions to those issues either
require even more hypervisor work, or are completely undesigned and,
at least, call into question the definition of "reasonably".

The hypervisor-based solution has been shown to be extremely
simple, fits very logically with existing Xen memory management
mechanisms/code, and has been reviewed through several iterations
by Xen hypervisor experts.

While I understand completely the Xen maintainers' desire to
fend off unnecessary additions to the hypervisor, I believe
XENMEM_claim_pages is a reasonable and natural hypervisor feature
and I hope you will now Ack the patch.

Acknowledgements: Thanks very much to Konrad for his thorough
read-through and for suggestions on how to soften my combative
style which may have alienated the maintainers more than the
proposal itself.

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2013-02-14 13:32 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <mailman.18000.1354568068.1399.xen-devel@lists.xen.org>
2012-12-04  3:24 ` Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions Andres Lagar-Cavilla
2012-12-18 22:17   ` Konrad Rzeszutek Wilk
2012-12-19 12:53     ` George Dunlap
2012-12-19 13:48       ` George Dunlap
2013-01-03 20:38         ` Dan Magenheimer
2013-01-02 21:59       ` Konrad Rzeszutek Wilk
2013-01-14 18:28         ` George Dunlap
2013-01-22 21:57           ` Konrad Rzeszutek Wilk
2013-01-23 18:36             ` Dave Scott
2013-02-12 15:38               ` Konrad Rzeszutek Wilk
2012-12-20 16:04     ` Tim Deegan
2013-01-02 15:31       ` Andres Lagar-Cavilla
2013-01-02 21:43         ` Dan Magenheimer
2013-01-03 16:25           ` Andres Lagar-Cavilla
2013-01-03 18:49             ` Dan Magenheimer
2013-01-07 14:43               ` Ian Campbell
2013-01-07 18:41                 ` Dan Magenheimer
2013-01-08  9:03                   ` Ian Campbell
2013-01-08 19:41                     ` Dan Magenheimer
2013-01-09 10:41                       ` Ian Campbell
2013-01-09 14:44                         ` Dan Magenheimer
2013-01-09 14:58                           ` Ian Campbell
2013-01-14 15:45                           ` George Dunlap
2013-01-14 18:18                             ` Dan Magenheimer
2013-01-14 19:42                               ` George Dunlap
2013-01-14 23:14                                 ` Dan Magenheimer
2013-01-23 12:18                                   ` Ian Campbell
2013-01-23 17:34                                     ` Dan Magenheimer
2013-02-12 16:18                                     ` Konrad Rzeszutek Wilk
2013-01-10 10:31                       ` Ian Campbell
2013-01-10 18:42                         ` Dan Magenheimer
2013-01-02 21:38       ` Dan Magenheimer
2013-01-03 16:24         ` Andres Lagar-Cavilla
2013-01-03 18:33           ` Dan Magenheimer
2013-01-10 17:13         ` Tim Deegan
2013-01-10 21:43           ` Dan Magenheimer
2013-01-17 15:12             ` Tim Deegan
2013-01-17 15:26               ` Andres Lagar-Cavilla
2013-01-22 19:22               ` Dan Magenheimer
2013-01-23 12:18                 ` Ian Campbell
2013-01-23 16:05                   ` Dan Magenheimer
2013-01-02 15:29     ` Andres Lagar-Cavilla
2013-01-11 16:03       ` Konrad Rzeszutek Wilk
2013-01-11 16:13         ` Andres Lagar-Cavilla
2013-01-11 19:08           ` Konrad Rzeszutek Wilk
2013-01-14 16:00             ` George Dunlap
2013-01-14 16:11               ` Andres Lagar-Cavilla
2013-01-17 15:16             ` Tim Deegan
2013-01-18 21:45               ` Konrad Rzeszutek Wilk
2013-01-21 10:29                 ` Tim Deegan
2013-02-12 15:54                   ` Konrad Rzeszutek Wilk
2013-02-14 13:32                     ` Konrad Rzeszutek Wilk
2012-12-03 20:54 Dan Magenheimer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.