Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

* Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
       [not found] <mailman.18000.1354568068.1399.xen-devel@lists.xen.org>
@ 2012-12-04  3:24 ` Andres Lagar-Cavilla
  2012-12-18 22:17   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 53+ messages in thread
From: Andres Lagar-Cavilla @ 2012-12-04  3:24 UTC (permalink / raw)
  To: xen-devel
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, George Dunlap, Ian Jackson, Tim Deegan,
	Jan Beulich

> I earlier promised a complete analysis of the problem
> addressed by the proposed claim hypercall as well as
> an analysis of the alternate solutions.  I had not
> yet provided these analyses when I asked for approval
> to commit the hypervisor patch, so there was still
> a good amount of misunderstanding, and I am trying
> to fix that here.
> 
> I had hoped this essay could be both concise and complete
> but quickly found it to be impossible to be both at the
> same time.  So I have erred on the side of verbosity,
> but also have attempted to ensure that the analysis
> flows smoothly and is understandable to anyone interested
> in learning more about memory allocation in Xen.
> I'd appreciate feedback from other developers to understand
> if I've also achieved that goal.
> 
> Ian, Ian, George, and Tim -- I have tagged a few
> out-of-flow questions to you with [IIGF].  If I lose
> you at any point, I'd especially appreciate your feedback
> at those points.  I trust that, first, you will read
> this completely.  As I've said, I understand that
> Oracle's paradigm may differ in many ways from your
> own, so I also trust that you will read it completely
> with an open mind.
> 
> Thanks,
> Dan
> 
> PROBLEM STATEMENT OVERVIEW
> 
> The fundamental problem is a race; two entities are
> competing for part or all of a shared resource: in this case,
> physical system RAM.  Normally, a lock is used to mediate
> a race.
> 
> For memory allocation in Xen, there are two significant
> entities, the toolstack and the hypervisor.  And, in
> general terms, there are currently two important locks:
> one used in the toolstack for domain creation;
> and one in the hypervisor used for the buddy allocator.
> 
> Considering first only domain creation, the toolstack
> lock is taken to ensure that domain creation is serialized.
> The lock is taken when domain creation starts, and released
> when domain creation is complete.
> 
> As system and domain memory requirements grow, the amount
> of time to allocate all necessary memory to launch a large
> domain is growing and may now exceed several minutes, so
> this serialization is increasingly problematic.  The result
> is a customer reported problem:  If a customer wants to
> launch two or more very large domains, the "wait time"
> required by the serialization is unacceptable.
> 
> Oracle would like to solve this problem.  And Oracle
> would like to solve this problem not just for a single
> customer sitting in front of a single machine console, but
> for the very complex case of a large number of machines,
> with the "agent" on each machine taking independent
> actions including automatic load balancing and power
> management via migration.
Hi Dan,
an issue with your reasoning throughout has been the constant invocation of the multi host environment as a justification for your proposal. But this argument is not used in your proposal below beyond this mention in passing. Further, there is no relation between what you are changing (the hypervisor) and what you are claiming it is needed for (multi host VM management).

>  (This complex environment
> is sold by Oracle today; it is not a "future vision".)
> 
> [IIGT] Completely ignoring any possible solutions to this
> problem, is everyone in agreement that this _is_ a problem
> that _needs_ to be solved with _some_ change in the Xen
> ecosystem?
> 
> SOME IMPORTANT BACKGROUND INFORMATION
> 
> In the subsequent discussion, it is important to
> understand a few things:
> 
> While the toolstack lock is held, allocating memory for
> the domain creation process is done as a sequence of one
> or more hypercalls, each asking the hypervisor to allocate
> one or more -- "X" -- slabs of physical RAM, where a slab
> is 2**N contiguous aligned pages, also known as an
> "order N" allocation.  While the hypercall is defined
> to work with any value of N, common values are N=0
> (individual pages), N=9 ("hugepages" or "superpages"),
> and N=18 ("1GiB pages").  So, for example, if the toolstack
> requires 201MiB of memory, it will make two hypercalls:
> One with X=100 and N=9, and one with X=1 and N=0.
> 
> While the toolstack may ask for a smaller number X of
> order==9 slabs, system fragmentation may unpredictably
> cause the hypervisor to fail the request, in which case
> the toolstack will fall back to a request for 512*X
> individual pages.  If there is sufficient RAM in the system,
> this request for order==0 pages is guaranteed to succeed.
> Thus for a 1TiB domain, the hypervisor must be prepared
> to allocate up to 256Mi individual pages.
> 
> Note carefully that when the toolstack hypercall asks for
> 100 slabs, the hypervisor "heaplock" is currently taken
> and released 100 times.  Similarly, for 256M individual
> pages... 256 million spin_lock-alloc_page-spin_unlocks.
> This means that domain creation is not "atomic" inside
> the hypervisor, which means that races can and will still
> occur.
> 
> RULING OUT SOME SIMPLE SOLUTIONS
> 
> Is there an elegant simple solution here?
> 
> Let's first consider the possibility of removing the toolstack
> serialization entirely and/or the possibility that two
> independent toolstack threads (or "agents") can simultaneously
> request a very large domain creation in parallel.  As described
> above, the hypervisor's heaplock is insufficient to serialize RAM
> allocation, so the two domain creation processes race.  If there
> is sufficient resource for either one to launch, but insufficient
> resource for both to launch, the winner of the race is indeterminate,
> and one or both launches will fail, possibly after one or both 
> domain creation threads have been working for several minutes.
> This is a classic "TOCTOU" (time-of-check-time-of-use) race.
> If a customer is unhappy waiting several minutes to launch
> a domain, they will be even more unhappy waiting for several
> minutes to be told that one or both of the launches has failed.
> Multi-minute failure is even more unacceptable for an automated
> agent trying to, for example, evacuate a machine that the
> data center administrator needs to powercycle.
> 
> [IIGT: Please hold your objections for a moment... the paragraph
> above is discussing the simple solution of removing the serialization;
> your suggested solution will be discussed soon.]
> 
> Next, let's consider the possibility of changing the heaplock
> strategy in the hypervisor so that the lock is held not
> for one slab but for the entire request of N slabs.  As with
> any core hypervisor lock, holding the heaplock for a "long time"
> is unacceptable.  To a hypervisor, several minutes is an eternity.
> And, in any case, by serializing domain creation in the hypervisor,
> we have really only moved the problem from the toolstack into
> the hypervisor, not solved the problem.
> 
> [IIGT] Are we in agreement that these simple solutions can be
> safely ruled out?
> 
> CAPACITY ALLOCATION VS RAM ALLOCATION
> 
> Looking for a creative solution, one may realize that it is the
> page allocation -- especially in large quantities -- that is very
> time-consuming.  But, thinking outside of the box, it is not
> the actual pages of RAM that we are racing on, but the quantity of pages required to launch a domain!  If we instead have a way to
> "claim" a quantity of pages cheaply now and then allocate the actual
> physical RAM pages later, we have changed the race to require only serialization of the claiming process!  In other words, if some entity
> knows the number of pages available in the system, and can "claim"
> N pages for the benefit of a domain being launched, the successful launch of the domain can be ensured.  Well... the domain launch may
> still fail for an unrelated reason, but not due to a memory TOCTOU
> race.  But, in this case, if the cost (in time) of the claiming
> process is very small compared to the cost of the domain launch,
> we have solved the memory TOCTOU race with hardly any delay added
> to a non-memory-related failure that would have occurred anyway.
> 
> This "claim" sounds promising.  But we have made an assumption that
> an "entity" has certain knowledge.  In the Xen system, that entity
> must be either the toolstack or the hypervisor.  Or, in the Oracle
> environment, an "agent"... but an agent and a toolstack are similar
> enough for our purposes that we will just use the more broadly-used
> term "toolstack".  In using this term, however, it's important to
> remember it is necessary to consider the existence of multiple
> threads within this toolstack.
> 
> Now I quote Ian Jackson: "It is a key design principle of a system
> like Xen that the hypervisor should provide only those facilities
> which are strictly necessary.  Any functionality which can be
> reasonably provided outside the hypervisor should be excluded
> from it."
> 
> So let's examine the toolstack first.
> 
> [IIGT] Still all on the same page (pun intended)?
> 
> TOOLSTACK-BASED CAPACITY ALLOCATION
> 
> Does the toolstack know how many physical pages of RAM are available?
> Yes, it can use a hypercall to find out this information after Xen and
> dom0 launch, but before it launches any domain.  Then if it subtracts
> the number of pages used when it launches a domain and is aware of
> when any domain dies, and adds them back, the toolstack has a pretty
> good estimate.  In actuality, the toolstack doesn't _really_ know the
> exact number of pages used when a domain is launched, but there
> is a poorly-documented "fuzz factor"... the toolstack knows the
> number of pages within a few megabytes, which is probably close enough.
> 
> This is a fairly good description of how the toolstack works today
> and the accounting seems simple enough, so does toolstack-based
> capacity allocation solve our original problem?  It would seem so.
> Even if there are multiple threads, the accounting -- not the extended
> sequence of page allocation for the domain creation -- can be
> serialized by a lock in the toolstack.  But note carefully, either
> the toolstack and the hypervisor must always be in sync on the
> number of available pages (within an acceptable margin of error);
> or any query to the hypervisor _and_ the toolstack-based claim must
> be paired atomically, i.e. the toolstack lock must be held across
> both.  Otherwise we again have another TOCTOU race. Interesting,
> but probably not really a problem.
> 
> Wait, isn't it possible for the toolstack to dynamically change the
> number of pages assigned to a domain?  Yes, this is often called
> ballooning and the toolstack can do this via a hypercall.  But

> that's still OK because each call goes through the toolstack and
> it simply needs to add more accounting for when it uses ballooning
> to adjust the domain's memory footprint.  So we are still OK.
> 
> But wait again... that brings up an interesting point.  Are there
> any significant allocations that are done in the hypervisor without
> the knowledge and/or permission of the toolstack?  If so, the
> toolstack may be missing important information.
> 
> So are there any such allocations?  Well... yes. There are a few.
> Let's take a moment to enumerate them:
> 
> A) In Linux, a privileged user can write to a sysfs file which writes
> to the balloon driver which makes hypercalls from the guest kernel to

A fairly bizarre limitation of a balloon-based approach to memory management. Why on earth should the guest be allowed to change the size of its balloon, and therefore its footprint on the host. This may be justified with arguments pertaining to the stability of the in-guest workload. What they really reveal are limitations of ballooning. But the inadequacy of the balloon in itself doesn't automatically translate into justifying the need for a new hyper call.

> the hypervisor, which adjusts the domain memory footprint, which changes the number of free pages _without_ the toolstack knowledge.
> The toolstack controls constraints (essentially a minimum and maximum)
> which the hypervisor enforces.  The toolstack can ensure that the
> minimum and maximum are identical to essentially disallow Linux from
> using this functionality.  Indeed, this is precisely what Citrix's
> Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always has complete control and, so, knowledge of any domain memory
> footprint changes.  But DMC is not prescribed by the toolstack,

Neither is enforcing min==max. This was my argument when previously commenting on this thread. The fact that you have enforcement of a maximum domain allocation gives you an excellent tool to keep a domain's unsupervised growth at bay. The toolstack can choose how fine-grained, how often to be alerted and stall the domain.

> and some real Oracle Linux customers use and depend on the flexibility
> provided by in-guest ballooning.   So guest-privileged-user-driven-
> ballooning is a potential issue for toolstack-based capacity allocation.
> 
> [IIGT: This is why I have brought up DMC several times and have
> called this the "Citrix model,".. I'm not trying to be snippy
> or impugn your morals as maintainers.]
> 
> B) Xen's page sharing feature has slowly been completed over a number
> of recent Xen releases.  It takes advantage of the fact that many
> pages often contain identical data; the hypervisor merges them to save

Great care has been taken for this statement to not be exactly true. The hypervisor discards one of two pages that the toolstack tells it to (and patches the physmap of the VM previously pointing to the discard page). It doesn't merge, nor does it look into contents. The hypervisor doesn't care about the page contents. This is deliberate, so as to avoid spurious claims of "you are using technique X!"

> physical RAM.  When any "shared" page is written, the hypervisor
> "splits" the page (aka, copy-on-write) by allocating a new physical
> page.  There is a long history of this feature in other virtualization
> products and it is known to be possible that, under many circumstances, thousands of splits may occur in any fraction of a second.  The
> hypervisor does not notify or ask permission of the toolstack.
> So, page-splitting is an issue for toolstack-based capacity
> allocation, at least as currently coded in Xen.
> 
> [Andre: Please hold your objection here until you read further.]

Name is Andres. And please cc me if you'll be addressing me directly!

Note that I don't disagree with your previous statement in itself. Although "page-splitting" is fairly unique terminology, and confusing (at least to me). CoW works.

> 
> C) Transcendent Memory ("tmem") has existed in the Xen hypervisor and
> toolstack for over three years.  It depends on an in-guest-kernel
> adaptive technique to constantly adjust the domain memory footprint as
> well as hooks in the in-guest-kernel to move data to and from the
> hypervisor.  While the data is in the hypervisor's care, interesting
> memory-load balancing between guests is done, including optional
> compression and deduplication.  All of this has been in Xen since 2009
> and has been awaiting changes in the (guest-side) Linux kernel. Those
> changes are now merged into the mainstream kernel and are fully
> functional in shipping distros.
> 
> While a complete description of tmem's guest<->hypervisor interaction
> is beyond the scope of this document, it is important to understand
> that any tmem-enabled guest kernel may unpredictably request thousands
> or even millions of pages directly via hypercalls from the hypervisor in a fraction of a second with absolutely no interaction with the toolstack.  Further, the guest-side hypercalls that allocate pages
> via the hypervisor are done in "atomic" code deep in the Linux mm
> subsystem.
> 
> Indeed, if one truly understands tmem, it should become clear that
> tmem is fundamentally incompatible with toolstack-based capacity
> allocation. But let's stop discussing tmem for now and move on.

You have not discussed tmem pool thaw and freeze in this proposal.

> 
> OK.  So with existing code both in Xen and Linux guests, there are
> three challenges to toolstack-based capacity allocation.  We'd
> really still like to do capacity allocation in the toolstack.  Can
> something be done in the toolstack to "fix" these three cases?
> 
> Possibly.  But let's first look at hypervisor-based capacity
> allocation: the proposed "XENMEM_claim_pages" hypercall.
> 
> HYPERVISOR-BASED CAPACITY ALLOCATION
> 
> The posted patch for the claim hypercall is quite simple, but let's
> look at it in detail.  The claim hypercall is actually a subop
> of an existing hypercall.  After checking parameters for validity,
> a new function is called in the core Xen memory management code.
> This function takes the hypervisor heaplock, checks for a few
> special cases, does some arithmetic to ensure a valid claim, stakes
> the claim, releases the hypervisor heaplock, and then returns.  To
> review from earlier, the hypervisor heaplock protects _all_ page/slab
> allocations, so we can be absolutely certain that there are no other
> page allocation races.  This new function is about 35 lines of code,
> not counting comments.
> 
> The patch includes two other significant changes to the hypervisor:
> First, when any adjustment to a domain's memory footprint is made
> (either through a toolstack-aware hypercall or one of the three
> toolstack-unaware methods described above), the heaplock is
> taken, arithmetic is done, and the heaplock is released.  This
> is 12 lines of code.  Second, when any memory is allocated within
> Xen, a check must be made (with the heaplock already held) to
> determine if, given a previous claim, the domain has exceeded
> its upper bound, maxmem.  This code is a single conditional test.
> 
> With some declarations, but not counting the copious comments,
> all told, the new code provided by the patch is well under 100 lines.
> 
> What about the toolstack side?  First, it's important to note that
> the toolstack changes are entirely optional.  If any toolstack
> wishes either to not fix the original problem, or avoid toolstack-
> unaware allocation completely by ignoring the functionality provided
> by in-guest ballooning, page-sharing, and/or tmem, that toolstack need
> not use the new hyper call.

You are ruling out any other possibility here. In particular, but not limited to, use of max_pages.

>  Second, it's very relevant to note that the Oracle product uses a combination of a proprietary "manager"
> which oversees many machines, and the older open-source xm/xend
> toolstack, for which the current Xen toolstack maintainers are no
> longer accepting patches.
> 
> The preface of the published patch does suggest, however, some
> straightforward pseudo-code, as follows:
> 
> Current toolstack domain creation memory allocation code fragment:
> 
> 1. call populate_physmap repeatedly to achieve mem=N memory
> 2. if any populate_physmap call fails, report -ENOMEM up the stack
> 3. memory is held until domain dies or the toolstack decreases it
> 
> Proposed toolstack domain creation memory allocation code fragment
> (new code marked with "+"):
> 
> +  call claim for mem=N amount of memory
> +. if claim succeeds:
> 1.  call populate_physmap repeatedly to achieve mem=N memory (failsafe)
> +  else
> 2.  report -ENOMEM up the stack
> +  claim is held until mem=N is achieved or the domain dies or
>    forced to 0 by a second hypercall
> 3. memory is held until domain dies or the toolstack decreases it
> 
> Reviewing the pseudo-code, one can readily see that the toolstack
> changes required to implement the hypercall are quite small.
> 
> To complete this discussion, it has been pointed out that
> the proposed hypercall doesn't solve the original problem
> for certain classes of legacy domains... but also neither
> does it make the problem worse.  It has also been pointed
> out that the proposed patch is not (yet) NUMA-aware.
> 
> Now let's return to the earlier question:  There are three 
> challenges to toolstack-based capacity allocation, which are
> all handled easily by in-hypervisor capacity allocation. But we'd
> really still like to do capacity allocation in the toolstack.
> Can something be done in the toolstack to "fix" these three cases?
> 
> The answer is, of course, certainly... anything can be done in
> software.  So, recalling Ian Jackson's stated requirement:
> 
> "Any functionality which can be reasonably provided outside the
>  hypervisor should be excluded from it."
> 
> we are now left to evaluate the subjective term "reasonably".
> 
> CAN TOOLSTACK-BASED CAPACITY ALLOCATION OVERCOME THE ISSUES?
> 
> In earlier discussion on this topic, when page-splitting was raised
> as a concern, some of the authors of Xen's page-sharing feature
> pointed out that a mechanism could be designed such that "batches"
> of pages were pre-allocated by the toolstack and provided to the
> hypervisor to be utilized as needed for page-splitting.  Should the
> batch run dry, the hypervisor could stop the domain that was provoking
> the page-split until the toolstack could be consulted and the toolstack, at its leisure, could request the hypervisor to refill
> the batch, which then allows the page-split-causing domain to proceed.
> 
> But this batch page-allocation isn't implemented in Xen today.
> 
> Andres Lagar-Cavilla says "... this is because of shortcomings in the
> [Xen] mm layer and its interaction with wait queues, documented
> elsewhere."  In other words, this batching proposal requires
> significant changes to the hypervisor, which I think we
> all agreed we were trying to avoid.

This is a misunderstanding. There is no connection between the batching proposal and what I was referring to in the quote. Certainly I never advocated for pre-allocations.

The "significant changes to the hypervisor" statement is FUD. Everyone you've addressed on this email makes significant changes to the hypervisor, under the proviso that they are necessary/useful changes.

The interactions between the mm layer and wait queues need fixing, sooner or later, claim hyper call or not. But they are not a blocker, they are essentially a race that may trigger under certain circumstances. That is why they remain a low priority fix.

> 
> [Note to Andre: I'm not objecting to the need for this functionality
> for page-sharing to work with proprietary kernels and DMC; just

Let me nip this at the bud. I use page sharing and other techniques in an environment that doesn't use Citrix's DMC, nor is focused only on proprietary kernels...

> pointing out that it, too, is dependent on further hypervisor changes.]

… with 4.2 Xen. It is not perfect and has limitations that I am trying to fix. But our product ships, and page sharing works for anyone who would want to consume it, independently of further hypervisor changes.

> 
> Such an approach makes sense in the min==max model enforced by
> DMC but, again, DMC is not prescribed by the toolstack.
> 
> Further, this waitqueue solution for page-splitting only awkwardly
> works around in-guest ballooning (probably only with more hypervisor
> changes, TBD) and would be useless for tmem.  [IIGT: Please argue
> this last point only if you feel confident you truly understand how
> tmem works.]

I will argue though that "waitqueue solution … ballooning" is not true. Ballooning has never needed nor it suddenly needs now hypervisor wait queues.

> 
> So this as-yet-unimplemented solution only really solves a part
> of the problem.

As per the previous comments, I don't see your characterization as accurate.

Andres
> 
> Are there any other possibilities proposed?  Ian Jackson has
> suggested a somewhat different approach:
> 
> Let me quote Ian Jackson again:
> 
> "Of course if it is really desired to have each guest make its own
> decisions and simply for them to somehow agree to divvy up the
> available resources, then even so a new hypervisor mechanism is
> not needed.  All that is needed is a way for those guests to
> synchronise their accesses and updates to shared records of the
> available and in-use memory."
> 
> Ian then goes on to say:  "I don't have a detailed counter-proposal
> design of course..."
> 
> This proposal is certainly possible, but I think most would agree that
> it would require some fairly massive changes in OS memory management
> design that would run contrary to many years of computing history.
> It requires guest OS's to cooperate with each other about basic memory
> management decisions.  And to work for tmem, it would require
> communication from atomic code in the kernel to user-space, then communication from user-space in a guest to user-space-in-domain0
> and then (presumably... I don't have a design either) back again.
> One must also wonder what the performance impact would be.
> 
> CONCLUDING REMARKS
> 
> "Any functionality which can be reasonably provided outside the
>  hypervisor should be excluded from it."
> 
> I think this document has described a real customer problem and
> a good solution that could be implemented either in the toolstack
> or in the hypervisor.  Memory allocation in existing Xen functionality
> has been shown to interfere significantly with the toolstack-based
> solution and suggested partial solutions to those issues either
> require even more hypervisor work, or are completely undesigned and,
> at least, call into question the definition of "reasonably".
> 
> The hypervisor-based solution has been shown to be extremely
> simple, fits very logically with existing Xen memory management
> mechanisms/code, and has been reviewed through several iterations
> by Xen hypervisor experts.
> 
> While I understand completely the Xen maintainers' desire to
> fend off unnecessary additions to the hypervisor, I believe
> XENMEM_claim_pages is a reasonable and natural hypervisor feature
> and I hope you will now Ack the patch.
> 
> Acknowledgements: Thanks very much to Konrad for his thorough
> read-through and for suggestions on how to soften my combative
> style which may have alienated the maintainers more than the
> proposal itself.
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 
> 
> End of Xen-devel Digest, Vol 94, Issue 22
> *****************************************

^ permalink raw reply	[flat|nested] 53+ messages in thread