All of lore.kernel.org
 help / color / mirror / Atom feed
* Proposed new "memory capacity claim" hypercall/feature
@ 2012-10-29 17:06 Dan Magenheimer
  2012-10-29 18:24 ` Keir Fraser
  2012-10-29 22:35 ` Tim Deegan
  0 siblings, 2 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-10-29 17:06 UTC (permalink / raw)
  To: Keir (Xen.org), Jan Beulich
  Cc: Olaf Hering, Ian Campbell, Konrad Wilk, George Dunlap,
	George Shuklin, Tim (Xen.org),
	xen-devel, Dario Faggioli, Kurt Hackel, Zhigang Wang,
	Ian Jackson

Keir, Jan (et al) --

In a recent long thread [1], there was a great deal of discussion
about the possible need for a "memory reservation" hypercall.
While there was some confusion due to the two worldviews of static
vs dynamic management of physical memory capacity, one worldview
definitely has a requirement for this new capability.  It is still
uncertain whether the other worldview will benefit as well, though
I believe it eventually will, especially when page sharing is
fully deployed.

Note that to avoid confusion with existing usages of various
terms (such as "reservation"), I am now using the distinct
word "claim" as in a "land claim" or "mining claim":
http://dictionary.cambridge.org/dictionary/british/stake-a-claim 
When a toolstack creates a domain, it can first "stake a claim"
to the amount of memory capacity necessary to ensure the domain
launch will succeed.

In order to explore feasibility, I wanted to propose a possible
hypervisor design and would very much appreciate feedback!

The objective of the design is to ensure that a multi-threaded
toolstack can atomically claim a specific amount of RAM capacity for a
domain, especially in the presence of independent dynamic memory demand
(such as tmem and selfballooning) which the toolstack is not able to track.
"Claim X 50G" means that, on completion of the call, either (A) 50G of
capacity has been claimed for use by domain X and the call returns
success or (B) the call returns failure.  Note that in the above,
"claim" explicitly does NOT mean that specific physical RAM pages have
been assigned, only that the 50G of RAM capacity is not available either
to a subsequent "claim" or for most[2] independent dynamic memory demands.

I think the underlying hypervisor issue is that the current process
of "reserving" memory capacity (which currently does assign specific
physical RAM pages) is, by necessity when used for large quantities of RAM,
batched and slow and, consequently, can NOT be atomic.  One way to think
of the newly proposed "claim" is as "lazy reserving":  The capacity is
set aside even though specific physical RAM pages have not been assigned.
In another way, claiming is really just an accounting illusion, similar
to how an accountant must "accrue" future liabilities.

Hypervisor design/implementation overview:

A domain currently does RAM accounting with two primary counters
"tot_pages" and "max_pages".  (For now, let's ignore shr_pages,
paged_pages, and xenheap_pages, and I hope Olaf/Andre/others can
provide further expertise and input.)

Tot_pages is a struct_domain element in the hypervisor that tracks
the number of physical RAM pageframes "owned" by the domain.  The
hypervisor enforces that tot_pages is never allowed to exceed another
struct_domain element called max_pages.

I would like to introduce a new counter, which records how
much capacity is claimed for a domain which may or may not yet be
mapped to physical RAM pageframes.  To do so, I'd like to split
the concept of tot_pages into two variables, tot_phys_pages and
tot_claimed_pages and require the hypervisor to also enforce:

d.tot_phys_pages <= d.tot_claimed_pages[3] <= d.max_pages

I'd also split the hypervisor global "total_avail_pages" into
"total_free_pages" and "total_unclaimed_pages".  (I'm definitely
going to need to study more the two-dimensional array "avail"...)
The hypervisor must now do additional accounting to keep track
of the sum of claims across all domains and also enforce the
global:

total_unclaimed_pages <= total_free_pages

I think the memory_op hypercall can be extended to add two
additional subops, XENMEM_claim and XENMEM_release.  (Note: To
support tmem, there will need to be two variations of XEN_claim,
"hard claim" and "soft claim" [3].)  The XEN_claim subop atomically
evaluates total_unclaimed_pages against the new claim, claims
the pages for the domain if possible and returns success or failure.
The XEN_release "unsets" the domain's tot_claimed_pages (to an
"illegal" value such as zero or MINUS_ONE).

The hypervisor must also enforce some semantics:  If an allocation
occurs such that a domain's tot_phys_pages would equal or exceed
d.tot_claimed_pages, then d.tot_claimed_pages becomes "unset".
This enforces the temporary nature of a claim:  Once a domain
fully "occupies" its claim, the claim silently expires.

In the case of a dying domain, a XENMEM_release operation
is implied and must be executed by the hypervisor.

Ideally, the quantity of unclaimed memory for each domain and
for the system should be query-able.  This may require additional
memory_op hypercalls.

I'd very much appreciate feedback on this proposed design!

Thanks,
Dan

[1] http://lists.xen.org/archives/html/xen-devel/2012-09/msg02229.html
    and continued in October (the archives don't thread across months)
    http://lists.xen.org/archives/html/xen-devel/2012-10/msg00080.html 
[2] Pages used to store tmem "ephemeral" data may be an exception
    because those pages are "free-on-demand".
[3] I'd be happy to explain the minor additional work necessary to
    support tmem but have mostly left it out of the proposal for clarity.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-29 17:06 Proposed new "memory capacity claim" hypercall/feature Dan Magenheimer
@ 2012-10-29 18:24 ` Keir Fraser
  2012-10-29 21:08   ` Dan Magenheimer
  2012-10-29 22:35 ` Tim Deegan
  1 sibling, 1 reply; 58+ messages in thread
From: Keir Fraser @ 2012-10-29 18:24 UTC (permalink / raw)
  To: Dan Magenheimer, Jan Beulich
  Cc: Olaf Hering, Ian Campbell, Konrad Rzeszutek Wilk, George Dunlap,
	George Shuklin, Tim (Xen.org),
	xen-devel, Dario Faggioli, Kurt Hackel, Zhigang Wang,
	Ian Jackson

On 29/10/2012 18:06, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> The objective of the design is to ensure that a multi-threaded
> toolstack can atomically claim a specific amount of RAM capacity for a
> domain, especially in the presence of independent dynamic memory demand
> (such as tmem and selfballooning) which the toolstack is not able to track.
> "Claim X 50G" means that, on completion of the call, either (A) 50G of
> capacity has been claimed for use by domain X and the call returns
> success or (B) the call returns failure.  Note that in the above,
> "claim" explicitly does NOT mean that specific physical RAM pages have
> been assigned, only that the 50G of RAM capacity is not available either
> to a subsequent "claim" or for most[2] independent dynamic memory demands.

I don't really understand the problem it solves, to be honest. Why would you
not just allocate the RAM pages, rather than merely making that amount of
memory unallocatable for any other purpose?

 -- Keir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-29 18:24 ` Keir Fraser
@ 2012-10-29 21:08   ` Dan Magenheimer
  2012-10-29 22:22     ` Keir Fraser
  0 siblings, 1 reply; 58+ messages in thread
From: Dan Magenheimer @ 2012-10-29 21:08 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich
  Cc: Olaf Hering, Ian Campbell, Konrad Wilk, George Dunlap,
	George Shuklin, Tim (Xen.org),
	xen-devel, Dario Faggioli, Kurt Hackel, Zhigang Wang,
	Ian Jackson

> From: Keir Fraser [mailto:keir.xen@gmail.com]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> 
> On 29/10/2012 18:06, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:
> 
> > The objective of the design is to ensure that a multi-threaded
> > toolstack can atomically claim a specific amount of RAM capacity for a
> > domain, especially in the presence of independent dynamic memory demand
> > (such as tmem and selfballooning) which the toolstack is not able to track.
> > "Claim X 50G" means that, on completion of the call, either (A) 50G of
> > capacity has been claimed for use by domain X and the call returns
> > success or (B) the call returns failure.  Note that in the above,
> > "claim" explicitly does NOT mean that specific physical RAM pages have
> > been assigned, only that the 50G of RAM capacity is not available either
> > to a subsequent "claim" or for most[2] independent dynamic memory demands.
> 
> I don't really understand the problem it solves, to be honest. Why would you
> not just allocate the RAM pages, rather than merely making that amount of
> memory unallocatable for any other purpose?

Hi Keir --

Thanks for the response!

Sorry, I guess the answer to your question is buried in the
thread referenced (as [1]) plus a vague mention in this proposal.

The core issue is that, in the hypervisor, every current method of
"allocating RAM" is slow enough that if you want to allocate millions
of pages (e.g. for a large domain), the total RAM can't be allocated
atomically.  In fact, it may even take minutes, so currently a large
allocation is explicitly preemptible, not atomic.

The problems the proposal solves are (1) some toolstacks (including
Oracle's "cloud orchestration layer") want to launch domains in parallel;
currently xl/xapi require launches to be serialized which isn't very
scalable in a large data center; and (2) tmem and/or other dynamic
memory mechanisms may be asynchronously absorbing small-but-significant
portions of RAM for other purposes during an attempted domain launch.
In either case, this is a classic race, and a large allocation may
unexpectedly fail, possibly even after several minutes, which is
unacceptable for a data center operator or for automated tools trying
to launch any very large domain.

Does that make sense?  I'm very open to other solutions, but the
only one I've heard so far was essentially "disallow independent
dynamic memory allocations" plus keep track of all "claiming" in the
toolstack.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-29 21:08   ` Dan Magenheimer
@ 2012-10-29 22:22     ` Keir Fraser
  2012-10-29 23:03       ` Dan Magenheimer
  0 siblings, 1 reply; 58+ messages in thread
From: Keir Fraser @ 2012-10-29 22:22 UTC (permalink / raw)
  To: Dan Magenheimer, Jan Beulich
  Cc: Olaf Hering, Ian Campbell, Konrad Rzeszutek Wilk, George Dunlap,
	George Shuklin, Tim (Xen.org),
	xen-devel, Dario Faggioli, Kurt Hackel, Zhigang Wang,
	Ian Jackson

On 29/10/2012 21:08, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> The core issue is that, in the hypervisor, every current method of
> "allocating RAM" is slow enough that if you want to allocate millions
> of pages (e.g. for a large domain), the total RAM can't be allocated
> atomically.  In fact, it may even take minutes, so currently a large
> allocation is explicitly preemptible, not atomic.
>
> The problems the proposal solves are (1) some toolstacks (including
> Oracle's "cloud orchestration layer") want to launch domains in parallel;
> currently xl/xapi require launches to be serialized which isn't very
> scalable in a large data center;

Well it does depend how scalable domain creation actually is as an
operation. If it is spending most of its time allocating memory then it is
quite likely that parallel creations will spend a lot of time competing for
the heap spinlock, and actually there will be little/no speedup compared
with serialising the creations. Further, if domain creation can take
minutes, it may be that we simply need to go optimise that -- we already
found one stupid thing in the heap allocator recently that was burining
loads of time during large-memory domain creations, and fixed it for a
massive speedup in that particular case.

> and (2) tmem and/or other dynamic
> memory mechanisms may be asynchronously absorbing small-but-significant
> portions of RAM for other purposes during an attempted domain launch.

This is an argument against allocate-rather-than-reserve? I don't think that
makes sense -- so is this instead an argument against
reservation-as-a-toolstack-only-mechanism? I'm not actually convinced yet we
need reservations *at all*, before we get down to where it should be
implemented.

 -- Keir

> In either case, this is a classic race, and a large allocation may
> unexpectedly fail, possibly even after several minutes, which is
> unacceptable for a data center operator or for automated tools trying
> to launch any very large domain.
> 
> Does that make sense?  I'm very open to other solutions, but the
> only one I've heard so far was essentially "disallow independent
> dynamic memory allocations" plus keep track of all "claiming" in the
> toolstack.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-29 17:06 Proposed new "memory capacity claim" hypercall/feature Dan Magenheimer
  2012-10-29 18:24 ` Keir Fraser
@ 2012-10-29 22:35 ` Tim Deegan
  2012-10-29 23:21   ` Dan Magenheimer
  2012-11-01  2:13   ` Dario Faggioli
  1 sibling, 2 replies; 58+ messages in thread
From: Tim Deegan @ 2012-10-29 22:35 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Olaf Hering, Keir (Xen.org),
	Ian Campbell, Konrad Wilk, George Dunlap, Kurt Hackel,
	George Shuklin, xen-devel, Dario Faggioli, Zhigang Wang,
	Ian Jackson

At 10:06 -0700 on 29 Oct (1351505175), Dan Magenheimer wrote:
> Hypervisor design/implementation overview:
> 
> A domain currently does RAM accounting with two primary counters
> "tot_pages" and "max_pages".  (For now, let's ignore shr_pages,
> paged_pages, and xenheap_pages, and I hope Olaf/Andre/others can
> provide further expertise and input.)
> 
> Tot_pages is a struct_domain element in the hypervisor that tracks
> the number of physical RAM pageframes "owned" by the domain.  The
> hypervisor enforces that tot_pages is never allowed to exceed another
> struct_domain element called max_pages.
> 
> I would like to introduce a new counter, which records how
> much capacity is claimed for a domain which may or may not yet be
> mapped to physical RAM pageframes.  To do so, I'd like to split
> the concept of tot_pages into two variables, tot_phys_pages and
> tot_claimed_pages and require the hypervisor to also enforce:
> 
> d.tot_phys_pages <= d.tot_claimed_pages[3] <= d.max_pages
> 
> I'd also split the hypervisor global "total_avail_pages" into
> "total_free_pages" and "total_unclaimed_pages".  (I'm definitely
> going to need to study more the two-dimensional array "avail"...)
> The hypervisor must now do additional accounting to keep track
> of the sum of claims across all domains and also enforce the
> global:
> 
> total_unclaimed_pages <= total_free_pages
> 
> I think the memory_op hypercall can be extended to add two
> additional subops, XENMEM_claim and XENMEM_release.  (Note: To
> support tmem, there will need to be two variations of XEN_claim,
> "hard claim" and "soft claim" [3].)  The XEN_claim subop atomically
> evaluates total_unclaimed_pages against the new claim, claims
> the pages for the domain if possible and returns success or failure.
> The XEN_release "unsets" the domain's tot_claimed_pages (to an
> "illegal" value such as zero or MINUS_ONE).
> 
> The hypervisor must also enforce some semantics:  If an allocation
> occurs such that a domain's tot_phys_pages would equal or exceed
> d.tot_claimed_pages, then d.tot_claimed_pages becomes "unset".
> This enforces the temporary nature of a claim:  Once a domain
> fully "occupies" its claim, the claim silently expires.

Why does that happen?  If I understand you correctly, releasing the
claim is something the toolstack should do once it knows it's no longer
needed.

> In the case of a dying domain, a XENMEM_release operation
> is implied and must be executed by the hypervisor.
> 
> Ideally, the quantity of unclaimed memory for each domain and
> for the system should be query-able.  This may require additional
> memory_op hypercalls.
> 
> I'd very much appreciate feedback on this proposed design!

As I said, I'm not opposed to this, though even after reading through
the other thread I'm not convinced that it's necessary (except in cases
where guest-controlled operations are allowed to consume unbounded
memory, which frankly gives me the heebie-jeebies).

I think it needs a plan for handling restricted memory allocations.
For example, some PV guests need their memory to come below a
certain machine address, or entirely in superpages, and certain
build-time allocations come from xenheap.  How would you handle that
sort of thing?

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-29 22:22     ` Keir Fraser
@ 2012-10-29 23:03       ` Dan Magenheimer
  2012-10-29 23:17         ` Keir Fraser
  2012-10-30  9:11         ` George Dunlap
  0 siblings, 2 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-10-29 23:03 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich
  Cc: Olaf Hering, Ian Campbell, Konrad Wilk, George Dunlap,
	George Shuklin, Tim (Xen.org),
	xen-devel, Dario Faggioli, Kurt Hackel, Zhigang Wang,
	Ian Jackson

> From: Keir Fraser [mailto:keir@xen.org]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> 
> On 29/10/2012 21:08, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:
> 
> > The core issue is that, in the hypervisor, every current method of
> > "allocating RAM" is slow enough that if you want to allocate millions
> > of pages (e.g. for a large domain), the total RAM can't be allocated
> > atomically.  In fact, it may even take minutes, so currently a large
> > allocation is explicitly preemptible, not atomic.
> >
> > The problems the proposal solves are (1) some toolstacks (including
> > Oracle's "cloud orchestration layer") want to launch domains in parallel;
> > currently xl/xapi require launches to be serialized which isn't very
> > scalable in a large data center;
> 
> Well it does depend how scalable domain creation actually is as an
> operation. If it is spending most of its time allocating memory then it is
> quite likely that parallel creations will spend a lot of time competing for
> the heap spinlock, and actually there will be little/no speedup compared
> with serialising the creations. Further, if domain creation can take
> minutes, it may be that we simply need to go optimise that -- we already
> found one stupid thing in the heap allocator recently that was burining
> loads of time during large-memory domain creations, and fixed it for a
> massive speedup in that particular case.

I suppose ultimately it is a scalability question.  But Oracle's
measure of success here is based on how long a human or a tool
has to wait for confirmation to ensure that a domain will
successfully launch.  If two domains are launched in parallel
AND an indication is given that both will succeed, spinning on
the heaplock a bit just makes for a longer "boot" time, which is
just a cost of virtualization.  If they are launched in parallel
and, minutes later (or maybe even 20 seconds later), one or
both say "oops, I was wrong, there wasn't enough memory, so
try again", that's not OK for data center operations, especially if
there really was enough RAM for one, but not for both. Remember,
in the Oracle environment, we are talking about an administrator/automation
overseeing possibly hundreds of physical servers, not just a single
user/server.

Does that make more sense?

The "claim" approach immediately guarantees success or failure.
Unless there are enough "stupid things/optimisations" found that
you would be comfortable putting memory allocation for a domain
creation in a hypervisor spinlock, there will be a race unless
an atomic mechanism exists such as "claiming" where
only simple arithmetic must be done within a hypervisor lock.

Do you disagree?

> > and (2) tmem and/or other dynamic
> > memory mechanisms may be asynchronously absorbing small-but-significant
> > portions of RAM for other purposes during an attempted domain launch.
> 
> This is an argument against allocate-rather-than-reserve? I don't think that
> makes sense -- so is this instead an argument against
> reservation-as-a-toolstack-only-mechanism? I'm not actually convinced yet we
> need reservations *at all*, before we get down to where it should be
> implemented.

I'm not sure if we are defining terms the same, so that's hard
to answer.  If you define "allocation" as "a physical RAM page frame
number is selected (and possibly the physical page is zeroed)",
then I'm not sure how your definition of "reservation" differs
(because that's how increase/decrease_reservation are implemented
in the hypervisor, right?).

Or did you mean "allocate-rather-than-claim" (where "allocate" is
select a specific physical pageframe and "claim" means do accounting
only?  If so, see the atomicity argument above.

I'm not just arguing against reservation-as-a-toolstack-mechanism,
I'm stating I believe unequivocally that reservation-as-a-toolstack-
only-mechanism and tmem are incompatible.  (Well, not _totally_
incompatible... the existing workaround, tmem freeze/thaw, works
but is also single-threaded and has fairly severe unnecessary
performance repercussions.  So I'd like to solve both problems
at the same time.)

Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-29 23:03       ` Dan Magenheimer
@ 2012-10-29 23:17         ` Keir Fraser
  2012-10-30 15:13           ` Dan Magenheimer
  2012-10-30  9:11         ` George Dunlap
  1 sibling, 1 reply; 58+ messages in thread
From: Keir Fraser @ 2012-10-29 23:17 UTC (permalink / raw)
  To: Dan Magenheimer, Jan Beulich
  Cc: Olaf Hering, Ian Campbell, Konrad Rzeszutek Wilk, George Dunlap,
	George Shuklin, Tim (Xen.org),
	xen-devel, Dario Faggioli, Kurt Hackel, Zhigang Wang,
	Ian Jackson

On 30/10/2012 00:03, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> From: Keir Fraser [mailto:keir@xen.org]
>> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
>> 
>> On 29/10/2012 21:08, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:
>> 
>> Well it does depend how scalable domain creation actually is as an
>> operation. If it is spending most of its time allocating memory then it is
>> quite likely that parallel creations will spend a lot of time competing for
>> the heap spinlock, and actually there will be little/no speedup compared
>> with serialising the creations. Further, if domain creation can take
>> minutes, it may be that we simply need to go optimise that -- we already
>> found one stupid thing in the heap allocator recently that was burining
>> loads of time during large-memory domain creations, and fixed it for a
>> massive speedup in that particular case.
> 
> I suppose ultimately it is a scalability question.  But Oracle's
> measure of success here is based on how long a human or a tool
> has to wait for confirmation to ensure that a domain will
> successfully launch.  If two domains are launched in parallel
> AND an indication is given that both will succeed, spinning on
> the heaplock a bit just makes for a longer "boot" time, which is
> just a cost of virtualization.  If they are launched in parallel
> and, minutes later (or maybe even 20 seconds later), one or
> both say "oops, I was wrong, there wasn't enough memory, so
> try again", that's not OK for data center operations, especially if
> there really was enough RAM for one, but not for both. Remember,
> in the Oracle environment, we are talking about an administrator/automation
> overseeing possibly hundreds of physical servers, not just a single
> user/server.
> 
> Does that make more sense?

Yes, that makes sense.

> The "claim" approach immediately guarantees success or failure.
> Unless there are enough "stupid things/optimisations" found that
> you would be comfortable putting memory allocation for a domain
> creation in a hypervisor spinlock, there will be a race unless
> an atomic mechanism exists such as "claiming" where
> only simple arithmetic must be done within a hypervisor lock.
> 
> Do you disagree?
> 
>>> and (2) tmem and/or other dynamic
>>> memory mechanisms may be asynchronously absorbing small-but-significant
>>> portions of RAM for other purposes during an attempted domain launch.
>> 
>> This is an argument against allocate-rather-than-reserve? I don't think that
>> makes sense -- so is this instead an argument against
>> reservation-as-a-toolstack-only-mechanism? I'm not actually convinced yet we
>> need reservations *at all*, before we get down to where it should be
>> implemented.
> 
> I'm not sure if we are defining terms the same, so that's hard
> to answer.  If you define "allocation" as "a physical RAM page frame
> number is selected (and possibly the physical page is zeroed)",
> then I'm not sure how your definition of "reservation" differs
> (because that's how increase/decrease_reservation are implemented
> in the hypervisor, right?).
> 
> Or did you mean "allocate-rather-than-claim" (where "allocate" is
> select a specific physical pageframe and "claim" means do accounting
> only?  If so, see the atomicity argument above.
> 
> I'm not just arguing against reservation-as-a-toolstack-mechanism,
> I'm stating I believe unequivocally that reservation-as-a-toolstack-
> only-mechanism and tmem are incompatible.  (Well, not _totally_
> incompatible... the existing workaround, tmem freeze/thaw, works
> but is also single-threaded and has fairly severe unnecessary
> performance repercussions.  So I'd like to solve both problems
> at the same time.)

Okay, so why is tmem incompatible with implementing claims in the toolstack?

 -- Keir

> Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-29 22:35 ` Tim Deegan
@ 2012-10-29 23:21   ` Dan Magenheimer
  2012-10-30  8:13     ` Tim Deegan
  2012-10-30  8:29     ` Jan Beulich
  2012-11-01  2:13   ` Dario Faggioli
  1 sibling, 2 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-10-29 23:21 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Olaf Hering, Keir (Xen.org),
	Ian Campbell, Konrad Wilk, George Dunlap, Kurt Hackel,
	George Shuklin, xen-devel, Dario Faggioli, Zhigang Wang,
	Ian Jackson

> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Monday, October 29, 2012 4:36 PM
> To: Dan Magenheimer
> Cc: Keir (Xen.org); Jan Beulich; George Dunlap; Olaf Hering; Ian Campbell; Konrad Wilk; xen-
> devel@lists.xen.org; George Shuklin; Dario Faggioli; Kurt Hackel; Ian Jackson; Zhigang Wang; Mukesh
> Rathor
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> 
> > The hypervisor must also enforce some semantics:  If an allocation
> > occurs such that a domain's tot_phys_pages would equal or exceed
> > d.tot_claimed_pages, then d.tot_claimed_pages becomes "unset".
> > This enforces the temporary nature of a claim:  Once a domain
> > fully "occupies" its claim, the claim silently expires.
> 
> Why does that happen?  If I understand you correctly, releasing the
> claim is something the toolstack should do once it knows it's no longer
> needed.

Hi Tim --

Thanks for the feedback!

I haven't thought this all the way through yet, but I think this
part of the design allows the toolstack to avoid monitoring the
domain until "total_phys_pages" reaches "total_claimed" pages,
which should make the implementation of claims in the toolstack
simpler, especially in many-server environments.
 
> > In the case of a dying domain, a XENMEM_release operation
> > is implied and must be executed by the hypervisor.
> >
> > Ideally, the quantity of unclaimed memory for each domain and
> > for the system should be query-able.  This may require additional
> > memory_op hypercalls.
> >
> > I'd very much appreciate feedback on this proposed design!
> 
> As I said, I'm not opposed to this, though even after reading through
> the other thread I'm not convinced that it's necessary (except in cases
> where guest-controlled operations are allowed to consume unbounded
> memory, which frankly gives me the heebie-jeebies).

A really detailed discussion of tmem would probably be good but,
yes, with tmem, guest-controlled* operations can and frequently will
absorb ALL physical RAM.  However, this is "freeable" (ephemeral)
memory used by the hypervisor on behalf of domains, not domain-owned
memory.

* "guest-controlled" I suspect is the heebie-jeebie word... in
  tmem, a better description might be "guest-controls-which-data-
  and-hypervisor-controls-how-many-pages"
 
> I think it needs a plan for handling restricted memory allocations.
> For example, some PV guests need their memory to come below a
> certain machine address, or entirely in superpages, and certain
> build-time allocations come from xenheap.  How would you handle that
> sort of thing?

Good point.  I think there's always been some uncertainty about
how to account for different zones and xenheap... are they part of the
domain's memory or not?  Deserves some more thought...  if
you can enumerate all such cases, that would be very helpful
(and probably valuable long-term documentation as well).

Thanks,
Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-29 23:21   ` Dan Magenheimer
@ 2012-10-30  8:13     ` Tim Deegan
  2012-10-30 15:26       ` Dan Magenheimer
  2012-10-30  8:29     ` Jan Beulich
  1 sibling, 1 reply; 58+ messages in thread
From: Tim Deegan @ 2012-10-30  8:13 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Olaf Hering, Keir (Xen.org),
	Ian Campbell, Konrad Wilk, George Dunlap, Kurt Hackel,
	George Shuklin, xen-devel, Dario Faggioli, Zhigang Wang,
	Ian Jackson

Hi, 

At 16:21 -0700 on 29 Oct (1351527686), Dan Magenheimer wrote:
> > > The hypervisor must also enforce some semantics:  If an allocation
> > > occurs such that a domain's tot_phys_pages would equal or exceed
> > > d.tot_claimed_pages, then d.tot_claimed_pages becomes "unset".
> > > This enforces the temporary nature of a claim:  Once a domain
> > > fully "occupies" its claim, the claim silently expires.
> > 
> > Why does that happen?  If I understand you correctly, releasing the
> > claim is something the toolstack should do once it knows it's no longer
> > needed.
> 
> I haven't thought this all the way through yet, but I think this
> part of the design allows the toolstack to avoid monitoring the
> domain until "total_phys_pages" reaches "total_claimed" pages,
> which should make the implementation of claims in the toolstack
> simpler, especially in many-server environments.

I think the toolstack has to monitor the domain for that long anyway,
since it will have to unpause it once it's built.  Relying on an
implicit release seems fragile -- if the builder ends up using only
(total_claimed - 1) pages, or temporarily allocating total_claimed and
then releasing some memory, things could break.

> > I think it needs a plan for handling restricted memory allocations.
> > For example, some PV guests need their memory to come below a
> > certain machine address, or entirely in superpages, and certain
> > build-time allocations come from xenheap.  How would you handle that
> > sort of thing?
> 
> Good point.  I think there's always been some uncertainty about
> how to account for different zones and xenheap... are they part of the
> domain's memory or not?

Xenheap pages are not part of the domain memory for accounting purposes;
likewise other 'anonymous' allocations (that is, anywhere that
alloc_domheap_pages() & friends are called with a NULL domain pointer).
Pages with restricted addresses are just accounted like any other
memory, except when they're on the free lists.

Today, toolstacks use a rule of thumb of how much extra space to leave
to cover those things -- if you want to pre-allocate them, you'll have
to go through the hypervisor making sure _all_ memory allocations are
accounted to the right domain somehow (maybe by generalizing the
shadow-allocation pool to cover all per-domain overheads).  That seems
like a useful side-effect of adding your new feature.

> Deserves some more thought...  if you can enumerate all such cases,
> that would be very helpful (and probably valuable long-term
> documentation as well).

I'm afraid I can't, not without re-reading all the domain-builder code
and a fair chunk of the hypervisor, so it's up to you to figure it out.

Tim.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-29 23:21   ` Dan Magenheimer
  2012-10-30  8:13     ` Tim Deegan
@ 2012-10-30  8:29     ` Jan Beulich
  2012-10-30 15:43       ` Dan Magenheimer
  1 sibling, 1 reply; 58+ messages in thread
From: Jan Beulich @ 2012-10-30  8:29 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Tim Deegan, Olaf Hering, Keir (Xen.org),
	IanCampbell, Konrad Wilk, GeorgeDunlap, IanJackson,
	George Shuklin, xen-devel, DarioFaggioli, Kurt Hackel,
	Zhigang Wang

>>> On 30.10.12 at 00:21, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>  From: Tim Deegan [mailto:tim@xen.org]
>> As I said, I'm not opposed to this, though even after reading through
>> the other thread I'm not convinced that it's necessary (except in cases
>> where guest-controlled operations are allowed to consume unbounded
>> memory, which frankly gives me the heebie-jeebies).
> 
> A really detailed discussion of tmem would probably be good but,
> yes, with tmem, guest-controlled* operations can and frequently will
> absorb ALL physical RAM.  However, this is "freeable" (ephemeral)
> memory used by the hypervisor on behalf of domains, not domain-owned
> memory.
> 
> * "guest-controlled" I suspect is the heebie-jeebie word... in
>   tmem, a better description might be "guest-controls-which-data-
>   and-hypervisor-controls-how-many-pages"

But isn't tmem use supposed to be transparent in this respect, i.e.
if a "normal" allocation cannot be satisfied, tmem would jump in
and free sufficient space? In which case there's no need to do
any accounting outside of the control tools (leaving aside the
smaller hypervisor internal allocations, which the tool stack needs
to provide room for anyway).

Jan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-29 23:03       ` Dan Magenheimer
  2012-10-29 23:17         ` Keir Fraser
@ 2012-10-30  9:11         ` George Dunlap
  2012-10-30 16:13           ` Dan Magenheimer
  1 sibling, 1 reply; 58+ messages in thread
From: George Dunlap @ 2012-10-30  9:11 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Olaf Hering, Keir Fraser, Ian Campbell, Konrad Wilk,
	Tim (Xen.org),
	George Shuklin, xen-devel, Dario Faggioli, Kurt Hackel,
	Zhigang Wang, Ian Jackson

On Mon, Oct 29, 2012 at 6:06 PM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
> Keir, Jan (et al) --
>
> In a recent long thread [1], there was a great deal of discussion
> about the possible need for a "memory reservation" hypercall.
> While there was some confusion due to the two worldviews of static
> vs dynamic management of physical memory capacity, one worldview
> definitely has a requirement for this new capability.

No, it does not.


> I'm not just arguing against reservation-as-a-toolstack-mechanism,
> I'm stating I believe unequivocally that reservation-as-a-toolstack-
> only-mechanism and tmem are incompatible.  (Well, not _totally_
> incompatible... the existing workaround, tmem freeze/thaw, works
> but is also single-threaded and has fairly severe unnecessary
> performance repercussions.  So I'd like to solve both problems
> at the same time.)

No, it is not.

Look, the *only* reason you have this problem is that *you yourselves*
programmed in two incompatible assumptions:

1. You have a toolstack that assumes it can ask "how much free memory
is there" from the HV and have that be an accurate answer, rather than
keeping track of this itself

2. You wrote the tmem code to do "self-ballooning", which for no good
reason, gives memory back to the hypervisor, rather than just keeping
it itself.

Basically #2 breaks the assumption of #1.  It has absolutely nothing
at all to do with tmem.  It's just a quirk of your particular
implementation of self-ballooning.

This new hypercall you're introducing is just a hack to fix the fact
that you've baked in incompatible assumptions.  It's completely
unnecessary.  All of the functionality you're describing can be
implemented outside of the hypervisor in the toolstack -- this would
fix #1.  Doing that would have no effect on tmem whatsoever.

Alternately, you could fix #2 -- have the "self-ballooning" mechanism
just allocate the memory to force the swapping to happen, but *not
hand it back to the hypervisor*.

We don't need this new hypercall.  You should just fix your own bugs
rather than introducing new hacks to work around them.

 -George

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-30 15:13           ` Dan Magenheimer
@ 2012-10-30 14:43             ` Keir Fraser
  2012-10-30 16:33               ` Dan Magenheimer
  0 siblings, 1 reply; 58+ messages in thread
From: Keir Fraser @ 2012-10-30 14:43 UTC (permalink / raw)
  To: Dan Magenheimer, Jan Beulich
  Cc: Olaf Hering, Ian Campbell, Konrad Rzeszutek Wilk, George Dunlap,
	George Shuklin, Tim (Xen.org),
	xen-devel, Dario Faggioli, Kurt Hackel, Zhigang Wang,
	Ian Jackson

On 30/10/2012 16:13, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> Okay, so why is tmem incompatible with implementing claims in the toolstack?
> 
> (Hmmm... maybe I could schedule the equivalent of a PhD qual exam
> for tmem with all the core Xen developers as examiners?)
> 
> The short answer is tmem moves memory capacity around far too
> frequently to be managed by a userland toolstack, especially if
> the "controller" lives on a central "manager machine" in a
> data center (Oracle's model).  The ebb and flow of memory supply
> and demand for each guest is instead managed entirely dynamically.

I don't know. I agree that fine-grained memory management is the duty of the
hypervisor, but it seems to me that the toolstack should be able to handle
admission control. It knows how much memory each existing guest is allowed
to consume at max, how much memory the new guest requires, how much memory
the system has total... Isn't the decision then simple? Tmem should be
fairly invisible to the toolstack, right?

 -- Keir

> The somewhat longer answer (and remember all of this is
> implemented and upstream in Xen and Linux today):
> 
> First, in the tmem model, each guest is responsible for driving
> its memory utilization (what Xen tools calls "current" and Xen
> hypervisor calls "tot_pages") as low as it can.  This is done
> in Linux with selfballooning.  At 50Hz (default), the guest
> kernel will attempt to expand or contract the balloon to match
> the guest kernel's current demand for memory.  Agreed, one guest
> requesting changes at 50Hz could probably be handled by
> a userland toolstack, but what about 100 guests?  Maybe...
> but there's more.
> 
> Second, in the tmem model, each guest is making tmem hypercalls
> at a rate of perhaps thousands per second, driven by the kernel
> memory management internals.  Each call deals with a single
> page of memory and each possibly may remove a page from (or
> return a page to) Xen's free list.  Interacting with a userland
> toolstack for each page is simply not feasible for this high
> of a frequency, even in a single guest.
> 
> Third, tmem in Xen implements both compression and deduplication
> so each attempt to put a page of data from the guest into
> the hypervisor may or may not require a new physical page.
> Only the hypervisor knows.
> 
> So, even on a single machine, tmem is tossing memory capacity
> about at a very very high frequency.  A userland toolstack can't
> possibly keep track, let alone hope to control it; that would
> entirely defeat the value of tmem.  It would be like requiring
> the toolstack to participate in every vcpu->pcpu transition
> in the Xen cpu scheduler.
> 
> Does that make sense and answer your question?
> 
> Anyway, I think the proposed "claim" hypercall/subop neatly
> solves the problem of races between large-chunk memory demands
> (i.e. large domain launches) and small-chunk memory demands
> (i.e. small domain launches and single-page tmem allocations).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-29 23:17         ` Keir Fraser
@ 2012-10-30 15:13           ` Dan Magenheimer
  2012-10-30 14:43             ` Keir Fraser
  0 siblings, 1 reply; 58+ messages in thread
From: Dan Magenheimer @ 2012-10-30 15:13 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich
  Cc: Olaf Hering, Ian Campbell, Konrad Wilk, George Dunlap,
	George Shuklin, Tim (Xen.org),
	xen-devel, Dario Faggioli, Kurt Hackel, Zhigang Wang,
	Ian Jackson

> From: Keir Fraser [mailto:keir.xen@gmail.com]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> 
> On 30/10/2012 00:03, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:
> 
> >> From: Keir Fraser [mailto:keir@xen.org]
> >> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> >>
> >> On 29/10/2012 21:08, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:
> >>
> >> Well it does depend how scalable domain creation actually is as an
> >> operation. If it is spending most of its time allocating memory then it is
> >> quite likely that parallel creations will spend a lot of time competing for
> >> the heap spinlock, and actually there will be little/no speedup compared
> >> with serialising the creations. Further, if domain creation can take
> >> minutes, it may be that we simply need to go optimise that -- we already
> >> found one stupid thing in the heap allocator recently that was burining
> >> loads of time during large-memory domain creations, and fixed it for a
> >> massive speedup in that particular case.
> >
> > I suppose ultimately it is a scalability question.  But Oracle's
> > measure of success here is based on how long a human or a tool
> > has to wait for confirmation to ensure that a domain will
> > successfully launch.  If two domains are launched in parallel
> > AND an indication is given that both will succeed, spinning on
> > the heaplock a bit just makes for a longer "boot" time, which is
> > just a cost of virtualization.  If they are launched in parallel
> > and, minutes later (or maybe even 20 seconds later), one or
> > both say "oops, I was wrong, there wasn't enough memory, so
> > try again", that's not OK for data center operations, especially if
> > there really was enough RAM for one, but not for both. Remember,
> > in the Oracle environment, we are talking about an administrator/automation
> > overseeing possibly hundreds of physical servers, not just a single
> > user/server.
> >
> > Does that make more sense?
> 
> Yes, that makes sense.

:)

So, not to beat a dead horse, but let me re-emphasize that the problem
exists even without considering tmem.  I wish to solve the problem,
but would like to do it in a way which also resolves a similar problem
for tmem.  I think the "claim" approach does that.
 
> > The "claim" approach immediately guarantees success or failure.
> > Unless there are enough "stupid things/optimisations" found that
> > you would be comfortable putting memory allocation for a domain
> > creation in a hypervisor spinlock, there will be a race unless
> > an atomic mechanism exists such as "claiming" where
> > only simple arithmetic must be done within a hypervisor lock.
> >
> > Do you disagree?
> >
> >>> and (2) tmem and/or other dynamic
> >>> memory mechanisms may be asynchronously absorbing small-but-significant
> >>> portions of RAM for other purposes during an attempted domain launch.
> >>
> >> This is an argument against allocate-rather-than-reserve? I don't think that
> >> makes sense -- so is this instead an argument against
> >> reservation-as-a-toolstack-only-mechanism? I'm not actually convinced yet we
> >> need reservations *at all*, before we get down to where it should be
> >> implemented.
> >
> > I'm not sure if we are defining terms the same, so that's hard
> > to answer.  If you define "allocation" as "a physical RAM page frame
> > number is selected (and possibly the physical page is zeroed)",
> > then I'm not sure how your definition of "reservation" differs
> > (because that's how increase/decrease_reservation are implemented
> > in the hypervisor, right?).
> >
> > Or did you mean "allocate-rather-than-claim" (where "allocate" is
> > select a specific physical pageframe and "claim" means do accounting
> > only?  If so, see the atomicity argument above.
> >
> > I'm not just arguing against reservation-as-a-toolstack-mechanism,
> > I'm stating I believe unequivocally that reservation-as-a-toolstack-
> > only-mechanism and tmem are incompatible.  (Well, not _totally_
> > incompatible... the existing workaround, tmem freeze/thaw, works
> > but is also single-threaded and has fairly severe unnecessary
> > performance repercussions.  So I'd like to solve both problems
> > at the same time.)
> 
> Okay, so why is tmem incompatible with implementing claims in the toolstack?

(Hmmm... maybe I could schedule the equivalent of a PhD qual exam
for tmem with all the core Xen developers as examiners?)

The short answer is tmem moves memory capacity around far too
frequently to be managed by a userland toolstack, especially if
the "controller" lives on a central "manager machine" in a
data center (Oracle's model).  The ebb and flow of memory supply
and demand for each guest is instead managed entirely dynamically.

The somewhat longer answer (and remember all of this is
implemented and upstream in Xen and Linux today):

First, in the tmem model, each guest is responsible for driving
its memory utilization (what Xen tools calls "current" and Xen
hypervisor calls "tot_pages") as low as it can.  This is done
in Linux with selfballooning.  At 50Hz (default), the guest
kernel will attempt to expand or contract the balloon to match
the guest kernel's current demand for memory.  Agreed, one guest
requesting changes at 50Hz could probably be handled by
a userland toolstack, but what about 100 guests?  Maybe...
but there's more.

Second, in the tmem model, each guest is making tmem hypercalls
at a rate of perhaps thousands per second, driven by the kernel
memory management internals.  Each call deals with a single
page of memory and each possibly may remove a page from (or
return a page to) Xen's free list.  Interacting with a userland
toolstack for each page is simply not feasible for this high
of a frequency, even in a single guest.

Third, tmem in Xen implements both compression and deduplication
so each attempt to put a page of data from the guest into
the hypervisor may or may not require a new physical page.
Only the hypervisor knows.

So, even on a single machine, tmem is tossing memory capacity
about at a very very high frequency.  A userland toolstack can't
possibly keep track, let alone hope to control it; that would
entirely defeat the value of tmem.  It would be like requiring
the toolstack to participate in every vcpu->pcpu transition
in the Xen cpu scheduler.

Does that make sense and answer your question?

Anyway, I think the proposed "claim" hypercall/subop neatly
solves the problem of races between large-chunk memory demands
(i.e. large domain launches) and small-chunk memory demands
(i.e. small domain launches and single-page tmem allocations).

Thanks,
Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-30  8:13     ` Tim Deegan
@ 2012-10-30 15:26       ` Dan Magenheimer
  0 siblings, 0 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-10-30 15:26 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Olaf Hering, Keir (Xen.org),
	Ian Campbell, Konrad Wilk, George Dunlap, Kurt Hackel,
	George Shuklin, xen-devel, Dario Faggioli, Zhigang Wang,
	Ian Jackson

> From: Tim Deegan [mailto:tim@xen.org]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> 
> Hi,

Hi Tim!

> At 16:21 -0700 on 29 Oct (1351527686), Dan Magenheimer wrote:
> > > > The hypervisor must also enforce some semantics:  If an allocation
> > > > occurs such that a domain's tot_phys_pages would equal or exceed
> > > > d.tot_claimed_pages, then d.tot_claimed_pages becomes "unset".
> > > > This enforces the temporary nature of a claim:  Once a domain
> > > > fully "occupies" its claim, the claim silently expires.
> > >
> > > Why does that happen?  If I understand you correctly, releasing the
> > > claim is something the toolstack should do once it knows it's no longer
> > > needed.
> >
> > I haven't thought this all the way through yet, but I think this
> > part of the design allows the toolstack to avoid monitoring the
> > domain until "total_phys_pages" reaches "total_claimed" pages,
> > which should make the implementation of claims in the toolstack
> > simpler, especially in many-server environments.
> 
> I think the toolstack has to monitor the domain for that long anyway,
> since it will have to unpause it once it's built.

Could be.  This "claim auto-expire" feature is certainly not a
requirement but I thought it might be useful, especially for
multi-server toolstacks (such as Oracle's).  I may take a look at
implementing it anyway since it is probably only a few lines of code,
but will ensure I do so as a separately reviewable/rejectable patch.

> Relying on an
> implicit release seems fragile -- if the builder ends up using only
> (total_claimed - 1) pages, or temporarily allocating total_claimed and
> then releasing some memory, things could break.

I agree its fragile, though I don't see how things could actually
"break".  But, let's drop claim-auto-expire for now as I fear it is
detracting from the larger discussion.
 
> > > I think it needs a plan for handling restricted memory allocations.
> > > For example, some PV guests need their memory to come below a
> > > certain machine address, or entirely in superpages, and certain
> > > build-time allocations come from xenheap.  How would you handle that
> > > sort of thing?
> >
> > Good point.  I think there's always been some uncertainty about
> > how to account for different zones and xenheap... are they part of the
> > domain's memory or not?
> 
> Xenheap pages are not part of the domain memory for accounting purposes;
> likewise other 'anonymous' allocations (that is, anywhere that
> alloc_domheap_pages() & friends are called with a NULL domain pointer).
> Pages with restricted addresses are just accounted like any other
> memory, except when they're on the free lists.
> 
> Today, toolstacks use a rule of thumb of how much extra space to leave
> to cover those things -- if you want to pre-allocate them, you'll have
> to go through the hypervisor making sure _all_ memory allocations are
> accounted to the right domain somehow (maybe by generalizing the
> shadow-allocation pool to cover all per-domain overheads).  That seems
> like a useful side-effect of adding your new feature.

Hmmm... then I'm not quite sure how adding a simple "claim" changes
the need for accounting of these anonymous allocations.  I guess
it depends on the implementation... maybe the simple implementation
I have in mind can't co-exist with anonymous allocations but I think
it will.

> > Deserves some more thought...  if you can enumerate all such cases,
> > that would be very helpful (and probably valuable long-term
> > documentation as well).
> 
> I'm afraid I can't, not without re-reading all the domain-builder code
> and a fair chunk of the hypervisor, so it's up to you to figure it out.

Well, or at least to ensure that I haven't made it any worse ;-)

me adds "world peace" to the requirements list for the new claim
hypercall ;-)

Thanks much for the feedback!
Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-30  8:29     ` Jan Beulich
@ 2012-10-30 15:43       ` Dan Magenheimer
  2012-10-30 16:04         ` Jan Beulich
  2012-11-05 17:14         ` George Dunlap
  0 siblings, 2 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-10-30 15:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Olaf Hering, Keir (Xen.org),
	IanCampbell, Konrad Wilk, GeorgeDunlap, IanJackson,
	George Shuklin, xen-devel, DarioFaggioli, Kurt Hackel,
	Zhigang Wang

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, October 30, 2012 2:29 AM
> To: Dan Magenheimer
> Cc: Olaf Hering; IanCampbell; GeorgeDunlap; IanJackson; George Shuklin; DarioFaggioli; xen-
> devel@lists.xen.org; Konrad Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang; Keir (Xen.org); Tim Deegan
> Subject: RE: Proposed new "memory capacity claim" hypercall/feature
> 
> >>> On 30.10.12 at 00:21, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> >>  From: Tim Deegan [mailto:tim@xen.org]
> >> As I said, I'm not opposed to this, though even after reading through
> >> the other thread I'm not convinced that it's necessary (except in cases
> >> where guest-controlled operations are allowed to consume unbounded
> >> memory, which frankly gives me the heebie-jeebies).
> >
> > A really detailed discussion of tmem would probably be good but,
> > yes, with tmem, guest-controlled* operations can and frequently will
> > absorb ALL physical RAM.  However, this is "freeable" (ephemeral)
> > memory used by the hypervisor on behalf of domains, not domain-owned
> > memory.
> >
> > * "guest-controlled" I suspect is the heebie-jeebie word... in
> >   tmem, a better description might be "guest-controls-which-data-
> >   and-hypervisor-controls-how-many-pages"
> 
> But isn't tmem use supposed to be transparent in this respect, i.e.
> if a "normal" allocation cannot be satisfied, tmem would jump in
> and free sufficient space? In which case there's no need to do
> any accounting outside of the control tools (leaving aside the
> smaller hypervisor internal allocations, which the tool stack needs
> to provide room for anyway).

Hi Jan --

Tmem can only "free sufficient space" up to the total amount
of ephemeral space of which it has control (ie. all "freeable"
memory).

Let me explain further:  Let's oversimplify a bit and say that
there are three types of pages:

a) Truly free memory (each free page is on the hypervisor free list)
b) Freeable memory ("ephmeral" memory managed by tmem)
c) Owned memory (pages allocated by the hypervisor or for a domain)

The sum of these three is always a constant: The total number of
RAM pages in the system.  However, when tmem is active, the values
of all _three_ of these change constantly.  So if at the start of a
domain launch, the sum of free+freeable exceeds the intended size
of the domain, the domain allocation/launch can start.  But then
if "owned" increases enough, there may no longer be enough memory
and the domain launch will fail.

With tmem, memory "owned" by domain (d.tot_pages) increases dynamically
in two ways: selfballooning and persistent puts (aka frontswap),
but is always capped by d.max_pages.  Neither of these communicate
to the toolstack.

Similarly, tmem (or selfballooning) may be dynamically freeing up lots
of memory without communicating to the toolstack, which could result in
the toolstack rejecting a domain launch believing there is insufficient
memory.

I am thinking the "claim" hypercall/subop eliminates these problems
and hope you agree!

Thanks,
Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-30 15:43       ` Dan Magenheimer
@ 2012-10-30 16:04         ` Jan Beulich
  2012-10-30 17:13           ` Dan Magenheimer
  2012-11-05 17:14         ` George Dunlap
  1 sibling, 1 reply; 58+ messages in thread
From: Jan Beulich @ 2012-10-30 16:04 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Tim Deegan, Olaf Hering, Keir(Xen.org),
	IanCampbell, Konrad Wilk, GeorgeDunlap, IanJackson,
	George Shuklin, xen-devel, DarioFaggioli, Kurt Hackel,
	Zhigang Wang

>>> On 30.10.12 at 16:43, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> With tmem, memory "owned" by domain (d.tot_pages) increases dynamically
> in two ways: selfballooning and persistent puts (aka frontswap),
> but is always capped by d.max_pages.  Neither of these communicate
> to the toolstack.
> 
> Similarly, tmem (or selfballooning) may be dynamically freeing up lots
> of memory without communicating to the toolstack, which could result in
> the toolstack rejecting a domain launch believing there is insufficient
> memory.
> 
> I am thinking the "claim" hypercall/subop eliminates these problems
> and hope you agree!

With tmem being the odd one here, wouldn't it make more sense
to force it into no-alloc mode (apparently not exactly the same as
freezing all pools) for the (infrequent?) time periods of domain
creation, thus not allowing the amount of free memory to drop
unexpectedly? Tmem could, during these time periods, still itself
internally recycle pages (e.g. fulfill a persistent put by discarding
an ephemeral page).

Jan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-30  9:11         ` George Dunlap
@ 2012-10-30 16:13           ` Dan Magenheimer
  0 siblings, 0 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-10-30 16:13 UTC (permalink / raw)
  To: George Dunlap
  Cc: Olaf Hering, Keir Fraser, Ian Campbell, Konrad Wilk,
	Tim (Xen.org),
	George Shuklin, xen-devel, Dario Faggioli, Kurt Hackel,
	Zhigang Wang, Ian Jackson

> From: George Dunlap [mailto:George.Dunlap@eu.citrix.com]
>    :
> No, it does not.
>    :
> No, it does not.
>    :
> We don't need this new hypercall.  You should just fix your own bugs
> rather than introducing new hacks to work around them.

Ouch.  I'm sorry if the previous discussion on this made you angry.
I wasn't sure if you were just absorbing the new information or
rejecting it or just too busy to reply, so decided to proceed with
a more specific proposal.  I wasn't intending to cut off the
discussion.

New paradigms and paradigm shifts always encounter resistance,
especially from those with a lot of investment in the old paradigm.

This "new" paradigm, tmem, has been in Xen for years now and the
final piece is now in upstream Linux as well.  Tmem is in many ways
a breakthrough in virtualized memory management, though admittedly
it is far from perfect (and, notably, will not help proprietary
or legacy guests).

I would hope you, as release manager, would either try to understand
the different paradigm or at least accept that there are different
paradigms than yours that can co-exist in an open source project.

To answer some of your points:

Dynamic handling of memory management is not a bug.  And selfballooning
is only a small (though important) part of the tmem story.  And
the Oracle "toolstack" manages hundreds of physical machines and
thousands of virtual machines across a physical network, not one
physical machine with a handful of virtual machines across Xenbus.
So we come from different perspectives.

As repeatedly pointed out (and confirmed by others), variations
of the memory "race" problem exist even without tmem.  I do agree
that if a toolstack insists that only it, the toolstack, can ever
allocate or free memory, the problem goes away.  You think that
restriction is reasonable, and I think it is not.

The "claim" proposal is very simple and (as far as I can tell so far)
shouldn't interfere with your paradigm.  Reinforcing your paradigm
by rejecting the proposal only cripples my paradigm.  Please ensure
you don't reject a proposal simply because you have a different
worldview.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-30 14:43             ` Keir Fraser
@ 2012-10-30 16:33               ` Dan Magenheimer
  0 siblings, 0 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-10-30 16:33 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich
  Cc: Olaf Hering, Ian Campbell, Konrad Wilk, George Dunlap,
	George Shuklin, Tim (Xen.org),
	xen-devel, Dario Faggioli, Kurt Hackel, Zhigang Wang,
	Ian Jackson

> From: Keir Fraser [mailto:keir.xen@gmail.com]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> 
> On 30/10/2012 16:13, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:
> 
> >> Okay, so why is tmem incompatible with implementing claims in the toolstack?
> >
> > (Hmmm... maybe I could schedule the equivalent of a PhD qual exam
> > for tmem with all the core Xen developers as examiners?)
> >
> > The short answer is tmem moves memory capacity around far too
> > frequently to be managed by a userland toolstack, especially if
> > the "controller" lives on a central "manager machine" in a
> > data center (Oracle's model).  The ebb and flow of memory supply
> > and demand for each guest is instead managed entirely dynamically.
> 
> I don't know. I agree that fine-grained memory management is the duty of the
> hypervisor, but it seems to me that the toolstack should be able to handle
> admission control. It knows how much memory each existing guest is allowed
> to consume at max,
>   !!!!!!!!!!!how much memory the new guest requires!!!!!!!!!!
> how much memory
> the system has total... Isn't the decision then simple?

A fundamental assumption of tmem is that _nobody_ knows how much memory
a guest requires, not even the OS kernel running in the guest.  If you
have a toolstack that does know, please submit a paper to OSDI. ;-)
If you have a toolstack that can do it for thousands of guests across
hundreds of machines, please start up a company and allow me to invest. ;-)

One way to think of tmem is as a huge co-feedback loop that estimates
memory demand and deals effectively with the consequences of the (always
wrong) estimate using very fine-grained adjustments AND mechanisms that
allow maximum flexibility between guest memory demands while minimizing
impact on the running guests.

> Tmem should be fairly invisible to the toolstack, right?

It can be invisible, as long as the toolstack doesn't either make
the assumption that it controls every page allocated/freed by the
hypervisor or make the assumption that a large allocation can be
completed atomically.  The first of those assumptions is what is
generating all the controversy (George's worldview) and the second
is the problem I am trying to solve with the "claim" hypercall/subop.
And I'd like to solve it in a way that handles both tmem and non-tmem.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-30 16:04         ` Jan Beulich
@ 2012-10-30 17:13           ` Dan Magenheimer
  2012-10-31  8:14             ` Jan Beulich
  0 siblings, 1 reply; 58+ messages in thread
From: Dan Magenheimer @ 2012-10-30 17:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Olaf Hering, Keir(Xen.org),
	IanCampbell, Konrad Wilk, GeorgeDunlap, IanJackson,
	George Shuklin, xen-devel, DarioFaggioli, Kurt Hackel,
	Zhigang Wang

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: RE: Proposed new "memory capacity claim" hypercall/feature
> 
> >>> On 30.10.12 at 16:43, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> > With tmem, memory "owned" by domain (d.tot_pages) increases dynamically
> > in two ways: selfballooning and persistent puts (aka frontswap),
> > but is always capped by d.max_pages.  Neither of these communicate
> > to the toolstack.
> >
> > Similarly, tmem (or selfballooning) may be dynamically freeing up lots
> > of memory without communicating to the toolstack, which could result in
> > the toolstack rejecting a domain launch believing there is insufficient
> > memory.
> >
> > I am thinking the "claim" hypercall/subop eliminates these problems
> > and hope you agree!
> 
> With tmem being the odd one here, wouldn't it make more sense
> to force it into no-alloc mode (apparently not exactly the same as
> freezing all pools) for the (infrequent?) time periods of domain
> creation, thus not allowing the amount of free memory to drop
> unexpectedly? Tmem could, during these time periods, still itself
> internally recycle pages (e.g. fulfill a persistent put by discarding
> an ephemeral page).

Hi Jan --

Freeze has some unattractive issues that "claim" would solve
(see below) and freeze (whether ephemeral pages are used or not)
blocks allocations due to tmem, but doesn't block allocations due
to selfballooning (or manual ballooning attempts by a guest user
with root access).  I suppose the tmem freeze implementation could
be extended to also block all non-domain-creation ballooning
attempts but I'm not sure if that's what you are proposing.

To digress for a moment first, the original problem exists both in
non-tmem systems AND tmem systems.  It has been seen in the wild on
non-tmem systems.  I am involved with proposing a solution primarily
because, if the solution is designed correctly, it _also_ solves a
tmem problem.  (And as long as we have digressed, I believe it _also_
solves a page-sharing problem on non-tmem systems.)  That said,
here's the unattractive tmem freeze/thaw issue, first with
the existing freeze implementation.

Suppose you have a huge 256GB machine and you have already launched
a 64GB tmem guest "A".  The guest is idle for now, so slowly
selfballoons down to maybe 4GB.  You start to launch another 64GB
guest "B" which, as we know, is going to take some time to complete.
In the middle of launching "B", "A" suddenly gets very active and
needs to balloon up as quickly as possible or it can't balloon fast
enough (or at all if "frozen" as suggested) so starts swapping (and,
thanks to Linux frontswap, the swapping tries to go to hypervisor/tmem
memory).  But ballooning and tmem are both blocked and so the
guest swaps its poor little butt off even though there's >100GB
of free physical memory available.

Let's add in your suggestion, that a persistent put can be fulfilled
by discarding an ephemeral page.  I see two issues:  First, it
requires the number of ephemeral pages available to be larger
than the number of persistent pages required; this may not always
be true, though most of the time it will be true.  Second, the second
domain creation activity may have been assuming that it could use
some (or all) of the freeable pages, which have now been absorbed by
the first guest's persistent puts.  So I think "claim" is still
needed anyway.

Comments?

Thanks,
Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-30 17:13           ` Dan Magenheimer
@ 2012-10-31  8:14             ` Jan Beulich
  2012-10-31 16:04               ` Dan Magenheimer
  0 siblings, 1 reply; 58+ messages in thread
From: Jan Beulich @ 2012-10-31  8:14 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Tim Deegan, Olaf Hering, Keir(Xen.org),
	IanCampbell, Konrad Wilk, GeorgeDunlap, IanJackson,
	George Shuklin, xen-devel, DarioFaggioli, Kurt Hackel,
	Zhigang Wang

>>> On 30.10.12 at 18:13, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> With tmem being the odd one here, wouldn't it make more sense
>> to force it into no-alloc mode (apparently not exactly the same as
>> freezing all pools) for the (infrequent?) time periods of domain
>> creation, thus not allowing the amount of free memory to drop
>> unexpectedly? Tmem could, during these time periods, still itself
>> internally recycle pages (e.g. fulfill a persistent put by discarding
>> an ephemeral page).
> 
> Freeze has some unattractive issues that "claim" would solve
> (see below) and freeze (whether ephemeral pages are used or not)
> blocks allocations due to tmem, but doesn't block allocations due
> to selfballooning (or manual ballooning attempts by a guest user
> with root access).  I suppose the tmem freeze implementation could
> be extended to also block all non-domain-creation ballooning
> attempts but I'm not sure if that's what you are proposing.
> 
> To digress for a moment first, the original problem exists both in
> non-tmem systems AND tmem systems.  It has been seen in the wild on
> non-tmem systems.  I am involved with proposing a solution primarily
> because, if the solution is designed correctly, it _also_ solves a
> tmem problem.  (And as long as we have digressed, I believe it _also_
> solves a page-sharing problem on non-tmem systems.)  That said,
> here's the unattractive tmem freeze/thaw issue, first with
> the existing freeze implementation.
> 
> Suppose you have a huge 256GB machine and you have already launched
> a 64GB tmem guest "A".  The guest is idle for now, so slowly
> selfballoons down to maybe 4GB.  You start to launch another 64GB
> guest "B" which, as we know, is going to take some time to complete.
> In the middle of launching "B", "A" suddenly gets very active and
> needs to balloon up as quickly as possible or it can't balloon fast
> enough (or at all if "frozen" as suggested) so starts swapping (and,
> thanks to Linux frontswap, the swapping tries to go to hypervisor/tmem
> memory).  But ballooning and tmem are both blocked and so the
> guest swaps its poor little butt off even though there's >100GB
> of free physical memory available.

That's only one side of the overcommit situation you're striving
to get work right here: That same self ballooning guest, after
sufficiently more guest got started so that the rest of the memory
got absorbed by them would suffer the very same problems in
the described situation, so it has to be prepared for this case
anyway.

As long as the allocation times can get brought down to an
acceptable level, I continue to not see a need for the extra
"claim" approach you're proposing. So working on that one (or
showing that without unreasonable effort this cannot be
further improved) would be a higher priority thing from my pov
(without anyone arguing about its usefulness).

But yes, with all the factors you mention brought in, there is
certainly some improvement needed (whether your "claim"
proposal is a the right thing is another question, not to mention
that I currently don't see how this would get implemented in
a consistent way taking several orders of magnitude less time
to carry out).

Jan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-31  8:14             ` Jan Beulich
@ 2012-10-31 16:04               ` Dan Magenheimer
  2012-10-31 16:19                 ` Jan Beulich
  0 siblings, 1 reply; 58+ messages in thread
From: Dan Magenheimer @ 2012-10-31 16:04 UTC (permalink / raw)
  To: Jan Beulich, Keir(Xen.org)
  Cc: Tim Deegan, Olaf Hering, IanCampbell, Konrad Wilk, GeorgeDunlap,
	IanJackson, George Shuklin, xen-devel, DarioFaggioli,
	Kurt Hackel, Zhigang Wang

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: RE: Proposed new "memory capacity claim" hypercall/feature
> 
> >>> On 30.10.12 at 18:13, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]

(NOTE TO KEIR: Input from you requested in first stanza below.)

Hi Jan --

Thanks for the continued feedback!

I've slightly re-ordered the email to focus on the problem
(moved tmem-specific discussion to the end).

> As long as the allocation times can get brought down to an
> acceptable level, I continue to not see a need for the extra
> "claim" approach you're proposing. So working on that one (or
> showing that without unreasonable effort this cannot be
> further improved) would be a higher priority thing from my pov
> (without anyone arguing about its usefulness).

Fair enough.  I will do some measurement and analysis of this
code.  However, let me ask something of you and Keir as well:
Please estimate how long (in usec) you think it is acceptable
to hold the heap_lock.  If your limit is very small (as I expect),
doing anything "N" times in a loop with the lock held (for N==2^26,
which is a 256GB domain) may make the analysis moot.

> But yes, with all the factors you mention brought in, there is
> certainly some improvement needed (whether your "claim"
> proposal is a the right thing is another question, not to mention
> that I currently don't see how this would get implemented in
> a consistent way taking several orders of magnitude less time
> to carry out).

OK, I will start on the next step... proof-of-concept.
I'm envisioning simple arithmetic, but maybe you are
right and arithmetic will not be sufficient.

> > Suppose you have a huge 256GB machine and you have already launched
> > a 64GB tmem guest "A".  The guest is idle for now, so slowly
> > selfballoons down to maybe 4GB.  You start to launch another 64GB
> > guest "B" which, as we know, is going to take some time to complete.
> > In the middle of launching "B", "A" suddenly gets very active and
> > needs to balloon up as quickly as possible or it can't balloon fast
> > enough (or at all if "frozen" as suggested) so starts swapping (and,
> > thanks to Linux frontswap, the swapping tries to go to hypervisor/tmem
> > memory).  But ballooning and tmem are both blocked and so the
> > guest swaps its poor little butt off even though there's >100GB
> > of free physical memory available.
> 
> That's only one side of the overcommit situation you're striving
> to get work right here: That same self ballooning guest, after
> sufficiently more guest got started so that the rest of the memory
> got absorbed by them would suffer the very same problems in
> the described situation, so it has to be prepared for this case
> anyway.

The tmem design does ensure the guest is prepared for this case
anyway... the guest swaps.  And, unlike page-sharing, the guest
determines which pages to swap, not the host, and there is no
possibility of double-paging.

In your scenario, the host memory is truly oversubscribed.  This
scenario is ultimately a weakness of virtualization in general;
trying to statistically-share an oversubscribed fixed resource
among a number of guests will sometimes cause a performance
degradation, whether the resource is CPU or LAN bandwidth or,
in this case, physical memory.  That very generic problem
is I think not one any of us can solve.  Toolstacks need to
be able to recognize the problem (whether CPU, LAN, or memory)
and act accordingly (report, or auto-migrate).

In my scenario, guest performance is hammered only because of
the unfortunate deficiency in the existing hypervisor memory
allocation mechanisms, namely that small allocations must
be artificially "frozen" until a large allocation can complete.
That specific problem is one I am trying to solve.

BTW, with tmem, some future toolstack might monitor various
available tmem statistics and predict/avoid your scenario.

Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-31 16:04               ` Dan Magenheimer
@ 2012-10-31 16:19                 ` Jan Beulich
  2012-10-31 16:51                   ` Dan Magenheimer
  0 siblings, 1 reply; 58+ messages in thread
From: Jan Beulich @ 2012-10-31 16:19 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: TimDeegan, Olaf Hering, Keir(Xen.org),
	IanCampbell, Konrad Wilk, GeorgeDunlap, IanJackson,
	George Shuklin, xen-devel, DarioFaggioli, Kurt Hackel,
	Zhigang Wang

>>> On 31.10.12 at 17:04, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Subject: RE: Proposed new "memory capacity claim" hypercall/feature
>> 
>> >>> On 30.10.12 at 18:13, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> 
> (NOTE TO KEIR: Input from you requested in first stanza below.)
> 
> Hi Jan --
> 
> Thanks for the continued feedback!
> 
> I've slightly re-ordered the email to focus on the problem
> (moved tmem-specific discussion to the end).
> 
>> As long as the allocation times can get brought down to an
>> acceptable level, I continue to not see a need for the extra
>> "claim" approach you're proposing. So working on that one (or
>> showing that without unreasonable effort this cannot be
>> further improved) would be a higher priority thing from my pov
>> (without anyone arguing about its usefulness).
> 
> Fair enough.  I will do some measurement and analysis of this
> code.  However, let me ask something of you and Keir as well:
> Please estimate how long (in usec) you think it is acceptable
> to hold the heap_lock.  If your limit is very small (as I expect),
> doing anything "N" times in a loop with the lock held (for N==2^26,
> which is a 256GB domain) may make the analysis moot.

I think your thoughts here simply go a different route than mine:
Of course it is wrong to hold _any_ lock for extended periods of
time. But extending what was done by c/s 26056:177fdda0be56
might, considering the effect that change had, buy you quite a
bit of allocation efficiency.

Jan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-31 16:19                 ` Jan Beulich
@ 2012-10-31 16:51                   ` Dan Magenheimer
  2012-11-02  9:01                     ` Jan Beulich
  0 siblings, 1 reply; 58+ messages in thread
From: Dan Magenheimer @ 2012-10-31 16:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: TimDeegan, Olaf Hering, Keir(Xen.org),
	IanCampbell, Konrad Wilk, GeorgeDunlap, IanJackson,
	George Shuklin, xen-devel, DarioFaggioli, Kurt Hackel,
	Zhigang Wang

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: RE: Proposed new "memory capacity claim" hypercall/feature
> 
> >>> On 31.10.12 at 17:04, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Subject: RE: Proposed new "memory capacity claim" hypercall/feature
> >>
> >> As long as the allocation times can get brought down to an
> >> acceptable level, I continue to not see a need for the extra
> >> "claim" approach you're proposing. So working on that one (or
> >> showing that without unreasonable effort this cannot be
> >> further improved) would be a higher priority thing from my pov
> >> (without anyone arguing about its usefulness).
> >
> > Fair enough.  I will do some measurement and analysis of this
> > code.  However, let me ask something of you and Keir as well:
> > Please estimate how long (in usec) you think it is acceptable
> > to hold the heap_lock.  If your limit is very small (as I expect),
> > doing anything "N" times in a loop with the lock held (for N==2^26,
> > which is a 256GB domain) may make the analysis moot.
> 
> I think your thoughts here simply go a different route than mine:
> Of course it is wrong to hold _any_ lock for extended periods of
> time. But extending what was done by c/s 26056:177fdda0be56
> might, considering the effect that change had, buy you quite a
> bit of allocation efficiency.

No, I think we are on the same route, except that maybe I
am trying to take a shortcut to the end. :-)

I did follow the discussion that led to that changeset
and highly recommended to the Oracle product folks that
we integrate it asap.

But reducing the domain allocation time "massively" from
30 sec to 3 sec doesn't help solve my issue because, in
essence, my issue says that the heap_lock must still be
held for most of that 3 sec.  Even reducing it by _another_
factor of 10 to 0.3 sec or a factor of 100 to 30msec
doesn't solve my problem.

To look at it another way, the code in alloc_heap_page()
contained within the loop:

	for ( i = 0; i < (1 << order); i++ )

may be already unacceptable, even _after_ the patch, if
order==26 (a fictional page size just for this illustration)
because the heap_lock will be held for a very very long time.
(In fact for order==20, 1GB pages, it could already be a
problem.)

The claim hypercall/subop would allocate _capacity_ only,
and then the actual physical pages are "lazily" allocated
from that pre-allocated capacity.

Anyway, I am still planning on proceeding with some
of the measurement/analysis _and_ proof-of-concept.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-29 22:35 ` Tim Deegan
  2012-10-29 23:21   ` Dan Magenheimer
@ 2012-11-01  2:13   ` Dario Faggioli
  2012-11-01 15:51     ` Dan Magenheimer
  1 sibling, 1 reply; 58+ messages in thread
From: Dario Faggioli @ 2012-11-01  2:13 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Dan Magenheimer, Keir (Xen.org),
	Ian Campbell, Konrad Wilk, George Dunlap, Kurt Hackel,
	George Shuklin, Olaf Hering, xen-devel, Zhigang Wang,
	Ian Jackson


[-- Attachment #1.1: Type: text/plain, Size: 1741 bytes --]

On Mon, 2012-10-29 at 22:35 +0000, Tim Deegan wrote:
> At 10:06 -0700 on 29 Oct (1351505175), Dan Magenheimer wrote:
> > In the case of a dying domain, a XENMEM_release operation
> > is implied and must be executed by the hypervisor.
> > 
> > Ideally, the quantity of unclaimed memory for each domain and
> > for the system should be query-able.  This may require additional
> > memory_op hypercalls.
> > 
> > I'd very much appreciate feedback on this proposed design!
> 
> As I said, I'm not opposed to this, though even after reading through
> the other thread I'm not convinced that it's necessary (except in cases
> where guest-controlled operations are allowed to consume unbounded
> memory, which frankly gives me the heebie-jeebies).
> 
Let me also ask something.

Playing with NUMA systems I've been in the situation where it would be
nice to know not only how much free memory we have in general, but how
much free memory there is in a specific (set of) node(s), and that in
many places, from the hypervisor, to libxc, to top level toolstack.

Right now I ask this to Xen, but that is indeed prone to races and
TOCTOU issues if we allow for domain creation and ballooning
(tmem/paging/...) to happen concurrently between themselves and between
each other (as noted in the long thread that preceded this one).

Question is, the "claim" mechanism you're proposing is by no means NUMA
node-aware, right?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-01  2:13   ` Dario Faggioli
@ 2012-11-01 15:51     ` Dan Magenheimer
  0 siblings, 0 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-11-01 15:51 UTC (permalink / raw)
  To: Dario Faggioli, Tim Deegan
  Cc: Olaf Hering, Keir (Xen.org),
	Ian Campbell, Konrad Wilk, George Dunlap, Kurt Hackel,
	George Shuklin, xen-devel, Zhigang Wang, Ian Jackson

> From: Dario Faggioli [mailto:raistlin@linux.it]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> 
> On Mon, 2012-10-29 at 22:35 +0000, Tim Deegan wrote:
> > At 10:06 -0700 on 29 Oct (1351505175), Dan Magenheimer wrote:
> > > In the case of a dying domain, a XENMEM_release operation
> > > is implied and must be executed by the hypervisor.
> > >
> > > Ideally, the quantity of unclaimed memory for each domain and
> > > for the system should be query-able.  This may require additional
> > > memory_op hypercalls.
> > >
> > > I'd very much appreciate feedback on this proposed design!
> >
> > As I said, I'm not opposed to this, though even after reading through
> > the other thread I'm not convinced that it's necessary (except in cases
> > where guest-controlled operations are allowed to consume unbounded
> > memory, which frankly gives me the heebie-jeebies).
> >
> Let me also ask something.
> 
> Playing with NUMA systems I've been in the situation where it would be
> nice to know not only how much free memory we have in general, but how
> much free memory there is in a specific (set of) node(s), and that in
> many places, from the hypervisor, to libxc, to top level toolstack.
> 
> Right now I ask this to Xen, but that is indeed prone to races and
> TOCTOU issues if we allow for domain creation and ballooning

TOCTOU... hadn't seen that term before, but I agree it describes
the problem succinctly.  Thanks, I will begin using that now!

> (tmem/paging/...) to happen concurrently between themselves and between
> each other (as noted in the long thread that preceded this one).
> 
> Question is, the "claim" mechanism you're proposing is by no means NUMA
> node-aware, right?

I hadn't thought about NUMA, but I think the claim mechanism
could be augmented to attempt to stake a claim on a specified
node, or on any node that has sufficient memory.  AFAICT
this might complicate the arithmetic a bit but should work.
Let me prototype the NUMA-ignorant mechanism first though...

Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-31 16:51                   ` Dan Magenheimer
@ 2012-11-02  9:01                     ` Jan Beulich
  2012-11-02  9:30                       ` Keir Fraser
  0 siblings, 1 reply; 58+ messages in thread
From: Jan Beulich @ 2012-11-02  9:01 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: TimDeegan, Olaf Hering, Keir(Xen.org),
	IanCampbell, Konrad Wilk, GeorgeDunlap, IanJackson,
	George Shuklin, xen-devel, DarioFaggioli, Kurt Hackel,
	Zhigang Wang

>>> On 31.10.12 at 17:51, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> To look at it another way, the code in alloc_heap_page()
> contained within the loop:
> 
> 	for ( i = 0; i < (1 << order); i++ )
> 
> may be already unacceptable, even _after_ the patch, if
> order==26 (a fictional page size just for this illustration)
> because the heap_lock will be held for a very very long time.
> (In fact for order==20, 1GB pages, it could already be a
> problem.)

A million iterations doing just a few memory reads and writes
(not even atomic ones afaics) doesn't sound that bad. And
order-18 allocations (which is what 1Gb pages really amount
to) are the biggest ever happening (post-boot, if that matters).

You'll get much worse behavior if these large order allocations
fail, and the callers have to fall back to smaller ones.

Plus, if necessary, that loop could be broken up so that only the
initial part of it gets run with the lock held (see c/s
22135:69e8bb164683 for why the unlock was moved past the
loop). That would make for a shorter lock hold time, but for a
higher allocation latency on large oder allocations (due to worse
cache locality).

Jan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-02  9:01                     ` Jan Beulich
@ 2012-11-02  9:30                       ` Keir Fraser
  2012-11-04 19:43                         ` Dan Magenheimer
  0 siblings, 1 reply; 58+ messages in thread
From: Keir Fraser @ 2012-11-02  9:30 UTC (permalink / raw)
  To: Jan Beulich, Dan Magenheimer
  Cc: TimDeegan, Olaf Hering, IanCampbell, Konrad Rzeszutek Wilk,
	George Dunlap, Ian Jackson, George Shuklin, xen-devel,
	DarioFaggioli, Kurt Hackel, Zhigang Wang

On 02/11/2012 09:01, "Jan Beulich" <JBeulich@suse.com> wrote:

> Plus, if necessary, that loop could be broken up so that only the
> initial part of it gets run with the lock held (see c/s
> 22135:69e8bb164683 for why the unlock was moved past the
> loop). That would make for a shorter lock hold time, but for a
> higher allocation latency on large oder allocations (due to worse
> cache locality).

In fact I believe only the first page needs to have its count_info set to !=
PGC_state_free, while the lock is held. That is sufficient to defeat the
buddy merging in free_heap_pages(). Similarly, we could hoist most of the
first loop in free_heap_pages() outside the lock. There's a lot of scope for
optimisation here.

 -- Keir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-02  9:30                       ` Keir Fraser
@ 2012-11-04 19:43                         ` Dan Magenheimer
  2012-11-04 20:35                           ` Tim Deegan
  2012-11-05  9:16                           ` Jan Beulich
  0 siblings, 2 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-11-04 19:43 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich
  Cc: TimDeegan, Olaf Hering, IanCampbell, Konrad Wilk, George Dunlap,
	Ian Jackson, George Shuklin, xen-devel, DarioFaggioli,
	Kurt Hackel, Zhigang Wang

> From: Keir Fraser [mailto:keir@xen.org]
> Sent: Friday, November 02, 2012 3:30 AM
> To: Jan Beulich; Dan Magenheimer
> Cc: Olaf Hering; IanCampbell; George Dunlap; Ian Jackson; George Shuklin; DarioFaggioli; xen-
> devel@lists.xen.org; Konrad Rzeszutek Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang; TimDeegan
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> 
> On 02/11/2012 09:01, "Jan Beulich" <JBeulich@suse.com> wrote:
> 
> > Plus, if necessary, that loop could be broken up so that only the
> > initial part of it gets run with the lock held (see c/s
> > 22135:69e8bb164683 for why the unlock was moved past the
> > loop). That would make for a shorter lock hold time, but for a
> > higher allocation latency on large oder allocations (due to worse
> > cache locality).
> 
> In fact I believe only the first page needs to have its count_info set to !=
> PGC_state_free, while the lock is held. That is sufficient to defeat the
> buddy merging in free_heap_pages(). Similarly, we could hoist most of the
> first loop in free_heap_pages() outside the lock. There's a lot of scope for
> optimisation here.

(sorry for the delayed response)

Aren't we getting a little sidetracked here?  (Maybe my fault for
looking at whether this specific loop is fast enough...)

This loop handles only order=N chunks of RAM.  Speeding up this
loop and holding the heap_lock here for a shorter period only helps
the TOCTOU race if the entire domain can be allocated as a
single order-N allocation.

Domain creation is supposed to succeed as long as there is
sufficient RAM, _regardless_ of the state of memory fragmentation,
correct?

So unless the code for the _entire_ memory allocation path can
be optimized so that the heap_lock can be held across _all_ the
allocations necessary to create an arbitrary-sized domain, for
any arbitrary state of memory fragmentation, the original
problem has not been solved.

Or am I misunderstanding?

I _think_ the claim hypercall/subop should resolve this, though
admittedly I have yet to prove (and code) it.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-04 19:43                         ` Dan Magenheimer
@ 2012-11-04 20:35                           ` Tim Deegan
  2012-11-05  0:23                             ` Dan Magenheimer
  2012-11-05 22:33                             ` Dan Magenheimer
  2012-11-05  9:16                           ` Jan Beulich
  1 sibling, 2 replies; 58+ messages in thread
From: Tim Deegan @ 2012-11-04 20:35 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Olaf Hering, Keir Fraser, IanCampbell, Konrad Wilk,
	George Dunlap, George Shuklin, Ian Jackson, xen-devel,
	DarioFaggioli, Jan Beulich, Kurt Hackel, Zhigang Wang

At 11:43 -0800 on 04 Nov (1352029386), Dan Magenheimer wrote:
> > From: Keir Fraser [mailto:keir@xen.org]
> > Sent: Friday, November 02, 2012 3:30 AM
> > To: Jan Beulich; Dan Magenheimer
> > Cc: Olaf Hering; IanCampbell; George Dunlap; Ian Jackson; George Shuklin; DarioFaggioli; xen-
> > devel@lists.xen.org; Konrad Rzeszutek Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang; TimDeegan
> > Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> > 
> > On 02/11/2012 09:01, "Jan Beulich" <JBeulich@suse.com> wrote:
> > 
> > > Plus, if necessary, that loop could be broken up so that only the
> > > initial part of it gets run with the lock held (see c/s
> > > 22135:69e8bb164683 for why the unlock was moved past the
> > > loop). That would make for a shorter lock hold time, but for a
> > > higher allocation latency on large oder allocations (due to worse
> > > cache locality).
> > 
> > In fact I believe only the first page needs to have its count_info set to !=
> > PGC_state_free, while the lock is held. That is sufficient to defeat the
> > buddy merging in free_heap_pages(). Similarly, we could hoist most of the
> > first loop in free_heap_pages() outside the lock. There's a lot of scope for
> > optimisation here.
> 
> (sorry for the delayed response)
> 
> Aren't we getting a little sidetracked here?  (Maybe my fault for
> looking at whether this specific loop is fast enough...)
> 
> This loop handles only order=N chunks of RAM.  Speeding up this
> loop and holding the heap_lock here for a shorter period only helps
> the TOCTOU race if the entire domain can be allocated as a
> single order-N allocation.

I think the idea is to speed up allocation so that, even for a large VM,
you can just allocate memory instead of needing a reservation hypercall
(whose only purpose, AIUI, is to give you an immediate answer).

> So unless the code for the _entire_ memory allocation path can
> be optimized so that the heap_lock can be held across _all_ the
> allocations necessary to create an arbitrary-sized domain, for
> any arbitrary state of memory fragmentation, the original
> problem has not been solved.
> 
> Or am I misunderstanding?
> 
> I _think_ the claim hypercall/subop should resolve this, though
> admittedly I have yet to prove (and code) it.

I don't think it solves it - or rather it might solve this _particular_
instance of it but it doesn't solve the bigger problem.  If you have a
set of overcommitted hosts and you want to start a new VM, you need to:

 - (a) decide which of your hosts is the least overcommitted;
 - (b) free up enough memory on that host to build the VM; and
 - (c) build the VM.

The claim hypercall _might_ fix (c) (if it could handle allocations that
need address-width limits or contiguous pages).  But (b) and (a) have
exactly the same problem, unless there is a central arbiter of memory
allocation (or equivalent distributed system).  If you try to start 2
VMs at once,

 - (a) the toolstack will choose to start them both on the same machine,
       even if that's not optimal, or in the case where one creation is
       _bound_ to fail after some delay.
 - (b) the other VMs (and perhaps tmem) start ballooning out enough
       memory to start the new VM.  This can take even longer than
       allocating it since it depends on guest behaviour.  It can fail
       after an arbitrary delay (ditto).

If you have a toolstack with enough knowledge and control over memory
allocation to sort out stages (a) and (b) in such a way that there are
no delayed failures, (c) should be trivial.

Tim.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-04 20:35                           ` Tim Deegan
@ 2012-11-05  0:23                             ` Dan Magenheimer
  2012-11-05 10:29                               ` Ian Campbell
  2012-11-05 22:33                             ` Dan Magenheimer
  1 sibling, 1 reply; 58+ messages in thread
From: Dan Magenheimer @ 2012-11-05  0:23 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Olaf Hering, Keir Fraser, IanCampbell, Konrad Wilk,
	George Dunlap, George Shuklin, Ian Jackson, xen-devel,
	DarioFaggioli, Jan Beulich, Kurt Hackel, Zhigang Wang

> From: Tim Deegan [mailto:tim@xen.org]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature

Hi Tim --

> At 11:43 -0800 on 04 Nov (1352029386), Dan Magenheimer wrote:
> > > From: Keir Fraser [mailto:keir@xen.org]
> > > Sent: Friday, November 02, 2012 3:30 AM
> > > To: Jan Beulich; Dan Magenheimer
> > > Cc: Olaf Hering; IanCampbell; George Dunlap; Ian Jackson; George Shuklin; DarioFaggioli; xen-
> > > devel@lists.xen.org; Konrad Rzeszutek Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang; TimDeegan
> > > Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> > >
> > > On 02/11/2012 09:01, "Jan Beulich" <JBeulich@suse.com> wrote:
> > >
> > > > Plus, if necessary, that loop could be broken up so that only the
> > > > initial part of it gets run with the lock held (see c/s
> > > > 22135:69e8bb164683 for why the unlock was moved past the
> > > > loop). That would make for a shorter lock hold time, but for a
> > > > higher allocation latency on large oder allocations (due to worse
> > > > cache locality).
> > >
> > > In fact I believe only the first page needs to have its count_info set to !=
> > > PGC_state_free, while the lock is held. That is sufficient to defeat the
> > > buddy merging in free_heap_pages(). Similarly, we could hoist most of the
> > > first loop in free_heap_pages() outside the lock. There's a lot of scope for
> > > optimisation here.
> >
> > (sorry for the delayed response)
> >
> > Aren't we getting a little sidetracked here?  (Maybe my fault for
> > looking at whether this specific loop is fast enough...)
> >
> > This loop handles only order=N chunks of RAM.  Speeding up this
> > loop and holding the heap_lock here for a shorter period only helps
> > the TOCTOU race if the entire domain can be allocated as a
> > single order-N allocation.
> 
> I think the idea is to speed up allocation so that, even for a large VM,
> you can just allocate memory instead of needing a reservation hypercall
> (whose only purpose, AIUI, is to give you an immediate answer).

Its purpose is to give an immediate answer on whether sufficient
space is available for allocation AND (atomically) claim it so
no other call to the allocator can race and steal some or all of
it away. So unless the allocation is sped up enough (given an arbitrary
size domain and arbitrary state of memory fragmentation) so that
the heap_lock can be held for that length of time, speeding
up allocation doesn't solve the problem.
 
> > So unless the code for the _entire_ memory allocation path can
> > be optimized so that the heap_lock can be held across _all_ the
> > allocations necessary to create an arbitrary-sized domain, for
> > any arbitrary state of memory fragmentation, the original
> > problem has not been solved.
> >
> > Or am I misunderstanding?
> >
> > I _think_ the claim hypercall/subop should resolve this, though
> > admittedly I have yet to prove (and code) it.
> 
> I don't think it solves it - or rather it might solve this _particular_
> instance of it but it doesn't solve the bigger problem.  If you have a
> set of overcommitted hosts and you want to start a new VM, you need to:
> 
>  - (a) decide which of your hosts is the least overcommitted;
>  - (b) free up enough memory on that host to build the VM; and
>  - (c) build the VM.
>
> The claim hypercall _might_ fix (c) (if it could handle allocations that
> need address-width limits or contiguous pages).  But (b) and (a) have
> exactly the same problem, unless there is a central arbiter of memory
> allocation (or equivalent distributed system).  If you try to start 2
> VMs at once,
> 
>  - (a) the toolstack will choose to start them both on the same machine,
>        even if that's not optimal, or in the case where one creation is
>        _bound_ to fail after some delay.
>  - (b) the other VMs (and perhaps tmem) start ballooning out enough
>        memory to start the new VM.  This can take even longer than
>        allocating it since it depends on guest behaviour.  It can fail
>        after an arbitrary delay (ditto).
> 
> If you have a toolstack with enough knowledge and control over memory
> allocation to sort out stages (a) and (b) in such a way that there are
> no delayed failures, (c) should be trivial.

(You've used the labels (a) and (b) twice so I'm not quite sure
I understand... but in any case)

Sigh.  No, you are missing the beauty of tmem and dynamic allocation;
you are thinking from the old static paradigm where the toolstack
controls how much memory is available.  There is no central arbiter
of memory anymore than there is a central toolstack (other than the
hypervisor on a one server Xen environment) that decides exactly
when to assign vcpus to pcpus.  There is no "free up enough memory
on that host".  Tmem doesn't start ballooning out enough memory
to start the VM... the guests are responsible for doing the ballooning
and it is _already done_.  The machine either has sufficient free+freeable
memory or it does not; and it is _that_ determination that needs
to be done atomically because many threads are micro-allocating, and
possibly multiple toolstack threads are macro-allocating,
simultaneously.

Everything is handled dynamically.  And just like a CPU scheduler
built into a hypervisor that dynamically allocates vcpu->pcpus
has proven more effective than partitioning pcpus to different
domains, dynamic memory management should prove more effective
than some bossy toolstack trying to control memory statically.

I understand that you can solve "my" problem in your paradigm
without a claim hypercall and/or by speeding up allocations.
I _don't_ see that you can solve "my" problem in _my_ paradigm
without a claim hypercall... speeding up allocations doesn't
solve the TOCTOU race so allocating sufficient space for a
domain must be atomic.

Sigh.

Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-04 19:43                         ` Dan Magenheimer
  2012-11-04 20:35                           ` Tim Deegan
@ 2012-11-05  9:16                           ` Jan Beulich
  2012-11-07 22:17                             ` Dan Magenheimer
  1 sibling, 1 reply; 58+ messages in thread
From: Jan Beulich @ 2012-11-05  9:16 UTC (permalink / raw)
  To: Dan Magenheimer, Keir Fraser
  Cc: TimDeegan, Olaf Hering, IanCampbell, Konrad Wilk, George Dunlap,
	Ian Jackson, George Shuklin, xen-devel, DarioFaggioli,
	Kurt Hackel, Zhigang Wang

>>> On 04.11.12 at 20:43, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>  From: Keir Fraser [mailto:keir@xen.org]
>> Sent: Friday, November 02, 2012 3:30 AM
>> To: Jan Beulich; Dan Magenheimer
>> Cc: Olaf Hering; IanCampbell; George Dunlap; Ian Jackson; George Shuklin; 
> DarioFaggioli; xen-
>> devel@lists.xen.org; Konrad Rzeszutek Wilk; Kurt Hackel; Mukesh Rathor; 
> Zhigang Wang; TimDeegan
>> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
>> 
>> On 02/11/2012 09:01, "Jan Beulich" <JBeulich@suse.com> wrote:
>> 
>> > Plus, if necessary, that loop could be broken up so that only the
>> > initial part of it gets run with the lock held (see c/s
>> > 22135:69e8bb164683 for why the unlock was moved past the
>> > loop). That would make for a shorter lock hold time, but for a
>> > higher allocation latency on large oder allocations (due to worse
>> > cache locality).
>> 
>> In fact I believe only the first page needs to have its count_info set to !=
>> PGC_state_free, while the lock is held. That is sufficient to defeat the
>> buddy merging in free_heap_pages(). Similarly, we could hoist most of the
>> first loop in free_heap_pages() outside the lock. There's a lot of scope for
>> optimisation here.
> 
> (sorry for the delayed response)
> 
> Aren't we getting a little sidetracked here?  (Maybe my fault for
> looking at whether this specific loop is fast enough...)
> 
> This loop handles only order=N chunks of RAM.  Speeding up this
> loop and holding the heap_lock here for a shorter period only helps
> the TOCTOU race if the entire domain can be allocated as a
> single order-N allocation.
> 
> Domain creation is supposed to succeed as long as there is
> sufficient RAM, _regardless_ of the state of memory fragmentation,
> correct?
> 
> So unless the code for the _entire_ memory allocation path can
> be optimized so that the heap_lock can be held across _all_ the
> allocations necessary to create an arbitrary-sized domain, for
> any arbitrary state of memory fragmentation, the original
> problem has not been solved.
> 
> Or am I misunderstanding?

I think we got here via questioning whether suppressing certain
activities (like tmem causing the allocator visible amount of
available memory) for a brief period of time would be acceptable,
and while that indeed depends on the overall latency of memory
allocation for the domain as a whole, I would be somewhat
tolerant for it to involve a longer suspension period on a highly
fragmented system.

But of course, if this can be made work uniformly, that would be
preferred.

Jan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-05  0:23                             ` Dan Magenheimer
@ 2012-11-05 10:29                               ` Ian Campbell
  2012-11-05 14:54                                 ` Dan Magenheimer
  0 siblings, 1 reply; 58+ messages in thread
From: Ian Campbell @ 2012-11-05 10:29 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Olaf Hering, Keir (Xen.org),
	Konrad Wilk, George Dunlap, Ian Jackson, Tim (Xen.org),
	xen-devel, George Shuklin, DarioFaggioli, Jan Beulich,
	Kurt Hackel, Zhigang Wang

On Mon, 2012-11-05 at 00:23 +0000, Dan Magenheimer wrote:
> There is no "free up enough memory on that host". Tmem doesn't start
> ballooning out enough memory to start the VM... the guests are
> responsible for doing the ballooning and it is _already done_.  The
> machine either has sufficient free+freeable memory or it does not;

How does one go about deciding which host in a multi thousand host
deployment to try the claim hypercall on?

Ian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-05 10:29                               ` Ian Campbell
@ 2012-11-05 14:54                                 ` Dan Magenheimer
  2012-11-05 22:24                                   ` Ian Campbell
  0 siblings, 1 reply; 58+ messages in thread
From: Dan Magenheimer @ 2012-11-05 14:54 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Olaf Hering, Keir (Xen.org),
	Konrad Wilk, George Dunlap, Ian Jackson, Tim (Xen.org),
	xen-devel, George Shuklin, DarioFaggioli, Jan Beulich,
	Kurt Hackel, Zhigang Wang

> From: Ian Campbell [mailto:ian.campbell@citrix.com]
> Sent: Monday, November 05, 2012 3:30 AM
> To: Dan Magenheimer
> Cc: Tim (Xen.org); Keir (Xen.org); Jan Beulich; Olaf Hering; George Dunlap; Ian Jackson; George
> Shuklin; DarioFaggioli; xen-devel@lists.xen.org; Konrad Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> 
> On Mon, 2012-11-05 at 00:23 +0000, Dan Magenheimer wrote:
> > There is no "free up enough memory on that host". Tmem doesn't start
> > ballooning out enough memory to start the VM... the guests are
> > responsible for doing the ballooning and it is _already done_.  The
> > machine either has sufficient free+freeable memory or it does not;
> 
> How does one go about deciding which host in a multi thousand host
> deployment to try the claim hypercall on?

I don't get paid enough to solve that problem :-)

VM placement (both for new domains and migration due to
load-balancing and power-management) is dependent on a
number of factors currently involving CPU utilization,
SAN utilization, and LAN utilization, I think using
historical trends on streams of sampled statistics.  This
is very non-deterministic as all of these factors may
vary dramatically within a sampling interval.

Adding free+freeable memory to this just adds one more
such statistic.  Actually two, as it is probably best to
track free separately from freeable since a candidate
host that has enough free memory should have preference
over one with freeable memory.

Sorry if that's not very satisfying but anything beyond that
meager description is outside of my area of expertise.

Dan

P.S. I don't think I've ever said _thousands_ of physical
hosts, just hundreds (with thousands of VMs).  Honestly
I don't know the upper support bound for an Oracle VM
"server pool" (which is what we call the collection of
hundreds of physical machines)... it may be thousands.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-10-30 15:43       ` Dan Magenheimer
  2012-10-30 16:04         ` Jan Beulich
@ 2012-11-05 17:14         ` George Dunlap
  2012-11-05 18:21           ` Dan Magenheimer
  1 sibling, 1 reply; 58+ messages in thread
From: George Dunlap @ 2012-11-05 17:14 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Tim (Xen.org), Olaf Hering, Keir (Xen.org),
	Ian Campbell, Konrad Wilk, Ian Jackson, George Shuklin,
	xen-devel, DarioFaggioli, Jan Beulich, Kurt Hackel, Zhigang Wang

On 30/10/12 15:43, Dan Magenheimer wrote:
> a) Truly free memory (each free page is on the hypervisor free list)
> b) Freeable memory ("ephmeral" memory managed by tmem)
> c) Owned memory (pages allocated by the hypervisor or for a domain)
>
> The sum of these three is always a constant: The total number of
> RAM pages in the system.  However, when tmem is active, the values
> of all _three_ of these change constantly.  So if at the start of a
> domain launch, the sum of free+freeable exceeds the intended size
> of the domain, the domain allocation/launch can start.

Why free+freeable, rather than just "free"?

>   But then
> if "owned" increases enough, there may no longer be enough memory
> and the domain launch will fail.

Again, "owned" would not increase at all if the guest weren't handing 
memory back to Xen.  Why is that necessary, or even helpful?

(And please don't start another rant about the bold new world of peace 
and love.  Give me a freaking *technical* answer.)

  -George

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-05 17:14         ` George Dunlap
@ 2012-11-05 18:21           ` Dan Magenheimer
  0 siblings, 0 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-11-05 18:21 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim (Xen.org), Olaf Hering, Keir (Xen.org),
	Ian Campbell, Konrad Wilk, Ian Jackson, George Shuklin,
	xen-devel, DarioFaggioli, Jan Beulich, Kurt Hackel, Zhigang Wang

> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> 
> On 30/10/12 15:43, Dan Magenheimer wrote:
> > a) Truly free memory (each free page is on the hypervisor free list)
> > b) Freeable memory ("ephmeral" memory managed by tmem)
> > c) Owned memory (pages allocated by the hypervisor or for a domain)
> >
> > The sum of these three is always a constant: The total number of
> > RAM pages in the system.  However, when tmem is active, the values
> > of all _three_ of these change constantly.  So if at the start of a
> > domain launch, the sum of free+freeable exceeds the intended size
> > of the domain, the domain allocation/launch can start.

> (And please don't start another rant about the bold new world of peace
> and love.  Give me a freaking *technical* answer.)

<grin> /Me removes seventies-style tie-dye tshirt with peace logo
and sadly withdraws single daisy previously extended to George.

> Why free+freeable, rather than just "free"?

A free page is a page that is not used for anything at all.
It is on the hypervisor's free list.  A freeable page contains tmem
ephemeral data stored on behalf of a domain (or, if dedup'ing
is enabled, on behalf of one or more domains).  More specifically
for a tmem-enabled Linux guest, a freeable page contains a clean
page cache page that the Linux guest OS has asked the hypervisor
(via the tmem ABI) to hold if it can for as long as it can.
The specific clean page cache pages are chosen and the call is
done on the Linux side via "cleancache".

So, when tmem is working optimally, there are few or no free
pages and many many freeable pages (perhaps half of physical
RAM or more).

Freeable pages across all tmem-enabled guests are kept in a single
LRU queue.  When a request is made to the hypervisor allocator for
a free page and its free list is empty, the allocator will force
tmem to relinquish an ephemeral page (in LRU order).  Because
this is entirely up to the hypervisor and can happen at any
time, freeable pages are not counted as "owned" by a domain but
still have some value to a domain.

So, in essence, a "free" page has zero value and a "freeable"
page has a small, but non-zero value that decays over time.
So it's useful for a toolstack to know both quantities.

(And, since this thread has gone in many directions, let me
reiterate that all of this has been working in the hypervisor
since 4.0 in 2009, and cleancache in Linux since mid-2011.)
 
> >   But then
> > if "owned" increases enough, there may no longer be enough memory
> > and the domain launch will fail.
> 
> Again, "owned" would not increase at all if the guest weren't handing
> memory back to Xen.  Why is that necessary, or even helpful?

The guest _is_ handing memory back to Xen.  This is the other half
of the tmem functionality, persistent pages.

Answering your second question is going to require a little more
background.

Since nobody, not even the guest kernel, can guess the future
needs of its workload, there are two choices: (1) allocate enough
RAM so that the supply always exceeds max-demand, or (2) aggressively
reduce RAM to a reasonable guess for a target and prepare for the
probability that, sometimes, available RAM won't be enough.  Tmem does
choice #2; self-ballooning aggressively drives RAM (or "current memory"
as the hypervisor sees it) to a target level: in Linux, to Committed_AS
modified by a formula similar to the one Novell derived for a minimum
ballooning safety level.  The target level changes constantly, but the
selfballooning code samples and adjusts only periodically.  If, during
the time interval between samples, memory demand spikes, Linux
has a memory shortage and responds as it must, namely by swapping.

The frontswap code in Linux "intercepts" this swapping so that,
in most cases, it goes to a Xen tmem persistent pool instead of
to a (virtual or physical) swap disk.  Data in persistent pools,
unlike ephemeral pools, are guaranteed to be maintained by the
hypervisor until the guest invalidates it or until the guest dies.
As a result, pages allocated for persistent pools increase the count
of pages "owned" by the domain that requested the pages, until the guest
explicitly invalidates them (or dies).  The accounting also ensures
that malicious domains can't absorb memory beyond the toolset-specified
limit ("maxmem").

Note that, if compression is enabled, a domain _may_ "logically"
exceed maxmem, as long as it does not physically exceed it.

(And, again, all of this too has been in Xen since 4.0 in 2009,
and selfballooning has been in Linux since mid-2011, but frontswap
finally was accepted into Linux earlier in 2012.)

Ok, George, does that answer your questions, _technically_?  I'll
be happy to answer any others.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-05 14:54                                 ` Dan Magenheimer
@ 2012-11-05 22:24                                   ` Ian Campbell
  2012-11-05 22:58                                     ` Zhigang Wang
  2012-11-05 22:58                                     ` Dan Magenheimer
  0 siblings, 2 replies; 58+ messages in thread
From: Ian Campbell @ 2012-11-05 22:24 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Olaf Hering, Keir (Xen.org),
	Konrad Wilk, George Dunlap, Ian Jackson, Tim (Xen.org),
	xen-devel, George Shuklin, DarioFaggioli, Jan Beulich,
	Kurt Hackel, Zhigang Wang

On Mon, 2012-11-05 at 14:54 +0000, Dan Magenheimer wrote:
> > From: Ian Campbell [mailto:ian.campbell@citrix.com]
> > Sent: Monday, November 05, 2012 3:30 AM
> > To: Dan Magenheimer
> > Cc: Tim (Xen.org); Keir (Xen.org); Jan Beulich; Olaf Hering; George Dunlap; Ian Jackson; George
> > Shuklin; DarioFaggioli; xen-devel@lists.xen.org; Konrad Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang
> > Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> > 
> > On Mon, 2012-11-05 at 00:23 +0000, Dan Magenheimer wrote:
> > > There is no "free up enough memory on that host". Tmem doesn't start
> > > ballooning out enough memory to start the VM... the guests are
> > > responsible for doing the ballooning and it is _already done_.  The
> > > machine either has sufficient free+freeable memory or it does not;
> > 
> > How does one go about deciding which host in a multi thousand host
> > deployment to try the claim hypercall on?
> 
> I don't get paid enough to solve that problem :-)
> 
> VM placement (both for new domains and migration due to
> load-balancing and power-management) is dependent on a
> number of factors currently involving CPU utilization,
> SAN utilization, and LAN utilization, I think using
> historical trends on streams of sampled statistics.  This
> is very non-deterministic as all of these factors may
> vary dramatically within a sampling interval.
> 
> Adding free+freeable memory to this just adds one more
> such statistic.  Actually two, as it is probably best to
> track free separately from freeable since a candidate
> host that has enough free memory should have preference
> over one with freeable memory.
> 
> Sorry if that's not very satisfying but anything beyond that
> meager description is outside of my area of expertise.

I guess I don't see how your proposed claim hypercall is useful if you
can't decide which machine you should call it on, whether it's 10s, 100s
or 1000s of hosts. Surely you aren't suggesting that the toolstack try
it on all (or even a subset) of them and see which sticks?

By ignoring this part of the problem I think you are ignoring one of the
most important bits of the story, without which it is very hard to make
a useful and informed determination about the validity of the use cases
you are describing for the new call.

Ian.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-04 20:35                           ` Tim Deegan
  2012-11-05  0:23                             ` Dan Magenheimer
@ 2012-11-05 22:33                             ` Dan Magenheimer
  2012-11-06 10:49                               ` Jan Beulich
  1 sibling, 1 reply; 58+ messages in thread
From: Dan Magenheimer @ 2012-11-05 22:33 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Olaf Hering, Keir Fraser, IanCampbell, Konrad Wilk,
	George Dunlap, George Shuklin, Ian Jackson, xen-devel,
	DarioFaggioli, Jan Beulich, Kurt Hackel, Zhigang Wang

> From: Tim Deegan [mailto:tim@xen.org]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature

Oops, missed an important part of your response... I'm glad
I went back and reread it...
 
> The claim hypercall _might_ fix (c) (if it could handle allocations that
> need address-width limits or contiguous pages).

I'm still looking into this part.

It's my understanding (from Jan) that, post-dom0-launch, there are
no known memory allocation paths that _require_ order>0 allocations.
All of them attempt a larger allocation and gracefully fallback
to (eventually) order==0 allocations.  I've hacked some code
in to the allocator to confirm this, though I'm not sure how
to test the hypothesis exhaustively.

For address-width limits, I suspect we are talking mostly or
entirely about DMA in 32-bit PV domains?  And/or PCI-passthrough?
I'll look into it further, but if those are the principal cases,
I'd have no problem documenting that the claim hypercall doesn't
handle them and attempts to build such a domain might still
fail slowly.  At least unless/until someone decided to add
any necessary special corner cases to the claim hypercall.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-05 22:24                                   ` Ian Campbell
@ 2012-11-05 22:58                                     ` Zhigang Wang
  2012-11-05 22:58                                     ` Dan Magenheimer
  1 sibling, 0 replies; 58+ messages in thread
From: Zhigang Wang @ 2012-11-05 22:58 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Dan Magenheimer, Keir (Xen.org),
	Konrad Wilk, George Dunlap, Ian Jackson, Tim (Xen.org),
	Olaf Hering, xen-devel, George Shuklin, DarioFaggioli,
	Jan Beulich, Kurt Hackel

On 11/05/2012 05:24 PM, Ian Campbell wrote:
> On Mon, 2012-11-05 at 14:54 +0000, Dan Magenheimer wrote:
>>> On Mon, 2012-11-05 at 00:23 +0000, Dan Magenheimer wrote:
>>>> There is no "free up enough memory on that host". Tmem doesn't start
>>>> ballooning out enough memory to start the VM... the guests are
>>>> responsible for doing the ballooning and it is _already done_.  The
>>>> machine either has sufficient free+freeable memory or it does not;
>>> How does one go about deciding which host in a multi thousand host
>>> deployment to try the claim hypercall on?
> I guess I don't see how your proposed claim hypercall is useful if you
> can't decide which machine you should call it on, whether it's 10s, 100s
> or 1000s of hosts. Surely you aren't suggesting that the toolstack try
> it on all (or even a subset) of them and see which sticks?
>
> By ignoring this part of the problem I think you are ignoring one of the
> most important bits of the story, without which it is very hard to make
> a useful and informed determination about the validity of the use cases
> you are describing for the new call.
Planned implement:

1. Every Server (dom0) sends memory statistics to Manager every 20 seconds
(tunable).
2. At one time, Manager selects a Server to run VM based on the snapshot of
Server memory. Selected server should have: enough free memory for the VM or
have free + freeable memory > VM memory.

Two ways to handle failures:

1. Try start_vm on the first selected Server. If failed, try the second one.

2. Try reserve memory on the first Server. If failed, try the second one. If
success, start_vm on the Server.

>From high level, Dan's proposal could help with 2). If memory allocation is fast
enough (VM start failed/success very fast), then 1) is preferred.

Thanks,

Zhigang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-05 22:24                                   ` Ian Campbell
  2012-11-05 22:58                                     ` Zhigang Wang
@ 2012-11-05 22:58                                     ` Dan Magenheimer
  2012-11-06 13:23                                       ` Ian Campbell
  1 sibling, 1 reply; 58+ messages in thread
From: Dan Magenheimer @ 2012-11-05 22:58 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Olaf Hering, Keir (Xen.org),
	Konrad Wilk, George Dunlap, Ian Jackson, Tim (Xen.org),
	xen-devel, George Shuklin, DarioFaggioli, Jan Beulich,
	Kurt Hackel, Zhigang Wang

> From: Ian Campbell [mailto:ian.campbell@citrix.com]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature

Hi Ian --

> On Mon, 2012-11-05 at 14:54 +0000, Dan Magenheimer wrote:
> > > From: Ian Campbell [mailto:ian.campbell@citrix.com]
> > > Sent: Monday, November 05, 2012 3:30 AM
> > > To: Dan Magenheimer
> > > Cc: Tim (Xen.org); Keir (Xen.org); Jan Beulich; Olaf Hering; George Dunlap; Ian Jackson; George
> > > Shuklin; DarioFaggioli; xen-devel@lists.xen.org; Konrad Wilk; Kurt Hackel; Mukesh Rathor; Zhigang
> Wang
> > > Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> > >
> > > On Mon, 2012-11-05 at 00:23 +0000, Dan Magenheimer wrote:
> > > > There is no "free up enough memory on that host". Tmem doesn't start
> > > > ballooning out enough memory to start the VM... the guests are
> > > > responsible for doing the ballooning and it is _already done_.  The
> > > > machine either has sufficient free+freeable memory or it does not;
> > >
> > > How does one go about deciding which host in a multi thousand host
> > > deployment to try the claim hypercall on?
> >
> > I don't get paid enough to solve that problem :-)
> >
> > VM placement (both for new domains and migration due to
> > load-balancing and power-management) is dependent on a
> > number of factors currently involving CPU utilization,
> > SAN utilization, and LAN utilization, I think using
> > historical trends on streams of sampled statistics.  This
> > is very non-deterministic as all of these factors may
> > vary dramatically within a sampling interval.
> >
> > Adding free+freeable memory to this just adds one more
> > such statistic.  Actually two, as it is probably best to
> > track free separately from freeable since a candidate
> > host that has enough free memory should have preference
> > over one with freeable memory.
> >
> > Sorry if that's not very satisfying but anything beyond that
> > meager description is outside of my area of expertise.
> 
> I guess I don't see how your proposed claim hypercall is useful if you
> can't decide which machine you should call it on, whether it's 10s, 100s
> or 1000s of hosts. Surely you aren't suggesting that the toolstack try
> it on all (or even a subset) of them and see which sticks?
> 
> By ignoring this part of the problem I think you are ignoring one of the
> most important bits of the story, without which it is very hard to make
> a useful and informed determination about the validity of the use cases
> you are describing for the new call.

I'm not ignoring it at all.  One only needs to choose a machine and
be prepared that the machine will (immediately) answer "sorry, won't fit".
It's not necessary to choose the _optimal_ fit, only a probable one.
Since failure is immediate, trying more than one machine (which should
happen only rarely) is not particularly problematic, though I completely
agree that trying _all_ of them might be.

The existing OracleVM Manager already chooses domain launch candidates
and load balancing candidates based on sampled CPU/SAN/LAN data, which
is always stale but still sufficient as a rough estimate of the best
machine to choose.

Beyond that, I'm not particularly knowledgeable about the details and,
even if I were, I'm not sure if the details are suitable for a public
forum.  But I can tell you that it has been shipping for over a year
and here's some of what's published... look for DRS and DPM.

http://www.oracle.com/us/technologies/virtualization/ovm3-whats-new-459313.pdf 

Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-05 22:33                             ` Dan Magenheimer
@ 2012-11-06 10:49                               ` Jan Beulich
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Beulich @ 2012-11-06 10:49 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Tim Deegan, Olaf Hering, Keir Fraser, IanCampbell, Konrad Wilk,
	George Dunlap, Ian Jackson, George Shuklin, xen-devel,
	DarioFaggioli, Kurt Hackel, ZhigangWang

>>> On 05.11.12 at 23:33, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> For address-width limits, I suspect we are talking mostly or
> entirely about DMA in 32-bit PV domains?  And/or PCI-passthrough?
> I'll look into it further, but if those are the principal cases,
> I'd have no problem documenting that the claim hypercall doesn't
> handle them and attempts to build such a domain might still
> fail slowly.  At least unless/until someone decided to add
> any necessary special corner cases to the claim hypercall.

DMA (also for 64-bit PV) is one aspect, and the fundamental
address restriction of 32-bit guests is the perhaps more
important one (for they can't access the full M2P map, and
hence can't ever be handed pages not covered by the
portion they have access to).

Jan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-05 22:58                                     ` Dan Magenheimer
@ 2012-11-06 13:23                                       ` Ian Campbell
  0 siblings, 0 replies; 58+ messages in thread
From: Ian Campbell @ 2012-11-06 13:23 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Olaf Hering, Keir (Xen.org),
	Konrad Wilk, George Dunlap, Ian Jackson, Tim (Xen.org),
	xen-devel, George Shuklin, DarioFaggioli, Jan Beulich,
	Kurt Hackel, Zhigang Wang

On Mon, 2012-11-05 at 22:58 +0000, Dan Magenheimer wrote:
> It's not necessary to choose the _optimal_ fit, only a probable one. 

I think this is the key point which I was missing i.e. that it doesn't
need to be a totally accurate answer. Without that piece it seemed to me
that you must already have the more knowledgeable toolstack part which
others have mentioned.

Ian.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-05  9:16                           ` Jan Beulich
@ 2012-11-07 22:17                             ` Dan Magenheimer
  2012-11-08  7:36                               ` Keir Fraser
  2012-11-08  8:00                               ` Jan Beulich
  0 siblings, 2 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-11-07 22:17 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser
  Cc: TimDeegan, Olaf Hering, IanCampbell, Konrad Wilk, George Dunlap,
	Ian Jackson, George Shuklin, xen-devel, DarioFaggioli,
	Kurt Hackel, Zhigang Wang

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: RE: Proposed new "memory capacity claim" hypercall/feature
> 
> > Aren't we getting a little sidetracked here?  (Maybe my fault for
> > looking at whether this specific loop is fast enough...)
> >
> > This loop handles only order=N chunks of RAM.  Speeding up this
> > loop and holding the heap_lock here for a shorter period only helps
> > the TOCTOU race if the entire domain can be allocated as a
> > single order-N allocation.
> >
> > Domain creation is supposed to succeed as long as there is
> > sufficient RAM, _regardless_ of the state of memory fragmentation,
> > correct?
> >
> > So unless the code for the _entire_ memory allocation path can
> > be optimized so that the heap_lock can be held across _all_ the
> > allocations necessary to create an arbitrary-sized domain, for
> > any arbitrary state of memory fragmentation, the original
> > problem has not been solved.
> >
> > Or am I misunderstanding?
> 
> I think we got here via questioning whether suppressing certain
> activities (like tmem causing the allocator visible amount of
> available memory) for a brief period of time would be acceptable,
> and while that indeed depends on the overall latency of memory
> allocation for the domain as a whole, I would be somewhat
> tolerant for it to involve a longer suspension period on a highly
> fragmented system.
> 
> But of course, if this can be made work uniformly, that would be
> preferred.

Hi Jan and Keir --

OK, here's a status update.  Sorry for the delay but it took awhile
for me to refamiliarize myself with the code paths.

It appears that the attempt to use 2MB and 1GB pages is done in
the toolstack, and if the hypervisor rejects it, toolstack tries
smaller pages.  Thus, if physical memory is highly fragmented
(few or no order>=9 allocations available), this will result
in one hypercall per 4k page so a 256GB domain would require
64 million hypercalls.  And, since AFAICT, there is no sane
way to hold the heap_lock across even two hypercalls, speeding
up the in-hypervisor allocation path, by itself, will not solve
the TOCTOU race.

One option to avoid the 64M hypercalls is to change the Xen ABI to
add a new memory hypercall/subop to populate_physmap an arbitrary
amount of physical RAM, and have Xen (optionally) try order==18, then
order==9, then order==0.  I suspect that, even with the overhead
of hypercalls removed, the steps required to allocate 64 million pages
(including, for example, removing a page from a xen list
and adding it to the domain's page list) will consume enough time
that holding the heap_lock and/or suppressing micro-allocations for
the entire macro-allocation on a fragmented system will still be
unacceptable (e.g. at least tens of seconds).  However, I am
speculating, and I think I can measure it if you (Jan or Keir)
feel a measurement is necessary to fully convince you.

I think this brings us back to the proposed "claim" hypercall/subop.
Unless there are further objections or suggestions for different
approaches, I'll commence prototyping it, OK?

Thanks,
Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-07 22:17                             ` Dan Magenheimer
@ 2012-11-08  7:36                               ` Keir Fraser
  2012-11-08 10:11                                 ` Ian Jackson
  2012-11-08  8:00                               ` Jan Beulich
  1 sibling, 1 reply; 58+ messages in thread
From: Keir Fraser @ 2012-11-08  7:36 UTC (permalink / raw)
  To: Dan Magenheimer, Jan Beulich
  Cc: TimDeegan, Olaf Hering, IanCampbell, Konrad Rzeszutek Wilk,
	George Dunlap, Ian Jackson, George Shuklin, xen-devel,
	DarioFaggioli, Kurt Hackel, Zhigang Wang

On 07/11/2012 22:17, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> I think this brings us back to the proposed "claim" hypercall/subop.
> Unless there are further objections or suggestions for different
> approaches, I'll commence prototyping it, OK?

Yes, in fact I thought you'd started already!

 K.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-07 22:17                             ` Dan Magenheimer
  2012-11-08  7:36                               ` Keir Fraser
@ 2012-11-08  8:00                               ` Jan Beulich
  2012-11-08  8:18                                 ` Keir Fraser
  2012-11-08 18:38                                 ` Dan Magenheimer
  1 sibling, 2 replies; 58+ messages in thread
From: Jan Beulich @ 2012-11-08  8:00 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: TimDeegan, Olaf Hering, Keir Fraser, IanCampbell, Konrad Wilk,
	George Dunlap, Ian Jackson, George Shuklin, xen-devel,
	DarioFaggioli, Kurt Hackel, Zhigang Wang

>>> On 07.11.12 at 23:17, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> It appears that the attempt to use 2MB and 1GB pages is done in
> the toolstack, and if the hypervisor rejects it, toolstack tries
> smaller pages.  Thus, if physical memory is highly fragmented
> (few or no order>=9 allocations available), this will result
> in one hypercall per 4k page so a 256GB domain would require
> 64 million hypercalls.  And, since AFAICT, there is no sane
> way to hold the heap_lock across even two hypercalls, speeding
> up the in-hypervisor allocation path, by itself, will not solve
> the TOCTOU race.

No, even in the absence of large pages, the tool stack will do 8M
allocations, just without requesting them to be contiguous.
Whether 8M is a suitable value is another aspect; that value may
predate hypercall preemption, and I don't immediately see why
the tool stack shouldn't be able to request larger chunks (up to
the whole amount at once).

Jan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08  8:00                               ` Jan Beulich
@ 2012-11-08  8:18                                 ` Keir Fraser
  2012-11-08  8:54                                   ` Jan Beulich
  2012-11-08 18:38                                 ` Dan Magenheimer
  1 sibling, 1 reply; 58+ messages in thread
From: Keir Fraser @ 2012-11-08  8:18 UTC (permalink / raw)
  To: Jan Beulich, Dan Magenheimer
  Cc: TimDeegan, Olaf Hering, Keir Fraser, IanCampbell,
	Konrad Rzeszutek Wilk, George Dunlap, Ian Jackson,
	George Shuklin, xen-devel, DarioFaggioli, Kurt Hackel,
	Zhigang Wang

On 08/11/2012 08:00, "Jan Beulich" <JBeulich@suse.com> wrote:

>>>> On 07.11.12 at 23:17, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>> It appears that the attempt to use 2MB and 1GB pages is done in
>> the toolstack, and if the hypervisor rejects it, toolstack tries
>> smaller pages.  Thus, if physical memory is highly fragmented
>> (few or no order>=9 allocations available), this will result
>> in one hypercall per 4k page so a 256GB domain would require
>> 64 million hypercalls.  And, since AFAICT, there is no sane
>> way to hold the heap_lock across even two hypercalls, speeding
>> up the in-hypervisor allocation path, by itself, will not solve
>> the TOCTOU race.
> 
> No, even in the absence of large pages, the tool stack will do 8M
> allocations, just without requesting them to be contiguous.
> Whether 8M is a suitable value is another aspect; that value may
> predate hypercall preemption, and I don't immediately see why
> the tool stack shouldn't be able to request larger chunks (up to
> the whole amount at once).

It is probably to allow other dom0 processing (including softirqs) to
preempt the toolstack task, in the case that the kernel was not built with
involuntary preemption enabled (having it disabled is the common case I
believe?). 8M batches may provide enough returns to user space to allow
other work to get a look-in.



> Jan
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08  8:18                                 ` Keir Fraser
@ 2012-11-08  8:54                                   ` Jan Beulich
  2012-11-08  9:12                                     ` Keir Fraser
  0 siblings, 1 reply; 58+ messages in thread
From: Jan Beulich @ 2012-11-08  8:54 UTC (permalink / raw)
  To: Keir Fraser
  Cc: TimDeegan, Olaf Hering, Keir Fraser, IanCampbell,
	Konrad Rzeszutek Wilk, George Dunlap, Ian Jackson,
	George Shuklin, Dan Magenheimer, xen-devel, DarioFaggioli,
	Kurt Hackel, Zhigang Wang

>>> On 08.11.12 at 09:18, Keir Fraser <keir.xen@gmail.com> wrote:
> On 08/11/2012 08:00, "Jan Beulich" <JBeulich@suse.com> wrote:
> 
>>>>> On 07.11.12 at 23:17, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>> It appears that the attempt to use 2MB and 1GB pages is done in
>>> the toolstack, and if the hypervisor rejects it, toolstack tries
>>> smaller pages.  Thus, if physical memory is highly fragmented
>>> (few or no order>=9 allocations available), this will result
>>> in one hypercall per 4k page so a 256GB domain would require
>>> 64 million hypercalls.  And, since AFAICT, there is no sane
>>> way to hold the heap_lock across even two hypercalls, speeding
>>> up the in-hypervisor allocation path, by itself, will not solve
>>> the TOCTOU race.
>> 
>> No, even in the absence of large pages, the tool stack will do 8M
>> allocations, just without requesting them to be contiguous.
>> Whether 8M is a suitable value is another aspect; that value may
>> predate hypercall preemption, and I don't immediately see why
>> the tool stack shouldn't be able to request larger chunks (up to
>> the whole amount at once).
> 
> It is probably to allow other dom0 processing (including softirqs) to
> preempt the toolstack task, in the case that the kernel was not built with
> involuntary preemption enabled (having it disabled is the common case I
> believe?). 8M batches may provide enough returns to user space to allow
> other work to get a look-in.

That may have mattered when ioctl-s were run with the big kernel
lock held, but even 2.6.18 didn't do that anymore (using the
.unlocked_ioctl field of struct file_operations), which means
that even softirqs will get serviced in Dom0 since the preempted
hypercall gets restarted via exiting to the guest (i.e. events get
delivered). Scheduling is what indeed wouldn't happen, but if
allocation latency can be brought down, 8M might turn out pretty
small a chunk size.

If we do care about Dom0-s running even older kernels (assuming
there ever was a privcmd implementation that didn't use the
unlocked path), or if we have to assume non-Linux Dom0-s might
have issues here, making the tool stack behavior kernel kind/
version dependent without strong need of course wouldn't sound
very attractive.

Jan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08  8:54                                   ` Jan Beulich
@ 2012-11-08  9:12                                     ` Keir Fraser
  2012-11-08  9:47                                       ` Jan Beulich
  0 siblings, 1 reply; 58+ messages in thread
From: Keir Fraser @ 2012-11-08  9:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: TimDeegan, Olaf Hering, IanCampbell, Konrad Rzeszutek Wilk,
	George Dunlap, Ian Jackson, George Shuklin, Dan Magenheimer,
	xen-devel, DarioFaggioli, Kurt Hackel, Zhigang Wang

On 08/11/2012 08:54, "Jan Beulich" <JBeulich@suse.com> wrote:

>>>> On 08.11.12 at 09:18, Keir Fraser <keir.xen@gmail.com> wrote:
>> On 08/11/2012 08:00, "Jan Beulich" <JBeulich@suse.com> wrote:
>> 
>>>>>> On 07.11.12 at 23:17, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>>> It appears that the attempt to use 2MB and 1GB pages is done in
>>>> the toolstack, and if the hypervisor rejects it, toolstack tries
>>>> smaller pages.  Thus, if physical memory is highly fragmented
>>>> (few or no order>=9 allocations available), this will result
>>>> in one hypercall per 4k page so a 256GB domain would require
>>>> 64 million hypercalls.  And, since AFAICT, there is no sane
>>>> way to hold the heap_lock across even two hypercalls, speeding
>>>> up the in-hypervisor allocation path, by itself, will not solve
>>>> the TOCTOU race.
>>> 
>>> No, even in the absence of large pages, the tool stack will do 8M
>>> allocations, just without requesting them to be contiguous.
>>> Whether 8M is a suitable value is another aspect; that value may
>>> predate hypercall preemption, and I don't immediately see why
>>> the tool stack shouldn't be able to request larger chunks (up to
>>> the whole amount at once).
>> 
>> It is probably to allow other dom0 processing (including softirqs) to
>> preempt the toolstack task, in the case that the kernel was not built with
>> involuntary preemption enabled (having it disabled is the common case I
>> believe?). 8M batches may provide enough returns to user space to allow
>> other work to get a look-in.
> 
> That may have mattered when ioctl-s were run with the big kernel
> lock held, but even 2.6.18 didn't do that anymore (using the
> .unlocked_ioctl field of struct file_operations), which means
> that even softirqs will get serviced in Dom0 since the preempted
> hypercall gets restarted via exiting to the guest (i.e. events get
> delivered). Scheduling is what indeed wouldn't happen, but if
> allocation latency can be brought down, 8M might turn out pretty
> small a chunk size.

Ah, then I am out of date on how Linux services softirqs and preemption? Can
softirqs/preemption occur any time, even in kernel mode, so long as no locks
are held?

I thought softirq-type work only happened during event servicing, only if
the event servicing had interrupted user context (ie, would not happen if
started from within kernel mode). So the restart of the hypercall trap
instruction would be an opportunity to service hardirqs, but not softirqs or
scheduler...

 -- Keir

> If we do care about Dom0-s running even older kernels (assuming
> there ever was a privcmd implementation that didn't use the
> unlocked path), or if we have to assume non-Linux Dom0-s might
> have issues here, making the tool stack behavior kernel kind/
> version dependent without strong need of course wouldn't sound
> very attractive.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08  9:12                                     ` Keir Fraser
@ 2012-11-08  9:47                                       ` Jan Beulich
  2012-11-08 10:50                                         ` Keir Fraser
  0 siblings, 1 reply; 58+ messages in thread
From: Jan Beulich @ 2012-11-08  9:47 UTC (permalink / raw)
  To: Keir Fraser
  Cc: TimDeegan, Olaf Hering, IanCampbell, Konrad Rzeszutek Wilk,
	George Dunlap, Ian Jackson, George Shuklin, Dan Magenheimer,
	xen-devel, DarioFaggioli, Kurt Hackel, Zhigang Wang

>>> On 08.11.12 at 10:12, Keir Fraser <keir@xen.org> wrote:
> On 08/11/2012 08:54, "Jan Beulich" <JBeulich@suse.com> wrote:
>> That may have mattered when ioctl-s were run with the big kernel
>> lock held, but even 2.6.18 didn't do that anymore (using the
>> .unlocked_ioctl field of struct file_operations), which means
>> that even softirqs will get serviced in Dom0 since the preempted
>> hypercall gets restarted via exiting to the guest (i.e. events get
>> delivered). Scheduling is what indeed wouldn't happen, but if
>> allocation latency can be brought down, 8M might turn out pretty
>> small a chunk size.
> 
> Ah, then I am out of date on how Linux services softirqs and preemption? Can
> softirqs/preemption occur any time, even in kernel mode, so long as no locks
> are held?
> 
> I thought softirq-type work only happened during event servicing, only if
> the event servicing had interrupted user context (ie, would not happen if
> started from within kernel mode). So the restart of the hypercall trap
> instruction would be an opportunity to service hardirqs, but not softirqs or
> scheduler...

No, irq_exit() can invoke softirqs, provided this isn't a nested IRQ
(soft as well as hard) or softirqs weren't disabled in the interrupted
context.

The only thing that indeed is - on non-preemptible kernels - done
only on exit to user mode is the eventual entering of the scheduler.

Jan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08  7:36                               ` Keir Fraser
@ 2012-11-08 10:11                                 ` Ian Jackson
  2012-11-08 10:57                                   ` Keir Fraser
  2012-11-08 21:45                                   ` Dan Magenheimer
  0 siblings, 2 replies; 58+ messages in thread
From: Ian Jackson @ 2012-11-08 10:11 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Tim (Xen.org),
	Dan Magenheimer, Ian Campbell, Konrad Rzeszutek Wilk,
	George Dunlap, Kurt Hackel, George Shuklin, Olaf Hering,
	xen-devel, DarioFaggioli, Jan Beulich, Zhigang Wang

Keir Fraser writes ("Re: Proposed new "memory capacity claim" hypercall/feature"):
> On 07/11/2012 22:17, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:
> > I think this brings us back to the proposed "claim" hypercall/subop.
> > Unless there are further objections or suggestions for different
> > approaches, I'll commence prototyping it, OK?
> 
> Yes, in fact I thought you'd started already!

Sorry to play bad cop here but I am still far from convinced that a
new hypercall is necessary or desirable.

A lot of words have been written but the concrete, detailed, technical
argument remains to be made IMO.

Ian.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08  9:47                                       ` Jan Beulich
@ 2012-11-08 10:50                                         ` Keir Fraser
  2012-11-08 13:48                                           ` Jan Beulich
  0 siblings, 1 reply; 58+ messages in thread
From: Keir Fraser @ 2012-11-08 10:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: TimDeegan, Olaf Hering, IanCampbell, Konrad Rzeszutek Wilk,
	George Dunlap, Ian Jackson, George Shuklin, Dan Magenheimer,
	xen-devel, DarioFaggioli, Kurt Hackel, Zhigang Wang

On 08/11/2012 09:47, "Jan Beulich" <JBeulich@suse.com> wrote:

>> Ah, then I am out of date on how Linux services softirqs and preemption? Can
>> softirqs/preemption occur any time, even in kernel mode, so long as no locks
>> are held?
>> 
>> I thought softirq-type work only happened during event servicing, only if
>> the event servicing had interrupted user context (ie, would not happen if
>> started from within kernel mode). So the restart of the hypercall trap
>> instruction would be an opportunity to service hardirqs, but not softirqs or
>> scheduler...
> 
> No, irq_exit() can invoke softirqs, provided this isn't a nested IRQ
> (soft as well as hard) or softirqs weren't disabled in the interrupted
> context.

Ah, okay. In fact maybe that's always been the case and I have misremembered
this detail, since condition for softirq entry in Xen has always been more
strict than this.

> The only thing that indeed is - on non-preemptible kernels - done
> only on exit to user mode is the eventual entering of the scheduler.

That alone may still be an argument for restricting the batch size from the
toolstack?

 -- Keir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08 10:11                                 ` Ian Jackson
@ 2012-11-08 10:57                                   ` Keir Fraser
  2012-11-08 21:45                                   ` Dan Magenheimer
  1 sibling, 0 replies; 58+ messages in thread
From: Keir Fraser @ 2012-11-08 10:57 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Tim (Xen.org),
	Dan Magenheimer, Ian Campbell, Konrad Rzeszutek Wilk,
	George Dunlap, Kurt Hackel, George Shuklin, Olaf Hering,
	xen-devel, DarioFaggioli, Jan Beulich, Zhigang Wang

On 08/11/2012 10:11, "Ian Jackson" <Ian.Jackson@eu.citrix.com> wrote:

> Keir Fraser writes ("Re: Proposed new "memory capacity claim"
> hypercall/feature"):
>> On 07/11/2012 22:17, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:
>>> I think this brings us back to the proposed "claim" hypercall/subop.
>>> Unless there are further objections or suggestions for different
>>> approaches, I'll commence prototyping it, OK?
>> 
>> Yes, in fact I thought you'd started already!
> 
> Sorry to play bad cop here but I am still far from convinced that a
> new hypercall is necessary or desirable.
> 
> A lot of words have been written but the concrete, detailed, technical
> argument remains to be made IMO.

I agree but prototyping != acceptance, and at least it gives something
concrete to hang the discussion on. Otherwise this longwinded thread is
going nowhere.

> Ian.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08 10:50                                         ` Keir Fraser
@ 2012-11-08 13:48                                           ` Jan Beulich
  2012-11-08 19:16                                             ` Dan Magenheimer
  0 siblings, 1 reply; 58+ messages in thread
From: Jan Beulich @ 2012-11-08 13:48 UTC (permalink / raw)
  To: Keir Fraser
  Cc: TimDeegan, Olaf Hering, IanCampbell, Konrad Rzeszutek Wilk,
	George Dunlap, Ian Jackson, George Shuklin, Dan Magenheimer,
	xen-devel, DarioFaggioli, Kurt Hackel, Zhigang Wang

>>> On 08.11.12 at 11:50, Keir Fraser <keir@xen.org> wrote:
> On 08/11/2012 09:47, "Jan Beulich" <JBeulich@suse.com> wrote:
>> The only thing that indeed is - on non-preemptible kernels - done
>> only on exit to user mode is the eventual entering of the scheduler.
> 
> That alone may still be an argument for restricting the batch size from the
> toolstack?

Yes, this clearly prohibits unlimited batches. But not being able to
schedule should be less restrictive than not being able to run
softirqs, so I'd still put under question whether the limit shouldn't
be bumped.

Jan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08  8:00                               ` Jan Beulich
  2012-11-08  8:18                                 ` Keir Fraser
@ 2012-11-08 18:38                                 ` Dan Magenheimer
  1 sibling, 0 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-11-08 18:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: TimDeegan, Olaf Hering, Keir Fraser, IanCampbell, Konrad Wilk,
	George Dunlap, Ian Jackson, George Shuklin, xen-devel,
	DarioFaggioli, Kurt Hackel, Zhigang Wang

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: RE: Proposed new "memory capacity claim" hypercall/feature
> 
> >>> On 07.11.12 at 23:17, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> > It appears that the attempt to use 2MB and 1GB pages is done in
> > the toolstack, and if the hypervisor rejects it, toolstack tries
> > smaller pages.  Thus, if physical memory is highly fragmented
> > (few or no order>=9 allocations available), this will result
> > in one hypercall per 4k page so a 256GB domain would require
> > 64 million hypercalls.  And, since AFAICT, there is no sane
> > way to hold the heap_lock across even two hypercalls, speeding
> > up the in-hypervisor allocation path, by itself, will not solve
> > the TOCTOU race.
> 
> No, even in the absence of large pages, the tool stack will do 8M
> allocations, just without requesting them to be contiguous.

Rats, you are right (as usual).  My debug code was poorly
placed and missed this important point.

So ignore the huge-number-of-hypercalls point and I think we
return to:  What is an upper time bound for holding the heap_lock
and, for an arbitrary-sized domain in an arbitrarily-fragmented
system, can the page allocation code be made fast enough to
fit within that bound?

I am in agreement that if the page allocation code can be
fast enough so that the heap_lock can be held, this is a better
solution than "claim".  I am just skeptical that, in
the presence of those two "arbitraries", it is possible.

So I will proceed with more measurements before prototyping
the "claim" stuff.

Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08 13:48                                           ` Jan Beulich
@ 2012-11-08 19:16                                             ` Dan Magenheimer
  2012-11-08 22:32                                               ` Keir Fraser
  2012-11-09  8:47                                               ` Jan Beulich
  0 siblings, 2 replies; 58+ messages in thread
From: Dan Magenheimer @ 2012-11-08 19:16 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser
  Cc: TimDeegan, Olaf Hering, IanCampbell, Konrad Wilk, George Dunlap,
	Ian Jackson, George Shuklin, xen-devel, DarioFaggioli,
	Kurt Hackel, Zhigang Wang

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, November 08, 2012 6:49 AM
> To: Keir Fraser
> Cc: Olaf Hering; IanCampbell; George Dunlap; Ian Jackson; George Shuklin; DarioFaggioli; xen-
> devel@lists.xen.org; Dan Magenheimer; Konrad Rzeszutek Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang;
> TimDeegan
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> 
> >>> On 08.11.12 at 11:50, Keir Fraser <keir@xen.org> wrote:
> > On 08/11/2012 09:47, "Jan Beulich" <JBeulich@suse.com> wrote:
> >> The only thing that indeed is - on non-preemptible kernels - done
> >> only on exit to user mode is the eventual entering of the scheduler.
> >
> > That alone may still be an argument for restricting the batch size from the
> > toolstack?
> 
> Yes, this clearly prohibits unlimited batches. But not being able to
> schedule should be less restrictive than not being able to run
> softirqs, so I'd still put under question whether the limit shouldn't
> be bumped.

Wait, please define unlimited.

I think we are in agreement from previous discussion that, to solve
the TOCTOU race, the heap_lock must be held for the entire allocation
for a domain creation.  True?

So unless the limit is "bumped" to handle the largest supported
physical memory size for a domain AND the allocation code in
the hypervisor is rewritten to hold the heap_lock while allocating
the entire extent, bumping the limit doesn't help the TOCTOU race,
correct?

Further, holding the heap_lock not only stops scheduling of
this pcpu, but also blocks other domains/pcpus from doing
any micro-allocations at all.  True?

Sorry if I am restating the obvious, but I am red-faced about
the huge-number-of-hypercalls mistake, so want to ensure if
I am understanding.

Dan

P.S. For PV domains, doesn't the toolstack already use a batch of up
to 2^20 pages?  (Or maybe I am misunderstanding/misreading the code
in arch_setup_meminit() in xc_dom_x86.c?)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08 10:11                                 ` Ian Jackson
  2012-11-08 10:57                                   ` Keir Fraser
@ 2012-11-08 21:45                                   ` Dan Magenheimer
  2012-11-12 11:03                                     ` Ian Jackson
  1 sibling, 1 reply; 58+ messages in thread
From: Dan Magenheimer @ 2012-11-08 21:45 UTC (permalink / raw)
  To: Ian Jackson, Keir Fraser
  Cc: Tim (Xen.org),
	Olaf Hering, Ian Campbell, Konrad Wilk, George Dunlap,
	Kurt Hackel, George Shuklin, xen-devel, DarioFaggioli,
	Jan Beulich, Zhigang Wang

> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> 
> Keir Fraser writes ("Re: Proposed new "memory capacity claim" hypercall/feature"):
> > On 07/11/2012 22:17, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:
> > > I think this brings us back to the proposed "claim" hypercall/subop.
> > > Unless there are further objections or suggestions for different
> > > approaches, I'll commence prototyping it, OK?
> >
> > Yes, in fact I thought you'd started already!
> 
> Sorry to play bad cop here but I am still far from convinced that a
> new hypercall is necessary or desirable.
> 
> A lot of words have been written but the concrete, detailed, technical
> argument remains to be made IMO.

Hi Ian --

I agree, a _lot_ of words have been written and this discussion
has had a lot of side conversations so has gone back and forth into
a lot of weed patches.

I agree it would be worthwhile to restate the problem clearly,
along with some of the proposed solutions/pros/cons.  When I
have a chance I will do that, but prototyping may either clarify
some things or bring out some new unforeseen issues, so I think
I will do some more coding first (and this may take a week or two
due to some other constraints).

But to ensure that any summary/restatement touches on your
concerns, could you be more specific as to about what you are
unconvinced?

I.e. I still think the toolstack can manage all memory
allocation; or, holding the heap_lock for a longer period
should solve the problem; or I don't understand what the
original problem is that you are trying to solve, etc.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08 19:16                                             ` Dan Magenheimer
@ 2012-11-08 22:32                                               ` Keir Fraser
  2012-11-09  8:47                                               ` Jan Beulich
  1 sibling, 0 replies; 58+ messages in thread
From: Keir Fraser @ 2012-11-08 22:32 UTC (permalink / raw)
  To: Dan Magenheimer, Jan Beulich
  Cc: TimDeegan, Olaf Hering, IanCampbell, Konrad Rzeszutek Wilk,
	George Dunlap, Ian Jackson, George Shuklin, xen-devel,
	DarioFaggioli, Kurt Hackel, Zhigang Wang

On 08/11/2012 19:16, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> Yes, this clearly prohibits unlimited batches. But not being able to
>> schedule should be less restrictive than not being able to run
>> softirqs, so I'd still put under question whether the limit shouldn't
>> be bumped.
> 
> Wait, please define unlimited.
> 
> I think we are in agreement from previous discussion that, to solve
> the TOCTOU race, the heap_lock must be held for the entire allocation
> for a domain creation.  True?

It's pretty obvious that this isn't going to be possible in the general
case. E.g., a 40G domain being created out of 4k pages (eg. Because memory
is fragmented) is going to be at least 40G/4k == 10M heap operations. Say
each takes 10ns, which would be quick, we're talking 100ms of cpu work.
Holding a lock that long can't be recommended really.

 -- Keir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08 19:16                                             ` Dan Magenheimer
  2012-11-08 22:32                                               ` Keir Fraser
@ 2012-11-09  8:47                                               ` Jan Beulich
  1 sibling, 0 replies; 58+ messages in thread
From: Jan Beulich @ 2012-11-09  8:47 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: TimDeegan, Olaf Hering, Keir Fraser, IanCampbell, Konrad Wilk,
	George Dunlap, Ian Jackson, George Shuklin, xen-devel,
	DarioFaggioli, Kurt Hackel, Zhigang Wang

>>> On 08.11.12 at 20:16, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >>> On 08.11.12 at 11:50, Keir Fraser <keir@xen.org> wrote:
>> > On 08/11/2012 09:47, "Jan Beulich" <JBeulich@suse.com> wrote:
>> >> The only thing that indeed is - on non-preemptible kernels - done
>> >> only on exit to user mode is the eventual entering of the scheduler.
>> >
>> > That alone may still be an argument for restricting the batch size from the
>> > toolstack?
>> 
>> Yes, this clearly prohibits unlimited batches. But not being able to
>> schedule should be less restrictive than not being able to run
>> softirqs, so I'd still put under question whether the limit shouldn't
>> be bumped.
> 
> Wait, please define unlimited.

Unlimited as in unlimited.

> I think we are in agreement from previous discussion that, to solve
> the TOCTOU race, the heap_lock must be held for the entire allocation
> for a domain creation.  True?

That's only one way (and as Keir already responded, not one
that we should actually pursue).

The point about being fast enough was rather made to allow
a decision towards the feasibility of intermediately disabling
tmem (or at least allocations originating from it) in particular
(I'm not worried about micro-allocations - the tool stack has to
provide some slack in its calculations for this anyway).

Jan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Proposed new "memory capacity claim" hypercall/feature
  2012-11-08 21:45                                   ` Dan Magenheimer
@ 2012-11-12 11:03                                     ` Ian Jackson
  0 siblings, 0 replies; 58+ messages in thread
From: Ian Jackson @ 2012-11-12 11:03 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Tim (Xen.org),
	Olaf Hering, Ian Campbell, Konrad Wilk, George Dunlap,
	Kurt Hackel, George Shuklin, xen-devel, Keir Fraser,
	DarioFaggioli, Jan Beulich, Zhigang Wang

Dan Magenheimer writes ("RE: Proposed new "memory capacity claim" hypercall/feature"):
> But to ensure that any summary/restatement touches on your
> concerns, could you be more specific as to about what you are
> unconvinced?
> 
> I.e. I still think the toolstack can manage all memory
> allocation;

I'm still unconvinced that this is false.  I think it's probably true.

Ian.

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2012-11-12 11:03 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-29 17:06 Proposed new "memory capacity claim" hypercall/feature Dan Magenheimer
2012-10-29 18:24 ` Keir Fraser
2012-10-29 21:08   ` Dan Magenheimer
2012-10-29 22:22     ` Keir Fraser
2012-10-29 23:03       ` Dan Magenheimer
2012-10-29 23:17         ` Keir Fraser
2012-10-30 15:13           ` Dan Magenheimer
2012-10-30 14:43             ` Keir Fraser
2012-10-30 16:33               ` Dan Magenheimer
2012-10-30  9:11         ` George Dunlap
2012-10-30 16:13           ` Dan Magenheimer
2012-10-29 22:35 ` Tim Deegan
2012-10-29 23:21   ` Dan Magenheimer
2012-10-30  8:13     ` Tim Deegan
2012-10-30 15:26       ` Dan Magenheimer
2012-10-30  8:29     ` Jan Beulich
2012-10-30 15:43       ` Dan Magenheimer
2012-10-30 16:04         ` Jan Beulich
2012-10-30 17:13           ` Dan Magenheimer
2012-10-31  8:14             ` Jan Beulich
2012-10-31 16:04               ` Dan Magenheimer
2012-10-31 16:19                 ` Jan Beulich
2012-10-31 16:51                   ` Dan Magenheimer
2012-11-02  9:01                     ` Jan Beulich
2012-11-02  9:30                       ` Keir Fraser
2012-11-04 19:43                         ` Dan Magenheimer
2012-11-04 20:35                           ` Tim Deegan
2012-11-05  0:23                             ` Dan Magenheimer
2012-11-05 10:29                               ` Ian Campbell
2012-11-05 14:54                                 ` Dan Magenheimer
2012-11-05 22:24                                   ` Ian Campbell
2012-11-05 22:58                                     ` Zhigang Wang
2012-11-05 22:58                                     ` Dan Magenheimer
2012-11-06 13:23                                       ` Ian Campbell
2012-11-05 22:33                             ` Dan Magenheimer
2012-11-06 10:49                               ` Jan Beulich
2012-11-05  9:16                           ` Jan Beulich
2012-11-07 22:17                             ` Dan Magenheimer
2012-11-08  7:36                               ` Keir Fraser
2012-11-08 10:11                                 ` Ian Jackson
2012-11-08 10:57                                   ` Keir Fraser
2012-11-08 21:45                                   ` Dan Magenheimer
2012-11-12 11:03                                     ` Ian Jackson
2012-11-08  8:00                               ` Jan Beulich
2012-11-08  8:18                                 ` Keir Fraser
2012-11-08  8:54                                   ` Jan Beulich
2012-11-08  9:12                                     ` Keir Fraser
2012-11-08  9:47                                       ` Jan Beulich
2012-11-08 10:50                                         ` Keir Fraser
2012-11-08 13:48                                           ` Jan Beulich
2012-11-08 19:16                                             ` Dan Magenheimer
2012-11-08 22:32                                               ` Keir Fraser
2012-11-09  8:47                                               ` Jan Beulich
2012-11-08 18:38                                 ` Dan Magenheimer
2012-11-05 17:14         ` George Dunlap
2012-11-05 18:21           ` Dan Magenheimer
2012-11-01  2:13   ` Dario Faggioli
2012-11-01 15:51     ` Dan Magenheimer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.