All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
@ 2012-11-14 23:55 Dan Magenheimer
  2012-11-15 10:25 ` Jan Beulich
  2012-11-15 12:25 ` Ian Campbell
  0 siblings, 2 replies; 26+ messages in thread
From: Dan Magenheimer @ 2012-11-14 23:55 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser, TimDeegan; +Cc: Zhigang Wang, Konrad Wilk, xen-devel

[-- Attachment #1: Type: text/plain, Size: 17405 bytes --]

This is a second cut of the hypervisor patch of the proposed
XENMEM_claim_pages hypercall/subop, taking into account
feedback from Jan and Keir, plus some fixes found via runtime
debugging (using privcmd only) and some comments/cleanup.

v1->v2:
- Add reset-to-zero page claim in domain_kill [JBeulich]
- Proper handling of struct passed to hypercall [JBeulich]
- Fix alloc_heap_pages when a domain has a claim [JBeulich]
- Need not hold heap_lock if !d->unclaimed_pages [keir]
- Fix missed tot_pages call in donate_page [djm]
- Remove domain_reset_unclaimed_pages; use set with zero [djm]
- Bugfixes found through testing in set_unclaimed [djm]
- More comments in code [djm]
- Code formatting fixes [djm]

===

Motivation:

The goal of this hypercall
is to attempt to atomically and very quickly determine if
there are sufficient pages available in the system and, if
so, "set aside" that quantity of pages for future allocations
by that domain.  Unlike an existing hypercall such as
increase_reservation or populate_physmap, specific physical
pageframes are not assigned to the domain because this
cannot be done sufficiently quickly (especially for very large
allocations in an arbitrarily fragmented system) and so the
existing mechanisms result in classic time-of-check-time-of-use
(TOCTOU) races.  One can think of claiming as similar to a
"lazy" allocation, but subsequent hypercalls are required
to do the actual physical pageframe allocation.

I don't have a patch for the toolstack side, but I envision
a "xl create --claim" option to maximize backwards
compatibility while minimizing impact on existing toolstacks.
As a result, testing has (so far) only been done via privcmd.

Code overview:

Though the hypercall simply does arithmetic within locks,
some of the semantics in the code may be a bit subtle
so let me explain a bit.

The key variables (d->unclaimed_pages and total_unclaimed_pages)
start at zero if no claim has yet been staked for any domain.
(Perhaps a better name is "claimed_but_not_yet_possessed" but that's
a bit unwieldy.)  If no claim hypercalls are executed, there
should be no impact on existing usage.

When a claim is successfully staked by a domain, it is like a
watermark but there is no record kept of the size of the claim.
Instead, d->unclaimed_pages is set to the difference between
d->tot_pages and the claim (d->tot_pages may normally be zero
but see Note 1).  When d->tot_pages increases or decreases,
d->unclaimed_pages atomically decreases or increases.  Once
d->unclaimed_pages reaches zero, the claim is satisfied and
d->unclaimed pages stays at zero -- unless a new claim is
subsequently staked.  See Note 2.

The systemwide variable total_unclaimed_pages is always the sum
of d->unclaimed_pages, across all domains.  A non-domain-
specific heap allocation will fail if total_unclaimed_pages
exceeds free (plus, on tmem enabled systems, freeable) pages.

Claim semantics may be modified by flags.  The initial
implementation has one flag, which discerns whether the
caller would like tmem freeable pages to be considered
in determining whether or not the claim can be successfully
staked. Future flags may, for example, specify that the
caller would like the claim to be successful only if there
are sufficient pages available on a single node (per Dario's
suggestion).

Note 1: Tim: I'm thinking this may resolve your concern that
the claim mechanism must be more complicated to handle
restricted memory allocations and order>0 allocations.
The proposed claim mechanism only guarantees a quantity of
order==0 pages; if restricted allocations are required, these
are done first by the toolstack, and followed by the claim.
Order>0 allocations still work if memory is not fragmented,
but the claim mechanism doesn't guarantee anything but
a quantity of order==0 pages.

Note 2: Tim: This arithmetic also indirectly implements the
"claim auto-expire" discussed earlier.  We certainly don't
want a negative claim.  It's possible that the toolstack need
never call "unclaim" (and so such a hypercall/subop need
not even exist) as long as domain_reset_unclaimed_pages()
is called when a domain dies.  This draft doesn't even
provide unclaim... though if a reason for its existence
is determined, it should be easy to add.

Note 3: There is currently no way to observe a staked claim,
so a staked claim will not survive a save/restore/migrate.
Not clear yet if this is needed but could be added.

Thanks for any feedback!

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com

 arch/x86/mm.c             |    2 
 arch/x86/mm/mem_sharing.c |    4 -
 common/domain.c           |    1 
 common/grant_table.c      |    2 
 common/memory.c           |   31 ++++++++++++-
 common/page_alloc.c       |  107 +++++++++++++++++++++++++++++++++++++++++++++-
 include/public/memory.h   |   31 +++++++++++++
 include/xen/mm.h          |    6 ++
 include/xen/sched.h       |    1 
 9 files changed, 178 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index fad3d33..7e55908 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3841,7 +3841,7 @@ int donate_page(
     {
         if ( d->tot_pages >= d->max_pages )
             goto fail;
-        d->tot_pages++;
+        domain_increase_tot_pages(d, 1);
     }
 
     page->count_info = PGC_allocated | 1;
diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
index 5103285..943a3b5 100644
--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -639,7 +639,7 @@ static int page_make_sharable(struct domain *d,
     }
 
     page_set_owner(page, dom_cow);
-    d->tot_pages--;
+    domain_decrease_tot_pages(d, 1);
     drop_dom_ref = (d->tot_pages == 0);
     page_list_del(page, &d->page_list);
     spin_unlock(&d->page_alloc_lock);
@@ -680,7 +680,7 @@ static int page_make_private(struct domain *d, struct page_info *page)
     ASSERT(page_get_owner(page) == dom_cow);
     page_set_owner(page, d);
 
-    if ( d->tot_pages++ == 0 )
+    if ( domain_increase_tot_pages(d, 1) == 0 )
         get_domain(d);
     page_list_add_tail(page, &d->page_list);
     spin_unlock(&d->page_alloc_lock);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 0e3e36a..95509e2 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -492,6 +492,7 @@ int domain_kill(struct domain *d)
         evtchn_destroy(d);
         gnttab_release_mappings(d);
         tmem_destroy(d->tmem);
+        domain_set_unclaimed_pages(d, 0, 0);
         d->tmem = NULL;
         /* fallthrough */
     case DOMDYING_dying:
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 7912769..10ce78f 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1656,7 +1656,7 @@ gnttab_transfer(
         }
 
         /* Okay, add the page to 'e'. */
-        if ( unlikely(e->tot_pages++ == 0) )
+        if ( unlikely(domain_increase_tot_pages(e, 1) == 0) )
             get_knownalive_domain(e);
         page_list_add_tail(page, &e->page_list);
         page_set_owner(page, e);
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 83e2666..38f2cb9 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -454,7 +454,7 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
                              (j * (1UL << exch.out.extent_order)));
 
                 spin_lock(&d->page_alloc_lock);
-                d->tot_pages -= dec_count;
+                domain_decrease_tot_pages(d, dec_count);
                 drop_dom_ref = (dec_count && !d->tot_pages);
                 spin_unlock(&d->page_alloc_lock);
 
@@ -685,6 +685,35 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         break;
     }
 
+    case XENMEM_claim_pages:
+    {
+        unsigned long request;
+
+        start_extent = cmd >> MEMOP_EXTENT_SHIFT;
+
+        if ( copy_from_guest(&reservation, arg, 1) )
+            return start_extent;
+
+        if ( !(guest_handle_is_null(reservation.extent_start)) )
+            return start_extent;
+
+        rc = rcu_lock_target_domain_by_id(reservation.domid, &d);
+        if ( rc )
+            return rc;
+
+        /*
+         * extent_order may be non-zero, but is only a multiplier and
+         * does not currently claim any order>0 slabs, though this is
+         * a possible future feature
+         */
+        request = reservation.nr_extents << reservation.extent_order;
+        rc = domain_set_unclaimed_pages(d, request, reservation.mem_flags);
+
+        rcu_unlock_domain(d);
+
+        break;
+    }
+
     default:
         rc = arch_memory_op(op, arg);
         break;
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 15ebc66..178d816 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -238,6 +238,100 @@ static long midsize_alloc_zone_pages;
 #define MIDSIZE_ALLOC_FRAC 128
 
 static DEFINE_SPINLOCK(heap_lock);
+static long total_unclaimed_pages; /* total outstanding claims by all domains */
+
+unsigned long domain_increase_tot_pages(struct domain *d, unsigned long pages)
+{
+    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
+
+    ASSERT(spin_is_locked(&d->page_alloc_lock));
+    ASSERT(!spin_is_locked(&heap_lock));
+    if ( !d->unclaimed_pages )
+        return d->tot_pages += pages;
+    spin_lock(&heap_lock);
+    dom_before = d->unclaimed_pages;
+    dom_after = dom_before - pages;
+    if ( (dom_before > 0) && (dom_after < 0) )
+        dom_claimed = 0;
+    else
+        dom_claimed = dom_after;
+    sys_before = total_unclaimed_pages;
+    sys_after = sys_before - (dom_before - dom_claimed);
+    BUG_ON( (sys_before > 0) && (sys_after < 0) );
+    total_unclaimed_pages = sys_after;
+    d->unclaimed_pages = dom_claimed;
+    spin_unlock(&heap_lock);
+    return d->tot_pages;
+}
+
+unsigned long domain_decrease_tot_pages(struct domain *d, unsigned long pages)
+{
+    ASSERT(spin_is_locked(&d->page_alloc_lock));
+    ASSERT(!spin_is_locked(&heap_lock));
+    spin_lock(&heap_lock);
+    d->tot_pages -= pages;
+    if ( d->unclaimed_pages )
+    {
+        d->unclaimed_pages += pages;
+        total_unclaimed_pages += pages;
+    }
+    spin_unlock(&heap_lock);
+    return d->tot_pages;
+}
+
+int domain_set_unclaimed_pages(struct domain *d, unsigned long pages,
+                                unsigned long flags)
+{
+    int ret = -ENOMEM;
+    unsigned long claim = 0, avail_pages = 0;
+
+    /*
+     * take the domain's page_alloc_lock, else all increases/decreases
+     * must always take the global heap_lock rather than only in the much
+     * rarer case that d->unclaimed_pages is non-zero
+     */
+    ASSERT(!spin_is_locked(&d->page_alloc_lock));
+    spin_lock(&d->page_alloc_lock);
+    ASSERT(!spin_is_locked(&heap_lock));
+    spin_lock(&heap_lock);
+
+    /* pages==0 means "unset" the claim (and flags is ignored) */
+    if (pages == 0)
+    {
+        total_unclaimed_pages -= d->unclaimed_pages;
+        d->unclaimed_pages = 0;
+        ret = 0;
+        goto out;
+    }
+
+    /* only one active claim per domain please */
+    if ( d->unclaimed_pages)
+        goto out;
+
+    /* how much memory is available? */
+    avail_pages = total_avail_pages;
+    if ( !(flags & XENMEM_CLAIMF_free_only) )
+        avail_pages += tmem_freeable_pages();
+    avail_pages -= total_unclaimed_pages;
+
+    /*
+     * note, if domain has already allocated memory before making a claim 
+     * then the claim must take tot_pages into account
+     */
+    claim = pages - d->tot_pages;
+    if ( claim > avail_pages )
+        goto out;
+
+    /* yay, claim fits in available memory, stake the claim, success! */
+    d->unclaimed_pages = claim;
+    total_unclaimed_pages += d->unclaimed_pages;
+    ret = 0;
+
+out:
+    spin_unlock(&heap_lock);
+    spin_unlock(&d->page_alloc_lock);
+    return ret;
+}
 
 static unsigned long init_node_heap(int node, unsigned long mfn,
                                     unsigned long nr, bool_t *use_tail)
@@ -443,6 +537,15 @@ static struct page_info *alloc_heap_pages(
     spin_lock(&heap_lock);
 
     /*
+     * Claimed memory is considered unavailable unless the request
+     * is made by a domain with sufficient unclaimed pages.
+     */
+    if ( (total_unclaimed_pages + request >
+           total_avail_pages + tmem_freeable_pages()) &&
+          (d == NULL || d->unclaimed_pages < request) )
+        goto not_found;
+
+    /*
      * TMEM: When available memory is scarce due to tmem absorbing it, allow
      * only mid-size allocations to avoid worst of fragmentation issues.
      * Others try tmem pools then fail.  This is a workaround until all
@@ -1291,7 +1394,7 @@ int assign_pages(
         if ( unlikely(d->tot_pages == 0) )
             get_knownalive_domain(d);
 
-        d->tot_pages += 1 << order;
+        domain_increase_tot_pages(d, 1 << order);
     }
 
     for ( i = 0; i < (1 << order); i++ )
@@ -1375,7 +1478,7 @@ void free_domheap_pages(struct page_info *pg, unsigned int order)
             page_list_del2(&pg[i], &d->page_list, &d->arch.relmem_list);
         }
 
-        d->tot_pages -= 1 << order;
+        domain_decrease_tot_pages(d, 1 << order);
         drop_dom_ref = (d->tot_pages == 0);
 
         spin_unlock_recursive(&d->page_alloc_lock);
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index f1ddbc0..c4de5c1 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -68,6 +68,8 @@ struct xen_memory_reservation {
      *   IN:  GPFN bases of extents to populate with memory
      *   OUT: GMFN bases of extents that were allocated
      *   (NB. This command also updates the mach_to_phys translation table)
+     * XENMEM_claim_pages:
+     *   IN: must be zero
      */
     XEN_GUEST_HANDLE(xen_pfn_t) extent_start;
 
@@ -421,6 +423,35 @@ struct xen_mem_sharing_op {
 typedef struct xen_mem_sharing_op xen_mem_sharing_op_t;
 DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
 
+/*
+ * Attempt to stake a claim for a domain on a quantity of pages
+ * of system RAM, but _not_ assign specific pageframes.  Only
+ * arithmetic is performed so the hypercall is very fast and need
+ * not be preemptible, thus sidestepping time-of-check-time-of-use
+ * races for memory allocation.  Returns 0 if the hypervisor page
+ * allocator has atomically and successfully claimed the requested
+ * number of pages, else non-zero.
+ *
+ * Any domain may have only one active claim.  When sufficient memory
+ * has been allocated to resolve the claim, the claim silently expires.
+ * Claiming zero pages effectively resets any outstanding claim and
+ * is always successful.
+ *
+ * Note that a valid claim may be staked even after memory has been
+ * allocated for a domain.  In this case, the claim is not incremental,
+ * i.e. if the domain's tot_pages is 3, and a claim is staked for 10,
+ * only 7 additional pages are claimed.
+ */
+#define XENMEM_claim_pages                  24
+/*
+ * XENMEM_claim_pages flags:
+ *  free_only: claim is successful only if sufficient free pages
+ *    are available.  If not set and tmem is enabled, hypervisor
+ *    may also consider tmem "freeable" pages to satisfy the claim.
+ */
+#define _XENMEM_CLAIMF_free_only            0
+#define XENMEM_CLAIMF_free_only             (1U<<_XENMEM_CLAIMF_free_only)
+
 #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 64a0cc1..5c63581 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -48,6 +48,12 @@ void free_xenheap_pages(void *v, unsigned int order);
 #define alloc_xenheap_page() (alloc_xenheap_pages(0,0))
 #define free_xenheap_page(v) (free_xenheap_pages(v,0))
 
+/* Claim handling */
+unsigned long domain_increase_tot_pages(struct domain *d, unsigned long pages);
+unsigned long domain_decrease_tot_pages(struct domain *d, unsigned long pages);
+int domain_set_unclaimed_pages(
+    struct domain *d, unsigned long pages, unsigned long flags);
+
 /* Domain suballocator. These functions are *not* interrupt-safe.*/
 void init_domheap_pages(paddr_t ps, paddr_t pe);
 struct page_info *alloc_domheap_pages(
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 6c55039..480ef39 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -242,6 +242,7 @@ struct domain
     struct page_list_head page_list;  /* linked list */
     struct page_list_head xenpage_list; /* linked list (size xenheap_pages) */
     unsigned int     tot_pages;       /* number of pages currently possesed */
+    unsigned int     unclaimed_pages; /* pages claimed but not possessed    */
     unsigned int     max_pages;       /* maximum value for tot_pages        */
     atomic_t         shr_pages;       /* number of shared pages             */
     atomic_t         paged_pages;     /* number of paged-out pages          */

[-- Attachment #2: claim-121114.patch --]
[-- Type: application/octet-stream, Size: 11994 bytes --]

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index fad3d33..7e55908 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3841,7 +3841,7 @@ int donate_page(
     {
         if ( d->tot_pages >= d->max_pages )
             goto fail;
-        d->tot_pages++;
+        domain_increase_tot_pages(d, 1);
     }
 
     page->count_info = PGC_allocated | 1;
diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
index 5103285..943a3b5 100644
--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -639,7 +639,7 @@ static int page_make_sharable(struct domain *d,
     }
 
     page_set_owner(page, dom_cow);
-    d->tot_pages--;
+    domain_decrease_tot_pages(d, 1);
     drop_dom_ref = (d->tot_pages == 0);
     page_list_del(page, &d->page_list);
     spin_unlock(&d->page_alloc_lock);
@@ -680,7 +680,7 @@ static int page_make_private(struct domain *d, struct page_info *page)
     ASSERT(page_get_owner(page) == dom_cow);
     page_set_owner(page, d);
 
-    if ( d->tot_pages++ == 0 )
+    if ( domain_increase_tot_pages(d, 1) == 0 )
         get_domain(d);
     page_list_add_tail(page, &d->page_list);
     spin_unlock(&d->page_alloc_lock);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 0e3e36a..95509e2 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -492,6 +492,7 @@ int domain_kill(struct domain *d)
         evtchn_destroy(d);
         gnttab_release_mappings(d);
         tmem_destroy(d->tmem);
+        domain_set_unclaimed_pages(d, 0, 0);
         d->tmem = NULL;
         /* fallthrough */
     case DOMDYING_dying:
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 7912769..10ce78f 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1656,7 +1656,7 @@ gnttab_transfer(
         }
 
         /* Okay, add the page to 'e'. */
-        if ( unlikely(e->tot_pages++ == 0) )
+        if ( unlikely(domain_increase_tot_pages(e, 1) == 0) )
             get_knownalive_domain(e);
         page_list_add_tail(page, &e->page_list);
         page_set_owner(page, e);
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 83e2666..38f2cb9 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -454,7 +454,7 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
                              (j * (1UL << exch.out.extent_order)));
 
                 spin_lock(&d->page_alloc_lock);
-                d->tot_pages -= dec_count;
+                domain_decrease_tot_pages(d, dec_count);
                 drop_dom_ref = (dec_count && !d->tot_pages);
                 spin_unlock(&d->page_alloc_lock);
 
@@ -685,6 +685,35 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         break;
     }
 
+    case XENMEM_claim_pages:
+    {
+        unsigned long request;
+
+        start_extent = cmd >> MEMOP_EXTENT_SHIFT;
+
+        if ( copy_from_guest(&reservation, arg, 1) )
+            return start_extent;
+
+        if ( !(guest_handle_is_null(reservation.extent_start)) )
+            return start_extent;
+
+        rc = rcu_lock_target_domain_by_id(reservation.domid, &d);
+        if ( rc )
+            return rc;
+
+        /*
+         * extent_order may be non-zero, but is only a multiplier and
+         * does not currently claim any order>0 slabs, though this is
+         * a possible future feature
+         */
+        request = reservation.nr_extents << reservation.extent_order;
+        rc = domain_set_unclaimed_pages(d, request, reservation.mem_flags);
+
+        rcu_unlock_domain(d);
+
+        break;
+    }
+
     default:
         rc = arch_memory_op(op, arg);
         break;
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 15ebc66..178d816 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -238,6 +238,100 @@ static long midsize_alloc_zone_pages;
 #define MIDSIZE_ALLOC_FRAC 128
 
 static DEFINE_SPINLOCK(heap_lock);
+static long total_unclaimed_pages; /* total outstanding claims by all domains */
+
+unsigned long domain_increase_tot_pages(struct domain *d, unsigned long pages)
+{
+    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
+
+    ASSERT(spin_is_locked(&d->page_alloc_lock));
+    ASSERT(!spin_is_locked(&heap_lock));
+    if ( !d->unclaimed_pages )
+        return d->tot_pages += pages;
+    spin_lock(&heap_lock);
+    dom_before = d->unclaimed_pages;
+    dom_after = dom_before - pages;
+    if ( (dom_before > 0) && (dom_after < 0) )
+        dom_claimed = 0;
+    else
+        dom_claimed = dom_after;
+    sys_before = total_unclaimed_pages;
+    sys_after = sys_before - (dom_before - dom_claimed);
+    BUG_ON( (sys_before > 0) && (sys_after < 0) );
+    total_unclaimed_pages = sys_after;
+    d->unclaimed_pages = dom_claimed;
+    spin_unlock(&heap_lock);
+    return d->tot_pages;
+}
+
+unsigned long domain_decrease_tot_pages(struct domain *d, unsigned long pages)
+{
+    ASSERT(spin_is_locked(&d->page_alloc_lock));
+    ASSERT(!spin_is_locked(&heap_lock));
+    spin_lock(&heap_lock);
+    d->tot_pages -= pages;
+    if ( d->unclaimed_pages )
+    {
+        d->unclaimed_pages += pages;
+        total_unclaimed_pages += pages;
+    }
+    spin_unlock(&heap_lock);
+    return d->tot_pages;
+}
+
+int domain_set_unclaimed_pages(struct domain *d, unsigned long pages,
+                                unsigned long flags)
+{
+    int ret = -ENOMEM;
+    unsigned long claim = 0, avail_pages = 0;
+
+    /*
+     * take the domain's page_alloc_lock, else all increases/decreases
+     * must always take the global heap_lock rather than only in the much
+     * rarer case that d->unclaimed_pages is non-zero
+     */
+    ASSERT(!spin_is_locked(&d->page_alloc_lock));
+    spin_lock(&d->page_alloc_lock);
+    ASSERT(!spin_is_locked(&heap_lock));
+    spin_lock(&heap_lock);
+
+    /* pages==0 means "unset" the claim (and flags is ignored) */
+    if (pages == 0)
+    {
+        total_unclaimed_pages -= d->unclaimed_pages;
+        d->unclaimed_pages = 0;
+        ret = 0;
+        goto out;
+    }
+
+    /* only one active claim per domain please */
+    if ( d->unclaimed_pages)
+        goto out;
+
+    /* how much memory is available? */
+    avail_pages = total_avail_pages;
+    if ( !(flags & XENMEM_CLAIMF_free_only) )
+        avail_pages += tmem_freeable_pages();
+    avail_pages -= total_unclaimed_pages;
+
+    /*
+     * note, if domain has already allocated memory before making a claim 
+     * then the claim must take tot_pages into account
+     */
+    claim = pages - d->tot_pages;
+    if ( claim > avail_pages )
+        goto out;
+
+    /* yay, claim fits in available memory, stake the claim, success! */
+    d->unclaimed_pages = claim;
+    total_unclaimed_pages += d->unclaimed_pages;
+    ret = 0;
+
+out:
+    spin_unlock(&heap_lock);
+    spin_unlock(&d->page_alloc_lock);
+    return ret;
+}
 
 static unsigned long init_node_heap(int node, unsigned long mfn,
                                     unsigned long nr, bool_t *use_tail)
@@ -443,6 +537,15 @@ static struct page_info *alloc_heap_pages(
     spin_lock(&heap_lock);
 
     /*
+     * Claimed memory is considered unavailable unless the request
+     * is made by a domain with sufficient unclaimed pages.
+     */
+    if ( (total_unclaimed_pages + request >
+           total_avail_pages + tmem_freeable_pages()) &&
+          (d == NULL || d->unclaimed_pages < request) )
+        goto not_found;
+
+    /*
      * TMEM: When available memory is scarce due to tmem absorbing it, allow
      * only mid-size allocations to avoid worst of fragmentation issues.
      * Others try tmem pools then fail.  This is a workaround until all
@@ -1291,7 +1394,7 @@ int assign_pages(
         if ( unlikely(d->tot_pages == 0) )
             get_knownalive_domain(d);
 
-        d->tot_pages += 1 << order;
+        domain_increase_tot_pages(d, 1 << order);
     }
 
     for ( i = 0; i < (1 << order); i++ )
@@ -1375,7 +1478,7 @@ void free_domheap_pages(struct page_info *pg, unsigned int order)
             page_list_del2(&pg[i], &d->page_list, &d->arch.relmem_list);
         }
 
-        d->tot_pages -= 1 << order;
+        domain_decrease_tot_pages(d, 1 << order);
         drop_dom_ref = (d->tot_pages == 0);
 
         spin_unlock_recursive(&d->page_alloc_lock);
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index f1ddbc0..c4de5c1 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -68,6 +68,8 @@ struct xen_memory_reservation {
      *   IN:  GPFN bases of extents to populate with memory
      *   OUT: GMFN bases of extents that were allocated
      *   (NB. This command also updates the mach_to_phys translation table)
+     * XENMEM_claim_pages:
+     *   IN: must be zero
      */
     XEN_GUEST_HANDLE(xen_pfn_t) extent_start;
 
@@ -421,6 +423,35 @@ struct xen_mem_sharing_op {
 typedef struct xen_mem_sharing_op xen_mem_sharing_op_t;
 DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
 
+/*
+ * Attempt to stake a claim for a domain on a quantity of pages
+ * of system RAM, but _not_ assign specific pageframes.  Only
+ * arithmetic is performed so the hypercall is very fast and need
+ * not be preemptible, thus sidestepping time-of-check-time-of-use
+ * races for memory allocation.  Returns 0 if the hypervisor page
+ * allocator has atomically and successfully claimed the requested
+ * number of pages, else non-zero.
+ *
+ * Any domain may have only one active claim.  When sufficient memory
+ * has been allocated to resolve the claim, the claim silently expires.
+ * Claiming zero pages effectively resets any outstanding claim and
+ * is always successful.
+ *
+ * Note that a valid claim may be staked even after memory has been
+ * allocated for a domain.  In this case, the claim is not incremental,
+ * i.e. if the domain's tot_pages is 3, and a claim is staked for 10,
+ * only 7 additional pages are claimed.
+ */
+#define XENMEM_claim_pages                  24
+/*
+ * XENMEM_claim_pages flags:
+ *  free_only: claim is successful only if sufficient free pages
+ *    are available.  If not set and tmem is enabled, hypervisor
+ *    may also consider tmem "freeable" pages to satisfy the claim.
+ */
+#define _XENMEM_CLAIMF_free_only            0
+#define XENMEM_CLAIMF_free_only             (1U<<_XENMEM_CLAIMF_free_only)
+
 #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 64a0cc1..5c63581 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -48,6 +48,12 @@ void free_xenheap_pages(void *v, unsigned int order);
 #define alloc_xenheap_page() (alloc_xenheap_pages(0,0))
 #define free_xenheap_page(v) (free_xenheap_pages(v,0))
 
+/* Claim handling */
+unsigned long domain_increase_tot_pages(struct domain *d, unsigned long pages);
+unsigned long domain_decrease_tot_pages(struct domain *d, unsigned long pages);
+int domain_set_unclaimed_pages(
+    struct domain *d, unsigned long pages, unsigned long flags);
+
 /* Domain suballocator. These functions are *not* interrupt-safe.*/
 void init_domheap_pages(paddr_t ps, paddr_t pe);
 struct page_info *alloc_domheap_pages(
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 6c55039..480ef39 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -242,6 +242,7 @@ struct domain
     struct page_list_head page_list;  /* linked list */
     struct page_list_head xenpage_list; /* linked list (size xenheap_pages) */
     unsigned int     tot_pages;       /* number of pages currently possesed */
+    unsigned int     unclaimed_pages; /* pages claimed but not possessed    */
     unsigned int     max_pages;       /* maximum value for tot_pages        */
     atomic_t         shr_pages;       /* number of shared pages             */
     atomic_t         paged_pages;     /* number of paged-out pages          */

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-14 23:55 [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall Dan Magenheimer
@ 2012-11-15 10:25 ` Jan Beulich
  2012-11-15 18:00   ` Dan Magenheimer
  2012-11-15 12:25 ` Ian Campbell
  1 sibling, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2012-11-15 10:25 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir Fraser, TimDeegan, KonradWilk, Zhigang Wang, xen-devel

>>> On 15.11.12 at 00:55, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> @@ -685,6 +685,35 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>          break;
>      }
>  
> +    case XENMEM_claim_pages:
> +    {
> +        unsigned long request;
> +
> +        start_extent = cmd >> MEMOP_EXTENT_SHIFT;

What's this? You don't do continuations (you explicitly do that
change in order to be fast), and hence this ought to be ignored
just like e.g. XENMEM_remove_from_physmap ignores it.

> +
> +        if ( copy_from_guest(&reservation, arg, 1) )
> +            return start_extent;

-EFAULT

> +
> +        if ( !(guest_handle_is_null(reservation.extent_start)) )
> +            return start_extent;

-EINVAL

> +
> +        rc = rcu_lock_target_domain_by_id(reservation.domid, &d);
> +        if ( rc )
> +            return rc;
> +
> +        /*
> +         * extent_order may be non-zero, but is only a multiplier and
> +         * does not currently claim any order>0 slabs, though this is
> +         * a possible future feature
> +         */

If extent_order doesn't do what it promises to do, just don't
allow non-zero values.

> +        request = reservation.nr_extents << reservation.extent_order;
> +        rc = domain_set_unclaimed_pages(d, request, reservation.mem_flags);
> +
> +        rcu_unlock_domain(d);
> +
> +        break;
> +    }
> +
>...
> +unsigned long domain_decrease_tot_pages(struct domain *d, unsigned long pages)
> +{
> +    ASSERT(spin_is_locked(&d->page_alloc_lock));
> +    ASSERT(!spin_is_locked(&heap_lock));

There are numerous of these throughout the code, but while
asserting a certain lock to be held is valid, asserting that one
isn't being held (by _anyone_) clearly isn't. You're presumably
mixing this up with something like spin_is_owned()...

> +    spin_lock(&heap_lock);
> +    d->tot_pages -= pages;

Shouldn't you do the same lockless approach for the common case
here as you did on Keir's advice in domain_increase_tot_pages()?

> +    if ( d->unclaimed_pages )
> +    {
> +        d->unclaimed_pages += pages;
> +        total_unclaimed_pages += pages;
> +    }
> +    spin_unlock(&heap_lock);
> +    return d->tot_pages;
> +}
> +
> +int domain_set_unclaimed_pages(struct domain *d, unsigned long pages,
> +                                unsigned long flags)
> +{
> +    int ret = -ENOMEM;
> +    unsigned long claim = 0, avail_pages = 0;

Pointless initializers.

> +
> +    /*
> +     * take the domain's page_alloc_lock, else all increases/decreases
> +     * must always take the global heap_lock rather than only in the much
> +     * rarer case that d->unclaimed_pages is non-zero
> +     */
> +    ASSERT(!spin_is_locked(&d->page_alloc_lock));
> +    spin_lock(&d->page_alloc_lock);
> +    ASSERT(!spin_is_locked(&heap_lock));
> +    spin_lock(&heap_lock);
> +
> +    /* pages==0 means "unset" the claim (and flags is ignored) */
> +    if (pages == 0)
> +    {
> +        total_unclaimed_pages -= d->unclaimed_pages;
> +        d->unclaimed_pages = 0;
> +        ret = 0;
> +        goto out;
> +    }
> +
> +    /* only one active claim per domain please */
> +    if ( d->unclaimed_pages)
> +        goto out;

-ENOMEM for this case seems wrong.

> +
> +    /* how much memory is available? */
> +    avail_pages = total_avail_pages;
> +    if ( !(flags & XENMEM_CLAIMF_free_only) )
> +        avail_pages += tmem_freeable_pages();
> +    avail_pages -= total_unclaimed_pages;
> +
> +    /*
> +     * note, if domain has already allocated memory before making a claim 
> +     * then the claim must take tot_pages into account
> +     */
> +    claim = pages - d->tot_pages;
> +    if ( claim > avail_pages )
> +        goto out;

What if pages < d->tot_pages?

> +
> +    /* yay, claim fits in available memory, stake the claim, success! */
> +    d->unclaimed_pages = claim;
> +    total_unclaimed_pages += d->unclaimed_pages;
> +    ret = 0;
> +
> +out:
> +    spin_unlock(&heap_lock);
> +    spin_unlock(&d->page_alloc_lock);
> +    return ret;
> +}
>  
>  static unsigned long init_node_heap(int node, unsigned long mfn,
>                                      unsigned long nr, bool_t *use_tail)
> @@ -443,6 +537,15 @@ static struct page_info *alloc_heap_pages(
>      spin_lock(&heap_lock);
>  
>      /*
> +     * Claimed memory is considered unavailable unless the request
> +     * is made by a domain with sufficient unclaimed pages.
> +     */
> +    if ( (total_unclaimed_pages + request >
> +           total_avail_pages + tmem_freeable_pages()) &&
> +          (d == NULL || d->unclaimed_pages < request) )
> +        goto not_found;

The treatment of d being NULL certainly needs further thought:
Is it really better to fulfill the claim and fail some (perhaps
important) _xmalloc()?

Also, I'm missing a mechanism by which the tools could find out
how much unclaimed memory is available, in order to determine
(if in use) how much memory needs to be ballooned out of Dom0.

Similarly, but perhaps of lower priority, there is no integration
with the low-mem handling.

Finally, there still are a number of formatting issues.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-14 23:55 [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall Dan Magenheimer
  2012-11-15 10:25 ` Jan Beulich
@ 2012-11-15 12:25 ` Ian Campbell
  2012-11-15 12:39   ` Jan Beulich
  2012-11-15 19:15   ` Dan Magenheimer
  1 sibling, 2 replies; 26+ messages in thread
From: Ian Campbell @ 2012-11-15 12:25 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org), Dave McCracken, Konrad Wilk, Tim (Xen.org),
	xen-devel, Jan Beulich, Zhigang Wang

On Wed, 2012-11-14 at 23:55 +0000, Dan Magenheimer wrote:
> 
> Note 1: Tim: I'm thinking this may resolve your concern that
> the claim mechanism must be more complicated to handle
> restricted memory allocations and order>0 allocations.
> The proposed claim mechanism only guarantees a quantity of
> order==0 pages; if restricted allocations are required, these
> are done first by the toolstack, and followed by the claim.
> Order>0 allocations still work if memory is not fragmented,
> but the claim mechanism doesn't guarantee anything but
> a quantity of order==0 pages.

How does this interact with the feature which lets you create PV guests
using only superpages? I believe is something Oracle added and still
maintains (Dave added to the CC).

Also doesn't this fail to make any sort of guarantee if you are building
a 32 bit PV guest, since they require memory under a certain host
address limit (160GB IIRC)?

Basically neither of those cases benefit from this hypercall at all? I
don't know what the usecase for the superpage PV guests is (but I
suppose it is important to Oracle, at least). The 32 bit PV guest use
case is still a pretty significant one which I think ought to be
handled, otherwise any system built on top of this functionality will
only work reliably in a subset of cases.

Ian.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-15 12:25 ` Ian Campbell
@ 2012-11-15 12:39   ` Jan Beulich
  2012-11-15 19:19     ` Dan Magenheimer
  2012-11-15 19:15   ` Dan Magenheimer
  1 sibling, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2012-11-15 12:39 UTC (permalink / raw)
  To: Ian Campbell, Dan Magenheimer
  Cc: xen-devel, Keir (Xen.org), KonradWilk, Tim(Xen.org),
	Dave McCracken, Zhigang Wang

>>> On 15.11.12 at 13:25, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> Also doesn't this fail to make any sort of guarantee if you are building
> a 32 bit PV guest, since they require memory under a certain host
> address limit (160GB IIRC)?

This case is unreliable already, and has always been (I think we
have a tools side hack in some of our trees in an attempt to deal
with that), when ballooning is used to get at the memory, or
when trying to start a 32-bit guest after having run 64-bit ones
exhausting most of memory, and having terminated an early
created one (as allocation is top down, ones created close to
exhaustion, i.e. later, would eat up that "special" memory at
lower addresses).

So this new functionality "only" makes a bad situation worse
(which isn't meant to say I wouldn't prefer to see it get fixed).

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-15 10:25 ` Jan Beulich
@ 2012-11-15 18:00   ` Dan Magenheimer
  2012-11-16 10:36     ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Magenheimer @ 2012-11-15 18:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Keir Fraser, TimDeegan, Konrad Wilk, Zhigang Wang, xen-devel

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, November 15, 2012 3:25 AM
> To: Dan Magenheimer
> Cc: xen-devel@lists.xen.org; KonradWilk; Zhigang Wang; Keir Fraser; TimDeegan
> Subject: Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall

Hi Jan --

Thanks for the quick review!

Mostly fixed for v3 but a couple things:
 
> > +     * Claimed memory is considered unavailable unless the request
> > +     * is made by a domain with sufficient unclaimed pages.
> > +     */
> > +    if ( (total_unclaimed_pages + request >
> > +           total_avail_pages + tmem_freeable_pages()) &&
> > +          (d == NULL || d->unclaimed_pages < request) )
> > +        goto not_found;
> 
> The treatment of d being NULL certainly needs further thought:
> Is it really better to fulfill the claim and fail some (perhaps
> important) _xmalloc()?

Ideally, allocation in the presence of existing claims should
behave as if the claiming domains had actually already allocated
the unclaimed-amount-of-memory.  So I'd argue that enforcing
the claim should be sacrosanct here.
 
> Also, I'm missing a mechanism by which the tools could find out
> how much unclaimed memory is available, in order to determine
> (if in use) how much memory needs to be ballooned out of Dom0.

OK.  I'm not certain if this will be useful on a per-domain
basis as well but, for completeness, I will also add
unclaimed_pages into xc_dominfo etc (which causes a bump
in XEN_DOMCTL_INTERFACE_VERSION).

> Similarly, but perhaps of lower priority, there is no integration
> with the low-mem handling.

I'd also consider this lower priority as Olaf and Andre
have argued that the claim mechanism is not needed for
sharing/paging so the two mechanisms may not
be used together, at least for the foreseeable future.
So I plan to skip this, unless you change your mind and
consider it a showstopper for acceptance.

> Finally, there still are a number of formatting issues.

Hmmm... I found one I think.  Is there an equivalent to
checkpatch for hypervisor code?  If you see any formatting
issues in v3, please call them out explicitly as I am
sincerely trying to avoid them.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-15 12:25 ` Ian Campbell
  2012-11-15 12:39   ` Jan Beulich
@ 2012-11-15 19:15   ` Dan Magenheimer
  2012-11-16 10:39     ` Jan Beulich
  2012-11-16 10:48     ` Ian Campbell
  1 sibling, 2 replies; 26+ messages in thread
From: Dan Magenheimer @ 2012-11-15 19:15 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir (Xen.org), Dave Mccracken, Konrad Wilk, Tim (Xen.org),
	xen-devel, Jan Beulich, Zhigang Wang

> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
> 
> On Wed, 2012-11-14 at 23:55 +0000, Dan Magenheimer wrote:
> >
> > Note 1: Tim: I'm thinking this may resolve your concern that
> > the claim mechanism must be more complicated to handle
> > restricted memory allocations and order>0 allocations.
> > The proposed claim mechanism only guarantees a quantity of
> > order==0 pages; if restricted allocations are required, these
> > are done first by the toolstack, and followed by the claim.
> > Order>0 allocations still work if memory is not fragmented,
> > but the claim mechanism doesn't guarantee anything but
> > a quantity of order==0 pages.
> 
> How does this interact with the feature which lets you create PV guests
> using only superpages? I believe is something Oracle added and still
> maintains (Dave added to the CC).
> 
> Also doesn't this fail to make any sort of guarantee if you are building
> a 32 bit PV guest, since they require memory under a certain host
> address limit (160GB IIRC)?
> 
> Basically neither of those cases benefit from this hypercall at all? I
> don't know what the usecase for the superpage PV guests is (but I
> suppose it is important to Oracle, at least). The 32 bit PV guest use
> case is still a pretty significant one which I think ought to be
> handled, otherwise any system built on top of this functionality will
> only work reliably in a subset of cases.

The claim mechanism will not benefit PV superpages.  IIUC, the
design of the PV superpages will cause a domain launch to fail
if it requests 10000 superpages but Xen can only successfully
allocate 9999.  That's already very fragile.  Since the only
way currently to find out if there are 10000 superpages available
is to start allocating them, claim can't really help.

For 32 bit PV guests, note that a claim does NOT have to be
staked prior to any allocation.  So if a toolstack needs
to enforce that some portion of allocated memory is under
a host address limit, it can (attempt to) allocate those pages,
then stake a claim for the rest.  Or just not use the claim
mechanism at all.

Dan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-15 12:39   ` Jan Beulich
@ 2012-11-15 19:19     ` Dan Magenheimer
  2012-11-16 10:41       ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Magenheimer @ 2012-11-15 19:19 UTC (permalink / raw)
  To: Jan Beulich, Ian Campbell
  Cc: xen-devel, Keir (Xen.org), Konrad Wilk, Tim(Xen.org),
	Dave Mccracken, Zhigang Wang

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, November 15, 2012 5:39 AM
> To: Ian Campbell; Dan Magenheimer
> Cc: xen-devel@lists.xen.org; Dave McCracken; KonradWilk; Zhigang Wang; Keir (Xen.org); Tim(Xen.org)
> Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
> 
> >>> On 15.11.12 at 13:25, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > Also doesn't this fail to make any sort of guarantee if you are building
> > a 32 bit PV guest, since they require memory under a certain host
> > address limit (160GB IIRC)?
> 
> This case is unreliable already, and has always been (I think we
> have a tools side hack in some of our trees in an attempt to deal
> with that), when ballooning is used to get at the memory, or
> when trying to start a 32-bit guest after having run 64-bit ones
> exhausting most of memory, and having terminated an early
> created one (as allocation is top down, ones created close to
> exhaustion, i.e. later, would eat up that "special" memory at
> lower addresses).
> 
> So this new functionality "only" makes a bad situation worse
> (which isn't meant to say I wouldn't prefer to see it get fixed).

Hmmm... I guess I don't see how claim makes the situation worse.
Well maybe a few microseconds worse.

Old model:
(1) Allocate a huge number of pages

New model:
(1) Claim a huge number of pages.  If successful...
(2) Allocate that huge number of pages

In either case, the failure conditions are the same
except that the claim mechanism checks one of the
failure conditions sooner.

Or am I misunderstanding?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-15 18:00   ` Dan Magenheimer
@ 2012-11-16 10:36     ` Jan Beulich
  2012-11-19 20:04       ` Dan Magenheimer
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2012-11-16 10:36 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir Fraser, TimDeegan, Konrad Wilk, ZhigangWang, xen-devel

>>> On 15.11.12 at 19:00, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Similarly, but perhaps of lower priority, there is no integration
>> with the low-mem handling.
> 
> I'd also consider this lower priority as Olaf and Andre
> have argued that the claim mechanism is not needed for
> sharing/paging so the two mechanisms may not
> be used together, at least for the foreseeable future.
> So I plan to skip this, unless you change your mind and
> consider it a showstopper for acceptance.

Skipping for the initial implementation is likely fine, but that
shouldn't mean deferring the integration indefinitely. Also,
I see no close connection between the low-mem feature
and sharing/paging (apart from Andres working on both).

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-15 19:15   ` Dan Magenheimer
@ 2012-11-16 10:39     ` Jan Beulich
  2012-11-16 10:48     ` Ian Campbell
  1 sibling, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2012-11-16 10:39 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: xen-devel, Keir (Xen.org), Ian Campbell, KonradWilk, Tim(Xen.org),
	Dave Mccracken, Zhigang Wang

>>> On 15.11.12 at 20:15, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> For 32 bit PV guests, note that a claim does NOT have to be
> staked prior to any allocation.  So if a toolstack needs
> to enforce that some portion of allocated memory is under
> a host address limit, it can (attempt to) allocate those pages,
> then stake a claim for the rest.  Or just not use the claim
> mechanism at all.

I don't get what you're trying to tell us here.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-15 19:19     ` Dan Magenheimer
@ 2012-11-16 10:41       ` Jan Beulich
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Beulich @ 2012-11-16 10:41 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: xen-devel, Keir (Xen.org),
	Ian Campbell, Konrad Wilk, Tim(Xen.org),
	Dave Mccracken, Zhigang Wang

>>> On 15.11.12 at 20:19, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>  -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, November 15, 2012 5:39 AM
>> To: Ian Campbell; Dan Magenheimer
>> Cc: xen-devel@lists.xen.org; Dave McCracken; KonradWilk; Zhigang Wang; Keir 
> (Xen.org); Tim(Xen.org)
>> Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of 
> existing) hypercall
>> 
>> >>> On 15.11.12 at 13:25, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>> > Also doesn't this fail to make any sort of guarantee if you are building
>> > a 32 bit PV guest, since they require memory under a certain host
>> > address limit (160GB IIRC)?
>> 
>> This case is unreliable already, and has always been (I think we
>> have a tools side hack in some of our trees in an attempt to deal
>> with that), when ballooning is used to get at the memory, or
>> when trying to start a 32-bit guest after having run 64-bit ones
>> exhausting most of memory, and having terminated an early
>> created one (as allocation is top down, ones created close to
>> exhaustion, i.e. later, would eat up that "special" memory at
>> lower addresses).
>> 
>> So this new functionality "only" makes a bad situation worse
>> (which isn't meant to say I wouldn't prefer to see it get fixed).
> 
> Hmmm... I guess I don't see how claim makes the situation worse.
> Well maybe a few microseconds worse.
> 
> Old model:
> (1) Allocate a huge number of pages
> 
> New model:
> (1) Claim a huge number of pages.  If successful...
> (2) Allocate that huge number of pages
> 
> In either case, the failure conditions are the same
> except that the claim mechanism checks one of the
> failure conditions sooner.
> 
> Or am I misunderstanding?

I think you are: Your new code adds one more thing that isn't
properly integrated with the creation of such guests. I.e. it's
not the user experience getting worse, but the state of the
code.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-15 19:15   ` Dan Magenheimer
  2012-11-16 10:39     ` Jan Beulich
@ 2012-11-16 10:48     ` Ian Campbell
  2012-11-19 20:53       ` Dan Magenheimer
  1 sibling, 1 reply; 26+ messages in thread
From: Ian Campbell @ 2012-11-16 10:48 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org), Dave Mccracken, Konrad Wilk, Tim (Xen.org),
	xen-devel, Jan Beulich, Zhigang Wang

On Thu, 2012-11-15 at 19:15 +0000, Dan Magenheimer wrote:
> > From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> > Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
> > 
> > On Wed, 2012-11-14 at 23:55 +0000, Dan Magenheimer wrote:
> > >
> > > Note 1: Tim: I'm thinking this may resolve your concern that
> > > the claim mechanism must be more complicated to handle
> > > restricted memory allocations and order>0 allocations.
> > > The proposed claim mechanism only guarantees a quantity of
> > > order==0 pages; if restricted allocations are required, these
> > > are done first by the toolstack, and followed by the claim.
> > > Order>0 allocations still work if memory is not fragmented,
> > > but the claim mechanism doesn't guarantee anything but
> > > a quantity of order==0 pages.
> > 
> > How does this interact with the feature which lets you create PV guests
> > using only superpages? I believe is something Oracle added and still
> > maintains (Dave added to the CC).
> > 
> > Also doesn't this fail to make any sort of guarantee if you are building
> > a 32 bit PV guest, since they require memory under a certain host
> > address limit (160GB IIRC)?
> > 
> > Basically neither of those cases benefit from this hypercall at all? I
> > don't know what the usecase for the superpage PV guests is (but I
> > suppose it is important to Oracle, at least). The 32 bit PV guest use
> > case is still a pretty significant one which I think ought to be
> > handled, otherwise any system built on top of this functionality will
> > only work reliably in a subset of cases.
> 
> The claim mechanism will not benefit PV superpages.  IIUC, the
> design of the PV superpages will cause a domain launch to fail
> if it requests 10000 superpages but Xen can only successfully
> allocate 9999.  That's already very fragile.  Since the only
> way currently to find out if there are 10000 superpages available
> is to start allocating them, claim can't really help.

Well, you could always account the number of free superpages in the
system, which would allow you to cover this case too.

> For 32 bit PV guests, note that a claim does NOT have to be
> staked prior to any allocation.  So if a toolstack needs
> to enforce that some portion of allocated memory is under
> a host address limit, it can (attempt to) allocate those pages,
> then stake a claim for the rest.

For 32 bit PV guests this is *all* of the pages needed to build any 32
bit PV guest.

> Or just not use the claim mechanism at all.

So your use case has no requirement to be able to start 32 bit domains?
Or whatever requirement you have the leads to the claim mechanism
somehow doesn't apply to those sorts of guests? If not then why not?

As it stands it seems that any toolstack which wants to use claim would
still have to cope with the fact that a potentially significant
proportion of guests may still fail to build even after a claim has been
successfully staked.

Even if these shortcomings are acceptable in your specific scenario I
don't see why we should be satisfied with a solution which is not more
generally applicable.

Ian.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-16 10:36     ` Jan Beulich
@ 2012-11-19 20:04       ` Dan Magenheimer
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Magenheimer @ 2012-11-19 20:04 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Keir Fraser, TimDeegan, Konrad Wilk, Zhigang Wang, xen-devel

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, November 16, 2012 3:36 AM
> To: Dan Magenheimer
> Cc: xen-devel@lists.xen.org; Konrad Wilk; ZhigangWang; Keir Fraser; TimDeegan
> Subject: RE: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
> 
> >>> On 15.11.12 at 19:00, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Similarly, but perhaps of lower priority, there is no integration
> >> with the low-mem handling.
> >
> > I'd also consider this lower priority as Olaf and Andre
> > have argued that the claim mechanism is not needed for
> > sharing/paging so the two mechanisms may not
> > be used together, at least for the foreseeable future.
> > So I plan to skip this, unless you change your mind and
> > consider it a showstopper for acceptance.
> 
> Skipping for the initial implementation is likely fine, but that
> shouldn't mean deferring the integration indefinitely. Also,
> I see no close connection between the low-mem feature
> and sharing/paging (apart from Andres working on both).

Fair enough.

After reviewing the thread where low_mem was submitted, I have to admit
that I am a bit baffled as to when the low_mem handling would ever be
necessary.   I suspect it is because the author and I are approaching
memory management from a completely different paradigm (per discussion
in an earlier thread where "claim" was first proposed), so that
is probably better left for the deferred discussion of the
integration.

So since you (Jan) do not consider this (lack of integration with
low_mem) a showstopper for claim, I will set myself a reminder
to initiate a new thread about this later.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-16 10:48     ` Ian Campbell
@ 2012-11-19 20:53       ` Dan Magenheimer
  2012-11-20  8:16         ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Magenheimer @ 2012-11-19 20:53 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir (Xen.org), Dave Mccracken, Konrad Wilk, Tim (Xen.org),
	xen-devel, Jan Beulich, Zhigang Wang

> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall

Hi Ian --

> On Thu, 2012-11-15 at 19:15 +0000, Dan Magenheimer wrote:
> > > From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> > > Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
> > >
> > > On Wed, 2012-11-14 at 23:55 +0000, Dan Magenheimer wrote:
> > >
> > > How does this interact with the feature which lets you create PV guests
> > > using only superpages? I believe is something Oracle added and still
> > > maintains (Dave added to the CC).
> > >
> > > Also doesn't this fail to make any sort of guarantee if you are building
> > > a 32 bit PV guest, since they require memory under a certain host
> > > address limit (160GB IIRC)?
> > >
> > > Basically neither of those cases benefit from this hypercall at all? I
> > > don't know what the usecase for the superpage PV guests is (but I
> > > suppose it is important to Oracle, at least). The 32 bit PV guest use
> > > case is still a pretty significant one which I think ought to be
> > > handled, otherwise any system built on top of this functionality will
> > > only work reliably in a subset of cases.
> >
> > The claim mechanism will not benefit PV superpages.  IIUC, the
> > design of the PV superpages will cause a domain launch to fail
> > if it requests 10000 superpages but Xen can only successfully
> > allocate 9999.  That's already very fragile.  Since the only
> > way currently to find out if there are 10000 superpages available
> > is to start allocating them, claim can't really help.
> 
> Well, you could always account the number of free superpages in the
> system, which would allow you to cover this case too.

Because of the nature of the buddy allocator (i.e. 4MB chunks are
kept separately from 2MB chunks even though a 4MB page contains
two 2MB pages), I don't think this is trivial.  Also, once
memory is fragmented, creation of a PV-superpages domain will fail
(relatively quickly) whether claiming first or not.  The situation
could be improved by adding defrag code to Xen (or "compaction"
as Linux calls it), which I'd be interested in pursuing, but that
is also non-trivial.  Last, once PVH is an available option,
community (and Oracle's) interest in PV+superpages may fade away
quickly.

So I agree that this is worth adding superpages to a to-do list
for future claim work, but I see it as a "would be nice to add later"
feature, not a showstopper.

> > For 32 bit PV guests, note that a claim does NOT have to be
> > staked prior to any allocation.  So if a toolstack needs
> > to enforce that some portion of allocated memory is under
> > a host address limit, it can (attempt to) allocate those pages,
> > then stake a claim for the rest.
> 
> For 32 bit PV guests this is *all* of the pages needed to build any 32
> bit PV guest.

I am ignorant of the details here, but this is _all_ of the
pages only on machines with >160GiB of physical memory, correct?
(If incorrect, just ignore the 160GB parts below :-)

> > Or just not use the claim mechanism at all.
> 
> So your use case has no requirement to be able to start 32 bit domains?
> Or whatever requirement you have the leads to the claim mechanism
> somehow doesn't apply to those sorts of guests? If not then why not?

Claim doesn't fail to start 32-bit PV domains, it just doesn't
help them avoid "failing slowly" [on machines with >160GiB of RAM].

32-bit PV domains are highly likely to be legacy domains with
much smaller RAM requirements [always <= 64GiB?] so "failing slowly"
is much less of a concern.

32-bit PV domains are on their way to obsolescence so I am (and
I believe Oracle would be) quite happy documenting that launching
them might result in conditions not seen when creating other domain
types... especially when other failure conditions already exist for
32-bit PV domains.

> As it stands it seems that any toolstack which wants to use claim would
> still have to cope with the fact that a potentially significant
> proportion of guests may still fail to build even after a claim has been
> successfully staked.

The problem being addressed here is "slow failure" when creating
a domain and this can currently occur with 100% of domain creations.
The proposed claim hypercall solves this problem for:

- All HVM domains and all future PVH domains
- All 64-bit PV domains [when system RAM < 5TiB]
- [All 32-bit PV domains when system RAM < 160GiB]

and, as you've observed, currently doesn't solve the problem for

- All 32-bit PV domains [except when system RAM > 160 GiB]
- PV domains with superpages=1 that "almost succeed" but fail

I see that as a pretty good start and, with the flags argument,
the proposal even has room for future extensions should they
be needed.

> Even if these shortcomings are acceptable in your specific scenario I
> don't see why we should be satisfied with a solution which is not more
> generally applicable.

I would welcome a solution which is more generally applicable
if you've got one.  Otherwise, I am very satisfied with this one.
If you'd like confirmation from Oracle management that they
are satisfied as well, I can encourage them to reply.

Dan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-19 20:53       ` Dan Magenheimer
@ 2012-11-20  8:16         ` Jan Beulich
  2012-11-20 10:16           ` Ian Campbell
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2012-11-20  8:16 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: xen-devel, Keir (Xen.org), Ian Campbell, KonradWilk, Tim(Xen.org),
	Dave Mccracken, Zhigang Wang

>>> On 19.11.12 at 21:53, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>  From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
>> Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of 
> existing) hypercall
> 
> Hi Ian --
> 
>> On Thu, 2012-11-15 at 19:15 +0000, Dan Magenheimer wrote:
>> > > From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
>> > > Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of 
> existing) hypercall
>> > >
>> > > On Wed, 2012-11-14 at 23:55 +0000, Dan Magenheimer wrote:
>> > >
>> > > How does this interact with the feature which lets you create PV guests
>> > > using only superpages? I believe is something Oracle added and still
>> > > maintains (Dave added to the CC).
>> > >
>> > > Also doesn't this fail to make any sort of guarantee if you are building
>> > > a 32 bit PV guest, since they require memory under a certain host
>> > > address limit (160GB IIRC)?
>> > >
>> > > Basically neither of those cases benefit from this hypercall at all? I
>> > > don't know what the usecase for the superpage PV guests is (but I
>> > > suppose it is important to Oracle, at least). The 32 bit PV guest use
>> > > case is still a pretty significant one which I think ought to be
>> > > handled, otherwise any system built on top of this functionality will
>> > > only work reliably in a subset of cases.
>> >
>> > The claim mechanism will not benefit PV superpages.  IIUC, the
>> > design of the PV superpages will cause a domain launch to fail
>> > if it requests 10000 superpages but Xen can only successfully
>> > allocate 9999.  That's already very fragile.  Since the only
>> > way currently to find out if there are 10000 superpages available
>> > is to start allocating them, claim can't really help.
>> 
>> Well, you could always account the number of free superpages in the
>> system, which would allow you to cover this case too.
> 
> Because of the nature of the buddy allocator (i.e. 4MB chunks are
> kept separately from 2MB chunks even though a 4MB page contains
> two 2MB pages), I don't think this is trivial.

This ought to be as simple as adding respective accounting
when
- splitting a chunk in alloc_heap_pages(), crossing the super page
  size boundary, and
- merging chunks in free_heap_pages(), crossing the super page
  size boundary.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-20  8:16         ` Jan Beulich
@ 2012-11-20 10:16           ` Ian Campbell
  2012-11-20 16:33             ` Dan Magenheimer
  0 siblings, 1 reply; 26+ messages in thread
From: Ian Campbell @ 2012-11-20 10:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Dan Magenheimer, xen-devel, Keir (Xen.org),
	KonradWilk, Tim (Xen.org),
	Dave Mccracken, Zhigang Wang

On Tue, 2012-11-20 at 08:16 +0000, Jan Beulich wrote:
> >>> On 19.11.12 at 21:53, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> >>  From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> >> Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of 
> > existing) hypercall
> > 
> > Hi Ian --
> > 
> >> On Thu, 2012-11-15 at 19:15 +0000, Dan Magenheimer wrote:
> >> > > From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> >> > > Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of 
> > existing) hypercall
> >> > >
> >> > > On Wed, 2012-11-14 at 23:55 +0000, Dan Magenheimer wrote:
> >> > >
> >> > > How does this interact with the feature which lets you create PV guests
> >> > > using only superpages? I believe is something Oracle added and still
> >> > > maintains (Dave added to the CC).
> >> > >
> >> > > Also doesn't this fail to make any sort of guarantee if you are building
> >> > > a 32 bit PV guest, since they require memory under a certain host
> >> > > address limit (160GB IIRC)?
> >> > >
> >> > > Basically neither of those cases benefit from this hypercall at all? I
> >> > > don't know what the usecase for the superpage PV guests is (but I
> >> > > suppose it is important to Oracle, at least). The 32 bit PV guest use
> >> > > case is still a pretty significant one which I think ought to be
> >> > > handled, otherwise any system built on top of this functionality will
> >> > > only work reliably in a subset of cases.
> >> >
> >> > The claim mechanism will not benefit PV superpages.  IIUC, the
> >> > design of the PV superpages will cause a domain launch to fail
> >> > if it requests 10000 superpages but Xen can only successfully
> >> > allocate 9999.  That's already very fragile.  Since the only
> >> > way currently to find out if there are 10000 superpages available
> >> > is to start allocating them, claim can't really help.
> >> 
> >> Well, you could always account the number of free superpages in the
> >> system, which would allow you to cover this case too.
> > 
> > Because of the nature of the buddy allocator (i.e. 4MB chunks are
> > kept separately from 2MB chunks even though a 4MB page contains
> > two 2MB pages), I don't think this is trivial.
> 
> This ought to be as simple as adding respective accounting
> when
> - splitting a chunk in alloc_heap_pages(), crossing the super page
>   size boundary, and
> - merging chunks in free_heap_pages(), crossing the super page
>   size boundary.

That was my first thought too.

Ian.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-20 10:16           ` Ian Campbell
@ 2012-11-20 16:33             ` Dan Magenheimer
  2012-11-20 16:47               ` Tim Deegan
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Magenheimer @ 2012-11-20 16:33 UTC (permalink / raw)
  To: Ian Campbell, Jan Beulich
  Cc: xen-devel, Keir (Xen.org), Konrad Wilk, Tim (Xen.org),
	Dave Mccracken, Zhigang Wang

> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
> 
> On Tue, 2012-11-20 at 08:16 +0000, Jan Beulich wrote:
> > >>> On 19.11.12 at 21:53, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> > >>  From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> > >> Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of
> > > existing) hypercall
> > >
> > > Hi Ian --
> > >
> > >> On Thu, 2012-11-15 at 19:15 +0000, Dan Magenheimer wrote:
> > >> > > From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> > >> > > Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of
> > > existing) hypercall
> > >> > >
> > >> > > On Wed, 2012-11-14 at 23:55 +0000, Dan Magenheimer wrote:
> > >> > >
> > >> > > How does this interact with the feature which lets you create PV guests
> > >> > > using only superpages? I believe is something Oracle added and still
> > >> > > maintains (Dave added to the CC).
> > >> >
> > >> > The claim mechanism will not benefit PV superpages.  IIUC, the
> > >> > design of the PV superpages will cause a domain launch to fail
> > >> > if it requests 10000 superpages but Xen can only successfully
> > >> > allocate 9999.  That's already very fragile.  Since the only
> > >> > way currently to find out if there are 10000 superpages available
> > >> > is to start allocating them, claim can't really help.
> > >>
> > >> Well, you could always account the number of free superpages in the
> > >> system, which would allow you to cover this case too.
> > >
> > > Because of the nature of the buddy allocator (i.e. 4MB chunks are
> > > kept separately from 2MB chunks even though a 4MB page contains
> > > two 2MB pages), I don't think this is trivial.
> >
> > This ought to be as simple as adding respective accounting
> > when
> > - splitting a chunk in alloc_heap_pages(), crossing the super page
> >   size boundary, and
> > - merging chunks in free_heap_pages(), crossing the super page
> >   size boundary.
> 
> That was my first thought too.

Hi Ian and Jan --

I agree it's possible, just saying it's not trivial... one has to
account not only for superpages system-wide (which isn't currently
done) but a mix of unclaimed superpages and unclaimed order==0 pages
per-domain.  Especially since that would improve launch of only a small
and shrinking class of domains (PV && superpages=1 && mem="huge"),
can we please consider it a possible future enhancement, not a showstopper?
The proposed claim hypercall doesn't _break_ PV-superpage domains,
it just doesn't pre-fail domain launch when there is sufficient physical
RAM but insufficient quantity of superpages.  And the toolstack doesn't
need to special-case... always doing a claim first is worthwhile because
it still pre-fails the case where there is insufficient physical RAM.

(I'd ask Konrad or Kurt to chime in here, but I know both are
away from the office due to the holidays... and I should be too.)

Thanks,
Dan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-20 16:33             ` Dan Magenheimer
@ 2012-11-20 16:47               ` Tim Deegan
  2012-11-20 17:52                 ` Dan Magenheimer
  0 siblings, 1 reply; 26+ messages in thread
From: Tim Deegan @ 2012-11-20 16:47 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Zhigang Wang, Keir (Xen.org),
	Ian Campbell, Konrad Wilk, Dave Mccracken, Jan Beulich,
	xen-devel

At 08:33 -0800 on 20 Nov (1353400380), Dan Magenheimer wrote:
> I agree it's possible, just saying it's not trivial... one has to
> account not only for superpages system-wide (which isn't currently
> done) but a mix of unclaimed superpages and unclaimed order==0 pages
> per-domain.  Especially since that would improve launch of only a small
> and shrinking class of domains (PV && superpages=1 && mem="huge"),
> can we please consider it a possible future enhancement, not a showstopper?

Please, no.  Either you need this benighted hypercall, or you don't.  
If you really need it, do it properly.

Tim.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-20 16:47               ` Tim Deegan
@ 2012-11-20 17:52                 ` Dan Magenheimer
  2012-11-21  8:31                   ` Jan Beulich
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Magenheimer @ 2012-11-20 17:52 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Zhigang Wang, Keir (Xen.org),
	Ian Campbell, Konrad Wilk, Dave Mccracken, Jan Beulich,
	xen-devel

> From: Tim Deegan [mailto:tim@xen.org]
> Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
> 
> At 08:33 -0800 on 20 Nov (1353400380), Dan Magenheimer wrote:
> > I agree it's possible, just saying it's not trivial... one has to
> > account not only for superpages system-wide (which isn't currently
> > done) but a mix of unclaimed superpages and unclaimed order==0 pages
> > per-domain.  Especially since that would improve launch of only a small
> > and shrinking class of domains (PV && superpages=1 && mem="huge"),
> > can we please consider it a possible future enhancement, not a showstopper?
> 
> Please, no.  Either you need this benighted hypercall, or you don't.
> If you really need it, do it properly.

Hi Tim --

I must respectfully disagree.

For years, Xen has been accepting features that work on a 64-bit
hypervisor but not on a 32-bit hypervisor.  And new features such
as memory-sharing/paging _could_ be designed to help PV domains and
have completely ignored PV but have still been accepted.  There is
clearly precedent for new features that don't enhance every
possible case.

The claim feature dramatically decreases a real customer problem in
the vast majority of our customer environments with no loss of
functionality in the small remaining percentage.  This real problem
is getting continually worse as system physical RAM and domain memory
requirements increase.  So, yes, _we_ do need it.

Considering maintenance, adding 100% claim support for remaining
near-obsolescent domains will (IMHO) unnecessarily clutter both
the hypervisor and the toolstack, with very little net value
to real customers.

So, in my opinion, it _is_ done properly already, just not
obsessively.

Dan

P.S. Did you mean "beknighted"?  Or are you referring to your
omniscient toolstack as enlightened and Oracle's toolstack as
ignorant?  As a fan of etymology and wordplay, I'm curious...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-20 17:52                 ` Dan Magenheimer
@ 2012-11-21  8:31                   ` Jan Beulich
  2012-11-21 15:59                     ` Dan Magenheimer
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2012-11-21  8:31 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: xen-devel, Keir (Xen.org),
	Ian Campbell, Konrad Wilk, Tim Deegan, Dave Mccracken,
	ZhigangWang

>>> On 20.11.12 at 18:52, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>  From: Tim Deegan [mailto:tim@xen.org]
>> Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of 
> existing) hypercall
>> 
>> At 08:33 -0800 on 20 Nov (1353400380), Dan Magenheimer wrote:
>> > I agree it's possible, just saying it's not trivial... one has to
>> > account not only for superpages system-wide (which isn't currently
>> > done) but a mix of unclaimed superpages and unclaimed order==0 pages
>> > per-domain.  Especially since that would improve launch of only a small
>> > and shrinking class of domains (PV && superpages=1 && mem="huge"),
>> > can we please consider it a possible future enhancement, not a showstopper?
>> 
>> Please, no.  Either you need this benighted hypercall, or you don't.
>> If you really need it, do it properly.
> 
> Hi Tim --
> 
> I must respectfully disagree.
> 
> For years, Xen has been accepting features that work on a 64-bit
> hypervisor but not on a 32-bit hypervisor.  And new features such
> as memory-sharing/paging _could_ be designed to help PV domains and
> have completely ignored PV but have still been accepted.  There is
> clearly precedent for new features that don't enhance every
> possible case.
> 
> The claim feature dramatically decreases a real customer problem in
> the vast majority of our customer environments with no loss of
> functionality in the small remaining percentage.  This real problem
> is getting continually worse as system physical RAM and domain memory
> requirements increase.  So, yes, _we_ do need it.

A meaningful difference is that those other features have tools
side users, while you add (from the perspective of the Xen tree)
dead code. That is, it wouldn't have any chance of getting checked
for correctness when committed (other than for not breaking
existing code), and it will bit rot pretty quickly. Or did you mean
to supply tools side integration before expecting this to be
considered for applying?

In any case, while the hypervisor side changes look acceptable,
I'm afraid that without (mostly) convincing (almost) all of the
maintainers, there's no perspective of this getting committed.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-21  8:31                   ` Jan Beulich
@ 2012-11-21 15:59                     ` Dan Magenheimer
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Magenheimer @ 2012-11-21 15:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Keir (Xen.org),
	Ian Campbell, Konrad Wilk, Tim Deegan, Dave Mccracken,
	Zhigang Wang

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, November 21, 2012 1:32 AM
> To: Dan Magenheimer
> Cc: Ian Campbell; xen-devel@lists.xen.org; Dave Mccracken; Konrad Wilk; ZhigangWang; Keir (Xen.org);
> Tim Deegan
> Subject: RE: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
> 
> >>> On 20.11.12 at 18:52, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> >>  From: Tim Deegan [mailto:tim@xen.org]
> >> Subject: Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of
> > existing) hypercall
> >>
> >> At 08:33 -0800 on 20 Nov (1353400380), Dan Magenheimer wrote:
> >> > per-domain.  Especially since that would improve launch of only a small
> >> > and shrinking class of domains (PV && superpages=1 && mem="huge"),
> >> > can we please consider it a possible future enhancement, not a showstopper?
> >>
> >> Please, no.  Either you need this benighted hypercall, or you don't.
> >> If you really need it, do it properly.
> >
> > Hi Tim --
> >
> > I must respectfully disagree.
> >
> > For years, Xen has been accepting features that work on a 64-bit
> > hypervisor but not on a 32-bit hypervisor.  And new features such
> > as memory-sharing/paging _could_ be designed to help PV domains and
> > have completely ignored PV but have still been accepted.  There is
> > clearly precedent for new features that don't enhance every
> > possible case.
> >
> > The claim feature dramatically decreases a real customer problem in
> > the vast majority of our customer environments with no loss of
> > functionality in the small remaining percentage.  This real problem
> > is getting continually worse as system physical RAM and domain memory
> > requirements increase.  So, yes, _we_ do need it.
> 
> A meaningful difference is that those other features have tools
> side users, while you add (from the perspective of the Xen tree)
> dead code. That is, it wouldn't have any chance of getting checked
> for correctness when committed (other than for not breaking
> existing code), and it will bit rot pretty quickly. Or did you mean
> to supply tools side integration before expecting this to be
> considered for applying?

Oracle uses xm and a proprietary toolstack.  I don't normally
work on the toolstack (and other Oracle folk that do are tied up
with a release right now)... I can probably come up with an
xm/xend patch but I expect an xm/xend patch won't be well-received.

> In any case, while the hypervisor side changes look acceptable,
> I'm afraid that without (mostly) convincing (almost) all of the
> maintainers, there's no perspective of this getting committed.

Thanks very much, Jan, for your repeated reviews and I'm
very pleased that you think the hypervisor patch is acceptable.

I was under the impression that the hypervisor was supposed
to be toolstack-independent.  I think I've thoroughly explained
Oracle's real customer need here... if advocates of other
toolstacks don't understand it, it's not from lack of my
trying.  (More words have been spilled on this topic now than
possibly any other 200 lines in Xen!)

If other maintainers wish to impose their toolstack requirements
on other vendors' toolstacks and/or block hypervisor enhancements
that don't fit with their idea of a toolstack's needs,  I
suppose that is a question that will need to be raised to the
Xen Advisory Board, and that is above my pay grade.

Happy (U.S.) Thanksgiving!
Dan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-26 18:01     ` Dan Magenheimer
@ 2012-11-26 18:05       ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 26+ messages in thread
From: Andres Lagar-Cavilla @ 2012-11-26 18:05 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Keir (Xen.org),
	Konrad Wilk, Tim Deegan, xen-devel, Jan Beulich, Zhigang Wang

On Nov 26, 2012, at 1:01 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:

>> From: Dan Magenheimer
>> Subject: RE: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
>> 
>>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
>>> Subject: Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
>>>> Fair enough.
>>>> 
>>>> After reviewing the thread where low_mem was submitted, I have to admit
>>>> that I am a bit baffled as to when the low_mem handling would ever be
>>>> necessary.   I suspect it is because the author and I are approaching
>>> 
>>> Little to be baffled at, as per above explanation. And probably a good idea to cc the author if so.
>>> 
>>> Andres
>>> 
>>>> memory management from a completely different paradigm (per discussion
>>>> in an earlier thread where "claim" was first proposed), so that
>>>> is probably better left for the deferred discussion of the
>>>> integration.
>>>> 
>>>> So since you (Jan) do not consider this (lack of integration with
>>>> low_mem) a showstopper for claim, I will set myself a reminder
>>>> to initiate a new thread about this later.
>> 
>>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> 
>> Hi Andres (and sorry for the typo in your name earlier in the thread) --
>> 
>>> ...And probably a good idea to cc the author if so.
>> 
>> No offense intended, I certainly intended for you to be not just on
>> the "Cc" list but on the "To" list of the new thread, but you are too
>> quick for me and, due to time constraints, I may not get to that
>> new thread until next week (it's a holiday week in the US).  But until
>> then... a quick clarification:
>> 
>>>> After reviewing the thread where low_mem was submitted, I have to admit
>>>> that I am a bit baffled as to when the low_mem handling would ever be
>>>> necessary.   I suspect it is because the author and I are approaching
>> 
>> I meant "ever be necessary in the dynamic memory (e.g. tmem) paradigm",
>> not the squeezed (or MS -memory-balancing-engine) paradigm, where I can
>> at least fathom it.
> 
> Hmmm... it appear that, while it might be fun and illuminating,
> a new thread is probably not worth our time, as I think the fix
> to allow co-existence of XENMEM_claim_pages and the low_mem_virq
> code is one additional line.

Agreed, just factor in the global unclaimed count into the arithmetic the virq code performs to decide whether to fire.

Andres
> 
> I'll include it in v7.
> 
> Dan
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-19 21:41   ` Dan Magenheimer
@ 2012-11-26 18:01     ` Dan Magenheimer
  2012-11-26 18:05       ` Andres Lagar-Cavilla
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Magenheimer @ 2012-11-26 18:01 UTC (permalink / raw)
  To: Andres Lagar-Cavilla, xen-devel, Jan Beulich
  Cc: Keir (Xen.org), Zhigang Wang, Tim Deegan, Konrad Wilk

> From: Dan Magenheimer
> Subject: RE: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
> 
> > From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> > Subject: Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
> > > Fair enough.
> > >
> > > After reviewing the thread where low_mem was submitted, I have to admit
> > > that I am a bit baffled as to when the low_mem handling would ever be
> > > necessary.   I suspect it is because the author and I are approaching
> >
> > Little to be baffled at, as per above explanation. And probably a good idea to cc the author if so.
> >
> > Andres
> >
> > > memory management from a completely different paradigm (per discussion
> > > in an earlier thread where "claim" was first proposed), so that
> > > is probably better left for the deferred discussion of the
> > > integration.
> > >
> > > So since you (Jan) do not consider this (lack of integration with
> > > low_mem) a showstopper for claim, I will set myself a reminder
> > > to initiate a new thread about this later.
> 
>      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Hi Andres (and sorry for the typo in your name earlier in the thread) --
> 
> > ...And probably a good idea to cc the author if so.
> 
> No offense intended, I certainly intended for you to be not just on
> the "Cc" list but on the "To" list of the new thread, but you are too
> quick for me and, due to time constraints, I may not get to that
> new thread until next week (it's a holiday week in the US).  But until
> then... a quick clarification:
> 
> > > After reviewing the thread where low_mem was submitted, I have to admit
> > > that I am a bit baffled as to when the low_mem handling would ever be
> > > necessary.   I suspect it is because the author and I are approaching
> 
> I meant "ever be necessary in the dynamic memory (e.g. tmem) paradigm",
> not the squeezed (or MS -memory-balancing-engine) paradigm, where I can
> at least fathom it.

Hmmm... it appear that, while it might be fun and illuminating,
a new thread is probably not worth our time, as I think the fix
to allow co-existence of XENMEM_claim_pages and the low_mem_virq
code is one additional line.

I'll include it in v7.

Dan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-19 20:25 ` Andres Lagar-Cavilla
@ 2012-11-19 21:41   ` Dan Magenheimer
  2012-11-26 18:01     ` Dan Magenheimer
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Magenheimer @ 2012-11-19 21:41 UTC (permalink / raw)
  To: Andres Lagar-Cavilla, xen-devel
  Cc: Keir (Xen.org), Zhigang Wang, Tim Deegan, Jan Beulich, Konrad Wilk

> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> Subject: Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
> > Fair enough.
> >
> > After reviewing the thread where low_mem was submitted, I have to admit
> > that I am a bit baffled as to when the low_mem handling would ever be
> > necessary.   I suspect it is because the author and I are approaching
> 
> Little to be baffled at, as per above explanation. And probably a good idea to cc the author if so.
> 
> Andres
>
> > memory management from a completely different paradigm (per discussion
> > in an earlier thread where "claim" was first proposed), so that
> > is probably better left for the deferred discussion of the
> > integration.
> >
> > So since you (Jan) do not consider this (lack of integration with
> > low_mem) a showstopper for claim, I will set myself a reminder
> > to initiate a new thread about this later.

     ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Hi Andres (and sorry for the typo in your name earlier in the thread) --

> ...And probably a good idea to cc the author if so.

No offense intended, I certainly intended for you to be not just on
the "Cc" list but on the "To" list of the new thread, but you are too
quick for me and, due to time constraints, I may not get to that
new thread until next week (it's a holiday week in the US).  But until
then... a quick clarification:

> > After reviewing the thread where low_mem was submitted, I have to admit
> > that I am a bit baffled as to when the low_mem handling would ever be
> > necessary.   I suspect it is because the author and I are approaching

I meant "ever be necessary in the dynamic memory (e.g. tmem) paradigm",
not the squeezed (or MS -memory-balancing-engine) paradigm, where I can
at least fathom it.

Feel free to start the new thread if you like, but I may not be able
to answer for a few days.

Happy Thanksgiving-week-in-the-US,
Dan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
       [not found] <mailman.16877.1353355647.1399.xen-devel@lists.xen.org>
@ 2012-11-19 20:25 ` Andres Lagar-Cavilla
  2012-11-19 21:41   ` Dan Magenheimer
  0 siblings, 1 reply; 26+ messages in thread
From: Andres Lagar-Cavilla @ 2012-11-19 20:25 UTC (permalink / raw)
  To: xen-devel
  Cc: Dan Magenheimer, Keir (Xen.org),
	Konrad Wilk, Tim Deegan, Jan Beulich, Zhigang Wang

>   
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Friday, November 16, 2012 3:36 AM
>> To: Dan Magenheimer
>> Cc: xen-devel@lists.xen.org; Konrad Wilk; ZhigangWang; Keir Fraser; TimDeegan
>> Subject: RE: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
>> 
>>>>> On 15.11.12 at 19:00, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>>> From: Jan Beulich [mailto:JBeulich@suse.com]
>>>> Similarly, but perhaps of lower priority, there is no integration
>>>> with the low-mem handling.
>>> 
>>> I'd also consider this lower priority as Olaf and Andre
>>> have argued that the claim mechanism is not needed for
>>> sharing/paging so the two mechanisms may not
>>> be used together, at least for the foreseeable future.
>>> So I plan to skip this, unless you change your mind and
>>> consider it a showstopper for acceptance.
>> 
>> Skipping for the initial implementation is likely fine, but that
>> shouldn't mean deferring the integration indefinitely. Also,
>> I see no close connection between the low-mem feature
>> and sharing/paging (apart from Andres working on both).

It's simple. Right now one can't reliably set d->max_pages to the watermark at which a VM's allocation is temporarily allowed to grow -- by unsharing or paging in. This is because of (hopefully fixable!) short-comings in the mm layer and its interaction with wait queues, documented elsewhere.

So the best next approach (in place) is to get a synchronous kick as watermarks in available memory are reached. Otherwise a host memory manager is down to polling.

It could be the case that once d->max_pages can be set reliably, one would not need the low_mem virq any further. But it's just such a useful feature at infinitesimal cost, that I would not want it to see it gone. And it would still remain useful in any scenario in which the total sum of d->max_pages exceeds available memory (for whatever reason).

> 
> Fair enough.
> 
> After reviewing the thread where low_mem was submitted, I have to admit
> that I am a bit baffled as to when the low_mem handling would ever be
> necessary.   I suspect it is because the author and I are approaching

Little to be baffled at, as per above explanation. And probably a good idea to cc the author if so.

Andres

> memory management from a completely different paradigm (per discussion
> in an earlier thread where "claim" was first proposed), so that
> is probably better left for the deferred discussion of the
> integration.
> 
> So since you (Jan) do not consider this (lack of integration with
> low_mem) a showstopper for claim, I will set myself a reminder
> to initiate a new thread about this later.
> 
> Thanks,
> Dan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
  2012-11-15 18:51 ` Andres Lagar-Cavilla
@ 2012-11-15 19:35   ` Dan Magenheimer
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Magenheimer @ 2012-11-15 19:35 UTC (permalink / raw)
  To: Andres Lagar-Cavilla, xen-devel
  Cc: Keir (Xen.org), Zhigang Wang, Tim Deegan, Jan Beulich, Konrad Wilk

> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> Subject: Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
> 
> > Ideally, allocation in the presence of existing claims should
> > behave as if the claiming domains had actually already allocated
> > the unclaimed-amount-of-memory.  So I'd argue that enforcing
> > the claim should be sacrosanct here.
> 
> Well, are we sure that failing an "anonymous" allocations is not going to trigger a BUG_ON? That's a
> lot of code review. If you get this wrong, now Xen suddenly crashes if allocating domains close to the
> max. It doesn't, today, afaict.

I'm not sure I understand your concern.  What I said is
that the behavior is the same whether the memory is
allocated for a domain (the pre-claim way) or claimed
and then allocated later.  Claim is neither improving
a low-memory situation or making it worse.  If Xen
crashes when a claim has been staked (but not yet
satisfied), it would crash the same if the memory
had already been fully allocated.

> >> Similarly, but perhaps of lower priority, there is no integration
> >> with the low-mem handling.
> >
> > I'd also consider this lower priority as Olaf and Andre
> > have argued that the claim mechanism is not needed for
> > sharing/paging so the two mechanisms may not
> > be used together, at least for the foreseeable future.
> > So I plan to skip this, unless you change your mind and
> > consider it a showstopper for acceptance.
> 
> This is a slippery slope. Let's not work out the interactions with existing subsystems before adding
> code to the tree. What could go wrong?

Indeed.

> As a data point for everyone, I've found the low men virq extremely useful as a sync interrupt
> signaling to our toolstack that it needs to get its act together and start rebalancing memory, if it
> hasn't yet. I don't see how that cannot be useful to any other toolstack.

Um... very different toolstack paradigm.

Besides (can't resist), in your toolstack, the toolstack is
omniscient about all hypervisor memory allocations, so why
would it _need_ to be notified? ;-) ;-) ;-) ;-) /me ducks

Dan "who holds out single daisy to Andre, previously withdrawn from George
in http://lists.xen.org/archives/html/xen-devel/2012-11/msg00150.html"

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
       [not found] <mailman.16651.1353004500.1399.xen-devel@lists.xen.org>
@ 2012-11-15 18:51 ` Andres Lagar-Cavilla
  2012-11-15 19:35   ` Dan Magenheimer
  0 siblings, 1 reply; 26+ messages in thread
From: Andres Lagar-Cavilla @ 2012-11-15 18:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Dan Magenheimer, Keir (Xen.org),
	Konrad Wilk, Tim Deegan, Jan Beulich, Zhigang Wang

> Hi Jan --
> 
> Thanks for the quick review!
> 
> Mostly fixed for v3 but a couple things:
> 
>>> +     * Claimed memory is considered unavailable unless the request
>>> +     * is made by a domain with sufficient unclaimed pages.
>>> +     */
>>> +    if ( (total_unclaimed_pages + request >
>>> +           total_avail_pages + tmem_freeable_pages()) &&
>>> +          (d == NULL || d->unclaimed_pages < request) )
>>> +        goto not_found;
>> 
>> The treatment of d being NULL certainly needs further thought:
>> Is it really better to fulfill the claim and fail some (perhaps
>> important) _xmalloc()?
> 
> Ideally, allocation in the presence of existing claims should
> behave as if the claiming domains had actually already allocated
> the unclaimed-amount-of-memory.  So I'd argue that enforcing
> the claim should be sacrosanct here.

Well, are we sure that failing an "anonymous" allocations is not going to trigger a BUG_ON? That's a lot of code review. If you get this wrong, now Xen suddenly crashes if allocating domains close to the max. It doesn't, today, afaict.

> 
>> Also, I'm missing a mechanism by which the tools could find out
>> how much unclaimed memory is available, in order to determine
>> (if in use) how much memory needs to be ballooned out of Dom0.
> 
> OK.  I'm not certain if this will be useful on a per-domain
> basis as well but, for completeness, I will also add
> unclaimed_pages into xc_dominfo etc (which causes a bump
> in XEN_DOMCTL_INTERFACE_VERSION).
> 
>> Similarly, but perhaps of lower priority, there is no integration
>> with the low-mem handling.
> 
> I'd also consider this lower priority as Olaf and Andre
> have argued that the claim mechanism is not needed for
> sharing/paging so the two mechanisms may not
> be used together, at least for the foreseeable future.
> So I plan to skip this, unless you change your mind and
> consider it a showstopper for acceptance.

This is a slippery slope. Let's not work out the interactions with existing subsystems before adding code to the tree. What could go wrong?

As a data point for everyone, I've found the low men virq extremely useful as a sync interrupt signaling to our toolstack that it needs to get its act together and start rebalancing memory, if it hasn't yet. I don't see how that cannot be useful to any other toolstack.

Andres

> 
>> Finally, there still are a number of formatting issues.
> 
> Hmmm... I found one I think.  Is there an equivalent to
> checkpatch for hypervisor code?  If you see any formatting
> issues in v3, please call them out explicitly as I am
> sincerely trying to avoid them.
> 
> Thanks,
> Dan
> 
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Thu, 15 Nov 2012 18:23:09 +0000
> From: Mats Petersson <mats.petersson@citrix.com>
> To: Tim Deegan <tim@xen.org>
> Cc: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
> Subject: Re: [Xen-devel] [PATCH V2] xen: vmx: Use an INT 2 call to
> 	process real NMI's instead of self_nmi() in VMEXIT handler
> Message-ID: <50A5330D.9020204@citrix.com>
> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
> 
> On 15/11/12 17:44, Tim Deegan wrote:
>> At 17:33 +0000 on 15 Nov (1353000782), Mats Petersson wrote:
>>> On 15/11/12 17:15, Tim Deegan wrote:
>>>> At 17:03 +0000 on 15 Nov (1352998993), Mats Petersson wrote:
>>>>>> On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
>>>>>> running a HVM guest, so the stack issue doesn't apply (but nested NMIs
>>>>>> are still bad).
>>>>>> 
>>>>>> On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
>>>>>> guests.  We don't really have to but it saves time in the context switch
>>>>>> not to update the IDT.  Using do_nmi() here means that the first NMI is
>>>>>> handled on the normal stack instead.  It's also consistent with the way
>>>>>> we call do_machine_check() for the MCE case.  But it needs an explicit
>>>>>> IRET after the call to do_nmi() to make sure that NMIs get re-enabled.
>>>>> Both AMD and Intel has an identical setup with regard to stacks and
>>>>> general "what happens when we taken one of these interrupts".
>>>> My reading of svm_ctxt_switch_{to,from} makes me disagree with this.
>>>> AFAICT, on SVM we're not using dedicated stacks at all.
>>> In SVM, the VMRUN returns to whatever stack you had before the VMRUN.
>>> This is not what I'm talking about, however. The stack used for the NMI
>>> and MCE comes from the interrupt descriptor entry for those respective
>>> vectors.
>> This is the code I was referring to:
>> 
>>     /*
>>      * Cannot use ISTs for NMI/#MC/#DF while we are running with the guest TR.
>>      * But this doesn't matter: the IST is only req'd to handle SYSCALL/SYSRET.
>>      */
>>     idt_tables[cpu][TRAP_double_fault].a  &= ~(7UL << 32);
>>     idt_tables[cpu][TRAP_nmi].a           &= ~(7UL << 32);
>>     idt_tables[cpu][TRAP_machine_check].a &= ~(7UL << 32);
>> 
>> Am I misreading it?
> 
> No, you are reading it perfectly right, I'm wrong...
> 
> --
> Mats
>> 
>>> So in conclusion, the do_mce_exception() call probably should be a
>>> __asm__ __volatile__("int $X"), where X is the relevant vector.
>> This handles MCEs that were raised in guest context.  If we've managed
>> to get this far into the exit handler, the hypervisor stack is probably
>> OK. :)
>> 
>> I'd be happy to invoke the MCE handler though the IDT here, just for
>> symmetry with the other cases, but I don't think it makes much
>> difference.
>> 
>> Tim.
>> 
>> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Thu, 15 Nov 2012 13:29:28 -0500
> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> To: Ian Campbell <Ian.Campbell@citrix.com>
> Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>,	ANNIE LI
> 	<annie.li@oracle.com>,	"xen-devel@lists.xensource.com"
> 	<xen-devel@lists.xensource.com>,	Roger Pau Monne
> 	<roger.pau@citrix.com>
> Subject: Re: [Xen-devel] [PATCH 0/4] Implement persistent grant in
> 	xen-netfront/netback
> Message-ID: <20121115182928.GB22320@phenom.dumpdata.com>
> Content-Type: text/plain; charset=iso-8859-1
> 
> On Thu, Nov 15, 2012 at 11:15:06AM +0000, Ian Campbell wrote:
>> On Thu, 2012-11-15 at 10:56 +0000, Roger Pau Monne wrote:
>>> On 15/11/12 09:38, ANNIE LI wrote:
>>>> 
>>>> 
>>>> On 2012-11-15 15:40, Pasi K?rkk?inen wrote:
>>>>> Hello,
>>>>> 
>>>>> On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote:
>>>>>> This patch implements persistent grants for xen-netfront/netback. This
>>>>>> mechanism maintains page pools in netback/netfront, these page pools is used to
>>>>>> save grant pages which are mapped. This way improve performance which is wasted
>>>>>> when doing grant operations.
>>>>>> 
>>>>>> Current netback/netfront does map/unmap grant operations frequently when
>>>>>> transmitting/receiving packets, and grant operations costs much cpu clock. In
>>>>>> this patch, netfront/netback maps grant pages when needed and then saves them
>>>>>> into a page pool for future use. All these pages will be unmapped when
>>>>>> removing/releasing the net device.
>>>>>> 
>>>>> Do you have performance numbers available already? with/without persistent grants?
>>>> I have some simple netperf/netserver test result with/without persistent 
>>>> grants,
>>>> 
>>>> Following is result of with persistent grant patch,
>>>> 
>>>> Guests, Sum,      Avg,     Min,     Max
>>>>  1,  15106.4,  15106.4, 15106.36, 15106.36
>>>>  2,  13052.7,  6526.34,  6261.81,  6790.86
>>>>  3,  12675.1,  6337.53,  6220.24,  6454.83
>>>>  4,  13194,  6596.98,  6274.70,  6919.25
>>>> 
>>>> 
>>>> Following are result of without persistent patch
>>>> 
>>>> Guests, Sum,     Avg,    Min,        Max
>>>>  1,  10864.1,  10864.1, 10864.10, 10864.10
>>>>  2,  10898.5,  5449.24,  4862.08,  6036.40
>>>>  3,  10734.5,  5367.26,  5261.43,  5473.08
>>>>  4,  10924,    5461.99,  5314.84,  5609.14
>>> 
>>> In the block case, performance improvement is seen when using a large
>>> number of guests, could you perform the same benchmark increasing the
>>> number of guests to 15?
>> 
>> It would also be nice to see some analysis of the numbers which justify
>> why this change is a good one without every reviewer having to evaluate
>> the raw data themselves. In fact this should really be part of the
>> commit message.
> 
> You mean like a nice graph, eh?
> 
> I will run these patches on my 32GB box and see if I can give you
> a nice PDF/jpg.
> 
>> 
>> Ian.
>> 
> 
> 
> 
> ------------------------------
> 
> Message: 6
> Date: Thu, 15 Nov 2012 18:29:13 +0000
> From: George Dunlap <george.dunlap@eu.citrix.com>
> To: Michael Palmeter <michael.palmeter@oracle.com>
> Cc: Dario Faggioli <raistlin@linux.it>,	"xen-devel@lists.xen.org"
> 	<xen-devel@lists.xen.org>
> Subject: Re: [Xen-devel] Xen credit scheduler question
> Message-ID: <50A53479.5050901@eu.citrix.com>
> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
> 
> On 15/11/12 15:43, Michael Palmeter wrote:
>> 
>> Hi all (and Mr. Dunlap in particular),
>> 
> 
> Haha -- please don't call me "Mr"; I prefer "George", but if you want a 
> title, use "Dr" (since I have  PhD). :-)
> 
>> Example scenario:
>> 
>>  * Server hardware: 2 sockets, 8-cores per socket, 2 hardware threads
>>    per core (total of 32 hardware threads)
>>  * Test VM: a single virtual machine with a single vCPU, weight=256
>>    and cap=100%
>> 
>> In this scenario, from what I understand, I should be able to load the 
>> Test VM with traffic to a maximum of approximately 1/32 of the 
>> aggregate compute capacity of the server.  The total CPU utilization 
>> of the server hardware should be approximately 3.4%, plus the overhead 
>> of dom0 (say 1-2).  The credits available to any vCPU capped at 100% 
>> should be equal to 1/32 of the aggregate compute available for the 
>> whole server, correct?
>> 
> 
> I think to really be precise, you should say, "1/32nd of the logical cpu 
> time available", where "logical cpu time" simply means, "time processing 
> on one logical CPU".  At the moment, that is all that either the credit1 
> or credit2 schedulers look at.
> 
> As I'm sure you're aware, not all "logical cpu time" is equal.  If one 
> thread of a hyperthread pair is running but the other idle, it will get 
> significantly higher performance than if the other thread is busy.  How 
> much is highly unpredictable, and depends very much on exactly what 
> units are shared with the other hyperthread, and the workload running on 
> each unit.  But even when both threads are busy, it should (in theory) 
> be rare for both threads to get a throughput of 50%; the whole idea of 
> HT is that threads typically get 70-80% of the full performance of the 
> core (so the overall throughput is increased).
> 
> But if course, while this is particularly extreme in the case of 
> hyperthreads, it's also true on a smaller scale even without that -- 
> cores share caches, NUMA nodes share memory bandwidth, and so on. No 
> attempt is made to compensate VMs for cache misses or extra memory 
> latency due to sharing either. :-)
> 
>> Put simply, is there a way to constrain a VM with 1 vCPU to consume no 
>> more than 0.5 of a physical core (hyper-threaded) on the server 
>> hardware mentioned below? Does the cap help in that respect?
>> 
> 
> You can use "cap" to make the VM in question get 50% of logical vcpu 
> time, which on an idle system will give it 0.5 of the capacity of a 
> physical core (if we don't consider Intel's Turbo Boost technology).  
> But if the system becomes busy, it will get less than 0.5 of the 
> processing capacity of a physical core.
> 
>> I have been struggling to understand how the scheduler can deal with 
>> the uncertainty that hyperthreading introduces, however.  I know this 
>> is an issue that you are tackling in the credit2 scheduler, but I 
>> would like to know what your thoughts are on this problem (if you are 
>> able to share).  Any insight or assistance you could offer would be 
>> greatly appreciated.
>> 
> 
> At the moment it does not attempt to; the only thing it does is try not 
> to schedule two hyperthreads that share a core if there is an idle 
> core.  But if there are more active vcpus than cores, then some will 
> share; and the ones that share a core with another vcpu will be charged 
> the same as the ones that have the core all to themselves.
> 
> Could you explain why you your question is important to you -- i.e,. 
> what are you trying to accomplish?  It sounds a bit like you're more 
> concerned with accuracy in reporting, and control of resources, rather 
> than fairness, for instance.
> 
>  -George
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.xen.org/archives/html/xen-devel/attachments/20121115/b2ff8583/attachment.html>
> 
> ------------------------------
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 
> 
> End of Xen-devel Digest, Vol 93, Issue 157
> ******************************************

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2012-11-26 18:05 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-14 23:55 [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall Dan Magenheimer
2012-11-15 10:25 ` Jan Beulich
2012-11-15 18:00   ` Dan Magenheimer
2012-11-16 10:36     ` Jan Beulich
2012-11-19 20:04       ` Dan Magenheimer
2012-11-15 12:25 ` Ian Campbell
2012-11-15 12:39   ` Jan Beulich
2012-11-15 19:19     ` Dan Magenheimer
2012-11-16 10:41       ` Jan Beulich
2012-11-15 19:15   ` Dan Magenheimer
2012-11-16 10:39     ` Jan Beulich
2012-11-16 10:48     ` Ian Campbell
2012-11-19 20:53       ` Dan Magenheimer
2012-11-20  8:16         ` Jan Beulich
2012-11-20 10:16           ` Ian Campbell
2012-11-20 16:33             ` Dan Magenheimer
2012-11-20 16:47               ` Tim Deegan
2012-11-20 17:52                 ` Dan Magenheimer
2012-11-21  8:31                   ` Jan Beulich
2012-11-21 15:59                     ` Dan Magenheimer
     [not found] <mailman.16651.1353004500.1399.xen-devel@lists.xen.org>
2012-11-15 18:51 ` Andres Lagar-Cavilla
2012-11-15 19:35   ` Dan Magenheimer
     [not found] <mailman.16877.1353355647.1399.xen-devel@lists.xen.org>
2012-11-19 20:25 ` Andres Lagar-Cavilla
2012-11-19 21:41   ` Dan Magenheimer
2012-11-26 18:01     ` Dan Magenheimer
2012-11-26 18:05       ` Andres Lagar-Cavilla

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.