[RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall

* [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall
@ 2012-11-14 23:55 Dan Magenheimer
  2012-11-15 10:25 ` Jan Beulich
  2012-11-15 12:25 ` Ian Campbell
  0 siblings, 2 replies; 26+ messages in thread
From: Dan Magenheimer @ 2012-11-14 23:55 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser, TimDeegan; +Cc: Zhigang Wang, Konrad Wilk, xen-devel

[-- Attachment #1: Type: text/plain, Size: 17405 bytes --]

This is a second cut of the hypervisor patch of the proposed
XENMEM_claim_pages hypercall/subop, taking into account
feedback from Jan and Keir, plus some fixes found via runtime
debugging (using privcmd only) and some comments/cleanup.

v1->v2:
- Add reset-to-zero page claim in domain_kill [JBeulich]
- Proper handling of struct passed to hypercall [JBeulich]
- Fix alloc_heap_pages when a domain has a claim [JBeulich]
- Need not hold heap_lock if !d->unclaimed_pages [keir]
- Fix missed tot_pages call in donate_page [djm]
- Remove domain_reset_unclaimed_pages; use set with zero [djm]
- Bugfixes found through testing in set_unclaimed [djm]
- More comments in code [djm]
- Code formatting fixes [djm]

===

Motivation:

The goal of this hypercall
is to attempt to atomically and very quickly determine if
there are sufficient pages available in the system and, if
so, "set aside" that quantity of pages for future allocations
by that domain.  Unlike an existing hypercall such as
increase_reservation or populate_physmap, specific physical
pageframes are not assigned to the domain because this
cannot be done sufficiently quickly (especially for very large
allocations in an arbitrarily fragmented system) and so the
existing mechanisms result in classic time-of-check-time-of-use
(TOCTOU) races.  One can think of claiming as similar to a
"lazy" allocation, but subsequent hypercalls are required
to do the actual physical pageframe allocation.

I don't have a patch for the toolstack side, but I envision
a "xl create --claim" option to maximize backwards
compatibility while minimizing impact on existing toolstacks.
As a result, testing has (so far) only been done via privcmd.

Code overview:

Though the hypercall simply does arithmetic within locks,
some of the semantics in the code may be a bit subtle
so let me explain a bit.

The key variables (d->unclaimed_pages and total_unclaimed_pages)
start at zero if no claim has yet been staked for any domain.
(Perhaps a better name is "claimed_but_not_yet_possessed" but that's
a bit unwieldy.)  If no claim hypercalls are executed, there
should be no impact on existing usage.

When a claim is successfully staked by a domain, it is like a
watermark but there is no record kept of the size of the claim.
Instead, d->unclaimed_pages is set to the difference between
d->tot_pages and the claim (d->tot_pages may normally be zero
but see Note 1).  When d->tot_pages increases or decreases,
d->unclaimed_pages atomically decreases or increases.  Once
d->unclaimed_pages reaches zero, the claim is satisfied and
d->unclaimed pages stays at zero -- unless a new claim is
subsequently staked.  See Note 2.

The systemwide variable total_unclaimed_pages is always the sum
of d->unclaimed_pages, across all domains.  A non-domain-
specific heap allocation will fail if total_unclaimed_pages
exceeds free (plus, on tmem enabled systems, freeable) pages.

Claim semantics may be modified by flags.  The initial
implementation has one flag, which discerns whether the
caller would like tmem freeable pages to be considered
in determining whether or not the claim can be successfully
staked. Future flags may, for example, specify that the
caller would like the claim to be successful only if there
are sufficient pages available on a single node (per Dario's
suggestion).

Note 1: Tim: I'm thinking this may resolve your concern that
the claim mechanism must be more complicated to handle
restricted memory allocations and order>0 allocations.
The proposed claim mechanism only guarantees a quantity of
order==0 pages; if restricted allocations are required, these
are done first by the toolstack, and followed by the claim.
Order>0 allocations still work if memory is not fragmented,
but the claim mechanism doesn't guarantee anything but
a quantity of order==0 pages.

Note 2: Tim: This arithmetic also indirectly implements the
"claim auto-expire" discussed earlier.  We certainly don't
want a negative claim.  It's possible that the toolstack need
never call "unclaim" (and so such a hypercall/subop need
not even exist) as long as domain_reset_unclaimed_pages()
is called when a domain dies.  This draft doesn't even
provide unclaim... though if a reason for its existence
is determined, it should be easy to add.

Note 3: There is currently no way to observe a staked claim,
so a staked claim will not survive a save/restore/migrate.
Not clear yet if this is needed but could be added.

Thanks for any feedback!

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com

 arch/x86/mm.c             |    2 
 arch/x86/mm/mem_sharing.c |    4 -
 common/domain.c           |    1 
 common/grant_table.c      |    2 
 common/memory.c           |   31 ++++++++++++-
 common/page_alloc.c       |  107 +++++++++++++++++++++++++++++++++++++++++++++-
 include/public/memory.h   |   31 +++++++++++++
 include/xen/mm.h          |    6 ++
 include/xen/sched.h       |    1 
 9 files changed, 178 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index fad3d33..7e55908 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3841,7 +3841,7 @@ int donate_page(
     {
         if ( d->tot_pages >= d->max_pages )
             goto fail;
-        d->tot_pages++;
+        domain_increase_tot_pages(d, 1);
     }
 
     page->count_info = PGC_allocated | 1;
diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
index 5103285..943a3b5 100644
--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -639,7 +639,7 @@ static int page_make_sharable(struct domain *d,
     }
 
     page_set_owner(page, dom_cow);
-    d->tot_pages--;
+    domain_decrease_tot_pages(d, 1);
     drop_dom_ref = (d->tot_pages == 0);
     page_list_del(page, &d->page_list);
     spin_unlock(&d->page_alloc_lock);
@@ -680,7 +680,7 @@ static int page_make_private(struct domain *d, struct page_info *page)
     ASSERT(page_get_owner(page) == dom_cow);
     page_set_owner(page, d);
 
-    if ( d->tot_pages++ == 0 )
+    if ( domain_increase_tot_pages(d, 1) == 0 )
         get_domain(d);
     page_list_add_tail(page, &d->page_list);
     spin_unlock(&d->page_alloc_lock);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 0e3e36a..95509e2 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -492,6 +492,7 @@ int domain_kill(struct domain *d)
         evtchn_destroy(d);
         gnttab_release_mappings(d);
         tmem_destroy(d->tmem);
+        domain_set_unclaimed_pages(d, 0, 0);
         d->tmem = NULL;
         /* fallthrough */
     case DOMDYING_dying:
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 7912769..10ce78f 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1656,7 +1656,7 @@ gnttab_transfer(
         }
 
         /* Okay, add the page to 'e'. */
-        if ( unlikely(e->tot_pages++ == 0) )
+        if ( unlikely(domain_increase_tot_pages(e, 1) == 0) )
             get_knownalive_domain(e);
         page_list_add_tail(page, &e->page_list);
         page_set_owner(page, e);
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 83e2666..38f2cb9 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -454,7 +454,7 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
                              (j * (1UL << exch.out.extent_order)));
 
                 spin_lock(&d->page_alloc_lock);
-                d->tot_pages -= dec_count;
+                domain_decrease_tot_pages(d, dec_count);
                 drop_dom_ref = (dec_count && !d->tot_pages);
                 spin_unlock(&d->page_alloc_lock);
 
@@ -685,6 +685,35 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         break;
     }
 
+    case XENMEM_claim_pages:
+    {
+        unsigned long request;
+
+        start_extent = cmd >> MEMOP_EXTENT_SHIFT;
+
+        if ( copy_from_guest(&reservation, arg, 1) )
+            return start_extent;
+
+        if ( !(guest_handle_is_null(reservation.extent_start)) )
+            return start_extent;
+
+        rc = rcu_lock_target_domain_by_id(reservation.domid, &d);
+        if ( rc )
+            return rc;
+
+        /*
+         * extent_order may be non-zero, but is only a multiplier and
+         * does not currently claim any order>0 slabs, though this is
+         * a possible future feature
+         */
+        request = reservation.nr_extents << reservation.extent_order;
+        rc = domain_set_unclaimed_pages(d, request, reservation.mem_flags);
+
+        rcu_unlock_domain(d);
+
+        break;
+    }
+
     default:
         rc = arch_memory_op(op, arg);
         break;
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 15ebc66..178d816 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -238,6 +238,100 @@ static long midsize_alloc_zone_pages;
 #define MIDSIZE_ALLOC_FRAC 128
 
 static DEFINE_SPINLOCK(heap_lock);
+static long total_unclaimed_pages; /* total outstanding claims by all domains */
+
+unsigned long domain_increase_tot_pages(struct domain *d, unsigned long pages)
+{
+    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
+
+    ASSERT(spin_is_locked(&d->page_alloc_lock));
+    ASSERT(!spin_is_locked(&heap_lock));
+    if ( !d->unclaimed_pages )
+        return d->tot_pages += pages;
+    spin_lock(&heap_lock);
+    dom_before = d->unclaimed_pages;
+    dom_after = dom_before - pages;
+    if ( (dom_before > 0) && (dom_after < 0) )
+        dom_claimed = 0;
+    else
+        dom_claimed = dom_after;
+    sys_before = total_unclaimed_pages;
+    sys_after = sys_before - (dom_before - dom_claimed);
+    BUG_ON( (sys_before > 0) && (sys_after < 0) );
+    total_unclaimed_pages = sys_after;
+    d->unclaimed_pages = dom_claimed;
+    spin_unlock(&heap_lock);
+    return d->tot_pages;
+}
+
+unsigned long domain_decrease_tot_pages(struct domain *d, unsigned long pages)
+{
+    ASSERT(spin_is_locked(&d->page_alloc_lock));
+    ASSERT(!spin_is_locked(&heap_lock));
+    spin_lock(&heap_lock);
+    d->tot_pages -= pages;
+    if ( d->unclaimed_pages )
+    {
+        d->unclaimed_pages += pages;
+        total_unclaimed_pages += pages;
+    }
+    spin_unlock(&heap_lock);
+    return d->tot_pages;
+}
+
+int domain_set_unclaimed_pages(struct domain *d, unsigned long pages,
+                                unsigned long flags)
+{
+    int ret = -ENOMEM;
+    unsigned long claim = 0, avail_pages = 0;
+
+    /*
+     * take the domain's page_alloc_lock, else all increases/decreases
+     * must always take the global heap_lock rather than only in the much
+     * rarer case that d->unclaimed_pages is non-zero
+     */
+    ASSERT(!spin_is_locked(&d->page_alloc_lock));
+    spin_lock(&d->page_alloc_lock);
+    ASSERT(!spin_is_locked(&heap_lock));
+    spin_lock(&heap_lock);
+
+    /* pages==0 means "unset" the claim (and flags is ignored) */
+    if (pages == 0)
+    {
+        total_unclaimed_pages -= d->unclaimed_pages;
+        d->unclaimed_pages = 0;
+        ret = 0;
+        goto out;
+    }
+
+    /* only one active claim per domain please */
+    if ( d->unclaimed_pages)
+        goto out;
+
+    /* how much memory is available? */
+    avail_pages = total_avail_pages;
+    if ( !(flags & XENMEM_CLAIMF_free_only) )
+        avail_pages += tmem_freeable_pages();
+    avail_pages -= total_unclaimed_pages;
+
+    /*
+     * note, if domain has already allocated memory before making a claim 
+     * then the claim must take tot_pages into account
+     */
+    claim = pages - d->tot_pages;
+    if ( claim > avail_pages )
+        goto out;
+
+    /* yay, claim fits in available memory, stake the claim, success! */
+    d->unclaimed_pages = claim;
+    total_unclaimed_pages += d->unclaimed_pages;
+    ret = 0;
+
+out:
+    spin_unlock(&heap_lock);
+    spin_unlock(&d->page_alloc_lock);
+    return ret;
+}
 
 static unsigned long init_node_heap(int node, unsigned long mfn,
                                     unsigned long nr, bool_t *use_tail)
@@ -443,6 +537,15 @@ static struct page_info *alloc_heap_pages(
     spin_lock(&heap_lock);
 
     /*
+     * Claimed memory is considered unavailable unless the request
+     * is made by a domain with sufficient unclaimed pages.
+     */
+    if ( (total_unclaimed_pages + request >
+           total_avail_pages + tmem_freeable_pages()) &&
+          (d == NULL || d->unclaimed_pages < request) )
+        goto not_found;
+
+    /*
      * TMEM: When available memory is scarce due to tmem absorbing it, allow
      * only mid-size allocations to avoid worst of fragmentation issues.
      * Others try tmem pools then fail.  This is a workaround until all
@@ -1291,7 +1394,7 @@ int assign_pages(
         if ( unlikely(d->tot_pages == 0) )
             get_knownalive_domain(d);
 
-        d->tot_pages += 1 << order;
+        domain_increase_tot_pages(d, 1 << order);
     }
 
     for ( i = 0; i < (1 << order); i++ )
@@ -1375,7 +1478,7 @@ void free_domheap_pages(struct page_info *pg, unsigned int order)
             page_list_del2(&pg[i], &d->page_list, &d->arch.relmem_list);
         }
 
-        d->tot_pages -= 1 << order;
+        domain_decrease_tot_pages(d, 1 << order);
         drop_dom_ref = (d->tot_pages == 0);
 
         spin_unlock_recursive(&d->page_alloc_lock);
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index f1ddbc0..c4de5c1 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -68,6 +68,8 @@ struct xen_memory_reservation {
      *   IN:  GPFN bases of extents to populate with memory
      *   OUT: GMFN bases of extents that were allocated
      *   (NB. This command also updates the mach_to_phys translation table)
+     * XENMEM_claim_pages:
+     *   IN: must be zero
      */
     XEN_GUEST_HANDLE(xen_pfn_t) extent_start;
 
@@ -421,6 +423,35 @@ struct xen_mem_sharing_op {
 typedef struct xen_mem_sharing_op xen_mem_sharing_op_t;
 DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
 
+/*
+ * Attempt to stake a claim for a domain on a quantity of pages
+ * of system RAM, but _not_ assign specific pageframes.  Only
+ * arithmetic is performed so the hypercall is very fast and need
+ * not be preemptible, thus sidestepping time-of-check-time-of-use
+ * races for memory allocation.  Returns 0 if the hypervisor page
+ * allocator has atomically and successfully claimed the requested
+ * number of pages, else non-zero.
+ *
+ * Any domain may have only one active claim.  When sufficient memory
+ * has been allocated to resolve the claim, the claim silently expires.
+ * Claiming zero pages effectively resets any outstanding claim and
+ * is always successful.
+ *
+ * Note that a valid claim may be staked even after memory has been
+ * allocated for a domain.  In this case, the claim is not incremental,
+ * i.e. if the domain's tot_pages is 3, and a claim is staked for 10,
+ * only 7 additional pages are claimed.
+ */
+#define XENMEM_claim_pages                  24
+/*
+ * XENMEM_claim_pages flags:
+ *  free_only: claim is successful only if sufficient free pages
+ *    are available.  If not set and tmem is enabled, hypervisor
+ *    may also consider tmem "freeable" pages to satisfy the claim.
+ */
+#define _XENMEM_CLAIMF_free_only            0
+#define XENMEM_CLAIMF_free_only             (1U<<_XENMEM_CLAIMF_free_only)
+
 #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 64a0cc1..5c63581 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -48,6 +48,12 @@ void free_xenheap_pages(void *v, unsigned int order);
 #define alloc_xenheap_page() (alloc_xenheap_pages(0,0))
 #define free_xenheap_page(v) (free_xenheap_pages(v,0))
 
+/* Claim handling */
+unsigned long domain_increase_tot_pages(struct domain *d, unsigned long pages);
+unsigned long domain_decrease_tot_pages(struct domain *d, unsigned long pages);
+int domain_set_unclaimed_pages(
+    struct domain *d, unsigned long pages, unsigned long flags);
+
 /* Domain suballocator. These functions are *not* interrupt-safe.*/
 void init_domheap_pages(paddr_t ps, paddr_t pe);
 struct page_info *alloc_domheap_pages(
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 6c55039..480ef39 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -242,6 +242,7 @@ struct domain
     struct page_list_head page_list;  /* linked list */
     struct page_list_head xenpage_list; /* linked list (size xenheap_pages) */
     unsigned int     tot_pages;       /* number of pages currently possesed */
+    unsigned int     unclaimed_pages; /* pages claimed but not possessed    */
     unsigned int     max_pages;       /* maximum value for tot_pages        */
     atomic_t         shr_pages;       /* number of shared pages             */
     atomic_t         paged_pages;     /* number of paged-out pages          */

[-- Attachment #2: claim-121114.patch --]
[-- Type: application/octet-stream, Size: 11994 bytes --]

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index fad3d33..7e55908 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3841,7 +3841,7 @@ int donate_page(
     {
         if ( d->tot_pages >= d->max_pages )
             goto fail;
-        d->tot_pages++;
+        domain_increase_tot_pages(d, 1);
     }
 
     page->count_info = PGC_allocated | 1;
diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
index 5103285..943a3b5 100644
--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -639,7 +639,7 @@ static int page_make_sharable(struct domain *d,
     }
 
     page_set_owner(page, dom_cow);
-    d->tot_pages--;
+    domain_decrease_tot_pages(d, 1);
     drop_dom_ref = (d->tot_pages == 0);
     page_list_del(page, &d->page_list);
     spin_unlock(&d->page_alloc_lock);
@@ -680,7 +680,7 @@ static int page_make_private(struct domain *d, struct page_info *page)
     ASSERT(page_get_owner(page) == dom_cow);
     page_set_owner(page, d);
 
-    if ( d->tot_pages++ == 0 )
+    if ( domain_increase_tot_pages(d, 1) == 0 )
         get_domain(d);
     page_list_add_tail(page, &d->page_list);
     spin_unlock(&d->page_alloc_lock);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 0e3e36a..95509e2 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -492,6 +492,7 @@ int domain_kill(struct domain *d)
         evtchn_destroy(d);
         gnttab_release_mappings(d);
         tmem_destroy(d->tmem);
+        domain_set_unclaimed_pages(d, 0, 0);
         d->tmem = NULL;
         /* fallthrough */
     case DOMDYING_dying:
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 7912769..10ce78f 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1656,7 +1656,7 @@ gnttab_transfer(
         }
 
         /* Okay, add the page to 'e'. */
-        if ( unlikely(e->tot_pages++ == 0) )
+        if ( unlikely(domain_increase_tot_pages(e, 1) == 0) )
             get_knownalive_domain(e);
         page_list_add_tail(page, &e->page_list);
         page_set_owner(page, e);
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 83e2666..38f2cb9 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -454,7 +454,7 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
                              (j * (1UL << exch.out.extent_order)));
 
                 spin_lock(&d->page_alloc_lock);
-                d->tot_pages -= dec_count;
+                domain_decrease_tot_pages(d, dec_count);
                 drop_dom_ref = (dec_count && !d->tot_pages);
                 spin_unlock(&d->page_alloc_lock);
 
@@ -685,6 +685,35 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         break;
     }
 
+    case XENMEM_claim_pages:
+    {
+        unsigned long request;
+
+        start_extent = cmd >> MEMOP_EXTENT_SHIFT;
+
+        if ( copy_from_guest(&reservation, arg, 1) )
+            return start_extent;
+
+        if ( !(guest_handle_is_null(reservation.extent_start)) )
+            return start_extent;
+
+        rc = rcu_lock_target_domain_by_id(reservation.domid, &d);
+        if ( rc )
+            return rc;
+
+        /*
+         * extent_order may be non-zero, but is only a multiplier and
+         * does not currently claim any order>0 slabs, though this is
+         * a possible future feature
+         */
+        request = reservation.nr_extents << reservation.extent_order;
+        rc = domain_set_unclaimed_pages(d, request, reservation.mem_flags);
+
+        rcu_unlock_domain(d);
+
+        break;
+    }
+
     default:
         rc = arch_memory_op(op, arg);
         break;
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 15ebc66..178d816 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -238,6 +238,100 @@ static long midsize_alloc_zone_pages;
 #define MIDSIZE_ALLOC_FRAC 128
 
 static DEFINE_SPINLOCK(heap_lock);
+static long total_unclaimed_pages; /* total outstanding claims by all domains */
+
+unsigned long domain_increase_tot_pages(struct domain *d, unsigned long pages)
+{
+    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
+
+    ASSERT(spin_is_locked(&d->page_alloc_lock));
+    ASSERT(!spin_is_locked(&heap_lock));
+    if ( !d->unclaimed_pages )
+        return d->tot_pages += pages;
+    spin_lock(&heap_lock);
+    dom_before = d->unclaimed_pages;
+    dom_after = dom_before - pages;
+    if ( (dom_before > 0) && (dom_after < 0) )
+        dom_claimed = 0;
+    else
+        dom_claimed = dom_after;
+    sys_before = total_unclaimed_pages;
+    sys_after = sys_before - (dom_before - dom_claimed);
+    BUG_ON( (sys_before > 0) && (sys_after < 0) );
+    total_unclaimed_pages = sys_after;
+    d->unclaimed_pages = dom_claimed;
+    spin_unlock(&heap_lock);
+    return d->tot_pages;
+}
+
+unsigned long domain_decrease_tot_pages(struct domain *d, unsigned long pages)
+{
+    ASSERT(spin_is_locked(&d->page_alloc_lock));
+    ASSERT(!spin_is_locked(&heap_lock));
+    spin_lock(&heap_lock);
+    d->tot_pages -= pages;
+    if ( d->unclaimed_pages )
+    {
+        d->unclaimed_pages += pages;
+        total_unclaimed_pages += pages;
+    }
+    spin_unlock(&heap_lock);
+    return d->tot_pages;
+}
+
+int domain_set_unclaimed_pages(struct domain *d, unsigned long pages,
+                                unsigned long flags)
+{
+    int ret = -ENOMEM;
+    unsigned long claim = 0, avail_pages = 0;
+
+    /*
+     * take the domain's page_alloc_lock, else all increases/decreases
+     * must always take the global heap_lock rather than only in the much
+     * rarer case that d->unclaimed_pages is non-zero
+     */
+    ASSERT(!spin_is_locked(&d->page_alloc_lock));
+    spin_lock(&d->page_alloc_lock);
+    ASSERT(!spin_is_locked(&heap_lock));
+    spin_lock(&heap_lock);
+
+    /* pages==0 means "unset" the claim (and flags is ignored) */
+    if (pages == 0)
+    {
+        total_unclaimed_pages -= d->unclaimed_pages;
+        d->unclaimed_pages = 0;
+        ret = 0;
+        goto out;
+    }
+
+    /* only one active claim per domain please */
+    if ( d->unclaimed_pages)
+        goto out;
+
+    /* how much memory is available? */
+    avail_pages = total_avail_pages;
+    if ( !(flags & XENMEM_CLAIMF_free_only) )
+        avail_pages += tmem_freeable_pages();
+    avail_pages -= total_unclaimed_pages;
+
+    /*
+     * note, if domain has already allocated memory before making a claim 
+     * then the claim must take tot_pages into account
+     */
+    claim = pages - d->tot_pages;
+    if ( claim > avail_pages )
+        goto out;
+
+    /* yay, claim fits in available memory, stake the claim, success! */
+    d->unclaimed_pages = claim;
+    total_unclaimed_pages += d->unclaimed_pages;
+    ret = 0;
+
+out:
+    spin_unlock(&heap_lock);
+    spin_unlock(&d->page_alloc_lock);
+    return ret;
+}
 
 static unsigned long init_node_heap(int node, unsigned long mfn,
                                     unsigned long nr, bool_t *use_tail)
@@ -443,6 +537,15 @@ static struct page_info *alloc_heap_pages(
     spin_lock(&heap_lock);
 
     /*
+     * Claimed memory is considered unavailable unless the request
+     * is made by a domain with sufficient unclaimed pages.
+     */
+    if ( (total_unclaimed_pages + request >
+           total_avail_pages + tmem_freeable_pages()) &&
+          (d == NULL || d->unclaimed_pages < request) )
+        goto not_found;
+
+    /*
      * TMEM: When available memory is scarce due to tmem absorbing it, allow
      * only mid-size allocations to avoid worst of fragmentation issues.
      * Others try tmem pools then fail.  This is a workaround until all
@@ -1291,7 +1394,7 @@ int assign_pages(
         if ( unlikely(d->tot_pages == 0) )
             get_knownalive_domain(d);
 
-        d->tot_pages += 1 << order;
+        domain_increase_tot_pages(d, 1 << order);
     }
 
     for ( i = 0; i < (1 << order); i++ )
@@ -1375,7 +1478,7 @@ void free_domheap_pages(struct page_info *pg, unsigned int order)
             page_list_del2(&pg[i], &d->page_list, &d->arch.relmem_list);
         }
 
-        d->tot_pages -= 1 << order;
+        domain_decrease_tot_pages(d, 1 << order);
         drop_dom_ref = (d->tot_pages == 0);
 
         spin_unlock_recursive(&d->page_alloc_lock);
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index f1ddbc0..c4de5c1 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -68,6 +68,8 @@ struct xen_memory_reservation {
      *   IN:  GPFN bases of extents to populate with memory
      *   OUT: GMFN bases of extents that were allocated
      *   (NB. This command also updates the mach_to_phys translation table)
+     * XENMEM_claim_pages:
+     *   IN: must be zero
      */
     XEN_GUEST_HANDLE(xen_pfn_t) extent_start;
 
@@ -421,6 +423,35 @@ struct xen_mem_sharing_op {
 typedef struct xen_mem_sharing_op xen_mem_sharing_op_t;
 DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
 
+/*
+ * Attempt to stake a claim for a domain on a quantity of pages
+ * of system RAM, but _not_ assign specific pageframes.  Only
+ * arithmetic is performed so the hypercall is very fast and need
+ * not be preemptible, thus sidestepping time-of-check-time-of-use
+ * races for memory allocation.  Returns 0 if the hypervisor page
+ * allocator has atomically and successfully claimed the requested
+ * number of pages, else non-zero.
+ *
+ * Any domain may have only one active claim.  When sufficient memory
+ * has been allocated to resolve the claim, the claim silently expires.
+ * Claiming zero pages effectively resets any outstanding claim and
+ * is always successful.
+ *
+ * Note that a valid claim may be staked even after memory has been
+ * allocated for a domain.  In this case, the claim is not incremental,
+ * i.e. if the domain's tot_pages is 3, and a claim is staked for 10,
+ * only 7 additional pages are claimed.
+ */
+#define XENMEM_claim_pages                  24
+/*
+ * XENMEM_claim_pages flags:
+ *  free_only: claim is successful only if sufficient free pages
+ *    are available.  If not set and tmem is enabled, hypervisor
+ *    may also consider tmem "freeable" pages to satisfy the claim.
+ */
+#define _XENMEM_CLAIMF_free_only            0
+#define XENMEM_CLAIMF_free_only             (1U<<_XENMEM_CLAIMF_free_only)
+
 #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 64a0cc1..5c63581 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -48,6 +48,12 @@ void free_xenheap_pages(void *v, unsigned int order);
 #define alloc_xenheap_page() (alloc_xenheap_pages(0,0))
 #define free_xenheap_page(v) (free_xenheap_pages(v,0))
 
+/* Claim handling */
+unsigned long domain_increase_tot_pages(struct domain *d, unsigned long pages);
+unsigned long domain_decrease_tot_pages(struct domain *d, unsigned long pages);
+int domain_set_unclaimed_pages(
+    struct domain *d, unsigned long pages, unsigned long flags);
+
 /* Domain suballocator. These functions are *not* interrupt-safe.*/
 void init_domheap_pages(paddr_t ps, paddr_t pe);
 struct page_info *alloc_domheap_pages(
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 6c55039..480ef39 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -242,6 +242,7 @@ struct domain
     struct page_list_head page_list;  /* linked list */
     struct page_list_head xenpage_list; /* linked list (size xenheap_pages) */
     unsigned int     tot_pages;       /* number of pages currently possesed */
+    unsigned int     unclaimed_pages; /* pages claimed but not possessed    */
     unsigned int     max_pages;       /* maximum value for tot_pages        */
     atomic_t         shr_pages;       /* number of shared pages             */
     atomic_t         paged_pages;     /* number of paged-out pages          */

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread