[RFC][PATCH] Basic support for page offline

* [RFC][PATCH] Basic support for page offline
@ 2009-02-09  8:54 Jiang, Yunhong
  2009-02-10  9:15 ` Tim Deegan
                   ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Jiang, Yunhong @ 2009-02-09  8:54 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2551 bytes --]

Hi, Tim, this patchset try to support page offline request. I want to get some initial feedback before more testing.

Page offline can be used by multiple usage model, belows are some examples:
a) If too many correctable error happen to one page, management tools may try to offline the page to avoid more server error in future;
b) When page is ECC error and can't be recoverd by hardware, Xen's MCA handler may try to offline the page, so that it will not be accessed anymore.
c) Offline some DIMM for power management etc (Of course, this is far more than simple page offline)

The basic idea to offline a page is:
1) If a page is free, it will be removed from page allocator
2) If page is in use, the owner will be checked
  2.1) if it is owned by xen/dom0, the offline will be failed
  2.2) If it is owned by a PV guest with no device assigned, user space tools will try to replace the page with new one.
  2.3) It it is owned by a HVM guest with no device assigned, user space tools will try to live migration it.
  2.4) If it is owned by a guest with device assigned, user space tools can do live migration if needed.

This patchset includes support for type 2.1/2.2. 

page_offfline_xen.patch gives basic support. The new hypercall (XEN_SYSCTL_page_offline) will mark a page offlining if the page is in-use, otherwise, it will remove the page from the page allocator. It also changes the free_heap_pages(), so that if a page_offlining page is freed, that page will be marked as page_offlined and will not be allocated anymore. One tricky thing is, the offlined page may not be buddy-aligned (i.e., it may be in the middle of a 2^order pages), so that we have to re-arrange the buddy system (i.e. &heap[][][]) carefully.

page_offline_xen_memory.patch add support to PV guest, a new hypercall (XENMEM_page_offline) try to replace the old page with the new one. This will happen only when the guest has been suspeneded, to avoid complex page sharing situation. I'm still checking if more situation need be considered, like LDT pages and CR3 pages, so any suggestion is really great help.

page_offline_tools.patch is an example user space tools based on libxc/xc_domain_save.c, it will try to firstly mark a page offline, and checking the result. If a page is owned by a PV guest, it will try to replace the pages.

I did some basic testing, tried free pages and PV guest pages and is ok. Of course, I need more test on it. And more robust error handling is needed.

Any suggestion is welcome.

Thanks
Yunhong Jiang

[-- Attachment #2: page_offline_xen_memory.patch --]
[-- Type: application/octet-stream, Size: 8494 bytes --]

memory exchange for PV domain

diff -r b736475df064 xen/arch/x86/mm.c

--- a/xen/arch/x86/mm.c	Sun Feb 08 23:20:16 2009 +0800
+++ b/xen/arch/x86/mm.c	Mon Feb 09 01:39:03 2009 +0800
@@ -4812,6 +4812,122 @@ void memguard_guard_stack(void *p)
     memguard_guard_range(p, PAGE_SIZE);
 }
 
+int update_pgtable_entry(struct domain *d, xen_pfn_t in_frame,
+                         xen_pfn_t out_frame, int ref)
+{
+    struct page_info *page;
+    int changed = 0;
+    struct page_info *pg = mfn_to_page(in_frame);
+
+    page_list_for_each ( page, &d->page_list )
+    {
+        unsigned long scan_type;
+
+        scan_type = page->u.inuse.type_info & PGT_type_mask;
+        switch (scan_type)
+        {
+#define REPLACE_L(x)    \
+    case  PGT_l##x##_page_table:    \
+    {   \
+        int i; \
+        unsigned long flags, mfn;    \
+        l##x##_pgentry_t *entry;     \
+        entry = map_domain_page(page_to_mfn(page)); \
+        for (i = 0; i < L##x##_PAGETABLE_ENTRIES; i++)  \
+        {                                                   \
+            mfn = l##x##e_get_pfn(entry[i]);    \
+            flags = l##x##e_get_flags(entry[i]);    \
+            if (mfn == in_frame)            \
+            {\
+                entry[i] = l##x##e_from_pfn(out_frame, flags);\
+                if (ref)    \
+                    put_page_and_type(pg);  \
+                changed = 1;    \
+                printk("update one entry here\n");  \
+            }\
+        }   \
+        unmap_domain_page(entry);   \
+        break;  \
+    }
+        REPLACE_L(1)
+        REPLACE_L(2)
+        REPLACE_L(3)
+        REPLACE_L(4)
+
+        default:
+        break;
+        }
+    }
+
+    return changed;
+}
+
+int replace_page(struct domain *d, xen_pfn_t in_frame,
+                 xen_pfn_t *out_frame, unsigned int memflags)
+{
+    xen_pfn_t out_mfn;
+    unsigned long type_info, count_info;
+    struct page_info *pg = mfn_to_page(in_frame), *out;
+
+    if (d != page_get_owner(pg))
+    {
+        dprintk(XENLOG_ERR, "replace page %lx not owned by domain %x\n",
+                 in_frame, d->domain_id);
+        return -EINVAL;
+    }
+
+    out = alloc_domheap_page(NULL, memflags);
+    if (!out)
+    {
+        put_page(mfn_to_page(in_frame));
+        return -ENOMEM;
+    }
+    out_mfn = page_to_mfn(out);
+
+    spin_lock(&d->page_alloc_lock);
+
+    /* Copy the page_info to keep all count/typecount info */
+    type_info = pg->u.inuse.type_info & PGT_type_mask;
+    count_info = pg->count_info;
+
+    /* get page temp to avoid page be freed in this process */
+    if ( unlikely(!get_page(mfn_to_page(in_frame), d)) )
+    {
+        dprintk(XENLOG_INFO, "Fail to get in_frame %lx when replace page\n",
+                             in_frame);
+        return -EINVAL;
+    }
+
+    update_pgtable_entry(d, in_frame, out_mfn, 1);
+    if ( (pg->count_info & PGC_count_mask) != 2 )
+    {
+        dprintk(XENLOG_WARNING, "page is granted to others? count_info %lx\n", 
+                pg->count_info);
+        dprintk(XENLOG_WARNING, "type info %lx\n", pg->u.inuse.type_info);
+        update_pgtable_entry(d, out_mfn, in_frame, 0);
+        free_domheap_page(out);
+        put_page(mfn_to_page(in_frame));
+        spin_unlock(&d->page_alloc_lock);
+        return -EINVAL;
+    }
+
+    guest_remove_page(d, in_frame);
+
+    out->count_info = count_info;
+    out->u.inuse.type_info = type_info;
+    page_set_owner(out, d);
+    wmb(); /* Domain pointer must be visible before updating refcnt. */
+    page_list_add_tail(out, &d->page_list);
+
+    spin_unlock(&d->page_alloc_lock);
+
+    put_page(mfn_to_page(in_frame));
+
+    *out_frame = out_mfn;
+
+    return 0;
+}
+
 /*
  * Local variables:
  * mode: C
diff -r b736475df064 xen/common/memory.c
--- a/xen/common/memory.c	Sun Feb 08 23:20:16 2009 +0800
+++ b/xen/common/memory.c	Mon Feb 09 01:34:38 2009 +0800
@@ -22,6 +22,7 @@
 #include <asm/current.h>
 #include <asm/hardirq.h>
 #include <xen/numa.h>
+#include <public/sched.h>
 #include <public/memory.h>
 #include <xsm/xsm.h>
 
@@ -214,6 +215,91 @@ static void decrease_reservation(struct 
  out:
     a->nr_done = i;
 }
+
+static long memory_page_offline(XEN_GUEST_HANDLE(xen_memory_page_offline_t) arg)
+{
+    struct xen_memory_page_offline offline;
+    unsigned long i;
+    long          rc = 0;
+    struct domain *d;
+    struct page_info *page;
+
+    if ( copy_from_guest(&offline, arg, 1) )
+        return -EFAULT;
+
+    /* only privileged domain can ask for this */
+    if (!IS_PRIV(current->domain))
+        return -EPERM;
+
+    d = get_domain_by_id(offline.domid);
+
+    if (!d)
+        return -EINVAL;
+    /* Domain must be shutdown in advance */
+    if (!(d->is_shut_down && (d->shutdown_code == SHUTDOWN_suspend)))
+    {
+        dprintk(XENLOG_WARNING, "Try to offline page for running domain \n");
+        put_domain(d);
+        return -EINVAL;
+    }
+
+    for ( i = offline.nr_offlined;
+          i < offline.num;
+          i++ )
+    {
+        unsigned int  node, mem_flags = 0;
+        xen_pfn_t in_frame, out_frame;
+
+        if ( hypercall_preempt_check() )
+        {
+            offline.nr_offlined = i;
+            if ( copy_field_to_guest(arg, &offline, nr_offlined) )
+                return -EFAULT;
+            return hypercall_create_continuation(
+                __HYPERVISOR_memory_op, "lh", XENMEM_page_offline, arg);
+        }
+
+        if (unlikely(__copy_from_guest_offset(
+                  &mem_flags, offline.mem_flags, i, 1)))
+        {
+            rc = -EFAULT;
+            break;
+        }
+
+        if (unlikely(__copy_from_guest_offset(
+                  &in_frame, offline.start_mfn, i, 1)))
+        {
+            rc = -EFAULT;
+            break;
+        }
+
+        page = mfn_to_page(in_frame);
+        node = XENMEMF_get_node(mem_flags);
+        mem_flags |= MEMF_bits(domain_clamp_alloc_bitsize(
+                              d,
+                              XENMEMF_get_address_bits(mem_flags) ? :
+                              (BITS_PER_LONG+PAGE_SHIFT)));
+        if ( node == NUMA_NO_NODE )
+            node = domain_to_node(d);
+        mem_flags |= MEMF_node(node);
+
+        rc = replace_page(d, in_frame,&out_frame, mem_flags);
+
+        /*
+         * No need for rollback since the replace is harmless
+         */
+        if (rc)
+            break;
+
+        __copy_to_guest_offset(
+          offline.out, i, &out_frame, 1);
+    }
+
+    offline.nr_offlined = i;
+    if ( copy_field_to_guest(arg, &offline, nr_offlined) )
+        rc = -EFAULT;
+    return rc;
+} 
 
 static long memory_exchange(XEN_GUEST_HANDLE(xen_memory_exchange_t) arg)
 {
@@ -513,6 +599,10 @@ long do_memory_op(unsigned long cmd, XEN
         rc = memory_exchange(guest_handle_cast(arg, xen_memory_exchange_t));
         break;
 
+    case XENMEM_page_offline:
+        rc = memory_page_offline(guest_handle_cast(arg, xen_memory_page_offline_t));
+        break;
+
     case XENMEM_maximum_ram_page:
         rc = max_page;
         break;
diff -r b736475df064 xen/include/asm-x86/mm.h
--- a/xen/include/asm-x86/mm.h	Sun Feb 08 23:20:16 2009 +0800
+++ b/xen/include/asm-x86/mm.h	Sun Feb 08 23:20:19 2009 +0800
@@ -492,6 +492,8 @@ unsigned int domain_clamp_alloc_bitsize(
 # define domain_clamp_alloc_bitsize(d, b) (b)
 #endif
 
+int replace_page(struct domain *d, xen_pfn_t in_frame, xen_pfn_t *out_frame, unsigned int mem_flags);
+
 unsigned long domain_get_maximum_gpfn(struct domain *d);
 
 extern struct domain *dom_xen, *dom_io;	/* for vmcoreinfo */
diff -r b736475df064 xen/include/public/memory.h
--- a/xen/include/public/memory.h	Sun Feb 08 23:20:16 2009 +0800
+++ b/xen/include/public/memory.h	Sun Feb 08 23:20:19 2009 +0800
@@ -129,6 +129,21 @@ typedef struct xen_memory_exchange xen_m
 typedef struct xen_memory_exchange xen_memory_exchange_t;
 DEFINE_XEN_GUEST_HANDLE(xen_memory_exchange_t);
 
+#define XENMEM_page_offline         18
+struct xen_memory_page_offline {
+    uint32_t num;
+    domid_t  domid;
+
+    XEN_GUEST_HANDLE(xen_pfn_t) start_mfn;
+    XEN_GUEST_HANDLE(uint32)  mem_flags;
+
+    XEN_GUEST_HANDLE(xen_pfn_t) out;
+
+    xen_ulong_t nr_offlined;
+};
+typedef struct xen_memory_page_offline xen_memory_page_offline_t;
+DEFINE_XEN_GUEST_HANDLE(xen_memory_page_offline_t);
+
 /*
  * Returns the maximum machine frame number of mapped RAM in this system.
  * This command always succeeds (it never returns an error code).

[-- Attachment #3: page_offline_tools.patch --]
[-- Type: application/octet-stream, Size: 20534 bytes --]

the tools changes

diff -r f1756e5c1203 tools/libxc/xc_misc.c
--- a/tools/libxc/xc_misc.c	Sun Feb 08 23:20:19 2009 +0800
+++ b/tools/libxc/xc_misc.c	Sun Feb 08 23:20:21 2009 +0800
@@ -80,6 +80,23 @@ int xc_physinfo(int xc_handle,
     return 0;
 }
 
+int xc_mark_pages_offline(int xc_handle,
+                          int start, int end,
+                          uint32_t *status)
+{
+    int ret;
+    DECLARE_SYSCTL;
+
+    sysctl.cmd = XEN_SYSCTL_page_offline;
+    set_xen_guest_handle(sysctl.u.page_offline.status, status);
+    sysctl.u.page_offline.start = start;
+    sysctl.u.page_offline.end = end;
+
+    ret = do_sysctl(xc_handle, &sysctl);
+    
+    return ret;
+ }
+ 
 int xc_sched_id(int xc_handle,
                 int *sched_id)
 {
diff -r f1756e5c1203 tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h	Sun Feb 08 23:20:19 2009 +0800
+++ b/tools/libxc/xenctrl.h	Sun Feb 08 23:20:21 2009 +0800
@@ -608,6 +608,10 @@ int xc_physinfo(int xc_handle,
 int xc_physinfo(int xc_handle,
                 xc_physinfo_t *info);
 
+int xc_mark_pages_offline(int xc_handle,
+                          int start, int end,
+                          uint32_t *status);
+
 int xc_sched_id(int xc_handle,
                 int *sched_id);
 
diff -r f1756e5c1203 tools/xcutils/Makefile
--- a/tools/xcutils/Makefile	Sun Feb 08 23:20:19 2009 +0800
+++ b/tools/xcutils/Makefile	Sun Feb 08 23:20:21 2009 +0800
@@ -14,7 +14,7 @@ CFLAGS += -Werror
 CFLAGS += -Werror
 CFLAGS += $(CFLAGS_libxenctrl) $(CFLAGS_libxenguest) $(CFLAGS_libxenstore)
 
-PROGRAMS = xc_restore xc_save readnotes lsevtchn
+PROGRAMS = xc_restore xc_save readnotes lsevtchn xc_page
 
 LDLIBS   = $(LDFLAGS_libxenctrl) $(LDFLAGS_libxenguest) $(LDFLAGS_libxenstore)
 
diff -r f1756e5c1203 tools/xcutils/xc_page.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/xcutils/xc_page.c	Mon Feb 09 01:36:33 2009 +0800
@@ -0,0 +1,698 @@
+#include <inttypes.h>
+#include <time.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/time.h>
+
+#include <xs.h>
+#include "xc_private.h"
+#include "xc_dom.h"
+#include "xg_private.h"
+#include "xg_save_restore.h"
+
+#define GET_FIELD(_p, _f) ((guest_width==8) ? ((_p)->x64._f) : ((_p)->x32._f))
+#define M2P_SHIFT       L2_PAGETABLE_SHIFT_PAE
+#define M2P_CHUNK_SIZE  (1 << M2P_SHIFT)
+#define M2P_SIZE(_m)    ROUNDUP(((_m) * sizeof(xen_pfn_t)), M2P_SHIFT)
+#define M2P_CHUNKS(_m)  (M2P_SIZE((_m)) >> M2P_SHIFT)
+
+inline int is_hvm_domain(xc_dominfo_t *info)
+{
+    return info->hvm;
+}
+
+xen_pfn_t *live_m2p = NULL;
+#define mfn_to_pfn(_mfn)  (live_m2p[(_mfn)])
+
+static xen_pfn_t *xc_map_m2p(int xc_handle,
+                                 unsigned long max_mfn,
+                                 int prot)
+{
+    struct xen_machphys_mfn_list xmml;
+    privcmd_mmap_entry_t *entries;
+    unsigned long m2p_chunks, m2p_size;
+    xen_pfn_t *m2p;
+    xen_pfn_t *extent_start;
+    int i;
+
+    m2p = NULL;
+    m2p_size   = M2P_SIZE(max_mfn);
+    m2p_chunks = M2P_CHUNKS(max_mfn);
+
+    xmml.max_extents = m2p_chunks;
+
+    extent_start = calloc(m2p_chunks, sizeof(xen_pfn_t));
+    if ( !extent_start )
+    {
+        ERROR("failed to allocate space for m2p mfns");
+        goto err0;
+    }
+    set_xen_guest_handle(xmml.extent_start, extent_start);
+
+    if ( xc_memory_op(xc_handle, XENMEM_machphys_mfn_list, &xmml) ||
+         (xmml.nr_extents != m2p_chunks) )
+    {
+        ERROR("xc_get_m2p_mfns");
+        goto err1;
+    }
+
+    entries = calloc(m2p_chunks, sizeof(privcmd_mmap_entry_t));
+    if (entries == NULL)
+    {
+        ERROR("failed to allocate space for mmap entries");
+        goto err1;
+    }
+
+    for ( i = 0; i < m2p_chunks; i++ )
+        entries[i].mfn = extent_start[i];
+
+    m2p = xc_map_foreign_ranges(xc_handle, DOMID_XEN,
+			m2p_size, prot, M2P_CHUNK_SIZE,
+			entries, m2p_chunks);
+    if (m2p == NULL)
+    {
+        ERROR("xc_mmap_foreign_ranges failed");
+        goto err2;
+    }
+
+err2:
+    free(entries);
+err1:
+    free(extent_start);
+
+err0:
+    return m2p;
+}
+
+static void *map_frame_list_list(int xc_handle, uint32_t dom,
+                                 shared_info_any_t *shinfo,
+                                 int guest_width)
+{
+    int count = 100;
+    void *p;
+    uint64_t fll = GET_FIELD(shinfo, arch.pfn_to_mfn_frame_list_list);
+
+    while ( count-- && (fll == 0) )
+    {
+        usleep(10000);
+        fll = GET_FIELD(shinfo, arch.pfn_to_mfn_frame_list_list);
+    }
+
+    if ( fll == 0 )
+    {
+        ERROR("Timed out waiting for frame list updated.");
+        return NULL;
+    }
+
+    p = xc_map_foreign_range(xc_handle, dom, PAGE_SIZE, PROT_READ, fll);
+    if ( p == NULL )
+        ERROR("Couldn't map p2m_frame_list_list (errno %d)", errno);
+
+    return p;
+}
+
+static void *map_p2m_table(int xc_handle, uint32_t domid, int guest_width)
+{
+    xc_dominfo_t info;
+    static unsigned long p2m_size;
+    void *live_p2m_frame_list_list = NULL;
+    void *live_p2m_frame_list = NULL;
+    /* Copies of the above. */
+    xen_pfn_t *p2m_frame_list_list = NULL;
+    xen_pfn_t *p2m_frame_list = NULL;
+    unsigned long shared_info_frame;
+    shared_info_any_t *live_shinfo = NULL;
+
+    /* The mapping of the live p2m table itself */
+    xen_pfn_t *p2m = NULL;
+    int i = 0;
+
+    /* Get the size of the P2M table */
+    p2m_size = xc_memory_op(xc_handle, XENMEM_maximum_gpfn, &domid) + 1;
+
+
+    if ( xc_domain_getinfo(xc_handle, domid, 1, &info) != 1 )
+    {
+        fprintf(stderr, "Could not get domain info");
+        return NULL;
+    }
+    shared_info_frame = info.shared_info_frame;
+
+    live_shinfo = xc_map_foreign_range(xc_handle, domid, PAGE_SIZE,
+                      PROT_READ, shared_info_frame);
+    if ( !live_shinfo )
+    {
+        fprintf(stderr, "Couldn't map live_shinfo");
+        goto out;
+    }
+
+    live_p2m_frame_list_list = map_frame_list_list(xc_handle, domid,
+                                                   live_shinfo, guest_width);
+
+    munmap(live_shinfo, PAGE_SIZE);
+    live_shinfo = NULL;
+
+    if ( !live_p2m_frame_list_list )
+    {
+        fprintf(stderr, "Could get live p2m_frame_list_list\n");
+        goto out;
+    }
+
+    /* Get a local copy of the live_P2M_frame_list_list */
+    if ( !(p2m_frame_list_list = malloc(PAGE_SIZE)) )
+    {
+        fprintf(stderr, "Couldn't allocate p2m_frame_list_list array");
+        goto out;
+    }
+    memcpy(p2m_frame_list_list, live_p2m_frame_list_list, PAGE_SIZE);
+    munmap(live_p2m_frame_list_list, PAGE_SIZE);
+    live_p2m_frame_list_list = NULL;
+
+    /* Canonicalize guest's unsigned long vs ours */
+    if ( guest_width > sizeof(unsigned long) )
+        for ( i = 0; i < PAGE_SIZE/sizeof(unsigned long); i++ )
+            if ( i < PAGE_SIZE/guest_width )
+                p2m_frame_list_list[i] = ((uint64_t *)p2m_frame_list_list)[i];
+            else
+                p2m_frame_list_list[i] = 0;
+    else if ( guest_width < sizeof(unsigned long) )
+        for ( i = PAGE_SIZE/sizeof(unsigned long) - 1; i >= 0; i-- )
+            p2m_frame_list_list[i] = ((uint32_t *)p2m_frame_list_list)[i];
+
+    live_p2m_frame_list =
+        xc_map_foreign_batch(xc_handle, domid, PROT_READ,
+                             p2m_frame_list_list,
+                             P2M_FLL_ENTRIES);
+    if ( !live_p2m_frame_list )
+    {
+        fprintf(stderr, "Couldn't map p2m_frame_list");
+        goto out;
+    }
+    free(p2m_frame_list_list);
+    p2m_frame_list_list = NULL;
+
+    /* Get a local copy of the live_P2M_frame_list */
+    if ( !(p2m_frame_list = malloc(P2M_TOOLS_FL_SIZE)) )
+    {
+        ERROR("Couldn't allocate p2m_frame_list array");
+        goto out;
+    }
+
+    memset(p2m_frame_list, 0, P2M_TOOLS_FL_SIZE);
+    memcpy(p2m_frame_list, live_p2m_frame_list, P2M_GUEST_FL_SIZE);
+    munmap(live_p2m_frame_list, P2M_FLL_ENTRIES * PAGE_SIZE);
+    live_p2m_frame_list = NULL;
+
+    /* Canonicalize guest's unsigned long vs ours */
+    if ( guest_width > sizeof(unsigned long) )
+        for ( i = 0; i < P2M_FL_ENTRIES; i++ )
+            p2m_frame_list[i] = ((uint64_t *)p2m_frame_list)[i];
+    else if ( guest_width < sizeof(unsigned long) )
+        for ( i = P2M_FL_ENTRIES - 1; i >= 0; i-- )
+            p2m_frame_list[i] = ((uint32_t *)p2m_frame_list)[i];
+
+
+    /* Map all the frames of the pfn->mfn table. For migrate to succeed,
+       the guest must not change which frames are used for this purpose.
+       (its not clear why it would want to change them, and we'll be OK
+       from a safety POV anyhow. */
+
+    p2m = xc_map_foreign_batch(xc_handle, domid, PROT_READ,
+                               p2m_frame_list,
+                               P2M_FL_ENTRIES);
+    if ( !p2m )
+    {
+        fprintf(stderr, "Couldn't map p2m table");
+        goto out;
+    }
+    free(p2m_frame_list);
+    p2m_frame_list = NULL;
+
+    return p2m;
+
+out:
+    if (live_p2m_frame_list_list)
+        munmap(live_p2m_frame_list_list, PAGE_SIZE);
+    if (live_p2m_frame_list)
+        munmap(live_p2m_frame_list, P2M_FLL_ENTRIES * PAGE_SIZE);
+    if (p2m_frame_list_list)
+        free(p2m_frame_list_list);
+    if (p2m_frame_list)
+        free(p2m_frame_list);
+    if (live_shinfo)
+        munmap(live_shinfo, PAGE_SIZE);
+
+    return NULL;
+}
+
+int update_p2m_table(void *p2m, xen_pfn_t *old, xen_pfn_t *new,
+                     int num, int guest_width)
+{
+    int i;
+
+    if (!p2m)
+        return -1;
+
+    for (i = 0; i < num; i++)
+    {
+        if (guest_width == 4)
+            ((unsigned int*)p2m)[mfn_to_pfn(old[i])] = new[i];
+        else
+            ((unsigned long *)p2m)[mfn_to_pfn(old[i])] = 
+                        (unsigned long)(new[i]);
+    }
+
+    return 0;
+}
+
+struct suspendinfo {
+    int xc_fd; /* libxc handle */
+    int xce; /* event channel handle */
+    int suspend_evtchn;
+    int domid;
+    unsigned int flags;
+};
+
+static int suspend_evtchn_release(struct suspendinfo *si)
+{
+    if (si->suspend_evtchn >= 0) {
+        xc_evtchn_unbind(si->xce, si->suspend_evtchn);
+        si->suspend_evtchn = -1;
+    }
+    if (si->xce >= 0) {
+        xc_evtchn_close(si->xce);
+        si->xce = -1;
+    }
+
+    return 0;
+}
+
+static int await_suspend(struct suspendinfo *si)
+{
+    int rc;
+
+    do {
+        rc = xc_evtchn_pending(si->xce);
+        if (rc < 0) {
+            ERROR("error polling suspend notification channel: %d", rc);
+            return -1;
+        }
+    } while (rc != si->suspend_evtchn);
+
+    /* harmless for one-off suspend */
+    if (xc_evtchn_unmask(si->xce, si->suspend_evtchn) < 0)
+        ERROR("failed to unmask suspend notification channel: %d", rc);
+
+    return 0;
+}
+
+static int suspend_evtchn_init(int xc, int domid, struct suspendinfo *si)
+{
+    struct xs_handle *xs;
+    char path[128];
+    char *portstr;
+    unsigned int plen;
+    int port;
+    int rc;
+
+    si->xce = -1;
+    si->suspend_evtchn = -1;
+
+    xs = xs_daemon_open();
+    if (!xs) {
+        ERROR("failed to get xenstore handle");
+        return -1;
+    }
+    sprintf(path, "/local/domain/%d/device/suspend/event-channel", domid);
+    portstr = xs_read(xs, XBT_NULL, path, &plen);
+    xs_daemon_close(xs);
+
+    if (!portstr || !plen) {
+        ERROR("could not read suspend event channel");
+        return -1;
+    }
+
+    port = atoi(portstr);
+    free(portstr);
+
+    si->xce = xc_evtchn_open();
+    if (si->xce < 0) {
+        ERROR("failed to open event channel handle");
+        goto cleanup;
+    }
+
+    si->suspend_evtchn = xc_evtchn_bind_interdomain(si->xce, domid, port);
+    if (si->suspend_evtchn < 0) {
+        ERROR("failed to bind suspend event channel: %d", si->suspend_evtchn);
+        goto cleanup;
+    }
+
+    rc = xc_domain_subscribe_for_suspend(xc, domid, port);
+    if (rc < 0) {
+        ERROR("failed to subscribe to domain: %d", rc);
+        goto cleanup;
+    }
+
+    /* event channel is pending immediately after binding */
+    await_suspend(si);
+
+    return 0;
+
+  cleanup:
+    suspend_evtchn_release(si);
+
+    return -1;
+}
+
+/**
+ * Issue a suspend request to a dedicated event channel in the guest, and
+ * receive the acknowledgement from the subscribe event channel.
+ */
+static int evtchn_suspend(struct suspendinfo *si)
+{
+    int rc;
+
+    rc = xc_evtchn_notify(si->xce, si->suspend_evtchn);
+    if (rc < 0) {
+        ERROR("failed to notify suspend request channel: %d", rc);
+        return 0;
+    }
+
+    if (await_suspend(si) < 0) {
+        ERROR("suspend failed");
+        return 0;
+    }
+
+    /* notify xend that it can do device migration */
+    printf("suspended\n");
+    fflush(stdout);
+
+    return 1;
+}
+
+/* More consideration here like CR3 etc */
+int _pages_offline(int xc_handle, int domid, xen_pfn_t *old_mfn, xen_pfn_t *new_mfn, int num, int *done )
+{
+    struct xen_memory_page_offline offline;
+    xen_pfn_t *in_frames, *out_frames;
+    uint32_t *mem_flags;
+    int i, err;
+
+    in_frames = malloc(num * sizeof(xen_pfn_t));
+    if (!in_frames)
+        return -ENOMEM;
+
+    out_frames = malloc(num * sizeof(xen_pfn_t));
+    if (!out_frames)
+    {
+        free(in_frames);
+        return -ENOMEM;
+    }
+
+    mem_flags = malloc(num * sizeof(uint32_t));
+
+    if (!mem_flags)
+    {
+        free(in_frames);
+        free(out_frames);
+        return -ENOMEM;
+    }
+    memset(mem_flags, 0, num);
+
+    for (i = 0; i < num; i++)
+        in_frames[i] = old_mfn[i];
+
+    offline.num = num;
+
+    offline.domid = domid;
+
+    offline.nr_offlined = 0;
+
+    set_xen_guest_handle(offline.start_mfn, in_frames);
+    set_xen_guest_handle(offline.out, out_frames);
+    set_xen_guest_handle(offline.mem_flags, mem_flags);
+
+    err = xc_memory_op(xc_handle, XENMEM_page_offline, &offline);
+
+    if (err)
+    {
+        ERROR("failed to get the memory exchange done \n");
+        return -1;
+    }
+    
+    for (i = 0; i < num; i++)
+        new_mfn[i] = out_frames[i];
+    *done = num;
+
+    free(in_frames);
+    free(out_frames);
+
+    return 0;
+}
+
+int domain_page_offline(int xc_handle, int domid,
+                        xen_pfn_t *mfn, int num,
+                        int *done)
+{
+    xc_dominfo_t info;
+    int ret = 0, guest_width;
+    struct suspendinfo si;
+    xen_pfn_t *new_mfn = NULL;
+    DECLARE_DOMCTL;
+    void *p2m = NULL;
+    unsigned long max_mfn = 0;
+
+    if (xc_domain_getinfo(xc_handle, domid, 1, &info) != 1)
+    {
+        fprintf(stderr, "Domain get info failed\n");
+        goto error;
+    }
+
+    if (is_hvm_domain(&info))
+    {
+        fprintf(stderr, "we need utilize live migration for hvm domain\n");
+        ret = -EINVAL;
+        goto error;
+    }
+
+    *done = 0;
+
+    new_mfn = malloc(num * sizeof(xen_pfn_t));
+
+    if (!new_mfn)
+        return -ENOMEM;
+
+    if ((ret = suspend_evtchn_init(xc_handle, domid, &si)))
+    {
+        fprintf(stderr, "suspend_evtchn init failed\n");
+        goto error;
+    }
+
+    evtchn_suspend(&si);
+
+    *done = 0;
+    /* We pass mfn to Xen HV, instead of gpfn */
+    ret = _pages_offline(xc_handle, domid, mfn, new_mfn, num, done);
+
+    if (ret)
+    {
+        fprintf(stderr, "%x page offline request failed with %x done\n",
+                         num, *done);
+        goto error;
+    }
+
+    fprintf(stderr, "now we have offline the pages\n");
+    memset(&domctl, 0, sizeof(domctl));
+    domctl.domain = domid;
+    domctl.cmd = XEN_DOMCTL_get_address_size;
+    if ( do_domctl(xc_handle, &domctl) != 0 )
+    {
+        fprintf(stderr, "Could not get guest width\n");
+        goto error;
+    }
+
+    guest_width = domctl.u.address_size.size / 8;
+    p2m = map_p2m_table(xc_handle, domid, guest_width);
+
+    max_mfn = xc_memory_op(xc_handle, XENMEM_maximum_ram_page, NULL);
+
+    live_m2p = xc_map_m2p(xc_handle, max_mfn, PROT_READ);
+
+    if (!live_m2p)
+        goto error;
+
+    fprintf(stderr, "have mapped the p2m table\n");
+    /* update guest's P2M table here */
+    update_p2m_table(p2m, mfn, new_mfn, num, guest_width); 
+
+error:
+    if (new_mfn)
+        free(new_mfn);
+
+    if (live_m2p)
+        munmap(live_m2p, M2P_SIZE(max_mfn));
+
+    suspend_evtchn_release(&si);
+    /* resume guest now */
+    xc_domain_resume(xc_handle, domid, 1);
+
+    return ret;
+}
+
+static int xc_mark_page_offline(int xc, unsigned long start,
+                                unsigned long end, uint32_t *status)
+{
+    DECLARE_SYSCTL;
+    int ret = -1;
+
+    if (end < start)
+        return -EINVAL;
+
+    if (lock_pages(status, sizeof(uint32_t)*(end - start + 1)))
+    {
+        fprintf(stderr, "Could not lock memory for Xen hypercall");
+        return -EINVAL;
+    }
+
+    sysctl.cmd = XEN_SYSCTL_page_offline;
+    sysctl.u.page_offline.start = start;
+    sysctl.u.page_offline.end = end;
+    set_xen_guest_handle(sysctl.u.page_offline.status, status);
+    ret = xc_sysctl(xc, &sysctl);
+
+    unlock_pages(status, sizeof(uint32_t)*(end - start + 1));
+
+    return ret;
+}
+#define PAGE_OFFLINE_HANDLED (0x1UL << 8)
+#define page_offline_owner(x)   \
+        (x >> PG_OFFLINE_OWNER_SHIFT)
+
+static int xc_page_offline(unsigned long start, unsigned long end)
+{
+    int rc = 0, i, num, check_count, xc_handle;
+    uint32_t *status = NULL;
+    xen_pfn_t *pfns = NULL;
+
+    if (end < start)
+    {
+        fprintf(stderr, "End %lx is smaller than start %lx\n", end, start);
+        return -EINVAL;
+    }
+
+    xc_handle = xc_interface_open();
+
+    if (!xc_handle)
+        return -1;
+
+    num = end - start + 1;
+
+    rc = -ENOMEM;
+    pfns = malloc(num * sizeof(xen_pfn_t));
+    if (!pfns)
+        goto fail;
+    memset(pfns, sizeof(xen_pfn_t)* num, 0);
+
+    status  = malloc(num * sizeof(uint32_t));
+    if (!status)
+        goto fail;
+    memset(status, sizeof(uint32_t)*num, 0);
+
+    rc = xc_mark_page_offline(xc_handle, start, end, status);
+
+    if (rc)
+    {
+        fprintf(stderr, "fail to mark pages offline\n");
+        goto fail;
+    }
+
+    rc = 0;
+    check_count = 0;
+    while (check_count != num)
+    {
+        uint32_t pstat = status[check_count];
+
+        fprintf(stderr, "check_count %x pstat %x\n",
+                check_count, pstat);
+        if (pstat & PAGE_OFFLINE_HANDLED)
+        {
+            check_count ++;
+            continue;
+        }
+
+        switch (pstat & PG_OFFLINE_STATUS_MASK)
+        {
+        case PG_OFFLINE_OFFLINED:
+            check_count ++;
+            break;
+        case PG_OFFLINE_PENDING:
+        {
+            domid_t owner = page_offline_owner(pstat);
+            int j = 0, done;
+
+            /* Should HV present such information ?? */
+            if (owner >= DOMID_FIRST_RESERVED)
+            {
+                fprintf(stderr, "special domain ownership\n");
+                check_count++;
+                continue;
+            }
+
+            /* get all pages with the same owner */
+            memset(pfns, sizeof(uint32_t)*num, 0);
+            for (i = check_count; i < num; i++)
+            {
+                if (page_offline_owner(status[i]) == owner)
+                {
+                    status[i] |= PAGE_OFFLINE_HANDLED;
+                    pfns[j++] = start + i;
+                }
+            }
+
+            /* offline the pages */
+            rc = domain_page_offline(xc_handle, owner, pfns, j, &done);
+            if (rc)
+            {
+                /* XXX need take recovery if can't offline all pages? */ 
+                fprintf(stderr, "failed to offline domain %x's page\n"
+                        "total page %x done %x\n", owner, j, done);
+                goto fail;
+            }
+            check_count ++;
+            break;
+        }
+
+        default:
+            fprintf(stderr, "Error page offline status %x\n", pstat);
+            goto fail;
+        }
+    }
+    xc_interface_close(xc_handle);
+fail:
+    if (status)
+        free(status);
+    if (pfns)
+        free(pfns);
+    return rc;    
+}
+
+int
+main(int argc, char **argv)
+{
+    unsigned long start, end;
+
+    if (argc != 3)
+        fprintf(stderr, "usage: %s start end", argv[0]);
+
+    errno = 0;
+    start = strtoul(argv[1], NULL, 0);
+    end = strtoul(argv[2], NULL, 0);
+
+    if (errno){
+        fprintf(stderr, "usage: %s start end", argv[0]);
+    }
+
+    xc_page_offline(start, end);
+
+    return 1;
+}

[-- Attachment #4: page_offline_xen.patch --]
[-- Type: application/octet-stream, Size: 25865 bytes --]

new version of page offline

diff -r fb35bb57bba6 xen/common/page_alloc.c
--- a/xen/common/page_alloc.c	Sun Feb 08 18:25:36 2009 +0800
+++ b/xen/common/page_alloc.c	Sun Feb 08 23:19:58 2009 +0800
@@ -35,9 +35,13 @@
 #include <xen/perfc.h>
 #include <xen/numa.h>
 #include <xen/nodemask.h>
+#include <public/sysctl.h>
 #include <asm/page.h>
 #include <asm/numa.h>
 #include <asm/flushtlb.h>
+
+#define dbg_offpage(_f, _a...)    \
+    printk(XENLOG_DEBUG "PAGE_OFFLINE %s:%d: " _f,  __FILE__ , __LINE__ , ## _a)
 
 /*
  * Comma-separated list of hexadecimal page numbers containing bad bytes.
@@ -73,6 +77,12 @@ static DEFINE_SPINLOCK(page_scrub_lock);
 static DEFINE_SPINLOCK(page_scrub_lock);
 PAGE_LIST_HEAD(page_scrub_list);
 static unsigned long scrub_pages;
+
+/* Offlined page list, protected by heap_lock */
+PAGE_LIST_HEAD(page_offlined_list);
+
+/* Broken page list, protected by heap_lock */
+PAGE_LIST_HEAD(page_broken_list);
 
 /*********************
  * ALLOCATION BITMAP
@@ -421,19 +431,93 @@ static struct page_info *alloc_heap_page
     return pg;
 }
 
+static inline int is_page_allocated(struct page_info *pg)
+{
+    return allocated_in_map(page_to_mfn(pg)) && 
+            !(pg->count_info & PGC_offlined);
+}
+
+/* Add the pages into heap[][][], and merge chunks as far as possible */
+static void merge_heap_pages(unsigned int zone, struct page_info *pg,
+                             unsigned int order, int prev, int next)
+{
+    unsigned long mask;
+    unsigned int node = phys_to_nid(page_to_maddr(pg));
+
+    while ( order < MAX_ORDER )
+    {
+        mask = 1UL << order;
+
+        if ( page_to_mfn(pg) & mask )
+        {
+            /* Merge with predecessor block? */
+            if ( allocated_in_map(page_to_mfn(pg)-mask) ||
+                 (PFN_ORDER(pg-mask) != order) )
+                break;
+            if (prev)
+            {
+                pg -= mask;
+                page_list_del(pg, &heap(node, zone, order));
+            } else
+                break;
+        }
+        else
+        {
+            /* Merge with successor block? */
+            if ( allocated_in_map(page_to_mfn(pg)+mask) ||
+                 (PFN_ORDER(pg+mask) != order) )
+                break;
+            if (next)
+                page_list_del(pg+mask, &heap(node, zone, order));
+            else
+                break;
+        }
+
+        order++;
+
+        /* After merging, pg should remain in the same node. */
+        ASSERT(phys_to_nid(page_to_maddr(pg)) == node);
+    }
+
+    PFN_ORDER(pg) = order;
+    page_list_add_tail(pg, &heap(node, zone, order));
+}
+
+/* Mark 2^@order set of pages freed in heap[][][], avail[], and bitmap.
+ * This assumes all pages are from the same zone
+ */
+static void recycle_heap_pages(
+    unsigned int zone, struct page_info *start, struct page_info *end)
+{
+    unsigned int node = phys_to_nid(page_to_maddr(start));
+    struct page_info *pg = start;
+
+    ASSERT(zone < NR_ZONES);
+    ASSERT(node >= 0);
+    ASSERT(node < num_online_nodes());
+
+    ASSERT(spin_is_locked(&heap_lock));
+
+    /* XXX enhance for performance */
+    while (pg++ <= end)
+    {
+        merge_heap_pages(zone, pg, 0, 1, 1);
+        map_free(page_to_mfn(pg), 1);
+    }
+    avail[node][zone] += page_to_mfn(end) - page_to_mfn(start);
+}
+
 /* Free 2^@order set of pages. */
 static void free_heap_pages(
     struct page_info *pg, unsigned int order)
 {
-    unsigned long mask;
-    unsigned int i, node = phys_to_nid(page_to_maddr(pg));
+    unsigned long count_info;
+    unsigned int i, nr_pages;
     unsigned int zone = page_to_zone(pg);
-
-    ASSERT(order <= MAX_ORDER);
-    ASSERT(node >= 0);
-    ASSERT(node < num_online_nodes());
-
-    for ( i = 0; i < (1 << order); i++ )
+    int offlined = 0, clean = 1;
+
+    spin_lock(&heap_lock);
+    for ( i = 0, nr_pages = 0; i < (1 << order); i++, nr_pages++)
     {
         /*
          * Cannot assume that count_info == 0, as there are some corner cases
@@ -446,52 +530,543 @@ static void free_heap_pages(
          *     in its pseudophysical address space).
          * In all the above cases there can be no guest mappings of this page.
          */
+        count_info = pg[i].count_info;
         pg[i].count_info = 0;
+
+        if ( count_info & PGC_offlining )
+            pg[i].count_info |= PGC_offlining;
+        if ( count_info & PGC_broken )
+            pg[i].count_info |= PGC_broken;
 
         /* If a page has no owner it will need no safety TLB flush. */
         pg[i].u.free.need_tlbflush = (page_get_owner(&pg[i]) != NULL);
         if ( pg[i].u.free.need_tlbflush )
             pg[i].tlbflush_timestamp = tlbflush_current_time();
-    }
+        ASSERT(is_page_allocated(&pg[i]));
+
+        /* If the page is in "offline pending and broken", then set it to be
+         * "offlined and broken" and put it to the broken list; if the page is
+         * in "offline pending", then set it to be "offline" and put it to
+         * the offline list; otherwise, free it and put it to heap[][][]
+         */
+        if ( is_page_broken(&pg[i]) )
+        {
+            page_list_add_tail(&pg[i], &page_broken_list);
+            pg->count_info |= PGC_offlined;
+            offlined = 1;
+            clean = 0;
+        }
+        else if ( is_page_offlining(&pg[i]) )
+        {
+            page_list_add_tail(&pg[i], &page_offlined_list);
+            pg->count_info |= PGC_offlined;
+            pg->count_info &= ~PGC_offlining;
+            offlined = 1;
+            clean = 0;
+        }
+
+        if ( unlikely(offlined) )
+        {
+            offlined = 0;
+            /* recycle those freed pages except offlined pages */
+            if ( nr_pages > 0 )
+            {
+                recycle_heap_pages(zone, pg + i - nr_pages, pg + i - 1);
+                nr_pages = 0;
+            }
+        }
+    }
+
+    if (clean)
+    {
+        unsigned int node = phys_to_nid(page_to_maddr(pg));
+        map_free(page_to_mfn(pg), 1UL << order);
+        avail[node][zone] += (1UL << order);
+        merge_heap_pages(zone, pg, order, 1, 1);
+    } else if ( nr_pages > 0 )
+    /* handle the rest */
+    {
+        recycle_heap_pages(zone, pg + i - nr_pages, pg + i - 1);
+    }
+
+    spin_unlock(&heap_lock);
+}
+
+/*
+ * Reserve pages that is in the same order list in the buddy system
+ * head: the head page in the buddy contains the range
+ */
+int reserve_heap_pages_order(struct page_info *head,
+                                     unsigned long start_mfn,
+                                     unsigned long end_mfn,
+                                     int order)
+{
+    unsigned int node = phys_to_nid(page_to_maddr(head));
+    int zone = page_to_zone(head), cur_order;
+    struct page_info *start, *end, *cur_head, *cur_end;
+
+
+    ASSERT(order <= MAX_ORDER);
+    ASSERT(PFN_ORDER(head) == order);
+
+    start = mfn_to_page(start_mfn);
+    end = mfn_to_page(end_mfn);
+    if (end >= (head + (1UL << order)))
+        return -EINVAL;
+
+    /* sanity checking */
+    if ( (end < start) || (start < head) || (end >= (head + (1UL << order))) )
+        return -EINVAL;
+
+    page_list_del(head, &heap(node, zone, order));
+
+    ASSERT(spin_is_locked(&heap_lock));
+
+    cur_head = head;
+
+    cur_order = PFN_ORDER(head);
+
+    while (cur_head < start)
+    {
+       while ( (cur_order >= 0) && (cur_head + (1UL << cur_order ) > start) )
+           cur_order--; 
+
+       if (cur_head + (1UL << cur_order ) <= start)
+       {
+           merge_heap_pages(zone, cur_head, cur_order, 1, 0); 
+           cur_head += (1UL << cur_order);
+       }
+    }
+
+    cur_end = head + (1UL << order) - 1;
+    cur_order = order;
+
+    while (cur_end > end)
+    {
+        while ( (cur_order >= 0) && (cur_end - (1UL << cur_order) < end))
+            cur_order --;
+
+       if ((cur_end - (1UL << cur_order) >= end))
+       {
+           merge_heap_pages(zone, cur_end - (1UL << cur_order) + 1, cur_order, 0, 1);
+           cur_end -= (1UL << cur_order);
+       }
+    }
+
+    avail[node][zone] -= (end_mfn - start_mfn);
+
+    return 0;
+}
+
+/*
+ * Reserve pages that is in the same zone
+ */
+int reserve_heap_pages_zone(unsigned long start_mfn,
+                                   unsigned long end_mfn)
+{
+    int node = phys_to_nid(pfn_to_paddr(start_mfn));
+    int zone = page_to_zone(mfn_to_page(start_mfn)), ret = 0;
+
+    unsigned long cur_start, cur_end;
+    int i;
+
+    ASSERT(spin_is_locked(&heap_lock));
+    ASSERT( page_to_zone(mfn_to_page(start_mfn)) ==
+            page_to_zone(mfn_to_page((end_mfn))) );
+
+    if (end_mfn < start_mfn)
+        return 0;
+
+    if (end_mfn > max_page)
+        return -EINVAL;
+
+    cur_start = cur_end = start_mfn;
+
+    for ( i = 0; i <= MAX_ORDER; i++ )
+    {
+        struct page_info *head, *tmp;
+        unsigned int heap_mask;
+
+        if ( page_list_empty(&heap(node, zone, i)) )
+            continue;
+
+        if (cur_start > end_mfn)
+            break;
+
+        heap_mask = 1UL << i;
+        page_list_for_each_safe(head, tmp, &heap(node, zone, i))
+        {
+            if ( (head <= mfn_to_page(cur_start)) &&
+              ( (head + (1UL << i)) > mfn_to_page(cur_start)))
+            {
+                cur_end = min(page_to_mfn(head ) + (1UL << i) - 1, end_mfn);
+
+                ret = reserve_heap_pages_order(head, cur_start, cur_end, i);
+                cur_start = cur_end + 1;
+                if (ret)
+                {
+                    dprintk(XENLOG_ERR, "fail to reserve page %lx to %lx\n",
+                      cur_start, cur_end);
+                    return ret;
+                }
+                break;
+            }
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * Reserve pages that is in the same node
+ */
+int reserve_heap_pages_node(
+    unsigned long start_mfn, unsigned long end_mfn)
+{
+    unsigned long cur_start, cur_end;
+    int ret = 0;
+
+    if (end_mfn > max_page)
+        return -EINVAL;
+
+    ASSERT(spin_is_locked(&heap_lock));
+    ASSERT(phys_to_nid(pfn_to_paddr(start_mfn)) ==
+           phys_to_nid(pfn_to_paddr(end_mfn)) );
+
+    cur_start = cur_end = start_mfn;
+    while (cur_start <= end_mfn)
+    {
+        while ( (cur_end < end_mfn) &&
+                ( page_to_zone(mfn_to_page(cur_end + 1)) == 
+                  page_to_zone(mfn_to_page(cur_start)) ) ) 
+                cur_end ++;
+        ret = reserve_heap_pages_zone(cur_start, cur_end);
+
+        if (ret)
+        {
+            dprintk(XENLOG_ERR, "fail to reserve page %lx %lx\n", 
+                                cur_start, cur_end);
+            break;
+        }
+        cur_start = cur_end + 1;
+    }
+
+    return ret;
+}
+
+/*
+ * reserve page from buddy system
+ */
+int reserve_heap_pages(
+    unsigned long start_mfn, unsigned long end_mfn, int broken)
+{
+    unsigned int i;
+    unsigned long cur_start, cur_end;
+    int ret = 0;
+
+    ASSERT(spin_is_locked(&heap_lock));
+
+    if (end_mfn > max_page)
+        return -EINVAL;
+
+    /* sanity checking */
+    for (cur_start = start_mfn; cur_start <= end_mfn; cur_start++)
+    {
+        struct page_info *pg;
+
+        pg = mfn_to_page(cur_start);
+        if (allocated_in_map(cur_start) && !(pg->count_info & PGC_offlined) )
+        {
+            dprintk(XENLOG_WARNING,
+              "pg %lx is not free, can't reserve\n", cur_start);
+            return -EINVAL;
+        }
+    }
+
+    map_alloc(start_mfn, end_mfn - start_mfn + 1);
+
+    cur_start = cur_end = start_mfn;
+    while (cur_start <= end_mfn)
+    {
+        while ( (cur_end < end_mfn ) &&
+                (phys_to_nid(pfn_to_paddr(cur_end + 1))
+                    == phys_to_nid(pfn_to_paddr(cur_start))) )
+                    cur_end ++;
+        ret = reserve_heap_pages_node(cur_start, cur_end);
+
+        if (ret)
+        {
+            dprintk(XENLOG_ERR, "fail to reserve page %lx %lx\n", 
+                                cur_start, cur_end);
+            break;
+        }
+        cur_start = cur_end + 1;
+    }
+
+    if (cur_start <= end_mfn)
+        return ret;
+
+    for ( i = start_mfn; i <= end_mfn; i++ )
+    {
+        struct page_info *pg;
+
+        pg = mfn_to_page(i);
+        pg->count_info |= PGC_offlined;
+        if (broken)
+            page_list_add_tail(&pg[i], &page_broken_list);
+        else
+            page_list_add_tail(&pg[i], &page_offlined_list);
+    }
+
+    return ret;
+}
+
+void online_heap_page(struct page_info *pg)
+{
+    unsigned int zone;
+
+    if ( is_xen_heap_page(pg) )
+        zone = MEMZONE_XEN;
+    else
+        zone = page_to_zone(pg);
+
+    /* Cannot online broken page or assigned page */
+    ASSERT(!is_page_broken(pg) && !is_page_allocated(pg));
+
+    pg->count_info &= ~(PGC_offlining | PGC_offlined);
+    page_list_del(pg, &page_offlined_list);
+
+    recycle_heap_pages(zone, pg, pg);
+}
+
+/*
+ * Offline the memory, 0 if succeed
+ */
+unsigned int do_offline_pages(unsigned long start_pfn, unsigned long end_pfn,
+                              uint32_t *status, int broken)
+{
+    unsigned long mfn = start_pfn;
+    struct domain *owner;
+    int i = 0, ret = 0;
+    unsigned long * updated;
+
+    if ( start_pfn > end_pfn )
+        return 0;
+
+    if (end_pfn > max_page)
+    {
+        dprintk(XENLOG_WARNING,
+                "try to offline page out of range %lx\n", end_pfn);
+        return -EINVAL;
+    }
+    updated = xmalloc_bytes( BITS_TO_LONGS(end_pfn - start_pfn) * sizeof(long));
+
+    if (!updated)
+        return -ENOMEM;
 
     spin_lock(&heap_lock);
 
-    map_free(page_to_mfn(pg), 1 << order);
-    avail[node][zone] += 1 << order;
-
-    /* Merge chunks as far as possible. */
-    while ( order < MAX_ORDER )
-    {
-        mask = 1UL << order;
-
-        if ( (page_to_mfn(pg) & mask) )
-        {
-            /* Merge with predecessor block? */
-            if ( allocated_in_map(page_to_mfn(pg)-mask) ||
-                 (PFN_ORDER(pg-mask) != order) )
+    while ( mfn <= end_pfn )
+    {
+        struct page_info *pg;
+
+        pg = mfn_to_page(mfn);
+        /* init the result value */
+        status[i] = 0;
+
+#if defined(__i386__)
+        if ( is_xen_heap_mfn(mfn) )
+        {
+            status[i] |= PG_OFFLINE_XENPAGE | PG_OFFLINE_FAILED;
+            status[i] |= (DOMID_XEN << PG_OFFLINE_OWNER_SHIFT);
+            ret = -EPERM;
+            break;
+        }
+        else
+#endif
+        if ( is_page_allocated(pg) && !is_page_offlined(pg) )
+        {
+            owner = page_get_owner(pg);
+
+            if (!owner)
+            { 
+                /* anounymous page, shadow page, or Xen heap page for x86_64 */
+#if !defined(__i386__)
+                if ( is_xen_heap_mfn(mfn) )
+                    status[i] |= ( (DOMID_XEN << PG_OFFLINE_OWNER_SHIFT) |
+                      PG_OFFLINE_XENPAGE );
+                else
+#endif
+                    status[i] |= ( PG_OFFLINE_ANONYMOUS |
+                      (DOMID_INVALID) << PG_OFFLINE_OWNER_SHIFT );
+
+                status[i] |= PG_OFFLINE_FAILED;
+                ret = -EPERM;
                 break;
-            pg -= mask;
-            page_list_del(pg, &heap(node, zone, order));
-        }
-        else
-        {
-            /* Merge with successor block? */
-            if ( allocated_in_map(page_to_mfn(pg)+mask) ||
-                 (PFN_ORDER(pg+mask) != order) )
+            }
+            else if ( owner == dom0 )
+            {
+                status[i] |= PG_OFFLINE_DOM0PAGE | PG_OFFLINE_FAILED;
+                ret = -EPERM;
                 break;
-            page_list_del(pg + mask, &heap(node, zone, order));
-        }
-        
-        order++;
-
-        /* After merging, pg should remain in the same node. */
-        ASSERT(phys_to_nid(page_to_maddr(pg)) == node);
-    }
-
-    PFN_ORDER(pg) = order;
-    page_list_add_tail(pg, &heap(node, zone, order));
-
+            }
+            /* Set the bit only */
+            else 
+            {
+                status[i] |= PG_OFFLINE_OWNED | PG_OFFLINE_PENDING;
+                status[i] |= (owner->domain_id << PG_OFFLINE_OWNER_SHIFT);
+                if (!(pg->count_info & PGC_offlining))
+                {
+                    updated[(mfn - start_pfn)/PAGES_PER_MAPWORD] |=
+                      (1UL << ((mfn - start_pfn) & (PAGES_PER_MAPWORD -1 )));
+                    pg->count_info |= PGC_offlining;
+                }
+            }
+        }
+        else if (is_page_offlined(pg))
+            status[i] |= PG_OFFLINE_OFFLINED;
+        else {
+            unsigned long last_mfn = mfn;
+            int j;
+            struct page_info *last_pg = mfn_to_page(mfn);
+
+            if (!(pg->count_info & PGC_offlined))
+            {
+                /* Free pages */
+
+                updated[(mfn - start_pfn)/PAGES_PER_MAPWORD] |=
+                        (1UL << ((mfn - start_pfn) & (PAGES_PER_MAPWORD -1 )));
+                /* Try as much free pages as possible */
+                last_mfn = mfn + 1;
+                last_pg = mfn_to_page(last_mfn);
+                while ( (last_mfn <= end_pfn) && 
+                         !is_page_allocated(last_pg) &&
+                         !(last_pg->count_info & PGC_offlined) &&
+                         !is_xen_heap_page(last_pg) )
+                {
+                    last_mfn ++;
+                    last_pg = mfn_to_page(last_mfn);
+                }
+                reserve_heap_pages(mfn, last_mfn - 1, broken);
+            }
+
+            for (j = mfn; j < last_mfn; j++)
+                status[j - start_pfn] = PG_OFFLINE_OFFLINED;
+
+            i += (last_mfn - 1 - mfn);
+            mfn = last_mfn - 1;
+        }
+
+        i++;
+        mfn++;
+    }
     spin_unlock(&heap_lock);
+
+    /* revert if failed */
+    if (mfn <= end_pfn)
+    {
+        int i = 0;
+        struct page_info *revert;
+
+        for ( i = find_first_bit(updated,  end_pfn - start_pfn);
+              i < (end_pfn - start_pfn);
+              i = find_next_bit(updated, end_pfn - start_pfn, i+1) )
+        {
+            revert = mfn_to_page(start_pfn + i);
+            revert->count_info &= ~(PGC_offlining | PGC_offlined);
+
+            /* Put the offlined page back to buddy system */
+            if (revert->count_info & PGC_offlined)
+                online_heap_page(revert);
+        }
+    }
+
+    dprintk(XENLOG_INFO, "Offlin Page %lx ~ %lx, last page is %lx",
+                          start_pfn, end_pfn, mfn);
+
+    xfree(updated);        
+    return ret;
+}
+
+/* Online the memory.
+ *   The caller should make sure end_pfn <= max_page,
+ *   if not, expand_pages() should be called prior to online_pages().
+ *   Succeed if it returns PG_ONLINE_SUCCESS.
+ *   Fail if it returns PG_ONLINE_FAILURE.
+ */ 
+unsigned int do_online_pages(unsigned long start_pfn,
+                             unsigned long end_pfn,
+                             uint32_t *status)
+{
+    unsigned long mfn;
+    int i;
+    struct page_info *pg;
+    int ret = 0;
+
+    if ( start_pfn >= end_pfn )
+        return 0;
+
+    if ( end_pfn > max_page )
+    {
+        dprintk(XENLOG_WARNING, "call expand_pages() first\n");
+        dprintk(XENLOG_WARNING, "memory onlining %lx to %lx failed\n", 
+                     start_pfn, end_pfn);
+        return -EINVAL;
+    }
+
+    for ( mfn = start_pfn, i = 0; mfn < end_pfn; mfn++, i++ )
+    {
+        pg = mfn_to_page(mfn);
+
+        if ( unlikely(is_page_broken(pg)) )
+        {
+            ret = -EINVAL;
+            status[i] |= PG_ONLINE_FAILED |PG_ONLINE_BROKEN;
+            break;
+        }
+        else if (pg->count_info & PGC_offlined)
+        {
+            pg->count_info &= PGC_offlined;
+            free_heap_pages(pg, 0);
+            status[i] |= PG_ONLINE_ONLINED;
+        }
+        else if (pg->count_info & PGC_offlining)
+        {
+            pg->count_info &= PGC_offlining;
+            status[i] |= PG_ONLINE_ONLINED;
+        }
+    }
+
+    return ret;
+}
+
+unsigned int do_kill_pages(unsigned long start_pfn,
+                           unsigned long end_pfn,
+                           uint32_t *status)
+{
+    unsigned int ret = 0, i;
+
+    if ( start_pfn > end_pfn )
+        return -EINVAL;
+
+    dbg_offpage("do_kill_page start_pfn %lx end_pfn %lx\n",
+                 start_pfn, end_pfn);
+
+    BUG_ON(end_pfn > max_page);
+
+    ret = do_offline_pages(start_pfn, end_pfn, status, 1);
+
+    if (ret)
+        return ret;
+
+    for ( i = start_pfn; i <= end_pfn; i++ )
+    {
+        struct page_info *pg = mfn_to_page(i);
+
+        pg[i].count_info |= PGC_broken;
+    }
+
+    return ret;
 }
 
 /*
diff -r fb35bb57bba6 xen/common/sysctl.c
--- a/xen/common/sysctl.c	Sun Feb 08 18:25:36 2009 +0800
+++ b/xen/common/sysctl.c	Sun Feb 08 18:25:37 2009 +0800
@@ -233,6 +233,34 @@ long do_sysctl(XEN_GUEST_HANDLE(xen_sysc
     }
     break;
 
+    case XEN_SYSCTL_page_offline:
+    {
+        uint32_t *status;
+
+        status = xmalloc_bytes( sizeof(uint32_t) *
+                                (op->u.page_offline.end -
+                                  op->u.page_offline.start + 1));
+        if (!status)
+        {
+            ret = -ENOMEM;
+            break;
+        }
+        ret = do_offline_pages(op->u.page_offline.start,
+                               op->u.page_offline.end,
+                               status, 0);
+
+        if (ret)
+            break;
+        if (copy_to_guest(op->u.page_offline.status, status,
+                          op->u.page_offline.end - op->u.page_offline.start + 1))
+        {   
+            ret = -EFAULT; 
+            break;
+        }
+        xfree(status);
+    }
+    break;
+
     default:
         ret = arch_do_sysctl(op, u_sysctl);
         break;
diff -r fb35bb57bba6 xen/include/asm-x86/mm.h
--- a/xen/include/asm-x86/mm.h	Sun Feb 08 18:25:36 2009 +0800
+++ b/xen/include/asm-x86/mm.h	Sun Feb 08 23:12:59 2009 +0800
@@ -198,8 +198,23 @@ struct page_info
  /* 3-bit PAT/PCD/PWT cache-attribute hint. */
 #define PGC_cacheattr_base PG_shift(6)
 #define PGC_cacheattr_mask PG_mask(7, 6)
+ /* Page is broken? */
+#define _PGC_broken         PG_shift(7)
+#define PGC_broken          PG_mask(1, 7)
+ /* Page is offline pending ? */
+#define _PGC_offlining      PG_shift(8)
+#define PGC_offlining       PG_mask(1, 8)
+ /* Page is offlined */
+#define _PGC_offlined       PG_shift(9)
+#define PGC_offlined        PG_mask(1, 9)
+
+#define is_page_offlining(page)          ((page)->count_info & PGC_offlining)
+#define is_page_offlined(page)          ((page)->count_info & PGC_offlined)
+#define is_page_broken(page)           ((page)->count_info & PGC_broken)
+#define is_page_online(page)           (!is_page_offlined(page))
+
  /* Count of references to this frame. */
-#define PGC_count_width   PG_shift(6)
+#define PGC_count_width   PG_shift(10)
 #define PGC_count_mask    ((1UL<<PGC_count_width)-1)
 
 #if defined(__i386__)
diff -r fb35bb57bba6 xen/include/public/sysctl.h
--- a/xen/include/public/sysctl.h	Sun Feb 08 18:25:36 2009 +0800
+++ b/xen/include/public/sysctl.h	Sun Feb 08 18:25:37 2009 +0800
@@ -359,6 +359,39 @@ struct xen_sysctl_pm_op {
     };
 };
 
+#define XEN_SYSCTL_page_offline        14
+struct xen_sysctl_page_offline {
+    /* IN: range of page to be offlined */
+    uint32_t start;
+    uint32_t end;
+    /* OUT: result of page offline request */
+    /*
+     * bit 0~15: result flags
+     * bit 16~31: owner
+     */
+    XEN_GUEST_HANDLE(uint32) status;
+};
+
+#define PG_OFFLINE_STATUS_MASK    (0xFUL)
+#define PG_OFFLINE_OFFLINED  (0x1UL << 0)
+#define PG_OFFLINE_PENDING   (0x1UL << 1)
+#define PG_OFFLINE_FAILED    (0x1UL << 2)
+#define PG_ONLINE_FAILED     (0x1UL << 2)
+#define PG_ONLINE_ONLINED    (0x1UL << 0)
+
+#define PG_OFFLINE_MISC_MASK    (0xFUL << 4)
+/* only valid when PG_OFFLINE_FAILED */
+#define PG_OFFLINE_XENPAGE   (0x1UL << 4)
+#define PG_OFFLINE_DOM0PAGE  (0x1UL << 5)
+#define PG_OFFLINE_ANONYMOUS (0x1UL << 6)
+
+#define PG_ONLINE_BROKEN      (0x1UL << 4)
+
+#define PG_OFFLINE_OWNED     (0x1UL << 7)
+
+#define PG_OFFLINE_OWNER_SHIFT 16
+
+
 struct xen_sysctl {
     uint32_t cmd;
     uint32_t interface_version; /* XEN_SYSCTL_INTERFACE_VERSION */
@@ -375,6 +408,7 @@ struct xen_sysctl {
         struct xen_sysctl_get_pmstat        get_pmstat;
         struct xen_sysctl_cpu_hotplug       cpu_hotplug;
         struct xen_sysctl_pm_op             pm_op;
+        struct xen_sysctl_page_offline      page_offline;
         uint8_t                             pad[128];
     } u;
 };
diff -r fb35bb57bba6 xen/include/public/xen.h
--- a/xen/include/public/xen.h	Sun Feb 08 18:25:36 2009 +0800
+++ b/xen/include/public/xen.h	Sun Feb 08 18:25:37 2009 +0800
@@ -354,6 +354,9 @@ typedef uint16_t domid_t;
  */
 #define DOMID_XEN  (0x7FF2U)
 
+/* DOMID_INVALID is used to identity invalid domid */
+#define DOMID_INVALID (0x7FFFU)
+
 /*
  * Send an array of these to HYPERVISOR_mmu_update().
  * NB. The fields are natural pointer/address size for this architecture.
diff -r fb35bb57bba6 xen/include/xen/mm.h
--- a/xen/include/xen/mm.h	Sun Feb 08 18:25:36 2009 +0800
+++ b/xen/include/xen/mm.h	Sun Feb 08 18:25:37 2009 +0800
@@ -47,6 +47,8 @@ void init_xenheap_pages(paddr_t ps, padd
 void init_xenheap_pages(paddr_t ps, paddr_t pe);
 void *alloc_xenheap_pages(unsigned int order, unsigned int memflags);
 void free_xenheap_pages(void *v, unsigned int order);
+unsigned int do_offline_pages(unsigned long start_pfn, unsigned long end_pfn, uint32_t *status, int broken);
+unsigned int do_online_pages(unsigned long start_pfn, unsigned long end_pfn, uint32_t *status);
 #define alloc_xenheap_page() (alloc_xenheap_pages(0,0))
 #define free_xenheap_page(v) (free_xenheap_pages(v,0))
 

[-- Attachment #5: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread