All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] Basic support for page offline
@ 2009-02-09  8:54 Jiang, Yunhong
  2009-02-10  9:15 ` Tim Deegan
                   ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Jiang, Yunhong @ 2009-02-09  8:54 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2551 bytes --]

Hi, Tim, this patchset try to support page offline request. I want to get some initial feedback before more testing.

Page offline can be used by multiple usage model, belows are some examples:
a) If too many correctable error happen to one page, management tools may try to offline the page to avoid more server error in future;
b) When page is ECC error and can't be recoverd by hardware, Xen's MCA handler may try to offline the page, so that it will not be accessed anymore.
c) Offline some DIMM for power management etc (Of course, this is far more than simple page offline)

The basic idea to offline a page is:
1) If a page is free, it will be removed from page allocator
2) If page is in use, the owner will be checked
  2.1) if it is owned by xen/dom0, the offline will be failed
  2.2) If it is owned by a PV guest with no device assigned, user space tools will try to replace the page with new one.
  2.3) It it is owned by a HVM guest with no device assigned, user space tools will try to live migration it.
  2.4) If it is owned by a guest with device assigned, user space tools can do live migration if needed.

This patchset includes support for type 2.1/2.2. 

page_offfline_xen.patch gives basic support. The new hypercall (XEN_SYSCTL_page_offline) will mark a page offlining if the page is in-use, otherwise, it will remove the page from the page allocator. It also changes the free_heap_pages(), so that if a page_offlining page is freed, that page will be marked as page_offlined and will not be allocated anymore. One tricky thing is, the offlined page may not be buddy-aligned (i.e., it may be in the middle of a 2^order pages), so that we have to re-arrange the buddy system (i.e. &heap[][][]) carefully.

page_offline_xen_memory.patch add support to PV guest, a new hypercall (XENMEM_page_offline) try to replace the old page with the new one. This will happen only when the guest has been suspeneded, to avoid complex page sharing situation. I'm still checking if more situation need be considered, like LDT pages and CR3 pages, so any suggestion is really great help.

page_offline_tools.patch is an example user space tools based on libxc/xc_domain_save.c, it will try to firstly mark a page offline, and checking the result. If a page is owned by a PV guest, it will try to replace the pages.

I did some basic testing, tried free pages and PV guest pages and is ok. Of course, I need more test on it. And more robust error handling is needed.

Any suggestion is welcome.

Thanks
Yunhong Jiang

[-- Attachment #2: page_offline_xen_memory.patch --]
[-- Type: application/octet-stream, Size: 8494 bytes --]

memory exchange for PV domain

diff -r b736475df064 xen/arch/x86/mm.c
--- a/xen/arch/x86/mm.c	Sun Feb 08 23:20:16 2009 +0800
+++ b/xen/arch/x86/mm.c	Mon Feb 09 01:39:03 2009 +0800
@@ -4812,6 +4812,122 @@ void memguard_guard_stack(void *p)
     memguard_guard_range(p, PAGE_SIZE);
 }
 
+int update_pgtable_entry(struct domain *d, xen_pfn_t in_frame,
+                         xen_pfn_t out_frame, int ref)
+{
+    struct page_info *page;
+    int changed = 0;
+    struct page_info *pg = mfn_to_page(in_frame);
+
+    page_list_for_each ( page, &d->page_list )
+    {
+        unsigned long scan_type;
+
+        scan_type = page->u.inuse.type_info & PGT_type_mask;
+        switch (scan_type)
+        {
+#define REPLACE_L(x)    \
+    case  PGT_l##x##_page_table:    \
+    {   \
+        int i; \
+        unsigned long flags, mfn;    \
+        l##x##_pgentry_t *entry;     \
+        entry = map_domain_page(page_to_mfn(page)); \
+        for (i = 0; i < L##x##_PAGETABLE_ENTRIES; i++)  \
+        {                                                   \
+            mfn = l##x##e_get_pfn(entry[i]);    \
+            flags = l##x##e_get_flags(entry[i]);    \
+            if (mfn == in_frame)            \
+            {\
+                entry[i] = l##x##e_from_pfn(out_frame, flags);\
+                if (ref)    \
+                    put_page_and_type(pg);  \
+                changed = 1;    \
+                printk("update one entry here\n");  \
+            }\
+        }   \
+        unmap_domain_page(entry);   \
+        break;  \
+    }
+        REPLACE_L(1)
+        REPLACE_L(2)
+        REPLACE_L(3)
+        REPLACE_L(4)
+
+        default:
+        break;
+        }
+    }
+
+    return changed;
+}
+
+int replace_page(struct domain *d, xen_pfn_t in_frame,
+                 xen_pfn_t *out_frame, unsigned int memflags)
+{
+    xen_pfn_t out_mfn;
+    unsigned long type_info, count_info;
+    struct page_info *pg = mfn_to_page(in_frame), *out;
+
+    if (d != page_get_owner(pg))
+    {
+        dprintk(XENLOG_ERR, "replace page %lx not owned by domain %x\n",
+                 in_frame, d->domain_id);
+        return -EINVAL;
+    }
+
+    out = alloc_domheap_page(NULL, memflags);
+    if (!out)
+    {
+        put_page(mfn_to_page(in_frame));
+        return -ENOMEM;
+    }
+    out_mfn = page_to_mfn(out);
+
+    spin_lock(&d->page_alloc_lock);
+
+    /* Copy the page_info to keep all count/typecount info */
+    type_info = pg->u.inuse.type_info & PGT_type_mask;
+    count_info = pg->count_info;
+
+    /* get page temp to avoid page be freed in this process */
+    if ( unlikely(!get_page(mfn_to_page(in_frame), d)) )
+    {
+        dprintk(XENLOG_INFO, "Fail to get in_frame %lx when replace page\n",
+                             in_frame);
+        return -EINVAL;
+    }
+
+    update_pgtable_entry(d, in_frame, out_mfn, 1);
+    if ( (pg->count_info & PGC_count_mask) != 2 )
+    {
+        dprintk(XENLOG_WARNING, "page is granted to others? count_info %lx\n", 
+                pg->count_info);
+        dprintk(XENLOG_WARNING, "type info %lx\n", pg->u.inuse.type_info);
+        update_pgtable_entry(d, out_mfn, in_frame, 0);
+        free_domheap_page(out);
+        put_page(mfn_to_page(in_frame));
+        spin_unlock(&d->page_alloc_lock);
+        return -EINVAL;
+    }
+
+    guest_remove_page(d, in_frame);
+
+    out->count_info = count_info;
+    out->u.inuse.type_info = type_info;
+    page_set_owner(out, d);
+    wmb(); /* Domain pointer must be visible before updating refcnt. */
+    page_list_add_tail(out, &d->page_list);
+
+    spin_unlock(&d->page_alloc_lock);
+
+    put_page(mfn_to_page(in_frame));
+
+    *out_frame = out_mfn;
+
+    return 0;
+}
+
 /*
  * Local variables:
  * mode: C
diff -r b736475df064 xen/common/memory.c
--- a/xen/common/memory.c	Sun Feb 08 23:20:16 2009 +0800
+++ b/xen/common/memory.c	Mon Feb 09 01:34:38 2009 +0800
@@ -22,6 +22,7 @@
 #include <asm/current.h>
 #include <asm/hardirq.h>
 #include <xen/numa.h>
+#include <public/sched.h>
 #include <public/memory.h>
 #include <xsm/xsm.h>
 
@@ -214,6 +215,91 @@ static void decrease_reservation(struct 
  out:
     a->nr_done = i;
 }
+
+static long memory_page_offline(XEN_GUEST_HANDLE(xen_memory_page_offline_t) arg)
+{
+    struct xen_memory_page_offline offline;
+    unsigned long i;
+    long          rc = 0;
+    struct domain *d;
+    struct page_info *page;
+
+    if ( copy_from_guest(&offline, arg, 1) )
+        return -EFAULT;
+
+    /* only privileged domain can ask for this */
+    if (!IS_PRIV(current->domain))
+        return -EPERM;
+
+    d = get_domain_by_id(offline.domid);
+
+    if (!d)
+        return -EINVAL;
+    /* Domain must be shutdown in advance */
+    if (!(d->is_shut_down && (d->shutdown_code == SHUTDOWN_suspend)))
+    {
+        dprintk(XENLOG_WARNING, "Try to offline page for running domain \n");
+        put_domain(d);
+        return -EINVAL;
+    }
+
+    for ( i = offline.nr_offlined;
+          i < offline.num;
+          i++ )
+    {
+        unsigned int  node, mem_flags = 0;
+        xen_pfn_t in_frame, out_frame;
+
+        if ( hypercall_preempt_check() )
+        {
+            offline.nr_offlined = i;
+            if ( copy_field_to_guest(arg, &offline, nr_offlined) )
+                return -EFAULT;
+            return hypercall_create_continuation(
+                __HYPERVISOR_memory_op, "lh", XENMEM_page_offline, arg);
+        }
+
+        if (unlikely(__copy_from_guest_offset(
+                  &mem_flags, offline.mem_flags, i, 1)))
+        {
+            rc = -EFAULT;
+            break;
+        }
+
+        if (unlikely(__copy_from_guest_offset(
+                  &in_frame, offline.start_mfn, i, 1)))
+        {
+            rc = -EFAULT;
+            break;
+        }
+
+        page = mfn_to_page(in_frame);
+        node = XENMEMF_get_node(mem_flags);
+        mem_flags |= MEMF_bits(domain_clamp_alloc_bitsize(
+                              d,
+                              XENMEMF_get_address_bits(mem_flags) ? :
+                              (BITS_PER_LONG+PAGE_SHIFT)));
+        if ( node == NUMA_NO_NODE )
+            node = domain_to_node(d);
+        mem_flags |= MEMF_node(node);
+
+        rc = replace_page(d, in_frame,&out_frame, mem_flags);
+
+        /*
+         * No need for rollback since the replace is harmless
+         */
+        if (rc)
+            break;
+
+        __copy_to_guest_offset(
+          offline.out, i, &out_frame, 1);
+    }
+
+    offline.nr_offlined = i;
+    if ( copy_field_to_guest(arg, &offline, nr_offlined) )
+        rc = -EFAULT;
+    return rc;
+} 
 
 static long memory_exchange(XEN_GUEST_HANDLE(xen_memory_exchange_t) arg)
 {
@@ -513,6 +599,10 @@ long do_memory_op(unsigned long cmd, XEN
         rc = memory_exchange(guest_handle_cast(arg, xen_memory_exchange_t));
         break;
 
+    case XENMEM_page_offline:
+        rc = memory_page_offline(guest_handle_cast(arg, xen_memory_page_offline_t));
+        break;
+
     case XENMEM_maximum_ram_page:
         rc = max_page;
         break;
diff -r b736475df064 xen/include/asm-x86/mm.h
--- a/xen/include/asm-x86/mm.h	Sun Feb 08 23:20:16 2009 +0800
+++ b/xen/include/asm-x86/mm.h	Sun Feb 08 23:20:19 2009 +0800
@@ -492,6 +492,8 @@ unsigned int domain_clamp_alloc_bitsize(
 # define domain_clamp_alloc_bitsize(d, b) (b)
 #endif
 
+int replace_page(struct domain *d, xen_pfn_t in_frame, xen_pfn_t *out_frame, unsigned int mem_flags);
+
 unsigned long domain_get_maximum_gpfn(struct domain *d);
 
 extern struct domain *dom_xen, *dom_io;	/* for vmcoreinfo */
diff -r b736475df064 xen/include/public/memory.h
--- a/xen/include/public/memory.h	Sun Feb 08 23:20:16 2009 +0800
+++ b/xen/include/public/memory.h	Sun Feb 08 23:20:19 2009 +0800
@@ -129,6 +129,21 @@ typedef struct xen_memory_exchange xen_m
 typedef struct xen_memory_exchange xen_memory_exchange_t;
 DEFINE_XEN_GUEST_HANDLE(xen_memory_exchange_t);
 
+#define XENMEM_page_offline         18
+struct xen_memory_page_offline {
+    uint32_t num;
+    domid_t  domid;
+
+    XEN_GUEST_HANDLE(xen_pfn_t) start_mfn;
+    XEN_GUEST_HANDLE(uint32)  mem_flags;
+
+    XEN_GUEST_HANDLE(xen_pfn_t) out;
+
+    xen_ulong_t nr_offlined;
+};
+typedef struct xen_memory_page_offline xen_memory_page_offline_t;
+DEFINE_XEN_GUEST_HANDLE(xen_memory_page_offline_t);
+
 /*
  * Returns the maximum machine frame number of mapped RAM in this system.
  * This command always succeeds (it never returns an error code).

[-- Attachment #3: page_offline_tools.patch --]
[-- Type: application/octet-stream, Size: 20534 bytes --]

the tools changes

diff -r f1756e5c1203 tools/libxc/xc_misc.c
--- a/tools/libxc/xc_misc.c	Sun Feb 08 23:20:19 2009 +0800
+++ b/tools/libxc/xc_misc.c	Sun Feb 08 23:20:21 2009 +0800
@@ -80,6 +80,23 @@ int xc_physinfo(int xc_handle,
     return 0;
 }
 
+int xc_mark_pages_offline(int xc_handle,
+                          int start, int end,
+                          uint32_t *status)
+{
+    int ret;
+    DECLARE_SYSCTL;
+
+    sysctl.cmd = XEN_SYSCTL_page_offline;
+    set_xen_guest_handle(sysctl.u.page_offline.status, status);
+    sysctl.u.page_offline.start = start;
+    sysctl.u.page_offline.end = end;
+
+    ret = do_sysctl(xc_handle, &sysctl);
+    
+    return ret;
+ }
+ 
 int xc_sched_id(int xc_handle,
                 int *sched_id)
 {
diff -r f1756e5c1203 tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h	Sun Feb 08 23:20:19 2009 +0800
+++ b/tools/libxc/xenctrl.h	Sun Feb 08 23:20:21 2009 +0800
@@ -608,6 +608,10 @@ int xc_physinfo(int xc_handle,
 int xc_physinfo(int xc_handle,
                 xc_physinfo_t *info);
 
+int xc_mark_pages_offline(int xc_handle,
+                          int start, int end,
+                          uint32_t *status);
+
 int xc_sched_id(int xc_handle,
                 int *sched_id);
 
diff -r f1756e5c1203 tools/xcutils/Makefile
--- a/tools/xcutils/Makefile	Sun Feb 08 23:20:19 2009 +0800
+++ b/tools/xcutils/Makefile	Sun Feb 08 23:20:21 2009 +0800
@@ -14,7 +14,7 @@ CFLAGS += -Werror
 CFLAGS += -Werror
 CFLAGS += $(CFLAGS_libxenctrl) $(CFLAGS_libxenguest) $(CFLAGS_libxenstore)
 
-PROGRAMS = xc_restore xc_save readnotes lsevtchn
+PROGRAMS = xc_restore xc_save readnotes lsevtchn xc_page
 
 LDLIBS   = $(LDFLAGS_libxenctrl) $(LDFLAGS_libxenguest) $(LDFLAGS_libxenstore)
 
diff -r f1756e5c1203 tools/xcutils/xc_page.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/xcutils/xc_page.c	Mon Feb 09 01:36:33 2009 +0800
@@ -0,0 +1,698 @@
+#include <inttypes.h>
+#include <time.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/time.h>
+
+#include <xs.h>
+#include "xc_private.h"
+#include "xc_dom.h"
+#include "xg_private.h"
+#include "xg_save_restore.h"
+
+#define GET_FIELD(_p, _f) ((guest_width==8) ? ((_p)->x64._f) : ((_p)->x32._f))
+#define M2P_SHIFT       L2_PAGETABLE_SHIFT_PAE
+#define M2P_CHUNK_SIZE  (1 << M2P_SHIFT)
+#define M2P_SIZE(_m)    ROUNDUP(((_m) * sizeof(xen_pfn_t)), M2P_SHIFT)
+#define M2P_CHUNKS(_m)  (M2P_SIZE((_m)) >> M2P_SHIFT)
+
+inline int is_hvm_domain(xc_dominfo_t *info)
+{
+    return info->hvm;
+}
+
+xen_pfn_t *live_m2p = NULL;
+#define mfn_to_pfn(_mfn)  (live_m2p[(_mfn)])
+
+static xen_pfn_t *xc_map_m2p(int xc_handle,
+                                 unsigned long max_mfn,
+                                 int prot)
+{
+    struct xen_machphys_mfn_list xmml;
+    privcmd_mmap_entry_t *entries;
+    unsigned long m2p_chunks, m2p_size;
+    xen_pfn_t *m2p;
+    xen_pfn_t *extent_start;
+    int i;
+
+    m2p = NULL;
+    m2p_size   = M2P_SIZE(max_mfn);
+    m2p_chunks = M2P_CHUNKS(max_mfn);
+
+    xmml.max_extents = m2p_chunks;
+
+    extent_start = calloc(m2p_chunks, sizeof(xen_pfn_t));
+    if ( !extent_start )
+    {
+        ERROR("failed to allocate space for m2p mfns");
+        goto err0;
+    }
+    set_xen_guest_handle(xmml.extent_start, extent_start);
+
+    if ( xc_memory_op(xc_handle, XENMEM_machphys_mfn_list, &xmml) ||
+         (xmml.nr_extents != m2p_chunks) )
+    {
+        ERROR("xc_get_m2p_mfns");
+        goto err1;
+    }
+
+    entries = calloc(m2p_chunks, sizeof(privcmd_mmap_entry_t));
+    if (entries == NULL)
+    {
+        ERROR("failed to allocate space for mmap entries");
+        goto err1;
+    }
+
+    for ( i = 0; i < m2p_chunks; i++ )
+        entries[i].mfn = extent_start[i];
+
+    m2p = xc_map_foreign_ranges(xc_handle, DOMID_XEN,
+			m2p_size, prot, M2P_CHUNK_SIZE,
+			entries, m2p_chunks);
+    if (m2p == NULL)
+    {
+        ERROR("xc_mmap_foreign_ranges failed");
+        goto err2;
+    }
+
+err2:
+    free(entries);
+err1:
+    free(extent_start);
+
+err0:
+    return m2p;
+}
+
+static void *map_frame_list_list(int xc_handle, uint32_t dom,
+                                 shared_info_any_t *shinfo,
+                                 int guest_width)
+{
+    int count = 100;
+    void *p;
+    uint64_t fll = GET_FIELD(shinfo, arch.pfn_to_mfn_frame_list_list);
+
+    while ( count-- && (fll == 0) )
+    {
+        usleep(10000);
+        fll = GET_FIELD(shinfo, arch.pfn_to_mfn_frame_list_list);
+    }
+
+    if ( fll == 0 )
+    {
+        ERROR("Timed out waiting for frame list updated.");
+        return NULL;
+    }
+
+    p = xc_map_foreign_range(xc_handle, dom, PAGE_SIZE, PROT_READ, fll);
+    if ( p == NULL )
+        ERROR("Couldn't map p2m_frame_list_list (errno %d)", errno);
+
+    return p;
+}
+
+static void *map_p2m_table(int xc_handle, uint32_t domid, int guest_width)
+{
+    xc_dominfo_t info;
+    static unsigned long p2m_size;
+    void *live_p2m_frame_list_list = NULL;
+    void *live_p2m_frame_list = NULL;
+    /* Copies of the above. */
+    xen_pfn_t *p2m_frame_list_list = NULL;
+    xen_pfn_t *p2m_frame_list = NULL;
+    unsigned long shared_info_frame;
+    shared_info_any_t *live_shinfo = NULL;
+
+    /* The mapping of the live p2m table itself */
+    xen_pfn_t *p2m = NULL;
+    int i = 0;
+
+    /* Get the size of the P2M table */
+    p2m_size = xc_memory_op(xc_handle, XENMEM_maximum_gpfn, &domid) + 1;
+
+
+    if ( xc_domain_getinfo(xc_handle, domid, 1, &info) != 1 )
+    {
+        fprintf(stderr, "Could not get domain info");
+        return NULL;
+    }
+    shared_info_frame = info.shared_info_frame;
+
+    live_shinfo = xc_map_foreign_range(xc_handle, domid, PAGE_SIZE,
+                      PROT_READ, shared_info_frame);
+    if ( !live_shinfo )
+    {
+        fprintf(stderr, "Couldn't map live_shinfo");
+        goto out;
+    }
+
+    live_p2m_frame_list_list = map_frame_list_list(xc_handle, domid,
+                                                   live_shinfo, guest_width);
+
+    munmap(live_shinfo, PAGE_SIZE);
+    live_shinfo = NULL;
+
+    if ( !live_p2m_frame_list_list )
+    {
+        fprintf(stderr, "Could get live p2m_frame_list_list\n");
+        goto out;
+    }
+
+    /* Get a local copy of the live_P2M_frame_list_list */
+    if ( !(p2m_frame_list_list = malloc(PAGE_SIZE)) )
+    {
+        fprintf(stderr, "Couldn't allocate p2m_frame_list_list array");
+        goto out;
+    }
+    memcpy(p2m_frame_list_list, live_p2m_frame_list_list, PAGE_SIZE);
+    munmap(live_p2m_frame_list_list, PAGE_SIZE);
+    live_p2m_frame_list_list = NULL;
+
+    /* Canonicalize guest's unsigned long vs ours */
+    if ( guest_width > sizeof(unsigned long) )
+        for ( i = 0; i < PAGE_SIZE/sizeof(unsigned long); i++ )
+            if ( i < PAGE_SIZE/guest_width )
+                p2m_frame_list_list[i] = ((uint64_t *)p2m_frame_list_list)[i];
+            else
+                p2m_frame_list_list[i] = 0;
+    else if ( guest_width < sizeof(unsigned long) )
+        for ( i = PAGE_SIZE/sizeof(unsigned long) - 1; i >= 0; i-- )
+            p2m_frame_list_list[i] = ((uint32_t *)p2m_frame_list_list)[i];
+
+    live_p2m_frame_list =
+        xc_map_foreign_batch(xc_handle, domid, PROT_READ,
+                             p2m_frame_list_list,
+                             P2M_FLL_ENTRIES);
+    if ( !live_p2m_frame_list )
+    {
+        fprintf(stderr, "Couldn't map p2m_frame_list");
+        goto out;
+    }
+    free(p2m_frame_list_list);
+    p2m_frame_list_list = NULL;
+
+    /* Get a local copy of the live_P2M_frame_list */
+    if ( !(p2m_frame_list = malloc(P2M_TOOLS_FL_SIZE)) )
+    {
+        ERROR("Couldn't allocate p2m_frame_list array");
+        goto out;
+    }
+
+    memset(p2m_frame_list, 0, P2M_TOOLS_FL_SIZE);
+    memcpy(p2m_frame_list, live_p2m_frame_list, P2M_GUEST_FL_SIZE);
+    munmap(live_p2m_frame_list, P2M_FLL_ENTRIES * PAGE_SIZE);
+    live_p2m_frame_list = NULL;
+
+    /* Canonicalize guest's unsigned long vs ours */
+    if ( guest_width > sizeof(unsigned long) )
+        for ( i = 0; i < P2M_FL_ENTRIES; i++ )
+            p2m_frame_list[i] = ((uint64_t *)p2m_frame_list)[i];
+    else if ( guest_width < sizeof(unsigned long) )
+        for ( i = P2M_FL_ENTRIES - 1; i >= 0; i-- )
+            p2m_frame_list[i] = ((uint32_t *)p2m_frame_list)[i];
+
+
+    /* Map all the frames of the pfn->mfn table. For migrate to succeed,
+       the guest must not change which frames are used for this purpose.
+       (its not clear why it would want to change them, and we'll be OK
+       from a safety POV anyhow. */
+
+    p2m = xc_map_foreign_batch(xc_handle, domid, PROT_READ,
+                               p2m_frame_list,
+                               P2M_FL_ENTRIES);
+    if ( !p2m )
+    {
+        fprintf(stderr, "Couldn't map p2m table");
+        goto out;
+    }
+    free(p2m_frame_list);
+    p2m_frame_list = NULL;
+
+    return p2m;
+
+out:
+    if (live_p2m_frame_list_list)
+        munmap(live_p2m_frame_list_list, PAGE_SIZE);
+    if (live_p2m_frame_list)
+        munmap(live_p2m_frame_list, P2M_FLL_ENTRIES * PAGE_SIZE);
+    if (p2m_frame_list_list)
+        free(p2m_frame_list_list);
+    if (p2m_frame_list)
+        free(p2m_frame_list);
+    if (live_shinfo)
+        munmap(live_shinfo, PAGE_SIZE);
+
+    return NULL;
+}
+
+int update_p2m_table(void *p2m, xen_pfn_t *old, xen_pfn_t *new,
+                     int num, int guest_width)
+{
+    int i;
+
+    if (!p2m)
+        return -1;
+
+    for (i = 0; i < num; i++)
+    {
+        if (guest_width == 4)
+            ((unsigned int*)p2m)[mfn_to_pfn(old[i])] = new[i];
+        else
+            ((unsigned long *)p2m)[mfn_to_pfn(old[i])] = 
+                        (unsigned long)(new[i]);
+    }
+
+    return 0;
+}
+
+struct suspendinfo {
+    int xc_fd; /* libxc handle */
+    int xce; /* event channel handle */
+    int suspend_evtchn;
+    int domid;
+    unsigned int flags;
+};
+
+static int suspend_evtchn_release(struct suspendinfo *si)
+{
+    if (si->suspend_evtchn >= 0) {
+        xc_evtchn_unbind(si->xce, si->suspend_evtchn);
+        si->suspend_evtchn = -1;
+    }
+    if (si->xce >= 0) {
+        xc_evtchn_close(si->xce);
+        si->xce = -1;
+    }
+
+    return 0;
+}
+
+static int await_suspend(struct suspendinfo *si)
+{
+    int rc;
+
+    do {
+        rc = xc_evtchn_pending(si->xce);
+        if (rc < 0) {
+            ERROR("error polling suspend notification channel: %d", rc);
+            return -1;
+        }
+    } while (rc != si->suspend_evtchn);
+
+    /* harmless for one-off suspend */
+    if (xc_evtchn_unmask(si->xce, si->suspend_evtchn) < 0)
+        ERROR("failed to unmask suspend notification channel: %d", rc);
+
+    return 0;
+}
+
+static int suspend_evtchn_init(int xc, int domid, struct suspendinfo *si)
+{
+    struct xs_handle *xs;
+    char path[128];
+    char *portstr;
+    unsigned int plen;
+    int port;
+    int rc;
+
+    si->xce = -1;
+    si->suspend_evtchn = -1;
+
+    xs = xs_daemon_open();
+    if (!xs) {
+        ERROR("failed to get xenstore handle");
+        return -1;
+    }
+    sprintf(path, "/local/domain/%d/device/suspend/event-channel", domid);
+    portstr = xs_read(xs, XBT_NULL, path, &plen);
+    xs_daemon_close(xs);
+
+    if (!portstr || !plen) {
+        ERROR("could not read suspend event channel");
+        return -1;
+    }
+
+    port = atoi(portstr);
+    free(portstr);
+
+    si->xce = xc_evtchn_open();
+    if (si->xce < 0) {
+        ERROR("failed to open event channel handle");
+        goto cleanup;
+    }
+
+    si->suspend_evtchn = xc_evtchn_bind_interdomain(si->xce, domid, port);
+    if (si->suspend_evtchn < 0) {
+        ERROR("failed to bind suspend event channel: %d", si->suspend_evtchn);
+        goto cleanup;
+    }
+
+    rc = xc_domain_subscribe_for_suspend(xc, domid, port);
+    if (rc < 0) {
+        ERROR("failed to subscribe to domain: %d", rc);
+        goto cleanup;
+    }
+
+    /* event channel is pending immediately after binding */
+    await_suspend(si);
+
+    return 0;
+
+  cleanup:
+    suspend_evtchn_release(si);
+
+    return -1;
+}
+
+/**
+ * Issue a suspend request to a dedicated event channel in the guest, and
+ * receive the acknowledgement from the subscribe event channel.
+ */
+static int evtchn_suspend(struct suspendinfo *si)
+{
+    int rc;
+
+    rc = xc_evtchn_notify(si->xce, si->suspend_evtchn);
+    if (rc < 0) {
+        ERROR("failed to notify suspend request channel: %d", rc);
+        return 0;
+    }
+
+    if (await_suspend(si) < 0) {
+        ERROR("suspend failed");
+        return 0;
+    }
+
+    /* notify xend that it can do device migration */
+    printf("suspended\n");
+    fflush(stdout);
+
+    return 1;
+}
+
+/* More consideration here like CR3 etc */
+int _pages_offline(int xc_handle, int domid, xen_pfn_t *old_mfn, xen_pfn_t *new_mfn, int num, int *done )
+{
+    struct xen_memory_page_offline offline;
+    xen_pfn_t *in_frames, *out_frames;
+    uint32_t *mem_flags;
+    int i, err;
+
+    in_frames = malloc(num * sizeof(xen_pfn_t));
+    if (!in_frames)
+        return -ENOMEM;
+
+    out_frames = malloc(num * sizeof(xen_pfn_t));
+    if (!out_frames)
+    {
+        free(in_frames);
+        return -ENOMEM;
+    }
+
+    mem_flags = malloc(num * sizeof(uint32_t));
+
+    if (!mem_flags)
+    {
+        free(in_frames);
+        free(out_frames);
+        return -ENOMEM;
+    }
+    memset(mem_flags, 0, num);
+
+    for (i = 0; i < num; i++)
+        in_frames[i] = old_mfn[i];
+
+    offline.num = num;
+
+    offline.domid = domid;
+
+    offline.nr_offlined = 0;
+
+    set_xen_guest_handle(offline.start_mfn, in_frames);
+    set_xen_guest_handle(offline.out, out_frames);
+    set_xen_guest_handle(offline.mem_flags, mem_flags);
+
+    err = xc_memory_op(xc_handle, XENMEM_page_offline, &offline);
+
+    if (err)
+    {
+        ERROR("failed to get the memory exchange done \n");
+        return -1;
+    }
+    
+    for (i = 0; i < num; i++)
+        new_mfn[i] = out_frames[i];
+    *done = num;
+
+    free(in_frames);
+    free(out_frames);
+
+    return 0;
+}
+
+int domain_page_offline(int xc_handle, int domid,
+                        xen_pfn_t *mfn, int num,
+                        int *done)
+{
+    xc_dominfo_t info;
+    int ret = 0, guest_width;
+    struct suspendinfo si;
+    xen_pfn_t *new_mfn = NULL;
+    DECLARE_DOMCTL;
+    void *p2m = NULL;
+    unsigned long max_mfn = 0;
+
+    if (xc_domain_getinfo(xc_handle, domid, 1, &info) != 1)
+    {
+        fprintf(stderr, "Domain get info failed\n");
+        goto error;
+    }
+
+    if (is_hvm_domain(&info))
+    {
+        fprintf(stderr, "we need utilize live migration for hvm domain\n");
+        ret = -EINVAL;
+        goto error;
+    }
+
+    *done = 0;
+
+    new_mfn = malloc(num * sizeof(xen_pfn_t));
+
+    if (!new_mfn)
+        return -ENOMEM;
+
+    if ((ret = suspend_evtchn_init(xc_handle, domid, &si)))
+    {
+        fprintf(stderr, "suspend_evtchn init failed\n");
+        goto error;
+    }
+
+    evtchn_suspend(&si);
+
+    *done = 0;
+    /* We pass mfn to Xen HV, instead of gpfn */
+    ret = _pages_offline(xc_handle, domid, mfn, new_mfn, num, done);
+
+    if (ret)
+    {
+        fprintf(stderr, "%x page offline request failed with %x done\n",
+                         num, *done);
+        goto error;
+    }
+
+    fprintf(stderr, "now we have offline the pages\n");
+    memset(&domctl, 0, sizeof(domctl));
+    domctl.domain = domid;
+    domctl.cmd = XEN_DOMCTL_get_address_size;
+    if ( do_domctl(xc_handle, &domctl) != 0 )
+    {
+        fprintf(stderr, "Could not get guest width\n");
+        goto error;
+    }
+
+    guest_width = domctl.u.address_size.size / 8;
+    p2m = map_p2m_table(xc_handle, domid, guest_width);
+
+    max_mfn = xc_memory_op(xc_handle, XENMEM_maximum_ram_page, NULL);
+
+    live_m2p = xc_map_m2p(xc_handle, max_mfn, PROT_READ);
+
+    if (!live_m2p)
+        goto error;
+
+    fprintf(stderr, "have mapped the p2m table\n");
+    /* update guest's P2M table here */
+    update_p2m_table(p2m, mfn, new_mfn, num, guest_width); 
+
+error:
+    if (new_mfn)
+        free(new_mfn);
+
+    if (live_m2p)
+        munmap(live_m2p, M2P_SIZE(max_mfn));
+
+    suspend_evtchn_release(&si);
+    /* resume guest now */
+    xc_domain_resume(xc_handle, domid, 1);
+
+    return ret;
+}
+
+static int xc_mark_page_offline(int xc, unsigned long start,
+                                unsigned long end, uint32_t *status)
+{
+    DECLARE_SYSCTL;
+    int ret = -1;
+
+    if (end < start)
+        return -EINVAL;
+
+    if (lock_pages(status, sizeof(uint32_t)*(end - start + 1)))
+    {
+        fprintf(stderr, "Could not lock memory for Xen hypercall");
+        return -EINVAL;
+    }
+
+    sysctl.cmd = XEN_SYSCTL_page_offline;
+    sysctl.u.page_offline.start = start;
+    sysctl.u.page_offline.end = end;
+    set_xen_guest_handle(sysctl.u.page_offline.status, status);
+    ret = xc_sysctl(xc, &sysctl);
+
+    unlock_pages(status, sizeof(uint32_t)*(end - start + 1));
+
+    return ret;
+}
+#define PAGE_OFFLINE_HANDLED (0x1UL << 8)
+#define page_offline_owner(x)   \
+        (x >> PG_OFFLINE_OWNER_SHIFT)
+
+static int xc_page_offline(unsigned long start, unsigned long end)
+{
+    int rc = 0, i, num, check_count, xc_handle;
+    uint32_t *status = NULL;
+    xen_pfn_t *pfns = NULL;
+
+    if (end < start)
+    {
+        fprintf(stderr, "End %lx is smaller than start %lx\n", end, start);
+        return -EINVAL;
+    }
+
+    xc_handle = xc_interface_open();
+
+    if (!xc_handle)
+        return -1;
+
+    num = end - start + 1;
+
+    rc = -ENOMEM;
+    pfns = malloc(num * sizeof(xen_pfn_t));
+    if (!pfns)
+        goto fail;
+    memset(pfns, sizeof(xen_pfn_t)* num, 0);
+
+    status  = malloc(num * sizeof(uint32_t));
+    if (!status)
+        goto fail;
+    memset(status, sizeof(uint32_t)*num, 0);
+
+    rc = xc_mark_page_offline(xc_handle, start, end, status);
+
+    if (rc)
+    {
+        fprintf(stderr, "fail to mark pages offline\n");
+        goto fail;
+    }
+
+    rc = 0;
+    check_count = 0;
+    while (check_count != num)
+    {
+        uint32_t pstat = status[check_count];
+
+        fprintf(stderr, "check_count %x pstat %x\n",
+                check_count, pstat);
+        if (pstat & PAGE_OFFLINE_HANDLED)
+        {
+            check_count ++;
+            continue;
+        }
+
+        switch (pstat & PG_OFFLINE_STATUS_MASK)
+        {
+        case PG_OFFLINE_OFFLINED:
+            check_count ++;
+            break;
+        case PG_OFFLINE_PENDING:
+        {
+            domid_t owner = page_offline_owner(pstat);
+            int j = 0, done;
+
+            /* Should HV present such information ?? */
+            if (owner >= DOMID_FIRST_RESERVED)
+            {
+                fprintf(stderr, "special domain ownership\n");
+                check_count++;
+                continue;
+            }
+
+            /* get all pages with the same owner */
+            memset(pfns, sizeof(uint32_t)*num, 0);
+            for (i = check_count; i < num; i++)
+            {
+                if (page_offline_owner(status[i]) == owner)
+                {
+                    status[i] |= PAGE_OFFLINE_HANDLED;
+                    pfns[j++] = start + i;
+                }
+            }
+
+            /* offline the pages */
+            rc = domain_page_offline(xc_handle, owner, pfns, j, &done);
+            if (rc)
+            {
+                /* XXX need take recovery if can't offline all pages? */ 
+                fprintf(stderr, "failed to offline domain %x's page\n"
+                        "total page %x done %x\n", owner, j, done);
+                goto fail;
+            }
+            check_count ++;
+            break;
+        }
+
+        default:
+            fprintf(stderr, "Error page offline status %x\n", pstat);
+            goto fail;
+        }
+    }
+    xc_interface_close(xc_handle);
+fail:
+    if (status)
+        free(status);
+    if (pfns)
+        free(pfns);
+    return rc;    
+}
+
+int
+main(int argc, char **argv)
+{
+    unsigned long start, end;
+
+    if (argc != 3)
+        fprintf(stderr, "usage: %s start end", argv[0]);
+
+    errno = 0;
+    start = strtoul(argv[1], NULL, 0);
+    end = strtoul(argv[2], NULL, 0);
+
+    if (errno){
+        fprintf(stderr, "usage: %s start end", argv[0]);
+    }
+
+    xc_page_offline(start, end);
+
+    return 1;
+}

[-- Attachment #4: page_offline_xen.patch --]
[-- Type: application/octet-stream, Size: 25865 bytes --]

new version of page offline

diff -r fb35bb57bba6 xen/common/page_alloc.c
--- a/xen/common/page_alloc.c	Sun Feb 08 18:25:36 2009 +0800
+++ b/xen/common/page_alloc.c	Sun Feb 08 23:19:58 2009 +0800
@@ -35,9 +35,13 @@
 #include <xen/perfc.h>
 #include <xen/numa.h>
 #include <xen/nodemask.h>
+#include <public/sysctl.h>
 #include <asm/page.h>
 #include <asm/numa.h>
 #include <asm/flushtlb.h>
+
+#define dbg_offpage(_f, _a...)    \
+    printk(XENLOG_DEBUG "PAGE_OFFLINE %s:%d: " _f,  __FILE__ , __LINE__ , ## _a)
 
 /*
  * Comma-separated list of hexadecimal page numbers containing bad bytes.
@@ -73,6 +77,12 @@ static DEFINE_SPINLOCK(page_scrub_lock);
 static DEFINE_SPINLOCK(page_scrub_lock);
 PAGE_LIST_HEAD(page_scrub_list);
 static unsigned long scrub_pages;
+
+/* Offlined page list, protected by heap_lock */
+PAGE_LIST_HEAD(page_offlined_list);
+
+/* Broken page list, protected by heap_lock */
+PAGE_LIST_HEAD(page_broken_list);
 
 /*********************
  * ALLOCATION BITMAP
@@ -421,19 +431,93 @@ static struct page_info *alloc_heap_page
     return pg;
 }
 
+static inline int is_page_allocated(struct page_info *pg)
+{
+    return allocated_in_map(page_to_mfn(pg)) && 
+            !(pg->count_info & PGC_offlined);
+}
+
+/* Add the pages into heap[][][], and merge chunks as far as possible */
+static void merge_heap_pages(unsigned int zone, struct page_info *pg,
+                             unsigned int order, int prev, int next)
+{
+    unsigned long mask;
+    unsigned int node = phys_to_nid(page_to_maddr(pg));
+
+    while ( order < MAX_ORDER )
+    {
+        mask = 1UL << order;
+
+        if ( page_to_mfn(pg) & mask )
+        {
+            /* Merge with predecessor block? */
+            if ( allocated_in_map(page_to_mfn(pg)-mask) ||
+                 (PFN_ORDER(pg-mask) != order) )
+                break;
+            if (prev)
+            {
+                pg -= mask;
+                page_list_del(pg, &heap(node, zone, order));
+            } else
+                break;
+        }
+        else
+        {
+            /* Merge with successor block? */
+            if ( allocated_in_map(page_to_mfn(pg)+mask) ||
+                 (PFN_ORDER(pg+mask) != order) )
+                break;
+            if (next)
+                page_list_del(pg+mask, &heap(node, zone, order));
+            else
+                break;
+        }
+
+        order++;
+
+        /* After merging, pg should remain in the same node. */
+        ASSERT(phys_to_nid(page_to_maddr(pg)) == node);
+    }
+
+    PFN_ORDER(pg) = order;
+    page_list_add_tail(pg, &heap(node, zone, order));
+}
+
+/* Mark 2^@order set of pages freed in heap[][][], avail[], and bitmap.
+ * This assumes all pages are from the same zone
+ */
+static void recycle_heap_pages(
+    unsigned int zone, struct page_info *start, struct page_info *end)
+{
+    unsigned int node = phys_to_nid(page_to_maddr(start));
+    struct page_info *pg = start;
+
+    ASSERT(zone < NR_ZONES);
+    ASSERT(node >= 0);
+    ASSERT(node < num_online_nodes());
+
+    ASSERT(spin_is_locked(&heap_lock));
+
+    /* XXX enhance for performance */
+    while (pg++ <= end)
+    {
+        merge_heap_pages(zone, pg, 0, 1, 1);
+        map_free(page_to_mfn(pg), 1);
+    }
+    avail[node][zone] += page_to_mfn(end) - page_to_mfn(start);
+}
+
 /* Free 2^@order set of pages. */
 static void free_heap_pages(
     struct page_info *pg, unsigned int order)
 {
-    unsigned long mask;
-    unsigned int i, node = phys_to_nid(page_to_maddr(pg));
+    unsigned long count_info;
+    unsigned int i, nr_pages;
     unsigned int zone = page_to_zone(pg);
-
-    ASSERT(order <= MAX_ORDER);
-    ASSERT(node >= 0);
-    ASSERT(node < num_online_nodes());
-
-    for ( i = 0; i < (1 << order); i++ )
+    int offlined = 0, clean = 1;
+
+    spin_lock(&heap_lock);
+    for ( i = 0, nr_pages = 0; i < (1 << order); i++, nr_pages++)
     {
         /*
          * Cannot assume that count_info == 0, as there are some corner cases
@@ -446,52 +530,543 @@ static void free_heap_pages(
          *     in its pseudophysical address space).
          * In all the above cases there can be no guest mappings of this page.
          */
+        count_info = pg[i].count_info;
         pg[i].count_info = 0;
+
+        if ( count_info & PGC_offlining )
+            pg[i].count_info |= PGC_offlining;
+        if ( count_info & PGC_broken )
+            pg[i].count_info |= PGC_broken;
 
         /* If a page has no owner it will need no safety TLB flush. */
         pg[i].u.free.need_tlbflush = (page_get_owner(&pg[i]) != NULL);
         if ( pg[i].u.free.need_tlbflush )
             pg[i].tlbflush_timestamp = tlbflush_current_time();
-    }
+        ASSERT(is_page_allocated(&pg[i]));
+
+        /* If the page is in "offline pending and broken", then set it to be
+         * "offlined and broken" and put it to the broken list; if the page is
+         * in "offline pending", then set it to be "offline" and put it to
+         * the offline list; otherwise, free it and put it to heap[][][]
+         */
+        if ( is_page_broken(&pg[i]) )
+        {
+            page_list_add_tail(&pg[i], &page_broken_list);
+            pg->count_info |= PGC_offlined;
+            offlined = 1;
+            clean = 0;
+        }
+        else if ( is_page_offlining(&pg[i]) )
+        {
+            page_list_add_tail(&pg[i], &page_offlined_list);
+            pg->count_info |= PGC_offlined;
+            pg->count_info &= ~PGC_offlining;
+            offlined = 1;
+            clean = 0;
+        }
+
+        if ( unlikely(offlined) )
+        {
+            offlined = 0;
+            /* recycle those freed pages except offlined pages */
+            if ( nr_pages > 0 )
+            {
+                recycle_heap_pages(zone, pg + i - nr_pages, pg + i - 1);
+                nr_pages = 0;
+            }
+        }
+    }
+
+    if (clean)
+    {
+        unsigned int node = phys_to_nid(page_to_maddr(pg));
+        map_free(page_to_mfn(pg), 1UL << order);
+        avail[node][zone] += (1UL << order);
+        merge_heap_pages(zone, pg, order, 1, 1);
+    } else if ( nr_pages > 0 )
+    /* handle the rest */
+    {
+        recycle_heap_pages(zone, pg + i - nr_pages, pg + i - 1);
+    }
+
+    spin_unlock(&heap_lock);
+}
+
+/*
+ * Reserve pages that is in the same order list in the buddy system
+ * head: the head page in the buddy contains the range
+ */
+int reserve_heap_pages_order(struct page_info *head,
+                                     unsigned long start_mfn,
+                                     unsigned long end_mfn,
+                                     int order)
+{
+    unsigned int node = phys_to_nid(page_to_maddr(head));
+    int zone = page_to_zone(head), cur_order;
+    struct page_info *start, *end, *cur_head, *cur_end;
+
+
+    ASSERT(order <= MAX_ORDER);
+    ASSERT(PFN_ORDER(head) == order);
+
+    start = mfn_to_page(start_mfn);
+    end = mfn_to_page(end_mfn);
+    if (end >= (head + (1UL << order)))
+        return -EINVAL;
+
+    /* sanity checking */
+    if ( (end < start) || (start < head) || (end >= (head + (1UL << order))) )
+        return -EINVAL;
+
+    page_list_del(head, &heap(node, zone, order));
+
+    ASSERT(spin_is_locked(&heap_lock));
+
+    cur_head = head;
+
+    cur_order = PFN_ORDER(head);
+
+    while (cur_head < start)
+    {
+       while ( (cur_order >= 0) && (cur_head + (1UL << cur_order ) > start) )
+           cur_order--; 
+
+       if (cur_head + (1UL << cur_order ) <= start)
+       {
+           merge_heap_pages(zone, cur_head, cur_order, 1, 0); 
+           cur_head += (1UL << cur_order);
+       }
+    }
+
+    cur_end = head + (1UL << order) - 1;
+    cur_order = order;
+
+    while (cur_end > end)
+    {
+        while ( (cur_order >= 0) && (cur_end - (1UL << cur_order) < end))
+            cur_order --;
+
+       if ((cur_end - (1UL << cur_order) >= end))
+       {
+           merge_heap_pages(zone, cur_end - (1UL << cur_order) + 1, cur_order, 0, 1);
+           cur_end -= (1UL << cur_order);
+       }
+    }
+
+    avail[node][zone] -= (end_mfn - start_mfn);
+
+    return 0;
+}
+
+/*
+ * Reserve pages that is in the same zone
+ */
+int reserve_heap_pages_zone(unsigned long start_mfn,
+                                   unsigned long end_mfn)
+{
+    int node = phys_to_nid(pfn_to_paddr(start_mfn));
+    int zone = page_to_zone(mfn_to_page(start_mfn)), ret = 0;
+
+    unsigned long cur_start, cur_end;
+    int i;
+
+    ASSERT(spin_is_locked(&heap_lock));
+    ASSERT( page_to_zone(mfn_to_page(start_mfn)) ==
+            page_to_zone(mfn_to_page((end_mfn))) );
+
+    if (end_mfn < start_mfn)
+        return 0;
+
+    if (end_mfn > max_page)
+        return -EINVAL;
+
+    cur_start = cur_end = start_mfn;
+
+    for ( i = 0; i <= MAX_ORDER; i++ )
+    {
+        struct page_info *head, *tmp;
+        unsigned int heap_mask;
+
+        if ( page_list_empty(&heap(node, zone, i)) )
+            continue;
+
+        if (cur_start > end_mfn)
+            break;
+
+        heap_mask = 1UL << i;
+        page_list_for_each_safe(head, tmp, &heap(node, zone, i))
+        {
+            if ( (head <= mfn_to_page(cur_start)) &&
+              ( (head + (1UL << i)) > mfn_to_page(cur_start)))
+            {
+                cur_end = min(page_to_mfn(head ) + (1UL << i) - 1, end_mfn);
+
+                ret = reserve_heap_pages_order(head, cur_start, cur_end, i);
+                cur_start = cur_end + 1;
+                if (ret)
+                {
+                    dprintk(XENLOG_ERR, "fail to reserve page %lx to %lx\n",
+                      cur_start, cur_end);
+                    return ret;
+                }
+                break;
+            }
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * Reserve pages that is in the same node
+ */
+int reserve_heap_pages_node(
+    unsigned long start_mfn, unsigned long end_mfn)
+{
+    unsigned long cur_start, cur_end;
+    int ret = 0;
+
+    if (end_mfn > max_page)
+        return -EINVAL;
+
+    ASSERT(spin_is_locked(&heap_lock));
+    ASSERT(phys_to_nid(pfn_to_paddr(start_mfn)) ==
+           phys_to_nid(pfn_to_paddr(end_mfn)) );
+
+    cur_start = cur_end = start_mfn;
+    while (cur_start <= end_mfn)
+    {
+        while ( (cur_end < end_mfn) &&
+                ( page_to_zone(mfn_to_page(cur_end + 1)) == 
+                  page_to_zone(mfn_to_page(cur_start)) ) ) 
+                cur_end ++;
+        ret = reserve_heap_pages_zone(cur_start, cur_end);
+
+        if (ret)
+        {
+            dprintk(XENLOG_ERR, "fail to reserve page %lx %lx\n", 
+                                cur_start, cur_end);
+            break;
+        }
+        cur_start = cur_end + 1;
+    }
+
+    return ret;
+}
+
+/*
+ * reserve page from buddy system
+ */
+int reserve_heap_pages(
+    unsigned long start_mfn, unsigned long end_mfn, int broken)
+{
+    unsigned int i;
+    unsigned long cur_start, cur_end;
+    int ret = 0;
+
+    ASSERT(spin_is_locked(&heap_lock));
+
+    if (end_mfn > max_page)
+        return -EINVAL;
+
+    /* sanity checking */
+    for (cur_start = start_mfn; cur_start <= end_mfn; cur_start++)
+    {
+        struct page_info *pg;
+
+        pg = mfn_to_page(cur_start);
+        if (allocated_in_map(cur_start) && !(pg->count_info & PGC_offlined) )
+        {
+            dprintk(XENLOG_WARNING,
+              "pg %lx is not free, can't reserve\n", cur_start);
+            return -EINVAL;
+        }
+    }
+
+    map_alloc(start_mfn, end_mfn - start_mfn + 1);
+
+    cur_start = cur_end = start_mfn;
+    while (cur_start <= end_mfn)
+    {
+        while ( (cur_end < end_mfn ) &&
+                (phys_to_nid(pfn_to_paddr(cur_end + 1))
+                    == phys_to_nid(pfn_to_paddr(cur_start))) )
+                    cur_end ++;
+        ret = reserve_heap_pages_node(cur_start, cur_end);
+
+        if (ret)
+        {
+            dprintk(XENLOG_ERR, "fail to reserve page %lx %lx\n", 
+                                cur_start, cur_end);
+            break;
+        }
+        cur_start = cur_end + 1;
+    }
+
+    if (cur_start <= end_mfn)
+        return ret;
+
+    for ( i = start_mfn; i <= end_mfn; i++ )
+    {
+        struct page_info *pg;
+
+        pg = mfn_to_page(i);
+        pg->count_info |= PGC_offlined;
+        if (broken)
+            page_list_add_tail(&pg[i], &page_broken_list);
+        else
+            page_list_add_tail(&pg[i], &page_offlined_list);
+    }
+
+    return ret;
+}
+
+void online_heap_page(struct page_info *pg)
+{
+    unsigned int zone;
+
+    if ( is_xen_heap_page(pg) )
+        zone = MEMZONE_XEN;
+    else
+        zone = page_to_zone(pg);
+
+    /* Cannot online broken page or assigned page */
+    ASSERT(!is_page_broken(pg) && !is_page_allocated(pg));
+
+    pg->count_info &= ~(PGC_offlining | PGC_offlined);
+    page_list_del(pg, &page_offlined_list);
+
+    recycle_heap_pages(zone, pg, pg);
+}
+
+/*
+ * Offline the memory, 0 if succeed
+ */
+unsigned int do_offline_pages(unsigned long start_pfn, unsigned long end_pfn,
+                              uint32_t *status, int broken)
+{
+    unsigned long mfn = start_pfn;
+    struct domain *owner;
+    int i = 0, ret = 0;
+    unsigned long * updated;
+
+    if ( start_pfn > end_pfn )
+        return 0;
+
+    if (end_pfn > max_page)
+    {
+        dprintk(XENLOG_WARNING,
+                "try to offline page out of range %lx\n", end_pfn);
+        return -EINVAL;
+    }
+    updated = xmalloc_bytes( BITS_TO_LONGS(end_pfn - start_pfn) * sizeof(long));
+
+    if (!updated)
+        return -ENOMEM;
 
     spin_lock(&heap_lock);
 
-    map_free(page_to_mfn(pg), 1 << order);
-    avail[node][zone] += 1 << order;
-
-    /* Merge chunks as far as possible. */
-    while ( order < MAX_ORDER )
-    {
-        mask = 1UL << order;
-
-        if ( (page_to_mfn(pg) & mask) )
-        {
-            /* Merge with predecessor block? */
-            if ( allocated_in_map(page_to_mfn(pg)-mask) ||
-                 (PFN_ORDER(pg-mask) != order) )
+    while ( mfn <= end_pfn )
+    {
+        struct page_info *pg;
+
+        pg = mfn_to_page(mfn);
+        /* init the result value */
+        status[i] = 0;
+
+#if defined(__i386__)
+        if ( is_xen_heap_mfn(mfn) )
+        {
+            status[i] |= PG_OFFLINE_XENPAGE | PG_OFFLINE_FAILED;
+            status[i] |= (DOMID_XEN << PG_OFFLINE_OWNER_SHIFT);
+            ret = -EPERM;
+            break;
+        }
+        else
+#endif
+        if ( is_page_allocated(pg) && !is_page_offlined(pg) )
+        {
+            owner = page_get_owner(pg);
+
+            if (!owner)
+            { 
+                /* anounymous page, shadow page, or Xen heap page for x86_64 */
+#if !defined(__i386__)
+                if ( is_xen_heap_mfn(mfn) )
+                    status[i] |= ( (DOMID_XEN << PG_OFFLINE_OWNER_SHIFT) |
+                      PG_OFFLINE_XENPAGE );
+                else
+#endif
+                    status[i] |= ( PG_OFFLINE_ANONYMOUS |
+                      (DOMID_INVALID) << PG_OFFLINE_OWNER_SHIFT );
+
+                status[i] |= PG_OFFLINE_FAILED;
+                ret = -EPERM;
                 break;
-            pg -= mask;
-            page_list_del(pg, &heap(node, zone, order));
-        }
-        else
-        {
-            /* Merge with successor block? */
-            if ( allocated_in_map(page_to_mfn(pg)+mask) ||
-                 (PFN_ORDER(pg+mask) != order) )
+            }
+            else if ( owner == dom0 )
+            {
+                status[i] |= PG_OFFLINE_DOM0PAGE | PG_OFFLINE_FAILED;
+                ret = -EPERM;
                 break;
-            page_list_del(pg + mask, &heap(node, zone, order));
-        }
-        
-        order++;
-
-        /* After merging, pg should remain in the same node. */
-        ASSERT(phys_to_nid(page_to_maddr(pg)) == node);
-    }
-
-    PFN_ORDER(pg) = order;
-    page_list_add_tail(pg, &heap(node, zone, order));
-
+            }
+            /* Set the bit only */
+            else 
+            {
+                status[i] |= PG_OFFLINE_OWNED | PG_OFFLINE_PENDING;
+                status[i] |= (owner->domain_id << PG_OFFLINE_OWNER_SHIFT);
+                if (!(pg->count_info & PGC_offlining))
+                {
+                    updated[(mfn - start_pfn)/PAGES_PER_MAPWORD] |=
+                      (1UL << ((mfn - start_pfn) & (PAGES_PER_MAPWORD -1 )));
+                    pg->count_info |= PGC_offlining;
+                }
+            }
+        }
+        else if (is_page_offlined(pg))
+            status[i] |= PG_OFFLINE_OFFLINED;
+        else {
+            unsigned long last_mfn = mfn;
+            int j;
+            struct page_info *last_pg = mfn_to_page(mfn);
+
+            if (!(pg->count_info & PGC_offlined))
+            {
+                /* Free pages */
+
+                updated[(mfn - start_pfn)/PAGES_PER_MAPWORD] |=
+                        (1UL << ((mfn - start_pfn) & (PAGES_PER_MAPWORD -1 )));
+                /* Try as much free pages as possible */
+                last_mfn = mfn + 1;
+                last_pg = mfn_to_page(last_mfn);
+                while ( (last_mfn <= end_pfn) && 
+                         !is_page_allocated(last_pg) &&
+                         !(last_pg->count_info & PGC_offlined) &&
+                         !is_xen_heap_page(last_pg) )
+                {
+                    last_mfn ++;
+                    last_pg = mfn_to_page(last_mfn);
+                }
+                reserve_heap_pages(mfn, last_mfn - 1, broken);
+            }
+
+            for (j = mfn; j < last_mfn; j++)
+                status[j - start_pfn] = PG_OFFLINE_OFFLINED;
+
+            i += (last_mfn - 1 - mfn);
+            mfn = last_mfn - 1;
+        }
+
+        i++;
+        mfn++;
+    }
     spin_unlock(&heap_lock);
+
+    /* revert if failed */
+    if (mfn <= end_pfn)
+    {
+        int i = 0;
+        struct page_info *revert;
+
+        for ( i = find_first_bit(updated,  end_pfn - start_pfn);
+              i < (end_pfn - start_pfn);
+              i = find_next_bit(updated, end_pfn - start_pfn, i+1) )
+        {
+            revert = mfn_to_page(start_pfn + i);
+            revert->count_info &= ~(PGC_offlining | PGC_offlined);
+
+            /* Put the offlined page back to buddy system */
+            if (revert->count_info & PGC_offlined)
+                online_heap_page(revert);
+        }
+    }
+
+    dprintk(XENLOG_INFO, "Offlin Page %lx ~ %lx, last page is %lx",
+                          start_pfn, end_pfn, mfn);
+
+    xfree(updated);        
+    return ret;
+}
+
+/* Online the memory.
+ *   The caller should make sure end_pfn <= max_page,
+ *   if not, expand_pages() should be called prior to online_pages().
+ *   Succeed if it returns PG_ONLINE_SUCCESS.
+ *   Fail if it returns PG_ONLINE_FAILURE.
+ */ 
+unsigned int do_online_pages(unsigned long start_pfn,
+                             unsigned long end_pfn,
+                             uint32_t *status)
+{
+    unsigned long mfn;
+    int i;
+    struct page_info *pg;
+    int ret = 0;
+
+    if ( start_pfn >= end_pfn )
+        return 0;
+
+    if ( end_pfn > max_page )
+    {
+        dprintk(XENLOG_WARNING, "call expand_pages() first\n");
+        dprintk(XENLOG_WARNING, "memory onlining %lx to %lx failed\n", 
+                     start_pfn, end_pfn);
+        return -EINVAL;
+    }
+
+    for ( mfn = start_pfn, i = 0; mfn < end_pfn; mfn++, i++ )
+    {
+        pg = mfn_to_page(mfn);
+
+        if ( unlikely(is_page_broken(pg)) )
+        {
+            ret = -EINVAL;
+            status[i] |= PG_ONLINE_FAILED |PG_ONLINE_BROKEN;
+            break;
+        }
+        else if (pg->count_info & PGC_offlined)
+        {
+            pg->count_info &= PGC_offlined;
+            free_heap_pages(pg, 0);
+            status[i] |= PG_ONLINE_ONLINED;
+        }
+        else if (pg->count_info & PGC_offlining)
+        {
+            pg->count_info &= PGC_offlining;
+            status[i] |= PG_ONLINE_ONLINED;
+        }
+    }
+
+    return ret;
+}
+
+unsigned int do_kill_pages(unsigned long start_pfn,
+                           unsigned long end_pfn,
+                           uint32_t *status)
+{
+    unsigned int ret = 0, i;
+
+    if ( start_pfn > end_pfn )
+        return -EINVAL;
+
+    dbg_offpage("do_kill_page start_pfn %lx end_pfn %lx\n",
+                 start_pfn, end_pfn);
+
+    BUG_ON(end_pfn > max_page);
+
+    ret = do_offline_pages(start_pfn, end_pfn, status, 1);
+
+    if (ret)
+        return ret;
+
+    for ( i = start_pfn; i <= end_pfn; i++ )
+    {
+        struct page_info *pg = mfn_to_page(i);
+
+        pg[i].count_info |= PGC_broken;
+    }
+
+    return ret;
 }
 
 /*
diff -r fb35bb57bba6 xen/common/sysctl.c
--- a/xen/common/sysctl.c	Sun Feb 08 18:25:36 2009 +0800
+++ b/xen/common/sysctl.c	Sun Feb 08 18:25:37 2009 +0800
@@ -233,6 +233,34 @@ long do_sysctl(XEN_GUEST_HANDLE(xen_sysc
     }
     break;
 
+    case XEN_SYSCTL_page_offline:
+    {
+        uint32_t *status;
+
+        status = xmalloc_bytes( sizeof(uint32_t) *
+                                (op->u.page_offline.end -
+                                  op->u.page_offline.start + 1));
+        if (!status)
+        {
+            ret = -ENOMEM;
+            break;
+        }
+        ret = do_offline_pages(op->u.page_offline.start,
+                               op->u.page_offline.end,
+                               status, 0);
+
+        if (ret)
+            break;
+        if (copy_to_guest(op->u.page_offline.status, status,
+                          op->u.page_offline.end - op->u.page_offline.start + 1))
+        {   
+            ret = -EFAULT; 
+            break;
+        }
+        xfree(status);
+    }
+    break;
+
     default:
         ret = arch_do_sysctl(op, u_sysctl);
         break;
diff -r fb35bb57bba6 xen/include/asm-x86/mm.h
--- a/xen/include/asm-x86/mm.h	Sun Feb 08 18:25:36 2009 +0800
+++ b/xen/include/asm-x86/mm.h	Sun Feb 08 23:12:59 2009 +0800
@@ -198,8 +198,23 @@ struct page_info
  /* 3-bit PAT/PCD/PWT cache-attribute hint. */
 #define PGC_cacheattr_base PG_shift(6)
 #define PGC_cacheattr_mask PG_mask(7, 6)
+ /* Page is broken? */
+#define _PGC_broken         PG_shift(7)
+#define PGC_broken          PG_mask(1, 7)
+ /* Page is offline pending ? */
+#define _PGC_offlining      PG_shift(8)
+#define PGC_offlining       PG_mask(1, 8)
+ /* Page is offlined */
+#define _PGC_offlined       PG_shift(9)
+#define PGC_offlined        PG_mask(1, 9)
+
+#define is_page_offlining(page)          ((page)->count_info & PGC_offlining)
+#define is_page_offlined(page)          ((page)->count_info & PGC_offlined)
+#define is_page_broken(page)           ((page)->count_info & PGC_broken)
+#define is_page_online(page)           (!is_page_offlined(page))
+
  /* Count of references to this frame. */
-#define PGC_count_width   PG_shift(6)
+#define PGC_count_width   PG_shift(10)
 #define PGC_count_mask    ((1UL<<PGC_count_width)-1)
 
 #if defined(__i386__)
diff -r fb35bb57bba6 xen/include/public/sysctl.h
--- a/xen/include/public/sysctl.h	Sun Feb 08 18:25:36 2009 +0800
+++ b/xen/include/public/sysctl.h	Sun Feb 08 18:25:37 2009 +0800
@@ -359,6 +359,39 @@ struct xen_sysctl_pm_op {
     };
 };
 
+#define XEN_SYSCTL_page_offline        14
+struct xen_sysctl_page_offline {
+    /* IN: range of page to be offlined */
+    uint32_t start;
+    uint32_t end;
+    /* OUT: result of page offline request */
+    /*
+     * bit 0~15: result flags
+     * bit 16~31: owner
+     */
+    XEN_GUEST_HANDLE(uint32) status;
+};
+
+#define PG_OFFLINE_STATUS_MASK    (0xFUL)
+#define PG_OFFLINE_OFFLINED  (0x1UL << 0)
+#define PG_OFFLINE_PENDING   (0x1UL << 1)
+#define PG_OFFLINE_FAILED    (0x1UL << 2)
+#define PG_ONLINE_FAILED     (0x1UL << 2)
+#define PG_ONLINE_ONLINED    (0x1UL << 0)
+
+#define PG_OFFLINE_MISC_MASK    (0xFUL << 4)
+/* only valid when PG_OFFLINE_FAILED */
+#define PG_OFFLINE_XENPAGE   (0x1UL << 4)
+#define PG_OFFLINE_DOM0PAGE  (0x1UL << 5)
+#define PG_OFFLINE_ANONYMOUS (0x1UL << 6)
+
+#define PG_ONLINE_BROKEN      (0x1UL << 4)
+
+#define PG_OFFLINE_OWNED     (0x1UL << 7)
+
+#define PG_OFFLINE_OWNER_SHIFT 16
+
+
 struct xen_sysctl {
     uint32_t cmd;
     uint32_t interface_version; /* XEN_SYSCTL_INTERFACE_VERSION */
@@ -375,6 +408,7 @@ struct xen_sysctl {
         struct xen_sysctl_get_pmstat        get_pmstat;
         struct xen_sysctl_cpu_hotplug       cpu_hotplug;
         struct xen_sysctl_pm_op             pm_op;
+        struct xen_sysctl_page_offline      page_offline;
         uint8_t                             pad[128];
     } u;
 };
diff -r fb35bb57bba6 xen/include/public/xen.h
--- a/xen/include/public/xen.h	Sun Feb 08 18:25:36 2009 +0800
+++ b/xen/include/public/xen.h	Sun Feb 08 18:25:37 2009 +0800
@@ -354,6 +354,9 @@ typedef uint16_t domid_t;
  */
 #define DOMID_XEN  (0x7FF2U)
 
+/* DOMID_INVALID is used to identity invalid domid */
+#define DOMID_INVALID (0x7FFFU)
+
 /*
  * Send an array of these to HYPERVISOR_mmu_update().
  * NB. The fields are natural pointer/address size for this architecture.
diff -r fb35bb57bba6 xen/include/xen/mm.h
--- a/xen/include/xen/mm.h	Sun Feb 08 18:25:36 2009 +0800
+++ b/xen/include/xen/mm.h	Sun Feb 08 18:25:37 2009 +0800
@@ -47,6 +47,8 @@ void init_xenheap_pages(paddr_t ps, padd
 void init_xenheap_pages(paddr_t ps, paddr_t pe);
 void *alloc_xenheap_pages(unsigned int order, unsigned int memflags);
 void free_xenheap_pages(void *v, unsigned int order);
+unsigned int do_offline_pages(unsigned long start_pfn, unsigned long end_pfn, uint32_t *status, int broken);
+unsigned int do_online_pages(unsigned long start_pfn, unsigned long end_pfn, uint32_t *status);
 #define alloc_xenheap_page() (alloc_xenheap_pages(0,0))
 #define free_xenheap_page(v) (free_xenheap_pages(v,0))
 

[-- Attachment #5: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] Basic support for page offline
  2009-02-09  8:54 [RFC][PATCH] Basic support for page offline Jiang, Yunhong
@ 2009-02-10  9:15 ` Tim Deegan
  2009-02-10  9:29   ` Jiang, Yunhong
  2009-02-10 21:09 ` Frank van der Linden
  2009-02-13 17:03 ` Tim Deegan
  2 siblings, 1 reply; 61+ messages in thread
From: Tim Deegan @ 2009-02-10  9:15 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: xen-devel

At 03:54 -0500 on 09 Feb (1234151686), Jiang, Yunhong wrote:
> Hi, Tim, this patchset try to support page offline request. I want to get some initial feedback before more testing.

I haven't had a chance to read the patches in detail yet, but my initial
impression is that:

 - The general approach so far seems good (I suspect that your 2.3 stage
   below could also be done like 2.2 without a full live migration but 
   since that's not implemented yet that's fine).
 - It seems like a lot of code for what it does.  On the Xen side that's
   just a general impression since I'm not familiar with the bits of the 
   heap allocators that you're changing.  In libxc you seem to have 
   duplicated parts of the save/restore code -- better to make those 
   routines externally visible to the rest of libxc and call them 
   from your new function.
 - Like all systems code everywhere, it needs more comments. :)  You've
   introduced some generic-sounding functions (adjust_pte &c) without
   describing what they do.

I'll have more detailed comments later in the week, I hope. 

Cheers,

Tim.

> Page offline can be used by multiple usage model, belows are some examples:
> a) If too many correctable error happen to one page, management tools may try to offline the page to avoid more server error in future;
> b) When page is ECC error and can't be recoverd by hardware, Xen's MCA handler may try to offline the page, so that it will not be accessed anymore.
> c) Offline some DIMM for power management etc (Of course, this is far more than simple page offline)
> 
> The basic idea to offline a page is:
> 1) If a page is free, it will be removed from page allocator
> 2) If page is in use, the owner will be checked
>   2.1) if it is owned by xen/dom0, the offline will be failed
>   2.2) If it is owned by a PV guest with no device assigned, user space tools will try to replace the page with new one.
>   2.3) It it is owned by a HVM guest with no device assigned, user space tools will try to live migration it.
>   2.4) If it is owned by a guest with device assigned, user space tools can do live migration if needed.
> 
> This patchset includes support for type 2.1/2.2. 
> 
> page_offfline_xen.patch gives basic support. The new hypercall (XEN_SYSCTL_page_offline) will mark a page offlining if the page is in-use, otherwise, it will remove the page from the page allocator. It also changes the free_heap_pages(), so that if a page_offlining page is freed, that page will be marked as page_offlined and will not be allocated anymore. One tricky thing is, the offlined page may not be buddy-aligned (i.e., it may be in the middle of a 2^order pages), so that we have to re-arrange the buddy system (i.e. &heap[][][]) carefully.
> 
> page_offline_xen_memory.patch add support to PV guest, a new hypercall (XENMEM_page_offline) try to replace the old page with the new one. This will happen only when the guest has been suspeneded, to avoid complex page sharing situation. I'm still checking if more situation need be considered, like LDT pages and CR3 pages, so any suggestion is really great help.
> 
> page_offline_tools.patch is an example user space tools based on libxc/xc_domain_save.c, it will try to firstly mark a page offline, and checking the result. If a page is owned by a PV guest, it will try to replace the pages.
> 
> I did some basic testing, tried free pages and PV guest pages and is ok. Of course, I need more test on it. And more robust error handling is needed.
> 
> Any suggestion is welcome.
> 
> Thanks
> Yunhong Jiang





-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC][PATCH] Basic support for page offline
  2009-02-10  9:15 ` Tim Deegan
@ 2009-02-10  9:29   ` Jiang, Yunhong
  2009-02-10  9:42     ` Tim Deegan
  2009-02-10 10:29     ` Keir Fraser
  0 siblings, 2 replies; 61+ messages in thread
From: Jiang, Yunhong @ 2009-02-10  9:29 UTC (permalink / raw)
  To: Tim Deegan, Keir Fraser; +Cc: xen-devel


Tim Deegan <mailto:Tim.Deegan@citrix.com> wrote:
> At 03:54 -0500 on 09 Feb (1234151686), Jiang, Yunhong wrote:
>> Hi, Tim, this patchset try to support page offline request.
> I want to get some initial feedback before more testing.
> 
> I haven't had a chance to read the patches in detail yet, but
> my initial
> impression is that:

Thanks for your comments very much! 

> 
> - The general approach so far seems good (I suspect that your
> 2.3 stage
>   below could also be done like 2.2 without a full live migration but
>   since that's not implemented yet that's fine).

Yes, for HVM guest, if it has stub domain support, we can do it through two step:
a) first do the update for the stub domain as 2.2
b) update the shadow/hap page table for the domain itself.
But as stub domain support is not official enabled, so we will wait till stub domain is by default enabled.

> - It seems like a lot of code for what it does.  On the Xen
> side that's
>   just a general impression since I'm not familiar with the
> bits of the

For Xen part, the reason is because the currently heap allocator is based on buddy system, while the offlined page may be in the middle of the buddy, so we have to split the original buddy into several smaller one. 
And do you mean I need turn to Keir for the heap allocator changes?

>   heap allocators that you're changing.  In libxc you seem to have
>   duplicated parts of the save/restore code -- better to make those
>   routines externally visible to the rest of libxc and call them   from
> your new function. - Like all systems code everywhere, it needs more
>   comments. :)  You've introduced some generic-sounding functions
> (adjust_pte &c) without   describing what they do. 

Yes, I will add it in next round.

-- Yunhong 

> 
> I'll have more detailed comments later in the week, I hope.
> 
> Cheers,
> 
> Tim.
> 
>> Page offline can be used by multiple usage model, belows are some examples:
>> a) If too many correctable error happen to one page,
> management tools may try to offline the page to avoid more
> server error in future;
>> b) When page is ECC error and can't be recoverd by hardware,
> Xen's MCA handler may try to offline the page, so that it will
> not be accessed anymore.
>> c) Offline some DIMM for power management etc (Of course,
> this is far more than simple page offline)
>> 
>> The basic idea to offline a page is:
>> 1) If a page is free, it will be removed from page allocator
>> 2) If page is in use, the owner will be checked
>>   2.1) if it is owned by xen/dom0, the offline will be failed
>>   2.2) If it is owned by a PV guest with no device assigned,
> user space tools will try to replace the page with new one.
>>   2.3) It it is owned by a HVM guest with no device
> assigned, user space tools will try to live migration it.
>>   2.4) If it is owned by a guest with device assigned, user
> space tools can do live migration if needed.
>> 
>> This patchset includes support for type 2.1/2.2.
>> 
>> page_offfline_xen.patch gives basic support. The new
> hypercall (XEN_SYSCTL_page_offline) will mark a page offlining
> if the page is in-use, otherwise, it will remove the page from
> the page allocator. It also changes the free_heap_pages(), so
> that if a page_offlining page is freed, that page will be
> marked as page_offlined and will not be allocated anymore. One
> tricky thing is, the offlined page may not be buddy-aligned
> (i.e., it may be in the middle of a 2^order pages), so that we
> have to re-arrange the buddy system (i.e. &heap[][][]) carefully.
>> 
>> page_offline_xen_memory.patch add support to PV guest, a new
> hypercall (XENMEM_page_offline) try to replace the old page
> with the new one. This will happen only when the guest has
> been suspeneded, to avoid complex page sharing situation. I'm
> still checking if more situation need be considered, like LDT
> pages and CR3 pages, so any suggestion is really great help.
>> 
>> page_offline_tools.patch is an example user space tools
> based on libxc/xc_domain_save.c, it will try to firstly mark a
> page offline, and checking the result. If a page is owned by a
> PV guest, it will try to replace the pages.
>> 
>> I did some basic testing, tried free pages and PV guest
> pages and is ok. Of course, I need more test on it. And more
> robust error handling is needed.
>> 
>> Any suggestion is welcome.
>> 
>> Thanks
>> Yunhong Jiang
> 
> 
> 
> 
> 
> --
> Tim Deegan <Tim.Deegan@citrix.com>
> Principal Software Engineer, Citrix Systems (R&D) Ltd.
> [Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] Basic support for page offline
  2009-02-10  9:29   ` Jiang, Yunhong
@ 2009-02-10  9:42     ` Tim Deegan
  2009-02-10 10:29     ` Keir Fraser
  1 sibling, 0 replies; 61+ messages in thread
From: Tim Deegan @ 2009-02-10  9:42 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: xen-devel, Keir Fraser

Hi, 

At 04:29 -0500 on 10 Feb (1234240157), Jiang, Yunhong wrote:
> For Xen part, the reason is because the currently heap allocator is
> based on buddy system, while the offlined page may be in the middle of
> the buddy, so we have to split the original buddy into several smaller
> one.

Right, I see.  It still seems like a lot of code but as I said I haven't
worked through the detail.

>  And do you mean I need turn to Keir for the heap allocator
> changes?

Well, Keir will have to ack/nack the patches eventually. :) But I'm
happy to have ago at them first.

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] Basic support for page offline
  2009-02-10  9:29   ` Jiang, Yunhong
  2009-02-10  9:42     ` Tim Deegan
@ 2009-02-10 10:29     ` Keir Fraser
  1 sibling, 0 replies; 61+ messages in thread
From: Keir Fraser @ 2009-02-10 10:29 UTC (permalink / raw)
  To: Jiang, Yunhong, Tim Deegan; +Cc: xen-devel

On 10/02/2009 09:29, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

>> - It seems like a lot of code for what it does.  On the Xen
>> side that's
>>   just a general impression since I'm not familiar with the
>> bits of the
> 
> For Xen part, the reason is because the currently heap allocator is based on
> buddy system, while the offlined page may be in the middle of the buddy, so we
> have to split the original buddy into several smaller one.
> And do you mean I need turn to Keir for the heap allocator changes?

I will look at them, yes. It's probably more my area than Tim's.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] Basic support for page offline
  2009-02-09  8:54 [RFC][PATCH] Basic support for page offline Jiang, Yunhong
  2009-02-10  9:15 ` Tim Deegan
@ 2009-02-10 21:09 ` Frank van der Linden
  2009-02-11  0:16   ` Jiang, Yunhong
  2009-02-13 17:03 ` Tim Deegan
  2 siblings, 1 reply; 61+ messages in thread
From: Frank van der Linden @ 2009-02-10 21:09 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: xen-devel

Would it be possible to add page status query to the hypercalls? E.g. 
return the status and owner for just one page, without trying to 
on/offline it?

The same goes for the CPU hotplug code, but that's another issue.

The Solaris FMA code likes to have the option of querying the status, so 
that's why we'd like to have such an option.

- Frank

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC][PATCH] Basic support for page offline
  2009-02-10 21:09 ` Frank van der Linden
@ 2009-02-11  0:16   ` Jiang, Yunhong
  2009-02-11  0:39     ` Frank van der Linden
  0 siblings, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-02-11  0:16 UTC (permalink / raw)
  To: Frank.Vanderlinden; +Cc: xen-devel


Frank.Vanderlinden@Sun.COM <mailto:Frank.Vanderlinden@Sun.COM> wrote:
> Would it be possible to add page status query to the hypercalls? E.g.
> return the status and owner for just one page, without trying to on/offline

Frank, can you please elabrate what's the status mean? What's the usage by Solaris FMA code? Do you mean how the page is used (i.e. information in the page_info->count_info)? 

As for the ownership of the page, I'm not sure if we can do that without trying online/offline it, since the page may be assigned to another guest after the query if it is not marked offline-pending (mark offline pending make sure it will not be used anymore if freed).

-- Yunhong Jiang

> it? 
> 
> The same goes for the CPU hotplug code, but that's another issue.
> 
> The Solaris FMA code likes to have the option of querying the status, so
> that's why we'd like to have such an option.
> 
> - Frank

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] Basic support for page offline
  2009-02-11  0:16   ` Jiang, Yunhong
@ 2009-02-11  0:39     ` Frank van der Linden
  2009-02-11  1:08       ` Jiang, Yunhong
  0 siblings, 1 reply; 61+ messages in thread
From: Frank van der Linden @ 2009-02-11  0:39 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: xen-devel

Jiang, Yunhong wrote:
> 
> Frank, can you please elabrate what's the status mean? What's the usage by Solaris FMA code? Do you mean how the page is used (i.e. information in the page_info->count_info)? 
> 
> As for the ownership of the page, I'm not sure if we can do that without trying online/offline it, since the page may be assigned to another guest after the query if it is not marked offline-pending (mark offline pending make sure it will not be used anymore if freed).

The usage is pretty simple: there is an interface to:

* retire a page
* unretire a page
* check on the status

The status doesn't have to have much info. The Solaris interface returns 
one of:

* retired
* not retired
* pending
* invalid page

I understand that the owner may be harder to return, and it isn't really 
important, so it doesn't matter if that information is not available 
from the hypercall.

- Frank

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC][PATCH] Basic support for page offline
  2009-02-11  0:39     ` Frank van der Linden
@ 2009-02-11  1:08       ` Jiang, Yunhong
  2009-02-11  4:08         ` Frank Van Der Linden
  0 siblings, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-02-11  1:08 UTC (permalink / raw)
  To: Frank van der Linden; +Cc: xen-devel

So you mean just the online/offline status? That make sense. I will do that.

Thanks
Yunhong Jiang

xen-devel-bounces@lists.xensource.com <> wrote:
> Jiang, Yunhong wrote:
>> 
>> Frank, can you please elabrate what's the status mean?
> What's the usage by Solaris FMA code? Do you mean how the page
> is used (i.e. information in the page_info->count_info)?
>> 
>> As for the ownership of the page, I'm not sure if we can do
> that without trying online/offline it, since the page may be
> assigned to another guest after the query if it is not marked
> offline-pending (mark offline pending make sure it will not be
> used anymore if freed).
> 
> The usage is pretty simple: there is an interface to:
> 
> * retire a page
> * unretire a page
> * check on the status
> 
> The status doesn't have to have much info. The Solaris
> interface returns
> one of:
> 
> * retired
> * not retired
> * pending
> * invalid page
> 
> I understand that the owner may be harder to return, and it
> isn't really
> important, so it doesn't matter if that information is not available from
> the hypercall. 
> 
> - Frank
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] Basic support for page offline
  2009-02-11  1:08       ` Jiang, Yunhong
@ 2009-02-11  4:08         ` Frank Van Der Linden
  0 siblings, 0 replies; 61+ messages in thread
From: Frank Van Der Linden @ 2009-02-11  4:08 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: xen-devel

Jiang, Yunhong wrote:
> So you mean just the online/offline status? That make sense. I will do that.
>
>   
Cool, thanks!

- Frank

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] Basic support for page offline
  2009-02-09  8:54 [RFC][PATCH] Basic support for page offline Jiang, Yunhong
  2009-02-10  9:15 ` Tim Deegan
  2009-02-10 21:09 ` Frank van der Linden
@ 2009-02-13 17:03 ` Tim Deegan
  2009-02-13 17:36   ` Keir Fraser
  2009-02-15  9:48   ` Jiang, Yunhong
  2 siblings, 2 replies; 61+ messages in thread
From: Tim Deegan @ 2009-02-13 17:03 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: xen-devel

Hi, 

So a few more comments on the detail of those patches.  

I had imagined that you would suspend the domain, then update the p2m
and pagetables in the guest memory from the _tools_.  That would involve
less code (possibly none) in Xen, and is how I'd prefer it.  But your
current approach probably catches more of the corner cases (grant tables
&c) than the tools could, so that's OK.

update_pgtable_entry() needs a more descriptive name!  It updates
potentially very many pagetable entries, and in a particular way. 
Also it probably ought to be static.

The reference counting in update_pgtable_entry() is confusing -- it
should probably always do reference counting for both the old and new
entries; that seems more robust than only doing the decrements there and
manually setting count_info and type_info on the new page in
replace_page.

In replace_page(), your error paths are confused: the ENOMEM error case
drops a ref that wasn't taken and if get_page() fails you don't free the
allocated page.

Both of those functions need comments describing what they do and what
their arguments are.

memory_page_offline(): again, check your error and exit paths; I'm
pretty sure you leak references to the domain.  Why does this take a
domain, by the way?  can't it just take a range of MFNs and figure out
the owning domain for each one as it goes?

Also, isn't the returned nr_offlined value always one less than was
requested?  You write back the _index_ of the highest-numbered frame
that you _attempted_ to offline, which is a pretty confusing number.

Other than that, the xen mm patch just needs a good scattering of
comments.

The tools patch is enormous, and seems to copy big chunks of
xc_domain_save into a new file.  And since Xen is now doing the hard
work of pagetable manipulation, I don't think you even need to suspend
the guest -- just pausing it should be enough and is much easier.

If you do need to use the suspend/resume code in later stages of
development, please don't copy it out; just make a libxc function that
calls the existing functions appropriately.

I'll leave page_offline_xen.patch to Keir since he's said he'll do it,
but 700 new lines of code seems like quite a lot -- surely some subsets
of he existing buddy splitting and merging code could be split out and
reused?

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [RFC][PATCH] Basic support for page offline
  2009-02-13 17:03 ` Tim Deegan
@ 2009-02-13 17:36   ` Keir Fraser
  2009-02-15  9:39     ` Jiang, Yunhong
  2009-02-15  9:48   ` Jiang, Yunhong
  1 sibling, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-02-13 17:36 UTC (permalink / raw)
  To: Tim Deegan, Jiang, Yunhong; +Cc: xen-devel

On 13/02/2009 17:03, "Tim Deegan" <Tim.Deegan@citrix.com> wrote:

> I'll leave page_offline_xen.patch to Keir since he's said he'll do it,
> but 700 new lines of code seems like quite a lot -- surely some subsets
> of he existing buddy splitting and merging code could be split out and
> reused?

It could indeed surely be half the size. It's split for some reason into a
chain of about five functions, each of which does a little bit of the work.
Just merge them all together. And unless you have a good reason for
currently expecting to offline large ranges of pages, and can measure a
substantial performance difference, I would actually just offline one page a
at a time -- implement that in a function and call it repeatedly from a for
loop. It will be far less complex and for bad-page offlining should perform
just fine.

Also some comments about what the difference is between offlining, offlined,
and broken would be nice. The change in free_heap_pages() to preserve
PGC_offlining|PGC_broken struck me as particularly worrying -- no sane C
programmer should write it like that, which just makes me more worried about
the verbosity of the rest.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Re: [RFC][PATCH] Basic support for page offline
  2009-02-13 17:36   ` Keir Fraser
@ 2009-02-15  9:39     ` Jiang, Yunhong
  0 siblings, 0 replies; 61+ messages in thread
From: Jiang, Yunhong @ 2009-02-15  9:39 UTC (permalink / raw)
  To: Keir Fraser, Tim Deegan; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1614 bytes --]

Keir Fraser <mailto:keir.fraser@eu.citrix.com> wrote:
> On 13/02/2009 17:03, "Tim Deegan" <Tim.Deegan@citrix.com> wrote:
> 
>> I'll leave page_offline_xen.patch to Keir since he's said he'll do it,
>> but 700 new lines of code seems like quite a lot -- surely some subsets
>> of he existing buddy splitting and merging code could be split out and
>> reused?
> 
> It could indeed surely be half the size. It's split for some
> reason into a
> chain of about five functions, each of which does a little bit
> of the work.
> Just merge them all together. And unless you have a good reason for
> currently expecting to offline large ranges of pages, and can measure a
> substantial performance difference, I would actually just
> offline one page a
> at a time -- implement that in a function and call it
> repeatedly from a for
> loop. It will be far less complex and for bad-page offlining
> should perform
> just fine.

The reasonto offline large ranges of pages is to offline a DIMM, but yes that may not important currently, also that is not performance critical. We only support one page each time, that is sure to be much simpler. 

> 
> Also some comments about what the difference is between
> offlining, offlined,
> and broken would be nice. The change in free_heap_pages() to preserve
> PGC_offlining|PGC_broken struck me as particularly worrying --
> no sane C
> programmer should write it like that, which just makes me more
> worried about
> the verbosity of the rest.

Sure, will add comments for how the page status change.

Thanks
Yunhong Jiang

> 
> -- Keir

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC][PATCH] Basic support for page offline
  2009-02-13 17:03 ` Tim Deegan
  2009-02-13 17:36   ` Keir Fraser
@ 2009-02-15  9:48   ` Jiang, Yunhong
  2009-02-16 14:31     ` Tim Deegan
  1 sibling, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-02-15  9:48 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel

Tim Deegan <mailto:Tim.Deegan@citrix.com> wrote:
> Hi,
> 
> So a few more comments on the detail of those patches.
> 
> I had imagined that you would suspend the domain, then update the p2m
> and pagetables in the guest memory from the _tools_.  That
> would involve
> less code (possibly none) in Xen, and is how I'd prefer it.  But your
> current approach probably catches more of the corner cases (grant tables
> &c) than the tools could, so that's OK.
> 
> update_pgtable_entry() needs a more descriptive name!  It updates
> potentially very many pagetable entries, and in a particular way.
> Also it probably ought to be static.

Tim, thanks for your feedback very much. Yes, the update_pgtable_entry() will update potentially very much page table entries, I'm not sure that's the right method to achieve it.

> 
> The reference counting in update_pgtable_entry() is confusing -- it
> should probably always do reference counting for both the old and new
> entries; that seems more robust than only doing the decrements
> there and
> manually setting count_info and type_info on the new page in replace_page.

Sure, I will do like this.

> 
> In replace_page(), your error paths are confused: the ENOMEM error case
> drops a ref that wasn't taken and if get_page() fails you
> don't free the
> allocated page.
> 
> Both of those functions need comments describing what they do and what
> their arguments are. 
> 
> memory_page_offline(): again, check your error and exit paths; I'm
> pretty sure you leak references to the domain.  Why does this take a
> domain, by the way?  can't it just take a range of MFNs and figure out
> the owning domain for each one as it goes?
> 
> Also, isn't the returned nr_offlined value always one less than was
> requested?  You write back the _index_ of the highest-numbered frame
> that you _attempted_ to offline, which is a pretty confusing number.
> 
> Other than that, the xen mm patch just needs a good scattering of comments.

Sure, I will update it.

> 
> The tools patch is enormous, and seems to copy big chunks of
> xc_domain_save into a new file.  And since Xen is now doing the hard
> work of pagetable manipulation, I don't think you even need to suspend
> the guest -- just pausing it should be enough and is much easier.

But I'm not sure if we can update the P2M table from Xen side, that's the reason I did the it in the user space.

-- Yunhong Jiang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] Basic support for page offline
  2009-02-15  9:48   ` Jiang, Yunhong
@ 2009-02-16 14:31     ` Tim Deegan
  2009-02-16 15:25       ` Jiang, Yunhong
  2009-02-18 14:51       ` Jiang, Yunhong
  0 siblings, 2 replies; 61+ messages in thread
From: Tim Deegan @ 2009-02-16 14:31 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: xen-devel

At 04:48 -0500 on 15 Feb (1234673293), Jiang, Yunhong wrote:
> > The reference counting in update_pgtable_entry() is confusing -- it
> > should probably always do reference counting for both the old and new
> > entries; that seems more robust than only doing the decrements
> > there and
> > manually setting count_info and type_info on the new page in replace_page.
> 
> Sure, I will do like this.

In fact, it should use the existing PTE-updating code -- I suspect that,
for example, your code won't work at all on a guest that has shadow
pagetables enabled.

> > The tools patch is enormous, and seems to copy big chunks of
> > xc_domain_save into a new file.  And since Xen is now doing the hard
> > work of pagetable manipulation, I don't think you even need to suspend
> > the guest -- just pausing it should be enough and is much easier.
> 
> But I'm not sure if we can update the P2M table from Xen side, that's
> the reason I did the it in the user space.

In that case, why don't you update the pagetables from the tools as
well?  That way you'd avoid walking the guest pagetables in Xen.  You
could make all the PTE changes, try to free the page, and if it still
doesn't work (because there's some other refcount held), put things back
the way they were.

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC][PATCH] Basic support for page offline
  2009-02-16 14:31     ` Tim Deegan
@ 2009-02-16 15:25       ` Jiang, Yunhong
  2009-02-18 14:51       ` Jiang, Yunhong
  1 sibling, 0 replies; 61+ messages in thread
From: Jiang, Yunhong @ 2009-02-16 15:25 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1954 bytes --]

 

>-----Original Message-----
>From: Tim Deegan [mailto:Tim.Deegan@citrix.com] 
>Sent: 2009年2月16日 22:31
>To: Jiang, Yunhong
>Cc: xen-devel@lists.xensource.com
>Subject: Re: [RFC][PATCH] Basic support for page offline
>
>At 04:48 -0500 on 15 Feb (1234673293), Jiang, Yunhong wrote:
>> > The reference counting in update_pgtable_entry() is confusing -- it
>> > should probably always do reference counting for both the 
>old and new
>> > entries; that seems more robust than only doing the decrements
>> > there and
>> > manually setting count_info and type_info on the new page 
>in replace_page.
>> 
>> Sure, I will do like this.
>
>In fact, it should use the existing PTE-updating code -- I 
>suspect that,
>for example, your code won't work at all on a guest that has shadow
>pagetables enabled.

Yes, we need work differently depends on guest's paging mode. I forgot PV guest will use shadow mode for log dirty.
Just as you said, doing this in user space tools will be much simpler, I will consider more on that option.


>
>> > The tools patch is enormous, and seems to copy big chunks of
>> > xc_domain_save into a new file.  And since Xen is now 
>doing the hard
>> > work of pagetable manipulation, I don't think you even 
>need to suspend
>> > the guest -- just pausing it should be enough and is much easier.
>> 
>> But I'm not sure if we can update the P2M table from Xen side, that's
>> the reason I did the it in the user space.
>
>In that case, why don't you update the pagetables from the tools as
>well?  That way you'd avoid walking the guest pagetables in Xen.  You
>could make all the PTE changes, try to free the page, and if it still
>doesn't work (because there's some other refcount held), put 
>things back
>the way they were.


>
>Tim.
>
>-- 
>Tim Deegan <Tim.Deegan@citrix.com>
>Principal Software Engineer, Citrix Systems (R&D) Ltd.
>[Company #02300071, SL9 0DZ, UK.]
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC][PATCH] Basic support for page offline
  2009-02-16 14:31     ` Tim Deegan
  2009-02-16 15:25       ` Jiang, Yunhong
@ 2009-02-18 14:51       ` Jiang, Yunhong
  2009-02-18 15:20         ` Tim Deegan
  1 sibling, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-02-18 14:51 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2945 bytes --]


>-----Original Message-----
>From: Tim Deegan [mailto:Tim.Deegan@citrix.com] 
>Sent: 2009年2月16日 22:31
>To: Jiang, Yunhong
>Cc: xen-devel@lists.xensource.com
>Subject: Re: [RFC][PATCH] Basic support for page offline
>
>At 04:48 -0500 on 15 Feb (1234673293), Jiang, Yunhong wrote:
>> > The reference counting in update_pgtable_entry() is confusing -- it
>> > should probably always do reference counting for both the 
>old and new
>> > entries; that seems more robust than only doing the decrements
>> > there and
>> > manually setting count_info and type_info on the new page 
>in replace_page.
>> 
>> Sure, I will do like this.
>
>In fact, it should use the existing PTE-updating code -- I 
>suspect that,
>for example, your code won't work at all on a guest that has shadow
>pagetables enabled.

Can you please share me which existing PET-updating code? I browsed the code and didn't find approprate function, especially considering we need update all level page tables.

>
>> > The tools patch is enormous, and seems to copy big chunks of
>> > xc_domain_save into a new file.  And since Xen is now 
>doing the hard
>> > work of pagetable manipulation, I don't think you even 
>need to suspend
>> > the guest -- just pausing it should be enough and is much easier.
>> 
>> But I'm not sure if we can update the P2M table from Xen side, that's
>> the reason I did the it in the user space.
>
>In that case, why don't you update the pagetables from the tools as
>well?  That way you'd avoid walking the guest pagetables in Xen.  You
>could make all the PTE changes, try to free the page, and if it still
>doesn't work (because there's some other refcount held), put 
>things back
>the way they were.

Just as you stated before, there may have some corner case need considered, grant table etc (some is missed in my previous patch). For example, if guest has been granted, but the remote domain has not map it (i.e. it is in grant_table->shared, but not been mapped still), there should have no reference added, but if we don't hold the grant table lock, then the backend may mapped a wrong mfn. Such situation is difficult to be solved in tools. BTW, I suspect I may missed more reference count, for example, the page may be pinned etc, I will consider that in my next patch.

Also, as to your suggestion in previous mail of "since Xen is now doing the hard work of pagetable manipulation, I don't think you even need to suspend the guest -- just pausing it should be enough and is much easier", I suppose suspend guest will make thing much simpler. For example, we need consider the preempted hypercall that handle page table page (for example, the page table may be partially validated still when we do the offline).

Thanks
Yunhong Jiang

>
>Tim.
>
>-- 
>Tim Deegan <Tim.Deegan@citrix.com>
>Principal Software Engineer, Citrix Systems (R&D) Ltd.
>[Company #02300071, SL9 0DZ, UK.]
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] Basic support for page offline
  2009-02-18 14:51       ` Jiang, Yunhong
@ 2009-02-18 15:20         ` Tim Deegan
  2009-02-19  8:44           ` Jiang, Yunhong
  0 siblings, 1 reply; 61+ messages in thread
From: Tim Deegan @ 2009-02-18 15:20 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: xen-devel

Hi,

At 14:51 +0000 on 18 Feb (1234968681), Jiang, Yunhong wrote: 
> Can you please share me which existing PET-updating code? I browsed
> the code and didn't find approprate function, especially considering
> we need update all level page tables.

xen/arch/x86/mm.c: mod_l1_entry() mod_l2_entry(), etc.

> Just as you stated before, there may have some corner case need
> considered, grant table etc (some is missed in my previous patch). For
> example, if guest has been granted, but the remote domain has not map
> it (i.e. it is in grant_table->shared, but not been mapped still),
> there should have no reference added, but if we don't hold the grant
> table lock, then the backend may mapped a wrong mfn. Such situation is
> difficult to be solved in tools. BTW, I suspect I may missed more
> reference count, for example, the page may be pinned etc, I will
> consider that in my next patch.

It should be possible to have the tools do all the PTE manipulations
with MMU update hypercalls (I think -- Keir may correct me here). Then 
the final hypercall to surrender the page will fail if the grant tables
are wrong; if it does, put the PTEs back and fall back to a live
migration.  Isn't that what your in-xen code does anyway?

> Also, as to your suggestion in previous mail of "since Xen is now
> doing the hard work of pagetable manipulation, I don't think you even
> need to suspend the guest -- just pausing it should be enough and is
> much easier"

As you already pointed out, I was wrong about that -- you need to
suspend the guest to do the p2m manipulations. 

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC][PATCH] Basic support for page offline
  2009-02-18 15:20         ` Tim Deegan
@ 2009-02-19  8:44           ` Jiang, Yunhong
  2009-02-19 14:37             ` Jiang, Yunhong
  0 siblings, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-02-19  8:44 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel

Tim Deegan <mailto:Tim.Deegan@citrix.com> wrote:
> Hi,
> 
> At 14:51 +0000 on 18 Feb (1234968681), Jiang, Yunhong wrote:
>> Can you please share me which existing PET-updating code? I browsed
>> the code and didn't find approprate function, especially considering
>> we need update all level page tables.
> 
> xen/arch/x86/mm.c: mod_l1_entry() mod_l2_entry(), etc.

Yes, I considered this also, I give it up because these function will do more than simply swap the entry, for example, it may try to re-validate the page table again. But yes, I should use them because they are common function and performance is not important for us.

> It should be possible to have the tools do all the PTE manipulations
> with MMU update hypercalls (I think -- Keir may correct me here). Then
> the final hypercall to surrender the page will fail if the grant tables
> are wrong; if it does, put the PTEs back and fall back to a live
> migration.  Isn't that what your in-xen code does anyway?

Yes, although we may need a very big multicall to achieve it. I didn't see any issue with this method and will work this way.

BTW, I think what we are doing is in fact something like memory_exchange(), although memory_exchange() requires the source memory is not referenced anymore.

Thanks
Yunhong Jiang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC][PATCH] Basic support for page offline
  2009-02-19  8:44           ` Jiang, Yunhong
@ 2009-02-19 14:37             ` Jiang, Yunhong
  2009-03-02 11:56               ` Tim Deegan
  0 siblings, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-02-19 14:37 UTC (permalink / raw)
  To: Tim Deegan, Keir Fraser; +Cc: xen-devel


>> It should be possible to have the tools do all the PTE manipulations
>> with MMU update hypercalls (I think -- Keir may correct me here). Then
>> the final hypercall to surrender the page will fail if the grant tables
>> are wrong; if it does, put the PTEs back and fall back to a live
>> migration.  Isn't that what your in-xen code does anyway?

Tim, after checking the code more carefully, seems currently the MMU update hypercalls (including mod_lx_entry ) assume it is for current domain, while in our usage model, it will update MMU for other domain, so I will try to do following changes: 1) change mod_lx_entry() to get a domain parameter 2) Add a new hypercall (or a new command to do_mmu_update ) to update the MMU for other domain. I'm not sure if there are other usage model for such requirement, and if such changes acceptable? Any feedback is welcome.

Thanks
Yunhong Jiang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] Basic support for page offline
  2009-02-19 14:37             ` Jiang, Yunhong
@ 2009-03-02 11:56               ` Tim Deegan
  2009-03-04  8:23                 ` Jiang, Yunhong
  2009-03-18 10:24                 ` [PATCH] Support swap a page from user space tools -- Was " Jiang, Yunhong
  0 siblings, 2 replies; 61+ messages in thread
From: Tim Deegan @ 2009-03-02 11:56 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: xen-devel, Keir Fraser

At 14:37 +0000 on 19 Feb (1235054276), Jiang, Yunhong wrote:
> 
> >> It should be possible to have the tools do all the PTE manipulations
> >> with MMU update hypercalls (I think -- Keir may correct me here). Then
> >> the final hypercall to surrender the page will fail if the grant tables
> >> are wrong; if it does, put the PTEs back and fall back to a live
> >> migration.  Isn't that what your in-xen code does anyway?
> 
> Tim, after checking the code more carefully, seems currently the MMU update hypercalls (including mod_lx_entry ) assume it is for current domain, while in our usage model, it will update MMU for other domain, so I will try to do following changes: 1) change mod_lx_entry() to get a domain parameter 2) Add a new hypercall (or a new command to do_mmu_update ) to update the MMU for other domain. I'm not sure if there are other usage model for such requirement, and if such changes acceptable? Any feedback is welcome.
> 

Sorry for the delay -- I was travelling around the summit and this got
lost.  Yes, I think this is an OK approach.  I doubt there will be many
other users of such a hypercall since most OSes will get upset by their
PTEs changing under their feet, but I prefer it to the current patch.

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [RFC][PATCH] Basic support for page offline
  2009-03-02 11:56               ` Tim Deegan
@ 2009-03-04  8:23                 ` Jiang, Yunhong
  2009-03-18 10:24                 ` [PATCH] Support swap a page from user space tools -- Was " Jiang, Yunhong
  1 sibling, 0 replies; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-04  8:23 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser

Tim Deegan <mailto:Tim.Deegan@citrix.com> wrote:
> At 14:37 +0000 on 19 Feb (1235054276), Jiang, Yunhong wrote:
>> 
>>>> It should be possible to have the tools do all the PTE manipulations
>>>> with MMU update hypercalls (I think -- Keir may correct me here). Then
>>>> the final hypercall to surrender the page will fail if the grant tables
>>>> are wrong; if it does, put the PTEs back and fall back to a live
>>>> migration.  Isn't that what your in-xen code does anyway?
>> 
>> Tim, after checking the code more carefully, seems currently
> the MMU update hypercalls (including mod_lx_entry ) assume it
> is for current domain, while in our usage model, it will
> update MMU for other domain, so I will try to do following
> changes: 1) change mod_lx_entry() to get a domain parameter 2)
> Add a new hypercall (or a new command to do_mmu_update ) to
> update the MMU for other domain. I'm not sure if there are
> other usage model for such requirement, and if such changes
> acceptable? Any feedback is welcome.
>> 
> 
> Sorry for the delay -- I was travelling around the summit and this got
> lost.  Yes, I think this is an OK approach.  I doubt there will be many
> other users of such a hypercall since most OSes will get upset by their
> PTEs changing under their feet, but I prefer it to the current patch.

Sure, I will do this way.

Thanks
Yunhong Jiang

> 
> Cheers,
> 
> Tim.
> 
> --
> Tim Deegan <Tim.Deegan@citrix.com>
> Principal Software Engineer, Citrix Systems (R&D) Ltd.
> [Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-02 11:56               ` Tim Deegan
  2009-03-04  8:23                 ` Jiang, Yunhong
@ 2009-03-18 10:24                 ` Jiang, Yunhong
  2009-03-18 10:32                   ` Jiang, Yunhong
  2009-03-18 17:34                   ` Tim Deegan
  1 sibling, 2 replies; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-18 10:24 UTC (permalink / raw)
  To: Jiang, Yunhong, Tim Deegan; +Cc: xen-devel, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 1889 bytes --]

Tim, this is the implementation as discussed.
The swap_page.patch is for HV side change, basically it call the mod_lx_entry to update the table.
The free_page.patch is the function from user space tools to offlien a page.

Thanks
Yunhong Jiang

Jiang, Yunhong <> wrote:
> Tim Deegan <mailto:Tim.Deegan@citrix.com> wrote:
>> At 14:37 +0000 on 19 Feb (1235054276), Jiang, Yunhong wrote:
>>> 
>>>>> It should be possible to have the tools do all the PTE manipulations
>>>>> with MMU update hypercalls (I think -- Keir may correct me here). Then
>>>>> the final hypercall to surrender the page will fail if the grant tables
>>>>> are wrong; if it does, put the PTEs back and fall back to a live
>>>>> migration.  Isn't that what your in-xen code does anyway?
>>> 
>>> Tim, after checking the code more carefully, seems currently
>> the MMU update hypercalls (including mod_lx_entry ) assume it
>> is for current domain, while in our usage model, it will
>> update MMU for other domain, so I will try to do following
>> changes: 1) change mod_lx_entry() to get a domain parameter 2)
>> Add a new hypercall (or a new command to do_mmu_update ) to
>> update the MMU for other domain. I'm not sure if there are
>> other usage model for such requirement, and if such changes
>> acceptable? Any feedback is welcome.
>>> 
>> 
>> Sorry for the delay -- I was travelling around the summit and this got
>> lost.  Yes, I think this is an OK approach.  I doubt there will be many
>> other users of such a hypercall since most OSes will get upset by their
>> PTEs changing under their feet, but I prefer it to the current patch.
> 
> Sure, I will do this way.
> 
> Thanks
> Yunhong Jiang
> 
>> 
>> Cheers,
>> 
>> Tim.
>> 
>> --
>> Tim Deegan <Tim.Deegan@citrix.com>
>> Principal Software Engineer, Citrix Systems (R&D) Ltd.
>> [Company #02300071, SL9 0DZ, UK.]

[-- Attachment #2: swap_page.patch --]
[-- Type: application/octet-stream, Size: 7373 bytes --]

Add a hypercall to swap a page in a page table entry.

Add a new hypercall to update a page table entry from a old page to a new page. When the old page has no referecen anymore, it will be released.
This is mainly used for page offline. When a page owned by a guest is marked offline pending, user space tools can scan the domain's page table and then update the entry one by one. When the page is freed, the page status will become offlined.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

diff -r a45117dc9168 xen/arch/x86/mm.c
--- a/xen/arch/x86/mm.c	Tue Mar 17 21:11:35 2009 +0800
+++ b/xen/arch/x86/mm.c	Wed Mar 18 01:46:26 2009 +0800
@@ -2924,6 +2924,154 @@ int do_mmuext_op(
             break;
         }
 
+#define get_new_entry(x)  \
+            x##_pgentry_t old_entry, new_entry; \
+            if ( unlikely(__copy_from_user( \
+                      &old_entry, va, sizeof(new_entry)) != 0) )   \
+            {   \
+                goto error_swap;    \
+            }   \
+            old_mfn = x##e_get_pfn(old_entry);  \
+            new_entry = x##e_from_page(mfn_to_page(new_mfn),\
+                        x##e_get_flags(old_entry));
+
+        case MMUEXT_SWAP_PAGE:
+        {
+            struct domain *d = rcu_lock_domain_by_id(foreigndom);
+            struct vcpu *v = d->vcpu[0];
+            void *va = NULL;
+            xen_pfn_t new_mfn, pte_mfn, old_mfn = 0;
+
+            pte_mfn = (op.arg1.mfn >> PAGE_SHIFT);
+            new_mfn = op.arg2.new_mfn;
+
+            rc = -EINVAL;
+
+            if ( !IS_PRIV(current->domain) )
+                return -EPERM;
+
+            if ( !mfn_valid(new_mfn) || !get_page_from_pagenr(new_mfn, d) ||
+                 !mfn_valid(pte_mfn) || !get_page_from_pagenr(pte_mfn, d) )
+            {
+                dprintk(XENLOG_WARNING,
+                        "SWAP_PAGE with invalid new mfn %lx or pte_mfn %lx \n",
+                         new_mfn, pte_mfn);
+                break;
+            }
+
+            page = mfn_to_page(pte_mfn);
+
+            va = map_domain_page(pte_mfn);
+            va = (void *)((unsigned long)va +
+                          (unsigned long)(op.arg1.mfn & ~PAGE_MASK));
+
+            if ( page_lock(page) )
+            {
+                switch ( page->u.inuse.type_info & PGT_type_mask )
+                {
+                    case PGT_l1_page_table:
+                        {
+                            get_new_entry(l1);
+                            if ( !mfn_valid(old_mfn) ||
+                                 !get_page_from_pagenr(old_mfn, d) )
+                                 break;
+                            rc = 0;
+                            okay = mod_l1_entry(va, new_entry, pte_mfn, 1, v);
+                            put_page(mfn_to_page(old_mfn));
+                        }
+                        break;
+                    case PGT_l2_page_table:
+                        {
+                            get_new_entry(l2);
+                            if ( !mfn_valid(old_mfn) ||
+                                 !get_page_from_pagenr(old_mfn, d) )
+                                 break;
+                            rc = 0;
+                            okay = mod_l2_entry(va, new_entry, pte_mfn, 1, v);
+                            put_page(mfn_to_page(old_mfn));
+                        }
+                        break;
+                    case PGT_l3_page_table:
+                        {
+                            get_new_entry(l3);
+                            if ( !mfn_valid(old_mfn) ||
+                                 !get_page_from_pagenr(old_mfn, d) )
+                                 break;
+                            rc = mod_l3_entry(va, new_entry, pte_mfn, 1, 1, v);
+                            okay = !rc;
+                            put_page(mfn_to_page(old_mfn));
+                        }
+                        break;
+#if CONFIG_PAGING_LEVELS >= 4
+                    case PGT_l4_page_table:
+                        {
+                            get_new_entry(l4);
+                            if ( !mfn_valid(old_mfn) ||
+                                 !get_page_from_pagenr(old_mfn, d) )
+                                 break;
+                            rc = mod_l4_entry(va, new_entry, pte_mfn, 1, 1, v);
+                            okay = !rc;
+                            put_page(mfn_to_page(old_mfn));
+                        }
+                        break;
+#endif
+                    default:
+                        perfc_incr(writable_mmu_updates);
+                        dprintk(XENLOG_WARNING,
+                                "pte_mfn %lx is not page table page\n",
+                                pte_mfn );
+                        break;
+                }
+
+                if ( (mfn_to_page(old_mfn)->count_info &
+                      (PGC_count_mask|PGC_allocated)) ==
+                     (1 | PGC_allocated) )
+                {
+                    /* No referece to this page from page table anymore */
+                    xen_pfn_t old_pfn;
+
+                    old_pfn = mfn_to_gmfn(d, old_mfn);
+
+                    /* release original one */
+                    steal_page(d,  mfn_to_page(old_mfn), 0);
+
+                    if ( !test_and_clear_bit(_PGC_allocated,
+                          &(mfn_to_page(old_mfn)->count_info)) )
+                        BUG();
+
+                    guest_physmap_remove_page(d, mfn_to_gmfn(d, mfn), mfn, 0);
+
+                    put_page(mfn_to_page(old_mfn));
+
+                    /* Setup the new page */
+                    guest_physmap_add_page(d, old_pfn, new_mfn, 0);
+
+                    if ( !paging_mode_translate(d) )
+                    {
+                        set_gpfn_from_mfn(new_mfn, old_pfn);
+                    }
+                }
+
+                if ( rc == -EINTR )
+                    rc = -EAGAIN;
+error_swap:
+                page_unlock(page);
+
+            }
+            else
+            {
+                dprintk(XENLOG_WARNING, "fail to lock the page in swap page\n");
+                rc = -EAGAIN;
+            }
+
+            put_page(mfn_to_page(pte_mfn));
+            put_page(mfn_to_page(new_mfn));
+
+            if (va)
+                unmap_domain_page(va);
+            break;
+        }
+
         default:
             MEM_LOG("Invalid extended pt command 0x%x", op.cmd);
             rc = -ENOSYS;
diff -r a45117dc9168 xen/include/public/xen.h
--- a/xen/include/public/xen.h	Tue Mar 17 21:11:35 2009 +0800
+++ b/xen/include/public/xen.h	Wed Mar 18 01:45:11 2009 +0800
@@ -256,13 +256,14 @@ DEFINE_XEN_GUEST_HANDLE(xen_pfn_t);
 #define MMUEXT_NEW_USER_BASEPTR 15
 #define MMUEXT_CLEAR_PAGE       16
 #define MMUEXT_COPY_PAGE        17
+#define MMUEXT_SWAP_PAGE        18
 
 #ifndef __ASSEMBLY__
 struct mmuext_op {
     unsigned int cmd;
     union {
         /* [UN]PIN_TABLE, NEW_BASEPTR, NEW_USER_BASEPTR
-         * CLEAR_PAGE, COPY_PAGE */
+         * CLEAR_PAGE, COPY_PAGE, SWAP_PAGE */
         xen_pfn_t     mfn;
         /* INVLPG_LOCAL, INVLPG_ALL, SET_LDT */
         unsigned long linear_addr;
@@ -278,6 +279,9 @@ struct mmuext_op {
 #endif
         /* COPY_PAGE */
         xen_pfn_t src_mfn;
+
+        /* SWAP_PAGE */
+        xen_pfn_t new_mfn;
     } arg2;
 };
 typedef struct mmuext_op mmuext_op_t;

[-- Attachment #3: free_page.patch --]
[-- Type: application/octet-stream, Size: 26658 bytes --]

This is the user space tools support to offline a page. 

Changeset 19286:dd489125a2e7 implement the HV side page offline logic. This patch add support to use space tools.

When a page owned by a PV domain is marked "offline pending", user space tools will try to suspend the guest, scan the whole page table and update the page table entry one by one. If no more reference to the page, the page become "offlined".

xc_offline_page() is added will update all reference to the offline pending pages.
xc_map_m2p() is exported from xc_domain_save.c so that we don't need implement it again.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

diff -r fa6b2732778f tools/libxc/Makefile
--- a/tools/libxc/Makefile	Wed Mar 18 00:19:25 2009 +0800
+++ b/tools/libxc/Makefile	Wed Mar 18 00:19:28 2009 +0800
@@ -31,6 +31,7 @@ GUEST_SRCS-y :=
 GUEST_SRCS-y :=
 GUEST_SRCS-y += xg_private.c
 GUEST_SRCS-$(CONFIG_MIGRATE) += xc_domain_restore.c xc_domain_save.c
+GUEST_SRCS-y += xc_offline_page.c
 GUEST_SRCS-$(CONFIG_HVM) += xc_hvm_build.c
 
 vpath %.c ../../xen/common/libelf
diff -r fa6b2732778f tools/libxc/xc_domain_save.c
--- a/tools/libxc/xc_domain_save.c	Wed Mar 18 00:19:25 2009 +0800
+++ b/tools/libxc/xc_domain_save.c	Wed Mar 18 00:19:28 2009 +0800
@@ -510,9 +510,10 @@ static int canonicalize_pagetable(unsign
     return race;
 }
 
-static xen_pfn_t *xc_map_m2p(int xc_handle,
-                                 unsigned long max_mfn,
-                                 int prot)
+xen_pfn_t *xc_map_m2p(int xc_handle,
+                      unsigned long max_mfn,
+                      int prot,
+                      unsigned long *mfn0)
 {
     struct xen_machphys_mfn_list xmml;
     privcmd_mmap_entry_t *entries;
@@ -561,7 +562,8 @@ static xen_pfn_t *xc_map_m2p(int xc_hand
         goto err2;
     }
 
-    m2p_mfn0 = entries[0].mfn;
+    if (mfn0)
+        *mfn0 = entries[0].mfn;
 
 err2:
     free(entries);
@@ -1058,7 +1060,7 @@ int xc_domain_save(int xc_handle, int io
     }
 
     /* Setup the mfn_to_pfn table mapping */
-    if ( !(live_m2p = xc_map_m2p(xc_handle, max_mfn, PROT_READ)) )
+    if ( !(live_m2p = xc_map_m2p(xc_handle, max_mfn, PROT_READ, &m2p_mfn0)) )
     {
         ERROR("Failed to map live M2P table");
         goto out;
diff -r fa6b2732778f tools/libxc/xc_offline_page.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/libxc/xc_offline_page.c	Wed Mar 18 01:42:25 2009 +0800
@@ -0,0 +1,388 @@
+/******************************************************************************
+ * xc_offline_page.c
+ *
+ * Helper functions to replace a offlining page
+ *
+ * Copyright (c) 2003, K A Fraser.
+ * Copyright (c) 2009, Intel Corporation.
+ */
+
+#include <inttypes.h>
+#include <time.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/time.h>
+
+#include "xc_private.h"
+#include "xc_dom.h"
+#include "xg_private.h"
+#include "xg_save_restore.h"
+
+#undef DEBUG
+#define DEBUG(_f, _a...) fprintf(stderr, _f , ## _a)
+
+int xc_mark_page_online(int xc, unsigned long start,
+                        unsigned long end, uint32_t *status)
+{
+    DECLARE_SYSCTL;
+    int ret = -1;
+
+    if ( !status || (end < start) )
+        return -EINVAL;
+
+    if (lock_pages(status, sizeof(uint32_t)*(end - start + 1)))
+    {
+        fprintf(stderr, "Could not lock memory for Xen hypercall");
+        return -EINVAL;
+    }
+
+    sysctl.cmd = XEN_SYSCTL_page_offline_op;
+    sysctl.u.page_offline.start = start;
+    sysctl.u.page_offline.cmd = sysctl_page_online;
+    sysctl.u.page_offline.end = end;
+    set_xen_guest_handle(sysctl.u.page_offline.status, status);
+    ret = xc_sysctl(xc, &sysctl);
+
+    unlock_pages(status, sizeof(uint32_t)*(end - start + 1));
+
+    return ret;
+}
+
+int xc_mark_page_offline(int xc, unsigned long start,
+                          unsigned long end, uint32_t *status)
+{
+    DECLARE_SYSCTL;
+    int ret = -1;
+
+    if ( !status || (end < start) )
+        return -EINVAL;
+
+    if (lock_pages(status, sizeof(uint32_t)*(end - start + 1)))
+    {
+        fprintf(stderr, "Could not lock memory for Xen hypercall");
+        return -EINVAL;
+    }
+
+    sysctl.cmd = XEN_SYSCTL_page_offline_op;
+    sysctl.u.page_offline.start = start;
+    sysctl.u.page_offline.cmd = sysctl_page_offline;
+    sysctl.u.page_offline.end = end;
+    set_xen_guest_handle(sysctl.u.page_offline.status, status);
+    ret = xc_sysctl(xc, &sysctl);
+
+    unlock_pages(status, sizeof(uint32_t)*(end - start + 1));
+
+    return ret;
+}
+
+int xc_query_page_offline_status(int xc, unsigned long start,
+                                 unsigned long end, uint32_t *status)
+{
+    DECLARE_SYSCTL;
+    int ret = -1;
+
+    if ( !status || (end < start) )
+        return -EINVAL;
+
+    if (lock_pages(status, sizeof(uint32_t)*(end - start + 1)))
+    {
+        fprintf(stderr, "Could not lock memory for Xen hypercall");
+        return -EINVAL;
+    }
+
+    sysctl.cmd = XEN_SYSCTL_page_offline_op;
+    sysctl.u.page_offline.start = start;
+    sysctl.u.page_offline.cmd = sysctl_query_page_offline;
+    sysctl.u.page_offline.end = end;
+    set_xen_guest_handle(sysctl.u.page_offline.status, status);
+    ret = xc_sysctl(xc, &sysctl);
+
+    unlock_pages(status, sizeof(uint32_t)*(end - start + 1));
+
+    return ret;
+}
+
+#define MAX_OFFLINE_BATCH 1024
+
+static xen_pfn_t pfn_to_mfn(xen_pfn_t pfn, xen_pfn_t *p2m, int guest_width)
+{
+  return ((xen_pfn_t) ((guest_width==8)?
+                       (((uint64_t *)p2m)[(pfn)]):
+                       ((((uint32_t *)p2m)[(pfn)]) == 0xffffffffU ?
+                            (-1UL) :
+                            (((uint32_t *)p2m)[(pfn)]))));
+}
+
+static int swap_page(int xc_handle, xen_pfn_t mfn, xen_pfn_t new_mfn,
+                    int domid, xen_pfn_t *p2m, unsigned long *pfn_type,
+                    int p2m_size, int pt_levels, int guest_width)
+{
+    int nr_swaped = 0, pte_num;
+    uint64_t i;
+    struct mmuext_op swap[MAX_OFFLINE_BATCH];
+    void *content = NULL;
+
+    pte_num = PAGE_SIZE / ((pt_levels == 2) ? 4 : 8);
+
+    for (i = 0; i < p2m_size; i++)
+    {
+        xen_pfn_t table_mfn = pfn_to_mfn(i, p2m, guest_width);
+        uint64_t pte;
+        int j;
+
+        if ( table_mfn == INVALID_P2M_ENTRY )
+            continue;
+
+        if ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+            content = xc_map_foreign_range(xc_handle, domid, PAGE_SIZE,
+                                            PROT_READ, table_mfn);
+            if (!content)
+                goto out;
+
+            for (j = 0; j < pte_num; j++)
+            {
+                if ( pt_levels == 2 )
+                    pte = ((const uint32_t*)content)[j];
+                else
+                    pte = ((const uint64_t*)content)[j];
+
+                if (!(pte & _PAGE_PRESENT))
+                    continue;
+
+                /* Hit one entry */
+                if ( ((pte >> PAGE_SHIFT) & MFN_MASK_X86) == mfn )
+                {
+                    swap[nr_swaped].cmd = MMUEXT_SWAP_PAGE;
+                    swap[nr_swaped].arg1.mfn = table_mfn << PAGE_SHIFT;
+                    swap[nr_swaped].arg1.mfn |= j * ( (pt_levels == 2) ?
+                                        sizeof(uint32_t): sizeof(uint64_t) );
+                    swap[nr_swaped].arg2.new_mfn = new_mfn;
+                    nr_swaped ++;
+                }
+
+                if ( nr_swaped == MAX_OFFLINE_BATCH )
+                {
+                    if ( xc_mmuext_op(xc_handle, swap, nr_swaped, domid) < 0 )
+                    {
+                        ERROR("Failed to swap");
+                        goto out;
+                    }
+                    nr_swaped = 0;
+                }
+            }
+
+            if ( nr_swaped )
+            {
+                if ( xc_mmuext_op(xc_handle, swap, nr_swaped, domid) < 0 )
+                {
+                    ERROR("Failed to swap");
+                    goto out;
+                }
+                nr_swaped = 0;
+            }
+
+            munmap(content, PAGE_SIZE);
+            content = NULL;
+        }
+    }
+    return 0;
+out:
+    /* XXX Shall we take action if we have fail to swap? */
+    if (content)
+        munmap(content, PAGE_SIZE);
+
+    return -1;
+}
+
+int xc_replace_page(int xc_handle, xen_pfn_t pfn, int domid,
+                    xen_pfn_t *p2m, unsigned long *pfn_type,
+                    int p2m_size, int pt_levels, int guest_width)
+{
+    uint32_t status;
+    xc_dominfo_t info;
+    xen_pfn_t mfn, new_mfn;
+    struct mmuext_op unpin;
+    int rc, max_pages, broken = 0;
+    void *old_p, *tmp_p = NULL, *new_p = NULL;
+
+    if (!p2m || !pfn_type)
+        return -EINVAL;
+
+    mfn = pfn_to_mfn(pfn, p2m, guest_width);
+
+    /* This page has no mfn established?? */
+    if (mfn == INVALID_P2M_ENTRY)
+        return -EINVAL;
+
+    /* Target domain should be suspended already */
+    if ( (xc_domain_getinfo(xc_handle, domid, 1, &info) != 1) ||
+         !info.shutdown || (info.shutdown_reason != SHUTDOWN_suspend) )
+    {
+        ERROR("Domain not in suspended state");
+        return -1;
+    }
+
+    /* Check if pages are offline pending or not */
+    rc = xc_query_page_offline_status(xc_handle, mfn, mfn, &status);
+
+    if (rc)
+        return rc;
+
+    if ( !(status & PG_OFFLINE_STATUS_OFFLINE_PENDING) )
+    {
+        ERROR("Page %lx(mfn %lx) is not offline pending %x\n",
+               pfn, mfn, status);
+        return -EINVAL;
+    }
+
+    /* Unpin the page if it is pined */
+    if (pfn_type[pfn] & XEN_DOMCTL_PFINFO_LPINTAB)
+    {
+        unpin.cmd = MMUEXT_UNPIN_TABLE;
+        unpin.arg1.mfn = mfn;
+
+        if ( xc_mmuext_op(xc_handle, &unpin, 1, domid) < 0 )
+        {
+            ERROR("Failed to unpin page %lx", pfn);
+            return -EINVAL;
+        }
+    }
+
+    /* We increase the page limitation temp */
+    max_pages = xc_memory_op(xc_handle, XENMEM_maximum_reservation , &domid);
+    if (max_pages < 0)
+    {
+        ERROR("Failed to get max mfn\n");
+        goto undo_unpin;
+    }
+
+    max_pages ++;
+    xc_domain_setmaxmem(xc_handle, domid, max_pages << 2);
+
+    if (lock_pages(&new_mfn, sizeof(xen_pfn_t)))
+    {
+        ERROR("Could not lock new_mfn\n");
+        goto undo_maxmem;
+    }
+    rc = xc_domain_memory_increase_reservation(xc_handle, domid, 1, 0,
+                                     0x0, &new_mfn);
+
+    if (rc < 0)
+    {
+        ERROR("Failed to increase reservation \n");
+        goto undo_maxmem;
+    }
+
+    unlock_pages(&new_mfn, sizeof(xen_pfn_t));
+
+    /* Copy content from old page to new one */
+    old_p = xc_map_foreign_range(xc_handle, domid, PAGE_SIZE,
+                                PROT_READ, mfn);
+    tmp_p = malloc(PAGE_SIZE);
+
+    new_p = xc_map_foreign_range(xc_handle, domid, PAGE_SIZE,
+                                PROT_READ, new_mfn);
+
+    if (!old_p || !tmp_p || !new_p)
+        goto undo_increase;
+
+    memcpy(tmp_p, old_p, PAGE_SIZE);
+    munmap(old_p, PAGE_SIZE);
+
+    rc = swap_page(xc_handle, mfn, new_mfn, domid,
+                    p2m, pfn_type, p2m_size, pt_levels, guest_width);
+
+    if (rc)
+    {
+        ERROR("swap page failed\n");
+        /* No recover action now for swap fail */
+        broken = 1;
+        goto unmap_tmp_p;
+    }
+
+    /* Check if pages are offlined already */
+    rc = xc_query_page_offline_status(xc_handle,
+                            pfn_to_mfn(pfn, p2m, guest_width),
+                            pfn_to_mfn(pfn, p2m, guest_width),
+                            &status);
+
+    if (rc)
+    {
+        ERROR("Fail to query offline status\n");
+        goto unmap_tmp_p;
+    }
+
+    if ( !(status & PG_OFFLINE_STATUS_OFFLINED) )
+    {
+        ERROR("page is still not offlined, is it in granted to others? \n");
+        goto offline_error;
+    }
+    else
+    {
+        DEBUG("Now page is offlined %lx\n", pfn);
+        /* Update the p2m table */
+        p2m[pfn] = new_mfn;
+        memcpy(new_p, tmp_p, PAGE_SIZE);
+        munmap(new_p, PAGE_SIZE);
+        xc_domain_setmaxmem(xc_handle, domid, (max_pages - 1 ) << 2);
+    }
+
+    return 0;
+
+offline_error:
+    if (swap_page(xc_handle, new_mfn, mfn, domid,
+                  p2m, pfn_type, p2m_size, pt_levels, guest_width) < 0)
+        goto broken;
+
+unmap_tmp_p:
+    free(tmp_p);
+    munmap(new_p, PAGE_SIZE);
+
+
+undo_increase:
+    if (xc_domain_memory_decrease_reservation(xc_handle, domid, 1, 0,
+                                              &new_mfn))
+        goto broken;
+
+undo_maxmem:
+    xc_domain_setmaxmem(xc_handle, domid, (max_pages - 1 ) << 2);
+
+undo_unpin:
+    if (pfn_type[pfn] & XEN_DOMCTL_PFINFO_LPINTAB)
+    {
+        switch ( pfn_type[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+            case XEN_DOMCTL_PFINFO_L1TAB:
+                unpin.cmd = MMUEXT_PIN_L1_TABLE;
+                break;
+
+            case XEN_DOMCTL_PFINFO_L2TAB:
+                unpin.cmd = MMUEXT_PIN_L2_TABLE;
+                break;
+
+            case XEN_DOMCTL_PFINFO_L3TAB:
+                unpin.cmd = MMUEXT_PIN_L3_TABLE;
+                break;
+
+            case XEN_DOMCTL_PFINFO_L4TAB:
+                unpin.cmd = MMUEXT_PIN_L4_TABLE;
+                break;
+
+            default:
+                ERROR("Unpined for non pate table page\n");
+                break;
+        }
+        unpin.arg1.mfn = mfn;
+
+        if ( xc_mmuext_op(xc_handle, &unpin, 1, domid) < 0 )
+        {
+            ERROR("failed to pin the mfn again\n");
+            goto broken;
+        }
+    }
+    if (!broken)
+        return -1;
+broken:
+    return -2;
+}
diff -r fa6b2732778f tools/libxc/xenguest.h
--- a/tools/libxc/xenguest.h	Wed Mar 18 00:19:25 2009 +0800
+++ b/tools/libxc/xenguest.h	Wed Mar 18 00:19:28 2009 +0800
@@ -29,6 +29,31 @@ int xc_domain_save(int xc_handle, int io
                    void *(*init_qemu_maps)(int, unsigned),  /* HVM only */
                    void (*qemu_flip_buffer)(int, int));     /* HVM only */
 
+int xc_replace_page(int xc, xen_pfn_t pfn, int domid,
+                    xen_pfn_t *p2m, unsigned long *pfn_type,
+                    int p2m_size, int pt_levels, int guest_width);
+
+int xc_mark_page_online(int xc, unsigned long start,
+                        unsigned long end, uint32_t *status);
+
+int xc_mark_page_offline(int xc, unsigned long start,
+                          unsigned long end, uint32_t *status);
+
+int xc_query_page_offline_status(int xc, unsigned long start,
+                                 unsigned long end, uint32_t *status);
+
+/**
+ * This function map msp table
+ * @parm xc_handle a handle to an open hypervisor interface
+ * @parm max_mfn the max pfn
+ * @parm prot the flags to map, such as read/write etc
+ * @parm mfn0 return the first mfn, can be NULL
+ * @return mapped m2p table on success, NULL on failure
+ */
+xen_pfn_t *xc_map_m2p(int xc_handle,
+                      unsigned long max_mfn,
+                      int prot,
+                      unsigned long *mfn0);
 
 /**
  * This function will restore a saved domain.
diff -r fa6b2732778f tools/xcutils/Makefile
--- a/tools/xcutils/Makefile	Wed Mar 18 00:19:25 2009 +0800
+++ b/tools/xcutils/Makefile	Wed Mar 18 00:19:28 2009 +0800
@@ -14,7 +14,7 @@ CFLAGS += -Werror
 CFLAGS += -Werror
 CFLAGS += $(CFLAGS_libxenctrl) $(CFLAGS_libxenguest) $(CFLAGS_libxenstore)
 
-PROGRAMS = xc_restore xc_save readnotes lsevtchn
+PROGRAMS = xc_restore xc_save readnotes lsevtchn xc_offline
 
 LDLIBS   = $(LDFLAGS_libxenctrl) $(LDFLAGS_libxenguest) $(LDFLAGS_libxenstore)
 
diff -r fa6b2732778f tools/xcutils/xc_offline.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/xcutils/xc_offline.c	Wed Mar 18 01:42:44 2009 +0800
@@ -0,0 +1,327 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General
+ * Public License.  See the file "COPYING" in the main directory of
+ * this archive for more details.
+ *
+ * Copyright (C) 2005 by Christian Limpach
+ *
+ */
+
+#include <err.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdio.h>
+#include <sys/ipc.h>
+#include <sys/shm.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <err.h>
+
+#include <xs.h>
+#include <xenctrl.h>
+#include <xenguest.h>
+#include <xc_private.h>
+#include <xc_core.h>
+
+#undef ERROR
+#undef DEBUG
+#define ERROR(fmr, args...) do { fprintf(stderr, "ERROR: " fmr , ##args); } while (0)
+#define DEBUG(fmr, args...) do { fprintf(stderr, "DEBUG: " fmr , ##args); } while (0)
+
+#define FPP             (PAGE_SIZE/(guest_width))
+
+/* Number of entries in the pfn_to_mfn_frame_list_list */
+#define P2M_FLL_ENTRIES (((p2m_size)+(FPP*FPP)-1)/(FPP*FPP))
+
+static xen_pfn_t pfn_to_mfn(xen_pfn_t pfn, xen_pfn_t *p2m, int guest_width)
+{
+  return ((xen_pfn_t) ((guest_width==8)?
+                       (((uint64_t *)p2m)[(pfn)]):
+                       ((((uint32_t *)p2m)[(pfn)]) == 0xffffffffU ?
+                            (-1UL) :
+                            (((uint32_t *)p2m)[(pfn)]))));
+}
+
+static int get_pt_level(int xc_handle, uint32_t domid,
+                        unsigned int *pt_level,
+                        unsigned int *guest_width)
+{
+    DECLARE_DOMCTL;
+    xen_capabilities_info_t xen_caps = "";
+
+    if (xc_version(xc_handle, XENVER_capabilities, &xen_caps) != 0)
+        return -1;
+
+    memset(&domctl, 0, sizeof(domctl));
+    domctl.domain = domid;
+    domctl.cmd = XEN_DOMCTL_get_address_size;
+
+    if ( do_domctl(xc_handle, &domctl) != 0 )
+        return -1;
+
+    *guest_width = domctl.u.address_size.size / 8;
+
+    if (strstr(xen_caps, "xen-3.0-x86_64"))
+        /* Depends on whether it's a compat 32-on-64 guest */
+        *pt_level = ( (*guest_width == 8) ? 4 : 3 );
+    else if (strstr(xen_caps, "xen-3.0-x86_32p"))
+        *pt_level = 3;
+    else if (strstr(xen_caps, "xen-3.0-x86_32"))
+        *pt_level = 2;
+    else
+        return -1;
+
+    return 0;
+}
+
+#define PG_OFFLINE_STATUS_HANDLED (1UL << 14)
+int
+main(int argc, char **argv)
+{
+    unsigned long start, end;
+    int xc_handle = 0, i, num, rc;
+    uint32_t *status = NULL;
+
+    if (argc != 3)
+        fprintf(stderr, "usage: %s start end", argv[0]);
+
+    start = strtoul(argv[1], NULL, 0);
+    end = strtoul(argv[2], NULL, 0);
+
+    xc_handle = xc_interface_open();
+
+    if (!xc_handle)
+        return -1;
+
+    num = end - start + 1;
+
+    rc = -ENOMEM;
+    status  = malloc(num * sizeof(uint32_t));
+    if (!status)
+        return -EINVAL;
+    memset(status, 0, sizeof(uint32_t)*num);
+
+    rc = xc_mark_page_offline(xc_handle, start, end, status);
+
+    if (rc)
+    {
+        ERROR("fail to mark pages offline %x\n", rc);
+        return -EINVAL;
+    }
+
+    for (i = 0; i < num; i++)
+    {
+        DEBUG("pfn %lx status %x\n", start + i, status[i]);
+
+        if (status[i] & PG_OFFLINE_STATUS_HANDLED)
+            continue;
+
+        switch (status[i] & PG_OFFLINE_STATUS_MASK)
+        {
+            case PG_OFFLINE_OFFLINED:
+                DEBUG("offlined page %lx\n", start + i);
+                status[i] |= PG_OFFLINE_STATUS_HANDLED;
+                break;
+            break;
+
+            case PG_OFFLINE_PENDING:
+            {
+                uint32_t domid, j;
+                int port, xce = -1, rc;
+                unsigned long p2m_size;
+                xen_pfn_t *p2m_table = NULL;
+                xen_pfn_t *m2p_table = NULL;
+                xc_dominfo_t info;
+                uint64_aligned_t shared_info_frame;
+                shared_info_any_t *live_shinfo = NULL;
+                uint32_t *pfn_type = NULL;
+                unsigned long *pfn_real = NULL, max_mfn;
+                int suspend_evtchn = -1, suspended = 0;
+                uint32_t pt_level = 0, guest_width = 0;
+
+                domid = status[i] >> PG_OFFLINE_OWNER_SHIFT;
+
+                if ( !domid || (domid > DOMID_FIRST_RESERVED) )
+                {
+                    DEBUG("Dom0's page can't be LM");
+                    goto failed;
+                }
+
+                if ( get_pt_level(xc_handle, domid, &pt_level, &guest_width) )
+                {
+                    ERROR("Unable to get PT level info.");
+                    goto failed;
+                }
+
+                if ( xc_domain_getinfo(xc_handle, domid, 1, &info) != 1 )
+                {
+                    ERROR("Could not get domain info");
+                    goto failed;
+                }
+
+                if (info.hvm)
+                {
+                    DEBUG("please Live migrate dom %x\n", domid);
+                    goto failed;
+                }
+
+                /* Map the p2m table and M2P table */
+                shared_info_frame = info.shared_info_frame;
+
+                live_shinfo = xc_map_foreign_range(xc_handle, domid,
+                  PAGE_SIZE, PROT_READ, shared_info_frame);
+                if ( !live_shinfo )
+                {
+                    ERROR("Couldn't map live_shinfo");
+                    goto failed;
+                }
+
+                if ( (rc = xc_core_arch_map_p2m(xc_handle, guest_width, &info,
+                                          live_shinfo, &p2m_table,  &p2m_size)) )
+                {
+                    ERROR("Couldn't map p2m table %x\n", rc);
+                    goto failed;
+                }
+
+                max_mfn = xc_memory_op(xc_handle, XENMEM_maximum_ram_page, NULL);
+                if ( !(m2p_table = xc_map_m2p(xc_handle, max_mfn, PROT_READ, NULL)) )
+                {
+                    ERROR("Failed to map live M2P table");
+                    goto failed;
+                }
+
+                /* Suspend the guest */
+                port = xs_suspend_evtchn_port(domid);
+                if (port < 0)
+                {
+                    ERROR("Dom %x: No suspsend port, try live migration\n",
+                            domid);
+                    goto failed;
+                }
+
+                xce = xc_evtchn_open();
+                if (xce < 0)
+                {
+                    ERROR("Dom %x: fail to open evtchn\n",
+                            domid);
+                            goto failed;
+                }
+
+                suspend_evtchn =
+                  xc_suspend_evtchn_init(xc_handle, xce, domid, port);
+                if (suspend_evtchn < 0)
+                {
+                    ERROR("suspend event channel initialization failed\n");
+                    goto failed;
+                }
+
+                rc = xc_evtchn_notify(xce, suspend_evtchn);
+                if (rc < 0) {
+                    ERROR("failed to notify suspend channel: %d", rc);
+                    goto failed;
+                }
+                if (xc_await_suspend(xce, suspend_evtchn) < 0) {
+                    ERROR("suspend failed");\
+                    goto failed;
+                }
+                suspended = 1;
+
+                /* Get pfn type */
+                pfn_type = malloc(sizeof(uint32_t) * p2m_size);
+                if (!pfn_type)
+                {
+                    ERROR("Failed to malloc pfn_type\n");
+                    goto failed;
+                }
+                memset(pfn_type, 0, sizeof(uint32_t) * p2m_size);
+
+                pfn_real = malloc(sizeof(unsigned long) * p2m_size);
+                if (!pfn_real)
+                {
+                    ERROR("Failed to malloc pfn_real\n");
+                    goto failed;
+                }
+                memset(pfn_real, 0, sizeof(unsigned long) * p2m_size);
+
+                for (j = 0; j < p2m_size; j++)
+                    pfn_type[j] = pfn_to_mfn(j, p2m_table, guest_width);
+
+                if ( lock_pages(pfn_type, p2m_size * sizeof(*pfn_type)) )
+                {
+                    ERROR("Unable to lock pfn_type array");
+                    goto failed;
+                }
+                if ( lock_pages(pfn_real, p2m_size * sizeof(*pfn_real)) )
+                {
+                    ERROR("Unable to lock pfn_real array");
+                    goto failed;
+                }
+
+                for (j = 0; j < p2m_size ; j+=1024)
+                {
+                    int count = ((p2m_size - j ) > 1024 ) ? 1024: (p2m_size - j);
+                    if ( ( rc = xc_get_pfn_type_batch(xc_handle, domid, count,
+                              pfn_type + j)) )
+                    {
+                        ERROR("Failed to get pfn_type %x\n", rc);
+                         goto failed;
+                    }
+                }
+
+                /* Now replace the page */
+                for (j = 0; j < p2m_size; j++)
+                    pfn_real[j] = pfn_type[j];
+
+                for (j = i ; j < num; j++)
+                {
+                    if ( ((status[j]& PG_OFFLINE_STATUS_MASK) == PG_OFFLINE_PENDING) &&
+                         ((status[j] >> PG_OFFLINE_OWNER_SHIFT) == domid) )
+                    {
+#define mfn_to_pfn(_mfn)  (m2p_table[(_mfn)])
+                             rc = xc_replace_page(xc_handle, mfn_to_pfn(start + j),
+                               domid, p2m_table,
+                               pfn_real, p2m_size, pt_level,
+                               guest_width);
+                             if (rc)
+                             {
+                                 ERROR("Failed to replace page %x\n", j);
+                                 goto failed;
+                             }
+
+                             status[j] |= PG_OFFLINE_STATUS_HANDLED;
+                    }
+                }
+
+failed:
+                status[i] |= PG_OFFLINE_STATUS_HANDLED;
+
+                if (p2m_table)
+                    munmap(p2m_table, P2M_FLL_ENTRIES * PAGE_SIZE);
+                if (pfn_type)
+                    free(pfn_type);
+                if (live_shinfo)
+                    munmap(live_shinfo, PAGE_SIZE);
+                if (suspend_evtchn > 0)
+                    xc_suspend_evtchn_release(xc_handle, suspend_evtchn);
+                if (suspended)
+                    xc_domain_resume(xc_handle, domid, 1);
+                if (xce > 0)
+                    xc_evtchn_close(xce);
+                break;
+            }
+            default:
+            {
+                ERROR("Error status result %x\n", status[i]);
+                break;
+            }
+        }
+
+    }
+    if (xc_handle)
+        xc_interface_close(xc_handle);
+    if (status)
+        free(status);
+    return 0;
+}

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-18 10:24                 ` [PATCH] Support swap a page from user space tools -- Was " Jiang, Yunhong
@ 2009-03-18 10:32                   ` Jiang, Yunhong
  2009-03-18 10:42                     ` Keir Fraser
  2009-03-18 17:34                   ` Tim Deegan
  1 sibling, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-18 10:32 UTC (permalink / raw)
  To: Jiang, Yunhong, Tim Deegan; +Cc: xen-devel, Keir Fraser

BTW, this patch depends on several patches I sent out earlier, i.e the change to suspend event channel and the changes to mod_lx_entry, although those patches don't rely on this one.

Keir, as the page offline support is in HV already, so can this target for 3.4 if it pass the review?

Thanks
Yunhong Jiang

Jiang, Yunhong <> wrote:
> Tim, this is the implementation as discussed.
> The swap_page.patch is for HV side change, basically it call
> the mod_lx_entry to update the table.
> The free_page.patch is the function from user space tools to offlien a page.
> 
> Thanks
> Yunhong Jiang
> 
> Jiang, Yunhong <> wrote:
>> Tim Deegan <mailto:Tim.Deegan@citrix.com> wrote:
>>> At 14:37 +0000 on 19 Feb (1235054276), Jiang, Yunhong wrote:
>>>> 
>>>>>> It should be possible to have the tools do all the PTE manipulations
>>>>>> with MMU update hypercalls (I think -- Keir may correct me here). Then
>>>>>> the final hypercall to surrender the page will fail if the grant tables
>>>>>> are wrong; if it does, put the PTEs back and fall back to a live
>>>>>> migration.  Isn't that what your in-xen code does anyway?
>>>> 
>>>> Tim, after checking the code more carefully, seems currently
>>> the MMU update hypercalls (including mod_lx_entry ) assume it
>>> is for current domain, while in our usage model, it will
>>> update MMU for other domain, so I will try to do following
>>> changes: 1) change mod_lx_entry() to get a domain parameter 2)
>>> Add a new hypercall (or a new command to do_mmu_update ) to
>>> update the MMU for other domain. I'm not sure if there are
>>> other usage model for such requirement, and if such changes
>>> acceptable? Any feedback is welcome.
>>>> 
>>> 
>>> Sorry for the delay -- I was travelling around the summit and this got
>>> lost.  Yes, I think this is an OK approach.  I doubt there will be many
>>> other users of such a hypercall since most OSes will get upset by their
>>> PTEs changing under their feet, but I prefer it to the current patch.
>> 
>> Sure, I will do this way.
>> 
>> Thanks
>> Yunhong Jiang
>> 
>>> 
>>> Cheers,
>>> 
>>> Tim.
>>> 
>>> --
>>> Tim Deegan <Tim.Deegan@citrix.com>
>>> Principal Software Engineer, Citrix Systems (R&D) Ltd.
>>> [Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-18 10:32                   ` Jiang, Yunhong
@ 2009-03-18 10:42                     ` Keir Fraser
  0 siblings, 0 replies; 61+ messages in thread
From: Keir Fraser @ 2009-03-18 10:42 UTC (permalink / raw)
  To: Jiang, Yunhong, Tim Deegan; +Cc: xen-devel

On 18/03/2009 10:32, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

> BTW, this patch depends on several patches I sent out earlier, i.e the change
> to suspend event channel and the changes to mod_lx_entry, although those
> patches don't rely on this one.
> 
> Keir, as the page offline support is in HV already, so can this target for 3.4
> if it pass the review?

Yes, I'd be happy to check this in this week.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-18 10:24                 ` [PATCH] Support swap a page from user space tools -- Was " Jiang, Yunhong
  2009-03-18 10:32                   ` Jiang, Yunhong
@ 2009-03-18 17:34                   ` Tim Deegan
  2009-03-19  5:12                     ` Jiang, Yunhong
  1 sibling, 1 reply; 61+ messages in thread
From: Tim Deegan @ 2009-03-18 17:34 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: xen-devel, Keir Fraser

Hi, 

At 10:24 +0000 on 18 Mar (1237371884), Jiang, Yunhong wrote:
> Tim, this is the implementation as discussed.
> The swap_page.patch is for HV side change, basically it call the mod_lx_entry to update the table.

That seems good.  One or two nits with that patch:
 
 - You're passing a physical address (of the PTE to update) in an MFN
   field.  That's not going to be big enough on all platforms.  Also
   it's pretty confusing.

 - The "swap" operation takes the physical address of a PTE and a new
   MFN.  Why not a new PTE?  And if it's going to be called "swap", why
   not take the old PTE too and do an atomic update?  (or just call it
   something else and only take the new PTE?)

 - Why the checks for the validity of the old MFN before the change?
   What are you guarding against?

And please document the hypercall, specially since its side-effects
could be quite surprising. 

> The free_page.patch is the function from user space tools to offlien a page.

This is much better than the previous version, thanks.

Cheers,

Tim.

> 
> Thanks
> Yunhong Jiang
> 
> Jiang, Yunhong <> wrote:
> > Tim Deegan <mailto:Tim.Deegan@citrix.com> wrote:
> >> At 14:37 +0000 on 19 Feb (1235054276), Jiang, Yunhong wrote:
> >>> 
> >>>>> It should be possible to have the tools do all the PTE manipulations
> >>>>> with MMU update hypercalls (I think -- Keir may correct me here). Then
> >>>>> the final hypercall to surrender the page will fail if the grant tables
> >>>>> are wrong; if it does, put the PTEs back and fall back to a live
> >>>>> migration.  Isn't that what your in-xen code does anyway?
> >>> 
> >>> Tim, after checking the code more carefully, seems currently
> >> the MMU update hypercalls (including mod_lx_entry ) assume it
> >> is for current domain, while in our usage model, it will
> >> update MMU for other domain, so I will try to do following
> >> changes: 1) change mod_lx_entry() to get a domain parameter 2)
> >> Add a new hypercall (or a new command to do_mmu_update ) to
> >> update the MMU for other domain. I'm not sure if there are
> >> other usage model for such requirement, and if such changes
> >> acceptable? Any feedback is welcome.
> >>> 
> >> 
> >> Sorry for the delay -- I was travelling around the summit and this got
> >> lost.  Yes, I think this is an OK approach.  I doubt there will be many
> >> other users of such a hypercall since most OSes will get upset by their
> >> PTEs changing under their feet, but I prefer it to the current patch.
> > 
> > Sure, I will do this way.
> > 
> > Thanks
> > Yunhong Jiang
> > 
> >> 
> >> Cheers,
> >> 
> >> Tim.
> >> 
> >> --
> >> Tim Deegan <Tim.Deegan@citrix.com>
> >> Principal Software Engineer, Citrix Systems (R&D) Ltd.
> >> [Company #02300071, SL9 0DZ, UK.]



-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-18 17:34                   ` Tim Deegan
@ 2009-03-19  5:12                     ` Jiang, Yunhong
  2009-03-19  9:32                       ` Tim Deegan
  0 siblings, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-19  5:12 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 3884 bytes --]

Tim, thanks for review very much. Attached is the new version. See below for comments.

Thanks
Yunhong Jiang

Tim Deegan <mailto:Tim.Deegan@citrix.com> wrote:
> Hi,
> 
> At 10:24 +0000 on 18 Mar (1237371884), Jiang, Yunhong wrote:
>> Tim, this is the implementation as discussed.
>> The swap_page.patch is for HV side change, basically it call
> the mod_lx_entry to update the table.
> 
> That seems good.  One or two nits with that patch:
> 
> - You're passing a physical address (of the PTE to update) in an MFN
>   field.  That's not going to be big enough on all platforms.  Also   it's
> pretty confusing. 

Yes, fixed and now named pte_addr as a uint64.

> 
> - The "swap" operation takes the physical address of a PTE and a new
>   MFN.  Why not a new PTE?  And if it's going to be called "swap", why
>   not take the old PTE too and do an atomic update?  (or just call it
>   something else and only take the new PTE?)

Yes, I change the name to update_pte, and pass only the new PTE.

> 
> - Why the checks for the validity of the old MFN before the change?
>   What are you guarding against?

Yes, seems it is meaningless.

> 
> And please document the hypercall, specially since its side-effects could
> be quite surprising. 
> 
>> The free_page.patch is the function from user space tools to offlien a
>> page. 
> 
> This is much better than the previous version, thanks.

I missed one thing in previous patch, i.e. the changes to xc_core_arch_map_p2m(). 
Originally I change that function to map the p2m table as rw (it is forgoted in previous mail). Now I add a new function xc_core_arch_map_p2m_writable() so that not break the original API.

But I'm a bit confused why the xc_domain_save.c will not use this function to map p2m table also? Instead,  I noticed a lot of duplicate on these two files, I can send out a clean patch in future if it is ok.

Thanks
-- Yunhong Jiang

> 
> Cheers,
> 
> Tim.
> 
>> 
>> Thanks
>> Yunhong Jiang
>> 
>> Jiang, Yunhong <> wrote:
>>> Tim Deegan <mailto:Tim.Deegan@citrix.com> wrote:
>>>> At 14:37 +0000 on 19 Feb (1235054276), Jiang, Yunhong wrote:
>>>>> 
>>>>>>> It should be possible to have the tools do all the PTE manipulations
>>>>>>> with MMU update hypercalls (I think -- Keir may correct me here). Then
>>>>>>> the final hypercall to surrender the page will fail if the grant
>>>>>>> tables are wrong; if it does, put the PTEs back and fall back to a
>>>>>>> live migration.  Isn't that what your in-xen code does anyway?
>>>>> 
>>>>> Tim, after checking the code more carefully, seems currently
>>>> the MMU update hypercalls (including mod_lx_entry ) assume it
>>>> is for current domain, while in our usage model, it will
>>>> update MMU for other domain, so I will try to do following
>>>> changes: 1) change mod_lx_entry() to get a domain parameter 2)
>>>> Add a new hypercall (or a new command to do_mmu_update ) to
>>>> update the MMU for other domain. I'm not sure if there are
>>>> other usage model for such requirement, and if such changes
>>>> acceptable? Any feedback is welcome.
>>>>> 
>>>> 
>>>> Sorry for the delay -- I was travelling around the summit and this got
>>>> lost.  Yes, I think this is an OK approach.  I doubt there will be many
>>>> other users of such a hypercall since most OSes will get upset by their
>>>> PTEs changing under their feet, but I prefer it to the current patch.
>>> 
>>> Sure, I will do this way.
>>> 
>>> Thanks
>>> Yunhong Jiang
>>> 
>>>> 
>>>> Cheers,
>>>> 
>>>> Tim.
>>>> 
>>>> --
>>>> Tim Deegan <Tim.Deegan@citrix.com>
>>>> Principal Software Engineer, Citrix Systems (R&D) Ltd.
>>>> [Company #02300071, SL9 0DZ, UK.]
> 
> 
> 
> --
> Tim Deegan <Tim.Deegan@citrix.com>
> Principal Software Engineer, Citrix Systems (R&D) Ltd.
> [Company #02300071, SL9 0DZ, UK.]

[-- Attachment #2: swap_page.patch --]
[-- Type: application/octet-stream, Size: 6879 bytes --]

Add a hypercall to update one page table entry.

Add a new hypercall to update page table entry to point to a new page. When the old page has no referecen anymore, it will be released.
This is mainly used for page offline. When a page owned by a guest is marked offline pending, user space tools can scan the domain's page table and then update the entries one by one. When the page is freed, the page status will become offlined.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

diff -r 2039e8271051 xen/arch/x86/mm.c
--- a/xen/arch/x86/mm.c	Wed Mar 18 17:30:13 2009 +0000
+++ b/xen/arch/x86/mm.c	Wed Mar 18 20:14:37 2009 +0800
@@ -2929,6 +2929,141 @@ int do_mmuext_op(
             break;
         }
 
+#define get_mfns(x)  \
+         x##_pgentry_t old_entry; \
+         if ( unlikely(__copy_from_user( \
+                     &old_entry, va, sizeof(old_entry)) != 0) )   \
+                old_mfn = INVALID_MFN;  \
+         else    \
+            old_mfn = x##e_get_pfn(old_entry);  \
+        new_mfn = x##e_get_pfn(new_entry);  \
+
+        case MMUEXT_UPDATE_PTE:
+        {
+            struct domain *d = rcu_lock_domain_by_id(foreigndom);
+            struct vcpu *v = d->vcpu[0];
+            void *va = NULL;
+            xen_pfn_t new_mfn, pte_mfn, old_mfn = INVALID_MFN;
+
+            pte_mfn = (op.arg1.pte_addr >> PAGE_SHIFT);
+
+            rc = -EINVAL;
+
+            if ( !IS_PRIV(current->domain) )
+                return -EPERM;
+
+            if ( !mfn_valid(pte_mfn) || !get_page_from_pagenr(pte_mfn, d) )
+            {
+                dprintk(XENLOG_WARNING,
+                        "UPDATE_PET with invalid pte_mfn %lx \n", pte_mfn);
+                break;
+            }
+
+            page = mfn_to_page(pte_mfn);
+
+            va = map_domain_page(pte_mfn);
+            va = (void *)((unsigned long)va +
+                          (unsigned long)(op.arg1.pte_addr & ~PAGE_MASK));
+
+            if ( page_lock(page) )
+            {
+                switch ( page->u.inuse.type_info & PGT_type_mask )
+                {
+                    case PGT_l1_page_table:
+                        {
+                            l1_pgentry_t new_entry =
+                                    l1e_from_intpte(op.arg2.new_pte);
+                            get_mfns(l1);
+                            rc = 0;
+                            okay = mod_l1_entry(va, new_entry, pte_mfn, 1, v);
+                        }
+                        break;
+                    case PGT_l2_page_table:
+                        {
+                            l2_pgentry_t new_entry =
+                                    l2e_from_intpte(op.arg2.new_pte);
+                            get_mfns(l2);
+                            rc = 0;
+                            okay = mod_l2_entry(va, new_entry, pte_mfn, 1, v);
+                        }
+                        break;
+                    case PGT_l3_page_table:
+                        {
+                            l3_pgentry_t new_entry =
+                                    l3e_from_intpte(op.arg2.new_pte);
+                            get_mfns(l3);
+                            rc = mod_l3_entry(va, new_entry, pte_mfn, 1, 1, v);
+                            okay = !rc;
+                        }
+                        break;
+#if CONFIG_PAGING_LEVELS >= 4
+                    case PGT_l4_page_table:
+                        {
+                            l4_pgentry_t new_entry =
+                                    l4e_from_intpte(op.arg2.new_pte);
+                            get_mfns(l4);
+                            rc = mod_l4_entry(va, new_entry, pte_mfn, 1, 1, v);
+                            okay = !rc;
+                        }
+                        break;
+#endif
+                    default:
+                        perfc_incr(writable_mmu_updates);
+                        dprintk(XENLOG_WARNING,
+                                "pte_mfn %lx is not page table page\n",
+                                pte_mfn );
+                        okay = 0;
+                        break;
+                }
+
+                if ( okay && ( ( mfn_to_page(old_mfn)->count_info &
+                                 (PGC_count_mask|PGC_allocated) ) ==
+                               (1 | PGC_allocated)) )
+                {
+                    /* No referece to this page from page table anymore */
+                    xen_pfn_t old_pfn;
+
+                    old_pfn = mfn_to_gmfn(d, old_mfn);
+
+                    /* release original one */
+                    steal_page(d,  mfn_to_page(old_mfn), 0);
+
+                    if ( !test_and_clear_bit(_PGC_allocated,
+                          &(mfn_to_page(old_mfn)->count_info)) )
+                        BUG();
+
+                    guest_physmap_remove_page(d, mfn_to_gmfn(d, mfn), mfn, 0);
+
+                    put_page(mfn_to_page(old_mfn));
+
+                    /* Setup the new page */
+                    guest_physmap_add_page(d, old_pfn, new_mfn, 0);
+
+                    if ( !paging_mode_translate(d) )
+                    {
+                        set_gpfn_from_mfn(new_mfn, old_pfn);
+                    }
+                }
+
+                if ( rc == -EINTR )
+                    rc = -EAGAIN;
+
+                page_unlock(page);
+
+            }
+            else
+            {
+                dprintk(XENLOG_WARNING, "fail to lock the page in swap page\n");
+                rc = -EAGAIN;
+            }
+
+            put_page(mfn_to_page(pte_mfn));
+
+            if (va)
+                unmap_domain_page(va);
+            break;
+        }
+
         default:
             MEM_LOG("Invalid extended pt command 0x%x", op.cmd);
             rc = -ENOSYS;
diff -r 2039e8271051 xen/include/public/xen.h
--- a/xen/include/public/xen.h	Wed Mar 18 17:30:13 2009 +0000
+++ b/xen/include/public/xen.h	Wed Mar 18 18:58:12 2009 +0800
@@ -256,6 +256,12 @@ DEFINE_XEN_GUEST_HANDLE(xen_pfn_t);
 #define MMUEXT_NEW_USER_BASEPTR 15
 #define MMUEXT_CLEAR_PAGE       16
 #define MMUEXT_COPY_PAGE        17
+/*
+ * Replace the MFN in one PTE to a new page.It is mainly used to replace a
+ * offline pending page to a new mfn, so that the previously page will be freed
+ * if it is not granted to other domain
+ */
+#define MMUEXT_UPDATE_PTE       18
 
 #ifndef __ASSEMBLY__
 struct mmuext_op {
@@ -266,6 +272,8 @@ struct mmuext_op {
         xen_pfn_t     mfn;
         /* INVLPG_LOCAL, INVLPG_ALL, SET_LDT */
         unsigned long linear_addr;
+        /* MMUEXT_UPDATE_PTE */
+        unsigned long pte_addr;
     } arg1;
     union {
         /* SET_LDT */
@@ -278,6 +286,9 @@ struct mmuext_op {
 #endif
         /* COPY_PAGE */
         xen_pfn_t src_mfn;
+
+        /* MMUEXT_UPDATE_PTE */
+        unsigned long new_pte;
     } arg2;
 };
 typedef struct mmuext_op mmuext_op_t;

[-- Attachment #3: free_page.patch --]
[-- Type: application/octet-stream, Size: 26830 bytes --]

This is the user space tools support to offline a page. 

Changeset 19286:dd489125a2e7 implement the HV side page offline logic. This patch add support to use space tools.

When a page owned by a PV domain is marked "offline pending", user space tools will try to suspend the guest, scan the whole page table and update the page table entry one by one. If no more reference to the page, the page become "offlined".

xc_offline_page() is added to update all reference to the offline pending pages.
xc_map_m2p() is exported from xc_domain_save.c so that we don't need implement it again.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

diff -r d914e26d5df5 tools/libxc/Makefile
--- a/tools/libxc/Makefile	Wed Mar 18 21:39:21 2009 +0800
+++ b/tools/libxc/Makefile	Wed Mar 18 21:39:23 2009 +0800
@@ -31,6 +31,7 @@ GUEST_SRCS-y :=
 GUEST_SRCS-y :=
 GUEST_SRCS-y += xg_private.c
 GUEST_SRCS-$(CONFIG_MIGRATE) += xc_domain_restore.c xc_domain_save.c
+GUEST_SRCS-y += xc_offline_page.c
 GUEST_SRCS-$(CONFIG_HVM) += xc_hvm_build.c
 
 vpath %.c ../../xen/common/libelf
diff -r d914e26d5df5 tools/libxc/xc_domain_save.c
--- a/tools/libxc/xc_domain_save.c	Wed Mar 18 21:39:21 2009 +0800
+++ b/tools/libxc/xc_domain_save.c	Wed Mar 18 21:39:23 2009 +0800
@@ -510,9 +510,10 @@ static int canonicalize_pagetable(unsign
     return race;
 }
 
-static xen_pfn_t *xc_map_m2p(int xc_handle,
-                                 unsigned long max_mfn,
-                                 int prot)
+xen_pfn_t *xc_map_m2p(int xc_handle,
+                      unsigned long max_mfn,
+                      int prot,
+                      unsigned long *mfn0)
 {
     struct xen_machphys_mfn_list xmml;
     privcmd_mmap_entry_t *entries;
@@ -561,7 +562,8 @@ static xen_pfn_t *xc_map_m2p(int xc_hand
         goto err2;
     }
 
-    m2p_mfn0 = entries[0].mfn;
+    if (mfn0)
+        *mfn0 = entries[0].mfn;
 
 err2:
     free(entries);
@@ -1058,7 +1060,7 @@ int xc_domain_save(int xc_handle, int io
     }
 
     /* Setup the mfn_to_pfn table mapping */
-    if ( !(live_m2p = xc_map_m2p(xc_handle, max_mfn, PROT_READ)) )
+    if ( !(live_m2p = xc_map_m2p(xc_handle, max_mfn, PROT_READ, &m2p_mfn0)) )
     {
         ERROR("Failed to map live M2P table");
         goto out;
diff -r d914e26d5df5 tools/libxc/xc_offline_page.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/libxc/xc_offline_page.c	Wed Mar 18 21:58:53 2009 +0800
@@ -0,0 +1,390 @@
+/******************************************************************************
+ * xc_offline_page.c
+ *
+ * Helper functions to replace a offlining page
+ *
+ * Copyright (c) 2003, K A Fraser.
+ * Copyright (c) 2009, Intel Corporation.
+ */
+
+#include <inttypes.h>
+#include <time.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/time.h>
+
+#include "xc_private.h"
+#include "xc_dom.h"
+#include "xg_private.h"
+#include "xg_save_restore.h"
+
+#undef DEBUG
+#define DEBUG(_f, _a...) fprintf(stderr, _f , ## _a)
+
+int xc_mark_page_online(int xc, unsigned long start,
+                        unsigned long end, uint32_t *status)
+{
+    DECLARE_SYSCTL;
+    int ret = -1;
+
+    if ( !status || (end < start) )
+        return -EINVAL;
+
+    if (lock_pages(status, sizeof(uint32_t)*(end - start + 1)))
+    {
+        fprintf(stderr, "Could not lock memory for Xen hypercall");
+        return -EINVAL;
+    }
+
+    sysctl.cmd = XEN_SYSCTL_page_offline_op;
+    sysctl.u.page_offline.start = start;
+    sysctl.u.page_offline.cmd = sysctl_page_online;
+    sysctl.u.page_offline.end = end;
+    set_xen_guest_handle(sysctl.u.page_offline.status, status);
+    ret = xc_sysctl(xc, &sysctl);
+
+    unlock_pages(status, sizeof(uint32_t)*(end - start + 1));
+
+    return ret;
+}
+
+int xc_mark_page_offline(int xc, unsigned long start,
+                          unsigned long end, uint32_t *status)
+{
+    DECLARE_SYSCTL;
+    int ret = -1;
+
+    if ( !status || (end < start) )
+        return -EINVAL;
+
+    if (lock_pages(status, sizeof(uint32_t)*(end - start + 1)))
+    {
+        fprintf(stderr, "Could not lock memory for Xen hypercall");
+        return -EINVAL;
+    }
+
+    sysctl.cmd = XEN_SYSCTL_page_offline_op;
+    sysctl.u.page_offline.start = start;
+    sysctl.u.page_offline.cmd = sysctl_page_offline;
+    sysctl.u.page_offline.end = end;
+    set_xen_guest_handle(sysctl.u.page_offline.status, status);
+    ret = xc_sysctl(xc, &sysctl);
+
+    unlock_pages(status, sizeof(uint32_t)*(end - start + 1));
+
+    return ret;
+}
+
+int xc_query_page_offline_status(int xc, unsigned long start,
+                                 unsigned long end, uint32_t *status)
+{
+    DECLARE_SYSCTL;
+    int ret = -1;
+
+    if ( !status || (end < start) )
+        return -EINVAL;
+
+    if (lock_pages(status, sizeof(uint32_t)*(end - start + 1)))
+    {
+        fprintf(stderr, "Could not lock memory for Xen hypercall");
+        return -EINVAL;
+    }
+
+    sysctl.cmd = XEN_SYSCTL_page_offline_op;
+    sysctl.u.page_offline.start = start;
+    sysctl.u.page_offline.cmd = sysctl_query_page_offline;
+    sysctl.u.page_offline.end = end;
+    set_xen_guest_handle(sysctl.u.page_offline.status, status);
+    ret = xc_sysctl(xc, &sysctl);
+
+    unlock_pages(status, sizeof(uint32_t)*(end - start + 1));
+
+    return ret;
+}
+
+#define MAX_OFFLINE_BATCH 1024
+
+static xen_pfn_t pfn_to_mfn(xen_pfn_t pfn, xen_pfn_t *p2m, int guest_width)
+{
+  return ((xen_pfn_t) ((guest_width==8)?
+                       (((uint64_t *)p2m)[(pfn)]):
+                       ((((uint32_t *)p2m)[(pfn)]) == 0xffffffffU ?
+                            (-1UL) :
+                            (((uint32_t *)p2m)[(pfn)]))));
+}
+
+static int swap_page(int xc_handle, xen_pfn_t mfn, xen_pfn_t new_mfn,
+                    int domid, xen_pfn_t *p2m, unsigned long *pfn_type,
+                    int p2m_size, int pt_levels, int guest_width)
+{
+    int nr_swaped = 0, pte_num;
+    uint64_t i;
+    struct mmuext_op swap[MAX_OFFLINE_BATCH];
+    void *content = NULL;
+
+    pte_num = PAGE_SIZE / ((pt_levels == 2) ? 4 : 8);
+
+    for (i = 0; i < p2m_size; i++)
+    {
+        xen_pfn_t table_mfn = pfn_to_mfn(i, p2m, guest_width);
+        uint64_t pte, new_pte;
+        int j;
+
+        if ( table_mfn == INVALID_P2M_ENTRY )
+            continue;
+
+        if ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+            content = xc_map_foreign_range(xc_handle, domid, PAGE_SIZE,
+                                            PROT_READ, table_mfn);
+            if (!content)
+                goto out;
+
+            for (j = 0; j < pte_num; j++)
+            {
+                if ( pt_levels == 2 )
+                    pte = ((const uint32_t*)content)[j];
+                else
+                    pte = ((const uint64_t*)content)[j];
+
+                if (!(pte & _PAGE_PRESENT))
+                    continue;
+
+                /* Hit one entry */
+                if ( ((pte >> PAGE_SHIFT) & MFN_MASK_X86) == mfn )
+                {
+                    new_pte = (pte & ~MADDR_MASK_X86) |
+                               (new_mfn << PAGE_SHIFT_X86);
+                    swap[nr_swaped].cmd = MMUEXT_UPDATE_PTE;
+                    swap[nr_swaped].arg1.pte_addr = table_mfn << PAGE_SHIFT;
+                    swap[nr_swaped].arg1.pte_addr |= j * ( (pt_levels == 2) ?
+                                        sizeof(uint32_t): sizeof(uint64_t) );
+                    swap[nr_swaped].arg2.new_pte = new_pte;
+                    nr_swaped ++;
+                }
+
+                if ( nr_swaped == MAX_OFFLINE_BATCH )
+                {
+                    if ( xc_mmuext_op(xc_handle, swap, nr_swaped, domid) < 0 )
+                    {
+                        ERROR("Failed to swap");
+                        goto out;
+                    }
+                    nr_swaped = 0;
+                }
+            }
+
+            if ( nr_swaped )
+            {
+                if ( xc_mmuext_op(xc_handle, swap, nr_swaped, domid) < 0 )
+                {
+                    ERROR("Failed to swap");
+                    goto out;
+                }
+                nr_swaped = 0;
+            }
+
+            munmap(content, PAGE_SIZE);
+            content = NULL;
+        }
+    }
+    return 0;
+out:
+    /* XXX Shall we take action if we have fail to swap? */
+    if (content)
+        munmap(content, PAGE_SIZE);
+
+    return -1;
+}
+
+int xc_replace_page(int xc_handle, xen_pfn_t pfn, int domid,
+                    xen_pfn_t *p2m, unsigned long *pfn_type,
+                    int p2m_size, int pt_levels, int guest_width)
+{
+    uint32_t status;
+    xc_dominfo_t info;
+    xen_pfn_t mfn, new_mfn;
+    struct mmuext_op unpin;
+    int rc, max_pages, broken = 0;
+    void *old_p, *tmp_p = NULL, *new_p = NULL;
+
+    if (!p2m || !pfn_type)
+        return -EINVAL;
+
+    mfn = pfn_to_mfn(pfn, p2m, guest_width);
+
+    /* This page has no mfn established?? */
+    if (mfn == INVALID_P2M_ENTRY)
+        return -EINVAL;
+
+    /* Target domain should be suspended already */
+    if ( (xc_domain_getinfo(xc_handle, domid, 1, &info) != 1) ||
+         !info.shutdown || (info.shutdown_reason != SHUTDOWN_suspend) )
+    {
+        ERROR("Domain not in suspended state");
+        return -1;
+    }
+
+    /* Check if pages are offline pending or not */
+    rc = xc_query_page_offline_status(xc_handle, mfn, mfn, &status);
+
+    if (rc)
+        return rc;
+
+    if ( !(status & PG_OFFLINE_STATUS_OFFLINE_PENDING) )
+    {
+        ERROR("Page %lx(mfn %lx) is not offline pending %x\n",
+               pfn, mfn, status);
+        return -EINVAL;
+    }
+
+    /* Unpin the page if it is pined */
+    if (pfn_type[pfn] & XEN_DOMCTL_PFINFO_LPINTAB)
+    {
+        unpin.cmd = MMUEXT_UNPIN_TABLE;
+        unpin.arg1.mfn = mfn;
+
+        if ( xc_mmuext_op(xc_handle, &unpin, 1, domid) < 0 )
+        {
+            ERROR("Failed to unpin page %lx", pfn);
+            return -EINVAL;
+        }
+    }
+
+    /* We increase the page limitation temp */
+    max_pages = xc_memory_op(xc_handle, XENMEM_maximum_reservation , &domid);
+    if (max_pages < 0)
+    {
+        ERROR("Failed to get max mfn\n");
+        goto undo_unpin;
+    }
+
+    max_pages ++;
+    xc_domain_setmaxmem(xc_handle, domid, max_pages << 2);
+
+    if (lock_pages(&new_mfn, sizeof(xen_pfn_t)))
+    {
+        ERROR("Could not lock new_mfn\n");
+        goto undo_maxmem;
+    }
+    rc = xc_domain_memory_increase_reservation(xc_handle, domid, 1, 0,
+                                     0x0, &new_mfn);
+
+    if (rc < 0)
+    {
+        ERROR("Failed to increase reservation \n");
+        goto undo_maxmem;
+    }
+
+    unlock_pages(&new_mfn, sizeof(xen_pfn_t));
+
+    /* Copy content from old page to new one */
+    old_p = xc_map_foreign_range(xc_handle, domid, PAGE_SIZE,
+                                PROT_READ, mfn);
+    tmp_p = malloc(PAGE_SIZE);
+
+    if ( !old_p || !tmp_p )
+        goto undo_increase;
+
+    memcpy(tmp_p, old_p, PAGE_SIZE);
+    munmap(old_p, PAGE_SIZE);
+
+    rc = swap_page(xc_handle, mfn, new_mfn, domid,
+                    p2m, pfn_type, p2m_size, pt_levels, guest_width);
+
+    if (rc)
+    {
+        ERROR("swap page failed\n");
+        /* No recover action now for swap fail */
+        broken = 1;
+        goto unmap_tmp_p;
+    }
+
+    /* Check if pages are offlined already */
+    rc = xc_query_page_offline_status(xc_handle,
+                            pfn_to_mfn(pfn, p2m, guest_width),
+                            pfn_to_mfn(pfn, p2m, guest_width),
+                            &status);
+
+    if (rc)
+    {
+        ERROR("Fail to query offline status\n");
+        goto unmap_tmp_p;
+    }
+
+    if ( !(status & PG_OFFLINE_STATUS_OFFLINED) )
+    {
+        ERROR("page is still not offlined, is it in granted to others? \n");
+        goto offline_error;
+    }
+    else
+    {
+        DEBUG("Now page is offlined %lx\n", pfn);
+        /* Update the p2m table */
+        p2m[pfn] = new_mfn;
+
+        new_p = xc_map_foreign_range(xc_handle, domid, PAGE_SIZE,
+                                     PROT_READ|PROT_WRITE, new_mfn);
+        memcpy(new_p, tmp_p, PAGE_SIZE);
+        munmap(new_p, PAGE_SIZE);
+        xc_domain_setmaxmem(xc_handle, domid, (max_pages - 1 ) << 2);
+    }
+
+    return 0;
+
+offline_error:
+    if (swap_page(xc_handle, new_mfn, mfn, domid,
+                  p2m, pfn_type, p2m_size, pt_levels, guest_width) < 0)
+        goto broken;
+
+unmap_tmp_p:
+    free(tmp_p);
+    munmap(new_p, PAGE_SIZE);
+
+
+undo_increase:
+    if (xc_domain_memory_decrease_reservation(xc_handle, domid, 1, 0,
+                                              &new_mfn))
+        goto broken;
+
+undo_maxmem:
+    xc_domain_setmaxmem(xc_handle, domid, (max_pages - 1 ) << 2);
+
+undo_unpin:
+    if (pfn_type[pfn] & XEN_DOMCTL_PFINFO_LPINTAB)
+    {
+        switch ( pfn_type[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+            case XEN_DOMCTL_PFINFO_L1TAB:
+                unpin.cmd = MMUEXT_PIN_L1_TABLE;
+                break;
+
+            case XEN_DOMCTL_PFINFO_L2TAB:
+                unpin.cmd = MMUEXT_PIN_L2_TABLE;
+                break;
+
+            case XEN_DOMCTL_PFINFO_L3TAB:
+                unpin.cmd = MMUEXT_PIN_L3_TABLE;
+                break;
+
+            case XEN_DOMCTL_PFINFO_L4TAB:
+                unpin.cmd = MMUEXT_PIN_L4_TABLE;
+                break;
+
+            default:
+                ERROR("Unpined for non pate table page\n");
+                break;
+        }
+        unpin.arg1.mfn = mfn;
+
+        if ( xc_mmuext_op(xc_handle, &unpin, 1, domid) < 0 )
+        {
+            ERROR("failed to pin the mfn again\n");
+            goto broken;
+        }
+    }
+    if (!broken)
+        return -1;
+broken:
+    return -2;
+}
diff -r d914e26d5df5 tools/libxc/xenguest.h
--- a/tools/libxc/xenguest.h	Wed Mar 18 21:39:21 2009 +0800
+++ b/tools/libxc/xenguest.h	Wed Mar 18 21:39:23 2009 +0800
@@ -29,6 +29,31 @@ int xc_domain_save(int xc_handle, int io
                    void *(*init_qemu_maps)(int, unsigned),  /* HVM only */
                    void (*qemu_flip_buffer)(int, int));     /* HVM only */
 
+int xc_replace_page(int xc, xen_pfn_t pfn, int domid,
+                    xen_pfn_t *p2m, unsigned long *pfn_type,
+                    int p2m_size, int pt_levels, int guest_width);
+
+int xc_mark_page_online(int xc, unsigned long start,
+                        unsigned long end, uint32_t *status);
+
+int xc_mark_page_offline(int xc, unsigned long start,
+                          unsigned long end, uint32_t *status);
+
+int xc_query_page_offline_status(int xc, unsigned long start,
+                                 unsigned long end, uint32_t *status);
+
+/**
+ * This function map msp table
+ * @parm xc_handle a handle to an open hypervisor interface
+ * @parm max_mfn the max pfn
+ * @parm prot the flags to map, such as read/write etc
+ * @parm mfn0 return the first mfn, can be NULL
+ * @return mapped m2p table on success, NULL on failure
+ */
+xen_pfn_t *xc_map_m2p(int xc_handle,
+                      unsigned long max_mfn,
+                      int prot,
+                      unsigned long *mfn0);
 
 /**
  * This function will restore a saved domain.
diff -r d914e26d5df5 tools/xcutils/Makefile
--- a/tools/xcutils/Makefile	Wed Mar 18 21:39:21 2009 +0800
+++ b/tools/xcutils/Makefile	Wed Mar 18 21:39:23 2009 +0800
@@ -14,7 +14,7 @@ CFLAGS += -Werror
 CFLAGS += -Werror
 CFLAGS += $(CFLAGS_libxenctrl) $(CFLAGS_libxenguest) $(CFLAGS_libxenstore)
 
-PROGRAMS = xc_restore xc_save readnotes lsevtchn
+PROGRAMS = xc_restore xc_save readnotes lsevtchn xc_offline
 
 LDLIBS   = $(LDFLAGS_libxenctrl) $(LDFLAGS_libxenguest) $(LDFLAGS_libxenstore)
 
diff -r d914e26d5df5 tools/xcutils/xc_offline.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/xcutils/xc_offline.c	Wed Mar 18 21:39:53 2009 +0800
@@ -0,0 +1,327 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General
+ * Public License.  See the file "COPYING" in the main directory of
+ * this archive for more details.
+ *
+ * Copyright (C) 2005 by Christian Limpach
+ *
+ */
+
+#include <err.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdio.h>
+#include <sys/ipc.h>
+#include <sys/shm.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <err.h>
+
+#include <xs.h>
+#include <xenctrl.h>
+#include <xenguest.h>
+#include <xc_private.h>
+#include <xc_core.h>
+
+#undef ERROR
+#undef DEBUG
+#define ERROR(fmr, args...) do { fprintf(stderr, "ERROR: " fmr , ##args); } while (0)
+#define DEBUG(fmr, args...) do { fprintf(stderr, "DEBUG: " fmr , ##args); } while (0)
+
+#define FPP             (PAGE_SIZE/(guest_width))
+
+/* Number of entries in the pfn_to_mfn_frame_list_list */
+#define P2M_FLL_ENTRIES (((p2m_size)+(FPP*FPP)-1)/(FPP*FPP))
+
+static xen_pfn_t pfn_to_mfn(xen_pfn_t pfn, xen_pfn_t *p2m, int guest_width)
+{
+  return ((xen_pfn_t) ((guest_width==8)?
+                       (((uint64_t *)p2m)[(pfn)]):
+                       ((((uint32_t *)p2m)[(pfn)]) == 0xffffffffU ?
+                            (-1UL) :
+                            (((uint32_t *)p2m)[(pfn)]))));
+}
+
+static int get_pt_level(int xc_handle, uint32_t domid,
+                        unsigned int *pt_level,
+                        unsigned int *guest_width)
+{
+    DECLARE_DOMCTL;
+    xen_capabilities_info_t xen_caps = "";
+
+    if (xc_version(xc_handle, XENVER_capabilities, &xen_caps) != 0)
+        return -1;
+
+    memset(&domctl, 0, sizeof(domctl));
+    domctl.domain = domid;
+    domctl.cmd = XEN_DOMCTL_get_address_size;
+
+    if ( do_domctl(xc_handle, &domctl) != 0 )
+        return -1;
+
+    *guest_width = domctl.u.address_size.size / 8;
+
+    if (strstr(xen_caps, "xen-3.0-x86_64"))
+        /* Depends on whether it's a compat 32-on-64 guest */
+        *pt_level = ( (*guest_width == 8) ? 4 : 3 );
+    else if (strstr(xen_caps, "xen-3.0-x86_32p"))
+        *pt_level = 3;
+    else if (strstr(xen_caps, "xen-3.0-x86_32"))
+        *pt_level = 2;
+    else
+        return -1;
+
+    return 0;
+}
+
+#define PG_OFFLINE_STATUS_HANDLED (1UL << 14)
+int
+main(int argc, char **argv)
+{
+    unsigned long start, end;
+    int xc_handle = 0, i, num, rc;
+    uint32_t *status = NULL;
+
+    if (argc != 3)
+        fprintf(stderr, "usage: %s start end", argv[0]);
+
+    start = strtoul(argv[1], NULL, 0);
+    end = strtoul(argv[2], NULL, 0);
+
+    xc_handle = xc_interface_open();
+
+    if (!xc_handle)
+        return -1;
+
+    num = end - start + 1;
+
+    rc = -ENOMEM;
+    status  = malloc(num * sizeof(uint32_t));
+    if (!status)
+        return -EINVAL;
+    memset(status, 0, sizeof(uint32_t)*num);
+
+    rc = xc_mark_page_offline(xc_handle, start, end, status);
+
+    if (rc)
+    {
+        ERROR("fail to mark pages offline %x\n", rc);
+        return -EINVAL;
+    }
+
+    for (i = 0; i < num; i++)
+    {
+        DEBUG("pfn %lx status %x\n", start + i, status[i]);
+
+        if (status[i] & PG_OFFLINE_STATUS_HANDLED)
+            continue;
+
+        switch (status[i] & PG_OFFLINE_STATUS_MASK)
+        {
+            case PG_OFFLINE_OFFLINED:
+                DEBUG("offlined page %lx\n", start + i);
+                status[i] |= PG_OFFLINE_STATUS_HANDLED;
+                break;
+            break;
+
+            case PG_OFFLINE_PENDING:
+            {
+                uint32_t domid, j;
+                int port, xce = -1, rc;
+                unsigned long p2m_size;
+                xen_pfn_t *p2m_table = NULL;
+                xen_pfn_t *m2p_table = NULL;
+                xc_dominfo_t info;
+                uint64_aligned_t shared_info_frame;
+                shared_info_any_t *live_shinfo = NULL;
+                uint32_t *pfn_type = NULL;
+                unsigned long *pfn_real = NULL, max_mfn;
+                int suspend_evtchn = -1, suspended = 0;
+                uint32_t pt_level = 0, guest_width = 0;
+
+                domid = status[i] >> PG_OFFLINE_OWNER_SHIFT;
+
+                if ( !domid || (domid > DOMID_FIRST_RESERVED) )
+                {
+                    DEBUG("Dom0's page can't be LM");
+                    goto failed;
+                }
+
+                if ( get_pt_level(xc_handle, domid, &pt_level, &guest_width) )
+                {
+                    ERROR("Unable to get PT level info.");
+                    goto failed;
+                }
+
+                if ( xc_domain_getinfo(xc_handle, domid, 1, &info) != 1 )
+                {
+                    ERROR("Could not get domain info");
+                    goto failed;
+                }
+
+                if (info.hvm)
+                {
+                    DEBUG("please Live migrate dom %x\n", domid);
+                    goto failed;
+                }
+
+                /* Map the p2m table and M2P table */
+                shared_info_frame = info.shared_info_frame;
+
+                live_shinfo = xc_map_foreign_range(xc_handle, domid,
+                  PAGE_SIZE, PROT_READ, shared_info_frame);
+                if ( !live_shinfo )
+                {
+                    ERROR("Couldn't map live_shinfo");
+                    goto failed;
+                }
+
+                if ( (rc = xc_core_arch_map_p2m_writable(xc_handle, guest_width, &info,
+                                                         live_shinfo, &p2m_table,  &p2m_size)) )
+                {
+                    ERROR("Couldn't map p2m table %x\n", rc);
+                    goto failed;
+                }
+
+                max_mfn = xc_memory_op(xc_handle, XENMEM_maximum_ram_page, NULL);
+                if ( !(m2p_table = xc_map_m2p(xc_handle, max_mfn, PROT_READ, NULL)) )
+                {
+                    ERROR("Failed to map live M2P table");
+                    goto failed;
+                }
+
+                /* Suspend the guest */
+                port = xs_suspend_evtchn_port(domid);
+                if (port < 0)
+                {
+                    ERROR("Dom %x: No suspsend port, try live migration\n",
+                            domid);
+                    goto failed;
+                }
+
+                xce = xc_evtchn_open();
+                if (xce < 0)
+                {
+                    ERROR("Dom %x: fail to open evtchn\n",
+                            domid);
+                            goto failed;
+                }
+
+                suspend_evtchn =
+                  xc_suspend_evtchn_init(xc_handle, xce, domid, port);
+                if (suspend_evtchn < 0)
+                {
+                    ERROR("suspend event channel initialization failed\n");
+                    goto failed;
+                }
+
+                rc = xc_evtchn_notify(xce, suspend_evtchn);
+                if (rc < 0) {
+                    ERROR("failed to notify suspend channel: %d", rc);
+                    goto failed;
+                }
+                if (xc_await_suspend(xce, suspend_evtchn) < 0) {
+                    ERROR("suspend failed");\
+                    goto failed;
+                }
+                suspended = 1;
+
+                /* Get pfn type */
+                pfn_type = malloc(sizeof(uint32_t) * p2m_size);
+                if (!pfn_type)
+                {
+                    ERROR("Failed to malloc pfn_type\n");
+                    goto failed;
+                }
+                memset(pfn_type, 0, sizeof(uint32_t) * p2m_size);
+
+                pfn_real = malloc(sizeof(unsigned long) * p2m_size);
+                if (!pfn_real)
+                {
+                    ERROR("Failed to malloc pfn_real\n");
+                    goto failed;
+                }
+                memset(pfn_real, 0, sizeof(unsigned long) * p2m_size);
+
+                for (j = 0; j < p2m_size; j++)
+                    pfn_type[j] = pfn_to_mfn(j, p2m_table, guest_width);
+
+                if ( lock_pages(pfn_type, p2m_size * sizeof(*pfn_type)) )
+                {
+                    ERROR("Unable to lock pfn_type array");
+                    goto failed;
+                }
+                if ( lock_pages(pfn_real, p2m_size * sizeof(*pfn_real)) )
+                {
+                    ERROR("Unable to lock pfn_real array");
+                    goto failed;
+                }
+
+                for (j = 0; j < p2m_size ; j+=1024)
+                {
+                    int count = ((p2m_size - j ) > 1024 ) ? 1024: (p2m_size - j);
+                    if ( ( rc = xc_get_pfn_type_batch(xc_handle, domid, count,
+                              pfn_type + j)) )
+                    {
+                        ERROR("Failed to get pfn_type %x\n", rc);
+                         goto failed;
+                    }
+                }
+
+                /* Now replace the page */
+                for (j = 0; j < p2m_size; j++)
+                    pfn_real[j] = pfn_type[j];
+
+                for (j = i ; j < num; j++)
+                {
+                    if ( ((status[j]& PG_OFFLINE_STATUS_MASK) == PG_OFFLINE_PENDING) &&
+                         ((status[j] >> PG_OFFLINE_OWNER_SHIFT) == domid) )
+                    {
+#define mfn_to_pfn(_mfn)  (m2p_table[(_mfn)])
+                             rc = xc_replace_page(xc_handle, mfn_to_pfn(start + j),
+                               domid, p2m_table,
+                               pfn_real, p2m_size, pt_level,
+                               guest_width);
+                             if (rc)
+                             {
+                                 ERROR("Failed to replace page %x\n", j);
+                                 goto failed;
+                             }
+
+                             status[j] |= PG_OFFLINE_STATUS_HANDLED;
+                    }
+                }
+
+failed:
+                status[i] |= PG_OFFLINE_STATUS_HANDLED;
+
+                if (p2m_table)
+                    munmap(p2m_table, P2M_FLL_ENTRIES * PAGE_SIZE);
+                if (pfn_type)
+                    free(pfn_type);
+                if (live_shinfo)
+                    munmap(live_shinfo, PAGE_SIZE);
+                if (suspend_evtchn > 0)
+                    xc_suspend_evtchn_release(xc_handle, suspend_evtchn);
+                if (suspended)
+                    xc_domain_resume(xc_handle, domid, 1);
+                if (xce > 0)
+                    xc_evtchn_close(xce);
+                break;
+            }
+            default:
+            {
+                ERROR("Error status result %x\n", status[i]);
+                break;
+            }
+        }
+
+    }
+    if (xc_handle)
+        xc_interface_close(xc_handle);
+    if (status)
+        free(status);
+    return 0;
+}

[-- Attachment #4: writable_p2m.patch --]
[-- Type: application/octet-stream, Size: 2895 bytes --]

Update the p2m mapping table to be writable

diff -r df5c0b078d8d tools/libxc/xc_core.h
--- a/tools/libxc/xc_core.h	Wed Mar 18 20:14:41 2009 +0800
+++ b/tools/libxc/xc_core.h	Wed Mar 18 21:38:41 2009 +0800
@@ -143,6 +143,11 @@ int xc_core_arch_map_p2m(int xc_handle, 
                          xc_dominfo_t *info, shared_info_any_t *live_shinfo,
                          xen_pfn_t **live_p2m, unsigned long *pfnp);
 
+int xc_core_arch_map_p2m_writable(int xc_handle, unsigned int guest_width,
+                                  xc_dominfo_t *info,
+                                  shared_info_any_t *live_shinfo,
+                                  xen_pfn_t **live_p2m, unsigned long *pfnp);
+
 
 #if defined (__i386__) || defined (__x86_64__)
 # include "xc_core_x86.h"
diff -r df5c0b078d8d tools/libxc/xc_core_x86.c
--- a/tools/libxc/xc_core_x86.c	Wed Mar 18 20:14:41 2009 +0800
+++ b/tools/libxc/xc_core_x86.c	Wed Mar 18 21:39:14 2009 +0800
@@ -75,10 +75,10 @@ xc_core_arch_memory_map_get(int xc_handl
     return 0;
 }
 
-int
-xc_core_arch_map_p2m(int xc_handle, unsigned int guest_width, xc_dominfo_t *info,
-                     shared_info_any_t *live_shinfo, xen_pfn_t **live_p2m,
-                     unsigned long *pfnp)
+static int
+xc_core_arch_map_p2m_rw(int xc_handle, unsigned int guest_width, xc_dominfo_t *info,
+                        shared_info_any_t *live_shinfo, xen_pfn_t **live_p2m,
+                        unsigned long *pfnp, int rw)
 {
     /* Double and single indirect references to the live P2M table */
     xen_pfn_t *live_p2m_frame_list_list = NULL;
@@ -156,7 +156,8 @@ xc_core_arch_map_p2m(int xc_handle, unsi
         for ( i = P2M_FL_ENTRIES - 1; i >= 0; i-- )
             p2m_frame_list[i] = ((uint32_t *)p2m_frame_list)[i];
 
-    *live_p2m = xc_map_foreign_pages(xc_handle, dom, PROT_READ,
+    *live_p2m = xc_map_foreign_pages(xc_handle, dom,
+                                    rw ? (PROT_READ | PROT_WRITE) : PROT_READ,
                                     p2m_frame_list,
                                     P2M_FL_ENTRIES);
 
@@ -189,6 +190,23 @@ out:
     return ret;
 }
 
+int
+xc_core_arch_map_p2m(int xc_handle, unsigned int guest_width, xc_dominfo_t *info,
+                        shared_info_any_t *live_shinfo, xen_pfn_t **live_p2m,
+                        unsigned long *pfnp)
+{
+    return xc_core_arch_map_p2m_rw(xc_handle, guest_width, info,
+                                   live_shinfo, live_p2m, pfnp, 0);
+}
+
+int
+xc_core_arch_map_p2m_writable(int xc_handle, unsigned int guest_width, xc_dominfo_t *info,
+                              shared_info_any_t *live_shinfo, xen_pfn_t **live_p2m,
+                              unsigned long *pfnp)
+{
+    return xc_core_arch_map_p2m_rw(xc_handle, guest_width, info,
+                                   live_shinfo, live_p2m, pfnp, 1);
+}
 /*
  * Local variables:
  * mode: C

[-- Attachment #5: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-19  5:12                     ` Jiang, Yunhong
@ 2009-03-19  9:32                       ` Tim Deegan
  2009-03-19  9:45                         ` Keir Fraser
  2009-03-19  9:48                         ` [PATCH] Support swap a page from user space tools " Jiang, Yunhong
  0 siblings, 2 replies; 61+ messages in thread
From: Tim Deegan @ 2009-03-19  9:32 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: xen-devel, Keir Fraser

Hi, 

At 05:12 +0000 on 19 Mar (1237439530), Jiang, Yunhong wrote:
> > - You're passing a physical address (of the PTE to update) in an MFN
> >   field.  That's not going to be big enough on all platforms.  Also   it's
> > pretty confusing. 
> 
> Yes, fixed and now named pte_addr as a uint64.

You made it an unsigned long, which is still smaller than a paddr_t on
PAE builds.  And you can't just make it 64 bits in that union without
breaking the ABI; you'll need to add a new interface somewhere.  Maybe
Keir can suggest a better place.

> I missed one thing in previous patch, i.e. the changes to
> xc_core_arch_map_p2m().  Originally I change that function to map the
> p2m table as rw (it is forgoted in previous mail). Now I add a new
> function xc_core_arch_map_p2m_writable() so that not break the
> original API.

OK.  Are there any callers of the xc_core_arch_map_p2m() that would care
if it gave a writable mapping?

> But I'm a bit confused why the xc_domain_save.c will not use this
> function to map p2m table also? Instead, I noticed a lot of duplicate
> on these two files, I can send out a clean patch in future if it is
> ok.

I think that was just carelessness at the time the xc_core stuff went in
(and possibly also distaste at the rather scruffy state of the
xc_domain_save version).  They should probably be unified at some point
if anyone has the energy. :)

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-19  9:32                       ` Tim Deegan
@ 2009-03-19  9:45                         ` Keir Fraser
  2009-03-19  9:57                           ` Jiang, Yunhong
  2009-03-19  9:48                         ` [PATCH] Support swap a page from user space tools " Jiang, Yunhong
  1 sibling, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-03-19  9:45 UTC (permalink / raw)
  To: Tim Deegan, Jiang, Yunhong; +Cc: xen-devel

On 19/03/2009 09:32, "Tim Deegan" <Tim.Deegan@eu.citrix.com> wrote:

>>> - You're passing a physical address (of the PTE to update) in an MFN
>>>   field.  That's not going to be big enough on all platforms.  Also   it's
>>> pretty confusing.
>> 
>> Yes, fixed and now named pte_addr as a uint64.
> 
> You made it an unsigned long, which is still smaller than a paddr_t on
> PAE builds.  And you can't just make it 64 bits in that union without
> breaking the ABI; you'll need to add a new interface somewhere.  Maybe
> Keir can suggest a better place.

Firstly, the comment added to the header file is pretty rubbish. The
description fits existing update methods such as MMU_NORMAL_PT_UPDATE, so
based on that comment I could quite reasonably reject your patch on grounds
that it is redundant.

Secondly, the patch name says swap_page and a printk the patch adds refers
to 'swap page'. What's being swapped? That's not the name of the operation,
nor is swapping referred to in the description comment I mention above.

Thirdly, perhaps this makes more sense as a MMU_* op hanging off
mmu_update()? That call takes pairs of u64 values, which could give you the
space you require. Then you can add a nice comment explaining how your new
command differs from MMU_NORMAL_PT_UPDATE.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-19  9:32                       ` Tim Deegan
  2009-03-19  9:45                         ` Keir Fraser
@ 2009-03-19  9:48                         ` Jiang, Yunhong
  1 sibling, 0 replies; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-19  9:48 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser

Tim Deegan <mailto:Tim.Deegan@citrix.com> wrote:
> Hi,
> 
> At 05:12 +0000 on 19 Mar (1237439530), Jiang, Yunhong wrote:
>> I missed one thing in previous patch, i.e. the changes to
>> xc_core_arch_map_p2m().  Originally I change that function to map the
>> p2m table as rw (it is forgoted in previous mail). Now I add a new
>> function xc_core_arch_map_p2m_writable() so that not break the
>> original API.
> 
> OK.  Are there any callers of the xc_core_arch_map_p2m() that
> would care
> if it gave a writable mapping?

It is exported by libxc, so we can't make assumption here.

> 
>> But I'm a bit confused why the xc_domain_save.c will not use this
>> function to map p2m table also? Instead, I noticed a lot of duplicate
>> on these two files, I can send out a clean patch in future if it is
>> ok.
> 
> I think that was just carelessness at the time the xc_core
> stuff went in
> (and possibly also distaste at the rather scruffy state of the
> xc_domain_save version).  They should probably be unified at some point if
> anyone has the energy. :) 
> 
> Cheers,
> 
> Tim.
> 
> --
> Tim Deegan <Tim.Deegan@citrix.com>
> Principal Software Engineer, Citrix Systems (R&D) Ltd.
> [Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-19  9:45                         ` Keir Fraser
@ 2009-03-19  9:57                           ` Jiang, Yunhong
  2009-03-19 10:13                             ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-19  9:57 UTC (permalink / raw)
  To: Keir Fraser, Tim Deegan; +Cc: xen-devel

Keir Fraser <mailto:keir.fraser@eu.citrix.com> wrote:
> On 19/03/2009 09:32, "Tim Deegan" <Tim.Deegan@eu.citrix.com> wrote:
> 
>>>> - You're passing a physical address (of the PTE to update) in an MFN
>>>>   field.  That's not going to be big enough on all platforms.  Also  
>>>> it's pretty confusing.
>>> 
>>> Yes, fixed and now named pte_addr as a uint64.
>> 
>> You made it an unsigned long, which is still smaller than a paddr_t on
>> PAE builds.  And you can't just make it 64 bits in that union without
>> breaking the ABI; you'll need to add a new interface somewhere.  Maybe
>> Keir can suggest a better place.
> 
> Firstly, the comment added to the header file is pretty rubbish. The
> description fits existing update methods such as
> MMU_NORMAL_PT_UPDATE, so
> based on that comment I could quite reasonably reject your
> patch on grounds
> that it is redundant.

I will update it. In fact, I didn't find a proper name for it. Maybe something like MMU_FOREIGN_PT_UPDATE? But it may still be confused as update pt to point memory belongs to other domain.

> 
> Secondly, the patch name says swap_page and a printk the patch
> adds refers
> to 'swap page'. What's being swapped? That's not the name of
> the operation,
> nor is swapping referred to in the description comment I mention above.
> 
> Thirdly, perhaps this makes more sense as a MMU_* op hanging off
> mmu_update()? That call takes pairs of u64 values, which could
> give you the
> space you require. Then you can add a nice comment explaining
> how your new
> command differs from MMU_NORMAL_PT_UPDATE.

I turn to the mmu_ext_op because it has only 1 entry left. So do you mean it is ok to be there?

- yhj

> 
> -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-19  9:57                           ` Jiang, Yunhong
@ 2009-03-19 10:13                             ` Keir Fraser
  2009-03-19 13:01                               ` Jiang, Yunhong
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-03-19 10:13 UTC (permalink / raw)
  To: Jiang, Yunhong, Tim Deegan; +Cc: xen-devel

On 19/03/2009 09:57, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

>> Thirdly, perhaps this makes more sense as a MMU_* op hanging off
>> mmu_update()? That call takes pairs of u64 values, which could
>> give you the
>> space you require. Then you can add a nice comment explaining
>> how your new
>> command differs from MMU_NORMAL_PT_UPDATE.
> 
> I turn to the mmu_ext_op because it has only 1 entry left. So do you mean it
> is ok to be there?

Yes, I think it makes most sense there. It's close in behaviour to
MMU_NORMAL_PT_UPDATE, except for the 'foreigness'.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-19 10:13                             ` Keir Fraser
@ 2009-03-19 13:01                               ` Jiang, Yunhong
  2009-03-19 13:22                                 ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-19 13:01 UTC (permalink / raw)
  To: Keir Fraser, Tim Deegan; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1558 bytes --]

Keir Fraser <mailto:keir.fraser@eu.citrix.com> wrote:
> On 19/03/2009 09:57, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
> 
>>> Thirdly, perhaps this makes more sense as a MMU_* op hanging off
>>> mmu_update()? That call takes pairs of u64 values, which could give you
>>> the space you require. Then you can add a nice comment explaining how
>>> your new command differs from MMU_NORMAL_PT_UPDATE.
>> 
>> I turn to the mmu_ext_op because it has only 1 entry left. So do you mean
>> it is ok to be there?
> 
> Yes, I think it makes most sense there. It's close in behaviour to
> MMU_NORMAL_PT_UPDATE, except for the 'foreigness'.
> 
> -- Keir

This is updated version.
Instead of add a new mmu_*op, I try to change the  MMU_NORMAL_PT_UPDATE to support update foreign page table. There will be some restirction for it, including the pagetable and the new mfn pointed should be owned by same domain, and the domain should be suspeneded already. The update_foregin_pt.patch is for this. I'm not sure if this method is feasible.

The exchange_page.patch is to replace a mfn with a new one. Although there is already a memory_exchange hypercall, but that hypercall will not accept new page, add that support will break current ABI. Also it does not support foreign domain. This function add those support

The other two have not much changes. The only change is the user space tools need two hypercall, one is to update the page table, while the second is to exchange the pages.

Please have a look on it.

Thanks
Yunhong Jiang


[-- Attachment #2: writable_p2m.patch --]
[-- Type: application/octet-stream, Size: 2895 bytes --]

Update the p2m mapping table to be writable

diff -r df5c0b078d8d tools/libxc/xc_core.h
--- a/tools/libxc/xc_core.h	Wed Mar 18 20:14:41 2009 +0800
+++ b/tools/libxc/xc_core.h	Wed Mar 18 21:38:41 2009 +0800
@@ -143,6 +143,11 @@ int xc_core_arch_map_p2m(int xc_handle, 
                          xc_dominfo_t *info, shared_info_any_t *live_shinfo,
                          xen_pfn_t **live_p2m, unsigned long *pfnp);
 
+int xc_core_arch_map_p2m_writable(int xc_handle, unsigned int guest_width,
+                                  xc_dominfo_t *info,
+                                  shared_info_any_t *live_shinfo,
+                                  xen_pfn_t **live_p2m, unsigned long *pfnp);
+
 
 #if defined (__i386__) || defined (__x86_64__)
 # include "xc_core_x86.h"
diff -r df5c0b078d8d tools/libxc/xc_core_x86.c
--- a/tools/libxc/xc_core_x86.c	Wed Mar 18 20:14:41 2009 +0800
+++ b/tools/libxc/xc_core_x86.c	Wed Mar 18 21:39:14 2009 +0800
@@ -75,10 +75,10 @@ xc_core_arch_memory_map_get(int xc_handl
     return 0;
 }
 
-int
-xc_core_arch_map_p2m(int xc_handle, unsigned int guest_width, xc_dominfo_t *info,
-                     shared_info_any_t *live_shinfo, xen_pfn_t **live_p2m,
-                     unsigned long *pfnp)
+static int
+xc_core_arch_map_p2m_rw(int xc_handle, unsigned int guest_width, xc_dominfo_t *info,
+                        shared_info_any_t *live_shinfo, xen_pfn_t **live_p2m,
+                        unsigned long *pfnp, int rw)
 {
     /* Double and single indirect references to the live P2M table */
     xen_pfn_t *live_p2m_frame_list_list = NULL;
@@ -156,7 +156,8 @@ xc_core_arch_map_p2m(int xc_handle, unsi
         for ( i = P2M_FL_ENTRIES - 1; i >= 0; i-- )
             p2m_frame_list[i] = ((uint32_t *)p2m_frame_list)[i];
 
-    *live_p2m = xc_map_foreign_pages(xc_handle, dom, PROT_READ,
+    *live_p2m = xc_map_foreign_pages(xc_handle, dom,
+                                    rw ? (PROT_READ | PROT_WRITE) : PROT_READ,
                                     p2m_frame_list,
                                     P2M_FL_ENTRIES);
 
@@ -189,6 +190,23 @@ out:
     return ret;
 }
 
+int
+xc_core_arch_map_p2m(int xc_handle, unsigned int guest_width, xc_dominfo_t *info,
+                        shared_info_any_t *live_shinfo, xen_pfn_t **live_p2m,
+                        unsigned long *pfnp)
+{
+    return xc_core_arch_map_p2m_rw(xc_handle, guest_width, info,
+                                   live_shinfo, live_p2m, pfnp, 0);
+}
+
+int
+xc_core_arch_map_p2m_writable(int xc_handle, unsigned int guest_width, xc_dominfo_t *info,
+                              shared_info_any_t *live_shinfo, xen_pfn_t **live_p2m,
+                              unsigned long *pfnp)
+{
+    return xc_core_arch_map_p2m_rw(xc_handle, guest_width, info,
+                                   live_shinfo, live_p2m, pfnp, 1);
+}
 /*
  * Local variables:
  * mode: C

[-- Attachment #3: update_foriegn_pt.patch --]
[-- Type: application/octet-stream, Size: 4710 bytes --]

Enhance update_pt to support update foreign domain's page table.

This is mainly for page offline, so we want the domain is suspended when we try to update the page table.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

diff -r 2039e8271051 xen/arch/x86/mm.c
--- a/xen/arch/x86/mm.c	Wed Mar 18 17:30:13 2009 +0000
+++ b/xen/arch/x86/mm.c	Thu Mar 19 07:00:23 2009 +0800
@@ -110,6 +110,7 @@
 #include <asm/hypercall.h>
 #include <asm/shared.h>
 #include <public/memory.h>
+#include <public/sched.h>
 #include <xsm/xsm.h>
 #include <xen/trace.h>
 
@@ -2977,7 +2978,8 @@ int do_mmu_update(
     struct page_info *page;
     int rc = 0, okay = 1, i = 0;
     unsigned int cmd, done = 0;
-    struct domain *d = current->domain;
+    struct domain *d = current->domain, *pt_owner = NULL;
+    struct vcpu *v = current;
     struct domain_mmap_cache mapcache;
 
     if ( unlikely(count & MMU_UPDATE_PREEMPTED) )
@@ -3020,6 +3022,7 @@ int do_mmu_update(
 
         cmd = req.ptr & (sizeof(l1_pgentry_t)-1);
         okay = 0;
+
 
         switch ( cmd )
         {
@@ -3038,10 +3041,30 @@ int do_mmu_update(
             gmfn = req.ptr >> PAGE_SHIFT;
             mfn = gmfn_to_mfn(d, gmfn);
 
-            if ( unlikely(!get_page_from_pagenr(mfn, d)) )
+            if ( mfn == INVALID_MFN )
+                mfn = gmfn_to_mfn(FOREIGNDOM, gmfn);
+
+            if ( !mfn_valid(mfn) )
+                break;
+            pt_owner = page_get_owner_and_reference(mfn_to_page(mfn));
+
+            if ( pt_owner != d )
             {
-                MEM_LOG("Could not get page for normal update");
-                break;
+                if ( pt_owner == FOREIGNDOM )
+                {
+                    if ( !IS_PRIV(d) ||
+                         !FOREIGNDOM->is_shut_down ||
+                          (FOREIGNDOM->shutdown_code != SHUTDOWN_suspend) )
+                    {
+                        rc = -EPERM;
+                        break;
+                    }
+                    v = FOREIGNDOM->vcpu[0];
+                }else
+                {
+                    rc = -EPERM;
+                    break;
+                }
             }
 
             va = map_domain_page_with_cache(mfn, &mapcache);
@@ -3057,24 +3080,21 @@ int do_mmu_update(
                 {
                     l1_pgentry_t l1e = l1e_from_intpte(req.val);
                     okay = mod_l1_entry(va, l1e, mfn,
-                                        cmd == MMU_PT_UPDATE_PRESERVE_AD,
-                                        current);
+                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
                 }
                 break;
                 case PGT_l2_page_table:
                 {
                     l2_pgentry_t l2e = l2e_from_intpte(req.val);
                     okay = mod_l2_entry(va, l2e, mfn,
-                                        cmd == MMU_PT_UPDATE_PRESERVE_AD,
-                                        current);
+                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
                 }
                 break;
                 case PGT_l3_page_table:
                 {
                     l3_pgentry_t l3e = l3e_from_intpte(req.val);
                     rc = mod_l3_entry(va, l3e, mfn,
-                                      cmd == MMU_PT_UPDATE_PRESERVE_AD, 1,
-                                      current);
+                                      cmd == MMU_PT_UPDATE_PRESERVE_AD, 1, v);
                     okay = !rc;
                 }
                 break;
@@ -3083,8 +3103,7 @@ int do_mmu_update(
                 {
                     l4_pgentry_t l4e = l4e_from_intpte(req.val);
                     rc = mod_l4_entry(va, l4e, mfn,
-                                      cmd == MMU_PT_UPDATE_PRESERVE_AD, 1,
-                                      current);
+                                      cmd == MMU_PT_UPDATE_PRESERVE_AD, 1, v);
                     okay = !rc;
                 }
                 break;
@@ -3092,7 +3111,7 @@ int do_mmu_update(
                 case PGT_writable_page:
                     perfc_incr(writable_mmu_updates);
                     okay = paging_write_guest_entry(
-                        current, va, req.val, _mfn(mfn));
+                        v, va, req.val, _mfn(mfn));
                     break;
                 }
                 page_unlock(page);
@@ -3103,7 +3122,7 @@ int do_mmu_update(
             {
                 perfc_incr(writable_mmu_updates);
                 okay = paging_write_guest_entry(
-                    current, va, req.val, _mfn(mfn));
+                    v, va, req.val, _mfn(mfn));
                 put_page_type(page);
             }
 

[-- Attachment #4: free_page.patch --]
[-- Type: application/octet-stream, Size: 26290 bytes --]

Add a hypercall to free one page

diff -r 547beb2f4fd0 tools/libxc/Makefile
--- a/tools/libxc/Makefile	Thu Mar 19 06:16:22 2009 +0800
+++ b/tools/libxc/Makefile	Thu Mar 19 06:17:07 2009 +0800
@@ -31,6 +31,7 @@ GUEST_SRCS-y :=
 GUEST_SRCS-y :=
 GUEST_SRCS-y += xg_private.c
 GUEST_SRCS-$(CONFIG_MIGRATE) += xc_domain_restore.c xc_domain_save.c
+GUEST_SRCS-y += xc_offline_page.c
 GUEST_SRCS-$(CONFIG_HVM) += xc_hvm_build.c
 
 vpath %.c ../../xen/common/libelf
diff -r 547beb2f4fd0 tools/libxc/xc_domain_save.c
--- a/tools/libxc/xc_domain_save.c	Thu Mar 19 06:16:22 2009 +0800
+++ b/tools/libxc/xc_domain_save.c	Thu Mar 19 06:17:07 2009 +0800
@@ -510,9 +510,10 @@ static int canonicalize_pagetable(unsign
     return race;
 }
 
-static xen_pfn_t *xc_map_m2p(int xc_handle,
-                                 unsigned long max_mfn,
-                                 int prot)
+xen_pfn_t *xc_map_m2p(int xc_handle,
+                      unsigned long max_mfn,
+                      int prot,
+                      unsigned long *mfn0)
 {
     struct xen_machphys_mfn_list xmml;
     privcmd_mmap_entry_t *entries;
@@ -561,7 +562,8 @@ static xen_pfn_t *xc_map_m2p(int xc_hand
         goto err2;
     }
 
-    m2p_mfn0 = entries[0].mfn;
+    if (mfn0)
+        *mfn0 = entries[0].mfn;
 
 err2:
     free(entries);
@@ -1058,7 +1060,7 @@ int xc_domain_save(int xc_handle, int io
     }
 
     /* Setup the mfn_to_pfn table mapping */
-    if ( !(live_m2p = xc_map_m2p(xc_handle, max_mfn, PROT_READ)) )
+    if ( !(live_m2p = xc_map_m2p(xc_handle, max_mfn, PROT_READ, &m2p_mfn0)) )
     {
         ERROR("Failed to map live M2P table");
         goto out;
diff -r 547beb2f4fd0 tools/libxc/xc_offline_page.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/libxc/xc_offline_page.c	Thu Mar 19 06:56:37 2009 +0800
@@ -0,0 +1,402 @@
+/******************************************************************************
+ * xc_offline_page.c
+ *
+ * Helper functions to replace a offlining page
+ *
+ * Copyright (c) 2003, K A Fraser.
+ * Copyright (c) 2009, Intel Corporation.
+ */
+
+#include <inttypes.h>
+#include <time.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/time.h>
+
+#include "xc_private.h"
+#include "xc_dom.h"
+#include "xg_private.h"
+#include "xg_save_restore.h"
+
+#undef DEBUG
+#define DEBUG(_f, _a...) fprintf(stderr, _f , ## _a)
+
+int xc_mark_page_online(int xc, unsigned long start,
+                        unsigned long end, uint32_t *status)
+{
+    DECLARE_SYSCTL;
+    int ret = -1;
+
+    if ( !status || (end < start) )
+        return -EINVAL;
+
+    if (lock_pages(status, sizeof(uint32_t)*(end - start + 1)))
+    {
+        fprintf(stderr, "Could not lock memory for Xen hypercall");
+        return -EINVAL;
+    }
+
+    sysctl.cmd = XEN_SYSCTL_page_offline_op;
+    sysctl.u.page_offline.start = start;
+    sysctl.u.page_offline.cmd = sysctl_page_online;
+    sysctl.u.page_offline.end = end;
+    set_xen_guest_handle(sysctl.u.page_offline.status, status);
+    ret = xc_sysctl(xc, &sysctl);
+
+    unlock_pages(status, sizeof(uint32_t)*(end - start + 1));
+
+    return ret;
+}
+
+int xc_mark_page_offline(int xc, unsigned long start,
+                          unsigned long end, uint32_t *status)
+{
+    DECLARE_SYSCTL;
+    int ret = -1;
+
+    if ( !status || (end < start) )
+        return -EINVAL;
+
+    if (lock_pages(status, sizeof(uint32_t)*(end - start + 1)))
+    {
+        fprintf(stderr, "Could not lock memory for Xen hypercall");
+        return -EINVAL;
+    }
+
+    sysctl.cmd = XEN_SYSCTL_page_offline_op;
+    sysctl.u.page_offline.start = start;
+    sysctl.u.page_offline.cmd = sysctl_page_offline;
+    sysctl.u.page_offline.end = end;
+    set_xen_guest_handle(sysctl.u.page_offline.status, status);
+    ret = xc_sysctl(xc, &sysctl);
+
+    unlock_pages(status, sizeof(uint32_t)*(end - start + 1));
+
+    return ret;
+}
+
+int xc_query_page_offline_status(int xc, unsigned long start,
+                                 unsigned long end, uint32_t *status)
+{
+    DECLARE_SYSCTL;
+    int ret = -1;
+
+    if ( !status || (end < start) )
+        return -EINVAL;
+
+    if (lock_pages(status, sizeof(uint32_t)*(end - start + 1)))
+    {
+        fprintf(stderr, "Could not lock memory for Xen hypercall");
+        return -EINVAL;
+    }
+
+    sysctl.cmd = XEN_SYSCTL_page_offline_op;
+    sysctl.u.page_offline.start = start;
+    sysctl.u.page_offline.cmd = sysctl_query_page_offline;
+    sysctl.u.page_offline.end = end;
+    set_xen_guest_handle(sysctl.u.page_offline.status, status);
+    ret = xc_sysctl(xc, &sysctl);
+
+    unlock_pages(status, sizeof(uint32_t)*(end - start + 1));
+
+    return ret;
+}
+
+#define MAX_OFFLINE_BATCH 1024
+
+static xen_pfn_t pfn_to_mfn(xen_pfn_t pfn, xen_pfn_t *p2m, int guest_width)
+{
+  return ((xen_pfn_t) ((guest_width==8)?
+                       (((uint64_t *)p2m)[(pfn)]):
+                       ((((uint32_t *)p2m)[(pfn)]) == 0xffffffffU ?
+                            (-1UL) :
+                            (((uint32_t *)p2m)[(pfn)]))));
+}
+
+static int exchange_page(int xc_handle, xen_pfn_t mfn,
+                     xen_pfn_t new_mfn, int domid)
+{
+    struct mmuext_op op;
+
+    op.cmd = MMUEXT_EXCHANGE_PAGE;
+    op.arg1.mfn = new_mfn;
+    op.arg2.src_mfn = mfn;
+
+    return xc_mmuext_op(xc_handle, &op, 1, domid);
+}
+
+static int update_pte(int xc_handle, xen_pfn_t mfn, xen_pfn_t new_mfn,
+                    int domid, xen_pfn_t *p2m, unsigned long *pfn_type,
+                    int p2m_size, int pt_levels, int guest_width)
+{
+    int pte_num;
+    uint64_t i;
+    void *content = NULL;
+    struct xc_mmu *mmu = NULL;
+
+    mmu = xc_alloc_mmu_updates(xc_handle, domid);
+    if ( mmu == NULL )
+    {
+        xc_dom_printf("%s: failed at %d\n", __FUNCTION__, __LINE__);
+        return -1;
+    }
+
+    pte_num = PAGE_SIZE / ((pt_levels == 2) ? 4 : 8);
+
+    for (i = 0; i < p2m_size; i++)
+    {
+        xen_pfn_t table_mfn = pfn_to_mfn(i, p2m, guest_width);
+        uint64_t pte, new_pte;
+        int j;
+
+        if ( table_mfn == INVALID_P2M_ENTRY )
+            continue;
+
+        if ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+            content = xc_map_foreign_range(xc_handle, domid, PAGE_SIZE,
+                                            PROT_READ, table_mfn);
+            if (!content)
+                goto out;
+
+            for (j = 0; j < pte_num; j++)
+            {
+                if ( pt_levels == 2 )
+                    pte = ((const uint32_t*)content)[j];
+                else
+                    pte = ((const uint64_t*)content)[j];
+
+                if (!(pte & _PAGE_PRESENT))
+                    continue;
+
+                /* Hit one entry */
+                if ( ((pte >> PAGE_SHIFT) & MFN_MASK_X86) == mfn )
+                {
+                    new_pte = (pte & ~MADDR_MASK_X86) |
+                               (new_mfn << PAGE_SHIFT_X86);
+                    if ( xc_add_mmu_update(xc_handle, mmu,
+                           table_mfn << PAGE_SHIFT |
+                           j * ( (pt_levels == 2) ?
+                                   sizeof(uint32_t): sizeof(uint64_t)) |
+                           MMU_PT_UPDATE_PRESERVE_AD,
+                           new_pte) )
+                        goto out;
+                }
+            }
+            if ( xc_flush_mmu_updates(xc_handle, mmu) )
+                goto out;
+
+            munmap(content, PAGE_SIZE);
+            content = NULL;
+        }
+    }
+    return 0;
+out:
+    /* XXX Shall we take action if we have fail to update? */
+    if (content)
+        munmap(content, PAGE_SIZE);
+    free(mmu);
+
+    return -1;
+}
+
+int xc_replace_page(int xc_handle, xen_pfn_t pfn, int domid,
+                    xen_pfn_t *p2m, unsigned long *pfn_type,
+                    int p2m_size, int pt_levels, int guest_width)
+{
+    uint32_t status;
+    xc_dominfo_t info;
+    xen_pfn_t mfn, new_mfn;
+    struct mmuext_op unpin;
+    int rc, max_pages, broken = 0;
+    void *old_p, *tmp_p = NULL, *new_p = NULL;
+
+    if (!p2m || !pfn_type)
+        return -EINVAL;
+
+    mfn = pfn_to_mfn(pfn, p2m, guest_width);
+
+    /* This page has no mfn established?? */
+    if (mfn == INVALID_P2M_ENTRY)
+        return -EINVAL;
+
+    /* Target domain should be suspended already */
+    if ( (xc_domain_getinfo(xc_handle, domid, 1, &info) != 1) ||
+         !info.shutdown || (info.shutdown_reason != SHUTDOWN_suspend) )
+    {
+        ERROR("Domain not in suspended state");
+        return -1;
+    }
+
+    /* Check if pages are offline pending or not */
+    rc = xc_query_page_offline_status(xc_handle, mfn, mfn, &status);
+
+    if (rc)
+        return rc;
+
+    if ( !(status & PG_OFFLINE_STATUS_OFFLINE_PENDING) )
+    {
+        ERROR("Page %lx(mfn %lx) is not offline pending %x\n",
+               pfn, mfn, status);
+        return -EINVAL;
+    }
+
+    /* Unpin the page if it is pined */
+    if (pfn_type[pfn] & XEN_DOMCTL_PFINFO_LPINTAB)
+    {
+        unpin.cmd = MMUEXT_UNPIN_TABLE;
+        unpin.arg1.mfn = mfn;
+
+        if ( xc_mmuext_op(xc_handle, &unpin, 1, domid) < 0 )
+        {
+            ERROR("Failed to unpin page %lx", pfn);
+            return -EINVAL;
+        }
+    }
+
+    /* We increase the page limitation temp */
+    max_pages = xc_memory_op(xc_handle, XENMEM_maximum_reservation , &domid);
+    if (max_pages < 0)
+    {
+        ERROR("Failed to get max mfn\n");
+        goto undo_unpin;
+    }
+
+    max_pages ++;
+    xc_domain_setmaxmem(xc_handle, domid, max_pages << 2);
+
+    if (lock_pages(&new_mfn, sizeof(xen_pfn_t)))
+    {
+        ERROR("Could not lock new_mfn\n");
+        goto undo_maxmem;
+    }
+    rc = xc_domain_memory_increase_reservation(xc_handle, domid, 1, 0,
+                                     0x0, &new_mfn);
+
+    if (rc < 0)
+    {
+        ERROR("Failed to increase reservation \n");
+        goto undo_maxmem;
+    }
+
+    unlock_pages(&new_mfn, sizeof(xen_pfn_t));
+
+    /* Copy content from old page to new one */
+    old_p = xc_map_foreign_range(xc_handle, domid, PAGE_SIZE,
+                                PROT_READ, mfn);
+    tmp_p = malloc(PAGE_SIZE);
+
+    if ( !old_p || !tmp_p )
+        goto undo_increase;
+
+    memcpy(tmp_p, old_p, PAGE_SIZE);
+    munmap(old_p, PAGE_SIZE);
+
+    rc = update_pte(xc_handle, mfn, new_mfn, domid,
+                    p2m, pfn_type, p2m_size, pt_levels, guest_width);
+
+    if (rc)
+    {
+        ERROR("update failed\n");
+        /* No recover action now for update fail */
+        broken = 1;
+        goto unmap_tmp_p;
+    }
+
+    rc = exchange_page(xc_handle, mfn, new_mfn, domid);
+
+    if (rc)
+    {
+        ERROR("Exchange the page failed\n");
+        /* We will try to take action here */
+        goto offline_error;
+    }
+
+    /* Check if pages are offlined already */
+    rc = xc_query_page_offline_status(xc_handle,
+                            pfn_to_mfn(pfn, p2m, guest_width),
+                            pfn_to_mfn(pfn, p2m, guest_width),
+                            &status);
+
+    if (rc)
+    {
+        ERROR("Fail to query offline status\n");
+        goto unmap_tmp_p;
+    }
+
+    if ( !(status & PG_OFFLINE_STATUS_OFFLINED) )
+    {
+        /* Should should not comes here */
+        goto offline_error;
+    }
+    else
+    {
+        DEBUG("Now page is offlined %lx\n", pfn);
+        /* Update the p2m table */
+        p2m[pfn] = new_mfn;
+
+        new_p = xc_map_foreign_range(xc_handle, domid, PAGE_SIZE,
+                                     PROT_READ|PROT_WRITE, new_mfn);
+        memcpy(new_p, tmp_p, PAGE_SIZE);
+        munmap(new_p, PAGE_SIZE);
+        xc_domain_setmaxmem(xc_handle, domid, (max_pages - 1 ) << 2);
+    }
+
+    return 0;
+
+offline_error:
+    if (update_pte(xc_handle, new_mfn, mfn, domid,
+                  p2m, pfn_type, p2m_size, pt_levels, guest_width) < 0)
+        goto broken;
+
+unmap_tmp_p:
+    free(tmp_p);
+    munmap(new_p, PAGE_SIZE);
+
+
+undo_increase:
+    if (xc_domain_memory_decrease_reservation(xc_handle, domid, 1, 0,
+                                              &new_mfn))
+        goto broken;
+
+undo_maxmem:
+    xc_domain_setmaxmem(xc_handle, domid, (max_pages - 1 ) << 2);
+
+undo_unpin:
+    if (pfn_type[pfn] & XEN_DOMCTL_PFINFO_LPINTAB)
+    {
+        switch ( pfn_type[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+            case XEN_DOMCTL_PFINFO_L1TAB:
+                unpin.cmd = MMUEXT_PIN_L1_TABLE;
+                break;
+
+            case XEN_DOMCTL_PFINFO_L2TAB:
+                unpin.cmd = MMUEXT_PIN_L2_TABLE;
+                break;
+
+            case XEN_DOMCTL_PFINFO_L3TAB:
+                unpin.cmd = MMUEXT_PIN_L3_TABLE;
+                break;
+
+            case XEN_DOMCTL_PFINFO_L4TAB:
+                unpin.cmd = MMUEXT_PIN_L4_TABLE;
+                break;
+
+            default:
+                ERROR("Unpined for non pate table page\n");
+                break;
+        }
+        unpin.arg1.mfn = mfn;
+
+        if ( xc_mmuext_op(xc_handle, &unpin, 1, domid) < 0 )
+        {
+            ERROR("failed to pin the mfn again\n");
+            goto broken;
+        }
+    }
+    if (!broken)
+        return -1;
+broken:
+    return -2;
+}
diff -r 547beb2f4fd0 tools/libxc/xenguest.h
--- a/tools/libxc/xenguest.h	Thu Mar 19 06:16:22 2009 +0800
+++ b/tools/libxc/xenguest.h	Thu Mar 19 06:17:07 2009 +0800
@@ -29,6 +29,31 @@ int xc_domain_save(int xc_handle, int io
                    void *(*init_qemu_maps)(int, unsigned),  /* HVM only */
                    void (*qemu_flip_buffer)(int, int));     /* HVM only */
 
+int xc_replace_page(int xc, xen_pfn_t pfn, int domid,
+                    xen_pfn_t *p2m, unsigned long *pfn_type,
+                    int p2m_size, int pt_levels, int guest_width);
+
+int xc_mark_page_online(int xc, unsigned long start,
+                        unsigned long end, uint32_t *status);
+
+int xc_mark_page_offline(int xc, unsigned long start,
+                          unsigned long end, uint32_t *status);
+
+int xc_query_page_offline_status(int xc, unsigned long start,
+                                 unsigned long end, uint32_t *status);
+
+/**
+ * This function map msp table
+ * @parm xc_handle a handle to an open hypervisor interface
+ * @parm max_mfn the max pfn
+ * @parm prot the flags to map, such as read/write etc
+ * @parm mfn0 return the first mfn, can be NULL
+ * @return mapped m2p table on success, NULL on failure
+ */
+xen_pfn_t *xc_map_m2p(int xc_handle,
+                      unsigned long max_mfn,
+                      int prot,
+                      unsigned long *mfn0);
 
 /**
  * This function will restore a saved domain.
diff -r 547beb2f4fd0 tools/xcutils/Makefile
--- a/tools/xcutils/Makefile	Thu Mar 19 06:16:22 2009 +0800
+++ b/tools/xcutils/Makefile	Thu Mar 19 06:17:07 2009 +0800
@@ -14,7 +14,7 @@ CFLAGS += -Werror
 CFLAGS += -Werror
 CFLAGS += $(CFLAGS_libxenctrl) $(CFLAGS_libxenguest) $(CFLAGS_libxenstore)
 
-PROGRAMS = xc_restore xc_save readnotes lsevtchn
+PROGRAMS = xc_restore xc_save readnotes lsevtchn xc_offline
 
 LDLIBS   = $(LDFLAGS_libxenctrl) $(LDFLAGS_libxenguest) $(LDFLAGS_libxenstore)
 
diff -r 547beb2f4fd0 tools/xcutils/xc_offline.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/xcutils/xc_offline.c	Thu Mar 19 06:17:07 2009 +0800
@@ -0,0 +1,327 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General
+ * Public License.  See the file "COPYING" in the main directory of
+ * this archive for more details.
+ *
+ * Copyright (C) 2005 by Christian Limpach
+ *
+ */
+
+#include <err.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdio.h>
+#include <sys/ipc.h>
+#include <sys/shm.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <err.h>
+
+#include <xs.h>
+#include <xenctrl.h>
+#include <xenguest.h>
+#include <xc_private.h>
+#include <xc_core.h>
+
+#undef ERROR
+#undef DEBUG
+#define ERROR(fmr, args...) do { fprintf(stderr, "ERROR: " fmr , ##args); } while (0)
+#define DEBUG(fmr, args...) do { fprintf(stderr, "DEBUG: " fmr , ##args); } while (0)
+
+#define FPP             (PAGE_SIZE/(guest_width))
+
+/* Number of entries in the pfn_to_mfn_frame_list_list */
+#define P2M_FLL_ENTRIES (((p2m_size)+(FPP*FPP)-1)/(FPP*FPP))
+
+static xen_pfn_t pfn_to_mfn(xen_pfn_t pfn, xen_pfn_t *p2m, int guest_width)
+{
+  return ((xen_pfn_t) ((guest_width==8)?
+                       (((uint64_t *)p2m)[(pfn)]):
+                       ((((uint32_t *)p2m)[(pfn)]) == 0xffffffffU ?
+                            (-1UL) :
+                            (((uint32_t *)p2m)[(pfn)]))));
+}
+
+static int get_pt_level(int xc_handle, uint32_t domid,
+                        unsigned int *pt_level,
+                        unsigned int *guest_width)
+{
+    DECLARE_DOMCTL;
+    xen_capabilities_info_t xen_caps = "";
+
+    if (xc_version(xc_handle, XENVER_capabilities, &xen_caps) != 0)
+        return -1;
+
+    memset(&domctl, 0, sizeof(domctl));
+    domctl.domain = domid;
+    domctl.cmd = XEN_DOMCTL_get_address_size;
+
+    if ( do_domctl(xc_handle, &domctl) != 0 )
+        return -1;
+
+    *guest_width = domctl.u.address_size.size / 8;
+
+    if (strstr(xen_caps, "xen-3.0-x86_64"))
+        /* Depends on whether it's a compat 32-on-64 guest */
+        *pt_level = ( (*guest_width == 8) ? 4 : 3 );
+    else if (strstr(xen_caps, "xen-3.0-x86_32p"))
+        *pt_level = 3;
+    else if (strstr(xen_caps, "xen-3.0-x86_32"))
+        *pt_level = 2;
+    else
+        return -1;
+
+    return 0;
+}
+
+#define PG_OFFLINE_STATUS_HANDLED (1UL << 14)
+int
+main(int argc, char **argv)
+{
+    unsigned long start, end;
+    int xc_handle = 0, i, num, rc;
+    uint32_t *status = NULL;
+
+    if (argc != 3)
+        fprintf(stderr, "usage: %s start end", argv[0]);
+
+    start = strtoul(argv[1], NULL, 0);
+    end = strtoul(argv[2], NULL, 0);
+
+    xc_handle = xc_interface_open();
+
+    if (!xc_handle)
+        return -1;
+
+    num = end - start + 1;
+
+    rc = -ENOMEM;
+    status  = malloc(num * sizeof(uint32_t));
+    if (!status)
+        return -EINVAL;
+    memset(status, 0, sizeof(uint32_t)*num);
+
+    rc = xc_mark_page_offline(xc_handle, start, end, status);
+
+    if (rc)
+    {
+        ERROR("fail to mark pages offline %x\n", rc);
+        return -EINVAL;
+    }
+
+    for (i = 0; i < num; i++)
+    {
+        DEBUG("pfn %lx status %x\n", start + i, status[i]);
+
+        if (status[i] & PG_OFFLINE_STATUS_HANDLED)
+            continue;
+
+        switch (status[i] & PG_OFFLINE_STATUS_MASK)
+        {
+            case PG_OFFLINE_OFFLINED:
+                DEBUG("offlined page %lx\n", start + i);
+                status[i] |= PG_OFFLINE_STATUS_HANDLED;
+                break;
+            break;
+
+            case PG_OFFLINE_PENDING:
+            {
+                uint32_t domid, j;
+                int port, xce = -1, rc;
+                unsigned long p2m_size;
+                xen_pfn_t *p2m_table = NULL;
+                xen_pfn_t *m2p_table = NULL;
+                xc_dominfo_t info;
+                uint64_aligned_t shared_info_frame;
+                shared_info_any_t *live_shinfo = NULL;
+                uint32_t *pfn_type = NULL;
+                unsigned long *pfn_real = NULL, max_mfn;
+                int suspend_evtchn = -1, suspended = 0;
+                uint32_t pt_level = 0, guest_width = 0;
+
+                domid = status[i] >> PG_OFFLINE_OWNER_SHIFT;
+
+                if ( !domid || (domid > DOMID_FIRST_RESERVED) )
+                {
+                    DEBUG("Dom0's page can't be LM");
+                    goto failed;
+                }
+
+                if ( get_pt_level(xc_handle, domid, &pt_level, &guest_width) )
+                {
+                    ERROR("Unable to get PT level info.");
+                    goto failed;
+                }
+
+                if ( xc_domain_getinfo(xc_handle, domid, 1, &info) != 1 )
+                {
+                    ERROR("Could not get domain info");
+                    goto failed;
+                }
+
+                if (info.hvm)
+                {
+                    DEBUG("please Live migrate dom %x\n", domid);
+                    goto failed;
+                }
+
+                /* Map the p2m table and M2P table */
+                shared_info_frame = info.shared_info_frame;
+
+                live_shinfo = xc_map_foreign_range(xc_handle, domid,
+                  PAGE_SIZE, PROT_READ, shared_info_frame);
+                if ( !live_shinfo )
+                {
+                    ERROR("Couldn't map live_shinfo");
+                    goto failed;
+                }
+
+                if ( (rc = xc_core_arch_map_p2m_writable(xc_handle, guest_width, &info,
+                                                         live_shinfo, &p2m_table,  &p2m_size)) )
+                {
+                    ERROR("Couldn't map p2m table %x\n", rc);
+                    goto failed;
+                }
+
+                max_mfn = xc_memory_op(xc_handle, XENMEM_maximum_ram_page, NULL);
+                if ( !(m2p_table = xc_map_m2p(xc_handle, max_mfn, PROT_READ, NULL)) )
+                {
+                    ERROR("Failed to map live M2P table");
+                    goto failed;
+                }
+
+                /* Suspend the guest */
+                port = xs_suspend_evtchn_port(domid);
+                if (port < 0)
+                {
+                    ERROR("Dom %x: No suspsend port, try live migration\n",
+                            domid);
+                    goto failed;
+                }
+
+                xce = xc_evtchn_open();
+                if (xce < 0)
+                {
+                    ERROR("Dom %x: fail to open evtchn\n",
+                            domid);
+                            goto failed;
+                }
+
+                suspend_evtchn =
+                  xc_suspend_evtchn_init(xc_handle, xce, domid, port);
+                if (suspend_evtchn < 0)
+                {
+                    ERROR("suspend event channel initialization failed\n");
+                    goto failed;
+                }
+
+                rc = xc_evtchn_notify(xce, suspend_evtchn);
+                if (rc < 0) {
+                    ERROR("failed to notify suspend channel: %d", rc);
+                    goto failed;
+                }
+                if (xc_await_suspend(xce, suspend_evtchn) < 0) {
+                    ERROR("suspend failed");\
+                    goto failed;
+                }
+                suspended = 1;
+
+                /* Get pfn type */
+                pfn_type = malloc(sizeof(uint32_t) * p2m_size);
+                if (!pfn_type)
+                {
+                    ERROR("Failed to malloc pfn_type\n");
+                    goto failed;
+                }
+                memset(pfn_type, 0, sizeof(uint32_t) * p2m_size);
+
+                pfn_real = malloc(sizeof(unsigned long) * p2m_size);
+                if (!pfn_real)
+                {
+                    ERROR("Failed to malloc pfn_real\n");
+                    goto failed;
+                }
+                memset(pfn_real, 0, sizeof(unsigned long) * p2m_size);
+
+                for (j = 0; j < p2m_size; j++)
+                    pfn_type[j] = pfn_to_mfn(j, p2m_table, guest_width);
+
+                if ( lock_pages(pfn_type, p2m_size * sizeof(*pfn_type)) )
+                {
+                    ERROR("Unable to lock pfn_type array");
+                    goto failed;
+                }
+                if ( lock_pages(pfn_real, p2m_size * sizeof(*pfn_real)) )
+                {
+                    ERROR("Unable to lock pfn_real array");
+                    goto failed;
+                }
+
+                for (j = 0; j < p2m_size ; j+=1024)
+                {
+                    int count = ((p2m_size - j ) > 1024 ) ? 1024: (p2m_size - j);
+                    if ( ( rc = xc_get_pfn_type_batch(xc_handle, domid, count,
+                              pfn_type + j)) )
+                    {
+                        ERROR("Failed to get pfn_type %x\n", rc);
+                         goto failed;
+                    }
+                }
+
+                /* Now replace the page */
+                for (j = 0; j < p2m_size; j++)
+                    pfn_real[j] = pfn_type[j];
+
+                for (j = i ; j < num; j++)
+                {
+                    if ( ((status[j]& PG_OFFLINE_STATUS_MASK) == PG_OFFLINE_PENDING) &&
+                         ((status[j] >> PG_OFFLINE_OWNER_SHIFT) == domid) )
+                    {
+#define mfn_to_pfn(_mfn)  (m2p_table[(_mfn)])
+                             rc = xc_replace_page(xc_handle, mfn_to_pfn(start + j),
+                               domid, p2m_table,
+                               pfn_real, p2m_size, pt_level,
+                               guest_width);
+                             if (rc)
+                             {
+                                 ERROR("Failed to replace page %x\n", j);
+                                 goto failed;
+                             }
+
+                             status[j] |= PG_OFFLINE_STATUS_HANDLED;
+                    }
+                }
+
+failed:
+                status[i] |= PG_OFFLINE_STATUS_HANDLED;
+
+                if (p2m_table)
+                    munmap(p2m_table, P2M_FLL_ENTRIES * PAGE_SIZE);
+                if (pfn_type)
+                    free(pfn_type);
+                if (live_shinfo)
+                    munmap(live_shinfo, PAGE_SIZE);
+                if (suspend_evtchn > 0)
+                    xc_suspend_evtchn_release(xc_handle, suspend_evtchn);
+                if (suspended)
+                    xc_domain_resume(xc_handle, domid, 1);
+                if (xce > 0)
+                    xc_evtchn_close(xce);
+                break;
+            }
+            default:
+            {
+                ERROR("Error status result %x\n", status[i]);
+                break;
+            }
+        }
+
+    }
+    if (xc_handle)
+        xc_interface_close(xc_handle);
+    if (status)
+        free(status);
+    return 0;
+}

[-- Attachment #5: exchange_page.patch --]
[-- Type: application/octet-stream, Size: 3048 bytes --]

This patch add a function to exchange a foreign domain's one page to another.

Although there is already has a memory_exchange hypercall, but that hypercall will not accept new page, add that support will break current ABI. Also it does not support foreign domain. This function add those support.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

diff -r 5a3bc2ebbc64 xen/arch/x86/mm.c
--- a/xen/arch/x86/mm.c	Thu Mar 19 06:14:15 2009 +0800
+++ b/xen/arch/x86/mm.c	Thu Mar 19 06:58:20 2009 +0800
@@ -2929,6 +2929,57 @@ int do_mmuext_op(
             put_page(mfn_to_page(src_mfn));
             break;
         }
+        case MMUEXT_EXCHANGE_PAGE:
+        {
+            unsigned long old_mfn, new_mfn;
+            xen_pfn_t old_pfn;
+
+            old_mfn = op.arg2.src_mfn;
+            new_mfn = op.arg1.mfn;
+
+            if ( !get_page_from_pagenr(new_mfn, FOREIGNDOM) )
+            {
+                okay = 0;
+                break;
+            }
+
+            if ( !FOREIGNDOM->is_shut_down ||
+                (FOREIGNDOM->shutdown_code != SHUTDOWN_suspend) )
+            {
+                okay = 0;
+                goto exchange_fail;
+            }
+
+            /* No referece to this page from page table anymore */
+
+            old_pfn = mfn_to_gmfn(FOREIGNDOM, old_mfn);
+
+            /* release original one */
+            if ( unlikely(steal_page(FOREIGNDOM, mfn_to_page(old_mfn), 0)) )
+            {
+                okay = 0;
+                goto exchange_fail;
+            }
+
+            if ( !test_and_clear_bit(_PGC_allocated,
+                  &(mfn_to_page(old_mfn)->count_info)) )
+                BUG();
+
+            guest_physmap_remove_page(FOREIGNDOM, old_pfn, old_mfn, 0);
+
+            put_page(mfn_to_page(old_mfn));
+
+            /* Setup the new page */
+            guest_physmap_add_page(FOREIGNDOM, old_pfn, new_mfn, 0);
+
+            if ( !paging_mode_translate(FOREIGNDOM) )
+            {
+                set_gpfn_from_mfn(new_mfn, old_pfn);
+            }
+exchange_fail:
+            put_page(mfn_to_page(new_mfn));
+            break;
+          }
 
         default:
             MEM_LOG("Invalid extended pt command 0x%x", op.cmd);
diff -r 5a3bc2ebbc64 xen/include/public/xen.h
--- a/xen/include/public/xen.h	Thu Mar 19 06:14:15 2009 +0800
+++ b/xen/include/public/xen.h	Thu Mar 19 06:16:22 2009 +0800
@@ -238,6 +238,10 @@ DEFINE_XEN_GUEST_HANDLE(xen_pfn_t);
  * cmd: MMUEXT_COPY_PAGE
  * mfn: Machine frame number of the destination page.
  * src_mfn: Machine frame number of the source page.
+ *
+ * cmd: MMUEXT_EXCHANGE_PAGE
+ * src_mfn: Machine frame number of the page that will be replaced by new_mfn
+ * mfn: the mfn of the new page.
  */
 #define MMUEXT_PIN_L1_TABLE      0
 #define MMUEXT_PIN_L2_TABLE      1
@@ -256,6 +260,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_pfn_t);
 #define MMUEXT_NEW_USER_BASEPTR 15
 #define MMUEXT_CLEAR_PAGE       16
 #define MMUEXT_COPY_PAGE        17
+#define MMUEXT_EXCHANGE_PAGE        18
 
 #ifndef __ASSEMBLY__
 struct mmuext_op {

[-- Attachment #6: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-19 13:01                               ` Jiang, Yunhong
@ 2009-03-19 13:22                                 ` Keir Fraser
  2009-03-19 14:26                                   ` Jiang, Yunhong
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-03-19 13:22 UTC (permalink / raw)
  To: Jiang, Yunhong, Tim Deegan; +Cc: xen-devel

On 19/03/2009 13:01, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

> Instead of add a new mmu_*op, I try to change the  MMU_NORMAL_PT_UPDATE to
> support update foreign page table. There will be some restirction for it,
> including the pagetable and the new mfn pointed should be owned by same
> domain, and the domain should be suspeneded already. The
> update_foregin_pt.patch is for this. I'm not sure if this method is feasible.

I'll need to look in more detail, but yes it does appear this approach can
work. This is probably the easiest way to go in that case.

> The exchange_page.patch is to replace a mfn with a new one. Although there is
> already a memory_exchange hypercall, but that hypercall will not accept new
> page, add that support will break current ABI. Also it does not support
> foreign domain. This function add those support

Well, why is the existing hypercall not okay by itself? Why can you not take
an arbitrary new mfn and you feel you have to specify it from the tools
instead?

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-19 13:22                                 ` Keir Fraser
@ 2009-03-19 14:26                                   ` Jiang, Yunhong
  2009-03-19 14:36                                     ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-19 14:26 UTC (permalink / raw)
  To: Keir Fraser, Tim Deegan; +Cc: xen-devel

Keir Fraser <mailto:keir.fraser@eu.citrix.com> wrote:
> On 19/03/2009 13:01, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
> 
>> Instead of add a new mmu_*op, I try to change the MMU_NORMAL_PT_UPDATE to
>> support update foreign page table. There will be some restirction for it,
>> including the pagetable and the new mfn pointed should be owned by same
>> domain, and the domain should be suspeneded already. The
>> update_foregin_pt.patch is for this. I'm not sure if this method is
>> feasible. 
> 
> I'll need to look in more detail, but yes it does appear this
> approach can
> work. This is probably the easiest way to go in that case.
> 
>> The exchange_page.patch is to replace a mfn with a new one. Although there
>> is already a memory_exchange hypercall, but that hypercall will not accept
>> new page, add that support will break current ABI. Also it does not support
>> foreign domain. This function add those support
> 
> Well, why is the existing hypercall not okay by itself? Why
> can you not take
> an arbitrary new mfn and you feel you have to specify it from the tools
> instead? 

It is ok for us to use an arbitrary new mfn, and then do the update_entry. But what happen if this process failed and we want to turn back to the old page? We still need this mechanism at that situation.

Thanks
Yunhong Jiang

> 
> -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-19 14:26                                   ` Jiang, Yunhong
@ 2009-03-19 14:36                                     ` Keir Fraser
  2009-03-19 14:42                                       ` Jiang, Yunhong
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-03-19 14:36 UTC (permalink / raw)
  To: Jiang, Yunhong, Tim Deegan; +Cc: xen-devel

On 19/03/2009 14:26, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

>> Well, why is the existing hypercall not okay by itself? Why
>> can you not take
>> an arbitrary new mfn and you feel you have to specify it from the tools
>> instead? 
> 
> It is ok for us to use an arbitrary new mfn, and then do the update_entry. But
> what happen if this process failed and we want to turn back to the old page?
> We still need this mechanism at that situation.

If what failed? The update_entry? How could that happen?

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-19 14:36                                     ` Keir Fraser
@ 2009-03-19 14:42                                       ` Jiang, Yunhong
  2009-03-19 14:48                                         ` Jiang, Yunhong
  2009-03-19 16:45                                         ` Keir Fraser
  0 siblings, 2 replies; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-19 14:42 UTC (permalink / raw)
  To: Keir Fraser, Tim Deegan; +Cc: xen-devel

xen-devel-bounces@lists.xensource.com <> wrote:
> On 19/03/2009 14:26, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
> 
>>> Well, why is the existing hypercall not okay by itself? Why can you not
>>> take an arbitrary new mfn and you feel you have to specify it from the
>>> tools instead?
>> 
>> It is ok for us to use an arbitrary new mfn, and then do the update_entry.
>> But what happen if this process failed and we want to turn back to the old
>> page? We still need this mechanism at that situation.
> 
> If what failed? The update_entry? How could that happen?

Per discussion before, when the page is granted to other domain, then after we update all entry, there will still have reference to left.

-- yhj

> 
> -- Keir
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-19 14:42                                       ` Jiang, Yunhong
@ 2009-03-19 14:48                                         ` Jiang, Yunhong
  2009-03-19 16:45                                         ` Keir Fraser
  1 sibling, 0 replies; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-19 14:48 UTC (permalink / raw)
  To: Jiang, Yunhong, Keir Fraser, Tim Deegan; +Cc: xen-devel

Or when page is mapped by other domain.


xen-devel-bounces@lists.xensource.com <> wrote:
> xen-devel-bounces@lists.xensource.com <> wrote:
>> On 19/03/2009 14:26, "Jiang, Yunhong"
> <yunhong.jiang@intel.com> wrote:
>> 
>>>> Well, why is the existing hypercall not okay by itself? Why can you not
>>>> take an arbitrary new mfn and you feel you have to specify it from the
>>>> tools instead?
>>> 
>>> It is ok for us to use an arbitrary new mfn, and then do the update_entry.
>>> But what happen if this process failed and we want to turn back to the old
>>> page? We still need this mechanism at that situation.
>> 
>> If what failed? The update_entry? How could that happen?
> 
> Per discussion before, when the page is granted to other
> domain, then after we update all entry, there will still have reference to
> left. 
> 
> -- yhj
> 
>> 
>> -- Keir
>> 
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-19 14:42                                       ` Jiang, Yunhong
  2009-03-19 14:48                                         ` Jiang, Yunhong
@ 2009-03-19 16:45                                         ` Keir Fraser
  2009-03-20  2:52                                           ` Jiang, Yunhong
  1 sibling, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-03-19 16:45 UTC (permalink / raw)
  To: Jiang, Yunhong, Tim Deegan; +Cc: xen-devel

On 19/03/2009 14:42, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

>>> It is ok for us to use an arbitrary new mfn, and then do the update_entry.
>>> But what happen if this process failed and we want to turn back to the old
>>> page? We still need this mechanism at that situation.
>> 
>> If what failed? The update_entry? How could that happen?
> 
> Per discussion before, when the page is granted to other domain, then after we
> update all entry, there will still have reference to left.

Hmmm I don't really understand.

 K.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-19 16:45                                         ` Keir Fraser
@ 2009-03-20  2:52                                           ` Jiang, Yunhong
  2009-03-20  9:05                                             ` Keir Fraser
  2009-03-20  9:37                                             ` Re: [PATCH] Support swap a page from user spacetools " Jan Beulich
  0 siblings, 2 replies; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-20  2:52 UTC (permalink / raw)
  To: Keir Fraser, Tim Deegan; +Cc: xen-devel

xen-devel-bounces@lists.xensource.com <> wrote:
> On 19/03/2009 14:42, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
> 
>>>> It is ok for us to use an arbitrary new mfn, and then do the
>>>> update_entry. But what happen if this process failed and we want to turn
>>>> back to the old page? We still need this mechanism at that situation.
>>> 
>>> If what failed? The update_entry? How could that happen?
>> 
>> Per discussion before, when the page is granted to other domain, then
>> after we update all entry, there will still have reference to left.
> 
> Hmmm I don't really understand.

The basic idea to offline a page is:

1) mark a page offline pending
2) If the page is owned by a HVM domain, user have to live migrate it
3) If the page is owned by a PV domain, we will try to exchange the offline pending page to a new one and free the old page. (This is target of this series patches).

The method to exchange the offline pending page for PV domain is:
1) Suspend the guest.
2) Allocate a new page for the guest
3) Get a copy for the content
4) User space tools will scan all page table page to see if any reference to the offending page, if yes, then it will hypercall to Xen to replace the entry to point to the new one. (Through the mmu_*ops)
5) After update all page tables, user space tools will try to exchange the old page with the new page. If the new mfn has no reference anymore (i.e. count_info & count_mask = 1), the exchange will update the m2p and return success, otherwise it will return fail. (the page may be referenced by other domain, like grant table or foreign mapped).
6) If step 5 is success, user space tools will update the content of the new page and the p2m table, else it will try to undo step 4 to revert page table changes.
7) Resume the guest.

This requires we need to allocate the new page before the exchange call and we have to pass both old_mfn and new_mfn in step 5 to exchange the memory. However, current hypercall will always allocate a new page to replace the old one.

Currently I try to add a new hypercall for this purpose. 

Maybe we can enhance the current XENMEM_exchange to accept a mem_flags, when that flag is set, the exch.out.extent_start will be the new_mfn instead of the gpfn, and the gpfn will be always same as corresponding gpfn in the exch.in.ext_start. But I do think this is a bit tricky, it change the meaning of exch.out.extent_start and how the gpn is pass down. 

Thanks
Yunhong Jiang





> 
> K.


> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-20  2:52                                           ` Jiang, Yunhong
@ 2009-03-20  9:05                                             ` Keir Fraser
  2009-03-20  9:16                                               ` Jiang, Yunhong
  2009-03-20  9:37                                             ` Re: [PATCH] Support swap a page from user spacetools " Jan Beulich
  1 sibling, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-03-20  9:05 UTC (permalink / raw)
  To: Jiang, Yunhong, Tim Deegan; +Cc: xen-devel

On 20/03/2009 02:52, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

> This requires we need to allocate the new page before the exchange call and we
> have to pass both old_mfn and new_mfn in step 5 to exchange the memory.
> However, current hypercall will always allocate a new page to replace the old
> one.
> 
> Currently I try to add a new hypercall for this purpose.
> 
> Maybe we can enhance the current XENMEM_exchange to accept a mem_flags, when
> that flag is set, the exch.out.extent_start will be the new_mfn instead of the
> gpfn, and the gpfn will be always same as corresponding gpfn in the
> exch.in.ext_start. But I do think this is a bit tricky, it change the meaning
> of exch.out.extent_start and how the gpn is pass down.

Thanks for the description. I guess I will wait for your next patch, which
should I think at least separate the foreign pagetable update hypercall from
this weird swap hypercall. Then I can comment on the new swap hypercall
perhaps getting less confused by it. I certainly don't promise to get this
in for 3.4 at this point however.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-20  9:05                                             ` Keir Fraser
@ 2009-03-20  9:16                                               ` Jiang, Yunhong
  2009-03-20  9:28                                                 ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-20  9:16 UTC (permalink / raw)
  To: Keir Fraser, Tim Deegan; +Cc: xen-devel

Keir Fraser <mailto:keir.fraser@eu.citrix.com> wrote:
> On 20/03/2009 02:52, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
> 
>> This requires we need to allocate the new page before the exchange call
>> and we have to pass both old_mfn and new_mfn in step 5 to exchange the
>> memory. However, current hypercall will always allocate a new page to
>> replace the old one. 
>> 
>> Currently I try to add a new hypercall for this purpose.
>> 
>> Maybe we can enhance the current XENMEM_exchange to accept a mem_flags,
>> when that flag is set, the exch.out.extent_start will be the new_mfn
>> instead of the gpfn, and the gpfn will be always same as corresponding
>> gpfn in the exch.in.ext_start. But I do think this is a bit tricky, it
>> change the meaning of exch.out.extent_start and how the gpn is pass down.
> 
> Thanks for the description. I guess I will wait for your next
> patch, which
> should I think at least separate the foreign pagetable update hypercall from

I think I have split that patch last night. There is no foreign page table hypercall anymore, instead, I just tried to enhance the current mmu_op. And this new "weird" swap is now named mmu_ext_exchange_page.(Is it really so weird to just pass the new mfn down?)

> this weird swap hypercall. Then I can comment on the new swap hypercall
> perhaps getting less confused by it. I certainly don't promise
> to get this
> in for 3.4 at this point however.

That's ok and should depends on our discussion. Just stated before, hope check in IF it pass the review. Now it is on-going still :)

> 
> -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-20  9:16                                               ` Jiang, Yunhong
@ 2009-03-20  9:28                                                 ` Keir Fraser
  2009-03-20  9:42                                                   ` Re: [PATCH] Support swap a page from user space tools-- " Jan Beulich
  2009-03-20  9:44                                                   ` Re: [PATCH] Support swap a page from user space tools -- " Jiang, Yunhong
  0 siblings, 2 replies; 61+ messages in thread
From: Keir Fraser @ 2009-03-20  9:28 UTC (permalink / raw)
  To: Jiang, Yunhong, Tim Deegan; +Cc: xen-devel

On 20/03/2009 09:16, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

>> Thanks for the description. I guess I will wait for your next
>> patch, which
>> should I think at least separate the foreign pagetable update hypercall from
> 
> I think I have split that patch last night. There is no foreign page table
> hypercall anymore, instead, I just tried to enhance the current mmu_op. And
> this new "weird" swap is now named mmu_ext_exchange_page.(Is it really so
> weird to just pass the new mfn down?)

Ah yes, I found the email now. Well I'm still confused as to why it is
needed. It seems to me you could scan for all PTEs mapping old_pfn, stash
them in a list and temporarily make them not-present, and take a copy of
old_pfn. Then do a normal XENMEM_exchange: on failure revert all PTEs, on
success switch over all PTEs and copy old_pfn data into new_pfn. Well it is
more hypercalls (two updates per PTE) I suppose, but I doubt this matters
unless you're offlining a lot of pages, and we don't support offlining
memory banks really at the moment. Also some of this will potentially batch
up into multicalls or MMUOP_ lists nicely anyway.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-20  2:52                                           ` Jiang, Yunhong
  2009-03-20  9:05                                             ` Keir Fraser
@ 2009-03-20  9:37                                             ` Jan Beulich
  2009-03-20  9:41                                               ` Jiang, Yunhong
  2009-03-20  9:42                                               ` Keir Fraser
  1 sibling, 2 replies; 61+ messages in thread
From: Jan Beulich @ 2009-03-20  9:37 UTC (permalink / raw)
  To: Yunhong Jiang; +Cc: Tim Deegan, xen-devel, Keir Fraser

>>> "Jiang, Yunhong" <yunhong.jiang@intel.com> 20.03.09 03:52 >>>
>The method to exchange the offline pending page for PV domain is:
>1) Suspend the guest.
>2) Allocate a new page for the guest
>3) Get a copy for the content
>4) User space tools will scan all page table page to see if any reference to the offending page, if yes, then it will hypercall to Xen
>to replace the entry to point to the new one. (Through the mmu_*ops)
>5) After update all page tables, user space tools will try to exchange the old page with the new page. If the new mfn has no
>reference anymore (i.e. count_info & count_mask = 1), the exchange will update the m2p and return success, otherwise it will
>return fail. (the page may be referenced by other domain, like grant table or foreign mapped).

Hmm, if you consider the possibility of this case, then you should also consider the possibility of a page still being accessible by another domain at the point where you copy its content, but no longer in use when you do the exchange (which means that the content may have changed between the two points in time).

>6) If step 5 is success, user space tools will update the content of the new page and the p2m table, else it will try to undo step 4
>to revert page table changes.
>7) Resume the guest.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20  9:37                                             ` Re: [PATCH] Support swap a page from user spacetools " Jan Beulich
@ 2009-03-20  9:41                                               ` Jiang, Yunhong
  2009-03-20  9:42                                               ` Keir Fraser
  1 sibling, 0 replies; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-20  9:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Tim Deegan, xen-devel, Keir Fraser

Jan Beulich <mailto:jbeulich@novell.com> wrote:
>>>> "Jiang, Yunhong" <yunhong.jiang@intel.com> 20.03.09 03:52 >>>
>> The method to exchange the offline pending page for PV domain is: 1)
>> Suspend the guest. 2) Allocate a new page for the guest
>> 3) Get a copy for the content
>> 4) User space tools will scan all page table page to see if
> any reference to the offending page, if yes, then it will
> hypercall to Xen
>> to replace the entry to point to the new one. (Through the mmu_*ops)
>> 5) After update all page tables, user space tools will try to
> exchange the old page with the new page. If the new mfn has no
>> reference anymore (i.e. count_info & count_mask = 1), the
> exchange will update the m2p and return success, otherwise it will
>> return fail. (the page may be referenced by other domain,
> like grant table or foreign mapped).
> 
> Hmm, if you consider the possibility of this case, then you
> should also consider the possibility of a page still being
> accessible by another domain at the point where you copy its
> content, but no longer in use when you do the exchange (which
> means that the content may have changed between the two points
> in time).

Aha, yes, thanks for pointing this. I considerd this but apparently missed this race condition. 
When the page is freed, we can't map the page from the user space anymore, so we have to do it in the exchange hypercall to gurantee the atomic.

Keir, I checked the  XENMEM_exchange before, and it didn't do the copy, are there any reason for that? Or if we can add the copy to it?

Thanks
Yunhong Jiang

> 
>> 6) If step 5 is success, user space tools will update the
> content of the new page and the p2m table, else it will try to
> undo step 4
>> to revert page table changes.
>> 7) Resume the guest.
> 
> Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20  9:37                                             ` Re: [PATCH] Support swap a page from user spacetools " Jan Beulich
  2009-03-20  9:41                                               ` Jiang, Yunhong
@ 2009-03-20  9:42                                               ` Keir Fraser
  2009-03-20  9:52                                                 ` Jiang, Yunhong
  1 sibling, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-03-20  9:42 UTC (permalink / raw)
  To: Jan Beulich, Yunhong Jiang; +Cc: Tim Deegan, xen-devel

On 20/03/2009 09:37, "Jan Beulich" <jbeulich@novell.com> wrote:

>> 5) After update all page tables, user space tools will try to exchange the
>> old page with the new page. If the new mfn has no
>> reference anymore (i.e. count_info & count_mask = 1), the exchange will
>> update the m2p and return success, otherwise it will
>> return fail. (the page may be referenced by other domain, like grant table or
>> foreign mapped).
> 
> Hmm, if you consider the possibility of this case, then you should also
> consider the possibility of a page still being accessible by another domain at
> the point where you copy its content, but no longer in use when you do the
> exchange (which means that the content may have changed between the two points
> in time).

Since the guest is suspended, would that matter? Any I/Os that were in
flight on suspend will get resubmitted on resume. So the potential races
might all be benign.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [PATCH] Support swap a page from user space tools-- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-20  9:28                                                 ` Keir Fraser
@ 2009-03-20  9:42                                                   ` Jan Beulich
  2009-03-20  9:48                                                     ` Keir Fraser
  2009-03-20  9:44                                                   ` Re: [PATCH] Support swap a page from user space tools -- " Jiang, Yunhong
  1 sibling, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2009-03-20  9:42 UTC (permalink / raw)
  To: Keir Fraser, Yunhong Jiang; +Cc: Tim Deegan, xen-devel

>>> Keir Fraser <keir.fraser@eu.citrix.com> 20.03.09 10:28 >>>
>On 20/03/2009 09:16, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
>
>>> Thanks for the description. I guess I will wait for your next
>>> patch, which
>>> should I think at least separate the foreign pagetable update hypercall from
>> 
>> I think I have split that patch last night. There is no foreign page table
>> hypercall anymore, instead, I just tried to enhance the current mmu_op. And
>> this new "weird" swap is now named mmu_ext_exchange_page.(Is it really so
>> weird to just pass the new mfn down?)
>
>Ah yes, I found the email now. Well I'm still confused as to why it is
>needed. It seems to me you could scan for all PTEs mapping old_pfn, stash
>them in a list and temporarily make them not-present, and take a copy of
>old_pfn. Then do a normal XENMEM_exchange: on failure revert all PTEs, on
>success switch over all PTEs and copy old_pfn data into new_pfn. Well it is
>more hypercalls (two updates per PTE) I suppose, but I doubt this matters
>unless you're offlining a lot of pages, and we don't support offlining
>memory banks really at the moment. Also some of this will potentially batch
>up into multicalls or MMUOP_ lists nicely anyway.

A normal XENMEM_exchange wouldn't work here, would it? The old page
must have no other references for it to succeed, and will go back to the
allocator right afterwards - it's contents won't be recoverable. This would
probably require a new flag to XENMEM_exchange (which I agree would be
much simpler than adding a full-blown new [sub-]hypercall).

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-20  9:28                                                 ` Keir Fraser
  2009-03-20  9:42                                                   ` Re: [PATCH] Support swap a page from user space tools-- " Jan Beulich
@ 2009-03-20  9:44                                                   ` Jiang, Yunhong
  2009-03-20  9:52                                                     ` Keir Fraser
  1 sibling, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-20  9:44 UTC (permalink / raw)
  To: Keir Fraser, Tim Deegan; +Cc: xen-devel

Keir Fraser <mailto:keir.fraser@eu.citrix.com> wrote:
> On 20/03/2009 09:16, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
> 
>>> Thanks for the description. I guess I will wait for your next patch, which
>>> should I think at least separate the foreign pagetable update hypercall
>>> from 
>> 
>> I think I have split that patch last night. There is no foreign page table
>> hypercall anymore, instead, I just tried to enhance the current mmu_op. And
>> this new "weird" swap is now named mmu_ext_exchange_page.(Is it really so
>> weird to just pass the new mfn down?)
> 
> Ah yes, I found the email now. Well I'm still confused as to why it is
> needed. It seems to me you could scan for all PTEs mapping
> old_pfn, stash
> them in a list and temporarily make them not-present, and take
> a copy of
> old_pfn. Then do a normal XENMEM_exchange: on failure revert
> all PTEs, on
> success switch over all PTEs and copy old_pfn data into
> new_pfn. Well it is
> more hypercalls (two updates per PTE) I suppose, but I doubt
> this matters
> unless you're offlining a lot of pages, and we don't support offlining
> memory banks really at the moment. Also some of this will
> potentially batch
> up into multicalls or MMUOP_ lists nicely anyway.

Yes, seems this method works well. The only change to Xen is to enhance current mmu_xxx_pt_update to support foreign page table.

I will update the patch accordingly to see if it can pass the review  :-)

Thanks
Yunhong Jiang

> 
> -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [PATCH] Support swap a page from user  space tools-- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20  9:42                                                   ` Re: [PATCH] Support swap a page from user space tools-- " Jan Beulich
@ 2009-03-20  9:48                                                     ` Keir Fraser
  0 siblings, 0 replies; 61+ messages in thread
From: Keir Fraser @ 2009-03-20  9:48 UTC (permalink / raw)
  To: Jan Beulich, Yunhong Jiang; +Cc: Tim Deegan, xen-devel

On 20/03/2009 09:42, "Jan Beulich" <jbeulich@novell.com> wrote:

>> Ah yes, I found the email now. Well I'm still confused as to why it is
>> needed. It seems to me you could scan for all PTEs mapping old_pfn, stash
>> them in a list and temporarily make them not-present, and take a copy of
>> old_pfn. Then do a normal XENMEM_exchange: on failure revert all PTEs, on
>> success switch over all PTEs and copy old_pfn data into new_pfn. Well it is
>> more hypercalls (two updates per PTE) I suppose, but I doubt this matters
>> unless you're offlining a lot of pages, and we don't support offlining
>> memory banks really at the moment. Also some of this will potentially batch
>> up into multicalls or MMUOP_ lists nicely anyway.
> 
> A normal XENMEM_exchange wouldn't work here, would it? The old page
> must have no other references for it to succeed, and will go back to the
> allocator right afterwards - it's contents won't be recoverable. This would
> probably require a new flag to XENMEM_exchange (which I agree would be
> much simpler than adding a full-blown new [sub-]hypercall).

If the XENMEM_exchange succeeds then you don't need the old_pfn any more.
You only need recover if the excahnge fails, and in that case the page isn't
released by XENMEM_exchange. This is my understanding at least. :-) I'm
trying to work out whether I'm wrong or whether the extra Xen bits are not
needed.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20  9:42                                               ` Keir Fraser
@ 2009-03-20  9:52                                                 ` Jiang, Yunhong
  2009-03-20  9:58                                                   ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-20  9:52 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich; +Cc: Tim Deegan, xen-devel

Keir Fraser <mailto:keir.fraser@eu.citrix.com> wrote:
> On 20/03/2009 09:37, "Jan Beulich" <jbeulich@novell.com> wrote:
> 
>>> 5) After update all page tables, user space tools will try to exchange the
>>> old page with the new page. If the new mfn has no
>>> reference anymore (i.e. count_info & count_mask = 1), the exchange will
>>> update the m2p and return success, otherwise it will
>>> return fail. (the page may be referenced by other domain, like grant
>>> table or foreign mapped).
>> 
>> Hmm, if you consider the possibility of this case, then you should also
>> consider the possibility of a page still being accessible by another
>> domain at the point where you copy its content, but no longer in use when
>> you do the exchange (which means that the content may have changed between
>> the two points in time).
> 
> Since the guest is suspended, would that matter? Any I/Os that were in
> flight on suspend will get resubmitted on resume. So the
> potential races
> might all be benign.

I think it may not be resubmited. Backend may have no idea at all if the guest is resume quickly, I remember. Or maybe I miss understand the interface between backend/frontend..

Thanks
Yunhong Jiang

> 
> -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [PATCH] Support swap a page from user space tools -- Was RE: [RFC][PATCH] Basic support for page offline
  2009-03-20  9:44                                                   ` Re: [PATCH] Support swap a page from user space tools -- " Jiang, Yunhong
@ 2009-03-20  9:52                                                     ` Keir Fraser
  0 siblings, 0 replies; 61+ messages in thread
From: Keir Fraser @ 2009-03-20  9:52 UTC (permalink / raw)
  To: Jiang, Yunhong, Tim Deegan; +Cc: xen-devel

On 20/03/2009 09:44, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

> Yes, seems this method works well. The only change to Xen is to enhance
> current mmu_xxx_pt_update to support foreign page table.
> 
> I will update the patch accordingly to see if it can pass the review  :-)

Yay! Good news.

The only question, then, is whether we think the races against backends are
benign or not. Since the guest is suspended (in the ready-for-save-restore
meaning of suspended, right?) I think you get away with it. If actually
there are dangerous races we may need Xen support for this operation yet.
But if the guest is really suspended this is no more dangerous than the
stop-and-copy phase of a domain save or live migration, I would say. And
that obviously works!

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20  9:52                                                 ` Jiang, Yunhong
@ 2009-03-20  9:58                                                   ` Keir Fraser
  2009-03-20  9:59                                                     ` Jiang, Yunhong
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-03-20  9:58 UTC (permalink / raw)
  To: Jiang, Yunhong, Jan Beulich; +Cc: Tim Deegan, xen-devel

On 20/03/2009 09:52, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

>> Since the guest is suspended, would that matter? Any I/Os that were in
>> flight on suspend will get resubmitted on resume. So the
>> potential races
>> might all be benign.
> 
> I think it may not be resubmited. Backend may have no idea at all if the guest
> is resume quickly, I remember. Or maybe I miss understand the interface
> between backend/frontend..

Ah, you do a suspend-cancel/fast-resume?

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20  9:58                                                   ` Keir Fraser
@ 2009-03-20  9:59                                                     ` Jiang, Yunhong
  2009-03-20 10:03                                                       ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-20  9:59 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich; +Cc: Tim Deegan, xen-devel

Keir Fraser <mailto:keir.fraser@eu.citrix.com> wrote:
> On 20/03/2009 09:52, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
> 
>>> Since the guest is suspended, would that matter? Any I/Os that were in
>>> flight on suspend will get resubmitted on resume. So the potential races
>>> might all be benign.
>> 
>> I think it may not be resubmited. Backend may have no idea at all if the
>> guest is resume quickly, I remember. Or maybe I miss understand the
>> interface between backend/frontend..
> 
> Ah, you do a suspend-cancel/fast-resume?

Yes, that's suggested by Tim and I think that's meet our purpose quite well.

Thanks
Yunhong Jiang

> 
> -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20  9:59                                                     ` Jiang, Yunhong
@ 2009-03-20 10:03                                                       ` Keir Fraser
  2009-03-20 10:05                                                         ` Jiang, Yunhong
  2009-03-20 10:07                                                         ` Keir Fraser
  0 siblings, 2 replies; 61+ messages in thread
From: Keir Fraser @ 2009-03-20 10:03 UTC (permalink / raw)
  To: Jiang, Yunhong, Jan Beulich; +Cc: Tim Deegan, xen-devel

On 20/03/2009 09:59, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

>>>> Since the guest is suspended, would that matter? Any I/Os that were in
>>>> flight on suspend will get resubmitted on resume. So the potential races
>>>> might all be benign.
>>> 
>>> I think it may not be resubmited. Backend may have no idea at all if the
>>> guest is resume quickly, I remember. Or maybe I miss understand the
>>> interface between backend/frontend..
>> 
>> Ah, you do a suspend-cancel/fast-resume?
> 
> Yes, that's suggested by Tim and I think that's meet our purpose quite well.

Okay, then I suggest you extend XENMEM_exchange so that in.mem_flags can
tell that hypercall to copy data from old to new pages. XENMEMF_copy_data?

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20 10:03                                                       ` Keir Fraser
@ 2009-03-20 10:05                                                         ` Jiang, Yunhong
  2009-03-20 10:07                                                         ` Keir Fraser
  1 sibling, 0 replies; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-20 10:05 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich; +Cc: Tim Deegan, xen-devel

Yes, I'm working on that in fact.

-- yhj

Keir Fraser <mailto:keir.fraser@eu.citrix.com> wrote:
> On 20/03/2009 09:59, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
> 
>>>>> Since the guest is suspended, would that matter? Any I/Os that were in
>>>>> flight on suspend will get resubmitted on resume. So the potential races
>>>>> might all be benign.
>>>> 
>>>> I think it may not be resubmited. Backend may have no idea at all if the
>>>> guest is resume quickly, I remember. Or maybe I miss understand the
>>>> interface between backend/frontend..
>>> 
>>> Ah, you do a suspend-cancel/fast-resume?
>> 
>> Yes, that's suggested by Tim and I think that's meet our purpose quite
>> well. 
> 
> Okay, then I suggest you extend XENMEM_exchange so that
> in.mem_flags can
> tell that hypercall to copy data from old to new pages. XENMEMF_copy_data?
> 
> -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20 10:03                                                       ` Keir Fraser
  2009-03-20 10:05                                                         ` Jiang, Yunhong
@ 2009-03-20 10:07                                                         ` Keir Fraser
  2009-03-20 10:13                                                           ` Jiang, Yunhong
  2009-03-20 10:19                                                           ` Keir Fraser
  1 sibling, 2 replies; 61+ messages in thread
From: Keir Fraser @ 2009-03-20 10:07 UTC (permalink / raw)
  To: Jiang, Yunhong, Jan Beulich; +Cc: Tim Deegan, xen-devel

On 20/03/2009 10:03, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:

>>> Ah, you do a suspend-cancel/fast-resume?
>> 
>> Yes, that's suggested by Tim and I think that's meet our purpose quite well.
> 
> Okay, then I suggest you extend XENMEM_exchange so that in.mem_flags can
> tell that hypercall to copy data from old to new pages. XENMEMF_copy_data?

Even this may not work. Old grants will reference the old page. Subsequent
attempts by a backend to map the grant will fail. And the resulting failed
I/Os will probably make the frontend driver throw a fit. Getting this
working with suspend-cancel seems pretty tricky.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20 10:07                                                         ` Keir Fraser
@ 2009-03-20 10:13                                                           ` Jiang, Yunhong
  2009-03-20 10:21                                                             ` Keir Fraser
  2009-03-20 10:19                                                           ` Keir Fraser
  1 sibling, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-20 10:13 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich; +Cc: Tim Deegan, xen-devel

Keir Fraser <mailto:keir.fraser@eu.citrix.com> wrote:
> On 20/03/2009 10:03, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:
> 
>>>> Ah, you do a suspend-cancel/fast-resume?
>>> 
>>> Yes, that's suggested by Tim and I think that's meet our purpose quite
>>> well. 
>> 
>> Okay, then I suggest you extend XENMEM_exchange so that in.mem_flags can
>> tell that hypercall to copy data from old to new pages. XENMEMF_copy_data?
> 
> Even this may not work. Old grants will reference the old
> page. Subsequent
> attempts by a backend to map the grant will fail. And the
> resulting failed
> I/Os will probably make the frontend driver throw a fit. Getting this
> working with suspend-cancel seems pretty tricky.

If there is grant map for it, I think we will fail since the reference is not 1 when XENMEM_exchange.

Or do you mean there is a reference in grant table but is not mapped still? Will the refrence count be added when a page is granted? I'm not quite sure about this, but I think that will be same to original XENMEM_exchange. If yes, we may have to update the grant information?

The normal save-restore method may cause service broken, I think that's the reason of choosing the suspend-cacel method.

Thanks
Yunhong Jiang

> 
> -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20 10:07                                                         ` Keir Fraser
  2009-03-20 10:13                                                           ` Jiang, Yunhong
@ 2009-03-20 10:19                                                           ` Keir Fraser
  1 sibling, 0 replies; 61+ messages in thread
From: Keir Fraser @ 2009-03-20 10:19 UTC (permalink / raw)
  To: Jiang, Yunhong, Jan Beulich; +Cc: Tim Deegan, xen-devel

On 20/03/2009 10:07, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:

>>> Yes, that's suggested by Tim and I think that's meet our purpose quite well.
>> 
>> Okay, then I suggest you extend XENMEM_exchange so that in.mem_flags can
>> tell that hypercall to copy data from old to new pages. XENMEMF_copy_data?
> 
> Even this may not work. Old grants will reference the old page. Subsequent
> attempts by a backend to map the grant will fail. And the resulting failed
> I/Os will probably make the frontend driver throw a fit. Getting this
> working with suspend-cancel seems pretty tricky.

In fact do you have *any* handling for grants which have not yet been used
by the backend? It seems that such grants are doomed to fail since you
probably don't rewrite them?

Would it be a good idea to map the guest's grant pages and scan for your
old_pfn? Then fail the exchange if you see a match? That avoids this whole
issue and then also you don't need to worry about racing backends and you
can keep the copy operation where it is, in the tools, I think.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20 10:13                                                           ` Jiang, Yunhong
@ 2009-03-20 10:21                                                             ` Keir Fraser
  2009-03-20 10:36                                                               ` Jiang, Yunhong
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-03-20 10:21 UTC (permalink / raw)
  To: Jiang, Yunhong, Jan Beulich; +Cc: Tim Deegan, xen-devel

On 20/03/2009 10:13, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

>> Even this may not work. Old grants will reference the old
>> page. Subsequent
>> attempts by a backend to map the grant will fail. And the
>> resulting failed
>> I/Os will probably make the frontend driver throw a fit. Getting this
>> working with suspend-cancel seems pretty tricky.
> 
> If there is grant map for it, I think we will fail since the reference is not
> 1 when XENMEM_exchange.
> 
> Or do you mean there is a reference in grant table but is not mapped still?
> Will the refrence count be added when a page is granted? I'm not quite sure
> about this, but I think that will be same to original XENMEM_exchange. If yes,
> we may have to update the grant information?

Reference counts are not adjusted until a grant is mapped. So I'm talking
about grants which have been written by the domU, passed to the backend
driver, but not yet used. Those will be screwed.

Anyway, see my email just now, which suggests you map the guest's grant
table into your toolstack code, and you can go check for existing grants
yourself. As I say there -- if you do that I don't think you need worry
about atomicity of the copy.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20 10:21                                                             ` Keir Fraser
@ 2009-03-20 10:36                                                               ` Jiang, Yunhong
  2009-03-20 10:40                                                                 ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Jiang, Yunhong @ 2009-03-20 10:36 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich; +Cc: Tim Deegan, xen-devel

Keir Fraser <mailto:keir.fraser@eu.citrix.com> wrote:
> On 20/03/2009 10:13, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
> 
>>> Even this may not work. Old grants will reference the old page. Subsequent
>>> attempts by a backend to map the grant will fail. And the resulting failed
>>> I/Os will probably make the frontend driver throw a fit. Getting this
>>> working with suspend-cancel seems pretty tricky.
>> 
>> If there is grant map for it, I think we will fail since the reference is
>> not 1 when XENMEM_exchange. 
>> 
>> Or do you mean there is a reference in grant table but is not mapped still?
>> Will the refrence count be added when a page is granted? I'm not quite sure
>> about this, but I think that will be same to original XENMEM_exchange. If
>> yes, we may have to update the grant information?
> 
> Reference counts are not adjusted until a grant is mapped. So
> I'm talking
> about grants which have been written by the domU, passed to the backend
> driver, but not yet used. Those will be screwed.

Got it. 

> 
> Anyway, see my email just now, which suggests you map the guest's grant
> table into your toolstack code, and you can go check for
> existing grants
> yourself. As I say there -- if you do that I don't think you need worry
> about atomicity of the copy. 

Are there any method for user space to get the grant table information?

> 
> -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [PATCH] Support swap a page from user spacetools -- Was RE: [RFC][PATCH] Basic support for page  offline
  2009-03-20 10:36                                                               ` Jiang, Yunhong
@ 2009-03-20 10:40                                                                 ` Keir Fraser
  0 siblings, 0 replies; 61+ messages in thread
From: Keir Fraser @ 2009-03-20 10:40 UTC (permalink / raw)
  To: Jiang, Yunhong, Jan Beulich; +Cc: Tim Deegan, xen-devel

On 20/03/2009 10:36, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

>> Anyway, see my email just now, which suggests you map the guest's grant
>> table into your toolstack code, and you can go check for
>> existing grants
>> yourself. As I say there -- if you do that I don't think you need worry
>> about atomicity of the copy.
> 
> Are there any method for user space to get the grant table information?

GNTTABOP_query_size to discover size of the grant table.

GNTTABOP_setup_table to get the frame list (pass in nr_frames from the
query_size operation).

Map the frames and away you go.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2009-03-20 10:40 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-09  8:54 [RFC][PATCH] Basic support for page offline Jiang, Yunhong
2009-02-10  9:15 ` Tim Deegan
2009-02-10  9:29   ` Jiang, Yunhong
2009-02-10  9:42     ` Tim Deegan
2009-02-10 10:29     ` Keir Fraser
2009-02-10 21:09 ` Frank van der Linden
2009-02-11  0:16   ` Jiang, Yunhong
2009-02-11  0:39     ` Frank van der Linden
2009-02-11  1:08       ` Jiang, Yunhong
2009-02-11  4:08         ` Frank Van Der Linden
2009-02-13 17:03 ` Tim Deegan
2009-02-13 17:36   ` Keir Fraser
2009-02-15  9:39     ` Jiang, Yunhong
2009-02-15  9:48   ` Jiang, Yunhong
2009-02-16 14:31     ` Tim Deegan
2009-02-16 15:25       ` Jiang, Yunhong
2009-02-18 14:51       ` Jiang, Yunhong
2009-02-18 15:20         ` Tim Deegan
2009-02-19  8:44           ` Jiang, Yunhong
2009-02-19 14:37             ` Jiang, Yunhong
2009-03-02 11:56               ` Tim Deegan
2009-03-04  8:23                 ` Jiang, Yunhong
2009-03-18 10:24                 ` [PATCH] Support swap a page from user space tools -- Was " Jiang, Yunhong
2009-03-18 10:32                   ` Jiang, Yunhong
2009-03-18 10:42                     ` Keir Fraser
2009-03-18 17:34                   ` Tim Deegan
2009-03-19  5:12                     ` Jiang, Yunhong
2009-03-19  9:32                       ` Tim Deegan
2009-03-19  9:45                         ` Keir Fraser
2009-03-19  9:57                           ` Jiang, Yunhong
2009-03-19 10:13                             ` Keir Fraser
2009-03-19 13:01                               ` Jiang, Yunhong
2009-03-19 13:22                                 ` Keir Fraser
2009-03-19 14:26                                   ` Jiang, Yunhong
2009-03-19 14:36                                     ` Keir Fraser
2009-03-19 14:42                                       ` Jiang, Yunhong
2009-03-19 14:48                                         ` Jiang, Yunhong
2009-03-19 16:45                                         ` Keir Fraser
2009-03-20  2:52                                           ` Jiang, Yunhong
2009-03-20  9:05                                             ` Keir Fraser
2009-03-20  9:16                                               ` Jiang, Yunhong
2009-03-20  9:28                                                 ` Keir Fraser
2009-03-20  9:42                                                   ` Re: [PATCH] Support swap a page from user space tools-- " Jan Beulich
2009-03-20  9:48                                                     ` Keir Fraser
2009-03-20  9:44                                                   ` Re: [PATCH] Support swap a page from user space tools -- " Jiang, Yunhong
2009-03-20  9:52                                                     ` Keir Fraser
2009-03-20  9:37                                             ` Re: [PATCH] Support swap a page from user spacetools " Jan Beulich
2009-03-20  9:41                                               ` Jiang, Yunhong
2009-03-20  9:42                                               ` Keir Fraser
2009-03-20  9:52                                                 ` Jiang, Yunhong
2009-03-20  9:58                                                   ` Keir Fraser
2009-03-20  9:59                                                     ` Jiang, Yunhong
2009-03-20 10:03                                                       ` Keir Fraser
2009-03-20 10:05                                                         ` Jiang, Yunhong
2009-03-20 10:07                                                         ` Keir Fraser
2009-03-20 10:13                                                           ` Jiang, Yunhong
2009-03-20 10:21                                                             ` Keir Fraser
2009-03-20 10:36                                                               ` Jiang, Yunhong
2009-03-20 10:40                                                                 ` Keir Fraser
2009-03-20 10:19                                                           ` Keir Fraser
2009-03-19  9:48                         ` [PATCH] Support swap a page from user space tools " Jiang, Yunhong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.