All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
@ 2013-09-20 13:10 David Vrabel
  2013-09-20 13:10 ` [PATCH 1/9] x86: give FIX_EFI_MPF its own fixmap entry David Vrabel
                   ` (11 more replies)
  0 siblings, 12 replies; 31+ messages in thread
From: David Vrabel @ 2013-09-20 13:10 UTC (permalink / raw)
  To: xen-devel; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, Jan Beulich

The series (for Xen 4.4) improves the kexec hypercall by making Xen
responsible for loading and relocating the image.  This allows kexec
to be usable by pv-ops kernels and should allow kexec to be usable
from a HVM or PVH privileged domain.

The first patch is a simple clean-up.

Patch 2 introduces the new ABI.

Patch 3 and 4 nearly completely reimplement the kexec load, unload and
exec sub-ops.  The old load_v1 sub-op is then implemented on top of
the new code.

Patch 5 calls the kexec image when dom0 crashes.  This avoids having
to alter dom0 kernels to do a exec sub-op call on crash -- a
SHUTDOWN_crash by dom0 will trigger the kexec.

Patches 6 and 7 add the libxc API for the kexec calls.  These have
been acked-by Ian Campbell already.

Patch 8 adds a link time check for the size of the relocate code.

Patch 9 adds myself as the maintainer for kexec in Xen.

The required patch series for kexec-tools will be posted shortly and
are available from the xen-v5 branch of:

http://xenbits.xen.org/gitweb/?p=people/dvrabel/kexec-tools.git;a=summary

Changes in v8:

- Use #defines for compat ABI structures.
- Tweak link time check for kexec_reloc.

Changes in v7:

- No longer use GUEST_HANDLE_64(), get a uniform ABI by using unions
  and explicit padding.
- Only map the segments and not all of RAM.
- Add a mechanism to create mappings for use by the exec'd image (a
  segment with a NULL buf handle).
- Fix a bug where a crash image's code page would by placed at machine
  address 0 (instead of inside the crash region).

Changes in v6:

- Fix double free in KEXEC_load_v1 failure path.
- Only copy the relocation code and not the whole page.
- Add myself as the kexec maintainer.

Changes in v5 (not posted to the list):

- _rsvd -> _pad in one of the public ABI structures.
- Fix bug where trailing pages were not zeroed. This fixes loading a
  64-bit Linux kernel using a more recent version of kexec-tools.
- Check the relocation code fits into a page at link time.

Changes in v4:

- Use paddr_t and page_to_maddr() etc. for portability.
- Add explicit padding to hypercall structures where required.
- Minor cleanup of the kexec_reloc assembly.
- Print a message before exec'ing a crash image.
- Style fixes (tabs, trailing whitespace) and typos.
- Fix a bug where using the V1 interface and unloading a image may crash.

Changes in v3:

- Provide old struct xen_kexec_load if __XEN_INTERFACE_VERSION__ < 4.3
- Adjust new struct xen_kexec_load to avoid unnecessary padding.
- Use domheap pages for the image and control pages.
- Remove the DBG() macros from the reloc code.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/9] x86: give FIX_EFI_MPF its own fixmap entry
  2013-09-20 13:10 [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
@ 2013-09-20 13:10 ` David Vrabel
  2013-09-20 13:10 ` [PATCH 2/9] kexec: add public interface for improved load/unload sub-ops David Vrabel
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: David Vrabel @ 2013-09-20 13:10 UTC (permalink / raw)
  To: xen-devel; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

FIX_EFI_MPF was the same as FIX_KEXEC_BASE_0 which is going away.  So
add its own entry.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 xen/arch/x86/mpparse.c       |    2 --
 xen/include/asm-x86/fixmap.h |    1 +
 2 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mpparse.c b/xen/arch/x86/mpparse.c
index 97d34bc..3753704 100644
--- a/xen/arch/x86/mpparse.c
+++ b/xen/arch/x86/mpparse.c
@@ -538,8 +538,6 @@ static inline void __init construct_default_ISA_mptable(int mpc_default_type)
 	}
 }
 
-#define FIX_EFI_MPF FIX_KEXEC_BASE_0
-
 static __init void efi_unmap_mpf(void)
 {
 	if (efi_enabled)
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index d850be4..8b4266d 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -66,6 +66,7 @@ enum fixed_addresses {
     FIX_APEI_RANGE_BASE,
     FIX_APEI_RANGE_END = FIX_APEI_RANGE_BASE + FIX_APEI_RANGE_MAX -1,
     FIX_IGD_MMIO,
+    FIX_EFI_MPF,
     __end_of_fixed_addresses
 };
 
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/9] kexec: add public interface for improved load/unload sub-ops
  2013-09-20 13:10 [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
  2013-09-20 13:10 ` [PATCH 1/9] x86: give FIX_EFI_MPF its own fixmap entry David Vrabel
@ 2013-09-20 13:10 ` David Vrabel
  2013-09-24 15:23   ` Jan Beulich
  2013-09-20 13:10 ` [PATCH 3/9] kexec: add infrastructure for handling kexec images David Vrabel
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 31+ messages in thread
From: David Vrabel @ 2013-09-20 13:10 UTC (permalink / raw)
  To: xen-devel; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Add replacement KEXEC_CMD_load and KEXEC_CMD_unload sub-ops to the
kexec hypercall.  These new sub-ops allow a priviledged guest to
provide the image data to be loaded into Xen memory or the crash
region instead of guests loading the image data themselves and
providing the relocation code and metadata.

The old interface is provided to guests requesting an interface
version prior to 4.3.

Bump __XEN_LATEST_INTERFACE_VERSION__ to 0x00040400.

Signed-off: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/kexec.c              |   12 +++---
 xen/include/public/kexec.h      |   78 +++++++++++++++++++++++++++++++++++++--
 xen/include/public/xen-compat.h |    2 +-
 3 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index 7cd151f..7b23df0 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -734,7 +734,7 @@ static void crash_save_vmcoreinfo(void)
 #endif
 }
 
-static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_t *load)
+static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
 {
     xen_kexec_image_t *image;
     int base, bit, pos;
@@ -781,7 +781,7 @@ static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_t *load)
 
 static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
-    xen_kexec_load_t load;
+    xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&load, uarg, 1)) )
         return -EFAULT;
@@ -793,8 +793,8 @@ static int kexec_load_unload_compat(unsigned long op,
                                     XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
 #ifdef CONFIG_COMPAT
-    compat_kexec_load_t compat_load;
-    xen_kexec_load_t load;
+    compat_kexec_load_v1_t compat_load;
+    xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&compat_load, uarg, 1)) )
         return -EFAULT;
@@ -866,8 +866,8 @@ static int do_kexec_op_internal(unsigned long op,
         else
                 ret = kexec_get_range(uarg);
         break;
-    case KEXEC_CMD_kexec_load:
-    case KEXEC_CMD_kexec_unload:
+    case KEXEC_CMD_kexec_load_v1:
+    case KEXEC_CMD_kexec_unload_v1:
         spin_lock_irqsave(&kexec_lock, flags);
         if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
         {
diff --git a/xen/include/public/kexec.h b/xen/include/public/kexec.h
index 36409ff..cb36c20 100644
--- a/xen/include/public/kexec.h
+++ b/xen/include/public/kexec.h
@@ -116,12 +116,12 @@ typedef struct xen_kexec_exec {
  * type  == KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH [in]
  * image == relocation information for kexec (ignored for unload) [in]
  */
-#define KEXEC_CMD_kexec_load            1
-#define KEXEC_CMD_kexec_unload          2
-typedef struct xen_kexec_load {
+#define KEXEC_CMD_kexec_load_v1         1 /* obsolete since 0x00040300 */
+#define KEXEC_CMD_kexec_unload_v1       2 /* obsolete since 0x00040300 */
+typedef struct xen_kexec_load_v1 {
     int type;
     xen_kexec_image_t image;
-} xen_kexec_load_t;
+} xen_kexec_load_v1_t;
 
 #define KEXEC_RANGE_MA_CRASH      0 /* machine address and size of crash area */
 #define KEXEC_RANGE_MA_XEN        1 /* machine address and size of Xen itself */
@@ -152,6 +152,76 @@ typedef struct xen_kexec_range {
     unsigned long start;
 } xen_kexec_range_t;
 
+#if __XEN_INTERFACE_VERSION__ >= 0x00040400
+/*
+ * A contiguous chunk of a kexec image and it's destination machine
+ * address.
+ */
+typedef struct xen_kexec_segment {
+    union {
+        XEN_GUEST_HANDLE(const_void) h;
+        uint64_t _pad;
+    } buf;
+    uint64_t buf_size;
+    uint64_t dest_maddr;
+    uint64_t dest_size;
+} xen_kexec_segment_t;
+DEFINE_XEN_GUEST_HANDLE(xen_kexec_segment_t);
+
+/*
+ * Load a kexec image into memory.
+ *
+ * For KEXEC_TYPE_DEFAULT images, the segments may be anywhere in RAM.
+ * The image is relocated prior to being executed.
+ *
+ * For KEXEC_TYPE_CRASH images, each segment of the image must reside
+ * in the memory region reserved for kexec (KEXEC_RANGE_MA_CRASH) and
+ * the entry point must be within the image. The caller is responsible
+ * for ensuring that multiple images do not overlap.
+ *
+ * All image segments will be loaded to their destination machine
+ * addresses prior to being executed.  The trailing portion of any
+ * segments with a source buffer (from dest_maddr + buf_size to
+ * dest_maddr + dest_size) will be zeroed.
+ *
+ * Segments with no source buffer will be accessible to the image when
+ * it is executed.
+ */
+
+#define KEXEC_CMD_kexec_load 4
+typedef struct xen_kexec_load {
+    uint8_t  type;        /* One of KEXEC_TYPE_* */
+    uint8_t  _pad;
+    uint16_t arch;        /* ELF machine type (EM_*). */
+    uint32_t nr_segments;
+    union {
+        XEN_GUEST_HANDLE(xen_kexec_segment_t) h;
+        uint64_t _pad;
+    } segments;
+    uint64_t entry_maddr; /* image entry point machine address. */
+} xen_kexec_load_t;
+DEFINE_XEN_GUEST_HANDLE(xen_kexec_load_t);
+
+/*
+ * Unload a kexec image.
+ *
+ * Type must be one of KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH.
+ */
+#define KEXEC_CMD_kexec_unload 5
+typedef struct xen_kexec_unload {
+    uint8_t type;
+} xen_kexec_unload_t;
+DEFINE_XEN_GUEST_HANDLE(xen_kexec_unload_t);
+
+#else /* __XEN_INTERFACE_VERSION__ < 0x00040400 */
+
+#define KEXEC_CMD_kexec_load KEXEC_CMD_kexec_load_v1
+#define KEXEC_CMD_kexec_unload KEXEC_CMD_kexec_unload_v1
+#define xen_kexec_load xen_kexec_load_v1
+#define xen_kexec_load_t xen_kexec_load_v1_t
+
+#endif
+
 #endif /* _XEN_PUBLIC_KEXEC_H */
 
 /*
diff --git a/xen/include/public/xen-compat.h b/xen/include/public/xen-compat.h
index 69141c4..3eb80a0 100644
--- a/xen/include/public/xen-compat.h
+++ b/xen/include/public/xen-compat.h
@@ -27,7 +27,7 @@
 #ifndef __XEN_PUBLIC_XEN_COMPAT_H__
 #define __XEN_PUBLIC_XEN_COMPAT_H__
 
-#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040300
+#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040400
 
 #if defined(__XEN__) || defined(__XEN_TOOLS__)
 /* Xen is built with matching headers and implements the latest interface. */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/9] kexec: add infrastructure for handling kexec images
  2013-09-20 13:10 [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
  2013-09-20 13:10 ` [PATCH 1/9] x86: give FIX_EFI_MPF its own fixmap entry David Vrabel
  2013-09-20 13:10 ` [PATCH 2/9] kexec: add public interface for improved load/unload sub-ops David Vrabel
@ 2013-09-20 13:10 ` David Vrabel
  2013-09-20 13:10 ` [PATCH 4/9] kexec: extend hypercall with improved load/unload ops David Vrabel
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: David Vrabel @ 2013-09-20 13:10 UTC (permalink / raw)
  To: xen-devel; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Add the code needed to handle and load kexec images into Xen memory or
into the crash region.  This is needed for the new KEXEC_CMD_load and
KEXEC_CMD_unload hypercall sub-ops.

Much of this code is derived from the Linux kernel.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/Makefile      |    1 +
 xen/common/kimage.c      |  817 ++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/kimage.h |   62 ++++
 3 files changed, 880 insertions(+), 0 deletions(-)
 create mode 100644 xen/common/kimage.c
 create mode 100644 xen/include/xen/kimage.h

diff --git a/xen/common/Makefile b/xen/common/Makefile
index 5486140..3cb456a 100644
--- a/xen/common/Makefile
+++ b/xen/common/Makefile
@@ -11,6 +11,7 @@ obj-y += irq.o
 obj-y += kernel.o
 obj-y += keyhandler.o
 obj-$(HAS_KEXEC) += kexec.o
+obj-$(HAS_KEXEC) += kimage.o
 obj-y += lib.o
 obj-y += memory.o
 obj-y += multicall.o
diff --git a/xen/common/kimage.c b/xen/common/kimage.c
new file mode 100644
index 0000000..9783e5a
--- /dev/null
+++ b/xen/common/kimage.c
@@ -0,0 +1,817 @@
+/*
+ * Kexec Image
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Derived from kernel/kexec.c from Linux:
+ *
+ *   Copyright (C) 2002-2004 Eric Biederman  <ebiederm@xmission.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/init.h>
+#include <xen/kernel.h>
+#include <xen/errno.h>
+#include <xen/spinlock.h>
+#include <xen/guest_access.h>
+#include <xen/mm.h>
+#include <xen/kexec.h>
+#include <xen/kimage.h>
+
+#include <asm/page.h>
+
+/*
+ * When kexec transitions to the new kernel there is a one-to-one
+ * mapping between physical and virtual addresses.  On processors
+ * where you can disable the MMU this is trivial, and easy.  For
+ * others it is still a simple predictable page table to setup.
+ *
+ * The code for the transition from the current kernel to the the new
+ * kernel is placed in the page-size control_code_buffer.  This memory
+ * must be identity mapped in the transition from virtual to physical
+ * addresses.
+ *
+ * The assembly stub in the control code buffer is passed a linked list
+ * of descriptor pages detailing the source pages of the new kernel,
+ * and the destination addresses of those source pages.  As this data
+ * structure is not used in the context of the current OS, it must
+ * be self-contained.
+ *
+ * The code has been made to work with highmem pages and will use a
+ * destination page in its final resting place (if it happens
+ * to allocate it).  The end product of this is that most of the
+ * physical address space, and most of RAM can be used.
+ *
+ * Future directions include:
+ *  - allocating a page table with the control code buffer identity
+ *    mapped, to simplify machine_kexec and make kexec_on_panic more
+ *    reliable.
+ */
+
+/*
+ * KIMAGE_NO_DEST is an impossible destination address..., for
+ * allocating pages whose destination address we do not care about.
+ */
+#define KIMAGE_NO_DEST (-1UL)
+
+/*
+ * Offset of the last entry in an indirection page.
+ */
+#define KIMAGE_LAST_ENTRY (PAGE_SIZE/sizeof(kimage_entry_t) - 1)
+
+
+static int kimage_is_destination_range(struct kexec_image *image,
+                                       paddr_t start, paddr_t end);
+static struct page_info *kimage_alloc_page(struct kexec_image *image,
+                                           paddr_t dest);
+
+static struct page_info *kimage_alloc_zeroed_page(unsigned memflags)
+{
+    struct page_info *page;
+
+    page = alloc_domheap_page(NULL, memflags);
+    if ( !page )
+        return NULL;
+
+    clear_domain_page(page_to_mfn(page));
+
+    return page;
+}
+
+static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
+                           unsigned long nr_segments,
+                           xen_kexec_segment_t *segments, uint8_t type)
+{
+    struct kexec_image *image;
+    unsigned long i;
+    int result;
+
+    /* Allocate a controlling structure */
+    result = -ENOMEM;
+    image = xzalloc(typeof(*image));
+    if ( !image )
+        goto out;
+
+    image->entry_maddr = entry;
+    image->type = type;
+    image->nr_segments = nr_segments;
+    image->segments = segments;
+
+    image->next_crash_page = kexec_crash_area.start;
+
+    INIT_PAGE_LIST_HEAD(&image->control_pages);
+    INIT_PAGE_LIST_HEAD(&image->dest_pages);
+    INIT_PAGE_LIST_HEAD(&image->unusable_pages);
+
+    /*
+     * Verify we have good destination addresses.  The caller is
+     * responsible for making certain we don't attempt to load the new
+     * image into invalid or reserved areas of RAM.  This just
+     * verifies it is an address we can use.
+     *
+     * Since the kernel does everything in page size chunks ensure the
+     * destination addresses are page aligned.  Too many special cases
+     * crop of when we don't do this.  The most insidious is getting
+     * overlapping destination addresses simply because addresses are
+     * changed to page size granularity.
+     */
+    result = -EADDRNOTAVAIL;
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+
+        mstart = image->segments[i].dest_maddr;
+        mend   = mstart + image->segments[i].dest_size;
+        if ( (mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK) )
+            goto out;
+    }
+
+    /*
+     * Verify our destination addresses do not overlap.  If we allowed
+     * overlapping destination addresses through very weird things can
+     * happen with no easy explanation as one segment stops on
+     * another.
+     */
+    result = -EINVAL;
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+        unsigned long j;
+
+        mstart = image->segments[i].dest_maddr;
+        mend   = mstart + image->segments[i].dest_size;
+        for (j = 0; j < i; j++ )
+        {
+            paddr_t pstart, pend;
+            pstart = image->segments[j].dest_maddr;
+            pend   = pstart + image->segments[j].dest_size;
+            /* Do the segments overlap? */
+            if ( (mend > pstart) && (mstart < pend) )
+                goto out;
+        }
+    }
+
+    /*
+     * Ensure our buffer sizes are strictly less than our memory
+     * sizes.  This should always be the case, and it is easier to
+     * check up front than to be surprised later on.
+     */
+    result = -EINVAL;
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        if ( image->segments[i].buf_size > image->segments[i].dest_size )
+            goto out;
+    }
+
+    /* 
+     * Page for the relocation code must still be accessible after the
+     * processor has switched to 32-bit mode.
+     */
+    result = -ENOMEM;
+    image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
+    if ( !image->control_code_page )
+        goto out;
+
+    /* Add an empty indirection page. */
+    image->entry_page = kimage_alloc_control_page(image, 0);
+    if ( !image->entry_page )
+        goto out;
+
+    image->head = page_to_maddr(image->entry_page);
+    image->next_entry = 0;
+
+    result = 0;
+out:
+    if ( result == 0 )
+        *rimage = image;
+    else
+        kimage_free(image);
+
+    return result;
+
+}
+
+static int kimage_normal_alloc(struct kexec_image **rimage, paddr_t entry,
+                               unsigned long nr_segments,
+                               xen_kexec_segment_t *segments)
+{
+    return do_kimage_alloc(rimage, entry, nr_segments, segments,
+                           KEXEC_TYPE_DEFAULT);
+}
+
+static int kimage_crash_alloc(struct kexec_image **rimage, paddr_t entry,
+                              unsigned long nr_segments,
+                              xen_kexec_segment_t *segments)
+{
+    unsigned long i;
+    int result;
+
+    /* Verify we have a valid entry point */
+    if ( (entry < kexec_crash_area.start)
+         || (entry > kexec_crash_area.start + kexec_crash_area.size))
+        return -EADDRNOTAVAIL;
+
+    /*
+     * Verify we have good destination addresses.  Normally
+     * the caller is responsible for making certain we don't
+     * attempt to load the new image into invalid or reserved
+     * areas of RAM.  But crash kernels are preloaded into a
+     * reserved area of ram.  We must ensure the addresses
+     * are in the reserved area otherwise preloading the
+     * kernel could corrupt things.
+     */
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+
+        if ( guest_handle_is_null(segments[i].buf.h) )
+            continue;
+
+        mstart = segments[i].dest_maddr;
+        mend = mstart + segments[i].dest_size - 1;
+        /* Ensure we are within the crash kernel limits. */
+        if ( (mstart < kexec_crash_area.start )
+             || (mend > kexec_crash_area.start + kexec_crash_area.size))
+            return -EADDRNOTAVAIL;
+    }
+
+    /* Allocate and initialize a controlling structure. */
+    result = do_kimage_alloc(rimage, entry, nr_segments, segments,
+                             KEXEC_TYPE_CRASH);
+    if ( result )
+        return result;
+
+    return 0;
+}
+
+static int kimage_is_destination_range(struct kexec_image *image,
+                                       paddr_t start,
+                                       paddr_t end)
+{
+    unsigned long i;
+
+    for ( i = 0; i < image->nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+
+        mstart = image->segments[i].dest_maddr;
+        mend = mstart + image->segments[i].dest_size;
+        if ( (end > mstart) && (start < mend) )
+            return 1;
+    }
+
+    return 0;
+}
+
+static void kimage_free_page_list(struct page_list_head *list)
+{
+    struct page_info *page, *next;
+
+    page_list_for_each_safe(page, next, list)
+    {
+        page_list_del(page, list);
+        free_domheap_page(page);
+    }
+}
+
+static struct page_info *kimage_alloc_normal_control_page(
+    struct kexec_image *image, unsigned memflags)
+{
+    /*
+     * Control pages are special, they are the intermediaries that are
+     * needed while we copy the rest of the pages to their final
+     * resting place.  As such they must not conflict with either the
+     * destination addresses or memory the kernel is already using.
+     *
+     * The only case where we really need more than one of these are
+     * for architectures where we cannot disable the MMU and must
+     * instead generate an identity mapped page table for all of the
+     * memory.
+     *
+     * At worst this runs in O(N) of the image size.
+     */
+    struct page_list_head extra_pages;
+    struct page_info *page = NULL;
+
+    INIT_PAGE_LIST_HEAD(&extra_pages);
+
+    /*
+     * Loop while I can allocate a page and the page allocated is a
+     * destination page.
+     */
+    do {
+        unsigned long mfn, emfn;
+        paddr_t addr, eaddr;
+
+        page = kimage_alloc_zeroed_page(memflags);
+        if ( !page )
+            break;
+        mfn   = page_to_mfn(page);
+        emfn  = mfn + 1;
+        addr  = page_to_maddr(page);
+        eaddr = addr + PAGE_SIZE;
+        if ( kimage_is_destination_range(image, addr, eaddr) )
+        {
+            page_list_add(page, &extra_pages);
+            page = NULL;
+        }
+    } while ( !page );
+
+    if ( page )
+    {
+        /* Remember the allocated page... */
+        page_list_add(page, &image->control_pages);
+
+        /*
+         * Because the page is already in it's destination location we
+         * will never allocate another page at that address.
+         * Therefore kimage_alloc_page will not return it (again) and
+         * we don't need to give it an entry in image->segments[].
+         */
+    }
+    /*
+     * Deal with the destination pages I have inadvertently allocated.
+     *
+     * Ideally I would convert multi-page allocations into single page
+     * allocations, and add everything to image->dest_pages.
+     *
+     * For now it is simpler to just free the pages.
+     */
+    kimage_free_page_list(&extra_pages);
+
+    return page;
+}
+
+static struct page_info *kimage_alloc_crash_control_page(struct kexec_image *image)
+{
+    /*
+     * Control pages are special, they are the intermediaries that are
+     * needed while we copy the rest of the pages to their final
+     * resting place.  As such they must not conflict with either the
+     * destination addresses or memory the kernel is already using.
+     *
+     * Control pages are also the only pags we must allocate when
+     * loading a crash kernel.  All of the other pages are specified
+     * by the segments and we just memcpy into them directly.
+     *
+     * The only case where we really need more than one of these are
+     * for architectures where we cannot disable the MMU and must
+     * instead generate an identity mapped page table for all of the
+     * memory.
+     *
+     * Given the low demand this implements a very simple allocator
+     * that finds the first hole of the appropriate size in the
+     * reserved memory region, and allocates all of the memory up to
+     * and including the hole.
+     */
+    paddr_t hole_start, hole_end;
+    struct page_info *page = NULL;
+
+    hole_start = PAGE_ALIGN(image->next_crash_page);
+    hole_end   = hole_start + PAGE_SIZE;
+    while ( hole_end <= kexec_crash_area.start + kexec_crash_area.size )
+    {
+        unsigned long i;
+
+        /* See if I overlap any of the segments. */
+        for ( i = 0; i < image->nr_segments; i++ )
+        {
+            paddr_t mstart, mend;
+
+            mstart = image->segments[i].dest_maddr;
+            mend   = mstart + image->segments[i].dest_size;
+            if ( (hole_end > mstart) && (hole_start < mend) )
+            {
+                /* Advance the hole to the end of the segment. */
+                hole_start = PAGE_ALIGN(mend);
+                hole_end   = hole_start + PAGE_SIZE;
+                break;
+            }
+        }
+        /* If I don't overlap any segments I have found my hole! */
+        if ( i == image->nr_segments )
+        {
+            page = maddr_to_page(hole_start);
+            break;
+        }
+    }
+    if ( page )
+    {
+        image->next_crash_page = hole_end;
+        clear_domain_page(page_to_mfn(page));
+    }
+
+    return page;
+}
+
+
+struct page_info *kimage_alloc_control_page(struct kexec_image *image,
+                                            unsigned memflags)
+{
+    struct page_info *pages = NULL;
+
+    switch ( image->type )
+    {
+    case KEXEC_TYPE_DEFAULT:
+        pages = kimage_alloc_normal_control_page(image, memflags);
+        break;
+    case KEXEC_TYPE_CRASH:
+        pages = kimage_alloc_crash_control_page(image);
+        break;
+    }
+    return pages;
+}
+
+static int kimage_add_entry(struct kexec_image *image, kimage_entry_t entry)
+{
+    kimage_entry_t *entries;
+
+    if ( image->next_entry == KIMAGE_LAST_ENTRY )
+    {
+        struct page_info *page;
+
+        page = kimage_alloc_page(image, KIMAGE_NO_DEST);
+        if ( !page )
+            return -ENOMEM;
+
+        entries = __map_domain_page(image->entry_page);
+        entries[image->next_entry] = page_to_maddr(page) | IND_INDIRECTION;
+        unmap_domain_page(entries);
+
+        image->entry_page = page;
+        image->next_entry = 0;
+    }
+
+    entries = __map_domain_page(image->entry_page);
+    entries[image->next_entry] = entry;
+    image->next_entry++;
+    unmap_domain_page(entries);
+
+    return 0;
+}
+
+static int kimage_set_destination(struct kexec_image *image,
+                                  paddr_t destination)
+{
+    return kimage_add_entry(image, (destination & PAGE_MASK) | IND_DESTINATION);
+}
+
+
+static int kimage_add_page(struct kexec_image *image, paddr_t maddr)
+{
+    return kimage_add_entry(image, (maddr & PAGE_MASK) | IND_SOURCE);
+}
+
+
+static void kimage_free_extra_pages(struct kexec_image *image)
+{
+    kimage_free_page_list(&image->dest_pages);
+    kimage_free_page_list(&image->unusable_pages);
+}
+
+static void kimage_terminate(struct kexec_image *image)
+{
+    kimage_entry_t *entries;
+
+    entries = __map_domain_page(image->entry_page);
+    entries[image->next_entry] = IND_DONE;
+    unmap_domain_page(entries);
+}
+
+/*
+ * Iterate over all the entries in the indirection pages.
+ *
+ * Call unmap_domain_page(ptr) after the loop exits.
+ */
+#define for_each_kimage_entry(image, ptr, entry)                        \
+    for ( ptr = map_domain_page(image->head >> PAGE_SHIFT);             \
+          (entry = *ptr) && !(entry & IND_DONE);                        \
+          ptr = (entry & IND_INDIRECTION) ?                             \
+              (unmap_domain_page(ptr), map_domain_page(entry >> PAGE_SHIFT)) \
+              : ptr + 1 )
+
+static void kimage_free_entry(kimage_entry_t entry)
+{
+    struct page_info *page;
+
+    page = mfn_to_page(entry >> PAGE_SHIFT);
+    free_domheap_page(page);
+}
+
+void kimage_free(struct kexec_image *image)
+{
+    kimage_entry_t *ptr, entry;
+    kimage_entry_t ind = 0;
+
+    if ( !image )
+        return;
+
+    kimage_free_extra_pages(image);
+    for_each_kimage_entry(image, ptr, entry)
+    {
+        if ( entry & IND_INDIRECTION )
+        {
+            /* Free the previous indirection page */
+            if ( ind & IND_INDIRECTION )
+                kimage_free_entry(ind);
+            /* Save this indirection page until we are done with it. */
+            ind = entry;
+        }
+        else if ( entry & IND_SOURCE )
+            kimage_free_entry(entry);
+    }
+    unmap_domain_page(ptr);
+
+    /* Free the final indirection page. */
+    if ( ind & IND_INDIRECTION )
+        kimage_free_entry(ind);
+
+    /* Free the kexec control pages. */
+    kimage_free_page_list(&image->control_pages);
+    xfree(image->segments);
+    xfree(image);
+}
+
+static kimage_entry_t *kimage_dst_used(struct kexec_image *image,
+                                       paddr_t maddr)
+{
+    kimage_entry_t *ptr, entry;
+    unsigned long destination = 0;
+
+    for_each_kimage_entry(image, ptr, entry)
+    {
+        if ( entry & IND_DESTINATION )
+            destination = entry & PAGE_MASK;
+        else if ( entry & IND_SOURCE )
+        {
+            if ( maddr == destination )
+                return ptr;
+            destination += PAGE_SIZE;
+        }
+    }
+    unmap_domain_page(ptr);
+
+    return NULL;
+}
+
+static struct page_info *kimage_alloc_page(struct kexec_image *image,
+                                           paddr_t destination)
+{
+    /*
+     * Here we implement safeguards to ensure that a source page is
+     * not copied to its destination page before the data on the
+     * destination page is no longer useful.
+     *
+     * To do this we maintain the invariant that a source page is
+     * either its own destination page, or it is not a destination
+     * page at all.
+     *
+     * That is slightly stronger than required, but the proof that no
+     * problems will not occur is trivial, and the implementation is
+     * simply to verify.
+     *
+     * When allocating all pages normally this algorithm will run in
+     * O(N) time, but in the worst case it will run in O(N^2) time.
+     * If the runtime is a problem the data structures can be fixed.
+     */
+    struct page_info *page;
+    paddr_t addr;
+
+    /*
+     * Walk through the list of destination pages, and see if I have a
+     * match.
+     */
+    page_list_for_each(page, &image->dest_pages)
+    {
+        addr = page_to_maddr(page);
+        if ( addr == destination )
+        {
+            page_list_del(page, &image->dest_pages);
+            return page;
+        }
+    }
+    page = NULL;
+    for (;;)
+    {
+        kimage_entry_t *old;
+
+        /* Allocate a page, if we run out of memory give up. */
+        page = kimage_alloc_zeroed_page(0);
+        if ( !page )
+            return NULL;
+        addr = page_to_maddr(page);
+
+        /* If it is the destination page we want use it. */
+        if ( addr == destination )
+            break;
+
+        /* If the page is not a destination page use it. */
+        if ( !kimage_is_destination_range(image, addr,
+                                          addr + PAGE_SIZE) )
+            break;
+
+        /*
+         * I know that the page is someones destination page.  See if
+         * there is already a source page for this destination page.
+         * And if so swap the source pages.
+         */
+        old = kimage_dst_used(image, addr);
+        if ( old )
+        {
+            /* If so move it. */
+            unsigned long old_mfn = *old >> PAGE_SHIFT;
+            unsigned long mfn = addr >> PAGE_SHIFT;
+
+            copy_domain_page(mfn, old_mfn);
+            clear_domain_page(old_mfn);
+            *old = (addr & ~PAGE_MASK) | IND_SOURCE;
+            unmap_domain_page(old);
+
+            page = mfn_to_page(old_mfn);
+            break;
+        }
+        else
+        {
+            /*
+             * Place the page on the destination list; I will use it
+             * later.
+             */
+            page_list_add(page, &image->dest_pages);
+        }
+    }
+    return page;
+}
+
+static int kimage_load_normal_segment(struct kexec_image *image,
+                                      xen_kexec_segment_t *segment)
+{
+    unsigned long to_copy;
+    unsigned long src_offset;
+    paddr_t dest, end;
+    int ret;
+
+    to_copy = segment->buf_size;
+    src_offset = 0;
+    dest = segment->dest_maddr;
+
+    ret = kimage_set_destination(image, dest);
+    if ( ret < 0 )
+        return ret;
+
+    while ( to_copy )
+    {
+        unsigned long dest_mfn;
+        struct page_info *page;
+        void *dest_va;
+        size_t size;
+
+        dest_mfn = dest >> PAGE_SHIFT;
+
+        size = min_t(unsigned long, PAGE_SIZE, to_copy);
+
+        page = kimage_alloc_page(image, dest);
+        if ( !page )
+            return -ENOMEM;
+        ret = kimage_add_page(image, page_to_maddr(page));
+        if ( ret < 0 )
+            return ret;
+
+        dest_va = __map_domain_page(page);
+        ret = copy_from_guest_offset(dest_va, segment->buf.h, src_offset, size);
+        unmap_domain_page(dest_va);
+        if ( ret )
+            return -EFAULT;
+
+        to_copy -= size;
+        src_offset += size;
+        dest += PAGE_SIZE;
+    }
+
+    /* Remainder of the destination should be zeroed. */
+    end = segment->dest_maddr + segment->dest_size;
+    for ( ; dest < end; dest += PAGE_SIZE )
+        kimage_add_entry(image, IND_ZERO);
+
+    return 0;
+}
+
+static int kimage_load_crash_segment(struct kexec_image *image,
+                                     xen_kexec_segment_t *segment)
+{
+    /*
+     * For crash dumps kernels we simply copy the data from user space
+     * to it's destination.
+     */
+    paddr_t dest;
+    unsigned long sbytes, dbytes;
+    int ret = 0;
+    unsigned long src_offset = 0;
+
+    sbytes = segment->buf_size;
+    dbytes = segment->dest_size;
+    dest = segment->dest_maddr;
+
+    while ( dbytes )
+    {
+        unsigned long dest_mfn;
+        void *dest_va;
+        size_t schunk, dchunk;
+
+        dest_mfn = dest >> PAGE_SHIFT;
+
+        dchunk = PAGE_SIZE;
+        schunk = min(dchunk, sbytes);
+
+        dest_va = map_domain_page(dest_mfn);
+        if ( !dest_va )
+            return -EINVAL;
+
+        ret = copy_from_guest_offset(dest_va, segment->buf.h, src_offset, schunk);
+        memset(dest_va + schunk, 0, dchunk - schunk);
+
+        unmap_domain_page(dest_va);
+        if ( ret )
+            return -EFAULT;
+
+        dbytes -= dchunk;
+        sbytes -= schunk;
+        dest += dchunk;
+        src_offset += schunk;
+    }
+
+    return 0;
+}
+
+static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
+{
+    int result = -ENOMEM;
+
+    if ( !guest_handle_is_null(segment->buf.h) )
+    {
+        switch ( image->type )
+        {
+        case KEXEC_TYPE_DEFAULT:
+            result = kimage_load_normal_segment(image, segment);
+            break;
+        case KEXEC_TYPE_CRASH:
+            result = kimage_load_crash_segment(image, segment);
+            break;
+        }
+    }
+
+    return result;
+}
+
+int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch,
+                 uint64_t entry_maddr,
+                 uint32_t nr_segments, xen_kexec_segment_t *segment)
+{
+    int result;
+
+    switch( type )
+    {
+    case KEXEC_TYPE_DEFAULT:
+        result = kimage_normal_alloc(rimage, entry_maddr, nr_segments, segment);
+        break;
+    case KEXEC_TYPE_CRASH:
+        result = kimage_crash_alloc(rimage, entry_maddr, nr_segments, segment);
+        break;
+    default:
+        result = -EINVAL;
+        break;
+    }
+    if ( result < 0 )
+        return result;
+
+    (*rimage)->arch = arch;
+
+    return result;
+}
+
+int kimage_load_segments(struct kexec_image *image)
+{
+    int s;
+    int result;
+
+    for ( s = 0; s < image->nr_segments; s++ ) {
+        result = kimage_load_segment(image, &image->segments[s]);
+        if ( result < 0 )
+            return result;
+    }
+    kimage_terminate(image);
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
new file mode 100644
index 0000000..0ebd37a
--- /dev/null
+++ b/xen/include/xen/kimage.h
@@ -0,0 +1,62 @@
+#ifndef __XEN_KIMAGE_H__
+#define __XEN_KIMAGE_H__
+
+#define IND_DESTINATION  0x1
+#define IND_INDIRECTION  0x2
+#define IND_DONE         0x4
+#define IND_SOURCE       0x8
+#define IND_ZERO        0x10
+
+#ifndef __ASSEMBLY__
+
+#include <xen/list.h>
+#include <xen/mm.h>
+#include <public/kexec.h>
+
+#define KEXEC_SEGMENT_MAX 16
+
+typedef paddr_t kimage_entry_t;
+
+struct kexec_image {
+    uint8_t type;
+    uint16_t arch;
+    uint64_t entry_maddr;
+    uint32_t nr_segments;
+    xen_kexec_segment_t *segments;
+
+    kimage_entry_t head;
+    struct page_info *entry_page;
+    unsigned next_entry;
+
+    struct page_info *control_code_page;
+    struct page_info *aux_page;
+
+    struct page_list_head control_pages;
+    struct page_list_head dest_pages;
+    struct page_list_head unusable_pages;
+
+    /* Address of next control page to allocate for crash kernels. */
+    paddr_t next_crash_page;
+};
+
+int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch,
+                 uint64_t entry_maddr,
+                 uint32_t nr_segments, xen_kexec_segment_t *segment);
+void kimage_free(struct kexec_image *image);
+int kimage_load_segments(struct kexec_image *image);
+struct page_info *kimage_alloc_control_page(struct kexec_image *image,
+                                            unsigned memflags);
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* __XEN_KIMAGE_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-09-20 13:10 [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (2 preceding siblings ...)
  2013-09-20 13:10 ` [PATCH 3/9] kexec: add infrastructure for handling kexec images David Vrabel
@ 2013-09-20 13:10 ` David Vrabel
  2013-10-04 21:23   ` Daniel Kiper
  2013-09-20 13:10 ` [PATCH 5/9] xen: kexec crash image when dom0 crashes David Vrabel
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 31+ messages in thread
From: David Vrabel @ 2013-09-20 13:10 UTC (permalink / raw)
  To: xen-devel; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

In the existing kexec hypercall, the load and unload ops depend on
internals of the Linux kernel (the page list and code page provided by
the kernel).  The code page is used to transition between Xen context
and the image so using kernel code doesn't make sense and will not
work for PVH guests.

Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
that no longer require a code page to be provided by the guest -- Xen
now provides the code for calling the image directly.

The new load op looks similar to the Linux kexec_load system call and
allows the guest to provide the image data to be loaded.  The guest
specifies the architecture of the image which may be a 32-bit subarch
of the hypervisor's architecture (i.e., an EM_386 image on an
EM_X86_64 hypervisor).

The toolstack can now load images without kernel involvement.  This is
required for supporting kexec when using a dom0 with an upstream
kernel.

Crash images are copied directly into the crash region on load.
Default images are copied into domheap pages and a list of source and
destination machine addresses is created.  This is list is used in
kexec_reloc() to relocate the image to its destination.

The old load and unload sub-ops are still available (as
KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
of the new infrastructure.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 xen/arch/x86/machine_kexec.c        |  192 +++++++++++-------
 xen/arch/x86/x86_64/Makefile        |    2 +-
 xen/arch/x86/x86_64/compat_kexec.S  |  187 -----------------
 xen/arch/x86/x86_64/kexec_reloc.S   |  208 ++++++++++++++++++
 xen/common/kexec.c                  |  393 +++++++++++++++++++++++++++++------
 xen/common/kimage.c                 |  123 +++++++++++-
 xen/include/asm-x86/fixmap.h        |    3 -
 xen/include/asm-x86/machine_kexec.h |   16 ++
 xen/include/xen/kexec.h             |   16 +-
 xen/include/xen/kimage.h            |    6 +
 10 files changed, 811 insertions(+), 335 deletions(-)
 delete mode 100644 xen/arch/x86/x86_64/compat_kexec.S
 create mode 100644 xen/arch/x86/x86_64/kexec_reloc.S
 create mode 100644 xen/include/asm-x86/machine_kexec.h

diff --git a/xen/arch/x86/machine_kexec.c b/xen/arch/x86/machine_kexec.c
index 68b9705..b70d5a6 100644
--- a/xen/arch/x86/machine_kexec.c
+++ b/xen/arch/x86/machine_kexec.c
@@ -1,9 +1,18 @@
 /******************************************************************************
  * machine_kexec.c
  *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/machine_kexec_64.c.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
  * Xen port written by:
  * - Simon 'Horms' Horman <horms@verge.net.au>
  * - Magnus Damm <magnus@valinux.co.jp>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
  */
 
 #include <xen/types.h>
@@ -11,63 +20,124 @@
 #include <xen/guest_access.h>
 #include <asm/fixmap.h>
 #include <asm/hpet.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
 
-typedef void (*relocate_new_kernel_t)(
-                unsigned long indirection_page,
-                unsigned long *page_list,
-                unsigned long start_address,
-                unsigned int preserve_context);
-
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image)
+/*
+ * Add a mapping for a page to the page tables used during kexec.
+ */
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr)
 {
-    unsigned long prev_ma = 0;
-    int fix_base = FIX_KEXEC_BASE_0 + (slot * (KEXEC_XEN_NO_PAGES >> 1));
-    int k;
+    struct page_info *l4_page;
+    struct page_info *l3_page;
+    struct page_info *l2_page;
+    struct page_info *l1_page;
+    l4_pgentry_t *l4 = NULL;
+    l3_pgentry_t *l3 = NULL;
+    l2_pgentry_t *l2 = NULL;
+    l1_pgentry_t *l1 = NULL;
+    int ret = -ENOMEM;
+
+    l4_page = image->aux_page;
+    if ( !l4_page )
+    {
+        l4_page = kimage_alloc_control_page(image, 0);
+        if ( !l4_page )
+            goto out;
+        image->aux_page = l4_page;
+    }
 
-    /* setup fixmap to point to our pages and record the virtual address
-     * in every odd index in page_list[].
-     */
+    l4 = __map_domain_page(l4_page);
+    l4 += l4_table_offset(vaddr);
+    if ( !(l4e_get_flags(*l4) & _PAGE_PRESENT) )
+    {
+        l3_page = kimage_alloc_control_page(image, 0);
+        if ( !l3_page )
+            goto out;
+        l4e_write(l4, l4e_from_page(l3_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l3_page = l4e_get_page(*l4);
+
+    l3 = __map_domain_page(l3_page);
+    l3 += l3_table_offset(vaddr);
+    if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) )
+    {
+        l2_page = kimage_alloc_control_page(image, 0);
+        if ( !l2_page )
+            goto out;
+        l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l2_page = l3e_get_page(*l3);
+
+    l2 = __map_domain_page(l2_page);
+    l2 += l2_table_offset(vaddr);
+    if ( !(l2e_get_flags(*l2) & _PAGE_PRESENT) )
+    {
+        l1_page = kimage_alloc_control_page(image, 0);
+        if ( !l1_page )
+            goto out;
+        l2e_write(l2, l2e_from_page(l1_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l1_page = l2e_get_page(*l2);
+
+    l1 = __map_domain_page(l1_page);
+    l1 += l1_table_offset(vaddr);
+    l1e_write(l1, l1e_from_pfn(maddr >> PAGE_SHIFT, __PAGE_HYPERVISOR));
+
+    ret = 0;
+out:
+    if ( l1 )
+        unmap_domain_page(l1);
+    if ( l2 )
+        unmap_domain_page(l2);
+    if ( l3 )
+        unmap_domain_page(l3);
+    if ( l4 )
+        unmap_domain_page(l4);
+    return ret;
+}
 
-    for ( k = 0; k < KEXEC_XEN_NO_PAGES; k++ )
+int machine_kexec_load(struct kexec_image *image)
+{
+    void *code_page;
+    int ret;
+
+    switch ( image->arch )
     {
-        if ( (k & 1) == 0 )
-        {
-            /* Even pages: machine address. */
-            prev_ma = image->page_list[k];
-        }
-        else
-        {
-            /* Odd pages: va for previous ma. */
-            if ( is_pv_32on64_domain(dom0) )
-            {
-                /*
-                 * The compatability bounce code sets up a page table
-                 * with a 1-1 mapping of the first 1G of memory so
-                 * VA==PA here.
-                 *
-                 * This Linux purgatory code still sets up separate
-                 * high and low mappings on the control page (entries
-                 * 0 and 1) but it is harmless if they are equal since
-                 * that PT is not live at the time.
-                 */
-                image->page_list[k] = prev_ma;
-            }
-            else
-            {
-                set_fixmap(fix_base + (k >> 1), prev_ma);
-                image->page_list[k] = fix_to_virt(fix_base + (k >> 1));
-            }
-        }
+    case EM_386:
+    case EM_X86_64:
+        break;
+    default:
+        return -EINVAL;
     }
 
+    code_page = __map_domain_page(image->control_code_page);
+    memcpy(code_page, kexec_reloc, kexec_reloc_size);
+    unmap_domain_page(code_page);
+
+    /*
+     * Add a mapping for the control code page to the same virtual
+     * address as kexec_reloc.  This allows us to keep running after
+     * these page tables are loaded in kexec_reloc.
+     */
+    ret = machine_kexec_add_page(image, (unsigned long)kexec_reloc,
+                                 page_to_maddr(image->control_code_page));
+    if ( ret < 0 )
+        return ret;
+
     return 0;
 }
 
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image)
+void machine_kexec_unload(struct kexec_image *image)
 {
+    /* no-op. kimage_free() frees all control pages. */
 }
 
-void machine_reboot_kexec(xen_kexec_image_t *image)
+void machine_reboot_kexec(struct kexec_image *image)
 {
     BUG_ON(smp_processor_id() != 0);
     smp_send_stop();
@@ -75,13 +145,10 @@ void machine_reboot_kexec(xen_kexec_image_t *image)
     BUG();
 }
 
-void machine_kexec(xen_kexec_image_t *image)
+void machine_kexec(struct kexec_image *image)
 {
-    struct desc_ptr gdt_desc = {
-        .base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY),
-        .limit = LAST_RESERVED_GDT_BYTE
-    };
     int i;
+    unsigned long reloc_flags = 0;
 
     /* We are about to permenantly jump out of the Xen context into the kexec
      * purgatory code.  We really dont want to be still servicing interupts.
@@ -109,29 +176,12 @@ void machine_kexec(xen_kexec_image_t *image)
      * not like running with NMIs disabled. */
     enable_nmis();
 
-    /*
-     * compat_machine_kexec() returns to idle pagetables, which requires us
-     * to be running on a static GDT mapping (idle pagetables have no GDT
-     * mappings in their per-domain mapping area).
-     */
-    asm volatile ( "lgdt %0" : : "m" (gdt_desc) );
+    if ( image->arch == EM_386 )
+        reloc_flags |= KEXEC_RELOC_FLAG_COMPAT;
 
-    if ( is_pv_32on64_domain(dom0) )
-    {
-        compat_machine_kexec(image->page_list[1],
-                             image->indirection_page,
-                             image->page_list,
-                             image->start_address);
-    }
-    else
-    {
-        relocate_new_kernel_t rnk;
-
-        rnk = (relocate_new_kernel_t) image->page_list[1];
-        (*rnk)(image->indirection_page, image->page_list,
-               image->start_address,
-               0 /* preserve_context */);
-    }
+    kexec_reloc(page_to_maddr(image->control_code_page),
+                page_to_maddr(image->aux_page),
+                image->head, image->entry_maddr, reloc_flags);
 }
 
 int machine_kexec_get(xen_kexec_range_t *range)
diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
index d56e12d..7f8fb3d 100644
--- a/xen/arch/x86/x86_64/Makefile
+++ b/xen/arch/x86/x86_64/Makefile
@@ -11,11 +11,11 @@ obj-y += mmconf-fam10h.o
 obj-y += mmconfig_64.o
 obj-y += mmconfig-shared.o
 obj-y += compat.o
-obj-bin-y += compat_kexec.o
 obj-y += domain.o
 obj-y += physdev.o
 obj-y += platform_hypercall.o
 obj-y += cpu_idle.o
 obj-y += cpufreq.o
+obj-bin-y += kexec_reloc.o
 
 obj-$(crash_debug)   += gdbstub.o
diff --git a/xen/arch/x86/x86_64/compat_kexec.S b/xen/arch/x86/x86_64/compat_kexec.S
deleted file mode 100644
index fc92af9..0000000
--- a/xen/arch/x86/x86_64/compat_kexec.S
+++ /dev/null
@@ -1,187 +0,0 @@
-/*
- * Compatibility kexec handler.
- */
-
-/*
- * NOTE: We rely on Xen not relocating itself above the 4G boundary. This is
- * currently true but if it ever changes then compat_pg_table will
- * need to be moved back below 4G at run time.
- */
-
-#include <xen/config.h>
-
-#include <asm/asm_defns.h>
-#include <asm/msr.h>
-#include <asm/page.h>
-
-/* The unrelocated physical address of a symbol. */
-#define SYM_PHYS(sym)          ((sym) - __XEN_VIRT_START)
-
-/* Load physical address of symbol into register and relocate it. */
-#define RELOCATE_SYM(sym,reg)  mov $SYM_PHYS(sym), reg ; \
-                               add xen_phys_start(%rip), reg
-
-/*
- * Relocate a physical address in memory. Size of temporary register
- * determines size of the value to relocate.
- */
-#define RELOCATE_MEM(addr,reg) mov addr(%rip), reg ; \
-                               add xen_phys_start(%rip), reg ; \
-                               mov reg, addr(%rip)
-
-        .text
-
-        .code64
-
-ENTRY(compat_machine_kexec)
-        /* x86/64                        x86/32  */
-        /* %rdi - relocate_new_kernel_t  CALL    */
-        /* %rsi - indirection page       4(%esp) */
-        /* %rdx - page_list              8(%esp) */
-        /* %rcx - start address         12(%esp) */
-        /*        cpu has pae           16(%esp) */
-
-        /* Shim the 64 bit page_list into a 32 bit page_list. */
-        mov $12,%r9
-        lea compat_page_list(%rip), %rbx
-1:      dec %r9
-        movl (%rdx,%r9,8),%eax
-        movl %eax,(%rbx,%r9,4)
-        test %r9,%r9
-        jnz 1b
-
-        RELOCATE_SYM(compat_page_list,%rdx)
-
-        /* Relocate compatibility mode entry point address. */
-        RELOCATE_MEM(compatibility_mode_far,%eax)
-
-        /* Relocate compat_pg_table. */
-        RELOCATE_MEM(compat_pg_table,     %rax)
-        RELOCATE_MEM(compat_pg_table+0x8, %rax)
-        RELOCATE_MEM(compat_pg_table+0x10,%rax)
-        RELOCATE_MEM(compat_pg_table+0x18,%rax)
-
-        /*
-         * Setup an identity mapped region in PML4[0] of idle page
-         * table.
-         */
-        RELOCATE_SYM(l3_identmap,%rax)
-        or  $0x63,%rax
-        mov %rax, idle_pg_table(%rip)
-
-        /* Switch to idle page table. */
-        RELOCATE_SYM(idle_pg_table,%rax)
-        movq %rax, %cr3
-
-        /* Switch to identity mapped compatibility stack. */
-        RELOCATE_SYM(compat_stack,%rax)
-        movq %rax, %rsp
-
-        /* Save xen_phys_start for 32 bit code. */
-        movq xen_phys_start(%rip), %rbx
-
-        /* Jump to low identity mapping in compatibility mode. */
-        ljmp *compatibility_mode_far(%rip)
-        ud2
-
-compatibility_mode_far:
-        .long SYM_PHYS(compatibility_mode)
-        .long __HYPERVISOR_CS32
-
-        /*
-         * We use 5 words of stack for the arguments passed to the kernel. The
-         * kernel only uses 1 word before switching to its own stack. Allocate
-         * 16 words to give "plenty" of room.
-         */
-        .fill 16,4,0
-compat_stack:
-
-        .code32
-
-#undef RELOCATE_SYM
-#undef RELOCATE_MEM
-
-/*
- * Load physical address of symbol into register and relocate it. %rbx
- * contains xen_phys_start(%rip) saved before jump to compatibility
- * mode.
- */
-#define RELOCATE_SYM(sym,reg) mov $SYM_PHYS(sym), reg ; \
-                              add %ebx, reg
-
-compatibility_mode:
-        /* Setup some sane segments. */
-        movl $__HYPERVISOR_DS32, %eax
-        movl %eax, %ds
-        movl %eax, %es
-        movl %eax, %fs
-        movl %eax, %gs
-        movl %eax, %ss
-
-        /* Push arguments onto stack. */
-        pushl $0   /* 20(%esp) - preserve context */
-        pushl $1   /* 16(%esp) - cpu has pae */
-        pushl %ecx /* 12(%esp) - start address */
-        pushl %edx /*  8(%esp) - page list */
-        pushl %esi /*  4(%esp) - indirection page */
-        pushl %edi /*  0(%esp) - CALL */
-
-        /* Disable paging and therefore leave 64 bit mode. */
-        movl %cr0, %eax
-        andl $~X86_CR0_PG, %eax
-        movl %eax, %cr0
-
-        /* Switch to 32 bit page table. */
-        RELOCATE_SYM(compat_pg_table, %eax)
-        movl  %eax, %cr3
-
-        /* Clear MSR_EFER[LME], disabling long mode */
-        movl    $MSR_EFER,%ecx
-        rdmsr
-        btcl    $_EFER_LME,%eax
-        wrmsr
-
-        /* Re-enable paging, but only 32 bit mode now. */
-        movl %cr0, %eax
-        orl $X86_CR0_PG, %eax
-        movl %eax, %cr0
-        jmp 1f
-1:
-
-        popl %eax
-        call *%eax
-        ud2
-
-        .data
-        .align 4
-compat_page_list:
-        .fill 12,4,0
-
-        .align 32,0
-
-        /*
-         * These compat page tables contain an identity mapping of the
-         * first 4G of the physical address space.
-         */
-compat_pg_table:
-        .long SYM_PHYS(compat_pg_table_l2) + 0*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 1*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 2*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 3*PAGE_SIZE + 0x01, 0
-
-        .section .data.page_aligned, "aw", @progbits
-        .align PAGE_SIZE,0
-compat_pg_table_l2:
-        .macro identmap from=0, count=512
-        .if \count-1
-        identmap "(\from+0)","(\count/2)"
-        identmap "(\from+(0x200000*(\count/2)))","(\count/2)"
-        .else
-        .quad 0x00000000000000e3 + \from
-        .endif
-        .endm
-
-        identmap 0x00000000
-        identmap 0x40000000
-        identmap 0x80000000
-        identmap 0xc0000000
diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
new file mode 100644
index 0000000..41dd27b
--- /dev/null
+++ b/xen/arch/x86/x86_64/kexec_reloc.S
@@ -0,0 +1,208 @@
+/*
+ * Relocate a kexec_image to its destination and call it.
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+#include <xen/config.h>
+#include <xen/kimage.h>
+
+#include <asm/asm_defns.h>
+#include <asm/msr.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
+
+        .text
+        .align PAGE_SIZE
+        .code64
+
+ENTRY(kexec_reloc)
+        /* %rdi - code page maddr */
+        /* %rsi - page table maddr */
+        /* %rdx - indirection page maddr */
+        /* %rcx - entry maddr */
+        /* %r8 - flags */
+
+        movq %rdx, %rbx
+
+        /* Setup stack. */
+        leaq (reloc_stack - kexec_reloc)(%rdi), %rsp
+
+        /* Load reloc page table. */
+        movq %rsi, %cr3
+
+        /* Jump to identity mapped code. */
+        leaq (identity_mapped - kexec_reloc)(%rdi), %rax
+        jmpq *%rax
+
+identity_mapped:
+        pushq %rcx
+        pushq %rbx
+        pushq %rsi
+        pushq %rdi
+
+        /*
+         * Set cr0 to a known state:
+         *  - Paging enabled
+         *  - Alignment check disabled
+         *  - Write protect disabled
+         *  - No task switch
+         *  - Don't do FP software emulation.
+         *  - Proctected mode enabled
+         */
+        movq    %cr0, %rax
+        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
+        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
+        movq    %rax, %cr0
+
+        /*
+         * Set cr4 to a known state:
+         *  - physical address extension enabled
+         */
+        movl    $X86_CR4_PAE, %eax
+        movq    %rax, %cr4
+
+        movq %rbx, %rdi
+        call relocate_pages
+
+        popq %rdi
+        popq %rsi
+        popq %rbx
+        popq %rcx
+
+        /* Need to switch to 32-bit mode? */
+        testq $KEXEC_RELOC_FLAG_COMPAT, %r8
+        jnz call_32_bit
+
+call_64_bit:
+        /* Call the image entry point.  This should never return. */
+        callq *%rcx
+        ud2
+
+call_32_bit:
+        /* Setup IDT. */
+        lidt compat_mode_idt(%rip)
+
+        /* Load compat GDT. */
+        leaq (compat_mode_gdt - kexec_reloc)(%rdi), %rax
+        movq %rax, (compat_mode_gdt_desc + 2)(%rip)
+        lgdt compat_mode_gdt_desc(%rip)
+
+        /* Relocate compatibility mode entry point address. */
+        leal (compatibility_mode - kexec_reloc)(%edi), %eax
+        movl %eax, compatibility_mode_far(%rip)
+
+        /* Enter compatibility mode. */
+        ljmp *compatibility_mode_far(%rip)
+
+relocate_pages:
+        /* %rdi - indirection page maddr */
+        cld
+        movq    %rdi, %rcx
+        xorl    %edi, %edi
+        xorl    %esi, %esi
+        jmp     is_dest
+
+next_entry: /* top, read another word for the indirection page */
+
+        movq    (%rbx), %rcx
+        addq    $8, %rbx
+is_dest:
+        testb   $IND_DESTINATION, %cl
+        jz      is_ind
+        movq    %rcx, %rdi
+        andq    $PAGE_MASK, %rdi
+        jmp     next_entry
+is_ind:
+        testb   $IND_INDIRECTION, %cl
+        jz      is_done
+        movq    %rcx, %rbx
+        andq    $PAGE_MASK, %rbx
+        jmp     next_entry
+is_done:
+        testb   $IND_DONE, %cl
+        jnz     done
+is_source:
+        testb   $IND_SOURCE, %cl
+        jz      is_zero
+        movq    %rcx, %rsi      /* For every source page do a copy */
+        andq    $PAGE_MASK, %rsi
+        movl    $(PAGE_SIZE / 8), %ecx
+        rep movsq
+        jmp     next_entry
+is_zero:
+        testb   $IND_ZERO, %cl
+        jz      next_entry
+        movl    $(PAGE_SIZE / 8), %ecx  /* Zero the destination page. */
+        xorl    %eax, %eax
+        rep stosq
+        jmp     next_entry
+done:
+        ret
+
+        .code32
+
+compatibility_mode:
+        /* Setup some sane segments. */
+        movl $0x0008, %eax
+        movl %eax, %ds
+        movl %eax, %es
+        movl %eax, %fs
+        movl %eax, %gs
+        movl %eax, %ss
+
+        movl %ecx, %ebp
+
+        /* Disable paging and therefore leave 64 bit mode. */
+        movl %cr0, %eax
+        andl $~X86_CR0_PG, %eax
+        movl %eax, %cr0
+
+        /* Disable long mode */
+        movl    $MSR_EFER, %ecx
+        rdmsr
+        andl    $~EFER_LME, %eax
+        wrmsr
+
+        /* Clear cr4 to disable PAE. */
+        xorl    %eax, %eax
+        movl    %eax, %cr4
+
+        /* Call the image entry point.  This should never return. */
+        call *%ebp
+        ud2
+
+        .align 4
+compatibility_mode_far:
+        .long 0x00000000             /* set in call_32_bit above */
+        .word 0x0010
+
+compat_mode_gdt_desc:
+        .word (3*8)-1
+        .quad 0x0000000000000000     /* set in call_32_bit above */
+
+        .align 8
+compat_mode_gdt:
+        .quad 0x0000000000000000     /* null                              */
+        .quad 0x00cf92000000ffff     /* 0x0008 ring 0 data                */
+        .quad 0x00cf9a000000ffff     /* 0x0010 ring 0 code, compatibility */
+
+compat_mode_idt:
+        .word 0                      /* limit */
+        .long 0                      /* base */
+
+        /*
+         * 16 words of stack are more than enough.
+         */
+        .fill 16,8,0
+reloc_stack:
+
+        .globl kexec_reloc_size
+kexec_reloc_size:
+        .long . - kexec_reloc
diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index 7b23df0..5548c37 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -25,6 +25,7 @@
 #include <xen/version.h>
 #include <xen/console.h>
 #include <xen/kexec.h>
+#include <xen/kimage.h>
 #include <public/elfnote.h>
 #include <xsm/xsm.h>
 #include <xen/cpu.h>
@@ -47,7 +48,7 @@ static Elf_Note *xen_crash_note;
 
 static cpumask_t crash_saved_cpus;
 
-static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
+static struct kexec_image *kexec_image[KEXEC_IMAGE_NR];
 
 #define KEXEC_FLAG_DEFAULT_POS   (KEXEC_IMAGE_NR + 0)
 #define KEXEC_FLAG_CRASH_POS     (KEXEC_IMAGE_NR + 1)
@@ -311,14 +312,14 @@ void kexec_crash(void)
     kexec_common_shutdown();
     kexec_crash_save_cpu();
     machine_crash_shutdown();
-    machine_kexec(&kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
+    machine_kexec(kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
 
     BUG();
 }
 
 static long kexec_reboot(void *_image)
 {
-    xen_kexec_image_t *image = _image;
+    struct kexec_image *image = _image;
 
     kexecing = TRUE;
 
@@ -734,63 +735,261 @@ static void crash_save_vmcoreinfo(void)
 #endif
 }
 
-static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
+static void kexec_unload_image(struct kexec_image *image)
+{
+    if ( !image )
+        return;
+
+    machine_kexec_unload(image);
+    kimage_free(image);
+}
+
+static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_exec_t exec;
+    struct kexec_image *image;
+    int base, bit, pos, ret = -EINVAL;
+
+    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+        return -EFAULT;
+
+    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+        return -EINVAL;
+
+    pos = (test_bit(bit, &kexec_flags) != 0);
+
+    /* Only allow kexec/kdump into loaded images */
+    if ( !test_bit(base + pos, &kexec_flags) )
+        return -ENOENT;
+
+    switch (exec.type)
+    {
+    case KEXEC_TYPE_DEFAULT:
+        image = kexec_image[base + pos];
+        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
+        break;
+    case KEXEC_TYPE_CRASH:
+        kexec_crash(); /* Does not return */
+        break;
+    }
+
+    return -EINVAL; /* never reached */
+}
+
+static int kexec_swap_images(int type, struct kexec_image *new,
+                             struct kexec_image **old)
 {
-    xen_kexec_image_t *image;
     int base, bit, pos;
-    int ret = 0;
+    int new_slot, old_slot;
+
+    *old = NULL;
+
+    spin_lock(&kexec_lock);
+
+    if ( test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    {
+        spin_unlock(&kexec_lock);
+        return -EBUSY;
+    }
 
-    if ( kexec_load_get_bits(load->type, &base, &bit) )
+    if ( kexec_load_get_bits(type, &base, &bit) )
         return -EINVAL;
 
     pos = (test_bit(bit, &kexec_flags) != 0);
+    old_slot = base + pos;
+    new_slot = base + !pos;
 
-    /* Load the user data into an unused image */
-    if ( op == KEXEC_CMD_kexec_load )
+    if ( new )
     {
-        image = &kexec_image[base + !pos];
+        kexec_image[new_slot] = new;
+        set_bit(new_slot, &kexec_flags);
+    }
+    change_bit(bit, &kexec_flags);
 
-        BUG_ON(test_bit((base + !pos), &kexec_flags)); /* must be free */
+    clear_bit(old_slot, &kexec_flags);
+    *old = kexec_image[old_slot];
 
-        memcpy(image, &load->image, sizeof(*image));
+    spin_unlock(&kexec_lock);
 
-        if ( !(ret = machine_kexec_load(load->type, base + !pos, image)) )
-        {
-            /* Set image present bit */
-            set_bit((base + !pos), &kexec_flags);
+    return 0;
+}
 
-            /* Make new image the active one */
-            change_bit(bit, &kexec_flags);
-        }
+static int kexec_load_slot(struct kexec_image *kimage)
+{
+    struct kexec_image *old_kimage;
+    int ret = -ENOMEM;
+
+    ret = machine_kexec_load(kimage);
+    if ( ret < 0 )
+        return ret;
+
+    crash_save_vmcoreinfo();
+
+    ret = kexec_swap_images(kimage->type, kimage, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static uint16_t kexec_load_v1_arch(void)
+{
+#ifdef CONFIG_X86
+    return is_pv_32on64_domain(dom0) ? EM_386 : EM_X86_64;
+#else
+    return EM_NONE;
+#endif
+}
+
+static int kexec_segments_add_segment(
+    unsigned *nr_segments, xen_kexec_segment_t *segments,
+    unsigned long mfn)
+{
+    paddr_t maddr = (paddr_t)mfn << PAGE_SHIFT;
+    int n = *nr_segments;
 
-        crash_save_vmcoreinfo();
+    /* Need a new segment? */
+    if ( n == 0
+         || segments[n-1].dest_maddr + segments[n-1].dest_size != maddr )
+    {
+        n++;
+        if ( n > KEXEC_SEGMENT_MAX )
+            return -EINVAL;
+        *nr_segments = n;
+
+        set_xen_guest_handle(segments[n-1].buf.h, NULL);
+        segments[n-1].buf_size = 0;
+        segments[n-1].dest_maddr = maddr;
+        segments[n-1].dest_size = 0;
     }
 
-    /* Unload the old image if present and load successful */
-    if ( ret == 0 && !test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    return 0;
+}
+
+static int kexec_segments_from_ind_page(unsigned long mfn,
+                                        unsigned *nr_segments,
+                                        xen_kexec_segment_t *segments,
+                                        bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+
+    page = map_domain_page(mfn);
+
+    /*
+     * Walk the indirection page list, adding destination pages to the
+     * segments.
+     */
+    for ( entry = page; ; )
     {
-        if ( test_and_clear_bit((base + pos), &kexec_flags) )
+        unsigned long ind;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
         {
-            image = &kexec_image[base + pos];
-            machine_kexec_unload(load->type, base + pos, image);
+        case IND_DESTINATION:
+            ret = kexec_segments_add_segment(nr_segments, segments, mfn);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            page = map_domain_page(mfn);
+            if ( page == NULL )
+                return -ENOMEM;
+            entry = page;
+            continue;
+        case IND_DONE:
+            goto done;
+        case IND_SOURCE:
+            segments[*nr_segments-1].dest_size += PAGE_SIZE;
+            break;
+        default:
+            ret = -EINVAL;
+            goto done;
         }
+        entry = kimage_entry_next(entry, compat);
     }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
+
+static int kexec_do_load_v1(xen_kexec_load_v1_t *load, int compat)
+{
+    struct kexec_image *kimage = NULL;
+    xen_kexec_segment_t *segments;
+    uint16_t arch;
+    unsigned nr_segments = 0;
+    unsigned long ind_mfn = load->image.indirection_page >> PAGE_SHIFT;
+    int ret;
+
+    arch = kexec_load_v1_arch();
+    if ( arch == EM_NONE )
+        return -ENOSYS;
+
+    segments = xmalloc_array(xen_kexec_segment_t, KEXEC_SEGMENT_MAX);
+    if ( segments == NULL )
+        return -ENOMEM;
+
+    /*
+     * Work out the image segments (destination only) from the
+     * indirection pages.
+     *
+     * This is needed so we don't allocate pages that will overlap
+     * with the destination when building the new set of indirection
+     * pages below.
+     */
+    ret = kexec_segments_from_ind_page(ind_mfn, &nr_segments, segments, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_alloc(&kimage, load->type, arch, load->image.start_address,
+                       nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    /*
+     * Build a new set of indirection pages in the native format.
+     *
+     * This walks the guest provided indirection pages a second time.
+     * The guest could have altered then, invalidating the segment
+     * information constructed above.  This will only result in the
+     * resulting image being potentially unrelocatable.
+     */
+    ret = kimage_build_ind(kimage, ind_mfn, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
 
+    return 0;
+
+error:
+    if ( !kimage )
+        xfree(segments);
+    kimage_free(kimage);
     return ret;
 }
 
-static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
     xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&load, uarg, 1)) )
         return -EFAULT;
 
-    return kexec_load_unload_internal(op, &load);
+    return kexec_do_load_v1(&load, 0);
 }
 
-static int kexec_load_unload_compat(unsigned long op,
-                                    XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
 #ifdef CONFIG_COMPAT
     compat_kexec_load_v1_t compat_load;
@@ -809,49 +1008,113 @@ static int kexec_load_unload_compat(unsigned long op,
     load.type = compat_load.type;
     XLAT_kexec_image(&load.image, &compat_load.image);
 
-    return kexec_load_unload_internal(op, &load);
-#else /* CONFIG_COMPAT */
+    return kexec_do_load_v1(&load, 1);
+#else
     return 0;
-#endif /* CONFIG_COMPAT */
+#endif
 }
 
-static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
-    xen_kexec_exec_t exec;
-    xen_kexec_image_t *image;
-    int base, bit, pos, ret = -EINVAL;
+    xen_kexec_load_t load;
+    xen_kexec_segment_t *segments;
+    struct kexec_image *kimage = NULL;
+    int ret;
 
-    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+    if ( copy_from_guest(&load, uarg, 1) )
         return -EFAULT;
 
-    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+    if ( load.nr_segments >= KEXEC_SEGMENT_MAX )
         return -EINVAL;
 
-    pos = (test_bit(bit, &kexec_flags) != 0);
-
-    /* Only allow kexec/kdump into loaded images */
-    if ( !test_bit(base + pos, &kexec_flags) )
-        return -ENOENT;
+    segments = xmalloc_array(xen_kexec_segment_t, load.nr_segments);
+    if ( segments == NULL )
+        return -ENOMEM;
 
-    switch (exec.type)
+    if ( copy_from_guest(segments, load.segments.h, load.nr_segments) )
     {
-    case KEXEC_TYPE_DEFAULT:
-        image = &kexec_image[base + pos];
-        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
-        break;
-    case KEXEC_TYPE_CRASH:
-        kexec_crash(); /* Does not return */
-        break;
+        ret = -EFAULT;
+        goto error;
     }
 
-    return -EINVAL; /* never reached */
+    ret = kimage_alloc(&kimage, load.type, load.arch, load.entry_maddr,
+                       load.nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_load_segments(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    return 0;
+
+error:
+    if ( ! kimage )
+        xfree(segments);
+    kimage_free(kimage);
+    return ret;
+}
+
+static int kexec_do_unload(xen_kexec_unload_t *unload)
+{
+    struct kexec_image *old_kimage;
+    int ret;
+
+    ret = kexec_swap_images(unload->type, NULL, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static int kexec_unload_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_load_v1_t load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = load.type;
+    return kexec_do_unload(&unload);
+}
+
+static int kexec_unload_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+#ifdef CONFIG_COMPAT
+    compat_kexec_load_v1_t compat_load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&compat_load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = compat_load.type;
+    return kexec_do_unload(&unload);
+#else
+    return 0;
+#endif
+}
+
+static int kexec_unload(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_unload_t unload;
+
+    if ( unlikely(copy_from_guest(&unload, uarg, 1)) )
+        return -EFAULT;
+
+    return kexec_do_unload(&unload);
 }
 
 static int do_kexec_op_internal(unsigned long op,
                                 XEN_GUEST_HANDLE_PARAM(void) uarg,
                                 bool_t compat)
 {
-    unsigned long flags;
     int ret = -EINVAL;
 
     ret = xsm_kexec(XSM_PRIV);
@@ -867,20 +1130,26 @@ static int do_kexec_op_internal(unsigned long op,
                 ret = kexec_get_range(uarg);
         break;
     case KEXEC_CMD_kexec_load_v1:
+        if ( compat )
+            ret = kexec_load_v1_compat(uarg);
+        else
+            ret = kexec_load_v1(uarg);
+        break;
     case KEXEC_CMD_kexec_unload_v1:
-        spin_lock_irqsave(&kexec_lock, flags);
-        if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
-        {
-                if (compat)
-                        ret = kexec_load_unload_compat(op, uarg);
-                else
-                        ret = kexec_load_unload(op, uarg);
-        }
-        spin_unlock_irqrestore(&kexec_lock, flags);
+        if ( compat )
+            ret = kexec_unload_v1_compat(uarg);
+        else
+            ret = kexec_unload_v1(uarg);
         break;
     case KEXEC_CMD_kexec:
         ret = kexec_exec(uarg);
         break;
+    case KEXEC_CMD_kexec_load:
+        ret = kexec_load(uarg);
+        break;
+    case KEXEC_CMD_kexec_unload:
+        ret = kexec_unload(uarg);
+        break;
     }
 
     return ret;
diff --git a/xen/common/kimage.c b/xen/common/kimage.c
index 9783e5a..6bee9cf 100644
--- a/xen/common/kimage.c
+++ b/xen/common/kimage.c
@@ -175,14 +175,22 @@ static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
     image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
     if ( !image->control_code_page )
         goto out;
+    result = machine_kexec_add_page(image,
+                                    page_to_maddr(image->control_code_page),
+                                    page_to_maddr(image->control_code_page));
+    if ( result < 0 )
+        return result;
 
     /* Add an empty indirection page. */
     image->entry_page = kimage_alloc_control_page(image, 0);
     if ( !image->entry_page )
         goto out;
+    result = machine_kexec_add_page(image, page_to_maddr(image->entry_page),
+                                    page_to_maddr(image->entry_page));
+    if ( result < 0 )
+        return result;
 
     image->head = page_to_maddr(image->entry_page);
-    image->next_entry = 0;
 
     result = 0;
 out:
@@ -591,7 +599,7 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
         if ( addr == destination )
         {
             page_list_del(page, &image->dest_pages);
-            return page;
+            goto found;
         }
     }
     page = NULL;
@@ -643,6 +651,8 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
             page_list_add(page, &image->dest_pages);
         }
     }
+found:
+    machine_kexec_add_page(image, page_to_maddr(page), page_to_maddr(page));
     return page;
 }
 
@@ -749,6 +759,7 @@ static int kimage_load_crash_segment(struct kexec_image *image,
 static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
 {
     int result = -ENOMEM;
+    paddr_t addr;
 
     if ( !guest_handle_is_null(segment->buf.h) )
     {
@@ -763,6 +774,14 @@ static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *s
         }
     }
 
+    for ( addr = segment->dest_maddr & PAGE_MASK;
+          addr < segment->dest_maddr + segment->dest_size; addr += PAGE_SIZE )
+    {
+        result = machine_kexec_add_page(image, addr, addr);
+        if ( result < 0 )
+            break;
+    }
+
     return result;
 }
 
@@ -806,6 +825,106 @@ int kimage_load_segments(struct kexec_image *image)
     return 0;
 }
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return (kimage_entry_t *)((uint32_t *)entry + 1);
+    return entry + 1;
+}
+
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry >> PAGE_SHIFT;
+    return *entry >> PAGE_SHIFT;
+}
+
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry & 0xf;
+    return *entry & 0xf;
+}
+
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+    paddr_t dest = KIMAGE_NO_DEST;
+
+    page = map_domain_page(ind_mfn);
+    if ( !page )
+        return -ENOMEM;
+
+    /*
+     * Walk the guest-supplied indirection pages, adding entries to
+     * the image's indirection pages.
+     */
+    for ( entry = page; ;  )
+    {
+        unsigned long ind;
+        unsigned long mfn;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
+        {
+        case IND_DESTINATION:
+            dest = (paddr_t)mfn << PAGE_SHIFT;
+            ret = kimage_set_destination(image, dest);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            page = map_domain_page(mfn);
+            entry = page;
+            continue;
+        case IND_DONE:
+            kimage_terminate(image);
+            goto done;
+        case IND_SOURCE:
+        {
+            struct page_info *guest_page, *xen_page;
+
+            guest_page = mfn_to_page(mfn);
+            if ( !get_page(guest_page, current->domain) )
+            {
+                ret = -EFAULT;
+                goto done;
+            }
+
+            xen_page = kimage_alloc_page(image, dest);
+            if ( !xen_page )
+            {
+                put_page(guest_page);
+                ret = -ENOMEM;
+                goto done;
+            }
+
+            copy_domain_page(page_to_mfn(xen_page), mfn);
+            put_page(guest_page);
+
+            ret = kimage_add_page(image, page_to_maddr(xen_page));
+            if ( ret < 0 )
+                goto done;
+            dest += PAGE_SIZE;
+            break;
+        }
+        default:
+            ret = -EINVAL;
+            goto done;
+        }
+        entry = kimage_entry_next(entry, compat);
+    }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index 8b4266d..48c5676 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -56,9 +56,6 @@ enum fixed_addresses {
     FIX_ACPI_BEGIN,
     FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1,
     FIX_HPET_BASE,
-    FIX_KEXEC_BASE_0,
-    FIX_KEXEC_BASE_END = FIX_KEXEC_BASE_0 \
-      + ((KEXEC_XEN_NO_PAGES >> 1) * KEXEC_IMAGE_NR) - 1,
     FIX_TBOOT_SHARED_BASE,
     FIX_MSIX_IO_RESERV_BASE,
     FIX_MSIX_IO_RESERV_END = FIX_MSIX_IO_RESERV_BASE + FIX_MSIX_MAX_PAGES -1,
diff --git a/xen/include/asm-x86/machine_kexec.h b/xen/include/asm-x86/machine_kexec.h
new file mode 100644
index 0000000..ba0d469
--- /dev/null
+++ b/xen/include/asm-x86/machine_kexec.h
@@ -0,0 +1,16 @@
+#ifndef __X86_MACHINE_KEXEC_H__
+#define __X86_MACHINE_KEXEC_H__
+
+#define KEXEC_RELOC_FLAG_COMPAT 0x1 /* 32-bit image */
+
+#ifndef __ASSEMBLY__
+
+extern void kexec_reloc(unsigned long reloc_code, unsigned long reloc_pt,
+                        unsigned long ind_maddr, unsigned long entry_maddr,
+                        unsigned long flags);
+
+extern unsigned int kexec_reloc_size;
+
+#endif
+
+#endif /* __X86_MACHINE_KEXEC_H__ */
diff --git a/xen/include/xen/kexec.h b/xen/include/xen/kexec.h
index 1a5dda1..bd17747 100644
--- a/xen/include/xen/kexec.h
+++ b/xen/include/xen/kexec.h
@@ -6,6 +6,7 @@
 #include <public/kexec.h>
 #include <asm/percpu.h>
 #include <xen/elfcore.h>
+#include <xen/kimage.h>
 
 typedef struct xen_kexec_reserve {
     unsigned long size;
@@ -40,11 +41,13 @@ extern enum low_crashinfo low_crashinfo_mode;
 extern paddr_t crashinfo_maxaddr_bits;
 void kexec_early_calculations(void);
 
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr);
+int machine_kexec_load(struct kexec_image *image);
+void machine_kexec_unload(struct kexec_image *image);
 void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
-void machine_reboot_kexec(xen_kexec_image_t *image);
-void machine_kexec(xen_kexec_image_t *image);
+void machine_reboot_kexec(struct kexec_image *image);
+void machine_kexec(struct kexec_image *image);
 void kexec_crash(void);
 void kexec_crash_save_cpu(void);
 crash_xen_info_t *kexec_crash_save_info(void);
@@ -52,11 +55,6 @@ void machine_crash_shutdown(void);
 int machine_kexec_get(xen_kexec_range_t *range);
 int machine_kexec_get_xen(xen_kexec_range_t *range);
 
-void compat_machine_kexec(unsigned long rnk,
-                          unsigned long indirection_page,
-                          unsigned long *page_list,
-                          unsigned long start_address);
-
 /* vmcoreinfo stuff */
 #define VMCOREINFO_BYTES           (4096)
 #define VMCOREINFO_NOTE_NAME       "VMCOREINFO_XEN"
diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
index 0ebd37a..d10ebf7 100644
--- a/xen/include/xen/kimage.h
+++ b/xen/include/xen/kimage.h
@@ -47,6 +47,12 @@ int kimage_load_segments(struct kexec_image *image);
 struct page_info *kimage_alloc_control_page(struct kexec_image *image,
                                             unsigned memflags);
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat);
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __XEN_KIMAGE_H__ */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/9] xen: kexec crash image when dom0 crashes
  2013-09-20 13:10 [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (3 preceding siblings ...)
  2013-09-20 13:10 ` [PATCH 4/9] kexec: extend hypercall with improved load/unload ops David Vrabel
@ 2013-09-20 13:10 ` David Vrabel
  2013-09-20 13:10 ` [PATCH 6/9] libxc: add hypercall buffer arrays David Vrabel
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: David Vrabel @ 2013-09-20 13:10 UTC (permalink / raw)
  To: xen-devel; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 xen/common/kexec.c    |    2 ++
 xen/common/shutdown.c |    3 +++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index 5548c37..fe0424e 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -307,6 +307,8 @@ void kexec_crash(void)
     if ( !test_bit(KEXEC_IMAGE_CRASH_BASE + pos, &kexec_flags) )
         return;
 
+    printk("Executing crash image\n");
+
     kexecing = TRUE;
 
     kexec_common_shutdown();
diff --git a/xen/common/shutdown.c b/xen/common/shutdown.c
index 20f04b0..9bccd34 100644
--- a/xen/common/shutdown.c
+++ b/xen/common/shutdown.c
@@ -47,6 +47,9 @@ void dom0_shutdown(u8 reason)
     {
         debugger_trap_immediate();
         printk("Domain 0 crashed: ");
+#ifdef CONFIG_KEXEC
+        kexec_crash();
+#endif
         maybe_reboot();
         break; /* not reached */
     }
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/9] libxc: add hypercall buffer arrays
  2013-09-20 13:10 [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (4 preceding siblings ...)
  2013-09-20 13:10 ` [PATCH 5/9] xen: kexec crash image when dom0 crashes David Vrabel
@ 2013-09-20 13:10 ` David Vrabel
  2013-09-20 13:10 ` [PATCH 7/9] libxc: add API for kexec hypercall David Vrabel
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: David Vrabel @ 2013-09-20 13:10 UTC (permalink / raw)
  To: xen-devel; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Hypercall buffer arrays are used when a hypercall takes a variable
length array of buffers.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 tools/libxc/xc_hcall_buf.c |   73 ++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h      |   27 ++++++++++++++++
 2 files changed, 100 insertions(+), 0 deletions(-)

diff --git a/tools/libxc/xc_hcall_buf.c b/tools/libxc/xc_hcall_buf.c
index c354677..e762a93 100644
--- a/tools/libxc/xc_hcall_buf.c
+++ b/tools/libxc/xc_hcall_buf.c
@@ -228,6 +228,79 @@ void xc__hypercall_bounce_post(xc_interface *xch, xc_hypercall_buffer_t *b)
     xc__hypercall_buffer_free(xch, b);
 }
 
+struct xc_hypercall_buffer_array {
+    unsigned max_bufs;
+    xc_hypercall_buffer_t *bufs;
+};
+
+xc_hypercall_buffer_array_t *xc_hypercall_buffer_array_create(xc_interface *xch,
+                                                              unsigned n)
+{
+    xc_hypercall_buffer_array_t *array;
+    xc_hypercall_buffer_t *bufs = NULL;
+
+    array = malloc(sizeof(*array));
+    if ( array == NULL )
+        goto error;
+
+    bufs = calloc(n, sizeof(*bufs));
+    if ( bufs == NULL )
+        goto error;
+
+    array->max_bufs = n;
+    array->bufs     = bufs;
+
+    return array;
+
+error:
+    free(bufs);
+    free(array);
+    return NULL;
+}
+
+void *xc__hypercall_buffer_array_alloc(xc_interface *xch,
+                                       xc_hypercall_buffer_array_t *array,
+                                       unsigned index,
+                                       xc_hypercall_buffer_t *hbuf,
+                                       size_t size)
+{
+    void *buf;
+
+    if ( index >= array->max_bufs || array->bufs[index].hbuf )
+        abort();
+
+    buf = xc__hypercall_buffer_alloc(xch, hbuf, size);
+    if ( buf )
+        array->bufs[index] = *hbuf;
+    return buf;
+}
+
+void *xc__hypercall_buffer_array_get(xc_interface *xch,
+                                     xc_hypercall_buffer_array_t *array,
+                                     unsigned index,
+                                     xc_hypercall_buffer_t *hbuf)
+{
+    if ( index >= array->max_bufs || array->bufs[index].hbuf == NULL )
+        abort();
+
+    *hbuf = array->bufs[index];
+    return array->bufs[index].hbuf;
+}
+
+void xc_hypercall_buffer_array_destroy(xc_interface *xc,
+                                       xc_hypercall_buffer_array_t *array)
+{
+    unsigned i;
+
+    if ( array == NULL )
+        return;
+
+    for (i = 0; i < array->max_bufs; i++ )
+        xc__hypercall_buffer_free(xc, &array->bufs[i]);
+    free(array->bufs);
+    free(array);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 58d51f3..1119943 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -321,6 +321,33 @@ void xc__hypercall_buffer_free_pages(xc_interface *xch, xc_hypercall_buffer_t *b
 #define xc_hypercall_buffer_free_pages(_xch, _name, _nr) xc__hypercall_buffer_free_pages(_xch, HYPERCALL_BUFFER(_name), _nr)
 
 /*
+ * Array of hypercall buffers.
+ *
+ * Create an array with xc_hypercall_buffer_array_create() and
+ * populate it by declaring one hypercall buffer in a loop and
+ * allocating the buffer with xc_hypercall_buffer_array_alloc().
+ *
+ * To access a previously allocated buffers, declare a new hypercall
+ * buffer and call xc_hypercall_buffer_array_get().
+ *
+ * Destroy the array with xc_hypercall_buffer_array_destroy() to free
+ * the array and all its alocated hypercall buffers.
+ */
+struct xc_hypercall_buffer_array;
+typedef struct xc_hypercall_buffer_array xc_hypercall_buffer_array_t;
+
+xc_hypercall_buffer_array_t *xc_hypercall_buffer_array_create(xc_interface *xch, unsigned n);
+void *xc__hypercall_buffer_array_alloc(xc_interface *xch, xc_hypercall_buffer_array_t *array,
+                                       unsigned index, xc_hypercall_buffer_t *hbuf, size_t size);
+#define xc_hypercall_buffer_array_alloc(_xch, _array, _index, _name, _size) \
+    xc__hypercall_buffer_array_alloc(_xch, _array, _index, HYPERCALL_BUFFER(_name), _size)
+void *xc__hypercall_buffer_array_get(xc_interface *xch, xc_hypercall_buffer_array_t *array,
+                                     unsigned index, xc_hypercall_buffer_t *hbuf);
+#define xc_hypercall_buffer_array_get(_xch, _array, _index, _name, _size) \
+    xc__hypercall_buffer_array_get(_xch, _array, _index, HYPERCALL_BUFFER(_name))
+void xc_hypercall_buffer_array_destroy(xc_interface *xc, xc_hypercall_buffer_array_t *array);
+
+/*
  * CPUMAP handling
  */
 typedef uint8_t *xc_cpumap_t;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 7/9] libxc: add API for kexec hypercall
  2013-09-20 13:10 [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (5 preceding siblings ...)
  2013-09-20 13:10 ` [PATCH 6/9] libxc: add hypercall buffer arrays David Vrabel
@ 2013-09-20 13:10 ` David Vrabel
  2013-09-20 13:10 ` [PATCH 8/9] x86: check kexec relocation code fits in a page David Vrabel
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: David Vrabel @ 2013-09-20 13:10 UTC (permalink / raw)
  To: xen-devel; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Add xc_kexec_exec(), xc_kexec_get_ranges(), xc_kexec_load(), and
xc_kexec_unload().  The load and unload calls require the v2 load and
unload ops.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 tools/libxc/Makefile   |    1 +
 tools/libxc/xc_kexec.c |  140 ++++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h  |   55 +++++++++++++++++++
 3 files changed, 196 insertions(+), 0 deletions(-)
 create mode 100644 tools/libxc/xc_kexec.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 512a994..528198e 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -31,6 +31,7 @@ CTRL_SRCS-y       += xc_mem_access.c
 CTRL_SRCS-y       += xc_memshr.c
 CTRL_SRCS-y       += xc_hcall_buf.c
 CTRL_SRCS-y       += xc_foreign_memory.c
+CTRL_SRCS-y       += xc_kexec.c
 CTRL_SRCS-y       += xtl_core.c
 CTRL_SRCS-y       += xtl_logger_stdio.c
 CTRL_SRCS-$(CONFIG_X86) += xc_pagetab.c
diff --git a/tools/libxc/xc_kexec.c b/tools/libxc/xc_kexec.c
new file mode 100644
index 0000000..a49cffb
--- /dev/null
+++ b/tools/libxc/xc_kexec.c
@@ -0,0 +1,140 @@
+/******************************************************************************
+ * xc_kexec.c
+ *
+ * API for loading and executing kexec images.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation;
+ * version 2.1 of the License.
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ */
+#include "xc_private.h"
+
+int xc_kexec_exec(xc_interface *xch, int type)
+{
+    DECLARE_HYPERCALL;
+    DECLARE_HYPERCALL_BUFFER(xen_kexec_exec_t, exec);
+    int ret = -1;
+
+    exec = xc_hypercall_buffer_alloc(xch, exec, sizeof(*exec));
+    if ( exec == NULL )
+    {
+        PERROR("Count not alloc bounce buffer for kexec_exec hypercall");
+        goto out;
+    }
+
+    exec->type = type;
+
+    hypercall.op = __HYPERVISOR_kexec_op;
+    hypercall.arg[0] = KEXEC_CMD_kexec;
+    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(exec);
+
+    ret = do_xen_hypercall(xch, &hypercall);
+
+out:
+    xc_hypercall_buffer_free(xch, exec);
+
+    return ret;
+}
+
+int xc_kexec_get_range(xc_interface *xch, int range,  int nr,
+                       uint64_t *size, uint64_t *start)
+{
+    DECLARE_HYPERCALL;
+    DECLARE_HYPERCALL_BUFFER(xen_kexec_range_t, get_range);
+    int ret = -1;
+
+    get_range = xc_hypercall_buffer_alloc(xch, get_range, sizeof(*get_range));
+    if ( get_range == NULL )
+    {
+        PERROR("Could not alloc bounce buffer for kexec_get_range hypercall");
+        goto out;
+    }
+
+    get_range->range = range;
+    get_range->nr = nr;
+
+    hypercall.op = __HYPERVISOR_kexec_op;
+    hypercall.arg[0] = KEXEC_CMD_kexec_get_range;
+    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(get_range);
+
+    ret = do_xen_hypercall(xch, &hypercall);
+
+    *size = get_range->size;
+    *start = get_range->start;
+
+out:
+    xc_hypercall_buffer_free(xch, get_range);
+
+    return ret;
+}
+
+int xc_kexec_load(xc_interface *xch, uint8_t type, uint16_t arch,
+                  uint64_t entry_maddr,
+                  uint32_t nr_segments, xen_kexec_segment_t *segments)
+{
+    int ret = -1;
+    DECLARE_HYPERCALL;
+    DECLARE_HYPERCALL_BOUNCE(segments, sizeof(*segments) * nr_segments,
+                             XC_HYPERCALL_BUFFER_BOUNCE_IN);
+    DECLARE_HYPERCALL_BUFFER(xen_kexec_load_t, load);
+
+    if ( xc_hypercall_bounce_pre(xch, segments) )
+    {
+        PERROR("Could not allocate bounce buffer for kexec load hypercall");
+        goto out;
+    }
+    load = xc_hypercall_buffer_alloc(xch, load, sizeof(*load));
+    if ( load == NULL )
+    {
+        PERROR("Could not allocate buffer for kexec load hypercall");
+        goto out;
+    }
+
+    load->type = type;
+    load->arch = arch;
+    load->entry_maddr = entry_maddr;
+    load->nr_segments = nr_segments;
+    set_xen_guest_handle(load->segments.h, segments);
+
+    hypercall.op = __HYPERVISOR_kexec_op;
+    hypercall.arg[0] = KEXEC_CMD_kexec_load;
+    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(load);
+
+    ret = do_xen_hypercall(xch, &hypercall);
+
+out:
+    xc_hypercall_buffer_free(xch, load);
+    xc_hypercall_bounce_post(xch, segments);
+
+    return ret;
+}
+
+int xc_kexec_unload(xc_interface *xch, int type)
+{
+    DECLARE_HYPERCALL;
+    DECLARE_HYPERCALL_BUFFER(xen_kexec_unload_t, unload);
+    int ret = -1;
+
+    unload = xc_hypercall_buffer_alloc(xch, unload, sizeof(*unload));
+    if ( unload == NULL )
+    {
+        PERROR("Count not alloc buffer for kexec unload hypercall");
+        goto out;
+    }
+
+    unload->type = type;
+
+    hypercall.op = __HYPERVISOR_kexec_op;
+    hypercall.arg[0] = KEXEC_CMD_kexec_unload;
+    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(unload);
+
+    ret = do_xen_hypercall(xch, &hypercall);
+
+out:
+    xc_hypercall_buffer_free(xch, unload);
+
+    return ret;
+}
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 1119943..d77a8b7 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -46,6 +46,7 @@
 #include <xen/hvm/params.h>
 #include <xen/xsm/flask_op.h>
 #include <xen/tmem.h>
+#include <xen/kexec.h>
 
 #include "xentoollog.h"
 
@@ -2328,4 +2329,58 @@ int xc_compression_uncompress_page(xc_interface *xch, char *compbuf,
 				   unsigned long compbuf_size,
 				   unsigned long *compbuf_pos, char *dest);
 
+/*
+ * Execute an image previously loaded with xc_kexec_load().
+ *
+ * Does not return on success.
+ *
+ * Fails with:
+ *   ENOENT if the specified image has not been loaded.
+ */
+int xc_kexec_exec(xc_interface *xch, int type);
+
+/*
+ * Find the machine address and size of certain memory areas.
+ *
+ *   KEXEC_RANGE_MA_CRASH       crash area
+ *   KEXEC_RANGE_MA_XEN         Xen itself
+ *   KEXEC_RANGE_MA_CPU         CPU note for CPU number 'nr'
+ *   KEXEC_RANGE_MA_XENHEAP     xenheap
+ *   KEXEC_RANGE_MA_EFI_MEMMAP  EFI Memory Map
+ *   KEXEC_RANGE_MA_VMCOREINFO  vmcoreinfo
+ *
+ * Fails with:
+ *   EINVAL if the range or CPU number isn't valid.
+ */
+int xc_kexec_get_range(xc_interface *xch, int range,  int nr,
+                       uint64_t *size, uint64_t *start);
+
+/*
+ * Load a kexec image into memory.
+ *
+ * The image may be of type KEXEC_TYPE_DEFAULT (executed on request)
+ * or KEXEC_TYPE_CRASH (executed on a crash).
+ *
+ * The image architecture may be a 32-bit variant of the hypervisor
+ * architecture (e.g, EM_386 on a x86-64 hypervisor).
+ *
+ * Fails with:
+ *   ENOMEM if there is insufficient memory for the new image.
+ *   EINVAL if the image does not fit into the crash area or the entry
+ *          point isn't within one of segments.
+ *   EBUSY  if another image is being executed.
+ */
+int xc_kexec_load(xc_interface *xch, uint8_t type, uint16_t arch,
+                  uint64_t entry_maddr,
+                  uint32_t nr_segments, xen_kexec_segment_t *segments);
+
+/*
+ * Unload a kexec image.
+ *
+ * This prevents a KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH image from
+ * being executed.  The crash images are not cleared from the crash
+ * region.
+ */
+int xc_kexec_unload(xc_interface *xch, int type);
+
 #endif /* XENCTRL_H */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 8/9] x86: check kexec relocation code fits in a page
  2013-09-20 13:10 [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (6 preceding siblings ...)
  2013-09-20 13:10 ` [PATCH 7/9] libxc: add API for kexec hypercall David Vrabel
@ 2013-09-20 13:10 ` David Vrabel
  2013-09-20 13:10 ` [PATCH 9/9] MAINTAINERS: Add KEXEC maintainer David Vrabel
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: David Vrabel @ 2013-09-20 13:10 UTC (permalink / raw)
  To: xen-devel; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

The kexec relocation (control) code must fit in a single page so add a
link time check for this.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 xen/arch/x86/xen.lds.S |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 9600cdf..17db361 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -198,3 +198,5 @@ SECTIONS
   .stab.indexstr 0 : { *(.stab.indexstr) }
   .comment 0 : { *(.comment) }
 }
+
+ASSERT(kexec_reloc_size - kexec_reloc <= PAGE_SIZE, "kexec_reloc is too large")
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 9/9] MAINTAINERS: Add KEXEC maintainer
  2013-09-20 13:10 [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (7 preceding siblings ...)
  2013-09-20 13:10 ` [PATCH 8/9] x86: check kexec relocation code fits in a page David Vrabel
@ 2013-09-20 13:10 ` David Vrabel
  2013-09-22 20:20 ` [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels Daniel Kiper
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: David Vrabel @ 2013-09-20 13:10 UTC (permalink / raw)
  To: xen-devel; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 MAINTAINERS |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index adacac2..4aac28c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -197,6 +197,14 @@ X:	xen/drivers/passthrough/amd/
 X:	xen/drivers/passthrough/vtd/
 F:	xen/include/xen/iommu.h
 
+KEXEC
+M:      David Vrabel <david.vrabel@citrix.com>
+S:      Supported
+F:      xen/common/{kexec,kimage}.c
+F:      xen/include/{kexec,kimage}.h
+F:      xen/arch/x86/machine_kexec.c
+F:      xen/arch/x86/x86_64/kexec_reloc.S
+
 LINUX (PV_OPS)
 M:	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
 S:	Supported
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-09-20 13:10 [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (8 preceding siblings ...)
  2013-09-20 13:10 ` [PATCH 9/9] MAINTAINERS: Add KEXEC maintainer David Vrabel
@ 2013-09-22 20:20 ` Daniel Kiper
  2013-10-04 11:50   ` David Vrabel
  2013-09-22 20:27 ` Daniel Kiper
  2013-10-04 21:40 ` Daniel Kiper
  11 siblings, 1 reply; 31+ messages in thread
From: Daniel Kiper @ 2013-09-22 20:20 UTC (permalink / raw)
  To: David Vrabel; +Cc: kexec, Keir Fraser, Jan Beulich, xen-devel

On Fri, Sep 20, 2013 at 02:10:46PM +0100, David Vrabel wrote:
> The series (for Xen 4.4) improves the kexec hypercall by making Xen
> responsible for loading and relocating the image.  This allows kexec
> to be usable by pv-ops kernels and should allow kexec to be usable
> from a HVM or PVH privileged domain.

Currently I am on holiday. I will do final review
and test at the beginning of October.

Daniel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-09-20 13:10 [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (9 preceding siblings ...)
  2013-09-22 20:20 ` [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels Daniel Kiper
@ 2013-09-22 20:27 ` Daniel Kiper
  2013-10-04 21:40 ` Daniel Kiper
  11 siblings, 0 replies; 31+ messages in thread
From: Daniel Kiper @ 2013-09-22 20:27 UTC (permalink / raw)
  To: David Vrabel; +Cc: kexec, Keir Fraser, Jan Beulich, xen-devel

On Fri, Sep 20, 2013 at 02:10:46PM +0100, David Vrabel wrote:
> The series (for Xen 4.4) improves the kexec hypercall by making Xen
> responsible for loading and relocating the image.  This allows kexec
> to be usable by pv-ops kernels and should allow kexec to be usable
> from a HVM or PVH privileged domain.

Please repost both patch series because kexec list address is incorrect.
It should be kexec@lists.infradead.org.

Daniel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/9] kexec: add public interface for improved load/unload sub-ops
  2013-09-20 13:10 ` [PATCH 2/9] kexec: add public interface for improved load/unload sub-ops David Vrabel
@ 2013-09-24 15:23   ` Jan Beulich
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Beulich @ 2013-09-24 15:23 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, kexec, Keir Fraser, Daniel Kiper

>>> On 20.09.13 at 15:10, David Vrabel <david.vrabel@citrix.com> wrote:
> From: David Vrabel <david.vrabel@citrix.com>
> 
> Add replacement KEXEC_CMD_load and KEXEC_CMD_unload sub-ops to the
> kexec hypercall.  These new sub-ops allow a priviledged guest to
> provide the image data to be loaded into Xen memory or the crash
> region instead of guests loading the image data themselves and
> providing the relocation code and metadata.
> 
> The old interface is provided to guests requesting an interface
> version prior to 4.3.
> 
> Bump __XEN_LATEST_INTERFACE_VERSION__ to 0x00040400.

The paragraph directly preceding this sentence is inconsistent
with this.

> --- a/xen/include/public/kexec.h
> +++ b/xen/include/public/kexec.h
> @@ -116,12 +116,12 @@ typedef struct xen_kexec_exec {
>   * type  == KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH [in]
>   * image == relocation information for kexec (ignored for unload) [in]
>   */
> -#define KEXEC_CMD_kexec_load            1
> -#define KEXEC_CMD_kexec_unload          2
> -typedef struct xen_kexec_load {
> +#define KEXEC_CMD_kexec_load_v1         1 /* obsolete since 0x00040300 */
> +#define KEXEC_CMD_kexec_unload_v1       2 /* obsolete since 0x00040300 */

And so are these comments.

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-09-22 20:20 ` [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels Daniel Kiper
@ 2013-10-04 11:50   ` David Vrabel
  0 siblings, 0 replies; 31+ messages in thread
From: David Vrabel @ 2013-10-04 11:50 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: Keir Fraser, Jan Beulich, xen-devel

On 22/09/13 21:20, Daniel Kiper wrote:
> On Fri, Sep 20, 2013 at 02:10:46PM +0100, David Vrabel wrote:
>> The series (for Xen 4.4) improves the kexec hypercall by making Xen
>> responsible for loading and relocating the image.  This allows kexec
>> to be usable by pv-ops kernels and should allow kexec to be usable
>> from a HVM or PVH privileged domain.
> 
> Currently I am on holiday. I will do final review
> and test at the beginning of October.

Ping?

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-09-20 13:10 ` [PATCH 4/9] kexec: extend hypercall with improved load/unload ops David Vrabel
@ 2013-10-04 21:23   ` Daniel Kiper
  2013-10-06 18:16     ` Andrew Cooper
  2013-10-07  9:23     ` David Vrabel
  0 siblings, 2 replies; 31+ messages in thread
From: Daniel Kiper @ 2013-10-04 21:23 UTC (permalink / raw)
  To: David Vrabel; +Cc: kexec, Keir Fraser, Jan Beulich, xen-devel

On Fri, Sep 20, 2013 at 02:10:50PM +0100, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> In the existing kexec hypercall, the load and unload ops depend on
> internals of the Linux kernel (the page list and code page provided by
> the kernel).  The code page is used to transition between Xen context
> and the image so using kernel code doesn't make sense and will not
> work for PVH guests.
>
> Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
> that no longer require a code page to be provided by the guest -- Xen
> now provides the code for calling the image directly.
>
> The new load op looks similar to the Linux kexec_load system call and
> allows the guest to provide the image data to be loaded.  The guest
> specifies the architecture of the image which may be a 32-bit subarch
> of the hypervisor's architecture (i.e., an EM_386 image on an
> EM_X86_64 hypervisor).
>
> The toolstack can now load images without kernel involvement.  This is
> required for supporting kexec when using a dom0 with an upstream
> kernel.
>
> Crash images are copied directly into the crash region on load.
> Default images are copied into domheap pages and a list of source and
> destination machine addresses is created.  This is list is used in
> kexec_reloc() to relocate the image to its destination.
>
> The old load and unload sub-ops are still available (as
> KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
> of the new infrastructure.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>

[...]

> diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
> new file mode 100644
> index 0000000..41dd27b
> --- /dev/null
> +++ b/xen/arch/x86/x86_64/kexec_reloc.S
> @@ -0,0 +1,208 @@
> +/*
> + * Relocate a kexec_image to its destination and call it.
> + *
> + * Copyright (C) 2013 Citrix Systems R&D Ltd.
> + *
> + * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
> + *
> + *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +#include <xen/config.h>
> +#include <xen/kimage.h>
> +
> +#include <asm/asm_defns.h>
> +#include <asm/msr.h>
> +#include <asm/page.h>
> +#include <asm/machine_kexec.h>
> +
> +        .text
> +        .align PAGE_SIZE
> +        .code64
> +
> +ENTRY(kexec_reloc)
> +        /* %rdi - code page maddr */
> +        /* %rsi - page table maddr */
> +        /* %rdx - indirection page maddr */
> +        /* %rcx - entry maddr */
> +        /* %r8 - flags */
> +
> +        movq %rdx, %rbx

Delete movq %rdx, %rbx

> +        /* Setup stack. */
> +        leaq (reloc_stack - kexec_reloc)(%rdi), %rsp
> +
> +        /* Load reloc page table. */
> +        movq %rsi, %cr3
> +
> +        /* Jump to identity mapped code. */
> +        leaq (identity_mapped - kexec_reloc)(%rdi), %rax
> +        jmpq *%rax
> +
> +identity_mapped:
> +        pushq %rcx
> +        pushq %rbx
> +        pushq %rsi
> +        pushq %rdi

Delete pushq %rbx, pushq %rsi, pushq %rdi

> +        /*
> +         * Set cr0 to a known state:
> +         *  - Paging enabled
> +         *  - Alignment check disabled
> +         *  - Write protect disabled
> +         *  - No task switch
> +         *  - Don't do FP software emulation.
> +         *  - Proctected mode enabled
> +         */
> +        movq    %cr0, %rax
> +        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
> +        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
> +        movq    %rax, %cr0
> +
> +        /*
> +         * Set cr4 to a known state:
> +         *  - physical address extension enabled
> +         */
> +        movl    $X86_CR4_PAE, %eax
> +        movq    %rax, %cr4
> +
> +        movq %rbx, %rdi

movq %rdx, %rdi

> +        call relocate_pages
> +
> +        popq %rdi
> +        popq %rsi
> +        popq %rbx
> +        popq %rcx

Delete popq %rdi, popq %rsi, popq %rbx

> +        /* Need to switch to 32-bit mode? */
> +        testq $KEXEC_RELOC_FLAG_COMPAT, %r8
> +        jnz call_32_bit
> +
> +call_64_bit:
> +        /* Call the image entry point.  This should never return. */

I think that all general purpose registers (including %rsi, %rdi, %rbp
and %rsp) should be zeroed here. We should leave as little as possible
info about previous system. Especially in kexec case. Just in case.
Please look into linux/arch/x86/kernel/relocate_kernel_64.S
for more details.

> +        callq *%rcx

Maybe we should use retq to jump into image entry point. If not
I think that we should store image entry point address in %rax
(just to the order).

> +        ud2
> +
> +call_32_bit:
> +        /* Setup IDT. */
> +        lidt compat_mode_idt(%rip)
> +
> +        /* Load compat GDT. */
> +        leaq (compat_mode_gdt - kexec_reloc)(%rdi), %rax
> +        movq %rax, (compat_mode_gdt_desc + 2)(%rip)
> +        lgdt compat_mode_gdt_desc(%rip)
> +
> +        /* Relocate compatibility mode entry point address. */
> +        leal (compatibility_mode - kexec_reloc)(%edi), %eax
> +        movl %eax, compatibility_mode_far(%rip)
> +
> +        /* Enter compatibility mode. */
> +        ljmp *compatibility_mode_far(%rip)
> +
> +relocate_pages:
> +        /* %rdi - indirection page maddr */
> +        cld
> +        movq    %rdi, %rcx
> +        xorl    %edi, %edi
> +        xorl    %esi, %esi
> +        jmp     is_dest
> +
> +next_entry: /* top, read another word for the indirection page */
> +
> +        movq    (%rbx), %rcx
> +        addq    $8, %rbx
> +is_dest:
> +        testb   $IND_DESTINATION, %cl
> +        jz      is_ind
> +        movq    %rcx, %rdi
> +        andq    $PAGE_MASK, %rdi
> +        jmp     next_entry
> +is_ind:
> +        testb   $IND_INDIRECTION, %cl
> +        jz      is_done
> +        movq    %rcx, %rbx
> +        andq    $PAGE_MASK, %rbx
> +        jmp     next_entry
> +is_done:
> +        testb   $IND_DONE, %cl
> +        jnz     done
> +is_source:
> +        testb   $IND_SOURCE, %cl
> +        jz      is_zero
> +        movq    %rcx, %rsi      /* For every source page do a copy */
> +        andq    $PAGE_MASK, %rsi
> +        movl    $(PAGE_SIZE / 8), %ecx
> +        rep movsq
> +        jmp     next_entry
> +is_zero:
> +        testb   $IND_ZERO, %cl
> +        jz      next_entry
> +        movl    $(PAGE_SIZE / 8), %ecx  /* Zero the destination page. */
> +        xorl    %eax, %eax
> +        rep stosq
> +        jmp     next_entry
> +done:
> +        ret
> +
> +        .code32
> +
> +compatibility_mode:
> +        /* Setup some sane segments. */
> +        movl $0x0008, %eax
> +        movl %eax, %ds
> +        movl %eax, %es
> +        movl %eax, %fs
> +        movl %eax, %gs
> +        movl %eax, %ss
> +
> +        movl %ecx, %ebp
> +
> +        /* Disable paging and therefore leave 64 bit mode. */
> +        movl %cr0, %eax
> +        andl $~X86_CR0_PG, %eax
> +        movl %eax, %cr0
> +
> +        /* Disable long mode */
> +        movl    $MSR_EFER, %ecx
> +        rdmsr
> +        andl    $~EFER_LME, %eax
> +        wrmsr
> +
> +        /* Clear cr4 to disable PAE. */
> +        xorl    %eax, %eax
> +        movl    %eax, %cr4
> +
> +        /* Call the image entry point.  This should never return. */

Ditto.

> +        call *%ebp

Ditto.

Daniel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-09-20 13:10 [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (10 preceding siblings ...)
  2013-09-22 20:27 ` Daniel Kiper
@ 2013-10-04 21:40 ` Daniel Kiper
  11 siblings, 0 replies; 31+ messages in thread
From: Daniel Kiper @ 2013-10-04 21:40 UTC (permalink / raw)
  To: David Vrabel; +Cc: kexec, Keir Fraser, Jan Beulich, xen-devel

On Fri, Sep 20, 2013 at 02:10:46PM +0100, David Vrabel wrote:
> The series (for Xen 4.4) improves the kexec hypercall by making Xen
> responsible for loading and relocating the image.  This allows kexec
> to be usable by pv-ops kernels and should allow kexec to be usable
> from a HVM or PVH privileged domain.

[...]

I am back. I have reviewed latest patch series. There are some minor
issues which should be fixed before patches will be applied to Xen
unstable. Please take also into account Jan's and Konrad's comments.
Do not forget to respost to correct kexec list address
(kexec@lists.infradead.org).

You could add Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
to kexec-tools patch series.

Daniel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-10-04 21:23   ` Daniel Kiper
@ 2013-10-06 18:16     ` Andrew Cooper
  2013-10-07  7:47       ` Daniel Kiper
  2013-10-07  9:23     ` David Vrabel
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Cooper @ 2013-10-06 18:16 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, Keir Fraser, David Vrabel, Jan Beulich, xen-devel

On 04/10/2013 22:23, Daniel Kiper wrote:
> On Fri, Sep 20, 2013 at 02:10:50PM +0100, David Vrabel wrote:
>> From: David Vrabel <david.vrabel@citrix.com>
>>
>> In the existing kexec hypercall, the load and unload ops depend on
>> internals of the Linux kernel (the page list and code page provided by
>> the kernel).  The code page is used to transition between Xen context
>> and the image so using kernel code doesn't make sense and will not
>> work for PVH guests.
>>
>> Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
>> that no longer require a code page to be provided by the guest -- Xen
>> now provides the code for calling the image directly.
>>
>> The new load op looks similar to the Linux kexec_load system call and
>> allows the guest to provide the image data to be loaded.  The guest
>> specifies the architecture of the image which may be a 32-bit subarch
>> of the hypervisor's architecture (i.e., an EM_386 image on an
>> EM_X86_64 hypervisor).
>>
>> The toolstack can now load images without kernel involvement.  This is
>> required for supporting kexec when using a dom0 with an upstream
>> kernel.
>>
>> Crash images are copied directly into the crash region on load.
>> Default images are copied into domheap pages and a list of source and
>> destination machine addresses is created.  This is list is used in
>> kexec_reloc() to relocate the image to its destination.
>>
>> The old load and unload sub-ops are still available (as
>> KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
>> of the new infrastructure.
>>
>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> [...]
>
>> diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
>> new file mode 100644
>> index 0000000..41dd27b
>> --- /dev/null
>> +++ b/xen/arch/x86/x86_64/kexec_reloc.S
>> @@ -0,0 +1,208 @@
>> +/*
>> + * Relocate a kexec_image to its destination and call it.
>> + *
>> + * Copyright (C) 2013 Citrix Systems R&D Ltd.
>> + *
>> + * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
>> + *
>> + *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
>> + *
>> + * This source code is licensed under the GNU General Public License,
>> + * Version 2.  See the file COPYING for more details.
>> + */
>> +#include <xen/config.h>
>> +#include <xen/kimage.h>
>> +
>> +#include <asm/asm_defns.h>
>> +#include <asm/msr.h>
>> +#include <asm/page.h>
>> +#include <asm/machine_kexec.h>
>> +
>> +        .text
>> +        .align PAGE_SIZE
>> +        .code64
>> +
>> +ENTRY(kexec_reloc)
>> +        /* %rdi - code page maddr */
>> +        /* %rsi - page table maddr */
>> +        /* %rdx - indirection page maddr */
>> +        /* %rcx - entry maddr */
>> +        /* %r8 - flags */
>> +
>> +        movq %rdx, %rbx
> Delete movq %rdx, %rbx
>
>> +        /* Setup stack. */
>> +        leaq (reloc_stack - kexec_reloc)(%rdi), %rsp
>> +
>> +        /* Load reloc page table. */
>> +        movq %rsi, %cr3
>> +
>> +        /* Jump to identity mapped code. */
>> +        leaq (identity_mapped - kexec_reloc)(%rdi), %rax
>> +        jmpq *%rax
>> +
>> +identity_mapped:
>> +        pushq %rcx
>> +        pushq %rbx
>> +        pushq %rsi
>> +        pushq %rdi
> Delete pushq %rbx, pushq %rsi, pushq %rdi
>
>> +        /*
>> +         * Set cr0 to a known state:
>> +         *  - Paging enabled
>> +         *  - Alignment check disabled
>> +         *  - Write protect disabled
>> +         *  - No task switch
>> +         *  - Don't do FP software emulation.
>> +         *  - Proctected mode enabled
>> +         */
>> +        movq    %cr0, %rax
>> +        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
>> +        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
>> +        movq    %rax, %cr0
>> +
>> +        /*
>> +         * Set cr4 to a known state:
>> +         *  - physical address extension enabled
>> +         */
>> +        movl    $X86_CR4_PAE, %eax
>> +        movq    %rax, %cr4
>> +
>> +        movq %rbx, %rdi
> movq %rdx, %rdi
>
>> +        call relocate_pages
>> +
>> +        popq %rdi
>> +        popq %rsi
>> +        popq %rbx
>> +        popq %rcx
> Delete popq %rdi, popq %rsi, popq %rbx
>
>> +        /* Need to switch to 32-bit mode? */
>> +        testq $KEXEC_RELOC_FLAG_COMPAT, %r8
>> +        jnz call_32_bit
>> +
>> +call_64_bit:
>> +        /* Call the image entry point.  This should never return. */
> I think that all general purpose registers (including %rsi, %rdi, %rbp
> and %rsp) should be zeroed here. We should leave as little as possible
> info about previous system. Especially in kexec case. Just in case.
> Please look into linux/arch/x86/kernel/relocate_kernel_64.S
> for more details.

In the kexec case, we are specifically trying to leak the state of the
entire system outside of the crash area so something can investigate.

Zeroing the registers for the sake of it is pointless.  Any author of
kexec image expecting all GPRs to be 0 on entry deserves the debugging
problems they are asking for.

>
>> +        callq *%rcx
> Maybe we should use retq to jump into image entry point. If not
> I think that we should store image entry point address in %rax
> (just to the order).

retq with a zeroed %rsp ?

This code was derived from the linux code which allows a kexec image to
pass parameters (which is also why the registers are pushed/poped above)
and return back to the originator.  (Although I have to admit to being
curious to how someone would go about trying to justify using this
feature in a real usecase.)

It might be reasonable to replace "callq *%rcx; ud2" with "jmp *%rcx",
but all of this does feel like nitpicking on a codepath which
(hopefully) will never be executed.

If we decide here and now that Xen shall never support passing arguments
to the kexec image, and never support the kexec image returning back to
us then perhaps we can go ahead and change things, although I would
argue that consistency with the other implementation is more important.

~Andrew

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-10-06 18:16     ` Andrew Cooper
@ 2013-10-07  7:47       ` Daniel Kiper
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Kiper @ 2013-10-07  7:47 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: kexec, Keir Fraser, David Vrabel, Jan Beulich, xen-devel

On Sun, Oct 06, 2013 at 07:16:16PM +0100, Andrew Cooper wrote:
> On 04/10/2013 22:23, Daniel Kiper wrote:
> > On Fri, Sep 20, 2013 at 02:10:50PM +0100, David Vrabel wrote:
> >> From: David Vrabel <david.vrabel@citrix.com>
> >>
> >> In the existing kexec hypercall, the load and unload ops depend on
> >> internals of the Linux kernel (the page list and code page provided by
> >> the kernel).  The code page is used to transition between Xen context
> >> and the image so using kernel code doesn't make sense and will not
> >> work for PVH guests.
> >>
> >> Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
> >> that no longer require a code page to be provided by the guest -- Xen
> >> now provides the code for calling the image directly.
> >>
> >> The new load op looks similar to the Linux kexec_load system call and
> >> allows the guest to provide the image data to be loaded.  The guest
> >> specifies the architecture of the image which may be a 32-bit subarch
> >> of the hypervisor's architecture (i.e., an EM_386 image on an
> >> EM_X86_64 hypervisor).
> >>
> >> The toolstack can now load images without kernel involvement.  This is
> >> required for supporting kexec when using a dom0 with an upstream
> >> kernel.
> >>
> >> Crash images are copied directly into the crash region on load.
> >> Default images are copied into domheap pages and a list of source and
> >> destination machine addresses is created.  This is list is used in
> >> kexec_reloc() to relocate the image to its destination.
> >>
> >> The old load and unload sub-ops are still available (as
> >> KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
> >> of the new infrastructure.
> >>
> >> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> > [...]
> >
> >> diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
> >> new file mode 100644
> >> index 0000000..41dd27b
> >> --- /dev/null
> >> +++ b/xen/arch/x86/x86_64/kexec_reloc.S
> >> @@ -0,0 +1,208 @@
> >> +/*
> >> + * Relocate a kexec_image to its destination and call it.
> >> + *
> >> + * Copyright (C) 2013 Citrix Systems R&D Ltd.
> >> + *
> >> + * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
> >> + *
> >> + *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
> >> + *
> >> + * This source code is licensed under the GNU General Public License,
> >> + * Version 2.  See the file COPYING for more details.
> >> + */
> >> +#include <xen/config.h>
> >> +#include <xen/kimage.h>
> >> +
> >> +#include <asm/asm_defns.h>
> >> +#include <asm/msr.h>
> >> +#include <asm/page.h>
> >> +#include <asm/machine_kexec.h>
> >> +
> >> +        .text
> >> +        .align PAGE_SIZE
> >> +        .code64
> >> +
> >> +ENTRY(kexec_reloc)
> >> +        /* %rdi - code page maddr */
> >> +        /* %rsi - page table maddr */
> >> +        /* %rdx - indirection page maddr */
> >> +        /* %rcx - entry maddr */
> >> +        /* %r8 - flags */
> >> +
> >> +        movq %rdx, %rbx
> > Delete movq %rdx, %rbx
> >
> >> +        /* Setup stack. */
> >> +        leaq (reloc_stack - kexec_reloc)(%rdi), %rsp
> >> +
> >> +        /* Load reloc page table. */
> >> +        movq %rsi, %cr3
> >> +
> >> +        /* Jump to identity mapped code. */
> >> +        leaq (identity_mapped - kexec_reloc)(%rdi), %rax
> >> +        jmpq *%rax
> >> +
> >> +identity_mapped:
> >> +        pushq %rcx
> >> +        pushq %rbx
> >> +        pushq %rsi
> >> +        pushq %rdi
> > Delete pushq %rbx, pushq %rsi, pushq %rdi
> >
> >> +        /*
> >> +         * Set cr0 to a known state:
> >> +         *  - Paging enabled
> >> +         *  - Alignment check disabled
> >> +         *  - Write protect disabled
> >> +         *  - No task switch
> >> +         *  - Don't do FP software emulation.
> >> +         *  - Proctected mode enabled
> >> +         */
> >> +        movq    %cr0, %rax
> >> +        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
> >> +        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
> >> +        movq    %rax, %cr0
> >> +
> >> +        /*
> >> +         * Set cr4 to a known state:
> >> +         *  - physical address extension enabled
> >> +         */
> >> +        movl    $X86_CR4_PAE, %eax
> >> +        movq    %rax, %cr4
> >> +
> >> +        movq %rbx, %rdi
> > movq %rdx, %rdi
> >
> >> +        call relocate_pages
> >> +
> >> +        popq %rdi
> >> +        popq %rsi
> >> +        popq %rbx
> >> +        popq %rcx
> > Delete popq %rdi, popq %rsi, popq %rbx
> >
> >> +        /* Need to switch to 32-bit mode? */
> >> +        testq $KEXEC_RELOC_FLAG_COMPAT, %r8
> >> +        jnz call_32_bit
> >> +
> >> +call_64_bit:
> >> +        /* Call the image entry point.  This should never return. */
> > I think that all general purpose registers (including %rsi, %rdi, %rbp
> > and %rsp) should be zeroed here. We should leave as little as possible
> > info about previous system. Especially in kexec case. Just in case.
> > Please look into linux/arch/x86/kernel/relocate_kernel_64.S
> > for more details.
>
> In the kexec case, we are specifically trying to leak the state of the
> entire system outside of the crash area so something can investigate.

You think about kdump which is different thing than kexec.

> Zeroing the registers for the sake of it is pointless.  Any author of
> kexec image expecting all GPRs to be 0 on entry deserves the debugging
> problems they are asking for.

I agree that anybody should not assume that any given register contains
specific value in this case. However, we should not leak data (especially
in kexec not kdump case) if we could wipe it at almost no cost. I know
that memory contains more data but why we would like to make bad guys
life easier?

> >> +        callq *%rcx
> > Maybe we should use retq to jump into image entry point. If not
> > I think that we should store image entry point address in %rax
> > (just to the order).
>
> retq with a zeroed %rsp ?

I hope that you are joking here. I assume that everybody on this list
knows which register should be set in each case.

> This code was derived from the linux code which allows a kexec image to
> pass parameters (which is also why the registers are pushed/poped above)
> and return back to the originator.  (Although I have to admit to being
> curious to how someone would go about trying to justify using this
> feature in a real usecase.)

Original Linux code uses ret not call.

> It might be reasonable to replace "callq *%rcx; ud2" with "jmp *%rcx",
> but all of this does feel like nitpicking on a codepath which
> (hopefully) will never be executed.
>
> If we decide here and now that Xen shall never support passing arguments
> to the kexec image, and never support the kexec image returning back to
> us then perhaps we can go ahead and change things, although I would
> argue that consistency with the other implementation is more important.

Right, so should we use ret? I do not insist on that. More important thing
for me is zeroing of registers. Later stuff is also done in Linux Kernel.

Daniel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-10-04 21:23   ` Daniel Kiper
  2013-10-06 18:16     ` Andrew Cooper
@ 2013-10-07  9:23     ` David Vrabel
  2013-10-07 10:39       ` Daniel Kiper
  1 sibling, 1 reply; 31+ messages in thread
From: David Vrabel @ 2013-10-07  9:23 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, Keir Fraser, Jan Beulich, xen-devel

On 04/10/13 22:23, Daniel Kiper wrote:
> On Fri, Sep 20, 2013 at 02:10:50PM +0100, David Vrabel wrote:
>> --- /dev/null
>> +++ b/xen/arch/x86/x86_64/kexec_reloc.S
>> @@ -0,0 +1,208 @@
[...]
>> +ENTRY(kexec_reloc)
>> +        /* %rdi - code page maddr */
>> +        /* %rsi - page table maddr */
>> +        /* %rdx - indirection page maddr */
>> +        /* %rcx - entry maddr */
>> +        /* %r8 - flags */
>> +
>> +        movq %rdx, %rbx
> 
> Delete movq %rdx, %rbx

We avoid using %rdx in case we need to re-add the UART debugging.

>> +        /* Need to switch to 32-bit mode? */
>> +        testq $KEXEC_RELOC_FLAG_COMPAT, %r8
>> +        jnz call_32_bit
>> +
>> +call_64_bit:
>> +        /* Call the image entry point.  This should never return. */
> 
> I think that all general purpose registers (including %rsi, %rdi, %rbp
> and %rsp) should be zeroed here. We should leave as little as possible
> info about previous system. Especially in kexec case. Just in case.
> Please look into linux/arch/x86/kernel/relocate_kernel_64.S
> for more details.

Not initializing the registers is a deliberate design decision so exec'd
images cannot mistakenly rely on the register values.

Clearing a handful of words when all of host memory is accessible by the
exec'd image does nothing for security (as you suggest in a later email).

>> +        callq *%rcx
> 
> Maybe we should use retq to jump into image entry point. If not
> I think that we should store image entry point address in %rax
> (just to the order).

With call if an image does try to return it will fault at a well defined
point (the following ud2) making the failure a bit easier to diagnose.
With your suggestion it will fault somewhere random.

Linux uses call.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-10-07  9:23     ` David Vrabel
@ 2013-10-07 10:39       ` Daniel Kiper
  2013-10-07 10:55         ` David Vrabel
  0 siblings, 1 reply; 31+ messages in thread
From: Daniel Kiper @ 2013-10-07 10:39 UTC (permalink / raw)
  To: David Vrabel; +Cc: kexec, Keir Fraser, Jan Beulich, xen-devel

On Mon, Oct 07, 2013 at 10:23:09AM +0100, David Vrabel wrote:
> On 04/10/13 22:23, Daniel Kiper wrote:
> > On Fri, Sep 20, 2013 at 02:10:50PM +0100, David Vrabel wrote:
> >> --- /dev/null
> >> +++ b/xen/arch/x86/x86_64/kexec_reloc.S
> >> @@ -0,0 +1,208 @@
> [...]
> >> +ENTRY(kexec_reloc)
> >> +        /* %rdi - code page maddr */
> >> +        /* %rsi - page table maddr */
> >> +        /* %rdx - indirection page maddr */
> >> +        /* %rcx - entry maddr */
> >> +        /* %r8 - flags */
> >> +
> >> +        movq %rdx, %rbx
> >
> > Delete movq %rdx, %rbx
>
> We avoid using %rdx in case we need to re-add the UART debugging.

Does not make sens for me. We could re-add it also if we remove this movq.
Now it is not clear why it is here. I think that it should be removed.

> >> +        /* Need to switch to 32-bit mode? */
> >> +        testq $KEXEC_RELOC_FLAG_COMPAT, %r8
> >> +        jnz call_32_bit
> >> +
> >> +call_64_bit:
> >> +        /* Call the image entry point.  This should never return. */
> >
> > I think that all general purpose registers (including %rsi, %rdi, %rbp
> > and %rsp) should be zeroed here. We should leave as little as possible
> > info about previous system. Especially in kexec case. Just in case.
> > Please look into linux/arch/x86/kernel/relocate_kernel_64.S
> > for more details.
>
> Not initializing the registers is a deliberate design decision so exec'd
> images cannot mistakenly rely on the register values.

Anybody who does this asks for problems. This is not our issue.

> Clearing a handful of words when all of host memory is accessible by the
> exec'd image does nothing for security (as you suggest in a later email).

I am aware that this does not solve all security issues but it could make simple
attacks more difficult. I think that it costs nothing. Linux does this.
Why we could not do that?

> >> +        callq *%rcx
> >
> > Maybe we should use retq to jump into image entry point. If not
> > I think that we should store image entry point address in %rax
> > (just to the order).
>
> With call if an image does try to return it will fault at a well defined
> point (the following ud2) making the failure a bit easier to diagnose.
> With your suggestion it will fault somewhere random.

As I said I do not insist on changing that. I know why that you did this
in that way. ret is just an option to consider. That is all.

> Linux uses call.

OK, to be precise in almost all cases it uses ret. It uses call if preserve
context is enabled. However, I do not know who is using that feature (once
someone considered even removal of this feature). So ret is main path to go
to the new system image. Please check the sources.

Daniel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-10-07 10:39       ` Daniel Kiper
@ 2013-10-07 10:55         ` David Vrabel
  2013-10-07 14:49           ` Daniel Kiper
  0 siblings, 1 reply; 31+ messages in thread
From: David Vrabel @ 2013-10-07 10:55 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, Keir Fraser, Jan Beulich, xen-devel

On 07/10/13 11:39, Daniel Kiper wrote:
> On Mon, Oct 07, 2013 at 10:23:09AM +0100, David Vrabel wrote:
>> On 04/10/13 22:23, Daniel Kiper wrote:
>>> On Fri, Sep 20, 2013 at 02:10:50PM +0100, David Vrabel wrote:
>>>> --- /dev/null
>>>> +++ b/xen/arch/x86/x86_64/kexec_reloc.S
>>>> @@ -0,0 +1,208 @@
>> [...]
>>>> +ENTRY(kexec_reloc)
>>>> +        /* %rdi - code page maddr */
>>>> +        /* %rsi - page table maddr */
>>>> +        /* %rdx - indirection page maddr */
>>>> +        /* %rcx - entry maddr */
>>>> +        /* %r8 - flags */
>>>> +
>>>> +        movq %rdx, %rbx
>>>
>>> Delete movq %rdx, %rbx
>>
>> We avoid using %rdx in case we need to re-add the UART debugging.
> 
> Does not make sens for me. We could re-add it also if we remove this movq.
> Now it is not clear why it is here. I think that it should be removed.

outb uses %rdx so avoiding using %rdx means any UART debugging macros
are trivial (since they don't have to save/restore the value in %rdx).

>>>> +        /* Need to switch to 32-bit mode? */
>>>> +        testq $KEXEC_RELOC_FLAG_COMPAT, %r8
>>>> +        jnz call_32_bit
>>>> +
>>>> +call_64_bit:
>>>> +        /* Call the image entry point.  This should never return. */
>>>
>>> I think that all general purpose registers (including %rsi, %rdi, %rbp
>>> and %rsp) should be zeroed here. We should leave as little as possible
>>> info about previous system. Especially in kexec case. Just in case.
>>> Please look into linux/arch/x86/kernel/relocate_kernel_64.S
>>> for more details.
>>
>> Not initializing the registers is a deliberate design decision so exec'd
>> images cannot mistakenly rely on the register values.
> 
> Anybody who does this asks for problems. This is not our issue.

Zeroing the registers makes that part of the ABI for calling images,
which means it can never be changed.  If the ABI is the register values
are undefined then this can be changes in the future to something that
is defined.

>> Clearing a handful of words when all of host memory is accessible by the
>> exec'd image does nothing for security (as you suggest in a later email).
> 
> I am aware that this does not solve all security issues but it could make simple
> attacks more difficult.

What attacks?  What security issues is zero-ing a tiny amount of state
going to prevent when the exec'd image has full control over the whole host?

Your reasoning is completely bogus.

>> Linux uses call.
> 
> OK, to be precise in almost all cases it uses ret. It uses call if preserve
> context is enabled. However, I do not know who is using that feature (once
> someone considered even removal of this feature). So ret is main path to go
> to the new system image. Please check the sources.

I missed that which means I've definitely not changing this from the
obvious call.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-10-07 10:55         ` David Vrabel
@ 2013-10-07 14:49           ` Daniel Kiper
  2013-10-07 17:44             ` David Vrabel
  0 siblings, 1 reply; 31+ messages in thread
From: Daniel Kiper @ 2013-10-07 14:49 UTC (permalink / raw)
  To: David Vrabel; +Cc: kexec, Keir Fraser, Jan Beulich, xen-devel

On Mon, Oct 07, 2013 at 11:55:08AM +0100, David Vrabel wrote:
> On 07/10/13 11:39, Daniel Kiper wrote:
> > On Mon, Oct 07, 2013 at 10:23:09AM +0100, David Vrabel wrote:
> >> On 04/10/13 22:23, Daniel Kiper wrote:
> >>> On Fri, Sep 20, 2013 at 02:10:50PM +0100, David Vrabel wrote:
> >>>> --- /dev/null
> >>>> +++ b/xen/arch/x86/x86_64/kexec_reloc.S
> >>>> @@ -0,0 +1,208 @@
> >> [...]
> >>>> +ENTRY(kexec_reloc)
> >>>> +        /* %rdi - code page maddr */
> >>>> +        /* %rsi - page table maddr */
> >>>> +        /* %rdx - indirection page maddr */
> >>>> +        /* %rcx - entry maddr */
> >>>> +        /* %r8 - flags */
> >>>> +
> >>>> +        movq %rdx, %rbx
> >>>
> >>> Delete movq %rdx, %rbx
> >>
> >> We avoid using %rdx in case we need to re-add the UART debugging.
> >
> > Does not make sens for me. We could re-add it also if we remove this movq.
> > Now it is not clear why it is here. I think that it should be removed.
>
> outb uses %rdx so avoiding using %rdx means any UART debugging macros
> are trivial (since they don't have to save/restore the value in %rdx).

Once again, there is no UART code so there is no sens for this movq.
Any smart developer (we have a dozens of them here) knows how to write
relevant code. Now this movq only obfuscates things.

> >>>> +        /* Need to switch to 32-bit mode? */
> >>>> +        testq $KEXEC_RELOC_FLAG_COMPAT, %r8
> >>>> +        jnz call_32_bit
> >>>> +
> >>>> +call_64_bit:
> >>>> +        /* Call the image entry point.  This should never return. */
> >>>
> >>> I think that all general purpose registers (including %rsi, %rdi, %rbp
> >>> and %rsp) should be zeroed here. We should leave as little as possible
> >>> info about previous system. Especially in kexec case. Just in case.
> >>> Please look into linux/arch/x86/kernel/relocate_kernel_64.S
> >>> for more details.
> >>
> >> Not initializing the registers is a deliberate design decision so exec'd
> >> images cannot mistakenly rely on the register values.
> >
> > Anybody who does this asks for problems. This is not our issue.
>
> Zeroing the registers makes that part of the ABI for calling images,
> which means it can never be changed.  If the ABI is the register values
> are undefined then this can be changes in the future to something that
> is defined.

I have never ever tried to define any ABI here. I have never ever said
that the caller must pass this and the callee must expect that. There is
no such definition in current Linux Kernel implementation too. Even purgatory
expects nothing special in registers. I am just saying that it is worth to wipe
data from GPRs. No more no less. If you would like to use any register to pass
argument later you could do that. My proposal does not impose any limits.

> >> Clearing a handful of words when all of host memory is accessible by the
> >> exec'd image does nothing for security (as you suggest in a later email).
> >
> > I am aware that this does not solve all security issues but it could make simple
> > attacks more difficult.
>
> What attacks?  What security issues is zero-ing a tiny amount of state
> going to prevent when the exec'd image has full control over the whole host?

I said "more difficult" not "prevent" and it makes difference.

Daniel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-10-07 14:49           ` Daniel Kiper
@ 2013-10-07 17:44             ` David Vrabel
  2013-10-07 20:21               ` Daniel Kiper
  0 siblings, 1 reply; 31+ messages in thread
From: David Vrabel @ 2013-10-07 17:44 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: Keir Fraser, Jan Beulich, xen-devel

On 07/10/13 15:49, Daniel Kiper wrote:
> On Mon, Oct 07, 2013 at 11:55:08AM +0100, David Vrabel wrote:
>> On 07/10/13 11:39, Daniel Kiper wrote:
>>> On Mon, Oct 07, 2013 at 10:23:09AM +0100, David Vrabel wrote:
>>>> On 04/10/13 22:23, Daniel Kiper wrote:
>>>>> On Fri, Sep 20, 2013 at 02:10:50PM +0100, David Vrabel wrote:
>>>>>> --- /dev/null
>>>>>> +++ b/xen/arch/x86/x86_64/kexec_reloc.S
>>>>>> @@ -0,0 +1,208 @@
>>>> [...]
>>>>>> +ENTRY(kexec_reloc)
>>>>>> +        /* %rdi - code page maddr */
>>>>>> +        /* %rsi - page table maddr */
>>>>>> +        /* %rdx - indirection page maddr */
>>>>>> +        /* %rcx - entry maddr */
>>>>>> +        /* %r8 - flags */
>>>>>> +
>>>>>> +        movq %rdx, %rbx
>>>>>
>>>>> Delete movq %rdx, %rbx
>>>>
>>>> We avoid using %rdx in case we need to re-add the UART debugging.
>>>
>>> Does not make sens for me. We could re-add it also if we remove this movq.
>>> Now it is not clear why it is here. I think that it should be removed.
>>
>> outb uses %rdx so avoiding using %rdx means any UART debugging macros
>> are trivial (since they don't have to save/restore the value in %rdx).
> 
> Once again, there is no UART code so there is no sens for this movq.
> Any smart developer (we have a dozens of them here) knows how to write
> relevant code. Now this movq only obfuscates things.

As a smart engineer I know I would much prefer to drop in debugging code
with minimal (risky) changes to existing code.

That said, I now have access to an ICE so I don't much care about
debugging with a UART so I'll make the suggested change.

>>>>>> +        /* Need to switch to 32-bit mode? */
>>>>>> +        testq $KEXEC_RELOC_FLAG_COMPAT, %r8
>>>>>> +        jnz call_32_bit
>>>>>> +
>>>>>> +call_64_bit:
>>>>>> +        /* Call the image entry point.  This should never return. */
>>>>>
>>>>> I think that all general purpose registers (including %rsi, %rdi, %rbp
>>>>> and %rsp) should be zeroed here. We should leave as little as possible
>>>>> info about previous system. Especially in kexec case. Just in case.
>>>>> Please look into linux/arch/x86/kernel/relocate_kernel_64.S
>>>>> for more details.
>>>>
>>>> Not initializing the registers is a deliberate design decision so exec'd
>>>> images cannot mistakenly rely on the register values.
>>>
>>> Anybody who does this asks for problems. This is not our issue.
>>
>> Zeroing the registers makes that part of the ABI for calling images,
>> which means it can never be changed.  If the ABI is the register values
>> are undefined then this can be changes in the future to something that
>> is defined.
> 
> I have never ever tried to define any ABI here. I have never ever said
> that the caller must pass this and the callee must expect that. There is
> no such definition in current Linux Kernel implementation too. Even purgatory
> expects nothing special in registers. I am just saying that it is worth to wipe
> data from GPRs. No more no less. If you would like to use any register to pass
> argument later you could do that. My proposal does not impose any limits.

There's an explicit ABI and an implicit one based on the implementation.
 In the absence of a formal test suite and certification program,
implicit ABI are often just as fixed and constraining as explicit ones.

>>>> Clearing a handful of words when all of host memory is accessible by the
>>>> exec'd image does nothing for security (as you suggest in a later email).
>>>
>>> I am aware that this does not solve all security issues but it could make simple
>>> attacks more difficult.
>>
>> What attacks?  What security issues is zero-ing a tiny amount of state
>> going to prevent when the exec'd image has full control over the whole host?
> 
> I said "more difficult" not "prevent" and it makes difference.

You didn't actually answer my question.  What security issue do you
think zeroing registers will mitigate?  Given that the exec'd image has
full control over the host, access to all memory and all devices.

Lipstick on one pig in a whole herd of pigs comes to mind...

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-10-07 17:44             ` David Vrabel
@ 2013-10-07 20:21               ` Daniel Kiper
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Kiper @ 2013-10-07 20:21 UTC (permalink / raw)
  To: David Vrabel; +Cc: Keir Fraser, Jan Beulich, xen-devel

On Mon, Oct 07, 2013 at 06:44:17PM +0100, David Vrabel wrote:
> On 07/10/13 15:49, Daniel Kiper wrote:
> > On Mon, Oct 07, 2013 at 11:55:08AM +0100, David Vrabel wrote:
> >> On 07/10/13 11:39, Daniel Kiper wrote:
> >>> On Mon, Oct 07, 2013 at 10:23:09AM +0100, David Vrabel wrote:
> >>>> On 04/10/13 22:23, Daniel Kiper wrote:
> >>>>> On Fri, Sep 20, 2013 at 02:10:50PM +0100, David Vrabel wrote:
> >>>>>> --- /dev/null
> >>>>>> +++ b/xen/arch/x86/x86_64/kexec_reloc.S
> >>>>>> @@ -0,0 +1,208 @@
> >>>> [...]
> >>>>>> +ENTRY(kexec_reloc)
> >>>>>> +        /* %rdi - code page maddr */
> >>>>>> +        /* %rsi - page table maddr */
> >>>>>> +        /* %rdx - indirection page maddr */
> >>>>>> +        /* %rcx - entry maddr */
> >>>>>> +        /* %r8 - flags */
> >>>>>> +
> >>>>>> +        movq %rdx, %rbx
> >>>>>
> >>>>> Delete movq %rdx, %rbx
> >>>>
> >>>> We avoid using %rdx in case we need to re-add the UART debugging.
> >>>
> >>> Does not make sens for me. We could re-add it also if we remove this movq.
> >>> Now it is not clear why it is here. I think that it should be removed.
> >>
> >> outb uses %rdx so avoiding using %rdx means any UART debugging macros
> >> are trivial (since they don't have to save/restore the value in %rdx).
> >
> > Once again, there is no UART code so there is no sens for this movq.
> > Any smart developer (we have a dozens of them here) knows how to write
> > relevant code. Now this movq only obfuscates things.
>
> As a smart engineer I know I would much prefer to drop in debugging code
> with minimal (risky) changes to existing code.
>
> That said, I now have access to an ICE so I don't much care about
> debugging with a UART so I'll make the suggested change.

Still I am not convinced. This way you could add similar redundant stuff
in every piece of code, just in case, and finaly make it unreadable. Does
not make sens for me. Please remove this redundant movq or add UART code
or add relevant comment which explains why do you do that this way.
1 and 2 makes sens for me. 3 much less or at all.

> >>>>>> +        /* Need to switch to 32-bit mode? */
> >>>>>> +        testq $KEXEC_RELOC_FLAG_COMPAT, %r8
> >>>>>> +        jnz call_32_bit
> >>>>>> +
> >>>>>> +call_64_bit:
> >>>>>> +        /* Call the image entry point.  This should never return. */
> >>>>>
> >>>>> I think that all general purpose registers (including %rsi, %rdi, %rbp
> >>>>> and %rsp) should be zeroed here. We should leave as little as possible
> >>>>> info about previous system. Especially in kexec case. Just in case.
> >>>>> Please look into linux/arch/x86/kernel/relocate_kernel_64.S
> >>>>> for more details.
> >>>>
> >>>> Not initializing the registers is a deliberate design decision so exec'd
> >>>> images cannot mistakenly rely on the register values.
> >>>
> >>> Anybody who does this asks for problems. This is not our issue.
> >>
> >> Zeroing the registers makes that part of the ABI for calling images,
> >> which means it can never be changed.  If the ABI is the register values
> >> are undefined then this can be changes in the future to something that
> >> is defined.
> >
> > I have never ever tried to define any ABI here. I have never ever said
> > that the caller must pass this and the callee must expect that. There is
> > no such definition in current Linux Kernel implementation too. Even purgatory
> > expects nothing special in registers. I am just saying that it is worth to wipe
> > data from GPRs. No more no less. If you would like to use any register to pass
> > argument later you could do that. My proposal does not impose any limits.
>
> There's an explicit ABI and an implicit one based on the implementation.
>  In the absence of a formal test suite and certification program,
> implicit ABI are often just as fixed and constraining as explicit ones.

Could you tell me where do you see an ABI here?

> >>>> Clearing a handful of words when all of host memory is accessible by the
> >>>> exec'd image does nothing for security (as you suggest in a later email).
> >>>
> >>> I am aware that this does not solve all security issues but it could make simple
> >>> attacks more difficult.
> >>
> >> What attacks?  What security issues is zero-ing a tiny amount of state
> >> going to prevent when the exec'd image has full control over the whole host?
> >
> > I said "more difficult" not "prevent" and it makes difference.
>
> You didn't actually answer my question.  What security issue do you
> think zeroing registers will mitigate?  Given that the exec'd image has
> full control over the host, access to all memory and all devices.

I am able to imagine a situation in which some registers values could
tell you how to put all interesting pieces together from somewhere.
Without them it could be difficult or even impossible in some cases.
It is difficult to point out a specific example now due to complexity
of Xen itself and other involved stuff. However, I belive that there
are smart guys in the wild with sufficient time and money to find some
(sooner or later). Why we would like to make their live easier?
Few extra instructions cost us nothing. For them this could be or
not to be.

> Lipstick on one pig in a whole herd of pigs comes to mind...

I have never stated that this solve all security issues.
However, this way maybe we could avoid some problems in the future.

Daniel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-11-06 14:49   ` David Vrabel
  (?)
@ 2013-11-07 20:56   ` Don Slutz
  -1 siblings, 0 replies; 31+ messages in thread
From: Don Slutz @ 2013-11-07 20:56 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

For what it is worth.

Reviewed-by: Don Slutz <dslutz@verizon.com>
     -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> In the existing kexec hypercall, the load and unload ops depend on
> internals of the Linux kernel (the page list and code page provided by
> the kernel).  The code page is used to transition between Xen context
> and the image so using kernel code doesn't make sense and will not
> work for PVH guests.
>
> Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
> that no longer require a code page to be provided by the guest -- Xen
> now provides the code for calling the image directly.
>
> The new load op looks similar to the Linux kexec_load system call and
> allows the guest to provide the image data to be loaded.  The guest
> specifies the architecture of the image which may be a 32-bit subarch
> of the hypervisor's architecture (i.e., an EM_386 image on an
> EM_X86_64 hypervisor).
>
> The toolstack can now load images without kernel involvement.  This is
> required for supporting kexec when using a dom0 with an upstream
> kernel.
>
> Crash images are copied directly into the crash region on load.
> Default images are copied into domheap pages and a list of source and
> destination machine addresses is created.  This is list is used in
> kexec_reloc() to relocate the image to its destination.
>
> The old load and unload sub-ops are still available (as
> KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
> of the new infrastructure.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/arch/x86/machine_kexec.c        |  192 +++++++++++------
>   xen/arch/x86/x86_64/Makefile        |    2 +-
>   xen/arch/x86/x86_64/compat_kexec.S  |  187 ----------------
>   xen/arch/x86/x86_64/kexec_reloc.S   |  198 +++++++++++++++++
>   xen/common/kexec.c                  |  398 +++++++++++++++++++++++++++++------
>   xen/common/kimage.c                 |  122 +++++++++++-
>   xen/include/asm-x86/fixmap.h        |    3 -
>   xen/include/asm-x86/machine_kexec.h |   16 ++
>   xen/include/xen/kexec.h             |   16 +-
>   xen/include/xen/kimage.h            |    6 +
>   10 files changed, 804 insertions(+), 336 deletions(-)
>   delete mode 100644 xen/arch/x86/x86_64/compat_kexec.S
>   create mode 100644 xen/arch/x86/x86_64/kexec_reloc.S
>   create mode 100644 xen/include/asm-x86/machine_kexec.h
>
> diff --git a/xen/arch/x86/machine_kexec.c b/xen/arch/x86/machine_kexec.c
> index 68b9705..b70d5a6 100644
> --- a/xen/arch/x86/machine_kexec.c
> +++ b/xen/arch/x86/machine_kexec.c
> @@ -1,9 +1,18 @@
>   /******************************************************************************
>    * machine_kexec.c
>    *
> + * Copyright (C) 2013 Citrix Systems R&D Ltd.
> + *
> + * Portions derived from Linux's arch/x86/kernel/machine_kexec_64.c.
> + *
> + *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
> + *
>    * Xen port written by:
>    * - Simon 'Horms' Horman <horms@verge.net.au>
>    * - Magnus Damm <magnus@valinux.co.jp>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
>    */
>   
>   #include <xen/types.h>
> @@ -11,63 +20,124 @@
>   #include <xen/guest_access.h>
>   #include <asm/fixmap.h>
>   #include <asm/hpet.h>
> +#include <asm/page.h>
> +#include <asm/machine_kexec.h>
>   
> -typedef void (*relocate_new_kernel_t)(
> -                unsigned long indirection_page,
> -                unsigned long *page_list,
> -                unsigned long start_address,
> -                unsigned int preserve_context);
> -
> -int machine_kexec_load(int type, int slot, xen_kexec_image_t *image)
> +/*
> + * Add a mapping for a page to the page tables used during kexec.
> + */
> +int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
> +                           unsigned long maddr)
>   {
> -    unsigned long prev_ma = 0;
> -    int fix_base = FIX_KEXEC_BASE_0 + (slot * (KEXEC_XEN_NO_PAGES >> 1));
> -    int k;
> +    struct page_info *l4_page;
> +    struct page_info *l3_page;
> +    struct page_info *l2_page;
> +    struct page_info *l1_page;
> +    l4_pgentry_t *l4 = NULL;
> +    l3_pgentry_t *l3 = NULL;
> +    l2_pgentry_t *l2 = NULL;
> +    l1_pgentry_t *l1 = NULL;
> +    int ret = -ENOMEM;
> +
> +    l4_page = image->aux_page;
> +    if ( !l4_page )
> +    {
> +        l4_page = kimage_alloc_control_page(image, 0);
> +        if ( !l4_page )
> +            goto out;
> +        image->aux_page = l4_page;
> +    }
>   
> -    /* setup fixmap to point to our pages and record the virtual address
> -     * in every odd index in page_list[].
> -     */
> +    l4 = __map_domain_page(l4_page);
> +    l4 += l4_table_offset(vaddr);
> +    if ( !(l4e_get_flags(*l4) & _PAGE_PRESENT) )
> +    {
> +        l3_page = kimage_alloc_control_page(image, 0);
> +        if ( !l3_page )
> +            goto out;
> +        l4e_write(l4, l4e_from_page(l3_page, __PAGE_HYPERVISOR));
> +    }
> +    else
> +        l3_page = l4e_get_page(*l4);
> +
> +    l3 = __map_domain_page(l3_page);
> +    l3 += l3_table_offset(vaddr);
> +    if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) )
> +    {
> +        l2_page = kimage_alloc_control_page(image, 0);
> +        if ( !l2_page )
> +            goto out;
> +        l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR));
> +    }
> +    else
> +        l2_page = l3e_get_page(*l3);
> +
> +    l2 = __map_domain_page(l2_page);
> +    l2 += l2_table_offset(vaddr);
> +    if ( !(l2e_get_flags(*l2) & _PAGE_PRESENT) )
> +    {
> +        l1_page = kimage_alloc_control_page(image, 0);
> +        if ( !l1_page )
> +            goto out;
> +        l2e_write(l2, l2e_from_page(l1_page, __PAGE_HYPERVISOR));
> +    }
> +    else
> +        l1_page = l2e_get_page(*l2);
> +
> +    l1 = __map_domain_page(l1_page);
> +    l1 += l1_table_offset(vaddr);
> +    l1e_write(l1, l1e_from_pfn(maddr >> PAGE_SHIFT, __PAGE_HYPERVISOR));
> +
> +    ret = 0;
> +out:
> +    if ( l1 )
> +        unmap_domain_page(l1);
> +    if ( l2 )
> +        unmap_domain_page(l2);
> +    if ( l3 )
> +        unmap_domain_page(l3);
> +    if ( l4 )
> +        unmap_domain_page(l4);
> +    return ret;
> +}
>   
> -    for ( k = 0; k < KEXEC_XEN_NO_PAGES; k++ )
> +int machine_kexec_load(struct kexec_image *image)
> +{
> +    void *code_page;
> +    int ret;
> +
> +    switch ( image->arch )
>       {
> -        if ( (k & 1) == 0 )
> -        {
> -            /* Even pages: machine address. */
> -            prev_ma = image->page_list[k];
> -        }
> -        else
> -        {
> -            /* Odd pages: va for previous ma. */
> -            if ( is_pv_32on64_domain(dom0) )
> -            {
> -                /*
> -                 * The compatability bounce code sets up a page table
> -                 * with a 1-1 mapping of the first 1G of memory so
> -                 * VA==PA here.
> -                 *
> -                 * This Linux purgatory code still sets up separate
> -                 * high and low mappings on the control page (entries
> -                 * 0 and 1) but it is harmless if they are equal since
> -                 * that PT is not live at the time.
> -                 */
> -                image->page_list[k] = prev_ma;
> -            }
> -            else
> -            {
> -                set_fixmap(fix_base + (k >> 1), prev_ma);
> -                image->page_list[k] = fix_to_virt(fix_base + (k >> 1));
> -            }
> -        }
> +    case EM_386:
> +    case EM_X86_64:
> +        break;
> +    default:
> +        return -EINVAL;
>       }
>   
> +    code_page = __map_domain_page(image->control_code_page);
> +    memcpy(code_page, kexec_reloc, kexec_reloc_size);
> +    unmap_domain_page(code_page);
> +
> +    /*
> +     * Add a mapping for the control code page to the same virtual
> +     * address as kexec_reloc.  This allows us to keep running after
> +     * these page tables are loaded in kexec_reloc.
> +     */
> +    ret = machine_kexec_add_page(image, (unsigned long)kexec_reloc,
> +                                 page_to_maddr(image->control_code_page));
> +    if ( ret < 0 )
> +        return ret;
> +
>       return 0;
>   }
>   
> -void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image)
> +void machine_kexec_unload(struct kexec_image *image)
>   {
> +    /* no-op. kimage_free() frees all control pages. */
>   }
>   
> -void machine_reboot_kexec(xen_kexec_image_t *image)
> +void machine_reboot_kexec(struct kexec_image *image)
>   {
>       BUG_ON(smp_processor_id() != 0);
>       smp_send_stop();
> @@ -75,13 +145,10 @@ void machine_reboot_kexec(xen_kexec_image_t *image)
>       BUG();
>   }
>   
> -void machine_kexec(xen_kexec_image_t *image)
> +void machine_kexec(struct kexec_image *image)
>   {
> -    struct desc_ptr gdt_desc = {
> -        .base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY),
> -        .limit = LAST_RESERVED_GDT_BYTE
> -    };
>       int i;
> +    unsigned long reloc_flags = 0;
>   
>       /* We are about to permenantly jump out of the Xen context into the kexec
>        * purgatory code.  We really dont want to be still servicing interupts.
> @@ -109,29 +176,12 @@ void machine_kexec(xen_kexec_image_t *image)
>        * not like running with NMIs disabled. */
>       enable_nmis();
>   
> -    /*
> -     * compat_machine_kexec() returns to idle pagetables, which requires us
> -     * to be running on a static GDT mapping (idle pagetables have no GDT
> -     * mappings in their per-domain mapping area).
> -     */
> -    asm volatile ( "lgdt %0" : : "m" (gdt_desc) );
> +    if ( image->arch == EM_386 )
> +        reloc_flags |= KEXEC_RELOC_FLAG_COMPAT;
>   
> -    if ( is_pv_32on64_domain(dom0) )
> -    {
> -        compat_machine_kexec(image->page_list[1],
> -                             image->indirection_page,
> -                             image->page_list,
> -                             image->start_address);
> -    }
> -    else
> -    {
> -        relocate_new_kernel_t rnk;
> -
> -        rnk = (relocate_new_kernel_t) image->page_list[1];
> -        (*rnk)(image->indirection_page, image->page_list,
> -               image->start_address,
> -               0 /* preserve_context */);
> -    }
> +    kexec_reloc(page_to_maddr(image->control_code_page),
> +                page_to_maddr(image->aux_page),
> +                image->head, image->entry_maddr, reloc_flags);
>   }
>   
>   int machine_kexec_get(xen_kexec_range_t *range)
> diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
> index d56e12d..7f8fb3d 100644
> --- a/xen/arch/x86/x86_64/Makefile
> +++ b/xen/arch/x86/x86_64/Makefile
> @@ -11,11 +11,11 @@ obj-y += mmconf-fam10h.o
>   obj-y += mmconfig_64.o
>   obj-y += mmconfig-shared.o
>   obj-y += compat.o
> -obj-bin-y += compat_kexec.o
>   obj-y += domain.o
>   obj-y += physdev.o
>   obj-y += platform_hypercall.o
>   obj-y += cpu_idle.o
>   obj-y += cpufreq.o
> +obj-bin-y += kexec_reloc.o
>   
>   obj-$(crash_debug)   += gdbstub.o
> diff --git a/xen/arch/x86/x86_64/compat_kexec.S b/xen/arch/x86/x86_64/compat_kexec.S
> deleted file mode 100644
> index fc92af9..0000000
> --- a/xen/arch/x86/x86_64/compat_kexec.S
> +++ /dev/null
> @@ -1,187 +0,0 @@
> -/*
> - * Compatibility kexec handler.
> - */
> -
> -/*
> - * NOTE: We rely on Xen not relocating itself above the 4G boundary. This is
> - * currently true but if it ever changes then compat_pg_table will
> - * need to be moved back below 4G at run time.
> - */
> -
> -#include <xen/config.h>
> -
> -#include <asm/asm_defns.h>
> -#include <asm/msr.h>
> -#include <asm/page.h>
> -
> -/* The unrelocated physical address of a symbol. */
> -#define SYM_PHYS(sym)          ((sym) - __XEN_VIRT_START)
> -
> -/* Load physical address of symbol into register and relocate it. */
> -#define RELOCATE_SYM(sym,reg)  mov $SYM_PHYS(sym), reg ; \
> -                               add xen_phys_start(%rip), reg
> -
> -/*
> - * Relocate a physical address in memory. Size of temporary register
> - * determines size of the value to relocate.
> - */
> -#define RELOCATE_MEM(addr,reg) mov addr(%rip), reg ; \
> -                               add xen_phys_start(%rip), reg ; \
> -                               mov reg, addr(%rip)
> -
> -        .text
> -
> -        .code64
> -
> -ENTRY(compat_machine_kexec)
> -        /* x86/64                        x86/32  */
> -        /* %rdi - relocate_new_kernel_t  CALL    */
> -        /* %rsi - indirection page       4(%esp) */
> -        /* %rdx - page_list              8(%esp) */
> -        /* %rcx - start address         12(%esp) */
> -        /*        cpu has pae           16(%esp) */
> -
> -        /* Shim the 64 bit page_list into a 32 bit page_list. */
> -        mov $12,%r9
> -        lea compat_page_list(%rip), %rbx
> -1:      dec %r9
> -        movl (%rdx,%r9,8),%eax
> -        movl %eax,(%rbx,%r9,4)
> -        test %r9,%r9
> -        jnz 1b
> -
> -        RELOCATE_SYM(compat_page_list,%rdx)
> -
> -        /* Relocate compatibility mode entry point address. */
> -        RELOCATE_MEM(compatibility_mode_far,%eax)
> -
> -        /* Relocate compat_pg_table. */
> -        RELOCATE_MEM(compat_pg_table,     %rax)
> -        RELOCATE_MEM(compat_pg_table+0x8, %rax)
> -        RELOCATE_MEM(compat_pg_table+0x10,%rax)
> -        RELOCATE_MEM(compat_pg_table+0x18,%rax)
> -
> -        /*
> -         * Setup an identity mapped region in PML4[0] of idle page
> -         * table.
> -         */
> -        RELOCATE_SYM(l3_identmap,%rax)
> -        or  $0x63,%rax
> -        mov %rax, idle_pg_table(%rip)
> -
> -        /* Switch to idle page table. */
> -        RELOCATE_SYM(idle_pg_table,%rax)
> -        movq %rax, %cr3
> -
> -        /* Switch to identity mapped compatibility stack. */
> -        RELOCATE_SYM(compat_stack,%rax)
> -        movq %rax, %rsp
> -
> -        /* Save xen_phys_start for 32 bit code. */
> -        movq xen_phys_start(%rip), %rbx
> -
> -        /* Jump to low identity mapping in compatibility mode. */
> -        ljmp *compatibility_mode_far(%rip)
> -        ud2
> -
> -compatibility_mode_far:
> -        .long SYM_PHYS(compatibility_mode)
> -        .long __HYPERVISOR_CS32
> -
> -        /*
> -         * We use 5 words of stack for the arguments passed to the kernel. The
> -         * kernel only uses 1 word before switching to its own stack. Allocate
> -         * 16 words to give "plenty" of room.
> -         */
> -        .fill 16,4,0
> -compat_stack:
> -
> -        .code32
> -
> -#undef RELOCATE_SYM
> -#undef RELOCATE_MEM
> -
> -/*
> - * Load physical address of symbol into register and relocate it. %rbx
> - * contains xen_phys_start(%rip) saved before jump to compatibility
> - * mode.
> - */
> -#define RELOCATE_SYM(sym,reg) mov $SYM_PHYS(sym), reg ; \
> -                              add %ebx, reg
> -
> -compatibility_mode:
> -        /* Setup some sane segments. */
> -        movl $__HYPERVISOR_DS32, %eax
> -        movl %eax, %ds
> -        movl %eax, %es
> -        movl %eax, %fs
> -        movl %eax, %gs
> -        movl %eax, %ss
> -
> -        /* Push arguments onto stack. */
> -        pushl $0   /* 20(%esp) - preserve context */
> -        pushl $1   /* 16(%esp) - cpu has pae */
> -        pushl %ecx /* 12(%esp) - start address */
> -        pushl %edx /*  8(%esp) - page list */
> -        pushl %esi /*  4(%esp) - indirection page */
> -        pushl %edi /*  0(%esp) - CALL */
> -
> -        /* Disable paging and therefore leave 64 bit mode. */
> -        movl %cr0, %eax
> -        andl $~X86_CR0_PG, %eax
> -        movl %eax, %cr0
> -
> -        /* Switch to 32 bit page table. */
> -        RELOCATE_SYM(compat_pg_table, %eax)
> -        movl  %eax, %cr3
> -
> -        /* Clear MSR_EFER[LME], disabling long mode */
> -        movl    $MSR_EFER,%ecx
> -        rdmsr
> -        btcl    $_EFER_LME,%eax
> -        wrmsr
> -
> -        /* Re-enable paging, but only 32 bit mode now. */
> -        movl %cr0, %eax
> -        orl $X86_CR0_PG, %eax
> -        movl %eax, %cr0
> -        jmp 1f
> -1:
> -
> -        popl %eax
> -        call *%eax
> -        ud2
> -
> -        .data
> -        .align 4
> -compat_page_list:
> -        .fill 12,4,0
> -
> -        .align 32,0
> -
> -        /*
> -         * These compat page tables contain an identity mapping of the
> -         * first 4G of the physical address space.
> -         */
> -compat_pg_table:
> -        .long SYM_PHYS(compat_pg_table_l2) + 0*PAGE_SIZE + 0x01, 0
> -        .long SYM_PHYS(compat_pg_table_l2) + 1*PAGE_SIZE + 0x01, 0
> -        .long SYM_PHYS(compat_pg_table_l2) + 2*PAGE_SIZE + 0x01, 0
> -        .long SYM_PHYS(compat_pg_table_l2) + 3*PAGE_SIZE + 0x01, 0
> -
> -        .section .data.page_aligned, "aw", @progbits
> -        .align PAGE_SIZE,0
> -compat_pg_table_l2:
> -        .macro identmap from=0, count=512
> -        .if \count-1
> -        identmap "(\from+0)","(\count/2)"
> -        identmap "(\from+(0x200000*(\count/2)))","(\count/2)"
> -        .else
> -        .quad 0x00000000000000e3 + \from
> -        .endif
> -        .endm
> -
> -        identmap 0x00000000
> -        identmap 0x40000000
> -        identmap 0x80000000
> -        identmap 0xc0000000
> diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
> new file mode 100644
> index 0000000..7a16c85
> --- /dev/null
> +++ b/xen/arch/x86/x86_64/kexec_reloc.S
> @@ -0,0 +1,198 @@
> +/*
> + * Relocate a kexec_image to its destination and call it.
> + *
> + * Copyright (C) 2013 Citrix Systems R&D Ltd.
> + *
> + * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
> + *
> + *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +#include <xen/config.h>
> +#include <xen/kimage.h>
> +
> +#include <asm/asm_defns.h>
> +#include <asm/msr.h>
> +#include <asm/page.h>
> +#include <asm/machine_kexec.h>
> +
> +        .text
> +        .align PAGE_SIZE
> +        .code64
> +
> +ENTRY(kexec_reloc)
> +        /* %rdi - code page maddr */
> +        /* %rsi - page table maddr */
> +        /* %rdx - indirection page maddr */
> +        /* %rcx - entry maddr (%rbp) */
> +        /* %r8 - flags */
> +
> +        movq    %rcx, %rbp
> +
> +        /* Setup stack. */
> +        leaq    (reloc_stack - kexec_reloc)(%rdi), %rsp
> +
> +        /* Load reloc page table. */
> +        movq    %rsi, %cr3
> +
> +        /* Jump to identity mapped code. */
> +        leaq    (identity_mapped - kexec_reloc)(%rdi), %rax
> +        jmpq    *%rax
> +
> +identity_mapped:
> +        /*
> +         * Set cr0 to a known state:
> +         *  - Paging enabled
> +         *  - Alignment check disabled
> +         *  - Write protect disabled
> +         *  - No task switch
> +         *  - Don't do FP software emulation.
> +         *  - Protected mode enabled
> +         */
> +        movq    %cr0, %rax
> +        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
> +        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
> +        movq    %rax, %cr0
> +
> +        /*
> +         * Set cr4 to a known state:
> +         *  - physical address extension enabled
> +         */
> +        movl    $X86_CR4_PAE, %eax
> +        movq    %rax, %cr4
> +
> +        movq    %rdx, %rdi
> +        call    relocate_pages
> +
> +        /* Need to switch to 32-bit mode? */
> +        testq   $KEXEC_RELOC_FLAG_COMPAT, %r8
> +        jnz     call_32_bit
> +
> +call_64_bit:
> +        /* Call the image entry point.  This should never return. */
> +        callq   *%rbp
> +        ud2
> +
> +call_32_bit:
> +        /* Setup IDT. */
> +        lidt    compat_mode_idt(%rip)
> +
> +        /* Load compat GDT. */
> +        leaq    compat_mode_gdt(%rip), %rax
> +        movq    %rax, (compat_mode_gdt_desc + 2)(%rip)
> +        lgdt    compat_mode_gdt_desc(%rip)
> +
> +        /* Relocate compatibility mode entry point address. */
> +        leal    compatibility_mode(%rip), %eax
> +        movl    %eax, compatibility_mode_far(%rip)
> +
> +        /* Enter compatibility mode. */
> +        ljmp    *compatibility_mode_far(%rip)
> +
> +relocate_pages:
> +        /* %rdi - indirection page maddr */
> +        pushq   %rbx
> +
> +        cld
> +        movq    %rdi, %rbx
> +        xorl    %edi, %edi
> +        xorl    %esi, %esi
> +
> +next_entry: /* top, read another word for the indirection page */
> +
> +        movq    (%rbx), %rcx
> +        addq    $8, %rbx
> +is_dest:
> +        testb   $IND_DESTINATION, %cl
> +        jz      is_ind
> +        movq    %rcx, %rdi
> +        andq    $PAGE_MASK, %rdi
> +        jmp     next_entry
> +is_ind:
> +        testb   $IND_INDIRECTION, %cl
> +        jz      is_done
> +        movq    %rcx, %rbx
> +        andq    $PAGE_MASK, %rbx
> +        jmp     next_entry
> +is_done:
> +        testb   $IND_DONE, %cl
> +        jnz     done
> +is_source:
> +        testb   $IND_SOURCE, %cl
> +        jz      is_zero
> +        movq    %rcx, %rsi      /* For every source page do a copy */
> +        andq    $PAGE_MASK, %rsi
> +        movl    $(PAGE_SIZE / 8), %ecx
> +        rep movsq
> +        jmp     next_entry
> +is_zero:
> +        testb   $IND_ZERO, %cl
> +        jz      next_entry
> +        movl    $(PAGE_SIZE / 8), %ecx  /* Zero the destination page. */
> +        xorl    %eax, %eax
> +        rep stosq
> +        jmp     next_entry
> +done:
> +        popq    %rbx
> +        ret
> +
> +        .code32
> +
> +compatibility_mode:
> +        /* Setup some sane segments. */
> +        movl    $0x0008, %eax
> +        movl    %eax, %ds
> +        movl    %eax, %es
> +        movl    %eax, %fs
> +        movl    %eax, %gs
> +        movl    %eax, %ss
> +
> +        /* Disable paging and therefore leave 64 bit mode. */
> +        movl    %cr0, %eax
> +        andl    $~X86_CR0_PG, %eax
> +        movl    %eax, %cr0
> +
> +        /* Disable long mode */
> +        movl    $MSR_EFER, %ecx
> +        rdmsr
> +        andl    $~EFER_LME, %eax
> +        wrmsr
> +
> +        /* Clear cr4 to disable PAE. */
> +        xorl    %eax, %eax
> +        movl    %eax, %cr4
> +
> +        /* Call the image entry point.  This should never return. */
> +        call    *%ebp
> +        ud2
> +
> +        .align 4
> +compatibility_mode_far:
> +        .long 0x00000000             /* set in call_32_bit above */
> +        .word 0x0010
> +
> +compat_mode_gdt_desc:
> +        .word (3*8)-1
> +        .quad 0x0000000000000000     /* set in call_32_bit above */
> +
> +        .align 8
> +compat_mode_gdt:
> +        .quad 0x0000000000000000     /* null                              */
> +        .quad 0x00cf92000000ffff     /* 0x0008 ring 0 data                */
> +        .quad 0x00cf9a000000ffff     /* 0x0010 ring 0 code, compatibility */
> +
> +compat_mode_idt:
> +        .word 0                      /* limit */
> +        .long 0                      /* base */
> +
> +        /*
> +         * 16 words of stack are more than enough.
> +         */
> +        .fill 16,8,0
> +reloc_stack:
> +
> +        .globl kexec_reloc_size
> +kexec_reloc_size:
> +        .long . - kexec_reloc
> diff --git a/xen/common/kexec.c b/xen/common/kexec.c
> index 7b23df0..c5450ba 100644
> --- a/xen/common/kexec.c
> +++ b/xen/common/kexec.c
> @@ -25,6 +25,7 @@
>   #include <xen/version.h>
>   #include <xen/console.h>
>   #include <xen/kexec.h>
> +#include <xen/kimage.h>
>   #include <public/elfnote.h>
>   #include <xsm/xsm.h>
>   #include <xen/cpu.h>
> @@ -47,7 +48,7 @@ static Elf_Note *xen_crash_note;
>   
>   static cpumask_t crash_saved_cpus;
>   
> -static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
> +static struct kexec_image *kexec_image[KEXEC_IMAGE_NR];
>   
>   #define KEXEC_FLAG_DEFAULT_POS   (KEXEC_IMAGE_NR + 0)
>   #define KEXEC_FLAG_CRASH_POS     (KEXEC_IMAGE_NR + 1)
> @@ -55,8 +56,6 @@ static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
>   
>   static unsigned long kexec_flags = 0; /* the lowest bits are for KEXEC_IMAGE... */
>   
> -static spinlock_t kexec_lock = SPIN_LOCK_UNLOCKED;
> -
>   static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
>   static size_t vmcoreinfo_size = 0;
>   
> @@ -311,14 +310,14 @@ void kexec_crash(void)
>       kexec_common_shutdown();
>       kexec_crash_save_cpu();
>       machine_crash_shutdown();
> -    machine_kexec(&kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
> +    machine_kexec(kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
>   
>       BUG();
>   }
>   
>   static long kexec_reboot(void *_image)
>   {
> -    xen_kexec_image_t *image = _image;
> +    struct kexec_image *image = _image;
>   
>       kexecing = TRUE;
>   
> @@ -734,63 +733,264 @@ static void crash_save_vmcoreinfo(void)
>   #endif
>   }
>   
> -static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
> +static void kexec_unload_image(struct kexec_image *image)
>   {
> -    xen_kexec_image_t *image;
> +    if ( !image )
> +        return;
> +
> +    machine_kexec_unload(image);
> +    kimage_free(image);
> +}
> +
> +static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +{
> +    xen_kexec_exec_t exec;
> +    struct kexec_image *image;
> +    int base, bit, pos, ret = -EINVAL;
> +
> +    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
> +        return -EFAULT;
> +
> +    if ( kexec_load_get_bits(exec.type, &base, &bit) )
> +        return -EINVAL;
> +
> +    pos = (test_bit(bit, &kexec_flags) != 0);
> +
> +    /* Only allow kexec/kdump into loaded images */
> +    if ( !test_bit(base + pos, &kexec_flags) )
> +        return -ENOENT;
> +
> +    switch (exec.type)
> +    {
> +    case KEXEC_TYPE_DEFAULT:
> +        image = kexec_image[base + pos];
> +        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
> +        break;
> +    case KEXEC_TYPE_CRASH:
> +        kexec_crash(); /* Does not return */
> +        break;
> +    }
> +
> +    return -EINVAL; /* never reached */
> +}
> +
> +static int kexec_swap_images(int type, struct kexec_image *new,
> +                             struct kexec_image **old)
> +{
> +    static DEFINE_SPINLOCK(kexec_lock);
>       int base, bit, pos;
> -    int ret = 0;
> +    int new_slot, old_slot;
> +
> +    *old = NULL;
> +
> +    spin_lock(&kexec_lock);
> +
> +    if ( test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
> +    {
> +        spin_unlock(&kexec_lock);
> +        return -EBUSY;
> +    }
>   
> -    if ( kexec_load_get_bits(load->type, &base, &bit) )
> +    if ( kexec_load_get_bits(type, &base, &bit) )
>           return -EINVAL;
>   
>       pos = (test_bit(bit, &kexec_flags) != 0);
> +    old_slot = base + pos;
> +    new_slot = base + !pos;
>   
> -    /* Load the user data into an unused image */
> -    if ( op == KEXEC_CMD_kexec_load )
> +    if ( new )
>       {
> -        image = &kexec_image[base + !pos];
> +        kexec_image[new_slot] = new;
> +        set_bit(new_slot, &kexec_flags);
> +    }
> +    change_bit(bit, &kexec_flags);
>   
> -        BUG_ON(test_bit((base + !pos), &kexec_flags)); /* must be free */
> +    clear_bit(old_slot, &kexec_flags);
> +    *old = kexec_image[old_slot];
>   
> -        memcpy(image, &load->image, sizeof(*image));
> +    spin_unlock(&kexec_lock);
>   
> -        if ( !(ret = machine_kexec_load(load->type, base + !pos, image)) )
> -        {
> -            /* Set image present bit */
> -            set_bit((base + !pos), &kexec_flags);
> +    return 0;
> +}
>   
> -            /* Make new image the active one */
> -            change_bit(bit, &kexec_flags);
> -        }
> +static int kexec_load_slot(struct kexec_image *kimage)
> +{
> +    struct kexec_image *old_kimage;
> +    int ret = -ENOMEM;
> +
> +    ret = machine_kexec_load(kimage);
> +    if ( ret < 0 )
> +        return ret;
> +
> +    crash_save_vmcoreinfo();
> +
> +    ret = kexec_swap_images(kimage->type, kimage, &old_kimage);
> +    if ( ret < 0 )
> +        return ret;
> +
> +    kexec_unload_image(old_kimage);
> +
> +    return 0;
> +}
> +
> +static uint16_t kexec_load_v1_arch(void)
> +{
> +#ifdef CONFIG_X86
> +    return is_pv_32on64_domain(dom0) ? EM_386 : EM_X86_64;
> +#else
> +    return EM_NONE;
> +#endif
> +}
>   
> -        crash_save_vmcoreinfo();
> +static int kexec_segments_add_segment(
> +    unsigned int *nr_segments, xen_kexec_segment_t *segments,
> +    unsigned long mfn)
> +{
> +    paddr_t maddr = (paddr_t)mfn << PAGE_SHIFT;
> +    unsigned int n = *nr_segments;
> +
> +    /* Need a new segment? */
> +    if ( n == 0
> +         || segments[n-1].dest_maddr + segments[n-1].dest_size != maddr )
> +    {
> +        n++;
> +        if ( n > KEXEC_SEGMENT_MAX )
> +            return -EINVAL;
> +        *nr_segments = n;
> +
> +        set_xen_guest_handle(segments[n-1].buf.h, NULL);
> +        segments[n-1].buf_size = 0;
> +        segments[n-1].dest_maddr = maddr;
> +        segments[n-1].dest_size = 0;
>       }
>   
> -    /* Unload the old image if present and load successful */
> -    if ( ret == 0 && !test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
> +    return 0;
> +}
> +
> +static int kexec_segments_from_ind_page(unsigned long mfn,
> +                                        unsigned int *nr_segments,
> +                                        xen_kexec_segment_t *segments,
> +                                        bool_t compat)
> +{
> +    void *page;
> +    kimage_entry_t *entry;
> +    int ret = 0;
> +
> +    page = map_domain_page(mfn);
> +
> +    /*
> +     * Walk the indirection page list, adding destination pages to the
> +     * segments.
> +     */
> +    for ( entry = page; ; )
>       {
> -        if ( test_and_clear_bit((base + pos), &kexec_flags) )
> +        unsigned long ind;
> +
> +        ind = kimage_entry_ind(entry, compat);
> +        mfn = kimage_entry_mfn(entry, compat);
> +
> +        switch ( ind )
>           {
> -            image = &kexec_image[base + pos];
> -            machine_kexec_unload(load->type, base + pos, image);
> +        case IND_DESTINATION:
> +            ret = kexec_segments_add_segment(nr_segments, segments, mfn);
> +            if ( ret < 0 )
> +                goto done;
> +            break;
> +        case IND_INDIRECTION:
> +            unmap_domain_page(page);
> +            entry = page = map_domain_page(mfn);
> +            continue;
> +        case IND_DONE:
> +            goto done;
> +        case IND_SOURCE:
> +            if ( *nr_segments == 0 )
> +            {
> +                ret = -EINVAL;
> +                goto done;
> +            }
> +            segments[*nr_segments-1].dest_size += PAGE_SIZE;
> +            break;
> +        default:
> +            ret = -EINVAL;
> +            goto done;
>           }
> +        entry = kimage_entry_next(entry, compat);
>       }
> +done:
> +    unmap_domain_page(page);
> +    return ret;
> +}
>   
> +static int kexec_do_load_v1(xen_kexec_load_v1_t *load, int compat)
> +{
> +    struct kexec_image *kimage = NULL;
> +    xen_kexec_segment_t *segments;
> +    uint16_t arch;
> +    unsigned int nr_segments = 0;
> +    unsigned long ind_mfn = load->image.indirection_page >> PAGE_SHIFT;
> +    int ret;
> +
> +    arch = kexec_load_v1_arch();
> +    if ( arch == EM_NONE )
> +        return -ENOSYS;
> +
> +    segments = xmalloc_array(xen_kexec_segment_t, KEXEC_SEGMENT_MAX);
> +    if ( segments == NULL )
> +        return -ENOMEM;
> +
> +    /*
> +     * Work out the image segments (destination only) from the
> +     * indirection pages.
> +     *
> +     * This is needed so we don't allocate pages that will overlap
> +     * with the destination when building the new set of indirection
> +     * pages below.
> +     */
> +    ret = kexec_segments_from_ind_page(ind_mfn, &nr_segments, segments, compat);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    ret = kimage_alloc(&kimage, load->type, arch, load->image.start_address,
> +                       nr_segments, segments);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    /*
> +     * Build a new set of indirection pages in the native format.
> +     *
> +     * This walks the guest provided indirection pages a second time.
> +     * The guest could have altered then, invalidating the segment
> +     * information constructed above.  This will only result in the
> +     * resulting image being potentially unrelocatable.
> +     */
> +    ret = kimage_build_ind(kimage, ind_mfn, compat);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    ret = kexec_load_slot(kimage);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    return 0;
> +
> +error:
> +    if ( !kimage )
> +        xfree(segments);
> +    kimage_free(kimage);
>       return ret;
>   }
>   
> -static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
> +static int kexec_load_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
>   {
>       xen_kexec_load_v1_t load;
>   
>       if ( unlikely(copy_from_guest(&load, uarg, 1)) )
>           return -EFAULT;
>   
> -    return kexec_load_unload_internal(op, &load);
> +    return kexec_do_load_v1(&load, 0);
>   }
>   
> -static int kexec_load_unload_compat(unsigned long op,
> -                                    XEN_GUEST_HANDLE_PARAM(void) uarg)
> +static int kexec_load_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
>   {
>   #ifdef CONFIG_COMPAT
>       compat_kexec_load_v1_t compat_load;
> @@ -809,49 +1009,113 @@ static int kexec_load_unload_compat(unsigned long op,
>       load.type = compat_load.type;
>       XLAT_kexec_image(&load.image, &compat_load.image);
>   
> -    return kexec_load_unload_internal(op, &load);
> -#else /* CONFIG_COMPAT */
> +    return kexec_do_load_v1(&load, 1);
> +#else
>       return 0;
> -#endif /* CONFIG_COMPAT */
> +#endif
>   }
>   
> -static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +static int kexec_load(XEN_GUEST_HANDLE_PARAM(void) uarg)
>   {
> -    xen_kexec_exec_t exec;
> -    xen_kexec_image_t *image;
> -    int base, bit, pos, ret = -EINVAL;
> +    xen_kexec_load_t load;
> +    xen_kexec_segment_t *segments;
> +    struct kexec_image *kimage = NULL;
> +    int ret;
>   
> -    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
> +    if ( copy_from_guest(&load, uarg, 1) )
>           return -EFAULT;
>   
> -    if ( kexec_load_get_bits(exec.type, &base, &bit) )
> +    if ( load.nr_segments >= KEXEC_SEGMENT_MAX )
>           return -EINVAL;
>   
> -    pos = (test_bit(bit, &kexec_flags) != 0);
> -
> -    /* Only allow kexec/kdump into loaded images */
> -    if ( !test_bit(base + pos, &kexec_flags) )
> -        return -ENOENT;
> +    segments = xmalloc_array(xen_kexec_segment_t, load.nr_segments);
> +    if ( segments == NULL )
> +        return -ENOMEM;
>   
> -    switch (exec.type)
> +    if ( copy_from_guest(segments, load.segments.h, load.nr_segments) )
>       {
> -    case KEXEC_TYPE_DEFAULT:
> -        image = &kexec_image[base + pos];
> -        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
> -        break;
> -    case KEXEC_TYPE_CRASH:
> -        kexec_crash(); /* Does not return */
> -        break;
> +        ret = -EFAULT;
> +        goto error;
>       }
>   
> -    return -EINVAL; /* never reached */
> +    ret = kimage_alloc(&kimage, load.type, load.arch, load.entry_maddr,
> +                       load.nr_segments, segments);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    ret = kimage_load_segments(kimage);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    ret = kexec_load_slot(kimage);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    return 0;
> +
> +error:
> +    if ( ! kimage )
> +        xfree(segments);
> +    kimage_free(kimage);
> +    return ret;
> +}
> +
> +static int kexec_do_unload(xen_kexec_unload_t *unload)
> +{
> +    struct kexec_image *old_kimage;
> +    int ret;
> +
> +    ret = kexec_swap_images(unload->type, NULL, &old_kimage);
> +    if ( ret < 0 )
> +        return ret;
> +
> +    kexec_unload_image(old_kimage);
> +
> +    return 0;
> +}
> +
> +static int kexec_unload_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +{
> +    xen_kexec_load_v1_t load;
> +    xen_kexec_unload_t unload;
> +
> +    if ( copy_from_guest(&load, uarg, 1) )
> +        return -EFAULT;
> +
> +    unload.type = load.type;
> +    return kexec_do_unload(&unload);
> +}
> +
> +static int kexec_unload_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +{
> +#ifdef CONFIG_COMPAT
> +    compat_kexec_load_v1_t compat_load;
> +    xen_kexec_unload_t unload;
> +
> +    if ( copy_from_guest(&compat_load, uarg, 1) )
> +        return -EFAULT;
> +
> +    unload.type = compat_load.type;
> +    return kexec_do_unload(&unload);
> +#else
> +    return 0;
> +#endif
> +}
> +
> +static int kexec_unload(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +{
> +    xen_kexec_unload_t unload;
> +
> +    if ( unlikely(copy_from_guest(&unload, uarg, 1)) )
> +        return -EFAULT;
> +
> +    return kexec_do_unload(&unload);
>   }
>   
>   static int do_kexec_op_internal(unsigned long op,
>                                   XEN_GUEST_HANDLE_PARAM(void) uarg,
>                                   bool_t compat)
>   {
> -    unsigned long flags;
>       int ret = -EINVAL;
>   
>       ret = xsm_kexec(XSM_PRIV);
> @@ -867,20 +1131,26 @@ static int do_kexec_op_internal(unsigned long op,
>                   ret = kexec_get_range(uarg);
>           break;
>       case KEXEC_CMD_kexec_load_v1:
> +        if ( compat )
> +            ret = kexec_load_v1_compat(uarg);
> +        else
> +            ret = kexec_load_v1(uarg);
> +        break;
>       case KEXEC_CMD_kexec_unload_v1:
> -        spin_lock_irqsave(&kexec_lock, flags);
> -        if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
> -        {
> -                if (compat)
> -                        ret = kexec_load_unload_compat(op, uarg);
> -                else
> -                        ret = kexec_load_unload(op, uarg);
> -        }
> -        spin_unlock_irqrestore(&kexec_lock, flags);
> +        if ( compat )
> +            ret = kexec_unload_v1_compat(uarg);
> +        else
> +            ret = kexec_unload_v1(uarg);
>           break;
>       case KEXEC_CMD_kexec:
>           ret = kexec_exec(uarg);
>           break;
> +    case KEXEC_CMD_kexec_load:
> +        ret = kexec_load(uarg);
> +        break;
> +    case KEXEC_CMD_kexec_unload:
> +        ret = kexec_unload(uarg);
> +        break;
>       }
>   
>       return ret;
> diff --git a/xen/common/kimage.c b/xen/common/kimage.c
> index 02ee37e..10fb785 100644
> --- a/xen/common/kimage.c
> +++ b/xen/common/kimage.c
> @@ -175,11 +175,20 @@ static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
>       image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
>       if ( !image->control_code_page )
>           goto out;
> +    result = machine_kexec_add_page(image,
> +                                    page_to_maddr(image->control_code_page),
> +                                    page_to_maddr(image->control_code_page));
> +    if ( result < 0 )
> +        goto out;
>   
>       /* Add an empty indirection page. */
>       image->entry_page = kimage_alloc_control_page(image, 0);
>       if ( !image->entry_page )
>           goto out;
> +    result = machine_kexec_add_page(image, page_to_maddr(image->entry_page),
> +                                    page_to_maddr(image->entry_page));
> +    if ( result < 0 )
> +        goto out;
>   
>       image->head = page_to_maddr(image->entry_page);
>   
> @@ -595,7 +604,7 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
>           if ( addr == destination )
>           {
>               page_list_del(page, &image->dest_pages);
> -            return page;
> +            goto found;
>           }
>       }
>       page = NULL;
> @@ -647,6 +656,8 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
>               page_list_add(page, &image->dest_pages);
>           }
>       }
> +found:
> +    machine_kexec_add_page(image, page_to_maddr(page), page_to_maddr(page));
>       return page;
>   }
>   
> @@ -753,6 +764,7 @@ static int kimage_load_crash_segment(struct kexec_image *image,
>   static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
>   {
>       int result = -ENOMEM;
> +    paddr_t addr;
>   
>       if ( !guest_handle_is_null(segment->buf.h) )
>       {
> @@ -767,6 +779,14 @@ static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *s
>           }
>       }
>   
> +    for ( addr = segment->dest_maddr & PAGE_MASK;
> +          addr < segment->dest_maddr + segment->dest_size; addr += PAGE_SIZE )
> +    {
> +        result = machine_kexec_add_page(image, addr, addr);
> +        if ( result < 0 )
> +            break;
> +    }
> +
>       return result;
>   }
>   
> @@ -810,6 +830,106 @@ int kimage_load_segments(struct kexec_image *image)
>       return 0;
>   }
>   
> +kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat)
> +{
> +    if ( compat )
> +        return (kimage_entry_t *)((uint32_t *)entry + 1);
> +    return entry + 1;
> +}
> +
> +unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat)
> +{
> +    if ( compat )
> +        return *(uint32_t *)entry >> PAGE_SHIFT;
> +    return *entry >> PAGE_SHIFT;
> +}
> +
> +unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat)
> +{
> +    if ( compat )
> +        return *(uint32_t *)entry & 0xf;
> +    return *entry & 0xf;
> +}
> +
> +int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
> +                     bool_t compat)
> +{
> +    void *page;
> +    kimage_entry_t *entry;
> +    int ret = 0;
> +    paddr_t dest = KIMAGE_NO_DEST;
> +
> +    page = map_domain_page(ind_mfn);
> +    if ( !page )
> +        return -ENOMEM;
> +
> +    /*
> +     * Walk the guest-supplied indirection pages, adding entries to
> +     * the image's indirection pages.
> +     */
> +    for ( entry = page; ;  )
> +    {
> +        unsigned long ind;
> +        unsigned long mfn;
> +
> +        ind = kimage_entry_ind(entry, compat);
> +        mfn = kimage_entry_mfn(entry, compat);
> +
> +        switch ( ind )
> +        {
> +        case IND_DESTINATION:
> +            dest = (paddr_t)mfn << PAGE_SHIFT;
> +            ret = kimage_set_destination(image, dest);
> +            if ( ret < 0 )
> +                goto done;
> +            break;
> +        case IND_INDIRECTION:
> +            unmap_domain_page(page);
> +            page = map_domain_page(mfn);
> +            entry = page;
> +            continue;
> +        case IND_DONE:
> +            kimage_terminate(image);
> +            goto done;
> +        case IND_SOURCE:
> +        {
> +            struct page_info *guest_page, *xen_page;
> +
> +            guest_page = mfn_to_page(mfn);
> +            if ( !get_page(guest_page, current->domain) )
> +            {
> +                ret = -EFAULT;
> +                goto done;
> +            }
> +
> +            xen_page = kimage_alloc_page(image, dest);
> +            if ( !xen_page )
> +            {
> +                put_page(guest_page);
> +                ret = -ENOMEM;
> +                goto done;
> +            }
> +
> +            copy_domain_page(page_to_mfn(xen_page), mfn);
> +            put_page(guest_page);
> +
> +            ret = kimage_add_page(image, page_to_maddr(xen_page));
> +            if ( ret < 0 )
> +                goto done;
> +            dest += PAGE_SIZE;
> +            break;
> +        }
> +        default:
> +            ret = -EINVAL;
> +            goto done;
> +        }
> +        entry = kimage_entry_next(entry, compat);
> +    }
> +done:
> +    unmap_domain_page(page);
> +    return ret;
> +}
> +
>   /*
>    * Local variables:
>    * mode: C
> diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
> index 8b4266d..48c5676 100644
> --- a/xen/include/asm-x86/fixmap.h
> +++ b/xen/include/asm-x86/fixmap.h
> @@ -56,9 +56,6 @@ enum fixed_addresses {
>       FIX_ACPI_BEGIN,
>       FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1,
>       FIX_HPET_BASE,
> -    FIX_KEXEC_BASE_0,
> -    FIX_KEXEC_BASE_END = FIX_KEXEC_BASE_0 \
> -      + ((KEXEC_XEN_NO_PAGES >> 1) * KEXEC_IMAGE_NR) - 1,
>       FIX_TBOOT_SHARED_BASE,
>       FIX_MSIX_IO_RESERV_BASE,
>       FIX_MSIX_IO_RESERV_END = FIX_MSIX_IO_RESERV_BASE + FIX_MSIX_MAX_PAGES -1,
> diff --git a/xen/include/asm-x86/machine_kexec.h b/xen/include/asm-x86/machine_kexec.h
> new file mode 100644
> index 0000000..ba0d469
> --- /dev/null
> +++ b/xen/include/asm-x86/machine_kexec.h
> @@ -0,0 +1,16 @@
> +#ifndef __X86_MACHINE_KEXEC_H__
> +#define __X86_MACHINE_KEXEC_H__
> +
> +#define KEXEC_RELOC_FLAG_COMPAT 0x1 /* 32-bit image */
> +
> +#ifndef __ASSEMBLY__
> +
> +extern void kexec_reloc(unsigned long reloc_code, unsigned long reloc_pt,
> +                        unsigned long ind_maddr, unsigned long entry_maddr,
> +                        unsigned long flags);
> +
> +extern unsigned int kexec_reloc_size;
> +
> +#endif
> +
> +#endif /* __X86_MACHINE_KEXEC_H__ */
> diff --git a/xen/include/xen/kexec.h b/xen/include/xen/kexec.h
> index 1a5dda1..bd17747 100644
> --- a/xen/include/xen/kexec.h
> +++ b/xen/include/xen/kexec.h
> @@ -6,6 +6,7 @@
>   #include <public/kexec.h>
>   #include <asm/percpu.h>
>   #include <xen/elfcore.h>
> +#include <xen/kimage.h>
>   
>   typedef struct xen_kexec_reserve {
>       unsigned long size;
> @@ -40,11 +41,13 @@ extern enum low_crashinfo low_crashinfo_mode;
>   extern paddr_t crashinfo_maxaddr_bits;
>   void kexec_early_calculations(void);
>   
> -int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
> -void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
> +int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
> +                           unsigned long maddr);
> +int machine_kexec_load(struct kexec_image *image);
> +void machine_kexec_unload(struct kexec_image *image);
>   void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
> -void machine_reboot_kexec(xen_kexec_image_t *image);
> -void machine_kexec(xen_kexec_image_t *image);
> +void machine_reboot_kexec(struct kexec_image *image);
> +void machine_kexec(struct kexec_image *image);
>   void kexec_crash(void);
>   void kexec_crash_save_cpu(void);
>   crash_xen_info_t *kexec_crash_save_info(void);
> @@ -52,11 +55,6 @@ void machine_crash_shutdown(void);
>   int machine_kexec_get(xen_kexec_range_t *range);
>   int machine_kexec_get_xen(xen_kexec_range_t *range);
>   
> -void compat_machine_kexec(unsigned long rnk,
> -                          unsigned long indirection_page,
> -                          unsigned long *page_list,
> -                          unsigned long start_address);
> -
>   /* vmcoreinfo stuff */
>   #define VMCOREINFO_BYTES           (4096)
>   #define VMCOREINFO_NOTE_NAME       "VMCOREINFO_XEN"
> diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
> index 0ebd37a..d10ebf7 100644
> --- a/xen/include/xen/kimage.h
> +++ b/xen/include/xen/kimage.h
> @@ -47,6 +47,12 @@ int kimage_load_segments(struct kexec_image *image);
>   struct page_info *kimage_alloc_control_page(struct kexec_image *image,
>                                               unsigned memflags);
>   
> +kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat);
> +unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat);
> +unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat);
> +int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
> +                     bool_t compat);
> +
>   #endif /* __ASSEMBLY__ */
>   
>   #endif /* __XEN_KIMAGE_H__ */

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-11-06 14:49 [PATCHv10 " David Vrabel
@ 2013-11-06 14:49   ` David Vrabel
  0 siblings, 0 replies; 31+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

In the existing kexec hypercall, the load and unload ops depend on
internals of the Linux kernel (the page list and code page provided by
the kernel).  The code page is used to transition between Xen context
and the image so using kernel code doesn't make sense and will not
work for PVH guests.

Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
that no longer require a code page to be provided by the guest -- Xen
now provides the code for calling the image directly.

The new load op looks similar to the Linux kexec_load system call and
allows the guest to provide the image data to be loaded.  The guest
specifies the architecture of the image which may be a 32-bit subarch
of the hypervisor's architecture (i.e., an EM_386 image on an
EM_X86_64 hypervisor).

The toolstack can now load images without kernel involvement.  This is
required for supporting kexec when using a dom0 with an upstream
kernel.

Crash images are copied directly into the crash region on load.
Default images are copied into domheap pages and a list of source and
destination machine addresses is created.  This is list is used in
kexec_reloc() to relocate the image to its destination.

The old load and unload sub-ops are still available (as
KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
of the new infrastructure.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/machine_kexec.c        |  192 +++++++++++------
 xen/arch/x86/x86_64/Makefile        |    2 +-
 xen/arch/x86/x86_64/compat_kexec.S  |  187 ----------------
 xen/arch/x86/x86_64/kexec_reloc.S   |  198 +++++++++++++++++
 xen/common/kexec.c                  |  398 +++++++++++++++++++++++++++++------
 xen/common/kimage.c                 |  122 +++++++++++-
 xen/include/asm-x86/fixmap.h        |    3 -
 xen/include/asm-x86/machine_kexec.h |   16 ++
 xen/include/xen/kexec.h             |   16 +-
 xen/include/xen/kimage.h            |    6 +
 10 files changed, 804 insertions(+), 336 deletions(-)
 delete mode 100644 xen/arch/x86/x86_64/compat_kexec.S
 create mode 100644 xen/arch/x86/x86_64/kexec_reloc.S
 create mode 100644 xen/include/asm-x86/machine_kexec.h

diff --git a/xen/arch/x86/machine_kexec.c b/xen/arch/x86/machine_kexec.c
index 68b9705..b70d5a6 100644
--- a/xen/arch/x86/machine_kexec.c
+++ b/xen/arch/x86/machine_kexec.c
@@ -1,9 +1,18 @@
 /******************************************************************************
  * machine_kexec.c
  *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/machine_kexec_64.c.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
  * Xen port written by:
  * - Simon 'Horms' Horman <horms@verge.net.au>
  * - Magnus Damm <magnus@valinux.co.jp>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
  */
 
 #include <xen/types.h>
@@ -11,63 +20,124 @@
 #include <xen/guest_access.h>
 #include <asm/fixmap.h>
 #include <asm/hpet.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
 
-typedef void (*relocate_new_kernel_t)(
-                unsigned long indirection_page,
-                unsigned long *page_list,
-                unsigned long start_address,
-                unsigned int preserve_context);
-
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image)
+/*
+ * Add a mapping for a page to the page tables used during kexec.
+ */
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr)
 {
-    unsigned long prev_ma = 0;
-    int fix_base = FIX_KEXEC_BASE_0 + (slot * (KEXEC_XEN_NO_PAGES >> 1));
-    int k;
+    struct page_info *l4_page;
+    struct page_info *l3_page;
+    struct page_info *l2_page;
+    struct page_info *l1_page;
+    l4_pgentry_t *l4 = NULL;
+    l3_pgentry_t *l3 = NULL;
+    l2_pgentry_t *l2 = NULL;
+    l1_pgentry_t *l1 = NULL;
+    int ret = -ENOMEM;
+
+    l4_page = image->aux_page;
+    if ( !l4_page )
+    {
+        l4_page = kimage_alloc_control_page(image, 0);
+        if ( !l4_page )
+            goto out;
+        image->aux_page = l4_page;
+    }
 
-    /* setup fixmap to point to our pages and record the virtual address
-     * in every odd index in page_list[].
-     */
+    l4 = __map_domain_page(l4_page);
+    l4 += l4_table_offset(vaddr);
+    if ( !(l4e_get_flags(*l4) & _PAGE_PRESENT) )
+    {
+        l3_page = kimage_alloc_control_page(image, 0);
+        if ( !l3_page )
+            goto out;
+        l4e_write(l4, l4e_from_page(l3_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l3_page = l4e_get_page(*l4);
+
+    l3 = __map_domain_page(l3_page);
+    l3 += l3_table_offset(vaddr);
+    if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) )
+    {
+        l2_page = kimage_alloc_control_page(image, 0);
+        if ( !l2_page )
+            goto out;
+        l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l2_page = l3e_get_page(*l3);
+
+    l2 = __map_domain_page(l2_page);
+    l2 += l2_table_offset(vaddr);
+    if ( !(l2e_get_flags(*l2) & _PAGE_PRESENT) )
+    {
+        l1_page = kimage_alloc_control_page(image, 0);
+        if ( !l1_page )
+            goto out;
+        l2e_write(l2, l2e_from_page(l1_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l1_page = l2e_get_page(*l2);
+
+    l1 = __map_domain_page(l1_page);
+    l1 += l1_table_offset(vaddr);
+    l1e_write(l1, l1e_from_pfn(maddr >> PAGE_SHIFT, __PAGE_HYPERVISOR));
+
+    ret = 0;
+out:
+    if ( l1 )
+        unmap_domain_page(l1);
+    if ( l2 )
+        unmap_domain_page(l2);
+    if ( l3 )
+        unmap_domain_page(l3);
+    if ( l4 )
+        unmap_domain_page(l4);
+    return ret;
+}
 
-    for ( k = 0; k < KEXEC_XEN_NO_PAGES; k++ )
+int machine_kexec_load(struct kexec_image *image)
+{
+    void *code_page;
+    int ret;
+
+    switch ( image->arch )
     {
-        if ( (k & 1) == 0 )
-        {
-            /* Even pages: machine address. */
-            prev_ma = image->page_list[k];
-        }
-        else
-        {
-            /* Odd pages: va for previous ma. */
-            if ( is_pv_32on64_domain(dom0) )
-            {
-                /*
-                 * The compatability bounce code sets up a page table
-                 * with a 1-1 mapping of the first 1G of memory so
-                 * VA==PA here.
-                 *
-                 * This Linux purgatory code still sets up separate
-                 * high and low mappings on the control page (entries
-                 * 0 and 1) but it is harmless if they are equal since
-                 * that PT is not live at the time.
-                 */
-                image->page_list[k] = prev_ma;
-            }
-            else
-            {
-                set_fixmap(fix_base + (k >> 1), prev_ma);
-                image->page_list[k] = fix_to_virt(fix_base + (k >> 1));
-            }
-        }
+    case EM_386:
+    case EM_X86_64:
+        break;
+    default:
+        return -EINVAL;
     }
 
+    code_page = __map_domain_page(image->control_code_page);
+    memcpy(code_page, kexec_reloc, kexec_reloc_size);
+    unmap_domain_page(code_page);
+
+    /*
+     * Add a mapping for the control code page to the same virtual
+     * address as kexec_reloc.  This allows us to keep running after
+     * these page tables are loaded in kexec_reloc.
+     */
+    ret = machine_kexec_add_page(image, (unsigned long)kexec_reloc,
+                                 page_to_maddr(image->control_code_page));
+    if ( ret < 0 )
+        return ret;
+
     return 0;
 }
 
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image)
+void machine_kexec_unload(struct kexec_image *image)
 {
+    /* no-op. kimage_free() frees all control pages. */
 }
 
-void machine_reboot_kexec(xen_kexec_image_t *image)
+void machine_reboot_kexec(struct kexec_image *image)
 {
     BUG_ON(smp_processor_id() != 0);
     smp_send_stop();
@@ -75,13 +145,10 @@ void machine_reboot_kexec(xen_kexec_image_t *image)
     BUG();
 }
 
-void machine_kexec(xen_kexec_image_t *image)
+void machine_kexec(struct kexec_image *image)
 {
-    struct desc_ptr gdt_desc = {
-        .base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY),
-        .limit = LAST_RESERVED_GDT_BYTE
-    };
     int i;
+    unsigned long reloc_flags = 0;
 
     /* We are about to permenantly jump out of the Xen context into the kexec
      * purgatory code.  We really dont want to be still servicing interupts.
@@ -109,29 +176,12 @@ void machine_kexec(xen_kexec_image_t *image)
      * not like running with NMIs disabled. */
     enable_nmis();
 
-    /*
-     * compat_machine_kexec() returns to idle pagetables, which requires us
-     * to be running on a static GDT mapping (idle pagetables have no GDT
-     * mappings in their per-domain mapping area).
-     */
-    asm volatile ( "lgdt %0" : : "m" (gdt_desc) );
+    if ( image->arch == EM_386 )
+        reloc_flags |= KEXEC_RELOC_FLAG_COMPAT;
 
-    if ( is_pv_32on64_domain(dom0) )
-    {
-        compat_machine_kexec(image->page_list[1],
-                             image->indirection_page,
-                             image->page_list,
-                             image->start_address);
-    }
-    else
-    {
-        relocate_new_kernel_t rnk;
-
-        rnk = (relocate_new_kernel_t) image->page_list[1];
-        (*rnk)(image->indirection_page, image->page_list,
-               image->start_address,
-               0 /* preserve_context */);
-    }
+    kexec_reloc(page_to_maddr(image->control_code_page),
+                page_to_maddr(image->aux_page),
+                image->head, image->entry_maddr, reloc_flags);
 }
 
 int machine_kexec_get(xen_kexec_range_t *range)
diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
index d56e12d..7f8fb3d 100644
--- a/xen/arch/x86/x86_64/Makefile
+++ b/xen/arch/x86/x86_64/Makefile
@@ -11,11 +11,11 @@ obj-y += mmconf-fam10h.o
 obj-y += mmconfig_64.o
 obj-y += mmconfig-shared.o
 obj-y += compat.o
-obj-bin-y += compat_kexec.o
 obj-y += domain.o
 obj-y += physdev.o
 obj-y += platform_hypercall.o
 obj-y += cpu_idle.o
 obj-y += cpufreq.o
+obj-bin-y += kexec_reloc.o
 
 obj-$(crash_debug)   += gdbstub.o
diff --git a/xen/arch/x86/x86_64/compat_kexec.S b/xen/arch/x86/x86_64/compat_kexec.S
deleted file mode 100644
index fc92af9..0000000
--- a/xen/arch/x86/x86_64/compat_kexec.S
+++ /dev/null
@@ -1,187 +0,0 @@
-/*
- * Compatibility kexec handler.
- */
-
-/*
- * NOTE: We rely on Xen not relocating itself above the 4G boundary. This is
- * currently true but if it ever changes then compat_pg_table will
- * need to be moved back below 4G at run time.
- */
-
-#include <xen/config.h>
-
-#include <asm/asm_defns.h>
-#include <asm/msr.h>
-#include <asm/page.h>
-
-/* The unrelocated physical address of a symbol. */
-#define SYM_PHYS(sym)          ((sym) - __XEN_VIRT_START)
-
-/* Load physical address of symbol into register and relocate it. */
-#define RELOCATE_SYM(sym,reg)  mov $SYM_PHYS(sym), reg ; \
-                               add xen_phys_start(%rip), reg
-
-/*
- * Relocate a physical address in memory. Size of temporary register
- * determines size of the value to relocate.
- */
-#define RELOCATE_MEM(addr,reg) mov addr(%rip), reg ; \
-                               add xen_phys_start(%rip), reg ; \
-                               mov reg, addr(%rip)
-
-        .text
-
-        .code64
-
-ENTRY(compat_machine_kexec)
-        /* x86/64                        x86/32  */
-        /* %rdi - relocate_new_kernel_t  CALL    */
-        /* %rsi - indirection page       4(%esp) */
-        /* %rdx - page_list              8(%esp) */
-        /* %rcx - start address         12(%esp) */
-        /*        cpu has pae           16(%esp) */
-
-        /* Shim the 64 bit page_list into a 32 bit page_list. */
-        mov $12,%r9
-        lea compat_page_list(%rip), %rbx
-1:      dec %r9
-        movl (%rdx,%r9,8),%eax
-        movl %eax,(%rbx,%r9,4)
-        test %r9,%r9
-        jnz 1b
-
-        RELOCATE_SYM(compat_page_list,%rdx)
-
-        /* Relocate compatibility mode entry point address. */
-        RELOCATE_MEM(compatibility_mode_far,%eax)
-
-        /* Relocate compat_pg_table. */
-        RELOCATE_MEM(compat_pg_table,     %rax)
-        RELOCATE_MEM(compat_pg_table+0x8, %rax)
-        RELOCATE_MEM(compat_pg_table+0x10,%rax)
-        RELOCATE_MEM(compat_pg_table+0x18,%rax)
-
-        /*
-         * Setup an identity mapped region in PML4[0] of idle page
-         * table.
-         */
-        RELOCATE_SYM(l3_identmap,%rax)
-        or  $0x63,%rax
-        mov %rax, idle_pg_table(%rip)
-
-        /* Switch to idle page table. */
-        RELOCATE_SYM(idle_pg_table,%rax)
-        movq %rax, %cr3
-
-        /* Switch to identity mapped compatibility stack. */
-        RELOCATE_SYM(compat_stack,%rax)
-        movq %rax, %rsp
-
-        /* Save xen_phys_start for 32 bit code. */
-        movq xen_phys_start(%rip), %rbx
-
-        /* Jump to low identity mapping in compatibility mode. */
-        ljmp *compatibility_mode_far(%rip)
-        ud2
-
-compatibility_mode_far:
-        .long SYM_PHYS(compatibility_mode)
-        .long __HYPERVISOR_CS32
-
-        /*
-         * We use 5 words of stack for the arguments passed to the kernel. The
-         * kernel only uses 1 word before switching to its own stack. Allocate
-         * 16 words to give "plenty" of room.
-         */
-        .fill 16,4,0
-compat_stack:
-
-        .code32
-
-#undef RELOCATE_SYM
-#undef RELOCATE_MEM
-
-/*
- * Load physical address of symbol into register and relocate it. %rbx
- * contains xen_phys_start(%rip) saved before jump to compatibility
- * mode.
- */
-#define RELOCATE_SYM(sym,reg) mov $SYM_PHYS(sym), reg ; \
-                              add %ebx, reg
-
-compatibility_mode:
-        /* Setup some sane segments. */
-        movl $__HYPERVISOR_DS32, %eax
-        movl %eax, %ds
-        movl %eax, %es
-        movl %eax, %fs
-        movl %eax, %gs
-        movl %eax, %ss
-
-        /* Push arguments onto stack. */
-        pushl $0   /* 20(%esp) - preserve context */
-        pushl $1   /* 16(%esp) - cpu has pae */
-        pushl %ecx /* 12(%esp) - start address */
-        pushl %edx /*  8(%esp) - page list */
-        pushl %esi /*  4(%esp) - indirection page */
-        pushl %edi /*  0(%esp) - CALL */
-
-        /* Disable paging and therefore leave 64 bit mode. */
-        movl %cr0, %eax
-        andl $~X86_CR0_PG, %eax
-        movl %eax, %cr0
-
-        /* Switch to 32 bit page table. */
-        RELOCATE_SYM(compat_pg_table, %eax)
-        movl  %eax, %cr3
-
-        /* Clear MSR_EFER[LME], disabling long mode */
-        movl    $MSR_EFER,%ecx
-        rdmsr
-        btcl    $_EFER_LME,%eax
-        wrmsr
-
-        /* Re-enable paging, but only 32 bit mode now. */
-        movl %cr0, %eax
-        orl $X86_CR0_PG, %eax
-        movl %eax, %cr0
-        jmp 1f
-1:
-
-        popl %eax
-        call *%eax
-        ud2
-
-        .data
-        .align 4
-compat_page_list:
-        .fill 12,4,0
-
-        .align 32,0
-
-        /*
-         * These compat page tables contain an identity mapping of the
-         * first 4G of the physical address space.
-         */
-compat_pg_table:
-        .long SYM_PHYS(compat_pg_table_l2) + 0*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 1*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 2*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 3*PAGE_SIZE + 0x01, 0
-
-        .section .data.page_aligned, "aw", @progbits
-        .align PAGE_SIZE,0
-compat_pg_table_l2:
-        .macro identmap from=0, count=512
-        .if \count-1
-        identmap "(\from+0)","(\count/2)"
-        identmap "(\from+(0x200000*(\count/2)))","(\count/2)"
-        .else
-        .quad 0x00000000000000e3 + \from
-        .endif
-        .endm
-
-        identmap 0x00000000
-        identmap 0x40000000
-        identmap 0x80000000
-        identmap 0xc0000000
diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
new file mode 100644
index 0000000..7a16c85
--- /dev/null
+++ b/xen/arch/x86/x86_64/kexec_reloc.S
@@ -0,0 +1,198 @@
+/*
+ * Relocate a kexec_image to its destination and call it.
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+#include <xen/config.h>
+#include <xen/kimage.h>
+
+#include <asm/asm_defns.h>
+#include <asm/msr.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
+
+        .text
+        .align PAGE_SIZE
+        .code64
+
+ENTRY(kexec_reloc)
+        /* %rdi - code page maddr */
+        /* %rsi - page table maddr */
+        /* %rdx - indirection page maddr */
+        /* %rcx - entry maddr (%rbp) */
+        /* %r8 - flags */
+
+        movq    %rcx, %rbp
+
+        /* Setup stack. */
+        leaq    (reloc_stack - kexec_reloc)(%rdi), %rsp
+
+        /* Load reloc page table. */
+        movq    %rsi, %cr3
+
+        /* Jump to identity mapped code. */
+        leaq    (identity_mapped - kexec_reloc)(%rdi), %rax
+        jmpq    *%rax
+
+identity_mapped:
+        /*
+         * Set cr0 to a known state:
+         *  - Paging enabled
+         *  - Alignment check disabled
+         *  - Write protect disabled
+         *  - No task switch
+         *  - Don't do FP software emulation.
+         *  - Protected mode enabled
+         */
+        movq    %cr0, %rax
+        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
+        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
+        movq    %rax, %cr0
+
+        /*
+         * Set cr4 to a known state:
+         *  - physical address extension enabled
+         */
+        movl    $X86_CR4_PAE, %eax
+        movq    %rax, %cr4
+
+        movq    %rdx, %rdi
+        call    relocate_pages
+
+        /* Need to switch to 32-bit mode? */
+        testq   $KEXEC_RELOC_FLAG_COMPAT, %r8
+        jnz     call_32_bit
+
+call_64_bit:
+        /* Call the image entry point.  This should never return. */
+        callq   *%rbp
+        ud2
+
+call_32_bit:
+        /* Setup IDT. */
+        lidt    compat_mode_idt(%rip)
+
+        /* Load compat GDT. */
+        leaq    compat_mode_gdt(%rip), %rax
+        movq    %rax, (compat_mode_gdt_desc + 2)(%rip)
+        lgdt    compat_mode_gdt_desc(%rip)
+
+        /* Relocate compatibility mode entry point address. */
+        leal    compatibility_mode(%rip), %eax
+        movl    %eax, compatibility_mode_far(%rip)
+
+        /* Enter compatibility mode. */
+        ljmp    *compatibility_mode_far(%rip)
+
+relocate_pages:
+        /* %rdi - indirection page maddr */
+        pushq   %rbx
+
+        cld
+        movq    %rdi, %rbx
+        xorl    %edi, %edi
+        xorl    %esi, %esi
+
+next_entry: /* top, read another word for the indirection page */
+
+        movq    (%rbx), %rcx
+        addq    $8, %rbx
+is_dest:
+        testb   $IND_DESTINATION, %cl
+        jz      is_ind
+        movq    %rcx, %rdi
+        andq    $PAGE_MASK, %rdi
+        jmp     next_entry
+is_ind:
+        testb   $IND_INDIRECTION, %cl
+        jz      is_done
+        movq    %rcx, %rbx
+        andq    $PAGE_MASK, %rbx
+        jmp     next_entry
+is_done:
+        testb   $IND_DONE, %cl
+        jnz     done
+is_source:
+        testb   $IND_SOURCE, %cl
+        jz      is_zero
+        movq    %rcx, %rsi      /* For every source page do a copy */
+        andq    $PAGE_MASK, %rsi
+        movl    $(PAGE_SIZE / 8), %ecx
+        rep movsq
+        jmp     next_entry
+is_zero:
+        testb   $IND_ZERO, %cl
+        jz      next_entry
+        movl    $(PAGE_SIZE / 8), %ecx  /* Zero the destination page. */
+        xorl    %eax, %eax
+        rep stosq
+        jmp     next_entry
+done:
+        popq    %rbx
+        ret
+
+        .code32
+
+compatibility_mode:
+        /* Setup some sane segments. */
+        movl    $0x0008, %eax
+        movl    %eax, %ds
+        movl    %eax, %es
+        movl    %eax, %fs
+        movl    %eax, %gs
+        movl    %eax, %ss
+
+        /* Disable paging and therefore leave 64 bit mode. */
+        movl    %cr0, %eax
+        andl    $~X86_CR0_PG, %eax
+        movl    %eax, %cr0
+
+        /* Disable long mode */
+        movl    $MSR_EFER, %ecx
+        rdmsr
+        andl    $~EFER_LME, %eax
+        wrmsr
+
+        /* Clear cr4 to disable PAE. */
+        xorl    %eax, %eax
+        movl    %eax, %cr4
+
+        /* Call the image entry point.  This should never return. */
+        call    *%ebp
+        ud2
+
+        .align 4
+compatibility_mode_far:
+        .long 0x00000000             /* set in call_32_bit above */
+        .word 0x0010
+
+compat_mode_gdt_desc:
+        .word (3*8)-1
+        .quad 0x0000000000000000     /* set in call_32_bit above */
+
+        .align 8
+compat_mode_gdt:
+        .quad 0x0000000000000000     /* null                              */
+        .quad 0x00cf92000000ffff     /* 0x0008 ring 0 data                */
+        .quad 0x00cf9a000000ffff     /* 0x0010 ring 0 code, compatibility */
+
+compat_mode_idt:
+        .word 0                      /* limit */
+        .long 0                      /* base */
+
+        /*
+         * 16 words of stack are more than enough.
+         */
+        .fill 16,8,0
+reloc_stack:
+
+        .globl kexec_reloc_size
+kexec_reloc_size:
+        .long . - kexec_reloc
diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index 7b23df0..c5450ba 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -25,6 +25,7 @@
 #include <xen/version.h>
 #include <xen/console.h>
 #include <xen/kexec.h>
+#include <xen/kimage.h>
 #include <public/elfnote.h>
 #include <xsm/xsm.h>
 #include <xen/cpu.h>
@@ -47,7 +48,7 @@ static Elf_Note *xen_crash_note;
 
 static cpumask_t crash_saved_cpus;
 
-static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
+static struct kexec_image *kexec_image[KEXEC_IMAGE_NR];
 
 #define KEXEC_FLAG_DEFAULT_POS   (KEXEC_IMAGE_NR + 0)
 #define KEXEC_FLAG_CRASH_POS     (KEXEC_IMAGE_NR + 1)
@@ -55,8 +56,6 @@ static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
 
 static unsigned long kexec_flags = 0; /* the lowest bits are for KEXEC_IMAGE... */
 
-static spinlock_t kexec_lock = SPIN_LOCK_UNLOCKED;
-
 static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
 static size_t vmcoreinfo_size = 0;
 
@@ -311,14 +310,14 @@ void kexec_crash(void)
     kexec_common_shutdown();
     kexec_crash_save_cpu();
     machine_crash_shutdown();
-    machine_kexec(&kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
+    machine_kexec(kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
 
     BUG();
 }
 
 static long kexec_reboot(void *_image)
 {
-    xen_kexec_image_t *image = _image;
+    struct kexec_image *image = _image;
 
     kexecing = TRUE;
 
@@ -734,63 +733,264 @@ static void crash_save_vmcoreinfo(void)
 #endif
 }
 
-static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
+static void kexec_unload_image(struct kexec_image *image)
 {
-    xen_kexec_image_t *image;
+    if ( !image )
+        return;
+
+    machine_kexec_unload(image);
+    kimage_free(image);
+}
+
+static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_exec_t exec;
+    struct kexec_image *image;
+    int base, bit, pos, ret = -EINVAL;
+
+    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+        return -EFAULT;
+
+    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+        return -EINVAL;
+
+    pos = (test_bit(bit, &kexec_flags) != 0);
+
+    /* Only allow kexec/kdump into loaded images */
+    if ( !test_bit(base + pos, &kexec_flags) )
+        return -ENOENT;
+
+    switch (exec.type)
+    {
+    case KEXEC_TYPE_DEFAULT:
+        image = kexec_image[base + pos];
+        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
+        break;
+    case KEXEC_TYPE_CRASH:
+        kexec_crash(); /* Does not return */
+        break;
+    }
+
+    return -EINVAL; /* never reached */
+}
+
+static int kexec_swap_images(int type, struct kexec_image *new,
+                             struct kexec_image **old)
+{
+    static DEFINE_SPINLOCK(kexec_lock);
     int base, bit, pos;
-    int ret = 0;
+    int new_slot, old_slot;
+
+    *old = NULL;
+
+    spin_lock(&kexec_lock);
+
+    if ( test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    {
+        spin_unlock(&kexec_lock);
+        return -EBUSY;
+    }
 
-    if ( kexec_load_get_bits(load->type, &base, &bit) )
+    if ( kexec_load_get_bits(type, &base, &bit) )
         return -EINVAL;
 
     pos = (test_bit(bit, &kexec_flags) != 0);
+    old_slot = base + pos;
+    new_slot = base + !pos;
 
-    /* Load the user data into an unused image */
-    if ( op == KEXEC_CMD_kexec_load )
+    if ( new )
     {
-        image = &kexec_image[base + !pos];
+        kexec_image[new_slot] = new;
+        set_bit(new_slot, &kexec_flags);
+    }
+    change_bit(bit, &kexec_flags);
 
-        BUG_ON(test_bit((base + !pos), &kexec_flags)); /* must be free */
+    clear_bit(old_slot, &kexec_flags);
+    *old = kexec_image[old_slot];
 
-        memcpy(image, &load->image, sizeof(*image));
+    spin_unlock(&kexec_lock);
 
-        if ( !(ret = machine_kexec_load(load->type, base + !pos, image)) )
-        {
-            /* Set image present bit */
-            set_bit((base + !pos), &kexec_flags);
+    return 0;
+}
 
-            /* Make new image the active one */
-            change_bit(bit, &kexec_flags);
-        }
+static int kexec_load_slot(struct kexec_image *kimage)
+{
+    struct kexec_image *old_kimage;
+    int ret = -ENOMEM;
+
+    ret = machine_kexec_load(kimage);
+    if ( ret < 0 )
+        return ret;
+
+    crash_save_vmcoreinfo();
+
+    ret = kexec_swap_images(kimage->type, kimage, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static uint16_t kexec_load_v1_arch(void)
+{
+#ifdef CONFIG_X86
+    return is_pv_32on64_domain(dom0) ? EM_386 : EM_X86_64;
+#else
+    return EM_NONE;
+#endif
+}
 
-        crash_save_vmcoreinfo();
+static int kexec_segments_add_segment(
+    unsigned int *nr_segments, xen_kexec_segment_t *segments,
+    unsigned long mfn)
+{
+    paddr_t maddr = (paddr_t)mfn << PAGE_SHIFT;
+    unsigned int n = *nr_segments;
+
+    /* Need a new segment? */
+    if ( n == 0
+         || segments[n-1].dest_maddr + segments[n-1].dest_size != maddr )
+    {
+        n++;
+        if ( n > KEXEC_SEGMENT_MAX )
+            return -EINVAL;
+        *nr_segments = n;
+
+        set_xen_guest_handle(segments[n-1].buf.h, NULL);
+        segments[n-1].buf_size = 0;
+        segments[n-1].dest_maddr = maddr;
+        segments[n-1].dest_size = 0;
     }
 
-    /* Unload the old image if present and load successful */
-    if ( ret == 0 && !test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    return 0;
+}
+
+static int kexec_segments_from_ind_page(unsigned long mfn,
+                                        unsigned int *nr_segments,
+                                        xen_kexec_segment_t *segments,
+                                        bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+
+    page = map_domain_page(mfn);
+
+    /*
+     * Walk the indirection page list, adding destination pages to the
+     * segments.
+     */
+    for ( entry = page; ; )
     {
-        if ( test_and_clear_bit((base + pos), &kexec_flags) )
+        unsigned long ind;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
         {
-            image = &kexec_image[base + pos];
-            machine_kexec_unload(load->type, base + pos, image);
+        case IND_DESTINATION:
+            ret = kexec_segments_add_segment(nr_segments, segments, mfn);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            entry = page = map_domain_page(mfn);
+            continue;
+        case IND_DONE:
+            goto done;
+        case IND_SOURCE:
+            if ( *nr_segments == 0 )
+            {
+                ret = -EINVAL;
+                goto done;
+            }
+            segments[*nr_segments-1].dest_size += PAGE_SIZE;
+            break;
+        default:
+            ret = -EINVAL;
+            goto done;
         }
+        entry = kimage_entry_next(entry, compat);
     }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
 
+static int kexec_do_load_v1(xen_kexec_load_v1_t *load, int compat)
+{
+    struct kexec_image *kimage = NULL;
+    xen_kexec_segment_t *segments;
+    uint16_t arch;
+    unsigned int nr_segments = 0;
+    unsigned long ind_mfn = load->image.indirection_page >> PAGE_SHIFT;
+    int ret;
+
+    arch = kexec_load_v1_arch();
+    if ( arch == EM_NONE )
+        return -ENOSYS;
+
+    segments = xmalloc_array(xen_kexec_segment_t, KEXEC_SEGMENT_MAX);
+    if ( segments == NULL )
+        return -ENOMEM;
+
+    /*
+     * Work out the image segments (destination only) from the
+     * indirection pages.
+     *
+     * This is needed so we don't allocate pages that will overlap
+     * with the destination when building the new set of indirection
+     * pages below.
+     */
+    ret = kexec_segments_from_ind_page(ind_mfn, &nr_segments, segments, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_alloc(&kimage, load->type, arch, load->image.start_address,
+                       nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    /*
+     * Build a new set of indirection pages in the native format.
+     *
+     * This walks the guest provided indirection pages a second time.
+     * The guest could have altered then, invalidating the segment
+     * information constructed above.  This will only result in the
+     * resulting image being potentially unrelocatable.
+     */
+    ret = kimage_build_ind(kimage, ind_mfn, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    return 0;
+
+error:
+    if ( !kimage )
+        xfree(segments);
+    kimage_free(kimage);
     return ret;
 }
 
-static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
     xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&load, uarg, 1)) )
         return -EFAULT;
 
-    return kexec_load_unload_internal(op, &load);
+    return kexec_do_load_v1(&load, 0);
 }
 
-static int kexec_load_unload_compat(unsigned long op,
-                                    XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
 #ifdef CONFIG_COMPAT
     compat_kexec_load_v1_t compat_load;
@@ -809,49 +1009,113 @@ static int kexec_load_unload_compat(unsigned long op,
     load.type = compat_load.type;
     XLAT_kexec_image(&load.image, &compat_load.image);
 
-    return kexec_load_unload_internal(op, &load);
-#else /* CONFIG_COMPAT */
+    return kexec_do_load_v1(&load, 1);
+#else
     return 0;
-#endif /* CONFIG_COMPAT */
+#endif
 }
 
-static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
-    xen_kexec_exec_t exec;
-    xen_kexec_image_t *image;
-    int base, bit, pos, ret = -EINVAL;
+    xen_kexec_load_t load;
+    xen_kexec_segment_t *segments;
+    struct kexec_image *kimage = NULL;
+    int ret;
 
-    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+    if ( copy_from_guest(&load, uarg, 1) )
         return -EFAULT;
 
-    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+    if ( load.nr_segments >= KEXEC_SEGMENT_MAX )
         return -EINVAL;
 
-    pos = (test_bit(bit, &kexec_flags) != 0);
-
-    /* Only allow kexec/kdump into loaded images */
-    if ( !test_bit(base + pos, &kexec_flags) )
-        return -ENOENT;
+    segments = xmalloc_array(xen_kexec_segment_t, load.nr_segments);
+    if ( segments == NULL )
+        return -ENOMEM;
 
-    switch (exec.type)
+    if ( copy_from_guest(segments, load.segments.h, load.nr_segments) )
     {
-    case KEXEC_TYPE_DEFAULT:
-        image = &kexec_image[base + pos];
-        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
-        break;
-    case KEXEC_TYPE_CRASH:
-        kexec_crash(); /* Does not return */
-        break;
+        ret = -EFAULT;
+        goto error;
     }
 
-    return -EINVAL; /* never reached */
+    ret = kimage_alloc(&kimage, load.type, load.arch, load.entry_maddr,
+                       load.nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_load_segments(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    return 0;
+
+error:
+    if ( ! kimage )
+        xfree(segments);
+    kimage_free(kimage);
+    return ret;
+}
+
+static int kexec_do_unload(xen_kexec_unload_t *unload)
+{
+    struct kexec_image *old_kimage;
+    int ret;
+
+    ret = kexec_swap_images(unload->type, NULL, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static int kexec_unload_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_load_v1_t load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = load.type;
+    return kexec_do_unload(&unload);
+}
+
+static int kexec_unload_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+#ifdef CONFIG_COMPAT
+    compat_kexec_load_v1_t compat_load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&compat_load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = compat_load.type;
+    return kexec_do_unload(&unload);
+#else
+    return 0;
+#endif
+}
+
+static int kexec_unload(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_unload_t unload;
+
+    if ( unlikely(copy_from_guest(&unload, uarg, 1)) )
+        return -EFAULT;
+
+    return kexec_do_unload(&unload);
 }
 
 static int do_kexec_op_internal(unsigned long op,
                                 XEN_GUEST_HANDLE_PARAM(void) uarg,
                                 bool_t compat)
 {
-    unsigned long flags;
     int ret = -EINVAL;
 
     ret = xsm_kexec(XSM_PRIV);
@@ -867,20 +1131,26 @@ static int do_kexec_op_internal(unsigned long op,
                 ret = kexec_get_range(uarg);
         break;
     case KEXEC_CMD_kexec_load_v1:
+        if ( compat )
+            ret = kexec_load_v1_compat(uarg);
+        else
+            ret = kexec_load_v1(uarg);
+        break;
     case KEXEC_CMD_kexec_unload_v1:
-        spin_lock_irqsave(&kexec_lock, flags);
-        if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
-        {
-                if (compat)
-                        ret = kexec_load_unload_compat(op, uarg);
-                else
-                        ret = kexec_load_unload(op, uarg);
-        }
-        spin_unlock_irqrestore(&kexec_lock, flags);
+        if ( compat )
+            ret = kexec_unload_v1_compat(uarg);
+        else
+            ret = kexec_unload_v1(uarg);
         break;
     case KEXEC_CMD_kexec:
         ret = kexec_exec(uarg);
         break;
+    case KEXEC_CMD_kexec_load:
+        ret = kexec_load(uarg);
+        break;
+    case KEXEC_CMD_kexec_unload:
+        ret = kexec_unload(uarg);
+        break;
     }
 
     return ret;
diff --git a/xen/common/kimage.c b/xen/common/kimage.c
index 02ee37e..10fb785 100644
--- a/xen/common/kimage.c
+++ b/xen/common/kimage.c
@@ -175,11 +175,20 @@ static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
     image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
     if ( !image->control_code_page )
         goto out;
+    result = machine_kexec_add_page(image,
+                                    page_to_maddr(image->control_code_page),
+                                    page_to_maddr(image->control_code_page));
+    if ( result < 0 )
+        goto out;
 
     /* Add an empty indirection page. */
     image->entry_page = kimage_alloc_control_page(image, 0);
     if ( !image->entry_page )
         goto out;
+    result = machine_kexec_add_page(image, page_to_maddr(image->entry_page),
+                                    page_to_maddr(image->entry_page));
+    if ( result < 0 )
+        goto out;
 
     image->head = page_to_maddr(image->entry_page);
 
@@ -595,7 +604,7 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
         if ( addr == destination )
         {
             page_list_del(page, &image->dest_pages);
-            return page;
+            goto found;
         }
     }
     page = NULL;
@@ -647,6 +656,8 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
             page_list_add(page, &image->dest_pages);
         }
     }
+found:
+    machine_kexec_add_page(image, page_to_maddr(page), page_to_maddr(page));
     return page;
 }
 
@@ -753,6 +764,7 @@ static int kimage_load_crash_segment(struct kexec_image *image,
 static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
 {
     int result = -ENOMEM;
+    paddr_t addr;
 
     if ( !guest_handle_is_null(segment->buf.h) )
     {
@@ -767,6 +779,14 @@ static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *s
         }
     }
 
+    for ( addr = segment->dest_maddr & PAGE_MASK;
+          addr < segment->dest_maddr + segment->dest_size; addr += PAGE_SIZE )
+    {
+        result = machine_kexec_add_page(image, addr, addr);
+        if ( result < 0 )
+            break;
+    }
+
     return result;
 }
 
@@ -810,6 +830,106 @@ int kimage_load_segments(struct kexec_image *image)
     return 0;
 }
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return (kimage_entry_t *)((uint32_t *)entry + 1);
+    return entry + 1;
+}
+
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry >> PAGE_SHIFT;
+    return *entry >> PAGE_SHIFT;
+}
+
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry & 0xf;
+    return *entry & 0xf;
+}
+
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+    paddr_t dest = KIMAGE_NO_DEST;
+
+    page = map_domain_page(ind_mfn);
+    if ( !page )
+        return -ENOMEM;
+
+    /*
+     * Walk the guest-supplied indirection pages, adding entries to
+     * the image's indirection pages.
+     */
+    for ( entry = page; ;  )
+    {
+        unsigned long ind;
+        unsigned long mfn;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
+        {
+        case IND_DESTINATION:
+            dest = (paddr_t)mfn << PAGE_SHIFT;
+            ret = kimage_set_destination(image, dest);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            page = map_domain_page(mfn);
+            entry = page;
+            continue;
+        case IND_DONE:
+            kimage_terminate(image);
+            goto done;
+        case IND_SOURCE:
+        {
+            struct page_info *guest_page, *xen_page;
+
+            guest_page = mfn_to_page(mfn);
+            if ( !get_page(guest_page, current->domain) )
+            {
+                ret = -EFAULT;
+                goto done;
+            }
+
+            xen_page = kimage_alloc_page(image, dest);
+            if ( !xen_page )
+            {
+                put_page(guest_page);
+                ret = -ENOMEM;
+                goto done;
+            }
+
+            copy_domain_page(page_to_mfn(xen_page), mfn);
+            put_page(guest_page);
+
+            ret = kimage_add_page(image, page_to_maddr(xen_page));
+            if ( ret < 0 )
+                goto done;
+            dest += PAGE_SIZE;
+            break;
+        }
+        default:
+            ret = -EINVAL;
+            goto done;
+        }
+        entry = kimage_entry_next(entry, compat);
+    }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index 8b4266d..48c5676 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -56,9 +56,6 @@ enum fixed_addresses {
     FIX_ACPI_BEGIN,
     FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1,
     FIX_HPET_BASE,
-    FIX_KEXEC_BASE_0,
-    FIX_KEXEC_BASE_END = FIX_KEXEC_BASE_0 \
-      + ((KEXEC_XEN_NO_PAGES >> 1) * KEXEC_IMAGE_NR) - 1,
     FIX_TBOOT_SHARED_BASE,
     FIX_MSIX_IO_RESERV_BASE,
     FIX_MSIX_IO_RESERV_END = FIX_MSIX_IO_RESERV_BASE + FIX_MSIX_MAX_PAGES -1,
diff --git a/xen/include/asm-x86/machine_kexec.h b/xen/include/asm-x86/machine_kexec.h
new file mode 100644
index 0000000..ba0d469
--- /dev/null
+++ b/xen/include/asm-x86/machine_kexec.h
@@ -0,0 +1,16 @@
+#ifndef __X86_MACHINE_KEXEC_H__
+#define __X86_MACHINE_KEXEC_H__
+
+#define KEXEC_RELOC_FLAG_COMPAT 0x1 /* 32-bit image */
+
+#ifndef __ASSEMBLY__
+
+extern void kexec_reloc(unsigned long reloc_code, unsigned long reloc_pt,
+                        unsigned long ind_maddr, unsigned long entry_maddr,
+                        unsigned long flags);
+
+extern unsigned int kexec_reloc_size;
+
+#endif
+
+#endif /* __X86_MACHINE_KEXEC_H__ */
diff --git a/xen/include/xen/kexec.h b/xen/include/xen/kexec.h
index 1a5dda1..bd17747 100644
--- a/xen/include/xen/kexec.h
+++ b/xen/include/xen/kexec.h
@@ -6,6 +6,7 @@
 #include <public/kexec.h>
 #include <asm/percpu.h>
 #include <xen/elfcore.h>
+#include <xen/kimage.h>
 
 typedef struct xen_kexec_reserve {
     unsigned long size;
@@ -40,11 +41,13 @@ extern enum low_crashinfo low_crashinfo_mode;
 extern paddr_t crashinfo_maxaddr_bits;
 void kexec_early_calculations(void);
 
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr);
+int machine_kexec_load(struct kexec_image *image);
+void machine_kexec_unload(struct kexec_image *image);
 void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
-void machine_reboot_kexec(xen_kexec_image_t *image);
-void machine_kexec(xen_kexec_image_t *image);
+void machine_reboot_kexec(struct kexec_image *image);
+void machine_kexec(struct kexec_image *image);
 void kexec_crash(void);
 void kexec_crash_save_cpu(void);
 crash_xen_info_t *kexec_crash_save_info(void);
@@ -52,11 +55,6 @@ void machine_crash_shutdown(void);
 int machine_kexec_get(xen_kexec_range_t *range);
 int machine_kexec_get_xen(xen_kexec_range_t *range);
 
-void compat_machine_kexec(unsigned long rnk,
-                          unsigned long indirection_page,
-                          unsigned long *page_list,
-                          unsigned long start_address);
-
 /* vmcoreinfo stuff */
 #define VMCOREINFO_BYTES           (4096)
 #define VMCOREINFO_NOTE_NAME       "VMCOREINFO_XEN"
diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
index 0ebd37a..d10ebf7 100644
--- a/xen/include/xen/kimage.h
+++ b/xen/include/xen/kimage.h
@@ -47,6 +47,12 @@ int kimage_load_segments(struct kexec_image *image);
 struct page_info *kimage_alloc_control_page(struct kexec_image *image,
                                             unsigned memflags);
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat);
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __XEN_KIMAGE_H__ */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
@ 2013-11-06 14:49   ` David Vrabel
  0 siblings, 0 replies; 31+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

In the existing kexec hypercall, the load and unload ops depend on
internals of the Linux kernel (the page list and code page provided by
the kernel).  The code page is used to transition between Xen context
and the image so using kernel code doesn't make sense and will not
work for PVH guests.

Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
that no longer require a code page to be provided by the guest -- Xen
now provides the code for calling the image directly.

The new load op looks similar to the Linux kexec_load system call and
allows the guest to provide the image data to be loaded.  The guest
specifies the architecture of the image which may be a 32-bit subarch
of the hypervisor's architecture (i.e., an EM_386 image on an
EM_X86_64 hypervisor).

The toolstack can now load images without kernel involvement.  This is
required for supporting kexec when using a dom0 with an upstream
kernel.

Crash images are copied directly into the crash region on load.
Default images are copied into domheap pages and a list of source and
destination machine addresses is created.  This is list is used in
kexec_reloc() to relocate the image to its destination.

The old load and unload sub-ops are still available (as
KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
of the new infrastructure.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/machine_kexec.c        |  192 +++++++++++------
 xen/arch/x86/x86_64/Makefile        |    2 +-
 xen/arch/x86/x86_64/compat_kexec.S  |  187 ----------------
 xen/arch/x86/x86_64/kexec_reloc.S   |  198 +++++++++++++++++
 xen/common/kexec.c                  |  398 +++++++++++++++++++++++++++++------
 xen/common/kimage.c                 |  122 +++++++++++-
 xen/include/asm-x86/fixmap.h        |    3 -
 xen/include/asm-x86/machine_kexec.h |   16 ++
 xen/include/xen/kexec.h             |   16 +-
 xen/include/xen/kimage.h            |    6 +
 10 files changed, 804 insertions(+), 336 deletions(-)
 delete mode 100644 xen/arch/x86/x86_64/compat_kexec.S
 create mode 100644 xen/arch/x86/x86_64/kexec_reloc.S
 create mode 100644 xen/include/asm-x86/machine_kexec.h

diff --git a/xen/arch/x86/machine_kexec.c b/xen/arch/x86/machine_kexec.c
index 68b9705..b70d5a6 100644
--- a/xen/arch/x86/machine_kexec.c
+++ b/xen/arch/x86/machine_kexec.c
@@ -1,9 +1,18 @@
 /******************************************************************************
  * machine_kexec.c
  *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/machine_kexec_64.c.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
  * Xen port written by:
  * - Simon 'Horms' Horman <horms@verge.net.au>
  * - Magnus Damm <magnus@valinux.co.jp>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
  */
 
 #include <xen/types.h>
@@ -11,63 +20,124 @@
 #include <xen/guest_access.h>
 #include <asm/fixmap.h>
 #include <asm/hpet.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
 
-typedef void (*relocate_new_kernel_t)(
-                unsigned long indirection_page,
-                unsigned long *page_list,
-                unsigned long start_address,
-                unsigned int preserve_context);
-
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image)
+/*
+ * Add a mapping for a page to the page tables used during kexec.
+ */
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr)
 {
-    unsigned long prev_ma = 0;
-    int fix_base = FIX_KEXEC_BASE_0 + (slot * (KEXEC_XEN_NO_PAGES >> 1));
-    int k;
+    struct page_info *l4_page;
+    struct page_info *l3_page;
+    struct page_info *l2_page;
+    struct page_info *l1_page;
+    l4_pgentry_t *l4 = NULL;
+    l3_pgentry_t *l3 = NULL;
+    l2_pgentry_t *l2 = NULL;
+    l1_pgentry_t *l1 = NULL;
+    int ret = -ENOMEM;
+
+    l4_page = image->aux_page;
+    if ( !l4_page )
+    {
+        l4_page = kimage_alloc_control_page(image, 0);
+        if ( !l4_page )
+            goto out;
+        image->aux_page = l4_page;
+    }
 
-    /* setup fixmap to point to our pages and record the virtual address
-     * in every odd index in page_list[].
-     */
+    l4 = __map_domain_page(l4_page);
+    l4 += l4_table_offset(vaddr);
+    if ( !(l4e_get_flags(*l4) & _PAGE_PRESENT) )
+    {
+        l3_page = kimage_alloc_control_page(image, 0);
+        if ( !l3_page )
+            goto out;
+        l4e_write(l4, l4e_from_page(l3_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l3_page = l4e_get_page(*l4);
+
+    l3 = __map_domain_page(l3_page);
+    l3 += l3_table_offset(vaddr);
+    if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) )
+    {
+        l2_page = kimage_alloc_control_page(image, 0);
+        if ( !l2_page )
+            goto out;
+        l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l2_page = l3e_get_page(*l3);
+
+    l2 = __map_domain_page(l2_page);
+    l2 += l2_table_offset(vaddr);
+    if ( !(l2e_get_flags(*l2) & _PAGE_PRESENT) )
+    {
+        l1_page = kimage_alloc_control_page(image, 0);
+        if ( !l1_page )
+            goto out;
+        l2e_write(l2, l2e_from_page(l1_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l1_page = l2e_get_page(*l2);
+
+    l1 = __map_domain_page(l1_page);
+    l1 += l1_table_offset(vaddr);
+    l1e_write(l1, l1e_from_pfn(maddr >> PAGE_SHIFT, __PAGE_HYPERVISOR));
+
+    ret = 0;
+out:
+    if ( l1 )
+        unmap_domain_page(l1);
+    if ( l2 )
+        unmap_domain_page(l2);
+    if ( l3 )
+        unmap_domain_page(l3);
+    if ( l4 )
+        unmap_domain_page(l4);
+    return ret;
+}
 
-    for ( k = 0; k < KEXEC_XEN_NO_PAGES; k++ )
+int machine_kexec_load(struct kexec_image *image)
+{
+    void *code_page;
+    int ret;
+
+    switch ( image->arch )
     {
-        if ( (k & 1) == 0 )
-        {
-            /* Even pages: machine address. */
-            prev_ma = image->page_list[k];
-        }
-        else
-        {
-            /* Odd pages: va for previous ma. */
-            if ( is_pv_32on64_domain(dom0) )
-            {
-                /*
-                 * The compatability bounce code sets up a page table
-                 * with a 1-1 mapping of the first 1G of memory so
-                 * VA==PA here.
-                 *
-                 * This Linux purgatory code still sets up separate
-                 * high and low mappings on the control page (entries
-                 * 0 and 1) but it is harmless if they are equal since
-                 * that PT is not live at the time.
-                 */
-                image->page_list[k] = prev_ma;
-            }
-            else
-            {
-                set_fixmap(fix_base + (k >> 1), prev_ma);
-                image->page_list[k] = fix_to_virt(fix_base + (k >> 1));
-            }
-        }
+    case EM_386:
+    case EM_X86_64:
+        break;
+    default:
+        return -EINVAL;
     }
 
+    code_page = __map_domain_page(image->control_code_page);
+    memcpy(code_page, kexec_reloc, kexec_reloc_size);
+    unmap_domain_page(code_page);
+
+    /*
+     * Add a mapping for the control code page to the same virtual
+     * address as kexec_reloc.  This allows us to keep running after
+     * these page tables are loaded in kexec_reloc.
+     */
+    ret = machine_kexec_add_page(image, (unsigned long)kexec_reloc,
+                                 page_to_maddr(image->control_code_page));
+    if ( ret < 0 )
+        return ret;
+
     return 0;
 }
 
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image)
+void machine_kexec_unload(struct kexec_image *image)
 {
+    /* no-op. kimage_free() frees all control pages. */
 }
 
-void machine_reboot_kexec(xen_kexec_image_t *image)
+void machine_reboot_kexec(struct kexec_image *image)
 {
     BUG_ON(smp_processor_id() != 0);
     smp_send_stop();
@@ -75,13 +145,10 @@ void machine_reboot_kexec(xen_kexec_image_t *image)
     BUG();
 }
 
-void machine_kexec(xen_kexec_image_t *image)
+void machine_kexec(struct kexec_image *image)
 {
-    struct desc_ptr gdt_desc = {
-        .base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY),
-        .limit = LAST_RESERVED_GDT_BYTE
-    };
     int i;
+    unsigned long reloc_flags = 0;
 
     /* We are about to permenantly jump out of the Xen context into the kexec
      * purgatory code.  We really dont want to be still servicing interupts.
@@ -109,29 +176,12 @@ void machine_kexec(xen_kexec_image_t *image)
      * not like running with NMIs disabled. */
     enable_nmis();
 
-    /*
-     * compat_machine_kexec() returns to idle pagetables, which requires us
-     * to be running on a static GDT mapping (idle pagetables have no GDT
-     * mappings in their per-domain mapping area).
-     */
-    asm volatile ( "lgdt %0" : : "m" (gdt_desc) );
+    if ( image->arch == EM_386 )
+        reloc_flags |= KEXEC_RELOC_FLAG_COMPAT;
 
-    if ( is_pv_32on64_domain(dom0) )
-    {
-        compat_machine_kexec(image->page_list[1],
-                             image->indirection_page,
-                             image->page_list,
-                             image->start_address);
-    }
-    else
-    {
-        relocate_new_kernel_t rnk;
-
-        rnk = (relocate_new_kernel_t) image->page_list[1];
-        (*rnk)(image->indirection_page, image->page_list,
-               image->start_address,
-               0 /* preserve_context */);
-    }
+    kexec_reloc(page_to_maddr(image->control_code_page),
+                page_to_maddr(image->aux_page),
+                image->head, image->entry_maddr, reloc_flags);
 }
 
 int machine_kexec_get(xen_kexec_range_t *range)
diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
index d56e12d..7f8fb3d 100644
--- a/xen/arch/x86/x86_64/Makefile
+++ b/xen/arch/x86/x86_64/Makefile
@@ -11,11 +11,11 @@ obj-y += mmconf-fam10h.o
 obj-y += mmconfig_64.o
 obj-y += mmconfig-shared.o
 obj-y += compat.o
-obj-bin-y += compat_kexec.o
 obj-y += domain.o
 obj-y += physdev.o
 obj-y += platform_hypercall.o
 obj-y += cpu_idle.o
 obj-y += cpufreq.o
+obj-bin-y += kexec_reloc.o
 
 obj-$(crash_debug)   += gdbstub.o
diff --git a/xen/arch/x86/x86_64/compat_kexec.S b/xen/arch/x86/x86_64/compat_kexec.S
deleted file mode 100644
index fc92af9..0000000
--- a/xen/arch/x86/x86_64/compat_kexec.S
+++ /dev/null
@@ -1,187 +0,0 @@
-/*
- * Compatibility kexec handler.
- */
-
-/*
- * NOTE: We rely on Xen not relocating itself above the 4G boundary. This is
- * currently true but if it ever changes then compat_pg_table will
- * need to be moved back below 4G at run time.
- */
-
-#include <xen/config.h>
-
-#include <asm/asm_defns.h>
-#include <asm/msr.h>
-#include <asm/page.h>
-
-/* The unrelocated physical address of a symbol. */
-#define SYM_PHYS(sym)          ((sym) - __XEN_VIRT_START)
-
-/* Load physical address of symbol into register and relocate it. */
-#define RELOCATE_SYM(sym,reg)  mov $SYM_PHYS(sym), reg ; \
-                               add xen_phys_start(%rip), reg
-
-/*
- * Relocate a physical address in memory. Size of temporary register
- * determines size of the value to relocate.
- */
-#define RELOCATE_MEM(addr,reg) mov addr(%rip), reg ; \
-                               add xen_phys_start(%rip), reg ; \
-                               mov reg, addr(%rip)
-
-        .text
-
-        .code64
-
-ENTRY(compat_machine_kexec)
-        /* x86/64                        x86/32  */
-        /* %rdi - relocate_new_kernel_t  CALL    */
-        /* %rsi - indirection page       4(%esp) */
-        /* %rdx - page_list              8(%esp) */
-        /* %rcx - start address         12(%esp) */
-        /*        cpu has pae           16(%esp) */
-
-        /* Shim the 64 bit page_list into a 32 bit page_list. */
-        mov $12,%r9
-        lea compat_page_list(%rip), %rbx
-1:      dec %r9
-        movl (%rdx,%r9,8),%eax
-        movl %eax,(%rbx,%r9,4)
-        test %r9,%r9
-        jnz 1b
-
-        RELOCATE_SYM(compat_page_list,%rdx)
-
-        /* Relocate compatibility mode entry point address. */
-        RELOCATE_MEM(compatibility_mode_far,%eax)
-
-        /* Relocate compat_pg_table. */
-        RELOCATE_MEM(compat_pg_table,     %rax)
-        RELOCATE_MEM(compat_pg_table+0x8, %rax)
-        RELOCATE_MEM(compat_pg_table+0x10,%rax)
-        RELOCATE_MEM(compat_pg_table+0x18,%rax)
-
-        /*
-         * Setup an identity mapped region in PML4[0] of idle page
-         * table.
-         */
-        RELOCATE_SYM(l3_identmap,%rax)
-        or  $0x63,%rax
-        mov %rax, idle_pg_table(%rip)
-
-        /* Switch to idle page table. */
-        RELOCATE_SYM(idle_pg_table,%rax)
-        movq %rax, %cr3
-
-        /* Switch to identity mapped compatibility stack. */
-        RELOCATE_SYM(compat_stack,%rax)
-        movq %rax, %rsp
-
-        /* Save xen_phys_start for 32 bit code. */
-        movq xen_phys_start(%rip), %rbx
-
-        /* Jump to low identity mapping in compatibility mode. */
-        ljmp *compatibility_mode_far(%rip)
-        ud2
-
-compatibility_mode_far:
-        .long SYM_PHYS(compatibility_mode)
-        .long __HYPERVISOR_CS32
-
-        /*
-         * We use 5 words of stack for the arguments passed to the kernel. The
-         * kernel only uses 1 word before switching to its own stack. Allocate
-         * 16 words to give "plenty" of room.
-         */
-        .fill 16,4,0
-compat_stack:
-
-        .code32
-
-#undef RELOCATE_SYM
-#undef RELOCATE_MEM
-
-/*
- * Load physical address of symbol into register and relocate it. %rbx
- * contains xen_phys_start(%rip) saved before jump to compatibility
- * mode.
- */
-#define RELOCATE_SYM(sym,reg) mov $SYM_PHYS(sym), reg ; \
-                              add %ebx, reg
-
-compatibility_mode:
-        /* Setup some sane segments. */
-        movl $__HYPERVISOR_DS32, %eax
-        movl %eax, %ds
-        movl %eax, %es
-        movl %eax, %fs
-        movl %eax, %gs
-        movl %eax, %ss
-
-        /* Push arguments onto stack. */
-        pushl $0   /* 20(%esp) - preserve context */
-        pushl $1   /* 16(%esp) - cpu has pae */
-        pushl %ecx /* 12(%esp) - start address */
-        pushl %edx /*  8(%esp) - page list */
-        pushl %esi /*  4(%esp) - indirection page */
-        pushl %edi /*  0(%esp) - CALL */
-
-        /* Disable paging and therefore leave 64 bit mode. */
-        movl %cr0, %eax
-        andl $~X86_CR0_PG, %eax
-        movl %eax, %cr0
-
-        /* Switch to 32 bit page table. */
-        RELOCATE_SYM(compat_pg_table, %eax)
-        movl  %eax, %cr3
-
-        /* Clear MSR_EFER[LME], disabling long mode */
-        movl    $MSR_EFER,%ecx
-        rdmsr
-        btcl    $_EFER_LME,%eax
-        wrmsr
-
-        /* Re-enable paging, but only 32 bit mode now. */
-        movl %cr0, %eax
-        orl $X86_CR0_PG, %eax
-        movl %eax, %cr0
-        jmp 1f
-1:
-
-        popl %eax
-        call *%eax
-        ud2
-
-        .data
-        .align 4
-compat_page_list:
-        .fill 12,4,0
-
-        .align 32,0
-
-        /*
-         * These compat page tables contain an identity mapping of the
-         * first 4G of the physical address space.
-         */
-compat_pg_table:
-        .long SYM_PHYS(compat_pg_table_l2) + 0*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 1*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 2*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 3*PAGE_SIZE + 0x01, 0
-
-        .section .data.page_aligned, "aw", @progbits
-        .align PAGE_SIZE,0
-compat_pg_table_l2:
-        .macro identmap from=0, count=512
-        .if \count-1
-        identmap "(\from+0)","(\count/2)"
-        identmap "(\from+(0x200000*(\count/2)))","(\count/2)"
-        .else
-        .quad 0x00000000000000e3 + \from
-        .endif
-        .endm
-
-        identmap 0x00000000
-        identmap 0x40000000
-        identmap 0x80000000
-        identmap 0xc0000000
diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
new file mode 100644
index 0000000..7a16c85
--- /dev/null
+++ b/xen/arch/x86/x86_64/kexec_reloc.S
@@ -0,0 +1,198 @@
+/*
+ * Relocate a kexec_image to its destination and call it.
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+#include <xen/config.h>
+#include <xen/kimage.h>
+
+#include <asm/asm_defns.h>
+#include <asm/msr.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
+
+        .text
+        .align PAGE_SIZE
+        .code64
+
+ENTRY(kexec_reloc)
+        /* %rdi - code page maddr */
+        /* %rsi - page table maddr */
+        /* %rdx - indirection page maddr */
+        /* %rcx - entry maddr (%rbp) */
+        /* %r8 - flags */
+
+        movq    %rcx, %rbp
+
+        /* Setup stack. */
+        leaq    (reloc_stack - kexec_reloc)(%rdi), %rsp
+
+        /* Load reloc page table. */
+        movq    %rsi, %cr3
+
+        /* Jump to identity mapped code. */
+        leaq    (identity_mapped - kexec_reloc)(%rdi), %rax
+        jmpq    *%rax
+
+identity_mapped:
+        /*
+         * Set cr0 to a known state:
+         *  - Paging enabled
+         *  - Alignment check disabled
+         *  - Write protect disabled
+         *  - No task switch
+         *  - Don't do FP software emulation.
+         *  - Protected mode enabled
+         */
+        movq    %cr0, %rax
+        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
+        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
+        movq    %rax, %cr0
+
+        /*
+         * Set cr4 to a known state:
+         *  - physical address extension enabled
+         */
+        movl    $X86_CR4_PAE, %eax
+        movq    %rax, %cr4
+
+        movq    %rdx, %rdi
+        call    relocate_pages
+
+        /* Need to switch to 32-bit mode? */
+        testq   $KEXEC_RELOC_FLAG_COMPAT, %r8
+        jnz     call_32_bit
+
+call_64_bit:
+        /* Call the image entry point.  This should never return. */
+        callq   *%rbp
+        ud2
+
+call_32_bit:
+        /* Setup IDT. */
+        lidt    compat_mode_idt(%rip)
+
+        /* Load compat GDT. */
+        leaq    compat_mode_gdt(%rip), %rax
+        movq    %rax, (compat_mode_gdt_desc + 2)(%rip)
+        lgdt    compat_mode_gdt_desc(%rip)
+
+        /* Relocate compatibility mode entry point address. */
+        leal    compatibility_mode(%rip), %eax
+        movl    %eax, compatibility_mode_far(%rip)
+
+        /* Enter compatibility mode. */
+        ljmp    *compatibility_mode_far(%rip)
+
+relocate_pages:
+        /* %rdi - indirection page maddr */
+        pushq   %rbx
+
+        cld
+        movq    %rdi, %rbx
+        xorl    %edi, %edi
+        xorl    %esi, %esi
+
+next_entry: /* top, read another word for the indirection page */
+
+        movq    (%rbx), %rcx
+        addq    $8, %rbx
+is_dest:
+        testb   $IND_DESTINATION, %cl
+        jz      is_ind
+        movq    %rcx, %rdi
+        andq    $PAGE_MASK, %rdi
+        jmp     next_entry
+is_ind:
+        testb   $IND_INDIRECTION, %cl
+        jz      is_done
+        movq    %rcx, %rbx
+        andq    $PAGE_MASK, %rbx
+        jmp     next_entry
+is_done:
+        testb   $IND_DONE, %cl
+        jnz     done
+is_source:
+        testb   $IND_SOURCE, %cl
+        jz      is_zero
+        movq    %rcx, %rsi      /* For every source page do a copy */
+        andq    $PAGE_MASK, %rsi
+        movl    $(PAGE_SIZE / 8), %ecx
+        rep movsq
+        jmp     next_entry
+is_zero:
+        testb   $IND_ZERO, %cl
+        jz      next_entry
+        movl    $(PAGE_SIZE / 8), %ecx  /* Zero the destination page. */
+        xorl    %eax, %eax
+        rep stosq
+        jmp     next_entry
+done:
+        popq    %rbx
+        ret
+
+        .code32
+
+compatibility_mode:
+        /* Setup some sane segments. */
+        movl    $0x0008, %eax
+        movl    %eax, %ds
+        movl    %eax, %es
+        movl    %eax, %fs
+        movl    %eax, %gs
+        movl    %eax, %ss
+
+        /* Disable paging and therefore leave 64 bit mode. */
+        movl    %cr0, %eax
+        andl    $~X86_CR0_PG, %eax
+        movl    %eax, %cr0
+
+        /* Disable long mode */
+        movl    $MSR_EFER, %ecx
+        rdmsr
+        andl    $~EFER_LME, %eax
+        wrmsr
+
+        /* Clear cr4 to disable PAE. */
+        xorl    %eax, %eax
+        movl    %eax, %cr4
+
+        /* Call the image entry point.  This should never return. */
+        call    *%ebp
+        ud2
+
+        .align 4
+compatibility_mode_far:
+        .long 0x00000000             /* set in call_32_bit above */
+        .word 0x0010
+
+compat_mode_gdt_desc:
+        .word (3*8)-1
+        .quad 0x0000000000000000     /* set in call_32_bit above */
+
+        .align 8
+compat_mode_gdt:
+        .quad 0x0000000000000000     /* null                              */
+        .quad 0x00cf92000000ffff     /* 0x0008 ring 0 data                */
+        .quad 0x00cf9a000000ffff     /* 0x0010 ring 0 code, compatibility */
+
+compat_mode_idt:
+        .word 0                      /* limit */
+        .long 0                      /* base */
+
+        /*
+         * 16 words of stack are more than enough.
+         */
+        .fill 16,8,0
+reloc_stack:
+
+        .globl kexec_reloc_size
+kexec_reloc_size:
+        .long . - kexec_reloc
diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index 7b23df0..c5450ba 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -25,6 +25,7 @@
 #include <xen/version.h>
 #include <xen/console.h>
 #include <xen/kexec.h>
+#include <xen/kimage.h>
 #include <public/elfnote.h>
 #include <xsm/xsm.h>
 #include <xen/cpu.h>
@@ -47,7 +48,7 @@ static Elf_Note *xen_crash_note;
 
 static cpumask_t crash_saved_cpus;
 
-static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
+static struct kexec_image *kexec_image[KEXEC_IMAGE_NR];
 
 #define KEXEC_FLAG_DEFAULT_POS   (KEXEC_IMAGE_NR + 0)
 #define KEXEC_FLAG_CRASH_POS     (KEXEC_IMAGE_NR + 1)
@@ -55,8 +56,6 @@ static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
 
 static unsigned long kexec_flags = 0; /* the lowest bits are for KEXEC_IMAGE... */
 
-static spinlock_t kexec_lock = SPIN_LOCK_UNLOCKED;
-
 static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
 static size_t vmcoreinfo_size = 0;
 
@@ -311,14 +310,14 @@ void kexec_crash(void)
     kexec_common_shutdown();
     kexec_crash_save_cpu();
     machine_crash_shutdown();
-    machine_kexec(&kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
+    machine_kexec(kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
 
     BUG();
 }
 
 static long kexec_reboot(void *_image)
 {
-    xen_kexec_image_t *image = _image;
+    struct kexec_image *image = _image;
 
     kexecing = TRUE;
 
@@ -734,63 +733,264 @@ static void crash_save_vmcoreinfo(void)
 #endif
 }
 
-static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
+static void kexec_unload_image(struct kexec_image *image)
 {
-    xen_kexec_image_t *image;
+    if ( !image )
+        return;
+
+    machine_kexec_unload(image);
+    kimage_free(image);
+}
+
+static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_exec_t exec;
+    struct kexec_image *image;
+    int base, bit, pos, ret = -EINVAL;
+
+    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+        return -EFAULT;
+
+    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+        return -EINVAL;
+
+    pos = (test_bit(bit, &kexec_flags) != 0);
+
+    /* Only allow kexec/kdump into loaded images */
+    if ( !test_bit(base + pos, &kexec_flags) )
+        return -ENOENT;
+
+    switch (exec.type)
+    {
+    case KEXEC_TYPE_DEFAULT:
+        image = kexec_image[base + pos];
+        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
+        break;
+    case KEXEC_TYPE_CRASH:
+        kexec_crash(); /* Does not return */
+        break;
+    }
+
+    return -EINVAL; /* never reached */
+}
+
+static int kexec_swap_images(int type, struct kexec_image *new,
+                             struct kexec_image **old)
+{
+    static DEFINE_SPINLOCK(kexec_lock);
     int base, bit, pos;
-    int ret = 0;
+    int new_slot, old_slot;
+
+    *old = NULL;
+
+    spin_lock(&kexec_lock);
+
+    if ( test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    {
+        spin_unlock(&kexec_lock);
+        return -EBUSY;
+    }
 
-    if ( kexec_load_get_bits(load->type, &base, &bit) )
+    if ( kexec_load_get_bits(type, &base, &bit) )
         return -EINVAL;
 
     pos = (test_bit(bit, &kexec_flags) != 0);
+    old_slot = base + pos;
+    new_slot = base + !pos;
 
-    /* Load the user data into an unused image */
-    if ( op == KEXEC_CMD_kexec_load )
+    if ( new )
     {
-        image = &kexec_image[base + !pos];
+        kexec_image[new_slot] = new;
+        set_bit(new_slot, &kexec_flags);
+    }
+    change_bit(bit, &kexec_flags);
 
-        BUG_ON(test_bit((base + !pos), &kexec_flags)); /* must be free */
+    clear_bit(old_slot, &kexec_flags);
+    *old = kexec_image[old_slot];
 
-        memcpy(image, &load->image, sizeof(*image));
+    spin_unlock(&kexec_lock);
 
-        if ( !(ret = machine_kexec_load(load->type, base + !pos, image)) )
-        {
-            /* Set image present bit */
-            set_bit((base + !pos), &kexec_flags);
+    return 0;
+}
 
-            /* Make new image the active one */
-            change_bit(bit, &kexec_flags);
-        }
+static int kexec_load_slot(struct kexec_image *kimage)
+{
+    struct kexec_image *old_kimage;
+    int ret = -ENOMEM;
+
+    ret = machine_kexec_load(kimage);
+    if ( ret < 0 )
+        return ret;
+
+    crash_save_vmcoreinfo();
+
+    ret = kexec_swap_images(kimage->type, kimage, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static uint16_t kexec_load_v1_arch(void)
+{
+#ifdef CONFIG_X86
+    return is_pv_32on64_domain(dom0) ? EM_386 : EM_X86_64;
+#else
+    return EM_NONE;
+#endif
+}
 
-        crash_save_vmcoreinfo();
+static int kexec_segments_add_segment(
+    unsigned int *nr_segments, xen_kexec_segment_t *segments,
+    unsigned long mfn)
+{
+    paddr_t maddr = (paddr_t)mfn << PAGE_SHIFT;
+    unsigned int n = *nr_segments;
+
+    /* Need a new segment? */
+    if ( n == 0
+         || segments[n-1].dest_maddr + segments[n-1].dest_size != maddr )
+    {
+        n++;
+        if ( n > KEXEC_SEGMENT_MAX )
+            return -EINVAL;
+        *nr_segments = n;
+
+        set_xen_guest_handle(segments[n-1].buf.h, NULL);
+        segments[n-1].buf_size = 0;
+        segments[n-1].dest_maddr = maddr;
+        segments[n-1].dest_size = 0;
     }
 
-    /* Unload the old image if present and load successful */
-    if ( ret == 0 && !test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    return 0;
+}
+
+static int kexec_segments_from_ind_page(unsigned long mfn,
+                                        unsigned int *nr_segments,
+                                        xen_kexec_segment_t *segments,
+                                        bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+
+    page = map_domain_page(mfn);
+
+    /*
+     * Walk the indirection page list, adding destination pages to the
+     * segments.
+     */
+    for ( entry = page; ; )
     {
-        if ( test_and_clear_bit((base + pos), &kexec_flags) )
+        unsigned long ind;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
         {
-            image = &kexec_image[base + pos];
-            machine_kexec_unload(load->type, base + pos, image);
+        case IND_DESTINATION:
+            ret = kexec_segments_add_segment(nr_segments, segments, mfn);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            entry = page = map_domain_page(mfn);
+            continue;
+        case IND_DONE:
+            goto done;
+        case IND_SOURCE:
+            if ( *nr_segments == 0 )
+            {
+                ret = -EINVAL;
+                goto done;
+            }
+            segments[*nr_segments-1].dest_size += PAGE_SIZE;
+            break;
+        default:
+            ret = -EINVAL;
+            goto done;
         }
+        entry = kimage_entry_next(entry, compat);
     }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
 
+static int kexec_do_load_v1(xen_kexec_load_v1_t *load, int compat)
+{
+    struct kexec_image *kimage = NULL;
+    xen_kexec_segment_t *segments;
+    uint16_t arch;
+    unsigned int nr_segments = 0;
+    unsigned long ind_mfn = load->image.indirection_page >> PAGE_SHIFT;
+    int ret;
+
+    arch = kexec_load_v1_arch();
+    if ( arch == EM_NONE )
+        return -ENOSYS;
+
+    segments = xmalloc_array(xen_kexec_segment_t, KEXEC_SEGMENT_MAX);
+    if ( segments == NULL )
+        return -ENOMEM;
+
+    /*
+     * Work out the image segments (destination only) from the
+     * indirection pages.
+     *
+     * This is needed so we don't allocate pages that will overlap
+     * with the destination when building the new set of indirection
+     * pages below.
+     */
+    ret = kexec_segments_from_ind_page(ind_mfn, &nr_segments, segments, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_alloc(&kimage, load->type, arch, load->image.start_address,
+                       nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    /*
+     * Build a new set of indirection pages in the native format.
+     *
+     * This walks the guest provided indirection pages a second time.
+     * The guest could have altered then, invalidating the segment
+     * information constructed above.  This will only result in the
+     * resulting image being potentially unrelocatable.
+     */
+    ret = kimage_build_ind(kimage, ind_mfn, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    return 0;
+
+error:
+    if ( !kimage )
+        xfree(segments);
+    kimage_free(kimage);
     return ret;
 }
 
-static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
     xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&load, uarg, 1)) )
         return -EFAULT;
 
-    return kexec_load_unload_internal(op, &load);
+    return kexec_do_load_v1(&load, 0);
 }
 
-static int kexec_load_unload_compat(unsigned long op,
-                                    XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
 #ifdef CONFIG_COMPAT
     compat_kexec_load_v1_t compat_load;
@@ -809,49 +1009,113 @@ static int kexec_load_unload_compat(unsigned long op,
     load.type = compat_load.type;
     XLAT_kexec_image(&load.image, &compat_load.image);
 
-    return kexec_load_unload_internal(op, &load);
-#else /* CONFIG_COMPAT */
+    return kexec_do_load_v1(&load, 1);
+#else
     return 0;
-#endif /* CONFIG_COMPAT */
+#endif
 }
 
-static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
-    xen_kexec_exec_t exec;
-    xen_kexec_image_t *image;
-    int base, bit, pos, ret = -EINVAL;
+    xen_kexec_load_t load;
+    xen_kexec_segment_t *segments;
+    struct kexec_image *kimage = NULL;
+    int ret;
 
-    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+    if ( copy_from_guest(&load, uarg, 1) )
         return -EFAULT;
 
-    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+    if ( load.nr_segments >= KEXEC_SEGMENT_MAX )
         return -EINVAL;
 
-    pos = (test_bit(bit, &kexec_flags) != 0);
-
-    /* Only allow kexec/kdump into loaded images */
-    if ( !test_bit(base + pos, &kexec_flags) )
-        return -ENOENT;
+    segments = xmalloc_array(xen_kexec_segment_t, load.nr_segments);
+    if ( segments == NULL )
+        return -ENOMEM;
 
-    switch (exec.type)
+    if ( copy_from_guest(segments, load.segments.h, load.nr_segments) )
     {
-    case KEXEC_TYPE_DEFAULT:
-        image = &kexec_image[base + pos];
-        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
-        break;
-    case KEXEC_TYPE_CRASH:
-        kexec_crash(); /* Does not return */
-        break;
+        ret = -EFAULT;
+        goto error;
     }
 
-    return -EINVAL; /* never reached */
+    ret = kimage_alloc(&kimage, load.type, load.arch, load.entry_maddr,
+                       load.nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_load_segments(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    return 0;
+
+error:
+    if ( ! kimage )
+        xfree(segments);
+    kimage_free(kimage);
+    return ret;
+}
+
+static int kexec_do_unload(xen_kexec_unload_t *unload)
+{
+    struct kexec_image *old_kimage;
+    int ret;
+
+    ret = kexec_swap_images(unload->type, NULL, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static int kexec_unload_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_load_v1_t load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = load.type;
+    return kexec_do_unload(&unload);
+}
+
+static int kexec_unload_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+#ifdef CONFIG_COMPAT
+    compat_kexec_load_v1_t compat_load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&compat_load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = compat_load.type;
+    return kexec_do_unload(&unload);
+#else
+    return 0;
+#endif
+}
+
+static int kexec_unload(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_unload_t unload;
+
+    if ( unlikely(copy_from_guest(&unload, uarg, 1)) )
+        return -EFAULT;
+
+    return kexec_do_unload(&unload);
 }
 
 static int do_kexec_op_internal(unsigned long op,
                                 XEN_GUEST_HANDLE_PARAM(void) uarg,
                                 bool_t compat)
 {
-    unsigned long flags;
     int ret = -EINVAL;
 
     ret = xsm_kexec(XSM_PRIV);
@@ -867,20 +1131,26 @@ static int do_kexec_op_internal(unsigned long op,
                 ret = kexec_get_range(uarg);
         break;
     case KEXEC_CMD_kexec_load_v1:
+        if ( compat )
+            ret = kexec_load_v1_compat(uarg);
+        else
+            ret = kexec_load_v1(uarg);
+        break;
     case KEXEC_CMD_kexec_unload_v1:
-        spin_lock_irqsave(&kexec_lock, flags);
-        if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
-        {
-                if (compat)
-                        ret = kexec_load_unload_compat(op, uarg);
-                else
-                        ret = kexec_load_unload(op, uarg);
-        }
-        spin_unlock_irqrestore(&kexec_lock, flags);
+        if ( compat )
+            ret = kexec_unload_v1_compat(uarg);
+        else
+            ret = kexec_unload_v1(uarg);
         break;
     case KEXEC_CMD_kexec:
         ret = kexec_exec(uarg);
         break;
+    case KEXEC_CMD_kexec_load:
+        ret = kexec_load(uarg);
+        break;
+    case KEXEC_CMD_kexec_unload:
+        ret = kexec_unload(uarg);
+        break;
     }
 
     return ret;
diff --git a/xen/common/kimage.c b/xen/common/kimage.c
index 02ee37e..10fb785 100644
--- a/xen/common/kimage.c
+++ b/xen/common/kimage.c
@@ -175,11 +175,20 @@ static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
     image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
     if ( !image->control_code_page )
         goto out;
+    result = machine_kexec_add_page(image,
+                                    page_to_maddr(image->control_code_page),
+                                    page_to_maddr(image->control_code_page));
+    if ( result < 0 )
+        goto out;
 
     /* Add an empty indirection page. */
     image->entry_page = kimage_alloc_control_page(image, 0);
     if ( !image->entry_page )
         goto out;
+    result = machine_kexec_add_page(image, page_to_maddr(image->entry_page),
+                                    page_to_maddr(image->entry_page));
+    if ( result < 0 )
+        goto out;
 
     image->head = page_to_maddr(image->entry_page);
 
@@ -595,7 +604,7 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
         if ( addr == destination )
         {
             page_list_del(page, &image->dest_pages);
-            return page;
+            goto found;
         }
     }
     page = NULL;
@@ -647,6 +656,8 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
             page_list_add(page, &image->dest_pages);
         }
     }
+found:
+    machine_kexec_add_page(image, page_to_maddr(page), page_to_maddr(page));
     return page;
 }
 
@@ -753,6 +764,7 @@ static int kimage_load_crash_segment(struct kexec_image *image,
 static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
 {
     int result = -ENOMEM;
+    paddr_t addr;
 
     if ( !guest_handle_is_null(segment->buf.h) )
     {
@@ -767,6 +779,14 @@ static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *s
         }
     }
 
+    for ( addr = segment->dest_maddr & PAGE_MASK;
+          addr < segment->dest_maddr + segment->dest_size; addr += PAGE_SIZE )
+    {
+        result = machine_kexec_add_page(image, addr, addr);
+        if ( result < 0 )
+            break;
+    }
+
     return result;
 }
 
@@ -810,6 +830,106 @@ int kimage_load_segments(struct kexec_image *image)
     return 0;
 }
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return (kimage_entry_t *)((uint32_t *)entry + 1);
+    return entry + 1;
+}
+
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry >> PAGE_SHIFT;
+    return *entry >> PAGE_SHIFT;
+}
+
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry & 0xf;
+    return *entry & 0xf;
+}
+
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+    paddr_t dest = KIMAGE_NO_DEST;
+
+    page = map_domain_page(ind_mfn);
+    if ( !page )
+        return -ENOMEM;
+
+    /*
+     * Walk the guest-supplied indirection pages, adding entries to
+     * the image's indirection pages.
+     */
+    for ( entry = page; ;  )
+    {
+        unsigned long ind;
+        unsigned long mfn;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
+        {
+        case IND_DESTINATION:
+            dest = (paddr_t)mfn << PAGE_SHIFT;
+            ret = kimage_set_destination(image, dest);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            page = map_domain_page(mfn);
+            entry = page;
+            continue;
+        case IND_DONE:
+            kimage_terminate(image);
+            goto done;
+        case IND_SOURCE:
+        {
+            struct page_info *guest_page, *xen_page;
+
+            guest_page = mfn_to_page(mfn);
+            if ( !get_page(guest_page, current->domain) )
+            {
+                ret = -EFAULT;
+                goto done;
+            }
+
+            xen_page = kimage_alloc_page(image, dest);
+            if ( !xen_page )
+            {
+                put_page(guest_page);
+                ret = -ENOMEM;
+                goto done;
+            }
+
+            copy_domain_page(page_to_mfn(xen_page), mfn);
+            put_page(guest_page);
+
+            ret = kimage_add_page(image, page_to_maddr(xen_page));
+            if ( ret < 0 )
+                goto done;
+            dest += PAGE_SIZE;
+            break;
+        }
+        default:
+            ret = -EINVAL;
+            goto done;
+        }
+        entry = kimage_entry_next(entry, compat);
+    }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index 8b4266d..48c5676 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -56,9 +56,6 @@ enum fixed_addresses {
     FIX_ACPI_BEGIN,
     FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1,
     FIX_HPET_BASE,
-    FIX_KEXEC_BASE_0,
-    FIX_KEXEC_BASE_END = FIX_KEXEC_BASE_0 \
-      + ((KEXEC_XEN_NO_PAGES >> 1) * KEXEC_IMAGE_NR) - 1,
     FIX_TBOOT_SHARED_BASE,
     FIX_MSIX_IO_RESERV_BASE,
     FIX_MSIX_IO_RESERV_END = FIX_MSIX_IO_RESERV_BASE + FIX_MSIX_MAX_PAGES -1,
diff --git a/xen/include/asm-x86/machine_kexec.h b/xen/include/asm-x86/machine_kexec.h
new file mode 100644
index 0000000..ba0d469
--- /dev/null
+++ b/xen/include/asm-x86/machine_kexec.h
@@ -0,0 +1,16 @@
+#ifndef __X86_MACHINE_KEXEC_H__
+#define __X86_MACHINE_KEXEC_H__
+
+#define KEXEC_RELOC_FLAG_COMPAT 0x1 /* 32-bit image */
+
+#ifndef __ASSEMBLY__
+
+extern void kexec_reloc(unsigned long reloc_code, unsigned long reloc_pt,
+                        unsigned long ind_maddr, unsigned long entry_maddr,
+                        unsigned long flags);
+
+extern unsigned int kexec_reloc_size;
+
+#endif
+
+#endif /* __X86_MACHINE_KEXEC_H__ */
diff --git a/xen/include/xen/kexec.h b/xen/include/xen/kexec.h
index 1a5dda1..bd17747 100644
--- a/xen/include/xen/kexec.h
+++ b/xen/include/xen/kexec.h
@@ -6,6 +6,7 @@
 #include <public/kexec.h>
 #include <asm/percpu.h>
 #include <xen/elfcore.h>
+#include <xen/kimage.h>
 
 typedef struct xen_kexec_reserve {
     unsigned long size;
@@ -40,11 +41,13 @@ extern enum low_crashinfo low_crashinfo_mode;
 extern paddr_t crashinfo_maxaddr_bits;
 void kexec_early_calculations(void);
 
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr);
+int machine_kexec_load(struct kexec_image *image);
+void machine_kexec_unload(struct kexec_image *image);
 void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
-void machine_reboot_kexec(xen_kexec_image_t *image);
-void machine_kexec(xen_kexec_image_t *image);
+void machine_reboot_kexec(struct kexec_image *image);
+void machine_kexec(struct kexec_image *image);
 void kexec_crash(void);
 void kexec_crash_save_cpu(void);
 crash_xen_info_t *kexec_crash_save_info(void);
@@ -52,11 +55,6 @@ void machine_crash_shutdown(void);
 int machine_kexec_get(xen_kexec_range_t *range);
 int machine_kexec_get_xen(xen_kexec_range_t *range);
 
-void compat_machine_kexec(unsigned long rnk,
-                          unsigned long indirection_page,
-                          unsigned long *page_list,
-                          unsigned long start_address);
-
 /* vmcoreinfo stuff */
 #define VMCOREINFO_BYTES           (4096)
 #define VMCOREINFO_NOTE_NAME       "VMCOREINFO_XEN"
diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
index 0ebd37a..d10ebf7 100644
--- a/xen/include/xen/kimage.h
+++ b/xen/include/xen/kimage.h
@@ -47,6 +47,12 @@ int kimage_load_segments(struct kexec_image *image);
 struct page_info *kimage_alloc_control_page(struct kexec_image *image,
                                             unsigned memflags);
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat);
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __XEN_KIMAGE_H__ */
-- 
1.7.2.5


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-10-08 16:55 ` [PATCH 4/9] kexec: extend hypercall with improved load/unload ops David Vrabel
@ 2013-11-05 22:43   ` Don Slutz
  0 siblings, 0 replies; 31+ messages in thread
From: Don Slutz @ 2013-11-05 22:43 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Keir Fraser, Jan Beulich

On 10/08/13 12:55, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
[...]
> +
> +static int kexec_segments_from_ind_page(unsigned long mfn,
> +                                        unsigned *nr_segments,
> +                                        xen_kexec_segment_t *segments,
> +                                        bool_t compat)
> +{
> +    void *page;
> +    kimage_entry_t *entry;
> +    int ret = 0;
> +
> +    page = map_domain_page(mfn);
> +
> +    /*
> +     * Walk the indirection page list, adding destination pages to the
> +     * segments.
> +     */
> +    for ( entry = page; ; )
>       {
> -        if ( test_and_clear_bit((base + pos), &kexec_flags) )
> +        unsigned long ind;
> +
> +        ind = kimage_entry_ind(entry, compat);
> +        mfn = kimage_entry_mfn(entry, compat);
> +
> +        switch ( ind )
>           {
> -            image = &kexec_image[base + pos];
> -            machine_kexec_unload(load->type, base + pos, image);
> +        case IND_DESTINATION:
> +            ret = kexec_segments_add_segment(nr_segments, segments, mfn);
> +            if ( ret < 0 )
> +                goto done;
> +            break;
> +        case IND_INDIRECTION:
> +            unmap_domain_page(page);
> +            page = map_domain_page(mfn);
> +            if ( page == NULL )
> +                return -ENOMEM;
> +            entry = page;
> +            continue;
> +        case IND_DONE:
> +            goto done;
> +        case IND_SOURCE:
> +            segments[*nr_segments-1].dest_size += PAGE_SIZE;
I have not been able to prove that *nr_segments can not be zero when you get here.  So I think that this needs to be checked for instead of corrupting memory.
> +            break;
> +        default:
> +            ret = -EINVAL;
> +            goto done;
>           }
> +        entry = kimage_entry_next(entry, compat);
>       }
> +done:
> +    unmap_domain_page(page);
> +    return ret;
> +}
[...]
>
    -Don Slutz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-10-08 16:55 [PATCHv9 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
@ 2013-10-08 16:55 ` David Vrabel
  2013-11-05 22:43   ` Don Slutz
  0 siblings, 1 reply; 31+ messages in thread
From: David Vrabel @ 2013-10-08 16:55 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

In the existing kexec hypercall, the load and unload ops depend on
internals of the Linux kernel (the page list and code page provided by
the kernel).  The code page is used to transition between Xen context
and the image so using kernel code doesn't make sense and will not
work for PVH guests.

Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
that no longer require a code page to be provided by the guest -- Xen
now provides the code for calling the image directly.

The new load op looks similar to the Linux kexec_load system call and
allows the guest to provide the image data to be loaded.  The guest
specifies the architecture of the image which may be a 32-bit subarch
of the hypervisor's architecture (i.e., an EM_386 image on an
EM_X86_64 hypervisor).

The toolstack can now load images without kernel involvement.  This is
required for supporting kexec when using a dom0 with an upstream
kernel.

Crash images are copied directly into the crash region on load.
Default images are copied into domheap pages and a list of source and
destination machine addresses is created.  This is list is used in
kexec_reloc() to relocate the image to its destination.

The old load and unload sub-ops are still available (as
KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
of the new infrastructure.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 xen/arch/x86/machine_kexec.c        |  192 +++++++++++-------
 xen/arch/x86/x86_64/Makefile        |    2 +-
 xen/arch/x86/x86_64/compat_kexec.S  |  187 -----------------
 xen/arch/x86/x86_64/kexec_reloc.S   |  206 ++++++++++++++++++
 xen/common/kexec.c                  |  393 +++++++++++++++++++++++++++++------
 xen/common/kimage.c                 |  123 +++++++++++-
 xen/include/asm-x86/fixmap.h        |    3 -
 xen/include/asm-x86/machine_kexec.h |   16 ++
 xen/include/xen/kexec.h             |   16 +-
 xen/include/xen/kimage.h            |    6 +
 10 files changed, 809 insertions(+), 335 deletions(-)
 delete mode 100644 xen/arch/x86/x86_64/compat_kexec.S
 create mode 100644 xen/arch/x86/x86_64/kexec_reloc.S
 create mode 100644 xen/include/asm-x86/machine_kexec.h

diff --git a/xen/arch/x86/machine_kexec.c b/xen/arch/x86/machine_kexec.c
index 68b9705..b70d5a6 100644
--- a/xen/arch/x86/machine_kexec.c
+++ b/xen/arch/x86/machine_kexec.c
@@ -1,9 +1,18 @@
 /******************************************************************************
  * machine_kexec.c
  *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/machine_kexec_64.c.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
  * Xen port written by:
  * - Simon 'Horms' Horman <horms@verge.net.au>
  * - Magnus Damm <magnus@valinux.co.jp>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
  */
 
 #include <xen/types.h>
@@ -11,63 +20,124 @@
 #include <xen/guest_access.h>
 #include <asm/fixmap.h>
 #include <asm/hpet.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
 
-typedef void (*relocate_new_kernel_t)(
-                unsigned long indirection_page,
-                unsigned long *page_list,
-                unsigned long start_address,
-                unsigned int preserve_context);
-
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image)
+/*
+ * Add a mapping for a page to the page tables used during kexec.
+ */
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr)
 {
-    unsigned long prev_ma = 0;
-    int fix_base = FIX_KEXEC_BASE_0 + (slot * (KEXEC_XEN_NO_PAGES >> 1));
-    int k;
+    struct page_info *l4_page;
+    struct page_info *l3_page;
+    struct page_info *l2_page;
+    struct page_info *l1_page;
+    l4_pgentry_t *l4 = NULL;
+    l3_pgentry_t *l3 = NULL;
+    l2_pgentry_t *l2 = NULL;
+    l1_pgentry_t *l1 = NULL;
+    int ret = -ENOMEM;
+
+    l4_page = image->aux_page;
+    if ( !l4_page )
+    {
+        l4_page = kimage_alloc_control_page(image, 0);
+        if ( !l4_page )
+            goto out;
+        image->aux_page = l4_page;
+    }
 
-    /* setup fixmap to point to our pages and record the virtual address
-     * in every odd index in page_list[].
-     */
+    l4 = __map_domain_page(l4_page);
+    l4 += l4_table_offset(vaddr);
+    if ( !(l4e_get_flags(*l4) & _PAGE_PRESENT) )
+    {
+        l3_page = kimage_alloc_control_page(image, 0);
+        if ( !l3_page )
+            goto out;
+        l4e_write(l4, l4e_from_page(l3_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l3_page = l4e_get_page(*l4);
+
+    l3 = __map_domain_page(l3_page);
+    l3 += l3_table_offset(vaddr);
+    if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) )
+    {
+        l2_page = kimage_alloc_control_page(image, 0);
+        if ( !l2_page )
+            goto out;
+        l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l2_page = l3e_get_page(*l3);
+
+    l2 = __map_domain_page(l2_page);
+    l2 += l2_table_offset(vaddr);
+    if ( !(l2e_get_flags(*l2) & _PAGE_PRESENT) )
+    {
+        l1_page = kimage_alloc_control_page(image, 0);
+        if ( !l1_page )
+            goto out;
+        l2e_write(l2, l2e_from_page(l1_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l1_page = l2e_get_page(*l2);
+
+    l1 = __map_domain_page(l1_page);
+    l1 += l1_table_offset(vaddr);
+    l1e_write(l1, l1e_from_pfn(maddr >> PAGE_SHIFT, __PAGE_HYPERVISOR));
+
+    ret = 0;
+out:
+    if ( l1 )
+        unmap_domain_page(l1);
+    if ( l2 )
+        unmap_domain_page(l2);
+    if ( l3 )
+        unmap_domain_page(l3);
+    if ( l4 )
+        unmap_domain_page(l4);
+    return ret;
+}
 
-    for ( k = 0; k < KEXEC_XEN_NO_PAGES; k++ )
+int machine_kexec_load(struct kexec_image *image)
+{
+    void *code_page;
+    int ret;
+
+    switch ( image->arch )
     {
-        if ( (k & 1) == 0 )
-        {
-            /* Even pages: machine address. */
-            prev_ma = image->page_list[k];
-        }
-        else
-        {
-            /* Odd pages: va for previous ma. */
-            if ( is_pv_32on64_domain(dom0) )
-            {
-                /*
-                 * The compatability bounce code sets up a page table
-                 * with a 1-1 mapping of the first 1G of memory so
-                 * VA==PA here.
-                 *
-                 * This Linux purgatory code still sets up separate
-                 * high and low mappings on the control page (entries
-                 * 0 and 1) but it is harmless if they are equal since
-                 * that PT is not live at the time.
-                 */
-                image->page_list[k] = prev_ma;
-            }
-            else
-            {
-                set_fixmap(fix_base + (k >> 1), prev_ma);
-                image->page_list[k] = fix_to_virt(fix_base + (k >> 1));
-            }
-        }
+    case EM_386:
+    case EM_X86_64:
+        break;
+    default:
+        return -EINVAL;
     }
 
+    code_page = __map_domain_page(image->control_code_page);
+    memcpy(code_page, kexec_reloc, kexec_reloc_size);
+    unmap_domain_page(code_page);
+
+    /*
+     * Add a mapping for the control code page to the same virtual
+     * address as kexec_reloc.  This allows us to keep running after
+     * these page tables are loaded in kexec_reloc.
+     */
+    ret = machine_kexec_add_page(image, (unsigned long)kexec_reloc,
+                                 page_to_maddr(image->control_code_page));
+    if ( ret < 0 )
+        return ret;
+
     return 0;
 }
 
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image)
+void machine_kexec_unload(struct kexec_image *image)
 {
+    /* no-op. kimage_free() frees all control pages. */
 }
 
-void machine_reboot_kexec(xen_kexec_image_t *image)
+void machine_reboot_kexec(struct kexec_image *image)
 {
     BUG_ON(smp_processor_id() != 0);
     smp_send_stop();
@@ -75,13 +145,10 @@ void machine_reboot_kexec(xen_kexec_image_t *image)
     BUG();
 }
 
-void machine_kexec(xen_kexec_image_t *image)
+void machine_kexec(struct kexec_image *image)
 {
-    struct desc_ptr gdt_desc = {
-        .base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY),
-        .limit = LAST_RESERVED_GDT_BYTE
-    };
     int i;
+    unsigned long reloc_flags = 0;
 
     /* We are about to permenantly jump out of the Xen context into the kexec
      * purgatory code.  We really dont want to be still servicing interupts.
@@ -109,29 +176,12 @@ void machine_kexec(xen_kexec_image_t *image)
      * not like running with NMIs disabled. */
     enable_nmis();
 
-    /*
-     * compat_machine_kexec() returns to idle pagetables, which requires us
-     * to be running on a static GDT mapping (idle pagetables have no GDT
-     * mappings in their per-domain mapping area).
-     */
-    asm volatile ( "lgdt %0" : : "m" (gdt_desc) );
+    if ( image->arch == EM_386 )
+        reloc_flags |= KEXEC_RELOC_FLAG_COMPAT;
 
-    if ( is_pv_32on64_domain(dom0) )
-    {
-        compat_machine_kexec(image->page_list[1],
-                             image->indirection_page,
-                             image->page_list,
-                             image->start_address);
-    }
-    else
-    {
-        relocate_new_kernel_t rnk;
-
-        rnk = (relocate_new_kernel_t) image->page_list[1];
-        (*rnk)(image->indirection_page, image->page_list,
-               image->start_address,
-               0 /* preserve_context */);
-    }
+    kexec_reloc(page_to_maddr(image->control_code_page),
+                page_to_maddr(image->aux_page),
+                image->head, image->entry_maddr, reloc_flags);
 }
 
 int machine_kexec_get(xen_kexec_range_t *range)
diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
index d56e12d..7f8fb3d 100644
--- a/xen/arch/x86/x86_64/Makefile
+++ b/xen/arch/x86/x86_64/Makefile
@@ -11,11 +11,11 @@ obj-y += mmconf-fam10h.o
 obj-y += mmconfig_64.o
 obj-y += mmconfig-shared.o
 obj-y += compat.o
-obj-bin-y += compat_kexec.o
 obj-y += domain.o
 obj-y += physdev.o
 obj-y += platform_hypercall.o
 obj-y += cpu_idle.o
 obj-y += cpufreq.o
+obj-bin-y += kexec_reloc.o
 
 obj-$(crash_debug)   += gdbstub.o
diff --git a/xen/arch/x86/x86_64/compat_kexec.S b/xen/arch/x86/x86_64/compat_kexec.S
deleted file mode 100644
index fc92af9..0000000
--- a/xen/arch/x86/x86_64/compat_kexec.S
+++ /dev/null
@@ -1,187 +0,0 @@
-/*
- * Compatibility kexec handler.
- */
-
-/*
- * NOTE: We rely on Xen not relocating itself above the 4G boundary. This is
- * currently true but if it ever changes then compat_pg_table will
- * need to be moved back below 4G at run time.
- */
-
-#include <xen/config.h>
-
-#include <asm/asm_defns.h>
-#include <asm/msr.h>
-#include <asm/page.h>
-
-/* The unrelocated physical address of a symbol. */
-#define SYM_PHYS(sym)          ((sym) - __XEN_VIRT_START)
-
-/* Load physical address of symbol into register and relocate it. */
-#define RELOCATE_SYM(sym,reg)  mov $SYM_PHYS(sym), reg ; \
-                               add xen_phys_start(%rip), reg
-
-/*
- * Relocate a physical address in memory. Size of temporary register
- * determines size of the value to relocate.
- */
-#define RELOCATE_MEM(addr,reg) mov addr(%rip), reg ; \
-                               add xen_phys_start(%rip), reg ; \
-                               mov reg, addr(%rip)
-
-        .text
-
-        .code64
-
-ENTRY(compat_machine_kexec)
-        /* x86/64                        x86/32  */
-        /* %rdi - relocate_new_kernel_t  CALL    */
-        /* %rsi - indirection page       4(%esp) */
-        /* %rdx - page_list              8(%esp) */
-        /* %rcx - start address         12(%esp) */
-        /*        cpu has pae           16(%esp) */
-
-        /* Shim the 64 bit page_list into a 32 bit page_list. */
-        mov $12,%r9
-        lea compat_page_list(%rip), %rbx
-1:      dec %r9
-        movl (%rdx,%r9,8),%eax
-        movl %eax,(%rbx,%r9,4)
-        test %r9,%r9
-        jnz 1b
-
-        RELOCATE_SYM(compat_page_list,%rdx)
-
-        /* Relocate compatibility mode entry point address. */
-        RELOCATE_MEM(compatibility_mode_far,%eax)
-
-        /* Relocate compat_pg_table. */
-        RELOCATE_MEM(compat_pg_table,     %rax)
-        RELOCATE_MEM(compat_pg_table+0x8, %rax)
-        RELOCATE_MEM(compat_pg_table+0x10,%rax)
-        RELOCATE_MEM(compat_pg_table+0x18,%rax)
-
-        /*
-         * Setup an identity mapped region in PML4[0] of idle page
-         * table.
-         */
-        RELOCATE_SYM(l3_identmap,%rax)
-        or  $0x63,%rax
-        mov %rax, idle_pg_table(%rip)
-
-        /* Switch to idle page table. */
-        RELOCATE_SYM(idle_pg_table,%rax)
-        movq %rax, %cr3
-
-        /* Switch to identity mapped compatibility stack. */
-        RELOCATE_SYM(compat_stack,%rax)
-        movq %rax, %rsp
-
-        /* Save xen_phys_start for 32 bit code. */
-        movq xen_phys_start(%rip), %rbx
-
-        /* Jump to low identity mapping in compatibility mode. */
-        ljmp *compatibility_mode_far(%rip)
-        ud2
-
-compatibility_mode_far:
-        .long SYM_PHYS(compatibility_mode)
-        .long __HYPERVISOR_CS32
-
-        /*
-         * We use 5 words of stack for the arguments passed to the kernel. The
-         * kernel only uses 1 word before switching to its own stack. Allocate
-         * 16 words to give "plenty" of room.
-         */
-        .fill 16,4,0
-compat_stack:
-
-        .code32
-
-#undef RELOCATE_SYM
-#undef RELOCATE_MEM
-
-/*
- * Load physical address of symbol into register and relocate it. %rbx
- * contains xen_phys_start(%rip) saved before jump to compatibility
- * mode.
- */
-#define RELOCATE_SYM(sym,reg) mov $SYM_PHYS(sym), reg ; \
-                              add %ebx, reg
-
-compatibility_mode:
-        /* Setup some sane segments. */
-        movl $__HYPERVISOR_DS32, %eax
-        movl %eax, %ds
-        movl %eax, %es
-        movl %eax, %fs
-        movl %eax, %gs
-        movl %eax, %ss
-
-        /* Push arguments onto stack. */
-        pushl $0   /* 20(%esp) - preserve context */
-        pushl $1   /* 16(%esp) - cpu has pae */
-        pushl %ecx /* 12(%esp) - start address */
-        pushl %edx /*  8(%esp) - page list */
-        pushl %esi /*  4(%esp) - indirection page */
-        pushl %edi /*  0(%esp) - CALL */
-
-        /* Disable paging and therefore leave 64 bit mode. */
-        movl %cr0, %eax
-        andl $~X86_CR0_PG, %eax
-        movl %eax, %cr0
-
-        /* Switch to 32 bit page table. */
-        RELOCATE_SYM(compat_pg_table, %eax)
-        movl  %eax, %cr3
-
-        /* Clear MSR_EFER[LME], disabling long mode */
-        movl    $MSR_EFER,%ecx
-        rdmsr
-        btcl    $_EFER_LME,%eax
-        wrmsr
-
-        /* Re-enable paging, but only 32 bit mode now. */
-        movl %cr0, %eax
-        orl $X86_CR0_PG, %eax
-        movl %eax, %cr0
-        jmp 1f
-1:
-
-        popl %eax
-        call *%eax
-        ud2
-
-        .data
-        .align 4
-compat_page_list:
-        .fill 12,4,0
-
-        .align 32,0
-
-        /*
-         * These compat page tables contain an identity mapping of the
-         * first 4G of the physical address space.
-         */
-compat_pg_table:
-        .long SYM_PHYS(compat_pg_table_l2) + 0*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 1*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 2*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 3*PAGE_SIZE + 0x01, 0
-
-        .section .data.page_aligned, "aw", @progbits
-        .align PAGE_SIZE,0
-compat_pg_table_l2:
-        .macro identmap from=0, count=512
-        .if \count-1
-        identmap "(\from+0)","(\count/2)"
-        identmap "(\from+(0x200000*(\count/2)))","(\count/2)"
-        .else
-        .quad 0x00000000000000e3 + \from
-        .endif
-        .endm
-
-        identmap 0x00000000
-        identmap 0x40000000
-        identmap 0x80000000
-        identmap 0xc0000000
diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
new file mode 100644
index 0000000..de75e04
--- /dev/null
+++ b/xen/arch/x86/x86_64/kexec_reloc.S
@@ -0,0 +1,206 @@
+/*
+ * Relocate a kexec_image to its destination and call it.
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+#include <xen/config.h>
+#include <xen/kimage.h>
+
+#include <asm/asm_defns.h>
+#include <asm/msr.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
+
+        .text
+        .align PAGE_SIZE
+        .code64
+
+ENTRY(kexec_reloc)
+        /* %rdi - code page maddr */
+        /* %rsi - page table maddr */
+        /* %rdx - indirection page maddr */
+        /* %rcx - entry maddr */
+        /* %r8 - flags */
+
+        /* Setup stack. */
+        leaq    (reloc_stack - kexec_reloc)(%rdi), %rsp
+
+        /* Load reloc page table. */
+        movq    %rsi, %cr3
+
+        /* Jump to identity mapped code. */
+        leaq    (identity_mapped - kexec_reloc)(%rdi), %rax
+        jmpq    *%rax
+
+identity_mapped:
+        /*
+         * Set cr0 to a known state:
+         *  - Paging enabled
+         *  - Alignment check disabled
+         *  - Write protect disabled
+         *  - No task switch
+         *  - Don't do FP software emulation.
+         *  - Protected mode enabled
+         */
+        movq    %cr0, %rax
+        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
+        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
+        movq    %rax, %cr0
+
+        /*
+         * Set cr4 to a known state:
+         *  - physical address extension enabled
+         */
+        movl    $X86_CR4_PAE, %eax
+        movq    %rax, %cr4
+
+        pushq   %rdi
+        movq    %rdx, %rdi
+        call    relocate_pages
+        popq    %rdi
+
+        /* Need to switch to 32-bit mode? */
+        testq   $KEXEC_RELOC_FLAG_COMPAT, %r8
+        jnz     call_32_bit
+
+call_64_bit:
+        /* Call the image entry point.  This should never return. */
+        callq   *%rcx
+        ud2
+
+call_32_bit:
+        /* Setup IDT. */
+        lidt    compat_mode_idt(%rip)
+
+        /* Load compat GDT. */
+        leaq    (compat_mode_gdt - kexec_reloc)(%rdi), %rax
+        movq    %rax, (compat_mode_gdt_desc + 2)(%rip)
+        lgdt    compat_mode_gdt_desc(%rip)
+
+        /* Relocate compatibility mode entry point address. */
+        leal    (compatibility_mode - kexec_reloc)(%edi), %eax
+        movl    %eax, compatibility_mode_far(%rip)
+
+        /* Enter compatibility mode. */
+        ljmp    *compatibility_mode_far(%rip)
+
+relocate_pages:
+        /* %rdi - indirection page maddr */
+        pushq   %rdi
+        pushq   %rsi
+        pushq   %rbx
+        pushq   %rcx
+
+        cld
+        movq    %rdi, %rbx
+        xorl    %edi, %edi
+        xorl    %esi, %esi
+
+next_entry: /* top, read another word for the indirection page */
+
+        movq    (%rbx), %rcx
+        addq    $8, %rbx
+is_dest:
+        testb   $IND_DESTINATION, %cl
+        jz      is_ind
+        movq    %rcx, %rdi
+        andq    $PAGE_MASK, %rdi
+        jmp     next_entry
+is_ind:
+        testb   $IND_INDIRECTION, %cl
+        jz      is_done
+        movq    %rcx, %rbx
+        andq    $PAGE_MASK, %rbx
+        jmp     next_entry
+is_done:
+        testb   $IND_DONE, %cl
+        jnz     done
+is_source:
+        testb   $IND_SOURCE, %cl
+        jz      is_zero
+        movq    %rcx, %rsi      /* For every source page do a copy */
+        andq    $PAGE_MASK, %rsi
+        movl    $(PAGE_SIZE / 8), %ecx
+        rep movsq
+        jmp     next_entry
+is_zero:
+        testb   $IND_ZERO, %cl
+        jz      next_entry
+        movl    $(PAGE_SIZE / 8), %ecx  /* Zero the destination page. */
+        xorl    %eax, %eax
+        rep stosq
+        jmp     next_entry
+done:
+        popq    %rcx
+        popq    %rbx
+        popq    %rsi
+        popq    %rdi
+        ret
+
+        .code32
+
+compatibility_mode:
+        /* Setup some sane segments. */
+        movl    $0x0008, %eax
+        movl    %eax, %ds
+        movl    %eax, %es
+        movl    %eax, %fs
+        movl    %eax, %gs
+        movl    %eax, %ss
+
+        movl    %ecx, %ebp
+
+        /* Disable paging and therefore leave 64 bit mode. */
+        movl    %cr0, %eax
+        andl    $~X86_CR0_PG, %eax
+        movl    %eax, %cr0
+
+        /* Disable long mode */
+        movl    $MSR_EFER, %ecx
+        rdmsr
+        andl    $~EFER_LME, %eax
+        wrmsr
+
+        /* Clear cr4 to disable PAE. */
+        xorl    %eax, %eax
+        movl    %eax, %cr4
+
+        /* Call the image entry point.  This should never return. */
+        call    *%ebp
+        ud2
+
+        .align 4
+compatibility_mode_far:
+        .long 0x00000000             /* set in call_32_bit above */
+        .word 0x0010
+
+compat_mode_gdt_desc:
+        .word (3*8)-1
+        .quad 0x0000000000000000     /* set in call_32_bit above */
+
+        .align 8
+compat_mode_gdt:
+        .quad 0x0000000000000000     /* null                              */
+        .quad 0x00cf92000000ffff     /* 0x0008 ring 0 data                */
+        .quad 0x00cf9a000000ffff     /* 0x0010 ring 0 code, compatibility */
+
+compat_mode_idt:
+        .word 0                      /* limit */
+        .long 0                      /* base */
+
+        /*
+         * 16 words of stack are more than enough.
+         */
+        .fill 16,8,0
+reloc_stack:
+
+        .globl kexec_reloc_size
+kexec_reloc_size:
+        .long . - kexec_reloc
diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index 7b23df0..5548c37 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -25,6 +25,7 @@
 #include <xen/version.h>
 #include <xen/console.h>
 #include <xen/kexec.h>
+#include <xen/kimage.h>
 #include <public/elfnote.h>
 #include <xsm/xsm.h>
 #include <xen/cpu.h>
@@ -47,7 +48,7 @@ static Elf_Note *xen_crash_note;
 
 static cpumask_t crash_saved_cpus;
 
-static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
+static struct kexec_image *kexec_image[KEXEC_IMAGE_NR];
 
 #define KEXEC_FLAG_DEFAULT_POS   (KEXEC_IMAGE_NR + 0)
 #define KEXEC_FLAG_CRASH_POS     (KEXEC_IMAGE_NR + 1)
@@ -311,14 +312,14 @@ void kexec_crash(void)
     kexec_common_shutdown();
     kexec_crash_save_cpu();
     machine_crash_shutdown();
-    machine_kexec(&kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
+    machine_kexec(kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
 
     BUG();
 }
 
 static long kexec_reboot(void *_image)
 {
-    xen_kexec_image_t *image = _image;
+    struct kexec_image *image = _image;
 
     kexecing = TRUE;
 
@@ -734,63 +735,261 @@ static void crash_save_vmcoreinfo(void)
 #endif
 }
 
-static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
+static void kexec_unload_image(struct kexec_image *image)
+{
+    if ( !image )
+        return;
+
+    machine_kexec_unload(image);
+    kimage_free(image);
+}
+
+static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_exec_t exec;
+    struct kexec_image *image;
+    int base, bit, pos, ret = -EINVAL;
+
+    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+        return -EFAULT;
+
+    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+        return -EINVAL;
+
+    pos = (test_bit(bit, &kexec_flags) != 0);
+
+    /* Only allow kexec/kdump into loaded images */
+    if ( !test_bit(base + pos, &kexec_flags) )
+        return -ENOENT;
+
+    switch (exec.type)
+    {
+    case KEXEC_TYPE_DEFAULT:
+        image = kexec_image[base + pos];
+        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
+        break;
+    case KEXEC_TYPE_CRASH:
+        kexec_crash(); /* Does not return */
+        break;
+    }
+
+    return -EINVAL; /* never reached */
+}
+
+static int kexec_swap_images(int type, struct kexec_image *new,
+                             struct kexec_image **old)
 {
-    xen_kexec_image_t *image;
     int base, bit, pos;
-    int ret = 0;
+    int new_slot, old_slot;
+
+    *old = NULL;
+
+    spin_lock(&kexec_lock);
+
+    if ( test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    {
+        spin_unlock(&kexec_lock);
+        return -EBUSY;
+    }
 
-    if ( kexec_load_get_bits(load->type, &base, &bit) )
+    if ( kexec_load_get_bits(type, &base, &bit) )
         return -EINVAL;
 
     pos = (test_bit(bit, &kexec_flags) != 0);
+    old_slot = base + pos;
+    new_slot = base + !pos;
 
-    /* Load the user data into an unused image */
-    if ( op == KEXEC_CMD_kexec_load )
+    if ( new )
     {
-        image = &kexec_image[base + !pos];
+        kexec_image[new_slot] = new;
+        set_bit(new_slot, &kexec_flags);
+    }
+    change_bit(bit, &kexec_flags);
 
-        BUG_ON(test_bit((base + !pos), &kexec_flags)); /* must be free */
+    clear_bit(old_slot, &kexec_flags);
+    *old = kexec_image[old_slot];
 
-        memcpy(image, &load->image, sizeof(*image));
+    spin_unlock(&kexec_lock);
 
-        if ( !(ret = machine_kexec_load(load->type, base + !pos, image)) )
-        {
-            /* Set image present bit */
-            set_bit((base + !pos), &kexec_flags);
+    return 0;
+}
 
-            /* Make new image the active one */
-            change_bit(bit, &kexec_flags);
-        }
+static int kexec_load_slot(struct kexec_image *kimage)
+{
+    struct kexec_image *old_kimage;
+    int ret = -ENOMEM;
+
+    ret = machine_kexec_load(kimage);
+    if ( ret < 0 )
+        return ret;
+
+    crash_save_vmcoreinfo();
+
+    ret = kexec_swap_images(kimage->type, kimage, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static uint16_t kexec_load_v1_arch(void)
+{
+#ifdef CONFIG_X86
+    return is_pv_32on64_domain(dom0) ? EM_386 : EM_X86_64;
+#else
+    return EM_NONE;
+#endif
+}
+
+static int kexec_segments_add_segment(
+    unsigned *nr_segments, xen_kexec_segment_t *segments,
+    unsigned long mfn)
+{
+    paddr_t maddr = (paddr_t)mfn << PAGE_SHIFT;
+    int n = *nr_segments;
 
-        crash_save_vmcoreinfo();
+    /* Need a new segment? */
+    if ( n == 0
+         || segments[n-1].dest_maddr + segments[n-1].dest_size != maddr )
+    {
+        n++;
+        if ( n > KEXEC_SEGMENT_MAX )
+            return -EINVAL;
+        *nr_segments = n;
+
+        set_xen_guest_handle(segments[n-1].buf.h, NULL);
+        segments[n-1].buf_size = 0;
+        segments[n-1].dest_maddr = maddr;
+        segments[n-1].dest_size = 0;
     }
 
-    /* Unload the old image if present and load successful */
-    if ( ret == 0 && !test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    return 0;
+}
+
+static int kexec_segments_from_ind_page(unsigned long mfn,
+                                        unsigned *nr_segments,
+                                        xen_kexec_segment_t *segments,
+                                        bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+
+    page = map_domain_page(mfn);
+
+    /*
+     * Walk the indirection page list, adding destination pages to the
+     * segments.
+     */
+    for ( entry = page; ; )
     {
-        if ( test_and_clear_bit((base + pos), &kexec_flags) )
+        unsigned long ind;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
         {
-            image = &kexec_image[base + pos];
-            machine_kexec_unload(load->type, base + pos, image);
+        case IND_DESTINATION:
+            ret = kexec_segments_add_segment(nr_segments, segments, mfn);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            page = map_domain_page(mfn);
+            if ( page == NULL )
+                return -ENOMEM;
+            entry = page;
+            continue;
+        case IND_DONE:
+            goto done;
+        case IND_SOURCE:
+            segments[*nr_segments-1].dest_size += PAGE_SIZE;
+            break;
+        default:
+            ret = -EINVAL;
+            goto done;
         }
+        entry = kimage_entry_next(entry, compat);
     }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
+
+static int kexec_do_load_v1(xen_kexec_load_v1_t *load, int compat)
+{
+    struct kexec_image *kimage = NULL;
+    xen_kexec_segment_t *segments;
+    uint16_t arch;
+    unsigned nr_segments = 0;
+    unsigned long ind_mfn = load->image.indirection_page >> PAGE_SHIFT;
+    int ret;
+
+    arch = kexec_load_v1_arch();
+    if ( arch == EM_NONE )
+        return -ENOSYS;
+
+    segments = xmalloc_array(xen_kexec_segment_t, KEXEC_SEGMENT_MAX);
+    if ( segments == NULL )
+        return -ENOMEM;
+
+    /*
+     * Work out the image segments (destination only) from the
+     * indirection pages.
+     *
+     * This is needed so we don't allocate pages that will overlap
+     * with the destination when building the new set of indirection
+     * pages below.
+     */
+    ret = kexec_segments_from_ind_page(ind_mfn, &nr_segments, segments, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_alloc(&kimage, load->type, arch, load->image.start_address,
+                       nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    /*
+     * Build a new set of indirection pages in the native format.
+     *
+     * This walks the guest provided indirection pages a second time.
+     * The guest could have altered then, invalidating the segment
+     * information constructed above.  This will only result in the
+     * resulting image being potentially unrelocatable.
+     */
+    ret = kimage_build_ind(kimage, ind_mfn, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
 
+    return 0;
+
+error:
+    if ( !kimage )
+        xfree(segments);
+    kimage_free(kimage);
     return ret;
 }
 
-static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
     xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&load, uarg, 1)) )
         return -EFAULT;
 
-    return kexec_load_unload_internal(op, &load);
+    return kexec_do_load_v1(&load, 0);
 }
 
-static int kexec_load_unload_compat(unsigned long op,
-                                    XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
 #ifdef CONFIG_COMPAT
     compat_kexec_load_v1_t compat_load;
@@ -809,49 +1008,113 @@ static int kexec_load_unload_compat(unsigned long op,
     load.type = compat_load.type;
     XLAT_kexec_image(&load.image, &compat_load.image);
 
-    return kexec_load_unload_internal(op, &load);
-#else /* CONFIG_COMPAT */
+    return kexec_do_load_v1(&load, 1);
+#else
     return 0;
-#endif /* CONFIG_COMPAT */
+#endif
 }
 
-static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
-    xen_kexec_exec_t exec;
-    xen_kexec_image_t *image;
-    int base, bit, pos, ret = -EINVAL;
+    xen_kexec_load_t load;
+    xen_kexec_segment_t *segments;
+    struct kexec_image *kimage = NULL;
+    int ret;
 
-    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+    if ( copy_from_guest(&load, uarg, 1) )
         return -EFAULT;
 
-    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+    if ( load.nr_segments >= KEXEC_SEGMENT_MAX )
         return -EINVAL;
 
-    pos = (test_bit(bit, &kexec_flags) != 0);
-
-    /* Only allow kexec/kdump into loaded images */
-    if ( !test_bit(base + pos, &kexec_flags) )
-        return -ENOENT;
+    segments = xmalloc_array(xen_kexec_segment_t, load.nr_segments);
+    if ( segments == NULL )
+        return -ENOMEM;
 
-    switch (exec.type)
+    if ( copy_from_guest(segments, load.segments.h, load.nr_segments) )
     {
-    case KEXEC_TYPE_DEFAULT:
-        image = &kexec_image[base + pos];
-        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
-        break;
-    case KEXEC_TYPE_CRASH:
-        kexec_crash(); /* Does not return */
-        break;
+        ret = -EFAULT;
+        goto error;
     }
 
-    return -EINVAL; /* never reached */
+    ret = kimage_alloc(&kimage, load.type, load.arch, load.entry_maddr,
+                       load.nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_load_segments(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    return 0;
+
+error:
+    if ( ! kimage )
+        xfree(segments);
+    kimage_free(kimage);
+    return ret;
+}
+
+static int kexec_do_unload(xen_kexec_unload_t *unload)
+{
+    struct kexec_image *old_kimage;
+    int ret;
+
+    ret = kexec_swap_images(unload->type, NULL, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static int kexec_unload_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_load_v1_t load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = load.type;
+    return kexec_do_unload(&unload);
+}
+
+static int kexec_unload_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+#ifdef CONFIG_COMPAT
+    compat_kexec_load_v1_t compat_load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&compat_load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = compat_load.type;
+    return kexec_do_unload(&unload);
+#else
+    return 0;
+#endif
+}
+
+static int kexec_unload(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_unload_t unload;
+
+    if ( unlikely(copy_from_guest(&unload, uarg, 1)) )
+        return -EFAULT;
+
+    return kexec_do_unload(&unload);
 }
 
 static int do_kexec_op_internal(unsigned long op,
                                 XEN_GUEST_HANDLE_PARAM(void) uarg,
                                 bool_t compat)
 {
-    unsigned long flags;
     int ret = -EINVAL;
 
     ret = xsm_kexec(XSM_PRIV);
@@ -867,20 +1130,26 @@ static int do_kexec_op_internal(unsigned long op,
                 ret = kexec_get_range(uarg);
         break;
     case KEXEC_CMD_kexec_load_v1:
+        if ( compat )
+            ret = kexec_load_v1_compat(uarg);
+        else
+            ret = kexec_load_v1(uarg);
+        break;
     case KEXEC_CMD_kexec_unload_v1:
-        spin_lock_irqsave(&kexec_lock, flags);
-        if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
-        {
-                if (compat)
-                        ret = kexec_load_unload_compat(op, uarg);
-                else
-                        ret = kexec_load_unload(op, uarg);
-        }
-        spin_unlock_irqrestore(&kexec_lock, flags);
+        if ( compat )
+            ret = kexec_unload_v1_compat(uarg);
+        else
+            ret = kexec_unload_v1(uarg);
         break;
     case KEXEC_CMD_kexec:
         ret = kexec_exec(uarg);
         break;
+    case KEXEC_CMD_kexec_load:
+        ret = kexec_load(uarg);
+        break;
+    case KEXEC_CMD_kexec_unload:
+        ret = kexec_unload(uarg);
+        break;
     }
 
     return ret;
diff --git a/xen/common/kimage.c b/xen/common/kimage.c
index 9783e5a..6bee9cf 100644
--- a/xen/common/kimage.c
+++ b/xen/common/kimage.c
@@ -175,14 +175,22 @@ static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
     image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
     if ( !image->control_code_page )
         goto out;
+    result = machine_kexec_add_page(image,
+                                    page_to_maddr(image->control_code_page),
+                                    page_to_maddr(image->control_code_page));
+    if ( result < 0 )
+        return result;
 
     /* Add an empty indirection page. */
     image->entry_page = kimage_alloc_control_page(image, 0);
     if ( !image->entry_page )
         goto out;
+    result = machine_kexec_add_page(image, page_to_maddr(image->entry_page),
+                                    page_to_maddr(image->entry_page));
+    if ( result < 0 )
+        return result;
 
     image->head = page_to_maddr(image->entry_page);
-    image->next_entry = 0;
 
     result = 0;
 out:
@@ -591,7 +599,7 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
         if ( addr == destination )
         {
             page_list_del(page, &image->dest_pages);
-            return page;
+            goto found;
         }
     }
     page = NULL;
@@ -643,6 +651,8 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
             page_list_add(page, &image->dest_pages);
         }
     }
+found:
+    machine_kexec_add_page(image, page_to_maddr(page), page_to_maddr(page));
     return page;
 }
 
@@ -749,6 +759,7 @@ static int kimage_load_crash_segment(struct kexec_image *image,
 static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
 {
     int result = -ENOMEM;
+    paddr_t addr;
 
     if ( !guest_handle_is_null(segment->buf.h) )
     {
@@ -763,6 +774,14 @@ static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *s
         }
     }
 
+    for ( addr = segment->dest_maddr & PAGE_MASK;
+          addr < segment->dest_maddr + segment->dest_size; addr += PAGE_SIZE )
+    {
+        result = machine_kexec_add_page(image, addr, addr);
+        if ( result < 0 )
+            break;
+    }
+
     return result;
 }
 
@@ -806,6 +825,106 @@ int kimage_load_segments(struct kexec_image *image)
     return 0;
 }
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return (kimage_entry_t *)((uint32_t *)entry + 1);
+    return entry + 1;
+}
+
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry >> PAGE_SHIFT;
+    return *entry >> PAGE_SHIFT;
+}
+
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry & 0xf;
+    return *entry & 0xf;
+}
+
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+    paddr_t dest = KIMAGE_NO_DEST;
+
+    page = map_domain_page(ind_mfn);
+    if ( !page )
+        return -ENOMEM;
+
+    /*
+     * Walk the guest-supplied indirection pages, adding entries to
+     * the image's indirection pages.
+     */
+    for ( entry = page; ;  )
+    {
+        unsigned long ind;
+        unsigned long mfn;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
+        {
+        case IND_DESTINATION:
+            dest = (paddr_t)mfn << PAGE_SHIFT;
+            ret = kimage_set_destination(image, dest);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            page = map_domain_page(mfn);
+            entry = page;
+            continue;
+        case IND_DONE:
+            kimage_terminate(image);
+            goto done;
+        case IND_SOURCE:
+        {
+            struct page_info *guest_page, *xen_page;
+
+            guest_page = mfn_to_page(mfn);
+            if ( !get_page(guest_page, current->domain) )
+            {
+                ret = -EFAULT;
+                goto done;
+            }
+
+            xen_page = kimage_alloc_page(image, dest);
+            if ( !xen_page )
+            {
+                put_page(guest_page);
+                ret = -ENOMEM;
+                goto done;
+            }
+
+            copy_domain_page(page_to_mfn(xen_page), mfn);
+            put_page(guest_page);
+
+            ret = kimage_add_page(image, page_to_maddr(xen_page));
+            if ( ret < 0 )
+                goto done;
+            dest += PAGE_SIZE;
+            break;
+        }
+        default:
+            ret = -EINVAL;
+            goto done;
+        }
+        entry = kimage_entry_next(entry, compat);
+    }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index 8b4266d..48c5676 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -56,9 +56,6 @@ enum fixed_addresses {
     FIX_ACPI_BEGIN,
     FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1,
     FIX_HPET_BASE,
-    FIX_KEXEC_BASE_0,
-    FIX_KEXEC_BASE_END = FIX_KEXEC_BASE_0 \
-      + ((KEXEC_XEN_NO_PAGES >> 1) * KEXEC_IMAGE_NR) - 1,
     FIX_TBOOT_SHARED_BASE,
     FIX_MSIX_IO_RESERV_BASE,
     FIX_MSIX_IO_RESERV_END = FIX_MSIX_IO_RESERV_BASE + FIX_MSIX_MAX_PAGES -1,
diff --git a/xen/include/asm-x86/machine_kexec.h b/xen/include/asm-x86/machine_kexec.h
new file mode 100644
index 0000000..ba0d469
--- /dev/null
+++ b/xen/include/asm-x86/machine_kexec.h
@@ -0,0 +1,16 @@
+#ifndef __X86_MACHINE_KEXEC_H__
+#define __X86_MACHINE_KEXEC_H__
+
+#define KEXEC_RELOC_FLAG_COMPAT 0x1 /* 32-bit image */
+
+#ifndef __ASSEMBLY__
+
+extern void kexec_reloc(unsigned long reloc_code, unsigned long reloc_pt,
+                        unsigned long ind_maddr, unsigned long entry_maddr,
+                        unsigned long flags);
+
+extern unsigned int kexec_reloc_size;
+
+#endif
+
+#endif /* __X86_MACHINE_KEXEC_H__ */
diff --git a/xen/include/xen/kexec.h b/xen/include/xen/kexec.h
index 1a5dda1..bd17747 100644
--- a/xen/include/xen/kexec.h
+++ b/xen/include/xen/kexec.h
@@ -6,6 +6,7 @@
 #include <public/kexec.h>
 #include <asm/percpu.h>
 #include <xen/elfcore.h>
+#include <xen/kimage.h>
 
 typedef struct xen_kexec_reserve {
     unsigned long size;
@@ -40,11 +41,13 @@ extern enum low_crashinfo low_crashinfo_mode;
 extern paddr_t crashinfo_maxaddr_bits;
 void kexec_early_calculations(void);
 
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr);
+int machine_kexec_load(struct kexec_image *image);
+void machine_kexec_unload(struct kexec_image *image);
 void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
-void machine_reboot_kexec(xen_kexec_image_t *image);
-void machine_kexec(xen_kexec_image_t *image);
+void machine_reboot_kexec(struct kexec_image *image);
+void machine_kexec(struct kexec_image *image);
 void kexec_crash(void);
 void kexec_crash_save_cpu(void);
 crash_xen_info_t *kexec_crash_save_info(void);
@@ -52,11 +55,6 @@ void machine_crash_shutdown(void);
 int machine_kexec_get(xen_kexec_range_t *range);
 int machine_kexec_get_xen(xen_kexec_range_t *range);
 
-void compat_machine_kexec(unsigned long rnk,
-                          unsigned long indirection_page,
-                          unsigned long *page_list,
-                          unsigned long start_address);
-
 /* vmcoreinfo stuff */
 #define VMCOREINFO_BYTES           (4096)
 #define VMCOREINFO_NOTE_NAME       "VMCOREINFO_XEN"
diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
index 0ebd37a..d10ebf7 100644
--- a/xen/include/xen/kimage.h
+++ b/xen/include/xen/kimage.h
@@ -47,6 +47,12 @@ int kimage_load_segments(struct kexec_image *image);
 struct page_info *kimage_alloc_control_page(struct kexec_image *image,
                                             unsigned memflags);
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat);
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __XEN_KIMAGE_H__ */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-09-12 19:48 [PATCHv7 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
@ 2013-09-12 19:49   ` David Vrabel
  0 siblings, 0 replies; 31+ messages in thread
From: David Vrabel @ 2013-09-12 19:49 UTC (permalink / raw)
  To: xen-devel; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

In the existing kexec hypercall, the load and unload ops depend on
internals of the Linux kernel (the page list and code page provided by
the kernel).  The code page is used to transition between Xen context
and the image so using kernel code doesn't make sense and will not
work for PVH guests.

Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
that no longer require a code page to be provided by the guest -- Xen
now provides the code for calling the image directly.

The new load op looks similar to the Linux kexec_load system call and
allows the guest to provide the image data to be loaded.  The guest
specifies the architecture of the image which may be a 32-bit subarch
of the hypervisor's architecture (i.e., an EM_386 image on an
EM_X86_64 hypervisor).

The toolstack can now load images without kernel involvement.  This is
required for supporting kexec when using a dom0 with an upstream
kernel.

Crash images are copied directly into the crash region on load.
Default images are copied into domheap pages and a list of source and
destination machine addresses is created.  This is list is used in
kexec_reloc() to relocate the image to its destination.

The old load and unload sub-ops are still available (as
KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
of the new infrastructure.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 xen/arch/x86/machine_kexec.c        |  192 +++++++++++-------
 xen/arch/x86/x86_64/Makefile        |    2 +-
 xen/arch/x86/x86_64/compat_kexec.S  |  187 -----------------
 xen/arch/x86/x86_64/kexec_reloc.S   |  209 +++++++++++++++++++
 xen/common/kexec.c                  |  393 +++++++++++++++++++++++++++++------
 xen/common/kimage.c                 |  123 +++++++++++-
 xen/include/asm-x86/fixmap.h        |    3 -
 xen/include/asm-x86/machine_kexec.h |   16 ++
 xen/include/xen/kexec.h             |   16 +-
 xen/include/xen/kimage.h            |    6 +
 10 files changed, 812 insertions(+), 335 deletions(-)
 delete mode 100644 xen/arch/x86/x86_64/compat_kexec.S
 create mode 100644 xen/arch/x86/x86_64/kexec_reloc.S
 create mode 100644 xen/include/asm-x86/machine_kexec.h

diff --git a/xen/arch/x86/machine_kexec.c b/xen/arch/x86/machine_kexec.c
index 68b9705..b70d5a6 100644
--- a/xen/arch/x86/machine_kexec.c
+++ b/xen/arch/x86/machine_kexec.c
@@ -1,9 +1,18 @@
 /******************************************************************************
  * machine_kexec.c
  *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/machine_kexec_64.c.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
  * Xen port written by:
  * - Simon 'Horms' Horman <horms@verge.net.au>
  * - Magnus Damm <magnus@valinux.co.jp>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
  */
 
 #include <xen/types.h>
@@ -11,63 +20,124 @@
 #include <xen/guest_access.h>
 #include <asm/fixmap.h>
 #include <asm/hpet.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
 
-typedef void (*relocate_new_kernel_t)(
-                unsigned long indirection_page,
-                unsigned long *page_list,
-                unsigned long start_address,
-                unsigned int preserve_context);
-
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image)
+/*
+ * Add a mapping for a page to the page tables used during kexec.
+ */
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr)
 {
-    unsigned long prev_ma = 0;
-    int fix_base = FIX_KEXEC_BASE_0 + (slot * (KEXEC_XEN_NO_PAGES >> 1));
-    int k;
+    struct page_info *l4_page;
+    struct page_info *l3_page;
+    struct page_info *l2_page;
+    struct page_info *l1_page;
+    l4_pgentry_t *l4 = NULL;
+    l3_pgentry_t *l3 = NULL;
+    l2_pgentry_t *l2 = NULL;
+    l1_pgentry_t *l1 = NULL;
+    int ret = -ENOMEM;
+
+    l4_page = image->aux_page;
+    if ( !l4_page )
+    {
+        l4_page = kimage_alloc_control_page(image, 0);
+        if ( !l4_page )
+            goto out;
+        image->aux_page = l4_page;
+    }
 
-    /* setup fixmap to point to our pages and record the virtual address
-     * in every odd index in page_list[].
-     */
+    l4 = __map_domain_page(l4_page);
+    l4 += l4_table_offset(vaddr);
+    if ( !(l4e_get_flags(*l4) & _PAGE_PRESENT) )
+    {
+        l3_page = kimage_alloc_control_page(image, 0);
+        if ( !l3_page )
+            goto out;
+        l4e_write(l4, l4e_from_page(l3_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l3_page = l4e_get_page(*l4);
+
+    l3 = __map_domain_page(l3_page);
+    l3 += l3_table_offset(vaddr);
+    if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) )
+    {
+        l2_page = kimage_alloc_control_page(image, 0);
+        if ( !l2_page )
+            goto out;
+        l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l2_page = l3e_get_page(*l3);
+
+    l2 = __map_domain_page(l2_page);
+    l2 += l2_table_offset(vaddr);
+    if ( !(l2e_get_flags(*l2) & _PAGE_PRESENT) )
+    {
+        l1_page = kimage_alloc_control_page(image, 0);
+        if ( !l1_page )
+            goto out;
+        l2e_write(l2, l2e_from_page(l1_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l1_page = l2e_get_page(*l2);
+
+    l1 = __map_domain_page(l1_page);
+    l1 += l1_table_offset(vaddr);
+    l1e_write(l1, l1e_from_pfn(maddr >> PAGE_SHIFT, __PAGE_HYPERVISOR));
+
+    ret = 0;
+out:
+    if ( l1 )
+        unmap_domain_page(l1);
+    if ( l2 )
+        unmap_domain_page(l2);
+    if ( l3 )
+        unmap_domain_page(l3);
+    if ( l4 )
+        unmap_domain_page(l4);
+    return ret;
+}
 
-    for ( k = 0; k < KEXEC_XEN_NO_PAGES; k++ )
+int machine_kexec_load(struct kexec_image *image)
+{
+    void *code_page;
+    int ret;
+
+    switch ( image->arch )
     {
-        if ( (k & 1) == 0 )
-        {
-            /* Even pages: machine address. */
-            prev_ma = image->page_list[k];
-        }
-        else
-        {
-            /* Odd pages: va for previous ma. */
-            if ( is_pv_32on64_domain(dom0) )
-            {
-                /*
-                 * The compatability bounce code sets up a page table
-                 * with a 1-1 mapping of the first 1G of memory so
-                 * VA==PA here.
-                 *
-                 * This Linux purgatory code still sets up separate
-                 * high and low mappings on the control page (entries
-                 * 0 and 1) but it is harmless if they are equal since
-                 * that PT is not live at the time.
-                 */
-                image->page_list[k] = prev_ma;
-            }
-            else
-            {
-                set_fixmap(fix_base + (k >> 1), prev_ma);
-                image->page_list[k] = fix_to_virt(fix_base + (k >> 1));
-            }
-        }
+    case EM_386:
+    case EM_X86_64:
+        break;
+    default:
+        return -EINVAL;
     }
 
+    code_page = __map_domain_page(image->control_code_page);
+    memcpy(code_page, kexec_reloc, kexec_reloc_size);
+    unmap_domain_page(code_page);
+
+    /*
+     * Add a mapping for the control code page to the same virtual
+     * address as kexec_reloc.  This allows us to keep running after
+     * these page tables are loaded in kexec_reloc.
+     */
+    ret = machine_kexec_add_page(image, (unsigned long)kexec_reloc,
+                                 page_to_maddr(image->control_code_page));
+    if ( ret < 0 )
+        return ret;
+
     return 0;
 }
 
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image)
+void machine_kexec_unload(struct kexec_image *image)
 {
+    /* no-op. kimage_free() frees all control pages. */
 }
 
-void machine_reboot_kexec(xen_kexec_image_t *image)
+void machine_reboot_kexec(struct kexec_image *image)
 {
     BUG_ON(smp_processor_id() != 0);
     smp_send_stop();
@@ -75,13 +145,10 @@ void machine_reboot_kexec(xen_kexec_image_t *image)
     BUG();
 }
 
-void machine_kexec(xen_kexec_image_t *image)
+void machine_kexec(struct kexec_image *image)
 {
-    struct desc_ptr gdt_desc = {
-        .base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY),
-        .limit = LAST_RESERVED_GDT_BYTE
-    };
     int i;
+    unsigned long reloc_flags = 0;
 
     /* We are about to permenantly jump out of the Xen context into the kexec
      * purgatory code.  We really dont want to be still servicing interupts.
@@ -109,29 +176,12 @@ void machine_kexec(xen_kexec_image_t *image)
      * not like running with NMIs disabled. */
     enable_nmis();
 
-    /*
-     * compat_machine_kexec() returns to idle pagetables, which requires us
-     * to be running on a static GDT mapping (idle pagetables have no GDT
-     * mappings in their per-domain mapping area).
-     */
-    asm volatile ( "lgdt %0" : : "m" (gdt_desc) );
+    if ( image->arch == EM_386 )
+        reloc_flags |= KEXEC_RELOC_FLAG_COMPAT;
 
-    if ( is_pv_32on64_domain(dom0) )
-    {
-        compat_machine_kexec(image->page_list[1],
-                             image->indirection_page,
-                             image->page_list,
-                             image->start_address);
-    }
-    else
-    {
-        relocate_new_kernel_t rnk;
-
-        rnk = (relocate_new_kernel_t) image->page_list[1];
-        (*rnk)(image->indirection_page, image->page_list,
-               image->start_address,
-               0 /* preserve_context */);
-    }
+    kexec_reloc(page_to_maddr(image->control_code_page),
+                page_to_maddr(image->aux_page),
+                image->head, image->entry_maddr, reloc_flags);
 }
 
 int machine_kexec_get(xen_kexec_range_t *range)
diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
index d56e12d..7f8fb3d 100644
--- a/xen/arch/x86/x86_64/Makefile
+++ b/xen/arch/x86/x86_64/Makefile
@@ -11,11 +11,11 @@ obj-y += mmconf-fam10h.o
 obj-y += mmconfig_64.o
 obj-y += mmconfig-shared.o
 obj-y += compat.o
-obj-bin-y += compat_kexec.o
 obj-y += domain.o
 obj-y += physdev.o
 obj-y += platform_hypercall.o
 obj-y += cpu_idle.o
 obj-y += cpufreq.o
+obj-bin-y += kexec_reloc.o
 
 obj-$(crash_debug)   += gdbstub.o
diff --git a/xen/arch/x86/x86_64/compat_kexec.S b/xen/arch/x86/x86_64/compat_kexec.S
deleted file mode 100644
index fc92af9..0000000
--- a/xen/arch/x86/x86_64/compat_kexec.S
+++ /dev/null
@@ -1,187 +0,0 @@
-/*
- * Compatibility kexec handler.
- */
-
-/*
- * NOTE: We rely on Xen not relocating itself above the 4G boundary. This is
- * currently true but if it ever changes then compat_pg_table will
- * need to be moved back below 4G at run time.
- */
-
-#include <xen/config.h>
-
-#include <asm/asm_defns.h>
-#include <asm/msr.h>
-#include <asm/page.h>
-
-/* The unrelocated physical address of a symbol. */
-#define SYM_PHYS(sym)          ((sym) - __XEN_VIRT_START)
-
-/* Load physical address of symbol into register and relocate it. */
-#define RELOCATE_SYM(sym,reg)  mov $SYM_PHYS(sym), reg ; \
-                               add xen_phys_start(%rip), reg
-
-/*
- * Relocate a physical address in memory. Size of temporary register
- * determines size of the value to relocate.
- */
-#define RELOCATE_MEM(addr,reg) mov addr(%rip), reg ; \
-                               add xen_phys_start(%rip), reg ; \
-                               mov reg, addr(%rip)
-
-        .text
-
-        .code64
-
-ENTRY(compat_machine_kexec)
-        /* x86/64                        x86/32  */
-        /* %rdi - relocate_new_kernel_t  CALL    */
-        /* %rsi - indirection page       4(%esp) */
-        /* %rdx - page_list              8(%esp) */
-        /* %rcx - start address         12(%esp) */
-        /*        cpu has pae           16(%esp) */
-
-        /* Shim the 64 bit page_list into a 32 bit page_list. */
-        mov $12,%r9
-        lea compat_page_list(%rip), %rbx
-1:      dec %r9
-        movl (%rdx,%r9,8),%eax
-        movl %eax,(%rbx,%r9,4)
-        test %r9,%r9
-        jnz 1b
-
-        RELOCATE_SYM(compat_page_list,%rdx)
-
-        /* Relocate compatibility mode entry point address. */
-        RELOCATE_MEM(compatibility_mode_far,%eax)
-
-        /* Relocate compat_pg_table. */
-        RELOCATE_MEM(compat_pg_table,     %rax)
-        RELOCATE_MEM(compat_pg_table+0x8, %rax)
-        RELOCATE_MEM(compat_pg_table+0x10,%rax)
-        RELOCATE_MEM(compat_pg_table+0x18,%rax)
-
-        /*
-         * Setup an identity mapped region in PML4[0] of idle page
-         * table.
-         */
-        RELOCATE_SYM(l3_identmap,%rax)
-        or  $0x63,%rax
-        mov %rax, idle_pg_table(%rip)
-
-        /* Switch to idle page table. */
-        RELOCATE_SYM(idle_pg_table,%rax)
-        movq %rax, %cr3
-
-        /* Switch to identity mapped compatibility stack. */
-        RELOCATE_SYM(compat_stack,%rax)
-        movq %rax, %rsp
-
-        /* Save xen_phys_start for 32 bit code. */
-        movq xen_phys_start(%rip), %rbx
-
-        /* Jump to low identity mapping in compatibility mode. */
-        ljmp *compatibility_mode_far(%rip)
-        ud2
-
-compatibility_mode_far:
-        .long SYM_PHYS(compatibility_mode)
-        .long __HYPERVISOR_CS32
-
-        /*
-         * We use 5 words of stack for the arguments passed to the kernel. The
-         * kernel only uses 1 word before switching to its own stack. Allocate
-         * 16 words to give "plenty" of room.
-         */
-        .fill 16,4,0
-compat_stack:
-
-        .code32
-
-#undef RELOCATE_SYM
-#undef RELOCATE_MEM
-
-/*
- * Load physical address of symbol into register and relocate it. %rbx
- * contains xen_phys_start(%rip) saved before jump to compatibility
- * mode.
- */
-#define RELOCATE_SYM(sym,reg) mov $SYM_PHYS(sym), reg ; \
-                              add %ebx, reg
-
-compatibility_mode:
-        /* Setup some sane segments. */
-        movl $__HYPERVISOR_DS32, %eax
-        movl %eax, %ds
-        movl %eax, %es
-        movl %eax, %fs
-        movl %eax, %gs
-        movl %eax, %ss
-
-        /* Push arguments onto stack. */
-        pushl $0   /* 20(%esp) - preserve context */
-        pushl $1   /* 16(%esp) - cpu has pae */
-        pushl %ecx /* 12(%esp) - start address */
-        pushl %edx /*  8(%esp) - page list */
-        pushl %esi /*  4(%esp) - indirection page */
-        pushl %edi /*  0(%esp) - CALL */
-
-        /* Disable paging and therefore leave 64 bit mode. */
-        movl %cr0, %eax
-        andl $~X86_CR0_PG, %eax
-        movl %eax, %cr0
-
-        /* Switch to 32 bit page table. */
-        RELOCATE_SYM(compat_pg_table, %eax)
-        movl  %eax, %cr3
-
-        /* Clear MSR_EFER[LME], disabling long mode */
-        movl    $MSR_EFER,%ecx
-        rdmsr
-        btcl    $_EFER_LME,%eax
-        wrmsr
-
-        /* Re-enable paging, but only 32 bit mode now. */
-        movl %cr0, %eax
-        orl $X86_CR0_PG, %eax
-        movl %eax, %cr0
-        jmp 1f
-1:
-
-        popl %eax
-        call *%eax
-        ud2
-
-        .data
-        .align 4
-compat_page_list:
-        .fill 12,4,0
-
-        .align 32,0
-
-        /*
-         * These compat page tables contain an identity mapping of the
-         * first 4G of the physical address space.
-         */
-compat_pg_table:
-        .long SYM_PHYS(compat_pg_table_l2) + 0*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 1*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 2*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 3*PAGE_SIZE + 0x01, 0
-
-        .section .data.page_aligned, "aw", @progbits
-        .align PAGE_SIZE,0
-compat_pg_table_l2:
-        .macro identmap from=0, count=512
-        .if \count-1
-        identmap "(\from+0)","(\count/2)"
-        identmap "(\from+(0x200000*(\count/2)))","(\count/2)"
-        .else
-        .quad 0x00000000000000e3 + \from
-        .endif
-        .endm
-
-        identmap 0x00000000
-        identmap 0x40000000
-        identmap 0x80000000
-        identmap 0xc0000000
diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
new file mode 100644
index 0000000..31a8cee
--- /dev/null
+++ b/xen/arch/x86/x86_64/kexec_reloc.S
@@ -0,0 +1,209 @@
+/*
+ * Relocate a kexec_image to its destination and call it.
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+#include <xen/config.h>
+#include <xen/kimage.h>
+
+#include <asm/asm_defns.h>
+#include <asm/msr.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
+
+        .text
+        .align PAGE_SIZE
+        .code64
+
+ENTRY(kexec_reloc)
+        /* %rdi - code page maddr */
+        /* %rsi - page table maddr */
+        /* %rdx - indirection page maddr */
+        /* %rcx - entry maddr */
+        /* %r8 - flags */
+
+        movq %rdx, %rbx
+
+        /* Setup stack. */
+        leaq (reloc_stack - kexec_reloc)(%rdi), %rsp
+
+        /* Load reloc page table. */
+        movq %rsi, %cr3
+
+        /* Jump to identity mapped code. */
+        leaq (identity_mapped - kexec_reloc)(%rdi), %rax
+        jmpq *%rax
+
+identity_mapped:
+        pushq %rcx
+        pushq %rbx
+        pushq %rsi
+        pushq %rdi
+
+        /*
+         * Set cr0 to a known state:
+         *  - Paging enabled
+         *  - Alignment check disabled
+         *  - Write protect disabled
+         *  - No task switch
+         *  - Don't do FP software emulation.
+         *  - Proctected mode enabled
+         */
+        movq    %cr0, %rax
+        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
+        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
+        movq    %rax, %cr0
+
+        /*
+         * Set cr4 to a known state:
+         *  - physical address extension enabled
+         */
+        movl    $X86_CR4_PAE, %eax
+        movq    %rax, %cr4
+
+        movq %rbx, %rdi
+        call relocate_pages
+
+        popq %rdi
+        popq %rsi
+        popq %rbx
+        popq %rcx
+
+        /* Need to switch to 32-bit mode? */
+        testq $KEXEC_RELOC_FLAG_COMPAT, %r8
+        jnz call_32_bit
+
+call_64_bit:
+        /* Call the image entry point.  This should never return. */
+        callq *%rcx
+        ud2
+
+call_32_bit:
+        /* Setup IDT. */
+        lidt compat_mode_idt(%rip)
+
+        /* Load compat GDT. */
+        leaq (compat_mode_gdt - kexec_reloc)(%rdi), %rax
+        movq %rax, (compat_mode_gdt_desc + 2)(%rip)
+        lgdt compat_mode_gdt_desc(%rip)
+
+        /* Relocate compatibility mode entry point address. */
+        leal (compatibility_mode - kexec_reloc)(%edi), %eax
+        movl %eax, compatibility_mode_far(%rip)
+
+        /* Enter compatibility mode. */
+        ljmp *compatibility_mode_far(%rip)
+
+relocate_pages:
+        /* %rdi - indirection page maddr */
+        cld
+        movq    %rdi, %rcx
+        xorl    %edi, %edi
+        xorl    %esi, %esi
+        jmp     is_dest
+
+next_entry: /* top, read another word for the indirection page */
+
+        movq    (%rbx), %rcx
+        addq    $8, %rbx
+is_dest:
+        testb   $IND_DESTINATION, %cl
+        jz      is_ind
+        movq    %rcx, %rdi
+        andq    $PAGE_MASK, %rdi
+        jmp     next_entry
+is_ind:
+        testb   $IND_INDIRECTION, %cl
+        jz      is_done
+        movq    %rcx, %rbx
+        andq    $PAGE_MASK, %rbx
+        jmp     next_entry
+is_done:
+        testb   $IND_DONE, %cl
+        jnz     done
+is_source:
+        testb   $IND_SOURCE, %cl
+        jz      is_zero
+        movq    %rcx, %rsi      /* For every source page do a copy */
+        andq    $PAGE_MASK, %rsi
+        movl    $(PAGE_SIZE / 8), %ecx
+        rep movsq
+        jmp     next_entry
+is_zero:
+        testb   $IND_ZERO, %cl
+        jz      next_entry
+        movl    $(PAGE_SIZE / 8), %ecx  /* Zero the destination page. */
+        xorl    %eax, %eax
+        rep stosq
+        jmp     next_entry
+done:
+        ret
+
+        .code32
+
+compatibility_mode:
+        /* Setup some sane segments. */
+        movl $0x0008, %eax
+        movl %eax, %ds
+        movl %eax, %es
+        movl %eax, %fs
+        movl %eax, %gs
+        movl %eax, %ss
+
+        movl %ecx, %ebp
+
+        /* Disable paging and therefore leave 64 bit mode. */
+        movl %cr0, %eax
+        andl $~X86_CR0_PG, %eax
+        movl %eax, %cr0
+
+        /* Disable long mode */
+        movl    $MSR_EFER, %ecx
+        rdmsr
+        andl    $~EFER_LME, %eax
+        wrmsr
+
+        /* Clear cr4 to disable PAE. */
+        xorl    %eax, %eax
+        movl    %eax, %cr4
+
+        /* Call the image entry point.  This should never return. */
+        call *%ebp
+        ud2
+
+        .align 4
+compatibility_mode_far:
+        .long 0x00000000             /* set in call_32_bit above */
+        .word 0x0010
+
+compat_mode_gdt_desc:
+        .word (3*8)-1
+        .quad 0x0000000000000000     /* set in call_32_bit above */
+
+        .align 8
+compat_mode_gdt:
+        .quad 0x0000000000000000     /* null                              */
+        .quad 0x00cf92000000ffff     /* 0x0008 ring 0 data                */
+        .quad 0x00cf9a000000ffff     /* 0x0010 ring 0 code, compatibility */
+
+compat_mode_idt:
+        .word 0                      /* limit */
+        .long 0                      /* base */
+
+        /*
+         * 16 words of stack are more than enough.
+         */
+        .fill 16,8,0
+reloc_stack:
+
+        .globl __kexec_reloc_size, kexec_reloc_size
+        .set __kexec_reloc_size, . - kexec_reloc
+kexec_reloc_size:
+        .long __kexec_reloc_size
diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index 7b23df0..5548c37 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -25,6 +25,7 @@
 #include <xen/version.h>
 #include <xen/console.h>
 #include <xen/kexec.h>
+#include <xen/kimage.h>
 #include <public/elfnote.h>
 #include <xsm/xsm.h>
 #include <xen/cpu.h>
@@ -47,7 +48,7 @@ static Elf_Note *xen_crash_note;
 
 static cpumask_t crash_saved_cpus;
 
-static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
+static struct kexec_image *kexec_image[KEXEC_IMAGE_NR];
 
 #define KEXEC_FLAG_DEFAULT_POS   (KEXEC_IMAGE_NR + 0)
 #define KEXEC_FLAG_CRASH_POS     (KEXEC_IMAGE_NR + 1)
@@ -311,14 +312,14 @@ void kexec_crash(void)
     kexec_common_shutdown();
     kexec_crash_save_cpu();
     machine_crash_shutdown();
-    machine_kexec(&kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
+    machine_kexec(kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
 
     BUG();
 }
 
 static long kexec_reboot(void *_image)
 {
-    xen_kexec_image_t *image = _image;
+    struct kexec_image *image = _image;
 
     kexecing = TRUE;
 
@@ -734,63 +735,261 @@ static void crash_save_vmcoreinfo(void)
 #endif
 }
 
-static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
+static void kexec_unload_image(struct kexec_image *image)
+{
+    if ( !image )
+        return;
+
+    machine_kexec_unload(image);
+    kimage_free(image);
+}
+
+static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_exec_t exec;
+    struct kexec_image *image;
+    int base, bit, pos, ret = -EINVAL;
+
+    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+        return -EFAULT;
+
+    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+        return -EINVAL;
+
+    pos = (test_bit(bit, &kexec_flags) != 0);
+
+    /* Only allow kexec/kdump into loaded images */
+    if ( !test_bit(base + pos, &kexec_flags) )
+        return -ENOENT;
+
+    switch (exec.type)
+    {
+    case KEXEC_TYPE_DEFAULT:
+        image = kexec_image[base + pos];
+        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
+        break;
+    case KEXEC_TYPE_CRASH:
+        kexec_crash(); /* Does not return */
+        break;
+    }
+
+    return -EINVAL; /* never reached */
+}
+
+static int kexec_swap_images(int type, struct kexec_image *new,
+                             struct kexec_image **old)
 {
-    xen_kexec_image_t *image;
     int base, bit, pos;
-    int ret = 0;
+    int new_slot, old_slot;
+
+    *old = NULL;
+
+    spin_lock(&kexec_lock);
+
+    if ( test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    {
+        spin_unlock(&kexec_lock);
+        return -EBUSY;
+    }
 
-    if ( kexec_load_get_bits(load->type, &base, &bit) )
+    if ( kexec_load_get_bits(type, &base, &bit) )
         return -EINVAL;
 
     pos = (test_bit(bit, &kexec_flags) != 0);
+    old_slot = base + pos;
+    new_slot = base + !pos;
 
-    /* Load the user data into an unused image */
-    if ( op == KEXEC_CMD_kexec_load )
+    if ( new )
     {
-        image = &kexec_image[base + !pos];
+        kexec_image[new_slot] = new;
+        set_bit(new_slot, &kexec_flags);
+    }
+    change_bit(bit, &kexec_flags);
 
-        BUG_ON(test_bit((base + !pos), &kexec_flags)); /* must be free */
+    clear_bit(old_slot, &kexec_flags);
+    *old = kexec_image[old_slot];
 
-        memcpy(image, &load->image, sizeof(*image));
+    spin_unlock(&kexec_lock);
 
-        if ( !(ret = machine_kexec_load(load->type, base + !pos, image)) )
-        {
-            /* Set image present bit */
-            set_bit((base + !pos), &kexec_flags);
+    return 0;
+}
 
-            /* Make new image the active one */
-            change_bit(bit, &kexec_flags);
-        }
+static int kexec_load_slot(struct kexec_image *kimage)
+{
+    struct kexec_image *old_kimage;
+    int ret = -ENOMEM;
+
+    ret = machine_kexec_load(kimage);
+    if ( ret < 0 )
+        return ret;
+
+    crash_save_vmcoreinfo();
+
+    ret = kexec_swap_images(kimage->type, kimage, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static uint16_t kexec_load_v1_arch(void)
+{
+#ifdef CONFIG_X86
+    return is_pv_32on64_domain(dom0) ? EM_386 : EM_X86_64;
+#else
+    return EM_NONE;
+#endif
+}
+
+static int kexec_segments_add_segment(
+    unsigned *nr_segments, xen_kexec_segment_t *segments,
+    unsigned long mfn)
+{
+    paddr_t maddr = (paddr_t)mfn << PAGE_SHIFT;
+    int n = *nr_segments;
 
-        crash_save_vmcoreinfo();
+    /* Need a new segment? */
+    if ( n == 0
+         || segments[n-1].dest_maddr + segments[n-1].dest_size != maddr )
+    {
+        n++;
+        if ( n > KEXEC_SEGMENT_MAX )
+            return -EINVAL;
+        *nr_segments = n;
+
+        set_xen_guest_handle(segments[n-1].buf.h, NULL);
+        segments[n-1].buf_size = 0;
+        segments[n-1].dest_maddr = maddr;
+        segments[n-1].dest_size = 0;
     }
 
-    /* Unload the old image if present and load successful */
-    if ( ret == 0 && !test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    return 0;
+}
+
+static int kexec_segments_from_ind_page(unsigned long mfn,
+                                        unsigned *nr_segments,
+                                        xen_kexec_segment_t *segments,
+                                        bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+
+    page = map_domain_page(mfn);
+
+    /*
+     * Walk the indirection page list, adding destination pages to the
+     * segments.
+     */
+    for ( entry = page; ; )
     {
-        if ( test_and_clear_bit((base + pos), &kexec_flags) )
+        unsigned long ind;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
         {
-            image = &kexec_image[base + pos];
-            machine_kexec_unload(load->type, base + pos, image);
+        case IND_DESTINATION:
+            ret = kexec_segments_add_segment(nr_segments, segments, mfn);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            page = map_domain_page(mfn);
+            if ( page == NULL )
+                return -ENOMEM;
+            entry = page;
+            continue;
+        case IND_DONE:
+            goto done;
+        case IND_SOURCE:
+            segments[*nr_segments-1].dest_size += PAGE_SIZE;
+            break;
+        default:
+            ret = -EINVAL;
+            goto done;
         }
+        entry = kimage_entry_next(entry, compat);
     }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
+
+static int kexec_do_load_v1(xen_kexec_load_v1_t *load, int compat)
+{
+    struct kexec_image *kimage = NULL;
+    xen_kexec_segment_t *segments;
+    uint16_t arch;
+    unsigned nr_segments = 0;
+    unsigned long ind_mfn = load->image.indirection_page >> PAGE_SHIFT;
+    int ret;
+
+    arch = kexec_load_v1_arch();
+    if ( arch == EM_NONE )
+        return -ENOSYS;
+
+    segments = xmalloc_array(xen_kexec_segment_t, KEXEC_SEGMENT_MAX);
+    if ( segments == NULL )
+        return -ENOMEM;
+
+    /*
+     * Work out the image segments (destination only) from the
+     * indirection pages.
+     *
+     * This is needed so we don't allocate pages that will overlap
+     * with the destination when building the new set of indirection
+     * pages below.
+     */
+    ret = kexec_segments_from_ind_page(ind_mfn, &nr_segments, segments, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_alloc(&kimage, load->type, arch, load->image.start_address,
+                       nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    /*
+     * Build a new set of indirection pages in the native format.
+     *
+     * This walks the guest provided indirection pages a second time.
+     * The guest could have altered then, invalidating the segment
+     * information constructed above.  This will only result in the
+     * resulting image being potentially unrelocatable.
+     */
+    ret = kimage_build_ind(kimage, ind_mfn, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
 
+    return 0;
+
+error:
+    if ( !kimage )
+        xfree(segments);
+    kimage_free(kimage);
     return ret;
 }
 
-static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
     xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&load, uarg, 1)) )
         return -EFAULT;
 
-    return kexec_load_unload_internal(op, &load);
+    return kexec_do_load_v1(&load, 0);
 }
 
-static int kexec_load_unload_compat(unsigned long op,
-                                    XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
 #ifdef CONFIG_COMPAT
     compat_kexec_load_v1_t compat_load;
@@ -809,49 +1008,113 @@ static int kexec_load_unload_compat(unsigned long op,
     load.type = compat_load.type;
     XLAT_kexec_image(&load.image, &compat_load.image);
 
-    return kexec_load_unload_internal(op, &load);
-#else /* CONFIG_COMPAT */
+    return kexec_do_load_v1(&load, 1);
+#else
     return 0;
-#endif /* CONFIG_COMPAT */
+#endif
 }
 
-static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
-    xen_kexec_exec_t exec;
-    xen_kexec_image_t *image;
-    int base, bit, pos, ret = -EINVAL;
+    xen_kexec_load_t load;
+    xen_kexec_segment_t *segments;
+    struct kexec_image *kimage = NULL;
+    int ret;
 
-    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+    if ( copy_from_guest(&load, uarg, 1) )
         return -EFAULT;
 
-    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+    if ( load.nr_segments >= KEXEC_SEGMENT_MAX )
         return -EINVAL;
 
-    pos = (test_bit(bit, &kexec_flags) != 0);
-
-    /* Only allow kexec/kdump into loaded images */
-    if ( !test_bit(base + pos, &kexec_flags) )
-        return -ENOENT;
+    segments = xmalloc_array(xen_kexec_segment_t, load.nr_segments);
+    if ( segments == NULL )
+        return -ENOMEM;
 
-    switch (exec.type)
+    if ( copy_from_guest(segments, load.segments.h, load.nr_segments) )
     {
-    case KEXEC_TYPE_DEFAULT:
-        image = &kexec_image[base + pos];
-        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
-        break;
-    case KEXEC_TYPE_CRASH:
-        kexec_crash(); /* Does not return */
-        break;
+        ret = -EFAULT;
+        goto error;
     }
 
-    return -EINVAL; /* never reached */
+    ret = kimage_alloc(&kimage, load.type, load.arch, load.entry_maddr,
+                       load.nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_load_segments(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    return 0;
+
+error:
+    if ( ! kimage )
+        xfree(segments);
+    kimage_free(kimage);
+    return ret;
+}
+
+static int kexec_do_unload(xen_kexec_unload_t *unload)
+{
+    struct kexec_image *old_kimage;
+    int ret;
+
+    ret = kexec_swap_images(unload->type, NULL, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static int kexec_unload_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_load_v1_t load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = load.type;
+    return kexec_do_unload(&unload);
+}
+
+static int kexec_unload_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+#ifdef CONFIG_COMPAT
+    compat_kexec_load_v1_t compat_load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&compat_load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = compat_load.type;
+    return kexec_do_unload(&unload);
+#else
+    return 0;
+#endif
+}
+
+static int kexec_unload(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_unload_t unload;
+
+    if ( unlikely(copy_from_guest(&unload, uarg, 1)) )
+        return -EFAULT;
+
+    return kexec_do_unload(&unload);
 }
 
 static int do_kexec_op_internal(unsigned long op,
                                 XEN_GUEST_HANDLE_PARAM(void) uarg,
                                 bool_t compat)
 {
-    unsigned long flags;
     int ret = -EINVAL;
 
     ret = xsm_kexec(XSM_PRIV);
@@ -867,20 +1130,26 @@ static int do_kexec_op_internal(unsigned long op,
                 ret = kexec_get_range(uarg);
         break;
     case KEXEC_CMD_kexec_load_v1:
+        if ( compat )
+            ret = kexec_load_v1_compat(uarg);
+        else
+            ret = kexec_load_v1(uarg);
+        break;
     case KEXEC_CMD_kexec_unload_v1:
-        spin_lock_irqsave(&kexec_lock, flags);
-        if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
-        {
-                if (compat)
-                        ret = kexec_load_unload_compat(op, uarg);
-                else
-                        ret = kexec_load_unload(op, uarg);
-        }
-        spin_unlock_irqrestore(&kexec_lock, flags);
+        if ( compat )
+            ret = kexec_unload_v1_compat(uarg);
+        else
+            ret = kexec_unload_v1(uarg);
         break;
     case KEXEC_CMD_kexec:
         ret = kexec_exec(uarg);
         break;
+    case KEXEC_CMD_kexec_load:
+        ret = kexec_load(uarg);
+        break;
+    case KEXEC_CMD_kexec_unload:
+        ret = kexec_unload(uarg);
+        break;
     }
 
     return ret;
diff --git a/xen/common/kimage.c b/xen/common/kimage.c
index 9783e5a..6bee9cf 100644
--- a/xen/common/kimage.c
+++ b/xen/common/kimage.c
@@ -175,14 +175,22 @@ static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
     image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
     if ( !image->control_code_page )
         goto out;
+    result = machine_kexec_add_page(image,
+                                    page_to_maddr(image->control_code_page),
+                                    page_to_maddr(image->control_code_page));
+    if ( result < 0 )
+        return result;
 
     /* Add an empty indirection page. */
     image->entry_page = kimage_alloc_control_page(image, 0);
     if ( !image->entry_page )
         goto out;
+    result = machine_kexec_add_page(image, page_to_maddr(image->entry_page),
+                                    page_to_maddr(image->entry_page));
+    if ( result < 0 )
+        return result;
 
     image->head = page_to_maddr(image->entry_page);
-    image->next_entry = 0;
 
     result = 0;
 out:
@@ -591,7 +599,7 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
         if ( addr == destination )
         {
             page_list_del(page, &image->dest_pages);
-            return page;
+            goto found;
         }
     }
     page = NULL;
@@ -643,6 +651,8 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
             page_list_add(page, &image->dest_pages);
         }
     }
+found:
+    machine_kexec_add_page(image, page_to_maddr(page), page_to_maddr(page));
     return page;
 }
 
@@ -749,6 +759,7 @@ static int kimage_load_crash_segment(struct kexec_image *image,
 static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
 {
     int result = -ENOMEM;
+    paddr_t addr;
 
     if ( !guest_handle_is_null(segment->buf.h) )
     {
@@ -763,6 +774,14 @@ static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *s
         }
     }
 
+    for ( addr = segment->dest_maddr & PAGE_MASK;
+          addr < segment->dest_maddr + segment->dest_size; addr += PAGE_SIZE )
+    {
+        result = machine_kexec_add_page(image, addr, addr);
+        if ( result < 0 )
+            break;
+    }
+
     return result;
 }
 
@@ -806,6 +825,106 @@ int kimage_load_segments(struct kexec_image *image)
     return 0;
 }
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return (kimage_entry_t *)((uint32_t *)entry + 1);
+    return entry + 1;
+}
+
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry >> PAGE_SHIFT;
+    return *entry >> PAGE_SHIFT;
+}
+
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry & 0xf;
+    return *entry & 0xf;
+}
+
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+    paddr_t dest = KIMAGE_NO_DEST;
+
+    page = map_domain_page(ind_mfn);
+    if ( !page )
+        return -ENOMEM;
+
+    /*
+     * Walk the guest-supplied indirection pages, adding entries to
+     * the image's indirection pages.
+     */
+    for ( entry = page; ;  )
+    {
+        unsigned long ind;
+        unsigned long mfn;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
+        {
+        case IND_DESTINATION:
+            dest = (paddr_t)mfn << PAGE_SHIFT;
+            ret = kimage_set_destination(image, dest);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            page = map_domain_page(mfn);
+            entry = page;
+            continue;
+        case IND_DONE:
+            kimage_terminate(image);
+            goto done;
+        case IND_SOURCE:
+        {
+            struct page_info *guest_page, *xen_page;
+
+            guest_page = mfn_to_page(mfn);
+            if ( !get_page(guest_page, current->domain) )
+            {
+                ret = -EFAULT;
+                goto done;
+            }
+
+            xen_page = kimage_alloc_page(image, dest);
+            if ( !xen_page )
+            {
+                put_page(guest_page);
+                ret = -ENOMEM;
+                goto done;
+            }
+
+            copy_domain_page(page_to_mfn(xen_page), mfn);
+            put_page(guest_page);
+
+            ret = kimage_add_page(image, page_to_maddr(xen_page));
+            if ( ret < 0 )
+                goto done;
+            dest += PAGE_SIZE;
+            break;
+        }
+        default:
+            ret = -EINVAL;
+            goto done;
+        }
+        entry = kimage_entry_next(entry, compat);
+    }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index 8b4266d..48c5676 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -56,9 +56,6 @@ enum fixed_addresses {
     FIX_ACPI_BEGIN,
     FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1,
     FIX_HPET_BASE,
-    FIX_KEXEC_BASE_0,
-    FIX_KEXEC_BASE_END = FIX_KEXEC_BASE_0 \
-      + ((KEXEC_XEN_NO_PAGES >> 1) * KEXEC_IMAGE_NR) - 1,
     FIX_TBOOT_SHARED_BASE,
     FIX_MSIX_IO_RESERV_BASE,
     FIX_MSIX_IO_RESERV_END = FIX_MSIX_IO_RESERV_BASE + FIX_MSIX_MAX_PAGES -1,
diff --git a/xen/include/asm-x86/machine_kexec.h b/xen/include/asm-x86/machine_kexec.h
new file mode 100644
index 0000000..ba0d469
--- /dev/null
+++ b/xen/include/asm-x86/machine_kexec.h
@@ -0,0 +1,16 @@
+#ifndef __X86_MACHINE_KEXEC_H__
+#define __X86_MACHINE_KEXEC_H__
+
+#define KEXEC_RELOC_FLAG_COMPAT 0x1 /* 32-bit image */
+
+#ifndef __ASSEMBLY__
+
+extern void kexec_reloc(unsigned long reloc_code, unsigned long reloc_pt,
+                        unsigned long ind_maddr, unsigned long entry_maddr,
+                        unsigned long flags);
+
+extern unsigned int kexec_reloc_size;
+
+#endif
+
+#endif /* __X86_MACHINE_KEXEC_H__ */
diff --git a/xen/include/xen/kexec.h b/xen/include/xen/kexec.h
index 1a5dda1..bd17747 100644
--- a/xen/include/xen/kexec.h
+++ b/xen/include/xen/kexec.h
@@ -6,6 +6,7 @@
 #include <public/kexec.h>
 #include <asm/percpu.h>
 #include <xen/elfcore.h>
+#include <xen/kimage.h>
 
 typedef struct xen_kexec_reserve {
     unsigned long size;
@@ -40,11 +41,13 @@ extern enum low_crashinfo low_crashinfo_mode;
 extern paddr_t crashinfo_maxaddr_bits;
 void kexec_early_calculations(void);
 
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr);
+int machine_kexec_load(struct kexec_image *image);
+void machine_kexec_unload(struct kexec_image *image);
 void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
-void machine_reboot_kexec(xen_kexec_image_t *image);
-void machine_kexec(xen_kexec_image_t *image);
+void machine_reboot_kexec(struct kexec_image *image);
+void machine_kexec(struct kexec_image *image);
 void kexec_crash(void);
 void kexec_crash_save_cpu(void);
 crash_xen_info_t *kexec_crash_save_info(void);
@@ -52,11 +55,6 @@ void machine_crash_shutdown(void);
 int machine_kexec_get(xen_kexec_range_t *range);
 int machine_kexec_get_xen(xen_kexec_range_t *range);
 
-void compat_machine_kexec(unsigned long rnk,
-                          unsigned long indirection_page,
-                          unsigned long *page_list,
-                          unsigned long start_address);
-
 /* vmcoreinfo stuff */
 #define VMCOREINFO_BYTES           (4096)
 #define VMCOREINFO_NOTE_NAME       "VMCOREINFO_XEN"
diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
index 0ebd37a..d10ebf7 100644
--- a/xen/include/xen/kimage.h
+++ b/xen/include/xen/kimage.h
@@ -47,6 +47,12 @@ int kimage_load_segments(struct kexec_image *image);
 struct page_info *kimage_alloc_control_page(struct kexec_image *image,
                                             unsigned memflags);
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat);
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __XEN_KIMAGE_H__ */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
@ 2013-09-12 19:49   ` David Vrabel
  0 siblings, 0 replies; 31+ messages in thread
From: David Vrabel @ 2013-09-12 19:49 UTC (permalink / raw)
  To: xen-devel; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

In the existing kexec hypercall, the load and unload ops depend on
internals of the Linux kernel (the page list and code page provided by
the kernel).  The code page is used to transition between Xen context
and the image so using kernel code doesn't make sense and will not
work for PVH guests.

Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
that no longer require a code page to be provided by the guest -- Xen
now provides the code for calling the image directly.

The new load op looks similar to the Linux kexec_load system call and
allows the guest to provide the image data to be loaded.  The guest
specifies the architecture of the image which may be a 32-bit subarch
of the hypervisor's architecture (i.e., an EM_386 image on an
EM_X86_64 hypervisor).

The toolstack can now load images without kernel involvement.  This is
required for supporting kexec when using a dom0 with an upstream
kernel.

Crash images are copied directly into the crash region on load.
Default images are copied into domheap pages and a list of source and
destination machine addresses is created.  This is list is used in
kexec_reloc() to relocate the image to its destination.

The old load and unload sub-ops are still available (as
KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
of the new infrastructure.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 xen/arch/x86/machine_kexec.c        |  192 +++++++++++-------
 xen/arch/x86/x86_64/Makefile        |    2 +-
 xen/arch/x86/x86_64/compat_kexec.S  |  187 -----------------
 xen/arch/x86/x86_64/kexec_reloc.S   |  209 +++++++++++++++++++
 xen/common/kexec.c                  |  393 +++++++++++++++++++++++++++++------
 xen/common/kimage.c                 |  123 +++++++++++-
 xen/include/asm-x86/fixmap.h        |    3 -
 xen/include/asm-x86/machine_kexec.h |   16 ++
 xen/include/xen/kexec.h             |   16 +-
 xen/include/xen/kimage.h            |    6 +
 10 files changed, 812 insertions(+), 335 deletions(-)
 delete mode 100644 xen/arch/x86/x86_64/compat_kexec.S
 create mode 100644 xen/arch/x86/x86_64/kexec_reloc.S
 create mode 100644 xen/include/asm-x86/machine_kexec.h

diff --git a/xen/arch/x86/machine_kexec.c b/xen/arch/x86/machine_kexec.c
index 68b9705..b70d5a6 100644
--- a/xen/arch/x86/machine_kexec.c
+++ b/xen/arch/x86/machine_kexec.c
@@ -1,9 +1,18 @@
 /******************************************************************************
  * machine_kexec.c
  *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/machine_kexec_64.c.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
  * Xen port written by:
  * - Simon 'Horms' Horman <horms@verge.net.au>
  * - Magnus Damm <magnus@valinux.co.jp>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
  */
 
 #include <xen/types.h>
@@ -11,63 +20,124 @@
 #include <xen/guest_access.h>
 #include <asm/fixmap.h>
 #include <asm/hpet.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
 
-typedef void (*relocate_new_kernel_t)(
-                unsigned long indirection_page,
-                unsigned long *page_list,
-                unsigned long start_address,
-                unsigned int preserve_context);
-
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image)
+/*
+ * Add a mapping for a page to the page tables used during kexec.
+ */
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr)
 {
-    unsigned long prev_ma = 0;
-    int fix_base = FIX_KEXEC_BASE_0 + (slot * (KEXEC_XEN_NO_PAGES >> 1));
-    int k;
+    struct page_info *l4_page;
+    struct page_info *l3_page;
+    struct page_info *l2_page;
+    struct page_info *l1_page;
+    l4_pgentry_t *l4 = NULL;
+    l3_pgentry_t *l3 = NULL;
+    l2_pgentry_t *l2 = NULL;
+    l1_pgentry_t *l1 = NULL;
+    int ret = -ENOMEM;
+
+    l4_page = image->aux_page;
+    if ( !l4_page )
+    {
+        l4_page = kimage_alloc_control_page(image, 0);
+        if ( !l4_page )
+            goto out;
+        image->aux_page = l4_page;
+    }
 
-    /* setup fixmap to point to our pages and record the virtual address
-     * in every odd index in page_list[].
-     */
+    l4 = __map_domain_page(l4_page);
+    l4 += l4_table_offset(vaddr);
+    if ( !(l4e_get_flags(*l4) & _PAGE_PRESENT) )
+    {
+        l3_page = kimage_alloc_control_page(image, 0);
+        if ( !l3_page )
+            goto out;
+        l4e_write(l4, l4e_from_page(l3_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l3_page = l4e_get_page(*l4);
+
+    l3 = __map_domain_page(l3_page);
+    l3 += l3_table_offset(vaddr);
+    if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) )
+    {
+        l2_page = kimage_alloc_control_page(image, 0);
+        if ( !l2_page )
+            goto out;
+        l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l2_page = l3e_get_page(*l3);
+
+    l2 = __map_domain_page(l2_page);
+    l2 += l2_table_offset(vaddr);
+    if ( !(l2e_get_flags(*l2) & _PAGE_PRESENT) )
+    {
+        l1_page = kimage_alloc_control_page(image, 0);
+        if ( !l1_page )
+            goto out;
+        l2e_write(l2, l2e_from_page(l1_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l1_page = l2e_get_page(*l2);
+
+    l1 = __map_domain_page(l1_page);
+    l1 += l1_table_offset(vaddr);
+    l1e_write(l1, l1e_from_pfn(maddr >> PAGE_SHIFT, __PAGE_HYPERVISOR));
+
+    ret = 0;
+out:
+    if ( l1 )
+        unmap_domain_page(l1);
+    if ( l2 )
+        unmap_domain_page(l2);
+    if ( l3 )
+        unmap_domain_page(l3);
+    if ( l4 )
+        unmap_domain_page(l4);
+    return ret;
+}
 
-    for ( k = 0; k < KEXEC_XEN_NO_PAGES; k++ )
+int machine_kexec_load(struct kexec_image *image)
+{
+    void *code_page;
+    int ret;
+
+    switch ( image->arch )
     {
-        if ( (k & 1) == 0 )
-        {
-            /* Even pages: machine address. */
-            prev_ma = image->page_list[k];
-        }
-        else
-        {
-            /* Odd pages: va for previous ma. */
-            if ( is_pv_32on64_domain(dom0) )
-            {
-                /*
-                 * The compatability bounce code sets up a page table
-                 * with a 1-1 mapping of the first 1G of memory so
-                 * VA==PA here.
-                 *
-                 * This Linux purgatory code still sets up separate
-                 * high and low mappings on the control page (entries
-                 * 0 and 1) but it is harmless if they are equal since
-                 * that PT is not live at the time.
-                 */
-                image->page_list[k] = prev_ma;
-            }
-            else
-            {
-                set_fixmap(fix_base + (k >> 1), prev_ma);
-                image->page_list[k] = fix_to_virt(fix_base + (k >> 1));
-            }
-        }
+    case EM_386:
+    case EM_X86_64:
+        break;
+    default:
+        return -EINVAL;
     }
 
+    code_page = __map_domain_page(image->control_code_page);
+    memcpy(code_page, kexec_reloc, kexec_reloc_size);
+    unmap_domain_page(code_page);
+
+    /*
+     * Add a mapping for the control code page to the same virtual
+     * address as kexec_reloc.  This allows us to keep running after
+     * these page tables are loaded in kexec_reloc.
+     */
+    ret = machine_kexec_add_page(image, (unsigned long)kexec_reloc,
+                                 page_to_maddr(image->control_code_page));
+    if ( ret < 0 )
+        return ret;
+
     return 0;
 }
 
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image)
+void machine_kexec_unload(struct kexec_image *image)
 {
+    /* no-op. kimage_free() frees all control pages. */
 }
 
-void machine_reboot_kexec(xen_kexec_image_t *image)
+void machine_reboot_kexec(struct kexec_image *image)
 {
     BUG_ON(smp_processor_id() != 0);
     smp_send_stop();
@@ -75,13 +145,10 @@ void machine_reboot_kexec(xen_kexec_image_t *image)
     BUG();
 }
 
-void machine_kexec(xen_kexec_image_t *image)
+void machine_kexec(struct kexec_image *image)
 {
-    struct desc_ptr gdt_desc = {
-        .base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY),
-        .limit = LAST_RESERVED_GDT_BYTE
-    };
     int i;
+    unsigned long reloc_flags = 0;
 
     /* We are about to permenantly jump out of the Xen context into the kexec
      * purgatory code.  We really dont want to be still servicing interupts.
@@ -109,29 +176,12 @@ void machine_kexec(xen_kexec_image_t *image)
      * not like running with NMIs disabled. */
     enable_nmis();
 
-    /*
-     * compat_machine_kexec() returns to idle pagetables, which requires us
-     * to be running on a static GDT mapping (idle pagetables have no GDT
-     * mappings in their per-domain mapping area).
-     */
-    asm volatile ( "lgdt %0" : : "m" (gdt_desc) );
+    if ( image->arch == EM_386 )
+        reloc_flags |= KEXEC_RELOC_FLAG_COMPAT;
 
-    if ( is_pv_32on64_domain(dom0) )
-    {
-        compat_machine_kexec(image->page_list[1],
-                             image->indirection_page,
-                             image->page_list,
-                             image->start_address);
-    }
-    else
-    {
-        relocate_new_kernel_t rnk;
-
-        rnk = (relocate_new_kernel_t) image->page_list[1];
-        (*rnk)(image->indirection_page, image->page_list,
-               image->start_address,
-               0 /* preserve_context */);
-    }
+    kexec_reloc(page_to_maddr(image->control_code_page),
+                page_to_maddr(image->aux_page),
+                image->head, image->entry_maddr, reloc_flags);
 }
 
 int machine_kexec_get(xen_kexec_range_t *range)
diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
index d56e12d..7f8fb3d 100644
--- a/xen/arch/x86/x86_64/Makefile
+++ b/xen/arch/x86/x86_64/Makefile
@@ -11,11 +11,11 @@ obj-y += mmconf-fam10h.o
 obj-y += mmconfig_64.o
 obj-y += mmconfig-shared.o
 obj-y += compat.o
-obj-bin-y += compat_kexec.o
 obj-y += domain.o
 obj-y += physdev.o
 obj-y += platform_hypercall.o
 obj-y += cpu_idle.o
 obj-y += cpufreq.o
+obj-bin-y += kexec_reloc.o
 
 obj-$(crash_debug)   += gdbstub.o
diff --git a/xen/arch/x86/x86_64/compat_kexec.S b/xen/arch/x86/x86_64/compat_kexec.S
deleted file mode 100644
index fc92af9..0000000
--- a/xen/arch/x86/x86_64/compat_kexec.S
+++ /dev/null
@@ -1,187 +0,0 @@
-/*
- * Compatibility kexec handler.
- */
-
-/*
- * NOTE: We rely on Xen not relocating itself above the 4G boundary. This is
- * currently true but if it ever changes then compat_pg_table will
- * need to be moved back below 4G at run time.
- */
-
-#include <xen/config.h>
-
-#include <asm/asm_defns.h>
-#include <asm/msr.h>
-#include <asm/page.h>
-
-/* The unrelocated physical address of a symbol. */
-#define SYM_PHYS(sym)          ((sym) - __XEN_VIRT_START)
-
-/* Load physical address of symbol into register and relocate it. */
-#define RELOCATE_SYM(sym,reg)  mov $SYM_PHYS(sym), reg ; \
-                               add xen_phys_start(%rip), reg
-
-/*
- * Relocate a physical address in memory. Size of temporary register
- * determines size of the value to relocate.
- */
-#define RELOCATE_MEM(addr,reg) mov addr(%rip), reg ; \
-                               add xen_phys_start(%rip), reg ; \
-                               mov reg, addr(%rip)
-
-        .text
-
-        .code64
-
-ENTRY(compat_machine_kexec)
-        /* x86/64                        x86/32  */
-        /* %rdi - relocate_new_kernel_t  CALL    */
-        /* %rsi - indirection page       4(%esp) */
-        /* %rdx - page_list              8(%esp) */
-        /* %rcx - start address         12(%esp) */
-        /*        cpu has pae           16(%esp) */
-
-        /* Shim the 64 bit page_list into a 32 bit page_list. */
-        mov $12,%r9
-        lea compat_page_list(%rip), %rbx
-1:      dec %r9
-        movl (%rdx,%r9,8),%eax
-        movl %eax,(%rbx,%r9,4)
-        test %r9,%r9
-        jnz 1b
-
-        RELOCATE_SYM(compat_page_list,%rdx)
-
-        /* Relocate compatibility mode entry point address. */
-        RELOCATE_MEM(compatibility_mode_far,%eax)
-
-        /* Relocate compat_pg_table. */
-        RELOCATE_MEM(compat_pg_table,     %rax)
-        RELOCATE_MEM(compat_pg_table+0x8, %rax)
-        RELOCATE_MEM(compat_pg_table+0x10,%rax)
-        RELOCATE_MEM(compat_pg_table+0x18,%rax)
-
-        /*
-         * Setup an identity mapped region in PML4[0] of idle page
-         * table.
-         */
-        RELOCATE_SYM(l3_identmap,%rax)
-        or  $0x63,%rax
-        mov %rax, idle_pg_table(%rip)
-
-        /* Switch to idle page table. */
-        RELOCATE_SYM(idle_pg_table,%rax)
-        movq %rax, %cr3
-
-        /* Switch to identity mapped compatibility stack. */
-        RELOCATE_SYM(compat_stack,%rax)
-        movq %rax, %rsp
-
-        /* Save xen_phys_start for 32 bit code. */
-        movq xen_phys_start(%rip), %rbx
-
-        /* Jump to low identity mapping in compatibility mode. */
-        ljmp *compatibility_mode_far(%rip)
-        ud2
-
-compatibility_mode_far:
-        .long SYM_PHYS(compatibility_mode)
-        .long __HYPERVISOR_CS32
-
-        /*
-         * We use 5 words of stack for the arguments passed to the kernel. The
-         * kernel only uses 1 word before switching to its own stack. Allocate
-         * 16 words to give "plenty" of room.
-         */
-        .fill 16,4,0
-compat_stack:
-
-        .code32
-
-#undef RELOCATE_SYM
-#undef RELOCATE_MEM
-
-/*
- * Load physical address of symbol into register and relocate it. %rbx
- * contains xen_phys_start(%rip) saved before jump to compatibility
- * mode.
- */
-#define RELOCATE_SYM(sym,reg) mov $SYM_PHYS(sym), reg ; \
-                              add %ebx, reg
-
-compatibility_mode:
-        /* Setup some sane segments. */
-        movl $__HYPERVISOR_DS32, %eax
-        movl %eax, %ds
-        movl %eax, %es
-        movl %eax, %fs
-        movl %eax, %gs
-        movl %eax, %ss
-
-        /* Push arguments onto stack. */
-        pushl $0   /* 20(%esp) - preserve context */
-        pushl $1   /* 16(%esp) - cpu has pae */
-        pushl %ecx /* 12(%esp) - start address */
-        pushl %edx /*  8(%esp) - page list */
-        pushl %esi /*  4(%esp) - indirection page */
-        pushl %edi /*  0(%esp) - CALL */
-
-        /* Disable paging and therefore leave 64 bit mode. */
-        movl %cr0, %eax
-        andl $~X86_CR0_PG, %eax
-        movl %eax, %cr0
-
-        /* Switch to 32 bit page table. */
-        RELOCATE_SYM(compat_pg_table, %eax)
-        movl  %eax, %cr3
-
-        /* Clear MSR_EFER[LME], disabling long mode */
-        movl    $MSR_EFER,%ecx
-        rdmsr
-        btcl    $_EFER_LME,%eax
-        wrmsr
-
-        /* Re-enable paging, but only 32 bit mode now. */
-        movl %cr0, %eax
-        orl $X86_CR0_PG, %eax
-        movl %eax, %cr0
-        jmp 1f
-1:
-
-        popl %eax
-        call *%eax
-        ud2
-
-        .data
-        .align 4
-compat_page_list:
-        .fill 12,4,0
-
-        .align 32,0
-
-        /*
-         * These compat page tables contain an identity mapping of the
-         * first 4G of the physical address space.
-         */
-compat_pg_table:
-        .long SYM_PHYS(compat_pg_table_l2) + 0*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 1*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 2*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 3*PAGE_SIZE + 0x01, 0
-
-        .section .data.page_aligned, "aw", @progbits
-        .align PAGE_SIZE,0
-compat_pg_table_l2:
-        .macro identmap from=0, count=512
-        .if \count-1
-        identmap "(\from+0)","(\count/2)"
-        identmap "(\from+(0x200000*(\count/2)))","(\count/2)"
-        .else
-        .quad 0x00000000000000e3 + \from
-        .endif
-        .endm
-
-        identmap 0x00000000
-        identmap 0x40000000
-        identmap 0x80000000
-        identmap 0xc0000000
diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
new file mode 100644
index 0000000..31a8cee
--- /dev/null
+++ b/xen/arch/x86/x86_64/kexec_reloc.S
@@ -0,0 +1,209 @@
+/*
+ * Relocate a kexec_image to its destination and call it.
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+#include <xen/config.h>
+#include <xen/kimage.h>
+
+#include <asm/asm_defns.h>
+#include <asm/msr.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
+
+        .text
+        .align PAGE_SIZE
+        .code64
+
+ENTRY(kexec_reloc)
+        /* %rdi - code page maddr */
+        /* %rsi - page table maddr */
+        /* %rdx - indirection page maddr */
+        /* %rcx - entry maddr */
+        /* %r8 - flags */
+
+        movq %rdx, %rbx
+
+        /* Setup stack. */
+        leaq (reloc_stack - kexec_reloc)(%rdi), %rsp
+
+        /* Load reloc page table. */
+        movq %rsi, %cr3
+
+        /* Jump to identity mapped code. */
+        leaq (identity_mapped - kexec_reloc)(%rdi), %rax
+        jmpq *%rax
+
+identity_mapped:
+        pushq %rcx
+        pushq %rbx
+        pushq %rsi
+        pushq %rdi
+
+        /*
+         * Set cr0 to a known state:
+         *  - Paging enabled
+         *  - Alignment check disabled
+         *  - Write protect disabled
+         *  - No task switch
+         *  - Don't do FP software emulation.
+         *  - Proctected mode enabled
+         */
+        movq    %cr0, %rax
+        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
+        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
+        movq    %rax, %cr0
+
+        /*
+         * Set cr4 to a known state:
+         *  - physical address extension enabled
+         */
+        movl    $X86_CR4_PAE, %eax
+        movq    %rax, %cr4
+
+        movq %rbx, %rdi
+        call relocate_pages
+
+        popq %rdi
+        popq %rsi
+        popq %rbx
+        popq %rcx
+
+        /* Need to switch to 32-bit mode? */
+        testq $KEXEC_RELOC_FLAG_COMPAT, %r8
+        jnz call_32_bit
+
+call_64_bit:
+        /* Call the image entry point.  This should never return. */
+        callq *%rcx
+        ud2
+
+call_32_bit:
+        /* Setup IDT. */
+        lidt compat_mode_idt(%rip)
+
+        /* Load compat GDT. */
+        leaq (compat_mode_gdt - kexec_reloc)(%rdi), %rax
+        movq %rax, (compat_mode_gdt_desc + 2)(%rip)
+        lgdt compat_mode_gdt_desc(%rip)
+
+        /* Relocate compatibility mode entry point address. */
+        leal (compatibility_mode - kexec_reloc)(%edi), %eax
+        movl %eax, compatibility_mode_far(%rip)
+
+        /* Enter compatibility mode. */
+        ljmp *compatibility_mode_far(%rip)
+
+relocate_pages:
+        /* %rdi - indirection page maddr */
+        cld
+        movq    %rdi, %rcx
+        xorl    %edi, %edi
+        xorl    %esi, %esi
+        jmp     is_dest
+
+next_entry: /* top, read another word for the indirection page */
+
+        movq    (%rbx), %rcx
+        addq    $8, %rbx
+is_dest:
+        testb   $IND_DESTINATION, %cl
+        jz      is_ind
+        movq    %rcx, %rdi
+        andq    $PAGE_MASK, %rdi
+        jmp     next_entry
+is_ind:
+        testb   $IND_INDIRECTION, %cl
+        jz      is_done
+        movq    %rcx, %rbx
+        andq    $PAGE_MASK, %rbx
+        jmp     next_entry
+is_done:
+        testb   $IND_DONE, %cl
+        jnz     done
+is_source:
+        testb   $IND_SOURCE, %cl
+        jz      is_zero
+        movq    %rcx, %rsi      /* For every source page do a copy */
+        andq    $PAGE_MASK, %rsi
+        movl    $(PAGE_SIZE / 8), %ecx
+        rep movsq
+        jmp     next_entry
+is_zero:
+        testb   $IND_ZERO, %cl
+        jz      next_entry
+        movl    $(PAGE_SIZE / 8), %ecx  /* Zero the destination page. */
+        xorl    %eax, %eax
+        rep stosq
+        jmp     next_entry
+done:
+        ret
+
+        .code32
+
+compatibility_mode:
+        /* Setup some sane segments. */
+        movl $0x0008, %eax
+        movl %eax, %ds
+        movl %eax, %es
+        movl %eax, %fs
+        movl %eax, %gs
+        movl %eax, %ss
+
+        movl %ecx, %ebp
+
+        /* Disable paging and therefore leave 64 bit mode. */
+        movl %cr0, %eax
+        andl $~X86_CR0_PG, %eax
+        movl %eax, %cr0
+
+        /* Disable long mode */
+        movl    $MSR_EFER, %ecx
+        rdmsr
+        andl    $~EFER_LME, %eax
+        wrmsr
+
+        /* Clear cr4 to disable PAE. */
+        xorl    %eax, %eax
+        movl    %eax, %cr4
+
+        /* Call the image entry point.  This should never return. */
+        call *%ebp
+        ud2
+
+        .align 4
+compatibility_mode_far:
+        .long 0x00000000             /* set in call_32_bit above */
+        .word 0x0010
+
+compat_mode_gdt_desc:
+        .word (3*8)-1
+        .quad 0x0000000000000000     /* set in call_32_bit above */
+
+        .align 8
+compat_mode_gdt:
+        .quad 0x0000000000000000     /* null                              */
+        .quad 0x00cf92000000ffff     /* 0x0008 ring 0 data                */
+        .quad 0x00cf9a000000ffff     /* 0x0010 ring 0 code, compatibility */
+
+compat_mode_idt:
+        .word 0                      /* limit */
+        .long 0                      /* base */
+
+        /*
+         * 16 words of stack are more than enough.
+         */
+        .fill 16,8,0
+reloc_stack:
+
+        .globl __kexec_reloc_size, kexec_reloc_size
+        .set __kexec_reloc_size, . - kexec_reloc
+kexec_reloc_size:
+        .long __kexec_reloc_size
diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index 7b23df0..5548c37 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -25,6 +25,7 @@
 #include <xen/version.h>
 #include <xen/console.h>
 #include <xen/kexec.h>
+#include <xen/kimage.h>
 #include <public/elfnote.h>
 #include <xsm/xsm.h>
 #include <xen/cpu.h>
@@ -47,7 +48,7 @@ static Elf_Note *xen_crash_note;
 
 static cpumask_t crash_saved_cpus;
 
-static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
+static struct kexec_image *kexec_image[KEXEC_IMAGE_NR];
 
 #define KEXEC_FLAG_DEFAULT_POS   (KEXEC_IMAGE_NR + 0)
 #define KEXEC_FLAG_CRASH_POS     (KEXEC_IMAGE_NR + 1)
@@ -311,14 +312,14 @@ void kexec_crash(void)
     kexec_common_shutdown();
     kexec_crash_save_cpu();
     machine_crash_shutdown();
-    machine_kexec(&kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
+    machine_kexec(kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
 
     BUG();
 }
 
 static long kexec_reboot(void *_image)
 {
-    xen_kexec_image_t *image = _image;
+    struct kexec_image *image = _image;
 
     kexecing = TRUE;
 
@@ -734,63 +735,261 @@ static void crash_save_vmcoreinfo(void)
 #endif
 }
 
-static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
+static void kexec_unload_image(struct kexec_image *image)
+{
+    if ( !image )
+        return;
+
+    machine_kexec_unload(image);
+    kimage_free(image);
+}
+
+static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_exec_t exec;
+    struct kexec_image *image;
+    int base, bit, pos, ret = -EINVAL;
+
+    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+        return -EFAULT;
+
+    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+        return -EINVAL;
+
+    pos = (test_bit(bit, &kexec_flags) != 0);
+
+    /* Only allow kexec/kdump into loaded images */
+    if ( !test_bit(base + pos, &kexec_flags) )
+        return -ENOENT;
+
+    switch (exec.type)
+    {
+    case KEXEC_TYPE_DEFAULT:
+        image = kexec_image[base + pos];
+        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
+        break;
+    case KEXEC_TYPE_CRASH:
+        kexec_crash(); /* Does not return */
+        break;
+    }
+
+    return -EINVAL; /* never reached */
+}
+
+static int kexec_swap_images(int type, struct kexec_image *new,
+                             struct kexec_image **old)
 {
-    xen_kexec_image_t *image;
     int base, bit, pos;
-    int ret = 0;
+    int new_slot, old_slot;
+
+    *old = NULL;
+
+    spin_lock(&kexec_lock);
+
+    if ( test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    {
+        spin_unlock(&kexec_lock);
+        return -EBUSY;
+    }
 
-    if ( kexec_load_get_bits(load->type, &base, &bit) )
+    if ( kexec_load_get_bits(type, &base, &bit) )
         return -EINVAL;
 
     pos = (test_bit(bit, &kexec_flags) != 0);
+    old_slot = base + pos;
+    new_slot = base + !pos;
 
-    /* Load the user data into an unused image */
-    if ( op == KEXEC_CMD_kexec_load )
+    if ( new )
     {
-        image = &kexec_image[base + !pos];
+        kexec_image[new_slot] = new;
+        set_bit(new_slot, &kexec_flags);
+    }
+    change_bit(bit, &kexec_flags);
 
-        BUG_ON(test_bit((base + !pos), &kexec_flags)); /* must be free */
+    clear_bit(old_slot, &kexec_flags);
+    *old = kexec_image[old_slot];
 
-        memcpy(image, &load->image, sizeof(*image));
+    spin_unlock(&kexec_lock);
 
-        if ( !(ret = machine_kexec_load(load->type, base + !pos, image)) )
-        {
-            /* Set image present bit */
-            set_bit((base + !pos), &kexec_flags);
+    return 0;
+}
 
-            /* Make new image the active one */
-            change_bit(bit, &kexec_flags);
-        }
+static int kexec_load_slot(struct kexec_image *kimage)
+{
+    struct kexec_image *old_kimage;
+    int ret = -ENOMEM;
+
+    ret = machine_kexec_load(kimage);
+    if ( ret < 0 )
+        return ret;
+
+    crash_save_vmcoreinfo();
+
+    ret = kexec_swap_images(kimage->type, kimage, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static uint16_t kexec_load_v1_arch(void)
+{
+#ifdef CONFIG_X86
+    return is_pv_32on64_domain(dom0) ? EM_386 : EM_X86_64;
+#else
+    return EM_NONE;
+#endif
+}
+
+static int kexec_segments_add_segment(
+    unsigned *nr_segments, xen_kexec_segment_t *segments,
+    unsigned long mfn)
+{
+    paddr_t maddr = (paddr_t)mfn << PAGE_SHIFT;
+    int n = *nr_segments;
 
-        crash_save_vmcoreinfo();
+    /* Need a new segment? */
+    if ( n == 0
+         || segments[n-1].dest_maddr + segments[n-1].dest_size != maddr )
+    {
+        n++;
+        if ( n > KEXEC_SEGMENT_MAX )
+            return -EINVAL;
+        *nr_segments = n;
+
+        set_xen_guest_handle(segments[n-1].buf.h, NULL);
+        segments[n-1].buf_size = 0;
+        segments[n-1].dest_maddr = maddr;
+        segments[n-1].dest_size = 0;
     }
 
-    /* Unload the old image if present and load successful */
-    if ( ret == 0 && !test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    return 0;
+}
+
+static int kexec_segments_from_ind_page(unsigned long mfn,
+                                        unsigned *nr_segments,
+                                        xen_kexec_segment_t *segments,
+                                        bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+
+    page = map_domain_page(mfn);
+
+    /*
+     * Walk the indirection page list, adding destination pages to the
+     * segments.
+     */
+    for ( entry = page; ; )
     {
-        if ( test_and_clear_bit((base + pos), &kexec_flags) )
+        unsigned long ind;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
         {
-            image = &kexec_image[base + pos];
-            machine_kexec_unload(load->type, base + pos, image);
+        case IND_DESTINATION:
+            ret = kexec_segments_add_segment(nr_segments, segments, mfn);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            page = map_domain_page(mfn);
+            if ( page == NULL )
+                return -ENOMEM;
+            entry = page;
+            continue;
+        case IND_DONE:
+            goto done;
+        case IND_SOURCE:
+            segments[*nr_segments-1].dest_size += PAGE_SIZE;
+            break;
+        default:
+            ret = -EINVAL;
+            goto done;
         }
+        entry = kimage_entry_next(entry, compat);
     }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
+
+static int kexec_do_load_v1(xen_kexec_load_v1_t *load, int compat)
+{
+    struct kexec_image *kimage = NULL;
+    xen_kexec_segment_t *segments;
+    uint16_t arch;
+    unsigned nr_segments = 0;
+    unsigned long ind_mfn = load->image.indirection_page >> PAGE_SHIFT;
+    int ret;
+
+    arch = kexec_load_v1_arch();
+    if ( arch == EM_NONE )
+        return -ENOSYS;
+
+    segments = xmalloc_array(xen_kexec_segment_t, KEXEC_SEGMENT_MAX);
+    if ( segments == NULL )
+        return -ENOMEM;
+
+    /*
+     * Work out the image segments (destination only) from the
+     * indirection pages.
+     *
+     * This is needed so we don't allocate pages that will overlap
+     * with the destination when building the new set of indirection
+     * pages below.
+     */
+    ret = kexec_segments_from_ind_page(ind_mfn, &nr_segments, segments, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_alloc(&kimage, load->type, arch, load->image.start_address,
+                       nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    /*
+     * Build a new set of indirection pages in the native format.
+     *
+     * This walks the guest provided indirection pages a second time.
+     * The guest could have altered then, invalidating the segment
+     * information constructed above.  This will only result in the
+     * resulting image being potentially unrelocatable.
+     */
+    ret = kimage_build_ind(kimage, ind_mfn, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
 
+    return 0;
+
+error:
+    if ( !kimage )
+        xfree(segments);
+    kimage_free(kimage);
     return ret;
 }
 
-static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
     xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&load, uarg, 1)) )
         return -EFAULT;
 
-    return kexec_load_unload_internal(op, &load);
+    return kexec_do_load_v1(&load, 0);
 }
 
-static int kexec_load_unload_compat(unsigned long op,
-                                    XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
 #ifdef CONFIG_COMPAT
     compat_kexec_load_v1_t compat_load;
@@ -809,49 +1008,113 @@ static int kexec_load_unload_compat(unsigned long op,
     load.type = compat_load.type;
     XLAT_kexec_image(&load.image, &compat_load.image);
 
-    return kexec_load_unload_internal(op, &load);
-#else /* CONFIG_COMPAT */
+    return kexec_do_load_v1(&load, 1);
+#else
     return 0;
-#endif /* CONFIG_COMPAT */
+#endif
 }
 
-static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
-    xen_kexec_exec_t exec;
-    xen_kexec_image_t *image;
-    int base, bit, pos, ret = -EINVAL;
+    xen_kexec_load_t load;
+    xen_kexec_segment_t *segments;
+    struct kexec_image *kimage = NULL;
+    int ret;
 
-    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+    if ( copy_from_guest(&load, uarg, 1) )
         return -EFAULT;
 
-    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+    if ( load.nr_segments >= KEXEC_SEGMENT_MAX )
         return -EINVAL;
 
-    pos = (test_bit(bit, &kexec_flags) != 0);
-
-    /* Only allow kexec/kdump into loaded images */
-    if ( !test_bit(base + pos, &kexec_flags) )
-        return -ENOENT;
+    segments = xmalloc_array(xen_kexec_segment_t, load.nr_segments);
+    if ( segments == NULL )
+        return -ENOMEM;
 
-    switch (exec.type)
+    if ( copy_from_guest(segments, load.segments.h, load.nr_segments) )
     {
-    case KEXEC_TYPE_DEFAULT:
-        image = &kexec_image[base + pos];
-        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
-        break;
-    case KEXEC_TYPE_CRASH:
-        kexec_crash(); /* Does not return */
-        break;
+        ret = -EFAULT;
+        goto error;
     }
 
-    return -EINVAL; /* never reached */
+    ret = kimage_alloc(&kimage, load.type, load.arch, load.entry_maddr,
+                       load.nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_load_segments(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    return 0;
+
+error:
+    if ( ! kimage )
+        xfree(segments);
+    kimage_free(kimage);
+    return ret;
+}
+
+static int kexec_do_unload(xen_kexec_unload_t *unload)
+{
+    struct kexec_image *old_kimage;
+    int ret;
+
+    ret = kexec_swap_images(unload->type, NULL, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static int kexec_unload_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_load_v1_t load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = load.type;
+    return kexec_do_unload(&unload);
+}
+
+static int kexec_unload_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+#ifdef CONFIG_COMPAT
+    compat_kexec_load_v1_t compat_load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&compat_load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = compat_load.type;
+    return kexec_do_unload(&unload);
+#else
+    return 0;
+#endif
+}
+
+static int kexec_unload(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_unload_t unload;
+
+    if ( unlikely(copy_from_guest(&unload, uarg, 1)) )
+        return -EFAULT;
+
+    return kexec_do_unload(&unload);
 }
 
 static int do_kexec_op_internal(unsigned long op,
                                 XEN_GUEST_HANDLE_PARAM(void) uarg,
                                 bool_t compat)
 {
-    unsigned long flags;
     int ret = -EINVAL;
 
     ret = xsm_kexec(XSM_PRIV);
@@ -867,20 +1130,26 @@ static int do_kexec_op_internal(unsigned long op,
                 ret = kexec_get_range(uarg);
         break;
     case KEXEC_CMD_kexec_load_v1:
+        if ( compat )
+            ret = kexec_load_v1_compat(uarg);
+        else
+            ret = kexec_load_v1(uarg);
+        break;
     case KEXEC_CMD_kexec_unload_v1:
-        spin_lock_irqsave(&kexec_lock, flags);
-        if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
-        {
-                if (compat)
-                        ret = kexec_load_unload_compat(op, uarg);
-                else
-                        ret = kexec_load_unload(op, uarg);
-        }
-        spin_unlock_irqrestore(&kexec_lock, flags);
+        if ( compat )
+            ret = kexec_unload_v1_compat(uarg);
+        else
+            ret = kexec_unload_v1(uarg);
         break;
     case KEXEC_CMD_kexec:
         ret = kexec_exec(uarg);
         break;
+    case KEXEC_CMD_kexec_load:
+        ret = kexec_load(uarg);
+        break;
+    case KEXEC_CMD_kexec_unload:
+        ret = kexec_unload(uarg);
+        break;
     }
 
     return ret;
diff --git a/xen/common/kimage.c b/xen/common/kimage.c
index 9783e5a..6bee9cf 100644
--- a/xen/common/kimage.c
+++ b/xen/common/kimage.c
@@ -175,14 +175,22 @@ static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
     image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
     if ( !image->control_code_page )
         goto out;
+    result = machine_kexec_add_page(image,
+                                    page_to_maddr(image->control_code_page),
+                                    page_to_maddr(image->control_code_page));
+    if ( result < 0 )
+        return result;
 
     /* Add an empty indirection page. */
     image->entry_page = kimage_alloc_control_page(image, 0);
     if ( !image->entry_page )
         goto out;
+    result = machine_kexec_add_page(image, page_to_maddr(image->entry_page),
+                                    page_to_maddr(image->entry_page));
+    if ( result < 0 )
+        return result;
 
     image->head = page_to_maddr(image->entry_page);
-    image->next_entry = 0;
 
     result = 0;
 out:
@@ -591,7 +599,7 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
         if ( addr == destination )
         {
             page_list_del(page, &image->dest_pages);
-            return page;
+            goto found;
         }
     }
     page = NULL;
@@ -643,6 +651,8 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
             page_list_add(page, &image->dest_pages);
         }
     }
+found:
+    machine_kexec_add_page(image, page_to_maddr(page), page_to_maddr(page));
     return page;
 }
 
@@ -749,6 +759,7 @@ static int kimage_load_crash_segment(struct kexec_image *image,
 static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
 {
     int result = -ENOMEM;
+    paddr_t addr;
 
     if ( !guest_handle_is_null(segment->buf.h) )
     {
@@ -763,6 +774,14 @@ static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *s
         }
     }
 
+    for ( addr = segment->dest_maddr & PAGE_MASK;
+          addr < segment->dest_maddr + segment->dest_size; addr += PAGE_SIZE )
+    {
+        result = machine_kexec_add_page(image, addr, addr);
+        if ( result < 0 )
+            break;
+    }
+
     return result;
 }
 
@@ -806,6 +825,106 @@ int kimage_load_segments(struct kexec_image *image)
     return 0;
 }
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return (kimage_entry_t *)((uint32_t *)entry + 1);
+    return entry + 1;
+}
+
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry >> PAGE_SHIFT;
+    return *entry >> PAGE_SHIFT;
+}
+
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry & 0xf;
+    return *entry & 0xf;
+}
+
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+    paddr_t dest = KIMAGE_NO_DEST;
+
+    page = map_domain_page(ind_mfn);
+    if ( !page )
+        return -ENOMEM;
+
+    /*
+     * Walk the guest-supplied indirection pages, adding entries to
+     * the image's indirection pages.
+     */
+    for ( entry = page; ;  )
+    {
+        unsigned long ind;
+        unsigned long mfn;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
+        {
+        case IND_DESTINATION:
+            dest = (paddr_t)mfn << PAGE_SHIFT;
+            ret = kimage_set_destination(image, dest);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            page = map_domain_page(mfn);
+            entry = page;
+            continue;
+        case IND_DONE:
+            kimage_terminate(image);
+            goto done;
+        case IND_SOURCE:
+        {
+            struct page_info *guest_page, *xen_page;
+
+            guest_page = mfn_to_page(mfn);
+            if ( !get_page(guest_page, current->domain) )
+            {
+                ret = -EFAULT;
+                goto done;
+            }
+
+            xen_page = kimage_alloc_page(image, dest);
+            if ( !xen_page )
+            {
+                put_page(guest_page);
+                ret = -ENOMEM;
+                goto done;
+            }
+
+            copy_domain_page(page_to_mfn(xen_page), mfn);
+            put_page(guest_page);
+
+            ret = kimage_add_page(image, page_to_maddr(xen_page));
+            if ( ret < 0 )
+                goto done;
+            dest += PAGE_SIZE;
+            break;
+        }
+        default:
+            ret = -EINVAL;
+            goto done;
+        }
+        entry = kimage_entry_next(entry, compat);
+    }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index 8b4266d..48c5676 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -56,9 +56,6 @@ enum fixed_addresses {
     FIX_ACPI_BEGIN,
     FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1,
     FIX_HPET_BASE,
-    FIX_KEXEC_BASE_0,
-    FIX_KEXEC_BASE_END = FIX_KEXEC_BASE_0 \
-      + ((KEXEC_XEN_NO_PAGES >> 1) * KEXEC_IMAGE_NR) - 1,
     FIX_TBOOT_SHARED_BASE,
     FIX_MSIX_IO_RESERV_BASE,
     FIX_MSIX_IO_RESERV_END = FIX_MSIX_IO_RESERV_BASE + FIX_MSIX_MAX_PAGES -1,
diff --git a/xen/include/asm-x86/machine_kexec.h b/xen/include/asm-x86/machine_kexec.h
new file mode 100644
index 0000000..ba0d469
--- /dev/null
+++ b/xen/include/asm-x86/machine_kexec.h
@@ -0,0 +1,16 @@
+#ifndef __X86_MACHINE_KEXEC_H__
+#define __X86_MACHINE_KEXEC_H__
+
+#define KEXEC_RELOC_FLAG_COMPAT 0x1 /* 32-bit image */
+
+#ifndef __ASSEMBLY__
+
+extern void kexec_reloc(unsigned long reloc_code, unsigned long reloc_pt,
+                        unsigned long ind_maddr, unsigned long entry_maddr,
+                        unsigned long flags);
+
+extern unsigned int kexec_reloc_size;
+
+#endif
+
+#endif /* __X86_MACHINE_KEXEC_H__ */
diff --git a/xen/include/xen/kexec.h b/xen/include/xen/kexec.h
index 1a5dda1..bd17747 100644
--- a/xen/include/xen/kexec.h
+++ b/xen/include/xen/kexec.h
@@ -6,6 +6,7 @@
 #include <public/kexec.h>
 #include <asm/percpu.h>
 #include <xen/elfcore.h>
+#include <xen/kimage.h>
 
 typedef struct xen_kexec_reserve {
     unsigned long size;
@@ -40,11 +41,13 @@ extern enum low_crashinfo low_crashinfo_mode;
 extern paddr_t crashinfo_maxaddr_bits;
 void kexec_early_calculations(void);
 
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr);
+int machine_kexec_load(struct kexec_image *image);
+void machine_kexec_unload(struct kexec_image *image);
 void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
-void machine_reboot_kexec(xen_kexec_image_t *image);
-void machine_kexec(xen_kexec_image_t *image);
+void machine_reboot_kexec(struct kexec_image *image);
+void machine_kexec(struct kexec_image *image);
 void kexec_crash(void);
 void kexec_crash_save_cpu(void);
 crash_xen_info_t *kexec_crash_save_info(void);
@@ -52,11 +55,6 @@ void machine_crash_shutdown(void);
 int machine_kexec_get(xen_kexec_range_t *range);
 int machine_kexec_get_xen(xen_kexec_range_t *range);
 
-void compat_machine_kexec(unsigned long rnk,
-                          unsigned long indirection_page,
-                          unsigned long *page_list,
-                          unsigned long start_address);
-
 /* vmcoreinfo stuff */
 #define VMCOREINFO_BYTES           (4096)
 #define VMCOREINFO_NOTE_NAME       "VMCOREINFO_XEN"
diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
index 0ebd37a..d10ebf7 100644
--- a/xen/include/xen/kimage.h
+++ b/xen/include/xen/kimage.h
@@ -47,6 +47,12 @@ int kimage_load_segments(struct kexec_image *image);
 struct page_info *kimage_alloc_control_page(struct kexec_image *image,
                                             unsigned memflags);
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat);
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __XEN_KIMAGE_H__ */
-- 
1.7.2.5


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2013-11-07 20:56 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-20 13:10 [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
2013-09-20 13:10 ` [PATCH 1/9] x86: give FIX_EFI_MPF its own fixmap entry David Vrabel
2013-09-20 13:10 ` [PATCH 2/9] kexec: add public interface for improved load/unload sub-ops David Vrabel
2013-09-24 15:23   ` Jan Beulich
2013-09-20 13:10 ` [PATCH 3/9] kexec: add infrastructure for handling kexec images David Vrabel
2013-09-20 13:10 ` [PATCH 4/9] kexec: extend hypercall with improved load/unload ops David Vrabel
2013-10-04 21:23   ` Daniel Kiper
2013-10-06 18:16     ` Andrew Cooper
2013-10-07  7:47       ` Daniel Kiper
2013-10-07  9:23     ` David Vrabel
2013-10-07 10:39       ` Daniel Kiper
2013-10-07 10:55         ` David Vrabel
2013-10-07 14:49           ` Daniel Kiper
2013-10-07 17:44             ` David Vrabel
2013-10-07 20:21               ` Daniel Kiper
2013-09-20 13:10 ` [PATCH 5/9] xen: kexec crash image when dom0 crashes David Vrabel
2013-09-20 13:10 ` [PATCH 6/9] libxc: add hypercall buffer arrays David Vrabel
2013-09-20 13:10 ` [PATCH 7/9] libxc: add API for kexec hypercall David Vrabel
2013-09-20 13:10 ` [PATCH 8/9] x86: check kexec relocation code fits in a page David Vrabel
2013-09-20 13:10 ` [PATCH 9/9] MAINTAINERS: Add KEXEC maintainer David Vrabel
2013-09-22 20:20 ` [PATCHv8 0/9] Xen: extend kexec hypercall for use with pv-ops kernels Daniel Kiper
2013-10-04 11:50   ` David Vrabel
2013-09-22 20:27 ` Daniel Kiper
2013-10-04 21:40 ` Daniel Kiper
  -- strict thread matches above, loose matches on Subject: below --
2013-11-06 14:49 [PATCHv10 " David Vrabel
2013-11-06 14:49 ` [PATCH 4/9] kexec: extend hypercall with improved load/unload ops David Vrabel
2013-11-06 14:49   ` David Vrabel
2013-11-07 20:56   ` Don Slutz
2013-10-08 16:55 [PATCHv9 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
2013-10-08 16:55 ` [PATCH 4/9] kexec: extend hypercall with improved load/unload ops David Vrabel
2013-11-05 22:43   ` Don Slutz
2013-09-12 19:48 [PATCHv7 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
2013-09-12 19:49 ` [PATCH 4/9] kexec: extend hypercall with improved load/unload ops David Vrabel
2013-09-12 19:49   ` David Vrabel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.