[PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
@ 2013-11-06 14:49 David Vrabel
  2013-11-06 14:49 ` [PATCH 1/9] x86: give FIX_EFI_MPF its own fixmap entry David Vrabel
                   ` (20 more replies)
  0 siblings, 21 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

The series (for Xen 4.4) improves the kexec hypercall by making Xen
responsible for loading and relocating the image.  This allows kexec
to be usable by pv-ops kernels and should allow kexec to be usable
from a HVM or PVH privileged domain.

I have now tested this with a Linux kernel image using the VGA console
which was what was causing problems in v9 (this turned out to be a
kexec-tools bug).

The required patch series for kexec-tools will be posted shortly and
are available from the xen-v7 branch of:

http://xenbits.xen.org/gitweb/?p=people/dvrabel/kexec-tools.git;a=summary

Changes in v10:

- Document host state on exec.
- Fix kimage_alloc() error path (double free, crash on zero kimage->head).
- Check for segment before expanding it in load_v1.
- Move kexec_lock define into kexec_swap_images().

Changes in v9:

- Update comments to correctly say 4.4.
- Minor updates the kexec_reloc assembly to improve maintainability a
  bit.

Changes in v8:

- Use #defines for compat ABI structures.
- Tweak link time check for kexec_reloc.

Changes in v7:

- No longer use GUEST_HANDLE_64(), get a uniform ABI by using unions
  and explicit padding.
- Only map the segments and not all of RAM.
- Add a mechanism to create mappings for use by the exec'd image (a
  segment with a NULL buf handle).
- Fix a bug where a crash image's code page would by placed at machine
  address 0 (instead of inside the crash region).

Changes in v6:

- Fix double free in KEXEC_load_v1 failure path.
- Only copy the relocation code and not the whole page.
- Add myself as the kexec maintainer.

Changes in v5 (not posted to the list):

- _rsvd -> _pad in one of the public ABI structures.
- Fix bug where trailing pages were not zeroed. This fixes loading a
  64-bit Linux kernel using a more recent version of kexec-tools.
- Check the relocation code fits into a page at link time.

Changes in v4:

- Use paddr_t and page_to_maddr() etc. for portability.
- Add explicit padding to hypercall structures where required.
- Minor cleanup of the kexec_reloc assembly.
- Print a message before exec'ing a crash image.
- Style fixes (tabs, trailing whitespace) and typos.
- Fix a bug where using the V1 interface and unloading a image may crash.

Changes in v3:

- Provide old struct xen_kexec_load if __XEN_INTERFACE_VERSION__ < 4.3
- Adjust new struct xen_kexec_load to avoid unnecessary padding.
- Use domheap pages for the image and control pages.
- Remove the DBG() macros from the reloc code.

David


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH 1/9] x86: give FIX_EFI_MPF its own fixmap entry
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-06 14:49 ` David Vrabel
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

FIX_EFI_MPF was the same as FIX_KEXEC_BASE_0 which is going away.  So
add its own entry.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/mpparse.c       |    2 --
 xen/include/asm-x86/fixmap.h |    1 +
 2 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mpparse.c b/xen/arch/x86/mpparse.c
index 97d34bc..3753704 100644
--- a/xen/arch/x86/mpparse.c
+++ b/xen/arch/x86/mpparse.c
@@ -538,8 +538,6 @@ static inline void __init construct_default_ISA_mptable(int mpc_default_type)
 	}
 }
 
-#define FIX_EFI_MPF FIX_KEXEC_BASE_0
-
 static __init void efi_unmap_mpf(void)
 {
 	if (efi_enabled)
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index d850be4..8b4266d 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -66,6 +66,7 @@ enum fixed_addresses {
     FIX_APEI_RANGE_BASE,
     FIX_APEI_RANGE_END = FIX_APEI_RANGE_BASE + FIX_APEI_RANGE_MAX -1,
     FIX_IGD_MMIO,
+    FIX_EFI_MPF,
     __end_of_fixed_addresses
 };
 
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 1/9] x86: give FIX_EFI_MPF its own fixmap entry
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
  2013-11-06 14:49 ` [PATCH 1/9] x86: give FIX_EFI_MPF its own fixmap entry David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-06 18:49   ` [Xen-devel] " Don Slutz
  2013-11-06 18:49   ` Don Slutz
  2013-11-06 14:49 ` [PATCH 2/9] kexec: add public interface for improved load/unload sub-ops David Vrabel
                   ` (18 subsequent siblings)
  20 siblings, 2 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

FIX_EFI_MPF was the same as FIX_KEXEC_BASE_0 which is going away.  So
add its own entry.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/mpparse.c       |    2 --
 xen/include/asm-x86/fixmap.h |    1 +
 2 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mpparse.c b/xen/arch/x86/mpparse.c
index 97d34bc..3753704 100644
--- a/xen/arch/x86/mpparse.c
+++ b/xen/arch/x86/mpparse.c
@@ -538,8 +538,6 @@ static inline void __init construct_default_ISA_mptable(int mpc_default_type)
 	}
 }
 
-#define FIX_EFI_MPF FIX_KEXEC_BASE_0
-
 static __init void efi_unmap_mpf(void)
 {
 	if (efi_enabled)
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index d850be4..8b4266d 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -66,6 +66,7 @@ enum fixed_addresses {
     FIX_APEI_RANGE_BASE,
     FIX_APEI_RANGE_END = FIX_APEI_RANGE_BASE + FIX_APEI_RANGE_MAX -1,
     FIX_IGD_MMIO,
+    FIX_EFI_MPF,
     __end_of_fixed_addresses
 };
 
-- 
1.7.2.5


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 2/9] kexec: add public interface for improved load/unload sub-ops
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
  2013-11-06 14:49 ` [PATCH 1/9] x86: give FIX_EFI_MPF its own fixmap entry David Vrabel
  2013-11-06 14:49 ` David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-06 14:49 ` David Vrabel
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Add replacement KEXEC_CMD_load and KEXEC_CMD_unload sub-ops to the
kexec hypercall.  These new sub-ops allow a priviledged guest to
provide the image data to be loaded into Xen memory or the crash
region instead of guests loading the image data themselves and
providing the relocation code and metadata.

The old interface is provided to guests requesting an interface
version prior to 4.4.

Bump __XEN_LATEST_INTERFACE_VERSION__ to 0x00040400.

Signed-off: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/common/kexec.c              |   12 +++---
 xen/include/public/kexec.h      |   92 +++++++++++++++++++++++++++++++++++++--
 xen/include/public/xen-compat.h |    2 +-
 3 files changed, 95 insertions(+), 11 deletions(-)

diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index 7cd151f..7b23df0 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -734,7 +734,7 @@ static void crash_save_vmcoreinfo(void)
 #endif
 }
 
-static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_t *load)
+static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
 {
     xen_kexec_image_t *image;
     int base, bit, pos;
@@ -781,7 +781,7 @@ static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_t *load)
 
 static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
-    xen_kexec_load_t load;
+    xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&load, uarg, 1)) )
         return -EFAULT;
@@ -793,8 +793,8 @@ static int kexec_load_unload_compat(unsigned long op,
                                     XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
 #ifdef CONFIG_COMPAT
-    compat_kexec_load_t compat_load;
-    xen_kexec_load_t load;
+    compat_kexec_load_v1_t compat_load;
+    xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&compat_load, uarg, 1)) )
         return -EFAULT;
@@ -866,8 +866,8 @@ static int do_kexec_op_internal(unsigned long op,
         else
                 ret = kexec_get_range(uarg);
         break;
-    case KEXEC_CMD_kexec_load:
-    case KEXEC_CMD_kexec_unload:
+    case KEXEC_CMD_kexec_load_v1:
+    case KEXEC_CMD_kexec_unload_v1:
         spin_lock_irqsave(&kexec_lock, flags);
         if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
         {
diff --git a/xen/include/public/kexec.h b/xen/include/public/kexec.h
index 36409ff..a6a0a88 100644
--- a/xen/include/public/kexec.h
+++ b/xen/include/public/kexec.h
@@ -105,6 +105,20 @@ typedef struct xen_kexec_image {
  * Perform kexec having previously loaded a kexec or kdump kernel
  * as appropriate.
  * type == KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH [in]
+ *
+ * Control is transferred to the image entry point with the host in
+ * the following state.
+ *
+ * - The image may be executed on any PCPU and all other PCPUs are
+ *   stopped.
+ *
+ * - Local interrupts are disabled.
+ *
+ * - Register values are undefined.
+ *
+ * - The image segments have writeable 1:1 virtual to machine
+ *   mappings.  The location of any page tables is undefined and these
+ *   page table frames are not be mapped.
  */
 #define KEXEC_CMD_kexec                 0
 typedef struct xen_kexec_exec {
@@ -116,12 +130,12 @@ typedef struct xen_kexec_exec {
  * type  == KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH [in]
  * image == relocation information for kexec (ignored for unload) [in]
  */
-#define KEXEC_CMD_kexec_load            1
-#define KEXEC_CMD_kexec_unload          2
-typedef struct xen_kexec_load {
+#define KEXEC_CMD_kexec_load_v1         1 /* obsolete since 0x00040400 */
+#define KEXEC_CMD_kexec_unload_v1       2 /* obsolete since 0x00040400 */
+typedef struct xen_kexec_load_v1 {
     int type;
     xen_kexec_image_t image;
-} xen_kexec_load_t;
+} xen_kexec_load_v1_t;
 
 #define KEXEC_RANGE_MA_CRASH      0 /* machine address and size of crash area */
 #define KEXEC_RANGE_MA_XEN        1 /* machine address and size of Xen itself */
@@ -152,6 +166,76 @@ typedef struct xen_kexec_range {
     unsigned long start;
 } xen_kexec_range_t;
 
+#if __XEN_INTERFACE_VERSION__ >= 0x00040400
+/*
+ * A contiguous chunk of a kexec image and it's destination machine
+ * address.
+ */
+typedef struct xen_kexec_segment {
+    union {
+        XEN_GUEST_HANDLE(const_void) h;
+        uint64_t _pad;
+    } buf;
+    uint64_t buf_size;
+    uint64_t dest_maddr;
+    uint64_t dest_size;
+} xen_kexec_segment_t;
+DEFINE_XEN_GUEST_HANDLE(xen_kexec_segment_t);
+
+/*
+ * Load a kexec image into memory.
+ *
+ * For KEXEC_TYPE_DEFAULT images, the segments may be anywhere in RAM.
+ * The image is relocated prior to being executed.
+ *
+ * For KEXEC_TYPE_CRASH images, each segment of the image must reside
+ * in the memory region reserved for kexec (KEXEC_RANGE_MA_CRASH) and
+ * the entry point must be within the image. The caller is responsible
+ * for ensuring that multiple images do not overlap.
+ *
+ * All image segments will be loaded to their destination machine
+ * addresses prior to being executed.  The trailing portion of any
+ * segments with a source buffer (from dest_maddr + buf_size to
+ * dest_maddr + dest_size) will be zeroed.
+ *
+ * Segments with no source buffer will be accessible to the image when
+ * it is executed.
+ */
+
+#define KEXEC_CMD_kexec_load 4
+typedef struct xen_kexec_load {
+    uint8_t  type;        /* One of KEXEC_TYPE_* */
+    uint8_t  _pad;
+    uint16_t arch;        /* ELF machine type (EM_*). */
+    uint32_t nr_segments;
+    union {
+        XEN_GUEST_HANDLE(xen_kexec_segment_t) h;
+        uint64_t _pad;
+    } segments;
+    uint64_t entry_maddr; /* image entry point machine address. */
+} xen_kexec_load_t;
+DEFINE_XEN_GUEST_HANDLE(xen_kexec_load_t);
+
+/*
+ * Unload a kexec image.
+ *
+ * Type must be one of KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH.
+ */
+#define KEXEC_CMD_kexec_unload 5
+typedef struct xen_kexec_unload {
+    uint8_t type;
+} xen_kexec_unload_t;
+DEFINE_XEN_GUEST_HANDLE(xen_kexec_unload_t);
+
+#else /* __XEN_INTERFACE_VERSION__ < 0x00040400 */
+
+#define KEXEC_CMD_kexec_load KEXEC_CMD_kexec_load_v1
+#define KEXEC_CMD_kexec_unload KEXEC_CMD_kexec_unload_v1
+#define xen_kexec_load xen_kexec_load_v1
+#define xen_kexec_load_t xen_kexec_load_v1_t
+
+#endif
+
 #endif /* _XEN_PUBLIC_KEXEC_H */
 
 /*
diff --git a/xen/include/public/xen-compat.h b/xen/include/public/xen-compat.h
index 69141c4..3eb80a0 100644
--- a/xen/include/public/xen-compat.h
+++ b/xen/include/public/xen-compat.h
@@ -27,7 +27,7 @@
 #ifndef __XEN_PUBLIC_XEN_COMPAT_H__
 #define __XEN_PUBLIC_XEN_COMPAT_H__
 
-#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040300
+#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040400
 
 #if defined(__XEN__) || defined(__XEN_TOOLS__)
 /* Xen is built with matching headers and implements the latest interface. */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 2/9] kexec: add public interface for improved load/unload sub-ops
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (2 preceding siblings ...)
  2013-11-06 14:49 ` [PATCH 2/9] kexec: add public interface for improved load/unload sub-ops David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-07 20:38   ` Don Slutz
  2013-11-07 20:38   ` Don Slutz
  2013-11-06 14:49 ` [PATCH 3/9] kexec: add infrastructure for handling kexec images David Vrabel
                   ` (16 subsequent siblings)
  20 siblings, 2 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Add replacement KEXEC_CMD_load and KEXEC_CMD_unload sub-ops to the
kexec hypercall.  These new sub-ops allow a priviledged guest to
provide the image data to be loaded into Xen memory or the crash
region instead of guests loading the image data themselves and
providing the relocation code and metadata.

The old interface is provided to guests requesting an interface
version prior to 4.4.

Bump __XEN_LATEST_INTERFACE_VERSION__ to 0x00040400.

Signed-off: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/common/kexec.c              |   12 +++---
 xen/include/public/kexec.h      |   92 +++++++++++++++++++++++++++++++++++++--
 xen/include/public/xen-compat.h |    2 +-
 3 files changed, 95 insertions(+), 11 deletions(-)

diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index 7cd151f..7b23df0 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -734,7 +734,7 @@ static void crash_save_vmcoreinfo(void)
 #endif
 }
 
-static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_t *load)
+static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
 {
     xen_kexec_image_t *image;
     int base, bit, pos;
@@ -781,7 +781,7 @@ static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_t *load)
 
 static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
-    xen_kexec_load_t load;
+    xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&load, uarg, 1)) )
         return -EFAULT;
@@ -793,8 +793,8 @@ static int kexec_load_unload_compat(unsigned long op,
                                     XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
 #ifdef CONFIG_COMPAT
-    compat_kexec_load_t compat_load;
-    xen_kexec_load_t load;
+    compat_kexec_load_v1_t compat_load;
+    xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&compat_load, uarg, 1)) )
         return -EFAULT;
@@ -866,8 +866,8 @@ static int do_kexec_op_internal(unsigned long op,
         else
                 ret = kexec_get_range(uarg);
         break;
-    case KEXEC_CMD_kexec_load:
-    case KEXEC_CMD_kexec_unload:
+    case KEXEC_CMD_kexec_load_v1:
+    case KEXEC_CMD_kexec_unload_v1:
         spin_lock_irqsave(&kexec_lock, flags);
         if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
         {
diff --git a/xen/include/public/kexec.h b/xen/include/public/kexec.h
index 36409ff..a6a0a88 100644
--- a/xen/include/public/kexec.h
+++ b/xen/include/public/kexec.h
@@ -105,6 +105,20 @@ typedef struct xen_kexec_image {
  * Perform kexec having previously loaded a kexec or kdump kernel
  * as appropriate.
  * type == KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH [in]
+ *
+ * Control is transferred to the image entry point with the host in
+ * the following state.
+ *
+ * - The image may be executed on any PCPU and all other PCPUs are
+ *   stopped.
+ *
+ * - Local interrupts are disabled.
+ *
+ * - Register values are undefined.
+ *
+ * - The image segments have writeable 1:1 virtual to machine
+ *   mappings.  The location of any page tables is undefined and these
+ *   page table frames are not be mapped.
  */
 #define KEXEC_CMD_kexec                 0
 typedef struct xen_kexec_exec {
@@ -116,12 +130,12 @@ typedef struct xen_kexec_exec {
  * type  == KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH [in]
  * image == relocation information for kexec (ignored for unload) [in]
  */
-#define KEXEC_CMD_kexec_load            1
-#define KEXEC_CMD_kexec_unload          2
-typedef struct xen_kexec_load {
+#define KEXEC_CMD_kexec_load_v1         1 /* obsolete since 0x00040400 */
+#define KEXEC_CMD_kexec_unload_v1       2 /* obsolete since 0x00040400 */
+typedef struct xen_kexec_load_v1 {
     int type;
     xen_kexec_image_t image;
-} xen_kexec_load_t;
+} xen_kexec_load_v1_t;
 
 #define KEXEC_RANGE_MA_CRASH      0 /* machine address and size of crash area */
 #define KEXEC_RANGE_MA_XEN        1 /* machine address and size of Xen itself */
@@ -152,6 +166,76 @@ typedef struct xen_kexec_range {
     unsigned long start;
 } xen_kexec_range_t;
 
+#if __XEN_INTERFACE_VERSION__ >= 0x00040400
+/*
+ * A contiguous chunk of a kexec image and it's destination machine
+ * address.
+ */
+typedef struct xen_kexec_segment {
+    union {
+        XEN_GUEST_HANDLE(const_void) h;
+        uint64_t _pad;
+    } buf;
+    uint64_t buf_size;
+    uint64_t dest_maddr;
+    uint64_t dest_size;
+} xen_kexec_segment_t;
+DEFINE_XEN_GUEST_HANDLE(xen_kexec_segment_t);
+
+/*
+ * Load a kexec image into memory.
+ *
+ * For KEXEC_TYPE_DEFAULT images, the segments may be anywhere in RAM.
+ * The image is relocated prior to being executed.
+ *
+ * For KEXEC_TYPE_CRASH images, each segment of the image must reside
+ * in the memory region reserved for kexec (KEXEC_RANGE_MA_CRASH) and
+ * the entry point must be within the image. The caller is responsible
+ * for ensuring that multiple images do not overlap.
+ *
+ * All image segments will be loaded to their destination machine
+ * addresses prior to being executed.  The trailing portion of any
+ * segments with a source buffer (from dest_maddr + buf_size to
+ * dest_maddr + dest_size) will be zeroed.
+ *
+ * Segments with no source buffer will be accessible to the image when
+ * it is executed.
+ */
+
+#define KEXEC_CMD_kexec_load 4
+typedef struct xen_kexec_load {
+    uint8_t  type;        /* One of KEXEC_TYPE_* */
+    uint8_t  _pad;
+    uint16_t arch;        /* ELF machine type (EM_*). */
+    uint32_t nr_segments;
+    union {
+        XEN_GUEST_HANDLE(xen_kexec_segment_t) h;
+        uint64_t _pad;
+    } segments;
+    uint64_t entry_maddr; /* image entry point machine address. */
+} xen_kexec_load_t;
+DEFINE_XEN_GUEST_HANDLE(xen_kexec_load_t);
+
+/*
+ * Unload a kexec image.
+ *
+ * Type must be one of KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH.
+ */
+#define KEXEC_CMD_kexec_unload 5
+typedef struct xen_kexec_unload {
+    uint8_t type;
+} xen_kexec_unload_t;
+DEFINE_XEN_GUEST_HANDLE(xen_kexec_unload_t);
+
+#else /* __XEN_INTERFACE_VERSION__ < 0x00040400 */
+
+#define KEXEC_CMD_kexec_load KEXEC_CMD_kexec_load_v1
+#define KEXEC_CMD_kexec_unload KEXEC_CMD_kexec_unload_v1
+#define xen_kexec_load xen_kexec_load_v1
+#define xen_kexec_load_t xen_kexec_load_v1_t
+
+#endif
+
 #endif /* _XEN_PUBLIC_KEXEC_H */
 
 /*
diff --git a/xen/include/public/xen-compat.h b/xen/include/public/xen-compat.h
index 69141c4..3eb80a0 100644
--- a/xen/include/public/xen-compat.h
+++ b/xen/include/public/xen-compat.h
@@ -27,7 +27,7 @@
 #ifndef __XEN_PUBLIC_XEN_COMPAT_H__
 #define __XEN_PUBLIC_XEN_COMPAT_H__
 
-#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040300
+#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040400
 
 #if defined(__XEN__) || defined(__XEN_TOOLS__)
 /* Xen is built with matching headers and implements the latest interface. */
-- 
1.7.2.5


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 3/9] kexec: add infrastructure for handling kexec images
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (4 preceding siblings ...)
  2013-11-06 14:49 ` [PATCH 3/9] kexec: add infrastructure for handling kexec images David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-06 14:49   ` David Vrabel
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Add the code needed to handle and load kexec images into Xen memory or
into the crash region.  This is needed for the new KEXEC_CMD_load and
KEXEC_CMD_unload hypercall sub-ops.

Much of this code is derived from the Linux kernel.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/common/Makefile      |    1 +
 xen/common/kimage.c      |  821 ++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/kimage.h |   62 ++++
 3 files changed, 884 insertions(+), 0 deletions(-)
 create mode 100644 xen/common/kimage.c
 create mode 100644 xen/include/xen/kimage.h

diff --git a/xen/common/Makefile b/xen/common/Makefile
index 686f7a1..3683ae3 100644
--- a/xen/common/Makefile
+++ b/xen/common/Makefile
@@ -13,6 +13,7 @@ obj-y += irq.o
 obj-y += kernel.o
 obj-y += keyhandler.o
 obj-$(HAS_KEXEC) += kexec.o
+obj-$(HAS_KEXEC) += kimage.o
 obj-y += lib.o
 obj-y += memory.o
 obj-y += multicall.o
diff --git a/xen/common/kimage.c b/xen/common/kimage.c
new file mode 100644
index 0000000..02ee37e
--- /dev/null
+++ b/xen/common/kimage.c
@@ -0,0 +1,821 @@
+/*
+ * Kexec Image
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Derived from kernel/kexec.c from Linux:
+ *
+ *   Copyright (C) 2002-2004 Eric Biederman  <ebiederm@xmission.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/init.h>
+#include <xen/kernel.h>
+#include <xen/errno.h>
+#include <xen/spinlock.h>
+#include <xen/guest_access.h>
+#include <xen/mm.h>
+#include <xen/kexec.h>
+#include <xen/kimage.h>
+
+#include <asm/page.h>
+
+/*
+ * When kexec transitions to the new kernel there is a one-to-one
+ * mapping between physical and virtual addresses.  On processors
+ * where you can disable the MMU this is trivial, and easy.  For
+ * others it is still a simple predictable page table to setup.
+ *
+ * The code for the transition from the current kernel to the the new
+ * kernel is placed in the page-size control_code_buffer.  This memory
+ * must be identity mapped in the transition from virtual to physical
+ * addresses.
+ *
+ * The assembly stub in the control code buffer is passed a linked list
+ * of descriptor pages detailing the source pages of the new kernel,
+ * and the destination addresses of those source pages.  As this data
+ * structure is not used in the context of the current OS, it must
+ * be self-contained.
+ *
+ * The code has been made to work with highmem pages and will use a
+ * destination page in its final resting place (if it happens
+ * to allocate it).  The end product of this is that most of the
+ * physical address space, and most of RAM can be used.
+ *
+ * Future directions include:
+ *  - allocating a page table with the control code buffer identity
+ *    mapped, to simplify machine_kexec and make kexec_on_panic more
+ *    reliable.
+ */
+
+/*
+ * KIMAGE_NO_DEST is an impossible destination address..., for
+ * allocating pages whose destination address we do not care about.
+ */
+#define KIMAGE_NO_DEST (-1UL)
+
+/*
+ * Offset of the last entry in an indirection page.
+ */
+#define KIMAGE_LAST_ENTRY (PAGE_SIZE/sizeof(kimage_entry_t) - 1)
+
+
+static int kimage_is_destination_range(struct kexec_image *image,
+                                       paddr_t start, paddr_t end);
+static struct page_info *kimage_alloc_page(struct kexec_image *image,
+                                           paddr_t dest);
+
+static struct page_info *kimage_alloc_zeroed_page(unsigned memflags)
+{
+    struct page_info *page;
+
+    page = alloc_domheap_page(NULL, memflags);
+    if ( !page )
+        return NULL;
+
+    clear_domain_page(page_to_mfn(page));
+
+    return page;
+}
+
+static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
+                           unsigned long nr_segments,
+                           xen_kexec_segment_t *segments, uint8_t type)
+{
+    struct kexec_image *image;
+    unsigned long i;
+    int result;
+
+    /* Allocate a controlling structure */
+    result = -ENOMEM;
+    image = xzalloc(typeof(*image));
+    if ( !image )
+        goto out;
+
+    image->entry_maddr = entry;
+    image->type = type;
+    image->nr_segments = nr_segments;
+    image->segments = segments;
+
+    image->next_crash_page = kexec_crash_area.start;
+
+    INIT_PAGE_LIST_HEAD(&image->control_pages);
+    INIT_PAGE_LIST_HEAD(&image->dest_pages);
+    INIT_PAGE_LIST_HEAD(&image->unusable_pages);
+
+    /*
+     * Verify we have good destination addresses.  The caller is
+     * responsible for making certain we don't attempt to load the new
+     * image into invalid or reserved areas of RAM.  This just
+     * verifies it is an address we can use.
+     *
+     * Since the kernel does everything in page size chunks ensure the
+     * destination addresses are page aligned.  Too many special cases
+     * crop of when we don't do this.  The most insidious is getting
+     * overlapping destination addresses simply because addresses are
+     * changed to page size granularity.
+     */
+    result = -EADDRNOTAVAIL;
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+
+        mstart = image->segments[i].dest_maddr;
+        mend   = mstart + image->segments[i].dest_size;
+        if ( (mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK) )
+            goto out;
+    }
+
+    /*
+     * Verify our destination addresses do not overlap.  If we allowed
+     * overlapping destination addresses through very weird things can
+     * happen with no easy explanation as one segment stops on
+     * another.
+     */
+    result = -EINVAL;
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+        unsigned long j;
+
+        mstart = image->segments[i].dest_maddr;
+        mend   = mstart + image->segments[i].dest_size;
+        for (j = 0; j < i; j++ )
+        {
+            paddr_t pstart, pend;
+            pstart = image->segments[j].dest_maddr;
+            pend   = pstart + image->segments[j].dest_size;
+            /* Do the segments overlap? */
+            if ( (mend > pstart) && (mstart < pend) )
+                goto out;
+        }
+    }
+
+    /*
+     * Ensure our buffer sizes are strictly less than our memory
+     * sizes.  This should always be the case, and it is easier to
+     * check up front than to be surprised later on.
+     */
+    result = -EINVAL;
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        if ( image->segments[i].buf_size > image->segments[i].dest_size )
+            goto out;
+    }
+
+    /* 
+     * Page for the relocation code must still be accessible after the
+     * processor has switched to 32-bit mode.
+     */
+    result = -ENOMEM;
+    image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
+    if ( !image->control_code_page )
+        goto out;
+
+    /* Add an empty indirection page. */
+    image->entry_page = kimage_alloc_control_page(image, 0);
+    if ( !image->entry_page )
+        goto out;
+
+    image->head = page_to_maddr(image->entry_page);
+
+    result = 0;
+out:
+    if ( result == 0 )
+        *rimage = image;
+    else if ( image )
+    {
+        image->segments = NULL; /* caller frees segments after an error */
+        kimage_free(image);
+    }
+
+    return result;
+
+}
+
+static int kimage_normal_alloc(struct kexec_image **rimage, paddr_t entry,
+                               unsigned long nr_segments,
+                               xen_kexec_segment_t *segments)
+{
+    return do_kimage_alloc(rimage, entry, nr_segments, segments,
+                           KEXEC_TYPE_DEFAULT);
+}
+
+static int kimage_crash_alloc(struct kexec_image **rimage, paddr_t entry,
+                              unsigned long nr_segments,
+                              xen_kexec_segment_t *segments)
+{
+    unsigned long i;
+    int result;
+
+    /* Verify we have a valid entry point */
+    if ( (entry < kexec_crash_area.start)
+         || (entry > kexec_crash_area.start + kexec_crash_area.size))
+        return -EADDRNOTAVAIL;
+
+    /*
+     * Verify we have good destination addresses.  Normally
+     * the caller is responsible for making certain we don't
+     * attempt to load the new image into invalid or reserved
+     * areas of RAM.  But crash kernels are preloaded into a
+     * reserved area of ram.  We must ensure the addresses
+     * are in the reserved area otherwise preloading the
+     * kernel could corrupt things.
+     */
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+
+        if ( guest_handle_is_null(segments[i].buf.h) )
+            continue;
+
+        mstart = segments[i].dest_maddr;
+        mend = mstart + segments[i].dest_size;
+        /* Ensure we are within the crash kernel limits. */
+        if ( (mstart < kexec_crash_area.start )
+             || (mend > kexec_crash_area.start + kexec_crash_area.size))
+            return -EADDRNOTAVAIL;
+    }
+
+    /* Allocate and initialize a controlling structure. */
+    return do_kimage_alloc(rimage, entry, nr_segments, segments,
+                           KEXEC_TYPE_CRASH);
+}
+
+static int kimage_is_destination_range(struct kexec_image *image,
+                                       paddr_t start,
+                                       paddr_t end)
+{
+    unsigned long i;
+
+    for ( i = 0; i < image->nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+
+        mstart = image->segments[i].dest_maddr;
+        mend = mstart + image->segments[i].dest_size;
+        if ( (end > mstart) && (start < mend) )
+            return 1;
+    }
+
+    return 0;
+}
+
+static void kimage_free_page_list(struct page_list_head *list)
+{
+    struct page_info *page, *next;
+
+    page_list_for_each_safe(page, next, list)
+    {
+        page_list_del(page, list);
+        free_domheap_page(page);
+    }
+}
+
+static struct page_info *kimage_alloc_normal_control_page(
+    struct kexec_image *image, unsigned memflags)
+{
+    /*
+     * Control pages are special, they are the intermediaries that are
+     * needed while we copy the rest of the pages to their final
+     * resting place.  As such they must not conflict with either the
+     * destination addresses or memory the kernel is already using.
+     *
+     * The only case where we really need more than one of these are
+     * for architectures where we cannot disable the MMU and must
+     * instead generate an identity mapped page table for all of the
+     * memory.
+     *
+     * At worst this runs in O(N) of the image size.
+     */
+    struct page_list_head extra_pages;
+    struct page_info *page = NULL;
+
+    INIT_PAGE_LIST_HEAD(&extra_pages);
+
+    /*
+     * Loop while I can allocate a page and the page allocated is a
+     * destination page.
+     */
+    do {
+        unsigned long mfn, emfn;
+        paddr_t addr, eaddr;
+
+        page = kimage_alloc_zeroed_page(memflags);
+        if ( !page )
+            break;
+        mfn   = page_to_mfn(page);
+        emfn  = mfn + 1;
+        addr  = page_to_maddr(page);
+        eaddr = addr + PAGE_SIZE;
+        if ( kimage_is_destination_range(image, addr, eaddr) )
+        {
+            page_list_add(page, &extra_pages);
+            page = NULL;
+        }
+    } while ( !page );
+
+    if ( page )
+    {
+        /* Remember the allocated page... */
+        page_list_add(page, &image->control_pages);
+
+        /*
+         * Because the page is already in it's destination location we
+         * will never allocate another page at that address.
+         * Therefore kimage_alloc_page will not return it (again) and
+         * we don't need to give it an entry in image->segments[].
+         */
+    }
+    /*
+     * Deal with the destination pages I have inadvertently allocated.
+     *
+     * Ideally I would convert multi-page allocations into single page
+     * allocations, and add everything to image->dest_pages.
+     *
+     * For now it is simpler to just free the pages.
+     */
+    kimage_free_page_list(&extra_pages);
+
+    return page;
+}
+
+static struct page_info *kimage_alloc_crash_control_page(struct kexec_image *image)
+{
+    /*
+     * Control pages are special, they are the intermediaries that are
+     * needed while we copy the rest of the pages to their final
+     * resting place.  As such they must not conflict with either the
+     * destination addresses or memory the kernel is already using.
+     *
+     * Control pages are also the only pags we must allocate when
+     * loading a crash kernel.  All of the other pages are specified
+     * by the segments and we just memcpy into them directly.
+     *
+     * The only case where we really need more than one of these are
+     * for architectures where we cannot disable the MMU and must
+     * instead generate an identity mapped page table for all of the
+     * memory.
+     *
+     * Given the low demand this implements a very simple allocator
+     * that finds the first hole of the appropriate size in the
+     * reserved memory region, and allocates all of the memory up to
+     * and including the hole.
+     */
+    paddr_t hole_start, hole_end;
+    struct page_info *page = NULL;
+
+    hole_start = PAGE_ALIGN(image->next_crash_page);
+    hole_end   = hole_start + PAGE_SIZE;
+    while ( hole_end <= kexec_crash_area.start + kexec_crash_area.size )
+    {
+        unsigned long i;
+
+        /* See if I overlap any of the segments. */
+        for ( i = 0; i < image->nr_segments; i++ )
+        {
+            paddr_t mstart, mend;
+
+            mstart = image->segments[i].dest_maddr;
+            mend   = mstart + image->segments[i].dest_size;
+            if ( (hole_end > mstart) && (hole_start < mend) )
+            {
+                /* Advance the hole to the end of the segment. */
+                hole_start = PAGE_ALIGN(mend);
+                hole_end   = hole_start + PAGE_SIZE;
+                break;
+            }
+        }
+        /* If I don't overlap any segments I have found my hole! */
+        if ( i == image->nr_segments )
+        {
+            page = maddr_to_page(hole_start);
+            break;
+        }
+    }
+    if ( page )
+    {
+        image->next_crash_page = hole_end;
+        clear_domain_page(page_to_mfn(page));
+    }
+
+    return page;
+}
+
+
+struct page_info *kimage_alloc_control_page(struct kexec_image *image,
+                                            unsigned memflags)
+{
+    struct page_info *pages = NULL;
+
+    switch ( image->type )
+    {
+    case KEXEC_TYPE_DEFAULT:
+        pages = kimage_alloc_normal_control_page(image, memflags);
+        break;
+    case KEXEC_TYPE_CRASH:
+        pages = kimage_alloc_crash_control_page(image);
+        break;
+    }
+    return pages;
+}
+
+static int kimage_add_entry(struct kexec_image *image, kimage_entry_t entry)
+{
+    kimage_entry_t *entries;
+
+    if ( image->next_entry == KIMAGE_LAST_ENTRY )
+    {
+        struct page_info *page;
+
+        page = kimage_alloc_page(image, KIMAGE_NO_DEST);
+        if ( !page )
+            return -ENOMEM;
+
+        entries = __map_domain_page(image->entry_page);
+        entries[image->next_entry] = page_to_maddr(page) | IND_INDIRECTION;
+        unmap_domain_page(entries);
+
+        image->entry_page = page;
+        image->next_entry = 0;
+    }
+
+    entries = __map_domain_page(image->entry_page);
+    entries[image->next_entry] = entry;
+    image->next_entry++;
+    unmap_domain_page(entries);
+
+    return 0;
+}
+
+static int kimage_set_destination(struct kexec_image *image,
+                                  paddr_t destination)
+{
+    return kimage_add_entry(image, (destination & PAGE_MASK) | IND_DESTINATION);
+}
+
+
+static int kimage_add_page(struct kexec_image *image, paddr_t maddr)
+{
+    return kimage_add_entry(image, (maddr & PAGE_MASK) | IND_SOURCE);
+}
+
+
+static void kimage_free_extra_pages(struct kexec_image *image)
+{
+    kimage_free_page_list(&image->dest_pages);
+    kimage_free_page_list(&image->unusable_pages);
+}
+
+static void kimage_terminate(struct kexec_image *image)
+{
+    kimage_entry_t *entries;
+
+    entries = __map_domain_page(image->entry_page);
+    entries[image->next_entry] = IND_DONE;
+    unmap_domain_page(entries);
+}
+
+/*
+ * Iterate over all the entries in the indirection pages.
+ *
+ * Call unmap_domain_page(ptr) after the loop exits.
+ */
+#define for_each_kimage_entry(image, ptr, entry)                        \
+    for ( ptr = map_domain_page(image->head >> PAGE_SHIFT);             \
+          (entry = *ptr) && !(entry & IND_DONE);                        \
+          ptr = (entry & IND_INDIRECTION) ?                             \
+              (unmap_domain_page(ptr), map_domain_page(entry >> PAGE_SHIFT)) \
+              : ptr + 1 )
+
+static void kimage_free_entry(kimage_entry_t entry)
+{
+    struct page_info *page;
+
+    page = mfn_to_page(entry >> PAGE_SHIFT);
+    free_domheap_page(page);
+}
+
+static void kimage_free_all_entries(struct kexec_image *image)
+{
+    kimage_entry_t *ptr, entry;
+    kimage_entry_t ind = 0;
+
+    if ( !image->head )
+        return;
+
+    for_each_kimage_entry(image, ptr, entry)
+    {
+        if ( entry & IND_INDIRECTION )
+        {
+            /* Free the previous indirection page */
+            if ( ind & IND_INDIRECTION )
+                kimage_free_entry(ind);
+            /* Save this indirection page until we are done with it. */
+            ind = entry;
+        }
+        else if ( entry & IND_SOURCE )
+            kimage_free_entry(entry);
+    }
+    unmap_domain_page(ptr);
+
+    /* Free the final indirection page. */
+    if ( ind & IND_INDIRECTION )
+        kimage_free_entry(ind);
+}
+
+void kimage_free(struct kexec_image *image)
+{
+    if ( !image )
+        return;
+
+    kimage_free_extra_pages(image);
+    kimage_free_all_entries(image);
+    kimage_free_page_list(&image->control_pages);
+    xfree(image->segments);
+    xfree(image);
+}
+
+static kimage_entry_t *kimage_dst_used(struct kexec_image *image,
+                                       paddr_t maddr)
+{
+    kimage_entry_t *ptr, entry;
+    unsigned long destination = 0;
+
+    for_each_kimage_entry(image, ptr, entry)
+    {
+        if ( entry & IND_DESTINATION )
+            destination = entry & PAGE_MASK;
+        else if ( entry & IND_SOURCE )
+        {
+            if ( maddr == destination )
+                return ptr;
+            destination += PAGE_SIZE;
+        }
+    }
+    unmap_domain_page(ptr);
+
+    return NULL;
+}
+
+static struct page_info *kimage_alloc_page(struct kexec_image *image,
+                                           paddr_t destination)
+{
+    /*
+     * Here we implement safeguards to ensure that a source page is
+     * not copied to its destination page before the data on the
+     * destination page is no longer useful.
+     *
+     * To do this we maintain the invariant that a source page is
+     * either its own destination page, or it is not a destination
+     * page at all.
+     *
+     * That is slightly stronger than required, but the proof that no
+     * problems will not occur is trivial, and the implementation is
+     * simply to verify.
+     *
+     * When allocating all pages normally this algorithm will run in
+     * O(N) time, but in the worst case it will run in O(N^2) time.
+     * If the runtime is a problem the data structures can be fixed.
+     */
+    struct page_info *page;
+    paddr_t addr;
+
+    /*
+     * Walk through the list of destination pages, and see if I have a
+     * match.
+     */
+    page_list_for_each(page, &image->dest_pages)
+    {
+        addr = page_to_maddr(page);
+        if ( addr == destination )
+        {
+            page_list_del(page, &image->dest_pages);
+            return page;
+        }
+    }
+    page = NULL;
+    for (;;)
+    {
+        kimage_entry_t *old;
+
+        /* Allocate a page, if we run out of memory give up. */
+        page = kimage_alloc_zeroed_page(0);
+        if ( !page )
+            return NULL;
+        addr = page_to_maddr(page);
+
+        /* If it is the destination page we want use it. */
+        if ( addr == destination )
+            break;
+
+        /* If the page is not a destination page use it. */
+        if ( !kimage_is_destination_range(image, addr,
+                                          addr + PAGE_SIZE) )
+            break;
+
+        /*
+         * I know that the page is someones destination page.  See if
+         * there is already a source page for this destination page.
+         * And if so swap the source pages.
+         */
+        old = kimage_dst_used(image, addr);
+        if ( old )
+        {
+            /* If so move it. */
+            unsigned long old_mfn = *old >> PAGE_SHIFT;
+            unsigned long mfn = addr >> PAGE_SHIFT;
+
+            copy_domain_page(mfn, old_mfn);
+            clear_domain_page(old_mfn);
+            *old = (addr & ~PAGE_MASK) | IND_SOURCE;
+            unmap_domain_page(old);
+
+            page = mfn_to_page(old_mfn);
+            break;
+        }
+        else
+        {
+            /*
+             * Place the page on the destination list; I will use it
+             * later.
+             */
+            page_list_add(page, &image->dest_pages);
+        }
+    }
+    return page;
+}
+
+static int kimage_load_normal_segment(struct kexec_image *image,
+                                      xen_kexec_segment_t *segment)
+{
+    unsigned long to_copy;
+    unsigned long src_offset;
+    paddr_t dest, end;
+    int ret;
+
+    to_copy = segment->buf_size;
+    src_offset = 0;
+    dest = segment->dest_maddr;
+
+    ret = kimage_set_destination(image, dest);
+    if ( ret < 0 )
+        return ret;
+
+    while ( to_copy )
+    {
+        unsigned long dest_mfn;
+        struct page_info *page;
+        void *dest_va;
+        size_t size;
+
+        dest_mfn = dest >> PAGE_SHIFT;
+
+        size = min_t(unsigned long, PAGE_SIZE, to_copy);
+
+        page = kimage_alloc_page(image, dest);
+        if ( !page )
+            return -ENOMEM;
+        ret = kimage_add_page(image, page_to_maddr(page));
+        if ( ret < 0 )
+            return ret;
+
+        dest_va = __map_domain_page(page);
+        ret = copy_from_guest_offset(dest_va, segment->buf.h, src_offset, size);
+        unmap_domain_page(dest_va);
+        if ( ret )
+            return -EFAULT;
+
+        to_copy -= size;
+        src_offset += size;
+        dest += PAGE_SIZE;
+    }
+
+    /* Remainder of the destination should be zeroed. */
+    end = segment->dest_maddr + segment->dest_size;
+    for ( ; dest < end; dest += PAGE_SIZE )
+        kimage_add_entry(image, IND_ZERO);
+
+    return 0;
+}
+
+static int kimage_load_crash_segment(struct kexec_image *image,
+                                     xen_kexec_segment_t *segment)
+{
+    /*
+     * For crash dumps kernels we simply copy the data from user space
+     * to it's destination.
+     */
+    paddr_t dest;
+    unsigned long sbytes, dbytes;
+    int ret = 0;
+    unsigned long src_offset = 0;
+
+    sbytes = segment->buf_size;
+    dbytes = segment->dest_size;
+    dest = segment->dest_maddr;
+
+    while ( dbytes )
+    {
+        unsigned long dest_mfn;
+        void *dest_va;
+        size_t schunk, dchunk;
+
+        dest_mfn = dest >> PAGE_SHIFT;
+
+        dchunk = PAGE_SIZE;
+        schunk = min(dchunk, sbytes);
+
+        dest_va = map_domain_page(dest_mfn);
+        if ( !dest_va )
+            return -EINVAL;
+
+        ret = copy_from_guest_offset(dest_va, segment->buf.h, src_offset, schunk);
+        memset(dest_va + schunk, 0, dchunk - schunk);
+
+        unmap_domain_page(dest_va);
+        if ( ret )
+            return -EFAULT;
+
+        dbytes -= dchunk;
+        sbytes -= schunk;
+        dest += dchunk;
+        src_offset += schunk;
+    }
+
+    return 0;
+}
+
+static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
+{
+    int result = -ENOMEM;
+
+    if ( !guest_handle_is_null(segment->buf.h) )
+    {
+        switch ( image->type )
+        {
+        case KEXEC_TYPE_DEFAULT:
+            result = kimage_load_normal_segment(image, segment);
+            break;
+        case KEXEC_TYPE_CRASH:
+            result = kimage_load_crash_segment(image, segment);
+            break;
+        }
+    }
+
+    return result;
+}
+
+int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch,
+                 uint64_t entry_maddr,
+                 uint32_t nr_segments, xen_kexec_segment_t *segment)
+{
+    int result;
+
+    switch( type )
+    {
+    case KEXEC_TYPE_DEFAULT:
+        result = kimage_normal_alloc(rimage, entry_maddr, nr_segments, segment);
+        break;
+    case KEXEC_TYPE_CRASH:
+        result = kimage_crash_alloc(rimage, entry_maddr, nr_segments, segment);
+        break;
+    default:
+        result = -EINVAL;
+        break;
+    }
+    if ( result < 0 )
+        return result;
+
+    (*rimage)->arch = arch;
+
+    return result;
+}
+
+int kimage_load_segments(struct kexec_image *image)
+{
+    int s;
+    int result;
+
+    for ( s = 0; s < image->nr_segments; s++ ) {
+        result = kimage_load_segment(image, &image->segments[s]);
+        if ( result < 0 )
+            return result;
+    }
+    kimage_terminate(image);
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
new file mode 100644
index 0000000..0ebd37a
--- /dev/null
+++ b/xen/include/xen/kimage.h
@@ -0,0 +1,62 @@
+#ifndef __XEN_KIMAGE_H__
+#define __XEN_KIMAGE_H__
+
+#define IND_DESTINATION  0x1
+#define IND_INDIRECTION  0x2
+#define IND_DONE         0x4
+#define IND_SOURCE       0x8
+#define IND_ZERO        0x10
+
+#ifndef __ASSEMBLY__
+
+#include <xen/list.h>
+#include <xen/mm.h>
+#include <public/kexec.h>
+
+#define KEXEC_SEGMENT_MAX 16
+
+typedef paddr_t kimage_entry_t;
+
+struct kexec_image {
+    uint8_t type;
+    uint16_t arch;
+    uint64_t entry_maddr;
+    uint32_t nr_segments;
+    xen_kexec_segment_t *segments;
+
+    kimage_entry_t head;
+    struct page_info *entry_page;
+    unsigned next_entry;
+
+    struct page_info *control_code_page;
+    struct page_info *aux_page;
+
+    struct page_list_head control_pages;
+    struct page_list_head dest_pages;
+    struct page_list_head unusable_pages;
+
+    /* Address of next control page to allocate for crash kernels. */
+    paddr_t next_crash_page;
+};
+
+int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch,
+                 uint64_t entry_maddr,
+                 uint32_t nr_segments, xen_kexec_segment_t *segment);
+void kimage_free(struct kexec_image *image);
+int kimage_load_segments(struct kexec_image *image);
+struct page_info *kimage_alloc_control_page(struct kexec_image *image,
+                                            unsigned memflags);
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* __XEN_KIMAGE_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 3/9] kexec: add infrastructure for handling kexec images
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (3 preceding siblings ...)
  2013-11-06 14:49 ` David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-07 20:40   ` [Xen-devel] " Don Slutz
                     ` (2 more replies)
  2013-11-06 14:49 ` [PATCH " David Vrabel
                   ` (15 subsequent siblings)
  20 siblings, 3 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Add the code needed to handle and load kexec images into Xen memory or
into the crash region.  This is needed for the new KEXEC_CMD_load and
KEXEC_CMD_unload hypercall sub-ops.

Much of this code is derived from the Linux kernel.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/common/Makefile      |    1 +
 xen/common/kimage.c      |  821 ++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/kimage.h |   62 ++++
 3 files changed, 884 insertions(+), 0 deletions(-)
 create mode 100644 xen/common/kimage.c
 create mode 100644 xen/include/xen/kimage.h

diff --git a/xen/common/Makefile b/xen/common/Makefile
index 686f7a1..3683ae3 100644
--- a/xen/common/Makefile
+++ b/xen/common/Makefile
@@ -13,6 +13,7 @@ obj-y += irq.o
 obj-y += kernel.o
 obj-y += keyhandler.o
 obj-$(HAS_KEXEC) += kexec.o
+obj-$(HAS_KEXEC) += kimage.o
 obj-y += lib.o
 obj-y += memory.o
 obj-y += multicall.o
diff --git a/xen/common/kimage.c b/xen/common/kimage.c
new file mode 100644
index 0000000..02ee37e
--- /dev/null
+++ b/xen/common/kimage.c
@@ -0,0 +1,821 @@
+/*
+ * Kexec Image
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Derived from kernel/kexec.c from Linux:
+ *
+ *   Copyright (C) 2002-2004 Eric Biederman  <ebiederm@xmission.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/init.h>
+#include <xen/kernel.h>
+#include <xen/errno.h>
+#include <xen/spinlock.h>
+#include <xen/guest_access.h>
+#include <xen/mm.h>
+#include <xen/kexec.h>
+#include <xen/kimage.h>
+
+#include <asm/page.h>
+
+/*
+ * When kexec transitions to the new kernel there is a one-to-one
+ * mapping between physical and virtual addresses.  On processors
+ * where you can disable the MMU this is trivial, and easy.  For
+ * others it is still a simple predictable page table to setup.
+ *
+ * The code for the transition from the current kernel to the the new
+ * kernel is placed in the page-size control_code_buffer.  This memory
+ * must be identity mapped in the transition from virtual to physical
+ * addresses.
+ *
+ * The assembly stub in the control code buffer is passed a linked list
+ * of descriptor pages detailing the source pages of the new kernel,
+ * and the destination addresses of those source pages.  As this data
+ * structure is not used in the context of the current OS, it must
+ * be self-contained.
+ *
+ * The code has been made to work with highmem pages and will use a
+ * destination page in its final resting place (if it happens
+ * to allocate it).  The end product of this is that most of the
+ * physical address space, and most of RAM can be used.
+ *
+ * Future directions include:
+ *  - allocating a page table with the control code buffer identity
+ *    mapped, to simplify machine_kexec and make kexec_on_panic more
+ *    reliable.
+ */
+
+/*
+ * KIMAGE_NO_DEST is an impossible destination address..., for
+ * allocating pages whose destination address we do not care about.
+ */
+#define KIMAGE_NO_DEST (-1UL)
+
+/*
+ * Offset of the last entry in an indirection page.
+ */
+#define KIMAGE_LAST_ENTRY (PAGE_SIZE/sizeof(kimage_entry_t) - 1)
+
+
+static int kimage_is_destination_range(struct kexec_image *image,
+                                       paddr_t start, paddr_t end);
+static struct page_info *kimage_alloc_page(struct kexec_image *image,
+                                           paddr_t dest);
+
+static struct page_info *kimage_alloc_zeroed_page(unsigned memflags)
+{
+    struct page_info *page;
+
+    page = alloc_domheap_page(NULL, memflags);
+    if ( !page )
+        return NULL;
+
+    clear_domain_page(page_to_mfn(page));
+
+    return page;
+}
+
+static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
+                           unsigned long nr_segments,
+                           xen_kexec_segment_t *segments, uint8_t type)
+{
+    struct kexec_image *image;
+    unsigned long i;
+    int result;
+
+    /* Allocate a controlling structure */
+    result = -ENOMEM;
+    image = xzalloc(typeof(*image));
+    if ( !image )
+        goto out;
+
+    image->entry_maddr = entry;
+    image->type = type;
+    image->nr_segments = nr_segments;
+    image->segments = segments;
+
+    image->next_crash_page = kexec_crash_area.start;
+
+    INIT_PAGE_LIST_HEAD(&image->control_pages);
+    INIT_PAGE_LIST_HEAD(&image->dest_pages);
+    INIT_PAGE_LIST_HEAD(&image->unusable_pages);
+
+    /*
+     * Verify we have good destination addresses.  The caller is
+     * responsible for making certain we don't attempt to load the new
+     * image into invalid or reserved areas of RAM.  This just
+     * verifies it is an address we can use.
+     *
+     * Since the kernel does everything in page size chunks ensure the
+     * destination addresses are page aligned.  Too many special cases
+     * crop of when we don't do this.  The most insidious is getting
+     * overlapping destination addresses simply because addresses are
+     * changed to page size granularity.
+     */
+    result = -EADDRNOTAVAIL;
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+
+        mstart = image->segments[i].dest_maddr;
+        mend   = mstart + image->segments[i].dest_size;
+        if ( (mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK) )
+            goto out;
+    }
+
+    /*
+     * Verify our destination addresses do not overlap.  If we allowed
+     * overlapping destination addresses through very weird things can
+     * happen with no easy explanation as one segment stops on
+     * another.
+     */
+    result = -EINVAL;
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+        unsigned long j;
+
+        mstart = image->segments[i].dest_maddr;
+        mend   = mstart + image->segments[i].dest_size;
+        for (j = 0; j < i; j++ )
+        {
+            paddr_t pstart, pend;
+            pstart = image->segments[j].dest_maddr;
+            pend   = pstart + image->segments[j].dest_size;
+            /* Do the segments overlap? */
+            if ( (mend > pstart) && (mstart < pend) )
+                goto out;
+        }
+    }
+
+    /*
+     * Ensure our buffer sizes are strictly less than our memory
+     * sizes.  This should always be the case, and it is easier to
+     * check up front than to be surprised later on.
+     */
+    result = -EINVAL;
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        if ( image->segments[i].buf_size > image->segments[i].dest_size )
+            goto out;
+    }
+
+    /* 
+     * Page for the relocation code must still be accessible after the
+     * processor has switched to 32-bit mode.
+     */
+    result = -ENOMEM;
+    image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
+    if ( !image->control_code_page )
+        goto out;
+
+    /* Add an empty indirection page. */
+    image->entry_page = kimage_alloc_control_page(image, 0);
+    if ( !image->entry_page )
+        goto out;
+
+    image->head = page_to_maddr(image->entry_page);
+
+    result = 0;
+out:
+    if ( result == 0 )
+        *rimage = image;
+    else if ( image )
+    {
+        image->segments = NULL; /* caller frees segments after an error */
+        kimage_free(image);
+    }
+
+    return result;
+
+}
+
+static int kimage_normal_alloc(struct kexec_image **rimage, paddr_t entry,
+                               unsigned long nr_segments,
+                               xen_kexec_segment_t *segments)
+{
+    return do_kimage_alloc(rimage, entry, nr_segments, segments,
+                           KEXEC_TYPE_DEFAULT);
+}
+
+static int kimage_crash_alloc(struct kexec_image **rimage, paddr_t entry,
+                              unsigned long nr_segments,
+                              xen_kexec_segment_t *segments)
+{
+    unsigned long i;
+    int result;
+
+    /* Verify we have a valid entry point */
+    if ( (entry < kexec_crash_area.start)
+         || (entry > kexec_crash_area.start + kexec_crash_area.size))
+        return -EADDRNOTAVAIL;
+
+    /*
+     * Verify we have good destination addresses.  Normally
+     * the caller is responsible for making certain we don't
+     * attempt to load the new image into invalid or reserved
+     * areas of RAM.  But crash kernels are preloaded into a
+     * reserved area of ram.  We must ensure the addresses
+     * are in the reserved area otherwise preloading the
+     * kernel could corrupt things.
+     */
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+
+        if ( guest_handle_is_null(segments[i].buf.h) )
+            continue;
+
+        mstart = segments[i].dest_maddr;
+        mend = mstart + segments[i].dest_size;
+        /* Ensure we are within the crash kernel limits. */
+        if ( (mstart < kexec_crash_area.start )
+             || (mend > kexec_crash_area.start + kexec_crash_area.size))
+            return -EADDRNOTAVAIL;
+    }
+
+    /* Allocate and initialize a controlling structure. */
+    return do_kimage_alloc(rimage, entry, nr_segments, segments,
+                           KEXEC_TYPE_CRASH);
+}
+
+static int kimage_is_destination_range(struct kexec_image *image,
+                                       paddr_t start,
+                                       paddr_t end)
+{
+    unsigned long i;
+
+    for ( i = 0; i < image->nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+
+        mstart = image->segments[i].dest_maddr;
+        mend = mstart + image->segments[i].dest_size;
+        if ( (end > mstart) && (start < mend) )
+            return 1;
+    }
+
+    return 0;
+}
+
+static void kimage_free_page_list(struct page_list_head *list)
+{
+    struct page_info *page, *next;
+
+    page_list_for_each_safe(page, next, list)
+    {
+        page_list_del(page, list);
+        free_domheap_page(page);
+    }
+}
+
+static struct page_info *kimage_alloc_normal_control_page(
+    struct kexec_image *image, unsigned memflags)
+{
+    /*
+     * Control pages are special, they are the intermediaries that are
+     * needed while we copy the rest of the pages to their final
+     * resting place.  As such they must not conflict with either the
+     * destination addresses or memory the kernel is already using.
+     *
+     * The only case where we really need more than one of these are
+     * for architectures where we cannot disable the MMU and must
+     * instead generate an identity mapped page table for all of the
+     * memory.
+     *
+     * At worst this runs in O(N) of the image size.
+     */
+    struct page_list_head extra_pages;
+    struct page_info *page = NULL;
+
+    INIT_PAGE_LIST_HEAD(&extra_pages);
+
+    /*
+     * Loop while I can allocate a page and the page allocated is a
+     * destination page.
+     */
+    do {
+        unsigned long mfn, emfn;
+        paddr_t addr, eaddr;
+
+        page = kimage_alloc_zeroed_page(memflags);
+        if ( !page )
+            break;
+        mfn   = page_to_mfn(page);
+        emfn  = mfn + 1;
+        addr  = page_to_maddr(page);
+        eaddr = addr + PAGE_SIZE;
+        if ( kimage_is_destination_range(image, addr, eaddr) )
+        {
+            page_list_add(page, &extra_pages);
+            page = NULL;
+        }
+    } while ( !page );
+
+    if ( page )
+    {
+        /* Remember the allocated page... */
+        page_list_add(page, &image->control_pages);
+
+        /*
+         * Because the page is already in it's destination location we
+         * will never allocate another page at that address.
+         * Therefore kimage_alloc_page will not return it (again) and
+         * we don't need to give it an entry in image->segments[].
+         */
+    }
+    /*
+     * Deal with the destination pages I have inadvertently allocated.
+     *
+     * Ideally I would convert multi-page allocations into single page
+     * allocations, and add everything to image->dest_pages.
+     *
+     * For now it is simpler to just free the pages.
+     */
+    kimage_free_page_list(&extra_pages);
+
+    return page;
+}
+
+static struct page_info *kimage_alloc_crash_control_page(struct kexec_image *image)
+{
+    /*
+     * Control pages are special, they are the intermediaries that are
+     * needed while we copy the rest of the pages to their final
+     * resting place.  As such they must not conflict with either the
+     * destination addresses or memory the kernel is already using.
+     *
+     * Control pages are also the only pags we must allocate when
+     * loading a crash kernel.  All of the other pages are specified
+     * by the segments and we just memcpy into them directly.
+     *
+     * The only case where we really need more than one of these are
+     * for architectures where we cannot disable the MMU and must
+     * instead generate an identity mapped page table for all of the
+     * memory.
+     *
+     * Given the low demand this implements a very simple allocator
+     * that finds the first hole of the appropriate size in the
+     * reserved memory region, and allocates all of the memory up to
+     * and including the hole.
+     */
+    paddr_t hole_start, hole_end;
+    struct page_info *page = NULL;
+
+    hole_start = PAGE_ALIGN(image->next_crash_page);
+    hole_end   = hole_start + PAGE_SIZE;
+    while ( hole_end <= kexec_crash_area.start + kexec_crash_area.size )
+    {
+        unsigned long i;
+
+        /* See if I overlap any of the segments. */
+        for ( i = 0; i < image->nr_segments; i++ )
+        {
+            paddr_t mstart, mend;
+
+            mstart = image->segments[i].dest_maddr;
+            mend   = mstart + image->segments[i].dest_size;
+            if ( (hole_end > mstart) && (hole_start < mend) )
+            {
+                /* Advance the hole to the end of the segment. */
+                hole_start = PAGE_ALIGN(mend);
+                hole_end   = hole_start + PAGE_SIZE;
+                break;
+            }
+        }
+        /* If I don't overlap any segments I have found my hole! */
+        if ( i == image->nr_segments )
+        {
+            page = maddr_to_page(hole_start);
+            break;
+        }
+    }
+    if ( page )
+    {
+        image->next_crash_page = hole_end;
+        clear_domain_page(page_to_mfn(page));
+    }
+
+    return page;
+}
+
+
+struct page_info *kimage_alloc_control_page(struct kexec_image *image,
+                                            unsigned memflags)
+{
+    struct page_info *pages = NULL;
+
+    switch ( image->type )
+    {
+    case KEXEC_TYPE_DEFAULT:
+        pages = kimage_alloc_normal_control_page(image, memflags);
+        break;
+    case KEXEC_TYPE_CRASH:
+        pages = kimage_alloc_crash_control_page(image);
+        break;
+    }
+    return pages;
+}
+
+static int kimage_add_entry(struct kexec_image *image, kimage_entry_t entry)
+{
+    kimage_entry_t *entries;
+
+    if ( image->next_entry == KIMAGE_LAST_ENTRY )
+    {
+        struct page_info *page;
+
+        page = kimage_alloc_page(image, KIMAGE_NO_DEST);
+        if ( !page )
+            return -ENOMEM;
+
+        entries = __map_domain_page(image->entry_page);
+        entries[image->next_entry] = page_to_maddr(page) | IND_INDIRECTION;
+        unmap_domain_page(entries);
+
+        image->entry_page = page;
+        image->next_entry = 0;
+    }
+
+    entries = __map_domain_page(image->entry_page);
+    entries[image->next_entry] = entry;
+    image->next_entry++;
+    unmap_domain_page(entries);
+
+    return 0;
+}
+
+static int kimage_set_destination(struct kexec_image *image,
+                                  paddr_t destination)
+{
+    return kimage_add_entry(image, (destination & PAGE_MASK) | IND_DESTINATION);
+}
+
+
+static int kimage_add_page(struct kexec_image *image, paddr_t maddr)
+{
+    return kimage_add_entry(image, (maddr & PAGE_MASK) | IND_SOURCE);
+}
+
+
+static void kimage_free_extra_pages(struct kexec_image *image)
+{
+    kimage_free_page_list(&image->dest_pages);
+    kimage_free_page_list(&image->unusable_pages);
+}
+
+static void kimage_terminate(struct kexec_image *image)
+{
+    kimage_entry_t *entries;
+
+    entries = __map_domain_page(image->entry_page);
+    entries[image->next_entry] = IND_DONE;
+    unmap_domain_page(entries);
+}
+
+/*
+ * Iterate over all the entries in the indirection pages.
+ *
+ * Call unmap_domain_page(ptr) after the loop exits.
+ */
+#define for_each_kimage_entry(image, ptr, entry)                        \
+    for ( ptr = map_domain_page(image->head >> PAGE_SHIFT);             \
+          (entry = *ptr) && !(entry & IND_DONE);                        \
+          ptr = (entry & IND_INDIRECTION) ?                             \
+              (unmap_domain_page(ptr), map_domain_page(entry >> PAGE_SHIFT)) \
+              : ptr + 1 )
+
+static void kimage_free_entry(kimage_entry_t entry)
+{
+    struct page_info *page;
+
+    page = mfn_to_page(entry >> PAGE_SHIFT);
+    free_domheap_page(page);
+}
+
+static void kimage_free_all_entries(struct kexec_image *image)
+{
+    kimage_entry_t *ptr, entry;
+    kimage_entry_t ind = 0;
+
+    if ( !image->head )
+        return;
+
+    for_each_kimage_entry(image, ptr, entry)
+    {
+        if ( entry & IND_INDIRECTION )
+        {
+            /* Free the previous indirection page */
+            if ( ind & IND_INDIRECTION )
+                kimage_free_entry(ind);
+            /* Save this indirection page until we are done with it. */
+            ind = entry;
+        }
+        else if ( entry & IND_SOURCE )
+            kimage_free_entry(entry);
+    }
+    unmap_domain_page(ptr);
+
+    /* Free the final indirection page. */
+    if ( ind & IND_INDIRECTION )
+        kimage_free_entry(ind);
+}
+
+void kimage_free(struct kexec_image *image)
+{
+    if ( !image )
+        return;
+
+    kimage_free_extra_pages(image);
+    kimage_free_all_entries(image);
+    kimage_free_page_list(&image->control_pages);
+    xfree(image->segments);
+    xfree(image);
+}
+
+static kimage_entry_t *kimage_dst_used(struct kexec_image *image,
+                                       paddr_t maddr)
+{
+    kimage_entry_t *ptr, entry;
+    unsigned long destination = 0;
+
+    for_each_kimage_entry(image, ptr, entry)
+    {
+        if ( entry & IND_DESTINATION )
+            destination = entry & PAGE_MASK;
+        else if ( entry & IND_SOURCE )
+        {
+            if ( maddr == destination )
+                return ptr;
+            destination += PAGE_SIZE;
+        }
+    }
+    unmap_domain_page(ptr);
+
+    return NULL;
+}
+
+static struct page_info *kimage_alloc_page(struct kexec_image *image,
+                                           paddr_t destination)
+{
+    /*
+     * Here we implement safeguards to ensure that a source page is
+     * not copied to its destination page before the data on the
+     * destination page is no longer useful.
+     *
+     * To do this we maintain the invariant that a source page is
+     * either its own destination page, or it is not a destination
+     * page at all.
+     *
+     * That is slightly stronger than required, but the proof that no
+     * problems will not occur is trivial, and the implementation is
+     * simply to verify.
+     *
+     * When allocating all pages normally this algorithm will run in
+     * O(N) time, but in the worst case it will run in O(N^2) time.
+     * If the runtime is a problem the data structures can be fixed.
+     */
+    struct page_info *page;
+    paddr_t addr;
+
+    /*
+     * Walk through the list of destination pages, and see if I have a
+     * match.
+     */
+    page_list_for_each(page, &image->dest_pages)
+    {
+        addr = page_to_maddr(page);
+        if ( addr == destination )
+        {
+            page_list_del(page, &image->dest_pages);
+            return page;
+        }
+    }
+    page = NULL;
+    for (;;)
+    {
+        kimage_entry_t *old;
+
+        /* Allocate a page, if we run out of memory give up. */
+        page = kimage_alloc_zeroed_page(0);
+        if ( !page )
+            return NULL;
+        addr = page_to_maddr(page);
+
+        /* If it is the destination page we want use it. */
+        if ( addr == destination )
+            break;
+
+        /* If the page is not a destination page use it. */
+        if ( !kimage_is_destination_range(image, addr,
+                                          addr + PAGE_SIZE) )
+            break;
+
+        /*
+         * I know that the page is someones destination page.  See if
+         * there is already a source page for this destination page.
+         * And if so swap the source pages.
+         */
+        old = kimage_dst_used(image, addr);
+        if ( old )
+        {
+            /* If so move it. */
+            unsigned long old_mfn = *old >> PAGE_SHIFT;
+            unsigned long mfn = addr >> PAGE_SHIFT;
+
+            copy_domain_page(mfn, old_mfn);
+            clear_domain_page(old_mfn);
+            *old = (addr & ~PAGE_MASK) | IND_SOURCE;
+            unmap_domain_page(old);
+
+            page = mfn_to_page(old_mfn);
+            break;
+        }
+        else
+        {
+            /*
+             * Place the page on the destination list; I will use it
+             * later.
+             */
+            page_list_add(page, &image->dest_pages);
+        }
+    }
+    return page;
+}
+
+static int kimage_load_normal_segment(struct kexec_image *image,
+                                      xen_kexec_segment_t *segment)
+{
+    unsigned long to_copy;
+    unsigned long src_offset;
+    paddr_t dest, end;
+    int ret;
+
+    to_copy = segment->buf_size;
+    src_offset = 0;
+    dest = segment->dest_maddr;
+
+    ret = kimage_set_destination(image, dest);
+    if ( ret < 0 )
+        return ret;
+
+    while ( to_copy )
+    {
+        unsigned long dest_mfn;
+        struct page_info *page;
+        void *dest_va;
+        size_t size;
+
+        dest_mfn = dest >> PAGE_SHIFT;
+
+        size = min_t(unsigned long, PAGE_SIZE, to_copy);
+
+        page = kimage_alloc_page(image, dest);
+        if ( !page )
+            return -ENOMEM;
+        ret = kimage_add_page(image, page_to_maddr(page));
+        if ( ret < 0 )
+            return ret;
+
+        dest_va = __map_domain_page(page);
+        ret = copy_from_guest_offset(dest_va, segment->buf.h, src_offset, size);
+        unmap_domain_page(dest_va);
+        if ( ret )
+            return -EFAULT;
+
+        to_copy -= size;
+        src_offset += size;
+        dest += PAGE_SIZE;
+    }
+
+    /* Remainder of the destination should be zeroed. */
+    end = segment->dest_maddr + segment->dest_size;
+    for ( ; dest < end; dest += PAGE_SIZE )
+        kimage_add_entry(image, IND_ZERO);
+
+    return 0;
+}
+
+static int kimage_load_crash_segment(struct kexec_image *image,
+                                     xen_kexec_segment_t *segment)
+{
+    /*
+     * For crash dumps kernels we simply copy the data from user space
+     * to it's destination.
+     */
+    paddr_t dest;
+    unsigned long sbytes, dbytes;
+    int ret = 0;
+    unsigned long src_offset = 0;
+
+    sbytes = segment->buf_size;
+    dbytes = segment->dest_size;
+    dest = segment->dest_maddr;
+
+    while ( dbytes )
+    {
+        unsigned long dest_mfn;
+        void *dest_va;
+        size_t schunk, dchunk;
+
+        dest_mfn = dest >> PAGE_SHIFT;
+
+        dchunk = PAGE_SIZE;
+        schunk = min(dchunk, sbytes);
+
+        dest_va = map_domain_page(dest_mfn);
+        if ( !dest_va )
+            return -EINVAL;
+
+        ret = copy_from_guest_offset(dest_va, segment->buf.h, src_offset, schunk);
+        memset(dest_va + schunk, 0, dchunk - schunk);
+
+        unmap_domain_page(dest_va);
+        if ( ret )
+            return -EFAULT;
+
+        dbytes -= dchunk;
+        sbytes -= schunk;
+        dest += dchunk;
+        src_offset += schunk;
+    }
+
+    return 0;
+}
+
+static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
+{
+    int result = -ENOMEM;
+
+    if ( !guest_handle_is_null(segment->buf.h) )
+    {
+        switch ( image->type )
+        {
+        case KEXEC_TYPE_DEFAULT:
+            result = kimage_load_normal_segment(image, segment);
+            break;
+        case KEXEC_TYPE_CRASH:
+            result = kimage_load_crash_segment(image, segment);
+            break;
+        }
+    }
+
+    return result;
+}
+
+int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch,
+                 uint64_t entry_maddr,
+                 uint32_t nr_segments, xen_kexec_segment_t *segment)
+{
+    int result;
+
+    switch( type )
+    {
+    case KEXEC_TYPE_DEFAULT:
+        result = kimage_normal_alloc(rimage, entry_maddr, nr_segments, segment);
+        break;
+    case KEXEC_TYPE_CRASH:
+        result = kimage_crash_alloc(rimage, entry_maddr, nr_segments, segment);
+        break;
+    default:
+        result = -EINVAL;
+        break;
+    }
+    if ( result < 0 )
+        return result;
+
+    (*rimage)->arch = arch;
+
+    return result;
+}
+
+int kimage_load_segments(struct kexec_image *image)
+{
+    int s;
+    int result;
+
+    for ( s = 0; s < image->nr_segments; s++ ) {
+        result = kimage_load_segment(image, &image->segments[s]);
+        if ( result < 0 )
+            return result;
+    }
+    kimage_terminate(image);
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
new file mode 100644
index 0000000..0ebd37a
--- /dev/null
+++ b/xen/include/xen/kimage.h
@@ -0,0 +1,62 @@
+#ifndef __XEN_KIMAGE_H__
+#define __XEN_KIMAGE_H__
+
+#define IND_DESTINATION  0x1
+#define IND_INDIRECTION  0x2
+#define IND_DONE         0x4
+#define IND_SOURCE       0x8
+#define IND_ZERO        0x10
+
+#ifndef __ASSEMBLY__
+
+#include <xen/list.h>
+#include <xen/mm.h>
+#include <public/kexec.h>
+
+#define KEXEC_SEGMENT_MAX 16
+
+typedef paddr_t kimage_entry_t;
+
+struct kexec_image {
+    uint8_t type;
+    uint16_t arch;
+    uint64_t entry_maddr;
+    uint32_t nr_segments;
+    xen_kexec_segment_t *segments;
+
+    kimage_entry_t head;
+    struct page_info *entry_page;
+    unsigned next_entry;
+
+    struct page_info *control_code_page;
+    struct page_info *aux_page;
+
+    struct page_list_head control_pages;
+    struct page_list_head dest_pages;
+    struct page_list_head unusable_pages;
+
+    /* Address of next control page to allocate for crash kernels. */
+    paddr_t next_crash_page;
+};
+
+int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch,
+                 uint64_t entry_maddr,
+                 uint32_t nr_segments, xen_kexec_segment_t *segment);
+void kimage_free(struct kexec_image *image);
+int kimage_load_segments(struct kexec_image *image);
+struct page_info *kimage_alloc_control_page(struct kexec_image *image,
+                                            unsigned memflags);
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* __XEN_KIMAGE_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.2.5


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
@ 2013-11-06 14:49   ` David Vrabel
  2013-11-06 14:49 ` David Vrabel
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

In the existing kexec hypercall, the load and unload ops depend on
internals of the Linux kernel (the page list and code page provided by
the kernel).  The code page is used to transition between Xen context
and the image so using kernel code doesn't make sense and will not
work for PVH guests.

Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
that no longer require a code page to be provided by the guest -- Xen
now provides the code for calling the image directly.

The new load op looks similar to the Linux kexec_load system call and
allows the guest to provide the image data to be loaded.  The guest
specifies the architecture of the image which may be a 32-bit subarch
of the hypervisor's architecture (i.e., an EM_386 image on an
EM_X86_64 hypervisor).

The toolstack can now load images without kernel involvement.  This is
required for supporting kexec when using a dom0 with an upstream
kernel.

Crash images are copied directly into the crash region on load.
Default images are copied into domheap pages and a list of source and
destination machine addresses is created.  This is list is used in
kexec_reloc() to relocate the image to its destination.

The old load and unload sub-ops are still available (as
KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
of the new infrastructure.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/machine_kexec.c        |  192 +++++++++++------
 xen/arch/x86/x86_64/Makefile        |    2 +-
 xen/arch/x86/x86_64/compat_kexec.S  |  187 ----------------
 xen/arch/x86/x86_64/kexec_reloc.S   |  198 +++++++++++++++++
 xen/common/kexec.c                  |  398 +++++++++++++++++++++++++++++------
 xen/common/kimage.c                 |  122 +++++++++++-
 xen/include/asm-x86/fixmap.h        |    3 -
 xen/include/asm-x86/machine_kexec.h |   16 ++
 xen/include/xen/kexec.h             |   16 +-
 xen/include/xen/kimage.h            |    6 +
 10 files changed, 804 insertions(+), 336 deletions(-)
 delete mode 100644 xen/arch/x86/x86_64/compat_kexec.S
 create mode 100644 xen/arch/x86/x86_64/kexec_reloc.S
 create mode 100644 xen/include/asm-x86/machine_kexec.h

diff --git a/xen/arch/x86/machine_kexec.c b/xen/arch/x86/machine_kexec.c
index 68b9705..b70d5a6 100644
--- a/xen/arch/x86/machine_kexec.c
+++ b/xen/arch/x86/machine_kexec.c
@@ -1,9 +1,18 @@
 /******************************************************************************
  * machine_kexec.c
  *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/machine_kexec_64.c.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
  * Xen port written by:
  * - Simon 'Horms' Horman <horms@verge.net.au>
  * - Magnus Damm <magnus@valinux.co.jp>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
  */
 
 #include <xen/types.h>
@@ -11,63 +20,124 @@
 #include <xen/guest_access.h>
 #include <asm/fixmap.h>
 #include <asm/hpet.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
 
-typedef void (*relocate_new_kernel_t)(
-                unsigned long indirection_page,
-                unsigned long *page_list,
-                unsigned long start_address,
-                unsigned int preserve_context);
-
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image)
+/*
+ * Add a mapping for a page to the page tables used during kexec.
+ */
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr)
 {
-    unsigned long prev_ma = 0;
-    int fix_base = FIX_KEXEC_BASE_0 + (slot * (KEXEC_XEN_NO_PAGES >> 1));
-    int k;
+    struct page_info *l4_page;
+    struct page_info *l3_page;
+    struct page_info *l2_page;
+    struct page_info *l1_page;
+    l4_pgentry_t *l4 = NULL;
+    l3_pgentry_t *l3 = NULL;
+    l2_pgentry_t *l2 = NULL;
+    l1_pgentry_t *l1 = NULL;
+    int ret = -ENOMEM;
+
+    l4_page = image->aux_page;
+    if ( !l4_page )
+    {
+        l4_page = kimage_alloc_control_page(image, 0);
+        if ( !l4_page )
+            goto out;
+        image->aux_page = l4_page;
+    }
 
-    /* setup fixmap to point to our pages and record the virtual address
-     * in every odd index in page_list[].
-     */
+    l4 = __map_domain_page(l4_page);
+    l4 += l4_table_offset(vaddr);
+    if ( !(l4e_get_flags(*l4) & _PAGE_PRESENT) )
+    {
+        l3_page = kimage_alloc_control_page(image, 0);
+        if ( !l3_page )
+            goto out;
+        l4e_write(l4, l4e_from_page(l3_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l3_page = l4e_get_page(*l4);
+
+    l3 = __map_domain_page(l3_page);
+    l3 += l3_table_offset(vaddr);
+    if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) )
+    {
+        l2_page = kimage_alloc_control_page(image, 0);
+        if ( !l2_page )
+            goto out;
+        l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l2_page = l3e_get_page(*l3);
+
+    l2 = __map_domain_page(l2_page);
+    l2 += l2_table_offset(vaddr);
+    if ( !(l2e_get_flags(*l2) & _PAGE_PRESENT) )
+    {
+        l1_page = kimage_alloc_control_page(image, 0);
+        if ( !l1_page )
+            goto out;
+        l2e_write(l2, l2e_from_page(l1_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l1_page = l2e_get_page(*l2);
+
+    l1 = __map_domain_page(l1_page);
+    l1 += l1_table_offset(vaddr);
+    l1e_write(l1, l1e_from_pfn(maddr >> PAGE_SHIFT, __PAGE_HYPERVISOR));
+
+    ret = 0;
+out:
+    if ( l1 )
+        unmap_domain_page(l1);
+    if ( l2 )
+        unmap_domain_page(l2);
+    if ( l3 )
+        unmap_domain_page(l3);
+    if ( l4 )
+        unmap_domain_page(l4);
+    return ret;
+}
 
-    for ( k = 0; k < KEXEC_XEN_NO_PAGES; k++ )
+int machine_kexec_load(struct kexec_image *image)
+{
+    void *code_page;
+    int ret;
+
+    switch ( image->arch )
     {
-        if ( (k & 1) == 0 )
-        {
-            /* Even pages: machine address. */
-            prev_ma = image->page_list[k];
-        }
-        else
-        {
-            /* Odd pages: va for previous ma. */
-            if ( is_pv_32on64_domain(dom0) )
-            {
-                /*
-                 * The compatability bounce code sets up a page table
-                 * with a 1-1 mapping of the first 1G of memory so
-                 * VA==PA here.
-                 *
-                 * This Linux purgatory code still sets up separate
-                 * high and low mappings on the control page (entries
-                 * 0 and 1) but it is harmless if they are equal since
-                 * that PT is not live at the time.
-                 */
-                image->page_list[k] = prev_ma;
-            }
-            else
-            {
-                set_fixmap(fix_base + (k >> 1), prev_ma);
-                image->page_list[k] = fix_to_virt(fix_base + (k >> 1));
-            }
-        }
+    case EM_386:
+    case EM_X86_64:
+        break;
+    default:
+        return -EINVAL;
     }
 
+    code_page = __map_domain_page(image->control_code_page);
+    memcpy(code_page, kexec_reloc, kexec_reloc_size);
+    unmap_domain_page(code_page);
+
+    /*
+     * Add a mapping for the control code page to the same virtual
+     * address as kexec_reloc.  This allows us to keep running after
+     * these page tables are loaded in kexec_reloc.
+     */
+    ret = machine_kexec_add_page(image, (unsigned long)kexec_reloc,
+                                 page_to_maddr(image->control_code_page));
+    if ( ret < 0 )
+        return ret;
+
     return 0;
 }
 
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image)
+void machine_kexec_unload(struct kexec_image *image)
 {
+    /* no-op. kimage_free() frees all control pages. */
 }
 
-void machine_reboot_kexec(xen_kexec_image_t *image)
+void machine_reboot_kexec(struct kexec_image *image)
 {
     BUG_ON(smp_processor_id() != 0);
     smp_send_stop();
@@ -75,13 +145,10 @@ void machine_reboot_kexec(xen_kexec_image_t *image)
     BUG();
 }
 
-void machine_kexec(xen_kexec_image_t *image)
+void machine_kexec(struct kexec_image *image)
 {
-    struct desc_ptr gdt_desc = {
-        .base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY),
-        .limit = LAST_RESERVED_GDT_BYTE
-    };
     int i;
+    unsigned long reloc_flags = 0;
 
     /* We are about to permenantly jump out of the Xen context into the kexec
      * purgatory code.  We really dont want to be still servicing interupts.
@@ -109,29 +176,12 @@ void machine_kexec(xen_kexec_image_t *image)
      * not like running with NMIs disabled. */
     enable_nmis();
 
-    /*
-     * compat_machine_kexec() returns to idle pagetables, which requires us
-     * to be running on a static GDT mapping (idle pagetables have no GDT
-     * mappings in their per-domain mapping area).
-     */
-    asm volatile ( "lgdt %0" : : "m" (gdt_desc) );
+    if ( image->arch == EM_386 )
+        reloc_flags |= KEXEC_RELOC_FLAG_COMPAT;
 
-    if ( is_pv_32on64_domain(dom0) )
-    {
-        compat_machine_kexec(image->page_list[1],
-                             image->indirection_page,
-                             image->page_list,
-                             image->start_address);
-    }
-    else
-    {
-        relocate_new_kernel_t rnk;
-
-        rnk = (relocate_new_kernel_t) image->page_list[1];
-        (*rnk)(image->indirection_page, image->page_list,
-               image->start_address,
-               0 /* preserve_context */);
-    }
+    kexec_reloc(page_to_maddr(image->control_code_page),
+                page_to_maddr(image->aux_page),
+                image->head, image->entry_maddr, reloc_flags);
 }
 
 int machine_kexec_get(xen_kexec_range_t *range)
diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
index d56e12d..7f8fb3d 100644
--- a/xen/arch/x86/x86_64/Makefile
+++ b/xen/arch/x86/x86_64/Makefile
@@ -11,11 +11,11 @@ obj-y += mmconf-fam10h.o
 obj-y += mmconfig_64.o
 obj-y += mmconfig-shared.o
 obj-y += compat.o
-obj-bin-y += compat_kexec.o
 obj-y += domain.o
 obj-y += physdev.o
 obj-y += platform_hypercall.o
 obj-y += cpu_idle.o
 obj-y += cpufreq.o
+obj-bin-y += kexec_reloc.o
 
 obj-$(crash_debug)   += gdbstub.o
diff --git a/xen/arch/x86/x86_64/compat_kexec.S b/xen/arch/x86/x86_64/compat_kexec.S
deleted file mode 100644
index fc92af9..0000000
--- a/xen/arch/x86/x86_64/compat_kexec.S
+++ /dev/null
@@ -1,187 +0,0 @@
-/*
- * Compatibility kexec handler.
- */
-
-/*
- * NOTE: We rely on Xen not relocating itself above the 4G boundary. This is
- * currently true but if it ever changes then compat_pg_table will
- * need to be moved back below 4G at run time.
- */
-
-#include <xen/config.h>
-
-#include <asm/asm_defns.h>
-#include <asm/msr.h>
-#include <asm/page.h>
-
-/* The unrelocated physical address of a symbol. */
-#define SYM_PHYS(sym)          ((sym) - __XEN_VIRT_START)
-
-/* Load physical address of symbol into register and relocate it. */
-#define RELOCATE_SYM(sym,reg)  mov $SYM_PHYS(sym), reg ; \
-                               add xen_phys_start(%rip), reg
-
-/*
- * Relocate a physical address in memory. Size of temporary register
- * determines size of the value to relocate.
- */
-#define RELOCATE_MEM(addr,reg) mov addr(%rip), reg ; \
-                               add xen_phys_start(%rip), reg ; \
-                               mov reg, addr(%rip)
-
-        .text
-
-        .code64
-
-ENTRY(compat_machine_kexec)
-        /* x86/64                        x86/32  */
-        /* %rdi - relocate_new_kernel_t  CALL    */
-        /* %rsi - indirection page       4(%esp) */
-        /* %rdx - page_list              8(%esp) */
-        /* %rcx - start address         12(%esp) */
-        /*        cpu has pae           16(%esp) */
-
-        /* Shim the 64 bit page_list into a 32 bit page_list. */
-        mov $12,%r9
-        lea compat_page_list(%rip), %rbx
-1:      dec %r9
-        movl (%rdx,%r9,8),%eax
-        movl %eax,(%rbx,%r9,4)
-        test %r9,%r9
-        jnz 1b
-
-        RELOCATE_SYM(compat_page_list,%rdx)
-
-        /* Relocate compatibility mode entry point address. */
-        RELOCATE_MEM(compatibility_mode_far,%eax)
-
-        /* Relocate compat_pg_table. */
-        RELOCATE_MEM(compat_pg_table,     %rax)
-        RELOCATE_MEM(compat_pg_table+0x8, %rax)
-        RELOCATE_MEM(compat_pg_table+0x10,%rax)
-        RELOCATE_MEM(compat_pg_table+0x18,%rax)
-
-        /*
-         * Setup an identity mapped region in PML4[0] of idle page
-         * table.
-         */
-        RELOCATE_SYM(l3_identmap,%rax)
-        or  $0x63,%rax
-        mov %rax, idle_pg_table(%rip)
-
-        /* Switch to idle page table. */
-        RELOCATE_SYM(idle_pg_table,%rax)
-        movq %rax, %cr3
-
-        /* Switch to identity mapped compatibility stack. */
-        RELOCATE_SYM(compat_stack,%rax)
-        movq %rax, %rsp
-
-        /* Save xen_phys_start for 32 bit code. */
-        movq xen_phys_start(%rip), %rbx
-
-        /* Jump to low identity mapping in compatibility mode. */
-        ljmp *compatibility_mode_far(%rip)
-        ud2
-
-compatibility_mode_far:
-        .long SYM_PHYS(compatibility_mode)
-        .long __HYPERVISOR_CS32
-
-        /*
-         * We use 5 words of stack for the arguments passed to the kernel. The
-         * kernel only uses 1 word before switching to its own stack. Allocate
-         * 16 words to give "plenty" of room.
-         */
-        .fill 16,4,0
-compat_stack:
-
-        .code32
-
-#undef RELOCATE_SYM
-#undef RELOCATE_MEM
-
-/*
- * Load physical address of symbol into register and relocate it. %rbx
- * contains xen_phys_start(%rip) saved before jump to compatibility
- * mode.
- */
-#define RELOCATE_SYM(sym,reg) mov $SYM_PHYS(sym), reg ; \
-                              add %ebx, reg
-
-compatibility_mode:
-        /* Setup some sane segments. */
-        movl $__HYPERVISOR_DS32, %eax
-        movl %eax, %ds
-        movl %eax, %es
-        movl %eax, %fs
-        movl %eax, %gs
-        movl %eax, %ss
-
-        /* Push arguments onto stack. */
-        pushl $0   /* 20(%esp) - preserve context */
-        pushl $1   /* 16(%esp) - cpu has pae */
-        pushl %ecx /* 12(%esp) - start address */
-        pushl %edx /*  8(%esp) - page list */
-        pushl %esi /*  4(%esp) - indirection page */
-        pushl %edi /*  0(%esp) - CALL */
-
-        /* Disable paging and therefore leave 64 bit mode. */
-        movl %cr0, %eax
-        andl $~X86_CR0_PG, %eax
-        movl %eax, %cr0
-
-        /* Switch to 32 bit page table. */
-        RELOCATE_SYM(compat_pg_table, %eax)
-        movl  %eax, %cr3
-
-        /* Clear MSR_EFER[LME], disabling long mode */
-        movl    $MSR_EFER,%ecx
-        rdmsr
-        btcl    $_EFER_LME,%eax
-        wrmsr
-
-        /* Re-enable paging, but only 32 bit mode now. */
-        movl %cr0, %eax
-        orl $X86_CR0_PG, %eax
-        movl %eax, %cr0
-        jmp 1f
-1:
-
-        popl %eax
-        call *%eax
-        ud2
-
-        .data
-        .align 4
-compat_page_list:
-        .fill 12,4,0
-
-        .align 32,0
-
-        /*
-         * These compat page tables contain an identity mapping of the
-         * first 4G of the physical address space.
-         */
-compat_pg_table:
-        .long SYM_PHYS(compat_pg_table_l2) + 0*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 1*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 2*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 3*PAGE_SIZE + 0x01, 0
-
-        .section .data.page_aligned, "aw", @progbits
-        .align PAGE_SIZE,0
-compat_pg_table_l2:
-        .macro identmap from=0, count=512
-        .if \count-1
-        identmap "(\from+0)","(\count/2)"
-        identmap "(\from+(0x200000*(\count/2)))","(\count/2)"
-        .else
-        .quad 0x00000000000000e3 + \from
-        .endif
-        .endm
-
-        identmap 0x00000000
-        identmap 0x40000000
-        identmap 0x80000000
-        identmap 0xc0000000
diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
new file mode 100644
index 0000000..7a16c85
--- /dev/null
+++ b/xen/arch/x86/x86_64/kexec_reloc.S
@@ -0,0 +1,198 @@
+/*
+ * Relocate a kexec_image to its destination and call it.
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+#include <xen/config.h>
+#include <xen/kimage.h>
+
+#include <asm/asm_defns.h>
+#include <asm/msr.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
+
+        .text
+        .align PAGE_SIZE
+        .code64
+
+ENTRY(kexec_reloc)
+        /* %rdi - code page maddr */
+        /* %rsi - page table maddr */
+        /* %rdx - indirection page maddr */
+        /* %rcx - entry maddr (%rbp) */
+        /* %r8 - flags */
+
+        movq    %rcx, %rbp
+
+        /* Setup stack. */
+        leaq    (reloc_stack - kexec_reloc)(%rdi), %rsp
+
+        /* Load reloc page table. */
+        movq    %rsi, %cr3
+
+        /* Jump to identity mapped code. */
+        leaq    (identity_mapped - kexec_reloc)(%rdi), %rax
+        jmpq    *%rax
+
+identity_mapped:
+        /*
+         * Set cr0 to a known state:
+         *  - Paging enabled
+         *  - Alignment check disabled
+         *  - Write protect disabled
+         *  - No task switch
+         *  - Don't do FP software emulation.
+         *  - Protected mode enabled
+         */
+        movq    %cr0, %rax
+        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
+        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
+        movq    %rax, %cr0
+
+        /*
+         * Set cr4 to a known state:
+         *  - physical address extension enabled
+         */
+        movl    $X86_CR4_PAE, %eax
+        movq    %rax, %cr4
+
+        movq    %rdx, %rdi
+        call    relocate_pages
+
+        /* Need to switch to 32-bit mode? */
+        testq   $KEXEC_RELOC_FLAG_COMPAT, %r8
+        jnz     call_32_bit
+
+call_64_bit:
+        /* Call the image entry point.  This should never return. */
+        callq   *%rbp
+        ud2
+
+call_32_bit:
+        /* Setup IDT. */
+        lidt    compat_mode_idt(%rip)
+
+        /* Load compat GDT. */
+        leaq    compat_mode_gdt(%rip), %rax
+        movq    %rax, (compat_mode_gdt_desc + 2)(%rip)
+        lgdt    compat_mode_gdt_desc(%rip)
+
+        /* Relocate compatibility mode entry point address. */
+        leal    compatibility_mode(%rip), %eax
+        movl    %eax, compatibility_mode_far(%rip)
+
+        /* Enter compatibility mode. */
+        ljmp    *compatibility_mode_far(%rip)
+
+relocate_pages:
+        /* %rdi - indirection page maddr */
+        pushq   %rbx
+
+        cld
+        movq    %rdi, %rbx
+        xorl    %edi, %edi
+        xorl    %esi, %esi
+
+next_entry: /* top, read another word for the indirection page */
+
+        movq    (%rbx), %rcx
+        addq    $8, %rbx
+is_dest:
+        testb   $IND_DESTINATION, %cl
+        jz      is_ind
+        movq    %rcx, %rdi
+        andq    $PAGE_MASK, %rdi
+        jmp     next_entry
+is_ind:
+        testb   $IND_INDIRECTION, %cl
+        jz      is_done
+        movq    %rcx, %rbx
+        andq    $PAGE_MASK, %rbx
+        jmp     next_entry
+is_done:
+        testb   $IND_DONE, %cl
+        jnz     done
+is_source:
+        testb   $IND_SOURCE, %cl
+        jz      is_zero
+        movq    %rcx, %rsi      /* For every source page do a copy */
+        andq    $PAGE_MASK, %rsi
+        movl    $(PAGE_SIZE / 8), %ecx
+        rep movsq
+        jmp     next_entry
+is_zero:
+        testb   $IND_ZERO, %cl
+        jz      next_entry
+        movl    $(PAGE_SIZE / 8), %ecx  /* Zero the destination page. */
+        xorl    %eax, %eax
+        rep stosq
+        jmp     next_entry
+done:
+        popq    %rbx
+        ret
+
+        .code32
+
+compatibility_mode:
+        /* Setup some sane segments. */
+        movl    $0x0008, %eax
+        movl    %eax, %ds
+        movl    %eax, %es
+        movl    %eax, %fs
+        movl    %eax, %gs
+        movl    %eax, %ss
+
+        /* Disable paging and therefore leave 64 bit mode. */
+        movl    %cr0, %eax
+        andl    $~X86_CR0_PG, %eax
+        movl    %eax, %cr0
+
+        /* Disable long mode */
+        movl    $MSR_EFER, %ecx
+        rdmsr
+        andl    $~EFER_LME, %eax
+        wrmsr
+
+        /* Clear cr4 to disable PAE. */
+        xorl    %eax, %eax
+        movl    %eax, %cr4
+
+        /* Call the image entry point.  This should never return. */
+        call    *%ebp
+        ud2
+
+        .align 4
+compatibility_mode_far:
+        .long 0x00000000             /* set in call_32_bit above */
+        .word 0x0010
+
+compat_mode_gdt_desc:
+        .word (3*8)-1
+        .quad 0x0000000000000000     /* set in call_32_bit above */
+
+        .align 8
+compat_mode_gdt:
+        .quad 0x0000000000000000     /* null                              */
+        .quad 0x00cf92000000ffff     /* 0x0008 ring 0 data                */
+        .quad 0x00cf9a000000ffff     /* 0x0010 ring 0 code, compatibility */
+
+compat_mode_idt:
+        .word 0                      /* limit */
+        .long 0                      /* base */
+
+        /*
+         * 16 words of stack are more than enough.
+         */
+        .fill 16,8,0
+reloc_stack:
+
+        .globl kexec_reloc_size
+kexec_reloc_size:
+        .long . - kexec_reloc
diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index 7b23df0..c5450ba 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -25,6 +25,7 @@
 #include <xen/version.h>
 #include <xen/console.h>
 #include <xen/kexec.h>
+#include <xen/kimage.h>
 #include <public/elfnote.h>
 #include <xsm/xsm.h>
 #include <xen/cpu.h>
@@ -47,7 +48,7 @@ static Elf_Note *xen_crash_note;
 
 static cpumask_t crash_saved_cpus;
 
-static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
+static struct kexec_image *kexec_image[KEXEC_IMAGE_NR];
 
 #define KEXEC_FLAG_DEFAULT_POS   (KEXEC_IMAGE_NR + 0)
 #define KEXEC_FLAG_CRASH_POS     (KEXEC_IMAGE_NR + 1)
@@ -55,8 +56,6 @@ static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
 
 static unsigned long kexec_flags = 0; /* the lowest bits are for KEXEC_IMAGE... */
 
-static spinlock_t kexec_lock = SPIN_LOCK_UNLOCKED;
-
 static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
 static size_t vmcoreinfo_size = 0;
 
@@ -311,14 +310,14 @@ void kexec_crash(void)
     kexec_common_shutdown();
     kexec_crash_save_cpu();
     machine_crash_shutdown();
-    machine_kexec(&kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
+    machine_kexec(kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
 
     BUG();
 }
 
 static long kexec_reboot(void *_image)
 {
-    xen_kexec_image_t *image = _image;
+    struct kexec_image *image = _image;
 
     kexecing = TRUE;
 
@@ -734,63 +733,264 @@ static void crash_save_vmcoreinfo(void)
 #endif
 }
 
-static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
+static void kexec_unload_image(struct kexec_image *image)
 {
-    xen_kexec_image_t *image;
+    if ( !image )
+        return;
+
+    machine_kexec_unload(image);
+    kimage_free(image);
+}
+
+static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_exec_t exec;
+    struct kexec_image *image;
+    int base, bit, pos, ret = -EINVAL;
+
+    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+        return -EFAULT;
+
+    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+        return -EINVAL;
+
+    pos = (test_bit(bit, &kexec_flags) != 0);
+
+    /* Only allow kexec/kdump into loaded images */
+    if ( !test_bit(base + pos, &kexec_flags) )
+        return -ENOENT;
+
+    switch (exec.type)
+    {
+    case KEXEC_TYPE_DEFAULT:
+        image = kexec_image[base + pos];
+        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
+        break;
+    case KEXEC_TYPE_CRASH:
+        kexec_crash(); /* Does not return */
+        break;
+    }
+
+    return -EINVAL; /* never reached */
+}
+
+static int kexec_swap_images(int type, struct kexec_image *new,
+                             struct kexec_image **old)
+{
+    static DEFINE_SPINLOCK(kexec_lock);
     int base, bit, pos;
-    int ret = 0;
+    int new_slot, old_slot;
+
+    *old = NULL;
+
+    spin_lock(&kexec_lock);
+
+    if ( test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    {
+        spin_unlock(&kexec_lock);
+        return -EBUSY;
+    }
 
-    if ( kexec_load_get_bits(load->type, &base, &bit) )
+    if ( kexec_load_get_bits(type, &base, &bit) )
         return -EINVAL;
 
     pos = (test_bit(bit, &kexec_flags) != 0);
+    old_slot = base + pos;
+    new_slot = base + !pos;
 
-    /* Load the user data into an unused image */
-    if ( op == KEXEC_CMD_kexec_load )
+    if ( new )
     {
-        image = &kexec_image[base + !pos];
+        kexec_image[new_slot] = new;
+        set_bit(new_slot, &kexec_flags);
+    }
+    change_bit(bit, &kexec_flags);
 
-        BUG_ON(test_bit((base + !pos), &kexec_flags)); /* must be free */
+    clear_bit(old_slot, &kexec_flags);
+    *old = kexec_image[old_slot];
 
-        memcpy(image, &load->image, sizeof(*image));
+    spin_unlock(&kexec_lock);
 
-        if ( !(ret = machine_kexec_load(load->type, base + !pos, image)) )
-        {
-            /* Set image present bit */
-            set_bit((base + !pos), &kexec_flags);
+    return 0;
+}
 
-            /* Make new image the active one */
-            change_bit(bit, &kexec_flags);
-        }
+static int kexec_load_slot(struct kexec_image *kimage)
+{
+    struct kexec_image *old_kimage;
+    int ret = -ENOMEM;
+
+    ret = machine_kexec_load(kimage);
+    if ( ret < 0 )
+        return ret;
+
+    crash_save_vmcoreinfo();
+
+    ret = kexec_swap_images(kimage->type, kimage, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static uint16_t kexec_load_v1_arch(void)
+{
+#ifdef CONFIG_X86
+    return is_pv_32on64_domain(dom0) ? EM_386 : EM_X86_64;
+#else
+    return EM_NONE;
+#endif
+}
 
-        crash_save_vmcoreinfo();
+static int kexec_segments_add_segment(
+    unsigned int *nr_segments, xen_kexec_segment_t *segments,
+    unsigned long mfn)
+{
+    paddr_t maddr = (paddr_t)mfn << PAGE_SHIFT;
+    unsigned int n = *nr_segments;
+
+    /* Need a new segment? */
+    if ( n == 0
+         || segments[n-1].dest_maddr + segments[n-1].dest_size != maddr )
+    {
+        n++;
+        if ( n > KEXEC_SEGMENT_MAX )
+            return -EINVAL;
+        *nr_segments = n;
+
+        set_xen_guest_handle(segments[n-1].buf.h, NULL);
+        segments[n-1].buf_size = 0;
+        segments[n-1].dest_maddr = maddr;
+        segments[n-1].dest_size = 0;
     }
 
-    /* Unload the old image if present and load successful */
-    if ( ret == 0 && !test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    return 0;
+}
+
+static int kexec_segments_from_ind_page(unsigned long mfn,
+                                        unsigned int *nr_segments,
+                                        xen_kexec_segment_t *segments,
+                                        bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+
+    page = map_domain_page(mfn);
+
+    /*
+     * Walk the indirection page list, adding destination pages to the
+     * segments.
+     */
+    for ( entry = page; ; )
     {
-        if ( test_and_clear_bit((base + pos), &kexec_flags) )
+        unsigned long ind;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
         {
-            image = &kexec_image[base + pos];
-            machine_kexec_unload(load->type, base + pos, image);
+        case IND_DESTINATION:
+            ret = kexec_segments_add_segment(nr_segments, segments, mfn);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            entry = page = map_domain_page(mfn);
+            continue;
+        case IND_DONE:
+            goto done;
+        case IND_SOURCE:
+            if ( *nr_segments == 0 )
+            {
+                ret = -EINVAL;
+                goto done;
+            }
+            segments[*nr_segments-1].dest_size += PAGE_SIZE;
+            break;
+        default:
+            ret = -EINVAL;
+            goto done;
         }
+        entry = kimage_entry_next(entry, compat);
     }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
 
+static int kexec_do_load_v1(xen_kexec_load_v1_t *load, int compat)
+{
+    struct kexec_image *kimage = NULL;
+    xen_kexec_segment_t *segments;
+    uint16_t arch;
+    unsigned int nr_segments = 0;
+    unsigned long ind_mfn = load->image.indirection_page >> PAGE_SHIFT;
+    int ret;
+
+    arch = kexec_load_v1_arch();
+    if ( arch == EM_NONE )
+        return -ENOSYS;
+
+    segments = xmalloc_array(xen_kexec_segment_t, KEXEC_SEGMENT_MAX);
+    if ( segments == NULL )
+        return -ENOMEM;
+
+    /*
+     * Work out the image segments (destination only) from the
+     * indirection pages.
+     *
+     * This is needed so we don't allocate pages that will overlap
+     * with the destination when building the new set of indirection
+     * pages below.
+     */
+    ret = kexec_segments_from_ind_page(ind_mfn, &nr_segments, segments, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_alloc(&kimage, load->type, arch, load->image.start_address,
+                       nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    /*
+     * Build a new set of indirection pages in the native format.
+     *
+     * This walks the guest provided indirection pages a second time.
+     * The guest could have altered then, invalidating the segment
+     * information constructed above.  This will only result in the
+     * resulting image being potentially unrelocatable.
+     */
+    ret = kimage_build_ind(kimage, ind_mfn, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    return 0;
+
+error:
+    if ( !kimage )
+        xfree(segments);
+    kimage_free(kimage);
     return ret;
 }
 
-static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
     xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&load, uarg, 1)) )
         return -EFAULT;
 
-    return kexec_load_unload_internal(op, &load);
+    return kexec_do_load_v1(&load, 0);
 }
 
-static int kexec_load_unload_compat(unsigned long op,
-                                    XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
 #ifdef CONFIG_COMPAT
     compat_kexec_load_v1_t compat_load;
@@ -809,49 +1009,113 @@ static int kexec_load_unload_compat(unsigned long op,
     load.type = compat_load.type;
     XLAT_kexec_image(&load.image, &compat_load.image);
 
-    return kexec_load_unload_internal(op, &load);
-#else /* CONFIG_COMPAT */
+    return kexec_do_load_v1(&load, 1);
+#else
     return 0;
-#endif /* CONFIG_COMPAT */
+#endif
 }
 
-static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
-    xen_kexec_exec_t exec;
-    xen_kexec_image_t *image;
-    int base, bit, pos, ret = -EINVAL;
+    xen_kexec_load_t load;
+    xen_kexec_segment_t *segments;
+    struct kexec_image *kimage = NULL;
+    int ret;
 
-    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+    if ( copy_from_guest(&load, uarg, 1) )
         return -EFAULT;
 
-    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+    if ( load.nr_segments >= KEXEC_SEGMENT_MAX )
         return -EINVAL;
 
-    pos = (test_bit(bit, &kexec_flags) != 0);
-
-    /* Only allow kexec/kdump into loaded images */
-    if ( !test_bit(base + pos, &kexec_flags) )
-        return -ENOENT;
+    segments = xmalloc_array(xen_kexec_segment_t, load.nr_segments);
+    if ( segments == NULL )
+        return -ENOMEM;
 
-    switch (exec.type)
+    if ( copy_from_guest(segments, load.segments.h, load.nr_segments) )
     {
-    case KEXEC_TYPE_DEFAULT:
-        image = &kexec_image[base + pos];
-        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
-        break;
-    case KEXEC_TYPE_CRASH:
-        kexec_crash(); /* Does not return */
-        break;
+        ret = -EFAULT;
+        goto error;
     }
 
-    return -EINVAL; /* never reached */
+    ret = kimage_alloc(&kimage, load.type, load.arch, load.entry_maddr,
+                       load.nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_load_segments(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    return 0;
+
+error:
+    if ( ! kimage )
+        xfree(segments);
+    kimage_free(kimage);
+    return ret;
+}
+
+static int kexec_do_unload(xen_kexec_unload_t *unload)
+{
+    struct kexec_image *old_kimage;
+    int ret;
+
+    ret = kexec_swap_images(unload->type, NULL, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static int kexec_unload_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_load_v1_t load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = load.type;
+    return kexec_do_unload(&unload);
+}
+
+static int kexec_unload_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+#ifdef CONFIG_COMPAT
+    compat_kexec_load_v1_t compat_load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&compat_load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = compat_load.type;
+    return kexec_do_unload(&unload);
+#else
+    return 0;
+#endif
+}
+
+static int kexec_unload(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_unload_t unload;
+
+    if ( unlikely(copy_from_guest(&unload, uarg, 1)) )
+        return -EFAULT;
+
+    return kexec_do_unload(&unload);
 }
 
 static int do_kexec_op_internal(unsigned long op,
                                 XEN_GUEST_HANDLE_PARAM(void) uarg,
                                 bool_t compat)
 {
-    unsigned long flags;
     int ret = -EINVAL;
 
     ret = xsm_kexec(XSM_PRIV);
@@ -867,20 +1131,26 @@ static int do_kexec_op_internal(unsigned long op,
                 ret = kexec_get_range(uarg);
         break;
     case KEXEC_CMD_kexec_load_v1:
+        if ( compat )
+            ret = kexec_load_v1_compat(uarg);
+        else
+            ret = kexec_load_v1(uarg);
+        break;
     case KEXEC_CMD_kexec_unload_v1:
-        spin_lock_irqsave(&kexec_lock, flags);
-        if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
-        {
-                if (compat)
-                        ret = kexec_load_unload_compat(op, uarg);
-                else
-                        ret = kexec_load_unload(op, uarg);
-        }
-        spin_unlock_irqrestore(&kexec_lock, flags);
+        if ( compat )
+            ret = kexec_unload_v1_compat(uarg);
+        else
+            ret = kexec_unload_v1(uarg);
         break;
     case KEXEC_CMD_kexec:
         ret = kexec_exec(uarg);
         break;
+    case KEXEC_CMD_kexec_load:
+        ret = kexec_load(uarg);
+        break;
+    case KEXEC_CMD_kexec_unload:
+        ret = kexec_unload(uarg);
+        break;
     }
 
     return ret;
diff --git a/xen/common/kimage.c b/xen/common/kimage.c
index 02ee37e..10fb785 100644
--- a/xen/common/kimage.c
+++ b/xen/common/kimage.c
@@ -175,11 +175,20 @@ static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
     image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
     if ( !image->control_code_page )
         goto out;
+    result = machine_kexec_add_page(image,
+                                    page_to_maddr(image->control_code_page),
+                                    page_to_maddr(image->control_code_page));
+    if ( result < 0 )
+        goto out;
 
     /* Add an empty indirection page. */
     image->entry_page = kimage_alloc_control_page(image, 0);
     if ( !image->entry_page )
         goto out;
+    result = machine_kexec_add_page(image, page_to_maddr(image->entry_page),
+                                    page_to_maddr(image->entry_page));
+    if ( result < 0 )
+        goto out;
 
     image->head = page_to_maddr(image->entry_page);
 
@@ -595,7 +604,7 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
         if ( addr == destination )
         {
             page_list_del(page, &image->dest_pages);
-            return page;
+            goto found;
         }
     }
     page = NULL;
@@ -647,6 +656,8 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
             page_list_add(page, &image->dest_pages);
         }
     }
+found:
+    machine_kexec_add_page(image, page_to_maddr(page), page_to_maddr(page));
     return page;
 }
 
@@ -753,6 +764,7 @@ static int kimage_load_crash_segment(struct kexec_image *image,
 static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
 {
     int result = -ENOMEM;
+    paddr_t addr;
 
     if ( !guest_handle_is_null(segment->buf.h) )
     {
@@ -767,6 +779,14 @@ static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *s
         }
     }
 
+    for ( addr = segment->dest_maddr & PAGE_MASK;
+          addr < segment->dest_maddr + segment->dest_size; addr += PAGE_SIZE )
+    {
+        result = machine_kexec_add_page(image, addr, addr);
+        if ( result < 0 )
+            break;
+    }
+
     return result;
 }
 
@@ -810,6 +830,106 @@ int kimage_load_segments(struct kexec_image *image)
     return 0;
 }
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return (kimage_entry_t *)((uint32_t *)entry + 1);
+    return entry + 1;
+}
+
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry >> PAGE_SHIFT;
+    return *entry >> PAGE_SHIFT;
+}
+
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry & 0xf;
+    return *entry & 0xf;
+}
+
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+    paddr_t dest = KIMAGE_NO_DEST;
+
+    page = map_domain_page(ind_mfn);
+    if ( !page )
+        return -ENOMEM;
+
+    /*
+     * Walk the guest-supplied indirection pages, adding entries to
+     * the image's indirection pages.
+     */
+    for ( entry = page; ;  )
+    {
+        unsigned long ind;
+        unsigned long mfn;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
+        {
+        case IND_DESTINATION:
+            dest = (paddr_t)mfn << PAGE_SHIFT;
+            ret = kimage_set_destination(image, dest);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            page = map_domain_page(mfn);
+            entry = page;
+            continue;
+        case IND_DONE:
+            kimage_terminate(image);
+            goto done;
+        case IND_SOURCE:
+        {
+            struct page_info *guest_page, *xen_page;
+
+            guest_page = mfn_to_page(mfn);
+            if ( !get_page(guest_page, current->domain) )
+            {
+                ret = -EFAULT;
+                goto done;
+            }
+
+            xen_page = kimage_alloc_page(image, dest);
+            if ( !xen_page )
+            {
+                put_page(guest_page);
+                ret = -ENOMEM;
+                goto done;
+            }
+
+            copy_domain_page(page_to_mfn(xen_page), mfn);
+            put_page(guest_page);
+
+            ret = kimage_add_page(image, page_to_maddr(xen_page));
+            if ( ret < 0 )
+                goto done;
+            dest += PAGE_SIZE;
+            break;
+        }
+        default:
+            ret = -EINVAL;
+            goto done;
+        }
+        entry = kimage_entry_next(entry, compat);
+    }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index 8b4266d..48c5676 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -56,9 +56,6 @@ enum fixed_addresses {
     FIX_ACPI_BEGIN,
     FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1,
     FIX_HPET_BASE,
-    FIX_KEXEC_BASE_0,
-    FIX_KEXEC_BASE_END = FIX_KEXEC_BASE_0 \
-      + ((KEXEC_XEN_NO_PAGES >> 1) * KEXEC_IMAGE_NR) - 1,
     FIX_TBOOT_SHARED_BASE,
     FIX_MSIX_IO_RESERV_BASE,
     FIX_MSIX_IO_RESERV_END = FIX_MSIX_IO_RESERV_BASE + FIX_MSIX_MAX_PAGES -1,
diff --git a/xen/include/asm-x86/machine_kexec.h b/xen/include/asm-x86/machine_kexec.h
new file mode 100644
index 0000000..ba0d469
--- /dev/null
+++ b/xen/include/asm-x86/machine_kexec.h
@@ -0,0 +1,16 @@
+#ifndef __X86_MACHINE_KEXEC_H__
+#define __X86_MACHINE_KEXEC_H__
+
+#define KEXEC_RELOC_FLAG_COMPAT 0x1 /* 32-bit image */
+
+#ifndef __ASSEMBLY__
+
+extern void kexec_reloc(unsigned long reloc_code, unsigned long reloc_pt,
+                        unsigned long ind_maddr, unsigned long entry_maddr,
+                        unsigned long flags);
+
+extern unsigned int kexec_reloc_size;
+
+#endif
+
+#endif /* __X86_MACHINE_KEXEC_H__ */
diff --git a/xen/include/xen/kexec.h b/xen/include/xen/kexec.h
index 1a5dda1..bd17747 100644
--- a/xen/include/xen/kexec.h
+++ b/xen/include/xen/kexec.h
@@ -6,6 +6,7 @@
 #include <public/kexec.h>
 #include <asm/percpu.h>
 #include <xen/elfcore.h>
+#include <xen/kimage.h>
 
 typedef struct xen_kexec_reserve {
     unsigned long size;
@@ -40,11 +41,13 @@ extern enum low_crashinfo low_crashinfo_mode;
 extern paddr_t crashinfo_maxaddr_bits;
 void kexec_early_calculations(void);
 
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr);
+int machine_kexec_load(struct kexec_image *image);
+void machine_kexec_unload(struct kexec_image *image);
 void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
-void machine_reboot_kexec(xen_kexec_image_t *image);
-void machine_kexec(xen_kexec_image_t *image);
+void machine_reboot_kexec(struct kexec_image *image);
+void machine_kexec(struct kexec_image *image);
 void kexec_crash(void);
 void kexec_crash_save_cpu(void);
 crash_xen_info_t *kexec_crash_save_info(void);
@@ -52,11 +55,6 @@ void machine_crash_shutdown(void);
 int machine_kexec_get(xen_kexec_range_t *range);
 int machine_kexec_get_xen(xen_kexec_range_t *range);
 
-void compat_machine_kexec(unsigned long rnk,
-                          unsigned long indirection_page,
-                          unsigned long *page_list,
-                          unsigned long start_address);
-
 /* vmcoreinfo stuff */
 #define VMCOREINFO_BYTES           (4096)
 #define VMCOREINFO_NOTE_NAME       "VMCOREINFO_XEN"
diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
index 0ebd37a..d10ebf7 100644
--- a/xen/include/xen/kimage.h
+++ b/xen/include/xen/kimage.h
@@ -47,6 +47,12 @@ int kimage_load_segments(struct kexec_image *image);
 struct page_info *kimage_alloc_control_page(struct kexec_image *image,
                                             unsigned memflags);
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat);
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __XEN_KIMAGE_H__ */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
@ 2013-11-06 14:49   ` David Vrabel
  0 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

In the existing kexec hypercall, the load and unload ops depend on
internals of the Linux kernel (the page list and code page provided by
the kernel).  The code page is used to transition between Xen context
and the image so using kernel code doesn't make sense and will not
work for PVH guests.

Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
that no longer require a code page to be provided by the guest -- Xen
now provides the code for calling the image directly.

The new load op looks similar to the Linux kexec_load system call and
allows the guest to provide the image data to be loaded.  The guest
specifies the architecture of the image which may be a 32-bit subarch
of the hypervisor's architecture (i.e., an EM_386 image on an
EM_X86_64 hypervisor).

The toolstack can now load images without kernel involvement.  This is
required for supporting kexec when using a dom0 with an upstream
kernel.

Crash images are copied directly into the crash region on load.
Default images are copied into domheap pages and a list of source and
destination machine addresses is created.  This is list is used in
kexec_reloc() to relocate the image to its destination.

The old load and unload sub-ops are still available (as
KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
of the new infrastructure.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/machine_kexec.c        |  192 +++++++++++------
 xen/arch/x86/x86_64/Makefile        |    2 +-
 xen/arch/x86/x86_64/compat_kexec.S  |  187 ----------------
 xen/arch/x86/x86_64/kexec_reloc.S   |  198 +++++++++++++++++
 xen/common/kexec.c                  |  398 +++++++++++++++++++++++++++++------
 xen/common/kimage.c                 |  122 +++++++++++-
 xen/include/asm-x86/fixmap.h        |    3 -
 xen/include/asm-x86/machine_kexec.h |   16 ++
 xen/include/xen/kexec.h             |   16 +-
 xen/include/xen/kimage.h            |    6 +
 10 files changed, 804 insertions(+), 336 deletions(-)
 delete mode 100644 xen/arch/x86/x86_64/compat_kexec.S
 create mode 100644 xen/arch/x86/x86_64/kexec_reloc.S
 create mode 100644 xen/include/asm-x86/machine_kexec.h

diff --git a/xen/arch/x86/machine_kexec.c b/xen/arch/x86/machine_kexec.c
index 68b9705..b70d5a6 100644
--- a/xen/arch/x86/machine_kexec.c
+++ b/xen/arch/x86/machine_kexec.c
@@ -1,9 +1,18 @@
 /******************************************************************************
  * machine_kexec.c
  *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/machine_kexec_64.c.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
  * Xen port written by:
  * - Simon 'Horms' Horman <horms@verge.net.au>
  * - Magnus Damm <magnus@valinux.co.jp>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
  */
 
 #include <xen/types.h>
@@ -11,63 +20,124 @@
 #include <xen/guest_access.h>
 #include <asm/fixmap.h>
 #include <asm/hpet.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
 
-typedef void (*relocate_new_kernel_t)(
-                unsigned long indirection_page,
-                unsigned long *page_list,
-                unsigned long start_address,
-                unsigned int preserve_context);
-
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image)
+/*
+ * Add a mapping for a page to the page tables used during kexec.
+ */
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr)
 {
-    unsigned long prev_ma = 0;
-    int fix_base = FIX_KEXEC_BASE_0 + (slot * (KEXEC_XEN_NO_PAGES >> 1));
-    int k;
+    struct page_info *l4_page;
+    struct page_info *l3_page;
+    struct page_info *l2_page;
+    struct page_info *l1_page;
+    l4_pgentry_t *l4 = NULL;
+    l3_pgentry_t *l3 = NULL;
+    l2_pgentry_t *l2 = NULL;
+    l1_pgentry_t *l1 = NULL;
+    int ret = -ENOMEM;
+
+    l4_page = image->aux_page;
+    if ( !l4_page )
+    {
+        l4_page = kimage_alloc_control_page(image, 0);
+        if ( !l4_page )
+            goto out;
+        image->aux_page = l4_page;
+    }
 
-    /* setup fixmap to point to our pages and record the virtual address
-     * in every odd index in page_list[].
-     */
+    l4 = __map_domain_page(l4_page);
+    l4 += l4_table_offset(vaddr);
+    if ( !(l4e_get_flags(*l4) & _PAGE_PRESENT) )
+    {
+        l3_page = kimage_alloc_control_page(image, 0);
+        if ( !l3_page )
+            goto out;
+        l4e_write(l4, l4e_from_page(l3_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l3_page = l4e_get_page(*l4);
+
+    l3 = __map_domain_page(l3_page);
+    l3 += l3_table_offset(vaddr);
+    if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) )
+    {
+        l2_page = kimage_alloc_control_page(image, 0);
+        if ( !l2_page )
+            goto out;
+        l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l2_page = l3e_get_page(*l3);
+
+    l2 = __map_domain_page(l2_page);
+    l2 += l2_table_offset(vaddr);
+    if ( !(l2e_get_flags(*l2) & _PAGE_PRESENT) )
+    {
+        l1_page = kimage_alloc_control_page(image, 0);
+        if ( !l1_page )
+            goto out;
+        l2e_write(l2, l2e_from_page(l1_page, __PAGE_HYPERVISOR));
+    }
+    else
+        l1_page = l2e_get_page(*l2);
+
+    l1 = __map_domain_page(l1_page);
+    l1 += l1_table_offset(vaddr);
+    l1e_write(l1, l1e_from_pfn(maddr >> PAGE_SHIFT, __PAGE_HYPERVISOR));
+
+    ret = 0;
+out:
+    if ( l1 )
+        unmap_domain_page(l1);
+    if ( l2 )
+        unmap_domain_page(l2);
+    if ( l3 )
+        unmap_domain_page(l3);
+    if ( l4 )
+        unmap_domain_page(l4);
+    return ret;
+}
 
-    for ( k = 0; k < KEXEC_XEN_NO_PAGES; k++ )
+int machine_kexec_load(struct kexec_image *image)
+{
+    void *code_page;
+    int ret;
+
+    switch ( image->arch )
     {
-        if ( (k & 1) == 0 )
-        {
-            /* Even pages: machine address. */
-            prev_ma = image->page_list[k];
-        }
-        else
-        {
-            /* Odd pages: va for previous ma. */
-            if ( is_pv_32on64_domain(dom0) )
-            {
-                /*
-                 * The compatability bounce code sets up a page table
-                 * with a 1-1 mapping of the first 1G of memory so
-                 * VA==PA here.
-                 *
-                 * This Linux purgatory code still sets up separate
-                 * high and low mappings on the control page (entries
-                 * 0 and 1) but it is harmless if they are equal since
-                 * that PT is not live at the time.
-                 */
-                image->page_list[k] = prev_ma;
-            }
-            else
-            {
-                set_fixmap(fix_base + (k >> 1), prev_ma);
-                image->page_list[k] = fix_to_virt(fix_base + (k >> 1));
-            }
-        }
+    case EM_386:
+    case EM_X86_64:
+        break;
+    default:
+        return -EINVAL;
     }
 
+    code_page = __map_domain_page(image->control_code_page);
+    memcpy(code_page, kexec_reloc, kexec_reloc_size);
+    unmap_domain_page(code_page);
+
+    /*
+     * Add a mapping for the control code page to the same virtual
+     * address as kexec_reloc.  This allows us to keep running after
+     * these page tables are loaded in kexec_reloc.
+     */
+    ret = machine_kexec_add_page(image, (unsigned long)kexec_reloc,
+                                 page_to_maddr(image->control_code_page));
+    if ( ret < 0 )
+        return ret;
+
     return 0;
 }
 
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image)
+void machine_kexec_unload(struct kexec_image *image)
 {
+    /* no-op. kimage_free() frees all control pages. */
 }
 
-void machine_reboot_kexec(xen_kexec_image_t *image)
+void machine_reboot_kexec(struct kexec_image *image)
 {
     BUG_ON(smp_processor_id() != 0);
     smp_send_stop();
@@ -75,13 +145,10 @@ void machine_reboot_kexec(xen_kexec_image_t *image)
     BUG();
 }
 
-void machine_kexec(xen_kexec_image_t *image)
+void machine_kexec(struct kexec_image *image)
 {
-    struct desc_ptr gdt_desc = {
-        .base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY),
-        .limit = LAST_RESERVED_GDT_BYTE
-    };
     int i;
+    unsigned long reloc_flags = 0;
 
     /* We are about to permenantly jump out of the Xen context into the kexec
      * purgatory code.  We really dont want to be still servicing interupts.
@@ -109,29 +176,12 @@ void machine_kexec(xen_kexec_image_t *image)
      * not like running with NMIs disabled. */
     enable_nmis();
 
-    /*
-     * compat_machine_kexec() returns to idle pagetables, which requires us
-     * to be running on a static GDT mapping (idle pagetables have no GDT
-     * mappings in their per-domain mapping area).
-     */
-    asm volatile ( "lgdt %0" : : "m" (gdt_desc) );
+    if ( image->arch == EM_386 )
+        reloc_flags |= KEXEC_RELOC_FLAG_COMPAT;
 
-    if ( is_pv_32on64_domain(dom0) )
-    {
-        compat_machine_kexec(image->page_list[1],
-                             image->indirection_page,
-                             image->page_list,
-                             image->start_address);
-    }
-    else
-    {
-        relocate_new_kernel_t rnk;
-
-        rnk = (relocate_new_kernel_t) image->page_list[1];
-        (*rnk)(image->indirection_page, image->page_list,
-               image->start_address,
-               0 /* preserve_context */);
-    }
+    kexec_reloc(page_to_maddr(image->control_code_page),
+                page_to_maddr(image->aux_page),
+                image->head, image->entry_maddr, reloc_flags);
 }
 
 int machine_kexec_get(xen_kexec_range_t *range)
diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
index d56e12d..7f8fb3d 100644
--- a/xen/arch/x86/x86_64/Makefile
+++ b/xen/arch/x86/x86_64/Makefile
@@ -11,11 +11,11 @@ obj-y += mmconf-fam10h.o
 obj-y += mmconfig_64.o
 obj-y += mmconfig-shared.o
 obj-y += compat.o
-obj-bin-y += compat_kexec.o
 obj-y += domain.o
 obj-y += physdev.o
 obj-y += platform_hypercall.o
 obj-y += cpu_idle.o
 obj-y += cpufreq.o
+obj-bin-y += kexec_reloc.o
 
 obj-$(crash_debug)   += gdbstub.o
diff --git a/xen/arch/x86/x86_64/compat_kexec.S b/xen/arch/x86/x86_64/compat_kexec.S
deleted file mode 100644
index fc92af9..0000000
--- a/xen/arch/x86/x86_64/compat_kexec.S
+++ /dev/null
@@ -1,187 +0,0 @@
-/*
- * Compatibility kexec handler.
- */
-
-/*
- * NOTE: We rely on Xen not relocating itself above the 4G boundary. This is
- * currently true but if it ever changes then compat_pg_table will
- * need to be moved back below 4G at run time.
- */
-
-#include <xen/config.h>
-
-#include <asm/asm_defns.h>
-#include <asm/msr.h>
-#include <asm/page.h>
-
-/* The unrelocated physical address of a symbol. */
-#define SYM_PHYS(sym)          ((sym) - __XEN_VIRT_START)
-
-/* Load physical address of symbol into register and relocate it. */
-#define RELOCATE_SYM(sym,reg)  mov $SYM_PHYS(sym), reg ; \
-                               add xen_phys_start(%rip), reg
-
-/*
- * Relocate a physical address in memory. Size of temporary register
- * determines size of the value to relocate.
- */
-#define RELOCATE_MEM(addr,reg) mov addr(%rip), reg ; \
-                               add xen_phys_start(%rip), reg ; \
-                               mov reg, addr(%rip)
-
-        .text
-
-        .code64
-
-ENTRY(compat_machine_kexec)
-        /* x86/64                        x86/32  */
-        /* %rdi - relocate_new_kernel_t  CALL    */
-        /* %rsi - indirection page       4(%esp) */
-        /* %rdx - page_list              8(%esp) */
-        /* %rcx - start address         12(%esp) */
-        /*        cpu has pae           16(%esp) */
-
-        /* Shim the 64 bit page_list into a 32 bit page_list. */
-        mov $12,%r9
-        lea compat_page_list(%rip), %rbx
-1:      dec %r9
-        movl (%rdx,%r9,8),%eax
-        movl %eax,(%rbx,%r9,4)
-        test %r9,%r9
-        jnz 1b
-
-        RELOCATE_SYM(compat_page_list,%rdx)
-
-        /* Relocate compatibility mode entry point address. */
-        RELOCATE_MEM(compatibility_mode_far,%eax)
-
-        /* Relocate compat_pg_table. */
-        RELOCATE_MEM(compat_pg_table,     %rax)
-        RELOCATE_MEM(compat_pg_table+0x8, %rax)
-        RELOCATE_MEM(compat_pg_table+0x10,%rax)
-        RELOCATE_MEM(compat_pg_table+0x18,%rax)
-
-        /*
-         * Setup an identity mapped region in PML4[0] of idle page
-         * table.
-         */
-        RELOCATE_SYM(l3_identmap,%rax)
-        or  $0x63,%rax
-        mov %rax, idle_pg_table(%rip)
-
-        /* Switch to idle page table. */
-        RELOCATE_SYM(idle_pg_table,%rax)
-        movq %rax, %cr3
-
-        /* Switch to identity mapped compatibility stack. */
-        RELOCATE_SYM(compat_stack,%rax)
-        movq %rax, %rsp
-
-        /* Save xen_phys_start for 32 bit code. */
-        movq xen_phys_start(%rip), %rbx
-
-        /* Jump to low identity mapping in compatibility mode. */
-        ljmp *compatibility_mode_far(%rip)
-        ud2
-
-compatibility_mode_far:
-        .long SYM_PHYS(compatibility_mode)
-        .long __HYPERVISOR_CS32
-
-        /*
-         * We use 5 words of stack for the arguments passed to the kernel. The
-         * kernel only uses 1 word before switching to its own stack. Allocate
-         * 16 words to give "plenty" of room.
-         */
-        .fill 16,4,0
-compat_stack:
-
-        .code32
-
-#undef RELOCATE_SYM
-#undef RELOCATE_MEM
-
-/*
- * Load physical address of symbol into register and relocate it. %rbx
- * contains xen_phys_start(%rip) saved before jump to compatibility
- * mode.
- */
-#define RELOCATE_SYM(sym,reg) mov $SYM_PHYS(sym), reg ; \
-                              add %ebx, reg
-
-compatibility_mode:
-        /* Setup some sane segments. */
-        movl $__HYPERVISOR_DS32, %eax
-        movl %eax, %ds
-        movl %eax, %es
-        movl %eax, %fs
-        movl %eax, %gs
-        movl %eax, %ss
-
-        /* Push arguments onto stack. */
-        pushl $0   /* 20(%esp) - preserve context */
-        pushl $1   /* 16(%esp) - cpu has pae */
-        pushl %ecx /* 12(%esp) - start address */
-        pushl %edx /*  8(%esp) - page list */
-        pushl %esi /*  4(%esp) - indirection page */
-        pushl %edi /*  0(%esp) - CALL */
-
-        /* Disable paging and therefore leave 64 bit mode. */
-        movl %cr0, %eax
-        andl $~X86_CR0_PG, %eax
-        movl %eax, %cr0
-
-        /* Switch to 32 bit page table. */
-        RELOCATE_SYM(compat_pg_table, %eax)
-        movl  %eax, %cr3
-
-        /* Clear MSR_EFER[LME], disabling long mode */
-        movl    $MSR_EFER,%ecx
-        rdmsr
-        btcl    $_EFER_LME,%eax
-        wrmsr
-
-        /* Re-enable paging, but only 32 bit mode now. */
-        movl %cr0, %eax
-        orl $X86_CR0_PG, %eax
-        movl %eax, %cr0
-        jmp 1f
-1:
-
-        popl %eax
-        call *%eax
-        ud2
-
-        .data
-        .align 4
-compat_page_list:
-        .fill 12,4,0
-
-        .align 32,0
-
-        /*
-         * These compat page tables contain an identity mapping of the
-         * first 4G of the physical address space.
-         */
-compat_pg_table:
-        .long SYM_PHYS(compat_pg_table_l2) + 0*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 1*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 2*PAGE_SIZE + 0x01, 0
-        .long SYM_PHYS(compat_pg_table_l2) + 3*PAGE_SIZE + 0x01, 0
-
-        .section .data.page_aligned, "aw", @progbits
-        .align PAGE_SIZE,0
-compat_pg_table_l2:
-        .macro identmap from=0, count=512
-        .if \count-1
-        identmap "(\from+0)","(\count/2)"
-        identmap "(\from+(0x200000*(\count/2)))","(\count/2)"
-        .else
-        .quad 0x00000000000000e3 + \from
-        .endif
-        .endm
-
-        identmap 0x00000000
-        identmap 0x40000000
-        identmap 0x80000000
-        identmap 0xc0000000
diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
new file mode 100644
index 0000000..7a16c85
--- /dev/null
+++ b/xen/arch/x86/x86_64/kexec_reloc.S
@@ -0,0 +1,198 @@
+/*
+ * Relocate a kexec_image to its destination and call it.
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
+ *
+ *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+#include <xen/config.h>
+#include <xen/kimage.h>
+
+#include <asm/asm_defns.h>
+#include <asm/msr.h>
+#include <asm/page.h>
+#include <asm/machine_kexec.h>
+
+        .text
+        .align PAGE_SIZE
+        .code64
+
+ENTRY(kexec_reloc)
+        /* %rdi - code page maddr */
+        /* %rsi - page table maddr */
+        /* %rdx - indirection page maddr */
+        /* %rcx - entry maddr (%rbp) */
+        /* %r8 - flags */
+
+        movq    %rcx, %rbp
+
+        /* Setup stack. */
+        leaq    (reloc_stack - kexec_reloc)(%rdi), %rsp
+
+        /* Load reloc page table. */
+        movq    %rsi, %cr3
+
+        /* Jump to identity mapped code. */
+        leaq    (identity_mapped - kexec_reloc)(%rdi), %rax
+        jmpq    *%rax
+
+identity_mapped:
+        /*
+         * Set cr0 to a known state:
+         *  - Paging enabled
+         *  - Alignment check disabled
+         *  - Write protect disabled
+         *  - No task switch
+         *  - Don't do FP software emulation.
+         *  - Protected mode enabled
+         */
+        movq    %cr0, %rax
+        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
+        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
+        movq    %rax, %cr0
+
+        /*
+         * Set cr4 to a known state:
+         *  - physical address extension enabled
+         */
+        movl    $X86_CR4_PAE, %eax
+        movq    %rax, %cr4
+
+        movq    %rdx, %rdi
+        call    relocate_pages
+
+        /* Need to switch to 32-bit mode? */
+        testq   $KEXEC_RELOC_FLAG_COMPAT, %r8
+        jnz     call_32_bit
+
+call_64_bit:
+        /* Call the image entry point.  This should never return. */
+        callq   *%rbp
+        ud2
+
+call_32_bit:
+        /* Setup IDT. */
+        lidt    compat_mode_idt(%rip)
+
+        /* Load compat GDT. */
+        leaq    compat_mode_gdt(%rip), %rax
+        movq    %rax, (compat_mode_gdt_desc + 2)(%rip)
+        lgdt    compat_mode_gdt_desc(%rip)
+
+        /* Relocate compatibility mode entry point address. */
+        leal    compatibility_mode(%rip), %eax
+        movl    %eax, compatibility_mode_far(%rip)
+
+        /* Enter compatibility mode. */
+        ljmp    *compatibility_mode_far(%rip)
+
+relocate_pages:
+        /* %rdi - indirection page maddr */
+        pushq   %rbx
+
+        cld
+        movq    %rdi, %rbx
+        xorl    %edi, %edi
+        xorl    %esi, %esi
+
+next_entry: /* top, read another word for the indirection page */
+
+        movq    (%rbx), %rcx
+        addq    $8, %rbx
+is_dest:
+        testb   $IND_DESTINATION, %cl
+        jz      is_ind
+        movq    %rcx, %rdi
+        andq    $PAGE_MASK, %rdi
+        jmp     next_entry
+is_ind:
+        testb   $IND_INDIRECTION, %cl
+        jz      is_done
+        movq    %rcx, %rbx
+        andq    $PAGE_MASK, %rbx
+        jmp     next_entry
+is_done:
+        testb   $IND_DONE, %cl
+        jnz     done
+is_source:
+        testb   $IND_SOURCE, %cl
+        jz      is_zero
+        movq    %rcx, %rsi      /* For every source page do a copy */
+        andq    $PAGE_MASK, %rsi
+        movl    $(PAGE_SIZE / 8), %ecx
+        rep movsq
+        jmp     next_entry
+is_zero:
+        testb   $IND_ZERO, %cl
+        jz      next_entry
+        movl    $(PAGE_SIZE / 8), %ecx  /* Zero the destination page. */
+        xorl    %eax, %eax
+        rep stosq
+        jmp     next_entry
+done:
+        popq    %rbx
+        ret
+
+        .code32
+
+compatibility_mode:
+        /* Setup some sane segments. */
+        movl    $0x0008, %eax
+        movl    %eax, %ds
+        movl    %eax, %es
+        movl    %eax, %fs
+        movl    %eax, %gs
+        movl    %eax, %ss
+
+        /* Disable paging and therefore leave 64 bit mode. */
+        movl    %cr0, %eax
+        andl    $~X86_CR0_PG, %eax
+        movl    %eax, %cr0
+
+        /* Disable long mode */
+        movl    $MSR_EFER, %ecx
+        rdmsr
+        andl    $~EFER_LME, %eax
+        wrmsr
+
+        /* Clear cr4 to disable PAE. */
+        xorl    %eax, %eax
+        movl    %eax, %cr4
+
+        /* Call the image entry point.  This should never return. */
+        call    *%ebp
+        ud2
+
+        .align 4
+compatibility_mode_far:
+        .long 0x00000000             /* set in call_32_bit above */
+        .word 0x0010
+
+compat_mode_gdt_desc:
+        .word (3*8)-1
+        .quad 0x0000000000000000     /* set in call_32_bit above */
+
+        .align 8
+compat_mode_gdt:
+        .quad 0x0000000000000000     /* null                              */
+        .quad 0x00cf92000000ffff     /* 0x0008 ring 0 data                */
+        .quad 0x00cf9a000000ffff     /* 0x0010 ring 0 code, compatibility */
+
+compat_mode_idt:
+        .word 0                      /* limit */
+        .long 0                      /* base */
+
+        /*
+         * 16 words of stack are more than enough.
+         */
+        .fill 16,8,0
+reloc_stack:
+
+        .globl kexec_reloc_size
+kexec_reloc_size:
+        .long . - kexec_reloc
diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index 7b23df0..c5450ba 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -25,6 +25,7 @@
 #include <xen/version.h>
 #include <xen/console.h>
 #include <xen/kexec.h>
+#include <xen/kimage.h>
 #include <public/elfnote.h>
 #include <xsm/xsm.h>
 #include <xen/cpu.h>
@@ -47,7 +48,7 @@ static Elf_Note *xen_crash_note;
 
 static cpumask_t crash_saved_cpus;
 
-static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
+static struct kexec_image *kexec_image[KEXEC_IMAGE_NR];
 
 #define KEXEC_FLAG_DEFAULT_POS   (KEXEC_IMAGE_NR + 0)
 #define KEXEC_FLAG_CRASH_POS     (KEXEC_IMAGE_NR + 1)
@@ -55,8 +56,6 @@ static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
 
 static unsigned long kexec_flags = 0; /* the lowest bits are for KEXEC_IMAGE... */
 
-static spinlock_t kexec_lock = SPIN_LOCK_UNLOCKED;
-
 static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
 static size_t vmcoreinfo_size = 0;
 
@@ -311,14 +310,14 @@ void kexec_crash(void)
     kexec_common_shutdown();
     kexec_crash_save_cpu();
     machine_crash_shutdown();
-    machine_kexec(&kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
+    machine_kexec(kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
 
     BUG();
 }
 
 static long kexec_reboot(void *_image)
 {
-    xen_kexec_image_t *image = _image;
+    struct kexec_image *image = _image;
 
     kexecing = TRUE;
 
@@ -734,63 +733,264 @@ static void crash_save_vmcoreinfo(void)
 #endif
 }
 
-static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
+static void kexec_unload_image(struct kexec_image *image)
 {
-    xen_kexec_image_t *image;
+    if ( !image )
+        return;
+
+    machine_kexec_unload(image);
+    kimage_free(image);
+}
+
+static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_exec_t exec;
+    struct kexec_image *image;
+    int base, bit, pos, ret = -EINVAL;
+
+    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+        return -EFAULT;
+
+    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+        return -EINVAL;
+
+    pos = (test_bit(bit, &kexec_flags) != 0);
+
+    /* Only allow kexec/kdump into loaded images */
+    if ( !test_bit(base + pos, &kexec_flags) )
+        return -ENOENT;
+
+    switch (exec.type)
+    {
+    case KEXEC_TYPE_DEFAULT:
+        image = kexec_image[base + pos];
+        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
+        break;
+    case KEXEC_TYPE_CRASH:
+        kexec_crash(); /* Does not return */
+        break;
+    }
+
+    return -EINVAL; /* never reached */
+}
+
+static int kexec_swap_images(int type, struct kexec_image *new,
+                             struct kexec_image **old)
+{
+    static DEFINE_SPINLOCK(kexec_lock);
     int base, bit, pos;
-    int ret = 0;
+    int new_slot, old_slot;
+
+    *old = NULL;
+
+    spin_lock(&kexec_lock);
+
+    if ( test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    {
+        spin_unlock(&kexec_lock);
+        return -EBUSY;
+    }
 
-    if ( kexec_load_get_bits(load->type, &base, &bit) )
+    if ( kexec_load_get_bits(type, &base, &bit) )
         return -EINVAL;
 
     pos = (test_bit(bit, &kexec_flags) != 0);
+    old_slot = base + pos;
+    new_slot = base + !pos;
 
-    /* Load the user data into an unused image */
-    if ( op == KEXEC_CMD_kexec_load )
+    if ( new )
     {
-        image = &kexec_image[base + !pos];
+        kexec_image[new_slot] = new;
+        set_bit(new_slot, &kexec_flags);
+    }
+    change_bit(bit, &kexec_flags);
 
-        BUG_ON(test_bit((base + !pos), &kexec_flags)); /* must be free */
+    clear_bit(old_slot, &kexec_flags);
+    *old = kexec_image[old_slot];
 
-        memcpy(image, &load->image, sizeof(*image));
+    spin_unlock(&kexec_lock);
 
-        if ( !(ret = machine_kexec_load(load->type, base + !pos, image)) )
-        {
-            /* Set image present bit */
-            set_bit((base + !pos), &kexec_flags);
+    return 0;
+}
 
-            /* Make new image the active one */
-            change_bit(bit, &kexec_flags);
-        }
+static int kexec_load_slot(struct kexec_image *kimage)
+{
+    struct kexec_image *old_kimage;
+    int ret = -ENOMEM;
+
+    ret = machine_kexec_load(kimage);
+    if ( ret < 0 )
+        return ret;
+
+    crash_save_vmcoreinfo();
+
+    ret = kexec_swap_images(kimage->type, kimage, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static uint16_t kexec_load_v1_arch(void)
+{
+#ifdef CONFIG_X86
+    return is_pv_32on64_domain(dom0) ? EM_386 : EM_X86_64;
+#else
+    return EM_NONE;
+#endif
+}
 
-        crash_save_vmcoreinfo();
+static int kexec_segments_add_segment(
+    unsigned int *nr_segments, xen_kexec_segment_t *segments,
+    unsigned long mfn)
+{
+    paddr_t maddr = (paddr_t)mfn << PAGE_SHIFT;
+    unsigned int n = *nr_segments;
+
+    /* Need a new segment? */
+    if ( n == 0
+         || segments[n-1].dest_maddr + segments[n-1].dest_size != maddr )
+    {
+        n++;
+        if ( n > KEXEC_SEGMENT_MAX )
+            return -EINVAL;
+        *nr_segments = n;
+
+        set_xen_guest_handle(segments[n-1].buf.h, NULL);
+        segments[n-1].buf_size = 0;
+        segments[n-1].dest_maddr = maddr;
+        segments[n-1].dest_size = 0;
     }
 
-    /* Unload the old image if present and load successful */
-    if ( ret == 0 && !test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
+    return 0;
+}
+
+static int kexec_segments_from_ind_page(unsigned long mfn,
+                                        unsigned int *nr_segments,
+                                        xen_kexec_segment_t *segments,
+                                        bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+
+    page = map_domain_page(mfn);
+
+    /*
+     * Walk the indirection page list, adding destination pages to the
+     * segments.
+     */
+    for ( entry = page; ; )
     {
-        if ( test_and_clear_bit((base + pos), &kexec_flags) )
+        unsigned long ind;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
         {
-            image = &kexec_image[base + pos];
-            machine_kexec_unload(load->type, base + pos, image);
+        case IND_DESTINATION:
+            ret = kexec_segments_add_segment(nr_segments, segments, mfn);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            entry = page = map_domain_page(mfn);
+            continue;
+        case IND_DONE:
+            goto done;
+        case IND_SOURCE:
+            if ( *nr_segments == 0 )
+            {
+                ret = -EINVAL;
+                goto done;
+            }
+            segments[*nr_segments-1].dest_size += PAGE_SIZE;
+            break;
+        default:
+            ret = -EINVAL;
+            goto done;
         }
+        entry = kimage_entry_next(entry, compat);
     }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
 
+static int kexec_do_load_v1(xen_kexec_load_v1_t *load, int compat)
+{
+    struct kexec_image *kimage = NULL;
+    xen_kexec_segment_t *segments;
+    uint16_t arch;
+    unsigned int nr_segments = 0;
+    unsigned long ind_mfn = load->image.indirection_page >> PAGE_SHIFT;
+    int ret;
+
+    arch = kexec_load_v1_arch();
+    if ( arch == EM_NONE )
+        return -ENOSYS;
+
+    segments = xmalloc_array(xen_kexec_segment_t, KEXEC_SEGMENT_MAX);
+    if ( segments == NULL )
+        return -ENOMEM;
+
+    /*
+     * Work out the image segments (destination only) from the
+     * indirection pages.
+     *
+     * This is needed so we don't allocate pages that will overlap
+     * with the destination when building the new set of indirection
+     * pages below.
+     */
+    ret = kexec_segments_from_ind_page(ind_mfn, &nr_segments, segments, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_alloc(&kimage, load->type, arch, load->image.start_address,
+                       nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    /*
+     * Build a new set of indirection pages in the native format.
+     *
+     * This walks the guest provided indirection pages a second time.
+     * The guest could have altered then, invalidating the segment
+     * information constructed above.  This will only result in the
+     * resulting image being potentially unrelocatable.
+     */
+    ret = kimage_build_ind(kimage, ind_mfn, compat);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    return 0;
+
+error:
+    if ( !kimage )
+        xfree(segments);
+    kimage_free(kimage);
     return ret;
 }
 
-static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
     xen_kexec_load_v1_t load;
 
     if ( unlikely(copy_from_guest(&load, uarg, 1)) )
         return -EFAULT;
 
-    return kexec_load_unload_internal(op, &load);
+    return kexec_do_load_v1(&load, 0);
 }
 
-static int kexec_load_unload_compat(unsigned long op,
-                                    XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
 #ifdef CONFIG_COMPAT
     compat_kexec_load_v1_t compat_load;
@@ -809,49 +1009,113 @@ static int kexec_load_unload_compat(unsigned long op,
     load.type = compat_load.type;
     XLAT_kexec_image(&load.image, &compat_load.image);
 
-    return kexec_load_unload_internal(op, &load);
-#else /* CONFIG_COMPAT */
+    return kexec_do_load_v1(&load, 1);
+#else
     return 0;
-#endif /* CONFIG_COMPAT */
+#endif
 }
 
-static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
+static int kexec_load(XEN_GUEST_HANDLE_PARAM(void) uarg)
 {
-    xen_kexec_exec_t exec;
-    xen_kexec_image_t *image;
-    int base, bit, pos, ret = -EINVAL;
+    xen_kexec_load_t load;
+    xen_kexec_segment_t *segments;
+    struct kexec_image *kimage = NULL;
+    int ret;
 
-    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
+    if ( copy_from_guest(&load, uarg, 1) )
         return -EFAULT;
 
-    if ( kexec_load_get_bits(exec.type, &base, &bit) )
+    if ( load.nr_segments >= KEXEC_SEGMENT_MAX )
         return -EINVAL;
 
-    pos = (test_bit(bit, &kexec_flags) != 0);
-
-    /* Only allow kexec/kdump into loaded images */
-    if ( !test_bit(base + pos, &kexec_flags) )
-        return -ENOENT;
+    segments = xmalloc_array(xen_kexec_segment_t, load.nr_segments);
+    if ( segments == NULL )
+        return -ENOMEM;
 
-    switch (exec.type)
+    if ( copy_from_guest(segments, load.segments.h, load.nr_segments) )
     {
-    case KEXEC_TYPE_DEFAULT:
-        image = &kexec_image[base + pos];
-        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
-        break;
-    case KEXEC_TYPE_CRASH:
-        kexec_crash(); /* Does not return */
-        break;
+        ret = -EFAULT;
+        goto error;
     }
 
-    return -EINVAL; /* never reached */
+    ret = kimage_alloc(&kimage, load.type, load.arch, load.entry_maddr,
+                       load.nr_segments, segments);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kimage_load_segments(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    ret = kexec_load_slot(kimage);
+    if ( ret < 0 )
+        goto error;
+
+    return 0;
+
+error:
+    if ( ! kimage )
+        xfree(segments);
+    kimage_free(kimage);
+    return ret;
+}
+
+static int kexec_do_unload(xen_kexec_unload_t *unload)
+{
+    struct kexec_image *old_kimage;
+    int ret;
+
+    ret = kexec_swap_images(unload->type, NULL, &old_kimage);
+    if ( ret < 0 )
+        return ret;
+
+    kexec_unload_image(old_kimage);
+
+    return 0;
+}
+
+static int kexec_unload_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_load_v1_t load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = load.type;
+    return kexec_do_unload(&unload);
+}
+
+static int kexec_unload_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+#ifdef CONFIG_COMPAT
+    compat_kexec_load_v1_t compat_load;
+    xen_kexec_unload_t unload;
+
+    if ( copy_from_guest(&compat_load, uarg, 1) )
+        return -EFAULT;
+
+    unload.type = compat_load.type;
+    return kexec_do_unload(&unload);
+#else
+    return 0;
+#endif
+}
+
+static int kexec_unload(XEN_GUEST_HANDLE_PARAM(void) uarg)
+{
+    xen_kexec_unload_t unload;
+
+    if ( unlikely(copy_from_guest(&unload, uarg, 1)) )
+        return -EFAULT;
+
+    return kexec_do_unload(&unload);
 }
 
 static int do_kexec_op_internal(unsigned long op,
                                 XEN_GUEST_HANDLE_PARAM(void) uarg,
                                 bool_t compat)
 {
-    unsigned long flags;
     int ret = -EINVAL;
 
     ret = xsm_kexec(XSM_PRIV);
@@ -867,20 +1131,26 @@ static int do_kexec_op_internal(unsigned long op,
                 ret = kexec_get_range(uarg);
         break;
     case KEXEC_CMD_kexec_load_v1:
+        if ( compat )
+            ret = kexec_load_v1_compat(uarg);
+        else
+            ret = kexec_load_v1(uarg);
+        break;
     case KEXEC_CMD_kexec_unload_v1:
-        spin_lock_irqsave(&kexec_lock, flags);
-        if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
-        {
-                if (compat)
-                        ret = kexec_load_unload_compat(op, uarg);
-                else
-                        ret = kexec_load_unload(op, uarg);
-        }
-        spin_unlock_irqrestore(&kexec_lock, flags);
+        if ( compat )
+            ret = kexec_unload_v1_compat(uarg);
+        else
+            ret = kexec_unload_v1(uarg);
         break;
     case KEXEC_CMD_kexec:
         ret = kexec_exec(uarg);
         break;
+    case KEXEC_CMD_kexec_load:
+        ret = kexec_load(uarg);
+        break;
+    case KEXEC_CMD_kexec_unload:
+        ret = kexec_unload(uarg);
+        break;
     }
 
     return ret;
diff --git a/xen/common/kimage.c b/xen/common/kimage.c
index 02ee37e..10fb785 100644
--- a/xen/common/kimage.c
+++ b/xen/common/kimage.c
@@ -175,11 +175,20 @@ static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
     image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
     if ( !image->control_code_page )
         goto out;
+    result = machine_kexec_add_page(image,
+                                    page_to_maddr(image->control_code_page),
+                                    page_to_maddr(image->control_code_page));
+    if ( result < 0 )
+        goto out;
 
     /* Add an empty indirection page. */
     image->entry_page = kimage_alloc_control_page(image, 0);
     if ( !image->entry_page )
         goto out;
+    result = machine_kexec_add_page(image, page_to_maddr(image->entry_page),
+                                    page_to_maddr(image->entry_page));
+    if ( result < 0 )
+        goto out;
 
     image->head = page_to_maddr(image->entry_page);
 
@@ -595,7 +604,7 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
         if ( addr == destination )
         {
             page_list_del(page, &image->dest_pages);
-            return page;
+            goto found;
         }
     }
     page = NULL;
@@ -647,6 +656,8 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
             page_list_add(page, &image->dest_pages);
         }
     }
+found:
+    machine_kexec_add_page(image, page_to_maddr(page), page_to_maddr(page));
     return page;
 }
 
@@ -753,6 +764,7 @@ static int kimage_load_crash_segment(struct kexec_image *image,
 static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
 {
     int result = -ENOMEM;
+    paddr_t addr;
 
     if ( !guest_handle_is_null(segment->buf.h) )
     {
@@ -767,6 +779,14 @@ static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *s
         }
     }
 
+    for ( addr = segment->dest_maddr & PAGE_MASK;
+          addr < segment->dest_maddr + segment->dest_size; addr += PAGE_SIZE )
+    {
+        result = machine_kexec_add_page(image, addr, addr);
+        if ( result < 0 )
+            break;
+    }
+
     return result;
 }
 
@@ -810,6 +830,106 @@ int kimage_load_segments(struct kexec_image *image)
     return 0;
 }
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return (kimage_entry_t *)((uint32_t *)entry + 1);
+    return entry + 1;
+}
+
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry >> PAGE_SHIFT;
+    return *entry >> PAGE_SHIFT;
+}
+
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat)
+{
+    if ( compat )
+        return *(uint32_t *)entry & 0xf;
+    return *entry & 0xf;
+}
+
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat)
+{
+    void *page;
+    kimage_entry_t *entry;
+    int ret = 0;
+    paddr_t dest = KIMAGE_NO_DEST;
+
+    page = map_domain_page(ind_mfn);
+    if ( !page )
+        return -ENOMEM;
+
+    /*
+     * Walk the guest-supplied indirection pages, adding entries to
+     * the image's indirection pages.
+     */
+    for ( entry = page; ;  )
+    {
+        unsigned long ind;
+        unsigned long mfn;
+
+        ind = kimage_entry_ind(entry, compat);
+        mfn = kimage_entry_mfn(entry, compat);
+
+        switch ( ind )
+        {
+        case IND_DESTINATION:
+            dest = (paddr_t)mfn << PAGE_SHIFT;
+            ret = kimage_set_destination(image, dest);
+            if ( ret < 0 )
+                goto done;
+            break;
+        case IND_INDIRECTION:
+            unmap_domain_page(page);
+            page = map_domain_page(mfn);
+            entry = page;
+            continue;
+        case IND_DONE:
+            kimage_terminate(image);
+            goto done;
+        case IND_SOURCE:
+        {
+            struct page_info *guest_page, *xen_page;
+
+            guest_page = mfn_to_page(mfn);
+            if ( !get_page(guest_page, current->domain) )
+            {
+                ret = -EFAULT;
+                goto done;
+            }
+
+            xen_page = kimage_alloc_page(image, dest);
+            if ( !xen_page )
+            {
+                put_page(guest_page);
+                ret = -ENOMEM;
+                goto done;
+            }
+
+            copy_domain_page(page_to_mfn(xen_page), mfn);
+            put_page(guest_page);
+
+            ret = kimage_add_page(image, page_to_maddr(xen_page));
+            if ( ret < 0 )
+                goto done;
+            dest += PAGE_SIZE;
+            break;
+        }
+        default:
+            ret = -EINVAL;
+            goto done;
+        }
+        entry = kimage_entry_next(entry, compat);
+    }
+done:
+    unmap_domain_page(page);
+    return ret;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index 8b4266d..48c5676 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -56,9 +56,6 @@ enum fixed_addresses {
     FIX_ACPI_BEGIN,
     FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1,
     FIX_HPET_BASE,
-    FIX_KEXEC_BASE_0,
-    FIX_KEXEC_BASE_END = FIX_KEXEC_BASE_0 \
-      + ((KEXEC_XEN_NO_PAGES >> 1) * KEXEC_IMAGE_NR) - 1,
     FIX_TBOOT_SHARED_BASE,
     FIX_MSIX_IO_RESERV_BASE,
     FIX_MSIX_IO_RESERV_END = FIX_MSIX_IO_RESERV_BASE + FIX_MSIX_MAX_PAGES -1,
diff --git a/xen/include/asm-x86/machine_kexec.h b/xen/include/asm-x86/machine_kexec.h
new file mode 100644
index 0000000..ba0d469
--- /dev/null
+++ b/xen/include/asm-x86/machine_kexec.h
@@ -0,0 +1,16 @@
+#ifndef __X86_MACHINE_KEXEC_H__
+#define __X86_MACHINE_KEXEC_H__
+
+#define KEXEC_RELOC_FLAG_COMPAT 0x1 /* 32-bit image */
+
+#ifndef __ASSEMBLY__
+
+extern void kexec_reloc(unsigned long reloc_code, unsigned long reloc_pt,
+                        unsigned long ind_maddr, unsigned long entry_maddr,
+                        unsigned long flags);
+
+extern unsigned int kexec_reloc_size;
+
+#endif
+
+#endif /* __X86_MACHINE_KEXEC_H__ */
diff --git a/xen/include/xen/kexec.h b/xen/include/xen/kexec.h
index 1a5dda1..bd17747 100644
--- a/xen/include/xen/kexec.h
+++ b/xen/include/xen/kexec.h
@@ -6,6 +6,7 @@
 #include <public/kexec.h>
 #include <asm/percpu.h>
 #include <xen/elfcore.h>
+#include <xen/kimage.h>
 
 typedef struct xen_kexec_reserve {
     unsigned long size;
@@ -40,11 +41,13 @@ extern enum low_crashinfo low_crashinfo_mode;
 extern paddr_t crashinfo_maxaddr_bits;
 void kexec_early_calculations(void);
 
-int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
-void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
+int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
+                           unsigned long maddr);
+int machine_kexec_load(struct kexec_image *image);
+void machine_kexec_unload(struct kexec_image *image);
 void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
-void machine_reboot_kexec(xen_kexec_image_t *image);
-void machine_kexec(xen_kexec_image_t *image);
+void machine_reboot_kexec(struct kexec_image *image);
+void machine_kexec(struct kexec_image *image);
 void kexec_crash(void);
 void kexec_crash_save_cpu(void);
 crash_xen_info_t *kexec_crash_save_info(void);
@@ -52,11 +55,6 @@ void machine_crash_shutdown(void);
 int machine_kexec_get(xen_kexec_range_t *range);
 int machine_kexec_get_xen(xen_kexec_range_t *range);
 
-void compat_machine_kexec(unsigned long rnk,
-                          unsigned long indirection_page,
-                          unsigned long *page_list,
-                          unsigned long start_address);
-
 /* vmcoreinfo stuff */
 #define VMCOREINFO_BYTES           (4096)
 #define VMCOREINFO_NOTE_NAME       "VMCOREINFO_XEN"
diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
index 0ebd37a..d10ebf7 100644
--- a/xen/include/xen/kimage.h
+++ b/xen/include/xen/kimage.h
@@ -47,6 +47,12 @@ int kimage_load_segments(struct kexec_image *image);
 struct page_info *kimage_alloc_control_page(struct kexec_image *image,
                                             unsigned memflags);
 
+kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat);
+unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat);
+int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
+                     bool_t compat);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __XEN_KIMAGE_H__ */
-- 
1.7.2.5


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 5/9] xen: kexec crash image when dom0 crashes
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (7 preceding siblings ...)
  2013-11-06 14:49 ` [PATCH 5/9] xen: kexec crash image when dom0 crashes David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-06 14:49 ` [PATCH 6/9] libxc: add hypercall buffer arrays David Vrabel
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/common/kexec.c    |    2 ++
 xen/common/shutdown.c |    3 +++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index c5450ba..9999bab 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -305,6 +305,8 @@ void kexec_crash(void)
     if ( !test_bit(KEXEC_IMAGE_CRASH_BASE + pos, &kexec_flags) )
         return;
 
+    printk("Executing crash image\n");
+
     kexecing = TRUE;
 
     kexec_common_shutdown();
diff --git a/xen/common/shutdown.c b/xen/common/shutdown.c
index 20f04b0..9bccd34 100644
--- a/xen/common/shutdown.c
+++ b/xen/common/shutdown.c
@@ -47,6 +47,9 @@ void dom0_shutdown(u8 reason)
     {
         debugger_trap_immediate();
         printk("Domain 0 crashed: ");
+#ifdef CONFIG_KEXEC
+        kexec_crash();
+#endif
         maybe_reboot();
         break; /* not reached */
     }
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 5/9] xen: kexec crash image when dom0 crashes
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (6 preceding siblings ...)
  2013-11-06 14:49   ` David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-07 20:44   ` Don Slutz
  2013-11-07 20:44   ` [Xen-devel] " Don Slutz
  2013-11-06 14:49 ` David Vrabel
                   ` (12 subsequent siblings)
  20 siblings, 2 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/common/kexec.c    |    2 ++
 xen/common/shutdown.c |    3 +++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index c5450ba..9999bab 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -305,6 +305,8 @@ void kexec_crash(void)
     if ( !test_bit(KEXEC_IMAGE_CRASH_BASE + pos, &kexec_flags) )
         return;
 
+    printk("Executing crash image\n");
+
     kexecing = TRUE;
 
     kexec_common_shutdown();
diff --git a/xen/common/shutdown.c b/xen/common/shutdown.c
index 20f04b0..9bccd34 100644
--- a/xen/common/shutdown.c
+++ b/xen/common/shutdown.c
@@ -47,6 +47,9 @@ void dom0_shutdown(u8 reason)
     {
         debugger_trap_immediate();
         printk("Domain 0 crashed: ");
+#ifdef CONFIG_KEXEC
+        kexec_crash();
+#endif
         maybe_reboot();
         break; /* not reached */
     }
-- 
1.7.2.5


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 6/9] libxc: add hypercall buffer arrays
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (8 preceding siblings ...)
  2013-11-06 14:49 ` David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-06 14:49 ` David Vrabel
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Hypercall buffer arrays are used when a hypercall takes a variable
length array of buffers.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/xc_hcall_buf.c |   73 ++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h      |   27 ++++++++++++++++
 2 files changed, 100 insertions(+), 0 deletions(-)

diff --git a/tools/libxc/xc_hcall_buf.c b/tools/libxc/xc_hcall_buf.c
index c354677..e762a93 100644
--- a/tools/libxc/xc_hcall_buf.c
+++ b/tools/libxc/xc_hcall_buf.c
@@ -228,6 +228,79 @@ void xc__hypercall_bounce_post(xc_interface *xch, xc_hypercall_buffer_t *b)
     xc__hypercall_buffer_free(xch, b);
 }
 
+struct xc_hypercall_buffer_array {
+    unsigned max_bufs;
+    xc_hypercall_buffer_t *bufs;
+};
+
+xc_hypercall_buffer_array_t *xc_hypercall_buffer_array_create(xc_interface *xch,
+                                                              unsigned n)
+{
+    xc_hypercall_buffer_array_t *array;
+    xc_hypercall_buffer_t *bufs = NULL;
+
+    array = malloc(sizeof(*array));
+    if ( array == NULL )
+        goto error;
+
+    bufs = calloc(n, sizeof(*bufs));
+    if ( bufs == NULL )
+        goto error;
+
+    array->max_bufs = n;
+    array->bufs     = bufs;
+
+    return array;
+
+error:
+    free(bufs);
+    free(array);
+    return NULL;
+}
+
+void *xc__hypercall_buffer_array_alloc(xc_interface *xch,
+                                       xc_hypercall_buffer_array_t *array,
+                                       unsigned index,
+                                       xc_hypercall_buffer_t *hbuf,
+                                       size_t size)
+{
+    void *buf;
+
+    if ( index >= array->max_bufs || array->bufs[index].hbuf )
+        abort();
+
+    buf = xc__hypercall_buffer_alloc(xch, hbuf, size);
+    if ( buf )
+        array->bufs[index] = *hbuf;
+    return buf;
+}
+
+void *xc__hypercall_buffer_array_get(xc_interface *xch,
+                                     xc_hypercall_buffer_array_t *array,
+                                     unsigned index,
+                                     xc_hypercall_buffer_t *hbuf)
+{
+    if ( index >= array->max_bufs || array->bufs[index].hbuf == NULL )
+        abort();
+
+    *hbuf = array->bufs[index];
+    return array->bufs[index].hbuf;
+}
+
+void xc_hypercall_buffer_array_destroy(xc_interface *xc,
+                                       xc_hypercall_buffer_array_t *array)
+{
+    unsigned i;
+
+    if ( array == NULL )
+        return;
+
+    for (i = 0; i < array->max_bufs; i++ )
+        xc__hypercall_buffer_free(xc, &array->bufs[i]);
+    free(array->bufs);
+    free(array);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 8cf3f3b..a7e8c31 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -321,6 +321,33 @@ void xc__hypercall_buffer_free_pages(xc_interface *xch, xc_hypercall_buffer_t *b
 #define xc_hypercall_buffer_free_pages(_xch, _name, _nr) xc__hypercall_buffer_free_pages(_xch, HYPERCALL_BUFFER(_name), _nr)
 
 /*
+ * Array of hypercall buffers.
+ *
+ * Create an array with xc_hypercall_buffer_array_create() and
+ * populate it by declaring one hypercall buffer in a loop and
+ * allocating the buffer with xc_hypercall_buffer_array_alloc().
+ *
+ * To access a previously allocated buffers, declare a new hypercall
+ * buffer and call xc_hypercall_buffer_array_get().
+ *
+ * Destroy the array with xc_hypercall_buffer_array_destroy() to free
+ * the array and all its alocated hypercall buffers.
+ */
+struct xc_hypercall_buffer_array;
+typedef struct xc_hypercall_buffer_array xc_hypercall_buffer_array_t;
+
+xc_hypercall_buffer_array_t *xc_hypercall_buffer_array_create(xc_interface *xch, unsigned n);
+void *xc__hypercall_buffer_array_alloc(xc_interface *xch, xc_hypercall_buffer_array_t *array,
+                                       unsigned index, xc_hypercall_buffer_t *hbuf, size_t size);
+#define xc_hypercall_buffer_array_alloc(_xch, _array, _index, _name, _size) \
+    xc__hypercall_buffer_array_alloc(_xch, _array, _index, HYPERCALL_BUFFER(_name), _size)
+void *xc__hypercall_buffer_array_get(xc_interface *xch, xc_hypercall_buffer_array_t *array,
+                                     unsigned index, xc_hypercall_buffer_t *hbuf);
+#define xc_hypercall_buffer_array_get(_xch, _array, _index, _name, _size) \
+    xc__hypercall_buffer_array_get(_xch, _array, _index, HYPERCALL_BUFFER(_name))
+void xc_hypercall_buffer_array_destroy(xc_interface *xc, xc_hypercall_buffer_array_t *array);
+
+/*
  * CPUMAP handling
  */
 typedef uint8_t *xc_cpumap_t;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 6/9] libxc: add hypercall buffer arrays
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (9 preceding siblings ...)
  2013-11-06 14:49 ` [PATCH 6/9] libxc: add hypercall buffer arrays David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-07 20:46   ` Don Slutz
  2013-11-07 20:46   ` [Xen-devel] " Don Slutz
  2013-11-06 14:49 ` [PATCH 7/9] libxc: add API for kexec hypercall David Vrabel
                   ` (9 subsequent siblings)
  20 siblings, 2 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Hypercall buffer arrays are used when a hypercall takes a variable
length array of buffers.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/xc_hcall_buf.c |   73 ++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h      |   27 ++++++++++++++++
 2 files changed, 100 insertions(+), 0 deletions(-)

diff --git a/tools/libxc/xc_hcall_buf.c b/tools/libxc/xc_hcall_buf.c
index c354677..e762a93 100644
--- a/tools/libxc/xc_hcall_buf.c
+++ b/tools/libxc/xc_hcall_buf.c
@@ -228,6 +228,79 @@ void xc__hypercall_bounce_post(xc_interface *xch, xc_hypercall_buffer_t *b)
     xc__hypercall_buffer_free(xch, b);
 }
 
+struct xc_hypercall_buffer_array {
+    unsigned max_bufs;
+    xc_hypercall_buffer_t *bufs;
+};
+
+xc_hypercall_buffer_array_t *xc_hypercall_buffer_array_create(xc_interface *xch,
+                                                              unsigned n)
+{
+    xc_hypercall_buffer_array_t *array;
+    xc_hypercall_buffer_t *bufs = NULL;
+
+    array = malloc(sizeof(*array));
+    if ( array == NULL )
+        goto error;
+
+    bufs = calloc(n, sizeof(*bufs));
+    if ( bufs == NULL )
+        goto error;
+
+    array->max_bufs = n;
+    array->bufs     = bufs;
+
+    return array;
+
+error:
+    free(bufs);
+    free(array);
+    return NULL;
+}
+
+void *xc__hypercall_buffer_array_alloc(xc_interface *xch,
+                                       xc_hypercall_buffer_array_t *array,
+                                       unsigned index,
+                                       xc_hypercall_buffer_t *hbuf,
+                                       size_t size)
+{
+    void *buf;
+
+    if ( index >= array->max_bufs || array->bufs[index].hbuf )
+        abort();
+
+    buf = xc__hypercall_buffer_alloc(xch, hbuf, size);
+    if ( buf )
+        array->bufs[index] = *hbuf;
+    return buf;
+}
+
+void *xc__hypercall_buffer_array_get(xc_interface *xch,
+                                     xc_hypercall_buffer_array_t *array,
+                                     unsigned index,
+                                     xc_hypercall_buffer_t *hbuf)
+{
+    if ( index >= array->max_bufs || array->bufs[index].hbuf == NULL )
+        abort();
+
+    *hbuf = array->bufs[index];
+    return array->bufs[index].hbuf;
+}
+
+void xc_hypercall_buffer_array_destroy(xc_interface *xc,
+                                       xc_hypercall_buffer_array_t *array)
+{
+    unsigned i;
+
+    if ( array == NULL )
+        return;
+
+    for (i = 0; i < array->max_bufs; i++ )
+        xc__hypercall_buffer_free(xc, &array->bufs[i]);
+    free(array->bufs);
+    free(array);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 8cf3f3b..a7e8c31 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -321,6 +321,33 @@ void xc__hypercall_buffer_free_pages(xc_interface *xch, xc_hypercall_buffer_t *b
 #define xc_hypercall_buffer_free_pages(_xch, _name, _nr) xc__hypercall_buffer_free_pages(_xch, HYPERCALL_BUFFER(_name), _nr)
 
 /*
+ * Array of hypercall buffers.
+ *
+ * Create an array with xc_hypercall_buffer_array_create() and
+ * populate it by declaring one hypercall buffer in a loop and
+ * allocating the buffer with xc_hypercall_buffer_array_alloc().
+ *
+ * To access a previously allocated buffers, declare a new hypercall
+ * buffer and call xc_hypercall_buffer_array_get().
+ *
+ * Destroy the array with xc_hypercall_buffer_array_destroy() to free
+ * the array and all its alocated hypercall buffers.
+ */
+struct xc_hypercall_buffer_array;
+typedef struct xc_hypercall_buffer_array xc_hypercall_buffer_array_t;
+
+xc_hypercall_buffer_array_t *xc_hypercall_buffer_array_create(xc_interface *xch, unsigned n);
+void *xc__hypercall_buffer_array_alloc(xc_interface *xch, xc_hypercall_buffer_array_t *array,
+                                       unsigned index, xc_hypercall_buffer_t *hbuf, size_t size);
+#define xc_hypercall_buffer_array_alloc(_xch, _array, _index, _name, _size) \
+    xc__hypercall_buffer_array_alloc(_xch, _array, _index, HYPERCALL_BUFFER(_name), _size)
+void *xc__hypercall_buffer_array_get(xc_interface *xch, xc_hypercall_buffer_array_t *array,
+                                     unsigned index, xc_hypercall_buffer_t *hbuf);
+#define xc_hypercall_buffer_array_get(_xch, _array, _index, _name, _size) \
+    xc__hypercall_buffer_array_get(_xch, _array, _index, HYPERCALL_BUFFER(_name))
+void xc_hypercall_buffer_array_destroy(xc_interface *xc, xc_hypercall_buffer_array_t *array);
+
+/*
  * CPUMAP handling
  */
 typedef uint8_t *xc_cpumap_t;
-- 
1.7.2.5


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 7/9] libxc: add API for kexec hypercall
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (11 preceding siblings ...)
  2013-11-06 14:49 ` [PATCH 7/9] libxc: add API for kexec hypercall David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-06 14:49 ` [PATCH 8/9] x86: check kexec relocation code fits in a page David Vrabel
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Add xc_kexec_exec(), xc_kexec_get_ranges(), xc_kexec_load(), and
xc_kexec_unload().  The load and unload calls require the v2 load and
unload ops.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/Makefile   |    1 +
 tools/libxc/xc_kexec.c |  140 ++++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h  |   55 +++++++++++++++++++
 3 files changed, 196 insertions(+), 0 deletions(-)
 create mode 100644 tools/libxc/xc_kexec.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 4c64c15..f2d6e56 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -31,6 +31,7 @@ CTRL_SRCS-y       += xc_mem_access.c
 CTRL_SRCS-y       += xc_memshr.c
 CTRL_SRCS-y       += xc_hcall_buf.c
 CTRL_SRCS-y       += xc_foreign_memory.c
+CTRL_SRCS-y       += xc_kexec.c
 CTRL_SRCS-y       += xtl_core.c
 CTRL_SRCS-y       += xtl_logger_stdio.c
 CTRL_SRCS-$(CONFIG_X86) += xc_pagetab.c
diff --git a/tools/libxc/xc_kexec.c b/tools/libxc/xc_kexec.c
new file mode 100644
index 0000000..a49cffb
--- /dev/null
+++ b/tools/libxc/xc_kexec.c
@@ -0,0 +1,140 @@
+/******************************************************************************
+ * xc_kexec.c
+ *
+ * API for loading and executing kexec images.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation;
+ * version 2.1 of the License.
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ */
+#include "xc_private.h"
+
+int xc_kexec_exec(xc_interface *xch, int type)
+{
+    DECLARE_HYPERCALL;
+    DECLARE_HYPERCALL_BUFFER(xen_kexec_exec_t, exec);
+    int ret = -1;
+
+    exec = xc_hypercall_buffer_alloc(xch, exec, sizeof(*exec));
+    if ( exec == NULL )
+    {
+        PERROR("Count not alloc bounce buffer for kexec_exec hypercall");
+        goto out;
+    }
+
+    exec->type = type;
+
+    hypercall.op = __HYPERVISOR_kexec_op;
+    hypercall.arg[0] = KEXEC_CMD_kexec;
+    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(exec);
+
+    ret = do_xen_hypercall(xch, &hypercall);
+
+out:
+    xc_hypercall_buffer_free(xch, exec);
+
+    return ret;
+}
+
+int xc_kexec_get_range(xc_interface *xch, int range,  int nr,
+                       uint64_t *size, uint64_t *start)
+{
+    DECLARE_HYPERCALL;
+    DECLARE_HYPERCALL_BUFFER(xen_kexec_range_t, get_range);
+    int ret = -1;
+
+    get_range = xc_hypercall_buffer_alloc(xch, get_range, sizeof(*get_range));
+    if ( get_range == NULL )
+    {
+        PERROR("Could not alloc bounce buffer for kexec_get_range hypercall");
+        goto out;
+    }
+
+    get_range->range = range;
+    get_range->nr = nr;
+
+    hypercall.op = __HYPERVISOR_kexec_op;
+    hypercall.arg[0] = KEXEC_CMD_kexec_get_range;
+    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(get_range);
+
+    ret = do_xen_hypercall(xch, &hypercall);
+
+    *size = get_range->size;
+    *start = get_range->start;
+
+out:
+    xc_hypercall_buffer_free(xch, get_range);
+
+    return ret;
+}
+
+int xc_kexec_load(xc_interface *xch, uint8_t type, uint16_t arch,
+                  uint64_t entry_maddr,
+                  uint32_t nr_segments, xen_kexec_segment_t *segments)
+{
+    int ret = -1;
+    DECLARE_HYPERCALL;
+    DECLARE_HYPERCALL_BOUNCE(segments, sizeof(*segments) * nr_segments,
+                             XC_HYPERCALL_BUFFER_BOUNCE_IN);
+    DECLARE_HYPERCALL_BUFFER(xen_kexec_load_t, load);
+
+    if ( xc_hypercall_bounce_pre(xch, segments) )
+    {
+        PERROR("Could not allocate bounce buffer for kexec load hypercall");
+        goto out;
+    }
+    load = xc_hypercall_buffer_alloc(xch, load, sizeof(*load));
+    if ( load == NULL )
+    {
+        PERROR("Could not allocate buffer for kexec load hypercall");
+        goto out;
+    }
+
+    load->type = type;
+    load->arch = arch;
+    load->entry_maddr = entry_maddr;
+    load->nr_segments = nr_segments;
+    set_xen_guest_handle(load->segments.h, segments);
+
+    hypercall.op = __HYPERVISOR_kexec_op;
+    hypercall.arg[0] = KEXEC_CMD_kexec_load;
+    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(load);
+
+    ret = do_xen_hypercall(xch, &hypercall);
+
+out:
+    xc_hypercall_buffer_free(xch, load);
+    xc_hypercall_bounce_post(xch, segments);
+
+    return ret;
+}
+
+int xc_kexec_unload(xc_interface *xch, int type)
+{
+    DECLARE_HYPERCALL;
+    DECLARE_HYPERCALL_BUFFER(xen_kexec_unload_t, unload);
+    int ret = -1;
+
+    unload = xc_hypercall_buffer_alloc(xch, unload, sizeof(*unload));
+    if ( unload == NULL )
+    {
+        PERROR("Count not alloc buffer for kexec unload hypercall");
+        goto out;
+    }
+
+    unload->type = type;
+
+    hypercall.op = __HYPERVISOR_kexec_op;
+    hypercall.arg[0] = KEXEC_CMD_kexec_unload;
+    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(unload);
+
+    ret = do_xen_hypercall(xch, &hypercall);
+
+out:
+    xc_hypercall_buffer_free(xch, unload);
+
+    return ret;
+}
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index a7e8c31..4ac6b8a 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -46,6 +46,7 @@
 #include <xen/hvm/params.h>
 #include <xen/xsm/flask_op.h>
 #include <xen/tmem.h>
+#include <xen/kexec.h>
 
 #include "xentoollog.h"
 
@@ -2340,4 +2341,58 @@ int xc_compression_uncompress_page(xc_interface *xch, char *compbuf,
 				   unsigned long compbuf_size,
 				   unsigned long *compbuf_pos, char *dest);
 
+/*
+ * Execute an image previously loaded with xc_kexec_load().
+ *
+ * Does not return on success.
+ *
+ * Fails with:
+ *   ENOENT if the specified image has not been loaded.
+ */
+int xc_kexec_exec(xc_interface *xch, int type);
+
+/*
+ * Find the machine address and size of certain memory areas.
+ *
+ *   KEXEC_RANGE_MA_CRASH       crash area
+ *   KEXEC_RANGE_MA_XEN         Xen itself
+ *   KEXEC_RANGE_MA_CPU         CPU note for CPU number 'nr'
+ *   KEXEC_RANGE_MA_XENHEAP     xenheap
+ *   KEXEC_RANGE_MA_EFI_MEMMAP  EFI Memory Map
+ *   KEXEC_RANGE_MA_VMCOREINFO  vmcoreinfo
+ *
+ * Fails with:
+ *   EINVAL if the range or CPU number isn't valid.
+ */
+int xc_kexec_get_range(xc_interface *xch, int range,  int nr,
+                       uint64_t *size, uint64_t *start);
+
+/*
+ * Load a kexec image into memory.
+ *
+ * The image may be of type KEXEC_TYPE_DEFAULT (executed on request)
+ * or KEXEC_TYPE_CRASH (executed on a crash).
+ *
+ * The image architecture may be a 32-bit variant of the hypervisor
+ * architecture (e.g, EM_386 on a x86-64 hypervisor).
+ *
+ * Fails with:
+ *   ENOMEM if there is insufficient memory for the new image.
+ *   EINVAL if the image does not fit into the crash area or the entry
+ *          point isn't within one of segments.
+ *   EBUSY  if another image is being executed.
+ */
+int xc_kexec_load(xc_interface *xch, uint8_t type, uint16_t arch,
+                  uint64_t entry_maddr,
+                  uint32_t nr_segments, xen_kexec_segment_t *segments);
+
+/*
+ * Unload a kexec image.
+ *
+ * This prevents a KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH image from
+ * being executed.  The crash images are not cleared from the crash
+ * region.
+ */
+int xc_kexec_unload(xc_interface *xch, int type);
+
 #endif /* XENCTRL_H */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 7/9] libxc: add API for kexec hypercall
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (10 preceding siblings ...)
  2013-11-06 14:49 ` David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-07 20:48   ` Don Slutz
  2013-11-07 20:48   ` Don Slutz
  2013-11-06 14:49 ` David Vrabel
                   ` (8 subsequent siblings)
  20 siblings, 2 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Add xc_kexec_exec(), xc_kexec_get_ranges(), xc_kexec_load(), and
xc_kexec_unload().  The load and unload calls require the v2 load and
unload ops.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/Makefile   |    1 +
 tools/libxc/xc_kexec.c |  140 ++++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h  |   55 +++++++++++++++++++
 3 files changed, 196 insertions(+), 0 deletions(-)
 create mode 100644 tools/libxc/xc_kexec.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 4c64c15..f2d6e56 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -31,6 +31,7 @@ CTRL_SRCS-y       += xc_mem_access.c
 CTRL_SRCS-y       += xc_memshr.c
 CTRL_SRCS-y       += xc_hcall_buf.c
 CTRL_SRCS-y       += xc_foreign_memory.c
+CTRL_SRCS-y       += xc_kexec.c
 CTRL_SRCS-y       += xtl_core.c
 CTRL_SRCS-y       += xtl_logger_stdio.c
 CTRL_SRCS-$(CONFIG_X86) += xc_pagetab.c
diff --git a/tools/libxc/xc_kexec.c b/tools/libxc/xc_kexec.c
new file mode 100644
index 0000000..a49cffb
--- /dev/null
+++ b/tools/libxc/xc_kexec.c
@@ -0,0 +1,140 @@
+/******************************************************************************
+ * xc_kexec.c
+ *
+ * API for loading and executing kexec images.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation;
+ * version 2.1 of the License.
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ */
+#include "xc_private.h"
+
+int xc_kexec_exec(xc_interface *xch, int type)
+{
+    DECLARE_HYPERCALL;
+    DECLARE_HYPERCALL_BUFFER(xen_kexec_exec_t, exec);
+    int ret = -1;
+
+    exec = xc_hypercall_buffer_alloc(xch, exec, sizeof(*exec));
+    if ( exec == NULL )
+    {
+        PERROR("Count not alloc bounce buffer for kexec_exec hypercall");
+        goto out;
+    }
+
+    exec->type = type;
+
+    hypercall.op = __HYPERVISOR_kexec_op;
+    hypercall.arg[0] = KEXEC_CMD_kexec;
+    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(exec);
+
+    ret = do_xen_hypercall(xch, &hypercall);
+
+out:
+    xc_hypercall_buffer_free(xch, exec);
+
+    return ret;
+}
+
+int xc_kexec_get_range(xc_interface *xch, int range,  int nr,
+                       uint64_t *size, uint64_t *start)
+{
+    DECLARE_HYPERCALL;
+    DECLARE_HYPERCALL_BUFFER(xen_kexec_range_t, get_range);
+    int ret = -1;
+
+    get_range = xc_hypercall_buffer_alloc(xch, get_range, sizeof(*get_range));
+    if ( get_range == NULL )
+    {
+        PERROR("Could not alloc bounce buffer for kexec_get_range hypercall");
+        goto out;
+    }
+
+    get_range->range = range;
+    get_range->nr = nr;
+
+    hypercall.op = __HYPERVISOR_kexec_op;
+    hypercall.arg[0] = KEXEC_CMD_kexec_get_range;
+    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(get_range);
+
+    ret = do_xen_hypercall(xch, &hypercall);
+
+    *size = get_range->size;
+    *start = get_range->start;
+
+out:
+    xc_hypercall_buffer_free(xch, get_range);
+
+    return ret;
+}
+
+int xc_kexec_load(xc_interface *xch, uint8_t type, uint16_t arch,
+                  uint64_t entry_maddr,
+                  uint32_t nr_segments, xen_kexec_segment_t *segments)
+{
+    int ret = -1;
+    DECLARE_HYPERCALL;
+    DECLARE_HYPERCALL_BOUNCE(segments, sizeof(*segments) * nr_segments,
+                             XC_HYPERCALL_BUFFER_BOUNCE_IN);
+    DECLARE_HYPERCALL_BUFFER(xen_kexec_load_t, load);
+
+    if ( xc_hypercall_bounce_pre(xch, segments) )
+    {
+        PERROR("Could not allocate bounce buffer for kexec load hypercall");
+        goto out;
+    }
+    load = xc_hypercall_buffer_alloc(xch, load, sizeof(*load));
+    if ( load == NULL )
+    {
+        PERROR("Could not allocate buffer for kexec load hypercall");
+        goto out;
+    }
+
+    load->type = type;
+    load->arch = arch;
+    load->entry_maddr = entry_maddr;
+    load->nr_segments = nr_segments;
+    set_xen_guest_handle(load->segments.h, segments);
+
+    hypercall.op = __HYPERVISOR_kexec_op;
+    hypercall.arg[0] = KEXEC_CMD_kexec_load;
+    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(load);
+
+    ret = do_xen_hypercall(xch, &hypercall);
+
+out:
+    xc_hypercall_buffer_free(xch, load);
+    xc_hypercall_bounce_post(xch, segments);
+
+    return ret;
+}
+
+int xc_kexec_unload(xc_interface *xch, int type)
+{
+    DECLARE_HYPERCALL;
+    DECLARE_HYPERCALL_BUFFER(xen_kexec_unload_t, unload);
+    int ret = -1;
+
+    unload = xc_hypercall_buffer_alloc(xch, unload, sizeof(*unload));
+    if ( unload == NULL )
+    {
+        PERROR("Count not alloc buffer for kexec unload hypercall");
+        goto out;
+    }
+
+    unload->type = type;
+
+    hypercall.op = __HYPERVISOR_kexec_op;
+    hypercall.arg[0] = KEXEC_CMD_kexec_unload;
+    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(unload);
+
+    ret = do_xen_hypercall(xch, &hypercall);
+
+out:
+    xc_hypercall_buffer_free(xch, unload);
+
+    return ret;
+}
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index a7e8c31..4ac6b8a 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -46,6 +46,7 @@
 #include <xen/hvm/params.h>
 #include <xen/xsm/flask_op.h>
 #include <xen/tmem.h>
+#include <xen/kexec.h>
 
 #include "xentoollog.h"
 
@@ -2340,4 +2341,58 @@ int xc_compression_uncompress_page(xc_interface *xch, char *compbuf,
 				   unsigned long compbuf_size,
 				   unsigned long *compbuf_pos, char *dest);
 
+/*
+ * Execute an image previously loaded with xc_kexec_load().
+ *
+ * Does not return on success.
+ *
+ * Fails with:
+ *   ENOENT if the specified image has not been loaded.
+ */
+int xc_kexec_exec(xc_interface *xch, int type);
+
+/*
+ * Find the machine address and size of certain memory areas.
+ *
+ *   KEXEC_RANGE_MA_CRASH       crash area
+ *   KEXEC_RANGE_MA_XEN         Xen itself
+ *   KEXEC_RANGE_MA_CPU         CPU note for CPU number 'nr'
+ *   KEXEC_RANGE_MA_XENHEAP     xenheap
+ *   KEXEC_RANGE_MA_EFI_MEMMAP  EFI Memory Map
+ *   KEXEC_RANGE_MA_VMCOREINFO  vmcoreinfo
+ *
+ * Fails with:
+ *   EINVAL if the range or CPU number isn't valid.
+ */
+int xc_kexec_get_range(xc_interface *xch, int range,  int nr,
+                       uint64_t *size, uint64_t *start);
+
+/*
+ * Load a kexec image into memory.
+ *
+ * The image may be of type KEXEC_TYPE_DEFAULT (executed on request)
+ * or KEXEC_TYPE_CRASH (executed on a crash).
+ *
+ * The image architecture may be a 32-bit variant of the hypervisor
+ * architecture (e.g, EM_386 on a x86-64 hypervisor).
+ *
+ * Fails with:
+ *   ENOMEM if there is insufficient memory for the new image.
+ *   EINVAL if the image does not fit into the crash area or the entry
+ *          point isn't within one of segments.
+ *   EBUSY  if another image is being executed.
+ */
+int xc_kexec_load(xc_interface *xch, uint8_t type, uint16_t arch,
+                  uint64_t entry_maddr,
+                  uint32_t nr_segments, xen_kexec_segment_t *segments);
+
+/*
+ * Unload a kexec image.
+ *
+ * This prevents a KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH image from
+ * being executed.  The crash images are not cleared from the crash
+ * region.
+ */
+int xc_kexec_unload(xc_interface *xch, int type);
+
 #endif /* XENCTRL_H */
-- 
1.7.2.5


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 8/9] x86: check kexec relocation code fits in a page
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (12 preceding siblings ...)
  2013-11-06 14:49 ` David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-06 14:49 ` David Vrabel
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

The kexec relocation (control) code must fit in a single page so add a
link time check for this.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/xen.lds.S |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 9600cdf..17db361 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -198,3 +198,5 @@ SECTIONS
   .stab.indexstr 0 : { *(.stab.indexstr) }
   .comment 0 : { *(.comment) }
 }
+
+ASSERT(kexec_reloc_size - kexec_reloc <= PAGE_SIZE, "kexec_reloc is too large")
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 8/9] x86: check kexec relocation code fits in a page
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (13 preceding siblings ...)
  2013-11-06 14:49 ` [PATCH 8/9] x86: check kexec relocation code fits in a page David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-06 18:51   ` [Xen-devel] " Don Slutz
  2013-11-06 18:51   ` Don Slutz
  2013-11-06 14:49 ` [PATCH 9/9] MAINTAINERS: Add KEXEC maintainer David Vrabel
                   ` (5 subsequent siblings)
  20 siblings, 2 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

The kexec relocation (control) code must fit in a single page so add a
link time check for this.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/xen.lds.S |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 9600cdf..17db361 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -198,3 +198,5 @@ SECTIONS
   .stab.indexstr 0 : { *(.stab.indexstr) }
   .comment 0 : { *(.comment) }
 }
+
+ASSERT(kexec_reloc_size - kexec_reloc <= PAGE_SIZE, "kexec_reloc is too large")
-- 
1.7.2.5


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 9/9] MAINTAINERS: Add KEXEC maintainer
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (15 preceding siblings ...)
  2013-11-06 14:49 ` [PATCH 9/9] MAINTAINERS: Add KEXEC maintainer David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-07 21:16 ` [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels Daniel Kiper
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 MAINTAINERS |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index adacac2..4aac28c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -197,6 +197,14 @@ X:	xen/drivers/passthrough/amd/
 X:	xen/drivers/passthrough/vtd/
 F:	xen/include/xen/iommu.h
 
+KEXEC
+M:      David Vrabel <david.vrabel@citrix.com>
+S:      Supported
+F:      xen/common/{kexec,kimage}.c
+F:      xen/include/{kexec,kimage}.h
+F:      xen/arch/x86/machine_kexec.c
+F:      xen/arch/x86/x86_64/kexec_reloc.S
+
 LINUX (PV_OPS)
 M:	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
 S:	Supported
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 9/9] MAINTAINERS: Add KEXEC maintainer
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (14 preceding siblings ...)
  2013-11-06 14:49 ` David Vrabel
@ 2013-11-06 14:49 ` David Vrabel
  2013-11-06 18:50   ` Don Slutz
  2013-11-06 18:50   ` Don Slutz
  2013-11-06 14:49 ` David Vrabel
                   ` (4 subsequent siblings)
  20 siblings, 2 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-06 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, kexec, David Vrabel, Jan Beulich

From: David Vrabel <david.vrabel@citrix.com>

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 MAINTAINERS |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index adacac2..4aac28c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -197,6 +197,14 @@ X:	xen/drivers/passthrough/amd/
 X:	xen/drivers/passthrough/vtd/
 F:	xen/include/xen/iommu.h
 
+KEXEC
+M:      David Vrabel <david.vrabel@citrix.com>
+S:      Supported
+F:      xen/common/{kexec,kimage}.c
+F:      xen/include/{kexec,kimage}.h
+F:      xen/arch/x86/machine_kexec.c
+F:      xen/arch/x86/x86_64/kexec_reloc.S
+
 LINUX (PV_OPS)
 M:	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
 S:	Supported
-- 
1.7.2.5


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH 1/9] x86: give FIX_EFI_MPF its own fixmap entry
  2013-11-06 14:49 ` David Vrabel
  2013-11-06 18:49   ` [Xen-devel] " Don Slutz
@ 2013-11-06 18:49   ` Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-06 18:49 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

Looks good.

Reviewed-by: Don Slutz <dslutz@verizon.com>

    -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> FIX_EFI_MPF was the same as FIX_KEXEC_BASE_0 which is going away.  So
> add its own entry.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
> Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/arch/x86/mpparse.c       |    2 --
>   xen/include/asm-x86/fixmap.h |    1 +
>   2 files changed, 1 insertions(+), 2 deletions(-)
>
> diff --git a/xen/arch/x86/mpparse.c b/xen/arch/x86/mpparse.c
> index 97d34bc..3753704 100644
> --- a/xen/arch/x86/mpparse.c
> +++ b/xen/arch/x86/mpparse.c
> @@ -538,8 +538,6 @@ static inline void __init construct_default_ISA_mptable(int mpc_default_type)
>   	}
>   }
>   
> -#define FIX_EFI_MPF FIX_KEXEC_BASE_0
> -
>   static __init void efi_unmap_mpf(void)
>   {
>   	if (efi_enabled)
> diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
> index d850be4..8b4266d 100644
> --- a/xen/include/asm-x86/fixmap.h
> +++ b/xen/include/asm-x86/fixmap.h
> @@ -66,6 +66,7 @@ enum fixed_addresses {
>       FIX_APEI_RANGE_BASE,
>       FIX_APEI_RANGE_END = FIX_APEI_RANGE_BASE + FIX_APEI_RANGE_MAX -1,
>       FIX_IGD_MMIO,
> +    FIX_EFI_MPF,
>       __end_of_fixed_addresses
>   };
>   

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCH 1/9] x86: give FIX_EFI_MPF its own fixmap entry
  2013-11-06 14:49 ` David Vrabel
@ 2013-11-06 18:49   ` Don Slutz
  2013-11-06 18:49   ` Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-06 18:49 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

Looks good.

Reviewed-by: Don Slutz <dslutz@verizon.com>

    -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> FIX_EFI_MPF was the same as FIX_KEXEC_BASE_0 which is going away.  So
> add its own entry.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
> Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/arch/x86/mpparse.c       |    2 --
>   xen/include/asm-x86/fixmap.h |    1 +
>   2 files changed, 1 insertions(+), 2 deletions(-)
>
> diff --git a/xen/arch/x86/mpparse.c b/xen/arch/x86/mpparse.c
> index 97d34bc..3753704 100644
> --- a/xen/arch/x86/mpparse.c
> +++ b/xen/arch/x86/mpparse.c
> @@ -538,8 +538,6 @@ static inline void __init construct_default_ISA_mptable(int mpc_default_type)
>   	}
>   }
>   
> -#define FIX_EFI_MPF FIX_KEXEC_BASE_0
> -
>   static __init void efi_unmap_mpf(void)
>   {
>   	if (efi_enabled)
> diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
> index d850be4..8b4266d 100644
> --- a/xen/include/asm-x86/fixmap.h
> +++ b/xen/include/asm-x86/fixmap.h
> @@ -66,6 +66,7 @@ enum fixed_addresses {
>       FIX_APEI_RANGE_BASE,
>       FIX_APEI_RANGE_END = FIX_APEI_RANGE_BASE + FIX_APEI_RANGE_MAX -1,
>       FIX_IGD_MMIO,
> +    FIX_EFI_MPF,
>       __end_of_fixed_addresses
>   };
>   


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 9/9] MAINTAINERS: Add KEXEC maintainer
  2013-11-06 14:49 ` [PATCH 9/9] MAINTAINERS: Add KEXEC maintainer David Vrabel
@ 2013-11-06 18:50   ` Don Slutz
  2013-11-06 18:50   ` Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-06 18:50 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

Looks good.

Reviewed-by: Don Slutz <dslutz@verizon.com>

    -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>   MAINTAINERS |    8 ++++++++
>   1 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index adacac2..4aac28c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -197,6 +197,14 @@ X:	xen/drivers/passthrough/amd/
>   X:	xen/drivers/passthrough/vtd/
>   F:	xen/include/xen/iommu.h
>   
> +KEXEC
> +M:      David Vrabel <david.vrabel@citrix.com>
> +S:      Supported
> +F:      xen/common/{kexec,kimage}.c
> +F:      xen/include/{kexec,kimage}.h
> +F:      xen/arch/x86/machine_kexec.c
> +F:      xen/arch/x86/x86_64/kexec_reloc.S
> +
>   LINUX (PV_OPS)
>   M:	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>   S:	Supported

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 9/9] MAINTAINERS: Add KEXEC maintainer
  2013-11-06 14:49 ` [PATCH 9/9] MAINTAINERS: Add KEXEC maintainer David Vrabel
  2013-11-06 18:50   ` Don Slutz
@ 2013-11-06 18:50   ` Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-06 18:50 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

Looks good.

Reviewed-by: Don Slutz <dslutz@verizon.com>

    -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>   MAINTAINERS |    8 ++++++++
>   1 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index adacac2..4aac28c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -197,6 +197,14 @@ X:	xen/drivers/passthrough/amd/
>   X:	xen/drivers/passthrough/vtd/
>   F:	xen/include/xen/iommu.h
>   
> +KEXEC
> +M:      David Vrabel <david.vrabel@citrix.com>
> +S:      Supported
> +F:      xen/common/{kexec,kimage}.c
> +F:      xen/include/{kexec,kimage}.h
> +F:      xen/arch/x86/machine_kexec.c
> +F:      xen/arch/x86/x86_64/kexec_reloc.S
> +
>   LINUX (PV_OPS)
>   M:	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>   S:	Supported


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 8/9] x86: check kexec relocation code fits in a page
  2013-11-06 14:49 ` David Vrabel
  2013-11-06 18:51   ` [Xen-devel] " Don Slutz
@ 2013-11-06 18:51   ` Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-06 18:51 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

Also

Reviewed-by: Don Slutz <dslutz@verizon.com>

    -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> The kexec relocation (control) code must fit in a single page so add a
> link time check for this.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/arch/x86/xen.lds.S |    2 ++
>   1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
> index 9600cdf..17db361 100644
> --- a/xen/arch/x86/xen.lds.S
> +++ b/xen/arch/x86/xen.lds.S
> @@ -198,3 +198,5 @@ SECTIONS
>     .stab.indexstr 0 : { *(.stab.indexstr) }
>     .comment 0 : { *(.comment) }
>   }
> +
> +ASSERT(kexec_reloc_size - kexec_reloc <= PAGE_SIZE, "kexec_reloc is too large")

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCH 8/9] x86: check kexec relocation code fits in a page
  2013-11-06 14:49 ` David Vrabel
@ 2013-11-06 18:51   ` Don Slutz
  2013-11-06 18:51   ` Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-06 18:51 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

Also

Reviewed-by: Don Slutz <dslutz@verizon.com>

    -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> The kexec relocation (control) code must fit in a single page so add a
> link time check for this.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/arch/x86/xen.lds.S |    2 ++
>   1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
> index 9600cdf..17db361 100644
> --- a/xen/arch/x86/xen.lds.S
> +++ b/xen/arch/x86/xen.lds.S
> @@ -198,3 +198,5 @@ SECTIONS
>     .stab.indexstr 0 : { *(.stab.indexstr) }
>     .comment 0 : { *(.comment) }
>   }
> +
> +ASSERT(kexec_reloc_size - kexec_reloc <= PAGE_SIZE, "kexec_reloc is too large")


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/9] kexec: add public interface for improved load/unload sub-ops
  2013-11-06 14:49 ` David Vrabel
  2013-11-07 20:38   ` Don Slutz
@ 2013-11-07 20:38   ` Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-07 20:38 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

For what it is worth.

Reviewed-by: Don Slutz <dslutz@verizon.com>
     -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Add replacement KEXEC_CMD_load and KEXEC_CMD_unload sub-ops to the
> kexec hypercall.  These new sub-ops allow a priviledged guest to
> provide the image data to be loaded into Xen memory or the crash
> region instead of guests loading the image data themselves and
> providing the relocation code and metadata.
>
> The old interface is provided to guests requesting an interface
> version prior to 4.4.
>
> Bump __XEN_LATEST_INTERFACE_VERSION__ to 0x00040400.
>
> Signed-off: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/common/kexec.c              |   12 +++---
>   xen/include/public/kexec.h      |   92 +++++++++++++++++++++++++++++++++++++--
>   xen/include/public/xen-compat.h |    2 +-
>   3 files changed, 95 insertions(+), 11 deletions(-)
>
> diff --git a/xen/common/kexec.c b/xen/common/kexec.c
> index 7cd151f..7b23df0 100644
> --- a/xen/common/kexec.c
> +++ b/xen/common/kexec.c
> @@ -734,7 +734,7 @@ static void crash_save_vmcoreinfo(void)
>   #endif
>   }
>   
> -static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_t *load)
> +static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
>   {
>       xen_kexec_image_t *image;
>       int base, bit, pos;
> @@ -781,7 +781,7 @@ static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_t *load)
>   
>   static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
>   {
> -    xen_kexec_load_t load;
> +    xen_kexec_load_v1_t load;
>   
>       if ( unlikely(copy_from_guest(&load, uarg, 1)) )
>           return -EFAULT;
> @@ -793,8 +793,8 @@ static int kexec_load_unload_compat(unsigned long op,
>                                       XEN_GUEST_HANDLE_PARAM(void) uarg)
>   {
>   #ifdef CONFIG_COMPAT
> -    compat_kexec_load_t compat_load;
> -    xen_kexec_load_t load;
> +    compat_kexec_load_v1_t compat_load;
> +    xen_kexec_load_v1_t load;
>   
>       if ( unlikely(copy_from_guest(&compat_load, uarg, 1)) )
>           return -EFAULT;
> @@ -866,8 +866,8 @@ static int do_kexec_op_internal(unsigned long op,
>           else
>                   ret = kexec_get_range(uarg);
>           break;
> -    case KEXEC_CMD_kexec_load:
> -    case KEXEC_CMD_kexec_unload:
> +    case KEXEC_CMD_kexec_load_v1:
> +    case KEXEC_CMD_kexec_unload_v1:
>           spin_lock_irqsave(&kexec_lock, flags);
>           if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
>           {
> diff --git a/xen/include/public/kexec.h b/xen/include/public/kexec.h
> index 36409ff..a6a0a88 100644
> --- a/xen/include/public/kexec.h
> +++ b/xen/include/public/kexec.h
> @@ -105,6 +105,20 @@ typedef struct xen_kexec_image {
>    * Perform kexec having previously loaded a kexec or kdump kernel
>    * as appropriate.
>    * type == KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH [in]
> + *
> + * Control is transferred to the image entry point with the host in
> + * the following state.
> + *
> + * - The image may be executed on any PCPU and all other PCPUs are
> + *   stopped.
> + *
> + * - Local interrupts are disabled.
> + *
> + * - Register values are undefined.
> + *
> + * - The image segments have writeable 1:1 virtual to machine
> + *   mappings.  The location of any page tables is undefined and these
> + *   page table frames are not be mapped.
>    */
>   #define KEXEC_CMD_kexec                 0
>   typedef struct xen_kexec_exec {
> @@ -116,12 +130,12 @@ typedef struct xen_kexec_exec {
>    * type  == KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH [in]
>    * image == relocation information for kexec (ignored for unload) [in]
>    */
> -#define KEXEC_CMD_kexec_load            1
> -#define KEXEC_CMD_kexec_unload          2
> -typedef struct xen_kexec_load {
> +#define KEXEC_CMD_kexec_load_v1         1 /* obsolete since 0x00040400 */
> +#define KEXEC_CMD_kexec_unload_v1       2 /* obsolete since 0x00040400 */
> +typedef struct xen_kexec_load_v1 {
>       int type;
>       xen_kexec_image_t image;
> -} xen_kexec_load_t;
> +} xen_kexec_load_v1_t;
>   
>   #define KEXEC_RANGE_MA_CRASH      0 /* machine address and size of crash area */
>   #define KEXEC_RANGE_MA_XEN        1 /* machine address and size of Xen itself */
> @@ -152,6 +166,76 @@ typedef struct xen_kexec_range {
>       unsigned long start;
>   } xen_kexec_range_t;
>   
> +#if __XEN_INTERFACE_VERSION__ >= 0x00040400
> +/*
> + * A contiguous chunk of a kexec image and it's destination machine
> + * address.
> + */
> +typedef struct xen_kexec_segment {
> +    union {
> +        XEN_GUEST_HANDLE(const_void) h;
> +        uint64_t _pad;
> +    } buf;
> +    uint64_t buf_size;
> +    uint64_t dest_maddr;
> +    uint64_t dest_size;
> +} xen_kexec_segment_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_kexec_segment_t);
> +
> +/*
> + * Load a kexec image into memory.
> + *
> + * For KEXEC_TYPE_DEFAULT images, the segments may be anywhere in RAM.
> + * The image is relocated prior to being executed.
> + *
> + * For KEXEC_TYPE_CRASH images, each segment of the image must reside
> + * in the memory region reserved for kexec (KEXEC_RANGE_MA_CRASH) and
> + * the entry point must be within the image. The caller is responsible
> + * for ensuring that multiple images do not overlap.
> + *
> + * All image segments will be loaded to their destination machine
> + * addresses prior to being executed.  The trailing portion of any
> + * segments with a source buffer (from dest_maddr + buf_size to
> + * dest_maddr + dest_size) will be zeroed.
> + *
> + * Segments with no source buffer will be accessible to the image when
> + * it is executed.
> + */
> +
> +#define KEXEC_CMD_kexec_load 4
> +typedef struct xen_kexec_load {
> +    uint8_t  type;        /* One of KEXEC_TYPE_* */
> +    uint8_t  _pad;
> +    uint16_t arch;        /* ELF machine type (EM_*). */
> +    uint32_t nr_segments;
> +    union {
> +        XEN_GUEST_HANDLE(xen_kexec_segment_t) h;
> +        uint64_t _pad;
> +    } segments;
> +    uint64_t entry_maddr; /* image entry point machine address. */
> +} xen_kexec_load_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_kexec_load_t);
> +
> +/*
> + * Unload a kexec image.
> + *
> + * Type must be one of KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH.
> + */
> +#define KEXEC_CMD_kexec_unload 5
> +typedef struct xen_kexec_unload {
> +    uint8_t type;
> +} xen_kexec_unload_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_kexec_unload_t);
> +
> +#else /* __XEN_INTERFACE_VERSION__ < 0x00040400 */
> +
> +#define KEXEC_CMD_kexec_load KEXEC_CMD_kexec_load_v1
> +#define KEXEC_CMD_kexec_unload KEXEC_CMD_kexec_unload_v1
> +#define xen_kexec_load xen_kexec_load_v1
> +#define xen_kexec_load_t xen_kexec_load_v1_t
> +
> +#endif
> +
>   #endif /* _XEN_PUBLIC_KEXEC_H */
>   
>   /*
> diff --git a/xen/include/public/xen-compat.h b/xen/include/public/xen-compat.h
> index 69141c4..3eb80a0 100644
> --- a/xen/include/public/xen-compat.h
> +++ b/xen/include/public/xen-compat.h
> @@ -27,7 +27,7 @@
>   #ifndef __XEN_PUBLIC_XEN_COMPAT_H__
>   #define __XEN_PUBLIC_XEN_COMPAT_H__
>   
> -#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040300
> +#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040400
>   
>   #if defined(__XEN__) || defined(__XEN_TOOLS__)
>   /* Xen is built with matching headers and implements the latest interface. */

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/9] kexec: add public interface for improved load/unload sub-ops
  2013-11-06 14:49 ` David Vrabel
@ 2013-11-07 20:38   ` Don Slutz
  2013-11-07 20:38   ` Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-07 20:38 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

For what it is worth.

Reviewed-by: Don Slutz <dslutz@verizon.com>
     -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Add replacement KEXEC_CMD_load and KEXEC_CMD_unload sub-ops to the
> kexec hypercall.  These new sub-ops allow a priviledged guest to
> provide the image data to be loaded into Xen memory or the crash
> region instead of guests loading the image data themselves and
> providing the relocation code and metadata.
>
> The old interface is provided to guests requesting an interface
> version prior to 4.4.
>
> Bump __XEN_LATEST_INTERFACE_VERSION__ to 0x00040400.
>
> Signed-off: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/common/kexec.c              |   12 +++---
>   xen/include/public/kexec.h      |   92 +++++++++++++++++++++++++++++++++++++--
>   xen/include/public/xen-compat.h |    2 +-
>   3 files changed, 95 insertions(+), 11 deletions(-)
>
> diff --git a/xen/common/kexec.c b/xen/common/kexec.c
> index 7cd151f..7b23df0 100644
> --- a/xen/common/kexec.c
> +++ b/xen/common/kexec.c
> @@ -734,7 +734,7 @@ static void crash_save_vmcoreinfo(void)
>   #endif
>   }
>   
> -static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_t *load)
> +static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
>   {
>       xen_kexec_image_t *image;
>       int base, bit, pos;
> @@ -781,7 +781,7 @@ static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_t *load)
>   
>   static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
>   {
> -    xen_kexec_load_t load;
> +    xen_kexec_load_v1_t load;
>   
>       if ( unlikely(copy_from_guest(&load, uarg, 1)) )
>           return -EFAULT;
> @@ -793,8 +793,8 @@ static int kexec_load_unload_compat(unsigned long op,
>                                       XEN_GUEST_HANDLE_PARAM(void) uarg)
>   {
>   #ifdef CONFIG_COMPAT
> -    compat_kexec_load_t compat_load;
> -    xen_kexec_load_t load;
> +    compat_kexec_load_v1_t compat_load;
> +    xen_kexec_load_v1_t load;
>   
>       if ( unlikely(copy_from_guest(&compat_load, uarg, 1)) )
>           return -EFAULT;
> @@ -866,8 +866,8 @@ static int do_kexec_op_internal(unsigned long op,
>           else
>                   ret = kexec_get_range(uarg);
>           break;
> -    case KEXEC_CMD_kexec_load:
> -    case KEXEC_CMD_kexec_unload:
> +    case KEXEC_CMD_kexec_load_v1:
> +    case KEXEC_CMD_kexec_unload_v1:
>           spin_lock_irqsave(&kexec_lock, flags);
>           if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
>           {
> diff --git a/xen/include/public/kexec.h b/xen/include/public/kexec.h
> index 36409ff..a6a0a88 100644
> --- a/xen/include/public/kexec.h
> +++ b/xen/include/public/kexec.h
> @@ -105,6 +105,20 @@ typedef struct xen_kexec_image {
>    * Perform kexec having previously loaded a kexec or kdump kernel
>    * as appropriate.
>    * type == KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH [in]
> + *
> + * Control is transferred to the image entry point with the host in
> + * the following state.
> + *
> + * - The image may be executed on any PCPU and all other PCPUs are
> + *   stopped.
> + *
> + * - Local interrupts are disabled.
> + *
> + * - Register values are undefined.
> + *
> + * - The image segments have writeable 1:1 virtual to machine
> + *   mappings.  The location of any page tables is undefined and these
> + *   page table frames are not be mapped.
>    */
>   #define KEXEC_CMD_kexec                 0
>   typedef struct xen_kexec_exec {
> @@ -116,12 +130,12 @@ typedef struct xen_kexec_exec {
>    * type  == KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH [in]
>    * image == relocation information for kexec (ignored for unload) [in]
>    */
> -#define KEXEC_CMD_kexec_load            1
> -#define KEXEC_CMD_kexec_unload          2
> -typedef struct xen_kexec_load {
> +#define KEXEC_CMD_kexec_load_v1         1 /* obsolete since 0x00040400 */
> +#define KEXEC_CMD_kexec_unload_v1       2 /* obsolete since 0x00040400 */
> +typedef struct xen_kexec_load_v1 {
>       int type;
>       xen_kexec_image_t image;
> -} xen_kexec_load_t;
> +} xen_kexec_load_v1_t;
>   
>   #define KEXEC_RANGE_MA_CRASH      0 /* machine address and size of crash area */
>   #define KEXEC_RANGE_MA_XEN        1 /* machine address and size of Xen itself */
> @@ -152,6 +166,76 @@ typedef struct xen_kexec_range {
>       unsigned long start;
>   } xen_kexec_range_t;
>   
> +#if __XEN_INTERFACE_VERSION__ >= 0x00040400
> +/*
> + * A contiguous chunk of a kexec image and it's destination machine
> + * address.
> + */
> +typedef struct xen_kexec_segment {
> +    union {
> +        XEN_GUEST_HANDLE(const_void) h;
> +        uint64_t _pad;
> +    } buf;
> +    uint64_t buf_size;
> +    uint64_t dest_maddr;
> +    uint64_t dest_size;
> +} xen_kexec_segment_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_kexec_segment_t);
> +
> +/*
> + * Load a kexec image into memory.
> + *
> + * For KEXEC_TYPE_DEFAULT images, the segments may be anywhere in RAM.
> + * The image is relocated prior to being executed.
> + *
> + * For KEXEC_TYPE_CRASH images, each segment of the image must reside
> + * in the memory region reserved for kexec (KEXEC_RANGE_MA_CRASH) and
> + * the entry point must be within the image. The caller is responsible
> + * for ensuring that multiple images do not overlap.
> + *
> + * All image segments will be loaded to their destination machine
> + * addresses prior to being executed.  The trailing portion of any
> + * segments with a source buffer (from dest_maddr + buf_size to
> + * dest_maddr + dest_size) will be zeroed.
> + *
> + * Segments with no source buffer will be accessible to the image when
> + * it is executed.
> + */
> +
> +#define KEXEC_CMD_kexec_load 4
> +typedef struct xen_kexec_load {
> +    uint8_t  type;        /* One of KEXEC_TYPE_* */
> +    uint8_t  _pad;
> +    uint16_t arch;        /* ELF machine type (EM_*). */
> +    uint32_t nr_segments;
> +    union {
> +        XEN_GUEST_HANDLE(xen_kexec_segment_t) h;
> +        uint64_t _pad;
> +    } segments;
> +    uint64_t entry_maddr; /* image entry point machine address. */
> +} xen_kexec_load_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_kexec_load_t);
> +
> +/*
> + * Unload a kexec image.
> + *
> + * Type must be one of KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH.
> + */
> +#define KEXEC_CMD_kexec_unload 5
> +typedef struct xen_kexec_unload {
> +    uint8_t type;
> +} xen_kexec_unload_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_kexec_unload_t);
> +
> +#else /* __XEN_INTERFACE_VERSION__ < 0x00040400 */
> +
> +#define KEXEC_CMD_kexec_load KEXEC_CMD_kexec_load_v1
> +#define KEXEC_CMD_kexec_unload KEXEC_CMD_kexec_unload_v1
> +#define xen_kexec_load xen_kexec_load_v1
> +#define xen_kexec_load_t xen_kexec_load_v1_t
> +
> +#endif
> +
>   #endif /* _XEN_PUBLIC_KEXEC_H */
>   
>   /*
> diff --git a/xen/include/public/xen-compat.h b/xen/include/public/xen-compat.h
> index 69141c4..3eb80a0 100644
> --- a/xen/include/public/xen-compat.h
> +++ b/xen/include/public/xen-compat.h
> @@ -27,7 +27,7 @@
>   #ifndef __XEN_PUBLIC_XEN_COMPAT_H__
>   #define __XEN_PUBLIC_XEN_COMPAT_H__
>   
> -#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040300
> +#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040400
>   
>   #if defined(__XEN__) || defined(__XEN_TOOLS__)
>   /* Xen is built with matching headers and implements the latest interface. */


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 3/9] kexec: add infrastructure for handling kexec images
  2013-11-06 14:49 ` [PATCH 3/9] kexec: add infrastructure for handling kexec images David Vrabel
  2013-11-07 20:40   ` [Xen-devel] " Don Slutz
@ 2013-11-07 20:40   ` Don Slutz
  2013-11-08 12:50   ` [PATCHv11 " David Vrabel
  2 siblings, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-07 20:40 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

For what it is worth.

Reviewed-by: Don Slutz <dslutz@verizon.com>
     -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Add the code needed to handle and load kexec images into Xen memory or
> into the crash region.  This is needed for the new KEXEC_CMD_load and
> KEXEC_CMD_unload hypercall sub-ops.
>
> Much of this code is derived from the Linux kernel.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/common/Makefile      |    1 +
>   xen/common/kimage.c      |  821 ++++++++++++++++++++++++++++++++++++++++++++++
>   xen/include/xen/kimage.h |   62 ++++
>   3 files changed, 884 insertions(+), 0 deletions(-)
>   create mode 100644 xen/common/kimage.c
>   create mode 100644 xen/include/xen/kimage.h
>
> diff --git a/xen/common/Makefile b/xen/common/Makefile
> index 686f7a1..3683ae3 100644
> --- a/xen/common/Makefile
> +++ b/xen/common/Makefile
> @@ -13,6 +13,7 @@ obj-y += irq.o
>   obj-y += kernel.o
>   obj-y += keyhandler.o
>   obj-$(HAS_KEXEC) += kexec.o
> +obj-$(HAS_KEXEC) += kimage.o
>   obj-y += lib.o
>   obj-y += memory.o
>   obj-y += multicall.o
> diff --git a/xen/common/kimage.c b/xen/common/kimage.c
> new file mode 100644
> index 0000000..02ee37e
> --- /dev/null
> +++ b/xen/common/kimage.c
> @@ -0,0 +1,821 @@
> +/*
> + * Kexec Image
> + *
> + * Copyright (C) 2013 Citrix Systems R&D Ltd.
> + *
> + * Derived from kernel/kexec.c from Linux:
> + *
> + *   Copyright (C) 2002-2004 Eric Biederman  <ebiederm@xmission.com>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +
> +#include <xen/config.h>
> +#include <xen/types.h>
> +#include <xen/init.h>
> +#include <xen/kernel.h>
> +#include <xen/errno.h>
> +#include <xen/spinlock.h>
> +#include <xen/guest_access.h>
> +#include <xen/mm.h>
> +#include <xen/kexec.h>
> +#include <xen/kimage.h>
> +
> +#include <asm/page.h>
> +
> +/*
> + * When kexec transitions to the new kernel there is a one-to-one
> + * mapping between physical and virtual addresses.  On processors
> + * where you can disable the MMU this is trivial, and easy.  For
> + * others it is still a simple predictable page table to setup.
> + *
> + * The code for the transition from the current kernel to the the new
> + * kernel is placed in the page-size control_code_buffer.  This memory
> + * must be identity mapped in the transition from virtual to physical
> + * addresses.
> + *
> + * The assembly stub in the control code buffer is passed a linked list
> + * of descriptor pages detailing the source pages of the new kernel,
> + * and the destination addresses of those source pages.  As this data
> + * structure is not used in the context of the current OS, it must
> + * be self-contained.
> + *
> + * The code has been made to work with highmem pages and will use a
> + * destination page in its final resting place (if it happens
> + * to allocate it).  The end product of this is that most of the
> + * physical address space, and most of RAM can be used.
> + *
> + * Future directions include:
> + *  - allocating a page table with the control code buffer identity
> + *    mapped, to simplify machine_kexec and make kexec_on_panic more
> + *    reliable.
> + */
> +
> +/*
> + * KIMAGE_NO_DEST is an impossible destination address..., for
> + * allocating pages whose destination address we do not care about.
> + */
> +#define KIMAGE_NO_DEST (-1UL)
> +
> +/*
> + * Offset of the last entry in an indirection page.
> + */
> +#define KIMAGE_LAST_ENTRY (PAGE_SIZE/sizeof(kimage_entry_t) - 1)
> +
> +
> +static int kimage_is_destination_range(struct kexec_image *image,
> +                                       paddr_t start, paddr_t end);
> +static struct page_info *kimage_alloc_page(struct kexec_image *image,
> +                                           paddr_t dest);
> +
> +static struct page_info *kimage_alloc_zeroed_page(unsigned memflags)
> +{
> +    struct page_info *page;
> +
> +    page = alloc_domheap_page(NULL, memflags);
> +    if ( !page )
> +        return NULL;
> +
> +    clear_domain_page(page_to_mfn(page));
> +
> +    return page;
> +}
> +
> +static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
> +                           unsigned long nr_segments,
> +                           xen_kexec_segment_t *segments, uint8_t type)
> +{
> +    struct kexec_image *image;
> +    unsigned long i;
> +    int result;
> +
> +    /* Allocate a controlling structure */
> +    result = -ENOMEM;
> +    image = xzalloc(typeof(*image));
> +    if ( !image )
> +        goto out;
> +
> +    image->entry_maddr = entry;
> +    image->type = type;
> +    image->nr_segments = nr_segments;
> +    image->segments = segments;
> +
> +    image->next_crash_page = kexec_crash_area.start;
> +
> +    INIT_PAGE_LIST_HEAD(&image->control_pages);
> +    INIT_PAGE_LIST_HEAD(&image->dest_pages);
> +    INIT_PAGE_LIST_HEAD(&image->unusable_pages);
> +
> +    /*
> +     * Verify we have good destination addresses.  The caller is
> +     * responsible for making certain we don't attempt to load the new
> +     * image into invalid or reserved areas of RAM.  This just
> +     * verifies it is an address we can use.
> +     *
> +     * Since the kernel does everything in page size chunks ensure the
> +     * destination addresses are page aligned.  Too many special cases
> +     * crop of when we don't do this.  The most insidious is getting
> +     * overlapping destination addresses simply because addresses are
> +     * changed to page size granularity.
> +     */
> +    result = -EADDRNOTAVAIL;
> +    for ( i = 0; i < nr_segments; i++ )
> +    {
> +        paddr_t mstart, mend;
> +
> +        mstart = image->segments[i].dest_maddr;
> +        mend   = mstart + image->segments[i].dest_size;
> +        if ( (mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK) )
> +            goto out;
> +    }
> +
> +    /*
> +     * Verify our destination addresses do not overlap.  If we allowed
> +     * overlapping destination addresses through very weird things can
> +     * happen with no easy explanation as one segment stops on
> +     * another.
> +     */
> +    result = -EINVAL;
> +    for ( i = 0; i < nr_segments; i++ )
> +    {
> +        paddr_t mstart, mend;
> +        unsigned long j;
> +
> +        mstart = image->segments[i].dest_maddr;
> +        mend   = mstart + image->segments[i].dest_size;
> +        for (j = 0; j < i; j++ )
> +        {
> +            paddr_t pstart, pend;
> +            pstart = image->segments[j].dest_maddr;
> +            pend   = pstart + image->segments[j].dest_size;
> +            /* Do the segments overlap? */
> +            if ( (mend > pstart) && (mstart < pend) )
> +                goto out;
> +        }
> +    }
> +
> +    /*
> +     * Ensure our buffer sizes are strictly less than our memory
> +     * sizes.  This should always be the case, and it is easier to
> +     * check up front than to be surprised later on.
> +     */
> +    result = -EINVAL;
> +    for ( i = 0; i < nr_segments; i++ )
> +    {
> +        if ( image->segments[i].buf_size > image->segments[i].dest_size )
> +            goto out;
> +    }
> +
> +    /*
> +     * Page for the relocation code must still be accessible after the
> +     * processor has switched to 32-bit mode.
> +     */
> +    result = -ENOMEM;
> +    image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
> +    if ( !image->control_code_page )
> +        goto out;
> +
> +    /* Add an empty indirection page. */
> +    image->entry_page = kimage_alloc_control_page(image, 0);
> +    if ( !image->entry_page )
> +        goto out;
> +
> +    image->head = page_to_maddr(image->entry_page);
> +
> +    result = 0;
> +out:
> +    if ( result == 0 )
> +        *rimage = image;
> +    else if ( image )
> +    {
> +        image->segments = NULL; /* caller frees segments after an error */
> +        kimage_free(image);
> +    }
> +
> +    return result;
> +
> +}
> +
> +static int kimage_normal_alloc(struct kexec_image **rimage, paddr_t entry,
> +                               unsigned long nr_segments,
> +                               xen_kexec_segment_t *segments)
> +{
> +    return do_kimage_alloc(rimage, entry, nr_segments, segments,
> +                           KEXEC_TYPE_DEFAULT);
> +}
> +
> +static int kimage_crash_alloc(struct kexec_image **rimage, paddr_t entry,
> +                              unsigned long nr_segments,
> +                              xen_kexec_segment_t *segments)
> +{
> +    unsigned long i;
> +    int result;
> +
> +    /* Verify we have a valid entry point */
> +    if ( (entry < kexec_crash_area.start)
> +         || (entry > kexec_crash_area.start + kexec_crash_area.size))
> +        return -EADDRNOTAVAIL;
> +
> +    /*
> +     * Verify we have good destination addresses.  Normally
> +     * the caller is responsible for making certain we don't
> +     * attempt to load the new image into invalid or reserved
> +     * areas of RAM.  But crash kernels are preloaded into a
> +     * reserved area of ram.  We must ensure the addresses
> +     * are in the reserved area otherwise preloading the
> +     * kernel could corrupt things.
> +     */
> +    for ( i = 0; i < nr_segments; i++ )
> +    {
> +        paddr_t mstart, mend;
> +
> +        if ( guest_handle_is_null(segments[i].buf.h) )
> +            continue;
> +
> +        mstart = segments[i].dest_maddr;
> +        mend = mstart + segments[i].dest_size;
> +        /* Ensure we are within the crash kernel limits. */
> +        if ( (mstart < kexec_crash_area.start )
> +             || (mend > kexec_crash_area.start + kexec_crash_area.size))
> +            return -EADDRNOTAVAIL;
> +    }
> +
> +    /* Allocate and initialize a controlling structure. */
> +    return do_kimage_alloc(rimage, entry, nr_segments, segments,
> +                           KEXEC_TYPE_CRASH);
> +}
> +
> +static int kimage_is_destination_range(struct kexec_image *image,
> +                                       paddr_t start,
> +                                       paddr_t end)
> +{
> +    unsigned long i;
> +
> +    for ( i = 0; i < image->nr_segments; i++ )
> +    {
> +        paddr_t mstart, mend;
> +
> +        mstart = image->segments[i].dest_maddr;
> +        mend = mstart + image->segments[i].dest_size;
> +        if ( (end > mstart) && (start < mend) )
> +            return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +static void kimage_free_page_list(struct page_list_head *list)
> +{
> +    struct page_info *page, *next;
> +
> +    page_list_for_each_safe(page, next, list)
> +    {
> +        page_list_del(page, list);
> +        free_domheap_page(page);
> +    }
> +}
> +
> +static struct page_info *kimage_alloc_normal_control_page(
> +    struct kexec_image *image, unsigned memflags)
> +{
> +    /*
> +     * Control pages are special, they are the intermediaries that are
> +     * needed while we copy the rest of the pages to their final
> +     * resting place.  As such they must not conflict with either the
> +     * destination addresses or memory the kernel is already using.
> +     *
> +     * The only case where we really need more than one of these are
> +     * for architectures where we cannot disable the MMU and must
> +     * instead generate an identity mapped page table for all of the
> +     * memory.
> +     *
> +     * At worst this runs in O(N) of the image size.
> +     */
> +    struct page_list_head extra_pages;
> +    struct page_info *page = NULL;
> +
> +    INIT_PAGE_LIST_HEAD(&extra_pages);
> +
> +    /*
> +     * Loop while I can allocate a page and the page allocated is a
> +     * destination page.
> +     */
> +    do {
> +        unsigned long mfn, emfn;
> +        paddr_t addr, eaddr;
> +
> +        page = kimage_alloc_zeroed_page(memflags);
> +        if ( !page )
> +            break;
> +        mfn   = page_to_mfn(page);
> +        emfn  = mfn + 1;
> +        addr  = page_to_maddr(page);
> +        eaddr = addr + PAGE_SIZE;
> +        if ( kimage_is_destination_range(image, addr, eaddr) )
> +        {
> +            page_list_add(page, &extra_pages);
> +            page = NULL;
> +        }
> +    } while ( !page );
> +
> +    if ( page )
> +    {
> +        /* Remember the allocated page... */
> +        page_list_add(page, &image->control_pages);
> +
> +        /*
> +         * Because the page is already in it's destination location we
> +         * will never allocate another page at that address.
> +         * Therefore kimage_alloc_page will not return it (again) and
> +         * we don't need to give it an entry in image->segments[].
> +         */
> +    }
> +    /*
> +     * Deal with the destination pages I have inadvertently allocated.
> +     *
> +     * Ideally I would convert multi-page allocations into single page
> +     * allocations, and add everything to image->dest_pages.
> +     *
> +     * For now it is simpler to just free the pages.
> +     */
> +    kimage_free_page_list(&extra_pages);
> +
> +    return page;
> +}
> +
> +static struct page_info *kimage_alloc_crash_control_page(struct kexec_image *image)
> +{
> +    /*
> +     * Control pages are special, they are the intermediaries that are
> +     * needed while we copy the rest of the pages to their final
> +     * resting place.  As such they must not conflict with either the
> +     * destination addresses or memory the kernel is already using.
> +     *
> +     * Control pages are also the only pags we must allocate when
> +     * loading a crash kernel.  All of the other pages are specified
> +     * by the segments and we just memcpy into them directly.
> +     *
> +     * The only case where we really need more than one of these are
> +     * for architectures where we cannot disable the MMU and must
> +     * instead generate an identity mapped page table for all of the
> +     * memory.
> +     *
> +     * Given the low demand this implements a very simple allocator
> +     * that finds the first hole of the appropriate size in the
> +     * reserved memory region, and allocates all of the memory up to
> +     * and including the hole.
> +     */
> +    paddr_t hole_start, hole_end;
> +    struct page_info *page = NULL;
> +
> +    hole_start = PAGE_ALIGN(image->next_crash_page);
> +    hole_end   = hole_start + PAGE_SIZE;
> +    while ( hole_end <= kexec_crash_area.start + kexec_crash_area.size )
> +    {
> +        unsigned long i;
> +
> +        /* See if I overlap any of the segments. */
> +        for ( i = 0; i < image->nr_segments; i++ )
> +        {
> +            paddr_t mstart, mend;
> +
> +            mstart = image->segments[i].dest_maddr;
> +            mend   = mstart + image->segments[i].dest_size;
> +            if ( (hole_end > mstart) && (hole_start < mend) )
> +            {
> +                /* Advance the hole to the end of the segment. */
> +                hole_start = PAGE_ALIGN(mend);
> +                hole_end   = hole_start + PAGE_SIZE;
> +                break;
> +            }
> +        }
> +        /* If I don't overlap any segments I have found my hole! */
> +        if ( i == image->nr_segments )
> +        {
> +            page = maddr_to_page(hole_start);
> +            break;
> +        }
> +    }
> +    if ( page )
> +    {
> +        image->next_crash_page = hole_end;
> +        clear_domain_page(page_to_mfn(page));
> +    }
> +
> +    return page;
> +}
> +
> +
> +struct page_info *kimage_alloc_control_page(struct kexec_image *image,
> +                                            unsigned memflags)
> +{
> +    struct page_info *pages = NULL;
> +
> +    switch ( image->type )
> +    {
> +    case KEXEC_TYPE_DEFAULT:
> +        pages = kimage_alloc_normal_control_page(image, memflags);
> +        break;
> +    case KEXEC_TYPE_CRASH:
> +        pages = kimage_alloc_crash_control_page(image);
> +        break;
> +    }
> +    return pages;
> +}
> +
> +static int kimage_add_entry(struct kexec_image *image, kimage_entry_t entry)
> +{
> +    kimage_entry_t *entries;
> +
> +    if ( image->next_entry == KIMAGE_LAST_ENTRY )
> +    {
> +        struct page_info *page;
> +
> +        page = kimage_alloc_page(image, KIMAGE_NO_DEST);
> +        if ( !page )
> +            return -ENOMEM;
> +
> +        entries = __map_domain_page(image->entry_page);
> +        entries[image->next_entry] = page_to_maddr(page) | IND_INDIRECTION;
> +        unmap_domain_page(entries);
> +
> +        image->entry_page = page;
> +        image->next_entry = 0;
> +    }
> +
> +    entries = __map_domain_page(image->entry_page);
> +    entries[image->next_entry] = entry;
> +    image->next_entry++;
> +    unmap_domain_page(entries);
> +
> +    return 0;
> +}
> +
> +static int kimage_set_destination(struct kexec_image *image,
> +                                  paddr_t destination)
> +{
> +    return kimage_add_entry(image, (destination & PAGE_MASK) | IND_DESTINATION);
> +}
> +
> +
> +static int kimage_add_page(struct kexec_image *image, paddr_t maddr)
> +{
> +    return kimage_add_entry(image, (maddr & PAGE_MASK) | IND_SOURCE);
> +}
> +
> +
> +static void kimage_free_extra_pages(struct kexec_image *image)
> +{
> +    kimage_free_page_list(&image->dest_pages);
> +    kimage_free_page_list(&image->unusable_pages);
> +}
> +
> +static void kimage_terminate(struct kexec_image *image)
> +{
> +    kimage_entry_t *entries;
> +
> +    entries = __map_domain_page(image->entry_page);
> +    entries[image->next_entry] = IND_DONE;
> +    unmap_domain_page(entries);
> +}
> +
> +/*
> + * Iterate over all the entries in the indirection pages.
> + *
> + * Call unmap_domain_page(ptr) after the loop exits.
> + */
> +#define for_each_kimage_entry(image, ptr, entry)                        \
> +    for ( ptr = map_domain_page(image->head >> PAGE_SHIFT);             \
> +          (entry = *ptr) && !(entry & IND_DONE);                        \
> +          ptr = (entry & IND_INDIRECTION) ?                             \
> +              (unmap_domain_page(ptr), map_domain_page(entry >> PAGE_SHIFT)) \
> +              : ptr + 1 )
> +
> +static void kimage_free_entry(kimage_entry_t entry)
> +{
> +    struct page_info *page;
> +
> +    page = mfn_to_page(entry >> PAGE_SHIFT);
> +    free_domheap_page(page);
> +}
> +
> +static void kimage_free_all_entries(struct kexec_image *image)
> +{
> +    kimage_entry_t *ptr, entry;
> +    kimage_entry_t ind = 0;
> +
> +    if ( !image->head )
> +        return;
> +
> +    for_each_kimage_entry(image, ptr, entry)
> +    {
> +        if ( entry & IND_INDIRECTION )
> +        {
> +            /* Free the previous indirection page */
> +            if ( ind & IND_INDIRECTION )
> +                kimage_free_entry(ind);
> +            /* Save this indirection page until we are done with it. */
> +            ind = entry;
> +        }
> +        else if ( entry & IND_SOURCE )
> +            kimage_free_entry(entry);
> +    }
> +    unmap_domain_page(ptr);
> +
> +    /* Free the final indirection page. */
> +    if ( ind & IND_INDIRECTION )
> +        kimage_free_entry(ind);
> +}
> +
> +void kimage_free(struct kexec_image *image)
> +{
> +    if ( !image )
> +        return;
> +
> +    kimage_free_extra_pages(image);
> +    kimage_free_all_entries(image);
> +    kimage_free_page_list(&image->control_pages);
> +    xfree(image->segments);
> +    xfree(image);
> +}
> +
> +static kimage_entry_t *kimage_dst_used(struct kexec_image *image,
> +                                       paddr_t maddr)
> +{
> +    kimage_entry_t *ptr, entry;
> +    unsigned long destination = 0;
> +
> +    for_each_kimage_entry(image, ptr, entry)
> +    {
> +        if ( entry & IND_DESTINATION )
> +            destination = entry & PAGE_MASK;
> +        else if ( entry & IND_SOURCE )
> +        {
> +            if ( maddr == destination )
> +                return ptr;
> +            destination += PAGE_SIZE;
> +        }
> +    }
> +    unmap_domain_page(ptr);
> +
> +    return NULL;
> +}
> +
> +static struct page_info *kimage_alloc_page(struct kexec_image *image,
> +                                           paddr_t destination)
> +{
> +    /*
> +     * Here we implement safeguards to ensure that a source page is
> +     * not copied to its destination page before the data on the
> +     * destination page is no longer useful.
> +     *
> +     * To do this we maintain the invariant that a source page is
> +     * either its own destination page, or it is not a destination
> +     * page at all.
> +     *
> +     * That is slightly stronger than required, but the proof that no
> +     * problems will not occur is trivial, and the implementation is
> +     * simply to verify.
> +     *
> +     * When allocating all pages normally this algorithm will run in
> +     * O(N) time, but in the worst case it will run in O(N^2) time.
> +     * If the runtime is a problem the data structures can be fixed.
> +     */
> +    struct page_info *page;
> +    paddr_t addr;
> +
> +    /*
> +     * Walk through the list of destination pages, and see if I have a
> +     * match.
> +     */
> +    page_list_for_each(page, &image->dest_pages)
> +    {
> +        addr = page_to_maddr(page);
> +        if ( addr == destination )
> +        {
> +            page_list_del(page, &image->dest_pages);
> +            return page;
> +        }
> +    }
> +    page = NULL;
> +    for (;;)
> +    {
> +        kimage_entry_t *old;
> +
> +        /* Allocate a page, if we run out of memory give up. */
> +        page = kimage_alloc_zeroed_page(0);
> +        if ( !page )
> +            return NULL;
> +        addr = page_to_maddr(page);
> +
> +        /* If it is the destination page we want use it. */
> +        if ( addr == destination )
> +            break;
> +
> +        /* If the page is not a destination page use it. */
> +        if ( !kimage_is_destination_range(image, addr,
> +                                          addr + PAGE_SIZE) )
> +            break;
> +
> +        /*
> +         * I know that the page is someones destination page.  See if
> +         * there is already a source page for this destination page.
> +         * And if so swap the source pages.
> +         */
> +        old = kimage_dst_used(image, addr);
> +        if ( old )
> +        {
> +            /* If so move it. */
> +            unsigned long old_mfn = *old >> PAGE_SHIFT;
> +            unsigned long mfn = addr >> PAGE_SHIFT;
> +
> +            copy_domain_page(mfn, old_mfn);
> +            clear_domain_page(old_mfn);
> +            *old = (addr & ~PAGE_MASK) | IND_SOURCE;
> +            unmap_domain_page(old);
> +
> +            page = mfn_to_page(old_mfn);
> +            break;
> +        }
> +        else
> +        {
> +            /*
> +             * Place the page on the destination list; I will use it
> +             * later.
> +             */
> +            page_list_add(page, &image->dest_pages);
> +        }
> +    }
> +    return page;
> +}
> +
> +static int kimage_load_normal_segment(struct kexec_image *image,
> +                                      xen_kexec_segment_t *segment)
> +{
> +    unsigned long to_copy;
> +    unsigned long src_offset;
> +    paddr_t dest, end;
> +    int ret;
> +
> +    to_copy = segment->buf_size;
> +    src_offset = 0;
> +    dest = segment->dest_maddr;
> +
> +    ret = kimage_set_destination(image, dest);
> +    if ( ret < 0 )
> +        return ret;
> +
> +    while ( to_copy )
> +    {
> +        unsigned long dest_mfn;
> +        struct page_info *page;
> +        void *dest_va;
> +        size_t size;
> +
> +        dest_mfn = dest >> PAGE_SHIFT;
> +
> +        size = min_t(unsigned long, PAGE_SIZE, to_copy);
> +
> +        page = kimage_alloc_page(image, dest);
> +        if ( !page )
> +            return -ENOMEM;
> +        ret = kimage_add_page(image, page_to_maddr(page));
> +        if ( ret < 0 )
> +            return ret;
> +
> +        dest_va = __map_domain_page(page);
> +        ret = copy_from_guest_offset(dest_va, segment->buf.h, src_offset, size);
> +        unmap_domain_page(dest_va);
> +        if ( ret )
> +            return -EFAULT;
> +
> +        to_copy -= size;
> +        src_offset += size;
> +        dest += PAGE_SIZE;
> +    }
> +
> +    /* Remainder of the destination should be zeroed. */
> +    end = segment->dest_maddr + segment->dest_size;
> +    for ( ; dest < end; dest += PAGE_SIZE )
> +        kimage_add_entry(image, IND_ZERO);
> +
> +    return 0;
> +}
> +
> +static int kimage_load_crash_segment(struct kexec_image *image,
> +                                     xen_kexec_segment_t *segment)
> +{
> +    /*
> +     * For crash dumps kernels we simply copy the data from user space
> +     * to it's destination.
> +     */
> +    paddr_t dest;
> +    unsigned long sbytes, dbytes;
> +    int ret = 0;
> +    unsigned long src_offset = 0;
> +
> +    sbytes = segment->buf_size;
> +    dbytes = segment->dest_size;
> +    dest = segment->dest_maddr;
> +
> +    while ( dbytes )
> +    {
> +        unsigned long dest_mfn;
> +        void *dest_va;
> +        size_t schunk, dchunk;
> +
> +        dest_mfn = dest >> PAGE_SHIFT;
> +
> +        dchunk = PAGE_SIZE;
> +        schunk = min(dchunk, sbytes);
> +
> +        dest_va = map_domain_page(dest_mfn);
> +        if ( !dest_va )
> +            return -EINVAL;
> +
> +        ret = copy_from_guest_offset(dest_va, segment->buf.h, src_offset, schunk);
> +        memset(dest_va + schunk, 0, dchunk - schunk);
> +
> +        unmap_domain_page(dest_va);
> +        if ( ret )
> +            return -EFAULT;
> +
> +        dbytes -= dchunk;
> +        sbytes -= schunk;
> +        dest += dchunk;
> +        src_offset += schunk;
> +    }
> +
> +    return 0;
> +}
> +
> +static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
> +{
> +    int result = -ENOMEM;
> +
> +    if ( !guest_handle_is_null(segment->buf.h) )
> +    {
> +        switch ( image->type )
> +        {
> +        case KEXEC_TYPE_DEFAULT:
> +            result = kimage_load_normal_segment(image, segment);
> +            break;
> +        case KEXEC_TYPE_CRASH:
> +            result = kimage_load_crash_segment(image, segment);
> +            break;
> +        }
> +    }
> +
> +    return result;
> +}
> +
> +int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch,
> +                 uint64_t entry_maddr,
> +                 uint32_t nr_segments, xen_kexec_segment_t *segment)
> +{
> +    int result;
> +
> +    switch( type )
> +    {
> +    case KEXEC_TYPE_DEFAULT:
> +        result = kimage_normal_alloc(rimage, entry_maddr, nr_segments, segment);
> +        break;
> +    case KEXEC_TYPE_CRASH:
> +        result = kimage_crash_alloc(rimage, entry_maddr, nr_segments, segment);
> +        break;
> +    default:
> +        result = -EINVAL;
> +        break;
> +    }
> +    if ( result < 0 )
> +        return result;
> +
> +    (*rimage)->arch = arch;
> +
> +    return result;
> +}
> +
> +int kimage_load_segments(struct kexec_image *image)
> +{
> +    int s;
> +    int result;
> +
> +    for ( s = 0; s < image->nr_segments; s++ ) {
> +        result = kimage_load_segment(image, &image->segments[s]);
> +        if ( result < 0 )
> +            return result;
> +    }
> +    kimage_terminate(image);
> +    return 0;
> +}
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
> new file mode 100644
> index 0000000..0ebd37a
> --- /dev/null
> +++ b/xen/include/xen/kimage.h
> @@ -0,0 +1,62 @@
> +#ifndef __XEN_KIMAGE_H__
> +#define __XEN_KIMAGE_H__
> +
> +#define IND_DESTINATION  0x1
> +#define IND_INDIRECTION  0x2
> +#define IND_DONE         0x4
> +#define IND_SOURCE       0x8
> +#define IND_ZERO        0x10
> +
> +#ifndef __ASSEMBLY__
> +
> +#include <xen/list.h>
> +#include <xen/mm.h>
> +#include <public/kexec.h>
> +
> +#define KEXEC_SEGMENT_MAX 16
> +
> +typedef paddr_t kimage_entry_t;
> +
> +struct kexec_image {
> +    uint8_t type;
> +    uint16_t arch;
> +    uint64_t entry_maddr;
> +    uint32_t nr_segments;
> +    xen_kexec_segment_t *segments;
> +
> +    kimage_entry_t head;
> +    struct page_info *entry_page;
> +    unsigned next_entry;
> +
> +    struct page_info *control_code_page;
> +    struct page_info *aux_page;
> +
> +    struct page_list_head control_pages;
> +    struct page_list_head dest_pages;
> +    struct page_list_head unusable_pages;
> +
> +    /* Address of next control page to allocate for crash kernels. */
> +    paddr_t next_crash_page;
> +};
> +
> +int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch,
> +                 uint64_t entry_maddr,
> +                 uint32_t nr_segments, xen_kexec_segment_t *segment);
> +void kimage_free(struct kexec_image *image);
> +int kimage_load_segments(struct kexec_image *image);
> +struct page_info *kimage_alloc_control_page(struct kexec_image *image,
> +                                            unsigned memflags);
> +
> +#endif /* __ASSEMBLY__ */
> +
> +#endif /* __XEN_KIMAGE_H__ */
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCH 3/9] kexec: add infrastructure for handling kexec images
  2013-11-06 14:49 ` [PATCH 3/9] kexec: add infrastructure for handling kexec images David Vrabel
@ 2013-11-07 20:40   ` Don Slutz
  2013-11-07 23:51       ` [Xen-devel] " Don Slutz
  2013-11-07 20:40   ` Don Slutz
  2013-11-08 12:50   ` [PATCHv11 " David Vrabel
  2 siblings, 1 reply; 99+ messages in thread
From: Don Slutz @ 2013-11-07 20:40 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

For what it is worth.

Reviewed-by: Don Slutz <dslutz@verizon.com>
     -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Add the code needed to handle and load kexec images into Xen memory or
> into the crash region.  This is needed for the new KEXEC_CMD_load and
> KEXEC_CMD_unload hypercall sub-ops.
>
> Much of this code is derived from the Linux kernel.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/common/Makefile      |    1 +
>   xen/common/kimage.c      |  821 ++++++++++++++++++++++++++++++++++++++++++++++
>   xen/include/xen/kimage.h |   62 ++++
>   3 files changed, 884 insertions(+), 0 deletions(-)
>   create mode 100644 xen/common/kimage.c
>   create mode 100644 xen/include/xen/kimage.h
>
> diff --git a/xen/common/Makefile b/xen/common/Makefile
> index 686f7a1..3683ae3 100644
> --- a/xen/common/Makefile
> +++ b/xen/common/Makefile
> @@ -13,6 +13,7 @@ obj-y += irq.o
>   obj-y += kernel.o
>   obj-y += keyhandler.o
>   obj-$(HAS_KEXEC) += kexec.o
> +obj-$(HAS_KEXEC) += kimage.o
>   obj-y += lib.o
>   obj-y += memory.o
>   obj-y += multicall.o
> diff --git a/xen/common/kimage.c b/xen/common/kimage.c
> new file mode 100644
> index 0000000..02ee37e
> --- /dev/null
> +++ b/xen/common/kimage.c
> @@ -0,0 +1,821 @@
> +/*
> + * Kexec Image
> + *
> + * Copyright (C) 2013 Citrix Systems R&D Ltd.
> + *
> + * Derived from kernel/kexec.c from Linux:
> + *
> + *   Copyright (C) 2002-2004 Eric Biederman  <ebiederm@xmission.com>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +
> +#include <xen/config.h>
> +#include <xen/types.h>
> +#include <xen/init.h>
> +#include <xen/kernel.h>
> +#include <xen/errno.h>
> +#include <xen/spinlock.h>
> +#include <xen/guest_access.h>
> +#include <xen/mm.h>
> +#include <xen/kexec.h>
> +#include <xen/kimage.h>
> +
> +#include <asm/page.h>
> +
> +/*
> + * When kexec transitions to the new kernel there is a one-to-one
> + * mapping between physical and virtual addresses.  On processors
> + * where you can disable the MMU this is trivial, and easy.  For
> + * others it is still a simple predictable page table to setup.
> + *
> + * The code for the transition from the current kernel to the the new
> + * kernel is placed in the page-size control_code_buffer.  This memory
> + * must be identity mapped in the transition from virtual to physical
> + * addresses.
> + *
> + * The assembly stub in the control code buffer is passed a linked list
> + * of descriptor pages detailing the source pages of the new kernel,
> + * and the destination addresses of those source pages.  As this data
> + * structure is not used in the context of the current OS, it must
> + * be self-contained.
> + *
> + * The code has been made to work with highmem pages and will use a
> + * destination page in its final resting place (if it happens
> + * to allocate it).  The end product of this is that most of the
> + * physical address space, and most of RAM can be used.
> + *
> + * Future directions include:
> + *  - allocating a page table with the control code buffer identity
> + *    mapped, to simplify machine_kexec and make kexec_on_panic more
> + *    reliable.
> + */
> +
> +/*
> + * KIMAGE_NO_DEST is an impossible destination address..., for
> + * allocating pages whose destination address we do not care about.
> + */
> +#define KIMAGE_NO_DEST (-1UL)
> +
> +/*
> + * Offset of the last entry in an indirection page.
> + */
> +#define KIMAGE_LAST_ENTRY (PAGE_SIZE/sizeof(kimage_entry_t) - 1)
> +
> +
> +static int kimage_is_destination_range(struct kexec_image *image,
> +                                       paddr_t start, paddr_t end);
> +static struct page_info *kimage_alloc_page(struct kexec_image *image,
> +                                           paddr_t dest);
> +
> +static struct page_info *kimage_alloc_zeroed_page(unsigned memflags)
> +{
> +    struct page_info *page;
> +
> +    page = alloc_domheap_page(NULL, memflags);
> +    if ( !page )
> +        return NULL;
> +
> +    clear_domain_page(page_to_mfn(page));
> +
> +    return page;
> +}
> +
> +static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
> +                           unsigned long nr_segments,
> +                           xen_kexec_segment_t *segments, uint8_t type)
> +{
> +    struct kexec_image *image;
> +    unsigned long i;
> +    int result;
> +
> +    /* Allocate a controlling structure */
> +    result = -ENOMEM;
> +    image = xzalloc(typeof(*image));
> +    if ( !image )
> +        goto out;
> +
> +    image->entry_maddr = entry;
> +    image->type = type;
> +    image->nr_segments = nr_segments;
> +    image->segments = segments;
> +
> +    image->next_crash_page = kexec_crash_area.start;
> +
> +    INIT_PAGE_LIST_HEAD(&image->control_pages);
> +    INIT_PAGE_LIST_HEAD(&image->dest_pages);
> +    INIT_PAGE_LIST_HEAD(&image->unusable_pages);
> +
> +    /*
> +     * Verify we have good destination addresses.  The caller is
> +     * responsible for making certain we don't attempt to load the new
> +     * image into invalid or reserved areas of RAM.  This just
> +     * verifies it is an address we can use.
> +     *
> +     * Since the kernel does everything in page size chunks ensure the
> +     * destination addresses are page aligned.  Too many special cases
> +     * crop of when we don't do this.  The most insidious is getting
> +     * overlapping destination addresses simply because addresses are
> +     * changed to page size granularity.
> +     */
> +    result = -EADDRNOTAVAIL;
> +    for ( i = 0; i < nr_segments; i++ )
> +    {
> +        paddr_t mstart, mend;
> +
> +        mstart = image->segments[i].dest_maddr;
> +        mend   = mstart + image->segments[i].dest_size;
> +        if ( (mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK) )
> +            goto out;
> +    }
> +
> +    /*
> +     * Verify our destination addresses do not overlap.  If we allowed
> +     * overlapping destination addresses through very weird things can
> +     * happen with no easy explanation as one segment stops on
> +     * another.
> +     */
> +    result = -EINVAL;
> +    for ( i = 0; i < nr_segments; i++ )
> +    {
> +        paddr_t mstart, mend;
> +        unsigned long j;
> +
> +        mstart = image->segments[i].dest_maddr;
> +        mend   = mstart + image->segments[i].dest_size;
> +        for (j = 0; j < i; j++ )
> +        {
> +            paddr_t pstart, pend;
> +            pstart = image->segments[j].dest_maddr;
> +            pend   = pstart + image->segments[j].dest_size;
> +            /* Do the segments overlap? */
> +            if ( (mend > pstart) && (mstart < pend) )
> +                goto out;
> +        }
> +    }
> +
> +    /*
> +     * Ensure our buffer sizes are strictly less than our memory
> +     * sizes.  This should always be the case, and it is easier to
> +     * check up front than to be surprised later on.
> +     */
> +    result = -EINVAL;
> +    for ( i = 0; i < nr_segments; i++ )
> +    {
> +        if ( image->segments[i].buf_size > image->segments[i].dest_size )
> +            goto out;
> +    }
> +
> +    /*
> +     * Page for the relocation code must still be accessible after the
> +     * processor has switched to 32-bit mode.
> +     */
> +    result = -ENOMEM;
> +    image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
> +    if ( !image->control_code_page )
> +        goto out;
> +
> +    /* Add an empty indirection page. */
> +    image->entry_page = kimage_alloc_control_page(image, 0);
> +    if ( !image->entry_page )
> +        goto out;
> +
> +    image->head = page_to_maddr(image->entry_page);
> +
> +    result = 0;
> +out:
> +    if ( result == 0 )
> +        *rimage = image;
> +    else if ( image )
> +    {
> +        image->segments = NULL; /* caller frees segments after an error */
> +        kimage_free(image);
> +    }
> +
> +    return result;
> +
> +}
> +
> +static int kimage_normal_alloc(struct kexec_image **rimage, paddr_t entry,
> +                               unsigned long nr_segments,
> +                               xen_kexec_segment_t *segments)
> +{
> +    return do_kimage_alloc(rimage, entry, nr_segments, segments,
> +                           KEXEC_TYPE_DEFAULT);
> +}
> +
> +static int kimage_crash_alloc(struct kexec_image **rimage, paddr_t entry,
> +                              unsigned long nr_segments,
> +                              xen_kexec_segment_t *segments)
> +{
> +    unsigned long i;
> +    int result;
> +
> +    /* Verify we have a valid entry point */
> +    if ( (entry < kexec_crash_area.start)
> +         || (entry > kexec_crash_area.start + kexec_crash_area.size))
> +        return -EADDRNOTAVAIL;
> +
> +    /*
> +     * Verify we have good destination addresses.  Normally
> +     * the caller is responsible for making certain we don't
> +     * attempt to load the new image into invalid or reserved
> +     * areas of RAM.  But crash kernels are preloaded into a
> +     * reserved area of ram.  We must ensure the addresses
> +     * are in the reserved area otherwise preloading the
> +     * kernel could corrupt things.
> +     */
> +    for ( i = 0; i < nr_segments; i++ )
> +    {
> +        paddr_t mstart, mend;
> +
> +        if ( guest_handle_is_null(segments[i].buf.h) )
> +            continue;
> +
> +        mstart = segments[i].dest_maddr;
> +        mend = mstart + segments[i].dest_size;
> +        /* Ensure we are within the crash kernel limits. */
> +        if ( (mstart < kexec_crash_area.start )
> +             || (mend > kexec_crash_area.start + kexec_crash_area.size))
> +            return -EADDRNOTAVAIL;
> +    }
> +
> +    /* Allocate and initialize a controlling structure. */
> +    return do_kimage_alloc(rimage, entry, nr_segments, segments,
> +                           KEXEC_TYPE_CRASH);
> +}
> +
> +static int kimage_is_destination_range(struct kexec_image *image,
> +                                       paddr_t start,
> +                                       paddr_t end)
> +{
> +    unsigned long i;
> +
> +    for ( i = 0; i < image->nr_segments; i++ )
> +    {
> +        paddr_t mstart, mend;
> +
> +        mstart = image->segments[i].dest_maddr;
> +        mend = mstart + image->segments[i].dest_size;
> +        if ( (end > mstart) && (start < mend) )
> +            return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +static void kimage_free_page_list(struct page_list_head *list)
> +{
> +    struct page_info *page, *next;
> +
> +    page_list_for_each_safe(page, next, list)
> +    {
> +        page_list_del(page, list);
> +        free_domheap_page(page);
> +    }
> +}
> +
> +static struct page_info *kimage_alloc_normal_control_page(
> +    struct kexec_image *image, unsigned memflags)
> +{
> +    /*
> +     * Control pages are special, they are the intermediaries that are
> +     * needed while we copy the rest of the pages to their final
> +     * resting place.  As such they must not conflict with either the
> +     * destination addresses or memory the kernel is already using.
> +     *
> +     * The only case where we really need more than one of these are
> +     * for architectures where we cannot disable the MMU and must
> +     * instead generate an identity mapped page table for all of the
> +     * memory.
> +     *
> +     * At worst this runs in O(N) of the image size.
> +     */
> +    struct page_list_head extra_pages;
> +    struct page_info *page = NULL;
> +
> +    INIT_PAGE_LIST_HEAD(&extra_pages);
> +
> +    /*
> +     * Loop while I can allocate a page and the page allocated is a
> +     * destination page.
> +     */
> +    do {
> +        unsigned long mfn, emfn;
> +        paddr_t addr, eaddr;
> +
> +        page = kimage_alloc_zeroed_page(memflags);
> +        if ( !page )
> +            break;
> +        mfn   = page_to_mfn(page);
> +        emfn  = mfn + 1;
> +        addr  = page_to_maddr(page);
> +        eaddr = addr + PAGE_SIZE;
> +        if ( kimage_is_destination_range(image, addr, eaddr) )
> +        {
> +            page_list_add(page, &extra_pages);
> +            page = NULL;
> +        }
> +    } while ( !page );
> +
> +    if ( page )
> +    {
> +        /* Remember the allocated page... */
> +        page_list_add(page, &image->control_pages);
> +
> +        /*
> +         * Because the page is already in it's destination location we
> +         * will never allocate another page at that address.
> +         * Therefore kimage_alloc_page will not return it (again) and
> +         * we don't need to give it an entry in image->segments[].
> +         */
> +    }
> +    /*
> +     * Deal with the destination pages I have inadvertently allocated.
> +     *
> +     * Ideally I would convert multi-page allocations into single page
> +     * allocations, and add everything to image->dest_pages.
> +     *
> +     * For now it is simpler to just free the pages.
> +     */
> +    kimage_free_page_list(&extra_pages);
> +
> +    return page;
> +}
> +
> +static struct page_info *kimage_alloc_crash_control_page(struct kexec_image *image)
> +{
> +    /*
> +     * Control pages are special, they are the intermediaries that are
> +     * needed while we copy the rest of the pages to their final
> +     * resting place.  As such they must not conflict with either the
> +     * destination addresses or memory the kernel is already using.
> +     *
> +     * Control pages are also the only pags we must allocate when
> +     * loading a crash kernel.  All of the other pages are specified
> +     * by the segments and we just memcpy into them directly.
> +     *
> +     * The only case where we really need more than one of these are
> +     * for architectures where we cannot disable the MMU and must
> +     * instead generate an identity mapped page table for all of the
> +     * memory.
> +     *
> +     * Given the low demand this implements a very simple allocator
> +     * that finds the first hole of the appropriate size in the
> +     * reserved memory region, and allocates all of the memory up to
> +     * and including the hole.
> +     */
> +    paddr_t hole_start, hole_end;
> +    struct page_info *page = NULL;
> +
> +    hole_start = PAGE_ALIGN(image->next_crash_page);
> +    hole_end   = hole_start + PAGE_SIZE;
> +    while ( hole_end <= kexec_crash_area.start + kexec_crash_area.size )
> +    {
> +        unsigned long i;
> +
> +        /* See if I overlap any of the segments. */
> +        for ( i = 0; i < image->nr_segments; i++ )
> +        {
> +            paddr_t mstart, mend;
> +
> +            mstart = image->segments[i].dest_maddr;
> +            mend   = mstart + image->segments[i].dest_size;
> +            if ( (hole_end > mstart) && (hole_start < mend) )
> +            {
> +                /* Advance the hole to the end of the segment. */
> +                hole_start = PAGE_ALIGN(mend);
> +                hole_end   = hole_start + PAGE_SIZE;
> +                break;
> +            }
> +        }
> +        /* If I don't overlap any segments I have found my hole! */
> +        if ( i == image->nr_segments )
> +        {
> +            page = maddr_to_page(hole_start);
> +            break;
> +        }
> +    }
> +    if ( page )
> +    {
> +        image->next_crash_page = hole_end;
> +        clear_domain_page(page_to_mfn(page));
> +    }
> +
> +    return page;
> +}
> +
> +
> +struct page_info *kimage_alloc_control_page(struct kexec_image *image,
> +                                            unsigned memflags)
> +{
> +    struct page_info *pages = NULL;
> +
> +    switch ( image->type )
> +    {
> +    case KEXEC_TYPE_DEFAULT:
> +        pages = kimage_alloc_normal_control_page(image, memflags);
> +        break;
> +    case KEXEC_TYPE_CRASH:
> +        pages = kimage_alloc_crash_control_page(image);
> +        break;
> +    }
> +    return pages;
> +}
> +
> +static int kimage_add_entry(struct kexec_image *image, kimage_entry_t entry)
> +{
> +    kimage_entry_t *entries;
> +
> +    if ( image->next_entry == KIMAGE_LAST_ENTRY )
> +    {
> +        struct page_info *page;
> +
> +        page = kimage_alloc_page(image, KIMAGE_NO_DEST);
> +        if ( !page )
> +            return -ENOMEM;
> +
> +        entries = __map_domain_page(image->entry_page);
> +        entries[image->next_entry] = page_to_maddr(page) | IND_INDIRECTION;
> +        unmap_domain_page(entries);
> +
> +        image->entry_page = page;
> +        image->next_entry = 0;
> +    }
> +
> +    entries = __map_domain_page(image->entry_page);
> +    entries[image->next_entry] = entry;
> +    image->next_entry++;
> +    unmap_domain_page(entries);
> +
> +    return 0;
> +}
> +
> +static int kimage_set_destination(struct kexec_image *image,
> +                                  paddr_t destination)
> +{
> +    return kimage_add_entry(image, (destination & PAGE_MASK) | IND_DESTINATION);
> +}
> +
> +
> +static int kimage_add_page(struct kexec_image *image, paddr_t maddr)
> +{
> +    return kimage_add_entry(image, (maddr & PAGE_MASK) | IND_SOURCE);
> +}
> +
> +
> +static void kimage_free_extra_pages(struct kexec_image *image)
> +{
> +    kimage_free_page_list(&image->dest_pages);
> +    kimage_free_page_list(&image->unusable_pages);
> +}
> +
> +static void kimage_terminate(struct kexec_image *image)
> +{
> +    kimage_entry_t *entries;
> +
> +    entries = __map_domain_page(image->entry_page);
> +    entries[image->next_entry] = IND_DONE;
> +    unmap_domain_page(entries);
> +}
> +
> +/*
> + * Iterate over all the entries in the indirection pages.
> + *
> + * Call unmap_domain_page(ptr) after the loop exits.
> + */
> +#define for_each_kimage_entry(image, ptr, entry)                        \
> +    for ( ptr = map_domain_page(image->head >> PAGE_SHIFT);             \
> +          (entry = *ptr) && !(entry & IND_DONE);                        \
> +          ptr = (entry & IND_INDIRECTION) ?                             \
> +              (unmap_domain_page(ptr), map_domain_page(entry >> PAGE_SHIFT)) \
> +              : ptr + 1 )
> +
> +static void kimage_free_entry(kimage_entry_t entry)
> +{
> +    struct page_info *page;
> +
> +    page = mfn_to_page(entry >> PAGE_SHIFT);
> +    free_domheap_page(page);
> +}
> +
> +static void kimage_free_all_entries(struct kexec_image *image)
> +{
> +    kimage_entry_t *ptr, entry;
> +    kimage_entry_t ind = 0;
> +
> +    if ( !image->head )
> +        return;
> +
> +    for_each_kimage_entry(image, ptr, entry)
> +    {
> +        if ( entry & IND_INDIRECTION )
> +        {
> +            /* Free the previous indirection page */
> +            if ( ind & IND_INDIRECTION )
> +                kimage_free_entry(ind);
> +            /* Save this indirection page until we are done with it. */
> +            ind = entry;
> +        }
> +        else if ( entry & IND_SOURCE )
> +            kimage_free_entry(entry);
> +    }
> +    unmap_domain_page(ptr);
> +
> +    /* Free the final indirection page. */
> +    if ( ind & IND_INDIRECTION )
> +        kimage_free_entry(ind);
> +}
> +
> +void kimage_free(struct kexec_image *image)
> +{
> +    if ( !image )
> +        return;
> +
> +    kimage_free_extra_pages(image);
> +    kimage_free_all_entries(image);
> +    kimage_free_page_list(&image->control_pages);
> +    xfree(image->segments);
> +    xfree(image);
> +}
> +
> +static kimage_entry_t *kimage_dst_used(struct kexec_image *image,
> +                                       paddr_t maddr)
> +{
> +    kimage_entry_t *ptr, entry;
> +    unsigned long destination = 0;
> +
> +    for_each_kimage_entry(image, ptr, entry)
> +    {
> +        if ( entry & IND_DESTINATION )
> +            destination = entry & PAGE_MASK;
> +        else if ( entry & IND_SOURCE )
> +        {
> +            if ( maddr == destination )
> +                return ptr;
> +            destination += PAGE_SIZE;
> +        }
> +    }
> +    unmap_domain_page(ptr);
> +
> +    return NULL;
> +}
> +
> +static struct page_info *kimage_alloc_page(struct kexec_image *image,
> +                                           paddr_t destination)
> +{
> +    /*
> +     * Here we implement safeguards to ensure that a source page is
> +     * not copied to its destination page before the data on the
> +     * destination page is no longer useful.
> +     *
> +     * To do this we maintain the invariant that a source page is
> +     * either its own destination page, or it is not a destination
> +     * page at all.
> +     *
> +     * That is slightly stronger than required, but the proof that no
> +     * problems will not occur is trivial, and the implementation is
> +     * simply to verify.
> +     *
> +     * When allocating all pages normally this algorithm will run in
> +     * O(N) time, but in the worst case it will run in O(N^2) time.
> +     * If the runtime is a problem the data structures can be fixed.
> +     */
> +    struct page_info *page;
> +    paddr_t addr;
> +
> +    /*
> +     * Walk through the list of destination pages, and see if I have a
> +     * match.
> +     */
> +    page_list_for_each(page, &image->dest_pages)
> +    {
> +        addr = page_to_maddr(page);
> +        if ( addr == destination )
> +        {
> +            page_list_del(page, &image->dest_pages);
> +            return page;
> +        }
> +    }
> +    page = NULL;
> +    for (;;)
> +    {
> +        kimage_entry_t *old;
> +
> +        /* Allocate a page, if we run out of memory give up. */
> +        page = kimage_alloc_zeroed_page(0);
> +        if ( !page )
> +            return NULL;
> +        addr = page_to_maddr(page);
> +
> +        /* If it is the destination page we want use it. */
> +        if ( addr == destination )
> +            break;
> +
> +        /* If the page is not a destination page use it. */
> +        if ( !kimage_is_destination_range(image, addr,
> +                                          addr + PAGE_SIZE) )
> +            break;
> +
> +        /*
> +         * I know that the page is someones destination page.  See if
> +         * there is already a source page for this destination page.
> +         * And if so swap the source pages.
> +         */
> +        old = kimage_dst_used(image, addr);
> +        if ( old )
> +        {
> +            /* If so move it. */
> +            unsigned long old_mfn = *old >> PAGE_SHIFT;
> +            unsigned long mfn = addr >> PAGE_SHIFT;
> +
> +            copy_domain_page(mfn, old_mfn);
> +            clear_domain_page(old_mfn);
> +            *old = (addr & ~PAGE_MASK) | IND_SOURCE;
> +            unmap_domain_page(old);
> +
> +            page = mfn_to_page(old_mfn);
> +            break;
> +        }
> +        else
> +        {
> +            /*
> +             * Place the page on the destination list; I will use it
> +             * later.
> +             */
> +            page_list_add(page, &image->dest_pages);
> +        }
> +    }
> +    return page;
> +}
> +
> +static int kimage_load_normal_segment(struct kexec_image *image,
> +                                      xen_kexec_segment_t *segment)
> +{
> +    unsigned long to_copy;
> +    unsigned long src_offset;
> +    paddr_t dest, end;
> +    int ret;
> +
> +    to_copy = segment->buf_size;
> +    src_offset = 0;
> +    dest = segment->dest_maddr;
> +
> +    ret = kimage_set_destination(image, dest);
> +    if ( ret < 0 )
> +        return ret;
> +
> +    while ( to_copy )
> +    {
> +        unsigned long dest_mfn;
> +        struct page_info *page;
> +        void *dest_va;
> +        size_t size;
> +
> +        dest_mfn = dest >> PAGE_SHIFT;
> +
> +        size = min_t(unsigned long, PAGE_SIZE, to_copy);
> +
> +        page = kimage_alloc_page(image, dest);
> +        if ( !page )
> +            return -ENOMEM;
> +        ret = kimage_add_page(image, page_to_maddr(page));
> +        if ( ret < 0 )
> +            return ret;
> +
> +        dest_va = __map_domain_page(page);
> +        ret = copy_from_guest_offset(dest_va, segment->buf.h, src_offset, size);
> +        unmap_domain_page(dest_va);
> +        if ( ret )
> +            return -EFAULT;
> +
> +        to_copy -= size;
> +        src_offset += size;
> +        dest += PAGE_SIZE;
> +    }
> +
> +    /* Remainder of the destination should be zeroed. */
> +    end = segment->dest_maddr + segment->dest_size;
> +    for ( ; dest < end; dest += PAGE_SIZE )
> +        kimage_add_entry(image, IND_ZERO);
> +
> +    return 0;
> +}
> +
> +static int kimage_load_crash_segment(struct kexec_image *image,
> +                                     xen_kexec_segment_t *segment)
> +{
> +    /*
> +     * For crash dumps kernels we simply copy the data from user space
> +     * to it's destination.
> +     */
> +    paddr_t dest;
> +    unsigned long sbytes, dbytes;
> +    int ret = 0;
> +    unsigned long src_offset = 0;
> +
> +    sbytes = segment->buf_size;
> +    dbytes = segment->dest_size;
> +    dest = segment->dest_maddr;
> +
> +    while ( dbytes )
> +    {
> +        unsigned long dest_mfn;
> +        void *dest_va;
> +        size_t schunk, dchunk;
> +
> +        dest_mfn = dest >> PAGE_SHIFT;
> +
> +        dchunk = PAGE_SIZE;
> +        schunk = min(dchunk, sbytes);
> +
> +        dest_va = map_domain_page(dest_mfn);
> +        if ( !dest_va )
> +            return -EINVAL;
> +
> +        ret = copy_from_guest_offset(dest_va, segment->buf.h, src_offset, schunk);
> +        memset(dest_va + schunk, 0, dchunk - schunk);
> +
> +        unmap_domain_page(dest_va);
> +        if ( ret )
> +            return -EFAULT;
> +
> +        dbytes -= dchunk;
> +        sbytes -= schunk;
> +        dest += dchunk;
> +        src_offset += schunk;
> +    }
> +
> +    return 0;
> +}
> +
> +static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
> +{
> +    int result = -ENOMEM;
> +
> +    if ( !guest_handle_is_null(segment->buf.h) )
> +    {
> +        switch ( image->type )
> +        {
> +        case KEXEC_TYPE_DEFAULT:
> +            result = kimage_load_normal_segment(image, segment);
> +            break;
> +        case KEXEC_TYPE_CRASH:
> +            result = kimage_load_crash_segment(image, segment);
> +            break;
> +        }
> +    }
> +
> +    return result;
> +}
> +
> +int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch,
> +                 uint64_t entry_maddr,
> +                 uint32_t nr_segments, xen_kexec_segment_t *segment)
> +{
> +    int result;
> +
> +    switch( type )
> +    {
> +    case KEXEC_TYPE_DEFAULT:
> +        result = kimage_normal_alloc(rimage, entry_maddr, nr_segments, segment);
> +        break;
> +    case KEXEC_TYPE_CRASH:
> +        result = kimage_crash_alloc(rimage, entry_maddr, nr_segments, segment);
> +        break;
> +    default:
> +        result = -EINVAL;
> +        break;
> +    }
> +    if ( result < 0 )
> +        return result;
> +
> +    (*rimage)->arch = arch;
> +
> +    return result;
> +}
> +
> +int kimage_load_segments(struct kexec_image *image)
> +{
> +    int s;
> +    int result;
> +
> +    for ( s = 0; s < image->nr_segments; s++ ) {
> +        result = kimage_load_segment(image, &image->segments[s]);
> +        if ( result < 0 )
> +            return result;
> +    }
> +    kimage_terminate(image);
> +    return 0;
> +}
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
> new file mode 100644
> index 0000000..0ebd37a
> --- /dev/null
> +++ b/xen/include/xen/kimage.h
> @@ -0,0 +1,62 @@
> +#ifndef __XEN_KIMAGE_H__
> +#define __XEN_KIMAGE_H__
> +
> +#define IND_DESTINATION  0x1
> +#define IND_INDIRECTION  0x2
> +#define IND_DONE         0x4
> +#define IND_SOURCE       0x8
> +#define IND_ZERO        0x10
> +
> +#ifndef __ASSEMBLY__
> +
> +#include <xen/list.h>
> +#include <xen/mm.h>
> +#include <public/kexec.h>
> +
> +#define KEXEC_SEGMENT_MAX 16
> +
> +typedef paddr_t kimage_entry_t;
> +
> +struct kexec_image {
> +    uint8_t type;
> +    uint16_t arch;
> +    uint64_t entry_maddr;
> +    uint32_t nr_segments;
> +    xen_kexec_segment_t *segments;
> +
> +    kimage_entry_t head;
> +    struct page_info *entry_page;
> +    unsigned next_entry;
> +
> +    struct page_info *control_code_page;
> +    struct page_info *aux_page;
> +
> +    struct page_list_head control_pages;
> +    struct page_list_head dest_pages;
> +    struct page_list_head unusable_pages;
> +
> +    /* Address of next control page to allocate for crash kernels. */
> +    paddr_t next_crash_page;
> +};
> +
> +int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch,
> +                 uint64_t entry_maddr,
> +                 uint32_t nr_segments, xen_kexec_segment_t *segment);
> +void kimage_free(struct kexec_image *image);
> +int kimage_load_segments(struct kexec_image *image);
> +struct page_info *kimage_alloc_control_page(struct kexec_image *image,
> +                                            unsigned memflags);
> +
> +#endif /* __ASSEMBLY__ */
> +
> +#endif /* __XEN_KIMAGE_H__ */
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 5/9] xen: kexec crash image when dom0 crashes
  2013-11-06 14:49 ` [PATCH 5/9] xen: kexec crash image when dom0 crashes David Vrabel
@ 2013-11-07 20:44   ` Don Slutz
  2013-11-07 20:44   ` [Xen-devel] " Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-07 20:44 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

For what it is worth.

Reviewed-by: Don Slutz <dslutz@verizon.com>
     -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
> Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/common/kexec.c    |    2 ++
>   xen/common/shutdown.c |    3 +++
>   2 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/xen/common/kexec.c b/xen/common/kexec.c
> index c5450ba..9999bab 100644
> --- a/xen/common/kexec.c
> +++ b/xen/common/kexec.c
> @@ -305,6 +305,8 @@ void kexec_crash(void)
>       if ( !test_bit(KEXEC_IMAGE_CRASH_BASE + pos, &kexec_flags) )
>           return;
>   
> +    printk("Executing crash image\n");
> +
>       kexecing = TRUE;
>   
>       kexec_common_shutdown();
> diff --git a/xen/common/shutdown.c b/xen/common/shutdown.c
> index 20f04b0..9bccd34 100644
> --- a/xen/common/shutdown.c
> +++ b/xen/common/shutdown.c
> @@ -47,6 +47,9 @@ void dom0_shutdown(u8 reason)
>       {
>           debugger_trap_immediate();
>           printk("Domain 0 crashed: ");
> +#ifdef CONFIG_KEXEC
> +        kexec_crash();
> +#endif
>           maybe_reboot();
>           break; /* not reached */
>       }

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCH 5/9] xen: kexec crash image when dom0 crashes
  2013-11-06 14:49 ` [PATCH 5/9] xen: kexec crash image when dom0 crashes David Vrabel
  2013-11-07 20:44   ` Don Slutz
@ 2013-11-07 20:44   ` Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-07 20:44 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

For what it is worth.

Reviewed-by: Don Slutz <dslutz@verizon.com>
     -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
> Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/common/kexec.c    |    2 ++
>   xen/common/shutdown.c |    3 +++
>   2 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/xen/common/kexec.c b/xen/common/kexec.c
> index c5450ba..9999bab 100644
> --- a/xen/common/kexec.c
> +++ b/xen/common/kexec.c
> @@ -305,6 +305,8 @@ void kexec_crash(void)
>       if ( !test_bit(KEXEC_IMAGE_CRASH_BASE + pos, &kexec_flags) )
>           return;
>   
> +    printk("Executing crash image\n");
> +
>       kexecing = TRUE;
>   
>       kexec_common_shutdown();
> diff --git a/xen/common/shutdown.c b/xen/common/shutdown.c
> index 20f04b0..9bccd34 100644
> --- a/xen/common/shutdown.c
> +++ b/xen/common/shutdown.c
> @@ -47,6 +47,9 @@ void dom0_shutdown(u8 reason)
>       {
>           debugger_trap_immediate();
>           printk("Domain 0 crashed: ");
> +#ifdef CONFIG_KEXEC
> +        kexec_crash();
> +#endif
>           maybe_reboot();
>           break; /* not reached */
>       }


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 6/9] libxc: add hypercall buffer arrays
  2013-11-06 14:49 ` David Vrabel
@ 2013-11-07 20:46   ` Don Slutz
  2013-11-07 20:46   ` [Xen-devel] " Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-07 20:46 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

For what it is worth.

Reviewed-by: Don Slutz <dslutz@verizon.com>
     -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Hypercall buffer arrays are used when a hypercall takes a variable
> length array of buffers.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Acked-by: Ian Campbell <ian.campbell@citrix.com>
> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
> Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   tools/libxc/xc_hcall_buf.c |   73 ++++++++++++++++++++++++++++++++++++++++++++
>   tools/libxc/xenctrl.h      |   27 ++++++++++++++++
>   2 files changed, 100 insertions(+), 0 deletions(-)
>
> diff --git a/tools/libxc/xc_hcall_buf.c b/tools/libxc/xc_hcall_buf.c
> index c354677..e762a93 100644
> --- a/tools/libxc/xc_hcall_buf.c
> +++ b/tools/libxc/xc_hcall_buf.c
> @@ -228,6 +228,79 @@ void xc__hypercall_bounce_post(xc_interface *xch, xc_hypercall_buffer_t *b)
>       xc__hypercall_buffer_free(xch, b);
>   }
>   
> +struct xc_hypercall_buffer_array {
> +    unsigned max_bufs;
> +    xc_hypercall_buffer_t *bufs;
> +};
> +
> +xc_hypercall_buffer_array_t *xc_hypercall_buffer_array_create(xc_interface *xch,
> +                                                              unsigned n)
> +{
> +    xc_hypercall_buffer_array_t *array;
> +    xc_hypercall_buffer_t *bufs = NULL;
> +
> +    array = malloc(sizeof(*array));
> +    if ( array == NULL )
> +        goto error;
> +
> +    bufs = calloc(n, sizeof(*bufs));
> +    if ( bufs == NULL )
> +        goto error;
> +
> +    array->max_bufs = n;
> +    array->bufs     = bufs;
> +
> +    return array;
> +
> +error:
> +    free(bufs);
> +    free(array);
> +    return NULL;
> +}
> +
> +void *xc__hypercall_buffer_array_alloc(xc_interface *xch,
> +                                       xc_hypercall_buffer_array_t *array,
> +                                       unsigned index,
> +                                       xc_hypercall_buffer_t *hbuf,
> +                                       size_t size)
> +{
> +    void *buf;
> +
> +    if ( index >= array->max_bufs || array->bufs[index].hbuf )
> +        abort();
> +
> +    buf = xc__hypercall_buffer_alloc(xch, hbuf, size);
> +    if ( buf )
> +        array->bufs[index] = *hbuf;
> +    return buf;
> +}
> +
> +void *xc__hypercall_buffer_array_get(xc_interface *xch,
> +                                     xc_hypercall_buffer_array_t *array,
> +                                     unsigned index,
> +                                     xc_hypercall_buffer_t *hbuf)
> +{
> +    if ( index >= array->max_bufs || array->bufs[index].hbuf == NULL )
> +        abort();
> +
> +    *hbuf = array->bufs[index];
> +    return array->bufs[index].hbuf;
> +}
> +
> +void xc_hypercall_buffer_array_destroy(xc_interface *xc,
> +                                       xc_hypercall_buffer_array_t *array)
> +{
> +    unsigned i;
> +
> +    if ( array == NULL )
> +        return;
> +
> +    for (i = 0; i < array->max_bufs; i++ )
> +        xc__hypercall_buffer_free(xc, &array->bufs[i]);
> +    free(array->bufs);
> +    free(array);
> +}
> +
>   /*
>    * Local variables:
>    * mode: C
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> index 8cf3f3b..a7e8c31 100644
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -321,6 +321,33 @@ void xc__hypercall_buffer_free_pages(xc_interface *xch, xc_hypercall_buffer_t *b
>   #define xc_hypercall_buffer_free_pages(_xch, _name, _nr) xc__hypercall_buffer_free_pages(_xch, HYPERCALL_BUFFER(_name), _nr)
>   
>   /*
> + * Array of hypercall buffers.
> + *
> + * Create an array with xc_hypercall_buffer_array_create() and
> + * populate it by declaring one hypercall buffer in a loop and
> + * allocating the buffer with xc_hypercall_buffer_array_alloc().
> + *
> + * To access a previously allocated buffers, declare a new hypercall
> + * buffer and call xc_hypercall_buffer_array_get().
> + *
> + * Destroy the array with xc_hypercall_buffer_array_destroy() to free
> + * the array and all its alocated hypercall buffers.
> + */
> +struct xc_hypercall_buffer_array;
> +typedef struct xc_hypercall_buffer_array xc_hypercall_buffer_array_t;
> +
> +xc_hypercall_buffer_array_t *xc_hypercall_buffer_array_create(xc_interface *xch, unsigned n);
> +void *xc__hypercall_buffer_array_alloc(xc_interface *xch, xc_hypercall_buffer_array_t *array,
> +                                       unsigned index, xc_hypercall_buffer_t *hbuf, size_t size);
> +#define xc_hypercall_buffer_array_alloc(_xch, _array, _index, _name, _size) \
> +    xc__hypercall_buffer_array_alloc(_xch, _array, _index, HYPERCALL_BUFFER(_name), _size)
> +void *xc__hypercall_buffer_array_get(xc_interface *xch, xc_hypercall_buffer_array_t *array,
> +                                     unsigned index, xc_hypercall_buffer_t *hbuf);
> +#define xc_hypercall_buffer_array_get(_xch, _array, _index, _name, _size) \
> +    xc__hypercall_buffer_array_get(_xch, _array, _index, HYPERCALL_BUFFER(_name))
> +void xc_hypercall_buffer_array_destroy(xc_interface *xc, xc_hypercall_buffer_array_t *array);
> +
> +/*
>    * CPUMAP handling
>    */
>   typedef uint8_t *xc_cpumap_t;

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCH 6/9] libxc: add hypercall buffer arrays
  2013-11-06 14:49 ` David Vrabel
  2013-11-07 20:46   ` Don Slutz
@ 2013-11-07 20:46   ` Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-07 20:46 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

For what it is worth.

Reviewed-by: Don Slutz <dslutz@verizon.com>
     -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Hypercall buffer arrays are used when a hypercall takes a variable
> length array of buffers.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Acked-by: Ian Campbell <ian.campbell@citrix.com>
> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
> Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   tools/libxc/xc_hcall_buf.c |   73 ++++++++++++++++++++++++++++++++++++++++++++
>   tools/libxc/xenctrl.h      |   27 ++++++++++++++++
>   2 files changed, 100 insertions(+), 0 deletions(-)
>
> diff --git a/tools/libxc/xc_hcall_buf.c b/tools/libxc/xc_hcall_buf.c
> index c354677..e762a93 100644
> --- a/tools/libxc/xc_hcall_buf.c
> +++ b/tools/libxc/xc_hcall_buf.c
> @@ -228,6 +228,79 @@ void xc__hypercall_bounce_post(xc_interface *xch, xc_hypercall_buffer_t *b)
>       xc__hypercall_buffer_free(xch, b);
>   }
>   
> +struct xc_hypercall_buffer_array {
> +    unsigned max_bufs;
> +    xc_hypercall_buffer_t *bufs;
> +};
> +
> +xc_hypercall_buffer_array_t *xc_hypercall_buffer_array_create(xc_interface *xch,
> +                                                              unsigned n)
> +{
> +    xc_hypercall_buffer_array_t *array;
> +    xc_hypercall_buffer_t *bufs = NULL;
> +
> +    array = malloc(sizeof(*array));
> +    if ( array == NULL )
> +        goto error;
> +
> +    bufs = calloc(n, sizeof(*bufs));
> +    if ( bufs == NULL )
> +        goto error;
> +
> +    array->max_bufs = n;
> +    array->bufs     = bufs;
> +
> +    return array;
> +
> +error:
> +    free(bufs);
> +    free(array);
> +    return NULL;
> +}
> +
> +void *xc__hypercall_buffer_array_alloc(xc_interface *xch,
> +                                       xc_hypercall_buffer_array_t *array,
> +                                       unsigned index,
> +                                       xc_hypercall_buffer_t *hbuf,
> +                                       size_t size)
> +{
> +    void *buf;
> +
> +    if ( index >= array->max_bufs || array->bufs[index].hbuf )
> +        abort();
> +
> +    buf = xc__hypercall_buffer_alloc(xch, hbuf, size);
> +    if ( buf )
> +        array->bufs[index] = *hbuf;
> +    return buf;
> +}
> +
> +void *xc__hypercall_buffer_array_get(xc_interface *xch,
> +                                     xc_hypercall_buffer_array_t *array,
> +                                     unsigned index,
> +                                     xc_hypercall_buffer_t *hbuf)
> +{
> +    if ( index >= array->max_bufs || array->bufs[index].hbuf == NULL )
> +        abort();
> +
> +    *hbuf = array->bufs[index];
> +    return array->bufs[index].hbuf;
> +}
> +
> +void xc_hypercall_buffer_array_destroy(xc_interface *xc,
> +                                       xc_hypercall_buffer_array_t *array)
> +{
> +    unsigned i;
> +
> +    if ( array == NULL )
> +        return;
> +
> +    for (i = 0; i < array->max_bufs; i++ )
> +        xc__hypercall_buffer_free(xc, &array->bufs[i]);
> +    free(array->bufs);
> +    free(array);
> +}
> +
>   /*
>    * Local variables:
>    * mode: C
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> index 8cf3f3b..a7e8c31 100644
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -321,6 +321,33 @@ void xc__hypercall_buffer_free_pages(xc_interface *xch, xc_hypercall_buffer_t *b
>   #define xc_hypercall_buffer_free_pages(_xch, _name, _nr) xc__hypercall_buffer_free_pages(_xch, HYPERCALL_BUFFER(_name), _nr)
>   
>   /*
> + * Array of hypercall buffers.
> + *
> + * Create an array with xc_hypercall_buffer_array_create() and
> + * populate it by declaring one hypercall buffer in a loop and
> + * allocating the buffer with xc_hypercall_buffer_array_alloc().
> + *
> + * To access a previously allocated buffers, declare a new hypercall
> + * buffer and call xc_hypercall_buffer_array_get().
> + *
> + * Destroy the array with xc_hypercall_buffer_array_destroy() to free
> + * the array and all its alocated hypercall buffers.
> + */
> +struct xc_hypercall_buffer_array;
> +typedef struct xc_hypercall_buffer_array xc_hypercall_buffer_array_t;
> +
> +xc_hypercall_buffer_array_t *xc_hypercall_buffer_array_create(xc_interface *xch, unsigned n);
> +void *xc__hypercall_buffer_array_alloc(xc_interface *xch, xc_hypercall_buffer_array_t *array,
> +                                       unsigned index, xc_hypercall_buffer_t *hbuf, size_t size);
> +#define xc_hypercall_buffer_array_alloc(_xch, _array, _index, _name, _size) \
> +    xc__hypercall_buffer_array_alloc(_xch, _array, _index, HYPERCALL_BUFFER(_name), _size)
> +void *xc__hypercall_buffer_array_get(xc_interface *xch, xc_hypercall_buffer_array_t *array,
> +                                     unsigned index, xc_hypercall_buffer_t *hbuf);
> +#define xc_hypercall_buffer_array_get(_xch, _array, _index, _name, _size) \
> +    xc__hypercall_buffer_array_get(_xch, _array, _index, HYPERCALL_BUFFER(_name))
> +void xc_hypercall_buffer_array_destroy(xc_interface *xc, xc_hypercall_buffer_array_t *array);
> +
> +/*
>    * CPUMAP handling
>    */
>   typedef uint8_t *xc_cpumap_t;


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 7/9] libxc: add API for kexec hypercall
  2013-11-06 14:49 ` [PATCH 7/9] libxc: add API for kexec hypercall David Vrabel
@ 2013-11-07 20:48   ` Don Slutz
  2013-11-07 20:48   ` Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-07 20:48 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

For what it is worth.

Reviewed-by: Don Slutz <dslutz@verizon.com>
     -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Add xc_kexec_exec(), xc_kexec_get_ranges(), xc_kexec_load(), and
> xc_kexec_unload().  The load and unload calls require the v2 load and
> unload ops.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Acked-by: Ian Campbell <ian.campbell@citrix.com>
> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
> Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   tools/libxc/Makefile   |    1 +
>   tools/libxc/xc_kexec.c |  140 ++++++++++++++++++++++++++++++++++++++++++++++++
>   tools/libxc/xenctrl.h  |   55 +++++++++++++++++++
>   3 files changed, 196 insertions(+), 0 deletions(-)
>   create mode 100644 tools/libxc/xc_kexec.c
>
> diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
> index 4c64c15..f2d6e56 100644
> --- a/tools/libxc/Makefile
> +++ b/tools/libxc/Makefile
> @@ -31,6 +31,7 @@ CTRL_SRCS-y       += xc_mem_access.c
>   CTRL_SRCS-y       += xc_memshr.c
>   CTRL_SRCS-y       += xc_hcall_buf.c
>   CTRL_SRCS-y       += xc_foreign_memory.c
> +CTRL_SRCS-y       += xc_kexec.c
>   CTRL_SRCS-y       += xtl_core.c
>   CTRL_SRCS-y       += xtl_logger_stdio.c
>   CTRL_SRCS-$(CONFIG_X86) += xc_pagetab.c
> diff --git a/tools/libxc/xc_kexec.c b/tools/libxc/xc_kexec.c
> new file mode 100644
> index 0000000..a49cffb
> --- /dev/null
> +++ b/tools/libxc/xc_kexec.c
> @@ -0,0 +1,140 @@
> +/******************************************************************************
> + * xc_kexec.c
> + *
> + * API for loading and executing kexec images.
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation;
> + * version 2.1 of the License.
> + *
> + * Copyright (C) 2013 Citrix Systems R&D Ltd.
> + */
> +#include "xc_private.h"
> +
> +int xc_kexec_exec(xc_interface *xch, int type)
> +{
> +    DECLARE_HYPERCALL;
> +    DECLARE_HYPERCALL_BUFFER(xen_kexec_exec_t, exec);
> +    int ret = -1;
> +
> +    exec = xc_hypercall_buffer_alloc(xch, exec, sizeof(*exec));
> +    if ( exec == NULL )
> +    {
> +        PERROR("Count not alloc bounce buffer for kexec_exec hypercall");
> +        goto out;
> +    }
> +
> +    exec->type = type;
> +
> +    hypercall.op = __HYPERVISOR_kexec_op;
> +    hypercall.arg[0] = KEXEC_CMD_kexec;
> +    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(exec);
> +
> +    ret = do_xen_hypercall(xch, &hypercall);
> +
> +out:
> +    xc_hypercall_buffer_free(xch, exec);
> +
> +    return ret;
> +}
> +
> +int xc_kexec_get_range(xc_interface *xch, int range,  int nr,
> +                       uint64_t *size, uint64_t *start)
> +{
> +    DECLARE_HYPERCALL;
> +    DECLARE_HYPERCALL_BUFFER(xen_kexec_range_t, get_range);
> +    int ret = -1;
> +
> +    get_range = xc_hypercall_buffer_alloc(xch, get_range, sizeof(*get_range));
> +    if ( get_range == NULL )
> +    {
> +        PERROR("Could not alloc bounce buffer for kexec_get_range hypercall");
> +        goto out;
> +    }
> +
> +    get_range->range = range;
> +    get_range->nr = nr;
> +
> +    hypercall.op = __HYPERVISOR_kexec_op;
> +    hypercall.arg[0] = KEXEC_CMD_kexec_get_range;
> +    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(get_range);
> +
> +    ret = do_xen_hypercall(xch, &hypercall);
> +
> +    *size = get_range->size;
> +    *start = get_range->start;
> +
> +out:
> +    xc_hypercall_buffer_free(xch, get_range);
> +
> +    return ret;
> +}
> +
> +int xc_kexec_load(xc_interface *xch, uint8_t type, uint16_t arch,
> +                  uint64_t entry_maddr,
> +                  uint32_t nr_segments, xen_kexec_segment_t *segments)
> +{
> +    int ret = -1;
> +    DECLARE_HYPERCALL;
> +    DECLARE_HYPERCALL_BOUNCE(segments, sizeof(*segments) * nr_segments,
> +                             XC_HYPERCALL_BUFFER_BOUNCE_IN);
> +    DECLARE_HYPERCALL_BUFFER(xen_kexec_load_t, load);
> +
> +    if ( xc_hypercall_bounce_pre(xch, segments) )
> +    {
> +        PERROR("Could not allocate bounce buffer for kexec load hypercall");
> +        goto out;
> +    }
> +    load = xc_hypercall_buffer_alloc(xch, load, sizeof(*load));
> +    if ( load == NULL )
> +    {
> +        PERROR("Could not allocate buffer for kexec load hypercall");
> +        goto out;
> +    }
> +
> +    load->type = type;
> +    load->arch = arch;
> +    load->entry_maddr = entry_maddr;
> +    load->nr_segments = nr_segments;
> +    set_xen_guest_handle(load->segments.h, segments);
> +
> +    hypercall.op = __HYPERVISOR_kexec_op;
> +    hypercall.arg[0] = KEXEC_CMD_kexec_load;
> +    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(load);
> +
> +    ret = do_xen_hypercall(xch, &hypercall);
> +
> +out:
> +    xc_hypercall_buffer_free(xch, load);
> +    xc_hypercall_bounce_post(xch, segments);
> +
> +    return ret;
> +}
> +
> +int xc_kexec_unload(xc_interface *xch, int type)
> +{
> +    DECLARE_HYPERCALL;
> +    DECLARE_HYPERCALL_BUFFER(xen_kexec_unload_t, unload);
> +    int ret = -1;
> +
> +    unload = xc_hypercall_buffer_alloc(xch, unload, sizeof(*unload));
> +    if ( unload == NULL )
> +    {
> +        PERROR("Count not alloc buffer for kexec unload hypercall");
> +        goto out;
> +    }
> +
> +    unload->type = type;
> +
> +    hypercall.op = __HYPERVISOR_kexec_op;
> +    hypercall.arg[0] = KEXEC_CMD_kexec_unload;
> +    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(unload);
> +
> +    ret = do_xen_hypercall(xch, &hypercall);
> +
> +out:
> +    xc_hypercall_buffer_free(xch, unload);
> +
> +    return ret;
> +}
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> index a7e8c31..4ac6b8a 100644
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -46,6 +46,7 @@
>   #include <xen/hvm/params.h>
>   #include <xen/xsm/flask_op.h>
>   #include <xen/tmem.h>
> +#include <xen/kexec.h>
>   
>   #include "xentoollog.h"
>   
> @@ -2340,4 +2341,58 @@ int xc_compression_uncompress_page(xc_interface *xch, char *compbuf,
>   				   unsigned long compbuf_size,
>   				   unsigned long *compbuf_pos, char *dest);
>   
> +/*
> + * Execute an image previously loaded with xc_kexec_load().
> + *
> + * Does not return on success.
> + *
> + * Fails with:
> + *   ENOENT if the specified image has not been loaded.
> + */
> +int xc_kexec_exec(xc_interface *xch, int type);
> +
> +/*
> + * Find the machine address and size of certain memory areas.
> + *
> + *   KEXEC_RANGE_MA_CRASH       crash area
> + *   KEXEC_RANGE_MA_XEN         Xen itself
> + *   KEXEC_RANGE_MA_CPU         CPU note for CPU number 'nr'
> + *   KEXEC_RANGE_MA_XENHEAP     xenheap
> + *   KEXEC_RANGE_MA_EFI_MEMMAP  EFI Memory Map
> + *   KEXEC_RANGE_MA_VMCOREINFO  vmcoreinfo
> + *
> + * Fails with:
> + *   EINVAL if the range or CPU number isn't valid.
> + */
> +int xc_kexec_get_range(xc_interface *xch, int range,  int nr,
> +                       uint64_t *size, uint64_t *start);
> +
> +/*
> + * Load a kexec image into memory.
> + *
> + * The image may be of type KEXEC_TYPE_DEFAULT (executed on request)
> + * or KEXEC_TYPE_CRASH (executed on a crash).
> + *
> + * The image architecture may be a 32-bit variant of the hypervisor
> + * architecture (e.g, EM_386 on a x86-64 hypervisor).
> + *
> + * Fails with:
> + *   ENOMEM if there is insufficient memory for the new image.
> + *   EINVAL if the image does not fit into the crash area or the entry
> + *          point isn't within one of segments.
> + *   EBUSY  if another image is being executed.
> + */
> +int xc_kexec_load(xc_interface *xch, uint8_t type, uint16_t arch,
> +                  uint64_t entry_maddr,
> +                  uint32_t nr_segments, xen_kexec_segment_t *segments);
> +
> +/*
> + * Unload a kexec image.
> + *
> + * This prevents a KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH image from
> + * being executed.  The crash images are not cleared from the crash
> + * region.
> + */
> +int xc_kexec_unload(xc_interface *xch, int type);
> +
>   #endif /* XENCTRL_H */

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 7/9] libxc: add API for kexec hypercall
  2013-11-06 14:49 ` [PATCH 7/9] libxc: add API for kexec hypercall David Vrabel
  2013-11-07 20:48   ` Don Slutz
@ 2013-11-07 20:48   ` Don Slutz
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-07 20:48 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

For what it is worth.

Reviewed-by: Don Slutz <dslutz@verizon.com>
     -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Add xc_kexec_exec(), xc_kexec_get_ranges(), xc_kexec_load(), and
> xc_kexec_unload().  The load and unload calls require the v2 load and
> unload ops.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Acked-by: Ian Campbell <ian.campbell@citrix.com>
> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
> Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   tools/libxc/Makefile   |    1 +
>   tools/libxc/xc_kexec.c |  140 ++++++++++++++++++++++++++++++++++++++++++++++++
>   tools/libxc/xenctrl.h  |   55 +++++++++++++++++++
>   3 files changed, 196 insertions(+), 0 deletions(-)
>   create mode 100644 tools/libxc/xc_kexec.c
>
> diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
> index 4c64c15..f2d6e56 100644
> --- a/tools/libxc/Makefile
> +++ b/tools/libxc/Makefile
> @@ -31,6 +31,7 @@ CTRL_SRCS-y       += xc_mem_access.c
>   CTRL_SRCS-y       += xc_memshr.c
>   CTRL_SRCS-y       += xc_hcall_buf.c
>   CTRL_SRCS-y       += xc_foreign_memory.c
> +CTRL_SRCS-y       += xc_kexec.c
>   CTRL_SRCS-y       += xtl_core.c
>   CTRL_SRCS-y       += xtl_logger_stdio.c
>   CTRL_SRCS-$(CONFIG_X86) += xc_pagetab.c
> diff --git a/tools/libxc/xc_kexec.c b/tools/libxc/xc_kexec.c
> new file mode 100644
> index 0000000..a49cffb
> --- /dev/null
> +++ b/tools/libxc/xc_kexec.c
> @@ -0,0 +1,140 @@
> +/******************************************************************************
> + * xc_kexec.c
> + *
> + * API for loading and executing kexec images.
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation;
> + * version 2.1 of the License.
> + *
> + * Copyright (C) 2013 Citrix Systems R&D Ltd.
> + */
> +#include "xc_private.h"
> +
> +int xc_kexec_exec(xc_interface *xch, int type)
> +{
> +    DECLARE_HYPERCALL;
> +    DECLARE_HYPERCALL_BUFFER(xen_kexec_exec_t, exec);
> +    int ret = -1;
> +
> +    exec = xc_hypercall_buffer_alloc(xch, exec, sizeof(*exec));
> +    if ( exec == NULL )
> +    {
> +        PERROR("Count not alloc bounce buffer for kexec_exec hypercall");
> +        goto out;
> +    }
> +
> +    exec->type = type;
> +
> +    hypercall.op = __HYPERVISOR_kexec_op;
> +    hypercall.arg[0] = KEXEC_CMD_kexec;
> +    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(exec);
> +
> +    ret = do_xen_hypercall(xch, &hypercall);
> +
> +out:
> +    xc_hypercall_buffer_free(xch, exec);
> +
> +    return ret;
> +}
> +
> +int xc_kexec_get_range(xc_interface *xch, int range,  int nr,
> +                       uint64_t *size, uint64_t *start)
> +{
> +    DECLARE_HYPERCALL;
> +    DECLARE_HYPERCALL_BUFFER(xen_kexec_range_t, get_range);
> +    int ret = -1;
> +
> +    get_range = xc_hypercall_buffer_alloc(xch, get_range, sizeof(*get_range));
> +    if ( get_range == NULL )
> +    {
> +        PERROR("Could not alloc bounce buffer for kexec_get_range hypercall");
> +        goto out;
> +    }
> +
> +    get_range->range = range;
> +    get_range->nr = nr;
> +
> +    hypercall.op = __HYPERVISOR_kexec_op;
> +    hypercall.arg[0] = KEXEC_CMD_kexec_get_range;
> +    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(get_range);
> +
> +    ret = do_xen_hypercall(xch, &hypercall);
> +
> +    *size = get_range->size;
> +    *start = get_range->start;
> +
> +out:
> +    xc_hypercall_buffer_free(xch, get_range);
> +
> +    return ret;
> +}
> +
> +int xc_kexec_load(xc_interface *xch, uint8_t type, uint16_t arch,
> +                  uint64_t entry_maddr,
> +                  uint32_t nr_segments, xen_kexec_segment_t *segments)
> +{
> +    int ret = -1;
> +    DECLARE_HYPERCALL;
> +    DECLARE_HYPERCALL_BOUNCE(segments, sizeof(*segments) * nr_segments,
> +                             XC_HYPERCALL_BUFFER_BOUNCE_IN);
> +    DECLARE_HYPERCALL_BUFFER(xen_kexec_load_t, load);
> +
> +    if ( xc_hypercall_bounce_pre(xch, segments) )
> +    {
> +        PERROR("Could not allocate bounce buffer for kexec load hypercall");
> +        goto out;
> +    }
> +    load = xc_hypercall_buffer_alloc(xch, load, sizeof(*load));
> +    if ( load == NULL )
> +    {
> +        PERROR("Could not allocate buffer for kexec load hypercall");
> +        goto out;
> +    }
> +
> +    load->type = type;
> +    load->arch = arch;
> +    load->entry_maddr = entry_maddr;
> +    load->nr_segments = nr_segments;
> +    set_xen_guest_handle(load->segments.h, segments);
> +
> +    hypercall.op = __HYPERVISOR_kexec_op;
> +    hypercall.arg[0] = KEXEC_CMD_kexec_load;
> +    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(load);
> +
> +    ret = do_xen_hypercall(xch, &hypercall);
> +
> +out:
> +    xc_hypercall_buffer_free(xch, load);
> +    xc_hypercall_bounce_post(xch, segments);
> +
> +    return ret;
> +}
> +
> +int xc_kexec_unload(xc_interface *xch, int type)
> +{
> +    DECLARE_HYPERCALL;
> +    DECLARE_HYPERCALL_BUFFER(xen_kexec_unload_t, unload);
> +    int ret = -1;
> +
> +    unload = xc_hypercall_buffer_alloc(xch, unload, sizeof(*unload));
> +    if ( unload == NULL )
> +    {
> +        PERROR("Count not alloc buffer for kexec unload hypercall");
> +        goto out;
> +    }
> +
> +    unload->type = type;
> +
> +    hypercall.op = __HYPERVISOR_kexec_op;
> +    hypercall.arg[0] = KEXEC_CMD_kexec_unload;
> +    hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(unload);
> +
> +    ret = do_xen_hypercall(xch, &hypercall);
> +
> +out:
> +    xc_hypercall_buffer_free(xch, unload);
> +
> +    return ret;
> +}
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> index a7e8c31..4ac6b8a 100644
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -46,6 +46,7 @@
>   #include <xen/hvm/params.h>
>   #include <xen/xsm/flask_op.h>
>   #include <xen/tmem.h>
> +#include <xen/kexec.h>
>   
>   #include "xentoollog.h"
>   
> @@ -2340,4 +2341,58 @@ int xc_compression_uncompress_page(xc_interface *xch, char *compbuf,
>   				   unsigned long compbuf_size,
>   				   unsigned long *compbuf_pos, char *dest);
>   
> +/*
> + * Execute an image previously loaded with xc_kexec_load().
> + *
> + * Does not return on success.
> + *
> + * Fails with:
> + *   ENOENT if the specified image has not been loaded.
> + */
> +int xc_kexec_exec(xc_interface *xch, int type);
> +
> +/*
> + * Find the machine address and size of certain memory areas.
> + *
> + *   KEXEC_RANGE_MA_CRASH       crash area
> + *   KEXEC_RANGE_MA_XEN         Xen itself
> + *   KEXEC_RANGE_MA_CPU         CPU note for CPU number 'nr'
> + *   KEXEC_RANGE_MA_XENHEAP     xenheap
> + *   KEXEC_RANGE_MA_EFI_MEMMAP  EFI Memory Map
> + *   KEXEC_RANGE_MA_VMCOREINFO  vmcoreinfo
> + *
> + * Fails with:
> + *   EINVAL if the range or CPU number isn't valid.
> + */
> +int xc_kexec_get_range(xc_interface *xch, int range,  int nr,
> +                       uint64_t *size, uint64_t *start);
> +
> +/*
> + * Load a kexec image into memory.
> + *
> + * The image may be of type KEXEC_TYPE_DEFAULT (executed on request)
> + * or KEXEC_TYPE_CRASH (executed on a crash).
> + *
> + * The image architecture may be a 32-bit variant of the hypervisor
> + * architecture (e.g, EM_386 on a x86-64 hypervisor).
> + *
> + * Fails with:
> + *   ENOMEM if there is insufficient memory for the new image.
> + *   EINVAL if the image does not fit into the crash area or the entry
> + *          point isn't within one of segments.
> + *   EBUSY  if another image is being executed.
> + */
> +int xc_kexec_load(xc_interface *xch, uint8_t type, uint16_t arch,
> +                  uint64_t entry_maddr,
> +                  uint32_t nr_segments, xen_kexec_segment_t *segments);
> +
> +/*
> + * Unload a kexec image.
> + *
> + * This prevents a KEXEC_TYPE_DEFAULT or KEXEC_TYPE_CRASH image from
> + * being executed.  The crash images are not cleared from the crash
> + * region.
> + */
> +int xc_kexec_unload(xc_interface *xch, int type);
> +
>   #endif /* XENCTRL_H */


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
  2013-11-06 14:49   ` David Vrabel
@ 2013-11-07 20:56     ` Don Slutz
  -1 siblings, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-07 20:56 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

For what it is worth.

Reviewed-by: Don Slutz <dslutz@verizon.com>
     -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> In the existing kexec hypercall, the load and unload ops depend on
> internals of the Linux kernel (the page list and code page provided by
> the kernel).  The code page is used to transition between Xen context
> and the image so using kernel code doesn't make sense and will not
> work for PVH guests.
>
> Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
> that no longer require a code page to be provided by the guest -- Xen
> now provides the code for calling the image directly.
>
> The new load op looks similar to the Linux kexec_load system call and
> allows the guest to provide the image data to be loaded.  The guest
> specifies the architecture of the image which may be a 32-bit subarch
> of the hypervisor's architecture (i.e., an EM_386 image on an
> EM_X86_64 hypervisor).
>
> The toolstack can now load images without kernel involvement.  This is
> required for supporting kexec when using a dom0 with an upstream
> kernel.
>
> Crash images are copied directly into the crash region on load.
> Default images are copied into domheap pages and a list of source and
> destination machine addresses is created.  This is list is used in
> kexec_reloc() to relocate the image to its destination.
>
> The old load and unload sub-ops are still available (as
> KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
> of the new infrastructure.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/arch/x86/machine_kexec.c        |  192 +++++++++++------
>   xen/arch/x86/x86_64/Makefile        |    2 +-
>   xen/arch/x86/x86_64/compat_kexec.S  |  187 ----------------
>   xen/arch/x86/x86_64/kexec_reloc.S   |  198 +++++++++++++++++
>   xen/common/kexec.c                  |  398 +++++++++++++++++++++++++++++------
>   xen/common/kimage.c                 |  122 +++++++++++-
>   xen/include/asm-x86/fixmap.h        |    3 -
>   xen/include/asm-x86/machine_kexec.h |   16 ++
>   xen/include/xen/kexec.h             |   16 +-
>   xen/include/xen/kimage.h            |    6 +
>   10 files changed, 804 insertions(+), 336 deletions(-)
>   delete mode 100644 xen/arch/x86/x86_64/compat_kexec.S
>   create mode 100644 xen/arch/x86/x86_64/kexec_reloc.S
>   create mode 100644 xen/include/asm-x86/machine_kexec.h
>
> diff --git a/xen/arch/x86/machine_kexec.c b/xen/arch/x86/machine_kexec.c
> index 68b9705..b70d5a6 100644
> --- a/xen/arch/x86/machine_kexec.c
> +++ b/xen/arch/x86/machine_kexec.c
> @@ -1,9 +1,18 @@
>   /******************************************************************************
>    * machine_kexec.c
>    *
> + * Copyright (C) 2013 Citrix Systems R&D Ltd.
> + *
> + * Portions derived from Linux's arch/x86/kernel/machine_kexec_64.c.
> + *
> + *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
> + *
>    * Xen port written by:
>    * - Simon 'Horms' Horman <horms@verge.net.au>
>    * - Magnus Damm <magnus@valinux.co.jp>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
>    */
>   
>   #include <xen/types.h>
> @@ -11,63 +20,124 @@
>   #include <xen/guest_access.h>
>   #include <asm/fixmap.h>
>   #include <asm/hpet.h>
> +#include <asm/page.h>
> +#include <asm/machine_kexec.h>
>   
> -typedef void (*relocate_new_kernel_t)(
> -                unsigned long indirection_page,
> -                unsigned long *page_list,
> -                unsigned long start_address,
> -                unsigned int preserve_context);
> -
> -int machine_kexec_load(int type, int slot, xen_kexec_image_t *image)
> +/*
> + * Add a mapping for a page to the page tables used during kexec.
> + */
> +int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
> +                           unsigned long maddr)
>   {
> -    unsigned long prev_ma = 0;
> -    int fix_base = FIX_KEXEC_BASE_0 + (slot * (KEXEC_XEN_NO_PAGES >> 1));
> -    int k;
> +    struct page_info *l4_page;
> +    struct page_info *l3_page;
> +    struct page_info *l2_page;
> +    struct page_info *l1_page;
> +    l4_pgentry_t *l4 = NULL;
> +    l3_pgentry_t *l3 = NULL;
> +    l2_pgentry_t *l2 = NULL;
> +    l1_pgentry_t *l1 = NULL;
> +    int ret = -ENOMEM;
> +
> +    l4_page = image->aux_page;
> +    if ( !l4_page )
> +    {
> +        l4_page = kimage_alloc_control_page(image, 0);
> +        if ( !l4_page )
> +            goto out;
> +        image->aux_page = l4_page;
> +    }
>   
> -    /* setup fixmap to point to our pages and record the virtual address
> -     * in every odd index in page_list[].
> -     */
> +    l4 = __map_domain_page(l4_page);
> +    l4 += l4_table_offset(vaddr);
> +    if ( !(l4e_get_flags(*l4) & _PAGE_PRESENT) )
> +    {
> +        l3_page = kimage_alloc_control_page(image, 0);
> +        if ( !l3_page )
> +            goto out;
> +        l4e_write(l4, l4e_from_page(l3_page, __PAGE_HYPERVISOR));
> +    }
> +    else
> +        l3_page = l4e_get_page(*l4);
> +
> +    l3 = __map_domain_page(l3_page);
> +    l3 += l3_table_offset(vaddr);
> +    if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) )
> +    {
> +        l2_page = kimage_alloc_control_page(image, 0);
> +        if ( !l2_page )
> +            goto out;
> +        l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR));
> +    }
> +    else
> +        l2_page = l3e_get_page(*l3);
> +
> +    l2 = __map_domain_page(l2_page);
> +    l2 += l2_table_offset(vaddr);
> +    if ( !(l2e_get_flags(*l2) & _PAGE_PRESENT) )
> +    {
> +        l1_page = kimage_alloc_control_page(image, 0);
> +        if ( !l1_page )
> +            goto out;
> +        l2e_write(l2, l2e_from_page(l1_page, __PAGE_HYPERVISOR));
> +    }
> +    else
> +        l1_page = l2e_get_page(*l2);
> +
> +    l1 = __map_domain_page(l1_page);
> +    l1 += l1_table_offset(vaddr);
> +    l1e_write(l1, l1e_from_pfn(maddr >> PAGE_SHIFT, __PAGE_HYPERVISOR));
> +
> +    ret = 0;
> +out:
> +    if ( l1 )
> +        unmap_domain_page(l1);
> +    if ( l2 )
> +        unmap_domain_page(l2);
> +    if ( l3 )
> +        unmap_domain_page(l3);
> +    if ( l4 )
> +        unmap_domain_page(l4);
> +    return ret;
> +}
>   
> -    for ( k = 0; k < KEXEC_XEN_NO_PAGES; k++ )
> +int machine_kexec_load(struct kexec_image *image)
> +{
> +    void *code_page;
> +    int ret;
> +
> +    switch ( image->arch )
>       {
> -        if ( (k & 1) == 0 )
> -        {
> -            /* Even pages: machine address. */
> -            prev_ma = image->page_list[k];
> -        }
> -        else
> -        {
> -            /* Odd pages: va for previous ma. */
> -            if ( is_pv_32on64_domain(dom0) )
> -            {
> -                /*
> -                 * The compatability bounce code sets up a page table
> -                 * with a 1-1 mapping of the first 1G of memory so
> -                 * VA==PA here.
> -                 *
> -                 * This Linux purgatory code still sets up separate
> -                 * high and low mappings on the control page (entries
> -                 * 0 and 1) but it is harmless if they are equal since
> -                 * that PT is not live at the time.
> -                 */
> -                image->page_list[k] = prev_ma;
> -            }
> -            else
> -            {
> -                set_fixmap(fix_base + (k >> 1), prev_ma);
> -                image->page_list[k] = fix_to_virt(fix_base + (k >> 1));
> -            }
> -        }
> +    case EM_386:
> +    case EM_X86_64:
> +        break;
> +    default:
> +        return -EINVAL;
>       }
>   
> +    code_page = __map_domain_page(image->control_code_page);
> +    memcpy(code_page, kexec_reloc, kexec_reloc_size);
> +    unmap_domain_page(code_page);
> +
> +    /*
> +     * Add a mapping for the control code page to the same virtual
> +     * address as kexec_reloc.  This allows us to keep running after
> +     * these page tables are loaded in kexec_reloc.
> +     */
> +    ret = machine_kexec_add_page(image, (unsigned long)kexec_reloc,
> +                                 page_to_maddr(image->control_code_page));
> +    if ( ret < 0 )
> +        return ret;
> +
>       return 0;
>   }
>   
> -void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image)
> +void machine_kexec_unload(struct kexec_image *image)
>   {
> +    /* no-op. kimage_free() frees all control pages. */
>   }
>   
> -void machine_reboot_kexec(xen_kexec_image_t *image)
> +void machine_reboot_kexec(struct kexec_image *image)
>   {
>       BUG_ON(smp_processor_id() != 0);
>       smp_send_stop();
> @@ -75,13 +145,10 @@ void machine_reboot_kexec(xen_kexec_image_t *image)
>       BUG();
>   }
>   
> -void machine_kexec(xen_kexec_image_t *image)
> +void machine_kexec(struct kexec_image *image)
>   {
> -    struct desc_ptr gdt_desc = {
> -        .base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY),
> -        .limit = LAST_RESERVED_GDT_BYTE
> -    };
>       int i;
> +    unsigned long reloc_flags = 0;
>   
>       /* We are about to permenantly jump out of the Xen context into the kexec
>        * purgatory code.  We really dont want to be still servicing interupts.
> @@ -109,29 +176,12 @@ void machine_kexec(xen_kexec_image_t *image)
>        * not like running with NMIs disabled. */
>       enable_nmis();
>   
> -    /*
> -     * compat_machine_kexec() returns to idle pagetables, which requires us
> -     * to be running on a static GDT mapping (idle pagetables have no GDT
> -     * mappings in their per-domain mapping area).
> -     */
> -    asm volatile ( "lgdt %0" : : "m" (gdt_desc) );
> +    if ( image->arch == EM_386 )
> +        reloc_flags |= KEXEC_RELOC_FLAG_COMPAT;
>   
> -    if ( is_pv_32on64_domain(dom0) )
> -    {
> -        compat_machine_kexec(image->page_list[1],
> -                             image->indirection_page,
> -                             image->page_list,
> -                             image->start_address);
> -    }
> -    else
> -    {
> -        relocate_new_kernel_t rnk;
> -
> -        rnk = (relocate_new_kernel_t) image->page_list[1];
> -        (*rnk)(image->indirection_page, image->page_list,
> -               image->start_address,
> -               0 /* preserve_context */);
> -    }
> +    kexec_reloc(page_to_maddr(image->control_code_page),
> +                page_to_maddr(image->aux_page),
> +                image->head, image->entry_maddr, reloc_flags);
>   }
>   
>   int machine_kexec_get(xen_kexec_range_t *range)
> diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
> index d56e12d..7f8fb3d 100644
> --- a/xen/arch/x86/x86_64/Makefile
> +++ b/xen/arch/x86/x86_64/Makefile
> @@ -11,11 +11,11 @@ obj-y += mmconf-fam10h.o
>   obj-y += mmconfig_64.o
>   obj-y += mmconfig-shared.o
>   obj-y += compat.o
> -obj-bin-y += compat_kexec.o
>   obj-y += domain.o
>   obj-y += physdev.o
>   obj-y += platform_hypercall.o
>   obj-y += cpu_idle.o
>   obj-y += cpufreq.o
> +obj-bin-y += kexec_reloc.o
>   
>   obj-$(crash_debug)   += gdbstub.o
> diff --git a/xen/arch/x86/x86_64/compat_kexec.S b/xen/arch/x86/x86_64/compat_kexec.S
> deleted file mode 100644
> index fc92af9..0000000
> --- a/xen/arch/x86/x86_64/compat_kexec.S
> +++ /dev/null
> @@ -1,187 +0,0 @@
> -/*
> - * Compatibility kexec handler.
> - */
> -
> -/*
> - * NOTE: We rely on Xen not relocating itself above the 4G boundary. This is
> - * currently true but if it ever changes then compat_pg_table will
> - * need to be moved back below 4G at run time.
> - */
> -
> -#include <xen/config.h>
> -
> -#include <asm/asm_defns.h>
> -#include <asm/msr.h>
> -#include <asm/page.h>
> -
> -/* The unrelocated physical address of a symbol. */
> -#define SYM_PHYS(sym)          ((sym) - __XEN_VIRT_START)
> -
> -/* Load physical address of symbol into register and relocate it. */
> -#define RELOCATE_SYM(sym,reg)  mov $SYM_PHYS(sym), reg ; \
> -                               add xen_phys_start(%rip), reg
> -
> -/*
> - * Relocate a physical address in memory. Size of temporary register
> - * determines size of the value to relocate.
> - */
> -#define RELOCATE_MEM(addr,reg) mov addr(%rip), reg ; \
> -                               add xen_phys_start(%rip), reg ; \
> -                               mov reg, addr(%rip)
> -
> -        .text
> -
> -        .code64
> -
> -ENTRY(compat_machine_kexec)
> -        /* x86/64                        x86/32  */
> -        /* %rdi - relocate_new_kernel_t  CALL    */
> -        /* %rsi - indirection page       4(%esp) */
> -        /* %rdx - page_list              8(%esp) */
> -        /* %rcx - start address         12(%esp) */
> -        /*        cpu has pae           16(%esp) */
> -
> -        /* Shim the 64 bit page_list into a 32 bit page_list. */
> -        mov $12,%r9
> -        lea compat_page_list(%rip), %rbx
> -1:      dec %r9
> -        movl (%rdx,%r9,8),%eax
> -        movl %eax,(%rbx,%r9,4)
> -        test %r9,%r9
> -        jnz 1b
> -
> -        RELOCATE_SYM(compat_page_list,%rdx)
> -
> -        /* Relocate compatibility mode entry point address. */
> -        RELOCATE_MEM(compatibility_mode_far,%eax)
> -
> -        /* Relocate compat_pg_table. */
> -        RELOCATE_MEM(compat_pg_table,     %rax)
> -        RELOCATE_MEM(compat_pg_table+0x8, %rax)
> -        RELOCATE_MEM(compat_pg_table+0x10,%rax)
> -        RELOCATE_MEM(compat_pg_table+0x18,%rax)
> -
> -        /*
> -         * Setup an identity mapped region in PML4[0] of idle page
> -         * table.
> -         */
> -        RELOCATE_SYM(l3_identmap,%rax)
> -        or  $0x63,%rax
> -        mov %rax, idle_pg_table(%rip)
> -
> -        /* Switch to idle page table. */
> -        RELOCATE_SYM(idle_pg_table,%rax)
> -        movq %rax, %cr3
> -
> -        /* Switch to identity mapped compatibility stack. */
> -        RELOCATE_SYM(compat_stack,%rax)
> -        movq %rax, %rsp
> -
> -        /* Save xen_phys_start for 32 bit code. */
> -        movq xen_phys_start(%rip), %rbx
> -
> -        /* Jump to low identity mapping in compatibility mode. */
> -        ljmp *compatibility_mode_far(%rip)
> -        ud2
> -
> -compatibility_mode_far:
> -        .long SYM_PHYS(compatibility_mode)
> -        .long __HYPERVISOR_CS32
> -
> -        /*
> -         * We use 5 words of stack for the arguments passed to the kernel. The
> -         * kernel only uses 1 word before switching to its own stack. Allocate
> -         * 16 words to give "plenty" of room.
> -         */
> -        .fill 16,4,0
> -compat_stack:
> -
> -        .code32
> -
> -#undef RELOCATE_SYM
> -#undef RELOCATE_MEM
> -
> -/*
> - * Load physical address of symbol into register and relocate it. %rbx
> - * contains xen_phys_start(%rip) saved before jump to compatibility
> - * mode.
> - */
> -#define RELOCATE_SYM(sym,reg) mov $SYM_PHYS(sym), reg ; \
> -                              add %ebx, reg
> -
> -compatibility_mode:
> -        /* Setup some sane segments. */
> -        movl $__HYPERVISOR_DS32, %eax
> -        movl %eax, %ds
> -        movl %eax, %es
> -        movl %eax, %fs
> -        movl %eax, %gs
> -        movl %eax, %ss
> -
> -        /* Push arguments onto stack. */
> -        pushl $0   /* 20(%esp) - preserve context */
> -        pushl $1   /* 16(%esp) - cpu has pae */
> -        pushl %ecx /* 12(%esp) - start address */
> -        pushl %edx /*  8(%esp) - page list */
> -        pushl %esi /*  4(%esp) - indirection page */
> -        pushl %edi /*  0(%esp) - CALL */
> -
> -        /* Disable paging and therefore leave 64 bit mode. */
> -        movl %cr0, %eax
> -        andl $~X86_CR0_PG, %eax
> -        movl %eax, %cr0
> -
> -        /* Switch to 32 bit page table. */
> -        RELOCATE_SYM(compat_pg_table, %eax)
> -        movl  %eax, %cr3
> -
> -        /* Clear MSR_EFER[LME], disabling long mode */
> -        movl    $MSR_EFER,%ecx
> -        rdmsr
> -        btcl    $_EFER_LME,%eax
> -        wrmsr
> -
> -        /* Re-enable paging, but only 32 bit mode now. */
> -        movl %cr0, %eax
> -        orl $X86_CR0_PG, %eax
> -        movl %eax, %cr0
> -        jmp 1f
> -1:
> -
> -        popl %eax
> -        call *%eax
> -        ud2
> -
> -        .data
> -        .align 4
> -compat_page_list:
> -        .fill 12,4,0
> -
> -        .align 32,0
> -
> -        /*
> -         * These compat page tables contain an identity mapping of the
> -         * first 4G of the physical address space.
> -         */
> -compat_pg_table:
> -        .long SYM_PHYS(compat_pg_table_l2) + 0*PAGE_SIZE + 0x01, 0
> -        .long SYM_PHYS(compat_pg_table_l2) + 1*PAGE_SIZE + 0x01, 0
> -        .long SYM_PHYS(compat_pg_table_l2) + 2*PAGE_SIZE + 0x01, 0
> -        .long SYM_PHYS(compat_pg_table_l2) + 3*PAGE_SIZE + 0x01, 0
> -
> -        .section .data.page_aligned, "aw", @progbits
> -        .align PAGE_SIZE,0
> -compat_pg_table_l2:
> -        .macro identmap from=0, count=512
> -        .if \count-1
> -        identmap "(\from+0)","(\count/2)"
> -        identmap "(\from+(0x200000*(\count/2)))","(\count/2)"
> -        .else
> -        .quad 0x00000000000000e3 + \from
> -        .endif
> -        .endm
> -
> -        identmap 0x00000000
> -        identmap 0x40000000
> -        identmap 0x80000000
> -        identmap 0xc0000000
> diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
> new file mode 100644
> index 0000000..7a16c85
> --- /dev/null
> +++ b/xen/arch/x86/x86_64/kexec_reloc.S
> @@ -0,0 +1,198 @@
> +/*
> + * Relocate a kexec_image to its destination and call it.
> + *
> + * Copyright (C) 2013 Citrix Systems R&D Ltd.
> + *
> + * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
> + *
> + *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +#include <xen/config.h>
> +#include <xen/kimage.h>
> +
> +#include <asm/asm_defns.h>
> +#include <asm/msr.h>
> +#include <asm/page.h>
> +#include <asm/machine_kexec.h>
> +
> +        .text
> +        .align PAGE_SIZE
> +        .code64
> +
> +ENTRY(kexec_reloc)
> +        /* %rdi - code page maddr */
> +        /* %rsi - page table maddr */
> +        /* %rdx - indirection page maddr */
> +        /* %rcx - entry maddr (%rbp) */
> +        /* %r8 - flags */
> +
> +        movq    %rcx, %rbp
> +
> +        /* Setup stack. */
> +        leaq    (reloc_stack - kexec_reloc)(%rdi), %rsp
> +
> +        /* Load reloc page table. */
> +        movq    %rsi, %cr3
> +
> +        /* Jump to identity mapped code. */
> +        leaq    (identity_mapped - kexec_reloc)(%rdi), %rax
> +        jmpq    *%rax
> +
> +identity_mapped:
> +        /*
> +         * Set cr0 to a known state:
> +         *  - Paging enabled
> +         *  - Alignment check disabled
> +         *  - Write protect disabled
> +         *  - No task switch
> +         *  - Don't do FP software emulation.
> +         *  - Protected mode enabled
> +         */
> +        movq    %cr0, %rax
> +        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
> +        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
> +        movq    %rax, %cr0
> +
> +        /*
> +         * Set cr4 to a known state:
> +         *  - physical address extension enabled
> +         */
> +        movl    $X86_CR4_PAE, %eax
> +        movq    %rax, %cr4
> +
> +        movq    %rdx, %rdi
> +        call    relocate_pages
> +
> +        /* Need to switch to 32-bit mode? */
> +        testq   $KEXEC_RELOC_FLAG_COMPAT, %r8
> +        jnz     call_32_bit
> +
> +call_64_bit:
> +        /* Call the image entry point.  This should never return. */
> +        callq   *%rbp
> +        ud2
> +
> +call_32_bit:
> +        /* Setup IDT. */
> +        lidt    compat_mode_idt(%rip)
> +
> +        /* Load compat GDT. */
> +        leaq    compat_mode_gdt(%rip), %rax
> +        movq    %rax, (compat_mode_gdt_desc + 2)(%rip)
> +        lgdt    compat_mode_gdt_desc(%rip)
> +
> +        /* Relocate compatibility mode entry point address. */
> +        leal    compatibility_mode(%rip), %eax
> +        movl    %eax, compatibility_mode_far(%rip)
> +
> +        /* Enter compatibility mode. */
> +        ljmp    *compatibility_mode_far(%rip)
> +
> +relocate_pages:
> +        /* %rdi - indirection page maddr */
> +        pushq   %rbx
> +
> +        cld
> +        movq    %rdi, %rbx
> +        xorl    %edi, %edi
> +        xorl    %esi, %esi
> +
> +next_entry: /* top, read another word for the indirection page */
> +
> +        movq    (%rbx), %rcx
> +        addq    $8, %rbx
> +is_dest:
> +        testb   $IND_DESTINATION, %cl
> +        jz      is_ind
> +        movq    %rcx, %rdi
> +        andq    $PAGE_MASK, %rdi
> +        jmp     next_entry
> +is_ind:
> +        testb   $IND_INDIRECTION, %cl
> +        jz      is_done
> +        movq    %rcx, %rbx
> +        andq    $PAGE_MASK, %rbx
> +        jmp     next_entry
> +is_done:
> +        testb   $IND_DONE, %cl
> +        jnz     done
> +is_source:
> +        testb   $IND_SOURCE, %cl
> +        jz      is_zero
> +        movq    %rcx, %rsi      /* For every source page do a copy */
> +        andq    $PAGE_MASK, %rsi
> +        movl    $(PAGE_SIZE / 8), %ecx
> +        rep movsq
> +        jmp     next_entry
> +is_zero:
> +        testb   $IND_ZERO, %cl
> +        jz      next_entry
> +        movl    $(PAGE_SIZE / 8), %ecx  /* Zero the destination page. */
> +        xorl    %eax, %eax
> +        rep stosq
> +        jmp     next_entry
> +done:
> +        popq    %rbx
> +        ret
> +
> +        .code32
> +
> +compatibility_mode:
> +        /* Setup some sane segments. */
> +        movl    $0x0008, %eax
> +        movl    %eax, %ds
> +        movl    %eax, %es
> +        movl    %eax, %fs
> +        movl    %eax, %gs
> +        movl    %eax, %ss
> +
> +        /* Disable paging and therefore leave 64 bit mode. */
> +        movl    %cr0, %eax
> +        andl    $~X86_CR0_PG, %eax
> +        movl    %eax, %cr0
> +
> +        /* Disable long mode */
> +        movl    $MSR_EFER, %ecx
> +        rdmsr
> +        andl    $~EFER_LME, %eax
> +        wrmsr
> +
> +        /* Clear cr4 to disable PAE. */
> +        xorl    %eax, %eax
> +        movl    %eax, %cr4
> +
> +        /* Call the image entry point.  This should never return. */
> +        call    *%ebp
> +        ud2
> +
> +        .align 4
> +compatibility_mode_far:
> +        .long 0x00000000             /* set in call_32_bit above */
> +        .word 0x0010
> +
> +compat_mode_gdt_desc:
> +        .word (3*8)-1
> +        .quad 0x0000000000000000     /* set in call_32_bit above */
> +
> +        .align 8
> +compat_mode_gdt:
> +        .quad 0x0000000000000000     /* null                              */
> +        .quad 0x00cf92000000ffff     /* 0x0008 ring 0 data                */
> +        .quad 0x00cf9a000000ffff     /* 0x0010 ring 0 code, compatibility */
> +
> +compat_mode_idt:
> +        .word 0                      /* limit */
> +        .long 0                      /* base */
> +
> +        /*
> +         * 16 words of stack are more than enough.
> +         */
> +        .fill 16,8,0
> +reloc_stack:
> +
> +        .globl kexec_reloc_size
> +kexec_reloc_size:
> +        .long . - kexec_reloc
> diff --git a/xen/common/kexec.c b/xen/common/kexec.c
> index 7b23df0..c5450ba 100644
> --- a/xen/common/kexec.c
> +++ b/xen/common/kexec.c
> @@ -25,6 +25,7 @@
>   #include <xen/version.h>
>   #include <xen/console.h>
>   #include <xen/kexec.h>
> +#include <xen/kimage.h>
>   #include <public/elfnote.h>
>   #include <xsm/xsm.h>
>   #include <xen/cpu.h>
> @@ -47,7 +48,7 @@ static Elf_Note *xen_crash_note;
>   
>   static cpumask_t crash_saved_cpus;
>   
> -static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
> +static struct kexec_image *kexec_image[KEXEC_IMAGE_NR];
>   
>   #define KEXEC_FLAG_DEFAULT_POS   (KEXEC_IMAGE_NR + 0)
>   #define KEXEC_FLAG_CRASH_POS     (KEXEC_IMAGE_NR + 1)
> @@ -55,8 +56,6 @@ static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
>   
>   static unsigned long kexec_flags = 0; /* the lowest bits are for KEXEC_IMAGE... */
>   
> -static spinlock_t kexec_lock = SPIN_LOCK_UNLOCKED;
> -
>   static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
>   static size_t vmcoreinfo_size = 0;
>   
> @@ -311,14 +310,14 @@ void kexec_crash(void)
>       kexec_common_shutdown();
>       kexec_crash_save_cpu();
>       machine_crash_shutdown();
> -    machine_kexec(&kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
> +    machine_kexec(kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
>   
>       BUG();
>   }
>   
>   static long kexec_reboot(void *_image)
>   {
> -    xen_kexec_image_t *image = _image;
> +    struct kexec_image *image = _image;
>   
>       kexecing = TRUE;
>   
> @@ -734,63 +733,264 @@ static void crash_save_vmcoreinfo(void)
>   #endif
>   }
>   
> -static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
> +static void kexec_unload_image(struct kexec_image *image)
>   {
> -    xen_kexec_image_t *image;
> +    if ( !image )
> +        return;
> +
> +    machine_kexec_unload(image);
> +    kimage_free(image);
> +}
> +
> +static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +{
> +    xen_kexec_exec_t exec;
> +    struct kexec_image *image;
> +    int base, bit, pos, ret = -EINVAL;
> +
> +    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
> +        return -EFAULT;
> +
> +    if ( kexec_load_get_bits(exec.type, &base, &bit) )
> +        return -EINVAL;
> +
> +    pos = (test_bit(bit, &kexec_flags) != 0);
> +
> +    /* Only allow kexec/kdump into loaded images */
> +    if ( !test_bit(base + pos, &kexec_flags) )
> +        return -ENOENT;
> +
> +    switch (exec.type)
> +    {
> +    case KEXEC_TYPE_DEFAULT:
> +        image = kexec_image[base + pos];
> +        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
> +        break;
> +    case KEXEC_TYPE_CRASH:
> +        kexec_crash(); /* Does not return */
> +        break;
> +    }
> +
> +    return -EINVAL; /* never reached */
> +}
> +
> +static int kexec_swap_images(int type, struct kexec_image *new,
> +                             struct kexec_image **old)
> +{
> +    static DEFINE_SPINLOCK(kexec_lock);
>       int base, bit, pos;
> -    int ret = 0;
> +    int new_slot, old_slot;
> +
> +    *old = NULL;
> +
> +    spin_lock(&kexec_lock);
> +
> +    if ( test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
> +    {
> +        spin_unlock(&kexec_lock);
> +        return -EBUSY;
> +    }
>   
> -    if ( kexec_load_get_bits(load->type, &base, &bit) )
> +    if ( kexec_load_get_bits(type, &base, &bit) )
>           return -EINVAL;
>   
>       pos = (test_bit(bit, &kexec_flags) != 0);
> +    old_slot = base + pos;
> +    new_slot = base + !pos;
>   
> -    /* Load the user data into an unused image */
> -    if ( op == KEXEC_CMD_kexec_load )
> +    if ( new )
>       {
> -        image = &kexec_image[base + !pos];
> +        kexec_image[new_slot] = new;
> +        set_bit(new_slot, &kexec_flags);
> +    }
> +    change_bit(bit, &kexec_flags);
>   
> -        BUG_ON(test_bit((base + !pos), &kexec_flags)); /* must be free */
> +    clear_bit(old_slot, &kexec_flags);
> +    *old = kexec_image[old_slot];
>   
> -        memcpy(image, &load->image, sizeof(*image));
> +    spin_unlock(&kexec_lock);
>   
> -        if ( !(ret = machine_kexec_load(load->type, base + !pos, image)) )
> -        {
> -            /* Set image present bit */
> -            set_bit((base + !pos), &kexec_flags);
> +    return 0;
> +}
>   
> -            /* Make new image the active one */
> -            change_bit(bit, &kexec_flags);
> -        }
> +static int kexec_load_slot(struct kexec_image *kimage)
> +{
> +    struct kexec_image *old_kimage;
> +    int ret = -ENOMEM;
> +
> +    ret = machine_kexec_load(kimage);
> +    if ( ret < 0 )
> +        return ret;
> +
> +    crash_save_vmcoreinfo();
> +
> +    ret = kexec_swap_images(kimage->type, kimage, &old_kimage);
> +    if ( ret < 0 )
> +        return ret;
> +
> +    kexec_unload_image(old_kimage);
> +
> +    return 0;
> +}
> +
> +static uint16_t kexec_load_v1_arch(void)
> +{
> +#ifdef CONFIG_X86
> +    return is_pv_32on64_domain(dom0) ? EM_386 : EM_X86_64;
> +#else
> +    return EM_NONE;
> +#endif
> +}
>   
> -        crash_save_vmcoreinfo();
> +static int kexec_segments_add_segment(
> +    unsigned int *nr_segments, xen_kexec_segment_t *segments,
> +    unsigned long mfn)
> +{
> +    paddr_t maddr = (paddr_t)mfn << PAGE_SHIFT;
> +    unsigned int n = *nr_segments;
> +
> +    /* Need a new segment? */
> +    if ( n == 0
> +         || segments[n-1].dest_maddr + segments[n-1].dest_size != maddr )
> +    {
> +        n++;
> +        if ( n > KEXEC_SEGMENT_MAX )
> +            return -EINVAL;
> +        *nr_segments = n;
> +
> +        set_xen_guest_handle(segments[n-1].buf.h, NULL);
> +        segments[n-1].buf_size = 0;
> +        segments[n-1].dest_maddr = maddr;
> +        segments[n-1].dest_size = 0;
>       }
>   
> -    /* Unload the old image if present and load successful */
> -    if ( ret == 0 && !test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
> +    return 0;
> +}
> +
> +static int kexec_segments_from_ind_page(unsigned long mfn,
> +                                        unsigned int *nr_segments,
> +                                        xen_kexec_segment_t *segments,
> +                                        bool_t compat)
> +{
> +    void *page;
> +    kimage_entry_t *entry;
> +    int ret = 0;
> +
> +    page = map_domain_page(mfn);
> +
> +    /*
> +     * Walk the indirection page list, adding destination pages to the
> +     * segments.
> +     */
> +    for ( entry = page; ; )
>       {
> -        if ( test_and_clear_bit((base + pos), &kexec_flags) )
> +        unsigned long ind;
> +
> +        ind = kimage_entry_ind(entry, compat);
> +        mfn = kimage_entry_mfn(entry, compat);
> +
> +        switch ( ind )
>           {
> -            image = &kexec_image[base + pos];
> -            machine_kexec_unload(load->type, base + pos, image);
> +        case IND_DESTINATION:
> +            ret = kexec_segments_add_segment(nr_segments, segments, mfn);
> +            if ( ret < 0 )
> +                goto done;
> +            break;
> +        case IND_INDIRECTION:
> +            unmap_domain_page(page);
> +            entry = page = map_domain_page(mfn);
> +            continue;
> +        case IND_DONE:
> +            goto done;
> +        case IND_SOURCE:
> +            if ( *nr_segments == 0 )
> +            {
> +                ret = -EINVAL;
> +                goto done;
> +            }
> +            segments[*nr_segments-1].dest_size += PAGE_SIZE;
> +            break;
> +        default:
> +            ret = -EINVAL;
> +            goto done;
>           }
> +        entry = kimage_entry_next(entry, compat);
>       }
> +done:
> +    unmap_domain_page(page);
> +    return ret;
> +}
>   
> +static int kexec_do_load_v1(xen_kexec_load_v1_t *load, int compat)
> +{
> +    struct kexec_image *kimage = NULL;
> +    xen_kexec_segment_t *segments;
> +    uint16_t arch;
> +    unsigned int nr_segments = 0;
> +    unsigned long ind_mfn = load->image.indirection_page >> PAGE_SHIFT;
> +    int ret;
> +
> +    arch = kexec_load_v1_arch();
> +    if ( arch == EM_NONE )
> +        return -ENOSYS;
> +
> +    segments = xmalloc_array(xen_kexec_segment_t, KEXEC_SEGMENT_MAX);
> +    if ( segments == NULL )
> +        return -ENOMEM;
> +
> +    /*
> +     * Work out the image segments (destination only) from the
> +     * indirection pages.
> +     *
> +     * This is needed so we don't allocate pages that will overlap
> +     * with the destination when building the new set of indirection
> +     * pages below.
> +     */
> +    ret = kexec_segments_from_ind_page(ind_mfn, &nr_segments, segments, compat);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    ret = kimage_alloc(&kimage, load->type, arch, load->image.start_address,
> +                       nr_segments, segments);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    /*
> +     * Build a new set of indirection pages in the native format.
> +     *
> +     * This walks the guest provided indirection pages a second time.
> +     * The guest could have altered then, invalidating the segment
> +     * information constructed above.  This will only result in the
> +     * resulting image being potentially unrelocatable.
> +     */
> +    ret = kimage_build_ind(kimage, ind_mfn, compat);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    ret = kexec_load_slot(kimage);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    return 0;
> +
> +error:
> +    if ( !kimage )
> +        xfree(segments);
> +    kimage_free(kimage);
>       return ret;
>   }
>   
> -static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
> +static int kexec_load_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
>   {
>       xen_kexec_load_v1_t load;
>   
>       if ( unlikely(copy_from_guest(&load, uarg, 1)) )
>           return -EFAULT;
>   
> -    return kexec_load_unload_internal(op, &load);
> +    return kexec_do_load_v1(&load, 0);
>   }
>   
> -static int kexec_load_unload_compat(unsigned long op,
> -                                    XEN_GUEST_HANDLE_PARAM(void) uarg)
> +static int kexec_load_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
>   {
>   #ifdef CONFIG_COMPAT
>       compat_kexec_load_v1_t compat_load;
> @@ -809,49 +1009,113 @@ static int kexec_load_unload_compat(unsigned long op,
>       load.type = compat_load.type;
>       XLAT_kexec_image(&load.image, &compat_load.image);
>   
> -    return kexec_load_unload_internal(op, &load);
> -#else /* CONFIG_COMPAT */
> +    return kexec_do_load_v1(&load, 1);
> +#else
>       return 0;
> -#endif /* CONFIG_COMPAT */
> +#endif
>   }
>   
> -static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +static int kexec_load(XEN_GUEST_HANDLE_PARAM(void) uarg)
>   {
> -    xen_kexec_exec_t exec;
> -    xen_kexec_image_t *image;
> -    int base, bit, pos, ret = -EINVAL;
> +    xen_kexec_load_t load;
> +    xen_kexec_segment_t *segments;
> +    struct kexec_image *kimage = NULL;
> +    int ret;
>   
> -    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
> +    if ( copy_from_guest(&load, uarg, 1) )
>           return -EFAULT;
>   
> -    if ( kexec_load_get_bits(exec.type, &base, &bit) )
> +    if ( load.nr_segments >= KEXEC_SEGMENT_MAX )
>           return -EINVAL;
>   
> -    pos = (test_bit(bit, &kexec_flags) != 0);
> -
> -    /* Only allow kexec/kdump into loaded images */
> -    if ( !test_bit(base + pos, &kexec_flags) )
> -        return -ENOENT;
> +    segments = xmalloc_array(xen_kexec_segment_t, load.nr_segments);
> +    if ( segments == NULL )
> +        return -ENOMEM;
>   
> -    switch (exec.type)
> +    if ( copy_from_guest(segments, load.segments.h, load.nr_segments) )
>       {
> -    case KEXEC_TYPE_DEFAULT:
> -        image = &kexec_image[base + pos];
> -        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
> -        break;
> -    case KEXEC_TYPE_CRASH:
> -        kexec_crash(); /* Does not return */
> -        break;
> +        ret = -EFAULT;
> +        goto error;
>       }
>   
> -    return -EINVAL; /* never reached */
> +    ret = kimage_alloc(&kimage, load.type, load.arch, load.entry_maddr,
> +                       load.nr_segments, segments);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    ret = kimage_load_segments(kimage);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    ret = kexec_load_slot(kimage);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    return 0;
> +
> +error:
> +    if ( ! kimage )
> +        xfree(segments);
> +    kimage_free(kimage);
> +    return ret;
> +}
> +
> +static int kexec_do_unload(xen_kexec_unload_t *unload)
> +{
> +    struct kexec_image *old_kimage;
> +    int ret;
> +
> +    ret = kexec_swap_images(unload->type, NULL, &old_kimage);
> +    if ( ret < 0 )
> +        return ret;
> +
> +    kexec_unload_image(old_kimage);
> +
> +    return 0;
> +}
> +
> +static int kexec_unload_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +{
> +    xen_kexec_load_v1_t load;
> +    xen_kexec_unload_t unload;
> +
> +    if ( copy_from_guest(&load, uarg, 1) )
> +        return -EFAULT;
> +
> +    unload.type = load.type;
> +    return kexec_do_unload(&unload);
> +}
> +
> +static int kexec_unload_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +{
> +#ifdef CONFIG_COMPAT
> +    compat_kexec_load_v1_t compat_load;
> +    xen_kexec_unload_t unload;
> +
> +    if ( copy_from_guest(&compat_load, uarg, 1) )
> +        return -EFAULT;
> +
> +    unload.type = compat_load.type;
> +    return kexec_do_unload(&unload);
> +#else
> +    return 0;
> +#endif
> +}
> +
> +static int kexec_unload(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +{
> +    xen_kexec_unload_t unload;
> +
> +    if ( unlikely(copy_from_guest(&unload, uarg, 1)) )
> +        return -EFAULT;
> +
> +    return kexec_do_unload(&unload);
>   }
>   
>   static int do_kexec_op_internal(unsigned long op,
>                                   XEN_GUEST_HANDLE_PARAM(void) uarg,
>                                   bool_t compat)
>   {
> -    unsigned long flags;
>       int ret = -EINVAL;
>   
>       ret = xsm_kexec(XSM_PRIV);
> @@ -867,20 +1131,26 @@ static int do_kexec_op_internal(unsigned long op,
>                   ret = kexec_get_range(uarg);
>           break;
>       case KEXEC_CMD_kexec_load_v1:
> +        if ( compat )
> +            ret = kexec_load_v1_compat(uarg);
> +        else
> +            ret = kexec_load_v1(uarg);
> +        break;
>       case KEXEC_CMD_kexec_unload_v1:
> -        spin_lock_irqsave(&kexec_lock, flags);
> -        if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
> -        {
> -                if (compat)
> -                        ret = kexec_load_unload_compat(op, uarg);
> -                else
> -                        ret = kexec_load_unload(op, uarg);
> -        }
> -        spin_unlock_irqrestore(&kexec_lock, flags);
> +        if ( compat )
> +            ret = kexec_unload_v1_compat(uarg);
> +        else
> +            ret = kexec_unload_v1(uarg);
>           break;
>       case KEXEC_CMD_kexec:
>           ret = kexec_exec(uarg);
>           break;
> +    case KEXEC_CMD_kexec_load:
> +        ret = kexec_load(uarg);
> +        break;
> +    case KEXEC_CMD_kexec_unload:
> +        ret = kexec_unload(uarg);
> +        break;
>       }
>   
>       return ret;
> diff --git a/xen/common/kimage.c b/xen/common/kimage.c
> index 02ee37e..10fb785 100644
> --- a/xen/common/kimage.c
> +++ b/xen/common/kimage.c
> @@ -175,11 +175,20 @@ static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
>       image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
>       if ( !image->control_code_page )
>           goto out;
> +    result = machine_kexec_add_page(image,
> +                                    page_to_maddr(image->control_code_page),
> +                                    page_to_maddr(image->control_code_page));
> +    if ( result < 0 )
> +        goto out;
>   
>       /* Add an empty indirection page. */
>       image->entry_page = kimage_alloc_control_page(image, 0);
>       if ( !image->entry_page )
>           goto out;
> +    result = machine_kexec_add_page(image, page_to_maddr(image->entry_page),
> +                                    page_to_maddr(image->entry_page));
> +    if ( result < 0 )
> +        goto out;
>   
>       image->head = page_to_maddr(image->entry_page);
>   
> @@ -595,7 +604,7 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
>           if ( addr == destination )
>           {
>               page_list_del(page, &image->dest_pages);
> -            return page;
> +            goto found;
>           }
>       }
>       page = NULL;
> @@ -647,6 +656,8 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
>               page_list_add(page, &image->dest_pages);
>           }
>       }
> +found:
> +    machine_kexec_add_page(image, page_to_maddr(page), page_to_maddr(page));
>       return page;
>   }
>   
> @@ -753,6 +764,7 @@ static int kimage_load_crash_segment(struct kexec_image *image,
>   static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
>   {
>       int result = -ENOMEM;
> +    paddr_t addr;
>   
>       if ( !guest_handle_is_null(segment->buf.h) )
>       {
> @@ -767,6 +779,14 @@ static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *s
>           }
>       }
>   
> +    for ( addr = segment->dest_maddr & PAGE_MASK;
> +          addr < segment->dest_maddr + segment->dest_size; addr += PAGE_SIZE )
> +    {
> +        result = machine_kexec_add_page(image, addr, addr);
> +        if ( result < 0 )
> +            break;
> +    }
> +
>       return result;
>   }
>   
> @@ -810,6 +830,106 @@ int kimage_load_segments(struct kexec_image *image)
>       return 0;
>   }
>   
> +kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat)
> +{
> +    if ( compat )
> +        return (kimage_entry_t *)((uint32_t *)entry + 1);
> +    return entry + 1;
> +}
> +
> +unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat)
> +{
> +    if ( compat )
> +        return *(uint32_t *)entry >> PAGE_SHIFT;
> +    return *entry >> PAGE_SHIFT;
> +}
> +
> +unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat)
> +{
> +    if ( compat )
> +        return *(uint32_t *)entry & 0xf;
> +    return *entry & 0xf;
> +}
> +
> +int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
> +                     bool_t compat)
> +{
> +    void *page;
> +    kimage_entry_t *entry;
> +    int ret = 0;
> +    paddr_t dest = KIMAGE_NO_DEST;
> +
> +    page = map_domain_page(ind_mfn);
> +    if ( !page )
> +        return -ENOMEM;
> +
> +    /*
> +     * Walk the guest-supplied indirection pages, adding entries to
> +     * the image's indirection pages.
> +     */
> +    for ( entry = page; ;  )
> +    {
> +        unsigned long ind;
> +        unsigned long mfn;
> +
> +        ind = kimage_entry_ind(entry, compat);
> +        mfn = kimage_entry_mfn(entry, compat);
> +
> +        switch ( ind )
> +        {
> +        case IND_DESTINATION:
> +            dest = (paddr_t)mfn << PAGE_SHIFT;
> +            ret = kimage_set_destination(image, dest);
> +            if ( ret < 0 )
> +                goto done;
> +            break;
> +        case IND_INDIRECTION:
> +            unmap_domain_page(page);
> +            page = map_domain_page(mfn);
> +            entry = page;
> +            continue;
> +        case IND_DONE:
> +            kimage_terminate(image);
> +            goto done;
> +        case IND_SOURCE:
> +        {
> +            struct page_info *guest_page, *xen_page;
> +
> +            guest_page = mfn_to_page(mfn);
> +            if ( !get_page(guest_page, current->domain) )
> +            {
> +                ret = -EFAULT;
> +                goto done;
> +            }
> +
> +            xen_page = kimage_alloc_page(image, dest);
> +            if ( !xen_page )
> +            {
> +                put_page(guest_page);
> +                ret = -ENOMEM;
> +                goto done;
> +            }
> +
> +            copy_domain_page(page_to_mfn(xen_page), mfn);
> +            put_page(guest_page);
> +
> +            ret = kimage_add_page(image, page_to_maddr(xen_page));
> +            if ( ret < 0 )
> +                goto done;
> +            dest += PAGE_SIZE;
> +            break;
> +        }
> +        default:
> +            ret = -EINVAL;
> +            goto done;
> +        }
> +        entry = kimage_entry_next(entry, compat);
> +    }
> +done:
> +    unmap_domain_page(page);
> +    return ret;
> +}
> +
>   /*
>    * Local variables:
>    * mode: C
> diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
> index 8b4266d..48c5676 100644
> --- a/xen/include/asm-x86/fixmap.h
> +++ b/xen/include/asm-x86/fixmap.h
> @@ -56,9 +56,6 @@ enum fixed_addresses {
>       FIX_ACPI_BEGIN,
>       FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1,
>       FIX_HPET_BASE,
> -    FIX_KEXEC_BASE_0,
> -    FIX_KEXEC_BASE_END = FIX_KEXEC_BASE_0 \
> -      + ((KEXEC_XEN_NO_PAGES >> 1) * KEXEC_IMAGE_NR) - 1,
>       FIX_TBOOT_SHARED_BASE,
>       FIX_MSIX_IO_RESERV_BASE,
>       FIX_MSIX_IO_RESERV_END = FIX_MSIX_IO_RESERV_BASE + FIX_MSIX_MAX_PAGES -1,
> diff --git a/xen/include/asm-x86/machine_kexec.h b/xen/include/asm-x86/machine_kexec.h
> new file mode 100644
> index 0000000..ba0d469
> --- /dev/null
> +++ b/xen/include/asm-x86/machine_kexec.h
> @@ -0,0 +1,16 @@
> +#ifndef __X86_MACHINE_KEXEC_H__
> +#define __X86_MACHINE_KEXEC_H__
> +
> +#define KEXEC_RELOC_FLAG_COMPAT 0x1 /* 32-bit image */
> +
> +#ifndef __ASSEMBLY__
> +
> +extern void kexec_reloc(unsigned long reloc_code, unsigned long reloc_pt,
> +                        unsigned long ind_maddr, unsigned long entry_maddr,
> +                        unsigned long flags);
> +
> +extern unsigned int kexec_reloc_size;
> +
> +#endif
> +
> +#endif /* __X86_MACHINE_KEXEC_H__ */
> diff --git a/xen/include/xen/kexec.h b/xen/include/xen/kexec.h
> index 1a5dda1..bd17747 100644
> --- a/xen/include/xen/kexec.h
> +++ b/xen/include/xen/kexec.h
> @@ -6,6 +6,7 @@
>   #include <public/kexec.h>
>   #include <asm/percpu.h>
>   #include <xen/elfcore.h>
> +#include <xen/kimage.h>
>   
>   typedef struct xen_kexec_reserve {
>       unsigned long size;
> @@ -40,11 +41,13 @@ extern enum low_crashinfo low_crashinfo_mode;
>   extern paddr_t crashinfo_maxaddr_bits;
>   void kexec_early_calculations(void);
>   
> -int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
> -void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
> +int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
> +                           unsigned long maddr);
> +int machine_kexec_load(struct kexec_image *image);
> +void machine_kexec_unload(struct kexec_image *image);
>   void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
> -void machine_reboot_kexec(xen_kexec_image_t *image);
> -void machine_kexec(xen_kexec_image_t *image);
> +void machine_reboot_kexec(struct kexec_image *image);
> +void machine_kexec(struct kexec_image *image);
>   void kexec_crash(void);
>   void kexec_crash_save_cpu(void);
>   crash_xen_info_t *kexec_crash_save_info(void);
> @@ -52,11 +55,6 @@ void machine_crash_shutdown(void);
>   int machine_kexec_get(xen_kexec_range_t *range);
>   int machine_kexec_get_xen(xen_kexec_range_t *range);
>   
> -void compat_machine_kexec(unsigned long rnk,
> -                          unsigned long indirection_page,
> -                          unsigned long *page_list,
> -                          unsigned long start_address);
> -
>   /* vmcoreinfo stuff */
>   #define VMCOREINFO_BYTES           (4096)
>   #define VMCOREINFO_NOTE_NAME       "VMCOREINFO_XEN"
> diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
> index 0ebd37a..d10ebf7 100644
> --- a/xen/include/xen/kimage.h
> +++ b/xen/include/xen/kimage.h
> @@ -47,6 +47,12 @@ int kimage_load_segments(struct kexec_image *image);
>   struct page_info *kimage_alloc_control_page(struct kexec_image *image,
>                                               unsigned memflags);
>   
> +kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat);
> +unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat);
> +unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat);
> +int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
> +                     bool_t compat);
> +
>   #endif /* __ASSEMBLY__ */
>   
>   #endif /* __XEN_KIMAGE_H__ */

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCH 4/9] kexec: extend hypercall with improved load/unload ops
@ 2013-11-07 20:56     ` Don Slutz
  0 siblings, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-07 20:56 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

For what it is worth.

Reviewed-by: Don Slutz <dslutz@verizon.com>
     -Don Slutz

On 11/06/13 09:49, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> In the existing kexec hypercall, the load and unload ops depend on
> internals of the Linux kernel (the page list and code page provided by
> the kernel).  The code page is used to transition between Xen context
> and the image so using kernel code doesn't make sense and will not
> work for PVH guests.
>
> Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
> that no longer require a code page to be provided by the guest -- Xen
> now provides the code for calling the image directly.
>
> The new load op looks similar to the Linux kexec_load system call and
> allows the guest to provide the image data to be loaded.  The guest
> specifies the architecture of the image which may be a 32-bit subarch
> of the hypervisor's architecture (i.e., an EM_386 image on an
> EM_X86_64 hypervisor).
>
> The toolstack can now load images without kernel involvement.  This is
> required for supporting kexec when using a dom0 with an upstream
> kernel.
>
> Crash images are copied directly into the crash region on load.
> Default images are copied into domheap pages and a list of source and
> destination machine addresses is created.  This is list is used in
> kexec_reloc() to relocate the image to its destination.
>
> The old load and unload sub-ops are still available (as
> KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
> of the new infrastructure.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/arch/x86/machine_kexec.c        |  192 +++++++++++------
>   xen/arch/x86/x86_64/Makefile        |    2 +-
>   xen/arch/x86/x86_64/compat_kexec.S  |  187 ----------------
>   xen/arch/x86/x86_64/kexec_reloc.S   |  198 +++++++++++++++++
>   xen/common/kexec.c                  |  398 +++++++++++++++++++++++++++++------
>   xen/common/kimage.c                 |  122 +++++++++++-
>   xen/include/asm-x86/fixmap.h        |    3 -
>   xen/include/asm-x86/machine_kexec.h |   16 ++
>   xen/include/xen/kexec.h             |   16 +-
>   xen/include/xen/kimage.h            |    6 +
>   10 files changed, 804 insertions(+), 336 deletions(-)
>   delete mode 100644 xen/arch/x86/x86_64/compat_kexec.S
>   create mode 100644 xen/arch/x86/x86_64/kexec_reloc.S
>   create mode 100644 xen/include/asm-x86/machine_kexec.h
>
> diff --git a/xen/arch/x86/machine_kexec.c b/xen/arch/x86/machine_kexec.c
> index 68b9705..b70d5a6 100644
> --- a/xen/arch/x86/machine_kexec.c
> +++ b/xen/arch/x86/machine_kexec.c
> @@ -1,9 +1,18 @@
>   /******************************************************************************
>    * machine_kexec.c
>    *
> + * Copyright (C) 2013 Citrix Systems R&D Ltd.
> + *
> + * Portions derived from Linux's arch/x86/kernel/machine_kexec_64.c.
> + *
> + *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
> + *
>    * Xen port written by:
>    * - Simon 'Horms' Horman <horms@verge.net.au>
>    * - Magnus Damm <magnus@valinux.co.jp>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
>    */
>   
>   #include <xen/types.h>
> @@ -11,63 +20,124 @@
>   #include <xen/guest_access.h>
>   #include <asm/fixmap.h>
>   #include <asm/hpet.h>
> +#include <asm/page.h>
> +#include <asm/machine_kexec.h>
>   
> -typedef void (*relocate_new_kernel_t)(
> -                unsigned long indirection_page,
> -                unsigned long *page_list,
> -                unsigned long start_address,
> -                unsigned int preserve_context);
> -
> -int machine_kexec_load(int type, int slot, xen_kexec_image_t *image)
> +/*
> + * Add a mapping for a page to the page tables used during kexec.
> + */
> +int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
> +                           unsigned long maddr)
>   {
> -    unsigned long prev_ma = 0;
> -    int fix_base = FIX_KEXEC_BASE_0 + (slot * (KEXEC_XEN_NO_PAGES >> 1));
> -    int k;
> +    struct page_info *l4_page;
> +    struct page_info *l3_page;
> +    struct page_info *l2_page;
> +    struct page_info *l1_page;
> +    l4_pgentry_t *l4 = NULL;
> +    l3_pgentry_t *l3 = NULL;
> +    l2_pgentry_t *l2 = NULL;
> +    l1_pgentry_t *l1 = NULL;
> +    int ret = -ENOMEM;
> +
> +    l4_page = image->aux_page;
> +    if ( !l4_page )
> +    {
> +        l4_page = kimage_alloc_control_page(image, 0);
> +        if ( !l4_page )
> +            goto out;
> +        image->aux_page = l4_page;
> +    }
>   
> -    /* setup fixmap to point to our pages and record the virtual address
> -     * in every odd index in page_list[].
> -     */
> +    l4 = __map_domain_page(l4_page);
> +    l4 += l4_table_offset(vaddr);
> +    if ( !(l4e_get_flags(*l4) & _PAGE_PRESENT) )
> +    {
> +        l3_page = kimage_alloc_control_page(image, 0);
> +        if ( !l3_page )
> +            goto out;
> +        l4e_write(l4, l4e_from_page(l3_page, __PAGE_HYPERVISOR));
> +    }
> +    else
> +        l3_page = l4e_get_page(*l4);
> +
> +    l3 = __map_domain_page(l3_page);
> +    l3 += l3_table_offset(vaddr);
> +    if ( !(l3e_get_flags(*l3) & _PAGE_PRESENT) )
> +    {
> +        l2_page = kimage_alloc_control_page(image, 0);
> +        if ( !l2_page )
> +            goto out;
> +        l3e_write(l3, l3e_from_page(l2_page, __PAGE_HYPERVISOR));
> +    }
> +    else
> +        l2_page = l3e_get_page(*l3);
> +
> +    l2 = __map_domain_page(l2_page);
> +    l2 += l2_table_offset(vaddr);
> +    if ( !(l2e_get_flags(*l2) & _PAGE_PRESENT) )
> +    {
> +        l1_page = kimage_alloc_control_page(image, 0);
> +        if ( !l1_page )
> +            goto out;
> +        l2e_write(l2, l2e_from_page(l1_page, __PAGE_HYPERVISOR));
> +    }
> +    else
> +        l1_page = l2e_get_page(*l2);
> +
> +    l1 = __map_domain_page(l1_page);
> +    l1 += l1_table_offset(vaddr);
> +    l1e_write(l1, l1e_from_pfn(maddr >> PAGE_SHIFT, __PAGE_HYPERVISOR));
> +
> +    ret = 0;
> +out:
> +    if ( l1 )
> +        unmap_domain_page(l1);
> +    if ( l2 )
> +        unmap_domain_page(l2);
> +    if ( l3 )
> +        unmap_domain_page(l3);
> +    if ( l4 )
> +        unmap_domain_page(l4);
> +    return ret;
> +}
>   
> -    for ( k = 0; k < KEXEC_XEN_NO_PAGES; k++ )
> +int machine_kexec_load(struct kexec_image *image)
> +{
> +    void *code_page;
> +    int ret;
> +
> +    switch ( image->arch )
>       {
> -        if ( (k & 1) == 0 )
> -        {
> -            /* Even pages: machine address. */
> -            prev_ma = image->page_list[k];
> -        }
> -        else
> -        {
> -            /* Odd pages: va for previous ma. */
> -            if ( is_pv_32on64_domain(dom0) )
> -            {
> -                /*
> -                 * The compatability bounce code sets up a page table
> -                 * with a 1-1 mapping of the first 1G of memory so
> -                 * VA==PA here.
> -                 *
> -                 * This Linux purgatory code still sets up separate
> -                 * high and low mappings on the control page (entries
> -                 * 0 and 1) but it is harmless if they are equal since
> -                 * that PT is not live at the time.
> -                 */
> -                image->page_list[k] = prev_ma;
> -            }
> -            else
> -            {
> -                set_fixmap(fix_base + (k >> 1), prev_ma);
> -                image->page_list[k] = fix_to_virt(fix_base + (k >> 1));
> -            }
> -        }
> +    case EM_386:
> +    case EM_X86_64:
> +        break;
> +    default:
> +        return -EINVAL;
>       }
>   
> +    code_page = __map_domain_page(image->control_code_page);
> +    memcpy(code_page, kexec_reloc, kexec_reloc_size);
> +    unmap_domain_page(code_page);
> +
> +    /*
> +     * Add a mapping for the control code page to the same virtual
> +     * address as kexec_reloc.  This allows us to keep running after
> +     * these page tables are loaded in kexec_reloc.
> +     */
> +    ret = machine_kexec_add_page(image, (unsigned long)kexec_reloc,
> +                                 page_to_maddr(image->control_code_page));
> +    if ( ret < 0 )
> +        return ret;
> +
>       return 0;
>   }
>   
> -void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image)
> +void machine_kexec_unload(struct kexec_image *image)
>   {
> +    /* no-op. kimage_free() frees all control pages. */
>   }
>   
> -void machine_reboot_kexec(xen_kexec_image_t *image)
> +void machine_reboot_kexec(struct kexec_image *image)
>   {
>       BUG_ON(smp_processor_id() != 0);
>       smp_send_stop();
> @@ -75,13 +145,10 @@ void machine_reboot_kexec(xen_kexec_image_t *image)
>       BUG();
>   }
>   
> -void machine_kexec(xen_kexec_image_t *image)
> +void machine_kexec(struct kexec_image *image)
>   {
> -    struct desc_ptr gdt_desc = {
> -        .base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY),
> -        .limit = LAST_RESERVED_GDT_BYTE
> -    };
>       int i;
> +    unsigned long reloc_flags = 0;
>   
>       /* We are about to permenantly jump out of the Xen context into the kexec
>        * purgatory code.  We really dont want to be still servicing interupts.
> @@ -109,29 +176,12 @@ void machine_kexec(xen_kexec_image_t *image)
>        * not like running with NMIs disabled. */
>       enable_nmis();
>   
> -    /*
> -     * compat_machine_kexec() returns to idle pagetables, which requires us
> -     * to be running on a static GDT mapping (idle pagetables have no GDT
> -     * mappings in their per-domain mapping area).
> -     */
> -    asm volatile ( "lgdt %0" : : "m" (gdt_desc) );
> +    if ( image->arch == EM_386 )
> +        reloc_flags |= KEXEC_RELOC_FLAG_COMPAT;
>   
> -    if ( is_pv_32on64_domain(dom0) )
> -    {
> -        compat_machine_kexec(image->page_list[1],
> -                             image->indirection_page,
> -                             image->page_list,
> -                             image->start_address);
> -    }
> -    else
> -    {
> -        relocate_new_kernel_t rnk;
> -
> -        rnk = (relocate_new_kernel_t) image->page_list[1];
> -        (*rnk)(image->indirection_page, image->page_list,
> -               image->start_address,
> -               0 /* preserve_context */);
> -    }
> +    kexec_reloc(page_to_maddr(image->control_code_page),
> +                page_to_maddr(image->aux_page),
> +                image->head, image->entry_maddr, reloc_flags);
>   }
>   
>   int machine_kexec_get(xen_kexec_range_t *range)
> diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
> index d56e12d..7f8fb3d 100644
> --- a/xen/arch/x86/x86_64/Makefile
> +++ b/xen/arch/x86/x86_64/Makefile
> @@ -11,11 +11,11 @@ obj-y += mmconf-fam10h.o
>   obj-y += mmconfig_64.o
>   obj-y += mmconfig-shared.o
>   obj-y += compat.o
> -obj-bin-y += compat_kexec.o
>   obj-y += domain.o
>   obj-y += physdev.o
>   obj-y += platform_hypercall.o
>   obj-y += cpu_idle.o
>   obj-y += cpufreq.o
> +obj-bin-y += kexec_reloc.o
>   
>   obj-$(crash_debug)   += gdbstub.o
> diff --git a/xen/arch/x86/x86_64/compat_kexec.S b/xen/arch/x86/x86_64/compat_kexec.S
> deleted file mode 100644
> index fc92af9..0000000
> --- a/xen/arch/x86/x86_64/compat_kexec.S
> +++ /dev/null
> @@ -1,187 +0,0 @@
> -/*
> - * Compatibility kexec handler.
> - */
> -
> -/*
> - * NOTE: We rely on Xen not relocating itself above the 4G boundary. This is
> - * currently true but if it ever changes then compat_pg_table will
> - * need to be moved back below 4G at run time.
> - */
> -
> -#include <xen/config.h>
> -
> -#include <asm/asm_defns.h>
> -#include <asm/msr.h>
> -#include <asm/page.h>
> -
> -/* The unrelocated physical address of a symbol. */
> -#define SYM_PHYS(sym)          ((sym) - __XEN_VIRT_START)
> -
> -/* Load physical address of symbol into register and relocate it. */
> -#define RELOCATE_SYM(sym,reg)  mov $SYM_PHYS(sym), reg ; \
> -                               add xen_phys_start(%rip), reg
> -
> -/*
> - * Relocate a physical address in memory. Size of temporary register
> - * determines size of the value to relocate.
> - */
> -#define RELOCATE_MEM(addr,reg) mov addr(%rip), reg ; \
> -                               add xen_phys_start(%rip), reg ; \
> -                               mov reg, addr(%rip)
> -
> -        .text
> -
> -        .code64
> -
> -ENTRY(compat_machine_kexec)
> -        /* x86/64                        x86/32  */
> -        /* %rdi - relocate_new_kernel_t  CALL    */
> -        /* %rsi - indirection page       4(%esp) */
> -        /* %rdx - page_list              8(%esp) */
> -        /* %rcx - start address         12(%esp) */
> -        /*        cpu has pae           16(%esp) */
> -
> -        /* Shim the 64 bit page_list into a 32 bit page_list. */
> -        mov $12,%r9
> -        lea compat_page_list(%rip), %rbx
> -1:      dec %r9
> -        movl (%rdx,%r9,8),%eax
> -        movl %eax,(%rbx,%r9,4)
> -        test %r9,%r9
> -        jnz 1b
> -
> -        RELOCATE_SYM(compat_page_list,%rdx)
> -
> -        /* Relocate compatibility mode entry point address. */
> -        RELOCATE_MEM(compatibility_mode_far,%eax)
> -
> -        /* Relocate compat_pg_table. */
> -        RELOCATE_MEM(compat_pg_table,     %rax)
> -        RELOCATE_MEM(compat_pg_table+0x8, %rax)
> -        RELOCATE_MEM(compat_pg_table+0x10,%rax)
> -        RELOCATE_MEM(compat_pg_table+0x18,%rax)
> -
> -        /*
> -         * Setup an identity mapped region in PML4[0] of idle page
> -         * table.
> -         */
> -        RELOCATE_SYM(l3_identmap,%rax)
> -        or  $0x63,%rax
> -        mov %rax, idle_pg_table(%rip)
> -
> -        /* Switch to idle page table. */
> -        RELOCATE_SYM(idle_pg_table,%rax)
> -        movq %rax, %cr3
> -
> -        /* Switch to identity mapped compatibility stack. */
> -        RELOCATE_SYM(compat_stack,%rax)
> -        movq %rax, %rsp
> -
> -        /* Save xen_phys_start for 32 bit code. */
> -        movq xen_phys_start(%rip), %rbx
> -
> -        /* Jump to low identity mapping in compatibility mode. */
> -        ljmp *compatibility_mode_far(%rip)
> -        ud2
> -
> -compatibility_mode_far:
> -        .long SYM_PHYS(compatibility_mode)
> -        .long __HYPERVISOR_CS32
> -
> -        /*
> -         * We use 5 words of stack for the arguments passed to the kernel. The
> -         * kernel only uses 1 word before switching to its own stack. Allocate
> -         * 16 words to give "plenty" of room.
> -         */
> -        .fill 16,4,0
> -compat_stack:
> -
> -        .code32
> -
> -#undef RELOCATE_SYM
> -#undef RELOCATE_MEM
> -
> -/*
> - * Load physical address of symbol into register and relocate it. %rbx
> - * contains xen_phys_start(%rip) saved before jump to compatibility
> - * mode.
> - */
> -#define RELOCATE_SYM(sym,reg) mov $SYM_PHYS(sym), reg ; \
> -                              add %ebx, reg
> -
> -compatibility_mode:
> -        /* Setup some sane segments. */
> -        movl $__HYPERVISOR_DS32, %eax
> -        movl %eax, %ds
> -        movl %eax, %es
> -        movl %eax, %fs
> -        movl %eax, %gs
> -        movl %eax, %ss
> -
> -        /* Push arguments onto stack. */
> -        pushl $0   /* 20(%esp) - preserve context */
> -        pushl $1   /* 16(%esp) - cpu has pae */
> -        pushl %ecx /* 12(%esp) - start address */
> -        pushl %edx /*  8(%esp) - page list */
> -        pushl %esi /*  4(%esp) - indirection page */
> -        pushl %edi /*  0(%esp) - CALL */
> -
> -        /* Disable paging and therefore leave 64 bit mode. */
> -        movl %cr0, %eax
> -        andl $~X86_CR0_PG, %eax
> -        movl %eax, %cr0
> -
> -        /* Switch to 32 bit page table. */
> -        RELOCATE_SYM(compat_pg_table, %eax)
> -        movl  %eax, %cr3
> -
> -        /* Clear MSR_EFER[LME], disabling long mode */
> -        movl    $MSR_EFER,%ecx
> -        rdmsr
> -        btcl    $_EFER_LME,%eax
> -        wrmsr
> -
> -        /* Re-enable paging, but only 32 bit mode now. */
> -        movl %cr0, %eax
> -        orl $X86_CR0_PG, %eax
> -        movl %eax, %cr0
> -        jmp 1f
> -1:
> -
> -        popl %eax
> -        call *%eax
> -        ud2
> -
> -        .data
> -        .align 4
> -compat_page_list:
> -        .fill 12,4,0
> -
> -        .align 32,0
> -
> -        /*
> -         * These compat page tables contain an identity mapping of the
> -         * first 4G of the physical address space.
> -         */
> -compat_pg_table:
> -        .long SYM_PHYS(compat_pg_table_l2) + 0*PAGE_SIZE + 0x01, 0
> -        .long SYM_PHYS(compat_pg_table_l2) + 1*PAGE_SIZE + 0x01, 0
> -        .long SYM_PHYS(compat_pg_table_l2) + 2*PAGE_SIZE + 0x01, 0
> -        .long SYM_PHYS(compat_pg_table_l2) + 3*PAGE_SIZE + 0x01, 0
> -
> -        .section .data.page_aligned, "aw", @progbits
> -        .align PAGE_SIZE,0
> -compat_pg_table_l2:
> -        .macro identmap from=0, count=512
> -        .if \count-1
> -        identmap "(\from+0)","(\count/2)"
> -        identmap "(\from+(0x200000*(\count/2)))","(\count/2)"
> -        .else
> -        .quad 0x00000000000000e3 + \from
> -        .endif
> -        .endm
> -
> -        identmap 0x00000000
> -        identmap 0x40000000
> -        identmap 0x80000000
> -        identmap 0xc0000000
> diff --git a/xen/arch/x86/x86_64/kexec_reloc.S b/xen/arch/x86/x86_64/kexec_reloc.S
> new file mode 100644
> index 0000000..7a16c85
> --- /dev/null
> +++ b/xen/arch/x86/x86_64/kexec_reloc.S
> @@ -0,0 +1,198 @@
> +/*
> + * Relocate a kexec_image to its destination and call it.
> + *
> + * Copyright (C) 2013 Citrix Systems R&D Ltd.
> + *
> + * Portions derived from Linux's arch/x86/kernel/relocate_kernel_64.S.
> + *
> + *   Copyright (C) 2002-2005 Eric Biederman  <ebiederm@xmission.com>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +#include <xen/config.h>
> +#include <xen/kimage.h>
> +
> +#include <asm/asm_defns.h>
> +#include <asm/msr.h>
> +#include <asm/page.h>
> +#include <asm/machine_kexec.h>
> +
> +        .text
> +        .align PAGE_SIZE
> +        .code64
> +
> +ENTRY(kexec_reloc)
> +        /* %rdi - code page maddr */
> +        /* %rsi - page table maddr */
> +        /* %rdx - indirection page maddr */
> +        /* %rcx - entry maddr (%rbp) */
> +        /* %r8 - flags */
> +
> +        movq    %rcx, %rbp
> +
> +        /* Setup stack. */
> +        leaq    (reloc_stack - kexec_reloc)(%rdi), %rsp
> +
> +        /* Load reloc page table. */
> +        movq    %rsi, %cr3
> +
> +        /* Jump to identity mapped code. */
> +        leaq    (identity_mapped - kexec_reloc)(%rdi), %rax
> +        jmpq    *%rax
> +
> +identity_mapped:
> +        /*
> +         * Set cr0 to a known state:
> +         *  - Paging enabled
> +         *  - Alignment check disabled
> +         *  - Write protect disabled
> +         *  - No task switch
> +         *  - Don't do FP software emulation.
> +         *  - Protected mode enabled
> +         */
> +        movq    %cr0, %rax
> +        andl    $~(X86_CR0_AM | X86_CR0_WP | X86_CR0_TS | X86_CR0_EM), %eax
> +        orl     $(X86_CR0_PG | X86_CR0_PE), %eax
> +        movq    %rax, %cr0
> +
> +        /*
> +         * Set cr4 to a known state:
> +         *  - physical address extension enabled
> +         */
> +        movl    $X86_CR4_PAE, %eax
> +        movq    %rax, %cr4
> +
> +        movq    %rdx, %rdi
> +        call    relocate_pages
> +
> +        /* Need to switch to 32-bit mode? */
> +        testq   $KEXEC_RELOC_FLAG_COMPAT, %r8
> +        jnz     call_32_bit
> +
> +call_64_bit:
> +        /* Call the image entry point.  This should never return. */
> +        callq   *%rbp
> +        ud2
> +
> +call_32_bit:
> +        /* Setup IDT. */
> +        lidt    compat_mode_idt(%rip)
> +
> +        /* Load compat GDT. */
> +        leaq    compat_mode_gdt(%rip), %rax
> +        movq    %rax, (compat_mode_gdt_desc + 2)(%rip)
> +        lgdt    compat_mode_gdt_desc(%rip)
> +
> +        /* Relocate compatibility mode entry point address. */
> +        leal    compatibility_mode(%rip), %eax
> +        movl    %eax, compatibility_mode_far(%rip)
> +
> +        /* Enter compatibility mode. */
> +        ljmp    *compatibility_mode_far(%rip)
> +
> +relocate_pages:
> +        /* %rdi - indirection page maddr */
> +        pushq   %rbx
> +
> +        cld
> +        movq    %rdi, %rbx
> +        xorl    %edi, %edi
> +        xorl    %esi, %esi
> +
> +next_entry: /* top, read another word for the indirection page */
> +
> +        movq    (%rbx), %rcx
> +        addq    $8, %rbx
> +is_dest:
> +        testb   $IND_DESTINATION, %cl
> +        jz      is_ind
> +        movq    %rcx, %rdi
> +        andq    $PAGE_MASK, %rdi
> +        jmp     next_entry
> +is_ind:
> +        testb   $IND_INDIRECTION, %cl
> +        jz      is_done
> +        movq    %rcx, %rbx
> +        andq    $PAGE_MASK, %rbx
> +        jmp     next_entry
> +is_done:
> +        testb   $IND_DONE, %cl
> +        jnz     done
> +is_source:
> +        testb   $IND_SOURCE, %cl
> +        jz      is_zero
> +        movq    %rcx, %rsi      /* For every source page do a copy */
> +        andq    $PAGE_MASK, %rsi
> +        movl    $(PAGE_SIZE / 8), %ecx
> +        rep movsq
> +        jmp     next_entry
> +is_zero:
> +        testb   $IND_ZERO, %cl
> +        jz      next_entry
> +        movl    $(PAGE_SIZE / 8), %ecx  /* Zero the destination page. */
> +        xorl    %eax, %eax
> +        rep stosq
> +        jmp     next_entry
> +done:
> +        popq    %rbx
> +        ret
> +
> +        .code32
> +
> +compatibility_mode:
> +        /* Setup some sane segments. */
> +        movl    $0x0008, %eax
> +        movl    %eax, %ds
> +        movl    %eax, %es
> +        movl    %eax, %fs
> +        movl    %eax, %gs
> +        movl    %eax, %ss
> +
> +        /* Disable paging and therefore leave 64 bit mode. */
> +        movl    %cr0, %eax
> +        andl    $~X86_CR0_PG, %eax
> +        movl    %eax, %cr0
> +
> +        /* Disable long mode */
> +        movl    $MSR_EFER, %ecx
> +        rdmsr
> +        andl    $~EFER_LME, %eax
> +        wrmsr
> +
> +        /* Clear cr4 to disable PAE. */
> +        xorl    %eax, %eax
> +        movl    %eax, %cr4
> +
> +        /* Call the image entry point.  This should never return. */
> +        call    *%ebp
> +        ud2
> +
> +        .align 4
> +compatibility_mode_far:
> +        .long 0x00000000             /* set in call_32_bit above */
> +        .word 0x0010
> +
> +compat_mode_gdt_desc:
> +        .word (3*8)-1
> +        .quad 0x0000000000000000     /* set in call_32_bit above */
> +
> +        .align 8
> +compat_mode_gdt:
> +        .quad 0x0000000000000000     /* null                              */
> +        .quad 0x00cf92000000ffff     /* 0x0008 ring 0 data                */
> +        .quad 0x00cf9a000000ffff     /* 0x0010 ring 0 code, compatibility */
> +
> +compat_mode_idt:
> +        .word 0                      /* limit */
> +        .long 0                      /* base */
> +
> +        /*
> +         * 16 words of stack are more than enough.
> +         */
> +        .fill 16,8,0
> +reloc_stack:
> +
> +        .globl kexec_reloc_size
> +kexec_reloc_size:
> +        .long . - kexec_reloc
> diff --git a/xen/common/kexec.c b/xen/common/kexec.c
> index 7b23df0..c5450ba 100644
> --- a/xen/common/kexec.c
> +++ b/xen/common/kexec.c
> @@ -25,6 +25,7 @@
>   #include <xen/version.h>
>   #include <xen/console.h>
>   #include <xen/kexec.h>
> +#include <xen/kimage.h>
>   #include <public/elfnote.h>
>   #include <xsm/xsm.h>
>   #include <xen/cpu.h>
> @@ -47,7 +48,7 @@ static Elf_Note *xen_crash_note;
>   
>   static cpumask_t crash_saved_cpus;
>   
> -static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
> +static struct kexec_image *kexec_image[KEXEC_IMAGE_NR];
>   
>   #define KEXEC_FLAG_DEFAULT_POS   (KEXEC_IMAGE_NR + 0)
>   #define KEXEC_FLAG_CRASH_POS     (KEXEC_IMAGE_NR + 1)
> @@ -55,8 +56,6 @@ static xen_kexec_image_t kexec_image[KEXEC_IMAGE_NR];
>   
>   static unsigned long kexec_flags = 0; /* the lowest bits are for KEXEC_IMAGE... */
>   
> -static spinlock_t kexec_lock = SPIN_LOCK_UNLOCKED;
> -
>   static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
>   static size_t vmcoreinfo_size = 0;
>   
> @@ -311,14 +310,14 @@ void kexec_crash(void)
>       kexec_common_shutdown();
>       kexec_crash_save_cpu();
>       machine_crash_shutdown();
> -    machine_kexec(&kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
> +    machine_kexec(kexec_image[KEXEC_IMAGE_CRASH_BASE + pos]);
>   
>       BUG();
>   }
>   
>   static long kexec_reboot(void *_image)
>   {
> -    xen_kexec_image_t *image = _image;
> +    struct kexec_image *image = _image;
>   
>       kexecing = TRUE;
>   
> @@ -734,63 +733,264 @@ static void crash_save_vmcoreinfo(void)
>   #endif
>   }
>   
> -static int kexec_load_unload_internal(unsigned long op, xen_kexec_load_v1_t *load)
> +static void kexec_unload_image(struct kexec_image *image)
>   {
> -    xen_kexec_image_t *image;
> +    if ( !image )
> +        return;
> +
> +    machine_kexec_unload(image);
> +    kimage_free(image);
> +}
> +
> +static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +{
> +    xen_kexec_exec_t exec;
> +    struct kexec_image *image;
> +    int base, bit, pos, ret = -EINVAL;
> +
> +    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
> +        return -EFAULT;
> +
> +    if ( kexec_load_get_bits(exec.type, &base, &bit) )
> +        return -EINVAL;
> +
> +    pos = (test_bit(bit, &kexec_flags) != 0);
> +
> +    /* Only allow kexec/kdump into loaded images */
> +    if ( !test_bit(base + pos, &kexec_flags) )
> +        return -ENOENT;
> +
> +    switch (exec.type)
> +    {
> +    case KEXEC_TYPE_DEFAULT:
> +        image = kexec_image[base + pos];
> +        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
> +        break;
> +    case KEXEC_TYPE_CRASH:
> +        kexec_crash(); /* Does not return */
> +        break;
> +    }
> +
> +    return -EINVAL; /* never reached */
> +}
> +
> +static int kexec_swap_images(int type, struct kexec_image *new,
> +                             struct kexec_image **old)
> +{
> +    static DEFINE_SPINLOCK(kexec_lock);
>       int base, bit, pos;
> -    int ret = 0;
> +    int new_slot, old_slot;
> +
> +    *old = NULL;
> +
> +    spin_lock(&kexec_lock);
> +
> +    if ( test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
> +    {
> +        spin_unlock(&kexec_lock);
> +        return -EBUSY;
> +    }
>   
> -    if ( kexec_load_get_bits(load->type, &base, &bit) )
> +    if ( kexec_load_get_bits(type, &base, &bit) )
>           return -EINVAL;
>   
>       pos = (test_bit(bit, &kexec_flags) != 0);
> +    old_slot = base + pos;
> +    new_slot = base + !pos;
>   
> -    /* Load the user data into an unused image */
> -    if ( op == KEXEC_CMD_kexec_load )
> +    if ( new )
>       {
> -        image = &kexec_image[base + !pos];
> +        kexec_image[new_slot] = new;
> +        set_bit(new_slot, &kexec_flags);
> +    }
> +    change_bit(bit, &kexec_flags);
>   
> -        BUG_ON(test_bit((base + !pos), &kexec_flags)); /* must be free */
> +    clear_bit(old_slot, &kexec_flags);
> +    *old = kexec_image[old_slot];
>   
> -        memcpy(image, &load->image, sizeof(*image));
> +    spin_unlock(&kexec_lock);
>   
> -        if ( !(ret = machine_kexec_load(load->type, base + !pos, image)) )
> -        {
> -            /* Set image present bit */
> -            set_bit((base + !pos), &kexec_flags);
> +    return 0;
> +}
>   
> -            /* Make new image the active one */
> -            change_bit(bit, &kexec_flags);
> -        }
> +static int kexec_load_slot(struct kexec_image *kimage)
> +{
> +    struct kexec_image *old_kimage;
> +    int ret = -ENOMEM;
> +
> +    ret = machine_kexec_load(kimage);
> +    if ( ret < 0 )
> +        return ret;
> +
> +    crash_save_vmcoreinfo();
> +
> +    ret = kexec_swap_images(kimage->type, kimage, &old_kimage);
> +    if ( ret < 0 )
> +        return ret;
> +
> +    kexec_unload_image(old_kimage);
> +
> +    return 0;
> +}
> +
> +static uint16_t kexec_load_v1_arch(void)
> +{
> +#ifdef CONFIG_X86
> +    return is_pv_32on64_domain(dom0) ? EM_386 : EM_X86_64;
> +#else
> +    return EM_NONE;
> +#endif
> +}
>   
> -        crash_save_vmcoreinfo();
> +static int kexec_segments_add_segment(
> +    unsigned int *nr_segments, xen_kexec_segment_t *segments,
> +    unsigned long mfn)
> +{
> +    paddr_t maddr = (paddr_t)mfn << PAGE_SHIFT;
> +    unsigned int n = *nr_segments;
> +
> +    /* Need a new segment? */
> +    if ( n == 0
> +         || segments[n-1].dest_maddr + segments[n-1].dest_size != maddr )
> +    {
> +        n++;
> +        if ( n > KEXEC_SEGMENT_MAX )
> +            return -EINVAL;
> +        *nr_segments = n;
> +
> +        set_xen_guest_handle(segments[n-1].buf.h, NULL);
> +        segments[n-1].buf_size = 0;
> +        segments[n-1].dest_maddr = maddr;
> +        segments[n-1].dest_size = 0;
>       }
>   
> -    /* Unload the old image if present and load successful */
> -    if ( ret == 0 && !test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags) )
> +    return 0;
> +}
> +
> +static int kexec_segments_from_ind_page(unsigned long mfn,
> +                                        unsigned int *nr_segments,
> +                                        xen_kexec_segment_t *segments,
> +                                        bool_t compat)
> +{
> +    void *page;
> +    kimage_entry_t *entry;
> +    int ret = 0;
> +
> +    page = map_domain_page(mfn);
> +
> +    /*
> +     * Walk the indirection page list, adding destination pages to the
> +     * segments.
> +     */
> +    for ( entry = page; ; )
>       {
> -        if ( test_and_clear_bit((base + pos), &kexec_flags) )
> +        unsigned long ind;
> +
> +        ind = kimage_entry_ind(entry, compat);
> +        mfn = kimage_entry_mfn(entry, compat);
> +
> +        switch ( ind )
>           {
> -            image = &kexec_image[base + pos];
> -            machine_kexec_unload(load->type, base + pos, image);
> +        case IND_DESTINATION:
> +            ret = kexec_segments_add_segment(nr_segments, segments, mfn);
> +            if ( ret < 0 )
> +                goto done;
> +            break;
> +        case IND_INDIRECTION:
> +            unmap_domain_page(page);
> +            entry = page = map_domain_page(mfn);
> +            continue;
> +        case IND_DONE:
> +            goto done;
> +        case IND_SOURCE:
> +            if ( *nr_segments == 0 )
> +            {
> +                ret = -EINVAL;
> +                goto done;
> +            }
> +            segments[*nr_segments-1].dest_size += PAGE_SIZE;
> +            break;
> +        default:
> +            ret = -EINVAL;
> +            goto done;
>           }
> +        entry = kimage_entry_next(entry, compat);
>       }
> +done:
> +    unmap_domain_page(page);
> +    return ret;
> +}
>   
> +static int kexec_do_load_v1(xen_kexec_load_v1_t *load, int compat)
> +{
> +    struct kexec_image *kimage = NULL;
> +    xen_kexec_segment_t *segments;
> +    uint16_t arch;
> +    unsigned int nr_segments = 0;
> +    unsigned long ind_mfn = load->image.indirection_page >> PAGE_SHIFT;
> +    int ret;
> +
> +    arch = kexec_load_v1_arch();
> +    if ( arch == EM_NONE )
> +        return -ENOSYS;
> +
> +    segments = xmalloc_array(xen_kexec_segment_t, KEXEC_SEGMENT_MAX);
> +    if ( segments == NULL )
> +        return -ENOMEM;
> +
> +    /*
> +     * Work out the image segments (destination only) from the
> +     * indirection pages.
> +     *
> +     * This is needed so we don't allocate pages that will overlap
> +     * with the destination when building the new set of indirection
> +     * pages below.
> +     */
> +    ret = kexec_segments_from_ind_page(ind_mfn, &nr_segments, segments, compat);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    ret = kimage_alloc(&kimage, load->type, arch, load->image.start_address,
> +                       nr_segments, segments);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    /*
> +     * Build a new set of indirection pages in the native format.
> +     *
> +     * This walks the guest provided indirection pages a second time.
> +     * The guest could have altered then, invalidating the segment
> +     * information constructed above.  This will only result in the
> +     * resulting image being potentially unrelocatable.
> +     */
> +    ret = kimage_build_ind(kimage, ind_mfn, compat);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    ret = kexec_load_slot(kimage);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    return 0;
> +
> +error:
> +    if ( !kimage )
> +        xfree(segments);
> +    kimage_free(kimage);
>       return ret;
>   }
>   
> -static int kexec_load_unload(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) uarg)
> +static int kexec_load_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
>   {
>       xen_kexec_load_v1_t load;
>   
>       if ( unlikely(copy_from_guest(&load, uarg, 1)) )
>           return -EFAULT;
>   
> -    return kexec_load_unload_internal(op, &load);
> +    return kexec_do_load_v1(&load, 0);
>   }
>   
> -static int kexec_load_unload_compat(unsigned long op,
> -                                    XEN_GUEST_HANDLE_PARAM(void) uarg)
> +static int kexec_load_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
>   {
>   #ifdef CONFIG_COMPAT
>       compat_kexec_load_v1_t compat_load;
> @@ -809,49 +1009,113 @@ static int kexec_load_unload_compat(unsigned long op,
>       load.type = compat_load.type;
>       XLAT_kexec_image(&load.image, &compat_load.image);
>   
> -    return kexec_load_unload_internal(op, &load);
> -#else /* CONFIG_COMPAT */
> +    return kexec_do_load_v1(&load, 1);
> +#else
>       return 0;
> -#endif /* CONFIG_COMPAT */
> +#endif
>   }
>   
> -static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +static int kexec_load(XEN_GUEST_HANDLE_PARAM(void) uarg)
>   {
> -    xen_kexec_exec_t exec;
> -    xen_kexec_image_t *image;
> -    int base, bit, pos, ret = -EINVAL;
> +    xen_kexec_load_t load;
> +    xen_kexec_segment_t *segments;
> +    struct kexec_image *kimage = NULL;
> +    int ret;
>   
> -    if ( unlikely(copy_from_guest(&exec, uarg, 1)) )
> +    if ( copy_from_guest(&load, uarg, 1) )
>           return -EFAULT;
>   
> -    if ( kexec_load_get_bits(exec.type, &base, &bit) )
> +    if ( load.nr_segments >= KEXEC_SEGMENT_MAX )
>           return -EINVAL;
>   
> -    pos = (test_bit(bit, &kexec_flags) != 0);
> -
> -    /* Only allow kexec/kdump into loaded images */
> -    if ( !test_bit(base + pos, &kexec_flags) )
> -        return -ENOENT;
> +    segments = xmalloc_array(xen_kexec_segment_t, load.nr_segments);
> +    if ( segments == NULL )
> +        return -ENOMEM;
>   
> -    switch (exec.type)
> +    if ( copy_from_guest(segments, load.segments.h, load.nr_segments) )
>       {
> -    case KEXEC_TYPE_DEFAULT:
> -        image = &kexec_image[base + pos];
> -        ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
> -        break;
> -    case KEXEC_TYPE_CRASH:
> -        kexec_crash(); /* Does not return */
> -        break;
> +        ret = -EFAULT;
> +        goto error;
>       }
>   
> -    return -EINVAL; /* never reached */
> +    ret = kimage_alloc(&kimage, load.type, load.arch, load.entry_maddr,
> +                       load.nr_segments, segments);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    ret = kimage_load_segments(kimage);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    ret = kexec_load_slot(kimage);
> +    if ( ret < 0 )
> +        goto error;
> +
> +    return 0;
> +
> +error:
> +    if ( ! kimage )
> +        xfree(segments);
> +    kimage_free(kimage);
> +    return ret;
> +}
> +
> +static int kexec_do_unload(xen_kexec_unload_t *unload)
> +{
> +    struct kexec_image *old_kimage;
> +    int ret;
> +
> +    ret = kexec_swap_images(unload->type, NULL, &old_kimage);
> +    if ( ret < 0 )
> +        return ret;
> +
> +    kexec_unload_image(old_kimage);
> +
> +    return 0;
> +}
> +
> +static int kexec_unload_v1(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +{
> +    xen_kexec_load_v1_t load;
> +    xen_kexec_unload_t unload;
> +
> +    if ( copy_from_guest(&load, uarg, 1) )
> +        return -EFAULT;
> +
> +    unload.type = load.type;
> +    return kexec_do_unload(&unload);
> +}
> +
> +static int kexec_unload_v1_compat(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +{
> +#ifdef CONFIG_COMPAT
> +    compat_kexec_load_v1_t compat_load;
> +    xen_kexec_unload_t unload;
> +
> +    if ( copy_from_guest(&compat_load, uarg, 1) )
> +        return -EFAULT;
> +
> +    unload.type = compat_load.type;
> +    return kexec_do_unload(&unload);
> +#else
> +    return 0;
> +#endif
> +}
> +
> +static int kexec_unload(XEN_GUEST_HANDLE_PARAM(void) uarg)
> +{
> +    xen_kexec_unload_t unload;
> +
> +    if ( unlikely(copy_from_guest(&unload, uarg, 1)) )
> +        return -EFAULT;
> +
> +    return kexec_do_unload(&unload);
>   }
>   
>   static int do_kexec_op_internal(unsigned long op,
>                                   XEN_GUEST_HANDLE_PARAM(void) uarg,
>                                   bool_t compat)
>   {
> -    unsigned long flags;
>       int ret = -EINVAL;
>   
>       ret = xsm_kexec(XSM_PRIV);
> @@ -867,20 +1131,26 @@ static int do_kexec_op_internal(unsigned long op,
>                   ret = kexec_get_range(uarg);
>           break;
>       case KEXEC_CMD_kexec_load_v1:
> +        if ( compat )
> +            ret = kexec_load_v1_compat(uarg);
> +        else
> +            ret = kexec_load_v1(uarg);
> +        break;
>       case KEXEC_CMD_kexec_unload_v1:
> -        spin_lock_irqsave(&kexec_lock, flags);
> -        if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
> -        {
> -                if (compat)
> -                        ret = kexec_load_unload_compat(op, uarg);
> -                else
> -                        ret = kexec_load_unload(op, uarg);
> -        }
> -        spin_unlock_irqrestore(&kexec_lock, flags);
> +        if ( compat )
> +            ret = kexec_unload_v1_compat(uarg);
> +        else
> +            ret = kexec_unload_v1(uarg);
>           break;
>       case KEXEC_CMD_kexec:
>           ret = kexec_exec(uarg);
>           break;
> +    case KEXEC_CMD_kexec_load:
> +        ret = kexec_load(uarg);
> +        break;
> +    case KEXEC_CMD_kexec_unload:
> +        ret = kexec_unload(uarg);
> +        break;
>       }
>   
>       return ret;
> diff --git a/xen/common/kimage.c b/xen/common/kimage.c
> index 02ee37e..10fb785 100644
> --- a/xen/common/kimage.c
> +++ b/xen/common/kimage.c
> @@ -175,11 +175,20 @@ static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
>       image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
>       if ( !image->control_code_page )
>           goto out;
> +    result = machine_kexec_add_page(image,
> +                                    page_to_maddr(image->control_code_page),
> +                                    page_to_maddr(image->control_code_page));
> +    if ( result < 0 )
> +        goto out;
>   
>       /* Add an empty indirection page. */
>       image->entry_page = kimage_alloc_control_page(image, 0);
>       if ( !image->entry_page )
>           goto out;
> +    result = machine_kexec_add_page(image, page_to_maddr(image->entry_page),
> +                                    page_to_maddr(image->entry_page));
> +    if ( result < 0 )
> +        goto out;
>   
>       image->head = page_to_maddr(image->entry_page);
>   
> @@ -595,7 +604,7 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
>           if ( addr == destination )
>           {
>               page_list_del(page, &image->dest_pages);
> -            return page;
> +            goto found;
>           }
>       }
>       page = NULL;
> @@ -647,6 +656,8 @@ static struct page_info *kimage_alloc_page(struct kexec_image *image,
>               page_list_add(page, &image->dest_pages);
>           }
>       }
> +found:
> +    machine_kexec_add_page(image, page_to_maddr(page), page_to_maddr(page));
>       return page;
>   }
>   
> @@ -753,6 +764,7 @@ static int kimage_load_crash_segment(struct kexec_image *image,
>   static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
>   {
>       int result = -ENOMEM;
> +    paddr_t addr;
>   
>       if ( !guest_handle_is_null(segment->buf.h) )
>       {
> @@ -767,6 +779,14 @@ static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *s
>           }
>       }
>   
> +    for ( addr = segment->dest_maddr & PAGE_MASK;
> +          addr < segment->dest_maddr + segment->dest_size; addr += PAGE_SIZE )
> +    {
> +        result = machine_kexec_add_page(image, addr, addr);
> +        if ( result < 0 )
> +            break;
> +    }
> +
>       return result;
>   }
>   
> @@ -810,6 +830,106 @@ int kimage_load_segments(struct kexec_image *image)
>       return 0;
>   }
>   
> +kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat)
> +{
> +    if ( compat )
> +        return (kimage_entry_t *)((uint32_t *)entry + 1);
> +    return entry + 1;
> +}
> +
> +unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat)
> +{
> +    if ( compat )
> +        return *(uint32_t *)entry >> PAGE_SHIFT;
> +    return *entry >> PAGE_SHIFT;
> +}
> +
> +unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat)
> +{
> +    if ( compat )
> +        return *(uint32_t *)entry & 0xf;
> +    return *entry & 0xf;
> +}
> +
> +int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
> +                     bool_t compat)
> +{
> +    void *page;
> +    kimage_entry_t *entry;
> +    int ret = 0;
> +    paddr_t dest = KIMAGE_NO_DEST;
> +
> +    page = map_domain_page(ind_mfn);
> +    if ( !page )
> +        return -ENOMEM;
> +
> +    /*
> +     * Walk the guest-supplied indirection pages, adding entries to
> +     * the image's indirection pages.
> +     */
> +    for ( entry = page; ;  )
> +    {
> +        unsigned long ind;
> +        unsigned long mfn;
> +
> +        ind = kimage_entry_ind(entry, compat);
> +        mfn = kimage_entry_mfn(entry, compat);
> +
> +        switch ( ind )
> +        {
> +        case IND_DESTINATION:
> +            dest = (paddr_t)mfn << PAGE_SHIFT;
> +            ret = kimage_set_destination(image, dest);
> +            if ( ret < 0 )
> +                goto done;
> +            break;
> +        case IND_INDIRECTION:
> +            unmap_domain_page(page);
> +            page = map_domain_page(mfn);
> +            entry = page;
> +            continue;
> +        case IND_DONE:
> +            kimage_terminate(image);
> +            goto done;
> +        case IND_SOURCE:
> +        {
> +            struct page_info *guest_page, *xen_page;
> +
> +            guest_page = mfn_to_page(mfn);
> +            if ( !get_page(guest_page, current->domain) )
> +            {
> +                ret = -EFAULT;
> +                goto done;
> +            }
> +
> +            xen_page = kimage_alloc_page(image, dest);
> +            if ( !xen_page )
> +            {
> +                put_page(guest_page);
> +                ret = -ENOMEM;
> +                goto done;
> +            }
> +
> +            copy_domain_page(page_to_mfn(xen_page), mfn);
> +            put_page(guest_page);
> +
> +            ret = kimage_add_page(image, page_to_maddr(xen_page));
> +            if ( ret < 0 )
> +                goto done;
> +            dest += PAGE_SIZE;
> +            break;
> +        }
> +        default:
> +            ret = -EINVAL;
> +            goto done;
> +        }
> +        entry = kimage_entry_next(entry, compat);
> +    }
> +done:
> +    unmap_domain_page(page);
> +    return ret;
> +}
> +
>   /*
>    * Local variables:
>    * mode: C
> diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
> index 8b4266d..48c5676 100644
> --- a/xen/include/asm-x86/fixmap.h
> +++ b/xen/include/asm-x86/fixmap.h
> @@ -56,9 +56,6 @@ enum fixed_addresses {
>       FIX_ACPI_BEGIN,
>       FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1,
>       FIX_HPET_BASE,
> -    FIX_KEXEC_BASE_0,
> -    FIX_KEXEC_BASE_END = FIX_KEXEC_BASE_0 \
> -      + ((KEXEC_XEN_NO_PAGES >> 1) * KEXEC_IMAGE_NR) - 1,
>       FIX_TBOOT_SHARED_BASE,
>       FIX_MSIX_IO_RESERV_BASE,
>       FIX_MSIX_IO_RESERV_END = FIX_MSIX_IO_RESERV_BASE + FIX_MSIX_MAX_PAGES -1,
> diff --git a/xen/include/asm-x86/machine_kexec.h b/xen/include/asm-x86/machine_kexec.h
> new file mode 100644
> index 0000000..ba0d469
> --- /dev/null
> +++ b/xen/include/asm-x86/machine_kexec.h
> @@ -0,0 +1,16 @@
> +#ifndef __X86_MACHINE_KEXEC_H__
> +#define __X86_MACHINE_KEXEC_H__
> +
> +#define KEXEC_RELOC_FLAG_COMPAT 0x1 /* 32-bit image */
> +
> +#ifndef __ASSEMBLY__
> +
> +extern void kexec_reloc(unsigned long reloc_code, unsigned long reloc_pt,
> +                        unsigned long ind_maddr, unsigned long entry_maddr,
> +                        unsigned long flags);
> +
> +extern unsigned int kexec_reloc_size;
> +
> +#endif
> +
> +#endif /* __X86_MACHINE_KEXEC_H__ */
> diff --git a/xen/include/xen/kexec.h b/xen/include/xen/kexec.h
> index 1a5dda1..bd17747 100644
> --- a/xen/include/xen/kexec.h
> +++ b/xen/include/xen/kexec.h
> @@ -6,6 +6,7 @@
>   #include <public/kexec.h>
>   #include <asm/percpu.h>
>   #include <xen/elfcore.h>
> +#include <xen/kimage.h>
>   
>   typedef struct xen_kexec_reserve {
>       unsigned long size;
> @@ -40,11 +41,13 @@ extern enum low_crashinfo low_crashinfo_mode;
>   extern paddr_t crashinfo_maxaddr_bits;
>   void kexec_early_calculations(void);
>   
> -int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
> -void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
> +int machine_kexec_add_page(struct kexec_image *image, unsigned long vaddr,
> +                           unsigned long maddr);
> +int machine_kexec_load(struct kexec_image *image);
> +void machine_kexec_unload(struct kexec_image *image);
>   void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
> -void machine_reboot_kexec(xen_kexec_image_t *image);
> -void machine_kexec(xen_kexec_image_t *image);
> +void machine_reboot_kexec(struct kexec_image *image);
> +void machine_kexec(struct kexec_image *image);
>   void kexec_crash(void);
>   void kexec_crash_save_cpu(void);
>   crash_xen_info_t *kexec_crash_save_info(void);
> @@ -52,11 +55,6 @@ void machine_crash_shutdown(void);
>   int machine_kexec_get(xen_kexec_range_t *range);
>   int machine_kexec_get_xen(xen_kexec_range_t *range);
>   
> -void compat_machine_kexec(unsigned long rnk,
> -                          unsigned long indirection_page,
> -                          unsigned long *page_list,
> -                          unsigned long start_address);
> -
>   /* vmcoreinfo stuff */
>   #define VMCOREINFO_BYTES           (4096)
>   #define VMCOREINFO_NOTE_NAME       "VMCOREINFO_XEN"
> diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
> index 0ebd37a..d10ebf7 100644
> --- a/xen/include/xen/kimage.h
> +++ b/xen/include/xen/kimage.h
> @@ -47,6 +47,12 @@ int kimage_load_segments(struct kexec_image *image);
>   struct page_info *kimage_alloc_control_page(struct kexec_image *image,
>                                               unsigned memflags);
>   
> +kimage_entry_t *kimage_entry_next(kimage_entry_t *entry, bool_t compat);
> +unsigned long kimage_entry_mfn(kimage_entry_t *entry, bool_t compat);
> +unsigned long kimage_entry_ind(kimage_entry_t *entry, bool_t compat);
> +int kimage_build_ind(struct kexec_image *image, unsigned long ind_mfn,
> +                     bool_t compat);
> +
>   #endif /* __ASSEMBLY__ */
>   
>   #endif /* __XEN_KIMAGE_H__ */


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (16 preceding siblings ...)
  2013-11-06 14:49 ` David Vrabel
@ 2013-11-07 21:16 ` Daniel Kiper
  2013-11-07 21:16 ` Daniel Kiper
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-07 21:16 UTC (permalink / raw)
  To: David Vrabel; +Cc: kexec, Jan Beulich, xen-devel

On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
> The series (for Xen 4.4) improves the kexec hypercall by making Xen
> responsible for loading and relocating the image.  This allows kexec
> to be usable by pv-ops kernels and should allow kexec to be usable
> from a HVM or PVH privileged domain.
>
> I have now tested this with a Linux kernel image using the VGA console
> which was what was causing problems in v9 (this turned out to be a
> kexec-tools bug).
>
> The required patch series for kexec-tools will be posted shortly and
> are available from the xen-v7 branch of:

In general it works. However, quite often I am not able to execute panic
kernel. Machine hangs with following message:

(XEN) Domain 0 crashed: Executing crash image

gdb shows:

(gdb) bt
#0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
#1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
#2  0x0000000000000000 in ?? ()
(gdb)

Especially second bt line scares me... ;-)))

I have not been able to identify why NMI was activated because
stack is completely cleared. I tried to record execution in gdb
but it stops with following message:

cpumask_clear_cpu (dstp=0xffff82d0802f7f78 <call_data+24>, cpu=0)
    at /srv/dev/xen/xen_20130413_20131107.kexec/xen/include/xen/cpumask.h:108
108             clear_bit(cpumask_check(cpu), dstp->bits);
Process record: failed to record execution log.

Do you know how to find out why NMI was activated?

I am able almost always reproduce this issue doing this:
  - boot Xen,
  - load panic kernel,
  - echo c > /proc/sysrq-trigger,
  - reboot from command line,
  - boot Xen,
  - load panic kernel,
  - echo c > /proc/sysrq-trigger.

Additionally, my compiler fails because it detects unused result
variable in xen/common/kimage.c:kimage_crash_alloc().

Daniel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (17 preceding siblings ...)
  2013-11-07 21:16 ` [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels Daniel Kiper
@ 2013-11-07 21:16 ` Daniel Kiper
  2013-11-07 21:25   ` Andrew Cooper
                     ` (5 more replies)
  2013-11-11 17:18 ` Keir Fraser
  2013-11-11 17:18 ` [Xen-devel] " Keir Fraser
  20 siblings, 6 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-07 21:16 UTC (permalink / raw)
  To: David Vrabel; +Cc: kexec, Jan Beulich, xen-devel

On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
> The series (for Xen 4.4) improves the kexec hypercall by making Xen
> responsible for loading and relocating the image.  This allows kexec
> to be usable by pv-ops kernels and should allow kexec to be usable
> from a HVM or PVH privileged domain.
>
> I have now tested this with a Linux kernel image using the VGA console
> which was what was causing problems in v9 (this turned out to be a
> kexec-tools bug).
>
> The required patch series for kexec-tools will be posted shortly and
> are available from the xen-v7 branch of:

In general it works. However, quite often I am not able to execute panic
kernel. Machine hangs with following message:

(XEN) Domain 0 crashed: Executing crash image

gdb shows:

(gdb) bt
#0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
#1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
#2  0x0000000000000000 in ?? ()
(gdb)

Especially second bt line scares me... ;-)))

I have not been able to identify why NMI was activated because
stack is completely cleared. I tried to record execution in gdb
but it stops with following message:

cpumask_clear_cpu (dstp=0xffff82d0802f7f78 <call_data+24>, cpu=0)
    at /srv/dev/xen/xen_20130413_20131107.kexec/xen/include/xen/cpumask.h:108
108             clear_bit(cpumask_check(cpu), dstp->bits);
Process record: failed to record execution log.

Do you know how to find out why NMI was activated?

I am able almost always reproduce this issue doing this:
  - boot Xen,
  - load panic kernel,
  - echo c > /proc/sysrq-trigger,
  - reboot from command line,
  - boot Xen,
  - load panic kernel,
  - echo c > /proc/sysrq-trigger.

Additionally, my compiler fails because it detects unused result
variable in xen/common/kimage.c:kimage_crash_alloc().

Daniel

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-07 21:16 ` Daniel Kiper
@ 2013-11-07 21:25   ` Andrew Cooper
  2013-11-07 21:25   ` [Xen-devel] " Andrew Cooper
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 99+ messages in thread
From: Andrew Cooper @ 2013-11-07 21:25 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, David Vrabel, Jan Beulich, xen-devel

On 07/11/13 21:16, Daniel Kiper wrote:
> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
>> The series (for Xen 4.4) improves the kexec hypercall by making Xen
>> responsible for loading and relocating the image.  This allows kexec
>> to be usable by pv-ops kernels and should allow kexec to be usable
>> from a HVM or PVH privileged domain.
>>
>> I have now tested this with a Linux kernel image using the VGA console
>> which was what was causing problems in v9 (this turned out to be a
>> kexec-tools bug).
>>
>> The required patch series for kexec-tools will be posted shortly and
>> are available from the xen-v7 branch of:
> In general it works. However, quite often I am not able to execute panic
> kernel. Machine hangs with following message:
>
> (XEN) Domain 0 crashed: Executing crash image
>
> gdb shows:
>
> (gdb) bt
> #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
> #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
> #2  0x0000000000000000 in ?? ()
> (gdb)
>
> Especially second bt line scares me... ;-)))

Why? This is completely normal.  If you look in crash.c at that line, it
is a for (;;) halt(); loop

How are you hooking gdb up?

>
> I have not been able to identify why NMI was activated because
> stack is completely cleared. I tried to record execution in gdb
> but it stops with following message:

NMIs are used for cpu shootdown of the non-crashing cpus.  Again, this
is not touched by the series.

~Andrew

>
> cpumask_clear_cpu (dstp=0xffff82d0802f7f78 <call_data+24>, cpu=0)
>     at /srv/dev/xen/xen_20130413_20131107.kexec/xen/include/xen/cpumask.h:108
> 108             clear_bit(cpumask_check(cpu), dstp->bits);
> Process record: failed to record execution log.
>
> Do you know how to find out why NMI was activated?
>
> I am able almost always reproduce this issue doing this:
>   - boot Xen,
>   - load panic kernel,
>   - echo c > /proc/sysrq-trigger,
>   - reboot from command line,
>   - boot Xen,
>   - load panic kernel,
>   - echo c > /proc/sysrq-trigger.
>
> Additionally, my compiler fails because it detects unused result
> variable in xen/common/kimage.c:kimage_crash_alloc().
>
> Daniel
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-07 21:16 ` Daniel Kiper
  2013-11-07 21:25   ` Andrew Cooper
@ 2013-11-07 21:25   ` Andrew Cooper
  2013-11-07 21:41     ` Daniel Kiper
  2013-11-07 21:41     ` [Xen-devel] " Daniel Kiper
  2013-11-08 13:13   ` David Vrabel
                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 99+ messages in thread
From: Andrew Cooper @ 2013-11-07 21:25 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, David Vrabel, Jan Beulich, xen-devel

On 07/11/13 21:16, Daniel Kiper wrote:
> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
>> The series (for Xen 4.4) improves the kexec hypercall by making Xen
>> responsible for loading and relocating the image.  This allows kexec
>> to be usable by pv-ops kernels and should allow kexec to be usable
>> from a HVM or PVH privileged domain.
>>
>> I have now tested this with a Linux kernel image using the VGA console
>> which was what was causing problems in v9 (this turned out to be a
>> kexec-tools bug).
>>
>> The required patch series for kexec-tools will be posted shortly and
>> are available from the xen-v7 branch of:
> In general it works. However, quite often I am not able to execute panic
> kernel. Machine hangs with following message:
>
> (XEN) Domain 0 crashed: Executing crash image
>
> gdb shows:
>
> (gdb) bt
> #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
> #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
> #2  0x0000000000000000 in ?? ()
> (gdb)
>
> Especially second bt line scares me... ;-)))

Why? This is completely normal.  If you look in crash.c at that line, it
is a for (;;) halt(); loop

How are you hooking gdb up?

>
> I have not been able to identify why NMI was activated because
> stack is completely cleared. I tried to record execution in gdb
> but it stops with following message:

NMIs are used for cpu shootdown of the non-crashing cpus.  Again, this
is not touched by the series.

~Andrew

>
> cpumask_clear_cpu (dstp=0xffff82d0802f7f78 <call_data+24>, cpu=0)
>     at /srv/dev/xen/xen_20130413_20131107.kexec/xen/include/xen/cpumask.h:108
> 108             clear_bit(cpumask_check(cpu), dstp->bits);
> Process record: failed to record execution log.
>
> Do you know how to find out why NMI was activated?
>
> I am able almost always reproduce this issue doing this:
>   - boot Xen,
>   - load panic kernel,
>   - echo c > /proc/sysrq-trigger,
>   - reboot from command line,
>   - boot Xen,
>   - load panic kernel,
>   - echo c > /proc/sysrq-trigger.
>
> Additionally, my compiler fails because it detects unused result
> variable in xen/common/kimage.c:kimage_crash_alloc().
>
> Daniel
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-07 21:25   ` [Xen-devel] " Andrew Cooper
@ 2013-11-07 21:41     ` Daniel Kiper
  2013-11-07 21:41     ` [Xen-devel] " Daniel Kiper
  1 sibling, 0 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-07 21:41 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: kexec, David Vrabel, Jan Beulich, xen-devel

On Thu, Nov 07, 2013 at 09:25:33PM +0000, Andrew Cooper wrote:
> On 07/11/13 21:16, Daniel Kiper wrote:
> > On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
> >> The series (for Xen 4.4) improves the kexec hypercall by making Xen
> >> responsible for loading and relocating the image.  This allows kexec
> >> to be usable by pv-ops kernels and should allow kexec to be usable
> >> from a HVM or PVH privileged domain.
> >>
> >> I have now tested this with a Linux kernel image using the VGA console
> >> which was what was causing problems in v9 (this turned out to be a
> >> kexec-tools bug).
> >>
> >> The required patch series for kexec-tools will be posted shortly and
> >> are available from the xen-v7 branch of:
> > In general it works. However, quite often I am not able to execute panic
> > kernel. Machine hangs with following message:
> >
> > (XEN) Domain 0 crashed: Executing crash image
> >
> > gdb shows:
> >
> > (gdb) bt
> > #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
> > #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
> > #2  0x0000000000000000 in ?? ()
> > (gdb)
> >
> > Especially second bt line scares me... ;-)))
>
> Why? This is completely normal.  If you look in crash.c at that line, it
> is a for (;;) halt(); loop

I thought more about this:

#1  0xffff82d0802281d9 in nmi_crash () at entry.S:666

Look at the end of this line... ;-)))

> How are you hooking gdb up?

I am doing tests in QEMU and using QEMU's -gdb option.

> > I have not been able to identify why NMI was activated because
> > stack is completely cleared. I tried to record execution in gdb
> > but it stops with following message:
>
> NMIs are used for cpu shootdown of the non-crashing cpus.  Again, this
> is not touched by the series.

Ahh... It makes sens. However, why machine hangs at this stage? Hmmm...
CPU sending NMIs receives one and instead of ignoring it halts itself?

Daniel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-07 21:25   ` [Xen-devel] " Andrew Cooper
  2013-11-07 21:41     ` Daniel Kiper
@ 2013-11-07 21:41     ` Daniel Kiper
  2013-11-07 21:57       ` Andrew Cooper
                         ` (3 more replies)
  1 sibling, 4 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-07 21:41 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: kexec, David Vrabel, Jan Beulich, xen-devel

On Thu, Nov 07, 2013 at 09:25:33PM +0000, Andrew Cooper wrote:
> On 07/11/13 21:16, Daniel Kiper wrote:
> > On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
> >> The series (for Xen 4.4) improves the kexec hypercall by making Xen
> >> responsible for loading and relocating the image.  This allows kexec
> >> to be usable by pv-ops kernels and should allow kexec to be usable
> >> from a HVM or PVH privileged domain.
> >>
> >> I have now tested this with a Linux kernel image using the VGA console
> >> which was what was causing problems in v9 (this turned out to be a
> >> kexec-tools bug).
> >>
> >> The required patch series for kexec-tools will be posted shortly and
> >> are available from the xen-v7 branch of:
> > In general it works. However, quite often I am not able to execute panic
> > kernel. Machine hangs with following message:
> >
> > (XEN) Domain 0 crashed: Executing crash image
> >
> > gdb shows:
> >
> > (gdb) bt
> > #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
> > #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
> > #2  0x0000000000000000 in ?? ()
> > (gdb)
> >
> > Especially second bt line scares me... ;-)))
>
> Why? This is completely normal.  If you look in crash.c at that line, it
> is a for (;;) halt(); loop

I thought more about this:

#1  0xffff82d0802281d9 in nmi_crash () at entry.S:666

Look at the end of this line... ;-)))

> How are you hooking gdb up?

I am doing tests in QEMU and using QEMU's -gdb option.

> > I have not been able to identify why NMI was activated because
> > stack is completely cleared. I tried to record execution in gdb
> > but it stops with following message:
>
> NMIs are used for cpu shootdown of the non-crashing cpus.  Again, this
> is not touched by the series.

Ahh... It makes sens. However, why machine hangs at this stage? Hmmm...
CPU sending NMIs receives one and instead of ignoring it halts itself?

Daniel

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-07 21:41     ` [Xen-devel] " Daniel Kiper
@ 2013-11-07 21:57       ` Andrew Cooper
  2013-11-07 21:57       ` [Xen-devel] " Andrew Cooper
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 99+ messages in thread
From: Andrew Cooper @ 2013-11-07 21:57 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, David Vrabel, Jan Beulich, xen-devel

On 07/11/2013 21:41, Daniel Kiper wrote:
> On Thu, Nov 07, 2013 at 09:25:33PM +0000, Andrew Cooper wrote:
>> On 07/11/13 21:16, Daniel Kiper wrote:
>>> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
>>>> The series (for Xen 4.4) improves the kexec hypercall by making Xen
>>>> responsible for loading and relocating the image.  This allows kexec
>>>> to be usable by pv-ops kernels and should allow kexec to be usable
>>>> from a HVM or PVH privileged domain.
>>>>
>>>> I have now tested this with a Linux kernel image using the VGA console
>>>> which was what was causing problems in v9 (this turned out to be a
>>>> kexec-tools bug).
>>>>
>>>> The required patch series for kexec-tools will be posted shortly and
>>>> are available from the xen-v7 branch of:
>>> In general it works. However, quite often I am not able to execute panic
>>> kernel. Machine hangs with following message:
>>>
>>> (XEN) Domain 0 crashed: Executing crash image
>>>
>>> gdb shows:
>>>
>>> (gdb) bt
>>> #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
>>> #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
>>> #2  0x0000000000000000 in ?? ()
>>> (gdb)
>>>
>>> Especially second bt line scares me... ;-)))
>> Why? This is completely normal.  If you look in crash.c at that line, it
>> is a for (;;) halt(); loop
> I thought more about this:
>
> #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
>
> Look at the end of this line... ;-)))

Which line and what about it?  In current master, that is a SAVE_ALL,
but as the call to do_nmi_crash has happened, I presume
0xffff82d0802281d9 is a ud2 instruction in your tree?

>
>> How are you hooking gdb up?
> I am doing tests in QEMU and using QEMU's -gdb option.
>
>>> I have not been able to identify why NMI was activated because
>>> stack is completely cleared. I tried to record execution in gdb
>>> but it stops with following message:
>> NMIs are used for cpu shootdown of the non-crashing cpus.  Again, this
>> is not touched by the series.
> Ahh... It makes sens. However, why machine hangs at this stage? Hmmm...
> CPU sending NMIs receives one and instead of ignoring it halts itself?
>
> Daniel

No - there is very clear protection from racing down the crash path. 
The crashing CPU forces all other cpus into nmi_crash(), where they will
stay until reset.

It is the one cpu which is not executing nmi_crash() which will end up
executing the crash image.

~Andrew

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-07 21:41     ` [Xen-devel] " Daniel Kiper
  2013-11-07 21:57       ` Andrew Cooper
@ 2013-11-07 21:57       ` Andrew Cooper
  2013-11-08 13:20       ` David Vrabel
  2013-11-08 13:20       ` David Vrabel
  3 siblings, 0 replies; 99+ messages in thread
From: Andrew Cooper @ 2013-11-07 21:57 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, David Vrabel, Jan Beulich, xen-devel

On 07/11/2013 21:41, Daniel Kiper wrote:
> On Thu, Nov 07, 2013 at 09:25:33PM +0000, Andrew Cooper wrote:
>> On 07/11/13 21:16, Daniel Kiper wrote:
>>> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
>>>> The series (for Xen 4.4) improves the kexec hypercall by making Xen
>>>> responsible for loading and relocating the image.  This allows kexec
>>>> to be usable by pv-ops kernels and should allow kexec to be usable
>>>> from a HVM or PVH privileged domain.
>>>>
>>>> I have now tested this with a Linux kernel image using the VGA console
>>>> which was what was causing problems in v9 (this turned out to be a
>>>> kexec-tools bug).
>>>>
>>>> The required patch series for kexec-tools will be posted shortly and
>>>> are available from the xen-v7 branch of:
>>> In general it works. However, quite often I am not able to execute panic
>>> kernel. Machine hangs with following message:
>>>
>>> (XEN) Domain 0 crashed: Executing crash image
>>>
>>> gdb shows:
>>>
>>> (gdb) bt
>>> #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
>>> #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
>>> #2  0x0000000000000000 in ?? ()
>>> (gdb)
>>>
>>> Especially second bt line scares me... ;-)))
>> Why? This is completely normal.  If you look in crash.c at that line, it
>> is a for (;;) halt(); loop
> I thought more about this:
>
> #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
>
> Look at the end of this line... ;-)))

Which line and what about it?  In current master, that is a SAVE_ALL,
but as the call to do_nmi_crash has happened, I presume
0xffff82d0802281d9 is a ud2 instruction in your tree?

>
>> How are you hooking gdb up?
> I am doing tests in QEMU and using QEMU's -gdb option.
>
>>> I have not been able to identify why NMI was activated because
>>> stack is completely cleared. I tried to record execution in gdb
>>> but it stops with following message:
>> NMIs are used for cpu shootdown of the non-crashing cpus.  Again, this
>> is not touched by the series.
> Ahh... It makes sens. However, why machine hangs at this stage? Hmmm...
> CPU sending NMIs receives one and instead of ignoring it halts itself?
>
> Daniel

No - there is very clear protection from racing down the crash path. 
The crashing CPU forces all other cpus into nmi_crash(), where they will
stay until reset.

It is the one cpu which is not executing nmi_crash() which will end up
executing the crash image.

~Andrew

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 3/9] kexec: add infrastructure for handling kexec images
  2013-11-07 20:40   ` [Xen-devel] " Don Slutz
@ 2013-11-07 23:51       ` Don Slutz
  0 siblings, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-07 23:51 UTC (permalink / raw)
  To: Don Slutz, David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

[-- Attachment #1: Type: text/plain, Size: 1642 bytes --]

Sigh, my build just stopped.

kimage.c: In function 'kimage_crash_alloc':
kimage.c:222:9: error: unused variable 'result' [-Werror=unused-variable]
cc1: all warnings being treated as errors

A late change from v9 to v10 missed the removal of this variable.

Here is what I did to fix:

 From 09587856fa36ae38a500e218979f7111cb4546f4 Mon Sep 17 00:00:00 2001
From: Don Slutz <dslutz@verizon.com>
Date: Thu, 7 Nov 2013 18:46:23 -0500
Subject: [PATCH] kexec: remove result.

kimage.c: In function 'kimage_crash_alloc':
kimage.c:222:9: error: unused variable 'result' [-Werror=unused-variable]
cc1: all warnings being treated as errors

Signed-off-by: Don Slutz <dslutz@verizon.com>
---
  xen/common/kimage.c |    1 -
  1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/xen/common/kimage.c b/xen/common/kimage.c
index 10fb785..5c3e3b3 100644
--- a/xen/common/kimage.c
+++ b/xen/common/kimage.c
@@ -219,7 +219,6 @@ static int kimage_crash_alloc(struct kexec_image **rimage, p
addr_t entry,
                                xen_kexec_segment_t *segments)
  {
      unsigned long i;
-    int result;

      /* Verify we have a valid entry point */
      if ( (entry < kexec_crash_area.start)
-- 
1.7.1

     -Don Slutz

On 11/07/13 15:40, Don Slutz wrote:
> For what it is worth.
>
> Reviewed-by: Don Slutz <dslutz@verizon.com>
>     -Don Slutz
>
> On 11/06/13 09:49, David Vrabel wrote:
>> From: David Vrabel <david.vrabel@citrix.com>
>>
>> Add the code needed to handle and load kexec images into Xen memory or
>> into the crash region.  This is needed for the new KEXEC_CMD_load and
>> KEXEC_CMD_unload hypercall sub-ops.
>>

[...]



[-- Attachment #2: 0001-kexec-remove-result.patch --]
[-- Type: text/x-patch, Size: 915 bytes --]

>From 09587856fa36ae38a500e218979f7111cb4546f4 Mon Sep 17 00:00:00 2001
From: Don Slutz <dslutz@verizon.com>
Date: Thu, 7 Nov 2013 18:46:23 -0500
Subject: [PATCH] kexec: remove result.

kimage.c: In function 'kimage_crash_alloc':
kimage.c:222:9: error: unused variable 'result' [-Werror=unused-variable]
cc1: all warnings being treated as errors

Signed-off-by: Don Slutz <dslutz@verizon.com>
---
 xen/common/kimage.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/xen/common/kimage.c b/xen/common/kimage.c
index 10fb785..5c3e3b3 100644
--- a/xen/common/kimage.c
+++ b/xen/common/kimage.c
@@ -219,7 +219,6 @@ static int kimage_crash_alloc(struct kexec_image **rimage, paddr_t entry,
                               xen_kexec_segment_t *segments)
 {
     unsigned long i;
-    int result;
 
     /* Verify we have a valid entry point */
     if ( (entry < kexec_crash_area.start)
-- 
1.7.1


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCH 3/9] kexec: add infrastructure for handling kexec images
@ 2013-11-07 23:51       ` Don Slutz
  0 siblings, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-07 23:51 UTC (permalink / raw)
  To: Don Slutz, David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

[-- Attachment #1: Type: text/plain, Size: 1642 bytes --]

Sigh, my build just stopped.

kimage.c: In function 'kimage_crash_alloc':
kimage.c:222:9: error: unused variable 'result' [-Werror=unused-variable]
cc1: all warnings being treated as errors

A late change from v9 to v10 missed the removal of this variable.

Here is what I did to fix:

 From 09587856fa36ae38a500e218979f7111cb4546f4 Mon Sep 17 00:00:00 2001
From: Don Slutz <dslutz@verizon.com>
Date: Thu, 7 Nov 2013 18:46:23 -0500
Subject: [PATCH] kexec: remove result.

kimage.c: In function 'kimage_crash_alloc':
kimage.c:222:9: error: unused variable 'result' [-Werror=unused-variable]
cc1: all warnings being treated as errors

Signed-off-by: Don Slutz <dslutz@verizon.com>
---
  xen/common/kimage.c |    1 -
  1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/xen/common/kimage.c b/xen/common/kimage.c
index 10fb785..5c3e3b3 100644
--- a/xen/common/kimage.c
+++ b/xen/common/kimage.c
@@ -219,7 +219,6 @@ static int kimage_crash_alloc(struct kexec_image **rimage, p
addr_t entry,
                                xen_kexec_segment_t *segments)
  {
      unsigned long i;
-    int result;

      /* Verify we have a valid entry point */
      if ( (entry < kexec_crash_area.start)
-- 
1.7.1

     -Don Slutz

On 11/07/13 15:40, Don Slutz wrote:
> For what it is worth.
>
> Reviewed-by: Don Slutz <dslutz@verizon.com>
>     -Don Slutz
>
> On 11/06/13 09:49, David Vrabel wrote:
>> From: David Vrabel <david.vrabel@citrix.com>
>>
>> Add the code needed to handle and load kexec images into Xen memory or
>> into the crash region.  This is needed for the new KEXEC_CMD_load and
>> KEXEC_CMD_unload hypercall sub-ops.
>>

[...]



[-- Attachment #2: 0001-kexec-remove-result.patch --]
[-- Type: text/x-patch, Size: 914 bytes --]

From 09587856fa36ae38a500e218979f7111cb4546f4 Mon Sep 17 00:00:00 2001
From: Don Slutz <dslutz@verizon.com>
Date: Thu, 7 Nov 2013 18:46:23 -0500
Subject: [PATCH] kexec: remove result.

kimage.c: In function 'kimage_crash_alloc':
kimage.c:222:9: error: unused variable 'result' [-Werror=unused-variable]
cc1: all warnings being treated as errors

Signed-off-by: Don Slutz <dslutz@verizon.com>
---
 xen/common/kimage.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/xen/common/kimage.c b/xen/common/kimage.c
index 10fb785..5c3e3b3 100644
--- a/xen/common/kimage.c
+++ b/xen/common/kimage.c
@@ -219,7 +219,6 @@ static int kimage_crash_alloc(struct kexec_image **rimage, paddr_t entry,
                               xen_kexec_segment_t *segments)
 {
     unsigned long i;
-    int result;
 
     /* Verify we have a valid entry point */
     if ( (entry < kexec_crash_area.start)
-- 
1.7.1


[-- Attachment #3: Type: text/plain, Size: 143 bytes --]

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-06 14:49 ` [PATCH 3/9] kexec: add infrastructure for handling kexec images David Vrabel
  2013-11-07 20:40   ` [Xen-devel] " Don Slutz
  2013-11-07 20:40   ` Don Slutz
@ 2013-11-08 12:50   ` David Vrabel
  2013-11-11 14:37     ` Don Slutz
  2013-11-15 14:35     ` Jan Beulich
  2 siblings, 2 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-08 12:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Daniel Kiper, David Vrabel, Jan Beulich

Add the code needed to handle and load kexec images into Xen memory or
into the crash region.  This is needed for the new KEXEC_CMD_load and
KEXEC_CMD_unload hypercall sub-ops.

Much of this code is derived from the Linux kernel.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/common/Makefile      |    1 +
 xen/common/kimage.c      |  820 ++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/kimage.h |   62 ++++
 3 files changed, 883 insertions(+), 0 deletions(-)
 create mode 100644 xen/common/kimage.c
 create mode 100644 xen/include/xen/kimage.h

diff --git a/xen/common/Makefile b/xen/common/Makefile
index 686f7a1..3683ae3 100644
--- a/xen/common/Makefile
+++ b/xen/common/Makefile
@@ -13,6 +13,7 @@ obj-y += irq.o
 obj-y += kernel.o
 obj-y += keyhandler.o
 obj-$(HAS_KEXEC) += kexec.o
+obj-$(HAS_KEXEC) += kimage.o
 obj-y += lib.o
 obj-y += memory.o
 obj-y += multicall.o
diff --git a/xen/common/kimage.c b/xen/common/kimage.c
new file mode 100644
index 0000000..cba4458
--- /dev/null
+++ b/xen/common/kimage.c
@@ -0,0 +1,820 @@
+/*
+ * Kexec Image
+ *
+ * Copyright (C) 2013 Citrix Systems R&D Ltd.
+ *
+ * Derived from kernel/kexec.c from Linux:
+ *
+ *   Copyright (C) 2002-2004 Eric Biederman  <ebiederm@xmission.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/init.h>
+#include <xen/kernel.h>
+#include <xen/errno.h>
+#include <xen/spinlock.h>
+#include <xen/guest_access.h>
+#include <xen/mm.h>
+#include <xen/kexec.h>
+#include <xen/kimage.h>
+
+#include <asm/page.h>
+
+/*
+ * When kexec transitions to the new kernel there is a one-to-one
+ * mapping between physical and virtual addresses.  On processors
+ * where you can disable the MMU this is trivial, and easy.  For
+ * others it is still a simple predictable page table to setup.
+ *
+ * The code for the transition from the current kernel to the the new
+ * kernel is placed in the page-size control_code_buffer.  This memory
+ * must be identity mapped in the transition from virtual to physical
+ * addresses.
+ *
+ * The assembly stub in the control code buffer is passed a linked list
+ * of descriptor pages detailing the source pages of the new kernel,
+ * and the destination addresses of those source pages.  As this data
+ * structure is not used in the context of the current OS, it must
+ * be self-contained.
+ *
+ * The code has been made to work with highmem pages and will use a
+ * destination page in its final resting place (if it happens
+ * to allocate it).  The end product of this is that most of the
+ * physical address space, and most of RAM can be used.
+ *
+ * Future directions include:
+ *  - allocating a page table with the control code buffer identity
+ *    mapped, to simplify machine_kexec and make kexec_on_panic more
+ *    reliable.
+ */
+
+/*
+ * KIMAGE_NO_DEST is an impossible destination address..., for
+ * allocating pages whose destination address we do not care about.
+ */
+#define KIMAGE_NO_DEST (-1UL)
+
+/*
+ * Offset of the last entry in an indirection page.
+ */
+#define KIMAGE_LAST_ENTRY (PAGE_SIZE/sizeof(kimage_entry_t) - 1)
+
+
+static int kimage_is_destination_range(struct kexec_image *image,
+                                       paddr_t start, paddr_t end);
+static struct page_info *kimage_alloc_page(struct kexec_image *image,
+                                           paddr_t dest);
+
+static struct page_info *kimage_alloc_zeroed_page(unsigned memflags)
+{
+    struct page_info *page;
+
+    page = alloc_domheap_page(NULL, memflags);
+    if ( !page )
+        return NULL;
+
+    clear_domain_page(page_to_mfn(page));
+
+    return page;
+}
+
+static int do_kimage_alloc(struct kexec_image **rimage, paddr_t entry,
+                           unsigned long nr_segments,
+                           xen_kexec_segment_t *segments, uint8_t type)
+{
+    struct kexec_image *image;
+    unsigned long i;
+    int result;
+
+    /* Allocate a controlling structure */
+    result = -ENOMEM;
+    image = xzalloc(typeof(*image));
+    if ( !image )
+        goto out;
+
+    image->entry_maddr = entry;
+    image->type = type;
+    image->nr_segments = nr_segments;
+    image->segments = segments;
+
+    image->next_crash_page = kexec_crash_area.start;
+
+    INIT_PAGE_LIST_HEAD(&image->control_pages);
+    INIT_PAGE_LIST_HEAD(&image->dest_pages);
+    INIT_PAGE_LIST_HEAD(&image->unusable_pages);
+
+    /*
+     * Verify we have good destination addresses.  The caller is
+     * responsible for making certain we don't attempt to load the new
+     * image into invalid or reserved areas of RAM.  This just
+     * verifies it is an address we can use.
+     *
+     * Since the kernel does everything in page size chunks ensure the
+     * destination addresses are page aligned.  Too many special cases
+     * crop of when we don't do this.  The most insidious is getting
+     * overlapping destination addresses simply because addresses are
+     * changed to page size granularity.
+     */
+    result = -EADDRNOTAVAIL;
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+
+        mstart = image->segments[i].dest_maddr;
+        mend   = mstart + image->segments[i].dest_size;
+        if ( (mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK) )
+            goto out;
+    }
+
+    /*
+     * Verify our destination addresses do not overlap.  If we allowed
+     * overlapping destination addresses through very weird things can
+     * happen with no easy explanation as one segment stops on
+     * another.
+     */
+    result = -EINVAL;
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+        unsigned long j;
+
+        mstart = image->segments[i].dest_maddr;
+        mend   = mstart + image->segments[i].dest_size;
+        for (j = 0; j < i; j++ )
+        {
+            paddr_t pstart, pend;
+            pstart = image->segments[j].dest_maddr;
+            pend   = pstart + image->segments[j].dest_size;
+            /* Do the segments overlap? */
+            if ( (mend > pstart) && (mstart < pend) )
+                goto out;
+        }
+    }
+
+    /*
+     * Ensure our buffer sizes are strictly less than our memory
+     * sizes.  This should always be the case, and it is easier to
+     * check up front than to be surprised later on.
+     */
+    result = -EINVAL;
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        if ( image->segments[i].buf_size > image->segments[i].dest_size )
+            goto out;
+    }
+
+    /* 
+     * Page for the relocation code must still be accessible after the
+     * processor has switched to 32-bit mode.
+     */
+    result = -ENOMEM;
+    image->control_code_page = kimage_alloc_control_page(image, MEMF_bits(32));
+    if ( !image->control_code_page )
+        goto out;
+
+    /* Add an empty indirection page. */
+    image->entry_page = kimage_alloc_control_page(image, 0);
+    if ( !image->entry_page )
+        goto out;
+
+    image->head = page_to_maddr(image->entry_page);
+
+    result = 0;
+out:
+    if ( result == 0 )
+        *rimage = image;
+    else if ( image )
+    {
+        image->segments = NULL; /* caller frees segments after an error */
+        kimage_free(image);
+    }
+
+    return result;
+
+}
+
+static int kimage_normal_alloc(struct kexec_image **rimage, paddr_t entry,
+                               unsigned long nr_segments,
+                               xen_kexec_segment_t *segments)
+{
+    return do_kimage_alloc(rimage, entry, nr_segments, segments,
+                           KEXEC_TYPE_DEFAULT);
+}
+
+static int kimage_crash_alloc(struct kexec_image **rimage, paddr_t entry,
+                              unsigned long nr_segments,
+                              xen_kexec_segment_t *segments)
+{
+    unsigned long i;
+
+    /* Verify we have a valid entry point */
+    if ( (entry < kexec_crash_area.start)
+         || (entry > kexec_crash_area.start + kexec_crash_area.size))
+        return -EADDRNOTAVAIL;
+
+    /*
+     * Verify we have good destination addresses.  Normally
+     * the caller is responsible for making certain we don't
+     * attempt to load the new image into invalid or reserved
+     * areas of RAM.  But crash kernels are preloaded into a
+     * reserved area of ram.  We must ensure the addresses
+     * are in the reserved area otherwise preloading the
+     * kernel could corrupt things.
+     */
+    for ( i = 0; i < nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+
+        if ( guest_handle_is_null(segments[i].buf.h) )
+            continue;
+
+        mstart = segments[i].dest_maddr;
+        mend = mstart + segments[i].dest_size;
+        /* Ensure we are within the crash kernel limits. */
+        if ( (mstart < kexec_crash_area.start )
+             || (mend > kexec_crash_area.start + kexec_crash_area.size))
+            return -EADDRNOTAVAIL;
+    }
+
+    /* Allocate and initialize a controlling structure. */
+    return do_kimage_alloc(rimage, entry, nr_segments, segments,
+                           KEXEC_TYPE_CRASH);
+}
+
+static int kimage_is_destination_range(struct kexec_image *image,
+                                       paddr_t start,
+                                       paddr_t end)
+{
+    unsigned long i;
+
+    for ( i = 0; i < image->nr_segments; i++ )
+    {
+        paddr_t mstart, mend;
+
+        mstart = image->segments[i].dest_maddr;
+        mend = mstart + image->segments[i].dest_size;
+        if ( (end > mstart) && (start < mend) )
+            return 1;
+    }
+
+    return 0;
+}
+
+static void kimage_free_page_list(struct page_list_head *list)
+{
+    struct page_info *page, *next;
+
+    page_list_for_each_safe(page, next, list)
+    {
+        page_list_del(page, list);
+        free_domheap_page(page);
+    }
+}
+
+static struct page_info *kimage_alloc_normal_control_page(
+    struct kexec_image *image, unsigned memflags)
+{
+    /*
+     * Control pages are special, they are the intermediaries that are
+     * needed while we copy the rest of the pages to their final
+     * resting place.  As such they must not conflict with either the
+     * destination addresses or memory the kernel is already using.
+     *
+     * The only case where we really need more than one of these are
+     * for architectures where we cannot disable the MMU and must
+     * instead generate an identity mapped page table for all of the
+     * memory.
+     *
+     * At worst this runs in O(N) of the image size.
+     */
+    struct page_list_head extra_pages;
+    struct page_info *page = NULL;
+
+    INIT_PAGE_LIST_HEAD(&extra_pages);
+
+    /*
+     * Loop while I can allocate a page and the page allocated is a
+     * destination page.
+     */
+    do {
+        unsigned long mfn, emfn;
+        paddr_t addr, eaddr;
+
+        page = kimage_alloc_zeroed_page(memflags);
+        if ( !page )
+            break;
+        mfn   = page_to_mfn(page);
+        emfn  = mfn + 1;
+        addr  = page_to_maddr(page);
+        eaddr = addr + PAGE_SIZE;
+        if ( kimage_is_destination_range(image, addr, eaddr) )
+        {
+            page_list_add(page, &extra_pages);
+            page = NULL;
+        }
+    } while ( !page );
+
+    if ( page )
+    {
+        /* Remember the allocated page... */
+        page_list_add(page, &image->control_pages);
+
+        /*
+         * Because the page is already in it's destination location we
+         * will never allocate another page at that address.
+         * Therefore kimage_alloc_page will not return it (again) and
+         * we don't need to give it an entry in image->segments[].
+         */
+    }
+    /*
+     * Deal with the destination pages I have inadvertently allocated.
+     *
+     * Ideally I would convert multi-page allocations into single page
+     * allocations, and add everything to image->dest_pages.
+     *
+     * For now it is simpler to just free the pages.
+     */
+    kimage_free_page_list(&extra_pages);
+
+    return page;
+}
+
+static struct page_info *kimage_alloc_crash_control_page(struct kexec_image *image)
+{
+    /*
+     * Control pages are special, they are the intermediaries that are
+     * needed while we copy the rest of the pages to their final
+     * resting place.  As such they must not conflict with either the
+     * destination addresses or memory the kernel is already using.
+     *
+     * Control pages are also the only pags we must allocate when
+     * loading a crash kernel.  All of the other pages are specified
+     * by the segments and we just memcpy into them directly.
+     *
+     * The only case where we really need more than one of these are
+     * for architectures where we cannot disable the MMU and must
+     * instead generate an identity mapped page table for all of the
+     * memory.
+     *
+     * Given the low demand this implements a very simple allocator
+     * that finds the first hole of the appropriate size in the
+     * reserved memory region, and allocates all of the memory up to
+     * and including the hole.
+     */
+    paddr_t hole_start, hole_end;
+    struct page_info *page = NULL;
+
+    hole_start = PAGE_ALIGN(image->next_crash_page);
+    hole_end   = hole_start + PAGE_SIZE;
+    while ( hole_end <= kexec_crash_area.start + kexec_crash_area.size )
+    {
+        unsigned long i;
+
+        /* See if I overlap any of the segments. */
+        for ( i = 0; i < image->nr_segments; i++ )
+        {
+            paddr_t mstart, mend;
+
+            mstart = image->segments[i].dest_maddr;
+            mend   = mstart + image->segments[i].dest_size;
+            if ( (hole_end > mstart) && (hole_start < mend) )
+            {
+                /* Advance the hole to the end of the segment. */
+                hole_start = PAGE_ALIGN(mend);
+                hole_end   = hole_start + PAGE_SIZE;
+                break;
+            }
+        }
+        /* If I don't overlap any segments I have found my hole! */
+        if ( i == image->nr_segments )
+        {
+            page = maddr_to_page(hole_start);
+            break;
+        }
+    }
+    if ( page )
+    {
+        image->next_crash_page = hole_end;
+        clear_domain_page(page_to_mfn(page));
+    }
+
+    return page;
+}
+
+
+struct page_info *kimage_alloc_control_page(struct kexec_image *image,
+                                            unsigned memflags)
+{
+    struct page_info *pages = NULL;
+
+    switch ( image->type )
+    {
+    case KEXEC_TYPE_DEFAULT:
+        pages = kimage_alloc_normal_control_page(image, memflags);
+        break;
+    case KEXEC_TYPE_CRASH:
+        pages = kimage_alloc_crash_control_page(image);
+        break;
+    }
+    return pages;
+}
+
+static int kimage_add_entry(struct kexec_image *image, kimage_entry_t entry)
+{
+    kimage_entry_t *entries;
+
+    if ( image->next_entry == KIMAGE_LAST_ENTRY )
+    {
+        struct page_info *page;
+
+        page = kimage_alloc_page(image, KIMAGE_NO_DEST);
+        if ( !page )
+            return -ENOMEM;
+
+        entries = __map_domain_page(image->entry_page);
+        entries[image->next_entry] = page_to_maddr(page) | IND_INDIRECTION;
+        unmap_domain_page(entries);
+
+        image->entry_page = page;
+        image->next_entry = 0;
+    }
+
+    entries = __map_domain_page(image->entry_page);
+    entries[image->next_entry] = entry;
+    image->next_entry++;
+    unmap_domain_page(entries);
+
+    return 0;
+}
+
+static int kimage_set_destination(struct kexec_image *image,
+                                  paddr_t destination)
+{
+    return kimage_add_entry(image, (destination & PAGE_MASK) | IND_DESTINATION);
+}
+
+
+static int kimage_add_page(struct kexec_image *image, paddr_t maddr)
+{
+    return kimage_add_entry(image, (maddr & PAGE_MASK) | IND_SOURCE);
+}
+
+
+static void kimage_free_extra_pages(struct kexec_image *image)
+{
+    kimage_free_page_list(&image->dest_pages);
+    kimage_free_page_list(&image->unusable_pages);
+}
+
+static void kimage_terminate(struct kexec_image *image)
+{
+    kimage_entry_t *entries;
+
+    entries = __map_domain_page(image->entry_page);
+    entries[image->next_entry] = IND_DONE;
+    unmap_domain_page(entries);
+}
+
+/*
+ * Iterate over all the entries in the indirection pages.
+ *
+ * Call unmap_domain_page(ptr) after the loop exits.
+ */
+#define for_each_kimage_entry(image, ptr, entry)                        \
+    for ( ptr = map_domain_page(image->head >> PAGE_SHIFT);             \
+          (entry = *ptr) && !(entry & IND_DONE);                        \
+          ptr = (entry & IND_INDIRECTION) ?                             \
+              (unmap_domain_page(ptr), map_domain_page(entry >> PAGE_SHIFT)) \
+              : ptr + 1 )
+
+static void kimage_free_entry(kimage_entry_t entry)
+{
+    struct page_info *page;
+
+    page = mfn_to_page(entry >> PAGE_SHIFT);
+    free_domheap_page(page);
+}
+
+static void kimage_free_all_entries(struct kexec_image *image)
+{
+    kimage_entry_t *ptr, entry;
+    kimage_entry_t ind = 0;
+
+    if ( !image->head )
+        return;
+
+    for_each_kimage_entry(image, ptr, entry)
+    {
+        if ( entry & IND_INDIRECTION )
+        {
+            /* Free the previous indirection page */
+            if ( ind & IND_INDIRECTION )
+                kimage_free_entry(ind);
+            /* Save this indirection page until we are done with it. */
+            ind = entry;
+        }
+        else if ( entry & IND_SOURCE )
+            kimage_free_entry(entry);
+    }
+    unmap_domain_page(ptr);
+
+    /* Free the final indirection page. */
+    if ( ind & IND_INDIRECTION )
+        kimage_free_entry(ind);
+}
+
+void kimage_free(struct kexec_image *image)
+{
+    if ( !image )
+        return;
+
+    kimage_free_extra_pages(image);
+    kimage_free_all_entries(image);
+    kimage_free_page_list(&image->control_pages);
+    xfree(image->segments);
+    xfree(image);
+}
+
+static kimage_entry_t *kimage_dst_used(struct kexec_image *image,
+                                       paddr_t maddr)
+{
+    kimage_entry_t *ptr, entry;
+    unsigned long destination = 0;
+
+    for_each_kimage_entry(image, ptr, entry)
+    {
+        if ( entry & IND_DESTINATION )
+            destination = entry & PAGE_MASK;
+        else if ( entry & IND_SOURCE )
+        {
+            if ( maddr == destination )
+                return ptr;
+            destination += PAGE_SIZE;
+        }
+    }
+    unmap_domain_page(ptr);
+
+    return NULL;
+}
+
+static struct page_info *kimage_alloc_page(struct kexec_image *image,
+                                           paddr_t destination)
+{
+    /*
+     * Here we implement safeguards to ensure that a source page is
+     * not copied to its destination page before the data on the
+     * destination page is no longer useful.
+     *
+     * To do this we maintain the invariant that a source page is
+     * either its own destination page, or it is not a destination
+     * page at all.
+     *
+     * That is slightly stronger than required, but the proof that no
+     * problems will not occur is trivial, and the implementation is
+     * simply to verify.
+     *
+     * When allocating all pages normally this algorithm will run in
+     * O(N) time, but in the worst case it will run in O(N^2) time.
+     * If the runtime is a problem the data structures can be fixed.
+     */
+    struct page_info *page;
+    paddr_t addr;
+
+    /*
+     * Walk through the list of destination pages, and see if I have a
+     * match.
+     */
+    page_list_for_each(page, &image->dest_pages)
+    {
+        addr = page_to_maddr(page);
+        if ( addr == destination )
+        {
+            page_list_del(page, &image->dest_pages);
+            return page;
+        }
+    }
+    page = NULL;
+    for (;;)
+    {
+        kimage_entry_t *old;
+
+        /* Allocate a page, if we run out of memory give up. */
+        page = kimage_alloc_zeroed_page(0);
+        if ( !page )
+            return NULL;
+        addr = page_to_maddr(page);
+
+        /* If it is the destination page we want use it. */
+        if ( addr == destination )
+            break;
+
+        /* If the page is not a destination page use it. */
+        if ( !kimage_is_destination_range(image, addr,
+                                          addr + PAGE_SIZE) )
+            break;
+
+        /*
+         * I know that the page is someones destination page.  See if
+         * there is already a source page for this destination page.
+         * And if so swap the source pages.
+         */
+        old = kimage_dst_used(image, addr);
+        if ( old )
+        {
+            /* If so move it. */
+            unsigned long old_mfn = *old >> PAGE_SHIFT;
+            unsigned long mfn = addr >> PAGE_SHIFT;
+
+            copy_domain_page(mfn, old_mfn);
+            clear_domain_page(old_mfn);
+            *old = (addr & ~PAGE_MASK) | IND_SOURCE;
+            unmap_domain_page(old);
+
+            page = mfn_to_page(old_mfn);
+            break;
+        }
+        else
+        {
+            /*
+             * Place the page on the destination list; I will use it
+             * later.
+             */
+            page_list_add(page, &image->dest_pages);
+        }
+    }
+    return page;
+}
+
+static int kimage_load_normal_segment(struct kexec_image *image,
+                                      xen_kexec_segment_t *segment)
+{
+    unsigned long to_copy;
+    unsigned long src_offset;
+    paddr_t dest, end;
+    int ret;
+
+    to_copy = segment->buf_size;
+    src_offset = 0;
+    dest = segment->dest_maddr;
+
+    ret = kimage_set_destination(image, dest);
+    if ( ret < 0 )
+        return ret;
+
+    while ( to_copy )
+    {
+        unsigned long dest_mfn;
+        struct page_info *page;
+        void *dest_va;
+        size_t size;
+
+        dest_mfn = dest >> PAGE_SHIFT;
+
+        size = min_t(unsigned long, PAGE_SIZE, to_copy);
+
+        page = kimage_alloc_page(image, dest);
+        if ( !page )
+            return -ENOMEM;
+        ret = kimage_add_page(image, page_to_maddr(page));
+        if ( ret < 0 )
+            return ret;
+
+        dest_va = __map_domain_page(page);
+        ret = copy_from_guest_offset(dest_va, segment->buf.h, src_offset, size);
+        unmap_domain_page(dest_va);
+        if ( ret )
+            return -EFAULT;
+
+        to_copy -= size;
+        src_offset += size;
+        dest += PAGE_SIZE;
+    }
+
+    /* Remainder of the destination should be zeroed. */
+    end = segment->dest_maddr + segment->dest_size;
+    for ( ; dest < end; dest += PAGE_SIZE )
+        kimage_add_entry(image, IND_ZERO);
+
+    return 0;
+}
+
+static int kimage_load_crash_segment(struct kexec_image *image,
+                                     xen_kexec_segment_t *segment)
+{
+    /*
+     * For crash dumps kernels we simply copy the data from user space
+     * to it's destination.
+     */
+    paddr_t dest;
+    unsigned long sbytes, dbytes;
+    int ret = 0;
+    unsigned long src_offset = 0;
+
+    sbytes = segment->buf_size;
+    dbytes = segment->dest_size;
+    dest = segment->dest_maddr;
+
+    while ( dbytes )
+    {
+        unsigned long dest_mfn;
+        void *dest_va;
+        size_t schunk, dchunk;
+
+        dest_mfn = dest >> PAGE_SHIFT;
+
+        dchunk = PAGE_SIZE;
+        schunk = min(dchunk, sbytes);
+
+        dest_va = map_domain_page(dest_mfn);
+        if ( !dest_va )
+            return -EINVAL;
+
+        ret = copy_from_guest_offset(dest_va, segment->buf.h, src_offset, schunk);
+        memset(dest_va + schunk, 0, dchunk - schunk);
+
+        unmap_domain_page(dest_va);
+        if ( ret )
+            return -EFAULT;
+
+        dbytes -= dchunk;
+        sbytes -= schunk;
+        dest += dchunk;
+        src_offset += schunk;
+    }
+
+    return 0;
+}
+
+static int kimage_load_segment(struct kexec_image *image, xen_kexec_segment_t *segment)
+{
+    int result = -ENOMEM;
+
+    if ( !guest_handle_is_null(segment->buf.h) )
+    {
+        switch ( image->type )
+        {
+        case KEXEC_TYPE_DEFAULT:
+            result = kimage_load_normal_segment(image, segment);
+            break;
+        case KEXEC_TYPE_CRASH:
+            result = kimage_load_crash_segment(image, segment);
+            break;
+        }
+    }
+
+    return result;
+}
+
+int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch,
+                 uint64_t entry_maddr,
+                 uint32_t nr_segments, xen_kexec_segment_t *segment)
+{
+    int result;
+
+    switch( type )
+    {
+    case KEXEC_TYPE_DEFAULT:
+        result = kimage_normal_alloc(rimage, entry_maddr, nr_segments, segment);
+        break;
+    case KEXEC_TYPE_CRASH:
+        result = kimage_crash_alloc(rimage, entry_maddr, nr_segments, segment);
+        break;
+    default:
+        result = -EINVAL;
+        break;
+    }
+    if ( result < 0 )
+        return result;
+
+    (*rimage)->arch = arch;
+
+    return result;
+}
+
+int kimage_load_segments(struct kexec_image *image)
+{
+    int s;
+    int result;
+
+    for ( s = 0; s < image->nr_segments; s++ ) {
+        result = kimage_load_segment(image, &image->segments[s]);
+        if ( result < 0 )
+            return result;
+    }
+    kimage_terminate(image);
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/xen/kimage.h b/xen/include/xen/kimage.h
new file mode 100644
index 0000000..0ebd37a
--- /dev/null
+++ b/xen/include/xen/kimage.h
@@ -0,0 +1,62 @@
+#ifndef __XEN_KIMAGE_H__
+#define __XEN_KIMAGE_H__
+
+#define IND_DESTINATION  0x1
+#define IND_INDIRECTION  0x2
+#define IND_DONE         0x4
+#define IND_SOURCE       0x8
+#define IND_ZERO        0x10
+
+#ifndef __ASSEMBLY__
+
+#include <xen/list.h>
+#include <xen/mm.h>
+#include <public/kexec.h>
+
+#define KEXEC_SEGMENT_MAX 16
+
+typedef paddr_t kimage_entry_t;
+
+struct kexec_image {
+    uint8_t type;
+    uint16_t arch;
+    uint64_t entry_maddr;
+    uint32_t nr_segments;
+    xen_kexec_segment_t *segments;
+
+    kimage_entry_t head;
+    struct page_info *entry_page;
+    unsigned next_entry;
+
+    struct page_info *control_code_page;
+    struct page_info *aux_page;
+
+    struct page_list_head control_pages;
+    struct page_list_head dest_pages;
+    struct page_list_head unusable_pages;
+
+    /* Address of next control page to allocate for crash kernels. */
+    paddr_t next_crash_page;
+};
+
+int kimage_alloc(struct kexec_image **rimage, uint8_t type, uint16_t arch,
+                 uint64_t entry_maddr,
+                 uint32_t nr_segments, xen_kexec_segment_t *segment);
+void kimage_free(struct kexec_image *image);
+int kimage_load_segments(struct kexec_image *image);
+struct page_info *kimage_alloc_control_page(struct kexec_image *image,
+                                            unsigned memflags);
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* __XEN_KIMAGE_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-07 21:16 ` Daniel Kiper
                     ` (2 preceding siblings ...)
  2013-11-08 13:13   ` David Vrabel
@ 2013-11-08 13:13   ` David Vrabel
  2013-11-09 19:18   ` Daniel Kiper
  2013-11-09 19:18   ` Daniel Kiper
  5 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-08 13:13 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: Keir Fraser, kexec, Jan Beulich, xen-devel

Keir,

Sorry, forgot to CC you on this series.

Can we have your opinion on whether this kexec series can be merged?
And if not, what further work and/or testing is required?

On 07/11/13 21:16, Daniel Kiper wrote:
> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
>> The series (for Xen 4.4) improves the kexec hypercall by making Xen
>> responsible for loading and relocating the image.  This allows kexec
>> to be usable by pv-ops kernels and should allow kexec to be usable
>> from a HVM or PVH privileged domain.
>>
>> I have now tested this with a Linux kernel image using the VGA console
>> which was what was causing problems in v9 (this turned out to be a
>> kexec-tools bug).
>>
>> The required patch series for kexec-tools will be posted shortly and
>> are available from the xen-v7 branch of:
> 
> In general it works. However, quite often I am not able to execute panic
> kernel. Machine hangs with following message:

I cannot reproduce any failures, neither on my dev box nor on any of the
automated XenServer tests that run on a range of different hardware
platforms.  I find kexec to be very reliable and an earlier version of
this series has been in production within XenServer for a while now and
has seen real use in the field.

None of the issues reported so far have been regressions but failures in
specific uses of the new support for pv-ops kernels.

I really can't see how I can do anything else to make this series
acceptable for merging.

In my opinion, the current implementation is so broken[1] and useless[2]
that anything that even vaguely looks like it might work is significant
improvement, and something that is deployed usefully in production
should definitely be merged.

[1] Uses code provided by the guest to jump out of Xen into the image
which works only through luck. Does not (and has never) worked reliably
with 32-bit dom0.

[2] Does not work at all (and will never work) with upstream kernels.

> (XEN) Domain 0 crashed: Executing crash image
> 
> gdb shows:
> 
> (gdb) bt
> #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
> #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
> #2  0x0000000000000000 in ?? ()
> (gdb)
> 
> Especially second bt line scares me... ;-)))
> 
> I have not been able to identify why NMI was activated because
> stack is completely cleared.

All this you have described here is correct and expected behavior,
which, quite frankly, you should have been able to see with even the
most cursory look at the code.

> Additionally, my compiler fails because it detects unused result
> variable in xen/common/kimage.c:kimage_crash_alloc().

Yes, sorry about that.  That was fallout from a last minute trivial
cleanup.  I've posted an updated patch correcting this.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-07 21:16 ` Daniel Kiper
  2013-11-07 21:25   ` Andrew Cooper
  2013-11-07 21:25   ` [Xen-devel] " Andrew Cooper
@ 2013-11-08 13:13   ` David Vrabel
  2013-11-08 13:19     ` Jan Beulich
                       ` (5 more replies)
  2013-11-08 13:13   ` David Vrabel
                     ` (2 subsequent siblings)
  5 siblings, 6 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-08 13:13 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: Keir Fraser, kexec, Jan Beulich, xen-devel

Keir,

Sorry, forgot to CC you on this series.

Can we have your opinion on whether this kexec series can be merged?
And if not, what further work and/or testing is required?

On 07/11/13 21:16, Daniel Kiper wrote:
> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
>> The series (for Xen 4.4) improves the kexec hypercall by making Xen
>> responsible for loading and relocating the image.  This allows kexec
>> to be usable by pv-ops kernels and should allow kexec to be usable
>> from a HVM or PVH privileged domain.
>>
>> I have now tested this with a Linux kernel image using the VGA console
>> which was what was causing problems in v9 (this turned out to be a
>> kexec-tools bug).
>>
>> The required patch series for kexec-tools will be posted shortly and
>> are available from the xen-v7 branch of:
> 
> In general it works. However, quite often I am not able to execute panic
> kernel. Machine hangs with following message:

I cannot reproduce any failures, neither on my dev box nor on any of the
automated XenServer tests that run on a range of different hardware
platforms.  I find kexec to be very reliable and an earlier version of
this series has been in production within XenServer for a while now and
has seen real use in the field.

None of the issues reported so far have been regressions but failures in
specific uses of the new support for pv-ops kernels.

I really can't see how I can do anything else to make this series
acceptable for merging.

In my opinion, the current implementation is so broken[1] and useless[2]
that anything that even vaguely looks like it might work is significant
improvement, and something that is deployed usefully in production
should definitely be merged.

[1] Uses code provided by the guest to jump out of Xen into the image
which works only through luck. Does not (and has never) worked reliably
with 32-bit dom0.

[2] Does not work at all (and will never work) with upstream kernels.

> (XEN) Domain 0 crashed: Executing crash image
> 
> gdb shows:
> 
> (gdb) bt
> #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
> #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
> #2  0x0000000000000000 in ?? ()
> (gdb)
> 
> Especially second bt line scares me... ;-)))
> 
> I have not been able to identify why NMI was activated because
> stack is completely cleared.

All this you have described here is correct and expected behavior,
which, quite frankly, you should have been able to see with even the
most cursory look at the code.

> Additionally, my compiler fails because it detects unused result
> variable in xen/common/kimage.c:kimage_crash_alloc().

Yes, sorry about that.  That was fallout from a last minute trivial
cleanup.  I've posted an updated patch correcting this.

David

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 13:13   ` David Vrabel
  2013-11-08 13:19     ` Jan Beulich
@ 2013-11-08 13:19     ` Jan Beulich
  2013-11-08 13:48     ` Daniel Kiper
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 99+ messages in thread
From: Jan Beulich @ 2013-11-08 13:19 UTC (permalink / raw)
  To: David Vrabel; +Cc: Keir Fraser, Daniel Kiper, kexec, xen-devel

>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
> Keir,
> 
> Sorry, forgot to CC you on this series.
> 
> Can we have your opinion on whether this kexec series can be merged?
> And if not, what further work and/or testing is required?

Just to clarify - unless I missed something, there was still no
review of this from Daniel or someone else known to be
familiar with the subject. If Keir gave his ack, formally this
could go in, but I wouldn't feel too well with that (the more
that apart from not having reviewed it, Daniel seems to also
continue to have problems with it).

Jan

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 13:13   ` David Vrabel
@ 2013-11-08 13:19     ` Jan Beulich
  2013-11-08 14:01       ` Andrew Cooper
  2013-11-08 14:01       ` [Xen-devel] " Andrew Cooper
  2013-11-08 13:19     ` Jan Beulich
                       ` (4 subsequent siblings)
  5 siblings, 2 replies; 99+ messages in thread
From: Jan Beulich @ 2013-11-08 13:19 UTC (permalink / raw)
  To: David Vrabel; +Cc: Keir Fraser, Daniel Kiper, kexec, xen-devel

>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
> Keir,
> 
> Sorry, forgot to CC you on this series.
> 
> Can we have your opinion on whether this kexec series can be merged?
> And if not, what further work and/or testing is required?

Just to clarify - unless I missed something, there was still no
review of this from Daniel or someone else known to be
familiar with the subject. If Keir gave his ack, formally this
could go in, but I wouldn't feel too well with that (the more
that apart from not having reviewed it, Daniel seems to also
continue to have problems with it).

Jan


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-07 21:41     ` [Xen-devel] " Daniel Kiper
                         ` (2 preceding siblings ...)
  2013-11-08 13:20       ` David Vrabel
@ 2013-11-08 13:20       ` David Vrabel
  3 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-08 13:20 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: Andrew Cooper, kexec, Jan Beulich, xen-devel

On 07/11/13 21:41, Daniel Kiper wrote:
> 
> I am doing tests in QEMU and using QEMU's -gdb option.

Er.  I'm not sure this is a very interesting real world use case.  I
would suggest the failure here is more likely to be bugs in qemu's
emulation.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-07 21:41     ` [Xen-devel] " Daniel Kiper
  2013-11-07 21:57       ` Andrew Cooper
  2013-11-07 21:57       ` [Xen-devel] " Andrew Cooper
@ 2013-11-08 13:20       ` David Vrabel
  2013-11-08 13:20       ` David Vrabel
  3 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-08 13:20 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: Andrew Cooper, kexec, Jan Beulich, xen-devel

On 07/11/13 21:41, Daniel Kiper wrote:
> 
> I am doing tests in QEMU and using QEMU's -gdb option.

Er.  I'm not sure this is a very interesting real world use case.  I
would suggest the failure here is more likely to be bugs in qemu's
emulation.

David

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 13:13   ` David Vrabel
  2013-11-08 13:19     ` Jan Beulich
  2013-11-08 13:19     ` Jan Beulich
@ 2013-11-08 13:48     ` Daniel Kiper
  2013-11-08 13:48     ` Daniel Kiper
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-08 13:48 UTC (permalink / raw)
  To: David Vrabel; +Cc: Keir Fraser, kexec, Jan Beulich, xen-devel

On Fri, Nov 08, 2013 at 01:13:59PM +0000, David Vrabel wrote:
> Keir,
>
> Sorry, forgot to CC you on this series.
>
> Can we have your opinion on whether this kexec series can be merged?
> And if not, what further work and/or testing is required?
>
> On 07/11/13 21:16, Daniel Kiper wrote:
> > On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
> >> The series (for Xen 4.4) improves the kexec hypercall by making Xen
> >> responsible for loading and relocating the image.  This allows kexec
> >> to be usable by pv-ops kernels and should allow kexec to be usable
> >> from a HVM or PVH privileged domain.
> >>
> >> I have now tested this with a Linux kernel image using the VGA console
> >> which was what was causing problems in v9 (this turned out to be a
> >> kexec-tools bug).
> >>
> >> The required patch series for kexec-tools will be posted shortly and
> >> are available from the xen-v7 branch of:
> >
> > In general it works. However, quite often I am not able to execute panic
> > kernel. Machine hangs with following message:
>
> I cannot reproduce any failures, neither on my dev box nor on any of the
> automated XenServer tests that run on a range of different hardware
> platforms.  I find kexec to be very reliable and an earlier version of
> this series has been in production within XenServer for a while now and
> has seen real use in the field.
>
> None of the issues reported so far have been regressions but failures in
> specific uses of the new support for pv-ops kernels.
>
> I really can't see how I can do anything else to make this series
> acceptable for merging.

I think that in general it is OK. However, we must solve discovered
issues or confirm that it is not a problem of current implementation.
That is all. I hope that we finally do that next week (FYI, Monday
is public holiday in Poland).

Additionally, we agreed that shortly after applying this patch series
we decide that registers should be cleared before jumping into new
image or not. I think that it will be done quickly too.

Daniel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 13:13   ` David Vrabel
                       ` (2 preceding siblings ...)
  2013-11-08 13:48     ` Daniel Kiper
@ 2013-11-08 13:48     ` Daniel Kiper
  2013-11-08 14:01       ` [Xen-devel] " Andrew Cooper
  2013-11-08 14:01       ` Andrew Cooper
  2013-11-08 15:04     ` Daniel Kiper
  2013-11-08 15:04     ` Daniel Kiper
  5 siblings, 2 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-08 13:48 UTC (permalink / raw)
  To: David Vrabel; +Cc: Keir Fraser, kexec, Jan Beulich, xen-devel

On Fri, Nov 08, 2013 at 01:13:59PM +0000, David Vrabel wrote:
> Keir,
>
> Sorry, forgot to CC you on this series.
>
> Can we have your opinion on whether this kexec series can be merged?
> And if not, what further work and/or testing is required?
>
> On 07/11/13 21:16, Daniel Kiper wrote:
> > On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
> >> The series (for Xen 4.4) improves the kexec hypercall by making Xen
> >> responsible for loading and relocating the image.  This allows kexec
> >> to be usable by pv-ops kernels and should allow kexec to be usable
> >> from a HVM or PVH privileged domain.
> >>
> >> I have now tested this with a Linux kernel image using the VGA console
> >> which was what was causing problems in v9 (this turned out to be a
> >> kexec-tools bug).
> >>
> >> The required patch series for kexec-tools will be posted shortly and
> >> are available from the xen-v7 branch of:
> >
> > In general it works. However, quite often I am not able to execute panic
> > kernel. Machine hangs with following message:
>
> I cannot reproduce any failures, neither on my dev box nor on any of the
> automated XenServer tests that run on a range of different hardware
> platforms.  I find kexec to be very reliable and an earlier version of
> this series has been in production within XenServer for a while now and
> has seen real use in the field.
>
> None of the issues reported so far have been regressions but failures in
> specific uses of the new support for pv-ops kernels.
>
> I really can't see how I can do anything else to make this series
> acceptable for merging.

I think that in general it is OK. However, we must solve discovered
issues or confirm that it is not a problem of current implementation.
That is all. I hope that we finally do that next week (FYI, Monday
is public holiday in Poland).

Additionally, we agreed that shortly after applying this patch series
we decide that registers should be cleared before jumping into new
image or not. I think that it will be done quickly too.

Daniel

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 13:19     ` Jan Beulich
@ 2013-11-08 14:01       ` Andrew Cooper
  2013-11-08 14:01       ` [Xen-devel] " Andrew Cooper
  1 sibling, 0 replies; 99+ messages in thread
From: Andrew Cooper @ 2013-11-08 14:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, xen-devel

On 08/11/13 13:19, Jan Beulich wrote:
>>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
>> Keir,
>>
>> Sorry, forgot to CC you on this series.
>>
>> Can we have your opinion on whether this kexec series can be merged?
>> And if not, what further work and/or testing is required?
> Just to clarify - unless I missed something, there was still no
> review of this from Daniel or someone else known to be
> familiar with the subject. If Keir gave his ack, formally this
> could go in, but I wouldn't feel too well with that (the more
> that apart from not having reviewed it, Daniel seems to also
> continue to have problems with it).
>
> Jan

Can I have myself deemed to be familiar with the subject as far as this
is concerned?

A noticeable quantity of my contributions to Xen have been in the kexec
/ crash areas, and I am the author of the xen-crashdump-analyser.

I do realise that I certainly not impartial as far as this series is
concerned, being a co-developer.

Davids statement of "the current implementation is so broken[1] and
useless[2] that..." is completely accurate.  It is frankly a miracle
that the current code ever worked at all (and from XenServers point of
view, failed far more often than it worked).

For reference, XenServer 6.2 shipped with approximately v7 of this
series, and an appropriate kexec-tools and xen-crashdump-analyser. 
Since we put the code in, we have not had a single failure-to-kexec in
automated testing (both specific crash tests, and from unexpected host
crashes), whereas we were seeing reliable failures to crash on most of
our test infrastructure.

In stark contrast to previous versions of XenServer, we have not had a
single customer reported host crash where the kexec path has failed. 
There was one systematic failure where the HPSA driver was unhappy with
the state of the hardware, resulting in no root filesystem to write logs
to, and a repeated panic and Xen deadlock in the queued invalidation
codepath.

~Andrew

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 13:19     ` Jan Beulich
  2013-11-08 14:01       ` Andrew Cooper
@ 2013-11-08 14:01       ` Andrew Cooper
  2013-11-08 14:22         ` Don Slutz
                           ` (5 more replies)
  1 sibling, 6 replies; 99+ messages in thread
From: Andrew Cooper @ 2013-11-08 14:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: kexec, Daniel Kiper, Keir Fraser, David Vrabel, xen-devel

On 08/11/13 13:19, Jan Beulich wrote:
>>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
>> Keir,
>>
>> Sorry, forgot to CC you on this series.
>>
>> Can we have your opinion on whether this kexec series can be merged?
>> And if not, what further work and/or testing is required?
> Just to clarify - unless I missed something, there was still no
> review of this from Daniel or someone else known to be
> familiar with the subject. If Keir gave his ack, formally this
> could go in, but I wouldn't feel too well with that (the more
> that apart from not having reviewed it, Daniel seems to also
> continue to have problems with it).
>
> Jan

Can I have myself deemed to be familiar with the subject as far as this
is concerned?

A noticeable quantity of my contributions to Xen have been in the kexec
/ crash areas, and I am the author of the xen-crashdump-analyser.

I do realise that I certainly not impartial as far as this series is
concerned, being a co-developer.

Davids statement of "the current implementation is so broken[1] and
useless[2] that..." is completely accurate.  It is frankly a miracle
that the current code ever worked at all (and from XenServers point of
view, failed far more often than it worked).

For reference, XenServer 6.2 shipped with approximately v7 of this
series, and an appropriate kexec-tools and xen-crashdump-analyser. 
Since we put the code in, we have not had a single failure-to-kexec in
automated testing (both specific crash tests, and from unexpected host
crashes), whereas we were seeing reliable failures to crash on most of
our test infrastructure.

In stark contrast to previous versions of XenServer, we have not had a
single customer reported host crash where the kexec path has failed. 
There was one systematic failure where the HPSA driver was unhappy with
the state of the hardware, resulting in no root filesystem to write logs
to, and a repeated panic and Xen deadlock in the queued invalidation
codepath.

~Andrew

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 13:48     ` Daniel Kiper
  2013-11-08 14:01       ` [Xen-devel] " Andrew Cooper
@ 2013-11-08 14:01       ` Andrew Cooper
  1 sibling, 0 replies; 99+ messages in thread
From: Andrew Cooper @ 2013-11-08 14:01 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, Keir Fraser, David Vrabel, Jan Beulich, xen-devel

On 08/11/13 13:48, Daniel Kiper wrote:
> On Fri, Nov 08, 2013 at 01:13:59PM +0000, David Vrabel wrote:
>> Keir,
>>
>> Sorry, forgot to CC you on this series.
>>
>> Can we have your opinion on whether this kexec series can be merged?
>> And if not, what further work and/or testing is required?
>>
>> On 07/11/13 21:16, Daniel Kiper wrote:
>>> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
>>>> The series (for Xen 4.4) improves the kexec hypercall by making Xen
>>>> responsible for loading and relocating the image.  This allows kexec
>>>> to be usable by pv-ops kernels and should allow kexec to be usable
>>>> from a HVM or PVH privileged domain.
>>>>
>>>> I have now tested this with a Linux kernel image using the VGA console
>>>> which was what was causing problems in v9 (this turned out to be a
>>>> kexec-tools bug).
>>>>
>>>> The required patch series for kexec-tools will be posted shortly and
>>>> are available from the xen-v7 branch of:
>>> In general it works. However, quite often I am not able to execute panic
>>> kernel. Machine hangs with following message:
>> I cannot reproduce any failures, neither on my dev box nor on any of the
>> automated XenServer tests that run on a range of different hardware
>> platforms.  I find kexec to be very reliable and an earlier version of
>> this series has been in production within XenServer for a while now and
>> has seen real use in the field.
>>
>> None of the issues reported so far have been regressions but failures in
>> specific uses of the new support for pv-ops kernels.
>>
>> I really can't see how I can do anything else to make this series
>> acceptable for merging.
> I think that in general it is OK. However, we must solve discovered
> issues or confirm that it is not a problem of current implementation.
> That is all. I hope that we finally do that next week (FYI, Monday
> is public holiday in Poland).

What outstanding issues do you think are present then?

~Andrew

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 13:48     ` Daniel Kiper
@ 2013-11-08 14:01       ` Andrew Cooper
  2013-11-08 14:01       ` Andrew Cooper
  1 sibling, 0 replies; 99+ messages in thread
From: Andrew Cooper @ 2013-11-08 14:01 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, Keir Fraser, David Vrabel, Jan Beulich, xen-devel

On 08/11/13 13:48, Daniel Kiper wrote:
> On Fri, Nov 08, 2013 at 01:13:59PM +0000, David Vrabel wrote:
>> Keir,
>>
>> Sorry, forgot to CC you on this series.
>>
>> Can we have your opinion on whether this kexec series can be merged?
>> And if not, what further work and/or testing is required?
>>
>> On 07/11/13 21:16, Daniel Kiper wrote:
>>> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
>>>> The series (for Xen 4.4) improves the kexec hypercall by making Xen
>>>> responsible for loading and relocating the image.  This allows kexec
>>>> to be usable by pv-ops kernels and should allow kexec to be usable
>>>> from a HVM or PVH privileged domain.
>>>>
>>>> I have now tested this with a Linux kernel image using the VGA console
>>>> which was what was causing problems in v9 (this turned out to be a
>>>> kexec-tools bug).
>>>>
>>>> The required patch series for kexec-tools will be posted shortly and
>>>> are available from the xen-v7 branch of:
>>> In general it works. However, quite often I am not able to execute panic
>>> kernel. Machine hangs with following message:
>> I cannot reproduce any failures, neither on my dev box nor on any of the
>> automated XenServer tests that run on a range of different hardware
>> platforms.  I find kexec to be very reliable and an earlier version of
>> this series has been in production within XenServer for a while now and
>> has seen real use in the field.
>>
>> None of the issues reported so far have been regressions but failures in
>> specific uses of the new support for pv-ops kernels.
>>
>> I really can't see how I can do anything else to make this series
>> acceptable for merging.
> I think that in general it is OK. However, we must solve discovered
> issues or confirm that it is not a problem of current implementation.
> That is all. I hope that we finally do that next week (FYI, Monday
> is public holiday in Poland).

What outstanding issues do you think are present then?

~Andrew

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 14:01       ` [Xen-devel] " Andrew Cooper
@ 2013-11-08 14:22         ` Don Slutz
  2013-11-08 14:22         ` [Xen-devel] " Don Slutz
                           ` (4 subsequent siblings)
  5 siblings, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-08 14:22 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Daniel Kiper, kexec, xen-devel, David Vrabel, Jan Beulich

On 11/08/13 09:01, Andrew Cooper wrote:
> On 08/11/13 13:19, Jan Beulich wrote:
>>>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
>>> Keir,
>>>
>>> Sorry, forgot to CC you on this series.
>>>
>>> Can we have your opinion on whether this kexec series can be merged?
>>> And if not, what further work and/or testing is required?
>> Just to clarify - unless I missed something, there was still no
>> review of this from Daniel or someone else known to be
>> familiar with the subject. If Keir gave his ack, formally this
>> could go in, but I wouldn't feel too well with that (the more
>> that apart from not having reviewed it, Daniel seems to also
>> continue to have problems with it).
If I am following this correctly, Jan is testing this by running xen 
under QEMU.  All my testing has been on bare metal.
>> Jan
> Can I have myself deemed to be familiar with the subject as far as this
> is concerned?
>
> A noticeable quantity of my contributions to Xen have been in the kexec
> / crash areas, and I am the author of the xen-crashdump-analyser.
>
> I do realise that I certainly not impartial as far as this series is
> concerned, being a co-developer.
>
> Davids statement of "the current implementation is so broken[1] and
> useless[2] that..." is completely accurate.  It is frankly a miracle
> that the current code ever worked at all (and from XenServers point of
> view, failed far more often than it worked).
>
>
> For reference, XenServer 6.2 shipped with approximately v7 of this
> series, and an appropriate kexec-tools and xen-crashdump-analyser.
> Since we put the code in, we have not had a single failure-to-kexec in
> automated testing (both specific crash tests, and from unexpected host
> crashes), whereas we were seeing reliable failures to crash on most of
> our test infrastructure.
Verizon is also using an older version back ported to 4.2.1, and we have 
yet to see a failure in getting into the crash kernel via kexec (it is a 
very small sample size ~6 Dom0 crashes so far).  I have only done 10 
crashes so far with v10+ (soon to be v11).
    -Don Slutz
> In stark contrast to previous versions of XenServer, we have not had a
> single customer reported host crash where the kexec path has failed.
> There was one systematic failure where the HPSA driver was unhappy with
> the state of the hardware, resulting in no root filesystem to write logs
> to, and a repeated panic and Xen deadlock in the queued invalidation
> codepath.
>
> ~Andrew
>
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 14:01       ` [Xen-devel] " Andrew Cooper
  2013-11-08 14:22         ` Don Slutz
@ 2013-11-08 14:22         ` Don Slutz
  2013-11-08 14:36         ` Jan Beulich
                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-08 14:22 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Daniel Kiper, kexec, xen-devel, David Vrabel, Jan Beulich

On 11/08/13 09:01, Andrew Cooper wrote:
> On 08/11/13 13:19, Jan Beulich wrote:
>>>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
>>> Keir,
>>>
>>> Sorry, forgot to CC you on this series.
>>>
>>> Can we have your opinion on whether this kexec series can be merged?
>>> And if not, what further work and/or testing is required?
>> Just to clarify - unless I missed something, there was still no
>> review of this from Daniel or someone else known to be
>> familiar with the subject. If Keir gave his ack, formally this
>> could go in, but I wouldn't feel too well with that (the more
>> that apart from not having reviewed it, Daniel seems to also
>> continue to have problems with it).
If I am following this correctly, Jan is testing this by running xen 
under QEMU.  All my testing has been on bare metal.
>> Jan
> Can I have myself deemed to be familiar with the subject as far as this
> is concerned?
>
> A noticeable quantity of my contributions to Xen have been in the kexec
> / crash areas, and I am the author of the xen-crashdump-analyser.
>
> I do realise that I certainly not impartial as far as this series is
> concerned, being a co-developer.
>
> Davids statement of "the current implementation is so broken[1] and
> useless[2] that..." is completely accurate.  It is frankly a miracle
> that the current code ever worked at all (and from XenServers point of
> view, failed far more often than it worked).
>
>
> For reference, XenServer 6.2 shipped with approximately v7 of this
> series, and an appropriate kexec-tools and xen-crashdump-analyser.
> Since we put the code in, we have not had a single failure-to-kexec in
> automated testing (both specific crash tests, and from unexpected host
> crashes), whereas we were seeing reliable failures to crash on most of
> our test infrastructure.
Verizon is also using an older version back ported to 4.2.1, and we have 
yet to see a failure in getting into the crash kernel via kexec (it is a 
very small sample size ~6 Dom0 crashes so far).  I have only done 10 
crashes so far with v10+ (soon to be v11).
    -Don Slutz
> In stark contrast to previous versions of XenServer, we have not had a
> single customer reported host crash where the kexec path has failed.
> There was one systematic failure where the HPSA driver was unhappy with
> the state of the hardware, resulting in no root filesystem to write logs
> to, and a repeated panic and Xen deadlock in the queued invalidation
> codepath.
>
> ~Andrew
>
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 14:01       ` [Xen-devel] " Andrew Cooper
                           ` (2 preceding siblings ...)
  2013-11-08 14:36         ` Jan Beulich
@ 2013-11-08 14:36         ` Jan Beulich
  2013-11-08 15:15         ` Daniel Kiper
  2013-11-08 15:15         ` [Xen-devel] " Daniel Kiper
  5 siblings, 0 replies; 99+ messages in thread
From: Jan Beulich @ 2013-11-08 14:36 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Keir Fraser, DanielKiper, kexec, David Vrabel, xen-devel

>>> On 08.11.13 at 15:01, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 08/11/13 13:19, Jan Beulich wrote:
>>>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
>>> Keir,
>>>
>>> Sorry, forgot to CC you on this series.
>>>
>>> Can we have your opinion on whether this kexec series can be merged?
>>> And if not, what further work and/or testing is required?
>> Just to clarify - unless I missed something, there was still no
>> review of this from Daniel or someone else known to be
>> familiar with the subject. If Keir gave his ack, formally this
>> could go in, but I wouldn't feel too well with that (the more
>> that apart from not having reviewed it, Daniel seems to also
>> continue to have problems with it).
> 
> Can I have myself deemed to be familiar with the subject as far as this
> is concerned?
> 
> A noticeable quantity of my contributions to Xen have been in the kexec
> / crash areas, and I am the author of the xen-crashdump-analyser.

I'm sorry, I didn't mean to offend you in any way. In fact David
and I briefly discussed this situation on the summit, and he sort
of understood that I consider your review valuable, but ...

> I do realise that I certainly not impartial as far as this series is
> concerned, being a co-developer.

... possibly/likely biased. Not the least because both of you work
for Citrix. I'm therefore rather after a second, really independent
review.

Please forgive me not having expressed myself correctly.

Jan

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 14:01       ` [Xen-devel] " Andrew Cooper
  2013-11-08 14:22         ` Don Slutz
  2013-11-08 14:22         ` [Xen-devel] " Don Slutz
@ 2013-11-08 14:36         ` Jan Beulich
  2013-11-08 14:36         ` Jan Beulich
                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 99+ messages in thread
From: Jan Beulich @ 2013-11-08 14:36 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Keir Fraser, DanielKiper, kexec, David Vrabel, xen-devel

>>> On 08.11.13 at 15:01, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 08/11/13 13:19, Jan Beulich wrote:
>>>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
>>> Keir,
>>>
>>> Sorry, forgot to CC you on this series.
>>>
>>> Can we have your opinion on whether this kexec series can be merged?
>>> And if not, what further work and/or testing is required?
>> Just to clarify - unless I missed something, there was still no
>> review of this from Daniel or someone else known to be
>> familiar with the subject. If Keir gave his ack, formally this
>> could go in, but I wouldn't feel too well with that (the more
>> that apart from not having reviewed it, Daniel seems to also
>> continue to have problems with it).
> 
> Can I have myself deemed to be familiar with the subject as far as this
> is concerned?
> 
> A noticeable quantity of my contributions to Xen have been in the kexec
> / crash areas, and I am the author of the xen-crashdump-analyser.

I'm sorry, I didn't mean to offend you in any way. In fact David
and I briefly discussed this situation on the summit, and he sort
of understood that I consider your review valuable, but ...

> I do realise that I certainly not impartial as far as this series is
> concerned, being a co-developer.

... possibly/likely biased. Not the least because both of you work
for Citrix. I'm therefore rather after a second, really independent
review.

Please forgive me not having expressed myself correctly.

Jan


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 13:13   ` David Vrabel
                       ` (3 preceding siblings ...)
  2013-11-08 13:48     ` Daniel Kiper
@ 2013-11-08 15:04     ` Daniel Kiper
  2013-11-08 15:04     ` Daniel Kiper
  5 siblings, 0 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-08 15:04 UTC (permalink / raw)
  To: David Vrabel; +Cc: Keir Fraser, kexec, Jan Beulich, xen-devel

On Fri, Nov 08, 2013 at 01:13:59PM +0000, David Vrabel wrote:

[...]

> > (XEN) Domain 0 crashed: Executing crash image
> >
> > gdb shows:
> >
> > (gdb) bt
> > #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
> > #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
> > #2  0x0000000000000000 in ?? ()
> > (gdb)
> >
> > Especially second bt line scares me... ;-)))
> >
> > I have not been able to identify why NMI was activated because
> > stack is completely cleared.
>
> All this you have described here is correct and expected behavior,
> which, quite frankly, you should have been able to see with even the
> most cursory look at the code.

This is more a fun stuff than a real concern. That is why I have added
smile at the end of my statement. nmi_crash () at entry.S:666 is
a quite interesting coincidence for me... ;-)))

Anyway, it is interesting why all CPUs were stopped at this stage.
One should execute kdump code still. I will try to reproduce this
on real hardware.

Daniel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 13:13   ` David Vrabel
                       ` (4 preceding siblings ...)
  2013-11-08 15:04     ` Daniel Kiper
@ 2013-11-08 15:04     ` Daniel Kiper
  5 siblings, 0 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-08 15:04 UTC (permalink / raw)
  To: David Vrabel; +Cc: Keir Fraser, kexec, Jan Beulich, xen-devel

On Fri, Nov 08, 2013 at 01:13:59PM +0000, David Vrabel wrote:

[...]

> > (XEN) Domain 0 crashed: Executing crash image
> >
> > gdb shows:
> >
> > (gdb) bt
> > #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
> > #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
> > #2  0x0000000000000000 in ?? ()
> > (gdb)
> >
> > Especially second bt line scares me... ;-)))
> >
> > I have not been able to identify why NMI was activated because
> > stack is completely cleared.
>
> All this you have described here is correct and expected behavior,
> which, quite frankly, you should have been able to see with even the
> most cursory look at the code.

This is more a fun stuff than a real concern. That is why I have added
smile at the end of my statement. nmi_crash () at entry.S:666 is
a quite interesting coincidence for me... ;-)))

Anyway, it is interesting why all CPUs were stopped at this stage.
One should execute kdump code still. I will try to reproduce this
on real hardware.

Daniel

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 14:01       ` [Xen-devel] " Andrew Cooper
                           ` (3 preceding siblings ...)
  2013-11-08 14:36         ` Jan Beulich
@ 2013-11-08 15:15         ` Daniel Kiper
  2013-11-08 15:15         ` [Xen-devel] " Daniel Kiper
  5 siblings, 0 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-08 15:15 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: kexec, Keir Fraser, David Vrabel, Jan Beulich, xen-devel

On Fri, Nov 08, 2013 at 02:01:28PM +0000, Andrew Cooper wrote:
> On 08/11/13 13:19, Jan Beulich wrote:
> >>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
> >> Keir,
> >>
> >> Sorry, forgot to CC you on this series.
> >>
> >> Can we have your opinion on whether this kexec series can be merged?
> >> And if not, what further work and/or testing is required?
> > Just to clarify - unless I missed something, there was still no
> > review of this from Daniel or someone else known to be
> > familiar with the subject. If Keir gave his ack, formally this
> > could go in, but I wouldn't feel too well with that (the more
> > that apart from not having reviewed it, Daniel seems to also
> > continue to have problems with it).
> >
> > Jan
>
> Can I have myself deemed to be familiar with the subject as far as this
> is concerned?
>
> A noticeable quantity of my contributions to Xen have been in the kexec
> / crash areas, and I am the author of the xen-crashdump-analyser.
>
> I do realise that I certainly not impartial as far as this series is
> concerned, being a co-developer.
>
> Davids statement of "the current implementation is so broken[1] and
> useless[2] that..." is completely accurate.  It is frankly a miracle
> that the current code ever worked at all (and from XenServers point of
> view, failed far more often than it worked).
>
>
> For reference, XenServer 6.2 shipped with approximately v7 of this
> series, and an appropriate kexec-tools and xen-crashdump-analyser.
> Since we put the code in, we have not had a single failure-to-kexec in
> automated testing (both specific crash tests, and from unexpected host
> crashes), whereas we were seeing reliable failures to crash on most of
> our test infrastructure.
>
> In stark contrast to previous versions of XenServer, we have not had a
> single customer reported host crash where the kexec path has failed.
> There was one systematic failure where the HPSA driver was unhappy with
> the state of the hardware, resulting in no root filesystem to write logs
> to, and a repeated panic and Xen deadlock in the queued invalidation
> codepath.

Andrew, if it runs on all your hardware it does not mean that it runs
everywhere. I have discovered the problem (I hope the last one) and it
should be taken into consideration. Another question is what is the
source of this problem. Maybe QEMU but it should be checked and not
ignored.

Daniel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 14:01       ` [Xen-devel] " Andrew Cooper
                           ` (4 preceding siblings ...)
  2013-11-08 15:15         ` Daniel Kiper
@ 2013-11-08 15:15         ` Daniel Kiper
  2013-11-08 15:42           ` Konrad Rzeszutek Wilk
                             ` (3 more replies)
  5 siblings, 4 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-08 15:15 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: kexec, Keir Fraser, David Vrabel, Jan Beulich, xen-devel

On Fri, Nov 08, 2013 at 02:01:28PM +0000, Andrew Cooper wrote:
> On 08/11/13 13:19, Jan Beulich wrote:
> >>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
> >> Keir,
> >>
> >> Sorry, forgot to CC you on this series.
> >>
> >> Can we have your opinion on whether this kexec series can be merged?
> >> And if not, what further work and/or testing is required?
> > Just to clarify - unless I missed something, there was still no
> > review of this from Daniel or someone else known to be
> > familiar with the subject. If Keir gave his ack, formally this
> > could go in, but I wouldn't feel too well with that (the more
> > that apart from not having reviewed it, Daniel seems to also
> > continue to have problems with it).
> >
> > Jan
>
> Can I have myself deemed to be familiar with the subject as far as this
> is concerned?
>
> A noticeable quantity of my contributions to Xen have been in the kexec
> / crash areas, and I am the author of the xen-crashdump-analyser.
>
> I do realise that I certainly not impartial as far as this series is
> concerned, being a co-developer.
>
> Davids statement of "the current implementation is so broken[1] and
> useless[2] that..." is completely accurate.  It is frankly a miracle
> that the current code ever worked at all (and from XenServers point of
> view, failed far more often than it worked).
>
>
> For reference, XenServer 6.2 shipped with approximately v7 of this
> series, and an appropriate kexec-tools and xen-crashdump-analyser.
> Since we put the code in, we have not had a single failure-to-kexec in
> automated testing (both specific crash tests, and from unexpected host
> crashes), whereas we were seeing reliable failures to crash on most of
> our test infrastructure.
>
> In stark contrast to previous versions of XenServer, we have not had a
> single customer reported host crash where the kexec path has failed.
> There was one systematic failure where the HPSA driver was unhappy with
> the state of the hardware, resulting in no root filesystem to write logs
> to, and a repeated panic and Xen deadlock in the queued invalidation
> codepath.

Andrew, if it runs on all your hardware it does not mean that it runs
everywhere. I have discovered the problem (I hope the last one) and it
should be taken into consideration. Another question is what is the
source of this problem. Maybe QEMU but it should be checked and not
ignored.

Daniel

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 15:15         ` [Xen-devel] " Daniel Kiper
  2013-11-08 15:42           ` Konrad Rzeszutek Wilk
@ 2013-11-08 15:42           ` Konrad Rzeszutek Wilk
  2013-11-08 15:48           ` [Xen-devel] " Andrew Cooper
  2013-11-08 15:48           ` Andrew Cooper
  3 siblings, 0 replies; 99+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-11-08 15:42 UTC (permalink / raw)
  To: Daniel Kiper
  Cc: Keir Fraser, Andrew Cooper, kexec, xen-devel, David Vrabel, Jan Beulich

On Fri, Nov 08, 2013 at 07:15:00AM -0800, Daniel Kiper wrote:
> On Fri, Nov 08, 2013 at 02:01:28PM +0000, Andrew Cooper wrote:
> > On 08/11/13 13:19, Jan Beulich wrote:
> > >>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
> > >> Keir,
> > >>
> > >> Sorry, forgot to CC you on this series.
> > >>
> > >> Can we have your opinion on whether this kexec series can be merged?
> > >> And if not, what further work and/or testing is required?
> > > Just to clarify - unless I missed something, there was still no
> > > review of this from Daniel or someone else known to be
> > > familiar with the subject. If Keir gave his ack, formally this
> > > could go in, but I wouldn't feel too well with that (the more
> > > that apart from not having reviewed it, Daniel seems to also
> > > continue to have problems with it).
> > >
> > > Jan
> >
> > Can I have myself deemed to be familiar with the subject as far as this
> > is concerned?
> >
> > A noticeable quantity of my contributions to Xen have been in the kexec
> > / crash areas, and I am the author of the xen-crashdump-analyser.
> >
> > I do realise that I certainly not impartial as far as this series is
> > concerned, being a co-developer.
> >
> > Davids statement of "the current implementation is so broken[1] and
> > useless[2] that..." is completely accurate.  It is frankly a miracle
> > that the current code ever worked at all (and from XenServers point of
> > view, failed far more often than it worked).
> >
> >
> > For reference, XenServer 6.2 shipped with approximately v7 of this
> > series, and an appropriate kexec-tools and xen-crashdump-analyser.
> > Since we put the code in, we have not had a single failure-to-kexec in
> > automated testing (both specific crash tests, and from unexpected host
> > crashes), whereas we were seeing reliable failures to crash on most of
> > our test infrastructure.
> >
> > In stark contrast to previous versions of XenServer, we have not had a
> > single customer reported host crash where the kexec path has failed.
> > There was one systematic failure where the HPSA driver was unhappy with
> > the state of the hardware, resulting in no root filesystem to write logs
> > to, and a repeated panic and Xen deadlock in the queued invalidation
> > codepath.
> 
> Andrew, if it runs on all your hardware it does not mean that it runs
> everywhere. I have discovered the problem (I hope the last one) and it
> should be taken into consideration. Another question is what is the
> source of this problem. Maybe QEMU but it should be checked and not
> ignored.

I think the question is that the feature freeze is the 18th - and whether
this single bug should halt the integration of this whole patchset.

Or that it is OK to put in the patchset in and deal with the bugs
and not stall this initial patchset.

> 
> Daniel
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 15:15         ` [Xen-devel] " Daniel Kiper
@ 2013-11-08 15:42           ` Konrad Rzeszutek Wilk
  2013-11-08 16:28             ` Daniel Kiper
  2013-11-08 16:28             ` [Xen-devel] " Daniel Kiper
  2013-11-08 15:42           ` Konrad Rzeszutek Wilk
                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 99+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-11-08 15:42 UTC (permalink / raw)
  To: Daniel Kiper
  Cc: Keir Fraser, Andrew Cooper, kexec, xen-devel, David Vrabel, Jan Beulich

On Fri, Nov 08, 2013 at 07:15:00AM -0800, Daniel Kiper wrote:
> On Fri, Nov 08, 2013 at 02:01:28PM +0000, Andrew Cooper wrote:
> > On 08/11/13 13:19, Jan Beulich wrote:
> > >>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
> > >> Keir,
> > >>
> > >> Sorry, forgot to CC you on this series.
> > >>
> > >> Can we have your opinion on whether this kexec series can be merged?
> > >> And if not, what further work and/or testing is required?
> > > Just to clarify - unless I missed something, there was still no
> > > review of this from Daniel or someone else known to be
> > > familiar with the subject. If Keir gave his ack, formally this
> > > could go in, but I wouldn't feel too well with that (the more
> > > that apart from not having reviewed it, Daniel seems to also
> > > continue to have problems with it).
> > >
> > > Jan
> >
> > Can I have myself deemed to be familiar with the subject as far as this
> > is concerned?
> >
> > A noticeable quantity of my contributions to Xen have been in the kexec
> > / crash areas, and I am the author of the xen-crashdump-analyser.
> >
> > I do realise that I certainly not impartial as far as this series is
> > concerned, being a co-developer.
> >
> > Davids statement of "the current implementation is so broken[1] and
> > useless[2] that..." is completely accurate.  It is frankly a miracle
> > that the current code ever worked at all (and from XenServers point of
> > view, failed far more often than it worked).
> >
> >
> > For reference, XenServer 6.2 shipped with approximately v7 of this
> > series, and an appropriate kexec-tools and xen-crashdump-analyser.
> > Since we put the code in, we have not had a single failure-to-kexec in
> > automated testing (both specific crash tests, and from unexpected host
> > crashes), whereas we were seeing reliable failures to crash on most of
> > our test infrastructure.
> >
> > In stark contrast to previous versions of XenServer, we have not had a
> > single customer reported host crash where the kexec path has failed.
> > There was one systematic failure where the HPSA driver was unhappy with
> > the state of the hardware, resulting in no root filesystem to write logs
> > to, and a repeated panic and Xen deadlock in the queued invalidation
> > codepath.
> 
> Andrew, if it runs on all your hardware it does not mean that it runs
> everywhere. I have discovered the problem (I hope the last one) and it
> should be taken into consideration. Another question is what is the
> source of this problem. Maybe QEMU but it should be checked and not
> ignored.

I think the question is that the feature freeze is the 18th - and whether
this single bug should halt the integration of this whole patchset.

Or that it is OK to put in the patchset in and deal with the bugs
and not stall this initial patchset.

> 
> Daniel
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 15:15         ` [Xen-devel] " Daniel Kiper
                             ` (2 preceding siblings ...)
  2013-11-08 15:48           ` [Xen-devel] " Andrew Cooper
@ 2013-11-08 15:48           ` Andrew Cooper
  3 siblings, 0 replies; 99+ messages in thread
From: Andrew Cooper @ 2013-11-08 15:48 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, Keir Fraser, David Vrabel, Jan Beulich, xen-devel

On 08/11/13 15:15, Daniel Kiper wrote:
> On Fri, Nov 08, 2013 at 02:01:28PM +0000, Andrew Cooper wrote:
>> On 08/11/13 13:19, Jan Beulich wrote:
>>>>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
>>>> Keir,
>>>>
>>>> Sorry, forgot to CC you on this series.
>>>>
>>>> Can we have your opinion on whether this kexec series can be merged?
>>>> And if not, what further work and/or testing is required?
>>> Just to clarify - unless I missed something, there was still no
>>> review of this from Daniel or someone else known to be
>>> familiar with the subject. If Keir gave his ack, formally this
>>> could go in, but I wouldn't feel too well with that (the more
>>> that apart from not having reviewed it, Daniel seems to also
>>> continue to have problems with it).
>>>
>>> Jan
>> Can I have myself deemed to be familiar with the subject as far as this
>> is concerned?
>>
>> A noticeable quantity of my contributions to Xen have been in the kexec
>> / crash areas, and I am the author of the xen-crashdump-analyser.
>>
>> I do realise that I certainly not impartial as far as this series is
>> concerned, being a co-developer.
>>
>> Davids statement of "the current implementation is so broken[1] and
>> useless[2] that..." is completely accurate.  It is frankly a miracle
>> that the current code ever worked at all (and from XenServers point of
>> view, failed far more often than it worked).
>>
>>
>> For reference, XenServer 6.2 shipped with approximately v7 of this
>> series, and an appropriate kexec-tools and xen-crashdump-analyser.
>> Since we put the code in, we have not had a single failure-to-kexec in
>> automated testing (both specific crash tests, and from unexpected host
>> crashes), whereas we were seeing reliable failures to crash on most of
>> our test infrastructure.
>>
>> In stark contrast to previous versions of XenServer, we have not had a
>> single customer reported host crash where the kexec path has failed.
>> There was one systematic failure where the HPSA driver was unhappy with
>> the state of the hardware, resulting in no root filesystem to write logs
>> to, and a repeated panic and Xen deadlock in the queued invalidation
>> codepath.
> Andrew, if it runs on all your hardware it does not mean that it runs
> everywhere. I have discovered the problem (I hope the last one) and it
> should be taken into consideration. Another question is what is the
> source of this problem. Maybe QEMU but it should be checked and not
> ignored.
>
> Daniel

I am not trying to suggest that it is 100% perfect with all corner cases
covered.

However, I feel that a QEMU failure in the NMI shootdown logic (which
has not been touched by this series, and has been present in Xen since
the 4.3 development cycle) should not be considered against the series. 
Or are you meaning that the QEMU failure is a regression caused by the
series?

For interest, our nightly tests consist of:

* xl debug-keys C
* echo c > /proc/sysrq-trigger
** This is further repeated several times with a 1-vcpu dom0 pinned to
pcpu 0, 1, -1 and a 2 further randomly-chosen pcpus.
* echo c > /proc/sysrq-trigger with the server running VM workloads.

Which are chained back-to-back with our crashdump environment which
takes logs and automatically reboots.  For each individual crash, the
crashdump-analyser logs are checked for correctness.

There is a separate test on supporting hardware which uses an IPMI
controller to inject an IOCK NMI.

The above tests get run on a random server every single night.  During
development when the lab was idle, we repeatedly ran the test against
every unique machine we had available (about 100 types, different
brands, different generations of technology)

~Andrew

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 15:15         ` [Xen-devel] " Daniel Kiper
  2013-11-08 15:42           ` Konrad Rzeszutek Wilk
  2013-11-08 15:42           ` Konrad Rzeszutek Wilk
@ 2013-11-08 15:48           ` Andrew Cooper
  2013-11-08 15:48           ` Andrew Cooper
  3 siblings, 0 replies; 99+ messages in thread
From: Andrew Cooper @ 2013-11-08 15:48 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, Keir Fraser, David Vrabel, Jan Beulich, xen-devel

On 08/11/13 15:15, Daniel Kiper wrote:
> On Fri, Nov 08, 2013 at 02:01:28PM +0000, Andrew Cooper wrote:
>> On 08/11/13 13:19, Jan Beulich wrote:
>>>>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
>>>> Keir,
>>>>
>>>> Sorry, forgot to CC you on this series.
>>>>
>>>> Can we have your opinion on whether this kexec series can be merged?
>>>> And if not, what further work and/or testing is required?
>>> Just to clarify - unless I missed something, there was still no
>>> review of this from Daniel or someone else known to be
>>> familiar with the subject. If Keir gave his ack, formally this
>>> could go in, but I wouldn't feel too well with that (the more
>>> that apart from not having reviewed it, Daniel seems to also
>>> continue to have problems with it).
>>>
>>> Jan
>> Can I have myself deemed to be familiar with the subject as far as this
>> is concerned?
>>
>> A noticeable quantity of my contributions to Xen have been in the kexec
>> / crash areas, and I am the author of the xen-crashdump-analyser.
>>
>> I do realise that I certainly not impartial as far as this series is
>> concerned, being a co-developer.
>>
>> Davids statement of "the current implementation is so broken[1] and
>> useless[2] that..." is completely accurate.  It is frankly a miracle
>> that the current code ever worked at all (and from XenServers point of
>> view, failed far more often than it worked).
>>
>>
>> For reference, XenServer 6.2 shipped with approximately v7 of this
>> series, and an appropriate kexec-tools and xen-crashdump-analyser.
>> Since we put the code in, we have not had a single failure-to-kexec in
>> automated testing (both specific crash tests, and from unexpected host
>> crashes), whereas we were seeing reliable failures to crash on most of
>> our test infrastructure.
>>
>> In stark contrast to previous versions of XenServer, we have not had a
>> single customer reported host crash where the kexec path has failed.
>> There was one systematic failure where the HPSA driver was unhappy with
>> the state of the hardware, resulting in no root filesystem to write logs
>> to, and a repeated panic and Xen deadlock in the queued invalidation
>> codepath.
> Andrew, if it runs on all your hardware it does not mean that it runs
> everywhere. I have discovered the problem (I hope the last one) and it
> should be taken into consideration. Another question is what is the
> source of this problem. Maybe QEMU but it should be checked and not
> ignored.
>
> Daniel

I am not trying to suggest that it is 100% perfect with all corner cases
covered.

However, I feel that a QEMU failure in the NMI shootdown logic (which
has not been touched by this series, and has been present in Xen since
the 4.3 development cycle) should not be considered against the series. 
Or are you meaning that the QEMU failure is a regression caused by the
series?

For interest, our nightly tests consist of:

* xl debug-keys C
* echo c > /proc/sysrq-trigger
** This is further repeated several times with a 1-vcpu dom0 pinned to
pcpu 0, 1, -1 and a 2 further randomly-chosen pcpus.
* echo c > /proc/sysrq-trigger with the server running VM workloads.

Which are chained back-to-back with our crashdump environment which
takes logs and automatically reboots.  For each individual crash, the
crashdump-analyser logs are checked for correctness.

There is a separate test on supporting hardware which uses an IPMI
controller to inject an IOCK NMI.

The above tests get run on a random server every single night.  During
development when the lab was idle, we repeatedly ran the test against
every unique machine we had available (about 100 types, different
brands, different generations of technology)

~Andrew

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 15:42           ` Konrad Rzeszutek Wilk
@ 2013-11-08 16:28             ` Daniel Kiper
  2013-11-08 16:28             ` [Xen-devel] " Daniel Kiper
  1 sibling, 0 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-08 16:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Andrew Cooper, kexec, xen-devel, David Vrabel, Jan Beulich

On Fri, Nov 08, 2013 at 10:42:51AM -0500, Konrad Rzeszutek Wilk wrote:
> On Fri, Nov 08, 2013 at 07:15:00AM -0800, Daniel Kiper wrote:
> > On Fri, Nov 08, 2013 at 02:01:28PM +0000, Andrew Cooper wrote:
> > > On 08/11/13 13:19, Jan Beulich wrote:
> > > >>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
> > > >> Keir,
> > > >>
> > > >> Sorry, forgot to CC you on this series.
> > > >>
> > > >> Can we have your opinion on whether this kexec series can be merged?
> > > >> And if not, what further work and/or testing is required?
> > > > Just to clarify - unless I missed something, there was still no
> > > > review of this from Daniel or someone else known to be
> > > > familiar with the subject. If Keir gave his ack, formally this
> > > > could go in, but I wouldn't feel too well with that (the more
> > > > that apart from not having reviewed it, Daniel seems to also
> > > > continue to have problems with it).
> > > >
> > > > Jan
> > >
> > > Can I have myself deemed to be familiar with the subject as far as this
> > > is concerned?
> > >
> > > A noticeable quantity of my contributions to Xen have been in the kexec
> > > / crash areas, and I am the author of the xen-crashdump-analyser.
> > >
> > > I do realise that I certainly not impartial as far as this series is
> > > concerned, being a co-developer.
> > >
> > > Davids statement of "the current implementation is so broken[1] and
> > > useless[2] that..." is completely accurate.  It is frankly a miracle
> > > that the current code ever worked at all (and from XenServers point of
> > > view, failed far more often than it worked).
> > >
> > >
> > > For reference, XenServer 6.2 shipped with approximately v7 of this
> > > series, and an appropriate kexec-tools and xen-crashdump-analyser.
> > > Since we put the code in, we have not had a single failure-to-kexec in
> > > automated testing (both specific crash tests, and from unexpected host
> > > crashes), whereas we were seeing reliable failures to crash on most of
> > > our test infrastructure.
> > >
> > > In stark contrast to previous versions of XenServer, we have not had a
> > > single customer reported host crash where the kexec path has failed.
> > > There was one systematic failure where the HPSA driver was unhappy with
> > > the state of the hardware, resulting in no root filesystem to write logs
> > > to, and a repeated panic and Xen deadlock in the queued invalidation
> > > codepath.
> >
> > Andrew, if it runs on all your hardware it does not mean that it runs
> > everywhere. I have discovered the problem (I hope the last one) and it
> > should be taken into consideration. Another question is what is the
> > source of this problem. Maybe QEMU but it should be checked and not
> > ignored.
>
> I think the question is that the feature freeze is the 18th - and whether
> this single bug should halt the integration of this whole patchset.
>
> Or that it is OK to put in the patchset in and deal with the bugs
> and not stall this initial patchset.

I have never stated that I would like to block this patch series
indefinitely due to this one bug (I am still not sure that this
is a bug; Currently, I feel that I am only one person who tries
to verify that). We have more then one week and I think that we
are able to discover what is going on. If not I think that we
can workout reasonable solution for this issue (as we did in other
cases). Last but not least, I would like to underline that I wish
that this patch series were included in Xen 4.4 too. However,
it must be done in sensible way.

Daniel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-08 15:42           ` Konrad Rzeszutek Wilk
  2013-11-08 16:28             ` Daniel Kiper
@ 2013-11-08 16:28             ` Daniel Kiper
  1 sibling, 0 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-08 16:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Andrew Cooper, kexec, xen-devel, David Vrabel, Jan Beulich

On Fri, Nov 08, 2013 at 10:42:51AM -0500, Konrad Rzeszutek Wilk wrote:
> On Fri, Nov 08, 2013 at 07:15:00AM -0800, Daniel Kiper wrote:
> > On Fri, Nov 08, 2013 at 02:01:28PM +0000, Andrew Cooper wrote:
> > > On 08/11/13 13:19, Jan Beulich wrote:
> > > >>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote:
> > > >> Keir,
> > > >>
> > > >> Sorry, forgot to CC you on this series.
> > > >>
> > > >> Can we have your opinion on whether this kexec series can be merged?
> > > >> And if not, what further work and/or testing is required?
> > > > Just to clarify - unless I missed something, there was still no
> > > > review of this from Daniel or someone else known to be
> > > > familiar with the subject. If Keir gave his ack, formally this
> > > > could go in, but I wouldn't feel too well with that (the more
> > > > that apart from not having reviewed it, Daniel seems to also
> > > > continue to have problems with it).
> > > >
> > > > Jan
> > >
> > > Can I have myself deemed to be familiar with the subject as far as this
> > > is concerned?
> > >
> > > A noticeable quantity of my contributions to Xen have been in the kexec
> > > / crash areas, and I am the author of the xen-crashdump-analyser.
> > >
> > > I do realise that I certainly not impartial as far as this series is
> > > concerned, being a co-developer.
> > >
> > > Davids statement of "the current implementation is so broken[1] and
> > > useless[2] that..." is completely accurate.  It is frankly a miracle
> > > that the current code ever worked at all (and from XenServers point of
> > > view, failed far more often than it worked).
> > >
> > >
> > > For reference, XenServer 6.2 shipped with approximately v7 of this
> > > series, and an appropriate kexec-tools and xen-crashdump-analyser.
> > > Since we put the code in, we have not had a single failure-to-kexec in
> > > automated testing (both specific crash tests, and from unexpected host
> > > crashes), whereas we were seeing reliable failures to crash on most of
> > > our test infrastructure.
> > >
> > > In stark contrast to previous versions of XenServer, we have not had a
> > > single customer reported host crash where the kexec path has failed.
> > > There was one systematic failure where the HPSA driver was unhappy with
> > > the state of the hardware, resulting in no root filesystem to write logs
> > > to, and a repeated panic and Xen deadlock in the queued invalidation
> > > codepath.
> >
> > Andrew, if it runs on all your hardware it does not mean that it runs
> > everywhere. I have discovered the problem (I hope the last one) and it
> > should be taken into consideration. Another question is what is the
> > source of this problem. Maybe QEMU but it should be checked and not
> > ignored.
>
> I think the question is that the feature freeze is the 18th - and whether
> this single bug should halt the integration of this whole patchset.
>
> Or that it is OK to put in the patchset in and deal with the bugs
> and not stall this initial patchset.

I have never stated that I would like to block this patch series
indefinitely due to this one bug (I am still not sure that this
is a bug; Currently, I feel that I am only one person who tries
to verify that). We have more then one week and I think that we
are able to discover what is going on. If not I think that we
can workout reasonable solution for this issue (as we did in other
cases). Last but not least, I would like to underline that I wish
that this patch series were included in Xen 4.4 too. However,
it must be done in sensible way.

Daniel

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-07 21:16 ` Daniel Kiper
                     ` (4 preceding siblings ...)
  2013-11-09 19:18   ` Daniel Kiper
@ 2013-11-09 19:18   ` Daniel Kiper
  5 siblings, 0 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-09 19:18 UTC (permalink / raw)
  To: David Vrabel; +Cc: kexec, Jan Beulich, xen-devel

On Thu, Nov 07, 2013 at 10:16:51PM +0100, Daniel Kiper wrote:
> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
> > The series (for Xen 4.4) improves the kexec hypercall by making Xen
> > responsible for loading and relocating the image.  This allows kexec
> > to be usable by pv-ops kernels and should allow kexec to be usable
> > from a HVM or PVH privileged domain.
> >
> > I have now tested this with a Linux kernel image using the VGA console
> > which was what was causing problems in v9 (this turned out to be a
> > kexec-tools bug).
> >
> > The required patch series for kexec-tools will be posted shortly and
> > are available from the xen-v7 branch of:
>
> In general it works. However, quite often I am not able to execute panic
> kernel. Machine hangs with following message:
>
> (XEN) Domain 0 crashed: Executing crash image
>
> gdb shows:
>
> (gdb) bt
> #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
> #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
> #2  0x0000000000000000 in ?? ()
> (gdb)
>
> Especially second bt line scares me... ;-)))
>
> I have not been able to identify why NMI was activated because
> stack is completely cleared. I tried to record execution in gdb
> but it stops with following message:
>
> cpumask_clear_cpu (dstp=0xffff82d0802f7f78 <call_data+24>, cpu=0)
>     at /srv/dev/xen/xen_20130413_20131107.kexec/xen/include/xen/cpumask.h:108
> 108             clear_bit(cpumask_check(cpu), dstp->bits);
> Process record: failed to record execution log.
>
> Do you know how to find out why NMI was activated?
>
> I am able almost always reproduce this issue doing this:
>   - boot Xen,
>   - load panic kernel,
>   - echo c > /proc/sysrq-trigger,
>   - reboot from command line,
>   - boot Xen,
>   - load panic kernel,
>   - echo c > /proc/sysrq-trigger.

I am not able to reproduce this on real hardware. Sorry for confusion.

Hence, for whole Xen kexec/kdump series:

Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>

Daniel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-07 21:16 ` Daniel Kiper
                     ` (3 preceding siblings ...)
  2013-11-08 13:13   ` David Vrabel
@ 2013-11-09 19:18   ` Daniel Kiper
  2013-11-11 14:34     ` Don Slutz
                       ` (3 more replies)
  2013-11-09 19:18   ` Daniel Kiper
  5 siblings, 4 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-09 19:18 UTC (permalink / raw)
  To: David Vrabel; +Cc: kexec, Jan Beulich, xen-devel

On Thu, Nov 07, 2013 at 10:16:51PM +0100, Daniel Kiper wrote:
> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
> > The series (for Xen 4.4) improves the kexec hypercall by making Xen
> > responsible for loading and relocating the image.  This allows kexec
> > to be usable by pv-ops kernels and should allow kexec to be usable
> > from a HVM or PVH privileged domain.
> >
> > I have now tested this with a Linux kernel image using the VGA console
> > which was what was causing problems in v9 (this turned out to be a
> > kexec-tools bug).
> >
> > The required patch series for kexec-tools will be posted shortly and
> > are available from the xen-v7 branch of:
>
> In general it works. However, quite often I am not able to execute panic
> kernel. Machine hangs with following message:
>
> (XEN) Domain 0 crashed: Executing crash image
>
> gdb shows:
>
> (gdb) bt
> #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
> #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
> #2  0x0000000000000000 in ?? ()
> (gdb)
>
> Especially second bt line scares me... ;-)))
>
> I have not been able to identify why NMI was activated because
> stack is completely cleared. I tried to record execution in gdb
> but it stops with following message:
>
> cpumask_clear_cpu (dstp=0xffff82d0802f7f78 <call_data+24>, cpu=0)
>     at /srv/dev/xen/xen_20130413_20131107.kexec/xen/include/xen/cpumask.h:108
> 108             clear_bit(cpumask_check(cpu), dstp->bits);
> Process record: failed to record execution log.
>
> Do you know how to find out why NMI was activated?
>
> I am able almost always reproduce this issue doing this:
>   - boot Xen,
>   - load panic kernel,
>   - echo c > /proc/sysrq-trigger,
>   - reboot from command line,
>   - boot Xen,
>   - load panic kernel,
>   - echo c > /proc/sysrq-trigger.

I am not able to reproduce this on real hardware. Sorry for confusion.

Hence, for whole Xen kexec/kdump series:

Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>

Daniel

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-09 19:18   ` Daniel Kiper
  2013-11-11 14:34     ` Don Slutz
@ 2013-11-11 14:34     ` Don Slutz
  2013-11-11 15:09     ` David Vrabel
  2013-11-11 15:09     ` David Vrabel
  3 siblings, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-11 14:34 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, David Vrabel, Jan Beulich, xen-devel

On 11/09/13 14:18, Daniel Kiper wrote:
> On Thu, Nov 07, 2013 at 10:16:51PM +0100, Daniel Kiper wrote:
>> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
>>> The series (for Xen 4.4) improves the kexec hypercall by making Xen
>>> responsible for loading and relocating the image.  This allows kexec
>>> to be usable by pv-ops kernels and should allow kexec to be usable
>>> from a HVM or PVH privileged domain.
>>>
>>> I have now tested this with a Linux kernel image using the VGA console
>>> which was what was causing problems in v9 (this turned out to be a
>>> kexec-tools bug).
>>>
>>> The required patch series for kexec-tools will be posted shortly and
>>> are available from the xen-v7 branch of:
>> In general it works. However, quite often I am not able to execute panic
>> kernel. Machine hangs with following message:
>>
>> (XEN) Domain 0 crashed: Executing crash image
>>
>> gdb shows:
>>
>> (gdb) bt
>> #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
>> #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
>> #2  0x0000000000000000 in ?? ()
>> (gdb)
>>
>> Especially second bt line scares me... ;-)))
>>
>> I have not been able to identify why NMI was activated because
>> stack is completely cleared. I tried to record execution in gdb
>> but it stops with following message:
>>
>> cpumask_clear_cpu (dstp=0xffff82d0802f7f78 <call_data+24>, cpu=0)
>>      at /srv/dev/xen/xen_20130413_20131107.kexec/xen/include/xen/cpumask.h:108
>> 108             clear_bit(cpumask_check(cpu), dstp->bits);
>> Process record: failed to record execution log.
>>
>> Do you know how to find out why NMI was activated?
>>
>> I am able almost always reproduce this issue doing this:
>>    - boot Xen,
>>    - load panic kernel,
>>    - echo c > /proc/sysrq-trigger,
>>    - reboot from command line,
>>    - boot Xen,
>>    - load panic kernel,
>>    - echo c > /proc/sysrq-trigger.
> I am not able to reproduce this on real hardware. Sorry for confusion.
>
> Hence, for whole Xen kexec/kdump series:
>
> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
> Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
>
> Daniel
Also

Tested-by: Don Slutz <dslutz@verizon.com>

    -Don Slutz
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-09 19:18   ` Daniel Kiper
@ 2013-11-11 14:34     ` Don Slutz
  2013-11-11 14:34     ` Don Slutz
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-11 14:34 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, David Vrabel, Jan Beulich, xen-devel

On 11/09/13 14:18, Daniel Kiper wrote:
> On Thu, Nov 07, 2013 at 10:16:51PM +0100, Daniel Kiper wrote:
>> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:
>>> The series (for Xen 4.4) improves the kexec hypercall by making Xen
>>> responsible for loading and relocating the image.  This allows kexec
>>> to be usable by pv-ops kernels and should allow kexec to be usable
>>> from a HVM or PVH privileged domain.
>>>
>>> I have now tested this with a Linux kernel image using the VGA console
>>> which was what was causing problems in v9 (this turned out to be a
>>> kexec-tools bug).
>>>
>>> The required patch series for kexec-tools will be posted shortly and
>>> are available from the xen-v7 branch of:
>> In general it works. However, quite often I am not able to execute panic
>> kernel. Machine hangs with following message:
>>
>> (XEN) Domain 0 crashed: Executing crash image
>>
>> gdb shows:
>>
>> (gdb) bt
>> #0  0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113
>> #1  0xffff82d0802281d9 in nmi_crash () at entry.S:666
>> #2  0x0000000000000000 in ?? ()
>> (gdb)
>>
>> Especially second bt line scares me... ;-)))
>>
>> I have not been able to identify why NMI was activated because
>> stack is completely cleared. I tried to record execution in gdb
>> but it stops with following message:
>>
>> cpumask_clear_cpu (dstp=0xffff82d0802f7f78 <call_data+24>, cpu=0)
>>      at /srv/dev/xen/xen_20130413_20131107.kexec/xen/include/xen/cpumask.h:108
>> 108             clear_bit(cpumask_check(cpu), dstp->bits);
>> Process record: failed to record execution log.
>>
>> Do you know how to find out why NMI was activated?
>>
>> I am able almost always reproduce this issue doing this:
>>    - boot Xen,
>>    - load panic kernel,
>>    - echo c > /proc/sysrq-trigger,
>>    - reboot from command line,
>>    - boot Xen,
>>    - load panic kernel,
>>    - echo c > /proc/sysrq-trigger.
> I am not able to reproduce this on real hardware. Sorry for confusion.
>
> Hence, for whole Xen kexec/kdump series:
>
> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
> Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
>
> Daniel
Also

Tested-by: Don Slutz <dslutz@verizon.com>

    -Don Slutz
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-08 12:50   ` [PATCHv11 " David Vrabel
@ 2013-11-11 14:37     ` Don Slutz
  2013-11-15 14:35     ` Jan Beulich
  1 sibling, 0 replies; 99+ messages in thread
From: Don Slutz @ 2013-11-11 14:37 UTC (permalink / raw)
  To: David Vrabel; +Cc: Daniel Kiper, Jan Beulich, xen-devel

On 11/08/13 07:50, David Vrabel wrote:
> Add the code needed to handle and load kexec images into Xen memory or
> into the crash region.  This is needed for the new KEXEC_CMD_load and
> KEXEC_CMD_unload hypercall sub-ops.
>
> Much of this code is derived from the Linux kernel.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
[...]

Reviewed-by: Don Slutz <dslutz@verizon.com>
Tested-by: Don Slutz <dslutz@verizon.com>

    -Don Slutz

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-09 19:18   ` Daniel Kiper
  2013-11-11 14:34     ` Don Slutz
  2013-11-11 14:34     ` Don Slutz
@ 2013-11-11 15:09     ` David Vrabel
  2013-11-11 15:09     ` David Vrabel
  3 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-11 15:09 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, Jan Beulich, xen-devel

On 09/11/13 19:18, Daniel Kiper wrote:
> 
> Hence, for whole Xen kexec/kdump series:
> 
> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
> Tested-by: Daniel Kiper <daniel.kiper@oracle.com>

Thanks.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-09 19:18   ` Daniel Kiper
                       ` (2 preceding siblings ...)
  2013-11-11 15:09     ` David Vrabel
@ 2013-11-11 15:09     ` David Vrabel
  3 siblings, 0 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-11 15:09 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: kexec, Jan Beulich, xen-devel

On 09/11/13 19:18, Daniel Kiper wrote:
> 
> Hence, for whole Xen kexec/kdump series:
> 
> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
> Tested-by: Daniel Kiper <daniel.kiper@oracle.com>

Thanks.

David

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (18 preceding siblings ...)
  2013-11-07 21:16 ` Daniel Kiper
@ 2013-11-11 17:18 ` Keir Fraser
  2013-11-11 17:18 ` [Xen-devel] " Keir Fraser
  20 siblings, 0 replies; 99+ messages in thread
From: Keir Fraser @ 2013-11-11 17:18 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

On 06/11/2013 14:49, "David Vrabel" <david.vrabel@citrix.com> wrote:

> The series (for Xen 4.4) improves the kexec hypercall by making Xen
> responsible for loading and relocating the image.  This allows kexec
> to be usable by pv-ops kernels and should allow kexec to be usable
> from a HVM or PVH privileged domain.

Acked-by: Keir Fraser <keir@xen.org>

> I have now tested this with a Linux kernel image using the VGA console
> which was what was causing problems in v9 (this turned out to be a
> kexec-tools bug).
> 
> The required patch series for kexec-tools will be posted shortly and
> are available from the xen-v7 branch of:
> 
> http://xenbits.xen.org/gitweb/?p=people/dvrabel/kexec-tools.git;a=summary
> 
> Changes in v10:
> 
> - Document host state on exec.
> - Fix kimage_alloc() error path (double free, crash on zero kimage->head).
> - Check for segment before expanding it in load_v1.
> - Move kexec_lock define into kexec_swap_images().
> 
> Changes in v9:
> 
> - Update comments to correctly say 4.4.
> - Minor updates the kexec_reloc assembly to improve maintainability a
>   bit.
> 
> Changes in v8:
> 
> - Use #defines for compat ABI structures.
> - Tweak link time check for kexec_reloc.
> 
> Changes in v7:
> 
> - No longer use GUEST_HANDLE_64(), get a uniform ABI by using unions
>   and explicit padding.
> - Only map the segments and not all of RAM.
> - Add a mechanism to create mappings for use by the exec'd image (a
>   segment with a NULL buf handle).
> - Fix a bug where a crash image's code page would by placed at machine
>   address 0 (instead of inside the crash region).
> 
> Changes in v6:
> 
> - Fix double free in KEXEC_load_v1 failure path.
> - Only copy the relocation code and not the whole page.
> - Add myself as the kexec maintainer.
> 
> Changes in v5 (not posted to the list):
> 
> - _rsvd -> _pad in one of the public ABI structures.
> - Fix bug where trailing pages were not zeroed. This fixes loading a
>   64-bit Linux kernel using a more recent version of kexec-tools.
> - Check the relocation code fits into a page at link time.
> 
> Changes in v4:
> 
> - Use paddr_t and page_to_maddr() etc. for portability.
> - Add explicit padding to hypercall structures where required.
> - Minor cleanup of the kexec_reloc assembly.
> - Print a message before exec'ing a crash image.
> - Style fixes (tabs, trailing whitespace) and typos.
> - Fix a bug where using the V1 interface and unloading a image may crash.
> 
> Changes in v3:
> 
> - Provide old struct xen_kexec_load if __XEN_INTERFACE_VERSION__ < 4.3
> - Adjust new struct xen_kexec_load to avoid unnecessary padding.
> - Use domheap pages for the image and control pages.
> - Remove the DBG() macros from the reloc code.
> 
> David
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Xen-devel] [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
  2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
                   ` (19 preceding siblings ...)
  2013-11-11 17:18 ` Keir Fraser
@ 2013-11-11 17:18 ` Keir Fraser
  20 siblings, 0 replies; 99+ messages in thread
From: Keir Fraser @ 2013-11-11 17:18 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Daniel Kiper, kexec, Jan Beulich

On 06/11/2013 14:49, "David Vrabel" <david.vrabel@citrix.com> wrote:

> The series (for Xen 4.4) improves the kexec hypercall by making Xen
> responsible for loading and relocating the image.  This allows kexec
> to be usable by pv-ops kernels and should allow kexec to be usable
> from a HVM or PVH privileged domain.

Acked-by: Keir Fraser <keir@xen.org>

> I have now tested this with a Linux kernel image using the VGA console
> which was what was causing problems in v9 (this turned out to be a
> kexec-tools bug).
> 
> The required patch series for kexec-tools will be posted shortly and
> are available from the xen-v7 branch of:
> 
> http://xenbits.xen.org/gitweb/?p=people/dvrabel/kexec-tools.git;a=summary
> 
> Changes in v10:
> 
> - Document host state on exec.
> - Fix kimage_alloc() error path (double free, crash on zero kimage->head).
> - Check for segment before expanding it in load_v1.
> - Move kexec_lock define into kexec_swap_images().
> 
> Changes in v9:
> 
> - Update comments to correctly say 4.4.
> - Minor updates the kexec_reloc assembly to improve maintainability a
>   bit.
> 
> Changes in v8:
> 
> - Use #defines for compat ABI structures.
> - Tweak link time check for kexec_reloc.
> 
> Changes in v7:
> 
> - No longer use GUEST_HANDLE_64(), get a uniform ABI by using unions
>   and explicit padding.
> - Only map the segments and not all of RAM.
> - Add a mechanism to create mappings for use by the exec'd image (a
>   segment with a NULL buf handle).
> - Fix a bug where a crash image's code page would by placed at machine
>   address 0 (instead of inside the crash region).
> 
> Changes in v6:
> 
> - Fix double free in KEXEC_load_v1 failure path.
> - Only copy the relocation code and not the whole page.
> - Add myself as the kexec maintainer.
> 
> Changes in v5 (not posted to the list):
> 
> - _rsvd -> _pad in one of the public ABI structures.
> - Fix bug where trailing pages were not zeroed. This fixes loading a
>   64-bit Linux kernel using a more recent version of kexec-tools.
> - Check the relocation code fits into a page at link time.
> 
> Changes in v4:
> 
> - Use paddr_t and page_to_maddr() etc. for portability.
> - Add explicit padding to hypercall structures where required.
> - Minor cleanup of the kexec_reloc assembly.
> - Print a message before exec'ing a crash image.
> - Style fixes (tabs, trailing whitespace) and typos.
> - Fix a bug where using the V1 interface and unloading a image may crash.
> 
> Changes in v3:
> 
> - Provide old struct xen_kexec_load if __XEN_INTERFACE_VERSION__ < 4.3
> - Adjust new struct xen_kexec_load to avoid unnecessary padding.
> - Use domheap pages for the image and control pages.
> - Remove the DBG() macros from the reloc code.
> 
> David
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel



_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-08 12:50   ` [PATCHv11 " David Vrabel
  2013-11-11 14:37     ` Don Slutz
@ 2013-11-15 14:35     ` Jan Beulich
  2013-11-15 18:31       ` David Vrabel
  1 sibling, 1 reply; 99+ messages in thread
From: Jan Beulich @ 2013-11-15 14:35 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, Daniel Kiper

>>> On 08.11.13 at 13:50, David Vrabel <david.vrabel@citrix.com> wrote:
> Add the code needed to handle and load kexec images into Xen memory or
> into the crash region.  This is needed for the new KEXEC_CMD_load and
> KEXEC_CMD_unload hypercall sub-ops.

I know it's late in the game, but just now I started getting the
impression that this introduced a new limitation that needs to
be taken into consideration elsewhere: With the old
implementation it was the kernel's responsibility to write to
the reserved space or, where Xen needed to touch the space,
it did so via fixmap entries. Hence there was no need for the
area to have corresponding struct page_info.

The new code, however, appears to make assumptions that
the memory used here is part of the range covered by the
frame table, and hence setup.c's determination of the base
address would need to be adjusted accordingly. (I realize
that this only matters on systems having more RAM than the
hypervisor can make use of.)

Jan

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-15 14:35     ` Jan Beulich
@ 2013-11-15 18:31       ` David Vrabel
  2013-11-18  8:07         ` Jan Beulich
  0 siblings, 1 reply; 99+ messages in thread
From: David Vrabel @ 2013-11-15 18:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Daniel Kiper, David Vrabel

On 15/11/13 14:35, Jan Beulich wrote:
>>>> On 08.11.13 at 13:50, David Vrabel <david.vrabel@citrix.com> wrote:
>> Add the code needed to handle and load kexec images into Xen memory or
>> into the crash region.  This is needed for the new KEXEC_CMD_load and
>> KEXEC_CMD_unload hypercall sub-ops.
> 
> I know it's late in the game, but just now I started getting the
> impression that this introduced a new limitation that needs to
> be taken into consideration elsewhere: With the old
> implementation it was the kernel's responsibility to write to
> the reserved space or, where Xen needed to touch the space,
> it did so via fixmap entries. Hence there was no need for the
> area to have corresponding struct page_info.
> 
> The new code, however, appears to make assumptions that
> the memory used here is part of the range covered by the
> frame table, and hence setup.c's determination of the base
> address would need to be adjusted accordingly. (I realize
> that this only matters on systems having more RAM than the
> hypervisor can make use of.)

The relocation code wrote the image into the crash region, not the
kernel, but I take your point.

Is this a real problem or just a theoretical one for now? I don't think
it's unreasonable to require the crash region to be within the frame table.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-15 18:31       ` David Vrabel
@ 2013-11-18  8:07         ` Jan Beulich
  2013-11-18 11:04           ` David Vrabel
  0 siblings, 1 reply; 99+ messages in thread
From: Jan Beulich @ 2013-11-18  8:07 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, Daniel Kiper

>>> On 15.11.13 at 19:31, David Vrabel <david.vrabel@citrix.com> wrote:
> On 15/11/13 14:35, Jan Beulich wrote:
>>>>> On 08.11.13 at 13:50, David Vrabel <david.vrabel@citrix.com> wrote:
>>> Add the code needed to handle and load kexec images into Xen memory or
>>> into the crash region.  This is needed for the new KEXEC_CMD_load and
>>> KEXEC_CMD_unload hypercall sub-ops.
>> 
>> I know it's late in the game, but just now I started getting the
>> impression that this introduced a new limitation that needs to
>> be taken into consideration elsewhere: With the old
>> implementation it was the kernel's responsibility to write to
>> the reserved space or, where Xen needed to touch the space,
>> it did so via fixmap entries. Hence there was no need for the
>> area to have corresponding struct page_info.
>> 
>> The new code, however, appears to make assumptions that
>> the memory used here is part of the range covered by the
>> frame table, and hence setup.c's determination of the base
>> address would need to be adjusted accordingly. (I realize
>> that this only matters on systems having more RAM than the
>> hypervisor can make use of.)
> 
> The relocation code wrote the image into the crash region, not the
> kernel, but I take your point.
> 
> Is this a real problem or just a theoretical one for now?

Not sure what "theoretical" here means - I know of actual systems
(even if perhaps not commercially available yet) that would be
affected by this.

> I don't think
> it's unreasonable to require the crash region to be within the frame table.

Right - as I assume you don't want to change all of your mapping
code, the only alternative is for the restriction to be enforced when
allocating the memory block.

Jan

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-18  8:07         ` Jan Beulich
@ 2013-11-18 11:04           ` David Vrabel
  2013-11-18 11:34             ` Jan Beulich
  2013-11-18 11:43             ` Daniel Kiper
  0 siblings, 2 replies; 99+ messages in thread
From: David Vrabel @ 2013-11-18 11:04 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Daniel Kiper

On 18/11/13 08:07, Jan Beulich wrote:
>>>> On 15.11.13 at 19:31, David Vrabel <david.vrabel@citrix.com> wrote:
>> On 15/11/13 14:35, Jan Beulich wrote:
>>>>>> On 08.11.13 at 13:50, David Vrabel <david.vrabel@citrix.com> wrote:
>>>> Add the code needed to handle and load kexec images into Xen memory or
>>>> into the crash region.  This is needed for the new KEXEC_CMD_load and
>>>> KEXEC_CMD_unload hypercall sub-ops.
>>>
>>> I know it's late in the game, but just now I started getting the
>>> impression that this introduced a new limitation that needs to
>>> be taken into consideration elsewhere: With the old
>>> implementation it was the kernel's responsibility to write to
>>> the reserved space or, where Xen needed to touch the space,
>>> it did so via fixmap entries. Hence there was no need for the
>>> area to have corresponding struct page_info.
>>>
>>> The new code, however, appears to make assumptions that
>>> the memory used here is part of the range covered by the
>>> frame table, and hence setup.c's determination of the base
>>> address would need to be adjusted accordingly. (I realize
>>> that this only matters on systems having more RAM than the
>>> hypervisor can make use of.)
>>
>> The relocation code wrote the image into the crash region, not the
>> kernel, but I take your point.
>>
>> Is this a real problem or just a theoretical one for now?
> 
> Not sure what "theoretical" here means - I know of actual systems
> (even if perhaps not commercially available yet) that would be
> affected by this.

The administrator has to configure the location of the crash region.  I
was asking if there are systems that configure the crash region such
that it would would end in the wrong place.

It does appear that the simplest crashkernel configuration would get it
wrong.  e.g., crashkernel=0-:64M

>> I don't think
>> it's unreasonable to require the crash region to be within the frame table.
> 
> Right - as I assume you don't want to change all of your mapping
> code, the only alternative is for the restriction to be enforced when
> allocating the memory block.

The

   map_pages_to_xen((unsigned long)__va(kexec_crash_area.start),
                     kexec_crash_area.start >> PAGE_SHIFT,
                     PFN_UP(kexec_crash_area.size), PAGE_HYPERVISOR);

call in __start_xen() suggests that this isn't a new problem.

This seems like a minor issue and if no one finds the time to fix it, I
think simply adding a release note would do.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-18 11:04           ` David Vrabel
@ 2013-11-18 11:34             ` Jan Beulich
  2013-11-18 12:25               ` Daniel Kiper
  2013-11-18 11:43             ` Daniel Kiper
  1 sibling, 1 reply; 99+ messages in thread
From: Jan Beulich @ 2013-11-18 11:34 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, Daniel Kiper

>>> On 18.11.13 at 12:04, David Vrabel <david.vrabel@citrix.com> wrote:
> On 18/11/13 08:07, Jan Beulich wrote:
>>>>> On 15.11.13 at 19:31, David Vrabel <david.vrabel@citrix.com> wrote:
>>> On 15/11/13 14:35, Jan Beulich wrote:
>>>>>>> On 08.11.13 at 13:50, David Vrabel <david.vrabel@citrix.com> wrote:
>>>>> Add the code needed to handle and load kexec images into Xen memory or
>>>>> into the crash region.  This is needed for the new KEXEC_CMD_load and
>>>>> KEXEC_CMD_unload hypercall sub-ops.
>>>>
>>>> I know it's late in the game, but just now I started getting the
>>>> impression that this introduced a new limitation that needs to
>>>> be taken into consideration elsewhere: With the old
>>>> implementation it was the kernel's responsibility to write to
>>>> the reserved space or, where Xen needed to touch the space,
>>>> it did so via fixmap entries. Hence there was no need for the
>>>> area to have corresponding struct page_info.
>>>>
>>>> The new code, however, appears to make assumptions that
>>>> the memory used here is part of the range covered by the
>>>> frame table, and hence setup.c's determination of the base
>>>> address would need to be adjusted accordingly. (I realize
>>>> that this only matters on systems having more RAM than the
>>>> hypervisor can make use of.)
>>>
>>> The relocation code wrote the image into the crash region, not the
>>> kernel, but I take your point.
>>>
>>> Is this a real problem or just a theoretical one for now?
>> 
>> Not sure what "theoretical" here means - I know of actual systems
>> (even if perhaps not commercially available yet) that would be
>> affected by this.
> 
> The administrator has to configure the location of the crash region.

All he needs to specify is the size; specifying the location is optional.

>  I
> was asking if there are systems that configure the crash region such
> that it would would end in the wrong place.
> 
> It does appear that the simplest crashkernel configuration would get it
> wrong.  e.g., crashkernel=0-:64M

Which you seem to confirm here.

>>> I don't think
>>> it's unreasonable to require the crash region to be within the frame table.
>> 
>> Right - as I assume you don't want to change all of your mapping
>> code, the only alternative is for the restriction to be enforced when
>> allocating the memory block.
> 
> The
> 
>    map_pages_to_xen((unsigned long)__va(kexec_crash_area.start),
>                      kexec_crash_area.start >> PAGE_SHIFT,
>                      PFN_UP(kexec_crash_area.size), PAGE_HYPERVISOR);
> 
> call in __start_xen() suggests that this isn't a new problem.

Oh, indeed. So I looked at all the (old) kexec code, not finding any
such implication, and completely overlooked that boot time thing
(which appears to be superfluous with both the old _and_ new
implementations).

Jan

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-18 11:04           ` David Vrabel
  2013-11-18 11:34             ` Jan Beulich
@ 2013-11-18 11:43             ` Daniel Kiper
  1 sibling, 0 replies; 99+ messages in thread
From: Daniel Kiper @ 2013-11-18 11:43 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, Jan Beulich

On Mon, Nov 18, 2013 at 11:04:00AM +0000, David Vrabel wrote:
> On 18/11/13 08:07, Jan Beulich wrote:
> >>>> On 15.11.13 at 19:31, David Vrabel <david.vrabel@citrix.com> wrote:
> >> On 15/11/13 14:35, Jan Beulich wrote:
> >>>>>> On 08.11.13 at 13:50, David Vrabel <david.vrabel@citrix.com> wrote:
> >>>> Add the code needed to handle and load kexec images into Xen memory or
> >>>> into the crash region.  This is needed for the new KEXEC_CMD_load and
> >>>> KEXEC_CMD_unload hypercall sub-ops.
> >>>
> >>> I know it's late in the game, but just now I started getting the
> >>> impression that this introduced a new limitation that needs to
> >>> be taken into consideration elsewhere: With the old
> >>> implementation it was the kernel's responsibility to write to
> >>> the reserved space or, where Xen needed to touch the space,
> >>> it did so via fixmap entries. Hence there was no need for the
> >>> area to have corresponding struct page_info.
> >>>
> >>> The new code, however, appears to make assumptions that
> >>> the memory used here is part of the range covered by the
> >>> frame table, and hence setup.c's determination of the base
> >>> address would need to be adjusted accordingly. (I realize
> >>> that this only matters on systems having more RAM than the
> >>> hypervisor can make use of.)
> >>
> >> The relocation code wrote the image into the crash region, not the
> >> kernel, but I take your point.
> >>
> >> Is this a real problem or just a theoretical one for now?
> >
> > Not sure what "theoretical" here means - I know of actual systems
> > (even if perhaps not commercially available yet) that would be
> > affected by this.
>
> The administrator has to configure the location of the crash region.  I
> was asking if there are systems that configure the crash region such
> that it would would end in the wrong place.
>
> It does appear that the simplest crashkernel configuration would get it
> wrong.  e.g., crashkernel=0-:64M
>
> >> I don't think
> >> it's unreasonable to require the crash region to be within the frame table.
> >
> > Right - as I assume you don't want to change all of your mapping
> > code, the only alternative is for the restriction to be enforced when
> > allocating the memory block.
>
> The
>
>    map_pages_to_xen((unsigned long)__va(kexec_crash_area.start),
>                      kexec_crash_area.start >> PAGE_SHIFT,
>                      PFN_UP(kexec_crash_area.size), PAGE_HYPERVISOR);
>
> call in __start_xen() suggests that this isn't a new problem.
>
> This seems like a minor issue and if no one finds the time to fix it, I
> think simply adding a release note would do.

I think that at this stage we could require that crashkernel region should
live below 5 TiB and do not overlap with Xen code and/or structures. This
way user will know that he/she chosen bad values. Later we could think
about better solution.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-18 11:34             ` Jan Beulich
@ 2013-11-18 12:25               ` Daniel Kiper
  2013-11-18 12:53                 ` Jan Beulich
  0 siblings, 1 reply; 99+ messages in thread
From: Daniel Kiper @ 2013-11-18 12:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, David Vrabel

On Mon, Nov 18, 2013 at 11:34:56AM +0000, Jan Beulich wrote:
> >>> On 18.11.13 at 12:04, David Vrabel <david.vrabel@citrix.com> wrote:
> > On 18/11/13 08:07, Jan Beulich wrote:
> >>>>> On 15.11.13 at 19:31, David Vrabel <david.vrabel@citrix.com> wrote:
> >>> On 15/11/13 14:35, Jan Beulich wrote:
> >>>>>>> On 08.11.13 at 13:50, David Vrabel <david.vrabel@citrix.com> wrote:
> >>>>> Add the code needed to handle and load kexec images into Xen memory or
> >>>>> into the crash region.  This is needed for the new KEXEC_CMD_load and
> >>>>> KEXEC_CMD_unload hypercall sub-ops.
> >>>>
> >>>> I know it's late in the game, but just now I started getting the
> >>>> impression that this introduced a new limitation that needs to
> >>>> be taken into consideration elsewhere: With the old
> >>>> implementation it was the kernel's responsibility to write to
> >>>> the reserved space or, where Xen needed to touch the space,
> >>>> it did so via fixmap entries. Hence there was no need for the
> >>>> area to have corresponding struct page_info.
> >>>>
> >>>> The new code, however, appears to make assumptions that
> >>>> the memory used here is part of the range covered by the
> >>>> frame table, and hence setup.c's determination of the base
> >>>> address would need to be adjusted accordingly. (I realize
> >>>> that this only matters on systems having more RAM than the
> >>>> hypervisor can make use of.)
> >>>
> >>> The relocation code wrote the image into the crash region, not the
> >>> kernel, but I take your point.
> >>>
> >>> Is this a real problem or just a theoretical one for now?
> >>
> >> Not sure what "theoretical" here means - I know of actual systems
> >> (even if perhaps not commercially available yet) that would be
> >> affected by this.
> >
> > The administrator has to configure the location of the crash region.
>
> All he needs to specify is the size; specifying the location is optional.
>
> >  I
> > was asking if there are systems that configure the crash region such
> > that it would would end in the wrong place.
> >
> > It does appear that the simplest crashkernel configuration would get it
> > wrong.  e.g., crashkernel=0-:64M
>
> Which you seem to confirm here.

Even if that this does not make sens mapping should work without any issue.
We are mapping only one page at a time. So what is the limit in that case?

> >>> I don't think
> >>> it's unreasonable to require the crash region to be within the frame table.
> >>
> >> Right - as I assume you don't want to change all of your mapping
> >> code, the only alternative is for the restriction to be enforced when
> >> allocating the memory block.
> >
> > The
> >
> >    map_pages_to_xen((unsigned long)__va(kexec_crash_area.start),
> >                      kexec_crash_area.start >> PAGE_SHIFT,
> >                      PFN_UP(kexec_crash_area.size), PAGE_HYPERVISOR);
> >
> > call in __start_xen() suggests that this isn't a new problem.
>
> Oh, indeed. So I looked at all the (old) kexec code, not finding any
> such implication, and completely overlooked that boot time thing
> (which appears to be superfluous with both the old _and_ new
> implementations).

Ugh... I forgot that we are mapping/unmapping page by page in kexec.
Hence, this could be simply removed.

Daniel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-18 12:25               ` Daniel Kiper
@ 2013-11-18 12:53                 ` Jan Beulich
  2013-11-18 13:24                   ` Daniel Kiper
  0 siblings, 1 reply; 99+ messages in thread
From: Jan Beulich @ 2013-11-18 12:53 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: xen-devel, David Vrabel

>>> On 18.11.13 at 13:25, Daniel Kiper <daniel.kiper@oracle.com> wrote:
> On Mon, Nov 18, 2013 at 11:34:56AM +0000, Jan Beulich wrote:
>> >>> On 18.11.13 at 12:04, David Vrabel <david.vrabel@citrix.com> wrote:
>> > On 18/11/13 08:07, Jan Beulich wrote:
>> >>>>> On 15.11.13 at 19:31, David Vrabel <david.vrabel@citrix.com> wrote:
>> >>> On 15/11/13 14:35, Jan Beulich wrote:
>> >>>> The new code, however, appears to make assumptions that
>> >>>> the memory used here is part of the range covered by the
>> >>>> frame table, and hence setup.c's determination of the base
>> >>>> address would need to be adjusted accordingly. (I realize
>> >>>> that this only matters on systems having more RAM than the
>> >>>> hypervisor can make use of.)
>> >>>
>> >>> The relocation code wrote the image into the crash region, not the
>> >>> kernel, but I take your point.
>> >>>
>> >>> Is this a real problem or just a theoretical one for now?
>> >>
>> >> Not sure what "theoretical" here means - I know of actual systems
>> >> (even if perhaps not commercially available yet) that would be
>> >> affected by this.
>> >
>> > The administrator has to configure the location of the crash region.
>>
>> All he needs to specify is the size; specifying the location is optional.
>>
>> >  I
>> > was asking if there are systems that configure the crash region such
>> > that it would would end in the wrong place.
>> >
>> > It does appear that the simplest crashkernel configuration would get it
>> > wrong.  e.g., crashkernel=0-:64M
>>
>> Which you seem to confirm here.
> 
> Even if that this does not make sens mapping should work without any issue.
> We are mapping only one page at a time. So what is the limit in that case?

The issue is not a limit on mappable pages, but the fact that there's
potentially no struct page_info for some or all of the crash area.

Jan

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-18 12:53                 ` Jan Beulich
@ 2013-11-18 13:24                   ` Daniel Kiper
  2013-11-18 13:43                     ` Jan Beulich
  0 siblings, 1 reply; 99+ messages in thread
From: Daniel Kiper @ 2013-11-18 13:24 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, David Vrabel

On Mon, Nov 18, 2013 at 12:53:39PM +0000, Jan Beulich wrote:

[...]

> The issue is not a limit on mappable pages, but the fact that there's
> potentially no struct page_info for some or all of the crash area.

OK, are they allocated at boot time for whole system memory or just
for pages owned by Xen hypervsior? What about pages owned by domains?
As I can see we access crash region as it was owned by domain. AIUI,
it was done in that way because we wanted to be in line with normal
kexec case. However, I am not sure right know that this is good idea.
Maybe we should do something else in crash dump case?

We could establish an limit (4 GiB?) for crash region as a workaround.
Additionally, it should not overlap with Xen code and/or structures.

Daniel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-18 13:24                   ` Daniel Kiper
@ 2013-11-18 13:43                     ` Jan Beulich
  2013-11-18 14:23                       ` Daniel Kiper
  0 siblings, 1 reply; 99+ messages in thread
From: Jan Beulich @ 2013-11-18 13:43 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: xen-devel, David Vrabel

>>> On 18.11.13 at 14:24, Daniel Kiper <daniel.kiper@oracle.com> wrote:
> On Mon, Nov 18, 2013 at 12:53:39PM +0000, Jan Beulich wrote:
> 
> [...]
> 
>> The issue is not a limit on mappable pages, but the fact that there's
>> potentially no struct page_info for some or all of the crash area.
> 
> OK, are they allocated at boot time for whole system memory or just
> for pages owned by Xen hypervsior? What about pages owned by domains?

Sorry, I don't understand the question.

> As I can see we access crash region as it was owned by domain. AIUI,
> it was done in that way because we wanted to be in line with normal
> kexec case. However, I am not sure right know that this is good idea.
> Maybe we should do something else in crash dump case?
> 
> We could establish an limit (4 GiB?) for crash region as a workaround.

Are you taking of a size limit or an address one?

> Additionally, it should not overlap with Xen code and/or structures.

That's being guaranteed already afaict.

Jan

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-18 13:43                     ` Jan Beulich
@ 2013-11-18 14:23                       ` Daniel Kiper
  2013-11-18 15:24                         ` Jan Beulich
  0 siblings, 1 reply; 99+ messages in thread
From: Daniel Kiper @ 2013-11-18 14:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, David Vrabel

On Mon, Nov 18, 2013 at 01:43:35PM +0000, Jan Beulich wrote:
> >>> On 18.11.13 at 14:24, Daniel Kiper <daniel.kiper@oracle.com> wrote:
> > On Mon, Nov 18, 2013 at 12:53:39PM +0000, Jan Beulich wrote:
> >
> > [...]
> >
> >> The issue is not a limit on mappable pages, but the fact that there's
> >> potentially no struct page_info for some or all of the crash area.
> >
> > OK, are they allocated at boot time for whole system memory or just
> > for pages owned by Xen hypervsior? What about pages owned by domains?
>
> Sorry, I don't understand the question.

Is struct page_info created for every page of system/machine memory?
Are they created at Xen boot time? Is it possible to create them later
when Xen is runnig?

> > As I can see we access crash region as it was owned by domain. AIUI,
> > it was done in that way because we wanted to be in line with normal
> > kexec case. However, I am not sure right know that this is good idea.
> > Maybe we should do something else in crash dump case?
> >
> > We could establish an limit (4 GiB?) for crash region as a workaround.
>
> Are you taking of a size limit or an address one?

Size.

> > Additionally, it should not overlap with Xen code and/or structures.
>
> That's being guaranteed already afaict.

Yep.

Daniel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-18 14:23                       ` Daniel Kiper
@ 2013-11-18 15:24                         ` Jan Beulich
  2013-11-18 21:50                           ` Daniel Kiper
  0 siblings, 1 reply; 99+ messages in thread
From: Jan Beulich @ 2013-11-18 15:24 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: xen-devel, David Vrabel

>>> On 18.11.13 at 15:23, Daniel Kiper <daniel.kiper@oracle.com> wrote:
> On Mon, Nov 18, 2013 at 01:43:35PM +0000, Jan Beulich wrote:
>> >>> On 18.11.13 at 14:24, Daniel Kiper <daniel.kiper@oracle.com> wrote:
>> > On Mon, Nov 18, 2013 at 12:53:39PM +0000, Jan Beulich wrote:
>> >
>> > [...]
>> >
>> >> The issue is not a limit on mappable pages, but the fact that there's
>> >> potentially no struct page_info for some or all of the crash area.
>> >
>> > OK, are they allocated at boot time for whole system memory or just
>> > for pages owned by Xen hypervsior? What about pages owned by domains?
>>
>> Sorry, I don't understand the question.
> 
> Is struct page_info created for every page of system/machine memory?
> Are they created at Xen boot time? Is it possible to create them later
> when Xen is runnig?

No - if they aren't created, then generally because there's no
virtual address space to cover struct page_info itself or the 1:1
mapping that would also be needed for a "normal" page.

>> > As I can see we access crash region as it was owned by domain. AIUI,
>> > it was done in that way because we wanted to be in line with normal
>> > kexec case. However, I am not sure right know that this is good idea.
>> > Maybe we should do something else in crash dump case?
>> >
>> > We could establish an limit (4 GiB?) for crash region as a workaround.
>>
>> Are you taking of a size limit or an address one?
> 
> Size.

So how would limiting the size help?

Jan

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-18 15:24                         ` Jan Beulich
@ 2013-11-18 21:50                           ` Daniel Kiper
  2013-11-19 12:40                             ` Jan Beulich
  0 siblings, 1 reply; 99+ messages in thread
From: Daniel Kiper @ 2013-11-18 21:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, David Vrabel

On Mon, Nov 18, 2013 at 03:24:03PM +0000, Jan Beulich wrote:
> >>> On 18.11.13 at 15:23, Daniel Kiper <daniel.kiper@oracle.com> wrote:
> > On Mon, Nov 18, 2013 at 01:43:35PM +0000, Jan Beulich wrote:
> >> >>> On 18.11.13 at 14:24, Daniel Kiper <daniel.kiper@oracle.com> wrote:
> >> > On Mon, Nov 18, 2013 at 12:53:39PM +0000, Jan Beulich wrote:
> >> >
> >> > [...]
> >> >
> >> >> The issue is not a limit on mappable pages, but the fact that there's
> >> >> potentially no struct page_info for some or all of the crash area.
> >> >
> >> > OK, are they allocated at boot time for whole system memory or just
> >> > for pages owned by Xen hypervsior? What about pages owned by domains?
> >>
> >> Sorry, I don't understand the question.
> >
> > Is struct page_info created for every page of system/machine memory?
> > Are they created at Xen boot time? Is it possible to create them later
> > when Xen is runnig?
>
> No - if they aren't created, then generally because there's no
> virtual address space to cover struct page_info itself or the 1:1
> mapping that would also be needed for a "normal" page.

AIUI, frame_table is used to store struct page_info. It starts at 0xffff82e000000000
on x86_64 and its size is 128 GiB. sizeof(struct page_info) == 32 on x86_64.
Hence, frame_table can store struct page_info for 16 TiB of RAM. However,
on the other hand 1:1 mapping has 5 TiB size + continuation of 1:1 mapping
119.5 TiB == 124.5 TiB. So main ceiling here is frame_table. Could we
increase its size? Do you have machine with more than 16 TiB of RAM?
Probably yes.

I think this issue is not kexec specific. Probably it hurts Xen in general because,
AIUI, pages are not accessible if they do not have relevant struct page_info
in frame_table (or 1:1 mapping in page table).

Hmmm... For what 1:1 mapping is used if same page could be mapped by map_domain_page()?

> >> > As I can see we access crash region as it was owned by domain. AIUI,
> >> > it was done in that way because we wanted to be in line with normal
> >> > kexec case. However, I am not sure right know that this is good idea.
> >> > Maybe we should do something else in crash dump case?
> >> >
> >> > We could establish an limit (4 GiB?) for crash region as a workaround.
> >>
> >> Are you taking of a size limit or an address one?
> >
> > Size.
>
> So how would limiting the size help?

I thought that struct page_info for every page is created in another way.
Currently we could require that crash dump region should end below or
at 16 TiB or increase frame_table size if it is possible.

Daniel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-18 21:50                           ` Daniel Kiper
@ 2013-11-19 12:40                             ` Jan Beulich
  2013-11-20 19:59                               ` Daniel Kiper
  0 siblings, 1 reply; 99+ messages in thread
From: Jan Beulich @ 2013-11-19 12:40 UTC (permalink / raw)
  To: Daniel Kiper; +Cc: xen-devel, David Vrabel

>>> On 18.11.13 at 22:50, Daniel Kiper <daniel.kiper@oracle.com> wrote:
> On Mon, Nov 18, 2013 at 03:24:03PM +0000, Jan Beulich wrote:
>> >>> On 18.11.13 at 15:23, Daniel Kiper <daniel.kiper@oracle.com> wrote:
>> > On Mon, Nov 18, 2013 at 01:43:35PM +0000, Jan Beulich wrote:
>> >> >>> On 18.11.13 at 14:24, Daniel Kiper <daniel.kiper@oracle.com> wrote:
>> >> > On Mon, Nov 18, 2013 at 12:53:39PM +0000, Jan Beulich wrote:
>> >> >
>> >> > [...]
>> >> >
>> >> >> The issue is not a limit on mappable pages, but the fact that there's
>> >> >> potentially no struct page_info for some or all of the crash area.
>> >> >
>> >> > OK, are they allocated at boot time for whole system memory or just
>> >> > for pages owned by Xen hypervsior? What about pages owned by domains?
>> >>
>> >> Sorry, I don't understand the question.
>> >
>> > Is struct page_info created for every page of system/machine memory?
>> > Are they created at Xen boot time? Is it possible to create them later
>> > when Xen is runnig?
>>
>> No - if they aren't created, then generally because there's no
>> virtual address space to cover struct page_info itself or the 1:1
>> mapping that would also be needed for a "normal" page.
> 
> AIUI, frame_table is used to store struct page_info. It starts at 
> 0xffff82e000000000
> on x86_64 and its size is 128 GiB. sizeof(struct page_info) == 32 on x86_64.
> Hence, frame_table can store struct page_info for 16 TiB of RAM. However,
> on the other hand 1:1 mapping has 5 TiB size + continuation of 1:1 mapping
> 119.5 TiB == 124.5 TiB. So main ceiling here is frame_table. Could we
> increase its size? Do you have machine with more than 16 TiB of RAM?
> Probably yes.
> 
> I think this issue is not kexec specific. Probably it hurts Xen in general 
> because,
> AIUI, pages are not accessible if they do not have relevant struct page_info
> in frame_table (or 1:1 mapping in page table).

It would certainly have helped if you looked at the relevant
changes to that code. We can't simply go beyond 16Tb, as that
means crossing the 44-bit boundary (turning into the 32-bit
boundary for MFNs).

And yes, the problem _is_ kexec specific - memory not usable
for "normal" purposes gets ignored.

> Hmmm... For what 1:1 mapping is used if same page could be mapped by 
> map_domain_page()?

This is purely simplification and a performance optimization:
Obviously you don't want e.g. each caller of xmalloc() to a
map/unmap operation.

>> >> > As I can see we access crash region as it was owned by domain. AIUI,
>> >> > it was done in that way because we wanted to be in line with normal
>> >> > kexec case. However, I am not sure right know that this is good idea.
>> >> > Maybe we should do something else in crash dump case?
>> >> >
>> >> > We could establish an limit (4 GiB?) for crash region as a workaround.
>> >>
>> >> Are you taking of a size limit or an address one?
>> >
>> > Size.
>>
>> So how would limiting the size help?
> 
> I thought that struct page_info for every page is created in another way.
> Currently we could require that crash dump region should end below or
> at 16 TiB or increase frame_table size if it is possible.

The former is exactly what I was asking to be done.

Jan

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-19 12:40                             ` Jan Beulich
@ 2013-11-20 19:59                               ` Daniel Kiper
  2013-11-21 16:19                                 ` Jan Beulich
  0 siblings, 1 reply; 99+ messages in thread
From: Daniel Kiper @ 2013-11-20 19:59 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, David Vrabel

On Tue, Nov 19, 2013 at 12:40:08PM +0000, Jan Beulich wrote:
> >>> On 18.11.13 at 22:50, Daniel Kiper <daniel.kiper@oracle.com> wrote:
> > On Mon, Nov 18, 2013 at 03:24:03PM +0000, Jan Beulich wrote:
> >> >>> On 18.11.13 at 15:23, Daniel Kiper <daniel.kiper@oracle.com> wrote:
> >> > On Mon, Nov 18, 2013 at 01:43:35PM +0000, Jan Beulich wrote:
> >> >> >>> On 18.11.13 at 14:24, Daniel Kiper <daniel.kiper@oracle.com> wrote:
> >> >> > On Mon, Nov 18, 2013 at 12:53:39PM +0000, Jan Beulich wrote:
> >> >> >
> >> >> > [...]
> >> >> >
> >> >> >> The issue is not a limit on mappable pages, but the fact that there's
> >> >> >> potentially no struct page_info for some or all of the crash area.
> >> >> >
> >> >> > OK, are they allocated at boot time for whole system memory or just
> >> >> > for pages owned by Xen hypervsior? What about pages owned by domains?
> >> >>
> >> >> Sorry, I don't understand the question.
> >> >
> >> > Is struct page_info created for every page of system/machine memory?
> >> > Are they created at Xen boot time? Is it possible to create them later
> >> > when Xen is runnig?
> >>
> >> No - if they aren't created, then generally because there's no
> >> virtual address space to cover struct page_info itself or the 1:1
> >> mapping that would also be needed for a "normal" page.
> >
> > AIUI, frame_table is used to store struct page_info. It starts at
> > 0xffff82e000000000
> > on x86_64 and its size is 128 GiB. sizeof(struct page_info) == 32 on x86_64.
> > Hence, frame_table can store struct page_info for 16 TiB of RAM. However,
> > on the other hand 1:1 mapping has 5 TiB size + continuation of 1:1 mapping
> > 119.5 TiB == 124.5 TiB. So main ceiling here is frame_table. Could we
> > increase its size? Do you have machine with more than 16 TiB of RAM?
> > Probably yes.
> >
> > I think this issue is not kexec specific. Probably it hurts Xen in general
> > because,
> > AIUI, pages are not accessible if they do not have relevant struct page_info
> > in frame_table (or 1:1 mapping in page table).
>
> It would certainly have helped if you looked at the relevant
> changes to that code. We can't simply go beyond 16Tb, as that
> means crossing the 44-bit boundary (turning into the 32-bit
> boundary for MFNs).

I could not find any real explanation/comment/doc why 44-bit boundary.
Could you give me a hint? I have a feeling that MFN bits 32-39 were used
as flags somewhere? However, I could not find relevant code right now.

> And yes, the problem _is_ kexec specific - memory not usable
> for "normal" purposes gets ignored.

What do you mean by "normal" purposes? Xen heap, domain heap, etc.
If yes then, AIUI, it means that in general Xen will work on machines
with so huge amount of RAM but all memory above 16 TiB will be not
available. I suppose that sooner or later we would like to make Xen
working with whole memory on such machines. So are there any plans
to fix this issue? Just curious...

> > Hmmm... For what 1:1 mapping is used if same page could be mapped by
> > map_domain_page()?
>
> This is purely simplification and a performance optimization:
> Obviously you don't want e.g. each caller of xmalloc() to a
> map/unmap operation.

Right. Does it mean that whole system memory has 1:1 mapping?
Or are there some regions deliberately omitted?

> >> >> > As I can see we access crash region as it was owned by domain. AIUI,
> >> >> > it was done in that way because we wanted to be in line with normal
> >> >> > kexec case. However, I am not sure right know that this is good idea.
> >> >> > Maybe we should do something else in crash dump case?
> >> >> >
> >> >> > We could establish an limit (4 GiB?) for crash region as a workaround.
> >> >>
> >> >> Are you taking of a size limit or an address one?
> >> >
> >> > Size.
> >>
> >> So how would limiting the size help?
> >
> > I thought that struct page_info for every page is created in another way.
> > Currently we could require that crash dump region should end below or
> > at 16 TiB or increase frame_table size if it is possible.
>
> The former is exactly what I was asking to be done.

Probably I will prepare relevant patches next week.

Do you have machine with so huge amount of RAM to do some tests?

Daniel

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCHv11 3/9] kexec: add infrastructure for handling kexec images
  2013-11-20 19:59                               ` Daniel Kiper
@ 2013-11-21 16:19                                 ` Jan Beulich
  0 siblings, 0 replies; 99+ messages in thread
From: Jan Beulich @ 2013-11-21 16:19 UTC (permalink / raw)
  To: daniel.kiper; +Cc: xen-devel, david.vrabel

>>> Daniel Kiper <daniel.kiper@oracle.com> 11/20/13 8:59 PM >>>
>I could not find any real explanation/comment/doc why 44-bit boundary.
>Could you give me a hint? I have a feeling that MFN bits 32-39 were used
>as flags somewhere? However, I could not find relevant code right now.

asm-x86/mm.h has

#define __pdx_t unsigned int

struct page_list_entry
{
    __pdx_t next, prev;
};

>> And yes, the problem _is_ kexec specific - memory not usable
>> for "normal" purposes gets ignored.
>
>What do you mean by "normal" purposes? Xen heap, domain heap, etc.

Right.

>If yes then, AIUI, it means that in general Xen will work on machines
>with so huge amount of RAM but all memory above 16 TiB will be not
>available. I suppose that sooner or later we would like to make Xen
>working with whole memory on such machines. So are there any plans
>to fix this issue? Just curious...

Sure. But the brute force approach (using 64-bit next/prev pointers) would
have the downside of growing struct page_info from 32 to 40 bytes. Not only
does that mean higher memory overhead, it also means that calculating the
entry from an MFN (which we do a lot) can't be done by a simple shift anymore.

>> > Hmmm... For what 1:1 mapping is used if same page could be mapped by
>> > map_domain_page()?
>>
>> This is purely simplification and a performance optimization:
>> Obviously you don't want e.g. each caller of xmalloc() to a
>> map/unmap operation.
>
>Right. Does it mean that whole system memory has 1:1 mapping?
>Or are there some regions deliberately omitted?

All "normal" memory has 1:1 mapping, but as you recall not all of the 1:1
mapping is available at all times. Hence the need for map_domain_page()
for anything not coming from the Xen heap.

>Do you have machine with so huge amount of RAM to do some tests?

No, I don't. When fixing issues in this area, it's generally in response to
some partner having found it, and hence being able to test it for us.

Jan

^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2013-11-21 16:19 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-06 14:49 [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels David Vrabel
2013-11-06 14:49 ` [PATCH 1/9] x86: give FIX_EFI_MPF its own fixmap entry David Vrabel
2013-11-06 14:49 ` David Vrabel
2013-11-06 18:49   ` [Xen-devel] " Don Slutz
2013-11-06 18:49   ` Don Slutz
2013-11-06 14:49 ` [PATCH 2/9] kexec: add public interface for improved load/unload sub-ops David Vrabel
2013-11-06 14:49 ` David Vrabel
2013-11-07 20:38   ` Don Slutz
2013-11-07 20:38   ` Don Slutz
2013-11-06 14:49 ` [PATCH 3/9] kexec: add infrastructure for handling kexec images David Vrabel
2013-11-07 20:40   ` [Xen-devel] " Don Slutz
2013-11-07 23:51     ` Don Slutz
2013-11-07 23:51       ` [Xen-devel] " Don Slutz
2013-11-07 20:40   ` Don Slutz
2013-11-08 12:50   ` [PATCHv11 " David Vrabel
2013-11-11 14:37     ` Don Slutz
2013-11-15 14:35     ` Jan Beulich
2013-11-15 18:31       ` David Vrabel
2013-11-18  8:07         ` Jan Beulich
2013-11-18 11:04           ` David Vrabel
2013-11-18 11:34             ` Jan Beulich
2013-11-18 12:25               ` Daniel Kiper
2013-11-18 12:53                 ` Jan Beulich
2013-11-18 13:24                   ` Daniel Kiper
2013-11-18 13:43                     ` Jan Beulich
2013-11-18 14:23                       ` Daniel Kiper
2013-11-18 15:24                         ` Jan Beulich
2013-11-18 21:50                           ` Daniel Kiper
2013-11-19 12:40                             ` Jan Beulich
2013-11-20 19:59                               ` Daniel Kiper
2013-11-21 16:19                                 ` Jan Beulich
2013-11-18 11:43             ` Daniel Kiper
2013-11-06 14:49 ` [PATCH " David Vrabel
2013-11-06 14:49 ` [PATCH 4/9] kexec: extend hypercall with improved load/unload ops David Vrabel
2013-11-06 14:49   ` David Vrabel
2013-11-07 20:56   ` Don Slutz
2013-11-07 20:56     ` [Xen-devel] " Don Slutz
2013-11-06 14:49 ` [PATCH 5/9] xen: kexec crash image when dom0 crashes David Vrabel
2013-11-07 20:44   ` Don Slutz
2013-11-07 20:44   ` [Xen-devel] " Don Slutz
2013-11-06 14:49 ` David Vrabel
2013-11-06 14:49 ` [PATCH 6/9] libxc: add hypercall buffer arrays David Vrabel
2013-11-06 14:49 ` David Vrabel
2013-11-07 20:46   ` Don Slutz
2013-11-07 20:46   ` [Xen-devel] " Don Slutz
2013-11-06 14:49 ` [PATCH 7/9] libxc: add API for kexec hypercall David Vrabel
2013-11-07 20:48   ` Don Slutz
2013-11-07 20:48   ` Don Slutz
2013-11-06 14:49 ` David Vrabel
2013-11-06 14:49 ` [PATCH 8/9] x86: check kexec relocation code fits in a page David Vrabel
2013-11-06 14:49 ` David Vrabel
2013-11-06 18:51   ` [Xen-devel] " Don Slutz
2013-11-06 18:51   ` Don Slutz
2013-11-06 14:49 ` [PATCH 9/9] MAINTAINERS: Add KEXEC maintainer David Vrabel
2013-11-06 18:50   ` Don Slutz
2013-11-06 18:50   ` Don Slutz
2013-11-06 14:49 ` David Vrabel
2013-11-07 21:16 ` [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels Daniel Kiper
2013-11-07 21:16 ` Daniel Kiper
2013-11-07 21:25   ` Andrew Cooper
2013-11-07 21:25   ` [Xen-devel] " Andrew Cooper
2013-11-07 21:41     ` Daniel Kiper
2013-11-07 21:41     ` [Xen-devel] " Daniel Kiper
2013-11-07 21:57       ` Andrew Cooper
2013-11-07 21:57       ` [Xen-devel] " Andrew Cooper
2013-11-08 13:20       ` David Vrabel
2013-11-08 13:20       ` David Vrabel
2013-11-08 13:13   ` David Vrabel
2013-11-08 13:19     ` Jan Beulich
2013-11-08 14:01       ` Andrew Cooper
2013-11-08 14:01       ` [Xen-devel] " Andrew Cooper
2013-11-08 14:22         ` Don Slutz
2013-11-08 14:22         ` [Xen-devel] " Don Slutz
2013-11-08 14:36         ` Jan Beulich
2013-11-08 14:36         ` Jan Beulich
2013-11-08 15:15         ` Daniel Kiper
2013-11-08 15:15         ` [Xen-devel] " Daniel Kiper
2013-11-08 15:42           ` Konrad Rzeszutek Wilk
2013-11-08 16:28             ` Daniel Kiper
2013-11-08 16:28             ` [Xen-devel] " Daniel Kiper
2013-11-08 15:42           ` Konrad Rzeszutek Wilk
2013-11-08 15:48           ` [Xen-devel] " Andrew Cooper
2013-11-08 15:48           ` Andrew Cooper
2013-11-08 13:19     ` Jan Beulich
2013-11-08 13:48     ` Daniel Kiper
2013-11-08 13:48     ` Daniel Kiper
2013-11-08 14:01       ` [Xen-devel] " Andrew Cooper
2013-11-08 14:01       ` Andrew Cooper
2013-11-08 15:04     ` Daniel Kiper
2013-11-08 15:04     ` Daniel Kiper
2013-11-08 13:13   ` David Vrabel
2013-11-09 19:18   ` Daniel Kiper
2013-11-11 14:34     ` Don Slutz
2013-11-11 14:34     ` Don Slutz
2013-11-11 15:09     ` David Vrabel
2013-11-11 15:09     ` David Vrabel
2013-11-09 19:18   ` Daniel Kiper
2013-11-11 17:18 ` Keir Fraser
2013-11-11 17:18 ` [Xen-devel] " Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.