All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0 of 3] KEXEC fixes
@ 2011-12-22 17:36 Andrew Cooper
  2011-12-22 17:36 ` [PATCH 1 of 3] KEXEC: Add new hypercall to fix 64/32bit truncation errors. v2 Andrew Cooper
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Andrew Cooper @ 2011-12-22 17:36 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

This set of 3 patches contain fixes and functional improvements to the
kexec path in Xen.  The first two have already been posted to the list
individually but are presented here as a group.

Patch 1 contains an implementation of a new hypercall which will not
truncate 64bit pointers to 32bit when in compatability mode.

Patch 2 changes the per cpu crash notes to be allocated on boot.

Patch 3 implements logic to allow Xen to specifically allocate the
crash notes and console buffer in lower memory because if 64/32bit issues
on machines with large quantities of RAM.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1 of 3] KEXEC: Add new hypercall to fix 64/32bit truncation errors. v2
  2011-12-22 17:36 [PATCH 0 of 3] KEXEC fixes Andrew Cooper
@ 2011-12-22 17:36 ` Andrew Cooper
  2011-12-23  8:34   ` Jan Beulich
  2011-12-22 17:36 ` [PATCH 2 of 3] KEXEC: Allocate crash notes on boot v6 Andrew Cooper
  2011-12-22 17:36 ` [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory Andrew Cooper
  2 siblings, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2011-12-22 17:36 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

Currently, KEXEC_CMD_kexec_get_range in compat mode, or with 32bit
Xen, will truncate 64bit pointers to 32bits, leading to problems on
machines with more than 4GiB of RAM.  As 32bit dom0 kernels tend to be
PAE capable, this is especially wasteful, as most structures currently
limited to <4GiB could easily be <64GiB instead.

Therefore, introduce a new hypercall 'KEXEC_CMD_kexec64_get_range'
which passes similar information as KEXEC_CMD_kexec_get_range, but
which uses a structure with explicitly specified field widths, causing
it to be identical in the compat and regular case.  This new
structure, xen_kexec64_range_t, will be the same as xen_kexec_range_t
if Xen is compiled for x86_64, but will still use 64bit integers if
Xen is compiled for x86_32.

To fix 32bit Xen which uses 32bit integers for addresses and sizes,
change the internals to use xen_kexec64_range_t which will use 64bit
integers instead.  This also involves changing several casts to
explicitly use uint64_ts rather than unsigned longs.

In addition, the hypercall entry points need updating to be able to
cater for all possibilities.

|Xen/dom0 bits|          Bit width of addresses in structure for        |
+------+------+---------------------------+-----------------------------+
| Xen  | dom0 | KEXEC_CMD_kexec_get_range | KEXEC_CMD_kexec64_get_range |
+------+------+---------------------------+-----------------------------+
|  32  |  32  |             32            |              64             |
|  64  |  32  |             32            |              64             |
|  64  |  64  |             64            |              64             |
+------+------+---------------------------+-----------------------------+

This has been solved by splitting do_kexec_op_internal back into
do_kexec_op{,_compat} and changing kexec_get_range{,_compat} to
kexec_get_range{32,64} which now exist irrespective of CONFIG_COMPAT
or not.

The knock-on effect of removing do_kexec_op_internal means that
kexec_load_unload_compat is only ever called from inside an #ifdef
CONFIG_COMPAT codepath, which does not exist on Xen x86_32.
Therefore, move the #ifdef CONFIG_COMPAT guards to include the entire
function.

Finally, and a little unrelatedly, fix KEXEC_CMD_kexec_{load,unload}
to return -EBUSY instead of EOK if a kexec is in progress.

Changes since v1:
 *  Fix check for pointer truncation to work when Xen is compiled for
    32 bit mode as well.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

diff -r e56500f95b6a -r f571dc8e4368 xen/arch/ia64/xen/machine_kexec.c
--- a/xen/arch/ia64/xen/machine_kexec.c
+++ b/xen/arch/ia64/xen/machine_kexec.c
@@ -102,10 +102,10 @@ void machine_reboot_kexec(xen_kexec_imag
 	machine_kexec(image);
 }
 
-int machine_kexec_get_xen(xen_kexec_range_t *range)
+int machine_kexec_get_xen(xen_kexec64_range_t *range)
 {
 	range->start = ia64_tpa(_text);
-	range->size = (unsigned long)_end - (unsigned long)_text;
+	range->size = (uint64_t)_end - (uint64_t)_text;
 	return 0;
 }
 
@@ -113,31 +113,31 @@ int machine_kexec_get_xen(xen_kexec_rang
 #define ELF_PAGE_SIZE  (__IA64_UL_CONST(1) << ELF_PAGE_SHIFT)
 #define ELF_PAGE_MASK  (~(ELF_PAGE_SIZE - 1))
 
-static int machine_kexec_get_xenheap(xen_kexec_range_t *range)
+static int machine_kexec_get_xenheap(xen_kexec64_range_t *range)
 {
 	range->start = (ia64_tpa(_end) + (ELF_PAGE_SIZE - 1)) & ELF_PAGE_MASK;
 	range->size =
-		(((unsigned long)range->start + KERNEL_TR_PAGE_SIZE) &
+		(((uint64_t)range->start + KERNEL_TR_PAGE_SIZE) &
          ~(KERNEL_TR_PAGE_SIZE - 1))
-		- (unsigned long)range->start;
+		- (uint64_t)range->start;
 	return 0;
 }
 
-static int machine_kexec_get_boot_param(xen_kexec_range_t *range)
+static int machine_kexec_get_boot_param(xen_kexec64_range_t *range)
 {
 	range->start = __pa(ia64_boot_param);
 	range->size = sizeof(*ia64_boot_param);
 	return 0;
 }
 
-static int machine_kexec_get_efi_memmap(xen_kexec_range_t *range)
+static int machine_kexec_get_efi_memmap(xen_kexec64_range_t *range)
 {
 	range->start = ia64_boot_param->efi_memmap;
 	range->size = ia64_boot_param->efi_memmap_size;
 	return 0;
 }
 
-int machine_kexec_get(xen_kexec_range_t *range)
+int machine_kexec_get(xen_kexec64_range_t *range)
 {
 	switch (range->range) {
 	case KEXEC_RANGE_MA_XEN:
diff -r e56500f95b6a -r f571dc8e4368 xen/arch/x86/machine_kexec.c
--- a/xen/arch/x86/machine_kexec.c
+++ b/xen/arch/x86/machine_kexec.c
@@ -120,7 +120,7 @@ void machine_kexec(xen_kexec_image_t *im
     }
 }
 
-int machine_kexec_get(xen_kexec_range_t *range)
+int machine_kexec_get(xen_kexec64_range_t *range)
 {
 	if (range->range != KEXEC_RANGE_MA_XEN)
 		return -EINVAL;
diff -r e56500f95b6a -r f571dc8e4368 xen/arch/x86/x86_32/machine_kexec.c
--- a/xen/arch/x86/x86_32/machine_kexec.c
+++ b/xen/arch/x86/x86_32/machine_kexec.c
@@ -11,11 +11,11 @@
 #include <asm/page.h>
 #include <public/kexec.h>
 
-int machine_kexec_get_xen(xen_kexec_range_t *range)
+int machine_kexec_get_xen(xen_kexec64_range_t *range)
 {
         range->start = virt_to_maddr(_start);
-        range->size = (unsigned long)xenheap_phys_end -
-                      (unsigned long)range->start;
+        range->size = (uint64_t)xenheap_phys_end -
+                      (uint64_t)range->start;
         return 0;
 }
 
diff -r e56500f95b6a -r f571dc8e4368 xen/arch/x86/x86_64/machine_kexec.c
--- a/xen/arch/x86/x86_64/machine_kexec.c
+++ b/xen/arch/x86/x86_64/machine_kexec.c
@@ -11,10 +11,10 @@
 #include <asm/page.h>
 #include <public/kexec.h>
 
-int machine_kexec_get_xen(xen_kexec_range_t *range)
+int machine_kexec_get_xen(xen_kexec64_range_t *range)
 {
         range->start = virt_to_maddr(_start);
-        range->size = virt_to_maddr(_end) - (unsigned long)range->start;
+        range->size = virt_to_maddr(_end) - (uint64_t)range->start;
         return 0;
 }
 
diff -r e56500f95b6a -r f571dc8e4368 xen/common/kexec.c
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -56,6 +56,14 @@ static struct {
     unsigned long size;
 } ranges[16] __initdata;
 
+/* This XLAT macro is required now even without CONFIG_COMPAT. */
+#define TRANSLATE_kexec_range(_d_, _s_) do { \
+    (_d_)->range = (_s_)->range; \
+    (_d_)->nr = (_s_)->nr; \
+    (_d_)->size = (_s_)->size; \
+    (_d_)->start = (_s_)->start; \
+} while (0)
+
 /*
  * Parse command lines in the format
  *
@@ -280,7 +288,7 @@ static int sizeof_note(const char *name,
             ELFNOTE_ALIGN(descsz));
 }
 
-static int kexec_get_reserve(xen_kexec_range_t *range)
+static int kexec_get_reserve(xen_kexec64_range_t *range)
 {
     if ( kexec_crash_area.size > 0 && kexec_crash_area.start > 0) {
         range->start = kexec_crash_area.start;
@@ -291,7 +299,7 @@ static int kexec_get_reserve(xen_kexec_r
     return 0;
 }
 
-static int kexec_get_cpu(xen_kexec_range_t *range)
+static int kexec_get_cpu(xen_kexec64_range_t *range)
 {
     int nr = range->nr;
     int nr_bytes = 0;
@@ -335,14 +343,14 @@ static int kexec_get_cpu(xen_kexec_range
     return 0;
 }
 
-static int kexec_get_vmcoreinfo(xen_kexec_range_t *range)
+static int kexec_get_vmcoreinfo(xen_kexec64_range_t *range)
 {
     range->start = __pa((unsigned long)vmcoreinfo_data);
     range->size = VMCOREINFO_BYTES;
     return 0;
 }
 
-static int kexec_get_range_internal(xen_kexec_range_t *range)
+static int kexec_get_range_internal(xen_kexec64_range_t *range)
 {
     int ret = -EINVAL;
 
@@ -365,9 +373,14 @@ static int kexec_get_range_internal(xen_
     return ret;
 }
 
-static int kexec_get_range(XEN_GUEST_HANDLE(void) uarg)
+/* This function can be invoked from either a KEXEC_CMD_kexec64_get_range
+ * hypercall, or from a KEXEC_CMD_kexec_get_range hypercall with 64bit dom0
+ * on 64bit Xen.  In the first case, the guest structure is a 
+ * xen_kexec64_range_t, and in the second case, the xen_kexec_range_t guest
+ * structure is identical to xen_kexec64_range_t. */
+static int kexec_get_range64(XEN_GUEST_HANDLE(void) uarg)
 {
-    xen_kexec_range_t range;
+    xen_kexec64_range_t range;
     int ret = -EINVAL;
 
     if ( unlikely(copy_from_guest(&range, uarg, 1)) )
@@ -381,34 +394,40 @@ static int kexec_get_range(XEN_GUEST_HAN
     return ret;
 }
 
-static int kexec_get_range_compat(XEN_GUEST_HANDLE(void) uarg)
+/* This function can be invoked from either a KEXEC_CMD_kexec_get_range
+ * compat hypercall for 32bit dom0 on 64bit Xen, or from the same hypercall
+ * on 32bit Xen.  In both cases, the guest argument uses 32bit integers, so 
+ * translate them to 64bit for use by kexec_get_range_internal.  The
+ * preprocessor guards are to choose the correct 32bit structure, as the
+ * compat_* structures dont exist in 32bit Xen. */
+static int kexec_get_range32(XEN_GUEST_HANDLE(void) uarg)
 {
+    xen_kexec64_range_t range64;
 #ifdef CONFIG_COMPAT
-    xen_kexec_range_t range;
-    compat_kexec_range_t compat_range;
+    compat_kexec_range_t range32;
+#else
+    xen_kexec_range_t range32;
+#endif
     int ret = -EINVAL;
 
-    if ( unlikely(copy_from_guest(&compat_range, uarg, 1)) )
+    if ( unlikely(copy_from_guest(&range32, uarg, 1)) )
         return -EFAULT;
 
-    XLAT_kexec_range(&range, &compat_range);
+    TRANSLATE_kexec_range(&range64, &range32);
 
-    ret = kexec_get_range_internal(&range);
+    ret = kexec_get_range_internal(&range64);
 
     /* Dont silently truncate physical addresses or sizes. */
-    if ( (range.start | range.size) & ~(unsigned long)(~0u) )
+    if ( (range64.start | range64.size) & 0xffffffff00000000ULL )
         return -ERANGE;
 
     if ( ret == 0 ) {
-        XLAT_kexec_range(&compat_range, &range);
-        if ( unlikely(copy_to_guest(uarg, &compat_range, 1)) )
+        TRANSLATE_kexec_range(&range32, &range64);
+        if ( unlikely(copy_to_guest(uarg, &range32, 1)) )
              return -EFAULT;
     }
 
     return ret;
-#else /* CONFIG_COMPAT */
-    return 0;
-#endif /* CONFIG_COMPAT */
 }
 
 static int kexec_load_get_bits(int type, int *base, int *bit)
@@ -543,10 +562,10 @@ static int kexec_load_unload(unsigned lo
     return kexec_load_unload_internal(op, &load);
 }
 
+#ifdef CONFIG_COMPAT
 static int kexec_load_unload_compat(unsigned long op,
                                     XEN_GUEST_HANDLE(void) uarg)
 {
-#ifdef CONFIG_COMPAT
     compat_kexec_load_t compat_load;
     xen_kexec_load_t load;
 
@@ -564,10 +583,8 @@ static int kexec_load_unload_compat(unsi
     XLAT_kexec_image(&load.image, &compat_load.image);
 
     return kexec_load_unload_internal(op, &load);
-#else /* CONFIG_COMPAT */
-    return 0;
+}
 #endif /* CONFIG_COMPAT */
-}
 
 static int kexec_exec(XEN_GUEST_HANDLE(void) uarg)
 {
@@ -601,8 +618,51 @@ static int kexec_exec(XEN_GUEST_HANDLE(v
     return -EINVAL; /* never reached */
 }
 
-int do_kexec_op_internal(unsigned long op, XEN_GUEST_HANDLE(void) uarg,
-                           int compat)
+long do_kexec_op(unsigned long op, XEN_GUEST_HANDLE(void) uarg)
+{
+    unsigned long flags;
+    int ret = -EINVAL;
+
+    if ( !IS_PRIV(current->domain) )
+        return -EPERM;
+
+    ret = xsm_kexec();
+    if ( ret )
+        return ret;
+
+    switch ( op )
+    {
+#ifdef __i386__
+    case KEXEC_CMD_kexec_get_range:
+        ret = kexec_get_range32(uarg);
+        break;
+#else
+    case KEXEC_CMD_kexec_get_range:
+#endif
+    case KEXEC_CMD_kexec64_get_range:
+        ret = kexec_get_range64(uarg);
+        break;
+
+    case KEXEC_CMD_kexec_load:
+    case KEXEC_CMD_kexec_unload:
+        spin_lock_irqsave(&kexec_lock, flags);
+        if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
+            ret = kexec_load_unload(op, uarg);
+        else
+            ret = -EBUSY;
+        spin_unlock_irqrestore(&kexec_lock, flags);
+        break;
+
+    case KEXEC_CMD_kexec:
+        ret = kexec_exec(uarg);
+        break;
+    }
+
+    return ret;
+}
+
+#ifdef CONFIG_COMPAT
+int compat_kexec_op(unsigned long op, XEN_GUEST_HANDLE(void) uarg)
 {
     unsigned long flags;
     int ret = -EINVAL;
@@ -617,23 +677,21 @@ int do_kexec_op_internal(unsigned long o
     switch ( op )
     {
     case KEXEC_CMD_kexec_get_range:
-        if (compat)
-                ret = kexec_get_range_compat(uarg);
-        else
-                ret = kexec_get_range(uarg);
+        ret = kexec_get_range32(uarg);
+        break;
+    case KEXEC_CMD_kexec64_get_range:
+        ret = kexec_get_range64(uarg);
         break;
     case KEXEC_CMD_kexec_load:
     case KEXEC_CMD_kexec_unload:
         spin_lock_irqsave(&kexec_lock, flags);
         if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
-        {
-                if (compat)
-                        ret = kexec_load_unload_compat(op, uarg);
-                else
-                        ret = kexec_load_unload(op, uarg);
-        }
+            ret = kexec_load_unload_compat(op, uarg);
+        else
+            ret = -EBUSY;
         spin_unlock_irqrestore(&kexec_lock, flags);
         break;
+
     case KEXEC_CMD_kexec:
         ret = kexec_exec(uarg);
         break;
@@ -641,17 +699,6 @@ int do_kexec_op_internal(unsigned long o
 
     return ret;
 }
-
-long do_kexec_op(unsigned long op, XEN_GUEST_HANDLE(void) uarg)
-{
-    return do_kexec_op_internal(op, uarg, 0);
-}
-
-#ifdef CONFIG_COMPAT
-int compat_kexec_op(unsigned long op, XEN_GUEST_HANDLE(void) uarg)
-{
-    return do_kexec_op_internal(op, uarg, 1);
-}
 #endif
 
 /*
diff -r e56500f95b6a -r f571dc8e4368 xen/include/public/kexec.h
--- a/xen/include/public/kexec.h
+++ b/xen/include/public/kexec.h
@@ -155,6 +155,15 @@ typedef struct xen_kexec_range {
     unsigned long start;
 } xen_kexec_range_t;
 
+#define KEXEC_CMD_kexec64_get_range      4
+/* xen_kexec_range_t using explicit sizes for fields. */
+typedef struct xen_kexec64_range {
+    int32_t range;
+    int32_t nr;
+    uint64_t size;
+    uint64_t start;
+} xen_kexec64_range_t;
+
 #endif /* _XEN_PUBLIC_KEXEC_H */
 
 /*
diff -r e56500f95b6a -r f571dc8e4368 xen/include/xen/kexec.h
--- a/xen/include/xen/kexec.h
+++ b/xen/include/xen/kexec.h
@@ -34,8 +34,8 @@ void kexec_crash(void);
 void kexec_crash_save_cpu(void);
 crash_xen_info_t *kexec_crash_save_info(void);
 void machine_crash_shutdown(void);
-int machine_kexec_get(xen_kexec_range_t *range);
-int machine_kexec_get_xen(xen_kexec_range_t *range);
+int machine_kexec_get(xen_kexec64_range_t *range);
+int machine_kexec_get_xen(xen_kexec64_range_t *range);
 
 void compat_machine_kexec(unsigned long rnk,
                           unsigned long indirection_page,
diff -r e56500f95b6a -r f571dc8e4368 xen/include/xlat.lst
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -50,9 +50,7 @@
 ?	grant_entry_v1			grant_table.h
 ?       grant_entry_header              grant_table.h
 ?	grant_entry_v2			grant_table.h
-?	kexec_exec			kexec.h
 !	kexec_image			kexec.h
-!	kexec_range			kexec.h
 !	add_to_physmap			memory.h
 !	foreign_memory_map		memory.h
 !	memory_exchange			memory.h

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2 of 3] KEXEC: Allocate crash notes on boot v6
  2011-12-22 17:36 [PATCH 0 of 3] KEXEC fixes Andrew Cooper
  2011-12-22 17:36 ` [PATCH 1 of 3] KEXEC: Add new hypercall to fix 64/32bit truncation errors. v2 Andrew Cooper
@ 2011-12-22 17:36 ` Andrew Cooper
  2011-12-23  8:42   ` Jan Beulich
  2011-12-22 17:36 ` [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory Andrew Cooper
  2 siblings, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2011-12-22 17:36 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

Currently, the buffers for crash notes are allocated per CPU when a
KEXEC_CMD_kexec_get_range hypercall is made, referencing the CPU in
question.

Although it certainly works in general, there are a few edge case problems:

1) There appears to be no guarentee that dom0 will make this hypercall
   for each pcpu on the system.  There appears to be variable
   behaviour depending on how many cpus dom0 is compiled for, and how
   many vcpus Xen gives to dom0.

2) The allocation of these buffers occur at the whim of dom0.  While
   this is typically very early on dom0 boot, but not guarenteed.

3) It is possible (although not sensible) for a crash
   kernel to be loaded without these (or some of these) hypercalls
   being made. Under these circumstances, a crash would cause the
   crash note code path will suffer a slew of null pointer deferences.

4) If the hypercalls are made after late in the day, it is possible
   for the hypercall to return -ENOMEM.  As code tends to be more
   fragile once memory is enhausted, the likelyhood of us needing the
   crash kernel is greater.

In addition, my forthcoming code to support 32bit kdump kernels on
64bit Xen on large (>64GB) boxes will require some guarentees as to
where the crash note buffers are actually allocated in physical
memory.  This is far easier to sort out at boot time, rather than
after dom0 has been booted and potentially using the physical memory
required.

Therefore, allocate the crash note buffers at boot time.

Changes since v5:
 *  Introduce sizeof_cpu_notes to move calculation of note size into a
    separate location.
 *  Tweak sizeof_note() to return size_t rather than int, as it is a
    function based upon sizeof().

Changes since v4:

 *  Replace the current cpu crash note scheme of using void pointers
    and hand calculating the size each time is needed, by a range
    structure containing a pointer and a size.  This removes duplicate
    times where the size is calculated.
 *  Tweak kexec_get_cpu().  Don't fail if a cpu is offline because it
    may already have crash notes, and may be up by the time a crash
    happens.  Split the error conditions up to return ERANGE for an
    out-of-range cpu request rather than EINVAL.  Finally, returning a
    range of zeros is acceptable, so do this in preference to failing.

Changes since v3:

 *  Alter the spinlocks to avoid calling xmalloc/xfree while holding
    the lock.
 *  Tidy up the coding style used.

Changes since v2:

 *  Allocate crash_notes dynamically using nr_cpu_ids at boot time,
    rather than statically using NR_CPUS.
 *  Fix the incorrect use of signed integers for cpu id.
 *  Fix collateral damage to do_crashdump_trigger() and
    crashdump_trigger_handler caused by reordering sizeof_note() and
    setup_note()
 *  Tweak the issue about returing -ENOMEM from kexec_init_cpu_note().
    No functional change.
 *  Change kexec_get_cpu() to attempt to allocate crash note buffers
    in case we have more free memory now than when the pcpu came up.
 *  Now that there are two codepaths possibly allocating crash notes,
    protect the allocation itself with a spinlock.

Changes since v1:

 *  Use cpu hotplug notifiers to handle allocating of the notes
    buffers rather than assuming the boot state of cpus will be the
    same as the crash state.
 *  Move crash_notes from being per_cpu.  This is because the kdump
    kernel elf binary put in the crash area is hard coded to physical
    addresses which the dom0 kernel typically obtains at boot time.
    If a cpu is offlined, its buffer should not be deallocated because
    the kdump kernel would read junk when trying to get the crash
    notes.  Similarly, the same problem would occur if the cpu was
    re-onlined later and its crash notes buffer was allocated elsewhere.
 *  Only attempt to allocate buffers if a crash area has been
    specified.  Else, allocating crash note buffers is a waste of
    space.  Along with this, change the test in kexec_get_cpu to
    return -EINVAL if no buffers have been allocated.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

diff -r f571dc8e4368 -r 3da37c68284f xen/common/kexec.c
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -25,13 +25,19 @@
 #include <xen/kexec.h>
 #include <public/elfnote.h>
 #include <xsm/xsm.h>
+#include <xen/cpu.h>
 #ifdef CONFIG_COMPAT
 #include <compat/kexec.h>
 #endif
 
 bool_t kexecing = FALSE;
 
-static DEFINE_PER_CPU_READ_MOSTLY(void *, crash_notes);
+/* Memory regions to store the per cpu register state etc. on a crash. */
+typedef struct { Elf_Note * start; size_t size; } crash_note_range_t;
+static crash_note_range_t * crash_notes;
+
+/* Lock to prevent race conditions when allocating the crash note buffers. */
+static DEFINE_SPINLOCK(crash_notes_lock);
 
 static Elf_Note *xen_crash_note;
 
@@ -173,13 +179,17 @@ static void one_cpu_only(void)
 void kexec_crash_save_cpu(void)
 {
     int cpu = smp_processor_id();
-    Elf_Note *note = per_cpu(crash_notes, cpu);
+    Elf_Note *note;
     ELF_Prstatus *prstatus;
     crash_xen_core_t *xencore;
 
+    BUG_ON ( ! crash_notes );
+
     if ( cpumask_test_and_set_cpu(cpu, &crash_saved_cpus) )
         return;
 
+    note = crash_notes[cpu].start;
+
     prstatus = (ELF_Prstatus *)ELFNOTE_DESC(note);
 
     note = ELFNOTE_NEXT(note);
@@ -265,13 +275,6 @@ static struct keyhandler crashdump_trigg
     .desc = "trigger a crashdump"
 };
 
-static __init int register_crashdump_trigger(void)
-{
-    register_keyhandler('C', &crashdump_trigger_keyhandler);
-    return 0;
-}
-__initcall(register_crashdump_trigger);
-
 static void setup_note(Elf_Note *n, const char *name, int type, int descsz)
 {
     int l = strlen(name) + 1;
@@ -281,13 +284,144 @@ static void setup_note(Elf_Note *n, cons
     n->type = type;
 }
 
-static int sizeof_note(const char *name, int descsz)
+static size_t sizeof_note(const char *name, int descsz)
 {
     return (sizeof(Elf_Note) +
             ELFNOTE_ALIGN(strlen(name)+1) +
             ELFNOTE_ALIGN(descsz));
 }
 
+static size_t sizeof_cpu_notes(const unsigned long cpu)
+{
+    /* All CPUs present a PRSTATUS and crash_xen_core note. */
+    size_t bytes =
+        + sizeof_note("CORE", sizeof(ELF_Prstatus)) +
+        + sizeof_note("Xen", sizeof(crash_xen_core_t));
+
+    /* CPU0 also presents the crash_xen_info note. */
+    if ( ! cpu )
+        bytes = bytes +
+            sizeof_note("Xen", sizeof(crash_xen_info_t));
+
+    return bytes;
+}
+
+/* Allocate a crash note buffer for a newly onlined cpu. */
+static int kexec_init_cpu_notes(const unsigned long cpu)
+{
+    Elf_Note * note = NULL;
+    int ret = 0;
+    int nr_bytes = 0;
+
+    BUG_ON( cpu >= nr_cpu_ids || ! crash_notes );
+
+    /* If already allocated, nothing to do. */
+    if ( crash_notes[cpu].start )
+        return ret;
+
+    nr_bytes = sizeof_cpu_notes(cpu);
+    note = xmalloc_bytes(nr_bytes);
+
+    /* Protect the write into crash_notes[] with a spinlock, as this function
+     * is on a hotplug path and a hypercall path. */
+    spin_lock(&crash_notes_lock);
+
+    /* If we are racing with another CPU and it has beaten us, give up
+     * gracefully. */
+    if ( crash_notes[cpu].start )
+    {
+        spin_unlock(&crash_notes_lock);
+        /* Always return ok, because whether we successfully allocated or not,
+         * another CPU has successfully allocated. */
+        if ( note )
+            xfree(note);
+    }
+    else
+    {
+        crash_notes[cpu].start = note;
+        crash_notes[cpu].size = nr_bytes;
+        spin_unlock(&crash_notes_lock);
+
+        /* If the allocation failed, and another CPU did not beat us, give
+         * up with ENOMEM. */
+        if ( ! note )
+            ret = -ENOMEM;
+        /* else all is good so lets set up the notes. */
+        else
+        {
+            /* Set up CORE note. */
+            setup_note(note, "CORE", NT_PRSTATUS, sizeof(ELF_Prstatus));
+            note = ELFNOTE_NEXT(note);
+
+            /* Set up Xen CORE note. */
+            setup_note(note, "Xen", XEN_ELFNOTE_CRASH_REGS,
+                       sizeof(crash_xen_core_t));
+
+            if ( ! cpu )
+            {
+                /* Set up Xen Crash Info note. */
+                xen_crash_note = note = ELFNOTE_NEXT(note);
+                setup_note(note, "Xen", XEN_ELFNOTE_CRASH_INFO,
+                           sizeof(crash_xen_info_t));
+            }
+        }
+    }
+
+    return ret;
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned long cpu = (unsigned long)hcpu;
+
+    /* Only hook on CPU_UP_PREPARE because once a crash_note has been reported
+     * to dom0, it must keep it around in case of a crash, as the crash kernel
+     * will be hard coded to the original physical address reported. */
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        /* Ignore return value.  If this boot time, -ENOMEM will cause all
+         * manner of problems elsewhere very soon, and if it is during runtime,
+         * then failing to allocate crash notes is not a good enough reason to
+         * fail the CPU_UP_PREPARE */
+        kexec_init_cpu_notes(cpu);
+        break;
+    default:
+        break;
+    }
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback
+};
+
+static int __init kexec_init(void)
+{
+    void *cpu = (void *)(unsigned long)smp_processor_id();
+
+    /* If no crash area, no need to allocate space for notes. */
+    if ( !kexec_crash_area.size )
+        return 0;
+
+    register_keyhandler('C', &crashdump_trigger_keyhandler);
+
+    crash_notes = xmalloc_array(crash_note_range_t, nr_cpu_ids);
+    if ( ! crash_notes )
+        return -ENOMEM;
+
+    memset(crash_notes, 0, sizeof(crash_note_range_t) * nr_cpu_ids);
+
+    cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+/* The reason for this to be a presmp_initcall as opposed to a regular
+ * __initcall is to allow the setup of the cpu hotplug handler before APs are
+ * brought up. */
+presmp_initcall(kexec_init);
+
 static int kexec_get_reserve(xen_kexec64_range_t *range)
 {
     if ( kexec_crash_area.size > 0 && kexec_crash_area.start > 0) {
@@ -302,44 +436,29 @@ static int kexec_get_reserve(xen_kexec64
 static int kexec_get_cpu(xen_kexec64_range_t *range)
 {
     int nr = range->nr;
-    int nr_bytes = 0;
 
-    if ( nr < 0 || nr >= nr_cpu_ids || !cpu_online(nr) )
+    if ( nr < 0 || nr >= nr_cpu_ids )
+        return -ERANGE;
+
+    if ( ! crash_notes )
         return -EINVAL;
 
-    nr_bytes += sizeof_note("CORE", sizeof(ELF_Prstatus));
-    nr_bytes += sizeof_note("Xen", sizeof(crash_xen_core_t));
+    /* Try once again to allocate room for the crash notes.  It is just possible
+     * that more space has become available since we last tried.  If space has
+     * already been allocated, kexec_init_cpu_notes() will return early with 0.
+     */
+    kexec_init_cpu_notes(nr);
 
-    /* The Xen info note is included in CPU0's range. */
-    if ( nr == 0 )
-        nr_bytes += sizeof_note("Xen", sizeof(crash_xen_info_t));
+    /* In the case of still not having enough memory to allocate buffer room,
+     * returning a range of 0,0 is still valid. */
+    if ( crash_notes[nr].start )
+    {
+        range->start = __pa(crash_notes[nr].start);
+        range->size = crash_notes[nr].size;
+    }
+    else
+        range->start = range->size = 0;
 
-    if ( per_cpu(crash_notes, nr) == NULL )
-    {
-        Elf_Note *note;
-
-        note = per_cpu(crash_notes, nr) = xmalloc_bytes(nr_bytes);
-
-        if ( note == NULL )
-            return -ENOMEM;
-
-        /* Setup CORE note. */
-        setup_note(note, "CORE", NT_PRSTATUS, sizeof(ELF_Prstatus));
-
-        /* Setup Xen CORE note. */
-        note = ELFNOTE_NEXT(note);
-        setup_note(note, "Xen", XEN_ELFNOTE_CRASH_REGS, sizeof(crash_xen_core_t));
-
-        if (nr == 0)
-        {
-            /* Setup system wide Xen info note. */
-            xen_crash_note = note = ELFNOTE_NEXT(note);
-            setup_note(note, "Xen", XEN_ELFNOTE_CRASH_INFO, sizeof(crash_xen_info_t));
-        }
-    }
-
-    range->start = __pa((unsigned long)per_cpu(crash_notes, nr));
-    range->size = nr_bytes;
     return 0;
 }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory
  2011-12-22 17:36 [PATCH 0 of 3] KEXEC fixes Andrew Cooper
  2011-12-22 17:36 ` [PATCH 1 of 3] KEXEC: Add new hypercall to fix 64/32bit truncation errors. v2 Andrew Cooper
  2011-12-22 17:36 ` [PATCH 2 of 3] KEXEC: Allocate crash notes on boot v6 Andrew Cooper
@ 2011-12-22 17:36 ` Andrew Cooper
  2011-12-22 18:09   ` David Vrabel
  2011-12-23  9:06   ` Jan Beulich
  2 siblings, 2 replies; 18+ messages in thread
From: Andrew Cooper @ 2011-12-22 17:36 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

On 64bit Xen with 32bit dom0 and crashkernel, xmalloc'ing items such
as the CPU crash notes will go into the xenheap, which tends to be in
upper memory.  This causes problems on machines with more than 64GB
(or 4GB if no PAE support) of ram as the crashkernel physically cant
access the crash notes.

The solution is to force Xen to allocate certain structures in lower
memory.  This is achieved by introducing two new command line
parameters; low_crashinfo and crashinfo_maxaddr.  Because of the
potential impact on 32bit PV guests, and that this problem does not
exist for 64bit dom0 on 64bit Xen, this new functionality defaults to
the codebase's previous behavior, requiring the user to explicitly
add extra command line parameters to change the behavior.

This patch consists of 3 logically distinct but closely related
changes.
 1) Add the two new command line parameters.
 2) Change crash note allocation to use lower memory when instructed.
 3) Change the conring buffer to use lower memory when instructed.

There result is that the crash notes and console ring will be placed
in lower memory so useful information can be recovered in the case of
a crash.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

diff -r 3da37c68284f -r 23c31d59ffb1 xen/arch/x86/setup.c
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -586,6 +586,10 @@ void __init __start_xen(unsigned long mb
     }
     cmdline_parse(cmdline);
 
+    /* Must be after command line argument parsing and before 
+     * allocing any xenheap structures wanted in lower memory. */
+    kexec_early_calculations();
+
     parse_video_info();
 
     set_current((struct vcpu *)0xfffff000); /* debug sanity */
diff -r 3da37c68284f -r 23c31d59ffb1 xen/common/kexec.c
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -36,7 +36,9 @@ bool_t kexecing = FALSE;
 typedef struct { Elf_Note * start; size_t size; } crash_note_range_t;
 static crash_note_range_t * crash_notes;
 
-/* Lock to prevent race conditions when allocating the crash note buffers. */
+/* Lock to prevent race conditions when allocating the crash note buffers.
+ * It also serves to protect calls to alloc_from_crash_heap when allocating
+ * crash note buffers in lower memory. */
 static DEFINE_SPINLOCK(crash_notes_lock);
 
 static Elf_Note *xen_crash_note;
@@ -70,6 +72,23 @@ static struct {
     (_d_)->start = (_s_)->start; \
 } while (0)
 
+/* Low crashinfo mode.  Start as INVALID so serveral codepaths can set up
+ * defaults without needing to know the state of the others. */
+enum low_crashinfo low_crashinfo_mode = LOW_CRASHINFO_INVALID;
+
+/* This value is only considered if low_crash_mode is set to MIN or ALL, so
+ * setting a default here is safe. Default to 4GB.  This is because the current
+ * KEXEC_CMD_get_range compat hypercall trucates 64bit pointers to 32 bits. The
+ * typical usecase for crashinfo_maxaddr will be for 64bit Xen with 32bit dom0 
+ * and 32bit crash kernel. */
+static paddr_t crashinfo_maxaddr = 4ULL << 30;
+
+/* = log base 2 of crashinfo_maxaddr after checking for sanity. */
+paddr_t crashinfo_maxaddr_bits = 0;
+
+/* Pointers to keep track of the crash heap region. */
+static void *crash_heap_current = NULL, *crash_heap_end = NULL;
+
 /*
  * Parse command lines in the format
  *
@@ -148,6 +167,58 @@ static void __init parse_crashkernel(con
 }
 custom_param("crashkernel", parse_crashkernel);
 
+/* Parse command lines in the format:
+ *
+ *   low_crashinfo=[none,min,all]
+ * 
+ * - none disables the low allocation of crash info.
+ * - min will allocate enough low information for the crash kernel to be able
+ *       to extract the hypervisor and dom0 message ring buffers.
+ * - all will allocate additional structures such as domain and vcpu structs
+ *       low so the crash kernel can perform an extended analysis of state.
+ */
+static void __init parse_low_crashinfo(const char * str)
+{
+
+    if ( !strlen(str) )
+        /* default to min if user just specifies "low_crashinfo" */
+        low_crashinfo_mode = LOW_CRASHINFO_MIN;
+    else if ( !strcmp(str, "none" ) )
+        low_crashinfo_mode = LOW_CRASHINFO_NONE;
+    else if ( !strcmp(str, "min" ) )
+        low_crashinfo_mode = LOW_CRASHINFO_MIN;
+    else if ( !strcmp(str, "all" ) )
+        low_crashinfo_mode = LOW_CRASHINFO_ALL;
+    else
+    {
+        printk("Unknown low_crashinfo parameter '%s'.  Defaulting to min.\n", str);
+        low_crashinfo_mode = LOW_CRASHINFO_MIN;
+    }
+}
+custom_param("low_crashinfo", parse_low_crashinfo);
+
+/* Parse command lines in the format:
+ * 
+ *   crashinfo_maxaddr=<addr>
+ *
+ * <addr> will be rounded down to the nearest power of two.  Defaults to 64G
+ */
+static void __init parse_crashinfo_maxaddr(const char * str)
+{
+    u64 addr;
+
+    /* if low_crashinfo_mode is unset, default to min. */
+    if ( low_crashinfo_mode == LOW_CRASHINFO_INVALID )
+        low_crashinfo_mode = LOW_CRASHINFO_MIN;
+
+    if ( (addr = parse_size_and_unit(str, NULL)) )
+        crashinfo_maxaddr = addr;
+    else
+        printk("Unable to parse crashinfo_maxaddr. Defaulting to %p\n",
+               (void*)crashinfo_maxaddr);
+}
+custom_param("crashinfo_maxaddr", parse_crashinfo_maxaddr);
+
 void __init set_kexec_crash_area_size(u64 system_ram)
 {
     unsigned int idx;
@@ -306,6 +377,20 @@ static size_t sizeof_cpu_notes(const uns
     return bytes;
 }
 
+/* Allocate size_t bytes of space from the previously allocated
+ * crash heap if the user has requested that crash notes be allocated
+ * in lower memory.  There is currently no case where the crash notes
+ * should be free()'d. */
+static void * alloc_from_crash_heap(const size_t bytes)
+{
+    void * ret;
+    if ( crash_heap_current + bytes > crash_heap_end )
+        return NULL;
+    ret = (void*)crash_heap_current;
+    crash_heap_current += bytes;
+    return ret;
+}
+
 /* Allocate a crash note buffer for a newly onlined cpu. */
 static int kexec_init_cpu_notes(const unsigned long cpu)
 {
@@ -320,7 +405,10 @@ static int kexec_init_cpu_notes(const un
         return ret;
 
     nr_bytes = sizeof_cpu_notes(cpu);
-    note = xmalloc_bytes(nr_bytes);
+
+    /* If we dont care about the position of allocation, malloc. */
+    if ( low_crashinfo_mode == LOW_CRASHINFO_NONE )
+        note = xmalloc_bytes(nr_bytes);
 
     /* Protect the write into crash_notes[] with a spinlock, as this function
      * is on a hotplug path and a hypercall path. */
@@ -338,6 +426,11 @@ static int kexec_init_cpu_notes(const un
     }
     else
     {
+        /* If we care about memory possition, alloc from the crash heap,
+         * also protected by the crash_notes_lock. */
+        if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
+            note = alloc_from_crash_heap(nr_bytes);
+
         crash_notes[cpu].start = note;
         crash_notes[cpu].size = nr_bytes;
         spin_unlock(&crash_notes_lock);
@@ -397,6 +490,24 @@ static struct notifier_block cpu_nfb = {
     .notifier_call = cpu_callback
 };
 
+void __init kexec_early_calculations(void)
+{
+    /* If low_crashinfo_mode is still INVALID, neither "low_crashinfo" nor
+     * "crashinfo_maxaddr" have been specified on the command line, so
+     * explicitly set to NONE. */
+    if ( low_crashinfo_mode == LOW_CRASHINFO_INVALID )
+        low_crashinfo_mode = LOW_CRASHINFO_NONE;
+
+    crashinfo_maxaddr_bits = 0;
+    if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
+    {
+        paddr_t tmp = crashinfo_maxaddr;
+
+        while ( tmp >>= 1 )
+            crashinfo_maxaddr_bits++;
+    }
+}
+
 static int __init kexec_init(void)
 {
     void *cpu = (void *)(unsigned long)smp_processor_id();
@@ -407,6 +518,29 @@ static int __init kexec_init(void)
 
     register_keyhandler('C', &crashdump_trigger_keyhandler);
 
+    if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
+    {
+        size_t crash_heap_size;
+
+        /* This calculation is safe even if the machine is booted in 
+         * uniprocessor mode. */
+        crash_heap_size = sizeof_cpu_notes(0) +
+            sizeof_cpu_notes(1) * (nr_cpu_ids - 1);
+        crash_heap_size = PAGE_ALIGN(crash_heap_size);
+
+        crash_heap_current = alloc_xenheap_pages(
+            get_order_from_bytes(crash_heap_size),
+            MEMF_bits(crashinfo_maxaddr_bits) );
+
+        if ( crash_heap_current )
+            crash_heap_end = crash_heap_current + crash_heap_size;
+        else
+            return -ENOMEM;
+    }
+
+    /* crash_notes may be allocated anywhere Xen can reach in memory.
+       Only the individual CPU crash notes themselves must be allocated
+       in lower memory if requested. */
     crash_notes = xmalloc_array(crash_note_range_t, nr_cpu_ids);
     if ( ! crash_notes )
         return -ENOMEM;
diff -r 3da37c68284f -r 23c31d59ffb1 xen/drivers/char/console.c
--- a/xen/drivers/char/console.c
+++ b/xen/drivers/char/console.c
@@ -584,6 +584,7 @@ void __init console_init_postirq(void)
 {
     char *ring;
     unsigned int i, order;
+    u64 memflags;
 
     serial_init_postirq();
 
@@ -591,7 +592,8 @@ void __init console_init_postirq(void)
         opt_conring_size = num_present_cpus() << (9 + xenlog_lower_thresh);
 
     order = get_order_from_bytes(max(opt_conring_size, conring_size));
-    while ( (ring = alloc_xenheap_pages(order, 0)) == NULL )
+    memflags = low_crashinfo_mode > LOW_CRASHINFO_NONE ? MEMF_bits(crashinfo_maxaddr_bits) : 0;
+    while ( (ring = alloc_xenheap_pages(order, memflags)) == NULL )
     {
         BUG_ON(order == 0);
         order--;
diff -r 3da37c68284f -r 23c31d59ffb1 xen/include/xen/kexec.h
--- a/xen/include/xen/kexec.h
+++ b/xen/include/xen/kexec.h
@@ -25,6 +25,19 @@ void set_kexec_crash_area_size(u64 syste
 #define KEXEC_IMAGE_CRASH_BASE   2
 #define KEXEC_IMAGE_NR           4
 
+enum low_crashinfo {
+    LOW_CRASHINFO_INVALID = 0,
+    LOW_CRASHINFO_NONE = 1,
+    LOW_CRASHINFO_MIN = 2,
+    LOW_CRASHINFO_ALL = 3
+};
+
+/* Low crashinfo mode.  Start as INVALID so serveral codepaths can set up
+ * defaults without needing to know the state of the others. */
+extern enum low_crashinfo low_crashinfo_mode;
+extern paddr_t crashinfo_maxaddr_bits;
+void kexec_early_calculations(void);
+
 int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
 void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
 void machine_kexec_reserved(xen_kexec_reserve_t *reservation);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory
  2011-12-22 17:36 ` [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory Andrew Cooper
@ 2011-12-22 18:09   ` David Vrabel
  2011-12-22 18:44     ` Andrew Cooper
  2011-12-23  9:06   ` Jan Beulich
  1 sibling, 1 reply; 18+ messages in thread
From: David Vrabel @ 2011-12-22 18:09 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

On 22/12/11 17:36, Andrew Cooper wrote:
> On 64bit Xen with 32bit dom0 and crashkernel, xmalloc'ing items such
> as the CPU crash notes will go into the xenheap, which tends to be in
> upper memory.  This causes problems on machines with more than 64GB
> (or 4GB if no PAE support) of ram as the crashkernel physically cant
> access the crash notes.

I've never been entirely convinced that this was the correct approach.
It seems that using a 64-bit crash kernel would be an easier solution.

> @@ -320,7 +405,10 @@ static int kexec_init_cpu_notes(const un
>          return ret;
>  
>      nr_bytes = sizeof_cpu_notes(cpu);
> -    note = xmalloc_bytes(nr_bytes);
> +
> +    /* If we dont care about the position of allocation, malloc. */
> +    if ( low_crashinfo_mode == LOW_CRASHINFO_NONE )
> +        note = xmalloc_bytes(nr_bytes);

Suggest refactoring this (and the other similar places) so that the
check for the regular/crash head occurs in one wrapper alloc function.

>      /* Protect the write into crash_notes[] with a spinlock, as this function
>       * is on a hotplug path and a hypercall path. */
> @@ -338,6 +426,11 @@ static int kexec_init_cpu_notes(const un
>      }
>      else
>      {
> +        /* If we care about memory possition, alloc from the crash heap,
> +         * also protected by the crash_notes_lock. */
> +        if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
> +            note = alloc_from_crash_heap(nr_bytes);
> +
>          crash_notes[cpu].start = note;
>          crash_notes[cpu].size = nr_bytes;
>          spin_unlock(&crash_notes_lock);
> @@ -397,6 +490,24 @@ static struct notifier_block cpu_nfb = {
>      .notifier_call = cpu_callback
>  };
>  
> +void __init kexec_early_calculations(void)
> +{
> +    /* If low_crashinfo_mode is still INVALID, neither "low_crashinfo" nor
> +     * "crashinfo_maxaddr" have been specified on the command line, so
> +     * explicitly set to NONE. */
> +    if ( low_crashinfo_mode == LOW_CRASHINFO_INVALID )
> +        low_crashinfo_mode = LOW_CRASHINFO_NONE;

Why not initialize low_crash_info_mode to NONE?

> +
> +    crashinfo_maxaddr_bits = 0;
> +    if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
> +    {
> +        paddr_t tmp = crashinfo_maxaddr;
> +
> +        while ( tmp >>= 1 )
> +            crashinfo_maxaddr_bits++;
> +    }
> +}

Do these during the parsing of the command line?

These will allow you to remove this unhelpfully named function.

> +
>  static int __init kexec_init(void)
>  {
>      void *cpu = (void *)(unsigned long)smp_processor_id();
> @@ -407,6 +518,29 @@ static int __init kexec_init(void)
>  
>      register_keyhandler('C', &crashdump_trigger_keyhandler);
>  
> +    if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
> +    {
> +        size_t crash_heap_size;
> +
> +        /* This calculation is safe even if the machine is booted in 
> +         * uniprocessor mode. */
> +        crash_heap_size = sizeof_cpu_notes(0) +
> +            sizeof_cpu_notes(1) * (nr_cpu_ids - 1);
> +        crash_heap_size = PAGE_ALIGN(crash_heap_size);
> +
> +        crash_heap_current = alloc_xenheap_pages(
> +            get_order_from_bytes(crash_heap_size),
> +            MEMF_bits(crashinfo_maxaddr_bits) );
> +
> +        if ( crash_heap_current )
> +            crash_heap_end = crash_heap_current + crash_heap_size;
> +        else
> +            return -ENOMEM;
> +    }

Suggest moving this into a crash_heap_setup() function.

> +
> +    /* crash_notes may be allocated anywhere Xen can reach in memory.
> +       Only the individual CPU crash notes themselves must be allocated
> +       in lower memory if requested. */
>      crash_notes = xmalloc_array(crash_note_range_t, nr_cpu_ids);
>      if ( ! crash_notes )
>          return -ENOMEM;
> diff -r 3da37c68284f -r 23c31d59ffb1 xen/drivers/char/console.c
> --- a/xen/drivers/char/console.c
> +++ b/xen/drivers/char/console.c
> @@ -584,6 +584,7 @@ void __init console_init_postirq(void)
>  {
>      char *ring;
>      unsigned int i, order;
> +    u64 memflags;
>  
>      serial_init_postirq();
>  
> @@ -591,7 +592,8 @@ void __init console_init_postirq(void)
>          opt_conring_size = num_present_cpus() << (9 + xenlog_lower_thresh);
>  
>      order = get_order_from_bytes(max(opt_conring_size, conring_size));
> -    while ( (ring = alloc_xenheap_pages(order, 0)) == NULL )
> +    memflags = low_crashinfo_mode > LOW_CRASHINFO_NONE ? MEMF_bits(crashinfo_maxaddr_bits) : 0;

If you set crashinfo_maxaddr_bits to 64 if low_crashinfo_mode isn't used
you wouldn't need this test.

David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory
  2011-12-22 18:09   ` David Vrabel
@ 2011-12-22 18:44     ` Andrew Cooper
  2011-12-23  9:09       ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2011-12-22 18:44 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, Tim Deegan

On 22/12/11 18:09, David Vrabel wrote:
> On 22/12/11 17:36, Andrew Cooper wrote:
>> On 64bit Xen with 32bit dom0 and crashkernel, xmalloc'ing items such
>> as the CPU crash notes will go into the xenheap, which tends to be in
>> upper memory.  This causes problems on machines with more than 64GB
>> (or 4GB if no PAE support) of ram as the crashkernel physically cant
>> access the crash notes.
> I've never been entirely convinced that this was the correct approach.
> It seems that using a 64-bit crash kernel would be an easier solution.

In an ideal world yes, but reality is that XS will stay with a 32bit
kernel for a while yet.

>> @@ -320,7 +405,10 @@ static int kexec_init_cpu_notes(const un
>>          return ret;
>>  
>>      nr_bytes = sizeof_cpu_notes(cpu);
>> -    note = xmalloc_bytes(nr_bytes);
>> +
>> +    /* If we dont care about the position of allocation, malloc. */
>> +    if ( low_crashinfo_mode == LOW_CRASHINFO_NONE )
>> +        note = xmalloc_bytes(nr_bytes);
> Suggest refactoring this (and the other similar places) so that the
> check for the regular/crash head occurs in one wrapper alloc function.

Can't refactor this as its other half is specifically inside the lock
while this is outside.

>>      /* Protect the write into crash_notes[] with a spinlock, as this function
>>       * is on a hotplug path and a hypercall path. */
>> @@ -338,6 +426,11 @@ static int kexec_init_cpu_notes(const un
>>      }
>>      else
>>      {
>> +        /* If we care about memory possition, alloc from the crash heap,
>> +         * also protected by the crash_notes_lock. */
>> +        if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
>> +            note = alloc_from_crash_heap(nr_bytes);
>> +
>>          crash_notes[cpu].start = note;
>>          crash_notes[cpu].size = nr_bytes;
>>          spin_unlock(&crash_notes_lock);
>> @@ -397,6 +490,24 @@ static struct notifier_block cpu_nfb = {
>>      .notifier_call = cpu_callback
>>  };
>>  
>> +void __init kexec_early_calculations(void)
>> +{
>> +    /* If low_crashinfo_mode is still INVALID, neither "low_crashinfo" nor
>> +     * "crashinfo_maxaddr" have been specified on the command line, so
>> +     * explicitly set to NONE. */
>> +    if ( low_crashinfo_mode == LOW_CRASHINFO_INVALID )
>> +        low_crashinfo_mode = LOW_CRASHINFO_NONE;
> Why not initialize low_crash_info_mode to NONE?

So there is not a dependency between the order of low_crashinfo and
crashinfo_maxaddr, while still allowing each option to set a sensible
default for the other if only a single one is specified on the command line.

>> +
>> +    crashinfo_maxaddr_bits = 0;
>> +    if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
>> +    {
>> +        paddr_t tmp = crashinfo_maxaddr;
>> +
>> +        while ( tmp >>= 1 )
>> +            crashinfo_maxaddr_bits++;
>> +    }
>> +}
> Do these during the parsing of the command line?

Argument along the same lines as above w.r.t two command line
arguments.  crashinfo_maxaddr_bits is always used.  Therefore it must be
0 in the case where no limits are applied, meaning that setting it to a
default of lg(4GB) will break things.  Setting it in the parsing
function would prevent a sensible default being set if the user only
specified low_crashinfo without making a combinational explosion of
logic in the parsing function.

> These will allow you to remove this unhelpfully named function.
>
>> +
>>  static int __init kexec_init(void)
>>  {
>>      void *cpu = (void *)(unsigned long)smp_processor_id();
>> @@ -407,6 +518,29 @@ static int __init kexec_init(void)
>>  
>>      register_keyhandler('C', &crashdump_trigger_keyhandler);
>>  
>> +    if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
>> +    {
>> +        size_t crash_heap_size;
>> +
>> +        /* This calculation is safe even if the machine is booted in 
>> +         * uniprocessor mode. */
>> +        crash_heap_size = sizeof_cpu_notes(0) +
>> +            sizeof_cpu_notes(1) * (nr_cpu_ids - 1);
>> +        crash_heap_size = PAGE_ALIGN(crash_heap_size);
>> +
>> +        crash_heap_current = alloc_xenheap_pages(
>> +            get_order_from_bytes(crash_heap_size),
>> +            MEMF_bits(crashinfo_maxaddr_bits) );
>> +
>> +        if ( crash_heap_current )
>> +            crash_heap_end = crash_heap_current + crash_heap_size;
>> +        else
>> +            return -ENOMEM;
>> +    }
> Suggest moving this into a crash_heap_setup() function.

Questionable.  I would argue that it is trading an extra indirection in
the source code for a gain of only a few lines, which will be inlined
back to here by the compiler.

>> +
>> +    /* crash_notes may be allocated anywhere Xen can reach in memory.
>> +       Only the individual CPU crash notes themselves must be allocated
>> +       in lower memory if requested. */
>>      crash_notes = xmalloc_array(crash_note_range_t, nr_cpu_ids);
>>      if ( ! crash_notes )
>>          return -ENOMEM;
>> diff -r 3da37c68284f -r 23c31d59ffb1 xen/drivers/char/console.c
>> --- a/xen/drivers/char/console.c
>> +++ b/xen/drivers/char/console.c
>> @@ -584,6 +584,7 @@ void __init console_init_postirq(void)
>>  {
>>      char *ring;
>>      unsigned int i, order;
>> +    u64 memflags;
>>  
>>      serial_init_postirq();
>>  
>> @@ -591,7 +592,8 @@ void __init console_init_postirq(void)
>>          opt_conring_size = num_present_cpus() << (9 + xenlog_lower_thresh);
>>  
>>      order = get_order_from_bytes(max(opt_conring_size, conring_size));
>> -    while ( (ring = alloc_xenheap_pages(order, 0)) == NULL )
>> +    memflags = low_crashinfo_mode > LOW_CRASHINFO_NONE ? MEMF_bits(crashinfo_maxaddr_bits) : 0;
> If you set crashinfo_maxaddr_bits to 64 if low_crashinfo_mode isn't used
> you wouldn't need this test.

I am not familiar enough with the intricacies of alloc_xenheap_pages to
know whether that is safe.  cc'ing Tim for his expertise.

> David

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1 of 3] KEXEC: Add new hypercall to fix 64/32bit truncation errors. v2
  2011-12-22 17:36 ` [PATCH 1 of 3] KEXEC: Add new hypercall to fix 64/32bit truncation errors. v2 Andrew Cooper
@ 2011-12-23  8:34   ` Jan Beulich
  2011-12-23 11:17     ` Andrew Cooper
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2011-12-23  8:34 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 22.12.11 at 18:36, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> Currently, KEXEC_CMD_kexec_get_range in compat mode, or with 32bit
> Xen, will truncate 64bit pointers to 32bits, leading to problems on
> machines with more than 4GiB of RAM.  As 32bit dom0 kernels tend to be
> PAE capable, this is especially wasteful, as most structures currently
> limited to <4GiB could easily be <64GiB instead.
> 
> Therefore, introduce a new hypercall 'KEXEC_CMD_kexec64_get_range'
> which passes similar information as KEXEC_CMD_kexec_get_range, but
> which uses a structure with explicitly specified field widths, causing
> it to be identical in the compat and regular case.  This new
> structure, xen_kexec64_range_t, will be the same as xen_kexec_range_t
> if Xen is compiled for x86_64, but will still use 64bit integers if
> Xen is compiled for x86_32.
> 
> To fix 32bit Xen which uses 32bit integers for addresses and sizes,
> change the internals to use xen_kexec64_range_t which will use 64bit
> integers instead.  This also involves changing several casts to
> explicitly use uint64_ts rather than unsigned longs.

Just for the record - I continue to be opposed to doing this for the
32-bit hypervisor. All relevant allocations are below 4G there, so
there's simply no need to decrease code readability for no benefit.

Jan

> In addition, the hypercall entry points need updating to be able to
> cater for all possibilities.
> 
> |Xen/dom0 bits|          Bit width of addresses in structure for        |
> +------+------+---------------------------+-----------------------------+
> | Xen  | dom0 | KEXEC_CMD_kexec_get_range | KEXEC_CMD_kexec64_get_range |
> +------+------+---------------------------+-----------------------------+
> |  32  |  32  |             32            |              64             |
> |  64  |  32  |             32            |              64             |
> |  64  |  64  |             64            |              64             |
> +------+------+---------------------------+-----------------------------+
> 
> This has been solved by splitting do_kexec_op_internal back into
> do_kexec_op{,_compat} and changing kexec_get_range{,_compat} to
> kexec_get_range{32,64} which now exist irrespective of CONFIG_COMPAT
> or not.
> 
> The knock-on effect of removing do_kexec_op_internal means that
> kexec_load_unload_compat is only ever called from inside an #ifdef
> CONFIG_COMPAT codepath, which does not exist on Xen x86_32.
> Therefore, move the #ifdef CONFIG_COMPAT guards to include the entire
> function.
> 
> Finally, and a little unrelatedly, fix KEXEC_CMD_kexec_{load,unload}
> to return -EBUSY instead of EOK if a kexec is in progress.
> 
> Changes since v1:
>  *  Fix check for pointer truncation to work when Xen is compiled for
>     32 bit mode as well.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> 
> diff -r e56500f95b6a -r f571dc8e4368 xen/arch/ia64/xen/machine_kexec.c
> --- a/xen/arch/ia64/xen/machine_kexec.c
> +++ b/xen/arch/ia64/xen/machine_kexec.c
> @@ -102,10 +102,10 @@ void machine_reboot_kexec(xen_kexec_imag
>  	machine_kexec(image);
>  }
>  
> -int machine_kexec_get_xen(xen_kexec_range_t *range)
> +int machine_kexec_get_xen(xen_kexec64_range_t *range)
>  {
>  	range->start = ia64_tpa(_text);
> -	range->size = (unsigned long)_end - (unsigned long)_text;
> +	range->size = (uint64_t)_end - (uint64_t)_text;
>  	return 0;
>  }
>  
> @@ -113,31 +113,31 @@ int machine_kexec_get_xen(xen_kexec_rang
>  #define ELF_PAGE_SIZE  (__IA64_UL_CONST(1) << ELF_PAGE_SHIFT)
>  #define ELF_PAGE_MASK  (~(ELF_PAGE_SIZE - 1))
>  
> -static int machine_kexec_get_xenheap(xen_kexec_range_t *range)
> +static int machine_kexec_get_xenheap(xen_kexec64_range_t *range)
>  {
>  	range->start = (ia64_tpa(_end) + (ELF_PAGE_SIZE - 1)) & ELF_PAGE_MASK;
>  	range->size =
> -		(((unsigned long)range->start + KERNEL_TR_PAGE_SIZE) &
> +		(((uint64_t)range->start + KERNEL_TR_PAGE_SIZE) &
>           ~(KERNEL_TR_PAGE_SIZE - 1))
> -		- (unsigned long)range->start;
> +		- (uint64_t)range->start;
>  	return 0;
>  }
>  
> -static int machine_kexec_get_boot_param(xen_kexec_range_t *range)
> +static int machine_kexec_get_boot_param(xen_kexec64_range_t *range)
>  {
>  	range->start = __pa(ia64_boot_param);
>  	range->size = sizeof(*ia64_boot_param);
>  	return 0;
>  }
>  
> -static int machine_kexec_get_efi_memmap(xen_kexec_range_t *range)
> +static int machine_kexec_get_efi_memmap(xen_kexec64_range_t *range)
>  {
>  	range->start = ia64_boot_param->efi_memmap;
>  	range->size = ia64_boot_param->efi_memmap_size;
>  	return 0;
>  }
>  
> -int machine_kexec_get(xen_kexec_range_t *range)
> +int machine_kexec_get(xen_kexec64_range_t *range)
>  {
>  	switch (range->range) {
>  	case KEXEC_RANGE_MA_XEN:
> diff -r e56500f95b6a -r f571dc8e4368 xen/arch/x86/machine_kexec.c
> --- a/xen/arch/x86/machine_kexec.c
> +++ b/xen/arch/x86/machine_kexec.c
> @@ -120,7 +120,7 @@ void machine_kexec(xen_kexec_image_t *im
>      }
>  }
>  
> -int machine_kexec_get(xen_kexec_range_t *range)
> +int machine_kexec_get(xen_kexec64_range_t *range)
>  {
>  	if (range->range != KEXEC_RANGE_MA_XEN)
>  		return -EINVAL;
> diff -r e56500f95b6a -r f571dc8e4368 xen/arch/x86/x86_32/machine_kexec.c
> --- a/xen/arch/x86/x86_32/machine_kexec.c
> +++ b/xen/arch/x86/x86_32/machine_kexec.c
> @@ -11,11 +11,11 @@
>  #include <asm/page.h>
>  #include <public/kexec.h>
>  
> -int machine_kexec_get_xen(xen_kexec_range_t *range)
> +int machine_kexec_get_xen(xen_kexec64_range_t *range)
>  {
>          range->start = virt_to_maddr(_start);
> -        range->size = (unsigned long)xenheap_phys_end -
> -                      (unsigned long)range->start;
> +        range->size = (uint64_t)xenheap_phys_end -
> +                      (uint64_t)range->start;
>          return 0;
>  }
>  
> diff -r e56500f95b6a -r f571dc8e4368 xen/arch/x86/x86_64/machine_kexec.c
> --- a/xen/arch/x86/x86_64/machine_kexec.c
> +++ b/xen/arch/x86/x86_64/machine_kexec.c
> @@ -11,10 +11,10 @@
>  #include <asm/page.h>
>  #include <public/kexec.h>
>  
> -int machine_kexec_get_xen(xen_kexec_range_t *range)
> +int machine_kexec_get_xen(xen_kexec64_range_t *range)
>  {
>          range->start = virt_to_maddr(_start);
> -        range->size = virt_to_maddr(_end) - (unsigned long)range->start;
> +        range->size = virt_to_maddr(_end) - (uint64_t)range->start;
>          return 0;
>  }
>  
> diff -r e56500f95b6a -r f571dc8e4368 xen/common/kexec.c
> --- a/xen/common/kexec.c
> +++ b/xen/common/kexec.c
> @@ -56,6 +56,14 @@ static struct {
>      unsigned long size;
>  } ranges[16] __initdata;
>  
> +/* This XLAT macro is required now even without CONFIG_COMPAT. */
> +#define TRANSLATE_kexec_range(_d_, _s_) do { \
> +    (_d_)->range = (_s_)->range; \
> +    (_d_)->nr = (_s_)->nr; \
> +    (_d_)->size = (_s_)->size; \
> +    (_d_)->start = (_s_)->start; \
> +} while (0)
> +
>  /*
>   * Parse command lines in the format
>   *
> @@ -280,7 +288,7 @@ static int sizeof_note(const char *name,
>              ELFNOTE_ALIGN(descsz));
>  }
>  
> -static int kexec_get_reserve(xen_kexec_range_t *range)
> +static int kexec_get_reserve(xen_kexec64_range_t *range)
>  {
>      if ( kexec_crash_area.size > 0 && kexec_crash_area.start > 0) {
>          range->start = kexec_crash_area.start;
> @@ -291,7 +299,7 @@ static int kexec_get_reserve(xen_kexec_r
>      return 0;
>  }
>  
> -static int kexec_get_cpu(xen_kexec_range_t *range)
> +static int kexec_get_cpu(xen_kexec64_range_t *range)
>  {
>      int nr = range->nr;
>      int nr_bytes = 0;
> @@ -335,14 +343,14 @@ static int kexec_get_cpu(xen_kexec_range
>      return 0;
>  }
>  
> -static int kexec_get_vmcoreinfo(xen_kexec_range_t *range)
> +static int kexec_get_vmcoreinfo(xen_kexec64_range_t *range)
>  {
>      range->start = __pa((unsigned long)vmcoreinfo_data);
>      range->size = VMCOREINFO_BYTES;
>      return 0;
>  }
>  
> -static int kexec_get_range_internal(xen_kexec_range_t *range)
> +static int kexec_get_range_internal(xen_kexec64_range_t *range)
>  {
>      int ret = -EINVAL;
>  
> @@ -365,9 +373,14 @@ static int kexec_get_range_internal(xen_
>      return ret;
>  }
>  
> -static int kexec_get_range(XEN_GUEST_HANDLE(void) uarg)
> +/* This function can be invoked from either a KEXEC_CMD_kexec64_get_range
> + * hypercall, or from a KEXEC_CMD_kexec_get_range hypercall with 64bit dom0
> + * on 64bit Xen.  In the first case, the guest structure is a 
> + * xen_kexec64_range_t, and in the second case, the xen_kexec_range_t guest
> + * structure is identical to xen_kexec64_range_t. */
> +static int kexec_get_range64(XEN_GUEST_HANDLE(void) uarg)
>  {
> -    xen_kexec_range_t range;
> +    xen_kexec64_range_t range;
>      int ret = -EINVAL;
>  
>      if ( unlikely(copy_from_guest(&range, uarg, 1)) )
> @@ -381,34 +394,40 @@ static int kexec_get_range(XEN_GUEST_HAN
>      return ret;
>  }
>  
> -static int kexec_get_range_compat(XEN_GUEST_HANDLE(void) uarg)
> +/* This function can be invoked from either a KEXEC_CMD_kexec_get_range
> + * compat hypercall for 32bit dom0 on 64bit Xen, or from the same hypercall
> + * on 32bit Xen.  In both cases, the guest argument uses 32bit integers, so 
> 
> + * translate them to 64bit for use by kexec_get_range_internal.  The
> + * preprocessor guards are to choose the correct 32bit structure, as the
> + * compat_* structures dont exist in 32bit Xen. */
> +static int kexec_get_range32(XEN_GUEST_HANDLE(void) uarg)
>  {
> +    xen_kexec64_range_t range64;
>  #ifdef CONFIG_COMPAT
> -    xen_kexec_range_t range;
> -    compat_kexec_range_t compat_range;
> +    compat_kexec_range_t range32;
> +#else
> +    xen_kexec_range_t range32;
> +#endif
>      int ret = -EINVAL;
>  
> -    if ( unlikely(copy_from_guest(&compat_range, uarg, 1)) )
> +    if ( unlikely(copy_from_guest(&range32, uarg, 1)) )
>          return -EFAULT;
>  
> -    XLAT_kexec_range(&range, &compat_range);
> +    TRANSLATE_kexec_range(&range64, &range32);
>  
> -    ret = kexec_get_range_internal(&range);
> +    ret = kexec_get_range_internal(&range64);
>  
>      /* Dont silently truncate physical addresses or sizes. */
> -    if ( (range.start | range.size) & ~(unsigned long)(~0u) )
> +    if ( (range64.start | range64.size) & 0xffffffff00000000ULL )
>          return -ERANGE;
>  
>      if ( ret == 0 ) {
> -        XLAT_kexec_range(&compat_range, &range);
> -        if ( unlikely(copy_to_guest(uarg, &compat_range, 1)) )
> +        TRANSLATE_kexec_range(&range32, &range64);
> +        if ( unlikely(copy_to_guest(uarg, &range32, 1)) )
>               return -EFAULT;
>      }
>  
>      return ret;
> -#else /* CONFIG_COMPAT */
> -    return 0;
> -#endif /* CONFIG_COMPAT */
>  }
>  
>  static int kexec_load_get_bits(int type, int *base, int *bit)
> @@ -543,10 +562,10 @@ static int kexec_load_unload(unsigned lo
>      return kexec_load_unload_internal(op, &load);
>  }
>  
> +#ifdef CONFIG_COMPAT
>  static int kexec_load_unload_compat(unsigned long op,
>                                      XEN_GUEST_HANDLE(void) uarg)
>  {
> -#ifdef CONFIG_COMPAT
>      compat_kexec_load_t compat_load;
>      xen_kexec_load_t load;
>  
> @@ -564,10 +583,8 @@ static int kexec_load_unload_compat(unsi
>      XLAT_kexec_image(&load.image, &compat_load.image);
>  
>      return kexec_load_unload_internal(op, &load);
> -#else /* CONFIG_COMPAT */
> -    return 0;
> +}
>  #endif /* CONFIG_COMPAT */
> -}
>  
>  static int kexec_exec(XEN_GUEST_HANDLE(void) uarg)
>  {
> @@ -601,8 +618,51 @@ static int kexec_exec(XEN_GUEST_HANDLE(v
>      return -EINVAL; /* never reached */
>  }
>  
> -int do_kexec_op_internal(unsigned long op, XEN_GUEST_HANDLE(void) uarg,
> -                           int compat)
> +long do_kexec_op(unsigned long op, XEN_GUEST_HANDLE(void) uarg)
> +{
> +    unsigned long flags;
> +    int ret = -EINVAL;
> +
> +    if ( !IS_PRIV(current->domain) )
> +        return -EPERM;
> +
> +    ret = xsm_kexec();
> +    if ( ret )
> +        return ret;
> +
> +    switch ( op )
> +    {
> +#ifdef __i386__
> +    case KEXEC_CMD_kexec_get_range:
> +        ret = kexec_get_range32(uarg);
> +        break;
> +#else
> +    case KEXEC_CMD_kexec_get_range:
> +#endif
> +    case KEXEC_CMD_kexec64_get_range:
> +        ret = kexec_get_range64(uarg);
> +        break;
> +
> +    case KEXEC_CMD_kexec_load:
> +    case KEXEC_CMD_kexec_unload:
> +        spin_lock_irqsave(&kexec_lock, flags);
> +        if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
> +            ret = kexec_load_unload(op, uarg);
> +        else
> +            ret = -EBUSY;
> +        spin_unlock_irqrestore(&kexec_lock, flags);
> +        break;
> +
> +    case KEXEC_CMD_kexec:
> +        ret = kexec_exec(uarg);
> +        break;
> +    }
> +
> +    return ret;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +int compat_kexec_op(unsigned long op, XEN_GUEST_HANDLE(void) uarg)
>  {
>      unsigned long flags;
>      int ret = -EINVAL;
> @@ -617,23 +677,21 @@ int do_kexec_op_internal(unsigned long o
>      switch ( op )
>      {
>      case KEXEC_CMD_kexec_get_range:
> -        if (compat)
> -                ret = kexec_get_range_compat(uarg);
> -        else
> -                ret = kexec_get_range(uarg);
> +        ret = kexec_get_range32(uarg);
> +        break;
> +    case KEXEC_CMD_kexec64_get_range:
> +        ret = kexec_get_range64(uarg);
>          break;
>      case KEXEC_CMD_kexec_load:
>      case KEXEC_CMD_kexec_unload:
>          spin_lock_irqsave(&kexec_lock, flags);
>          if (!test_bit(KEXEC_FLAG_IN_PROGRESS, &kexec_flags))
> -        {
> -                if (compat)
> -                        ret = kexec_load_unload_compat(op, uarg);
> -                else
> -                        ret = kexec_load_unload(op, uarg);
> -        }
> +            ret = kexec_load_unload_compat(op, uarg);
> +        else
> +            ret = -EBUSY;
>          spin_unlock_irqrestore(&kexec_lock, flags);
>          break;
> +
>      case KEXEC_CMD_kexec:
>          ret = kexec_exec(uarg);
>          break;
> @@ -641,17 +699,6 @@ int do_kexec_op_internal(unsigned long o
>  
>      return ret;
>  }
> -
> -long do_kexec_op(unsigned long op, XEN_GUEST_HANDLE(void) uarg)
> -{
> -    return do_kexec_op_internal(op, uarg, 0);
> -}
> -
> -#ifdef CONFIG_COMPAT
> -int compat_kexec_op(unsigned long op, XEN_GUEST_HANDLE(void) uarg)
> -{
> -    return do_kexec_op_internal(op, uarg, 1);
> -}
>  #endif
>  
>  /*
> diff -r e56500f95b6a -r f571dc8e4368 xen/include/public/kexec.h
> --- a/xen/include/public/kexec.h
> +++ b/xen/include/public/kexec.h
> @@ -155,6 +155,15 @@ typedef struct xen_kexec_range {
>      unsigned long start;
>  } xen_kexec_range_t;
>  
> +#define KEXEC_CMD_kexec64_get_range      4
> +/* xen_kexec_range_t using explicit sizes for fields. */
> +typedef struct xen_kexec64_range {
> +    int32_t range;
> +    int32_t nr;
> +    uint64_t size;
> +    uint64_t start;
> +} xen_kexec64_range_t;
> +
>  #endif /* _XEN_PUBLIC_KEXEC_H */
>  
>  /*
> diff -r e56500f95b6a -r f571dc8e4368 xen/include/xen/kexec.h
> --- a/xen/include/xen/kexec.h
> +++ b/xen/include/xen/kexec.h
> @@ -34,8 +34,8 @@ void kexec_crash(void);
>  void kexec_crash_save_cpu(void);
>  crash_xen_info_t *kexec_crash_save_info(void);
>  void machine_crash_shutdown(void);
> -int machine_kexec_get(xen_kexec_range_t *range);
> -int machine_kexec_get_xen(xen_kexec_range_t *range);
> +int machine_kexec_get(xen_kexec64_range_t *range);
> +int machine_kexec_get_xen(xen_kexec64_range_t *range);
>  
>  void compat_machine_kexec(unsigned long rnk,
>                            unsigned long indirection_page,
> diff -r e56500f95b6a -r f571dc8e4368 xen/include/xlat.lst
> --- a/xen/include/xlat.lst
> +++ b/xen/include/xlat.lst
> @@ -50,9 +50,7 @@
>  ?	grant_entry_v1			grant_table.h
>  ?       grant_entry_header              grant_table.h
>  ?	grant_entry_v2			grant_table.h
> -?	kexec_exec			kexec.h
>  !	kexec_image			kexec.h
> -!	kexec_range			kexec.h
>  !	add_to_physmap			memory.h
>  !	foreign_memory_map		memory.h
>  !	memory_exchange			memory.h
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com 
> http://lists.xensource.com/xen-devel 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2 of 3] KEXEC: Allocate crash notes on boot v6
  2011-12-22 17:36 ` [PATCH 2 of 3] KEXEC: Allocate crash notes on boot v6 Andrew Cooper
@ 2011-12-23  8:42   ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2011-12-23  8:42 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 22.12.11 at 18:36, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> Currently, the buffers for crash notes are allocated per CPU when a
> KEXEC_CMD_kexec_get_range hypercall is made, referencing the CPU in
> question.
> 
> Although it certainly works in general, there are a few edge case problems:
> 
> 1) There appears to be no guarentee that dom0 will make this hypercall
>    for each pcpu on the system.  There appears to be variable
>    behaviour depending on how many cpus dom0 is compiled for, and how
>    many vcpus Xen gives to dom0.
> 
> 2) The allocation of these buffers occur at the whim of dom0.  While
>    this is typically very early on dom0 boot, but not guarenteed.
> 
> 3) It is possible (although not sensible) for a crash
>    kernel to be loaded without these (or some of these) hypercalls
>    being made. Under these circumstances, a crash would cause the
>    crash note code path will suffer a slew of null pointer deferences.
> 
> 4) If the hypercalls are made after late in the day, it is possible
>    for the hypercall to return -ENOMEM.  As code tends to be more
>    fragile once memory is enhausted, the likelyhood of us needing the
>    crash kernel is greater.

Also for the record, I continue to believe that none of these warrant
the changes made here. Better error/NULL checking may instead be
required in a few places, plus possibly some Dom0 code adjustments
(see my recent 2.6.18 kernel changes, on top of which pCPU hot add
support was implemented [which can't go into the 2.6.18 kernel as it
lacks general pCPU hot add code]).

> In addition, my forthcoming code to support 32bit kdump kernels on
> 64bit Xen on large (>64GB) boxes will require some guarentees as to
> where the crash note buffers are actually allocated in physical
> memory.  This is far easier to sort out at boot time, rather than
> after dom0 has been booted and potentially using the physical memory
> required.

This one is the only part of the rationale for these changes that I
view as half way reasonable, though with allocation in Xen going
from top down, not being able to allocate these is about the same
as e.g. Dom0 not being able to set up its swiotlb. That's what the
(by default) 128M reservation is for. So if anything, the number
there may want some tweaking for the case of very many CPUs.

Jan

> Therefore, allocate the crash note buffers at boot time.
> 
> Changes since v5:
>  *  Introduce sizeof_cpu_notes to move calculation of note size into a
>     separate location.
>  *  Tweak sizeof_note() to return size_t rather than int, as it is a
>     function based upon sizeof().
> 
> Changes since v4:
> 
>  *  Replace the current cpu crash note scheme of using void pointers
>     and hand calculating the size each time is needed, by a range
>     structure containing a pointer and a size.  This removes duplicate
>     times where the size is calculated.
>  *  Tweak kexec_get_cpu().  Don't fail if a cpu is offline because it
>     may already have crash notes, and may be up by the time a crash
>     happens.  Split the error conditions up to return ERANGE for an
>     out-of-range cpu request rather than EINVAL.  Finally, returning a
>     range of zeros is acceptable, so do this in preference to failing.
> 
> Changes since v3:
> 
>  *  Alter the spinlocks to avoid calling xmalloc/xfree while holding
>     the lock.
>  *  Tidy up the coding style used.
> 
> Changes since v2:
> 
>  *  Allocate crash_notes dynamically using nr_cpu_ids at boot time,
>     rather than statically using NR_CPUS.
>  *  Fix the incorrect use of signed integers for cpu id.
>  *  Fix collateral damage to do_crashdump_trigger() and
>     crashdump_trigger_handler caused by reordering sizeof_note() and
>     setup_note()
>  *  Tweak the issue about returing -ENOMEM from kexec_init_cpu_note().
>     No functional change.
>  *  Change kexec_get_cpu() to attempt to allocate crash note buffers
>     in case we have more free memory now than when the pcpu came up.
>  *  Now that there are two codepaths possibly allocating crash notes,
>     protect the allocation itself with a spinlock.
> 
> Changes since v1:
> 
>  *  Use cpu hotplug notifiers to handle allocating of the notes
>     buffers rather than assuming the boot state of cpus will be the
>     same as the crash state.
>  *  Move crash_notes from being per_cpu.  This is because the kdump
>     kernel elf binary put in the crash area is hard coded to physical
>     addresses which the dom0 kernel typically obtains at boot time.
>     If a cpu is offlined, its buffer should not be deallocated because
>     the kdump kernel would read junk when trying to get the crash
>     notes.  Similarly, the same problem would occur if the cpu was
>     re-onlined later and its crash notes buffer was allocated elsewhere.
>  *  Only attempt to allocate buffers if a crash area has been
>     specified.  Else, allocating crash note buffers is a waste of
>     space.  Along with this, change the test in kexec_get_cpu to
>     return -EINVAL if no buffers have been allocated.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> 
> diff -r f571dc8e4368 -r 3da37c68284f xen/common/kexec.c
> --- a/xen/common/kexec.c
> +++ b/xen/common/kexec.c
> @@ -25,13 +25,19 @@
>  #include <xen/kexec.h>
>  #include <public/elfnote.h>
>  #include <xsm/xsm.h>
> +#include <xen/cpu.h>
>  #ifdef CONFIG_COMPAT
>  #include <compat/kexec.h>
>  #endif
>  
>  bool_t kexecing = FALSE;
>  
> -static DEFINE_PER_CPU_READ_MOSTLY(void *, crash_notes);
> +/* Memory regions to store the per cpu register state etc. on a crash. */
> +typedef struct { Elf_Note * start; size_t size; } crash_note_range_t;
> +static crash_note_range_t * crash_notes;
> +
> +/* Lock to prevent race conditions when allocating the crash note buffers. 
> */
> +static DEFINE_SPINLOCK(crash_notes_lock);
>  
>  static Elf_Note *xen_crash_note;
>  
> @@ -173,13 +179,17 @@ static void one_cpu_only(void)
>  void kexec_crash_save_cpu(void)
>  {
>      int cpu = smp_processor_id();
> -    Elf_Note *note = per_cpu(crash_notes, cpu);
> +    Elf_Note *note;
>      ELF_Prstatus *prstatus;
>      crash_xen_core_t *xencore;
>  
> +    BUG_ON ( ! crash_notes );
> +
>      if ( cpumask_test_and_set_cpu(cpu, &crash_saved_cpus) )
>          return;
>  
> +    note = crash_notes[cpu].start;
> +
>      prstatus = (ELF_Prstatus *)ELFNOTE_DESC(note);
>  
>      note = ELFNOTE_NEXT(note);
> @@ -265,13 +275,6 @@ static struct keyhandler crashdump_trigg
>      .desc = "trigger a crashdump"
>  };
>  
> -static __init int register_crashdump_trigger(void)
> -{
> -    register_keyhandler('C', &crashdump_trigger_keyhandler);
> -    return 0;
> -}
> -__initcall(register_crashdump_trigger);
> -
>  static void setup_note(Elf_Note *n, const char *name, int type, int descsz)
>  {
>      int l = strlen(name) + 1;
> @@ -281,13 +284,144 @@ static void setup_note(Elf_Note *n, cons
>      n->type = type;
>  }
>  
> -static int sizeof_note(const char *name, int descsz)
> +static size_t sizeof_note(const char *name, int descsz)
>  {
>      return (sizeof(Elf_Note) +
>              ELFNOTE_ALIGN(strlen(name)+1) +
>              ELFNOTE_ALIGN(descsz));
>  }
>  
> +static size_t sizeof_cpu_notes(const unsigned long cpu)
> +{
> +    /* All CPUs present a PRSTATUS and crash_xen_core note. */
> +    size_t bytes =
> +        + sizeof_note("CORE", sizeof(ELF_Prstatus)) +
> +        + sizeof_note("Xen", sizeof(crash_xen_core_t));
> +
> +    /* CPU0 also presents the crash_xen_info note. */
> +    if ( ! cpu )
> +        bytes = bytes +
> +            sizeof_note("Xen", sizeof(crash_xen_info_t));
> +
> +    return bytes;
> +}
> +
> +/* Allocate a crash note buffer for a newly onlined cpu. */
> +static int kexec_init_cpu_notes(const unsigned long cpu)
> +{
> +    Elf_Note * note = NULL;
> +    int ret = 0;
> +    int nr_bytes = 0;
> +
> +    BUG_ON( cpu >= nr_cpu_ids || ! crash_notes );
> +
> +    /* If already allocated, nothing to do. */
> +    if ( crash_notes[cpu].start )
> +        return ret;
> +
> +    nr_bytes = sizeof_cpu_notes(cpu);
> +    note = xmalloc_bytes(nr_bytes);
> +
> +    /* Protect the write into crash_notes[] with a spinlock, as this 
> function
> +     * is on a hotplug path and a hypercall path. */
> +    spin_lock(&crash_notes_lock);
> +
> +    /* If we are racing with another CPU and it has beaten us, give up
> +     * gracefully. */
> +    if ( crash_notes[cpu].start )
> +    {
> +        spin_unlock(&crash_notes_lock);
> +        /* Always return ok, because whether we successfully allocated or 
> not,
> +         * another CPU has successfully allocated. */
> +        if ( note )
> +            xfree(note);
> +    }
> +    else
> +    {
> +        crash_notes[cpu].start = note;
> +        crash_notes[cpu].size = nr_bytes;
> +        spin_unlock(&crash_notes_lock);
> +
> +        /* If the allocation failed, and another CPU did not beat us, give
> +         * up with ENOMEM. */
> +        if ( ! note )
> +            ret = -ENOMEM;
> +        /* else all is good so lets set up the notes. */
> +        else
> +        {
> +            /* Set up CORE note. */
> +            setup_note(note, "CORE", NT_PRSTATUS, sizeof(ELF_Prstatus));
> +            note = ELFNOTE_NEXT(note);
> +
> +            /* Set up Xen CORE note. */
> +            setup_note(note, "Xen", XEN_ELFNOTE_CRASH_REGS,
> +                       sizeof(crash_xen_core_t));
> +
> +            if ( ! cpu )
> +            {
> +                /* Set up Xen Crash Info note. */
> +                xen_crash_note = note = ELFNOTE_NEXT(note);
> +                setup_note(note, "Xen", XEN_ELFNOTE_CRASH_INFO,
> +                           sizeof(crash_xen_info_t));
> +            }
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +static int cpu_callback(
> +    struct notifier_block *nfb, unsigned long action, void *hcpu)
> +{
> +    unsigned long cpu = (unsigned long)hcpu;
> +
> +    /* Only hook on CPU_UP_PREPARE because once a crash_note has been 
> reported
> +     * to dom0, it must keep it around in case of a crash, as the crash 
> kernel
> +     * will be hard coded to the original physical address reported. */
> +    switch ( action )
> +    {
> +    case CPU_UP_PREPARE:
> +        /* Ignore return value.  If this boot time, -ENOMEM will cause all
> +         * manner of problems elsewhere very soon, and if it is during 
> runtime,
> +         * then failing to allocate crash notes is not a good enough reason 
> to
> +         * fail the CPU_UP_PREPARE */
> +        kexec_init_cpu_notes(cpu);
> +        break;
> +    default:
> +        break;
> +    }
> +    return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block cpu_nfb = {
> +    .notifier_call = cpu_callback
> +};
> +
> +static int __init kexec_init(void)
> +{
> +    void *cpu = (void *)(unsigned long)smp_processor_id();
> +
> +    /* If no crash area, no need to allocate space for notes. */
> +    if ( !kexec_crash_area.size )
> +        return 0;
> +
> +    register_keyhandler('C', &crashdump_trigger_keyhandler);
> +
> +    crash_notes = xmalloc_array(crash_note_range_t, nr_cpu_ids);
> +    if ( ! crash_notes )
> +        return -ENOMEM;
> +
> +    memset(crash_notes, 0, sizeof(crash_note_range_t) * nr_cpu_ids);
> +
> +    cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
> +    register_cpu_notifier(&cpu_nfb);
> +    return 0;
> +}
> +/* The reason for this to be a presmp_initcall as opposed to a regular
> + * __initcall is to allow the setup of the cpu hotplug handler before APs 
> are
> + * brought up. */
> +presmp_initcall(kexec_init);
> +
>  static int kexec_get_reserve(xen_kexec64_range_t *range)
>  {
>      if ( kexec_crash_area.size > 0 && kexec_crash_area.start > 0) {
> @@ -302,44 +436,29 @@ static int kexec_get_reserve(xen_kexec64
>  static int kexec_get_cpu(xen_kexec64_range_t *range)
>  {
>      int nr = range->nr;
> -    int nr_bytes = 0;
>  
> -    if ( nr < 0 || nr >= nr_cpu_ids || !cpu_online(nr) )
> +    if ( nr < 0 || nr >= nr_cpu_ids )
> +        return -ERANGE;
> +
> +    if ( ! crash_notes )
>          return -EINVAL;
>  
> -    nr_bytes += sizeof_note("CORE", sizeof(ELF_Prstatus));
> -    nr_bytes += sizeof_note("Xen", sizeof(crash_xen_core_t));
> +    /* Try once again to allocate room for the crash notes.  It is just 
> possible
> +     * that more space has become available since we last tried.  If space 
> has
> +     * already been allocated, kexec_init_cpu_notes() will return early 
> with 0.
> +     */
> +    kexec_init_cpu_notes(nr);
>  
> -    /* The Xen info note is included in CPU0's range. */
> -    if ( nr == 0 )
> -        nr_bytes += sizeof_note("Xen", sizeof(crash_xen_info_t));
> +    /* In the case of still not having enough memory to allocate buffer 
> room,
> +     * returning a range of 0,0 is still valid. */
> +    if ( crash_notes[nr].start )
> +    {
> +        range->start = __pa(crash_notes[nr].start);
> +        range->size = crash_notes[nr].size;
> +    }
> +    else
> +        range->start = range->size = 0;
>  
> -    if ( per_cpu(crash_notes, nr) == NULL )
> -    {
> -        Elf_Note *note;
> -
> -        note = per_cpu(crash_notes, nr) = xmalloc_bytes(nr_bytes);
> -
> -        if ( note == NULL )
> -            return -ENOMEM;
> -
> -        /* Setup CORE note. */
> -        setup_note(note, "CORE", NT_PRSTATUS, sizeof(ELF_Prstatus));
> -
> -        /* Setup Xen CORE note. */
> -        note = ELFNOTE_NEXT(note);
> -        setup_note(note, "Xen", XEN_ELFNOTE_CRASH_REGS, 
> sizeof(crash_xen_core_t));
> -
> -        if (nr == 0)
> -        {
> -            /* Setup system wide Xen info note. */
> -            xen_crash_note = note = ELFNOTE_NEXT(note);
> -            setup_note(note, "Xen", XEN_ELFNOTE_CRASH_INFO, 
> sizeof(crash_xen_info_t));
> -        }
> -    }
> -
> -    range->start = __pa((unsigned long)per_cpu(crash_notes, nr));
> -    range->size = nr_bytes;
>      return 0;
>  }
>  
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com 
> http://lists.xensource.com/xen-devel 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory
  2011-12-22 17:36 ` [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory Andrew Cooper
  2011-12-22 18:09   ` David Vrabel
@ 2011-12-23  9:06   ` Jan Beulich
  2011-12-23 11:59     ` Andrew Cooper
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2011-12-23  9:06 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 22.12.11 at 18:36, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 64bit Xen with 32bit dom0 and crashkernel, xmalloc'ing items such
> as the CPU crash notes will go into the xenheap, which tends to be in
> upper memory.  This causes problems on machines with more than 64GB
> (or 4GB if no PAE support) of ram as the crashkernel physically cant
> access the crash notes.

What use is a crash dump taken with a 32-bit kernel on a machine
with more than 64G (or more than 4G is the crash kernel doesn't
support PAE)?

> The solution is to force Xen to allocate certain structures in lower
> memory.  This is achieved by introducing two new command line
> parameters; low_crashinfo and crashinfo_maxaddr.  Because of the
> potential impact on 32bit PV guests, and that this problem does not
> exist for 64bit dom0 on 64bit Xen, this new functionality defaults to
> the codebase's previous behavior, requiring the user to explicitly
> add extra command line parameters to change the behavior.
> 
> This patch consists of 3 logically distinct but closely related
> changes.
>  1) Add the two new command line parameters.

Do you really need both? Just being able to set the max address
would seem sufficient...

>  2) Change crash note allocation to use lower memory when instructed.
>  3) Change the conring buffer to use lower memory when instructed.

Why does this one need moving, but none of the other Xen data
structures? If anything, I would have expected you to limit the Xen
heap altogether, so that automatically all allocations get done in a
way so that at least Xen's data structures are available in the
(truncated) dump.

> There result is that the crash notes and console ring will be placed
> in lower memory so useful information can be recovered in the case of
> a crash.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> 
> diff -r 3da37c68284f -r 23c31d59ffb1 xen/arch/x86/setup.c
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -586,6 +586,10 @@ void __init __start_xen(unsigned long mb
>      }
>      cmdline_parse(cmdline);
>  
> +    /* Must be after command line argument parsing and before 
> +     * allocing any xenheap structures wanted in lower memory. */
> +    kexec_early_calculations();
> +

But does it really need to be *this* early?

>      parse_video_info();
>  
>      set_current((struct vcpu *)0xfffff000); /* debug sanity */
> diff -r 3da37c68284f -r 23c31d59ffb1 xen/common/kexec.c
> --- a/xen/common/kexec.c
> +++ b/xen/common/kexec.c
> @@ -36,7 +36,9 @@ bool_t kexecing = FALSE;
>  typedef struct { Elf_Note * start; size_t size; } crash_note_range_t;
>  static crash_note_range_t * crash_notes;
>  
> -/* Lock to prevent race conditions when allocating the crash note buffers. 
> */
> +/* Lock to prevent race conditions when allocating the crash note buffers.
> + * It also serves to protect calls to alloc_from_crash_heap when allocating
> + * crash note buffers in lower memory. */
>  static DEFINE_SPINLOCK(crash_notes_lock);
>  
>  static Elf_Note *xen_crash_note;
> @@ -70,6 +72,23 @@ static struct {
>      (_d_)->start = (_s_)->start; \
>  } while (0)
>  
> +/* Low crashinfo mode.  Start as INVALID so serveral codepaths can set up
> + * defaults without needing to know the state of the others. */
> +enum low_crashinfo low_crashinfo_mode = LOW_CRASHINFO_INVALID;
> +
> +/* This value is only considered if low_crash_mode is set to MIN or ALL, so
> + * setting a default here is safe. Default to 4GB.  This is because the 
> current
> + * KEXEC_CMD_get_range compat hypercall trucates 64bit pointers to 32 bits. 
> The
> + * typical usecase for crashinfo_maxaddr will be for 64bit Xen with 32bit 
> dom0 
> + * and 32bit crash kernel. */
> +static paddr_t crashinfo_maxaddr = 4ULL << 30;

__initdata

> +
> +/* = log base 2 of crashinfo_maxaddr after checking for sanity. */
> +paddr_t crashinfo_maxaddr_bits = 0;
> +
> +/* Pointers to keep track of the crash heap region. */
> +static void *crash_heap_current = NULL, *crash_heap_end = NULL;
> +
>  /*
>   * Parse command lines in the format
>   *
> @@ -148,6 +167,58 @@ static void __init parse_crashkernel(con
>  }
>  custom_param("crashkernel", parse_crashkernel);
>  
> +/* Parse command lines in the format:
> + *
> + *   low_crashinfo=[none,min,all]
> + * 
> + * - none disables the low allocation of crash info.
> + * - min will allocate enough low information for the crash kernel to be 
> able
> + *       to extract the hypervisor and dom0 message ring buffers.
> + * - all will allocate additional structures such as domain and vcpu structs
> + *       low so the crash kernel can perform an extended analysis of state.
> + */
> +static void __init parse_low_crashinfo(const char * str)
> +{
> +
> +    if ( !strlen(str) )
> +        /* default to min if user just specifies "low_crashinfo" */
> +        low_crashinfo_mode = LOW_CRASHINFO_MIN;
> +    else if ( !strcmp(str, "none" ) )
> +        low_crashinfo_mode = LOW_CRASHINFO_NONE;
> +    else if ( !strcmp(str, "min" ) )
> +        low_crashinfo_mode = LOW_CRASHINFO_MIN;
> +    else if ( !strcmp(str, "all" ) )
> +        low_crashinfo_mode = LOW_CRASHINFO_ALL;
> +    else
> +    {
> +        printk("Unknown low_crashinfo parameter '%s'.  Defaulting to 
> min.\n", str);
> +        low_crashinfo_mode = LOW_CRASHINFO_MIN;
> +    }
> +}
> +custom_param("low_crashinfo", parse_low_crashinfo);
> +
> +/* Parse command lines in the format:
> + * 
> + *   crashinfo_maxaddr=<addr>
> + *
> + * <addr> will be rounded down to the nearest power of two.  Defaults to 64G
> + */
> +static void __init parse_crashinfo_maxaddr(const char * str)
> +{
> +    u64 addr;
> +
> +    /* if low_crashinfo_mode is unset, default to min. */
> +    if ( low_crashinfo_mode == LOW_CRASHINFO_INVALID )
> +        low_crashinfo_mode = LOW_CRASHINFO_MIN;
> +
> +    if ( (addr = parse_size_and_unit(str, NULL)) )
> +        crashinfo_maxaddr = addr;
> +    else
> +        printk("Unable to parse crashinfo_maxaddr. Defaulting to %p\n",
> +               (void*)crashinfo_maxaddr);
> +}
> +custom_param("crashinfo_maxaddr", parse_crashinfo_maxaddr);
> +
>  void __init set_kexec_crash_area_size(u64 system_ram)
>  {
>      unsigned int idx;
> @@ -306,6 +377,20 @@ static size_t sizeof_cpu_notes(const uns
>      return bytes;
>  }
>  
> +/* Allocate size_t bytes of space from the previously allocated
> + * crash heap if the user has requested that crash notes be allocated
> + * in lower memory.  There is currently no case where the crash notes
> + * should be free()'d. */
> +static void * alloc_from_crash_heap(const size_t bytes)
> +{
> +    void * ret;
> +    if ( crash_heap_current + bytes > crash_heap_end )
> +        return NULL;
> +    ret = (void*)crash_heap_current;
> +    crash_heap_current += bytes;
> +    return ret;
> +}
> +
>  /* Allocate a crash note buffer for a newly onlined cpu. */
>  static int kexec_init_cpu_notes(const unsigned long cpu)
>  {
> @@ -320,7 +405,10 @@ static int kexec_init_cpu_notes(const un
>          return ret;
>  
>      nr_bytes = sizeof_cpu_notes(cpu);
> -    note = xmalloc_bytes(nr_bytes);
> +
> +    /* If we dont care about the position of allocation, malloc. */
> +    if ( low_crashinfo_mode == LOW_CRASHINFO_NONE )
> +        note = xmalloc_bytes(nr_bytes);
>  
>      /* Protect the write into crash_notes[] with a spinlock, as this 
> function
>       * is on a hotplug path and a hypercall path. */
> @@ -338,6 +426,11 @@ static int kexec_init_cpu_notes(const un
>      }
>      else
>      {
> +        /* If we care about memory possition, alloc from the crash heap,
> +         * also protected by the crash_notes_lock. */
> +        if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
> +            note = alloc_from_crash_heap(nr_bytes);
> +
>          crash_notes[cpu].start = note;
>          crash_notes[cpu].size = nr_bytes;
>          spin_unlock(&crash_notes_lock);
> @@ -397,6 +490,24 @@ static struct notifier_block cpu_nfb = {
>      .notifier_call = cpu_callback
>  };
>  
> +void __init kexec_early_calculations(void)
> +{
> +    /* If low_crashinfo_mode is still INVALID, neither "low_crashinfo" nor
> +     * "crashinfo_maxaddr" have been specified on the command line, so
> +     * explicitly set to NONE. */
> +    if ( low_crashinfo_mode == LOW_CRASHINFO_INVALID )
> +        low_crashinfo_mode = LOW_CRASHINFO_NONE;
> +
> +    crashinfo_maxaddr_bits = 0;
> +    if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
> +    {
> +        paddr_t tmp = crashinfo_maxaddr;
> +
> +        while ( tmp >>= 1 )
> +            crashinfo_maxaddr_bits++;

fls() or some of its cousins

> +    }
> +}
> +
>  static int __init kexec_init(void)
>  {
>      void *cpu = (void *)(unsigned long)smp_processor_id();
> @@ -407,6 +518,29 @@ static int __init kexec_init(void)
>  
>      register_keyhandler('C', &crashdump_trigger_keyhandler);
>  
> +    if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
> +    {
> +        size_t crash_heap_size;
> +
> +        /* This calculation is safe even if the machine is booted in 
> +         * uniprocessor mode. */
> +        crash_heap_size = sizeof_cpu_notes(0) +
> +            sizeof_cpu_notes(1) * (nr_cpu_ids - 1);
> +        crash_heap_size = PAGE_ALIGN(crash_heap_size);
> +
> +        crash_heap_current = alloc_xenheap_pages(
> +            get_order_from_bytes(crash_heap_size),
> +            MEMF_bits(crashinfo_maxaddr_bits) );
> +
> +        if ( crash_heap_current )
> +            crash_heap_end = crash_heap_current + crash_heap_size;
> +        else
> +            return -ENOMEM;

Constructs like this generally look bogus to me, as

        if ( !crash_heap_current )
            return -ENOMEM;
        ...

is shorter and (to me at least) more clear.

> +    }
> +
> +    /* crash_notes may be allocated anywhere Xen can reach in memory.
> +       Only the individual CPU crash notes themselves must be allocated
> +       in lower memory if requested. */
>      crash_notes = xmalloc_array(crash_note_range_t, nr_cpu_ids);
>      if ( ! crash_notes )
>          return -ENOMEM;
> diff -r 3da37c68284f -r 23c31d59ffb1 xen/drivers/char/console.c
> --- a/xen/drivers/char/console.c
> +++ b/xen/drivers/char/console.c
> @@ -584,6 +584,7 @@ void __init console_init_postirq(void)
>  {
>      char *ring;
>      unsigned int i, order;
> +    u64 memflags;
>  
>      serial_init_postirq();
>  
> @@ -591,7 +592,8 @@ void __init console_init_postirq(void)
>          opt_conring_size = num_present_cpus() << (9 + xenlog_lower_thresh);
>  
>      order = get_order_from_bytes(max(opt_conring_size, conring_size));
> -    while ( (ring = alloc_xenheap_pages(order, 0)) == NULL )
> +    memflags = low_crashinfo_mode > LOW_CRASHINFO_NONE ? 
> MEMF_bits(crashinfo_maxaddr_bits) : 0;
> +    while ( (ring = alloc_xenheap_pages(order, memflags)) == NULL )
>      {
>          BUG_ON(order == 0);
>          order--;
> diff -r 3da37c68284f -r 23c31d59ffb1 xen/include/xen/kexec.h
> --- a/xen/include/xen/kexec.h
> +++ b/xen/include/xen/kexec.h
> @@ -25,6 +25,19 @@ void set_kexec_crash_area_size(u64 syste
>  #define KEXEC_IMAGE_CRASH_BASE   2
>  #define KEXEC_IMAGE_NR           4
>  
> +enum low_crashinfo {
> +    LOW_CRASHINFO_INVALID = 0,
> +    LOW_CRASHINFO_NONE = 1,
> +    LOW_CRASHINFO_MIN = 2,
> +    LOW_CRASHINFO_ALL = 3
> +};
> +
> +/* Low crashinfo mode.  Start as INVALID so serveral codepaths can set up
> + * defaults without needing to know the state of the others. */
> +extern enum low_crashinfo low_crashinfo_mode;
> +extern paddr_t crashinfo_maxaddr_bits;
> +void kexec_early_calculations(void);
> +
>  int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
>  void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
>  void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com 
> http://lists.xensource.com/xen-devel 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory
  2011-12-22 18:44     ` Andrew Cooper
@ 2011-12-23  9:09       ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2011-12-23  9:09 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Tim Deegan, David Vrabel

>>> On 22.12.11 at 19:44, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 22/12/11 18:09, David Vrabel wrote:
>> On 22/12/11 17:36, Andrew Cooper wrote:
>>> +    memflags = low_crashinfo_mode > LOW_CRASHINFO_NONE ? 
> MEMF_bits(crashinfo_maxaddr_bits) : 0;
>> If you set crashinfo_maxaddr_bits to 64 if low_crashinfo_mode isn't used
>> you wouldn't need this test.
> 
> I am not familiar enough with the intricacies of alloc_xenheap_pages to
> know whether that is safe.  cc'ing Tim for his expertise.

Passing 0 for the bit width is equivalent to passing a value beyond
what is covering the maximum available address, so you can pick
whatever better serves the purpose.

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1 of 3] KEXEC: Add new hypercall to fix 64/32bit truncation errors. v2
  2011-12-23  8:34   ` Jan Beulich
@ 2011-12-23 11:17     ` Andrew Cooper
  2011-12-23 12:56       ` Keir Fraser
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2011-12-23 11:17 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser; +Cc: xen-devel

On 23/12/11 08:34, Jan Beulich wrote:
>>>> On 22.12.11 at 18:36, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> Currently, KEXEC_CMD_kexec_get_range in compat mode, or with 32bit
>> Xen, will truncate 64bit pointers to 32bits, leading to problems on
>> machines with more than 4GiB of RAM.  As 32bit dom0 kernels tend to be
>> PAE capable, this is especially wasteful, as most structures currently
>> limited to <4GiB could easily be <64GiB instead.
>>
>> Therefore, introduce a new hypercall 'KEXEC_CMD_kexec64_get_range'
>> which passes similar information as KEXEC_CMD_kexec_get_range, but
>> which uses a structure with explicitly specified field widths, causing
>> it to be identical in the compat and regular case.  This new
>> structure, xen_kexec64_range_t, will be the same as xen_kexec_range_t
>> if Xen is compiled for x86_64, but will still use 64bit integers if
>> Xen is compiled for x86_32.
>>
>> To fix 32bit Xen which uses 32bit integers for addresses and sizes,
>> change the internals to use xen_kexec64_range_t which will use 64bit
>> integers instead.  This also involves changing several casts to
>> explicitly use uint64_ts rather than unsigned longs.
> Just for the record - I continue to be opposed to doing this for the
> 32-bit hypervisor. All relevant allocations are below 4G there, so
> there's simply no need to decrease code readability for no benefit.
>
> Jan

I don't mind reducing the scope to ignore the 32bit Xen case.  Any
object Keir?

<snip>

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory
  2011-12-23  9:06   ` Jan Beulich
@ 2011-12-23 11:59     ` Andrew Cooper
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Cooper @ 2011-12-23 11:59 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 23/12/11 09:06, Jan Beulich wrote:
>>>> On 22.12.11 at 18:36, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 64bit Xen with 32bit dom0 and crashkernel, xmalloc'ing items such
>> as the CPU crash notes will go into the xenheap, which tends to be in
>> upper memory.  This causes problems on machines with more than 64GB
>> (or 4GB if no PAE support) of ram as the crashkernel physically cant
>> access the crash notes.
> What use is a crash dump taken with a 32-bit kernel on a machine
> with more than 64G (or more than 4G is the crash kernel doesn't
> support PAE)?

Very little use at all, which is the reason for this change.  The
correct solution is indeed to use a 64bit dom0 kernel, but while
politics is preventing that from happening, another solution has to be
found.  I doubt that XenServer is the only product which would benifit,
which is why the changes are presented for upstreaming.

>> The solution is to force Xen to allocate certain structures in lower
>> memory.  This is achieved by introducing two new command line
>> parameters; low_crashinfo and crashinfo_maxaddr.  Because of the
>> potential impact on 32bit PV guests, and that this problem does not
>> exist for 64bit dom0 on 64bit Xen, this new functionality defaults to
>> the codebase's previous behavior, requiring the user to explicitly
>> add extra command line parameters to change the behavior.
>>
>> This patch consists of 3 logically distinct but closely related
>> changes.
>>  1) Add the two new command line parameters.
> Do you really need both? Just being able to set the max address
> would seem sufficient...

The plan, as explained in the comments, is for two (or perhaps more)
levels of low allocation.  "min" means "allocate enough stuff low for
the crash kernel to grab the Xen ring buffer" where "all" will include
extra things such as the page tables so the crash kernel can generate
full stack traces given the PT_STATUS notes.  The differentiation is
because "min" should not impact very much at all on other memory
restricted things such as 32bit PV guests, whereas "all" certainly will.

>>  2) Change crash note allocation to use lower memory when instructed.
>>  3) Change the conring buffer to use lower memory when instructed.
> Why does this one need moving, but none of the other Xen data
> structures? If anything, I would have expected you to limit the Xen
> heap altogether, so that automatically all allocations get done in a
> way so that at least Xen's data structures are available in the
> (truncated) dump.

This is part of the "min" option which is trying to have the least
possible impact.  The idea is to have this in low memory, use the
"console_to_ring" boot option to copy dom0 dmesg into conring, and pass
its physical address and size in a crash note, so that the crash kernel
environment grab it all.

>> There result is that the crash notes and console ring will be placed
>> in lower memory so useful information can be recovered in the case of
>> a crash.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>
>> diff -r 3da37c68284f -r 23c31d59ffb1 xen/arch/x86/setup.c
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -586,6 +586,10 @@ void __init __start_xen(unsigned long mb
>>      }
>>      cmdline_parse(cmdline);
>>  
>> +    /* Must be after command line argument parsing and before 
>> +     * allocing any xenheap structures wanted in lower memory. */
>> +    kexec_early_calculations();
>> +
> But does it really need to be *this* early?

Possibly not, but it needs to be before console_preirq_init() which is
not much later.

>>      parse_video_info();
>>  
>>      set_current((struct vcpu *)0xfffff000); /* debug sanity */
>> diff -r 3da37c68284f -r 23c31d59ffb1 xen/common/kexec.c
>> --- a/xen/common/kexec.c
>> +++ b/xen/common/kexec.c
>> @@ -36,7 +36,9 @@ bool_t kexecing = FALSE;
>>  typedef struct { Elf_Note * start; size_t size; } crash_note_range_t;
>>  static crash_note_range_t * crash_notes;
>>  
>> -/* Lock to prevent race conditions when allocating the crash note buffers. 
>> */
>> +/* Lock to prevent race conditions when allocating the crash note buffers.
>> + * It also serves to protect calls to alloc_from_crash_heap when allocating
>> + * crash note buffers in lower memory. */
>>  static DEFINE_SPINLOCK(crash_notes_lock);
>>  
>>  static Elf_Note *xen_crash_note;
>> @@ -70,6 +72,23 @@ static struct {
>>      (_d_)->start = (_s_)->start; \
>>  } while (0)
>>  
>> +/* Low crashinfo mode.  Start as INVALID so serveral codepaths can set up
>> + * defaults without needing to know the state of the others. */
>> +enum low_crashinfo low_crashinfo_mode = LOW_CRASHINFO_INVALID;
>> +
>> +/* This value is only considered if low_crash_mode is set to MIN or ALL, so
>> + * setting a default here is safe. Default to 4GB.  This is because the 
>> current
>> + * KEXEC_CMD_get_range compat hypercall trucates 64bit pointers to 32 bits. 
>> The
>> + * typical usecase for crashinfo_maxaddr will be for 64bit Xen with 32bit 
>> dom0 
>> + * and 32bit crash kernel. */
>> +static paddr_t crashinfo_maxaddr = 4ULL << 30;
> __initdata

Good point

>> +
>> +/* = log base 2 of crashinfo_maxaddr after checking for sanity. */
>> +paddr_t crashinfo_maxaddr_bits = 0;
>> +
>> +/* Pointers to keep track of the crash heap region. */
>> +static void *crash_heap_current = NULL, *crash_heap_end = NULL;
>> +
>>  /*
>>   * Parse command lines in the format
>>   *
>> @@ -148,6 +167,58 @@ static void __init parse_crashkernel(con
>>  }
>>  custom_param("crashkernel", parse_crashkernel);
>>  
>> +/* Parse command lines in the format:
>> + *
>> + *   low_crashinfo=[none,min,all]
>> + * 
>> + * - none disables the low allocation of crash info.
>> + * - min will allocate enough low information for the crash kernel to be 
>> able
>> + *       to extract the hypervisor and dom0 message ring buffers.
>> + * - all will allocate additional structures such as domain and vcpu structs
>> + *       low so the crash kernel can perform an extended analysis of state.
>> + */
>> +static void __init parse_low_crashinfo(const char * str)
>> +{
>> +
>> +    if ( !strlen(str) )
>> +        /* default to min if user just specifies "low_crashinfo" */
>> +        low_crashinfo_mode = LOW_CRASHINFO_MIN;
>> +    else if ( !strcmp(str, "none" ) )
>> +        low_crashinfo_mode = LOW_CRASHINFO_NONE;
>> +    else if ( !strcmp(str, "min" ) )
>> +        low_crashinfo_mode = LOW_CRASHINFO_MIN;
>> +    else if ( !strcmp(str, "all" ) )
>> +        low_crashinfo_mode = LOW_CRASHINFO_ALL;
>> +    else
>> +    {
>> +        printk("Unknown low_crashinfo parameter '%s'.  Defaulting to 
>> min.\n", str);
>> +        low_crashinfo_mode = LOW_CRASHINFO_MIN;
>> +    }
>> +}
>> +custom_param("low_crashinfo", parse_low_crashinfo);
>> +
>> +/* Parse command lines in the format:
>> + * 
>> + *   crashinfo_maxaddr=<addr>
>> + *
>> + * <addr> will be rounded down to the nearest power of two.  Defaults to 64G
>> + */
>> +static void __init parse_crashinfo_maxaddr(const char * str)
>> +{
>> +    u64 addr;
>> +
>> +    /* if low_crashinfo_mode is unset, default to min. */
>> +    if ( low_crashinfo_mode == LOW_CRASHINFO_INVALID )
>> +        low_crashinfo_mode = LOW_CRASHINFO_MIN;
>> +
>> +    if ( (addr = parse_size_and_unit(str, NULL)) )
>> +        crashinfo_maxaddr = addr;
>> +    else
>> +        printk("Unable to parse crashinfo_maxaddr. Defaulting to %p\n",
>> +               (void*)crashinfo_maxaddr);
>> +}
>> +custom_param("crashinfo_maxaddr", parse_crashinfo_maxaddr);
>> +
>>  void __init set_kexec_crash_area_size(u64 system_ram)
>>  {
>>      unsigned int idx;
>> @@ -306,6 +377,20 @@ static size_t sizeof_cpu_notes(const uns
>>      return bytes;
>>  }
>>  
>> +/* Allocate size_t bytes of space from the previously allocated
>> + * crash heap if the user has requested that crash notes be allocated
>> + * in lower memory.  There is currently no case where the crash notes
>> + * should be free()'d. */
>> +static void * alloc_from_crash_heap(const size_t bytes)
>> +{
>> +    void * ret;
>> +    if ( crash_heap_current + bytes > crash_heap_end )
>> +        return NULL;
>> +    ret = (void*)crash_heap_current;
>> +    crash_heap_current += bytes;
>> +    return ret;
>> +}
>> +
>>  /* Allocate a crash note buffer for a newly onlined cpu. */
>>  static int kexec_init_cpu_notes(const unsigned long cpu)
>>  {
>> @@ -320,7 +405,10 @@ static int kexec_init_cpu_notes(const un
>>          return ret;
>>  
>>      nr_bytes = sizeof_cpu_notes(cpu);
>> -    note = xmalloc_bytes(nr_bytes);
>> +
>> +    /* If we dont care about the position of allocation, malloc. */
>> +    if ( low_crashinfo_mode == LOW_CRASHINFO_NONE )
>> +        note = xmalloc_bytes(nr_bytes);
>>  
>>      /* Protect the write into crash_notes[] with a spinlock, as this 
>> function
>>       * is on a hotplug path and a hypercall path. */
>> @@ -338,6 +426,11 @@ static int kexec_init_cpu_notes(const un
>>      }
>>      else
>>      {
>> +        /* If we care about memory possition, alloc from the crash heap,
>> +         * also protected by the crash_notes_lock. */
>> +        if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
>> +            note = alloc_from_crash_heap(nr_bytes);
>> +
>>          crash_notes[cpu].start = note;
>>          crash_notes[cpu].size = nr_bytes;
>>          spin_unlock(&crash_notes_lock);
>> @@ -397,6 +490,24 @@ static struct notifier_block cpu_nfb = {
>>      .notifier_call = cpu_callback
>>  };
>>  
>> +void __init kexec_early_calculations(void)
>> +{
>> +    /* If low_crashinfo_mode is still INVALID, neither "low_crashinfo" nor
>> +     * "crashinfo_maxaddr" have been specified on the command line, so
>> +     * explicitly set to NONE. */
>> +    if ( low_crashinfo_mode == LOW_CRASHINFO_INVALID )
>> +        low_crashinfo_mode = LOW_CRASHINFO_NONE;
>> +
>> +    crashinfo_maxaddr_bits = 0;
>> +    if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
>> +    {
>> +        paddr_t tmp = crashinfo_maxaddr;
>> +
>> +        while ( tmp >>= 1 )
>> +            crashinfo_maxaddr_bits++;
> fls() or some of its cousins

Ok

>> +    }
>> +}
>> +
>>  static int __init kexec_init(void)
>>  {
>>      void *cpu = (void *)(unsigned long)smp_processor_id();
>> @@ -407,6 +518,29 @@ static int __init kexec_init(void)
>>  
>>      register_keyhandler('C', &crashdump_trigger_keyhandler);
>>  
>> +    if ( low_crashinfo_mode > LOW_CRASHINFO_NONE )
>> +    {
>> +        size_t crash_heap_size;
>> +
>> +        /* This calculation is safe even if the machine is booted in 
>> +         * uniprocessor mode. */
>> +        crash_heap_size = sizeof_cpu_notes(0) +
>> +            sizeof_cpu_notes(1) * (nr_cpu_ids - 1);
>> +        crash_heap_size = PAGE_ALIGN(crash_heap_size);
>> +
>> +        crash_heap_current = alloc_xenheap_pages(
>> +            get_order_from_bytes(crash_heap_size),
>> +            MEMF_bits(crashinfo_maxaddr_bits) );
>> +
>> +        if ( crash_heap_current )
>> +            crash_heap_end = crash_heap_current + crash_heap_size;
>> +        else
>> +            return -ENOMEM;
> Constructs like this generally look bogus to me, as
>
>         if ( !crash_heap_current )
>             return -ENOMEM;
>         ...
>
> is shorter and (to me at least) more clear.

True - for a long time I was debating whether to return ENOMEM at this
point at not, so I clearly didn't consider clarity when deciding to add
it in.  I will fix this.

>> +    }
>> +
>> +    /* crash_notes may be allocated anywhere Xen can reach in memory.
>> +       Only the individual CPU crash notes themselves must be allocated
>> +       in lower memory if requested. */
>>      crash_notes = xmalloc_array(crash_note_range_t, nr_cpu_ids);
>>      if ( ! crash_notes )
>>          return -ENOMEM;
>> diff -r 3da37c68284f -r 23c31d59ffb1 xen/drivers/char/console.c
>> --- a/xen/drivers/char/console.c
>> +++ b/xen/drivers/char/console.c
>> @@ -584,6 +584,7 @@ void __init console_init_postirq(void)
>>  {
>>      char *ring;
>>      unsigned int i, order;
>> +    u64 memflags;
>>  
>>      serial_init_postirq();
>>  
>> @@ -591,7 +592,8 @@ void __init console_init_postirq(void)
>>          opt_conring_size = num_present_cpus() << (9 + xenlog_lower_thresh);
>>  
>>      order = get_order_from_bytes(max(opt_conring_size, conring_size));
>> -    while ( (ring = alloc_xenheap_pages(order, 0)) == NULL )
>> +    memflags = low_crashinfo_mode > LOW_CRASHINFO_NONE ? 
>> MEMF_bits(crashinfo_maxaddr_bits) : 0;
>> +    while ( (ring = alloc_xenheap_pages(order, memflags)) == NULL )
>>      {
>>          BUG_ON(order == 0);
>>          order--;
>> diff -r 3da37c68284f -r 23c31d59ffb1 xen/include/xen/kexec.h
>> --- a/xen/include/xen/kexec.h
>> +++ b/xen/include/xen/kexec.h
>> @@ -25,6 +25,19 @@ void set_kexec_crash_area_size(u64 syste
>>  #define KEXEC_IMAGE_CRASH_BASE   2
>>  #define KEXEC_IMAGE_NR           4
>>  
>> +enum low_crashinfo {
>> +    LOW_CRASHINFO_INVALID = 0,
>> +    LOW_CRASHINFO_NONE = 1,
>> +    LOW_CRASHINFO_MIN = 2,
>> +    LOW_CRASHINFO_ALL = 3
>> +};
>> +
>> +/* Low crashinfo mode.  Start as INVALID so serveral codepaths can set up
>> + * defaults without needing to know the state of the others. */
>> +extern enum low_crashinfo low_crashinfo_mode;
>> +extern paddr_t crashinfo_maxaddr_bits;
>> +void kexec_early_calculations(void);
>> +
>>  int machine_kexec_load(int type, int slot, xen_kexec_image_t *image);
>>  void machine_kexec_unload(int type, int slot, xen_kexec_image_t *image);
>>  void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com 
>> http://lists.xensource.com/xen-devel 
>

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1 of 3] KEXEC: Add new hypercall to fix 64/32bit truncation errors. v2
  2011-12-23 11:17     ` Andrew Cooper
@ 2011-12-23 12:56       ` Keir Fraser
  0 siblings, 0 replies; 18+ messages in thread
From: Keir Fraser @ 2011-12-23 12:56 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich; +Cc: xen-devel

On 23/12/2011 11:17, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:

>> Just for the record - I continue to be opposed to doing this for the
>> 32-bit hypervisor. All relevant allocations are below 4G there, so
>> there's simply no need to decrease code readability for no benefit.
>> 
>> Jan
> 
> I don't mind reducing the scope to ignore the 32bit Xen case.  Any
> object Keir?

Fine by me.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory
  2011-12-31  7:38   ` Jan Beulich
@ 2012-01-03 11:58     ` Andrew Cooper
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Cooper @ 2012-01-03 11:58 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 31/12/11 07:38, Jan Beulich wrote:
>>>> Andrew Cooper <andrew.cooper3@citrix.com> 12/31/11 1:12 AM >>>
>> The register contents of the pcpus which were running will be available
>> in the PR_STATUS notes which are also deliberately allocated in low
>> memory by this patch.  To the best of my understanding; to get to the
>> dom0 vcpu state, the crashkernel needs access to the domain structs and
>> vcpu structs (which I believe are actually allocated below 4GiB) and the
> The vCPU ones are, while the domain one isn't.

Right - that is useful to know.

>> Xen page tables which dom0 uses (which inspecting CR3 from the crash
>> notes is certainly not in lower memory).  I suspect that there is also
>> more which needs to be allocated in lower memory to get a full register
>> dump, stack dump, stack trace etc.
>>
>> The plan is to also have the "all" option from the command line which
>> will also allocate the page tables (and other structures where relevant)
>> in lower memory, but this is rather more of an overhead than just the
>> console ring and crash notes, which will have a more visible impact to
>> customers running 32bit PV guests.  This is the reason for separating
>> the two via a command line argument.
> I'm wondering whether, with much less impact on existing code, and as
> already pointed out, restricting the allocation range of
> alloc_xenheap_pages() (by way of a command line option) wouldn't get
> you what you want.
>
> Jan

Assuming that all relevant structures do in fact get allocated by
alloc_xenheap_pages() (which is probably a safe assumption looking at
the code, but I am not certain) then yes, it would have much less impact
on the existing code.  However, it would have a much bigger impact on
the other memory restricted items.  It might be possible (and perhaps a
good idea when extreme debugging is needed) to have a 3rd command line
option for low_crashinfo= which does this, but I dont like it as the
general solution to this problem.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory
  2011-12-31  0:11 ` Andrew Cooper
@ 2011-12-31  7:38   ` Jan Beulich
  2012-01-03 11:58     ` Andrew Cooper
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2011-12-31  7:38 UTC (permalink / raw)
  To: andrew.cooper3; +Cc: xen-devel

>>> Andrew Cooper <andrew.cooper3@citrix.com> 12/31/11 1:12 AM >>>
>The register contents of the pcpus which were running will be available
>in the PR_STATUS notes which are also deliberately allocated in low
>memory by this patch.  To the best of my understanding; to get to the
>dom0 vcpu state, the crashkernel needs access to the domain structs and
>vcpu structs (which I believe are actually allocated below 4GiB) and the

The vCPU ones are, while the domain one isn't.

>Xen page tables which dom0 uses (which inspecting CR3 from the crash
>notes is certainly not in lower memory).  I suspect that there is also
>more which needs to be allocated in lower memory to get a full register
>dump, stack dump, stack trace etc.
>
>The plan is to also have the "all" option from the command line which
>will also allocate the page tables (and other structures where relevant)
>in lower memory, but this is rather more of an overhead than just the
>console ring and crash notes, which will have a more visible impact to
>customers running 32bit PV guests.  This is the reason for separating
>the two via a command line argument.

I'm wondering whether, with much less impact on existing code, and as
already pointed out, restricting the allocation range of
alloc_xenheap_pages() (by way of a command line option) wouldn't get
you what you want.

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory
  2011-12-29 15:51 Jan Beulich
  2011-12-30 11:19 ` Tim Deegan
@ 2011-12-31  0:11 ` Andrew Cooper
  2011-12-31  7:38   ` Jan Beulich
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2011-12-31  0:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 29/12/2011 15:51, Jan Beulich wrote:
>>>> Andrew Cooper  12/23/11 12:59 PM >>>
>> On 23/12/11 09:06, Jan Beulich wrote:
>>>>>> On 22.12.11 at 18:36, Andrew Cooper  wrote:
>>>> On 64bit Xen with 32bit dom0 and crashkernel, xmalloc'ing items such
>>>> as the CPU crash notes will go into the xenheap, which tends to be in
>>>> upper memory.  This causes problems on machines with more than 64GB
>>>> (or 4GB if no PAE support) of ram as the crashkernel physically cant
>>>> access the crash notes.
>>> What use is a crash dump taken with a 32-bit kernel on a machine
>>> with more than 64G (or more than 4G is the crash kernel doesn't
>>> support PAE)?
>> Very little use at all, which is the reason for this change.
> With this change, the usefulness doesn't significantly increase imo.

But it does make it possible for a 32bit crashkernel to get information
useful for debugging a problem, which is the point of the patch.  If you
honestly thing that this change is not worth the effort then that is
fine, but as far as I am aware, there are still several good reasons to
use a 32bit dom0 over a 64bit dom0.  Following this reasoning is why I
believe that this feature may be useful to other projects, as well as
XenServer.

>> The correct solution is indeed to use a 64bit dom0 kernel, but while
>> politics is preventing that from happening, another solution has to be
>> found.  I doubt that XenServer is the only product which would benifit,
>> which is why the changes are presented for upstreaming.
>>
>>>>  3) Change the conring buffer to use lower memory when instructed.
>>> Why does this one need moving, but none of the other Xen data
>>> structures? If anything, I would have expected you to limit the Xen
>>> heap altogether, so that automatically all allocations get done in a
>>> way so that at least Xen's data structures are available in the
>>> (truncated) dump.
>> This is part of the "min" option which is trying to have the least
>> possible impact.  The idea is to have this in low memory, use the
>> "console_to_ring" boot option to copy dom0 dmesg into conring, and pass
>> its physical address and size in a crash note, so that the crash kernel
>> environment grab it all.
> Why is the console ring *that* important? I would have thought
> that proper register values and stack contents are much more
> significant for analysis of a crash.
>
> Jan

In combination with "console_to_ring" on the Xen command line, the dom0
serial console will be copied into xen console ring.  In the case of a
crash, it is highly likely that one or other of them have panic()'d into
their respective console rings.  It is not a foolproof solution, but in
will work in the general case.

The register contents of the pcpus which were running will be available
in the PR_STATUS notes which are also deliberately allocated in low
memory by this patch.  To the best of my understanding; to get to the
dom0 vcpu state, the crashkernel needs access to the domain structs and
vcpu structs (which I believe are actually allocated below 4GiB) and the
Xen page tables which dom0 uses (which inspecting CR3 from the crash
notes is certainly not in lower memory).  I suspect that there is also
more which needs to be allocated in lower memory to get a full register
dump, stack dump, stack trace etc.

The plan is to also have the "all" option from the command line which
will also allocate the page tables (and other structures where relevant)
in lower memory, but this is rather more of an overhead than just the
console ring and crash notes, which will have a more visible impact to
customers running 32bit PV guests.  This is the reason for separating
the two via a command line argument.

~Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory
  2011-12-29 15:51 Jan Beulich
@ 2011-12-30 11:19 ` Tim Deegan
  2011-12-31  0:11 ` Andrew Cooper
  1 sibling, 0 replies; 18+ messages in thread
From: Tim Deegan @ 2011-12-30 11:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel

At 15:51 +0000 on 29 Dec (1325173875), Jan Beulich wrote:
> >This is part of the "min" option which is trying to have the least
> >possible impact.  The idea is to have this in low memory, use the
> >"console_to_ring" boot option to copy dom0 dmesg into conring, and pass
> >its physical address and size in a crash note, so that the crash kernel
> >environment grab it all.
> 
> Why is the console ring *that* important? I would have thought
> that proper register values and stack contents are much more
> significant for analysis of a crash.

The console ring has been _very_ useful in diagnosing bugs from field
reports, especially things like guest-level watchdog timeouts and
refcounting errors that cause a crash after the interesting event has
passed.  If nothing else it lets you verify the user's description of
the system, and saves a round-trip of 'please try to set up a serial
logger and reproduce the bug').

(I agree, though, that using a 64-bit crash kernel would be a much
better idea).

Tim.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory
@ 2011-12-29 15:51 Jan Beulich
  2011-12-30 11:19 ` Tim Deegan
  2011-12-31  0:11 ` Andrew Cooper
  0 siblings, 2 replies; 18+ messages in thread
From: Jan Beulich @ 2011-12-29 15:51 UTC (permalink / raw)
  To: andrew.cooper3; +Cc: xen-devel

>>> Andrew Cooper  12/23/11 12:59 PM >>>
>On 23/12/11 09:06, Jan Beulich wrote:
>>>>> On 22.12.11 at 18:36, Andrew Cooper  wrote:
>>> On 64bit Xen with 32bit dom0 and crashkernel, xmalloc'ing items such
>>> as the CPU crash notes will go into the xenheap, which tends to be in
>>> upper memory.  This causes problems on machines with more than 64GB
>>> (or 4GB if no PAE support) of ram as the crashkernel physically cant
>>> access the crash notes.
>> What use is a crash dump taken with a 32-bit kernel on a machine
>> with more than 64G (or more than 4G is the crash kernel doesn't
>> support PAE)?
>
>Very little use at all, which is the reason for this change.

With this change, the usefulness doesn't significantly increase imo.

>The correct solution is indeed to use a 64bit dom0 kernel, but while
>politics is preventing that from happening, another solution has to be
>found.  I doubt that XenServer is the only product which would benifit,
>which is why the changes are presented for upstreaming.
>
>>>  3) Change the conring buffer to use lower memory when instructed.
>> Why does this one need moving, but none of the other Xen data
>> structures? If anything, I would have expected you to limit the Xen
>> heap altogether, so that automatically all allocations get done in a
>> way so that at least Xen's data structures are available in the
>> (truncated) dump.
>
>This is part of the "min" option which is trying to have the least
>possible impact.  The idea is to have this in low memory, use the
>"console_to_ring" boot option to copy dom0 dmesg into conring, and pass
>its physical address and size in a crash note, so that the crash kernel
>environment grab it all.

Why is the console ring *that* important? I would have thought
that proper register values and stack contents are much more
significant for analysis of a crash.

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-01-03 11:58 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-22 17:36 [PATCH 0 of 3] KEXEC fixes Andrew Cooper
2011-12-22 17:36 ` [PATCH 1 of 3] KEXEC: Add new hypercall to fix 64/32bit truncation errors. v2 Andrew Cooper
2011-12-23  8:34   ` Jan Beulich
2011-12-23 11:17     ` Andrew Cooper
2011-12-23 12:56       ` Keir Fraser
2011-12-22 17:36 ` [PATCH 2 of 3] KEXEC: Allocate crash notes on boot v6 Andrew Cooper
2011-12-23  8:42   ` Jan Beulich
2011-12-22 17:36 ` [PATCH 3 of 3] KEXEC: Allocate crash structures in low memory Andrew Cooper
2011-12-22 18:09   ` David Vrabel
2011-12-22 18:44     ` Andrew Cooper
2011-12-23  9:09       ` Jan Beulich
2011-12-23  9:06   ` Jan Beulich
2011-12-23 11:59     ` Andrew Cooper
2011-12-29 15:51 Jan Beulich
2011-12-30 11:19 ` Tim Deegan
2011-12-31  0:11 ` Andrew Cooper
2011-12-31  7:38   ` Jan Beulich
2012-01-03 11:58     ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.