linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump
@ 2007-12-07 15:53 Huang, Ying
  2007-12-08 23:53 ` Pavel Machek
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Huang, Ying @ 2007-12-07 15:53 UTC (permalink / raw)
  To: Eric W. Biederman, Pavel Machek, nigel, Rafael J. Wysocki,
	Andrew Morton, Jeremy Maitin-Shepard
  Cc: linux-kernel, linux-pm, Kexec Mailing List

This patch implements the functionality of jumping between the kexeced
kernel and the original kernel.

To support jumping between two kernels, before jumping to (executing)
the new kernel and jumping back to the original kernel, the devices
are put into quiescent state, and the state of devices and CPU is
saved. After jumping back from kexeced kernel and jumping to the new
kernel, the state of devices and CPU are restored accordingly. The
devices/CPU state save/restore code of software suspend is called to
implement corresponding function.

To support jumping without reserving memory. One shadow backup page
(source page) is allocated for each page used by new (kexeced) kernel
(destination page). When do kexec_load, the image of new kernel is
loaded into source pages, and before executing, the destination pages
and the source pages are swapped, so the contents of destination pages
are backupped. Before jumping to the new (kexeced) kernel and after
jumping back to the original kernel, the destination pages and the
source pages are swapped too.

A jump back protocol for kexec is defined and documented. It is an
extension to ordinary function calling protocol. So, the facility
provided by this patch can be used to call ordinary C function in real
mode.

A set of flags for sys_kexec_load are added to control which state are
saved/restored before/after real mode code executing. For example, you
can specify the device state and FPU state are saved/restored
before/after real mode code executing.

The states (exclude CPU state) save/restore code can be overridden
based on the "command" parameter of kexec jump. Because more states
need to be saved/restored by hibernating/resuming.

Signed-off-by: Huang Ying <ying.huang@intel.com>

---
 Documentation/i386/jump_back_protocol.txt |  103 ++++++++++++++
 arch/powerpc/kernel/machine_kexec.c       |    2 
 arch/ppc/kernel/machine_kexec.c           |    2 
 arch/sh/kernel/machine_kexec.c            |    2 
 arch/x86/kernel/machine_kexec_32.c        |   88 +++++++++---
 arch/x86/kernel/machine_kexec_64.c        |    2 
 arch/x86/kernel/relocate_kernel_32.S      |  214 +++++++++++++++++++++++++++---
 include/asm-x86/kexec_32.h                |   39 ++++-
 include/linux/kexec.h                     |   40 +++++
 kernel/kexec.c                            |  188 ++++++++++++++++++++++++++
 kernel/power/Kconfig                      |    2 
 kernel/sys.c                              |   35 +++-
 12 files changed, 648 insertions(+), 69 deletions(-)

--- a/arch/x86/kernel/machine_kexec_32.c
+++ b/arch/x86/kernel/machine_kexec_32.c
@@ -20,6 +20,7 @@
 #include <asm/cpufeature.h>
 #include <asm/desc.h>
 #include <asm/system.h>
+#include <asm/cacheflush.h>
 
 #define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE)))
 static u32 kexec_pgd[1024] PAGE_ALIGNED;
@@ -83,10 +84,14 @@ static void load_segments(void)
  * reboot code buffer to allow us to avoid allocations
  * later.
  *
- * Currently nothing.
+ * Turn off NX bit for control page.
  */
 int machine_kexec_prepare(struct kimage *image)
 {
+	if (nx_enabled) {
+		change_page_attr(image->control_code_page, 1, PAGE_KERNEL_EXEC);
+		global_flush_tlb();
+	}
 	return 0;
 }
 
@@ -96,25 +101,59 @@ int machine_kexec_prepare(struct kimage 
  */
 void machine_kexec_cleanup(struct kimage *image)
 {
+	if (nx_enabled) {
+		change_page_attr(image->control_code_page, 1, PAGE_KERNEL);
+		global_flush_tlb();
+	}
+}
+
+void machine_kexec(struct kimage *image)
+{
+	machine_kexec_call(image, NULL, 0);
 }
 
 /*
  * Do not allocate memory (or fail in any way) in machine_kexec().
  * We are past the point of no return, committed to rebooting now.
  */
-NORET_TYPE void machine_kexec(struct kimage *image)
+int machine_kexec_vcall(struct kimage *image, unsigned long *ret,
+			 unsigned int argc, va_list args)
 {
 	unsigned long page_list[PAGES_NR];
 	void *control_page;
+	asmlinkage NORET_TYPE void
+		(*relocate_kernel_ptr)(unsigned long indirection_page,
+				       unsigned long control_page,
+				       unsigned long start_address,
+				       unsigned int has_pae) ATTRIB_NORET;
 
 	/* Interrupts aren't acceptable while we reboot */
 	local_irq_disable();
 
 	control_page = page_address(image->control_code_page);
-	memcpy(control_page, relocate_kernel, PAGE_SIZE);
+	memcpy(control_page, relocate_page, PAGE_SIZE/2);
+	KCALL_MAGIC(control_page) = 0;
 
+	if (image->preserve_cpu) {
+		unsigned int i;
+		KCALL_MAGIC(control_page) = KCALL_MAGIC_NUMBER;
+		KCALL_ARGC(control_page) = argc;
+		for (i = 0; i < argc; i++)
+			KCALL_ARGS(control_page)[i] = \
+				va_arg(args, unsigned long);
+
+		if (kexec_call_save_cpu(control_page)) {
+			image->start = KCALL_ENTRY(control_page);
+			if (ret)
+				*ret = KCALL_ARGS(control_page)[0];
+			return 0;
+		}
+	}
+
+	relocate_kernel_ptr = control_page +
+		((void *)relocate_kernel - (void *)relocate_page);
 	page_list[PA_CONTROL_PAGE] = __pa(control_page);
-	page_list[VA_CONTROL_PAGE] = (unsigned long)relocate_kernel;
+	page_list[VA_CONTROL_PAGE] = (unsigned long)control_page;
 	page_list[PA_PGD] = __pa(kexec_pgd);
 	page_list[VA_PGD] = (unsigned long)kexec_pgd;
 #ifdef CONFIG_X86_PAE
@@ -127,26 +166,33 @@ NORET_TYPE void machine_kexec(struct kim
 	page_list[VA_PTE_0] = (unsigned long)kexec_pte0;
 	page_list[PA_PTE_1] = __pa(kexec_pte1);
 	page_list[VA_PTE_1] = (unsigned long)kexec_pte1;
+	page_list[PA_SWAP_PAGE] = (page_to_pfn(image->swap_page) << PAGE_SHIFT);
 
-	/* The segment registers are funny things, they have both a
-	 * visible and an invisible part.  Whenever the visible part is
-	 * set to a specific selector, the invisible part is loaded
-	 * with from a table in memory.  At no other time is the
-	 * descriptor table in memory accessed.
-	 *
-	 * I take advantage of this here by force loading the
-	 * segments, before I zap the gdt with an invalid value.
-	 */
-	load_segments();
-	/* The gdt & idt are now invalid.
-	 * If you want to load them you must set up your own idt & gdt.
-	 */
-	set_gdt(phys_to_virt(0),0);
-	set_idt(phys_to_virt(0),0);
+	if (image->preserve_cpu_ext) {
+		/* The segment registers are funny things, they have
+		 * both a visible and an invisible part.  Whenever the
+		 * visible part is set to a specific selector, the
+		 * invisible part is loaded with from a table in
+		 * memory.  At no other time is the descriptor table
+		 * in memory accessed.
+		 *
+		 * I take advantage of this here by force loading the
+		 * segments, before I zap the gdt with an invalid
+		 * value.
+		 */
+		load_segments();
+		/* The gdt & idt are now invalid.  If you want to load
+		 * them you must set up your own idt & gdt.
+		 */
+		set_gdt(phys_to_virt(0), 0);
+		set_idt(phys_to_virt(0), 0);
+	}
 
 	/* now call it */
-	relocate_kernel((unsigned long)image->head, (unsigned long)page_list,
-			image->start, cpu_has_pae);
+	relocate_kernel_ptr((unsigned long)image->head,
+			    (unsigned long)page_list,
+			    image->start, cpu_has_pae);
+	return 0;
 }
 
 void arch_crash_save_vmcoreinfo(void)
--- a/include/asm-x86/kexec_32.h
+++ b/include/asm-x86/kexec_32.h
@@ -9,16 +9,40 @@
 #define VA_PTE_0         5
 #define PA_PTE_1         6
 #define VA_PTE_1         7
+#define PA_SWAP_PAGE     8
 #ifdef CONFIG_X86_PAE
-#define PA_PMD_0         8
-#define VA_PMD_0         9
-#define PA_PMD_1         10
-#define VA_PMD_1         11
-#define PAGES_NR         12
+#define PA_PMD_0         9
+#define VA_PMD_0         10
+#define PA_PMD_1         11
+#define VA_PMD_1         12
+#define PAGES_NR         13
 #else
-#define PAGES_NR         8
+#define PAGES_NR         9
 #endif
 
+#define KCALL_DATA_BASE		0x800
+
+#define KCALL_MAGIC_NUMBER	0xe1b6a57d
+
+#define KCALL_DATA(buf)		((__u8 *)(buf)+KCALL_DATA_BASE)
+#define KCALL_OFF(off)		(KCALL_DATA_BASE+(off))
+
+#define KCALL_MAGIC_OFF		KCALL_OFF(0x0)
+#define KCALL_MAGIC(buf)	(*(__u32 *)(KCALL_DATA(buf)+0x0))
+#define KCALL_ARGC_OFF		KCALL_OFF(0x4)
+#define KCALL_ARGC(buf)		(*(__u32 *)(KCALL_DATA(buf)+0x4))
+#define KCALL_ARGS_OFF		KCALL_OFF(0x8)
+#define KCALL_ARGS(buf)		((__u32 *)(KCALL_DATA(buf)+0x8))
+
+/*
+ * The following are not a part of jump back protocol, for internal
+ * use only
+ */
+#define KCALL_ENTRY_OFF		KCALL_OFF(0x200)
+#define KCALL_ENTRY(buf)	(*(__u32 *)(KCALL_DATA(buf)+0x200))
+/* Other internal data fields base */
+#define KCALL_OTHER_OFF		KCALL_OFF(0x204)
+
 #ifndef __ASSEMBLY__
 
 #include <asm/ptrace.h>
@@ -94,6 +118,9 @@ relocate_kernel(unsigned long indirectio
 		unsigned long start_address,
 		unsigned int has_pae) ATTRIB_NORET;
 
+extern char relocate_page[PAGE_SIZE];
+
+extern asmlinkage int kexec_call_save_cpu(void *buf);
 #endif /* __ASSEMBLY__ */
 
 #endif /* _I386_KEXEC_H */
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -9,6 +9,7 @@
 #include <linux/ioport.h>
 #include <linux/elfcore.h>
 #include <linux/elf.h>
+#include <linux/notifier.h>
 #include <asm/kexec.h>
 
 /* Verify architecture specific macros are defined */
@@ -83,6 +84,7 @@ struct kimage {
 
 	unsigned long start;
 	struct page *control_code_page;
+	struct page *swap_page;
 
 	unsigned long nr_segments;
 	struct kexec_segment segment[KEXEC_SEGMENT_MAX];
@@ -98,18 +100,34 @@ struct kimage {
 	unsigned int type : 1;
 #define KEXEC_TYPE_DEFAULT 0
 #define KEXEC_TYPE_CRASH   1
+	unsigned int preserve_cpu : 1;
+	unsigned int preserve_cpu_ext : 1;
+	unsigned int single_cpu : 1;
+	unsigned int preserve_device : 1;
+	unsigned int preserve_console : 1;
 };
 
 
 
 /* kexec interface functions */
-extern NORET_TYPE void machine_kexec(struct kimage *image) ATTRIB_NORET;
+extern void machine_kexec(struct kimage *image);
+extern int machine_kexec_vcall(struct kimage *image, unsigned long *ret,
+			       unsigned int argc, va_list args);
+extern int machine_kexec_call(struct kimage *image, unsigned long *ret,
+			      unsigned int argc, ...);
 extern int machine_kexec_prepare(struct kimage *image);
 extern void machine_kexec_cleanup(struct kimage *image);
 extern asmlinkage long sys_kexec_load(unsigned long entry,
 					unsigned long nr_segments,
 					struct kexec_segment __user *segments,
 					unsigned long flags);
+extern int kexec_call(struct kimage *image, unsigned long *ret,
+		      unsigned int argc, ...);
+extern int kexec_vcall(struct kimage *image, unsigned long *ret,
+		       unsigned int argc, va_list args);
+extern int kexec_jump(struct kimage *image, unsigned long *cmd_ret,
+		      unsigned long cmd);
+#define KJUMP_CMD_NONE 0
 #ifdef CONFIG_COMPAT
 extern asmlinkage long compat_sys_kexec_load(unsigned long entry,
 				unsigned long nr_segments,
@@ -151,13 +169,21 @@ unsigned long paddr_vmcoreinfo_note(void
 
 extern struct kimage *kexec_image;
 extern struct kimage *kexec_crash_image;
+extern int kexec_lock;
+extern struct blocking_notifier_head kjump_chain_pre;
+extern struct blocking_notifier_head kjump_chain_post;
 
 #ifndef kexec_flush_icache_page
 #define kexec_flush_icache_page(page)
 #endif
 
-#define KEXEC_ON_CRASH  0x00000001
-#define KEXEC_ARCH_MASK 0xffff0000
+#define KEXEC_ON_CRASH		0x00000001
+#define KEXEC_PRESERVE_CPU	0x00000002
+#define KEXEC_PRESERVE_CPU_EXT	0x00000004
+#define KEXEC_SINGLE_CPU	0x00000008
+#define KEXEC_PRESERVE_DEVICE	0x00000010
+#define KEXEC_PRESERVE_CONSOLE	0x00000020
+#define KEXEC_ARCH_MASK		0xffff0000
 
 /* These values match the ELF architecture values.
  * Unless there is a good reason that should continue to be the case.
@@ -174,7 +200,13 @@ extern struct kimage *kexec_crash_image;
 #define KEXEC_ARCH_MIPS_LE (10 << 16)
 #define KEXEC_ARCH_MIPS    ( 8 << 16)
 
-#define KEXEC_FLAGS    (KEXEC_ON_CRASH)  /* List of defined/legal kexec flags */
+/* List of defined/legal kexec flags */
+#define KEXEC_FLAGS    (KEXEC_ON_CRASH |		\
+			KEXEC_PRESERVE_CPU |		\
+			KEXEC_PRESERVE_CPU_EXT |	\
+			KEXEC_SINGLE_CPU |		\
+			KEXEC_PRESERVE_DEVICE |		\
+			KEXEC_PRESERVE_CONSOLE)
 
 #define VMCOREINFO_BYTES           (4096)
 #define VMCOREINFO_NOTE_NAME       "VMCOREINFO"
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -24,6 +24,11 @@
 #include <linux/utsrelease.h>
 #include <linux/utsname.h>
 #include <linux/numa.h>
+#include <linux/suspend.h>
+#include <linux/freezer.h>
+#include <linux/pm.h>
+#include <linux/cpu.h>
+#include <linux/console.h>
 
 #include <asm/page.h>
 #include <asm/uaccess.h>
@@ -49,6 +54,9 @@ struct resource crashk_res = {
 	.flags = IORESOURCE_BUSY | IORESOURCE_MEM
 };
 
+BLOCKING_NOTIFIER_HEAD(kjump_chain_pre);
+BLOCKING_NOTIFIER_HEAD(kjump_chain_post);
+
 int kexec_should_crash(struct task_struct *p)
 {
 	if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops)
@@ -243,6 +251,12 @@ static int kimage_normal_alloc(struct ki
 		goto out;
 	}
 
+	image->swap_page = kimage_alloc_control_pages(image, 0);
+	if (!image->swap_page) {
+		printk(KERN_ERR "Could not allocate swap buffer\n");
+		goto out;
+	}
+
 	result = 0;
  out:
 	if (result == 0)
@@ -920,7 +934,7 @@ struct kimage *kexec_crash_image;
  * Nothing can wait so this mutex is safe to use
  * in interrupt context :)
  */
-static int kexec_lock;
+int kexec_lock;
 
 asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
 				struct kexec_segment __user *segments,
@@ -989,6 +1003,16 @@ asmlinkage long sys_kexec_load(unsigned 
 		if (result)
 			goto out;
 
+		if (flags & KEXEC_PRESERVE_CPU)
+			image->preserve_cpu = 1;
+		if (flags & KEXEC_PRESERVE_CPU_EXT)
+			image->preserve_cpu_ext = 1;
+		if (flags & KEXEC_SINGLE_CPU)
+			image->single_cpu = 1;
+		if (flags & KEXEC_PRESERVE_DEVICE)
+			image->preserve_device = 1;
+		if (flags & KEXEC_PRESERVE_CONSOLE)
+			image->preserve_console = 1;
 		result = machine_kexec_prepare(image);
 		if (result)
 			goto out;
@@ -1413,3 +1437,165 @@ static int __init crash_save_vmcoreinfo_
 }
 
 module_init(crash_save_vmcoreinfo_init)
+
+int machine_kexec_call(struct kimage *image, unsigned long *ret,
+			unsigned int argc, ...)
+{
+	va_list args;
+	int error;
+	va_start(args, argc);
+	error = machine_kexec_vcall(image, ret, argc, args);
+	va_end(args);
+	return error;
+}
+
+int __attribute__ ((weak)) machine_kexec_vcall(struct kimage *image,
+						unsigned long *ret,
+						unsigned int argc,
+						va_list args)
+{
+	machine_kexec(image);
+	if (ret)
+		*ret = 0;
+	return 0;
+}
+
+int kexec_call(struct kimage *image, unsigned long *ret,
+	       unsigned int argc, ...)
+{
+	int retval;
+	va_list args;
+	va_start(args, argc);
+	retval = kexec_vcall(image, ret, argc, args);
+	va_end(args);
+	return retval;
+}
+
+static int kexec_vcall_pre(struct kimage *image)
+{
+	int error;
+
+	if (image->preserve_console)
+		pm_prepare_console();
+	if (image->preserve_device) {
+		error = freeze_processes();
+		if (error) {
+			error = -EBUSY;
+			goto Exit;
+		}
+	}
+	if (image->preserve_console)
+		suspend_console();
+	if (image->preserve_device) {
+		error = device_suspend(PMSG_FREEZE);
+		if (error)
+			goto Resume_console;
+	}
+	if (image->single_cpu) {
+		error = disable_nonboot_cpus();
+		if (error)
+			goto Resume_devices;
+	}
+	local_irq_disable();
+	if (image->preserve_device) {
+		/* At this point, device_suspend() has been called,
+		 * but *not* device_power_down(). We *must*
+		 * device_power_down() now.  Otherwise, drivers for
+		 * some devices (e.g. interrupt controllers) become
+		 * desynchronized with the actual state of the
+		 * hardware at resume time, and evil weirdness ensues.
+		 */
+		error = device_power_down(PMSG_FREEZE);
+		if (error)
+			goto Enable_irqs;
+	}
+	return 0;
+
+ Enable_irqs:
+	local_irq_enable();
+	if (image->single_cpu)
+		enable_nonboot_cpus();
+ Resume_devices:
+	if (image->preserve_device)
+		device_resume();
+ Resume_console:
+	if (image->preserve_console)
+		resume_console();
+	if (image->preserve_device)
+		thaw_processes();
+ Exit:
+	if (image->preserve_console)
+		pm_restore_console();
+	return error;
+}
+
+static int kexec_vcall_post(struct kimage *image)
+{
+	if (image->preserve_device) {
+		/* NOTE:  device_power_up() is just a resume() for devices
+		 * that suspended with irqs off ... no overall powerup.
+		 */
+		device_power_up();
+	}
+	local_irq_enable();
+	if (image->single_cpu)
+		enable_nonboot_cpus();
+	if (image->preserve_device)
+		device_resume();
+	if (image->preserve_console)
+		resume_console();
+	if (image->preserve_device)
+		thaw_processes();
+	if (image->preserve_console)
+		pm_restore_console();
+	return 0;
+}
+
+int kexec_vcall(struct kimage *image, unsigned long *ret,
+		unsigned int argc, va_list args)
+{
+	int error;
+
+	error = kexec_vcall_pre(image);
+	if (error)
+		return error;
+	if (image->preserve_cpu_ext)
+		save_processor_state();
+	error = machine_kexec_vcall(image, ret, argc, args);
+	if (error)
+		return error;
+	if (image->preserve_cpu_ext)
+		restore_processor_state();
+	error = kexec_vcall_post(image);
+	return error;
+}
+
+int kexec_jump(struct kimage *image, unsigned long *pcmd_ret,
+	       unsigned long cmd)
+{
+	int chret, error;
+	unsigned long cmd_ret;
+
+	chret = blocking_notifier_call_chain(&kjump_chain_pre, cmd, image);
+	if (chret == NOTIFY_DONE)
+		error = kexec_vcall_pre(image);
+	else {
+		error = notifier_to_errno(chret);
+		if (error)
+			return error;
+	}
+	if (image->preserve_cpu_ext)
+		save_processor_state();
+	error = machine_kexec_call(image, &cmd_ret, 4, cmd, image->head,
+				   __pa(vmcoreinfo_data), vmcoreinfo_size);
+	if (image->preserve_cpu_ext)
+		restore_processor_state();
+	if (pcmd_ret)
+		*pcmd_ret = cmd_ret;
+	chret = blocking_notifier_call_chain(&kjump_chain_post, cmd_ret, image);
+	if (chret == NOTIFY_DONE)
+		error = kexec_vcall_post(image);
+	else
+		error = notifier_to_errno(chret);
+	return error;
+}
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -301,18 +301,26 @@ EXPORT_SYMBOL_GPL(kernel_restart);
  *	Move into place and start executing a preloaded standalone
  *	executable.  If nothing was preloaded return an error.
  */
-static void kernel_kexec(void)
+static int kernel_kexec(unsigned long cmd)
 {
+	int ret = -ENOSYS;
 #ifdef CONFIG_KEXEC
-	struct kimage *image;
-	image = xchg(&kexec_image, NULL);
-	if (!image)
-		return;
-	kernel_restart_prepare(NULL);
-	printk(KERN_EMERG "Starting new kernel\n");
-	machine_shutdown();
-	machine_kexec(image);
+	if (xchg(&kexec_lock, 1))
+		return -EBUSY;
+	if (!kexec_image) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+	if (!kexec_image->preserve_cpu) {
+		kernel_restart_prepare(NULL);
+		printk(KERN_EMERG "Starting new kernel\n");
+		machine_shutdown();
+	}
+	ret = kexec_jump(kexec_image, NULL, cmd);
+unlock:
+	xchg(&kexec_lock, 0);
 #endif
+	return ret;
 }
 
 static void kernel_shutdown_prepare(enum system_states state)
@@ -420,9 +428,12 @@ asmlinkage long sys_reboot(int magic1, i
 		break;
 
 	case LINUX_REBOOT_CMD_KEXEC:
-		kernel_kexec();
-		unlock_kernel();
-		return -EINVAL;
+		{
+			int ret;
+			ret = kernel_kexec((unsigned long)arg);
+			unlock_kernel();
+			return ret;
+		}
 
 #ifdef CONFIG_HIBERNATION
 	case LINUX_REBOOT_CMD_SW_SUSPEND:
--- a/kernel/power/Kconfig
+++ b/kernel/power/Kconfig
@@ -91,7 +91,7 @@ config PM_SLEEP_SMP
 
 config PM_SLEEP
 	bool
-	depends on SUSPEND || HIBERNATION
+	depends on SUSPEND || HIBERNATION || KEXEC
 	default y
 
 config SUSPEND_UP_POSSIBLE
--- a/arch/x86/kernel/relocate_kernel_32.S
+++ b/arch/x86/kernel/relocate_kernel_32.S
@@ -9,6 +9,7 @@
 #include <linux/linkage.h>
 #include <asm/page.h>
 #include <asm/kexec.h>
+#include <asm/asm-offsets.h>
 
 /*
  * Must be relocatable PIC code callable as a C function
@@ -19,8 +20,89 @@
 #define PAGE_ATTR 0x63 /* _PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|_PAGE_DIRTY */
 #define PAE_PGD_ATTR 0x01 /* _PAGE_PRESENT */
 
+#define STACK_TOP		PAGE_SIZE_asm
+
+#define DATA(offset)		(KCALL_OTHER_OFF+(offset))
+
+/* Minimal CPU stat */
+#define EBX			DATA(0x0)
+#define ESI			DATA(0x4)
+#define EDI			DATA(0x8)
+#define EBP			DATA(0xc)
+#define ESP			DATA(0x10)
+#define CR0			DATA(0x14)
+#define CR3			DATA(0x18)
+#define CR4			DATA(0x1c)
+#define FLAG			DATA(0x20)
+#define RET			DATA(0x24)
+
+/* some information saved in control page (CP) for jumping back */
+#define CP_VA_CONTROL_PAGE	DATA(0x30)
+#define CP_PA_PGD		DATA(0x34)
+#define CP_PA_SWAP_PAGE		DATA(0x38)
+#define CP_PA_BACKUP_PAGES_MAP	DATA(0x3c)
+
 	.text
 	.align PAGE_ALIGNED
+	.globl relocate_page
+relocate_page:
+
+/*
+ * Entry point for jumping back from kexeced kernel, the paging is
+ * turned off.
+ */
+kexec_jump_back_entry:
+	call	1f
+1:
+	popl	%ebx
+	subl	$(1b - relocate_page), %ebx
+	movl	%edi, KCALL_ENTRY_OFF(%ebx)
+	movl	$0, %eax
+	andl	%esi, %esi
+	jz	2f
+	movl	4(%esp), %eax
+2:
+	movl	%eax, KCALL_ARGS_OFF(%ebx)
+	movl	CP_VA_CONTROL_PAGE(%ebx), %edi
+	lea	STACK_TOP(%ebx), %esp
+	movl	CP_PA_SWAP_PAGE(%ebx), %eax
+	movl	CP_PA_BACKUP_PAGES_MAP(%ebx), %edx
+	pushl	%eax
+	pushl	%edx
+	call	swap_pages
+	addl	$8, %esp
+	movl	CP_PA_PGD(%ebx), %eax
+	movl	%eax, %cr3
+	movl	%cr0, %eax
+	orl	$(1<<31), %eax
+	movl	%eax, %cr0
+	lea	STACK_TOP(%edi), %esp
+	movl	%edi, %eax
+	addl	$(virtual_mapped - relocate_page), %eax
+	pushl	%eax
+	ret
+
+virtual_mapped:
+	movl	%edi, %edx
+	movl	EBX(%edx), %ebx
+	movl	ESI(%edx), %esi
+	movl	EDI(%edx), %edi
+	movl	EBP(%edx), %ebp
+	movl	FLAG(%edx), %eax
+	pushl	%eax
+	popf
+	movl	ESP(%edx), %esp
+	movl	CR4(%edx), %eax
+	movl	%eax, %cr4
+	movl	CR3(%edx), %eax
+	movl	%eax, %cr3
+	movl	CR0(%edx), %eax
+	movl	%eax, %cr0
+	movl	RET(%edx), %eax
+	movl	%eax, (%esp)
+	mov	$1, %eax
+	ret
+
 	.globl relocate_kernel
 relocate_kernel:
 	movl	8(%esp), %ebp /* list of pages */
@@ -146,6 +228,15 @@ relocate_new_kernel:
 	pushl $0
 	popfl
 
+	/* save some information for jumping back */
+	movl	PTR(VA_CONTROL_PAGE)(%ebp), %edi
+	movl	%edi, CP_VA_CONTROL_PAGE(%edi)
+	movl	PTR(PA_PGD)(%ebp), %eax
+	movl	%eax, CP_PA_PGD(%edi)
+	movl	PTR(PA_SWAP_PAGE)(%ebp), %eax
+	movl	%eax, CP_PA_SWAP_PAGE(%edi)
+	movl	%ebx, CP_PA_BACKUP_PAGES_MAP(%edi)
+
 	/* get physical address of control page now */
 	/* this is impossible after page table switch */
 	movl	PTR(PA_CONTROL_PAGE)(%ebp), %edi
@@ -155,11 +246,11 @@ relocate_new_kernel:
 	movl	%eax, %cr3
 
 	/* setup a new stack at the end of the physical control page */
-	lea	4096(%edi), %esp
+	lea	STACK_TOP(%edi), %esp
 
 	/* jump to identity mapped page */
 	movl    %edi, %eax
-	addl    $(identity_mapped - relocate_kernel), %eax
+	addl    $(identity_mapped - relocate_page), %eax
 	pushl   %eax
 	ret
 
@@ -197,8 +288,68 @@ identity_mapped:
 	xorl	%eax, %eax
 	movl	%eax, %cr3
 
+	movl	CP_PA_SWAP_PAGE(%edi), %eax
+	pushl	%eax
+	pushl	%ebx
+	call	swap_pages
+	addl	$8, %esp
+
+	/* To be certain of avoiding problems with self-modifying code
+	 * I need to execute a serializing instruction here.
+	 * So I flush the TLB, it's handy, and not processor dependent.
+	 */
+	xorl	%eax, %eax
+	movl	%eax, %cr3
+
+	/* set all of the registers to known values */
+	/* leave %esp alone */
+
+	movl	KCALL_MAGIC_OFF(%edi), %eax
+	cmpl	$KCALL_MAGIC_NUMBER, %eax
+	jz 1f
+	xorl	%edi, %edi
+	xorl	%eax, %eax
+	xorl	%ebx, %ebx
+	xorl    %ecx, %ecx
+	xorl    %edx, %edx
+	xorl    %esi, %esi
+	xorl    %ebp, %ebp
+	ret
+1:
+	popl	%edx
+	movl	CP_PA_SWAP_PAGE(%edi), %esp
+	addl	$PAGE_SIZE_asm, %esp
+	pushl	%edx
+	movl	%edi, %ebp
+	movl	KCALL_ARGC_OFF(%edi), %ecx
+	shll	$2, %ecx
+	movl	%edi, %esi
+	addl	$KCALL_ARGS_OFF, %esi
+	subl	%ecx, %esp
+	movl	%esp, %edi
+	rep ; movsb
+	movl	%ebp, %edi
+	movl	KCALL_ARGC_OFF(%edi), %esi
+2:
+	call	*%edx
+	shll	$2, %esi
+	addl	%esi, %esp
+	movl	%edi, %edx
+	popl	%edi
+	pushl	%edx
+	pushl	%eax
+	movl	$1, %esi
+	jmp	2b
+
 	/* Do the copies */
-	movl	%ebx, %ecx
+swap_pages:
+	movl	8(%esp), %edx
+	movl	4(%esp), %ecx
+	pushl	%ebp
+	pushl	%ebx
+	pushl	%edi
+	pushl	%esi
+	movl	%ecx, %ebx
 	jmp	1f
 
 0:	/* top, read another word from the indirection page */
@@ -226,27 +377,50 @@ identity_mapped:
 	movl    %ecx,   %esi /* For every source page do a copy */
 	andl    $0xfffff000, %esi
 
+	movl	%edi, %eax
+	movl	%esi, %ebp
+
+	movl	%edx, %edi
 	movl    $1024, %ecx
 	rep ; movsl
-	jmp     0b
 
-3:
+	movl	%ebp, %edi
+	movl	%eax, %esi
+	movl	$1024, %ecx
+	rep ; movsl
 
-	/* To be certain of avoiding problems with self-modifying code
-	 * I need to execute a serializing instruction here.
-	 * So I flush the TLB, it's handy, and not processor dependent.
-	 */
-	xorl	%eax, %eax
-	movl	%eax, %cr3
+	movl	%eax, %edi
+	movl	%edx, %esi
+	movl	$1024, %ecx
+	rep ; movsl
 
-	/* set all of the registers to known values */
-	/* leave %esp alone */
+	lea	PAGE_SIZE_asm(%ebp), %esi
+	jmp     0b
+3:
+	popl	%esi
+	popl	%edi
+	popl	%ebx
+	popl	%ebp
+	ret
 
-	xorl	%eax, %eax
-	xorl	%ebx, %ebx
-	xorl    %ecx, %ecx
-	xorl    %edx, %edx
-	xorl    %esi, %esi
-	xorl    %edi, %edi
-	xorl    %ebp, %ebp
+	.globl kexec_call_save_cpu
+kexec_call_save_cpu:
+	movl	4(%esp), %edx
+	movl	%ebx, EBX(%edx)
+	movl	%esi, ESI(%edx)
+	movl	%edi, EDI(%edx)
+	movl	%ebp, EBP(%edx)
+	movl	%esp, ESP(%edx)
+	movl	%cr0, %eax
+	movl	%eax, CR0(%edx)
+	movl	%cr3, %eax
+	movl	%eax, CR3(%edx)
+	movl	%cr4, %eax
+	movl	%eax, CR4(%edx)
+	pushf
+	popl	%eax
+	movl	%eax, FLAG(%edx)
+	movl	(%esp), %eax
+	movl	%eax, RET(%edx)
+	mov	$0, %eax
 	ret
--- /dev/null
+++ b/Documentation/i386/jump_back_protocol.txt
@@ -0,0 +1,103 @@
+		THE LINUX/I386 JUMP BACK PROTOCOL
+		---------------------------------
+
+		Huang Ying <ying.huang@intel.com>
+		    Last update 2007-11-17
+
+Currently, the following versions of the jump back protocol exist.
+
+Protocol 1.00:	Jumping between original kernel and kexeced kernel
+		support. Calling ordinary C function support.
+
+
+*** JUMP BACK ENTRY
+
+At jump back entry of callee, the CPU must be in 32-bit protected mode
+with paging disabled; the CS, DS, ES and SS must be 4G flat segments;
+CS must have execute/read permission, and DS, ES and SS must have
+read/write permission; interrupt must be disabled; the contents of
+registers and corresponding memory must be as follow:
+
+Offset/Size	Meaning
+
+%edi		Real jump back entry of caller if supported,
+		otherwise 0.
+%esi		Number of parameters, that is, N.
+%esp		Stack top pointer, the size of stack is about 4k.
+(%esp)/4	Helper jump back entry of caller if %edi != 0,
+		otherwise undefined.
+4*n(%esp)/4	nth parameter
+2048(%edi)/4	Optional, if %edi != 0, magic number: 0xe1b6a57d
+2052(%edi)/4	Optional, if %edi != 0 and 2048(%edi) == 0xe1b6a57d,
+		number of parameters, that is, N
+(2056+4*n)(%edi)/4 Optional, if %edi != 0 and 2048(%edi) == 0xe1b6a57d,
+		   nth parameter
+
+If jumping back to caller is supported, %edi is the real jump back
+entry of caller, that is, the callee can jump back to %edi with the
+same protocol.
+
+If jumping back to caller is supported, (%esp) is the helper jump back
+entry of caller. At helper jump back entry, CPU state other than
+contents of registers must be same as ordinary jump back protocol; the
+contents of registers and corresponding memory must be as follow:
+
+Offset/Size	Meaning
+
+%edi,%esi,%ebp,%ebx Original value
+%esp		Original value - 4, that is, the return address is popped.
+%eax		Return value.
+
+This is same as function return ABI, and the jump back entry protocol
+conforms function calling ABI too. So, if the helper jump back entry
+is used, the jump back entry can be implemented as an ordinary C
+function without register parameters, the function protocol is as
+follow:
+
+unsigned long jump_back_entry(...)
+
+or
+
+unsigned long jump_back_entry(unsigned long arg1,
+			      unsigned long arg2,
+			      unsigned long arg3,
+			      ...
+			      unsigned long argN);
+
+The code at helper jump back entry of caller will jump to real jump
+back entry of caller, with contents of registers and corresponding
+memory as follow:
+
+Offset/Size	Meaning
+
+%edi		Real jump back entry of callee (start address of callee)
+%esi		1, number of parameters
+%esp		Stack top pointer, the size of stack is about 4k.
+(%esp)/4	Helper jump back entry of callee
+4(%esp)/4	%eax at helper jump back entry, first parameter
+
+That is, the return value of jump back entry of callee is used as the
+only parameter to call the jump back entry of caller.
+
+If jumping back to caller is supported, and 2048(%edi) == 0xe1b6a57d,
+the parameters information is stored at memory from 2048(%edi) on as
+well. This is used to check the parameters information of jump back
+image.
+
+
+**** LOAD THE JUMP BACK IMAGE
+
+Jump back image is an ordinary ELF64 executable file, it can be loaded
+just as other ELF64 image. That is, the PT_LOAD segments should be
+loaded into their physical address. The entry point of jump back image
+is called the jump back entry of image.
+
+Before loading all segments of jump back image, the jump back header
+can be checked. The contents of jump back header is the optional part
+from 2048(%edi) on in jump back entry protocol, details as follow:
+
+Offset/Size	Meaning
+
+2048/4		Magic number: 0xe1b6a57d
+2052/4		Number of parameters, that is, N
+2056+4*n/4	nth parameter
--- a/arch/ppc/kernel/machine_kexec.c
+++ b/arch/ppc/kernel/machine_kexec.c
@@ -66,7 +66,7 @@ void machine_kexec_cleanup(struct kimage
  * Do not allocate memory (or fail in any way) in machine_kexec().
  * We are past the point of no return, committed to rebooting now.
  */
-NORET_TYPE void machine_kexec(struct kimage *image)
+void machine_kexec(struct kimage *image)
 {
 	if (ppc_md.machine_kexec)
 		ppc_md.machine_kexec(image);
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -179,7 +179,7 @@ void machine_kexec_cleanup(struct kimage
  * Do not allocate memory (or fail in any way) in machine_kexec().
  * We are past the point of no return, committed to rebooting now.
  */
-NORET_TYPE void machine_kexec(struct kimage *image)
+void machine_kexec(struct kimage *image)
 {
 	unsigned long page_list[PAGES_NR];
 	void *control_page;
--- a/arch/sh/kernel/machine_kexec.c
+++ b/arch/sh/kernel/machine_kexec.c
@@ -70,7 +70,7 @@ static void kexec_info(struct kimage *im
  * Do not allocate memory (or fail in any way) in machine_kexec().
  * We are past the point of no return, committed to rebooting now.
  */
-NORET_TYPE void machine_kexec(struct kimage *image)
+void machine_kexec(struct kimage *image)
 {
 
 	unsigned long page_list;
--- a/arch/powerpc/kernel/machine_kexec.c
+++ b/arch/powerpc/kernel/machine_kexec.c
@@ -47,7 +47,7 @@ void machine_kexec_cleanup(struct kimage
  * Do not allocate memory (or fail in any way) in machine_kexec().
  * We are past the point of no return, committed to rebooting now.
  */
-NORET_TYPE void machine_kexec(struct kimage *image)
+void machine_kexec(struct kimage *image)
 {
 	if (ppc_md.machine_kexec)
 		ppc_md.machine_kexec(image);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump
  2007-12-07 15:53 [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump Huang, Ying
@ 2007-12-08 23:53 ` Pavel Machek
  2007-12-09  0:19   ` Rafael J. Wysocki
  2007-12-10 19:55 ` Vivek Goyal
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Pavel Machek @ 2007-12-08 23:53 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Eric W. Biederman, nigel, Rafael J. Wysocki, Andrew Morton,
	Jeremy Maitin-Shepard, linux-kernel, linux-pm,
	Kexec Mailing List

Hi!

> This patch implements the functionality of jumping between the kexeced
> kernel and the original kernel.
> 
> To support jumping between two kernels, before jumping to (executing)
> the new kernel and jumping back to the original kernel, the devices
> are put into quiescent state, and the state of devices and CPU is
> saved. After jumping back from kexeced kernel and jumping to the new
> kernel, the state of devices and CPU are restored accordingly. The
> devices/CPU state save/restore code of software suspend is called to
> implement corresponding function.
> 
> To support jumping without reserving memory. One shadow backup page
> (source page) is allocated for each page used by new (kexeced) kernel
> (destination page). When do kexec_load, the image of new kernel is
> loaded into source pages, and before executing, the destination pages
> and the source pages are swapped, so the contents of destination pages
> are backupped. Before jumping to the new (kexeced) kernel and after
> jumping back to the original kernel, the destination pages and the
> source pages are swapped too.
> 
> A jump back protocol for kexec is defined and documented. It is an
> extension to ordinary function calling protocol. So, the facility
> provided by this patch can be used to call ordinary C function in real
> mode.
> 
> A set of flags for sys_kexec_load are added to control which state are
> saved/restored before/after real mode code executing. For example, you
> can specify the device state and FPU state are saved/restored
> before/after real mode code executing.
> 
> The states (exclude CPU state) save/restore code can be overridden
> based on the "command" parameter of kexec jump. Because more states
> need to be saved/restored by hibernating/resuming.
> 
> Signed-off-by: Huang Ying <ying.huang@intel.com>

I'm not kexec hacker... but maybe this is in good enough state to be
merged? It is useful on its own: kexec jump and back means we can dump
system then continue running, for example...

								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump
  2007-12-08 23:53 ` Pavel Machek
@ 2007-12-09  0:19   ` Rafael J. Wysocki
  2007-12-09  1:06     ` Eric W. Biederman
  0 siblings, 1 reply; 13+ messages in thread
From: Rafael J. Wysocki @ 2007-12-09  0:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Huang, Ying, Eric W. Biederman, nigel, Andrew Morton,
	Jeremy Maitin-Shepard, linux-kernel, linux-pm,
	Kexec Mailing List

On Sunday, 9 of December 2007, Pavel Machek wrote:
> Hi!
> 
> > This patch implements the functionality of jumping between the kexeced
> > kernel and the original kernel.
> > 
> > To support jumping between two kernels, before jumping to (executing)
> > the new kernel and jumping back to the original kernel, the devices
> > are put into quiescent state, and the state of devices and CPU is
> > saved. After jumping back from kexeced kernel and jumping to the new
> > kernel, the state of devices and CPU are restored accordingly. The
> > devices/CPU state save/restore code of software suspend is called to
> > implement corresponding function.
> > 
> > To support jumping without reserving memory. One shadow backup page
> > (source page) is allocated for each page used by new (kexeced) kernel
> > (destination page). When do kexec_load, the image of new kernel is
> > loaded into source pages, and before executing, the destination pages
> > and the source pages are swapped, so the contents of destination pages
> > are backupped. Before jumping to the new (kexeced) kernel and after
> > jumping back to the original kernel, the destination pages and the
> > source pages are swapped too.
> > 
> > A jump back protocol for kexec is defined and documented. It is an
> > extension to ordinary function calling protocol. So, the facility
> > provided by this patch can be used to call ordinary C function in real
> > mode.
> > 
> > A set of flags for sys_kexec_load are added to control which state are
> > saved/restored before/after real mode code executing. For example, you
> > can specify the device state and FPU state are saved/restored
> > before/after real mode code executing.
> > 
> > The states (exclude CPU state) save/restore code can be overridden
> > based on the "command" parameter of kexec jump. Because more states
> > need to be saved/restored by hibernating/resuming.
> > 
> > Signed-off-by: Huang Ying <ying.huang@intel.com>
> 
> I'm not kexec hacker... but maybe this is in good enough state to be
> merged? It is useful on its own: kexec jump and back means we can dump
> system then continue running, for example...

As far as I'm concerned, patches [1/4] and [2/4] can go.

The other two are not in that shape yet (especially the [3/4] patch).

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump
  2007-12-09  0:19   ` Rafael J. Wysocki
@ 2007-12-09  1:06     ` Eric W. Biederman
  0 siblings, 0 replies; 13+ messages in thread
From: Eric W. Biederman @ 2007-12-09  1:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Huang, Ying, nigel, Andrew Morton,
	Jeremy Maitin-Shepard, linux-kernel, linux-pm,
	Kexec Mailing List

"Rafael J. Wysocki" <rjw@sisk.pl> writes:

>> I'm not kexec hacker... but maybe this is in good enough state to be
>> merged? It is useful on its own: kexec jump and back means we can dump
>> system then continue running, for example...
>
> As far as I'm concerned, patches [1/4] and [2/4] can go.
>
> The other two are not in that shape yet (especially the [3/4] patch).

Ok.  Then I will see if I can review these in the next couple days
and give some feedback.

At a quick skim through the code it appears there is some more infrastructure
then we need and things can still be simplified.

Since this applies in particular to the user space interface I'm not comfortable
with these patches going in just yet.

The unused KEXEC_PRESERVE_ flags especially give me pause.  Having something
like that, that isn't currently wired up sounds like a bad place to start.

Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump
  2007-12-07 15:53 [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump Huang, Ying
  2007-12-08 23:53 ` Pavel Machek
@ 2007-12-10 19:55 ` Vivek Goyal
  2007-12-11  8:51   ` Huang, Ying
  2007-12-10 22:31 ` Vivek Goyal
  2007-12-11  2:25 ` Eric W. Biederman
  3 siblings, 1 reply; 13+ messages in thread
From: Vivek Goyal @ 2007-12-10 19:55 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Eric W. Biederman, Pavel Machek, nigel, Rafael J. Wysocki,
	Andrew Morton, Jeremy Maitin-Shepard, linux-pm,
	Kexec Mailing List, linux-kernel

On Fri, Dec 07, 2007 at 03:53:30PM +0000, Huang, Ying wrote:
> This patch implements the functionality of jumping between the kexeced
> kernel and the original kernel.
> 

Hi,

I am just going through your patches and trying to understand it. Don't
understand many things. Asking is easy so here you go...

> To support jumping between two kernels, before jumping to (executing)
> the new kernel and jumping back to the original kernel, the devices
> are put into quiescent state, and the state of devices and CPU is
> saved. After jumping back from kexeced kernel and jumping to the new
> kernel, the state of devices and CPU are restored accordingly. The
> devices/CPU state save/restore code of software suspend is called to
> implement corresponding function.
> 

I need jumping back to restore a already hibernated kernel image? Can
you please tell little more about jumping back and why it is needed?

> To support jumping without reserving memory. One shadow backup page
> (source page) is allocated for each page used by new (kexeced) kernel
> (destination page). When do kexec_load, the image of new kernel is
> loaded into source pages, and before executing, the destination pages
> and the source pages are swapped, so the contents of destination pages
> are backupped. Before jumping to the new (kexeced) kernel and after
> jumping back to the original kernel, the destination pages and the
> source pages are swapped too.
> 

Ok, so due to swapping of source and destination pages first kernel's data
is still preserved.  How do I get the dynamic memory required for second
kernel boot (without writing first kernel's data)?

> A jump back protocol for kexec is defined and documented. It is an
> extension to ordinary function calling protocol. So, the facility
> provided by this patch can be used to call ordinary C function in real
> mode.
> 
> A set of flags for sys_kexec_load are added to control which state are
> saved/restored before/after real mode code executing. For example, you
> can specify the device state and FPU state are saved/restored
> before/after real mode code executing.
> 
> The states (exclude CPU state) save/restore code can be overridden
> based on the "command" parameter of kexec jump. Because more states
> need to be saved/restored by hibernating/resuming.
> 
> Signed-off-by: Huang Ying <ying.huang@intel.com>
> 
> ---
>  Documentation/i386/jump_back_protocol.txt |  103 ++++++++++++++
>  arch/powerpc/kernel/machine_kexec.c       |    2 
>  arch/ppc/kernel/machine_kexec.c           |    2 
>  arch/sh/kernel/machine_kexec.c            |    2 
>  arch/x86/kernel/machine_kexec_32.c        |   88 +++++++++---
>  arch/x86/kernel/machine_kexec_64.c        |    2 
>  arch/x86/kernel/relocate_kernel_32.S      |  214 +++++++++++++++++++++++++++---
>  include/asm-x86/kexec_32.h                |   39 ++++-
>  include/linux/kexec.h                     |   40 +++++
>  kernel/kexec.c                            |  188 ++++++++++++++++++++++++++
>  kernel/power/Kconfig                      |    2 
>  kernel/sys.c                              |   35 +++-
>  12 files changed, 648 insertions(+), 69 deletions(-)
> 
> --- a/arch/x86/kernel/machine_kexec_32.c
> +++ b/arch/x86/kernel/machine_kexec_32.c
> @@ -20,6 +20,7 @@
>  #include <asm/cpufeature.h>
>  #include <asm/desc.h>
>  #include <asm/system.h>
> +#include <asm/cacheflush.h>
>  
>  #define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE)))
>  static u32 kexec_pgd[1024] PAGE_ALIGNED;
> @@ -83,10 +84,14 @@ static void load_segments(void)
>   * reboot code buffer to allow us to avoid allocations
>   * later.
>   *
> - * Currently nothing.
> + * Turn off NX bit for control page.
>   */
>  int machine_kexec_prepare(struct kimage *image)
>  {
> +	if (nx_enabled) {
> +		change_page_attr(image->control_code_page, 1, PAGE_KERNEL_EXEC);
> +		global_flush_tlb();
> +	}
>  	return 0;
>  }
>  
> @@ -96,25 +101,59 @@ int machine_kexec_prepare(struct kimage 
>   */
>  void machine_kexec_cleanup(struct kimage *image)
>  {
> +	if (nx_enabled) {
> +		change_page_attr(image->control_code_page, 1, PAGE_KERNEL);
> +		global_flush_tlb();
> +	}
> +}
> +
> +void machine_kexec(struct kimage *image)
> +{
> +	machine_kexec_call(image, NULL, 0);
>  }
>  
>  /*
>   * Do not allocate memory (or fail in any way) in machine_kexec().
>   * We are past the point of no return, committed to rebooting now.
>   */
> -NORET_TYPE void machine_kexec(struct kimage *image)
> +int machine_kexec_vcall(struct kimage *image, unsigned long *ret,
> +			 unsigned int argc, va_list args)
>  {
>  	unsigned long page_list[PAGES_NR];
>  	void *control_page;
> +	asmlinkage NORET_TYPE void
> +		(*relocate_kernel_ptr)(unsigned long indirection_page,
> +				       unsigned long control_page,
> +				       unsigned long start_address,
> +				       unsigned int has_pae) ATTRIB_NORET;
>  
>  	/* Interrupts aren't acceptable while we reboot */
>  	local_irq_disable();
>  
>  	control_page = page_address(image->control_code_page);
> -	memcpy(control_page, relocate_kernel, PAGE_SIZE);
> +	memcpy(control_page, relocate_page, PAGE_SIZE/2);
> +	KCALL_MAGIC(control_page) = 0;
>  

Is 2K sufficient for all the code in relocate_kernel_32.S? What's the
current size?

> +	if (image->preserve_cpu) {
> +		unsigned int i;
> +		KCALL_MAGIC(control_page) = KCALL_MAGIC_NUMBER;
> +		KCALL_ARGC(control_page) = argc;
> +		for (i = 0; i < argc; i++)
> +			KCALL_ARGS(control_page)[i] = \
> +				va_arg(args, unsigned long);
> +
> +		if (kexec_call_save_cpu(control_page)) {
> +			image->start = KCALL_ENTRY(control_page);

Who fills the entry point at offset 0x200?


[..]
>  extern int machine_kexec_prepare(struct kimage *image);
>  extern void machine_kexec_cleanup(struct kimage *image);
>  extern asmlinkage long sys_kexec_load(unsigned long entry,
>  					unsigned long nr_segments,
>  					struct kexec_segment __user *segments,
>  					unsigned long flags);
> +extern int kexec_call(struct kimage *image, unsigned long *ret,
> +		      unsigned int argc, ...);

Who is using kexec_call(). I can't seem to locate the caller of it.


Thanks
Vivek

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump
  2007-12-07 15:53 [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump Huang, Ying
  2007-12-08 23:53 ` Pavel Machek
  2007-12-10 19:55 ` Vivek Goyal
@ 2007-12-10 22:31 ` Vivek Goyal
  2007-12-11  8:55   ` Huang, Ying
  2007-12-11  2:25 ` Eric W. Biederman
  3 siblings, 1 reply; 13+ messages in thread
From: Vivek Goyal @ 2007-12-10 22:31 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Eric W. Biederman, Pavel Machek, nigel, Rafael J. Wysocki,
	Andrew Morton, Jeremy Maitin-Shepard, linux-pm,
	Kexec Mailing List, linux-kernel

On Fri, Dec 07, 2007 at 03:53:30PM +0000, Huang, Ying wrote:
> This patch implements the functionality of jumping between the kexeced
> kernel and the original kernel.
> 
> To support jumping between two kernels, before jumping to (executing)
> the new kernel and jumping back to the original kernel, the devices
> are put into quiescent state, and the state of devices and CPU is
> saved. After jumping back from kexeced kernel and jumping to the new
> kernel, the state of devices and CPU are restored accordingly. The
> devices/CPU state save/restore code of software suspend is called to
> implement corresponding function.
> 
> To support jumping without reserving memory. One shadow backup page
> (source page) is allocated for each page used by new (kexeced) kernel
> (destination page). When do kexec_load, the image of new kernel is
> loaded into source pages, and before executing, the destination pages
> and the source pages are swapped, so the contents of destination pages
> are backupped. Before jumping to the new (kexeced) kernel and after
> jumping back to the original kernel, the destination pages and the
> source pages are swapped too.
> 
> A jump back protocol for kexec is defined and documented. It is an
> extension to ordinary function calling protocol. So, the facility
> provided by this patch can be used to call ordinary C function in real
> mode.
> 
> A set of flags for sys_kexec_load are added to control which state are
> saved/restored before/after real mode code executing. For example, you
> can specify the device state and FPU state are saved/restored
> before/after real mode code executing.
> 
> The states (exclude CPU state) save/restore code can be overridden
> based on the "command" parameter of kexec jump. Because more states
> need to be saved/restored by hibernating/resuming.
> 


[..]
>  
> -#define KEXEC_ON_CRASH  0x00000001
> -#define KEXEC_ARCH_MASK 0xffff0000
> +#define KEXEC_ON_CRASH		0x00000001
> +#define KEXEC_PRESERVE_CPU	0x00000002
> +#define KEXEC_PRESERVE_CPU_EXT	0x00000004
> +#define KEXEC_SINGLE_CPU	0x00000008
> +#define KEXEC_PRESERVE_DEVICE	0x00000010
> +#define KEXEC_PRESERVE_CONSOLE	0x00000020

Hi,

Why do we need so many different flags for preserving different types
of state (CPU, CPU_EXT, Device, console) ? To keep things simple,
can't we can create just one flag KEXEC_PRESERVE_CONTEXT, which will
indicate any special action required for preserving the previous kernel's
context so that one can swith back to old kernel?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump
  2007-12-07 15:53 [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump Huang, Ying
                   ` (2 preceding siblings ...)
  2007-12-10 22:31 ` Vivek Goyal
@ 2007-12-11  2:25 ` Eric W. Biederman
  2007-12-11 15:50   ` Huang, Ying
  3 siblings, 1 reply; 13+ messages in thread
From: Eric W. Biederman @ 2007-12-11  2:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Pavel Machek, nigel, Rafael J. Wysocki, Andrew Morton,
	Jeremy Maitin-Shepard, linux-kernel, linux-pm,
	Kexec Mailing List

"Huang, Ying" <ying.huang@intel.com> writes:

> This patch implements the functionality of jumping between the kexeced
> kernel and the original kernel.
>
> To support jumping between two kernels, before jumping to (executing)
> the new kernel and jumping back to the original kernel, the devices
> are put into quiescent state, and the state of devices and CPU is
> saved. After jumping back from kexeced kernel and jumping to the new
> kernel, the state of devices and CPU are restored accordingly. The
> devices/CPU state save/restore code of software suspend is called to
> implement corresponding function.
>
> To support jumping without reserving memory. One shadow backup page
> (source page) is allocated for each page used by new (kexeced) kernel
> (destination page). When do kexec_load, the image of new kernel is
> loaded into source pages, and before executing, the destination pages
> and the source pages are swapped, so the contents of destination pages
> are backupped. Before jumping to the new (kexeced) kernel and after
> jumping back to the original kernel, the destination pages and the
> source pages are swapped too.
>
> A jump back protocol for kexec is defined and documented. It is an
> extension to ordinary function calling protocol. So, the facility
> provided by this patch can be used to call ordinary C function in real
> mode.
>
> A set of flags for sys_kexec_load are added to control which state are
> saved/restored before/after real mode code executing. For example, you
> can specify the device state and FPU state are saved/restored
> before/after real mode code executing.
>
> The states (exclude CPU state) save/restore code can be overridden
> based on the "command" parameter of kexec jump. Because more states
> need to be saved/restored by hibernating/resuming.
>

> Signed-off-by: Huang Ying <ying.huang@intel.com>
>
> ---
>  Documentation/i386/jump_back_protocol.txt |  103 ++++++++++++++
>  arch/powerpc/kernel/machine_kexec.c       |    2 
>  arch/ppc/kernel/machine_kexec.c           |    2 
>  arch/sh/kernel/machine_kexec.c            |    2 
>  arch/x86/kernel/machine_kexec_32.c        |   88 +++++++++---
>  arch/x86/kernel/machine_kexec_64.c        |    2 
>  arch/x86/kernel/relocate_kernel_32.S | 214 +++++++++++++++++++++++++++---
>  include/asm-x86/kexec_32.h                |   39 ++++-
>  include/linux/kexec.h                     |   40 +++++
>  kernel/kexec.c                            |  188 ++++++++++++++++++++++++++
>  kernel/power/Kconfig                      |    2 
>  kernel/sys.c                              |   35 +++-
>  12 files changed, 648 insertions(+), 69 deletions(-)
>
> --- a/arch/x86/kernel/machine_kexec_32.c
> +++ b/arch/x86/kernel/machine_kexec_32.c
> @@ -20,6 +20,7 @@
>  #include <asm/cpufeature.h>
>  #include <asm/desc.h>
>  #include <asm/system.h>
> +#include <asm/cacheflush.h>
>  
>  #define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE)))
>  static u32 kexec_pgd[1024] PAGE_ALIGNED;
> @@ -83,10 +84,14 @@ static void load_segments(void)
>   * reboot code buffer to allow us to avoid allocations
>   * later.
>   *
> - * Currently nothing.
> + * Turn off NX bit for control page.
>   */
>  int machine_kexec_prepare(struct kimage *image)
>  {
> +	if (nx_enabled) {
> + change_page_attr(image->control_code_page, 1, PAGE_KERNEL_EXEC);
> +		global_flush_tlb();
> +	}
>  	return 0;
>  }
>  
> @@ -96,25 +101,59 @@ int machine_kexec_prepare(struct kimage 
>   */
>  void machine_kexec_cleanup(struct kimage *image)
>  {
> +	if (nx_enabled) {
> +		change_page_attr(image->control_code_page, 1, PAGE_KERNEL);
> +		global_flush_tlb();
> +	}
> +}
> +
> +void machine_kexec(struct kimage *image)
> +{
> +	machine_kexec_call(image, NULL, 0);
>  }
>  
>  /*
>   * Do not allocate memory (or fail in any way) in machine_kexec().
>   * We are past the point of no return, committed to rebooting now.
>   */
> -NORET_TYPE void machine_kexec(struct kimage *image)
> +int machine_kexec_vcall(struct kimage *image, unsigned long *ret,
> +			 unsigned int argc, va_list args)
>  {

Why do we need var arg support?
Can't we do that with a shim we load from user space?

>  	unsigned long page_list[PAGES_NR];
>  	void *control_page;
> +	asmlinkage NORET_TYPE void
> +		(*relocate_kernel_ptr)(unsigned long indirection_page,
> +				       unsigned long control_page,
> +				       unsigned long start_address,
> +				       unsigned int has_pae) ATTRIB_NORET;
>  
>  	/* Interrupts aren't acceptable while we reboot */
>  	local_irq_disable();
>  
>  	control_page = page_address(image->control_code_page);
> -	memcpy(control_page, relocate_kernel, PAGE_SIZE);
> +	memcpy(control_page, relocate_page, PAGE_SIZE/2);
> +	KCALL_MAGIC(control_page) = 0;
>  
> +	if (image->preserve_cpu) {
> +		unsigned int i;
> +		KCALL_MAGIC(control_page) = KCALL_MAGIC_NUMBER;
> +		KCALL_ARGC(control_page) = argc;
> +		for (i = 0; i < argc; i++)
> +			KCALL_ARGS(control_page)[i] = \
> +				va_arg(args, unsigned long);
> +
> +		if (kexec_call_save_cpu(control_page)) {
> +			image->start = KCALL_ENTRY(control_page);
> +			if (ret)
> +				*ret = KCALL_ARGS(control_page)[0];
> +			return 0;
> +		}
> +	}
> +
> +	relocate_kernel_ptr = control_page +
> +		((void *)relocate_kernel - (void *)relocate_page);
>  	page_list[PA_CONTROL_PAGE] = __pa(control_page);
> -	page_list[VA_CONTROL_PAGE] = (unsigned long)relocate_kernel;
> +	page_list[VA_CONTROL_PAGE] = (unsigned long)control_page;
>  	page_list[PA_PGD] = __pa(kexec_pgd);
>  	page_list[VA_PGD] = (unsigned long)kexec_pgd;
>  #ifdef CONFIG_X86_PAE
> @@ -127,26 +166,33 @@ NORET_TYPE void machine_kexec(struct kim
>  	page_list[VA_PTE_0] = (unsigned long)kexec_pte0;
>  	page_list[PA_PTE_1] = __pa(kexec_pte1);
>  	page_list[VA_PTE_1] = (unsigned long)kexec_pte1;
> + page_list[PA_SWAP_PAGE] = (page_to_pfn(image->swap_page) << PAGE_SHIFT);
>  
> -	/* The segment registers are funny things, they have both a
> -	 * visible and an invisible part.  Whenever the visible part is
> -	 * set to a specific selector, the invisible part is loaded
> -	 * with from a table in memory.  At no other time is the
> -	 * descriptor table in memory accessed.
> -	 *
> -	 * I take advantage of this here by force loading the
> -	 * segments, before I zap the gdt with an invalid value.
> -	 */
> -	load_segments();
> -	/* The gdt & idt are now invalid.
> -	 * If you want to load them you must set up your own idt & gdt.
> -	 */
> -	set_gdt(phys_to_virt(0),0);
> -	set_idt(phys_to_virt(0),0);
> +	if (image->preserve_cpu_ext) {
> +		/* The segment registers are funny things, they have
> +		 * both a visible and an invisible part.  Whenever the
> +		 * visible part is set to a specific selector, the
> +		 * invisible part is loaded with from a table in
> +		 * memory.  At no other time is the descriptor table
> +		 * in memory accessed.
> +		 *
> +		 * I take advantage of this here by force loading the
> +		 * segments, before I zap the gdt with an invalid
> +		 * value.
> +		 */
> +		load_segments();
> +		/* The gdt & idt are now invalid.  If you want to load
> +		 * them you must set up your own idt & gdt.
> +		 */
> +		set_gdt(phys_to_virt(0), 0);
> +		set_idt(phys_to_virt(0), 0);
> +	}

We can't keep the same idt and gdt as the pages they are on will be
overwritten/reused.  So explictily stomping on them sounds better
so they never work.  We can restore them on kernel reentry.

>  	/* now call it */
> -	relocate_kernel((unsigned long)image->head, (unsigned long)page_list,
> -			image->start, cpu_has_pae);
> +	relocate_kernel_ptr((unsigned long)image->head,
> +			    (unsigned long)page_list,
> +			    image->start, cpu_has_pae);

Why rename relocate_kernel?
Ah.  I see.  You need to make it into a pointer again.  The crazy don't
stop the pgd support strikes again.  It used to be named rnk.

More later.

Eric


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump
  2007-12-10 19:55 ` Vivek Goyal
@ 2007-12-11  8:51   ` Huang, Ying
  0 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2007-12-11  8:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Eric W. Biederman, Pavel Machek, nigel, Rafael J. Wysocki,
	Andrew Morton, Jeremy Maitin-Shepard, linux-pm,
	Kexec Mailing List, linux-kernel

On Mon, 2007-12-10 at 14:55 -0500, Vivek Goyal wrote:
> On Fri, Dec 07, 2007 at 03:53:30PM +0000, Huang, Ying wrote:
> > This patch implements the functionality of jumping between the kexeced
> > kernel and the original kernel.
> > 
> 
> Hi,
> 
> I am just going through your patches and trying to understand it. Don't
> understand many things. Asking is easy so here you go...
> 
> > To support jumping between two kernels, before jumping to (executing)
> > the new kernel and jumping back to the original kernel, the devices
> > are put into quiescent state, and the state of devices and CPU is
> > saved. After jumping back from kexeced kernel and jumping to the new
> > kernel, the state of devices and CPU are restored accordingly. The
> > devices/CPU state save/restore code of software suspend is called to
> > implement corresponding function.
> > 
> 
> I need jumping back to restore a already hibernated kernel image? Can
> you please tell little more about jumping back and why it is needed?

Now, the jumping back is used to implement "kexec based hibernation",
which uses kexec/kdump to save the memory image of hibernated kernel
during hibernating, and uses /dev/oldmem to restore the memory image of
hibernated kernel and jump back to the hibernated kernel to continue
run.

The other usage model maybe include:

- Dump the system memory image then continue to run, that is, get some
memory snapshot of system during system running.
- Cooperative multi-task of different OS. You can load another OS (B)
from current OS (A), and jump between the two OSes upon needed.
- Call some code (such as firmware, etc) in physical mode. 

> > To support jumping without reserving memory. One shadow backup page
> > (source page) is allocated for each page used by new (kexeced) kernel
> > (destination page). When do kexec_load, the image of new kernel is
> > loaded into source pages, and before executing, the destination pages
> > and the source pages are swapped, so the contents of destination pages
> > are backupped. Before jumping to the new (kexeced) kernel and after
> > jumping back to the original kernel, the destination pages and the
> > source pages are swapped too.
> > 
> 
> Ok, so due to swapping of source and destination pages first kernel's data
> is still preserved.  How do I get the dynamic memory required for second
> kernel boot (without writing first kernel's data)?

All dynamic memory required for second kernel should be "loaded" by
sys_kexec_load in first kernel. For example, not only the Linux kernel
should be loaded at 1M, the memory 0~16M (exclude kernel) should be
"loaded" (all zero) by /sbin/kexec via sys_kexec_load too.

> > A jump back protocol for kexec is defined and documented. It is an
> > extension to ordinary function calling protocol. So, the facility
> > provided by this patch can be used to call ordinary C function in real
> > mode.
> > 
> > A set of flags for sys_kexec_load are added to control which state are
> > saved/restored before/after real mode code executing. For example, you
> > can specify the device state and FPU state are saved/restored
> > before/after real mode code executing.
> > 
> > The states (exclude CPU state) save/restore code can be overridden
> > based on the "command" parameter of kexec jump. Because more states
> > need to be saved/restored by hibernating/resuming.
> > 
> > Signed-off-by: Huang Ying <ying.huang@intel.com>
> > 
> > ---
> >  Documentation/i386/jump_back_protocol.txt |  103 ++++++++++++++
> >  arch/powerpc/kernel/machine_kexec.c       |    2 
> >  arch/ppc/kernel/machine_kexec.c           |    2 
> >  arch/sh/kernel/machine_kexec.c            |    2 
> >  arch/x86/kernel/machine_kexec_32.c        |   88 +++++++++---
> >  arch/x86/kernel/machine_kexec_64.c        |    2 
> >  arch/x86/kernel/relocate_kernel_32.S      |  214 +++++++++++++++++++++++++++---
> >  include/asm-x86/kexec_32.h                |   39 ++++-
> >  include/linux/kexec.h                     |   40 +++++
> >  kernel/kexec.c                            |  188 ++++++++++++++++++++++++++
> >  kernel/power/Kconfig                      |    2 
> >  kernel/sys.c                              |   35 +++-
> >  12 files changed, 648 insertions(+), 69 deletions(-)
> > 
> > --- a/arch/x86/kernel/machine_kexec_32.c
> > +++ b/arch/x86/kernel/machine_kexec_32.c
> > @@ -20,6 +20,7 @@
> >  #include <asm/cpufeature.h>
> >  #include <asm/desc.h>
> >  #include <asm/system.h>
> > +#include <asm/cacheflush.h>
> >  
> >  #define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE)))
> >  static u32 kexec_pgd[1024] PAGE_ALIGNED;
> > @@ -83,10 +84,14 @@ static void load_segments(void)
> >   * reboot code buffer to allow us to avoid allocations
> >   * later.
> >   *
> > - * Currently nothing.
> > + * Turn off NX bit for control page.
> >   */
> >  int machine_kexec_prepare(struct kimage *image)
> >  {
> > +	if (nx_enabled) {
> > +		change_page_attr(image->control_code_page, 1, PAGE_KERNEL_EXEC);
> > +		global_flush_tlb();
> > +	}
> >  	return 0;
> >  }
> >  
> > @@ -96,25 +101,59 @@ int machine_kexec_prepare(struct kimage 
> >   */
> >  void machine_kexec_cleanup(struct kimage *image)
> >  {
> > +	if (nx_enabled) {
> > +		change_page_attr(image->control_code_page, 1, PAGE_KERNEL);
> > +		global_flush_tlb();
> > +	}
> > +}
> > +
> > +void machine_kexec(struct kimage *image)
> > +{
> > +	machine_kexec_call(image, NULL, 0);
> >  }
> >  
> >  /*
> >   * Do not allocate memory (or fail in any way) in machine_kexec().
> >   * We are past the point of no return, committed to rebooting now.
> >   */
> > -NORET_TYPE void machine_kexec(struct kimage *image)
> > +int machine_kexec_vcall(struct kimage *image, unsigned long *ret,
> > +			 unsigned int argc, va_list args)
> >  {
> >  	unsigned long page_list[PAGES_NR];
> >  	void *control_page;
> > +	asmlinkage NORET_TYPE void
> > +		(*relocate_kernel_ptr)(unsigned long indirection_page,
> > +				       unsigned long control_page,
> > +				       unsigned long start_address,
> > +				       unsigned int has_pae) ATTRIB_NORET;
> >  
> >  	/* Interrupts aren't acceptable while we reboot */
> >  	local_irq_disable();
> >  
> >  	control_page = page_address(image->control_code_page);
> > -	memcpy(control_page, relocate_kernel, PAGE_SIZE);
> > +	memcpy(control_page, relocate_page, PAGE_SIZE/2);
> > +	KCALL_MAGIC(control_page) = 0;
> >  
> 
> Is 2K sufficient for all the code in relocate_kernel_32.S? What's the
> current size?

The current size is 0x2d7 (727). I got it though objdump,
machine_crash_shutdown - relocate_page. I think we have enough space.

> > +	if (image->preserve_cpu) {
> > +		unsigned int i;
> > +		KCALL_MAGIC(control_page) = KCALL_MAGIC_NUMBER;
> > +		KCALL_ARGC(control_page) = argc;
> > +		for (i = 0; i < argc; i++)
> > +			KCALL_ARGS(control_page)[i] = \
> > +				va_arg(args, unsigned long);
> > +
> > +		if (kexec_call_save_cpu(control_page)) {
> > +			image->start = KCALL_ENTRY(control_page);
> 
> Who fills the entry point at offset 0x200?

The entry point is filled by assembler code in reloate_kernel_32.S upon
jumping back. You can find it by "grep ENTRY relocate_kernel_32.S".

> 
> [..]
> >  extern int machine_kexec_prepare(struct kimage *image);
> >  extern void machine_kexec_cleanup(struct kimage *image);
> >  extern asmlinkage long sys_kexec_load(unsigned long entry,
> >  					unsigned long nr_segments,
> >  					struct kexec_segment __user *segments,
> >  					unsigned long flags);
> > +extern int kexec_call(struct kimage *image, unsigned long *ret,
> > +		      unsigned int argc, ...);
> 
> Who is using kexec_call(). I can't seem to locate the caller of it.

There is no user of kexec_call() now. But I think it may be useful as a
physical mode caller for some firmware code.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump
  2007-12-10 22:31 ` Vivek Goyal
@ 2007-12-11  8:55   ` Huang, Ying
  0 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2007-12-11  8:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Eric W. Biederman, Pavel Machek, nigel, Rafael J. Wysocki,
	Andrew Morton, Jeremy Maitin-Shepard, linux-pm,
	Kexec Mailing List, linux-kernel

On Mon, 2007-12-10 at 17:31 -0500, Vivek Goyal wrote:
> [..]
> >  
> > -#define KEXEC_ON_CRASH  0x00000001
> > -#define KEXEC_ARCH_MASK 0xffff0000
> > +#define KEXEC_ON_CRASH		0x00000001
> > +#define KEXEC_PRESERVE_CPU	0x00000002
> > +#define KEXEC_PRESERVE_CPU_EXT	0x00000004
> > +#define KEXEC_SINGLE_CPU	0x00000008
> > +#define KEXEC_PRESERVE_DEVICE	0x00000010
> > +#define KEXEC_PRESERVE_CONSOLE	0x00000020
> 
> Hi,
> 
> Why do we need so many different flags for preserving different types
> of state (CPU, CPU_EXT, Device, console) ? To keep things simple,
> can't we can create just one flag KEXEC_PRESERVE_CONTEXT, which will
> indicate any special action required for preserving the previous kernel's
> context so that one can swith back to old kernel?

Yes. There are too many flags, especially when we have no users of these
flags now. It is better to use one flag such as KEXEC_PRESERVE_CONTEXT
now, and create the others required flags when really needed.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump
  2007-12-11 15:50   ` Huang, Ying
@ 2007-12-11  9:27     ` Eric W. Biederman
  2007-12-12  6:27       ` Huang, Ying
  2007-12-18  8:34       ` Huang, Ying
  0 siblings, 2 replies; 13+ messages in thread
From: Eric W. Biederman @ 2007-12-11  9:27 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Pavel Machek, nigel, Rafael J. Wysocki, Andrew Morton,
	Jeremy Maitin-Shepard, linux-kernel, linux-pm,
	Kexec Mailing List

"Huang, Ying" <ying.huang@intel.com> writes:

> On Mon, 2007-12-10 at 19:25 -0700, Eric W. Biederman wrote:
>> "Huang, Ying" <ying.huang@intel.com> writes:
> [...]
>> >  /*
>> >   * Do not allocate memory (or fail in any way) in machine_kexec().
>> >   * We are past the point of no return, committed to rebooting now.
>> >   */
>> > -NORET_TYPE void machine_kexec(struct kimage *image)
>> > +int machine_kexec_vcall(struct kimage *image, unsigned long *ret,
>> > +			 unsigned int argc, va_list args)
>> >  {
>> 
>> Why do we need var arg support?
>> Can't we do that with a shim we load from user space?
>
> If all parameters are provided in user space, the usage model may be as
> follow:
>
> - sys_kexec_load() /* with executable/data/parameters(A) loaded */
> - sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,) /* execute physical mode code with
> parameters(A)*/
> - /* jump back */
> - sys_kexec_load() /* with executable/data/parameters(B) loaded */
> - sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,) /* execute physical mode code with
> parameters(B)*/
> - /* jump back */
>
> That is, the kexec image should be re-loaded if the parameters are
> different, and there can be no state reserved in kexec image. This is OK
> for original kexec implementation, because there is no jumping back.
> But, for kexec with jumping back, another usage model may be useful too.
>
> - sys_kexec_load() /* with executable/data loaded */
> - sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,parameters(A)) /* execute physical mode
> code with parameters(A)*/
> - sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,parameters(B)) /* execute physical mode
> code with parameters(B)*/
>
> This way the kexec image need not to be re-loaded, and the state of
> kexec image can be reserved across several invoking.

Interesting.  We wind up preserving the code in between invocations.

I don't know about your particular issue, but I can see that clearly
we need a way to read values back from our target image.

And if we can read everything back one way to proceed is to read
everything out modify it and then write it back.

Amending a kexec image that is already stored may also make sense.

I'm not convinced that the var arg parameters make sense, but you
added them because of a real need.

The kexec function is split into two separate calls so that we can
unmount the filesystem the kexec image comes from before actually
doing the kexec.

If extensive user space shutdown or startup is needed I will argue
that doing the work in the sys_reboot call is the wrong place to
do it.  Although if a jump back is happening we should not need
much restart.

Can you generate a minimal patch with just the minimal necessary
support to return from a kexec operation?

> Another usage model may be useful is invoking the kexec image (such as
> firmware) from kernel space.
>
> - kmalloc the needed memory and loaded the firmware image (if needed)
> - sys_kexec_load() with a fake image (one segment with size 0), the
> entry point of the fake image is the entry point of the firmware image.
> - kexec_call(fake_image, ...) /* maybe change entry point if needed */
>
> This way, some kernel code can invoke the firmware in physical mode just
> like invoking an ordinary function.

That certainly seems interesting.  But that doesn't justify the vararg
part of this.

> [...]
>> > -	/* The segment registers are funny things, they have both a
>> > -	 * visible and an invisible part.  Whenever the visible part is
>> > -	 * set to a specific selector, the invisible part is loaded
>> > -	 * with from a table in memory.  At no other time is the
>> > -	 * descriptor table in memory accessed.
>> > -	 *
>> > -	 * I take advantage of this here by force loading the
>> > -	 * segments, before I zap the gdt with an invalid value.
>> > -	 */
>> > -	load_segments();
>> > -	/* The gdt & idt are now invalid.
>> > -	 * If you want to load them you must set up your own idt & gdt.
>> > -	 */
>> > -	set_gdt(phys_to_virt(0),0);
>> > -	set_idt(phys_to_virt(0),0);
>> > +	if (image->preserve_cpu_ext) {
>> > +		/* The segment registers are funny things, they have
>> > +		 * both a visible and an invisible part.  Whenever the
>> > +		 * visible part is set to a specific selector, the
>> > +		 * invisible part is loaded with from a table in
>> > +		 * memory.  At no other time is the descriptor table
>> > +		 * in memory accessed.
>> > +		 *
>> > +		 * I take advantage of this here by force loading the
>> > +		 * segments, before I zap the gdt with an invalid
>> > +		 * value.
>> > +		 */
>> > +		load_segments();
>> > +		/* The gdt & idt are now invalid.  If you want to load
>> > +		 * them you must set up your own idt & gdt.
>> > +		 */
>> > +		set_gdt(phys_to_virt(0), 0);
>> > +		set_idt(phys_to_virt(0), 0);
>> > +	}
>> 
>> We can't keep the same idt and gdt as the pages they are on will be
>> overwritten/reused.  So explictily stomping on them sounds better
>> so they never work.  We can restore them on kernel reentry.
>
> The original idea about this code is:
>
> If the kexec image is claimed that it need not to "perserving extensive
> CPU state" (such as FPU/MMX/GDT/LDT/IDT/CS/DS/ES/FS/GS/SS etc), the
> IDT/GDT/CS/DS/ES/FS/GS/SS are not touched in kexec image code. So the
> segment registers need not to be set.
>
> But this is not clear. At least more description should be provided for
> each preserve flag.

yes.

>> >  	/* now call it */
>> > -	relocate_kernel((unsigned long)image->head, (unsigned long)page_list,
>> > -			image->start, cpu_has_pae);
>> > +	relocate_kernel_ptr((unsigned long)image->head,
>> > +			    (unsigned long)page_list,
>> > +			    image->start, cpu_has_pae);
>> 
>> Why rename relocate_kernel?
>> Ah.  I see.  You need to make it into a pointer again.  The crazy don't
>> stop the pgd support strikes again.  It used to be named rnk.
>
> You mean I should change the function pointer name to rnk to keep
> consistency? I find rnk in IA64 implementation.

You were changing something that used to be a pointer back to a pointer
and I found that confusing.    See the last one or two commits to
machine_kexec_32.c for when this happened.  I get the feeling that we
need to put the page table creation logic into machine_kexec_prepare,
instead of in assembly.

Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump
  2007-12-11  2:25 ` Eric W. Biederman
@ 2007-12-11 15:50   ` Huang, Ying
  2007-12-11  9:27     ` Eric W. Biederman
  0 siblings, 1 reply; 13+ messages in thread
From: Huang, Ying @ 2007-12-11 15:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Machek, nigel, Rafael J. Wysocki, Andrew Morton,
	Jeremy Maitin-Shepard, linux-kernel, linux-pm,
	Kexec Mailing List

On Mon, 2007-12-10 at 19:25 -0700, Eric W. Biederman wrote:
> "Huang, Ying" <ying.huang@intel.com> writes:
[...]
> >  /*
> >   * Do not allocate memory (or fail in any way) in machine_kexec().
> >   * We are past the point of no return, committed to rebooting now.
> >   */
> > -NORET_TYPE void machine_kexec(struct kimage *image)
> > +int machine_kexec_vcall(struct kimage *image, unsigned long *ret,
> > +			 unsigned int argc, va_list args)
> >  {
> 
> Why do we need var arg support?
> Can't we do that with a shim we load from user space?

If all parameters are provided in user space, the usage model may be as
follow:

- sys_kexec_load() /* with executable/data/parameters(A) loaded */
- sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,) /* execute physical mode code with parameters(A)*/
- /* jump back */
- sys_kexec_load() /* with executable/data/parameters(B) loaded */
- sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,) /* execute physical mode code with parameters(B)*/
- /* jump back */

That is, the kexec image should be re-loaded if the parameters are
different, and there can be no state reserved in kexec image. This is OK
for original kexec implementation, because there is no jumping back.
But, for kexec with jumping back, another usage model may be useful too.

- sys_kexec_load() /* with executable/data loaded */
- sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,parameters(A)) /* execute physical mode code with parameters(A)*/
- sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,parameters(B)) /* execute physical mode code with parameters(B)*/

This way the kexec image need not to be re-loaded, and the state of
kexec image can be reserved across several invoking.


Another usage model may be useful is invoking the kexec image (such as
firmware) from kernel space.

- kmalloc the needed memory and loaded the firmware image (if needed)
- sys_kexec_load() with a fake image (one segment with size 0), the
entry point of the fake image is the entry point of the firmware image.
- kexec_call(fake_image, ...) /* maybe change entry point if needed */

This way, some kernel code can invoke the firmware in physical mode just
like invoking an ordinary function.

[...]
> > -	/* The segment registers are funny things, they have both a
> > -	 * visible and an invisible part.  Whenever the visible part is
> > -	 * set to a specific selector, the invisible part is loaded
> > -	 * with from a table in memory.  At no other time is the
> > -	 * descriptor table in memory accessed.
> > -	 *
> > -	 * I take advantage of this here by force loading the
> > -	 * segments, before I zap the gdt with an invalid value.
> > -	 */
> > -	load_segments();
> > -	/* The gdt & idt are now invalid.
> > -	 * If you want to load them you must set up your own idt & gdt.
> > -	 */
> > -	set_gdt(phys_to_virt(0),0);
> > -	set_idt(phys_to_virt(0),0);
> > +	if (image->preserve_cpu_ext) {
> > +		/* The segment registers are funny things, they have
> > +		 * both a visible and an invisible part.  Whenever the
> > +		 * visible part is set to a specific selector, the
> > +		 * invisible part is loaded with from a table in
> > +		 * memory.  At no other time is the descriptor table
> > +		 * in memory accessed.
> > +		 *
> > +		 * I take advantage of this here by force loading the
> > +		 * segments, before I zap the gdt with an invalid
> > +		 * value.
> > +		 */
> > +		load_segments();
> > +		/* The gdt & idt are now invalid.  If you want to load
> > +		 * them you must set up your own idt & gdt.
> > +		 */
> > +		set_gdt(phys_to_virt(0), 0);
> > +		set_idt(phys_to_virt(0), 0);
> > +	}
> 
> We can't keep the same idt and gdt as the pages they are on will be
> overwritten/reused.  So explictily stomping on them sounds better
> so they never work.  We can restore them on kernel reentry.

The original idea about this code is:

If the kexec image is claimed that it need not to "perserving extensive
CPU state" (such as FPU/MMX/GDT/LDT/IDT/CS/DS/ES/FS/GS/SS etc), the
IDT/GDT/CS/DS/ES/FS/GS/SS are not touched in kexec image code. So the
segment registers need not to be set.

But this is not clear. At least more description should be provided for
each preserve flag.

> >  	/* now call it */
> > -	relocate_kernel((unsigned long)image->head, (unsigned long)page_list,
> > -			image->start, cpu_has_pae);
> > +	relocate_kernel_ptr((unsigned long)image->head,
> > +			    (unsigned long)page_list,
> > +			    image->start, cpu_has_pae);
> 
> Why rename relocate_kernel?
> Ah.  I see.  You need to make it into a pointer again.  The crazy don't
> stop the pgd support strikes again.  It used to be named rnk.

You mean I should change the function pointer name to rnk to keep
consistency? I find rnk in IA64 implementation.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump
  2007-12-11  9:27     ` Eric W. Biederman
@ 2007-12-12  6:27       ` Huang, Ying
  2007-12-18  8:34       ` Huang, Ying
  1 sibling, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2007-12-12  6:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Machek, nigel, Rafael J. Wysocki, Andrew Morton,
	Jeremy Maitin-Shepard, linux-kernel, linux-pm,
	Kexec Mailing List

On Tue, 2007-12-11 at 02:27 -0700, Eric W. Biederman wrote:
> "Huang, Ying" <ying.huang@intel.com> writes:
> 
> > On Mon, 2007-12-10 at 19:25 -0700, Eric W. Biederman wrote:
> >> "Huang, Ying" <ying.huang@intel.com> writes:
> > [...]
> >> >  /*
> >> >   * Do not allocate memory (or fail in any way) in machine_kexec().
> >> >   * We are past the point of no return, committed to rebooting now.
> >> >   */
> >> > -NORET_TYPE void machine_kexec(struct kimage *image)
> >> > +int machine_kexec_vcall(struct kimage *image, unsigned long *ret,
> >> > +			 unsigned int argc, va_list args)
> >> >  {
> >> 
> >> Why do we need var arg support?
> >> Can't we do that with a shim we load from user space?
> >
> > If all parameters are provided in user space, the usage model may be as
> > follow:
> >
> > - sys_kexec_load() /* with executable/data/parameters(A) loaded */
> > - sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,) /* execute physical mode code with
> > parameters(A)*/
> > - /* jump back */
> > - sys_kexec_load() /* with executable/data/parameters(B) loaded */
> > - sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,) /* execute physical mode code with
> > parameters(B)*/
> > - /* jump back */
> >
> > That is, the kexec image should be re-loaded if the parameters are
> > different, and there can be no state reserved in kexec image. This is OK
> > for original kexec implementation, because there is no jumping back.
> > But, for kexec with jumping back, another usage model may be useful too.
> >
> > - sys_kexec_load() /* with executable/data loaded */
> > - sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,parameters(A)) /* execute physical mode
> > code with parameters(A)*/
> > - sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,parameters(B)) /* execute physical mode
> > code with parameters(B)*/
> >
> > This way the kexec image need not to be re-loaded, and the state of
> > kexec image can be reserved across several invoking.
> 
> Interesting.  We wind up preserving the code in between invocations.
> 
> I don't know about your particular issue, but I can see that clearly
> we need a way to read values back from our target image.
> 
> And if we can read everything back one way to proceed is to read
> everything out modify it and then write it back.
> 
> Amending a kexec image that is already stored may also make sense.
> 
> I'm not convinced that the var arg parameters make sense, but you
> added them because of a real need.
> 
> The kexec function is split into two separate calls so that we can
> unmount the filesystem the kexec image comes from before actually
> doing the kexec.

Yes. Reading/Modifying the loaded kexec image is another way to do
necessary communication between the first kernel and the second kernel.
In fact, the patch [4/4] of this series with title:

[PATCH 4/4 -mm] kexec based hibernation -v7 : kimgcore

provide a ELF CORE file in /proc (/proc/kimgcore) to read the loaded
kexec image. The writing function can be added easily.

But I think communication between the first kernel and the second kernel
via reading/modifying the loaded kernel image is not very convenient
way. The usage mode may be as follow:

- sys_kexec_load() /* with executable/data loaded */
- modify the loaded kexec image to set the parameters (A)
- sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,) /* execute physical mode code with parameters(A)*/
- In physical mode code, check the parameters A and executing accordingly
- modify the loaded kexec image to set the parameters (B)
- sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,) /* execute physical mode code with parameters(B)*/
- In physical mode code, check the parameters B and executing accordingly

There are some issues with this usage model:

- Some parameters in kernel needed to be exported (such as the
kimage->head to let the second kernel to read the memory contents of
backupped memory).

- The physical mode code invoker (the first kernel) need to know where
to write the parameters. A common protocol or a protocol case by case
should be defined. For example, the memory address after the entry point
of kexec image is a good candidate. But for Linux kernel, there are two
types of entry point, the "jump back entry" or "purgatory". Maybe
different protocol should be defined for these two types of entry point.

- For the user space of the second kernel to get the parameters. A
interface (maybe a file in /proc or /sys) should be provided to export
the parameters to user space.

So I think the current parameters passing mechanism may be more simple
and convenient (defined in Document/i386/jump_back_protocol.txt in the
patch).

There is only one user of var args. But I think it is simple to be
implemented and may be used by others.

> If extensive user space shutdown or startup is needed I will argue
> that doing the work in the sys_reboot call is the wrong place to
> do it.  Although if a jump back is happening we should not need
> much restart.

Now, the user space is not shut down or started up across kexec/jump
back, just the sys_reboot call is used to trigger the kexec/jump back.
Maybe sys_reboot is not the right place to do this. Can you recommended
a more suitable place?

> Can you generate a minimal patch with just the minimal necessary
> support to return from a kexec operation?

I plan to reduce the KEXEC_PRESERVE_XXX flags into one
KEXEC_PRESERVE_CONTEXT. The var args part may be changed if we have the
consensus.

> > Another usage model may be useful is invoking the kexec image (such as
> > firmware) from kernel space.
> >
> > - kmalloc the needed memory and loaded the firmware image (if needed)
> > - sys_kexec_load() with a fake image (one segment with size 0), the
> > entry point of the fake image is the entry point of the firmware image.
> > - kexec_call(fake_image, ...) /* maybe change entry point if needed */
> >
> > This way, some kernel code can invoke the firmware in physical mode just
> > like invoking an ordinary function.
> 
> That certainly seems interesting.  But that doesn't justify the vararg
> part of this.
> 
> > [...]
> >> > -	/* The segment registers are funny things, they have both a
> >> > -	 * visible and an invisible part.  Whenever the visible part is
> >> > -	 * set to a specific selector, the invisible part is loaded
> >> > -	 * with from a table in memory.  At no other time is the
> >> > -	 * descriptor table in memory accessed.
> >> > -	 *
> >> > -	 * I take advantage of this here by force loading the
> >> > -	 * segments, before I zap the gdt with an invalid value.
> >> > -	 */
> >> > -	load_segments();
> >> > -	/* The gdt & idt are now invalid.
> >> > -	 * If you want to load them you must set up your own idt & gdt.
> >> > -	 */
> >> > -	set_gdt(phys_to_virt(0),0);
> >> > -	set_idt(phys_to_virt(0),0);
> >> > +	if (image->preserve_cpu_ext) {
> >> > +		/* The segment registers are funny things, they have
> >> > +		 * both a visible and an invisible part.  Whenever the
> >> > +		 * visible part is set to a specific selector, the
> >> > +		 * invisible part is loaded with from a table in
> >> > +		 * memory.  At no other time is the descriptor table
> >> > +		 * in memory accessed.
> >> > +		 *
> >> > +		 * I take advantage of this here by force loading the
> >> > +		 * segments, before I zap the gdt with an invalid
> >> > +		 * value.
> >> > +		 */
> >> > +		load_segments();
> >> > +		/* The gdt & idt are now invalid.  If you want to load
> >> > +		 * them you must set up your own idt & gdt.
> >> > +		 */
> >> > +		set_gdt(phys_to_virt(0), 0);
> >> > +		set_idt(phys_to_virt(0), 0);
> >> > +	}
> >> 
> >> We can't keep the same idt and gdt as the pages they are on will be
> >> overwritten/reused.  So explictily stomping on them sounds better
> >> so they never work.  We can restore them on kernel reentry.
> >
> > The original idea about this code is:
> >
> > If the kexec image is claimed that it need not to "perserving extensive
> > CPU state" (such as FPU/MMX/GDT/LDT/IDT/CS/DS/ES/FS/GS/SS etc), the
> > IDT/GDT/CS/DS/ES/FS/GS/SS are not touched in kexec image code. So the
> > segment registers need not to be set.
> >
> > But this is not clear. At least more description should be provided for
> > each preserve flag.
> 
> yes.
> 
> >> >  	/* now call it */
> >> > -	relocate_kernel((unsigned long)image->head, (unsigned long)page_list,
> >> > -			image->start, cpu_has_pae);
> >> > +	relocate_kernel_ptr((unsigned long)image->head,
> >> > +			    (unsigned long)page_list,
> >> > +			    image->start, cpu_has_pae);
> >> 
> >> Why rename relocate_kernel?
> >> Ah.  I see.  You need to make it into a pointer again.  The crazy don't
> >> stop the pgd support strikes again.  It used to be named rnk.
> >
> > You mean I should change the function pointer name to rnk to keep
> > consistency? I find rnk in IA64 implementation.
> 
> You were changing something that used to be a pointer back to a pointer
> and I found that confusing.    See the last one or two commits to
> machine_kexec_32.c for when this happened.

After checking the history of machine_kexec_32.c, I find the pointer is
removed because a specific PGD is used. My reason is a little different.
I need save some information (the preserving registers, some control
registers, swap page address etc) in control_page, it is more convenient
to execute on identity map of control_page. I think it is not a big
issue. In fact, my code can work without making relocate_kernel to
pointer with some changes.

> I get the feeling that we
> need to put the page table creation logic into machine_kexec_prepare,
> instead of in assembly.

Yes. I think this is a good idea, and the kexec_pgd, kexec_pmd0/1,
kexec_pte0/1 can be allocated in machine_kexec_prepare too to save the
static memory usage.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump
  2007-12-11  9:27     ` Eric W. Biederman
  2007-12-12  6:27       ` Huang, Ying
@ 2007-12-18  8:34       ` Huang, Ying
  1 sibling, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2007-12-18  8:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Machek, nigel, Rafael J. Wysocki, Andrew Morton,
	Jeremy Maitin-Shepard, linux-kernel, linux-pm,
	Kexec Mailing List

On Tue, 2007-12-11 at 02:27 -0700, Eric W. Biederman wrote:
> "Huang, Ying" <ying.huang@intel.com> writes:
> 
> > On Mon, 2007-12-10 at 19:25 -0700, Eric W. Biederman wrote:
> >> "Huang, Ying" <ying.huang@intel.com> writes:
> > [...]
> >> >  /*
> >> >   * Do not allocate memory (or fail in any way) in machine_kexec().
> >> >   * We are past the point of no return, committed to rebooting now.
> >> >   */
> >> > -NORET_TYPE void machine_kexec(struct kimage *image)
> >> > +int machine_kexec_vcall(struct kimage *image, unsigned long *ret,
> >> > +			 unsigned int argc, va_list args)
> >> >  {
> >> 
> >> Why do we need var arg support?
> >> Can't we do that with a shim we load from user space?
> >
> > If all parameters are provided in user space, the usage model may be as
> > follow:
> >
> > - sys_kexec_load() /* with executable/data/parameters(A) loaded */
> > - sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,) /* execute physical mode code with
> > parameters(A)*/
> > - /* jump back */
> > - sys_kexec_load() /* with executable/data/parameters(B) loaded */
> > - sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,) /* execute physical mode code with
> > parameters(B)*/
> > - /* jump back */
> >
> > That is, the kexec image should be re-loaded if the parameters are
> > different, and there can be no state reserved in kexec image. This is OK
> > for original kexec implementation, because there is no jumping back.
> > But, for kexec with jumping back, another usage model may be useful too.
> >
> > - sys_kexec_load() /* with executable/data loaded */
> > - sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,parameters(A)) /* execute physical mode
> > code with parameters(A)*/
> > - sys_reboot(,,LINUX_REBOOT_CMD_KEXEC,parameters(B)) /* execute physical mode
> > code with parameters(B)*/
> >
> > This way the kexec image need not to be re-loaded, and the state of
> > kexec image can be reserved across several invoking.
> 
> Interesting.  We wind up preserving the code in between invocations.
> 
> I don't know about your particular issue, but I can see that clearly
> we need a way to read values back from our target image.
> 
> And if we can read everything back one way to proceed is to read
> everything out modify it and then write it back.
> 
> Amending a kexec image that is already stored may also make sense.
> 
> I'm not convinced that the var arg parameters make sense, but you
> added them because of a real need.
> 
> The kexec function is split into two separate calls so that we can
> unmount the filesystem the kexec image comes from before actually
> doing the kexec.

My real issue is that I need a kind of "kernel to kernel" communication
method. The var args is just a convenient way to pass an array of
unsigned longs between two kernels. The reason is as follow:

The kexec based hibernating process is as follow:

h1. put devices in quiescent state
h2. save devices/CPU state
h3. jump to kexeced kernel (kernel B)
*h4. normal kernel boot of kernel B
*h5. save devices/CPU state
*h6. jump back to original kernel (kernel A)
h7. restore devices/CPU state
h8. put devices in quiescent state
h9. put devices in low power state
h10. execute necessary ACPI method (prepare to sleep)
h11. save devices/CPU state
h12. jump to kernel B
*h13. execute necessary ACPI method (wake up)
*h14. restore devices/CPU state
*h15. put devices in normal power state
*h16. write memory image of kernel A into disk
*h17. put system into ACPI S4 state

The kexec based resuming process is as follow:

*r1. boot the resuming kernel (kernel C)
*r2. restore the memory image of kernel A
*r3. put devices in quiescent state
*r4. execute necessary ACPI method (prepare to resume)
*r5. jump to kernel A
r6. execute necessary ACPI method (wake up)
r7. restore devices/CPU state

Where, line begin with "*" is executed in kernel B and kernel C, others
are executed in kernel A.

The kernel A need to distinguish the difference between h7 and r6, while
the kernel B/C need to distinguish between *h13 and normal jump back.
The different kernel action need to be taken depends on the action of
peer kernel. Now, this is solved by kernel-kernel communication, a
command word is passed to peer kernel to inform the action required.

I remember you have said before that you think it is better to use only
"user space to user space" communication between kernel A and kernel B.
This is OK for normal kexec. But if the kexec jump is used for multiple
functions with early kernel action involved (normal kexec jump, kexec
jump to hibernate, kexec jump to resume), it is necessary to use "kernel
to kernel" communication.

The var args in the patch is just an array of unsigned longs, it can be
expresses as follow too.

int kexec_call(struct kimage *image, unsigned long *ret, unsigned int
argc, unsigned long argv[]);

The var args version is as follow.

int kexec_call(struct kimage *image, unsigned long *ret, unsigned int
argc, ...);

Best Regards,
Huang Ying


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-12-18  8:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-07 15:53 [PATCH 1/4 -mm] kexec based hibernation -v7 : kexec jump Huang, Ying
2007-12-08 23:53 ` Pavel Machek
2007-12-09  0:19   ` Rafael J. Wysocki
2007-12-09  1:06     ` Eric W. Biederman
2007-12-10 19:55 ` Vivek Goyal
2007-12-11  8:51   ` Huang, Ying
2007-12-10 22:31 ` Vivek Goyal
2007-12-11  8:55   ` Huang, Ying
2007-12-11  2:25 ` Eric W. Biederman
2007-12-11 15:50   ` Huang, Ying
2007-12-11  9:27     ` Eric W. Biederman
2007-12-12  6:27       ` Huang, Ying
2007-12-18  8:34       ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).