linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] kexec: A new system call to allow in kernel loading
@ 2013-11-20 17:50 Vivek Goyal
  2013-11-20 17:50 ` [PATCH 1/6] kexec: Export vmcoreinfo note size properly Vivek Goyal
                   ` (10 more replies)
  0 siblings, 11 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-20 17:50 UTC (permalink / raw)
  To: linux-kernel, kexec; +Cc: ebiederm, hpa, mjg59, greg, Vivek Goyal

Current proposed secureboot implementation disables kexec/kdump because
it can allow unsigned kernel to run on a secureboot platform. Intial
idea was to sign /sbin/kexec binary and let that binary do the kernel
signature verification. I had posted RFC patches for this apparoach
here.

https://lkml.org/lkml/2013/9/10/560

Later we had discussion at Plumbers and most of the people thought
that signing and trusting /sbin/kexec is becoming complex. So a 
better idea might be let kernel do the signature verification of
new kernel being loaded. This calls for implementing a new system call
and moving lot of user space code in kernel.

kexec_load() system call allows loading a kexec/kdump kernel and jump
to that kernel at right time. Though a lot of processing is done in
user space which prepares a list of segments/buffers to be loaded and
kexec_load() works on that list of segments. It does not know what's
contained in those segments.

Now a new system call kexec_file_load() is implemented which takes
kernel fd and initrd fd as parameters. Now kernel should be able
to verify signature of newly loaded kernel. 

This is an early RFC patchset. I have not done signature handling
part yet. This is more of a minimal patch to show how new system
call and functionality will look like. Right now it can only handle
bzImage with 64bit entry point on x86_64. No EFI, no x86_32  or any
other architecture. Rest of the things can be added slowly as need
arises. In first iteration, I have tried to address most common use case
for us.

Any feedback is welcome.

Vivek Goyal (6):
  kexec: Export vmcoreinfo note size properly
  kexec: Move segment verification code in a separate function
  resource: Provide new functions to walk through resources
  kexec: A new system call, kexec_file_load, for in kernel kexec
  kexec-bzImage: Support for loading bzImage using 64bit entry
  kexec: Support for Kexec on panic using new system call

 arch/x86/include/asm/crash.h         |    9 +
 arch/x86/include/asm/kexec-bzimage.h |   12 +
 arch/x86/include/asm/kexec.h         |   43 ++
 arch/x86/kernel/Makefile             |    2 +
 arch/x86/kernel/crash.c              |  585 +++++++++++++++++++++++++++
 arch/x86/kernel/kexec-bzimage.c      |  420 +++++++++++++++++++
 arch/x86/kernel/machine_kexec_64.c   |   60 +++-
 arch/x86/kernel/purgatory_entry_64.S |  119 ++++++
 arch/x86/syscalls/syscall_64.tbl     |    1 +
 include/linux/ioport.h               |    6 +
 include/linux/kexec.h                |   57 +++
 include/linux/syscalls.h             |    3 +
 include/uapi/linux/kexec.h           |    4 +
 kernel/kexec.c                       |  731 ++++++++++++++++++++++++++++++----
 kernel/ksysfs.c                      |    2 +-
 kernel/resource.c                    |  108 +++++-
 kernel/sys_ni.c                      |    1 +
 17 files changed, 2074 insertions(+), 89 deletions(-)
 create mode 100644 arch/x86/include/asm/crash.h
 create mode 100644 arch/x86/include/asm/kexec-bzimage.h
 create mode 100644 arch/x86/kernel/kexec-bzimage.c
 create mode 100644 arch/x86/kernel/purgatory_entry_64.S

-- 
1.7.7.6


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 1/6] kexec: Export vmcoreinfo note size properly
  2013-11-20 17:50 [PATCH 0/6] kexec: A new system call to allow in kernel loading Vivek Goyal
@ 2013-11-20 17:50 ` Vivek Goyal
  2013-11-21 18:59   ` Greg KH
  2013-11-20 17:50 ` [PATCH 2/6] kexec: Move segment verification code in a separate function Vivek Goyal
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-11-20 17:50 UTC (permalink / raw)
  To: linux-kernel, kexec; +Cc: ebiederm, hpa, mjg59, greg, Vivek Goyal

Right now we seem to be exporting the max data size contained inside
vmcoreinfo note. But this does not include the size of meta data around
vmcore info data. Like name of the note and starting and ending elf_note.

I think user space expects total size and that size is put in PT_NOTE
elf header.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 kernel/ksysfs.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 9659d38..d945a94 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -126,7 +126,7 @@ static ssize_t vmcoreinfo_show(struct kobject *kobj,
 {
 	return sprintf(buf, "%lx %x\n",
 		       paddr_vmcoreinfo_note(),
-		       (unsigned int)vmcoreinfo_max_size);
+		       (unsigned int)sizeof(vmcoreinfo_note));
 }
 KERNEL_ATTR_RO(vmcoreinfo);
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 2/6] kexec: Move segment verification code in a separate function
  2013-11-20 17:50 [PATCH 0/6] kexec: A new system call to allow in kernel loading Vivek Goyal
  2013-11-20 17:50 ` [PATCH 1/6] kexec: Export vmcoreinfo note size properly Vivek Goyal
@ 2013-11-20 17:50 ` Vivek Goyal
  2013-11-20 17:50 ` [PATCH 3/6] resource: Provide new functions to walk through resources Vivek Goyal
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-20 17:50 UTC (permalink / raw)
  To: linux-kernel, kexec; +Cc: ebiederm, hpa, mjg59, greg, Vivek Goyal

Previously do_kimage_alloc() will allocate a kimage structure, copy
segment list from user space and then do the segment list sanity verification.

Break down this function in 3 parts. do_kimage_alloc_init() to do actual
allocation and basic initialization of kimage structure.
copy_user_segment_list() to copy segment list from user space and
sanity_check_segment_list() to verify the sanity of segment list as passed
by user space.

In later patches, I need to only allocate kimage and not copy segment
list from user space. So breaking down in smaller functions enables
re-use of code at other places.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 kernel/kexec.c |  182 +++++++++++++++++++++++++++++++-------------------------
 1 files changed, 101 insertions(+), 81 deletions(-)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index 490afc0..6238927 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -120,45 +120,27 @@ static struct page *kimage_alloc_page(struct kimage *image,
 				       gfp_t gfp_mask,
 				       unsigned long dest);
 
-static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
-	                    unsigned long nr_segments,
-                            struct kexec_segment __user *segments)
+static int copy_user_segment_list(struct kimage *image,
+				unsigned long nr_segments,
+				struct kexec_segment __user *segments)
 {
+	int ret;
 	size_t segment_bytes;
-	struct kimage *image;
-	unsigned long i;
-	int result;
-
-	/* Allocate a controlling structure */
-	result = -ENOMEM;
-	image = kzalloc(sizeof(*image), GFP_KERNEL);
-	if (!image)
-		goto out;
-
-	image->head = 0;
-	image->entry = &image->head;
-	image->last_entry = &image->head;
-	image->control_page = ~0; /* By default this does not apply */
-	image->start = entry;
-	image->type = KEXEC_TYPE_DEFAULT;
-
-	/* Initialize the list of control pages */
-	INIT_LIST_HEAD(&image->control_pages);
-
-	/* Initialize the list of destination pages */
-	INIT_LIST_HEAD(&image->dest_pages);
-
-	/* Initialize the list of unusable pages */
-	INIT_LIST_HEAD(&image->unuseable_pages);
 
 	/* Read in the segments */
 	image->nr_segments = nr_segments;
 	segment_bytes = nr_segments * sizeof(*segments);
-	result = copy_from_user(image->segment, segments, segment_bytes);
-	if (result) {
-		result = -EFAULT;
-		goto out;
-	}
+	ret = copy_from_user(image->segment, segments, segment_bytes);
+	if (ret)
+		ret = -EFAULT;
+
+	return ret;
+}
+
+static int sanity_check_segment_list(struct kimage *image)
+{
+	int result, i;
+	unsigned long nr_segments = image->nr_segments;
 
 	/*
 	 * Verify we have good destination addresses.  The caller is
@@ -180,9 +162,9 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
 		mstart = image->segment[i].mem;
 		mend   = mstart + image->segment[i].memsz;
 		if ((mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK))
-			goto out;
+			return result;
 		if (mend >= KEXEC_DESTINATION_MEMORY_LIMIT)
-			goto out;
+			return result;
 	}
 
 	/* Verify our destination addresses do not overlap.
@@ -203,7 +185,7 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
 			pend   = pstart + image->segment[j].memsz;
 			/* Do the segments overlap ? */
 			if ((mend > pstart) && (mstart < pend))
-				goto out;
+				return result;
 		}
 	}
 
@@ -215,18 +197,61 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
 	result = -EINVAL;
 	for (i = 0; i < nr_segments; i++) {
 		if (image->segment[i].bufsz > image->segment[i].memsz)
-			goto out;
+			return result;
 	}
 
-	result = 0;
-out:
-	if (result == 0)
-		*rimage = image;
-	else
-		kfree(image);
+	/*
+	 * Verify we have good destination addresses.  Normally
+	 * the caller is responsible for making certain we don't
+	 * attempt to load the new image into invalid or reserved
+	 * areas of RAM.  But crash kernels are preloaded into a
+	 * reserved area of ram.  We must ensure the addresses
+	 * are in the reserved area otherwise preloading the
+	 * kernel could corrupt things.
+	 */
 
-	return result;
+	if (image->type == KEXEC_TYPE_CRASH) {
+		result = -EADDRNOTAVAIL;
+		for (i = 0; i < nr_segments; i++) {
+			unsigned long mstart, mend;
 
+			mstart = image->segment[i].mem;
+			mend = mstart + image->segment[i].memsz - 1;
+			/* Ensure we are within the crash kernel limits */
+			if ((mstart < crashk_res.start) ||
+			    (mend > crashk_res.end))
+				return result;
+		}
+	}
+
+	return 0;
+}
+
+static struct kimage *do_kimage_alloc_init(void)
+{
+	struct kimage *image;
+
+	/* Allocate a controlling structure */
+	image = kzalloc(sizeof(*image), GFP_KERNEL);
+	if (!image)
+		return NULL;
+
+	image->head = 0;
+	image->entry = &image->head;
+	image->last_entry = &image->head;
+	image->control_page = ~0; /* By default this does not apply */
+	image->type = KEXEC_TYPE_DEFAULT;
+
+	/* Initialize the list of control pages */
+	INIT_LIST_HEAD(&image->control_pages);
+
+	/* Initialize the list of destination pages */
+	INIT_LIST_HEAD(&image->dest_pages);
+
+	/* Initialize the list of unusable pages */
+	INIT_LIST_HEAD(&image->unuseable_pages);
+
+	return image;
 }
 
 static void kimage_free_page_list(struct list_head *list);
@@ -239,10 +264,19 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
 	struct kimage *image;
 
 	/* Allocate and initialize a controlling structure */
-	image = NULL;
-	result = do_kimage_alloc(&image, entry, nr_segments, segments);
+	image = do_kimage_alloc_init();
+	if (!image)
+		return -ENOMEM;
+
+	image->start = entry;
+
+	result = copy_user_segment_list(image, nr_segments, segments);
 	if (result)
-		goto out;
+		goto out_free_image;
+
+	result = sanity_check_segment_list(image);
+	if (result)
+		goto out_free_image;
 
 	/*
 	 * Find a location for the control code buffer, and add it
@@ -254,22 +288,23 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
 					   get_order(KEXEC_CONTROL_PAGE_SIZE));
 	if (!image->control_code_page) {
 		printk(KERN_ERR "Could not allocate control_code_buffer\n");
-		goto out_free;
+		goto out_free_image;
 	}
 
 	image->swap_page = kimage_alloc_control_pages(image, 0);
 	if (!image->swap_page) {
 		printk(KERN_ERR "Could not allocate swap buffer\n");
-		goto out_free;
+		goto out_free_control_pages;
 	}
 
 	*rimage = image;
 	return 0;
 
-out_free:
+
+out_free_control_pages:
 	kimage_free_page_list(&image->control_pages);
+out_free_image:
 	kfree(image);
-out:
 	return result;
 }
 
@@ -279,19 +314,17 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
 {
 	int result;
 	struct kimage *image;
-	unsigned long i;
 
-	image = NULL;
 	/* Verify we have a valid entry point */
-	if ((entry < crashk_res.start) || (entry > crashk_res.end)) {
-		result = -EADDRNOTAVAIL;
-		goto out;
-	}
+	if ((entry < crashk_res.start) || (entry > crashk_res.end))
+		return -EADDRNOTAVAIL;
 
 	/* Allocate and initialize a controlling structure */
-	result = do_kimage_alloc(&image, entry, nr_segments, segments);
-	if (result)
-		goto out;
+	image = do_kimage_alloc_init();
+	if (!image)
+		return -ENOMEM;
+
+	image->start = entry;
 
 	/* Enable the special crash kernel control page
 	 * allocation policy.
@@ -299,25 +332,13 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
 	image->control_page = crashk_res.start;
 	image->type = KEXEC_TYPE_CRASH;
 
-	/*
-	 * Verify we have good destination addresses.  Normally
-	 * the caller is responsible for making certain we don't
-	 * attempt to load the new image into invalid or reserved
-	 * areas of RAM.  But crash kernels are preloaded into a
-	 * reserved area of ram.  We must ensure the addresses
-	 * are in the reserved area otherwise preloading the
-	 * kernel could corrupt things.
-	 */
-	result = -EADDRNOTAVAIL;
-	for (i = 0; i < nr_segments; i++) {
-		unsigned long mstart, mend;
+	result = copy_user_segment_list(image, nr_segments, segments);
+	if (result)
+		goto out_free_image;
 
-		mstart = image->segment[i].mem;
-		mend = mstart + image->segment[i].memsz - 1;
-		/* Ensure we are within the crash kernel limits */
-		if ((mstart < crashk_res.start) || (mend > crashk_res.end))
-			goto out_free;
-	}
+	result = sanity_check_segment_list(image);
+	if (result)
+		goto out_free_image;
 
 	/*
 	 * Find a location for the control code buffer, and add
@@ -329,15 +350,14 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
 					   get_order(KEXEC_CONTROL_PAGE_SIZE));
 	if (!image->control_code_page) {
 		printk(KERN_ERR "Could not allocate control_code_buffer\n");
-		goto out_free;
+		goto out_free_image;
 	}
 
 	*rimage = image;
 	return 0;
 
-out_free:
+out_free_image:
 	kfree(image);
-out:
 	return result;
 }
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 3/6] resource: Provide new functions to walk through resources
  2013-11-20 17:50 [PATCH 0/6] kexec: A new system call to allow in kernel loading Vivek Goyal
  2013-11-20 17:50 ` [PATCH 1/6] kexec: Export vmcoreinfo note size properly Vivek Goyal
  2013-11-20 17:50 ` [PATCH 2/6] kexec: Move segment verification code in a separate function Vivek Goyal
@ 2013-11-20 17:50 ` Vivek Goyal
  2013-11-20 17:50 ` [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec Vivek Goyal
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-20 17:50 UTC (permalink / raw)
  To: linux-kernel, kexec; +Cc: ebiederm, hpa, mjg59, greg, Vivek Goyal, Yinghai Lu

I have added two more functions to walk through resources.
Current walk_system_ram_range() deals with pfn and /proc/iomem can contain
partial pages. By dealing in pfn, callback function loses the info that
last page of a memory range is a partial page and not the full page. So
I implemented walk_system_ran_res() which returns u64 values to callback
functions and now it properly return start and end address.

walk_system_ram_range() uses find_next_system_ram() to find the next
ram resource. This in turn only travels through siblings of top level
child and does not travers through all the nodes of the resoruce tree. I
also need another function where I can walk through all the resources,
for example figure out where "GART" aperture is. Figure out where
ACPI memory is.

So I wrote another function walk_ram_res() which walks through all
/proc/iomem resources and returns matches as asked by caller. Caller
can specify "name" of resource, start and end.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
---
 include/linux/ioport.h |    6 +++
 kernel/resource.c      |  108 ++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 110 insertions(+), 4 deletions(-)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 89b7c24..0ebf8b0 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -227,6 +227,12 @@ extern int iomem_is_exclusive(u64 addr);
 extern int
 walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
 		void *arg, int (*func)(unsigned long, unsigned long, void *));
+extern int
+walk_system_ram_res(u64 start, u64 end, void *arg,
+				int (*func)(u64, u64, void *));
+extern int
+walk_ram_res(char *name, unsigned long flags, u64 start, u64 end, void *arg,
+				int (*func)(u64, u64, void *));
 
 /* True if any part of r1 overlaps r2 */
 static inline bool resource_overlaps(struct resource *r1, struct resource *r2)
diff --git a/kernel/resource.c b/kernel/resource.c
index 3f285dc..5e575e8 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -59,10 +59,8 @@ static DEFINE_RWLOCK(resource_lock);
 static struct resource *bootmem_resource_free;
 static DEFINE_SPINLOCK(bootmem_resource_lock);
 
-static void *r_next(struct seq_file *m, void *v, loff_t *pos)
+static struct resource *next_resource(struct resource *p)
 {
-	struct resource *p = v;
-	(*pos)++;
 	if (p->child)
 		return p->child;
 	while (!p->sibling && p->parent)
@@ -70,6 +68,13 @@ static void *r_next(struct seq_file *m, void *v, loff_t *pos)
 	return p->sibling;
 }
 
+static void *r_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct resource *p = v;
+	(*pos)++;
+	return (void *)next_resource(p);
+}
+
 #ifdef CONFIG_PROC_FS
 
 enum { MAX_IORES_LEVEL = 5 };
@@ -322,7 +327,71 @@ int release_resource(struct resource *old)
 
 EXPORT_SYMBOL(release_resource);
 
-#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
+/*
+ * Finds the lowest iomem reosurce exists with-in [res->start.res->end)
+ * the caller must specify res->start, res->end, res->flags and "name".
+ * If found, returns 0, res is overwritten, if not found, returns -1.
+ * This walks through whole tree and not just first level children.
+ */
+static int find_next_iomem_res(struct resource *res, char *name)
+{
+	resource_size_t start, end;
+	struct resource *p;
+
+	BUG_ON(!res);
+
+	start = res->start;
+	end = res->end;
+	BUG_ON(start >= end);
+
+	read_lock(&resource_lock);
+	p = &iomem_resource;
+	while ((p = next_resource(p))) {
+		if (p->flags != res->flags)
+			continue;
+		if (name && strcmp(p->name, name))
+			continue;
+		if (p->start > end) {
+			p = NULL;
+			break;
+		}
+		if ((p->end >= start) && (p->start < end))
+			break;
+	}
+
+	read_unlock(&resource_lock);
+	if (!p)
+		return -1;
+	/* copy data */
+	if (res->start < p->start)
+		res->start = p->start;
+	if (res->end > p->end)
+		res->end = p->end;
+	return 0;
+}
+
+int walk_ram_res(char *name, unsigned long flags, u64 start, u64 end,
+		void *arg, int (*func)(u64, u64, void *))
+{
+	struct resource res;
+	u64 orig_end;
+	int ret = -1;
+
+	res.start = start;
+	res.end = end;
+	res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+	orig_end = res.end;
+	while ((res.start < res.end) &&
+		(find_next_iomem_res(&res, name) >= 0)) {
+		ret = (*func)(res.start, res.end, arg);
+		if (ret)
+			break;
+		res.start = res.end + 1;
+		res.end = orig_end;
+	}
+	return ret;
+}
+
 /*
  * Finds the lowest memory reosurce exists within [res->start.res->end)
  * the caller must specify res->start, res->end, res->flags and "name".
@@ -367,6 +436,37 @@ static int find_next_system_ram(struct resource *res, char *name)
 /*
  * This function calls callback against all memory range of "System RAM"
  * which are marked as IORESOURCE_MEM and IORESOUCE_BUSY.
+ * Now, this function is only for "System RAM". This function deals with
+ * full ranges and not pfn. If resources are not pfn aligned, dealing
+ * with pfn can truncate ranges.
+ */
+int walk_system_ram_res(u64 start, u64 end, void *arg,
+				int (*func)(u64, u64, void *))
+{
+	struct resource res;
+	u64 orig_end;
+	int ret = -1;
+
+	res.start = start;
+	res.end = end;
+	res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+	orig_end = res.end;
+	while ((res.start < res.end) &&
+		(find_next_system_ram(&res, "System RAM") >= 0)) {
+		ret = (*func)(res.start, res.end, arg);
+		if (ret)
+			break;
+		res.start = res.end + 1;
+		res.end = orig_end;
+	}
+	return ret;
+}
+
+#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
+
+/*
+ * This function calls callback against all memory range of "System RAM"
+ * which are marked as IORESOURCE_MEM and IORESOUCE_BUSY.
  * Now, this function is only for "System RAM".
  */
 int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-20 17:50 [PATCH 0/6] kexec: A new system call to allow in kernel loading Vivek Goyal
                   ` (2 preceding siblings ...)
  2013-11-20 17:50 ` [PATCH 3/6] resource: Provide new functions to walk through resources Vivek Goyal
@ 2013-11-20 17:50 ` Vivek Goyal
  2013-11-21 19:03   ` Greg KH
                     ` (3 more replies)
  2013-11-20 17:50 ` [PATCH 5/6] kexec-bzImage: Support for loading bzImage using 64bit entry Vivek Goyal
                   ` (6 subsequent siblings)
  10 siblings, 4 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-20 17:50 UTC (permalink / raw)
  To: linux-kernel, kexec; +Cc: ebiederm, hpa, mjg59, greg, Vivek Goyal

This patch implements the in kernel kexec functionality. It implements a
new system call kexec_file_load. I think parameter list of this system
call will change as I have not done the kernel image signature handling
yet. I have been told that I might have to pass the detached signature
and size as part of system call.

Previously segment list was prepared in user space. Now user space just
passes kernel fd, initrd fd and command line and kernel will create a
segment list internally.

This patch contains generic part of the code. Actual segment preparation
and loading is done by arch and image specific loader. Which comes in
next patch.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/kernel/machine_kexec_64.c |   57 ++++-
 arch/x86/syscalls/syscall_64.tbl   |    1 +
 include/linux/kexec.h              |   57 +++++
 include/linux/syscalls.h           |    3 +
 include/uapi/linux/kexec.h         |    4 +
 kernel/kexec.c                     |  486 +++++++++++++++++++++++++++++++++++-
 kernel/sys_ni.c                    |    1 +
 7 files changed, 607 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 4eabc16..fb41b73 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -22,6 +22,13 @@
 #include <asm/mmu_context.h>
 #include <asm/debugreg.h>
 
+/* arch dependent functionality related to kexec file based syscall */
+static struct kexec_file_type kexec_file_type[]={
+	{"", NULL, NULL, NULL, NULL},
+};
+
+static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
+
 static void free_transition_pgtable(struct kimage *image)
 {
 	free_page((unsigned long)image->arch.pud);
@@ -200,7 +207,7 @@ void machine_kexec(struct kimage *image)
 {
 	unsigned long page_list[PAGES_NR];
 	void *control_page;
-	int save_ftrace_enabled;
+	int save_ftrace_enabled, idx;
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
@@ -226,6 +233,11 @@ void machine_kexec(struct kimage *image)
 #endif
 	}
 
+	/* Call image loader to prepare for entry */
+	idx = image->file_handler_idx;
+	if (kexec_file_type[idx].prep_entry)
+		kexec_file_type[idx].prep_entry(image);
+
 	control_page = page_address(image->control_code_page) + PAGE_SIZE;
 	memcpy(control_page, relocate_kernel, KEXEC_CONTROL_CODE_MAX_SIZE);
 
@@ -281,3 +293,46 @@ void arch_crash_save_vmcoreinfo(void)
 #endif
 }
 
+/* arch dependent functionality related to kexec file based syscall */
+
+int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
+					unsigned long buf_len)
+{
+	int i, ret = -ENOEXEC;
+
+	for (i = 0; i < nr_file_types; i++) {
+		if (!kexec_file_type[i].probe)
+			continue;
+
+		ret = kexec_file_type[i].probe(buf, buf_len);
+		if (!ret) {
+			image->file_handler_idx = i;
+			return ret;
+		}
+	}
+
+	return ret;
+}
+
+void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
+			unsigned long kernel_len, char *initrd,
+			unsigned long initrd_len, char *cmdline,
+			unsigned long cmdline_len)
+{
+	int idx = image->file_handler_idx;
+
+	if (idx < 0)
+		return ERR_PTR(-ENOEXEC);
+
+	return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
+					initrd_len, cmdline, cmdline_len);
+}
+
+int arch_image_file_post_load_cleanup(struct kimage *image)
+{
+	int idx = image->file_handler_idx;
+
+	if (kexec_file_type[idx].cleanup)
+		return kexec_file_type[idx].cleanup(image);
+	return 0;
+}
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 38ae65d..6f37cc9 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,7 @@
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
 313	common	finit_module		sys_finit_module
+314	common	kexec_file_load		sys_kexec_file_load
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d78d28a..a2baf96 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -110,13 +110,60 @@ struct kimage {
 #define KEXEC_TYPE_DEFAULT 0
 #define KEXEC_TYPE_CRASH   1
 	unsigned int preserve_context : 1;
+	/* If set, we are using file mode kexec syscall */
+	unsigned int file_mode : 1;
 
 #ifdef ARCH_HAS_KIMAGE_ARCH
 	struct kimage_arch arch;
 #endif
+
+	/* Additional Fields for file based kexec syscall */
+	void *kernel_buf;
+	unsigned long kernel_buf_len;
+
+	void *initrd_buf;
+	unsigned long initrd_buf_len;
+
+	char *cmdline_buf;
+	unsigned long cmdline_buf_len;
+
+	/* index of file handler in array */
+	int file_handler_idx;
+
+	/* Image loader handling the kernel can store a pointer here */
+	void * image_loader_data;
 };
 
+/*
+ * Keeps a track of buffer parameters as provided by caller for requesting
+ * memory placement of buffer.
+ */
+struct kexec_buf {
+	struct kimage *image;
+	char *buffer;
+	unsigned long bufsz;
+	unsigned long memsz;
+	unsigned long buf_align;
+	unsigned long buf_min;
+	unsigned long buf_max;
+	int top_down;		/* allocate from top of memory hole */
+};
 
+typedef int (kexec_probe_t)(const char *kernel_buf, unsigned long kernel_size);
+typedef void *(kexec_load_t)(struct kimage *image, char *kernel_buf,
+				unsigned long kernel_len, char *initrd,
+				unsigned long initrd_len, char *cmdline,
+				unsigned long cmdline_len);
+typedef int (kexec_prep_entry_t)(struct kimage *image);
+typedef int (kexec_cleanup_t)(struct kimage *image);
+
+struct kexec_file_type {
+	const char *name;
+	kexec_probe_t *probe;
+	kexec_load_t *load;
+	kexec_prep_entry_t *prep_entry;
+	kexec_cleanup_t *cleanup;
+};
 
 /* kexec interface functions */
 extern void machine_kexec(struct kimage *image);
@@ -127,6 +174,11 @@ extern asmlinkage long sys_kexec_load(unsigned long entry,
 					struct kexec_segment __user *segments,
 					unsigned long flags);
 extern int kernel_kexec(void);
+extern int kexec_add_buffer(struct kimage *image, char *buffer,
+			unsigned long bufsz, unsigned long memsz,
+			unsigned long buf_align, unsigned long buf_min,
+			unsigned long buf_max, int buf_end,
+			unsigned long *load_addr);
 #ifdef CONFIG_COMPAT
 extern asmlinkage long compat_sys_kexec_load(unsigned long entry,
 				unsigned long nr_segments,
@@ -135,6 +187,8 @@ extern asmlinkage long compat_sys_kexec_load(unsigned long entry,
 #endif
 extern struct page *kimage_alloc_control_pages(struct kimage *image,
 						unsigned int order);
+extern void kimage_set_start_addr(struct kimage *image, unsigned long start);
+
 extern void crash_kexec(struct pt_regs *);
 int kexec_should_crash(struct task_struct *);
 void crash_save_cpu(struct pt_regs *regs, int cpu);
@@ -182,6 +236,9 @@ extern struct kimage *kexec_crash_image;
 #define KEXEC_FLAGS    (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT)
 #endif
 
+/* Listof defined/legal kexec file flags */
+#define KEXEC_FILE_FLAGS	(KEXEC_FILE_UNLOAD | KEXEC_FILE_ON_CRASH)
+
 #define VMCOREINFO_BYTES           (4096)
 #define VMCOREINFO_NOTE_NAME       "VMCOREINFO"
 #define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 94273bb..b712ac7 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -301,6 +301,9 @@ asmlinkage long sys_restart_syscall(void);
 asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
 				struct kexec_segment __user *segments,
 				unsigned long flags);
+asmlinkage long sys_kexec_file_load(int kernel_fd, int initrd_fd,
+				const char __user * cmdline_ptr,
+				unsigned long cmdline_len, unsigned long flags);
 
 asmlinkage long sys_exit(int error_code);
 asmlinkage long sys_exit_group(int error_code);
diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
index 104838f..cdd666b 100644
--- a/include/uapi/linux/kexec.h
+++ b/include/uapi/linux/kexec.h
@@ -13,6 +13,10 @@
 #define KEXEC_PRESERVE_CONTEXT	0x00000002
 #define KEXEC_ARCH_MASK		0xffff0000
 
+/* Kexec file load interface flags */
+#define KEXEC_FILE_UNLOAD	0x00000001
+#define KEXEC_FILE_ON_CRASH	0x00000002
+
 /* These values match the ELF architecture values.
  * Unless there is a good reason that should continue to be the case.
  */
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 6238927..50bcaa8 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -120,6 +120,11 @@ static struct page *kimage_alloc_page(struct kimage *image,
 				       gfp_t gfp_mask,
 				       unsigned long dest);
 
+void kimage_set_start_addr(struct kimage *image, unsigned long start)
+{
+	image->start = start;
+}
+
 static int copy_user_segment_list(struct kimage *image,
 				unsigned long nr_segments,
 				struct kexec_segment __user *segments)
@@ -256,6 +261,225 @@ static struct kimage *do_kimage_alloc_init(void)
 
 static void kimage_free_page_list(struct list_head *list);
 
+static int copy_file_from_fd(int fd, void **buf, unsigned long *buf_len)
+{
+	struct fd f = fdget(fd);
+	int ret = 0;
+	struct kstat stat;
+	loff_t pos;
+	ssize_t bytes = 0;
+
+	if (!f.file)
+		return -EBADF;
+
+	ret = vfs_getattr(&f.file->f_path, &stat);
+	if (ret)
+		goto out;
+
+	if (stat.size > INT_MAX) {
+		ret = -EFBIG;
+		goto out;
+	}
+
+	/* Don't hand 0 to vmalloc, it whines. */
+	if (stat.size == 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	*buf = vmalloc(stat.size);
+        if (!*buf) {
+                ret = -ENOMEM;
+                goto out;
+        }
+
+	pos = 0;
+	while (pos < stat.size) {
+		bytes = kernel_read(f.file, pos, (char *)(*buf) + pos,
+                                    stat.size - pos);
+                if (bytes < 0) {
+                        vfree(*buf);
+                        ret = bytes;
+                        goto out;
+                }
+
+                if (bytes == 0)
+                        break;
+                pos += bytes;
+        }
+
+        *buf_len = pos;
+
+out:
+	fdput(f);
+	return ret;
+}
+
+/* Architectures can provide this probe function */
+int __attribute__ ((weak))
+arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
+				unsigned long buf_len)
+{
+	return -ENOEXEC;
+}
+
+void * __attribute__ ((weak))
+arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
+		unsigned long kernel_len, char *initrd,
+		unsigned long initrd_len, char *cmdline,
+		unsigned long cmdline_len)
+{
+	return ERR_PTR(-ENOEXEC);
+}
+
+void __attribute__ ((weak))
+arch_kimage_file_post_load_cleanup(struct kimage *image)
+{
+	return;
+}
+
+/*
+ * Free up tempory buffers allocated which are not needed after image has
+ * been loaded.
+ *
+ * Free up memory used by kernel, initrd, and comand line. This is temporary
+ * memory allocation which is not needed any more after these buffers have
+ * been loaded into separate segments and have been copied elsewhere
+ */
+static void kimage_file_post_load_cleanup(struct kimage *image)
+{
+	if (image->kernel_buf) {
+		vfree(image->kernel_buf);
+		image->kernel_buf = NULL;
+	}
+
+	if (image->initrd_buf) {
+		vfree(image->initrd_buf);
+		image->initrd_buf = NULL;
+	}
+
+	if (image->cmdline_buf) {
+		vfree(image->cmdline_buf);
+		image->cmdline_buf = NULL;
+	}
+
+	/* See if architcture has anything to cleanup post load */
+	arch_kimage_file_post_load_cleanup(image);
+}
+
+/*
+ * In file mode list of segments is prepared by kernel. Copy relevant
+ * data from user space, do error checking, prepare segment list
+ */
+static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
+		int initrd_fd, const char __user *cmdline_ptr,
+		unsigned long cmdline_len)
+{
+	int ret = 0;
+	void *ldata;
+
+	ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
+					&image->kernel_buf_len);
+	if (ret)
+		goto out;
+
+	/* Call arch image probe handlers */
+	ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
+						image->kernel_buf_len);
+
+	if (ret)
+		goto out;
+
+	ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
+					&image->initrd_buf_len);
+	if (ret)
+		goto out;
+
+	image->cmdline_buf = vzalloc(cmdline_len);
+	if (!image->cmdline_buf)
+		goto out;
+
+	ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
+	if (ret) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	image->cmdline_buf_len = cmdline_len;
+
+	/* command line should be a string with last byte null */
+	if (image->cmdline_buf[cmdline_len - 1] != '\0') {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Call arch image load handlers */
+	ldata = arch_kexec_kernel_image_load(image,
+			image->kernel_buf, image->kernel_buf_len,
+			image->initrd_buf, image->initrd_buf_len,
+			image->cmdline_buf, image->cmdline_buf_len);
+
+	if (IS_ERR(ldata)) {
+		ret = PTR_ERR(ldata);
+		goto out;
+	}
+
+	image->image_loader_data = ldata;
+out:
+	return ret;
+}
+
+static int kimage_file_normal_alloc(struct kimage **rimage, int kernel_fd,
+		int initrd_fd, const char __user *cmdline_ptr,
+		unsigned long cmdline_len)
+{
+	int result;
+	struct kimage *image;
+
+	/* Allocate and initialize a controlling structure */
+	image = do_kimage_alloc_init();
+	if (!image)
+		return -ENOMEM;
+
+	image->file_mode = 1;
+	image->file_handler_idx = -1;
+
+	result = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
+			cmdline_ptr, cmdline_len);
+	if (result)
+		goto out_free_image;
+
+	result = sanity_check_segment_list(image);
+	if (result)
+		goto out_free_post_load_bufs;
+
+	result = -ENOMEM;
+	image->control_code_page = kimage_alloc_control_pages(image,
+					   get_order(KEXEC_CONTROL_PAGE_SIZE));
+	if (!image->control_code_page) {
+		printk(KERN_ERR "Could not allocate control_code_buffer\n");
+		goto out_free_post_load_bufs;
+	}
+
+	image->swap_page = kimage_alloc_control_pages(image, 0);
+	if (!image->swap_page) {
+		printk(KERN_ERR "Could not allocate swap buffer\n");
+		goto out_free_control_pages;
+	}
+
+	*rimage = image;
+	return 0;
+
+out_free_control_pages:
+	kimage_free_page_list(&image->control_pages);
+out_free_post_load_bufs:
+	kimage_file_post_load_cleanup(image);
+	kfree(image->image_loader_data);
+out_free_image:
+	kfree(image);
+	return result;
+}
+
 static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
 				unsigned long nr_segments,
 				struct kexec_segment __user *segments)
@@ -679,6 +903,14 @@ static void kimage_free(struct kimage *image)
 
 	/* Free the kexec control pages... */
 	kimage_free_page_list(&image->control_pages);
+
+	kfree(image->image_loader_data);
+
+	/*
+	 * Free up any temporary buffers allocated. This might hit if
+	 * error occurred much later after buffer allocation.
+	 */
+	kimage_file_post_load_cleanup(image);
 	kfree(image);
 }
 
@@ -843,7 +1075,11 @@ static int kimage_load_normal_segment(struct kimage *image,
 				PAGE_SIZE - (maddr & ~PAGE_MASK));
 		uchunk = min(ubytes, mchunk);
 
-		result = copy_from_user(ptr, buf, uchunk);
+		/* For file based kexec, source pages are in kernel memory */
+		if (image->file_mode)
+			memcpy(ptr, buf, uchunk);
+		else
+			result = copy_from_user(ptr, buf, uchunk);
 		kunmap(page);
 		if (result) {
 			result = -EFAULT;
@@ -1093,6 +1329,72 @@ asmlinkage long compat_sys_kexec_load(unsigned long entry,
 }
 #endif
 
+SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, const char __user *, cmdline_ptr, unsigned long, cmdline_len, unsigned long, flags)
+{
+	int ret = 0, i;
+	struct kimage **dest_image, *image;
+
+	/* We only trust the superuser with rebooting the system. */
+	if (!capable(CAP_SYS_BOOT))
+		return -EPERM;
+
+	pr_debug("kexec_file_load: kernel_fd=%d initrd_fd=%d cmdline=0x%p"
+			" cmdline_len=%lu flags=0x%lx\n", kernel_fd, initrd_fd,
+			cmdline_ptr, cmdline_len, flags);
+
+	/* Make sure we have a legal set of flags */
+	if (flags != (flags & KEXEC_FILE_FLAGS))
+		return -EINVAL;
+
+	image = NULL;
+
+	if (!mutex_trylock(&kexec_mutex))
+		return -EBUSY;
+
+	dest_image = &kexec_image;
+	if (flags & KEXEC_FILE_ON_CRASH)
+		dest_image = &kexec_crash_image;
+
+	if (flags & KEXEC_FILE_UNLOAD)
+		goto exchange;
+
+	ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
+				cmdline_ptr, cmdline_len);
+	if (ret)
+		goto out;
+
+	ret = machine_kexec_prepare(image);
+	if (ret)
+		goto out;
+
+	for (i = 0; i < image->nr_segments; i++) {
+		struct kexec_segment *ksegment;
+
+		ksegment = &image->segment[i];
+		pr_debug("Loading segment %d: buf=0x%p bufsz=0x%lx mem=0x%lx"
+			" memsz=0x%lx\n", i, ksegment->buf, ksegment->bufsz,
+			ksegment->mem, ksegment->memsz);
+		ret = kimage_load_segment(image, &image->segment[i]);
+		if (ret)
+			goto out;
+		pr_debug("Done loading segment %d\n", i);
+	}
+
+	kimage_terminate(image);
+
+	/*
+	 * Free up any temporary buffers allocated which are not needed
+	 * after image has been loaded
+	 */
+	kimage_file_post_load_cleanup(image);
+exchange:
+	image = xchg(dest_image, image);
+out:
+	mutex_unlock(&kexec_mutex);
+	kimage_free(image);
+	return ret;
+}
+
 void crash_kexec(struct pt_regs *regs)
 {
 	/* Take the kexec_mutex here to prevent sys_kexec_load
@@ -1647,6 +1949,188 @@ static int __init crash_save_vmcoreinfo_init(void)
 
 module_init(crash_save_vmcoreinfo_init)
 
+static int kexec_add_segment(struct kimage *image, char *buf,
+		unsigned long bufsz, unsigned long mem, unsigned long memsz)
+{
+	struct kexec_segment *ksegment;
+
+	ksegment = &image->segment[image->nr_segments];
+	ksegment->buf = buf;
+	ksegment->bufsz = bufsz;
+	ksegment->mem = mem;
+	ksegment->memsz = memsz;
+	image->nr_segments++;
+
+	return 0;
+}
+
+static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
+					struct kexec_buf *kbuf)
+{
+	struct kimage *image = kbuf->image;
+	unsigned long temp_start, temp_end;
+
+	temp_end = min(end, kbuf->buf_max);
+	temp_start = temp_end - kbuf->memsz;
+
+	do {
+		/* align down start */
+		temp_start = temp_start & (~ (kbuf->buf_align - 1));
+
+		if (temp_start < start || temp_start < kbuf->buf_min)
+			return 0;
+
+		temp_end = temp_start + kbuf->memsz - 1;
+
+		/*
+		 * Make sure this does not conflict with any of existing
+		 * segments
+		 */
+		if (kimage_is_destination_range(image, temp_start, temp_end)) {
+			temp_start = temp_start - PAGE_SIZE;
+			continue;
+		}
+
+		/* We found a suitable memory range */
+		break;
+	} while(1);
+
+	/* If we are here, we found a suitable memory range */
+	kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
+				kbuf->memsz);
+
+	/* Stop navigating through remaining System RAM ranges */
+	return 1;
+}
+
+static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
+					struct kexec_buf *kbuf)
+{
+	struct kimage *image = kbuf->image;
+	unsigned long temp_start, temp_end;
+
+	temp_start = max(start, kbuf->buf_min);
+
+	do {
+		temp_start = ALIGN(temp_start, kbuf->buf_align);
+		temp_end = temp_start + kbuf->memsz - 1;
+
+		if (temp_end > end || temp_end > kbuf->buf_max)
+			return 0;
+		/*
+		 * Make sure this does not conflict with any of existing
+		 * segments
+		 */
+		if (kimage_is_destination_range(image, temp_start, temp_end)) {
+			temp_start = temp_start + PAGE_SIZE;
+			continue;
+		}
+
+		/* We found a suitable memory range */
+		break;
+	} while(1);
+
+	/* If we are here, we found a suitable memory range */
+	kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
+				kbuf->memsz);
+
+	/* Stop navigating through remaining System RAM ranges */
+	return 1;
+}
+
+static int walk_ram_range_callback(u64 start, u64 end, void *arg)
+{
+	struct kexec_buf *kbuf = (struct kexec_buf *)arg;
+	unsigned long sz = end - start + 1;
+
+	/* Returning 0 will take to next memory range */
+	if (sz < kbuf->memsz)
+		return 0;
+
+	if (end < kbuf->buf_min || start > kbuf->buf_max)
+		return 0;
+
+	/*
+	 * Allocate memory top down with-in ram range. Otherwise bottom up
+	 * allocation.
+	 */
+	if (kbuf->top_down)
+		return locate_mem_hole_top_down(start, end, kbuf);
+	else
+		return locate_mem_hole_bottom_up(start, end, kbuf);
+}
+
+/*
+ * Helper functions for placing a buffer in a kexec segment. This assumes
+ * that kexec_mutex is held.
+ */
+int kexec_add_buffer(struct kimage *image, char *buffer,
+		unsigned long bufsz, unsigned long memsz,
+		unsigned long buf_align, unsigned long buf_min,
+		unsigned long buf_max, int top_down, unsigned long *load_addr)
+{
+
+	unsigned long nr_segments = image->nr_segments, new_nr_segments;
+	struct kexec_segment *ksegment;
+	struct kexec_buf *kbuf;
+
+	/* Currently adding segment this way is allowed only in file mode */
+	if (!image->file_mode)
+		return -EINVAL;
+
+	if (nr_segments >= KEXEC_SEGMENT_MAX)
+		return -EINVAL;
+
+	/*
+	 * Make sure we are not trying to add buffer after allocating
+	 * control pages. All segments need to be placed first before
+	 * any control pages are allocated. As control page allocation
+	 * logic goes through list of segments to make sure there are
+	 * no destination overlaps.
+	 */
+	WARN_ONCE(!list_empty(&image->control_pages), "Adding kexec buffer"
+			" after allocating control pages\n");
+
+	kbuf = kzalloc(sizeof(struct kexec_buf), GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	kbuf->image = image;
+	kbuf->buffer = buffer;
+	kbuf->bufsz = bufsz;
+	/* Align memsz to next page boundary */
+	kbuf->memsz = ALIGN(memsz, PAGE_SIZE);
+
+	/* Align to atleast page size boundary */
+	kbuf->buf_align = max(buf_align, PAGE_SIZE);
+	kbuf->buf_min = buf_min;
+	kbuf->buf_max = buf_max;
+	kbuf->top_down = top_down;
+
+	/* Walk the RAM ranges and allocate a suitable range for the buffer */
+	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
+
+	kbuf->image = NULL;
+	kfree(kbuf);
+
+	/*
+	 * If range could be found successfully, it would have incremented
+	 * the nr_segments value.
+	 */
+	new_nr_segments = image->nr_segments;
+
+	/* A suitable memory range could not be found for buffer */
+	if (new_nr_segments == nr_segments)
+		return -EADDRNOTAVAIL;
+
+	/* Found a suitable memory range */
+
+	ksegment = &image->segment[new_nr_segments - 1];
+	*load_addr = ksegment->mem;
+	return 0;
+}
+
+
 /*
  * Move into place and start executing a preloaded standalone
  * executable.  If nothing was preloaded return an error.
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7078052..7e1e13d 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -25,6 +25,7 @@ cond_syscall(sys_swapon);
 cond_syscall(sys_swapoff);
 cond_syscall(sys_kexec_load);
 cond_syscall(compat_sys_kexec_load);
+cond_syscall(sys_kexec_file_load);
 cond_syscall(sys_init_module);
 cond_syscall(sys_finit_module);
 cond_syscall(sys_delete_module);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 5/6] kexec-bzImage: Support for loading bzImage using 64bit entry
  2013-11-20 17:50 [PATCH 0/6] kexec: A new system call to allow in kernel loading Vivek Goyal
                   ` (3 preceding siblings ...)
  2013-11-20 17:50 ` [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec Vivek Goyal
@ 2013-11-20 17:50 ` Vivek Goyal
  2013-11-21 19:07   ` Greg KH
  2013-11-28 11:35   ` Baoquan He
  2013-11-20 17:50 ` [PATCH 6/6] kexec: Support for Kexec on panic using new system call Vivek Goyal
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-20 17:50 UTC (permalink / raw)
  To: linux-kernel, kexec; +Cc: ebiederm, hpa, mjg59, greg, Vivek Goyal

This is loader specific code which can load bzImage and set it up for
64bit entry. This does not take care of 32bit entry or real mode entry
yet.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/include/asm/kexec-bzimage.h |   12 +
 arch/x86/include/asm/kexec.h         |   26 +++
 arch/x86/kernel/Makefile             |    2 +
 arch/x86/kernel/kexec-bzimage.c      |  375 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/machine_kexec_64.c   |    4 +-
 arch/x86/kernel/purgatory_entry_64.S |  119 +++++++++++
 6 files changed, 537 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/include/asm/kexec-bzimage.h
 create mode 100644 arch/x86/kernel/kexec-bzimage.c
 create mode 100644 arch/x86/kernel/purgatory_entry_64.S

diff --git a/arch/x86/include/asm/kexec-bzimage.h b/arch/x86/include/asm/kexec-bzimage.h
new file mode 100644
index 0000000..d556727
--- /dev/null
+++ b/arch/x86/include/asm/kexec-bzimage.h
@@ -0,0 +1,12 @@
+#ifndef _ASM_BZIMAGE_H
+#define _ASM_BZIMAGE_H
+
+extern int bzImage64_probe(const char *buf, unsigned long len);
+extern void *bzImage64_load(struct kimage *image, char *kernel,
+		unsigned long kernel_len, char *initrd,
+		unsigned long initrd_len, char *cmdline,
+		unsigned long cmdline_len);
+extern int bzImage64_prep_entry(struct kimage *image);
+extern int bzImage64_cleanup(struct kimage *image);
+
+#endif  /* _ASM_BZIMAGE_H */
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 17483a4..94f1257 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -15,6 +15,9 @@
 # define PAGES_NR		4
 #endif
 
+#define KEXEC_PURGATORY_PAGE_SIZE	4096
+#define KEXEC_PURGATORY_CODE_MAX_SIZE	2048
+
 # define KEXEC_CONTROL_CODE_MAX_SIZE	2048
 
 #ifndef __ASSEMBLY__
@@ -141,6 +144,9 @@ relocate_kernel(unsigned long indirection_page,
 		unsigned long page_list,
 		unsigned long start_address,
 		unsigned int preserve_context);
+void purgatory_entry64(void);
+extern unsigned long purgatory_entry64_regs;
+extern struct desc_struct entry64_gdt;
 #endif
 
 #define ARCH_HAS_KIMAGE_ARCH
@@ -161,6 +167,26 @@ struct kimage_arch {
 	pmd_t *pmd;
 	pte_t *pte;
 };
+
+struct kexec_entry64_regs {
+	uint64_t rax;
+	uint64_t rbx;
+	uint64_t rcx;
+	uint64_t rdx;
+	uint64_t rsi;
+	uint64_t rdi;
+	uint64_t rsp;
+	uint64_t rbp;
+	uint64_t r8;
+	uint64_t r9;
+	uint64_t r10;
+	uint64_t r11;
+	uint64_t r12;
+	uint64_t r13;
+	uint64_t r14;
+	uint64_t r15;
+	uint64_t rip;
+};
 #endif
 
 typedef void crash_vmclear_fn(void);
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 9b0a34e..5d074c2 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -68,6 +68,7 @@ obj-$(CONFIG_FTRACE_SYSCALLS)	+= ftrace.o
 obj-$(CONFIG_X86_TSC)		+= trace_clock.o
 obj-$(CONFIG_KEXEC)		+= machine_kexec_$(BITS).o
 obj-$(CONFIG_KEXEC)		+= relocate_kernel_$(BITS).o crash.o
+obj-$(CONFIG_KEXEC)		+= kexec-bzimage.o
 obj-$(CONFIG_CRASH_DUMP)	+= crash_dump_$(BITS).o
 obj-y				+= kprobes/
 obj-$(CONFIG_MODULES)		+= module.o
@@ -122,4 +123,5 @@ ifeq ($(CONFIG_X86_64),y)
 
 	obj-$(CONFIG_PCI_MMCONFIG)	+= mmconf-fam10h_64.o
 	obj-y				+= vsmp_64.o
+	obj-$(CONFIG_KEXEC)		+= purgatory_entry_64.o
 endif
diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
new file mode 100644
index 0000000..a1032d4
--- /dev/null
+++ b/arch/x86/kernel/kexec-bzimage.c
@@ -0,0 +1,375 @@
+#include <linux/string.h>
+#include <linux/printk.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/kexec.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+#include <asm/bootparam.h>
+#include <asm/setup.h>
+
+#ifdef CONFIG_X86_64
+
+struct bzimage64_data {
+	unsigned long kernel_load_addr;
+	unsigned long bootparams_load_addr;
+
+	/*
+	 * Temporary buffer to hold bootparams buffer. This should be
+	 * freed once the bootparam segment has been loaded.
+	 */
+	void *bootparams_buf;
+	struct page *purgatory_page;
+};
+
+int bzImage64_probe(const char *buf, unsigned long len)
+{
+	int ret = -ENOEXEC;
+	struct setup_header *header;
+
+	if (len < 2 * 512) {
+		pr_debug("File is too short to be a bzImage\n");
+		return ret;
+	}
+
+	header = (struct setup_header *)(buf + 0x1F1);
+	if (memcmp((char *)&header->header, "HdrS", 4) != 0) {
+		pr_debug("Not a bzImage\n");
+		return ret;
+	}
+
+	if (header->boot_flag != 0xAA55) {
+                /* No x86 boot sector present */
+		pr_debug("No x86 boot sector present\n");
+		return ret;
+	}
+
+	if (header->version < 0x020C) {
+                /* Must be at least protocol version 2.12 */
+		pr_debug("Must be at least protocol version 2.12\n");
+		return ret;
+	}
+
+	if ((header->loadflags & 1) == 0) {
+		/* Not a bzImage */
+		pr_debug("zImage not a bzImage\n");
+		return ret;
+	}
+
+	if ((header->xloadflags & 3) != 3) {
+		/* XLF_KERNEL_64 and XLF_CAN_BE_LOADED_ABOVE_4G should be set */
+		pr_debug("Not a relocatable bzImage64\n");
+		return ret;
+	}
+
+        /* I've got a bzImage */
+	pr_debug("It's a relocatable bzImage64\n");
+	ret = 0;
+
+	return ret;
+}
+
+static int setup_memory_map_entries(struct boot_params *params)
+{
+	unsigned int nr_e820_entries;
+
+	/* TODO: What about EFI */
+	nr_e820_entries = e820_saved.nr_map;
+	if (nr_e820_entries > E820MAX)
+		nr_e820_entries = E820MAX;
+
+	params->e820_entries = nr_e820_entries;
+	memcpy(&params->e820_map, &e820_saved.map,
+			nr_e820_entries * sizeof(struct e820entry));
+
+	return 0;
+}
+
+static void setup_linux_system_parameters(struct boot_params *params)
+{
+	unsigned int nr_e820_entries;
+	unsigned long long mem_k, start, end;
+	int i;
+
+	/* Get subarch from existing bootparams */
+	params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
+
+	/* Copying screen_info will do? */
+	memcpy(&params->screen_info, &boot_params.screen_info,
+				sizeof(struct screen_info));
+
+	/* Fill in memsize later */
+	params->screen_info.ext_mem_k = 0;
+	params->alt_mem_k = 0;
+
+	/* Default APM info */
+	memset(&params->apm_bios_info, 0, sizeof(params->apm_bios_info));
+
+	/* Default drive info */
+	memset(&params->hd0_info, 0, sizeof(params->hd0_info));
+	memset(&params->hd1_info, 0, sizeof(params->hd1_info));
+
+	/* Default sysdesc table */
+	params->sys_desc_table.length = 0;
+
+	setup_memory_map_entries(params);
+	nr_e820_entries = params->e820_entries;
+
+	for(i = 0; i < nr_e820_entries; i++) {
+		if (params->e820_map[i].type != E820_RAM)
+			continue;
+		start = params->e820_map[i].addr;
+		end = params->e820_map[i].addr + params->e820_map[i].size - 1;
+
+		if ((start <= 0x100000) && end > 0x100000) {
+			mem_k = (end >> 10) - (0x100000 >> 10);
+			params->screen_info.ext_mem_k = mem_k;
+			params->alt_mem_k = mem_k;
+			if (mem_k > 0xfc00)
+				params->screen_info.ext_mem_k = 0xfc00; /* 64M*/
+			if (mem_k > 0xffffffff)
+				params->alt_mem_k = 0xffffffff;
+		}
+	}
+
+	/* Setup EDD info */
+	memcpy(params->eddbuf, boot_params.eddbuf,
+				EDDMAXNR * sizeof(struct edd_info));
+	params->eddbuf_entries = boot_params.eddbuf_entries;
+
+	memcpy(params->edd_mbr_sig_buffer, boot_params.edd_mbr_sig_buffer,
+			EDD_MBR_SIG_MAX * sizeof(unsigned int));
+}
+
+static void setup_initrd(struct boot_params *boot_params, unsigned long initrd_load_addr, unsigned long initrd_len)
+{
+	boot_params->hdr.ramdisk_image = initrd_load_addr & 0xffffffffUL;
+	boot_params->hdr.ramdisk_size = initrd_len & 0xffffffffUL;
+
+	boot_params->ext_ramdisk_image = initrd_load_addr >> 32;
+	boot_params->ext_ramdisk_size = initrd_len >> 32;
+}
+
+static void setup_cmdline(struct boot_params *boot_params,
+		unsigned long bootparams_load_addr,
+		unsigned long cmdline_offset, char *cmdline,
+		unsigned long cmdline_len)
+{
+	char *cmdline_ptr = ((char *)boot_params) + cmdline_offset;
+	unsigned long cmdline_ptr_phys;
+	uint32_t cmdline_low_32, cmdline_ext_32;
+
+	memcpy(cmdline_ptr, cmdline, cmdline_len);
+	cmdline_ptr[cmdline_len - 1] = '\0';
+
+	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
+	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
+	cmdline_ext_32 = cmdline_ptr_phys >> 32;
+
+	boot_params->hdr.cmd_line_ptr = cmdline_low_32;
+	if (cmdline_ext_32)
+		boot_params->ext_cmd_line_ptr = cmdline_ext_32;
+}
+
+void *bzImage64_load(struct kimage *image, char *kernel,
+		unsigned long kernel_len,
+		char *initrd, unsigned long initrd_len,
+		char *cmdline, unsigned long cmdline_len)
+{
+
+	struct setup_header *header;
+	int setup_sects, kern16_size_needed, kern16_size, ret = 0;
+	unsigned long setup_size, setup_header_size;
+	struct boot_params *params;
+	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
+	unsigned long kernel_bufsz, kernel_memsz, kernel_align;
+	char *kernel_buf;
+	struct bzimage64_data *ldata;
+
+	header = (struct setup_header *)(kernel + 0x1F1);
+	setup_sects = header->setup_sects;
+	if (setup_sects == 0)
+		setup_sects = 4;
+
+	kern16_size = (setup_sects + 1) * 512;
+	if (kernel_len < kern16_size) {
+		pr_debug("bzImage truncated\n");
+		return ERR_PTR(-ENOEXEC);
+	}
+
+	if (cmdline_len > header->cmdline_size) {
+		pr_debug("Kernel command line too long\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* Allocate loader specific data */
+	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
+	if (!ldata)
+		return ERR_PTR(-ENOMEM);
+
+	/* Argument/parameter segment */
+	kern16_size_needed = kern16_size;
+	if (kern16_size_needed < 4096)
+		kern16_size_needed = 4096;
+
+	setup_size = kern16_size_needed + cmdline_len;
+	params = kzalloc(setup_size, GFP_KERNEL);
+	if (!params) {
+		ret = -ENOMEM;
+		goto out_free_loader_data;
+	}
+
+	/* Copy setup header onto bootparams. */
+	setup_header_size = 0x0202 + kernel[0x0201] - 0x1F1;
+
+	/* Is there a limit on setup header size? */
+	memcpy(&params->hdr, (kernel + 0x1F1), setup_header_size);
+	ret = kexec_add_buffer(image, (char *)params, setup_size,
+			setup_size, 16, 0x3000, -1, 1, &bootparam_load_addr);
+	if (ret)
+		goto out_free_params;
+	pr_debug("Loaded boot_param and command line at 0x%lx\n",
+			bootparam_load_addr);
+
+	/* Load kernel */
+	kernel_buf = kernel + kern16_size;
+	kernel_bufsz =  kernel_len - kern16_size;
+	kernel_memsz = ALIGN(header->init_size, 4096);
+	kernel_align = header->kernel_alignment;
+
+	ret = kexec_add_buffer(image, kernel_buf,
+			kernel_bufsz, kernel_memsz, kernel_align, 0x100000,
+			-1, 1, &kernel_load_addr);
+	if (ret)
+		goto out_free_params;
+
+	pr_debug("Loaded 64bit kernel at 0x%lx sz = 0x%lx\n", kernel_load_addr,
+				kernel_memsz);
+
+	/* Load initrd high */
+	if (initrd) {
+		ret = kexec_add_buffer(image, initrd, initrd_len, initrd_len,
+			4096, 0x10000000, ULONG_MAX, 1, &initrd_load_addr);
+		if (ret)
+			goto out_free_params;
+
+		pr_debug("Loaded initrd at 0x%lx sz = 0x%lx\n",
+					initrd_load_addr, initrd_len);
+		setup_initrd(params, initrd_load_addr, initrd_len);
+	}
+
+	setup_cmdline(params, bootparam_load_addr, kern16_size_needed,
+			cmdline, cmdline_len);
+
+	/* bootloader info. Do we need a separate ID for kexec kernel loader? */
+	params->hdr.type_of_loader = 0x0D << 4;
+	params->hdr.loadflags = 0;
+
+	setup_linux_system_parameters(params);
+
+	/*
+	 * Allocate a purgatory page. For 64bit entry point, purgatory
+	 * code can be anywhere.
+	 *
+	 * Control page allocation logic goes through segment list to
+	 * make sure allocated page is not destination page. So allocate
+	 * control page after all required segment have been prepared.
+	 */
+	ldata->purgatory_page = kimage_alloc_control_pages(image,
+					get_order(KEXEC_PURGATORY_PAGE_SIZE));
+
+	if (!ldata->purgatory_page) {
+		printk(KERN_ERR "Could not allocate purgatory page\n");
+		ret = -ENOMEM;
+		goto out_free_params;
+	}
+
+	/*
+	 * Store pointer to params so that it could be freed after loading
+	 * params segment has been loaded and contents have been copied
+	 * somewhere else.
+	 */
+	ldata->bootparams_buf = params;
+	ldata->kernel_load_addr = kernel_load_addr;
+	ldata->bootparams_load_addr = bootparam_load_addr;
+	return ldata;
+
+out_free_params:
+	kfree(params);
+out_free_loader_data:
+	kfree(ldata);
+	return ERR_PTR(ret);
+}
+
+int bzImage64_prep_entry(struct kimage *image)
+{
+	struct bzimage64_data *ldata;
+	char *purgatory_page;
+	unsigned long regs_offset, gdt_offset, purgatory_page_phys;
+	struct kexec_entry64_regs *regs;
+	char *gdt_ptr;
+	unsigned long long *gdt_addr;
+
+	if (!image->file_mode)
+		return 0;
+
+	ldata = image->image_loader_data;
+	if (!ldata)
+		return -EINVAL;
+
+	/* Copy purgatory code to its control page */
+	purgatory_page = page_address(ldata->purgatory_page);
+
+	/* Physical address of purgatory page */
+	purgatory_page_phys = PFN_PHYS(page_to_pfn(ldata->purgatory_page));
+
+	memcpy(purgatory_page, purgatory_entry64,
+			KEXEC_PURGATORY_CODE_MAX_SIZE);
+
+	/* Set registers appropriately */
+	regs_offset =  (unsigned long)&purgatory_entry64_regs -
+			(unsigned long)purgatory_entry64;
+	regs = (struct kexec_entry64_regs *) (purgatory_page + regs_offset);
+
+	regs->rbx = 0; /* Bootstrap Processor */
+	regs->rsi = ldata->bootparams_load_addr;
+	regs->rip = ldata->kernel_load_addr + 0x200;
+
+	/* Fix up gdt */
+	gdt_offset = (unsigned long)&entry64_gdt -
+			(unsigned long)purgatory_entry64;
+
+	gdt_ptr = purgatory_page + gdt_offset;
+
+	/* Skip a word which contains size of gdt table */
+	gdt_addr = (unsigned long long *)(gdt_ptr + 2);
+
+	*gdt_addr = (unsigned long long)gdt_ptr;
+
+	/*
+	 * Update the relocated address of gdt. By the time we load gdt
+	 * in purgatory, we are running using identity mapped tables.
+	 * Load identity mapped address here.
+	 */
+	*gdt_addr = (unsigned long long)(purgatory_page_phys + gdt_offset);
+
+	/*
+	 * Jump to purgatory after control page. By the time we jump to
+	 * purgatory, we are using itentifiy mapped page tables
+	 */
+	kimage_set_start_addr(image, purgatory_page_phys);
+	return 0;
+}
+
+/* This cleanup function is called after various segments have been loaded */
+int bzImage64_cleanup(struct kimage *image)
+{
+	struct bzimage64_data *ldata = image->image_loader_data;
+
+	kfree(ldata->bootparams_buf);
+	ldata->bootparams_buf = NULL;
+	return 0;
+}
+
+#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index fb41b73..a66ce1d 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -21,10 +21,12 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 #include <asm/debugreg.h>
+#include <asm/kexec-bzimage.h>
 
 /* arch dependent functionality related to kexec file based syscall */
 static struct kexec_file_type kexec_file_type[]={
-	{"", NULL, NULL, NULL, NULL},
+	{"bzImage64", bzImage64_probe, bzImage64_load, bzImage64_prep_entry,
+	 bzImage64_cleanup},
 };
 
 static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
diff --git a/arch/x86/kernel/purgatory_entry_64.S b/arch/x86/kernel/purgatory_entry_64.S
new file mode 100644
index 0000000..12a235f
--- /dev/null
+++ b/arch/x86/kernel/purgatory_entry_64.S
@@ -0,0 +1,119 @@
+/*
+ * Copyright (C) 2013  Red Hat Inc.
+ *
+ * Author(s): Vivek Goyal <vgoyal@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation (version 2 of the License).
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+
+/*
+ * One page for purgatory. Code occupies first KEXEC_PURGATORY_CODE_MAX_SIZE
+ * bytes. Rest is for data/stack etc.
+ */
+#include <asm/page.h>
+
+	.text
+	.align PAGE_SIZE
+	.code64
+	.globl purgatory_entry64, purgatory_entry64_regs, entry64_gdt
+
+
+purgatory_entry64:
+	/* Setup a gdt that should be preserved */
+	lgdt entry64_gdt(%rip)
+
+	/* load the data segments */
+	movl    $0x18, %eax     /* data segment */
+	movl    %eax, %ds
+	movl    %eax, %es
+	movl    %eax, %ss
+	movl    %eax, %fs
+	movl    %eax, %gs
+
+	/* Setup new stack */
+	leaq    stack_init(%rip), %rsp
+	pushq   $0x10 /* CS */
+	leaq    new_cs_exit(%rip), %rax
+	pushq   %rax
+	lretq
+new_cs_exit:
+
+	/*
+	 * Load the registers except rsp. rsp is already loaded with stack
+	 * at the end of this page
+	 */
+	movq	rax(%rip), %rax
+	movq	rbx(%rip), %rbx
+	movq	rcx(%rip), %rcx
+	movq	rdx(%rip), %rdx
+	movq	rsi(%rip), %rsi
+	movq	rdi(%rip), %rdi
+	movq	rbp(%rip), %rbp
+	movq	r8(%rip), %r8
+	movq	r9(%rip), %r9
+	movq	r10(%rip), %r10
+	movq	r11(%rip), %r11
+	movq	r12(%rip), %r12
+	movq	r13(%rip), %r13
+	movq	r14(%rip), %r14
+	movq	r15(%rip), %r15
+
+	/* Jump to the new code... */
+	jmpq	*rip(%rip)
+
+	.balign 16
+purgatory_entry64_regs:
+rax:	.quad 0x00000000
+rbx:	.quad 0x00000000
+rcx:	.quad 0x00000000
+rdx:	.quad 0x00000000
+rsi:	.quad 0x00000000
+rdi:	.quad 0x00000000
+rsp:	.quad 0x00000000
+rbp:	.quad 0x00000000
+r8:	.quad 0x00000000
+r9:	.quad 0x00000000
+r10:	.quad 0x00000000
+r11:	.quad 0x00000000
+r12:	.quad 0x00000000
+r13:	.quad 0x00000000
+r14:	.quad 0x00000000
+r15:	.quad 0x00000000
+rip:	.quad 0x00000000
+
+	/* GDT */
+	.balign 16
+entry64_gdt:
+	/* 0x00 unusable segment
+	 * 0x08 unused
+	 * so use them as gdt ptr
+	 */
+	.word gdt_end - entry64_gdt - 1
+	.quad entry64_gdt
+	.word 0, 0, 0
+
+	/* 0x10 4GB flat code segment */
+	.word 0xFFFF, 0x0000, 0x9A00, 0x00AF
+
+	/* 0x18 4GB flat data segment */
+	.word 0xFFFF, 0x0000, 0x9200, 0x00CF
+gdt_end:
+
+	.globl kexec_purgatory_code_size
+.set kexec_purgatory_code_size, . - purgatory_entry64
+
+/* Fill rest of the page with zeros to be used as stack */
+stack: .fill purgatory_entry64 + PAGE_SIZE - ., 1, 0
+stack_init:
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 6/6] kexec: Support for Kexec on panic using new system call
  2013-11-20 17:50 [PATCH 0/6] kexec: A new system call to allow in kernel loading Vivek Goyal
                   ` (4 preceding siblings ...)
  2013-11-20 17:50 ` [PATCH 5/6] kexec-bzImage: Support for loading bzImage using 64bit entry Vivek Goyal
@ 2013-11-20 17:50 ` Vivek Goyal
  2013-11-28 11:28   ` Baoquan He
  2013-12-04  1:41   ` Baoquan He
  2013-11-21 18:58 ` [PATCH 0/6] kexec: A new system call to allow in kernel loading Greg KH
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-20 17:50 UTC (permalink / raw)
  To: linux-kernel, kexec; +Cc: ebiederm, hpa, mjg59, greg, Vivek Goyal

This patch adds support for loading a kexec on panic (kdump) kernel usning
new system call.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/include/asm/crash.h       |    9 +
 arch/x86/include/asm/kexec.h       |   17 +
 arch/x86/kernel/crash.c            |  585 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/kexec-bzimage.c    |   63 ++++-
 arch/x86/kernel/machine_kexec_64.c |    1 +
 kernel/kexec.c                     |   69 ++++-
 6 files changed, 731 insertions(+), 13 deletions(-)
 create mode 100644 arch/x86/include/asm/crash.h

diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h
new file mode 100644
index 0000000..2dd2eb8
--- /dev/null
+++ b/arch/x86/include/asm/crash.h
@@ -0,0 +1,9 @@
+#ifndef _ASM_X86_CRASH_H
+#define _ASM_X86_CRASH_H
+
+int load_crashdump_segments(struct kimage *image);
+int crash_copy_backup_region(struct kimage *image);
+int crash_setup_memmap_entries(struct kimage *image,
+		struct boot_params *params);
+
+#endif /* _ASM_X86_CRASH_H */
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 94f1257..9dc19fe 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -64,6 +64,10 @@
 # define KEXEC_ARCH KEXEC_ARCH_X86_64
 #endif
 
+/* Memory to backup during crash kdump */
+#define KEXEC_BACKUP_SRC_START	(0UL)
+#define KEXEC_BACKUP_SRC_END	(655360UL)	/* 640K */
+
 /*
  * CPU does not save ss and sp on stack if execution is already
  * running in kernel mode at the time of NMI occurrence. This code
@@ -166,8 +170,21 @@ struct kimage_arch {
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	/* Details of backup region */
+	unsigned long backup_src_start;
+	unsigned long backup_src_sz;
+
+	/* Physical address of backup segment */
+	unsigned long backup_load_addr;
+
+	/* Core ELF header buffer */
+	unsigned long elf_headers;
+	unsigned long elf_headers_sz;
+	unsigned long elf_load_addr;
 };
+#endif /* CONFIG_X86_32 */
 
+#ifdef CONFIG_X86_64
 struct kexec_entry64_regs {
 	uint64_t rax;
 	uint64_t rbx;
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 18677a9..d5d3118 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -4,6 +4,9 @@
  * Created by: Hariprasad Nellitheertha (hari@in.ibm.com)
  *
  * Copyright (C) IBM Corporation, 2004. All rights reserved.
+ * Copyright (C) Red Hat Inc., 2013. All rights reserved.
+ * Authors:
+ * 	Vivek Goyal <vgoyal@redhat.com>
  *
  */
 
@@ -17,6 +20,7 @@
 #include <linux/elf.h>
 #include <linux/elfcore.h>
 #include <linux/module.h>
+#include <linux/slab.h>
 
 #include <asm/processor.h>
 #include <asm/hardirq.h>
@@ -29,6 +33,45 @@
 #include <asm/reboot.h>
 #include <asm/virtext.h>
 
+/* Alignment required for elf header segment */
+#define ELF_CORE_HEADER_ALIGN   4096
+
+/* This primarily reprsents number of split ranges due to exclusion */
+#define CRASH_MAX_RANGES	16
+
+struct crash_mem_range {
+	unsigned long long start, end;
+};
+
+struct crash_mem {
+	unsigned int nr_ranges;
+	struct crash_mem_range ranges[CRASH_MAX_RANGES];
+};
+
+/* Misc data about ram ranges needed to prepare elf headers */
+struct crash_elf_data {
+	struct kimage *image;
+	/*
+	 * Total number of ram ranges we have after various ajustments for
+	 * GART, crash reserved region etc.
+	 */
+	unsigned int max_nr_ranges;
+	unsigned long gart_start, gart_end;
+
+	/* Pointer to elf header */
+	void *ehdr;
+	/* Pointer to next phdr */
+	void *bufp;
+	struct crash_mem mem;
+};
+
+/* Used while prepareing memory map entries for second kernel */
+struct crash_memmap_data {
+	struct boot_params *params;
+	/* Type of memory */
+	unsigned int type;
+};
+
 int in_crash_kexec;
 
 /*
@@ -138,3 +181,545 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 #endif
 	crash_save_cpu(regs, safe_smp_processor_id());
 }
+
+#ifdef CONFIG_X86_64
+static int get_nr_ram_ranges_callback(unsigned long start_pfn,
+				unsigned long nr_pfn, void *arg)
+{
+	int *nr_ranges = arg;
+
+	(*nr_ranges)++;
+	return 0;
+}
+
+static int get_gart_ranges_callback(u64 start, u64 end, void *arg)
+{
+	struct crash_elf_data *ced = arg;
+
+	ced->gart_start = start;
+	ced->gart_end = end;
+
+	/* Not expecting more than 1 gart aperture */
+	return 1;
+}
+
+
+/* Gather all the required information to prepare elf headers for ram regions */
+static int fill_up_ced(struct crash_elf_data *ced, struct kimage *image)
+{
+	unsigned int nr_ranges = 0;
+
+	ced->image = image;
+
+	walk_system_ram_range(0, -1, &nr_ranges,
+				get_nr_ram_ranges_callback);
+
+	ced->max_nr_ranges = nr_ranges;
+
+	/*
+	 * We don't create ELF headers for GART aperture as an attempt
+	 * to dump this memory in second kernel leads to hang/crash.
+	 * If gart aperture is present, one needs to exclude that region
+	 * and that could lead to need of extra phdr.
+	 */
+
+	walk_ram_res("GART", IORESOURCE_MEM, 0, -1,
+				ced, get_gart_ranges_callback);
+
+	/*
+	 * If we have gart region, excluding that could potentially split
+	 * a memory range, resulting in extra header. Account for  that.
+	 */
+	if (ced->gart_end)
+		ced->max_nr_ranges++;
+
+	/* Exclusion of crash region could split memory ranges */
+	ced->max_nr_ranges++;
+
+	/* If crashk_low_res is there, another range split possible */
+	if (crashk_low_res.end != 0)
+		ced->max_nr_ranges++;
+
+	return 0;
+}
+
+static int exclude_mem_range(struct crash_mem *mem,
+		unsigned long long mstart, unsigned long long mend)
+{
+	int i, j;
+	unsigned long long start, end;
+	struct crash_mem_range temp_range = {0, 0};
+
+	for (i = 0; i < mem->nr_ranges; i++) {
+		start = mem->ranges[i].start;
+		end = mem->ranges[i].end;
+
+		if (mstart > end || mend < start)
+			continue;
+
+		/* Truncate any area outside of range */
+		if (mstart < start)
+			mstart = start;
+		if (mend > end)
+			mend = end;
+
+		/* Found completely overlapping range */
+		if (mstart == start && mend == end) {
+			mem->ranges[i].start = 0;
+			mem->ranges[i].end = 0;
+			if (i < mem->nr_ranges - 1) {
+				/* Shift rest of the ranges to left */
+				for(j = i; j < mem->nr_ranges - 1; j++) {
+					mem->ranges[j].start =
+						mem->ranges[j+1].start;
+					mem->ranges[j].end =
+							mem->ranges[j+1].end;
+				}
+			}
+			mem->nr_ranges--;
+			return 0;
+		}
+
+		if (mstart > start && mend < end) {
+			/* Split original range */
+			mem->ranges[i].end = mstart - 1;
+			temp_range.start = mend + 1;
+			temp_range.end = end;
+		} else if (mstart != start)
+			mem->ranges[i].end = mstart - 1;
+		else
+			mem->ranges[i].start = mend + 1;
+		break;
+	}
+
+	/* If a split happend, add the split in array */
+	if (!temp_range.end)
+		return 0;
+
+	/* Split happened */
+	if (i == CRASH_MAX_RANGES - 1) {
+		printk("Too many crash ranges after split\n");
+		return -ENOMEM;
+	}
+
+	/* Location where new range should go */
+	j = i + 1;
+	if (j < mem->nr_ranges) {
+		/* Move over all ranges one place */
+		for (i = mem->nr_ranges - 1; i >= j; i--)
+			mem->ranges[i + 1] = mem->ranges[i];
+	}
+
+	mem->ranges[j].start = temp_range.start;
+	mem->ranges[j].end = temp_range.end;
+	mem->nr_ranges++;
+	return 0;
+}
+
+/*
+ * Look for any unwanted ranges between mstart, mend and remove them. This
+ * might lead to split and split ranges are put in ced->mem.ranges[] array
+ */
+static int elf_header_exclude_ranges(struct crash_elf_data *ced,
+		unsigned long long mstart, unsigned long long mend)
+{
+	struct crash_mem *cmem = &ced->mem;
+	int ret = 0;
+
+	memset(cmem->ranges, 0, sizeof(cmem->ranges));
+
+	cmem->ranges[0].start = mstart;
+	cmem->ranges[0].end = mend;
+	cmem->nr_ranges = 1;
+
+	/* Exclude crashkernel region */
+	ret = exclude_mem_range(cmem, crashk_res.start, crashk_res.end);
+	if (ret)
+		return ret;
+
+	ret = exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end);
+	if (ret)
+		return ret;
+
+	/* Exclude GART region */
+	if (ced->gart_end) {
+		ret = exclude_mem_range(cmem, ced->gart_start, ced->gart_end);
+		if (ret)
+			return ret;
+	}
+
+	return ret;
+}
+
+static int prepare_elf64_ram_headers_callback(u64 start, u64 end, void *arg)
+{
+	struct crash_elf_data *ced = arg;
+	Elf64_Ehdr *ehdr;
+	Elf64_Phdr *phdr;
+	unsigned long mstart, mend;
+	struct kimage *image = ced->image;
+	struct crash_mem *cmem;
+	int ret, i;
+
+	ehdr = ced->ehdr;
+
+	/* Exclude unwanted mem ranges */
+	ret = elf_header_exclude_ranges(ced, start, end);
+	if (ret)
+		return ret;
+
+	/* Go through all the ranges in ced->mem.ranges[] and prepare phdr */
+	cmem = &ced->mem;
+
+	for (i = 0; i < cmem->nr_ranges; i++) {
+		mstart = cmem->ranges[i].start;
+		mend = cmem->ranges[i].end;
+
+		phdr = ced->bufp;
+		ced->bufp += sizeof(Elf64_Phdr);
+
+		phdr->p_type = PT_LOAD;
+		phdr->p_flags = PF_R|PF_W|PF_X;
+		phdr->p_offset  = mstart;
+
+		/*
+		 * If a range matches backup region, adjust offset to backup
+		 * segment.
+		 */
+		if (mstart == image->arch.backup_src_start &&
+		    (mend - mstart + 1) == image->arch.backup_src_sz)
+			phdr->p_offset = image->arch.backup_load_addr;
+
+		phdr->p_paddr = mstart;
+		phdr->p_vaddr = (unsigned long long) __va(mstart);
+		phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
+		phdr->p_align = 0;
+		ehdr->e_phnum++;
+		pr_debug("Crash PT_LOAD elf header. phdr=%p"
+			" vaddr=0x%llx, paddr=0x%llx, sz=0x%llx e_phnum=%d"
+			" p_offset=0x%llx\n", phdr, phdr->p_vaddr,
+			phdr->p_paddr, phdr->p_filesz, ehdr->e_phnum,
+			phdr->p_offset);
+	}
+
+	return ret;
+}
+
+static int prepare_elf64_headers(struct crash_elf_data *ced,
+		unsigned long *addr, unsigned long *sz)
+{
+	Elf64_Ehdr *ehdr;
+	Elf64_Phdr *phdr;
+	unsigned long nr_cpus = NR_CPUS, nr_phdr, elf_sz;
+	unsigned char *buf, *bufp;
+	unsigned int cpu;
+	unsigned long long notes_addr;
+	int ret;
+
+	/* extra phdr for vmcoreinfo elf note */
+	nr_phdr = nr_cpus + 1;
+	nr_phdr += ced->max_nr_ranges;
+
+	/*
+	 * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
+	 * area on x86_64 (ffffffff80000000 - ffffffffa0000000).
+	 * I think this is required by tools like gdb. So same physical
+	 * memory will be mapped in two elf headers. One will contain kernel
+	 * text virtual addresses and other will have __va(physical) addresses.
+	 */
+
+	nr_phdr++;
+	elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
+	elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
+
+	buf = vzalloc(elf_sz);
+	if (!buf)
+		return -ENOMEM;
+
+	bufp = buf;
+	ehdr = (Elf64_Ehdr *)bufp;
+	bufp += sizeof(Elf64_Ehdr);
+	memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
+	ehdr->e_ident[EI_CLASS] = ELFCLASS64;
+	ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
+	ehdr->e_ident[EI_VERSION] = EV_CURRENT;
+	ehdr->e_ident[EI_OSABI] = ELF_OSABI;
+	memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
+	ehdr->e_type = ET_CORE;
+	ehdr->e_machine = ELF_ARCH;
+	ehdr->e_version = EV_CURRENT;
+	ehdr->e_entry = 0;
+	ehdr->e_phoff = sizeof(Elf64_Ehdr);
+	ehdr->e_shoff = 0;
+	ehdr->e_flags = 0;
+	ehdr->e_ehsize = sizeof(Elf64_Ehdr);
+	ehdr->e_phentsize = sizeof(Elf64_Phdr);
+	ehdr->e_phnum = 0;
+	ehdr->e_shentsize = 0;
+	ehdr->e_shnum = 0;
+	ehdr->e_shstrndx = 0;
+
+	/* Prepare one phdr of type PT_NOTE for each present cpu */
+	for_each_present_cpu(cpu) {
+		phdr = (Elf64_Phdr *)bufp;
+		bufp += sizeof(Elf64_Phdr);
+		phdr->p_type = PT_NOTE;
+		phdr->p_flags = 0;
+		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
+		phdr->p_offset = phdr->p_paddr = notes_addr;
+		phdr->p_vaddr = 0;
+		phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
+		phdr->p_align = 0;
+		(ehdr->e_phnum)++;
+	}
+
+	/* Prepare one PT_NOTE header for vmcoreinfo */
+	phdr = (Elf64_Phdr *)bufp;
+	bufp += sizeof(Elf64_Phdr);
+	phdr->p_type = PT_NOTE;
+	phdr->p_flags = 0;
+	phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
+	phdr->p_vaddr = 0;
+	phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
+	phdr->p_align = 0;
+	(ehdr->e_phnum)++;
+
+#ifdef CONFIG_X86_64
+	/* Prepare PT_LOAD type program header for kernel text region */
+	phdr = (Elf64_Phdr *)bufp;
+	bufp += sizeof(Elf64_Phdr);
+	phdr->p_type = PT_LOAD;
+	phdr->p_flags = PF_R|PF_W|PF_X;
+	phdr->p_vaddr = (Elf64_Addr)_text;
+	phdr->p_filesz = phdr->p_memsz = _end - _text;
+	phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);
+	phdr->p_align = 0;
+	(ehdr->e_phnum)++;
+#endif
+
+	/* Prepare PT_LOAD headers for system ram chunks. */
+	ced->ehdr = ehdr;
+	ced->bufp = bufp;
+	ret = walk_system_ram_res(0, -1, ced,
+			prepare_elf64_ram_headers_callback);
+	if (ret < 0)
+		return ret;
+
+	*addr = (unsigned long)buf;
+	*sz = elf_sz;
+	return 0;
+}
+
+/* Prepare elf headers. Return addr and size */
+static int prepare_elf_headers(struct kimage *image, unsigned long *addr,
+					unsigned long *sz)
+{
+	struct crash_elf_data *ced;
+	int ret;
+
+	ced = kzalloc(sizeof(*ced), GFP_KERNEL);
+	if (!ced)
+		return -ENOMEM;
+
+	ret = fill_up_ced(ced, image);
+	if (ret)
+		goto out;
+
+	/* By default prepare 64bit headers */
+	ret =  prepare_elf64_headers(ced, addr, sz);
+out:
+	kfree(ced);
+	return ret;
+}
+
+static int add_e820_entry(struct boot_params *params, struct e820entry *entry)
+{
+	unsigned int nr_e820_entries;
+
+	nr_e820_entries = params->e820_entries;
+	if (nr_e820_entries >= E820MAX)
+		return 1;
+
+	memcpy(&params->e820_map[nr_e820_entries], entry,
+                        sizeof(struct e820entry));
+	params->e820_entries++;
+
+	pr_debug("Add e820 entry to bootparams. addr=0x%llx size=0x%llx"
+		" type=%d\n", entry->addr, entry->size, entry->type);
+	return 0;
+}
+
+static int memmap_entry_callback(u64 start, u64 end, void *arg)
+{
+	struct crash_memmap_data *cmd = arg;
+	struct boot_params *params = cmd->params;
+	struct e820entry ei;
+
+	ei.addr = start;
+	ei.size = end - start + 1;
+	ei.type = cmd->type;
+	add_e820_entry(params, &ei);
+
+	return 0;
+}
+
+static int memmap_exclude_ranges(struct kimage *image, struct crash_mem *cmem,
+		unsigned long long mstart, unsigned long long mend)
+{
+	unsigned long start, end;
+	int ret = 0;
+
+	memset(cmem->ranges, 0, sizeof(cmem->ranges));
+
+	cmem->ranges[0].start = mstart;
+	cmem->ranges[0].end = mend;
+	cmem->nr_ranges = 1;
+
+	/* Exclude Backup region */
+	start = image->arch.backup_load_addr;
+	end = start + image->arch.backup_src_sz - 1;
+	ret = exclude_mem_range(cmem, start, end);
+	if (ret)
+		return ret;
+
+	/* Exclude elf header region */
+	start = image->arch.elf_load_addr;
+	end = start + image->arch.elf_headers_sz - 1;
+	ret = exclude_mem_range(cmem, start, end);
+	return ret;
+}
+
+/* Prepare memory map for crash dump kernel */
+int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
+{
+	int i, ret = 0;
+	unsigned long flags;
+	struct e820entry ei;
+	struct crash_memmap_data cmd;
+	struct crash_mem *cmem;
+
+	cmem = vzalloc(sizeof(struct crash_mem));
+	if (!cmem)
+		return -ENOMEM;
+
+	memset(&cmd, 0, sizeof(struct crash_memmap_data));
+	cmd.params = params;
+
+	/* Add first 640K segment */
+	ei.addr = image->arch.backup_src_start;
+	ei.size = image->arch.backup_src_sz;
+	ei.type = E820_RAM;
+	add_e820_entry(params, &ei);
+
+	/* Add ACPI tables */
+	cmd.type = E820_ACPI;
+	flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+	walk_ram_res("ACPI Tables", flags, 0, -1, &cmd, memmap_entry_callback);
+
+	/* Add ACPI Non-volatile Storage */
+	cmd.type = E820_NVS;
+	walk_ram_res("ACPI Non-volatile Storage", flags, 0, -1, &cmd,
+			memmap_entry_callback);
+
+	/* Add crashk_low_res region */
+	if (crashk_low_res.end) {
+		ei.addr = crashk_low_res.start;
+		ei.size = crashk_low_res.end - crashk_low_res.start + 1;
+		ei.type = E820_RAM;
+		add_e820_entry(params, &ei);
+	}
+
+	/* Exclude some ranges from crashk_res and add rest to memmap */
+	ret = memmap_exclude_ranges(image, cmem, crashk_res.start,
+						crashk_res.end);
+	if (ret)
+		goto out;
+
+	for (i = 0; i < cmem->nr_ranges; i++) {
+		ei.addr = cmem->ranges[i].start;
+		ei.size = cmem->ranges[i].end - ei.addr + 1;
+		ei.type = E820_RAM;
+
+		/* If entry is less than a page, skip it */
+		if (ei.size < PAGE_SIZE) {
+			continue;
+		}
+		add_e820_entry(params, &ei);
+	}
+
+out:
+	vfree(cmem);
+	return ret;
+}
+
+static int determine_backup_region(u64 start, u64 end, void *arg)
+{
+	struct kimage *image = arg;
+
+	image->arch.backup_src_start = start;
+	image->arch.backup_src_sz = end - start + 1;
+
+	/* Expecting only one range for backup region */
+	return 1;
+}
+
+int load_crashdump_segments(struct kimage *image)
+{
+	unsigned long src_start, src_sz;
+	unsigned long elf_addr, elf_sz;
+	int ret;
+
+	/*
+	 * Determine and load a segment for backup area. First 640K RAM
+	 * region is backup source
+	 */
+
+	ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
+				image, determine_backup_region);
+
+	/* Zero of postive return values are ok */
+	if (ret < 0)
+		return ret;
+
+	src_start = image->arch.backup_src_start;
+	src_sz = image->arch.backup_src_sz;
+
+	/* Add backup segment. */
+	if (src_sz) {
+		ret = kexec_add_buffer(image, __va(src_start), src_sz, src_sz,
+					PAGE_SIZE, 0, -1, 0,
+					&image->arch.backup_load_addr);
+		if (ret)
+			return ret;
+	}
+
+	/* Prepare elf headers and add a segment */
+	ret = prepare_elf_headers(image, &elf_addr, &elf_sz);
+	if (ret)
+		return ret;
+
+	image->arch.elf_headers = elf_addr;
+	image->arch.elf_headers_sz = elf_sz;
+
+	ret = kexec_add_buffer(image, (char *)elf_addr, elf_sz, elf_sz,
+			ELF_CORE_HEADER_ALIGN, 0, -1, 0,
+			&image->arch.elf_load_addr);
+	if (ret)
+		kfree((void *)image->arch.elf_headers);
+
+	return ret;
+}
+
+int crash_copy_backup_region(struct kimage *image)
+{
+	unsigned long dest_start, src_start, src_sz;
+
+	dest_start = image->arch.backup_load_addr;
+	src_start = image->arch.backup_src_start;
+	src_sz = image->arch.backup_src_sz;
+
+	memcpy(__va(dest_start), __va(src_start), src_sz);
+
+	return 0;
+}
+#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
index a1032d4..606942c 100644
--- a/arch/x86/kernel/kexec-bzimage.c
+++ b/arch/x86/kernel/kexec-bzimage.c
@@ -8,6 +8,9 @@
 
 #include <asm/bootparam.h>
 #include <asm/setup.h>
+#include <asm/crash.h>
+
+#define MAX_ELFCOREHDR_STR_LEN	30 	/* elfcorehdr=0x<64bit-value> */
 
 #ifdef CONFIG_X86_64
 
@@ -86,7 +89,8 @@ static int setup_memory_map_entries(struct boot_params *params)
 	return 0;
 }
 
-static void setup_linux_system_parameters(struct boot_params *params)
+static void setup_linux_system_parameters(struct kimage *image,
+			struct boot_params *params)
 {
 	unsigned int nr_e820_entries;
 	unsigned long long mem_k, start, end;
@@ -113,7 +117,10 @@ static void setup_linux_system_parameters(struct boot_params *params)
 	/* Default sysdesc table */
 	params->sys_desc_table.length = 0;
 
-	setup_memory_map_entries(params);
+	if (image->type == KEXEC_TYPE_CRASH)
+		crash_setup_memmap_entries(image, params);
+	else
+		setup_memory_map_entries(params);
 	nr_e820_entries = params->e820_entries;
 
 	for(i = 0; i < nr_e820_entries; i++) {
@@ -151,18 +158,23 @@ static void setup_initrd(struct boot_params *boot_params, unsigned long initrd_l
 	boot_params->ext_ramdisk_size = initrd_len >> 32;
 }
 
-static void setup_cmdline(struct boot_params *boot_params,
+static void setup_cmdline(struct kimage *image, struct boot_params *boot_params,
 		unsigned long bootparams_load_addr,
 		unsigned long cmdline_offset, char *cmdline,
 		unsigned long cmdline_len)
 {
 	char *cmdline_ptr = ((char *)boot_params) + cmdline_offset;
-	unsigned long cmdline_ptr_phys;
+	unsigned long cmdline_ptr_phys, len;
 	uint32_t cmdline_low_32, cmdline_ext_32;
 
 	memcpy(cmdline_ptr, cmdline, cmdline_len);
+	if (image->type == KEXEC_TYPE_CRASH) {
+		len = sprintf(cmdline_ptr + cmdline_len - 1,
+			" elfcorehdr=0x%lx", image->arch.elf_load_addr);
+		cmdline_len += len;
+	}
 	cmdline_ptr[cmdline_len - 1] = '\0';
-
+	pr_debug("Final command line is:%s\n", cmdline_ptr);
 	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
 	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
 	cmdline_ext_32 = cmdline_ptr_phys >> 32;
@@ -203,17 +215,34 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 		return ERR_PTR(-EINVAL);
 	}
 
+	/*
+	 * In case of crash dump, we will append elfcorehdr=<addr> to
+	 * command line. Make sure it does not overflow
+	 */
+	if (cmdline_len + MAX_ELFCOREHDR_STR_LEN > header->cmdline_size) {
+		ret = -EINVAL;
+		pr_debug("Kernel command line too long\n");
+		return ERR_PTR(-EINVAL);
+	}
+
 	/* Allocate loader specific data */
 	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
 	if (!ldata)
 		return ERR_PTR(-ENOMEM);
 
+	/* Allocate and load backup region */
+	if (image->type == KEXEC_TYPE_CRASH) {
+		ret = load_crashdump_segments(image);
+		if (ret)
+			goto out_free_loader_data;
+	}
+
 	/* Argument/parameter segment */
 	kern16_size_needed = kern16_size;
 	if (kern16_size_needed < 4096)
 		kern16_size_needed = 4096;
 
-	setup_size = kern16_size_needed + cmdline_len;
+	setup_size = kern16_size_needed + cmdline_len + MAX_ELFCOREHDR_STR_LEN;
 	params = kzalloc(setup_size, GFP_KERNEL);
 	if (!params) {
 		ret = -ENOMEM;
@@ -259,14 +288,14 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 		setup_initrd(params, initrd_load_addr, initrd_len);
 	}
 
-	setup_cmdline(params, bootparam_load_addr, kern16_size_needed,
+	setup_cmdline(image, params, bootparam_load_addr, kern16_size_needed,
 			cmdline, cmdline_len);
 
 	/* bootloader info. Do we need a separate ID for kexec kernel loader? */
 	params->hdr.type_of_loader = 0x0D << 4;
 	params->hdr.loadflags = 0;
 
-	setup_linux_system_parameters(params);
+	setup_linux_system_parameters(image, params);
 
 	/*
 	 * Allocate a purgatory page. For 64bit entry point, purgatory
@@ -302,7 +331,7 @@ out_free_loader_data:
 	return ERR_PTR(ret);
 }
 
-int bzImage64_prep_entry(struct kimage *image)
+static int prepare_purgatory(struct kimage *image)
 {
 	struct bzimage64_data *ldata;
 	char *purgatory_page;
@@ -362,6 +391,22 @@ int bzImage64_prep_entry(struct kimage *image)
 	return 0;
 }
 
+int bzImage64_prep_entry(struct kimage *image)
+{
+	if (!image->file_mode)
+		return 0;
+
+	if (!image->image_loader_data)
+		return -EINVAL;
+
+	prepare_purgatory(image);
+
+	if (image->type == KEXEC_TYPE_CRASH)
+		crash_copy_backup_region(image);
+
+	return 0;
+}
+
 /* This cleanup function is called after various segments have been loaded */
 int bzImage64_cleanup(struct kimage *image)
 {
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index a66ce1d..9d7a42d 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -334,6 +334,7 @@ int arch_image_file_post_load_cleanup(struct kimage *image)
 {
 	int idx = image->file_handler_idx;
 
+	vfree((void *)image->arch.elf_headers);
 	if (kexec_file_type[idx].cleanup)
 		return kexec_file_type[idx].cleanup(image);
 	return 0;
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 50bcaa8..64184a7 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -524,7 +524,6 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
 	*rimage = image;
 	return 0;
 
-
 out_free_control_pages:
 	kimage_free_page_list(&image->control_pages);
 out_free_image:
@@ -532,6 +531,54 @@ out_free_image:
 	return result;
 }
 
+static int kimage_file_crash_alloc(struct kimage **rimage, int kernel_fd,
+		int initrd_fd, const char __user *cmdline_ptr,
+		unsigned long cmdline_len)
+{
+	int result;
+	struct kimage *image;
+
+	/* Allocate and initialize a controlling structure */
+	image = do_kimage_alloc_init();
+	if (!image)
+		return -ENOMEM;
+
+	image->file_mode = 1;
+	image->file_handler_idx = -1;
+
+	/* Enable the special crash kernel control page allocation policy. */
+	image->control_page = crashk_res.start;
+	image->type = KEXEC_TYPE_CRASH;
+
+	result = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
+			cmdline_ptr, cmdline_len);
+	if (result)
+		goto out_free_image;
+
+	result = sanity_check_segment_list(image);
+	if (result)
+		goto out_free_post_load_bufs;
+
+	result = -ENOMEM;
+	image->control_code_page = kimage_alloc_control_pages(image,
+					   get_order(KEXEC_CONTROL_PAGE_SIZE));
+	if (!image->control_code_page) {
+		printk(KERN_ERR "Could not allocate control_code_buffer\n");
+		goto out_free_post_load_bufs;
+	}
+
+	*rimage = image;
+	return 0;
+
+out_free_post_load_bufs:
+	kimage_file_post_load_cleanup(image);
+	kfree(image->image_loader_data);
+out_free_image:
+	kfree(image);
+	return result;
+}
+
+
 static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
 				unsigned long nr_segments,
 				struct kexec_segment __user *segments)
@@ -1130,7 +1177,12 @@ static int kimage_load_crash_segment(struct kimage *image,
 			/* Zero the trailing part of the page */
 			memset(ptr + uchunk, 0, mchunk - uchunk);
 		}
-		result = copy_from_user(ptr, buf, uchunk);
+
+		/* For file based kexec, source pages are in kernel memory */
+		if (image->file_mode)
+			memcpy(ptr, buf, uchunk);
+		else
+			result = copy_from_user(ptr, buf, uchunk);
 		kexec_flush_icache_page(page);
 		kunmap(page);
 		if (result) {
@@ -1358,7 +1410,11 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, const char __us
 	if (flags & KEXEC_FILE_UNLOAD)
 		goto exchange;
 
-	ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
+	if (flags & KEXEC_FILE_ON_CRASH)
+		ret = kimage_file_crash_alloc(&image, kernel_fd, initrd_fd,
+				cmdline_ptr, cmdline_len);
+	else
+		ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
 				cmdline_ptr, cmdline_len);
 	if (ret)
 		goto out;
@@ -2108,7 +2164,12 @@ int kexec_add_buffer(struct kimage *image, char *buffer,
 	kbuf->top_down = top_down;
 
 	/* Walk the RAM ranges and allocate a suitable range for the buffer */
-	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
+	if (image->type == KEXEC_TYPE_CRASH)
+		walk_ram_res("Crash kernel", IORESOURCE_MEM | IORESOURCE_BUSY,
+				crashk_res.start, crashk_res.end, kbuf,
+				walk_ram_range_callback);
+	else
+		walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
 
 	kbuf->image = NULL;
 	kfree(kbuf);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-20 17:50 [PATCH 0/6] kexec: A new system call to allow in kernel loading Vivek Goyal
                   ` (5 preceding siblings ...)
  2013-11-20 17:50 ` [PATCH 6/6] kexec: Support for Kexec on panic using new system call Vivek Goyal
@ 2013-11-21 18:58 ` Greg KH
  2013-11-21 19:07   ` Vivek Goyal
  2013-11-21 19:06 ` Geert Uytterhoeven
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 90+ messages in thread
From: Greg KH @ 2013-11-21 18:58 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, ebiederm, hpa, mjg59

On Wed, Nov 20, 2013 at 12:50:45PM -0500, Vivek Goyal wrote:
> Current proposed secureboot implementation disables kexec/kdump because
> it can allow unsigned kernel to run on a secureboot platform. Intial
> idea was to sign /sbin/kexec binary and let that binary do the kernel
> signature verification. I had posted RFC patches for this apparoach
> here.
> 
> https://lkml.org/lkml/2013/9/10/560
> 
> Later we had discussion at Plumbers and most of the people thought
> that signing and trusting /sbin/kexec is becoming complex. So a 
> better idea might be let kernel do the signature verification of
> new kernel being loaded. This calls for implementing a new system call
> and moving lot of user space code in kernel.
> 
> kexec_load() system call allows loading a kexec/kdump kernel and jump
> to that kernel at right time. Though a lot of processing is done in
> user space which prepares a list of segments/buffers to be loaded and
> kexec_load() works on that list of segments. It does not know what's
> contained in those segments.
> 
> Now a new system call kexec_file_load() is implemented which takes
> kernel fd and initrd fd as parameters. Now kernel should be able
> to verify signature of newly loaded kernel. 
> 
> This is an early RFC patchset. I have not done signature handling
> part yet. This is more of a minimal patch to show how new system
> call and functionality will look like. Right now it can only handle
> bzImage with 64bit entry point on x86_64. No EFI, no x86_32  or any
> other architecture. Rest of the things can be added slowly as need
> arises. In first iteration, I have tried to address most common use case
> for us.

Very good stuff, thanks for working on this.  How have you been testing
this on the userspace side?  Are there patches to kexec, or are you just
using a small test program with the new syscall?

I'll comment on the patches separately.

greg k-h

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/6] kexec: Export vmcoreinfo note size properly
  2013-11-20 17:50 ` [PATCH 1/6] kexec: Export vmcoreinfo note size properly Vivek Goyal
@ 2013-11-21 18:59   ` Greg KH
  2013-11-21 19:08     ` Vivek Goyal
  0 siblings, 1 reply; 90+ messages in thread
From: Greg KH @ 2013-11-21 18:59 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, ebiederm, hpa, mjg59

On Wed, Nov 20, 2013 at 12:50:46PM -0500, Vivek Goyal wrote:
> Right now we seem to be exporting the max data size contained inside
> vmcoreinfo note. But this does not include the size of meta data around
> vmcore info data. Like name of the note and starting and ending elf_note.
> 
> I think user space expects total size and that size is put in PT_NOTE
> elf header.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  kernel/ksysfs.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)

This should go into 3.13, right?

Nice fix.

greg k-h

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-20 17:50 ` [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec Vivek Goyal
@ 2013-11-21 19:03   ` Greg KH
  2013-11-21 19:06     ` Matthew Garrett
                       ` (2 more replies)
  2013-11-22 20:42   ` Jiri Kosina
                     ` (2 subsequent siblings)
  3 siblings, 3 replies; 90+ messages in thread
From: Greg KH @ 2013-11-21 19:03 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, ebiederm, hpa, mjg59

On Wed, Nov 20, 2013 at 12:50:49PM -0500, Vivek Goyal wrote:
> This patch implements the in kernel kexec functionality. It implements a
> new system call kexec_file_load. I think parameter list of this system
> call will change as I have not done the kernel image signature handling
> yet. I have been told that I might have to pass the detached signature
> and size as part of system call.

This could be done as we do with modules, and just tack the signature
onto the end of the 'blob' of the image.  That way we could use the same
tool to sign the binary as we do for modules, and save the need for
extra parameters in the syscall.

> +/*
> + * Free up tempory buffers allocated which are not needed after image has
> + * been loaded.
> + *
> + * Free up memory used by kernel, initrd, and comand line. This is temporary
> + * memory allocation which is not needed any more after these buffers have
> + * been loaded into separate segments and have been copied elsewhere
> + */
> +static void kimage_file_post_load_cleanup(struct kimage *image)
> +{
> +	if (image->kernel_buf) {
> +		vfree(image->kernel_buf);
> +		image->kernel_buf = NULL;
> +	}
> +
> +	if (image->initrd_buf) {
> +		vfree(image->initrd_buf);
> +		image->initrd_buf = NULL;
> +	}
> +
> +	if (image->cmdline_buf) {
> +		vfree(image->cmdline_buf);
> +		image->cmdline_buf = NULL;
> +	}

No need to check the buffer before calling vfree(), it can handle NULL
just fine.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-20 17:50 [PATCH 0/6] kexec: A new system call to allow in kernel loading Vivek Goyal
                   ` (6 preceding siblings ...)
  2013-11-21 18:58 ` [PATCH 0/6] kexec: A new system call to allow in kernel loading Greg KH
@ 2013-11-21 19:06 ` Geert Uytterhoeven
  2013-11-21 19:14   ` Vivek Goyal
  2013-11-21 23:07 ` Eric W. Biederman
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 90+ messages in thread
From: Geert Uytterhoeven @ 2013-11-21 19:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, Eric W. Biederman, H. Peter Anvin,
	Matthew Garrett, Greg Kroah-Hartman

On Wed, Nov 20, 2013 at 6:50 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> Now a new system call kexec_file_load() is implemented which takes
> kernel fd and initrd fd as parameters. Now kernel should be able
> to verify signature of newly loaded kernel.

Only kernel fd and initrd fd?
What about other fds (e.g. device tree, bootinfo)?

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-21 19:03   ` Greg KH
@ 2013-11-21 19:06     ` Matthew Garrett
  2013-11-21 19:13       ` Vivek Goyal
  2013-11-21 19:16     ` Vivek Goyal
  2013-11-22  1:03     ` Kees Cook
  2 siblings, 1 reply; 90+ messages in thread
From: Matthew Garrett @ 2013-11-21 19:06 UTC (permalink / raw)
  To: Greg KH; +Cc: Vivek Goyal, linux-kernel, kexec, ebiederm, hpa

On Thu, Nov 21, 2013 at 11:03:50AM -0800, Greg KH wrote:

> This could be done as we do with modules, and just tack the signature
> onto the end of the 'blob' of the image.  That way we could use the same
> tool to sign the binary as we do for modules, and save the need for
> extra parameters in the syscall.

That would require a certain degree of massaging from userspace if we 
want to be able to use the existing Authenticode signatures. Otherwise 
we need to sign kernels twice.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 5/6] kexec-bzImage: Support for loading bzImage using 64bit entry
  2013-11-20 17:50 ` [PATCH 5/6] kexec-bzImage: Support for loading bzImage using 64bit entry Vivek Goyal
@ 2013-11-21 19:07   ` Greg KH
  2013-11-21 19:21     ` Vivek Goyal
  2013-11-28 11:35   ` Baoquan He
  1 sibling, 1 reply; 90+ messages in thread
From: Greg KH @ 2013-11-21 19:07 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, ebiederm, hpa, mjg59

On Wed, Nov 20, 2013 at 12:50:50PM -0500, Vivek Goyal wrote:
> This is loader specific code which can load bzImage and set it up for
> 64bit entry. This does not take care of 32bit entry or real mode entry
> yet.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/kexec-bzimage.h |   12 +
>  arch/x86/include/asm/kexec.h         |   26 +++
>  arch/x86/kernel/Makefile             |    2 +
>  arch/x86/kernel/kexec-bzimage.c      |  375 ++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/machine_kexec_64.c   |    4 +-
>  arch/x86/kernel/purgatory_entry_64.S |  119 +++++++++++
>  6 files changed, 537 insertions(+), 1 deletions(-)
>  create mode 100644 arch/x86/include/asm/kexec-bzimage.h
>  create mode 100644 arch/x86/kernel/kexec-bzimage.c
>  create mode 100644 arch/x86/kernel/purgatory_entry_64.S

Wow, that's surprisingly small, nice job.

What do you mean by the "real mode entry"?  Do we need to care about
that because we aren't falling back to real mode when executing this,
are we?  Or does that just happen for 32bit kernels?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-21 18:58 ` [PATCH 0/6] kexec: A new system call to allow in kernel loading Greg KH
@ 2013-11-21 19:07   ` Vivek Goyal
  2013-11-21 19:46     ` Vivek Goyal
  0 siblings, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-11-21 19:07 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, kexec, ebiederm, hpa, mjg59

On Thu, Nov 21, 2013 at 10:58:28AM -0800, Greg KH wrote:
> On Wed, Nov 20, 2013 at 12:50:45PM -0500, Vivek Goyal wrote:
> > Current proposed secureboot implementation disables kexec/kdump because
> > it can allow unsigned kernel to run on a secureboot platform. Intial
> > idea was to sign /sbin/kexec binary and let that binary do the kernel
> > signature verification. I had posted RFC patches for this apparoach
> > here.
> > 
> > https://lkml.org/lkml/2013/9/10/560
> > 
> > Later we had discussion at Plumbers and most of the people thought
> > that signing and trusting /sbin/kexec is becoming complex. So a 
> > better idea might be let kernel do the signature verification of
> > new kernel being loaded. This calls for implementing a new system call
> > and moving lot of user space code in kernel.
> > 
> > kexec_load() system call allows loading a kexec/kdump kernel and jump
> > to that kernel at right time. Though a lot of processing is done in
> > user space which prepares a list of segments/buffers to be loaded and
> > kexec_load() works on that list of segments. It does not know what's
> > contained in those segments.
> > 
> > Now a new system call kexec_file_load() is implemented which takes
> > kernel fd and initrd fd as parameters. Now kernel should be able
> > to verify signature of newly loaded kernel. 
> > 
> > This is an early RFC patchset. I have not done signature handling
> > part yet. This is more of a minimal patch to show how new system
> > call and functionality will look like. Right now it can only handle
> > bzImage with 64bit entry point on x86_64. No EFI, no x86_32  or any
> > other architecture. Rest of the things can be added slowly as need
> > arises. In first iteration, I have tried to address most common use case
> > for us.
> 
> Very good stuff, thanks for working on this.  How have you been testing
> this on the userspace side?  Are there patches to kexec, or are you just
> using a small test program with the new syscall?

I wrote a patch for kexec-tools. One can choose to use new system call
by passing command line option --use-kexec2-syscall. I will post
that patch soon in this mail thread.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/6] kexec: Export vmcoreinfo note size properly
  2013-11-21 18:59   ` Greg KH
@ 2013-11-21 19:08     ` Vivek Goyal
  0 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-21 19:08 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, kexec, ebiederm, hpa, mjg59

On Thu, Nov 21, 2013 at 10:59:00AM -0800, Greg KH wrote:
> On Wed, Nov 20, 2013 at 12:50:46PM -0500, Vivek Goyal wrote:
> > Right now we seem to be exporting the max data size contained inside
> > vmcoreinfo note. But this does not include the size of meta data around
> > vmcore info data. Like name of the note and starting and ending elf_note.
> > 
> > I think user space expects total size and that size is put in PT_NOTE
> > elf header.
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  kernel/ksysfs.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> This should go into 3.13, right?
> 
> Nice fix.

Yes. This is just a general fix for kexec. I noticed it while going
through the code. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-21 19:06     ` Matthew Garrett
@ 2013-11-21 19:13       ` Vivek Goyal
  2013-11-21 19:19         ` Matthew Garrett
  0 siblings, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-11-21 19:13 UTC (permalink / raw)
  To: Matthew Garrett; +Cc: Greg KH, linux-kernel, kexec, ebiederm, hpa, Peter Jones

On Thu, Nov 21, 2013 at 07:06:20PM +0000, Matthew Garrett wrote:
> On Thu, Nov 21, 2013 at 11:03:50AM -0800, Greg KH wrote:
> 
> > This could be done as we do with modules, and just tack the signature
> > onto the end of the 'blob' of the image.  That way we could use the same
> > tool to sign the binary as we do for modules, and save the need for
> > extra parameters in the syscall.
> 
> That would require a certain degree of massaging from userspace if we 
> want to be able to use the existing Authenticode signatures. Otherwise 
> we need to sign kernels twice.

I was thinking oof signing the same kernel twice. Can I sign authenticode
signed kernel again (using RSA signature as we do for modules) and append
the signature to bzImage. 

I am wondering if authenticode signature verification will fail due
to this extra signature at the end of bzImage. pjones thought that it
will break authenticode signature verification. CCing him.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-21 19:06 ` Geert Uytterhoeven
@ 2013-11-21 19:14   ` Vivek Goyal
  0 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-21 19:14 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: linux-kernel, kexec, Eric W. Biederman, H. Peter Anvin,
	Matthew Garrett, Greg Kroah-Hartman

On Thu, Nov 21, 2013 at 08:06:02PM +0100, Geert Uytterhoeven wrote:
> On Wed, Nov 20, 2013 at 6:50 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > Now a new system call kexec_file_load() is implemented which takes
> > kernel fd and initrd fd as parameters. Now kernel should be able
> > to verify signature of newly loaded kernel.
> 
> Only kernel fd and initrd fd?
> What about other fds (e.g. device tree, bootinfo)?

Now bootparams and any other stuff is prepared by kernel. So like a
bootloader we just need to know 3 things. Kernel, initrd and command line
and kernel should do the rest.

In future we can define few flags to alter the loading behavior. Like
force 32bit or 16bit entry instead of default 64bit for bzImage.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-21 19:03   ` Greg KH
  2013-11-21 19:06     ` Matthew Garrett
@ 2013-11-21 19:16     ` Vivek Goyal
  2013-11-22  1:03     ` Kees Cook
  2 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-21 19:16 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, kexec, ebiederm, hpa, mjg59

On Thu, Nov 21, 2013 at 11:03:50AM -0800, Greg KH wrote:
> On Wed, Nov 20, 2013 at 12:50:49PM -0500, Vivek Goyal wrote:
> > This patch implements the in kernel kexec functionality. It implements a
> > new system call kexec_file_load. I think parameter list of this system
> > call will change as I have not done the kernel image signature handling
> > yet. I have been told that I might have to pass the detached signature
> > and size as part of system call.
> 
> This could be done as we do with modules, and just tack the signature
> onto the end of the 'blob' of the image.  That way we could use the same
> tool to sign the binary as we do for modules, and save the need for
> extra parameters in the syscall.

I was hoping to do that. Just that we will have to somehow resolve the
conflict with PE/COFF authenticode signature of kernel.

> 
> > +/*
> > + * Free up tempory buffers allocated which are not needed after image has
> > + * been loaded.
> > + *
> > + * Free up memory used by kernel, initrd, and comand line. This is temporary
> > + * memory allocation which is not needed any more after these buffers have
> > + * been loaded into separate segments and have been copied elsewhere
> > + */
> > +static void kimage_file_post_load_cleanup(struct kimage *image)
> > +{
> > +	if (image->kernel_buf) {
> > +		vfree(image->kernel_buf);
> > +		image->kernel_buf = NULL;
> > +	}
> > +
> > +	if (image->initrd_buf) {
> > +		vfree(image->initrd_buf);
> > +		image->initrd_buf = NULL;
> > +	}
> > +
> > +	if (image->cmdline_buf) {
> > +		vfree(image->cmdline_buf);
> > +		image->cmdline_buf = NULL;
> > +	}
> 
> No need to check the buffer before calling vfree(), it can handle NULL
> just fine.

Ok, I will remove this extra non-null check.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-21 19:13       ` Vivek Goyal
@ 2013-11-21 19:19         ` Matthew Garrett
  2013-11-21 19:24           ` Vivek Goyal
  2013-11-22 18:57           ` Vivek Goyal
  0 siblings, 2 replies; 90+ messages in thread
From: Matthew Garrett @ 2013-11-21 19:19 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Greg KH, linux-kernel, kexec, ebiederm, hpa, Peter Jones

On Thu, Nov 21, 2013 at 02:13:05PM -0500, Vivek Goyal wrote:
> On Thu, Nov 21, 2013 at 07:06:20PM +0000, Matthew Garrett wrote:
> > That would require a certain degree of massaging from userspace if we 
> > want to be able to use the existing Authenticode signatures. Otherwise 
> > we need to sign kernels twice.
> 
> I was thinking oof signing the same kernel twice. Can I sign authenticode
> signed kernel again (using RSA signature as we do for modules) and append
> the signature to bzImage. 

No, you'd need to do it the other way around.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 5/6] kexec-bzImage: Support for loading bzImage using 64bit entry
  2013-11-21 19:07   ` Greg KH
@ 2013-11-21 19:21     ` Vivek Goyal
  2013-11-22 15:24       ` H. Peter Anvin
  0 siblings, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-11-21 19:21 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, kexec, ebiederm, hpa, mjg59

On Thu, Nov 21, 2013 at 11:07:10AM -0800, Greg KH wrote:
> On Wed, Nov 20, 2013 at 12:50:50PM -0500, Vivek Goyal wrote:
> > This is loader specific code which can load bzImage and set it up for
> > 64bit entry. This does not take care of 32bit entry or real mode entry
> > yet.
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  arch/x86/include/asm/kexec-bzimage.h |   12 +
> >  arch/x86/include/asm/kexec.h         |   26 +++
> >  arch/x86/kernel/Makefile             |    2 +
> >  arch/x86/kernel/kexec-bzimage.c      |  375 ++++++++++++++++++++++++++++++++++
> >  arch/x86/kernel/machine_kexec_64.c   |    4 +-
> >  arch/x86/kernel/purgatory_entry_64.S |  119 +++++++++++
> >  6 files changed, 537 insertions(+), 1 deletions(-)
> >  create mode 100644 arch/x86/include/asm/kexec-bzimage.h
> >  create mode 100644 arch/x86/kernel/kexec-bzimage.c
> >  create mode 100644 arch/x86/kernel/purgatory_entry_64.S
> 
> Wow, that's surprisingly small, nice job.
> 
> What do you mean by the "real mode entry"?  Do we need to care about
> that because we aren't falling back to real mode when executing this,
> are we?  Or does that just happen for 32bit kernels?

Original kexec offers real mode entry choice too. So we fall back
to real mode and jump to kernel and kernel makes bunch of BIOS calls. I
don't think this is a commonly used option.

For 32bit kernels, currently by default we use 32bit entry point of
bzImage and we don't drop to real mode. Real mode entry is forced
by user using command line option --real-mode to kexec-tools.

So this is one of the features of existing kexec-tools I just wanted
to make it explicitly clear that I have not taken care of that feature
as it is not a commonly used one. If somebody needs it, they can implement
it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-21 19:19         ` Matthew Garrett
@ 2013-11-21 19:24           ` Vivek Goyal
  2013-11-22 18:57           ` Vivek Goyal
  1 sibling, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-21 19:24 UTC (permalink / raw)
  To: Matthew Garrett; +Cc: Greg KH, linux-kernel, kexec, ebiederm, hpa, Peter Jones

On Thu, Nov 21, 2013 at 07:19:07PM +0000, Matthew Garrett wrote:
> On Thu, Nov 21, 2013 at 02:13:05PM -0500, Vivek Goyal wrote:
> > On Thu, Nov 21, 2013 at 07:06:20PM +0000, Matthew Garrett wrote:
> > > That would require a certain degree of massaging from userspace if we 
> > > want to be able to use the existing Authenticode signatures. Otherwise 
> > > we need to sign kernels twice.
> > 
> > I was thinking oof signing the same kernel twice. Can I sign authenticode
> > signed kernel again (using RSA signature as we do for modules) and append
> > the signature to bzImage. 
> 
> No, you'd need to do it the other way around.

Then I can't assume that RSA signatures are appened to bzImage, as we
do for modules.

Also I am assuming that authenticode signing will change something in
PE/COFF header and that would invalidate the bzImage signature.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-21 19:07   ` Vivek Goyal
@ 2013-11-21 19:46     ` Vivek Goyal
  0 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-21 19:46 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, kexec, ebiederm, hpa, mjg59

On Thu, Nov 21, 2013 at 02:07:29PM -0500, Vivek Goyal wrote:

[..]
> > Very good stuff, thanks for working on this.  How have you been testing
> > this on the userspace side?  Are there patches to kexec, or are you just
> > using a small test program with the new syscall?
> 
> I wrote a patch for kexec-tools. One can choose to use new system call
> by passing command line option --use-kexec2-syscall. I will post
> that patch soon in this mail thread.

I have been using following kexec-tools patch to test this code.

Thanks
Vivek


kexec-tools: Provide an option to make use of new system call

This patch provides and option --use-kexec2-syscall, to force use of
new system call for kexec. Default is to continue to use old syscall.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 kexec/arch/x86_64/kexec-bzImage64.c |   86 ++++++++++++++++++++++++++
 kexec/kexec-syscall.h               |   24 +++++++
 kexec/kexec.c                       |  118 +++++++++++++++++++++++++++++++++++-
 kexec/kexec.h                       |    9 ++
 4 files changed, 234 insertions(+), 3 deletions(-)

Index: kexec-tools/kexec/kexec.c
===================================================================
--- kexec-tools.orig/kexec/kexec.c	2013-11-20 10:28:26.652380051 -0500
+++ kexec-tools/kexec/kexec.c	2013-11-20 11:31:44.906046486 -0500
@@ -51,6 +51,7 @@
 unsigned long long mem_min = 0;
 unsigned long long mem_max = ULONG_MAX;
 static unsigned long kexec_flags = 0;
+static unsigned long kexec2_flags = 0;
 int kexec_debug = 0;
 
 void die(const char *fmt, ...)
@@ -781,6 +782,19 @@ static int my_load(const char *type, int
 	return result;
 }
 
+static int kexec2_unload(unsigned long kexec2_flags)
+{
+	int ret = 0;
+
+	ret = kexec_file_load(-1, -1, NULL, 0, kexec2_flags);
+	if (ret != 0) {
+		/* The unload failed, print some debugging information */
+		fprintf(stderr, "kexec_file_load(unload) failed\n: %s\n",
+			strerror(errno));
+	}
+	return ret;
+}
+
 static int k_unload (unsigned long kexec_flags)
 {
 	int result;
@@ -919,6 +933,7 @@ void usage(void)
 	       "                      (0 means it's not jump back or\n"
 	       "                      preserve context)\n"
 	       "                      to original kernel.\n"
+	       " -s --use-kexec2-syscall Use new syscall for kexec operation\n"
 	       " -d, --debug           Enable debugging to help spot a failure.\n"
 	       "\n"
 	       "Supported kernel file types and options: \n");
@@ -1066,6 +1081,70 @@ char *concat_cmdline(const char *base, c
 	return cmdline;
 }
 
+/* New file based kexec system call related code */
+static int kexec2_load(int fileind, int argc, char **argv,
+			unsigned long flags) {
+
+	char *kernel;
+	int kernel_fd, i;
+	struct kexec_info info;
+	int ret = 0;
+	char *kernel_buf;
+	off_t kernel_size;
+
+	memset(&info, 0, sizeof(info));
+	info.segment = NULL;
+	info.nr_segments = 0;
+	info.entry = NULL;
+	info.backup_start = 0;
+	info.kexec_flags = flags;
+
+	info.file_mode = 1;
+	info.initrd_fd = -1;
+
+	if (argc - fileind <= 0) {
+		fprintf(stderr, "No kernel specified\n");
+		usage();
+		return -1;
+	}
+
+	kernel = argv[fileind];
+
+	kernel_fd = open(kernel, O_RDONLY);
+	if (kernel_fd == -1) {
+		fprintf(stderr, "Failed to open file %s:%s\n", kernel,
+				strerror(errno));
+		return -1;
+	}
+
+	/* slurp in the input kernel */
+	kernel_buf = slurp_decompress_file(kernel, &kernel_size);
+
+	for (i = 0; i < file_types; i++) {
+		if (file_type[i].probe(kernel_buf, kernel_size) >= 0)
+			break;
+	}
+
+	if (i == file_types) {
+		fprintf(stderr, "Cannot determine the file type " "of %s\n",
+				kernel);
+		return -1;
+	}
+
+	ret = file_type[i].load(argc, argv, kernel_buf, kernel_size, &info);
+	if (ret < 0) {
+		fprintf(stderr, "Cannot load %s\n", kernel);
+		return ret;
+	}
+
+	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
+			info.command_line_len, info.kexec_flags);
+	if (ret != 0)
+		fprintf(stderr, "kexec_file_load failed: %s\n",
+					strerror(errno));
+	return ret;
+}
+
 
 int main(int argc, char *argv[])
 {
@@ -1077,6 +1156,7 @@ int main(int argc, char *argv[])
 	int do_ifdown = 0;
 	int do_unload = 0;
 	int do_reuse_initrd = 0;
+	int do_use_kexec2_syscall = 0;
 	void *entry = 0;
 	char *type = 0;
 	char *endptr;
@@ -1089,6 +1169,23 @@ int main(int argc, char *argv[])
 	};
 	static const char short_options[] = KEXEC_ALL_OPT_STR;
 
+	/*
+	 * First check if --use-kexec2-syscall is set. That changes lot of
+	 * things
+	 */
+	while ((opt = getopt_long(argc, argv, short_options,
+				  options, 0)) != -1) {
+		switch(opt) {
+		case OPT_USE_KEXEC2_SYSCALL:
+			do_use_kexec2_syscall = 1;
+			break;
+		}
+	}
+
+	/* Reset getopt for the next pass. */
+	opterr = 1;
+	optind = 1;
+
 	while ((opt = getopt_long(argc, argv, short_options,
 				  options, 0)) != -1) {
 		switch(opt) {
@@ -1121,6 +1218,8 @@ int main(int argc, char *argv[])
 			do_shutdown = 0;
 			do_sync = 0;
 			do_unload = 1;
+			if (do_use_kexec2_syscall)
+				kexec2_flags |= KEXEC_FILE_UNLOAD;
 			break;
 		case OPT_EXEC:
 			do_load = 0;
@@ -1163,7 +1262,10 @@ int main(int argc, char *argv[])
 			do_exec = 0;
 			do_shutdown = 0;
 			do_sync = 0;
-			kexec_flags = KEXEC_ON_CRASH;
+			if (do_use_kexec2_syscall)
+				kexec2_flags |= KEXEC_FILE_ON_CRASH;
+			else
+				kexec_flags = KEXEC_ON_CRASH;
 			break;
 		case OPT_MEM_MIN:
 			mem_min = strtoul(optarg, &endptr, 0);
@@ -1188,6 +1290,9 @@ int main(int argc, char *argv[])
 		case OPT_REUSE_INITRD:
 			do_reuse_initrd = 1;
 			break;
+		case OPT_USE_KEXEC2_SYSCALL:
+			/* We already parsed it. Nothing to do. */
+			break;
 		default:
 			break;
 		}
@@ -1232,10 +1337,17 @@ int main(int argc, char *argv[])
 	}
 
 	if (do_unload) {
-		result = k_unload(kexec_flags);
+		if (do_use_kexec2_syscall)
+			result = kexec2_unload(kexec2_flags);
+		else
+			result = k_unload(kexec_flags);
 	}
 	if (do_load && (result == 0)) {
-		result = my_load(type, fileind, argc, argv, kexec_flags, entry);
+		if (do_use_kexec2_syscall)
+			result = kexec2_load(fileind, argc, argv, kexec2_flags);
+		else
+			result = my_load(type, fileind, argc, argv,
+						kexec_flags, entry);
 	}
 	/* Don't shutdown unless there is something to reboot to! */
 	if ((result == 0) && (do_shutdown || do_exec) && !kexec_loaded()) {
Index: kexec-tools/kexec/kexec.h
===================================================================
--- kexec-tools.orig/kexec/kexec.h	2013-11-20 10:28:26.652380051 -0500
+++ kexec-tools/kexec/kexec.h	2013-11-20 10:28:47.806336565 -0500
@@ -154,6 +154,13 @@ struct kexec_info {
 	unsigned long kexec_flags;
 	unsigned long backup_src_start;
 	unsigned long backup_src_size;
+	/* Set to 1 if we are using kexec2 syscall */
+	unsigned long file_mode :1;
+
+	/* Filled by kernel image processing code */
+	int initrd_fd;
+	char *command_line;
+	int command_line_len;
 };
 
 struct arch_map_entry {
@@ -205,6 +212,7 @@ extern int file_types;
 #define OPT_UNLOAD		'u'
 #define OPT_TYPE		't'
 #define OPT_PANIC		'p'
+#define OPT_USE_KEXEC2_SYSCALL	's'
 #define OPT_MEM_MIN             256
 #define OPT_MEM_MAX             257
 #define OPT_REUSE_INITRD	258
@@ -228,6 +236,7 @@ extern int file_types;
 	{ "mem-min",		1, 0, OPT_MEM_MIN }, \
 	{ "mem-max",		1, 0, OPT_MEM_MAX }, \
 	{ "reuseinitrd",	0, 0, OPT_REUSE_INITRD }, \
+	{ "use-kexec2-syscall",	0, 0, OPT_USE_KEXEC2_SYSCALL }, \
 	{ "debug",		0, 0, OPT_DEBUG }, \
 
 #define KEXEC_OPT_STR "h?vdfxluet:p"
Index: kexec-tools/kexec/arch/x86_64/kexec-bzImage64.c
===================================================================
--- kexec-tools.orig/kexec/arch/x86_64/kexec-bzImage64.c	2013-11-20 10:28:26.652380051 -0500
+++ kexec-tools/kexec/arch/x86_64/kexec-bzImage64.c	2013-11-20 10:28:47.807336610 -0500
@@ -229,6 +229,89 @@ static int do_bzImage64_load(struct kexe
 	return 0;
 }
 
+/* This assumes file is being loaded using file based kexec2 syscall */
+int bzImage64_load_file(int argc, char **argv, struct kexec_info *info)
+{
+	int ret = 0;
+	char *command_line = NULL, *tmp_cmdline = NULL;
+	const char *ramdisk = NULL, *append = NULL;
+	int entry_16bit = 0, entry_32bit = 0;
+	int opt;
+	int command_line_len;
+
+	/* See options.h -- add any more there, too. */
+	static const struct option options[] = {
+		KEXEC_ARCH_OPTIONS
+		{ "command-line",	1, 0, OPT_APPEND },
+		{ "append",		1, 0, OPT_APPEND },
+		{ "reuse-cmdline",	0, 0, OPT_REUSE_CMDLINE },
+		{ "initrd",		1, 0, OPT_RAMDISK },
+		{ "ramdisk",		1, 0, OPT_RAMDISK },
+		{ "real-mode",		0, 0, OPT_REAL_MODE },
+		{ "entry-32bit",	0, 0, OPT_ENTRY_32BIT },
+		{ 0,			0, 0, 0 },
+	};
+	static const char short_options[] = KEXEC_ARCH_OPT_STR "d";
+
+	while ((opt = getopt_long(argc, argv, short_options, options, 0)) != -1) {
+		switch (opt) {
+		default:
+			/* Ignore core options */
+			if (opt < OPT_ARCH_MAX)
+				break;
+		case OPT_APPEND:
+			append = optarg;
+			break;
+		case OPT_REUSE_CMDLINE:
+			tmp_cmdline = get_command_line();
+			break;
+		case OPT_RAMDISK:
+			ramdisk = optarg;
+			break;
+		case OPT_REAL_MODE:
+			entry_16bit = 1;
+			break;
+		case OPT_ENTRY_32BIT:
+			entry_32bit = 1;
+			break;
+		}
+	}
+	command_line = concat_cmdline(tmp_cmdline, append);
+	if (tmp_cmdline)
+		free(tmp_cmdline);
+	command_line_len = 0;
+	if (command_line) {
+		command_line_len = strlen(command_line) + 1;
+	} else {
+		command_line = strdup("\0");
+		command_line_len = 1;
+	}
+
+	if (entry_16bit || entry_32bit) {
+		fprintf(stderr, "Kexec2 syscall does not support 16bit"
+			" or 32bit entry yet\n");
+		ret = -1;
+		goto out;
+	}
+
+	if (ramdisk) {
+		info->initrd_fd = open(ramdisk, O_RDONLY);
+		if (info->initrd_fd == -1) {
+			fprintf(stderr, "Could not open initrd file %s:%s\n",
+					ramdisk, strerror(errno));
+			ret = -1;
+			goto out;
+		}
+	}
+
+	info->command_line = command_line;
+	info->command_line_len = command_line_len;
+	return ret;
+out:
+	free(command_line);
+	return ret;
+}
+
 int bzImage64_load(int argc, char **argv, const char *buf, off_t len,
 	struct kexec_info *info)
 {
@@ -241,6 +324,9 @@ int bzImage64_load(int argc, char **argv
 	int opt;
 	int result;
 
+	if (info->file_mode)
+		return bzImage64_load_file(argc, argv, info);
+
 	/* See options.h -- add any more there, too. */
 	static const struct option options[] = {
 		KEXEC_ARCH_OPTIONS
Index: kexec-tools/kexec/kexec-syscall.h
===================================================================
--- kexec-tools.orig/kexec/kexec-syscall.h	2013-11-20 10:28:26.652380051 -0500
+++ kexec-tools/kexec/kexec-syscall.h	2013-11-20 10:28:47.808336655 -0500
@@ -50,6 +50,18 @@
 #endif
 #endif /*ifndef __NR_kexec_load*/
 
+#ifndef __NR_kexec_file_load
+
+#ifdef __x86_64__
+#define __NR_kexec_file_load	314
+#endif
+
+#ifndef __NR_kexec_file_load
+#error Unknown processor architecture.  Needs a kexec_load syscall number.
+#endif
+
+#endif /*ifndef __NR_kexec_file_load*/
+
 struct kexec_segment;
 
 static inline long kexec_load(void *entry, unsigned long nr_segments,
@@ -58,10 +70,22 @@ static inline long kexec_load(void *entr
 	return (long) syscall(__NR_kexec_load, entry, nr_segments, segments, flags);
 }
 
+static inline long kexec_file_load(int kernel_fd, int initrd_fd,
+			const char *cmdline_ptr, unsigned long cmdline_len,
+			unsigned long flags)
+{
+	return (long) syscall(__NR_kexec_file_load, kernel_fd, initrd_fd,
+				cmdline_ptr, cmdline_len, flags);
+}
+
 #define KEXEC_ON_CRASH		0x00000001
 #define KEXEC_PRESERVE_CONTEXT	0x00000002
 #define KEXEC_ARCH_MASK		0xffff0000
 
+/* Flags for kexec file based system call */
+#define KEXEC_FILE_UNLOAD	0x00000001
+#define KEXEC_FILE_ON_CRASH	0x00000002
+
 /* These values match the ELF architecture values. 
  * Unless there is a good reason that should continue to be the case.
  */

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-20 17:50 [PATCH 0/6] kexec: A new system call to allow in kernel loading Vivek Goyal
                   ` (7 preceding siblings ...)
  2013-11-21 19:06 ` Geert Uytterhoeven
@ 2013-11-21 23:07 ` Eric W. Biederman
  2013-11-22  1:28   ` H. Peter Anvin
  2013-11-22  1:55   ` Vivek Goyal
  2013-11-22  0:55 ` HATAYAMA Daisuke
  2013-12-03 13:23 ` Baoquan He
  10 siblings, 2 replies; 90+ messages in thread
From: Eric W. Biederman @ 2013-11-21 23:07 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, hpa, mjg59, greg

Vivek Goyal <vgoyal@redhat.com> writes:

> Current proposed secureboot implementation disables kexec/kdump because
> it can allow unsigned kernel to run on a secureboot platform. Intial
> idea was to sign /sbin/kexec binary and let that binary do the kernel
> signature verification. I had posted RFC patches for this apparoach
> here.
>
> https://lkml.org/lkml/2013/9/10/560
>
> Later we had discussion at Plumbers and most of the people thought
> that signing and trusting /sbin/kexec is becoming complex. So a 
> better idea might be let kernel do the signature verification of
> new kernel being loaded. This calls for implementing a new system call
> and moving lot of user space code in kernel.
>
> kexec_load() system call allows loading a kexec/kdump kernel and jump
> to that kernel at right time. Though a lot of processing is done in
> user space which prepares a list of segments/buffers to be loaded and
> kexec_load() works on that list of segments. It does not know what's
> contained in those segments.
>
> Now a new system call kexec_file_load() is implemented which takes
> kernel fd and initrd fd as parameters. Now kernel should be able
> to verify signature of newly loaded kernel. 
>
> This is an early RFC patchset. I have not done signature handling
> part yet. This is more of a minimal patch to show how new system
> call and functionality will look like. Right now it can only handle
> bzImage with 64bit entry point on x86_64. No EFI, no x86_32  or any
> other architecture. Rest of the things can be added slowly as need
> arises. In first iteration, I have tried to address most common use case
> for us.
>
> Any feedback is welcome.

Before you are done we need an ELF loader.  bzImage really is very
uninteresting.  To the point I am not at all convinced that an in kernel
loader should support it.

There is also a huge missing piece of this in that your purgatory is not
checking a hash of the loaded image before jumping too it.  Without that
this is a huge regression at least for the kexec on panic case.  We
absolutely need to check that the kernel sitting around in memory has
not been corrupted before we let it run very far.

Eric

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-20 17:50 [PATCH 0/6] kexec: A new system call to allow in kernel loading Vivek Goyal
                   ` (8 preceding siblings ...)
  2013-11-21 23:07 ` Eric W. Biederman
@ 2013-11-22  0:55 ` HATAYAMA Daisuke
  2013-11-22  2:03   ` Vivek Goyal
  2013-12-03 13:23 ` Baoquan He
  10 siblings, 1 reply; 90+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-22  0:55 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, mjg59, greg, ebiederm, hpa

(2013/11/21 2:50), Vivek Goyal wrote:
> Current proposed secureboot implementation disables kexec/kdump because
> it can allow unsigned kernel to run on a secureboot platform. Intial
> idea was to sign /sbin/kexec binary and let that binary do the kernel
> signature verification. I had posted RFC patches for this apparoach
> here.
>
> https://lkml.org/lkml/2013/9/10/560
>
> Later we had discussion at Plumbers and most of the people thought
> that signing and trusting /sbin/kexec is becoming complex. So a
> better idea might be let kernel do the signature verification of
> new kernel being loaded. This calls for implementing a new system call
> and moving lot of user space code in kernel.
>
> kexec_load() system call allows loading a kexec/kdump kernel and jump
> to that kernel at right time. Though a lot of processing is done in
> user space which prepares a list of segments/buffers to be loaded and
> kexec_load() works on that list of segments. It does not know what's
> contained in those segments.
>
> Now a new system call kexec_file_load() is implemented which takes
> kernel fd and initrd fd as parameters. Now kernel should be able
> to verify signature of newly loaded kernel.
>
> This is an early RFC patchset. I have not done signature handling
> part yet. This is more of a minimal patch to show how new system
> call and functionality will look like. Right now it can only handle
> bzImage with 64bit entry point on x86_64. No EFI, no x86_32  or any
> other architecture. Rest of the things can be added slowly as need
> arises. In first iteration, I have tried to address most common use case
> for us.
>
> Any feedback is welcome.
>

So, ultimately on this design direction, user-land kexec command someday
will no longer be used at all? Or is there any feature you will keep in
user-land side?

I think it big change if one component in kdump will disappear.

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-21 19:03   ` Greg KH
  2013-11-21 19:06     ` Matthew Garrett
  2013-11-21 19:16     ` Vivek Goyal
@ 2013-11-22  1:03     ` Kees Cook
  2013-11-22  2:13       ` Vivek Goyal
  2 siblings, 1 reply; 90+ messages in thread
From: Kees Cook @ 2013-11-22  1:03 UTC (permalink / raw)
  To: Greg KH; +Cc: Vivek Goyal, linux-kernel, kexec, ebiederm, hpa, mjg59

Hi,

On Thu, Nov 21, 2013 at 11:03:50AM -0800, Greg KH wrote:
> On Wed, Nov 20, 2013 at 12:50:49PM -0500, Vivek Goyal wrote:
> > This patch implements the in kernel kexec functionality. It implements a
> > new system call kexec_file_load. I think parameter list of this system
> > call will change as I have not done the kernel image signature handling
> > yet. I have been told that I might have to pass the detached signature
> > and size as part of system call.
> 
> This could be done as we do with modules, and just tack the signature
> onto the end of the 'blob' of the image.  That way we could use the same
> tool to sign the binary as we do for modules, and save the need for
> extra parameters in the syscall.

As long as the system call passing in an fd, I'm all good. For those
of us that run from verified filesystems, we don't need the additional
signing overhead, but we do need the file descriptor to validate the
origin of the kernel.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-21 23:07 ` Eric W. Biederman
@ 2013-11-22  1:28   ` H. Peter Anvin
  2013-11-22  2:35     ` Vivek Goyal
  2013-11-22  1:55   ` Vivek Goyal
  1 sibling, 1 reply; 90+ messages in thread
From: H. Peter Anvin @ 2013-11-22  1:28 UTC (permalink / raw)
  To: ebiederm, Vivek Goyal; +Cc: linux-kernel, kexec, mjg59, greg

What do you need from ELF?

ebiederm@xmission.com wrote:
>Vivek Goyal <vgoyal@redhat.com> writes:
>
>> Current proposed secureboot implementation disables kexec/kdump
>because
>> it can allow unsigned kernel to run on a secureboot platform. Intial
>> idea was to sign /sbin/kexec binary and let that binary do the kernel
>> signature verification. I had posted RFC patches for this apparoach
>> here.
>>
>> https://lkml.org/lkml/2013/9/10/560
>>
>> Later we had discussion at Plumbers and most of the people thought
>> that signing and trusting /sbin/kexec is becoming complex. So a 
>> better idea might be let kernel do the signature verification of
>> new kernel being loaded. This calls for implementing a new system
>call
>> and moving lot of user space code in kernel.
>>
>> kexec_load() system call allows loading a kexec/kdump kernel and jump
>> to that kernel at right time. Though a lot of processing is done in
>> user space which prepares a list of segments/buffers to be loaded and
>> kexec_load() works on that list of segments. It does not know what's
>> contained in those segments.
>>
>> Now a new system call kexec_file_load() is implemented which takes
>> kernel fd and initrd fd as parameters. Now kernel should be able
>> to verify signature of newly loaded kernel. 
>>
>> This is an early RFC patchset. I have not done signature handling
>> part yet. This is more of a minimal patch to show how new system
>> call and functionality will look like. Right now it can only handle
>> bzImage with 64bit entry point on x86_64. No EFI, no x86_32  or any
>> other architecture. Rest of the things can be added slowly as need
>> arises. In first iteration, I have tried to address most common use
>case
>> for us.
>>
>> Any feedback is welcome.
>
>Before you are done we need an ELF loader.  bzImage really is very
>uninteresting.  To the point I am not at all convinced that an in
>kernel
>loader should support it.
>
>There is also a huge missing piece of this in that your purgatory is
>not
>checking a hash of the loaded image before jumping too it.  Without
>that
>this is a huge regression at least for the kexec on panic case.  We
>absolutely need to check that the kernel sitting around in memory has
>not been corrupted before we let it run very far.
>
>Eric

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-21 23:07 ` Eric W. Biederman
  2013-11-22  1:28   ` H. Peter Anvin
@ 2013-11-22  1:55   ` Vivek Goyal
  2013-11-22  9:09     ` Geert Uytterhoeven
  2013-11-22 13:34     ` Eric W. Biederman
  1 sibling, 2 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-22  1:55 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, kexec, hpa, mjg59, greg

On Thu, Nov 21, 2013 at 03:07:04PM -0800, Eric W. Biederman wrote:

[..]
> 
> Before you are done we need an ELF loader.  bzImage really is very
> uninteresting.  To the point I am not at all convinced that an in kernel
> loader should support it.

Hi Eric,

Why ELF case is so interesting. I have not use kexec to boot ELF
images in years and have not seen others using it too. In fact bzImage
seems to be the most common kernel image format for x86, most of the distros
ship and use.

So first I did the loader for the common use case. There is no reason 
that one can't write another loader for ELF images. It just bloats
the code. Hence I thought that other image loaders can follow slowly. I am
not sure why do you say that bzImage is uninteresting. 

> 
> There is also a huge missing piece of this in that your purgatory is not
> checking a hash of the loaded image before jumping too it.  Without that
> this is a huge regression at least for the kexec on panic case.  We
> absolutely need to check that the kernel sitting around in memory has
> not been corrupted before we let it run very far.

Agreed. This should not be hard. It is just a matter of calcualting
digest of segments. I will store it in kimge and verify digest again
before passing control to control page. Will fix it in next version.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22  0:55 ` HATAYAMA Daisuke
@ 2013-11-22  2:03   ` Vivek Goyal
  0 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-22  2:03 UTC (permalink / raw)
  To: HATAYAMA Daisuke; +Cc: linux-kernel, kexec, mjg59, greg, ebiederm, hpa

On Fri, Nov 22, 2013 at 09:55:15AM +0900, HATAYAMA Daisuke wrote:
> (2013/11/21 2:50), Vivek Goyal wrote:
> >Current proposed secureboot implementation disables kexec/kdump because
> >it can allow unsigned kernel to run on a secureboot platform. Intial
> >idea was to sign /sbin/kexec binary and let that binary do the kernel
> >signature verification. I had posted RFC patches for this apparoach
> >here.
> >
> >https://lkml.org/lkml/2013/9/10/560
> >
> >Later we had discussion at Plumbers and most of the people thought
> >that signing and trusting /sbin/kexec is becoming complex. So a
> >better idea might be let kernel do the signature verification of
> >new kernel being loaded. This calls for implementing a new system call
> >and moving lot of user space code in kernel.
> >
> >kexec_load() system call allows loading a kexec/kdump kernel and jump
> >to that kernel at right time. Though a lot of processing is done in
> >user space which prepares a list of segments/buffers to be loaded and
> >kexec_load() works on that list of segments. It does not know what's
> >contained in those segments.
> >
> >Now a new system call kexec_file_load() is implemented which takes
> >kernel fd and initrd fd as parameters. Now kernel should be able
> >to verify signature of newly loaded kernel.
> >
> >This is an early RFC patchset. I have not done signature handling
> >part yet. This is more of a minimal patch to show how new system
> >call and functionality will look like. Right now it can only handle
> >bzImage with 64bit entry point on x86_64. No EFI, no x86_32  or any
> >other architecture. Rest of the things can be added slowly as need
> >arises. In first iteration, I have tried to address most common use case
> >for us.
> >
> >Any feedback is welcome.
> >
> 
> So, ultimately on this design direction, user-land kexec command someday
> will no longer be used at all? Or is there any feature you will keep in
> user-land side?
> 

Current user land is huge and implements lots of image formats on
differnt architectures with tons of options. 

I doubt that kernel implementation will be a complete replacement of
existing implementation anytime soon.  If kernel implementation works
well, then may be in future at some point of time we can completely
move away from user space implementation.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-22  1:03     ` Kees Cook
@ 2013-11-22  2:13       ` Vivek Goyal
  0 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-22  2:13 UTC (permalink / raw)
  To: Kees Cook; +Cc: Greg KH, linux-kernel, kexec, ebiederm, hpa, mjg59

On Thu, Nov 21, 2013 at 05:03:11PM -0800, Kees Cook wrote:
> Hi,
> 
> On Thu, Nov 21, 2013 at 11:03:50AM -0800, Greg KH wrote:
> > On Wed, Nov 20, 2013 at 12:50:49PM -0500, Vivek Goyal wrote:
> > > This patch implements the in kernel kexec functionality. It implements a
> > > new system call kexec_file_load. I think parameter list of this system
> > > call will change as I have not done the kernel image signature handling
> > > yet. I have been told that I might have to pass the detached signature
> > > and size as part of system call.
> > 
> > This could be done as we do with modules, and just tack the signature
> > onto the end of the 'blob' of the image.  That way we could use the same
> > tool to sign the binary as we do for modules, and save the need for
> > extra parameters in the syscall.
> 
> As long as the system call passing in an fd, I'm all good. For those
> of us that run from verified filesystems, we don't need the additional
> signing overhead, but we do need the file descriptor to validate the
> origin of the kernel.

Yep, Greg had mentioned that keep the interface file descriptor based so
that it can work well with LSM hooks and that's why I went with it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22  1:28   ` H. Peter Anvin
@ 2013-11-22  2:35     ` Vivek Goyal
  2013-11-22  2:40       ` H. Peter Anvin
  0 siblings, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-11-22  2:35 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: ebiederm, linux-kernel, kexec, mjg59, greg

On Thu, Nov 21, 2013 at 05:28:38PM -0800, H. Peter Anvin wrote:
> What do you need from ELF?

Sorry, I did not understand the question. Is it for me or Eric. I am
assuming you are asking Eric that why ELF is so important?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22  2:35     ` Vivek Goyal
@ 2013-11-22  2:40       ` H. Peter Anvin
  0 siblings, 0 replies; 90+ messages in thread
From: H. Peter Anvin @ 2013-11-22  2:40 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: ebiederm, linux-kernel, kexec, mjg59, greg

Yes.

Vivek Goyal <vgoyal@redhat.com> wrote:
>On Thu, Nov 21, 2013 at 05:28:38PM -0800, H. Peter Anvin wrote:
>> What do you need from ELF?
>
>Sorry, I did not understand the question. Is it for me or Eric. I am
>assuming you are asking Eric that why ELF is so important?
>
>Thanks
>Vivek

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22  1:55   ` Vivek Goyal
@ 2013-11-22  9:09     ` Geert Uytterhoeven
  2013-11-22 13:30       ` Jiri Kosina
  2013-11-22 13:43       ` Vivek Goyal
  2013-11-22 13:34     ` Eric W. Biederman
  1 sibling, 2 replies; 90+ messages in thread
From: Geert Uytterhoeven @ 2013-11-22  9:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Eric W. Biederman, linux-kernel, kexec, H. Peter Anvin,
	Matthew Garrett, Greg Kroah-Hartman

On Fri, Nov 22, 2013 at 2:55 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> Before you are done we need an ELF loader.  bzImage really is very
>> uninteresting.  To the point I am not at all convinced that an in kernel
>> loader should support it.
>
> Hi Eric,
>
> Why ELF case is so interesting. I have not use kexec to boot ELF
> images in years and have not seen others using it too. In fact bzImage
> seems to be the most common kernel image format for x86, most of the distros
> ship and use.
>
> So first I did the loader for the common use case. There is no reason
> that one can't write another loader for ELF images. It just bloats
> the code. Hence I thought that other image loaders can follow slowly. I am
> not sure why do you say that bzImage is uninteresting.

Welcome to the non-x86-centric world ;-)

Looking at kexec-tools, all of arm, cris, i386, ia64, m68k, mips, ppc, ppc64,
s390, sh, and x86_64 support ELF.
Only arm, i386, ppc, ppc64, sh, and x86_64 support zImage.
It's not clear to me what alpha supports (if it supports anything at all?).

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22  9:09     ` Geert Uytterhoeven
@ 2013-11-22 13:30       ` Jiri Kosina
  2013-11-22 13:46         ` Vivek Goyal
  2013-11-22 13:43       ` Vivek Goyal
  1 sibling, 1 reply; 90+ messages in thread
From: Jiri Kosina @ 2013-11-22 13:30 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Vivek Goyal, Eric W. Biederman, linux-kernel, kexec,
	H. Peter Anvin, Matthew Garrett, Greg Kroah-Hartman

On Fri, 22 Nov 2013, Geert Uytterhoeven wrote:

> > Why ELF case is so interesting. I have not use kexec to boot ELF
> > images in years and have not seen others using it too. In fact bzImage
> > seems to be the most common kernel image format for x86, most of the distros
> > ship and use.
> >
> > So first I did the loader for the common use case. There is no reason
> > that one can't write another loader for ELF images. It just bloats
> > the code. Hence I thought that other image loaders can follow slowly. I am
> > not sure why do you say that bzImage is uninteresting.
> 
> Welcome to the non-x86-centric world ;-)
> 
> Looking at kexec-tools, all of arm, cris, i386, ia64, m68k, mips, ppc, ppc64,
> s390, sh, and x86_64 support ELF.
> Only arm, i386, ppc, ppc64, sh, and x86_64 support zImage.
> It's not clear to me what alpha supports (if it supports anything at all?).

OTOH, does this feature make any sense whatsover on architectures that 
don't support secure boot anyway?

-- 
Jiri Kosina
SUSE Labs


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22  1:55   ` Vivek Goyal
  2013-11-22  9:09     ` Geert Uytterhoeven
@ 2013-11-22 13:34     ` Eric W. Biederman
  2013-11-22 14:19       ` Vivek Goyal
  2013-11-25 10:04       ` Michael Holzheu
  1 sibling, 2 replies; 90+ messages in thread
From: Eric W. Biederman @ 2013-11-22 13:34 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, hpa, mjg59, greg

Vivek Goyal <vgoyal@redhat.com> writes:

> On Thu, Nov 21, 2013 at 03:07:04PM -0800, Eric W. Biederman wrote:
>
> [..]
>> 
>> Before you are done we need an ELF loader.  bzImage really is very
>> uninteresting.  To the point I am not at all convinced that an in kernel
>> loader should support it.
>
> Hi Eric,
>
> Why ELF case is so interesting. I have not use kexec to boot ELF
> images in years and have not seen others using it too. In fact bzImage
> seems to be the most common kernel image format for x86, most of the distros
> ship and use.

ELF is interesting because it is the minimal file format that does
everything you need.   So especially for a proof of concept ELF needs to
come first.  There is an extra virtual address field in the ELF segment
header but otherwise ELF does not have any unnecessary fields.

ELF is interesting because it is the native kernel file format on all
architectures linux supports including x86.

ELF is interesting because producing an ELF image in practice requires
a trivial amount of tooling so it is a good general purpose format to
support.

> So first I did the loader for the common use case. There is no reason 
> that one can't write another loader for ELF images. It just bloats
> the code. Hence I thought that other image loaders can follow slowly. I am
> not sure why do you say that bzImage is uninteresting. 

If you boot anything that isn't a linux kernel bzImage on x86 bzImage is
not the solution you are using.  Furthermore because bzImage is a bunch
of hacks thrown together bzImage keeps evolving in weird and strange
ways.  The complexity of supporting bzImage only grows through the
years.

At the end of the day we will probably need to support bzImage in some
form (possibly just going so far as in userspace extracting the embedded
ELF image) as there are support benefits of only having one blob you
sling around.

But let's first start with the sane general case before worring about x86
legacy weirdness.

For a long term stable ABI to support booting things other than the
linux kernel bzImage is not my first choice.

>> There is also a huge missing piece of this in that your purgatory is not
>> checking a hash of the loaded image before jumping too it.  Without that
>> this is a huge regression at least for the kexec on panic case.  We
>> absolutely need to check that the kernel sitting around in memory has
>> not been corrupted before we let it run very far.
>
> Agreed. This should not be hard. It is just a matter of calcualting
> digest of segments. I will store it in kimge and verify digest again
> before passing control to control page. Will fix it in next version.

Nak.  The verification needs to happen in purgatory. 

The verification needs to happen in code whose runtime environment is
does not depend on random parts of the kernel.  Anything else is a
regression in maintainability and reliability.

It is the wrong direction to add any code to what needs to run in the
known broken environment of the kernel when a panic happens.

Which means that you almost certainly need to go to the trouble of
supporting the complexity needed to support purgatory code written in C.

(For those just tuning in purgatory is our term for the code that runs
between the kernels to do those things that can not happen a priori).

Eric

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22  9:09     ` Geert Uytterhoeven
  2013-11-22 13:30       ` Jiri Kosina
@ 2013-11-22 13:43       ` Vivek Goyal
  2013-11-22 15:25         ` Geert Uytterhoeven
  1 sibling, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-11-22 13:43 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Eric W. Biederman, linux-kernel, kexec, H. Peter Anvin,
	Matthew Garrett, Greg Kroah-Hartman

On Fri, Nov 22, 2013 at 10:09:17AM +0100, Geert Uytterhoeven wrote:
> On Fri, Nov 22, 2013 at 2:55 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> Before you are done we need an ELF loader.  bzImage really is very
> >> uninteresting.  To the point I am not at all convinced that an in kernel
> >> loader should support it.
> >
> > Hi Eric,
> >
> > Why ELF case is so interesting. I have not use kexec to boot ELF
> > images in years and have not seen others using it too. In fact bzImage
> > seems to be the most common kernel image format for x86, most of the distros
> > ship and use.
> >
> > So first I did the loader for the common use case. There is no reason
> > that one can't write another loader for ELF images. It just bloats
> > the code. Hence I thought that other image loaders can follow slowly. I am
> > not sure why do you say that bzImage is uninteresting.
> 
> Welcome to the non-x86-centric world ;-)
> 
> Looking at kexec-tools, all of arm, cris, i386, ia64, m68k, mips, ppc, ppc64,
> s390, sh, and x86_64 support ELF.

How many of them use ELF to boot in real world? Also one can easily
add ELF loader. I am just not able to see why ELF loader should be
a requirement for this patchset.

> Only arm, i386, ppc, ppc64, sh, and x86_64 support zImage.
> It's not clear to me what alpha supports (if it supports anything at all?).

Motiviation behind this patchset is secureboot. That is x86 specific
only and bzImage is most commonly used format on that platform. So it
makes sense to implement bzImage loader first, IMO.

One should be able to add support for more image loaders later as need
arises across different architectures.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22 13:30       ` Jiri Kosina
@ 2013-11-22 13:46         ` Vivek Goyal
  2013-11-22 13:50           ` Jiri Kosina
  0 siblings, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-11-22 13:46 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Geert Uytterhoeven, Eric W. Biederman, linux-kernel, kexec,
	H. Peter Anvin, Matthew Garrett, Greg Kroah-Hartman

On Fri, Nov 22, 2013 at 02:30:17PM +0100, Jiri Kosina wrote:
> On Fri, 22 Nov 2013, Geert Uytterhoeven wrote:
> 
> > > Why ELF case is so interesting. I have not use kexec to boot ELF
> > > images in years and have not seen others using it too. In fact bzImage
> > > seems to be the most common kernel image format for x86, most of the distros
> > > ship and use.
> > >
> > > So first I did the loader for the common use case. There is no reason
> > > that one can't write another loader for ELF images. It just bloats
> > > the code. Hence I thought that other image loaders can follow slowly. I am
> > > not sure why do you say that bzImage is uninteresting.
> > 
> > Welcome to the non-x86-centric world ;-)
> > 
> > Looking at kexec-tools, all of arm, cris, i386, ia64, m68k, mips, ppc, ppc64,
> > s390, sh, and x86_64 support ELF.
> > Only arm, i386, ppc, ppc64, sh, and x86_64 support zImage.
> > It's not clear to me what alpha supports (if it supports anything at all?).
> 
> OTOH, does this feature make any sense whatsover on architectures that 
> don't support secure boot anyway?

I guess if signed modules makes sense, then being able to kexec signed
kernel images should make sense too, in general.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22 13:46         ` Vivek Goyal
@ 2013-11-22 13:50           ` Jiri Kosina
  2013-11-22 15:33             ` Vivek Goyal
  0 siblings, 1 reply; 90+ messages in thread
From: Jiri Kosina @ 2013-11-22 13:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Geert Uytterhoeven, Eric W. Biederman, linux-kernel, kexec,
	H. Peter Anvin, Matthew Garrett, Greg Kroah-Hartman

On Fri, 22 Nov 2013, Vivek Goyal wrote:

> > OTOH, does this feature make any sense whatsover on architectures that 
> > don't support secure boot anyway?
> 
> I guess if signed modules makes sense, then being able to kexec signed
> kernel images should make sense too, in general.

Well, that's really a grey zone, I'd say.

In a non-secureboot environment, if you are root, you are able to issue 
reboot into a completely different, self-made kernel anyway, independent 
on whether signed modules are used or not.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22 13:34     ` Eric W. Biederman
@ 2013-11-22 14:19       ` Vivek Goyal
  2013-11-22 19:48         ` Greg KH
  2013-11-23  3:23         ` Eric W. Biederman
  2013-11-25 10:04       ` Michael Holzheu
  1 sibling, 2 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-22 14:19 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, kexec, hpa, mjg59, greg

On Fri, Nov 22, 2013 at 05:34:03AM -0800, Eric W. Biederman wrote:

[..]
> > Why ELF case is so interesting. I have not use kexec to boot ELF
> > images in years and have not seen others using it too. In fact bzImage
> > seems to be the most common kernel image format for x86, most of the distros
> > ship and use.
> 
> ELF is interesting because it is the minimal file format that does
> everything you need.   So especially for a proof of concept ELF needs to
> come first.  There is an extra virtual address field in the ELF segment
> header but otherwise ELF does not have any unnecessary fields.
> 
> ELF is interesting because it is the native kernel file format on all
> architectures linux supports including x86.
> 
> ELF is interesting because producing an ELF image in practice requires
> a trivial amount of tooling so it is a good general purpose format to
> support.

Ok. I will have a look at ELF loader too. I was hoping to keep only one
loader in initial patch. But looks like that's not acceptable.

[..]
> >> There is also a huge missing piece of this in that your purgatory is not
> >> checking a hash of the loaded image before jumping too it.  Without that
> >> this is a huge regression at least for the kexec on panic case.  We
> >> absolutely need to check that the kernel sitting around in memory has
> >> not been corrupted before we let it run very far.
> >
> > Agreed. This should not be hard. It is just a matter of calcualting
> > digest of segments. I will store it in kimge and verify digest again
> > before passing control to control page. Will fix it in next version.
> 
> Nak.  The verification needs to happen in purgatory. 
> 
> The verification needs to happen in code whose runtime environment is
> does not depend on random parts of the kernel.  Anything else is a
> regression in maintainability and reliability.
> 
> It is the wrong direction to add any code to what needs to run in the
> known broken environment of the kernel when a panic happens.
> 
> Which means that you almost certainly need to go to the trouble of
> supporting the complexity needed to support purgatory code written in C.
> 
> (For those just tuning in purgatory is our term for the code that runs
> between the kernels to do those things that can not happen a priori).

In general, I agree with not using kernel parts after crash.

But what protects against that purgatory itself has been scribbled over.
IOW, how different purgatory memory is as compared to kernel memory where
digest routines are stored. They have got equal probably of being scribbled
over and if that's the case one is not better than other?

And if they both got equal probability to getting corrupted, then there does
not seem to be an advantage in moving digest verification inside purgatory.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 5/6] kexec-bzImage: Support for loading bzImage using 64bit entry
  2013-11-21 19:21     ` Vivek Goyal
@ 2013-11-22 15:24       ` H. Peter Anvin
  0 siblings, 0 replies; 90+ messages in thread
From: H. Peter Anvin @ 2013-11-22 15:24 UTC (permalink / raw)
  To: Vivek Goyal, Greg KH; +Cc: linux-kernel, kexec, ebiederm, mjg59

On 11/21/2013 11:21 AM, Vivek Goyal wrote:
>>
>> What do you mean by the "real mode entry"?  Do we need to care about
>> that because we aren't falling back to real mode when executing this,
>> are we?  Or does that just happen for 32bit kernels?
> 
> Original kexec offers real mode entry choice too. So we fall back
> to real mode and jump to kernel and kernel makes bunch of BIOS calls. I
> don't think this is a commonly used option.
> 

I know some users of it (who I shall not name.)  In general it is a bad
option, because after running the first kernel the state of the hardware
is not guaranteed to be such that executing the BIOS is safe.  In
general, I do think they do it just because they tried at some point and
it happened to work; discouraging its use is probably for the better.

	-hpa


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22 13:43       ` Vivek Goyal
@ 2013-11-22 15:25         ` Geert Uytterhoeven
  2013-11-22 15:33           ` Jiri Kosina
  0 siblings, 1 reply; 90+ messages in thread
From: Geert Uytterhoeven @ 2013-11-22 15:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Eric W. Biederman, linux-kernel, kexec, H. Peter Anvin,
	Matthew Garrett, Greg Kroah-Hartman

On Fri, Nov 22, 2013 at 2:43 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> Looking at kexec-tools, all of arm, cris, i386, ia64, m68k, mips, ppc, ppc64,
>> s390, sh, and x86_64 support ELF.
>
> How many of them use ELF to boot in real world? Also one can easily
> add ELF loader. I am just not able to see why ELF loader should be
> a requirement for this patchset.

Many bootloaders support ELF.

>> Only arm, i386, ppc, ppc64, sh, and x86_64 support zImage.
>> It's not clear to me what alpha supports (if it supports anything at all?).
>
> Motiviation behind this patchset is secureboot. That is x86 specific
> only and bzImage is most commonly used format on that platform. So it
> makes sense to implement bzImage loader first, IMO.

While secureboot(TM) may be x86-centric, IIRC actually loading signed kernels
and modules didn't originate on x86. Anything can have a bootloader that
accepts signed kernel images only.

Even without the signing, I like the simplicity of the new syscall, moving
some bookkeeping to and keeping some info in the kernel (e.g. the kernel
no longer needs to export system RAM chunks and device tree or
bootinfo).

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22 15:25         ` Geert Uytterhoeven
@ 2013-11-22 15:33           ` Jiri Kosina
  2013-11-22 15:57             ` Eric Paris
  0 siblings, 1 reply; 90+ messages in thread
From: Jiri Kosina @ 2013-11-22 15:33 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Vivek Goyal, Eric W. Biederman, linux-kernel, kexec,
	H. Peter Anvin, Matthew Garrett, Greg Kroah-Hartman

On Fri, 22 Nov 2013, Geert Uytterhoeven wrote:

> >> Only arm, i386, ppc, ppc64, sh, and x86_64 support zImage.
> >> It's not clear to me what alpha supports (if it supports anything at all?).
> >
> > Motiviation behind this patchset is secureboot. That is x86 specific
> > only and bzImage is most commonly used format on that platform. So it
> > makes sense to implement bzImage loader first, IMO.
> 
> While secureboot(TM) may be x86-centric

And ARM, right?

> IIRC actually loading signed kernels and modules didn't originate on 
> x86. Anything can have a bootloader that accepts signed kernel images 
> only.

Yes, but if you don't have the whole secure boot security model (i.e. root 
is implicitly untrusted), it's all just a game really.

If you are playing this "signed kernel and modules" game, but have trusted 
root, he's free to replace the bootloader by one that wouldn't be 
verifying the kernel signature, and reboot into arbitrary kernel.

-- 
Jiri Kosina
SUSE Labs


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22 13:50           ` Jiri Kosina
@ 2013-11-22 15:33             ` Vivek Goyal
  2013-11-22 17:45               ` Kees Cook
  0 siblings, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-11-22 15:33 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Geert Uytterhoeven, Eric W. Biederman, linux-kernel, kexec,
	H. Peter Anvin, Matthew Garrett, Greg Kroah-Hartman, Kees Cook

On Fri, Nov 22, 2013 at 02:50:43PM +0100, Jiri Kosina wrote:
> On Fri, 22 Nov 2013, Vivek Goyal wrote:
> 
> > > OTOH, does this feature make any sense whatsover on architectures that 
> > > don't support secure boot anyway?
> > 
> > I guess if signed modules makes sense, then being able to kexec signed
> > kernel images should make sense too, in general.
> 
> Well, that's really a grey zone, I'd say.
> 
> In a non-secureboot environment, if you are root, you are able to issue 
> reboot into a completely different, self-made kernel anyway, independent 
> on whether signed modules are used or not.

That's a good poing. Frankly speaking I don't know if there is a good
use case to allow loading signed kernels only or not.

Kees mentioned that he would like to know where the kernel came from
and whether it came from trusted disk or not. So he does seem to have
a use case where he wants to launch only trusted kernel or deny execution.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22 15:33           ` Jiri Kosina
@ 2013-11-22 15:57             ` Eric Paris
  2013-11-22 16:04               ` Jiri Kosina
  0 siblings, 1 reply; 90+ messages in thread
From: Eric Paris @ 2013-11-22 15:57 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Geert Uytterhoeven, Vivek Goyal, Eric W. Biederman, linux-kernel,
	kexec, H. Peter Anvin, Matthew Garrett, Greg Kroah-Hartman

On Fri, Nov 22, 2013 at 10:33 AM, Jiri Kosina <jkosina@suse.cz> wrote:
> On Fri, 22 Nov 2013, Geert Uytterhoeven wrote:
>
>> >> Only arm, i386, ppc, ppc64, sh, and x86_64 support zImage.
>> >> It's not clear to me what alpha supports (if it supports anything at all?).
>> >
>> > Motiviation behind this patchset is secureboot. That is x86 specific
>> > only and bzImage is most commonly used format on that platform. So it
>> > makes sense to implement bzImage loader first, IMO.
>>
>> While secureboot(TM) may be x86-centric
>
> And ARM, right?
>
>> IIRC actually loading signed kernels and modules didn't originate on
>> x86. Anything can have a bootloader that accepts signed kernel images
>> only.
>
> Yes, but if you don't have the whole secure boot security model (i.e. root
> is implicitly untrusted), it's all just a game really.
>
> If you are playing this "signed kernel and modules" game, but have trusted
> root, he's free to replace the bootloader by one that wouldn't be
> verifying the kernel signature, and reboot into arbitrary kernel.

Ignore secureboot completely.

Consider a cloud provider who gives their customer a machine where
they, the cloud provider, is specifying the kernel and initrd.  This
is a real thing that people do today.  Root on the machine has ZERO
control over the kernel, bootloader, and initrd.  Check it out,
qemu/kvm can do this.  But, there is no way to disable kexec if the
distro configures it in (well, there is in RHEL at least).  I've
brought this up before with little useful response from the kexec
maintainers.  What I'd like is for a kernel trusted by me, the cloud
operator, to be able to be kexec'd.  I'd rather not have to completely
turn off kexec...

Make sense how this is useful for things other than secureboot?  And
we have users who want this.  This is not a speculative completely
made up maybe some day someone will want this type idea....

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22 15:57             ` Eric Paris
@ 2013-11-22 16:04               ` Jiri Kosina
  2013-11-22 16:08                 ` Vivek Goyal
  0 siblings, 1 reply; 90+ messages in thread
From: Jiri Kosina @ 2013-11-22 16:04 UTC (permalink / raw)
  To: Eric Paris
  Cc: Geert Uytterhoeven, Vivek Goyal, Eric W. Biederman, linux-kernel,
	kexec, H. Peter Anvin, Matthew Garrett, Greg Kroah-Hartman

On Fri, 22 Nov 2013, Eric Paris wrote:

> Consider a cloud provider who gives their customer a machine where
> they, the cloud provider, is specifying the kernel and initrd.  This
> is a real thing that people do today.  Root on the machine has ZERO
> control over the kernel, bootloader, and initrd.  Check it out,
> qemu/kvm can do this.  But, there is no way to disable kexec if the
> distro configures it in (well, there is in RHEL at least).  

If that root can load LKMs, access /dev/mem, or whatever else, there is 
not really a point disabling kexec anyway, is the same thing can be 
implemented (although with more hassle, of course) through these channels 
as well.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22 16:04               ` Jiri Kosina
@ 2013-11-22 16:08                 ` Vivek Goyal
  0 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-11-22 16:08 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Eric Paris, Geert Uytterhoeven, Eric W. Biederman, linux-kernel,
	kexec, H. Peter Anvin, Matthew Garrett, Greg Kroah-Hartman

On Fri, Nov 22, 2013 at 05:04:04PM +0100, Jiri Kosina wrote:
> On Fri, 22 Nov 2013, Eric Paris wrote:
> 
> > Consider a cloud provider who gives their customer a machine where
> > they, the cloud provider, is specifying the kernel and initrd.  This
> > is a real thing that people do today.  Root on the machine has ZERO
> > control over the kernel, bootloader, and initrd.  Check it out,
> > qemu/kvm can do this.  But, there is no way to disable kexec if the
> > distro configures it in (well, there is in RHEL at least).  
> 
> If that root can load LKMs, access /dev/mem, or whatever else, there is 
> not really a point disabling kexec anyway, is the same thing can be 
> implemented (although with more hassle, of course) through these channels 
> as well.

I am assuming that in above scenario, kernel will run in locked down
mode (something what matthew implemented for secureboot). Where /dev/mem
write access will be disabled and only signed modules will be loaded.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22 15:33             ` Vivek Goyal
@ 2013-11-22 17:45               ` Kees Cook
  0 siblings, 0 replies; 90+ messages in thread
From: Kees Cook @ 2013-11-22 17:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jiri Kosina, Geert Uytterhoeven, Eric W. Biederman, linux-kernel,
	kexec, H. Peter Anvin, Matthew Garrett, Greg Kroah-Hartman

On Fri, Nov 22, 2013 at 7:33 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Nov 22, 2013 at 02:50:43PM +0100, Jiri Kosina wrote:
>> On Fri, 22 Nov 2013, Vivek Goyal wrote:
>>
>> > > OTOH, does this feature make any sense whatsover on architectures that
>> > > don't support secure boot anyway?
>> >
>> > I guess if signed modules makes sense, then being able to kexec signed
>> > kernel images should make sense too, in general.
>>
>> Well, that's really a grey zone, I'd say.
>>
>> In a non-secureboot environment, if you are root, you are able to issue
>> reboot into a completely different, self-made kernel anyway, independent
>> on whether signed modules are used or not.
>
> That's a good poing. Frankly speaking I don't know if there is a good
> use case to allow loading signed kernels only or not.
>
> Kees mentioned that he would like to know where the kernel came from
> and whether it came from trusted disk or not. So he does seem to have
> a use case where he wants to launch only trusted kernel or deny execution.

Correct. Though to clarify, Chrome OS doesn't use UEFI SecureBoot: we
have a different solution that uses dm-verity to give us a trusted
read-only root filesystem. As long as things live on that filesystem,
we trust them. (This is why finit_module was added, and why I wanted
to make sure kexec used fd instead of "just" a memory blob.)

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-21 19:19         ` Matthew Garrett
  2013-11-21 19:24           ` Vivek Goyal
@ 2013-11-22 18:57           ` Vivek Goyal
  2013-11-23  3:39             ` Eric W. Biederman
  1 sibling, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-11-22 18:57 UTC (permalink / raw)
  To: Matthew Garrett; +Cc: Greg KH, linux-kernel, kexec, ebiederm, hpa, Peter Jones

On Thu, Nov 21, 2013 at 07:19:07PM +0000, Matthew Garrett wrote:
> On Thu, Nov 21, 2013 at 02:13:05PM -0500, Vivek Goyal wrote:
> > On Thu, Nov 21, 2013 at 07:06:20PM +0000, Matthew Garrett wrote:
> > > That would require a certain degree of massaging from userspace if we 
> > > want to be able to use the existing Authenticode signatures. Otherwise 
> > > we need to sign kernels twice.
> > 
> > I was thinking oof signing the same kernel twice. Can I sign authenticode
> > signed kernel again (using RSA signature as we do for modules) and append
> > the signature to bzImage. 
> 
> No, you'd need to do it the other way around.

Hmm..., I am running out of ideas here. This is what I understand.

- If I sign the bzImage (using PKCS1.5 signature), and later it is signed
  with authenticode format signatures, then PKCS1.5 signatures will not be
  valid as PE/COFF signing will do some modification to PE/COFF header in
  bzImage. And another problem is that then I don't have a way to find
  PKCS1.5 signature.

- If bzImage is first signed with authenticode format signature and then
  signed using PKCS1.5 signature, then  authenticode format signature
  will become invalid as it will also hash the data appened at the end
  of file.

So looks like both signatures can't co-exist on same file. That means
one signature has to be detached.

I am beginning to think that create a kernel option which allows to choose
between attached and detached signatures. Extend kexec syscall to allow
a parameter to pass in detached signatures. If detached signatures are
not passed, then look for signatures at the end of file. That way, those
who are signing kernels using platform specific format (authenticode) in 
this case, they can generate detached signature while others can just
use attached signatures.

Any thoughts on how this should be handled?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22 14:19       ` Vivek Goyal
@ 2013-11-22 19:48         ` Greg KH
  2013-11-23  3:23         ` Eric W. Biederman
  1 sibling, 0 replies; 90+ messages in thread
From: Greg KH @ 2013-11-22 19:48 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Eric W. Biederman, linux-kernel, kexec, hpa, mjg59

On Fri, Nov 22, 2013 at 09:19:46AM -0500, Vivek Goyal wrote:
> On Fri, Nov 22, 2013 at 05:34:03AM -0800, Eric W. Biederman wrote:
> 
> [..]
> > > Why ELF case is so interesting. I have not use kexec to boot ELF
> > > images in years and have not seen others using it too. In fact bzImage
> > > seems to be the most common kernel image format for x86, most of the distros
> > > ship and use.
> > 
> > ELF is interesting because it is the minimal file format that does
> > everything you need.   So especially for a proof of concept ELF needs to
> > come first.  There is an extra virtual address field in the ELF segment
> > header but otherwise ELF does not have any unnecessary fields.
> > 
> > ELF is interesting because it is the native kernel file format on all
> > architectures linux supports including x86.
> > 
> > ELF is interesting because producing an ELF image in practice requires
> > a trivial amount of tooling so it is a good general purpose format to
> > support.
> 
> Ok. I will have a look at ELF loader too. I was hoping to keep only one
> loader in initial patch. But looks like that's not acceptable.

I totally disagree.  I think what you have done now is fine.  If it
works for bzImage, it's a good sense that this is usable as-is.

And, if someone else cares about signed elf images, hey, let them
implement the loader for it :)

Either way, the syscall interface wouldn't change, which is the
important thing to get right, so you should be fine for now.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-20 17:50 ` [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec Vivek Goyal
  2013-11-21 19:03   ` Greg KH
@ 2013-11-22 20:42   ` Jiri Kosina
  2014-01-17 19:17     ` Vivek Goyal
  2013-11-29  3:10   ` Baoquan He
  2013-12-04  1:56   ` Baoquan He
  3 siblings, 1 reply; 90+ messages in thread
From: Jiri Kosina @ 2013-11-22 20:42 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg

On Wed, 20 Nov 2013, Vivek Goyal wrote:

> This patch implements the in kernel kexec functionality. It implements a
> new system call kexec_file_load. I think parameter list of this system
> call will change as I have not done the kernel image signature handling
> yet. I have been told that I might have to pass the detached signature
> and size as part of system call.
> 
> Previously segment list was prepared in user space. Now user space just
> passes kernel fd, initrd fd and command line and kernel will create a
> segment list internally.
> 
> This patch contains generic part of the code. Actual segment preparation
> and loading is done by arch and image specific loader. Which comes in
> next patch.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
[ ... snip ... ]
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index 6238927..50bcaa8 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
[ ... snip ... ]
> @@ -843,7 +1075,11 @@ static int kimage_load_normal_segment(struct kimage *image,
>  				PAGE_SIZE - (maddr & ~PAGE_MASK));
>  		uchunk = min(ubytes, mchunk);
>  
> -		result = copy_from_user(ptr, buf, uchunk);
> +		/* For file based kexec, source pages are in kernel memory */
> +		if (image->file_mode)
> +			memcpy(ptr, buf, uchunk);

Very minor nit I came across when going through the patchset -- can't we 
use some different buffer for the file-based kexec that's not marked 
__user here? This really causes some eye-pain when looking at the code.

Thanks a lot for pushing this patchset forward,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22 14:19       ` Vivek Goyal
  2013-11-22 19:48         ` Greg KH
@ 2013-11-23  3:23         ` Eric W. Biederman
  2013-12-04 19:34           ` Vivek Goyal
  1 sibling, 1 reply; 90+ messages in thread
From: Eric W. Biederman @ 2013-11-23  3:23 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, hpa, mjg59, greg

Vivek Goyal <vgoyal@redhat.com> writes:

> On Fri, Nov 22, 2013 at 05:34:03AM -0800, Eric W. Biederman wrote:
>
> [..]
>> > Why ELF case is so interesting. I have not use kexec to boot ELF
>> > images in years and have not seen others using it too. In fact bzImage
>> > seems to be the most common kernel image format for x86, most of the distros
>> > ship and use.
>> 
>> ELF is interesting because it is the minimal file format that does
>> everything you need.   So especially for a proof of concept ELF needs to
>> come first.  There is an extra virtual address field in the ELF segment
>> header but otherwise ELF does not have any unnecessary fields.
>> 
>> ELF is interesting because it is the native kernel file format on all
>> architectures linux supports including x86.
>> 
>> ELF is interesting because producing an ELF image in practice requires
>> a trivial amount of tooling so it is a good general purpose format to
>> support.
>
> Ok. I will have a look at ELF loader too. I was hoping to keep only one
> loader in initial patch. But looks like that's not acceptable.

Thank you.  We really need an ELF loader to see that we have handled the
general case and to see that we have a solution that is portable to
non-x86.

> [..]
>> >> There is also a huge missing piece of this in that your purgatory is not
>> >> checking a hash of the loaded image before jumping too it.  Without that
>> >> this is a huge regression at least for the kexec on panic case.  We
>> >> absolutely need to check that the kernel sitting around in memory has
>> >> not been corrupted before we let it run very far.
>> >
>> > Agreed. This should not be hard. It is just a matter of calcualting
>> > digest of segments. I will store it in kimge and verify digest again
>> > before passing control to control page. Will fix it in next version.
>> 
>> Nak.  The verification needs to happen in purgatory. 
>> 
>> The verification needs to happen in code whose runtime environment is
>> does not depend on random parts of the kernel.  Anything else is a
>> regression in maintainability and reliability.
>> 
>> It is the wrong direction to add any code to what needs to run in the
>> known broken environment of the kernel when a panic happens.
>> 
>> Which means that you almost certainly need to go to the trouble of
>> supporting the complexity needed to support purgatory code written in C.
>> 
>> (For those just tuning in purgatory is our term for the code that runs
>> between the kernels to do those things that can not happen a priori).
>
> In general, I agree with not using kernel parts after crash.
>
> But what protects against that purgatory itself has been scribbled over.
> IOW, how different purgatory memory is as compared to kernel memory where
> digest routines are stored. They have got equal probably of being scribbled
> over and if that's the case one is not better than other?
>
> And if they both got equal probability to getting corrupted, then there does
> not seem to be an advantage in moving digest verification inside
> purgatory.

The primary reason is that maintenance of code in the kernel that is
safe during a crash dump is hard.  That is why we boot a second kernel
after all.  If the code to do the signature verification resides in
machine_kexec on the kexec on panic code path in the kernel that has
called panic it is almost a given that at some point or other someone
will add an option that will add a weird dependency that makes the code
unsafe when the kernel is crashing.  I have seen it happen several times
on the existing kexec on panic code path.  I have seen it on other code
paths like netconsole.  Which can currently on some kernels I have
running cause the kernel go go into an endless printk loop if you call
printk from interrupt context.  So what we really gain by moving the
verification into purgatory is protection from inappropriate code reuse.

So having a completely separate piece of code may be a little harder to
write initially but the code is much simpler and more reliable to
maintain.  Essentially requiring no maintenance effort.  Further getting
to the point where purgatory is written in C makes small changes much
more approachable.

Eric


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-22 18:57           ` Vivek Goyal
@ 2013-11-23  3:39             ` Eric W. Biederman
  2013-11-25 16:39               ` Vivek Goyal
  0 siblings, 1 reply; 90+ messages in thread
From: Eric W. Biederman @ 2013-11-23  3:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Matthew Garrett, Greg KH, linux-kernel, kexec, hpa, Peter Jones

Vivek Goyal <vgoyal@redhat.com> writes:

> On Thu, Nov 21, 2013 at 07:19:07PM +0000, Matthew Garrett wrote:
>> On Thu, Nov 21, 2013 at 02:13:05PM -0500, Vivek Goyal wrote:
>> > On Thu, Nov 21, 2013 at 07:06:20PM +0000, Matthew Garrett wrote:
>> > > That would require a certain degree of massaging from userspace if we 
>> > > want to be able to use the existing Authenticode signatures. Otherwise 
>> > > we need to sign kernels twice.
>> > 
>> > I was thinking oof signing the same kernel twice. Can I sign authenticode
>> > signed kernel again (using RSA signature as we do for modules) and append
>> > the signature to bzImage. 
>> 
>> No, you'd need to do it the other way around.
>
> Hmm..., I am running out of ideas here. This is what I understand.
>
> - If I sign the bzImage (using PKCS1.5 signature), and later it is signed
>   with authenticode format signatures, then PKCS1.5 signatures will not be
>   valid as PE/COFF signing will do some modification to PE/COFF header in
>   bzImage. And another problem is that then I don't have a way to find
>   PKCS1.5 signature.
>
> - If bzImage is first signed with authenticode format signature and then
>   signed using PKCS1.5 signature, then  authenticode format signature
>   will become invalid as it will also hash the data appened at the end
>   of file.
>
> So looks like both signatures can't co-exist on same file. That means
> one signature has to be detached.
>
> I am beginning to think that create a kernel option which allows to choose
> between attached and detached signatures. Extend kexec syscall to allow
> a parameter to pass in detached signatures. If detached signatures are
> not passed, then look for signatures at the end of file. That way, those
> who are signing kernels using platform specific format (authenticode) in 
> this case, they can generate detached signature while others can just
> use attached signatures.
>
> Any thoughts on how this should be handled?

Inside of a modern bzImage there is an embedded ELF image.  How about in
userspace we just strip out the embedded ELF image and write that to a
file.  Then we can use the same signature checking scheme as we do for
kernel modules.  And you only have to support one file format.

As I recall there are already some platforms on x86 like Xen that
already need to strip out the embedded ELF image for their loaders to
have all of the information they need to load the image.

Eric

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-22 13:34     ` Eric W. Biederman
  2013-11-22 14:19       ` Vivek Goyal
@ 2013-11-25 10:04       ` Michael Holzheu
  2013-11-25 15:36         ` Vivek Goyal
  1 sibling, 1 reply; 90+ messages in thread
From: Michael Holzheu @ 2013-11-25 10:04 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Eric W. Biederman, mjg59, kexec, linux-kernel, greg, hpa

On Fri, 22 Nov 2013 05:34:03 -0800
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Vivek Goyal <vgoyal@redhat.com> writes:

> >> There is also a huge missing piece of this in that your purgatory is not
> >> checking a hash of the loaded image before jumping too it.  Without that
> >> this is a huge regression at least for the kexec on panic case.  We
> >> absolutely need to check that the kernel sitting around in memory has
> >> not been corrupted before we let it run very far.
> >
> > Agreed. This should not be hard. It is just a matter of calcualting
> > digest of segments. I will store it in kimge and verify digest again
> > before passing control to control page. Will fix it in next version.
> 
> Nak.  The verification needs to happen in purgatory. 
> 
> The verification needs to happen in code whose runtime environment is
> does not depend on random parts of the kernel.  Anything else is a
> regression in maintainability and reliability.

Hello Vivek,

Just to be sure that you have not forgotten the following s390 detail:

On s390 we first call purgatory with parameter "0" for doing the
checksum test. If this fails, we can have as backup solution our
traditional stand-alone dump. In case tha checksum test was ok,
we call purgatory a second time with parameter "1" which then
starts kdump.

Could you please ensure that this mechanism also works after
your rework.

Best Regards,
Michael


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-25 10:04       ` Michael Holzheu
@ 2013-11-25 15:36         ` Vivek Goyal
  2013-11-25 16:15           ` Michael Holzheu
  0 siblings, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-11-25 15:36 UTC (permalink / raw)
  To: Michael Holzheu; +Cc: Eric W. Biederman, mjg59, kexec, linux-kernel, greg, hpa

On Mon, Nov 25, 2013 at 11:04:28AM +0100, Michael Holzheu wrote:
> On Fri, 22 Nov 2013 05:34:03 -0800
> ebiederm@xmission.com (Eric W. Biederman) wrote:
> 
> > Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > >> There is also a huge missing piece of this in that your purgatory is not
> > >> checking a hash of the loaded image before jumping too it.  Without that
> > >> this is a huge regression at least for the kexec on panic case.  We
> > >> absolutely need to check that the kernel sitting around in memory has
> > >> not been corrupted before we let it run very far.
> > >
> > > Agreed. This should not be hard. It is just a matter of calcualting
> > > digest of segments. I will store it in kimge and verify digest again
> > > before passing control to control page. Will fix it in next version.
> > 
> > Nak.  The verification needs to happen in purgatory. 
> > 
> > The verification needs to happen in code whose runtime environment is
> > does not depend on random parts of the kernel.  Anything else is a
> > regression in maintainability and reliability.
> 
> Hello Vivek,
> 
> Just to be sure that you have not forgotten the following s390 detail:
> 
> On s390 we first call purgatory with parameter "0" for doing the
> checksum test. If this fails, we can have as backup solution our
> traditional stand-alone dump. In case tha checksum test was ok,
> we call purgatory a second time with parameter "1" which then
> starts kdump.
> 
> Could you please ensure that this mechanism also works after
> your rework.

Hi Michael,

All that logic in in arch dependent portion of s390? If yes, I am not
touching any arch dependent part of s390 yet and only doing implementation
of x86.

Generic changes should be usable by s390 and you should be able to do
same thing there. Though we are still detating whether segment checksum
verification logic should be part of purgatory or core kernel.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-25 15:36         ` Vivek Goyal
@ 2013-11-25 16:15           ` Michael Holzheu
  0 siblings, 0 replies; 90+ messages in thread
From: Michael Holzheu @ 2013-11-25 16:15 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Eric W. Biederman, mjg59, kexec, linux-kernel, greg, hpa

On Mon, 25 Nov 2013 10:36:20 -0500
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Mon, Nov 25, 2013 at 11:04:28AM +0100, Michael Holzheu wrote:
> > On Fri, 22 Nov 2013 05:34:03 -0800
> > ebiederm@xmission.com (Eric W. Biederman) wrote:
> > 
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > 
> > > >> There is also a huge missing piece of this in that your purgatory is not
> > > >> checking a hash of the loaded image before jumping too it.  Without that
> > > >> this is a huge regression at least for the kexec on panic case.  We
> > > >> absolutely need to check that the kernel sitting around in memory has
> > > >> not been corrupted before we let it run very far.
> > > >
> > > > Agreed. This should not be hard. It is just a matter of calcualting
> > > > digest of segments. I will store it in kimge and verify digest again
> > > > before passing control to control page. Will fix it in next version.
> > > 
> > > Nak.  The verification needs to happen in purgatory. 
> > > 
> > > The verification needs to happen in code whose runtime environment is
> > > does not depend on random parts of the kernel.  Anything else is a
> > > regression in maintainability and reliability.
> > 
> > Hello Vivek,
> > 
> > Just to be sure that you have not forgotten the following s390 detail:
> > 
> > On s390 we first call purgatory with parameter "0" for doing the
> > checksum test. If this fails, we can have as backup solution our
> > traditional stand-alone dump. In case tha checksum test was ok,
> > we call purgatory a second time with parameter "1" which then
> > starts kdump.
> > 
> > Could you please ensure that this mechanism also works after
> > your rework.
> 
> Hi Michael,
> 
> All that logic in in arch dependent portion of s390? If yes, I am not
> touching any arch dependent part of s390 yet and only doing implementation
> of x86.

Yes, part of s390 architecture code (kernel and kexec purgatory).

kernel:
-------
arch/s390/kernel/machine_kexec.c:
 kdump_csum_valid() -> rc = start_kdump(0);
 __do_machine_kdump() -> start_kdump(1)

kexec tools:
------------
purgatory/arch/s390/setup-s390.S
  cghi %r2,0
  je verify_checksums

> Generic changes should be usable by s390 and you should be able to do
> same thing there. Though we are still detating whether segment checksum
> verification logic should be part of purgatory or core kernel.

Yes, that was my concern. If you move the purgatory checksum logic to
the kernel we probably have to consider our s390 checksum test.

Thanks!
Michael


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-23  3:39             ` Eric W. Biederman
@ 2013-11-25 16:39               ` Vivek Goyal
  2013-11-26 12:23                 ` Eric W. Biederman
  0 siblings, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-11-25 16:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matthew Garrett, Greg KH, linux-kernel, kexec, hpa, Peter Jones

On Fri, Nov 22, 2013 at 07:39:14PM -0800, Eric W. Biederman wrote:

[..]
> > Hmm..., I am running out of ideas here. This is what I understand.
> >
> > - If I sign the bzImage (using PKCS1.5 signature), and later it is signed
> >   with authenticode format signatures, then PKCS1.5 signatures will not be
> >   valid as PE/COFF signing will do some modification to PE/COFF header in
> >   bzImage. And another problem is that then I don't have a way to find
> >   PKCS1.5 signature.
> >
> > - If bzImage is first signed with authenticode format signature and then
> >   signed using PKCS1.5 signature, then  authenticode format signature
> >   will become invalid as it will also hash the data appened at the end
> >   of file.
> >
> > So looks like both signatures can't co-exist on same file. That means
> > one signature has to be detached.
> >
> > I am beginning to think that create a kernel option which allows to choose
> > between attached and detached signatures. Extend kexec syscall to allow
> > a parameter to pass in detached signatures. If detached signatures are
> > not passed, then look for signatures at the end of file. That way, those
> > who are signing kernels using platform specific format (authenticode) in 
> > this case, they can generate detached signature while others can just
> > use attached signatures.
> >
> > Any thoughts on how this should be handled?
> 
> Inside of a modern bzImage there is an embedded ELF image.  How about in
> userspace we just strip out the embedded ELF image and write that to a
> file.  Then we can use the same signature checking scheme as we do for
> kernel modules.  And you only have to support one file format.

I think there is a problem with that. And that we lose the additional
metadata info present in bzImage which is important.

For example, knowing how much memory kernel will consume before it
sets up its own GDT and page tables (init_size) is very important. That
gives image loader lot of flexibility in figuring out where to place rest
of the components safely (initrd, GDT, page tables, ELF header segment, 
backup region etc).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-25 16:39               ` Vivek Goyal
@ 2013-11-26 12:23                 ` Eric W. Biederman
  2013-11-26 14:27                   ` Vivek Goyal
  0 siblings, 1 reply; 90+ messages in thread
From: Eric W. Biederman @ 2013-11-26 12:23 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Matthew Garrett, Greg KH, linux-kernel, kexec, hpa, Peter Jones

Vivek Goyal <vgoyal@redhat.com> writes:

> On Fri, Nov 22, 2013 at 07:39:14PM -0800, Eric W. Biederman wrote:
>
> [..]
>> > Hmm..., I am running out of ideas here. This is what I understand.
>> >
>> > - If I sign the bzImage (using PKCS1.5 signature), and later it is signed
>> >   with authenticode format signatures, then PKCS1.5 signatures will not be
>> >   valid as PE/COFF signing will do some modification to PE/COFF header in
>> >   bzImage. And another problem is that then I don't have a way to find
>> >   PKCS1.5 signature.
>> >
>> > - If bzImage is first signed with authenticode format signature and then
>> >   signed using PKCS1.5 signature, then  authenticode format signature
>> >   will become invalid as it will also hash the data appened at the end
>> >   of file.
>> >
>> > So looks like both signatures can't co-exist on same file. That means
>> > one signature has to be detached.
>> >
>> > I am beginning to think that create a kernel option which allows to choose
>> > between attached and detached signatures. Extend kexec syscall to allow
>> > a parameter to pass in detached signatures. If detached signatures are
>> > not passed, then look for signatures at the end of file. That way, those
>> > who are signing kernels using platform specific format (authenticode) in 
>> > this case, they can generate detached signature while others can just
>> > use attached signatures.
>> >
>> > Any thoughts on how this should be handled?
>> 
>> Inside of a modern bzImage there is an embedded ELF image.  How about in
>> userspace we just strip out the embedded ELF image and write that to a
>> file.  Then we can use the same signature checking scheme as we do for
>> kernel modules.  And you only have to support one file format.
>
> I think there is a problem with that. And that we lose the additional
> metadata info present in bzImage which is important.
>
> For example, knowing how much memory kernel will consume before it
> sets up its own GDT and page tables (init_size) is very important. That
> gives image loader lot of flexibility in figuring out where to place rest
> of the components safely (initrd, GDT, page tables, ELF header segment, 
> backup region etc).

The init_size should be reflected in the .bss of the ELF segments.  If
not it is a bug when generating the kernel ELF headers and should be
fixed.

For use by kexec I don't see any issues with just signing the embedded
ELF image.

Eric

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-26 12:23                 ` Eric W. Biederman
@ 2013-11-26 14:27                   ` Vivek Goyal
  2013-12-19 12:54                     ` Torsten Duwe
  0 siblings, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-11-26 14:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matthew Garrett, Greg KH, linux-kernel, kexec, hpa, Peter Jones,
	Kees Cook

On Tue, Nov 26, 2013 at 04:23:36AM -0800, Eric W. Biederman wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Fri, Nov 22, 2013 at 07:39:14PM -0800, Eric W. Biederman wrote:
> >
> > [..]
> >> > Hmm..., I am running out of ideas here. This is what I understand.
> >> >
> >> > - If I sign the bzImage (using PKCS1.5 signature), and later it is signed
> >> >   with authenticode format signatures, then PKCS1.5 signatures will not be
> >> >   valid as PE/COFF signing will do some modification to PE/COFF header in
> >> >   bzImage. And another problem is that then I don't have a way to find
> >> >   PKCS1.5 signature.
> >> >
> >> > - If bzImage is first signed with authenticode format signature and then
> >> >   signed using PKCS1.5 signature, then  authenticode format signature
> >> >   will become invalid as it will also hash the data appened at the end
> >> >   of file.
> >> >
> >> > So looks like both signatures can't co-exist on same file. That means
> >> > one signature has to be detached.
> >> >
> >> > I am beginning to think that create a kernel option which allows to choose
> >> > between attached and detached signatures. Extend kexec syscall to allow
> >> > a parameter to pass in detached signatures. If detached signatures are
> >> > not passed, then look for signatures at the end of file. That way, those
> >> > who are signing kernels using platform specific format (authenticode) in 
> >> > this case, they can generate detached signature while others can just
> >> > use attached signatures.
> >> >
> >> > Any thoughts on how this should be handled?
> >> 
> >> Inside of a modern bzImage there is an embedded ELF image.  How about in
> >> userspace we just strip out the embedded ELF image and write that to a
> >> file.  Then we can use the same signature checking scheme as we do for
> >> kernel modules.  And you only have to support one file format.
> >
> > I think there is a problem with that. And that we lose the additional
> > metadata info present in bzImage which is important.
> >
> > For example, knowing how much memory kernel will consume before it
> > sets up its own GDT and page tables (init_size) is very important. That
> > gives image loader lot of flexibility in figuring out where to place rest
> > of the components safely (initrd, GDT, page tables, ELF header segment, 
> > backup region etc).
> 
> The init_size should be reflected in the .bss of the ELF segments.  If
> not it is a bug when generating the kernel ELF headers and should be
> fixed.
> 
> For use by kexec I don't see any issues with just signing the embedded
> ELF image.

Hmm..., I will check this.

I have another concern though. If we extract it and write it to a file,
this assumes that we have a good destination where file can be written.
This atleast does not seem to be useful to chrome OS people where root
is read only and if a kernel is coming from root, then they know it
can be trusted.

Being able to write kernel to a file and then load it feels little odd to
me. Though this should be allowed but this should not be mandatory.

I think if we allow passing detached signature in kexec system call, then
it makes it much more flexible. We should be able to do what you are
suggesting at the same time it will also keep the possibility open for what
chromeOS developers are looking for.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/6] kexec: Support for Kexec on panic using new system call
  2013-11-20 17:50 ` [PATCH 6/6] kexec: Support for Kexec on panic using new system call Vivek Goyal
@ 2013-11-28 11:28   ` Baoquan He
  2013-12-02 15:30     ` Vivek Goyal
  2013-12-04  1:41   ` Baoquan He
  1 sibling, 1 reply; 90+ messages in thread
From: Baoquan He @ 2013-11-28 11:28 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, mjg59, greg, ebiederm, hpa

On 11/20/13 at 12:50pm, Vivek Goyal wrote:
> This patch adds support for loading a kexec on panic (kdump) kernel usning
> new system call.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/crash.h       |    9 +
>  arch/x86/include/asm/kexec.h       |   17 +
>  arch/x86/kernel/crash.c            |  585 ++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/kexec-bzimage.c    |   63 ++++-
>  arch/x86/kernel/machine_kexec_64.c |    1 +
>  kernel/kexec.c                     |   69 ++++-
>  6 files changed, 731 insertions(+), 13 deletions(-)
>  create mode 100644 arch/x86/include/asm/crash.h
> 
> diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h
> new file mode 100644
> index 0000000..2dd2eb8
> --- /dev/null
> +++ b/arch/x86/include/asm/crash.h
> @@ -0,0 +1,9 @@
> +#ifndef _ASM_X86_CRASH_H
> +#define _ASM_X86_CRASH_H
> +
> +int load_crashdump_segments(struct kimage *image);
> +int crash_copy_backup_region(struct kimage *image);
> +int crash_setup_memmap_entries(struct kimage *image,
> +		struct boot_params *params);
> +
> +#endif /* _ASM_X86_CRASH_H */
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 94f1257..9dc19fe 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -64,6 +64,10 @@
>  # define KEXEC_ARCH KEXEC_ARCH_X86_64
>  #endif
>  
> +/* Memory to backup during crash kdump */
> +#define KEXEC_BACKUP_SRC_START	(0UL)
> +#define KEXEC_BACKUP_SRC_END	(655360UL)	/* 640K */
> +
>  /*
>   * CPU does not save ss and sp on stack if execution is already
>   * running in kernel mode at the time of NMI occurrence. This code
> @@ -166,8 +170,21 @@ struct kimage_arch {
>  	pud_t *pud;
>  	pmd_t *pmd;
>  	pte_t *pte;
> +	/* Details of backup region */
> +	unsigned long backup_src_start;
> +	unsigned long backup_src_sz;
> +
> +	/* Physical address of backup segment */
> +	unsigned long backup_load_addr;
> +
> +	/* Core ELF header buffer */
> +	unsigned long elf_headers;
> +	unsigned long elf_headers_sz;
> +	unsigned long elf_load_addr;
>  };
> +#endif /* CONFIG_X86_32 */
>  
> +#ifdef CONFIG_X86_64
>  struct kexec_entry64_regs {
>  	uint64_t rax;
>  	uint64_t rbx;
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 18677a9..d5d3118 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -4,6 +4,9 @@
>   * Created by: Hariprasad Nellitheertha (hari@in.ibm.com)
>   *
>   * Copyright (C) IBM Corporation, 2004. All rights reserved.
> + * Copyright (C) Red Hat Inc., 2013. All rights reserved.
> + * Authors:
> + * 	Vivek Goyal <vgoyal@redhat.com>
>   *
>   */
>  
> @@ -17,6 +20,7 @@
>  #include <linux/elf.h>
>  #include <linux/elfcore.h>
>  #include <linux/module.h>
> +#include <linux/slab.h>
>  
>  #include <asm/processor.h>
>  #include <asm/hardirq.h>
> @@ -29,6 +33,45 @@
>  #include <asm/reboot.h>
>  #include <asm/virtext.h>
>  
> +/* Alignment required for elf header segment */
> +#define ELF_CORE_HEADER_ALIGN   4096
> +
> +/* This primarily reprsents number of split ranges due to exclusion */
> +#define CRASH_MAX_RANGES	16
> +
> +struct crash_mem_range {
> +	unsigned long long start, end;
> +};
> +
> +struct crash_mem {
> +	unsigned int nr_ranges;
> +	struct crash_mem_range ranges[CRASH_MAX_RANGES];
> +};
> +
> +/* Misc data about ram ranges needed to prepare elf headers */
> +struct crash_elf_data {
> +	struct kimage *image;
> +	/*
> +	 * Total number of ram ranges we have after various ajustments for
> +	 * GART, crash reserved region etc.
> +	 */
> +	unsigned int max_nr_ranges;
> +	unsigned long gart_start, gart_end;
> +
> +	/* Pointer to elf header */
> +	void *ehdr;
> +	/* Pointer to next phdr */
> +	void *bufp;
> +	struct crash_mem mem;
> +};
> +
> +/* Used while prepareing memory map entries for second kernel */
> +struct crash_memmap_data {
> +	struct boot_params *params;
> +	/* Type of memory */
> +	unsigned int type;
> +};
> +
>  int in_crash_kexec;
>  
>  /*
> @@ -138,3 +181,545 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  #endif
>  	crash_save_cpu(regs, safe_smp_processor_id());
>  }
> +
> +#ifdef CONFIG_X86_64
> +static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> +				unsigned long nr_pfn, void *arg)
> +{
> +	int *nr_ranges = arg;
> +
> +	(*nr_ranges)++;
> +	return 0;
> +}
> +
> +static int get_gart_ranges_callback(u64 start, u64 end, void *arg)
> +{
> +	struct crash_elf_data *ced = arg;
> +
> +	ced->gart_start = start;
> +	ced->gart_end = end;
> +
> +	/* Not expecting more than 1 gart aperture */
> +	return 1;
> +}
> +
> +
> +/* Gather all the required information to prepare elf headers for ram regions */
> +static int fill_up_ced(struct crash_elf_data *ced, struct kimage *image)
> +{
> +	unsigned int nr_ranges = 0;
> +
> +	ced->image = image;
> +
> +	walk_system_ram_range(0, -1, &nr_ranges,
> +				get_nr_ram_ranges_callback);
> +
> +	ced->max_nr_ranges = nr_ranges;
> +
> +	/*
> +	 * We don't create ELF headers for GART aperture as an attempt
> +	 * to dump this memory in second kernel leads to hang/crash.
> +	 * If gart aperture is present, one needs to exclude that region
> +	 * and that could lead to need of extra phdr.
> +	 */
> +
> +	walk_ram_res("GART", IORESOURCE_MEM, 0, -1,
> +				ced, get_gart_ranges_callback);
> +
> +	/*
> +	 * If we have gart region, excluding that could potentially split
> +	 * a memory range, resulting in extra header. Account for  that.
> +	 */
> +	if (ced->gart_end)
> +		ced->max_nr_ranges++;
> +
> +	/* Exclusion of crash region could split memory ranges */
> +	ced->max_nr_ranges++;
> +
> +	/* If crashk_low_res is there, another range split possible */
> +	if (crashk_low_res.end != 0)
> +		ced->max_nr_ranges++;
> +
> +	return 0;
> +}
> +
> +static int exclude_mem_range(struct crash_mem *mem,
> +		unsigned long long mstart, unsigned long long mend)
> +{
> +	int i, j;
> +	unsigned long long start, end;
> +	struct crash_mem_range temp_range = {0, 0};
> +
> +	for (i = 0; i < mem->nr_ranges; i++) {
> +		start = mem->ranges[i].start;
> +		end = mem->ranges[i].end;
> +
> +		if (mstart > end || mend < start)
> +			continue;
> +
> +		/* Truncate any area outside of range */
> +		if (mstart < start)
> +			mstart = start;
> +		if (mend > end)
> +			mend = end;
> +
> +		/* Found completely overlapping range */
> +		if (mstart == start && mend == end) {
> +			mem->ranges[i].start = 0;
> +			mem->ranges[i].end = 0;
> +			if (i < mem->nr_ranges - 1) {
> +				/* Shift rest of the ranges to left */
> +				for(j = i; j < mem->nr_ranges - 1; j++) {
> +					mem->ranges[j].start =
> +						mem->ranges[j+1].start;
> +					mem->ranges[j].end =
> +							mem->ranges[j+1].end;
> +				}
> +			}
> +			mem->nr_ranges--;
> +			return 0;
> +		}
> +
> +		if (mstart > start && mend < end) {
> +			/* Split original range */
> +			mem->ranges[i].end = mstart - 1;
> +			temp_range.start = mend + 1;
> +			temp_range.end = end;
> +		} else if (mstart != start)
> +			mem->ranges[i].end = mstart - 1;
> +		else
> +			mem->ranges[i].start = mend + 1;
> +		break;
> +	}
> +
> +	/* If a split happend, add the split in array */
> +	if (!temp_range.end)
> +		return 0;
> +
> +	/* Split happened */
> +	if (i == CRASH_MAX_RANGES - 1) {
> +		printk("Too many crash ranges after split\n");
> +		return -ENOMEM;
> +	}
> +
> +	/* Location where new range should go */
> +	j = i + 1;
> +	if (j < mem->nr_ranges) {
> +		/* Move over all ranges one place */
> +		for (i = mem->nr_ranges - 1; i >= j; i--)
> +			mem->ranges[i + 1] = mem->ranges[i];
> +	}
> +
> +	mem->ranges[j].start = temp_range.start;
> +	mem->ranges[j].end = temp_range.end;
> +	mem->nr_ranges++;
> +	return 0;
> +}
> +
> +/*
> + * Look for any unwanted ranges between mstart, mend and remove them. This
> + * might lead to split and split ranges are put in ced->mem.ranges[] array
> + */
> +static int elf_header_exclude_ranges(struct crash_elf_data *ced,
> +		unsigned long long mstart, unsigned long long mend)
> +{
> +	struct crash_mem *cmem = &ced->mem;
> +	int ret = 0;
> +
> +	memset(cmem->ranges, 0, sizeof(cmem->ranges));
> +
> +	cmem->ranges[0].start = mstart;
> +	cmem->ranges[0].end = mend;
> +	cmem->nr_ranges = 1;
> +
> +	/* Exclude crashkernel region */
> +	ret = exclude_mem_range(cmem, crashk_res.start, crashk_res.end);
> +	if (ret)
> +		return ret;
> +
> +	ret = exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end);
> +	if (ret)
> +		return ret;
> +
> +	/* Exclude GART region */
> +	if (ced->gart_end) {
> +		ret = exclude_mem_range(cmem, ced->gart_start, ced->gart_end);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static int prepare_elf64_ram_headers_callback(u64 start, u64 end, void *arg)
> +{
> +	struct crash_elf_data *ced = arg;
> +	Elf64_Ehdr *ehdr;
> +	Elf64_Phdr *phdr;
> +	unsigned long mstart, mend;
> +	struct kimage *image = ced->image;
> +	struct crash_mem *cmem;
> +	int ret, i;
> +
> +	ehdr = ced->ehdr;
> +
> +	/* Exclude unwanted mem ranges */
> +	ret = elf_header_exclude_ranges(ced, start, end);
> +	if (ret)
> +		return ret;
> +
> +	/* Go through all the ranges in ced->mem.ranges[] and prepare phdr */
> +	cmem = &ced->mem;
> +
> +	for (i = 0; i < cmem->nr_ranges; i++) {
> +		mstart = cmem->ranges[i].start;
> +		mend = cmem->ranges[i].end;
> +
> +		phdr = ced->bufp;
> +		ced->bufp += sizeof(Elf64_Phdr);
> +
> +		phdr->p_type = PT_LOAD;
> +		phdr->p_flags = PF_R|PF_W|PF_X;
> +		phdr->p_offset  = mstart;
> +
> +		/*
> +		 * If a range matches backup region, adjust offset to backup
> +		 * segment.
> +		 */
> +		if (mstart == image->arch.backup_src_start &&
> +		    (mend - mstart + 1) == image->arch.backup_src_sz)
> +			phdr->p_offset = image->arch.backup_load_addr;
> +
> +		phdr->p_paddr = mstart;
> +		phdr->p_vaddr = (unsigned long long) __va(mstart);
> +		phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
> +		phdr->p_align = 0;
> +		ehdr->e_phnum++;
> +		pr_debug("Crash PT_LOAD elf header. phdr=%p"
> +			" vaddr=0x%llx, paddr=0x%llx, sz=0x%llx e_phnum=%d"
> +			" p_offset=0x%llx\n", phdr, phdr->p_vaddr,
> +			phdr->p_paddr, phdr->p_filesz, ehdr->e_phnum,
> +			phdr->p_offset);
> +	}
> +
> +	return ret;
> +}
> +
> +static int prepare_elf64_headers(struct crash_elf_data *ced,
> +		unsigned long *addr, unsigned long *sz)
> +{
> +	Elf64_Ehdr *ehdr;
> +	Elf64_Phdr *phdr;
> +	unsigned long nr_cpus = NR_CPUS, nr_phdr, elf_sz;
> +	unsigned char *buf, *bufp;
> +	unsigned int cpu;
> +	unsigned long long notes_addr;
> +	int ret;
> +
> +	/* extra phdr for vmcoreinfo elf note */
> +	nr_phdr = nr_cpus + 1;
> +	nr_phdr += ced->max_nr_ranges;
> +
> +	/*
> +	 * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
> +	 * area on x86_64 (ffffffff80000000 - ffffffffa0000000).
> +	 * I think this is required by tools like gdb. So same physical
> +	 * memory will be mapped in two elf headers. One will contain kernel
> +	 * text virtual addresses and other will have __va(physical) addresses.
> +	 */
> +
> +	nr_phdr++;
> +	elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
> +	elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
> +
> +	buf = vzalloc(elf_sz);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	bufp = buf;
> +	ehdr = (Elf64_Ehdr *)bufp;
> +	bufp += sizeof(Elf64_Ehdr);
> +	memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
> +	ehdr->e_ident[EI_CLASS] = ELFCLASS64;
> +	ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
> +	ehdr->e_ident[EI_VERSION] = EV_CURRENT;
> +	ehdr->e_ident[EI_OSABI] = ELF_OSABI;
> +	memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
> +	ehdr->e_type = ET_CORE;
> +	ehdr->e_machine = ELF_ARCH;
> +	ehdr->e_version = EV_CURRENT;
> +	ehdr->e_entry = 0;
> +	ehdr->e_phoff = sizeof(Elf64_Ehdr);
> +	ehdr->e_shoff = 0;
> +	ehdr->e_flags = 0;
> +	ehdr->e_ehsize = sizeof(Elf64_Ehdr);
> +	ehdr->e_phentsize = sizeof(Elf64_Phdr);
> +	ehdr->e_phnum = 0;
> +	ehdr->e_shentsize = 0;
> +	ehdr->e_shnum = 0;
> +	ehdr->e_shstrndx = 0;
> +
> +	/* Prepare one phdr of type PT_NOTE for each present cpu */
> +	for_each_present_cpu(cpu) {
> +		phdr = (Elf64_Phdr *)bufp;
> +		bufp += sizeof(Elf64_Phdr);
> +		phdr->p_type = PT_NOTE;
> +		phdr->p_flags = 0;
> +		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
> +		phdr->p_offset = phdr->p_paddr = notes_addr;
> +		phdr->p_vaddr = 0;
> +		phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
> +		phdr->p_align = 0;
> +		(ehdr->e_phnum)++;
> +	}
> +
> +	/* Prepare one PT_NOTE header for vmcoreinfo */
> +	phdr = (Elf64_Phdr *)bufp;
> +	bufp += sizeof(Elf64_Phdr);
> +	phdr->p_type = PT_NOTE;
> +	phdr->p_flags = 0;
> +	phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
> +	phdr->p_vaddr = 0;
> +	phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
> +	phdr->p_align = 0;
> +	(ehdr->e_phnum)++;
> +
> +#ifdef CONFIG_X86_64
> +	/* Prepare PT_LOAD type program header for kernel text region */
> +	phdr = (Elf64_Phdr *)bufp;
> +	bufp += sizeof(Elf64_Phdr);
> +	phdr->p_type = PT_LOAD;
> +	phdr->p_flags = PF_R|PF_W|PF_X;
> +	phdr->p_vaddr = (Elf64_Addr)_text;
> +	phdr->p_filesz = phdr->p_memsz = _end - _text;
> +	phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);
> +	phdr->p_align = 0;
> +	(ehdr->e_phnum)++;
> +#endif
> +
> +	/* Prepare PT_LOAD headers for system ram chunks. */
> +	ced->ehdr = ehdr;
> +	ced->bufp = bufp;
> +	ret = walk_system_ram_res(0, -1, ced,
> +			prepare_elf64_ram_headers_callback);
> +	if (ret < 0)
> +		return ret;
> +
> +	*addr = (unsigned long)buf;
> +	*sz = elf_sz;
> +	return 0;
> +}
> +
> +/* Prepare elf headers. Return addr and size */
> +static int prepare_elf_headers(struct kimage *image, unsigned long *addr,
> +					unsigned long *sz)
> +{
> +	struct crash_elf_data *ced;
> +	int ret;
> +
> +	ced = kzalloc(sizeof(*ced), GFP_KERNEL);
> +	if (!ced)
> +		return -ENOMEM;
> +
> +	ret = fill_up_ced(ced, image);
> +	if (ret)
> +		goto out;
> +
> +	/* By default prepare 64bit headers */
> +	ret =  prepare_elf64_headers(ced, addr, sz);
> +out:
> +	kfree(ced);
> +	return ret;
> +}
> +
> +static int add_e820_entry(struct boot_params *params, struct e820entry *entry)
> +{
> +	unsigned int nr_e820_entries;
> +
> +	nr_e820_entries = params->e820_entries;
> +	if (nr_e820_entries >= E820MAX)
> +		return 1;
> +
> +	memcpy(&params->e820_map[nr_e820_entries], entry,
> +                        sizeof(struct e820entry));
> +	params->e820_entries++;
> +
> +	pr_debug("Add e820 entry to bootparams. addr=0x%llx size=0x%llx"
> +		" type=%d\n", entry->addr, entry->size, entry->type);
> +	return 0;
> +}
> +
> +static int memmap_entry_callback(u64 start, u64 end, void *arg)
> +{
> +	struct crash_memmap_data *cmd = arg;
> +	struct boot_params *params = cmd->params;
> +	struct e820entry ei;
> +
> +	ei.addr = start;
> +	ei.size = end - start + 1;
> +	ei.type = cmd->type;
> +	add_e820_entry(params, &ei);
> +
> +	return 0;
> +}
> +
> +static int memmap_exclude_ranges(struct kimage *image, struct crash_mem *cmem,
> +		unsigned long long mstart, unsigned long long mend)
> +{
> +	unsigned long start, end;
> +	int ret = 0;
> +
> +	memset(cmem->ranges, 0, sizeof(cmem->ranges));
> +
> +	cmem->ranges[0].start = mstart;
> +	cmem->ranges[0].end = mend;
> +	cmem->nr_ranges = 1;
> +
> +	/* Exclude Backup region */
> +	start = image->arch.backup_load_addr;
> +	end = start + image->arch.backup_src_sz - 1;
> +	ret = exclude_mem_range(cmem, start, end);
> +	if (ret)
> +		return ret;
> +
> +	/* Exclude elf header region */
> +	start = image->arch.elf_load_addr;
> +	end = start + image->arch.elf_headers_sz - 1;
> +	ret = exclude_mem_range(cmem, start, end);
> +	return ret;
> +}
> +
> +/* Prepare memory map for crash dump kernel */
> +int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
> +{
> +	int i, ret = 0;
> +	unsigned long flags;
> +	struct e820entry ei;
> +	struct crash_memmap_data cmd;
> +	struct crash_mem *cmem;
> +
> +	cmem = vzalloc(sizeof(struct crash_mem));
> +	if (!cmem)
> +		return -ENOMEM;
> +
> +	memset(&cmd, 0, sizeof(struct crash_memmap_data));
> +	cmd.params = params;
> +
> +	/* Add first 640K segment */
> +	ei.addr = image->arch.backup_src_start;
> +	ei.size = image->arch.backup_src_sz;
> +	ei.type = E820_RAM;
> +	add_e820_entry(params, &ei);
> +
> +	/* Add ACPI tables */
> +	cmd.type = E820_ACPI;
> +	flags = IORESOURCE_MEM | IORESOURCE_BUSY;
> +	walk_ram_res("ACPI Tables", flags, 0, -1, &cmd, memmap_entry_callback);
> +
> +	/* Add ACPI Non-volatile Storage */
> +	cmd.type = E820_NVS;
> +	walk_ram_res("ACPI Non-volatile Storage", flags, 0, -1, &cmd,
> +			memmap_entry_callback);
> +
> +	/* Add crashk_low_res region */
> +	if (crashk_low_res.end) {
> +		ei.addr = crashk_low_res.start;
> +		ei.size = crashk_low_res.end - crashk_low_res.start + 1;
> +		ei.type = E820_RAM;
> +		add_e820_entry(params, &ei);
> +	}
> +
> +	/* Exclude some ranges from crashk_res and add rest to memmap */
> +	ret = memmap_exclude_ranges(image, cmem, crashk_res.start,
> +						crashk_res.end);
> +	if (ret)
> +		goto out;
> +
> +	for (i = 0; i < cmem->nr_ranges; i++) {
> +		ei.addr = cmem->ranges[i].start;
> +		ei.size = cmem->ranges[i].end - ei.addr + 1;
> +		ei.type = E820_RAM;
> +
> +		/* If entry is less than a page, skip it */
> +		if (ei.size < PAGE_SIZE) {
> +			continue;
> +		}
> +		add_e820_entry(params, &ei);
> +	}
> +
> +out:
> +	vfree(cmem);
> +	return ret;
> +}
> +
> +static int determine_backup_region(u64 start, u64 end, void *arg)
> +{
> +	struct kimage *image = arg;
> +
> +	image->arch.backup_src_start = start;
> +	image->arch.backup_src_sz = end - start + 1;
> +
> +	/* Expecting only one range for backup region */
> +	return 1;
> +}
> +
> +int load_crashdump_segments(struct kimage *image)
> +{
> +	unsigned long src_start, src_sz;
> +	unsigned long elf_addr, elf_sz;
> +	int ret;
> +
> +	/*
> +	 * Determine and load a segment for backup area. First 640K RAM
> +	 * region is backup source
> +	 */
> +
> +	ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
> +				image, determine_backup_region);
> +
> +	/* Zero of postive return values are ok */
> +	if (ret < 0)
> +		return ret;
> +
> +	src_start = image->arch.backup_src_start;
> +	src_sz = image->arch.backup_src_sz;
> +
> +	/* Add backup segment. */
> +	if (src_sz) {
> +		ret = kexec_add_buffer(image, __va(src_start), src_sz, src_sz,
> +					PAGE_SIZE, 0, -1, 0,
> +					&image->arch.backup_load_addr);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	/* Prepare elf headers and add a segment */
> +	ret = prepare_elf_headers(image, &elf_addr, &elf_sz);
> +	if (ret)
> +		return ret;
> +
> +	image->arch.elf_headers = elf_addr;
> +	image->arch.elf_headers_sz = elf_sz;
> +
> +	ret = kexec_add_buffer(image, (char *)elf_addr, elf_sz, elf_sz,
> +			ELF_CORE_HEADER_ALIGN, 0, -1, 0,
> +			&image->arch.elf_load_addr);
> +	if (ret)
> +		kfree((void *)image->arch.elf_headers);
> +
> +	return ret;
> +}
> +
> +int crash_copy_backup_region(struct kimage *image)
> +{

Why need this func be called, backup region has been added into crash
segment by kexec_add_buffer, and then buffer copy is done in
kimage_load_crash_segment. I think this copy is handled twice. Please
correct me if I am wrong.



> +	unsigned long dest_start, src_start, src_sz;
> +
> +	dest_start = image->arch.backup_load_addr;
> +	src_start = image->arch.backup_src_start;
> +	src_sz = image->arch.backup_src_sz;
> +
> +	memcpy(__va(dest_start), __va(src_start), src_sz);
> +
> +	return 0;
> +}
> +#endif /* CONFIG_X86_64 */
> diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
> index a1032d4..606942c 100644
> --- a/arch/x86/kernel/kexec-bzimage.c
> +++ b/arch/x86/kernel/kexec-bzimage.c
> @@ -8,6 +8,9 @@
>  
>  #include <asm/bootparam.h>
>  #include <asm/setup.h>
> +#include <asm/crash.h>
> +
> +#define MAX_ELFCOREHDR_STR_LEN	30 	/* elfcorehdr=0x<64bit-value> */
>  
>  #ifdef CONFIG_X86_64
>  
> @@ -86,7 +89,8 @@ static int setup_memory_map_entries(struct boot_params *params)
>  	return 0;
>  }
>  
> -static void setup_linux_system_parameters(struct boot_params *params)
> +static void setup_linux_system_parameters(struct kimage *image,
> +			struct boot_params *params)
>  {
>  	unsigned int nr_e820_entries;
>  	unsigned long long mem_k, start, end;
> @@ -113,7 +117,10 @@ static void setup_linux_system_parameters(struct boot_params *params)
>  	/* Default sysdesc table */
>  	params->sys_desc_table.length = 0;
>  
> -	setup_memory_map_entries(params);
> +	if (image->type == KEXEC_TYPE_CRASH)
> +		crash_setup_memmap_entries(image, params);
> +	else
> +		setup_memory_map_entries(params);
>  	nr_e820_entries = params->e820_entries;
>  
>  	for(i = 0; i < nr_e820_entries; i++) {
> @@ -151,18 +158,23 @@ static void setup_initrd(struct boot_params *boot_params, unsigned long initrd_l
>  	boot_params->ext_ramdisk_size = initrd_len >> 32;
>  }
>  
> -static void setup_cmdline(struct boot_params *boot_params,
> +static void setup_cmdline(struct kimage *image, struct boot_params *boot_params,
>  		unsigned long bootparams_load_addr,
>  		unsigned long cmdline_offset, char *cmdline,
>  		unsigned long cmdline_len)
>  {
>  	char *cmdline_ptr = ((char *)boot_params) + cmdline_offset;
> -	unsigned long cmdline_ptr_phys;
> +	unsigned long cmdline_ptr_phys, len;
>  	uint32_t cmdline_low_32, cmdline_ext_32;
>  
>  	memcpy(cmdline_ptr, cmdline, cmdline_len);
> +	if (image->type == KEXEC_TYPE_CRASH) {
> +		len = sprintf(cmdline_ptr + cmdline_len - 1,
> +			" elfcorehdr=0x%lx", image->arch.elf_load_addr);
> +		cmdline_len += len;
> +	}
>  	cmdline_ptr[cmdline_len - 1] = '\0';
> -
> +	pr_debug("Final command line is:%s\n", cmdline_ptr);
>  	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
>  	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
>  	cmdline_ext_32 = cmdline_ptr_phys >> 32;
> @@ -203,17 +215,34 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  		return ERR_PTR(-EINVAL);
>  	}
>  
> +	/*
> +	 * In case of crash dump, we will append elfcorehdr=<addr> to
> +	 * command line. Make sure it does not overflow
> +	 */
> +	if (cmdline_len + MAX_ELFCOREHDR_STR_LEN > header->cmdline_size) {
> +		ret = -EINVAL;
> +		pr_debug("Kernel command line too long\n");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
>  	/* Allocate loader specific data */
>  	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
>  	if (!ldata)
>  		return ERR_PTR(-ENOMEM);
>  
> +	/* Allocate and load backup region */
> +	if (image->type == KEXEC_TYPE_CRASH) {
> +		ret = load_crashdump_segments(image);
> +		if (ret)
> +			goto out_free_loader_data;
> +	}
> +
>  	/* Argument/parameter segment */
>  	kern16_size_needed = kern16_size;
>  	if (kern16_size_needed < 4096)
>  		kern16_size_needed = 4096;
>  
> -	setup_size = kern16_size_needed + cmdline_len;
> +	setup_size = kern16_size_needed + cmdline_len + MAX_ELFCOREHDR_STR_LEN;
>  	params = kzalloc(setup_size, GFP_KERNEL);
>  	if (!params) {
>  		ret = -ENOMEM;
> @@ -259,14 +288,14 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  		setup_initrd(params, initrd_load_addr, initrd_len);
>  	}
>  
> -	setup_cmdline(params, bootparam_load_addr, kern16_size_needed,
> +	setup_cmdline(image, params, bootparam_load_addr, kern16_size_needed,
>  			cmdline, cmdline_len);
>  
>  	/* bootloader info. Do we need a separate ID for kexec kernel loader? */
>  	params->hdr.type_of_loader = 0x0D << 4;
>  	params->hdr.loadflags = 0;
>  
> -	setup_linux_system_parameters(params);
> +	setup_linux_system_parameters(image, params);
>  
>  	/*
>  	 * Allocate a purgatory page. For 64bit entry point, purgatory
> @@ -302,7 +331,7 @@ out_free_loader_data:
>  	return ERR_PTR(ret);
>  }
>  
> -int bzImage64_prep_entry(struct kimage *image)
> +static int prepare_purgatory(struct kimage *image)
>  {
>  	struct bzimage64_data *ldata;
>  	char *purgatory_page;
> @@ -362,6 +391,22 @@ int bzImage64_prep_entry(struct kimage *image)
>  	return 0;
>  }
>  
> +int bzImage64_prep_entry(struct kimage *image)
> +{
> +	if (!image->file_mode)
> +		return 0;
> +
> +	if (!image->image_loader_data)
> +		return -EINVAL;
> +
> +	prepare_purgatory(image);
> +
> +	if (image->type == KEXEC_TYPE_CRASH)
> +		crash_copy_backup_region(image);
> +
> +	return 0;
> +}
> +
>  /* This cleanup function is called after various segments have been loaded */
>  int bzImage64_cleanup(struct kimage *image)
>  {
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index a66ce1d..9d7a42d 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -334,6 +334,7 @@ int arch_image_file_post_load_cleanup(struct kimage *image)
>  {
>  	int idx = image->file_handler_idx;
>  
> +	vfree((void *)image->arch.elf_headers);
>  	if (kexec_file_type[idx].cleanup)
>  		return kexec_file_type[idx].cleanup(image);
>  	return 0;
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index 50bcaa8..64184a7 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -524,7 +524,6 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
>  	*rimage = image;
>  	return 0;
>  
> -
>  out_free_control_pages:
>  	kimage_free_page_list(&image->control_pages);
>  out_free_image:
> @@ -532,6 +531,54 @@ out_free_image:
>  	return result;
>  }
>  
> +static int kimage_file_crash_alloc(struct kimage **rimage, int kernel_fd,
> +		int initrd_fd, const char __user *cmdline_ptr,
> +		unsigned long cmdline_len)
> +{
> +	int result;
> +	struct kimage *image;
> +
> +	/* Allocate and initialize a controlling structure */
> +	image = do_kimage_alloc_init();
> +	if (!image)
> +		return -ENOMEM;
> +
> +	image->file_mode = 1;
> +	image->file_handler_idx = -1;
> +
> +	/* Enable the special crash kernel control page allocation policy. */
> +	image->control_page = crashk_res.start;
> +	image->type = KEXEC_TYPE_CRASH;
> +
> +	result = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
> +			cmdline_ptr, cmdline_len);
> +	if (result)
> +		goto out_free_image;
> +
> +	result = sanity_check_segment_list(image);
> +	if (result)
> +		goto out_free_post_load_bufs;
> +
> +	result = -ENOMEM;
> +	image->control_code_page = kimage_alloc_control_pages(image,
> +					   get_order(KEXEC_CONTROL_PAGE_SIZE));
> +	if (!image->control_code_page) {
> +		printk(KERN_ERR "Could not allocate control_code_buffer\n");
> +		goto out_free_post_load_bufs;
> +	}
> +
> +	*rimage = image;
> +	return 0;
> +
> +out_free_post_load_bufs:
> +	kimage_file_post_load_cleanup(image);
> +	kfree(image->image_loader_data);
> +out_free_image:
> +	kfree(image);
> +	return result;
> +}
> +
> +
>  static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
>  				unsigned long nr_segments,
>  				struct kexec_segment __user *segments)
> @@ -1130,7 +1177,12 @@ static int kimage_load_crash_segment(struct kimage *image,
>  			/* Zero the trailing part of the page */
>  			memset(ptr + uchunk, 0, mchunk - uchunk);
>  		}
> -		result = copy_from_user(ptr, buf, uchunk);
> +
> +		/* For file based kexec, source pages are in kernel memory */
> +		if (image->file_mode)
> +			memcpy(ptr, buf, uchunk);
> +		else
> +			result = copy_from_user(ptr, buf, uchunk);
>  		kexec_flush_icache_page(page);
>  		kunmap(page);
>  		if (result) {
> @@ -1358,7 +1410,11 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, const char __us
>  	if (flags & KEXEC_FILE_UNLOAD)
>  		goto exchange;
>  
> -	ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
> +	if (flags & KEXEC_FILE_ON_CRASH)
> +		ret = kimage_file_crash_alloc(&image, kernel_fd, initrd_fd,
> +				cmdline_ptr, cmdline_len);
> +	else
> +		ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
>  				cmdline_ptr, cmdline_len);
>  	if (ret)
>  		goto out;
> @@ -2108,7 +2164,12 @@ int kexec_add_buffer(struct kimage *image, char *buffer,
>  	kbuf->top_down = top_down;
>  
>  	/* Walk the RAM ranges and allocate a suitable range for the buffer */
> -	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> +	if (image->type == KEXEC_TYPE_CRASH)
> +		walk_ram_res("Crash kernel", IORESOURCE_MEM | IORESOURCE_BUSY,
> +				crashk_res.start, crashk_res.end, kbuf,
> +				walk_ram_range_callback);
> +	else
> +		walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
>  
>  	kbuf->image = NULL;
>  	kfree(kbuf);
> -- 
> 1.7.7.6
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 5/6] kexec-bzImage: Support for loading bzImage using 64bit entry
  2013-11-20 17:50 ` [PATCH 5/6] kexec-bzImage: Support for loading bzImage using 64bit entry Vivek Goyal
  2013-11-21 19:07   ` Greg KH
@ 2013-11-28 11:35   ` Baoquan He
  2013-12-02 15:36     ` Vivek Goyal
  1 sibling, 1 reply; 90+ messages in thread
From: Baoquan He @ 2013-11-28 11:35 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, mjg59, greg, ebiederm, hpa

On 11/20/13 at 12:50pm, Vivek Goyal wrote:
> This is loader specific code which can load bzImage and set it up for
> 64bit entry. This does not take care of 32bit entry or real mode entry
> yet.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/kexec-bzimage.h |   12 +
>  arch/x86/include/asm/kexec.h         |   26 +++
>  arch/x86/kernel/Makefile             |    2 +
>  arch/x86/kernel/kexec-bzimage.c      |  375 ++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/machine_kexec_64.c   |    4 +-
>  arch/x86/kernel/purgatory_entry_64.S |  119 +++++++++++
>  6 files changed, 537 insertions(+), 1 deletions(-)
>  create mode 100644 arch/x86/include/asm/kexec-bzimage.h
>  create mode 100644 arch/x86/kernel/kexec-bzimage.c
>  create mode 100644 arch/x86/kernel/purgatory_entry_64.S
> 
> diff --git a/arch/x86/include/asm/kexec-bzimage.h b/arch/x86/include/asm/kexec-bzimage.h
> new file mode 100644
> index 0000000..d556727
> --- /dev/null
> +++ b/arch/x86/include/asm/kexec-bzimage.h
> @@ -0,0 +1,12 @@
> +#ifndef _ASM_BZIMAGE_H
> +#define _ASM_BZIMAGE_H
> +
> +extern int bzImage64_probe(const char *buf, unsigned long len);
> +extern void *bzImage64_load(struct kimage *image, char *kernel,
> +		unsigned long kernel_len, char *initrd,
> +		unsigned long initrd_len, char *cmdline,
> +		unsigned long cmdline_len);
> +extern int bzImage64_prep_entry(struct kimage *image);
> +extern int bzImage64_cleanup(struct kimage *image);
> +
> +#endif  /* _ASM_BZIMAGE_H */
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 17483a4..94f1257 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -15,6 +15,9 @@
>  # define PAGES_NR		4
>  #endif
>  
> +#define KEXEC_PURGATORY_PAGE_SIZE	4096
> +#define KEXEC_PURGATORY_CODE_MAX_SIZE	2048
> +
>  # define KEXEC_CONTROL_CODE_MAX_SIZE	2048
>  
>  #ifndef __ASSEMBLY__
> @@ -141,6 +144,9 @@ relocate_kernel(unsigned long indirection_page,
>  		unsigned long page_list,
>  		unsigned long start_address,
>  		unsigned int preserve_context);
> +void purgatory_entry64(void);
> +extern unsigned long purgatory_entry64_regs;
> +extern struct desc_struct entry64_gdt;
>  #endif
>  
>  #define ARCH_HAS_KIMAGE_ARCH
> @@ -161,6 +167,26 @@ struct kimage_arch {
>  	pmd_t *pmd;
>  	pte_t *pte;
>  };
> +
> +struct kexec_entry64_regs {
> +	uint64_t rax;
> +	uint64_t rbx;
> +	uint64_t rcx;
> +	uint64_t rdx;
> +	uint64_t rsi;
> +	uint64_t rdi;
> +	uint64_t rsp;
> +	uint64_t rbp;
> +	uint64_t r8;
> +	uint64_t r9;
> +	uint64_t r10;
> +	uint64_t r11;
> +	uint64_t r12;
> +	uint64_t r13;
> +	uint64_t r14;
> +	uint64_t r15;
> +	uint64_t rip;
> +};
>  #endif
>  
>  typedef void crash_vmclear_fn(void);
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 9b0a34e..5d074c2 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -68,6 +68,7 @@ obj-$(CONFIG_FTRACE_SYSCALLS)	+= ftrace.o
>  obj-$(CONFIG_X86_TSC)		+= trace_clock.o
>  obj-$(CONFIG_KEXEC)		+= machine_kexec_$(BITS).o
>  obj-$(CONFIG_KEXEC)		+= relocate_kernel_$(BITS).o crash.o
> +obj-$(CONFIG_KEXEC)		+= kexec-bzimage.o
>  obj-$(CONFIG_CRASH_DUMP)	+= crash_dump_$(BITS).o
>  obj-y				+= kprobes/
>  obj-$(CONFIG_MODULES)		+= module.o
> @@ -122,4 +123,5 @@ ifeq ($(CONFIG_X86_64),y)
>  
>  	obj-$(CONFIG_PCI_MMCONFIG)	+= mmconf-fam10h_64.o
>  	obj-y				+= vsmp_64.o
> +	obj-$(CONFIG_KEXEC)		+= purgatory_entry_64.o
>  endif
> diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
> new file mode 100644
> index 0000000..a1032d4
> --- /dev/null
> +++ b/arch/x86/kernel/kexec-bzimage.c
> @@ -0,0 +1,375 @@
> +#include <linux/string.h>
> +#include <linux/printk.h>
> +#include <linux/errno.h>
> +#include <linux/slab.h>
> +#include <linux/kexec.h>
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +
> +#include <asm/bootparam.h>
> +#include <asm/setup.h>
> +
> +#ifdef CONFIG_X86_64
> +
> +struct bzimage64_data {
> +	unsigned long kernel_load_addr;
> +	unsigned long bootparams_load_addr;
> +
> +	/*
> +	 * Temporary buffer to hold bootparams buffer. This should be
> +	 * freed once the bootparam segment has been loaded.
> +	 */
> +	void *bootparams_buf;
> +	struct page *purgatory_page;
> +};
> +
> +int bzImage64_probe(const char *buf, unsigned long len)
> +{
> +	int ret = -ENOEXEC;
> +	struct setup_header *header;
> +
> +	if (len < 2 * 512) {
> +		pr_debug("File is too short to be a bzImage\n");
> +		return ret;
> +	}
> +
> +	header = (struct setup_header *)(buf + 0x1F1);
> +	if (memcmp((char *)&header->header, "HdrS", 4) != 0) {
> +		pr_debug("Not a bzImage\n");
> +		return ret;
> +	}
> +
> +	if (header->boot_flag != 0xAA55) {
> +                /* No x86 boot sector present */
> +		pr_debug("No x86 boot sector present\n");
> +		return ret;
> +	}
> +
> +	if (header->version < 0x020C) {
> +                /* Must be at least protocol version 2.12 */
> +		pr_debug("Must be at least protocol version 2.12\n");
> +		return ret;
> +	}
> +
> +	if ((header->loadflags & 1) == 0) {
> +		/* Not a bzImage */
> +		pr_debug("zImage not a bzImage\n");
> +		return ret;
> +	}
> +
> +	if ((header->xloadflags & 3) != 3) {
> +		/* XLF_KERNEL_64 and XLF_CAN_BE_LOADED_ABOVE_4G should be set */
> +		pr_debug("Not a relocatable bzImage64\n");
> +		return ret;
> +	}
> +
> +        /* I've got a bzImage */
> +	pr_debug("It's a relocatable bzImage64\n");
> +	ret = 0;
> +
> +	return ret;
> +}
> +
> +static int setup_memory_map_entries(struct boot_params *params)
> +{
> +	unsigned int nr_e820_entries;
> +
> +	/* TODO: What about EFI */
> +	nr_e820_entries = e820_saved.nr_map;
> +	if (nr_e820_entries > E820MAX)
> +		nr_e820_entries = E820MAX;
> +
> +	params->e820_entries = nr_e820_entries;
> +	memcpy(&params->e820_map, &e820_saved.map,
> +			nr_e820_entries * sizeof(struct e820entry));
> +
> +	return 0;
> +}
> +
> +static void setup_linux_system_parameters(struct boot_params *params)
> +{
> +	unsigned int nr_e820_entries;
> +	unsigned long long mem_k, start, end;
> +	int i;
> +
> +	/* Get subarch from existing bootparams */
> +	params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
> +
> +	/* Copying screen_info will do? */
> +	memcpy(&params->screen_info, &boot_params.screen_info,
> +				sizeof(struct screen_info));
> +
> +	/* Fill in memsize later */
> +	params->screen_info.ext_mem_k = 0;
> +	params->alt_mem_k = 0;
> +
> +	/* Default APM info */
> +	memset(&params->apm_bios_info, 0, sizeof(params->apm_bios_info));
> +
> +	/* Default drive info */
> +	memset(&params->hd0_info, 0, sizeof(params->hd0_info));
> +	memset(&params->hd1_info, 0, sizeof(params->hd1_info));
> +
> +	/* Default sysdesc table */
> +	params->sys_desc_table.length = 0;
> +
> +	setup_memory_map_entries(params);
> +	nr_e820_entries = params->e820_entries;
> +
> +	for(i = 0; i < nr_e820_entries; i++) {
> +		if (params->e820_map[i].type != E820_RAM)
> +			continue;
> +		start = params->e820_map[i].addr;
> +		end = params->e820_map[i].addr + params->e820_map[i].size - 1;
> +
> +		if ((start <= 0x100000) && end > 0x100000) {
> +			mem_k = (end >> 10) - (0x100000 >> 10);
> +			params->screen_info.ext_mem_k = mem_k;
> +			params->alt_mem_k = mem_k;
> +			if (mem_k > 0xfc00)
> +				params->screen_info.ext_mem_k = 0xfc00; /* 64M*/
> +			if (mem_k > 0xffffffff)
> +				params->alt_mem_k = 0xffffffff;
> +		}
> +	}
> +
> +	/* Setup EDD info */
> +	memcpy(params->eddbuf, boot_params.eddbuf,
> +				EDDMAXNR * sizeof(struct edd_info));
> +	params->eddbuf_entries = boot_params.eddbuf_entries;
> +
> +	memcpy(params->edd_mbr_sig_buffer, boot_params.edd_mbr_sig_buffer,
> +			EDD_MBR_SIG_MAX * sizeof(unsigned int));
> +}
> +
> +static void setup_initrd(struct boot_params *boot_params, unsigned long initrd_load_addr, unsigned long initrd_len)
> +{
> +	boot_params->hdr.ramdisk_image = initrd_load_addr & 0xffffffffUL;
> +	boot_params->hdr.ramdisk_size = initrd_len & 0xffffffffUL;
> +
> +	boot_params->ext_ramdisk_image = initrd_load_addr >> 32;
> +	boot_params->ext_ramdisk_size = initrd_len >> 32;
> +}
> +
> +static void setup_cmdline(struct boot_params *boot_params,
> +		unsigned long bootparams_load_addr,
> +		unsigned long cmdline_offset, char *cmdline,
> +		unsigned long cmdline_len)
> +{
> +	char *cmdline_ptr = ((char *)boot_params) + cmdline_offset;
> +	unsigned long cmdline_ptr_phys;
> +	uint32_t cmdline_low_32, cmdline_ext_32;
> +
> +	memcpy(cmdline_ptr, cmdline, cmdline_len);
> +	cmdline_ptr[cmdline_len - 1] = '\0';
> +
> +	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
> +	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
> +	cmdline_ext_32 = cmdline_ptr_phys >> 32;
> +
> +	boot_params->hdr.cmd_line_ptr = cmdline_low_32;
> +	if (cmdline_ext_32)
> +		boot_params->ext_cmd_line_ptr = cmdline_ext_32;
> +}
> +
> +void *bzImage64_load(struct kimage *image, char *kernel,
> +		unsigned long kernel_len,
> +		char *initrd, unsigned long initrd_len,
> +		char *cmdline, unsigned long cmdline_len)
> +{
> +
> +	struct setup_header *header;
> +	int setup_sects, kern16_size_needed, kern16_size, ret = 0;
> +	unsigned long setup_size, setup_header_size;
> +	struct boot_params *params;
> +	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
> +	unsigned long kernel_bufsz, kernel_memsz, kernel_align;
> +	char *kernel_buf;
> +	struct bzimage64_data *ldata;
> +
> +	header = (struct setup_header *)(kernel + 0x1F1);
> +	setup_sects = header->setup_sects;
> +	if (setup_sects == 0)
> +		setup_sects = 4;
> +
> +	kern16_size = (setup_sects + 1) * 512;
> +	if (kernel_len < kern16_size) {
> +		pr_debug("bzImage truncated\n");
> +		return ERR_PTR(-ENOEXEC);
> +	}
> +
> +	if (cmdline_len > header->cmdline_size) {
> +		pr_debug("Kernel command line too long\n");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	/* Allocate loader specific data */
> +	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
> +	if (!ldata)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* Argument/parameter segment */
> +	kern16_size_needed = kern16_size;
> +	if (kern16_size_needed < 4096)
> +		kern16_size_needed = 4096;
> +
> +	setup_size = kern16_size_needed + cmdline_len;
> +	params = kzalloc(setup_size, GFP_KERNEL);
> +	if (!params) {
> +		ret = -ENOMEM;
> +		goto out_free_loader_data;
> +	}
> +
> +	/* Copy setup header onto bootparams. */
> +	setup_header_size = 0x0202 + kernel[0x0201] - 0x1F1;
> +
> +	/* Is there a limit on setup header size? */
> +	memcpy(&params->hdr, (kernel + 0x1F1), setup_header_size);
> +	ret = kexec_add_buffer(image, (char *)params, setup_size,
> +			setup_size, 16, 0x3000, -1, 1, &bootparam_load_addr);
> +	if (ret)
> +		goto out_free_params;
> +	pr_debug("Loaded boot_param and command line at 0x%lx\n",
> +			bootparam_load_addr);
> +
> +	/* Load kernel */
> +	kernel_buf = kernel + kern16_size;
> +	kernel_bufsz =  kernel_len - kern16_size;
> +	kernel_memsz = ALIGN(header->init_size, 4096);
> +	kernel_align = header->kernel_alignment;
> +
> +	ret = kexec_add_buffer(image, kernel_buf,
> +			kernel_bufsz, kernel_memsz, kernel_align, 0x100000,
> +			-1, 1, &kernel_load_addr);
> +	if (ret)
> +		goto out_free_params;
> +
> +	pr_debug("Loaded 64bit kernel at 0x%lx sz = 0x%lx\n", kernel_load_addr,
> +				kernel_memsz);
> +
> +	/* Load initrd high */
> +	if (initrd) {
> +		ret = kexec_add_buffer(image, initrd, initrd_len, initrd_len,
> +			4096, 0x10000000, ULONG_MAX, 1, &initrd_load_addr);

Here the minimal starting addr to locate initrd is 256M, though it's
allocated from top to down, I am still wondering why.


> +		if (ret)
> +			goto out_free_params;
> +
> +		pr_debug("Loaded initrd at 0x%lx sz = 0x%lx\n",
> +					initrd_load_addr, initrd_len);
> +		setup_initrd(params, initrd_load_addr, initrd_len);
> +	}
> +
> +	setup_cmdline(params, bootparam_load_addr, kern16_size_needed,
> +			cmdline, cmdline_len);
> +
> +	/* bootloader info. Do we need a separate ID for kexec kernel loader? */
> +	params->hdr.type_of_loader = 0x0D << 4;
> +	params->hdr.loadflags = 0;
> +
> +	setup_linux_system_parameters(params);
> +
> +	/*
> +	 * Allocate a purgatory page. For 64bit entry point, purgatory
> +	 * code can be anywhere.
> +	 *
> +	 * Control page allocation logic goes through segment list to
> +	 * make sure allocated page is not destination page. So allocate
> +	 * control page after all required segment have been prepared.
> +	 */
> +	ldata->purgatory_page = kimage_alloc_control_pages(image,
> +					get_order(KEXEC_PURGATORY_PAGE_SIZE));
> +
> +	if (!ldata->purgatory_page) {
> +		printk(KERN_ERR "Could not allocate purgatory page\n");
> +		ret = -ENOMEM;
> +		goto out_free_params;
> +	}
> +
> +	/*
> +	 * Store pointer to params so that it could be freed after loading
> +	 * params segment has been loaded and contents have been copied
> +	 * somewhere else.
> +	 */
> +	ldata->bootparams_buf = params;
> +	ldata->kernel_load_addr = kernel_load_addr;
> +	ldata->bootparams_load_addr = bootparam_load_addr;
> +	return ldata;
> +
> +out_free_params:
> +	kfree(params);
> +out_free_loader_data:
> +	kfree(ldata);
> +	return ERR_PTR(ret);
> +}
> +
> +int bzImage64_prep_entry(struct kimage *image)
> +{
> +	struct bzimage64_data *ldata;
> +	char *purgatory_page;
> +	unsigned long regs_offset, gdt_offset, purgatory_page_phys;
> +	struct kexec_entry64_regs *regs;
> +	char *gdt_ptr;
> +	unsigned long long *gdt_addr;
> +
> +	if (!image->file_mode)
> +		return 0;
> +
> +	ldata = image->image_loader_data;
> +	if (!ldata)
> +		return -EINVAL;
> +
> +	/* Copy purgatory code to its control page */
> +	purgatory_page = page_address(ldata->purgatory_page);
> +
> +	/* Physical address of purgatory page */
> +	purgatory_page_phys = PFN_PHYS(page_to_pfn(ldata->purgatory_page));
> +
> +	memcpy(purgatory_page, purgatory_entry64,
> +			KEXEC_PURGATORY_CODE_MAX_SIZE);
> +
> +	/* Set registers appropriately */
> +	regs_offset =  (unsigned long)&purgatory_entry64_regs -
> +			(unsigned long)purgatory_entry64;
> +	regs = (struct kexec_entry64_regs *) (purgatory_page + regs_offset);
> +
> +	regs->rbx = 0; /* Bootstrap Processor */
> +	regs->rsi = ldata->bootparams_load_addr;
> +	regs->rip = ldata->kernel_load_addr + 0x200;
> +
> +	/* Fix up gdt */
> +	gdt_offset = (unsigned long)&entry64_gdt -
> +			(unsigned long)purgatory_entry64;
> +
> +	gdt_ptr = purgatory_page + gdt_offset;
> +
> +	/* Skip a word which contains size of gdt table */
> +	gdt_addr = (unsigned long long *)(gdt_ptr + 2);
> +
> +	*gdt_addr = (unsigned long long)gdt_ptr;
> +
> +	/*
> +	 * Update the relocated address of gdt. By the time we load gdt
> +	 * in purgatory, we are running using identity mapped tables.
> +	 * Load identity mapped address here.
> +	 */
> +	*gdt_addr = (unsigned long long)(purgatory_page_phys + gdt_offset);
> +
> +	/*
> +	 * Jump to purgatory after control page. By the time we jump to
> +	 * purgatory, we are using itentifiy mapped page tables
> +	 */
> +	kimage_set_start_addr(image, purgatory_page_phys);
> +	return 0;
> +}
> +
> +/* This cleanup function is called after various segments have been loaded */
> +int bzImage64_cleanup(struct kimage *image)
> +{
> +	struct bzimage64_data *ldata = image->image_loader_data;
> +
> +	kfree(ldata->bootparams_buf);
> +	ldata->bootparams_buf = NULL;
> +	return 0;
> +}
> +
> +#endif /* CONFIG_X86_64 */
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index fb41b73..a66ce1d 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -21,10 +21,12 @@
>  #include <asm/tlbflush.h>
>  #include <asm/mmu_context.h>
>  #include <asm/debugreg.h>
> +#include <asm/kexec-bzimage.h>
>  
>  /* arch dependent functionality related to kexec file based syscall */
>  static struct kexec_file_type kexec_file_type[]={
> -	{"", NULL, NULL, NULL, NULL},
> +	{"bzImage64", bzImage64_probe, bzImage64_load, bzImage64_prep_entry,
> +	 bzImage64_cleanup},
>  };
>  
>  static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
> diff --git a/arch/x86/kernel/purgatory_entry_64.S b/arch/x86/kernel/purgatory_entry_64.S
> new file mode 100644
> index 0000000..12a235f
> --- /dev/null
> +++ b/arch/x86/kernel/purgatory_entry_64.S
> @@ -0,0 +1,119 @@
> +/*
> + * Copyright (C) 2013  Red Hat Inc.
> + *
> + * Author(s): Vivek Goyal <vgoyal@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation (version 2 of the License).
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + */
> +
> +
> +/*
> + * One page for purgatory. Code occupies first KEXEC_PURGATORY_CODE_MAX_SIZE
> + * bytes. Rest is for data/stack etc.
> + */
> +#include <asm/page.h>
> +
> +	.text
> +	.align PAGE_SIZE
> +	.code64
> +	.globl purgatory_entry64, purgatory_entry64_regs, entry64_gdt
> +
> +
> +purgatory_entry64:
> +	/* Setup a gdt that should be preserved */
> +	lgdt entry64_gdt(%rip)
> +
> +	/* load the data segments */
> +	movl    $0x18, %eax     /* data segment */
> +	movl    %eax, %ds
> +	movl    %eax, %es
> +	movl    %eax, %ss
> +	movl    %eax, %fs
> +	movl    %eax, %gs
> +
> +	/* Setup new stack */
> +	leaq    stack_init(%rip), %rsp
> +	pushq   $0x10 /* CS */
> +	leaq    new_cs_exit(%rip), %rax
> +	pushq   %rax
> +	lretq
> +new_cs_exit:
> +
> +	/*
> +	 * Load the registers except rsp. rsp is already loaded with stack
> +	 * at the end of this page
> +	 */
> +	movq	rax(%rip), %rax
> +	movq	rbx(%rip), %rbx
> +	movq	rcx(%rip), %rcx
> +	movq	rdx(%rip), %rdx
> +	movq	rsi(%rip), %rsi
> +	movq	rdi(%rip), %rdi
> +	movq	rbp(%rip), %rbp
> +	movq	r8(%rip), %r8
> +	movq	r9(%rip), %r9
> +	movq	r10(%rip), %r10
> +	movq	r11(%rip), %r11
> +	movq	r12(%rip), %r12
> +	movq	r13(%rip), %r13
> +	movq	r14(%rip), %r14
> +	movq	r15(%rip), %r15
> +
> +	/* Jump to the new code... */
> +	jmpq	*rip(%rip)
> +
> +	.balign 16
> +purgatory_entry64_regs:
> +rax:	.quad 0x00000000
> +rbx:	.quad 0x00000000
> +rcx:	.quad 0x00000000
> +rdx:	.quad 0x00000000
> +rsi:	.quad 0x00000000
> +rdi:	.quad 0x00000000
> +rsp:	.quad 0x00000000
> +rbp:	.quad 0x00000000
> +r8:	.quad 0x00000000
> +r9:	.quad 0x00000000
> +r10:	.quad 0x00000000
> +r11:	.quad 0x00000000
> +r12:	.quad 0x00000000
> +r13:	.quad 0x00000000
> +r14:	.quad 0x00000000
> +r15:	.quad 0x00000000
> +rip:	.quad 0x00000000
> +
> +	/* GDT */
> +	.balign 16
> +entry64_gdt:
> +	/* 0x00 unusable segment
> +	 * 0x08 unused
> +	 * so use them as gdt ptr
> +	 */
> +	.word gdt_end - entry64_gdt - 1
> +	.quad entry64_gdt
> +	.word 0, 0, 0
> +
> +	/* 0x10 4GB flat code segment */
> +	.word 0xFFFF, 0x0000, 0x9A00, 0x00AF
> +
> +	/* 0x18 4GB flat data segment */
> +	.word 0xFFFF, 0x0000, 0x9200, 0x00CF
> +gdt_end:
> +
> +	.globl kexec_purgatory_code_size
> +.set kexec_purgatory_code_size, . - purgatory_entry64
> +
> +/* Fill rest of the page with zeros to be used as stack */
> +stack: .fill purgatory_entry64 + PAGE_SIZE - ., 1, 0
> +stack_init:
> -- 
> 1.7.7.6
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-20 17:50 ` [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec Vivek Goyal
  2013-11-21 19:03   ` Greg KH
  2013-11-22 20:42   ` Jiri Kosina
@ 2013-11-29  3:10   ` Baoquan He
  2013-12-02 15:27     ` WANG Chao
  2013-12-02 15:44     ` Vivek Goyal
  2013-12-04  1:56   ` Baoquan He
  3 siblings, 2 replies; 90+ messages in thread
From: Baoquan He @ 2013-11-29  3:10 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, mjg59, greg, ebiederm, hpa

On 11/20/13 at 12:50pm, Vivek Goyal wrote:
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 4eabc16..fb41b73 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -22,6 +22,13 @@
>  #include <asm/mmu_context.h>
>  #include <asm/debugreg.h>
>  
> +/* arch dependent functionality related to kexec file based syscall */
> +static struct kexec_file_type kexec_file_type[]={
> +	{"", NULL, NULL, NULL, NULL},
> +};
> +
> +
> +void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
> +			unsigned long kernel_len, char *initrd,
> +			unsigned long initrd_len, char *cmdline,
> +			unsigned long cmdline_len)
> +{
> +	int idx = image->file_handler_idx;
> +
> +	if (idx < 0)
> +		return ERR_PTR(-ENOEXEC);
> +
> +	return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
> +					initrd_len, cmdline, cmdline_len);
> +}
> +
> +int arch_image_file_post_load_cleanup(struct kimage *image)
> +{

Hi Vivek,

This function is defined as one of arch specific fucntion set, why don't
we name it in a unified prefix as others.

And name of the default dummy function in kernel/kexec.c is not consistent
with the arch specific one, so currently
arch_image_file_post_load_cleanup of x86 arch is not called. Please
consider wihch one need be changed.

> +
> +void __attribute__ ((weak))
> +arch_kimage_file_post_load_cleanup(struct kimage *image)
> +{
> +	return;
> +}
> +



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-29  3:10   ` Baoquan He
@ 2013-12-02 15:27     ` WANG Chao
  2013-12-02 15:44     ` Vivek Goyal
  1 sibling, 0 replies; 90+ messages in thread
From: WANG Chao @ 2013-12-02 15:27 UTC (permalink / raw)
  To: Baoquan He; +Cc: Vivek Goyal, mjg59, greg, kexec, linux-kernel, ebiederm, hpa

On 11/29/13 at 11:10am, Baoquan He wrote:
> On 11/20/13 at 12:50pm, Vivek Goyal wrote:
> > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > index 4eabc16..fb41b73 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -22,6 +22,13 @@
> >  #include <asm/mmu_context.h>
> >  #include <asm/debugreg.h>
> >  
> > +/* arch dependent functionality related to kexec file based syscall */
> > +static struct kexec_file_type kexec_file_type[]={
> > +	{"", NULL, NULL, NULL, NULL},
> > +};
> > +
> > +
> > +void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
> > +			unsigned long kernel_len, char *initrd,
> > +			unsigned long initrd_len, char *cmdline,
> > +			unsigned long cmdline_len)
> > +{
> > +	int idx = image->file_handler_idx;
> > +
> > +	if (idx < 0)
> > +		return ERR_PTR(-ENOEXEC);
> > +
> > +	return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
> > +					initrd_len, cmdline, cmdline_len);
> > +}
> > +
> > +int arch_image_file_post_load_cleanup(struct kimage *image)
> > +{
> 
> Hi Vivek,
> 
> This function is defined as one of arch specific fucntion set, why don't
> we name it in a unified prefix as others.
> 
> And name of the default dummy function in kernel/kexec.c is not consistent
> with the arch specific one, so currently
> arch_image_file_post_load_cleanup of x86 arch is not called. Please
> consider wihch one need be changed.

I think arch_kimage_file_post_load_cleanup should be used here.

And Good catch!

Thanks
WANG Chao

> 
> > +
> > +void __attribute__ ((weak))
> > +arch_kimage_file_post_load_cleanup(struct kimage *image)
> > +{
> > +	return;
> > +}
> > +
> 
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/6] kexec: Support for Kexec on panic using new system call
  2013-11-28 11:28   ` Baoquan He
@ 2013-12-02 15:30     ` Vivek Goyal
  2013-12-04  1:51       ` Baoquan He
  0 siblings, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-12-02 15:30 UTC (permalink / raw)
  To: Baoquan He; +Cc: linux-kernel, kexec, mjg59, greg, ebiederm, hpa

On Thu, Nov 28, 2013 at 07:28:16PM +0800, Baoquan He wrote:

[..]
> > +int crash_copy_backup_region(struct kimage *image)
> > +{
> 
> Why need this func be called, backup region has been added into crash
> segment by kexec_add_buffer, and then buffer copy is done in
> kimage_load_crash_segment. I think this copy is handled twice. Please
> correct me if I am wrong.
> 

Hi Bao,

kexec_add_buffer() will copy the backup region but it will happen when
crash kernel is loaded. We want snapshot of backup region at the time
of crash and that's why this second call to copy backup region and it
is executed after the crash.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 5/6] kexec-bzImage: Support for loading bzImage using 64bit entry
  2013-11-28 11:35   ` Baoquan He
@ 2013-12-02 15:36     ` Vivek Goyal
  0 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-12-02 15:36 UTC (permalink / raw)
  To: Baoquan He; +Cc: linux-kernel, kexec, mjg59, greg, ebiederm, hpa

On Thu, Nov 28, 2013 at 07:35:14PM +0800, Baoquan He wrote:

[..]
> > +void *bzImage64_load(struct kimage *image, char *kernel,
> > +		unsigned long kernel_len,
> > +		char *initrd, unsigned long initrd_len,
> > +		char *cmdline, unsigned long cmdline_len)
> > +{
> > +
> > +	struct setup_header *header;
> > +	int setup_sects, kern16_size_needed, kern16_size, ret = 0;
> > +	unsigned long setup_size, setup_header_size;
> > +	struct boot_params *params;
> > +	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
> > +	unsigned long kernel_bufsz, kernel_memsz, kernel_align;
> > +	char *kernel_buf;
> > +	struct bzimage64_data *ldata;
> > +
> > +	header = (struct setup_header *)(kernel + 0x1F1);
> > +	setup_sects = header->setup_sects;
> > +	if (setup_sects == 0)
> > +		setup_sects = 4;
> > +
> > +	kern16_size = (setup_sects + 1) * 512;
> > +	if (kernel_len < kern16_size) {
> > +		pr_debug("bzImage truncated\n");
> > +		return ERR_PTR(-ENOEXEC);
> > +	}
> > +
> > +	if (cmdline_len > header->cmdline_size) {
> > +		pr_debug("Kernel command line too long\n");
> > +		return ERR_PTR(-EINVAL);
> > +	}
> > +
> > +	/* Allocate loader specific data */
> > +	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
> > +	if (!ldata)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	/* Argument/parameter segment */
> > +	kern16_size_needed = kern16_size;
> > +	if (kern16_size_needed < 4096)
> > +		kern16_size_needed = 4096;
> > +
> > +	setup_size = kern16_size_needed + cmdline_len;
> > +	params = kzalloc(setup_size, GFP_KERNEL);
> > +	if (!params) {
> > +		ret = -ENOMEM;
> > +		goto out_free_loader_data;
> > +	}
> > +
> > +	/* Copy setup header onto bootparams. */
> > +	setup_header_size = 0x0202 + kernel[0x0201] - 0x1F1;
> > +
> > +	/* Is there a limit on setup header size? */
> > +	memcpy(&params->hdr, (kernel + 0x1F1), setup_header_size);
> > +	ret = kexec_add_buffer(image, (char *)params, setup_size,
> > +			setup_size, 16, 0x3000, -1, 1, &bootparam_load_addr);
> > +	if (ret)
> > +		goto out_free_params;
> > +	pr_debug("Loaded boot_param and command line at 0x%lx\n",
> > +			bootparam_load_addr);
> > +
> > +	/* Load kernel */
> > +	kernel_buf = kernel + kern16_size;
> > +	kernel_bufsz =  kernel_len - kern16_size;
> > +	kernel_memsz = ALIGN(header->init_size, 4096);
> > +	kernel_align = header->kernel_alignment;
> > +
> > +	ret = kexec_add_buffer(image, kernel_buf,
> > +			kernel_bufsz, kernel_memsz, kernel_align, 0x100000,
> > +			-1, 1, &kernel_load_addr);
> > +	if (ret)
> > +		goto out_free_params;
> > +
> > +	pr_debug("Loaded 64bit kernel at 0x%lx sz = 0x%lx\n", kernel_load_addr,
> > +				kernel_memsz);
> > +
> > +	/* Load initrd high */
> > +	if (initrd) {
> > +		ret = kexec_add_buffer(image, initrd, initrd_len, initrd_len,
> > +			4096, 0x10000000, ULONG_MAX, 1, &initrd_load_addr);
> 
> Here the minimal starting addr to locate initrd is 256M, though it's
> allocated from top to down, I am still wondering why.

Hi Bao,

Good catch. It is vestige of some hard coding I had done initially to
test my patches. Forgot to remove the hardcoding here. Will fix it in
next version.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-29  3:10   ` Baoquan He
  2013-12-02 15:27     ` WANG Chao
@ 2013-12-02 15:44     ` Vivek Goyal
  2013-12-04  1:35       ` Baoquan He
  1 sibling, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-12-02 15:44 UTC (permalink / raw)
  To: Baoquan He; +Cc: linux-kernel, kexec, mjg59, greg, ebiederm, hpa

On Fri, Nov 29, 2013 at 11:10:48AM +0800, Baoquan He wrote:

[..]
> > +void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
> > +			unsigned long kernel_len, char *initrd,
> > +			unsigned long initrd_len, char *cmdline,
> > +			unsigned long cmdline_len)
> > +{
> > +	int idx = image->file_handler_idx;
> > +
> > +	if (idx < 0)
> > +		return ERR_PTR(-ENOEXEC);
> > +
> > +	return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
> > +					initrd_len, cmdline, cmdline_len);
> > +}
> > +
> > +int arch_image_file_post_load_cleanup(struct kimage *image)
> > +{
> 
> Hi Vivek,
> 
> This function is defined as one of arch specific fucntion set, why don't
> we name it in a unified prefix as others.

I am using "arch_" prefix. What else to use?

> 
> And name of the default dummy function in kernel/kexec.c is not consistent
> with the arch specific one, so currently
> arch_image_file_post_load_cleanup of x86 arch is not called. Please
> consider wihch one need be changed.

Good catch Bao. I should change arch_image_file_post_load_cleanup() to
arch_kimage_file_post_load_cleanup(), otherwise it never gets called and
memory leaks.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-20 17:50 [PATCH 0/6] kexec: A new system call to allow in kernel loading Vivek Goyal
                   ` (9 preceding siblings ...)
  2013-11-22  0:55 ` HATAYAMA Daisuke
@ 2013-12-03 13:23 ` Baoquan He
  10 siblings, 0 replies; 90+ messages in thread
From: Baoquan He @ 2013-12-03 13:23 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, mjg59, greg, ebiederm, hpa

Tested kdump and kexec using --use-kexec2-syscall on kenrel 3.13.0-rc2+,
they work very well.


On 11/20/13 at 12:50pm, Vivek Goyal wrote:
> Current proposed secureboot implementation disables kexec/kdump because
> it can allow unsigned kernel to run on a secureboot platform. Intial
> idea was to sign /sbin/kexec binary and let that binary do the kernel
> signature verification. I had posted RFC patches for this apparoach
> here.
> 
> https://lkml.org/lkml/2013/9/10/560
> 
> Later we had discussion at Plumbers and most of the people thought
> that signing and trusting /sbin/kexec is becoming complex. So a 
> better idea might be let kernel do the signature verification of
> new kernel being loaded. This calls for implementing a new system call
> and moving lot of user space code in kernel.
> 
> kexec_load() system call allows loading a kexec/kdump kernel and jump
> to that kernel at right time. Though a lot of processing is done in
> user space which prepares a list of segments/buffers to be loaded and
> kexec_load() works on that list of segments. It does not know what's
> contained in those segments.
> 
> Now a new system call kexec_file_load() is implemented which takes
> kernel fd and initrd fd as parameters. Now kernel should be able
> to verify signature of newly loaded kernel. 
> 
> This is an early RFC patchset. I have not done signature handling
> part yet. This is more of a minimal patch to show how new system
> call and functionality will look like. Right now it can only handle
> bzImage with 64bit entry point on x86_64. No EFI, no x86_32  or any
> other architecture. Rest of the things can be added slowly as need
> arises. In first iteration, I have tried to address most common use case
> for us.
> 
> Any feedback is welcome.
> 
> Vivek Goyal (6):
>   kexec: Export vmcoreinfo note size properly
>   kexec: Move segment verification code in a separate function
>   resource: Provide new functions to walk through resources
>   kexec: A new system call, kexec_file_load, for in kernel kexec
>   kexec-bzImage: Support for loading bzImage using 64bit entry
>   kexec: Support for Kexec on panic using new system call
> 
>  arch/x86/include/asm/crash.h         |    9 +
>  arch/x86/include/asm/kexec-bzimage.h |   12 +
>  arch/x86/include/asm/kexec.h         |   43 ++
>  arch/x86/kernel/Makefile             |    2 +
>  arch/x86/kernel/crash.c              |  585 +++++++++++++++++++++++++++
>  arch/x86/kernel/kexec-bzimage.c      |  420 +++++++++++++++++++
>  arch/x86/kernel/machine_kexec_64.c   |   60 +++-
>  arch/x86/kernel/purgatory_entry_64.S |  119 ++++++
>  arch/x86/syscalls/syscall_64.tbl     |    1 +
>  include/linux/ioport.h               |    6 +
>  include/linux/kexec.h                |   57 +++
>  include/linux/syscalls.h             |    3 +
>  include/uapi/linux/kexec.h           |    4 +
>  kernel/kexec.c                       |  731 ++++++++++++++++++++++++++++++----
>  kernel/ksysfs.c                      |    2 +-
>  kernel/resource.c                    |  108 +++++-
>  kernel/sys_ni.c                      |    1 +
>  17 files changed, 2074 insertions(+), 89 deletions(-)
>  create mode 100644 arch/x86/include/asm/crash.h
>  create mode 100644 arch/x86/include/asm/kexec-bzimage.h
>  create mode 100644 arch/x86/kernel/kexec-bzimage.c
>  create mode 100644 arch/x86/kernel/purgatory_entry_64.S
> 
> -- 
> 1.7.7.6
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-12-02 15:44     ` Vivek Goyal
@ 2013-12-04  1:35       ` Baoquan He
  2013-12-04 17:19         ` Vivek Goyal
  0 siblings, 1 reply; 90+ messages in thread
From: Baoquan He @ 2013-12-04  1:35 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: mjg59, greg, kexec, linux-kernel, ebiederm, hpa

On 12/02/13 at 10:44am, Vivek Goyal wrote:
> On Fri, Nov 29, 2013 at 11:10:48AM +0800, Baoquan He wrote:
> 
> [..]
> > > +void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
> > > +			unsigned long kernel_len, char *initrd,
> > > +			unsigned long initrd_len, char *cmdline,
> > > +			unsigned long cmdline_len)
> > > +{
> > > +	int idx = image->file_handler_idx;
> > > +
> > > +	if (idx < 0)
> > > +		return ERR_PTR(-ENOEXEC);
> > > +
> > > +	return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
> > > +					initrd_len, cmdline, cmdline_len);
> > > +}
> > > +
> > > +int arch_image_file_post_load_cleanup(struct kimage *image)
> > > +{
> > 
> > Hi Vivek,
> > 
> > This function is defined as one of arch specific fucntion set, why don't
> > we name it in a unified prefix as others.
> 
> I am using "arch_" prefix. What else to use?

I mean in this function series, other functions have name like
arch_kexec_kernel_image_xxx, why this function is lonely, and is named
as arch_kimage_xxx. And here what does the 'k' mean in "kimage", kexec
image or kernel image, I am confused.

> 
> > 
> > And name of the default dummy function in kernel/kexec.c is not consistent
> > with the arch specific one, so currently
> > arch_image_file_post_load_cleanup of x86 arch is not called. Please
> > consider wihch one need be changed.
> 
> Good catch Bao. I should change arch_image_file_post_load_cleanup() to
> arch_kimage_file_post_load_cleanup(), otherwise it never gets called and
> memory leaks.
> 
> Thanks
> Vivek
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/6] kexec: Support for Kexec on panic using new system call
  2013-11-20 17:50 ` [PATCH 6/6] kexec: Support for Kexec on panic using new system call Vivek Goyal
  2013-11-28 11:28   ` Baoquan He
@ 2013-12-04  1:41   ` Baoquan He
  2013-12-04 17:19     ` Vivek Goyal
  1 sibling, 1 reply; 90+ messages in thread
From: Baoquan He @ 2013-12-04  1:41 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, mjg59, greg, ebiederm, hpa

On 11/20/13 at 12:50pm, Vivek Goyal wrote:
> This patch adds support for loading a kexec on panic (kdump) kernel usning
> new system call.
> +int load_crashdump_segments(struct kimage *image)
> +{
> +	unsigned long src_start, src_sz;
> +	unsigned long elf_addr, elf_sz;
> +	int ret;
> +
> +	/*
> +	 * Determine and load a segment for backup area. First 640K RAM
> +	 * region is backup source
> +	 */
> +
> +	ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
> +				image, determine_backup_region);
> +
> +	/* Zero of postive return values are ok */
                ^
Here I guess it is "or".

> +	if (ret < 0)
> +		return ret;
> +
> +	src_start = image->arch.backup_src_start;
> +	src_sz = image->arch.backup_src_sz;
> +
> +	/* Add backup segment. */
> +	if (src_sz) {
> +		ret = kexec_add_buffer(image, __va(src_start), src_sz, src_sz,
> +					PAGE_SIZE, 0, -1, 0,
> +					&image->arch.backup_load_addr);
> +		if (ret)
> +			return ret;
> +	}
> +


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/6] kexec: Support for Kexec on panic using new system call
  2013-12-02 15:30     ` Vivek Goyal
@ 2013-12-04  1:51       ` Baoquan He
  2013-12-04 17:20         ` Vivek Goyal
  0 siblings, 1 reply; 90+ messages in thread
From: Baoquan He @ 2013-12-04  1:51 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: mjg59, greg, kexec, linux-kernel, ebiederm, hpa

On 12/02/13 at 10:30am, Vivek Goyal wrote:
> On Thu, Nov 28, 2013 at 07:28:16PM +0800, Baoquan He wrote:
> 
> [..]
> > > +int crash_copy_backup_region(struct kimage *image)
> > > +{
> > 
> > Why need this func be called, backup region has been added into crash
> > segment by kexec_add_buffer, and then buffer copy is done in
> > kimage_load_crash_segment. I think this copy is handled twice. Please
> > correct me if I am wrong.
> > 
> 
> Hi Bao,
> 
> kexec_add_buffer() will copy the backup region but it will happen when
> crash kernel is loaded. We want snapshot of backup region at the time
> of crash and that's why this second call to copy backup region and it
> is executed after the crash.

Did we do this before? And will this region  be changed after 1st kernel
boot


> 
> Thanks
> Vivek
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-20 17:50 ` [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec Vivek Goyal
                     ` (2 preceding siblings ...)
  2013-11-29  3:10   ` Baoquan He
@ 2013-12-04  1:56   ` Baoquan He
  2013-12-04  8:19     ` Baoquan He
  2013-12-04 17:32     ` Vivek Goyal
  3 siblings, 2 replies; 90+ messages in thread
From: Baoquan He @ 2013-12-04  1:56 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, mjg59, greg, ebiederm, hpa

On 11/20/13 at 12:50pm, Vivek Goyal wrote:
> + * that kexec_mutex is held.
> + */

I think kexec_add_buffer is guaranteed to be called before allocating
control pages, why not updating image->control_page after each time
kexec_add_buffer is called. Then when control page is needed, effective
address in crash_kernel region can be given. This can be a little more
efficient.

> +int kexec_add_buffer(struct kimage *image, char *buffer,
> +		unsigned long bufsz, unsigned long memsz,
> +		unsigned long buf_align, unsigned long buf_min,
> +		unsigned long buf_max, int top_down, unsigned long *load_addr)
> +{
> +
> +	unsigned long nr_segments = image->nr_segments, new_nr_segments;
> +	struct kexec_segment *ksegment;
> +	struct kexec_buf *kbuf;
> +
> +	/* Currently adding segment this way is allowed only in file mode */
> +	if (!image->file_mode)
> +		return -EINVAL;
> +
> +	if (nr_segments >= KEXEC_SEGMENT_MAX)
> +		return -EINVAL;
> +
> +	/*
> +	 * Make sure we are not trying to add buffer after allocating
> +	 * control pages. All segments need to be placed first before
> +	 * any control pages are allocated. As control page allocation
> +	 * logic goes through list of segments to make sure there are
> +	 * no destination overlaps.
> +	 */
> +	WARN_ONCE(!list_empty(&image->control_pages), "Adding kexec buffer"
> +			" after allocating control pages\n");
> +
> +	kbuf = kzalloc(sizeof(struct kexec_buf), GFP_KERNEL);
> +	if (!kbuf)
> +		return -ENOMEM;
> +
> +	kbuf->image = image;
> +	kbuf->buffer = buffer;
> +	kbuf->bufsz = bufsz;
> +	/* Align memsz to next page boundary */
> +	kbuf->memsz = ALIGN(memsz, PAGE_SIZE);
> +
> +	/* Align to atleast page size boundary */
> +	kbuf->buf_align = max(buf_align, PAGE_SIZE);
> +	kbuf->buf_min = buf_min;
> +	kbuf->buf_max = buf_max;
> +	kbuf->top_down = top_down;
> +
> +	/* Walk the RAM ranges and allocate a suitable range for the buffer */
> +	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> +
> +	kbuf->image = NULL;
> +	kfree(kbuf);
> +
> +	/*
> +	 * If range could be found successfully, it would have incremented
> +	 * the nr_segments value.
> +	 */
> +	new_nr_segments = image->nr_segments;
> +
> +	/* A suitable memory range could not be found for buffer */
> +	if (new_nr_segments == nr_segments)
> +		return -EADDRNOTAVAIL;
> +
> +	/* Found a suitable memory range */
> +
> +	ksegment = &image->segment[new_nr_segments - 1];
> +	*load_addr = ksegment->mem;
> +	return 0;
> +}
> +
> +


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-12-04  1:56   ` Baoquan He
@ 2013-12-04  8:19     ` Baoquan He
  2013-12-04 17:32     ` Vivek Goyal
  1 sibling, 0 replies; 90+ messages in thread
From: Baoquan He @ 2013-12-04  8:19 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: mjg59, greg, kexec, linux-kernel, ebiederm, hpa

On 12/04/13 at 09:56am, Baoquan He wrote:
> On 11/20/13 at 12:50pm, Vivek Goyal wrote:
> > + * that kexec_mutex is held.
> > + */
> 
> I think kexec_add_buffer is guaranteed to be called before allocating
> control pages, why not updating image->control_page after each time
> kexec_add_buffer is called. Then when control page is needed, effective
> address in crash_kernel region can be given. This can be a little more
> efficient.

Check again, it's not a good idea. Because buf_align is different when
kexec_add_buffer is called. Sometime the bug_align is larger than 4k.
Please ignore this comment.

> 
> > +int kexec_add_buffer(struct kimage *image, char *buffer,
> > +		unsigned long bufsz, unsigned long memsz,
> > +		unsigned long buf_align, unsigned long buf_min,
> > +		unsigned long buf_max, int top_down, unsigned long *load_addr)
> > +{
> > +
> > +	unsigned long nr_segments = image->nr_segments, new_nr_segments;
> > +	struct kexec_segment *ksegment;
> > +	struct kexec_buf *kbuf;
> > +
> > +	/* Currently adding segment this way is allowed only in file mode */
> > +	if (!image->file_mode)
> > +		return -EINVAL;
> > +
> > +	if (nr_segments >= KEXEC_SEGMENT_MAX)
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * Make sure we are not trying to add buffer after allocating
> > +	 * control pages. All segments need to be placed first before
> > +	 * any control pages are allocated. As control page allocation
> > +	 * logic goes through list of segments to make sure there are
> > +	 * no destination overlaps.
> > +	 */
> > +	WARN_ONCE(!list_empty(&image->control_pages), "Adding kexec buffer"
> > +			" after allocating control pages\n");
> > +
> > +	kbuf = kzalloc(sizeof(struct kexec_buf), GFP_KERNEL);
> > +	if (!kbuf)
> > +		return -ENOMEM;
> > +


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-12-04  1:35       ` Baoquan He
@ 2013-12-04 17:19         ` Vivek Goyal
  0 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-12-04 17:19 UTC (permalink / raw)
  To: Baoquan He; +Cc: mjg59, greg, kexec, linux-kernel, ebiederm, hpa

On Wed, Dec 04, 2013 at 09:35:29AM +0800, Baoquan He wrote:
> On 12/02/13 at 10:44am, Vivek Goyal wrote:
> > On Fri, Nov 29, 2013 at 11:10:48AM +0800, Baoquan He wrote:
> > 
> > [..]
> > > > +void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
> > > > +			unsigned long kernel_len, char *initrd,
> > > > +			unsigned long initrd_len, char *cmdline,
> > > > +			unsigned long cmdline_len)
> > > > +{
> > > > +	int idx = image->file_handler_idx;
> > > > +
> > > > +	if (idx < 0)
> > > > +		return ERR_PTR(-ENOEXEC);
> > > > +
> > > > +	return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
> > > > +					initrd_len, cmdline, cmdline_len);
> > > > +}
> > > > +
> > > > +int arch_image_file_post_load_cleanup(struct kimage *image)
> > > > +{
> > > 
> > > Hi Vivek,
> > > 
> > > This function is defined as one of arch specific fucntion set, why don't
> > > we name it in a unified prefix as others.
> > 
> > I am using "arch_" prefix. What else to use?
> 
> I mean in this function series, other functions have name like
> arch_kexec_kernel_image_xxx, why this function is lonely, and is named
> as arch_kimage_xxx. And here what does the 'k' mean in "kimage", kexec
> image or kernel image, I am confused.

kimage is name is structure which is containing all the data relevant
to loading of file. 

I guess I can use arch_kimage_file_post_load_cleanup().

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/6] kexec: Support for Kexec on panic using new system call
  2013-12-04  1:41   ` Baoquan He
@ 2013-12-04 17:19     ` Vivek Goyal
  0 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-12-04 17:19 UTC (permalink / raw)
  To: Baoquan He; +Cc: linux-kernel, kexec, mjg59, greg, ebiederm, hpa

On Wed, Dec 04, 2013 at 09:41:05AM +0800, Baoquan He wrote:
> On 11/20/13 at 12:50pm, Vivek Goyal wrote:
> > This patch adds support for loading a kexec on panic (kdump) kernel usning
> > new system call.
> > +int load_crashdump_segments(struct kimage *image)
> > +{
> > +	unsigned long src_start, src_sz;
> > +	unsigned long elf_addr, elf_sz;
> > +	int ret;
> > +
> > +	/*
> > +	 * Determine and load a segment for backup area. First 640K RAM
> > +	 * region is backup source
> > +	 */
> > +
> > +	ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
> > +				image, determine_backup_region);
> > +
> > +	/* Zero of postive return values are ok */
>                 ^
> Here I guess it is "or".

Yep. Will fix it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/6] kexec: Support for Kexec on panic using new system call
  2013-12-04  1:51       ` Baoquan He
@ 2013-12-04 17:20         ` Vivek Goyal
  0 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-12-04 17:20 UTC (permalink / raw)
  To: Baoquan He; +Cc: mjg59, greg, kexec, linux-kernel, ebiederm, hpa

On Wed, Dec 04, 2013 at 09:51:27AM +0800, Baoquan He wrote:
> On 12/02/13 at 10:30am, Vivek Goyal wrote:
> > On Thu, Nov 28, 2013 at 07:28:16PM +0800, Baoquan He wrote:
> > 
> > [..]
> > > > +int crash_copy_backup_region(struct kimage *image)
> > > > +{
> > > 
> > > Why need this func be called, backup region has been added into crash
> > > segment by kexec_add_buffer, and then buffer copy is done in
> > > kimage_load_crash_segment. I think this copy is handled twice. Please
> > > correct me if I am wrong.
> > > 
> > 
> > Hi Bao,
> > 
> > kexec_add_buffer() will copy the backup region but it will happen when
> > crash kernel is loaded. We want snapshot of backup region at the time
> > of crash and that's why this second call to copy backup region and it
> > is executed after the crash.
> 
> Did we do this before? And will this region  be changed after 1st kernel
> boot

Backup region is first 640K and it can change after load. Some memory
allocation might come from there and kernel might write to that memory.
So copying of backup region needs to take place as late as possible.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-12-04  1:56   ` Baoquan He
  2013-12-04  8:19     ` Baoquan He
@ 2013-12-04 17:32     ` Vivek Goyal
  1 sibling, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2013-12-04 17:32 UTC (permalink / raw)
  To: Baoquan He; +Cc: linux-kernel, kexec, mjg59, greg, ebiederm, hpa

On Wed, Dec 04, 2013 at 09:56:57AM +0800, Baoquan He wrote:
> On 11/20/13 at 12:50pm, Vivek Goyal wrote:
> > + * that kexec_mutex is held.
> > + */
> 
> I think kexec_add_buffer is guaranteed to be called before allocating
> control pages, why not updating image->control_page after each time
> kexec_add_buffer is called. Then when control page is needed, effective
> address in crash_kernel region can be given. This can be a little more
> efficient.

image->control_page controls the lowest address available for control
pages in crash kernel region. When we do kexec_add_buffer, we don't
necessarily know whether there is an empty page available between
segments or not.

Also, existing logic for kexec does not update the image->control_page
when segments are being copied.

So I think this does not offer any huge benefits and it is not performance
critical path. I will just leave it as it is.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-11-23  3:23         ` Eric W. Biederman
@ 2013-12-04 19:34           ` Vivek Goyal
  2013-12-05  4:10             ` Eric W. Biederman
  0 siblings, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-12-04 19:34 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, kexec, hpa, mjg59, greg

On Fri, Nov 22, 2013 at 07:23:39PM -0800, Eric W. Biederman wrote:
> 
> > [..]
> >> >> There is also a huge missing piece of this in that your purgatory is not
> >> >> checking a hash of the loaded image before jumping too it.  Without that
> >> >> this is a huge regression at least for the kexec on panic case.  We
> >> >> absolutely need to check that the kernel sitting around in memory has
> >> >> not been corrupted before we let it run very far.
> >> >
> >> > Agreed. This should not be hard. It is just a matter of calcualting
> >> > digest of segments. I will store it in kimge and verify digest again
> >> > before passing control to control page. Will fix it in next version.
> >> 
> >> Nak.  The verification needs to happen in purgatory. 
> >> 
> >> The verification needs to happen in code whose runtime environment is
> >> does not depend on random parts of the kernel.  Anything else is a
> >> regression in maintainability and reliability.
> >> 
> >> It is the wrong direction to add any code to what needs to run in the
> >> known broken environment of the kernel when a panic happens.
> >> 
> >> Which means that you almost certainly need to go to the trouble of
> >> supporting the complexity needed to support purgatory code written in C.
> >> 
> >> (For those just tuning in purgatory is our term for the code that runs
> >> between the kernels to do those things that can not happen a priori).
> >
> > In general, I agree with not using kernel parts after crash.
> >
> > But what protects against that purgatory itself has been scribbled over.
> > IOW, how different purgatory memory is as compared to kernel memory where
> > digest routines are stored. They have got equal probably of being scribbled
> > over and if that's the case one is not better than other?
> >
> > And if they both got equal probability to getting corrupted, then there does
> > not seem to be an advantage in moving digest verification inside
> > purgatory.
> 
> The primary reason is that maintenance of code in the kernel that is
> safe during a crash dump is hard.  That is why we boot a second kernel
> after all.  If the code to do the signature verification resides in
> machine_kexec on the kexec on panic code path in the kernel that has
> called panic it is almost a given that at some point or other someone
> will add an option that will add a weird dependency that makes the code
> unsafe when the kernel is crashing.  I have seen it happen several times
> on the existing kexec on panic code path.  I have seen it on other code
> paths like netconsole.  Which can currently on some kernels I have
> running cause the kernel go go into an endless printk loop if you call
> printk from interrupt context.  So what we really gain by moving the
> verification into purgatory is protection from inappropriate code reuse.
> 
> So having a completely separate piece of code may be a little harder to
> write initially but the code is much simpler and more reliable to
> maintain.  Essentially requiring no maintenance effort.  Further getting
> to the point where purgatory is written in C makes small changes much
> more approachable.

Hi Eric,

So you want a separate purgatory code and that purgatory should be self
contained and should not share any code with rest of the kernel. No
inclusion of header files, no linking against kernel libraries? That means
even re-implementing sha256 functions separately (like user space)?

If code maintenance is a concern, then I think I can reimplement some
of the functions to calculate sha256 in separate crash files and invoke
those to reduce code sharing with rest of the kernel. And we should be
able to link against the kernel and not have to create separate
relocatable purgatory object and relocate it.

IOW, does purgatory still have to be a relocatable object? I think
user space had no choice but given the fact that we are implementing
thing in kernel, I should be able to implement my own hash calculation
and segment verification code and link it to existing kernel and invoke
these outside purgatory. Anyway, we call so many other functions after
crash to stop cpus, save registers, etc.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading
  2013-12-04 19:34           ` Vivek Goyal
@ 2013-12-05  4:10             ` Eric W. Biederman
  0 siblings, 0 replies; 90+ messages in thread
From: Eric W. Biederman @ 2013-12-05  4:10 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, kexec, hpa, mjg59, greg

Vivek Goyal <vgoyal@redhat.com> writes:

> Hi Eric,
>
> So you want a separate purgatory code and that purgatory should be self
> contained and should not share any code with rest of the kernel. No
> inclusion of header files, no linking against kernel libraries? That means
> even re-implementing sha256 functions separately (like user space)?

Call only trivial sharing of code with the rest of the kernel.   But only
as much as say the kernel decompressor has.  

> If code maintenance is a concern, then I think I can reimplement some
> of the functions to calculate sha256 in separate crash files and invoke
> those to reduce code sharing with rest of the kernel. And we should be
> able to link against the kernel and not have to create separate
> relocatable purgatory object and relocate it.

It is both code maintenance and the fact that we have a strong
expectation that where purgatory lives should not be corrupted.
Plus what I have seen there maintenance becomes much simpler if there is
a little bit of C code that lives between the two kernels.  At that
point people don't have to grok assembly to be able to touch anything,
and by simply living their it enforces separation of concerns
from the kernel in a way that is trivial and obvious.

Plus we already have all of the code in userspace to do all of this work
so it is not something you would need to write from scrach merely
something that you would need to adapt.

> IOW, does purgatory still have to be a relocatable object?

Fundamentally purgatory does need to be a relocatable object.

> I think
> user space had no choice but given the fact that we are implementing
> thing in kernel, I should be able to implement my own hash calculation
> and segment verification code and link it to existing kernel and invoke
> these outside purgatory. 

Doing this outside of purgatory and linking to the rest of the kernel is
almost certainly enough to get someone to perform an obvious cleanup
that will undermine the purpose of the code.  Or perhaps it will be
merely a reference to the GOT table behind our backs in the C code
generated by the compiler that will undermine this checking.

Linking to our sanity checks to the rest of the kernel leaves me
profoundly uncomfortable.

> Anyway, we call so many other functions after
> crash to stop cpus, save registers, etc.

There is no other possible place to stop cpus, and save the cpu
registers.  As much as possible that should be the only justification
for the code we run on the kexec on panic code path.

Honestly calling so many other functions on that code path is a good
reason to see about removing them.  In addition to my other reasons
adding the hash calculation on that path will likely confuse issues
more than helping them.

Eric


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-26 14:27                   ` Vivek Goyal
@ 2013-12-19 12:54                     ` Torsten Duwe
  2013-12-20 14:19                       ` Vivek Goyal
  0 siblings, 1 reply; 90+ messages in thread
From: Torsten Duwe @ 2013-12-19 12:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Eric W. Biederman, Matthew Garrett, Greg KH, linux-kernel, kexec,
	hpa, Peter Jones, Kees Cook

On Tue, Nov 26, 2013 at 09:27:59AM -0500, Vivek Goyal wrote:
> On Tue, Nov 26, 2013 at 04:23:36AM -0800, Eric W. Biederman wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
> > 
> > > On Fri, Nov 22, 2013 at 07:39:14PM -0800, Eric W. Biederman wrote:
> > >
> > 
> > The init_size should be reflected in the .bss of the ELF segments.  If
> > not it is a bug when generating the kernel ELF headers and should be
> > fixed.
> > 
> > For use by kexec I don't see any issues with just signing the embedded
> > ELF image.
> 
> Being able to write kernel to a file and then load it feels little odd to
> me. Though this should be allowed but this should not be mandatory.
> 
> I think if we allow passing detached signature in kexec system call, then
> it makes it much more flexible. We should be able to do what you are
> suggesting at the same time it will also keep the possibility open for what
> chromeOS developers are looking for.
> 
Support for detached signatures would be a big plus for kexec_file_load.

First I thought, some vendors are already shipping signed bzImages, why not
verify these? But as it turns out, parsing MS-DOS, PE/COFF headers just to
find the gaps is a lot of bloat for this little functionality. And David Howells
got flamed quite badly when he suggested to add pkcs#7 to the kernel.
IMO it's up to user land to search lists of certificates, and present
only the final chain of trust to the kernel for checking.

ELF is the preferred format for most sane OSes and firmware, and a detached
signature would probably be simplest to check. If we have the choice,
without restrictions from braindead boot loaders, ELF should be first.
And if the pesigning isn't usable and another sig is needed anyway,
why not apply that to vmlinux(.gz) ?

Another remaining issue is the root of trust. Should the kernel solely depend
on UeFI to check certificates? I'd rather vote for a compile-time fallback key.

	Torsten


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-12-19 12:54                     ` Torsten Duwe
@ 2013-12-20 14:19                       ` Vivek Goyal
  2013-12-20 23:11                         ` Eric W. Biederman
  0 siblings, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2013-12-20 14:19 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Eric W. Biederman, Matthew Garrett, Greg KH, linux-kernel, kexec,
	hpa, Peter Jones, Kees Cook

On Thu, Dec 19, 2013 at 01:54:39PM +0100, Torsten Duwe wrote:
> On Tue, Nov 26, 2013 at 09:27:59AM -0500, Vivek Goyal wrote:
> > On Tue, Nov 26, 2013 at 04:23:36AM -0800, Eric W. Biederman wrote:
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > 
> > > > On Fri, Nov 22, 2013 at 07:39:14PM -0800, Eric W. Biederman wrote:
> > > >
> > > 
> > > The init_size should be reflected in the .bss of the ELF segments.  If
> > > not it is a bug when generating the kernel ELF headers and should be
> > > fixed.
> > > 
> > > For use by kexec I don't see any issues with just signing the embedded
> > > ELF image.
> > 
> > Being able to write kernel to a file and then load it feels little odd to
> > me. Though this should be allowed but this should not be mandatory.
> > 
> > I think if we allow passing detached signature in kexec system call, then
> > it makes it much more flexible. We should be able to do what you are
> > suggesting at the same time it will also keep the possibility open for what
> > chromeOS developers are looking for.
> > 
> Support for detached signatures would be a big plus for kexec_file_load.
> 
> First I thought, some vendors are already shipping signed bzImages, why not
> verify these? But as it turns out, parsing MS-DOS, PE/COFF headers just to
> find the gaps is a lot of bloat for this little functionality. And David Howells
> got flamed quite badly when he suggested to add pkcs#7 to the kernel.

Yep, that's the reason I am not proposing parsing and verifying PE/COFF
signatures.

> IMO it's up to user land to search lists of certificates, and present
> only the final chain of trust to the kernel for checking.
> 
> ELF is the preferred format for most sane OSes and firmware, and a detached
> signature would probably be simplest to check. If we have the choice,
> without restrictions from braindead boot loaders, ELF should be first.
> And if the pesigning isn't usable and another sig is needed anyway,
> why not apply that to vmlinux(.gz) ?

I have yet to look deeper into it that if we can sign elf images and
just use elf loader. And can use space extract the elf image out of 
a bzImage and pass it to kernel.

Even if it is doable, one disadvantage seemed to be that extracted 
elf images will have to be written to a file so thta it's file descriptor
can be passed to kernel. And that assumed writable root and we chrome
folks seems to have setups where root is not writable.

> 
> Another remaining issue is the root of trust. Should the kernel solely depend
> on UeFI to check certificates? I'd rather vote for a compile-time fallback key.

I was thinking of signing kernel along the lines of modules. That is it is
RSA PKCS1 signatures. It does not contain certifiate chain. Signing key
should be in kernel and that key either can come from UEFI db, or user
might have added it in MOK during boot or we can embed one during kernel
build etc. Verification code will not care how did key show up in kernel.
There could be multiple ways.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-12-20 14:19                       ` Vivek Goyal
@ 2013-12-20 23:11                         ` Eric W. Biederman
  2013-12-20 23:20                           ` Kees Cook
  2013-12-20 23:20                           ` H. Peter Anvin
  0 siblings, 2 replies; 90+ messages in thread
From: Eric W. Biederman @ 2013-12-20 23:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Torsten Duwe, Matthew Garrett, Greg KH, linux-kernel, kexec, hpa,
	Peter Jones, Kees Cook

Vivek Goyal <vgoyal@redhat.com> writes:

> On Thu, Dec 19, 2013 at 01:54:39PM +0100, Torsten Duwe wrote:
>> On Tue, Nov 26, 2013 at 09:27:59AM -0500, Vivek Goyal wrote:

>> IMO it's up to user land to search lists of certificates, and present
>> only the final chain of trust to the kernel for checking.
>> 
>> ELF is the preferred format for most sane OSes and firmware, and a detached
>> signature would probably be simplest to check. If we have the choice,
>> without restrictions from braindead boot loaders, ELF should be first.
>> And if the pesigning isn't usable and another sig is needed anyway,
>> why not apply that to vmlinux(.gz) ?
>
> I have yet to look deeper into it that if we can sign elf images and
> just use elf loader. And can use space extract the elf image out of 
> a bzImage and pass it to kernel.
>
> Even if it is doable, one disadvantage seemed to be that extracted 
> elf images will have to be written to a file so thta it's file descriptor
> can be passed to kernel. And that assumed writable root and we chrome
> folks seems to have setups where root is not writable.

In that case the chrome folks would simply have to use an ELF format
kernel and not a bzImage.

Eric

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-12-20 23:11                         ` Eric W. Biederman
@ 2013-12-20 23:20                           ` Kees Cook
  2013-12-21 11:38                             ` Torsten Duwe
  2014-01-02 20:39                             ` Vivek Goyal
  2013-12-20 23:20                           ` H. Peter Anvin
  1 sibling, 2 replies; 90+ messages in thread
From: Kees Cook @ 2013-12-20 23:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Vivek Goyal, Torsten Duwe, Matthew Garrett, Greg KH, LKML, kexec,
	H. Peter Anvin, Peter Jones

On Fri, Dec 20, 2013 at 3:11 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
>
>> On Thu, Dec 19, 2013 at 01:54:39PM +0100, Torsten Duwe wrote:
>>> On Tue, Nov 26, 2013 at 09:27:59AM -0500, Vivek Goyal wrote:
>
>>> IMO it's up to user land to search lists of certificates, and present
>>> only the final chain of trust to the kernel for checking.
>>>
>>> ELF is the preferred format for most sane OSes and firmware, and a detached
>>> signature would probably be simplest to check. If we have the choice,
>>> without restrictions from braindead boot loaders, ELF should be first.
>>> And if the pesigning isn't usable and another sig is needed anyway,
>>> why not apply that to vmlinux(.gz) ?
>>
>> I have yet to look deeper into it that if we can sign elf images and
>> just use elf loader. And can use space extract the elf image out of
>> a bzImage and pass it to kernel.
>>
>> Even if it is doable, one disadvantage seemed to be that extracted
>> elf images will have to be written to a file so thta it's file descriptor
>> can be passed to kernel. And that assumed writable root and we chrome
>> folks seems to have setups where root is not writable.
>
> In that case the chrome folks would simply have to use an ELF format
> kernel and not a bzImage.

If we're doing fd origin verification (not signatures), can't we
continue to use a regular bzImage?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-12-20 23:11                         ` Eric W. Biederman
  2013-12-20 23:20                           ` Kees Cook
@ 2013-12-20 23:20                           ` H. Peter Anvin
  2013-12-21  1:32                             ` Eric W. Biederman
  1 sibling, 1 reply; 90+ messages in thread
From: H. Peter Anvin @ 2013-12-20 23:20 UTC (permalink / raw)
  To: Eric W. Biederman, Vivek Goyal
  Cc: Torsten Duwe, Matthew Garrett, Greg KH, linux-kernel, kexec,
	Peter Jones, Kees Cook

On 12/20/2013 03:11 PM, Eric W. Biederman wrote:
> 
> In that case the chrome folks would simply have to use an ELF format
> kernel and not a bzImage.
> 

This is starting to feel like everything is going in the direction of a
massive feature regression.  bzImage may be weird (it has definitely
grown organically), but the features that have been added to it have
generally been for a reason, e.g. kernel relocation and so on.

	-hpa



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-12-20 23:20                           ` H. Peter Anvin
@ 2013-12-21  1:32                             ` Eric W. Biederman
  2013-12-21  3:32                               ` H. Peter Anvin
  0 siblings, 1 reply; 90+ messages in thread
From: Eric W. Biederman @ 2013-12-21  1:32 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Vivek Goyal, Torsten Duwe, Matthew Garrett, Greg KH,
	linux-kernel, kexec, Peter Jones, Kees Cook

"H. Peter Anvin" <hpa@zytor.com> writes:

> On 12/20/2013 03:11 PM, Eric W. Biederman wrote:
>> 
>> In that case the chrome folks would simply have to use an ELF format
>> kernel and not a bzImage.
>> 
>
> This is starting to feel like everything is going in the direction of a
> massive feature regression.  bzImage may be weird (it has definitely
> grown organically), but the features that have been added to it have
> generally been for a reason, e.g. kernel relocation and so on.

Stuff and nonsense.  bzImage is just an ugly wrapper around an ELF
image.

I am just arguing that we expose the clean portable underpinnings and
make that work.

It absolutely does not make sense to make a solution that only works for
x86.  ELF is what ever other architecture uses so we absolutely have to
make any feature we build work with ELF.

At a very basic level for this feature ELF is good enough.  bzImage
isn't.

Given that in the worst case distro's will have to package a second
binary of the same kernel in their kernel rpm.  I don't know that there
is any point in supporting anything else except ELF in the kernel.
Given that the package and distribution are going to have to change
anyway to include signing a change in file format hardly seems scary.

But my point above was really that ELF is sufficient for the use case of
doing file based verification base on fd's in addition to the use case
of using detached signatures.  Which is really a long winded way of
saying the argument "But but but my distro only ships a bzImage today"
is a horrible techinical argument.

I am not fundamentally opposed to supporting other file formats but
given that ELF wins on both practical and techincal grounds ELF should
be the primary file format for kexec_file_load.  We can worry about
other file formats once ELF is shown to work.

Eric

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-12-21  1:32                             ` Eric W. Biederman
@ 2013-12-21  3:32                               ` H. Peter Anvin
  2013-12-21 12:15                                 ` Torsten Duwe
  0 siblings, 1 reply; 90+ messages in thread
From: H. Peter Anvin @ 2013-12-21  3:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Vivek Goyal, Torsten Duwe, Matthew Garrett, Greg KH,
	linux-kernel, kexec, Peter Jones, Kees Cook

On 12/20/2013 05:32 PM, Eric W. Biederman wrote:
> 
> Stuff and nonsense.  bzImage is just an ugly wrapper around an ELF
> image.
> 

Not really.  We put the ELF image in there to help Xen and presumably
kexec, but there are actually quite a few issues with it... for one
thing, as currently built there are megabytes of zeroes in it for no
good reason.

Even if you don't need the entry code, the additional metadata is
meaningful.

	-hpa



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-12-20 23:20                           ` Kees Cook
@ 2013-12-21 11:38                             ` Torsten Duwe
  2014-01-02 20:39                             ` Vivek Goyal
  1 sibling, 0 replies; 90+ messages in thread
From: Torsten Duwe @ 2013-12-21 11:38 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric W. Biederman, Vivek Goyal, Matthew Garrett, Greg KH, LKML,
	kexec, H. Peter Anvin, Peter Jones

On Fri, Dec 20, 2013 at 03:20:16PM -0800, Kees Cook wrote:
> 
> If we're doing fd origin verification (not signatures), can't we
> continue to use a regular bzImage?

I imagine the verification function is separated from the new kexec.
It gets passed the proposed new kernel file*, an optional signature
file*, and maybe hints about the kernel format and system security state.

With that function, you can do whatever you want. If you want to take
an early positive exit due to the location of the new kernel, you're fine.

	Torsten


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-12-21  3:32                               ` H. Peter Anvin
@ 2013-12-21 12:15                                 ` Torsten Duwe
  0 siblings, 0 replies; 90+ messages in thread
From: Torsten Duwe @ 2013-12-21 12:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Eric W. Biederman, Vivek Goyal, Matthew Garrett, Greg KH,
	linux-kernel, kexec, Peter Jones, Kees Cook

On Fri, Dec 20, 2013 at 07:32:11PM -0800, H. Peter Anvin wrote:
> thing, as currently built there are megabytes of zeroes in it for no
> good reason.

Then remove them ;) AFAICS, that's x86 only? What a waste!

What's the reason? ALIGN_RODATA? Even if so, vmlinux.gz might be
a fair trade-off.

> 
> Even if you don't need the entry code, the additional metadata is
> meaningful.

Any idea, or maybe even a list of features that would get lost?
What are the blossoms of this organically grown structure?
Many architectures, even embedded x86, boot happily any ELF kernel.

I'm with Eric here: this is not about _not_ supporting bzImage, it's
about _do_ support ELF first. As I wrote: if the existing signature 
is, let's say, impractical, and a new one is needed anyway, why not 
(detached-) sign vmlinux or vmlinux.gz?

Every architecture can benefit from a secure boot or secure kexec
that is done right.

	Torsten


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-12-20 23:20                           ` Kees Cook
  2013-12-21 11:38                             ` Torsten Duwe
@ 2014-01-02 20:39                             ` Vivek Goyal
  2014-01-02 20:56                               ` H. Peter Anvin
  1 sibling, 1 reply; 90+ messages in thread
From: Vivek Goyal @ 2014-01-02 20:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric W. Biederman, Torsten Duwe, Matthew Garrett, Greg KH, LKML,
	kexec, H. Peter Anvin, Peter Jones

On Fri, Dec 20, 2013 at 03:20:16PM -0800, Kees Cook wrote:
> On Fri, Dec 20, 2013 at 3:11 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
> >
> >> On Thu, Dec 19, 2013 at 01:54:39PM +0100, Torsten Duwe wrote:
> >>> On Tue, Nov 26, 2013 at 09:27:59AM -0500, Vivek Goyal wrote:
> >
> >>> IMO it's up to user land to search lists of certificates, and present
> >>> only the final chain of trust to the kernel for checking.
> >>>
> >>> ELF is the preferred format for most sane OSes and firmware, and a detached
> >>> signature would probably be simplest to check. If we have the choice,
> >>> without restrictions from braindead boot loaders, ELF should be first.
> >>> And if the pesigning isn't usable and another sig is needed anyway,
> >>> why not apply that to vmlinux(.gz) ?
> >>
> >> I have yet to look deeper into it that if we can sign elf images and
> >> just use elf loader. And can use space extract the elf image out of
> >> a bzImage and pass it to kernel.
> >>
> >> Even if it is doable, one disadvantage seemed to be that extracted
> >> elf images will have to be written to a file so thta it's file descriptor
> >> can be passed to kernel. And that assumed writable root and we chrome
> >> folks seems to have setups where root is not writable.
> >
> > In that case the chrome folks would simply have to use an ELF format
> > kernel and not a bzImage.
> 
> If we're doing fd origin verification (not signatures), can't we
> continue to use a regular bzImage?

If secureboot is enabled, it enforces module signature verification. I 
think similar will happen for kexec too. How would kernel know that on
a secureboot platform fd original verification will happen and it is
sufficient.

I personally want to support bzImage as well (apart from ELF) because
distributions has been shipping bzImage for a long time and I don't
want to enforce a change there because of secureboot. It is not necessary.
Right now I am thinking more about storing detached bzImage signatures
and passing those signatures to kexec system call.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2014-01-02 20:39                             ` Vivek Goyal
@ 2014-01-02 20:56                               ` H. Peter Anvin
  2014-01-06 21:33                                 ` Josh Boyer
  0 siblings, 1 reply; 90+ messages in thread
From: H. Peter Anvin @ 2014-01-02 20:56 UTC (permalink / raw)
  To: Vivek Goyal, Kees Cook
  Cc: Eric W. Biederman, Torsten Duwe, Matthew Garrett, Greg KH, LKML,
	kexec, Peter Jones

On 01/02/2014 12:39 PM, Vivek Goyal wrote:
> 
> If secureboot is enabled, it enforces module signature verification. I 
> think similar will happen for kexec too. How would kernel know that on
> a secureboot platform fd original verification will happen and it is
> sufficient.
> 
> I personally want to support bzImage as well (apart from ELF) because
> distributions has been shipping bzImage for a long time and I don't
> want to enforce a change there because of secureboot. It is not necessary.
> Right now I am thinking more about storing detached bzImage signatures
> and passing those signatures to kexec system call.
> 

Since the secureboot scenario probably means people will be signing
those kernels, and those kernels are going to be EFI images, that in
order to have "one kernel, one signature" there will be a desire to
support signed PE images.  Yes, PE is ugly but it shouldn't be too bad.
 However, it is probably one of those things that can be dealt with one
bit at a time.

	-hpa



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2014-01-02 20:56                               ` H. Peter Anvin
@ 2014-01-06 21:33                                 ` Josh Boyer
  2014-01-07  4:22                                   ` H. Peter Anvin
  0 siblings, 1 reply; 90+ messages in thread
From: Josh Boyer @ 2014-01-06 21:33 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Vivek Goyal, Kees Cook, Eric W. Biederman, Torsten Duwe,
	Matthew Garrett, Greg KH, LKML, kexec, Peter Jones

On Thu, Jan 2, 2014 at 3:56 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 01/02/2014 12:39 PM, Vivek Goyal wrote:
>>
>> If secureboot is enabled, it enforces module signature verification. I
>> think similar will happen for kexec too. How would kernel know that on
>> a secureboot platform fd original verification will happen and it is
>> sufficient.
>>
>> I personally want to support bzImage as well (apart from ELF) because
>> distributions has been shipping bzImage for a long time and I don't
>> want to enforce a change there because of secureboot. It is not necessary.
>> Right now I am thinking more about storing detached bzImage signatures
>> and passing those signatures to kexec system call.
>>
>
> Since the secureboot scenario probably means people will be signing
> those kernels, and those kernels are going to be EFI images, that in
> order to have "one kernel, one signature" there will be a desire to
> support signed PE images.  Yes, PE is ugly but it shouldn't be too bad.
>  However, it is probably one of those things that can be dealt with one
> bit at a time.

David Howells posted patches to support signed PE binaries early last
year.  They were rejected rather quickly.

https://lkml.org/lkml/2013/2/21/196

That was for loading keys via PE binaries, but the parser is needed
either way.  Unless I'm misunderstanding what you're suggesting?

josh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2014-01-06 21:33                                 ` Josh Boyer
@ 2014-01-07  4:22                                   ` H. Peter Anvin
  0 siblings, 0 replies; 90+ messages in thread
From: H. Peter Anvin @ 2014-01-07  4:22 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Vivek Goyal, Kees Cook, Eric W. Biederman, Torsten Duwe,
	Matthew Garrett, Greg KH, LKML, kexec, Peter Jones

On 01/06/2014 01:33 PM, Josh Boyer wrote:
> On Thu, Jan 2, 2014 at 3:56 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 01/02/2014 12:39 PM, Vivek Goyal wrote:
>>>
>>> If secureboot is enabled, it enforces module signature verification. I
>>> think similar will happen for kexec too. How would kernel know that on
>>> a secureboot platform fd original verification will happen and it is
>>> sufficient.
>>>
>>> I personally want to support bzImage as well (apart from ELF) because
>>> distributions has been shipping bzImage for a long time and I don't
>>> want to enforce a change there because of secureboot. It is not necessary.
>>> Right now I am thinking more about storing detached bzImage signatures
>>> and passing those signatures to kexec system call.
>>>
>>
>> Since the secureboot scenario probably means people will be signing
>> those kernels, and those kernels are going to be EFI images, that in
>> order to have "one kernel, one signature" there will be a desire to
>> support signed PE images.  Yes, PE is ugly but it shouldn't be too bad.
>>  However, it is probably one of those things that can be dealt with one
>> bit at a time.
> 
> David Howells posted patches to support signed PE binaries early last
> year.  They were rejected rather quickly.
> 
> https://lkml.org/lkml/2013/2/21/196
> 
> That was for loading keys via PE binaries, but the parser is needed
> either way.  Unless I'm misunderstanding what you're suggesting?
> 

I know.  I think the kexec is a better motivation, though.

	-hpa



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec
  2013-11-22 20:42   ` Jiri Kosina
@ 2014-01-17 19:17     ` Vivek Goyal
  0 siblings, 0 replies; 90+ messages in thread
From: Vivek Goyal @ 2014-01-17 19:17 UTC (permalink / raw)
  To: Jiri Kosina; +Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg

On Fri, Nov 22, 2013 at 09:42:52PM +0100, Jiri Kosina wrote:

[..]
> > @@ -843,7 +1075,11 @@ static int kimage_load_normal_segment(struct kimage *image,
> >  				PAGE_SIZE - (maddr & ~PAGE_MASK));
> >  		uchunk = min(ubytes, mchunk);
> >  
> > -		result = copy_from_user(ptr, buf, uchunk);
> > +		/* For file based kexec, source pages are in kernel memory */
> > +		if (image->file_mode)
> > +			memcpy(ptr, buf, uchunk);
> 
> Very minor nit I came across when going through the patchset -- can't we 
> use some different buffer for the file-based kexec that's not marked 
> __user here? This really causes some eye-pain when looking at the code.

Hi Jiri,

Sorry, responding to your comment after a very long time.

Now I have made buf field a union as it can either be a user pointer or
a kernel pointer depending on which kexec syscall has been used. Now
caller needs to either use segment->buf or segment->kbuf based on the
context of code.

	kexec_segment {
		union {
			void __user *buf;
			void *kbuf;
		};
	}

Thanks
Vivek

^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2014-01-17 19:17 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-20 17:50 [PATCH 0/6] kexec: A new system call to allow in kernel loading Vivek Goyal
2013-11-20 17:50 ` [PATCH 1/6] kexec: Export vmcoreinfo note size properly Vivek Goyal
2013-11-21 18:59   ` Greg KH
2013-11-21 19:08     ` Vivek Goyal
2013-11-20 17:50 ` [PATCH 2/6] kexec: Move segment verification code in a separate function Vivek Goyal
2013-11-20 17:50 ` [PATCH 3/6] resource: Provide new functions to walk through resources Vivek Goyal
2013-11-20 17:50 ` [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec Vivek Goyal
2013-11-21 19:03   ` Greg KH
2013-11-21 19:06     ` Matthew Garrett
2013-11-21 19:13       ` Vivek Goyal
2013-11-21 19:19         ` Matthew Garrett
2013-11-21 19:24           ` Vivek Goyal
2013-11-22 18:57           ` Vivek Goyal
2013-11-23  3:39             ` Eric W. Biederman
2013-11-25 16:39               ` Vivek Goyal
2013-11-26 12:23                 ` Eric W. Biederman
2013-11-26 14:27                   ` Vivek Goyal
2013-12-19 12:54                     ` Torsten Duwe
2013-12-20 14:19                       ` Vivek Goyal
2013-12-20 23:11                         ` Eric W. Biederman
2013-12-20 23:20                           ` Kees Cook
2013-12-21 11:38                             ` Torsten Duwe
2014-01-02 20:39                             ` Vivek Goyal
2014-01-02 20:56                               ` H. Peter Anvin
2014-01-06 21:33                                 ` Josh Boyer
2014-01-07  4:22                                   ` H. Peter Anvin
2013-12-20 23:20                           ` H. Peter Anvin
2013-12-21  1:32                             ` Eric W. Biederman
2013-12-21  3:32                               ` H. Peter Anvin
2013-12-21 12:15                                 ` Torsten Duwe
2013-11-21 19:16     ` Vivek Goyal
2013-11-22  1:03     ` Kees Cook
2013-11-22  2:13       ` Vivek Goyal
2013-11-22 20:42   ` Jiri Kosina
2014-01-17 19:17     ` Vivek Goyal
2013-11-29  3:10   ` Baoquan He
2013-12-02 15:27     ` WANG Chao
2013-12-02 15:44     ` Vivek Goyal
2013-12-04  1:35       ` Baoquan He
2013-12-04 17:19         ` Vivek Goyal
2013-12-04  1:56   ` Baoquan He
2013-12-04  8:19     ` Baoquan He
2013-12-04 17:32     ` Vivek Goyal
2013-11-20 17:50 ` [PATCH 5/6] kexec-bzImage: Support for loading bzImage using 64bit entry Vivek Goyal
2013-11-21 19:07   ` Greg KH
2013-11-21 19:21     ` Vivek Goyal
2013-11-22 15:24       ` H. Peter Anvin
2013-11-28 11:35   ` Baoquan He
2013-12-02 15:36     ` Vivek Goyal
2013-11-20 17:50 ` [PATCH 6/6] kexec: Support for Kexec on panic using new system call Vivek Goyal
2013-11-28 11:28   ` Baoquan He
2013-12-02 15:30     ` Vivek Goyal
2013-12-04  1:51       ` Baoquan He
2013-12-04 17:20         ` Vivek Goyal
2013-12-04  1:41   ` Baoquan He
2013-12-04 17:19     ` Vivek Goyal
2013-11-21 18:58 ` [PATCH 0/6] kexec: A new system call to allow in kernel loading Greg KH
2013-11-21 19:07   ` Vivek Goyal
2013-11-21 19:46     ` Vivek Goyal
2013-11-21 19:06 ` Geert Uytterhoeven
2013-11-21 19:14   ` Vivek Goyal
2013-11-21 23:07 ` Eric W. Biederman
2013-11-22  1:28   ` H. Peter Anvin
2013-11-22  2:35     ` Vivek Goyal
2013-11-22  2:40       ` H. Peter Anvin
2013-11-22  1:55   ` Vivek Goyal
2013-11-22  9:09     ` Geert Uytterhoeven
2013-11-22 13:30       ` Jiri Kosina
2013-11-22 13:46         ` Vivek Goyal
2013-11-22 13:50           ` Jiri Kosina
2013-11-22 15:33             ` Vivek Goyal
2013-11-22 17:45               ` Kees Cook
2013-11-22 13:43       ` Vivek Goyal
2013-11-22 15:25         ` Geert Uytterhoeven
2013-11-22 15:33           ` Jiri Kosina
2013-11-22 15:57             ` Eric Paris
2013-11-22 16:04               ` Jiri Kosina
2013-11-22 16:08                 ` Vivek Goyal
2013-11-22 13:34     ` Eric W. Biederman
2013-11-22 14:19       ` Vivek Goyal
2013-11-22 19:48         ` Greg KH
2013-11-23  3:23         ` Eric W. Biederman
2013-12-04 19:34           ` Vivek Goyal
2013-12-05  4:10             ` Eric W. Biederman
2013-11-25 10:04       ` Michael Holzheu
2013-11-25 15:36         ` Vivek Goyal
2013-11-25 16:15           ` Michael Holzheu
2013-11-22  0:55 ` HATAYAMA Daisuke
2013-11-22  2:03   ` Vivek Goyal
2013-12-03 13:23 ` Baoquan He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).