All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-03 13:06 ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal

Hi,

This is V3 of the patchset. Previous versions were posted here.

V1: https://lkml.org/lkml/2013/11/20/540
V2: https://lkml.org/lkml/2014/1/27/331

Changes since v2:

- Took care of most of the review comments from V2.
- Added support for kexec/kdump on EFI systems.
- Dropped support for loading ELF vmlinux.

This patch series is generated on top of 3.15.0-rc8. It also requires a
two patch cleanup series which is sitting in -tip tree here.

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=x86/boot

This patch series does not do kernel signature verification yet. I plan
to post another patch series for that. Now bzImage is already signed
with PKCS7 signature I plan to parse and verify those signatures.

Primary goal of this patchset is to prepare groundwork so that kernel
image can be signed and signatures be verified during kexec load. This
should help with two things.

- It should allow kexec/kdump on secureboot enabled machines.

- In general it can help even without secureboot. By being able to verify
  kernel image signature in kexec, it should help with avoiding module
  signing restrictions. Matthew Garret showed how to boot into a custom
  kernel, modify first kernel's memory and then jump back to old kernel and
  bypass any policy one wants to.

Any feedback is welcome.

Thanks
Vivek

Vivek Goyal (13):
  bin2c: Move bin2c in scripts/basic
  kernel: Build bin2c based on config option CONFIG_BUILD_BIN2C
  kexec: Move segment verification code in a separate function
  resource: Provide new functions to walk through resources
  kexec: Make kexec_segment user buffer pointer a union
  kexec: New syscall kexec_file_load() declaration
  kexec: Implementation of new syscall kexec_file_load
  purgatory/sha256: Provide implementation of sha256 in purgaotory
    context
  purgatory: Core purgatory functionality
  kexec: Load and Relocate purgatory at kernel load time
  kexec-bzImage: Support for loading bzImage using 64bit entry
  kexec: Support for Kexec on panic using new system call
  kexec: Support kexec/kdump on EFI systems

 arch/x86/Kbuild                      |    1 +
 arch/x86/Kconfig                     |    3 +
 arch/x86/Makefile                    |    6 +
 arch/x86/include/asm/crash.h         |    9 +
 arch/x86/include/asm/kexec-bzimage.h |   11 +
 arch/x86/include/asm/kexec.h         |   53 ++
 arch/x86/kernel/Makefile             |    3 +-
 arch/x86/kernel/crash.c              |  581 ++++++++++++++++
 arch/x86/kernel/kexec-bzimage.c      |  314 +++++++++
 arch/x86/kernel/machine_kexec.c      |  232 +++++++
 arch/x86/kernel/machine_kexec_64.c   |  177 +++++
 arch/x86/purgatory/Makefile          |   35 +
 arch/x86/purgatory/entry64.S         |  101 +++
 arch/x86/purgatory/purgatory.c       |   71 ++
 arch/x86/purgatory/setup-x86_32.S    |   17 +
 arch/x86/purgatory/setup-x86_64.S    |   58 ++
 arch/x86/purgatory/sha256.c          |  284 ++++++++
 arch/x86/purgatory/sha256.h          |   22 +
 arch/x86/purgatory/stack.S           |   19 +
 arch/x86/purgatory/string.c          |   13 +
 arch/x86/syscalls/syscall_64.tbl     |    1 +
 drivers/firmware/efi/runtime-map.c   |   21 +
 include/linux/efi.h                  |   19 +
 include/linux/ioport.h               |    6 +
 include/linux/kexec.h                |   97 ++-
 include/linux/syscalls.h             |    3 +
 include/uapi/linux/kexec.h           |    4 +
 init/Kconfig                         |    5 +
 kernel/Makefile                      |    2 +-
 kernel/kexec.c                       | 1239 +++++++++++++++++++++++++++++++---
 kernel/resource.c                    |  108 ++-
 kernel/sys_ni.c                      |    1 +
 scripts/Makefile                     |    1 -
 scripts/basic/Makefile               |    1 +
 scripts/basic/bin2c.c                |   35 +
 scripts/bin2c.c                      |   36 -
 36 files changed, 3452 insertions(+), 137 deletions(-)
 create mode 100644 arch/x86/include/asm/crash.h
 create mode 100644 arch/x86/include/asm/kexec-bzimage.h
 create mode 100644 arch/x86/kernel/kexec-bzimage.c
 create mode 100644 arch/x86/kernel/machine_kexec.c
 create mode 100644 arch/x86/purgatory/Makefile
 create mode 100644 arch/x86/purgatory/entry64.S
 create mode 100644 arch/x86/purgatory/purgatory.c
 create mode 100644 arch/x86/purgatory/setup-x86_32.S
 create mode 100644 arch/x86/purgatory/setup-x86_64.S
 create mode 100644 arch/x86/purgatory/sha256.c
 create mode 100644 arch/x86/purgatory/sha256.h
 create mode 100644 arch/x86/purgatory/stack.S
 create mode 100644 arch/x86/purgatory/string.c
 create mode 100644 scripts/basic/bin2c.c
 delete mode 100644 scripts/bin2c.c

-- 
1.9.0


^ permalink raw reply	[flat|nested] 214+ messages in thread

* [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-03 13:06 ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg, akpm,
	dyoung, chaowang

Hi,

This is V3 of the patchset. Previous versions were posted here.

V1: https://lkml.org/lkml/2013/11/20/540
V2: https://lkml.org/lkml/2014/1/27/331

Changes since v2:

- Took care of most of the review comments from V2.
- Added support for kexec/kdump on EFI systems.
- Dropped support for loading ELF vmlinux.

This patch series is generated on top of 3.15.0-rc8. It also requires a
two patch cleanup series which is sitting in -tip tree here.

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=x86/boot

This patch series does not do kernel signature verification yet. I plan
to post another patch series for that. Now bzImage is already signed
with PKCS7 signature I plan to parse and verify those signatures.

Primary goal of this patchset is to prepare groundwork so that kernel
image can be signed and signatures be verified during kexec load. This
should help with two things.

- It should allow kexec/kdump on secureboot enabled machines.

- In general it can help even without secureboot. By being able to verify
  kernel image signature in kexec, it should help with avoiding module
  signing restrictions. Matthew Garret showed how to boot into a custom
  kernel, modify first kernel's memory and then jump back to old kernel and
  bypass any policy one wants to.

Any feedback is welcome.

Thanks
Vivek

Vivek Goyal (13):
  bin2c: Move bin2c in scripts/basic
  kernel: Build bin2c based on config option CONFIG_BUILD_BIN2C
  kexec: Move segment verification code in a separate function
  resource: Provide new functions to walk through resources
  kexec: Make kexec_segment user buffer pointer a union
  kexec: New syscall kexec_file_load() declaration
  kexec: Implementation of new syscall kexec_file_load
  purgatory/sha256: Provide implementation of sha256 in purgaotory
    context
  purgatory: Core purgatory functionality
  kexec: Load and Relocate purgatory at kernel load time
  kexec-bzImage: Support for loading bzImage using 64bit entry
  kexec: Support for Kexec on panic using new system call
  kexec: Support kexec/kdump on EFI systems

 arch/x86/Kbuild                      |    1 +
 arch/x86/Kconfig                     |    3 +
 arch/x86/Makefile                    |    6 +
 arch/x86/include/asm/crash.h         |    9 +
 arch/x86/include/asm/kexec-bzimage.h |   11 +
 arch/x86/include/asm/kexec.h         |   53 ++
 arch/x86/kernel/Makefile             |    3 +-
 arch/x86/kernel/crash.c              |  581 ++++++++++++++++
 arch/x86/kernel/kexec-bzimage.c      |  314 +++++++++
 arch/x86/kernel/machine_kexec.c      |  232 +++++++
 arch/x86/kernel/machine_kexec_64.c   |  177 +++++
 arch/x86/purgatory/Makefile          |   35 +
 arch/x86/purgatory/entry64.S         |  101 +++
 arch/x86/purgatory/purgatory.c       |   71 ++
 arch/x86/purgatory/setup-x86_32.S    |   17 +
 arch/x86/purgatory/setup-x86_64.S    |   58 ++
 arch/x86/purgatory/sha256.c          |  284 ++++++++
 arch/x86/purgatory/sha256.h          |   22 +
 arch/x86/purgatory/stack.S           |   19 +
 arch/x86/purgatory/string.c          |   13 +
 arch/x86/syscalls/syscall_64.tbl     |    1 +
 drivers/firmware/efi/runtime-map.c   |   21 +
 include/linux/efi.h                  |   19 +
 include/linux/ioport.h               |    6 +
 include/linux/kexec.h                |   97 ++-
 include/linux/syscalls.h             |    3 +
 include/uapi/linux/kexec.h           |    4 +
 init/Kconfig                         |    5 +
 kernel/Makefile                      |    2 +-
 kernel/kexec.c                       | 1239 +++++++++++++++++++++++++++++++---
 kernel/resource.c                    |  108 ++-
 kernel/sys_ni.c                      |    1 +
 scripts/Makefile                     |    1 -
 scripts/basic/Makefile               |    1 +
 scripts/basic/bin2c.c                |   35 +
 scripts/bin2c.c                      |   36 -
 36 files changed, 3452 insertions(+), 137 deletions(-)
 create mode 100644 arch/x86/include/asm/crash.h
 create mode 100644 arch/x86/include/asm/kexec-bzimage.h
 create mode 100644 arch/x86/kernel/kexec-bzimage.c
 create mode 100644 arch/x86/kernel/machine_kexec.c
 create mode 100644 arch/x86/purgatory/Makefile
 create mode 100644 arch/x86/purgatory/entry64.S
 create mode 100644 arch/x86/purgatory/purgatory.c
 create mode 100644 arch/x86/purgatory/setup-x86_32.S
 create mode 100644 arch/x86/purgatory/setup-x86_64.S
 create mode 100644 arch/x86/purgatory/sha256.c
 create mode 100644 arch/x86/purgatory/sha256.h
 create mode 100644 arch/x86/purgatory/stack.S
 create mode 100644 arch/x86/purgatory/string.c
 create mode 100644 scripts/basic/bin2c.c
 delete mode 100644 scripts/bin2c.c

-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* [PATCH 01/13] bin2c: Move bin2c in scripts/basic
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:06   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal

Kexec wants to use bin2c and it wants to use it really early in the build
process. See arch/x86/purgatory/ code in later patches.

So move bin2c in scripts/basic so that it can be built very early and
be usable by arch/x86/purgatory/

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 kernel/Makefile        |  2 +-
 scripts/Makefile       |  1 -
 scripts/basic/Makefile |  1 +
 scripts/basic/bin2c.c  | 35 +++++++++++++++++++++++++++++++++++
 scripts/bin2c.c        | 36 ------------------------------------
 5 files changed, 37 insertions(+), 38 deletions(-)
 create mode 100644 scripts/basic/bin2c.c
 delete mode 100644 scripts/bin2c.c

diff --git a/kernel/Makefile b/kernel/Makefile
index f2a8b62..9b07bb7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -105,7 +105,7 @@ targets += config_data.gz
 $(obj)/config_data.gz: $(KCONFIG_CONFIG) FORCE
 	$(call if_changed,gzip)
 
-      filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/bin2c; echo "MAGIC_END;")
+      filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/basic/bin2c; echo "MAGIC_END;")
 targets += config_data.h
 $(obj)/config_data.h: $(obj)/config_data.gz FORCE
 	$(call filechk,ikconfiggz)
diff --git a/scripts/Makefile b/scripts/Makefile
index 1d07860..e9d56fb 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -13,7 +13,6 @@ HOST_EXTRACFLAGS += -I$(srctree)/tools/include
 hostprogs-$(CONFIG_KALLSYMS)     += kallsyms
 hostprogs-$(CONFIG_LOGO)         += pnmtologo
 hostprogs-$(CONFIG_VT)           += conmakehash
-hostprogs-$(CONFIG_IKCONFIG)     += bin2c
 hostprogs-$(BUILD_C_RECORDMCOUNT) += recordmcount
 hostprogs-$(CONFIG_BUILDTIME_EXTABLE_SORT) += sortextable
 hostprogs-$(CONFIG_ASN1)	 += asn1_compiler
diff --git a/scripts/basic/Makefile b/scripts/basic/Makefile
index 4fcef87..afbc1cd 100644
--- a/scripts/basic/Makefile
+++ b/scripts/basic/Makefile
@@ -9,6 +9,7 @@
 # fixdep: 	 Used to generate dependency information during build process
 
 hostprogs-y	:= fixdep
+hostprogs-$(CONFIG_IKCONFIG)     += bin2c
 always		:= $(hostprogs-y)
 
 # fixdep is needed to compile other host programs
diff --git a/scripts/basic/bin2c.c b/scripts/basic/bin2c.c
new file mode 100644
index 0000000..af187e6
--- /dev/null
+++ b/scripts/basic/bin2c.c
@@ -0,0 +1,35 @@
+/*
+ * Unloved program to convert a binary on stdin to a C include on stdout
+ *
+ * Jan 1999 Matt Mackall <mpm@selenic.com>
+ *
+ * This software may be used and distributed according to the terms
+ * of the GNU General Public License, incorporated herein by reference.
+ */
+
+#include <stdio.h>
+
+int main(int argc, char *argv[])
+{
+	int ch, total = 0;
+
+	if (argc > 1)
+		printf("const char %s[] %s=\n",
+			argv[1], argc > 2 ? argv[2] : "");
+
+	do {
+		printf("\t\"");
+		while ((ch = getchar()) != EOF) {
+			total++;
+			printf("\\x%02x", ch);
+			if (total % 16 == 0)
+				break;
+		}
+		printf("\"\n");
+	} while (ch != EOF);
+
+	if (argc > 1)
+		printf("\t;\n\nconst int %s_size = %d;\n", argv[1], total);
+
+	return 0;
+}
diff --git a/scripts/bin2c.c b/scripts/bin2c.c
deleted file mode 100644
index 96dd2bc..0000000
--- a/scripts/bin2c.c
+++ /dev/null
@@ -1,36 +0,0 @@
-/*
- * Unloved program to convert a binary on stdin to a C include on stdout
- *
- * Jan 1999 Matt Mackall <mpm@selenic.com>
- *
- * This software may be used and distributed according to the terms
- * of the GNU General Public License, incorporated herein by reference.
- */
-
-#include <stdio.h>
-
-int main(int argc, char *argv[])
-{
-	int ch, total=0;
-
-	if (argc > 1)
-		printf("const char %s[] %s=\n",
-			argv[1], argc > 2 ? argv[2] : "");
-
-	do {
-		printf("\t\"");
-		while ((ch = getchar()) != EOF)
-		{
-			total++;
-			printf("\\x%02x",ch);
-			if (total % 16 == 0)
-				break;
-		}
-		printf("\"\n");
-	} while (ch != EOF);
-
-	if (argc > 1)
-		printf("\t;\n\nconst int %s_size = %d;\n", argv[1], total);
-
-	return 0;
-}
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 01/13] bin2c: Move bin2c in scripts/basic
@ 2014-06-03 13:06   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg, akpm,
	dyoung, chaowang

Kexec wants to use bin2c and it wants to use it really early in the build
process. See arch/x86/purgatory/ code in later patches.

So move bin2c in scripts/basic so that it can be built very early and
be usable by arch/x86/purgatory/

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 kernel/Makefile        |  2 +-
 scripts/Makefile       |  1 -
 scripts/basic/Makefile |  1 +
 scripts/basic/bin2c.c  | 35 +++++++++++++++++++++++++++++++++++
 scripts/bin2c.c        | 36 ------------------------------------
 5 files changed, 37 insertions(+), 38 deletions(-)
 create mode 100644 scripts/basic/bin2c.c
 delete mode 100644 scripts/bin2c.c

diff --git a/kernel/Makefile b/kernel/Makefile
index f2a8b62..9b07bb7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -105,7 +105,7 @@ targets += config_data.gz
 $(obj)/config_data.gz: $(KCONFIG_CONFIG) FORCE
 	$(call if_changed,gzip)
 
-      filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/bin2c; echo "MAGIC_END;")
+      filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/basic/bin2c; echo "MAGIC_END;")
 targets += config_data.h
 $(obj)/config_data.h: $(obj)/config_data.gz FORCE
 	$(call filechk,ikconfiggz)
diff --git a/scripts/Makefile b/scripts/Makefile
index 1d07860..e9d56fb 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -13,7 +13,6 @@ HOST_EXTRACFLAGS += -I$(srctree)/tools/include
 hostprogs-$(CONFIG_KALLSYMS)     += kallsyms
 hostprogs-$(CONFIG_LOGO)         += pnmtologo
 hostprogs-$(CONFIG_VT)           += conmakehash
-hostprogs-$(CONFIG_IKCONFIG)     += bin2c
 hostprogs-$(BUILD_C_RECORDMCOUNT) += recordmcount
 hostprogs-$(CONFIG_BUILDTIME_EXTABLE_SORT) += sortextable
 hostprogs-$(CONFIG_ASN1)	 += asn1_compiler
diff --git a/scripts/basic/Makefile b/scripts/basic/Makefile
index 4fcef87..afbc1cd 100644
--- a/scripts/basic/Makefile
+++ b/scripts/basic/Makefile
@@ -9,6 +9,7 @@
 # fixdep: 	 Used to generate dependency information during build process
 
 hostprogs-y	:= fixdep
+hostprogs-$(CONFIG_IKCONFIG)     += bin2c
 always		:= $(hostprogs-y)
 
 # fixdep is needed to compile other host programs
diff --git a/scripts/basic/bin2c.c b/scripts/basic/bin2c.c
new file mode 100644
index 0000000..af187e6
--- /dev/null
+++ b/scripts/basic/bin2c.c
@@ -0,0 +1,35 @@
+/*
+ * Unloved program to convert a binary on stdin to a C include on stdout
+ *
+ * Jan 1999 Matt Mackall <mpm@selenic.com>
+ *
+ * This software may be used and distributed according to the terms
+ * of the GNU General Public License, incorporated herein by reference.
+ */
+
+#include <stdio.h>
+
+int main(int argc, char *argv[])
+{
+	int ch, total = 0;
+
+	if (argc > 1)
+		printf("const char %s[] %s=\n",
+			argv[1], argc > 2 ? argv[2] : "");
+
+	do {
+		printf("\t\"");
+		while ((ch = getchar()) != EOF) {
+			total++;
+			printf("\\x%02x", ch);
+			if (total % 16 == 0)
+				break;
+		}
+		printf("\"\n");
+	} while (ch != EOF);
+
+	if (argc > 1)
+		printf("\t;\n\nconst int %s_size = %d;\n", argv[1], total);
+
+	return 0;
+}
diff --git a/scripts/bin2c.c b/scripts/bin2c.c
deleted file mode 100644
index 96dd2bc..0000000
--- a/scripts/bin2c.c
+++ /dev/null
@@ -1,36 +0,0 @@
-/*
- * Unloved program to convert a binary on stdin to a C include on stdout
- *
- * Jan 1999 Matt Mackall <mpm@selenic.com>
- *
- * This software may be used and distributed according to the terms
- * of the GNU General Public License, incorporated herein by reference.
- */
-
-#include <stdio.h>
-
-int main(int argc, char *argv[])
-{
-	int ch, total=0;
-
-	if (argc > 1)
-		printf("const char %s[] %s=\n",
-			argv[1], argc > 2 ? argv[2] : "");
-
-	do {
-		printf("\t\"");
-		while ((ch = getchar()) != EOF)
-		{
-			total++;
-			printf("\\x%02x",ch);
-			if (total % 16 == 0)
-				break;
-		}
-		printf("\"\n");
-	} while (ch != EOF);
-
-	if (argc > 1)
-		printf("\t;\n\nconst int %s_size = %d;\n", argv[1], total);
-
-	return 0;
-}
-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 02/13] kernel: Build bin2c based on config option CONFIG_BUILD_BIN2C
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:06   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal

currently bin2c builds only if CONFIG_IKCONFIG=y. But bin2c will now be
used by kexec too.  So make it compilation dependent on CONFIG_BUILD_BIN2C
and this config option can be selected by CONFIG_KEXEC and CONFIG_IKCONFIG.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/Kconfig       | 1 +
 init/Kconfig           | 5 +++++
 scripts/basic/Makefile | 2 +-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 25d2c6f..213308a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1555,6 +1555,7 @@ source kernel/Kconfig.hz
 
 config KEXEC
 	bool "kexec system call"
+	select BUILD_BIN2C
 	---help---
 	  kexec is a system call that implements the ability to shutdown your
 	  current kernel, and to start another kernel.  It is like a reboot
diff --git a/init/Kconfig b/init/Kconfig
index 9d3585b..de59f0e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -773,8 +773,13 @@ endchoice
 
 endmenu # "RCU Subsystem"
 
+config BUILD_BIN2C
+	bool
+	default n
+
 config IKCONFIG
 	tristate "Kernel .config support"
+	select BUILD_BIN2C
 	---help---
 	  This option enables the complete Linux kernel ".config" file
 	  contents to be saved in the kernel. It provides documentation
diff --git a/scripts/basic/Makefile b/scripts/basic/Makefile
index afbc1cd..ec10d93 100644
--- a/scripts/basic/Makefile
+++ b/scripts/basic/Makefile
@@ -9,7 +9,7 @@
 # fixdep: 	 Used to generate dependency information during build process
 
 hostprogs-y	:= fixdep
-hostprogs-$(CONFIG_IKCONFIG)     += bin2c
+hostprogs-$(CONFIG_BUILD_BIN2C)     += bin2c
 always		:= $(hostprogs-y)
 
 # fixdep is needed to compile other host programs
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 02/13] kernel: Build bin2c based on config option CONFIG_BUILD_BIN2C
@ 2014-06-03 13:06   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg, akpm,
	dyoung, chaowang

currently bin2c builds only if CONFIG_IKCONFIG=y. But bin2c will now be
used by kexec too.  So make it compilation dependent on CONFIG_BUILD_BIN2C
and this config option can be selected by CONFIG_KEXEC and CONFIG_IKCONFIG.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/Kconfig       | 1 +
 init/Kconfig           | 5 +++++
 scripts/basic/Makefile | 2 +-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 25d2c6f..213308a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1555,6 +1555,7 @@ source kernel/Kconfig.hz
 
 config KEXEC
 	bool "kexec system call"
+	select BUILD_BIN2C
 	---help---
 	  kexec is a system call that implements the ability to shutdown your
 	  current kernel, and to start another kernel.  It is like a reboot
diff --git a/init/Kconfig b/init/Kconfig
index 9d3585b..de59f0e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -773,8 +773,13 @@ endchoice
 
 endmenu # "RCU Subsystem"
 
+config BUILD_BIN2C
+	bool
+	default n
+
 config IKCONFIG
 	tristate "Kernel .config support"
+	select BUILD_BIN2C
 	---help---
 	  This option enables the complete Linux kernel ".config" file
 	  contents to be saved in the kernel. It provides documentation
diff --git a/scripts/basic/Makefile b/scripts/basic/Makefile
index afbc1cd..ec10d93 100644
--- a/scripts/basic/Makefile
+++ b/scripts/basic/Makefile
@@ -9,7 +9,7 @@
 # fixdep: 	 Used to generate dependency information during build process
 
 hostprogs-y	:= fixdep
-hostprogs-$(CONFIG_IKCONFIG)     += bin2c
+hostprogs-$(CONFIG_BUILD_BIN2C)     += bin2c
 always		:= $(hostprogs-y)
 
 # fixdep is needed to compile other host programs
-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 03/13] kexec: Move segment verification code in a separate function
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:06   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal

Previously do_kimage_alloc() will allocate a kimage structure, copy
segment list from user space and then do the segment list sanity verification.

Break down this function in 3 parts. do_kimage_alloc_init() to do actual
allocation and basic initialization of kimage structure.
copy_user_segment_list() to copy segment list from user space and
sanity_check_segment_list() to verify the sanity of segment list as passed
by user space.

In later patches, I need to only allocate kimage and not copy segment
list from user space. So breaking down in smaller functions enables
re-use of code at other places.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 kernel/kexec.c | 182 ++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 101 insertions(+), 81 deletions(-)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index 28c5706..c435c5f 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -124,45 +124,27 @@ static struct page *kimage_alloc_page(struct kimage *image,
 				       gfp_t gfp_mask,
 				       unsigned long dest);
 
-static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
-	                    unsigned long nr_segments,
-                            struct kexec_segment __user *segments)
+static int copy_user_segment_list(struct kimage *image,
+				unsigned long nr_segments,
+				struct kexec_segment __user *segments)
 {
+	int ret;
 	size_t segment_bytes;
-	struct kimage *image;
-	unsigned long i;
-	int result;
-
-	/* Allocate a controlling structure */
-	result = -ENOMEM;
-	image = kzalloc(sizeof(*image), GFP_KERNEL);
-	if (!image)
-		goto out;
-
-	image->head = 0;
-	image->entry = &image->head;
-	image->last_entry = &image->head;
-	image->control_page = ~0; /* By default this does not apply */
-	image->start = entry;
-	image->type = KEXEC_TYPE_DEFAULT;
-
-	/* Initialize the list of control pages */
-	INIT_LIST_HEAD(&image->control_pages);
-
-	/* Initialize the list of destination pages */
-	INIT_LIST_HEAD(&image->dest_pages);
-
-	/* Initialize the list of unusable pages */
-	INIT_LIST_HEAD(&image->unuseable_pages);
 
 	/* Read in the segments */
 	image->nr_segments = nr_segments;
 	segment_bytes = nr_segments * sizeof(*segments);
-	result = copy_from_user(image->segment, segments, segment_bytes);
-	if (result) {
-		result = -EFAULT;
-		goto out;
-	}
+	ret = copy_from_user(image->segment, segments, segment_bytes);
+	if (ret)
+		ret = -EFAULT;
+
+	return ret;
+}
+
+static int sanity_check_segment_list(struct kimage *image)
+{
+	int result, i;
+	unsigned long nr_segments = image->nr_segments;
 
 	/*
 	 * Verify we have good destination addresses.  The caller is
@@ -184,9 +166,9 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
 		mstart = image->segment[i].mem;
 		mend   = mstart + image->segment[i].memsz;
 		if ((mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK))
-			goto out;
+			return result;
 		if (mend >= KEXEC_DESTINATION_MEMORY_LIMIT)
-			goto out;
+			return result;
 	}
 
 	/* Verify our destination addresses do not overlap.
@@ -207,7 +189,7 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
 			pend   = pstart + image->segment[j].memsz;
 			/* Do the segments overlap ? */
 			if ((mend > pstart) && (mstart < pend))
-				goto out;
+				return result;
 		}
 	}
 
@@ -219,18 +201,61 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
 	result = -EINVAL;
 	for (i = 0; i < nr_segments; i++) {
 		if (image->segment[i].bufsz > image->segment[i].memsz)
-			goto out;
+			return result;
 	}
 
-	result = 0;
-out:
-	if (result == 0)
-		*rimage = image;
-	else
-		kfree(image);
+	/*
+	 * Verify we have good destination addresses.  Normally
+	 * the caller is responsible for making certain we don't
+	 * attempt to load the new image into invalid or reserved
+	 * areas of RAM.  But crash kernels are preloaded into a
+	 * reserved area of ram.  We must ensure the addresses
+	 * are in the reserved area otherwise preloading the
+	 * kernel could corrupt things.
+	 */
 
-	return result;
+	if (image->type == KEXEC_TYPE_CRASH) {
+		result = -EADDRNOTAVAIL;
+		for (i = 0; i < nr_segments; i++) {
+			unsigned long mstart, mend;
 
+			mstart = image->segment[i].mem;
+			mend = mstart + image->segment[i].memsz - 1;
+			/* Ensure we are within the crash kernel limits */
+			if ((mstart < crashk_res.start) ||
+			    (mend > crashk_res.end))
+				return result;
+		}
+	}
+
+	return 0;
+}
+
+static struct kimage *do_kimage_alloc_init(void)
+{
+	struct kimage *image;
+
+	/* Allocate a controlling structure */
+	image = kzalloc(sizeof(*image), GFP_KERNEL);
+	if (!image)
+		return NULL;
+
+	image->head = 0;
+	image->entry = &image->head;
+	image->last_entry = &image->head;
+	image->control_page = ~0; /* By default this does not apply */
+	image->type = KEXEC_TYPE_DEFAULT;
+
+	/* Initialize the list of control pages */
+	INIT_LIST_HEAD(&image->control_pages);
+
+	/* Initialize the list of destination pages */
+	INIT_LIST_HEAD(&image->dest_pages);
+
+	/* Initialize the list of unusable pages */
+	INIT_LIST_HEAD(&image->unuseable_pages);
+
+	return image;
 }
 
 static void kimage_free_page_list(struct list_head *list);
@@ -243,10 +268,19 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
 	struct kimage *image;
 
 	/* Allocate and initialize a controlling structure */
-	image = NULL;
-	result = do_kimage_alloc(&image, entry, nr_segments, segments);
+	image = do_kimage_alloc_init();
+	if (!image)
+		return -ENOMEM;
+
+	image->start = entry;
+
+	result = copy_user_segment_list(image, nr_segments, segments);
 	if (result)
-		goto out;
+		goto out_free_image;
+
+	result = sanity_check_segment_list(image);
+	if (result)
+		goto out_free_image;
 
 	/*
 	 * Find a location for the control code buffer, and add it
@@ -258,22 +292,23 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
 					   get_order(KEXEC_CONTROL_PAGE_SIZE));
 	if (!image->control_code_page) {
 		printk(KERN_ERR "Could not allocate control_code_buffer\n");
-		goto out_free;
+		goto out_free_image;
 	}
 
 	image->swap_page = kimage_alloc_control_pages(image, 0);
 	if (!image->swap_page) {
 		printk(KERN_ERR "Could not allocate swap buffer\n");
-		goto out_free;
+		goto out_free_control_pages;
 	}
 
 	*rimage = image;
 	return 0;
 
-out_free:
+
+out_free_control_pages:
 	kimage_free_page_list(&image->control_pages);
+out_free_image:
 	kfree(image);
-out:
 	return result;
 }
 
@@ -283,19 +318,17 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
 {
 	int result;
 	struct kimage *image;
-	unsigned long i;
 
-	image = NULL;
 	/* Verify we have a valid entry point */
-	if ((entry < crashk_res.start) || (entry > crashk_res.end)) {
-		result = -EADDRNOTAVAIL;
-		goto out;
-	}
+	if ((entry < crashk_res.start) || (entry > crashk_res.end))
+		return -EADDRNOTAVAIL;
 
 	/* Allocate and initialize a controlling structure */
-	result = do_kimage_alloc(&image, entry, nr_segments, segments);
-	if (result)
-		goto out;
+	image = do_kimage_alloc_init();
+	if (!image)
+		return -ENOMEM;
+
+	image->start = entry;
 
 	/* Enable the special crash kernel control page
 	 * allocation policy.
@@ -303,25 +336,13 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
 	image->control_page = crashk_res.start;
 	image->type = KEXEC_TYPE_CRASH;
 
-	/*
-	 * Verify we have good destination addresses.  Normally
-	 * the caller is responsible for making certain we don't
-	 * attempt to load the new image into invalid or reserved
-	 * areas of RAM.  But crash kernels are preloaded into a
-	 * reserved area of ram.  We must ensure the addresses
-	 * are in the reserved area otherwise preloading the
-	 * kernel could corrupt things.
-	 */
-	result = -EADDRNOTAVAIL;
-	for (i = 0; i < nr_segments; i++) {
-		unsigned long mstart, mend;
+	result = copy_user_segment_list(image, nr_segments, segments);
+	if (result)
+		goto out_free_image;
 
-		mstart = image->segment[i].mem;
-		mend = mstart + image->segment[i].memsz - 1;
-		/* Ensure we are within the crash kernel limits */
-		if ((mstart < crashk_res.start) || (mend > crashk_res.end))
-			goto out_free;
-	}
+	result = sanity_check_segment_list(image);
+	if (result)
+		goto out_free_image;
 
 	/*
 	 * Find a location for the control code buffer, and add
@@ -333,15 +354,14 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
 					   get_order(KEXEC_CONTROL_PAGE_SIZE));
 	if (!image->control_code_page) {
 		printk(KERN_ERR "Could not allocate control_code_buffer\n");
-		goto out_free;
+		goto out_free_image;
 	}
 
 	*rimage = image;
 	return 0;
 
-out_free:
+out_free_image:
 	kfree(image);
-out:
 	return result;
 }
 
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 03/13] kexec: Move segment verification code in a separate function
@ 2014-06-03 13:06   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg, akpm,
	dyoung, chaowang

Previously do_kimage_alloc() will allocate a kimage structure, copy
segment list from user space and then do the segment list sanity verification.

Break down this function in 3 parts. do_kimage_alloc_init() to do actual
allocation and basic initialization of kimage structure.
copy_user_segment_list() to copy segment list from user space and
sanity_check_segment_list() to verify the sanity of segment list as passed
by user space.

In later patches, I need to only allocate kimage and not copy segment
list from user space. So breaking down in smaller functions enables
re-use of code at other places.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 kernel/kexec.c | 182 ++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 101 insertions(+), 81 deletions(-)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index 28c5706..c435c5f 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -124,45 +124,27 @@ static struct page *kimage_alloc_page(struct kimage *image,
 				       gfp_t gfp_mask,
 				       unsigned long dest);
 
-static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
-	                    unsigned long nr_segments,
-                            struct kexec_segment __user *segments)
+static int copy_user_segment_list(struct kimage *image,
+				unsigned long nr_segments,
+				struct kexec_segment __user *segments)
 {
+	int ret;
 	size_t segment_bytes;
-	struct kimage *image;
-	unsigned long i;
-	int result;
-
-	/* Allocate a controlling structure */
-	result = -ENOMEM;
-	image = kzalloc(sizeof(*image), GFP_KERNEL);
-	if (!image)
-		goto out;
-
-	image->head = 0;
-	image->entry = &image->head;
-	image->last_entry = &image->head;
-	image->control_page = ~0; /* By default this does not apply */
-	image->start = entry;
-	image->type = KEXEC_TYPE_DEFAULT;
-
-	/* Initialize the list of control pages */
-	INIT_LIST_HEAD(&image->control_pages);
-
-	/* Initialize the list of destination pages */
-	INIT_LIST_HEAD(&image->dest_pages);
-
-	/* Initialize the list of unusable pages */
-	INIT_LIST_HEAD(&image->unuseable_pages);
 
 	/* Read in the segments */
 	image->nr_segments = nr_segments;
 	segment_bytes = nr_segments * sizeof(*segments);
-	result = copy_from_user(image->segment, segments, segment_bytes);
-	if (result) {
-		result = -EFAULT;
-		goto out;
-	}
+	ret = copy_from_user(image->segment, segments, segment_bytes);
+	if (ret)
+		ret = -EFAULT;
+
+	return ret;
+}
+
+static int sanity_check_segment_list(struct kimage *image)
+{
+	int result, i;
+	unsigned long nr_segments = image->nr_segments;
 
 	/*
 	 * Verify we have good destination addresses.  The caller is
@@ -184,9 +166,9 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
 		mstart = image->segment[i].mem;
 		mend   = mstart + image->segment[i].memsz;
 		if ((mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK))
-			goto out;
+			return result;
 		if (mend >= KEXEC_DESTINATION_MEMORY_LIMIT)
-			goto out;
+			return result;
 	}
 
 	/* Verify our destination addresses do not overlap.
@@ -207,7 +189,7 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
 			pend   = pstart + image->segment[j].memsz;
 			/* Do the segments overlap ? */
 			if ((mend > pstart) && (mstart < pend))
-				goto out;
+				return result;
 		}
 	}
 
@@ -219,18 +201,61 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
 	result = -EINVAL;
 	for (i = 0; i < nr_segments; i++) {
 		if (image->segment[i].bufsz > image->segment[i].memsz)
-			goto out;
+			return result;
 	}
 
-	result = 0;
-out:
-	if (result == 0)
-		*rimage = image;
-	else
-		kfree(image);
+	/*
+	 * Verify we have good destination addresses.  Normally
+	 * the caller is responsible for making certain we don't
+	 * attempt to load the new image into invalid or reserved
+	 * areas of RAM.  But crash kernels are preloaded into a
+	 * reserved area of ram.  We must ensure the addresses
+	 * are in the reserved area otherwise preloading the
+	 * kernel could corrupt things.
+	 */
 
-	return result;
+	if (image->type == KEXEC_TYPE_CRASH) {
+		result = -EADDRNOTAVAIL;
+		for (i = 0; i < nr_segments; i++) {
+			unsigned long mstart, mend;
 
+			mstart = image->segment[i].mem;
+			mend = mstart + image->segment[i].memsz - 1;
+			/* Ensure we are within the crash kernel limits */
+			if ((mstart < crashk_res.start) ||
+			    (mend > crashk_res.end))
+				return result;
+		}
+	}
+
+	return 0;
+}
+
+static struct kimage *do_kimage_alloc_init(void)
+{
+	struct kimage *image;
+
+	/* Allocate a controlling structure */
+	image = kzalloc(sizeof(*image), GFP_KERNEL);
+	if (!image)
+		return NULL;
+
+	image->head = 0;
+	image->entry = &image->head;
+	image->last_entry = &image->head;
+	image->control_page = ~0; /* By default this does not apply */
+	image->type = KEXEC_TYPE_DEFAULT;
+
+	/* Initialize the list of control pages */
+	INIT_LIST_HEAD(&image->control_pages);
+
+	/* Initialize the list of destination pages */
+	INIT_LIST_HEAD(&image->dest_pages);
+
+	/* Initialize the list of unusable pages */
+	INIT_LIST_HEAD(&image->unuseable_pages);
+
+	return image;
 }
 
 static void kimage_free_page_list(struct list_head *list);
@@ -243,10 +268,19 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
 	struct kimage *image;
 
 	/* Allocate and initialize a controlling structure */
-	image = NULL;
-	result = do_kimage_alloc(&image, entry, nr_segments, segments);
+	image = do_kimage_alloc_init();
+	if (!image)
+		return -ENOMEM;
+
+	image->start = entry;
+
+	result = copy_user_segment_list(image, nr_segments, segments);
 	if (result)
-		goto out;
+		goto out_free_image;
+
+	result = sanity_check_segment_list(image);
+	if (result)
+		goto out_free_image;
 
 	/*
 	 * Find a location for the control code buffer, and add it
@@ -258,22 +292,23 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
 					   get_order(KEXEC_CONTROL_PAGE_SIZE));
 	if (!image->control_code_page) {
 		printk(KERN_ERR "Could not allocate control_code_buffer\n");
-		goto out_free;
+		goto out_free_image;
 	}
 
 	image->swap_page = kimage_alloc_control_pages(image, 0);
 	if (!image->swap_page) {
 		printk(KERN_ERR "Could not allocate swap buffer\n");
-		goto out_free;
+		goto out_free_control_pages;
 	}
 
 	*rimage = image;
 	return 0;
 
-out_free:
+
+out_free_control_pages:
 	kimage_free_page_list(&image->control_pages);
+out_free_image:
 	kfree(image);
-out:
 	return result;
 }
 
@@ -283,19 +318,17 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
 {
 	int result;
 	struct kimage *image;
-	unsigned long i;
 
-	image = NULL;
 	/* Verify we have a valid entry point */
-	if ((entry < crashk_res.start) || (entry > crashk_res.end)) {
-		result = -EADDRNOTAVAIL;
-		goto out;
-	}
+	if ((entry < crashk_res.start) || (entry > crashk_res.end))
+		return -EADDRNOTAVAIL;
 
 	/* Allocate and initialize a controlling structure */
-	result = do_kimage_alloc(&image, entry, nr_segments, segments);
-	if (result)
-		goto out;
+	image = do_kimage_alloc_init();
+	if (!image)
+		return -ENOMEM;
+
+	image->start = entry;
 
 	/* Enable the special crash kernel control page
 	 * allocation policy.
@@ -303,25 +336,13 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
 	image->control_page = crashk_res.start;
 	image->type = KEXEC_TYPE_CRASH;
 
-	/*
-	 * Verify we have good destination addresses.  Normally
-	 * the caller is responsible for making certain we don't
-	 * attempt to load the new image into invalid or reserved
-	 * areas of RAM.  But crash kernels are preloaded into a
-	 * reserved area of ram.  We must ensure the addresses
-	 * are in the reserved area otherwise preloading the
-	 * kernel could corrupt things.
-	 */
-	result = -EADDRNOTAVAIL;
-	for (i = 0; i < nr_segments; i++) {
-		unsigned long mstart, mend;
+	result = copy_user_segment_list(image, nr_segments, segments);
+	if (result)
+		goto out_free_image;
 
-		mstart = image->segment[i].mem;
-		mend = mstart + image->segment[i].memsz - 1;
-		/* Ensure we are within the crash kernel limits */
-		if ((mstart < crashk_res.start) || (mend > crashk_res.end))
-			goto out_free;
-	}
+	result = sanity_check_segment_list(image);
+	if (result)
+		goto out_free_image;
 
 	/*
 	 * Find a location for the control code buffer, and add
@@ -333,15 +354,14 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
 					   get_order(KEXEC_CONTROL_PAGE_SIZE));
 	if (!image->control_code_page) {
 		printk(KERN_ERR "Could not allocate control_code_buffer\n");
-		goto out_free;
+		goto out_free_image;
 	}
 
 	*rimage = image;
 	return 0;
 
-out_free:
+out_free_image:
 	kfree(image);
-out:
 	return result;
 }
 
-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 04/13] resource: Provide new functions to walk through resources
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:06   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal, Yinghai Lu

I have added two more functions to walk through resources.
Current walk_system_ram_range() deals with pfn and /proc/iomem can contain
partial pages. By dealing in pfn, callback function loses the info that
last page of a memory range is a partial page and not the full page. So
I implemented walk_system_ram_res() which returns u64 values to callback
functions and now it properly return start and end address.

walk_system_ram_range() uses find_next_system_ram() to find the next
ram resource. This in turn only travels through siblings of top level
child and does not travers through all the nodes of the resoruce tree. I
also need another function where I can walk through all the resources,
for example figure out where "GART" aperture is. Figure out where
ACPI memory is.

So I wrote another function walk_ram_res() which walks through all
/proc/iomem resources and returns matches as asked by caller. Caller
can specify "name" of resource, start and end.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
---
 include/linux/ioport.h |   6 +++
 kernel/resource.c      | 108 +++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 110 insertions(+), 4 deletions(-)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 5e3a906..a15f7f6 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -237,6 +237,12 @@ extern int iomem_is_exclusive(u64 addr);
 extern int
 walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
 		void *arg, int (*func)(unsigned long, unsigned long, void *));
+extern int
+walk_system_ram_res(u64 start, u64 end, void *arg,
+				int (*func)(u64, u64, void *));
+extern int
+walk_ram_res(char *name, unsigned long flags, u64 start, u64 end, void *arg,
+				int (*func)(u64, u64, void *));
 
 /* True if any part of r1 overlaps r2 */
 static inline bool resource_overlaps(struct resource *r1, struct resource *r2)
diff --git a/kernel/resource.c b/kernel/resource.c
index 8957d68..b3e3f95 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -59,10 +59,8 @@ static DEFINE_RWLOCK(resource_lock);
 static struct resource *bootmem_resource_free;
 static DEFINE_SPINLOCK(bootmem_resource_lock);
 
-static void *r_next(struct seq_file *m, void *v, loff_t *pos)
+static struct resource *next_resource(struct resource *p)
 {
-	struct resource *p = v;
-	(*pos)++;
 	if (p->child)
 		return p->child;
 	while (!p->sibling && p->parent)
@@ -70,6 +68,13 @@ static void *r_next(struct seq_file *m, void *v, loff_t *pos)
 	return p->sibling;
 }
 
+static void *r_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct resource *p = v;
+	(*pos)++;
+	return (void *)next_resource(p);
+}
+
 #ifdef CONFIG_PROC_FS
 
 enum { MAX_IORES_LEVEL = 5 };
@@ -322,7 +327,71 @@ int release_resource(struct resource *old)
 
 EXPORT_SYMBOL(release_resource);
 
-#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
+/*
+ * Finds the lowest iomem reosurce exists with-in [res->start.res->end)
+ * the caller must specify res->start, res->end, res->flags and "name".
+ * If found, returns 0, res is overwritten, if not found, returns -1.
+ * This walks through whole tree and not just first level children.
+ */
+static int find_next_iomem_res(struct resource *res, char *name)
+{
+	resource_size_t start, end;
+	struct resource *p;
+
+	BUG_ON(!res);
+
+	start = res->start;
+	end = res->end;
+	BUG_ON(start >= end);
+
+	read_lock(&resource_lock);
+	p = &iomem_resource;
+	while ((p = next_resource(p))) {
+		if (p->flags != res->flags)
+			continue;
+		if (name && strcmp(p->name, name))
+			continue;
+		if (p->start > end) {
+			p = NULL;
+			break;
+		}
+		if ((p->end >= start) && (p->start < end))
+			break;
+	}
+
+	read_unlock(&resource_lock);
+	if (!p)
+		return -1;
+	/* copy data */
+	if (res->start < p->start)
+		res->start = p->start;
+	if (res->end > p->end)
+		res->end = p->end;
+	return 0;
+}
+
+int walk_ram_res(char *name, unsigned long flags, u64 start, u64 end,
+		void *arg, int (*func)(u64, u64, void *))
+{
+	struct resource res;
+	u64 orig_end;
+	int ret = -1;
+
+	res.start = start;
+	res.end = end;
+	res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+	orig_end = res.end;
+	while ((res.start < res.end) &&
+		(find_next_iomem_res(&res, name) >= 0)) {
+		ret = (*func)(res.start, res.end, arg);
+		if (ret)
+			break;
+		res.start = res.end + 1;
+		res.end = orig_end;
+	}
+	return ret;
+}
+
 /*
  * Finds the lowest memory reosurce exists within [res->start.res->end)
  * the caller must specify res->start, res->end, res->flags and "name".
@@ -367,6 +436,37 @@ static int find_next_system_ram(struct resource *res, char *name)
 /*
  * This function calls callback against all memory range of "System RAM"
  * which are marked as IORESOURCE_MEM and IORESOUCE_BUSY.
+ * Now, this function is only for "System RAM". This function deals with
+ * full ranges and not pfn. If resources are not pfn aligned, dealing
+ * with pfn can truncate ranges.
+ */
+int walk_system_ram_res(u64 start, u64 end, void *arg,
+				int (*func)(u64, u64, void *))
+{
+	struct resource res;
+	u64 orig_end;
+	int ret = -1;
+
+	res.start = start;
+	res.end = end;
+	res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+	orig_end = res.end;
+	while ((res.start < res.end) &&
+		(find_next_system_ram(&res, "System RAM") >= 0)) {
+		ret = (*func)(res.start, res.end, arg);
+		if (ret)
+			break;
+		res.start = res.end + 1;
+		res.end = orig_end;
+	}
+	return ret;
+}
+
+#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
+
+/*
+ * This function calls callback against all memory range of "System RAM"
+ * which are marked as IORESOURCE_MEM and IORESOUCE_BUSY.
  * Now, this function is only for "System RAM".
  */
 int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 04/13] resource: Provide new functions to walk through resources
@ 2014-06-03 13:06   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg,
	Yinghai Lu, akpm, dyoung, chaowang

I have added two more functions to walk through resources.
Current walk_system_ram_range() deals with pfn and /proc/iomem can contain
partial pages. By dealing in pfn, callback function loses the info that
last page of a memory range is a partial page and not the full page. So
I implemented walk_system_ram_res() which returns u64 values to callback
functions and now it properly return start and end address.

walk_system_ram_range() uses find_next_system_ram() to find the next
ram resource. This in turn only travels through siblings of top level
child and does not travers through all the nodes of the resoruce tree. I
also need another function where I can walk through all the resources,
for example figure out where "GART" aperture is. Figure out where
ACPI memory is.

So I wrote another function walk_ram_res() which walks through all
/proc/iomem resources and returns matches as asked by caller. Caller
can specify "name" of resource, start and end.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
---
 include/linux/ioport.h |   6 +++
 kernel/resource.c      | 108 +++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 110 insertions(+), 4 deletions(-)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 5e3a906..a15f7f6 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -237,6 +237,12 @@ extern int iomem_is_exclusive(u64 addr);
 extern int
 walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
 		void *arg, int (*func)(unsigned long, unsigned long, void *));
+extern int
+walk_system_ram_res(u64 start, u64 end, void *arg,
+				int (*func)(u64, u64, void *));
+extern int
+walk_ram_res(char *name, unsigned long flags, u64 start, u64 end, void *arg,
+				int (*func)(u64, u64, void *));
 
 /* True if any part of r1 overlaps r2 */
 static inline bool resource_overlaps(struct resource *r1, struct resource *r2)
diff --git a/kernel/resource.c b/kernel/resource.c
index 8957d68..b3e3f95 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -59,10 +59,8 @@ static DEFINE_RWLOCK(resource_lock);
 static struct resource *bootmem_resource_free;
 static DEFINE_SPINLOCK(bootmem_resource_lock);
 
-static void *r_next(struct seq_file *m, void *v, loff_t *pos)
+static struct resource *next_resource(struct resource *p)
 {
-	struct resource *p = v;
-	(*pos)++;
 	if (p->child)
 		return p->child;
 	while (!p->sibling && p->parent)
@@ -70,6 +68,13 @@ static void *r_next(struct seq_file *m, void *v, loff_t *pos)
 	return p->sibling;
 }
 
+static void *r_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct resource *p = v;
+	(*pos)++;
+	return (void *)next_resource(p);
+}
+
 #ifdef CONFIG_PROC_FS
 
 enum { MAX_IORES_LEVEL = 5 };
@@ -322,7 +327,71 @@ int release_resource(struct resource *old)
 
 EXPORT_SYMBOL(release_resource);
 
-#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
+/*
+ * Finds the lowest iomem reosurce exists with-in [res->start.res->end)
+ * the caller must specify res->start, res->end, res->flags and "name".
+ * If found, returns 0, res is overwritten, if not found, returns -1.
+ * This walks through whole tree and not just first level children.
+ */
+static int find_next_iomem_res(struct resource *res, char *name)
+{
+	resource_size_t start, end;
+	struct resource *p;
+
+	BUG_ON(!res);
+
+	start = res->start;
+	end = res->end;
+	BUG_ON(start >= end);
+
+	read_lock(&resource_lock);
+	p = &iomem_resource;
+	while ((p = next_resource(p))) {
+		if (p->flags != res->flags)
+			continue;
+		if (name && strcmp(p->name, name))
+			continue;
+		if (p->start > end) {
+			p = NULL;
+			break;
+		}
+		if ((p->end >= start) && (p->start < end))
+			break;
+	}
+
+	read_unlock(&resource_lock);
+	if (!p)
+		return -1;
+	/* copy data */
+	if (res->start < p->start)
+		res->start = p->start;
+	if (res->end > p->end)
+		res->end = p->end;
+	return 0;
+}
+
+int walk_ram_res(char *name, unsigned long flags, u64 start, u64 end,
+		void *arg, int (*func)(u64, u64, void *))
+{
+	struct resource res;
+	u64 orig_end;
+	int ret = -1;
+
+	res.start = start;
+	res.end = end;
+	res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+	orig_end = res.end;
+	while ((res.start < res.end) &&
+		(find_next_iomem_res(&res, name) >= 0)) {
+		ret = (*func)(res.start, res.end, arg);
+		if (ret)
+			break;
+		res.start = res.end + 1;
+		res.end = orig_end;
+	}
+	return ret;
+}
+
 /*
  * Finds the lowest memory reosurce exists within [res->start.res->end)
  * the caller must specify res->start, res->end, res->flags and "name".
@@ -367,6 +436,37 @@ static int find_next_system_ram(struct resource *res, char *name)
 /*
  * This function calls callback against all memory range of "System RAM"
  * which are marked as IORESOURCE_MEM and IORESOUCE_BUSY.
+ * Now, this function is only for "System RAM". This function deals with
+ * full ranges and not pfn. If resources are not pfn aligned, dealing
+ * with pfn can truncate ranges.
+ */
+int walk_system_ram_res(u64 start, u64 end, void *arg,
+				int (*func)(u64, u64, void *))
+{
+	struct resource res;
+	u64 orig_end;
+	int ret = -1;
+
+	res.start = start;
+	res.end = end;
+	res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+	orig_end = res.end;
+	while ((res.start < res.end) &&
+		(find_next_system_ram(&res, "System RAM") >= 0)) {
+		ret = (*func)(res.start, res.end, arg);
+		if (ret)
+			break;
+		res.start = res.end + 1;
+		res.end = orig_end;
+	}
+	return ret;
+}
+
+#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
+
+/*
+ * This function calls callback against all memory range of "System RAM"
+ * which are marked as IORESOURCE_MEM and IORESOUCE_BUSY.
  * Now, this function is only for "System RAM".
  */
 int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 05/13] kexec: Make kexec_segment user buffer pointer a union
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:06   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal

So far kexec_segment->buf was always a user space pointer as user space
passed the array of kexec_segment structures and kernel copied it.

But with new system call, list of kexec segments will be prepared by
kernel and kexec_segment->buf will point to a kernel memory.

So while I was adding code where I made assumption that ->buf is pointing
to kernel memory, sparse started giving warning.

Make ->buf a union. And where a user space pointer is expected, access
it using ->buf and where a kernel space pointer is expected, access it
using ->kbuf. That takes care of sparse warnings.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 include/linux/kexec.h | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index a756419..d0285cc 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -69,7 +69,18 @@ typedef unsigned long kimage_entry_t;
 #define IND_SOURCE       0x8
 
 struct kexec_segment {
-	void __user *buf;
+	/*
+	 * This pointer can point to user memory if kexec_load() system
+	 * call is used or will point to kernel memory if
+	 * kexec_file_load() system call is used.
+	 *
+	 * Use ->buf when expecting to deal with user memory and use ->kbuf
+	 * when expecting to deal with kernel memory.
+	 */
+	union {
+		void __user *buf;
+		void *kbuf;
+	};
 	size_t bufsz;
 	unsigned long mem;
 	size_t memsz;
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 05/13] kexec: Make kexec_segment user buffer pointer a union
@ 2014-06-03 13:06   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg, akpm,
	dyoung, chaowang

So far kexec_segment->buf was always a user space pointer as user space
passed the array of kexec_segment structures and kernel copied it.

But with new system call, list of kexec segments will be prepared by
kernel and kexec_segment->buf will point to a kernel memory.

So while I was adding code where I made assumption that ->buf is pointing
to kernel memory, sparse started giving warning.

Make ->buf a union. And where a user space pointer is expected, access
it using ->buf and where a kernel space pointer is expected, access it
using ->kbuf. That takes care of sparse warnings.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 include/linux/kexec.h | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index a756419..d0285cc 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -69,7 +69,18 @@ typedef unsigned long kimage_entry_t;
 #define IND_SOURCE       0x8
 
 struct kexec_segment {
-	void __user *buf;
+	/*
+	 * This pointer can point to user memory if kexec_load() system
+	 * call is used or will point to kernel memory if
+	 * kexec_file_load() system call is used.
+	 *
+	 * Use ->buf when expecting to deal with user memory and use ->kbuf
+	 * when expecting to deal with kernel memory.
+	 */
+	union {
+		void __user *buf;
+		void *kbuf;
+	};
 	size_t bufsz;
 	unsigned long mem;
 	size_t memsz;
-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 06/13] kexec: New syscall kexec_file_load() declaration
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:06   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal

This is the new syscall kexec_file_load() declaration/interface. I have
reserved the syscall number only for x86_64 so far. Other architectures
(including i386) can reserve syscall number when they enable the support
for this new syscall.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/syscalls/syscall_64.tbl | 1 +
 include/linux/syscalls.h         | 3 +++
 kernel/kexec.c                   | 7 +++++++
 kernel/sys_ni.c                  | 1 +
 4 files changed, 12 insertions(+)

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 04376ac..94482db 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	kexec_file_load		sys_kexec_file_load
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a4a0588..9db7555 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -317,6 +317,9 @@ asmlinkage long sys_restart_syscall(void);
 asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
 				struct kexec_segment __user *segments,
 				unsigned long flags);
+asmlinkage long sys_kexec_file_load(int kernel_fd, int initrd_fd,
+				const char __user *cmdline_ptr,
+				unsigned long cmdline_len, unsigned long flags);
 
 asmlinkage long sys_exit(int error_code);
 asmlinkage long sys_exit_group(int error_code);
diff --git a/kernel/kexec.c b/kernel/kexec.c
index c435c5f..a3044e6 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1098,6 +1098,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
 }
 #endif
 
+SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
+		const char __user *, cmdline_ptr, unsigned long,
+		cmdline_len, unsigned long, flags)
+{
+	return -ENOSYS;
+}
+
 void crash_kexec(struct pt_regs *regs)
 {
 	/* Take the kexec_mutex here to prevent sys_kexec_load
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index bc8d1b7..4534626 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -25,6 +25,7 @@ cond_syscall(sys_swapon);
 cond_syscall(sys_swapoff);
 cond_syscall(sys_kexec_load);
 cond_syscall(compat_sys_kexec_load);
+cond_syscall(sys_kexec_file_load);
 cond_syscall(sys_init_module);
 cond_syscall(sys_finit_module);
 cond_syscall(sys_delete_module);
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 06/13] kexec: New syscall kexec_file_load() declaration
@ 2014-06-03 13:06   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg, akpm,
	dyoung, chaowang

This is the new syscall kexec_file_load() declaration/interface. I have
reserved the syscall number only for x86_64 so far. Other architectures
(including i386) can reserve syscall number when they enable the support
for this new syscall.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/syscalls/syscall_64.tbl | 1 +
 include/linux/syscalls.h         | 3 +++
 kernel/kexec.c                   | 7 +++++++
 kernel/sys_ni.c                  | 1 +
 4 files changed, 12 insertions(+)

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 04376ac..94482db 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	kexec_file_load		sys_kexec_file_load
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a4a0588..9db7555 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -317,6 +317,9 @@ asmlinkage long sys_restart_syscall(void);
 asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
 				struct kexec_segment __user *segments,
 				unsigned long flags);
+asmlinkage long sys_kexec_file_load(int kernel_fd, int initrd_fd,
+				const char __user *cmdline_ptr,
+				unsigned long cmdline_len, unsigned long flags);
 
 asmlinkage long sys_exit(int error_code);
 asmlinkage long sys_exit_group(int error_code);
diff --git a/kernel/kexec.c b/kernel/kexec.c
index c435c5f..a3044e6 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1098,6 +1098,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
 }
 #endif
 
+SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
+		const char __user *, cmdline_ptr, unsigned long,
+		cmdline_len, unsigned long, flags)
+{
+	return -ENOSYS;
+}
+
 void crash_kexec(struct pt_regs *regs)
 {
 	/* Take the kexec_mutex here to prevent sys_kexec_load
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index bc8d1b7..4534626 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -25,6 +25,7 @@ cond_syscall(sys_swapon);
 cond_syscall(sys_swapoff);
 cond_syscall(sys_kexec_load);
 cond_syscall(compat_sys_kexec_load);
+cond_syscall(sys_kexec_file_load);
 cond_syscall(sys_init_module);
 cond_syscall(sys_finit_module);
 cond_syscall(sys_delete_module);
-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:06   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal

Previous patch provided the interface definition and this patch prvides
implementation of new syscall.

Previously segment list was prepared in user space. Now user space just
passes kernel fd, initrd fd and command line and kernel will create a
segment list internally.

This patch contains generic part of the code. Actual segment preparation
and loading is done by arch and image specific loader. Which comes in
next patch.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/kernel/machine_kexec_64.c |  54 +++++
 include/linux/kexec.h              |  53 ++++
 include/uapi/linux/kexec.h         |   4 +
 kernel/kexec.c                     | 483 ++++++++++++++++++++++++++++++++++++-
 4 files changed, 589 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 679cef0..d9c5cf0 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -22,6 +22,13 @@
 #include <asm/mmu_context.h>
 #include <asm/debugreg.h>
 
+/* arch dependent functionality related to kexec file based syscall */
+static struct kexec_file_type kexec_file_type[] = {
+	{"", NULL, NULL, NULL},
+};
+
+static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
+
 static void free_transition_pgtable(struct kimage *image)
 {
 	free_page((unsigned long)image->arch.pud);
@@ -283,3 +290,50 @@ void arch_crash_save_vmcoreinfo(void)
 			      (unsigned long)&_text - __START_KERNEL);
 }
 
+/* arch dependent functionality related to kexec file based syscall */
+
+int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
+					unsigned long buf_len)
+{
+	int i, ret = -ENOEXEC;
+
+	for (i = 0; i < nr_file_types; i++) {
+		if (!kexec_file_type[i].probe)
+			continue;
+
+		ret = kexec_file_type[i].probe(buf, buf_len);
+		if (!ret) {
+			image->file_handler_idx = i;
+			return ret;
+		}
+	}
+
+	return ret;
+}
+
+void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
+			unsigned long kernel_len, char *initrd,
+			unsigned long initrd_len, char *cmdline,
+			unsigned long cmdline_len)
+{
+	int idx = image->file_handler_idx;
+
+	if (idx < 0)
+		return ERR_PTR(-ENOEXEC);
+
+	return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
+					initrd_len, cmdline, cmdline_len);
+}
+
+int arch_kimage_file_post_load_cleanup(struct kimage *image)
+{
+	int idx = image->file_handler_idx;
+
+	/* This can be called up even before image handler has been set */
+	if (idx < 0)
+		return 0;
+
+	if (kexec_file_type[idx].cleanup)
+		return kexec_file_type[idx].cleanup(image);
+	return 0;
+}
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d0285cc..3790519 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -121,13 +121,58 @@ struct kimage {
 #define KEXEC_TYPE_DEFAULT 0
 #define KEXEC_TYPE_CRASH   1
 	unsigned int preserve_context : 1;
+	/* If set, we are using file mode kexec syscall */
+	unsigned int file_mode:1;
 
 #ifdef ARCH_HAS_KIMAGE_ARCH
 	struct kimage_arch arch;
 #endif
+
+	/* Additional Fields for file based kexec syscall */
+	void *kernel_buf;
+	unsigned long kernel_buf_len;
+
+	void *initrd_buf;
+	unsigned long initrd_buf_len;
+
+	char *cmdline_buf;
+	unsigned long cmdline_buf_len;
+
+	/* index of file handler in array */
+	int file_handler_idx;
+
+	/* Image loader handling the kernel can store a pointer here */
+	void *image_loader_data;
 };
 
+/*
+ * Keeps a track of buffer parameters as provided by caller for requesting
+ * memory placement of buffer.
+ */
+struct kexec_buf {
+	struct kimage *image;
+	char *buffer;
+	unsigned long bufsz;
+	unsigned long memsz;
+	unsigned long buf_align;
+	unsigned long buf_min;
+	unsigned long buf_max;
+	bool top_down;		/* allocate from top of memory hole */
+};
 
+typedef int (kexec_probe_t)(const char *kernel_buf, unsigned long kernel_size);
+typedef void *(kexec_load_t)(struct kimage *image, char *kernel_buf,
+				unsigned long kernel_len, char *initrd,
+				unsigned long initrd_len, char *cmdline,
+				unsigned long cmdline_len);
+typedef int (kexec_cleanup_t)(struct kimage *image);
+
+struct kexec_file_type {
+	const char *name;
+	kexec_probe_t *probe;
+	kexec_load_t *load;
+	kexec_cleanup_t *cleanup;
+};
 
 /* kexec interface functions */
 extern void machine_kexec(struct kimage *image);
@@ -138,6 +183,11 @@ extern asmlinkage long sys_kexec_load(unsigned long entry,
 					struct kexec_segment __user *segments,
 					unsigned long flags);
 extern int kernel_kexec(void);
+extern int kexec_add_buffer(struct kimage *image, char *buffer,
+			unsigned long bufsz, unsigned long memsz,
+			unsigned long buf_align, unsigned long buf_min,
+			unsigned long buf_max, bool top_down,
+			unsigned long *load_addr);
 extern struct page *kimage_alloc_control_pages(struct kimage *image,
 						unsigned int order);
 extern void crash_kexec(struct pt_regs *);
@@ -188,6 +238,9 @@ extern int kexec_load_disabled;
 #define KEXEC_FLAGS    (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT)
 #endif
 
+/* Listof defined/legal kexec file flags */
+#define KEXEC_FILE_FLAGS	(KEXEC_FILE_UNLOAD | KEXEC_FILE_ON_CRASH)
+
 #define VMCOREINFO_BYTES           (4096)
 #define VMCOREINFO_NOTE_NAME       "VMCOREINFO"
 #define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4)
diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
index d6629d4..5fddb1b 100644
--- a/include/uapi/linux/kexec.h
+++ b/include/uapi/linux/kexec.h
@@ -13,6 +13,10 @@
 #define KEXEC_PRESERVE_CONTEXT	0x00000002
 #define KEXEC_ARCH_MASK		0xffff0000
 
+/* Kexec file load interface flags */
+#define KEXEC_FILE_UNLOAD	0x00000001
+#define KEXEC_FILE_ON_CRASH	0x00000002
+
 /* These values match the ELF architecture values.
  * Unless there is a good reason that should continue to be the case.
  */
diff --git a/kernel/kexec.c b/kernel/kexec.c
index a3044e6..1ad4d60 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -260,6 +260,221 @@ static struct kimage *do_kimage_alloc_init(void)
 
 static void kimage_free_page_list(struct list_head *list);
 
+static int copy_file_from_fd(int fd, void **buf, unsigned long *buf_len)
+{
+	struct fd f = fdget(fd);
+	int ret = 0;
+	struct kstat stat;
+	loff_t pos;
+	ssize_t bytes = 0;
+
+	if (!f.file)
+		return -EBADF;
+
+	ret = vfs_getattr(&f.file->f_path, &stat);
+	if (ret)
+		goto out;
+
+	if (stat.size > INT_MAX) {
+		ret = -EFBIG;
+		goto out;
+	}
+
+	/* Don't hand 0 to vmalloc, it whines. */
+	if (stat.size == 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	*buf = vmalloc(stat.size);
+	if (!*buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	pos = 0;
+	while (pos < stat.size) {
+		bytes = kernel_read(f.file, pos, (char *)(*buf) + pos,
+					stat.size - pos);
+		if (bytes < 0) {
+			vfree(*buf);
+			ret = bytes;
+			goto out;
+		}
+
+		if (bytes == 0)
+			break;
+		pos += bytes;
+	}
+
+	*buf_len = pos;
+out:
+	fdput(f);
+	return ret;
+}
+
+/* Architectures can provide this probe function */
+int __attribute__ ((weak))
+arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
+				unsigned long buf_len)
+{
+	return -ENOEXEC;
+}
+
+void *__attribute__ ((weak))
+arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
+		unsigned long kernel_len, char *initrd,
+		unsigned long initrd_len, char *cmdline,
+		unsigned long cmdline_len)
+{
+	return ERR_PTR(-ENOEXEC);
+}
+
+void __attribute__ ((weak))
+arch_kimage_file_post_load_cleanup(struct kimage *image)
+{
+	return;
+}
+
+/*
+ * Free up tempory buffers allocated which are not needed after image has
+ * been loaded.
+ *
+ * Free up memory used by kernel, initrd, and comand line. This is temporary
+ * memory allocation which is not needed any more after these buffers have
+ * been loaded into separate segments and have been copied elsewhere
+ */
+static void kimage_file_post_load_cleanup(struct kimage *image)
+{
+	vfree(image->kernel_buf);
+	image->kernel_buf = NULL;
+
+	vfree(image->initrd_buf);
+	image->initrd_buf = NULL;
+
+	vfree(image->cmdline_buf);
+	image->cmdline_buf = NULL;
+
+	/* See if architcture has anything to cleanup post load */
+	arch_kimage_file_post_load_cleanup(image);
+}
+
+/*
+ * In file mode list of segments is prepared by kernel. Copy relevant
+ * data from user space, do error checking, prepare segment list
+ */
+static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
+		int initrd_fd, const char __user *cmdline_ptr,
+		unsigned long cmdline_len)
+{
+	int ret = 0;
+	void *ldata;
+
+	ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
+					&image->kernel_buf_len);
+	if (ret)
+		return ret;
+
+	/* Call arch image probe handlers */
+	ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
+						image->kernel_buf_len);
+
+	if (ret)
+		goto out;
+
+	ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
+					&image->initrd_buf_len);
+	if (ret)
+		goto out;
+
+	image->cmdline_buf = vzalloc(cmdline_len);
+	if (!image->cmdline_buf)
+		goto out;
+
+	ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
+	if (ret) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	image->cmdline_buf_len = cmdline_len;
+
+	/* command line should be a string with last byte null */
+	if (image->cmdline_buf[cmdline_len - 1] != '\0') {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Call arch image load handlers */
+	ldata = arch_kexec_kernel_image_load(image,
+			image->kernel_buf, image->kernel_buf_len,
+			image->initrd_buf, image->initrd_buf_len,
+			image->cmdline_buf, image->cmdline_buf_len);
+
+	if (IS_ERR(ldata)) {
+		ret = PTR_ERR(ldata);
+		goto out;
+	}
+
+	image->image_loader_data = ldata;
+out:
+	/* In case of error, free up all allocated memory in this function */
+	if (ret)
+		kimage_file_post_load_cleanup(image);
+	return ret;
+}
+
+static int kimage_file_normal_alloc(struct kimage **rimage, int kernel_fd,
+		int initrd_fd, const char __user *cmdline_ptr,
+		unsigned long cmdline_len)
+{
+	int result;
+	struct kimage *image;
+
+	/* Allocate and initialize a controlling structure */
+	image = do_kimage_alloc_init();
+	if (!image)
+		return -ENOMEM;
+
+	image->file_mode = 1;
+	image->file_handler_idx = -1;
+
+	result = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
+			cmdline_ptr, cmdline_len);
+	if (result)
+		goto out_free_image;
+
+	result = sanity_check_segment_list(image);
+	if (result)
+		goto out_free_post_load_bufs;
+
+	result = -ENOMEM;
+	image->control_code_page = kimage_alloc_control_pages(image,
+					   get_order(KEXEC_CONTROL_PAGE_SIZE));
+	if (!image->control_code_page) {
+		pr_err("Could not allocate control_code_buffer\n");
+		goto out_free_post_load_bufs;
+	}
+
+	image->swap_page = kimage_alloc_control_pages(image, 0);
+	if (!image->swap_page) {
+		pr_err(KERN_ERR "Could not allocate swap buffer\n");
+		goto out_free_control_pages;
+	}
+
+	*rimage = image;
+	return 0;
+
+out_free_control_pages:
+	kimage_free_page_list(&image->control_pages);
+out_free_post_load_bufs:
+	kimage_file_post_load_cleanup(image);
+	kfree(image->image_loader_data);
+out_free_image:
+	kfree(image);
+	return result;
+}
+
 static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
 				unsigned long nr_segments,
 				struct kexec_segment __user *segments)
@@ -683,6 +898,16 @@ static void kimage_free(struct kimage *image)
 
 	/* Free the kexec control pages... */
 	kimage_free_page_list(&image->control_pages);
+
+	kfree(image->image_loader_data);
+
+	/*
+	 * Free up any temporary buffers allocated. This might hit if
+	 * error occurred much later after buffer allocation.
+	 */
+	if (image->file_mode)
+		kimage_file_post_load_cleanup(image);
+
 	kfree(image);
 }
 
@@ -812,10 +1037,14 @@ static int kimage_load_normal_segment(struct kimage *image,
 	unsigned long maddr;
 	size_t ubytes, mbytes;
 	int result;
-	unsigned char __user *buf;
+	unsigned char __user *buf = NULL;
+	unsigned char *kbuf = NULL;
 
 	result = 0;
-	buf = segment->buf;
+	if (image->file_mode)
+		kbuf = segment->kbuf;
+	else
+		buf = segment->buf;
 	ubytes = segment->bufsz;
 	mbytes = segment->memsz;
 	maddr = segment->mem;
@@ -847,7 +1076,11 @@ static int kimage_load_normal_segment(struct kimage *image,
 				PAGE_SIZE - (maddr & ~PAGE_MASK));
 		uchunk = min(ubytes, mchunk);
 
-		result = copy_from_user(ptr, buf, uchunk);
+		/* For file based kexec, source pages are in kernel memory */
+		if (image->file_mode)
+			memcpy(ptr, kbuf, uchunk);
+		else
+			result = copy_from_user(ptr, buf, uchunk);
 		kunmap(page);
 		if (result) {
 			result = -EFAULT;
@@ -855,7 +1088,10 @@ static int kimage_load_normal_segment(struct kimage *image,
 		}
 		ubytes -= uchunk;
 		maddr  += mchunk;
-		buf    += mchunk;
+		if (image->file_mode)
+			kbuf += mchunk;
+		else
+			buf += mchunk;
 		mbytes -= mchunk;
 	}
 out:
@@ -1102,7 +1338,64 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 		const char __user *, cmdline_ptr, unsigned long,
 		cmdline_len, unsigned long, flags)
 {
-	return -ENOSYS;
+	int ret = 0, i;
+	struct kimage **dest_image, *image;
+
+	/* We only trust the superuser with rebooting the system. */
+	if (!capable(CAP_SYS_BOOT))
+		return -EPERM;
+
+	/* Make sure we have a legal set of flags */
+	if (flags != (flags & KEXEC_FILE_FLAGS))
+		return -EINVAL;
+
+	image = NULL;
+
+	if (!mutex_trylock(&kexec_mutex))
+		return -EBUSY;
+
+	dest_image = &kexec_image;
+	if (flags & KEXEC_FILE_ON_CRASH)
+		dest_image = &kexec_crash_image;
+
+	if (flags & KEXEC_FILE_UNLOAD)
+		goto exchange;
+
+	ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
+				cmdline_ptr, cmdline_len);
+	if (ret)
+		goto out;
+
+	ret = machine_kexec_prepare(image);
+	if (ret)
+		goto out;
+
+	for (i = 0; i < image->nr_segments; i++) {
+		struct kexec_segment *ksegment;
+
+		ksegment = &image->segment[i];
+		pr_debug("Loading segment %d: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n",
+			 i, ksegment->buf, ksegment->bufsz, ksegment->mem,
+			 ksegment->memsz);
+
+		ret = kimage_load_segment(image, &image->segment[i]);
+		if (ret)
+			goto out;
+	}
+
+	kimage_terminate(image);
+
+	/*
+	 * Free up any temporary buffers allocated which are not needed
+	 * after image has been loaded
+	 */
+	kimage_file_post_load_cleanup(image);
+exchange:
+	image = xchg(dest_image, image);
+out:
+	mutex_unlock(&kexec_mutex);
+	kimage_free(image);
+	return ret;
 }
 
 void crash_kexec(struct pt_regs *regs)
@@ -1659,6 +1952,186 @@ static int __init crash_save_vmcoreinfo_init(void)
 
 subsys_initcall(crash_save_vmcoreinfo_init);
 
+static int __kexec_add_segment(struct kimage *image, char *buf,
+		unsigned long bufsz, unsigned long mem, unsigned long memsz)
+{
+	struct kexec_segment *ksegment;
+
+	ksegment = &image->segment[image->nr_segments];
+	ksegment->kbuf = buf;
+	ksegment->bufsz = bufsz;
+	ksegment->mem = mem;
+	ksegment->memsz = memsz;
+	image->nr_segments++;
+
+	return 0;
+}
+
+static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
+					struct kexec_buf *kbuf)
+{
+	struct kimage *image = kbuf->image;
+	unsigned long temp_start, temp_end;
+
+	temp_end = min(end, kbuf->buf_max);
+	temp_start = temp_end - kbuf->memsz;
+
+	do {
+		/* align down start */
+		temp_start = temp_start & (~(kbuf->buf_align - 1));
+
+		if (temp_start < start || temp_start < kbuf->buf_min)
+			return 0;
+
+		temp_end = temp_start + kbuf->memsz - 1;
+
+		/*
+		 * Make sure this does not conflict with any of existing
+		 * segments
+		 */
+		if (kimage_is_destination_range(image, temp_start, temp_end)) {
+			temp_start = temp_start - PAGE_SIZE;
+			continue;
+		}
+
+		/* We found a suitable memory range */
+		break;
+	} while (1);
+
+	/* If we are here, we found a suitable memory range */
+	__kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
+				kbuf->memsz);
+
+	/* Stop navigating through remaining System RAM ranges */
+	return 1;
+}
+
+static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
+					struct kexec_buf *kbuf)
+{
+	struct kimage *image = kbuf->image;
+	unsigned long temp_start, temp_end;
+
+	temp_start = max(start, kbuf->buf_min);
+
+	do {
+		temp_start = ALIGN(temp_start, kbuf->buf_align);
+		temp_end = temp_start + kbuf->memsz - 1;
+
+		if (temp_end > end || temp_end > kbuf->buf_max)
+			return 0;
+		/*
+		 * Make sure this does not conflict with any of existing
+		 * segments
+		 */
+		if (kimage_is_destination_range(image, temp_start, temp_end)) {
+			temp_start = temp_start + PAGE_SIZE;
+			continue;
+		}
+
+		/* We found a suitable memory range */
+		break;
+	} while (1);
+
+	/* If we are here, we found a suitable memory range */
+	__kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
+				kbuf->memsz);
+
+	/* Stop navigating through remaining System RAM ranges */
+	return 1;
+}
+
+static int walk_ram_range_callback(u64 start, u64 end, void *arg)
+{
+	struct kexec_buf *kbuf = (struct kexec_buf *)arg;
+	unsigned long sz = end - start + 1;
+
+	/* Returning 0 will take to next memory range */
+	if (sz < kbuf->memsz)
+		return 0;
+
+	if (end < kbuf->buf_min || start > kbuf->buf_max)
+		return 0;
+
+	/*
+	 * Allocate memory top down with-in ram range. Otherwise bottom up
+	 * allocation.
+	 */
+	if (kbuf->top_down)
+		return locate_mem_hole_top_down(start, end, kbuf);
+	else
+		return locate_mem_hole_bottom_up(start, end, kbuf);
+}
+
+/*
+ * Helper functions for placing a buffer in a kexec segment. This assumes
+ * that kexec_mutex is held.
+ */
+int kexec_add_buffer(struct kimage *image, char *buffer,
+		unsigned long bufsz, unsigned long memsz,
+		unsigned long buf_align, unsigned long buf_min,
+		unsigned long buf_max, bool top_down, unsigned long *load_addr)
+{
+
+	unsigned long nr_segments = image->nr_segments, new_nr_segments;
+	struct kexec_segment *ksegment;
+	struct kexec_buf buf, *kbuf;
+
+	/* Currently adding segment this way is allowed only in file mode */
+	if (!image->file_mode)
+		return -EINVAL;
+
+	if (nr_segments >= KEXEC_SEGMENT_MAX)
+		return -EINVAL;
+
+	/*
+	 * Make sure we are not trying to add buffer after allocating
+	 * control pages. All segments need to be placed first before
+	 * any control pages are allocated. As control page allocation
+	 * logic goes through list of segments to make sure there are
+	 * no destination overlaps.
+	 */
+	if (!list_empty(&image->control_pages)) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
+	memset(&buf, 0, sizeof(struct kexec_buf));
+	kbuf = &buf;
+	kbuf->image = image;
+	kbuf->buffer = buffer;
+	kbuf->bufsz = bufsz;
+
+	/* Align memsz to next page boundary */
+	kbuf->memsz = ALIGN(memsz, PAGE_SIZE);
+
+	/* Align to atleast page size boundary */
+	kbuf->buf_align = max(buf_align, PAGE_SIZE);
+	kbuf->buf_min = buf_min;
+	kbuf->buf_max = buf_max;
+	kbuf->top_down = top_down;
+
+	/* Walk the RAM ranges and allocate a suitable range for the buffer */
+	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
+
+	/*
+	 * If range could be found successfully, it would have incremented
+	 * the nr_segments value.
+	 */
+	new_nr_segments = image->nr_segments;
+
+	/* A suitable memory range could not be found for buffer */
+	if (new_nr_segments == nr_segments)
+		return -EADDRNOTAVAIL;
+
+	/* Found a suitable memory range */
+
+	ksegment = &image->segment[new_nr_segments - 1];
+	*load_addr = ksegment->mem;
+	return 0;
+}
+
+
 /*
  * Move into place and start executing a preloaded standalone
  * executable.  If nothing was preloaded return an error.
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-03 13:06   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg, akpm,
	dyoung, chaowang

Previous patch provided the interface definition and this patch prvides
implementation of new syscall.

Previously segment list was prepared in user space. Now user space just
passes kernel fd, initrd fd and command line and kernel will create a
segment list internally.

This patch contains generic part of the code. Actual segment preparation
and loading is done by arch and image specific loader. Which comes in
next patch.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/kernel/machine_kexec_64.c |  54 +++++
 include/linux/kexec.h              |  53 ++++
 include/uapi/linux/kexec.h         |   4 +
 kernel/kexec.c                     | 483 ++++++++++++++++++++++++++++++++++++-
 4 files changed, 589 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 679cef0..d9c5cf0 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -22,6 +22,13 @@
 #include <asm/mmu_context.h>
 #include <asm/debugreg.h>
 
+/* arch dependent functionality related to kexec file based syscall */
+static struct kexec_file_type kexec_file_type[] = {
+	{"", NULL, NULL, NULL},
+};
+
+static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
+
 static void free_transition_pgtable(struct kimage *image)
 {
 	free_page((unsigned long)image->arch.pud);
@@ -283,3 +290,50 @@ void arch_crash_save_vmcoreinfo(void)
 			      (unsigned long)&_text - __START_KERNEL);
 }
 
+/* arch dependent functionality related to kexec file based syscall */
+
+int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
+					unsigned long buf_len)
+{
+	int i, ret = -ENOEXEC;
+
+	for (i = 0; i < nr_file_types; i++) {
+		if (!kexec_file_type[i].probe)
+			continue;
+
+		ret = kexec_file_type[i].probe(buf, buf_len);
+		if (!ret) {
+			image->file_handler_idx = i;
+			return ret;
+		}
+	}
+
+	return ret;
+}
+
+void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
+			unsigned long kernel_len, char *initrd,
+			unsigned long initrd_len, char *cmdline,
+			unsigned long cmdline_len)
+{
+	int idx = image->file_handler_idx;
+
+	if (idx < 0)
+		return ERR_PTR(-ENOEXEC);
+
+	return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
+					initrd_len, cmdline, cmdline_len);
+}
+
+int arch_kimage_file_post_load_cleanup(struct kimage *image)
+{
+	int idx = image->file_handler_idx;
+
+	/* This can be called up even before image handler has been set */
+	if (idx < 0)
+		return 0;
+
+	if (kexec_file_type[idx].cleanup)
+		return kexec_file_type[idx].cleanup(image);
+	return 0;
+}
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d0285cc..3790519 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -121,13 +121,58 @@ struct kimage {
 #define KEXEC_TYPE_DEFAULT 0
 #define KEXEC_TYPE_CRASH   1
 	unsigned int preserve_context : 1;
+	/* If set, we are using file mode kexec syscall */
+	unsigned int file_mode:1;
 
 #ifdef ARCH_HAS_KIMAGE_ARCH
 	struct kimage_arch arch;
 #endif
+
+	/* Additional Fields for file based kexec syscall */
+	void *kernel_buf;
+	unsigned long kernel_buf_len;
+
+	void *initrd_buf;
+	unsigned long initrd_buf_len;
+
+	char *cmdline_buf;
+	unsigned long cmdline_buf_len;
+
+	/* index of file handler in array */
+	int file_handler_idx;
+
+	/* Image loader handling the kernel can store a pointer here */
+	void *image_loader_data;
 };
 
+/*
+ * Keeps a track of buffer parameters as provided by caller for requesting
+ * memory placement of buffer.
+ */
+struct kexec_buf {
+	struct kimage *image;
+	char *buffer;
+	unsigned long bufsz;
+	unsigned long memsz;
+	unsigned long buf_align;
+	unsigned long buf_min;
+	unsigned long buf_max;
+	bool top_down;		/* allocate from top of memory hole */
+};
 
+typedef int (kexec_probe_t)(const char *kernel_buf, unsigned long kernel_size);
+typedef void *(kexec_load_t)(struct kimage *image, char *kernel_buf,
+				unsigned long kernel_len, char *initrd,
+				unsigned long initrd_len, char *cmdline,
+				unsigned long cmdline_len);
+typedef int (kexec_cleanup_t)(struct kimage *image);
+
+struct kexec_file_type {
+	const char *name;
+	kexec_probe_t *probe;
+	kexec_load_t *load;
+	kexec_cleanup_t *cleanup;
+};
 
 /* kexec interface functions */
 extern void machine_kexec(struct kimage *image);
@@ -138,6 +183,11 @@ extern asmlinkage long sys_kexec_load(unsigned long entry,
 					struct kexec_segment __user *segments,
 					unsigned long flags);
 extern int kernel_kexec(void);
+extern int kexec_add_buffer(struct kimage *image, char *buffer,
+			unsigned long bufsz, unsigned long memsz,
+			unsigned long buf_align, unsigned long buf_min,
+			unsigned long buf_max, bool top_down,
+			unsigned long *load_addr);
 extern struct page *kimage_alloc_control_pages(struct kimage *image,
 						unsigned int order);
 extern void crash_kexec(struct pt_regs *);
@@ -188,6 +238,9 @@ extern int kexec_load_disabled;
 #define KEXEC_FLAGS    (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT)
 #endif
 
+/* Listof defined/legal kexec file flags */
+#define KEXEC_FILE_FLAGS	(KEXEC_FILE_UNLOAD | KEXEC_FILE_ON_CRASH)
+
 #define VMCOREINFO_BYTES           (4096)
 #define VMCOREINFO_NOTE_NAME       "VMCOREINFO"
 #define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4)
diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
index d6629d4..5fddb1b 100644
--- a/include/uapi/linux/kexec.h
+++ b/include/uapi/linux/kexec.h
@@ -13,6 +13,10 @@
 #define KEXEC_PRESERVE_CONTEXT	0x00000002
 #define KEXEC_ARCH_MASK		0xffff0000
 
+/* Kexec file load interface flags */
+#define KEXEC_FILE_UNLOAD	0x00000001
+#define KEXEC_FILE_ON_CRASH	0x00000002
+
 /* These values match the ELF architecture values.
  * Unless there is a good reason that should continue to be the case.
  */
diff --git a/kernel/kexec.c b/kernel/kexec.c
index a3044e6..1ad4d60 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -260,6 +260,221 @@ static struct kimage *do_kimage_alloc_init(void)
 
 static void kimage_free_page_list(struct list_head *list);
 
+static int copy_file_from_fd(int fd, void **buf, unsigned long *buf_len)
+{
+	struct fd f = fdget(fd);
+	int ret = 0;
+	struct kstat stat;
+	loff_t pos;
+	ssize_t bytes = 0;
+
+	if (!f.file)
+		return -EBADF;
+
+	ret = vfs_getattr(&f.file->f_path, &stat);
+	if (ret)
+		goto out;
+
+	if (stat.size > INT_MAX) {
+		ret = -EFBIG;
+		goto out;
+	}
+
+	/* Don't hand 0 to vmalloc, it whines. */
+	if (stat.size == 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	*buf = vmalloc(stat.size);
+	if (!*buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	pos = 0;
+	while (pos < stat.size) {
+		bytes = kernel_read(f.file, pos, (char *)(*buf) + pos,
+					stat.size - pos);
+		if (bytes < 0) {
+			vfree(*buf);
+			ret = bytes;
+			goto out;
+		}
+
+		if (bytes == 0)
+			break;
+		pos += bytes;
+	}
+
+	*buf_len = pos;
+out:
+	fdput(f);
+	return ret;
+}
+
+/* Architectures can provide this probe function */
+int __attribute__ ((weak))
+arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
+				unsigned long buf_len)
+{
+	return -ENOEXEC;
+}
+
+void *__attribute__ ((weak))
+arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
+		unsigned long kernel_len, char *initrd,
+		unsigned long initrd_len, char *cmdline,
+		unsigned long cmdline_len)
+{
+	return ERR_PTR(-ENOEXEC);
+}
+
+void __attribute__ ((weak))
+arch_kimage_file_post_load_cleanup(struct kimage *image)
+{
+	return;
+}
+
+/*
+ * Free up tempory buffers allocated which are not needed after image has
+ * been loaded.
+ *
+ * Free up memory used by kernel, initrd, and comand line. This is temporary
+ * memory allocation which is not needed any more after these buffers have
+ * been loaded into separate segments and have been copied elsewhere
+ */
+static void kimage_file_post_load_cleanup(struct kimage *image)
+{
+	vfree(image->kernel_buf);
+	image->kernel_buf = NULL;
+
+	vfree(image->initrd_buf);
+	image->initrd_buf = NULL;
+
+	vfree(image->cmdline_buf);
+	image->cmdline_buf = NULL;
+
+	/* See if architcture has anything to cleanup post load */
+	arch_kimage_file_post_load_cleanup(image);
+}
+
+/*
+ * In file mode list of segments is prepared by kernel. Copy relevant
+ * data from user space, do error checking, prepare segment list
+ */
+static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
+		int initrd_fd, const char __user *cmdline_ptr,
+		unsigned long cmdline_len)
+{
+	int ret = 0;
+	void *ldata;
+
+	ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
+					&image->kernel_buf_len);
+	if (ret)
+		return ret;
+
+	/* Call arch image probe handlers */
+	ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
+						image->kernel_buf_len);
+
+	if (ret)
+		goto out;
+
+	ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
+					&image->initrd_buf_len);
+	if (ret)
+		goto out;
+
+	image->cmdline_buf = vzalloc(cmdline_len);
+	if (!image->cmdline_buf)
+		goto out;
+
+	ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
+	if (ret) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	image->cmdline_buf_len = cmdline_len;
+
+	/* command line should be a string with last byte null */
+	if (image->cmdline_buf[cmdline_len - 1] != '\0') {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Call arch image load handlers */
+	ldata = arch_kexec_kernel_image_load(image,
+			image->kernel_buf, image->kernel_buf_len,
+			image->initrd_buf, image->initrd_buf_len,
+			image->cmdline_buf, image->cmdline_buf_len);
+
+	if (IS_ERR(ldata)) {
+		ret = PTR_ERR(ldata);
+		goto out;
+	}
+
+	image->image_loader_data = ldata;
+out:
+	/* In case of error, free up all allocated memory in this function */
+	if (ret)
+		kimage_file_post_load_cleanup(image);
+	return ret;
+}
+
+static int kimage_file_normal_alloc(struct kimage **rimage, int kernel_fd,
+		int initrd_fd, const char __user *cmdline_ptr,
+		unsigned long cmdline_len)
+{
+	int result;
+	struct kimage *image;
+
+	/* Allocate and initialize a controlling structure */
+	image = do_kimage_alloc_init();
+	if (!image)
+		return -ENOMEM;
+
+	image->file_mode = 1;
+	image->file_handler_idx = -1;
+
+	result = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
+			cmdline_ptr, cmdline_len);
+	if (result)
+		goto out_free_image;
+
+	result = sanity_check_segment_list(image);
+	if (result)
+		goto out_free_post_load_bufs;
+
+	result = -ENOMEM;
+	image->control_code_page = kimage_alloc_control_pages(image,
+					   get_order(KEXEC_CONTROL_PAGE_SIZE));
+	if (!image->control_code_page) {
+		pr_err("Could not allocate control_code_buffer\n");
+		goto out_free_post_load_bufs;
+	}
+
+	image->swap_page = kimage_alloc_control_pages(image, 0);
+	if (!image->swap_page) {
+		pr_err(KERN_ERR "Could not allocate swap buffer\n");
+		goto out_free_control_pages;
+	}
+
+	*rimage = image;
+	return 0;
+
+out_free_control_pages:
+	kimage_free_page_list(&image->control_pages);
+out_free_post_load_bufs:
+	kimage_file_post_load_cleanup(image);
+	kfree(image->image_loader_data);
+out_free_image:
+	kfree(image);
+	return result;
+}
+
 static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
 				unsigned long nr_segments,
 				struct kexec_segment __user *segments)
@@ -683,6 +898,16 @@ static void kimage_free(struct kimage *image)
 
 	/* Free the kexec control pages... */
 	kimage_free_page_list(&image->control_pages);
+
+	kfree(image->image_loader_data);
+
+	/*
+	 * Free up any temporary buffers allocated. This might hit if
+	 * error occurred much later after buffer allocation.
+	 */
+	if (image->file_mode)
+		kimage_file_post_load_cleanup(image);
+
 	kfree(image);
 }
 
@@ -812,10 +1037,14 @@ static int kimage_load_normal_segment(struct kimage *image,
 	unsigned long maddr;
 	size_t ubytes, mbytes;
 	int result;
-	unsigned char __user *buf;
+	unsigned char __user *buf = NULL;
+	unsigned char *kbuf = NULL;
 
 	result = 0;
-	buf = segment->buf;
+	if (image->file_mode)
+		kbuf = segment->kbuf;
+	else
+		buf = segment->buf;
 	ubytes = segment->bufsz;
 	mbytes = segment->memsz;
 	maddr = segment->mem;
@@ -847,7 +1076,11 @@ static int kimage_load_normal_segment(struct kimage *image,
 				PAGE_SIZE - (maddr & ~PAGE_MASK));
 		uchunk = min(ubytes, mchunk);
 
-		result = copy_from_user(ptr, buf, uchunk);
+		/* For file based kexec, source pages are in kernel memory */
+		if (image->file_mode)
+			memcpy(ptr, kbuf, uchunk);
+		else
+			result = copy_from_user(ptr, buf, uchunk);
 		kunmap(page);
 		if (result) {
 			result = -EFAULT;
@@ -855,7 +1088,10 @@ static int kimage_load_normal_segment(struct kimage *image,
 		}
 		ubytes -= uchunk;
 		maddr  += mchunk;
-		buf    += mchunk;
+		if (image->file_mode)
+			kbuf += mchunk;
+		else
+			buf += mchunk;
 		mbytes -= mchunk;
 	}
 out:
@@ -1102,7 +1338,64 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 		const char __user *, cmdline_ptr, unsigned long,
 		cmdline_len, unsigned long, flags)
 {
-	return -ENOSYS;
+	int ret = 0, i;
+	struct kimage **dest_image, *image;
+
+	/* We only trust the superuser with rebooting the system. */
+	if (!capable(CAP_SYS_BOOT))
+		return -EPERM;
+
+	/* Make sure we have a legal set of flags */
+	if (flags != (flags & KEXEC_FILE_FLAGS))
+		return -EINVAL;
+
+	image = NULL;
+
+	if (!mutex_trylock(&kexec_mutex))
+		return -EBUSY;
+
+	dest_image = &kexec_image;
+	if (flags & KEXEC_FILE_ON_CRASH)
+		dest_image = &kexec_crash_image;
+
+	if (flags & KEXEC_FILE_UNLOAD)
+		goto exchange;
+
+	ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
+				cmdline_ptr, cmdline_len);
+	if (ret)
+		goto out;
+
+	ret = machine_kexec_prepare(image);
+	if (ret)
+		goto out;
+
+	for (i = 0; i < image->nr_segments; i++) {
+		struct kexec_segment *ksegment;
+
+		ksegment = &image->segment[i];
+		pr_debug("Loading segment %d: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n",
+			 i, ksegment->buf, ksegment->bufsz, ksegment->mem,
+			 ksegment->memsz);
+
+		ret = kimage_load_segment(image, &image->segment[i]);
+		if (ret)
+			goto out;
+	}
+
+	kimage_terminate(image);
+
+	/*
+	 * Free up any temporary buffers allocated which are not needed
+	 * after image has been loaded
+	 */
+	kimage_file_post_load_cleanup(image);
+exchange:
+	image = xchg(dest_image, image);
+out:
+	mutex_unlock(&kexec_mutex);
+	kimage_free(image);
+	return ret;
 }
 
 void crash_kexec(struct pt_regs *regs)
@@ -1659,6 +1952,186 @@ static int __init crash_save_vmcoreinfo_init(void)
 
 subsys_initcall(crash_save_vmcoreinfo_init);
 
+static int __kexec_add_segment(struct kimage *image, char *buf,
+		unsigned long bufsz, unsigned long mem, unsigned long memsz)
+{
+	struct kexec_segment *ksegment;
+
+	ksegment = &image->segment[image->nr_segments];
+	ksegment->kbuf = buf;
+	ksegment->bufsz = bufsz;
+	ksegment->mem = mem;
+	ksegment->memsz = memsz;
+	image->nr_segments++;
+
+	return 0;
+}
+
+static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
+					struct kexec_buf *kbuf)
+{
+	struct kimage *image = kbuf->image;
+	unsigned long temp_start, temp_end;
+
+	temp_end = min(end, kbuf->buf_max);
+	temp_start = temp_end - kbuf->memsz;
+
+	do {
+		/* align down start */
+		temp_start = temp_start & (~(kbuf->buf_align - 1));
+
+		if (temp_start < start || temp_start < kbuf->buf_min)
+			return 0;
+
+		temp_end = temp_start + kbuf->memsz - 1;
+
+		/*
+		 * Make sure this does not conflict with any of existing
+		 * segments
+		 */
+		if (kimage_is_destination_range(image, temp_start, temp_end)) {
+			temp_start = temp_start - PAGE_SIZE;
+			continue;
+		}
+
+		/* We found a suitable memory range */
+		break;
+	} while (1);
+
+	/* If we are here, we found a suitable memory range */
+	__kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
+				kbuf->memsz);
+
+	/* Stop navigating through remaining System RAM ranges */
+	return 1;
+}
+
+static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
+					struct kexec_buf *kbuf)
+{
+	struct kimage *image = kbuf->image;
+	unsigned long temp_start, temp_end;
+
+	temp_start = max(start, kbuf->buf_min);
+
+	do {
+		temp_start = ALIGN(temp_start, kbuf->buf_align);
+		temp_end = temp_start + kbuf->memsz - 1;
+
+		if (temp_end > end || temp_end > kbuf->buf_max)
+			return 0;
+		/*
+		 * Make sure this does not conflict with any of existing
+		 * segments
+		 */
+		if (kimage_is_destination_range(image, temp_start, temp_end)) {
+			temp_start = temp_start + PAGE_SIZE;
+			continue;
+		}
+
+		/* We found a suitable memory range */
+		break;
+	} while (1);
+
+	/* If we are here, we found a suitable memory range */
+	__kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
+				kbuf->memsz);
+
+	/* Stop navigating through remaining System RAM ranges */
+	return 1;
+}
+
+static int walk_ram_range_callback(u64 start, u64 end, void *arg)
+{
+	struct kexec_buf *kbuf = (struct kexec_buf *)arg;
+	unsigned long sz = end - start + 1;
+
+	/* Returning 0 will take to next memory range */
+	if (sz < kbuf->memsz)
+		return 0;
+
+	if (end < kbuf->buf_min || start > kbuf->buf_max)
+		return 0;
+
+	/*
+	 * Allocate memory top down with-in ram range. Otherwise bottom up
+	 * allocation.
+	 */
+	if (kbuf->top_down)
+		return locate_mem_hole_top_down(start, end, kbuf);
+	else
+		return locate_mem_hole_bottom_up(start, end, kbuf);
+}
+
+/*
+ * Helper functions for placing a buffer in a kexec segment. This assumes
+ * that kexec_mutex is held.
+ */
+int kexec_add_buffer(struct kimage *image, char *buffer,
+		unsigned long bufsz, unsigned long memsz,
+		unsigned long buf_align, unsigned long buf_min,
+		unsigned long buf_max, bool top_down, unsigned long *load_addr)
+{
+
+	unsigned long nr_segments = image->nr_segments, new_nr_segments;
+	struct kexec_segment *ksegment;
+	struct kexec_buf buf, *kbuf;
+
+	/* Currently adding segment this way is allowed only in file mode */
+	if (!image->file_mode)
+		return -EINVAL;
+
+	if (nr_segments >= KEXEC_SEGMENT_MAX)
+		return -EINVAL;
+
+	/*
+	 * Make sure we are not trying to add buffer after allocating
+	 * control pages. All segments need to be placed first before
+	 * any control pages are allocated. As control page allocation
+	 * logic goes through list of segments to make sure there are
+	 * no destination overlaps.
+	 */
+	if (!list_empty(&image->control_pages)) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
+	memset(&buf, 0, sizeof(struct kexec_buf));
+	kbuf = &buf;
+	kbuf->image = image;
+	kbuf->buffer = buffer;
+	kbuf->bufsz = bufsz;
+
+	/* Align memsz to next page boundary */
+	kbuf->memsz = ALIGN(memsz, PAGE_SIZE);
+
+	/* Align to atleast page size boundary */
+	kbuf->buf_align = max(buf_align, PAGE_SIZE);
+	kbuf->buf_min = buf_min;
+	kbuf->buf_max = buf_max;
+	kbuf->top_down = top_down;
+
+	/* Walk the RAM ranges and allocate a suitable range for the buffer */
+	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
+
+	/*
+	 * If range could be found successfully, it would have incremented
+	 * the nr_segments value.
+	 */
+	new_nr_segments = image->nr_segments;
+
+	/* A suitable memory range could not be found for buffer */
+	if (new_nr_segments == nr_segments)
+		return -EADDRNOTAVAIL;
+
+	/* Found a suitable memory range */
+
+	ksegment = &image->segment[new_nr_segments - 1];
+	*load_addr = ksegment->mem;
+	return 0;
+}
+
+
 /*
  * Move into place and start executing a preloaded standalone
  * executable.  If nothing was preloaded return an error.
-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 08/13] purgatory/sha256: Provide implementation of sha256 in purgaotory context
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:06   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal

Next two patches provide code for purgatory. This is a code which does
not link against the kernel and runs stand alone. This code runs between
two kernels. One of the primary purpose of this code is to verify the
digest of newly loaded kernel and making sure it matches the digest
computed at kernel load time.

We use sha256 for calculating digest of kexec segmetns. Purgatory can't
use stanard crypto API as that API is not available in purgatory context.

Hence, I have copied code from crypto/sha256_generic.c and compiled it
with purgaotry code so that it could be used. I could not
#include sha256_generic.c file here as some of the function signature
requiered little tweaking. Original functions work with crypto API but
these ones don't

So instead of doing #include on sha256_generic.c I just copied relevant
portions of code into arch/x86/purgatory/sha256.c. Now we shouldn't have to
touch this code at all. Do let me know if there are better ways to handle it.

This patch does not enable compiling of this code. That happens in next
patch. I wanted to highlight this change in a separate patch for easy
review.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/purgatory/sha256.c | 284 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/purgatory/sha256.h |  22 ++++
 2 files changed, 306 insertions(+)
 create mode 100644 arch/x86/purgatory/sha256.c
 create mode 100644 arch/x86/purgatory/sha256.h

diff --git a/arch/x86/purgatory/sha256.c b/arch/x86/purgatory/sha256.c
new file mode 100644
index 0000000..1e814ca
--- /dev/null
+++ b/arch/x86/purgatory/sha256.c
@@ -0,0 +1,284 @@
+/*
+ * SHA-256, as specified in
+ * http://csrc.nist.gov/groups/STM/cavp/documents/shs/sha256-384-512.pdf
+ *
+ * SHA-256 code by Jean-Luc Cooke <jlcooke@certainkey.com>.
+ *
+ * Copyright (c) Jean-Luc Cooke <jlcooke@certainkey.com>
+ * Copyright (c) Andrew McDonald <andrew@mcdonald.org.uk>
+ * Copyright (c) 2002 James Morris <jmorris@intercode.com.au>
+ * Copyright (c) 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <linux/bitops.h>
+#include <asm/byteorder.h>
+#include "sha256.h"
+#include "../boot/string.h"
+
+static inline u32 Ch(u32 x, u32 y, u32 z)
+{
+	return z ^ (x & (y ^ z));
+}
+
+static inline u32 Maj(u32 x, u32 y, u32 z)
+{
+	return (x & y) | (z & (x | y));
+}
+
+#define e0(x)       (ror32(x, 2) ^ ror32(x, 13) ^ ror32(x, 22))
+#define e1(x)       (ror32(x, 6) ^ ror32(x, 11) ^ ror32(x, 25))
+#define s0(x)       (ror32(x, 7) ^ ror32(x, 18) ^ (x >> 3))
+#define s1(x)       (ror32(x, 17) ^ ror32(x, 19) ^ (x >> 10))
+
+static inline void LOAD_OP(int I, u32 *W, const u8 *input)
+{
+	W[I] = __be32_to_cpu(((__be32 *)(input))[I]);
+}
+
+static inline void BLEND_OP(int I, u32 *W)
+{
+	W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
+}
+
+static void sha256_transform(u32 *state, const u8 *input)
+{
+	u32 a, b, c, d, e, f, g, h, t1, t2;
+	u32 W[64];
+	int i;
+
+	/* load the input */
+	for (i = 0; i < 16; i++)
+		LOAD_OP(i, W, input);
+
+	/* now blend */
+	for (i = 16; i < 64; i++)
+		BLEND_OP(i, W);
+
+	/* load the state into our registers */
+	a = state[0];  b = state[1];  c = state[2];  d = state[3];
+	e = state[4];  f = state[5];  g = state[6];  h = state[7];
+
+	/* now iterate */
+	t1 = h + e1(e) + Ch(e, f, g) + 0x428a2f98 + W[0];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0x71374491 + W[1];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0xb5c0fbcf + W[2];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0xe9b5dba5 + W[3];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0x3956c25b + W[4];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0x59f111f1 + W[5];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0x923f82a4 + W[6];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0xab1c5ed5 + W[7];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0xd807aa98 + W[8];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0x12835b01 + W[9];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0x243185be + W[10];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0x550c7dc3 + W[11];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0x72be5d74 + W[12];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0x80deb1fe + W[13];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0x9bdc06a7 + W[14];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0xc19bf174 + W[15];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0xe49b69c1 + W[16];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1+t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0xefbe4786 + W[17];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1+t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0x0fc19dc6 + W[18];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1+t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0x240ca1cc + W[19];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1+t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0x2de92c6f + W[20];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1+t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0x4a7484aa + W[21];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1+t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0x5cb0a9dc + W[22];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1+t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0x76f988da + W[23];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0x983e5152 + W[24];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1+t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0xa831c66d + W[25];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1+t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0xb00327c8 + W[26];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1+t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0xbf597fc7 + W[27];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1+t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0xc6e00bf3 + W[28];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1+t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0xd5a79147 + W[29];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1+t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0x06ca6351 + W[30];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1+t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0x14292967 + W[31];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0x27b70a85 + W[32];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1+t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0x2e1b2138 + W[33];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1+t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0x4d2c6dfc + W[34];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1+t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0x53380d13 + W[35];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1+t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0x650a7354 + W[36];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1+t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0x766a0abb + W[37];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1+t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0x81c2c92e + W[38];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1+t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0x92722c85 + W[39];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0xa2bfe8a1 + W[40];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1+t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0xa81a664b + W[41];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1+t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0xc24b8b70 + W[42];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1+t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0xc76c51a3 + W[43];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1+t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0xd192e819 + W[44];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1+t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0xd6990624 + W[45];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1+t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0xf40e3585 + W[46];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1+t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0x106aa070 + W[47];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0x19a4c116 + W[48];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1+t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0x1e376c08 + W[49];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1+t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0x2748774c + W[50];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1+t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0x34b0bcb5 + W[51];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1+t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0x391c0cb3 + W[52];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1+t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0x4ed8aa4a + W[53];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1+t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0x5b9cca4f + W[54];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1+t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0x682e6ff3 + W[55];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0x748f82ee + W[56];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1+t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0x78a5636f + W[57];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1+t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0x84c87814 + W[58];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1+t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0x8cc70208 + W[59];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1+t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0x90befffa + W[60];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1+t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0xa4506ceb + W[61];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1+t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0xbef9a3f7 + W[62];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1+t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0xc67178f2 + W[63];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	state[0] += a; state[1] += b; state[2] += c; state[3] += d;
+	state[4] += e; state[5] += f; state[6] += g; state[7] += h;
+
+	/* clear any sensitive info... */
+	a = b = c = d = e = f = g = h = t1 = t2 = 0;
+	memset(W, 0, 64 * sizeof(u32));
+}
+
+int sha256_init(struct sha256_state *sctx)
+{
+	sctx->state[0] = SHA256_H0;
+	sctx->state[1] = SHA256_H1;
+	sctx->state[2] = SHA256_H2;
+	sctx->state[3] = SHA256_H3;
+	sctx->state[4] = SHA256_H4;
+	sctx->state[5] = SHA256_H5;
+	sctx->state[6] = SHA256_H6;
+	sctx->state[7] = SHA256_H7;
+	sctx->count = 0;
+
+	return 0;
+}
+
+int sha256_update(struct sha256_state *sctx, const u8 *data,
+					unsigned int len)
+{
+	unsigned int partial, done;
+	const u8 *src;
+
+	partial = sctx->count & 0x3f;
+	sctx->count += len;
+	done = 0;
+	src = data;
+
+	if ((partial + len) > 63) {
+		if (partial) {
+			done = -partial;
+			memcpy(sctx->buf + partial, data, done + 64);
+			src = sctx->buf;
+		}
+
+		do {
+			sha256_transform(sctx->state, src);
+			done += 64;
+			src = data + done;
+		} while (done + 63 < len);
+
+		partial = 0;
+	}
+	memcpy(sctx->buf + partial, src, len - done);
+
+	return 0;
+}
+
+int sha256_final(struct sha256_state *sctx, u8 *out)
+{
+	__be32 *dst = (__be32 *)out;
+	__be64 bits;
+	unsigned int index, pad_len;
+	int i;
+	static const u8 padding[64] = { 0x80, };
+
+	/* Save number of bits */
+	bits = cpu_to_be64(sctx->count << 3);
+
+	/* Pad out to 56 mod 64. */
+	index = sctx->count & 0x3f;
+	pad_len = (index < 56) ? (56 - index) : ((64+56) - index);
+	sha256_update(sctx, padding, pad_len);
+
+	/* Append length (before padding) */
+	sha256_update(sctx, (const u8 *)&bits, sizeof(bits));
+
+	/* Store state in digest */
+	for (i = 0; i < 8; i++)
+		dst[i] = cpu_to_be32(sctx->state[i]);
+
+	/* Zeroize sensitive information. */
+	memset(sctx, 0, sizeof(*sctx));
+
+	return 0;
+}
diff --git a/arch/x86/purgatory/sha256.h b/arch/x86/purgatory/sha256.h
new file mode 100644
index 0000000..bd15a41
--- /dev/null
+++ b/arch/x86/purgatory/sha256.h
@@ -0,0 +1,22 @@
+/*
+ *  Copyright (C) 2014 Red Hat Inc.
+ *
+ *  Author: Vivek Goyal <vgoyal@redhat.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#ifndef SHA256_H
+#define SHA256_H
+
+
+#include <linux/types.h>
+#include <crypto/sha.h>
+
+extern int sha256_init(struct sha256_state *sctx);
+extern int sha256_update(struct sha256_state *sctx, const u8 *input,
+				unsigned int length);
+extern int sha256_final(struct sha256_state *sctx, u8 *hash);
+
+#endif /* SHA256_H */
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 08/13] purgatory/sha256: Provide implementation of sha256 in purgaotory context
@ 2014-06-03 13:06   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg, akpm,
	dyoung, chaowang

Next two patches provide code for purgatory. This is a code which does
not link against the kernel and runs stand alone. This code runs between
two kernels. One of the primary purpose of this code is to verify the
digest of newly loaded kernel and making sure it matches the digest
computed at kernel load time.

We use sha256 for calculating digest of kexec segmetns. Purgatory can't
use stanard crypto API as that API is not available in purgatory context.

Hence, I have copied code from crypto/sha256_generic.c and compiled it
with purgaotry code so that it could be used. I could not
#include sha256_generic.c file here as some of the function signature
requiered little tweaking. Original functions work with crypto API but
these ones don't

So instead of doing #include on sha256_generic.c I just copied relevant
portions of code into arch/x86/purgatory/sha256.c. Now we shouldn't have to
touch this code at all. Do let me know if there are better ways to handle it.

This patch does not enable compiling of this code. That happens in next
patch. I wanted to highlight this change in a separate patch for easy
review.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/purgatory/sha256.c | 284 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/purgatory/sha256.h |  22 ++++
 2 files changed, 306 insertions(+)
 create mode 100644 arch/x86/purgatory/sha256.c
 create mode 100644 arch/x86/purgatory/sha256.h

diff --git a/arch/x86/purgatory/sha256.c b/arch/x86/purgatory/sha256.c
new file mode 100644
index 0000000..1e814ca
--- /dev/null
+++ b/arch/x86/purgatory/sha256.c
@@ -0,0 +1,284 @@
+/*
+ * SHA-256, as specified in
+ * http://csrc.nist.gov/groups/STM/cavp/documents/shs/sha256-384-512.pdf
+ *
+ * SHA-256 code by Jean-Luc Cooke <jlcooke@certainkey.com>.
+ *
+ * Copyright (c) Jean-Luc Cooke <jlcooke@certainkey.com>
+ * Copyright (c) Andrew McDonald <andrew@mcdonald.org.uk>
+ * Copyright (c) 2002 James Morris <jmorris@intercode.com.au>
+ * Copyright (c) 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <linux/bitops.h>
+#include <asm/byteorder.h>
+#include "sha256.h"
+#include "../boot/string.h"
+
+static inline u32 Ch(u32 x, u32 y, u32 z)
+{
+	return z ^ (x & (y ^ z));
+}
+
+static inline u32 Maj(u32 x, u32 y, u32 z)
+{
+	return (x & y) | (z & (x | y));
+}
+
+#define e0(x)       (ror32(x, 2) ^ ror32(x, 13) ^ ror32(x, 22))
+#define e1(x)       (ror32(x, 6) ^ ror32(x, 11) ^ ror32(x, 25))
+#define s0(x)       (ror32(x, 7) ^ ror32(x, 18) ^ (x >> 3))
+#define s1(x)       (ror32(x, 17) ^ ror32(x, 19) ^ (x >> 10))
+
+static inline void LOAD_OP(int I, u32 *W, const u8 *input)
+{
+	W[I] = __be32_to_cpu(((__be32 *)(input))[I]);
+}
+
+static inline void BLEND_OP(int I, u32 *W)
+{
+	W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
+}
+
+static void sha256_transform(u32 *state, const u8 *input)
+{
+	u32 a, b, c, d, e, f, g, h, t1, t2;
+	u32 W[64];
+	int i;
+
+	/* load the input */
+	for (i = 0; i < 16; i++)
+		LOAD_OP(i, W, input);
+
+	/* now blend */
+	for (i = 16; i < 64; i++)
+		BLEND_OP(i, W);
+
+	/* load the state into our registers */
+	a = state[0];  b = state[1];  c = state[2];  d = state[3];
+	e = state[4];  f = state[5];  g = state[6];  h = state[7];
+
+	/* now iterate */
+	t1 = h + e1(e) + Ch(e, f, g) + 0x428a2f98 + W[0];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0x71374491 + W[1];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0xb5c0fbcf + W[2];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0xe9b5dba5 + W[3];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0x3956c25b + W[4];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0x59f111f1 + W[5];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0x923f82a4 + W[6];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0xab1c5ed5 + W[7];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0xd807aa98 + W[8];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0x12835b01 + W[9];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0x243185be + W[10];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0x550c7dc3 + W[11];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0x72be5d74 + W[12];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0x80deb1fe + W[13];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0x9bdc06a7 + W[14];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0xc19bf174 + W[15];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0xe49b69c1 + W[16];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1+t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0xefbe4786 + W[17];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1+t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0x0fc19dc6 + W[18];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1+t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0x240ca1cc + W[19];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1+t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0x2de92c6f + W[20];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1+t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0x4a7484aa + W[21];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1+t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0x5cb0a9dc + W[22];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1+t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0x76f988da + W[23];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0x983e5152 + W[24];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1+t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0xa831c66d + W[25];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1+t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0xb00327c8 + W[26];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1+t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0xbf597fc7 + W[27];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1+t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0xc6e00bf3 + W[28];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1+t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0xd5a79147 + W[29];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1+t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0x06ca6351 + W[30];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1+t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0x14292967 + W[31];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0x27b70a85 + W[32];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1+t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0x2e1b2138 + W[33];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1+t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0x4d2c6dfc + W[34];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1+t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0x53380d13 + W[35];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1+t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0x650a7354 + W[36];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1+t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0x766a0abb + W[37];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1+t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0x81c2c92e + W[38];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1+t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0x92722c85 + W[39];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0xa2bfe8a1 + W[40];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1+t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0xa81a664b + W[41];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1+t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0xc24b8b70 + W[42];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1+t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0xc76c51a3 + W[43];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1+t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0xd192e819 + W[44];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1+t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0xd6990624 + W[45];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1+t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0xf40e3585 + W[46];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1+t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0x106aa070 + W[47];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0x19a4c116 + W[48];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1+t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0x1e376c08 + W[49];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1+t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0x2748774c + W[50];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1+t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0x34b0bcb5 + W[51];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1+t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0x391c0cb3 + W[52];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1+t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0x4ed8aa4a + W[53];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1+t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0x5b9cca4f + W[54];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1+t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0x682e6ff3 + W[55];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	t1 = h + e1(e) + Ch(e, f, g) + 0x748f82ee + W[56];
+	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1+t2;
+	t1 = g + e1(d) + Ch(d, e, f) + 0x78a5636f + W[57];
+	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1+t2;
+	t1 = f + e1(c) + Ch(c, d, e) + 0x84c87814 + W[58];
+	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1+t2;
+	t1 = e + e1(b) + Ch(b, c, d) + 0x8cc70208 + W[59];
+	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1+t2;
+	t1 = d + e1(a) + Ch(a, b, c) + 0x90befffa + W[60];
+	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1+t2;
+	t1 = c + e1(h) + Ch(h, a, b) + 0xa4506ceb + W[61];
+	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1+t2;
+	t1 = b + e1(g) + Ch(g, h, a) + 0xbef9a3f7 + W[62];
+	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1+t2;
+	t1 = a + e1(f) + Ch(f, g, h) + 0xc67178f2 + W[63];
+	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1+t2;
+
+	state[0] += a; state[1] += b; state[2] += c; state[3] += d;
+	state[4] += e; state[5] += f; state[6] += g; state[7] += h;
+
+	/* clear any sensitive info... */
+	a = b = c = d = e = f = g = h = t1 = t2 = 0;
+	memset(W, 0, 64 * sizeof(u32));
+}
+
+int sha256_init(struct sha256_state *sctx)
+{
+	sctx->state[0] = SHA256_H0;
+	sctx->state[1] = SHA256_H1;
+	sctx->state[2] = SHA256_H2;
+	sctx->state[3] = SHA256_H3;
+	sctx->state[4] = SHA256_H4;
+	sctx->state[5] = SHA256_H5;
+	sctx->state[6] = SHA256_H6;
+	sctx->state[7] = SHA256_H7;
+	sctx->count = 0;
+
+	return 0;
+}
+
+int sha256_update(struct sha256_state *sctx, const u8 *data,
+					unsigned int len)
+{
+	unsigned int partial, done;
+	const u8 *src;
+
+	partial = sctx->count & 0x3f;
+	sctx->count += len;
+	done = 0;
+	src = data;
+
+	if ((partial + len) > 63) {
+		if (partial) {
+			done = -partial;
+			memcpy(sctx->buf + partial, data, done + 64);
+			src = sctx->buf;
+		}
+
+		do {
+			sha256_transform(sctx->state, src);
+			done += 64;
+			src = data + done;
+		} while (done + 63 < len);
+
+		partial = 0;
+	}
+	memcpy(sctx->buf + partial, src, len - done);
+
+	return 0;
+}
+
+int sha256_final(struct sha256_state *sctx, u8 *out)
+{
+	__be32 *dst = (__be32 *)out;
+	__be64 bits;
+	unsigned int index, pad_len;
+	int i;
+	static const u8 padding[64] = { 0x80, };
+
+	/* Save number of bits */
+	bits = cpu_to_be64(sctx->count << 3);
+
+	/* Pad out to 56 mod 64. */
+	index = sctx->count & 0x3f;
+	pad_len = (index < 56) ? (56 - index) : ((64+56) - index);
+	sha256_update(sctx, padding, pad_len);
+
+	/* Append length (before padding) */
+	sha256_update(sctx, (const u8 *)&bits, sizeof(bits));
+
+	/* Store state in digest */
+	for (i = 0; i < 8; i++)
+		dst[i] = cpu_to_be32(sctx->state[i]);
+
+	/* Zeroize sensitive information. */
+	memset(sctx, 0, sizeof(*sctx));
+
+	return 0;
+}
diff --git a/arch/x86/purgatory/sha256.h b/arch/x86/purgatory/sha256.h
new file mode 100644
index 0000000..bd15a41
--- /dev/null
+++ b/arch/x86/purgatory/sha256.h
@@ -0,0 +1,22 @@
+/*
+ *  Copyright (C) 2014 Red Hat Inc.
+ *
+ *  Author: Vivek Goyal <vgoyal@redhat.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#ifndef SHA256_H
+#define SHA256_H
+
+
+#include <linux/types.h>
+#include <crypto/sha.h>
+
+extern int sha256_init(struct sha256_state *sctx);
+extern int sha256_update(struct sha256_state *sctx, const u8 *input,
+				unsigned int length);
+extern int sha256_final(struct sha256_state *sctx, u8 *hash);
+
+#endif /* SHA256_H */
-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 09/13] purgatory: Core purgatory functionality
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:06   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal

Create a stand alone relocatable object purgatory which runs between two
kernels. This name, concept and some code has been taken from kexec-tools.
Idea is that this code runs after a crash and it runs in minimal environment.
So keep it separate from rest of the kernel and in long term we will have
to practically do no maintenance of this code.

This code also has the logic to do verify sha256 hashes of various
segments which have been loaded into memory. So first we verify that
the kernel we are jumping to is fine and has not been corrupted and
make progress only if checsums are verified.

This code also takes care of copying some memory contents to backup region.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/Kbuild                   |   1 +
 arch/x86/Makefile                 |   6 +++
 arch/x86/purgatory/Makefile       |  35 +++++++++++++
 arch/x86/purgatory/entry64.S      | 101 ++++++++++++++++++++++++++++++++++++++
 arch/x86/purgatory/purgatory.c    |  71 +++++++++++++++++++++++++++
 arch/x86/purgatory/setup-x86_32.S |  17 +++++++
 arch/x86/purgatory/setup-x86_64.S |  58 ++++++++++++++++++++++
 arch/x86/purgatory/stack.S        |  19 +++++++
 arch/x86/purgatory/string.c       |  13 +++++
 9 files changed, 321 insertions(+)
 create mode 100644 arch/x86/purgatory/Makefile
 create mode 100644 arch/x86/purgatory/entry64.S
 create mode 100644 arch/x86/purgatory/purgatory.c
 create mode 100644 arch/x86/purgatory/setup-x86_32.S
 create mode 100644 arch/x86/purgatory/setup-x86_64.S
 create mode 100644 arch/x86/purgatory/stack.S
 create mode 100644 arch/x86/purgatory/string.c

diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index e5287d8..faaeee7 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -16,3 +16,4 @@ obj-$(CONFIG_IA32_EMULATION) += ia32/
 
 obj-y += platform/
 obj-y += net/
+obj-$(CONFIG_KEXEC) += purgatory/
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 33f71b0..0b25c6c 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -186,6 +186,11 @@ archscripts: scripts_basic
 archheaders:
 	$(Q)$(MAKE) $(build)=arch/x86/syscalls all
 
+archprepare:
+ifeq ($(CONFIG_KEXEC),y)
+	$(Q)$(MAKE) $(build)=arch/x86/purgatory arch/x86/purgatory/kexec-purgatory.c
+endif
+
 ###
 # Kernel objects
 
@@ -249,6 +254,7 @@ archclean:
 	$(Q)rm -rf $(objtree)/arch/x86_64
 	$(Q)$(MAKE) $(clean)=$(boot)
 	$(Q)$(MAKE) $(clean)=arch/x86/tools
+	$(Q)$(MAKE) $(clean)=arch/x86/purgatory
 
 PHONY += kvmconfig
 kvmconfig:
diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
new file mode 100644
index 0000000..8dbf8f5
--- /dev/null
+++ b/arch/x86/purgatory/Makefile
@@ -0,0 +1,35 @@
+ifeq ($(CONFIG_X86_64),y)
+	purgatory-y := purgatory.o entry64.o stack.o setup-x86_64.o sha256.o string.o
+else
+	purgatory-y := purgatory.o stack.o sha256.o setup-x86_32.o
+endif
+
+targets += $(purgatory-y)
+PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y))
+
+LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined -nostdlib -z nodefaultlib
+targets += purgatory.ro
+
+# Default KBUILD_CFLAGS can have -pg option set when FTRACE is enabled. That
+# in turn leaves some undefined symbols like __fentry__ in purgatory and not
+# sure how to relocate those. Like kexec-tools, custom flags.
+
+ifeq ($(CONFIG_X86_64),y)
+KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -mcmodel=large -Os -fno-builtin -ffreestanding -c -MD
+else
+KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -Os -fno-builtin -ffreestanding -c -MD -m32
+endif
+
+$(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
+		$(call if_changed,ld)
+
+targets += kexec-purgatory.c
+
+quiet_cmd_bin2c = BIN2C   $@
+      cmd_bin2c = cat $(obj)/purgatory.ro | $(srctree)/scripts/basic/bin2c kexec_purgatory > $(obj)/kexec-purgatory.c
+
+$(obj)/kexec-purgatory.c: $(obj)/purgatory.ro FORCE
+	$(call if_changed,bin2c)
+
+
+obj-$(CONFIG_KEXEC)	+= kexec-purgatory.o
diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
new file mode 100644
index 0000000..219b50b
--- /dev/null
+++ b/arch/x86/purgatory/entry64.S
@@ -0,0 +1,101 @@
+/*
+ * Copyright (C) 2003,2004  Eric Biederman (ebiederm@xmission.com)
+ * Copyright (C) 2014  Red Hat Inc.
+
+ * Author(s): Vivek Goyal <vgoyal@redhat.com>
+ *
+ * This code has been taken from kexec-tools.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+	.text
+	.balign 16
+	.code64
+	.globl entry64, entry64_regs
+
+
+entry64:
+	/* Setup a gdt that should be preserved */
+	lgdt gdt(%rip)
+
+	/* load the data segments */
+	movl    $0x18, %eax     /* data segment */
+	movl    %eax, %ds
+	movl    %eax, %es
+	movl    %eax, %ss
+	movl    %eax, %fs
+	movl    %eax, %gs
+
+	/* Setup new stack */
+	leaq    stack_init(%rip), %rsp
+	pushq   $0x10 /* CS */
+	leaq    new_cs_exit(%rip), %rax
+	pushq   %rax
+	lretq
+new_cs_exit:
+
+	/* Load the registers */
+	movq	rax(%rip), %rax
+	movq	rbx(%rip), %rbx
+	movq	rcx(%rip), %rcx
+	movq	rdx(%rip), %rdx
+	movq	rsi(%rip), %rsi
+	movq	rdi(%rip), %rdi
+	movq    rsp(%rip), %rsp
+	movq	rbp(%rip), %rbp
+	movq	r8(%rip), %r8
+	movq	r9(%rip), %r9
+	movq	r10(%rip), %r10
+	movq	r11(%rip), %r11
+	movq	r12(%rip), %r12
+	movq	r13(%rip), %r13
+	movq	r14(%rip), %r14
+	movq	r15(%rip), %r15
+
+	/* Jump to the new code... */
+	jmpq	*rip(%rip)
+
+	.section ".rodata"
+	.balign 4
+entry64_regs:
+rax:	.quad 0x00000000
+rbx:	.quad 0x00000000
+rcx:	.quad 0x00000000
+rdx:	.quad 0x00000000
+rsi:	.quad 0x00000000
+rdi:	.quad 0x00000000
+rsp:	.quad 0x00000000
+rbp:	.quad 0x00000000
+r8:	.quad 0x00000000
+r9:	.quad 0x00000000
+r10:	.quad 0x00000000
+r11:	.quad 0x00000000
+r12:	.quad 0x00000000
+r13:	.quad 0x00000000
+r14:	.quad 0x00000000
+r15:	.quad 0x00000000
+rip:	.quad 0x00000000
+	.size entry64_regs, . - entry64_regs
+
+	/* GDT */
+	.section ".rodata"
+	.balign 16
+gdt:
+	/* 0x00 unusable segment
+	 * 0x08 unused
+	 * so use them as gdt ptr
+	 */
+	.word gdt_end - gdt - 1
+	.quad gdt
+	.word 0, 0, 0
+
+	/* 0x10 4GB flat code segment */
+	.word 0xFFFF, 0x0000, 0x9A00, 0x00AF
+
+	/* 0x18 4GB flat data segment */
+	.word 0xFFFF, 0x0000, 0x9200, 0x00CF
+gdt_end:
+stack:	.quad   0, 0
+stack_init:
diff --git a/arch/x86/purgatory/purgatory.c b/arch/x86/purgatory/purgatory.c
new file mode 100644
index 0000000..3a808db
--- /dev/null
+++ b/arch/x86/purgatory/purgatory.c
@@ -0,0 +1,71 @@
+/*
+ * purgatory: Runs between two kernels
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * Author:
+ *       Vivek Goyal <vgoyal@redhat.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include "sha256.h"
+#include "../boot/string.h"
+
+struct sha_region {
+	unsigned long start;
+	unsigned long len;
+};
+
+unsigned long backup_dest = 0;
+unsigned long backup_src = 0;
+unsigned long backup_sz = 0;
+
+u8 sha256_digest[SHA256_DIGEST_SIZE] = { 0 };
+
+struct sha_region sha_regions[16] = {};
+
+/*
+ * On x86, second kernel requries first 640K of memory to boot. Copy
+ * first 640K to a backup region in reserved memory range so that second
+ * kernel can use first 640K.
+ */
+static int copy_backup_region(void)
+{
+	if (backup_dest)
+		memcpy((void *)backup_dest, (void *)backup_src, backup_sz);
+
+	return 0;
+}
+
+int verify_sha256_digest(void)
+{
+	struct sha_region *ptr, *end;
+	u8 digest[SHA256_DIGEST_SIZE];
+	struct sha256_state sctx;
+
+	sha256_init(&sctx);
+	end = &sha_regions[sizeof(sha_regions)/sizeof(sha_regions[0])];
+	for (ptr = sha_regions; ptr < end; ptr++)
+		sha256_update(&sctx, (uint8_t *)(ptr->start), ptr->len);
+
+	sha256_final(&sctx, digest);
+
+	if (memcmp(digest, sha256_digest, sizeof(digest)) != 0)
+		return 1;
+
+	return 0;
+}
+
+void purgatory(void)
+{
+	int ret;
+
+	ret = verify_sha256_digest();
+	if (ret) {
+		/* loop forever */
+		for (;;);
+	}
+	copy_backup_region();
+}
diff --git a/arch/x86/purgatory/setup-x86_32.S b/arch/x86/purgatory/setup-x86_32.S
new file mode 100644
index 0000000..cfcff31
--- /dev/null
+++ b/arch/x86/purgatory/setup-x86_32.S
@@ -0,0 +1,17 @@
+/*
+ * purgatory:  setup code
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+	.text
+	.globl purgatory_start
+	.balign 16
+purgatory_start:
+	.code32
+
+	/* This is just a stub. Write code when 32bit support comes along */
+	call purgatory
diff --git a/arch/x86/purgatory/setup-x86_64.S b/arch/x86/purgatory/setup-x86_64.S
new file mode 100644
index 0000000..fe3c91b
--- /dev/null
+++ b/arch/x86/purgatory/setup-x86_64.S
@@ -0,0 +1,58 @@
+/*
+ * purgatory:  setup code
+ *
+ * Copyright (C) 2003,2004  Eric Biederman (ebiederm@xmission.com)
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * This code has been taken from kexec-tools.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+	.text
+	.globl purgatory_start
+	.balign 16
+purgatory_start:
+	.code64
+
+	/* Load a gdt so I know what the segment registers are */
+	lgdt	gdt(%rip)
+
+	/* load the data segments */
+	movl	$0x18, %eax	/* data segment */
+	movl	%eax, %ds
+	movl	%eax, %es
+	movl	%eax, %ss
+	movl	%eax, %fs
+	movl	%eax, %gs
+
+	/* Setup a stack */
+	leaq	lstack_end(%rip), %rsp
+
+	/* Call the C code */
+	call purgatory
+	jmp	entry64
+
+	.section ".rodata"
+	.balign 16
+gdt:	/* 0x00 unusable segment
+	 * 0x08 unused
+	 * so use them as the gdt ptr
+	 */
+	.word	gdt_end - gdt - 1
+	.quad	gdt
+	.word	0, 0, 0
+
+	/* 0x10 4GB flat code segment */
+	.word	0xFFFF, 0x0000, 0x9A00, 0x00AF
+
+	/* 0x18 4GB flat data segment */
+	.word	0xFFFF, 0x0000, 0x9200, 0x00CF
+gdt_end:
+
+	.bss
+	.balign 4096
+lstack:
+	.skip 4096
+lstack_end:
diff --git a/arch/x86/purgatory/stack.S b/arch/x86/purgatory/stack.S
new file mode 100644
index 0000000..3cefba1
--- /dev/null
+++ b/arch/x86/purgatory/stack.S
@@ -0,0 +1,19 @@
+/*
+ * purgatory:  stack
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+	/* A stack for the loaded kernel.
+	 * Seperate and in the data section so it can be prepopulated.
+	 */
+	.data
+	.balign 4096
+	.globl stack, stack_end
+
+stack:
+	.skip 4096
+stack_end:
diff --git a/arch/x86/purgatory/string.c b/arch/x86/purgatory/string.c
new file mode 100644
index 0000000..d886b1f
--- /dev/null
+++ b/arch/x86/purgatory/string.c
@@ -0,0 +1,13 @@
+/*
+ * Simple string functions.
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * Author:
+ *       Vivek Goyal <vgoyal@redhat.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include "../boot/string.c"
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 09/13] purgatory: Core purgatory functionality
@ 2014-06-03 13:06   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg, akpm,
	dyoung, chaowang

Create a stand alone relocatable object purgatory which runs between two
kernels. This name, concept and some code has been taken from kexec-tools.
Idea is that this code runs after a crash and it runs in minimal environment.
So keep it separate from rest of the kernel and in long term we will have
to practically do no maintenance of this code.

This code also has the logic to do verify sha256 hashes of various
segments which have been loaded into memory. So first we verify that
the kernel we are jumping to is fine and has not been corrupted and
make progress only if checsums are verified.

This code also takes care of copying some memory contents to backup region.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/Kbuild                   |   1 +
 arch/x86/Makefile                 |   6 +++
 arch/x86/purgatory/Makefile       |  35 +++++++++++++
 arch/x86/purgatory/entry64.S      | 101 ++++++++++++++++++++++++++++++++++++++
 arch/x86/purgatory/purgatory.c    |  71 +++++++++++++++++++++++++++
 arch/x86/purgatory/setup-x86_32.S |  17 +++++++
 arch/x86/purgatory/setup-x86_64.S |  58 ++++++++++++++++++++++
 arch/x86/purgatory/stack.S        |  19 +++++++
 arch/x86/purgatory/string.c       |  13 +++++
 9 files changed, 321 insertions(+)
 create mode 100644 arch/x86/purgatory/Makefile
 create mode 100644 arch/x86/purgatory/entry64.S
 create mode 100644 arch/x86/purgatory/purgatory.c
 create mode 100644 arch/x86/purgatory/setup-x86_32.S
 create mode 100644 arch/x86/purgatory/setup-x86_64.S
 create mode 100644 arch/x86/purgatory/stack.S
 create mode 100644 arch/x86/purgatory/string.c

diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index e5287d8..faaeee7 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -16,3 +16,4 @@ obj-$(CONFIG_IA32_EMULATION) += ia32/
 
 obj-y += platform/
 obj-y += net/
+obj-$(CONFIG_KEXEC) += purgatory/
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 33f71b0..0b25c6c 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -186,6 +186,11 @@ archscripts: scripts_basic
 archheaders:
 	$(Q)$(MAKE) $(build)=arch/x86/syscalls all
 
+archprepare:
+ifeq ($(CONFIG_KEXEC),y)
+	$(Q)$(MAKE) $(build)=arch/x86/purgatory arch/x86/purgatory/kexec-purgatory.c
+endif
+
 ###
 # Kernel objects
 
@@ -249,6 +254,7 @@ archclean:
 	$(Q)rm -rf $(objtree)/arch/x86_64
 	$(Q)$(MAKE) $(clean)=$(boot)
 	$(Q)$(MAKE) $(clean)=arch/x86/tools
+	$(Q)$(MAKE) $(clean)=arch/x86/purgatory
 
 PHONY += kvmconfig
 kvmconfig:
diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
new file mode 100644
index 0000000..8dbf8f5
--- /dev/null
+++ b/arch/x86/purgatory/Makefile
@@ -0,0 +1,35 @@
+ifeq ($(CONFIG_X86_64),y)
+	purgatory-y := purgatory.o entry64.o stack.o setup-x86_64.o sha256.o string.o
+else
+	purgatory-y := purgatory.o stack.o sha256.o setup-x86_32.o
+endif
+
+targets += $(purgatory-y)
+PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y))
+
+LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined -nostdlib -z nodefaultlib
+targets += purgatory.ro
+
+# Default KBUILD_CFLAGS can have -pg option set when FTRACE is enabled. That
+# in turn leaves some undefined symbols like __fentry__ in purgatory and not
+# sure how to relocate those. Like kexec-tools, custom flags.
+
+ifeq ($(CONFIG_X86_64),y)
+KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -mcmodel=large -Os -fno-builtin -ffreestanding -c -MD
+else
+KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -Os -fno-builtin -ffreestanding -c -MD -m32
+endif
+
+$(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
+		$(call if_changed,ld)
+
+targets += kexec-purgatory.c
+
+quiet_cmd_bin2c = BIN2C   $@
+      cmd_bin2c = cat $(obj)/purgatory.ro | $(srctree)/scripts/basic/bin2c kexec_purgatory > $(obj)/kexec-purgatory.c
+
+$(obj)/kexec-purgatory.c: $(obj)/purgatory.ro FORCE
+	$(call if_changed,bin2c)
+
+
+obj-$(CONFIG_KEXEC)	+= kexec-purgatory.o
diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
new file mode 100644
index 0000000..219b50b
--- /dev/null
+++ b/arch/x86/purgatory/entry64.S
@@ -0,0 +1,101 @@
+/*
+ * Copyright (C) 2003,2004  Eric Biederman (ebiederm@xmission.com)
+ * Copyright (C) 2014  Red Hat Inc.
+
+ * Author(s): Vivek Goyal <vgoyal@redhat.com>
+ *
+ * This code has been taken from kexec-tools.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+	.text
+	.balign 16
+	.code64
+	.globl entry64, entry64_regs
+
+
+entry64:
+	/* Setup a gdt that should be preserved */
+	lgdt gdt(%rip)
+
+	/* load the data segments */
+	movl    $0x18, %eax     /* data segment */
+	movl    %eax, %ds
+	movl    %eax, %es
+	movl    %eax, %ss
+	movl    %eax, %fs
+	movl    %eax, %gs
+
+	/* Setup new stack */
+	leaq    stack_init(%rip), %rsp
+	pushq   $0x10 /* CS */
+	leaq    new_cs_exit(%rip), %rax
+	pushq   %rax
+	lretq
+new_cs_exit:
+
+	/* Load the registers */
+	movq	rax(%rip), %rax
+	movq	rbx(%rip), %rbx
+	movq	rcx(%rip), %rcx
+	movq	rdx(%rip), %rdx
+	movq	rsi(%rip), %rsi
+	movq	rdi(%rip), %rdi
+	movq    rsp(%rip), %rsp
+	movq	rbp(%rip), %rbp
+	movq	r8(%rip), %r8
+	movq	r9(%rip), %r9
+	movq	r10(%rip), %r10
+	movq	r11(%rip), %r11
+	movq	r12(%rip), %r12
+	movq	r13(%rip), %r13
+	movq	r14(%rip), %r14
+	movq	r15(%rip), %r15
+
+	/* Jump to the new code... */
+	jmpq	*rip(%rip)
+
+	.section ".rodata"
+	.balign 4
+entry64_regs:
+rax:	.quad 0x00000000
+rbx:	.quad 0x00000000
+rcx:	.quad 0x00000000
+rdx:	.quad 0x00000000
+rsi:	.quad 0x00000000
+rdi:	.quad 0x00000000
+rsp:	.quad 0x00000000
+rbp:	.quad 0x00000000
+r8:	.quad 0x00000000
+r9:	.quad 0x00000000
+r10:	.quad 0x00000000
+r11:	.quad 0x00000000
+r12:	.quad 0x00000000
+r13:	.quad 0x00000000
+r14:	.quad 0x00000000
+r15:	.quad 0x00000000
+rip:	.quad 0x00000000
+	.size entry64_regs, . - entry64_regs
+
+	/* GDT */
+	.section ".rodata"
+	.balign 16
+gdt:
+	/* 0x00 unusable segment
+	 * 0x08 unused
+	 * so use them as gdt ptr
+	 */
+	.word gdt_end - gdt - 1
+	.quad gdt
+	.word 0, 0, 0
+
+	/* 0x10 4GB flat code segment */
+	.word 0xFFFF, 0x0000, 0x9A00, 0x00AF
+
+	/* 0x18 4GB flat data segment */
+	.word 0xFFFF, 0x0000, 0x9200, 0x00CF
+gdt_end:
+stack:	.quad   0, 0
+stack_init:
diff --git a/arch/x86/purgatory/purgatory.c b/arch/x86/purgatory/purgatory.c
new file mode 100644
index 0000000..3a808db
--- /dev/null
+++ b/arch/x86/purgatory/purgatory.c
@@ -0,0 +1,71 @@
+/*
+ * purgatory: Runs between two kernels
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * Author:
+ *       Vivek Goyal <vgoyal@redhat.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include "sha256.h"
+#include "../boot/string.h"
+
+struct sha_region {
+	unsigned long start;
+	unsigned long len;
+};
+
+unsigned long backup_dest = 0;
+unsigned long backup_src = 0;
+unsigned long backup_sz = 0;
+
+u8 sha256_digest[SHA256_DIGEST_SIZE] = { 0 };
+
+struct sha_region sha_regions[16] = {};
+
+/*
+ * On x86, second kernel requries first 640K of memory to boot. Copy
+ * first 640K to a backup region in reserved memory range so that second
+ * kernel can use first 640K.
+ */
+static int copy_backup_region(void)
+{
+	if (backup_dest)
+		memcpy((void *)backup_dest, (void *)backup_src, backup_sz);
+
+	return 0;
+}
+
+int verify_sha256_digest(void)
+{
+	struct sha_region *ptr, *end;
+	u8 digest[SHA256_DIGEST_SIZE];
+	struct sha256_state sctx;
+
+	sha256_init(&sctx);
+	end = &sha_regions[sizeof(sha_regions)/sizeof(sha_regions[0])];
+	for (ptr = sha_regions; ptr < end; ptr++)
+		sha256_update(&sctx, (uint8_t *)(ptr->start), ptr->len);
+
+	sha256_final(&sctx, digest);
+
+	if (memcmp(digest, sha256_digest, sizeof(digest)) != 0)
+		return 1;
+
+	return 0;
+}
+
+void purgatory(void)
+{
+	int ret;
+
+	ret = verify_sha256_digest();
+	if (ret) {
+		/* loop forever */
+		for (;;);
+	}
+	copy_backup_region();
+}
diff --git a/arch/x86/purgatory/setup-x86_32.S b/arch/x86/purgatory/setup-x86_32.S
new file mode 100644
index 0000000..cfcff31
--- /dev/null
+++ b/arch/x86/purgatory/setup-x86_32.S
@@ -0,0 +1,17 @@
+/*
+ * purgatory:  setup code
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+	.text
+	.globl purgatory_start
+	.balign 16
+purgatory_start:
+	.code32
+
+	/* This is just a stub. Write code when 32bit support comes along */
+	call purgatory
diff --git a/arch/x86/purgatory/setup-x86_64.S b/arch/x86/purgatory/setup-x86_64.S
new file mode 100644
index 0000000..fe3c91b
--- /dev/null
+++ b/arch/x86/purgatory/setup-x86_64.S
@@ -0,0 +1,58 @@
+/*
+ * purgatory:  setup code
+ *
+ * Copyright (C) 2003,2004  Eric Biederman (ebiederm@xmission.com)
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * This code has been taken from kexec-tools.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+	.text
+	.globl purgatory_start
+	.balign 16
+purgatory_start:
+	.code64
+
+	/* Load a gdt so I know what the segment registers are */
+	lgdt	gdt(%rip)
+
+	/* load the data segments */
+	movl	$0x18, %eax	/* data segment */
+	movl	%eax, %ds
+	movl	%eax, %es
+	movl	%eax, %ss
+	movl	%eax, %fs
+	movl	%eax, %gs
+
+	/* Setup a stack */
+	leaq	lstack_end(%rip), %rsp
+
+	/* Call the C code */
+	call purgatory
+	jmp	entry64
+
+	.section ".rodata"
+	.balign 16
+gdt:	/* 0x00 unusable segment
+	 * 0x08 unused
+	 * so use them as the gdt ptr
+	 */
+	.word	gdt_end - gdt - 1
+	.quad	gdt
+	.word	0, 0, 0
+
+	/* 0x10 4GB flat code segment */
+	.word	0xFFFF, 0x0000, 0x9A00, 0x00AF
+
+	/* 0x18 4GB flat data segment */
+	.word	0xFFFF, 0x0000, 0x9200, 0x00CF
+gdt_end:
+
+	.bss
+	.balign 4096
+lstack:
+	.skip 4096
+lstack_end:
diff --git a/arch/x86/purgatory/stack.S b/arch/x86/purgatory/stack.S
new file mode 100644
index 0000000..3cefba1
--- /dev/null
+++ b/arch/x86/purgatory/stack.S
@@ -0,0 +1,19 @@
+/*
+ * purgatory:  stack
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+	/* A stack for the loaded kernel.
+	 * Seperate and in the data section so it can be prepopulated.
+	 */
+	.data
+	.balign 4096
+	.globl stack, stack_end
+
+stack:
+	.skip 4096
+stack_end:
diff --git a/arch/x86/purgatory/string.c b/arch/x86/purgatory/string.c
new file mode 100644
index 0000000..d886b1f
--- /dev/null
+++ b/arch/x86/purgatory/string.c
@@ -0,0 +1,13 @@
+/*
+ * Simple string functions.
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * Author:
+ *       Vivek Goyal <vgoyal@redhat.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include "../boot/string.c"
-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 10/13] kexec: Load and Relocate purgatory at kernel load time
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:06   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal

Load purgatory code in RAM and relocate it based on the location. Relocation
code has been inspired by module relocation code and purgatory relocation
code in kexec-tools.

Also compute the checksums of loaded kexec segments and store them in
purgatory.

Arch independent code provides this functionality so that arch dependent
bootloaders can make use of it.

Helper functions are provided to get/set symbol values in purgatory which
are used by bootloaders later to set things like stack and entry point
of second kernel etc.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/Kconfig                   |   2 +
 arch/x86/kernel/machine_kexec_64.c |  82 +++++++
 include/linux/kexec.h              |  31 +++
 kernel/kexec.c                     | 484 +++++++++++++++++++++++++++++++++++++
 4 files changed, 599 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 213308a..0f24b61 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1556,6 +1556,8 @@ source kernel/Kconfig.hz
 config KEXEC
 	bool "kexec system call"
 	select BUILD_BIN2C
+	select CRYPTO
+	select CRYPTO_SHA256
 	---help---
 	  kexec is a system call that implements the ability to shutdown your
 	  current kernel, and to start another kernel.  It is like a reboot
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index d9c5cf0..711c1fb 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -337,3 +337,85 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
 		return kexec_file_type[idx].cleanup(image);
 	return 0;
 }
+
+/* Apply purgatory relocations */
+int arch_kexec_apply_relocations_add(Elf64_Shdr *sechdrs,
+				unsigned int nr_sections, unsigned int relsec)
+{
+	unsigned int i;
+	Elf64_Rela *rel = (void *)sechdrs[relsec].sh_offset;
+	Elf64_Sym *sym;
+	void *location;
+	Elf64_Shdr *section, *symtab;
+	unsigned long address, sec_base, value;
+
+	/* Section to which relocations apply */
+	section = &sechdrs[sechdrs[relsec].sh_info];
+
+	/* Associated symbol table */
+	symtab = &sechdrs[sechdrs[relsec].sh_link];
+
+	for (i = 0; i < sechdrs[relsec].sh_size / sizeof(*rel); i++) {
+
+		/*
+		 * This is location (->sh_offset) to update. This is temporary
+		 * buffer where section is currently loaded. This will finally
+		 * be loaded to a different address later (pointed to
+		 * by ->sh_addr. kexec takes care of moving it
+		 * (kexec_load_segment()).
+		 */
+		location = (void *)(section->sh_offset + rel[i].r_offset);
+
+		/* Final address of the location */
+		address = section->sh_addr + rel[i].r_offset;
+
+		sym = (Elf64_Sym *)symtab->sh_offset +
+				ELF64_R_SYM(rel[i].r_info);
+
+		if (sym->st_shndx == SHN_UNDEF || sym->st_shndx == SHN_COMMON)
+			return -ENOEXEC;
+
+		if (sym->st_shndx == SHN_ABS)
+			sec_base = 0;
+		else if (sym->st_shndx >= nr_sections)
+			return -ENOEXEC;
+		else
+			sec_base = sechdrs[sym->st_shndx].sh_addr;
+
+		value = sym->st_value;
+		value += sec_base;
+		value += rel[i].r_addend;
+
+		switch (ELF64_R_TYPE(rel[i].r_info)) {
+		case R_X86_64_NONE:
+			break;
+		case R_X86_64_64:
+			*(u64 *)location = value;
+			break;
+		case R_X86_64_32:
+			*(u32 *)location = value;
+			if (value != *(u32 *)location)
+				goto overflow;
+			break;
+		case R_X86_64_32S:
+			*(s32 *)location = value;
+			if ((s64)value != *(s32 *)location)
+				goto overflow;
+			break;
+		case R_X86_64_PC32:
+			value -= (u64)address;
+			*(u32 *)location = value;
+			break;
+		default:
+			pr_err("kexec: Unknown rela relocation: %llu\n",
+					ELF64_R_TYPE(rel[i].r_info));
+			return -ENOEXEC;
+		}
+	}
+	return 0;
+
+overflow:
+	pr_err("kexec: overflow in relocation type %d value 0x%lx\n",
+		(int)ELF64_R_TYPE(rel[i].r_info), value);
+	return -ENOEXEC;
+}
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 3790519..7228873 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -10,6 +10,7 @@
 #include <linux/ioport.h>
 #include <linux/elfcore.h>
 #include <linux/elf.h>
+#include <linux/module.h>
 #include <asm/kexec.h>
 
 /* Verify architecture specific macros are defined */
@@ -95,6 +96,27 @@ struct compat_kexec_segment {
 };
 #endif
 
+struct kexec_sha_region {
+	unsigned long start;
+	unsigned long len;
+};
+
+struct purgatory_info {
+	/* Pointer to elf header of read only purgatory */
+	Elf_Ehdr *ehdr;
+
+	/* Pointer to purgatory sechdrs which are modifiable */
+	Elf_Shdr *sechdrs;
+	/*
+	 * Temporary buffer location where purgatory is loaded and relocated
+	 * This memory can be freed post image load
+	 */
+	void *purgatory_buf;
+
+	/* Address where purgatory is finally loaded and is executed from */
+	unsigned long purgatory_load_addr;
+};
+
 struct kimage {
 	kimage_entry_t head;
 	kimage_entry_t *entry;
@@ -143,6 +165,9 @@ struct kimage {
 
 	/* Image loader handling the kernel can store a pointer here */
 	void *image_loader_data;
+
+	/* Information for loading purgatory */
+	struct purgatory_info purgatory_info;
 };
 
 /*
@@ -190,6 +215,12 @@ extern int kexec_add_buffer(struct kimage *image, char *buffer,
 			unsigned long *load_addr);
 extern struct page *kimage_alloc_control_pages(struct kimage *image,
 						unsigned int order);
+extern int kexec_load_purgatory(struct kimage *image, unsigned long min,
+		unsigned long max, int top_down, unsigned long *load_addr);
+extern int kexec_purgatory_get_set_symbol(struct kimage *image,
+		const char *name, void *buf, unsigned int size, bool get_value);
+extern void *kexec_purgatory_get_symbol_addr(struct kimage *image,
+					const char *name);
 extern void crash_kexec(struct pt_regs *);
 int kexec_should_crash(struct task_struct *);
 void crash_save_cpu(struct pt_regs *regs, int cpu);
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 1ad4d60..7f2e393 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -39,6 +39,9 @@
 #include <asm/io.h>
 #include <asm/sections.h>
 
+#include <crypto/hash.h>
+#include <crypto/sha.h>
+
 /* Per cpu memory for storing cpu states in case of system crash. */
 note_buf_t __percpu *crash_notes;
 
@@ -51,6 +54,15 @@ size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
 /* Flag to indicate we are going to kexec a new kernel */
 bool kexec_in_progress = false;
 
+/*
+ * Declare these symbols weak so that if architecture provides a purgatory,
+ * these will be overridden.
+ */
+char __weak kexec_purgatory[0];
+size_t __weak kexec_purgatory_size = 0;
+
+static int kexec_calculate_store_digests(struct kimage *image);
+
 /* Location of the reserved area for the crash kernel */
 struct resource crashk_res = {
 	.name  = "Crash kernel",
@@ -336,6 +348,15 @@ arch_kimage_file_post_load_cleanup(struct kimage *image)
 	return;
 }
 
+/* Apply relocations for rela section */
+int __attribute__ ((weak))
+arch_kexec_apply_relocations_add(Elf_Shdr *sechdrs, unsigned int nr_sections,
+					unsigned int relsec)
+{
+	pr_err(KERN_ERR "kexec: REL relocation unsupported\n");
+	return -ENOEXEC;
+}
+
 /*
  * Free up tempory buffers allocated which are not needed after image has
  * been loaded.
@@ -355,6 +376,12 @@ static void kimage_file_post_load_cleanup(struct kimage *image)
 	vfree(image->cmdline_buf);
 	image->cmdline_buf = NULL;
 
+	vfree(image->purgatory_info.purgatory_buf);
+	image->purgatory_info.purgatory_buf = NULL;
+
+	vfree(image->purgatory_info.sechdrs);
+	image->purgatory_info.sechdrs = NULL;
+
 	/* See if architcture has anything to cleanup post load */
 	arch_kimage_file_post_load_cleanup(image);
 }
@@ -1370,6 +1397,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 	if (ret)
 		goto out;
 
+	ret = kexec_calculate_store_digests(image);
+	if (ret)
+		goto out;
+
 	for (i = 0; i < image->nr_segments; i++) {
 		struct kexec_segment *ksegment;
 
@@ -2131,6 +2162,459 @@ int kexec_add_buffer(struct kimage *image, char *buffer,
 	return 0;
 }
 
+/* Calculate and store the digest of segments */
+static int kexec_calculate_store_digests(struct kimage *image)
+{
+	struct crypto_shash *tfm;
+	struct shash_desc *desc;
+	int ret = 0, i, j, zero_buf_sz = 256, sha_region_sz;
+	size_t desc_size, nullsz;
+	char *digest = NULL;
+	void *zero_buf;
+	struct kexec_sha_region *sha_regions;
+
+	tfm = crypto_alloc_shash("sha256", 0, 0);
+	if (IS_ERR(tfm)) {
+		ret = PTR_ERR(tfm);
+		goto out;
+	}
+
+	desc_size = crypto_shash_descsize(tfm) + sizeof(*desc);
+	desc = kzalloc(desc_size, GFP_KERNEL);
+	if (!desc) {
+		ret = -ENOMEM;
+		goto out_free_tfm;
+	}
+
+	zero_buf = kzalloc(zero_buf_sz, GFP_KERNEL);
+	if (!zero_buf) {
+		ret = -ENOMEM;
+		goto out_free_desc;
+	}
+
+	sha_region_sz = KEXEC_SEGMENT_MAX * sizeof(struct kexec_sha_region);
+	sha_regions = vzalloc(sha_region_sz);
+	if (!sha_regions)
+		goto out_free_zero_buf;
+
+	desc->tfm   = tfm;
+	desc->flags = 0;
+
+	ret = crypto_shash_init(desc);
+	if (ret < 0)
+		goto out_free_sha_regions;
+
+	digest = kzalloc(SHA256_DIGEST_SIZE, GFP_KERNEL);
+	if (!digest) {
+		ret = -ENOMEM;
+		goto out_free_sha_regions;
+	}
+
+	/* Traverse through all segments */
+	for (j = i = 0; i < image->nr_segments; i++) {
+		struct kexec_segment *ksegment;
+		ksegment = &image->segment[i];
+
+		/*
+		 * Skip purgatory as it will be modified once we put digest
+		 * info in purgatory
+		 */
+		if (ksegment->kbuf == image->purgatory_info.purgatory_buf)
+			continue;
+
+		ret = crypto_shash_update(desc, ksegment->kbuf,
+						ksegment->bufsz);
+		if (ret)
+			break;
+
+		nullsz = ksegment->memsz - ksegment->bufsz;
+		while (nullsz) {
+			unsigned long bytes = nullsz;
+			if (bytes > zero_buf_sz)
+				bytes = zero_buf_sz;
+			ret = crypto_shash_update(desc, zero_buf, bytes);
+			if (ret)
+				break;
+			nullsz -= bytes;
+		}
+
+		if (ret)
+			break;
+
+		sha_regions[j].start = ksegment->mem;
+		sha_regions[j].len = ksegment->memsz;
+		j++;
+	}
+
+	if (!ret) {
+		ret = crypto_shash_final(desc, digest);
+		if (ret)
+			goto out_free_sha_regions;
+		ret = kexec_purgatory_get_set_symbol(image, "sha_regions",
+				sha_regions, sha_region_sz, 0);
+		if (ret)
+			goto out_free_sha_regions;
+
+		ret = kexec_purgatory_get_set_symbol(image, "sha256_digest",
+				digest, SHA256_DIGEST_SIZE, 0);
+		if (ret)
+			goto out_free_sha_regions;
+	}
+
+out_free_sha_regions:
+	vfree(sha_regions);
+out_free_zero_buf:
+	kfree(zero_buf);
+out_free_desc:
+	kfree(desc);
+out_free_tfm:
+	kfree(tfm);
+out:
+	kfree(digest);
+	return ret;
+}
+
+/* Actually load and relcoate purgatory. Lot of code taken from kexec-tools */
+static int elf_rel_load_relocate(struct kimage *image, unsigned long min,
+				unsigned long max, int top_down)
+{
+	struct purgatory_info *pi = &image->purgatory_info;
+	unsigned long align, buf_align, bss_align, buf_sz, bss_sz, bss_pad;
+	unsigned long memsz, entry, load_addr, data_addr, bss_addr, off;
+	unsigned char *buf_addr, *src;
+	int i, ret = 0, entry_sidx = -1;
+	Elf_Shdr *sechdrs = NULL, *sechdrs_c;
+	void *purgatory_buf = NULL;
+
+	/*
+	 * sechdrs_c points to section headers in purgatory and are read
+	 * only. No modifications allowed.
+	 */
+	sechdrs_c = (void *)pi->ehdr + pi->ehdr->e_shoff;
+
+	/*
+	 * We can not modify sechdrs_c[] and its fields. It is read only.
+	 * Copy it over to a local copy where one can store some temporary
+	 * data and free it at the end. We need to modify ->sh_addr and
+	 * ->sh_offset fields to keep track permanent and temporary locations
+	 * of sections.
+	 */
+	sechdrs = vzalloc(pi->ehdr->e_shnum * sizeof(Elf_Shdr));
+	if (!sechdrs)
+		return -ENOMEM;
+
+	memcpy(sechdrs, sechdrs_c, pi->ehdr->e_shnum * sizeof(Elf_Shdr));
+
+	/*
+	 * We seem to have multiple copies of sections. First copy is which
+	 * is embedded in kernel in read only section. Some of these sections
+	 * will be copied to a temporary buffer and relocated. And these
+	 * sections will finally be copied to their final detination at
+	 * segment load time.
+	 *
+	 * Use ->sh_offset to reflect section address in memory. It will
+	 * point to original read only copy if section is not allocatable.
+	 * Otherwise it will point to temporary copy which will be relocated.
+	 *
+	 * Use ->sh_addr to contain final address of the section where it
+	 * will go during execution time.
+	 */
+	for (i = 0; i < pi->ehdr->e_shnum; i++) {
+		if (sechdrs[i].sh_type == SHT_NOBITS)
+			continue;
+
+		sechdrs[i].sh_offset = (unsigned long)pi->ehdr +
+						sechdrs[i].sh_offset;
+	}
+
+	entry = pi->ehdr->e_entry;
+	for (i = 0; i < pi->ehdr->e_shnum; i++) {
+		if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+			continue;
+
+		if (!(sechdrs[i].sh_flags & SHF_EXECINSTR))
+			continue;
+
+		/* Make entry section relative */
+		if (sechdrs[i].sh_addr <= pi->ehdr->e_entry &&
+		    ((sechdrs[i].sh_addr + sechdrs[i].sh_size) >
+		     pi->ehdr->e_entry)) {
+			entry_sidx = i;
+			entry -= sechdrs[i].sh_addr;
+			break;
+		}
+	}
+
+	/* Find the RAM size requirements of relocatable object */
+	buf_align = 1;
+	bss_align = 1;
+	buf_sz = 0;
+	bss_sz = 0;
+
+	for (i = 0; i < pi->ehdr->e_shnum; i++) {
+		if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+			continue;
+
+		align = sechdrs[i].sh_addralign;
+		if (sechdrs[i].sh_type != SHT_NOBITS) {
+			if (buf_align < align)
+				buf_align = align;
+			buf_sz = ALIGN(buf_sz, align);
+			buf_sz += sechdrs[i].sh_size;
+		} else {
+			if (bss_align < align)
+				bss_align = align;
+			bss_sz = ALIGN(bss_sz, align);
+			bss_sz += sechdrs[i].sh_size;
+		}
+	}
+
+	if (buf_align < bss_align)
+		buf_align = bss_align;
+	bss_pad = 0;
+	if (buf_sz & (bss_align - 1))
+		bss_pad = bss_align - (buf_sz & (bss_align - 1));
+
+	memsz = buf_sz + bss_pad + bss_sz;
+
+	/* Allocate buffer for purgatory */
+	purgatory_buf = vzalloc(buf_sz);
+	if (!purgatory_buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* Add buffer to segment list */
+	ret = kexec_add_buffer(image, purgatory_buf, buf_sz, memsz,
+				buf_align, min, max, top_down,
+				&pi->purgatory_load_addr);
+	if (ret)
+		goto out;
+
+	/* Load SHF_ALLOC sections */
+	buf_addr = purgatory_buf;
+	load_addr = pi->purgatory_load_addr;
+	data_addr = load_addr;
+	bss_addr = load_addr + buf_sz + bss_pad;
+
+	for (i = 0; i < pi->ehdr->e_shnum; i++) {
+		if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+			continue;
+
+		align = sechdrs[i].sh_addralign;
+		if (sechdrs[i].sh_type != SHT_NOBITS) {
+			data_addr = ALIGN(data_addr, align);
+			off = data_addr - load_addr;
+			/* We have already modifed ->sh_offset to keep addr */
+			src = (char *) sechdrs[i].sh_offset;
+			memcpy(buf_addr + off, src, sechdrs[i].sh_size);
+
+			/* Store load address and source address of section */
+			sechdrs[i].sh_addr = data_addr;
+
+			/*
+			 * This section got copied to temporary buffer. Update
+			 * ->sh_offset accordingly.
+			 */
+			sechdrs[i].sh_offset = (unsigned long)(buf_addr + off);
+
+			/* Advance to the next address */
+			data_addr += sechdrs[i].sh_size;
+		} else {
+			bss_addr = ALIGN(bss_addr, align);
+			sechdrs[i].sh_addr = bss_addr;
+			bss_addr += sechdrs[i].sh_size;
+		}
+	}
+
+	/* update entry based on entry section position */
+	if (entry_sidx >= 0)
+		entry += sechdrs[entry_sidx].sh_addr;
+
+	/* Set the entry point of purgatory */
+	image->start = entry;
+
+	/* Apply relocations */
+	for (i = 0; i < pi->ehdr->e_shnum; i++) {
+		Elf_Shdr *section, *symtab;
+
+		if (sechdrs[i].sh_type != SHT_RELA &&
+		    sechdrs[i].sh_type != SHT_REL)
+			continue;
+
+		if (sechdrs[i].sh_info > pi->ehdr->e_shnum ||
+		    sechdrs[i].sh_link > pi->ehdr->e_shnum) {
+			ret = -ENOEXEC;
+			goto out;
+		}
+
+		section = &sechdrs[sechdrs[i].sh_info];
+		symtab = &sechdrs[sechdrs[i].sh_link];
+
+		if (!(section->sh_flags & SHF_ALLOC))
+			continue;
+
+		if (symtab->sh_link > pi->ehdr->e_shnum)
+			/* Invalid section number? */
+			continue;
+
+		ret = -EOPNOTSUPP;
+		if (sechdrs[i].sh_type == SHT_RELA)
+			ret = arch_kexec_apply_relocations_add(sechdrs,
+							pi->ehdr->e_shnum, i);
+		if (ret)
+			goto out;
+	}
+
+	pi->sechdrs = sechdrs;
+	pi->purgatory_buf = purgatory_buf;
+	return ret;
+out:
+	vfree(sechdrs);
+	vfree(purgatory_buf);
+	return ret;
+}
+
+/* Load relocatable purgatory object and relocate it appropriately */
+int kexec_load_purgatory(struct kimage *image, unsigned long min,
+		unsigned long max, int top_down, unsigned long *load_addr)
+{
+	struct purgatory_info *pi = &image->purgatory_info;
+	int ret;
+
+	if (kexec_purgatory_size <= 0)
+		return -EINVAL;
+
+	if (kexec_purgatory_size < sizeof(Elf_Ehdr))
+		return -ENOEXEC;
+
+	pi->ehdr = (Elf_Ehdr *)kexec_purgatory;
+
+	if (memcmp(pi->ehdr->e_ident, ELFMAG, SELFMAG) != 0
+	    || pi->ehdr->e_type != ET_REL
+	    || !elf_check_arch(pi->ehdr)
+	    || pi->ehdr->e_shentsize != sizeof(Elf_Shdr))
+		return -ENOEXEC;
+
+	if (pi->ehdr->e_shoff >= kexec_purgatory_size
+	    || (pi->ehdr->e_shnum * sizeof(Elf_Shdr) >
+	    kexec_purgatory_size - pi->ehdr->e_shoff))
+		return -ENOEXEC;
+
+	ret = elf_rel_load_relocate(image, min, max, top_down);
+	if (ret)
+		return ret;
+
+	*load_addr = image->purgatory_info.purgatory_load_addr;
+	return 0;
+}
+
+static Elf_Sym *kexec_purgatory_find_symbol(struct purgatory_info *pi,
+						const char *name)
+{
+	Elf_Sym *syms;
+	Elf_Shdr *sechdrs;
+	Elf_Ehdr *ehdr;
+	int i, k;
+	const char *strtab;
+
+	if (!pi->sechdrs || !pi->ehdr)
+		return NULL;
+
+	sechdrs = pi->sechdrs;
+	ehdr = pi->ehdr;
+
+	for (i = 0; i < ehdr->e_shnum; i++) {
+		if (sechdrs[i].sh_type != SHT_SYMTAB)
+			continue;
+
+		if (sechdrs[i].sh_link > ehdr->e_shnum)
+			/* Invalid stratab section number */
+			continue;
+		strtab = (char *)sechdrs[sechdrs[i].sh_link].sh_offset;
+		syms = (Elf_Sym *)sechdrs[i].sh_offset;
+
+		/* Go through symbols for a match */
+		for (k = 0; k < sechdrs[i].sh_size/sizeof(Elf_Sym); k++) {
+			if (ELF_ST_BIND(syms[k].st_info) != STB_GLOBAL)
+				continue;
+
+			if (strcmp(strtab + syms[k].st_name, name) != 0)
+				continue;
+
+			if (syms[k].st_shndx == SHN_UNDEF ||
+			    syms[k].st_shndx > ehdr->e_shnum) {
+				pr_debug("Symbol: %s has bad section index %d.\n",
+						name, syms[k].st_shndx);
+				return NULL;
+			}
+
+			/* Found the symbol we are looking for */
+			return &syms[k];
+		}
+	}
+
+	return NULL;
+}
+
+void *kexec_purgatory_get_symbol_addr(struct kimage *image, const char *name)
+{
+	struct purgatory_info *pi = &image->purgatory_info;
+	Elf_Sym *sym;
+	Elf_Shdr *sechdr;
+
+	sym = kexec_purgatory_find_symbol(pi, name);
+	if (!sym)
+		return ERR_PTR(-EINVAL);
+
+	sechdr = &pi->sechdrs[sym->st_shndx];
+
+	/*
+	 * Returns the address where symbol will finally be loaded after
+	 * kexec_load_segment()
+	 */
+	return (void *)(sechdr->sh_addr + sym->st_value);
+}
+
+/*
+ * Get or set value of a symbol. If "get_value" is true, symbol value is
+ * returned in buf otherwise symbol value is set based on value in buf.
+ */
+int kexec_purgatory_get_set_symbol(struct kimage *image, const char *name,
+				void *buf, unsigned int size, bool get_value)
+{
+	Elf_Sym *sym;
+	Elf_Shdr *sechdrs;
+	struct purgatory_info *pi = &image->purgatory_info;
+	char *sym_buf;
+
+	sym = kexec_purgatory_find_symbol(pi, name);
+	if (!sym)
+		return -EINVAL;
+
+	if (sym->st_size != size) {
+		pr_debug("Symbol: %s size is not right\n", name);
+		return -EINVAL;
+	}
+
+	sechdrs = pi->sechdrs;
+
+	if (sechdrs[sym->st_shndx].sh_type == SHT_NOBITS) {
+		pr_debug("Symbol: %s is in a bss section. Cannot get/set\n",
+					name);
+		return -EINVAL;
+	}
+
+	sym_buf = (unsigned char *)sechdrs[sym->st_shndx].sh_offset +
+					sym->st_value;
+
+	if (get_value)
+		memcpy((void *)buf, sym_buf, size);
+	else
+		memcpy((void *)sym_buf, buf, size);
+
+	return 0;
+}
 
 /*
  * Move into place and start executing a preloaded standalone
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 10/13] kexec: Load and Relocate purgatory at kernel load time
@ 2014-06-03 13:06   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:06 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg, akpm,
	dyoung, chaowang

Load purgatory code in RAM and relocate it based on the location. Relocation
code has been inspired by module relocation code and purgatory relocation
code in kexec-tools.

Also compute the checksums of loaded kexec segments and store them in
purgatory.

Arch independent code provides this functionality so that arch dependent
bootloaders can make use of it.

Helper functions are provided to get/set symbol values in purgatory which
are used by bootloaders later to set things like stack and entry point
of second kernel etc.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/Kconfig                   |   2 +
 arch/x86/kernel/machine_kexec_64.c |  82 +++++++
 include/linux/kexec.h              |  31 +++
 kernel/kexec.c                     | 484 +++++++++++++++++++++++++++++++++++++
 4 files changed, 599 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 213308a..0f24b61 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1556,6 +1556,8 @@ source kernel/Kconfig.hz
 config KEXEC
 	bool "kexec system call"
 	select BUILD_BIN2C
+	select CRYPTO
+	select CRYPTO_SHA256
 	---help---
 	  kexec is a system call that implements the ability to shutdown your
 	  current kernel, and to start another kernel.  It is like a reboot
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index d9c5cf0..711c1fb 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -337,3 +337,85 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
 		return kexec_file_type[idx].cleanup(image);
 	return 0;
 }
+
+/* Apply purgatory relocations */
+int arch_kexec_apply_relocations_add(Elf64_Shdr *sechdrs,
+				unsigned int nr_sections, unsigned int relsec)
+{
+	unsigned int i;
+	Elf64_Rela *rel = (void *)sechdrs[relsec].sh_offset;
+	Elf64_Sym *sym;
+	void *location;
+	Elf64_Shdr *section, *symtab;
+	unsigned long address, sec_base, value;
+
+	/* Section to which relocations apply */
+	section = &sechdrs[sechdrs[relsec].sh_info];
+
+	/* Associated symbol table */
+	symtab = &sechdrs[sechdrs[relsec].sh_link];
+
+	for (i = 0; i < sechdrs[relsec].sh_size / sizeof(*rel); i++) {
+
+		/*
+		 * This is location (->sh_offset) to update. This is temporary
+		 * buffer where section is currently loaded. This will finally
+		 * be loaded to a different address later (pointed to
+		 * by ->sh_addr. kexec takes care of moving it
+		 * (kexec_load_segment()).
+		 */
+		location = (void *)(section->sh_offset + rel[i].r_offset);
+
+		/* Final address of the location */
+		address = section->sh_addr + rel[i].r_offset;
+
+		sym = (Elf64_Sym *)symtab->sh_offset +
+				ELF64_R_SYM(rel[i].r_info);
+
+		if (sym->st_shndx == SHN_UNDEF || sym->st_shndx == SHN_COMMON)
+			return -ENOEXEC;
+
+		if (sym->st_shndx == SHN_ABS)
+			sec_base = 0;
+		else if (sym->st_shndx >= nr_sections)
+			return -ENOEXEC;
+		else
+			sec_base = sechdrs[sym->st_shndx].sh_addr;
+
+		value = sym->st_value;
+		value += sec_base;
+		value += rel[i].r_addend;
+
+		switch (ELF64_R_TYPE(rel[i].r_info)) {
+		case R_X86_64_NONE:
+			break;
+		case R_X86_64_64:
+			*(u64 *)location = value;
+			break;
+		case R_X86_64_32:
+			*(u32 *)location = value;
+			if (value != *(u32 *)location)
+				goto overflow;
+			break;
+		case R_X86_64_32S:
+			*(s32 *)location = value;
+			if ((s64)value != *(s32 *)location)
+				goto overflow;
+			break;
+		case R_X86_64_PC32:
+			value -= (u64)address;
+			*(u32 *)location = value;
+			break;
+		default:
+			pr_err("kexec: Unknown rela relocation: %llu\n",
+					ELF64_R_TYPE(rel[i].r_info));
+			return -ENOEXEC;
+		}
+	}
+	return 0;
+
+overflow:
+	pr_err("kexec: overflow in relocation type %d value 0x%lx\n",
+		(int)ELF64_R_TYPE(rel[i].r_info), value);
+	return -ENOEXEC;
+}
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 3790519..7228873 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -10,6 +10,7 @@
 #include <linux/ioport.h>
 #include <linux/elfcore.h>
 #include <linux/elf.h>
+#include <linux/module.h>
 #include <asm/kexec.h>
 
 /* Verify architecture specific macros are defined */
@@ -95,6 +96,27 @@ struct compat_kexec_segment {
 };
 #endif
 
+struct kexec_sha_region {
+	unsigned long start;
+	unsigned long len;
+};
+
+struct purgatory_info {
+	/* Pointer to elf header of read only purgatory */
+	Elf_Ehdr *ehdr;
+
+	/* Pointer to purgatory sechdrs which are modifiable */
+	Elf_Shdr *sechdrs;
+	/*
+	 * Temporary buffer location where purgatory is loaded and relocated
+	 * This memory can be freed post image load
+	 */
+	void *purgatory_buf;
+
+	/* Address where purgatory is finally loaded and is executed from */
+	unsigned long purgatory_load_addr;
+};
+
 struct kimage {
 	kimage_entry_t head;
 	kimage_entry_t *entry;
@@ -143,6 +165,9 @@ struct kimage {
 
 	/* Image loader handling the kernel can store a pointer here */
 	void *image_loader_data;
+
+	/* Information for loading purgatory */
+	struct purgatory_info purgatory_info;
 };
 
 /*
@@ -190,6 +215,12 @@ extern int kexec_add_buffer(struct kimage *image, char *buffer,
 			unsigned long *load_addr);
 extern struct page *kimage_alloc_control_pages(struct kimage *image,
 						unsigned int order);
+extern int kexec_load_purgatory(struct kimage *image, unsigned long min,
+		unsigned long max, int top_down, unsigned long *load_addr);
+extern int kexec_purgatory_get_set_symbol(struct kimage *image,
+		const char *name, void *buf, unsigned int size, bool get_value);
+extern void *kexec_purgatory_get_symbol_addr(struct kimage *image,
+					const char *name);
 extern void crash_kexec(struct pt_regs *);
 int kexec_should_crash(struct task_struct *);
 void crash_save_cpu(struct pt_regs *regs, int cpu);
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 1ad4d60..7f2e393 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -39,6 +39,9 @@
 #include <asm/io.h>
 #include <asm/sections.h>
 
+#include <crypto/hash.h>
+#include <crypto/sha.h>
+
 /* Per cpu memory for storing cpu states in case of system crash. */
 note_buf_t __percpu *crash_notes;
 
@@ -51,6 +54,15 @@ size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
 /* Flag to indicate we are going to kexec a new kernel */
 bool kexec_in_progress = false;
 
+/*
+ * Declare these symbols weak so that if architecture provides a purgatory,
+ * these will be overridden.
+ */
+char __weak kexec_purgatory[0];
+size_t __weak kexec_purgatory_size = 0;
+
+static int kexec_calculate_store_digests(struct kimage *image);
+
 /* Location of the reserved area for the crash kernel */
 struct resource crashk_res = {
 	.name  = "Crash kernel",
@@ -336,6 +348,15 @@ arch_kimage_file_post_load_cleanup(struct kimage *image)
 	return;
 }
 
+/* Apply relocations for rela section */
+int __attribute__ ((weak))
+arch_kexec_apply_relocations_add(Elf_Shdr *sechdrs, unsigned int nr_sections,
+					unsigned int relsec)
+{
+	pr_err(KERN_ERR "kexec: REL relocation unsupported\n");
+	return -ENOEXEC;
+}
+
 /*
  * Free up tempory buffers allocated which are not needed after image has
  * been loaded.
@@ -355,6 +376,12 @@ static void kimage_file_post_load_cleanup(struct kimage *image)
 	vfree(image->cmdline_buf);
 	image->cmdline_buf = NULL;
 
+	vfree(image->purgatory_info.purgatory_buf);
+	image->purgatory_info.purgatory_buf = NULL;
+
+	vfree(image->purgatory_info.sechdrs);
+	image->purgatory_info.sechdrs = NULL;
+
 	/* See if architcture has anything to cleanup post load */
 	arch_kimage_file_post_load_cleanup(image);
 }
@@ -1370,6 +1397,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 	if (ret)
 		goto out;
 
+	ret = kexec_calculate_store_digests(image);
+	if (ret)
+		goto out;
+
 	for (i = 0; i < image->nr_segments; i++) {
 		struct kexec_segment *ksegment;
 
@@ -2131,6 +2162,459 @@ int kexec_add_buffer(struct kimage *image, char *buffer,
 	return 0;
 }
 
+/* Calculate and store the digest of segments */
+static int kexec_calculate_store_digests(struct kimage *image)
+{
+	struct crypto_shash *tfm;
+	struct shash_desc *desc;
+	int ret = 0, i, j, zero_buf_sz = 256, sha_region_sz;
+	size_t desc_size, nullsz;
+	char *digest = NULL;
+	void *zero_buf;
+	struct kexec_sha_region *sha_regions;
+
+	tfm = crypto_alloc_shash("sha256", 0, 0);
+	if (IS_ERR(tfm)) {
+		ret = PTR_ERR(tfm);
+		goto out;
+	}
+
+	desc_size = crypto_shash_descsize(tfm) + sizeof(*desc);
+	desc = kzalloc(desc_size, GFP_KERNEL);
+	if (!desc) {
+		ret = -ENOMEM;
+		goto out_free_tfm;
+	}
+
+	zero_buf = kzalloc(zero_buf_sz, GFP_KERNEL);
+	if (!zero_buf) {
+		ret = -ENOMEM;
+		goto out_free_desc;
+	}
+
+	sha_region_sz = KEXEC_SEGMENT_MAX * sizeof(struct kexec_sha_region);
+	sha_regions = vzalloc(sha_region_sz);
+	if (!sha_regions)
+		goto out_free_zero_buf;
+
+	desc->tfm   = tfm;
+	desc->flags = 0;
+
+	ret = crypto_shash_init(desc);
+	if (ret < 0)
+		goto out_free_sha_regions;
+
+	digest = kzalloc(SHA256_DIGEST_SIZE, GFP_KERNEL);
+	if (!digest) {
+		ret = -ENOMEM;
+		goto out_free_sha_regions;
+	}
+
+	/* Traverse through all segments */
+	for (j = i = 0; i < image->nr_segments; i++) {
+		struct kexec_segment *ksegment;
+		ksegment = &image->segment[i];
+
+		/*
+		 * Skip purgatory as it will be modified once we put digest
+		 * info in purgatory
+		 */
+		if (ksegment->kbuf == image->purgatory_info.purgatory_buf)
+			continue;
+
+		ret = crypto_shash_update(desc, ksegment->kbuf,
+						ksegment->bufsz);
+		if (ret)
+			break;
+
+		nullsz = ksegment->memsz - ksegment->bufsz;
+		while (nullsz) {
+			unsigned long bytes = nullsz;
+			if (bytes > zero_buf_sz)
+				bytes = zero_buf_sz;
+			ret = crypto_shash_update(desc, zero_buf, bytes);
+			if (ret)
+				break;
+			nullsz -= bytes;
+		}
+
+		if (ret)
+			break;
+
+		sha_regions[j].start = ksegment->mem;
+		sha_regions[j].len = ksegment->memsz;
+		j++;
+	}
+
+	if (!ret) {
+		ret = crypto_shash_final(desc, digest);
+		if (ret)
+			goto out_free_sha_regions;
+		ret = kexec_purgatory_get_set_symbol(image, "sha_regions",
+				sha_regions, sha_region_sz, 0);
+		if (ret)
+			goto out_free_sha_regions;
+
+		ret = kexec_purgatory_get_set_symbol(image, "sha256_digest",
+				digest, SHA256_DIGEST_SIZE, 0);
+		if (ret)
+			goto out_free_sha_regions;
+	}
+
+out_free_sha_regions:
+	vfree(sha_regions);
+out_free_zero_buf:
+	kfree(zero_buf);
+out_free_desc:
+	kfree(desc);
+out_free_tfm:
+	kfree(tfm);
+out:
+	kfree(digest);
+	return ret;
+}
+
+/* Actually load and relcoate purgatory. Lot of code taken from kexec-tools */
+static int elf_rel_load_relocate(struct kimage *image, unsigned long min,
+				unsigned long max, int top_down)
+{
+	struct purgatory_info *pi = &image->purgatory_info;
+	unsigned long align, buf_align, bss_align, buf_sz, bss_sz, bss_pad;
+	unsigned long memsz, entry, load_addr, data_addr, bss_addr, off;
+	unsigned char *buf_addr, *src;
+	int i, ret = 0, entry_sidx = -1;
+	Elf_Shdr *sechdrs = NULL, *sechdrs_c;
+	void *purgatory_buf = NULL;
+
+	/*
+	 * sechdrs_c points to section headers in purgatory and are read
+	 * only. No modifications allowed.
+	 */
+	sechdrs_c = (void *)pi->ehdr + pi->ehdr->e_shoff;
+
+	/*
+	 * We can not modify sechdrs_c[] and its fields. It is read only.
+	 * Copy it over to a local copy where one can store some temporary
+	 * data and free it at the end. We need to modify ->sh_addr and
+	 * ->sh_offset fields to keep track permanent and temporary locations
+	 * of sections.
+	 */
+	sechdrs = vzalloc(pi->ehdr->e_shnum * sizeof(Elf_Shdr));
+	if (!sechdrs)
+		return -ENOMEM;
+
+	memcpy(sechdrs, sechdrs_c, pi->ehdr->e_shnum * sizeof(Elf_Shdr));
+
+	/*
+	 * We seem to have multiple copies of sections. First copy is which
+	 * is embedded in kernel in read only section. Some of these sections
+	 * will be copied to a temporary buffer and relocated. And these
+	 * sections will finally be copied to their final detination at
+	 * segment load time.
+	 *
+	 * Use ->sh_offset to reflect section address in memory. It will
+	 * point to original read only copy if section is not allocatable.
+	 * Otherwise it will point to temporary copy which will be relocated.
+	 *
+	 * Use ->sh_addr to contain final address of the section where it
+	 * will go during execution time.
+	 */
+	for (i = 0; i < pi->ehdr->e_shnum; i++) {
+		if (sechdrs[i].sh_type == SHT_NOBITS)
+			continue;
+
+		sechdrs[i].sh_offset = (unsigned long)pi->ehdr +
+						sechdrs[i].sh_offset;
+	}
+
+	entry = pi->ehdr->e_entry;
+	for (i = 0; i < pi->ehdr->e_shnum; i++) {
+		if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+			continue;
+
+		if (!(sechdrs[i].sh_flags & SHF_EXECINSTR))
+			continue;
+
+		/* Make entry section relative */
+		if (sechdrs[i].sh_addr <= pi->ehdr->e_entry &&
+		    ((sechdrs[i].sh_addr + sechdrs[i].sh_size) >
+		     pi->ehdr->e_entry)) {
+			entry_sidx = i;
+			entry -= sechdrs[i].sh_addr;
+			break;
+		}
+	}
+
+	/* Find the RAM size requirements of relocatable object */
+	buf_align = 1;
+	bss_align = 1;
+	buf_sz = 0;
+	bss_sz = 0;
+
+	for (i = 0; i < pi->ehdr->e_shnum; i++) {
+		if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+			continue;
+
+		align = sechdrs[i].sh_addralign;
+		if (sechdrs[i].sh_type != SHT_NOBITS) {
+			if (buf_align < align)
+				buf_align = align;
+			buf_sz = ALIGN(buf_sz, align);
+			buf_sz += sechdrs[i].sh_size;
+		} else {
+			if (bss_align < align)
+				bss_align = align;
+			bss_sz = ALIGN(bss_sz, align);
+			bss_sz += sechdrs[i].sh_size;
+		}
+	}
+
+	if (buf_align < bss_align)
+		buf_align = bss_align;
+	bss_pad = 0;
+	if (buf_sz & (bss_align - 1))
+		bss_pad = bss_align - (buf_sz & (bss_align - 1));
+
+	memsz = buf_sz + bss_pad + bss_sz;
+
+	/* Allocate buffer for purgatory */
+	purgatory_buf = vzalloc(buf_sz);
+	if (!purgatory_buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* Add buffer to segment list */
+	ret = kexec_add_buffer(image, purgatory_buf, buf_sz, memsz,
+				buf_align, min, max, top_down,
+				&pi->purgatory_load_addr);
+	if (ret)
+		goto out;
+
+	/* Load SHF_ALLOC sections */
+	buf_addr = purgatory_buf;
+	load_addr = pi->purgatory_load_addr;
+	data_addr = load_addr;
+	bss_addr = load_addr + buf_sz + bss_pad;
+
+	for (i = 0; i < pi->ehdr->e_shnum; i++) {
+		if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+			continue;
+
+		align = sechdrs[i].sh_addralign;
+		if (sechdrs[i].sh_type != SHT_NOBITS) {
+			data_addr = ALIGN(data_addr, align);
+			off = data_addr - load_addr;
+			/* We have already modifed ->sh_offset to keep addr */
+			src = (char *) sechdrs[i].sh_offset;
+			memcpy(buf_addr + off, src, sechdrs[i].sh_size);
+
+			/* Store load address and source address of section */
+			sechdrs[i].sh_addr = data_addr;
+
+			/*
+			 * This section got copied to temporary buffer. Update
+			 * ->sh_offset accordingly.
+			 */
+			sechdrs[i].sh_offset = (unsigned long)(buf_addr + off);
+
+			/* Advance to the next address */
+			data_addr += sechdrs[i].sh_size;
+		} else {
+			bss_addr = ALIGN(bss_addr, align);
+			sechdrs[i].sh_addr = bss_addr;
+			bss_addr += sechdrs[i].sh_size;
+		}
+	}
+
+	/* update entry based on entry section position */
+	if (entry_sidx >= 0)
+		entry += sechdrs[entry_sidx].sh_addr;
+
+	/* Set the entry point of purgatory */
+	image->start = entry;
+
+	/* Apply relocations */
+	for (i = 0; i < pi->ehdr->e_shnum; i++) {
+		Elf_Shdr *section, *symtab;
+
+		if (sechdrs[i].sh_type != SHT_RELA &&
+		    sechdrs[i].sh_type != SHT_REL)
+			continue;
+
+		if (sechdrs[i].sh_info > pi->ehdr->e_shnum ||
+		    sechdrs[i].sh_link > pi->ehdr->e_shnum) {
+			ret = -ENOEXEC;
+			goto out;
+		}
+
+		section = &sechdrs[sechdrs[i].sh_info];
+		symtab = &sechdrs[sechdrs[i].sh_link];
+
+		if (!(section->sh_flags & SHF_ALLOC))
+			continue;
+
+		if (symtab->sh_link > pi->ehdr->e_shnum)
+			/* Invalid section number? */
+			continue;
+
+		ret = -EOPNOTSUPP;
+		if (sechdrs[i].sh_type == SHT_RELA)
+			ret = arch_kexec_apply_relocations_add(sechdrs,
+							pi->ehdr->e_shnum, i);
+		if (ret)
+			goto out;
+	}
+
+	pi->sechdrs = sechdrs;
+	pi->purgatory_buf = purgatory_buf;
+	return ret;
+out:
+	vfree(sechdrs);
+	vfree(purgatory_buf);
+	return ret;
+}
+
+/* Load relocatable purgatory object and relocate it appropriately */
+int kexec_load_purgatory(struct kimage *image, unsigned long min,
+		unsigned long max, int top_down, unsigned long *load_addr)
+{
+	struct purgatory_info *pi = &image->purgatory_info;
+	int ret;
+
+	if (kexec_purgatory_size <= 0)
+		return -EINVAL;
+
+	if (kexec_purgatory_size < sizeof(Elf_Ehdr))
+		return -ENOEXEC;
+
+	pi->ehdr = (Elf_Ehdr *)kexec_purgatory;
+
+	if (memcmp(pi->ehdr->e_ident, ELFMAG, SELFMAG) != 0
+	    || pi->ehdr->e_type != ET_REL
+	    || !elf_check_arch(pi->ehdr)
+	    || pi->ehdr->e_shentsize != sizeof(Elf_Shdr))
+		return -ENOEXEC;
+
+	if (pi->ehdr->e_shoff >= kexec_purgatory_size
+	    || (pi->ehdr->e_shnum * sizeof(Elf_Shdr) >
+	    kexec_purgatory_size - pi->ehdr->e_shoff))
+		return -ENOEXEC;
+
+	ret = elf_rel_load_relocate(image, min, max, top_down);
+	if (ret)
+		return ret;
+
+	*load_addr = image->purgatory_info.purgatory_load_addr;
+	return 0;
+}
+
+static Elf_Sym *kexec_purgatory_find_symbol(struct purgatory_info *pi,
+						const char *name)
+{
+	Elf_Sym *syms;
+	Elf_Shdr *sechdrs;
+	Elf_Ehdr *ehdr;
+	int i, k;
+	const char *strtab;
+
+	if (!pi->sechdrs || !pi->ehdr)
+		return NULL;
+
+	sechdrs = pi->sechdrs;
+	ehdr = pi->ehdr;
+
+	for (i = 0; i < ehdr->e_shnum; i++) {
+		if (sechdrs[i].sh_type != SHT_SYMTAB)
+			continue;
+
+		if (sechdrs[i].sh_link > ehdr->e_shnum)
+			/* Invalid stratab section number */
+			continue;
+		strtab = (char *)sechdrs[sechdrs[i].sh_link].sh_offset;
+		syms = (Elf_Sym *)sechdrs[i].sh_offset;
+
+		/* Go through symbols for a match */
+		for (k = 0; k < sechdrs[i].sh_size/sizeof(Elf_Sym); k++) {
+			if (ELF_ST_BIND(syms[k].st_info) != STB_GLOBAL)
+				continue;
+
+			if (strcmp(strtab + syms[k].st_name, name) != 0)
+				continue;
+
+			if (syms[k].st_shndx == SHN_UNDEF ||
+			    syms[k].st_shndx > ehdr->e_shnum) {
+				pr_debug("Symbol: %s has bad section index %d.\n",
+						name, syms[k].st_shndx);
+				return NULL;
+			}
+
+			/* Found the symbol we are looking for */
+			return &syms[k];
+		}
+	}
+
+	return NULL;
+}
+
+void *kexec_purgatory_get_symbol_addr(struct kimage *image, const char *name)
+{
+	struct purgatory_info *pi = &image->purgatory_info;
+	Elf_Sym *sym;
+	Elf_Shdr *sechdr;
+
+	sym = kexec_purgatory_find_symbol(pi, name);
+	if (!sym)
+		return ERR_PTR(-EINVAL);
+
+	sechdr = &pi->sechdrs[sym->st_shndx];
+
+	/*
+	 * Returns the address where symbol will finally be loaded after
+	 * kexec_load_segment()
+	 */
+	return (void *)(sechdr->sh_addr + sym->st_value);
+}
+
+/*
+ * Get or set value of a symbol. If "get_value" is true, symbol value is
+ * returned in buf otherwise symbol value is set based on value in buf.
+ */
+int kexec_purgatory_get_set_symbol(struct kimage *image, const char *name,
+				void *buf, unsigned int size, bool get_value)
+{
+	Elf_Sym *sym;
+	Elf_Shdr *sechdrs;
+	struct purgatory_info *pi = &image->purgatory_info;
+	char *sym_buf;
+
+	sym = kexec_purgatory_find_symbol(pi, name);
+	if (!sym)
+		return -EINVAL;
+
+	if (sym->st_size != size) {
+		pr_debug("Symbol: %s size is not right\n", name);
+		return -EINVAL;
+	}
+
+	sechdrs = pi->sechdrs;
+
+	if (sechdrs[sym->st_shndx].sh_type == SHT_NOBITS) {
+		pr_debug("Symbol: %s is in a bss section. Cannot get/set\n",
+					name);
+		return -EINVAL;
+	}
+
+	sym_buf = (unsigned char *)sechdrs[sym->st_shndx].sh_offset +
+					sym->st_value;
+
+	if (get_value)
+		memcpy((void *)buf, sym_buf, size);
+	else
+		memcpy((void *)sym_buf, buf, size);
+
+	return 0;
+}
 
 /*
  * Move into place and start executing a preloaded standalone
-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:07   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:07 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal

This is loader specific code which can load bzImage and set it up for
64bit entry. This does not take care of 32bit entry or real mode entry.

32bit mode entry can be implemented if somebody needs it.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/include/asm/kexec-bzimage.h |  11 ++
 arch/x86/include/asm/kexec.h         |  30 ++++
 arch/x86/kernel/Makefile             |   3 +-
 arch/x86/kernel/kexec-bzimage.c      | 269 +++++++++++++++++++++++++++++++++++
 arch/x86/kernel/machine_kexec.c      | 136 ++++++++++++++++++
 arch/x86/kernel/machine_kexec_64.c   |   3 +-
 6 files changed, 450 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/kexec-bzimage.h
 create mode 100644 arch/x86/kernel/kexec-bzimage.c
 create mode 100644 arch/x86/kernel/machine_kexec.c

diff --git a/arch/x86/include/asm/kexec-bzimage.h b/arch/x86/include/asm/kexec-bzimage.h
new file mode 100644
index 0000000..9e83961
--- /dev/null
+++ b/arch/x86/include/asm/kexec-bzimage.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_BZIMAGE_H
+#define _ASM_BZIMAGE_H
+
+extern int bzImage64_probe(const char *buf, unsigned long len);
+extern void *bzImage64_load(struct kimage *image, char *kernel,
+		unsigned long kernel_len, char *initrd,
+		unsigned long initrd_len, char *cmdline,
+		unsigned long cmdline_len);
+extern int bzImage64_cleanup(struct kimage *image);
+
+#endif  /* _ASM_BZIMAGE_H */
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 17483a4..9bd6fec 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -23,6 +23,7 @@
 
 #include <asm/page.h>
 #include <asm/ptrace.h>
+#include <asm/bootparam.h>
 
 /*
  * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
@@ -161,11 +162,40 @@ struct kimage_arch {
 	pmd_t *pmd;
 	pte_t *pte;
 };
+
+struct kexec_entry64_regs {
+	uint64_t rax;
+	uint64_t rbx;
+	uint64_t rcx;
+	uint64_t rdx;
+	uint64_t rsi;
+	uint64_t rdi;
+	uint64_t rsp;
+	uint64_t rbp;
+	uint64_t r8;
+	uint64_t r9;
+	uint64_t r10;
+	uint64_t r11;
+	uint64_t r12;
+	uint64_t r13;
+	uint64_t r14;
+	uint64_t r15;
+	uint64_t rip;
+};
 #endif
 
 typedef void crash_vmclear_fn(void);
 extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
 
+extern int kexec_setup_initrd(struct boot_params *boot_params,
+		unsigned long initrd_load_addr, unsigned long initrd_len);
+extern int kexec_setup_cmdline(struct boot_params *boot_params,
+		unsigned long bootparams_load_addr,
+		unsigned long cmdline_offset, char *cmdline,
+		unsigned long cmdline_len);
+extern int kexec_setup_boot_parameters(struct boot_params *params);
+
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_X86_KEXEC_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index f4d9600..b2d0cfa 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -67,8 +67,9 @@ obj-$(CONFIG_DYNAMIC_FTRACE)	+= ftrace.o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
 obj-$(CONFIG_FTRACE_SYSCALLS)	+= ftrace.o
 obj-$(CONFIG_X86_TSC)		+= trace_clock.o
-obj-$(CONFIG_KEXEC)		+= machine_kexec_$(BITS).o
+obj-$(CONFIG_KEXEC)		+= machine_kexec.o machine_kexec_$(BITS).o
 obj-$(CONFIG_KEXEC)		+= relocate_kernel_$(BITS).o crash.o
+obj-$(CONFIG_KEXEC)		+= kexec-bzimage.o
 obj-$(CONFIG_CRASH_DUMP)	+= crash_dump_$(BITS).o
 obj-y				+= kprobes/
 obj-$(CONFIG_MODULES)		+= module.o
diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
new file mode 100644
index 0000000..0750784
--- /dev/null
+++ b/arch/x86/kernel/kexec-bzimage.c
@@ -0,0 +1,269 @@
+/*
+ * Kexec bzImage loader
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ * Authors:
+ *      Vivek Goyal <vgoyal@redhat.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+#include <linux/string.h>
+#include <linux/printk.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/kexec.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+#include <asm/bootparam.h>
+#include <asm/setup.h>
+
+/*
+ * Defines lowest physical address for various segments. Not sure where
+ * exactly these limits came from. Current bzimage64 loader in kexec-tools
+ * uses these so I am retaining it. It can be changed over time as we gain
+ * more insight.
+ */
+#define MIN_PURGATORY_ADDR	0x3000
+#define MIN_BOOTPARAM_ADDR	0x3000
+#define MIN_KERNEL_LOAD_ADDR	0x100000
+#define MIN_INITRD_LOAD_ADDR	0x1000000
+
+#ifdef CONFIG_X86_64
+
+/*
+ * This is a place holder for all boot loader specific data structure which
+ * gets allocated in one call but gets freed much later during cleanup
+ * time. Right now there is only one field but it can grow as need be.
+ */
+struct bzimage64_data {
+	/*
+	 * Temporary buffer to hold bootparams buffer. This should be
+	 * freed once the bootparam segment has been loaded.
+	 */
+	void *bootparams_buf;
+};
+
+int bzImage64_probe(const char *buf, unsigned long len)
+{
+	int ret = -ENOEXEC;
+	struct setup_header *header;
+
+	/* kernel should be atleast two sector long */
+	if (len < 2 * 512) {
+		pr_debug("File is too short to be a bzImage\n");
+		return ret;
+	}
+
+	header = (struct setup_header *)(buf + offsetof(struct boot_params,
+								hdr));
+	if (memcmp((char *)&header->header, "HdrS", 4) != 0) {
+		pr_debug("Not a bzImage\n");
+		return ret;
+	}
+
+	if (header->boot_flag != 0xAA55) {
+		pr_debug("No x86 boot sector present\n");
+		return ret;
+	}
+
+	if (header->version < 0x020C) {
+		pr_debug("Must be at least protocol version 2.12\n");
+		return ret;
+	}
+
+	if ((header->loadflags & LOADED_HIGH) == 0) {
+		pr_debug("zImage not a bzImage\n");
+		return ret;
+	}
+
+	if (!(header->xloadflags & XLF_KERNEL_64)) {
+		pr_debug("Not a bzImage64. XLF_KERNEL_64 is not set.\n");
+		return ret;
+	}
+
+	if (!(header->xloadflags & XLF_CAN_BE_LOADED_ABOVE_4G)) {
+		pr_debug("XLF_CAN_BE_LOADED_ABOVE_4G is not set.\n");
+		return ret;
+	}
+
+	/* I've got a bzImage */
+	pr_debug("It's a relocatable bzImage64\n");
+	ret = 0;
+
+	return ret;
+}
+
+void *bzImage64_load(struct kimage *image, char *kernel,
+		unsigned long kernel_len,
+		char *initrd, unsigned long initrd_len,
+		char *cmdline, unsigned long cmdline_len)
+{
+
+	struct setup_header *header;
+	int setup_sects, kern16_size, ret = 0;
+	unsigned long setup_header_size, params_cmdline_sz;
+	struct boot_params *params;
+	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
+	unsigned long purgatory_load_addr;
+	unsigned long kernel_bufsz, kernel_memsz, kernel_align;
+	char *kernel_buf;
+	struct bzimage64_data *ldata;
+	struct kexec_entry64_regs regs64;
+	void *stack;
+	unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
+
+	header = (struct setup_header *)(kernel + setup_hdr_offset);
+	setup_sects = header->setup_sects;
+	if (setup_sects == 0)
+		setup_sects = 4;
+
+	kern16_size = (setup_sects + 1) * 512;
+	if (kernel_len < kern16_size) {
+		pr_debug("bzImage truncated\n");
+		return ERR_PTR(-ENOEXEC);
+	}
+
+	if (cmdline_len > header->cmdline_size) {
+		pr_debug("Kernel command line too long\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* Allocate loader specific data */
+	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
+	if (!ldata)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * Load purgatory. For 64bit entry point, purgatory  code can be
+	 * anywhere.
+	 */
+	ret = kexec_load_purgatory(image, MIN_PURGATORY_ADDR, ULONG_MAX, 1,
+					&purgatory_load_addr);
+	if (ret) {
+		pr_debug("Loading purgatory failed\n");
+		goto out_free_loader_data;
+	}
+
+	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
+
+	/* Load Bootparams and cmdline */
+	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
+	params = kzalloc(params_cmdline_sz, GFP_KERNEL);
+	if (!params) {
+		ret = -ENOMEM;
+		goto out_free_loader_data;
+	}
+
+	/* Copy setup header onto bootparams. Documentation/x86/boot.txt */
+	setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;
+
+	/* Is there a limit on setup header size? */
+	memcpy(&params->hdr, (kernel + setup_hdr_offset), setup_header_size);
+
+	ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
+			       params_cmdline_sz, 16, MIN_BOOTPARAM_ADDR,
+			       ULONG_MAX, 1, &bootparam_load_addr);
+	if (ret)
+		goto out_free_params;
+	pr_debug("Loaded boot_param and command line at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+		 bootparam_load_addr, params_cmdline_sz, params_cmdline_sz);
+
+	/* Load kernel */
+	kernel_buf = kernel + kern16_size;
+	kernel_bufsz =  kernel_len - kern16_size;
+	kernel_memsz = ALIGN(header->init_size, 4096);
+	kernel_align = header->kernel_alignment;
+
+	ret = kexec_add_buffer(image, kernel_buf,
+			       kernel_bufsz, kernel_memsz, kernel_align,
+			       MIN_KERNEL_LOAD_ADDR, ULONG_MAX, 1,
+			       &kernel_load_addr);
+	if (ret)
+		goto out_free_params;
+
+	pr_debug("Loaded 64bit kernel at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+		 kernel_load_addr, kernel_memsz, kernel_memsz);
+
+	/* Load initrd high */
+	if (initrd) {
+		ret = kexec_add_buffer(image, initrd, initrd_len, initrd_len,
+				       PAGE_SIZE, MIN_INITRD_LOAD_ADDR,
+				       ULONG_MAX, 1, &initrd_load_addr);
+		if (ret)
+			goto out_free_params;
+
+		pr_debug("Loaded initrd at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+				initrd_load_addr, initrd_len, initrd_len);
+		ret = kexec_setup_initrd(params, initrd_load_addr, initrd_len);
+		if (ret)
+			goto out_free_params;
+	}
+
+	ret = kexec_setup_cmdline(params, bootparam_load_addr,
+				  sizeof(struct boot_params), cmdline,
+				  cmdline_len);
+	if (ret)
+		goto out_free_params;
+
+	/* bootloader info. Do we need a separate ID for kexec kernel loader? */
+	params->hdr.type_of_loader = 0x0D << 4;
+	params->hdr.loadflags = 0;
+
+	/* Setup purgatory regs for entry */
+	ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
+					     sizeof(regs64), 1);
+	if (ret)
+		goto out_free_params;
+
+	regs64.rbx = 0; /* Bootstrap Processor */
+	regs64.rsi = bootparam_load_addr;
+	regs64.rip = kernel_load_addr + 0x200;
+	stack = kexec_purgatory_get_symbol_addr(image, "stack_end");
+	if (IS_ERR(stack)) {
+		pr_debug("Could not find address of symbol stack_end\n");
+		ret = -EINVAL;
+		goto out_free_params;
+	}
+
+	regs64.rsp = (unsigned long)stack;
+	ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
+					     sizeof(regs64), 0);
+	if (ret)
+		goto out_free_params;
+
+	ret = kexec_setup_boot_parameters(params);
+	if (ret)
+		goto out_free_params;
+
+	/*
+	 * Store pointer to params so that it could be freed after loading
+	 * params segment has been loaded and contents have been copied
+	 * somewhere else.
+	 */
+	ldata->bootparams_buf = params;
+	return ldata;
+
+out_free_params:
+	kfree(params);
+out_free_loader_data:
+	kfree(ldata);
+	return ERR_PTR(ret);
+}
+
+/* This cleanup function is called after various segments have been loaded */
+int bzImage64_cleanup(struct kimage *image)
+{
+	struct bzimage64_data *ldata = image->image_loader_data;
+
+	if (!ldata)
+		return 0;
+
+	kfree(ldata->bootparams_buf);
+	ldata->bootparams_buf = NULL;
+
+	return 0;
+}
+
+#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
new file mode 100644
index 0000000..7de3239
--- /dev/null
+++ b/arch/x86/kernel/machine_kexec.c
@@ -0,0 +1,136 @@
+/*
+ * handle transition of Linux booting another kernel
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ * Authors:
+ *      Vivek Goyal <vgoyal@redhat.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <asm/bootparam.h>
+#include <asm/setup.h>
+
+/*
+ * Common code for x86 and x86_64 used for kexec.
+ *
+ * For the time being it compiles only for x86_64 as there are no image
+ * loaders implemented * for x86. This #ifdef can be removed once somebody
+ * decides to write an image loader on CONFIG_X86_32.
+ */
+
+#ifdef CONFIG_X86_64
+
+int kexec_setup_initrd(struct boot_params *params,
+		unsigned long initrd_load_addr, unsigned long initrd_len)
+{
+	params->hdr.ramdisk_image = initrd_load_addr & 0xffffffffUL;
+	params->hdr.ramdisk_size = initrd_len & 0xffffffffUL;
+
+	params->ext_ramdisk_image = initrd_load_addr >> 32;
+	params->ext_ramdisk_size = initrd_len >> 32;
+
+	return 0;
+}
+
+int kexec_setup_cmdline(struct boot_params *params,
+		unsigned long bootparams_load_addr,
+		unsigned long cmdline_offset, char *cmdline,
+		unsigned long cmdline_len)
+{
+	char *cmdline_ptr = ((char *)params) + cmdline_offset;
+	unsigned long cmdline_ptr_phys;
+	uint32_t cmdline_low_32, cmdline_ext_32;
+
+	memcpy(cmdline_ptr, cmdline, cmdline_len);
+	cmdline_ptr[cmdline_len - 1] = '\0';
+
+	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
+	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
+	cmdline_ext_32 = cmdline_ptr_phys >> 32;
+
+	params->hdr.cmd_line_ptr = cmdline_low_32;
+	if (cmdline_ext_32)
+		params->ext_cmd_line_ptr = cmdline_ext_32;
+
+	return 0;
+}
+
+static int setup_memory_map_entries(struct boot_params *params)
+{
+	unsigned int nr_e820_entries;
+
+	/* TODO: What about EFI */
+	nr_e820_entries = e820_saved.nr_map;
+	if (nr_e820_entries > E820MAX)
+		nr_e820_entries = E820MAX;
+
+	params->e820_entries = nr_e820_entries;
+	memcpy(&params->e820_map, &e820_saved.map,
+			nr_e820_entries * sizeof(struct e820entry));
+
+	return 0;
+}
+
+int kexec_setup_boot_parameters(struct boot_params *params)
+{
+	unsigned int nr_e820_entries;
+	unsigned long long mem_k, start, end;
+	int i;
+
+	/* Get subarch from existing bootparams */
+	params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
+
+	/* Copying screen_info will do? */
+	memcpy(&params->screen_info, &boot_params.screen_info,
+				sizeof(struct screen_info));
+
+	/* Fill in memsize later */
+	params->screen_info.ext_mem_k = 0;
+	params->alt_mem_k = 0;
+
+	/* Default APM info */
+	memset(&params->apm_bios_info, 0, sizeof(params->apm_bios_info));
+
+	/* Default drive info */
+	memset(&params->hd0_info, 0, sizeof(params->hd0_info));
+	memset(&params->hd1_info, 0, sizeof(params->hd1_info));
+
+	/* Default sysdesc table */
+	params->sys_desc_table.length = 0;
+
+	setup_memory_map_entries(params);
+	nr_e820_entries = params->e820_entries;
+
+	for (i = 0; i < nr_e820_entries; i++) {
+		if (params->e820_map[i].type != E820_RAM)
+			continue;
+		start = params->e820_map[i].addr;
+		end = params->e820_map[i].addr + params->e820_map[i].size - 1;
+
+		if ((start <= 0x100000) && end > 0x100000) {
+			mem_k = (end >> 10) - (0x100000 >> 10);
+			params->screen_info.ext_mem_k = mem_k;
+			params->alt_mem_k = mem_k;
+			if (mem_k > 0xfc00)
+				params->screen_info.ext_mem_k = 0xfc00; /* 64M*/
+			if (mem_k > 0xffffffff)
+				params->alt_mem_k = 0xffffffff;
+		}
+	}
+
+	/* Setup EDD info */
+	memcpy(params->eddbuf, boot_params.eddbuf,
+				EDDMAXNR * sizeof(struct edd_info));
+	params->eddbuf_entries = boot_params.eddbuf_entries;
+
+	memcpy(params->edd_mbr_sig_buffer, boot_params.edd_mbr_sig_buffer,
+			EDD_MBR_SIG_MAX * sizeof(unsigned int));
+
+	return 0;
+}
+
+#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 711c1fb..a66fae3 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -21,10 +21,11 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 #include <asm/debugreg.h>
+#include <asm/kexec-bzimage.h>
 
 /* arch dependent functionality related to kexec file based syscall */
 static struct kexec_file_type kexec_file_type[] = {
-	{"", NULL, NULL, NULL},
+	{"bzImage64", bzImage64_probe, bzImage64_load, bzImage64_cleanup},
 };
 
 static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
@ 2014-06-03 13:07   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:07 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg, akpm,
	dyoung, chaowang

This is loader specific code which can load bzImage and set it up for
64bit entry. This does not take care of 32bit entry or real mode entry.

32bit mode entry can be implemented if somebody needs it.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/include/asm/kexec-bzimage.h |  11 ++
 arch/x86/include/asm/kexec.h         |  30 ++++
 arch/x86/kernel/Makefile             |   3 +-
 arch/x86/kernel/kexec-bzimage.c      | 269 +++++++++++++++++++++++++++++++++++
 arch/x86/kernel/machine_kexec.c      | 136 ++++++++++++++++++
 arch/x86/kernel/machine_kexec_64.c   |   3 +-
 6 files changed, 450 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/kexec-bzimage.h
 create mode 100644 arch/x86/kernel/kexec-bzimage.c
 create mode 100644 arch/x86/kernel/machine_kexec.c

diff --git a/arch/x86/include/asm/kexec-bzimage.h b/arch/x86/include/asm/kexec-bzimage.h
new file mode 100644
index 0000000..9e83961
--- /dev/null
+++ b/arch/x86/include/asm/kexec-bzimage.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_BZIMAGE_H
+#define _ASM_BZIMAGE_H
+
+extern int bzImage64_probe(const char *buf, unsigned long len);
+extern void *bzImage64_load(struct kimage *image, char *kernel,
+		unsigned long kernel_len, char *initrd,
+		unsigned long initrd_len, char *cmdline,
+		unsigned long cmdline_len);
+extern int bzImage64_cleanup(struct kimage *image);
+
+#endif  /* _ASM_BZIMAGE_H */
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 17483a4..9bd6fec 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -23,6 +23,7 @@
 
 #include <asm/page.h>
 #include <asm/ptrace.h>
+#include <asm/bootparam.h>
 
 /*
  * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
@@ -161,11 +162,40 @@ struct kimage_arch {
 	pmd_t *pmd;
 	pte_t *pte;
 };
+
+struct kexec_entry64_regs {
+	uint64_t rax;
+	uint64_t rbx;
+	uint64_t rcx;
+	uint64_t rdx;
+	uint64_t rsi;
+	uint64_t rdi;
+	uint64_t rsp;
+	uint64_t rbp;
+	uint64_t r8;
+	uint64_t r9;
+	uint64_t r10;
+	uint64_t r11;
+	uint64_t r12;
+	uint64_t r13;
+	uint64_t r14;
+	uint64_t r15;
+	uint64_t rip;
+};
 #endif
 
 typedef void crash_vmclear_fn(void);
 extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
 
+extern int kexec_setup_initrd(struct boot_params *boot_params,
+		unsigned long initrd_load_addr, unsigned long initrd_len);
+extern int kexec_setup_cmdline(struct boot_params *boot_params,
+		unsigned long bootparams_load_addr,
+		unsigned long cmdline_offset, char *cmdline,
+		unsigned long cmdline_len);
+extern int kexec_setup_boot_parameters(struct boot_params *params);
+
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_X86_KEXEC_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index f4d9600..b2d0cfa 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -67,8 +67,9 @@ obj-$(CONFIG_DYNAMIC_FTRACE)	+= ftrace.o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
 obj-$(CONFIG_FTRACE_SYSCALLS)	+= ftrace.o
 obj-$(CONFIG_X86_TSC)		+= trace_clock.o
-obj-$(CONFIG_KEXEC)		+= machine_kexec_$(BITS).o
+obj-$(CONFIG_KEXEC)		+= machine_kexec.o machine_kexec_$(BITS).o
 obj-$(CONFIG_KEXEC)		+= relocate_kernel_$(BITS).o crash.o
+obj-$(CONFIG_KEXEC)		+= kexec-bzimage.o
 obj-$(CONFIG_CRASH_DUMP)	+= crash_dump_$(BITS).o
 obj-y				+= kprobes/
 obj-$(CONFIG_MODULES)		+= module.o
diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
new file mode 100644
index 0000000..0750784
--- /dev/null
+++ b/arch/x86/kernel/kexec-bzimage.c
@@ -0,0 +1,269 @@
+/*
+ * Kexec bzImage loader
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ * Authors:
+ *      Vivek Goyal <vgoyal@redhat.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+#include <linux/string.h>
+#include <linux/printk.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/kexec.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+#include <asm/bootparam.h>
+#include <asm/setup.h>
+
+/*
+ * Defines lowest physical address for various segments. Not sure where
+ * exactly these limits came from. Current bzimage64 loader in kexec-tools
+ * uses these so I am retaining it. It can be changed over time as we gain
+ * more insight.
+ */
+#define MIN_PURGATORY_ADDR	0x3000
+#define MIN_BOOTPARAM_ADDR	0x3000
+#define MIN_KERNEL_LOAD_ADDR	0x100000
+#define MIN_INITRD_LOAD_ADDR	0x1000000
+
+#ifdef CONFIG_X86_64
+
+/*
+ * This is a place holder for all boot loader specific data structure which
+ * gets allocated in one call but gets freed much later during cleanup
+ * time. Right now there is only one field but it can grow as need be.
+ */
+struct bzimage64_data {
+	/*
+	 * Temporary buffer to hold bootparams buffer. This should be
+	 * freed once the bootparam segment has been loaded.
+	 */
+	void *bootparams_buf;
+};
+
+int bzImage64_probe(const char *buf, unsigned long len)
+{
+	int ret = -ENOEXEC;
+	struct setup_header *header;
+
+	/* kernel should be atleast two sector long */
+	if (len < 2 * 512) {
+		pr_debug("File is too short to be a bzImage\n");
+		return ret;
+	}
+
+	header = (struct setup_header *)(buf + offsetof(struct boot_params,
+								hdr));
+	if (memcmp((char *)&header->header, "HdrS", 4) != 0) {
+		pr_debug("Not a bzImage\n");
+		return ret;
+	}
+
+	if (header->boot_flag != 0xAA55) {
+		pr_debug("No x86 boot sector present\n");
+		return ret;
+	}
+
+	if (header->version < 0x020C) {
+		pr_debug("Must be at least protocol version 2.12\n");
+		return ret;
+	}
+
+	if ((header->loadflags & LOADED_HIGH) == 0) {
+		pr_debug("zImage not a bzImage\n");
+		return ret;
+	}
+
+	if (!(header->xloadflags & XLF_KERNEL_64)) {
+		pr_debug("Not a bzImage64. XLF_KERNEL_64 is not set.\n");
+		return ret;
+	}
+
+	if (!(header->xloadflags & XLF_CAN_BE_LOADED_ABOVE_4G)) {
+		pr_debug("XLF_CAN_BE_LOADED_ABOVE_4G is not set.\n");
+		return ret;
+	}
+
+	/* I've got a bzImage */
+	pr_debug("It's a relocatable bzImage64\n");
+	ret = 0;
+
+	return ret;
+}
+
+void *bzImage64_load(struct kimage *image, char *kernel,
+		unsigned long kernel_len,
+		char *initrd, unsigned long initrd_len,
+		char *cmdline, unsigned long cmdline_len)
+{
+
+	struct setup_header *header;
+	int setup_sects, kern16_size, ret = 0;
+	unsigned long setup_header_size, params_cmdline_sz;
+	struct boot_params *params;
+	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
+	unsigned long purgatory_load_addr;
+	unsigned long kernel_bufsz, kernel_memsz, kernel_align;
+	char *kernel_buf;
+	struct bzimage64_data *ldata;
+	struct kexec_entry64_regs regs64;
+	void *stack;
+	unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
+
+	header = (struct setup_header *)(kernel + setup_hdr_offset);
+	setup_sects = header->setup_sects;
+	if (setup_sects == 0)
+		setup_sects = 4;
+
+	kern16_size = (setup_sects + 1) * 512;
+	if (kernel_len < kern16_size) {
+		pr_debug("bzImage truncated\n");
+		return ERR_PTR(-ENOEXEC);
+	}
+
+	if (cmdline_len > header->cmdline_size) {
+		pr_debug("Kernel command line too long\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* Allocate loader specific data */
+	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
+	if (!ldata)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * Load purgatory. For 64bit entry point, purgatory  code can be
+	 * anywhere.
+	 */
+	ret = kexec_load_purgatory(image, MIN_PURGATORY_ADDR, ULONG_MAX, 1,
+					&purgatory_load_addr);
+	if (ret) {
+		pr_debug("Loading purgatory failed\n");
+		goto out_free_loader_data;
+	}
+
+	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
+
+	/* Load Bootparams and cmdline */
+	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
+	params = kzalloc(params_cmdline_sz, GFP_KERNEL);
+	if (!params) {
+		ret = -ENOMEM;
+		goto out_free_loader_data;
+	}
+
+	/* Copy setup header onto bootparams. Documentation/x86/boot.txt */
+	setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;
+
+	/* Is there a limit on setup header size? */
+	memcpy(&params->hdr, (kernel + setup_hdr_offset), setup_header_size);
+
+	ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
+			       params_cmdline_sz, 16, MIN_BOOTPARAM_ADDR,
+			       ULONG_MAX, 1, &bootparam_load_addr);
+	if (ret)
+		goto out_free_params;
+	pr_debug("Loaded boot_param and command line at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+		 bootparam_load_addr, params_cmdline_sz, params_cmdline_sz);
+
+	/* Load kernel */
+	kernel_buf = kernel + kern16_size;
+	kernel_bufsz =  kernel_len - kern16_size;
+	kernel_memsz = ALIGN(header->init_size, 4096);
+	kernel_align = header->kernel_alignment;
+
+	ret = kexec_add_buffer(image, kernel_buf,
+			       kernel_bufsz, kernel_memsz, kernel_align,
+			       MIN_KERNEL_LOAD_ADDR, ULONG_MAX, 1,
+			       &kernel_load_addr);
+	if (ret)
+		goto out_free_params;
+
+	pr_debug("Loaded 64bit kernel at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+		 kernel_load_addr, kernel_memsz, kernel_memsz);
+
+	/* Load initrd high */
+	if (initrd) {
+		ret = kexec_add_buffer(image, initrd, initrd_len, initrd_len,
+				       PAGE_SIZE, MIN_INITRD_LOAD_ADDR,
+				       ULONG_MAX, 1, &initrd_load_addr);
+		if (ret)
+			goto out_free_params;
+
+		pr_debug("Loaded initrd at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+				initrd_load_addr, initrd_len, initrd_len);
+		ret = kexec_setup_initrd(params, initrd_load_addr, initrd_len);
+		if (ret)
+			goto out_free_params;
+	}
+
+	ret = kexec_setup_cmdline(params, bootparam_load_addr,
+				  sizeof(struct boot_params), cmdline,
+				  cmdline_len);
+	if (ret)
+		goto out_free_params;
+
+	/* bootloader info. Do we need a separate ID for kexec kernel loader? */
+	params->hdr.type_of_loader = 0x0D << 4;
+	params->hdr.loadflags = 0;
+
+	/* Setup purgatory regs for entry */
+	ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
+					     sizeof(regs64), 1);
+	if (ret)
+		goto out_free_params;
+
+	regs64.rbx = 0; /* Bootstrap Processor */
+	regs64.rsi = bootparam_load_addr;
+	regs64.rip = kernel_load_addr + 0x200;
+	stack = kexec_purgatory_get_symbol_addr(image, "stack_end");
+	if (IS_ERR(stack)) {
+		pr_debug("Could not find address of symbol stack_end\n");
+		ret = -EINVAL;
+		goto out_free_params;
+	}
+
+	regs64.rsp = (unsigned long)stack;
+	ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
+					     sizeof(regs64), 0);
+	if (ret)
+		goto out_free_params;
+
+	ret = kexec_setup_boot_parameters(params);
+	if (ret)
+		goto out_free_params;
+
+	/*
+	 * Store pointer to params so that it could be freed after loading
+	 * params segment has been loaded and contents have been copied
+	 * somewhere else.
+	 */
+	ldata->bootparams_buf = params;
+	return ldata;
+
+out_free_params:
+	kfree(params);
+out_free_loader_data:
+	kfree(ldata);
+	return ERR_PTR(ret);
+}
+
+/* This cleanup function is called after various segments have been loaded */
+int bzImage64_cleanup(struct kimage *image)
+{
+	struct bzimage64_data *ldata = image->image_loader_data;
+
+	if (!ldata)
+		return 0;
+
+	kfree(ldata->bootparams_buf);
+	ldata->bootparams_buf = NULL;
+
+	return 0;
+}
+
+#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
new file mode 100644
index 0000000..7de3239
--- /dev/null
+++ b/arch/x86/kernel/machine_kexec.c
@@ -0,0 +1,136 @@
+/*
+ * handle transition of Linux booting another kernel
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ * Authors:
+ *      Vivek Goyal <vgoyal@redhat.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <asm/bootparam.h>
+#include <asm/setup.h>
+
+/*
+ * Common code for x86 and x86_64 used for kexec.
+ *
+ * For the time being it compiles only for x86_64 as there are no image
+ * loaders implemented * for x86. This #ifdef can be removed once somebody
+ * decides to write an image loader on CONFIG_X86_32.
+ */
+
+#ifdef CONFIG_X86_64
+
+int kexec_setup_initrd(struct boot_params *params,
+		unsigned long initrd_load_addr, unsigned long initrd_len)
+{
+	params->hdr.ramdisk_image = initrd_load_addr & 0xffffffffUL;
+	params->hdr.ramdisk_size = initrd_len & 0xffffffffUL;
+
+	params->ext_ramdisk_image = initrd_load_addr >> 32;
+	params->ext_ramdisk_size = initrd_len >> 32;
+
+	return 0;
+}
+
+int kexec_setup_cmdline(struct boot_params *params,
+		unsigned long bootparams_load_addr,
+		unsigned long cmdline_offset, char *cmdline,
+		unsigned long cmdline_len)
+{
+	char *cmdline_ptr = ((char *)params) + cmdline_offset;
+	unsigned long cmdline_ptr_phys;
+	uint32_t cmdline_low_32, cmdline_ext_32;
+
+	memcpy(cmdline_ptr, cmdline, cmdline_len);
+	cmdline_ptr[cmdline_len - 1] = '\0';
+
+	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
+	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
+	cmdline_ext_32 = cmdline_ptr_phys >> 32;
+
+	params->hdr.cmd_line_ptr = cmdline_low_32;
+	if (cmdline_ext_32)
+		params->ext_cmd_line_ptr = cmdline_ext_32;
+
+	return 0;
+}
+
+static int setup_memory_map_entries(struct boot_params *params)
+{
+	unsigned int nr_e820_entries;
+
+	/* TODO: What about EFI */
+	nr_e820_entries = e820_saved.nr_map;
+	if (nr_e820_entries > E820MAX)
+		nr_e820_entries = E820MAX;
+
+	params->e820_entries = nr_e820_entries;
+	memcpy(&params->e820_map, &e820_saved.map,
+			nr_e820_entries * sizeof(struct e820entry));
+
+	return 0;
+}
+
+int kexec_setup_boot_parameters(struct boot_params *params)
+{
+	unsigned int nr_e820_entries;
+	unsigned long long mem_k, start, end;
+	int i;
+
+	/* Get subarch from existing bootparams */
+	params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
+
+	/* Copying screen_info will do? */
+	memcpy(&params->screen_info, &boot_params.screen_info,
+				sizeof(struct screen_info));
+
+	/* Fill in memsize later */
+	params->screen_info.ext_mem_k = 0;
+	params->alt_mem_k = 0;
+
+	/* Default APM info */
+	memset(&params->apm_bios_info, 0, sizeof(params->apm_bios_info));
+
+	/* Default drive info */
+	memset(&params->hd0_info, 0, sizeof(params->hd0_info));
+	memset(&params->hd1_info, 0, sizeof(params->hd1_info));
+
+	/* Default sysdesc table */
+	params->sys_desc_table.length = 0;
+
+	setup_memory_map_entries(params);
+	nr_e820_entries = params->e820_entries;
+
+	for (i = 0; i < nr_e820_entries; i++) {
+		if (params->e820_map[i].type != E820_RAM)
+			continue;
+		start = params->e820_map[i].addr;
+		end = params->e820_map[i].addr + params->e820_map[i].size - 1;
+
+		if ((start <= 0x100000) && end > 0x100000) {
+			mem_k = (end >> 10) - (0x100000 >> 10);
+			params->screen_info.ext_mem_k = mem_k;
+			params->alt_mem_k = mem_k;
+			if (mem_k > 0xfc00)
+				params->screen_info.ext_mem_k = 0xfc00; /* 64M*/
+			if (mem_k > 0xffffffff)
+				params->alt_mem_k = 0xffffffff;
+		}
+	}
+
+	/* Setup EDD info */
+	memcpy(params->eddbuf, boot_params.eddbuf,
+				EDDMAXNR * sizeof(struct edd_info));
+	params->eddbuf_entries = boot_params.eddbuf_entries;
+
+	memcpy(params->edd_mbr_sig_buffer, boot_params.edd_mbr_sig_buffer,
+			EDD_MBR_SIG_MAX * sizeof(unsigned int));
+
+	return 0;
+}
+
+#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 711c1fb..a66fae3 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -21,10 +21,11 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 #include <asm/debugreg.h>
+#include <asm/kexec-bzimage.h>
 
 /* arch dependent functionality related to kexec file based syscall */
 static struct kexec_file_type kexec_file_type[] = {
-	{"", NULL, NULL, NULL},
+	{"bzImage64", bzImage64_probe, bzImage64_load, bzImage64_cleanup},
 };
 
 static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 12/13] kexec: Support for Kexec on panic using new system call
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:07   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:07 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal

This patch adds support for loading a kexec on panic (kdump) kernel usning
new system call. Right now this primarily works with bzImage loader only.
But changes to ELF loader should be minimal as all the core infrastrcture
is there.

Only thing preventing making ELF load in crash reseved memory is
that kernel vmlinux is of type ET_EXEC and it expects to be loaded at
address it has been compiled for. At that location current kernel is
already running. One first needs to make vmlinux fully relocatable
and export it is type ET_DYN and then modify this ELF loader to support
images of type ET_DYN.

I am leaving it as a future TODO item.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/include/asm/crash.h       |   9 +
 arch/x86/include/asm/kexec.h       |  25 +-
 arch/x86/kernel/crash.c            | 581 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/kexec-bzimage.c    |  27 +-
 arch/x86/kernel/machine_kexec.c    |  21 +-
 arch/x86/kernel/machine_kexec_64.c |  40 +++
 kernel/kexec.c                     |  83 +++++-
 7 files changed, 770 insertions(+), 16 deletions(-)
 create mode 100644 arch/x86/include/asm/crash.h

diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h
new file mode 100644
index 0000000..2dd2eb8
--- /dev/null
+++ b/arch/x86/include/asm/crash.h
@@ -0,0 +1,9 @@
+#ifndef _ASM_X86_CRASH_H
+#define _ASM_X86_CRASH_H
+
+int load_crashdump_segments(struct kimage *image);
+int crash_copy_backup_region(struct kimage *image);
+int crash_setup_memmap_entries(struct kimage *image,
+		struct boot_params *params);
+
+#endif /* _ASM_X86_CRASH_H */
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 9bd6fec..4cbe5f7 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -25,6 +25,8 @@
 #include <asm/ptrace.h>
 #include <asm/bootparam.h>
 
+struct kimage;
+
 /*
  * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
  * I.e. Maximum page that is mapped directly into kernel memory,
@@ -62,6 +64,10 @@
 # define KEXEC_ARCH KEXEC_ARCH_X86_64
 #endif
 
+/* Memory to backup during crash kdump */
+#define KEXEC_BACKUP_SRC_START	(0UL)
+#define KEXEC_BACKUP_SRC_END	(640 * 1024UL)	/* 640K */
+
 /*
  * CPU does not save ss and sp on stack if execution is already
  * running in kernel mode at the time of NMI occurrence. This code
@@ -161,8 +167,21 @@ struct kimage_arch {
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	/* Details of backup region */
+	unsigned long backup_src_start;
+	unsigned long backup_src_sz;
+
+	/* Physical address of backup segment */
+	unsigned long backup_load_addr;
+
+	/* Core ELF header buffer */
+	void *elf_headers;
+	unsigned long elf_headers_sz;
+	unsigned long elf_load_addr;
 };
+#endif /* CONFIG_X86_32 */
 
+#ifdef CONFIG_X86_64
 struct kexec_entry64_regs {
 	uint64_t rax;
 	uint64_t rbx;
@@ -189,11 +208,13 @@ extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
 
 extern int kexec_setup_initrd(struct boot_params *boot_params,
 		unsigned long initrd_load_addr, unsigned long initrd_len);
-extern int kexec_setup_cmdline(struct boot_params *boot_params,
+extern int kexec_setup_cmdline(struct kimage *image,
+		struct boot_params *boot_params,
 		unsigned long bootparams_load_addr,
 		unsigned long cmdline_offset, char *cmdline,
 		unsigned long cmdline_len);
-extern int kexec_setup_boot_parameters(struct boot_params *params);
+extern int kexec_setup_boot_parameters(struct kimage *image,
+					struct boot_params *params);
 
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 507de80..b6a0974 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -4,6 +4,9 @@
  * Created by: Hariprasad Nellitheertha (hari@in.ibm.com)
  *
  * Copyright (C) IBM Corporation, 2004. All rights reserved.
+ * Copyright (C) Red Hat Inc., 2014. All rights reserved.
+ * Authors:
+ *      Vivek Goyal <vgoyal@redhat.com>
  *
  */
 
@@ -16,6 +19,7 @@
 #include <linux/elf.h>
 #include <linux/elfcore.h>
 #include <linux/module.h>
+#include <linux/slab.h>
 
 #include <asm/processor.h>
 #include <asm/hardirq.h>
@@ -28,6 +32,45 @@
 #include <asm/reboot.h>
 #include <asm/virtext.h>
 
+/* Alignment required for elf header segment */
+#define ELF_CORE_HEADER_ALIGN   4096
+
+/* This primarily reprsents number of split ranges due to exclusion */
+#define CRASH_MAX_RANGES	16
+
+struct crash_mem_range {
+	u64 start, end;
+};
+
+struct crash_mem {
+	unsigned int nr_ranges;
+	struct crash_mem_range ranges[CRASH_MAX_RANGES];
+};
+
+/* Misc data about ram ranges needed to prepare elf headers */
+struct crash_elf_data {
+	struct kimage *image;
+	/*
+	 * Total number of ram ranges we have after various ajustments for
+	 * GART, crash reserved region etc.
+	 */
+	unsigned int max_nr_ranges;
+	unsigned long gart_start, gart_end;
+
+	/* Pointer to elf header */
+	void *ehdr;
+	/* Pointer to next phdr */
+	void *bufp;
+	struct crash_mem mem;
+};
+
+/* Used while preparing memory map entries for second kernel */
+struct crash_memmap_data {
+	struct boot_params *params;
+	/* Type of memory */
+	unsigned int type;
+};
+
 int in_crash_kexec;
 
 /*
@@ -39,6 +82,7 @@ int in_crash_kexec;
  */
 crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss = NULL;
 EXPORT_SYMBOL_GPL(crash_vmclear_loaded_vmcss);
+unsigned long crash_zero_bytes;
 
 static inline void cpu_crash_vmclear_loaded_vmcss(void)
 {
@@ -135,3 +179,540 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 #endif
 	crash_save_cpu(regs, safe_smp_processor_id());
 }
+
+#ifdef CONFIG_X86_64
+
+static int get_nr_ram_ranges_callback(unsigned long start_pfn,
+				unsigned long nr_pfn, void *arg)
+{
+	int *nr_ranges = arg;
+
+	(*nr_ranges)++;
+	return 0;
+}
+
+static int get_gart_ranges_callback(u64 start, u64 end, void *arg)
+{
+	struct crash_elf_data *ced = arg;
+
+	ced->gart_start = start;
+	ced->gart_end = end;
+
+	/* Not expecting more than 1 gart aperture */
+	return 1;
+}
+
+
+/* Gather all the required information to prepare elf headers for ram regions */
+static int fill_up_crash_elf_data(struct crash_elf_data *ced,
+					struct kimage *image)
+{
+	unsigned int nr_ranges = 0;
+
+	ced->image = image;
+
+	walk_system_ram_range(0, -1, &nr_ranges,
+				get_nr_ram_ranges_callback);
+
+	ced->max_nr_ranges = nr_ranges;
+
+	/*
+	 * We don't create ELF headers for GART aperture as an attempt
+	 * to dump this memory in second kernel leads to hang/crash.
+	 * If gart aperture is present, one needs to exclude that region
+	 * and that could lead to need of extra phdr.
+	 */
+	walk_ram_res("GART", IORESOURCE_MEM, 0, -1,
+				ced, get_gart_ranges_callback);
+
+	/*
+	 * If we have gart region, excluding that could potentially split
+	 * a memory range, resulting in extra header. Account for  that.
+	 */
+	if (ced->gart_end)
+		ced->max_nr_ranges++;
+
+	/* Exclusion of crash region could split memory ranges */
+	ced->max_nr_ranges++;
+
+	/* If crashk_low_res is there, another range split possible */
+	if (crashk_low_res.end != 0)
+		ced->max_nr_ranges++;
+
+	return 0;
+}
+
+static int exclude_mem_range(struct crash_mem *mem,
+		unsigned long long mstart, unsigned long long mend)
+{
+	int i, j;
+	unsigned long long start, end;
+	struct crash_mem_range temp_range = {0, 0};
+
+	for (i = 0; i < mem->nr_ranges; i++) {
+		start = mem->ranges[i].start;
+		end = mem->ranges[i].end;
+
+		if (mstart > end || mend < start)
+			continue;
+
+		/* Truncate any area outside of range */
+		if (mstart < start)
+			mstart = start;
+		if (mend > end)
+			mend = end;
+
+		/* Found completely overlapping range */
+		if (mstart == start && mend == end) {
+			mem->ranges[i].start = 0;
+			mem->ranges[i].end = 0;
+			if (i < mem->nr_ranges - 1) {
+				/* Shift rest of the ranges to left */
+				for (j = i; j < mem->nr_ranges - 1; j++) {
+					mem->ranges[j].start =
+						mem->ranges[j+1].start;
+					mem->ranges[j].end =
+							mem->ranges[j+1].end;
+				}
+			}
+			mem->nr_ranges--;
+			return 0;
+		}
+
+		if (mstart > start && mend < end) {
+			/* Split original range */
+			mem->ranges[i].end = mstart - 1;
+			temp_range.start = mend + 1;
+			temp_range.end = end;
+		} else if (mstart != start)
+			mem->ranges[i].end = mstart - 1;
+		else
+			mem->ranges[i].start = mend + 1;
+		break;
+	}
+
+	/* If a split happend, add the split in array */
+	if (!temp_range.end)
+		return 0;
+
+	/* Split happened */
+	if (i == CRASH_MAX_RANGES - 1) {
+		pr_err("Too many crash ranges after split\n");
+		return -ENOMEM;
+	}
+
+	/* Location where new range should go */
+	j = i + 1;
+	if (j < mem->nr_ranges) {
+		/* Move over all ranges one place */
+		for (i = mem->nr_ranges - 1; i >= j; i--)
+			mem->ranges[i + 1] = mem->ranges[i];
+	}
+
+	mem->ranges[j].start = temp_range.start;
+	mem->ranges[j].end = temp_range.end;
+	mem->nr_ranges++;
+	return 0;
+}
+
+/*
+ * Look for any unwanted ranges between mstart, mend and remove them. This
+ * might lead to split and split ranges are put in ced->mem.ranges[] array
+ */
+static int elf_header_exclude_ranges(struct crash_elf_data *ced,
+		unsigned long long mstart, unsigned long long mend)
+{
+	struct crash_mem *cmem = &ced->mem;
+	int ret = 0;
+
+	memset(cmem->ranges, 0, sizeof(cmem->ranges));
+
+	cmem->ranges[0].start = mstart;
+	cmem->ranges[0].end = mend;
+	cmem->nr_ranges = 1;
+
+	/* Exclude crashkernel region */
+	ret = exclude_mem_range(cmem, crashk_res.start, crashk_res.end);
+	if (ret)
+		return ret;
+
+	ret = exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end);
+	if (ret)
+		return ret;
+
+	/* Exclude GART region */
+	if (ced->gart_end) {
+		ret = exclude_mem_range(cmem, ced->gart_start, ced->gart_end);
+		if (ret)
+			return ret;
+	}
+
+	return ret;
+}
+
+static int prepare_elf64_ram_headers_callback(u64 start, u64 end, void *arg)
+{
+	struct crash_elf_data *ced = arg;
+	Elf64_Ehdr *ehdr;
+	Elf64_Phdr *phdr;
+	unsigned long mstart, mend;
+	struct kimage *image = ced->image;
+	struct crash_mem *cmem;
+	int ret, i;
+
+	ehdr = ced->ehdr;
+
+	/* Exclude unwanted mem ranges */
+	ret = elf_header_exclude_ranges(ced, start, end);
+	if (ret)
+		return ret;
+
+	/* Go through all the ranges in ced->mem.ranges[] and prepare phdr */
+	cmem = &ced->mem;
+
+	for (i = 0; i < cmem->nr_ranges; i++) {
+		mstart = cmem->ranges[i].start;
+		mend = cmem->ranges[i].end;
+
+		phdr = ced->bufp;
+		ced->bufp += sizeof(Elf64_Phdr);
+
+		phdr->p_type = PT_LOAD;
+		phdr->p_flags = PF_R|PF_W|PF_X;
+		phdr->p_offset  = mstart;
+
+		/*
+		 * If a range matches backup region, adjust offset to backup
+		 * segment.
+		 */
+		if (mstart == image->arch.backup_src_start &&
+		    (mend - mstart + 1) == image->arch.backup_src_sz)
+			phdr->p_offset = image->arch.backup_load_addr;
+
+		phdr->p_paddr = mstart;
+		phdr->p_vaddr = (unsigned long long) __va(mstart);
+		phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
+		phdr->p_align = 0;
+		ehdr->e_phnum++;
+		pr_debug("Crash PT_LOAD elf header. phdr=%p vaddr=0x%llx, paddr=0x%llx, sz=0x%llx e_phnum=%d p_offset=0x%llx\n",
+			phdr, phdr->p_vaddr, phdr->p_paddr, phdr->p_filesz,
+			ehdr->e_phnum, phdr->p_offset);
+	}
+
+	return ret;
+}
+
+static int prepare_elf64_headers(struct crash_elf_data *ced,
+		void **addr, unsigned long *sz)
+{
+	Elf64_Ehdr *ehdr;
+	Elf64_Phdr *phdr;
+	unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz;
+	unsigned char *buf, *bufp;
+	unsigned int cpu;
+	unsigned long long notes_addr;
+	int ret;
+
+	/* extra phdr for vmcoreinfo elf note */
+	nr_phdr = nr_cpus + 1;
+	nr_phdr += ced->max_nr_ranges;
+
+	/*
+	 * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
+	 * area on x86_64 (ffffffff80000000 - ffffffffa0000000).
+	 * I think this is required by tools like gdb. So same physical
+	 * memory will be mapped in two elf headers. One will contain kernel
+	 * text virtual addresses and other will have __va(physical) addresses.
+	 */
+
+	nr_phdr++;
+	elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
+	elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
+
+	buf = vzalloc(elf_sz);
+	if (!buf)
+		return -ENOMEM;
+
+	bufp = buf;
+	ehdr = (Elf64_Ehdr *)bufp;
+	bufp += sizeof(Elf64_Ehdr);
+	memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
+	ehdr->e_ident[EI_CLASS] = ELFCLASS64;
+	ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
+	ehdr->e_ident[EI_VERSION] = EV_CURRENT;
+	ehdr->e_ident[EI_OSABI] = ELF_OSABI;
+	memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
+	ehdr->e_type = ET_CORE;
+	ehdr->e_machine = ELF_ARCH;
+	ehdr->e_version = EV_CURRENT;
+	ehdr->e_entry = 0;
+	ehdr->e_phoff = sizeof(Elf64_Ehdr);
+	ehdr->e_shoff = 0;
+	ehdr->e_flags = 0;
+	ehdr->e_ehsize = sizeof(Elf64_Ehdr);
+	ehdr->e_phentsize = sizeof(Elf64_Phdr);
+	ehdr->e_phnum = 0;
+	ehdr->e_shentsize = 0;
+	ehdr->e_shnum = 0;
+	ehdr->e_shstrndx = 0;
+
+	/* Prepare one phdr of type PT_NOTE for each present cpu */
+	for_each_present_cpu(cpu) {
+		phdr = (Elf64_Phdr *)bufp;
+		bufp += sizeof(Elf64_Phdr);
+		phdr->p_type = PT_NOTE;
+		phdr->p_flags = 0;
+		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
+		phdr->p_offset = phdr->p_paddr = notes_addr;
+		phdr->p_vaddr = 0;
+		phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
+		phdr->p_align = 0;
+		(ehdr->e_phnum)++;
+	}
+
+	/* Prepare one PT_NOTE header for vmcoreinfo */
+	phdr = (Elf64_Phdr *)bufp;
+	bufp += sizeof(Elf64_Phdr);
+	phdr->p_type = PT_NOTE;
+	phdr->p_flags = 0;
+	phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
+	phdr->p_vaddr = 0;
+	phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
+	phdr->p_align = 0;
+	(ehdr->e_phnum)++;
+
+#ifdef CONFIG_X86_64
+	/* Prepare PT_LOAD type program header for kernel text region */
+	phdr = (Elf64_Phdr *)bufp;
+	bufp += sizeof(Elf64_Phdr);
+	phdr->p_type = PT_LOAD;
+	phdr->p_flags = PF_R|PF_W|PF_X;
+	phdr->p_vaddr = (Elf64_Addr)_text;
+	phdr->p_filesz = phdr->p_memsz = _end - _text;
+	phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);
+	phdr->p_align = 0;
+	(ehdr->e_phnum)++;
+#endif
+
+	/* Prepare PT_LOAD headers for system ram chunks. */
+	ced->ehdr = ehdr;
+	ced->bufp = bufp;
+	ret = walk_system_ram_res(0, -1, ced,
+			prepare_elf64_ram_headers_callback);
+	if (ret < 0)
+		return ret;
+
+	*addr = buf;
+	*sz = elf_sz;
+	return 0;
+}
+
+/* Prepare elf headers. Return addr and size */
+static int prepare_elf_headers(struct kimage *image, void **addr,
+					unsigned long *sz)
+{
+	struct crash_elf_data *ced;
+	int ret;
+
+	ced = kzalloc(sizeof(*ced), GFP_KERNEL);
+	if (!ced)
+		return -ENOMEM;
+
+	ret = fill_up_crash_elf_data(ced, image);
+	if (ret)
+		goto out;
+
+	/* By default prepare 64bit headers */
+	ret =  prepare_elf64_headers(ced, addr, sz);
+out:
+	kfree(ced);
+	return ret;
+}
+
+static int add_e820_entry(struct boot_params *params, struct e820entry *entry)
+{
+	unsigned int nr_e820_entries;
+
+	nr_e820_entries = params->e820_entries;
+	if (nr_e820_entries >= E820MAX)
+		return 1;
+
+	memcpy(&params->e820_map[nr_e820_entries], entry,
+			sizeof(struct e820entry));
+	params->e820_entries++;
+	return 0;
+}
+
+static int memmap_entry_callback(u64 start, u64 end, void *arg)
+{
+	struct crash_memmap_data *cmd = arg;
+	struct boot_params *params = cmd->params;
+	struct e820entry ei;
+
+	ei.addr = start;
+	ei.size = end - start + 1;
+	ei.type = cmd->type;
+	add_e820_entry(params, &ei);
+
+	return 0;
+}
+
+static int memmap_exclude_ranges(struct kimage *image, struct crash_mem *cmem,
+		unsigned long long mstart, unsigned long long mend)
+{
+	unsigned long start, end;
+	int ret = 0;
+
+	memset(cmem->ranges, 0, sizeof(cmem->ranges));
+
+	cmem->ranges[0].start = mstart;
+	cmem->ranges[0].end = mend;
+	cmem->nr_ranges = 1;
+
+	/* Exclude Backup region */
+	start = image->arch.backup_load_addr;
+	end = start + image->arch.backup_src_sz - 1;
+	ret = exclude_mem_range(cmem, start, end);
+	if (ret)
+		return ret;
+
+	/* Exclude elf header region */
+	start = image->arch.elf_load_addr;
+	end = start + image->arch.elf_headers_sz - 1;
+	ret = exclude_mem_range(cmem, start, end);
+	return ret;
+}
+
+/* Prepare memory map for crash dump kernel */
+int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
+{
+	int i, ret = 0;
+	unsigned long flags;
+	struct e820entry ei;
+	struct crash_memmap_data cmd;
+	struct crash_mem *cmem;
+
+	cmem = vzalloc(sizeof(struct crash_mem));
+	if (!cmem)
+		return -ENOMEM;
+
+	memset(&cmd, 0, sizeof(struct crash_memmap_data));
+	cmd.params = params;
+
+	/* Add first 640K segment */
+	ei.addr = image->arch.backup_src_start;
+	ei.size = image->arch.backup_src_sz;
+	ei.type = E820_RAM;
+	add_e820_entry(params, &ei);
+
+	/* Add ACPI tables */
+	cmd.type = E820_ACPI;
+	flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+	walk_ram_res("ACPI Tables", flags, 0, -1, &cmd, memmap_entry_callback);
+
+	/* Add ACPI Non-volatile Storage */
+	cmd.type = E820_NVS;
+	walk_ram_res("ACPI Non-volatile Storage", flags, 0, -1, &cmd,
+			memmap_entry_callback);
+
+	/* Add crashk_low_res region */
+	if (crashk_low_res.end) {
+		ei.addr = crashk_low_res.start;
+		ei.size = crashk_low_res.end - crashk_low_res.start + 1;
+		ei.type = E820_RAM;
+		add_e820_entry(params, &ei);
+	}
+
+	/* Exclude some ranges from crashk_res and add rest to memmap */
+	ret = memmap_exclude_ranges(image, cmem, crashk_res.start,
+						crashk_res.end);
+	if (ret)
+		goto out;
+
+	for (i = 0; i < cmem->nr_ranges; i++) {
+		ei.addr = cmem->ranges[i].start;
+		ei.size = cmem->ranges[i].end - ei.addr + 1;
+		ei.type = E820_RAM;
+
+		/* If entry is less than a page, skip it */
+		if (ei.size < PAGE_SIZE)
+			continue;
+		add_e820_entry(params, &ei);
+	}
+
+out:
+	vfree(cmem);
+	return ret;
+}
+
+static int determine_backup_region(u64 start, u64 end, void *arg)
+{
+	struct kimage *image = arg;
+
+	image->arch.backup_src_start = start;
+	image->arch.backup_src_sz = end - start + 1;
+
+	/* Expecting only one range for backup region */
+	return 1;
+}
+
+int load_crashdump_segments(struct kimage *image)
+{
+	unsigned long src_start, src_sz, elf_sz;
+	void *elf_addr;
+	int ret;
+
+	/*
+	 * Determine and load a segment for backup area. First 640K RAM
+	 * region is backup source
+	 */
+
+	ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
+				image, determine_backup_region);
+
+	/* Zero or postive return values are ok */
+	if (ret < 0)
+		return ret;
+
+	src_start = image->arch.backup_src_start;
+	src_sz = image->arch.backup_src_sz;
+
+	/* Add backup segment. */
+	if (src_sz) {
+		/*
+		 * Ideally there is no source for backup segment. This is
+		 * copied in purgatory after crash. Just add a zero filled
+		 * segment for now to make sure checksum logic works fine.
+		 */
+		ret = kexec_add_buffer(image, (char *)&crash_zero_bytes,
+				       sizeof(crash_zero_bytes), src_sz,
+				       PAGE_SIZE, 0, -1, 0,
+				       &image->arch.backup_load_addr);
+		if (ret)
+			return ret;
+		pr_debug("Loaded backup region at 0x%lx backup_start=0x%lx memsz=0x%lx\n",
+			 image->arch.backup_load_addr, src_start, src_sz);
+	}
+
+	/* Prepare elf headers and add a segment */
+	ret = prepare_elf_headers(image, &elf_addr, &elf_sz);
+	if (ret)
+		return ret;
+
+	image->arch.elf_headers = elf_addr;
+	image->arch.elf_headers_sz = elf_sz;
+
+	ret = kexec_add_buffer(image, (char *)elf_addr, elf_sz, elf_sz,
+			ELF_CORE_HEADER_ALIGN, 0, -1, 0,
+			&image->arch.elf_load_addr);
+	if (ret) {
+		vfree((void *)image->arch.elf_headers);
+		return ret;
+	}
+	pr_debug("Loaded ELF headers at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+		 image->arch.elf_load_addr, elf_sz, elf_sz);
+
+	return ret;
+}
+
+#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
index 0750784..8e762d3 100644
--- a/arch/x86/kernel/kexec-bzimage.c
+++ b/arch/x86/kernel/kexec-bzimage.c
@@ -18,6 +18,9 @@
 
 #include <asm/bootparam.h>
 #include <asm/setup.h>
+#include <asm/crash.h>
+
+#define MAX_ELFCOREHDR_STR_LEN	30	/* elfcorehdr=0x<64bit-value> */
 
 /*
  * Defines lowest physical address for various segments. Not sure where
@@ -130,11 +133,28 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 		return ERR_PTR(-EINVAL);
 	}
 
+	/*
+	 * In case of crash dump, we will append elfcorehdr=<addr> to
+	 * command line. Make sure it does not overflow
+	 */
+	if (cmdline_len + MAX_ELFCOREHDR_STR_LEN > header->cmdline_size) {
+		ret = -EINVAL;
+		pr_debug("Kernel command line too long\n");
+		return ERR_PTR(-EINVAL);
+	}
+
 	/* Allocate loader specific data */
 	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
 	if (!ldata)
 		return ERR_PTR(-ENOMEM);
 
+	/* Allocate and load backup region */
+	if (image->type == KEXEC_TYPE_CRASH) {
+		ret = load_crashdump_segments(image);
+		if (ret)
+			goto out_free_loader_data;
+	}
+
 	/*
 	 * Load purgatory. For 64bit entry point, purgatory  code can be
 	 * anywhere.
@@ -149,7 +169,8 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
 
 	/* Load Bootparams and cmdline */
-	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
+	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
+				MAX_ELFCOREHDR_STR_LEN;
 	params = kzalloc(params_cmdline_sz, GFP_KERNEL);
 	if (!params) {
 		ret = -ENOMEM;
@@ -201,7 +222,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 			goto out_free_params;
 	}
 
-	ret = kexec_setup_cmdline(params, bootparam_load_addr,
+	ret = kexec_setup_cmdline(image, params, bootparam_load_addr,
 				  sizeof(struct boot_params), cmdline,
 				  cmdline_len);
 	if (ret)
@@ -233,7 +254,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 	if (ret)
 		goto out_free_params;
 
-	ret = kexec_setup_boot_parameters(params);
+	ret = kexec_setup_boot_parameters(image, params);
 	if (ret)
 		goto out_free_params;
 
diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
index 7de3239..6a3821b 100644
--- a/arch/x86/kernel/machine_kexec.c
+++ b/arch/x86/kernel/machine_kexec.c
@@ -10,9 +10,11 @@
  */
 
 #include <linux/kernel.h>
+#include <linux/kexec.h>
 #include <linux/string.h>
 #include <asm/bootparam.h>
 #include <asm/setup.h>
+#include <asm/crash.h>
 
 /*
  * Common code for x86 and x86_64 used for kexec.
@@ -36,18 +38,24 @@ int kexec_setup_initrd(struct boot_params *params,
 	return 0;
 }
 
-int kexec_setup_cmdline(struct boot_params *params,
+int kexec_setup_cmdline(struct kimage *image, struct boot_params *params,
 		unsigned long bootparams_load_addr,
 		unsigned long cmdline_offset, char *cmdline,
 		unsigned long cmdline_len)
 {
 	char *cmdline_ptr = ((char *)params) + cmdline_offset;
-	unsigned long cmdline_ptr_phys;
+	unsigned long cmdline_ptr_phys, len;
 	uint32_t cmdline_low_32, cmdline_ext_32;
 
 	memcpy(cmdline_ptr, cmdline, cmdline_len);
+	if (image->type == KEXEC_TYPE_CRASH) {
+		len = sprintf(cmdline_ptr + cmdline_len - 1,
+			" elfcorehdr=0x%lx", image->arch.elf_load_addr);
+		cmdline_len += len;
+	}
 	cmdline_ptr[cmdline_len - 1] = '\0';
 
+	pr_debug("Final command line is: %s\n", cmdline_ptr);
 	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
 	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
 	cmdline_ext_32 = cmdline_ptr_phys >> 32;
@@ -75,7 +83,8 @@ static int setup_memory_map_entries(struct boot_params *params)
 	return 0;
 }
 
-int kexec_setup_boot_parameters(struct boot_params *params)
+int kexec_setup_boot_parameters(struct kimage *image,
+				struct boot_params *params)
 {
 	unsigned int nr_e820_entries;
 	unsigned long long mem_k, start, end;
@@ -102,7 +111,11 @@ int kexec_setup_boot_parameters(struct boot_params *params)
 	/* Default sysdesc table */
 	params->sys_desc_table.length = 0;
 
-	setup_memory_map_entries(params);
+	if (image->type == KEXEC_TYPE_CRASH)
+		crash_setup_memmap_entries(image, params);
+	else
+		setup_memory_map_entries(params);
+
 	nr_e820_entries = params->e820_entries;
 
 	for (i = 0; i < nr_e820_entries; i++) {
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index a66fae3..07e4b60 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -179,6 +179,38 @@ static void load_segments(void)
 		);
 }
 
+/* Update purgatory as needed after various image segments have been prepared */
+static int arch_update_purgatory(struct kimage *image)
+{
+	int ret = 0;
+
+	if (!image->file_mode)
+		return 0;
+
+	/* Setup copying of backup region */
+	if (image->type == KEXEC_TYPE_CRASH) {
+		ret = kexec_purgatory_get_set_symbol(image, "backup_dest",
+				&image->arch.backup_load_addr,
+				sizeof(image->arch.backup_load_addr), 0);
+		if (ret)
+			return ret;
+
+		ret = kexec_purgatory_get_set_symbol(image, "backup_src",
+				&image->arch.backup_src_start,
+				sizeof(image->arch.backup_src_start), 0);
+		if (ret)
+			return ret;
+
+		ret = kexec_purgatory_get_set_symbol(image, "backup_sz",
+				&image->arch.backup_src_sz,
+				sizeof(image->arch.backup_src_sz), 0);
+		if (ret)
+			return ret;
+	}
+
+	return ret;
+}
+
 int machine_kexec_prepare(struct kimage *image)
 {
 	unsigned long start_pgtable;
@@ -192,6 +224,11 @@ int machine_kexec_prepare(struct kimage *image)
 	if (result)
 		return result;
 
+	/* update purgatory as needed */
+	result = arch_update_purgatory(image);
+	if (result)
+		return result;
+
 	return 0;
 }
 
@@ -334,6 +371,9 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
 	if (idx < 0)
 		return 0;
 
+	vfree(image->arch.elf_headers);
+	image->arch.elf_headers = NULL;
+
 	if (kexec_file_type[idx].cleanup)
 		return kexec_file_type[idx].cleanup(image);
 	return 0;
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 7f2e393..52e45cb 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -546,7 +546,6 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
 	*rimage = image;
 	return 0;
 
-
 out_free_control_pages:
 	kimage_free_page_list(&image->control_pages);
 out_free_image:
@@ -554,6 +553,54 @@ out_free_image:
 	return result;
 }
 
+static int kimage_file_crash_alloc(struct kimage **rimage, int kernel_fd,
+		int initrd_fd, const char __user *cmdline_ptr,
+		unsigned long cmdline_len)
+{
+	int result;
+	struct kimage *image;
+
+	/* Allocate and initialize a controlling structure */
+	image = do_kimage_alloc_init();
+	if (!image)
+		return -ENOMEM;
+
+	image->file_mode = 1;
+	image->file_handler_idx = -1;
+
+	/* Enable the special crash kernel control page allocation policy. */
+	image->control_page = crashk_res.start;
+	image->type = KEXEC_TYPE_CRASH;
+
+	result = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
+			cmdline_ptr, cmdline_len);
+	if (result)
+		goto out_free_image;
+
+	result = sanity_check_segment_list(image);
+	if (result)
+		goto out_free_post_load_bufs;
+
+	result = -ENOMEM;
+	image->control_code_page = kimage_alloc_control_pages(image,
+					   get_order(KEXEC_CONTROL_PAGE_SIZE));
+	if (!image->control_code_page) {
+		pr_err(KERN_ERR "Could not allocate control_code_buffer\n");
+		goto out_free_post_load_bufs;
+	}
+
+	*rimage = image;
+	return 0;
+
+out_free_post_load_bufs:
+	kimage_file_post_load_cleanup(image);
+	kfree(image->image_loader_data);
+out_free_image:
+	kfree(image);
+	return result;
+}
+
+
 static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
 				unsigned long nr_segments,
 				struct kexec_segment __user *segments)
@@ -1135,10 +1182,15 @@ static int kimage_load_crash_segment(struct kimage *image,
 	unsigned long maddr;
 	size_t ubytes, mbytes;
 	int result;
-	unsigned char __user *buf;
+	unsigned char __user *buf = NULL;
+	unsigned char *kbuf = NULL;
 
 	result = 0;
-	buf = segment->buf;
+	if (image->file_mode)
+		kbuf = segment->kbuf;
+	else
+		buf = segment->buf;
+
 	ubytes = segment->bufsz;
 	mbytes = segment->memsz;
 	maddr = segment->mem;
@@ -1161,7 +1213,12 @@ static int kimage_load_crash_segment(struct kimage *image,
 			/* Zero the trailing part of the page */
 			memset(ptr + uchunk, 0, mchunk - uchunk);
 		}
-		result = copy_from_user(ptr, buf, uchunk);
+
+		/* For file based kexec, source pages are in kernel memory */
+		if (image->file_mode)
+			memcpy(ptr, kbuf, uchunk);
+		else
+			result = copy_from_user(ptr, buf, uchunk);
 		kexec_flush_icache_page(page);
 		kunmap(page);
 		if (result) {
@@ -1170,7 +1227,10 @@ static int kimage_load_crash_segment(struct kimage *image,
 		}
 		ubytes -= uchunk;
 		maddr  += mchunk;
-		buf    += mchunk;
+		if (image->file_mode)
+			kbuf += mchunk;
+		else
+			buf += mchunk;
 		mbytes -= mchunk;
 	}
 out:
@@ -1388,7 +1448,11 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 	if (flags & KEXEC_FILE_UNLOAD)
 		goto exchange;
 
-	ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
+	if (flags & KEXEC_FILE_ON_CRASH)
+		ret = kimage_file_crash_alloc(&image, kernel_fd, initrd_fd,
+				cmdline_ptr, cmdline_len);
+	else
+		ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
 				cmdline_ptr, cmdline_len);
 	if (ret)
 		goto out;
@@ -2143,7 +2207,12 @@ int kexec_add_buffer(struct kimage *image, char *buffer,
 	kbuf->top_down = top_down;
 
 	/* Walk the RAM ranges and allocate a suitable range for the buffer */
-	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
+	if (image->type == KEXEC_TYPE_CRASH)
+		walk_ram_res("Crash kernel", IORESOURCE_MEM | IORESOURCE_BUSY,
+				crashk_res.start, crashk_res.end, kbuf,
+				walk_ram_range_callback);
+	else
+		walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
 
 	/*
 	 * If range could be found successfully, it would have incremented
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 12/13] kexec: Support for Kexec on panic using new system call
@ 2014-06-03 13:07   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:07 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg, akpm,
	dyoung, chaowang

This patch adds support for loading a kexec on panic (kdump) kernel usning
new system call. Right now this primarily works with bzImage loader only.
But changes to ELF loader should be minimal as all the core infrastrcture
is there.

Only thing preventing making ELF load in crash reseved memory is
that kernel vmlinux is of type ET_EXEC and it expects to be loaded at
address it has been compiled for. At that location current kernel is
already running. One first needs to make vmlinux fully relocatable
and export it is type ET_DYN and then modify this ELF loader to support
images of type ET_DYN.

I am leaving it as a future TODO item.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/include/asm/crash.h       |   9 +
 arch/x86/include/asm/kexec.h       |  25 +-
 arch/x86/kernel/crash.c            | 581 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/kexec-bzimage.c    |  27 +-
 arch/x86/kernel/machine_kexec.c    |  21 +-
 arch/x86/kernel/machine_kexec_64.c |  40 +++
 kernel/kexec.c                     |  83 +++++-
 7 files changed, 770 insertions(+), 16 deletions(-)
 create mode 100644 arch/x86/include/asm/crash.h

diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h
new file mode 100644
index 0000000..2dd2eb8
--- /dev/null
+++ b/arch/x86/include/asm/crash.h
@@ -0,0 +1,9 @@
+#ifndef _ASM_X86_CRASH_H
+#define _ASM_X86_CRASH_H
+
+int load_crashdump_segments(struct kimage *image);
+int crash_copy_backup_region(struct kimage *image);
+int crash_setup_memmap_entries(struct kimage *image,
+		struct boot_params *params);
+
+#endif /* _ASM_X86_CRASH_H */
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 9bd6fec..4cbe5f7 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -25,6 +25,8 @@
 #include <asm/ptrace.h>
 #include <asm/bootparam.h>
 
+struct kimage;
+
 /*
  * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
  * I.e. Maximum page that is mapped directly into kernel memory,
@@ -62,6 +64,10 @@
 # define KEXEC_ARCH KEXEC_ARCH_X86_64
 #endif
 
+/* Memory to backup during crash kdump */
+#define KEXEC_BACKUP_SRC_START	(0UL)
+#define KEXEC_BACKUP_SRC_END	(640 * 1024UL)	/* 640K */
+
 /*
  * CPU does not save ss and sp on stack if execution is already
  * running in kernel mode at the time of NMI occurrence. This code
@@ -161,8 +167,21 @@ struct kimage_arch {
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	/* Details of backup region */
+	unsigned long backup_src_start;
+	unsigned long backup_src_sz;
+
+	/* Physical address of backup segment */
+	unsigned long backup_load_addr;
+
+	/* Core ELF header buffer */
+	void *elf_headers;
+	unsigned long elf_headers_sz;
+	unsigned long elf_load_addr;
 };
+#endif /* CONFIG_X86_32 */
 
+#ifdef CONFIG_X86_64
 struct kexec_entry64_regs {
 	uint64_t rax;
 	uint64_t rbx;
@@ -189,11 +208,13 @@ extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
 
 extern int kexec_setup_initrd(struct boot_params *boot_params,
 		unsigned long initrd_load_addr, unsigned long initrd_len);
-extern int kexec_setup_cmdline(struct boot_params *boot_params,
+extern int kexec_setup_cmdline(struct kimage *image,
+		struct boot_params *boot_params,
 		unsigned long bootparams_load_addr,
 		unsigned long cmdline_offset, char *cmdline,
 		unsigned long cmdline_len);
-extern int kexec_setup_boot_parameters(struct boot_params *params);
+extern int kexec_setup_boot_parameters(struct kimage *image,
+					struct boot_params *params);
 
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 507de80..b6a0974 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -4,6 +4,9 @@
  * Created by: Hariprasad Nellitheertha (hari@in.ibm.com)
  *
  * Copyright (C) IBM Corporation, 2004. All rights reserved.
+ * Copyright (C) Red Hat Inc., 2014. All rights reserved.
+ * Authors:
+ *      Vivek Goyal <vgoyal@redhat.com>
  *
  */
 
@@ -16,6 +19,7 @@
 #include <linux/elf.h>
 #include <linux/elfcore.h>
 #include <linux/module.h>
+#include <linux/slab.h>
 
 #include <asm/processor.h>
 #include <asm/hardirq.h>
@@ -28,6 +32,45 @@
 #include <asm/reboot.h>
 #include <asm/virtext.h>
 
+/* Alignment required for elf header segment */
+#define ELF_CORE_HEADER_ALIGN   4096
+
+/* This primarily reprsents number of split ranges due to exclusion */
+#define CRASH_MAX_RANGES	16
+
+struct crash_mem_range {
+	u64 start, end;
+};
+
+struct crash_mem {
+	unsigned int nr_ranges;
+	struct crash_mem_range ranges[CRASH_MAX_RANGES];
+};
+
+/* Misc data about ram ranges needed to prepare elf headers */
+struct crash_elf_data {
+	struct kimage *image;
+	/*
+	 * Total number of ram ranges we have after various ajustments for
+	 * GART, crash reserved region etc.
+	 */
+	unsigned int max_nr_ranges;
+	unsigned long gart_start, gart_end;
+
+	/* Pointer to elf header */
+	void *ehdr;
+	/* Pointer to next phdr */
+	void *bufp;
+	struct crash_mem mem;
+};
+
+/* Used while preparing memory map entries for second kernel */
+struct crash_memmap_data {
+	struct boot_params *params;
+	/* Type of memory */
+	unsigned int type;
+};
+
 int in_crash_kexec;
 
 /*
@@ -39,6 +82,7 @@ int in_crash_kexec;
  */
 crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss = NULL;
 EXPORT_SYMBOL_GPL(crash_vmclear_loaded_vmcss);
+unsigned long crash_zero_bytes;
 
 static inline void cpu_crash_vmclear_loaded_vmcss(void)
 {
@@ -135,3 +179,540 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 #endif
 	crash_save_cpu(regs, safe_smp_processor_id());
 }
+
+#ifdef CONFIG_X86_64
+
+static int get_nr_ram_ranges_callback(unsigned long start_pfn,
+				unsigned long nr_pfn, void *arg)
+{
+	int *nr_ranges = arg;
+
+	(*nr_ranges)++;
+	return 0;
+}
+
+static int get_gart_ranges_callback(u64 start, u64 end, void *arg)
+{
+	struct crash_elf_data *ced = arg;
+
+	ced->gart_start = start;
+	ced->gart_end = end;
+
+	/* Not expecting more than 1 gart aperture */
+	return 1;
+}
+
+
+/* Gather all the required information to prepare elf headers for ram regions */
+static int fill_up_crash_elf_data(struct crash_elf_data *ced,
+					struct kimage *image)
+{
+	unsigned int nr_ranges = 0;
+
+	ced->image = image;
+
+	walk_system_ram_range(0, -1, &nr_ranges,
+				get_nr_ram_ranges_callback);
+
+	ced->max_nr_ranges = nr_ranges;
+
+	/*
+	 * We don't create ELF headers for GART aperture as an attempt
+	 * to dump this memory in second kernel leads to hang/crash.
+	 * If gart aperture is present, one needs to exclude that region
+	 * and that could lead to need of extra phdr.
+	 */
+	walk_ram_res("GART", IORESOURCE_MEM, 0, -1,
+				ced, get_gart_ranges_callback);
+
+	/*
+	 * If we have gart region, excluding that could potentially split
+	 * a memory range, resulting in extra header. Account for  that.
+	 */
+	if (ced->gart_end)
+		ced->max_nr_ranges++;
+
+	/* Exclusion of crash region could split memory ranges */
+	ced->max_nr_ranges++;
+
+	/* If crashk_low_res is there, another range split possible */
+	if (crashk_low_res.end != 0)
+		ced->max_nr_ranges++;
+
+	return 0;
+}
+
+static int exclude_mem_range(struct crash_mem *mem,
+		unsigned long long mstart, unsigned long long mend)
+{
+	int i, j;
+	unsigned long long start, end;
+	struct crash_mem_range temp_range = {0, 0};
+
+	for (i = 0; i < mem->nr_ranges; i++) {
+		start = mem->ranges[i].start;
+		end = mem->ranges[i].end;
+
+		if (mstart > end || mend < start)
+			continue;
+
+		/* Truncate any area outside of range */
+		if (mstart < start)
+			mstart = start;
+		if (mend > end)
+			mend = end;
+
+		/* Found completely overlapping range */
+		if (mstart == start && mend == end) {
+			mem->ranges[i].start = 0;
+			mem->ranges[i].end = 0;
+			if (i < mem->nr_ranges - 1) {
+				/* Shift rest of the ranges to left */
+				for (j = i; j < mem->nr_ranges - 1; j++) {
+					mem->ranges[j].start =
+						mem->ranges[j+1].start;
+					mem->ranges[j].end =
+							mem->ranges[j+1].end;
+				}
+			}
+			mem->nr_ranges--;
+			return 0;
+		}
+
+		if (mstart > start && mend < end) {
+			/* Split original range */
+			mem->ranges[i].end = mstart - 1;
+			temp_range.start = mend + 1;
+			temp_range.end = end;
+		} else if (mstart != start)
+			mem->ranges[i].end = mstart - 1;
+		else
+			mem->ranges[i].start = mend + 1;
+		break;
+	}
+
+	/* If a split happend, add the split in array */
+	if (!temp_range.end)
+		return 0;
+
+	/* Split happened */
+	if (i == CRASH_MAX_RANGES - 1) {
+		pr_err("Too many crash ranges after split\n");
+		return -ENOMEM;
+	}
+
+	/* Location where new range should go */
+	j = i + 1;
+	if (j < mem->nr_ranges) {
+		/* Move over all ranges one place */
+		for (i = mem->nr_ranges - 1; i >= j; i--)
+			mem->ranges[i + 1] = mem->ranges[i];
+	}
+
+	mem->ranges[j].start = temp_range.start;
+	mem->ranges[j].end = temp_range.end;
+	mem->nr_ranges++;
+	return 0;
+}
+
+/*
+ * Look for any unwanted ranges between mstart, mend and remove them. This
+ * might lead to split and split ranges are put in ced->mem.ranges[] array
+ */
+static int elf_header_exclude_ranges(struct crash_elf_data *ced,
+		unsigned long long mstart, unsigned long long mend)
+{
+	struct crash_mem *cmem = &ced->mem;
+	int ret = 0;
+
+	memset(cmem->ranges, 0, sizeof(cmem->ranges));
+
+	cmem->ranges[0].start = mstart;
+	cmem->ranges[0].end = mend;
+	cmem->nr_ranges = 1;
+
+	/* Exclude crashkernel region */
+	ret = exclude_mem_range(cmem, crashk_res.start, crashk_res.end);
+	if (ret)
+		return ret;
+
+	ret = exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end);
+	if (ret)
+		return ret;
+
+	/* Exclude GART region */
+	if (ced->gart_end) {
+		ret = exclude_mem_range(cmem, ced->gart_start, ced->gart_end);
+		if (ret)
+			return ret;
+	}
+
+	return ret;
+}
+
+static int prepare_elf64_ram_headers_callback(u64 start, u64 end, void *arg)
+{
+	struct crash_elf_data *ced = arg;
+	Elf64_Ehdr *ehdr;
+	Elf64_Phdr *phdr;
+	unsigned long mstart, mend;
+	struct kimage *image = ced->image;
+	struct crash_mem *cmem;
+	int ret, i;
+
+	ehdr = ced->ehdr;
+
+	/* Exclude unwanted mem ranges */
+	ret = elf_header_exclude_ranges(ced, start, end);
+	if (ret)
+		return ret;
+
+	/* Go through all the ranges in ced->mem.ranges[] and prepare phdr */
+	cmem = &ced->mem;
+
+	for (i = 0; i < cmem->nr_ranges; i++) {
+		mstart = cmem->ranges[i].start;
+		mend = cmem->ranges[i].end;
+
+		phdr = ced->bufp;
+		ced->bufp += sizeof(Elf64_Phdr);
+
+		phdr->p_type = PT_LOAD;
+		phdr->p_flags = PF_R|PF_W|PF_X;
+		phdr->p_offset  = mstart;
+
+		/*
+		 * If a range matches backup region, adjust offset to backup
+		 * segment.
+		 */
+		if (mstart == image->arch.backup_src_start &&
+		    (mend - mstart + 1) == image->arch.backup_src_sz)
+			phdr->p_offset = image->arch.backup_load_addr;
+
+		phdr->p_paddr = mstart;
+		phdr->p_vaddr = (unsigned long long) __va(mstart);
+		phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
+		phdr->p_align = 0;
+		ehdr->e_phnum++;
+		pr_debug("Crash PT_LOAD elf header. phdr=%p vaddr=0x%llx, paddr=0x%llx, sz=0x%llx e_phnum=%d p_offset=0x%llx\n",
+			phdr, phdr->p_vaddr, phdr->p_paddr, phdr->p_filesz,
+			ehdr->e_phnum, phdr->p_offset);
+	}
+
+	return ret;
+}
+
+static int prepare_elf64_headers(struct crash_elf_data *ced,
+		void **addr, unsigned long *sz)
+{
+	Elf64_Ehdr *ehdr;
+	Elf64_Phdr *phdr;
+	unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz;
+	unsigned char *buf, *bufp;
+	unsigned int cpu;
+	unsigned long long notes_addr;
+	int ret;
+
+	/* extra phdr for vmcoreinfo elf note */
+	nr_phdr = nr_cpus + 1;
+	nr_phdr += ced->max_nr_ranges;
+
+	/*
+	 * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
+	 * area on x86_64 (ffffffff80000000 - ffffffffa0000000).
+	 * I think this is required by tools like gdb. So same physical
+	 * memory will be mapped in two elf headers. One will contain kernel
+	 * text virtual addresses and other will have __va(physical) addresses.
+	 */
+
+	nr_phdr++;
+	elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
+	elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
+
+	buf = vzalloc(elf_sz);
+	if (!buf)
+		return -ENOMEM;
+
+	bufp = buf;
+	ehdr = (Elf64_Ehdr *)bufp;
+	bufp += sizeof(Elf64_Ehdr);
+	memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
+	ehdr->e_ident[EI_CLASS] = ELFCLASS64;
+	ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
+	ehdr->e_ident[EI_VERSION] = EV_CURRENT;
+	ehdr->e_ident[EI_OSABI] = ELF_OSABI;
+	memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
+	ehdr->e_type = ET_CORE;
+	ehdr->e_machine = ELF_ARCH;
+	ehdr->e_version = EV_CURRENT;
+	ehdr->e_entry = 0;
+	ehdr->e_phoff = sizeof(Elf64_Ehdr);
+	ehdr->e_shoff = 0;
+	ehdr->e_flags = 0;
+	ehdr->e_ehsize = sizeof(Elf64_Ehdr);
+	ehdr->e_phentsize = sizeof(Elf64_Phdr);
+	ehdr->e_phnum = 0;
+	ehdr->e_shentsize = 0;
+	ehdr->e_shnum = 0;
+	ehdr->e_shstrndx = 0;
+
+	/* Prepare one phdr of type PT_NOTE for each present cpu */
+	for_each_present_cpu(cpu) {
+		phdr = (Elf64_Phdr *)bufp;
+		bufp += sizeof(Elf64_Phdr);
+		phdr->p_type = PT_NOTE;
+		phdr->p_flags = 0;
+		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
+		phdr->p_offset = phdr->p_paddr = notes_addr;
+		phdr->p_vaddr = 0;
+		phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
+		phdr->p_align = 0;
+		(ehdr->e_phnum)++;
+	}
+
+	/* Prepare one PT_NOTE header for vmcoreinfo */
+	phdr = (Elf64_Phdr *)bufp;
+	bufp += sizeof(Elf64_Phdr);
+	phdr->p_type = PT_NOTE;
+	phdr->p_flags = 0;
+	phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
+	phdr->p_vaddr = 0;
+	phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
+	phdr->p_align = 0;
+	(ehdr->e_phnum)++;
+
+#ifdef CONFIG_X86_64
+	/* Prepare PT_LOAD type program header for kernel text region */
+	phdr = (Elf64_Phdr *)bufp;
+	bufp += sizeof(Elf64_Phdr);
+	phdr->p_type = PT_LOAD;
+	phdr->p_flags = PF_R|PF_W|PF_X;
+	phdr->p_vaddr = (Elf64_Addr)_text;
+	phdr->p_filesz = phdr->p_memsz = _end - _text;
+	phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);
+	phdr->p_align = 0;
+	(ehdr->e_phnum)++;
+#endif
+
+	/* Prepare PT_LOAD headers for system ram chunks. */
+	ced->ehdr = ehdr;
+	ced->bufp = bufp;
+	ret = walk_system_ram_res(0, -1, ced,
+			prepare_elf64_ram_headers_callback);
+	if (ret < 0)
+		return ret;
+
+	*addr = buf;
+	*sz = elf_sz;
+	return 0;
+}
+
+/* Prepare elf headers. Return addr and size */
+static int prepare_elf_headers(struct kimage *image, void **addr,
+					unsigned long *sz)
+{
+	struct crash_elf_data *ced;
+	int ret;
+
+	ced = kzalloc(sizeof(*ced), GFP_KERNEL);
+	if (!ced)
+		return -ENOMEM;
+
+	ret = fill_up_crash_elf_data(ced, image);
+	if (ret)
+		goto out;
+
+	/* By default prepare 64bit headers */
+	ret =  prepare_elf64_headers(ced, addr, sz);
+out:
+	kfree(ced);
+	return ret;
+}
+
+static int add_e820_entry(struct boot_params *params, struct e820entry *entry)
+{
+	unsigned int nr_e820_entries;
+
+	nr_e820_entries = params->e820_entries;
+	if (nr_e820_entries >= E820MAX)
+		return 1;
+
+	memcpy(&params->e820_map[nr_e820_entries], entry,
+			sizeof(struct e820entry));
+	params->e820_entries++;
+	return 0;
+}
+
+static int memmap_entry_callback(u64 start, u64 end, void *arg)
+{
+	struct crash_memmap_data *cmd = arg;
+	struct boot_params *params = cmd->params;
+	struct e820entry ei;
+
+	ei.addr = start;
+	ei.size = end - start + 1;
+	ei.type = cmd->type;
+	add_e820_entry(params, &ei);
+
+	return 0;
+}
+
+static int memmap_exclude_ranges(struct kimage *image, struct crash_mem *cmem,
+		unsigned long long mstart, unsigned long long mend)
+{
+	unsigned long start, end;
+	int ret = 0;
+
+	memset(cmem->ranges, 0, sizeof(cmem->ranges));
+
+	cmem->ranges[0].start = mstart;
+	cmem->ranges[0].end = mend;
+	cmem->nr_ranges = 1;
+
+	/* Exclude Backup region */
+	start = image->arch.backup_load_addr;
+	end = start + image->arch.backup_src_sz - 1;
+	ret = exclude_mem_range(cmem, start, end);
+	if (ret)
+		return ret;
+
+	/* Exclude elf header region */
+	start = image->arch.elf_load_addr;
+	end = start + image->arch.elf_headers_sz - 1;
+	ret = exclude_mem_range(cmem, start, end);
+	return ret;
+}
+
+/* Prepare memory map for crash dump kernel */
+int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
+{
+	int i, ret = 0;
+	unsigned long flags;
+	struct e820entry ei;
+	struct crash_memmap_data cmd;
+	struct crash_mem *cmem;
+
+	cmem = vzalloc(sizeof(struct crash_mem));
+	if (!cmem)
+		return -ENOMEM;
+
+	memset(&cmd, 0, sizeof(struct crash_memmap_data));
+	cmd.params = params;
+
+	/* Add first 640K segment */
+	ei.addr = image->arch.backup_src_start;
+	ei.size = image->arch.backup_src_sz;
+	ei.type = E820_RAM;
+	add_e820_entry(params, &ei);
+
+	/* Add ACPI tables */
+	cmd.type = E820_ACPI;
+	flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+	walk_ram_res("ACPI Tables", flags, 0, -1, &cmd, memmap_entry_callback);
+
+	/* Add ACPI Non-volatile Storage */
+	cmd.type = E820_NVS;
+	walk_ram_res("ACPI Non-volatile Storage", flags, 0, -1, &cmd,
+			memmap_entry_callback);
+
+	/* Add crashk_low_res region */
+	if (crashk_low_res.end) {
+		ei.addr = crashk_low_res.start;
+		ei.size = crashk_low_res.end - crashk_low_res.start + 1;
+		ei.type = E820_RAM;
+		add_e820_entry(params, &ei);
+	}
+
+	/* Exclude some ranges from crashk_res and add rest to memmap */
+	ret = memmap_exclude_ranges(image, cmem, crashk_res.start,
+						crashk_res.end);
+	if (ret)
+		goto out;
+
+	for (i = 0; i < cmem->nr_ranges; i++) {
+		ei.addr = cmem->ranges[i].start;
+		ei.size = cmem->ranges[i].end - ei.addr + 1;
+		ei.type = E820_RAM;
+
+		/* If entry is less than a page, skip it */
+		if (ei.size < PAGE_SIZE)
+			continue;
+		add_e820_entry(params, &ei);
+	}
+
+out:
+	vfree(cmem);
+	return ret;
+}
+
+static int determine_backup_region(u64 start, u64 end, void *arg)
+{
+	struct kimage *image = arg;
+
+	image->arch.backup_src_start = start;
+	image->arch.backup_src_sz = end - start + 1;
+
+	/* Expecting only one range for backup region */
+	return 1;
+}
+
+int load_crashdump_segments(struct kimage *image)
+{
+	unsigned long src_start, src_sz, elf_sz;
+	void *elf_addr;
+	int ret;
+
+	/*
+	 * Determine and load a segment for backup area. First 640K RAM
+	 * region is backup source
+	 */
+
+	ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
+				image, determine_backup_region);
+
+	/* Zero or postive return values are ok */
+	if (ret < 0)
+		return ret;
+
+	src_start = image->arch.backup_src_start;
+	src_sz = image->arch.backup_src_sz;
+
+	/* Add backup segment. */
+	if (src_sz) {
+		/*
+		 * Ideally there is no source for backup segment. This is
+		 * copied in purgatory after crash. Just add a zero filled
+		 * segment for now to make sure checksum logic works fine.
+		 */
+		ret = kexec_add_buffer(image, (char *)&crash_zero_bytes,
+				       sizeof(crash_zero_bytes), src_sz,
+				       PAGE_SIZE, 0, -1, 0,
+				       &image->arch.backup_load_addr);
+		if (ret)
+			return ret;
+		pr_debug("Loaded backup region at 0x%lx backup_start=0x%lx memsz=0x%lx\n",
+			 image->arch.backup_load_addr, src_start, src_sz);
+	}
+
+	/* Prepare elf headers and add a segment */
+	ret = prepare_elf_headers(image, &elf_addr, &elf_sz);
+	if (ret)
+		return ret;
+
+	image->arch.elf_headers = elf_addr;
+	image->arch.elf_headers_sz = elf_sz;
+
+	ret = kexec_add_buffer(image, (char *)elf_addr, elf_sz, elf_sz,
+			ELF_CORE_HEADER_ALIGN, 0, -1, 0,
+			&image->arch.elf_load_addr);
+	if (ret) {
+		vfree((void *)image->arch.elf_headers);
+		return ret;
+	}
+	pr_debug("Loaded ELF headers at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+		 image->arch.elf_load_addr, elf_sz, elf_sz);
+
+	return ret;
+}
+
+#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
index 0750784..8e762d3 100644
--- a/arch/x86/kernel/kexec-bzimage.c
+++ b/arch/x86/kernel/kexec-bzimage.c
@@ -18,6 +18,9 @@
 
 #include <asm/bootparam.h>
 #include <asm/setup.h>
+#include <asm/crash.h>
+
+#define MAX_ELFCOREHDR_STR_LEN	30	/* elfcorehdr=0x<64bit-value> */
 
 /*
  * Defines lowest physical address for various segments. Not sure where
@@ -130,11 +133,28 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 		return ERR_PTR(-EINVAL);
 	}
 
+	/*
+	 * In case of crash dump, we will append elfcorehdr=<addr> to
+	 * command line. Make sure it does not overflow
+	 */
+	if (cmdline_len + MAX_ELFCOREHDR_STR_LEN > header->cmdline_size) {
+		ret = -EINVAL;
+		pr_debug("Kernel command line too long\n");
+		return ERR_PTR(-EINVAL);
+	}
+
 	/* Allocate loader specific data */
 	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
 	if (!ldata)
 		return ERR_PTR(-ENOMEM);
 
+	/* Allocate and load backup region */
+	if (image->type == KEXEC_TYPE_CRASH) {
+		ret = load_crashdump_segments(image);
+		if (ret)
+			goto out_free_loader_data;
+	}
+
 	/*
 	 * Load purgatory. For 64bit entry point, purgatory  code can be
 	 * anywhere.
@@ -149,7 +169,8 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
 
 	/* Load Bootparams and cmdline */
-	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
+	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
+				MAX_ELFCOREHDR_STR_LEN;
 	params = kzalloc(params_cmdline_sz, GFP_KERNEL);
 	if (!params) {
 		ret = -ENOMEM;
@@ -201,7 +222,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 			goto out_free_params;
 	}
 
-	ret = kexec_setup_cmdline(params, bootparam_load_addr,
+	ret = kexec_setup_cmdline(image, params, bootparam_load_addr,
 				  sizeof(struct boot_params), cmdline,
 				  cmdline_len);
 	if (ret)
@@ -233,7 +254,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 	if (ret)
 		goto out_free_params;
 
-	ret = kexec_setup_boot_parameters(params);
+	ret = kexec_setup_boot_parameters(image, params);
 	if (ret)
 		goto out_free_params;
 
diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
index 7de3239..6a3821b 100644
--- a/arch/x86/kernel/machine_kexec.c
+++ b/arch/x86/kernel/machine_kexec.c
@@ -10,9 +10,11 @@
  */
 
 #include <linux/kernel.h>
+#include <linux/kexec.h>
 #include <linux/string.h>
 #include <asm/bootparam.h>
 #include <asm/setup.h>
+#include <asm/crash.h>
 
 /*
  * Common code for x86 and x86_64 used for kexec.
@@ -36,18 +38,24 @@ int kexec_setup_initrd(struct boot_params *params,
 	return 0;
 }
 
-int kexec_setup_cmdline(struct boot_params *params,
+int kexec_setup_cmdline(struct kimage *image, struct boot_params *params,
 		unsigned long bootparams_load_addr,
 		unsigned long cmdline_offset, char *cmdline,
 		unsigned long cmdline_len)
 {
 	char *cmdline_ptr = ((char *)params) + cmdline_offset;
-	unsigned long cmdline_ptr_phys;
+	unsigned long cmdline_ptr_phys, len;
 	uint32_t cmdline_low_32, cmdline_ext_32;
 
 	memcpy(cmdline_ptr, cmdline, cmdline_len);
+	if (image->type == KEXEC_TYPE_CRASH) {
+		len = sprintf(cmdline_ptr + cmdline_len - 1,
+			" elfcorehdr=0x%lx", image->arch.elf_load_addr);
+		cmdline_len += len;
+	}
 	cmdline_ptr[cmdline_len - 1] = '\0';
 
+	pr_debug("Final command line is: %s\n", cmdline_ptr);
 	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
 	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
 	cmdline_ext_32 = cmdline_ptr_phys >> 32;
@@ -75,7 +83,8 @@ static int setup_memory_map_entries(struct boot_params *params)
 	return 0;
 }
 
-int kexec_setup_boot_parameters(struct boot_params *params)
+int kexec_setup_boot_parameters(struct kimage *image,
+				struct boot_params *params)
 {
 	unsigned int nr_e820_entries;
 	unsigned long long mem_k, start, end;
@@ -102,7 +111,11 @@ int kexec_setup_boot_parameters(struct boot_params *params)
 	/* Default sysdesc table */
 	params->sys_desc_table.length = 0;
 
-	setup_memory_map_entries(params);
+	if (image->type == KEXEC_TYPE_CRASH)
+		crash_setup_memmap_entries(image, params);
+	else
+		setup_memory_map_entries(params);
+
 	nr_e820_entries = params->e820_entries;
 
 	for (i = 0; i < nr_e820_entries; i++) {
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index a66fae3..07e4b60 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -179,6 +179,38 @@ static void load_segments(void)
 		);
 }
 
+/* Update purgatory as needed after various image segments have been prepared */
+static int arch_update_purgatory(struct kimage *image)
+{
+	int ret = 0;
+
+	if (!image->file_mode)
+		return 0;
+
+	/* Setup copying of backup region */
+	if (image->type == KEXEC_TYPE_CRASH) {
+		ret = kexec_purgatory_get_set_symbol(image, "backup_dest",
+				&image->arch.backup_load_addr,
+				sizeof(image->arch.backup_load_addr), 0);
+		if (ret)
+			return ret;
+
+		ret = kexec_purgatory_get_set_symbol(image, "backup_src",
+				&image->arch.backup_src_start,
+				sizeof(image->arch.backup_src_start), 0);
+		if (ret)
+			return ret;
+
+		ret = kexec_purgatory_get_set_symbol(image, "backup_sz",
+				&image->arch.backup_src_sz,
+				sizeof(image->arch.backup_src_sz), 0);
+		if (ret)
+			return ret;
+	}
+
+	return ret;
+}
+
 int machine_kexec_prepare(struct kimage *image)
 {
 	unsigned long start_pgtable;
@@ -192,6 +224,11 @@ int machine_kexec_prepare(struct kimage *image)
 	if (result)
 		return result;
 
+	/* update purgatory as needed */
+	result = arch_update_purgatory(image);
+	if (result)
+		return result;
+
 	return 0;
 }
 
@@ -334,6 +371,9 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
 	if (idx < 0)
 		return 0;
 
+	vfree(image->arch.elf_headers);
+	image->arch.elf_headers = NULL;
+
 	if (kexec_file_type[idx].cleanup)
 		return kexec_file_type[idx].cleanup(image);
 	return 0;
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 7f2e393..52e45cb 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -546,7 +546,6 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
 	*rimage = image;
 	return 0;
 
-
 out_free_control_pages:
 	kimage_free_page_list(&image->control_pages);
 out_free_image:
@@ -554,6 +553,54 @@ out_free_image:
 	return result;
 }
 
+static int kimage_file_crash_alloc(struct kimage **rimage, int kernel_fd,
+		int initrd_fd, const char __user *cmdline_ptr,
+		unsigned long cmdline_len)
+{
+	int result;
+	struct kimage *image;
+
+	/* Allocate and initialize a controlling structure */
+	image = do_kimage_alloc_init();
+	if (!image)
+		return -ENOMEM;
+
+	image->file_mode = 1;
+	image->file_handler_idx = -1;
+
+	/* Enable the special crash kernel control page allocation policy. */
+	image->control_page = crashk_res.start;
+	image->type = KEXEC_TYPE_CRASH;
+
+	result = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
+			cmdline_ptr, cmdline_len);
+	if (result)
+		goto out_free_image;
+
+	result = sanity_check_segment_list(image);
+	if (result)
+		goto out_free_post_load_bufs;
+
+	result = -ENOMEM;
+	image->control_code_page = kimage_alloc_control_pages(image,
+					   get_order(KEXEC_CONTROL_PAGE_SIZE));
+	if (!image->control_code_page) {
+		pr_err(KERN_ERR "Could not allocate control_code_buffer\n");
+		goto out_free_post_load_bufs;
+	}
+
+	*rimage = image;
+	return 0;
+
+out_free_post_load_bufs:
+	kimage_file_post_load_cleanup(image);
+	kfree(image->image_loader_data);
+out_free_image:
+	kfree(image);
+	return result;
+}
+
+
 static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
 				unsigned long nr_segments,
 				struct kexec_segment __user *segments)
@@ -1135,10 +1182,15 @@ static int kimage_load_crash_segment(struct kimage *image,
 	unsigned long maddr;
 	size_t ubytes, mbytes;
 	int result;
-	unsigned char __user *buf;
+	unsigned char __user *buf = NULL;
+	unsigned char *kbuf = NULL;
 
 	result = 0;
-	buf = segment->buf;
+	if (image->file_mode)
+		kbuf = segment->kbuf;
+	else
+		buf = segment->buf;
+
 	ubytes = segment->bufsz;
 	mbytes = segment->memsz;
 	maddr = segment->mem;
@@ -1161,7 +1213,12 @@ static int kimage_load_crash_segment(struct kimage *image,
 			/* Zero the trailing part of the page */
 			memset(ptr + uchunk, 0, mchunk - uchunk);
 		}
-		result = copy_from_user(ptr, buf, uchunk);
+
+		/* For file based kexec, source pages are in kernel memory */
+		if (image->file_mode)
+			memcpy(ptr, kbuf, uchunk);
+		else
+			result = copy_from_user(ptr, buf, uchunk);
 		kexec_flush_icache_page(page);
 		kunmap(page);
 		if (result) {
@@ -1170,7 +1227,10 @@ static int kimage_load_crash_segment(struct kimage *image,
 		}
 		ubytes -= uchunk;
 		maddr  += mchunk;
-		buf    += mchunk;
+		if (image->file_mode)
+			kbuf += mchunk;
+		else
+			buf += mchunk;
 		mbytes -= mchunk;
 	}
 out:
@@ -1388,7 +1448,11 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 	if (flags & KEXEC_FILE_UNLOAD)
 		goto exchange;
 
-	ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
+	if (flags & KEXEC_FILE_ON_CRASH)
+		ret = kimage_file_crash_alloc(&image, kernel_fd, initrd_fd,
+				cmdline_ptr, cmdline_len);
+	else
+		ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
 				cmdline_ptr, cmdline_len);
 	if (ret)
 		goto out;
@@ -2143,7 +2207,12 @@ int kexec_add_buffer(struct kimage *image, char *buffer,
 	kbuf->top_down = top_down;
 
 	/* Walk the RAM ranges and allocate a suitable range for the buffer */
-	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
+	if (image->type == KEXEC_TYPE_CRASH)
+		walk_ram_res("Crash kernel", IORESOURCE_MEM | IORESOURCE_BUSY,
+				crashk_res.start, crashk_res.end, kbuf,
+				walk_ram_range_callback);
+	else
+		walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
 
 	/*
 	 * If range could be found successfully, it would have incremented
-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 13/13] kexec: Support kexec/kdump on EFI systems
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:07   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:07 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe,
	akpm, Vivek Goyal

This patch does two thigns. It passes EFI run time mappings to second
kernel in bootparams efi_info. Second kernel parse this info and create
new mappings in second kernel. That means mappings in first and second
kernel will be same. This paves the way to enable EFI in kexec kernel.

This patch also prepares and passes EFI setup data through bootparams.
This contains bunch of information about various tables and their
addresses.

These information gathering and passing has been written along the lines
of what current kexec-tools is doing to make kexec work with UEFI.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/include/asm/kexec.h       |  4 +-
 arch/x86/kernel/kexec-bzimage.c    | 40 ++++++++++++----
 arch/x86/kernel/machine_kexec.c    | 93 ++++++++++++++++++++++++++++++++++++--
 drivers/firmware/efi/runtime-map.c | 21 +++++++++
 include/linux/efi.h                | 19 ++++++++
 5 files changed, 163 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 4cbe5f7..d8461cf 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -214,7 +214,9 @@ extern int kexec_setup_cmdline(struct kimage *image,
 		unsigned long cmdline_offset, char *cmdline,
 		unsigned long cmdline_len);
 extern int kexec_setup_boot_parameters(struct kimage *image,
-					struct boot_params *params);
+		struct boot_params *params, unsigned long params_load_addr,
+		unsigned int efi_map_offset, unsigned int efi_map_sz,
+		unsigned int efi_setup_data_offset);
 
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
index 8e762d3..55716e1 100644
--- a/arch/x86/kernel/kexec-bzimage.c
+++ b/arch/x86/kernel/kexec-bzimage.c
@@ -15,10 +15,12 @@
 #include <linux/kexec.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/efi.h>
 
 #include <asm/bootparam.h>
 #include <asm/setup.h>
 #include <asm/crash.h>
+#include <asm/efi.h>
 
 #define MAX_ELFCOREHDR_STR_LEN	30	/* elfcorehdr=0x<64bit-value> */
 
@@ -106,7 +108,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 
 	struct setup_header *header;
 	int setup_sects, kern16_size, ret = 0;
-	unsigned long setup_header_size, params_cmdline_sz;
+	unsigned long setup_header_size, params_cmdline_sz, params_misc_sz;
 	struct boot_params *params;
 	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
 	unsigned long purgatory_load_addr;
@@ -116,6 +118,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 	struct kexec_entry64_regs regs64;
 	void *stack;
 	unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
+	unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
 
 	header = (struct setup_header *)(kernel + setup_hdr_offset);
 	setup_sects = header->setup_sects;
@@ -168,28 +171,47 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 
 	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
 
-	/* Load Bootparams and cmdline */
+
+	/*
+	 * Load Bootparams and cmdline and space for efi stuff.
+	 *
+	 * Allocate memory together for multiple data structures so
+	 * that they all can go in single area/segment and we don't
+	 * have to create separate segment for each. Keeps things
+	 * little bit simple
+	 */
+	efi_map_sz = get_efi_runtime_map_size();
+	efi_map_sz = ALIGN(efi_map_sz, 16);
+
 	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
 				MAX_ELFCOREHDR_STR_LEN;
-	params = kzalloc(params_cmdline_sz, GFP_KERNEL);
+	params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
+	params_misc_sz = params_cmdline_sz + efi_map_sz +
+				sizeof(struct setup_data) +
+				sizeof(struct efi_setup_data);
+
+	params = kzalloc(params_misc_sz, GFP_KERNEL);
 	if (!params) {
 		ret = -ENOMEM;
 		goto out_free_loader_data;
 	}
 
+	efi_map_offset = params_cmdline_sz;
+	efi_setup_data_offset = efi_map_offset + efi_map_sz;
+
 	/* Copy setup header onto bootparams. Documentation/x86/boot.txt */
 	setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;
 
 	/* Is there a limit on setup header size? */
 	memcpy(&params->hdr, (kernel + setup_hdr_offset), setup_header_size);
 
-	ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
-			       params_cmdline_sz, 16, MIN_BOOTPARAM_ADDR,
+	ret = kexec_add_buffer(image, (char *)params, params_misc_sz,
+			       params_misc_sz, 16, MIN_BOOTPARAM_ADDR,
 			       ULONG_MAX, 1, &bootparam_load_addr);
 	if (ret)
 		goto out_free_params;
-	pr_debug("Loaded boot_param and command line at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
-		 bootparam_load_addr, params_cmdline_sz, params_cmdline_sz);
+	pr_debug("Loaded boot_param, command line and misc at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+		 bootparam_load_addr, params_misc_sz, params_misc_sz);
 
 	/* Load kernel */
 	kernel_buf = kernel + kern16_size;
@@ -254,7 +276,9 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 	if (ret)
 		goto out_free_params;
 
-	ret = kexec_setup_boot_parameters(image, params);
+	ret = kexec_setup_boot_parameters(image, params, bootparam_load_addr,
+					  efi_map_offset, efi_map_sz,
+					  efi_setup_data_offset);
 	if (ret)
 		goto out_free_params;
 
diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
index 6a3821b..f31a4b5 100644
--- a/arch/x86/kernel/machine_kexec.c
+++ b/arch/x86/kernel/machine_kexec.c
@@ -12,9 +12,11 @@
 #include <linux/kernel.h>
 #include <linux/kexec.h>
 #include <linux/string.h>
+#include <linux/efi.h>
 #include <asm/bootparam.h>
 #include <asm/setup.h>
 #include <asm/crash.h>
+#include <asm/efi.h>
 
 /*
  * Common code for x86 and x86_64 used for kexec.
@@ -67,11 +69,10 @@ int kexec_setup_cmdline(struct kimage *image, struct boot_params *params,
 	return 0;
 }
 
-static int setup_memory_map_entries(struct boot_params *params)
+static int setup_e820_entries(struct boot_params *params)
 {
 	unsigned int nr_e820_entries;
 
-	/* TODO: What about EFI */
 	nr_e820_entries = e820_saved.nr_map;
 	if (nr_e820_entries > E820MAX)
 		nr_e820_entries = E820MAX;
@@ -83,8 +84,85 @@ static int setup_memory_map_entries(struct boot_params *params)
 	return 0;
 }
 
-int kexec_setup_boot_parameters(struct kimage *image,
-				struct boot_params *params)
+#ifdef CONFIG_EFI
+static int setup_efi_info_memmap(struct boot_params *params,
+				  unsigned long params_load_addr,
+				  unsigned int efi_map_offset,
+				  unsigned int efi_map_sz)
+{
+	void *efi_map = (void *)params + efi_map_offset;
+	unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
+	struct efi_info *ei = &params->efi_info;
+
+	if (!efi_map_sz)
+		return 0;
+
+	efi_runtime_map_copy(efi_map, efi_map_sz);
+
+	ei->efi_memmap = efi_map_phys_addr & 0xffffffff;
+	ei->efi_memmap_hi = efi_map_phys_addr >> 32;
+	ei->efi_memmap_size = efi_map_sz;
+
+	return 0;
+}
+
+static int
+prepare_add_efi_setup_data(struct boot_params *params,
+		       unsigned long params_load_addr,
+		       unsigned int efi_setup_data_offset)
+{
+	unsigned long setup_data_phys;
+	struct setup_data *sd = (void *)params + efi_setup_data_offset;
+	struct efi_setup_data *esd = (void *)sd + sizeof(struct setup_data);
+
+	esd->fw_vendor = efi.fw_vendor;
+	esd->runtime = efi.runtime;
+	esd->tables = efi.config_table;
+	esd->smbios = efi.smbios;
+
+	sd->type = SETUP_EFI;
+	sd->len = sizeof(struct efi_setup_data);
+
+	/* Add setup data */
+	setup_data_phys = params_load_addr + efi_setup_data_offset;
+	sd->next = params->hdr.setup_data;
+	params->hdr.setup_data = setup_data_phys;
+
+	return 0;
+}
+
+static int setup_efi_state(struct boot_params *params,
+			unsigned long params_load_addr,
+			unsigned int efi_map_offset, unsigned int efi_map_sz,
+			unsigned int efi_setup_data_offset)
+{
+	struct efi_info *current_ei = &boot_params.efi_info;
+	struct efi_info *ei = &params->efi_info;
+
+	if (!current_ei->efi_memmap_size)
+		return 0;
+
+	ei->efi_loader_signature = current_ei->efi_loader_signature;
+	ei->efi_systab = current_ei->efi_systab;
+	ei->efi_systab_hi = current_ei->efi_systab_hi;
+
+	ei->efi_memdesc_version = current_ei->efi_memdesc_version;
+	ei->efi_memdesc_size = get_efi_runtime_map_desc_size();
+
+	setup_efi_info_memmap(params, params_load_addr, efi_map_offset,
+			      efi_map_sz);
+	prepare_add_efi_setup_data(params, params_load_addr,
+				   efi_setup_data_offset);
+	return 0;
+}
+#endif /* CONFIG_EFI */
+
+int
+kexec_setup_boot_parameters(struct kimage *image, struct boot_params *params,
+			    unsigned long params_load_addr,
+			    unsigned int efi_map_offset,
+			    unsigned int efi_map_sz,
+			    unsigned int efi_setup_data_offset)
 {
 	unsigned int nr_e820_entries;
 	unsigned long long mem_k, start, end;
@@ -114,7 +192,7 @@ int kexec_setup_boot_parameters(struct kimage *image,
 	if (image->type == KEXEC_TYPE_CRASH)
 		crash_setup_memmap_entries(image, params);
 	else
-		setup_memory_map_entries(params);
+		setup_e820_entries(params);
 
 	nr_e820_entries = params->e820_entries;
 
@@ -135,6 +213,11 @@ int kexec_setup_boot_parameters(struct kimage *image,
 		}
 	}
 
+#ifdef CONFIG_EFI
+	/* Setup EFI state */
+	setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
+			efi_setup_data_offset);
+#endif
 	/* Setup EDD info */
 	memcpy(params->eddbuf, boot_params.eddbuf,
 				EDDMAXNR * sizeof(struct edd_info));
diff --git a/drivers/firmware/efi/runtime-map.c b/drivers/firmware/efi/runtime-map.c
index 97cdd16..40f2213 100644
--- a/drivers/firmware/efi/runtime-map.c
+++ b/drivers/firmware/efi/runtime-map.c
@@ -138,6 +138,27 @@ add_sysfs_runtime_map_entry(struct kobject *kobj, int nr)
 	return entry;
 }
 
+int get_efi_runtime_map_size(void)
+{
+	return nr_efi_runtime_map * efi_memdesc_size;
+}
+
+int get_efi_runtime_map_desc_size(void)
+{
+	return efi_memdesc_size;
+}
+
+int efi_runtime_map_copy(void *buf, size_t bufsz)
+{
+	size_t sz = get_efi_runtime_map_size();
+
+	if (sz > bufsz)
+		sz = bufsz;
+
+	memcpy(buf, efi_runtime_map, sz);
+	return 0;
+}
+
 void efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size)
 {
 	efi_runtime_map = map;
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 6c100ff..c2e5c4a 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -1131,6 +1131,9 @@ int efivars_sysfs_init(void);
 #ifdef CONFIG_EFI_RUNTIME_MAP
 int efi_runtime_map_init(struct kobject *);
 void efi_runtime_map_setup(void *, int, u32);
+int get_efi_runtime_map_size(void);
+int get_efi_runtime_map_desc_size(void);
+int efi_runtime_map_copy(void *buf, size_t bufsz);
 #else
 static inline int efi_runtime_map_init(struct kobject *kobj)
 {
@@ -1139,6 +1142,22 @@ static inline int efi_runtime_map_init(struct kobject *kobj)
 
 static inline void
 efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size) {}
+
+static inline int get_efi_runtime_map_size(void)
+{
+	return 0;
+}
+
+static inline int get_efi_runtime_map_desc_size(void)
+{
+	return 0;
+}
+
+static inline int efi_runtime_map_copy(void *buf, size_t bufsz)
+{
+	return 0;
+}
+
 #endif
 
 #endif /* _LINUX_EFI_H */
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 214+ messages in thread

* [PATCH 13/13] kexec: Support kexec/kdump on EFI systems
@ 2014-06-03 13:07   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:07 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, Vivek Goyal, bp, ebiederm, greg, akpm,
	dyoung, chaowang

This patch does two thigns. It passes EFI run time mappings to second
kernel in bootparams efi_info. Second kernel parse this info and create
new mappings in second kernel. That means mappings in first and second
kernel will be same. This paves the way to enable EFI in kexec kernel.

This patch also prepares and passes EFI setup data through bootparams.
This contains bunch of information about various tables and their
addresses.

These information gathering and passing has been written along the lines
of what current kexec-tools is doing to make kexec work with UEFI.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/include/asm/kexec.h       |  4 +-
 arch/x86/kernel/kexec-bzimage.c    | 40 ++++++++++++----
 arch/x86/kernel/machine_kexec.c    | 93 ++++++++++++++++++++++++++++++++++++--
 drivers/firmware/efi/runtime-map.c | 21 +++++++++
 include/linux/efi.h                | 19 ++++++++
 5 files changed, 163 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 4cbe5f7..d8461cf 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -214,7 +214,9 @@ extern int kexec_setup_cmdline(struct kimage *image,
 		unsigned long cmdline_offset, char *cmdline,
 		unsigned long cmdline_len);
 extern int kexec_setup_boot_parameters(struct kimage *image,
-					struct boot_params *params);
+		struct boot_params *params, unsigned long params_load_addr,
+		unsigned int efi_map_offset, unsigned int efi_map_sz,
+		unsigned int efi_setup_data_offset);
 
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
index 8e762d3..55716e1 100644
--- a/arch/x86/kernel/kexec-bzimage.c
+++ b/arch/x86/kernel/kexec-bzimage.c
@@ -15,10 +15,12 @@
 #include <linux/kexec.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/efi.h>
 
 #include <asm/bootparam.h>
 #include <asm/setup.h>
 #include <asm/crash.h>
+#include <asm/efi.h>
 
 #define MAX_ELFCOREHDR_STR_LEN	30	/* elfcorehdr=0x<64bit-value> */
 
@@ -106,7 +108,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 
 	struct setup_header *header;
 	int setup_sects, kern16_size, ret = 0;
-	unsigned long setup_header_size, params_cmdline_sz;
+	unsigned long setup_header_size, params_cmdline_sz, params_misc_sz;
 	struct boot_params *params;
 	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
 	unsigned long purgatory_load_addr;
@@ -116,6 +118,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 	struct kexec_entry64_regs regs64;
 	void *stack;
 	unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
+	unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
 
 	header = (struct setup_header *)(kernel + setup_hdr_offset);
 	setup_sects = header->setup_sects;
@@ -168,28 +171,47 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 
 	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
 
-	/* Load Bootparams and cmdline */
+
+	/*
+	 * Load Bootparams and cmdline and space for efi stuff.
+	 *
+	 * Allocate memory together for multiple data structures so
+	 * that they all can go in single area/segment and we don't
+	 * have to create separate segment for each. Keeps things
+	 * little bit simple
+	 */
+	efi_map_sz = get_efi_runtime_map_size();
+	efi_map_sz = ALIGN(efi_map_sz, 16);
+
 	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
 				MAX_ELFCOREHDR_STR_LEN;
-	params = kzalloc(params_cmdline_sz, GFP_KERNEL);
+	params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
+	params_misc_sz = params_cmdline_sz + efi_map_sz +
+				sizeof(struct setup_data) +
+				sizeof(struct efi_setup_data);
+
+	params = kzalloc(params_misc_sz, GFP_KERNEL);
 	if (!params) {
 		ret = -ENOMEM;
 		goto out_free_loader_data;
 	}
 
+	efi_map_offset = params_cmdline_sz;
+	efi_setup_data_offset = efi_map_offset + efi_map_sz;
+
 	/* Copy setup header onto bootparams. Documentation/x86/boot.txt */
 	setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;
 
 	/* Is there a limit on setup header size? */
 	memcpy(&params->hdr, (kernel + setup_hdr_offset), setup_header_size);
 
-	ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
-			       params_cmdline_sz, 16, MIN_BOOTPARAM_ADDR,
+	ret = kexec_add_buffer(image, (char *)params, params_misc_sz,
+			       params_misc_sz, 16, MIN_BOOTPARAM_ADDR,
 			       ULONG_MAX, 1, &bootparam_load_addr);
 	if (ret)
 		goto out_free_params;
-	pr_debug("Loaded boot_param and command line at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
-		 bootparam_load_addr, params_cmdline_sz, params_cmdline_sz);
+	pr_debug("Loaded boot_param, command line and misc at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+		 bootparam_load_addr, params_misc_sz, params_misc_sz);
 
 	/* Load kernel */
 	kernel_buf = kernel + kern16_size;
@@ -254,7 +276,9 @@ void *bzImage64_load(struct kimage *image, char *kernel,
 	if (ret)
 		goto out_free_params;
 
-	ret = kexec_setup_boot_parameters(image, params);
+	ret = kexec_setup_boot_parameters(image, params, bootparam_load_addr,
+					  efi_map_offset, efi_map_sz,
+					  efi_setup_data_offset);
 	if (ret)
 		goto out_free_params;
 
diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
index 6a3821b..f31a4b5 100644
--- a/arch/x86/kernel/machine_kexec.c
+++ b/arch/x86/kernel/machine_kexec.c
@@ -12,9 +12,11 @@
 #include <linux/kernel.h>
 #include <linux/kexec.h>
 #include <linux/string.h>
+#include <linux/efi.h>
 #include <asm/bootparam.h>
 #include <asm/setup.h>
 #include <asm/crash.h>
+#include <asm/efi.h>
 
 /*
  * Common code for x86 and x86_64 used for kexec.
@@ -67,11 +69,10 @@ int kexec_setup_cmdline(struct kimage *image, struct boot_params *params,
 	return 0;
 }
 
-static int setup_memory_map_entries(struct boot_params *params)
+static int setup_e820_entries(struct boot_params *params)
 {
 	unsigned int nr_e820_entries;
 
-	/* TODO: What about EFI */
 	nr_e820_entries = e820_saved.nr_map;
 	if (nr_e820_entries > E820MAX)
 		nr_e820_entries = E820MAX;
@@ -83,8 +84,85 @@ static int setup_memory_map_entries(struct boot_params *params)
 	return 0;
 }
 
-int kexec_setup_boot_parameters(struct kimage *image,
-				struct boot_params *params)
+#ifdef CONFIG_EFI
+static int setup_efi_info_memmap(struct boot_params *params,
+				  unsigned long params_load_addr,
+				  unsigned int efi_map_offset,
+				  unsigned int efi_map_sz)
+{
+	void *efi_map = (void *)params + efi_map_offset;
+	unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
+	struct efi_info *ei = &params->efi_info;
+
+	if (!efi_map_sz)
+		return 0;
+
+	efi_runtime_map_copy(efi_map, efi_map_sz);
+
+	ei->efi_memmap = efi_map_phys_addr & 0xffffffff;
+	ei->efi_memmap_hi = efi_map_phys_addr >> 32;
+	ei->efi_memmap_size = efi_map_sz;
+
+	return 0;
+}
+
+static int
+prepare_add_efi_setup_data(struct boot_params *params,
+		       unsigned long params_load_addr,
+		       unsigned int efi_setup_data_offset)
+{
+	unsigned long setup_data_phys;
+	struct setup_data *sd = (void *)params + efi_setup_data_offset;
+	struct efi_setup_data *esd = (void *)sd + sizeof(struct setup_data);
+
+	esd->fw_vendor = efi.fw_vendor;
+	esd->runtime = efi.runtime;
+	esd->tables = efi.config_table;
+	esd->smbios = efi.smbios;
+
+	sd->type = SETUP_EFI;
+	sd->len = sizeof(struct efi_setup_data);
+
+	/* Add setup data */
+	setup_data_phys = params_load_addr + efi_setup_data_offset;
+	sd->next = params->hdr.setup_data;
+	params->hdr.setup_data = setup_data_phys;
+
+	return 0;
+}
+
+static int setup_efi_state(struct boot_params *params,
+			unsigned long params_load_addr,
+			unsigned int efi_map_offset, unsigned int efi_map_sz,
+			unsigned int efi_setup_data_offset)
+{
+	struct efi_info *current_ei = &boot_params.efi_info;
+	struct efi_info *ei = &params->efi_info;
+
+	if (!current_ei->efi_memmap_size)
+		return 0;
+
+	ei->efi_loader_signature = current_ei->efi_loader_signature;
+	ei->efi_systab = current_ei->efi_systab;
+	ei->efi_systab_hi = current_ei->efi_systab_hi;
+
+	ei->efi_memdesc_version = current_ei->efi_memdesc_version;
+	ei->efi_memdesc_size = get_efi_runtime_map_desc_size();
+
+	setup_efi_info_memmap(params, params_load_addr, efi_map_offset,
+			      efi_map_sz);
+	prepare_add_efi_setup_data(params, params_load_addr,
+				   efi_setup_data_offset);
+	return 0;
+}
+#endif /* CONFIG_EFI */
+
+int
+kexec_setup_boot_parameters(struct kimage *image, struct boot_params *params,
+			    unsigned long params_load_addr,
+			    unsigned int efi_map_offset,
+			    unsigned int efi_map_sz,
+			    unsigned int efi_setup_data_offset)
 {
 	unsigned int nr_e820_entries;
 	unsigned long long mem_k, start, end;
@@ -114,7 +192,7 @@ int kexec_setup_boot_parameters(struct kimage *image,
 	if (image->type == KEXEC_TYPE_CRASH)
 		crash_setup_memmap_entries(image, params);
 	else
-		setup_memory_map_entries(params);
+		setup_e820_entries(params);
 
 	nr_e820_entries = params->e820_entries;
 
@@ -135,6 +213,11 @@ int kexec_setup_boot_parameters(struct kimage *image,
 		}
 	}
 
+#ifdef CONFIG_EFI
+	/* Setup EFI state */
+	setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
+			efi_setup_data_offset);
+#endif
 	/* Setup EDD info */
 	memcpy(params->eddbuf, boot_params.eddbuf,
 				EDDMAXNR * sizeof(struct edd_info));
diff --git a/drivers/firmware/efi/runtime-map.c b/drivers/firmware/efi/runtime-map.c
index 97cdd16..40f2213 100644
--- a/drivers/firmware/efi/runtime-map.c
+++ b/drivers/firmware/efi/runtime-map.c
@@ -138,6 +138,27 @@ add_sysfs_runtime_map_entry(struct kobject *kobj, int nr)
 	return entry;
 }
 
+int get_efi_runtime_map_size(void)
+{
+	return nr_efi_runtime_map * efi_memdesc_size;
+}
+
+int get_efi_runtime_map_desc_size(void)
+{
+	return efi_memdesc_size;
+}
+
+int efi_runtime_map_copy(void *buf, size_t bufsz)
+{
+	size_t sz = get_efi_runtime_map_size();
+
+	if (sz > bufsz)
+		sz = bufsz;
+
+	memcpy(buf, efi_runtime_map, sz);
+	return 0;
+}
+
 void efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size)
 {
 	efi_runtime_map = map;
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 6c100ff..c2e5c4a 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -1131,6 +1131,9 @@ int efivars_sysfs_init(void);
 #ifdef CONFIG_EFI_RUNTIME_MAP
 int efi_runtime_map_init(struct kobject *);
 void efi_runtime_map_setup(void *, int, u32);
+int get_efi_runtime_map_size(void);
+int get_efi_runtime_map_desc_size(void);
+int efi_runtime_map_copy(void *buf, size_t bufsz);
 #else
 static inline int efi_runtime_map_init(struct kobject *kobj)
 {
@@ -1139,6 +1142,22 @@ static inline int efi_runtime_map_init(struct kobject *kobj)
 
 static inline void
 efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size) {}
+
+static inline int get_efi_runtime_map_size(void)
+{
+	return 0;
+}
+
+static inline int get_efi_runtime_map_desc_size(void)
+{
+	return 0;
+}
+
+static inline int efi_runtime_map_copy(void *buf, size_t bufsz)
+{
+	return 0;
+}
+
 #endif
 
 #endif /* _LINUX_EFI_H */
-- 
1.9.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-03 13:12   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:12 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: ebiederm, hpa, mjg59, greg, bp, jkosina, dyoung, chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 09:06:49AM -0400, Vivek Goyal wrote:
> Hi,
> 
> This is V3 of the patchset. Previous versions were posted here.
> 
> V1: https://lkml.org/lkml/2013/11/20/540
> V2: https://lkml.org/lkml/2014/1/27/331
> 
> Changes since v2:
> 
> - Took care of most of the review comments from V2.
> - Added support for kexec/kdump on EFI systems.
> - Dropped support for loading ELF vmlinux.
> 
> This patch series is generated on top of 3.15.0-rc8. It also requires a
> two patch cleanup series which is sitting in -tip tree here.

I used following kexec-tools patches to test kernel changes.

Thanks
Vivek


kexec-tools: Provide an option to make use of new system call

This patch provides and option --use-kexec2-syscall, to force use of
new system call for kexec. Default is to continue to use old syscall.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 kexec/arch/x86_64/kexec-bzImage64.c |   86 +++++++++++++++++++++++++
 kexec/kexec-syscall.h               |   31 +++++++++
 kexec/kexec.c                       |  123 +++++++++++++++++++++++++++++++++++-
 kexec/kexec.h                       |    9 ++
 4 files changed, 246 insertions(+), 3 deletions(-)

Index: kexec-tools/kexec/kexec.c
===================================================================
--- kexec-tools.orig/kexec/kexec.c	2014-06-02 14:34:16.719774316 -0400
+++ kexec-tools/kexec/kexec.c	2014-06-02 14:34:42.009036315 -0400
@@ -51,6 +51,7 @@
 unsigned long long mem_min = 0;
 unsigned long long mem_max = ULONG_MAX;
 static unsigned long kexec_flags = 0;
+static unsigned long kexec2_flags = 0;
 int kexec_debug = 0;
 
 void dbgprint_mem_range(const char *prefix, struct memory_range *mr, int nr_mr)
@@ -787,6 +788,19 @@ static int my_load(const char *type, int
 	return result;
 }
 
+static int kexec2_unload(unsigned long kexec2_flags)
+{
+	int ret = 0;
+
+	ret = kexec_file_load(-1, -1, NULL, 0, kexec2_flags);
+	if (ret != 0) {
+		/* The unload failed, print some debugging information */
+		fprintf(stderr, "kexec_file_load(unload) failed\n: %s\n",
+			strerror(errno));
+	}
+	return ret;
+}
+
 static int k_unload (unsigned long kexec_flags)
 {
 	int result;
@@ -925,6 +939,7 @@ void usage(void)
 	       "                      (0 means it's not jump back or\n"
 	       "                      preserve context)\n"
 	       "                      to original kernel.\n"
+	       " -s --use-kexec2-syscall Use new syscall for kexec operation\n"
 	       " -d, --debug           Enable debugging to help spot a failure.\n"
 	       "\n"
 	       "Supported kernel file types and options: \n");
@@ -1072,6 +1087,75 @@ char *concat_cmdline(const char *base, c
 	return cmdline;
 }
 
+/* New file based kexec system call related code */
+static int kexec2_load(int fileind, int argc, char **argv,
+			unsigned long flags) {
+
+	char *kernel;
+	int kernel_fd, i;
+	struct kexec_info info;
+	int ret = 0;
+	char *kernel_buf;
+	off_t kernel_size;
+
+	memset(&info, 0, sizeof(info));
+	info.segment = NULL;
+	info.nr_segments = 0;
+	info.entry = NULL;
+	info.backup_start = 0;
+	info.kexec_flags = flags;
+
+	info.file_mode = 1;
+	info.initrd_fd = -1;
+
+	if (argc - fileind <= 0) {
+		fprintf(stderr, "No kernel specified\n");
+		usage();
+		return -1;
+	}
+
+	kernel = argv[fileind];
+
+	kernel_fd = open(kernel, O_RDONLY);
+	if (kernel_fd == -1) {
+		fprintf(stderr, "Failed to open file %s:%s\n", kernel,
+				strerror(errno));
+		return -1;
+	}
+
+	/* slurp in the input kernel */
+	kernel_buf = slurp_decompress_file(kernel, &kernel_size);
+
+	for (i = 0; i < file_types; i++) {
+		if (file_type[i].probe(kernel_buf, kernel_size) >= 0)
+			break;
+	}
+
+	if (i == file_types) {
+		fprintf(stderr, "Cannot determine the file type " "of %s\n",
+				kernel);
+		return -1;
+	}
+
+	ret = file_type[i].load(argc, argv, kernel_buf, kernel_size, &info);
+	if (ret < 0) {
+		fprintf(stderr, "Cannot load %s\n", kernel);
+		return ret;
+	}
+
+	if (!is_kexec_file_load_implemented()) {
+		fprintf(stderr, "syscall kexec_file_load not available.\n");
+		return -1;
+	}
+
+	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
+			info.command_line_len, info.kexec_flags);
+	if (ret != 0)
+		fprintf(stderr, "kexec_file_load failed: %s\n",
+					strerror(errno));
+	return ret;
+}
+
 
 int main(int argc, char *argv[])
 {
@@ -1083,6 +1167,7 @@ int main(int argc, char *argv[])
 	int do_ifdown = 0;
 	int do_unload = 0;
 	int do_reuse_initrd = 0;
+	int do_use_kexec2_syscall = 0;
 	void *entry = 0;
 	char *type = 0;
 	char *endptr;
@@ -1095,6 +1180,23 @@ int main(int argc, char *argv[])
 	};
 	static const char short_options[] = KEXEC_ALL_OPT_STR;
 
+	/*
+	 * First check if --use-kexec2-syscall is set. That changes lot of
+	 * things
+	 */
+	while ((opt = getopt_long(argc, argv, short_options,
+				  options, 0)) != -1) {
+		switch(opt) {
+		case OPT_USE_KEXEC2_SYSCALL:
+			do_use_kexec2_syscall = 1;
+			break;
+		}
+	}
+
+	/* Reset getopt for the next pass. */
+	opterr = 1;
+	optind = 1;
+
 	while ((opt = getopt_long(argc, argv, short_options,
 				  options, 0)) != -1) {
 		switch(opt) {
@@ -1127,6 +1229,8 @@ int main(int argc, char *argv[])
 			do_shutdown = 0;
 			do_sync = 0;
 			do_unload = 1;
+			if (do_use_kexec2_syscall)
+				kexec2_flags |= KEXEC_FILE_UNLOAD;
 			break;
 		case OPT_EXEC:
 			do_load = 0;
@@ -1169,7 +1273,10 @@ int main(int argc, char *argv[])
 			do_exec = 0;
 			do_shutdown = 0;
 			do_sync = 0;
-			kexec_flags = KEXEC_ON_CRASH;
+			if (do_use_kexec2_syscall)
+				kexec2_flags |= KEXEC_FILE_ON_CRASH;
+			else
+				kexec_flags = KEXEC_ON_CRASH;
 			break;
 		case OPT_MEM_MIN:
 			mem_min = strtoul(optarg, &endptr, 0);
@@ -1194,6 +1301,9 @@ int main(int argc, char *argv[])
 		case OPT_REUSE_INITRD:
 			do_reuse_initrd = 1;
 			break;
+		case OPT_USE_KEXEC2_SYSCALL:
+			/* We already parsed it. Nothing to do. */
+			break;
 		default:
 			break;
 		}
@@ -1238,10 +1348,17 @@ int main(int argc, char *argv[])
 	}
 
 	if (do_unload) {
-		result = k_unload(kexec_flags);
+		if (do_use_kexec2_syscall)
+			result = kexec2_unload(kexec2_flags);
+		else
+			result = k_unload(kexec_flags);
 	}
 	if (do_load && (result == 0)) {
-		result = my_load(type, fileind, argc, argv, kexec_flags, entry);
+		if (do_use_kexec2_syscall)
+			result = kexec2_load(fileind, argc, argv, kexec2_flags);
+		else
+			result = my_load(type, fileind, argc, argv,
+						kexec_flags, entry);
 	}
 	/* Don't shutdown unless there is something to reboot to! */
 	if ((result == 0) && (do_shutdown || do_exec) && !kexec_loaded()) {
Index: kexec-tools/kexec/kexec.h
===================================================================
--- kexec-tools.orig/kexec/kexec.h	2014-06-02 14:34:16.719774316 -0400
+++ kexec-tools/kexec/kexec.h	2014-06-02 14:34:42.010036325 -0400
@@ -156,6 +156,13 @@ struct kexec_info {
 	unsigned long kexec_flags;
 	unsigned long backup_src_start;
 	unsigned long backup_src_size;
+	/* Set to 1 if we are using kexec2 syscall */
+	unsigned long file_mode :1;
+
+	/* Filled by kernel image processing code */
+	int initrd_fd;
+	char *command_line;
+	int command_line_len;
 };
 
 struct arch_map_entry {
@@ -207,6 +214,7 @@ extern int file_types;
 #define OPT_UNLOAD		'u'
 #define OPT_TYPE		't'
 #define OPT_PANIC		'p'
+#define OPT_USE_KEXEC2_SYSCALL	's'
 #define OPT_MEM_MIN             256
 #define OPT_MEM_MAX             257
 #define OPT_REUSE_INITRD	258
@@ -230,6 +238,7 @@ extern int file_types;
 	{ "mem-min",		1, 0, OPT_MEM_MIN }, \
 	{ "mem-max",		1, 0, OPT_MEM_MAX }, \
 	{ "reuseinitrd",	0, 0, OPT_REUSE_INITRD }, \
+	{ "use-kexec2-syscall",	0, 0, OPT_USE_KEXEC2_SYSCALL }, \
 	{ "debug",		0, 0, OPT_DEBUG }, \
 
 #define KEXEC_OPT_STR "h?vdfxluet:p"
Index: kexec-tools/kexec/arch/x86_64/kexec-bzImage64.c
===================================================================
--- kexec-tools.orig/kexec/arch/x86_64/kexec-bzImage64.c	2014-06-02 14:34:16.719774316 -0400
+++ kexec-tools/kexec/arch/x86_64/kexec-bzImage64.c	2014-06-02 14:34:42.011036336 -0400
@@ -235,6 +235,89 @@ static int do_bzImage64_load(struct kexe
 	return 0;
 }
 
+/* This assumes file is being loaded using file based kexec2 syscall */
+int bzImage64_load_file(int argc, char **argv, struct kexec_info *info)
+{
+	int ret = 0;
+	char *command_line = NULL, *tmp_cmdline = NULL;
+	const char *ramdisk = NULL, *append = NULL;
+	int entry_16bit = 0, entry_32bit = 0;
+	int opt;
+	int command_line_len;
+
+	/* See options.h -- add any more there, too. */
+	static const struct option options[] = {
+		KEXEC_ARCH_OPTIONS
+		{ "command-line",	1, 0, OPT_APPEND },
+		{ "append",		1, 0, OPT_APPEND },
+		{ "reuse-cmdline",	0, 0, OPT_REUSE_CMDLINE },
+		{ "initrd",		1, 0, OPT_RAMDISK },
+		{ "ramdisk",		1, 0, OPT_RAMDISK },
+		{ "real-mode",		0, 0, OPT_REAL_MODE },
+		{ "entry-32bit",	0, 0, OPT_ENTRY_32BIT },
+		{ 0,			0, 0, 0 },
+	};
+	static const char short_options[] = KEXEC_ARCH_OPT_STR "d";
+
+	while ((opt = getopt_long(argc, argv, short_options, options, 0)) != -1) {
+		switch (opt) {
+		default:
+			/* Ignore core options */
+			if (opt < OPT_ARCH_MAX)
+				break;
+		case OPT_APPEND:
+			append = optarg;
+			break;
+		case OPT_REUSE_CMDLINE:
+			tmp_cmdline = get_command_line();
+			break;
+		case OPT_RAMDISK:
+			ramdisk = optarg;
+			break;
+		case OPT_REAL_MODE:
+			entry_16bit = 1;
+			break;
+		case OPT_ENTRY_32BIT:
+			entry_32bit = 1;
+			break;
+		}
+	}
+	command_line = concat_cmdline(tmp_cmdline, append);
+	if (tmp_cmdline)
+		free(tmp_cmdline);
+	command_line_len = 0;
+	if (command_line) {
+		command_line_len = strlen(command_line) + 1;
+	} else {
+		command_line = strdup("\0");
+		command_line_len = 1;
+	}
+
+	if (entry_16bit || entry_32bit) {
+		fprintf(stderr, "Kexec2 syscall does not support 16bit"
+			" or 32bit entry yet\n");
+		ret = -1;
+		goto out;
+	}
+
+	if (ramdisk) {
+		info->initrd_fd = open(ramdisk, O_RDONLY);
+		if (info->initrd_fd == -1) {
+			fprintf(stderr, "Could not open initrd file %s:%s\n",
+					ramdisk, strerror(errno));
+			ret = -1;
+			goto out;
+		}
+	}
+
+	info->command_line = command_line;
+	info->command_line_len = command_line_len;
+	return ret;
+out:
+	free(command_line);
+	return ret;
+}
+
 int bzImage64_load(int argc, char **argv, const char *buf, off_t len,
 	struct kexec_info *info)
 {
@@ -247,6 +330,9 @@ int bzImage64_load(int argc, char **argv
 	int opt;
 	int result;
 
+	if (info->file_mode)
+		return bzImage64_load_file(argc, argv, info);
+
 	/* See options.h -- add any more there, too. */
 	static const struct option options[] = {
 		KEXEC_ARCH_OPTIONS
Index: kexec-tools/kexec/kexec-syscall.h
===================================================================
--- kexec-tools.orig/kexec/kexec-syscall.h	2014-06-02 14:34:16.719774316 -0400
+++ kexec-tools/kexec/kexec-syscall.h	2014-06-02 14:34:42.011036336 -0400
@@ -53,6 +53,19 @@
 #endif
 #endif /*ifndef __NR_kexec_load*/
 
+#ifndef __NR_kexec_file_load
+
+#ifdef __x86_64__
+#define __NR_kexec_file_load	317
+#endif
+
+#ifndef __NR_kexec_file_load
+/* system call not available for the arch */
+#define __NR_kexec_file_load	0xffffffff	/* system call not available */
+#endif
+
+#endif /*ifndef __NR_kexec_file_load*/
+
 struct kexec_segment;
 
 static inline long kexec_load(void *entry, unsigned long nr_segments,
@@ -61,10 +74,28 @@ static inline long kexec_load(void *entr
 	return (long) syscall(__NR_kexec_load, entry, nr_segments, segments, flags);
 }
 
+static inline int is_kexec_file_load_implemented(void) {
+	if (__NR_kexec_file_load != 0xffffffff)
+		return 1;
+	return 0;
+}
+
+static inline long kexec_file_load(int kernel_fd, int initrd_fd,
+			const char *cmdline_ptr, unsigned long cmdline_len,
+			unsigned long flags)
+{
+	return (long) syscall(__NR_kexec_file_load, kernel_fd, initrd_fd,
+				cmdline_ptr, cmdline_len, flags);
+}
+
 #define KEXEC_ON_CRASH		0x00000001
 #define KEXEC_PRESERVE_CONTEXT	0x00000002
 #define KEXEC_ARCH_MASK		0xffff0000
 
+/* Flags for kexec file based system call */
+#define KEXEC_FILE_UNLOAD	0x00000001
+#define KEXEC_FILE_ON_CRASH	0x00000002
+
 /* These values match the ELF architecture values. 
  * Unless there is a good reason that should continue to be the case.
  */

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-03 13:12   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 13:12 UTC (permalink / raw)
  To: linux-kernel, kexec
  Cc: mjg59, bhe, jkosina, hpa, bp, ebiederm, greg, akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:06:49AM -0400, Vivek Goyal wrote:
> Hi,
> 
> This is V3 of the patchset. Previous versions were posted here.
> 
> V1: https://lkml.org/lkml/2013/11/20/540
> V2: https://lkml.org/lkml/2014/1/27/331
> 
> Changes since v2:
> 
> - Took care of most of the review comments from V2.
> - Added support for kexec/kdump on EFI systems.
> - Dropped support for loading ELF vmlinux.
> 
> This patch series is generated on top of 3.15.0-rc8. It also requires a
> two patch cleanup series which is sitting in -tip tree here.

I used following kexec-tools patches to test kernel changes.

Thanks
Vivek


kexec-tools: Provide an option to make use of new system call

This patch provides and option --use-kexec2-syscall, to force use of
new system call for kexec. Default is to continue to use old syscall.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 kexec/arch/x86_64/kexec-bzImage64.c |   86 +++++++++++++++++++++++++
 kexec/kexec-syscall.h               |   31 +++++++++
 kexec/kexec.c                       |  123 +++++++++++++++++++++++++++++++++++-
 kexec/kexec.h                       |    9 ++
 4 files changed, 246 insertions(+), 3 deletions(-)

Index: kexec-tools/kexec/kexec.c
===================================================================
--- kexec-tools.orig/kexec/kexec.c	2014-06-02 14:34:16.719774316 -0400
+++ kexec-tools/kexec/kexec.c	2014-06-02 14:34:42.009036315 -0400
@@ -51,6 +51,7 @@
 unsigned long long mem_min = 0;
 unsigned long long mem_max = ULONG_MAX;
 static unsigned long kexec_flags = 0;
+static unsigned long kexec2_flags = 0;
 int kexec_debug = 0;
 
 void dbgprint_mem_range(const char *prefix, struct memory_range *mr, int nr_mr)
@@ -787,6 +788,19 @@ static int my_load(const char *type, int
 	return result;
 }
 
+static int kexec2_unload(unsigned long kexec2_flags)
+{
+	int ret = 0;
+
+	ret = kexec_file_load(-1, -1, NULL, 0, kexec2_flags);
+	if (ret != 0) {
+		/* The unload failed, print some debugging information */
+		fprintf(stderr, "kexec_file_load(unload) failed\n: %s\n",
+			strerror(errno));
+	}
+	return ret;
+}
+
 static int k_unload (unsigned long kexec_flags)
 {
 	int result;
@@ -925,6 +939,7 @@ void usage(void)
 	       "                      (0 means it's not jump back or\n"
 	       "                      preserve context)\n"
 	       "                      to original kernel.\n"
+	       " -s --use-kexec2-syscall Use new syscall for kexec operation\n"
 	       " -d, --debug           Enable debugging to help spot a failure.\n"
 	       "\n"
 	       "Supported kernel file types and options: \n");
@@ -1072,6 +1087,75 @@ char *concat_cmdline(const char *base, c
 	return cmdline;
 }
 
+/* New file based kexec system call related code */
+static int kexec2_load(int fileind, int argc, char **argv,
+			unsigned long flags) {
+
+	char *kernel;
+	int kernel_fd, i;
+	struct kexec_info info;
+	int ret = 0;
+	char *kernel_buf;
+	off_t kernel_size;
+
+	memset(&info, 0, sizeof(info));
+	info.segment = NULL;
+	info.nr_segments = 0;
+	info.entry = NULL;
+	info.backup_start = 0;
+	info.kexec_flags = flags;
+
+	info.file_mode = 1;
+	info.initrd_fd = -1;
+
+	if (argc - fileind <= 0) {
+		fprintf(stderr, "No kernel specified\n");
+		usage();
+		return -1;
+	}
+
+	kernel = argv[fileind];
+
+	kernel_fd = open(kernel, O_RDONLY);
+	if (kernel_fd == -1) {
+		fprintf(stderr, "Failed to open file %s:%s\n", kernel,
+				strerror(errno));
+		return -1;
+	}
+
+	/* slurp in the input kernel */
+	kernel_buf = slurp_decompress_file(kernel, &kernel_size);
+
+	for (i = 0; i < file_types; i++) {
+		if (file_type[i].probe(kernel_buf, kernel_size) >= 0)
+			break;
+	}
+
+	if (i == file_types) {
+		fprintf(stderr, "Cannot determine the file type " "of %s\n",
+				kernel);
+		return -1;
+	}
+
+	ret = file_type[i].load(argc, argv, kernel_buf, kernel_size, &info);
+	if (ret < 0) {
+		fprintf(stderr, "Cannot load %s\n", kernel);
+		return ret;
+	}
+
+	if (!is_kexec_file_load_implemented()) {
+		fprintf(stderr, "syscall kexec_file_load not available.\n");
+		return -1;
+	}
+
+	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
+			info.command_line_len, info.kexec_flags);
+	if (ret != 0)
+		fprintf(stderr, "kexec_file_load failed: %s\n",
+					strerror(errno));
+	return ret;
+}
+
 
 int main(int argc, char *argv[])
 {
@@ -1083,6 +1167,7 @@ int main(int argc, char *argv[])
 	int do_ifdown = 0;
 	int do_unload = 0;
 	int do_reuse_initrd = 0;
+	int do_use_kexec2_syscall = 0;
 	void *entry = 0;
 	char *type = 0;
 	char *endptr;
@@ -1095,6 +1180,23 @@ int main(int argc, char *argv[])
 	};
 	static const char short_options[] = KEXEC_ALL_OPT_STR;
 
+	/*
+	 * First check if --use-kexec2-syscall is set. That changes lot of
+	 * things
+	 */
+	while ((opt = getopt_long(argc, argv, short_options,
+				  options, 0)) != -1) {
+		switch(opt) {
+		case OPT_USE_KEXEC2_SYSCALL:
+			do_use_kexec2_syscall = 1;
+			break;
+		}
+	}
+
+	/* Reset getopt for the next pass. */
+	opterr = 1;
+	optind = 1;
+
 	while ((opt = getopt_long(argc, argv, short_options,
 				  options, 0)) != -1) {
 		switch(opt) {
@@ -1127,6 +1229,8 @@ int main(int argc, char *argv[])
 			do_shutdown = 0;
 			do_sync = 0;
 			do_unload = 1;
+			if (do_use_kexec2_syscall)
+				kexec2_flags |= KEXEC_FILE_UNLOAD;
 			break;
 		case OPT_EXEC:
 			do_load = 0;
@@ -1169,7 +1273,10 @@ int main(int argc, char *argv[])
 			do_exec = 0;
 			do_shutdown = 0;
 			do_sync = 0;
-			kexec_flags = KEXEC_ON_CRASH;
+			if (do_use_kexec2_syscall)
+				kexec2_flags |= KEXEC_FILE_ON_CRASH;
+			else
+				kexec_flags = KEXEC_ON_CRASH;
 			break;
 		case OPT_MEM_MIN:
 			mem_min = strtoul(optarg, &endptr, 0);
@@ -1194,6 +1301,9 @@ int main(int argc, char *argv[])
 		case OPT_REUSE_INITRD:
 			do_reuse_initrd = 1;
 			break;
+		case OPT_USE_KEXEC2_SYSCALL:
+			/* We already parsed it. Nothing to do. */
+			break;
 		default:
 			break;
 		}
@@ -1238,10 +1348,17 @@ int main(int argc, char *argv[])
 	}
 
 	if (do_unload) {
-		result = k_unload(kexec_flags);
+		if (do_use_kexec2_syscall)
+			result = kexec2_unload(kexec2_flags);
+		else
+			result = k_unload(kexec_flags);
 	}
 	if (do_load && (result == 0)) {
-		result = my_load(type, fileind, argc, argv, kexec_flags, entry);
+		if (do_use_kexec2_syscall)
+			result = kexec2_load(fileind, argc, argv, kexec2_flags);
+		else
+			result = my_load(type, fileind, argc, argv,
+						kexec_flags, entry);
 	}
 	/* Don't shutdown unless there is something to reboot to! */
 	if ((result == 0) && (do_shutdown || do_exec) && !kexec_loaded()) {
Index: kexec-tools/kexec/kexec.h
===================================================================
--- kexec-tools.orig/kexec/kexec.h	2014-06-02 14:34:16.719774316 -0400
+++ kexec-tools/kexec/kexec.h	2014-06-02 14:34:42.010036325 -0400
@@ -156,6 +156,13 @@ struct kexec_info {
 	unsigned long kexec_flags;
 	unsigned long backup_src_start;
 	unsigned long backup_src_size;
+	/* Set to 1 if we are using kexec2 syscall */
+	unsigned long file_mode :1;
+
+	/* Filled by kernel image processing code */
+	int initrd_fd;
+	char *command_line;
+	int command_line_len;
 };
 
 struct arch_map_entry {
@@ -207,6 +214,7 @@ extern int file_types;
 #define OPT_UNLOAD		'u'
 #define OPT_TYPE		't'
 #define OPT_PANIC		'p'
+#define OPT_USE_KEXEC2_SYSCALL	's'
 #define OPT_MEM_MIN             256
 #define OPT_MEM_MAX             257
 #define OPT_REUSE_INITRD	258
@@ -230,6 +238,7 @@ extern int file_types;
 	{ "mem-min",		1, 0, OPT_MEM_MIN }, \
 	{ "mem-max",		1, 0, OPT_MEM_MAX }, \
 	{ "reuseinitrd",	0, 0, OPT_REUSE_INITRD }, \
+	{ "use-kexec2-syscall",	0, 0, OPT_USE_KEXEC2_SYSCALL }, \
 	{ "debug",		0, 0, OPT_DEBUG }, \
 
 #define KEXEC_OPT_STR "h?vdfxluet:p"
Index: kexec-tools/kexec/arch/x86_64/kexec-bzImage64.c
===================================================================
--- kexec-tools.orig/kexec/arch/x86_64/kexec-bzImage64.c	2014-06-02 14:34:16.719774316 -0400
+++ kexec-tools/kexec/arch/x86_64/kexec-bzImage64.c	2014-06-02 14:34:42.011036336 -0400
@@ -235,6 +235,89 @@ static int do_bzImage64_load(struct kexe
 	return 0;
 }
 
+/* This assumes file is being loaded using file based kexec2 syscall */
+int bzImage64_load_file(int argc, char **argv, struct kexec_info *info)
+{
+	int ret = 0;
+	char *command_line = NULL, *tmp_cmdline = NULL;
+	const char *ramdisk = NULL, *append = NULL;
+	int entry_16bit = 0, entry_32bit = 0;
+	int opt;
+	int command_line_len;
+
+	/* See options.h -- add any more there, too. */
+	static const struct option options[] = {
+		KEXEC_ARCH_OPTIONS
+		{ "command-line",	1, 0, OPT_APPEND },
+		{ "append",		1, 0, OPT_APPEND },
+		{ "reuse-cmdline",	0, 0, OPT_REUSE_CMDLINE },
+		{ "initrd",		1, 0, OPT_RAMDISK },
+		{ "ramdisk",		1, 0, OPT_RAMDISK },
+		{ "real-mode",		0, 0, OPT_REAL_MODE },
+		{ "entry-32bit",	0, 0, OPT_ENTRY_32BIT },
+		{ 0,			0, 0, 0 },
+	};
+	static const char short_options[] = KEXEC_ARCH_OPT_STR "d";
+
+	while ((opt = getopt_long(argc, argv, short_options, options, 0)) != -1) {
+		switch (opt) {
+		default:
+			/* Ignore core options */
+			if (opt < OPT_ARCH_MAX)
+				break;
+		case OPT_APPEND:
+			append = optarg;
+			break;
+		case OPT_REUSE_CMDLINE:
+			tmp_cmdline = get_command_line();
+			break;
+		case OPT_RAMDISK:
+			ramdisk = optarg;
+			break;
+		case OPT_REAL_MODE:
+			entry_16bit = 1;
+			break;
+		case OPT_ENTRY_32BIT:
+			entry_32bit = 1;
+			break;
+		}
+	}
+	command_line = concat_cmdline(tmp_cmdline, append);
+	if (tmp_cmdline)
+		free(tmp_cmdline);
+	command_line_len = 0;
+	if (command_line) {
+		command_line_len = strlen(command_line) + 1;
+	} else {
+		command_line = strdup("\0");
+		command_line_len = 1;
+	}
+
+	if (entry_16bit || entry_32bit) {
+		fprintf(stderr, "Kexec2 syscall does not support 16bit"
+			" or 32bit entry yet\n");
+		ret = -1;
+		goto out;
+	}
+
+	if (ramdisk) {
+		info->initrd_fd = open(ramdisk, O_RDONLY);
+		if (info->initrd_fd == -1) {
+			fprintf(stderr, "Could not open initrd file %s:%s\n",
+					ramdisk, strerror(errno));
+			ret = -1;
+			goto out;
+		}
+	}
+
+	info->command_line = command_line;
+	info->command_line_len = command_line_len;
+	return ret;
+out:
+	free(command_line);
+	return ret;
+}
+
 int bzImage64_load(int argc, char **argv, const char *buf, off_t len,
 	struct kexec_info *info)
 {
@@ -247,6 +330,9 @@ int bzImage64_load(int argc, char **argv
 	int opt;
 	int result;
 
+	if (info->file_mode)
+		return bzImage64_load_file(argc, argv, info);
+
 	/* See options.h -- add any more there, too. */
 	static const struct option options[] = {
 		KEXEC_ARCH_OPTIONS
Index: kexec-tools/kexec/kexec-syscall.h
===================================================================
--- kexec-tools.orig/kexec/kexec-syscall.h	2014-06-02 14:34:16.719774316 -0400
+++ kexec-tools/kexec/kexec-syscall.h	2014-06-02 14:34:42.011036336 -0400
@@ -53,6 +53,19 @@
 #endif
 #endif /*ifndef __NR_kexec_load*/
 
+#ifndef __NR_kexec_file_load
+
+#ifdef __x86_64__
+#define __NR_kexec_file_load	317
+#endif
+
+#ifndef __NR_kexec_file_load
+/* system call not available for the arch */
+#define __NR_kexec_file_load	0xffffffff	/* system call not available */
+#endif
+
+#endif /*ifndef __NR_kexec_file_load*/
+
 struct kexec_segment;
 
 static inline long kexec_load(void *entry, unsigned long nr_segments,
@@ -61,10 +74,28 @@ static inline long kexec_load(void *entr
 	return (long) syscall(__NR_kexec_load, entry, nr_segments, segments, flags);
 }
 
+static inline int is_kexec_file_load_implemented(void) {
+	if (__NR_kexec_file_load != 0xffffffff)
+		return 1;
+	return 0;
+}
+
+static inline long kexec_file_load(int kernel_fd, int initrd_fd,
+			const char *cmdline_ptr, unsigned long cmdline_len,
+			unsigned long flags)
+{
+	return (long) syscall(__NR_kexec_file_load, kernel_fd, initrd_fd,
+				cmdline_ptr, cmdline_len, flags);
+}
+
 #define KEXEC_ON_CRASH		0x00000001
 #define KEXEC_PRESERVE_CONTEXT	0x00000002
 #define KEXEC_ARCH_MASK		0xffff0000
 
+/* Flags for kexec file based system call */
+#define KEXEC_FILE_UNLOAD	0x00000001
+#define KEXEC_FILE_ON_CRASH	0x00000002
+
 /* These values match the ELF architecture values. 
  * Unless there is a good reason that should continue to be the case.
  */

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 01/13] bin2c: Move bin2c in scripts/basic
  2014-06-03 13:06   ` Vivek Goyal
@ 2014-06-03 16:01     ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-03 16:01 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 09:06:50AM -0400, Vivek Goyal wrote:
> Kexec wants to use bin2c and it wants to use it really early in the build
> process. See arch/x86/purgatory/ code in later patches.
> 
> So move bin2c in scripts/basic so that it can be built very early and
> be usable by arch/x86/purgatory/
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  kernel/Makefile        |  2 +-
>  scripts/Makefile       |  1 -
>  scripts/basic/Makefile |  1 +
>  scripts/basic/bin2c.c  | 35 +++++++++++++++++++++++++++++++++++
>  scripts/bin2c.c        | 36 ------------------------------------
>  5 files changed, 37 insertions(+), 38 deletions(-)
>  create mode 100644 scripts/basic/bin2c.c
>  delete mode 100644 scripts/bin2c.c
> 
> diff --git a/kernel/Makefile b/kernel/Makefile
> index f2a8b62..9b07bb7 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -105,7 +105,7 @@ targets += config_data.gz
>  $(obj)/config_data.gz: $(KCONFIG_CONFIG) FORCE
>  	$(call if_changed,gzip)
>  
> -      filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/bin2c; echo "MAGIC_END;")
> +      filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/basic/bin2c; echo "MAGIC_END;")
>  targets += config_data.h
>  $(obj)/config_data.h: $(obj)/config_data.gz FORCE
>  	$(call filechk,ikconfiggz)
> diff --git a/scripts/Makefile b/scripts/Makefile
> index 1d07860..e9d56fb 100644
> --- a/scripts/Makefile
> +++ b/scripts/Makefile
> @@ -13,7 +13,6 @@ HOST_EXTRACFLAGS += -I$(srctree)/tools/include
>  hostprogs-$(CONFIG_KALLSYMS)     += kallsyms
>  hostprogs-$(CONFIG_LOGO)         += pnmtologo
>  hostprogs-$(CONFIG_VT)           += conmakehash
> -hostprogs-$(CONFIG_IKCONFIG)     += bin2c

You also'd need to move the "bin2c" entry from scripts/.gitignore to
scripts/basic/.gitignore. I just noticed this by grepping around the
sources.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 01/13] bin2c: Move bin2c in scripts/basic
@ 2014-06-03 16:01     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-03 16:01 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:06:50AM -0400, Vivek Goyal wrote:
> Kexec wants to use bin2c and it wants to use it really early in the build
> process. See arch/x86/purgatory/ code in later patches.
> 
> So move bin2c in scripts/basic so that it can be built very early and
> be usable by arch/x86/purgatory/
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  kernel/Makefile        |  2 +-
>  scripts/Makefile       |  1 -
>  scripts/basic/Makefile |  1 +
>  scripts/basic/bin2c.c  | 35 +++++++++++++++++++++++++++++++++++
>  scripts/bin2c.c        | 36 ------------------------------------
>  5 files changed, 37 insertions(+), 38 deletions(-)
>  create mode 100644 scripts/basic/bin2c.c
>  delete mode 100644 scripts/bin2c.c
> 
> diff --git a/kernel/Makefile b/kernel/Makefile
> index f2a8b62..9b07bb7 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -105,7 +105,7 @@ targets += config_data.gz
>  $(obj)/config_data.gz: $(KCONFIG_CONFIG) FORCE
>  	$(call if_changed,gzip)
>  
> -      filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/bin2c; echo "MAGIC_END;")
> +      filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/basic/bin2c; echo "MAGIC_END;")
>  targets += config_data.h
>  $(obj)/config_data.h: $(obj)/config_data.gz FORCE
>  	$(call filechk,ikconfiggz)
> diff --git a/scripts/Makefile b/scripts/Makefile
> index 1d07860..e9d56fb 100644
> --- a/scripts/Makefile
> +++ b/scripts/Makefile
> @@ -13,7 +13,6 @@ HOST_EXTRACFLAGS += -I$(srctree)/tools/include
>  hostprogs-$(CONFIG_KALLSYMS)     += kallsyms
>  hostprogs-$(CONFIG_LOGO)         += pnmtologo
>  hostprogs-$(CONFIG_VT)           += conmakehash
> -hostprogs-$(CONFIG_IKCONFIG)     += bin2c

You also'd need to move the "bin2c" entry from scripts/.gitignore to
scripts/basic/.gitignore. I just noticed this by grepping around the
sources.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 01/13] bin2c: Move bin2c in scripts/basic
  2014-06-03 16:01     ` Borislav Petkov
@ 2014-06-03 17:13       ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 17:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 06:01:26PM +0200, Borislav Petkov wrote:
> On Tue, Jun 03, 2014 at 09:06:50AM -0400, Vivek Goyal wrote:
> > Kexec wants to use bin2c and it wants to use it really early in the build
> > process. See arch/x86/purgatory/ code in later patches.
> > 
> > So move bin2c in scripts/basic so that it can be built very early and
> > be usable by arch/x86/purgatory/
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  kernel/Makefile        |  2 +-
> >  scripts/Makefile       |  1 -
> >  scripts/basic/Makefile |  1 +
> >  scripts/basic/bin2c.c  | 35 +++++++++++++++++++++++++++++++++++
> >  scripts/bin2c.c        | 36 ------------------------------------
> >  5 files changed, 37 insertions(+), 38 deletions(-)
> >  create mode 100644 scripts/basic/bin2c.c
> >  delete mode 100644 scripts/bin2c.c
> > 
> > diff --git a/kernel/Makefile b/kernel/Makefile
> > index f2a8b62..9b07bb7 100644
> > --- a/kernel/Makefile
> > +++ b/kernel/Makefile
> > @@ -105,7 +105,7 @@ targets += config_data.gz
> >  $(obj)/config_data.gz: $(KCONFIG_CONFIG) FORCE
> >  	$(call if_changed,gzip)
> >  
> > -      filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/bin2c; echo "MAGIC_END;")
> > +      filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/basic/bin2c; echo "MAGIC_END;")
> >  targets += config_data.h
> >  $(obj)/config_data.h: $(obj)/config_data.gz FORCE
> >  	$(call filechk,ikconfiggz)
> > diff --git a/scripts/Makefile b/scripts/Makefile
> > index 1d07860..e9d56fb 100644
> > --- a/scripts/Makefile
> > +++ b/scripts/Makefile
> > @@ -13,7 +13,6 @@ HOST_EXTRACFLAGS += -I$(srctree)/tools/include
> >  hostprogs-$(CONFIG_KALLSYMS)     += kallsyms
> >  hostprogs-$(CONFIG_LOGO)         += pnmtologo
> >  hostprogs-$(CONFIG_VT)           += conmakehash
> > -hostprogs-$(CONFIG_IKCONFIG)     += bin2c
> 
> You also'd need to move the "bin2c" entry from scripts/.gitignore to
> scripts/basic/.gitignore. I just noticed this by grepping around the
> sources.

Good catch. Will do in next version.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 01/13] bin2c: Move bin2c in scripts/basic
@ 2014-06-03 17:13       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-03 17:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 06:01:26PM +0200, Borislav Petkov wrote:
> On Tue, Jun 03, 2014 at 09:06:50AM -0400, Vivek Goyal wrote:
> > Kexec wants to use bin2c and it wants to use it really early in the build
> > process. See arch/x86/purgatory/ code in later patches.
> > 
> > So move bin2c in scripts/basic so that it can be built very early and
> > be usable by arch/x86/purgatory/
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  kernel/Makefile        |  2 +-
> >  scripts/Makefile       |  1 -
> >  scripts/basic/Makefile |  1 +
> >  scripts/basic/bin2c.c  | 35 +++++++++++++++++++++++++++++++++++
> >  scripts/bin2c.c        | 36 ------------------------------------
> >  5 files changed, 37 insertions(+), 38 deletions(-)
> >  create mode 100644 scripts/basic/bin2c.c
> >  delete mode 100644 scripts/bin2c.c
> > 
> > diff --git a/kernel/Makefile b/kernel/Makefile
> > index f2a8b62..9b07bb7 100644
> > --- a/kernel/Makefile
> > +++ b/kernel/Makefile
> > @@ -105,7 +105,7 @@ targets += config_data.gz
> >  $(obj)/config_data.gz: $(KCONFIG_CONFIG) FORCE
> >  	$(call if_changed,gzip)
> >  
> > -      filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/bin2c; echo "MAGIC_END;")
> > +      filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/basic/bin2c; echo "MAGIC_END;")
> >  targets += config_data.h
> >  $(obj)/config_data.h: $(obj)/config_data.gz FORCE
> >  	$(call filechk,ikconfiggz)
> > diff --git a/scripts/Makefile b/scripts/Makefile
> > index 1d07860..e9d56fb 100644
> > --- a/scripts/Makefile
> > +++ b/scripts/Makefile
> > @@ -13,7 +13,6 @@ HOST_EXTRACFLAGS += -I$(srctree)/tools/include
> >  hostprogs-$(CONFIG_KALLSYMS)     += kallsyms
> >  hostprogs-$(CONFIG_LOGO)         += pnmtologo
> >  hostprogs-$(CONFIG_VT)           += conmakehash
> > -hostprogs-$(CONFIG_IKCONFIG)     += bin2c
> 
> You also'd need to move the "bin2c" entry from scripts/.gitignore to
> scripts/basic/.gitignore. I just noticed this by grepping around the
> sources.

Good catch. Will do in next version.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 02/13] kernel: Build bin2c based on config option CONFIG_BUILD_BIN2C
  2014-06-03 13:06   ` Vivek Goyal
@ 2014-06-04  9:13     ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-04  9:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 09:06:51AM -0400, Vivek Goyal wrote:
> currently bin2c builds only if CONFIG_IKCONFIG=y. But bin2c will now be
> used by kexec too.  So make it compilation dependent on CONFIG_BUILD_BIN2C
> and this config option can be selected by CONFIG_KEXEC and CONFIG_IKCONFIG.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 02/13] kernel: Build bin2c based on config option CONFIG_BUILD_BIN2C
@ 2014-06-04  9:13     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-04  9:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:06:51AM -0400, Vivek Goyal wrote:
> currently bin2c builds only if CONFIG_IKCONFIG=y. But bin2c will now be
> used by kexec too.  So make it compilation dependent on CONFIG_BUILD_BIN2C
> and this config option can be selected by CONFIG_KEXEC and CONFIG_IKCONFIG.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-03 13:12   ` Vivek Goyal
@ 2014-06-04  9:22     ` WANG Chao
  -1 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-04  9:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	dyoung, bhe, akpm

On 06/03/14 at 09:12am, Vivek Goyal wrote:
> On Tue, Jun 03, 2014 at 09:06:49AM -0400, Vivek Goyal wrote:
> > Hi,
> > 
> > This is V3 of the patchset. Previous versions were posted here.
> > 
> > V1: https://lkml.org/lkml/2013/11/20/540
> > V2: https://lkml.org/lkml/2014/1/27/331
> > 
> > Changes since v2:
> > 
> > - Took care of most of the review comments from V2.
> > - Added support for kexec/kdump on EFI systems.
> > - Dropped support for loading ELF vmlinux.
> > 
> > This patch series is generated on top of 3.15.0-rc8. It also requires a
> > two patch cleanup series which is sitting in -tip tree here.
> 
> I used following kexec-tools patches to test kernel changes.
> 
> Thanks
> Vivek
> 
> 
> kexec-tools: Provide an option to make use of new system call
> 
> This patch provides and option --use-kexec2-syscall, to force use of
> new system call for kexec. Default is to continue to use old syscall.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Hi, Vivek

In your kexec-tools patch, you mentioned '-s' as a short option for
--use-kexec2-syscall in usage(). But it doesn't work.

[..]
> Index: kexec-tools/kexec/kexec.h
> ===================================================================
> --- kexec-tools.orig/kexec/kexec.h	2014-06-02 14:34:16.719774316 -0400
> +++ kexec-tools/kexec/kexec.h	2014-06-02 14:34:42.010036325 -0400
> @@ -156,6 +156,13 @@ struct kexec_info {
>  	unsigned long kexec_flags;
>  	unsigned long backup_src_start;
>  	unsigned long backup_src_size;
> +	/* Set to 1 if we are using kexec2 syscall */
> +	unsigned long file_mode :1;
> +
> +	/* Filled by kernel image processing code */
> +	int initrd_fd;
> +	char *command_line;
> +	int command_line_len;
>  };
>  
>  struct arch_map_entry {
> @@ -207,6 +214,7 @@ extern int file_types;
>  #define OPT_UNLOAD		'u'
>  #define OPT_TYPE		't'
>  #define OPT_PANIC		'p'
> +#define OPT_USE_KEXEC2_SYSCALL	's'
>  #define OPT_MEM_MIN             256
>  #define OPT_MEM_MAX             257
>  #define OPT_REUSE_INITRD	258
> @@ -230,6 +238,7 @@ extern int file_types;
>  	{ "mem-min",		1, 0, OPT_MEM_MIN }, \
>  	{ "mem-max",		1, 0, OPT_MEM_MAX }, \
>  	{ "reuseinitrd",	0, 0, OPT_REUSE_INITRD }, \
> +	{ "use-kexec2-syscall",	0, 0, OPT_USE_KEXEC2_SYSCALL }, \
>  	{ "debug",		0, 0, OPT_DEBUG }, \
>  
>  #define KEXEC_OPT_STR "h?vdfxluet:p"

This line,
#define KEXEC_OPT_STR "h?vdfxluet:p"

should be something like,
#define KEXEC_OPT_STR "h?vdfxluet:ps"

Thanks
WANG Chao

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-04  9:22     ` WANG Chao
  0 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-04  9:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, dyoung

On 06/03/14 at 09:12am, Vivek Goyal wrote:
> On Tue, Jun 03, 2014 at 09:06:49AM -0400, Vivek Goyal wrote:
> > Hi,
> > 
> > This is V3 of the patchset. Previous versions were posted here.
> > 
> > V1: https://lkml.org/lkml/2013/11/20/540
> > V2: https://lkml.org/lkml/2014/1/27/331
> > 
> > Changes since v2:
> > 
> > - Took care of most of the review comments from V2.
> > - Added support for kexec/kdump on EFI systems.
> > - Dropped support for loading ELF vmlinux.
> > 
> > This patch series is generated on top of 3.15.0-rc8. It also requires a
> > two patch cleanup series which is sitting in -tip tree here.
> 
> I used following kexec-tools patches to test kernel changes.
> 
> Thanks
> Vivek
> 
> 
> kexec-tools: Provide an option to make use of new system call
> 
> This patch provides and option --use-kexec2-syscall, to force use of
> new system call for kexec. Default is to continue to use old syscall.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Hi, Vivek

In your kexec-tools patch, you mentioned '-s' as a short option for
--use-kexec2-syscall in usage(). But it doesn't work.

[..]
> Index: kexec-tools/kexec/kexec.h
> ===================================================================
> --- kexec-tools.orig/kexec/kexec.h	2014-06-02 14:34:16.719774316 -0400
> +++ kexec-tools/kexec/kexec.h	2014-06-02 14:34:42.010036325 -0400
> @@ -156,6 +156,13 @@ struct kexec_info {
>  	unsigned long kexec_flags;
>  	unsigned long backup_src_start;
>  	unsigned long backup_src_size;
> +	/* Set to 1 if we are using kexec2 syscall */
> +	unsigned long file_mode :1;
> +
> +	/* Filled by kernel image processing code */
> +	int initrd_fd;
> +	char *command_line;
> +	int command_line_len;
>  };
>  
>  struct arch_map_entry {
> @@ -207,6 +214,7 @@ extern int file_types;
>  #define OPT_UNLOAD		'u'
>  #define OPT_TYPE		't'
>  #define OPT_PANIC		'p'
> +#define OPT_USE_KEXEC2_SYSCALL	's'
>  #define OPT_MEM_MIN             256
>  #define OPT_MEM_MAX             257
>  #define OPT_REUSE_INITRD	258
> @@ -230,6 +238,7 @@ extern int file_types;
>  	{ "mem-min",		1, 0, OPT_MEM_MIN }, \
>  	{ "mem-max",		1, 0, OPT_MEM_MAX }, \
>  	{ "reuseinitrd",	0, 0, OPT_REUSE_INITRD }, \
> +	{ "use-kexec2-syscall",	0, 0, OPT_USE_KEXEC2_SYSCALL }, \
>  	{ "debug",		0, 0, OPT_DEBUG }, \
>  
>  #define KEXEC_OPT_STR "h?vdfxluet:p"

This line,
#define KEXEC_OPT_STR "h?vdfxluet:p"

should be something like,
#define KEXEC_OPT_STR "h?vdfxluet:ps"

Thanks
WANG Chao

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 03/13] kexec: Move segment verification code in a separate function
  2014-06-03 13:06   ` Vivek Goyal
@ 2014-06-04  9:32     ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-04  9:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 09:06:52AM -0400, Vivek Goyal wrote:
> Previously do_kimage_alloc() will allocate a kimage structure, copy
> segment list from user space and then do the segment list sanity verification.
> 
> Break down this function in 3 parts. do_kimage_alloc_init() to do actual
> allocation and basic initialization of kimage structure.
> copy_user_segment_list() to copy segment list from user space and
> sanity_check_segment_list() to verify the sanity of segment list as passed
> by user space.
> 
> In later patches, I need to only allocate kimage and not copy segment
> list from user space. So breaking down in smaller functions enables
> re-use of code at other places.

I haven't seen what's going on further in the patchset but from looking at
kimage_normal_alloc() and kimage_crash_alloc()'s guts, they look very
similar and could probably share a common __kimage_alloc which does
do_kimage_alloc_init, copy_user_segment_list, sanity_check_segment_list
and kimage_alloc_control_pages...

One probably would have to actually write it down to see whether it
makes sense though and is not too ugly :-)

> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---

In any case, it looks ok, just two nitpicks below:

Acked-by: Borislav Petkov <bp@suse.de>

>  kernel/kexec.c | 182 ++++++++++++++++++++++++++++++++-------------------------
>  1 file changed, 101 insertions(+), 81 deletions(-)

...

> +static struct kimage *do_kimage_alloc_init(void)
> +{
> +	struct kimage *image;
> +
> +	/* Allocate a controlling structure */
> +	image = kzalloc(sizeof(*image), GFP_KERNEL);
> +	if (!image)
> +		return NULL;
> +
> +	image->head = 0;
> +	image->entry = &image->head;
> +	image->last_entry = &image->head;
> +	image->control_page = ~0; /* By default this does not apply */
> +	image->type = KEXEC_TYPE_DEFAULT;
> +
> +	/* Initialize the list of control pages */
> +	INIT_LIST_HEAD(&image->control_pages);
> +
> +	/* Initialize the list of destination pages */
> +	INIT_LIST_HEAD(&image->dest_pages);
> +
> +	/* Initialize the list of unusable pages */
> +	INIT_LIST_HEAD(&image->unuseable_pages);

If the "e" in "unuseable" bugs you too, like me, you could add this one
to your patchset :-)

http://lkml.kernel.org/r/1392819695-24116-1-git-send-email-bp@alien8.de

...

> @@ -258,22 +292,23 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
>  					   get_order(KEXEC_CONTROL_PAGE_SIZE));
>  	if (!image->control_code_page) {
>  		printk(KERN_ERR "Could not allocate control_code_buffer\n");
> -		goto out_free;
> +		goto out_free_image;
>  	}
>  
>  	image->swap_page = kimage_alloc_control_pages(image, 0);
>  	if (!image->swap_page) {
>  		printk(KERN_ERR "Could not allocate swap buffer\n");
> -		goto out_free;
> +		goto out_free_control_pages;
>  	}
>  
>  	*rimage = image;
>  	return 0;
>  
> -out_free:
> +

Superfluous newline.

> +out_free_control_pages:
>  	kimage_free_page_list(&image->control_pages);
> +out_free_image:
>  	kfree(image);
> -out:
>  	return result;
>  }
>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 03/13] kexec: Move segment verification code in a separate function
@ 2014-06-04  9:32     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-04  9:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:06:52AM -0400, Vivek Goyal wrote:
> Previously do_kimage_alloc() will allocate a kimage structure, copy
> segment list from user space and then do the segment list sanity verification.
> 
> Break down this function in 3 parts. do_kimage_alloc_init() to do actual
> allocation and basic initialization of kimage structure.
> copy_user_segment_list() to copy segment list from user space and
> sanity_check_segment_list() to verify the sanity of segment list as passed
> by user space.
> 
> In later patches, I need to only allocate kimage and not copy segment
> list from user space. So breaking down in smaller functions enables
> re-use of code at other places.

I haven't seen what's going on further in the patchset but from looking at
kimage_normal_alloc() and kimage_crash_alloc()'s guts, they look very
similar and could probably share a common __kimage_alloc which does
do_kimage_alloc_init, copy_user_segment_list, sanity_check_segment_list
and kimage_alloc_control_pages...

One probably would have to actually write it down to see whether it
makes sense though and is not too ugly :-)

> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---

In any case, it looks ok, just two nitpicks below:

Acked-by: Borislav Petkov <bp@suse.de>

>  kernel/kexec.c | 182 ++++++++++++++++++++++++++++++++-------------------------
>  1 file changed, 101 insertions(+), 81 deletions(-)

...

> +static struct kimage *do_kimage_alloc_init(void)
> +{
> +	struct kimage *image;
> +
> +	/* Allocate a controlling structure */
> +	image = kzalloc(sizeof(*image), GFP_KERNEL);
> +	if (!image)
> +		return NULL;
> +
> +	image->head = 0;
> +	image->entry = &image->head;
> +	image->last_entry = &image->head;
> +	image->control_page = ~0; /* By default this does not apply */
> +	image->type = KEXEC_TYPE_DEFAULT;
> +
> +	/* Initialize the list of control pages */
> +	INIT_LIST_HEAD(&image->control_pages);
> +
> +	/* Initialize the list of destination pages */
> +	INIT_LIST_HEAD(&image->dest_pages);
> +
> +	/* Initialize the list of unusable pages */
> +	INIT_LIST_HEAD(&image->unuseable_pages);

If the "e" in "unuseable" bugs you too, like me, you could add this one
to your patchset :-)

http://lkml.kernel.org/r/1392819695-24116-1-git-send-email-bp@alien8.de

...

> @@ -258,22 +292,23 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
>  					   get_order(KEXEC_CONTROL_PAGE_SIZE));
>  	if (!image->control_code_page) {
>  		printk(KERN_ERR "Could not allocate control_code_buffer\n");
> -		goto out_free;
> +		goto out_free_image;
>  	}
>  
>  	image->swap_page = kimage_alloc_control_pages(image, 0);
>  	if (!image->swap_page) {
>  		printk(KERN_ERR "Could not allocate swap buffer\n");
> -		goto out_free;
> +		goto out_free_control_pages;
>  	}
>  
>  	*rimage = image;
>  	return 0;
>  
> -out_free:
> +

Superfluous newline.

> +out_free_control_pages:
>  	kimage_free_page_list(&image->control_pages);
> +out_free_image:
>  	kfree(image);
> -out:
>  	return result;
>  }
>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 04/13] resource: Provide new functions to walk through resources
  2014-06-03 13:06   ` Vivek Goyal
@ 2014-06-04 10:24     ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-04 10:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm, Yinghai Lu

On Tue, Jun 03, 2014 at 09:06:53AM -0400, Vivek Goyal wrote:
> @@ -322,7 +327,71 @@ int release_resource(struct resource *old)
>  
>  EXPORT_SYMBOL(release_resource);
>  
> -#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
> +/*
> + * Finds the lowest iomem reosurce exists with-in [res->start.res->end)
> + * the caller must specify res->start, res->end, res->flags and "name".
> + * If found, returns 0, res is overwritten, if not found, returns -1.
> + * This walks through whole tree and not just first level children.
> + */
> +static int find_next_iomem_res(struct resource *res, char *name)
> +{
> +	resource_size_t start, end;
> +	struct resource *p;
> +
> +	BUG_ON(!res);
> +
> +	start = res->start;
> +	end = res->end;
> +	BUG_ON(start >= end);
> +
> +	read_lock(&resource_lock);
> +	p = &iomem_resource;
> +	while ((p = next_resource(p))) {

Just a thought - this function differs from find_next_system_ram() only
in the traversal mode through resources. I wonder if next_resource()
could be given a flag, say TRAVERSE_SIBLINGS_ONLY or so and be called
from both, once with the flag set and once without and thus save us the
code duplication.

> +		if (p->flags != res->flags)
> +			continue;
> +		if (name && strcmp(p->name, name))
> +			continue;
> +		if (p->start > end) {
> +			p = NULL;
> +			break;
> +		}
> +		if ((p->end >= start) && (p->start < end))
> +			break;
> +	}
> +
> +	read_unlock(&resource_lock);
> +	if (!p)
> +		return -1;
> +	/* copy data */
> +	if (res->start < p->start)
> +		res->start = p->start;
> +	if (res->end > p->end)
> +		res->end = p->end;
> +	return 0;
> +}

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 04/13] resource: Provide new functions to walk through resources
@ 2014-06-04 10:24     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-04 10:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, Yinghai Lu,
	ebiederm, hpa, akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:06:53AM -0400, Vivek Goyal wrote:
> @@ -322,7 +327,71 @@ int release_resource(struct resource *old)
>  
>  EXPORT_SYMBOL(release_resource);
>  
> -#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
> +/*
> + * Finds the lowest iomem reosurce exists with-in [res->start.res->end)
> + * the caller must specify res->start, res->end, res->flags and "name".
> + * If found, returns 0, res is overwritten, if not found, returns -1.
> + * This walks through whole tree and not just first level children.
> + */
> +static int find_next_iomem_res(struct resource *res, char *name)
> +{
> +	resource_size_t start, end;
> +	struct resource *p;
> +
> +	BUG_ON(!res);
> +
> +	start = res->start;
> +	end = res->end;
> +	BUG_ON(start >= end);
> +
> +	read_lock(&resource_lock);
> +	p = &iomem_resource;
> +	while ((p = next_resource(p))) {

Just a thought - this function differs from find_next_system_ram() only
in the traversal mode through resources. I wonder if next_resource()
could be given a flag, say TRAVERSE_SIBLINGS_ONLY or so and be called
from both, once with the flag set and once without and thus save us the
code duplication.

> +		if (p->flags != res->flags)
> +			continue;
> +		if (name && strcmp(p->name, name))
> +			continue;
> +		if (p->start > end) {
> +			p = NULL;
> +			break;
> +		}
> +		if ((p->end >= start) && (p->start < end))
> +			break;
> +	}
> +
> +	read_unlock(&resource_lock);
> +	if (!p)
> +		return -1;
> +	/* copy data */
> +	if (res->start < p->start)
> +		res->start = p->start;
> +	if (res->end > p->end)
> +		res->end = p->end;
> +	return 0;
> +}

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 05/13] kexec: Make kexec_segment user buffer pointer a union
  2014-06-03 13:06   ` Vivek Goyal
@ 2014-06-04 10:34     ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-04 10:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 09:06:54AM -0400, Vivek Goyal wrote:
> So far kexec_segment->buf was always a user space pointer as user space
> passed the array of kexec_segment structures and kernel copied it.
> 
> But with new system call, list of kexec segments will be prepared by
> kernel and kexec_segment->buf will point to a kernel memory.
> 
> So while I was adding code where I made assumption that ->buf is pointing
> to kernel memory, sparse started giving warning.
> 
> Make ->buf a union. And where a user space pointer is expected, access
> it using ->buf and where a kernel space pointer is expected, access it
> using ->kbuf. That takes care of sparse warnings.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 05/13] kexec: Make kexec_segment user buffer pointer a union
@ 2014-06-04 10:34     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-04 10:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:06:54AM -0400, Vivek Goyal wrote:
> So far kexec_segment->buf was always a user space pointer as user space
> passed the array of kexec_segment structures and kernel copied it.
> 
> But with new system call, list of kexec segments will be prepared by
> kernel and kexec_segment->buf will point to a kernel memory.
> 
> So while I was adding code where I made assumption that ->buf is pointing
> to kernel memory, sparse started giving warning.
> 
> Make ->buf a union. And where a user space pointer is expected, access
> it using ->buf and where a kernel space pointer is expected, access it
> using ->kbuf. That takes care of sparse warnings.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 06/13] kexec: New syscall kexec_file_load() declaration
  2014-06-03 13:06   ` Vivek Goyal
@ 2014-06-04 15:18     ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-04 15:18 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 09:06:55AM -0400, Vivek Goyal wrote:
> This is the new syscall kexec_file_load() declaration/interface. I have
> reserved the syscall number only for x86_64 so far. Other architectures
> (including i386) can reserve syscall number when they enable the support
> for this new syscall.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 06/13] kexec: New syscall kexec_file_load() declaration
@ 2014-06-04 15:18     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-04 15:18 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:06:55AM -0400, Vivek Goyal wrote:
> This is the new syscall kexec_file_load() declaration/interface. I have
> reserved the syscall number only for x86_64 so far. Other architectures
> (including i386) can reserve syscall number when they enable the support
> for this new syscall.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-04  9:22     ` WANG Chao
@ 2014-06-04 17:50       ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-04 17:50 UTC (permalink / raw)
  To: WANG Chao
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	dyoung, bhe, akpm

On Wed, Jun 04, 2014 at 05:22:14PM +0800, WANG Chao wrote:

[..]
> > Index: kexec-tools/kexec/kexec.h
> > ===================================================================
> > --- kexec-tools.orig/kexec/kexec.h	2014-06-02 14:34:16.719774316 -0400
> > +++ kexec-tools/kexec/kexec.h	2014-06-02 14:34:42.010036325 -0400
> > @@ -156,6 +156,13 @@ struct kexec_info {
> >  	unsigned long kexec_flags;
> >  	unsigned long backup_src_start;
> >  	unsigned long backup_src_size;
> > +	/* Set to 1 if we are using kexec2 syscall */
> > +	unsigned long file_mode :1;
> > +
> > +	/* Filled by kernel image processing code */
> > +	int initrd_fd;
> > +	char *command_line;
> > +	int command_line_len;
> >  };
> >  
> >  struct arch_map_entry {
> > @@ -207,6 +214,7 @@ extern int file_types;
> >  #define OPT_UNLOAD		'u'
> >  #define OPT_TYPE		't'
> >  #define OPT_PANIC		'p'
> > +#define OPT_USE_KEXEC2_SYSCALL	's'
> >  #define OPT_MEM_MIN             256
> >  #define OPT_MEM_MAX             257
> >  #define OPT_REUSE_INITRD	258
> > @@ -230,6 +238,7 @@ extern int file_types;
> >  	{ "mem-min",		1, 0, OPT_MEM_MIN }, \
> >  	{ "mem-max",		1, 0, OPT_MEM_MAX }, \
> >  	{ "reuseinitrd",	0, 0, OPT_REUSE_INITRD }, \
> > +	{ "use-kexec2-syscall",	0, 0, OPT_USE_KEXEC2_SYSCALL }, \
> >  	{ "debug",		0, 0, OPT_DEBUG }, \
> >  
> >  #define KEXEC_OPT_STR "h?vdfxluet:p"
> 
> This line,
> #define KEXEC_OPT_STR "h?vdfxluet:p"
> 
> should be something like,
> #define KEXEC_OPT_STR "h?vdfxluet:ps"

Thanks chao. I will fix it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-04 17:50       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-04 17:50 UTC (permalink / raw)
  To: WANG Chao
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, dyoung

On Wed, Jun 04, 2014 at 05:22:14PM +0800, WANG Chao wrote:

[..]
> > Index: kexec-tools/kexec/kexec.h
> > ===================================================================
> > --- kexec-tools.orig/kexec/kexec.h	2014-06-02 14:34:16.719774316 -0400
> > +++ kexec-tools/kexec/kexec.h	2014-06-02 14:34:42.010036325 -0400
> > @@ -156,6 +156,13 @@ struct kexec_info {
> >  	unsigned long kexec_flags;
> >  	unsigned long backup_src_start;
> >  	unsigned long backup_src_size;
> > +	/* Set to 1 if we are using kexec2 syscall */
> > +	unsigned long file_mode :1;
> > +
> > +	/* Filled by kernel image processing code */
> > +	int initrd_fd;
> > +	char *command_line;
> > +	int command_line_len;
> >  };
> >  
> >  struct arch_map_entry {
> > @@ -207,6 +214,7 @@ extern int file_types;
> >  #define OPT_UNLOAD		'u'
> >  #define OPT_TYPE		't'
> >  #define OPT_PANIC		'p'
> > +#define OPT_USE_KEXEC2_SYSCALL	's'
> >  #define OPT_MEM_MIN             256
> >  #define OPT_MEM_MAX             257
> >  #define OPT_REUSE_INITRD	258
> > @@ -230,6 +238,7 @@ extern int file_types;
> >  	{ "mem-min",		1, 0, OPT_MEM_MIN }, \
> >  	{ "mem-max",		1, 0, OPT_MEM_MAX }, \
> >  	{ "reuseinitrd",	0, 0, OPT_REUSE_INITRD }, \
> > +	{ "use-kexec2-syscall",	0, 0, OPT_USE_KEXEC2_SYSCALL }, \
> >  	{ "debug",		0, 0, OPT_DEBUG }, \
> >  
> >  #define KEXEC_OPT_STR "h?vdfxluet:p"
> 
> This line,
> #define KEXEC_OPT_STR "h?vdfxluet:p"
> 
> should be something like,
> #define KEXEC_OPT_STR "h?vdfxluet:ps"

Thanks chao. I will fix it.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 03/13] kexec: Move segment verification code in a separate function
  2014-06-04  9:32     ` Borislav Petkov
@ 2014-06-04 18:47       ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-04 18:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Wed, Jun 04, 2014 at 11:32:55AM +0200, Borislav Petkov wrote:
> On Tue, Jun 03, 2014 at 09:06:52AM -0400, Vivek Goyal wrote:
> > Previously do_kimage_alloc() will allocate a kimage structure, copy
> > segment list from user space and then do the segment list sanity verification.
> > 
> > Break down this function in 3 parts. do_kimage_alloc_init() to do actual
> > allocation and basic initialization of kimage structure.
> > copy_user_segment_list() to copy segment list from user space and
> > sanity_check_segment_list() to verify the sanity of segment list as passed
> > by user space.
> > 
> > In later patches, I need to only allocate kimage and not copy segment
> > list from user space. So breaking down in smaller functions enables
> > re-use of code at other places.
> 
> I haven't seen what's going on further in the patchset but from looking at
> kimage_normal_alloc() and kimage_crash_alloc()'s guts, they look very
> similar and could probably share a common __kimage_alloc which does
> do_kimage_alloc_init, copy_user_segment_list, sanity_check_segment_list
> and kimage_alloc_control_pages...
> 
> One probably would have to actually write it down to see whether it
> makes sense though and is not too ugly :-)

Hi Boris,

Agreed. kimage_normal_alloc() and kimage_crash_alloc() are sharing
lot of code and it should make sense to write a common function for
shared code and let both call that function. I will give it a try
and if it makes sense will make it part of next version of posting.

> 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> 
> In any case, it looks ok, just two nitpicks below:
> 
> Acked-by: Borislav Petkov <bp@suse.de>
> 
> >  kernel/kexec.c | 182 ++++++++++++++++++++++++++++++++-------------------------
> >  1 file changed, 101 insertions(+), 81 deletions(-)
> 
> ...
> 
> > +static struct kimage *do_kimage_alloc_init(void)
> > +{
> > +	struct kimage *image;
> > +
> > +	/* Allocate a controlling structure */
> > +	image = kzalloc(sizeof(*image), GFP_KERNEL);
> > +	if (!image)
> > +		return NULL;
> > +
> > +	image->head = 0;
> > +	image->entry = &image->head;
> > +	image->last_entry = &image->head;
> > +	image->control_page = ~0; /* By default this does not apply */
> > +	image->type = KEXEC_TYPE_DEFAULT;
> > +
> > +	/* Initialize the list of control pages */
> > +	INIT_LIST_HEAD(&image->control_pages);
> > +
> > +	/* Initialize the list of destination pages */
> > +	INIT_LIST_HEAD(&image->dest_pages);
> > +
> > +	/* Initialize the list of unusable pages */
> > +	INIT_LIST_HEAD(&image->unuseable_pages);
> 
> If the "e" in "unuseable" bugs you too, like me, you could add this one
> to your patchset :-)
> 
> http://lkml.kernel.org/r/1392819695-24116-1-git-send-email-bp@alien8.de
> 

Hmm..., Interesting. I never noticed it. So google search seems to say
that unuseable is also not wrong. 

I am not feeling very strongly about it, so I will leave this cleanup
for some other day.

> ...
> 
> > @@ -258,22 +292,23 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
> >  					   get_order(KEXEC_CONTROL_PAGE_SIZE));
> >  	if (!image->control_code_page) {
> >  		printk(KERN_ERR "Could not allocate control_code_buffer\n");
> > -		goto out_free;
> > +		goto out_free_image;
> >  	}
> >  
> >  	image->swap_page = kimage_alloc_control_pages(image, 0);
> >  	if (!image->swap_page) {
> >  		printk(KERN_ERR "Could not allocate swap buffer\n");
> > -		goto out_free;
> > +		goto out_free_control_pages;
> >  	}
> >  
> >  	*rimage = image;
> >  	return 0;
> >  
> > -out_free:
> > +
> 
> Superfluous newline.

Will remove.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 03/13] kexec: Move segment verification code in a separate function
@ 2014-06-04 18:47       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-04 18:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Wed, Jun 04, 2014 at 11:32:55AM +0200, Borislav Petkov wrote:
> On Tue, Jun 03, 2014 at 09:06:52AM -0400, Vivek Goyal wrote:
> > Previously do_kimage_alloc() will allocate a kimage structure, copy
> > segment list from user space and then do the segment list sanity verification.
> > 
> > Break down this function in 3 parts. do_kimage_alloc_init() to do actual
> > allocation and basic initialization of kimage structure.
> > copy_user_segment_list() to copy segment list from user space and
> > sanity_check_segment_list() to verify the sanity of segment list as passed
> > by user space.
> > 
> > In later patches, I need to only allocate kimage and not copy segment
> > list from user space. So breaking down in smaller functions enables
> > re-use of code at other places.
> 
> I haven't seen what's going on further in the patchset but from looking at
> kimage_normal_alloc() and kimage_crash_alloc()'s guts, they look very
> similar and could probably share a common __kimage_alloc which does
> do_kimage_alloc_init, copy_user_segment_list, sanity_check_segment_list
> and kimage_alloc_control_pages...
> 
> One probably would have to actually write it down to see whether it
> makes sense though and is not too ugly :-)

Hi Boris,

Agreed. kimage_normal_alloc() and kimage_crash_alloc() are sharing
lot of code and it should make sense to write a common function for
shared code and let both call that function. I will give it a try
and if it makes sense will make it part of next version of posting.

> 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> 
> In any case, it looks ok, just two nitpicks below:
> 
> Acked-by: Borislav Petkov <bp@suse.de>
> 
> >  kernel/kexec.c | 182 ++++++++++++++++++++++++++++++++-------------------------
> >  1 file changed, 101 insertions(+), 81 deletions(-)
> 
> ...
> 
> > +static struct kimage *do_kimage_alloc_init(void)
> > +{
> > +	struct kimage *image;
> > +
> > +	/* Allocate a controlling structure */
> > +	image = kzalloc(sizeof(*image), GFP_KERNEL);
> > +	if (!image)
> > +		return NULL;
> > +
> > +	image->head = 0;
> > +	image->entry = &image->head;
> > +	image->last_entry = &image->head;
> > +	image->control_page = ~0; /* By default this does not apply */
> > +	image->type = KEXEC_TYPE_DEFAULT;
> > +
> > +	/* Initialize the list of control pages */
> > +	INIT_LIST_HEAD(&image->control_pages);
> > +
> > +	/* Initialize the list of destination pages */
> > +	INIT_LIST_HEAD(&image->dest_pages);
> > +
> > +	/* Initialize the list of unusable pages */
> > +	INIT_LIST_HEAD(&image->unuseable_pages);
> 
> If the "e" in "unuseable" bugs you too, like me, you could add this one
> to your patchset :-)
> 
> http://lkml.kernel.org/r/1392819695-24116-1-git-send-email-bp@alien8.de
> 

Hmm..., Interesting. I never noticed it. So google search seems to say
that unuseable is also not wrong. 

I am not feeling very strongly about it, so I will leave this cleanup
for some other day.

> ...
> 
> > @@ -258,22 +292,23 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
> >  					   get_order(KEXEC_CONTROL_PAGE_SIZE));
> >  	if (!image->control_code_page) {
> >  		printk(KERN_ERR "Could not allocate control_code_buffer\n");
> > -		goto out_free;
> > +		goto out_free_image;
> >  	}
> >  
> >  	image->swap_page = kimage_alloc_control_pages(image, 0);
> >  	if (!image->swap_page) {
> >  		printk(KERN_ERR "Could not allocate swap buffer\n");
> > -		goto out_free;
> > +		goto out_free_control_pages;
> >  	}
> >  
> >  	*rimage = image;
> >  	return 0;
> >  
> > -out_free:
> > +
> 
> Superfluous newline.

Will remove.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-04 19:39         ` Michael Kerrisk
  0 siblings, 0 replies; 214+ messages in thread
From: Michael Kerrisk @ 2014-06-04 19:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: WANG Chao, Linux Kernel, kexec, Eric W. Biederman,
	H. Peter Anvin, mjg59, Greg Kroah-Hartman, Borislav Petkov,
	Jiri Kosina, dyoung, bhe, Andrew Morton, Linux API,
	Michael Kerrisk-manpages

Vivek,

As per Documentation/SubmitChecklist , please CC linux-api@ on patces
that change the ABI/API. See
https://www.kernel.org/doc/man-pages/linux-api-ml.html.

Also, is there some draft man page for this new system call?

Thanks,

Michael

On Wed, Jun 4, 2014 at 7:50 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Wed, Jun 04, 2014 at 05:22:14PM +0800, WANG Chao wrote:
>
> [..]
>> > Index: kexec-tools/kexec/kexec.h
>> > ===================================================================
>> > --- kexec-tools.orig/kexec/kexec.h  2014-06-02 14:34:16.719774316 -0400
>> > +++ kexec-tools/kexec/kexec.h       2014-06-02 14:34:42.010036325 -0400
>> > @@ -156,6 +156,13 @@ struct kexec_info {
>> >     unsigned long kexec_flags;
>> >     unsigned long backup_src_start;
>> >     unsigned long backup_src_size;
>> > +   /* Set to 1 if we are using kexec2 syscall */
>> > +   unsigned long file_mode :1;
>> > +
>> > +   /* Filled by kernel image processing code */
>> > +   int initrd_fd;
>> > +   char *command_line;
>> > +   int command_line_len;
>> >  };
>> >
>> >  struct arch_map_entry {
>> > @@ -207,6 +214,7 @@ extern int file_types;
>> >  #define OPT_UNLOAD         'u'
>> >  #define OPT_TYPE           't'
>> >  #define OPT_PANIC          'p'
>> > +#define OPT_USE_KEXEC2_SYSCALL     's'
>> >  #define OPT_MEM_MIN             256
>> >  #define OPT_MEM_MAX             257
>> >  #define OPT_REUSE_INITRD   258
>> > @@ -230,6 +238,7 @@ extern int file_types;
>> >     { "mem-min",            1, 0, OPT_MEM_MIN }, \
>> >     { "mem-max",            1, 0, OPT_MEM_MAX }, \
>> >     { "reuseinitrd",        0, 0, OPT_REUSE_INITRD }, \
>> > +   { "use-kexec2-syscall", 0, 0, OPT_USE_KEXEC2_SYSCALL }, \
>> >     { "debug",              0, 0, OPT_DEBUG }, \
>> >
>> >  #define KEXEC_OPT_STR "h?vdfxluet:p"
>>
>> This line,
>> #define KEXEC_OPT_STR "h?vdfxluet:p"
>>
>> should be something like,
>> #define KEXEC_OPT_STR "h?vdfxluet:ps"
>
> Thanks chao. I will fix it.
>
> Thanks
> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-04 19:39         ` Michael Kerrisk
  0 siblings, 0 replies; 214+ messages in thread
From: Michael Kerrisk @ 2014-06-04 19:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: WANG Chao, Linux Kernel, kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Eric W. Biederman, H. Peter Anvin, mjg59-1xO5oi07KQx4cg9Nei1l7Q,
	Greg Kroah-Hartman, Borislav Petkov, Jiri Kosina,
	dyoung-H+wXaHxf7aLQT0dZR+AlfA, bhe-H+wXaHxf7aLQT0dZR+AlfA,
	Andrew Morton, Linux API, Michael Kerrisk-manpages

Vivek,

As per Documentation/SubmitChecklist , please CC linux-api@ on patces
that change the ABI/API. See
https://www.kernel.org/doc/man-pages/linux-api-ml.html.

Also, is there some draft man page for this new system call?

Thanks,

Michael

On Wed, Jun 4, 2014 at 7:50 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, Jun 04, 2014 at 05:22:14PM +0800, WANG Chao wrote:
>
> [..]
>> > Index: kexec-tools/kexec/kexec.h
>> > ===================================================================
>> > --- kexec-tools.orig/kexec/kexec.h  2014-06-02 14:34:16.719774316 -0400
>> > +++ kexec-tools/kexec/kexec.h       2014-06-02 14:34:42.010036325 -0400
>> > @@ -156,6 +156,13 @@ struct kexec_info {
>> >     unsigned long kexec_flags;
>> >     unsigned long backup_src_start;
>> >     unsigned long backup_src_size;
>> > +   /* Set to 1 if we are using kexec2 syscall */
>> > +   unsigned long file_mode :1;
>> > +
>> > +   /* Filled by kernel image processing code */
>> > +   int initrd_fd;
>> > +   char *command_line;
>> > +   int command_line_len;
>> >  };
>> >
>> >  struct arch_map_entry {
>> > @@ -207,6 +214,7 @@ extern int file_types;
>> >  #define OPT_UNLOAD         'u'
>> >  #define OPT_TYPE           't'
>> >  #define OPT_PANIC          'p'
>> > +#define OPT_USE_KEXEC2_SYSCALL     's'
>> >  #define OPT_MEM_MIN             256
>> >  #define OPT_MEM_MAX             257
>> >  #define OPT_REUSE_INITRD   258
>> > @@ -230,6 +238,7 @@ extern int file_types;
>> >     { "mem-min",            1, 0, OPT_MEM_MIN }, \
>> >     { "mem-max",            1, 0, OPT_MEM_MAX }, \
>> >     { "reuseinitrd",        0, 0, OPT_REUSE_INITRD }, \
>> > +   { "use-kexec2-syscall", 0, 0, OPT_USE_KEXEC2_SYSCALL }, \
>> >     { "debug",              0, 0, OPT_DEBUG }, \
>> >
>> >  #define KEXEC_OPT_STR "h?vdfxluet:p"
>>
>> This line,
>> #define KEXEC_OPT_STR "h?vdfxluet:p"
>>
>> should be something like,
>> #define KEXEC_OPT_STR "h?vdfxluet:ps"
>
> Thanks chao. I will fix it.
>
> Thanks
> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-04 19:39         ` Michael Kerrisk
  0 siblings, 0 replies; 214+ messages in thread
From: Michael Kerrisk @ 2014-06-04 19:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, Jiri Kosina, Greg Kroah-Hartman, kexec, Linux Kernel,
	Borislav Petkov, Eric W. Biederman, H. Peter Anvin,
	Andrew Morton, Linux API, dyoung, WANG Chao,
	Michael Kerrisk-manpages

Vivek,

As per Documentation/SubmitChecklist , please CC linux-api@ on patces
that change the ABI/API. See
https://www.kernel.org/doc/man-pages/linux-api-ml.html.

Also, is there some draft man page for this new system call?

Thanks,

Michael

On Wed, Jun 4, 2014 at 7:50 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Wed, Jun 04, 2014 at 05:22:14PM +0800, WANG Chao wrote:
>
> [..]
>> > Index: kexec-tools/kexec/kexec.h
>> > ===================================================================
>> > --- kexec-tools.orig/kexec/kexec.h  2014-06-02 14:34:16.719774316 -0400
>> > +++ kexec-tools/kexec/kexec.h       2014-06-02 14:34:42.010036325 -0400
>> > @@ -156,6 +156,13 @@ struct kexec_info {
>> >     unsigned long kexec_flags;
>> >     unsigned long backup_src_start;
>> >     unsigned long backup_src_size;
>> > +   /* Set to 1 if we are using kexec2 syscall */
>> > +   unsigned long file_mode :1;
>> > +
>> > +   /* Filled by kernel image processing code */
>> > +   int initrd_fd;
>> > +   char *command_line;
>> > +   int command_line_len;
>> >  };
>> >
>> >  struct arch_map_entry {
>> > @@ -207,6 +214,7 @@ extern int file_types;
>> >  #define OPT_UNLOAD         'u'
>> >  #define OPT_TYPE           't'
>> >  #define OPT_PANIC          'p'
>> > +#define OPT_USE_KEXEC2_SYSCALL     's'
>> >  #define OPT_MEM_MIN             256
>> >  #define OPT_MEM_MAX             257
>> >  #define OPT_REUSE_INITRD   258
>> > @@ -230,6 +238,7 @@ extern int file_types;
>> >     { "mem-min",            1, 0, OPT_MEM_MIN }, \
>> >     { "mem-max",            1, 0, OPT_MEM_MAX }, \
>> >     { "reuseinitrd",        0, 0, OPT_REUSE_INITRD }, \
>> > +   { "use-kexec2-syscall", 0, 0, OPT_USE_KEXEC2_SYSCALL }, \
>> >     { "debug",              0, 0, OPT_DEBUG }, \
>> >
>> >  #define KEXEC_OPT_STR "h?vdfxluet:p"
>>
>> This line,
>> #define KEXEC_OPT_STR "h?vdfxluet:p"
>>
>> should be something like,
>> #define KEXEC_OPT_STR "h?vdfxluet:ps"
>
> Thanks chao. I will fix it.
>
> Thanks
> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 03/13] kexec: Move segment verification code in a separate function
  2014-06-04 18:47       ` Vivek Goyal
@ 2014-06-04 20:30         ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-04 20:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Wed, Jun 04, 2014 at 02:47:43PM -0400, Vivek Goyal wrote:
> Hmm..., Interesting. I never noticed it. So google search seems to say
> that unuseable is also not wrong.

Nope, it seems more like "unuseable" is simply a very common misspelling
which has managed to spread out uncontrollably, even in the kernel :).
Both Oxford and Merriam-Webster know "unusable" as the only correct
spelling.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 03/13] kexec: Move segment verification code in a separate function
@ 2014-06-04 20:30         ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-04 20:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Wed, Jun 04, 2014 at 02:47:43PM -0400, Vivek Goyal wrote:
> Hmm..., Interesting. I never noticed it. So google search seems to say
> that unuseable is also not wrong.

Nope, it seems more like "unuseable" is simply a very common misspelling
which has managed to spread out uncontrollably, even in the kernel :).
Both Oxford and Merriam-Webster know "unusable" as the only correct
spelling.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-03 13:12   ` Vivek Goyal
@ 2014-06-05  8:31     ` Dave Young
  -1 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-05  8:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	chaowang, bhe, akpm

On 06/03/14 at 09:12am, Vivek Goyal wrote:
> On Tue, Jun 03, 2014 at 09:06:49AM -0400, Vivek Goyal wrote:
> > Hi,
> > 
> > This is V3 of the patchset. Previous versions were posted here.
> > 
> > V1: https://lkml.org/lkml/2013/11/20/540
> > V2: https://lkml.org/lkml/2014/1/27/331
> > 
> > Changes since v2:
> > 
> > - Took care of most of the review comments from V2.
> > - Added support for kexec/kdump on EFI systems.
> > - Dropped support for loading ELF vmlinux.
> > 
> > This patch series is generated on top of 3.15.0-rc8. It also requires a
> > two patch cleanup series which is sitting in -tip tree here.
> 
> I used following kexec-tools patches to test kernel changes.
> 
> Thanks
> Vivek
> 
> 
> kexec-tools: Provide an option to make use of new system call
> 
> This patch provides and option --use-kexec2-syscall, to force use of
> new system call for kexec. Default is to continue to use old syscall.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  kexec/arch/x86_64/kexec-bzImage64.c |   86 +++++++++++++++++++++++++
>  kexec/kexec-syscall.h               |   31 +++++++++
>  kexec/kexec.c                       |  123 +++++++++++++++++++++++++++++++++++-
>  kexec/kexec.h                       |    9 ++
>  4 files changed, 246 insertions(+), 3 deletions(-)
> 
> Index: kexec-tools/kexec/kexec.c
> ===================================================================
> --- kexec-tools.orig/kexec/kexec.c	2014-06-02 14:34:16.719774316 -0400
> +++ kexec-tools/kexec/kexec.c	2014-06-02 14:34:42.009036315 -0400
> @@ -51,6 +51,7 @@
>  unsigned long long mem_min = 0;
>  unsigned long long mem_max = ULONG_MAX;
>  static unsigned long kexec_flags = 0;
> +static unsigned long kexec2_flags = 0;
>  int kexec_debug = 0;
>  
>  void dbgprint_mem_range(const char *prefix, struct memory_range *mr, int nr_mr)
> @@ -787,6 +788,19 @@ static int my_load(const char *type, int
>  	return result;
>  }
>  
> +static int kexec2_unload(unsigned long kexec2_flags)
> +{
> +	int ret = 0;
> +
> +	ret = kexec_file_load(-1, -1, NULL, 0, kexec2_flags);
> +	if (ret != 0) {
> +		/* The unload failed, print some debugging information */
> +		fprintf(stderr, "kexec_file_load(unload) failed\n: %s\n",
> +			strerror(errno));
> +	}
> +	return ret;
> +}
> +
>  static int k_unload (unsigned long kexec_flags)
>  {
>  	int result;
> @@ -925,6 +939,7 @@ void usage(void)
>  	       "                      (0 means it's not jump back or\n"
>  	       "                      preserve context)\n"
>  	       "                      to original kernel.\n"
> +	       " -s --use-kexec2-syscall Use new syscall for kexec operation\n"
>  	       " -d, --debug           Enable debugging to help spot a failure.\n"
>  	       "\n"
>  	       "Supported kernel file types and options: \n");
> @@ -1072,6 +1087,75 @@ char *concat_cmdline(const char *base, c
>  	return cmdline;
>  }
>  
> +/* New file based kexec system call related code */
> +static int kexec2_load(int fileind, int argc, char **argv,
> +			unsigned long flags) {
> +
> +	char *kernel;
> +	int kernel_fd, i;
> +	struct kexec_info info;
> +	int ret = 0;
> +	char *kernel_buf;
> +	off_t kernel_size;
> +
> +	memset(&info, 0, sizeof(info));
> +	info.segment = NULL;
> +	info.nr_segments = 0;
> +	info.entry = NULL;
> +	info.backup_start = 0;
> +	info.kexec_flags = flags;
> +
> +	info.file_mode = 1;
> +	info.initrd_fd = -1;
> +
> +	if (argc - fileind <= 0) {
> +		fprintf(stderr, "No kernel specified\n");
> +		usage();
> +		return -1;
> +	}
> +
> +	kernel = argv[fileind];
> +
> +	kernel_fd = open(kernel, O_RDONLY);
> +	if (kernel_fd == -1) {
> +		fprintf(stderr, "Failed to open file %s:%s\n", kernel,
> +				strerror(errno));
> +		return -1;
> +	}
> +
> +	/* slurp in the input kernel */
> +	kernel_buf = slurp_decompress_file(kernel, &kernel_size);
> +
> +	for (i = 0; i < file_types; i++) {
> +		if (file_type[i].probe(kernel_buf, kernel_size) >= 0)
> +			break;
> +	}
> +
> +	if (i == file_types) {
> +		fprintf(stderr, "Cannot determine the file type " "of %s\n",
> +				kernel);
> +		return -1;
> +	}
> +
> +	ret = file_type[i].load(argc, argv, kernel_buf, kernel_size, &info);
> +	if (ret < 0) {
> +		fprintf(stderr, "Cannot load %s\n", kernel);
> +		return ret;
> +	}
> +
> +	if (!is_kexec_file_load_implemented()) {
> +		fprintf(stderr, "syscall kexec_file_load not available.\n");
> +		return -1;
> +	}
> +
> +	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
> +			info.command_line_len, info.kexec_flags);

Vivek,

I tried your patch on my uefi test machine, but kexec load fails like below:

[root@localhost ~]# kexec -l /boot/vmlinuz-3.15.0-rc8+ --use-kexec2-syscall
Could not find a free area of memory of 0xa000 bytes ...

Another issue is that the syscall should allow load kernel only without initrd and
cmdline since kernel can mount root and embed cmdline in itself.

AFAIK Slackware installs huge kernels without creating initrd.
> +	if (ret != 0)
> +		fprintf(stderr, "kexec_file_load failed: %s\n",
> +					strerror(errno));
> +	return ret;
> +}
> +
>  
>  int main(int argc, char *argv[])
>  {
> @@ -1083,6 +1167,7 @@ int main(int argc, char *argv[])
>  	int do_ifdown = 0;
>  	int do_unload = 0;
>  	int do_reuse_initrd = 0;
> +	int do_use_kexec2_syscall = 0;
>  	void *entry = 0;
>  	char *type = 0;
>  	char *endptr;
> @@ -1095,6 +1180,23 @@ int main(int argc, char *argv[])
>  	};
>  	static const char short_options[] = KEXEC_ALL_OPT_STR;
>  
> +	/*
> +	 * First check if --use-kexec2-syscall is set. That changes lot of
> +	 * things
> +	 */
> +	while ((opt = getopt_long(argc, argv, short_options,
> +				  options, 0)) != -1) {
> +		switch(opt) {
> +		case OPT_USE_KEXEC2_SYSCALL:
> +			do_use_kexec2_syscall = 1;
> +			break;
> +		}
> +	}
> +
> +	/* Reset getopt for the next pass. */
> +	opterr = 1;
> +	optind = 1;
> +
>  	while ((opt = getopt_long(argc, argv, short_options,
>  				  options, 0)) != -1) {
>  		switch(opt) {
> @@ -1127,6 +1229,8 @@ int main(int argc, char *argv[])
>  			do_shutdown = 0;
>  			do_sync = 0;
>  			do_unload = 1;
> +			if (do_use_kexec2_syscall)
> +				kexec2_flags |= KEXEC_FILE_UNLOAD;
>  			break;
>  		case OPT_EXEC:
>  			do_load = 0;
> @@ -1169,7 +1273,10 @@ int main(int argc, char *argv[])
>  			do_exec = 0;
>  			do_shutdown = 0;
>  			do_sync = 0;
> -			kexec_flags = KEXEC_ON_CRASH;
> +			if (do_use_kexec2_syscall)
> +				kexec2_flags |= KEXEC_FILE_ON_CRASH;
> +			else
> +				kexec_flags = KEXEC_ON_CRASH;
>  			break;
>  		case OPT_MEM_MIN:
>  			mem_min = strtoul(optarg, &endptr, 0);
> @@ -1194,6 +1301,9 @@ int main(int argc, char *argv[])
>  		case OPT_REUSE_INITRD:
>  			do_reuse_initrd = 1;
>  			break;
> +		case OPT_USE_KEXEC2_SYSCALL:
> +			/* We already parsed it. Nothing to do. */
> +			break;
>  		default:
>  			break;
>  		}
> @@ -1238,10 +1348,17 @@ int main(int argc, char *argv[])
>  	}
>  
>  	if (do_unload) {
> -		result = k_unload(kexec_flags);
> +		if (do_use_kexec2_syscall)
> +			result = kexec2_unload(kexec2_flags);
> +		else
> +			result = k_unload(kexec_flags);
>  	}
>  	if (do_load && (result == 0)) {
> -		result = my_load(type, fileind, argc, argv, kexec_flags, entry);
> +		if (do_use_kexec2_syscall)
> +			result = kexec2_load(fileind, argc, argv, kexec2_flags);
> +		else
> +			result = my_load(type, fileind, argc, argv,
> +						kexec_flags, entry);
>  	}
>  	/* Don't shutdown unless there is something to reboot to! */
>  	if ((result == 0) && (do_shutdown || do_exec) && !kexec_loaded()) {
> Index: kexec-tools/kexec/kexec.h
> ===================================================================
> --- kexec-tools.orig/kexec/kexec.h	2014-06-02 14:34:16.719774316 -0400
> +++ kexec-tools/kexec/kexec.h	2014-06-02 14:34:42.010036325 -0400
> @@ -156,6 +156,13 @@ struct kexec_info {
>  	unsigned long kexec_flags;
>  	unsigned long backup_src_start;
>  	unsigned long backup_src_size;
> +	/* Set to 1 if we are using kexec2 syscall */
> +	unsigned long file_mode :1;
> +
> +	/* Filled by kernel image processing code */
> +	int initrd_fd;
> +	char *command_line;
> +	int command_line_len;
>  };
>  
>  struct arch_map_entry {
> @@ -207,6 +214,7 @@ extern int file_types;
>  #define OPT_UNLOAD		'u'
>  #define OPT_TYPE		't'
>  #define OPT_PANIC		'p'
> +#define OPT_USE_KEXEC2_SYSCALL	's'
>  #define OPT_MEM_MIN             256
>  #define OPT_MEM_MAX             257
>  #define OPT_REUSE_INITRD	258
> @@ -230,6 +238,7 @@ extern int file_types;
>  	{ "mem-min",		1, 0, OPT_MEM_MIN }, \
>  	{ "mem-max",		1, 0, OPT_MEM_MAX }, \
>  	{ "reuseinitrd",	0, 0, OPT_REUSE_INITRD }, \
> +	{ "use-kexec2-syscall",	0, 0, OPT_USE_KEXEC2_SYSCALL }, \
>  	{ "debug",		0, 0, OPT_DEBUG }, \
>  
>  #define KEXEC_OPT_STR "h?vdfxluet:p"
> Index: kexec-tools/kexec/arch/x86_64/kexec-bzImage64.c
> ===================================================================
> --- kexec-tools.orig/kexec/arch/x86_64/kexec-bzImage64.c	2014-06-02 14:34:16.719774316 -0400
> +++ kexec-tools/kexec/arch/x86_64/kexec-bzImage64.c	2014-06-02 14:34:42.011036336 -0400
> @@ -235,6 +235,89 @@ static int do_bzImage64_load(struct kexe
>  	return 0;
>  }
>  
> +/* This assumes file is being loaded using file based kexec2 syscall */
> +int bzImage64_load_file(int argc, char **argv, struct kexec_info *info)
> +{
> +	int ret = 0;
> +	char *command_line = NULL, *tmp_cmdline = NULL;
> +	const char *ramdisk = NULL, *append = NULL;
> +	int entry_16bit = 0, entry_32bit = 0;
> +	int opt;
> +	int command_line_len;
> +
> +	/* See options.h -- add any more there, too. */
> +	static const struct option options[] = {
> +		KEXEC_ARCH_OPTIONS
> +		{ "command-line",	1, 0, OPT_APPEND },
> +		{ "append",		1, 0, OPT_APPEND },
> +		{ "reuse-cmdline",	0, 0, OPT_REUSE_CMDLINE },
> +		{ "initrd",		1, 0, OPT_RAMDISK },
> +		{ "ramdisk",		1, 0, OPT_RAMDISK },
> +		{ "real-mode",		0, 0, OPT_REAL_MODE },
> +		{ "entry-32bit",	0, 0, OPT_ENTRY_32BIT },
> +		{ 0,			0, 0, 0 },
> +	};
> +	static const char short_options[] = KEXEC_ARCH_OPT_STR "d";
> +
> +	while ((opt = getopt_long(argc, argv, short_options, options, 0)) != -1) {
> +		switch (opt) {
> +		default:
> +			/* Ignore core options */
> +			if (opt < OPT_ARCH_MAX)
> +				break;
> +		case OPT_APPEND:
> +			append = optarg;
> +			break;
> +		case OPT_REUSE_CMDLINE:
> +			tmp_cmdline = get_command_line();
> +			break;
> +		case OPT_RAMDISK:
> +			ramdisk = optarg;
> +			break;
> +		case OPT_REAL_MODE:
> +			entry_16bit = 1;
> +			break;
> +		case OPT_ENTRY_32BIT:
> +			entry_32bit = 1;
> +			break;
> +		}
> +	}
> +	command_line = concat_cmdline(tmp_cmdline, append);
> +	if (tmp_cmdline)
> +		free(tmp_cmdline);
> +	command_line_len = 0;
> +	if (command_line) {
> +		command_line_len = strlen(command_line) + 1;
> +	} else {
> +		command_line = strdup("\0");
> +		command_line_len = 1;
> +	}
> +
> +	if (entry_16bit || entry_32bit) {
> +		fprintf(stderr, "Kexec2 syscall does not support 16bit"
> +			" or 32bit entry yet\n");
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	if (ramdisk) {
> +		info->initrd_fd = open(ramdisk, O_RDONLY);
> +		if (info->initrd_fd == -1) {
> +			fprintf(stderr, "Could not open initrd file %s:%s\n",
> +					ramdisk, strerror(errno));
> +			ret = -1;
> +			goto out;
> +		}
> +	}
> +
> +	info->command_line = command_line;
> +	info->command_line_len = command_line_len;
> +	return ret;
> +out:
> +	free(command_line);
> +	return ret;
> +}
> +
>  int bzImage64_load(int argc, char **argv, const char *buf, off_t len,
>  	struct kexec_info *info)
>  {
> @@ -247,6 +330,9 @@ int bzImage64_load(int argc, char **argv
>  	int opt;
>  	int result;
>  
> +	if (info->file_mode)
> +		return bzImage64_load_file(argc, argv, info);
> +
>  	/* See options.h -- add any more there, too. */
>  	static const struct option options[] = {
>  		KEXEC_ARCH_OPTIONS
> Index: kexec-tools/kexec/kexec-syscall.h
> ===================================================================
> --- kexec-tools.orig/kexec/kexec-syscall.h	2014-06-02 14:34:16.719774316 -0400
> +++ kexec-tools/kexec/kexec-syscall.h	2014-06-02 14:34:42.011036336 -0400
> @@ -53,6 +53,19 @@
>  #endif
>  #endif /*ifndef __NR_kexec_load*/
>  
> +#ifndef __NR_kexec_file_load
> +
> +#ifdef __x86_64__
> +#define __NR_kexec_file_load	317
> +#endif
> +
> +#ifndef __NR_kexec_file_load
> +/* system call not available for the arch */
> +#define __NR_kexec_file_load	0xffffffff	/* system call not available */
> +#endif
> +
> +#endif /*ifndef __NR_kexec_file_load*/
> +
>  struct kexec_segment;
>  
>  static inline long kexec_load(void *entry, unsigned long nr_segments,
> @@ -61,10 +74,28 @@ static inline long kexec_load(void *entr
>  	return (long) syscall(__NR_kexec_load, entry, nr_segments, segments, flags);
>  }
>  
> +static inline int is_kexec_file_load_implemented(void) {
> +	if (__NR_kexec_file_load != 0xffffffff)
> +		return 1;
> +	return 0;
> +}
> +
> +static inline long kexec_file_load(int kernel_fd, int initrd_fd,
> +			const char *cmdline_ptr, unsigned long cmdline_len,
> +			unsigned long flags)
> +{
> +	return (long) syscall(__NR_kexec_file_load, kernel_fd, initrd_fd,
> +				cmdline_ptr, cmdline_len, flags);
> +}
> +
>  #define KEXEC_ON_CRASH		0x00000001
>  #define KEXEC_PRESERVE_CONTEXT	0x00000002
>  #define KEXEC_ARCH_MASK		0xffff0000
>  
> +/* Flags for kexec file based system call */
> +#define KEXEC_FILE_UNLOAD	0x00000001
> +#define KEXEC_FILE_ON_CRASH	0x00000002
> +
>  /* These values match the ELF architecture values. 
>   * Unless there is a good reason that should continue to be the case.
>   */

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-05  8:31     ` Dave Young
  0 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-05  8:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, chaowang

On 06/03/14 at 09:12am, Vivek Goyal wrote:
> On Tue, Jun 03, 2014 at 09:06:49AM -0400, Vivek Goyal wrote:
> > Hi,
> > 
> > This is V3 of the patchset. Previous versions were posted here.
> > 
> > V1: https://lkml.org/lkml/2013/11/20/540
> > V2: https://lkml.org/lkml/2014/1/27/331
> > 
> > Changes since v2:
> > 
> > - Took care of most of the review comments from V2.
> > - Added support for kexec/kdump on EFI systems.
> > - Dropped support for loading ELF vmlinux.
> > 
> > This patch series is generated on top of 3.15.0-rc8. It also requires a
> > two patch cleanup series which is sitting in -tip tree here.
> 
> I used following kexec-tools patches to test kernel changes.
> 
> Thanks
> Vivek
> 
> 
> kexec-tools: Provide an option to make use of new system call
> 
> This patch provides and option --use-kexec2-syscall, to force use of
> new system call for kexec. Default is to continue to use old syscall.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  kexec/arch/x86_64/kexec-bzImage64.c |   86 +++++++++++++++++++++++++
>  kexec/kexec-syscall.h               |   31 +++++++++
>  kexec/kexec.c                       |  123 +++++++++++++++++++++++++++++++++++-
>  kexec/kexec.h                       |    9 ++
>  4 files changed, 246 insertions(+), 3 deletions(-)
> 
> Index: kexec-tools/kexec/kexec.c
> ===================================================================
> --- kexec-tools.orig/kexec/kexec.c	2014-06-02 14:34:16.719774316 -0400
> +++ kexec-tools/kexec/kexec.c	2014-06-02 14:34:42.009036315 -0400
> @@ -51,6 +51,7 @@
>  unsigned long long mem_min = 0;
>  unsigned long long mem_max = ULONG_MAX;
>  static unsigned long kexec_flags = 0;
> +static unsigned long kexec2_flags = 0;
>  int kexec_debug = 0;
>  
>  void dbgprint_mem_range(const char *prefix, struct memory_range *mr, int nr_mr)
> @@ -787,6 +788,19 @@ static int my_load(const char *type, int
>  	return result;
>  }
>  
> +static int kexec2_unload(unsigned long kexec2_flags)
> +{
> +	int ret = 0;
> +
> +	ret = kexec_file_load(-1, -1, NULL, 0, kexec2_flags);
> +	if (ret != 0) {
> +		/* The unload failed, print some debugging information */
> +		fprintf(stderr, "kexec_file_load(unload) failed\n: %s\n",
> +			strerror(errno));
> +	}
> +	return ret;
> +}
> +
>  static int k_unload (unsigned long kexec_flags)
>  {
>  	int result;
> @@ -925,6 +939,7 @@ void usage(void)
>  	       "                      (0 means it's not jump back or\n"
>  	       "                      preserve context)\n"
>  	       "                      to original kernel.\n"
> +	       " -s --use-kexec2-syscall Use new syscall for kexec operation\n"
>  	       " -d, --debug           Enable debugging to help spot a failure.\n"
>  	       "\n"
>  	       "Supported kernel file types and options: \n");
> @@ -1072,6 +1087,75 @@ char *concat_cmdline(const char *base, c
>  	return cmdline;
>  }
>  
> +/* New file based kexec system call related code */
> +static int kexec2_load(int fileind, int argc, char **argv,
> +			unsigned long flags) {
> +
> +	char *kernel;
> +	int kernel_fd, i;
> +	struct kexec_info info;
> +	int ret = 0;
> +	char *kernel_buf;
> +	off_t kernel_size;
> +
> +	memset(&info, 0, sizeof(info));
> +	info.segment = NULL;
> +	info.nr_segments = 0;
> +	info.entry = NULL;
> +	info.backup_start = 0;
> +	info.kexec_flags = flags;
> +
> +	info.file_mode = 1;
> +	info.initrd_fd = -1;
> +
> +	if (argc - fileind <= 0) {
> +		fprintf(stderr, "No kernel specified\n");
> +		usage();
> +		return -1;
> +	}
> +
> +	kernel = argv[fileind];
> +
> +	kernel_fd = open(kernel, O_RDONLY);
> +	if (kernel_fd == -1) {
> +		fprintf(stderr, "Failed to open file %s:%s\n", kernel,
> +				strerror(errno));
> +		return -1;
> +	}
> +
> +	/* slurp in the input kernel */
> +	kernel_buf = slurp_decompress_file(kernel, &kernel_size);
> +
> +	for (i = 0; i < file_types; i++) {
> +		if (file_type[i].probe(kernel_buf, kernel_size) >= 0)
> +			break;
> +	}
> +
> +	if (i == file_types) {
> +		fprintf(stderr, "Cannot determine the file type " "of %s\n",
> +				kernel);
> +		return -1;
> +	}
> +
> +	ret = file_type[i].load(argc, argv, kernel_buf, kernel_size, &info);
> +	if (ret < 0) {
> +		fprintf(stderr, "Cannot load %s\n", kernel);
> +		return ret;
> +	}
> +
> +	if (!is_kexec_file_load_implemented()) {
> +		fprintf(stderr, "syscall kexec_file_load not available.\n");
> +		return -1;
> +	}
> +
> +	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
> +			info.command_line_len, info.kexec_flags);

Vivek,

I tried your patch on my uefi test machine, but kexec load fails like below:

[root@localhost ~]# kexec -l /boot/vmlinuz-3.15.0-rc8+ --use-kexec2-syscall
Could not find a free area of memory of 0xa000 bytes ...

Another issue is that the syscall should allow load kernel only without initrd and
cmdline since kernel can mount root and embed cmdline in itself.

AFAIK Slackware installs huge kernels without creating initrd.
> +	if (ret != 0)
> +		fprintf(stderr, "kexec_file_load failed: %s\n",
> +					strerror(errno));
> +	return ret;
> +}
> +
>  
>  int main(int argc, char *argv[])
>  {
> @@ -1083,6 +1167,7 @@ int main(int argc, char *argv[])
>  	int do_ifdown = 0;
>  	int do_unload = 0;
>  	int do_reuse_initrd = 0;
> +	int do_use_kexec2_syscall = 0;
>  	void *entry = 0;
>  	char *type = 0;
>  	char *endptr;
> @@ -1095,6 +1180,23 @@ int main(int argc, char *argv[])
>  	};
>  	static const char short_options[] = KEXEC_ALL_OPT_STR;
>  
> +	/*
> +	 * First check if --use-kexec2-syscall is set. That changes lot of
> +	 * things
> +	 */
> +	while ((opt = getopt_long(argc, argv, short_options,
> +				  options, 0)) != -1) {
> +		switch(opt) {
> +		case OPT_USE_KEXEC2_SYSCALL:
> +			do_use_kexec2_syscall = 1;
> +			break;
> +		}
> +	}
> +
> +	/* Reset getopt for the next pass. */
> +	opterr = 1;
> +	optind = 1;
> +
>  	while ((opt = getopt_long(argc, argv, short_options,
>  				  options, 0)) != -1) {
>  		switch(opt) {
> @@ -1127,6 +1229,8 @@ int main(int argc, char *argv[])
>  			do_shutdown = 0;
>  			do_sync = 0;
>  			do_unload = 1;
> +			if (do_use_kexec2_syscall)
> +				kexec2_flags |= KEXEC_FILE_UNLOAD;
>  			break;
>  		case OPT_EXEC:
>  			do_load = 0;
> @@ -1169,7 +1273,10 @@ int main(int argc, char *argv[])
>  			do_exec = 0;
>  			do_shutdown = 0;
>  			do_sync = 0;
> -			kexec_flags = KEXEC_ON_CRASH;
> +			if (do_use_kexec2_syscall)
> +				kexec2_flags |= KEXEC_FILE_ON_CRASH;
> +			else
> +				kexec_flags = KEXEC_ON_CRASH;
>  			break;
>  		case OPT_MEM_MIN:
>  			mem_min = strtoul(optarg, &endptr, 0);
> @@ -1194,6 +1301,9 @@ int main(int argc, char *argv[])
>  		case OPT_REUSE_INITRD:
>  			do_reuse_initrd = 1;
>  			break;
> +		case OPT_USE_KEXEC2_SYSCALL:
> +			/* We already parsed it. Nothing to do. */
> +			break;
>  		default:
>  			break;
>  		}
> @@ -1238,10 +1348,17 @@ int main(int argc, char *argv[])
>  	}
>  
>  	if (do_unload) {
> -		result = k_unload(kexec_flags);
> +		if (do_use_kexec2_syscall)
> +			result = kexec2_unload(kexec2_flags);
> +		else
> +			result = k_unload(kexec_flags);
>  	}
>  	if (do_load && (result == 0)) {
> -		result = my_load(type, fileind, argc, argv, kexec_flags, entry);
> +		if (do_use_kexec2_syscall)
> +			result = kexec2_load(fileind, argc, argv, kexec2_flags);
> +		else
> +			result = my_load(type, fileind, argc, argv,
> +						kexec_flags, entry);
>  	}
>  	/* Don't shutdown unless there is something to reboot to! */
>  	if ((result == 0) && (do_shutdown || do_exec) && !kexec_loaded()) {
> Index: kexec-tools/kexec/kexec.h
> ===================================================================
> --- kexec-tools.orig/kexec/kexec.h	2014-06-02 14:34:16.719774316 -0400
> +++ kexec-tools/kexec/kexec.h	2014-06-02 14:34:42.010036325 -0400
> @@ -156,6 +156,13 @@ struct kexec_info {
>  	unsigned long kexec_flags;
>  	unsigned long backup_src_start;
>  	unsigned long backup_src_size;
> +	/* Set to 1 if we are using kexec2 syscall */
> +	unsigned long file_mode :1;
> +
> +	/* Filled by kernel image processing code */
> +	int initrd_fd;
> +	char *command_line;
> +	int command_line_len;
>  };
>  
>  struct arch_map_entry {
> @@ -207,6 +214,7 @@ extern int file_types;
>  #define OPT_UNLOAD		'u'
>  #define OPT_TYPE		't'
>  #define OPT_PANIC		'p'
> +#define OPT_USE_KEXEC2_SYSCALL	's'
>  #define OPT_MEM_MIN             256
>  #define OPT_MEM_MAX             257
>  #define OPT_REUSE_INITRD	258
> @@ -230,6 +238,7 @@ extern int file_types;
>  	{ "mem-min",		1, 0, OPT_MEM_MIN }, \
>  	{ "mem-max",		1, 0, OPT_MEM_MAX }, \
>  	{ "reuseinitrd",	0, 0, OPT_REUSE_INITRD }, \
> +	{ "use-kexec2-syscall",	0, 0, OPT_USE_KEXEC2_SYSCALL }, \
>  	{ "debug",		0, 0, OPT_DEBUG }, \
>  
>  #define KEXEC_OPT_STR "h?vdfxluet:p"
> Index: kexec-tools/kexec/arch/x86_64/kexec-bzImage64.c
> ===================================================================
> --- kexec-tools.orig/kexec/arch/x86_64/kexec-bzImage64.c	2014-06-02 14:34:16.719774316 -0400
> +++ kexec-tools/kexec/arch/x86_64/kexec-bzImage64.c	2014-06-02 14:34:42.011036336 -0400
> @@ -235,6 +235,89 @@ static int do_bzImage64_load(struct kexe
>  	return 0;
>  }
>  
> +/* This assumes file is being loaded using file based kexec2 syscall */
> +int bzImage64_load_file(int argc, char **argv, struct kexec_info *info)
> +{
> +	int ret = 0;
> +	char *command_line = NULL, *tmp_cmdline = NULL;
> +	const char *ramdisk = NULL, *append = NULL;
> +	int entry_16bit = 0, entry_32bit = 0;
> +	int opt;
> +	int command_line_len;
> +
> +	/* See options.h -- add any more there, too. */
> +	static const struct option options[] = {
> +		KEXEC_ARCH_OPTIONS
> +		{ "command-line",	1, 0, OPT_APPEND },
> +		{ "append",		1, 0, OPT_APPEND },
> +		{ "reuse-cmdline",	0, 0, OPT_REUSE_CMDLINE },
> +		{ "initrd",		1, 0, OPT_RAMDISK },
> +		{ "ramdisk",		1, 0, OPT_RAMDISK },
> +		{ "real-mode",		0, 0, OPT_REAL_MODE },
> +		{ "entry-32bit",	0, 0, OPT_ENTRY_32BIT },
> +		{ 0,			0, 0, 0 },
> +	};
> +	static const char short_options[] = KEXEC_ARCH_OPT_STR "d";
> +
> +	while ((opt = getopt_long(argc, argv, short_options, options, 0)) != -1) {
> +		switch (opt) {
> +		default:
> +			/* Ignore core options */
> +			if (opt < OPT_ARCH_MAX)
> +				break;
> +		case OPT_APPEND:
> +			append = optarg;
> +			break;
> +		case OPT_REUSE_CMDLINE:
> +			tmp_cmdline = get_command_line();
> +			break;
> +		case OPT_RAMDISK:
> +			ramdisk = optarg;
> +			break;
> +		case OPT_REAL_MODE:
> +			entry_16bit = 1;
> +			break;
> +		case OPT_ENTRY_32BIT:
> +			entry_32bit = 1;
> +			break;
> +		}
> +	}
> +	command_line = concat_cmdline(tmp_cmdline, append);
> +	if (tmp_cmdline)
> +		free(tmp_cmdline);
> +	command_line_len = 0;
> +	if (command_line) {
> +		command_line_len = strlen(command_line) + 1;
> +	} else {
> +		command_line = strdup("\0");
> +		command_line_len = 1;
> +	}
> +
> +	if (entry_16bit || entry_32bit) {
> +		fprintf(stderr, "Kexec2 syscall does not support 16bit"
> +			" or 32bit entry yet\n");
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	if (ramdisk) {
> +		info->initrd_fd = open(ramdisk, O_RDONLY);
> +		if (info->initrd_fd == -1) {
> +			fprintf(stderr, "Could not open initrd file %s:%s\n",
> +					ramdisk, strerror(errno));
> +			ret = -1;
> +			goto out;
> +		}
> +	}
> +
> +	info->command_line = command_line;
> +	info->command_line_len = command_line_len;
> +	return ret;
> +out:
> +	free(command_line);
> +	return ret;
> +}
> +
>  int bzImage64_load(int argc, char **argv, const char *buf, off_t len,
>  	struct kexec_info *info)
>  {
> @@ -247,6 +330,9 @@ int bzImage64_load(int argc, char **argv
>  	int opt;
>  	int result;
>  
> +	if (info->file_mode)
> +		return bzImage64_load_file(argc, argv, info);
> +
>  	/* See options.h -- add any more there, too. */
>  	static const struct option options[] = {
>  		KEXEC_ARCH_OPTIONS
> Index: kexec-tools/kexec/kexec-syscall.h
> ===================================================================
> --- kexec-tools.orig/kexec/kexec-syscall.h	2014-06-02 14:34:16.719774316 -0400
> +++ kexec-tools/kexec/kexec-syscall.h	2014-06-02 14:34:42.011036336 -0400
> @@ -53,6 +53,19 @@
>  #endif
>  #endif /*ifndef __NR_kexec_load*/
>  
> +#ifndef __NR_kexec_file_load
> +
> +#ifdef __x86_64__
> +#define __NR_kexec_file_load	317
> +#endif
> +
> +#ifndef __NR_kexec_file_load
> +/* system call not available for the arch */
> +#define __NR_kexec_file_load	0xffffffff	/* system call not available */
> +#endif
> +
> +#endif /*ifndef __NR_kexec_file_load*/
> +
>  struct kexec_segment;
>  
>  static inline long kexec_load(void *entry, unsigned long nr_segments,
> @@ -61,10 +74,28 @@ static inline long kexec_load(void *entr
>  	return (long) syscall(__NR_kexec_load, entry, nr_segments, segments, flags);
>  }
>  
> +static inline int is_kexec_file_load_implemented(void) {
> +	if (__NR_kexec_file_load != 0xffffffff)
> +		return 1;
> +	return 0;
> +}
> +
> +static inline long kexec_file_load(int kernel_fd, int initrd_fd,
> +			const char *cmdline_ptr, unsigned long cmdline_len,
> +			unsigned long flags)
> +{
> +	return (long) syscall(__NR_kexec_file_load, kernel_fd, initrd_fd,
> +				cmdline_ptr, cmdline_len, flags);
> +}
> +
>  #define KEXEC_ON_CRASH		0x00000001
>  #define KEXEC_PRESERVE_CONTEXT	0x00000002
>  #define KEXEC_ARCH_MASK		0xffff0000
>  
> +/* Flags for kexec file based system call */
> +#define KEXEC_FILE_UNLOAD	0x00000001
> +#define KEXEC_FILE_ON_CRASH	0x00000002
> +
>  /* These values match the ELF architecture values. 
>   * Unless there is a good reason that should continue to be the case.
>   */

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 06/13] kexec: New syscall kexec_file_load() declaration
  2014-06-03 13:06   ` Vivek Goyal
@ 2014-06-05  9:56     ` WANG Chao
  -1 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-05  9:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	dyoung, bhe, akpm

On 06/03/14 at 09:06am, Vivek Goyal wrote:
> This is the new syscall kexec_file_load() declaration/interface. I have
> reserved the syscall number only for x86_64 so far. Other architectures
> (including i386) can reserve syscall number when they enable the support
> for this new syscall.

Hi, Vivek

I have a comment below about the kexec_file_load args.

[..]
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index c435c5f..a3044e6 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -1098,6 +1098,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
>  }
>  #endif
>  
> +SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> +		const char __user *, cmdline_ptr, unsigned long,
> +		cmdline_len, unsigned long, flags)

initrd is optional for system boot.

How about using int *kernel_fd and int *initrd_fd as the argument? Then
if I don't need initrd, in userspace I can do this:

kexec_file_load(&kernel_fd, NULL, ...)

And even you can remove KEXEC_FILE_UNLOAD flag, because you could tell
that one wants to unload if the following is invoked:

kexec_file_load(NULL, NULL, ...)

Thanks
WANG Chao

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 06/13] kexec: New syscall kexec_file_load() declaration
@ 2014-06-05  9:56     ` WANG Chao
  0 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-05  9:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, dyoung

On 06/03/14 at 09:06am, Vivek Goyal wrote:
> This is the new syscall kexec_file_load() declaration/interface. I have
> reserved the syscall number only for x86_64 so far. Other architectures
> (including i386) can reserve syscall number when they enable the support
> for this new syscall.

Hi, Vivek

I have a comment below about the kexec_file_load args.

[..]
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index c435c5f..a3044e6 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -1098,6 +1098,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
>  }
>  #endif
>  
> +SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> +		const char __user *, cmdline_ptr, unsigned long,
> +		cmdline_len, unsigned long, flags)

initrd is optional for system boot.

How about using int *kernel_fd and int *initrd_fd as the argument? Then
if I don't need initrd, in userspace I can do this:

kexec_file_load(&kernel_fd, NULL, ...)

And even you can remove KEXEC_FILE_UNLOAD flag, because you could tell
that one wants to unload if the following is invoked:

kexec_file_load(NULL, NULL, ...)

Thanks
WANG Chao

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-03 13:06   ` Vivek Goyal
@ 2014-06-05 11:15     ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-05 11:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 09:06:56AM -0400, Vivek Goyal wrote:
> Previous patch provided the interface definition and this patch prvides
> implementation of new syscall.
> 
> Previously segment list was prepared in user space. Now user space just
> passes kernel fd, initrd fd and command line and kernel will create a
> segment list internally.
> 
> This patch contains generic part of the code. Actual segment preparation
> and loading is done by arch and image specific loader. Which comes in
> next patch.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/kernel/machine_kexec_64.c |  54 +++++
>  include/linux/kexec.h              |  53 ++++
>  include/uapi/linux/kexec.h         |   4 +
>  kernel/kexec.c                     | 483 ++++++++++++++++++++++++++++++++++++-
>  4 files changed, 589 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 679cef0..d9c5cf0 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -22,6 +22,13 @@
>  #include <asm/mmu_context.h>
>  #include <asm/debugreg.h>
>  
> +/* arch dependent functionality related to kexec file based syscall */

  ... arch-dependent ...			... file-based ...

> +static struct kexec_file_type kexec_file_type[] = {

You could call this kexec_file_types and use ARRAY_SIZE and drop this
nr_file_types; mangled diff ontop:

Index: b/arch/x86/kernel/machine_kexec_64.c
===================================================================
--- a/arch/x86/kernel/machine_kexec_64.c        2014-06-04 17:32:31.520372283 +0200
+++ b/arch/x86/kernel/machine_kexec_64.c        2014-06-04 17:30:59.214376321 +0200
@@ -23,12 +23,10 @@
 #include <asm/debugreg.h>
 
 /* arch dependent functionality related to kexec file based syscall */
-static struct kexec_file_type kexec_file_type[] = {
+static struct kexec_file_type kexec_file_types[] = {
        {"", NULL, NULL, NULL},
 };
 
-static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
-
 static void free_transition_pgtable(struct kimage *image)
 {
        free_page((unsigned long)image->arch.pud);
@@ -297,7 +295,7 @@ int arch_kexec_kernel_image_probe(struct
 {
        int i, ret = -ENOEXEC;
 
-       for (i = 0; i < nr_file_types; i++) {
+       for (i = 0; i < ARRAY_SIZE(kexec_file_types); i++) {
                if (!kexec_file_type[i].probe)
                        continue;
 

> +	{"", NULL, NULL, NULL},
> +};
> +
> +static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
> +

Superfluous newline.

>  static void free_transition_pgtable(struct kimage *image)
>  {
>  	free_page((unsigned long)image->arch.pud);
> @@ -283,3 +290,50 @@ void arch_crash_save_vmcoreinfo(void)
>  			      (unsigned long)&_text - __START_KERNEL);
>  }
>  
> +/* arch dependent functionality related to kexec file based syscall */
> +
> +int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
> +					unsigned long buf_len)

Arg alignment: it is customary to put function args on new line at the
next right position after the opening brace. Ditto for the rest of the
locations where this is the case.

> +{
> +	int i, ret = -ENOEXEC;
> +
> +	for (i = 0; i < nr_file_types; i++) {
> +		if (!kexec_file_type[i].probe)
> +			continue;
> +
> +		ret = kexec_file_type[i].probe(buf, buf_len);
> +		if (!ret) {
> +			image->file_handler_idx = i;
> +			return ret;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
> +			unsigned long kernel_len, char *initrd,
> +			unsigned long initrd_len, char *cmdline,
> +			unsigned long cmdline_len)

Those are a *lot* of arguments. Maybe a helper struct encompassing them
all to pass around?

> +{
> +	int idx = image->file_handler_idx;
> +
> +	if (idx < 0)
> +		return ERR_PTR(-ENOEXEC);
> +
> +	return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
> +					initrd_len, cmdline, cmdline_len);
> +}
> +
> +int arch_kimage_file_post_load_cleanup(struct kimage *image)
> +{
> +	int idx = image->file_handler_idx;
> +
> +	/* This can be called up even before image handler has been set */
> +	if (idx < 0)
> +		return 0;

Btw, these games with the index seem not optimal to me. Why not simply
have image->fops or so which is a pointer to struct kexec_file_type
after having it renamed to kexec_file_ops and then assign the correct
one to image->fops in arch_kexec_kernel_image_probe() and then simply
call the proper handler:

	if (!image->fops)
		return;

	return image->fops->cleanup(image);

and above

	return image->fops->load(...)

and so on.

In any case, this looks cleaner to me.

> +
> +	if (kexec_file_type[idx].cleanup)
> +		return kexec_file_type[idx].cleanup(image);
> +	return 0;
> +}
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index d0285cc..3790519 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -121,13 +121,58 @@ struct kimage {
>  #define KEXEC_TYPE_DEFAULT 0
>  #define KEXEC_TYPE_CRASH   1
>  	unsigned int preserve_context : 1;
> +	/* If set, we are using file mode kexec syscall */
> +	unsigned int file_mode:1;
>  
>  #ifdef ARCH_HAS_KIMAGE_ARCH
>  	struct kimage_arch arch;
>  #endif
> +
> +	/* Additional Fields for file based kexec syscall */

Why capitalized?

> +	void *kernel_buf;
> +	unsigned long kernel_buf_len;
> +
> +	void *initrd_buf;
> +	unsigned long initrd_buf_len;
> +
> +	char *cmdline_buf;
> +	unsigned long cmdline_buf_len;
> +
> +	/* index of file handler in array */
> +	int file_handler_idx;
> +
> +	/* Image loader handling the kernel can store a pointer here */
> +	void *image_loader_data;
>  };
>  
> +/*
> + * Keeps a track of buffer parameters as provided by caller for requesting

"Keeps track"

> + * memory placement of buffer.
> + */
> +struct kexec_buf {
> +	struct kimage *image;
> +	char *buffer;
> +	unsigned long bufsz;
> +	unsigned long memsz;
> +	unsigned long buf_align;
> +	unsigned long buf_min;
> +	unsigned long buf_max;
> +	bool top_down;		/* allocate from top of memory hole */
> +};
>  
> +typedef int (kexec_probe_t)(const char *kernel_buf, unsigned long kernel_size);
> +typedef void *(kexec_load_t)(struct kimage *image, char *kernel_buf,
> +				unsigned long kernel_len, char *initrd,
> +				unsigned long initrd_len, char *cmdline,
> +				unsigned long cmdline_len);
> +typedef int (kexec_cleanup_t)(struct kimage *image);
> +
> +struct kexec_file_type {
> +	const char *name;
> +	kexec_probe_t *probe;
> +	kexec_load_t *load;
> +	kexec_cleanup_t *cleanup;
> +};
>  
>  /* kexec interface functions */
>  extern void machine_kexec(struct kimage *image);
> @@ -138,6 +183,11 @@ extern asmlinkage long sys_kexec_load(unsigned long entry,
>  					struct kexec_segment __user *segments,
>  					unsigned long flags);
>  extern int kernel_kexec(void);
> +extern int kexec_add_buffer(struct kimage *image, char *buffer,
> +			unsigned long bufsz, unsigned long memsz,
> +			unsigned long buf_align, unsigned long buf_min,
> +			unsigned long buf_max, bool top_down,
> +			unsigned long *load_addr);
>  extern struct page *kimage_alloc_control_pages(struct kimage *image,
>  						unsigned int order);
>  extern void crash_kexec(struct pt_regs *);
> @@ -188,6 +238,9 @@ extern int kexec_load_disabled;
>  #define KEXEC_FLAGS    (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT)
>  #endif
>  
> +/* Listof defined/legal kexec file flags */

"List of ..."

> +#define KEXEC_FILE_FLAGS	(KEXEC_FILE_UNLOAD | KEXEC_FILE_ON_CRASH)
> +
>  #define VMCOREINFO_BYTES           (4096)
>  #define VMCOREINFO_NOTE_NAME       "VMCOREINFO"
>  #define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4)
> diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
> index d6629d4..5fddb1b 100644
> --- a/include/uapi/linux/kexec.h
> +++ b/include/uapi/linux/kexec.h
> @@ -13,6 +13,10 @@
>  #define KEXEC_PRESERVE_CONTEXT	0x00000002
>  #define KEXEC_ARCH_MASK		0xffff0000
>  
> +/* Kexec file load interface flags */
> +#define KEXEC_FILE_UNLOAD	0x00000001
> +#define KEXEC_FILE_ON_CRASH	0x00000002

Do we have those documented somewhere and what do they mean?

> +
>  /* These values match the ELF architecture values.
>   * Unless there is a good reason that should continue to be the case.
>   */
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index a3044e6..1ad4d60 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -260,6 +260,221 @@ static struct kimage *do_kimage_alloc_init(void)
>  
>  static void kimage_free_page_list(struct list_head *list);
>  
> +static int copy_file_from_fd(int fd, void **buf, unsigned long *buf_len)
> +{
> +	struct fd f = fdget(fd);
> +	int ret = 0;
> +	struct kstat stat;
> +	loff_t pos;
> +	ssize_t bytes = 0;
> +
> +	if (!f.file)
> +		return -EBADF;
> +
> +	ret = vfs_getattr(&f.file->f_path, &stat);
> +	if (ret)
> +		goto out;
> +
> +	if (stat.size > INT_MAX) {
> +		ret = -EFBIG;
> +		goto out;
> +	}
> +
> +	/* Don't hand 0 to vmalloc, it whines. */
> +	if (stat.size == 0) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	*buf = vmalloc(stat.size);
> +	if (!*buf) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	pos = 0;
> +	while (pos < stat.size) {
> +		bytes = kernel_read(f.file, pos, (char *)(*buf) + pos,
> +					stat.size - pos);
> +		if (bytes < 0) {
> +			vfree(*buf);
> +			ret = bytes;
> +			goto out;
> +		}
> +
> +		if (bytes == 0)
> +			break;
> +		pos += bytes;
> +	}
> +
> +	*buf_len = pos;
> +out:
> +	fdput(f);
> +	return ret;
> +}
> +
> +/* Architectures can provide this probe function */
> +int __attribute__ ((weak))

We have __weak for that.

> +arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
> +				unsigned long buf_len)
> +{
> +	return -ENOEXEC;
> +}
> +
> +void *__attribute__ ((weak))

ditto.

> +arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
> +		unsigned long kernel_len, char *initrd,
> +		unsigned long initrd_len, char *cmdline,
> +		unsigned long cmdline_len)
> +{
> +	return ERR_PTR(-ENOEXEC);
> +}
> +
> +void __attribute__ ((weak))

ditto.

> +arch_kimage_file_post_load_cleanup(struct kimage *image)
> +{
> +	return;
> +}
> +
> +/*
> + * Free up tempory buffers allocated which are not needed after image has
> + * been loaded.
> + *
> + * Free up memory used by kernel, initrd, and comand line. This is temporary
> + * memory allocation which is not needed any more after these buffers have
> + * been loaded into separate segments and have been copied elsewhere
> + */

Why do we need that comment? It is obvious what's going on.

> +static void kimage_file_post_load_cleanup(struct kimage *image)
> +{
> +	vfree(image->kernel_buf);
> +	image->kernel_buf = NULL;
> +
> +	vfree(image->initrd_buf);
> +	image->initrd_buf = NULL;
> +
> +	vfree(image->cmdline_buf);
> +	image->cmdline_buf = NULL;
> +
> +	/* See if architcture has anything to cleanup post load */

s/architcture/architecture/

> +	arch_kimage_file_post_load_cleanup(image);
> +}
> +
> +/*
> + * In file mode list of segments is prepared by kernel. Copy relevant
> + * data from user space, do error checking, prepare segment list
> + */
> +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> +		int initrd_fd, const char __user *cmdline_ptr,
> +		unsigned long cmdline_len)

arg alignment

> +{
> +	int ret = 0;
> +	void *ldata;
> +
> +	ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
> +					&image->kernel_buf_len);
> +	if (ret)
> +		return ret;
> +
> +	/* Call arch image probe handlers */
> +	ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
> +						image->kernel_buf_len);
> +
> +	if (ret)
> +		goto out;
> +
> +	ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
> +					&image->initrd_buf_len);
> +	if (ret)
> +		goto out;
> +
> +	image->cmdline_buf = vzalloc(cmdline_len);
> +	if (!image->cmdline_buf)

		ret = -ENOMEM;

> +		goto out;
> +
> +	ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
> +	if (ret) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	image->cmdline_buf_len = cmdline_len;
> +
> +	/* command line should be a string with last byte null */
> +	if (image->cmdline_buf[cmdline_len - 1] != '\0') {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* Call arch image load handlers */
> +	ldata = arch_kexec_kernel_image_load(image,
> +			image->kernel_buf, image->kernel_buf_len,
> +			image->initrd_buf, image->initrd_buf_len,
> +			image->cmdline_buf, image->cmdline_buf_len);
> +
> +	if (IS_ERR(ldata)) {
> +		ret = PTR_ERR(ldata);
> +		goto out;
> +	}
> +
> +	image->image_loader_data = ldata;
> +out:
> +	/* In case of error, free up all allocated memory in this function */
> +	if (ret)
> +		kimage_file_post_load_cleanup(image);
> +	return ret;
> +}
> +
> +static int kimage_file_normal_alloc(struct kimage **rimage, int kernel_fd,
> +		int initrd_fd, const char __user *cmdline_ptr,
> +		unsigned long cmdline_len)

arg alignment

> +{
> +	int result;
> +	struct kimage *image;
> +
> +	/* Allocate and initialize a controlling structure */

No need for that comment IMO.

> +	image = do_kimage_alloc_init();
> +	if (!image)
> +		return -ENOMEM;
> +
> +	image->file_mode = 1;
> +	image->file_handler_idx = -1;
> +
> +	result = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
> +			cmdline_ptr, cmdline_len);

arg alignment... yeah, you know what I mean, I'm not going to point
those out further anymore.

> +	if (result)
> +		goto out_free_image;
> +
> +	result = sanity_check_segment_list(image);
> +	if (result)
> +		goto out_free_post_load_bufs;
> +
> +	result = -ENOMEM;
> +	image->control_code_page = kimage_alloc_control_pages(image,
> +					   get_order(KEXEC_CONTROL_PAGE_SIZE));
> +	if (!image->control_code_page) {
> +		pr_err("Could not allocate control_code_buffer\n");

Might wanna define pr_fmt when using the pr_* things fo the first time
in this file.

> +		goto out_free_post_load_bufs;
> +	}
> +
> +	image->swap_page = kimage_alloc_control_pages(image, 0);
> +	if (!image->swap_page) {
> +		pr_err(KERN_ERR "Could not allocate swap buffer\n");
> +		goto out_free_control_pages;
> +	}
> +
> +	*rimage = image;
> +	return 0;
> +
> +out_free_control_pages:
> +	kimage_free_page_list(&image->control_pages);
> +out_free_post_load_bufs:
> +	kimage_file_post_load_cleanup(image);
> +	kfree(image->image_loader_data);
> +out_free_image:
> +	kfree(image);
> +	return result;
> +}
> +
>  static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
>  				unsigned long nr_segments,
>  				struct kexec_segment __user *segments)
> @@ -683,6 +898,16 @@ static void kimage_free(struct kimage *image)
>  
>  	/* Free the kexec control pages... */
>  	kimage_free_page_list(&image->control_pages);
> +
> +	kfree(image->image_loader_data);
> +
> +	/*
> +	 * Free up any temporary buffers allocated. This might hit if
> +	 * error occurred much later after buffer allocation.
> +	 */
> +	if (image->file_mode)
> +		kimage_file_post_load_cleanup(image);
> +
>  	kfree(image);
>  }
>  
> @@ -812,10 +1037,14 @@ static int kimage_load_normal_segment(struct kimage *image,
>  	unsigned long maddr;
>  	size_t ubytes, mbytes;
>  	int result;
> -	unsigned char __user *buf;
> +	unsigned char __user *buf = NULL;
> +	unsigned char *kbuf = NULL;
>  
>  	result = 0;
> -	buf = segment->buf;
> +	if (image->file_mode)
> +		kbuf = segment->kbuf;
> +	else
> +		buf = segment->buf;
>  	ubytes = segment->bufsz;
>  	mbytes = segment->memsz;
>  	maddr = segment->mem;
> @@ -847,7 +1076,11 @@ static int kimage_load_normal_segment(struct kimage *image,
>  				PAGE_SIZE - (maddr & ~PAGE_MASK));
>  		uchunk = min(ubytes, mchunk);
>  
> -		result = copy_from_user(ptr, buf, uchunk);
> +		/* For file based kexec, source pages are in kernel memory */
> +		if (image->file_mode)
> +			memcpy(ptr, kbuf, uchunk);
> +		else
> +			result = copy_from_user(ptr, buf, uchunk);
>  		kunmap(page);
>  		if (result) {
>  			result = -EFAULT;
> @@ -855,7 +1088,10 @@ static int kimage_load_normal_segment(struct kimage *image,
>  		}
>  		ubytes -= uchunk;
>  		maddr  += mchunk;
> -		buf    += mchunk;
> +		if (image->file_mode)
> +			kbuf += mchunk;
> +		else
> +			buf += mchunk;
>  		mbytes -= mchunk;
>  	}
>  out:
> @@ -1102,7 +1338,64 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
>  		const char __user *, cmdline_ptr, unsigned long,
>  		cmdline_len, unsigned long, flags)
>  {
> -	return -ENOSYS;
> +	int ret = 0, i;
> +	struct kimage **dest_image, *image;
> +
> +	/* We only trust the superuser with rebooting the system. */
> +	if (!capable(CAP_SYS_BOOT))
> +		return -EPERM;
> +
> +	/* Make sure we have a legal set of flags */
> +	if (flags != (flags & KEXEC_FILE_FLAGS))
> +		return -EINVAL;

This test looks strange: according to it, kexec_file_load has to always
be called with both KEXEC_FILE_UNLOAD and KEXEC_FILE_ON_CRASH set. Don't
you want to check against an allowed mask or so like KEXEC_FLAGS is
handled in kexec_load?

> +
> +	image = NULL;
> +
> +	if (!mutex_trylock(&kexec_mutex))
> +		return -EBUSY;
> +
> +	dest_image = &kexec_image;
> +	if (flags & KEXEC_FILE_ON_CRASH)
> +		dest_image = &kexec_crash_image;
> +
> +	if (flags & KEXEC_FILE_UNLOAD)
> +		goto exchange;
> +
> +	ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
> +				cmdline_ptr, cmdline_len);
> +	if (ret)
> +		goto out;
> +
> +	ret = machine_kexec_prepare(image);
> +	if (ret)
> +		goto out;
> +
> +	for (i = 0; i < image->nr_segments; i++) {
> +		struct kexec_segment *ksegment;
> +
> +		ksegment = &image->segment[i];
> +		pr_debug("Loading segment %d: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n",
> +			 i, ksegment->buf, ksegment->bufsz, ksegment->mem,
> +			 ksegment->memsz);
> +
> +		ret = kimage_load_segment(image, &image->segment[i]);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	kimage_terminate(image);
> +
> +	/*
> +	 * Free up any temporary buffers allocated which are not needed
> +	 * after image has been loaded
> +	 */
> +	kimage_file_post_load_cleanup(image);
> +exchange:
> +	image = xchg(dest_image, image);
> +out:
> +	mutex_unlock(&kexec_mutex);
> +	kimage_free(image);
> +	return ret;
>  }
>  
>  void crash_kexec(struct pt_regs *regs)
> @@ -1659,6 +1952,186 @@ static int __init crash_save_vmcoreinfo_init(void)
>  
>  subsys_initcall(crash_save_vmcoreinfo_init);
>  
> +static int __kexec_add_segment(struct kimage *image, char *buf,
> +		unsigned long bufsz, unsigned long mem, unsigned long memsz)
> +{
> +	struct kexec_segment *ksegment;
> +
> +	ksegment = &image->segment[image->nr_segments];
> +	ksegment->kbuf = buf;
> +	ksegment->bufsz = bufsz;
> +	ksegment->mem = mem;
> +	ksegment->memsz = memsz;
> +	image->nr_segments++;
> +
> +	return 0;
> +}
> +
> +static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
> +					struct kexec_buf *kbuf)
> +{
> +	struct kimage *image = kbuf->image;
> +	unsigned long temp_start, temp_end;
> +
> +	temp_end = min(end, kbuf->buf_max);
> +	temp_start = temp_end - kbuf->memsz;
> +
> +	do {
> +		/* align down start */
> +		temp_start = temp_start & (~(kbuf->buf_align - 1));
> +
> +		if (temp_start < start || temp_start < kbuf->buf_min)
> +			return 0;
> +
> +		temp_end = temp_start + kbuf->memsz - 1;
> +
> +		/*
> +		 * Make sure this does not conflict with any of existing
> +		 * segments
> +		 */
> +		if (kimage_is_destination_range(image, temp_start, temp_end)) {
> +			temp_start = temp_start - PAGE_SIZE;
> +			continue;
> +		}
> +
> +		/* We found a suitable memory range */
> +		break;
> +	} while (1);
> +
> +	/* If we are here, we found a suitable memory range */
> +	__kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
> +				kbuf->memsz);
> +
> +	/* Stop navigating through remaining System RAM ranges */
> +	return 1;
> +}
> +
> +static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
> +					struct kexec_buf *kbuf)
> +{
> +	struct kimage *image = kbuf->image;
> +	unsigned long temp_start, temp_end;
> +
> +	temp_start = max(start, kbuf->buf_min);
> +
> +	do {
> +		temp_start = ALIGN(temp_start, kbuf->buf_align);
> +		temp_end = temp_start + kbuf->memsz - 1;
> +
> +		if (temp_end > end || temp_end > kbuf->buf_max)
> +			return 0;
> +		/*
> +		 * Make sure this does not conflict with any of existing
> +		 * segments
> +		 */
> +		if (kimage_is_destination_range(image, temp_start, temp_end)) {
> +			temp_start = temp_start + PAGE_SIZE;
> +			continue;
> +		}
> +
> +		/* We found a suitable memory range */
> +		break;
> +	} while (1);
> +
> +	/* If we are here, we found a suitable memory range */
> +	__kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
> +				kbuf->memsz);
> +
> +	/* Stop navigating through remaining System RAM ranges */
> +	return 1;
> +}
> +
> +static int walk_ram_range_callback(u64 start, u64 end, void *arg)
> +{
> +	struct kexec_buf *kbuf = (struct kexec_buf *)arg;
> +	unsigned long sz = end - start + 1;
> +
> +	/* Returning 0 will take to next memory range */
> +	if (sz < kbuf->memsz)
> +		return 0;
> +
> +	if (end < kbuf->buf_min || start > kbuf->buf_max)
> +		return 0;
> +
> +	/*
> +	 * Allocate memory top down with-in ram range. Otherwise bottom up
> +	 * allocation.
> +	 */
> +	if (kbuf->top_down)
> +		return locate_mem_hole_top_down(start, end, kbuf);
> +	else
> +		return locate_mem_hole_bottom_up(start, end, kbuf);
> +}
> +
> +/*
> + * Helper functions for placing a buffer in a kexec segment. This assumes

s/functions/function/

> + * that kexec_mutex is held.
> + */
> +int kexec_add_buffer(struct kimage *image, char *buffer,
> +		unsigned long bufsz, unsigned long memsz,
> +		unsigned long buf_align, unsigned long buf_min,
> +		unsigned long buf_max, bool top_down, unsigned long *load_addr)
> +{
> +
> +	unsigned long nr_segments = image->nr_segments, new_nr_segments;
> +	struct kexec_segment *ksegment;
> +	struct kexec_buf buf, *kbuf;
> +
> +	/* Currently adding segment this way is allowed only in file mode */
> +	if (!image->file_mode)
> +		return -EINVAL;
> +
> +	if (nr_segments >= KEXEC_SEGMENT_MAX)
> +		return -EINVAL;
> +
> +	/*
> +	 * Make sure we are not trying to add buffer after allocating
> +	 * control pages. All segments need to be placed first before
> +	 * any control pages are allocated. As control page allocation
> +	 * logic goes through list of segments to make sure there are
> +	 * no destination overlaps.
> +	 */
> +	if (!list_empty(&image->control_pages)) {
> +		WARN_ON(1);
> +		return -EINVAL;
> +	}
> +
> +	memset(&buf, 0, sizeof(struct kexec_buf));
> +	kbuf = &buf;
> +	kbuf->image = image;
> +	kbuf->buffer = buffer;
> +	kbuf->bufsz = bufsz;
> +
> +	/* Align memsz to next page boundary */

No need for that comment...

> +	kbuf->memsz = ALIGN(memsz, PAGE_SIZE);
> +
> +	/* Align to atleast page size boundary */

ditto.

> +	kbuf->buf_align = max(buf_align, PAGE_SIZE);
> +	kbuf->buf_min = buf_min;
> +	kbuf->buf_max = buf_max;
> +	kbuf->top_down = top_down;
> +
> +	/* Walk the RAM ranges and allocate a suitable range for the buffer */
> +	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> +
> +	/*
> +	 * If range could be found successfully, it would have incremented
> +	 * the nr_segments value.
> +	 */
> +	new_nr_segments = image->nr_segments;
> +
> +	/* A suitable memory range could not be found for buffer */
> +	if (new_nr_segments == nr_segments)
> +		return -EADDRNOTAVAIL;

Right, why don't you check walk_system_ram_res's retval? If it is != 0,
i.e. walk_ram_range_callback gives a 1 on "success", you can drop this
way of checking whether finding a new range succeeded.

> +
> +	/* Found a suitable memory range */
> +

superfluous newline.

> +	ksegment = &image->segment[new_nr_segments - 1];
> +	*load_addr = ksegment->mem;
> +	return 0;
> +}
> +
> +
>  /*
>   * Move into place and start executing a preloaded standalone
>   * executable.  If nothing was preloaded return an error.
> -- 
> 1.9.0
> 
> 

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-05 11:15     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-05 11:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:06:56AM -0400, Vivek Goyal wrote:
> Previous patch provided the interface definition and this patch prvides
> implementation of new syscall.
> 
> Previously segment list was prepared in user space. Now user space just
> passes kernel fd, initrd fd and command line and kernel will create a
> segment list internally.
> 
> This patch contains generic part of the code. Actual segment preparation
> and loading is done by arch and image specific loader. Which comes in
> next patch.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/kernel/machine_kexec_64.c |  54 +++++
>  include/linux/kexec.h              |  53 ++++
>  include/uapi/linux/kexec.h         |   4 +
>  kernel/kexec.c                     | 483 ++++++++++++++++++++++++++++++++++++-
>  4 files changed, 589 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 679cef0..d9c5cf0 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -22,6 +22,13 @@
>  #include <asm/mmu_context.h>
>  #include <asm/debugreg.h>
>  
> +/* arch dependent functionality related to kexec file based syscall */

  ... arch-dependent ...			... file-based ...

> +static struct kexec_file_type kexec_file_type[] = {

You could call this kexec_file_types and use ARRAY_SIZE and drop this
nr_file_types; mangled diff ontop:

Index: b/arch/x86/kernel/machine_kexec_64.c
===================================================================
--- a/arch/x86/kernel/machine_kexec_64.c        2014-06-04 17:32:31.520372283 +0200
+++ b/arch/x86/kernel/machine_kexec_64.c        2014-06-04 17:30:59.214376321 +0200
@@ -23,12 +23,10 @@
 #include <asm/debugreg.h>
 
 /* arch dependent functionality related to kexec file based syscall */
-static struct kexec_file_type kexec_file_type[] = {
+static struct kexec_file_type kexec_file_types[] = {
        {"", NULL, NULL, NULL},
 };
 
-static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
-
 static void free_transition_pgtable(struct kimage *image)
 {
        free_page((unsigned long)image->arch.pud);
@@ -297,7 +295,7 @@ int arch_kexec_kernel_image_probe(struct
 {
        int i, ret = -ENOEXEC;
 
-       for (i = 0; i < nr_file_types; i++) {
+       for (i = 0; i < ARRAY_SIZE(kexec_file_types); i++) {
                if (!kexec_file_type[i].probe)
                        continue;
 

> +	{"", NULL, NULL, NULL},
> +};
> +
> +static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
> +

Superfluous newline.

>  static void free_transition_pgtable(struct kimage *image)
>  {
>  	free_page((unsigned long)image->arch.pud);
> @@ -283,3 +290,50 @@ void arch_crash_save_vmcoreinfo(void)
>  			      (unsigned long)&_text - __START_KERNEL);
>  }
>  
> +/* arch dependent functionality related to kexec file based syscall */
> +
> +int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
> +					unsigned long buf_len)

Arg alignment: it is customary to put function args on new line at the
next right position after the opening brace. Ditto for the rest of the
locations where this is the case.

> +{
> +	int i, ret = -ENOEXEC;
> +
> +	for (i = 0; i < nr_file_types; i++) {
> +		if (!kexec_file_type[i].probe)
> +			continue;
> +
> +		ret = kexec_file_type[i].probe(buf, buf_len);
> +		if (!ret) {
> +			image->file_handler_idx = i;
> +			return ret;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
> +			unsigned long kernel_len, char *initrd,
> +			unsigned long initrd_len, char *cmdline,
> +			unsigned long cmdline_len)

Those are a *lot* of arguments. Maybe a helper struct encompassing them
all to pass around?

> +{
> +	int idx = image->file_handler_idx;
> +
> +	if (idx < 0)
> +		return ERR_PTR(-ENOEXEC);
> +
> +	return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
> +					initrd_len, cmdline, cmdline_len);
> +}
> +
> +int arch_kimage_file_post_load_cleanup(struct kimage *image)
> +{
> +	int idx = image->file_handler_idx;
> +
> +	/* This can be called up even before image handler has been set */
> +	if (idx < 0)
> +		return 0;

Btw, these games with the index seem not optimal to me. Why not simply
have image->fops or so which is a pointer to struct kexec_file_type
after having it renamed to kexec_file_ops and then assign the correct
one to image->fops in arch_kexec_kernel_image_probe() and then simply
call the proper handler:

	if (!image->fops)
		return;

	return image->fops->cleanup(image);

and above

	return image->fops->load(...)

and so on.

In any case, this looks cleaner to me.

> +
> +	if (kexec_file_type[idx].cleanup)
> +		return kexec_file_type[idx].cleanup(image);
> +	return 0;
> +}
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index d0285cc..3790519 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -121,13 +121,58 @@ struct kimage {
>  #define KEXEC_TYPE_DEFAULT 0
>  #define KEXEC_TYPE_CRASH   1
>  	unsigned int preserve_context : 1;
> +	/* If set, we are using file mode kexec syscall */
> +	unsigned int file_mode:1;
>  
>  #ifdef ARCH_HAS_KIMAGE_ARCH
>  	struct kimage_arch arch;
>  #endif
> +
> +	/* Additional Fields for file based kexec syscall */

Why capitalized?

> +	void *kernel_buf;
> +	unsigned long kernel_buf_len;
> +
> +	void *initrd_buf;
> +	unsigned long initrd_buf_len;
> +
> +	char *cmdline_buf;
> +	unsigned long cmdline_buf_len;
> +
> +	/* index of file handler in array */
> +	int file_handler_idx;
> +
> +	/* Image loader handling the kernel can store a pointer here */
> +	void *image_loader_data;
>  };
>  
> +/*
> + * Keeps a track of buffer parameters as provided by caller for requesting

"Keeps track"

> + * memory placement of buffer.
> + */
> +struct kexec_buf {
> +	struct kimage *image;
> +	char *buffer;
> +	unsigned long bufsz;
> +	unsigned long memsz;
> +	unsigned long buf_align;
> +	unsigned long buf_min;
> +	unsigned long buf_max;
> +	bool top_down;		/* allocate from top of memory hole */
> +};
>  
> +typedef int (kexec_probe_t)(const char *kernel_buf, unsigned long kernel_size);
> +typedef void *(kexec_load_t)(struct kimage *image, char *kernel_buf,
> +				unsigned long kernel_len, char *initrd,
> +				unsigned long initrd_len, char *cmdline,
> +				unsigned long cmdline_len);
> +typedef int (kexec_cleanup_t)(struct kimage *image);
> +
> +struct kexec_file_type {
> +	const char *name;
> +	kexec_probe_t *probe;
> +	kexec_load_t *load;
> +	kexec_cleanup_t *cleanup;
> +};
>  
>  /* kexec interface functions */
>  extern void machine_kexec(struct kimage *image);
> @@ -138,6 +183,11 @@ extern asmlinkage long sys_kexec_load(unsigned long entry,
>  					struct kexec_segment __user *segments,
>  					unsigned long flags);
>  extern int kernel_kexec(void);
> +extern int kexec_add_buffer(struct kimage *image, char *buffer,
> +			unsigned long bufsz, unsigned long memsz,
> +			unsigned long buf_align, unsigned long buf_min,
> +			unsigned long buf_max, bool top_down,
> +			unsigned long *load_addr);
>  extern struct page *kimage_alloc_control_pages(struct kimage *image,
>  						unsigned int order);
>  extern void crash_kexec(struct pt_regs *);
> @@ -188,6 +238,9 @@ extern int kexec_load_disabled;
>  #define KEXEC_FLAGS    (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT)
>  #endif
>  
> +/* Listof defined/legal kexec file flags */

"List of ..."

> +#define KEXEC_FILE_FLAGS	(KEXEC_FILE_UNLOAD | KEXEC_FILE_ON_CRASH)
> +
>  #define VMCOREINFO_BYTES           (4096)
>  #define VMCOREINFO_NOTE_NAME       "VMCOREINFO"
>  #define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4)
> diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
> index d6629d4..5fddb1b 100644
> --- a/include/uapi/linux/kexec.h
> +++ b/include/uapi/linux/kexec.h
> @@ -13,6 +13,10 @@
>  #define KEXEC_PRESERVE_CONTEXT	0x00000002
>  #define KEXEC_ARCH_MASK		0xffff0000
>  
> +/* Kexec file load interface flags */
> +#define KEXEC_FILE_UNLOAD	0x00000001
> +#define KEXEC_FILE_ON_CRASH	0x00000002

Do we have those documented somewhere and what do they mean?

> +
>  /* These values match the ELF architecture values.
>   * Unless there is a good reason that should continue to be the case.
>   */
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index a3044e6..1ad4d60 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -260,6 +260,221 @@ static struct kimage *do_kimage_alloc_init(void)
>  
>  static void kimage_free_page_list(struct list_head *list);
>  
> +static int copy_file_from_fd(int fd, void **buf, unsigned long *buf_len)
> +{
> +	struct fd f = fdget(fd);
> +	int ret = 0;
> +	struct kstat stat;
> +	loff_t pos;
> +	ssize_t bytes = 0;
> +
> +	if (!f.file)
> +		return -EBADF;
> +
> +	ret = vfs_getattr(&f.file->f_path, &stat);
> +	if (ret)
> +		goto out;
> +
> +	if (stat.size > INT_MAX) {
> +		ret = -EFBIG;
> +		goto out;
> +	}
> +
> +	/* Don't hand 0 to vmalloc, it whines. */
> +	if (stat.size == 0) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	*buf = vmalloc(stat.size);
> +	if (!*buf) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	pos = 0;
> +	while (pos < stat.size) {
> +		bytes = kernel_read(f.file, pos, (char *)(*buf) + pos,
> +					stat.size - pos);
> +		if (bytes < 0) {
> +			vfree(*buf);
> +			ret = bytes;
> +			goto out;
> +		}
> +
> +		if (bytes == 0)
> +			break;
> +		pos += bytes;
> +	}
> +
> +	*buf_len = pos;
> +out:
> +	fdput(f);
> +	return ret;
> +}
> +
> +/* Architectures can provide this probe function */
> +int __attribute__ ((weak))

We have __weak for that.

> +arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
> +				unsigned long buf_len)
> +{
> +	return -ENOEXEC;
> +}
> +
> +void *__attribute__ ((weak))

ditto.

> +arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
> +		unsigned long kernel_len, char *initrd,
> +		unsigned long initrd_len, char *cmdline,
> +		unsigned long cmdline_len)
> +{
> +	return ERR_PTR(-ENOEXEC);
> +}
> +
> +void __attribute__ ((weak))

ditto.

> +arch_kimage_file_post_load_cleanup(struct kimage *image)
> +{
> +	return;
> +}
> +
> +/*
> + * Free up tempory buffers allocated which are not needed after image has
> + * been loaded.
> + *
> + * Free up memory used by kernel, initrd, and comand line. This is temporary
> + * memory allocation which is not needed any more after these buffers have
> + * been loaded into separate segments and have been copied elsewhere
> + */

Why do we need that comment? It is obvious what's going on.

> +static void kimage_file_post_load_cleanup(struct kimage *image)
> +{
> +	vfree(image->kernel_buf);
> +	image->kernel_buf = NULL;
> +
> +	vfree(image->initrd_buf);
> +	image->initrd_buf = NULL;
> +
> +	vfree(image->cmdline_buf);
> +	image->cmdline_buf = NULL;
> +
> +	/* See if architcture has anything to cleanup post load */

s/architcture/architecture/

> +	arch_kimage_file_post_load_cleanup(image);
> +}
> +
> +/*
> + * In file mode list of segments is prepared by kernel. Copy relevant
> + * data from user space, do error checking, prepare segment list
> + */
> +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> +		int initrd_fd, const char __user *cmdline_ptr,
> +		unsigned long cmdline_len)

arg alignment

> +{
> +	int ret = 0;
> +	void *ldata;
> +
> +	ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
> +					&image->kernel_buf_len);
> +	if (ret)
> +		return ret;
> +
> +	/* Call arch image probe handlers */
> +	ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
> +						image->kernel_buf_len);
> +
> +	if (ret)
> +		goto out;
> +
> +	ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
> +					&image->initrd_buf_len);
> +	if (ret)
> +		goto out;
> +
> +	image->cmdline_buf = vzalloc(cmdline_len);
> +	if (!image->cmdline_buf)

		ret = -ENOMEM;

> +		goto out;
> +
> +	ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
> +	if (ret) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	image->cmdline_buf_len = cmdline_len;
> +
> +	/* command line should be a string with last byte null */
> +	if (image->cmdline_buf[cmdline_len - 1] != '\0') {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* Call arch image load handlers */
> +	ldata = arch_kexec_kernel_image_load(image,
> +			image->kernel_buf, image->kernel_buf_len,
> +			image->initrd_buf, image->initrd_buf_len,
> +			image->cmdline_buf, image->cmdline_buf_len);
> +
> +	if (IS_ERR(ldata)) {
> +		ret = PTR_ERR(ldata);
> +		goto out;
> +	}
> +
> +	image->image_loader_data = ldata;
> +out:
> +	/* In case of error, free up all allocated memory in this function */
> +	if (ret)
> +		kimage_file_post_load_cleanup(image);
> +	return ret;
> +}
> +
> +static int kimage_file_normal_alloc(struct kimage **rimage, int kernel_fd,
> +		int initrd_fd, const char __user *cmdline_ptr,
> +		unsigned long cmdline_len)

arg alignment

> +{
> +	int result;
> +	struct kimage *image;
> +
> +	/* Allocate and initialize a controlling structure */

No need for that comment IMO.

> +	image = do_kimage_alloc_init();
> +	if (!image)
> +		return -ENOMEM;
> +
> +	image->file_mode = 1;
> +	image->file_handler_idx = -1;
> +
> +	result = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
> +			cmdline_ptr, cmdline_len);

arg alignment... yeah, you know what I mean, I'm not going to point
those out further anymore.

> +	if (result)
> +		goto out_free_image;
> +
> +	result = sanity_check_segment_list(image);
> +	if (result)
> +		goto out_free_post_load_bufs;
> +
> +	result = -ENOMEM;
> +	image->control_code_page = kimage_alloc_control_pages(image,
> +					   get_order(KEXEC_CONTROL_PAGE_SIZE));
> +	if (!image->control_code_page) {
> +		pr_err("Could not allocate control_code_buffer\n");

Might wanna define pr_fmt when using the pr_* things fo the first time
in this file.

> +		goto out_free_post_load_bufs;
> +	}
> +
> +	image->swap_page = kimage_alloc_control_pages(image, 0);
> +	if (!image->swap_page) {
> +		pr_err(KERN_ERR "Could not allocate swap buffer\n");
> +		goto out_free_control_pages;
> +	}
> +
> +	*rimage = image;
> +	return 0;
> +
> +out_free_control_pages:
> +	kimage_free_page_list(&image->control_pages);
> +out_free_post_load_bufs:
> +	kimage_file_post_load_cleanup(image);
> +	kfree(image->image_loader_data);
> +out_free_image:
> +	kfree(image);
> +	return result;
> +}
> +
>  static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
>  				unsigned long nr_segments,
>  				struct kexec_segment __user *segments)
> @@ -683,6 +898,16 @@ static void kimage_free(struct kimage *image)
>  
>  	/* Free the kexec control pages... */
>  	kimage_free_page_list(&image->control_pages);
> +
> +	kfree(image->image_loader_data);
> +
> +	/*
> +	 * Free up any temporary buffers allocated. This might hit if
> +	 * error occurred much later after buffer allocation.
> +	 */
> +	if (image->file_mode)
> +		kimage_file_post_load_cleanup(image);
> +
>  	kfree(image);
>  }
>  
> @@ -812,10 +1037,14 @@ static int kimage_load_normal_segment(struct kimage *image,
>  	unsigned long maddr;
>  	size_t ubytes, mbytes;
>  	int result;
> -	unsigned char __user *buf;
> +	unsigned char __user *buf = NULL;
> +	unsigned char *kbuf = NULL;
>  
>  	result = 0;
> -	buf = segment->buf;
> +	if (image->file_mode)
> +		kbuf = segment->kbuf;
> +	else
> +		buf = segment->buf;
>  	ubytes = segment->bufsz;
>  	mbytes = segment->memsz;
>  	maddr = segment->mem;
> @@ -847,7 +1076,11 @@ static int kimage_load_normal_segment(struct kimage *image,
>  				PAGE_SIZE - (maddr & ~PAGE_MASK));
>  		uchunk = min(ubytes, mchunk);
>  
> -		result = copy_from_user(ptr, buf, uchunk);
> +		/* For file based kexec, source pages are in kernel memory */
> +		if (image->file_mode)
> +			memcpy(ptr, kbuf, uchunk);
> +		else
> +			result = copy_from_user(ptr, buf, uchunk);
>  		kunmap(page);
>  		if (result) {
>  			result = -EFAULT;
> @@ -855,7 +1088,10 @@ static int kimage_load_normal_segment(struct kimage *image,
>  		}
>  		ubytes -= uchunk;
>  		maddr  += mchunk;
> -		buf    += mchunk;
> +		if (image->file_mode)
> +			kbuf += mchunk;
> +		else
> +			buf += mchunk;
>  		mbytes -= mchunk;
>  	}
>  out:
> @@ -1102,7 +1338,64 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
>  		const char __user *, cmdline_ptr, unsigned long,
>  		cmdline_len, unsigned long, flags)
>  {
> -	return -ENOSYS;
> +	int ret = 0, i;
> +	struct kimage **dest_image, *image;
> +
> +	/* We only trust the superuser with rebooting the system. */
> +	if (!capable(CAP_SYS_BOOT))
> +		return -EPERM;
> +
> +	/* Make sure we have a legal set of flags */
> +	if (flags != (flags & KEXEC_FILE_FLAGS))
> +		return -EINVAL;

This test looks strange: according to it, kexec_file_load has to always
be called with both KEXEC_FILE_UNLOAD and KEXEC_FILE_ON_CRASH set. Don't
you want to check against an allowed mask or so like KEXEC_FLAGS is
handled in kexec_load?

> +
> +	image = NULL;
> +
> +	if (!mutex_trylock(&kexec_mutex))
> +		return -EBUSY;
> +
> +	dest_image = &kexec_image;
> +	if (flags & KEXEC_FILE_ON_CRASH)
> +		dest_image = &kexec_crash_image;
> +
> +	if (flags & KEXEC_FILE_UNLOAD)
> +		goto exchange;
> +
> +	ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
> +				cmdline_ptr, cmdline_len);
> +	if (ret)
> +		goto out;
> +
> +	ret = machine_kexec_prepare(image);
> +	if (ret)
> +		goto out;
> +
> +	for (i = 0; i < image->nr_segments; i++) {
> +		struct kexec_segment *ksegment;
> +
> +		ksegment = &image->segment[i];
> +		pr_debug("Loading segment %d: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n",
> +			 i, ksegment->buf, ksegment->bufsz, ksegment->mem,
> +			 ksegment->memsz);
> +
> +		ret = kimage_load_segment(image, &image->segment[i]);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	kimage_terminate(image);
> +
> +	/*
> +	 * Free up any temporary buffers allocated which are not needed
> +	 * after image has been loaded
> +	 */
> +	kimage_file_post_load_cleanup(image);
> +exchange:
> +	image = xchg(dest_image, image);
> +out:
> +	mutex_unlock(&kexec_mutex);
> +	kimage_free(image);
> +	return ret;
>  }
>  
>  void crash_kexec(struct pt_regs *regs)
> @@ -1659,6 +1952,186 @@ static int __init crash_save_vmcoreinfo_init(void)
>  
>  subsys_initcall(crash_save_vmcoreinfo_init);
>  
> +static int __kexec_add_segment(struct kimage *image, char *buf,
> +		unsigned long bufsz, unsigned long mem, unsigned long memsz)
> +{
> +	struct kexec_segment *ksegment;
> +
> +	ksegment = &image->segment[image->nr_segments];
> +	ksegment->kbuf = buf;
> +	ksegment->bufsz = bufsz;
> +	ksegment->mem = mem;
> +	ksegment->memsz = memsz;
> +	image->nr_segments++;
> +
> +	return 0;
> +}
> +
> +static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
> +					struct kexec_buf *kbuf)
> +{
> +	struct kimage *image = kbuf->image;
> +	unsigned long temp_start, temp_end;
> +
> +	temp_end = min(end, kbuf->buf_max);
> +	temp_start = temp_end - kbuf->memsz;
> +
> +	do {
> +		/* align down start */
> +		temp_start = temp_start & (~(kbuf->buf_align - 1));
> +
> +		if (temp_start < start || temp_start < kbuf->buf_min)
> +			return 0;
> +
> +		temp_end = temp_start + kbuf->memsz - 1;
> +
> +		/*
> +		 * Make sure this does not conflict with any of existing
> +		 * segments
> +		 */
> +		if (kimage_is_destination_range(image, temp_start, temp_end)) {
> +			temp_start = temp_start - PAGE_SIZE;
> +			continue;
> +		}
> +
> +		/* We found a suitable memory range */
> +		break;
> +	} while (1);
> +
> +	/* If we are here, we found a suitable memory range */
> +	__kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
> +				kbuf->memsz);
> +
> +	/* Stop navigating through remaining System RAM ranges */
> +	return 1;
> +}
> +
> +static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
> +					struct kexec_buf *kbuf)
> +{
> +	struct kimage *image = kbuf->image;
> +	unsigned long temp_start, temp_end;
> +
> +	temp_start = max(start, kbuf->buf_min);
> +
> +	do {
> +		temp_start = ALIGN(temp_start, kbuf->buf_align);
> +		temp_end = temp_start + kbuf->memsz - 1;
> +
> +		if (temp_end > end || temp_end > kbuf->buf_max)
> +			return 0;
> +		/*
> +		 * Make sure this does not conflict with any of existing
> +		 * segments
> +		 */
> +		if (kimage_is_destination_range(image, temp_start, temp_end)) {
> +			temp_start = temp_start + PAGE_SIZE;
> +			continue;
> +		}
> +
> +		/* We found a suitable memory range */
> +		break;
> +	} while (1);
> +
> +	/* If we are here, we found a suitable memory range */
> +	__kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
> +				kbuf->memsz);
> +
> +	/* Stop navigating through remaining System RAM ranges */
> +	return 1;
> +}
> +
> +static int walk_ram_range_callback(u64 start, u64 end, void *arg)
> +{
> +	struct kexec_buf *kbuf = (struct kexec_buf *)arg;
> +	unsigned long sz = end - start + 1;
> +
> +	/* Returning 0 will take to next memory range */
> +	if (sz < kbuf->memsz)
> +		return 0;
> +
> +	if (end < kbuf->buf_min || start > kbuf->buf_max)
> +		return 0;
> +
> +	/*
> +	 * Allocate memory top down with-in ram range. Otherwise bottom up
> +	 * allocation.
> +	 */
> +	if (kbuf->top_down)
> +		return locate_mem_hole_top_down(start, end, kbuf);
> +	else
> +		return locate_mem_hole_bottom_up(start, end, kbuf);
> +}
> +
> +/*
> + * Helper functions for placing a buffer in a kexec segment. This assumes

s/functions/function/

> + * that kexec_mutex is held.
> + */
> +int kexec_add_buffer(struct kimage *image, char *buffer,
> +		unsigned long bufsz, unsigned long memsz,
> +		unsigned long buf_align, unsigned long buf_min,
> +		unsigned long buf_max, bool top_down, unsigned long *load_addr)
> +{
> +
> +	unsigned long nr_segments = image->nr_segments, new_nr_segments;
> +	struct kexec_segment *ksegment;
> +	struct kexec_buf buf, *kbuf;
> +
> +	/* Currently adding segment this way is allowed only in file mode */
> +	if (!image->file_mode)
> +		return -EINVAL;
> +
> +	if (nr_segments >= KEXEC_SEGMENT_MAX)
> +		return -EINVAL;
> +
> +	/*
> +	 * Make sure we are not trying to add buffer after allocating
> +	 * control pages. All segments need to be placed first before
> +	 * any control pages are allocated. As control page allocation
> +	 * logic goes through list of segments to make sure there are
> +	 * no destination overlaps.
> +	 */
> +	if (!list_empty(&image->control_pages)) {
> +		WARN_ON(1);
> +		return -EINVAL;
> +	}
> +
> +	memset(&buf, 0, sizeof(struct kexec_buf));
> +	kbuf = &buf;
> +	kbuf->image = image;
> +	kbuf->buffer = buffer;
> +	kbuf->bufsz = bufsz;
> +
> +	/* Align memsz to next page boundary */

No need for that comment...

> +	kbuf->memsz = ALIGN(memsz, PAGE_SIZE);
> +
> +	/* Align to atleast page size boundary */

ditto.

> +	kbuf->buf_align = max(buf_align, PAGE_SIZE);
> +	kbuf->buf_min = buf_min;
> +	kbuf->buf_max = buf_max;
> +	kbuf->top_down = top_down;
> +
> +	/* Walk the RAM ranges and allocate a suitable range for the buffer */
> +	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> +
> +	/*
> +	 * If range could be found successfully, it would have incremented
> +	 * the nr_segments value.
> +	 */
> +	new_nr_segments = image->nr_segments;
> +
> +	/* A suitable memory range could not be found for buffer */
> +	if (new_nr_segments == nr_segments)
> +		return -EADDRNOTAVAIL;

Right, why don't you check walk_system_ram_res's retval? If it is != 0,
i.e. walk_ram_range_callback gives a 1 on "success", you can drop this
way of checking whether finding a new range succeeded.

> +
> +	/* Found a suitable memory range */
> +

superfluous newline.

> +	ksegment = &image->segment[new_nr_segments - 1];
> +	*load_addr = ksegment->mem;
> +	return 0;
> +}
> +
> +
>  /*
>   * Move into place and start executing a preloaded standalone
>   * executable.  If nothing was preloaded return an error.
> -- 
> 1.9.0
> 
> 

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 04/13] resource: Provide new functions to walk through resources
  2014-06-04 10:24     ` Borislav Petkov
@ 2014-06-05 13:58       ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 13:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm, Yinghai Lu

On Wed, Jun 04, 2014 at 12:24:20PM +0200, Borislav Petkov wrote:
> On Tue, Jun 03, 2014 at 09:06:53AM -0400, Vivek Goyal wrote:
> > @@ -322,7 +327,71 @@ int release_resource(struct resource *old)
> >  
> >  EXPORT_SYMBOL(release_resource);
> >  
> > -#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
> > +/*
> > + * Finds the lowest iomem reosurce exists with-in [res->start.res->end)
> > + * the caller must specify res->start, res->end, res->flags and "name".
> > + * If found, returns 0, res is overwritten, if not found, returns -1.
> > + * This walks through whole tree and not just first level children.
> > + */
> > +static int find_next_iomem_res(struct resource *res, char *name)
> > +{
> > +	resource_size_t start, end;
> > +	struct resource *p;
> > +
> > +	BUG_ON(!res);
> > +
> > +	start = res->start;
> > +	end = res->end;
> > +	BUG_ON(start >= end);
> > +
> > +	read_lock(&resource_lock);
> > +	p = &iomem_resource;
> > +	while ((p = next_resource(p))) {
> 
> Just a thought - this function differs from find_next_system_ram() only
> in the traversal mode through resources. I wonder if next_resource()
> could be given a flag, say TRAVERSE_SIBLINGS_ONLY or so and be called
> from both, once with the flag set and once without and thus save us the
> code duplication.

Yep, that makes sense. I will change it and start passing a flag.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 04/13] resource: Provide new functions to walk through resources
@ 2014-06-05 13:58       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 13:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, Yinghai Lu,
	ebiederm, hpa, akpm, dyoung, chaowang

On Wed, Jun 04, 2014 at 12:24:20PM +0200, Borislav Petkov wrote:
> On Tue, Jun 03, 2014 at 09:06:53AM -0400, Vivek Goyal wrote:
> > @@ -322,7 +327,71 @@ int release_resource(struct resource *old)
> >  
> >  EXPORT_SYMBOL(release_resource);
> >  
> > -#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
> > +/*
> > + * Finds the lowest iomem reosurce exists with-in [res->start.res->end)
> > + * the caller must specify res->start, res->end, res->flags and "name".
> > + * If found, returns 0, res is overwritten, if not found, returns -1.
> > + * This walks through whole tree and not just first level children.
> > + */
> > +static int find_next_iomem_res(struct resource *res, char *name)
> > +{
> > +	resource_size_t start, end;
> > +	struct resource *p;
> > +
> > +	BUG_ON(!res);
> > +
> > +	start = res->start;
> > +	end = res->end;
> > +	BUG_ON(start >= end);
> > +
> > +	read_lock(&resource_lock);
> > +	p = &iomem_resource;
> > +	while ((p = next_resource(p))) {
> 
> Just a thought - this function differs from find_next_system_ram() only
> in the traversal mode through resources. I wonder if next_resource()
> could be given a flag, say TRAVERSE_SIBLINGS_ONLY or so and be called
> from both, once with the flag set and once without and thus save us the
> code duplication.

Yep, that makes sense. I will change it and start passing a flag.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-05 14:04           ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 14:04 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: WANG Chao, Linux Kernel, kexec, Eric W. Biederman,
	H. Peter Anvin, mjg59, Greg Kroah-Hartman, Borislav Petkov,
	Jiri Kosina, dyoung, bhe, Andrew Morton, Linux API

On Wed, Jun 04, 2014 at 09:39:10PM +0200, Michael Kerrisk wrote:
> Vivek,
> 
> As per Documentation/SubmitChecklist , please CC linux-api@ on patces
> that change the ABI/API. See
> https://www.kernel.org/doc/man-pages/linux-api-ml.html.

Hi Michael,

Sorry, I did not notice that. I will CC linux-api@ in next version of
patches in patches which introduce new systemcal..

> 
> Also, is there some draft man page for this new system call?

No, there is none yet. In fact I don't see a man page for old kexec
system call either kexec_load().

Do you want me to write man page for this new syscall?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-05 14:04           ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 14:04 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: WANG Chao, Linux Kernel, kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Eric W. Biederman, H. Peter Anvin, mjg59-1xO5oi07KQx4cg9Nei1l7Q,
	Greg Kroah-Hartman, Borislav Petkov, Jiri Kosina,
	dyoung-H+wXaHxf7aLQT0dZR+AlfA, bhe-H+wXaHxf7aLQT0dZR+AlfA,
	Andrew Morton, Linux API

On Wed, Jun 04, 2014 at 09:39:10PM +0200, Michael Kerrisk wrote:
> Vivek,
> 
> As per Documentation/SubmitChecklist , please CC linux-api@ on patces
> that change the ABI/API. See
> https://www.kernel.org/doc/man-pages/linux-api-ml.html.

Hi Michael,

Sorry, I did not notice that. I will CC linux-api@ in next version of
patches in patches which introduce new systemcal..

> 
> Also, is there some draft man page for this new system call?

No, there is none yet. In fact I don't see a man page for old kexec
system call either kexec_load().

Do you want me to write man page for this new syscall?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-05 14:04           ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 14:04 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: mjg59, bhe, Jiri Kosina, Greg Kroah-Hartman, kexec, Linux Kernel,
	Borislav Petkov, Eric W. Biederman, H. Peter Anvin,
	Andrew Morton, Linux API, dyoung, WANG Chao

On Wed, Jun 04, 2014 at 09:39:10PM +0200, Michael Kerrisk wrote:
> Vivek,
> 
> As per Documentation/SubmitChecklist , please CC linux-api@ on patces
> that change the ABI/API. See
> https://www.kernel.org/doc/man-pages/linux-api-ml.html.

Hi Michael,

Sorry, I did not notice that. I will CC linux-api@ in next version of
patches in patches which introduce new systemcal..

> 
> Also, is there some draft man page for this new system call?

No, there is none yet. In fact I don't see a man page for old kexec
system call either kexec_load().

Do you want me to write man page for this new syscall?

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 03/13] kexec: Move segment verification code in a separate function
  2014-06-04 20:30         ` Borislav Petkov
@ 2014-06-05 14:05           ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 14:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Wed, Jun 04, 2014 at 10:30:41PM +0200, Borislav Petkov wrote:
> On Wed, Jun 04, 2014 at 02:47:43PM -0400, Vivek Goyal wrote:
> > Hmm..., Interesting. I never noticed it. So google search seems to say
> > that unuseable is also not wrong.
> 
> Nope, it seems more like "unuseable" is simply a very common misspelling
> which has managed to spread out uncontrollably, even in the kernel :).
> Both Oxford and Merriam-Webster know "unusable" as the only correct
> spelling.

Ok, given that you feel so strongly about it, I will change it next
version of patches. :-).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 03/13] kexec: Move segment verification code in a separate function
@ 2014-06-05 14:05           ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 14:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Wed, Jun 04, 2014 at 10:30:41PM +0200, Borislav Petkov wrote:
> On Wed, Jun 04, 2014 at 02:47:43PM -0400, Vivek Goyal wrote:
> > Hmm..., Interesting. I never noticed it. So google search seems to say
> > that unuseable is also not wrong.
> 
> Nope, it seems more like "unuseable" is simply a very common misspelling
> which has managed to spread out uncontrollably, even in the kernel :).
> Both Oxford and Merriam-Webster know "unusable" as the only correct
> spelling.

Ok, given that you feel so strongly about it, I will change it next
version of patches. :-).

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 03/13] kexec: Move segment verification code in a separate function
  2014-06-05 14:05           ` Vivek Goyal
@ 2014-06-05 14:07             ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-05 14:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Thu, Jun 05, 2014 at 10:05:06AM -0400, Vivek Goyal wrote:
> Ok, given that you feel so strongly about it, I will change it next
> version of patches. :-).

Hehe, thanks! :-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 03/13] kexec: Move segment verification code in a separate function
@ 2014-06-05 14:07             ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-05 14:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Thu, Jun 05, 2014 at 10:05:06AM -0400, Vivek Goyal wrote:
> Ok, given that you feel so strongly about it, I will change it next
> version of patches. :-).

Hehe, thanks! :-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-05  8:31     ` Dave Young
@ 2014-06-05 15:01       ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 15:01 UTC (permalink / raw)
  To: Dave Young
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	chaowang, bhe, akpm

On Thu, Jun 05, 2014 at 04:31:34PM +0800, Dave Young wrote:

[..]
> > +	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
> > +			info.command_line_len, info.kexec_flags);
> 
> Vivek,
> 
> I tried your patch on my uefi test machine, but kexec load fails like below:
> 
> [root@localhost ~]# kexec -l /boot/vmlinuz-3.15.0-rc8+ --use-kexec2-syscall
> Could not find a free area of memory of 0xa000 bytes ...

Hi Dave,

I think this message is coming from kexec-tools from old loading path. I
think somehow new path did not even kick in. I tried above and I got
-EBADF as I did not pass initrd. Can you run gdb on kexec and see if
you are getting to syscall or run strace.

> 
> Another issue is that the syscall should allow load kernel only without initrd

Agreed. Currently my code is not handling it. I am thinking of ways how to
make passing initrd fd optional.

> 
> and
> cmdline since kernel can mount root and embed cmdline in itself.

Passing command line is already optional. I tried it and kexec loaded
successfully.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-05 15:01       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 15:01 UTC (permalink / raw)
  To: Dave Young
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, chaowang

On Thu, Jun 05, 2014 at 04:31:34PM +0800, Dave Young wrote:

[..]
> > +	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
> > +			info.command_line_len, info.kexec_flags);
> 
> Vivek,
> 
> I tried your patch on my uefi test machine, but kexec load fails like below:
> 
> [root@localhost ~]# kexec -l /boot/vmlinuz-3.15.0-rc8+ --use-kexec2-syscall
> Could not find a free area of memory of 0xa000 bytes ...

Hi Dave,

I think this message is coming from kexec-tools from old loading path. I
think somehow new path did not even kick in. I tried above and I got
-EBADF as I did not pass initrd. Can you run gdb on kexec and see if
you are getting to syscall or run strace.

> 
> Another issue is that the syscall should allow load kernel only without initrd

Agreed. Currently my code is not handling it. I am thinking of ways how to
make passing initrd fd optional.

> 
> and
> cmdline since kernel can mount root and embed cmdline in itself.

Passing command line is already optional. I tried it and kexec loaded
successfully.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 06/13] kexec: New syscall kexec_file_load() declaration
  2014-06-05  9:56     ` WANG Chao
@ 2014-06-05 15:16       ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 15:16 UTC (permalink / raw)
  To: WANG Chao
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	dyoung, bhe, akpm

On Thu, Jun 05, 2014 at 05:56:03PM +0800, WANG Chao wrote:

[..]
> > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > index c435c5f..a3044e6 100644
> > --- a/kernel/kexec.c
> > +++ b/kernel/kexec.c
> > @@ -1098,6 +1098,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
> >  }
> >  #endif
> >  
> > +SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> > +		const char __user *, cmdline_ptr, unsigned long,
> > +		cmdline_len, unsigned long, flags)
> 
> initrd is optional for system boot.
> 
> How about using int *kernel_fd and int *initrd_fd as the argument? Then
> if I don't need initrd, in userspace I can do this:

Hi Chao,

I really am not too keen converting plain int fd arguments into pointers.

Given the fact that fd is int, that means all valid values are greater
than 0. How about using -1 to denote that initrd is not being loaded?

This does create one little anomaly and that is for all -ve values we
will return -EBADF except -1 which we special cased.

> 
> kexec_file_load(&kernel_fd, NULL, ...)
> 
> And even you can remove KEXEC_FILE_UNLOAD flag, because you could tell
> that one wants to unload if the following is invoked:
> 
> kexec_file_load(NULL, NULL, ...)

I would prefer not to derive special meanings of NULL parameters and
instead use an explicit flag to unload kernel.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 06/13] kexec: New syscall kexec_file_load() declaration
@ 2014-06-05 15:16       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 15:16 UTC (permalink / raw)
  To: WANG Chao
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, dyoung

On Thu, Jun 05, 2014 at 05:56:03PM +0800, WANG Chao wrote:

[..]
> > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > index c435c5f..a3044e6 100644
> > --- a/kernel/kexec.c
> > +++ b/kernel/kexec.c
> > @@ -1098,6 +1098,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
> >  }
> >  #endif
> >  
> > +SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> > +		const char __user *, cmdline_ptr, unsigned long,
> > +		cmdline_len, unsigned long, flags)
> 
> initrd is optional for system boot.
> 
> How about using int *kernel_fd and int *initrd_fd as the argument? Then
> if I don't need initrd, in userspace I can do this:

Hi Chao,

I really am not too keen converting plain int fd arguments into pointers.

Given the fact that fd is int, that means all valid values are greater
than 0. How about using -1 to denote that initrd is not being loaded?

This does create one little anomaly and that is for all -ve values we
will return -EBADF except -1 which we special cased.

> 
> kexec_file_load(&kernel_fd, NULL, ...)
> 
> And even you can remove KEXEC_FILE_UNLOAD flag, because you could tell
> that one wants to unload if the following is invoked:
> 
> kexec_file_load(NULL, NULL, ...)

I would prefer not to derive special meanings of NULL parameters and
instead use an explicit flag to unload kernel.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 06/13] kexec: New syscall kexec_file_load() declaration
  2014-06-05 15:16       ` Vivek Goyal
@ 2014-06-05 15:22         ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 15:22 UTC (permalink / raw)
  To: WANG Chao
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	dyoung, bhe, akpm

On Thu, Jun 05, 2014 at 11:16:39AM -0400, Vivek Goyal wrote:
> On Thu, Jun 05, 2014 at 05:56:03PM +0800, WANG Chao wrote:
> 
> [..]
> > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > index c435c5f..a3044e6 100644
> > > --- a/kernel/kexec.c
> > > +++ b/kernel/kexec.c
> > > @@ -1098,6 +1098,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
> > >  }
> > >  #endif
> > >  
> > > +SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> > > +		const char __user *, cmdline_ptr, unsigned long,
> > > +		cmdline_len, unsigned long, flags)
> > 
> > initrd is optional for system boot.
> > 
> > How about using int *kernel_fd and int *initrd_fd as the argument? Then
> > if I don't need initrd, in userspace I can do this:
> 
> Hi Chao,
> 
> I really am not too keen converting plain int fd arguments into pointers.
> 
> Given the fact that fd is int, that means all valid values are greater
> than 0. How about using -1 to denote that initrd is not being loaded?
> 
> This does create one little anomaly and that is for all -ve values we
> will return -EBADF except -1 which we special cased.

Or we could do.

- Define extra flag which should be set by user if valid initrd fd is not
  being passed. Say, KEXEC_FILE_NO_INITRAMFS. And if kernel sees that flag
  it will not try to parse value passed in argument initrd_fd at all.

I think I like this better.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 06/13] kexec: New syscall kexec_file_load() declaration
@ 2014-06-05 15:22         ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 15:22 UTC (permalink / raw)
  To: WANG Chao
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, dyoung

On Thu, Jun 05, 2014 at 11:16:39AM -0400, Vivek Goyal wrote:
> On Thu, Jun 05, 2014 at 05:56:03PM +0800, WANG Chao wrote:
> 
> [..]
> > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > index c435c5f..a3044e6 100644
> > > --- a/kernel/kexec.c
> > > +++ b/kernel/kexec.c
> > > @@ -1098,6 +1098,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
> > >  }
> > >  #endif
> > >  
> > > +SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> > > +		const char __user *, cmdline_ptr, unsigned long,
> > > +		cmdline_len, unsigned long, flags)
> > 
> > initrd is optional for system boot.
> > 
> > How about using int *kernel_fd and int *initrd_fd as the argument? Then
> > if I don't need initrd, in userspace I can do this:
> 
> Hi Chao,
> 
> I really am not too keen converting plain int fd arguments into pointers.
> 
> Given the fact that fd is int, that means all valid values are greater
> than 0. How about using -1 to denote that initrd is not being loaded?
> 
> This does create one little anomaly and that is for all -ve values we
> will return -EBADF except -1 which we special cased.

Or we could do.

- Define extra flag which should be set by user if valid initrd fd is not
  being passed. Say, KEXEC_FILE_NO_INITRAMFS. And if kernel sees that flag
  it will not try to parse value passed in argument initrd_fd at all.

I think I like this better.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 09/13] purgatory: Core purgatory functionality
  2014-06-03 13:06   ` Vivek Goyal
@ 2014-06-05 20:05     ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-05 20:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 09:06:58AM -0400, Vivek Goyal wrote:
> Create a stand alone relocatable object purgatory which runs between two
> kernels. This name, concept and some code has been taken from kexec-tools.
> Idea is that this code runs after a crash and it runs in minimal environment.
> So keep it separate from rest of the kernel and in long term we will have
> to practically do no maintenance of this code.
> 
> This code also has the logic to do verify sha256 hashes of various
> segments which have been loaded into memory. So first we verify that
> the kernel we are jumping to is fine and has not been corrupted and
> make progress only if checsums are verified.
> 
> This code also takes care of copying some memory contents to backup region.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/Kbuild                   |   1 +
>  arch/x86/Makefile                 |   6 +++
>  arch/x86/purgatory/Makefile       |  35 +++++++++++++
>  arch/x86/purgatory/entry64.S      | 101 ++++++++++++++++++++++++++++++++++++++
>  arch/x86/purgatory/purgatory.c    |  71 +++++++++++++++++++++++++++
>  arch/x86/purgatory/setup-x86_32.S |  17 +++++++
>  arch/x86/purgatory/setup-x86_64.S |  58 ++++++++++++++++++++++
>  arch/x86/purgatory/stack.S        |  19 +++++++
>  arch/x86/purgatory/string.c       |  13 +++++
>  9 files changed, 321 insertions(+)
>  create mode 100644 arch/x86/purgatory/Makefile
>  create mode 100644 arch/x86/purgatory/entry64.S
>  create mode 100644 arch/x86/purgatory/purgatory.c
>  create mode 100644 arch/x86/purgatory/setup-x86_32.S
>  create mode 100644 arch/x86/purgatory/setup-x86_64.S
>  create mode 100644 arch/x86/purgatory/stack.S
>  create mode 100644 arch/x86/purgatory/string.c
> 
> diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
> index e5287d8..faaeee7 100644
> --- a/arch/x86/Kbuild
> +++ b/arch/x86/Kbuild
> @@ -16,3 +16,4 @@ obj-$(CONFIG_IA32_EMULATION) += ia32/
>  
>  obj-y += platform/
>  obj-y += net/
> +obj-$(CONFIG_KEXEC) += purgatory/
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 33f71b0..0b25c6c 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -186,6 +186,11 @@ archscripts: scripts_basic
>  archheaders:
>  	$(Q)$(MAKE) $(build)=arch/x86/syscalls all
>  
> +archprepare:
> +ifeq ($(CONFIG_KEXEC),y)
> +	$(Q)$(MAKE) $(build)=arch/x86/purgatory arch/x86/purgatory/kexec-purgatory.c
> +endif
> +
>  ###
>  # Kernel objects
>  
> @@ -249,6 +254,7 @@ archclean:
>  	$(Q)rm -rf $(objtree)/arch/x86_64
>  	$(Q)$(MAKE) $(clean)=$(boot)
>  	$(Q)$(MAKE) $(clean)=arch/x86/tools

ifeq ($(CONFIG_KEXEC),y)
	$(Q)$(MAKE) $(clean)=arch/x86/purgatory
endif

>  
>  PHONY += kvmconfig
>  kvmconfig:
> diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
> new file mode 100644
> index 0000000..8dbf8f5
> --- /dev/null
> +++ b/arch/x86/purgatory/Makefile
> @@ -0,0 +1,35 @@
> +ifeq ($(CONFIG_X86_64),y)
> +	purgatory-y := purgatory.o entry64.o stack.o setup-x86_64.o sha256.o string.o
> +else
> +	purgatory-y := purgatory.o stack.o sha256.o setup-x86_32.o
> +endif
> +
> +targets += $(purgatory-y)
> +PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y))
> +
> +LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined -nostdlib -z nodefaultlib
> +targets += purgatory.ro
> +
> +# Default KBUILD_CFLAGS can have -pg option set when FTRACE is enabled. That
> +# in turn leaves some undefined symbols like __fentry__ in purgatory and not
> +# sure how to relocate those. Like kexec-tools, custom flags.
> +
> +ifeq ($(CONFIG_X86_64),y)
> +KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -mcmodel=large -Os -fno-builtin -ffreestanding -c -MD
> +else
> +KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -Os -fno-builtin -ffreestanding -c -MD -m32
> +endif

Those variable assignments have a lot of duplication, let's simplify
(diff ontop):


Index: b/arch/x86/purgatory/Makefile
===================================================================
--- a/arch/x86/purgatory/Makefile	2014-06-05 21:43:31.957252700 +0200
+++ b/arch/x86/purgatory/Makefile	2014-06-05 21:42:12.743256165 +0200
@@ -1,7 +1,7 @@
+purgatory-y := purgatory.o stack.o setup-x86_$(BITS).o sha256.o
+
 ifeq ($(CONFIG_X86_64),y)
-	purgatory-y := purgatory.o entry64.o stack.o setup-x86_64.o sha256.o string.o
-else
-	purgatory-y := purgatory.o stack.o sha256.o setup-x86_32.o
+	purgatory-y += entry64.o string.o
 endif
 
 targets += $(purgatory-y)
@@ -14,10 +14,12 @@ targets += purgatory.ro
 # in turn leaves some undefined symbols like __fentry__ in purgatory and not
 # sure how to relocate those. Like kexec-tools, custom flags.
 
+KBUILD_CFLAGS := -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -fno-builtin -ffreestanding -c -MD -Os
+
 ifeq ($(CONFIG_X86_64),y)
-KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -mcmodel=large -Os -fno-builtin -ffreestanding -c -MD
+KBUILD_CFLAGS	+= -mcmodel=large
 else
-KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -Os -fno-builtin -ffreestanding -c -MD -m32
+KBUILD_CFLAGS	+= -m32
 endif
 
 $(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE

> +
> +$(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
> +		$(call if_changed,ld)
> +
> +targets += kexec-purgatory.c
> +
> +quiet_cmd_bin2c = BIN2C   $@
> +      cmd_bin2c = cat $(obj)/purgatory.ro | $(srctree)/scripts/basic/bin2c kexec_purgatory > $(obj)/kexec-purgatory.c
> +
> +$(obj)/kexec-purgatory.c: $(obj)/purgatory.ro FORCE
> +	$(call if_changed,bin2c)
> +
> +
> +obj-$(CONFIG_KEXEC)	+= kexec-purgatory.o
> diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
> new file mode 100644
> index 0000000..219b50b
> --- /dev/null
> +++ b/arch/x86/purgatory/entry64.S
> @@ -0,0 +1,101 @@
> +/*
> + * Copyright (C) 2003,2004  Eric Biederman (ebiederm@xmission.com)
> + * Copyright (C) 2014  Red Hat Inc.
> +
> + * Author(s): Vivek Goyal <vgoyal@redhat.com>
> + *
> + * This code has been taken from kexec-tools.
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +
> +	.text
> +	.balign 16
> +	.code64
> +	.globl entry64, entry64_regs
> +
> +
> +entry64:
> +	/* Setup a gdt that should be preserved */
> +	lgdt gdt(%rip)
> +
> +	/* load the data segments */
> +	movl    $0x18, %eax     /* data segment */
> +	movl    %eax, %ds
> +	movl    %eax, %es
> +	movl    %eax, %ss
> +	movl    %eax, %fs
> +	movl    %eax, %gs
> +
> +	/* Setup new stack */
> +	leaq    stack_init(%rip), %rsp
> +	pushq   $0x10 /* CS */
> +	leaq    new_cs_exit(%rip), %rax
> +	pushq   %rax
> +	lretq
> +new_cs_exit:
> +
> +	/* Load the registers */
> +	movq	rax(%rip), %rax
> +	movq	rbx(%rip), %rbx
> +	movq	rcx(%rip), %rcx
> +	movq	rdx(%rip), %rdx
> +	movq	rsi(%rip), %rsi
> +	movq	rdi(%rip), %rdi
> +	movq    rsp(%rip), %rsp
> +	movq	rbp(%rip), %rbp
> +	movq	r8(%rip), %r8
> +	movq	r9(%rip), %r9
> +	movq	r10(%rip), %r10
> +	movq	r11(%rip), %r11
> +	movq	r12(%rip), %r12
> +	movq	r13(%rip), %r13
> +	movq	r14(%rip), %r14
> +	movq	r15(%rip), %r15
> +
> +	/* Jump to the new code... */
> +	jmpq	*rip(%rip)
> +
> +	.section ".rodata"
> +	.balign 4
> +entry64_regs:
> +rax:	.quad 0x00000000

Simply 0x0? Or am I missing something?

> +rbx:	.quad 0x00000000
> +rcx:	.quad 0x00000000
> +rdx:	.quad 0x00000000
> +rsi:	.quad 0x00000000
> +rdi:	.quad 0x00000000
> +rsp:	.quad 0x00000000
> +rbp:	.quad 0x00000000
> +r8:	.quad 0x00000000
> +r9:	.quad 0x00000000
> +r10:	.quad 0x00000000
> +r11:	.quad 0x00000000
> +r12:	.quad 0x00000000
> +r13:	.quad 0x00000000
> +r14:	.quad 0x00000000
> +r15:	.quad 0x00000000
> +rip:	.quad 0x00000000
> +	.size entry64_regs, . - entry64_regs
> +
> +	/* GDT */
> +	.section ".rodata"
> +	.balign 16
> +gdt:
> +	/* 0x00 unusable segment
> +	 * 0x08 unused
> +	 * so use them as gdt ptr
> +	 */
> +	.word gdt_end - gdt - 1
> +	.quad gdt
> +	.word 0, 0, 0
> +
> +	/* 0x10 4GB flat code segment */
> +	.word 0xFFFF, 0x0000, 0x9A00, 0x00AF
> +
> +	/* 0x18 4GB flat data segment */
> +	.word 0xFFFF, 0x0000, 0x9200, 0x00CF
> +gdt_end:
> +stack:	.quad   0, 0
> +stack_init:
> diff --git a/arch/x86/purgatory/purgatory.c b/arch/x86/purgatory/purgatory.c
> new file mode 100644
> index 0000000..3a808db
> --- /dev/null
> +++ b/arch/x86/purgatory/purgatory.c
> @@ -0,0 +1,71 @@
> +/*
> + * purgatory: Runs between two kernels
> + *
> + * Copyright (C) 2014 Red Hat Inc.
> + *
> + * Author:
> + *       Vivek Goyal <vgoyal@redhat.com>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +
> +#include "sha256.h"
> +#include "../boot/string.h"
> +
> +struct sha_region {
> +	unsigned long start;
> +	unsigned long len;
> +};
> +
> +unsigned long backup_dest = 0;
> +unsigned long backup_src = 0;
> +unsigned long backup_sz = 0;
> +
> +u8 sha256_digest[SHA256_DIGEST_SIZE] = { 0 };
> +
> +struct sha_region sha_regions[16] = {};
> +
> +/*
> + * On x86, second kernel requries first 640K of memory to boot. Copy
> + * first 640K to a backup region in reserved memory range so that second
> + * kernel can use first 640K.
> + */
> +static int copy_backup_region(void)
> +{
> +	if (backup_dest)
> +		memcpy((void *)backup_dest, (void *)backup_src, backup_sz);
> +
> +	return 0;
> +}
> +
> +int verify_sha256_digest(void)
> +{
> +	struct sha_region *ptr, *end;
> +	u8 digest[SHA256_DIGEST_SIZE];
> +	struct sha256_state sctx;
> +
> +	sha256_init(&sctx);
> +	end = &sha_regions[sizeof(sha_regions)/sizeof(sha_regions[0])];
> +	for (ptr = sha_regions; ptr < end; ptr++)
> +		sha256_update(&sctx, (uint8_t *)(ptr->start), ptr->len);
> +
> +	sha256_final(&sctx, digest);
> +
> +	if (memcmp(digest, sha256_digest, sizeof(digest)) != 0)

	if (memcmp(...))
		return 1;

should be a bit cleaner.

> +		return 1;
> +
> +	return 0;
> +}
> +
> +void purgatory(void)
> +{
> +	int ret;
> +
> +	ret = verify_sha256_digest();
> +	if (ret) {
> +		/* loop forever */
> +		for (;;);

checkpatch bitches about this:

ERROR: trailing statements should be on next line
#303: FILE: arch/x86/purgatory/purgatory.c:68:
+               for (;;);

> +	}
> +	copy_backup_region();
> +}
-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 09/13] purgatory: Core purgatory functionality
@ 2014-06-05 20:05     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-05 20:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:06:58AM -0400, Vivek Goyal wrote:
> Create a stand alone relocatable object purgatory which runs between two
> kernels. This name, concept and some code has been taken from kexec-tools.
> Idea is that this code runs after a crash and it runs in minimal environment.
> So keep it separate from rest of the kernel and in long term we will have
> to practically do no maintenance of this code.
> 
> This code also has the logic to do verify sha256 hashes of various
> segments which have been loaded into memory. So first we verify that
> the kernel we are jumping to is fine and has not been corrupted and
> make progress only if checsums are verified.
> 
> This code also takes care of copying some memory contents to backup region.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/Kbuild                   |   1 +
>  arch/x86/Makefile                 |   6 +++
>  arch/x86/purgatory/Makefile       |  35 +++++++++++++
>  arch/x86/purgatory/entry64.S      | 101 ++++++++++++++++++++++++++++++++++++++
>  arch/x86/purgatory/purgatory.c    |  71 +++++++++++++++++++++++++++
>  arch/x86/purgatory/setup-x86_32.S |  17 +++++++
>  arch/x86/purgatory/setup-x86_64.S |  58 ++++++++++++++++++++++
>  arch/x86/purgatory/stack.S        |  19 +++++++
>  arch/x86/purgatory/string.c       |  13 +++++
>  9 files changed, 321 insertions(+)
>  create mode 100644 arch/x86/purgatory/Makefile
>  create mode 100644 arch/x86/purgatory/entry64.S
>  create mode 100644 arch/x86/purgatory/purgatory.c
>  create mode 100644 arch/x86/purgatory/setup-x86_32.S
>  create mode 100644 arch/x86/purgatory/setup-x86_64.S
>  create mode 100644 arch/x86/purgatory/stack.S
>  create mode 100644 arch/x86/purgatory/string.c
> 
> diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
> index e5287d8..faaeee7 100644
> --- a/arch/x86/Kbuild
> +++ b/arch/x86/Kbuild
> @@ -16,3 +16,4 @@ obj-$(CONFIG_IA32_EMULATION) += ia32/
>  
>  obj-y += platform/
>  obj-y += net/
> +obj-$(CONFIG_KEXEC) += purgatory/
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 33f71b0..0b25c6c 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -186,6 +186,11 @@ archscripts: scripts_basic
>  archheaders:
>  	$(Q)$(MAKE) $(build)=arch/x86/syscalls all
>  
> +archprepare:
> +ifeq ($(CONFIG_KEXEC),y)
> +	$(Q)$(MAKE) $(build)=arch/x86/purgatory arch/x86/purgatory/kexec-purgatory.c
> +endif
> +
>  ###
>  # Kernel objects
>  
> @@ -249,6 +254,7 @@ archclean:
>  	$(Q)rm -rf $(objtree)/arch/x86_64
>  	$(Q)$(MAKE) $(clean)=$(boot)
>  	$(Q)$(MAKE) $(clean)=arch/x86/tools

ifeq ($(CONFIG_KEXEC),y)
	$(Q)$(MAKE) $(clean)=arch/x86/purgatory
endif

>  
>  PHONY += kvmconfig
>  kvmconfig:
> diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
> new file mode 100644
> index 0000000..8dbf8f5
> --- /dev/null
> +++ b/arch/x86/purgatory/Makefile
> @@ -0,0 +1,35 @@
> +ifeq ($(CONFIG_X86_64),y)
> +	purgatory-y := purgatory.o entry64.o stack.o setup-x86_64.o sha256.o string.o
> +else
> +	purgatory-y := purgatory.o stack.o sha256.o setup-x86_32.o
> +endif
> +
> +targets += $(purgatory-y)
> +PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y))
> +
> +LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined -nostdlib -z nodefaultlib
> +targets += purgatory.ro
> +
> +# Default KBUILD_CFLAGS can have -pg option set when FTRACE is enabled. That
> +# in turn leaves some undefined symbols like __fentry__ in purgatory and not
> +# sure how to relocate those. Like kexec-tools, custom flags.
> +
> +ifeq ($(CONFIG_X86_64),y)
> +KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -mcmodel=large -Os -fno-builtin -ffreestanding -c -MD
> +else
> +KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -Os -fno-builtin -ffreestanding -c -MD -m32
> +endif

Those variable assignments have a lot of duplication, let's simplify
(diff ontop):


Index: b/arch/x86/purgatory/Makefile
===================================================================
--- a/arch/x86/purgatory/Makefile	2014-06-05 21:43:31.957252700 +0200
+++ b/arch/x86/purgatory/Makefile	2014-06-05 21:42:12.743256165 +0200
@@ -1,7 +1,7 @@
+purgatory-y := purgatory.o stack.o setup-x86_$(BITS).o sha256.o
+
 ifeq ($(CONFIG_X86_64),y)
-	purgatory-y := purgatory.o entry64.o stack.o setup-x86_64.o sha256.o string.o
-else
-	purgatory-y := purgatory.o stack.o sha256.o setup-x86_32.o
+	purgatory-y += entry64.o string.o
 endif
 
 targets += $(purgatory-y)
@@ -14,10 +14,12 @@ targets += purgatory.ro
 # in turn leaves some undefined symbols like __fentry__ in purgatory and not
 # sure how to relocate those. Like kexec-tools, custom flags.
 
+KBUILD_CFLAGS := -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -fno-builtin -ffreestanding -c -MD -Os
+
 ifeq ($(CONFIG_X86_64),y)
-KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -mcmodel=large -Os -fno-builtin -ffreestanding -c -MD
+KBUILD_CFLAGS	+= -mcmodel=large
 else
-KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -Os -fno-builtin -ffreestanding -c -MD -m32
+KBUILD_CFLAGS	+= -m32
 endif
 
 $(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE

> +
> +$(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
> +		$(call if_changed,ld)
> +
> +targets += kexec-purgatory.c
> +
> +quiet_cmd_bin2c = BIN2C   $@
> +      cmd_bin2c = cat $(obj)/purgatory.ro | $(srctree)/scripts/basic/bin2c kexec_purgatory > $(obj)/kexec-purgatory.c
> +
> +$(obj)/kexec-purgatory.c: $(obj)/purgatory.ro FORCE
> +	$(call if_changed,bin2c)
> +
> +
> +obj-$(CONFIG_KEXEC)	+= kexec-purgatory.o
> diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
> new file mode 100644
> index 0000000..219b50b
> --- /dev/null
> +++ b/arch/x86/purgatory/entry64.S
> @@ -0,0 +1,101 @@
> +/*
> + * Copyright (C) 2003,2004  Eric Biederman (ebiederm@xmission.com)
> + * Copyright (C) 2014  Red Hat Inc.
> +
> + * Author(s): Vivek Goyal <vgoyal@redhat.com>
> + *
> + * This code has been taken from kexec-tools.
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +
> +	.text
> +	.balign 16
> +	.code64
> +	.globl entry64, entry64_regs
> +
> +
> +entry64:
> +	/* Setup a gdt that should be preserved */
> +	lgdt gdt(%rip)
> +
> +	/* load the data segments */
> +	movl    $0x18, %eax     /* data segment */
> +	movl    %eax, %ds
> +	movl    %eax, %es
> +	movl    %eax, %ss
> +	movl    %eax, %fs
> +	movl    %eax, %gs
> +
> +	/* Setup new stack */
> +	leaq    stack_init(%rip), %rsp
> +	pushq   $0x10 /* CS */
> +	leaq    new_cs_exit(%rip), %rax
> +	pushq   %rax
> +	lretq
> +new_cs_exit:
> +
> +	/* Load the registers */
> +	movq	rax(%rip), %rax
> +	movq	rbx(%rip), %rbx
> +	movq	rcx(%rip), %rcx
> +	movq	rdx(%rip), %rdx
> +	movq	rsi(%rip), %rsi
> +	movq	rdi(%rip), %rdi
> +	movq    rsp(%rip), %rsp
> +	movq	rbp(%rip), %rbp
> +	movq	r8(%rip), %r8
> +	movq	r9(%rip), %r9
> +	movq	r10(%rip), %r10
> +	movq	r11(%rip), %r11
> +	movq	r12(%rip), %r12
> +	movq	r13(%rip), %r13
> +	movq	r14(%rip), %r14
> +	movq	r15(%rip), %r15
> +
> +	/* Jump to the new code... */
> +	jmpq	*rip(%rip)
> +
> +	.section ".rodata"
> +	.balign 4
> +entry64_regs:
> +rax:	.quad 0x00000000

Simply 0x0? Or am I missing something?

> +rbx:	.quad 0x00000000
> +rcx:	.quad 0x00000000
> +rdx:	.quad 0x00000000
> +rsi:	.quad 0x00000000
> +rdi:	.quad 0x00000000
> +rsp:	.quad 0x00000000
> +rbp:	.quad 0x00000000
> +r8:	.quad 0x00000000
> +r9:	.quad 0x00000000
> +r10:	.quad 0x00000000
> +r11:	.quad 0x00000000
> +r12:	.quad 0x00000000
> +r13:	.quad 0x00000000
> +r14:	.quad 0x00000000
> +r15:	.quad 0x00000000
> +rip:	.quad 0x00000000
> +	.size entry64_regs, . - entry64_regs
> +
> +	/* GDT */
> +	.section ".rodata"
> +	.balign 16
> +gdt:
> +	/* 0x00 unusable segment
> +	 * 0x08 unused
> +	 * so use them as gdt ptr
> +	 */
> +	.word gdt_end - gdt - 1
> +	.quad gdt
> +	.word 0, 0, 0
> +
> +	/* 0x10 4GB flat code segment */
> +	.word 0xFFFF, 0x0000, 0x9A00, 0x00AF
> +
> +	/* 0x18 4GB flat data segment */
> +	.word 0xFFFF, 0x0000, 0x9200, 0x00CF
> +gdt_end:
> +stack:	.quad   0, 0
> +stack_init:
> diff --git a/arch/x86/purgatory/purgatory.c b/arch/x86/purgatory/purgatory.c
> new file mode 100644
> index 0000000..3a808db
> --- /dev/null
> +++ b/arch/x86/purgatory/purgatory.c
> @@ -0,0 +1,71 @@
> +/*
> + * purgatory: Runs between two kernels
> + *
> + * Copyright (C) 2014 Red Hat Inc.
> + *
> + * Author:
> + *       Vivek Goyal <vgoyal@redhat.com>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +
> +#include "sha256.h"
> +#include "../boot/string.h"
> +
> +struct sha_region {
> +	unsigned long start;
> +	unsigned long len;
> +};
> +
> +unsigned long backup_dest = 0;
> +unsigned long backup_src = 0;
> +unsigned long backup_sz = 0;
> +
> +u8 sha256_digest[SHA256_DIGEST_SIZE] = { 0 };
> +
> +struct sha_region sha_regions[16] = {};
> +
> +/*
> + * On x86, second kernel requries first 640K of memory to boot. Copy
> + * first 640K to a backup region in reserved memory range so that second
> + * kernel can use first 640K.
> + */
> +static int copy_backup_region(void)
> +{
> +	if (backup_dest)
> +		memcpy((void *)backup_dest, (void *)backup_src, backup_sz);
> +
> +	return 0;
> +}
> +
> +int verify_sha256_digest(void)
> +{
> +	struct sha_region *ptr, *end;
> +	u8 digest[SHA256_DIGEST_SIZE];
> +	struct sha256_state sctx;
> +
> +	sha256_init(&sctx);
> +	end = &sha_regions[sizeof(sha_regions)/sizeof(sha_regions[0])];
> +	for (ptr = sha_regions; ptr < end; ptr++)
> +		sha256_update(&sctx, (uint8_t *)(ptr->start), ptr->len);
> +
> +	sha256_final(&sctx, digest);
> +
> +	if (memcmp(digest, sha256_digest, sizeof(digest)) != 0)

	if (memcmp(...))
		return 1;

should be a bit cleaner.

> +		return 1;
> +
> +	return 0;
> +}
> +
> +void purgatory(void)
> +{
> +	int ret;
> +
> +	ret = verify_sha256_digest();
> +	if (ret) {
> +		/* loop forever */
> +		for (;;);

checkpatch bitches about this:

ERROR: trailing statements should be on next line
#303: FILE: arch/x86/purgatory/purgatory.c:68:
+               for (;;);

> +	}
> +	copy_backup_region();
> +}
-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-05 11:15     ` Borislav Petkov
@ 2014-06-05 20:17       ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 20:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Thu, Jun 05, 2014 at 01:15:35PM +0200, Borislav Petkov wrote:

[..]
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -22,6 +22,13 @@
> >  #include <asm/mmu_context.h>
> >  #include <asm/debugreg.h>
> >  
> > +/* arch dependent functionality related to kexec file based syscall */
> 
>   ... arch-dependent ...			... file-based ...

Will change.

> 
> > +static struct kexec_file_type kexec_file_type[] = {
> 
> You could call this kexec_file_types and use ARRAY_SIZE and drop this
> nr_file_types; mangled diff ontop:

Sounds good. Will change.

[..]
> > +static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
> > +
> 
> Superfluous newline.

Will remove.

> 
> >  static void free_transition_pgtable(struct kimage *image)
> >  {
> >  	free_page((unsigned long)image->arch.pud);
> > @@ -283,3 +290,50 @@ void arch_crash_save_vmcoreinfo(void)
> >  			      (unsigned long)&_text - __START_KERNEL);
> >  }
> >  
> > +/* arch dependent functionality related to kexec file based syscall */
> > +
> > +int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
> > +					unsigned long buf_len)
> 
> Arg alignment: it is customary to put function args on new line at the
> next right position after the opening brace. Ditto for the rest of the
> locations where this is the case.

Will do.

[..]
> > +void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
> > +			unsigned long kernel_len, char *initrd,
> > +			unsigned long initrd_len, char *cmdline,
> > +			unsigned long cmdline_len)
> 
> Those are a *lot* of arguments. Maybe a helper struct encompassing them
> all to pass around?

I think everything is already available in "struct kimage *image". So I
don't have to pass all these separately. I think I will remove all these
extra parameters and expect arch function to retrieve all that from
"struct kimage *image".

I guess I was trying to make "struct kimage" mostly opaque to arch
functions.

> 
> > +{
> > +	int idx = image->file_handler_idx;
> > +
> > +	if (idx < 0)
> > +		return ERR_PTR(-ENOEXEC);
> > +
> > +	return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
> > +					initrd_len, cmdline, cmdline_len);
> > +}
> > +
> > +int arch_kimage_file_post_load_cleanup(struct kimage *image)
> > +{
> > +	int idx = image->file_handler_idx;
> > +
> > +	/* This can be called up even before image handler has been set */
> > +	if (idx < 0)
> > +		return 0;
> 
> Btw, these games with the index seem not optimal to me. Why not simply
> have image->fops or so which is a pointer to struct kexec_file_type
> after having it renamed to kexec_file_ops and then assign the correct
> one to image->fops in arch_kexec_kernel_image_probe() and then simply
> call the proper handler:
> 
> 	if (!image->fops)
> 		return;
> 
> 	return image->fops->cleanup(image);
> 
> and above
> 
> 	return image->fops->load(...)
> 
> and so on.
> 
> In any case, this looks cleaner to me.

Ok, I will clean it up.


[..]
> > +
> > +	/* Additional Fields for file based kexec syscall */
> 
> Why capitalized?

Just typo. Will fix it.

[..]
> > +/*
> > + * Keeps a track of buffer parameters as provided by caller for requesting
> 
> "Keeps track"

will fix.

[..]
> > +/* Listof defined/legal kexec file flags */
> 
> "List of ..."
> 

Will fix.

[..]
> > --- a/include/uapi/linux/kexec.h
> > +++ b/include/uapi/linux/kexec.h
> > @@ -13,6 +13,10 @@
> >  #define KEXEC_PRESERVE_CONTEXT	0x00000002
> >  #define KEXEC_ARCH_MASK		0xffff0000
> >  
> > +/* Kexec file load interface flags */
> > +#define KEXEC_FILE_UNLOAD	0x00000001
> > +#define KEXEC_FILE_ON_CRASH	0x00000002
> 
> Do we have those documented somewhere and what do they mean?

Nope. I will put couple of comments here to explain what these flags do.

[..]
> > +/* Architectures can provide this probe function */
> > +int __attribute__ ((weak))
> 
> We have __weak for that.

Will change everywhere.

[..]
> > +/*
> > + * Free up tempory buffers allocated which are not needed after image has
> > + * been loaded.
> > + *
> > + * Free up memory used by kernel, initrd, and comand line. This is temporary
> > + * memory allocation which is not needed any more after these buffers have
> > + * been loaded into separate segments and have been copied elsewhere
> > + */
> 
> Why do we need that comment? It is obvious what's going on.

It is obivious that we are freeing memory but what was not obivious to
me that I already copied contents of these buffers in a seaparate memory
region (segment), hence I am able to free it. So I would like to keep
comment there.

> 
> > +static void kimage_file_post_load_cleanup(struct kimage *image)
> > +{
> > +	vfree(image->kernel_buf);
> > +	image->kernel_buf = NULL;
> > +
> > +	vfree(image->initrd_buf);
> > +	image->initrd_buf = NULL;
> > +
> > +	vfree(image->cmdline_buf);
> > +	image->cmdline_buf = NULL;
> > +
> > +	/* See if architcture has anything to cleanup post load */
> 
> s/architcture/architecture/

Will fix it.

> 
> > +	arch_kimage_file_post_load_cleanup(image);
> > +}
> > +
> > +/*
> > + * In file mode list of segments is prepared by kernel. Copy relevant
> > + * data from user space, do error checking, prepare segment list
> > + */
> > +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> > +		int initrd_fd, const char __user *cmdline_ptr,
> > +		unsigned long cmdline_len)
> 
> arg alignment

Sure, will change everywhere.


[..]
> > +	image->cmdline_buf = vzalloc(cmdline_len);
> > +	if (!image->cmdline_buf)
> 
> 		ret = -ENOMEM;

Good catch. This would have led to various sorts of issues. Will fix it.

> > +	int result;
> > +	struct kimage *image;
> > +
> > +	/* Allocate and initialize a controlling structure */
> 
> No need for that comment IMO.

Will drop.

> > +
> > +	result = -ENOMEM;
> > +	image->control_code_page = kimage_alloc_control_pages(image,
> > +					   get_order(KEXEC_CONTROL_PAGE_SIZE));
> > +	if (!image->control_code_page) {
> > +		pr_err("Could not allocate control_code_buffer\n");
> 
> Might wanna define pr_fmt when using the pr_* things fo the first time
> in this file.

Hmm....

I see that printk.h already provides a definition is pr_fmt is not
defined. So that means I shouldn't have to define pr_fmt() before I
use pr_*?

#ifndef pr_fmt
#define pr_fmt(fmt) fmt
#endif


> 
> > +		goto out_free_post_load_bufs;
> > +	}
> > +
> > +	image->swap_page = kimage_alloc_control_pages(image, 0);
> > +	if (!image->swap_page) {
> > +		pr_err(KERN_ERR "Could not allocate swap buffer\n");
> > +		goto out_free_control_pages;
> > +	}
> > +
> > +	*rimage = image;
> > +	return 0;
> > +
> > +out_free_control_pages:
> > +	kimage_free_page_list(&image->control_pages);
> > +out_free_post_load_bufs:
> > +	kimage_file_post_load_cleanup(image);
> > +	kfree(image->image_loader_data);
> > +out_free_image:
> > +	kfree(image);
> > +	return result;
> > +}
> > +
> >  static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
> >  				unsigned long nr_segments,
> >  				struct kexec_segment __user *segments)
> > @@ -683,6 +898,16 @@ static void kimage_free(struct kimage *image)
> >  
> >  	/* Free the kexec control pages... */
> >  	kimage_free_page_list(&image->control_pages);
> > +
> > +	kfree(image->image_loader_data);
> > +
> > +	/*
> > +	 * Free up any temporary buffers allocated. This might hit if
> > +	 * error occurred much later after buffer allocation.
> > +	 */
> > +	if (image->file_mode)
> > +		kimage_file_post_load_cleanup(image);
> > +
> >  	kfree(image);
> >  }
> >  
> > @@ -812,10 +1037,14 @@ static int kimage_load_normal_segment(struct kimage *image,
> >  	unsigned long maddr;
> >  	size_t ubytes, mbytes;
> >  	int result;
> > -	unsigned char __user *buf;
> > +	unsigned char __user *buf = NULL;
> > +	unsigned char *kbuf = NULL;
> >  
> >  	result = 0;
> > -	buf = segment->buf;
> > +	if (image->file_mode)
> > +		kbuf = segment->kbuf;
> > +	else
> > +		buf = segment->buf;
> >  	ubytes = segment->bufsz;
> >  	mbytes = segment->memsz;
> >  	maddr = segment->mem;
> > @@ -847,7 +1076,11 @@ static int kimage_load_normal_segment(struct kimage *image,
> >  				PAGE_SIZE - (maddr & ~PAGE_MASK));
> >  		uchunk = min(ubytes, mchunk);
> >  
> > -		result = copy_from_user(ptr, buf, uchunk);
> > +		/* For file based kexec, source pages are in kernel memory */
> > +		if (image->file_mode)
> > +			memcpy(ptr, kbuf, uchunk);
> > +		else
> > +			result = copy_from_user(ptr, buf, uchunk);
> >  		kunmap(page);
> >  		if (result) {
> >  			result = -EFAULT;
> > @@ -855,7 +1088,10 @@ static int kimage_load_normal_segment(struct kimage *image,
> >  		}
> >  		ubytes -= uchunk;
> >  		maddr  += mchunk;
> > -		buf    += mchunk;
> > +		if (image->file_mode)
> > +			kbuf += mchunk;
> > +		else
> > +			buf += mchunk;
> >  		mbytes -= mchunk;
> >  	}
> >  out:
> > @@ -1102,7 +1338,64 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> >  		const char __user *, cmdline_ptr, unsigned long,
> >  		cmdline_len, unsigned long, flags)
> >  {
> > -	return -ENOSYS;
> > +	int ret = 0, i;
> > +	struct kimage **dest_image, *image;
> > +
> > +	/* We only trust the superuser with rebooting the system. */
> > +	if (!capable(CAP_SYS_BOOT))
> > +		return -EPERM;
> > +
> > +	/* Make sure we have a legal set of flags */
> > +	if (flags != (flags & KEXEC_FILE_FLAGS))
> > +		return -EINVAL;
> 
> This test looks strange: according to it, kexec_file_load has to always
> be called with both KEXEC_FILE_UNLOAD and KEXEC_FILE_ON_CRASH set.

I think this test says that "flags" has to be some combination of valid
flags and superset is in KEXEC_FILE_FLAGS.

So user can pass only KEXEC_FILE_ON_CRASH.
flags = 0x00000002
KEXEC_FILE_FLAGS = 0x0x00000003
flags & KEXEC_FILE_FLAGS = 0x00000002 = flags.

>Don't you want to check against an allowed mask or so like KEXEC_FLAGS is
> handled in kexec_load?

KEXEC_FILE_FLAGS is the set of allowed flags and I am passing flags
through it. 

Are you referring to the fact that kexec_load() also passes it
additionally through KEXEC_ARCH_MASK?

Actually I am not sure if we need KEXEC_ARCH_MASK in this new system
call. I see that old call reserved uppper 16bits of flags for passing
the arch info. And passed in arch needs to be either native arch or
default arch.

I guess it might have been done to ensure that user is not expecting to
boot an image which can't boot on the running arch. But this test could
have been entirely done in user space.

I have no idea what's the purpose of this test. And why would we need
it in new syscall. We are passing the new kernel file to system and
image loader can look arch info in header and decide whether this
image can be booted in currently running arch or not.

Eric, can you please shed some light on what's the purpose of passing
arch info in flags in kexec_load().
	

[..]
> > +/*
> > + * Helper functions for placing a buffer in a kexec segment. This assumes
> 
> s/functions/function/

Will fix.

[..]
> > +	/* Align memsz to next page boundary */
> 
> No need for that comment...

Ok, will drop.

> 
> > +	kbuf->memsz = ALIGN(memsz, PAGE_SIZE);
> > +
> > +	/* Align to atleast page size boundary */
> 
> ditto.

Will drop.

> 
> > +	kbuf->buf_align = max(buf_align, PAGE_SIZE);
> > +	kbuf->buf_min = buf_min;
> > +	kbuf->buf_max = buf_max;
> > +	kbuf->top_down = top_down;
> > +
> > +	/* Walk the RAM ranges and allocate a suitable range for the buffer */
> > +	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> > +
> > +	/*
> > +	 * If range could be found successfully, it would have incremented
> > +	 * the nr_segments value.
> > +	 */
> > +	new_nr_segments = image->nr_segments;
> > +
> > +	/* A suitable memory range could not be found for buffer */
> > +	if (new_nr_segments == nr_segments)
> > +		return -EADDRNOTAVAIL;
> 
> Right, why don't you check walk_system_ram_res's retval? If it is != 0,
> i.e. walk_ram_range_callback gives a 1 on "success", you can drop this
> way of checking whether finding a new range succeeded.

In last version when I had ELF header support, I was checking for return
code 1 at one place and you had not liked that.

Anyway, I am thinking that problem here is that walk_* variants use
return code of called function to decide whether to continue looping
or not. I think these are two independent activities.  Pass a boolean
to called function which should be set to false if callee wants to
stop the loop. 

That way, callee can pass both errors and success without having to
worry about loop. And callee can return 0 to represent success and
negative error code to represent error.

How about following patch. This just compiles and I have not tested it
yet. I think it should work.

---
 include/linux/ioport.h |    4 +--
 kernel/kexec.c         |   50 +++++++++++++++++++++----------------------------
 kernel/resource.c      |   14 +++++++------
 3 files changed, 32 insertions(+), 36 deletions(-)

Index: linux-2.6/kernel/resource.c
===================================================================
--- linux-2.6.orig/kernel/resource.c	2014-06-02 14:47:58.292304960 -0400
+++ linux-2.6/kernel/resource.c	2014-06-05 15:23:40.305446872 -0400
@@ -371,11 +371,12 @@ static int find_next_iomem_res(struct re
 }
 
 int walk_ram_res(char *name, unsigned long flags, u64 start, u64 end,
-		void *arg, int (*func)(u64, u64, void *))
+		void *arg, int (*func)(u64, u64, void *, bool *))
 {
 	struct resource res;
 	u64 orig_end;
 	int ret = -1;
+	bool stop = false;
 
 	res.start = start;
 	res.end = end;
@@ -383,8 +384,8 @@ int walk_ram_res(char *name, unsigned lo
 	orig_end = res.end;
 	while ((res.start < res.end) &&
 		(find_next_iomem_res(&res, name) >= 0)) {
-		ret = (*func)(res.start, res.end, arg);
-		if (ret)
+		ret = (*func)(res.start, res.end, arg, &stop);
+		if (stop == true)
 			break;
 		res.start = res.end + 1;
 		res.end = orig_end;
@@ -441,11 +442,12 @@ static int find_next_system_ram(struct r
  * with pfn can truncate ranges.
  */
 int walk_system_ram_res(u64 start, u64 end, void *arg,
-				int (*func)(u64, u64, void *))
+				int (*func)(u64, u64, void *, bool *))
 {
 	struct resource res;
 	u64 orig_end;
 	int ret = -1;
+	bool stop = false;
 
 	res.start = start;
 	res.end = end;
@@ -453,8 +455,8 @@ int walk_system_ram_res(u64 start, u64 e
 	orig_end = res.end;
 	while ((res.start < res.end) &&
 		(find_next_system_ram(&res, "System RAM") >= 0)) {
-		ret = (*func)(res.start, res.end, arg);
-		if (ret)
+		ret = (*func)(res.start, res.end, arg, &stop);
+		if (stop == true)
 			break;
 		res.start = res.end + 1;
 		res.end = orig_end;
Index: linux-2.6/kernel/kexec.c
===================================================================
--- linux-2.6.orig/kernel/kexec.c	2014-06-05 14:58:26.976833241 -0400
+++ linux-2.6/kernel/kexec.c	2014-06-05 15:33:47.986727393 -0400
@@ -2063,7 +2063,7 @@ static int __kexec_add_segment(struct ki
 }
 
 static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
-					struct kexec_buf *kbuf)
+					struct kexec_buf *kbuf, bool *stop)
 {
 	struct kimage *image = kbuf->image;
 	unsigned long temp_start, temp_end;
@@ -2076,7 +2076,7 @@ static int locate_mem_hole_top_down(unsi
 		temp_start = temp_start & (~(kbuf->buf_align - 1));
 
 		if (temp_start < start || temp_start < kbuf->buf_min)
-			return 0;
+			return -EADDRNOTAVAIL;
 
 		temp_end = temp_start + kbuf->memsz - 1;
 
@@ -2098,11 +2098,12 @@ static int locate_mem_hole_top_down(unsi
 				kbuf->memsz);
 
 	/* Stop navigating through remaining System RAM ranges */
-	return 1;
+	*stop = true;
+	return 0;
 }
 
 static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
-					struct kexec_buf *kbuf)
+					struct kexec_buf *kbuf, bool *stop)
 {
 	struct kimage *image = kbuf->image;
 	unsigned long temp_start, temp_end;
@@ -2114,7 +2115,7 @@ static int locate_mem_hole_bottom_up(uns
 		temp_end = temp_start + kbuf->memsz - 1;
 
 		if (temp_end > end || temp_end > kbuf->buf_max)
-			return 0;
+			return -EADDRNOTAVAIL;
 		/*
 		 * Make sure this does not conflict with any of existing
 		 * segments
@@ -2133,29 +2134,29 @@ static int locate_mem_hole_bottom_up(uns
 				kbuf->memsz);
 
 	/* Stop navigating through remaining System RAM ranges */
-	return 1;
+	*stop = true;
+	return 0;
 }
 
-static int walk_ram_range_callback(u64 start, u64 end, void *arg)
+static int walk_ram_range_callback(u64 start, u64 end, void *arg, bool *stop)
 {
 	struct kexec_buf *kbuf = (struct kexec_buf *)arg;
 	unsigned long sz = end - start + 1;
 
-	/* Returning 0 will take to next memory range */
 	if (sz < kbuf->memsz)
-		return 0;
+		return -EADDRNOTAVAIL;
 
 	if (end < kbuf->buf_min || start > kbuf->buf_max)
-		return 0;
+		return -EADDRNOTAVAIL;
 
 	/*
 	 * Allocate memory top down with-in ram range. Otherwise bottom up
 	 * allocation.
 	 */
 	if (kbuf->top_down)
-		return locate_mem_hole_top_down(start, end, kbuf);
+		return locate_mem_hole_top_down(start, end, kbuf, stop);
 	else
-		return locate_mem_hole_bottom_up(start, end, kbuf);
+		return locate_mem_hole_bottom_up(start, end, kbuf, stop);
 }
 
 /*
@@ -2168,15 +2169,15 @@ int kexec_add_buffer(struct kimage *imag
 		unsigned long buf_max, bool top_down, unsigned long *load_addr)
 {
 
-	unsigned long nr_segments = image->nr_segments, new_nr_segments;
 	struct kexec_segment *ksegment;
 	struct kexec_buf buf, *kbuf;
+	int ret;
 
 	/* Currently adding segment this way is allowed only in file mode */
 	if (!image->file_mode)
 		return -EINVAL;
 
-	if (nr_segments >= KEXEC_SEGMENT_MAX)
+	if (image->nr_segments >= KEXEC_SEGMENT_MAX)
 		return -EINVAL;
 
 	/*
@@ -2208,25 +2209,18 @@ int kexec_add_buffer(struct kimage *imag
 
 	/* Walk the RAM ranges and allocate a suitable range for the buffer */
 	if (image->type == KEXEC_TYPE_CRASH)
-		walk_ram_res("Crash kernel", IORESOURCE_MEM | IORESOURCE_BUSY,
-				crashk_res.start, crashk_res.end, kbuf,
-				walk_ram_range_callback);
+		ret = walk_ram_res("Crash kernel",
+				   IORESOURCE_MEM | IORESOURCE_BUSY,
+				   crashk_res.start, crashk_res.end, kbuf,
+				   walk_ram_range_callback);
 	else
-		walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
+		ret = walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
 
-	/*
-	 * If range could be found successfully, it would have incremented
-	 * the nr_segments value.
-	 */
-	new_nr_segments = image->nr_segments;
-
-	/* A suitable memory range could not be found for buffer */
-	if (new_nr_segments == nr_segments)
+	if (ret)
 		return -EADDRNOTAVAIL;
 
 	/* Found a suitable memory range */
-
-	ksegment = &image->segment[new_nr_segments - 1];
+	ksegment = &image->segment[image->nr_segments - 1];
 	*load_addr = ksegment->mem;
 	return 0;
 }
Index: linux-2.6/include/linux/ioport.h
===================================================================
--- linux-2.6.orig/include/linux/ioport.h	2014-06-05 15:21:08.064872797 -0400
+++ linux-2.6/include/linux/ioport.h	2014-06-05 15:23:56.713616633 -0400
@@ -239,10 +239,10 @@ walk_system_ram_range(unsigned long star
 		void *arg, int (*func)(unsigned long, unsigned long, void *));
 extern int
 walk_system_ram_res(u64 start, u64 end, void *arg,
-				int (*func)(u64, u64, void *));
+				int (*func)(u64, u64, void *, bool *));
 extern int
 walk_ram_res(char *name, unsigned long flags, u64 start, u64 end, void *arg,
-				int (*func)(u64, u64, void *));
+				int (*func)(u64, u64, void *, bool *));
 
 /* True if any part of r1 overlaps r2 */
 static inline bool resource_overlaps(struct resource *r1, struct resource *r2)


^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-05 20:17       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-05 20:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Thu, Jun 05, 2014 at 01:15:35PM +0200, Borislav Petkov wrote:

[..]
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -22,6 +22,13 @@
> >  #include <asm/mmu_context.h>
> >  #include <asm/debugreg.h>
> >  
> > +/* arch dependent functionality related to kexec file based syscall */
> 
>   ... arch-dependent ...			... file-based ...

Will change.

> 
> > +static struct kexec_file_type kexec_file_type[] = {
> 
> You could call this kexec_file_types and use ARRAY_SIZE and drop this
> nr_file_types; mangled diff ontop:

Sounds good. Will change.

[..]
> > +static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
> > +
> 
> Superfluous newline.

Will remove.

> 
> >  static void free_transition_pgtable(struct kimage *image)
> >  {
> >  	free_page((unsigned long)image->arch.pud);
> > @@ -283,3 +290,50 @@ void arch_crash_save_vmcoreinfo(void)
> >  			      (unsigned long)&_text - __START_KERNEL);
> >  }
> >  
> > +/* arch dependent functionality related to kexec file based syscall */
> > +
> > +int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
> > +					unsigned long buf_len)
> 
> Arg alignment: it is customary to put function args on new line at the
> next right position after the opening brace. Ditto for the rest of the
> locations where this is the case.

Will do.

[..]
> > +void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
> > +			unsigned long kernel_len, char *initrd,
> > +			unsigned long initrd_len, char *cmdline,
> > +			unsigned long cmdline_len)
> 
> Those are a *lot* of arguments. Maybe a helper struct encompassing them
> all to pass around?

I think everything is already available in "struct kimage *image". So I
don't have to pass all these separately. I think I will remove all these
extra parameters and expect arch function to retrieve all that from
"struct kimage *image".

I guess I was trying to make "struct kimage" mostly opaque to arch
functions.

> 
> > +{
> > +	int idx = image->file_handler_idx;
> > +
> > +	if (idx < 0)
> > +		return ERR_PTR(-ENOEXEC);
> > +
> > +	return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
> > +					initrd_len, cmdline, cmdline_len);
> > +}
> > +
> > +int arch_kimage_file_post_load_cleanup(struct kimage *image)
> > +{
> > +	int idx = image->file_handler_idx;
> > +
> > +	/* This can be called up even before image handler has been set */
> > +	if (idx < 0)
> > +		return 0;
> 
> Btw, these games with the index seem not optimal to me. Why not simply
> have image->fops or so which is a pointer to struct kexec_file_type
> after having it renamed to kexec_file_ops and then assign the correct
> one to image->fops in arch_kexec_kernel_image_probe() and then simply
> call the proper handler:
> 
> 	if (!image->fops)
> 		return;
> 
> 	return image->fops->cleanup(image);
> 
> and above
> 
> 	return image->fops->load(...)
> 
> and so on.
> 
> In any case, this looks cleaner to me.

Ok, I will clean it up.


[..]
> > +
> > +	/* Additional Fields for file based kexec syscall */
> 
> Why capitalized?

Just typo. Will fix it.

[..]
> > +/*
> > + * Keeps a track of buffer parameters as provided by caller for requesting
> 
> "Keeps track"

will fix.

[..]
> > +/* Listof defined/legal kexec file flags */
> 
> "List of ..."
> 

Will fix.

[..]
> > --- a/include/uapi/linux/kexec.h
> > +++ b/include/uapi/linux/kexec.h
> > @@ -13,6 +13,10 @@
> >  #define KEXEC_PRESERVE_CONTEXT	0x00000002
> >  #define KEXEC_ARCH_MASK		0xffff0000
> >  
> > +/* Kexec file load interface flags */
> > +#define KEXEC_FILE_UNLOAD	0x00000001
> > +#define KEXEC_FILE_ON_CRASH	0x00000002
> 
> Do we have those documented somewhere and what do they mean?

Nope. I will put couple of comments here to explain what these flags do.

[..]
> > +/* Architectures can provide this probe function */
> > +int __attribute__ ((weak))
> 
> We have __weak for that.

Will change everywhere.

[..]
> > +/*
> > + * Free up tempory buffers allocated which are not needed after image has
> > + * been loaded.
> > + *
> > + * Free up memory used by kernel, initrd, and comand line. This is temporary
> > + * memory allocation which is not needed any more after these buffers have
> > + * been loaded into separate segments and have been copied elsewhere
> > + */
> 
> Why do we need that comment? It is obvious what's going on.

It is obivious that we are freeing memory but what was not obivious to
me that I already copied contents of these buffers in a seaparate memory
region (segment), hence I am able to free it. So I would like to keep
comment there.

> 
> > +static void kimage_file_post_load_cleanup(struct kimage *image)
> > +{
> > +	vfree(image->kernel_buf);
> > +	image->kernel_buf = NULL;
> > +
> > +	vfree(image->initrd_buf);
> > +	image->initrd_buf = NULL;
> > +
> > +	vfree(image->cmdline_buf);
> > +	image->cmdline_buf = NULL;
> > +
> > +	/* See if architcture has anything to cleanup post load */
> 
> s/architcture/architecture/

Will fix it.

> 
> > +	arch_kimage_file_post_load_cleanup(image);
> > +}
> > +
> > +/*
> > + * In file mode list of segments is prepared by kernel. Copy relevant
> > + * data from user space, do error checking, prepare segment list
> > + */
> > +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> > +		int initrd_fd, const char __user *cmdline_ptr,
> > +		unsigned long cmdline_len)
> 
> arg alignment

Sure, will change everywhere.


[..]
> > +	image->cmdline_buf = vzalloc(cmdline_len);
> > +	if (!image->cmdline_buf)
> 
> 		ret = -ENOMEM;

Good catch. This would have led to various sorts of issues. Will fix it.

> > +	int result;
> > +	struct kimage *image;
> > +
> > +	/* Allocate and initialize a controlling structure */
> 
> No need for that comment IMO.

Will drop.

> > +
> > +	result = -ENOMEM;
> > +	image->control_code_page = kimage_alloc_control_pages(image,
> > +					   get_order(KEXEC_CONTROL_PAGE_SIZE));
> > +	if (!image->control_code_page) {
> > +		pr_err("Could not allocate control_code_buffer\n");
> 
> Might wanna define pr_fmt when using the pr_* things fo the first time
> in this file.

Hmm....

I see that printk.h already provides a definition is pr_fmt is not
defined. So that means I shouldn't have to define pr_fmt() before I
use pr_*?

#ifndef pr_fmt
#define pr_fmt(fmt) fmt
#endif


> 
> > +		goto out_free_post_load_bufs;
> > +	}
> > +
> > +	image->swap_page = kimage_alloc_control_pages(image, 0);
> > +	if (!image->swap_page) {
> > +		pr_err(KERN_ERR "Could not allocate swap buffer\n");
> > +		goto out_free_control_pages;
> > +	}
> > +
> > +	*rimage = image;
> > +	return 0;
> > +
> > +out_free_control_pages:
> > +	kimage_free_page_list(&image->control_pages);
> > +out_free_post_load_bufs:
> > +	kimage_file_post_load_cleanup(image);
> > +	kfree(image->image_loader_data);
> > +out_free_image:
> > +	kfree(image);
> > +	return result;
> > +}
> > +
> >  static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
> >  				unsigned long nr_segments,
> >  				struct kexec_segment __user *segments)
> > @@ -683,6 +898,16 @@ static void kimage_free(struct kimage *image)
> >  
> >  	/* Free the kexec control pages... */
> >  	kimage_free_page_list(&image->control_pages);
> > +
> > +	kfree(image->image_loader_data);
> > +
> > +	/*
> > +	 * Free up any temporary buffers allocated. This might hit if
> > +	 * error occurred much later after buffer allocation.
> > +	 */
> > +	if (image->file_mode)
> > +		kimage_file_post_load_cleanup(image);
> > +
> >  	kfree(image);
> >  }
> >  
> > @@ -812,10 +1037,14 @@ static int kimage_load_normal_segment(struct kimage *image,
> >  	unsigned long maddr;
> >  	size_t ubytes, mbytes;
> >  	int result;
> > -	unsigned char __user *buf;
> > +	unsigned char __user *buf = NULL;
> > +	unsigned char *kbuf = NULL;
> >  
> >  	result = 0;
> > -	buf = segment->buf;
> > +	if (image->file_mode)
> > +		kbuf = segment->kbuf;
> > +	else
> > +		buf = segment->buf;
> >  	ubytes = segment->bufsz;
> >  	mbytes = segment->memsz;
> >  	maddr = segment->mem;
> > @@ -847,7 +1076,11 @@ static int kimage_load_normal_segment(struct kimage *image,
> >  				PAGE_SIZE - (maddr & ~PAGE_MASK));
> >  		uchunk = min(ubytes, mchunk);
> >  
> > -		result = copy_from_user(ptr, buf, uchunk);
> > +		/* For file based kexec, source pages are in kernel memory */
> > +		if (image->file_mode)
> > +			memcpy(ptr, kbuf, uchunk);
> > +		else
> > +			result = copy_from_user(ptr, buf, uchunk);
> >  		kunmap(page);
> >  		if (result) {
> >  			result = -EFAULT;
> > @@ -855,7 +1088,10 @@ static int kimage_load_normal_segment(struct kimage *image,
> >  		}
> >  		ubytes -= uchunk;
> >  		maddr  += mchunk;
> > -		buf    += mchunk;
> > +		if (image->file_mode)
> > +			kbuf += mchunk;
> > +		else
> > +			buf += mchunk;
> >  		mbytes -= mchunk;
> >  	}
> >  out:
> > @@ -1102,7 +1338,64 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> >  		const char __user *, cmdline_ptr, unsigned long,
> >  		cmdline_len, unsigned long, flags)
> >  {
> > -	return -ENOSYS;
> > +	int ret = 0, i;
> > +	struct kimage **dest_image, *image;
> > +
> > +	/* We only trust the superuser with rebooting the system. */
> > +	if (!capable(CAP_SYS_BOOT))
> > +		return -EPERM;
> > +
> > +	/* Make sure we have a legal set of flags */
> > +	if (flags != (flags & KEXEC_FILE_FLAGS))
> > +		return -EINVAL;
> 
> This test looks strange: according to it, kexec_file_load has to always
> be called with both KEXEC_FILE_UNLOAD and KEXEC_FILE_ON_CRASH set.

I think this test says that "flags" has to be some combination of valid
flags and superset is in KEXEC_FILE_FLAGS.

So user can pass only KEXEC_FILE_ON_CRASH.
flags = 0x00000002
KEXEC_FILE_FLAGS = 0x0x00000003
flags & KEXEC_FILE_FLAGS = 0x00000002 = flags.

>Don't you want to check against an allowed mask or so like KEXEC_FLAGS is
> handled in kexec_load?

KEXEC_FILE_FLAGS is the set of allowed flags and I am passing flags
through it. 

Are you referring to the fact that kexec_load() also passes it
additionally through KEXEC_ARCH_MASK?

Actually I am not sure if we need KEXEC_ARCH_MASK in this new system
call. I see that old call reserved uppper 16bits of flags for passing
the arch info. And passed in arch needs to be either native arch or
default arch.

I guess it might have been done to ensure that user is not expecting to
boot an image which can't boot on the running arch. But this test could
have been entirely done in user space.

I have no idea what's the purpose of this test. And why would we need
it in new syscall. We are passing the new kernel file to system and
image loader can look arch info in header and decide whether this
image can be booted in currently running arch or not.

Eric, can you please shed some light on what's the purpose of passing
arch info in flags in kexec_load().
	

[..]
> > +/*
> > + * Helper functions for placing a buffer in a kexec segment. This assumes
> 
> s/functions/function/

Will fix.

[..]
> > +	/* Align memsz to next page boundary */
> 
> No need for that comment...

Ok, will drop.

> 
> > +	kbuf->memsz = ALIGN(memsz, PAGE_SIZE);
> > +
> > +	/* Align to atleast page size boundary */
> 
> ditto.

Will drop.

> 
> > +	kbuf->buf_align = max(buf_align, PAGE_SIZE);
> > +	kbuf->buf_min = buf_min;
> > +	kbuf->buf_max = buf_max;
> > +	kbuf->top_down = top_down;
> > +
> > +	/* Walk the RAM ranges and allocate a suitable range for the buffer */
> > +	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> > +
> > +	/*
> > +	 * If range could be found successfully, it would have incremented
> > +	 * the nr_segments value.
> > +	 */
> > +	new_nr_segments = image->nr_segments;
> > +
> > +	/* A suitable memory range could not be found for buffer */
> > +	if (new_nr_segments == nr_segments)
> > +		return -EADDRNOTAVAIL;
> 
> Right, why don't you check walk_system_ram_res's retval? If it is != 0,
> i.e. walk_ram_range_callback gives a 1 on "success", you can drop this
> way of checking whether finding a new range succeeded.

In last version when I had ELF header support, I was checking for return
code 1 at one place and you had not liked that.

Anyway, I am thinking that problem here is that walk_* variants use
return code of called function to decide whether to continue looping
or not. I think these are two independent activities.  Pass a boolean
to called function which should be set to false if callee wants to
stop the loop. 

That way, callee can pass both errors and success without having to
worry about loop. And callee can return 0 to represent success and
negative error code to represent error.

How about following patch. This just compiles and I have not tested it
yet. I think it should work.

---
 include/linux/ioport.h |    4 +--
 kernel/kexec.c         |   50 +++++++++++++++++++++----------------------------
 kernel/resource.c      |   14 +++++++------
 3 files changed, 32 insertions(+), 36 deletions(-)

Index: linux-2.6/kernel/resource.c
===================================================================
--- linux-2.6.orig/kernel/resource.c	2014-06-02 14:47:58.292304960 -0400
+++ linux-2.6/kernel/resource.c	2014-06-05 15:23:40.305446872 -0400
@@ -371,11 +371,12 @@ static int find_next_iomem_res(struct re
 }
 
 int walk_ram_res(char *name, unsigned long flags, u64 start, u64 end,
-		void *arg, int (*func)(u64, u64, void *))
+		void *arg, int (*func)(u64, u64, void *, bool *))
 {
 	struct resource res;
 	u64 orig_end;
 	int ret = -1;
+	bool stop = false;
 
 	res.start = start;
 	res.end = end;
@@ -383,8 +384,8 @@ int walk_ram_res(char *name, unsigned lo
 	orig_end = res.end;
 	while ((res.start < res.end) &&
 		(find_next_iomem_res(&res, name) >= 0)) {
-		ret = (*func)(res.start, res.end, arg);
-		if (ret)
+		ret = (*func)(res.start, res.end, arg, &stop);
+		if (stop == true)
 			break;
 		res.start = res.end + 1;
 		res.end = orig_end;
@@ -441,11 +442,12 @@ static int find_next_system_ram(struct r
  * with pfn can truncate ranges.
  */
 int walk_system_ram_res(u64 start, u64 end, void *arg,
-				int (*func)(u64, u64, void *))
+				int (*func)(u64, u64, void *, bool *))
 {
 	struct resource res;
 	u64 orig_end;
 	int ret = -1;
+	bool stop = false;
 
 	res.start = start;
 	res.end = end;
@@ -453,8 +455,8 @@ int walk_system_ram_res(u64 start, u64 e
 	orig_end = res.end;
 	while ((res.start < res.end) &&
 		(find_next_system_ram(&res, "System RAM") >= 0)) {
-		ret = (*func)(res.start, res.end, arg);
-		if (ret)
+		ret = (*func)(res.start, res.end, arg, &stop);
+		if (stop == true)
 			break;
 		res.start = res.end + 1;
 		res.end = orig_end;
Index: linux-2.6/kernel/kexec.c
===================================================================
--- linux-2.6.orig/kernel/kexec.c	2014-06-05 14:58:26.976833241 -0400
+++ linux-2.6/kernel/kexec.c	2014-06-05 15:33:47.986727393 -0400
@@ -2063,7 +2063,7 @@ static int __kexec_add_segment(struct ki
 }
 
 static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
-					struct kexec_buf *kbuf)
+					struct kexec_buf *kbuf, bool *stop)
 {
 	struct kimage *image = kbuf->image;
 	unsigned long temp_start, temp_end;
@@ -2076,7 +2076,7 @@ static int locate_mem_hole_top_down(unsi
 		temp_start = temp_start & (~(kbuf->buf_align - 1));
 
 		if (temp_start < start || temp_start < kbuf->buf_min)
-			return 0;
+			return -EADDRNOTAVAIL;
 
 		temp_end = temp_start + kbuf->memsz - 1;
 
@@ -2098,11 +2098,12 @@ static int locate_mem_hole_top_down(unsi
 				kbuf->memsz);
 
 	/* Stop navigating through remaining System RAM ranges */
-	return 1;
+	*stop = true;
+	return 0;
 }
 
 static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
-					struct kexec_buf *kbuf)
+					struct kexec_buf *kbuf, bool *stop)
 {
 	struct kimage *image = kbuf->image;
 	unsigned long temp_start, temp_end;
@@ -2114,7 +2115,7 @@ static int locate_mem_hole_bottom_up(uns
 		temp_end = temp_start + kbuf->memsz - 1;
 
 		if (temp_end > end || temp_end > kbuf->buf_max)
-			return 0;
+			return -EADDRNOTAVAIL;
 		/*
 		 * Make sure this does not conflict with any of existing
 		 * segments
@@ -2133,29 +2134,29 @@ static int locate_mem_hole_bottom_up(uns
 				kbuf->memsz);
 
 	/* Stop navigating through remaining System RAM ranges */
-	return 1;
+	*stop = true;
+	return 0;
 }
 
-static int walk_ram_range_callback(u64 start, u64 end, void *arg)
+static int walk_ram_range_callback(u64 start, u64 end, void *arg, bool *stop)
 {
 	struct kexec_buf *kbuf = (struct kexec_buf *)arg;
 	unsigned long sz = end - start + 1;
 
-	/* Returning 0 will take to next memory range */
 	if (sz < kbuf->memsz)
-		return 0;
+		return -EADDRNOTAVAIL;
 
 	if (end < kbuf->buf_min || start > kbuf->buf_max)
-		return 0;
+		return -EADDRNOTAVAIL;
 
 	/*
 	 * Allocate memory top down with-in ram range. Otherwise bottom up
 	 * allocation.
 	 */
 	if (kbuf->top_down)
-		return locate_mem_hole_top_down(start, end, kbuf);
+		return locate_mem_hole_top_down(start, end, kbuf, stop);
 	else
-		return locate_mem_hole_bottom_up(start, end, kbuf);
+		return locate_mem_hole_bottom_up(start, end, kbuf, stop);
 }
 
 /*
@@ -2168,15 +2169,15 @@ int kexec_add_buffer(struct kimage *imag
 		unsigned long buf_max, bool top_down, unsigned long *load_addr)
 {
 
-	unsigned long nr_segments = image->nr_segments, new_nr_segments;
 	struct kexec_segment *ksegment;
 	struct kexec_buf buf, *kbuf;
+	int ret;
 
 	/* Currently adding segment this way is allowed only in file mode */
 	if (!image->file_mode)
 		return -EINVAL;
 
-	if (nr_segments >= KEXEC_SEGMENT_MAX)
+	if (image->nr_segments >= KEXEC_SEGMENT_MAX)
 		return -EINVAL;
 
 	/*
@@ -2208,25 +2209,18 @@ int kexec_add_buffer(struct kimage *imag
 
 	/* Walk the RAM ranges and allocate a suitable range for the buffer */
 	if (image->type == KEXEC_TYPE_CRASH)
-		walk_ram_res("Crash kernel", IORESOURCE_MEM | IORESOURCE_BUSY,
-				crashk_res.start, crashk_res.end, kbuf,
-				walk_ram_range_callback);
+		ret = walk_ram_res("Crash kernel",
+				   IORESOURCE_MEM | IORESOURCE_BUSY,
+				   crashk_res.start, crashk_res.end, kbuf,
+				   walk_ram_range_callback);
 	else
-		walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
+		ret = walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
 
-	/*
-	 * If range could be found successfully, it would have incremented
-	 * the nr_segments value.
-	 */
-	new_nr_segments = image->nr_segments;
-
-	/* A suitable memory range could not be found for buffer */
-	if (new_nr_segments == nr_segments)
+	if (ret)
 		return -EADDRNOTAVAIL;
 
 	/* Found a suitable memory range */
-
-	ksegment = &image->segment[new_nr_segments - 1];
+	ksegment = &image->segment[image->nr_segments - 1];
 	*load_addr = ksegment->mem;
 	return 0;
 }
Index: linux-2.6/include/linux/ioport.h
===================================================================
--- linux-2.6.orig/include/linux/ioport.h	2014-06-05 15:21:08.064872797 -0400
+++ linux-2.6/include/linux/ioport.h	2014-06-05 15:23:56.713616633 -0400
@@ -239,10 +239,10 @@ walk_system_ram_range(unsigned long star
 		void *arg, int (*func)(unsigned long, unsigned long, void *));
 extern int
 walk_system_ram_res(u64 start, u64 end, void *arg,
-				int (*func)(u64, u64, void *));
+				int (*func)(u64, u64, void *, bool *));
 extern int
 walk_ram_res(char *name, unsigned long flags, u64 start, u64 end, void *arg,
-				int (*func)(u64, u64, void *));
+				int (*func)(u64, u64, void *, bool *));
 
 /* True if any part of r1 overlaps r2 */
 static inline bool resource_overlaps(struct resource *r1, struct resource *r2)


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-05 20:17       ` Vivek Goyal
@ 2014-06-06  2:11         ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-06  2:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Thu, Jun 05, 2014 at 04:17:32PM -0400, Vivek Goyal wrote:
> I think everything is already available in "struct kimage *image". So
> I don't have to pass all these separately. I think I will remove all
> these extra parameters and expect arch function to retrieve all that
> from "struct kimage *image".

Sounds good.

> I guess I was trying to make "struct kimage" mostly opaque to arch
> functions.

Sure, of course. The question is, is it really worth it at the price
of having a lot of args. And you're passing struct kimage to arch_*
functions anyway.

...

> It is obivious that we are freeing memory but what was not obivious to
> me that I already copied contents of these buffers in a seaparate memory
> region (segment), hence I am able to free it. So I would like to keep
> comment there.

Ok.

> > > +
> > > +	result = -ENOMEM;
> > > +	image->control_code_page = kimage_alloc_control_pages(image,
> > > +					   get_order(KEXEC_CONTROL_PAGE_SIZE));
> > > +	if (!image->control_code_page) {
> > > +		pr_err("Could not allocate control_code_buffer\n");
> > 
> > Might wanna define pr_fmt when using the pr_* things fo the first time
> > in this file.
> 
> Hmm....
> 
> I see that printk.h already provides a definition is pr_fmt is not
> defined. So that means I shouldn't have to define pr_fmt() before I
> use pr_*?
> 
> #ifndef pr_fmt
> #define pr_fmt(fmt) fmt
> #endif

Yep, so you could do

#undef pr_fmt
#define pr_fmt(fmt) "kexec: "

or you can do the standard

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

Just look around the tree for examples, there's plenty.

> > This test looks strange: according to it, kexec_file_load has to always
> > be called with both KEXEC_FILE_UNLOAD and KEXEC_FILE_ON_CRASH set.
> 
> I think this test says that "flags" has to be some combination of valid
> flags and superset is in KEXEC_FILE_FLAGS.
> 
> So user can pass only KEXEC_FILE_ON_CRASH.
> flags = 0x00000002
> KEXEC_FILE_FLAGS = 0x0x00000003
> flags & KEXEC_FILE_FLAGS = 0x00000002 = flags.

Bah, ignore me - I got confused, sorry.

> > > +	kbuf->buf_align = max(buf_align, PAGE_SIZE);
> > > +	kbuf->buf_min = buf_min;
> > > +	kbuf->buf_max = buf_max;
> > > +	kbuf->top_down = top_down;
> > > +
> > > +	/* Walk the RAM ranges and allocate a suitable range for the buffer */
> > > +	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> > > +
> > > +	/*
> > > +	 * If range could be found successfully, it would have incremented
> > > +	 * the nr_segments value.
> > > +	 */
> > > +	new_nr_segments = image->nr_segments;
> > > +
> > > +	/* A suitable memory range could not be found for buffer */
> > > +	if (new_nr_segments == nr_segments)
> > > +		return -EADDRNOTAVAIL;
> > 
> > Right, why don't you check walk_system_ram_res's retval? If it is != 0,
> > i.e. walk_ram_range_callback gives a 1 on "success", you can drop this
> > way of checking whether finding a new range succeeded.
> 
> In last version when I had ELF header support, I was checking for return
> code 1 at one place and you had not liked that.
> 
> Anyway, I am thinking that problem here is that walk_* variants use
> return code of called function to decide whether to continue looping
> or not. I think these are two independent activities.  Pass a boolean
> to called function which should be set to false if callee wants to
> stop the loop.
> 
> That way, callee can pass both errors and success without having to
> worry about loop. And callee can return 0 to represent success and
> negative error code to represent error.

But why? It should be caller's responsibility to deal with the errors.
If it encounters one, it either decides to stop looping or not.

In any case, you don't need a second bool arg to pass around.

If you want to make it more explicit, you could do

#define RES_OK		0
#define RES_ERR		1
#define RES_STOP	2

("RES" for resource :-)) and signal what to do by returning one of those
return values. Or?

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-06  2:11         ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-06  2:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Thu, Jun 05, 2014 at 04:17:32PM -0400, Vivek Goyal wrote:
> I think everything is already available in "struct kimage *image". So
> I don't have to pass all these separately. I think I will remove all
> these extra parameters and expect arch function to retrieve all that
> from "struct kimage *image".

Sounds good.

> I guess I was trying to make "struct kimage" mostly opaque to arch
> functions.

Sure, of course. The question is, is it really worth it at the price
of having a lot of args. And you're passing struct kimage to arch_*
functions anyway.

...

> It is obivious that we are freeing memory but what was not obivious to
> me that I already copied contents of these buffers in a seaparate memory
> region (segment), hence I am able to free it. So I would like to keep
> comment there.

Ok.

> > > +
> > > +	result = -ENOMEM;
> > > +	image->control_code_page = kimage_alloc_control_pages(image,
> > > +					   get_order(KEXEC_CONTROL_PAGE_SIZE));
> > > +	if (!image->control_code_page) {
> > > +		pr_err("Could not allocate control_code_buffer\n");
> > 
> > Might wanna define pr_fmt when using the pr_* things fo the first time
> > in this file.
> 
> Hmm....
> 
> I see that printk.h already provides a definition is pr_fmt is not
> defined. So that means I shouldn't have to define pr_fmt() before I
> use pr_*?
> 
> #ifndef pr_fmt
> #define pr_fmt(fmt) fmt
> #endif

Yep, so you could do

#undef pr_fmt
#define pr_fmt(fmt) "kexec: "

or you can do the standard

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

Just look around the tree for examples, there's plenty.

> > This test looks strange: according to it, kexec_file_load has to always
> > be called with both KEXEC_FILE_UNLOAD and KEXEC_FILE_ON_CRASH set.
> 
> I think this test says that "flags" has to be some combination of valid
> flags and superset is in KEXEC_FILE_FLAGS.
> 
> So user can pass only KEXEC_FILE_ON_CRASH.
> flags = 0x00000002
> KEXEC_FILE_FLAGS = 0x0x00000003
> flags & KEXEC_FILE_FLAGS = 0x00000002 = flags.

Bah, ignore me - I got confused, sorry.

> > > +	kbuf->buf_align = max(buf_align, PAGE_SIZE);
> > > +	kbuf->buf_min = buf_min;
> > > +	kbuf->buf_max = buf_max;
> > > +	kbuf->top_down = top_down;
> > > +
> > > +	/* Walk the RAM ranges and allocate a suitable range for the buffer */
> > > +	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> > > +
> > > +	/*
> > > +	 * If range could be found successfully, it would have incremented
> > > +	 * the nr_segments value.
> > > +	 */
> > > +	new_nr_segments = image->nr_segments;
> > > +
> > > +	/* A suitable memory range could not be found for buffer */
> > > +	if (new_nr_segments == nr_segments)
> > > +		return -EADDRNOTAVAIL;
> > 
> > Right, why don't you check walk_system_ram_res's retval? If it is != 0,
> > i.e. walk_ram_range_callback gives a 1 on "success", you can drop this
> > way of checking whether finding a new range succeeded.
> 
> In last version when I had ELF header support, I was checking for return
> code 1 at one place and you had not liked that.
> 
> Anyway, I am thinking that problem here is that walk_* variants use
> return code of called function to decide whether to continue looping
> or not. I think these are two independent activities.  Pass a boolean
> to called function which should be set to false if callee wants to
> stop the loop.
> 
> That way, callee can pass both errors and success without having to
> worry about loop. And callee can return 0 to represent success and
> negative error code to represent error.

But why? It should be caller's responsibility to deal with the errors.
If it encounters one, it either decides to stop looping or not.

In any case, you don't need a second bool arg to pass around.

If you want to make it more explicit, you could do

#define RES_OK		0
#define RES_ERR		1
#define RES_STOP	2

("RES" for resource :-)) and signal what to do by returning one of those
return values. Or?

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-06  5:45             ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 214+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-06-06  5:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mtk.manpages, WANG Chao, Linux Kernel, kexec, Eric W. Biederman,
	H. Peter Anvin, mjg59, Greg Kroah-Hartman, Borislav Petkov,
	Jiri Kosina, dyoung, bhe, Andrew Morton, Linux API

On 06/05/2014 04:04 PM, Vivek Goyal wrote:
> On Wed, Jun 04, 2014 at 09:39:10PM +0200, Michael Kerrisk wrote:
>> Vivek,
>>
>> As per Documentation/SubmitChecklist , please CC linux-api@ on patces
>> that change the ABI/API. See
>> https://www.kernel.org/doc/man-pages/linux-api-ml.html.
> 
> Hi Michael,
> 
> Sorry, I did not notice that. I will CC linux-api@ in next version of
> patches in patches which introduce new systemcal..
> 
>>
>> Also, is there some draft man page for this new system call?
> 
> No, there is none yet. In fact I don't see a man page for old kexec
> system call either kexec_load().

Is this not what you are meaning:
http://man7.org/linux/man-pages/man2/kexec_load.2.html
?
(It probably could be improved...)

> Do you want me to write man page for this new syscall?

These days, that's considered a desirable accompaniment to new 
syscall proposals.

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-06  5:45             ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 214+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-06-06  5:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, WANG Chao, Linux Kernel,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Eric W. Biederman,
	H. Peter Anvin, mjg59-1xO5oi07KQx4cg9Nei1l7Q, Greg Kroah-Hartman,
	Borislav Petkov, Jiri Kosina, dyoung-H+wXaHxf7aLQT0dZR+AlfA,
	bhe-H+wXaHxf7aLQT0dZR+AlfA, Andrew Morton, Linux API

On 06/05/2014 04:04 PM, Vivek Goyal wrote:
> On Wed, Jun 04, 2014 at 09:39:10PM +0200, Michael Kerrisk wrote:
>> Vivek,
>>
>> As per Documentation/SubmitChecklist , please CC linux-api@ on patces
>> that change the ABI/API. See
>> https://www.kernel.org/doc/man-pages/linux-api-ml.html.
> 
> Hi Michael,
> 
> Sorry, I did not notice that. I will CC linux-api@ in next version of
> patches in patches which introduce new systemcal..
> 
>>
>> Also, is there some draft man page for this new system call?
> 
> No, there is none yet. In fact I don't see a man page for old kexec
> system call either kexec_load().

Is this not what you are meaning:
http://man7.org/linux/man-pages/man2/kexec_load.2.html
?
(It probably could be improved...)

> Do you want me to write man page for this new syscall?

These days, that's considered a desirable accompaniment to new 
syscall proposals.

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-06  5:45             ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 214+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-06-06  5:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, Jiri Kosina, Greg Kroah-Hartman, kexec, Linux Kernel,
	Borislav Petkov, Eric W. Biederman, H. Peter Anvin,
	Andrew Morton, Linux API, dyoung, WANG Chao, mtk.manpages

On 06/05/2014 04:04 PM, Vivek Goyal wrote:
> On Wed, Jun 04, 2014 at 09:39:10PM +0200, Michael Kerrisk wrote:
>> Vivek,
>>
>> As per Documentation/SubmitChecklist , please CC linux-api@ on patces
>> that change the ABI/API. See
>> https://www.kernel.org/doc/man-pages/linux-api-ml.html.
> 
> Hi Michael,
> 
> Sorry, I did not notice that. I will CC linux-api@ in next version of
> patches in patches which introduce new systemcal..
> 
>>
>> Also, is there some draft man page for this new system call?
> 
> No, there is none yet. In fact I don't see a man page for old kexec
> system call either kexec_load().

Is this not what you are meaning:
http://man7.org/linux/man-pages/man2/kexec_load.2.html
?
(It probably could be improved...)

> Do you want me to write man page for this new syscall?

These days, that's considered a desirable accompaniment to new 
syscall proposals.

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 06/13] kexec: New syscall kexec_file_load() declaration
  2014-06-05 15:22         ` Vivek Goyal
@ 2014-06-06  6:34           ` WANG Chao
  -1 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-06  6:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	dyoung, bhe, akpm

On 06/05/14 at 11:22am, Vivek Goyal wrote:
> On Thu, Jun 05, 2014 at 11:16:39AM -0400, Vivek Goyal wrote:
> > On Thu, Jun 05, 2014 at 05:56:03PM +0800, WANG Chao wrote:
> > 
> > [..]
> > > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > > index c435c5f..a3044e6 100644
> > > > --- a/kernel/kexec.c
> > > > +++ b/kernel/kexec.c
> > > > @@ -1098,6 +1098,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
> > > >  }
> > > >  #endif
> > > >  
> > > > +SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> > > > +		const char __user *, cmdline_ptr, unsigned long,
> > > > +		cmdline_len, unsigned long, flags)
> > > 
> > > initrd is optional for system boot.
> > > 
> > > How about using int *kernel_fd and int *initrd_fd as the argument? Then
> > > if I don't need initrd, in userspace I can do this:
> > 
> > Hi Chao,
> > 
> > I really am not too keen converting plain int fd arguments into pointers.
> > 
> > Given the fact that fd is int, that means all valid values are greater
> > than 0. How about using -1 to denote that initrd is not being loaded?
> > 
> > This does create one little anomaly and that is for all -ve values we
> > will return -EBADF except -1 which we special cased.
> 
> Or we could do.
> 
> - Define extra flag which should be set by user if valid initrd fd is not
>   being passed. Say, KEXEC_FILE_NO_INITRAMFS. And if kernel sees that flag
>   it will not try to parse value passed in argument initrd_fd at all.
> 
> I think I like this better.

Yep, it makes more sense to me as well. Please add this new flag.

Thanks
WANG Chao

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 06/13] kexec: New syscall kexec_file_load() declaration
@ 2014-06-06  6:34           ` WANG Chao
  0 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-06  6:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, dyoung

On 06/05/14 at 11:22am, Vivek Goyal wrote:
> On Thu, Jun 05, 2014 at 11:16:39AM -0400, Vivek Goyal wrote:
> > On Thu, Jun 05, 2014 at 05:56:03PM +0800, WANG Chao wrote:
> > 
> > [..]
> > > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > > index c435c5f..a3044e6 100644
> > > > --- a/kernel/kexec.c
> > > > +++ b/kernel/kexec.c
> > > > @@ -1098,6 +1098,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
> > > >  }
> > > >  #endif
> > > >  
> > > > +SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> > > > +		const char __user *, cmdline_ptr, unsigned long,
> > > > +		cmdline_len, unsigned long, flags)
> > > 
> > > initrd is optional for system boot.
> > > 
> > > How about using int *kernel_fd and int *initrd_fd as the argument? Then
> > > if I don't need initrd, in userspace I can do this:
> > 
> > Hi Chao,
> > 
> > I really am not too keen converting plain int fd arguments into pointers.
> > 
> > Given the fact that fd is int, that means all valid values are greater
> > than 0. How about using -1 to denote that initrd is not being loaded?
> > 
> > This does create one little anomaly and that is for all -ve values we
> > will return -EBADF except -1 which we special cased.
> 
> Or we could do.
> 
> - Define extra flag which should be set by user if valid initrd fd is not
>   being passed. Say, KEXEC_FILE_NO_INITRAMFS. And if kernel sees that flag
>   it will not try to parse value passed in argument initrd_fd at all.
> 
> I think I like this better.

Yep, it makes more sense to me as well. Please add this new flag.

Thanks
WANG Chao

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-03 13:06   ` Vivek Goyal
@ 2014-06-06  6:56     ` WANG Chao
  -1 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-06  6:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	dyoung, bhe, akpm

On 06/03/14 at 09:06am, Vivek Goyal wrote:
> Previous patch provided the interface definition and this patch prvides
> implementation of new syscall.
> 
> Previously segment list was prepared in user space. Now user space just
> passes kernel fd, initrd fd and command line and kernel will create a
> segment list internally.
> 
> This patch contains generic part of the code. Actual segment preparation
> and loading is done by arch and image specific loader. Which comes in
> next patch.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

[..]
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index a3044e6..1ad4d60 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c

> +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> +		int initrd_fd, const char __user *cmdline_ptr,
> +		unsigned long cmdline_len)
> +{
> +	int ret = 0;
> +	void *ldata;
> +
> +	ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
> +					&image->kernel_buf_len);
> +	if (ret)
> +		return ret;
> +
> +	/* Call arch image probe handlers */
> +	ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
> +						image->kernel_buf_len);
> +
> +	if (ret)
> +		goto out;
> +
> +	ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
> +					&image->initrd_buf_len);
> +	if (ret)
> +		goto out;
> +
> +	image->cmdline_buf = vzalloc(cmdline_len);

You should validate the upper/lower boundary of cmdline_len before
calling vzalloc. When cmdline_len is 0 or too large, vmalloc failure
message would be fired.

> +	if (!image->cmdline_buf)
> +		goto out;
> +
> +	ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
> +	if (ret) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	image->cmdline_buf_len = cmdline_len;
> +
> +	/* command line should be a string with last byte null */
> +	if (image->cmdline_buf[cmdline_len - 1] != '\0') {
> +		ret = -EINVAL;
> +		goto out;
> +	}

Given the fact that command line is optional as well as initrd, I think
above chunk of code needs to update a bit for the case cmdline_len is 0
or cmdline_buf is pointing NULL?

Thanks
WANG Chao

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-06  6:56     ` WANG Chao
  0 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-06  6:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, dyoung

On 06/03/14 at 09:06am, Vivek Goyal wrote:
> Previous patch provided the interface definition and this patch prvides
> implementation of new syscall.
> 
> Previously segment list was prepared in user space. Now user space just
> passes kernel fd, initrd fd and command line and kernel will create a
> segment list internally.
> 
> This patch contains generic part of the code. Actual segment preparation
> and loading is done by arch and image specific loader. Which comes in
> next patch.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

[..]
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index a3044e6..1ad4d60 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c

> +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> +		int initrd_fd, const char __user *cmdline_ptr,
> +		unsigned long cmdline_len)
> +{
> +	int ret = 0;
> +	void *ldata;
> +
> +	ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
> +					&image->kernel_buf_len);
> +	if (ret)
> +		return ret;
> +
> +	/* Call arch image probe handlers */
> +	ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
> +						image->kernel_buf_len);
> +
> +	if (ret)
> +		goto out;
> +
> +	ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
> +					&image->initrd_buf_len);
> +	if (ret)
> +		goto out;
> +
> +	image->cmdline_buf = vzalloc(cmdline_len);

You should validate the upper/lower boundary of cmdline_len before
calling vzalloc. When cmdline_len is 0 or too large, vmalloc failure
message would be fired.

> +	if (!image->cmdline_buf)
> +		goto out;
> +
> +	ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
> +	if (ret) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	image->cmdline_buf_len = cmdline_len;
> +
> +	/* command line should be a string with last byte null */
> +	if (image->cmdline_buf[cmdline_len - 1] != '\0') {
> +		ret = -EINVAL;
> +		goto out;
> +	}

Given the fact that command line is optional as well as initrd, I think
above chunk of code needs to update a bit for the case cmdline_len is 0
or cmdline_buf is pointing NULL?

Thanks
WANG Chao

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-05 15:01       ` Vivek Goyal
@ 2014-06-06  7:37         ` Dave Young
  -1 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-06  7:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	chaowang, bhe, akpm

On 06/05/14 at 11:01am, Vivek Goyal wrote:
> On Thu, Jun 05, 2014 at 04:31:34PM +0800, Dave Young wrote:
> 
> [..]
> > > +	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
> > > +			info.command_line_len, info.kexec_flags);
> > 
> > Vivek,
> > 
> > I tried your patch on my uefi test machine, but kexec load fails like below:
> > 
> > [root@localhost ~]# kexec -l /boot/vmlinuz-3.15.0-rc8+ --use-kexec2-syscall
> > Could not find a free area of memory of 0xa000 bytes ...
> 
> Hi Dave,
> 
> I think this message is coming from kexec-tools from old loading path. I
> think somehow new path did not even kick in. I tried above and I got
> -EBADF as I did not pass initrd. Can you run gdb on kexec and see if
> you are getting to syscall or run strace.

Seems I can not reproduce the local hole fail issue but I'm sure it happens
before the new syscall callback.

This time I got -ENOEXEC. It's caused by CONFIG_EFI_MIXED=y. In case EFI_MIXED
64bit kernel runs on 32bit efi firmware thus XLF_CAN_BE_LOADED_ABOVE_4G is not
set thus bzImage probe will fail with NOEXEC.

> 
> > 
> > Another issue is that the syscall should allow load kernel only without initrd
> 
> Agreed. Currently my code is not handling it. I am thinking of ways how to
> make passing initrd fd optional.
> 
> > 
> > and
> > cmdline since kernel can mount root and embed cmdline in itself.
> 
> Passing command line is already optional. I tried it and kexec loaded
> successfully.
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-06  7:37         ` Dave Young
  0 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-06  7:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, chaowang

On 06/05/14 at 11:01am, Vivek Goyal wrote:
> On Thu, Jun 05, 2014 at 04:31:34PM +0800, Dave Young wrote:
> 
> [..]
> > > +	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
> > > +			info.command_line_len, info.kexec_flags);
> > 
> > Vivek,
> > 
> > I tried your patch on my uefi test machine, but kexec load fails like below:
> > 
> > [root@localhost ~]# kexec -l /boot/vmlinuz-3.15.0-rc8+ --use-kexec2-syscall
> > Could not find a free area of memory of 0xa000 bytes ...
> 
> Hi Dave,
> 
> I think this message is coming from kexec-tools from old loading path. I
> think somehow new path did not even kick in. I tried above and I got
> -EBADF as I did not pass initrd. Can you run gdb on kexec and see if
> you are getting to syscall or run strace.

Seems I can not reproduce the local hole fail issue but I'm sure it happens
before the new syscall callback.

This time I got -ENOEXEC. It's caused by CONFIG_EFI_MIXED=y. In case EFI_MIXED
64bit kernel runs on 32bit efi firmware thus XLF_CAN_BE_LOADED_ABOVE_4G is not
set thus bzImage probe will fail with NOEXEC.

> 
> > 
> > Another issue is that the syscall should allow load kernel only without initrd
> 
> Agreed. Currently my code is not handling it. I am thinking of ways how to
> make passing initrd fd optional.
> 
> > 
> > and
> > cmdline since kernel can mount root and embed cmdline in itself.
> 
> Passing command line is already optional. I tried it and kexec loaded
> successfully.
> 
> Thanks
> Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-06  2:11         ` Borislav Petkov
@ 2014-06-06 18:02           ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-06 18:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Fri, Jun 06, 2014 at 04:11:33AM +0200, Borislav Petkov wrote:

[..]
> > > Might wanna define pr_fmt when using the pr_* things fo the first time
> > > in this file.
> > 
> > Hmm....
> > 
> > I see that printk.h already provides a definition is pr_fmt is not
> > defined. So that means I shouldn't have to define pr_fmt() before I
> > use pr_*?
> > 
> > #ifndef pr_fmt
> > #define pr_fmt(fmt) fmt
> > #endif
> 
> Yep, so you could do
> 
> #undef pr_fmt
> #define pr_fmt(fmt) "kexec: "
> 
> or you can do the standard
> 
> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> 
> Just look around the tree for examples, there's plenty.

Ok, got it. So this will allow me to prefix subsystem name to the
message. It is a good idea. Will do.

[..]
> > > > +	kbuf->buf_align = max(buf_align, PAGE_SIZE);
> > > > +	kbuf->buf_min = buf_min;
> > > > +	kbuf->buf_max = buf_max;
> > > > +	kbuf->top_down = top_down;
> > > > +
> > > > +	/* Walk the RAM ranges and allocate a suitable range for the buffer */
> > > > +	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> > > > +
> > > > +	/*
> > > > +	 * If range could be found successfully, it would have incremented
> > > > +	 * the nr_segments value.
> > > > +	 */
> > > > +	new_nr_segments = image->nr_segments;
> > > > +
> > > > +	/* A suitable memory range could not be found for buffer */
> > > > +	if (new_nr_segments == nr_segments)
> > > > +		return -EADDRNOTAVAIL;
> > > 
> > > Right, why don't you check walk_system_ram_res's retval? If it is != 0,
> > > i.e. walk_ram_range_callback gives a 1 on "success", you can drop this
> > > way of checking whether finding a new range succeeded.
> > 
> > In last version when I had ELF header support, I was checking for return
> > code 1 at one place and you had not liked that.
> > 
> > Anyway, I am thinking that problem here is that walk_* variants use
> > return code of called function to decide whether to continue looping
> > or not. I think these are two independent activities.  Pass a boolean
> > to called function which should be set to false if callee wants to
> > stop the loop.
> > 
> > That way, callee can pass both errors and success without having to
> > worry about loop. And callee can return 0 to represent success and
> > negative error code to represent error.
> 
> But why? It should be caller's responsibility to deal with the errors.
> If it encounters one, it either decides to stop looping or not.

There are cases where there is no error still looping needs to stop.

For example, suppose you are looking for a memory range of size x
between addresses A and B. Assume there are 20 SYSTEM RAM entries
between address A and B. Now lets say you found a suitable range
in the first call itself. In that called function is successful
and does not want to be called again.

But upon returning success "0", walk_* functions will continue to
call with rest of the overlapping ranges. Its seems pretty wasteful
and called function will have to keep a state which tells that
ignore further calls.

Now to stop looping we can't return error as that return code
will be passed to the function who called walk_* and that function
will think that error happened. But actually it did not.

So to me it makes sense to decouple two things. Error code and when
to stop looping.

> 
> In any case, you don't need a second bool arg to pass around.

I would say give it some more thought. It makes dealing with errors
easy.

> 
> If you want to make it more explicit, you could do
> 
> #define RES_OK		0
> #define RES_ERR		1
> #define RES_STOP	2

You are saying that called back function should return this to walk_*
functions? But then we lose the actual error code which should be
passed to parent function which actually called walk_* function.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-06 18:02           ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-06 18:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Fri, Jun 06, 2014 at 04:11:33AM +0200, Borislav Petkov wrote:

[..]
> > > Might wanna define pr_fmt when using the pr_* things fo the first time
> > > in this file.
> > 
> > Hmm....
> > 
> > I see that printk.h already provides a definition is pr_fmt is not
> > defined. So that means I shouldn't have to define pr_fmt() before I
> > use pr_*?
> > 
> > #ifndef pr_fmt
> > #define pr_fmt(fmt) fmt
> > #endif
> 
> Yep, so you could do
> 
> #undef pr_fmt
> #define pr_fmt(fmt) "kexec: "
> 
> or you can do the standard
> 
> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> 
> Just look around the tree for examples, there's plenty.

Ok, got it. So this will allow me to prefix subsystem name to the
message. It is a good idea. Will do.

[..]
> > > > +	kbuf->buf_align = max(buf_align, PAGE_SIZE);
> > > > +	kbuf->buf_min = buf_min;
> > > > +	kbuf->buf_max = buf_max;
> > > > +	kbuf->top_down = top_down;
> > > > +
> > > > +	/* Walk the RAM ranges and allocate a suitable range for the buffer */
> > > > +	walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> > > > +
> > > > +	/*
> > > > +	 * If range could be found successfully, it would have incremented
> > > > +	 * the nr_segments value.
> > > > +	 */
> > > > +	new_nr_segments = image->nr_segments;
> > > > +
> > > > +	/* A suitable memory range could not be found for buffer */
> > > > +	if (new_nr_segments == nr_segments)
> > > > +		return -EADDRNOTAVAIL;
> > > 
> > > Right, why don't you check walk_system_ram_res's retval? If it is != 0,
> > > i.e. walk_ram_range_callback gives a 1 on "success", you can drop this
> > > way of checking whether finding a new range succeeded.
> > 
> > In last version when I had ELF header support, I was checking for return
> > code 1 at one place and you had not liked that.
> > 
> > Anyway, I am thinking that problem here is that walk_* variants use
> > return code of called function to decide whether to continue looping
> > or not. I think these are two independent activities.  Pass a boolean
> > to called function which should be set to false if callee wants to
> > stop the loop.
> > 
> > That way, callee can pass both errors and success without having to
> > worry about loop. And callee can return 0 to represent success and
> > negative error code to represent error.
> 
> But why? It should be caller's responsibility to deal with the errors.
> If it encounters one, it either decides to stop looping or not.

There are cases where there is no error still looping needs to stop.

For example, suppose you are looking for a memory range of size x
between addresses A and B. Assume there are 20 SYSTEM RAM entries
between address A and B. Now lets say you found a suitable range
in the first call itself. In that called function is successful
and does not want to be called again.

But upon returning success "0", walk_* functions will continue to
call with rest of the overlapping ranges. Its seems pretty wasteful
and called function will have to keep a state which tells that
ignore further calls.

Now to stop looping we can't return error as that return code
will be passed to the function who called walk_* and that function
will think that error happened. But actually it did not.

So to me it makes sense to decouple two things. Error code and when
to stop looping.

> 
> In any case, you don't need a second bool arg to pass around.

I would say give it some more thought. It makes dealing with errors
easy.

> 
> If you want to make it more explicit, you could do
> 
> #define RES_OK		0
> #define RES_ERR		1
> #define RES_STOP	2

You are saying that called back function should return this to walk_*
functions? But then we lose the actual error code which should be
passed to parent function which actually called walk_* function.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-06 18:04               ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-06 18:04 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: WANG Chao, Linux Kernel, kexec, Eric W. Biederman,
	H. Peter Anvin, mjg59, Greg Kroah-Hartman, Borislav Petkov,
	Jiri Kosina, dyoung, bhe, Andrew Morton, Linux API

On Fri, Jun 06, 2014 at 07:45:17AM +0200, Michael Kerrisk (man-pages) wrote:
> On 06/05/2014 04:04 PM, Vivek Goyal wrote:
> > On Wed, Jun 04, 2014 at 09:39:10PM +0200, Michael Kerrisk wrote:
> >> Vivek,
> >>
> >> As per Documentation/SubmitChecklist , please CC linux-api@ on patces
> >> that change the ABI/API. See
> >> https://www.kernel.org/doc/man-pages/linux-api-ml.html.
> > 
> > Hi Michael,
> > 
> > Sorry, I did not notice that. I will CC linux-api@ in next version of
> > patches in patches which introduce new systemcal..
> > 
> >>
> >> Also, is there some draft man page for this new system call?
> > 
> > No, there is none yet. In fact I don't see a man page for old kexec
> > system call either kexec_load().
> 
> Is this not what you are meaning:
> http://man7.org/linux/man-pages/man2/kexec_load.2.html
> ?
> (It probably could be improved...)

Yep. I missed it.

> 
> > Do you want me to write man page for this new syscall?
> 
> These days, that's considered a desirable accompaniment to new 
> syscall proposals.

Ok, I will write one for kexec_file_load() too. Or extend the existing
one to include both syscalls.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-06 18:04               ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-06 18:04 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: WANG Chao, Linux Kernel, kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Eric W. Biederman, H. Peter Anvin, mjg59-1xO5oi07KQx4cg9Nei1l7Q,
	Greg Kroah-Hartman, Borislav Petkov, Jiri Kosina,
	dyoung-H+wXaHxf7aLQT0dZR+AlfA, bhe-H+wXaHxf7aLQT0dZR+AlfA,
	Andrew Morton, Linux API

On Fri, Jun 06, 2014 at 07:45:17AM +0200, Michael Kerrisk (man-pages) wrote:
> On 06/05/2014 04:04 PM, Vivek Goyal wrote:
> > On Wed, Jun 04, 2014 at 09:39:10PM +0200, Michael Kerrisk wrote:
> >> Vivek,
> >>
> >> As per Documentation/SubmitChecklist , please CC linux-api@ on patces
> >> that change the ABI/API. See
> >> https://www.kernel.org/doc/man-pages/linux-api-ml.html.
> > 
> > Hi Michael,
> > 
> > Sorry, I did not notice that. I will CC linux-api@ in next version of
> > patches in patches which introduce new systemcal..
> > 
> >>
> >> Also, is there some draft man page for this new system call?
> > 
> > No, there is none yet. In fact I don't see a man page for old kexec
> > system call either kexec_load().
> 
> Is this not what you are meaning:
> http://man7.org/linux/man-pages/man2/kexec_load.2.html
> ?
> (It probably could be improved...)

Yep. I missed it.

> 
> > Do you want me to write man page for this new syscall?
> 
> These days, that's considered a desirable accompaniment to new 
> syscall proposals.

Ok, I will write one for kexec_file_load() too. Or extend the existing
one to include both syscalls.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-06 18:04               ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-06 18:04 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: mjg59, bhe, Jiri Kosina, Greg Kroah-Hartman, kexec, Linux Kernel,
	Borislav Petkov, Eric W. Biederman, H. Peter Anvin,
	Andrew Morton, Linux API, dyoung, WANG Chao

On Fri, Jun 06, 2014 at 07:45:17AM +0200, Michael Kerrisk (man-pages) wrote:
> On 06/05/2014 04:04 PM, Vivek Goyal wrote:
> > On Wed, Jun 04, 2014 at 09:39:10PM +0200, Michael Kerrisk wrote:
> >> Vivek,
> >>
> >> As per Documentation/SubmitChecklist , please CC linux-api@ on patces
> >> that change the ABI/API. See
> >> https://www.kernel.org/doc/man-pages/linux-api-ml.html.
> > 
> > Hi Michael,
> > 
> > Sorry, I did not notice that. I will CC linux-api@ in next version of
> > patches in patches which introduce new systemcal..
> > 
> >>
> >> Also, is there some draft man page for this new system call?
> > 
> > No, there is none yet. In fact I don't see a man page for old kexec
> > system call either kexec_load().
> 
> Is this not what you are meaning:
> http://man7.org/linux/man-pages/man2/kexec_load.2.html
> ?
> (It probably could be improved...)

Yep. I missed it.

> 
> > Do you want me to write man page for this new syscall?
> 
> These days, that's considered a desirable accompaniment to new 
> syscall proposals.

Ok, I will write one for kexec_file_load() too. Or extend the existing
one to include both syscalls.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-06  6:56     ` WANG Chao
@ 2014-06-06 18:19       ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-06 18:19 UTC (permalink / raw)
  To: WANG Chao
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	dyoung, bhe, akpm

On Fri, Jun 06, 2014 at 02:56:05PM +0800, WANG Chao wrote:
> On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > Previous patch provided the interface definition and this patch prvides
> > implementation of new syscall.
> > 
> > Previously segment list was prepared in user space. Now user space just
> > passes kernel fd, initrd fd and command line and kernel will create a
> > segment list internally.
> > 
> > This patch contains generic part of the code. Actual segment preparation
> > and loading is done by arch and image specific loader. Which comes in
> > next patch.
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> 
> [..]
> > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > index a3044e6..1ad4d60 100644
> > --- a/kernel/kexec.c
> > +++ b/kernel/kexec.c
> 
> > +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> > +		int initrd_fd, const char __user *cmdline_ptr,
> > +		unsigned long cmdline_len)
> > +{
> > +	int ret = 0;
> > +	void *ldata;
> > +
> > +	ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
> > +					&image->kernel_buf_len);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* Call arch image probe handlers */
> > +	ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
> > +						image->kernel_buf_len);
> > +
> > +	if (ret)
> > +		goto out;
> > +
> > +	ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
> > +					&image->initrd_buf_len);
> > +	if (ret)
> > +		goto out;
> > +
> > +	image->cmdline_buf = vzalloc(cmdline_len);
> 
> You should validate the upper/lower boundary of cmdline_len before
> calling vzalloc. When cmdline_len is 0 or too large, vmalloc failure
> message would be fired.

What's the upper length of vzalloc(). I think if it is too big to alloc,
then vzalloc() should return me an error?

> 
> > +	if (!image->cmdline_buf)
> > +		goto out;
> > +
> > +	ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
> > +	if (ret) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	image->cmdline_buf_len = cmdline_len;
> > +
> > +	/* command line should be a string with last byte null */
> > +	if (image->cmdline_buf[cmdline_len - 1] != '\0') {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> 
> Given the fact that command line is optional as well as initrd, I think
> above chunk of code needs to update a bit for the case cmdline_len is 0
> or cmdline_buf is pointing NULL?

I agree. I think all this vzalloc(), copy_from_user() etc should be called
only fir cmdline_len is non-zero. Will fix it.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-06 18:19       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-06 18:19 UTC (permalink / raw)
  To: WANG Chao
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, dyoung

On Fri, Jun 06, 2014 at 02:56:05PM +0800, WANG Chao wrote:
> On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > Previous patch provided the interface definition and this patch prvides
> > implementation of new syscall.
> > 
> > Previously segment list was prepared in user space. Now user space just
> > passes kernel fd, initrd fd and command line and kernel will create a
> > segment list internally.
> > 
> > This patch contains generic part of the code. Actual segment preparation
> > and loading is done by arch and image specific loader. Which comes in
> > next patch.
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> 
> [..]
> > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > index a3044e6..1ad4d60 100644
> > --- a/kernel/kexec.c
> > +++ b/kernel/kexec.c
> 
> > +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> > +		int initrd_fd, const char __user *cmdline_ptr,
> > +		unsigned long cmdline_len)
> > +{
> > +	int ret = 0;
> > +	void *ldata;
> > +
> > +	ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
> > +					&image->kernel_buf_len);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* Call arch image probe handlers */
> > +	ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
> > +						image->kernel_buf_len);
> > +
> > +	if (ret)
> > +		goto out;
> > +
> > +	ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
> > +					&image->initrd_buf_len);
> > +	if (ret)
> > +		goto out;
> > +
> > +	image->cmdline_buf = vzalloc(cmdline_len);
> 
> You should validate the upper/lower boundary of cmdline_len before
> calling vzalloc. When cmdline_len is 0 or too large, vmalloc failure
> message would be fired.

What's the upper length of vzalloc(). I think if it is too big to alloc,
then vzalloc() should return me an error?

> 
> > +	if (!image->cmdline_buf)
> > +		goto out;
> > +
> > +	ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
> > +	if (ret) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	image->cmdline_buf_len = cmdline_len;
> > +
> > +	/* command line should be a string with last byte null */
> > +	if (image->cmdline_buf[cmdline_len - 1] != '\0') {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> 
> Given the fact that command line is optional as well as initrd, I think
> above chunk of code needs to update a bit for the case cmdline_len is 0
> or cmdline_buf is pointing NULL?

I agree. I think all this vzalloc(), copy_from_user() etc should be called
only fir cmdline_len is non-zero. Will fix it.

Thanks
Vivek


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 09/13] purgatory: Core purgatory functionality
  2014-06-05 20:05     ` Borislav Petkov
@ 2014-06-06 19:51       ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-06 19:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Thu, Jun 05, 2014 at 10:05:23PM +0200, Borislav Petkov wrote:

[..]
> > @@ -249,6 +254,7 @@ archclean:
> >  	$(Q)rm -rf $(objtree)/arch/x86_64
> >  	$(Q)$(MAKE) $(clean)=$(boot)
> >  	$(Q)$(MAKE) $(clean)=arch/x86/tools
> 
> ifeq ($(CONFIG_KEXEC),y)
> 	$(Q)$(MAKE) $(clean)=arch/x86/purgatory
> endif

Hmm.., is it strictly required? I am wondering what happens if I build
a kernel with CONFIG_KEXEC=y, then set CONFIG_KEXEC=n and do "make clean".
I think I will still like any files in arch/x86/purgatory to be cleaned
despite the fact that CONFIG_KEXEC=n. Isn't it?

[..]
> > +ifeq ($(CONFIG_X86_64),y)
> > +KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -mcmodel=large -Os -fno-builtin -ffreestanding -c -MD
> > +else
> > +KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -Os -fno-builtin -ffreestanding -c -MD -m32
> > +endif
> 
> Those variable assignments have a lot of duplication, let's simplify
> (diff ontop):

Thanks. This looks cleaner and also highlights the difference between
x86_64 and x86. Will change.

[..]
> > +	.section ".rodata"
> > +	.balign 4
> > +entry64_regs:
> > +rax:	.quad 0x00000000
> 
> Simply 0x0? Or am I missing something?

I think .quad 0x0 should work. Will use it.

[..]
> > +	sha256_final(&sctx, digest);
> > +
> > +	if (memcmp(digest, sha256_digest, sizeof(digest)) != 0)
> 
> 	if (memcmp(...))
> 		return 1;
> 
> should be a bit cleaner.
> 

Ok. Will do.

[..]
> > +void purgatory(void)
> > +{
> > +	int ret;
> > +
> > +	ret = verify_sha256_digest();
> > +	if (ret) {
> > +		/* loop forever */
> > +		for (;;);
> 
> checkpatch bitches about this:
> 
> ERROR: trailing statements should be on next line
> #303: FILE: arch/x86/purgatory/purgatory.c:68:
> +               for (;;);

Ok, will move semicolon to next line.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 09/13] purgatory: Core purgatory functionality
@ 2014-06-06 19:51       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-06 19:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Thu, Jun 05, 2014 at 10:05:23PM +0200, Borislav Petkov wrote:

[..]
> > @@ -249,6 +254,7 @@ archclean:
> >  	$(Q)rm -rf $(objtree)/arch/x86_64
> >  	$(Q)$(MAKE) $(clean)=$(boot)
> >  	$(Q)$(MAKE) $(clean)=arch/x86/tools
> 
> ifeq ($(CONFIG_KEXEC),y)
> 	$(Q)$(MAKE) $(clean)=arch/x86/purgatory
> endif

Hmm.., is it strictly required? I am wondering what happens if I build
a kernel with CONFIG_KEXEC=y, then set CONFIG_KEXEC=n and do "make clean".
I think I will still like any files in arch/x86/purgatory to be cleaned
despite the fact that CONFIG_KEXEC=n. Isn't it?

[..]
> > +ifeq ($(CONFIG_X86_64),y)
> > +KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -mcmodel=large -Os -fno-builtin -ffreestanding -c -MD
> > +else
> > +KBUILD_CFLAGS	:= -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -Os -fno-builtin -ffreestanding -c -MD -m32
> > +endif
> 
> Those variable assignments have a lot of duplication, let's simplify
> (diff ontop):

Thanks. This looks cleaner and also highlights the difference between
x86_64 and x86. Will change.

[..]
> > +	.section ".rodata"
> > +	.balign 4
> > +entry64_regs:
> > +rax:	.quad 0x00000000
> 
> Simply 0x0? Or am I missing something?

I think .quad 0x0 should work. Will use it.

[..]
> > +	sha256_final(&sctx, digest);
> > +
> > +	if (memcmp(digest, sha256_digest, sizeof(digest)) != 0)
> 
> 	if (memcmp(...))
> 		return 1;
> 
> should be a bit cleaner.
> 

Ok. Will do.

[..]
> > +void purgatory(void)
> > +{
> > +	int ret;
> > +
> > +	ret = verify_sha256_digest();
> > +	if (ret) {
> > +		/* loop forever */
> > +		for (;;);
> 
> checkpatch bitches about this:
> 
> ERROR: trailing statements should be on next line
> #303: FILE: arch/x86/purgatory/purgatory.c:68:
> +               for (;;);

Ok, will move semicolon to next line.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-06  7:37         ` Dave Young
@ 2014-06-06 20:04           ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-06 20:04 UTC (permalink / raw)
  To: Dave Young
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	chaowang, bhe, akpm

On Fri, Jun 06, 2014 at 03:37:48PM +0800, Dave Young wrote:
> On 06/05/14 at 11:01am, Vivek Goyal wrote:
> > On Thu, Jun 05, 2014 at 04:31:34PM +0800, Dave Young wrote:
> > 
> > [..]
> > > > +	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
> > > > +			info.command_line_len, info.kexec_flags);
> > > 
> > > Vivek,
> > > 
> > > I tried your patch on my uefi test machine, but kexec load fails like below:
> > > 
> > > [root@localhost ~]# kexec -l /boot/vmlinuz-3.15.0-rc8+ --use-kexec2-syscall
> > > Could not find a free area of memory of 0xa000 bytes ...
> > 
> > Hi Dave,
> > 
> > I think this message is coming from kexec-tools from old loading path. I
> > think somehow new path did not even kick in. I tried above and I got
> > -EBADF as I did not pass initrd. Can you run gdb on kexec and see if
> > you are getting to syscall or run strace.
> 
> Seems I can not reproduce the local hole fail issue but I'm sure it happens
> before the new syscall callback.
> 
> This time I got -ENOEXEC. It's caused by CONFIG_EFI_MIXED=y. In case EFI_MIXED
> 64bit kernel runs on 32bit efi firmware thus XLF_CAN_BE_LOADED_ABOVE_4G is not
> set thus bzImage probe will fail with NOEXEC.

Yep, current bzImage loader only supports loading 64bit image which can
be loaded above 4G.

I am wondering how user space implementation is taking care of it. I guess
we are falling back to 32bit implementation where we use 32bit entry and
assume that bzImage has to be below 4G.

We will have to do similar thing in kernel when 32bit loader comes in.
Compile that in for 64bit kernel and let it handle the case of bzImage
not being loadable above 4G.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-06 20:04           ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-06 20:04 UTC (permalink / raw)
  To: Dave Young
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, chaowang

On Fri, Jun 06, 2014 at 03:37:48PM +0800, Dave Young wrote:
> On 06/05/14 at 11:01am, Vivek Goyal wrote:
> > On Thu, Jun 05, 2014 at 04:31:34PM +0800, Dave Young wrote:
> > 
> > [..]
> > > > +	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
> > > > +			info.command_line_len, info.kexec_flags);
> > > 
> > > Vivek,
> > > 
> > > I tried your patch on my uefi test machine, but kexec load fails like below:
> > > 
> > > [root@localhost ~]# kexec -l /boot/vmlinuz-3.15.0-rc8+ --use-kexec2-syscall
> > > Could not find a free area of memory of 0xa000 bytes ...
> > 
> > Hi Dave,
> > 
> > I think this message is coming from kexec-tools from old loading path. I
> > think somehow new path did not even kick in. I tried above and I got
> > -EBADF as I did not pass initrd. Can you run gdb on kexec and see if
> > you are getting to syscall or run strace.
> 
> Seems I can not reproduce the local hole fail issue but I'm sure it happens
> before the new syscall callback.
> 
> This time I got -ENOEXEC. It's caused by CONFIG_EFI_MIXED=y. In case EFI_MIXED
> 64bit kernel runs on 32bit efi firmware thus XLF_CAN_BE_LOADED_ABOVE_4G is not
> set thus bzImage probe will fail with NOEXEC.

Yep, current bzImage loader only supports loading 64bit image which can
be loaded above 4G.

I am wondering how user space implementation is taking care of it. I guess
we are falling back to 32bit implementation where we use 32bit entry and
assume that bzImage has to be below 4G.

We will have to do similar thing in kernel when 32bit loader comes in.
Compile that in for 64bit kernel and let it handle the case of bzImage
not being loadable above 4G.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-06  7:37         ` Dave Young
@ 2014-06-06 20:37           ` H. Peter Anvin
  -1 siblings, 0 replies; 214+ messages in thread
From: H. Peter Anvin @ 2014-06-06 20:37 UTC (permalink / raw)
  To: Dave Young, Vivek Goyal, Fleming, Matt
  Cc: linux-kernel, kexec, ebiederm, mjg59, greg, bp, jkosina,
	chaowang, bhe, akpm

On 06/06/2014 12:37 AM, Dave Young wrote:
> On 06/05/14 at 11:01am, Vivek Goyal wrote:
>> On Thu, Jun 05, 2014 at 04:31:34PM +0800, Dave Young wrote:
>>
>> [..]
>>>> +	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
>>>> +			info.command_line_len, info.kexec_flags);
>>>
>>> Vivek,
>>>
>>> I tried your patch on my uefi test machine, but kexec load fails like below:
>>>
>>> [root@localhost ~]# kexec -l /boot/vmlinuz-3.15.0-rc8+ --use-kexec2-syscall
>>> Could not find a free area of memory of 0xa000 bytes ...
>>
>> Hi Dave,
>>
>> I think this message is coming from kexec-tools from old loading path. I
>> think somehow new path did not even kick in. I tried above and I got
>> -EBADF as I did not pass initrd. Can you run gdb on kexec and see if
>> you are getting to syscall or run strace.
> 
> Seems I can not reproduce the local hole fail issue but I'm sure it happens
> before the new syscall callback.
> 
> This time I got -ENOEXEC. It's caused by CONFIG_EFI_MIXED=y. In case EFI_MIXED
> 64bit kernel runs on 32bit efi firmware thus XLF_CAN_BE_LOADED_ABOVE_4G is not
> set thus bzImage probe will fail with NOEXEC.
> 

OK... this is seriously problematic.

#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_X86_64) && \
	!defined(CONFIG_EFI_MIXED)
   /* kernel/boot_param/ramdisk could be loaded above 4g */
# define XLF1 XLF_CAN_BE_LOADED_ABOVE_4G
#else
# define XLF1 0
#endif

The fact that even compiling with CONFIG_EFI_MIXED disables
XLF_CAN_BE_LOADED_ABOVE_4G is really not going to fly.  We should expect
CONFIG_EFI_MIXED to be the norm, but *also* should expect that there is
a legitimate need to load above 4G.

Matt, could you explain why this is necessary?  We need to figure out a
way around this.

My thinking is that disabling this flag is unnecessary, since a 32-bit
EFI loader should not load above the 4G mark anyway, but if I'm confused
and there is a more fundamental requirement, then we need to consider
that more carefully.

	-hpa


^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-06 20:37           ` H. Peter Anvin
  0 siblings, 0 replies; 214+ messages in thread
From: H. Peter Anvin @ 2014-06-06 20:37 UTC (permalink / raw)
  To: Dave Young, Vivek Goyal, Fleming, Matt
  Cc: mjg59, bhe, greg, kexec, linux-kernel, bp, ebiederm, jkosina,
	akpm, chaowang

On 06/06/2014 12:37 AM, Dave Young wrote:
> On 06/05/14 at 11:01am, Vivek Goyal wrote:
>> On Thu, Jun 05, 2014 at 04:31:34PM +0800, Dave Young wrote:
>>
>> [..]
>>>> +	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
>>>> +			info.command_line_len, info.kexec_flags);
>>>
>>> Vivek,
>>>
>>> I tried your patch on my uefi test machine, but kexec load fails like below:
>>>
>>> [root@localhost ~]# kexec -l /boot/vmlinuz-3.15.0-rc8+ --use-kexec2-syscall
>>> Could not find a free area of memory of 0xa000 bytes ...
>>
>> Hi Dave,
>>
>> I think this message is coming from kexec-tools from old loading path. I
>> think somehow new path did not even kick in. I tried above and I got
>> -EBADF as I did not pass initrd. Can you run gdb on kexec and see if
>> you are getting to syscall or run strace.
> 
> Seems I can not reproduce the local hole fail issue but I'm sure it happens
> before the new syscall callback.
> 
> This time I got -ENOEXEC. It's caused by CONFIG_EFI_MIXED=y. In case EFI_MIXED
> 64bit kernel runs on 32bit efi firmware thus XLF_CAN_BE_LOADED_ABOVE_4G is not
> set thus bzImage probe will fail with NOEXEC.
> 

OK... this is seriously problematic.

#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_X86_64) && \
	!defined(CONFIG_EFI_MIXED)
   /* kernel/boot_param/ramdisk could be loaded above 4g */
# define XLF1 XLF_CAN_BE_LOADED_ABOVE_4G
#else
# define XLF1 0
#endif

The fact that even compiling with CONFIG_EFI_MIXED disables
XLF_CAN_BE_LOADED_ABOVE_4G is really not going to fly.  We should expect
CONFIG_EFI_MIXED to be the norm, but *also* should expect that there is
a legitimate need to load above 4G.

Matt, could you explain why this is necessary?  We need to figure out a
way around this.

My thinking is that disabling this flag is unnecessary, since a 32-bit
EFI loader should not load above the 4G mark anyway, but if I'm confused
and there is a more fundamental requirement, then we need to consider
that more carefully.

	-hpa


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-06 20:37           ` H. Peter Anvin
@ 2014-06-06 20:58             ` Matt Fleming
  -1 siblings, 0 replies; 214+ messages in thread
From: Matt Fleming @ 2014-06-06 20:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Dave Young, Vivek Goyal, linux-kernel, kexec, ebiederm, mjg59,
	greg, bp, jkosina, chaowang, bhe, akpm

On 6 June 2014 21:37, H. Peter Anvin <hpa@zytor.com> wrote:
>
> OK... this is seriously problematic.
>
> #if defined(CONFIG_RELOCATABLE) && defined(CONFIG_X86_64) && \
>         !defined(CONFIG_EFI_MIXED)
>    /* kernel/boot_param/ramdisk could be loaded above 4g */
> # define XLF1 XLF_CAN_BE_LOADED_ABOVE_4G
> #else
> # define XLF1 0
> #endif
>
> The fact that even compiling with CONFIG_EFI_MIXED disables
> XLF_CAN_BE_LOADED_ABOVE_4G is really not going to fly.  We should expect
> CONFIG_EFI_MIXED to be the norm, but *also* should expect that there is
> a legitimate need to load above 4G.
>
> Matt, could you explain why this is necessary?  We need to figure out a
> way around this.
>
> My thinking is that disabling this flag is unnecessary, since a 32-bit
> EFI loader should not load above the 4G mark anyway, but if I'm confused
> and there is a more fundamental requirement, then we need to consider
> that more carefully.

No, your comments are absolutely correct. I was the one who was
confused. I found this in the git history,

commit 7d453eee36ae
Author: Matt Fleming <matt.fleming@intel.com>
Date:   Fri Jan 10 18:52:06 2014 +0000

    x86/efi: Wire up CONFIG_EFI_MIXED

    Add the Kconfig option and bump the kernel header version so that boot
    loaders can check whether the handover code is available if they want.

    The xloadflags field in the bzImage header is also updated to reflect
    that the kernel supports both entry points by setting both of
    XLF_EFI_HANDOVER_32 and XLF_EFI_HANDOVER_64 when CONFIG_EFI_MIXED=y.
    XLF_CAN_BE_LOADED_ABOVE_4G is disabled so that the kernel text is
    guaranteed to be addressable with 32-bits.

As you've pointed out above, a 32-bit loader is never going to load
the kernel above 4G, so we don't need to disable it.

What's the best way to fix this up? Just undo the change from the above commit?

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-06 20:58             ` Matt Fleming
  0 siblings, 0 replies; 214+ messages in thread
From: Matt Fleming @ 2014-06-06 20:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mjg59, bhe, greg, kexec, linux-kernel, bp, ebiederm, jkosina,
	chaowang, Dave Young, akpm, Vivek Goyal

On 6 June 2014 21:37, H. Peter Anvin <hpa@zytor.com> wrote:
>
> OK... this is seriously problematic.
>
> #if defined(CONFIG_RELOCATABLE) && defined(CONFIG_X86_64) && \
>         !defined(CONFIG_EFI_MIXED)
>    /* kernel/boot_param/ramdisk could be loaded above 4g */
> # define XLF1 XLF_CAN_BE_LOADED_ABOVE_4G
> #else
> # define XLF1 0
> #endif
>
> The fact that even compiling with CONFIG_EFI_MIXED disables
> XLF_CAN_BE_LOADED_ABOVE_4G is really not going to fly.  We should expect
> CONFIG_EFI_MIXED to be the norm, but *also* should expect that there is
> a legitimate need to load above 4G.
>
> Matt, could you explain why this is necessary?  We need to figure out a
> way around this.
>
> My thinking is that disabling this flag is unnecessary, since a 32-bit
> EFI loader should not load above the 4G mark anyway, but if I'm confused
> and there is a more fundamental requirement, then we need to consider
> that more carefully.

No, your comments are absolutely correct. I was the one who was
confused. I found this in the git history,

commit 7d453eee36ae
Author: Matt Fleming <matt.fleming@intel.com>
Date:   Fri Jan 10 18:52:06 2014 +0000

    x86/efi: Wire up CONFIG_EFI_MIXED

    Add the Kconfig option and bump the kernel header version so that boot
    loaders can check whether the handover code is available if they want.

    The xloadflags field in the bzImage header is also updated to reflect
    that the kernel supports both entry points by setting both of
    XLF_EFI_HANDOVER_32 and XLF_EFI_HANDOVER_64 when CONFIG_EFI_MIXED=y.
    XLF_CAN_BE_LOADED_ABOVE_4G is disabled so that the kernel text is
    guaranteed to be addressable with 32-bits.

As you've pointed out above, a 32-bit loader is never going to load
the kernel above 4G, so we don't need to disable it.

What's the best way to fix this up? Just undo the change from the above commit?

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-06 20:58             ` Matt Fleming
@ 2014-06-06 21:00               ` H. Peter Anvin
  -1 siblings, 0 replies; 214+ messages in thread
From: H. Peter Anvin @ 2014-06-06 21:00 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Dave Young, Vivek Goyal, linux-kernel, kexec, ebiederm, mjg59,
	greg, bp, jkosina, chaowang, bhe, akpm

On 06/06/2014 01:58 PM, Matt Fleming wrote:
> On 6 June 2014 21:37, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> OK... this is seriously problematic.
>>
>> #if defined(CONFIG_RELOCATABLE) && defined(CONFIG_X86_64) && \
>>         !defined(CONFIG_EFI_MIXED)
>>    /* kernel/boot_param/ramdisk could be loaded above 4g */
>> # define XLF1 XLF_CAN_BE_LOADED_ABOVE_4G
>> #else
>> # define XLF1 0
>> #endif
>>
>> The fact that even compiling with CONFIG_EFI_MIXED disables
>> XLF_CAN_BE_LOADED_ABOVE_4G is really not going to fly.  We should expect
>> CONFIG_EFI_MIXED to be the norm, but *also* should expect that there is
>> a legitimate need to load above 4G.
>>
>> Matt, could you explain why this is necessary?  We need to figure out a
>> way around this.
>>
>> My thinking is that disabling this flag is unnecessary, since a 32-bit
>> EFI loader should not load above the 4G mark anyway, but if I'm confused
>> and there is a more fundamental requirement, then we need to consider
>> that more carefully.
> 
> No, your comments are absolutely correct. I was the one who was
> confused. I found this in the git history,
> 
> commit 7d453eee36ae
> Author: Matt Fleming <matt.fleming@intel.com>
> Date:   Fri Jan 10 18:52:06 2014 +0000
> 
>     x86/efi: Wire up CONFIG_EFI_MIXED
> 
>     Add the Kconfig option and bump the kernel header version so that boot
>     loaders can check whether the handover code is available if they want.
> 
>     The xloadflags field in the bzImage header is also updated to reflect
>     that the kernel supports both entry points by setting both of
>     XLF_EFI_HANDOVER_32 and XLF_EFI_HANDOVER_64 when CONFIG_EFI_MIXED=y.
>     XLF_CAN_BE_LOADED_ABOVE_4G is disabled so that the kernel text is
>     guaranteed to be addressable with 32-bits.
> 
> As you've pointed out above, a 32-bit loader is never going to load
> the kernel above 4G, so we don't need to disable it.
> 
> What's the best way to fix this up? Just undo the change from the above commit?
> 

Yes, presumably (as a separate patch since the actual commit is quite
large.)  The patch needs to have a good description why the original
patch was wrong.

	-hpa


^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-06 21:00               ` H. Peter Anvin
  0 siblings, 0 replies; 214+ messages in thread
From: H. Peter Anvin @ 2014-06-06 21:00 UTC (permalink / raw)
  To: Matt Fleming
  Cc: mjg59, bhe, greg, kexec, linux-kernel, bp, ebiederm, jkosina,
	chaowang, Dave Young, akpm, Vivek Goyal

On 06/06/2014 01:58 PM, Matt Fleming wrote:
> On 6 June 2014 21:37, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> OK... this is seriously problematic.
>>
>> #if defined(CONFIG_RELOCATABLE) && defined(CONFIG_X86_64) && \
>>         !defined(CONFIG_EFI_MIXED)
>>    /* kernel/boot_param/ramdisk could be loaded above 4g */
>> # define XLF1 XLF_CAN_BE_LOADED_ABOVE_4G
>> #else
>> # define XLF1 0
>> #endif
>>
>> The fact that even compiling with CONFIG_EFI_MIXED disables
>> XLF_CAN_BE_LOADED_ABOVE_4G is really not going to fly.  We should expect
>> CONFIG_EFI_MIXED to be the norm, but *also* should expect that there is
>> a legitimate need to load above 4G.
>>
>> Matt, could you explain why this is necessary?  We need to figure out a
>> way around this.
>>
>> My thinking is that disabling this flag is unnecessary, since a 32-bit
>> EFI loader should not load above the 4G mark anyway, but if I'm confused
>> and there is a more fundamental requirement, then we need to consider
>> that more carefully.
> 
> No, your comments are absolutely correct. I was the one who was
> confused. I found this in the git history,
> 
> commit 7d453eee36ae
> Author: Matt Fleming <matt.fleming@intel.com>
> Date:   Fri Jan 10 18:52:06 2014 +0000
> 
>     x86/efi: Wire up CONFIG_EFI_MIXED
> 
>     Add the Kconfig option and bump the kernel header version so that boot
>     loaders can check whether the handover code is available if they want.
> 
>     The xloadflags field in the bzImage header is also updated to reflect
>     that the kernel supports both entry points by setting both of
>     XLF_EFI_HANDOVER_32 and XLF_EFI_HANDOVER_64 when CONFIG_EFI_MIXED=y.
>     XLF_CAN_BE_LOADED_ABOVE_4G is disabled so that the kernel text is
>     guaranteed to be addressable with 32-bits.
> 
> As you've pointed out above, a 32-bit loader is never going to load
> the kernel above 4G, so we don't need to disable it.
> 
> What's the best way to fix this up? Just undo the change from the above commit?
> 

Yes, presumably (as a separate patch since the actual commit is quite
large.)  The patch needs to have a good description why the original
patch was wrong.

	-hpa


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-06 21:00               ` H. Peter Anvin
@ 2014-06-06 21:02                 ` Matt Fleming
  -1 siblings, 0 replies; 214+ messages in thread
From: Matt Fleming @ 2014-06-06 21:02 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Dave Young, Vivek Goyal, linux-kernel, kexec, ebiederm, mjg59,
	greg, bp, jkosina, chaowang, bhe, akpm

On 6 June 2014 22:00, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Yes, presumably (as a separate patch since the actual commit is quite
> large.)  The patch needs to have a good description why the original
> patch was wrong.

Right. I'll take a look at this in the morning.

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-06 21:02                 ` Matt Fleming
  0 siblings, 0 replies; 214+ messages in thread
From: Matt Fleming @ 2014-06-06 21:02 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mjg59, bhe, greg, kexec, linux-kernel, bp, ebiederm, jkosina,
	chaowang, Dave Young, akpm, Vivek Goyal

On 6 June 2014 22:00, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Yes, presumably (as a separate patch since the actual commit is quite
> large.)  The patch needs to have a good description why the original
> patch was wrong.

Right. I'll take a look at this in the morning.

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-06 20:04           ` Vivek Goyal
@ 2014-06-09  1:57             ` Dave Young
  -1 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-09  1:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	chaowang, bhe, akpm

On 06/06/14 at 04:04pm, Vivek Goyal wrote:
> On Fri, Jun 06, 2014 at 03:37:48PM +0800, Dave Young wrote:
> > On 06/05/14 at 11:01am, Vivek Goyal wrote:
> > > On Thu, Jun 05, 2014 at 04:31:34PM +0800, Dave Young wrote:
> > > 
> > > [..]
> > > > > +	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
> > > > > +			info.command_line_len, info.kexec_flags);
> > > > 
> > > > Vivek,
> > > > 
> > > > I tried your patch on my uefi test machine, but kexec load fails like below:
> > > > 
> > > > [root@localhost ~]# kexec -l /boot/vmlinuz-3.15.0-rc8+ --use-kexec2-syscall
> > > > Could not find a free area of memory of 0xa000 bytes ...
> > > 
> > > Hi Dave,
> > > 
> > > I think this message is coming from kexec-tools from old loading path. I
> > > think somehow new path did not even kick in. I tried above and I got
> > > -EBADF as I did not pass initrd. Can you run gdb on kexec and see if
> > > you are getting to syscall or run strace.
> > 
> > Seems I can not reproduce the local hole fail issue but I'm sure it happens
> > before the new syscall callback.
> > 
> > This time I got -ENOEXEC. It's caused by CONFIG_EFI_MIXED=y. In case EFI_MIXED
> > 64bit kernel runs on 32bit efi firmware thus XLF_CAN_BE_LOADED_ABOVE_4G is not
> > set thus bzImage probe will fail with NOEXEC.
> 
> Yep, current bzImage loader only supports loading 64bit image which can
> be loaded above 4G.
> 
> I am wondering how user space implementation is taking care of it. I guess
> we are falling back to 32bit implementation where we use 32bit entry and
> assume that bzImage has to be below 4G.
> 
> We will have to do similar thing in kernel when 32bit loader comes in.
> Compile that in for 64bit kernel and let it handle the case of bzImage
> not being loadable above 4G.

Vivek, I think current implementation is ok to only handle XLF_CAN_BE_LOADED_ABOVE_4G
bzImage.

Matt has sent a patch to revert the EFI_MIXED patch since 32bit loader never load
kernel to above 4G space. So no worry about this issue any more.

Thanks
Dave 

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-09  1:57             ` Dave Young
  0 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-09  1:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, chaowang

On 06/06/14 at 04:04pm, Vivek Goyal wrote:
> On Fri, Jun 06, 2014 at 03:37:48PM +0800, Dave Young wrote:
> > On 06/05/14 at 11:01am, Vivek Goyal wrote:
> > > On Thu, Jun 05, 2014 at 04:31:34PM +0800, Dave Young wrote:
> > > 
> > > [..]
> > > > > +	ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line,
> > > > > +			info.command_line_len, info.kexec_flags);
> > > > 
> > > > Vivek,
> > > > 
> > > > I tried your patch on my uefi test machine, but kexec load fails like below:
> > > > 
> > > > [root@localhost ~]# kexec -l /boot/vmlinuz-3.15.0-rc8+ --use-kexec2-syscall
> > > > Could not find a free area of memory of 0xa000 bytes ...
> > > 
> > > Hi Dave,
> > > 
> > > I think this message is coming from kexec-tools from old loading path. I
> > > think somehow new path did not even kick in. I tried above and I got
> > > -EBADF as I did not pass initrd. Can you run gdb on kexec and see if
> > > you are getting to syscall or run strace.
> > 
> > Seems I can not reproduce the local hole fail issue but I'm sure it happens
> > before the new syscall callback.
> > 
> > This time I got -ENOEXEC. It's caused by CONFIG_EFI_MIXED=y. In case EFI_MIXED
> > 64bit kernel runs on 32bit efi firmware thus XLF_CAN_BE_LOADED_ABOVE_4G is not
> > set thus bzImage probe will fail with NOEXEC.
> 
> Yep, current bzImage loader only supports loading 64bit image which can
> be loaded above 4G.
> 
> I am wondering how user space implementation is taking care of it. I guess
> we are falling back to 32bit implementation where we use 32bit entry and
> assume that bzImage has to be below 4G.
> 
> We will have to do similar thing in kernel when 32bit loader comes in.
> Compile that in for 64bit kernel and let it handle the case of bzImage
> not being loadable above 4G.

Vivek, I think current implementation is ok to only handle XLF_CAN_BE_LOADED_ABOVE_4G
bzImage.

Matt has sent a patch to revert the EFI_MIXED patch since 32bit loader never load
kernel to above 4G space. So no worry about this issue any more.

Thanks
Dave 

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-06 18:19       ` Vivek Goyal
@ 2014-06-09  2:11         ` Dave Young
  -1 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-09  2:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: WANG Chao, mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp,
	ebiederm, hpa, akpm

On 06/06/14 at 02:19pm, Vivek Goyal wrote:
> On Fri, Jun 06, 2014 at 02:56:05PM +0800, WANG Chao wrote:
> > On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > > Previous patch provided the interface definition and this patch prvides
> > > implementation of new syscall.
> > > 
> > > Previously segment list was prepared in user space. Now user space just
> > > passes kernel fd, initrd fd and command line and kernel will create a
> > > segment list internally.
> > > 
> > > This patch contains generic part of the code. Actual segment preparation
> > > and loading is done by arch and image specific loader. Which comes in
> > > next patch.
> > > 
> > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > 
> > [..]
> > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > index a3044e6..1ad4d60 100644
> > > --- a/kernel/kexec.c
> > > +++ b/kernel/kexec.c
> > 
> > > +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> > > +		int initrd_fd, const char __user *cmdline_ptr,
> > > +		unsigned long cmdline_len)
> > > +{
> > > +	int ret = 0;
> > > +	void *ldata;
> > > +
> > > +	ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
> > > +					&image->kernel_buf_len);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	/* Call arch image probe handlers */
> > > +	ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
> > > +						image->kernel_buf_len);
> > > +
> > > +	if (ret)
> > > +		goto out;
> > > +
> > > +	ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
> > > +					&image->initrd_buf_len);
> > > +	if (ret)
> > > +		goto out;
> > > +
> > > +	image->cmdline_buf = vzalloc(cmdline_len);
> > 
> > You should validate the upper/lower boundary of cmdline_len before
> > calling vzalloc. When cmdline_len is 0 or too large, vmalloc failure
> > message would be fired.
> 
> What's the upper length of vzalloc(). I think if it is too big to alloc,
> then vzalloc() should return me an error?

function __vmalloc_node_range:
        if (!size || (size >> PAGE_SHIFT) > totalram_pages)
                goto fail;

So I think only checking cmdline_len == 0 is enough.

For the upper length shouldn't it be stripped to COMMAND_LINE_SIZE?


> > > +	if (!image->cmdline_buf)
> > > +		goto out;
> > > +
> > > +	ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
> > > +	if (ret) {
> > > +		ret = -EFAULT;
> > > +		goto out;
> > > +	}
> > > +
> > > +	image->cmdline_buf_len = cmdline_len;
> > > +
> > > +	/* command line should be a string with last byte null */
> > > +	if (image->cmdline_buf[cmdline_len - 1] != '\0') {
> > > +		ret = -EINVAL;
> > > +		goto out;
> > > +	}
> > 
> > Given the fact that command line is optional as well as initrd, I think
> > above chunk of code needs to update a bit for the case cmdline_len is 0
> > or cmdline_buf is pointing NULL?
> 
> I agree. I think all this vzalloc(), copy_from_user() etc should be called
> only fir cmdline_len is non-zero. Will fix it.
> 
> Thanks
> Vivek
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-09  2:11         ` Dave Young
  0 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-09  2:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, bp, ebiederm,
	jkosina, akpm, WANG Chao

On 06/06/14 at 02:19pm, Vivek Goyal wrote:
> On Fri, Jun 06, 2014 at 02:56:05PM +0800, WANG Chao wrote:
> > On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > > Previous patch provided the interface definition and this patch prvides
> > > implementation of new syscall.
> > > 
> > > Previously segment list was prepared in user space. Now user space just
> > > passes kernel fd, initrd fd and command line and kernel will create a
> > > segment list internally.
> > > 
> > > This patch contains generic part of the code. Actual segment preparation
> > > and loading is done by arch and image specific loader. Which comes in
> > > next patch.
> > > 
> > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > 
> > [..]
> > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > index a3044e6..1ad4d60 100644
> > > --- a/kernel/kexec.c
> > > +++ b/kernel/kexec.c
> > 
> > > +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> > > +		int initrd_fd, const char __user *cmdline_ptr,
> > > +		unsigned long cmdline_len)
> > > +{
> > > +	int ret = 0;
> > > +	void *ldata;
> > > +
> > > +	ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
> > > +					&image->kernel_buf_len);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	/* Call arch image probe handlers */
> > > +	ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
> > > +						image->kernel_buf_len);
> > > +
> > > +	if (ret)
> > > +		goto out;
> > > +
> > > +	ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
> > > +					&image->initrd_buf_len);
> > > +	if (ret)
> > > +		goto out;
> > > +
> > > +	image->cmdline_buf = vzalloc(cmdline_len);
> > 
> > You should validate the upper/lower boundary of cmdline_len before
> > calling vzalloc. When cmdline_len is 0 or too large, vmalloc failure
> > message would be fired.
> 
> What's the upper length of vzalloc(). I think if it is too big to alloc,
> then vzalloc() should return me an error?

function __vmalloc_node_range:
        if (!size || (size >> PAGE_SHIFT) > totalram_pages)
                goto fail;

So I think only checking cmdline_len == 0 is enough.

For the upper length shouldn't it be stripped to COMMAND_LINE_SIZE?


> > > +	if (!image->cmdline_buf)
> > > +		goto out;
> > > +
> > > +	ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
> > > +	if (ret) {
> > > +		ret = -EFAULT;
> > > +		goto out;
> > > +	}
> > > +
> > > +	image->cmdline_buf_len = cmdline_len;
> > > +
> > > +	/* command line should be a string with last byte null */
> > > +	if (image->cmdline_buf[cmdline_len - 1] != '\0') {
> > > +		ret = -EINVAL;
> > > +		goto out;
> > > +	}
> > 
> > Given the fact that command line is optional as well as initrd, I think
> > above chunk of code needs to update a bit for the case cmdline_len is 0
> > or cmdline_buf is pointing NULL?
> 
> I agree. I think all this vzalloc(), copy_from_user() etc should be called
> only fir cmdline_len is non-zero. Will fix it.
> 
> Thanks
> Vivek
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-09  2:11         ` Dave Young
@ 2014-06-09  5:35           ` WANG Chao
  -1 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-09  5:35 UTC (permalink / raw)
  To: Dave Young
  Cc: Vivek Goyal, mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp,
	ebiederm, hpa, akpm

On 06/09/14 at 10:11am, Dave Young wrote:
> On 06/06/14 at 02:19pm, Vivek Goyal wrote:
> > On Fri, Jun 06, 2014 at 02:56:05PM +0800, WANG Chao wrote:
> > > On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > > > Previous patch provided the interface definition and this patch prvides
> > > > implementation of new syscall.
> > > > 
> > > > Previously segment list was prepared in user space. Now user space just
> > > > passes kernel fd, initrd fd and command line and kernel will create a
> > > > segment list internally.
> > > > 
> > > > This patch contains generic part of the code. Actual segment preparation
> > > > and loading is done by arch and image specific loader. Which comes in
> > > > next patch.
> > > > 
> > > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > > 
> > > [..]
> > > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > > index a3044e6..1ad4d60 100644
> > > > --- a/kernel/kexec.c
> > > > +++ b/kernel/kexec.c
> > > 
> > > > +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> > > > +		int initrd_fd, const char __user *cmdline_ptr,
> > > > +		unsigned long cmdline_len)
> > > > +{
> > > > +	int ret = 0;
> > > > +	void *ldata;
> > > > +
> > > > +	ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
> > > > +					&image->kernel_buf_len);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	/* Call arch image probe handlers */
> > > > +	ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
> > > > +						image->kernel_buf_len);
> > > > +
> > > > +	if (ret)
> > > > +		goto out;
> > > > +
> > > > +	ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
> > > > +					&image->initrd_buf_len);
> > > > +	if (ret)
> > > > +		goto out;
> > > > +
> > > > +	image->cmdline_buf = vzalloc(cmdline_len);
> > > 
> > > You should validate the upper/lower boundary of cmdline_len before
> > > calling vzalloc. When cmdline_len is 0 or too large, vmalloc failure
> > > message would be fired.
> > 
> > What's the upper length of vzalloc(). I think if it is too big to alloc,
> > then vzalloc() should return me an error?

When allocating too large, eg. vzalloc(-1), kernel spits:

[  457.407579] vmalloc: allocation failure: 18446744073709551606 bytes
[  457.413854] kexec: page allocation failure: order:0, mode:0x80d2
[  457.419853] CPU: 3 PID: 2058 Comm: kexec Not tainted
3.15.0-rc8-00096-g3dc85e8 #10
[  457.427408] Hardware name: Dell Inc. OptiPlex 760
/0M860N, BIOS A12 05/23/2011
[  457.435999]  ffffffff81a2f678 ffff8800bfb03db0 ffffffff816944fd
00000000000080d2
[  457.443422]  ffff8800bfb03e38 ffffffff8118a31a ffffffff81a2f678
ffff8800bfb03dd0
[  457.450851]  ffff880100000018 ffff8800bfb03e48 ffff8800bfb03de8
ffff8800bfb03e10
[  457.458278] Call Trace:
[  457.460731]  [<ffffffff816944fd>] dump_stack+0x45/0x56
[  457.465865]  [<ffffffff8118a31a>] warn_alloc_failed+0xda/0x140
[  457.471693]  [<ffffffff811f56d1>] ? kernel_read+0x41/0x60
[  457.477085]  [<ffffffff811bf466>] __vmalloc_node_range+0x1b6/0x270
[  457.483256]  [<ffffffff811bf5bb>] vzalloc+0x4b/0x50
[  457.488132]  [<ffffffff81121815>] ?
kimage_file_prepare_segments.part.10+0x85/0x140
[  457.495774]  [<ffffffff81121815>]
kimage_file_prepare_segments.part.10+0x85/0x140
[  457.503244]  [<ffffffff8112301a>] SyS_kexec_file_load+0x38a/0x690
[  457.509330]  [<ffffffff816a2f29>] system_call_fastpath+0x16/0x1b
[..]

I think it's better to do some sane check to prevent such warning when
taking arbitrary argument from user space.

> 
> function __vmalloc_node_range:
>         if (!size || (size >> PAGE_SHIFT) > totalram_pages)
>                 goto fail;
> 
> So I think only checking cmdline_len == 0 is enough.
> 
> For the upper length shouldn't it be stripped to COMMAND_LINE_SIZE?

Yes, COMMAND_LINE_SIZE

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-09  5:35           ` WANG Chao
  0 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-09  5:35 UTC (permalink / raw)
  To: Dave Young
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, bp, ebiederm,
	jkosina, akpm, Vivek Goyal

On 06/09/14 at 10:11am, Dave Young wrote:
> On 06/06/14 at 02:19pm, Vivek Goyal wrote:
> > On Fri, Jun 06, 2014 at 02:56:05PM +0800, WANG Chao wrote:
> > > On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > > > Previous patch provided the interface definition and this patch prvides
> > > > implementation of new syscall.
> > > > 
> > > > Previously segment list was prepared in user space. Now user space just
> > > > passes kernel fd, initrd fd and command line and kernel will create a
> > > > segment list internally.
> > > > 
> > > > This patch contains generic part of the code. Actual segment preparation
> > > > and loading is done by arch and image specific loader. Which comes in
> > > > next patch.
> > > > 
> > > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > > 
> > > [..]
> > > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > > index a3044e6..1ad4d60 100644
> > > > --- a/kernel/kexec.c
> > > > +++ b/kernel/kexec.c
> > > 
> > > > +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> > > > +		int initrd_fd, const char __user *cmdline_ptr,
> > > > +		unsigned long cmdline_len)
> > > > +{
> > > > +	int ret = 0;
> > > > +	void *ldata;
> > > > +
> > > > +	ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
> > > > +					&image->kernel_buf_len);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	/* Call arch image probe handlers */
> > > > +	ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
> > > > +						image->kernel_buf_len);
> > > > +
> > > > +	if (ret)
> > > > +		goto out;
> > > > +
> > > > +	ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
> > > > +					&image->initrd_buf_len);
> > > > +	if (ret)
> > > > +		goto out;
> > > > +
> > > > +	image->cmdline_buf = vzalloc(cmdline_len);
> > > 
> > > You should validate the upper/lower boundary of cmdline_len before
> > > calling vzalloc. When cmdline_len is 0 or too large, vmalloc failure
> > > message would be fired.
> > 
> > What's the upper length of vzalloc(). I think if it is too big to alloc,
> > then vzalloc() should return me an error?

When allocating too large, eg. vzalloc(-1), kernel spits:

[  457.407579] vmalloc: allocation failure: 18446744073709551606 bytes
[  457.413854] kexec: page allocation failure: order:0, mode:0x80d2
[  457.419853] CPU: 3 PID: 2058 Comm: kexec Not tainted
3.15.0-rc8-00096-g3dc85e8 #10
[  457.427408] Hardware name: Dell Inc. OptiPlex 760
/0M860N, BIOS A12 05/23/2011
[  457.435999]  ffffffff81a2f678 ffff8800bfb03db0 ffffffff816944fd
00000000000080d2
[  457.443422]  ffff8800bfb03e38 ffffffff8118a31a ffffffff81a2f678
ffff8800bfb03dd0
[  457.450851]  ffff880100000018 ffff8800bfb03e48 ffff8800bfb03de8
ffff8800bfb03e10
[  457.458278] Call Trace:
[  457.460731]  [<ffffffff816944fd>] dump_stack+0x45/0x56
[  457.465865]  [<ffffffff8118a31a>] warn_alloc_failed+0xda/0x140
[  457.471693]  [<ffffffff811f56d1>] ? kernel_read+0x41/0x60
[  457.477085]  [<ffffffff811bf466>] __vmalloc_node_range+0x1b6/0x270
[  457.483256]  [<ffffffff811bf5bb>] vzalloc+0x4b/0x50
[  457.488132]  [<ffffffff81121815>] ?
kimage_file_prepare_segments.part.10+0x85/0x140
[  457.495774]  [<ffffffff81121815>]
kimage_file_prepare_segments.part.10+0x85/0x140
[  457.503244]  [<ffffffff8112301a>] SyS_kexec_file_load+0x38a/0x690
[  457.509330]  [<ffffffff816a2f29>] system_call_fastpath+0x16/0x1b
[..]

I think it's better to do some sane check to prevent such warning when
taking arbitrary argument from user space.

> 
> function __vmalloc_node_range:
>         if (!size || (size >> PAGE_SHIFT) > totalram_pages)
>                 goto fail;
> 
> So I think only checking cmdline_len == 0 is enough.
> 
> For the upper length shouldn't it be stripped to COMMAND_LINE_SIZE?

Yes, COMMAND_LINE_SIZE

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-09  2:11         ` Dave Young
@ 2014-06-09 15:30           ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-09 15:30 UTC (permalink / raw)
  To: Dave Young
  Cc: WANG Chao, mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp,
	ebiederm, hpa, akpm

On Mon, Jun 09, 2014 at 10:11:22AM +0800, Dave Young wrote:

[..]
> > > > +	image->cmdline_buf = vzalloc(cmdline_len);
> > > 
> > > You should validate the upper/lower boundary of cmdline_len before
> > > calling vzalloc. When cmdline_len is 0 or too large, vmalloc failure
> > > message would be fired.
> > 
> > What's the upper length of vzalloc(). I think if it is too big to alloc,
> > then vzalloc() should return me an error?
> 
> function __vmalloc_node_range:
>         if (!size || (size >> PAGE_SHIFT) > totalram_pages)
>                 goto fail;
> 
> So I think only checking cmdline_len == 0 is enough.
> 
> For the upper length shouldn't it be stripped to COMMAND_LINE_SIZE?

We might be booting a newer kernel supporting bigger command line size
as compared to running kernel. So we query bzImage header to figure out
what's the maximum command line length supported.

Just that currently that check happens later during image load time.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-09 15:30           ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-09 15:30 UTC (permalink / raw)
  To: Dave Young
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, bp, ebiederm,
	jkosina, akpm, WANG Chao

On Mon, Jun 09, 2014 at 10:11:22AM +0800, Dave Young wrote:

[..]
> > > > +	image->cmdline_buf = vzalloc(cmdline_len);
> > > 
> > > You should validate the upper/lower boundary of cmdline_len before
> > > calling vzalloc. When cmdline_len is 0 or too large, vmalloc failure
> > > message would be fired.
> > 
> > What's the upper length of vzalloc(). I think if it is too big to alloc,
> > then vzalloc() should return me an error?
> 
> function __vmalloc_node_range:
>         if (!size || (size >> PAGE_SHIFT) > totalram_pages)
>                 goto fail;
> 
> So I think only checking cmdline_len == 0 is enough.
> 
> For the upper length shouldn't it be stripped to COMMAND_LINE_SIZE?

We might be booting a newer kernel supporting bigger command line size
as compared to running kernel. So we query bzImage header to figure out
what's the maximum command line length supported.

Just that currently that check happens later during image load time.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-09  5:35           ` WANG Chao
@ 2014-06-09 15:41             ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-09 15:41 UTC (permalink / raw)
  To: WANG Chao
  Cc: Dave Young, mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp,
	ebiederm, hpa, akpm

On Mon, Jun 09, 2014 at 01:35:38PM +0800, WANG Chao wrote:

[..]
> > > What's the upper length of vzalloc(). I think if it is too big to alloc,
> > > then vzalloc() should return me an error?
> 
> When allocating too large, eg. vzalloc(-1), kernel spits:
> 
> [  457.407579] vmalloc: allocation failure: 18446744073709551606 bytes
> [  457.413854] kexec: page allocation failure: order:0, mode:0x80d2
> [  457.419853] CPU: 3 PID: 2058 Comm: kexec Not tainted
> 3.15.0-rc8-00096-g3dc85e8 #10
> [  457.427408] Hardware name: Dell Inc. OptiPlex 760
> /0M860N, BIOS A12 05/23/2011
> [  457.435999]  ffffffff81a2f678 ffff8800bfb03db0 ffffffff816944fd
> 00000000000080d2
> [  457.443422]  ffff8800bfb03e38 ffffffff8118a31a ffffffff81a2f678
> ffff8800bfb03dd0
> [  457.450851]  ffff880100000018 ffff8800bfb03e48 ffff8800bfb03de8
> ffff8800bfb03e10
> [  457.458278] Call Trace:
> [  457.460731]  [<ffffffff816944fd>] dump_stack+0x45/0x56
> [  457.465865]  [<ffffffff8118a31a>] warn_alloc_failed+0xda/0x140
> [  457.471693]  [<ffffffff811f56d1>] ? kernel_read+0x41/0x60
> [  457.477085]  [<ffffffff811bf466>] __vmalloc_node_range+0x1b6/0x270
> [  457.483256]  [<ffffffff811bf5bb>] vzalloc+0x4b/0x50
> [  457.488132]  [<ffffffff81121815>] ?
> kimage_file_prepare_segments.part.10+0x85/0x140
> [  457.495774]  [<ffffffff81121815>]
> kimage_file_prepare_segments.part.10+0x85/0x140
> [  457.503244]  [<ffffffff8112301a>] SyS_kexec_file_load+0x38a/0x690
> [  457.509330]  [<ffffffff816a2f29>] system_call_fastpath+0x16/0x1b
> [..]
> 
> I think it's better to do some sane check to prevent such warning when
> taking arbitrary argument from user space.

Hmm.., I did not know that memory allocation failures had to dump stack
trace.

Anyway, I think I can implement another function which calls into image
loader and query the maximum command length size they will support and
use that number as uppper limit. It is little more code and one extra
call. Hopefully it is worth it.

> 
 > 
> > function __vmalloc_node_range:
> >         if (!size || (size >> PAGE_SHIFT) > totalram_pages)
> >                 goto fail;
> > 
> > So I think only checking cmdline_len == 0 is enough.
> > 
> > For the upper length shouldn't it be stripped to COMMAND_LINE_SIZE?
> 
> Yes, COMMAND_LINE_SIZE

IIUC, COMMAND_LINE_SIZE gives max limits of running kernel and it does
not tell us anything about command line size supported by kernel being
loaded.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-09 15:41             ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-09 15:41 UTC (permalink / raw)
  To: WANG Chao
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, bp, ebiederm,
	jkosina, akpm, Dave Young

On Mon, Jun 09, 2014 at 01:35:38PM +0800, WANG Chao wrote:

[..]
> > > What's the upper length of vzalloc(). I think if it is too big to alloc,
> > > then vzalloc() should return me an error?
> 
> When allocating too large, eg. vzalloc(-1), kernel spits:
> 
> [  457.407579] vmalloc: allocation failure: 18446744073709551606 bytes
> [  457.413854] kexec: page allocation failure: order:0, mode:0x80d2
> [  457.419853] CPU: 3 PID: 2058 Comm: kexec Not tainted
> 3.15.0-rc8-00096-g3dc85e8 #10
> [  457.427408] Hardware name: Dell Inc. OptiPlex 760
> /0M860N, BIOS A12 05/23/2011
> [  457.435999]  ffffffff81a2f678 ffff8800bfb03db0 ffffffff816944fd
> 00000000000080d2
> [  457.443422]  ffff8800bfb03e38 ffffffff8118a31a ffffffff81a2f678
> ffff8800bfb03dd0
> [  457.450851]  ffff880100000018 ffff8800bfb03e48 ffff8800bfb03de8
> ffff8800bfb03e10
> [  457.458278] Call Trace:
> [  457.460731]  [<ffffffff816944fd>] dump_stack+0x45/0x56
> [  457.465865]  [<ffffffff8118a31a>] warn_alloc_failed+0xda/0x140
> [  457.471693]  [<ffffffff811f56d1>] ? kernel_read+0x41/0x60
> [  457.477085]  [<ffffffff811bf466>] __vmalloc_node_range+0x1b6/0x270
> [  457.483256]  [<ffffffff811bf5bb>] vzalloc+0x4b/0x50
> [  457.488132]  [<ffffffff81121815>] ?
> kimage_file_prepare_segments.part.10+0x85/0x140
> [  457.495774]  [<ffffffff81121815>]
> kimage_file_prepare_segments.part.10+0x85/0x140
> [  457.503244]  [<ffffffff8112301a>] SyS_kexec_file_load+0x38a/0x690
> [  457.509330]  [<ffffffff816a2f29>] system_call_fastpath+0x16/0x1b
> [..]
> 
> I think it's better to do some sane check to prevent such warning when
> taking arbitrary argument from user space.

Hmm.., I did not know that memory allocation failures had to dump stack
trace.

Anyway, I think I can implement another function which calls into image
loader and query the maximum command length size they will support and
use that number as uppper limit. It is little more code and one extra
call. Hopefully it is worth it.

> 
 > 
> > function __vmalloc_node_range:
> >         if (!size || (size >> PAGE_SHIFT) > totalram_pages)
> >                 goto fail;
> > 
> > So I think only checking cmdline_len == 0 is enough.
> > 
> > For the upper length shouldn't it be stripped to COMMAND_LINE_SIZE?
> 
> Yes, COMMAND_LINE_SIZE

IIUC, COMMAND_LINE_SIZE gives max limits of running kernel and it does
not tell us anything about command line size supported by kernel being
loaded.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 10/13] kexec: Load and Relocate purgatory at kernel load time
  2014-06-03 13:06   ` Vivek Goyal
@ 2014-06-10 16:31     ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-10 16:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 09:06:59AM -0400, Vivek Goyal wrote:
> Load purgatory code in RAM and relocate it based on the location. Relocation
> code has been inspired by module relocation code and purgatory relocation
> code in kexec-tools.
> 
> Also compute the checksums of loaded kexec segments and store them in
> purgatory.
> 
> Arch independent code provides this functionality so that arch dependent
> bootloaders can make use of it.
> 
> Helper functions are provided to get/set symbol values in purgatory which
> are used by bootloaders later to set things like stack and entry point
> of second kernel etc.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/Kconfig                   |   2 +
>  arch/x86/kernel/machine_kexec_64.c |  82 +++++++
>  include/linux/kexec.h              |  31 +++
>  kernel/kexec.c                     | 484 +++++++++++++++++++++++++++++++++++++
>  4 files changed, 599 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 213308a..0f24b61 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1556,6 +1556,8 @@ source kernel/Kconfig.hz
>  config KEXEC
>  	bool "kexec system call"
>  	select BUILD_BIN2C
> +	select CRYPTO
> +	select CRYPTO_SHA256

Ok, but why automatically enable crypto? There's still the old kexec
method where we don't check any signatures.

Which begs the more important question - shouldn't this new in-kernel
loading method support also kexec'ing of kernels without any signature
verifications at all?

I mean, the main use case is secure boot and all but why not leave it
configurable for people to decide?

>  	---help---
>  	  kexec is a system call that implements the ability to shutdown your
>  	  current kernel, and to start another kernel.  It is like a reboot
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index d9c5cf0..711c1fb 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -337,3 +337,85 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
>  		return kexec_file_type[idx].cleanup(image);
>  	return 0;
>  }
> +
> +/* Apply purgatory relocations */
> +int arch_kexec_apply_relocations_add(Elf64_Shdr *sechdrs,

apply_..._add? "arch_kexec_apply_relocations" seems fine to me.

> +				unsigned int nr_sections, unsigned int relsec)
> +{
> +	unsigned int i;
> +	Elf64_Rela *rel = (void *)sechdrs[relsec].sh_offset;
> +	Elf64_Sym *sym;
> +	void *location;
> +	Elf64_Shdr *section, *symtab;
> +	unsigned long address, sec_base, value;
> +
> +	/* Section to which relocations apply */
> +	section = &sechdrs[sechdrs[relsec].sh_info];
> +
> +	/* Associated symbol table */
> +	symtab = &sechdrs[sechdrs[relsec].sh_link];
> +
> +	for (i = 0; i < sechdrs[relsec].sh_size / sizeof(*rel); i++) {
> +
> +		/*
> +		 * This is location (->sh_offset) to update. This is temporary
> +		 * buffer where section is currently loaded. This will finally
> +		 * be loaded to a different address later (pointed to
> +		 * by ->sh_addr. kexec takes care of moving it
> +		 * (kexec_load_segment()).
> +		 */
> +		location = (void *)(section->sh_offset + rel[i].r_offset);
> +
> +		/* Final address of the location */
> +		address = section->sh_addr + rel[i].r_offset;
> +
> +		sym = (Elf64_Sym *)symtab->sh_offset +
> +				ELF64_R_SYM(rel[i].r_info);
> +
> +		if (sym->st_shndx == SHN_UNDEF || sym->st_shndx == SHN_COMMON)
> +			return -ENOEXEC;
> +
> +		if (sym->st_shndx == SHN_ABS)
> +			sec_base = 0;
> +		else if (sym->st_shndx >= nr_sections)
> +			return -ENOEXEC;
> +		else
> +			sec_base = sechdrs[sym->st_shndx].sh_addr;
> +
> +		value = sym->st_value;
> +		value += sec_base;
> +		value += rel[i].r_addend;
> +
> +		switch (ELF64_R_TYPE(rel[i].r_info)) {
> +		case R_X86_64_NONE:
> +			break;
> +		case R_X86_64_64:
> +			*(u64 *)location = value;
> +			break;
> +		case R_X86_64_32:
> +			*(u32 *)location = value;
> +			if (value != *(u32 *)location)
> +				goto overflow;
> +			break;
> +		case R_X86_64_32S:
> +			*(s32 *)location = value;
> +			if ((s64)value != *(s32 *)location)
> +				goto overflow;
> +			break;
> +		case R_X86_64_PC32:
> +			value -= (u64)address;
> +			*(u32 *)location = value;
> +			break;
> +		default:
> +			pr_err("kexec: Unknown rela relocation: %llu\n",

Yep, the "kexec: " string should come from pr_fmt as in the other mail.

> +					ELF64_R_TYPE(rel[i].r_info));
> +			return -ENOEXEC;
> +		}
> +	}
> +	return 0;
> +
> +overflow:
> +	pr_err("kexec: overflow in relocation type %d value 0x%lx\n",

Ditto.

> +		(int)ELF64_R_TYPE(rel[i].r_info), value);
> +	return -ENOEXEC;
> +}
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 3790519..7228873 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -10,6 +10,7 @@
>  #include <linux/ioport.h>
>  #include <linux/elfcore.h>
>  #include <linux/elf.h>
> +#include <linux/module.h>
>  #include <asm/kexec.h>
>  
>  /* Verify architecture specific macros are defined */
> @@ -95,6 +96,27 @@ struct compat_kexec_segment {
>  };
>  #endif
>  
> +struct kexec_sha_region {
> +	unsigned long start;
> +	unsigned long len;
> +};
> +
> +struct purgatory_info {
> +	/* Pointer to elf header of read only purgatory */
> +	Elf_Ehdr *ehdr;
> +
> +	/* Pointer to purgatory sechdrs which are modifiable */
> +	Elf_Shdr *sechdrs;
> +	/*
> +	 * Temporary buffer location where purgatory is loaded and relocated
> +	 * This memory can be freed post image load
> +	 */
> +	void *purgatory_buf;
> +
> +	/* Address where purgatory is finally loaded and is executed from */
> +	unsigned long purgatory_load_addr;
> +};
> +
>  struct kimage {
>  	kimage_entry_t head;
>  	kimage_entry_t *entry;
> @@ -143,6 +165,9 @@ struct kimage {
>  
>  	/* Image loader handling the kernel can store a pointer here */
>  	void *image_loader_data;
> +
> +	/* Information for loading purgatory */
> +	struct purgatory_info purgatory_info;

Having the member name with the same name as the struct is kinda
confusing. Also, you could shorten it, which, in turn, would give
shorter code lines at the sites it is accessed. I.e.,

	struct purgatory_info pinfo;

should be just fine IMHO.

...

> @@ -336,6 +348,15 @@ arch_kimage_file_post_load_cleanup(struct kimage *image)
>  	return;
>  }
>  
> +/* Apply relocations for rela section */
> +int __attribute__ ((weak))
> +arch_kexec_apply_relocations_add(Elf_Shdr *sechdrs, unsigned int nr_sections,
> +					unsigned int relsec)
> +{
> +	pr_err(KERN_ERR "kexec: REL relocation unsupported\n");

pr_err *and* KERN_ERR. Double error level? :-)

> +	return -ENOEXEC;
> +}
> +
>  /*
>   * Free up tempory buffers allocated which are not needed after image has
>   * been loaded.
> @@ -355,6 +376,12 @@ static void kimage_file_post_load_cleanup(struct kimage *image)
>  	vfree(image->cmdline_buf);
>  	image->cmdline_buf = NULL;
>  
> +	vfree(image->purgatory_info.purgatory_buf);
> +	image->purgatory_info.purgatory_buf = NULL;

Here's what I mean - that's definitely too long. Maybe

	vree(image->pinfo.pbuf);
	image->pinfo.pbuf = NULL;

(yep, we shortened purgatory_buf too). Now this looks like proper kernel
code to me :-)

> +
> +	vfree(image->purgatory_info.sechdrs);
> +	image->purgatory_info.sechdrs = NULL;
> +
>  	/* See if architcture has anything to cleanup post load */
>  	arch_kimage_file_post_load_cleanup(image);
>  }
> @@ -1370,6 +1397,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
>  	if (ret)
>  		goto out;
>  
> +	ret = kexec_calculate_store_digests(image);

This function name could be shortened too: kexec_calc_digests() or so.
The actual storing could be a separate kexec_store_digests. I.e., a
function does one thing only.

> +	if (ret)
> +		goto out;
> +
>  	for (i = 0; i < image->nr_segments; i++) {
>  		struct kexec_segment *ksegment;
>  
> @@ -2131,6 +2162,459 @@ int kexec_add_buffer(struct kimage *image, char *buffer,
>  	return 0;
>  }
>  
> +/* Calculate and store the digest of segments */
> +static int kexec_calculate_store_digests(struct kimage *image)
> +{
> +	struct crypto_shash *tfm;
> +	struct shash_desc *desc;
> +	int ret = 0, i, j, zero_buf_sz = 256, sha_region_sz;

256 - a magic constant.

> +	size_t desc_size, nullsz;
> +	char *digest = NULL;
> +	void *zero_buf;
> +	struct kexec_sha_region *sha_regions;
> +
> +	tfm = crypto_alloc_shash("sha256", 0, 0);
> +	if (IS_ERR(tfm)) {
> +		ret = PTR_ERR(tfm);
> +		goto out;

The "out" label kfrees digest but we haven't allocated it yet...

> +	}
> +
> +	desc_size = crypto_shash_descsize(tfm) + sizeof(*desc);
> +	desc = kzalloc(desc_size, GFP_KERNEL);
> +	if (!desc) {
> +		ret = -ENOMEM;
> +		goto out_free_tfm;
> +	}
> +
> +	zero_buf = kzalloc(zero_buf_sz, GFP_KERNEL);
> +	if (!zero_buf) {
> +		ret = -ENOMEM;
> +		goto out_free_desc;
> +	}
> +
> +	sha_region_sz = KEXEC_SEGMENT_MAX * sizeof(struct kexec_sha_region);
> +	sha_regions = vzalloc(sha_region_sz);
> +	if (!sha_regions)
> +		goto out_free_zero_buf;
> +
> +	desc->tfm   = tfm;
> +	desc->flags = 0;
> +
> +	ret = crypto_shash_init(desc);
> +	if (ret < 0)
> +		goto out_free_sha_regions;
> +
> +	digest = kzalloc(SHA256_DIGEST_SIZE, GFP_KERNEL);
> +	if (!digest) {
> +		ret = -ENOMEM;
> +		goto out_free_sha_regions;
> +	}

... but this digest is a simple allocation. Could it be you moved it
down but forgot to readjust the labels?

> +
> +	/* Traverse through all segments */

Yep, we can see that. Why does it need an explicit comment?

> +	for (j = i = 0; i < image->nr_segments; i++) {
> +		struct kexec_segment *ksegment;
> +		ksegment = &image->segment[i];
> +
> +		/*
> +		 * Skip purgatory as it will be modified once we put digest
> +		 * info in purgatory
> +		 */

Now this comment is perfect right there! :-) It needs a full-stop though.

> +		if (ksegment->kbuf == image->purgatory_info.purgatory_buf)
> +			continue;
> +
> +		ret = crypto_shash_update(desc, ksegment->kbuf,
> +						ksegment->bufsz);

Arg alignment.

> +		if (ret)
> +			break;
> +
> +		nullsz = ksegment->memsz - ksegment->bufsz;
> +		while (nullsz) {
> +			unsigned long bytes = nullsz;
> +			if (bytes > zero_buf_sz)
> +				bytes = zero_buf_sz;
> +			ret = crypto_shash_update(desc, zero_buf, bytes);
> +			if (ret)
> +				break;
> +			nullsz -= bytes;
> +		}

Now this trailing buffer "drainage" could very well use a comment on
what's going on.

> +
> +		if (ret)
> +			break;
> +
> +		sha_regions[j].start = ksegment->mem;
> +		sha_regions[j].len = ksegment->memsz;
> +		j++;
> +	}
> +
> +	if (!ret) {
> +		ret = crypto_shash_final(desc, digest);
> +		if (ret)
> +			goto out_free_sha_regions;
> +		ret = kexec_purgatory_get_set_symbol(image, "sha_regions",
> +				sha_regions, sha_region_sz, 0);
> +		if (ret)
> +			goto out_free_sha_regions;
> +
> +		ret = kexec_purgatory_get_set_symbol(image, "sha256_digest",
> +				digest, SHA256_DIGEST_SIZE, 0);
> +		if (ret)
> +			goto out_free_sha_regions;

Yeah, this block could be a separate kexec_store_digests() function.

> +	}
> +
> +out_free_sha_regions:
> +	vfree(sha_regions);
> +out_free_zero_buf:
> +	kfree(zero_buf);
> +out_free_desc:
> +	kfree(desc);
> +out_free_tfm:
> +	kfree(tfm);
> +out:
> +	kfree(digest);
> +	return ret;
> +}
> +
> +/* Actually load and relcoate purgatory. Lot of code taken from kexec-tools */

s/relcoate/relocate/

> +static int elf_rel_load_relocate(struct kimage *image, unsigned long min,
> +				unsigned long max, int top_down)

Another function which is too big and does at least two things and could
probably be nicely split into two.

> +{
> +	struct purgatory_info *pi = &image->purgatory_info;
> +	unsigned long align, buf_align, bss_align, buf_sz, bss_sz, bss_pad;
> +	unsigned long memsz, entry, load_addr, data_addr, bss_addr, off;
> +	unsigned char *buf_addr, *src;
> +	int i, ret = 0, entry_sidx = -1;
> +	Elf_Shdr *sechdrs = NULL, *sechdrs_c;
> +	void *purgatory_buf = NULL;
> +
> +	/*
> +	 * sechdrs_c points to section headers in purgatory and are read
> +	 * only. No modifications allowed.
> +	 */

Then do

	const Elf_Shdr * const sechdrs_c = (void *)pi->ehdr + pi->ehdr->e_shoff;

to enforce it?

> +	sechdrs_c = (void *)pi->ehdr + pi->ehdr->e_shoff;
> +
> +	/*
> +	 * We can not modify sechdrs_c[] and its fields. It is read only.
> +	 * Copy it over to a local copy where one can store some temporary
> +	 * data and free it at the end. We need to modify ->sh_addr and

What is freeing it when we store it into pi->sechdrs and return? Or
doesn't it need to be freed?

> +	 * ->sh_offset fields to keep track permanent and temporary locations
> +	 * of sections.
> +	 */
> +	sechdrs = vzalloc(pi->ehdr->e_shnum * sizeof(Elf_Shdr));
> +	if (!sechdrs)
> +		return -ENOMEM;
> +
> +	memcpy(sechdrs, sechdrs_c, pi->ehdr->e_shnum * sizeof(Elf_Shdr));
> +
> +	/*
> +	 * We seem to have multiple copies of sections. First copy is which
> +	 * is embedded in kernel in read only section. Some of these sections
> +	 * will be copied to a temporary buffer and relocated. And these
> +	 * sections will finally be copied to their final detination at

"destination"

> +	 * segment load time.
> +	 *
> +	 * Use ->sh_offset to reflect section address in memory. It will
> +	 * point to original read only copy if section is not allocatable.
> +	 * Otherwise it will point to temporary copy which will be relocated.
> +	 *
> +	 * Use ->sh_addr to contain final address of the section where it
> +	 * will go during execution time.
> +	 */
> +	for (i = 0; i < pi->ehdr->e_shnum; i++) {
> +		if (sechdrs[i].sh_type == SHT_NOBITS)
> +			continue;
> +
> +		sechdrs[i].sh_offset = (unsigned long)pi->ehdr +
> +						sechdrs[i].sh_offset;
> +	}
> +
> +	entry = pi->ehdr->e_entry;
> +	for (i = 0; i < pi->ehdr->e_shnum; i++) {
> +		if (!(sechdrs[i].sh_flags & SHF_ALLOC))
> +			continue;
> +
> +		if (!(sechdrs[i].sh_flags & SHF_EXECINSTR))
> +			continue;
> +
> +		/* Make entry section relative */
> +		if (sechdrs[i].sh_addr <= pi->ehdr->e_entry &&
> +		    ((sechdrs[i].sh_addr + sechdrs[i].sh_size) >
> +		     pi->ehdr->e_entry)) {
> +			entry_sidx = i;
> +			entry -= sechdrs[i].sh_addr;
> +			break;
> +		}
> +	}
> +
> +	/* Find the RAM size requirements of relocatable object */
> +	buf_align = 1;
> +	bss_align = 1;
> +	buf_sz = 0;
> +	bss_sz = 0;
> +
> +	for (i = 0; i < pi->ehdr->e_shnum; i++) {
> +		if (!(sechdrs[i].sh_flags & SHF_ALLOC))
> +			continue;
> +
> +		align = sechdrs[i].sh_addralign;
> +		if (sechdrs[i].sh_type != SHT_NOBITS) {
> +			if (buf_align < align)
> +				buf_align = align;
> +			buf_sz = ALIGN(buf_sz, align);
> +			buf_sz += sechdrs[i].sh_size;
> +		} else {
> +			if (bss_align < align)
> +				bss_align = align;
> +			bss_sz = ALIGN(bss_sz, align);
> +			bss_sz += sechdrs[i].sh_size;
> +		}
> +	}
> +
> +	if (buf_align < bss_align)
> +		buf_align = bss_align;
> +	bss_pad = 0;
> +	if (buf_sz & (bss_align - 1))
> +		bss_pad = bss_align - (buf_sz & (bss_align - 1));
> +
> +	memsz = buf_sz + bss_pad + bss_sz;
> +
> +	/* Allocate buffer for purgatory */
> +	purgatory_buf = vzalloc(buf_sz);
> +	if (!purgatory_buf) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	/* Add buffer to segment list */
> +	ret = kexec_add_buffer(image, purgatory_buf, buf_sz, memsz,
> +				buf_align, min, max, top_down,
> +				&pi->purgatory_load_addr);
> +	if (ret)
> +		goto out;
> +
> +	/* Load SHF_ALLOC sections */

Here could start a new function.

> +	buf_addr = purgatory_buf;
> +	load_addr = pi->purgatory_load_addr;
> +	data_addr = load_addr;
> +	bss_addr = load_addr + buf_sz + bss_pad;
> +
> +	for (i = 0; i < pi->ehdr->e_shnum; i++) {
> +		if (!(sechdrs[i].sh_flags & SHF_ALLOC))
> +			continue;
> +
> +		align = sechdrs[i].sh_addralign;
> +		if (sechdrs[i].sh_type != SHT_NOBITS) {
> +			data_addr = ALIGN(data_addr, align);
> +			off = data_addr - load_addr;
> +			/* We have already modifed ->sh_offset to keep addr */
> +			src = (char *) sechdrs[i].sh_offset;
> +			memcpy(buf_addr + off, src, sechdrs[i].sh_size);
> +
> +			/* Store load address and source address of section */
> +			sechdrs[i].sh_addr = data_addr;
> +
> +			/*
> +			 * This section got copied to temporary buffer. Update
> +			 * ->sh_offset accordingly.
> +			 */
> +			sechdrs[i].sh_offset = (unsigned long)(buf_addr + off);
> +
> +			/* Advance to the next address */
> +			data_addr += sechdrs[i].sh_size;
> +		} else {
> +			bss_addr = ALIGN(bss_addr, align);
> +			sechdrs[i].sh_addr = bss_addr;
> +			bss_addr += sechdrs[i].sh_size;
> +		}
> +	}
> +
> +	/* update entry based on entry section position */
> +	if (entry_sidx >= 0)
> +		entry += sechdrs[entry_sidx].sh_addr;
> +
> +	/* Set the entry point of purgatory */
> +	image->start = entry;
> +
> +	/* Apply relocations */

>From here-on could start a new function.

> +	for (i = 0; i < pi->ehdr->e_shnum; i++) {
> +		Elf_Shdr *section, *symtab;
> +
> +		if (sechdrs[i].sh_type != SHT_RELA &&
> +		    sechdrs[i].sh_type != SHT_REL)
> +			continue;
> +
> +		if (sechdrs[i].sh_info > pi->ehdr->e_shnum ||
> +		    sechdrs[i].sh_link > pi->ehdr->e_shnum) {
> +			ret = -ENOEXEC;
> +			goto out;
> +		}
> +
> +		section = &sechdrs[sechdrs[i].sh_info];
> +		symtab = &sechdrs[sechdrs[i].sh_link];
> +
> +		if (!(section->sh_flags & SHF_ALLOC))
> +			continue;
> +
> +		if (symtab->sh_link > pi->ehdr->e_shnum)
> +			/* Invalid section number? */
> +			continue;
> +
> +		ret = -EOPNOTSUPP;
> +		if (sechdrs[i].sh_type == SHT_RELA)
> +			ret = arch_kexec_apply_relocations_add(sechdrs,
> +							pi->ehdr->e_shnum, i);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	pi->sechdrs = sechdrs;
> +	pi->purgatory_buf = purgatory_buf;
> +	return ret;
> +out:
> +	vfree(sechdrs);
> +	vfree(purgatory_buf);
> +	return ret;
> +}
> +
> +/* Load relocatable purgatory object and relocate it appropriately */
> +int kexec_load_purgatory(struct kimage *image, unsigned long min,
> +		unsigned long max, int top_down, unsigned long *load_addr)
> +{
> +	struct purgatory_info *pi = &image->purgatory_info;
> +	int ret;
> +
> +	if (kexec_purgatory_size <= 0)
> +		return -EINVAL;
> +
> +	if (kexec_purgatory_size < sizeof(Elf_Ehdr))
> +		return -ENOEXEC;
> +
> +	pi->ehdr = (Elf_Ehdr *)kexec_purgatory;
> +
> +	if (memcmp(pi->ehdr->e_ident, ELFMAG, SELFMAG) != 0
> +	    || pi->ehdr->e_type != ET_REL
> +	    || !elf_check_arch(pi->ehdr)
> +	    || pi->ehdr->e_shentsize != sizeof(Elf_Shdr))
> +		return -ENOEXEC;
> +
> +	if (pi->ehdr->e_shoff >= kexec_purgatory_size
> +	    || (pi->ehdr->e_shnum * sizeof(Elf_Shdr) >
> +	    kexec_purgatory_size - pi->ehdr->e_shoff))
> +		return -ENOEXEC;
> +
> +	ret = elf_rel_load_relocate(image, min, max, top_down);
> +	if (ret)
> +		return ret;
> +
> +	*load_addr = image->purgatory_info.purgatory_load_addr;
> +	return 0;
> +}
> +
> +static Elf_Sym *kexec_purgatory_find_symbol(struct purgatory_info *pi,
> +						const char *name)
> +{
> +	Elf_Sym *syms;
> +	Elf_Shdr *sechdrs;
> +	Elf_Ehdr *ehdr;
> +	int i, k;
> +	const char *strtab;
> +
> +	if (!pi->sechdrs || !pi->ehdr)
> +		return NULL;
> +
> +	sechdrs = pi->sechdrs;
> +	ehdr = pi->ehdr;
> +
> +	for (i = 0; i < ehdr->e_shnum; i++) {
> +		if (sechdrs[i].sh_type != SHT_SYMTAB)
> +			continue;
> +
> +		if (sechdrs[i].sh_link > ehdr->e_shnum)
> +			/* Invalid stratab section number */

"strtab"

> +			continue;
> +		strtab = (char *)sechdrs[sechdrs[i].sh_link].sh_offset;
> +		syms = (Elf_Sym *)sechdrs[i].sh_offset;
> +
> +		/* Go through symbols for a match */
> +		for (k = 0; k < sechdrs[i].sh_size/sizeof(Elf_Sym); k++) {
> +			if (ELF_ST_BIND(syms[k].st_info) != STB_GLOBAL)
> +				continue;
> +
> +			if (strcmp(strtab + syms[k].st_name, name) != 0)
> +				continue;
> +
> +			if (syms[k].st_shndx == SHN_UNDEF ||
> +			    syms[k].st_shndx > ehdr->e_shnum) {
> +				pr_debug("Symbol: %s has bad section index %d.\n",
> +						name, syms[k].st_shndx);
> +				return NULL;
> +			}
> +
> +			/* Found the symbol we are looking for */
> +			return &syms[k];
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +void *kexec_purgatory_get_symbol_addr(struct kimage *image, const char *name)
> +{
> +	struct purgatory_info *pi = &image->purgatory_info;
> +	Elf_Sym *sym;
> +	Elf_Shdr *sechdr;
> +
> +	sym = kexec_purgatory_find_symbol(pi, name);
> +	if (!sym)
> +		return ERR_PTR(-EINVAL);
> +
> +	sechdr = &pi->sechdrs[sym->st_shndx];
> +
> +	/*
> +	 * Returns the address where symbol will finally be loaded after
> +	 * kexec_load_segment()
> +	 */
> +	return (void *)(sechdr->sh_addr + sym->st_value);
> +}
> +
> +/*
> + * Get or set value of a symbol. If "get_value" is true, symbol value is
> + * returned in buf otherwise symbol value is set based on value in buf.
> + */
> +int kexec_purgatory_get_set_symbol(struct kimage *image, const char *name,
> +				void *buf, unsigned int size, bool get_value)
> +{
> +	Elf_Sym *sym;
> +	Elf_Shdr *sechdrs;
> +	struct purgatory_info *pi = &image->purgatory_info;
> +	char *sym_buf;
> +
> +	sym = kexec_purgatory_find_symbol(pi, name);
> +	if (!sym)
> +		return -EINVAL;
> +
> +	if (sym->st_size != size) {
> +		pr_debug("Symbol: %s size is not right\n", name);

Should probably be pr_err because it is an error, right? And then

	pr_err("Symbol %s size mismatch: %d vs %d\n", name, sym->st_size, size);

> +		return -EINVAL;
> +	}
> +
> +	sechdrs = pi->sechdrs;
> +
> +	if (sechdrs[sym->st_shndx].sh_type == SHT_NOBITS) {
> +		pr_debug("Symbol: %s is in a bss section. Cannot get/set\n",

			... Cannot %s\n", (get_value ? "get" : "set"), name);

> +					name);
> +		return -EINVAL;
> +	}
> +
> +	sym_buf = (unsigned char *)sechdrs[sym->st_shndx].sh_offset +
> +					sym->st_value;
> +
> +	if (get_value)
> +		memcpy((void *)buf, sym_buf, size);
> +	else
> +		memcpy((void *)sym_buf, buf, size);
> +
> +	return 0;
> +}
>  
>  /*
>   * Move into place and start executing a preloaded standalone
> -- 
> 1.9.0
> 
> 

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 10/13] kexec: Load and Relocate purgatory at kernel load time
@ 2014-06-10 16:31     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-10 16:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:06:59AM -0400, Vivek Goyal wrote:
> Load purgatory code in RAM and relocate it based on the location. Relocation
> code has been inspired by module relocation code and purgatory relocation
> code in kexec-tools.
> 
> Also compute the checksums of loaded kexec segments and store them in
> purgatory.
> 
> Arch independent code provides this functionality so that arch dependent
> bootloaders can make use of it.
> 
> Helper functions are provided to get/set symbol values in purgatory which
> are used by bootloaders later to set things like stack and entry point
> of second kernel etc.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/Kconfig                   |   2 +
>  arch/x86/kernel/machine_kexec_64.c |  82 +++++++
>  include/linux/kexec.h              |  31 +++
>  kernel/kexec.c                     | 484 +++++++++++++++++++++++++++++++++++++
>  4 files changed, 599 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 213308a..0f24b61 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1556,6 +1556,8 @@ source kernel/Kconfig.hz
>  config KEXEC
>  	bool "kexec system call"
>  	select BUILD_BIN2C
> +	select CRYPTO
> +	select CRYPTO_SHA256

Ok, but why automatically enable crypto? There's still the old kexec
method where we don't check any signatures.

Which begs the more important question - shouldn't this new in-kernel
loading method support also kexec'ing of kernels without any signature
verifications at all?

I mean, the main use case is secure boot and all but why not leave it
configurable for people to decide?

>  	---help---
>  	  kexec is a system call that implements the ability to shutdown your
>  	  current kernel, and to start another kernel.  It is like a reboot
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index d9c5cf0..711c1fb 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -337,3 +337,85 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
>  		return kexec_file_type[idx].cleanup(image);
>  	return 0;
>  }
> +
> +/* Apply purgatory relocations */
> +int arch_kexec_apply_relocations_add(Elf64_Shdr *sechdrs,

apply_..._add? "arch_kexec_apply_relocations" seems fine to me.

> +				unsigned int nr_sections, unsigned int relsec)
> +{
> +	unsigned int i;
> +	Elf64_Rela *rel = (void *)sechdrs[relsec].sh_offset;
> +	Elf64_Sym *sym;
> +	void *location;
> +	Elf64_Shdr *section, *symtab;
> +	unsigned long address, sec_base, value;
> +
> +	/* Section to which relocations apply */
> +	section = &sechdrs[sechdrs[relsec].sh_info];
> +
> +	/* Associated symbol table */
> +	symtab = &sechdrs[sechdrs[relsec].sh_link];
> +
> +	for (i = 0; i < sechdrs[relsec].sh_size / sizeof(*rel); i++) {
> +
> +		/*
> +		 * This is location (->sh_offset) to update. This is temporary
> +		 * buffer where section is currently loaded. This will finally
> +		 * be loaded to a different address later (pointed to
> +		 * by ->sh_addr. kexec takes care of moving it
> +		 * (kexec_load_segment()).
> +		 */
> +		location = (void *)(section->sh_offset + rel[i].r_offset);
> +
> +		/* Final address of the location */
> +		address = section->sh_addr + rel[i].r_offset;
> +
> +		sym = (Elf64_Sym *)symtab->sh_offset +
> +				ELF64_R_SYM(rel[i].r_info);
> +
> +		if (sym->st_shndx == SHN_UNDEF || sym->st_shndx == SHN_COMMON)
> +			return -ENOEXEC;
> +
> +		if (sym->st_shndx == SHN_ABS)
> +			sec_base = 0;
> +		else if (sym->st_shndx >= nr_sections)
> +			return -ENOEXEC;
> +		else
> +			sec_base = sechdrs[sym->st_shndx].sh_addr;
> +
> +		value = sym->st_value;
> +		value += sec_base;
> +		value += rel[i].r_addend;
> +
> +		switch (ELF64_R_TYPE(rel[i].r_info)) {
> +		case R_X86_64_NONE:
> +			break;
> +		case R_X86_64_64:
> +			*(u64 *)location = value;
> +			break;
> +		case R_X86_64_32:
> +			*(u32 *)location = value;
> +			if (value != *(u32 *)location)
> +				goto overflow;
> +			break;
> +		case R_X86_64_32S:
> +			*(s32 *)location = value;
> +			if ((s64)value != *(s32 *)location)
> +				goto overflow;
> +			break;
> +		case R_X86_64_PC32:
> +			value -= (u64)address;
> +			*(u32 *)location = value;
> +			break;
> +		default:
> +			pr_err("kexec: Unknown rela relocation: %llu\n",

Yep, the "kexec: " string should come from pr_fmt as in the other mail.

> +					ELF64_R_TYPE(rel[i].r_info));
> +			return -ENOEXEC;
> +		}
> +	}
> +	return 0;
> +
> +overflow:
> +	pr_err("kexec: overflow in relocation type %d value 0x%lx\n",

Ditto.

> +		(int)ELF64_R_TYPE(rel[i].r_info), value);
> +	return -ENOEXEC;
> +}
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 3790519..7228873 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -10,6 +10,7 @@
>  #include <linux/ioport.h>
>  #include <linux/elfcore.h>
>  #include <linux/elf.h>
> +#include <linux/module.h>
>  #include <asm/kexec.h>
>  
>  /* Verify architecture specific macros are defined */
> @@ -95,6 +96,27 @@ struct compat_kexec_segment {
>  };
>  #endif
>  
> +struct kexec_sha_region {
> +	unsigned long start;
> +	unsigned long len;
> +};
> +
> +struct purgatory_info {
> +	/* Pointer to elf header of read only purgatory */
> +	Elf_Ehdr *ehdr;
> +
> +	/* Pointer to purgatory sechdrs which are modifiable */
> +	Elf_Shdr *sechdrs;
> +	/*
> +	 * Temporary buffer location where purgatory is loaded and relocated
> +	 * This memory can be freed post image load
> +	 */
> +	void *purgatory_buf;
> +
> +	/* Address where purgatory is finally loaded and is executed from */
> +	unsigned long purgatory_load_addr;
> +};
> +
>  struct kimage {
>  	kimage_entry_t head;
>  	kimage_entry_t *entry;
> @@ -143,6 +165,9 @@ struct kimage {
>  
>  	/* Image loader handling the kernel can store a pointer here */
>  	void *image_loader_data;
> +
> +	/* Information for loading purgatory */
> +	struct purgatory_info purgatory_info;

Having the member name with the same name as the struct is kinda
confusing. Also, you could shorten it, which, in turn, would give
shorter code lines at the sites it is accessed. I.e.,

	struct purgatory_info pinfo;

should be just fine IMHO.

...

> @@ -336,6 +348,15 @@ arch_kimage_file_post_load_cleanup(struct kimage *image)
>  	return;
>  }
>  
> +/* Apply relocations for rela section */
> +int __attribute__ ((weak))
> +arch_kexec_apply_relocations_add(Elf_Shdr *sechdrs, unsigned int nr_sections,
> +					unsigned int relsec)
> +{
> +	pr_err(KERN_ERR "kexec: REL relocation unsupported\n");

pr_err *and* KERN_ERR. Double error level? :-)

> +	return -ENOEXEC;
> +}
> +
>  /*
>   * Free up tempory buffers allocated which are not needed after image has
>   * been loaded.
> @@ -355,6 +376,12 @@ static void kimage_file_post_load_cleanup(struct kimage *image)
>  	vfree(image->cmdline_buf);
>  	image->cmdline_buf = NULL;
>  
> +	vfree(image->purgatory_info.purgatory_buf);
> +	image->purgatory_info.purgatory_buf = NULL;

Here's what I mean - that's definitely too long. Maybe

	vree(image->pinfo.pbuf);
	image->pinfo.pbuf = NULL;

(yep, we shortened purgatory_buf too). Now this looks like proper kernel
code to me :-)

> +
> +	vfree(image->purgatory_info.sechdrs);
> +	image->purgatory_info.sechdrs = NULL;
> +
>  	/* See if architcture has anything to cleanup post load */
>  	arch_kimage_file_post_load_cleanup(image);
>  }
> @@ -1370,6 +1397,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
>  	if (ret)
>  		goto out;
>  
> +	ret = kexec_calculate_store_digests(image);

This function name could be shortened too: kexec_calc_digests() or so.
The actual storing could be a separate kexec_store_digests. I.e., a
function does one thing only.

> +	if (ret)
> +		goto out;
> +
>  	for (i = 0; i < image->nr_segments; i++) {
>  		struct kexec_segment *ksegment;
>  
> @@ -2131,6 +2162,459 @@ int kexec_add_buffer(struct kimage *image, char *buffer,
>  	return 0;
>  }
>  
> +/* Calculate and store the digest of segments */
> +static int kexec_calculate_store_digests(struct kimage *image)
> +{
> +	struct crypto_shash *tfm;
> +	struct shash_desc *desc;
> +	int ret = 0, i, j, zero_buf_sz = 256, sha_region_sz;

256 - a magic constant.

> +	size_t desc_size, nullsz;
> +	char *digest = NULL;
> +	void *zero_buf;
> +	struct kexec_sha_region *sha_regions;
> +
> +	tfm = crypto_alloc_shash("sha256", 0, 0);
> +	if (IS_ERR(tfm)) {
> +		ret = PTR_ERR(tfm);
> +		goto out;

The "out" label kfrees digest but we haven't allocated it yet...

> +	}
> +
> +	desc_size = crypto_shash_descsize(tfm) + sizeof(*desc);
> +	desc = kzalloc(desc_size, GFP_KERNEL);
> +	if (!desc) {
> +		ret = -ENOMEM;
> +		goto out_free_tfm;
> +	}
> +
> +	zero_buf = kzalloc(zero_buf_sz, GFP_KERNEL);
> +	if (!zero_buf) {
> +		ret = -ENOMEM;
> +		goto out_free_desc;
> +	}
> +
> +	sha_region_sz = KEXEC_SEGMENT_MAX * sizeof(struct kexec_sha_region);
> +	sha_regions = vzalloc(sha_region_sz);
> +	if (!sha_regions)
> +		goto out_free_zero_buf;
> +
> +	desc->tfm   = tfm;
> +	desc->flags = 0;
> +
> +	ret = crypto_shash_init(desc);
> +	if (ret < 0)
> +		goto out_free_sha_regions;
> +
> +	digest = kzalloc(SHA256_DIGEST_SIZE, GFP_KERNEL);
> +	if (!digest) {
> +		ret = -ENOMEM;
> +		goto out_free_sha_regions;
> +	}

... but this digest is a simple allocation. Could it be you moved it
down but forgot to readjust the labels?

> +
> +	/* Traverse through all segments */

Yep, we can see that. Why does it need an explicit comment?

> +	for (j = i = 0; i < image->nr_segments; i++) {
> +		struct kexec_segment *ksegment;
> +		ksegment = &image->segment[i];
> +
> +		/*
> +		 * Skip purgatory as it will be modified once we put digest
> +		 * info in purgatory
> +		 */

Now this comment is perfect right there! :-) It needs a full-stop though.

> +		if (ksegment->kbuf == image->purgatory_info.purgatory_buf)
> +			continue;
> +
> +		ret = crypto_shash_update(desc, ksegment->kbuf,
> +						ksegment->bufsz);

Arg alignment.

> +		if (ret)
> +			break;
> +
> +		nullsz = ksegment->memsz - ksegment->bufsz;
> +		while (nullsz) {
> +			unsigned long bytes = nullsz;
> +			if (bytes > zero_buf_sz)
> +				bytes = zero_buf_sz;
> +			ret = crypto_shash_update(desc, zero_buf, bytes);
> +			if (ret)
> +				break;
> +			nullsz -= bytes;
> +		}

Now this trailing buffer "drainage" could very well use a comment on
what's going on.

> +
> +		if (ret)
> +			break;
> +
> +		sha_regions[j].start = ksegment->mem;
> +		sha_regions[j].len = ksegment->memsz;
> +		j++;
> +	}
> +
> +	if (!ret) {
> +		ret = crypto_shash_final(desc, digest);
> +		if (ret)
> +			goto out_free_sha_regions;
> +		ret = kexec_purgatory_get_set_symbol(image, "sha_regions",
> +				sha_regions, sha_region_sz, 0);
> +		if (ret)
> +			goto out_free_sha_regions;
> +
> +		ret = kexec_purgatory_get_set_symbol(image, "sha256_digest",
> +				digest, SHA256_DIGEST_SIZE, 0);
> +		if (ret)
> +			goto out_free_sha_regions;

Yeah, this block could be a separate kexec_store_digests() function.

> +	}
> +
> +out_free_sha_regions:
> +	vfree(sha_regions);
> +out_free_zero_buf:
> +	kfree(zero_buf);
> +out_free_desc:
> +	kfree(desc);
> +out_free_tfm:
> +	kfree(tfm);
> +out:
> +	kfree(digest);
> +	return ret;
> +}
> +
> +/* Actually load and relcoate purgatory. Lot of code taken from kexec-tools */

s/relcoate/relocate/

> +static int elf_rel_load_relocate(struct kimage *image, unsigned long min,
> +				unsigned long max, int top_down)

Another function which is too big and does at least two things and could
probably be nicely split into two.

> +{
> +	struct purgatory_info *pi = &image->purgatory_info;
> +	unsigned long align, buf_align, bss_align, buf_sz, bss_sz, bss_pad;
> +	unsigned long memsz, entry, load_addr, data_addr, bss_addr, off;
> +	unsigned char *buf_addr, *src;
> +	int i, ret = 0, entry_sidx = -1;
> +	Elf_Shdr *sechdrs = NULL, *sechdrs_c;
> +	void *purgatory_buf = NULL;
> +
> +	/*
> +	 * sechdrs_c points to section headers in purgatory and are read
> +	 * only. No modifications allowed.
> +	 */

Then do

	const Elf_Shdr * const sechdrs_c = (void *)pi->ehdr + pi->ehdr->e_shoff;

to enforce it?

> +	sechdrs_c = (void *)pi->ehdr + pi->ehdr->e_shoff;
> +
> +	/*
> +	 * We can not modify sechdrs_c[] and its fields. It is read only.
> +	 * Copy it over to a local copy where one can store some temporary
> +	 * data and free it at the end. We need to modify ->sh_addr and

What is freeing it when we store it into pi->sechdrs and return? Or
doesn't it need to be freed?

> +	 * ->sh_offset fields to keep track permanent and temporary locations
> +	 * of sections.
> +	 */
> +	sechdrs = vzalloc(pi->ehdr->e_shnum * sizeof(Elf_Shdr));
> +	if (!sechdrs)
> +		return -ENOMEM;
> +
> +	memcpy(sechdrs, sechdrs_c, pi->ehdr->e_shnum * sizeof(Elf_Shdr));
> +
> +	/*
> +	 * We seem to have multiple copies of sections. First copy is which
> +	 * is embedded in kernel in read only section. Some of these sections
> +	 * will be copied to a temporary buffer and relocated. And these
> +	 * sections will finally be copied to their final detination at

"destination"

> +	 * segment load time.
> +	 *
> +	 * Use ->sh_offset to reflect section address in memory. It will
> +	 * point to original read only copy if section is not allocatable.
> +	 * Otherwise it will point to temporary copy which will be relocated.
> +	 *
> +	 * Use ->sh_addr to contain final address of the section where it
> +	 * will go during execution time.
> +	 */
> +	for (i = 0; i < pi->ehdr->e_shnum; i++) {
> +		if (sechdrs[i].sh_type == SHT_NOBITS)
> +			continue;
> +
> +		sechdrs[i].sh_offset = (unsigned long)pi->ehdr +
> +						sechdrs[i].sh_offset;
> +	}
> +
> +	entry = pi->ehdr->e_entry;
> +	for (i = 0; i < pi->ehdr->e_shnum; i++) {
> +		if (!(sechdrs[i].sh_flags & SHF_ALLOC))
> +			continue;
> +
> +		if (!(sechdrs[i].sh_flags & SHF_EXECINSTR))
> +			continue;
> +
> +		/* Make entry section relative */
> +		if (sechdrs[i].sh_addr <= pi->ehdr->e_entry &&
> +		    ((sechdrs[i].sh_addr + sechdrs[i].sh_size) >
> +		     pi->ehdr->e_entry)) {
> +			entry_sidx = i;
> +			entry -= sechdrs[i].sh_addr;
> +			break;
> +		}
> +	}
> +
> +	/* Find the RAM size requirements of relocatable object */
> +	buf_align = 1;
> +	bss_align = 1;
> +	buf_sz = 0;
> +	bss_sz = 0;
> +
> +	for (i = 0; i < pi->ehdr->e_shnum; i++) {
> +		if (!(sechdrs[i].sh_flags & SHF_ALLOC))
> +			continue;
> +
> +		align = sechdrs[i].sh_addralign;
> +		if (sechdrs[i].sh_type != SHT_NOBITS) {
> +			if (buf_align < align)
> +				buf_align = align;
> +			buf_sz = ALIGN(buf_sz, align);
> +			buf_sz += sechdrs[i].sh_size;
> +		} else {
> +			if (bss_align < align)
> +				bss_align = align;
> +			bss_sz = ALIGN(bss_sz, align);
> +			bss_sz += sechdrs[i].sh_size;
> +		}
> +	}
> +
> +	if (buf_align < bss_align)
> +		buf_align = bss_align;
> +	bss_pad = 0;
> +	if (buf_sz & (bss_align - 1))
> +		bss_pad = bss_align - (buf_sz & (bss_align - 1));
> +
> +	memsz = buf_sz + bss_pad + bss_sz;
> +
> +	/* Allocate buffer for purgatory */
> +	purgatory_buf = vzalloc(buf_sz);
> +	if (!purgatory_buf) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	/* Add buffer to segment list */
> +	ret = kexec_add_buffer(image, purgatory_buf, buf_sz, memsz,
> +				buf_align, min, max, top_down,
> +				&pi->purgatory_load_addr);
> +	if (ret)
> +		goto out;
> +
> +	/* Load SHF_ALLOC sections */

Here could start a new function.

> +	buf_addr = purgatory_buf;
> +	load_addr = pi->purgatory_load_addr;
> +	data_addr = load_addr;
> +	bss_addr = load_addr + buf_sz + bss_pad;
> +
> +	for (i = 0; i < pi->ehdr->e_shnum; i++) {
> +		if (!(sechdrs[i].sh_flags & SHF_ALLOC))
> +			continue;
> +
> +		align = sechdrs[i].sh_addralign;
> +		if (sechdrs[i].sh_type != SHT_NOBITS) {
> +			data_addr = ALIGN(data_addr, align);
> +			off = data_addr - load_addr;
> +			/* We have already modifed ->sh_offset to keep addr */
> +			src = (char *) sechdrs[i].sh_offset;
> +			memcpy(buf_addr + off, src, sechdrs[i].sh_size);
> +
> +			/* Store load address and source address of section */
> +			sechdrs[i].sh_addr = data_addr;
> +
> +			/*
> +			 * This section got copied to temporary buffer. Update
> +			 * ->sh_offset accordingly.
> +			 */
> +			sechdrs[i].sh_offset = (unsigned long)(buf_addr + off);
> +
> +			/* Advance to the next address */
> +			data_addr += sechdrs[i].sh_size;
> +		} else {
> +			bss_addr = ALIGN(bss_addr, align);
> +			sechdrs[i].sh_addr = bss_addr;
> +			bss_addr += sechdrs[i].sh_size;
> +		}
> +	}
> +
> +	/* update entry based on entry section position */
> +	if (entry_sidx >= 0)
> +		entry += sechdrs[entry_sidx].sh_addr;
> +
> +	/* Set the entry point of purgatory */
> +	image->start = entry;
> +
> +	/* Apply relocations */

From here-on could start a new function.

> +	for (i = 0; i < pi->ehdr->e_shnum; i++) {
> +		Elf_Shdr *section, *symtab;
> +
> +		if (sechdrs[i].sh_type != SHT_RELA &&
> +		    sechdrs[i].sh_type != SHT_REL)
> +			continue;
> +
> +		if (sechdrs[i].sh_info > pi->ehdr->e_shnum ||
> +		    sechdrs[i].sh_link > pi->ehdr->e_shnum) {
> +			ret = -ENOEXEC;
> +			goto out;
> +		}
> +
> +		section = &sechdrs[sechdrs[i].sh_info];
> +		symtab = &sechdrs[sechdrs[i].sh_link];
> +
> +		if (!(section->sh_flags & SHF_ALLOC))
> +			continue;
> +
> +		if (symtab->sh_link > pi->ehdr->e_shnum)
> +			/* Invalid section number? */
> +			continue;
> +
> +		ret = -EOPNOTSUPP;
> +		if (sechdrs[i].sh_type == SHT_RELA)
> +			ret = arch_kexec_apply_relocations_add(sechdrs,
> +							pi->ehdr->e_shnum, i);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	pi->sechdrs = sechdrs;
> +	pi->purgatory_buf = purgatory_buf;
> +	return ret;
> +out:
> +	vfree(sechdrs);
> +	vfree(purgatory_buf);
> +	return ret;
> +}
> +
> +/* Load relocatable purgatory object and relocate it appropriately */
> +int kexec_load_purgatory(struct kimage *image, unsigned long min,
> +		unsigned long max, int top_down, unsigned long *load_addr)
> +{
> +	struct purgatory_info *pi = &image->purgatory_info;
> +	int ret;
> +
> +	if (kexec_purgatory_size <= 0)
> +		return -EINVAL;
> +
> +	if (kexec_purgatory_size < sizeof(Elf_Ehdr))
> +		return -ENOEXEC;
> +
> +	pi->ehdr = (Elf_Ehdr *)kexec_purgatory;
> +
> +	if (memcmp(pi->ehdr->e_ident, ELFMAG, SELFMAG) != 0
> +	    || pi->ehdr->e_type != ET_REL
> +	    || !elf_check_arch(pi->ehdr)
> +	    || pi->ehdr->e_shentsize != sizeof(Elf_Shdr))
> +		return -ENOEXEC;
> +
> +	if (pi->ehdr->e_shoff >= kexec_purgatory_size
> +	    || (pi->ehdr->e_shnum * sizeof(Elf_Shdr) >
> +	    kexec_purgatory_size - pi->ehdr->e_shoff))
> +		return -ENOEXEC;
> +
> +	ret = elf_rel_load_relocate(image, min, max, top_down);
> +	if (ret)
> +		return ret;
> +
> +	*load_addr = image->purgatory_info.purgatory_load_addr;
> +	return 0;
> +}
> +
> +static Elf_Sym *kexec_purgatory_find_symbol(struct purgatory_info *pi,
> +						const char *name)
> +{
> +	Elf_Sym *syms;
> +	Elf_Shdr *sechdrs;
> +	Elf_Ehdr *ehdr;
> +	int i, k;
> +	const char *strtab;
> +
> +	if (!pi->sechdrs || !pi->ehdr)
> +		return NULL;
> +
> +	sechdrs = pi->sechdrs;
> +	ehdr = pi->ehdr;
> +
> +	for (i = 0; i < ehdr->e_shnum; i++) {
> +		if (sechdrs[i].sh_type != SHT_SYMTAB)
> +			continue;
> +
> +		if (sechdrs[i].sh_link > ehdr->e_shnum)
> +			/* Invalid stratab section number */

"strtab"

> +			continue;
> +		strtab = (char *)sechdrs[sechdrs[i].sh_link].sh_offset;
> +		syms = (Elf_Sym *)sechdrs[i].sh_offset;
> +
> +		/* Go through symbols for a match */
> +		for (k = 0; k < sechdrs[i].sh_size/sizeof(Elf_Sym); k++) {
> +			if (ELF_ST_BIND(syms[k].st_info) != STB_GLOBAL)
> +				continue;
> +
> +			if (strcmp(strtab + syms[k].st_name, name) != 0)
> +				continue;
> +
> +			if (syms[k].st_shndx == SHN_UNDEF ||
> +			    syms[k].st_shndx > ehdr->e_shnum) {
> +				pr_debug("Symbol: %s has bad section index %d.\n",
> +						name, syms[k].st_shndx);
> +				return NULL;
> +			}
> +
> +			/* Found the symbol we are looking for */
> +			return &syms[k];
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +void *kexec_purgatory_get_symbol_addr(struct kimage *image, const char *name)
> +{
> +	struct purgatory_info *pi = &image->purgatory_info;
> +	Elf_Sym *sym;
> +	Elf_Shdr *sechdr;
> +
> +	sym = kexec_purgatory_find_symbol(pi, name);
> +	if (!sym)
> +		return ERR_PTR(-EINVAL);
> +
> +	sechdr = &pi->sechdrs[sym->st_shndx];
> +
> +	/*
> +	 * Returns the address where symbol will finally be loaded after
> +	 * kexec_load_segment()
> +	 */
> +	return (void *)(sechdr->sh_addr + sym->st_value);
> +}
> +
> +/*
> + * Get or set value of a symbol. If "get_value" is true, symbol value is
> + * returned in buf otherwise symbol value is set based on value in buf.
> + */
> +int kexec_purgatory_get_set_symbol(struct kimage *image, const char *name,
> +				void *buf, unsigned int size, bool get_value)
> +{
> +	Elf_Sym *sym;
> +	Elf_Shdr *sechdrs;
> +	struct purgatory_info *pi = &image->purgatory_info;
> +	char *sym_buf;
> +
> +	sym = kexec_purgatory_find_symbol(pi, name);
> +	if (!sym)
> +		return -EINVAL;
> +
> +	if (sym->st_size != size) {
> +		pr_debug("Symbol: %s size is not right\n", name);

Should probably be pr_err because it is an error, right? And then

	pr_err("Symbol %s size mismatch: %d vs %d\n", name, sym->st_size, size);

> +		return -EINVAL;
> +	}
> +
> +	sechdrs = pi->sechdrs;
> +
> +	if (sechdrs[sym->st_shndx].sh_type == SHT_NOBITS) {
> +		pr_debug("Symbol: %s is in a bss section. Cannot get/set\n",

			... Cannot %s\n", (get_value ? "get" : "set"), name);

> +					name);
> +		return -EINVAL;
> +	}
> +
> +	sym_buf = (unsigned char *)sechdrs[sym->st_shndx].sh_offset +
> +					sym->st_value;
> +
> +	if (get_value)
> +		memcpy((void *)buf, sym_buf, size);
> +	else
> +		memcpy((void *)sym_buf, buf, size);
> +
> +	return 0;
> +}
>  
>  /*
>   * Move into place and start executing a preloaded standalone
> -- 
> 1.9.0
> 
> 

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-06 18:02           ` Vivek Goyal
@ 2014-06-11 14:13             ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-11 14:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Fri, Jun 06, 2014 at 02:02:14PM -0400, Vivek Goyal wrote:
> > If you want to make it more explicit, you could do
> > 
> > #define RES_OK		0
> > #define RES_ERR		1
> > #define RES_STOP	2
> 
> You are saying that called back function should return this to walk_*
> functions? But then we lose the actual error code which should be
> passed to parent function which actually called walk_* function.

Well, RES_STOP could implicitly mean stop and no error. Also, if
you really want to return back the retval, you could slice it into
bitfields:

retval = [ ... 8 | 7 ... 0]

where [7:0] is the return value and bits from 8 onwards contain
different flags like RES_STOP. I did it just for the fun of it and it
looks like below. I honestly can't say that I'm crazy about it though.

--
Index: b/kernel/resource.c
===================================================================
--- a/kernel/resource.c	2014-06-11 14:49:35.865426300 +0200
+++ b/kernel/resource.c	2014-06-11 15:37:50.050299684 +0200
@@ -371,7 +371,7 @@ static int find_next_iomem_res(struct re
 }
 
 int walk_ram_res(char *name, unsigned long flags, u64 start, u64 end,
-		void *arg, int (*func)(u64, u64, void *))
+		 void *arg, int (*func)(u64, u64, void *))
 {
 	struct resource res;
 	u64 orig_end;
@@ -384,12 +384,12 @@ int walk_ram_res(char *name, unsigned lo
 	while ((res.start < res.end) &&
 		(find_next_iomem_res(&res, name) >= 0)) {
 		ret = (*func)(res.start, res.end, arg);
-		if (ret)
+		if (ret & RES_STOP)
 			break;
 		res.start = res.end + 1;
 		res.end = orig_end;
 	}
-	return ret;
+	return RETVAL(ret);
 }
 
 /*
@@ -441,7 +441,7 @@ static int find_next_system_ram(struct r
  * with pfn can truncate ranges.
  */
 int walk_system_ram_res(u64 start, u64 end, void *arg,
-				int (*func)(u64, u64, void *))
+			int (*func)(u64, u64, void *))
 {
 	struct resource res;
 	u64 orig_end;
@@ -454,12 +454,13 @@ int walk_system_ram_res(u64 start, u64 e
 	while ((res.start < res.end) &&
 		(find_next_system_ram(&res, "System RAM") >= 0)) {
 		ret = (*func)(res.start, res.end, arg);
-		if (ret)
+		if (ret & RES_STOP)
 			break;
 		res.start = res.end + 1;
 		res.end = orig_end;
 	}
-	return ret;
+
+	return RETVAL(ret);
 }
 
 #if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
Index: b/kernel/kexec.c
===================================================================
--- a/kernel/kexec.c	2014-06-11 14:49:35.865426300 +0200
+++ b/kernel/kexec.c	2014-06-11 16:03:26.264232477 +0200
@@ -2063,8 +2063,9 @@ static int __kexec_add_segment(struct ki
 }
 
 static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
-					struct kexec_buf *kbuf)
+				    struct kexec_buf *kbuf)
 {
+	int ret = 0;
 	struct kimage *image = kbuf->image;
 	unsigned long temp_start, temp_end;
 
@@ -2076,7 +2077,7 @@ static int locate_mem_hole_top_down(unsi
 		temp_start = temp_start & (~(kbuf->buf_align - 1));
 
 		if (temp_start < start || temp_start < kbuf->buf_min)
-			return 0;
+			return EADDRNOTAVAIL;
 
 		temp_end = temp_start + kbuf->memsz - 1;
 
@@ -2098,12 +2099,15 @@ static int locate_mem_hole_top_down(unsi
 				kbuf->memsz);
 
 	/* Stop navigating through remaining System RAM ranges */
-	return 1;
+	ret |= RES_STOP;
+
+	return ret;
 }
 
 static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
-					struct kexec_buf *kbuf)
+				     struct kexec_buf *kbuf)
 {
+	int ret = 0;
 	struct kimage *image = kbuf->image;
 	unsigned long temp_start, temp_end;
 
@@ -2114,7 +2118,7 @@ static int locate_mem_hole_bottom_up(uns
 		temp_end = temp_start + kbuf->memsz - 1;
 
 		if (temp_end > end || temp_end > kbuf->buf_max)
-			return 0;
+			return EADDRNOTAVAIL;
 		/*
 		 * Make sure this does not conflict with any of existing
 		 * segments
@@ -2133,7 +2137,9 @@ static int locate_mem_hole_bottom_up(uns
 				kbuf->memsz);
 
 	/* Stop navigating through remaining System RAM ranges */
-	return 1;
+	ret |= RES_STOP;
+
+	return ret;
 }
 
 static int walk_ram_range_callback(u64 start, u64 end, void *arg)
@@ -2141,12 +2147,11 @@ static int walk_ram_range_callback(u64 s
 	struct kexec_buf *kbuf = (struct kexec_buf *)arg;
 	unsigned long sz = end - start + 1;
 
-	/* Returning 0 will take to next memory range */
 	if (sz < kbuf->memsz)
-		return 0;
+		return EADDRNOTAVAIL;
 
 	if (end < kbuf->buf_min || start > kbuf->buf_max)
-		return 0;
+		return EADDRNOTAVAIL;
 
 	/*
 	 * Allocate memory top down with-in ram range. Otherwise bottom up
@@ -2168,15 +2173,15 @@ int kexec_add_buffer(struct kimage *imag
 		unsigned long buf_max, bool top_down, unsigned long *load_addr)
 {
 
-	unsigned long nr_segments = image->nr_segments, new_nr_segments;
 	struct kexec_segment *ksegment;
 	struct kexec_buf buf, *kbuf;
+	int ret;
 
 	/* Currently adding segment this way is allowed only in file mode */
 	if (!image->file_mode)
 		return -EINVAL;
 
-	if (nr_segments >= KEXEC_SEGMENT_MAX)
+	if (image->nr_segments >= KEXEC_SEGMENT_MAX)
 		return -EINVAL;
 
 	/*
@@ -2208,25 +2213,18 @@ int kexec_add_buffer(struct kimage *imag
 
 	/* Walk the RAM ranges and allocate a suitable range for the buffer */
 	if (image->type == KEXEC_TYPE_CRASH)
-		walk_ram_res("Crash kernel", IORESOURCE_MEM | IORESOURCE_BUSY,
-				crashk_res.start, crashk_res.end, kbuf,
-				walk_ram_range_callback);
+		ret = walk_ram_res("Crash kernel",
+				   IORESOURCE_MEM | IORESOURCE_BUSY,
+				   crashk_res.start, crashk_res.end, kbuf,
+				   walk_ram_range_callback);
 	else
-		walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
-
-	/*
-	 * If range could be found successfully, it would have incremented
-	 * the nr_segments value.
-	 */
-	new_nr_segments = image->nr_segments;
+		ret = walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
 
-	/* A suitable memory range could not be found for buffer */
-	if (new_nr_segments == nr_segments)
+	if (ret)
 		return -EADDRNOTAVAIL;
 
 	/* Found a suitable memory range */
-
-	ksegment = &image->segment[new_nr_segments - 1];
+	ksegment = &image->segment[image->nr_segments - 1];
 	*load_addr = ksegment->mem;
 	return 0;
 }
Index: b/include/linux/ioport.h
===================================================================
--- a/include/linux/ioport.h	2014-06-11 14:49:35.865426300 +0200
+++ b/include/linux/ioport.h	2014-06-11 16:02:12.775235692 +0200
@@ -237,6 +237,16 @@ extern int iomem_is_exclusive(u64 addr);
 extern int
 walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
 		void *arg, int (*func)(unsigned long, unsigned long, void *));
+
+#define RET_BITS	8
+#define RET_MASK	((1U << RET_BITS) - 1)
+#define RETVAL(r)	(-((r) & RET_MASK))
+
+#define RET_OK		0
+#define RET_ERR		1
+
+#define RES_STOP	BIT(0 + RET_BITS)
+
 extern int
 walk_system_ram_res(u64 start, u64 end, void *arg,
 				int (*func)(u64, u64, void *));


^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-11 14:13             ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-11 14:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Fri, Jun 06, 2014 at 02:02:14PM -0400, Vivek Goyal wrote:
> > If you want to make it more explicit, you could do
> > 
> > #define RES_OK		0
> > #define RES_ERR		1
> > #define RES_STOP	2
> 
> You are saying that called back function should return this to walk_*
> functions? But then we lose the actual error code which should be
> passed to parent function which actually called walk_* function.

Well, RES_STOP could implicitly mean stop and no error. Also, if
you really want to return back the retval, you could slice it into
bitfields:

retval = [ ... 8 | 7 ... 0]

where [7:0] is the return value and bits from 8 onwards contain
different flags like RES_STOP. I did it just for the fun of it and it
looks like below. I honestly can't say that I'm crazy about it though.

--
Index: b/kernel/resource.c
===================================================================
--- a/kernel/resource.c	2014-06-11 14:49:35.865426300 +0200
+++ b/kernel/resource.c	2014-06-11 15:37:50.050299684 +0200
@@ -371,7 +371,7 @@ static int find_next_iomem_res(struct re
 }
 
 int walk_ram_res(char *name, unsigned long flags, u64 start, u64 end,
-		void *arg, int (*func)(u64, u64, void *))
+		 void *arg, int (*func)(u64, u64, void *))
 {
 	struct resource res;
 	u64 orig_end;
@@ -384,12 +384,12 @@ int walk_ram_res(char *name, unsigned lo
 	while ((res.start < res.end) &&
 		(find_next_iomem_res(&res, name) >= 0)) {
 		ret = (*func)(res.start, res.end, arg);
-		if (ret)
+		if (ret & RES_STOP)
 			break;
 		res.start = res.end + 1;
 		res.end = orig_end;
 	}
-	return ret;
+	return RETVAL(ret);
 }
 
 /*
@@ -441,7 +441,7 @@ static int find_next_system_ram(struct r
  * with pfn can truncate ranges.
  */
 int walk_system_ram_res(u64 start, u64 end, void *arg,
-				int (*func)(u64, u64, void *))
+			int (*func)(u64, u64, void *))
 {
 	struct resource res;
 	u64 orig_end;
@@ -454,12 +454,13 @@ int walk_system_ram_res(u64 start, u64 e
 	while ((res.start < res.end) &&
 		(find_next_system_ram(&res, "System RAM") >= 0)) {
 		ret = (*func)(res.start, res.end, arg);
-		if (ret)
+		if (ret & RES_STOP)
 			break;
 		res.start = res.end + 1;
 		res.end = orig_end;
 	}
-	return ret;
+
+	return RETVAL(ret);
 }
 
 #if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
Index: b/kernel/kexec.c
===================================================================
--- a/kernel/kexec.c	2014-06-11 14:49:35.865426300 +0200
+++ b/kernel/kexec.c	2014-06-11 16:03:26.264232477 +0200
@@ -2063,8 +2063,9 @@ static int __kexec_add_segment(struct ki
 }
 
 static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
-					struct kexec_buf *kbuf)
+				    struct kexec_buf *kbuf)
 {
+	int ret = 0;
 	struct kimage *image = kbuf->image;
 	unsigned long temp_start, temp_end;
 
@@ -2076,7 +2077,7 @@ static int locate_mem_hole_top_down(unsi
 		temp_start = temp_start & (~(kbuf->buf_align - 1));
 
 		if (temp_start < start || temp_start < kbuf->buf_min)
-			return 0;
+			return EADDRNOTAVAIL;
 
 		temp_end = temp_start + kbuf->memsz - 1;
 
@@ -2098,12 +2099,15 @@ static int locate_mem_hole_top_down(unsi
 				kbuf->memsz);
 
 	/* Stop navigating through remaining System RAM ranges */
-	return 1;
+	ret |= RES_STOP;
+
+	return ret;
 }
 
 static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
-					struct kexec_buf *kbuf)
+				     struct kexec_buf *kbuf)
 {
+	int ret = 0;
 	struct kimage *image = kbuf->image;
 	unsigned long temp_start, temp_end;
 
@@ -2114,7 +2118,7 @@ static int locate_mem_hole_bottom_up(uns
 		temp_end = temp_start + kbuf->memsz - 1;
 
 		if (temp_end > end || temp_end > kbuf->buf_max)
-			return 0;
+			return EADDRNOTAVAIL;
 		/*
 		 * Make sure this does not conflict with any of existing
 		 * segments
@@ -2133,7 +2137,9 @@ static int locate_mem_hole_bottom_up(uns
 				kbuf->memsz);
 
 	/* Stop navigating through remaining System RAM ranges */
-	return 1;
+	ret |= RES_STOP;
+
+	return ret;
 }
 
 static int walk_ram_range_callback(u64 start, u64 end, void *arg)
@@ -2141,12 +2147,11 @@ static int walk_ram_range_callback(u64 s
 	struct kexec_buf *kbuf = (struct kexec_buf *)arg;
 	unsigned long sz = end - start + 1;
 
-	/* Returning 0 will take to next memory range */
 	if (sz < kbuf->memsz)
-		return 0;
+		return EADDRNOTAVAIL;
 
 	if (end < kbuf->buf_min || start > kbuf->buf_max)
-		return 0;
+		return EADDRNOTAVAIL;
 
 	/*
 	 * Allocate memory top down with-in ram range. Otherwise bottom up
@@ -2168,15 +2173,15 @@ int kexec_add_buffer(struct kimage *imag
 		unsigned long buf_max, bool top_down, unsigned long *load_addr)
 {
 
-	unsigned long nr_segments = image->nr_segments, new_nr_segments;
 	struct kexec_segment *ksegment;
 	struct kexec_buf buf, *kbuf;
+	int ret;
 
 	/* Currently adding segment this way is allowed only in file mode */
 	if (!image->file_mode)
 		return -EINVAL;
 
-	if (nr_segments >= KEXEC_SEGMENT_MAX)
+	if (image->nr_segments >= KEXEC_SEGMENT_MAX)
 		return -EINVAL;
 
 	/*
@@ -2208,25 +2213,18 @@ int kexec_add_buffer(struct kimage *imag
 
 	/* Walk the RAM ranges and allocate a suitable range for the buffer */
 	if (image->type == KEXEC_TYPE_CRASH)
-		walk_ram_res("Crash kernel", IORESOURCE_MEM | IORESOURCE_BUSY,
-				crashk_res.start, crashk_res.end, kbuf,
-				walk_ram_range_callback);
+		ret = walk_ram_res("Crash kernel",
+				   IORESOURCE_MEM | IORESOURCE_BUSY,
+				   crashk_res.start, crashk_res.end, kbuf,
+				   walk_ram_range_callback);
 	else
-		walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
-
-	/*
-	 * If range could be found successfully, it would have incremented
-	 * the nr_segments value.
-	 */
-	new_nr_segments = image->nr_segments;
+		ret = walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
 
-	/* A suitable memory range could not be found for buffer */
-	if (new_nr_segments == nr_segments)
+	if (ret)
 		return -EADDRNOTAVAIL;
 
 	/* Found a suitable memory range */
-
-	ksegment = &image->segment[new_nr_segments - 1];
+	ksegment = &image->segment[image->nr_segments - 1];
 	*load_addr = ksegment->mem;
 	return 0;
 }
Index: b/include/linux/ioport.h
===================================================================
--- a/include/linux/ioport.h	2014-06-11 14:49:35.865426300 +0200
+++ b/include/linux/ioport.h	2014-06-11 16:02:12.775235692 +0200
@@ -237,6 +237,16 @@ extern int iomem_is_exclusive(u64 addr);
 extern int
 walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
 		void *arg, int (*func)(unsigned long, unsigned long, void *));
+
+#define RET_BITS	8
+#define RET_MASK	((1U << RET_BITS) - 1)
+#define RETVAL(r)	(-((r) & RET_MASK))
+
+#define RET_OK		0
+#define RET_ERR		1
+
+#define RES_STOP	BIT(0 + RET_BITS)
+
 extern int
 walk_system_ram_res(u64 start, u64 end, void *arg,
 				int (*func)(u64, u64, void *));


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-11 14:13             ` Borislav Petkov
@ 2014-06-11 17:04               ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-11 17:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Wed, Jun 11, 2014 at 04:13:20PM +0200, Borislav Petkov wrote:
> On Fri, Jun 06, 2014 at 02:02:14PM -0400, Vivek Goyal wrote:
> > > If you want to make it more explicit, you could do
> > > 
> > > #define RES_OK		0
> > > #define RES_ERR		1
> > > #define RES_STOP	2
> > 
> > You are saying that called back function should return this to walk_*
> > functions? But then we lose the actual error code which should be
> > passed to parent function which actually called walk_* function.
> 
> Well, RES_STOP could implicitly mean stop and no error. Also, if
> you really want to return back the retval, you could slice it into
> bitfields:
> 
> retval = [ ... 8 | 7 ... 0]
> 
> where [7:0] is the return value and bits from 8 onwards contain
> different flags like RES_STOP. I did it just for the fun of it and it
> looks like below. I honestly can't say that I'm crazy about it though.

You are doing the same thing as I am doing. The only difference is that
I am using separate bool variable and you are trying to use upper bits
of return code to carry that extra information.

I personally think that using separate bool variable is simpler as
compared to using upper bits in return code.

Thanks
Vivek

> 
> --
> Index: b/kernel/resource.c
> ===================================================================
> --- a/kernel/resource.c	2014-06-11 14:49:35.865426300 +0200
> +++ b/kernel/resource.c	2014-06-11 15:37:50.050299684 +0200
> @@ -371,7 +371,7 @@ static int find_next_iomem_res(struct re
>  }
>  
>  int walk_ram_res(char *name, unsigned long flags, u64 start, u64 end,
> -		void *arg, int (*func)(u64, u64, void *))
> +		 void *arg, int (*func)(u64, u64, void *))
>  {
>  	struct resource res;
>  	u64 orig_end;
> @@ -384,12 +384,12 @@ int walk_ram_res(char *name, unsigned lo
>  	while ((res.start < res.end) &&
>  		(find_next_iomem_res(&res, name) >= 0)) {
>  		ret = (*func)(res.start, res.end, arg);
> -		if (ret)
> +		if (ret & RES_STOP)
>  			break;
>  		res.start = res.end + 1;
>  		res.end = orig_end;
>  	}
> -	return ret;
> +	return RETVAL(ret);
>  }
>  
>  /*
> @@ -441,7 +441,7 @@ static int find_next_system_ram(struct r
>   * with pfn can truncate ranges.
>   */
>  int walk_system_ram_res(u64 start, u64 end, void *arg,
> -				int (*func)(u64, u64, void *))
> +			int (*func)(u64, u64, void *))
>  {
>  	struct resource res;
>  	u64 orig_end;
> @@ -454,12 +454,13 @@ int walk_system_ram_res(u64 start, u64 e
>  	while ((res.start < res.end) &&
>  		(find_next_system_ram(&res, "System RAM") >= 0)) {
>  		ret = (*func)(res.start, res.end, arg);
> -		if (ret)
> +		if (ret & RES_STOP)
>  			break;
>  		res.start = res.end + 1;
>  		res.end = orig_end;
>  	}
> -	return ret;
> +
> +	return RETVAL(ret);
>  }
>  
>  #if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
> Index: b/kernel/kexec.c
> ===================================================================
> --- a/kernel/kexec.c	2014-06-11 14:49:35.865426300 +0200
> +++ b/kernel/kexec.c	2014-06-11 16:03:26.264232477 +0200
> @@ -2063,8 +2063,9 @@ static int __kexec_add_segment(struct ki
>  }
>  
>  static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
> -					struct kexec_buf *kbuf)
> +				    struct kexec_buf *kbuf)
>  {
> +	int ret = 0;
>  	struct kimage *image = kbuf->image;
>  	unsigned long temp_start, temp_end;
>  
> @@ -2076,7 +2077,7 @@ static int locate_mem_hole_top_down(unsi
>  		temp_start = temp_start & (~(kbuf->buf_align - 1));
>  
>  		if (temp_start < start || temp_start < kbuf->buf_min)
> -			return 0;
> +			return EADDRNOTAVAIL;
>  
>  		temp_end = temp_start + kbuf->memsz - 1;
>  
> @@ -2098,12 +2099,15 @@ static int locate_mem_hole_top_down(unsi
>  				kbuf->memsz);
>  
>  	/* Stop navigating through remaining System RAM ranges */
> -	return 1;
> +	ret |= RES_STOP;
> +
> +	return ret;
>  }
>  
>  static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
> -					struct kexec_buf *kbuf)
> +				     struct kexec_buf *kbuf)
>  {
> +	int ret = 0;
>  	struct kimage *image = kbuf->image;
>  	unsigned long temp_start, temp_end;
>  
> @@ -2114,7 +2118,7 @@ static int locate_mem_hole_bottom_up(uns
>  		temp_end = temp_start + kbuf->memsz - 1;
>  
>  		if (temp_end > end || temp_end > kbuf->buf_max)
> -			return 0;
> +			return EADDRNOTAVAIL;
>  		/*
>  		 * Make sure this does not conflict with any of existing
>  		 * segments
> @@ -2133,7 +2137,9 @@ static int locate_mem_hole_bottom_up(uns
>  				kbuf->memsz);
>  
>  	/* Stop navigating through remaining System RAM ranges */
> -	return 1;
> +	ret |= RES_STOP;
> +
> +	return ret;
>  }
>  
>  static int walk_ram_range_callback(u64 start, u64 end, void *arg)
> @@ -2141,12 +2147,11 @@ static int walk_ram_range_callback(u64 s
>  	struct kexec_buf *kbuf = (struct kexec_buf *)arg;
>  	unsigned long sz = end - start + 1;
>  
> -	/* Returning 0 will take to next memory range */
>  	if (sz < kbuf->memsz)
> -		return 0;
> +		return EADDRNOTAVAIL;
>  
>  	if (end < kbuf->buf_min || start > kbuf->buf_max)
> -		return 0;
> +		return EADDRNOTAVAIL;
>  
>  	/*
>  	 * Allocate memory top down with-in ram range. Otherwise bottom up
> @@ -2168,15 +2173,15 @@ int kexec_add_buffer(struct kimage *imag
>  		unsigned long buf_max, bool top_down, unsigned long *load_addr)
>  {
>  
> -	unsigned long nr_segments = image->nr_segments, new_nr_segments;
>  	struct kexec_segment *ksegment;
>  	struct kexec_buf buf, *kbuf;
> +	int ret;
>  
>  	/* Currently adding segment this way is allowed only in file mode */
>  	if (!image->file_mode)
>  		return -EINVAL;
>  
> -	if (nr_segments >= KEXEC_SEGMENT_MAX)
> +	if (image->nr_segments >= KEXEC_SEGMENT_MAX)
>  		return -EINVAL;
>  
>  	/*
> @@ -2208,25 +2213,18 @@ int kexec_add_buffer(struct kimage *imag
>  
>  	/* Walk the RAM ranges and allocate a suitable range for the buffer */
>  	if (image->type == KEXEC_TYPE_CRASH)
> -		walk_ram_res("Crash kernel", IORESOURCE_MEM | IORESOURCE_BUSY,
> -				crashk_res.start, crashk_res.end, kbuf,
> -				walk_ram_range_callback);
> +		ret = walk_ram_res("Crash kernel",
> +				   IORESOURCE_MEM | IORESOURCE_BUSY,
> +				   crashk_res.start, crashk_res.end, kbuf,
> +				   walk_ram_range_callback);
>  	else
> -		walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> -
> -	/*
> -	 * If range could be found successfully, it would have incremented
> -	 * the nr_segments value.
> -	 */
> -	new_nr_segments = image->nr_segments;
> +		ret = walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
>  
> -	/* A suitable memory range could not be found for buffer */
> -	if (new_nr_segments == nr_segments)
> +	if (ret)
>  		return -EADDRNOTAVAIL;
>  
>  	/* Found a suitable memory range */
> -
> -	ksegment = &image->segment[new_nr_segments - 1];
> +	ksegment = &image->segment[image->nr_segments - 1];
>  	*load_addr = ksegment->mem;
>  	return 0;
>  }
> Index: b/include/linux/ioport.h
> ===================================================================
> --- a/include/linux/ioport.h	2014-06-11 14:49:35.865426300 +0200
> +++ b/include/linux/ioport.h	2014-06-11 16:02:12.775235692 +0200
> @@ -237,6 +237,16 @@ extern int iomem_is_exclusive(u64 addr);
>  extern int
>  walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
>  		void *arg, int (*func)(unsigned long, unsigned long, void *));
> +
> +#define RET_BITS	8
> +#define RET_MASK	((1U << RET_BITS) - 1)
> +#define RETVAL(r)	(-((r) & RET_MASK))
> +
> +#define RET_OK		0
> +#define RET_ERR		1
> +
> +#define RES_STOP	BIT(0 + RET_BITS)
> +
>  extern int
>  walk_system_ram_res(u64 start, u64 end, void *arg,
>  				int (*func)(u64, u64, void *));

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-11 17:04               ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-11 17:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Wed, Jun 11, 2014 at 04:13:20PM +0200, Borislav Petkov wrote:
> On Fri, Jun 06, 2014 at 02:02:14PM -0400, Vivek Goyal wrote:
> > > If you want to make it more explicit, you could do
> > > 
> > > #define RES_OK		0
> > > #define RES_ERR		1
> > > #define RES_STOP	2
> > 
> > You are saying that called back function should return this to walk_*
> > functions? But then we lose the actual error code which should be
> > passed to parent function which actually called walk_* function.
> 
> Well, RES_STOP could implicitly mean stop and no error. Also, if
> you really want to return back the retval, you could slice it into
> bitfields:
> 
> retval = [ ... 8 | 7 ... 0]
> 
> where [7:0] is the return value and bits from 8 onwards contain
> different flags like RES_STOP. I did it just for the fun of it and it
> looks like below. I honestly can't say that I'm crazy about it though.

You are doing the same thing as I am doing. The only difference is that
I am using separate bool variable and you are trying to use upper bits
of return code to carry that extra information.

I personally think that using separate bool variable is simpler as
compared to using upper bits in return code.

Thanks
Vivek

> 
> --
> Index: b/kernel/resource.c
> ===================================================================
> --- a/kernel/resource.c	2014-06-11 14:49:35.865426300 +0200
> +++ b/kernel/resource.c	2014-06-11 15:37:50.050299684 +0200
> @@ -371,7 +371,7 @@ static int find_next_iomem_res(struct re
>  }
>  
>  int walk_ram_res(char *name, unsigned long flags, u64 start, u64 end,
> -		void *arg, int (*func)(u64, u64, void *))
> +		 void *arg, int (*func)(u64, u64, void *))
>  {
>  	struct resource res;
>  	u64 orig_end;
> @@ -384,12 +384,12 @@ int walk_ram_res(char *name, unsigned lo
>  	while ((res.start < res.end) &&
>  		(find_next_iomem_res(&res, name) >= 0)) {
>  		ret = (*func)(res.start, res.end, arg);
> -		if (ret)
> +		if (ret & RES_STOP)
>  			break;
>  		res.start = res.end + 1;
>  		res.end = orig_end;
>  	}
> -	return ret;
> +	return RETVAL(ret);
>  }
>  
>  /*
> @@ -441,7 +441,7 @@ static int find_next_system_ram(struct r
>   * with pfn can truncate ranges.
>   */
>  int walk_system_ram_res(u64 start, u64 end, void *arg,
> -				int (*func)(u64, u64, void *))
> +			int (*func)(u64, u64, void *))
>  {
>  	struct resource res;
>  	u64 orig_end;
> @@ -454,12 +454,13 @@ int walk_system_ram_res(u64 start, u64 e
>  	while ((res.start < res.end) &&
>  		(find_next_system_ram(&res, "System RAM") >= 0)) {
>  		ret = (*func)(res.start, res.end, arg);
> -		if (ret)
> +		if (ret & RES_STOP)
>  			break;
>  		res.start = res.end + 1;
>  		res.end = orig_end;
>  	}
> -	return ret;
> +
> +	return RETVAL(ret);
>  }
>  
>  #if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
> Index: b/kernel/kexec.c
> ===================================================================
> --- a/kernel/kexec.c	2014-06-11 14:49:35.865426300 +0200
> +++ b/kernel/kexec.c	2014-06-11 16:03:26.264232477 +0200
> @@ -2063,8 +2063,9 @@ static int __kexec_add_segment(struct ki
>  }
>  
>  static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
> -					struct kexec_buf *kbuf)
> +				    struct kexec_buf *kbuf)
>  {
> +	int ret = 0;
>  	struct kimage *image = kbuf->image;
>  	unsigned long temp_start, temp_end;
>  
> @@ -2076,7 +2077,7 @@ static int locate_mem_hole_top_down(unsi
>  		temp_start = temp_start & (~(kbuf->buf_align - 1));
>  
>  		if (temp_start < start || temp_start < kbuf->buf_min)
> -			return 0;
> +			return EADDRNOTAVAIL;
>  
>  		temp_end = temp_start + kbuf->memsz - 1;
>  
> @@ -2098,12 +2099,15 @@ static int locate_mem_hole_top_down(unsi
>  				kbuf->memsz);
>  
>  	/* Stop navigating through remaining System RAM ranges */
> -	return 1;
> +	ret |= RES_STOP;
> +
> +	return ret;
>  }
>  
>  static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
> -					struct kexec_buf *kbuf)
> +				     struct kexec_buf *kbuf)
>  {
> +	int ret = 0;
>  	struct kimage *image = kbuf->image;
>  	unsigned long temp_start, temp_end;
>  
> @@ -2114,7 +2118,7 @@ static int locate_mem_hole_bottom_up(uns
>  		temp_end = temp_start + kbuf->memsz - 1;
>  
>  		if (temp_end > end || temp_end > kbuf->buf_max)
> -			return 0;
> +			return EADDRNOTAVAIL;
>  		/*
>  		 * Make sure this does not conflict with any of existing
>  		 * segments
> @@ -2133,7 +2137,9 @@ static int locate_mem_hole_bottom_up(uns
>  				kbuf->memsz);
>  
>  	/* Stop navigating through remaining System RAM ranges */
> -	return 1;
> +	ret |= RES_STOP;
> +
> +	return ret;
>  }
>  
>  static int walk_ram_range_callback(u64 start, u64 end, void *arg)
> @@ -2141,12 +2147,11 @@ static int walk_ram_range_callback(u64 s
>  	struct kexec_buf *kbuf = (struct kexec_buf *)arg;
>  	unsigned long sz = end - start + 1;
>  
> -	/* Returning 0 will take to next memory range */
>  	if (sz < kbuf->memsz)
> -		return 0;
> +		return EADDRNOTAVAIL;
>  
>  	if (end < kbuf->buf_min || start > kbuf->buf_max)
> -		return 0;
> +		return EADDRNOTAVAIL;
>  
>  	/*
>  	 * Allocate memory top down with-in ram range. Otherwise bottom up
> @@ -2168,15 +2173,15 @@ int kexec_add_buffer(struct kimage *imag
>  		unsigned long buf_max, bool top_down, unsigned long *load_addr)
>  {
>  
> -	unsigned long nr_segments = image->nr_segments, new_nr_segments;
>  	struct kexec_segment *ksegment;
>  	struct kexec_buf buf, *kbuf;
> +	int ret;
>  
>  	/* Currently adding segment this way is allowed only in file mode */
>  	if (!image->file_mode)
>  		return -EINVAL;
>  
> -	if (nr_segments >= KEXEC_SEGMENT_MAX)
> +	if (image->nr_segments >= KEXEC_SEGMENT_MAX)
>  		return -EINVAL;
>  
>  	/*
> @@ -2208,25 +2213,18 @@ int kexec_add_buffer(struct kimage *imag
>  
>  	/* Walk the RAM ranges and allocate a suitable range for the buffer */
>  	if (image->type == KEXEC_TYPE_CRASH)
> -		walk_ram_res("Crash kernel", IORESOURCE_MEM | IORESOURCE_BUSY,
> -				crashk_res.start, crashk_res.end, kbuf,
> -				walk_ram_range_callback);
> +		ret = walk_ram_res("Crash kernel",
> +				   IORESOURCE_MEM | IORESOURCE_BUSY,
> +				   crashk_res.start, crashk_res.end, kbuf,
> +				   walk_ram_range_callback);
>  	else
> -		walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> -
> -	/*
> -	 * If range could be found successfully, it would have incremented
> -	 * the nr_segments value.
> -	 */
> -	new_nr_segments = image->nr_segments;
> +		ret = walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
>  
> -	/* A suitable memory range could not be found for buffer */
> -	if (new_nr_segments == nr_segments)
> +	if (ret)
>  		return -EADDRNOTAVAIL;
>  
>  	/* Found a suitable memory range */
> -
> -	ksegment = &image->segment[new_nr_segments - 1];
> +	ksegment = &image->segment[image->nr_segments - 1];
>  	*load_addr = ksegment->mem;
>  	return 0;
>  }
> Index: b/include/linux/ioport.h
> ===================================================================
> --- a/include/linux/ioport.h	2014-06-11 14:49:35.865426300 +0200
> +++ b/include/linux/ioport.h	2014-06-11 16:02:12.775235692 +0200
> @@ -237,6 +237,16 @@ extern int iomem_is_exclusive(u64 addr);
>  extern int
>  walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
>  		void *arg, int (*func)(unsigned long, unsigned long, void *));
> +
> +#define RET_BITS	8
> +#define RET_MASK	((1U << RET_BITS) - 1)
> +#define RETVAL(r)	(-((r) & RET_MASK))
> +
> +#define RET_OK		0
> +#define RET_ERR		1
> +
> +#define RES_STOP	BIT(0 + RET_BITS)
> +
>  extern int
>  walk_system_ram_res(u64 start, u64 end, void *arg,
>  				int (*func)(u64, u64, void *));

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 10/13] kexec: Load and Relocate purgatory at kernel load time
  2014-06-10 16:31     ` Borislav Petkov
@ 2014-06-11 19:24       ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-11 19:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 10, 2014 at 06:31:28PM +0200, Borislav Petkov wrote:

[..]
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 213308a..0f24b61 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1556,6 +1556,8 @@ source kernel/Kconfig.hz
> >  config KEXEC
> >  	bool "kexec system call"
> >  	select BUILD_BIN2C
> > +	select CRYPTO
> > +	select CRYPTO_SHA256
> 
> Ok, but why automatically enable crypto? There's still the old kexec
> method where we don't check any signatures.

Hi Boris,

Thanks for reviewing the patches.

This new syscall requires sha256 even if signature checking does not
happen. Purgatory verifies checksum of segments.

I had to select CRYPTO also otherwise CONFIG_CRYPTO=m broke the build.

> 
> Which begs the more important question - shouldn't this new in-kernel
> loading method support also kexec'ing of kernels without any signature
> verifications at all?

I think yes it should allow kexecing kernels without any signature also.
In fact in long term, we should deprecate the old syscall and maintain
this new one.

Now, when does signature checking kick in? I think we can define a new
config option say KEXEC_ENFORCE_KERNEL_SIG_VERIFICATION. This option
will make sure kernel signature are verified. 

If KEXEC_ENFORCE_KERNEL_SIG_VERIFICATION=n, even then signature
verification should be enforced if secureboot is enabled on the platform.

As signature verification is not part of this series, I was planning to
post these changes as part of next series.

> 
> I mean, the main use case is secure boot and all but why not leave it
> configurable for people to decide?

I will make it configurable in next series. This series does not do
any signature verification yet. Above CRYPTO and CRYPTO_SHA256 I had
to select to make sure checksum verfication logic in purgatory works
fine.


[..]
> > +/* Apply purgatory relocations */
> > +int arch_kexec_apply_relocations_add(Elf64_Shdr *sechdrs,
> 
> apply_..._add? "arch_kexec_apply_relocations" seems fine to me.

Hmmm.., not sure why I did this. I will change it (until and unless find
a good reason for doing this).

[..]
> > +		case R_X86_64_PC32:
> > +			value -= (u64)address;
> > +			*(u32 *)location = value;
> > +			break;
> > +		default:
> > +			pr_err("kexec: Unknown rela relocation: %llu\n",
> 
> Yep, the "kexec: " string should come from pr_fmt as in the other mail.

Sure. Will use pr_fmt to prefix "kexec" string.

> 
> > +					ELF64_R_TYPE(rel[i].r_info));
> > +			return -ENOEXEC;
> > +		}
> > +	}
> > +	return 0;
> > +
> > +overflow:
> > +	pr_err("kexec: overflow in relocation type %d value 0x%lx\n",
> 
> Ditto.

Will do.

[..]
> >  struct kimage {
> >  	kimage_entry_t head;
> >  	kimage_entry_t *entry;
> > @@ -143,6 +165,9 @@ struct kimage {
> >  
> >  	/* Image loader handling the kernel can store a pointer here */
> >  	void *image_loader_data;
> > +
> > +	/* Information for loading purgatory */
> > +	struct purgatory_info purgatory_info;
> 
> Having the member name with the same name as the struct is kinda
> confusing. Also, you could shorten it, which, in turn, would give
> shorter code lines at the sites it is accessed. I.e.,
> 
> 	struct purgatory_info pinfo;
> 
> should be just fine IMHO.

Hmm... I have seen at other places using same name as structure. But I am
not particular about it will change. Anyway, on most of the places
I use a pointer to access it.

struct purgaotry_info *pi  = &image->purgatory_info;

[..]
> > +/* Apply relocations for rela section */
> > +int __attribute__ ((weak))
> > +arch_kexec_apply_relocations_add(Elf_Shdr *sechdrs, unsigned int nr_sections,
> > +					unsigned int relsec)
> > +{
> > +	pr_err(KERN_ERR "kexec: REL relocation unsupported\n");
> 
> pr_err *and* KERN_ERR. Double error level? :-)

Yep. Will fix it. :-)

> 
> > +	return -ENOEXEC;
> > +}
> > +
> >  /*
> >   * Free up tempory buffers allocated which are not needed after image has
> >   * been loaded.
> > @@ -355,6 +376,12 @@ static void kimage_file_post_load_cleanup(struct kimage *image)
> >  	vfree(image->cmdline_buf);
> >  	image->cmdline_buf = NULL;
> >  
> > +	vfree(image->purgatory_info.purgatory_buf);
> > +	image->purgatory_info.purgatory_buf = NULL;
> 
> Here's what I mean - that's definitely too long. Maybe

Sure will change it.

> 
> 	vree(image->pinfo.pbuf);
> 	image->pinfo.pbuf = NULL;
> 
> (yep, we shortened purgatory_buf too). Now this looks like proper kernel
> code to me :-)

I would like to retain purgaotry_buf. To shorten it I could do this.

	struct purgatory_info *pi = &image->purgatory_info;
	vfree(pi->purgatory_buf);
	pi->purgatory_buf = NULL;

I like the clarity in variable names.


> 
> > +
> > +	vfree(image->purgatory_info.sechdrs);
> > +	image->purgatory_info.sechdrs = NULL;
> > +
> >  	/* See if architcture has anything to cleanup post load */
> >  	arch_kimage_file_post_load_cleanup(image);
> >  }
> > @@ -1370,6 +1397,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> >  	if (ret)
> >  		goto out;
> >  
> > +	ret = kexec_calculate_store_digests(image);
> 
> This function name could be shortened too: kexec_calc_digests() or so.
> The actual storing could be a separate kexec_store_digests. I.e., a
> function does one thing only.

I would like to keep it one function. Reason being that apart from
digest, we also store the list of regions which has been checkummed. And
you will notice that we skip the purgatory region during checksum
calculation.

So I will have to return quite some information from calc() function. Size
of digest, actual digest buffer which will need to be freed by caller,
and list of sha regions which will need to be freed by caller. Keeping
it call in one function makes it little simpler actually.


[..]
> > +/* Calculate and store the digest of segments */
> > +static int kexec_calculate_store_digests(struct kimage *image)
> > +{
> > +	struct crypto_shash *tfm;
> > +	struct shash_desc *desc;
> > +	int ret = 0, i, j, zero_buf_sz = 256, sha_region_sz;
> 
> 256 - a magic constant.

Just wanted a small zero buffer. Is there any global zero buffer available
in kernel. If not, I could use a PAGE_SIZE zero buffer instead.

> 
> > +	size_t desc_size, nullsz;
> > +	char *digest = NULL;
> > +	void *zero_buf;
> > +	struct kexec_sha_region *sha_regions;
> > +
> > +	tfm = crypto_alloc_shash("sha256", 0, 0);
> > +	if (IS_ERR(tfm)) {
> > +		ret = PTR_ERR(tfm);
> > +		goto out;
> 
> The "out" label kfrees digest but we haven't allocated it yet...

kfree() can handle NULL and digest is initialized to null.

> 
> > +	}
> > +
> > +	desc_size = crypto_shash_descsize(tfm) + sizeof(*desc);
> > +	desc = kzalloc(desc_size, GFP_KERNEL);
> > +	if (!desc) {
> > +		ret = -ENOMEM;
> > +		goto out_free_tfm;
> > +	}
> > +
> > +	zero_buf = kzalloc(zero_buf_sz, GFP_KERNEL);
> > +	if (!zero_buf) {
> > +		ret = -ENOMEM;
> > +		goto out_free_desc;
> > +	}
> > +
> > +	sha_region_sz = KEXEC_SEGMENT_MAX * sizeof(struct kexec_sha_region);
> > +	sha_regions = vzalloc(sha_region_sz);
> > +	if (!sha_regions)
> > +		goto out_free_zero_buf;
> > +
> > +	desc->tfm   = tfm;
> > +	desc->flags = 0;
> > +
> > +	ret = crypto_shash_init(desc);
> > +	if (ret < 0)
> > +		goto out_free_sha_regions;
> > +
> > +	digest = kzalloc(SHA256_DIGEST_SIZE, GFP_KERNEL);
> > +	if (!digest) {
> > +		ret = -ENOMEM;
> > +		goto out_free_sha_regions;
> > +	}
> 
> ... but this digest is a simple allocation. Could it be you moved it
> down but forgot to readjust the labels?

I see what you are saying. I will fix the allocation ordering and lable
ordering. 

> 
> > +
> > +	/* Traverse through all segments */
> 
> Yep, we can see that. Why does it need an explicit comment?

Will remove.

> 
> > +	for (j = i = 0; i < image->nr_segments; i++) {
> > +		struct kexec_segment *ksegment;
> > +		ksegment = &image->segment[i];
> > +
> > +		/*
> > +		 * Skip purgatory as it will be modified once we put digest
> > +		 * info in purgatory
> > +		 */
> 
> Now this comment is perfect right there! :-) It needs a full-stop though.

Will do.

> 
> > +		if (ksegment->kbuf == image->purgatory_info.purgatory_buf)
> > +			continue;
> > +
> > +		ret = crypto_shash_update(desc, ksegment->kbuf,
> > +						ksegment->bufsz);
> 
> Arg alignment.

Will do.

> 
> > +		if (ret)
> > +			break;
> > +
> > +		nullsz = ksegment->memsz - ksegment->bufsz;
> > +		while (nullsz) {
> > +			unsigned long bytes = nullsz;
> > +			if (bytes > zero_buf_sz)
> > +				bytes = zero_buf_sz;
> > +			ret = crypto_shash_update(desc, zero_buf, bytes);
> > +			if (ret)
> > +				break;
> > +			nullsz -= bytes;
> > +		}
> 
> Now this trailing buffer "drainage" could very well use a comment on
> what's going on.

Sure, will put a comment.

> 
> > +
> > +		if (ret)
> > +			break;
> > +
> > +		sha_regions[j].start = ksegment->mem;
> > +		sha_regions[j].len = ksegment->memsz;
> > +		j++;
> > +	}
> > +
> > +	if (!ret) {
> > +		ret = crypto_shash_final(desc, digest);
> > +		if (ret)
> > +			goto out_free_sha_regions;
> > +		ret = kexec_purgatory_get_set_symbol(image, "sha_regions",
> > +				sha_regions, sha_region_sz, 0);
> > +		if (ret)
> > +			goto out_free_sha_regions;
> > +
> > +		ret = kexec_purgatory_get_set_symbol(image, "sha256_digest",
> > +				digest, SHA256_DIGEST_SIZE, 0);
> > +		if (ret)
> > +			goto out_free_sha_regions;
> 
> Yeah, this block could be a separate kexec_store_digests() function.

As explained above, splitting it out in a separate function requires
carrying atleast two pointers and their sizes. And these pointers will need to
be freed by store() functions. I guess keeping it right here is simpler.

> 
> > +	}
> > +
> > +out_free_sha_regions:
> > +	vfree(sha_regions);
> > +out_free_zero_buf:
> > +	kfree(zero_buf);
> > +out_free_desc:
> > +	kfree(desc);
> > +out_free_tfm:
> > +	kfree(tfm);
> > +out:
> > +	kfree(digest);
> > +	return ret;
> > +}
> > +
> > +/* Actually load and relcoate purgatory. Lot of code taken from kexec-tools */
> 
> s/relcoate/relocate/

Will fix.

> 
> > +static int elf_rel_load_relocate(struct kimage *image, unsigned long min,
> > +				unsigned long max, int top_down)
> 
> Another function which is too big and does at least two things and could
> probably be nicely split into two.
> 
> > +{
> > +	struct purgatory_info *pi = &image->purgatory_info;
> > +	unsigned long align, buf_align, bss_align, buf_sz, bss_sz, bss_pad;
> > +	unsigned long memsz, entry, load_addr, data_addr, bss_addr, off;
> > +	unsigned char *buf_addr, *src;
> > +	int i, ret = 0, entry_sidx = -1;
> > +	Elf_Shdr *sechdrs = NULL, *sechdrs_c;
> > +	void *purgatory_buf = NULL;
> > +
> > +	/*
> > +	 * sechdrs_c points to section headers in purgatory and are read
> > +	 * only. No modifications allowed.
> > +	 */
> 
> Then do
> 
> 	const Elf_Shdr * const sechdrs_c = (void *)pi->ehdr + pi->ehdr->e_shoff;
> 
> to enforce it?

Ok, will try to use it.

> 
> > +	sechdrs_c = (void *)pi->ehdr + pi->ehdr->e_shoff;
> > +
> > +	/*
> > +	 * We can not modify sechdrs_c[] and its fields. It is read only.
> > +	 * Copy it over to a local copy where one can store some temporary
> > +	 * data and free it at the end. We need to modify ->sh_addr and
> 
> What is freeing it when we store it into pi->sechdrs and return? Or
> doesn't it need to be freed?

kimage_file_post_load_cleanup() takes care of freeing it up. Till then
we need to keep this information around.

> 
> > +	 * ->sh_offset fields to keep track permanent and temporary locations
> > +	 * of sections.
> > +	 */
> > +	sechdrs = vzalloc(pi->ehdr->e_shnum * sizeof(Elf_Shdr));
> > +	if (!sechdrs)
> > +		return -ENOMEM;
> > +
> > +	memcpy(sechdrs, sechdrs_c, pi->ehdr->e_shnum * sizeof(Elf_Shdr));
> > +
> > +	/*
> > +	 * We seem to have multiple copies of sections. First copy is which
> > +	 * is embedded in kernel in read only section. Some of these sections
> > +	 * will be copied to a temporary buffer and relocated. And these
> > +	 * sections will finally be copied to their final detination at
> 
> "destination"

Will fix.


[..]
> > +	/* Add buffer to segment list */
> > +	ret = kexec_add_buffer(image, purgatory_buf, buf_sz, memsz,
> > +				buf_align, min, max, top_down,
> > +				&pi->purgatory_load_addr);
> > +	if (ret)
> > +		goto out;
> > +
> > +	/* Load SHF_ALLOC sections */
> 
> Here could start a new function.

I can try but I think there is too much of context information which
will need to be passed through function parameters.

[..]
> > +	/* update entry based on entry section position */
> > +	if (entry_sidx >= 0)
> > +		entry += sechdrs[entry_sidx].sh_addr;
> > +
> > +	/* Set the entry point of purgatory */
> > +	image->start = entry;
> > +
> > +	/* Apply relocations */
> 
> >From here-on could start a new function.

You seem to be asking to make three functions, parse(), load() and
relocate(). Aagain, I will have a closer look again and see if it
easily doable. My feeling is that they are very tightly coupled and
will need many function parameters.

[..]
> > +	for (i = 0; i < ehdr->e_shnum; i++) {
> > +		if (sechdrs[i].sh_type != SHT_SYMTAB)
> > +			continue;
> > +
> > +		if (sechdrs[i].sh_link > ehdr->e_shnum)
> > +			/* Invalid stratab section number */
> 
> "strtab"

Will Fix.

[..]
> > +int kexec_purgatory_get_set_symbol(struct kimage *image, const char *name,
> > +				void *buf, unsigned int size, bool get_value)
> > +{
> > +	Elf_Sym *sym;
> > +	Elf_Shdr *sechdrs;
> > +	struct purgatory_info *pi = &image->purgatory_info;
> > +	char *sym_buf;
> > +
> > +	sym = kexec_purgatory_find_symbol(pi, name);
> > +	if (!sym)
> > +		return -EINVAL;
> > +
> > +	if (sym->st_size != size) {
> > +		pr_debug("Symbol: %s size is not right\n", name);
> 
> Should probably be pr_err because it is an error, right? And then
> 
> 	pr_err("Symbol %s size mismatch: %d vs %d\n", name, sym->st_size, size);

It is an error, that's why I return -EINVAL. We are always not verbose
for all the errors. I kind of felt that it does not have to be of type
KERN_ERR. But I don't feel strongly about it. Will change it to pr_err().

Also will include additional information about expected size and actual
size.

> 
> > +		return -EINVAL;
> > +	}
> > +
> > +	sechdrs = pi->sechdrs;
> > +
> > +	if (sechdrs[sym->st_shndx].sh_type == SHT_NOBITS) {
> > +		pr_debug("Symbol: %s is in a bss section. Cannot get/set\n",
> 
> 			... Cannot %s\n", (get_value ? "get" : "set"), name);

Will do.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 10/13] kexec: Load and Relocate purgatory at kernel load time
@ 2014-06-11 19:24       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-11 19:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 10, 2014 at 06:31:28PM +0200, Borislav Petkov wrote:

[..]
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 213308a..0f24b61 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1556,6 +1556,8 @@ source kernel/Kconfig.hz
> >  config KEXEC
> >  	bool "kexec system call"
> >  	select BUILD_BIN2C
> > +	select CRYPTO
> > +	select CRYPTO_SHA256
> 
> Ok, but why automatically enable crypto? There's still the old kexec
> method where we don't check any signatures.

Hi Boris,

Thanks for reviewing the patches.

This new syscall requires sha256 even if signature checking does not
happen. Purgatory verifies checksum of segments.

I had to select CRYPTO also otherwise CONFIG_CRYPTO=m broke the build.

> 
> Which begs the more important question - shouldn't this new in-kernel
> loading method support also kexec'ing of kernels without any signature
> verifications at all?

I think yes it should allow kexecing kernels without any signature also.
In fact in long term, we should deprecate the old syscall and maintain
this new one.

Now, when does signature checking kick in? I think we can define a new
config option say KEXEC_ENFORCE_KERNEL_SIG_VERIFICATION. This option
will make sure kernel signature are verified. 

If KEXEC_ENFORCE_KERNEL_SIG_VERIFICATION=n, even then signature
verification should be enforced if secureboot is enabled on the platform.

As signature verification is not part of this series, I was planning to
post these changes as part of next series.

> 
> I mean, the main use case is secure boot and all but why not leave it
> configurable for people to decide?

I will make it configurable in next series. This series does not do
any signature verification yet. Above CRYPTO and CRYPTO_SHA256 I had
to select to make sure checksum verfication logic in purgatory works
fine.


[..]
> > +/* Apply purgatory relocations */
> > +int arch_kexec_apply_relocations_add(Elf64_Shdr *sechdrs,
> 
> apply_..._add? "arch_kexec_apply_relocations" seems fine to me.

Hmmm.., not sure why I did this. I will change it (until and unless find
a good reason for doing this).

[..]
> > +		case R_X86_64_PC32:
> > +			value -= (u64)address;
> > +			*(u32 *)location = value;
> > +			break;
> > +		default:
> > +			pr_err("kexec: Unknown rela relocation: %llu\n",
> 
> Yep, the "kexec: " string should come from pr_fmt as in the other mail.

Sure. Will use pr_fmt to prefix "kexec" string.

> 
> > +					ELF64_R_TYPE(rel[i].r_info));
> > +			return -ENOEXEC;
> > +		}
> > +	}
> > +	return 0;
> > +
> > +overflow:
> > +	pr_err("kexec: overflow in relocation type %d value 0x%lx\n",
> 
> Ditto.

Will do.

[..]
> >  struct kimage {
> >  	kimage_entry_t head;
> >  	kimage_entry_t *entry;
> > @@ -143,6 +165,9 @@ struct kimage {
> >  
> >  	/* Image loader handling the kernel can store a pointer here */
> >  	void *image_loader_data;
> > +
> > +	/* Information for loading purgatory */
> > +	struct purgatory_info purgatory_info;
> 
> Having the member name with the same name as the struct is kinda
> confusing. Also, you could shorten it, which, in turn, would give
> shorter code lines at the sites it is accessed. I.e.,
> 
> 	struct purgatory_info pinfo;
> 
> should be just fine IMHO.

Hmm... I have seen at other places using same name as structure. But I am
not particular about it will change. Anyway, on most of the places
I use a pointer to access it.

struct purgaotry_info *pi  = &image->purgatory_info;

[..]
> > +/* Apply relocations for rela section */
> > +int __attribute__ ((weak))
> > +arch_kexec_apply_relocations_add(Elf_Shdr *sechdrs, unsigned int nr_sections,
> > +					unsigned int relsec)
> > +{
> > +	pr_err(KERN_ERR "kexec: REL relocation unsupported\n");
> 
> pr_err *and* KERN_ERR. Double error level? :-)

Yep. Will fix it. :-)

> 
> > +	return -ENOEXEC;
> > +}
> > +
> >  /*
> >   * Free up tempory buffers allocated which are not needed after image has
> >   * been loaded.
> > @@ -355,6 +376,12 @@ static void kimage_file_post_load_cleanup(struct kimage *image)
> >  	vfree(image->cmdline_buf);
> >  	image->cmdline_buf = NULL;
> >  
> > +	vfree(image->purgatory_info.purgatory_buf);
> > +	image->purgatory_info.purgatory_buf = NULL;
> 
> Here's what I mean - that's definitely too long. Maybe

Sure will change it.

> 
> 	vree(image->pinfo.pbuf);
> 	image->pinfo.pbuf = NULL;
> 
> (yep, we shortened purgatory_buf too). Now this looks like proper kernel
> code to me :-)

I would like to retain purgaotry_buf. To shorten it I could do this.

	struct purgatory_info *pi = &image->purgatory_info;
	vfree(pi->purgatory_buf);
	pi->purgatory_buf = NULL;

I like the clarity in variable names.


> 
> > +
> > +	vfree(image->purgatory_info.sechdrs);
> > +	image->purgatory_info.sechdrs = NULL;
> > +
> >  	/* See if architcture has anything to cleanup post load */
> >  	arch_kimage_file_post_load_cleanup(image);
> >  }
> > @@ -1370,6 +1397,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> >  	if (ret)
> >  		goto out;
> >  
> > +	ret = kexec_calculate_store_digests(image);
> 
> This function name could be shortened too: kexec_calc_digests() or so.
> The actual storing could be a separate kexec_store_digests. I.e., a
> function does one thing only.

I would like to keep it one function. Reason being that apart from
digest, we also store the list of regions which has been checkummed. And
you will notice that we skip the purgatory region during checksum
calculation.

So I will have to return quite some information from calc() function. Size
of digest, actual digest buffer which will need to be freed by caller,
and list of sha regions which will need to be freed by caller. Keeping
it call in one function makes it little simpler actually.


[..]
> > +/* Calculate and store the digest of segments */
> > +static int kexec_calculate_store_digests(struct kimage *image)
> > +{
> > +	struct crypto_shash *tfm;
> > +	struct shash_desc *desc;
> > +	int ret = 0, i, j, zero_buf_sz = 256, sha_region_sz;
> 
> 256 - a magic constant.

Just wanted a small zero buffer. Is there any global zero buffer available
in kernel. If not, I could use a PAGE_SIZE zero buffer instead.

> 
> > +	size_t desc_size, nullsz;
> > +	char *digest = NULL;
> > +	void *zero_buf;
> > +	struct kexec_sha_region *sha_regions;
> > +
> > +	tfm = crypto_alloc_shash("sha256", 0, 0);
> > +	if (IS_ERR(tfm)) {
> > +		ret = PTR_ERR(tfm);
> > +		goto out;
> 
> The "out" label kfrees digest but we haven't allocated it yet...

kfree() can handle NULL and digest is initialized to null.

> 
> > +	}
> > +
> > +	desc_size = crypto_shash_descsize(tfm) + sizeof(*desc);
> > +	desc = kzalloc(desc_size, GFP_KERNEL);
> > +	if (!desc) {
> > +		ret = -ENOMEM;
> > +		goto out_free_tfm;
> > +	}
> > +
> > +	zero_buf = kzalloc(zero_buf_sz, GFP_KERNEL);
> > +	if (!zero_buf) {
> > +		ret = -ENOMEM;
> > +		goto out_free_desc;
> > +	}
> > +
> > +	sha_region_sz = KEXEC_SEGMENT_MAX * sizeof(struct kexec_sha_region);
> > +	sha_regions = vzalloc(sha_region_sz);
> > +	if (!sha_regions)
> > +		goto out_free_zero_buf;
> > +
> > +	desc->tfm   = tfm;
> > +	desc->flags = 0;
> > +
> > +	ret = crypto_shash_init(desc);
> > +	if (ret < 0)
> > +		goto out_free_sha_regions;
> > +
> > +	digest = kzalloc(SHA256_DIGEST_SIZE, GFP_KERNEL);
> > +	if (!digest) {
> > +		ret = -ENOMEM;
> > +		goto out_free_sha_regions;
> > +	}
> 
> ... but this digest is a simple allocation. Could it be you moved it
> down but forgot to readjust the labels?

I see what you are saying. I will fix the allocation ordering and lable
ordering. 

> 
> > +
> > +	/* Traverse through all segments */
> 
> Yep, we can see that. Why does it need an explicit comment?

Will remove.

> 
> > +	for (j = i = 0; i < image->nr_segments; i++) {
> > +		struct kexec_segment *ksegment;
> > +		ksegment = &image->segment[i];
> > +
> > +		/*
> > +		 * Skip purgatory as it will be modified once we put digest
> > +		 * info in purgatory
> > +		 */
> 
> Now this comment is perfect right there! :-) It needs a full-stop though.

Will do.

> 
> > +		if (ksegment->kbuf == image->purgatory_info.purgatory_buf)
> > +			continue;
> > +
> > +		ret = crypto_shash_update(desc, ksegment->kbuf,
> > +						ksegment->bufsz);
> 
> Arg alignment.

Will do.

> 
> > +		if (ret)
> > +			break;
> > +
> > +		nullsz = ksegment->memsz - ksegment->bufsz;
> > +		while (nullsz) {
> > +			unsigned long bytes = nullsz;
> > +			if (bytes > zero_buf_sz)
> > +				bytes = zero_buf_sz;
> > +			ret = crypto_shash_update(desc, zero_buf, bytes);
> > +			if (ret)
> > +				break;
> > +			nullsz -= bytes;
> > +		}
> 
> Now this trailing buffer "drainage" could very well use a comment on
> what's going on.

Sure, will put a comment.

> 
> > +
> > +		if (ret)
> > +			break;
> > +
> > +		sha_regions[j].start = ksegment->mem;
> > +		sha_regions[j].len = ksegment->memsz;
> > +		j++;
> > +	}
> > +
> > +	if (!ret) {
> > +		ret = crypto_shash_final(desc, digest);
> > +		if (ret)
> > +			goto out_free_sha_regions;
> > +		ret = kexec_purgatory_get_set_symbol(image, "sha_regions",
> > +				sha_regions, sha_region_sz, 0);
> > +		if (ret)
> > +			goto out_free_sha_regions;
> > +
> > +		ret = kexec_purgatory_get_set_symbol(image, "sha256_digest",
> > +				digest, SHA256_DIGEST_SIZE, 0);
> > +		if (ret)
> > +			goto out_free_sha_regions;
> 
> Yeah, this block could be a separate kexec_store_digests() function.

As explained above, splitting it out in a separate function requires
carrying atleast two pointers and their sizes. And these pointers will need to
be freed by store() functions. I guess keeping it right here is simpler.

> 
> > +	}
> > +
> > +out_free_sha_regions:
> > +	vfree(sha_regions);
> > +out_free_zero_buf:
> > +	kfree(zero_buf);
> > +out_free_desc:
> > +	kfree(desc);
> > +out_free_tfm:
> > +	kfree(tfm);
> > +out:
> > +	kfree(digest);
> > +	return ret;
> > +}
> > +
> > +/* Actually load and relcoate purgatory. Lot of code taken from kexec-tools */
> 
> s/relcoate/relocate/

Will fix.

> 
> > +static int elf_rel_load_relocate(struct kimage *image, unsigned long min,
> > +				unsigned long max, int top_down)
> 
> Another function which is too big and does at least two things and could
> probably be nicely split into two.
> 
> > +{
> > +	struct purgatory_info *pi = &image->purgatory_info;
> > +	unsigned long align, buf_align, bss_align, buf_sz, bss_sz, bss_pad;
> > +	unsigned long memsz, entry, load_addr, data_addr, bss_addr, off;
> > +	unsigned char *buf_addr, *src;
> > +	int i, ret = 0, entry_sidx = -1;
> > +	Elf_Shdr *sechdrs = NULL, *sechdrs_c;
> > +	void *purgatory_buf = NULL;
> > +
> > +	/*
> > +	 * sechdrs_c points to section headers in purgatory and are read
> > +	 * only. No modifications allowed.
> > +	 */
> 
> Then do
> 
> 	const Elf_Shdr * const sechdrs_c = (void *)pi->ehdr + pi->ehdr->e_shoff;
> 
> to enforce it?

Ok, will try to use it.

> 
> > +	sechdrs_c = (void *)pi->ehdr + pi->ehdr->e_shoff;
> > +
> > +	/*
> > +	 * We can not modify sechdrs_c[] and its fields. It is read only.
> > +	 * Copy it over to a local copy where one can store some temporary
> > +	 * data and free it at the end. We need to modify ->sh_addr and
> 
> What is freeing it when we store it into pi->sechdrs and return? Or
> doesn't it need to be freed?

kimage_file_post_load_cleanup() takes care of freeing it up. Till then
we need to keep this information around.

> 
> > +	 * ->sh_offset fields to keep track permanent and temporary locations
> > +	 * of sections.
> > +	 */
> > +	sechdrs = vzalloc(pi->ehdr->e_shnum * sizeof(Elf_Shdr));
> > +	if (!sechdrs)
> > +		return -ENOMEM;
> > +
> > +	memcpy(sechdrs, sechdrs_c, pi->ehdr->e_shnum * sizeof(Elf_Shdr));
> > +
> > +	/*
> > +	 * We seem to have multiple copies of sections. First copy is which
> > +	 * is embedded in kernel in read only section. Some of these sections
> > +	 * will be copied to a temporary buffer and relocated. And these
> > +	 * sections will finally be copied to their final detination at
> 
> "destination"

Will fix.


[..]
> > +	/* Add buffer to segment list */
> > +	ret = kexec_add_buffer(image, purgatory_buf, buf_sz, memsz,
> > +				buf_align, min, max, top_down,
> > +				&pi->purgatory_load_addr);
> > +	if (ret)
> > +		goto out;
> > +
> > +	/* Load SHF_ALLOC sections */
> 
> Here could start a new function.

I can try but I think there is too much of context information which
will need to be passed through function parameters.

[..]
> > +	/* update entry based on entry section position */
> > +	if (entry_sidx >= 0)
> > +		entry += sechdrs[entry_sidx].sh_addr;
> > +
> > +	/* Set the entry point of purgatory */
> > +	image->start = entry;
> > +
> > +	/* Apply relocations */
> 
> >From here-on could start a new function.

You seem to be asking to make three functions, parse(), load() and
relocate(). Aagain, I will have a closer look again and see if it
easily doable. My feeling is that they are very tightly coupled and
will need many function parameters.

[..]
> > +	for (i = 0; i < ehdr->e_shnum; i++) {
> > +		if (sechdrs[i].sh_type != SHT_SYMTAB)
> > +			continue;
> > +
> > +		if (sechdrs[i].sh_link > ehdr->e_shnum)
> > +			/* Invalid stratab section number */
> 
> "strtab"

Will Fix.

[..]
> > +int kexec_purgatory_get_set_symbol(struct kimage *image, const char *name,
> > +				void *buf, unsigned int size, bool get_value)
> > +{
> > +	Elf_Sym *sym;
> > +	Elf_Shdr *sechdrs;
> > +	struct purgatory_info *pi = &image->purgatory_info;
> > +	char *sym_buf;
> > +
> > +	sym = kexec_purgatory_find_symbol(pi, name);
> > +	if (!sym)
> > +		return -EINVAL;
> > +
> > +	if (sym->st_size != size) {
> > +		pr_debug("Symbol: %s size is not right\n", name);
> 
> Should probably be pr_err because it is an error, right? And then
> 
> 	pr_err("Symbol %s size mismatch: %d vs %d\n", name, sym->st_size, size);

It is an error, that's why I return -EINVAL. We are always not verbose
for all the errors. I kind of felt that it does not have to be of type
KERN_ERR. But I don't feel strongly about it. Will change it to pr_err().

Also will include additional information about expected size and actual
size.

> 
> > +		return -EINVAL;
> > +	}
> > +
> > +	sechdrs = pi->sechdrs;
> > +
> > +	if (sechdrs[sym->st_shndx].sh_type == SHT_NOBITS) {
> > +		pr_debug("Symbol: %s is in a bss section. Cannot get/set\n",
> 
> 			... Cannot %s\n", (get_value ? "get" : "set"), name);

Will do.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-12  5:42   ` Dave Young
  -1 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-12  5:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	chaowang, bhe, akpm

On 06/03/14 at 09:06am, Vivek Goyal wrote:
> Hi,
> 
> This is V3 of the patchset. Previous versions were posted here.
> 
> V1: https://lkml.org/lkml/2013/11/20/540
> V2: https://lkml.org/lkml/2014/1/27/331
> 
> Changes since v2:
> 
> - Took care of most of the review comments from V2.
> - Added support for kexec/kdump on EFI systems.
> - Dropped support for loading ELF vmlinux.
> 
> This patch series is generated on top of 3.15.0-rc8. It also requires a
> two patch cleanup series which is sitting in -tip tree here.
> 
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=x86/boot
> 
> This patch series does not do kernel signature verification yet. I plan
> to post another patch series for that. Now bzImage is already signed
> with PKCS7 signature I plan to parse and verify those signatures.
> 
> Primary goal of this patchset is to prepare groundwork so that kernel
> image can be signed and signatures be verified during kexec load. This
> should help with two things.
> 
> - It should allow kexec/kdump on secureboot enabled machines.
> 
> - In general it can help even without secureboot. By being able to verify
>   kernel image signature in kexec, it should help with avoiding module
>   signing restrictions. Matthew Garret showed how to boot into a custom
>   kernel, modify first kernel's memory and then jump back to old kernel and
>   bypass any policy one wants to.
> 
> Any feedback is welcome.

Hi, Vivek

For efi ioremapping case, in 3.15 kernel efi runtime maps will not be saved
if efi=old_map is used. So you need detect this and fail the kexec file load.

Otherwise the patchset works for me.

Thanks
Dave

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-12  5:42   ` Dave Young
  0 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-12  5:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, chaowang

On 06/03/14 at 09:06am, Vivek Goyal wrote:
> Hi,
> 
> This is V3 of the patchset. Previous versions were posted here.
> 
> V1: https://lkml.org/lkml/2013/11/20/540
> V2: https://lkml.org/lkml/2014/1/27/331
> 
> Changes since v2:
> 
> - Took care of most of the review comments from V2.
> - Added support for kexec/kdump on EFI systems.
> - Dropped support for loading ELF vmlinux.
> 
> This patch series is generated on top of 3.15.0-rc8. It also requires a
> two patch cleanup series which is sitting in -tip tree here.
> 
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=x86/boot
> 
> This patch series does not do kernel signature verification yet. I plan
> to post another patch series for that. Now bzImage is already signed
> with PKCS7 signature I plan to parse and verify those signatures.
> 
> Primary goal of this patchset is to prepare groundwork so that kernel
> image can be signed and signatures be verified during kexec load. This
> should help with two things.
> 
> - It should allow kexec/kdump on secureboot enabled machines.
> 
> - In general it can help even without secureboot. By being able to verify
>   kernel image signature in kexec, it should help with avoiding module
>   signing restrictions. Matthew Garret showed how to boot into a custom
>   kernel, modify first kernel's memory and then jump back to old kernel and
>   bypass any policy one wants to.
> 
> Any feedback is welcome.

Hi, Vivek

For efi ioremapping case, in 3.15 kernel efi runtime maps will not be saved
if efi=old_map is used. So you need detect this and fail the kexec file load.

Otherwise the patchset works for me.

Thanks
Dave

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-12  5:42   ` Dave Young
@ 2014-06-12 12:36     ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-12 12:36 UTC (permalink / raw)
  To: Dave Young
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	chaowang, bhe, akpm

On Thu, Jun 12, 2014 at 01:42:03PM +0800, Dave Young wrote:
> On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > Hi,
> > 
> > This is V3 of the patchset. Previous versions were posted here.
> > 
> > V1: https://lkml.org/lkml/2013/11/20/540
> > V2: https://lkml.org/lkml/2014/1/27/331
> > 
> > Changes since v2:
> > 
> > - Took care of most of the review comments from V2.
> > - Added support for kexec/kdump on EFI systems.
> > - Dropped support for loading ELF vmlinux.
> > 
> > This patch series is generated on top of 3.15.0-rc8. It also requires a
> > two patch cleanup series which is sitting in -tip tree here.
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=x86/boot
> > 
> > This patch series does not do kernel signature verification yet. I plan
> > to post another patch series for that. Now bzImage is already signed
> > with PKCS7 signature I plan to parse and verify those signatures.
> > 
> > Primary goal of this patchset is to prepare groundwork so that kernel
> > image can be signed and signatures be verified during kexec load. This
> > should help with two things.
> > 
> > - It should allow kexec/kdump on secureboot enabled machines.
> > 
> > - In general it can help even without secureboot. By being able to verify
> >   kernel image signature in kexec, it should help with avoiding module
> >   signing restrictions. Matthew Garret showed how to boot into a custom
> >   kernel, modify first kernel's memory and then jump back to old kernel and
> >   bypass any policy one wants to.
> > 
> > Any feedback is welcome.
> 
> Hi, Vivek
> 
> For efi ioremapping case, in 3.15 kernel efi runtime maps will not be saved
> if efi=old_map is used. So you need detect this and fail the kexec file load.
> 
> Otherwise the patchset works for me.

Thanks Dave. I will make sure that in case of old mapping, kexec loading
fails. I don't want to be supporting that old "noefi" mode in this new
system call. Even SGI is planning to fix their firmware to support
1:1 mapping. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-12 12:36     ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-12 12:36 UTC (permalink / raw)
  To: Dave Young
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, chaowang

On Thu, Jun 12, 2014 at 01:42:03PM +0800, Dave Young wrote:
> On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > Hi,
> > 
> > This is V3 of the patchset. Previous versions were posted here.
> > 
> > V1: https://lkml.org/lkml/2013/11/20/540
> > V2: https://lkml.org/lkml/2014/1/27/331
> > 
> > Changes since v2:
> > 
> > - Took care of most of the review comments from V2.
> > - Added support for kexec/kdump on EFI systems.
> > - Dropped support for loading ELF vmlinux.
> > 
> > This patch series is generated on top of 3.15.0-rc8. It also requires a
> > two patch cleanup series which is sitting in -tip tree here.
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=x86/boot
> > 
> > This patch series does not do kernel signature verification yet. I plan
> > to post another patch series for that. Now bzImage is already signed
> > with PKCS7 signature I plan to parse and verify those signatures.
> > 
> > Primary goal of this patchset is to prepare groundwork so that kernel
> > image can be signed and signatures be verified during kexec load. This
> > should help with two things.
> > 
> > - It should allow kexec/kdump on secureboot enabled machines.
> > 
> > - In general it can help even without secureboot. By being able to verify
> >   kernel image signature in kexec, it should help with avoiding module
> >   signing restrictions. Matthew Garret showed how to boot into a custom
> >   kernel, modify first kernel's memory and then jump back to old kernel and
> >   bypass any policy one wants to.
> > 
> > Any feedback is welcome.
> 
> Hi, Vivek
> 
> For efi ioremapping case, in 3.15 kernel efi runtime maps will not be saved
> if efi=old_map is used. So you need detect this and fail the kexec file load.
> 
> Otherwise the patchset works for me.

Thanks Dave. I will make sure that in case of old mapping, kexec loading
fails. I don't want to be supporting that old "noefi" mode in this new
system call. Even SGI is planning to fix their firmware to support
1:1 mapping. 

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-09 15:41             ` Vivek Goyal
@ 2014-06-13  7:50               ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-13  7:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: WANG Chao, Dave Young, mjg59, bhe, jkosina, greg, kexec,
	linux-kernel, ebiederm, hpa, akpm

On Mon, Jun 09, 2014 at 11:41:37AM -0400, Vivek Goyal wrote:
> IIUC, COMMAND_LINE_SIZE gives max limits of running kernel and it does
> not tell us anything about command line size supported by kernel being
> loaded.

Whatever you do, you do need a sane default because even querying the
boot protocol is not reliable as the to-be-loaded kernel's boot protocol
might be manipulated too, before signing (who knows what people do
in the wild).

So having a sane, unconditional fallback COMMAND_LINE_SIZE from the
first kernel is a must, methinks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-13  7:50               ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-13  7:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, ebiederm, jkosina,
	akpm, Dave Young, WANG Chao

On Mon, Jun 09, 2014 at 11:41:37AM -0400, Vivek Goyal wrote:
> IIUC, COMMAND_LINE_SIZE gives max limits of running kernel and it does
> not tell us anything about command line size supported by kernel being
> loaded.

Whatever you do, you do need a sane default because even querying the
boot protocol is not reliable as the to-be-loaded kernel's boot protocol
might be manipulated too, before signing (who knows what people do
in the wild).

So having a sane, unconditional fallback COMMAND_LINE_SIZE from the
first kernel is a must, methinks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-13  7:50               ` Borislav Petkov
@ 2014-06-13  8:00                 ` WANG Chao
  -1 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-13  8:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Vivek Goyal, Dave Young, mjg59, bhe, jkosina, greg, kexec,
	linux-kernel, ebiederm, hpa, akpm

On 06/13/14 at 09:50am, Borislav Petkov wrote:
> On Mon, Jun 09, 2014 at 11:41:37AM -0400, Vivek Goyal wrote:
> > IIUC, COMMAND_LINE_SIZE gives max limits of running kernel and it does
> > not tell us anything about command line size supported by kernel being
> > loaded.
> 
> Whatever you do, you do need a sane default because even querying the
> boot protocol is not reliable as the to-be-loaded kernel's boot protocol
> might be manipulated too, before signing (who knows what people do
> in the wild).

Make sense.

> 
> So having a sane, unconditional fallback COMMAND_LINE_SIZE from the
> first kernel is a must, methinks.

By greping for COMMAND_LINE_SIZE for different arch, I think 8K being
the fallback, in general, is good for now and the future:

alpha/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     256

arm/include/uapi/asm/setup.h
19:#define COMMAND_LINE_SIZE 1024

avr32/include/uapi/asm/setup.h
14:#define COMMAND_LINE_SIZE 256

cris/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     256

frv/include/uapi/asm/setup.h
15:#define COMMAND_LINE_SIZE       512

ia64/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     2048

m32r/include/uapi/asm/setup.h
8:#define COMMAND_LINE_SIZE       512

m68k/include/uapi/asm/setup.h
14:#define COMMAND_LINE_SIZE 256

mips/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     4096

parisc/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     1024

powerpc/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     2048

s390/include/uapi/asm/setup.h
9:#define COMMAND_LINE_SIZE     4096

um/include/asm/setup.h
8:#define COMMAND_LINE_SIZE 4096

x86/include/asm/setup.h
6:#define COMMAND_LINE_SIZE 2048

xtensa/include/uapi/asm/setup.h
14:#define COMMAND_LINE_SIZE    256

c6x/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE   1024

microblaze/include/uapi/asm/setup.h
14:#define COMMAND_LINE_SIZE    256

mn10300/include/uapi/asm/param.h
16:#define COMMAND_LINE_SIZE 256

score/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     256

tile/include/uapi/asm/setup.h
18:#define COMMAND_LINE_SIZE    2048

arc/include/asm/setup.h
15:#define COMMAND_LINE_SIZE 256

arm64/include/uapi/asm/setup.h
24:#define COMMAND_LINE_SIZE    2048

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-13  8:00                 ` WANG Chao
  0 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-13  8:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, ebiederm, jkosina,
	akpm, Dave Young, Vivek Goyal

On 06/13/14 at 09:50am, Borislav Petkov wrote:
> On Mon, Jun 09, 2014 at 11:41:37AM -0400, Vivek Goyal wrote:
> > IIUC, COMMAND_LINE_SIZE gives max limits of running kernel and it does
> > not tell us anything about command line size supported by kernel being
> > loaded.
> 
> Whatever you do, you do need a sane default because even querying the
> boot protocol is not reliable as the to-be-loaded kernel's boot protocol
> might be manipulated too, before signing (who knows what people do
> in the wild).

Make sense.

> 
> So having a sane, unconditional fallback COMMAND_LINE_SIZE from the
> first kernel is a must, methinks.

By greping for COMMAND_LINE_SIZE for different arch, I think 8K being
the fallback, in general, is good for now and the future:

alpha/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     256

arm/include/uapi/asm/setup.h
19:#define COMMAND_LINE_SIZE 1024

avr32/include/uapi/asm/setup.h
14:#define COMMAND_LINE_SIZE 256

cris/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     256

frv/include/uapi/asm/setup.h
15:#define COMMAND_LINE_SIZE       512

ia64/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     2048

m32r/include/uapi/asm/setup.h
8:#define COMMAND_LINE_SIZE       512

m68k/include/uapi/asm/setup.h
14:#define COMMAND_LINE_SIZE 256

mips/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     4096

parisc/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     1024

powerpc/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     2048

s390/include/uapi/asm/setup.h
9:#define COMMAND_LINE_SIZE     4096

um/include/asm/setup.h
8:#define COMMAND_LINE_SIZE 4096

x86/include/asm/setup.h
6:#define COMMAND_LINE_SIZE 2048

xtensa/include/uapi/asm/setup.h
14:#define COMMAND_LINE_SIZE    256

c6x/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE   1024

microblaze/include/uapi/asm/setup.h
14:#define COMMAND_LINE_SIZE    256

mn10300/include/uapi/asm/param.h
16:#define COMMAND_LINE_SIZE 256

score/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE     256

tile/include/uapi/asm/setup.h
18:#define COMMAND_LINE_SIZE    2048

arc/include/asm/setup.h
15:#define COMMAND_LINE_SIZE 256

arm64/include/uapi/asm/setup.h
24:#define COMMAND_LINE_SIZE    2048

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-13  8:00                 ` WANG Chao
@ 2014-06-13  8:10                   ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-13  8:10 UTC (permalink / raw)
  To: WANG Chao
  Cc: Vivek Goyal, Dave Young, mjg59, bhe, jkosina, greg, kexec,
	linux-kernel, ebiederm, hpa, akpm

On Fri, Jun 13, 2014 at 04:00:28PM +0800, WANG Chao wrote:
> By greping for COMMAND_LINE_SIZE for different arch, I think 8K being
> the fallback, in general, is good for now and the future:

Why - we could simply use the arch default one.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-13  8:10                   ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-13  8:10 UTC (permalink / raw)
  To: WANG Chao
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, ebiederm, jkosina,
	akpm, Dave Young, Vivek Goyal

On Fri, Jun 13, 2014 at 04:00:28PM +0800, WANG Chao wrote:
> By greping for COMMAND_LINE_SIZE for different arch, I think 8K being
> the fallback, in general, is good for now and the future:

Why - we could simply use the arch default one.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-13  8:10                   ` Borislav Petkov
@ 2014-06-13  8:24                     ` WANG Chao
  -1 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-13  8:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Vivek Goyal, Dave Young, mjg59, bhe, jkosina, greg, kexec,
	linux-kernel, ebiederm, hpa, akpm

On 06/13/14 at 10:10am, Borislav Petkov wrote:
> On Fri, Jun 13, 2014 at 04:00:28PM +0800, WANG Chao wrote:
> > By greping for COMMAND_LINE_SIZE for different arch, I think 8K being
> > the fallback, in general, is good for now and the future:
> 
> Why - we could simply use the arch default one.

Emm.. I'm not sure, but I think there might be a chance that
COMMAND_LINE_SIZE extend in the future. In general 8K is safe to use,
because the current greatest one is 4K.

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-13  8:24                     ` WANG Chao
  0 siblings, 0 replies; 214+ messages in thread
From: WANG Chao @ 2014-06-13  8:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, ebiederm, jkosina,
	akpm, Dave Young, Vivek Goyal

On 06/13/14 at 10:10am, Borislav Petkov wrote:
> On Fri, Jun 13, 2014 at 04:00:28PM +0800, WANG Chao wrote:
> > By greping for COMMAND_LINE_SIZE for different arch, I think 8K being
> > the fallback, in general, is good for now and the future:
> 
> Why - we could simply use the arch default one.

Emm.. I'm not sure, but I think there might be a chance that
COMMAND_LINE_SIZE extend in the future. In general 8K is safe to use,
because the current greatest one is 4K.

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-13  8:24                     ` WANG Chao
@ 2014-06-13  8:30                       ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-13  8:30 UTC (permalink / raw)
  To: WANG Chao
  Cc: Vivek Goyal, Dave Young, mjg59, bhe, jkosina, greg, kexec,
	linux-kernel, ebiederm, hpa, akpm

On Fri, Jun 13, 2014 at 04:24:59PM +0800, WANG Chao wrote:
> Emm.. I'm not sure, but I think there might be a chance that
> COMMAND_LINE_SIZE extend in the future. In general 8K is safe to use,
> because the current greatest one is 4K.

Sure, but kexec cannot load a kernel of different arch, can it?
Which means, we're safe using the arch-specific definition of
COMMAND_LINE_SIZE.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-13  8:30                       ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-13  8:30 UTC (permalink / raw)
  To: WANG Chao
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, ebiederm, jkosina,
	akpm, Dave Young, Vivek Goyal

On Fri, Jun 13, 2014 at 04:24:59PM +0800, WANG Chao wrote:
> Emm.. I'm not sure, but I think there might be a chance that
> COMMAND_LINE_SIZE extend in the future. In general 8K is safe to use,
> because the current greatest one is 4K.

Sure, but kexec cannot load a kernel of different arch, can it?
Which means, we're safe using the arch-specific definition of
COMMAND_LINE_SIZE.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 09/13] purgatory: Core purgatory functionality
  2014-06-06 19:51       ` Vivek Goyal
@ 2014-06-13 10:17         ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-13 10:17 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Fri, Jun 06, 2014 at 03:51:04PM -0400, Vivek Goyal wrote:
> On Thu, Jun 05, 2014 at 10:05:23PM +0200, Borislav Petkov wrote:
> 
> [..]
> > > @@ -249,6 +254,7 @@ archclean:
> > >  	$(Q)rm -rf $(objtree)/arch/x86_64
> > >  	$(Q)$(MAKE) $(clean)=$(boot)
> > >  	$(Q)$(MAKE) $(clean)=arch/x86/tools
> > 
> > ifeq ($(CONFIG_KEXEC),y)
> > 	$(Q)$(MAKE) $(clean)=arch/x86/purgatory
> > endif
> 
> Hmm.., is it strictly required? I am wondering what happens if I build
> a kernel with CONFIG_KEXEC=y, then set CONFIG_KEXEC=n and do "make clean".

Try it - it works here.

> I think I will still like any files in arch/x86/purgatory to be cleaned
> despite the fact that CONFIG_KEXEC=n. Isn't it?

Yep, that works. Conversely, we don't want people who haven't enabled
KEXEC ever to have unrelated cleanup delay.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 09/13] purgatory: Core purgatory functionality
@ 2014-06-13 10:17         ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-13 10:17 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Fri, Jun 06, 2014 at 03:51:04PM -0400, Vivek Goyal wrote:
> On Thu, Jun 05, 2014 at 10:05:23PM +0200, Borislav Petkov wrote:
> 
> [..]
> > > @@ -249,6 +254,7 @@ archclean:
> > >  	$(Q)rm -rf $(objtree)/arch/x86_64
> > >  	$(Q)$(MAKE) $(clean)=$(boot)
> > >  	$(Q)$(MAKE) $(clean)=arch/x86/tools
> > 
> > ifeq ($(CONFIG_KEXEC),y)
> > 	$(Q)$(MAKE) $(clean)=arch/x86/purgatory
> > endif
> 
> Hmm.., is it strictly required? I am wondering what happens if I build
> a kernel with CONFIG_KEXEC=y, then set CONFIG_KEXEC=n and do "make clean".

Try it - it works here.

> I think I will still like any files in arch/x86/purgatory to be cleaned
> despite the fact that CONFIG_KEXEC=n. Isn't it?

Yep, that works. Conversely, we don't want people who haven't enabled
KEXEC ever to have unrelated cleanup delay.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-13  7:50               ` Borislav Petkov
@ 2014-06-13 12:46                 ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-13 12:46 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: WANG Chao, Dave Young, mjg59, bhe, jkosina, greg, kexec,
	linux-kernel, ebiederm, hpa, akpm

On Fri, Jun 13, 2014 at 09:50:11AM +0200, Borislav Petkov wrote:
> On Mon, Jun 09, 2014 at 11:41:37AM -0400, Vivek Goyal wrote:
> > IIUC, COMMAND_LINE_SIZE gives max limits of running kernel and it does
> > not tell us anything about command line size supported by kernel being
> > loaded.
> 
> Whatever you do, you do need a sane default because even querying the
> boot protocol is not reliable as the to-be-loaded kernel's boot protocol
> might be manipulated too, before signing (who knows what people do
> in the wild).

If signature verification is on, that should catch any manipulation to
to protocol headers.

If not, then we really can't do anything about it. A large memory
allocation will fail and user will get error. 

This is not different than length of kernel or length of initrd. Somebody
might prepare a very huge file and pass that fd to kernel and kernel will
try to read the whole thing in. If file is too large, memory allocation
will fail and user space will get error. We don't try to put an upper
limit on size of kernel image or initrd.

> 
> So having a sane, unconditional fallback COMMAND_LINE_SIZE from the
> first kernel is a must, methinks.

I disagree here. What if new kernel supports (2 * COMMAND_LINE_SIZE) length
command line. We don't want to truncate command line to smaller size
because running kernel does not support that long a command line.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-13 12:46                 ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-13 12:46 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, ebiederm, jkosina,
	akpm, Dave Young, WANG Chao

On Fri, Jun 13, 2014 at 09:50:11AM +0200, Borislav Petkov wrote:
> On Mon, Jun 09, 2014 at 11:41:37AM -0400, Vivek Goyal wrote:
> > IIUC, COMMAND_LINE_SIZE gives max limits of running kernel and it does
> > not tell us anything about command line size supported by kernel being
> > loaded.
> 
> Whatever you do, you do need a sane default because even querying the
> boot protocol is not reliable as the to-be-loaded kernel's boot protocol
> might be manipulated too, before signing (who knows what people do
> in the wild).

If signature verification is on, that should catch any manipulation to
to protocol headers.

If not, then we really can't do anything about it. A large memory
allocation will fail and user will get error. 

This is not different than length of kernel or length of initrd. Somebody
might prepare a very huge file and pass that fd to kernel and kernel will
try to read the whole thing in. If file is too large, memory allocation
will fail and user space will get error. We don't try to put an upper
limit on size of kernel image or initrd.

> 
> So having a sane, unconditional fallback COMMAND_LINE_SIZE from the
> first kernel is a must, methinks.

I disagree here. What if new kernel supports (2 * COMMAND_LINE_SIZE) length
command line. We don't want to truncate command line to smaller size
because running kernel does not support that long a command line.

Thanks
Vivek


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-13  8:00                 ` WANG Chao
@ 2014-06-13 12:49                   ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-13 12:49 UTC (permalink / raw)
  To: WANG Chao
  Cc: Borislav Petkov, Dave Young, mjg59, bhe, jkosina, greg, kexec,
	linux-kernel, ebiederm, hpa, akpm

On Fri, Jun 13, 2014 at 04:00:28PM +0800, WANG Chao wrote:
> On 06/13/14 at 09:50am, Borislav Petkov wrote:
> > On Mon, Jun 09, 2014 at 11:41:37AM -0400, Vivek Goyal wrote:
> > > IIUC, COMMAND_LINE_SIZE gives max limits of running kernel and it does
> > > not tell us anything about command line size supported by kernel being
> > > loaded.
> > 
> > Whatever you do, you do need a sane default because even querying the
> > boot protocol is not reliable as the to-be-loaded kernel's boot protocol
> > might be manipulated too, before signing (who knows what people do
> > in the wild).
> 
> Make sense.
> 
> > 
> > So having a sane, unconditional fallback COMMAND_LINE_SIZE from the
> > first kernel is a must, methinks.
> 
> By greping for COMMAND_LINE_SIZE for different arch, I think 8K being
> the fallback, in general, is good for now and the future:

How do you know we will never cross 8K. Also what kind of protection you
have against kernel file size and initrd file size? If we don't have any
protection there, why command line size is so special (Which is much
smaller than kernel and initrd).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-13 12:49                   ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-13 12:49 UTC (permalink / raw)
  To: WANG Chao
  Cc: mjg59, bhe, jkosina, hpa, kexec, linux-kernel, Borislav Petkov,
	ebiederm, greg, akpm, Dave Young

On Fri, Jun 13, 2014 at 04:00:28PM +0800, WANG Chao wrote:
> On 06/13/14 at 09:50am, Borislav Petkov wrote:
> > On Mon, Jun 09, 2014 at 11:41:37AM -0400, Vivek Goyal wrote:
> > > IIUC, COMMAND_LINE_SIZE gives max limits of running kernel and it does
> > > not tell us anything about command line size supported by kernel being
> > > loaded.
> > 
> > Whatever you do, you do need a sane default because even querying the
> > boot protocol is not reliable as the to-be-loaded kernel's boot protocol
> > might be manipulated too, before signing (who knows what people do
> > in the wild).
> 
> Make sense.
> 
> > 
> > So having a sane, unconditional fallback COMMAND_LINE_SIZE from the
> > first kernel is a must, methinks.
> 
> By greping for COMMAND_LINE_SIZE for different arch, I think 8K being
> the fallback, in general, is good for now and the future:

How do you know we will never cross 8K. Also what kind of protection you
have against kernel file size and initrd file size? If we don't have any
protection there, why command line size is so special (Which is much
smaller than kernel and initrd).

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-13 12:46                 ` Vivek Goyal
@ 2014-06-13 15:36                   ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-13 15:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: WANG Chao, Dave Young, mjg59, bhe, jkosina, greg, kexec,
	linux-kernel, ebiederm, hpa, akpm

On Fri, Jun 13, 2014 at 08:46:09AM -0400, Vivek Goyal wrote:
> If not, then we really can't do anything about it. A large memory
> allocation will fail and user will get error.

Of course we can! You can't trust userspace and you need to check the
values it gives you through the syscall.

And you need a sane default. The fact that I even have to state this
explicitly...!

So what if userspace gives a maximum value for which allocation
succeeds? Does that mean that the kernel should blindly comply and
allocate? Of course not! That would be dumb.

> I disagree here. What if new kernel supports (2 * COMMAND_LINE_SIZE)
> length command line. We don't want to truncate command line to smaller
> size because running kernel does not support that long a command line.

Do you have a sensible use case where 2048 cmdline size (on x86) won't
be enough and you really need it larger?

And even if this is a problem - which I seriously doubt - it would be
problem with the 1st kernel too, not only with the kexec-ed one.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-13 15:36                   ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-13 15:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, ebiederm, jkosina,
	akpm, Dave Young, WANG Chao

On Fri, Jun 13, 2014 at 08:46:09AM -0400, Vivek Goyal wrote:
> If not, then we really can't do anything about it. A large memory
> allocation will fail and user will get error.

Of course we can! You can't trust userspace and you need to check the
values it gives you through the syscall.

And you need a sane default. The fact that I even have to state this
explicitly...!

So what if userspace gives a maximum value for which allocation
succeeds? Does that mean that the kernel should blindly comply and
allocate? Of course not! That would be dumb.

> I disagree here. What if new kernel supports (2 * COMMAND_LINE_SIZE)
> length command line. We don't want to truncate command line to smaller
> size because running kernel does not support that long a command line.

Do you have a sensible use case where 2048 cmdline size (on x86) won't
be enough and you really need it larger?

And even if this is a problem - which I seriously doubt - it would be
problem with the 1st kernel too, not only with the kexec-ed one.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 10/13] kexec: Load and Relocate purgatory at kernel load time
  2014-06-11 19:24       ` Vivek Goyal
@ 2014-06-13 16:14         ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-13 16:14 UTC (permalink / raw)
  To: Vivek Goyal, mjg59
  Cc: linux-kernel, kexec, ebiederm, hpa, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Wed, Jun 11, 2014 at 03:24:48PM -0400, Vivek Goyal wrote:
> This new syscall requires sha256 even if signature checking does not
> happen. Purgatory verifies checksum of segments.
> 
> I had to select CRYPTO also otherwise CONFIG_CRYPTO=m broke the build.
> 
> > 
> > Which begs the more important question - shouldn't this new in-kernel
> > loading method support also kexec'ing of kernels without any signature
> > verifications at all?
> 
> I think yes it should allow kexecing kernels without any signature also.
> In fact in long term, we should deprecate the old syscall and maintain
> this new one.
> 
> Now, when does signature checking kick in? I think we can define a new
> config option say KEXEC_ENFORCE_KERNEL_SIG_VERIFICATION. This option
> will make sure kernel signature are verified. 
> 
> If KEXEC_ENFORCE_KERNEL_SIG_VERIFICATION=n, even then signature
> verification should be enforced if secureboot is enabled on the platform.

Right, this makes sense to me. Probably Matthew might want to chime in
here too...

> I will make it configurable in next series. This series does not do
> any signature verification yet. Above CRYPTO and CRYPTO_SHA256 I had
> to select to make sure checksum verfication logic in purgatory works
> fine.

Ok.

> Hmm... I have seen at other places using same name as structure. But I am
> not particular about it will change. Anyway, on most of the places
> I use a pointer to access it.
> 
> struct purgaotry_info *pi  = &image->purgatory_info;

Yep, saw that in the later patches :)

> I would like to retain purgaotry_buf. To shorten it I could do this.
> 
> 	struct purgatory_info *pi = &image->purgatory_info;
> 	vfree(pi->purgatory_buf);
> 	pi->purgatory_buf = NULL;
> 
> I like the clarity in variable names.

Ok.

> I would like to keep it one function. Reason being that apart from
> digest, we also store the list of regions which has been checkummed. And
> you will notice that we skip the purgatory region during checksum
> calculation.
> 
> So I will have to return quite some information from calc() function. Size
> of digest, actual digest buffer which will need to be freed by caller,
> and list of sha regions which will need to be freed by caller. Keeping
> it call in one function makes it little simpler actually.

Hmm, ok.

> Just wanted a small zero buffer. Is there any global zero buffer
> available in kernel. If not, I could use a PAGE_SIZE zero buffer
> instead.

empty_zero_page?

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 10/13] kexec: Load and Relocate purgatory at kernel load time
@ 2014-06-13 16:14         ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-13 16:14 UTC (permalink / raw)
  To: Vivek Goyal, mjg59
  Cc: bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa, akpm,
	dyoung, chaowang

On Wed, Jun 11, 2014 at 03:24:48PM -0400, Vivek Goyal wrote:
> This new syscall requires sha256 even if signature checking does not
> happen. Purgatory verifies checksum of segments.
> 
> I had to select CRYPTO also otherwise CONFIG_CRYPTO=m broke the build.
> 
> > 
> > Which begs the more important question - shouldn't this new in-kernel
> > loading method support also kexec'ing of kernels without any signature
> > verifications at all?
> 
> I think yes it should allow kexecing kernels without any signature also.
> In fact in long term, we should deprecate the old syscall and maintain
> this new one.
> 
> Now, when does signature checking kick in? I think we can define a new
> config option say KEXEC_ENFORCE_KERNEL_SIG_VERIFICATION. This option
> will make sure kernel signature are verified. 
> 
> If KEXEC_ENFORCE_KERNEL_SIG_VERIFICATION=n, even then signature
> verification should be enforced if secureboot is enabled on the platform.

Right, this makes sense to me. Probably Matthew might want to chime in
here too...

> I will make it configurable in next series. This series does not do
> any signature verification yet. Above CRYPTO and CRYPTO_SHA256 I had
> to select to make sure checksum verfication logic in purgatory works
> fine.

Ok.

> Hmm... I have seen at other places using same name as structure. But I am
> not particular about it will change. Anyway, on most of the places
> I use a pointer to access it.
> 
> struct purgaotry_info *pi  = &image->purgatory_info;

Yep, saw that in the later patches :)

> I would like to retain purgaotry_buf. To shorten it I could do this.
> 
> 	struct purgatory_info *pi = &image->purgatory_info;
> 	vfree(pi->purgatory_buf);
> 	pi->purgatory_buf = NULL;
> 
> I like the clarity in variable names.

Ok.

> I would like to keep it one function. Reason being that apart from
> digest, we also store the list of regions which has been checkummed. And
> you will notice that we skip the purgatory region during checksum
> calculation.
> 
> So I will have to return quite some information from calc() function. Size
> of digest, actual digest buffer which will need to be freed by caller,
> and list of sha regions which will need to be freed by caller. Keeping
> it call in one function makes it little simpler actually.

Hmm, ok.

> Just wanted a small zero buffer. Is there any global zero buffer
> available in kernel. If not, I could use a PAGE_SIZE zero buffer
> instead.

empty_zero_page?

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
  2014-06-03 13:07   ` Vivek Goyal
@ 2014-06-15 16:35     ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-15 16:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 09:07:00AM -0400, Vivek Goyal wrote:
> This is loader specific code which can load bzImage and set it up for
> 64bit entry. This does not take care of 32bit entry or real mode entry.
> 
> 32bit mode entry can be implemented if somebody needs it.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---

...

> diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
> new file mode 100644
> index 0000000..0750784
> --- /dev/null
> +++ b/arch/x86/kernel/kexec-bzimage.c
> @@ -0,0 +1,269 @@
> +/*
> + * Kexec bzImage loader
> + *
> + * Copyright (C) 2014 Red Hat Inc.
> + * Authors:
> + *      Vivek Goyal <vgoyal@redhat.com>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +#include <linux/string.h>
> +#include <linux/printk.h>
> +#include <linux/errno.h>
> +#include <linux/slab.h>
> +#include <linux/kexec.h>
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +
> +#include <asm/bootparam.h>
> +#include <asm/setup.h>
> +
> +/*
> + * Defines lowest physical address for various segments. Not sure where
> + * exactly these limits came from. Current bzimage64 loader in kexec-tools
> + * uses these so I am retaining it. It can be changed over time as we gain
> + * more insight.
> + */
> +#define MIN_PURGATORY_ADDR	0x3000
> +#define MIN_BOOTPARAM_ADDR	0x3000
> +#define MIN_KERNEL_LOAD_ADDR	0x100000
> +#define MIN_INITRD_LOAD_ADDR	0x1000000
> +
> +#ifdef CONFIG_X86_64
> +
> +/*
> + * This is a place holder for all boot loader specific data structure which
> + * gets allocated in one call but gets freed much later during cleanup
> + * time. Right now there is only one field but it can grow as need be.
> + */
> +struct bzimage64_data {
> +	/*
> +	 * Temporary buffer to hold bootparams buffer. This should be
> +	 * freed once the bootparam segment has been loaded.
> +	 */
> +	void *bootparams_buf;
> +};
> +
> +int bzImage64_probe(const char *buf, unsigned long len)
> +{
> +	int ret = -ENOEXEC;
> +	struct setup_header *header;
> +
> +	/* kernel should be atleast two sector long */

				    two sectors

> +	if (len < 2 * 512) {
> +		pr_debug("File is too short to be a bzImage\n");

Those error messages are all pr_debug. Now, wouldn't we want to tell
userspace what the problem is, *when* there is one?

I.e., pr_err or pr_info is much more helpful than pr_debug IMO.

> +		return ret;
> +	}
> +
> +	header = (struct setup_header *)(buf + offsetof(struct boot_params,
> +								hdr));

Just let that stick out. The 80 cols limit is not a hard one anyway,
especially if it impairs readability.

> +	if (memcmp((char *)&header->header, "HdrS", 4) != 0) {

Not strncmp? "HdrS" is a string...

> +		pr_debug("Not a bzImage\n");
> +		return ret;
> +	}
> +
> +	if (header->boot_flag != 0xAA55) {
> +		pr_debug("No x86 boot sector present\n");
> +		return ret;
> +	}
> +
> +	if (header->version < 0x020C) {
> +		pr_debug("Must be at least protocol version 2.12\n");
> +		return ret;
> +	}
> +
> +	if ((header->loadflags & LOADED_HIGH) == 0) {

	if (!(header->loadflags.. ))

> +		pr_debug("zImage not a bzImage\n");
> +		return ret;
> +	}
> +
> +	if (!(header->xloadflags & XLF_KERNEL_64)) {
> +		pr_debug("Not a bzImage64. XLF_KERNEL_64 is not set.\n");
> +		return ret;
> +	}
> +
> +	if (!(header->xloadflags & XLF_CAN_BE_LOADED_ABOVE_4G)) {
> +		pr_debug("XLF_CAN_BE_LOADED_ABOVE_4G is not set.\n");
> +		return ret;
> +	}

Just merge the two checks:

	if ((header->xloadflags & (XLF_KERNEL_64 | XLF_CAN_BE_LOADED_ABOVE_4G)) !=
                                  (XLF_KERNEL_64 | XLF_CAN_BE_LOADED_ABOVE_4G)) {
                pr_err("Not a bzImage, xloadflags: 0x%x\n", header->xloadflags);
                return ret;
        }

> +
> +	/* I've got a bzImage */
> +	pr_debug("It's a relocatable bzImage64\n");
> +	ret = 0;
> +
> +	return ret;
> +}
> +
> +void *bzImage64_load(struct kimage *image, char *kernel,
> +		unsigned long kernel_len,
> +		char *initrd, unsigned long initrd_len,
> +		char *cmdline, unsigned long cmdline_len)

Arg alignment.

> +{
> +
> +	struct setup_header *header;
> +	int setup_sects, kern16_size, ret = 0;
> +	unsigned long setup_header_size, params_cmdline_sz;
> +	struct boot_params *params;
> +	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
> +	unsigned long purgatory_load_addr;
> +	unsigned long kernel_bufsz, kernel_memsz, kernel_align;
> +	char *kernel_buf;
> +	struct bzimage64_data *ldata;
> +	struct kexec_entry64_regs regs64;
> +	void *stack;
> +	unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
> +
> +	header = (struct setup_header *)(kernel + setup_hdr_offset);
> +	setup_sects = header->setup_sects;
> +	if (setup_sects == 0)
> +		setup_sects = 4;
> +
> +	kern16_size = (setup_sects + 1) * 512;
> +	if (kernel_len < kern16_size) {
> +		pr_debug("bzImage truncated\n");

Ditto for all those pr_debug's in here - I think we want to know why the
bzImage load fails and pr_debug is not suitable for that.

> +		return ERR_PTR(-ENOEXEC);
> +	}
> +
> +	if (cmdline_len > header->cmdline_size) {

As we talked, I think COMMAND_LINE_SIZE is perfectly fine and safe for
all intents and purposes.

> +		pr_debug("Kernel command line too long\n");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	/* Allocate loader specific data */
> +	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
> +	if (!ldata)
> +		return ERR_PTR(-ENOMEM);

Why don't you move that allocation to the place right before it is being
assigned to it? I.e., to the "ldata->bootparams_buf = params" line.

This way you'll save yourself all the goto games on the error path and
do it only after everything else has succeeded.

> +
> +	/*
> +	 * Load purgatory. For 64bit entry point, purgatory  code can be
> +	 * anywhere.
> +	 */
> +	ret = kexec_load_purgatory(image, MIN_PURGATORY_ADDR, ULONG_MAX, 1,
> +					&purgatory_load_addr);
> +	if (ret) {
> +		pr_debug("Loading purgatory failed\n");
> +		goto out_free_loader_data;
> +	}
> +
> +	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
> +
> +	/* Load Bootparams and cmdline */
> +	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
> +	params = kzalloc(params_cmdline_sz, GFP_KERNEL);
> +	if (!params) {
> +		ret = -ENOMEM;
> +		goto out_free_loader_data;
> +	}
> +
> +	/* Copy setup header onto bootparams. Documentation/x86/boot.txt */
> +	setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;
> +
> +	/* Is there a limit on setup header size? */
> +	memcpy(&params->hdr, (kernel + setup_hdr_offset), setup_header_size);
> +
> +	ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
> +			       params_cmdline_sz, 16, MIN_BOOTPARAM_ADDR,
> +			       ULONG_MAX, 1, &bootparam_load_addr);
> +	if (ret)
> +		goto out_free_params;
> +	pr_debug("Loaded boot_param and command line at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> +		 bootparam_load_addr, params_cmdline_sz, params_cmdline_sz);
> +
> +	/* Load kernel */
> +	kernel_buf = kernel + kern16_size;
> +	kernel_bufsz =  kernel_len - kern16_size;
> +	kernel_memsz = ALIGN(header->init_size, 4096);

PAGE_ALIGN

> +	kernel_align = header->kernel_alignment;
> +
> +	ret = kexec_add_buffer(image, kernel_buf,
> +			       kernel_bufsz, kernel_memsz, kernel_align,
> +			       MIN_KERNEL_LOAD_ADDR, ULONG_MAX, 1,
> +			       &kernel_load_addr);
> +	if (ret)
> +		goto out_free_params;
> +
> +	pr_debug("Loaded 64bit kernel at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> +		 kernel_load_addr, kernel_memsz, kernel_memsz);
> +
> +	/* Load initrd high */
> +	if (initrd) {
> +		ret = kexec_add_buffer(image, initrd, initrd_len, initrd_len,
> +				       PAGE_SIZE, MIN_INITRD_LOAD_ADDR,
> +				       ULONG_MAX, 1, &initrd_load_addr);
> +		if (ret)
> +			goto out_free_params;
> +
> +		pr_debug("Loaded initrd at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> +				initrd_load_addr, initrd_len, initrd_len);

emtpy line to split, pls.

> +		ret = kexec_setup_initrd(params, initrd_load_addr, initrd_len);
> +		if (ret)

This ret is unconditionally 0 - no need to check it.

> +			goto out_free_params;
> +	}
> +
> +	ret = kexec_setup_cmdline(params, bootparam_load_addr,
> +				  sizeof(struct boot_params), cmdline,
> +				  cmdline_len);
> +	if (ret)

Ditto.

> +		goto out_free_params;
> +
> +	/* bootloader info. Do we need a separate ID for kexec kernel loader? */
> +	params->hdr.type_of_loader = 0x0D << 4;
> +	params->hdr.loadflags = 0;
> +
> +	/* Setup purgatory regs for entry */
> +	ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
> +					     sizeof(regs64), 1);
> +	if (ret)
> +		goto out_free_params;
> +
> +	regs64.rbx = 0; /* Bootstrap Processor */
> +	regs64.rsi = bootparam_load_addr;
> +	regs64.rip = kernel_load_addr + 0x200;
> +	stack = kexec_purgatory_get_symbol_addr(image, "stack_end");
> +	if (IS_ERR(stack)) {
> +		pr_debug("Could not find address of symbol stack_end\n");
> +		ret = -EINVAL;
> +		goto out_free_params;
> +	}
> +
> +	regs64.rsp = (unsigned long)stack;
> +	ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
> +					     sizeof(regs64), 0);
> +	if (ret)
> +		goto out_free_params;
> +
> +	ret = kexec_setup_boot_parameters(params);

Ditto.

> +	if (ret)
> +		goto out_free_params;
> +
> +	/*
> +	 * Store pointer to params so that it could be freed after loading
> +	 * params segment has been loaded and contents have been copied
> +	 * somewhere else.
> +	 */
> +	ldata->bootparams_buf = params;
> +	return ldata;
> +
> +out_free_params:
> +	kfree(params);
> +out_free_loader_data:
> +	kfree(ldata);
> +	return ERR_PTR(ret);
> +}
> +
> +/* This cleanup function is called after various segments have been loaded */
> +int bzImage64_cleanup(struct kimage *image)
> +{
> +	struct bzimage64_data *ldata = image->image_loader_data;
> +
> +	if (!ldata)
> +		return 0;
> +
> +	kfree(ldata->bootparams_buf);
> +	ldata->bootparams_buf = NULL;
> +
> +	return 0;
> +}
> +
> +#endif /* CONFIG_X86_64 */
> diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
> new file mode 100644
> index 0000000..7de3239
> --- /dev/null
> +++ b/arch/x86/kernel/machine_kexec.c
> @@ -0,0 +1,136 @@
> +/*
> + * handle transition of Linux booting another kernel
> + *
> + * Copyright (C) 2014 Red Hat Inc.
> + * Authors:
> + *      Vivek Goyal <vgoyal@redhat.com>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/string.h>
> +#include <asm/bootparam.h>
> +#include <asm/setup.h>
> +
> +/*
> + * Common code for x86 and x86_64 used for kexec.
> + *
> + * For the time being it compiles only for x86_64 as there are no image
> + * loaders implemented * for x86. This #ifdef can be removed once somebody
> + * decides to write an image loader on CONFIG_X86_32.
> + */
> +
> +#ifdef CONFIG_X86_64

Ok, this doesn't make any sense: this new machine_kexec.c is supposed to
be common code and yet it has this 64-bit ifdef in there.

It should be the other way around, IMO: put it now in machine_kexec_64.c
and if someone wants the 32-bit version, that someone should carve it
out. This'll save you the needless ifdeffery now.

> +
> +int kexec_setup_initrd(struct boot_params *params,
> +		unsigned long initrd_load_addr, unsigned long initrd_len)
> +{
> +	params->hdr.ramdisk_image = initrd_load_addr & 0xffffffffUL;
> +	params->hdr.ramdisk_size = initrd_len & 0xffffffffUL;

We have more readable GENMASK* macros for contiguous masks. This one
will then look like:

	params->hdr.ramdisk_image = initrd_load_addr & GENMASK(31, 0);
	params->hdr.ramdisk_size = initrd_len & GENMASK(31, 0);

and this way we know exactly about which bits are we talking about. :)

> +
> +	params->ext_ramdisk_image = initrd_load_addr >> 32;
> +	params->ext_ramdisk_size = initrd_len >> 32;
> +
> +	return 0;
> +}
> +
> +int kexec_setup_cmdline(struct boot_params *params,
> +		unsigned long bootparams_load_addr,
> +		unsigned long cmdline_offset, char *cmdline,
> +		unsigned long cmdline_len)
> +{
> +	char *cmdline_ptr = ((char *)params) + cmdline_offset;
> +	unsigned long cmdline_ptr_phys;
> +	uint32_t cmdline_low_32, cmdline_ext_32;
> +
> +	memcpy(cmdline_ptr, cmdline, cmdline_len);
> +	cmdline_ptr[cmdline_len - 1] = '\0';
> +
> +	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
> +	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;

GENMASK

> +	cmdline_ext_32 = cmdline_ptr_phys >> 32;
> +
> +	params->hdr.cmd_line_ptr = cmdline_low_32;
> +	if (cmdline_ext_32)
> +		params->ext_cmd_line_ptr = cmdline_ext_32;
> +
> +	return 0;
> +}
> +
> +static int setup_memory_map_entries(struct boot_params *params)
> +{
> +	unsigned int nr_e820_entries;
> +
> +	/* TODO: What about EFI */

You're removing this line in 13/13 so don't add it at all... ?

> +	nr_e820_entries = e820_saved.nr_map;
> +	if (nr_e820_entries > E820MAX)
> +		nr_e820_entries = E820MAX;
> +
> +	params->e820_entries = nr_e820_entries;
> +	memcpy(&params->e820_map, &e820_saved.map,
> +			nr_e820_entries * sizeof(struct e820entry));
> +
> +	return 0;
> +}
> +
> +int kexec_setup_boot_parameters(struct boot_params *params)
> +{
> +	unsigned int nr_e820_entries;
> +	unsigned long long mem_k, start, end;
> +	int i;
> +
> +	/* Get subarch from existing bootparams */
> +	params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
> +
> +	/* Copying screen_info will do? */
> +	memcpy(&params->screen_info, &boot_params.screen_info,
> +				sizeof(struct screen_info));
> +
> +	/* Fill in memsize later */
> +	params->screen_info.ext_mem_k = 0;
> +	params->alt_mem_k = 0;
> +
> +	/* Default APM info */
> +	memset(&params->apm_bios_info, 0, sizeof(params->apm_bios_info));
> +
> +	/* Default drive info */
> +	memset(&params->hd0_info, 0, sizeof(params->hd0_info));
> +	memset(&params->hd1_info, 0, sizeof(params->hd1_info));
> +
> +	/* Default sysdesc table */
> +	params->sys_desc_table.length = 0;
> +
> +	setup_memory_map_entries(params);
> +	nr_e820_entries = params->e820_entries;
> +
> +	for (i = 0; i < nr_e820_entries; i++) {
> +		if (params->e820_map[i].type != E820_RAM)
> +			continue;
> +		start = params->e820_map[i].addr;
> +		end = params->e820_map[i].addr + params->e820_map[i].size - 1;
> +
> +		if ((start <= 0x100000) && end > 0x100000) {
> +			mem_k = (end >> 10) - (0x100000 >> 10);
> +			params->screen_info.ext_mem_k = mem_k;
> +			params->alt_mem_k = mem_k;
> +			if (mem_k > 0xfc00)
> +				params->screen_info.ext_mem_k = 0xfc00; /* 64M*/
> +			if (mem_k > 0xffffffff)
> +				params->alt_mem_k = 0xffffffff;
> +		}
> +	}
> +
> +	/* Setup EDD info */
> +	memcpy(params->eddbuf, boot_params.eddbuf,
> +				EDDMAXNR * sizeof(struct edd_info));
				^^^^^^^^^^^^^^

Shouldn't you just copy eddbuf_entries many instead of EDDMAXNR?



> +	params->eddbuf_entries = boot_params.eddbuf_entries;
> +

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
@ 2014-06-15 16:35     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-15 16:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:07:00AM -0400, Vivek Goyal wrote:
> This is loader specific code which can load bzImage and set it up for
> 64bit entry. This does not take care of 32bit entry or real mode entry.
> 
> 32bit mode entry can be implemented if somebody needs it.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---

...

> diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
> new file mode 100644
> index 0000000..0750784
> --- /dev/null
> +++ b/arch/x86/kernel/kexec-bzimage.c
> @@ -0,0 +1,269 @@
> +/*
> + * Kexec bzImage loader
> + *
> + * Copyright (C) 2014 Red Hat Inc.
> + * Authors:
> + *      Vivek Goyal <vgoyal@redhat.com>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +#include <linux/string.h>
> +#include <linux/printk.h>
> +#include <linux/errno.h>
> +#include <linux/slab.h>
> +#include <linux/kexec.h>
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +
> +#include <asm/bootparam.h>
> +#include <asm/setup.h>
> +
> +/*
> + * Defines lowest physical address for various segments. Not sure where
> + * exactly these limits came from. Current bzimage64 loader in kexec-tools
> + * uses these so I am retaining it. It can be changed over time as we gain
> + * more insight.
> + */
> +#define MIN_PURGATORY_ADDR	0x3000
> +#define MIN_BOOTPARAM_ADDR	0x3000
> +#define MIN_KERNEL_LOAD_ADDR	0x100000
> +#define MIN_INITRD_LOAD_ADDR	0x1000000
> +
> +#ifdef CONFIG_X86_64
> +
> +/*
> + * This is a place holder for all boot loader specific data structure which
> + * gets allocated in one call but gets freed much later during cleanup
> + * time. Right now there is only one field but it can grow as need be.
> + */
> +struct bzimage64_data {
> +	/*
> +	 * Temporary buffer to hold bootparams buffer. This should be
> +	 * freed once the bootparam segment has been loaded.
> +	 */
> +	void *bootparams_buf;
> +};
> +
> +int bzImage64_probe(const char *buf, unsigned long len)
> +{
> +	int ret = -ENOEXEC;
> +	struct setup_header *header;
> +
> +	/* kernel should be atleast two sector long */

				    two sectors

> +	if (len < 2 * 512) {
> +		pr_debug("File is too short to be a bzImage\n");

Those error messages are all pr_debug. Now, wouldn't we want to tell
userspace what the problem is, *when* there is one?

I.e., pr_err or pr_info is much more helpful than pr_debug IMO.

> +		return ret;
> +	}
> +
> +	header = (struct setup_header *)(buf + offsetof(struct boot_params,
> +								hdr));

Just let that stick out. The 80 cols limit is not a hard one anyway,
especially if it impairs readability.

> +	if (memcmp((char *)&header->header, "HdrS", 4) != 0) {

Not strncmp? "HdrS" is a string...

> +		pr_debug("Not a bzImage\n");
> +		return ret;
> +	}
> +
> +	if (header->boot_flag != 0xAA55) {
> +		pr_debug("No x86 boot sector present\n");
> +		return ret;
> +	}
> +
> +	if (header->version < 0x020C) {
> +		pr_debug("Must be at least protocol version 2.12\n");
> +		return ret;
> +	}
> +
> +	if ((header->loadflags & LOADED_HIGH) == 0) {

	if (!(header->loadflags.. ))

> +		pr_debug("zImage not a bzImage\n");
> +		return ret;
> +	}
> +
> +	if (!(header->xloadflags & XLF_KERNEL_64)) {
> +		pr_debug("Not a bzImage64. XLF_KERNEL_64 is not set.\n");
> +		return ret;
> +	}
> +
> +	if (!(header->xloadflags & XLF_CAN_BE_LOADED_ABOVE_4G)) {
> +		pr_debug("XLF_CAN_BE_LOADED_ABOVE_4G is not set.\n");
> +		return ret;
> +	}

Just merge the two checks:

	if ((header->xloadflags & (XLF_KERNEL_64 | XLF_CAN_BE_LOADED_ABOVE_4G)) !=
                                  (XLF_KERNEL_64 | XLF_CAN_BE_LOADED_ABOVE_4G)) {
                pr_err("Not a bzImage, xloadflags: 0x%x\n", header->xloadflags);
                return ret;
        }

> +
> +	/* I've got a bzImage */
> +	pr_debug("It's a relocatable bzImage64\n");
> +	ret = 0;
> +
> +	return ret;
> +}
> +
> +void *bzImage64_load(struct kimage *image, char *kernel,
> +		unsigned long kernel_len,
> +		char *initrd, unsigned long initrd_len,
> +		char *cmdline, unsigned long cmdline_len)

Arg alignment.

> +{
> +
> +	struct setup_header *header;
> +	int setup_sects, kern16_size, ret = 0;
> +	unsigned long setup_header_size, params_cmdline_sz;
> +	struct boot_params *params;
> +	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
> +	unsigned long purgatory_load_addr;
> +	unsigned long kernel_bufsz, kernel_memsz, kernel_align;
> +	char *kernel_buf;
> +	struct bzimage64_data *ldata;
> +	struct kexec_entry64_regs regs64;
> +	void *stack;
> +	unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
> +
> +	header = (struct setup_header *)(kernel + setup_hdr_offset);
> +	setup_sects = header->setup_sects;
> +	if (setup_sects == 0)
> +		setup_sects = 4;
> +
> +	kern16_size = (setup_sects + 1) * 512;
> +	if (kernel_len < kern16_size) {
> +		pr_debug("bzImage truncated\n");

Ditto for all those pr_debug's in here - I think we want to know why the
bzImage load fails and pr_debug is not suitable for that.

> +		return ERR_PTR(-ENOEXEC);
> +	}
> +
> +	if (cmdline_len > header->cmdline_size) {

As we talked, I think COMMAND_LINE_SIZE is perfectly fine and safe for
all intents and purposes.

> +		pr_debug("Kernel command line too long\n");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	/* Allocate loader specific data */
> +	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
> +	if (!ldata)
> +		return ERR_PTR(-ENOMEM);

Why don't you move that allocation to the place right before it is being
assigned to it? I.e., to the "ldata->bootparams_buf = params" line.

This way you'll save yourself all the goto games on the error path and
do it only after everything else has succeeded.

> +
> +	/*
> +	 * Load purgatory. For 64bit entry point, purgatory  code can be
> +	 * anywhere.
> +	 */
> +	ret = kexec_load_purgatory(image, MIN_PURGATORY_ADDR, ULONG_MAX, 1,
> +					&purgatory_load_addr);
> +	if (ret) {
> +		pr_debug("Loading purgatory failed\n");
> +		goto out_free_loader_data;
> +	}
> +
> +	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
> +
> +	/* Load Bootparams and cmdline */
> +	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
> +	params = kzalloc(params_cmdline_sz, GFP_KERNEL);
> +	if (!params) {
> +		ret = -ENOMEM;
> +		goto out_free_loader_data;
> +	}
> +
> +	/* Copy setup header onto bootparams. Documentation/x86/boot.txt */
> +	setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;
> +
> +	/* Is there a limit on setup header size? */
> +	memcpy(&params->hdr, (kernel + setup_hdr_offset), setup_header_size);
> +
> +	ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
> +			       params_cmdline_sz, 16, MIN_BOOTPARAM_ADDR,
> +			       ULONG_MAX, 1, &bootparam_load_addr);
> +	if (ret)
> +		goto out_free_params;
> +	pr_debug("Loaded boot_param and command line at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> +		 bootparam_load_addr, params_cmdline_sz, params_cmdline_sz);
> +
> +	/* Load kernel */
> +	kernel_buf = kernel + kern16_size;
> +	kernel_bufsz =  kernel_len - kern16_size;
> +	kernel_memsz = ALIGN(header->init_size, 4096);

PAGE_ALIGN

> +	kernel_align = header->kernel_alignment;
> +
> +	ret = kexec_add_buffer(image, kernel_buf,
> +			       kernel_bufsz, kernel_memsz, kernel_align,
> +			       MIN_KERNEL_LOAD_ADDR, ULONG_MAX, 1,
> +			       &kernel_load_addr);
> +	if (ret)
> +		goto out_free_params;
> +
> +	pr_debug("Loaded 64bit kernel at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> +		 kernel_load_addr, kernel_memsz, kernel_memsz);
> +
> +	/* Load initrd high */
> +	if (initrd) {
> +		ret = kexec_add_buffer(image, initrd, initrd_len, initrd_len,
> +				       PAGE_SIZE, MIN_INITRD_LOAD_ADDR,
> +				       ULONG_MAX, 1, &initrd_load_addr);
> +		if (ret)
> +			goto out_free_params;
> +
> +		pr_debug("Loaded initrd at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> +				initrd_load_addr, initrd_len, initrd_len);

emtpy line to split, pls.

> +		ret = kexec_setup_initrd(params, initrd_load_addr, initrd_len);
> +		if (ret)

This ret is unconditionally 0 - no need to check it.

> +			goto out_free_params;
> +	}
> +
> +	ret = kexec_setup_cmdline(params, bootparam_load_addr,
> +				  sizeof(struct boot_params), cmdline,
> +				  cmdline_len);
> +	if (ret)

Ditto.

> +		goto out_free_params;
> +
> +	/* bootloader info. Do we need a separate ID for kexec kernel loader? */
> +	params->hdr.type_of_loader = 0x0D << 4;
> +	params->hdr.loadflags = 0;
> +
> +	/* Setup purgatory regs for entry */
> +	ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
> +					     sizeof(regs64), 1);
> +	if (ret)
> +		goto out_free_params;
> +
> +	regs64.rbx = 0; /* Bootstrap Processor */
> +	regs64.rsi = bootparam_load_addr;
> +	regs64.rip = kernel_load_addr + 0x200;
> +	stack = kexec_purgatory_get_symbol_addr(image, "stack_end");
> +	if (IS_ERR(stack)) {
> +		pr_debug("Could not find address of symbol stack_end\n");
> +		ret = -EINVAL;
> +		goto out_free_params;
> +	}
> +
> +	regs64.rsp = (unsigned long)stack;
> +	ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
> +					     sizeof(regs64), 0);
> +	if (ret)
> +		goto out_free_params;
> +
> +	ret = kexec_setup_boot_parameters(params);

Ditto.

> +	if (ret)
> +		goto out_free_params;
> +
> +	/*
> +	 * Store pointer to params so that it could be freed after loading
> +	 * params segment has been loaded and contents have been copied
> +	 * somewhere else.
> +	 */
> +	ldata->bootparams_buf = params;
> +	return ldata;
> +
> +out_free_params:
> +	kfree(params);
> +out_free_loader_data:
> +	kfree(ldata);
> +	return ERR_PTR(ret);
> +}
> +
> +/* This cleanup function is called after various segments have been loaded */
> +int bzImage64_cleanup(struct kimage *image)
> +{
> +	struct bzimage64_data *ldata = image->image_loader_data;
> +
> +	if (!ldata)
> +		return 0;
> +
> +	kfree(ldata->bootparams_buf);
> +	ldata->bootparams_buf = NULL;
> +
> +	return 0;
> +}
> +
> +#endif /* CONFIG_X86_64 */
> diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
> new file mode 100644
> index 0000000..7de3239
> --- /dev/null
> +++ b/arch/x86/kernel/machine_kexec.c
> @@ -0,0 +1,136 @@
> +/*
> + * handle transition of Linux booting another kernel
> + *
> + * Copyright (C) 2014 Red Hat Inc.
> + * Authors:
> + *      Vivek Goyal <vgoyal@redhat.com>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2.  See the file COPYING for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/string.h>
> +#include <asm/bootparam.h>
> +#include <asm/setup.h>
> +
> +/*
> + * Common code for x86 and x86_64 used for kexec.
> + *
> + * For the time being it compiles only for x86_64 as there are no image
> + * loaders implemented * for x86. This #ifdef can be removed once somebody
> + * decides to write an image loader on CONFIG_X86_32.
> + */
> +
> +#ifdef CONFIG_X86_64

Ok, this doesn't make any sense: this new machine_kexec.c is supposed to
be common code and yet it has this 64-bit ifdef in there.

It should be the other way around, IMO: put it now in machine_kexec_64.c
and if someone wants the 32-bit version, that someone should carve it
out. This'll save you the needless ifdeffery now.

> +
> +int kexec_setup_initrd(struct boot_params *params,
> +		unsigned long initrd_load_addr, unsigned long initrd_len)
> +{
> +	params->hdr.ramdisk_image = initrd_load_addr & 0xffffffffUL;
> +	params->hdr.ramdisk_size = initrd_len & 0xffffffffUL;

We have more readable GENMASK* macros for contiguous masks. This one
will then look like:

	params->hdr.ramdisk_image = initrd_load_addr & GENMASK(31, 0);
	params->hdr.ramdisk_size = initrd_len & GENMASK(31, 0);

and this way we know exactly about which bits are we talking about. :)

> +
> +	params->ext_ramdisk_image = initrd_load_addr >> 32;
> +	params->ext_ramdisk_size = initrd_len >> 32;
> +
> +	return 0;
> +}
> +
> +int kexec_setup_cmdline(struct boot_params *params,
> +		unsigned long bootparams_load_addr,
> +		unsigned long cmdline_offset, char *cmdline,
> +		unsigned long cmdline_len)
> +{
> +	char *cmdline_ptr = ((char *)params) + cmdline_offset;
> +	unsigned long cmdline_ptr_phys;
> +	uint32_t cmdline_low_32, cmdline_ext_32;
> +
> +	memcpy(cmdline_ptr, cmdline, cmdline_len);
> +	cmdline_ptr[cmdline_len - 1] = '\0';
> +
> +	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
> +	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;

GENMASK

> +	cmdline_ext_32 = cmdline_ptr_phys >> 32;
> +
> +	params->hdr.cmd_line_ptr = cmdline_low_32;
> +	if (cmdline_ext_32)
> +		params->ext_cmd_line_ptr = cmdline_ext_32;
> +
> +	return 0;
> +}
> +
> +static int setup_memory_map_entries(struct boot_params *params)
> +{
> +	unsigned int nr_e820_entries;
> +
> +	/* TODO: What about EFI */

You're removing this line in 13/13 so don't add it at all... ?

> +	nr_e820_entries = e820_saved.nr_map;
> +	if (nr_e820_entries > E820MAX)
> +		nr_e820_entries = E820MAX;
> +
> +	params->e820_entries = nr_e820_entries;
> +	memcpy(&params->e820_map, &e820_saved.map,
> +			nr_e820_entries * sizeof(struct e820entry));
> +
> +	return 0;
> +}
> +
> +int kexec_setup_boot_parameters(struct boot_params *params)
> +{
> +	unsigned int nr_e820_entries;
> +	unsigned long long mem_k, start, end;
> +	int i;
> +
> +	/* Get subarch from existing bootparams */
> +	params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
> +
> +	/* Copying screen_info will do? */
> +	memcpy(&params->screen_info, &boot_params.screen_info,
> +				sizeof(struct screen_info));
> +
> +	/* Fill in memsize later */
> +	params->screen_info.ext_mem_k = 0;
> +	params->alt_mem_k = 0;
> +
> +	/* Default APM info */
> +	memset(&params->apm_bios_info, 0, sizeof(params->apm_bios_info));
> +
> +	/* Default drive info */
> +	memset(&params->hd0_info, 0, sizeof(params->hd0_info));
> +	memset(&params->hd1_info, 0, sizeof(params->hd1_info));
> +
> +	/* Default sysdesc table */
> +	params->sys_desc_table.length = 0;
> +
> +	setup_memory_map_entries(params);
> +	nr_e820_entries = params->e820_entries;
> +
> +	for (i = 0; i < nr_e820_entries; i++) {
> +		if (params->e820_map[i].type != E820_RAM)
> +			continue;
> +		start = params->e820_map[i].addr;
> +		end = params->e820_map[i].addr + params->e820_map[i].size - 1;
> +
> +		if ((start <= 0x100000) && end > 0x100000) {
> +			mem_k = (end >> 10) - (0x100000 >> 10);
> +			params->screen_info.ext_mem_k = mem_k;
> +			params->alt_mem_k = mem_k;
> +			if (mem_k > 0xfc00)
> +				params->screen_info.ext_mem_k = 0xfc00; /* 64M*/
> +			if (mem_k > 0xffffffff)
> +				params->alt_mem_k = 0xffffffff;
> +		}
> +	}
> +
> +	/* Setup EDD info */
> +	memcpy(params->eddbuf, boot_params.eddbuf,
> +				EDDMAXNR * sizeof(struct edd_info));
				^^^^^^^^^^^^^^

Shouldn't you just copy eddbuf_entries many instead of EDDMAXNR?



> +	params->eddbuf_entries = boot_params.eddbuf_entries;
> +

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
  2014-06-15 16:35     ` Borislav Petkov
@ 2014-06-15 16:56       ` H. Peter Anvin
  -1 siblings, 0 replies; 214+ messages in thread
From: H. Peter Anvin @ 2014-06-15 16:56 UTC (permalink / raw)
  To: Borislav Petkov, Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On 06/15/2014 09:35 AM, Borislav Petkov wrote:
> 
>> +	if (memcmp((char *)&header->header, "HdrS", 4) != 0) {
> 
> Not strncmp? "HdrS" is a string...
> 

No, memcmp() is more appropriate.  It is really more of a byte sequence
than a string.

It could just as easily be done as:

	header->header == 0x53726448

	-hpa


^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
@ 2014-06-15 16:56       ` H. Peter Anvin
  0 siblings, 0 replies; 214+ messages in thread
From: H. Peter Anvin @ 2014-06-15 16:56 UTC (permalink / raw)
  To: Borislav Petkov, Vivek Goyal
  Cc: mjg59, bhe, jkosina, kexec, linux-kernel, ebiederm, greg, akpm,
	dyoung, chaowang

On 06/15/2014 09:35 AM, Borislav Petkov wrote:
> 
>> +	if (memcmp((char *)&header->header, "HdrS", 4) != 0) {
> 
> Not strncmp? "HdrS" is a string...
> 

No, memcmp() is more appropriate.  It is really more of a byte sequence
than a string.

It could just as easily be done as:

	header->header == 0x53726448

	-hpa


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 09/13] purgatory: Core purgatory functionality
  2014-06-13 10:17         ` Borislav Petkov
@ 2014-06-16 17:25           ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 17:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Fri, Jun 13, 2014 at 12:17:13PM +0200, Borislav Petkov wrote:
> On Fri, Jun 06, 2014 at 03:51:04PM -0400, Vivek Goyal wrote:
> > On Thu, Jun 05, 2014 at 10:05:23PM +0200, Borislav Petkov wrote:
> > 
> > [..]
> > > > @@ -249,6 +254,7 @@ archclean:
> > > >  	$(Q)rm -rf $(objtree)/arch/x86_64
> > > >  	$(Q)$(MAKE) $(clean)=$(boot)
> > > >  	$(Q)$(MAKE) $(clean)=arch/x86/tools
> > > 
> > > ifeq ($(CONFIG_KEXEC),y)
> > > 	$(Q)$(MAKE) $(clean)=arch/x86/purgatory
> > > endif
> > 
> > Hmm.., is it strictly required? I am wondering what happens if I build
> > a kernel with CONFIG_KEXEC=y, then set CONFIG_KEXEC=n and do "make clean".
> 
> Try it - it works here.
> 
> > I think I will still like any files in arch/x86/purgatory to be cleaned
> > despite the fact that CONFIG_KEXEC=n. Isn't it?
> 
> Yep, that works. Conversely, we don't want people who haven't enabled
> KEXEC ever to have unrelated cleanup delay.
> 

I tried following with CONFIG_KEXEC=n

ifeq ($(CONFIG_KEXEC),y)
        $(Q)$(MAKE) $(clean)=arch/x86/purgatory
endif

And still "make V=1 clean" shows me that it is going in purgatory dir
to clean things up.

make -f scripts/Makefile.clean obj=arch/x86/purgatory

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 09/13] purgatory: Core purgatory functionality
@ 2014-06-16 17:25           ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 17:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Fri, Jun 13, 2014 at 12:17:13PM +0200, Borislav Petkov wrote:
> On Fri, Jun 06, 2014 at 03:51:04PM -0400, Vivek Goyal wrote:
> > On Thu, Jun 05, 2014 at 10:05:23PM +0200, Borislav Petkov wrote:
> > 
> > [..]
> > > > @@ -249,6 +254,7 @@ archclean:
> > > >  	$(Q)rm -rf $(objtree)/arch/x86_64
> > > >  	$(Q)$(MAKE) $(clean)=$(boot)
> > > >  	$(Q)$(MAKE) $(clean)=arch/x86/tools
> > > 
> > > ifeq ($(CONFIG_KEXEC),y)
> > > 	$(Q)$(MAKE) $(clean)=arch/x86/purgatory
> > > endif
> > 
> > Hmm.., is it strictly required? I am wondering what happens if I build
> > a kernel with CONFIG_KEXEC=y, then set CONFIG_KEXEC=n and do "make clean".
> 
> Try it - it works here.
> 
> > I think I will still like any files in arch/x86/purgatory to be cleaned
> > despite the fact that CONFIG_KEXEC=n. Isn't it?
> 
> Yep, that works. Conversely, we don't want people who haven't enabled
> KEXEC ever to have unrelated cleanup delay.
> 

I tried following with CONFIG_KEXEC=n

ifeq ($(CONFIG_KEXEC),y)
        $(Q)$(MAKE) $(clean)=arch/x86/purgatory
endif

And still "make V=1 clean" shows me that it is going in purgatory dir
to clean things up.

make -f scripts/Makefile.clean obj=arch/x86/purgatory

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-13 15:36                   ` Borislav Petkov
@ 2014-06-16 17:38                     ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 17:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: WANG Chao, Dave Young, mjg59, bhe, jkosina, greg, kexec,
	linux-kernel, ebiederm, hpa, akpm

On Fri, Jun 13, 2014 at 05:36:20PM +0200, Borislav Petkov wrote:
> On Fri, Jun 13, 2014 at 08:46:09AM -0400, Vivek Goyal wrote:
> > If not, then we really can't do anything about it. A large memory
> > allocation will fail and user will get error.
> 
> Of course we can! You can't trust userspace and you need to check the
> values it gives you through the syscall.
> 
> And you need a sane default. The fact that I even have to state this
> explicitly...!

And what's the sane default in this case? Using current kernel's
command line size will not work if future kernel decide to support
even longer command line size.

> 
> So what if userspace gives a maximum value for which allocation
> succeeds? Does that mean that the kernel should blindly comply and
> allocate? Of course not! That would be dumb.

I agree that some kind of upper value is good. But I am disagreeing
that using current kernel's COMMAND_LINE_SIZE is better thing to do.

Also what's the upper limit on initramfs size? There is none. The issues
you are trying to prevent can be easily created simply by passing in
a large initrd file.

If we are not putting any sane defaults on size of kernel and initramfs, I
am not really sure what do we gain here by putting an incorrect limit
on kernel command line size.

> 
> > I disagree here. What if new kernel supports (2 * COMMAND_LINE_SIZE)
> > length command line. We don't want to truncate command line to smaller
> > size because running kernel does not support that long a command line.
> 
> Do you have a sensible use case where 2048 cmdline size (on x86) won't
> be enough and you really need it larger?

Who knows that in future we might have to extend it beyond 2048. You
can't say that 2048 wil never be changed. Nobody knows.

> 
> And even if this is a problem - which I seriously doubt - it would be
> problem with the 1st kernel too, not only with the kexec-ed one.

Why it will be a problem with first kernel?

So assuming that you will agree that we might have to extend kernel
command line some day, my question is how would you support kexec from
old kernel to newer kernel with larger command line size. (If I limit
support commnad line to the one supported by running kernel).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-16 17:38                     ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 17:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, ebiederm, jkosina,
	akpm, Dave Young, WANG Chao

On Fri, Jun 13, 2014 at 05:36:20PM +0200, Borislav Petkov wrote:
> On Fri, Jun 13, 2014 at 08:46:09AM -0400, Vivek Goyal wrote:
> > If not, then we really can't do anything about it. A large memory
> > allocation will fail and user will get error.
> 
> Of course we can! You can't trust userspace and you need to check the
> values it gives you through the syscall.
> 
> And you need a sane default. The fact that I even have to state this
> explicitly...!

And what's the sane default in this case? Using current kernel's
command line size will not work if future kernel decide to support
even longer command line size.

> 
> So what if userspace gives a maximum value for which allocation
> succeeds? Does that mean that the kernel should blindly comply and
> allocate? Of course not! That would be dumb.

I agree that some kind of upper value is good. But I am disagreeing
that using current kernel's COMMAND_LINE_SIZE is better thing to do.

Also what's the upper limit on initramfs size? There is none. The issues
you are trying to prevent can be easily created simply by passing in
a large initrd file.

If we are not putting any sane defaults on size of kernel and initramfs, I
am not really sure what do we gain here by putting an incorrect limit
on kernel command line size.

> 
> > I disagree here. What if new kernel supports (2 * COMMAND_LINE_SIZE)
> > length command line. We don't want to truncate command line to smaller
> > size because running kernel does not support that long a command line.
> 
> Do you have a sensible use case where 2048 cmdline size (on x86) won't
> be enough and you really need it larger?

Who knows that in future we might have to extend it beyond 2048. You
can't say that 2048 wil never be changed. Nobody knows.

> 
> And even if this is a problem - which I seriously doubt - it would be
> problem with the 1st kernel too, not only with the kexec-ed one.

Why it will be a problem with first kernel?

So assuming that you will agree that we might have to extend kernel
command line some day, my question is how would you support kexec from
old kernel to newer kernel with larger command line size. (If I limit
support commnad line to the one supported by running kernel).

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-16 17:38                     ` Vivek Goyal
@ 2014-06-16 20:05                       ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 20:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: WANG Chao, Dave Young, mjg59, bhe, jkosina, greg, kexec,
	linux-kernel, ebiederm, hpa, akpm

On Mon, Jun 16, 2014 at 01:38:23PM -0400, Vivek Goyal wrote:
> And what's the sane default in this case?

COMMAND_LINE_SIZE

> Using current kernel's command line size will not work if future
> kernel decide to support even longer command line size.

When do you ever get to kexec a kernel with command line size differing
from the first kernel? This use case is pretty much non-existant to
say the least (mind you, I'm open to examples but am still waiting for
them). And even then you go and simply upgrade the first kernel.

Why are we even talking about this?

> I agree that some kind of upper value is good. But I am disagreeing
> that using current kernel's COMMAND_LINE_SIZE is better thing to do.

Again, stop arguing about some nonsensical cases and give me a real use
case where this is a problem.

> Also what's the upper limit on initramfs size? There is none. The issues
> you are trying to prevent can be easily created simply by passing in
> a large initrd file.
> 
> If we are not putting any sane defaults on size of kernel and initramfs, I
> am not really sure what do we gain here by putting an incorrect limit
> on kernel command line size.

You need to have a *sane* default length for a command line size - not
what's possible or what's not - something sane.

> Who knows that in future we might have to extend it beyond 2048. You
> can't say that 2048 wil never be changed. Nobody knows.

Dude, stop arguing this dumb case - if the command line size is changed,
you simply update the first kernel. What is the use case of having to
kexec a newer kernel on an older kernel? Spit it out already.

> > And even if this is a problem - which I seriously doubt - it would be
> > problem with the 1st kernel too, not only with the kexec-ed one.
> 
> Why it will be a problem with first kernel?

Because if a kernel overflows COMMAND_LINE_SIZE, then something's wrong
with that use case and needs to get information passed in a different
manner - not 2K of cmdline string. Again, where is the sane use case?

> So assuming that you will agree that we might have to extend kernel
> command line some day, my question is how would you support kexec from
> old kernel to newer kernel with larger command line size.

Why do I need to support that case?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-16 20:05                       ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 20:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, ebiederm, jkosina,
	akpm, Dave Young, WANG Chao

On Mon, Jun 16, 2014 at 01:38:23PM -0400, Vivek Goyal wrote:
> And what's the sane default in this case?

COMMAND_LINE_SIZE

> Using current kernel's command line size will not work if future
> kernel decide to support even longer command line size.

When do you ever get to kexec a kernel with command line size differing
from the first kernel? This use case is pretty much non-existant to
say the least (mind you, I'm open to examples but am still waiting for
them). And even then you go and simply upgrade the first kernel.

Why are we even talking about this?

> I agree that some kind of upper value is good. But I am disagreeing
> that using current kernel's COMMAND_LINE_SIZE is better thing to do.

Again, stop arguing about some nonsensical cases and give me a real use
case where this is a problem.

> Also what's the upper limit on initramfs size? There is none. The issues
> you are trying to prevent can be easily created simply by passing in
> a large initrd file.
> 
> If we are not putting any sane defaults on size of kernel and initramfs, I
> am not really sure what do we gain here by putting an incorrect limit
> on kernel command line size.

You need to have a *sane* default length for a command line size - not
what's possible or what's not - something sane.

> Who knows that in future we might have to extend it beyond 2048. You
> can't say that 2048 wil never be changed. Nobody knows.

Dude, stop arguing this dumb case - if the command line size is changed,
you simply update the first kernel. What is the use case of having to
kexec a newer kernel on an older kernel? Spit it out already.

> > And even if this is a problem - which I seriously doubt - it would be
> > problem with the 1st kernel too, not only with the kexec-ed one.
> 
> Why it will be a problem with first kernel?

Because if a kernel overflows COMMAND_LINE_SIZE, then something's wrong
with that use case and needs to get information passed in a different
manner - not 2K of cmdline string. Again, where is the sane use case?

> So assuming that you will agree that we might have to extend kernel
> command line some day, my question is how would you support kexec from
> old kernel to newer kernel with larger command line size.

Why do I need to support that case?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
  2014-06-15 16:35     ` Borislav Petkov
@ 2014-06-16 20:06       ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 20:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Sun, Jun 15, 2014 at 06:35:15PM +0200, Borislav Petkov wrote:

[..]
> > +int bzImage64_probe(const char *buf, unsigned long len)
> > +{
> > +	int ret = -ENOEXEC;
> > +	struct setup_header *header;
> > +
> > +	/* kernel should be atleast two sector long */
> 
> 				    two sectors
> 
> > +	if (len < 2 * 512) {
> > +		pr_debug("File is too short to be a bzImage\n");
> 
> Those error messages are all pr_debug. Now, wouldn't we want to tell
> userspace what the problem is, *when* there is one?
> 
> I.e., pr_err or pr_info is much more helpful than pr_debug IMO.

There can be more than one loader and the one which claims first
to recognize the image will get to load the image. So once 32 bit
loader support comes in, it might happen that we ask 64bit loader
first and it rejects the image and then we ask 32bit loader.

So these message are really debug message which tells why loader
is not accepting an image. It might not be image destined for that
loader at all.

pr_debug() allows being verbose if user wants to for debugging purposes.
You just have to make sure that CONFIG_DYNAMIC_DEBUG=y and enable verbosity
in individual file.

echo 'file kexec-bzimage.c +p' > /sys/kernel/debug/dynamic_debug/control

> 
> > +		return ret;
> > +	}
> > +
> > +	header = (struct setup_header *)(buf + offsetof(struct boot_params,
> > +								hdr));
> 
> Just let that stick out. The 80 cols limit is not a hard one anyway,
> especially if it impairs readability.

Will do.

> 
> > +	if (memcmp((char *)&header->header, "HdrS", 4) != 0) {
> 
> Not strncmp? "HdrS" is a string...

As peter said, this is not string. So I will retain it.

> 
> > +		pr_debug("Not a bzImage\n");
> > +		return ret;
> > +	}
> > +
> > +	if (header->boot_flag != 0xAA55) {
> > +		pr_debug("No x86 boot sector present\n");
> > +		return ret;
> > +	}
> > +
> > +	if (header->version < 0x020C) {
> > +		pr_debug("Must be at least protocol version 2.12\n");
> > +		return ret;
> > +	}
> > +
> > +	if ((header->loadflags & LOADED_HIGH) == 0) {
> 
> 	if (!(header->loadflags.. ))

Will do.

> 
> > +		pr_debug("zImage not a bzImage\n");
> > +		return ret;
> > +	}
> > +
> > +	if (!(header->xloadflags & XLF_KERNEL_64)) {
> > +		pr_debug("Not a bzImage64. XLF_KERNEL_64 is not set.\n");
> > +		return ret;
> > +	}
> > +
> > +	if (!(header->xloadflags & XLF_CAN_BE_LOADED_ABOVE_4G)) {
> > +		pr_debug("XLF_CAN_BE_LOADED_ABOVE_4G is not set.\n");
> > +		return ret;
> > +	}
> 
> Just merge the two checks:
> 
> 	if ((header->xloadflags & (XLF_KERNEL_64 | XLF_CAN_BE_LOADED_ABOVE_4G)) !=
>                                   (XLF_KERNEL_64 | XLF_CAN_BE_LOADED_ABOVE_4G)) {
>                 pr_err("Not a bzImage, xloadflags: 0x%x\n", header->xloadflags);
>                 return ret;
>         }

I think I like separate checks better. That way I can output much better
debug message. Just saying xloadflags=0x%x does not tell anything about
what flags the loader is looking for (without looking at the code).

> 
> > +
> > +	/* I've got a bzImage */
> > +	pr_debug("It's a relocatable bzImage64\n");
> > +	ret = 0;
> > +
> > +	return ret;
> > +}
> > +
> > +void *bzImage64_load(struct kimage *image, char *kernel,
> > +		unsigned long kernel_len,
> > +		char *initrd, unsigned long initrd_len,
> > +		char *cmdline, unsigned long cmdline_len)
> 
> Arg alignment.

Will do.

[..]
> > +	header = (struct setup_header *)(kernel + setup_hdr_offset);
> > +	setup_sects = header->setup_sects;
> > +	if (setup_sects == 0)
> > +		setup_sects = 4;
> > +
> > +	kern16_size = (setup_sects + 1) * 512;
> > +	if (kernel_len < kern16_size) {
> > +		pr_debug("bzImage truncated\n");
> 
> Ditto for all those pr_debug's in here - I think we want to know why the
> bzImage load fails and pr_debug is not suitable for that.

Same here. We will potentially be trying multiple loaders and if every
loader prints messages for rejection by default, it is too much of info,
IMO.

> 
> > +		return ERR_PTR(-ENOEXEC);
> > +	}
> > +
> > +	if (cmdline_len > header->cmdline_size) {
> 
> As we talked, I think COMMAND_LINE_SIZE is perfectly fine and safe for
> all intents and purposes.

I still have concerns about using COMMAND_LINE_SIZE. If header information
is useful for a bootloader, then kernel is just a bootloader in this case
and if we really want to limit the size, it should be based on information
present in the header and not based on currently running kernel's limit.

> 
> > +		pr_debug("Kernel command line too long\n");
> > +		return ERR_PTR(-EINVAL);
> > +	}
> > +
> > +	/* Allocate loader specific data */
> > +	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
> > +	if (!ldata)
> > +		return ERR_PTR(-ENOMEM);
> 
> Why don't you move that allocation to the place right before it is being
> assigned to it? I.e., to the "ldata->bootparams_buf = params" line.

I like doing memory allocations early in the functions (as far as
possible) and error out if need be. If memory is available to begin
with for all the data structures needed by this function, it is kind
of pointless to do rest of the processing.

[..]
> > +	ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
> > +			       params_cmdline_sz, 16, MIN_BOOTPARAM_ADDR,
> > +			       ULONG_MAX, 1, &bootparam_load_addr);
> > +	if (ret)
> > +		goto out_free_params;
> > +	pr_debug("Loaded boot_param and command line at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> > +		 bootparam_load_addr, params_cmdline_sz, params_cmdline_sz);
> > +
> > +	/* Load kernel */
> > +	kernel_buf = kernel + kern16_size;
> > +	kernel_bufsz =  kernel_len - kern16_size;
> > +	kernel_memsz = ALIGN(header->init_size, 4096);
> 
> PAGE_ALIGN

Will change.

> 
> > +	kernel_align = header->kernel_alignment;
> > +
> > +	ret = kexec_add_buffer(image, kernel_buf,
> > +			       kernel_bufsz, kernel_memsz, kernel_align,
> > +			       MIN_KERNEL_LOAD_ADDR, ULONG_MAX, 1,
> > +			       &kernel_load_addr);
> > +	if (ret)
> > +		goto out_free_params;
> > +
> > +	pr_debug("Loaded 64bit kernel at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> > +		 kernel_load_addr, kernel_memsz, kernel_memsz);
> > +
> > +	/* Load initrd high */
> > +	if (initrd) {
> > +		ret = kexec_add_buffer(image, initrd, initrd_len, initrd_len,
> > +				       PAGE_SIZE, MIN_INITRD_LOAD_ADDR,
> > +				       ULONG_MAX, 1, &initrd_load_addr);
> > +		if (ret)
> > +			goto out_free_params;
> > +
> > +		pr_debug("Loaded initrd at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> > +				initrd_load_addr, initrd_len, initrd_len);
> 
> emtpy line to split, pls.

Will do.

> 
> > +		ret = kexec_setup_initrd(params, initrd_load_addr, initrd_len);
> > +		if (ret)
> 
> This ret is unconditionally 0 - no need to check it.

Ok will change it.

> 
> > +			goto out_free_params;
> > +	}
> > +
> > +	ret = kexec_setup_cmdline(params, bootparam_load_addr,
> > +				  sizeof(struct boot_params), cmdline,
> > +				  cmdline_len);
> > +	if (ret)
> 
> Ditto.

Will change.

[..]
> > +	ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
> > +					     sizeof(regs64), 0);
> > +	if (ret)
> > +		goto out_free_params;
> > +
> > +	ret = kexec_setup_boot_parameters(params);
> 
> Ditto.

Will change.

[..]
> > +/*
> > + * Common code for x86 and x86_64 used for kexec.
> > + *
> > + * For the time being it compiles only for x86_64 as there are no image
> > + * loaders implemented * for x86. This #ifdef can be removed once somebody
> > + * decides to write an image loader on CONFIG_X86_32.
> > + */
> > +
> > +#ifdef CONFIG_X86_64
> 
> Ok, this doesn't make any sense: this new machine_kexec.c is supposed to
> be common code and yet it has this 64-bit ifdef in there.
> 
> It should be the other way around, IMO: put it now in machine_kexec_64.c
> and if someone wants the 32-bit version, that someone should carve it
> out. This'll save you the needless ifdeffery now.

Hmm..., If you feel strongly about it, I can make this change. I thought
I just made it easier to share the code between 32bit and 64bit by this.

> 
> > +
> > +int kexec_setup_initrd(struct boot_params *params,
> > +		unsigned long initrd_load_addr, unsigned long initrd_len)
> > +{
> > +	params->hdr.ramdisk_image = initrd_load_addr & 0xffffffffUL;
> > +	params->hdr.ramdisk_size = initrd_len & 0xffffffffUL;
> 
> We have more readable GENMASK* macros for contiguous masks. This one
> will then look like:
> 
> 	params->hdr.ramdisk_image = initrd_load_addr & GENMASK(31, 0);
> 	params->hdr.ramdisk_size = initrd_len & GENMASK(31, 0);
> 
> and this way we know exactly about which bits are we talking about. :)

Ok, will use it.

> 
> > +
> > +	params->ext_ramdisk_image = initrd_load_addr >> 32;
> > +	params->ext_ramdisk_size = initrd_len >> 32;
> > +
> > +	return 0;
> > +}
> > +
> > +int kexec_setup_cmdline(struct boot_params *params,
> > +		unsigned long bootparams_load_addr,
> > +		unsigned long cmdline_offset, char *cmdline,
> > +		unsigned long cmdline_len)
> > +{
> > +	char *cmdline_ptr = ((char *)params) + cmdline_offset;
> > +	unsigned long cmdline_ptr_phys;
> > +	uint32_t cmdline_low_32, cmdline_ext_32;
> > +
> > +	memcpy(cmdline_ptr, cmdline, cmdline_len);
> > +	cmdline_ptr[cmdline_len - 1] = '\0';
> > +
> > +	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
> > +	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
> 
> GENMASK

Will change.

> 
> > +	cmdline_ext_32 = cmdline_ptr_phys >> 32;
> > +
> > +	params->hdr.cmd_line_ptr = cmdline_low_32;
> > +	if (cmdline_ext_32)
> > +		params->ext_cmd_line_ptr = cmdline_ext_32;
> > +
> > +	return 0;
> > +}
> > +
> > +static int setup_memory_map_entries(struct boot_params *params)
> > +{
> > +	unsigned int nr_e820_entries;
> > +
> > +	/* TODO: What about EFI */
> 
> You're removing this line in 13/13 so don't add it at all... ?

Yep. Will remove.

[..]
> > +	/* Setup EDD info */
> > +	memcpy(params->eddbuf, boot_params.eddbuf,
> > +				EDDMAXNR * sizeof(struct edd_info));
> 				^^^^^^^^^^^^^^
> 
> Shouldn't you just copy eddbuf_entries many instead of EDDMAXNR?
> 

I think it just makes it safer that we don't try to copy more than
size of destination, in case ->eddbuf_entries is not right or corrupted.

I see copy_edd() does similar thing.

memcpy(edd.edd_info, boot_params.eddbuf, sizeof(edd.edd_info));
edd.edd_info_nr = boot_params.eddbuf_entries;

So may be it is not a bad idea to copy based on max size of data
structures.

> > +	params->eddbuf_entries = boot_params.eddbuf_entries;
> > +

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
@ 2014-06-16 20:06       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 20:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Sun, Jun 15, 2014 at 06:35:15PM +0200, Borislav Petkov wrote:

[..]
> > +int bzImage64_probe(const char *buf, unsigned long len)
> > +{
> > +	int ret = -ENOEXEC;
> > +	struct setup_header *header;
> > +
> > +	/* kernel should be atleast two sector long */
> 
> 				    two sectors
> 
> > +	if (len < 2 * 512) {
> > +		pr_debug("File is too short to be a bzImage\n");
> 
> Those error messages are all pr_debug. Now, wouldn't we want to tell
> userspace what the problem is, *when* there is one?
> 
> I.e., pr_err or pr_info is much more helpful than pr_debug IMO.

There can be more than one loader and the one which claims first
to recognize the image will get to load the image. So once 32 bit
loader support comes in, it might happen that we ask 64bit loader
first and it rejects the image and then we ask 32bit loader.

So these message are really debug message which tells why loader
is not accepting an image. It might not be image destined for that
loader at all.

pr_debug() allows being verbose if user wants to for debugging purposes.
You just have to make sure that CONFIG_DYNAMIC_DEBUG=y and enable verbosity
in individual file.

echo 'file kexec-bzimage.c +p' > /sys/kernel/debug/dynamic_debug/control

> 
> > +		return ret;
> > +	}
> > +
> > +	header = (struct setup_header *)(buf + offsetof(struct boot_params,
> > +								hdr));
> 
> Just let that stick out. The 80 cols limit is not a hard one anyway,
> especially if it impairs readability.

Will do.

> 
> > +	if (memcmp((char *)&header->header, "HdrS", 4) != 0) {
> 
> Not strncmp? "HdrS" is a string...

As peter said, this is not string. So I will retain it.

> 
> > +		pr_debug("Not a bzImage\n");
> > +		return ret;
> > +	}
> > +
> > +	if (header->boot_flag != 0xAA55) {
> > +		pr_debug("No x86 boot sector present\n");
> > +		return ret;
> > +	}
> > +
> > +	if (header->version < 0x020C) {
> > +		pr_debug("Must be at least protocol version 2.12\n");
> > +		return ret;
> > +	}
> > +
> > +	if ((header->loadflags & LOADED_HIGH) == 0) {
> 
> 	if (!(header->loadflags.. ))

Will do.

> 
> > +		pr_debug("zImage not a bzImage\n");
> > +		return ret;
> > +	}
> > +
> > +	if (!(header->xloadflags & XLF_KERNEL_64)) {
> > +		pr_debug("Not a bzImage64. XLF_KERNEL_64 is not set.\n");
> > +		return ret;
> > +	}
> > +
> > +	if (!(header->xloadflags & XLF_CAN_BE_LOADED_ABOVE_4G)) {
> > +		pr_debug("XLF_CAN_BE_LOADED_ABOVE_4G is not set.\n");
> > +		return ret;
> > +	}
> 
> Just merge the two checks:
> 
> 	if ((header->xloadflags & (XLF_KERNEL_64 | XLF_CAN_BE_LOADED_ABOVE_4G)) !=
>                                   (XLF_KERNEL_64 | XLF_CAN_BE_LOADED_ABOVE_4G)) {
>                 pr_err("Not a bzImage, xloadflags: 0x%x\n", header->xloadflags);
>                 return ret;
>         }

I think I like separate checks better. That way I can output much better
debug message. Just saying xloadflags=0x%x does not tell anything about
what flags the loader is looking for (without looking at the code).

> 
> > +
> > +	/* I've got a bzImage */
> > +	pr_debug("It's a relocatable bzImage64\n");
> > +	ret = 0;
> > +
> > +	return ret;
> > +}
> > +
> > +void *bzImage64_load(struct kimage *image, char *kernel,
> > +		unsigned long kernel_len,
> > +		char *initrd, unsigned long initrd_len,
> > +		char *cmdline, unsigned long cmdline_len)
> 
> Arg alignment.

Will do.

[..]
> > +	header = (struct setup_header *)(kernel + setup_hdr_offset);
> > +	setup_sects = header->setup_sects;
> > +	if (setup_sects == 0)
> > +		setup_sects = 4;
> > +
> > +	kern16_size = (setup_sects + 1) * 512;
> > +	if (kernel_len < kern16_size) {
> > +		pr_debug("bzImage truncated\n");
> 
> Ditto for all those pr_debug's in here - I think we want to know why the
> bzImage load fails and pr_debug is not suitable for that.

Same here. We will potentially be trying multiple loaders and if every
loader prints messages for rejection by default, it is too much of info,
IMO.

> 
> > +		return ERR_PTR(-ENOEXEC);
> > +	}
> > +
> > +	if (cmdline_len > header->cmdline_size) {
> 
> As we talked, I think COMMAND_LINE_SIZE is perfectly fine and safe for
> all intents and purposes.

I still have concerns about using COMMAND_LINE_SIZE. If header information
is useful for a bootloader, then kernel is just a bootloader in this case
and if we really want to limit the size, it should be based on information
present in the header and not based on currently running kernel's limit.

> 
> > +		pr_debug("Kernel command line too long\n");
> > +		return ERR_PTR(-EINVAL);
> > +	}
> > +
> > +	/* Allocate loader specific data */
> > +	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
> > +	if (!ldata)
> > +		return ERR_PTR(-ENOMEM);
> 
> Why don't you move that allocation to the place right before it is being
> assigned to it? I.e., to the "ldata->bootparams_buf = params" line.

I like doing memory allocations early in the functions (as far as
possible) and error out if need be. If memory is available to begin
with for all the data structures needed by this function, it is kind
of pointless to do rest of the processing.

[..]
> > +	ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
> > +			       params_cmdline_sz, 16, MIN_BOOTPARAM_ADDR,
> > +			       ULONG_MAX, 1, &bootparam_load_addr);
> > +	if (ret)
> > +		goto out_free_params;
> > +	pr_debug("Loaded boot_param and command line at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> > +		 bootparam_load_addr, params_cmdline_sz, params_cmdline_sz);
> > +
> > +	/* Load kernel */
> > +	kernel_buf = kernel + kern16_size;
> > +	kernel_bufsz =  kernel_len - kern16_size;
> > +	kernel_memsz = ALIGN(header->init_size, 4096);
> 
> PAGE_ALIGN

Will change.

> 
> > +	kernel_align = header->kernel_alignment;
> > +
> > +	ret = kexec_add_buffer(image, kernel_buf,
> > +			       kernel_bufsz, kernel_memsz, kernel_align,
> > +			       MIN_KERNEL_LOAD_ADDR, ULONG_MAX, 1,
> > +			       &kernel_load_addr);
> > +	if (ret)
> > +		goto out_free_params;
> > +
> > +	pr_debug("Loaded 64bit kernel at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> > +		 kernel_load_addr, kernel_memsz, kernel_memsz);
> > +
> > +	/* Load initrd high */
> > +	if (initrd) {
> > +		ret = kexec_add_buffer(image, initrd, initrd_len, initrd_len,
> > +				       PAGE_SIZE, MIN_INITRD_LOAD_ADDR,
> > +				       ULONG_MAX, 1, &initrd_load_addr);
> > +		if (ret)
> > +			goto out_free_params;
> > +
> > +		pr_debug("Loaded initrd at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> > +				initrd_load_addr, initrd_len, initrd_len);
> 
> emtpy line to split, pls.

Will do.

> 
> > +		ret = kexec_setup_initrd(params, initrd_load_addr, initrd_len);
> > +		if (ret)
> 
> This ret is unconditionally 0 - no need to check it.

Ok will change it.

> 
> > +			goto out_free_params;
> > +	}
> > +
> > +	ret = kexec_setup_cmdline(params, bootparam_load_addr,
> > +				  sizeof(struct boot_params), cmdline,
> > +				  cmdline_len);
> > +	if (ret)
> 
> Ditto.

Will change.

[..]
> > +	ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
> > +					     sizeof(regs64), 0);
> > +	if (ret)
> > +		goto out_free_params;
> > +
> > +	ret = kexec_setup_boot_parameters(params);
> 
> Ditto.

Will change.

[..]
> > +/*
> > + * Common code for x86 and x86_64 used for kexec.
> > + *
> > + * For the time being it compiles only for x86_64 as there are no image
> > + * loaders implemented * for x86. This #ifdef can be removed once somebody
> > + * decides to write an image loader on CONFIG_X86_32.
> > + */
> > +
> > +#ifdef CONFIG_X86_64
> 
> Ok, this doesn't make any sense: this new machine_kexec.c is supposed to
> be common code and yet it has this 64-bit ifdef in there.
> 
> It should be the other way around, IMO: put it now in machine_kexec_64.c
> and if someone wants the 32-bit version, that someone should carve it
> out. This'll save you the needless ifdeffery now.

Hmm..., If you feel strongly about it, I can make this change. I thought
I just made it easier to share the code between 32bit and 64bit by this.

> 
> > +
> > +int kexec_setup_initrd(struct boot_params *params,
> > +		unsigned long initrd_load_addr, unsigned long initrd_len)
> > +{
> > +	params->hdr.ramdisk_image = initrd_load_addr & 0xffffffffUL;
> > +	params->hdr.ramdisk_size = initrd_len & 0xffffffffUL;
> 
> We have more readable GENMASK* macros for contiguous masks. This one
> will then look like:
> 
> 	params->hdr.ramdisk_image = initrd_load_addr & GENMASK(31, 0);
> 	params->hdr.ramdisk_size = initrd_len & GENMASK(31, 0);
> 
> and this way we know exactly about which bits are we talking about. :)

Ok, will use it.

> 
> > +
> > +	params->ext_ramdisk_image = initrd_load_addr >> 32;
> > +	params->ext_ramdisk_size = initrd_len >> 32;
> > +
> > +	return 0;
> > +}
> > +
> > +int kexec_setup_cmdline(struct boot_params *params,
> > +		unsigned long bootparams_load_addr,
> > +		unsigned long cmdline_offset, char *cmdline,
> > +		unsigned long cmdline_len)
> > +{
> > +	char *cmdline_ptr = ((char *)params) + cmdline_offset;
> > +	unsigned long cmdline_ptr_phys;
> > +	uint32_t cmdline_low_32, cmdline_ext_32;
> > +
> > +	memcpy(cmdline_ptr, cmdline, cmdline_len);
> > +	cmdline_ptr[cmdline_len - 1] = '\0';
> > +
> > +	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
> > +	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
> 
> GENMASK

Will change.

> 
> > +	cmdline_ext_32 = cmdline_ptr_phys >> 32;
> > +
> > +	params->hdr.cmd_line_ptr = cmdline_low_32;
> > +	if (cmdline_ext_32)
> > +		params->ext_cmd_line_ptr = cmdline_ext_32;
> > +
> > +	return 0;
> > +}
> > +
> > +static int setup_memory_map_entries(struct boot_params *params)
> > +{
> > +	unsigned int nr_e820_entries;
> > +
> > +	/* TODO: What about EFI */
> 
> You're removing this line in 13/13 so don't add it at all... ?

Yep. Will remove.

[..]
> > +	/* Setup EDD info */
> > +	memcpy(params->eddbuf, boot_params.eddbuf,
> > +				EDDMAXNR * sizeof(struct edd_info));
> 				^^^^^^^^^^^^^^
> 
> Shouldn't you just copy eddbuf_entries many instead of EDDMAXNR?
> 

I think it just makes it safer that we don't try to copy more than
size of destination, in case ->eddbuf_entries is not right or corrupted.

I see copy_edd() does similar thing.

memcpy(edd.edd_info, boot_params.eddbuf, sizeof(edd.edd_info));
edd.edd_info_nr = boot_params.eddbuf_entries;

So may be it is not a bad idea to copy based on max size of data
structures.

> > +	params->eddbuf_entries = boot_params.eddbuf_entries;
> > +

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 09/13] purgatory: Core purgatory functionality
  2014-06-16 17:25           ` Vivek Goyal
@ 2014-06-16 20:10             ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 20:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Mon, Jun 16, 2014 at 01:25:38PM -0400, Vivek Goyal wrote:
> I tried following with CONFIG_KEXEC=n
> 
> ifeq ($(CONFIG_KEXEC),y)
>         $(Q)$(MAKE) $(clean)=arch/x86/purgatory
> endif
> 
> And still "make V=1 clean" shows me that it is going in purgatory dir
> to clean things up.

So add the ifdef for the CONFIG_KEXEC=n .configs.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 09/13] purgatory: Core purgatory functionality
@ 2014-06-16 20:10             ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 20:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Mon, Jun 16, 2014 at 01:25:38PM -0400, Vivek Goyal wrote:
> I tried following with CONFIG_KEXEC=n
> 
> ifeq ($(CONFIG_KEXEC),y)
>         $(Q)$(MAKE) $(clean)=arch/x86/purgatory
> endif
> 
> And still "make V=1 clean" shows me that it is going in purgatory dir
> to clean things up.

So add the ifdef for the CONFIG_KEXEC=n .configs.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-16 20:05                       ` Borislav Petkov
@ 2014-06-16 20:53                         ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 20:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: WANG Chao, Dave Young, mjg59, bhe, jkosina, greg, kexec,
	linux-kernel, ebiederm, hpa, akpm

On Mon, Jun 16, 2014 at 10:05:26PM +0200, Borislav Petkov wrote:
> On Mon, Jun 16, 2014 at 01:38:23PM -0400, Vivek Goyal wrote:
> > And what's the sane default in this case?
> 
> COMMAND_LINE_SIZE
> 
> > Using current kernel's command line size will not work if future
> > kernel decide to support even longer command line size.
> 
> When do you ever get to kexec a kernel with command line size differing
> from the first kernel?This use case is pretty much non-existant to
> say the least (mind you, I'm open to examples but am still waiting for
> them). And even then you go and simply upgrade the first kernel.

Kdump kernel uses a different command line. It adds extra command line
options to currently running kernels.

Till recent past we used to pass new kernel's memory map using command
line "memmap=" and when command line size was 256, we could easily exhaust
command line on large machines.

Now we support 2048 and we have not seen that issue and now we have
moved to passing memory ranges in bootparams so that issue does not
exist. But kernel still does allow passing memmap= on command line.

One can do same thing using kexec too.

Agreed that it is a very corner case use case. Now we can say that we
will not support it. I am fine with that but I atleast wanted a discussion
and common understanding of what new syscall will support and what it
will not.

Some arches still seem to have COMMAND_LINE_SIZE 256. They will more
likely to hit this scenario at some point of time.

Given the fact you feel so strongly on putting this upper limit, I will
introduce it. And put a comment that if the kernel we are kexecing into
supports longer command line, the we will not support that size and one
needs to upgrade first kernel.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-16 20:53                         ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 20:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, ebiederm, jkosina,
	akpm, Dave Young, WANG Chao

On Mon, Jun 16, 2014 at 10:05:26PM +0200, Borislav Petkov wrote:
> On Mon, Jun 16, 2014 at 01:38:23PM -0400, Vivek Goyal wrote:
> > And what's the sane default in this case?
> 
> COMMAND_LINE_SIZE
> 
> > Using current kernel's command line size will not work if future
> > kernel decide to support even longer command line size.
> 
> When do you ever get to kexec a kernel with command line size differing
> from the first kernel?This use case is pretty much non-existant to
> say the least (mind you, I'm open to examples but am still waiting for
> them). And even then you go and simply upgrade the first kernel.

Kdump kernel uses a different command line. It adds extra command line
options to currently running kernels.

Till recent past we used to pass new kernel's memory map using command
line "memmap=" and when command line size was 256, we could easily exhaust
command line on large machines.

Now we support 2048 and we have not seen that issue and now we have
moved to passing memory ranges in bootparams so that issue does not
exist. But kernel still does allow passing memmap= on command line.

One can do same thing using kexec too.

Agreed that it is a very corner case use case. Now we can say that we
will not support it. I am fine with that but I atleast wanted a discussion
and common understanding of what new syscall will support and what it
will not.

Some arches still seem to have COMMAND_LINE_SIZE 256. They will more
likely to hit this scenario at some point of time.

Given the fact you feel so strongly on putting this upper limit, I will
introduce it. And put a comment that if the kernel we are kexecing into
supports longer command line, the we will not support that size and one
needs to upgrade first kernel.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
  2014-06-16 20:06       ` Vivek Goyal
@ 2014-06-16 20:57         ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 20:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Mon, Jun 16, 2014 at 04:06:08PM -0400, Vivek Goyal wrote:
> There can be more than one loader and the one which claims first
> to recognize the image will get to load the image. So once 32 bit
> loader support comes in, it might happen that we ask 64bit loader
> first and it rejects the image and then we ask 32bit loader.

What does that have to do with anything??

> So these message are really debug message which tells why loader
> is not accepting an image. It might not be image destined for that
> loader at all.
> 
> pr_debug() allows being verbose if user wants to for debugging purposes.
> You just have to make sure that CONFIG_DYNAMIC_DEBUG=y and enable verbosity
> in individual file.
> 
> echo 'file kexec-bzimage.c +p' > /sys/kernel/debug/dynamic_debug/control

So people are supposed to enable dynamic_debug just so that they see
*why* their image doesn't load.

Doesn't sound optimal to me.

> Same here. We will potentially be trying multiple loaders and if every
> loader prints messages for rejection by default, it is too much of
> info, IMO.

For max two loaders on one architecture? I don't think so. Now you're
just arguing for the sake of it.

> I like doing memory allocations early in the functions (as far as
> possible) and error out if need be. If memory is available to begin
> with for all the data structures needed by this function, it is kind
> of pointless to do rest of the processing.

We're talking about memory for a single void * which is ridiculous. And
I think simplifying the error paths is a much higher win than doing some
minor allocation.

> Hmm..., If you feel strongly about it, I can make this change. I
> thought I just made it easier to share the code between 32bit and
> 64bit by this.

Someone later can do that - right now this code is 64-bit only as far as
we're concerned and if it can be made to work on 32-bit, then people are
free to do so.

> I think it just makes it safer that we don't try to copy more than
> size of destination, in case ->eddbuf_entries is not right or corrupted.
> 
> I see copy_edd() does similar thing.
> 
> memcpy(edd.edd_info, boot_params.eddbuf, sizeof(edd.edd_info));
> edd.edd_info_nr = boot_params.eddbuf_entries;
> 
> So may be it is not a bad idea to copy based on max size of data
> structures.

Ok, makes sense.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
@ 2014-06-16 20:57         ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 20:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Mon, Jun 16, 2014 at 04:06:08PM -0400, Vivek Goyal wrote:
> There can be more than one loader and the one which claims first
> to recognize the image will get to load the image. So once 32 bit
> loader support comes in, it might happen that we ask 64bit loader
> first and it rejects the image and then we ask 32bit loader.

What does that have to do with anything??

> So these message are really debug message which tells why loader
> is not accepting an image. It might not be image destined for that
> loader at all.
> 
> pr_debug() allows being verbose if user wants to for debugging purposes.
> You just have to make sure that CONFIG_DYNAMIC_DEBUG=y and enable verbosity
> in individual file.
> 
> echo 'file kexec-bzimage.c +p' > /sys/kernel/debug/dynamic_debug/control

So people are supposed to enable dynamic_debug just so that they see
*why* their image doesn't load.

Doesn't sound optimal to me.

> Same here. We will potentially be trying multiple loaders and if every
> loader prints messages for rejection by default, it is too much of
> info, IMO.

For max two loaders on one architecture? I don't think so. Now you're
just arguing for the sake of it.

> I like doing memory allocations early in the functions (as far as
> possible) and error out if need be. If memory is available to begin
> with for all the data structures needed by this function, it is kind
> of pointless to do rest of the processing.

We're talking about memory for a single void * which is ridiculous. And
I think simplifying the error paths is a much higher win than doing some
minor allocation.

> Hmm..., If you feel strongly about it, I can make this change. I
> thought I just made it easier to share the code between 32bit and
> 64bit by this.

Someone later can do that - right now this code is 64-bit only as far as
we're concerned and if it can be made to work on 32-bit, then people are
free to do so.

> I think it just makes it safer that we don't try to copy more than
> size of destination, in case ->eddbuf_entries is not right or corrupted.
> 
> I see copy_edd() does similar thing.
> 
> memcpy(edd.edd_info, boot_params.eddbuf, sizeof(edd.edd_info));
> edd.edd_info_nr = boot_params.eddbuf_entries;
> 
> So may be it is not a bad idea to copy based on max size of data
> structures.

Ok, makes sense.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-16 20:53                         ` Vivek Goyal
@ 2014-06-16 21:09                           ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 21:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: WANG Chao, Dave Young, mjg59, bhe, jkosina, greg, kexec,
	linux-kernel, ebiederm, hpa, akpm

On Mon, Jun 16, 2014 at 04:53:31PM -0400, Vivek Goyal wrote:
> Kdump kernel uses a different command line. It adds extra command line
> options to currently running kernels.
> 
> Till recent past we used to pass new kernel's memory map using command
> line "memmap=" and when command line size was 256, we could easily exhaust
> command line on large machines.
> 
> Now we support 2048 and we have not seen that issue and now we have
> moved to passing memory ranges in bootparams so that issue does not
> exist. But kernel still does allow passing memmap= on command line.
> 
> One can do same thing using kexec too.
> 
> Agreed that it is a very corner case use case. Now we can say that we
> will not support it. I am fine with that but I atleast wanted a discussion
> and common understanding of what new syscall will support and what it
> will not.
> 
> Some arches still seem to have COMMAND_LINE_SIZE 256. They will more
> likely to hit this scenario at some point of time.
> 
> Given the fact you feel so strongly on putting this upper limit, I will
> introduce it. And put a comment that if the kernel we are kexecing into
> supports longer command line, the we will not support that size and one
> needs to upgrade first kernel.

Nah, I don't feel strongly about it - I just don't trust userspace and
think that every value we get from it should be "sanitized".

But if you say that you want to be able to pass bigger command line to
2nd kernel because this is how kexec passes info, then I'm fine with it.
This is actually a very valid use case which I was asking for, thanks!

I guess if a malicious user goes at lenths to manipulate
header->cmdline_size just so that kmalloc still succeeds and we're fine
with it then I certainly don't have anything against it. I.e., if user
really wants to shoot himself in the foot, user can.

So it is a good thing we talked about it then. :-)

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-16 21:09                           ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 21:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, greg, hpa, kexec, linux-kernel, ebiederm, jkosina,
	akpm, Dave Young, WANG Chao

On Mon, Jun 16, 2014 at 04:53:31PM -0400, Vivek Goyal wrote:
> Kdump kernel uses a different command line. It adds extra command line
> options to currently running kernels.
> 
> Till recent past we used to pass new kernel's memory map using command
> line "memmap=" and when command line size was 256, we could easily exhaust
> command line on large machines.
> 
> Now we support 2048 and we have not seen that issue and now we have
> moved to passing memory ranges in bootparams so that issue does not
> exist. But kernel still does allow passing memmap= on command line.
> 
> One can do same thing using kexec too.
> 
> Agreed that it is a very corner case use case. Now we can say that we
> will not support it. I am fine with that but I atleast wanted a discussion
> and common understanding of what new syscall will support and what it
> will not.
> 
> Some arches still seem to have COMMAND_LINE_SIZE 256. They will more
> likely to hit this scenario at some point of time.
> 
> Given the fact you feel so strongly on putting this upper limit, I will
> introduce it. And put a comment that if the kernel we are kexecing into
> supports longer command line, the we will not support that size and one
> needs to upgrade first kernel.

Nah, I don't feel strongly about it - I just don't trust userspace and
think that every value we get from it should be "sanitized".

But if you say that you want to be able to pass bigger command line to
2nd kernel because this is how kexec passes info, then I'm fine with it.
This is actually a very valid use case which I was asking for, thanks!

I guess if a malicious user goes at lenths to manipulate
header->cmdline_size just so that kmalloc still succeeds and we're fine
with it then I certainly don't have anything against it. I.e., if user
really wants to shoot himself in the foot, user can.

So it is a good thing we talked about it then. :-)

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-03 13:06 ` Vivek Goyal
@ 2014-06-16 21:13   ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 21:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 09:06:49AM -0400, Vivek Goyal wrote:
> This patch series does not do kernel signature verification yet. I
> plan to post another patch series for that. Now bzImage is already
> signed with PKCS7 signature I plan to parse and verify those
> signatures.

Btw, do you have a brief outline on how you are going to do the
extension to signature verification? Nothing formal, just enough of an
outline that I can see where in the flow it will be plugged in.

I was wondering how the whole signature signing and verification will
be done, i.e., where do I get the signature, how and who will verify it
(I'm guessing the purgatory code), etc, etc.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-16 21:13   ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 21:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:06:49AM -0400, Vivek Goyal wrote:
> This patch series does not do kernel signature verification yet. I
> plan to post another patch series for that. Now bzImage is already
> signed with PKCS7 signature I plan to parse and verify those
> signatures.

Btw, do you have a brief outline on how you are going to do the
extension to signature verification? Nothing formal, just enough of an
outline that I can see where in the flow it will be plugged in.

I was wondering how the whole signature signing and verification will
be done, i.e., where do I get the signature, how and who will verify it
(I'm guessing the purgatory code), etc, etc.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
  2014-06-16 20:57         ` Borislav Petkov
@ 2014-06-16 21:15           ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 21:15 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Mon, Jun 16, 2014 at 10:57:43PM +0200, Borislav Petkov wrote:
> On Mon, Jun 16, 2014 at 04:06:08PM -0400, Vivek Goyal wrote:
> > There can be more than one loader and the one which claims first
> > to recognize the image will get to load the image. So once 32 bit
> > loader support comes in, it might happen that we ask 64bit loader
> > first and it rejects the image and then we ask 32bit loader.
> 
> What does that have to do with anything??

Say down the line you support 3 types of kernel images. 64bit bzImage, 
32bit bzImage and ELF vmlinux. And there are 3 different loader
implementations in kernel. Now assume user us trying to load an ELF vmlinux
image. 

Generic code will call.

arch_kexec_kernel_image_probe {
        for (i = 0; i < nr_file_types; i++) {
                if (!kexec_file_type[i].probe)
                        continue;

                ret = kexec_file_type[i].probe(buf, buf_len);
                if (!ret) {
                        image->file_handler_idx = i;
                        return ret;
                }
        }
	return ret;
}

This code calls into very registered loader and if nobody is ready to
load the image it returns error. Say first bzImage64 and bzImage32 bit
loader are called. They both will reject the image and vmlinux loader
will accept it.

Do we want to show all the rejection messages from bzImage64 and bzImage32
loaders. It might be too verbose to show users that before vmlinux loader
accepted the image other loaders on this arches rejcted the image.

This is very similar to binary file loaing. Looks at load_elf_binary(). If
it does not find elf header it knows it is not an ELF binary and
returns error -ENOEXEC without outputing any messages.

> 
> > So these message are really debug message which tells why loader
> > is not accepting an image. It might not be image destined for that
> > loader at all.
> > 
> > pr_debug() allows being verbose if user wants to for debugging purposes.
> > You just have to make sure that CONFIG_DYNAMIC_DEBUG=y and enable verbosity
> > in individual file.
> > 
> > echo 'file kexec-bzimage.c +p' > /sys/kernel/debug/dynamic_debug/control
> 
> So people are supposed to enable dynamic_debug just so that they see
> *why* their image doesn't load.
> 
> Doesn't sound optimal to me.

This is one way of doing it. I can change it if you think that displaying
messages from all the loaders is fine.

> 
> > Same here. We will potentially be trying multiple loaders and if every
> > loader prints messages for rejection by default, it is too much of
> > info, IMO.
> 
> For max two loaders on one architecture? I don't think so. Now you're
> just arguing for the sake of it.

Well, we have 3 loaders in user space currently for x86_64. bzImage64,
bzImage32 and ELF vmlinux. So one would think that somebody might
go about implementing these in kernel space too.

> 
> > I like doing memory allocations early in the functions (as far as
> > possible) and error out if need be. If memory is available to begin
> > with for all the data structures needed by this function, it is kind
> > of pointless to do rest of the processing.
> 
> We're talking about memory for a single void * which is ridiculous. And
> I think simplifying the error paths is a much higher win than doing some
> minor allocation.

Ok, I will change it.

> 
> > Hmm..., If you feel strongly about it, I can make this change. I
> > thought I just made it easier to share the code between 32bit and
> > 64bit by this.
> 
> Someone later can do that - right now this code is 64-bit only as far as
> we're concerned and if it can be made to work on 32-bit, then people are
> free to do so.

Ok.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
@ 2014-06-16 21:15           ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 21:15 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Mon, Jun 16, 2014 at 10:57:43PM +0200, Borislav Petkov wrote:
> On Mon, Jun 16, 2014 at 04:06:08PM -0400, Vivek Goyal wrote:
> > There can be more than one loader and the one which claims first
> > to recognize the image will get to load the image. So once 32 bit
> > loader support comes in, it might happen that we ask 64bit loader
> > first and it rejects the image and then we ask 32bit loader.
> 
> What does that have to do with anything??

Say down the line you support 3 types of kernel images. 64bit bzImage, 
32bit bzImage and ELF vmlinux. And there are 3 different loader
implementations in kernel. Now assume user us trying to load an ELF vmlinux
image. 

Generic code will call.

arch_kexec_kernel_image_probe {
        for (i = 0; i < nr_file_types; i++) {
                if (!kexec_file_type[i].probe)
                        continue;

                ret = kexec_file_type[i].probe(buf, buf_len);
                if (!ret) {
                        image->file_handler_idx = i;
                        return ret;
                }
        }
	return ret;
}

This code calls into very registered loader and if nobody is ready to
load the image it returns error. Say first bzImage64 and bzImage32 bit
loader are called. They both will reject the image and vmlinux loader
will accept it.

Do we want to show all the rejection messages from bzImage64 and bzImage32
loaders. It might be too verbose to show users that before vmlinux loader
accepted the image other loaders on this arches rejcted the image.

This is very similar to binary file loaing. Looks at load_elf_binary(). If
it does not find elf header it knows it is not an ELF binary and
returns error -ENOEXEC without outputing any messages.

> 
> > So these message are really debug message which tells why loader
> > is not accepting an image. It might not be image destined for that
> > loader at all.
> > 
> > pr_debug() allows being verbose if user wants to for debugging purposes.
> > You just have to make sure that CONFIG_DYNAMIC_DEBUG=y and enable verbosity
> > in individual file.
> > 
> > echo 'file kexec-bzimage.c +p' > /sys/kernel/debug/dynamic_debug/control
> 
> So people are supposed to enable dynamic_debug just so that they see
> *why* their image doesn't load.
> 
> Doesn't sound optimal to me.

This is one way of doing it. I can change it if you think that displaying
messages from all the loaders is fine.

> 
> > Same here. We will potentially be trying multiple loaders and if every
> > loader prints messages for rejection by default, it is too much of
> > info, IMO.
> 
> For max two loaders on one architecture? I don't think so. Now you're
> just arguing for the sake of it.

Well, we have 3 loaders in user space currently for x86_64. bzImage64,
bzImage32 and ELF vmlinux. So one would think that somebody might
go about implementing these in kernel space too.

> 
> > I like doing memory allocations early in the functions (as far as
> > possible) and error out if need be. If memory is available to begin
> > with for all the data structures needed by this function, it is kind
> > of pointless to do rest of the processing.
> 
> We're talking about memory for a single void * which is ridiculous. And
> I think simplifying the error paths is a much higher win than doing some
> minor allocation.

Ok, I will change it.

> 
> > Hmm..., If you feel strongly about it, I can make this change. I
> > thought I just made it easier to share the code between 32bit and
> > 64bit by this.
> 
> Someone later can do that - right now this code is 64-bit only as far as
> we're concerned and if it can be made to work on 32-bit, then people are
> free to do so.

Ok.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-16 21:09                           ` Borislav Petkov
@ 2014-06-16 21:25                             ` H. Peter Anvin
  -1 siblings, 0 replies; 214+ messages in thread
From: H. Peter Anvin @ 2014-06-16 21:25 UTC (permalink / raw)
  To: Borislav Petkov, Vivek Goyal
  Cc: WANG Chao, Dave Young, mjg59, bhe, jkosina, greg, kexec,
	linux-kernel, ebiederm, akpm

On 06/16/2014 02:09 PM, Borislav Petkov wrote:
> 
> Nah, I don't feel strongly about it - I just don't trust userspace and
> think that every value we get from it should be "sanitized".
> 

Borislav and I talked about this briefly over IRC.  A key part of that
is that if userspace could manipulate this system call to consume an
unreasonable amount of memory, we would have a problem, for example if
this code used vzalloc() instead of kzalloc().  However, since
kmalloc/kzalloc implies a relatively restrictive limit on the memory
allocation size anyway, well short of anything that could cause OOM
problems, that pretty much solves the problem.

	-hpa


^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-16 21:25                             ` H. Peter Anvin
  0 siblings, 0 replies; 214+ messages in thread
From: H. Peter Anvin @ 2014-06-16 21:25 UTC (permalink / raw)
  To: Borislav Petkov, Vivek Goyal
  Cc: mjg59, bhe, greg, kexec, linux-kernel, ebiederm, jkosina, akpm,
	Dave Young, WANG Chao

On 06/16/2014 02:09 PM, Borislav Petkov wrote:
> 
> Nah, I don't feel strongly about it - I just don't trust userspace and
> think that every value we get from it should be "sanitized".
> 

Borislav and I talked about this briefly over IRC.  A key part of that
is that if userspace could manipulate this system call to consume an
unreasonable amount of memory, we would have a problem, for example if
this code used vzalloc() instead of kzalloc().  However, since
kmalloc/kzalloc implies a relatively restrictive limit on the memory
allocation size anyway, well short of anything that could cause OOM
problems, that pretty much solves the problem.

	-hpa


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
  2014-06-16 21:15           ` Vivek Goyal
@ 2014-06-16 21:27             ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 21:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Mon, Jun 16, 2014 at 05:15:00PM -0400, Vivek Goyal wrote:
> Do we want to show all the rejection messages from bzImage64 and
> bzImage32 loaders. It might be too verbose to show users that before
> vmlinux loader accepted the image other loaders on this arches rejcted
> the image.

I get all that. But, if people want to get feedback from the system
about *why* their image didn't load, they absolutely have to enable
dynamic debug. And this is not optimal IMO because they will have to
look at the code first to see what they need to do.

Or is kexec-tools going to be taught to interpret return values from the
syscall?

In any case, we want information about why an image fails loading to
reach the user in the easiest way possible. And why should the user need
to enable dynamic debug if he can get the info without doing so?

Oh, and not everyone knows about dynamic debug so...

And I don't think it'll be too much info - only the line which fails
the check will be printed before the image loader fails so that's
practically one error reason per failed image.

Ok?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
@ 2014-06-16 21:27             ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 21:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Mon, Jun 16, 2014 at 05:15:00PM -0400, Vivek Goyal wrote:
> Do we want to show all the rejection messages from bzImage64 and
> bzImage32 loaders. It might be too verbose to show users that before
> vmlinux loader accepted the image other loaders on this arches rejcted
> the image.

I get all that. But, if people want to get feedback from the system
about *why* their image didn't load, they absolutely have to enable
dynamic debug. And this is not optimal IMO because they will have to
look at the code first to see what they need to do.

Or is kexec-tools going to be taught to interpret return values from the
syscall?

In any case, we want information about why an image fails loading to
reach the user in the easiest way possible. And why should the user need
to enable dynamic debug if he can get the info without doing so?

Oh, and not everyone knows about dynamic debug so...

And I don't think it'll be too much info - only the line which fails
the check will be printed before the image loader fails so that's
practically one error reason per failed image.

Ok?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-16 21:25                             ` H. Peter Anvin
@ 2014-06-16 21:43                               ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 21:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, WANG Chao, Dave Young, mjg59, bhe, jkosina,
	greg, kexec, linux-kernel, ebiederm, akpm

On Mon, Jun 16, 2014 at 02:25:07PM -0700, H. Peter Anvin wrote:
> On 06/16/2014 02:09 PM, Borislav Petkov wrote:
> > 
> > Nah, I don't feel strongly about it - I just don't trust userspace and
> > think that every value we get from it should be "sanitized".
> > 
> 
> Borislav and I talked about this briefly over IRC.  A key part of that
> is that if userspace could manipulate this system call to consume an
> unreasonable amount of memory, we would have a problem, for example if
> this code used vzalloc() instead of kzalloc().  However, since
> kmalloc/kzalloc implies a relatively restrictive limit on the memory
> allocation size anyway, well short of anything that could cause OOM
> problems, that pretty much solves the problem.

Actually currently I am using vzalloc() for command line buffer
allocation.

	image->cmdline_buf = vzalloc(cmdline_len);
	if (!image->cmdline_buf)
		goto out;

Should I switch to using kzalloc() instead?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-16 21:43                               ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 21:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mjg59, bhe, jkosina, kexec, linux-kernel, Borislav Petkov,
	ebiederm, greg, akpm, Dave Young, WANG Chao

On Mon, Jun 16, 2014 at 02:25:07PM -0700, H. Peter Anvin wrote:
> On 06/16/2014 02:09 PM, Borislav Petkov wrote:
> > 
> > Nah, I don't feel strongly about it - I just don't trust userspace and
> > think that every value we get from it should be "sanitized".
> > 
> 
> Borislav and I talked about this briefly over IRC.  A key part of that
> is that if userspace could manipulate this system call to consume an
> unreasonable amount of memory, we would have a problem, for example if
> this code used vzalloc() instead of kzalloc().  However, since
> kmalloc/kzalloc implies a relatively restrictive limit on the memory
> allocation size anyway, well short of anything that could cause OOM
> problems, that pretty much solves the problem.

Actually currently I am using vzalloc() for command line buffer
allocation.

	image->cmdline_buf = vzalloc(cmdline_len);
	if (!image->cmdline_buf)
		goto out;

Should I switch to using kzalloc() instead?

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
  2014-06-16 21:27             ` Borislav Petkov
@ 2014-06-16 21:45               ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 21:45 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Mon, Jun 16, 2014 at 11:27:20PM +0200, Borislav Petkov wrote:
> On Mon, Jun 16, 2014 at 05:15:00PM -0400, Vivek Goyal wrote:
> > Do we want to show all the rejection messages from bzImage64 and
> > bzImage32 loaders. It might be too verbose to show users that before
> > vmlinux loader accepted the image other loaders on this arches rejcted
> > the image.
> 
> I get all that. But, if people want to get feedback from the system
> about *why* their image didn't load, they absolutely have to enable
> dynamic debug. And this is not optimal IMO because they will have to
> look at the code first to see what they need to do.
> 
> Or is kexec-tools going to be taught to interpret return values from the
> syscall?

In most of the cases return code is -ENOEXEC so kexec-tools can't figure
out what's wrong.

> 
> In any case, we want information about why an image fails loading to
> reach the user in the easiest way possible. And why should the user need
> to enable dynamic debug if he can get the info without doing so?
> 
> Oh, and not everyone knows about dynamic debug so...
> 
> And I don't think it'll be too much info - only the line which fails
> the check will be printed before the image loader fails so that's
> practically one error reason per failed image.
> 

Ok, there will be one line of error and that's not too bad. I will
convert these pr_debug() statements in bzImage_probe() to pr_err().

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
@ 2014-06-16 21:45               ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-16 21:45 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Mon, Jun 16, 2014 at 11:27:20PM +0200, Borislav Petkov wrote:
> On Mon, Jun 16, 2014 at 05:15:00PM -0400, Vivek Goyal wrote:
> > Do we want to show all the rejection messages from bzImage64 and
> > bzImage32 loaders. It might be too verbose to show users that before
> > vmlinux loader accepted the image other loaders on this arches rejcted
> > the image.
> 
> I get all that. But, if people want to get feedback from the system
> about *why* their image didn't load, they absolutely have to enable
> dynamic debug. And this is not optimal IMO because they will have to
> look at the code first to see what they need to do.
> 
> Or is kexec-tools going to be taught to interpret return values from the
> syscall?

In most of the cases return code is -ENOEXEC so kexec-tools can't figure
out what's wrong.

> 
> In any case, we want information about why an image fails loading to
> reach the user in the easiest way possible. And why should the user need
> to enable dynamic debug if he can get the info without doing so?
> 
> Oh, and not everyone knows about dynamic debug so...
> 
> And I don't think it'll be too much info - only the line which fails
> the check will be printed before the image loader fails so that's
> practically one error reason per failed image.
> 

Ok, there will be one line of error and that's not too bad. I will
convert these pr_debug() statements in bzImage_probe() to pr_err().

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-16 21:43                               ` Vivek Goyal
@ 2014-06-16 22:10                                 ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 22:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: H. Peter Anvin, WANG Chao, Dave Young, mjg59, bhe, jkosina, greg,
	kexec, linux-kernel, ebiederm, akpm

On Mon, Jun 16, 2014 at 05:43:13PM -0400, Vivek Goyal wrote:
> Actually currently I am using vzalloc() for command line buffer
> allocation.
> 
> 	image->cmdline_buf = vzalloc(cmdline_len);
> 	if (!image->cmdline_buf)
> 		goto out;
> 
> Should I switch to using kzalloc() instead?

Well, vzalloc could trigger the OOM killer, can't it, for the right
size? If so, then we will have a serious problem.

And this snippet above is down the syscall path.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-16 22:10                                 ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-16 22:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, kexec, linux-kernel, greg, ebiederm,
	H. Peter Anvin, akpm, Dave Young, WANG Chao

On Mon, Jun 16, 2014 at 05:43:13PM -0400, Vivek Goyal wrote:
> Actually currently I am using vzalloc() for command line buffer
> allocation.
> 
> 	image->cmdline_buf = vzalloc(cmdline_len);
> 	if (!image->cmdline_buf)
> 		goto out;
> 
> Should I switch to using kzalloc() instead?

Well, vzalloc could trigger the OOM killer, can't it, for the right
size? If so, then we will have a serious problem.

And this snippet above is down the syscall path.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
  2014-06-16 21:43                               ` Vivek Goyal
@ 2014-06-16 22:49                                 ` H. Peter Anvin
  -1 siblings, 0 replies; 214+ messages in thread
From: H. Peter Anvin @ 2014-06-16 22:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Borislav Petkov, WANG Chao, Dave Young, mjg59, bhe, jkosina,
	greg, kexec, linux-kernel, ebiederm, akpm

On 06/16/2014 02:43 PM, Vivek Goyal wrote:
>>
>> Borislav and I talked about this briefly over IRC.  A key part of that
>> is that if userspace could manipulate this system call to consume an
>> unreasonable amount of memory, we would have a problem, for example if
>> this code used vzalloc() instead of kzalloc().  However, since
>> kmalloc/kzalloc implies a relatively restrictive limit on the memory
>> allocation size anyway, well short of anything that could cause OOM
>> problems, that pretty much solves the problem.
> 
> Actually currently I am using vzalloc() for command line buffer
> allocation.
> 
> 	image->cmdline_buf = vzalloc(cmdline_len);
> 	if (!image->cmdline_buf)
> 		goto out;
> 
> Should I switch to using kzalloc() instead?
> 

Yes.  There is absolutely no valid reason to use vzalloc() for an object
that small, and if someone manipulates the header to allow for a crazily
large command line then you can trick the kernel into allocating
arbitrary amounts of memory.

	-hpa



^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load
@ 2014-06-16 22:49                                 ` H. Peter Anvin
  0 siblings, 0 replies; 214+ messages in thread
From: H. Peter Anvin @ 2014-06-16 22:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, kexec, linux-kernel, Borislav Petkov,
	ebiederm, greg, akpm, Dave Young, WANG Chao

On 06/16/2014 02:43 PM, Vivek Goyal wrote:
>>
>> Borislav and I talked about this briefly over IRC.  A key part of that
>> is that if userspace could manipulate this system call to consume an
>> unreasonable amount of memory, we would have a problem, for example if
>> this code used vzalloc() instead of kzalloc().  However, since
>> kmalloc/kzalloc implies a relatively restrictive limit on the memory
>> allocation size anyway, well short of anything that could cause OOM
>> problems, that pretty much solves the problem.
> 
> Actually currently I am using vzalloc() for command line buffer
> allocation.
> 
> 	image->cmdline_buf = vzalloc(cmdline_len);
> 	if (!image->cmdline_buf)
> 		goto out;
> 
> Should I switch to using kzalloc() instead?
> 

Yes.  There is absolutely no valid reason to use vzalloc() for an object
that small, and if someone manipulates the header to allow for a crazily
large command line then you can trick the kernel into allocating
arbitrary amounts of memory.

	-hpa



_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-16 21:13   ` Borislav Petkov
@ 2014-06-17 13:24     ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-17 13:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Mon, Jun 16, 2014 at 11:13:28PM +0200, Borislav Petkov wrote:
> On Tue, Jun 03, 2014 at 09:06:49AM -0400, Vivek Goyal wrote:
> > This patch series does not do kernel signature verification yet. I
> > plan to post another patch series for that. Now bzImage is already
> > signed with PKCS7 signature I plan to parse and verify those
> > signatures.
> 
> Btw, do you have a brief outline on how you are going to do the
> extension to signature verification? Nothing formal, just enough of an
> outline that I can see where in the flow it will be plugged in.
> 
> I was wondering how the whole signature signing and verification will
> be done, i.e., where do I get the signature, how and who will verify it
> (I'm guessing the purgatory code), etc, etc.

Hi Boris,

I am still working on signing patches. I have uploaded them here for you
to have a look.

http://people.redhat.com/vgoyal/kdump-secureboot/in-kernel-kexec/bzimage-sig-verification/

Last patch should give you a good idea how signature verification hooks
into kexec.

Signature verification patches are based on David Howell's work.

There is no purgatory involved here. Once we have copied kernel image
into kernel space, first thing we do is call into arch image loader
and ask it to verify signature of image. And bzImage loader in turn
calls into helper functions which can parse a PE signed file and
verify image signature.

I have yet to cleanup the code, implement a config option which enforces
signature verification.

Automatic enforncement of signature verification if secureboot is
enabled will happen once matthew's secureboot patches are merged 
into mainline.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-17 13:24     ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-17 13:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Mon, Jun 16, 2014 at 11:13:28PM +0200, Borislav Petkov wrote:
> On Tue, Jun 03, 2014 at 09:06:49AM -0400, Vivek Goyal wrote:
> > This patch series does not do kernel signature verification yet. I
> > plan to post another patch series for that. Now bzImage is already
> > signed with PKCS7 signature I plan to parse and verify those
> > signatures.
> 
> Btw, do you have a brief outline on how you are going to do the
> extension to signature verification? Nothing formal, just enough of an
> outline that I can see where in the flow it will be plugged in.
> 
> I was wondering how the whole signature signing and verification will
> be done, i.e., where do I get the signature, how and who will verify it
> (I'm guessing the purgatory code), etc, etc.

Hi Boris,

I am still working on signing patches. I have uploaded them here for you
to have a look.

http://people.redhat.com/vgoyal/kdump-secureboot/in-kernel-kexec/bzimage-sig-verification/

Last patch should give you a good idea how signature verification hooks
into kexec.

Signature verification patches are based on David Howell's work.

There is no purgatory involved here. Once we have copied kernel image
into kernel space, first thing we do is call into arch image loader
and ask it to verify signature of image. And bzImage loader in turn
calls into helper functions which can parse a PE signed file and
verify image signature.

I have yet to cleanup the code, implement a config option which enforces
signature verification.

Automatic enforncement of signature verification if secureboot is
enabled will happen once matthew's secureboot patches are merged 
into mainline.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-12  5:42   ` Dave Young
@ 2014-06-17 14:24     ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-17 14:24 UTC (permalink / raw)
  To: Dave Young
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	chaowang, bhe, akpm

On Thu, Jun 12, 2014 at 01:42:03PM +0800, Dave Young wrote:
> On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > Hi,
> > 
> > This is V3 of the patchset. Previous versions were posted here.
> > 
> > V1: https://lkml.org/lkml/2013/11/20/540
> > V2: https://lkml.org/lkml/2014/1/27/331
> > 
> > Changes since v2:
> > 
> > - Took care of most of the review comments from V2.
> > - Added support for kexec/kdump on EFI systems.
> > - Dropped support for loading ELF vmlinux.
> > 
> > This patch series is generated on top of 3.15.0-rc8. It also requires a
> > two patch cleanup series which is sitting in -tip tree here.
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=x86/boot
> > 
> > This patch series does not do kernel signature verification yet. I plan
> > to post another patch series for that. Now bzImage is already signed
> > with PKCS7 signature I plan to parse and verify those signatures.
> > 
> > Primary goal of this patchset is to prepare groundwork so that kernel
> > image can be signed and signatures be verified during kexec load. This
> > should help with two things.
> > 
> > - It should allow kexec/kdump on secureboot enabled machines.
> > 
> > - In general it can help even without secureboot. By being able to verify
> >   kernel image signature in kexec, it should help with avoiding module
> >   signing restrictions. Matthew Garret showed how to boot into a custom
> >   kernel, modify first kernel's memory and then jump back to old kernel and
> >   bypass any policy one wants to.
> > 
> > Any feedback is welcome.
> 
> Hi, Vivek
> 
> For efi ioremapping case, in 3.15 kernel efi runtime maps will not be saved
> if efi=old_map is used. So you need detect this and fail the kexec file load.

Dave,

Instead of failing kexec load in case of efi=old_map, I think it will be
better to just not pass runtime map in bootparams. That way user can
pass "noefi" on commandline and kdump should still work.  (Like it works
with user space implementation).

What do you think?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-17 14:24     ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-17 14:24 UTC (permalink / raw)
  To: Dave Young
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, chaowang

On Thu, Jun 12, 2014 at 01:42:03PM +0800, Dave Young wrote:
> On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > Hi,
> > 
> > This is V3 of the patchset. Previous versions were posted here.
> > 
> > V1: https://lkml.org/lkml/2013/11/20/540
> > V2: https://lkml.org/lkml/2014/1/27/331
> > 
> > Changes since v2:
> > 
> > - Took care of most of the review comments from V2.
> > - Added support for kexec/kdump on EFI systems.
> > - Dropped support for loading ELF vmlinux.
> > 
> > This patch series is generated on top of 3.15.0-rc8. It also requires a
> > two patch cleanup series which is sitting in -tip tree here.
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=x86/boot
> > 
> > This patch series does not do kernel signature verification yet. I plan
> > to post another patch series for that. Now bzImage is already signed
> > with PKCS7 signature I plan to parse and verify those signatures.
> > 
> > Primary goal of this patchset is to prepare groundwork so that kernel
> > image can be signed and signatures be verified during kexec load. This
> > should help with two things.
> > 
> > - It should allow kexec/kdump on secureboot enabled machines.
> > 
> > - In general it can help even without secureboot. By being able to verify
> >   kernel image signature in kexec, it should help with avoiding module
> >   signing restrictions. Matthew Garret showed how to boot into a custom
> >   kernel, modify first kernel's memory and then jump back to old kernel and
> >   bypass any policy one wants to.
> > 
> > Any feedback is welcome.
> 
> Hi, Vivek
> 
> For efi ioremapping case, in 3.15 kernel efi runtime maps will not be saved
> if efi=old_map is used. So you need detect this and fail the kexec file load.

Dave,

Instead of failing kexec load in case of efi=old_map, I think it will be
better to just not pass runtime map in bootparams. That way user can
pass "noefi" on commandline and kdump should still work.  (Like it works
with user space implementation).

What do you think?

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 12/13] kexec: Support for Kexec on panic using new system call
  2014-06-03 13:07   ` Vivek Goyal
@ 2014-06-17 21:43     ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-17 21:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 09:07:01AM -0400, Vivek Goyal wrote:
> This patch adds support for loading a kexec on panic (kdump) kernel usning
> new system call. Right now this primarily works with bzImage loader only.
> But changes to ELF loader should be minimal as all the core infrastrcture
> is there.
> 
> Only thing preventing making ELF load in crash reseved memory is
> that kernel vmlinux is of type ET_EXEC and it expects to be loaded at
> address it has been compiled for. At that location current kernel is
> already running. One first needs to make vmlinux fully relocatable
> and export it is type ET_DYN and then modify this ELF loader to support
> images of type ET_DYN.
> 
> I am leaving it as a future TODO item.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/crash.h       |   9 +
>  arch/x86/include/asm/kexec.h       |  25 +-
>  arch/x86/kernel/crash.c            | 581 +++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/kexec-bzimage.c    |  27 +-
>  arch/x86/kernel/machine_kexec.c    |  21 +-
>  arch/x86/kernel/machine_kexec_64.c |  40 +++
>  kernel/kexec.c                     |  83 +++++-
>  7 files changed, 770 insertions(+), 16 deletions(-)
>  create mode 100644 arch/x86/include/asm/crash.h
> 
> diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h
> new file mode 100644
> index 0000000..2dd2eb8
> --- /dev/null
> +++ b/arch/x86/include/asm/crash.h
> @@ -0,0 +1,9 @@
> +#ifndef _ASM_X86_CRASH_H
> +#define _ASM_X86_CRASH_H
> +
> +int load_crashdump_segments(struct kimage *image);

I guess crash_load_segments(..) as you're prefixing the other exported
functions with "crash_".

> +int crash_copy_backup_region(struct kimage *image);
> +int crash_setup_memmap_entries(struct kimage *image,
> +		struct boot_params *params);
> +
> +#endif /* _ASM_X86_CRASH_H */

...

> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 507de80..b6a0974 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -4,6 +4,9 @@
>   * Created by: Hariprasad Nellitheertha (hari@in.ibm.com)
>   *
>   * Copyright (C) IBM Corporation, 2004. All rights reserved.
> + * Copyright (C) Red Hat Inc., 2014. All rights reserved.
> + * Authors:
> + *      Vivek Goyal <vgoyal@redhat.com>
>   *
>   */
>  
> @@ -16,6 +19,7 @@
>  #include <linux/elf.h>
>  #include <linux/elfcore.h>
>  #include <linux/module.h>
> +#include <linux/slab.h>
>  
>  #include <asm/processor.h>
>  #include <asm/hardirq.h>
> @@ -28,6 +32,45 @@
>  #include <asm/reboot.h>
>  #include <asm/virtext.h>
>  
> +/* Alignment required for elf header segment */
> +#define ELF_CORE_HEADER_ALIGN   4096
> +
> +/* This primarily reprsents number of split ranges due to exclusion */

"represents"

> +#define CRASH_MAX_RANGES	16
> +
> +struct crash_mem_range {
> +	u64 start, end;
> +};
> +
> +struct crash_mem {
> +	unsigned int nr_ranges;
> +	struct crash_mem_range ranges[CRASH_MAX_RANGES];
> +};
> +
> +/* Misc data about ram ranges needed to prepare elf headers */
> +struct crash_elf_data {
> +	struct kimage *image;
> +	/*
> +	 * Total number of ram ranges we have after various ajustments for

"adjustments"

> +	 * GART, crash reserved region etc.
> +	 */
> +	unsigned int max_nr_ranges;
> +	unsigned long gart_start, gart_end;
> +
> +	/* Pointer to elf header */
> +	void *ehdr;
> +	/* Pointer to next phdr */
> +	void *bufp;
> +	struct crash_mem mem;
> +};
> +
> +/* Used while preparing memory map entries for second kernel */
> +struct crash_memmap_data {
> +	struct boot_params *params;
> +	/* Type of memory */
> +	unsigned int type;
> +};
> +
>  int in_crash_kexec;
>  
>  /*
> @@ -39,6 +82,7 @@ int in_crash_kexec;
>   */
>  crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss = NULL;
>  EXPORT_SYMBOL_GPL(crash_vmclear_loaded_vmcss);
> +unsigned long crash_zero_bytes;

Ah, that's the empty_zero_page...

>  
>  static inline void cpu_crash_vmclear_loaded_vmcss(void)
>  {
> @@ -135,3 +179,540 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  #endif
>  	crash_save_cpu(regs, safe_smp_processor_id());
>  }
> +
> +#ifdef CONFIG_X86_64
> +
> +static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> +				unsigned long nr_pfn, void *arg)
> +{
> +	int *nr_ranges = arg;
> +
> +	(*nr_ranges)++;
> +	return 0;
> +}
> +
> +static int get_gart_ranges_callback(u64 start, u64 end, void *arg)
> +{
> +	struct crash_elf_data *ced = arg;
> +
> +	ced->gart_start = start;
> +	ced->gart_end = end;
> +
> +	/* Not expecting more than 1 gart aperture */
> +	return 1;
> +}
> +
> +
> +/* Gather all the required information to prepare elf headers for ram regions */
> +static int fill_up_crash_elf_data(struct crash_elf_data *ced,
> +					struct kimage *image)
> +{
> +	unsigned int nr_ranges = 0;
> +
> +	ced->image = image;
> +
> +	walk_system_ram_range(0, -1, &nr_ranges,
> +				get_nr_ram_ranges_callback);
> +
> +	ced->max_nr_ranges = nr_ranges;
> +
> +	/*
> +	 * We don't create ELF headers for GART aperture as an attempt
> +	 * to dump this memory in second kernel leads to hang/crash.
> +	 * If gart aperture is present, one needs to exclude that region
> +	 * and that could lead to need of extra phdr.
> +	 */
> +	walk_ram_res("GART", IORESOURCE_MEM, 0, -1,
> +				ced, get_gart_ranges_callback);
> +
> +	/*
> +	 * If we have gart region, excluding that could potentially split
> +	 * a memory range, resulting in extra header. Account for  that.
> +	 */
> +	if (ced->gart_end)
> +		ced->max_nr_ranges++;
> +
> +	/* Exclusion of crash region could split memory ranges */
> +	ced->max_nr_ranges++;
> +
> +	/* If crashk_low_res is there, another range split possible */

You mean "is not 0"?

> +	if (crashk_low_res.end != 0)
> +		ced->max_nr_ranges++;
> +
> +	return 0;

Returns unconditional 0 - make function void then.

> +}
> +
> +static int exclude_mem_range(struct crash_mem *mem,
> +		unsigned long long mstart, unsigned long long mend)
> +{
> +	int i, j;
> +	unsigned long long start, end;
> +	struct crash_mem_range temp_range = {0, 0};
> +
> +	for (i = 0; i < mem->nr_ranges; i++) {
> +		start = mem->ranges[i].start;
> +		end = mem->ranges[i].end;
> +
> +		if (mstart > end || mend < start)
> +			continue;
> +
> +		/* Truncate any area outside of range */
> +		if (mstart < start)
> +			mstart = start;
> +		if (mend > end)
> +			mend = end;
> +
> +		/* Found completely overlapping range */
> +		if (mstart == start && mend == end) {
> +			mem->ranges[i].start = 0;
> +			mem->ranges[i].end = 0;
> +			if (i < mem->nr_ranges - 1) {
> +				/* Shift rest of the ranges to left */
> +				for (j = i; j < mem->nr_ranges - 1; j++) {
> +					mem->ranges[j].start =
> +						mem->ranges[j+1].start;
> +					mem->ranges[j].end =
> +							mem->ranges[j+1].end;
> +				}
> +			}
> +			mem->nr_ranges--;
> +			return 0;
> +		}
> +
> +		if (mstart > start && mend < end) {
> +			/* Split original range */
> +			mem->ranges[i].end = mstart - 1;
> +			temp_range.start = mend + 1;
> +			temp_range.end = end;
> +		} else if (mstart != start)
> +			mem->ranges[i].end = mstart - 1;
> +		else
> +			mem->ranges[i].start = mend + 1;
> +		break;
> +	}
> +
> +	/* If a split happend, add the split in array */

"happened" ... "split to array"

> +	if (!temp_range.end)
> +		return 0;
> +
> +	/* Split happened */
> +	if (i == CRASH_MAX_RANGES - 1) {
> +		pr_err("Too many crash ranges after split\n");
> +		return -ENOMEM;
> +	}
> +
> +	/* Location where new range should go */
> +	j = i + 1;
> +	if (j < mem->nr_ranges) {
> +		/* Move over all ranges one place */

			...  all ranges one slot towards the end */

> +		for (i = mem->nr_ranges - 1; i >= j; i--)
> +			mem->ranges[i + 1] = mem->ranges[i];
> +	}
> +
> +	mem->ranges[j].start = temp_range.start;
> +	mem->ranges[j].end = temp_range.end;
> +	mem->nr_ranges++;
> +	return 0;
> +}
> +
> +/*
> + * Look for any unwanted ranges between mstart, mend and remove them. This
> + * might lead to split and split ranges are put in ced->mem.ranges[] array
> + */
> +static int elf_header_exclude_ranges(struct crash_elf_data *ced,
> +		unsigned long long mstart, unsigned long long mend)
> +{
> +	struct crash_mem *cmem = &ced->mem;
> +	int ret = 0;
> +
> +	memset(cmem->ranges, 0, sizeof(cmem->ranges));
> +
> +	cmem->ranges[0].start = mstart;
> +	cmem->ranges[0].end = mend;
> +	cmem->nr_ranges = 1;
> +
> +	/* Exclude crashkernel region */
> +	ret = exclude_mem_range(cmem, crashk_res.start, crashk_res.end);
> +	if (ret)
> +		return ret;
> +
> +	ret = exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end);
> +	if (ret)
> +		return ret;
> +
> +	/* Exclude GART region */
> +	if (ced->gart_end) {
> +		ret = exclude_mem_range(cmem, ced->gart_start, ced->gart_end);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static int prepare_elf64_ram_headers_callback(u64 start, u64 end, void *arg)
> +{
> +	struct crash_elf_data *ced = arg;
> +	Elf64_Ehdr *ehdr;
> +	Elf64_Phdr *phdr;
> +	unsigned long mstart, mend;
> +	struct kimage *image = ced->image;
> +	struct crash_mem *cmem;
> +	int ret, i;
> +
> +	ehdr = ced->ehdr;
> +
> +	/* Exclude unwanted mem ranges */
> +	ret = elf_header_exclude_ranges(ced, start, end);
> +	if (ret)
> +		return ret;
> +
> +	/* Go through all the ranges in ced->mem.ranges[] and prepare phdr */
> +	cmem = &ced->mem;
> +
> +	for (i = 0; i < cmem->nr_ranges; i++) {
> +		mstart = cmem->ranges[i].start;
> +		mend = cmem->ranges[i].end;
> +
> +		phdr = ced->bufp;
> +		ced->bufp += sizeof(Elf64_Phdr);
> +
> +		phdr->p_type = PT_LOAD;
> +		phdr->p_flags = PF_R|PF_W|PF_X;
> +		phdr->p_offset  = mstart;
> +
> +		/*
> +		 * If a range matches backup region, adjust offset to backup
> +		 * segment.
> +		 */
> +		if (mstart == image->arch.backup_src_start &&
> +		    (mend - mstart + 1) == image->arch.backup_src_sz)
> +			phdr->p_offset = image->arch.backup_load_addr;
> +
> +		phdr->p_paddr = mstart;
> +		phdr->p_vaddr = (unsigned long long) __va(mstart);
> +		phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
> +		phdr->p_align = 0;
> +		ehdr->e_phnum++;
> +		pr_debug("Crash PT_LOAD elf header. phdr=%p vaddr=0x%llx, paddr=0x%llx, sz=0x%llx e_phnum=%d p_offset=0x%llx\n",
> +			phdr, phdr->p_vaddr, phdr->p_paddr, phdr->p_filesz,
> +			ehdr->e_phnum, phdr->p_offset);
> +	}
> +
> +	return ret;
> +}
> +
> +static int prepare_elf64_headers(struct crash_elf_data *ced,
> +		void **addr, unsigned long *sz)
> +{
> +	Elf64_Ehdr *ehdr;
> +	Elf64_Phdr *phdr;
> +	unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz;
> +	unsigned char *buf, *bufp;
> +	unsigned int cpu;
> +	unsigned long long notes_addr;
> +	int ret;
> +
> +	/* extra phdr for vmcoreinfo elf note */
> +	nr_phdr = nr_cpus + 1;
> +	nr_phdr += ced->max_nr_ranges;
> +
> +	/*
> +	 * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
> +	 * area on x86_64 (ffffffff80000000 - ffffffffa0000000).
> +	 * I think this is required by tools like gdb. So same physical
> +	 * memory will be mapped in two elf headers. One will contain kernel
> +	 * text virtual addresses and other will have __va(physical) addresses.
> +	 */
> +
> +	nr_phdr++;
> +	elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
> +	elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
> +
> +	buf = vzalloc(elf_sz);

Since you get zeroed memory, you can save yourself all assignments to 0
below and thus slim this already terse function.

> +	if (!buf)
> +		return -ENOMEM;
> +
> +	bufp = buf;
> +	ehdr = (Elf64_Ehdr *)bufp;
> +	bufp += sizeof(Elf64_Ehdr);
> +	memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
> +	ehdr->e_ident[EI_CLASS] = ELFCLASS64;
> +	ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
> +	ehdr->e_ident[EI_VERSION] = EV_CURRENT;
> +	ehdr->e_ident[EI_OSABI] = ELF_OSABI;
> +	memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
> +	ehdr->e_type = ET_CORE;
> +	ehdr->e_machine = ELF_ARCH;
> +	ehdr->e_version = EV_CURRENT;
> +	ehdr->e_entry = 0;
> +	ehdr->e_phoff = sizeof(Elf64_Ehdr);
> +	ehdr->e_shoff = 0;
> +	ehdr->e_flags = 0;
> +	ehdr->e_ehsize = sizeof(Elf64_Ehdr);
> +	ehdr->e_phentsize = sizeof(Elf64_Phdr);
> +	ehdr->e_phnum = 0;
> +	ehdr->e_shentsize = 0;
> +	ehdr->e_shnum = 0;
> +	ehdr->e_shstrndx = 0;
> +
> +	/* Prepare one phdr of type PT_NOTE for each present cpu */
> +	for_each_present_cpu(cpu) {
> +		phdr = (Elf64_Phdr *)bufp;
> +		bufp += sizeof(Elf64_Phdr);
> +		phdr->p_type = PT_NOTE;
> +		phdr->p_flags = 0;
> +		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
> +		phdr->p_offset = phdr->p_paddr = notes_addr;
> +		phdr->p_vaddr = 0;
> +		phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
> +		phdr->p_align = 0;
> +		(ehdr->e_phnum)++;
> +	}
> +
> +	/* Prepare one PT_NOTE header for vmcoreinfo */
> +	phdr = (Elf64_Phdr *)bufp;
> +	bufp += sizeof(Elf64_Phdr);
> +	phdr->p_type = PT_NOTE;
> +	phdr->p_flags = 0;
> +	phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
> +	phdr->p_vaddr = 0;
> +	phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
> +	phdr->p_align = 0;
> +	(ehdr->e_phnum)++;
> +
> +#ifdef CONFIG_X86_64
> +	/* Prepare PT_LOAD type program header for kernel text region */
> +	phdr = (Elf64_Phdr *)bufp;
> +	bufp += sizeof(Elf64_Phdr);
> +	phdr->p_type = PT_LOAD;
> +	phdr->p_flags = PF_R|PF_W|PF_X;
> +	phdr->p_vaddr = (Elf64_Addr)_text;
> +	phdr->p_filesz = phdr->p_memsz = _end - _text;
> +	phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);
> +	phdr->p_align = 0;
> +	(ehdr->e_phnum)++;
> +#endif
> +
> +	/* Prepare PT_LOAD headers for system ram chunks. */
> +	ced->ehdr = ehdr;
> +	ced->bufp = bufp;
> +	ret = walk_system_ram_res(0, -1, ced,
> +			prepare_elf64_ram_headers_callback);
> +	if (ret < 0)
> +		return ret;
> +
> +	*addr = buf;
> +	*sz = elf_sz;
> +	return 0;
> +}
> +
> +/* Prepare elf headers. Return addr and size */
> +static int prepare_elf_headers(struct kimage *image, void **addr,
> +					unsigned long *sz)
> +{
> +	struct crash_elf_data *ced;
> +	int ret;
> +
> +	ced = kzalloc(sizeof(*ced), GFP_KERNEL);
> +	if (!ced)
> +		return -ENOMEM;
> +
> +	ret = fill_up_crash_elf_data(ced, image);
> +	if (ret)
> +		goto out;
> +
> +	/* By default prepare 64bit headers */
> +	ret =  prepare_elf64_headers(ced, addr, sz);
> +out:
> +	kfree(ced);
> +	return ret;
> +}
> +
> +static int add_e820_entry(struct boot_params *params, struct e820entry *entry)
> +{
> +	unsigned int nr_e820_entries;
> +
> +	nr_e820_entries = params->e820_entries;
> +	if (nr_e820_entries >= E820MAX)
> +		return 1;

You're not testing for the error condition in any call site. Are we sure
we will never hit E820MAX?

> +
> +	memcpy(&params->e820_map[nr_e820_entries], entry,
> +			sizeof(struct e820entry));
> +	params->e820_entries++;
> +	return 0;
> +}
> +
> +static int memmap_entry_callback(u64 start, u64 end, void *arg)
> +{
> +	struct crash_memmap_data *cmd = arg;
> +	struct boot_params *params = cmd->params;
> +	struct e820entry ei;
> +
> +	ei.addr = start;
> +	ei.size = end - start + 1;
> +	ei.type = cmd->type;
> +	add_e820_entry(params, &ei);
> +
> +	return 0;
> +}
> +
> +static int memmap_exclude_ranges(struct kimage *image, struct crash_mem *cmem,
> +		unsigned long long mstart, unsigned long long mend)

Arg alignment... multiple occurrences in this patch.

> +{
> +	unsigned long start, end;
> +	int ret = 0;
> +
> +	memset(cmem->ranges, 0, sizeof(cmem->ranges));
> +
> +	cmem->ranges[0].start = mstart;
> +	cmem->ranges[0].end = mend;
> +	cmem->nr_ranges = 1;
> +
> +	/* Exclude Backup region */
> +	start = image->arch.backup_load_addr;
> +	end = start + image->arch.backup_src_sz - 1;
> +	ret = exclude_mem_range(cmem, start, end);
> +	if (ret)
> +		return ret;
> +
> +	/* Exclude elf header region */
> +	start = image->arch.elf_load_addr;
> +	end = start + image->arch.elf_headers_sz - 1;
> +	ret = exclude_mem_range(cmem, start, end);
> +	return ret;

	return exclude_mem_range(cmem, start, end);

> +}
> +
> +/* Prepare memory map for crash dump kernel */
> +int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
> +{
> +	int i, ret = 0;
> +	unsigned long flags;
> +	struct e820entry ei;
> +	struct crash_memmap_data cmd;
> +	struct crash_mem *cmem;
> +
> +	cmem = vzalloc(sizeof(struct crash_mem));
> +	if (!cmem)
> +		return -ENOMEM;

You're getting zeroed memory already but you're zeroing it out again
above in memmap_exclude_ranges().

> +
> +	memset(&cmd, 0, sizeof(struct crash_memmap_data));
> +	cmd.params = params;
> +
> +	/* Add first 640K segment */
> +	ei.addr = image->arch.backup_src_start;
> +	ei.size = image->arch.backup_src_sz;
> +	ei.type = E820_RAM;
> +	add_e820_entry(params, &ei);
> +
> +	/* Add ACPI tables */
> +	cmd.type = E820_ACPI;
> +	flags = IORESOURCE_MEM | IORESOURCE_BUSY;
> +	walk_ram_res("ACPI Tables", flags, 0, -1, &cmd, memmap_entry_callback);
> +
> +	/* Add ACPI Non-volatile Storage */
> +	cmd.type = E820_NVS;
> +	walk_ram_res("ACPI Non-volatile Storage", flags, 0, -1, &cmd,
> +			memmap_entry_callback);
> +
> +	/* Add crashk_low_res region */
> +	if (crashk_low_res.end) {
> +		ei.addr = crashk_low_res.start;
> +		ei.size = crashk_low_res.end - crashk_low_res.start + 1;
> +		ei.type = E820_RAM;
> +		add_e820_entry(params, &ei);
> +	}
> +
> +	/* Exclude some ranges from crashk_res and add rest to memmap */
> +	ret = memmap_exclude_ranges(image, cmem, crashk_res.start,
> +						crashk_res.end);
> +	if (ret)
> +		goto out;
> +
> +	for (i = 0; i < cmem->nr_ranges; i++) {
> +		ei.addr = cmem->ranges[i].start;
> +		ei.size = cmem->ranges[i].end - ei.addr + 1;
> +		ei.type = E820_RAM;
> +
> +		/* If entry is less than a page, skip it */
> +		if (ei.size < PAGE_SIZE)
> +			continue;

You can do the size assignment and check first so that you don't have to
do the rest if it is a less than a page.

> +		add_e820_entry(params, &ei);
> +	}
> +
> +out:
> +	vfree(cmem);
> +	return ret;

This retval is not checked at the callsite in
kexec_setup_boot_parameters().

> +}
> +
> +static int determine_backup_region(u64 start, u64 end, void *arg)
> +{
> +	struct kimage *image = arg;
> +
> +	image->arch.backup_src_start = start;
> +	image->arch.backup_src_sz = end - start + 1;
> +
> +	/* Expecting only one range for backup region */
> +	return 1;
> +}
> +
> +int load_crashdump_segments(struct kimage *image)
> +{
> +	unsigned long src_start, src_sz, elf_sz;
> +	void *elf_addr;
> +	int ret;
> +
> +	/*
> +	 * Determine and load a segment for backup area. First 640K RAM
> +	 * region is backup source
> +	 */
> +
> +	ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
> +				image, determine_backup_region);
> +
> +	/* Zero or postive return values are ok */
> +	if (ret < 0)
> +		return ret;
> +
> +	src_start = image->arch.backup_src_start;
> +	src_sz = image->arch.backup_src_sz;
> +
> +	/* Add backup segment. */
> +	if (src_sz) {
> +		/*
> +		 * Ideally there is no source for backup segment. This is
> +		 * copied in purgatory after crash. Just add a zero filled
> +		 * segment for now to make sure checksum logic works fine.
> +		 */
> +		ret = kexec_add_buffer(image, (char *)&crash_zero_bytes,
> +				       sizeof(crash_zero_bytes), src_sz,
> +				       PAGE_SIZE, 0, -1, 0,
> +				       &image->arch.backup_load_addr);
> +		if (ret)
> +			return ret;
> +		pr_debug("Loaded backup region at 0x%lx backup_start=0x%lx memsz=0x%lx\n",
> +			 image->arch.backup_load_addr, src_start, src_sz);
> +	}
> +
> +	/* Prepare elf headers and add a segment */
> +	ret = prepare_elf_headers(image, &elf_addr, &elf_sz);
> +	if (ret)
> +		return ret;
> +
> +	image->arch.elf_headers = elf_addr;
> +	image->arch.elf_headers_sz = elf_sz;
> +
> +	ret = kexec_add_buffer(image, (char *)elf_addr, elf_sz, elf_sz,
> +			ELF_CORE_HEADER_ALIGN, 0, -1, 0,
> +			&image->arch.elf_load_addr);
> +	if (ret) {
> +		vfree((void *)image->arch.elf_headers);
> +		return ret;
> +	}
> +	pr_debug("Loaded ELF headers at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> +		 image->arch.elf_load_addr, elf_sz, elf_sz);
> +
> +	return ret;
> +}
> +
> +#endif /* CONFIG_X86_64 */
> diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
> index 0750784..8e762d3 100644
> --- a/arch/x86/kernel/kexec-bzimage.c
> +++ b/arch/x86/kernel/kexec-bzimage.c
> @@ -18,6 +18,9 @@
>  
>  #include <asm/bootparam.h>
>  #include <asm/setup.h>
> +#include <asm/crash.h>
> +
> +#define MAX_ELFCOREHDR_STR_LEN	30	/* elfcorehdr=0x<64bit-value> */
>  
>  /*
>   * Defines lowest physical address for various segments. Not sure where
> @@ -130,11 +133,28 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  		return ERR_PTR(-EINVAL);
>  	}
>  
> +	/*
> +	 * In case of crash dump, we will append elfcorehdr=<addr> to
> +	 * command line. Make sure it does not overflow
> +	 */
> +	if (cmdline_len + MAX_ELFCOREHDR_STR_LEN > header->cmdline_size) {
> +		ret = -EINVAL;

No need to assign anything to ret if you return ERR_PTR below.

> +		pr_debug("Kernel command line too long\n");

This error message needs to differ from the one above - say something
about "error appending elfcorehdr=...", for example.

> +		return ERR_PTR(-EINVAL);
> +	}
> +
>  	/* Allocate loader specific data */
>  	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
>  	if (!ldata)
>  		return ERR_PTR(-ENOMEM);
>  
> +	/* Allocate and load backup region */
> +	if (image->type == KEXEC_TYPE_CRASH) {
> +		ret = load_crashdump_segments(image);
> +		if (ret)
> +			goto out_free_loader_data;
> +	}
> +
>  	/*
>  	 * Load purgatory. For 64bit entry point, purgatory  code can be
>  	 * anywhere.
> @@ -149,7 +169,8 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
>  
>  	/* Load Bootparams and cmdline */
> -	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
> +	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
> +				MAX_ELFCOREHDR_STR_LEN;
>  	params = kzalloc(params_cmdline_sz, GFP_KERNEL);
>  	if (!params) {
>  		ret = -ENOMEM;
> @@ -201,7 +222,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  			goto out_free_params;
>  	}
>  
> -	ret = kexec_setup_cmdline(params, bootparam_load_addr,
> +	ret = kexec_setup_cmdline(image, params, bootparam_load_addr,
>  				  sizeof(struct boot_params), cmdline,
>  				  cmdline_len);
>  	if (ret)
> @@ -233,7 +254,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  	if (ret)
>  		goto out_free_params;
>  
> -	ret = kexec_setup_boot_parameters(params);
> +	ret = kexec_setup_boot_parameters(image, params);
>  	if (ret)
>  		goto out_free_params;
>  
> diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
> index 7de3239..6a3821b 100644
> --- a/arch/x86/kernel/machine_kexec.c
> +++ b/arch/x86/kernel/machine_kexec.c
> @@ -10,9 +10,11 @@
>   */
>  
>  #include <linux/kernel.h>
> +#include <linux/kexec.h>
>  #include <linux/string.h>
>  #include <asm/bootparam.h>
>  #include <asm/setup.h>
> +#include <asm/crash.h>
>  
>  /*
>   * Common code for x86 and x86_64 used for kexec.
> @@ -36,18 +38,24 @@ int kexec_setup_initrd(struct boot_params *params,
>  	return 0;
>  }
>  
> -int kexec_setup_cmdline(struct boot_params *params,
> +int kexec_setup_cmdline(struct kimage *image, struct boot_params *params,
>  		unsigned long bootparams_load_addr,
>  		unsigned long cmdline_offset, char *cmdline,
>  		unsigned long cmdline_len)
>  {
>  	char *cmdline_ptr = ((char *)params) + cmdline_offset;
> -	unsigned long cmdline_ptr_phys;
> +	unsigned long cmdline_ptr_phys, len;
>  	uint32_t cmdline_low_32, cmdline_ext_32;
>  
>  	memcpy(cmdline_ptr, cmdline, cmdline_len);
> +	if (image->type == KEXEC_TYPE_CRASH) {
> +		len = sprintf(cmdline_ptr + cmdline_len - 1,
> +			" elfcorehdr=0x%lx", image->arch.elf_load_addr);
> +		cmdline_len += len;
> +	}
>  	cmdline_ptr[cmdline_len - 1] = '\0';
>  
> +	pr_debug("Final command line is: %s\n", cmdline_ptr);
>  	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
>  	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
>  	cmdline_ext_32 = cmdline_ptr_phys >> 32;
> @@ -75,7 +83,8 @@ static int setup_memory_map_entries(struct boot_params *params)
>  	return 0;
>  }
>  
> -int kexec_setup_boot_parameters(struct boot_params *params)
> +int kexec_setup_boot_parameters(struct kimage *image,
> +				struct boot_params *params)
>  {
>  	unsigned int nr_e820_entries;
>  	unsigned long long mem_k, start, end;
> @@ -102,7 +111,11 @@ int kexec_setup_boot_parameters(struct boot_params *params)
>  	/* Default sysdesc table */
>  	params->sys_desc_table.length = 0;
>  
> -	setup_memory_map_entries(params);
> +	if (image->type == KEXEC_TYPE_CRASH)
> +		crash_setup_memmap_entries(image, params);
> +	else
> +		setup_memory_map_entries(params);
> +
>  	nr_e820_entries = params->e820_entries;
>  
>  	for (i = 0; i < nr_e820_entries; i++) {
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index a66fae3..07e4b60 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -179,6 +179,38 @@ static void load_segments(void)
>  		);
>  }
>  
> +/* Update purgatory as needed after various image segments have been prepared */
> +static int arch_update_purgatory(struct kimage *image)
> +{
> +	int ret = 0;
> +
> +	if (!image->file_mode)
> +		return 0;
> +
> +	/* Setup copying of backup region */
> +	if (image->type == KEXEC_TYPE_CRASH) {
> +		ret = kexec_purgatory_get_set_symbol(image, "backup_dest",
> +				&image->arch.backup_load_addr,
> +				sizeof(image->arch.backup_load_addr), 0);
> +		if (ret)
> +			return ret;
> +
> +		ret = kexec_purgatory_get_set_symbol(image, "backup_src",
> +				&image->arch.backup_src_start,
> +				sizeof(image->arch.backup_src_start), 0);
> +		if (ret)
> +			return ret;
> +
> +		ret = kexec_purgatory_get_set_symbol(image, "backup_sz",
> +				&image->arch.backup_src_sz,
> +				sizeof(image->arch.backup_src_sz), 0);

Arg alignment is funny.

> +		if (ret)
> +			return ret;
> +	}
> +
> +	return ret;
> +}
> +
>  int machine_kexec_prepare(struct kimage *image)
>  {
>  	unsigned long start_pgtable;
-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 12/13] kexec: Support for Kexec on panic using new system call
@ 2014-06-17 21:43     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-17 21:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:07:01AM -0400, Vivek Goyal wrote:
> This patch adds support for loading a kexec on panic (kdump) kernel usning
> new system call. Right now this primarily works with bzImage loader only.
> But changes to ELF loader should be minimal as all the core infrastrcture
> is there.
> 
> Only thing preventing making ELF load in crash reseved memory is
> that kernel vmlinux is of type ET_EXEC and it expects to be loaded at
> address it has been compiled for. At that location current kernel is
> already running. One first needs to make vmlinux fully relocatable
> and export it is type ET_DYN and then modify this ELF loader to support
> images of type ET_DYN.
> 
> I am leaving it as a future TODO item.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/crash.h       |   9 +
>  arch/x86/include/asm/kexec.h       |  25 +-
>  arch/x86/kernel/crash.c            | 581 +++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/kexec-bzimage.c    |  27 +-
>  arch/x86/kernel/machine_kexec.c    |  21 +-
>  arch/x86/kernel/machine_kexec_64.c |  40 +++
>  kernel/kexec.c                     |  83 +++++-
>  7 files changed, 770 insertions(+), 16 deletions(-)
>  create mode 100644 arch/x86/include/asm/crash.h
> 
> diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h
> new file mode 100644
> index 0000000..2dd2eb8
> --- /dev/null
> +++ b/arch/x86/include/asm/crash.h
> @@ -0,0 +1,9 @@
> +#ifndef _ASM_X86_CRASH_H
> +#define _ASM_X86_CRASH_H
> +
> +int load_crashdump_segments(struct kimage *image);

I guess crash_load_segments(..) as you're prefixing the other exported
functions with "crash_".

> +int crash_copy_backup_region(struct kimage *image);
> +int crash_setup_memmap_entries(struct kimage *image,
> +		struct boot_params *params);
> +
> +#endif /* _ASM_X86_CRASH_H */

...

> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 507de80..b6a0974 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -4,6 +4,9 @@
>   * Created by: Hariprasad Nellitheertha (hari@in.ibm.com)
>   *
>   * Copyright (C) IBM Corporation, 2004. All rights reserved.
> + * Copyright (C) Red Hat Inc., 2014. All rights reserved.
> + * Authors:
> + *      Vivek Goyal <vgoyal@redhat.com>
>   *
>   */
>  
> @@ -16,6 +19,7 @@
>  #include <linux/elf.h>
>  #include <linux/elfcore.h>
>  #include <linux/module.h>
> +#include <linux/slab.h>
>  
>  #include <asm/processor.h>
>  #include <asm/hardirq.h>
> @@ -28,6 +32,45 @@
>  #include <asm/reboot.h>
>  #include <asm/virtext.h>
>  
> +/* Alignment required for elf header segment */
> +#define ELF_CORE_HEADER_ALIGN   4096
> +
> +/* This primarily reprsents number of split ranges due to exclusion */

"represents"

> +#define CRASH_MAX_RANGES	16
> +
> +struct crash_mem_range {
> +	u64 start, end;
> +};
> +
> +struct crash_mem {
> +	unsigned int nr_ranges;
> +	struct crash_mem_range ranges[CRASH_MAX_RANGES];
> +};
> +
> +/* Misc data about ram ranges needed to prepare elf headers */
> +struct crash_elf_data {
> +	struct kimage *image;
> +	/*
> +	 * Total number of ram ranges we have after various ajustments for

"adjustments"

> +	 * GART, crash reserved region etc.
> +	 */
> +	unsigned int max_nr_ranges;
> +	unsigned long gart_start, gart_end;
> +
> +	/* Pointer to elf header */
> +	void *ehdr;
> +	/* Pointer to next phdr */
> +	void *bufp;
> +	struct crash_mem mem;
> +};
> +
> +/* Used while preparing memory map entries for second kernel */
> +struct crash_memmap_data {
> +	struct boot_params *params;
> +	/* Type of memory */
> +	unsigned int type;
> +};
> +
>  int in_crash_kexec;
>  
>  /*
> @@ -39,6 +82,7 @@ int in_crash_kexec;
>   */
>  crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss = NULL;
>  EXPORT_SYMBOL_GPL(crash_vmclear_loaded_vmcss);
> +unsigned long crash_zero_bytes;

Ah, that's the empty_zero_page...

>  
>  static inline void cpu_crash_vmclear_loaded_vmcss(void)
>  {
> @@ -135,3 +179,540 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  #endif
>  	crash_save_cpu(regs, safe_smp_processor_id());
>  }
> +
> +#ifdef CONFIG_X86_64
> +
> +static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> +				unsigned long nr_pfn, void *arg)
> +{
> +	int *nr_ranges = arg;
> +
> +	(*nr_ranges)++;
> +	return 0;
> +}
> +
> +static int get_gart_ranges_callback(u64 start, u64 end, void *arg)
> +{
> +	struct crash_elf_data *ced = arg;
> +
> +	ced->gart_start = start;
> +	ced->gart_end = end;
> +
> +	/* Not expecting more than 1 gart aperture */
> +	return 1;
> +}
> +
> +
> +/* Gather all the required information to prepare elf headers for ram regions */
> +static int fill_up_crash_elf_data(struct crash_elf_data *ced,
> +					struct kimage *image)
> +{
> +	unsigned int nr_ranges = 0;
> +
> +	ced->image = image;
> +
> +	walk_system_ram_range(0, -1, &nr_ranges,
> +				get_nr_ram_ranges_callback);
> +
> +	ced->max_nr_ranges = nr_ranges;
> +
> +	/*
> +	 * We don't create ELF headers for GART aperture as an attempt
> +	 * to dump this memory in second kernel leads to hang/crash.
> +	 * If gart aperture is present, one needs to exclude that region
> +	 * and that could lead to need of extra phdr.
> +	 */
> +	walk_ram_res("GART", IORESOURCE_MEM, 0, -1,
> +				ced, get_gart_ranges_callback);
> +
> +	/*
> +	 * If we have gart region, excluding that could potentially split
> +	 * a memory range, resulting in extra header. Account for  that.
> +	 */
> +	if (ced->gart_end)
> +		ced->max_nr_ranges++;
> +
> +	/* Exclusion of crash region could split memory ranges */
> +	ced->max_nr_ranges++;
> +
> +	/* If crashk_low_res is there, another range split possible */

You mean "is not 0"?

> +	if (crashk_low_res.end != 0)
> +		ced->max_nr_ranges++;
> +
> +	return 0;

Returns unconditional 0 - make function void then.

> +}
> +
> +static int exclude_mem_range(struct crash_mem *mem,
> +		unsigned long long mstart, unsigned long long mend)
> +{
> +	int i, j;
> +	unsigned long long start, end;
> +	struct crash_mem_range temp_range = {0, 0};
> +
> +	for (i = 0; i < mem->nr_ranges; i++) {
> +		start = mem->ranges[i].start;
> +		end = mem->ranges[i].end;
> +
> +		if (mstart > end || mend < start)
> +			continue;
> +
> +		/* Truncate any area outside of range */
> +		if (mstart < start)
> +			mstart = start;
> +		if (mend > end)
> +			mend = end;
> +
> +		/* Found completely overlapping range */
> +		if (mstart == start && mend == end) {
> +			mem->ranges[i].start = 0;
> +			mem->ranges[i].end = 0;
> +			if (i < mem->nr_ranges - 1) {
> +				/* Shift rest of the ranges to left */
> +				for (j = i; j < mem->nr_ranges - 1; j++) {
> +					mem->ranges[j].start =
> +						mem->ranges[j+1].start;
> +					mem->ranges[j].end =
> +							mem->ranges[j+1].end;
> +				}
> +			}
> +			mem->nr_ranges--;
> +			return 0;
> +		}
> +
> +		if (mstart > start && mend < end) {
> +			/* Split original range */
> +			mem->ranges[i].end = mstart - 1;
> +			temp_range.start = mend + 1;
> +			temp_range.end = end;
> +		} else if (mstart != start)
> +			mem->ranges[i].end = mstart - 1;
> +		else
> +			mem->ranges[i].start = mend + 1;
> +		break;
> +	}
> +
> +	/* If a split happend, add the split in array */

"happened" ... "split to array"

> +	if (!temp_range.end)
> +		return 0;
> +
> +	/* Split happened */
> +	if (i == CRASH_MAX_RANGES - 1) {
> +		pr_err("Too many crash ranges after split\n");
> +		return -ENOMEM;
> +	}
> +
> +	/* Location where new range should go */
> +	j = i + 1;
> +	if (j < mem->nr_ranges) {
> +		/* Move over all ranges one place */

			...  all ranges one slot towards the end */

> +		for (i = mem->nr_ranges - 1; i >= j; i--)
> +			mem->ranges[i + 1] = mem->ranges[i];
> +	}
> +
> +	mem->ranges[j].start = temp_range.start;
> +	mem->ranges[j].end = temp_range.end;
> +	mem->nr_ranges++;
> +	return 0;
> +}
> +
> +/*
> + * Look for any unwanted ranges between mstart, mend and remove them. This
> + * might lead to split and split ranges are put in ced->mem.ranges[] array
> + */
> +static int elf_header_exclude_ranges(struct crash_elf_data *ced,
> +		unsigned long long mstart, unsigned long long mend)
> +{
> +	struct crash_mem *cmem = &ced->mem;
> +	int ret = 0;
> +
> +	memset(cmem->ranges, 0, sizeof(cmem->ranges));
> +
> +	cmem->ranges[0].start = mstart;
> +	cmem->ranges[0].end = mend;
> +	cmem->nr_ranges = 1;
> +
> +	/* Exclude crashkernel region */
> +	ret = exclude_mem_range(cmem, crashk_res.start, crashk_res.end);
> +	if (ret)
> +		return ret;
> +
> +	ret = exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end);
> +	if (ret)
> +		return ret;
> +
> +	/* Exclude GART region */
> +	if (ced->gart_end) {
> +		ret = exclude_mem_range(cmem, ced->gart_start, ced->gart_end);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static int prepare_elf64_ram_headers_callback(u64 start, u64 end, void *arg)
> +{
> +	struct crash_elf_data *ced = arg;
> +	Elf64_Ehdr *ehdr;
> +	Elf64_Phdr *phdr;
> +	unsigned long mstart, mend;
> +	struct kimage *image = ced->image;
> +	struct crash_mem *cmem;
> +	int ret, i;
> +
> +	ehdr = ced->ehdr;
> +
> +	/* Exclude unwanted mem ranges */
> +	ret = elf_header_exclude_ranges(ced, start, end);
> +	if (ret)
> +		return ret;
> +
> +	/* Go through all the ranges in ced->mem.ranges[] and prepare phdr */
> +	cmem = &ced->mem;
> +
> +	for (i = 0; i < cmem->nr_ranges; i++) {
> +		mstart = cmem->ranges[i].start;
> +		mend = cmem->ranges[i].end;
> +
> +		phdr = ced->bufp;
> +		ced->bufp += sizeof(Elf64_Phdr);
> +
> +		phdr->p_type = PT_LOAD;
> +		phdr->p_flags = PF_R|PF_W|PF_X;
> +		phdr->p_offset  = mstart;
> +
> +		/*
> +		 * If a range matches backup region, adjust offset to backup
> +		 * segment.
> +		 */
> +		if (mstart == image->arch.backup_src_start &&
> +		    (mend - mstart + 1) == image->arch.backup_src_sz)
> +			phdr->p_offset = image->arch.backup_load_addr;
> +
> +		phdr->p_paddr = mstart;
> +		phdr->p_vaddr = (unsigned long long) __va(mstart);
> +		phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
> +		phdr->p_align = 0;
> +		ehdr->e_phnum++;
> +		pr_debug("Crash PT_LOAD elf header. phdr=%p vaddr=0x%llx, paddr=0x%llx, sz=0x%llx e_phnum=%d p_offset=0x%llx\n",
> +			phdr, phdr->p_vaddr, phdr->p_paddr, phdr->p_filesz,
> +			ehdr->e_phnum, phdr->p_offset);
> +	}
> +
> +	return ret;
> +}
> +
> +static int prepare_elf64_headers(struct crash_elf_data *ced,
> +		void **addr, unsigned long *sz)
> +{
> +	Elf64_Ehdr *ehdr;
> +	Elf64_Phdr *phdr;
> +	unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz;
> +	unsigned char *buf, *bufp;
> +	unsigned int cpu;
> +	unsigned long long notes_addr;
> +	int ret;
> +
> +	/* extra phdr for vmcoreinfo elf note */
> +	nr_phdr = nr_cpus + 1;
> +	nr_phdr += ced->max_nr_ranges;
> +
> +	/*
> +	 * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
> +	 * area on x86_64 (ffffffff80000000 - ffffffffa0000000).
> +	 * I think this is required by tools like gdb. So same physical
> +	 * memory will be mapped in two elf headers. One will contain kernel
> +	 * text virtual addresses and other will have __va(physical) addresses.
> +	 */
> +
> +	nr_phdr++;
> +	elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
> +	elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
> +
> +	buf = vzalloc(elf_sz);

Since you get zeroed memory, you can save yourself all assignments to 0
below and thus slim this already terse function.

> +	if (!buf)
> +		return -ENOMEM;
> +
> +	bufp = buf;
> +	ehdr = (Elf64_Ehdr *)bufp;
> +	bufp += sizeof(Elf64_Ehdr);
> +	memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
> +	ehdr->e_ident[EI_CLASS] = ELFCLASS64;
> +	ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
> +	ehdr->e_ident[EI_VERSION] = EV_CURRENT;
> +	ehdr->e_ident[EI_OSABI] = ELF_OSABI;
> +	memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
> +	ehdr->e_type = ET_CORE;
> +	ehdr->e_machine = ELF_ARCH;
> +	ehdr->e_version = EV_CURRENT;
> +	ehdr->e_entry = 0;
> +	ehdr->e_phoff = sizeof(Elf64_Ehdr);
> +	ehdr->e_shoff = 0;
> +	ehdr->e_flags = 0;
> +	ehdr->e_ehsize = sizeof(Elf64_Ehdr);
> +	ehdr->e_phentsize = sizeof(Elf64_Phdr);
> +	ehdr->e_phnum = 0;
> +	ehdr->e_shentsize = 0;
> +	ehdr->e_shnum = 0;
> +	ehdr->e_shstrndx = 0;
> +
> +	/* Prepare one phdr of type PT_NOTE for each present cpu */
> +	for_each_present_cpu(cpu) {
> +		phdr = (Elf64_Phdr *)bufp;
> +		bufp += sizeof(Elf64_Phdr);
> +		phdr->p_type = PT_NOTE;
> +		phdr->p_flags = 0;
> +		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
> +		phdr->p_offset = phdr->p_paddr = notes_addr;
> +		phdr->p_vaddr = 0;
> +		phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
> +		phdr->p_align = 0;
> +		(ehdr->e_phnum)++;
> +	}
> +
> +	/* Prepare one PT_NOTE header for vmcoreinfo */
> +	phdr = (Elf64_Phdr *)bufp;
> +	bufp += sizeof(Elf64_Phdr);
> +	phdr->p_type = PT_NOTE;
> +	phdr->p_flags = 0;
> +	phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
> +	phdr->p_vaddr = 0;
> +	phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
> +	phdr->p_align = 0;
> +	(ehdr->e_phnum)++;
> +
> +#ifdef CONFIG_X86_64
> +	/* Prepare PT_LOAD type program header for kernel text region */
> +	phdr = (Elf64_Phdr *)bufp;
> +	bufp += sizeof(Elf64_Phdr);
> +	phdr->p_type = PT_LOAD;
> +	phdr->p_flags = PF_R|PF_W|PF_X;
> +	phdr->p_vaddr = (Elf64_Addr)_text;
> +	phdr->p_filesz = phdr->p_memsz = _end - _text;
> +	phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);
> +	phdr->p_align = 0;
> +	(ehdr->e_phnum)++;
> +#endif
> +
> +	/* Prepare PT_LOAD headers for system ram chunks. */
> +	ced->ehdr = ehdr;
> +	ced->bufp = bufp;
> +	ret = walk_system_ram_res(0, -1, ced,
> +			prepare_elf64_ram_headers_callback);
> +	if (ret < 0)
> +		return ret;
> +
> +	*addr = buf;
> +	*sz = elf_sz;
> +	return 0;
> +}
> +
> +/* Prepare elf headers. Return addr and size */
> +static int prepare_elf_headers(struct kimage *image, void **addr,
> +					unsigned long *sz)
> +{
> +	struct crash_elf_data *ced;
> +	int ret;
> +
> +	ced = kzalloc(sizeof(*ced), GFP_KERNEL);
> +	if (!ced)
> +		return -ENOMEM;
> +
> +	ret = fill_up_crash_elf_data(ced, image);
> +	if (ret)
> +		goto out;
> +
> +	/* By default prepare 64bit headers */
> +	ret =  prepare_elf64_headers(ced, addr, sz);
> +out:
> +	kfree(ced);
> +	return ret;
> +}
> +
> +static int add_e820_entry(struct boot_params *params, struct e820entry *entry)
> +{
> +	unsigned int nr_e820_entries;
> +
> +	nr_e820_entries = params->e820_entries;
> +	if (nr_e820_entries >= E820MAX)
> +		return 1;

You're not testing for the error condition in any call site. Are we sure
we will never hit E820MAX?

> +
> +	memcpy(&params->e820_map[nr_e820_entries], entry,
> +			sizeof(struct e820entry));
> +	params->e820_entries++;
> +	return 0;
> +}
> +
> +static int memmap_entry_callback(u64 start, u64 end, void *arg)
> +{
> +	struct crash_memmap_data *cmd = arg;
> +	struct boot_params *params = cmd->params;
> +	struct e820entry ei;
> +
> +	ei.addr = start;
> +	ei.size = end - start + 1;
> +	ei.type = cmd->type;
> +	add_e820_entry(params, &ei);
> +
> +	return 0;
> +}
> +
> +static int memmap_exclude_ranges(struct kimage *image, struct crash_mem *cmem,
> +		unsigned long long mstart, unsigned long long mend)

Arg alignment... multiple occurrences in this patch.

> +{
> +	unsigned long start, end;
> +	int ret = 0;
> +
> +	memset(cmem->ranges, 0, sizeof(cmem->ranges));
> +
> +	cmem->ranges[0].start = mstart;
> +	cmem->ranges[0].end = mend;
> +	cmem->nr_ranges = 1;
> +
> +	/* Exclude Backup region */
> +	start = image->arch.backup_load_addr;
> +	end = start + image->arch.backup_src_sz - 1;
> +	ret = exclude_mem_range(cmem, start, end);
> +	if (ret)
> +		return ret;
> +
> +	/* Exclude elf header region */
> +	start = image->arch.elf_load_addr;
> +	end = start + image->arch.elf_headers_sz - 1;
> +	ret = exclude_mem_range(cmem, start, end);
> +	return ret;

	return exclude_mem_range(cmem, start, end);

> +}
> +
> +/* Prepare memory map for crash dump kernel */
> +int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
> +{
> +	int i, ret = 0;
> +	unsigned long flags;
> +	struct e820entry ei;
> +	struct crash_memmap_data cmd;
> +	struct crash_mem *cmem;
> +
> +	cmem = vzalloc(sizeof(struct crash_mem));
> +	if (!cmem)
> +		return -ENOMEM;

You're getting zeroed memory already but you're zeroing it out again
above in memmap_exclude_ranges().

> +
> +	memset(&cmd, 0, sizeof(struct crash_memmap_data));
> +	cmd.params = params;
> +
> +	/* Add first 640K segment */
> +	ei.addr = image->arch.backup_src_start;
> +	ei.size = image->arch.backup_src_sz;
> +	ei.type = E820_RAM;
> +	add_e820_entry(params, &ei);
> +
> +	/* Add ACPI tables */
> +	cmd.type = E820_ACPI;
> +	flags = IORESOURCE_MEM | IORESOURCE_BUSY;
> +	walk_ram_res("ACPI Tables", flags, 0, -1, &cmd, memmap_entry_callback);
> +
> +	/* Add ACPI Non-volatile Storage */
> +	cmd.type = E820_NVS;
> +	walk_ram_res("ACPI Non-volatile Storage", flags, 0, -1, &cmd,
> +			memmap_entry_callback);
> +
> +	/* Add crashk_low_res region */
> +	if (crashk_low_res.end) {
> +		ei.addr = crashk_low_res.start;
> +		ei.size = crashk_low_res.end - crashk_low_res.start + 1;
> +		ei.type = E820_RAM;
> +		add_e820_entry(params, &ei);
> +	}
> +
> +	/* Exclude some ranges from crashk_res and add rest to memmap */
> +	ret = memmap_exclude_ranges(image, cmem, crashk_res.start,
> +						crashk_res.end);
> +	if (ret)
> +		goto out;
> +
> +	for (i = 0; i < cmem->nr_ranges; i++) {
> +		ei.addr = cmem->ranges[i].start;
> +		ei.size = cmem->ranges[i].end - ei.addr + 1;
> +		ei.type = E820_RAM;
> +
> +		/* If entry is less than a page, skip it */
> +		if (ei.size < PAGE_SIZE)
> +			continue;

You can do the size assignment and check first so that you don't have to
do the rest if it is a less than a page.

> +		add_e820_entry(params, &ei);
> +	}
> +
> +out:
> +	vfree(cmem);
> +	return ret;

This retval is not checked at the callsite in
kexec_setup_boot_parameters().

> +}
> +
> +static int determine_backup_region(u64 start, u64 end, void *arg)
> +{
> +	struct kimage *image = arg;
> +
> +	image->arch.backup_src_start = start;
> +	image->arch.backup_src_sz = end - start + 1;
> +
> +	/* Expecting only one range for backup region */
> +	return 1;
> +}
> +
> +int load_crashdump_segments(struct kimage *image)
> +{
> +	unsigned long src_start, src_sz, elf_sz;
> +	void *elf_addr;
> +	int ret;
> +
> +	/*
> +	 * Determine and load a segment for backup area. First 640K RAM
> +	 * region is backup source
> +	 */
> +
> +	ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
> +				image, determine_backup_region);
> +
> +	/* Zero or postive return values are ok */
> +	if (ret < 0)
> +		return ret;
> +
> +	src_start = image->arch.backup_src_start;
> +	src_sz = image->arch.backup_src_sz;
> +
> +	/* Add backup segment. */
> +	if (src_sz) {
> +		/*
> +		 * Ideally there is no source for backup segment. This is
> +		 * copied in purgatory after crash. Just add a zero filled
> +		 * segment for now to make sure checksum logic works fine.
> +		 */
> +		ret = kexec_add_buffer(image, (char *)&crash_zero_bytes,
> +				       sizeof(crash_zero_bytes), src_sz,
> +				       PAGE_SIZE, 0, -1, 0,
> +				       &image->arch.backup_load_addr);
> +		if (ret)
> +			return ret;
> +		pr_debug("Loaded backup region at 0x%lx backup_start=0x%lx memsz=0x%lx\n",
> +			 image->arch.backup_load_addr, src_start, src_sz);
> +	}
> +
> +	/* Prepare elf headers and add a segment */
> +	ret = prepare_elf_headers(image, &elf_addr, &elf_sz);
> +	if (ret)
> +		return ret;
> +
> +	image->arch.elf_headers = elf_addr;
> +	image->arch.elf_headers_sz = elf_sz;
> +
> +	ret = kexec_add_buffer(image, (char *)elf_addr, elf_sz, elf_sz,
> +			ELF_CORE_HEADER_ALIGN, 0, -1, 0,
> +			&image->arch.elf_load_addr);
> +	if (ret) {
> +		vfree((void *)image->arch.elf_headers);
> +		return ret;
> +	}
> +	pr_debug("Loaded ELF headers at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> +		 image->arch.elf_load_addr, elf_sz, elf_sz);
> +
> +	return ret;
> +}
> +
> +#endif /* CONFIG_X86_64 */
> diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
> index 0750784..8e762d3 100644
> --- a/arch/x86/kernel/kexec-bzimage.c
> +++ b/arch/x86/kernel/kexec-bzimage.c
> @@ -18,6 +18,9 @@
>  
>  #include <asm/bootparam.h>
>  #include <asm/setup.h>
> +#include <asm/crash.h>
> +
> +#define MAX_ELFCOREHDR_STR_LEN	30	/* elfcorehdr=0x<64bit-value> */
>  
>  /*
>   * Defines lowest physical address for various segments. Not sure where
> @@ -130,11 +133,28 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  		return ERR_PTR(-EINVAL);
>  	}
>  
> +	/*
> +	 * In case of crash dump, we will append elfcorehdr=<addr> to
> +	 * command line. Make sure it does not overflow
> +	 */
> +	if (cmdline_len + MAX_ELFCOREHDR_STR_LEN > header->cmdline_size) {
> +		ret = -EINVAL;

No need to assign anything to ret if you return ERR_PTR below.

> +		pr_debug("Kernel command line too long\n");

This error message needs to differ from the one above - say something
about "error appending elfcorehdr=...", for example.

> +		return ERR_PTR(-EINVAL);
> +	}
> +
>  	/* Allocate loader specific data */
>  	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
>  	if (!ldata)
>  		return ERR_PTR(-ENOMEM);
>  
> +	/* Allocate and load backup region */
> +	if (image->type == KEXEC_TYPE_CRASH) {
> +		ret = load_crashdump_segments(image);
> +		if (ret)
> +			goto out_free_loader_data;
> +	}
> +
>  	/*
>  	 * Load purgatory. For 64bit entry point, purgatory  code can be
>  	 * anywhere.
> @@ -149,7 +169,8 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
>  
>  	/* Load Bootparams and cmdline */
> -	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
> +	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
> +				MAX_ELFCOREHDR_STR_LEN;
>  	params = kzalloc(params_cmdline_sz, GFP_KERNEL);
>  	if (!params) {
>  		ret = -ENOMEM;
> @@ -201,7 +222,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  			goto out_free_params;
>  	}
>  
> -	ret = kexec_setup_cmdline(params, bootparam_load_addr,
> +	ret = kexec_setup_cmdline(image, params, bootparam_load_addr,
>  				  sizeof(struct boot_params), cmdline,
>  				  cmdline_len);
>  	if (ret)
> @@ -233,7 +254,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  	if (ret)
>  		goto out_free_params;
>  
> -	ret = kexec_setup_boot_parameters(params);
> +	ret = kexec_setup_boot_parameters(image, params);
>  	if (ret)
>  		goto out_free_params;
>  
> diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
> index 7de3239..6a3821b 100644
> --- a/arch/x86/kernel/machine_kexec.c
> +++ b/arch/x86/kernel/machine_kexec.c
> @@ -10,9 +10,11 @@
>   */
>  
>  #include <linux/kernel.h>
> +#include <linux/kexec.h>
>  #include <linux/string.h>
>  #include <asm/bootparam.h>
>  #include <asm/setup.h>
> +#include <asm/crash.h>
>  
>  /*
>   * Common code for x86 and x86_64 used for kexec.
> @@ -36,18 +38,24 @@ int kexec_setup_initrd(struct boot_params *params,
>  	return 0;
>  }
>  
> -int kexec_setup_cmdline(struct boot_params *params,
> +int kexec_setup_cmdline(struct kimage *image, struct boot_params *params,
>  		unsigned long bootparams_load_addr,
>  		unsigned long cmdline_offset, char *cmdline,
>  		unsigned long cmdline_len)
>  {
>  	char *cmdline_ptr = ((char *)params) + cmdline_offset;
> -	unsigned long cmdline_ptr_phys;
> +	unsigned long cmdline_ptr_phys, len;
>  	uint32_t cmdline_low_32, cmdline_ext_32;
>  
>  	memcpy(cmdline_ptr, cmdline, cmdline_len);
> +	if (image->type == KEXEC_TYPE_CRASH) {
> +		len = sprintf(cmdline_ptr + cmdline_len - 1,
> +			" elfcorehdr=0x%lx", image->arch.elf_load_addr);
> +		cmdline_len += len;
> +	}
>  	cmdline_ptr[cmdline_len - 1] = '\0';
>  
> +	pr_debug("Final command line is: %s\n", cmdline_ptr);
>  	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
>  	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
>  	cmdline_ext_32 = cmdline_ptr_phys >> 32;
> @@ -75,7 +83,8 @@ static int setup_memory_map_entries(struct boot_params *params)
>  	return 0;
>  }
>  
> -int kexec_setup_boot_parameters(struct boot_params *params)
> +int kexec_setup_boot_parameters(struct kimage *image,
> +				struct boot_params *params)
>  {
>  	unsigned int nr_e820_entries;
>  	unsigned long long mem_k, start, end;
> @@ -102,7 +111,11 @@ int kexec_setup_boot_parameters(struct boot_params *params)
>  	/* Default sysdesc table */
>  	params->sys_desc_table.length = 0;
>  
> -	setup_memory_map_entries(params);
> +	if (image->type == KEXEC_TYPE_CRASH)
> +		crash_setup_memmap_entries(image, params);
> +	else
> +		setup_memory_map_entries(params);
> +
>  	nr_e820_entries = params->e820_entries;
>  
>  	for (i = 0; i < nr_e820_entries; i++) {
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index a66fae3..07e4b60 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -179,6 +179,38 @@ static void load_segments(void)
>  		);
>  }
>  
> +/* Update purgatory as needed after various image segments have been prepared */
> +static int arch_update_purgatory(struct kimage *image)
> +{
> +	int ret = 0;
> +
> +	if (!image->file_mode)
> +		return 0;
> +
> +	/* Setup copying of backup region */
> +	if (image->type == KEXEC_TYPE_CRASH) {
> +		ret = kexec_purgatory_get_set_symbol(image, "backup_dest",
> +				&image->arch.backup_load_addr,
> +				sizeof(image->arch.backup_load_addr), 0);
> +		if (ret)
> +			return ret;
> +
> +		ret = kexec_purgatory_get_set_symbol(image, "backup_src",
> +				&image->arch.backup_src_start,
> +				sizeof(image->arch.backup_src_start), 0);
> +		if (ret)
> +			return ret;
> +
> +		ret = kexec_purgatory_get_set_symbol(image, "backup_sz",
> +				&image->arch.backup_src_sz,
> +				sizeof(image->arch.backup_src_sz), 0);

Arg alignment is funny.

> +		if (ret)
> +			return ret;
> +	}
> +
> +	return ret;
> +}
> +
>  int machine_kexec_prepare(struct kimage *image)
>  {
>  	unsigned long start_pgtable;
-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-17 14:24     ` Vivek Goyal
@ 2014-06-18  1:45       ` Dave Young
  -1 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-18  1:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	chaowang, bhe, akpm

On 06/17/14 at 10:24am, Vivek Goyal wrote:
> On Thu, Jun 12, 2014 at 01:42:03PM +0800, Dave Young wrote:
> > On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > > Hi,
> > > 
> > > This is V3 of the patchset. Previous versions were posted here.
> > > 
> > > V1: https://lkml.org/lkml/2013/11/20/540
> > > V2: https://lkml.org/lkml/2014/1/27/331
> > > 
> > > Changes since v2:
> > > 
> > > - Took care of most of the review comments from V2.
> > > - Added support for kexec/kdump on EFI systems.
> > > - Dropped support for loading ELF vmlinux.
> > > 
> > > This patch series is generated on top of 3.15.0-rc8. It also requires a
> > > two patch cleanup series which is sitting in -tip tree here.
> > > 
> > > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=x86/boot
> > > 
> > > This patch series does not do kernel signature verification yet. I plan
> > > to post another patch series for that. Now bzImage is already signed
> > > with PKCS7 signature I plan to parse and verify those signatures.
> > > 
> > > Primary goal of this patchset is to prepare groundwork so that kernel
> > > image can be signed and signatures be verified during kexec load. This
> > > should help with two things.
> > > 
> > > - It should allow kexec/kdump on secureboot enabled machines.
> > > 
> > > - In general it can help even without secureboot. By being able to verify
> > >   kernel image signature in kexec, it should help with avoiding module
> > >   signing restrictions. Matthew Garret showed how to boot into a custom
> > >   kernel, modify first kernel's memory and then jump back to old kernel and
> > >   bypass any policy one wants to.
> > > 
> > > Any feedback is welcome.
> > 
> > Hi, Vivek
> > 
> > For efi ioremapping case, in 3.15 kernel efi runtime maps will not be saved
> > if efi=old_map is used. So you need detect this and fail the kexec file load.
> 
> Dave,
> 
> Instead of failing kexec load in case of efi=old_map, I think it will be
> better to just not pass runtime map in bootparams. That way user can
> pass "noefi" on commandline and kdump should still work.  (Like it works
> with user space implementation).
> 
> What do you think?

Yes, agree that is better and align with kexec-tools logic.

Thanks
Dave

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-18  1:45       ` Dave Young
  0 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-18  1:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, chaowang

On 06/17/14 at 10:24am, Vivek Goyal wrote:
> On Thu, Jun 12, 2014 at 01:42:03PM +0800, Dave Young wrote:
> > On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > > Hi,
> > > 
> > > This is V3 of the patchset. Previous versions were posted here.
> > > 
> > > V1: https://lkml.org/lkml/2013/11/20/540
> > > V2: https://lkml.org/lkml/2014/1/27/331
> > > 
> > > Changes since v2:
> > > 
> > > - Took care of most of the review comments from V2.
> > > - Added support for kexec/kdump on EFI systems.
> > > - Dropped support for loading ELF vmlinux.
> > > 
> > > This patch series is generated on top of 3.15.0-rc8. It also requires a
> > > two patch cleanup series which is sitting in -tip tree here.
> > > 
> > > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=x86/boot
> > > 
> > > This patch series does not do kernel signature verification yet. I plan
> > > to post another patch series for that. Now bzImage is already signed
> > > with PKCS7 signature I plan to parse and verify those signatures.
> > > 
> > > Primary goal of this patchset is to prepare groundwork so that kernel
> > > image can be signed and signatures be verified during kexec load. This
> > > should help with two things.
> > > 
> > > - It should allow kexec/kdump on secureboot enabled machines.
> > > 
> > > - In general it can help even without secureboot. By being able to verify
> > >   kernel image signature in kexec, it should help with avoiding module
> > >   signing restrictions. Matthew Garret showed how to boot into a custom
> > >   kernel, modify first kernel's memory and then jump back to old kernel and
> > >   bypass any policy one wants to.
> > > 
> > > Any feedback is welcome.
> > 
> > Hi, Vivek
> > 
> > For efi ioremapping case, in 3.15 kernel efi runtime maps will not be saved
> > if efi=old_map is used. So you need detect this and fail the kexec file load.
> 
> Dave,
> 
> Instead of failing kexec load in case of efi=old_map, I think it will be
> better to just not pass runtime map in bootparams. That way user can
> pass "noefi" on commandline and kdump should still work.  (Like it works
> with user space implementation).
> 
> What do you think?

Yes, agree that is better and align with kexec-tools logic.

Thanks
Dave

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-18  1:45       ` Dave Young
@ 2014-06-18  1:52         ` Dave Young
  -1 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-18  1:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	chaowang, bhe, akpm

On 06/18/14 at 09:45am, Dave Young wrote:
> On 06/17/14 at 10:24am, Vivek Goyal wrote:
> > On Thu, Jun 12, 2014 at 01:42:03PM +0800, Dave Young wrote:
> > > On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > > > Hi,
> > > > 
> > > > This is V3 of the patchset. Previous versions were posted here.
> > > > 
> > > > V1: https://lkml.org/lkml/2013/11/20/540
> > > > V2: https://lkml.org/lkml/2014/1/27/331
> > > > 
> > > > Changes since v2:
> > > > 
> > > > - Took care of most of the review comments from V2.
> > > > - Added support for kexec/kdump on EFI systems.
> > > > - Dropped support for loading ELF vmlinux.
> > > > 
> > > > This patch series is generated on top of 3.15.0-rc8. It also requires a
> > > > two patch cleanup series which is sitting in -tip tree here.
> > > > 
> > > > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=x86/boot
> > > > 
> > > > This patch series does not do kernel signature verification yet. I plan
> > > > to post another patch series for that. Now bzImage is already signed
> > > > with PKCS7 signature I plan to parse and verify those signatures.
> > > > 
> > > > Primary goal of this patchset is to prepare groundwork so that kernel
> > > > image can be signed and signatures be verified during kexec load. This
> > > > should help with two things.
> > > > 
> > > > - It should allow kexec/kdump on secureboot enabled machines.
> > > > 
> > > > - In general it can help even without secureboot. By being able to verify
> > > >   kernel image signature in kexec, it should help with avoiding module
> > > >   signing restrictions. Matthew Garret showed how to boot into a custom
> > > >   kernel, modify first kernel's memory and then jump back to old kernel and
> > > >   bypass any policy one wants to.
> > > > 
> > > > Any feedback is welcome.
> > > 
> > > Hi, Vivek
> > > 
> > > For efi ioremapping case, in 3.15 kernel efi runtime maps will not be saved
> > > if efi=old_map is used. So you need detect this and fail the kexec file load.
> > 
> > Dave,
> > 
> > Instead of failing kexec load in case of efi=old_map, I think it will be
> > better to just not pass runtime map in bootparams. That way user can
> > pass "noefi" on commandline and kdump should still work.  (Like it works
> > with user space implementation).

BTW, in kexec-tools in case old_map it just does not fill efi_info so 2nd kernel
will automaticlly switch to noefi boot. So do same in kernel make sense as well.

In userspace tools we pass acpi_rsdp for kdump noefi case, do you want to add that
in the kernel loader?

> > 
> > What do you think?
> 
> Yes, agree that is better and align with kexec-tools logic.
> 
> Thanks
> Dave

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-18  1:52         ` Dave Young
  0 siblings, 0 replies; 214+ messages in thread
From: Dave Young @ 2014-06-18  1:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, chaowang

On 06/18/14 at 09:45am, Dave Young wrote:
> On 06/17/14 at 10:24am, Vivek Goyal wrote:
> > On Thu, Jun 12, 2014 at 01:42:03PM +0800, Dave Young wrote:
> > > On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > > > Hi,
> > > > 
> > > > This is V3 of the patchset. Previous versions were posted here.
> > > > 
> > > > V1: https://lkml.org/lkml/2013/11/20/540
> > > > V2: https://lkml.org/lkml/2014/1/27/331
> > > > 
> > > > Changes since v2:
> > > > 
> > > > - Took care of most of the review comments from V2.
> > > > - Added support for kexec/kdump on EFI systems.
> > > > - Dropped support for loading ELF vmlinux.
> > > > 
> > > > This patch series is generated on top of 3.15.0-rc8. It also requires a
> > > > two patch cleanup series which is sitting in -tip tree here.
> > > > 
> > > > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=x86/boot
> > > > 
> > > > This patch series does not do kernel signature verification yet. I plan
> > > > to post another patch series for that. Now bzImage is already signed
> > > > with PKCS7 signature I plan to parse and verify those signatures.
> > > > 
> > > > Primary goal of this patchset is to prepare groundwork so that kernel
> > > > image can be signed and signatures be verified during kexec load. This
> > > > should help with two things.
> > > > 
> > > > - It should allow kexec/kdump on secureboot enabled machines.
> > > > 
> > > > - In general it can help even without secureboot. By being able to verify
> > > >   kernel image signature in kexec, it should help with avoiding module
> > > >   signing restrictions. Matthew Garret showed how to boot into a custom
> > > >   kernel, modify first kernel's memory and then jump back to old kernel and
> > > >   bypass any policy one wants to.
> > > > 
> > > > Any feedback is welcome.
> > > 
> > > Hi, Vivek
> > > 
> > > For efi ioremapping case, in 3.15 kernel efi runtime maps will not be saved
> > > if efi=old_map is used. So you need detect this and fail the kexec file load.
> > 
> > Dave,
> > 
> > Instead of failing kexec load in case of efi=old_map, I think it will be
> > better to just not pass runtime map in bootparams. That way user can
> > pass "noefi" on commandline and kdump should still work.  (Like it works
> > with user space implementation).

BTW, in kexec-tools in case old_map it just does not fill efi_info so 2nd kernel
will automaticlly switch to noefi boot. So do same in kernel make sense as well.

In userspace tools we pass acpi_rsdp for kdump noefi case, do you want to add that
in the kernel loader?

> > 
> > What do you think?
> 
> Yes, agree that is better and align with kexec-tools logic.
> 
> Thanks
> Dave

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
  2014-06-18  1:52         ` Dave Young
@ 2014-06-18 12:40           ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-18 12:40 UTC (permalink / raw)
  To: Dave Young
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, bp, jkosina,
	chaowang, bhe, akpm

On Wed, Jun 18, 2014 at 09:52:46AM +0800, Dave Young wrote:
[..]
> > > > Hi, Vivek
> > > > 
> > > > For efi ioremapping case, in 3.15 kernel efi runtime maps will not be saved
> > > > if efi=old_map is used. So you need detect this and fail the kexec file load.
> > > 
> > > Dave,
> > > 
> > > Instead of failing kexec load in case of efi=old_map, I think it will be
> > > better to just not pass runtime map in bootparams. That way user can
> > > pass "noefi" on commandline and kdump should still work.  (Like it works
> > > with user space implementation).
> 
> BTW, in kexec-tools in case old_map it just does not fill efi_info so 2nd kernel
> will automaticlly switch to noefi boot. So do same in kernel make sense as well.

Yep. If old_map is being used, I will not fill efi_info and I will also
not pass efi setup data structure. I think user space does the same.

> 
> In userspace tools we pass acpi_rsdp for kdump noefi case, do you want to add that
> in the kernel loader?

I will not do that in kernel. I will expect user space to pass in
acpi_rsdp=0x on command line.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading
@ 2014-06-18 12:40           ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-18 12:40 UTC (permalink / raw)
  To: Dave Young
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, bp, ebiederm,
	hpa, akpm, chaowang

On Wed, Jun 18, 2014 at 09:52:46AM +0800, Dave Young wrote:
[..]
> > > > Hi, Vivek
> > > > 
> > > > For efi ioremapping case, in 3.15 kernel efi runtime maps will not be saved
> > > > if efi=old_map is used. So you need detect this and fail the kexec file load.
> > > 
> > > Dave,
> > > 
> > > Instead of failing kexec load in case of efi=old_map, I think it will be
> > > better to just not pass runtime map in bootparams. That way user can
> > > pass "noefi" on commandline and kdump should still work.  (Like it works
> > > with user space implementation).
> 
> BTW, in kexec-tools in case old_map it just does not fill efi_info so 2nd kernel
> will automaticlly switch to noefi boot. So do same in kernel make sense as well.

Yep. If old_map is being used, I will not fill efi_info and I will also
not pass efi setup data structure. I think user space does the same.

> 
> In userspace tools we pass acpi_rsdp for kdump noefi case, do you want to add that
> in the kernel loader?

I will not do that in kernel. I will expect user space to pass in
acpi_rsdp=0x on command line.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 12/13] kexec: Support for Kexec on panic using new system call
  2014-06-17 21:43     ` Borislav Petkov
@ 2014-06-18 14:20       ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-18 14:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 17, 2014 at 11:43:10PM +0200, Borislav Petkov wrote:

[..]
> > diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h
> > new file mode 100644
> > index 0000000..2dd2eb8
> > --- /dev/null
> > +++ b/arch/x86/include/asm/crash.h
> > @@ -0,0 +1,9 @@
> > +#ifndef _ASM_X86_CRASH_H
> > +#define _ASM_X86_CRASH_H
> > +
> > +int load_crashdump_segments(struct kimage *image);
> 
> I guess crash_load_segments(..) as you're prefixing the other exported
> functions with "crash_".

Ok, I can make that change.

[..]
> > +/* Alignment required for elf header segment */
> > +#define ELF_CORE_HEADER_ALIGN   4096
> > +
> > +/* This primarily reprsents number of split ranges due to exclusion */
> 
> "represents"

Will do.

> 
> > +#define CRASH_MAX_RANGES	16
> > +
> > +struct crash_mem_range {
> > +	u64 start, end;
> > +};
> > +
> > +struct crash_mem {
> > +	unsigned int nr_ranges;
> > +	struct crash_mem_range ranges[CRASH_MAX_RANGES];
> > +};
> > +
> > +/* Misc data about ram ranges needed to prepare elf headers */
> > +struct crash_elf_data {
> > +	struct kimage *image;
> > +	/*
> > +	 * Total number of ram ranges we have after various ajustments for
> 
> "adjustments"

Will do.

[..]
> > @@ -39,6 +82,7 @@ int in_crash_kexec;
> >   */
> >  crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss = NULL;
> >  EXPORT_SYMBOL_GPL(crash_vmclear_loaded_vmcss);
> > +unsigned long crash_zero_bytes;
> 
> Ah, that's the empty_zero_page...

Ok, will look into moving to empty_zero_page.

[..]
> > +static int fill_up_crash_elf_data(struct crash_elf_data *ced,
> > +					struct kimage *image)
> > +{
> > +	unsigned int nr_ranges = 0;
> > +
> > +	ced->image = image;
> > +
> > +	walk_system_ram_range(0, -1, &nr_ranges,
> > +				get_nr_ram_ranges_callback);
> > +
> > +	ced->max_nr_ranges = nr_ranges;
> > +
> > +	/*
> > +	 * We don't create ELF headers for GART aperture as an attempt
> > +	 * to dump this memory in second kernel leads to hang/crash.
> > +	 * If gart aperture is present, one needs to exclude that region
> > +	 * and that could lead to need of extra phdr.
> > +	 */
> > +	walk_ram_res("GART", IORESOURCE_MEM, 0, -1,
> > +				ced, get_gart_ranges_callback);
> > +
> > +	/*
> > +	 * If we have gart region, excluding that could potentially split
> > +	 * a memory range, resulting in extra header. Account for  that.
> > +	 */
> > +	if (ced->gart_end)
> > +		ced->max_nr_ranges++;
> > +
> > +	/* Exclusion of crash region could split memory ranges */
> > +	ced->max_nr_ranges++;
> > +
> > +	/* If crashk_low_res is there, another range split possible */
> 
> You mean "is not 0"?

Yes. Will make comment more clear.

> 
> > +	if (crashk_low_res.end != 0)
> > +		ced->max_nr_ranges++;
> > +
> > +	return 0;
> 
> Returns unconditional 0 - make function void then.

Will do.

[..]
> > +		if (mstart > start && mend < end) {
> > +			/* Split original range */
> > +			mem->ranges[i].end = mstart - 1;
> > +			temp_range.start = mend + 1;
> > +			temp_range.end = end;
> > +		} else if (mstart != start)
> > +			mem->ranges[i].end = mstart - 1;
> > +		else
> > +			mem->ranges[i].start = mend + 1;
> > +		break;
> > +	}
> > +
> > +	/* If a split happend, add the split in array */
> 
> "happened" ... "split to array"

Ok. Will fix.


> 
> > +	if (!temp_range.end)
> > +		return 0;
> > +
> > +	/* Split happened */
> > +	if (i == CRASH_MAX_RANGES - 1) {
> > +		pr_err("Too many crash ranges after split\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	/* Location where new range should go */
> > +	j = i + 1;
> > +	if (j < mem->nr_ranges) {
> > +		/* Move over all ranges one place */
> 
> 			...  all ranges one slot towards the end */
> 

Will change.

[..]
> > +static int prepare_elf64_headers(struct crash_elf_data *ced,
> > +		void **addr, unsigned long *sz)
> > +{
> > +	Elf64_Ehdr *ehdr;
> > +	Elf64_Phdr *phdr;
> > +	unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz;
> > +	unsigned char *buf, *bufp;
> > +	unsigned int cpu;
> > +	unsigned long long notes_addr;
> > +	int ret;
> > +
> > +	/* extra phdr for vmcoreinfo elf note */
> > +	nr_phdr = nr_cpus + 1;
> > +	nr_phdr += ced->max_nr_ranges;
> > +
> > +	/*
> > +	 * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
> > +	 * area on x86_64 (ffffffff80000000 - ffffffffa0000000).
> > +	 * I think this is required by tools like gdb. So same physical
> > +	 * memory will be mapped in two elf headers. One will contain kernel
> > +	 * text virtual addresses and other will have __va(physical) addresses.
> > +	 */
> > +
> > +	nr_phdr++;
> > +	elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
> > +	elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
> > +
> > +	buf = vzalloc(elf_sz);
> 
> Since you get zeroed memory, you can save yourself all assignments to 0
> below and thus slim this already terse function.

Will do.

[..]
> > +static int add_e820_entry(struct boot_params *params, struct e820entry *entry)
> > +{
> > +	unsigned int nr_e820_entries;
> > +
> > +	nr_e820_entries = params->e820_entries;
> > +	if (nr_e820_entries >= E820MAX)
> > +		return 1;
> 
> You're not testing for the error condition in any call site. Are we sure
> we will never hit E820MAX?

Actually there can be. Right now I am just handling the case of passing
as many e820 enties as can fit in bootparams and ignoring rest. Ideally
momory ranges more than E820MAX should be passed through setup data and I
have not handled that case yet.

Very few systems should run into that kind of scenario. I was thinking
that once these patches are in, I can look into enabling passing of
more than E820MAX entries using setup data.

I will put a TODO comment.

> 
> > +
> > +	memcpy(&params->e820_map[nr_e820_entries], entry,
> > +			sizeof(struct e820entry));
> > +	params->e820_entries++;
> > +	return 0;
> > +}
> > +
> > +static int memmap_entry_callback(u64 start, u64 end, void *arg)
> > +{
> > +	struct crash_memmap_data *cmd = arg;
> > +	struct boot_params *params = cmd->params;
> > +	struct e820entry ei;
> > +
> > +	ei.addr = start;
> > +	ei.size = end - start + 1;
> > +	ei.type = cmd->type;
> > +	add_e820_entry(params, &ei);
> > +
> > +	return 0;
> > +}
> > +
> > +static int memmap_exclude_ranges(struct kimage *image, struct crash_mem *cmem,
> > +		unsigned long long mstart, unsigned long long mend)
> 
> Arg alignment... multiple occurrences in this patch.

Will fix.

> 
> > +{
> > +	unsigned long start, end;
> > +	int ret = 0;
> > +
> > +	memset(cmem->ranges, 0, sizeof(cmem->ranges));
> > +
> > +	cmem->ranges[0].start = mstart;
> > +	cmem->ranges[0].end = mend;
> > +	cmem->nr_ranges = 1;
> > +
> > +	/* Exclude Backup region */
> > +	start = image->arch.backup_load_addr;
> > +	end = start + image->arch.backup_src_sz - 1;
> > +	ret = exclude_mem_range(cmem, start, end);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* Exclude elf header region */
> > +	start = image->arch.elf_load_addr;
> > +	end = start + image->arch.elf_headers_sz - 1;
> > +	ret = exclude_mem_range(cmem, start, end);
> > +	return ret;
> 
> 	return exclude_mem_range(cmem, start, end);

Will change.

> 
> > +}
> > +
> > +/* Prepare memory map for crash dump kernel */
> > +int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
> > +{
> > +	int i, ret = 0;
> > +	unsigned long flags;
> > +	struct e820entry ei;
> > +	struct crash_memmap_data cmd;
> > +	struct crash_mem *cmem;
> > +
> > +	cmem = vzalloc(sizeof(struct crash_mem));
> > +	if (!cmem)
> > +		return -ENOMEM;
> 
> You're getting zeroed memory already but you're zeroing it out again
> above in memmap_exclude_ranges().

Will remove extra zeoring.

[..]
> > +	/* Exclude some ranges from crashk_res and add rest to memmap */
> > +	ret = memmap_exclude_ranges(image, cmem, crashk_res.start,
> > +						crashk_res.end);
> > +	if (ret)
> > +		goto out;
> > +
> > +	for (i = 0; i < cmem->nr_ranges; i++) {
> > +		ei.addr = cmem->ranges[i].start;
> > +		ei.size = cmem->ranges[i].end - ei.addr + 1;
> > +		ei.type = E820_RAM;
> > +
> > +		/* If entry is less than a page, skip it */
> > +		if (ei.size < PAGE_SIZE)
> > +			continue;
> 
> You can do the size assignment and check first so that you don't have to
> do the rest if it is a less than a page.
> 

Ok, will do.

> > +		add_e820_entry(params, &ei);
> > +	}
> > +
> > +out:
> > +	vfree(cmem);
> > +	return ret;
> 
> This retval is not checked at the callsite in
> kexec_setup_boot_parameters().

Will check return code at call site.

[..]
> >  /*
> >   * Defines lowest physical address for various segments. Not sure where
> > @@ -130,11 +133,28 @@ void *bzImage64_load(struct kimage *image, char *kernel,
> >  		return ERR_PTR(-EINVAL);
> >  	}
> >  
> > +	/*
> > +	 * In case of crash dump, we will append elfcorehdr=<addr> to
> > +	 * command line. Make sure it does not overflow
> > +	 */
> > +	if (cmdline_len + MAX_ELFCOREHDR_STR_LEN > header->cmdline_size) {
> > +		ret = -EINVAL;
> 
> No need to assign anything to ret if you return ERR_PTR below.

Yep. Will remove it.

> 
> > +		pr_debug("Kernel command line too long\n");
> 
> This error message needs to differ from the one above - say something
> about "error appending elfcorehdr=...", for example.

Ok, Will fix it.

[..]
> > +	/* Setup copying of backup region */
> > +	if (image->type == KEXEC_TYPE_CRASH) {
> > +		ret = kexec_purgatory_get_set_symbol(image, "backup_dest",
> > +				&image->arch.backup_load_addr,
> > +				sizeof(image->arch.backup_load_addr), 0);
> > +		if (ret)
> > +			return ret;
> > +
> > +		ret = kexec_purgatory_get_set_symbol(image, "backup_src",
> > +				&image->arch.backup_src_start,
> > +				sizeof(image->arch.backup_src_start), 0);
> > +		if (ret)
> > +			return ret;
> > +
> > +		ret = kexec_purgatory_get_set_symbol(image, "backup_sz",
> > +				&image->arch.backup_src_sz,
> > +				sizeof(image->arch.backup_src_sz), 0);
> 
> Arg alignment is funny.

Will change.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 12/13] kexec: Support for Kexec on panic using new system call
@ 2014-06-18 14:20       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-18 14:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 17, 2014 at 11:43:10PM +0200, Borislav Petkov wrote:

[..]
> > diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h
> > new file mode 100644
> > index 0000000..2dd2eb8
> > --- /dev/null
> > +++ b/arch/x86/include/asm/crash.h
> > @@ -0,0 +1,9 @@
> > +#ifndef _ASM_X86_CRASH_H
> > +#define _ASM_X86_CRASH_H
> > +
> > +int load_crashdump_segments(struct kimage *image);
> 
> I guess crash_load_segments(..) as you're prefixing the other exported
> functions with "crash_".

Ok, I can make that change.

[..]
> > +/* Alignment required for elf header segment */
> > +#define ELF_CORE_HEADER_ALIGN   4096
> > +
> > +/* This primarily reprsents number of split ranges due to exclusion */
> 
> "represents"

Will do.

> 
> > +#define CRASH_MAX_RANGES	16
> > +
> > +struct crash_mem_range {
> > +	u64 start, end;
> > +};
> > +
> > +struct crash_mem {
> > +	unsigned int nr_ranges;
> > +	struct crash_mem_range ranges[CRASH_MAX_RANGES];
> > +};
> > +
> > +/* Misc data about ram ranges needed to prepare elf headers */
> > +struct crash_elf_data {
> > +	struct kimage *image;
> > +	/*
> > +	 * Total number of ram ranges we have after various ajustments for
> 
> "adjustments"

Will do.

[..]
> > @@ -39,6 +82,7 @@ int in_crash_kexec;
> >   */
> >  crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss = NULL;
> >  EXPORT_SYMBOL_GPL(crash_vmclear_loaded_vmcss);
> > +unsigned long crash_zero_bytes;
> 
> Ah, that's the empty_zero_page...

Ok, will look into moving to empty_zero_page.

[..]
> > +static int fill_up_crash_elf_data(struct crash_elf_data *ced,
> > +					struct kimage *image)
> > +{
> > +	unsigned int nr_ranges = 0;
> > +
> > +	ced->image = image;
> > +
> > +	walk_system_ram_range(0, -1, &nr_ranges,
> > +				get_nr_ram_ranges_callback);
> > +
> > +	ced->max_nr_ranges = nr_ranges;
> > +
> > +	/*
> > +	 * We don't create ELF headers for GART aperture as an attempt
> > +	 * to dump this memory in second kernel leads to hang/crash.
> > +	 * If gart aperture is present, one needs to exclude that region
> > +	 * and that could lead to need of extra phdr.
> > +	 */
> > +	walk_ram_res("GART", IORESOURCE_MEM, 0, -1,
> > +				ced, get_gart_ranges_callback);
> > +
> > +	/*
> > +	 * If we have gart region, excluding that could potentially split
> > +	 * a memory range, resulting in extra header. Account for  that.
> > +	 */
> > +	if (ced->gart_end)
> > +		ced->max_nr_ranges++;
> > +
> > +	/* Exclusion of crash region could split memory ranges */
> > +	ced->max_nr_ranges++;
> > +
> > +	/* If crashk_low_res is there, another range split possible */
> 
> You mean "is not 0"?

Yes. Will make comment more clear.

> 
> > +	if (crashk_low_res.end != 0)
> > +		ced->max_nr_ranges++;
> > +
> > +	return 0;
> 
> Returns unconditional 0 - make function void then.

Will do.

[..]
> > +		if (mstart > start && mend < end) {
> > +			/* Split original range */
> > +			mem->ranges[i].end = mstart - 1;
> > +			temp_range.start = mend + 1;
> > +			temp_range.end = end;
> > +		} else if (mstart != start)
> > +			mem->ranges[i].end = mstart - 1;
> > +		else
> > +			mem->ranges[i].start = mend + 1;
> > +		break;
> > +	}
> > +
> > +	/* If a split happend, add the split in array */
> 
> "happened" ... "split to array"

Ok. Will fix.


> 
> > +	if (!temp_range.end)
> > +		return 0;
> > +
> > +	/* Split happened */
> > +	if (i == CRASH_MAX_RANGES - 1) {
> > +		pr_err("Too many crash ranges after split\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	/* Location where new range should go */
> > +	j = i + 1;
> > +	if (j < mem->nr_ranges) {
> > +		/* Move over all ranges one place */
> 
> 			...  all ranges one slot towards the end */
> 

Will change.

[..]
> > +static int prepare_elf64_headers(struct crash_elf_data *ced,
> > +		void **addr, unsigned long *sz)
> > +{
> > +	Elf64_Ehdr *ehdr;
> > +	Elf64_Phdr *phdr;
> > +	unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz;
> > +	unsigned char *buf, *bufp;
> > +	unsigned int cpu;
> > +	unsigned long long notes_addr;
> > +	int ret;
> > +
> > +	/* extra phdr for vmcoreinfo elf note */
> > +	nr_phdr = nr_cpus + 1;
> > +	nr_phdr += ced->max_nr_ranges;
> > +
> > +	/*
> > +	 * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
> > +	 * area on x86_64 (ffffffff80000000 - ffffffffa0000000).
> > +	 * I think this is required by tools like gdb. So same physical
> > +	 * memory will be mapped in two elf headers. One will contain kernel
> > +	 * text virtual addresses and other will have __va(physical) addresses.
> > +	 */
> > +
> > +	nr_phdr++;
> > +	elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
> > +	elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
> > +
> > +	buf = vzalloc(elf_sz);
> 
> Since you get zeroed memory, you can save yourself all assignments to 0
> below and thus slim this already terse function.

Will do.

[..]
> > +static int add_e820_entry(struct boot_params *params, struct e820entry *entry)
> > +{
> > +	unsigned int nr_e820_entries;
> > +
> > +	nr_e820_entries = params->e820_entries;
> > +	if (nr_e820_entries >= E820MAX)
> > +		return 1;
> 
> You're not testing for the error condition in any call site. Are we sure
> we will never hit E820MAX?

Actually there can be. Right now I am just handling the case of passing
as many e820 enties as can fit in bootparams and ignoring rest. Ideally
momory ranges more than E820MAX should be passed through setup data and I
have not handled that case yet.

Very few systems should run into that kind of scenario. I was thinking
that once these patches are in, I can look into enabling passing of
more than E820MAX entries using setup data.

I will put a TODO comment.

> 
> > +
> > +	memcpy(&params->e820_map[nr_e820_entries], entry,
> > +			sizeof(struct e820entry));
> > +	params->e820_entries++;
> > +	return 0;
> > +}
> > +
> > +static int memmap_entry_callback(u64 start, u64 end, void *arg)
> > +{
> > +	struct crash_memmap_data *cmd = arg;
> > +	struct boot_params *params = cmd->params;
> > +	struct e820entry ei;
> > +
> > +	ei.addr = start;
> > +	ei.size = end - start + 1;
> > +	ei.type = cmd->type;
> > +	add_e820_entry(params, &ei);
> > +
> > +	return 0;
> > +}
> > +
> > +static int memmap_exclude_ranges(struct kimage *image, struct crash_mem *cmem,
> > +		unsigned long long mstart, unsigned long long mend)
> 
> Arg alignment... multiple occurrences in this patch.

Will fix.

> 
> > +{
> > +	unsigned long start, end;
> > +	int ret = 0;
> > +
> > +	memset(cmem->ranges, 0, sizeof(cmem->ranges));
> > +
> > +	cmem->ranges[0].start = mstart;
> > +	cmem->ranges[0].end = mend;
> > +	cmem->nr_ranges = 1;
> > +
> > +	/* Exclude Backup region */
> > +	start = image->arch.backup_load_addr;
> > +	end = start + image->arch.backup_src_sz - 1;
> > +	ret = exclude_mem_range(cmem, start, end);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* Exclude elf header region */
> > +	start = image->arch.elf_load_addr;
> > +	end = start + image->arch.elf_headers_sz - 1;
> > +	ret = exclude_mem_range(cmem, start, end);
> > +	return ret;
> 
> 	return exclude_mem_range(cmem, start, end);

Will change.

> 
> > +}
> > +
> > +/* Prepare memory map for crash dump kernel */
> > +int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
> > +{
> > +	int i, ret = 0;
> > +	unsigned long flags;
> > +	struct e820entry ei;
> > +	struct crash_memmap_data cmd;
> > +	struct crash_mem *cmem;
> > +
> > +	cmem = vzalloc(sizeof(struct crash_mem));
> > +	if (!cmem)
> > +		return -ENOMEM;
> 
> You're getting zeroed memory already but you're zeroing it out again
> above in memmap_exclude_ranges().

Will remove extra zeoring.

[..]
> > +	/* Exclude some ranges from crashk_res and add rest to memmap */
> > +	ret = memmap_exclude_ranges(image, cmem, crashk_res.start,
> > +						crashk_res.end);
> > +	if (ret)
> > +		goto out;
> > +
> > +	for (i = 0; i < cmem->nr_ranges; i++) {
> > +		ei.addr = cmem->ranges[i].start;
> > +		ei.size = cmem->ranges[i].end - ei.addr + 1;
> > +		ei.type = E820_RAM;
> > +
> > +		/* If entry is less than a page, skip it */
> > +		if (ei.size < PAGE_SIZE)
> > +			continue;
> 
> You can do the size assignment and check first so that you don't have to
> do the rest if it is a less than a page.
> 

Ok, will do.

> > +		add_e820_entry(params, &ei);
> > +	}
> > +
> > +out:
> > +	vfree(cmem);
> > +	return ret;
> 
> This retval is not checked at the callsite in
> kexec_setup_boot_parameters().

Will check return code at call site.

[..]
> >  /*
> >   * Defines lowest physical address for various segments. Not sure where
> > @@ -130,11 +133,28 @@ void *bzImage64_load(struct kimage *image, char *kernel,
> >  		return ERR_PTR(-EINVAL);
> >  	}
> >  
> > +	/*
> > +	 * In case of crash dump, we will append elfcorehdr=<addr> to
> > +	 * command line. Make sure it does not overflow
> > +	 */
> > +	if (cmdline_len + MAX_ELFCOREHDR_STR_LEN > header->cmdline_size) {
> > +		ret = -EINVAL;
> 
> No need to assign anything to ret if you return ERR_PTR below.

Yep. Will remove it.

> 
> > +		pr_debug("Kernel command line too long\n");
> 
> This error message needs to differ from the one above - say something
> about "error appending elfcorehdr=...", for example.

Ok, Will fix it.

[..]
> > +	/* Setup copying of backup region */
> > +	if (image->type == KEXEC_TYPE_CRASH) {
> > +		ret = kexec_purgatory_get_set_symbol(image, "backup_dest",
> > +				&image->arch.backup_load_addr,
> > +				sizeof(image->arch.backup_load_addr), 0);
> > +		if (ret)
> > +			return ret;
> > +
> > +		ret = kexec_purgatory_get_set_symbol(image, "backup_src",
> > +				&image->arch.backup_src_start,
> > +				sizeof(image->arch.backup_src_start), 0);
> > +		if (ret)
> > +			return ret;
> > +
> > +		ret = kexec_purgatory_get_set_symbol(image, "backup_sz",
> > +				&image->arch.backup_src_sz,
> > +				sizeof(image->arch.backup_src_sz), 0);
> 
> Arg alignment is funny.

Will change.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 13/13] kexec: Support kexec/kdump on EFI systems
  2014-06-03 13:07   ` Vivek Goyal
@ 2014-06-18 15:43     ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-18 15:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm

On Tue, Jun 03, 2014 at 09:07:02AM -0400, Vivek Goyal wrote:
> This patch does two thigns. It passes EFI run time mappings to second
> kernel in bootparams efi_info. Second kernel parse this info and create
> new mappings in second kernel. That means mappings in first and second
> kernel will be same. This paves the way to enable EFI in kexec kernel.
> 
> This patch also prepares and passes EFI setup data through bootparams.
> This contains bunch of information about various tables and their
> addresses.
> 
> These information gathering and passing has been written along the lines
> of what current kexec-tools is doing to make kexec work with UEFI.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/kexec.h       |  4 +-
>  arch/x86/kernel/kexec-bzimage.c    | 40 ++++++++++++----
>  arch/x86/kernel/machine_kexec.c    | 93 ++++++++++++++++++++++++++++++++++++--
>  drivers/firmware/efi/runtime-map.c | 21 +++++++++
>  include/linux/efi.h                | 19 ++++++++
>  5 files changed, 163 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 4cbe5f7..d8461cf 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -214,7 +214,9 @@ extern int kexec_setup_cmdline(struct kimage *image,
>  		unsigned long cmdline_offset, char *cmdline,
>  		unsigned long cmdline_len);
>  extern int kexec_setup_boot_parameters(struct kimage *image,
> -					struct boot_params *params);
> +		struct boot_params *params, unsigned long params_load_addr,
> +		unsigned int efi_map_offset, unsigned int efi_map_sz,
> +		unsigned int efi_setup_data_offset);
>  
>  
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
> index 8e762d3..55716e1 100644
> --- a/arch/x86/kernel/kexec-bzimage.c
> +++ b/arch/x86/kernel/kexec-bzimage.c
> @@ -15,10 +15,12 @@
>  #include <linux/kexec.h>
>  #include <linux/kernel.h>
>  #include <linux/mm.h>
> +#include <linux/efi.h>
>  
>  #include <asm/bootparam.h>
>  #include <asm/setup.h>
>  #include <asm/crash.h>
> +#include <asm/efi.h>
>  
>  #define MAX_ELFCOREHDR_STR_LEN	30	/* elfcorehdr=0x<64bit-value> */
>  
> @@ -106,7 +108,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  
>  	struct setup_header *header;
>  	int setup_sects, kern16_size, ret = 0;
> -	unsigned long setup_header_size, params_cmdline_sz;
> +	unsigned long setup_header_size, params_cmdline_sz, params_misc_sz;
>  	struct boot_params *params;
>  	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
>  	unsigned long purgatory_load_addr;
> @@ -116,6 +118,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  	struct kexec_entry64_regs regs64;
>  	void *stack;
>  	unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
> +	unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
>  
>  	header = (struct setup_header *)(kernel + setup_hdr_offset);
>  	setup_sects = header->setup_sects;
> @@ -168,28 +171,47 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  
>  	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
>  
> -	/* Load Bootparams and cmdline */
> +
> +	/*
> +	 * Load Bootparams and cmdline and space for efi stuff.
> +	 *
> +	 * Allocate memory together for multiple data structures so
> +	 * that they all can go in single area/segment and we don't
> +	 * have to create separate segment for each. Keeps things
> +	 * little bit simple
> +	 */
> +	efi_map_sz = get_efi_runtime_map_size();
> +	efi_map_sz = ALIGN(efi_map_sz, 16);

Yeah, those and the ones below should be inside #ifdef CONFIG_EFI,
strictly speaking. I see you've added dummy functions for the case when
EFI is not enabled to save yourself the ifdeffery. Hm, ok, I guess the
ifdeffery is uglier and the couple of bytes more is simply not worth the
trouble.

:-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 13/13] kexec: Support kexec/kdump on EFI systems
@ 2014-06-18 15:43     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-18 15:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, bhe, jkosina, greg, kexec, linux-kernel, ebiederm, hpa,
	akpm, dyoung, chaowang

On Tue, Jun 03, 2014 at 09:07:02AM -0400, Vivek Goyal wrote:
> This patch does two thigns. It passes EFI run time mappings to second
> kernel in bootparams efi_info. Second kernel parse this info and create
> new mappings in second kernel. That means mappings in first and second
> kernel will be same. This paves the way to enable EFI in kexec kernel.
> 
> This patch also prepares and passes EFI setup data through bootparams.
> This contains bunch of information about various tables and their
> addresses.
> 
> These information gathering and passing has been written along the lines
> of what current kexec-tools is doing to make kexec work with UEFI.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/kexec.h       |  4 +-
>  arch/x86/kernel/kexec-bzimage.c    | 40 ++++++++++++----
>  arch/x86/kernel/machine_kexec.c    | 93 ++++++++++++++++++++++++++++++++++++--
>  drivers/firmware/efi/runtime-map.c | 21 +++++++++
>  include/linux/efi.h                | 19 ++++++++
>  5 files changed, 163 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 4cbe5f7..d8461cf 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -214,7 +214,9 @@ extern int kexec_setup_cmdline(struct kimage *image,
>  		unsigned long cmdline_offset, char *cmdline,
>  		unsigned long cmdline_len);
>  extern int kexec_setup_boot_parameters(struct kimage *image,
> -					struct boot_params *params);
> +		struct boot_params *params, unsigned long params_load_addr,
> +		unsigned int efi_map_offset, unsigned int efi_map_sz,
> +		unsigned int efi_setup_data_offset);
>  
>  
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
> index 8e762d3..55716e1 100644
> --- a/arch/x86/kernel/kexec-bzimage.c
> +++ b/arch/x86/kernel/kexec-bzimage.c
> @@ -15,10 +15,12 @@
>  #include <linux/kexec.h>
>  #include <linux/kernel.h>
>  #include <linux/mm.h>
> +#include <linux/efi.h>
>  
>  #include <asm/bootparam.h>
>  #include <asm/setup.h>
>  #include <asm/crash.h>
> +#include <asm/efi.h>
>  
>  #define MAX_ELFCOREHDR_STR_LEN	30	/* elfcorehdr=0x<64bit-value> */
>  
> @@ -106,7 +108,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  
>  	struct setup_header *header;
>  	int setup_sects, kern16_size, ret = 0;
> -	unsigned long setup_header_size, params_cmdline_sz;
> +	unsigned long setup_header_size, params_cmdline_sz, params_misc_sz;
>  	struct boot_params *params;
>  	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
>  	unsigned long purgatory_load_addr;
> @@ -116,6 +118,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  	struct kexec_entry64_regs regs64;
>  	void *stack;
>  	unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
> +	unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
>  
>  	header = (struct setup_header *)(kernel + setup_hdr_offset);
>  	setup_sects = header->setup_sects;
> @@ -168,28 +171,47 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  
>  	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
>  
> -	/* Load Bootparams and cmdline */
> +
> +	/*
> +	 * Load Bootparams and cmdline and space for efi stuff.
> +	 *
> +	 * Allocate memory together for multiple data structures so
> +	 * that they all can go in single area/segment and we don't
> +	 * have to create separate segment for each. Keeps things
> +	 * little bit simple
> +	 */
> +	efi_map_sz = get_efi_runtime_map_size();
> +	efi_map_sz = ALIGN(efi_map_sz, 16);

Yeah, those and the ones below should be inside #ifdef CONFIG_EFI,
strictly speaking. I see you've added dummy functions for the case when
EFI is not enabled to save yourself the ifdeffery. Hm, ok, I guess the
ifdeffery is uglier and the couple of bytes more is simply not worth the
trouble.

:-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 13/13] kexec: Support kexec/kdump on EFI systems
@ 2014-06-18 16:06     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-18 16:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm, linux-efi

Addomg linux-efi to CC for the efi bits. Please CC it on your next
submission.

Thanks.

On Tue, Jun 03, 2014 at 09:07:02AM -0400, Vivek Goyal wrote:
> This patch does two thigns. It passes EFI run time mappings to second
> kernel in bootparams efi_info. Second kernel parse this info and create
> new mappings in second kernel. That means mappings in first and second
> kernel will be same. This paves the way to enable EFI in kexec kernel.
> 
> This patch also prepares and passes EFI setup data through bootparams.
> This contains bunch of information about various tables and their
> addresses.
> 
> These information gathering and passing has been written along the lines
> of what current kexec-tools is doing to make kexec work with UEFI.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/kexec.h       |  4 +-
>  arch/x86/kernel/kexec-bzimage.c    | 40 ++++++++++++----
>  arch/x86/kernel/machine_kexec.c    | 93 ++++++++++++++++++++++++++++++++++++--
>  drivers/firmware/efi/runtime-map.c | 21 +++++++++
>  include/linux/efi.h                | 19 ++++++++
>  5 files changed, 163 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 4cbe5f7..d8461cf 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -214,7 +214,9 @@ extern int kexec_setup_cmdline(struct kimage *image,
>  		unsigned long cmdline_offset, char *cmdline,
>  		unsigned long cmdline_len);
>  extern int kexec_setup_boot_parameters(struct kimage *image,
> -					struct boot_params *params);
> +		struct boot_params *params, unsigned long params_load_addr,
> +		unsigned int efi_map_offset, unsigned int efi_map_sz,
> +		unsigned int efi_setup_data_offset);
>  
>  
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
> index 8e762d3..55716e1 100644
> --- a/arch/x86/kernel/kexec-bzimage.c
> +++ b/arch/x86/kernel/kexec-bzimage.c
> @@ -15,10 +15,12 @@
>  #include <linux/kexec.h>
>  #include <linux/kernel.h>
>  #include <linux/mm.h>
> +#include <linux/efi.h>
>  
>  #include <asm/bootparam.h>
>  #include <asm/setup.h>
>  #include <asm/crash.h>
> +#include <asm/efi.h>
>  
>  #define MAX_ELFCOREHDR_STR_LEN	30	/* elfcorehdr=0x<64bit-value> */
>  
> @@ -106,7 +108,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  
>  	struct setup_header *header;
>  	int setup_sects, kern16_size, ret = 0;
> -	unsigned long setup_header_size, params_cmdline_sz;
> +	unsigned long setup_header_size, params_cmdline_sz, params_misc_sz;
>  	struct boot_params *params;
>  	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
>  	unsigned long purgatory_load_addr;
> @@ -116,6 +118,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  	struct kexec_entry64_regs regs64;
>  	void *stack;
>  	unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
> +	unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
>  
>  	header = (struct setup_header *)(kernel + setup_hdr_offset);
>  	setup_sects = header->setup_sects;
> @@ -168,28 +171,47 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  
>  	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
>  
> -	/* Load Bootparams and cmdline */
> +
> +	/*
> +	 * Load Bootparams and cmdline and space for efi stuff.
> +	 *
> +	 * Allocate memory together for multiple data structures so
> +	 * that they all can go in single area/segment and we don't
> +	 * have to create separate segment for each. Keeps things
> +	 * little bit simple
> +	 */
> +	efi_map_sz = get_efi_runtime_map_size();
> +	efi_map_sz = ALIGN(efi_map_sz, 16);
> +
>  	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
>  				MAX_ELFCOREHDR_STR_LEN;
> -	params = kzalloc(params_cmdline_sz, GFP_KERNEL);
> +	params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
> +	params_misc_sz = params_cmdline_sz + efi_map_sz +
> +				sizeof(struct setup_data) +
> +				sizeof(struct efi_setup_data);
> +
> +	params = kzalloc(params_misc_sz, GFP_KERNEL);
>  	if (!params) {
>  		ret = -ENOMEM;
>  		goto out_free_loader_data;
>  	}
>  
> +	efi_map_offset = params_cmdline_sz;
> +	efi_setup_data_offset = efi_map_offset + efi_map_sz;
> +
>  	/* Copy setup header onto bootparams. Documentation/x86/boot.txt */
>  	setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;
>  
>  	/* Is there a limit on setup header size? */
>  	memcpy(&params->hdr, (kernel + setup_hdr_offset), setup_header_size);
>  
> -	ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
> -			       params_cmdline_sz, 16, MIN_BOOTPARAM_ADDR,
> +	ret = kexec_add_buffer(image, (char *)params, params_misc_sz,
> +			       params_misc_sz, 16, MIN_BOOTPARAM_ADDR,
>  			       ULONG_MAX, 1, &bootparam_load_addr);
>  	if (ret)
>  		goto out_free_params;
> -	pr_debug("Loaded boot_param and command line at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> -		 bootparam_load_addr, params_cmdline_sz, params_cmdline_sz);
> +	pr_debug("Loaded boot_param, command line and misc at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> +		 bootparam_load_addr, params_misc_sz, params_misc_sz);
>  
>  	/* Load kernel */
>  	kernel_buf = kernel + kern16_size;
> @@ -254,7 +276,9 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  	if (ret)
>  		goto out_free_params;
>  
> -	ret = kexec_setup_boot_parameters(image, params);
> +	ret = kexec_setup_boot_parameters(image, params, bootparam_load_addr,
> +					  efi_map_offset, efi_map_sz,
> +					  efi_setup_data_offset);
>  	if (ret)
>  		goto out_free_params;
>  
> diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
> index 6a3821b..f31a4b5 100644
> --- a/arch/x86/kernel/machine_kexec.c
> +++ b/arch/x86/kernel/machine_kexec.c
> @@ -12,9 +12,11 @@
>  #include <linux/kernel.h>
>  #include <linux/kexec.h>
>  #include <linux/string.h>
> +#include <linux/efi.h>
>  #include <asm/bootparam.h>
>  #include <asm/setup.h>
>  #include <asm/crash.h>
> +#include <asm/efi.h>
>  
>  /*
>   * Common code for x86 and x86_64 used for kexec.
> @@ -67,11 +69,10 @@ int kexec_setup_cmdline(struct kimage *image, struct boot_params *params,
>  	return 0;
>  }
>  
> -static int setup_memory_map_entries(struct boot_params *params)
> +static int setup_e820_entries(struct boot_params *params)
>  {
>  	unsigned int nr_e820_entries;
>  
> -	/* TODO: What about EFI */
>  	nr_e820_entries = e820_saved.nr_map;
>  	if (nr_e820_entries > E820MAX)
>  		nr_e820_entries = E820MAX;
> @@ -83,8 +84,85 @@ static int setup_memory_map_entries(struct boot_params *params)
>  	return 0;
>  }
>  
> -int kexec_setup_boot_parameters(struct kimage *image,
> -				struct boot_params *params)
> +#ifdef CONFIG_EFI
> +static int setup_efi_info_memmap(struct boot_params *params,
> +				  unsigned long params_load_addr,
> +				  unsigned int efi_map_offset,
> +				  unsigned int efi_map_sz)
> +{
> +	void *efi_map = (void *)params + efi_map_offset;
> +	unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
> +	struct efi_info *ei = &params->efi_info;
> +
> +	if (!efi_map_sz)
> +		return 0;
> +
> +	efi_runtime_map_copy(efi_map, efi_map_sz);
> +
> +	ei->efi_memmap = efi_map_phys_addr & 0xffffffff;
> +	ei->efi_memmap_hi = efi_map_phys_addr >> 32;
> +	ei->efi_memmap_size = efi_map_sz;
> +
> +	return 0;
> +}
> +
> +static int
> +prepare_add_efi_setup_data(struct boot_params *params,
> +		       unsigned long params_load_addr,
> +		       unsigned int efi_setup_data_offset)
> +{
> +	unsigned long setup_data_phys;
> +	struct setup_data *sd = (void *)params + efi_setup_data_offset;
> +	struct efi_setup_data *esd = (void *)sd + sizeof(struct setup_data);
> +
> +	esd->fw_vendor = efi.fw_vendor;
> +	esd->runtime = efi.runtime;
> +	esd->tables = efi.config_table;
> +	esd->smbios = efi.smbios;
> +
> +	sd->type = SETUP_EFI;
> +	sd->len = sizeof(struct efi_setup_data);
> +
> +	/* Add setup data */
> +	setup_data_phys = params_load_addr + efi_setup_data_offset;
> +	sd->next = params->hdr.setup_data;
> +	params->hdr.setup_data = setup_data_phys;
> +
> +	return 0;
> +}
> +
> +static int setup_efi_state(struct boot_params *params,
> +			unsigned long params_load_addr,
> +			unsigned int efi_map_offset, unsigned int efi_map_sz,
> +			unsigned int efi_setup_data_offset)
> +{
> +	struct efi_info *current_ei = &boot_params.efi_info;
> +	struct efi_info *ei = &params->efi_info;
> +
> +	if (!current_ei->efi_memmap_size)
> +		return 0;
> +
> +	ei->efi_loader_signature = current_ei->efi_loader_signature;
> +	ei->efi_systab = current_ei->efi_systab;
> +	ei->efi_systab_hi = current_ei->efi_systab_hi;
> +
> +	ei->efi_memdesc_version = current_ei->efi_memdesc_version;
> +	ei->efi_memdesc_size = get_efi_runtime_map_desc_size();
> +
> +	setup_efi_info_memmap(params, params_load_addr, efi_map_offset,
> +			      efi_map_sz);
> +	prepare_add_efi_setup_data(params, params_load_addr,
> +				   efi_setup_data_offset);
> +	return 0;
> +}
> +#endif /* CONFIG_EFI */
> +
> +int
> +kexec_setup_boot_parameters(struct kimage *image, struct boot_params *params,
> +			    unsigned long params_load_addr,
> +			    unsigned int efi_map_offset,
> +			    unsigned int efi_map_sz,
> +			    unsigned int efi_setup_data_offset)
>  {
>  	unsigned int nr_e820_entries;
>  	unsigned long long mem_k, start, end;
> @@ -114,7 +192,7 @@ int kexec_setup_boot_parameters(struct kimage *image,
>  	if (image->type == KEXEC_TYPE_CRASH)
>  		crash_setup_memmap_entries(image, params);
>  	else
> -		setup_memory_map_entries(params);
> +		setup_e820_entries(params);
>  
>  	nr_e820_entries = params->e820_entries;
>  
> @@ -135,6 +213,11 @@ int kexec_setup_boot_parameters(struct kimage *image,
>  		}
>  	}
>  
> +#ifdef CONFIG_EFI
> +	/* Setup EFI state */
> +	setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
> +			efi_setup_data_offset);
> +#endif
>  	/* Setup EDD info */
>  	memcpy(params->eddbuf, boot_params.eddbuf,
>  				EDDMAXNR * sizeof(struct edd_info));
> diff --git a/drivers/firmware/efi/runtime-map.c b/drivers/firmware/efi/runtime-map.c
> index 97cdd16..40f2213 100644
> --- a/drivers/firmware/efi/runtime-map.c
> +++ b/drivers/firmware/efi/runtime-map.c
> @@ -138,6 +138,27 @@ add_sysfs_runtime_map_entry(struct kobject *kobj, int nr)
>  	return entry;
>  }
>  
> +int get_efi_runtime_map_size(void)
> +{
> +	return nr_efi_runtime_map * efi_memdesc_size;
> +}
> +
> +int get_efi_runtime_map_desc_size(void)
> +{
> +	return efi_memdesc_size;
> +}
> +
> +int efi_runtime_map_copy(void *buf, size_t bufsz)
> +{
> +	size_t sz = get_efi_runtime_map_size();
> +
> +	if (sz > bufsz)
> +		sz = bufsz;
> +
> +	memcpy(buf, efi_runtime_map, sz);
> +	return 0;
> +}
> +
>  void efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size)
>  {
>  	efi_runtime_map = map;
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 6c100ff..c2e5c4a 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -1131,6 +1131,9 @@ int efivars_sysfs_init(void);
>  #ifdef CONFIG_EFI_RUNTIME_MAP
>  int efi_runtime_map_init(struct kobject *);
>  void efi_runtime_map_setup(void *, int, u32);
> +int get_efi_runtime_map_size(void);
> +int get_efi_runtime_map_desc_size(void);
> +int efi_runtime_map_copy(void *buf, size_t bufsz);
>  #else
>  static inline int efi_runtime_map_init(struct kobject *kobj)
>  {
> @@ -1139,6 +1142,22 @@ static inline int efi_runtime_map_init(struct kobject *kobj)
>  
>  static inline void
>  efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size) {}
> +
> +static inline int get_efi_runtime_map_size(void)
> +{
> +	return 0;
> +}
> +
> +static inline int get_efi_runtime_map_desc_size(void)
> +{
> +	return 0;
> +}
> +
> +static inline int efi_runtime_map_copy(void *buf, size_t bufsz)
> +{
> +	return 0;
> +}
> +
>  #endif
>  
>  #endif /* _LINUX_EFI_H */
> -- 
> 1.9.0
> 
> 

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 13/13] kexec: Support kexec/kdump on EFI systems
@ 2014-06-18 16:06     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-18 16:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	mjg59-1xO5oi07KQx4cg9Nei1l7Q, greg-U8xfFu+wG4EAvxtiuMwx3w,
	jkosina-AlSwsSmVLrQ, dyoung-H+wXaHxf7aLQT0dZR+AlfA,
	chaowang-H+wXaHxf7aLQT0dZR+AlfA, bhe-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-efi

Addomg linux-efi to CC for the efi bits. Please CC it on your next
submission.

Thanks.

On Tue, Jun 03, 2014 at 09:07:02AM -0400, Vivek Goyal wrote:
> This patch does two thigns. It passes EFI run time mappings to second
> kernel in bootparams efi_info. Second kernel parse this info and create
> new mappings in second kernel. That means mappings in first and second
> kernel will be same. This paves the way to enable EFI in kexec kernel.
> 
> This patch also prepares and passes EFI setup data through bootparams.
> This contains bunch of information about various tables and their
> addresses.
> 
> These information gathering and passing has been written along the lines
> of what current kexec-tools is doing to make kexec work with UEFI.
> 
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  arch/x86/include/asm/kexec.h       |  4 +-
>  arch/x86/kernel/kexec-bzimage.c    | 40 ++++++++++++----
>  arch/x86/kernel/machine_kexec.c    | 93 ++++++++++++++++++++++++++++++++++++--
>  drivers/firmware/efi/runtime-map.c | 21 +++++++++
>  include/linux/efi.h                | 19 ++++++++
>  5 files changed, 163 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 4cbe5f7..d8461cf 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -214,7 +214,9 @@ extern int kexec_setup_cmdline(struct kimage *image,
>  		unsigned long cmdline_offset, char *cmdline,
>  		unsigned long cmdline_len);
>  extern int kexec_setup_boot_parameters(struct kimage *image,
> -					struct boot_params *params);
> +		struct boot_params *params, unsigned long params_load_addr,
> +		unsigned int efi_map_offset, unsigned int efi_map_sz,
> +		unsigned int efi_setup_data_offset);
>  
>  
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
> index 8e762d3..55716e1 100644
> --- a/arch/x86/kernel/kexec-bzimage.c
> +++ b/arch/x86/kernel/kexec-bzimage.c
> @@ -15,10 +15,12 @@
>  #include <linux/kexec.h>
>  #include <linux/kernel.h>
>  #include <linux/mm.h>
> +#include <linux/efi.h>
>  
>  #include <asm/bootparam.h>
>  #include <asm/setup.h>
>  #include <asm/crash.h>
> +#include <asm/efi.h>
>  
>  #define MAX_ELFCOREHDR_STR_LEN	30	/* elfcorehdr=0x<64bit-value> */
>  
> @@ -106,7 +108,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  
>  	struct setup_header *header;
>  	int setup_sects, kern16_size, ret = 0;
> -	unsigned long setup_header_size, params_cmdline_sz;
> +	unsigned long setup_header_size, params_cmdline_sz, params_misc_sz;
>  	struct boot_params *params;
>  	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
>  	unsigned long purgatory_load_addr;
> @@ -116,6 +118,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  	struct kexec_entry64_regs regs64;
>  	void *stack;
>  	unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
> +	unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
>  
>  	header = (struct setup_header *)(kernel + setup_hdr_offset);
>  	setup_sects = header->setup_sects;
> @@ -168,28 +171,47 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  
>  	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
>  
> -	/* Load Bootparams and cmdline */
> +
> +	/*
> +	 * Load Bootparams and cmdline and space for efi stuff.
> +	 *
> +	 * Allocate memory together for multiple data structures so
> +	 * that they all can go in single area/segment and we don't
> +	 * have to create separate segment for each. Keeps things
> +	 * little bit simple
> +	 */
> +	efi_map_sz = get_efi_runtime_map_size();
> +	efi_map_sz = ALIGN(efi_map_sz, 16);
> +
>  	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
>  				MAX_ELFCOREHDR_STR_LEN;
> -	params = kzalloc(params_cmdline_sz, GFP_KERNEL);
> +	params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
> +	params_misc_sz = params_cmdline_sz + efi_map_sz +
> +				sizeof(struct setup_data) +
> +				sizeof(struct efi_setup_data);
> +
> +	params = kzalloc(params_misc_sz, GFP_KERNEL);
>  	if (!params) {
>  		ret = -ENOMEM;
>  		goto out_free_loader_data;
>  	}
>  
> +	efi_map_offset = params_cmdline_sz;
> +	efi_setup_data_offset = efi_map_offset + efi_map_sz;
> +
>  	/* Copy setup header onto bootparams. Documentation/x86/boot.txt */
>  	setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;
>  
>  	/* Is there a limit on setup header size? */
>  	memcpy(&params->hdr, (kernel + setup_hdr_offset), setup_header_size);
>  
> -	ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
> -			       params_cmdline_sz, 16, MIN_BOOTPARAM_ADDR,
> +	ret = kexec_add_buffer(image, (char *)params, params_misc_sz,
> +			       params_misc_sz, 16, MIN_BOOTPARAM_ADDR,
>  			       ULONG_MAX, 1, &bootparam_load_addr);
>  	if (ret)
>  		goto out_free_params;
> -	pr_debug("Loaded boot_param and command line at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> -		 bootparam_load_addr, params_cmdline_sz, params_cmdline_sz);
> +	pr_debug("Loaded boot_param, command line and misc at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> +		 bootparam_load_addr, params_misc_sz, params_misc_sz);
>  
>  	/* Load kernel */
>  	kernel_buf = kernel + kern16_size;
> @@ -254,7 +276,9 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  	if (ret)
>  		goto out_free_params;
>  
> -	ret = kexec_setup_boot_parameters(image, params);
> +	ret = kexec_setup_boot_parameters(image, params, bootparam_load_addr,
> +					  efi_map_offset, efi_map_sz,
> +					  efi_setup_data_offset);
>  	if (ret)
>  		goto out_free_params;
>  
> diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
> index 6a3821b..f31a4b5 100644
> --- a/arch/x86/kernel/machine_kexec.c
> +++ b/arch/x86/kernel/machine_kexec.c
> @@ -12,9 +12,11 @@
>  #include <linux/kernel.h>
>  #include <linux/kexec.h>
>  #include <linux/string.h>
> +#include <linux/efi.h>
>  #include <asm/bootparam.h>
>  #include <asm/setup.h>
>  #include <asm/crash.h>
> +#include <asm/efi.h>
>  
>  /*
>   * Common code for x86 and x86_64 used for kexec.
> @@ -67,11 +69,10 @@ int kexec_setup_cmdline(struct kimage *image, struct boot_params *params,
>  	return 0;
>  }
>  
> -static int setup_memory_map_entries(struct boot_params *params)
> +static int setup_e820_entries(struct boot_params *params)
>  {
>  	unsigned int nr_e820_entries;
>  
> -	/* TODO: What about EFI */
>  	nr_e820_entries = e820_saved.nr_map;
>  	if (nr_e820_entries > E820MAX)
>  		nr_e820_entries = E820MAX;
> @@ -83,8 +84,85 @@ static int setup_memory_map_entries(struct boot_params *params)
>  	return 0;
>  }
>  
> -int kexec_setup_boot_parameters(struct kimage *image,
> -				struct boot_params *params)
> +#ifdef CONFIG_EFI
> +static int setup_efi_info_memmap(struct boot_params *params,
> +				  unsigned long params_load_addr,
> +				  unsigned int efi_map_offset,
> +				  unsigned int efi_map_sz)
> +{
> +	void *efi_map = (void *)params + efi_map_offset;
> +	unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
> +	struct efi_info *ei = &params->efi_info;
> +
> +	if (!efi_map_sz)
> +		return 0;
> +
> +	efi_runtime_map_copy(efi_map, efi_map_sz);
> +
> +	ei->efi_memmap = efi_map_phys_addr & 0xffffffff;
> +	ei->efi_memmap_hi = efi_map_phys_addr >> 32;
> +	ei->efi_memmap_size = efi_map_sz;
> +
> +	return 0;
> +}
> +
> +static int
> +prepare_add_efi_setup_data(struct boot_params *params,
> +		       unsigned long params_load_addr,
> +		       unsigned int efi_setup_data_offset)
> +{
> +	unsigned long setup_data_phys;
> +	struct setup_data *sd = (void *)params + efi_setup_data_offset;
> +	struct efi_setup_data *esd = (void *)sd + sizeof(struct setup_data);
> +
> +	esd->fw_vendor = efi.fw_vendor;
> +	esd->runtime = efi.runtime;
> +	esd->tables = efi.config_table;
> +	esd->smbios = efi.smbios;
> +
> +	sd->type = SETUP_EFI;
> +	sd->len = sizeof(struct efi_setup_data);
> +
> +	/* Add setup data */
> +	setup_data_phys = params_load_addr + efi_setup_data_offset;
> +	sd->next = params->hdr.setup_data;
> +	params->hdr.setup_data = setup_data_phys;
> +
> +	return 0;
> +}
> +
> +static int setup_efi_state(struct boot_params *params,
> +			unsigned long params_load_addr,
> +			unsigned int efi_map_offset, unsigned int efi_map_sz,
> +			unsigned int efi_setup_data_offset)
> +{
> +	struct efi_info *current_ei = &boot_params.efi_info;
> +	struct efi_info *ei = &params->efi_info;
> +
> +	if (!current_ei->efi_memmap_size)
> +		return 0;
> +
> +	ei->efi_loader_signature = current_ei->efi_loader_signature;
> +	ei->efi_systab = current_ei->efi_systab;
> +	ei->efi_systab_hi = current_ei->efi_systab_hi;
> +
> +	ei->efi_memdesc_version = current_ei->efi_memdesc_version;
> +	ei->efi_memdesc_size = get_efi_runtime_map_desc_size();
> +
> +	setup_efi_info_memmap(params, params_load_addr, efi_map_offset,
> +			      efi_map_sz);
> +	prepare_add_efi_setup_data(params, params_load_addr,
> +				   efi_setup_data_offset);
> +	return 0;
> +}
> +#endif /* CONFIG_EFI */
> +
> +int
> +kexec_setup_boot_parameters(struct kimage *image, struct boot_params *params,
> +			    unsigned long params_load_addr,
> +			    unsigned int efi_map_offset,
> +			    unsigned int efi_map_sz,
> +			    unsigned int efi_setup_data_offset)
>  {
>  	unsigned int nr_e820_entries;
>  	unsigned long long mem_k, start, end;
> @@ -114,7 +192,7 @@ int kexec_setup_boot_parameters(struct kimage *image,
>  	if (image->type == KEXEC_TYPE_CRASH)
>  		crash_setup_memmap_entries(image, params);
>  	else
> -		setup_memory_map_entries(params);
> +		setup_e820_entries(params);
>  
>  	nr_e820_entries = params->e820_entries;
>  
> @@ -135,6 +213,11 @@ int kexec_setup_boot_parameters(struct kimage *image,
>  		}
>  	}
>  
> +#ifdef CONFIG_EFI
> +	/* Setup EFI state */
> +	setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
> +			efi_setup_data_offset);
> +#endif
>  	/* Setup EDD info */
>  	memcpy(params->eddbuf, boot_params.eddbuf,
>  				EDDMAXNR * sizeof(struct edd_info));
> diff --git a/drivers/firmware/efi/runtime-map.c b/drivers/firmware/efi/runtime-map.c
> index 97cdd16..40f2213 100644
> --- a/drivers/firmware/efi/runtime-map.c
> +++ b/drivers/firmware/efi/runtime-map.c
> @@ -138,6 +138,27 @@ add_sysfs_runtime_map_entry(struct kobject *kobj, int nr)
>  	return entry;
>  }
>  
> +int get_efi_runtime_map_size(void)
> +{
> +	return nr_efi_runtime_map * efi_memdesc_size;
> +}
> +
> +int get_efi_runtime_map_desc_size(void)
> +{
> +	return efi_memdesc_size;
> +}
> +
> +int efi_runtime_map_copy(void *buf, size_t bufsz)
> +{
> +	size_t sz = get_efi_runtime_map_size();
> +
> +	if (sz > bufsz)
> +		sz = bufsz;
> +
> +	memcpy(buf, efi_runtime_map, sz);
> +	return 0;
> +}
> +
>  void efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size)
>  {
>  	efi_runtime_map = map;
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 6c100ff..c2e5c4a 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -1131,6 +1131,9 @@ int efivars_sysfs_init(void);
>  #ifdef CONFIG_EFI_RUNTIME_MAP
>  int efi_runtime_map_init(struct kobject *);
>  void efi_runtime_map_setup(void *, int, u32);
> +int get_efi_runtime_map_size(void);
> +int get_efi_runtime_map_desc_size(void);
> +int efi_runtime_map_copy(void *buf, size_t bufsz);
>  #else
>  static inline int efi_runtime_map_init(struct kobject *kobj)
>  {
> @@ -1139,6 +1142,22 @@ static inline int efi_runtime_map_init(struct kobject *kobj)
>  
>  static inline void
>  efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size) {}
> +
> +static inline int get_efi_runtime_map_size(void)
> +{
> +	return 0;
> +}
> +
> +static inline int get_efi_runtime_map_desc_size(void)
> +{
> +	return 0;
> +}
> +
> +static inline int efi_runtime_map_copy(void *buf, size_t bufsz)
> +{
> +	return 0;
> +}
> +
>  #endif
>  
>  #endif /* _LINUX_EFI_H */
> -- 
> 1.9.0
> 
> 

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 13/13] kexec: Support kexec/kdump on EFI systems
@ 2014-06-18 16:06     ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-18 16:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, linux-efi, bhe, jkosina, greg, kexec, linux-kernel,
	ebiederm, hpa, akpm, dyoung, chaowang

Addomg linux-efi to CC for the efi bits. Please CC it on your next
submission.

Thanks.

On Tue, Jun 03, 2014 at 09:07:02AM -0400, Vivek Goyal wrote:
> This patch does two thigns. It passes EFI run time mappings to second
> kernel in bootparams efi_info. Second kernel parse this info and create
> new mappings in second kernel. That means mappings in first and second
> kernel will be same. This paves the way to enable EFI in kexec kernel.
> 
> This patch also prepares and passes EFI setup data through bootparams.
> This contains bunch of information about various tables and their
> addresses.
> 
> These information gathering and passing has been written along the lines
> of what current kexec-tools is doing to make kexec work with UEFI.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/kexec.h       |  4 +-
>  arch/x86/kernel/kexec-bzimage.c    | 40 ++++++++++++----
>  arch/x86/kernel/machine_kexec.c    | 93 ++++++++++++++++++++++++++++++++++++--
>  drivers/firmware/efi/runtime-map.c | 21 +++++++++
>  include/linux/efi.h                | 19 ++++++++
>  5 files changed, 163 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 4cbe5f7..d8461cf 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -214,7 +214,9 @@ extern int kexec_setup_cmdline(struct kimage *image,
>  		unsigned long cmdline_offset, char *cmdline,
>  		unsigned long cmdline_len);
>  extern int kexec_setup_boot_parameters(struct kimage *image,
> -					struct boot_params *params);
> +		struct boot_params *params, unsigned long params_load_addr,
> +		unsigned int efi_map_offset, unsigned int efi_map_sz,
> +		unsigned int efi_setup_data_offset);
>  
>  
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
> index 8e762d3..55716e1 100644
> --- a/arch/x86/kernel/kexec-bzimage.c
> +++ b/arch/x86/kernel/kexec-bzimage.c
> @@ -15,10 +15,12 @@
>  #include <linux/kexec.h>
>  #include <linux/kernel.h>
>  #include <linux/mm.h>
> +#include <linux/efi.h>
>  
>  #include <asm/bootparam.h>
>  #include <asm/setup.h>
>  #include <asm/crash.h>
> +#include <asm/efi.h>
>  
>  #define MAX_ELFCOREHDR_STR_LEN	30	/* elfcorehdr=0x<64bit-value> */
>  
> @@ -106,7 +108,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  
>  	struct setup_header *header;
>  	int setup_sects, kern16_size, ret = 0;
> -	unsigned long setup_header_size, params_cmdline_sz;
> +	unsigned long setup_header_size, params_cmdline_sz, params_misc_sz;
>  	struct boot_params *params;
>  	unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
>  	unsigned long purgatory_load_addr;
> @@ -116,6 +118,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  	struct kexec_entry64_regs regs64;
>  	void *stack;
>  	unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
> +	unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
>  
>  	header = (struct setup_header *)(kernel + setup_hdr_offset);
>  	setup_sects = header->setup_sects;
> @@ -168,28 +171,47 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  
>  	pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
>  
> -	/* Load Bootparams and cmdline */
> +
> +	/*
> +	 * Load Bootparams and cmdline and space for efi stuff.
> +	 *
> +	 * Allocate memory together for multiple data structures so
> +	 * that they all can go in single area/segment and we don't
> +	 * have to create separate segment for each. Keeps things
> +	 * little bit simple
> +	 */
> +	efi_map_sz = get_efi_runtime_map_size();
> +	efi_map_sz = ALIGN(efi_map_sz, 16);
> +
>  	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
>  				MAX_ELFCOREHDR_STR_LEN;
> -	params = kzalloc(params_cmdline_sz, GFP_KERNEL);
> +	params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
> +	params_misc_sz = params_cmdline_sz + efi_map_sz +
> +				sizeof(struct setup_data) +
> +				sizeof(struct efi_setup_data);
> +
> +	params = kzalloc(params_misc_sz, GFP_KERNEL);
>  	if (!params) {
>  		ret = -ENOMEM;
>  		goto out_free_loader_data;
>  	}
>  
> +	efi_map_offset = params_cmdline_sz;
> +	efi_setup_data_offset = efi_map_offset + efi_map_sz;
> +
>  	/* Copy setup header onto bootparams. Documentation/x86/boot.txt */
>  	setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;
>  
>  	/* Is there a limit on setup header size? */
>  	memcpy(&params->hdr, (kernel + setup_hdr_offset), setup_header_size);
>  
> -	ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
> -			       params_cmdline_sz, 16, MIN_BOOTPARAM_ADDR,
> +	ret = kexec_add_buffer(image, (char *)params, params_misc_sz,
> +			       params_misc_sz, 16, MIN_BOOTPARAM_ADDR,
>  			       ULONG_MAX, 1, &bootparam_load_addr);
>  	if (ret)
>  		goto out_free_params;
> -	pr_debug("Loaded boot_param and command line at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> -		 bootparam_load_addr, params_cmdline_sz, params_cmdline_sz);
> +	pr_debug("Loaded boot_param, command line and misc at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
> +		 bootparam_load_addr, params_misc_sz, params_misc_sz);
>  
>  	/* Load kernel */
>  	kernel_buf = kernel + kern16_size;
> @@ -254,7 +276,9 @@ void *bzImage64_load(struct kimage *image, char *kernel,
>  	if (ret)
>  		goto out_free_params;
>  
> -	ret = kexec_setup_boot_parameters(image, params);
> +	ret = kexec_setup_boot_parameters(image, params, bootparam_load_addr,
> +					  efi_map_offset, efi_map_sz,
> +					  efi_setup_data_offset);
>  	if (ret)
>  		goto out_free_params;
>  
> diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
> index 6a3821b..f31a4b5 100644
> --- a/arch/x86/kernel/machine_kexec.c
> +++ b/arch/x86/kernel/machine_kexec.c
> @@ -12,9 +12,11 @@
>  #include <linux/kernel.h>
>  #include <linux/kexec.h>
>  #include <linux/string.h>
> +#include <linux/efi.h>
>  #include <asm/bootparam.h>
>  #include <asm/setup.h>
>  #include <asm/crash.h>
> +#include <asm/efi.h>
>  
>  /*
>   * Common code for x86 and x86_64 used for kexec.
> @@ -67,11 +69,10 @@ int kexec_setup_cmdline(struct kimage *image, struct boot_params *params,
>  	return 0;
>  }
>  
> -static int setup_memory_map_entries(struct boot_params *params)
> +static int setup_e820_entries(struct boot_params *params)
>  {
>  	unsigned int nr_e820_entries;
>  
> -	/* TODO: What about EFI */
>  	nr_e820_entries = e820_saved.nr_map;
>  	if (nr_e820_entries > E820MAX)
>  		nr_e820_entries = E820MAX;
> @@ -83,8 +84,85 @@ static int setup_memory_map_entries(struct boot_params *params)
>  	return 0;
>  }
>  
> -int kexec_setup_boot_parameters(struct kimage *image,
> -				struct boot_params *params)
> +#ifdef CONFIG_EFI
> +static int setup_efi_info_memmap(struct boot_params *params,
> +				  unsigned long params_load_addr,
> +				  unsigned int efi_map_offset,
> +				  unsigned int efi_map_sz)
> +{
> +	void *efi_map = (void *)params + efi_map_offset;
> +	unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
> +	struct efi_info *ei = &params->efi_info;
> +
> +	if (!efi_map_sz)
> +		return 0;
> +
> +	efi_runtime_map_copy(efi_map, efi_map_sz);
> +
> +	ei->efi_memmap = efi_map_phys_addr & 0xffffffff;
> +	ei->efi_memmap_hi = efi_map_phys_addr >> 32;
> +	ei->efi_memmap_size = efi_map_sz;
> +
> +	return 0;
> +}
> +
> +static int
> +prepare_add_efi_setup_data(struct boot_params *params,
> +		       unsigned long params_load_addr,
> +		       unsigned int efi_setup_data_offset)
> +{
> +	unsigned long setup_data_phys;
> +	struct setup_data *sd = (void *)params + efi_setup_data_offset;
> +	struct efi_setup_data *esd = (void *)sd + sizeof(struct setup_data);
> +
> +	esd->fw_vendor = efi.fw_vendor;
> +	esd->runtime = efi.runtime;
> +	esd->tables = efi.config_table;
> +	esd->smbios = efi.smbios;
> +
> +	sd->type = SETUP_EFI;
> +	sd->len = sizeof(struct efi_setup_data);
> +
> +	/* Add setup data */
> +	setup_data_phys = params_load_addr + efi_setup_data_offset;
> +	sd->next = params->hdr.setup_data;
> +	params->hdr.setup_data = setup_data_phys;
> +
> +	return 0;
> +}
> +
> +static int setup_efi_state(struct boot_params *params,
> +			unsigned long params_load_addr,
> +			unsigned int efi_map_offset, unsigned int efi_map_sz,
> +			unsigned int efi_setup_data_offset)
> +{
> +	struct efi_info *current_ei = &boot_params.efi_info;
> +	struct efi_info *ei = &params->efi_info;
> +
> +	if (!current_ei->efi_memmap_size)
> +		return 0;
> +
> +	ei->efi_loader_signature = current_ei->efi_loader_signature;
> +	ei->efi_systab = current_ei->efi_systab;
> +	ei->efi_systab_hi = current_ei->efi_systab_hi;
> +
> +	ei->efi_memdesc_version = current_ei->efi_memdesc_version;
> +	ei->efi_memdesc_size = get_efi_runtime_map_desc_size();
> +
> +	setup_efi_info_memmap(params, params_load_addr, efi_map_offset,
> +			      efi_map_sz);
> +	prepare_add_efi_setup_data(params, params_load_addr,
> +				   efi_setup_data_offset);
> +	return 0;
> +}
> +#endif /* CONFIG_EFI */
> +
> +int
> +kexec_setup_boot_parameters(struct kimage *image, struct boot_params *params,
> +			    unsigned long params_load_addr,
> +			    unsigned int efi_map_offset,
> +			    unsigned int efi_map_sz,
> +			    unsigned int efi_setup_data_offset)
>  {
>  	unsigned int nr_e820_entries;
>  	unsigned long long mem_k, start, end;
> @@ -114,7 +192,7 @@ int kexec_setup_boot_parameters(struct kimage *image,
>  	if (image->type == KEXEC_TYPE_CRASH)
>  		crash_setup_memmap_entries(image, params);
>  	else
> -		setup_memory_map_entries(params);
> +		setup_e820_entries(params);
>  
>  	nr_e820_entries = params->e820_entries;
>  
> @@ -135,6 +213,11 @@ int kexec_setup_boot_parameters(struct kimage *image,
>  		}
>  	}
>  
> +#ifdef CONFIG_EFI
> +	/* Setup EFI state */
> +	setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
> +			efi_setup_data_offset);
> +#endif
>  	/* Setup EDD info */
>  	memcpy(params->eddbuf, boot_params.eddbuf,
>  				EDDMAXNR * sizeof(struct edd_info));
> diff --git a/drivers/firmware/efi/runtime-map.c b/drivers/firmware/efi/runtime-map.c
> index 97cdd16..40f2213 100644
> --- a/drivers/firmware/efi/runtime-map.c
> +++ b/drivers/firmware/efi/runtime-map.c
> @@ -138,6 +138,27 @@ add_sysfs_runtime_map_entry(struct kobject *kobj, int nr)
>  	return entry;
>  }
>  
> +int get_efi_runtime_map_size(void)
> +{
> +	return nr_efi_runtime_map * efi_memdesc_size;
> +}
> +
> +int get_efi_runtime_map_desc_size(void)
> +{
> +	return efi_memdesc_size;
> +}
> +
> +int efi_runtime_map_copy(void *buf, size_t bufsz)
> +{
> +	size_t sz = get_efi_runtime_map_size();
> +
> +	if (sz > bufsz)
> +		sz = bufsz;
> +
> +	memcpy(buf, efi_runtime_map, sz);
> +	return 0;
> +}
> +
>  void efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size)
>  {
>  	efi_runtime_map = map;
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 6c100ff..c2e5c4a 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -1131,6 +1131,9 @@ int efivars_sysfs_init(void);
>  #ifdef CONFIG_EFI_RUNTIME_MAP
>  int efi_runtime_map_init(struct kobject *);
>  void efi_runtime_map_setup(void *, int, u32);
> +int get_efi_runtime_map_size(void);
> +int get_efi_runtime_map_desc_size(void);
> +int efi_runtime_map_copy(void *buf, size_t bufsz);
>  #else
>  static inline int efi_runtime_map_init(struct kobject *kobj)
>  {
> @@ -1139,6 +1142,22 @@ static inline int efi_runtime_map_init(struct kobject *kobj)
>  
>  static inline void
>  efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size) {}
> +
> +static inline int get_efi_runtime_map_size(void)
> +{
> +	return 0;
> +}
> +
> +static inline int get_efi_runtime_map_desc_size(void)
> +{
> +	return 0;
> +}
> +
> +static inline int efi_runtime_map_copy(void *buf, size_t bufsz)
> +{
> +	return 0;
> +}
> +
>  #endif
>  
>  #endif /* _LINUX_EFI_H */
> -- 
> 1.9.0
> 
> 

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 13/13] kexec: Support kexec/kdump on EFI systems
@ 2014-06-18 17:39       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-18 17:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm, linux-efi

On Wed, Jun 18, 2014 at 06:06:07PM +0200, Borislav Petkov wrote:
> Addomg linux-efi to CC for the efi bits. Please CC it on your next
> submission.

Sure I will. Thanks for reviewing the patches. I will get these ready
for next round of submission.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 13/13] kexec: Support kexec/kdump on EFI systems
@ 2014-06-18 17:39       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-18 17:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	mjg59-1xO5oi07KQx4cg9Nei1l7Q, greg-U8xfFu+wG4EAvxtiuMwx3w,
	jkosina-AlSwsSmVLrQ, dyoung-H+wXaHxf7aLQT0dZR+AlfA,
	chaowang-H+wXaHxf7aLQT0dZR+AlfA, bhe-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-efi

On Wed, Jun 18, 2014 at 06:06:07PM +0200, Borislav Petkov wrote:
> Addomg linux-efi to CC for the efi bits. Please CC it on your next
> submission.

Sure I will. Thanks for reviewing the patches. I will get these ready
for next round of submission.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 13/13] kexec: Support kexec/kdump on EFI systems
@ 2014-06-18 17:39       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-18 17:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, linux-efi, bhe, jkosina, greg, kexec, linux-kernel,
	ebiederm, hpa, akpm, dyoung, chaowang

On Wed, Jun 18, 2014 at 06:06:07PM +0200, Borislav Petkov wrote:
> Addomg linux-efi to CC for the efi bits. Please CC it on your next
> submission.

Sure I will. Thanks for reviewing the patches. I will get these ready
for next round of submission.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
  2014-06-15 16:35     ` Borislav Petkov
@ 2014-06-24 17:31       ` Vivek Goyal
  -1 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-24 17:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm, gong.chen

On Sun, Jun 15, 2014 at 06:35:15PM +0200, Borislav Petkov wrote:

[..]
> > +int kexec_setup_initrd(struct boot_params *params,
> > +		unsigned long initrd_load_addr, unsigned long initrd_len)
> > +{
> > +	params->hdr.ramdisk_image = initrd_load_addr & 0xffffffffUL;
> > +	params->hdr.ramdisk_size = initrd_len & 0xffffffffUL;
> 
> We have more readable GENMASK* macros for contiguous masks. This one
> will then look like:
> 
> 	params->hdr.ramdisk_image = initrd_load_addr & GENMASK(31, 0);
> 	params->hdr.ramdisk_size = initrd_len & GENMASK(31, 0);
> 
> and this way we know exactly about which bits are we talking about. :)

[ CC gong.chen@linux.intel.com ]

GENMASK(31,0) outputs compilation warning.

arch/x86/kernel/machine_kexec.c: In function ‘kexec_setup_initrd’:
arch/x86/kernel/machine_kexec.c:30:2: warning: left shift count >= width
of type [enabled by default]
  params->hdr.ramdisk_image = initrd_load_addr & GENMASK(31, 0);
  ^
arch/x86/kernel/machine_kexec.c:31:2: warning: left shift count >= width
of type [enabled by default]
  params->hdr.ramdisk_size = initrd_len & GENMASK(31, 0);
  ^
arch/x86/kernel/machine_kexec.c: In function ‘kexec_setup_cmdline’:
arch/x86/kernel/machine_kexec.c:52:2: warning: left shift count >= width
of type [enabled by default]
  cmdline_low_32 = cmdline_ptr_phys & GENMASK(31, 0);

I think problem is that we shift 1 by 32 bits in this case (31 - 0 + 1) and
that overflows the size of unsigned. So there is this corner case where 
it does not seem to work (or atleast outputs warning).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
@ 2014-06-24 17:31       ` Vivek Goyal
  0 siblings, 0 replies; 214+ messages in thread
From: Vivek Goyal @ 2014-06-24 17:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: mjg59, gong.chen, bhe, jkosina, greg, kexec, linux-kernel,
	ebiederm, hpa, akpm, dyoung, chaowang

On Sun, Jun 15, 2014 at 06:35:15PM +0200, Borislav Petkov wrote:

[..]
> > +int kexec_setup_initrd(struct boot_params *params,
> > +		unsigned long initrd_load_addr, unsigned long initrd_len)
> > +{
> > +	params->hdr.ramdisk_image = initrd_load_addr & 0xffffffffUL;
> > +	params->hdr.ramdisk_size = initrd_len & 0xffffffffUL;
> 
> We have more readable GENMASK* macros for contiguous masks. This one
> will then look like:
> 
> 	params->hdr.ramdisk_image = initrd_load_addr & GENMASK(31, 0);
> 	params->hdr.ramdisk_size = initrd_len & GENMASK(31, 0);
> 
> and this way we know exactly about which bits are we talking about. :)

[ CC gong.chen@linux.intel.com ]

GENMASK(31,0) outputs compilation warning.

arch/x86/kernel/machine_kexec.c: In function ‘kexec_setup_initrd’:
arch/x86/kernel/machine_kexec.c:30:2: warning: left shift count >= width
of type [enabled by default]
  params->hdr.ramdisk_image = initrd_load_addr & GENMASK(31, 0);
  ^
arch/x86/kernel/machine_kexec.c:31:2: warning: left shift count >= width
of type [enabled by default]
  params->hdr.ramdisk_size = initrd_len & GENMASK(31, 0);
  ^
arch/x86/kernel/machine_kexec.c: In function ‘kexec_setup_cmdline’:
arch/x86/kernel/machine_kexec.c:52:2: warning: left shift count >= width
of type [enabled by default]
  cmdline_low_32 = cmdline_ptr_phys & GENMASK(31, 0);

I think problem is that we shift 1 by 32 bits in this case (31 - 0 + 1) and
that overflows the size of unsigned. So there is this corner case where 
it does not seem to work (or atleast outputs warning).

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
  2014-06-24 17:31       ` Vivek Goyal
@ 2014-06-24 18:23         ` Borislav Petkov
  -1 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-24 18:23 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, kexec, ebiederm, hpa, mjg59, greg, jkosina, dyoung,
	chaowang, bhe, akpm, gong.chen

On Tue, Jun 24, 2014 at 01:31:25PM -0400, Vivek Goyal wrote:
> I think problem is that we shift 1 by 32 bits in this case (31 - 0 +
> 1) and that overflows the size of unsigned. So there is this corner
> case where it does not seem to work (or atleast outputs warning).

Right, that is a corner case which overflows the shift. The only thing I
can think of right now is:

#define GENMASK(h, l)           ((u32)GENMASK_ULL(h, l))

u32 because we're implicitly assuming we're dealing with 32-bit unsigned
quantities.

There might be a better solution though...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 214+ messages in thread

* Re: [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry
@ 2014-06-24 18:23         ` Borislav Petkov
  0 siblings, 0 replies; 214+ messages in thread
From: Borislav Petkov @ 2014-06-24 18:23 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: mjg59, gong.chen, bhe, jkosina, greg, kexec, linux-kernel,
	ebiederm, hpa, akpm, dyoung, chaowang

On Tue, Jun 24, 2014 at 01:31:25PM -0400, Vivek Goyal wrote:
> I think problem is that we shift 1 by 32 bits in this case (31 - 0 +
> 1) and that overflows the size of unsigned. So there is this corner
> case where it does not seem to work (or atleast outputs warning).

Right, that is a corner case which overflows the shift. The only thing I
can think of right now is:

#define GENMASK(h, l)           ((u32)GENMASK_ULL(h, l))

u32 because we're implicitly assuming we're dealing with 32-bit unsigned
quantities.

There might be a better solution though...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 214+ messages in thread

end of thread, other threads:[~2014-06-24 18:23 UTC | newest]

Thread overview: 214+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-03 13:06 [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading Vivek Goyal
2014-06-03 13:06 ` Vivek Goyal
2014-06-03 13:06 ` [PATCH 01/13] bin2c: Move bin2c in scripts/basic Vivek Goyal
2014-06-03 13:06   ` Vivek Goyal
2014-06-03 16:01   ` Borislav Petkov
2014-06-03 16:01     ` Borislav Petkov
2014-06-03 17:13     ` Vivek Goyal
2014-06-03 17:13       ` Vivek Goyal
2014-06-03 13:06 ` [PATCH 02/13] kernel: Build bin2c based on config option CONFIG_BUILD_BIN2C Vivek Goyal
2014-06-03 13:06   ` Vivek Goyal
2014-06-04  9:13   ` Borislav Petkov
2014-06-04  9:13     ` Borislav Petkov
2014-06-03 13:06 ` [PATCH 03/13] kexec: Move segment verification code in a separate function Vivek Goyal
2014-06-03 13:06   ` Vivek Goyal
2014-06-04  9:32   ` Borislav Petkov
2014-06-04  9:32     ` Borislav Petkov
2014-06-04 18:47     ` Vivek Goyal
2014-06-04 18:47       ` Vivek Goyal
2014-06-04 20:30       ` Borislav Petkov
2014-06-04 20:30         ` Borislav Petkov
2014-06-05 14:05         ` Vivek Goyal
2014-06-05 14:05           ` Vivek Goyal
2014-06-05 14:07           ` Borislav Petkov
2014-06-05 14:07             ` Borislav Petkov
2014-06-03 13:06 ` [PATCH 04/13] resource: Provide new functions to walk through resources Vivek Goyal
2014-06-03 13:06   ` Vivek Goyal
2014-06-04 10:24   ` Borislav Petkov
2014-06-04 10:24     ` Borislav Petkov
2014-06-05 13:58     ` Vivek Goyal
2014-06-05 13:58       ` Vivek Goyal
2014-06-03 13:06 ` [PATCH 05/13] kexec: Make kexec_segment user buffer pointer a union Vivek Goyal
2014-06-03 13:06   ` Vivek Goyal
2014-06-04 10:34   ` Borislav Petkov
2014-06-04 10:34     ` Borislav Petkov
2014-06-03 13:06 ` [PATCH 06/13] kexec: New syscall kexec_file_load() declaration Vivek Goyal
2014-06-03 13:06   ` Vivek Goyal
2014-06-04 15:18   ` Borislav Petkov
2014-06-04 15:18     ` Borislav Petkov
2014-06-05  9:56   ` WANG Chao
2014-06-05  9:56     ` WANG Chao
2014-06-05 15:16     ` Vivek Goyal
2014-06-05 15:16       ` Vivek Goyal
2014-06-05 15:22       ` Vivek Goyal
2014-06-05 15:22         ` Vivek Goyal
2014-06-06  6:34         ` WANG Chao
2014-06-06  6:34           ` WANG Chao
2014-06-03 13:06 ` [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load Vivek Goyal
2014-06-03 13:06   ` Vivek Goyal
2014-06-05 11:15   ` Borislav Petkov
2014-06-05 11:15     ` Borislav Petkov
2014-06-05 20:17     ` Vivek Goyal
2014-06-05 20:17       ` Vivek Goyal
2014-06-06  2:11       ` Borislav Petkov
2014-06-06  2:11         ` Borislav Petkov
2014-06-06 18:02         ` Vivek Goyal
2014-06-06 18:02           ` Vivek Goyal
2014-06-11 14:13           ` Borislav Petkov
2014-06-11 14:13             ` Borislav Petkov
2014-06-11 17:04             ` Vivek Goyal
2014-06-11 17:04               ` Vivek Goyal
2014-06-06  6:56   ` WANG Chao
2014-06-06  6:56     ` WANG Chao
2014-06-06 18:19     ` Vivek Goyal
2014-06-06 18:19       ` Vivek Goyal
2014-06-09  2:11       ` Dave Young
2014-06-09  2:11         ` Dave Young
2014-06-09  5:35         ` WANG Chao
2014-06-09  5:35           ` WANG Chao
2014-06-09 15:41           ` Vivek Goyal
2014-06-09 15:41             ` Vivek Goyal
2014-06-13  7:50             ` Borislav Petkov
2014-06-13  7:50               ` Borislav Petkov
2014-06-13  8:00               ` WANG Chao
2014-06-13  8:00                 ` WANG Chao
2014-06-13  8:10                 ` Borislav Petkov
2014-06-13  8:10                   ` Borislav Petkov
2014-06-13  8:24                   ` WANG Chao
2014-06-13  8:24                     ` WANG Chao
2014-06-13  8:30                     ` Borislav Petkov
2014-06-13  8:30                       ` Borislav Petkov
2014-06-13 12:49                 ` Vivek Goyal
2014-06-13 12:49                   ` Vivek Goyal
2014-06-13 12:46               ` Vivek Goyal
2014-06-13 12:46                 ` Vivek Goyal
2014-06-13 15:36                 ` Borislav Petkov
2014-06-13 15:36                   ` Borislav Petkov
2014-06-16 17:38                   ` Vivek Goyal
2014-06-16 17:38                     ` Vivek Goyal
2014-06-16 20:05                     ` Borislav Petkov
2014-06-16 20:05                       ` Borislav Petkov
2014-06-16 20:53                       ` Vivek Goyal
2014-06-16 20:53                         ` Vivek Goyal
2014-06-16 21:09                         ` Borislav Petkov
2014-06-16 21:09                           ` Borislav Petkov
2014-06-16 21:25                           ` H. Peter Anvin
2014-06-16 21:25                             ` H. Peter Anvin
2014-06-16 21:43                             ` Vivek Goyal
2014-06-16 21:43                               ` Vivek Goyal
2014-06-16 22:10                               ` Borislav Petkov
2014-06-16 22:10                                 ` Borislav Petkov
2014-06-16 22:49                               ` H. Peter Anvin
2014-06-16 22:49                                 ` H. Peter Anvin
2014-06-09 15:30         ` Vivek Goyal
2014-06-09 15:30           ` Vivek Goyal
2014-06-03 13:06 ` [PATCH 08/13] purgatory/sha256: Provide implementation of sha256 in purgaotory context Vivek Goyal
2014-06-03 13:06   ` Vivek Goyal
2014-06-03 13:06 ` [PATCH 09/13] purgatory: Core purgatory functionality Vivek Goyal
2014-06-03 13:06   ` Vivek Goyal
2014-06-05 20:05   ` Borislav Petkov
2014-06-05 20:05     ` Borislav Petkov
2014-06-06 19:51     ` Vivek Goyal
2014-06-06 19:51       ` Vivek Goyal
2014-06-13 10:17       ` Borislav Petkov
2014-06-13 10:17         ` Borislav Petkov
2014-06-16 17:25         ` Vivek Goyal
2014-06-16 17:25           ` Vivek Goyal
2014-06-16 20:10           ` Borislav Petkov
2014-06-16 20:10             ` Borislav Petkov
2014-06-03 13:06 ` [PATCH 10/13] kexec: Load and Relocate purgatory at kernel load time Vivek Goyal
2014-06-03 13:06   ` Vivek Goyal
2014-06-10 16:31   ` Borislav Petkov
2014-06-10 16:31     ` Borislav Petkov
2014-06-11 19:24     ` Vivek Goyal
2014-06-11 19:24       ` Vivek Goyal
2014-06-13 16:14       ` Borislav Petkov
2014-06-13 16:14         ` Borislav Petkov
2014-06-03 13:07 ` [PATCH 11/13] kexec-bzImage: Support for loading bzImage using 64bit entry Vivek Goyal
2014-06-03 13:07   ` Vivek Goyal
2014-06-15 16:35   ` Borislav Petkov
2014-06-15 16:35     ` Borislav Petkov
2014-06-15 16:56     ` H. Peter Anvin
2014-06-15 16:56       ` H. Peter Anvin
2014-06-16 20:06     ` Vivek Goyal
2014-06-16 20:06       ` Vivek Goyal
2014-06-16 20:57       ` Borislav Petkov
2014-06-16 20:57         ` Borislav Petkov
2014-06-16 21:15         ` Vivek Goyal
2014-06-16 21:15           ` Vivek Goyal
2014-06-16 21:27           ` Borislav Petkov
2014-06-16 21:27             ` Borislav Petkov
2014-06-16 21:45             ` Vivek Goyal
2014-06-16 21:45               ` Vivek Goyal
2014-06-24 17:31     ` Vivek Goyal
2014-06-24 17:31       ` Vivek Goyal
2014-06-24 18:23       ` Borislav Petkov
2014-06-24 18:23         ` Borislav Petkov
2014-06-03 13:07 ` [PATCH 12/13] kexec: Support for Kexec on panic using new system call Vivek Goyal
2014-06-03 13:07   ` Vivek Goyal
2014-06-17 21:43   ` Borislav Petkov
2014-06-17 21:43     ` Borislav Petkov
2014-06-18 14:20     ` Vivek Goyal
2014-06-18 14:20       ` Vivek Goyal
2014-06-03 13:07 ` [PATCH 13/13] kexec: Support kexec/kdump on EFI systems Vivek Goyal
2014-06-03 13:07   ` Vivek Goyal
2014-06-18 15:43   ` Borislav Petkov
2014-06-18 15:43     ` Borislav Petkov
2014-06-18 16:06   ` Borislav Petkov
2014-06-18 16:06     ` Borislav Petkov
2014-06-18 16:06     ` Borislav Petkov
2014-06-18 17:39     ` Vivek Goyal
2014-06-18 17:39       ` Vivek Goyal
2014-06-18 17:39       ` Vivek Goyal
2014-06-03 13:12 ` [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading Vivek Goyal
2014-06-03 13:12   ` Vivek Goyal
2014-06-04  9:22   ` WANG Chao
2014-06-04  9:22     ` WANG Chao
2014-06-04 17:50     ` Vivek Goyal
2014-06-04 17:50       ` Vivek Goyal
2014-06-04 19:39       ` Michael Kerrisk
2014-06-04 19:39         ` Michael Kerrisk
2014-06-04 19:39         ` Michael Kerrisk
2014-06-05 14:04         ` Vivek Goyal
2014-06-05 14:04           ` Vivek Goyal
2014-06-05 14:04           ` Vivek Goyal
2014-06-06  5:45           ` Michael Kerrisk (man-pages)
2014-06-06  5:45             ` Michael Kerrisk (man-pages)
2014-06-06  5:45             ` Michael Kerrisk (man-pages)
2014-06-06 18:04             ` Vivek Goyal
2014-06-06 18:04               ` Vivek Goyal
2014-06-06 18:04               ` Vivek Goyal
2014-06-05  8:31   ` Dave Young
2014-06-05  8:31     ` Dave Young
2014-06-05 15:01     ` Vivek Goyal
2014-06-05 15:01       ` Vivek Goyal
2014-06-06  7:37       ` Dave Young
2014-06-06  7:37         ` Dave Young
2014-06-06 20:04         ` Vivek Goyal
2014-06-06 20:04           ` Vivek Goyal
2014-06-09  1:57           ` Dave Young
2014-06-09  1:57             ` Dave Young
2014-06-06 20:37         ` H. Peter Anvin
2014-06-06 20:37           ` H. Peter Anvin
2014-06-06 20:58           ` Matt Fleming
2014-06-06 20:58             ` Matt Fleming
2014-06-06 21:00             ` H. Peter Anvin
2014-06-06 21:00               ` H. Peter Anvin
2014-06-06 21:02               ` Matt Fleming
2014-06-06 21:02                 ` Matt Fleming
2014-06-12  5:42 ` Dave Young
2014-06-12  5:42   ` Dave Young
2014-06-12 12:36   ` Vivek Goyal
2014-06-12 12:36     ` Vivek Goyal
2014-06-17 14:24   ` Vivek Goyal
2014-06-17 14:24     ` Vivek Goyal
2014-06-18  1:45     ` Dave Young
2014-06-18  1:45       ` Dave Young
2014-06-18  1:52       ` Dave Young
2014-06-18  1:52         ` Dave Young
2014-06-18 12:40         ` Vivek Goyal
2014-06-18 12:40           ` Vivek Goyal
2014-06-16 21:13 ` Borislav Petkov
2014-06-16 21:13   ` Borislav Petkov
2014-06-17 13:24   ` Vivek Goyal
2014-06-17 13:24     ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.