linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/11] xen: Initial kexec/kdump implementation
@ 2012-11-20 15:04 Daniel Kiper
  2012-11-20 15:04 ` [PATCH v2 01/11] kexec: introduce kexec_ops struct Daniel Kiper
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Kiper @ 2012-11-20 15:04 UTC (permalink / raw)
  To: andrew.cooper3, ebiederm, hpa, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel


Hi,

This set of patches contains initial kexec/kdump implementation for Xen v2
(previous version were posted to few people by mistake; sorry for that).
Currently only dom0 is supported, however, almost all infrustructure
required for domU support is ready.

Jan Beulich suggested to merge Xen x86 assembler code with baremetal x86 code.
This could simplify and reduce a bit size of kernel code. However, this solution
requires some changes in baremetal x86 code. The most important thing which should
be changed in that case is format of page_list array. Xen kexec hypercall requires
to alternate physical addresses with virtual ones. This and other required things
have not done in that version because I am not sure that solution will be accepted
by kexec/kdump maintainers. I hope that this email spark discussion about that topic.

Daniel

 arch/x86/include/asm/kexec.h         |   10 +-
 arch/x86/include/asm/xen/hypercall.h |    6 +
 arch/x86/include/asm/xen/kexec.h     |   83 +++++++++
 arch/x86/kernel/machine_kexec_64.c   |   12 +-
 arch/x86/kernel/vmlinux.lds.S        |    7 +-
 arch/x86/xen/Makefile                |    3 +
 arch/x86/xen/enlighten.c             |   12 ++
 arch/x86/xen/kexec.c                 |  150 ++++++++++++++++
 arch/x86/xen/machine_kexec_32.c      |  245 ++++++++++++++++++++++++++
 arch/x86/xen/machine_kexec_64.c      |  301 +++++++++++++++++++++++++++++++
 arch/x86/xen/relocate_kernel_32.S    |  323 ++++++++++++++++++++++++++++++++++
 arch/x86/xen/relocate_kernel_64.S    |  309 ++++++++++++++++++++++++++++++++
 drivers/xen/sys-hypervisor.c         |   42 +++++-
 include/linux/kexec.h                |   18 ++
 include/xen/interface/xen.h          |   33 ++++
 kernel/kexec.c                       |  125 ++++++++++----
 16 files changed, 1636 insertions(+), 43 deletions(-)

Daniel Kiper (11):
      kexec: introduce kexec_ops struct
      x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
      xen: Introduce architecture independent data for kexec/kdump
      x86/xen: Introduce architecture dependent data for kexec/kdump
      x86/xen: Register resources required by kexec-tools
      x86/xen: Add i386 kexec/kdump implementation
      x86/xen: Add x86_64 kexec/kdump implementation
      x86/xen: Add kexec/kdump makefile rules
      x86/xen/enlighten: Add init and crash kexec/kdump hooks
      drivers/xen: Export vmcoreinfo through sysfs
      x86: Add Xen kexec control code size check to linker script

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-20 15:04 [PATCH v2 00/11] xen: Initial kexec/kdump implementation Daniel Kiper
@ 2012-11-20 15:04 ` Daniel Kiper
  2012-11-20 15:04   ` [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE Daniel Kiper
  2012-11-20 16:40   ` [PATCH v2 01/11] kexec: introduce kexec_ops struct Eric W. Biederman
  0 siblings, 2 replies; 35+ messages in thread
From: Daniel Kiper @ 2012-11-20 15:04 UTC (permalink / raw)
  To: andrew.cooper3, ebiederm, hpa, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel
  Cc: Daniel Kiper

Some kexec/kdump implementations (e.g. Xen PVOPS) could not use default
functions or require some changes in behavior of kexec/kdump generic code.
To cope with that problem kexec_ops struct was introduced. It allows
a developer to replace all or some functions and control some
functionality of kexec/kdump generic code.

Default behavior of kexec/kdump generic code is not changed.

v2 - suggestions/fixes:
   - add comment for kexec_ops.crash_alloc_temp_store member
     (suggested by Konrad Rzeszutek Wilk),
   - simplify kexec_ops usage
     (suggested by Konrad Rzeszutek Wilk).

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 include/linux/kexec.h |   26 ++++++++++
 kernel/kexec.c        |  131 +++++++++++++++++++++++++++++++++++++------------
 2 files changed, 125 insertions(+), 32 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d0b8458..c8d0b35 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -116,7 +116,33 @@ struct kimage {
 #endif
 };
 
+struct kexec_ops {
+	/*
+	 * Some kdump implementations (e.g. Xen PVOPS dom0) could not access
+	 * directly crash kernel memory area. In this situation they must
+	 * allocate memory outside of it and later move contents from temporary
+	 * storage to final resting places (usualy done by relocate_kernel()).
+	 * Such behavior could be enforced by setting
+	 * crash_alloc_temp_store member to true.
+	 */
+	bool crash_alloc_temp_store;
+	struct page *(*kimage_alloc_pages)(gfp_t gfp_mask,
+						unsigned int order,
+						unsigned long limit);
+	void (*kimage_free_pages)(struct page *page);
+	unsigned long (*page_to_pfn)(struct page *page);
+	struct page *(*pfn_to_page)(unsigned long pfn);
+	unsigned long (*virt_to_phys)(volatile void *address);
+	void *(*phys_to_virt)(unsigned long address);
+	int (*machine_kexec_prepare)(struct kimage *image);
+	int (*machine_kexec_load)(struct kimage *image);
+	void (*machine_kexec_cleanup)(struct kimage *image);
+	void (*machine_kexec_unload)(struct kimage *image);
+	void (*machine_kexec_shutdown)(void);
+	void (*machine_kexec)(struct kimage *image);
+};
 
+extern struct kexec_ops kexec_ops;
 
 /* kexec interface functions */
 extern void machine_kexec(struct kimage *image);
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 5e4bd78..a5f7324 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -55,6 +55,56 @@ struct resource crashk_res = {
 	.flags = IORESOURCE_BUSY | IORESOURCE_MEM
 };
 
+static struct page *kimage_alloc_pages(gfp_t gfp_mask,
+					unsigned int order,
+					unsigned long limit);
+static void kimage_free_pages(struct page *page);
+
+static unsigned long generic_page_to_pfn(struct page *page)
+{
+	return page_to_pfn(page);
+}
+
+static struct page *generic_pfn_to_page(unsigned long pfn)
+{
+	return pfn_to_page(pfn);
+}
+
+static unsigned long generic_virt_to_phys(volatile void *address)
+{
+	return virt_to_phys(address);
+}
+
+static void *generic_phys_to_virt(unsigned long address)
+{
+	return phys_to_virt(address);
+}
+
+static int generic_kexec_load(struct kimage *image)
+{
+	return 0;
+}
+
+static void generic_kexec_unload(struct kimage *image)
+{
+}
+
+struct kexec_ops kexec_ops = {
+	.crash_alloc_temp_store = false,
+	.kimage_alloc_pages = kimage_alloc_pages,
+	.kimage_free_pages = kimage_free_pages,
+	.page_to_pfn = generic_page_to_pfn,
+	.pfn_to_page = generic_pfn_to_page,
+	.virt_to_phys = generic_virt_to_phys,
+	.phys_to_virt = generic_phys_to_virt,
+	.machine_kexec_prepare = machine_kexec_prepare,
+	.machine_kexec_load = generic_kexec_load,
+	.machine_kexec_cleanup = machine_kexec_cleanup,
+	.machine_kexec_unload = generic_kexec_unload,
+	.machine_kexec_shutdown = machine_shutdown,
+	.machine_kexec = machine_kexec
+};
+
 int kexec_should_crash(struct task_struct *p)
 {
 	if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops)
@@ -354,7 +404,9 @@ static int kimage_is_destination_range(struct kimage *image,
 	return 0;
 }
 
-static struct page *kimage_alloc_pages(gfp_t gfp_mask, unsigned int order)
+static struct page *kimage_alloc_pages(gfp_t gfp_mask,
+					unsigned int order,
+					unsigned long limit)
 {
 	struct page *pages;
 
@@ -391,7 +443,7 @@ static void kimage_free_page_list(struct list_head *list)
 
 		page = list_entry(pos, struct page, lru);
 		list_del(&page->lru);
-		kimage_free_pages(page);
+		(*kexec_ops.kimage_free_pages)(page);
 	}
 }
 
@@ -424,10 +476,11 @@ static struct page *kimage_alloc_normal_control_pages(struct kimage *image,
 	do {
 		unsigned long pfn, epfn, addr, eaddr;
 
-		pages = kimage_alloc_pages(GFP_KERNEL, order);
+		pages = (*kexec_ops.kimage_alloc_pages)(GFP_KERNEL, order,
+							KEXEC_CONTROL_MEMORY_LIMIT);
 		if (!pages)
 			break;
-		pfn   = page_to_pfn(pages);
+		pfn   = (*kexec_ops.page_to_pfn)(pages);
 		epfn  = pfn + count;
 		addr  = pfn << PAGE_SHIFT;
 		eaddr = epfn << PAGE_SHIFT;
@@ -514,7 +567,7 @@ static struct page *kimage_alloc_crash_control_pages(struct kimage *image,
 		}
 		/* If I don't overlap any segments I have found my hole! */
 		if (i == image->nr_segments) {
-			pages = pfn_to_page(hole_start >> PAGE_SHIFT);
+			pages = (*kexec_ops.pfn_to_page)(hole_start >> PAGE_SHIFT);
 			break;
 		}
 	}
@@ -531,12 +584,13 @@ struct page *kimage_alloc_control_pages(struct kimage *image,
 	struct page *pages = NULL;
 
 	switch (image->type) {
+	case KEXEC_TYPE_CRASH:
+		if (!kexec_ops.crash_alloc_temp_store) {
+			pages = kimage_alloc_crash_control_pages(image, order);
+			break;
+		}
 	case KEXEC_TYPE_DEFAULT:
 		pages = kimage_alloc_normal_control_pages(image, order);
-		break;
-	case KEXEC_TYPE_CRASH:
-		pages = kimage_alloc_crash_control_pages(image, order);
-		break;
 	}
 
 	return pages;
@@ -556,7 +610,7 @@ static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
 			return -ENOMEM;
 
 		ind_page = page_address(page);
-		*image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+		*image->entry = (*kexec_ops.virt_to_phys)(ind_page) | IND_INDIRECTION;
 		image->entry = ind_page;
 		image->last_entry = ind_page +
 				      ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
@@ -615,14 +669,14 @@ static void kimage_terminate(struct kimage *image)
 #define for_each_kimage_entry(image, ptr, entry) \
 	for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
 		ptr = (entry & IND_INDIRECTION)? \
-			phys_to_virt((entry & PAGE_MASK)): ptr +1)
+			(*kexec_ops.phys_to_virt)((entry & PAGE_MASK)): ptr +1)
 
 static void kimage_free_entry(kimage_entry_t entry)
 {
 	struct page *page;
 
-	page = pfn_to_page(entry >> PAGE_SHIFT);
-	kimage_free_pages(page);
+	page = (*kexec_ops.pfn_to_page)(entry >> PAGE_SHIFT);
+	(*kexec_ops.kimage_free_pages)(page);
 }
 
 static void kimage_free(struct kimage *image)
@@ -652,7 +706,7 @@ static void kimage_free(struct kimage *image)
 		kimage_free_entry(ind);
 
 	/* Handle any machine specific cleanup */
-	machine_kexec_cleanup(image);
+	(*kexec_ops.machine_kexec_cleanup)(image);
 
 	/* Free the kexec control pages... */
 	kimage_free_page_list(&image->control_pages);
@@ -708,7 +762,7 @@ static struct page *kimage_alloc_page(struct kimage *image,
 	 * have a match.
 	 */
 	list_for_each_entry(page, &image->dest_pages, lru) {
-		addr = page_to_pfn(page) << PAGE_SHIFT;
+		addr = (*kexec_ops.page_to_pfn)(page) << PAGE_SHIFT;
 		if (addr == destination) {
 			list_del(&page->lru);
 			return page;
@@ -719,16 +773,17 @@ static struct page *kimage_alloc_page(struct kimage *image,
 		kimage_entry_t *old;
 
 		/* Allocate a page, if we run out of memory give up */
-		page = kimage_alloc_pages(gfp_mask, 0);
+		page = (*kexec_ops.kimage_alloc_pages)(gfp_mask, 0,
+							KEXEC_SOURCE_MEMORY_LIMIT);
 		if (!page)
 			return NULL;
 		/* If the page cannot be used file it away */
-		if (page_to_pfn(page) >
+		if ((*kexec_ops.page_to_pfn)(page) >
 				(KEXEC_SOURCE_MEMORY_LIMIT >> PAGE_SHIFT)) {
 			list_add(&page->lru, &image->unuseable_pages);
 			continue;
 		}
-		addr = page_to_pfn(page) << PAGE_SHIFT;
+		addr = (*kexec_ops.page_to_pfn)(page) << PAGE_SHIFT;
 
 		/* If it is the destination page we want use it */
 		if (addr == destination)
@@ -751,7 +806,7 @@ static struct page *kimage_alloc_page(struct kimage *image,
 			struct page *old_page;
 
 			old_addr = *old & PAGE_MASK;
-			old_page = pfn_to_page(old_addr >> PAGE_SHIFT);
+			old_page = (*kexec_ops.pfn_to_page)(old_addr >> PAGE_SHIFT);
 			copy_highpage(page, old_page);
 			*old = addr | (*old & ~PAGE_MASK);
 
@@ -761,7 +816,7 @@ static struct page *kimage_alloc_page(struct kimage *image,
 			 */
 			if (!(gfp_mask & __GFP_HIGHMEM) &&
 			    PageHighMem(old_page)) {
-				kimage_free_pages(old_page);
+				(*kexec_ops.kimage_free_pages)(old_page);
 				continue;
 			}
 			addr = old_addr;
@@ -807,7 +862,7 @@ static int kimage_load_normal_segment(struct kimage *image,
 			result  = -ENOMEM;
 			goto out;
 		}
-		result = kimage_add_page(image, page_to_pfn(page)
+		result = kimage_add_page(image, (*kexec_ops.page_to_pfn)(page)
 								<< PAGE_SHIFT);
 		if (result < 0)
 			goto out;
@@ -861,7 +916,7 @@ static int kimage_load_crash_segment(struct kimage *image,
 		char *ptr;
 		size_t uchunk, mchunk;
 
-		page = pfn_to_page(maddr >> PAGE_SHIFT);
+		page = (*kexec_ops.pfn_to_page)(maddr >> PAGE_SHIFT);
 		if (!page) {
 			result  = -ENOMEM;
 			goto out;
@@ -900,12 +955,13 @@ static int kimage_load_segment(struct kimage *image,
 	int result = -ENOMEM;
 
 	switch (image->type) {
+	case KEXEC_TYPE_CRASH:
+		if (!kexec_ops.crash_alloc_temp_store) {
+			result = kimage_load_crash_segment(image, segment);
+			break;
+		}
 	case KEXEC_TYPE_DEFAULT:
 		result = kimage_load_normal_segment(image, segment);
-		break;
-	case KEXEC_TYPE_CRASH:
-		result = kimage_load_crash_segment(image, segment);
-		break;
 	}
 
 	return result;
@@ -993,6 +1049,7 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
 			/* Free any current crash dump kernel before
 			 * we corrupt it.
 			 */
+			(*kexec_ops.machine_kexec_unload)(image);
 			kimage_free(xchg(&kexec_crash_image, NULL));
 			result = kimage_crash_alloc(&image, entry,
 						     nr_segments, segments);
@@ -1003,7 +1060,7 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
 
 		if (flags & KEXEC_PRESERVE_CONTEXT)
 			image->preserve_context = 1;
-		result = machine_kexec_prepare(image);
+		result = (*kexec_ops.machine_kexec_prepare)(image);
 		if (result)
 			goto out;
 
@@ -1016,11 +1073,21 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
 		if (flags & KEXEC_ON_CRASH)
 			crash_unmap_reserved_pages();
 	}
+
+	result = (*kexec_ops.machine_kexec_load)(image);
+
+	if (result)
+		goto out;
+
 	/* Install the new kernel, and  Uninstall the old */
 	image = xchg(dest_image, image);
 
 out:
 	mutex_unlock(&kexec_mutex);
+
+	if (kexec_ops.machine_kexec_unload)
+		(*kexec_ops.machine_kexec_unload)(image);
+
 	kimage_free(image);
 
 	return result;
@@ -1094,7 +1161,7 @@ void crash_kexec(struct pt_regs *regs)
 			crash_setup_regs(&fixed_regs, regs);
 			crash_save_vmcoreinfo();
 			machine_crash_shutdown(&fixed_regs);
-			machine_kexec(kexec_crash_image);
+			(*kexec_ops.machine_kexec)(kexec_crash_image);
 		}
 		mutex_unlock(&kexec_mutex);
 	}
@@ -1116,8 +1183,8 @@ void __weak crash_free_reserved_phys_range(unsigned long begin,
 	unsigned long addr;
 
 	for (addr = begin; addr < end; addr += PAGE_SIZE) {
-		ClearPageReserved(pfn_to_page(addr >> PAGE_SHIFT));
-		init_page_count(pfn_to_page(addr >> PAGE_SHIFT));
+		ClearPageReserved((*kexec_ops.pfn_to_page)(addr >> PAGE_SHIFT));
+		init_page_count((*kexec_ops.pfn_to_page)(addr >> PAGE_SHIFT));
 		free_page((unsigned long)__va(addr));
 		totalram_pages++;
 	}
@@ -1571,10 +1638,10 @@ int kernel_kexec(void)
 	{
 		kernel_restart_prepare(NULL);
 		printk(KERN_EMERG "Starting new kernel\n");
-		machine_shutdown();
+		(*kexec_ops.machine_kexec_shutdown)();
 	}
 
-	machine_kexec(kexec_image);
+	(*kexec_ops.machine_kexec)(kexec_image);
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (kexec_image->preserve_context) {
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
  2012-11-20 15:04 ` [PATCH v2 01/11] kexec: introduce kexec_ops struct Daniel Kiper
@ 2012-11-20 15:04   ` Daniel Kiper
  2012-11-20 15:04     ` [PATCH v2 03/11] xen: Introduce architecture independent data for kexec/kdump Daniel Kiper
  2012-11-20 15:52     ` [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE Jan Beulich
  2012-11-20 16:40   ` [PATCH v2 01/11] kexec: introduce kexec_ops struct Eric W. Biederman
  1 sibling, 2 replies; 35+ messages in thread
From: Daniel Kiper @ 2012-11-20 15:04 UTC (permalink / raw)
  To: andrew.cooper3, ebiederm, hpa, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel
  Cc: Daniel Kiper

Some implementations (e.g. Xen PVOPS) could not use part of identity page table
to construct transition page table. It means that they require separate PUDs,
PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that
requirement add extra pointer to PGD, PUD, PMD and PTE and align existing code.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 arch/x86/include/asm/kexec.h       |   10 +++++++---
 arch/x86/kernel/machine_kexec_64.c |   12 ++++++------
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 317ff17..3cf5600 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -157,9 +157,13 @@ struct kimage_arch {
 };
 #else
 struct kimage_arch {
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
+	pgd_t *pgd;
+	pud_t *pud0;
+	pud_t *pud1;
+	pmd_t *pmd0;
+	pmd_t *pmd1;
+	pte_t *pte0;
+	pte_t *pte1;
 };
 #endif
 
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index b3ea9db..976e54b 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -137,9 +137,9 @@ out:
 
 static void free_transition_pgtable(struct kimage *image)
 {
-	free_page((unsigned long)image->arch.pud);
-	free_page((unsigned long)image->arch.pmd);
-	free_page((unsigned long)image->arch.pte);
+	free_page((unsigned long)image->arch.pud0);
+	free_page((unsigned long)image->arch.pmd0);
+	free_page((unsigned long)image->arch.pte0);
 }
 
 static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
@@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
 		pud = (pud_t *)get_zeroed_page(GFP_KERNEL);
 		if (!pud)
 			goto err;
-		image->arch.pud = pud;
+		image->arch.pud0 = pud;
 		set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
 	}
 	pud = pud_offset(pgd, vaddr);
@@ -165,7 +165,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
 		pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL);
 		if (!pmd)
 			goto err;
-		image->arch.pmd = pmd;
+		image->arch.pmd0 = pmd;
 		set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
 	}
 	pmd = pmd_offset(pud, vaddr);
@@ -173,7 +173,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
 		pte = (pte_t *)get_zeroed_page(GFP_KERNEL);
 		if (!pte)
 			goto err;
-		image->arch.pte = pte;
+		image->arch.pte0 = pte;
 		set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
 	}
 	pte = pte_offset_kernel(pmd, vaddr);
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 03/11] xen: Introduce architecture independent data for kexec/kdump
  2012-11-20 15:04   ` [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE Daniel Kiper
@ 2012-11-20 15:04     ` Daniel Kiper
  2012-11-20 15:04       ` [PATCH v2 04/11] x86/xen: Introduce architecture dependent " Daniel Kiper
  2012-11-20 15:52     ` [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE Jan Beulich
  1 sibling, 1 reply; 35+ messages in thread
From: Daniel Kiper @ 2012-11-20 15:04 UTC (permalink / raw)
  To: andrew.cooper3, ebiederm, hpa, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel
  Cc: Daniel Kiper

Introduce architecture independent constants and structures
required by Xen kexec/kdump implementation.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 include/xen/interface/xen.h |   33 +++++++++++++++++++++++++++++++++
 1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
index 886a5d8..09c16ab 100644
--- a/include/xen/interface/xen.h
+++ b/include/xen/interface/xen.h
@@ -57,6 +57,7 @@
 #define __HYPERVISOR_event_channel_op     32
 #define __HYPERVISOR_physdev_op           33
 #define __HYPERVISOR_hvm_op               34
+#define __HYPERVISOR_kexec_op             37
 #define __HYPERVISOR_tmem_op              38
 
 /* Architecture-specific hypercall definitions. */
@@ -231,7 +232,39 @@ DEFINE_GUEST_HANDLE_STRUCT(mmuext_op);
 #define VMASST_TYPE_pae_extended_cr3     3
 #define MAX_VMASST_TYPE 3
 
+/*
+ * Commands to HYPERVISOR_kexec_op().
+ */
+#define KEXEC_CMD_kexec			0
+#define KEXEC_CMD_kexec_load		1
+#define KEXEC_CMD_kexec_unload		2
+#define KEXEC_CMD_kexec_get_range	3
+
+/*
+ * Memory ranges for kdump (utilized by HYPERVISOR_kexec_op()).
+ */
+#define KEXEC_RANGE_MA_CRASH		0
+#define KEXEC_RANGE_MA_XEN		1
+#define KEXEC_RANGE_MA_CPU		2
+#define KEXEC_RANGE_MA_XENHEAP		3
+#define KEXEC_RANGE_MA_BOOT_PARAM	4
+#define KEXEC_RANGE_MA_EFI_MEMMAP	5
+#define KEXEC_RANGE_MA_VMCOREINFO	6
+
 #ifndef __ASSEMBLY__
+struct xen_kexec_exec {
+	int type;
+};
+
+struct xen_kexec_range {
+	int range;
+	int nr;
+	unsigned long size;
+	unsigned long start;
+};
+
+extern unsigned long xen_vmcoreinfo_maddr;
+extern unsigned long xen_vmcoreinfo_max_size;
 
 typedef uint16_t domid_t;
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 04/11] x86/xen: Introduce architecture dependent data for kexec/kdump
  2012-11-20 15:04     ` [PATCH v2 03/11] xen: Introduce architecture independent data for kexec/kdump Daniel Kiper
@ 2012-11-20 15:04       ` Daniel Kiper
  2012-11-20 15:04         ` [PATCH v2 05/11] x86/xen: Register resources required by kexec-tools Daniel Kiper
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Kiper @ 2012-11-20 15:04 UTC (permalink / raw)
  To: andrew.cooper3, ebiederm, hpa, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel
  Cc: Daniel Kiper

Introduce architecture dependent constants, structures and
functions required by Xen kexec/kdump implementation.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 arch/x86/include/asm/xen/hypercall.h |    6 +++
 arch/x86/include/asm/xen/kexec.h     |   83 ++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/xen/kexec.h

diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
index c20d1ce..e76a1b8 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -459,6 +459,12 @@ HYPERVISOR_hvm_op(int op, void *arg)
 }
 
 static inline int
+HYPERVISOR_kexec_op(unsigned long op, void *args)
+{
+	return _hypercall2(int, kexec_op, op, args);
+}
+
+static inline int
 HYPERVISOR_tmem_op(
 	struct tmem_op *op)
 {
diff --git a/arch/x86/include/asm/xen/kexec.h b/arch/x86/include/asm/xen/kexec.h
new file mode 100644
index 0000000..3349031
--- /dev/null
+++ b/arch/x86/include/asm/xen/kexec.h
@@ -0,0 +1,83 @@
+/*
+ * Copyright (c) 2011 Daniel Kiper
+ * Copyright (c) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * kexec/kdump implementation for Xen was written by Daniel Kiper.
+ * Initial work on it was sponsored by Google under Google Summer
+ * of Code 2011 program and Citrix. Konrad Rzeszutek Wilk from Oracle
+ * was the mentor for this project.
+ *
+ * Some ideas are taken from:
+ *   - native kexec/kdump implementation,
+ *   - kexec/kdump implementation for Xen Linux Kernel Ver. 2.6.18,
+ *   - PV-GRUB.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _ASM_X86_XEN_KEXEC_H
+#define _ASM_X86_XEN_KEXEC_H
+
+#include <linux/init.h>
+
+#define KEXEC_XEN_NO_PAGES	17
+
+#define XK_MA_CONTROL_PAGE	0
+#define XK_VA_CONTROL_PAGE	1
+#define XK_MA_PGD_PAGE		2
+#define XK_VA_PGD_PAGE		3
+#define XK_MA_PUD0_PAGE		4
+#define XK_VA_PUD0_PAGE		5
+#define XK_MA_PUD1_PAGE		6
+#define XK_VA_PUD1_PAGE		7
+#define XK_MA_PMD0_PAGE		8
+#define XK_VA_PMD0_PAGE		9
+#define XK_MA_PMD1_PAGE		10
+#define XK_VA_PMD1_PAGE		11
+#define XK_MA_PTE0_PAGE		12
+#define XK_VA_PTE0_PAGE		13
+#define XK_MA_PTE1_PAGE		14
+#define XK_VA_PTE1_PAGE		15
+#define XK_MA_TABLE_PAGE	16
+
+#ifndef __ASSEMBLY__
+struct xen_kexec_image {
+	unsigned long page_list[KEXEC_XEN_NO_PAGES];
+	unsigned long indirection_page;
+	unsigned long start_address;
+};
+
+struct xen_kexec_load {
+	int type;
+	struct xen_kexec_image image;
+};
+
+extern unsigned int xen_kexec_control_code_size;
+
+extern void __init xen_init_kexec_ops(void);
+
+#ifdef CONFIG_X86_32
+extern void xen_relocate_kernel(unsigned long indirection_page,
+				unsigned long *page_list,
+				unsigned long start_address,
+				unsigned int has_pae,
+				unsigned int preserve_context);
+#else
+extern void xen_relocate_kernel(unsigned long indirection_page,
+				unsigned long *page_list,
+				unsigned long start_address,
+				unsigned int preserve_context);
+#endif
+#endif
+#endif /* _ASM_X86_XEN_KEXEC_H */
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 05/11] x86/xen: Register resources required by kexec-tools
  2012-11-20 15:04       ` [PATCH v2 04/11] x86/xen: Introduce architecture dependent " Daniel Kiper
@ 2012-11-20 15:04         ` Daniel Kiper
  2012-11-20 15:04           ` [PATCH v2 06/11] x86/xen: Add i386 kexec/kdump implementation Daniel Kiper
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Kiper @ 2012-11-20 15:04 UTC (permalink / raw)
  To: andrew.cooper3, ebiederm, hpa, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel
  Cc: Daniel Kiper

Register resources required by kexec-tools.

v2 - suggestions/fixes:
   - change logging level
     (suggested by Konrad Rzeszutek Wilk).

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 arch/x86/xen/kexec.c |  150 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 150 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/xen/kexec.c

diff --git a/arch/x86/xen/kexec.c b/arch/x86/xen/kexec.c
new file mode 100644
index 0000000..7ec4c45
--- /dev/null
+++ b/arch/x86/xen/kexec.c
@@ -0,0 +1,150 @@
+/*
+ * Copyright (c) 2011 Daniel Kiper
+ * Copyright (c) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * kexec/kdump implementation for Xen was written by Daniel Kiper.
+ * Initial work on it was sponsored by Google under Google Summer
+ * of Code 2011 program and Citrix. Konrad Rzeszutek Wilk from Oracle
+ * was the mentor for this project.
+ *
+ * Some ideas are taken from:
+ *   - native kexec/kdump implementation,
+ *   - kexec/kdump implementation for Xen Linux Kernel Ver. 2.6.18,
+ *   - PV-GRUB.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/ioport.h>
+#include <linux/kernel.h>
+#include <linux/kexec.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+
+#include <xen/interface/platform.h>
+#include <xen/interface/xen.h>
+#include <xen/xen.h>
+
+#include <asm/xen/hypercall.h>
+
+unsigned long xen_vmcoreinfo_maddr = 0;
+unsigned long xen_vmcoreinfo_max_size = 0;
+
+static int __init xen_init_kexec_resources(void)
+{
+	int rc;
+	static struct resource xen_hypervisor_res = {
+		.name = "Hypervisor code and data",
+		.flags = IORESOURCE_BUSY | IORESOURCE_MEM
+	};
+	struct resource *cpu_res;
+	struct xen_kexec_range xkr;
+	struct xen_platform_op cpuinfo_op;
+	uint32_t cpus, i;
+
+	if (!xen_initial_domain())
+		return 0;
+
+	if (strstr(boot_command_line, "crashkernel="))
+		pr_warn("kexec: Ignoring crashkernel option. "
+			"It should be passed to Xen hypervisor.\n");
+
+	/* Register Crash kernel resource. */
+	xkr.range = KEXEC_RANGE_MA_CRASH;
+	rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec_get_range, &xkr);
+
+	if (rc) {
+		pr_warn("kexec: %s: HYPERVISOR_kexec_op(KEXEC_RANGE_MA_CRASH)"
+			": %i\n", __func__, rc);
+		return rc;
+	}
+
+	if (!xkr.size)
+		return 0;
+
+	crashk_res.start = xkr.start;
+	crashk_res.end = xkr.start + xkr.size - 1;
+	insert_resource(&iomem_resource, &crashk_res);
+
+	/* Register Hypervisor code and data resource. */
+	xkr.range = KEXEC_RANGE_MA_XEN;
+	rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec_get_range, &xkr);
+
+	if (rc) {
+		pr_warn("kexec: %s: HYPERVISOR_kexec_op(KEXEC_RANGE_MA_XEN)"
+			": %i\n", __func__, rc);
+		return rc;
+	}
+
+	xen_hypervisor_res.start = xkr.start;
+	xen_hypervisor_res.end = xkr.start + xkr.size - 1;
+	insert_resource(&iomem_resource, &xen_hypervisor_res);
+
+	/* Determine maximum number of physical CPUs. */
+	cpuinfo_op.cmd = XENPF_get_cpuinfo;
+	cpuinfo_op.u.pcpu_info.xen_cpuid = 0;
+	rc = HYPERVISOR_dom0_op(&cpuinfo_op);
+
+	if (rc) {
+		pr_warn("kexec: %s: HYPERVISOR_dom0_op(): %i\n", __func__, rc);
+		return rc;
+	}
+
+	cpus = cpuinfo_op.u.pcpu_info.max_present + 1;
+
+	/* Register CPUs Crash note resources. */
+	cpu_res = kcalloc(cpus, sizeof(struct resource), GFP_KERNEL);
+
+	if (!cpu_res) {
+		pr_warn("kexec: %s: kcalloc(): %i\n", __func__, -ENOMEM);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < cpus; ++i) {
+		xkr.range = KEXEC_RANGE_MA_CPU;
+		xkr.nr = i;
+		rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec_get_range, &xkr);
+
+		if (rc) {
+			pr_warn("kexec: %s: cpu: %u: HYPERVISOR_kexec_op"
+				"(KEXEC_RANGE_MA_XEN): %i\n", __func__, i, rc);
+			continue;
+		}
+
+		cpu_res->name = "Crash note";
+		cpu_res->start = xkr.start;
+		cpu_res->end = xkr.start + xkr.size - 1;
+		cpu_res->flags = IORESOURCE_BUSY | IORESOURCE_MEM;
+		insert_resource(&iomem_resource, cpu_res++);
+	}
+
+	/* Get vmcoreinfo address and maximum allowed size. */
+	xkr.range = KEXEC_RANGE_MA_VMCOREINFO;
+	rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec_get_range, &xkr);
+
+	if (rc) {
+		pr_warn("kexec: %s: HYPERVISOR_kexec_op(KEXEC_RANGE_MA_VMCOREINFO)"
+			": %i\n", __func__, rc);
+		return rc;
+	}
+
+	xen_vmcoreinfo_maddr = xkr.start;
+	xen_vmcoreinfo_max_size = xkr.size;
+
+	return 0;
+}
+
+core_initcall(xen_init_kexec_resources);
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 06/11] x86/xen: Add i386 kexec/kdump implementation
  2012-11-20 15:04         ` [PATCH v2 05/11] x86/xen: Register resources required by kexec-tools Daniel Kiper
@ 2012-11-20 15:04           ` Daniel Kiper
  2012-11-20 15:04             ` [PATCH v2 07/11] x86/xen: Add x86_64 " Daniel Kiper
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Kiper @ 2012-11-20 15:04 UTC (permalink / raw)
  To: andrew.cooper3, ebiederm, hpa, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel
  Cc: Daniel Kiper

Add i386 kexec/kdump implementation.

v2 - suggestions/fixes:
   - allocate transition page table pages below 4 GiB
     (suggested by Jan Beulich).

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 arch/x86/xen/machine_kexec_32.c   |  247 ++++++++++++++++++++++++++++
 arch/x86/xen/relocate_kernel_32.S |  323 +++++++++++++++++++++++++++++++++++++
 2 files changed, 570 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/xen/machine_kexec_32.c
 create mode 100644 arch/x86/xen/relocate_kernel_32.S

diff --git a/arch/x86/xen/machine_kexec_32.c b/arch/x86/xen/machine_kexec_32.c
new file mode 100644
index 0000000..116c302
--- /dev/null
+++ b/arch/x86/xen/machine_kexec_32.c
@@ -0,0 +1,247 @@
+/*
+ * Copyright (c) 2011 Daniel Kiper
+ * Copyright (c) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * kexec/kdump implementation for Xen was written by Daniel Kiper.
+ * Initial work on it was sponsored by Google under Google Summer
+ * of Code 2011 program and Citrix. Konrad Rzeszutek Wilk from Oracle
+ * was the mentor for this project.
+ *
+ * Some ideas are taken from:
+ *   - native kexec/kdump implementation,
+ *   - kexec/kdump implementation for Xen Linux Kernel Ver. 2.6.18,
+ *   - PV-GRUB.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/kexec.h>
+#include <linux/mm.h>
+#include <linux/string.h>
+
+#include <xen/xen.h>
+#include <xen/xen-ops.h>
+
+#include <asm/xen/hypercall.h>
+#include <asm/xen/kexec.h>
+#include <asm/xen/page.h>
+
+#define __ma(vaddr)	(virt_to_machine(vaddr).maddr)
+
+static struct page *kimage_alloc_pages(gfp_t gfp_mask,
+					unsigned int order,
+					unsigned long limit)
+{
+	struct page *pages;
+	unsigned int address_bits, i;
+
+	pages = alloc_pages(gfp_mask, order);
+
+	if (!pages)
+		return NULL;
+
+	address_bits = (limit == ULONG_MAX) ? BITS_PER_LONG : ilog2(limit);
+
+	/* Relocate set of pages below given limit. */
+	if (xen_create_contiguous_region((unsigned long)page_address(pages),
+							order, address_bits)) {
+		__free_pages(pages, order);
+		return NULL;
+	}
+
+	BUG_ON(PagePrivate(pages));
+
+	pages->mapping = NULL;
+	set_page_private(pages, order);
+
+	for (i = 0; i < (1 << order); ++i)
+		SetPageReserved(pages + i);
+
+	return pages;
+}
+
+static void kimage_free_pages(struct page *page)
+{
+	unsigned int i, order;
+
+	order = page_private(page);
+
+	for (i = 0; i < (1 << order); ++i)
+		ClearPageReserved(page + i);
+
+	xen_destroy_contiguous_region((unsigned long)page_address(page), order);
+	__free_pages(page, order);
+}
+
+static unsigned long xen_page_to_mfn(struct page *page)
+{
+	return pfn_to_mfn(page_to_pfn(page));
+}
+
+static struct page *xen_mfn_to_page(unsigned long mfn)
+{
+	return pfn_to_page(mfn_to_pfn(mfn));
+}
+
+static unsigned long xen_virt_to_machine(volatile void *address)
+{
+	return virt_to_machine(address).maddr;
+}
+
+static void *xen_machine_to_virt(unsigned long address)
+{
+	return phys_to_virt(machine_to_phys(XMADDR(address)).paddr);
+}
+
+static void *alloc_pgtable_page(struct kimage *image)
+{
+	struct page *page;
+
+	page = kimage_alloc_control_pages(image, 0);
+
+	if (!page || !page_address(page))
+		return NULL;
+
+	memset(page_address(page), 0, PAGE_SIZE);
+
+	return page_address(page);
+}
+
+static int alloc_transition_pgtable(struct kimage *image)
+{
+	image->arch.pgd = alloc_pgtable_page(image);
+
+	if (!image->arch.pgd)
+		return -ENOMEM;
+
+	image->arch.pmd0 = alloc_pgtable_page(image);
+
+	if (!image->arch.pmd0)
+		return -ENOMEM;
+
+	image->arch.pmd1 = alloc_pgtable_page(image);
+
+	if (!image->arch.pmd1)
+		return -ENOMEM;
+
+	image->arch.pte0 = alloc_pgtable_page(image);
+
+	if (!image->arch.pte0)
+		return -ENOMEM;
+
+	image->arch.pte1 = alloc_pgtable_page(image);
+
+	if (!image->arch.pte1)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int machine_xen_kexec_prepare(struct kimage *image)
+{
+#ifdef CONFIG_KEXEC_JUMP
+	if (image->preserve_context) {
+		pr_info_once("kexec: Context preservation is not "
+				"supported in Xen domains.\n");
+		return -ENOSYS;
+	}
+#endif
+
+	return alloc_transition_pgtable(image);
+}
+
+static int machine_xen_kexec_load(struct kimage *image)
+{
+	void *control_page;
+	struct xen_kexec_load xkl = {};
+
+	/* Image is unloaded, nothing to do. */
+	if (!image)
+		return 0;
+
+	control_page = page_address(image->control_code_page);
+	memcpy(control_page, xen_relocate_kernel, xen_kexec_control_code_size);
+
+	xkl.type = image->type;
+	xkl.image.page_list[XK_MA_CONTROL_PAGE] = __ma(control_page);
+	xkl.image.page_list[XK_MA_TABLE_PAGE] = 0; /* Unused. */
+	xkl.image.page_list[XK_MA_PGD_PAGE] = __ma(image->arch.pgd);
+	xkl.image.page_list[XK_MA_PUD0_PAGE] = 0; /* Unused. */
+	xkl.image.page_list[XK_MA_PUD1_PAGE] = 0; /* Unused. */
+	xkl.image.page_list[XK_MA_PMD0_PAGE] = __ma(image->arch.pmd0);
+	xkl.image.page_list[XK_MA_PMD1_PAGE] = __ma(image->arch.pmd1);
+	xkl.image.page_list[XK_MA_PTE0_PAGE] = __ma(image->arch.pte0);
+	xkl.image.page_list[XK_MA_PTE1_PAGE] = __ma(image->arch.pte1);
+	xkl.image.indirection_page = image->head;
+	xkl.image.start_address = image->start;
+
+	return HYPERVISOR_kexec_op(KEXEC_CMD_kexec_load, &xkl);
+}
+
+static void machine_xen_kexec_cleanup(struct kimage *image)
+{
+}
+
+static void machine_xen_kexec_unload(struct kimage *image)
+{
+	int rc;
+	struct xen_kexec_load xkl = {};
+
+	if (!image)
+		return;
+
+	xkl.type = image->type;
+	rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec_unload, &xkl);
+
+	WARN(rc, "kexec: %s: HYPERVISOR_kexec_op(): %i\n", __func__, rc);
+}
+
+static void machine_xen_kexec_shutdown(void)
+{
+}
+
+static void machine_xen_kexec(struct kimage *image)
+{
+	int rc;
+	struct xen_kexec_exec xke = {};
+
+	xke.type = image->type;
+	rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec, &xke);
+
+	pr_emerg("kexec: %s: HYPERVISOR_kexec_op(): %i\n", __func__, rc);
+	BUG();
+}
+
+void __init xen_init_kexec_ops(void)
+{
+	if (!xen_initial_domain())
+		return;
+
+	kexec_ops.crash_alloc_temp_store = true;
+	kexec_ops.kimage_alloc_pages = kimage_alloc_pages;
+	kexec_ops.kimage_free_pages = kimage_free_pages;
+	kexec_ops.page_to_pfn = xen_page_to_mfn;
+	kexec_ops.pfn_to_page = xen_mfn_to_page;
+	kexec_ops.virt_to_phys = xen_virt_to_machine;
+	kexec_ops.phys_to_virt = xen_machine_to_virt;
+	kexec_ops.machine_kexec_prepare = machine_xen_kexec_prepare;
+	kexec_ops.machine_kexec_load = machine_xen_kexec_load;
+	kexec_ops.machine_kexec_cleanup = machine_xen_kexec_cleanup;
+	kexec_ops.machine_kexec_unload = machine_xen_kexec_unload;
+	kexec_ops.machine_kexec_shutdown = machine_xen_kexec_shutdown;
+	kexec_ops.machine_kexec = machine_xen_kexec;
+}
diff --git a/arch/x86/xen/relocate_kernel_32.S b/arch/x86/xen/relocate_kernel_32.S
new file mode 100644
index 0000000..0e81830
--- /dev/null
+++ b/arch/x86/xen/relocate_kernel_32.S
@@ -0,0 +1,323 @@
+/*
+ * Copyright (c) 2002-2005 Eric Biederman <ebiederm@xmission.com>
+ * Copyright (c) 2011 Daniel Kiper
+ * Copyright (c) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * kexec/kdump implementation for Xen was written by Daniel Kiper.
+ * Initial work on it was sponsored by Google under Google Summer
+ * of Code 2011 program and Citrix. Konrad Rzeszutek Wilk from Oracle
+ * was the mentor for this project.
+ *
+ * Some ideas are taken from:
+ *   - native kexec/kdump implementation,
+ *   - kexec/kdump implementation for Xen Linux Kernel Ver. 2.6.18,
+ *   - PV-GRUB.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either veesion 2 of the License, or
+ * (at your option) any later veesion.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <asm/cache.h>
+#include <asm/page_types.h>
+#include <asm/pgtable_types.h>
+#include <asm/processor-flags.h>
+
+#include <asm/xen/kexec.h>
+
+#define ARG_INDIRECTION_PAGE	0x4
+#define ARG_PAGE_LIST		0x8
+#define ARG_START_ADDRESS	0xc
+
+#define PTR(x)	(x << 2)
+
+	.text
+	.align	PAGE_SIZE
+	.globl	xen_kexec_control_code_size, xen_relocate_kernel
+
+xen_relocate_kernel:
+	/*
+	 * Must be relocatable PIC code callable as a C function.
+	 *
+	 * This function is called by Xen but here hypervisor is dead.
+	 * We are playing on bare metal.
+	 *
+	 * Every machine address passed to this function through
+	 * page_list (e.g. XK_MA_CONTROL_PAGE) is established
+	 * by dom0 during kexec load phase.
+	 *
+	 * Every virtual address passed to this function through page_list
+	 * (e.g. XK_VA_CONTROL_PAGE) is established by hypervisor during
+	 * HYPERVISOR_kexec_op(KEXEC_CMD_kexec_load) hypercall.
+	 *
+	 * 0x4(%esp) - indirection_page,
+	 * 0x8(%esp) - page_list,
+	 * 0xc(%esp) - start_address,
+	 * 0x10(%esp) - cpu_has_pae (ignored),
+	 * 0x14(%esp) - preserve_context (ignored).
+	 */
+
+	/* Zero out flags, and disable interrupts. */
+	pushl	$0
+	popfl
+
+	/* Get page_list address. */
+	movl	ARG_PAGE_LIST(%esp), %esi
+
+	/*
+	 * Map the control page at its virtual address
+	 * in transition page table.
+	 */
+	movl	PTR(XK_VA_CONTROL_PAGE)(%esi), %eax
+
+	/* Get PGD address and PGD entry index. */
+	movl	PTR(XK_VA_PGD_PAGE)(%esi), %ebx
+	movl	%eax, %ecx
+	shrl	$PGDIR_SHIFT, %ecx
+	andl	$(PTRS_PER_PGD - 1), %ecx
+
+	/* Fill PGD entry with PMD0 reference. */
+	movl	PTR(XK_MA_PMD0_PAGE)(%esi), %edx
+	orl	$_PAGE_PRESENT, %edx
+	movl	%edx, (%ebx, %ecx, 8)
+
+	/* Get PMD0 address and PMD0 entry index. */
+	movl	PTR(XK_VA_PMD0_PAGE)(%esi), %ebx
+	movl	%eax, %ecx
+	shrl	$PMD_SHIFT, %ecx
+	andl	$(PTRS_PER_PMD - 1), %ecx
+
+	/* Fill PMD0 entry with PTE0 reference. */
+	movl	PTR(XK_MA_PTE0_PAGE)(%esi), %edx
+	orl	$_KERNPG_TABLE, %edx
+	movl	%edx, (%ebx, %ecx, 8)
+
+	/* Get PTE0 address and PTE0 entry index. */
+	movl	PTR(XK_VA_PTE0_PAGE)(%esi), %ebx
+	movl	%eax, %ecx
+	shrl	$PAGE_SHIFT, %ecx
+	andl	$(PTRS_PER_PTE - 1), %ecx
+
+	/* Fill PTE0 entry with control page reference. */
+	movl	PTR(XK_MA_CONTROL_PAGE)(%esi), %edx
+	orl	$__PAGE_KERNEL_EXEC, %edx
+	movl	%edx, (%ebx, %ecx, 8)
+
+	/*
+	 * Identity map the control page at its machine address
+	 * in transition page table.
+	 */
+	movl	PTR(XK_MA_CONTROL_PAGE)(%esi), %eax
+
+	/* Get PGD address and PGD entry index. */
+	movl	PTR(XK_VA_PGD_PAGE)(%esi), %ebx
+	movl	%eax, %ecx
+	shrl	$PGDIR_SHIFT, %ecx
+	andl	$(PTRS_PER_PGD - 1), %ecx
+
+	/* Fill PGD entry with PMD1 reference. */
+	movl	PTR(XK_MA_PMD1_PAGE)(%esi), %edx
+	orl	$_PAGE_PRESENT, %edx
+	movl	%edx, (%ebx, %ecx, 8)
+
+	/* Get PMD1 address and PMD1 entry index. */
+	movl	PTR(XK_VA_PMD1_PAGE)(%esi), %ebx
+	movl	%eax, %ecx
+	shrl	$PMD_SHIFT, %ecx
+	andl	$(PTRS_PER_PMD - 1), %ecx
+
+	/* Fill PMD1 entry with PTE1 reference. */
+	movl	PTR(XK_MA_PTE1_PAGE)(%esi), %edx
+	orl	$_KERNPG_TABLE, %edx
+	movl	%edx, (%ebx, %ecx, 8)
+
+	/* Get PTE1 address and PTE1 entry index. */
+	movl	PTR(XK_VA_PTE1_PAGE)(%esi), %ebx
+	movl	%eax, %ecx
+	shrl	$PAGE_SHIFT, %ecx
+	andl	$(PTRS_PER_PTE - 1), %ecx
+
+	/* Fill PTE1 entry with control page reference. */
+	movl	PTR(XK_MA_CONTROL_PAGE)(%esi), %edx
+	orl	$__PAGE_KERNEL_EXEC, %edx
+	movl	%edx, (%ebx, %ecx, 8)
+
+	/*
+	 * Get machine address of control page now.
+	 * This is impossible after page table switch.
+	 */
+	movl	PTR(XK_MA_CONTROL_PAGE)(%esi), %ebx
+
+	/* Get machine address of transition page table now too. */
+	movl	PTR(XK_MA_PGD_PAGE)(%esi), %ecx
+
+	/* Get start_address too. */
+	movl	ARG_START_ADDRESS(%esp), %edx
+
+	/* Get indirection_page address too. */
+	movl	ARG_INDIRECTION_PAGE(%esp), %edi
+
+	/* Switch to transition page table. */
+	movl	%ecx, %cr3
+
+	/* Load IDT. */
+	lidtl	(idt_48 - xen_relocate_kernel)(%ebx)
+
+	/* Load GDT. */
+	leal	(gdt - xen_relocate_kernel)(%ebx), %eax
+	movl	%eax, (gdt_48 - xen_relocate_kernel + 2)(%ebx)
+	lgdtl	(gdt_48 - xen_relocate_kernel)(%ebx)
+
+	/* Load data segment registers. */
+	movl	$(gdt_ds - gdt), %eax
+	movl	%eax, %ds
+	movl	%eax, %es
+	movl	%eax, %fs
+	movl	%eax, %gs
+	movl	%eax, %ss
+
+	/* Setup a new stack at the end of machine address of control page. */
+	leal	PAGE_SIZE(%ebx), %esp
+
+	/* Store start_address on the stack. */
+	pushl   %edx
+
+	/* Jump to identity mapped page. */
+	pushl	$0
+	pushl	$(gdt_cs - gdt)
+	addl	$(identity_mapped - xen_relocate_kernel), %ebx
+	pushl	%ebx
+	iretl
+
+identity_mapped:
+	/*
+	 * Set %cr0 to a known state:
+	 *   - disable alignment check,
+	 *   - disable floating point emulation,
+	 *   - disable paging,
+	 *   - no task switch,
+	 *   - disable write protect,
+	 *   - enable protected mode.
+	 */
+	movl	%cr0, %eax
+	andl	$~(X86_CR0_AM | X86_CR0_EM | X86_CR0_PG | X86_CR0_TS | X86_CR0_WP), %eax
+	orl	$(X86_CR0_PE), %eax
+	movl	%eax, %cr0
+
+	/* Set %cr4 to a known state. */
+	xorl	%eax, %eax
+	movl	%eax, %cr4
+
+	jmp	1f
+
+1:
+	/* Flush the TLB (needed?). */
+	movl	%eax, %cr3
+
+	/* Do the copies. */
+	movl	%edi, %ecx	/* Put the indirection_page in %ecx. */
+	xorl	%edi, %edi
+	xorl	%esi, %esi
+	jmp	1f
+
+0:
+	/*
+	 * Top, read another doubleword from the indirection page.
+	 * Indirection page is an array which contains source
+	 * and destination address pairs. If all pairs could
+	 * not fit in one page then at the end of given
+	 * indirection page is pointer to next one.
+	 * Copy is stopped when done indicator
+	 * is found in indirection page.
+	 */
+	movl	(%ebx), %ecx
+	addl	$4, %ebx
+
+1:
+	testl	$0x1, %ecx	/* Is it a destination page? */
+	jz	2f
+
+	movl	%ecx, %edi
+	andl	$PAGE_MASK, %edi
+	jmp	0b
+
+2:
+	testl	$0x2, %ecx	/* Is it an indirection page? */
+	jz	2f
+
+	movl	%ecx, %ebx
+	andl	$PAGE_MASK, %ebx
+	jmp	0b
+
+2:
+	testl	$0x4, %ecx	/* Is it the done indicator? */
+	jz	2f
+	jmp	3f
+
+2:
+	testl	$0x8, %ecx	/* Is it the source indicator? */
+	jz	0b		/* Ignore it otherwise. */
+
+	movl	%ecx, %esi
+	andl	$PAGE_MASK, %esi
+	movl	$1024, %ecx
+
+	/* Copy page. */
+	rep	movsl
+	jmp	0b
+
+3:
+	/*
+	 * To be certain of avoiding problems with self-modifying code
+	 * I need to execute a serializing instruction here.
+	 * So I flush the TLB by reloading %cr3 here, it's handy,
+	 * and not processor dependent.
+	 */
+	xorl	%eax, %eax
+	movl	%eax, %cr3
+
+	/*
+	 * Set all of the registers to known values.
+	 * Leave %esp alone.
+	 */
+	xorl	%ebx, %ebx
+	xorl    %ecx, %ecx
+	xorl    %edx, %edx
+	xorl    %esi, %esi
+	xorl    %edi, %edi
+	xorl    %ebp, %ebp
+
+	/* Jump to start_address. */
+	retl
+
+	.align	L1_CACHE_BYTES
+
+gdt:
+	.quad	0x0000000000000000	/* NULL descriptor. */
+
+gdt_cs:
+	.quad	0x00cf9a000000ffff	/* 4 GiB code segment at 0x00000000. */
+
+gdt_ds:
+	.quad	0x00cf92000000ffff	/* 4 GiB data segment at 0x00000000. */
+gdt_end:
+
+gdt_48:
+	.word	gdt_end - gdt - 1	/* GDT limit. */
+	.long	0			/* GDT base - filled in by code above. */
+
+idt_48:
+	.word	0			/* IDT limit. */
+	.long	0			/* IDT base. */
+
+xen_kexec_control_code_size:
+	.long	. - xen_relocate_kernel
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 07/11] x86/xen: Add x86_64 kexec/kdump implementation
  2012-11-20 15:04           ` [PATCH v2 06/11] x86/xen: Add i386 kexec/kdump implementation Daniel Kiper
@ 2012-11-20 15:04             ` Daniel Kiper
  2012-11-20 15:04               ` [PATCH v2 08/11] x86/xen: Add kexec/kdump makefile rules Daniel Kiper
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Kiper @ 2012-11-20 15:04 UTC (permalink / raw)
  To: andrew.cooper3, ebiederm, hpa, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel
  Cc: Daniel Kiper

Add x86_64 kexec/kdump implementation.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 arch/x86/xen/machine_kexec_64.c   |  302 ++++++++++++++++++++++++++++++++++++
 arch/x86/xen/relocate_kernel_64.S |  309 +++++++++++++++++++++++++++++++++++++
 2 files changed, 611 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/xen/machine_kexec_64.c
 create mode 100644 arch/x86/xen/relocate_kernel_64.S

diff --git a/arch/x86/xen/machine_kexec_64.c b/arch/x86/xen/machine_kexec_64.c
new file mode 100644
index 0000000..a2cf0c8
--- /dev/null
+++ b/arch/x86/xen/machine_kexec_64.c
@@ -0,0 +1,302 @@
+/*
+ * Copyright (c) 2011 Daniel Kiper
+ * Copyright (c) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * kexec/kdump implementation for Xen was written by Daniel Kiper.
+ * Initial work on it was sponsored by Google under Google Summer
+ * of Code 2011 program and Citrix. Konrad Rzeszutek Wilk from Oracle
+ * was the mentor for this project.
+ *
+ * Some ideas are taken from:
+ *   - native kexec/kdump implementation,
+ *   - kexec/kdump implementation for Xen Linux Kernel Ver. 2.6.18,
+ *   - PV-GRUB.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/kexec.h>
+#include <linux/mm.h>
+#include <linux/string.h>
+
+#include <xen/interface/memory.h>
+#include <xen/xen.h>
+
+#include <asm/xen/hypercall.h>
+#include <asm/xen/kexec.h>
+#include <asm/xen/page.h>
+
+#define __ma(vaddr)	(virt_to_machine(vaddr).maddr)
+
+static unsigned long xen_page_to_mfn(struct page *page)
+{
+	return pfn_to_mfn(page_to_pfn(page));
+}
+
+static struct page *xen_mfn_to_page(unsigned long mfn)
+{
+	return pfn_to_page(mfn_to_pfn(mfn));
+}
+
+static unsigned long xen_virt_to_machine(volatile void *address)
+{
+	return virt_to_machine(address).maddr;
+}
+
+static void *xen_machine_to_virt(unsigned long address)
+{
+	return phys_to_virt(machine_to_phys(XMADDR(address)).paddr);
+}
+
+static void init_level2_page(pmd_t *pmd, unsigned long addr)
+{
+	unsigned long end_addr = addr + PUD_SIZE;
+
+	while (addr < end_addr) {
+		native_set_pmd(pmd++, native_make_pmd(addr | __PAGE_KERNEL_LARGE_EXEC));
+		addr += PMD_SIZE;
+	}
+}
+
+static int init_level3_page(struct kimage *image, pud_t *pud,
+				unsigned long addr, unsigned long last_addr)
+{
+	pmd_t *pmd;
+	struct page *page;
+	unsigned long end_addr = addr + PGDIR_SIZE;
+
+	while ((addr < last_addr) && (addr < end_addr)) {
+		page = kimage_alloc_control_pages(image, 0);
+
+		if (!page)
+			return -ENOMEM;
+
+		pmd = page_address(page);
+		init_level2_page(pmd, addr);
+		native_set_pud(pud++, native_make_pud(__ma(pmd) | _KERNPG_TABLE));
+		addr += PUD_SIZE;
+	}
+
+	/* Clear the unused entries. */
+	while (addr < end_addr) {
+		native_pud_clear(pud++);
+		addr += PUD_SIZE;
+	}
+
+	return 0;
+}
+
+
+static int init_level4_page(struct kimage *image, pgd_t *pgd,
+				unsigned long addr, unsigned long last_addr)
+{
+	int rc;
+	pud_t *pud;
+	struct page *page;
+	unsigned long end_addr = addr + PTRS_PER_PGD * PGDIR_SIZE;
+
+	while ((addr < last_addr) && (addr < end_addr)) {
+		page = kimage_alloc_control_pages(image, 0);
+
+		if (!page)
+			return -ENOMEM;
+
+		pud = page_address(page);
+		rc = init_level3_page(image, pud, addr, last_addr);
+
+		if (rc)
+			return rc;
+
+		native_set_pgd(pgd++, native_make_pgd(__ma(pud) | _KERNPG_TABLE));
+		addr += PGDIR_SIZE;
+	}
+
+	/* Clear the unused entries. */
+	while (addr < end_addr) {
+		native_pgd_clear(pgd++);
+		addr += PGDIR_SIZE;
+	}
+
+	return 0;
+}
+
+static void free_transition_pgtable(struct kimage *image)
+{
+	free_page((unsigned long)image->arch.pgd);
+	free_page((unsigned long)image->arch.pud0);
+	free_page((unsigned long)image->arch.pud1);
+	free_page((unsigned long)image->arch.pmd0);
+	free_page((unsigned long)image->arch.pmd1);
+	free_page((unsigned long)image->arch.pte0);
+	free_page((unsigned long)image->arch.pte1);
+}
+
+static int alloc_transition_pgtable(struct kimage *image)
+{
+	image->arch.pgd = (pgd_t *)get_zeroed_page(GFP_KERNEL);
+
+	if (!image->arch.pgd)
+		goto err;
+
+	image->arch.pud0 = (pud_t *)get_zeroed_page(GFP_KERNEL);
+
+	if (!image->arch.pud0)
+		goto err;
+
+	image->arch.pud1 = (pud_t *)get_zeroed_page(GFP_KERNEL);
+
+	if (!image->arch.pud1)
+		goto err;
+
+	image->arch.pmd0 = (pmd_t *)get_zeroed_page(GFP_KERNEL);
+
+	if (!image->arch.pmd0)
+		goto err;
+
+	image->arch.pmd1 = (pmd_t *)get_zeroed_page(GFP_KERNEL);
+
+	if (!image->arch.pmd1)
+		goto err;
+
+	image->arch.pte0 = (pte_t *)get_zeroed_page(GFP_KERNEL);
+
+	if (!image->arch.pte0)
+		goto err;
+
+	image->arch.pte1 = (pte_t *)get_zeroed_page(GFP_KERNEL);
+
+	if (!image->arch.pte1)
+		goto err;
+
+	return 0;
+
+err:
+	free_transition_pgtable(image);
+
+	return -ENOMEM;
+}
+
+static int init_pgtable(struct kimage *image, pgd_t *pgd)
+{
+	int rc;
+	unsigned long max_mfn;
+
+	max_mfn = HYPERVISOR_memory_op(XENMEM_maximum_ram_page, NULL);
+
+	rc = init_level4_page(image, pgd, 0, PFN_PHYS(max_mfn));
+
+	if (rc)
+		return rc;
+
+	return alloc_transition_pgtable(image);
+}
+
+static int machine_xen_kexec_prepare(struct kimage *image)
+{
+#ifdef CONFIG_KEXEC_JUMP
+	if (image->preserve_context) {
+		pr_info_once("kexec: Context preservation is not "
+				"supported in Xen domains.\n");
+		return -ENOSYS;
+	}
+#endif
+
+	return init_pgtable(image, page_address(image->control_code_page));
+}
+
+static int machine_xen_kexec_load(struct kimage *image)
+{
+	void *control_page, *table_page;
+	struct xen_kexec_load xkl = {};
+
+	/* Image is unloaded, nothing to do. */
+	if (!image)
+		return 0;
+
+	table_page = page_address(image->control_code_page);
+	control_page = table_page + PAGE_SIZE;
+
+	memcpy(control_page, xen_relocate_kernel, xen_kexec_control_code_size);
+
+	xkl.type = image->type;
+	xkl.image.page_list[XK_MA_CONTROL_PAGE] = __ma(control_page);
+	xkl.image.page_list[XK_MA_TABLE_PAGE] = __ma(table_page);
+	xkl.image.page_list[XK_MA_PGD_PAGE] = __ma(image->arch.pgd);
+	xkl.image.page_list[XK_MA_PUD0_PAGE] = __ma(image->arch.pud0);
+	xkl.image.page_list[XK_MA_PUD1_PAGE] = __ma(image->arch.pud1);
+	xkl.image.page_list[XK_MA_PMD0_PAGE] = __ma(image->arch.pmd0);
+	xkl.image.page_list[XK_MA_PMD1_PAGE] = __ma(image->arch.pmd1);
+	xkl.image.page_list[XK_MA_PTE0_PAGE] = __ma(image->arch.pte0);
+	xkl.image.page_list[XK_MA_PTE1_PAGE] = __ma(image->arch.pte1);
+	xkl.image.indirection_page = image->head;
+	xkl.image.start_address = image->start;
+
+	return HYPERVISOR_kexec_op(KEXEC_CMD_kexec_load, &xkl);
+}
+
+static void machine_xen_kexec_cleanup(struct kimage *image)
+{
+	free_transition_pgtable(image);
+}
+
+static void machine_xen_kexec_unload(struct kimage *image)
+{
+	int rc;
+	struct xen_kexec_load xkl = {};
+
+	if (!image)
+		return;
+
+	xkl.type = image->type;
+	rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec_unload, &xkl);
+
+	WARN(rc, "kexec: %s: HYPERVISOR_kexec_op(): %i\n", __func__, rc);
+}
+
+static void machine_xen_kexec_shutdown(void)
+{
+}
+
+static void machine_xen_kexec(struct kimage *image)
+{
+	int rc;
+	struct xen_kexec_exec xke = {};
+
+	xke.type = image->type;
+	rc = HYPERVISOR_kexec_op(KEXEC_CMD_kexec, &xke);
+
+	pr_emerg("kexec: %s: HYPERVISOR_kexec_op(): %i\n", __func__, rc);
+	BUG();
+}
+
+void __init xen_init_kexec_ops(void)
+{
+	if (!xen_initial_domain())
+		return;
+
+	kexec_ops.crash_alloc_temp_store = true;
+	kexec_ops.page_to_pfn = xen_page_to_mfn;
+	kexec_ops.pfn_to_page = xen_mfn_to_page;
+	kexec_ops.virt_to_phys = xen_virt_to_machine;
+	kexec_ops.phys_to_virt = xen_machine_to_virt;
+	kexec_ops.machine_kexec_prepare = machine_xen_kexec_prepare;
+	kexec_ops.machine_kexec_load = machine_xen_kexec_load;
+	kexec_ops.machine_kexec_cleanup = machine_xen_kexec_cleanup;
+	kexec_ops.machine_kexec_unload = machine_xen_kexec_unload;
+	kexec_ops.machine_kexec_shutdown = machine_xen_kexec_shutdown;
+	kexec_ops.machine_kexec = machine_xen_kexec;
+}
diff --git a/arch/x86/xen/relocate_kernel_64.S b/arch/x86/xen/relocate_kernel_64.S
new file mode 100644
index 0000000..8f641f1
--- /dev/null
+++ b/arch/x86/xen/relocate_kernel_64.S
@@ -0,0 +1,309 @@
+/*
+ * Copyright (c) 2002-2005 Eric Biederman <ebiederm@xmission.com>
+ * Copyright (c) 2011 Daniel Kiper
+ * Copyright (c) 2012 Daniel Kiper, Oracle Corporation
+ *
+ * kexec/kdump implementation for Xen was written by Daniel Kiper.
+ * Initial work on it was sponsored by Google under Google Summer
+ * of Code 2011 program and Citrix. Konrad Rzeszutek Wilk from Oracle
+ * was the mentor for this project.
+ *
+ * Some ideas are taken from:
+ *   - native kexec/kdump implementation,
+ *   - kexec/kdump implementation for Xen Linux Kernel Ver. 2.6.18,
+ *   - PV-GRUB.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <asm/page_types.h>
+#include <asm/pgtable_types.h>
+#include <asm/processor-flags.h>
+
+#include <asm/xen/kexec.h>
+
+#define PTR(x)	(x << 3)
+
+	.text
+	.code64
+	.globl	xen_kexec_control_code_size, xen_relocate_kernel
+
+xen_relocate_kernel:
+	/*
+	 * Must be relocatable PIC code callable as a C function.
+	 *
+	 * This function is called by Xen but here hypervisor is dead.
+	 * We are playing on bare metal.
+	 *
+	 * Every machine address passed to this function through
+	 * page_list (e.g. XK_MA_CONTROL_PAGE) is established
+	 * by dom0 during kexec load phase.
+	 *
+	 * Every virtual address passed to this function through page_list
+	 * (e.g. XK_VA_CONTROL_PAGE) is established by hypervisor during
+	 * HYPERVISOR_kexec_op(KEXEC_CMD_kexec_load) hypercall.
+	 *
+	 * %rdi - indirection_page,
+	 * %rsi - page_list,
+	 * %rdx - start_address,
+	 * %ecx - preserve_context (ignored).
+	 */
+
+	/* Zero out flags, and disable interrupts. */
+	pushq	$0
+	popfq
+
+	/*
+	 * Map the control page at its virtual address
+	 * in transition page table.
+	 */
+	movq	PTR(XK_VA_CONTROL_PAGE)(%rsi), %r8
+
+	/* Get PGD address and PGD entry index. */
+	movq	PTR(XK_VA_PGD_PAGE)(%rsi), %r9
+	movq	%r8, %r10
+	shrq	$PGDIR_SHIFT, %r10
+	andq	$(PTRS_PER_PGD - 1), %r10
+
+	/* Fill PGD entry with PUD0 reference. */
+	movq	PTR(XK_MA_PUD0_PAGE)(%rsi), %r11
+	orq	$_KERNPG_TABLE, %r11
+	movq	%r11, (%r9, %r10, 8)
+
+	/* Get PUD0 address and PUD0 entry index. */
+	movq	PTR(XK_VA_PUD0_PAGE)(%rsi), %r9
+	movq	%r8, %r10
+	shrq	$PUD_SHIFT, %r10
+	andq	$(PTRS_PER_PUD - 1), %r10
+
+	/* Fill PUD0 entry with PMD0 reference. */
+	movq	PTR(XK_MA_PMD0_PAGE)(%rsi), %r11
+	orq	$_KERNPG_TABLE, %r11
+	movq	%r11, (%r9, %r10, 8)
+
+	/* Get PMD0 address and PMD0 entry index. */
+	movq	PTR(XK_VA_PMD0_PAGE)(%rsi), %r9
+	movq	%r8, %r10
+	shrq	$PMD_SHIFT, %r10
+	andq	$(PTRS_PER_PMD - 1), %r10
+
+	/* Fill PMD0 entry with PTE0 reference. */
+	movq	PTR(XK_MA_PTE0_PAGE)(%rsi), %r11
+	orq	$_KERNPG_TABLE, %r11
+	movq	%r11, (%r9, %r10, 8)
+
+	/* Get PTE0 address and PTE0 entry index. */
+	movq	PTR(XK_VA_PTE0_PAGE)(%rsi), %r9
+	movq	%r8, %r10
+	shrq	$PAGE_SHIFT, %r10
+	andq	$(PTRS_PER_PTE - 1), %r10
+
+	/* Fill PTE0 entry with control page reference. */
+	movq	PTR(XK_MA_CONTROL_PAGE)(%rsi), %r11
+	orq	$__PAGE_KERNEL_EXEC, %r11
+	movq	%r11, (%r9, %r10, 8)
+
+	/*
+	 * Identity map the control page at its machine address
+	 * in transition page table.
+	 */
+	movq	PTR(XK_MA_CONTROL_PAGE)(%rsi), %r8
+
+	/* Get PGD address and PGD entry index. */
+	movq	PTR(XK_VA_PGD_PAGE)(%rsi), %r9
+	movq	%r8, %r10
+	shrq	$PGDIR_SHIFT, %r10
+	andq	$(PTRS_PER_PGD - 1), %r10
+
+	/* Fill PGD entry with PUD1 reference. */
+	movq	PTR(XK_MA_PUD1_PAGE)(%rsi), %r11
+	orq	$_KERNPG_TABLE, %r11
+	movq	%r11, (%r9, %r10, 8)
+
+	/* Get PUD1 address and PUD1 entry index. */
+	movq	PTR(XK_VA_PUD1_PAGE)(%rsi), %r9
+	movq	%r8, %r10
+	shrq	$PUD_SHIFT, %r10
+	andq	$(PTRS_PER_PUD - 1), %r10
+
+	/* Fill PUD1 entry with PMD1 reference. */
+	movq	PTR(XK_MA_PMD1_PAGE)(%rsi), %r11
+	orq	$_KERNPG_TABLE, %r11
+	movq	%r11, (%r9, %r10, 8)
+
+	/* Get PMD1 address and PMD1 entry index. */
+	movq	PTR(XK_VA_PMD1_PAGE)(%rsi), %r9
+	movq	%r8, %r10
+	shrq	$PMD_SHIFT, %r10
+	andq	$(PTRS_PER_PMD - 1), %r10
+
+	/* Fill PMD1 entry with PTE1 reference. */
+	movq	PTR(XK_MA_PTE1_PAGE)(%rsi), %r11
+	orq	$_KERNPG_TABLE, %r11
+	movq	%r11, (%r9, %r10, 8)
+
+	/* Get PTE1 address and PTE1 entry index. */
+	movq	PTR(XK_VA_PTE1_PAGE)(%rsi), %r9
+	movq	%r8, %r10
+	shrq	$PAGE_SHIFT, %r10
+	andq	$(PTRS_PER_PTE - 1), %r10
+
+	/* Fill PTE1 entry with control page reference. */
+	movq	PTR(XK_MA_CONTROL_PAGE)(%rsi), %r11
+	orq	$__PAGE_KERNEL_EXEC, %r11
+	movq	%r11, (%r9, %r10, 8)
+
+	/*
+	 * Get machine address of control page now.
+	 * This is impossible after page table switch.
+	 */
+	movq	PTR(XK_MA_CONTROL_PAGE)(%rsi), %r8
+
+	/* Get machine address of identity page table now too. */
+	movq	PTR(XK_MA_TABLE_PAGE)(%rsi), %r9
+
+	/* Get machine address of transition page table now too. */
+	movq	PTR(XK_MA_PGD_PAGE)(%rsi), %r10
+
+	/* Switch to transition page table. */
+	movq	%r10, %cr3
+
+	/* Setup a new stack at the end of machine address of control page. */
+	leaq	PAGE_SIZE(%r8), %rsp
+
+	/* Store start_address on the stack. */
+	pushq   %rdx
+
+	/* Jump to identity mapped page. */
+	addq	$(identity_mapped - xen_relocate_kernel), %r8
+	jmpq	*%r8
+
+identity_mapped:
+	/* Switch to identity page table. */
+	movq	%r9, %cr3
+
+	/*
+	 * Set %cr0 to a known state:
+	 *   - disable alignment check,
+	 *   - disable floating point emulation,
+	 *   - no task switch,
+	 *   - disable write protect,
+	 *   - enable protected mode,
+	 *   - enable paging.
+	 */
+	movq	%cr0, %rax
+	andq	$~(X86_CR0_AM | X86_CR0_EM | X86_CR0_TS | X86_CR0_WP), %rax
+	orl	$(X86_CR0_PE | X86_CR0_PG), %eax
+	movq	%rax, %cr0
+
+	/*
+	 * Set %cr4 to a known state:
+	 *   - enable physical address extension.
+	 */
+	movq	$X86_CR4_PAE, %rax
+	movq	%rax, %cr4
+
+	jmp	1f
+
+1:
+	/* Flush the TLB (needed?). */
+	movq	%r9, %cr3
+
+	/* Do the copies. */
+	movq	%rdi, %rcx	/* Put the indirection_page in %rcx. */
+	xorq	%rdi, %rdi
+	xorq	%rsi, %rsi
+	jmp	1f
+
+0:
+	/*
+	 * Top, read another quadword from the indirection page.
+	 * Indirection page is an array which contains source
+	 * and destination address pairs. If all pairs could
+	 * not fit in one page then at the end of given
+	 * indirection page is pointer to next one.
+	 * Copy is stopped when done indicator
+	 * is found in indirection page.
+	 */
+	movq	(%rbx), %rcx
+	addq	$8, %rbx
+
+1:
+	testq	$0x1, %rcx	/* Is it a destination page? */
+	jz	2f
+
+	movq	%rcx, %rdi
+	andq	$PAGE_MASK, %rdi
+	jmp	0b
+
+2:
+	testq	$0x2, %rcx	/* Is it an indirection page? */
+	jz	2f
+
+	movq	%rcx, %rbx
+	andq	$PAGE_MASK, %rbx
+	jmp	0b
+
+2:
+	testq	$0x4, %rcx	/* Is it the done indicator? */
+	jz	2f
+	jmp	3f
+
+2:
+	testq	$0x8, %rcx	/* Is it the source indicator? */
+	jz	0b		/* Ignore it otherwise. */
+
+	movq	%rcx, %rsi
+	andq	$PAGE_MASK, %rsi
+	movq	$512, %rcx
+
+	/* Copy page. */
+	rep	movsq
+	jmp	0b
+
+3:
+	/*
+	 * To be certain of avoiding problems with self-modifying code
+	 * I need to execute a serializing instruction here.
+	 * So I flush the TLB by reloading %cr3 here, it's handy,
+	 * and not processor dependent.
+	 */
+	movq	%cr3, %rax
+	movq	%rax, %cr3
+
+	/*
+	 * Set all of the registers to known values.
+	 * Leave %rsp alone.
+	 */
+	xorq	%rax, %rax
+	xorq	%rbx, %rbx
+	xorq    %rcx, %rcx
+	xorq    %rdx, %rdx
+	xorq    %rsi, %rsi
+	xorq    %rdi, %rdi
+	xorq    %rbp, %rbp
+	xorq	%r8, %r8
+	xorq	%r9, %r9
+	xorq	%r10, %r10
+	xorq	%r11, %r11
+	xorq	%r12, %r12
+	xorq	%r13, %r13
+	xorq	%r14, %r14
+	xorq	%r15, %r15
+
+	/* Jump to start_address. */
+	retq
+
+xen_kexec_control_code_size:
+	.long	. - xen_relocate_kernel
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 08/11] x86/xen: Add kexec/kdump makefile rules
  2012-11-20 15:04             ` [PATCH v2 07/11] x86/xen: Add x86_64 " Daniel Kiper
@ 2012-11-20 15:04               ` Daniel Kiper
  2012-11-20 15:04                 ` [PATCH v2 09/11] x86/xen/enlighten: Add init and crash kexec/kdump hooks Daniel Kiper
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Kiper @ 2012-11-20 15:04 UTC (permalink / raw)
  To: andrew.cooper3, ebiederm, hpa, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel
  Cc: Daniel Kiper

Add kexec/kdump makefile rules.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 arch/x86/xen/Makefile |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
index 96ab2c0..7a5db44 100644
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -22,3 +22,6 @@ obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= spinlock.o
 obj-$(CONFIG_XEN_DEBUG_FS)	+= debugfs.o
 obj-$(CONFIG_XEN_DOM0)		+= apic.o vga.o
 obj-$(CONFIG_SWIOTLB_XEN)	+= pci-swiotlb-xen.o
+obj-$(CONFIG_KEXEC)		+= kexec.o
+obj-$(CONFIG_KEXEC)		+= machine_kexec_$(BITS).o
+obj-$(CONFIG_KEXEC)		+= relocate_kernel_$(BITS).o
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 09/11] x86/xen/enlighten: Add init and crash kexec/kdump hooks
  2012-11-20 15:04               ` [PATCH v2 08/11] x86/xen: Add kexec/kdump makefile rules Daniel Kiper
@ 2012-11-20 15:04                 ` Daniel Kiper
  2012-11-20 15:04                   ` [PATCH v2 10/11] drivers/xen: Export vmcoreinfo through sysfs Daniel Kiper
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Kiper @ 2012-11-20 15:04 UTC (permalink / raw)
  To: andrew.cooper3, ebiederm, hpa, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel
  Cc: Daniel Kiper

Add init and crash kexec/kdump hooks.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 arch/x86/xen/enlighten.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 586d838..e5b4d0d 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -31,6 +31,7 @@
 #include <linux/pci.h>
 #include <linux/gfp.h>
 #include <linux/memblock.h>
+#include <linux/kexec.h>
 
 #include <xen/xen.h>
 #include <xen/events.h>
@@ -67,6 +68,7 @@
 #include <asm/hypervisor.h>
 #include <asm/mwait.h>
 #include <asm/pci_x86.h>
+#include <asm/xen/kexec.h>
 
 #ifdef CONFIG_ACPI
 #include <linux/acpi.h>
@@ -1254,6 +1256,12 @@ static void xen_machine_power_off(void)
 
 static void xen_crash_shutdown(struct pt_regs *regs)
 {
+#ifdef CONFIG_KEXEC
+	if (kexec_crash_image) {
+		crash_save_cpu(regs, safe_smp_processor_id());
+		return;
+	}
+#endif
 	xen_reboot(SHUTDOWN_crash);
 }
 
@@ -1331,6 +1339,10 @@ asmlinkage void __init xen_start_kernel(void)
 
 	xen_init_mmu_ops();
 
+#ifdef CONFIG_KEXEC
+	xen_init_kexec_ops();
+#endif
+
 	/* Prevent unwanted bits from being set in PTEs. */
 	__supported_pte_mask &= ~_PAGE_GLOBAL;
 #if 0
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 10/11] drivers/xen: Export vmcoreinfo through sysfs
  2012-11-20 15:04                 ` [PATCH v2 09/11] x86/xen/enlighten: Add init and crash kexec/kdump hooks Daniel Kiper
@ 2012-11-20 15:04                   ` Daniel Kiper
  2012-11-20 15:04                     ` [PATCH v2 11/11] x86: Add Xen kexec control code size check to linker script Daniel Kiper
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Kiper @ 2012-11-20 15:04 UTC (permalink / raw)
  To: andrew.cooper3, ebiederm, hpa, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel
  Cc: Daniel Kiper

Export vmcoreinfo through sysfs.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 drivers/xen/sys-hypervisor.c |   42 +++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 41 insertions(+), 1 deletions(-)

diff --git a/drivers/xen/sys-hypervisor.c b/drivers/xen/sys-hypervisor.c
index 96453f8..6edc289 100644
--- a/drivers/xen/sys-hypervisor.c
+++ b/drivers/xen/sys-hypervisor.c
@@ -368,6 +368,41 @@ static void xen_properties_destroy(void)
 	sysfs_remove_group(hypervisor_kobj, &xen_properties_group);
 }
 
+#ifdef CONFIG_KEXEC
+static ssize_t vmcoreinfo_show(struct hyp_sysfs_attr *attr, char *buffer)
+{
+	return sprintf(buffer, "%lx %lx\n", xen_vmcoreinfo_maddr,
+						xen_vmcoreinfo_max_size);
+}
+
+HYPERVISOR_ATTR_RO(vmcoreinfo);
+
+static int __init xen_vmcoreinfo_init(void)
+{
+	if (!xen_vmcoreinfo_max_size)
+		return 0;
+
+	return sysfs_create_file(hypervisor_kobj, &vmcoreinfo_attr.attr);
+}
+
+static void xen_vmcoreinfo_destroy(void)
+{
+	if (!xen_vmcoreinfo_max_size)
+		return;
+
+	sysfs_remove_file(hypervisor_kobj, &vmcoreinfo_attr.attr);
+}
+#else
+static int __init xen_vmcoreinfo_init(void)
+{
+	return 0;
+}
+
+static void xen_vmcoreinfo_destroy(void)
+{
+}
+#endif
+
 static int __init hyper_sysfs_init(void)
 {
 	int ret;
@@ -390,9 +425,14 @@ static int __init hyper_sysfs_init(void)
 	ret = xen_properties_init();
 	if (ret)
 		goto prop_out;
+	ret = xen_vmcoreinfo_init();
+	if (ret)
+		goto vmcoreinfo_out;
 
 	goto out;
 
+vmcoreinfo_out:
+	xen_properties_destroy();
 prop_out:
 	xen_sysfs_uuid_destroy();
 uuid_out:
@@ -407,12 +447,12 @@ out:
 
 static void __exit hyper_sysfs_exit(void)
 {
+	xen_vmcoreinfo_destroy();
 	xen_properties_destroy();
 	xen_compilation_destroy();
 	xen_sysfs_uuid_destroy();
 	xen_sysfs_version_destroy();
 	xen_sysfs_type_destroy();
-
 }
 module_init(hyper_sysfs_init);
 module_exit(hyper_sysfs_exit);
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 11/11] x86: Add Xen kexec control code size check to linker script
  2012-11-20 15:04                   ` [PATCH v2 10/11] drivers/xen: Export vmcoreinfo through sysfs Daniel Kiper
@ 2012-11-20 15:04                     ` Daniel Kiper
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Kiper @ 2012-11-20 15:04 UTC (permalink / raw)
  To: andrew.cooper3, ebiederm, hpa, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel
  Cc: Daniel Kiper

Add Xen kexec control code size check to linker script.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
---
 arch/x86/kernel/vmlinux.lds.S |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 22a1530..f18786a 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -360,5 +360,10 @@ INIT_PER_CPU(irq_stack_union);
 
 . = ASSERT(kexec_control_code_size <= KEXEC_CONTROL_CODE_MAX_SIZE,
            "kexec control code size is too big");
-#endif
 
+#ifdef CONFIG_XEN
+. = ASSERT(xen_kexec_control_code_size - xen_relocate_kernel <=
+		KEXEC_CONTROL_CODE_MAX_SIZE,
+		"Xen kexec control code size is too big");
+#endif
+#endif
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
  2012-11-20 15:04   ` [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE Daniel Kiper
  2012-11-20 15:04     ` [PATCH v2 03/11] xen: Introduce architecture independent data for kexec/kdump Daniel Kiper
@ 2012-11-20 15:52     ` Jan Beulich
  1 sibling, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2012-11-20 15:52 UTC (permalink / raw)
  To: Daniel Kiper
  Cc: andrew.cooper3, x86, tglx, kexec, virtualization, xen-devel,
	konrad.wilk, mingo, linux-kernel, ebiederm, hpa

>>> On 20.11.12 at 16:04, Daniel Kiper <daniel.kiper@oracle.com> wrote:
> Some implementations (e.g. Xen PVOPS) could not use part of identity page 
> table
> to construct transition page table. It means that they require separate 
> PUDs,
> PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that
> requirement add extra pointer to PGD, PUD, PMD and PTE and align existing 
> code.

As said for v1 already - this is not really a requirement of the
interface, or else none of our Xen kernels since 2.6.30 would
have worked. I don't think it is desirable to introduce overhead
for everyone if it's not even needed for Xen.

Jan

> Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
> ---
>  arch/x86/include/asm/kexec.h       |   10 +++++++---
>  arch/x86/kernel/machine_kexec_64.c |   12 ++++++------
>  2 files changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 317ff17..3cf5600 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -157,9 +157,13 @@ struct kimage_arch {
>  };
>  #else
>  struct kimage_arch {
> -	pud_t *pud;
> -	pmd_t *pmd;
> -	pte_t *pte;
> +	pgd_t *pgd;
> +	pud_t *pud0;
> +	pud_t *pud1;
> +	pmd_t *pmd0;
> +	pmd_t *pmd1;
> +	pte_t *pte0;
> +	pte_t *pte1;
>  };
>  #endif
>  
> diff --git a/arch/x86/kernel/machine_kexec_64.c 
> b/arch/x86/kernel/machine_kexec_64.c
> index b3ea9db..976e54b 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -137,9 +137,9 @@ out:
>  
>  static void free_transition_pgtable(struct kimage *image)
>  {
> -	free_page((unsigned long)image->arch.pud);
> -	free_page((unsigned long)image->arch.pmd);
> -	free_page((unsigned long)image->arch.pte);
> +	free_page((unsigned long)image->arch.pud0);
> +	free_page((unsigned long)image->arch.pmd0);
> +	free_page((unsigned long)image->arch.pte0);
>  }
>  
>  static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
> @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, 
> pgd_t *pgd)
>  		pud = (pud_t *)get_zeroed_page(GFP_KERNEL);
>  		if (!pud)
>  			goto err;
> -		image->arch.pud = pud;
> +		image->arch.pud0 = pud;
>  		set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
>  	}
>  	pud = pud_offset(pgd, vaddr);
> @@ -165,7 +165,7 @@ static int init_transition_pgtable(struct kimage *image, 
> pgd_t *pgd)
>  		pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL);
>  		if (!pmd)
>  			goto err;
> -		image->arch.pmd = pmd;
> +		image->arch.pmd0 = pmd;
>  		set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
>  	}
>  	pmd = pmd_offset(pud, vaddr);
> @@ -173,7 +173,7 @@ static int init_transition_pgtable(struct kimage *image, 
> pgd_t *pgd)
>  		pte = (pte_t *)get_zeroed_page(GFP_KERNEL);
>  		if (!pte)
>  			goto err;
> -		image->arch.pte = pte;
> +		image->arch.pte0 = pte;
>  		set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
>  	}
>  	pte = pte_offset_kernel(pmd, vaddr);
> -- 
> 1.5.6.5




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-20 15:04 ` [PATCH v2 01/11] kexec: introduce kexec_ops struct Daniel Kiper
  2012-11-20 15:04   ` [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE Daniel Kiper
@ 2012-11-20 16:40   ` Eric W. Biederman
  2012-11-21 10:52     ` Daniel Kiper
  1 sibling, 1 reply; 35+ messages in thread
From: Eric W. Biederman @ 2012-11-20 16:40 UTC (permalink / raw)
  To: Daniel Kiper
  Cc: andrew.cooper3, hpa, jbeulich, konrad.wilk, mingo, tglx, x86,
	kexec, linux-kernel, virtualization, xen-devel

Daniel Kiper <daniel.kiper@oracle.com> writes:

> Some kexec/kdump implementations (e.g. Xen PVOPS) could not use default
> functions or require some changes in behavior of kexec/kdump generic code.
> To cope with that problem kexec_ops struct was introduced. It allows
> a developer to replace all or some functions and control some
> functionality of kexec/kdump generic code.
>
> Default behavior of kexec/kdump generic code is not changed.

Ick.

> v2 - suggestions/fixes:
>    - add comment for kexec_ops.crash_alloc_temp_store member
>      (suggested by Konrad Rzeszutek Wilk),
>    - simplify kexec_ops usage
>      (suggested by Konrad Rzeszutek Wilk).
>
> Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
> ---
>  include/linux/kexec.h |   26 ++++++++++
>  kernel/kexec.c        |  131 +++++++++++++++++++++++++++++++++++++------------
>  2 files changed, 125 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index d0b8458..c8d0b35 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -116,7 +116,33 @@ struct kimage {
>  #endif
>  };
>  
> +struct kexec_ops {
> +	/*
> +	 * Some kdump implementations (e.g. Xen PVOPS dom0) could not access
> +	 * directly crash kernel memory area. In this situation they must
> +	 * allocate memory outside of it and later move contents from temporary
> +	 * storage to final resting places (usualy done by relocate_kernel()).
> +	 * Such behavior could be enforced by setting
> +	 * crash_alloc_temp_store member to true.
> +	 */

Why in the world would Xen not be able to access crash kernel memory?
As currently defined it is normal memory that the kernel chooses not to
use.

If relocate kernel can access that memory you definitely can access the
memory so the comment does not make any sense.

> +	bool crash_alloc_temp_store;
> +	struct page *(*kimage_alloc_pages)(gfp_t gfp_mask,
> +						unsigned int order,
> +						unsigned long limit);
> +	void (*kimage_free_pages)(struct page *page);
> +	unsigned long (*page_to_pfn)(struct page *page);
> +	struct page *(*pfn_to_page)(unsigned long pfn);
> +	unsigned long (*virt_to_phys)(volatile void *address);
> +	void *(*phys_to_virt)(unsigned long address);
> +	int (*machine_kexec_prepare)(struct kimage *image);
> +	int (*machine_kexec_load)(struct kimage *image);
> +	void (*machine_kexec_cleanup)(struct kimage *image);
> +	void (*machine_kexec_unload)(struct kimage *image);
> +	void (*machine_kexec_shutdown)(void);
> +	void (*machine_kexec)(struct kimage *image);
> +};

Ugh.  This is a nasty abstraction.

You are mixing and matching a bunch of things together here.

If you need to override machine_kexec_xxx please do that on a per
architecture basis.

Special case overrides of page_to_pfn, pfn_to_page, virt_to_phys,
phys_to_virt, and friends seem completely inappropriate.

There may be a point to all of these but you are mixing and matching
things badly.


Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-20 16:40   ` [PATCH v2 01/11] kexec: introduce kexec_ops struct Eric W. Biederman
@ 2012-11-21 10:52     ` Daniel Kiper
  2012-11-22 12:15       ` Eric W. Biederman
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Kiper @ 2012-11-21 10:52 UTC (permalink / raw)
  To: ebiederm
  Cc: andrew.cooper3, hpa, jbeulich, konrad.wilk, mingo, tglx, x86,
	kexec, linux-kernel, virtualization, xen-devel

On Tue, Nov 20, 2012 at 08:40:39AM -0800, ebiederm@xmission.com wrote:
> Daniel Kiper <daniel.kiper@oracle.com> writes:
>
> > Some kexec/kdump implementations (e.g. Xen PVOPS) could not use default
> > functions or require some changes in behavior of kexec/kdump generic code.
> > To cope with that problem kexec_ops struct was introduced. It allows
> > a developer to replace all or some functions and control some
> > functionality of kexec/kdump generic code.
> >
> > Default behavior of kexec/kdump generic code is not changed.
>
> Ick.
>
> > v2 - suggestions/fixes:
> >    - add comment for kexec_ops.crash_alloc_temp_store member
> >      (suggested by Konrad Rzeszutek Wilk),
> >    - simplify kexec_ops usage
> >      (suggested by Konrad Rzeszutek Wilk).
> >
> > Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
> > ---
> >  include/linux/kexec.h |   26 ++++++++++
> >  kernel/kexec.c        |  131 +++++++++++++++++++++++++++++++++++++------------
> >  2 files changed, 125 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> > index d0b8458..c8d0b35 100644
> > --- a/include/linux/kexec.h
> > +++ b/include/linux/kexec.h
> > @@ -116,7 +116,33 @@ struct kimage {
> >  #endif
> >  };
> >
> > +struct kexec_ops {
> > +	/*
> > +	 * Some kdump implementations (e.g. Xen PVOPS dom0) could not access
> > +	 * directly crash kernel memory area. In this situation they must
> > +	 * allocate memory outside of it and later move contents from temporary
> > +	 * storage to final resting places (usualy done by relocate_kernel()).
> > +	 * Such behavior could be enforced by setting
> > +	 * crash_alloc_temp_store member to true.
> > +	 */
>
> Why in the world would Xen not be able to access crash kernel memory?
> As currently defined it is normal memory that the kernel chooses not to
> use.
>
> If relocate kernel can access that memory you definitely can access the
> memory so the comment does not make any sense.

Crash kernel memory is reserved by Xen hypervisor and Xen hypervisor
only has access to it. dom0 does not have any mapping of this area.
However, relocate_kernel() has access to crash kernel memory
because it is executed by Xen hypervisor and whole machine
memory is identity mapped.

> > +	bool crash_alloc_temp_store;
> > +	struct page *(*kimage_alloc_pages)(gfp_t gfp_mask,
> > +						unsigned int order,
> > +						unsigned long limit);
> > +	void (*kimage_free_pages)(struct page *page);
> > +	unsigned long (*page_to_pfn)(struct page *page);
> > +	struct page *(*pfn_to_page)(unsigned long pfn);
> > +	unsigned long (*virt_to_phys)(volatile void *address);
> > +	void *(*phys_to_virt)(unsigned long address);
> > +	int (*machine_kexec_prepare)(struct kimage *image);
> > +	int (*machine_kexec_load)(struct kimage *image);
> > +	void (*machine_kexec_cleanup)(struct kimage *image);
> > +	void (*machine_kexec_unload)(struct kimage *image);
> > +	void (*machine_kexec_shutdown)(void);
> > +	void (*machine_kexec)(struct kimage *image);
> > +};
>
> Ugh.  This is a nasty abstraction.
>
> You are mixing and matching a bunch of things together here.
>
> If you need to override machine_kexec_xxx please do that on a per
> architecture basis.

Yes, it is possible but I think that it is worth to do it at that
level because it could be useful for other archs too (e.g. Xen ARM port
is under development). Then we do not need to duplicate that functionality
in arch code. Additionally, Xen requires machine_kexec_load and
machine_kexec_unload hooks which are not available in current generic
kexec/kdump code.

> Special case overrides of page_to_pfn, pfn_to_page, virt_to_phys,
> phys_to_virt, and friends seem completely inappropriate.

They are required in Xen PVOPS case. If we do not do that in that way
then we at least need to duplicate almost all generic kexec/kdump existing
code in arch depended files. I do not mention that we need to capture
relevant syscall and other things. I think that this is wrong way.

> There may be a point to all of these but you are mixing and matching
> things badly.

Do you whish to split this kexec_ops struct to something which
works with addresses and something which is reponsible for
loading, unloading and executing kexec/kdump? I am able to change
that but I would like to know a bit about your vision first.

Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-21 10:52     ` Daniel Kiper
@ 2012-11-22 12:15       ` Eric W. Biederman
  2012-11-22 17:37         ` H. Peter Anvin
                           ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Eric W. Biederman @ 2012-11-22 12:15 UTC (permalink / raw)
  To: Daniel Kiper
  Cc: andrew.cooper3, hpa, jbeulich, konrad.wilk, mingo, tglx, x86,
	kexec, linux-kernel, virtualization, xen-devel

Daniel Kiper <daniel.kiper@oracle.com> writes:

> On Tue, Nov 20, 2012 at 08:40:39AM -0800, ebiederm@xmission.com wrote:
>> Daniel Kiper <daniel.kiper@oracle.com> writes:
>>
>> > Some kexec/kdump implementations (e.g. Xen PVOPS) could not use default
>> > functions or require some changes in behavior of kexec/kdump generic code.
>> > To cope with that problem kexec_ops struct was introduced. It allows
>> > a developer to replace all or some functions and control some
>> > functionality of kexec/kdump generic code.
>> >
>> > Default behavior of kexec/kdump generic code is not changed.
>>
>> Ick.
>>
>> > v2 - suggestions/fixes:
>> >    - add comment for kexec_ops.crash_alloc_temp_store member
>> >      (suggested by Konrad Rzeszutek Wilk),
>> >    - simplify kexec_ops usage
>> >      (suggested by Konrad Rzeszutek Wilk).
>> >
>> > Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
>> > ---
>> >  include/linux/kexec.h |   26 ++++++++++
>> >  kernel/kexec.c        |  131 +++++++++++++++++++++++++++++++++++++------------
>> >  2 files changed, 125 insertions(+), 32 deletions(-)
>> >
>> > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
>> > index d0b8458..c8d0b35 100644
>> > --- a/include/linux/kexec.h
>> > +++ b/include/linux/kexec.h
>> > @@ -116,7 +116,33 @@ struct kimage {
>> >  #endif
>> >  };
>> >
>> > +struct kexec_ops {
>> > +	/*
>> > +	 * Some kdump implementations (e.g. Xen PVOPS dom0) could not access
>> > +	 * directly crash kernel memory area. In this situation they must
>> > +	 * allocate memory outside of it and later move contents from temporary
>> > +	 * storage to final resting places (usualy done by relocate_kernel()).
>> > +	 * Such behavior could be enforced by setting
>> > +	 * crash_alloc_temp_store member to true.
>> > +	 */
>>
>> Why in the world would Xen not be able to access crash kernel memory?
>> As currently defined it is normal memory that the kernel chooses not to
>> use.
>>
>> If relocate kernel can access that memory you definitely can access the
>> memory so the comment does not make any sense.
>
> Crash kernel memory is reserved by Xen hypervisor and Xen hypervisor
> only has access to it. dom0 does not have any mapping of this area.
> However, relocate_kernel() has access to crash kernel memory
> because it is executed by Xen hypervisor and whole machine
> memory is identity mapped.

This is all weird.  Doubly so since this code is multi-arch and you have
a set of requirements no other arch has had.

I recall that Xen uses kexec in a unique manner.  What is the hypervisor
interface and how is it used?

Is this for when the hypervisor crashes and we want a crash dump of
that?



>> > +	bool crash_alloc_temp_store;
>> > +	struct page *(*kimage_alloc_pages)(gfp_t gfp_mask,
>> > +						unsigned int order,
>> > +						unsigned long limit);
>> > +	void (*kimage_free_pages)(struct page *page);
>> > +	unsigned long (*page_to_pfn)(struct page *page);
>> > +	struct page *(*pfn_to_page)(unsigned long pfn);
>> > +	unsigned long (*virt_to_phys)(volatile void *address);
>> > +	void *(*phys_to_virt)(unsigned long address);
>> > +	int (*machine_kexec_prepare)(struct kimage *image);
>> > +	int (*machine_kexec_load)(struct kimage *image);
>> > +	void (*machine_kexec_cleanup)(struct kimage *image);
>> > +	void (*machine_kexec_unload)(struct kimage *image);
>> > +	void (*machine_kexec_shutdown)(void);
>> > +	void (*machine_kexec)(struct kimage *image);
>> > +};
>>
>> Ugh.  This is a nasty abstraction.
>>
>> You are mixing and matching a bunch of things together here.
>>
>> If you need to override machine_kexec_xxx please do that on a per
>> architecture basis.
>
> Yes, it is possible but I think that it is worth to do it at that
> level because it could be useful for other archs too (e.g. Xen ARM port
> is under development). Then we do not need to duplicate that functionality
> in arch code. Additionally, Xen requires machine_kexec_load and
> machine_kexec_unload hooks which are not available in current generic
> kexec/kdump code.


Let me be clear.  kexec_ops as you have implemented it is absolutely
unacceptable.

Your kexec_ops is not an abstraction but a hack that enshrines in stone
implementation details.

>> Special case overrides of page_to_pfn, pfn_to_page, virt_to_phys,
>> phys_to_virt, and friends seem completely inappropriate.
>
> They are required in Xen PVOPS case. If we do not do that in that way
> then we at least need to duplicate almost all generic kexec/kdump existing
> code in arch depended files. I do not mention that we need to capture
> relevant syscall and other things. I think that this is wrong way.

A different definition of phys_to_virt and page_to_pfn for one specific
function is total nonsense.

It may actually be better to have a completely different code path.
This looks more like code abuse than code reuse.

Successful code reuse depends upon not breaking the assumptions on which
the code relies, or modifying the code so that the new modified
assumptions are clear.  In this case you might as well define up as down
for all of the sense kexec_ops makes.

>> There may be a point to all of these but you are mixing and matching
>> things badly.
>
> Do you whish to split this kexec_ops struct to something which
> works with addresses and something which is reponsible for
> loading, unloading and executing kexec/kdump? I am able to change
> that but I would like to know a bit about your vision first.

My vision is that we should have code that makes sense.

My suspicion is that what you want is a cousin of the existing kexec
system call.  Perhaps what is needed is a flag to say use the firmware
kexec system call.

I absolutely do not understand what Xen is trying to do.  kexec by
design should not require any firmware specific hooks.  kexec at this
level should only need to care about the processor architeture.  Clearly
what you are doing with Xen requires special hooks separate even from
the normal paravirt hooks.  So I do not understand you are trying to do.

It needs to be clear from the code what is happening differently in the
Xen case.  Otherwise the code is unmaintainable as no one will be able
to understand it.

Eric


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-22 12:15       ` Eric W. Biederman
@ 2012-11-22 17:37         ` H. Peter Anvin
  2012-11-23  9:56           ` Jan Beulich
  2012-11-22 17:47         ` H. Peter Anvin
  2012-11-23  9:47         ` Daniel Kiper
  2 siblings, 1 reply; 35+ messages in thread
From: H. Peter Anvin @ 2012-11-22 17:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Kiper, andrew.cooper3, jbeulich, konrad.wilk, mingo, tglx,
	x86, kexec, linux-kernel, virtualization, xen-devel

On 11/22/2012 04:15 AM, Eric W. Biederman wrote:
>
> Let me be clear.  kexec_ops as you have implemented it is absolutely
> unacceptable.
>
> Your kexec_ops is not an abstraction but a hack that enshrines in stone
> implementation details.
>

This is the kind of stuff that is absolutely endemic to the Xen 
endeavour, and which is why Xen is such a disease.  The design principle 
seems to have been "hey, let's go and replace random Linux kernel 
internals with our own stuff, and make them ABIs, so that they can never 
change.  Oh, and let's not bother documenting the constraints we're 
imposing, that might make the code manageable."

I actually talked to Ian Jackson at LCE, and mentioned among other 
things the bogosity of requiring a PUD page for three-level paging in 
Linux -- a bogosity which has spread from Xen into native.  It's a page 
wasted for no good reason, since it only contains 32 bytes worth of 
data, *inherently*.  Furthermore, contrary to popular belief, it is 
*not* pa page table per se.

Ian told me: "I didn't know we did that, and we shouldn't have to." 
Here we have suffered this overhead for at least six years, because *XEN 
FUCKED UP AND NOONE ELSE HAD ANY WAY OF KNOWING THAT*.

Now we know that it can "maybe"(!!!) be fixed, if we are willing to 
spend time working on a dying platform, whereas we have already suffered 
the damage during the height of its importance.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-22 12:15       ` Eric W. Biederman
  2012-11-22 17:37         ` H. Peter Anvin
@ 2012-11-22 17:47         ` H. Peter Anvin
  2012-11-22 18:07           ` Andrew Cooper
  2012-11-23  0:12           ` Andrew Cooper
  2012-11-23  9:47         ` Daniel Kiper
  2 siblings, 2 replies; 35+ messages in thread
From: H. Peter Anvin @ 2012-11-22 17:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Kiper, andrew.cooper3, jbeulich, konrad.wilk, mingo, tglx,
	x86, kexec, linux-kernel, virtualization, xen-devel

The other thing that should be considered here is how utterly 
preposterous the notion of doing in-guest crash dumping is in a system 
that contains a hypervisor.  The reason for kdump is that on bare metal 
there are no other options, but in a hypervisor system the right thing 
should be for the hypervisor to do the dump (possibly spawning a clean 
I/O domain if the I/O domain is necessary to access the media.)

There is absolutely no reason to have a crashkernel sitting around in 
each guest, consuming memory, and possibly get corrupt.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-22 17:47         ` H. Peter Anvin
@ 2012-11-22 18:07           ` Andrew Cooper
  2012-11-22 22:26             ` H. Peter Anvin
  2012-11-23  0:12           ` Andrew Cooper
  1 sibling, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2012-11-22 18:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Eric W. Biederman, Daniel Kiper, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel

On 22/11/12 17:47, H. Peter Anvin wrote:
> The other thing that should be considered here is how utterly 
> preposterous the notion of doing in-guest crash dumping is in a system 
> that contains a hypervisor.  The reason for kdump is that on bare metal 
> there are no other options, but in a hypervisor system the right thing 
> should be for the hypervisor to do the dump (possibly spawning a clean 
> I/O domain if the I/O domain is necessary to access the media.)
>
> There is absolutely no reason to have a crashkernel sitting around in 
> each guest, consuming memory, and possibly get corrupt.
>
> 	-hpa
>

I agree that regular guests should not be using the kexec/kdump. 
However, this patch series is required for allowing a pvops kernel to be
a crash kernel for Xen, which is very important from dom0/Xen's point of
view.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-22 18:07           ` Andrew Cooper
@ 2012-11-22 22:26             ` H. Peter Anvin
  2014-03-31 10:50               ` Petr Tesarik
  0 siblings, 1 reply; 35+ messages in thread
From: H. Peter Anvin @ 2012-11-22 22:26 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Eric W. Biederman, Daniel Kiper, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel

Bullshit.  This should be a separate domain.

Andrew Cooper <andrew.cooper3@citrix.com> wrote:

>On 22/11/12 17:47, H. Peter Anvin wrote:
>> The other thing that should be considered here is how utterly 
>> preposterous the notion of doing in-guest crash dumping is in a
>system 
>> that contains a hypervisor.  The reason for kdump is that on bare
>metal 
>> there are no other options, but in a hypervisor system the right
>thing 
>> should be for the hypervisor to do the dump (possibly spawning a
>clean 
>> I/O domain if the I/O domain is necessary to access the media.)
>>
>> There is absolutely no reason to have a crashkernel sitting around in
>
>> each guest, consuming memory, and possibly get corrupt.
>>
>> 	-hpa
>>
>
>I agree that regular guests should not be using the kexec/kdump. 
>However, this patch series is required for allowing a pvops kernel to
>be
>a crash kernel for Xen, which is very important from dom0/Xen's point
>of
>view.

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-22 17:47         ` H. Peter Anvin
  2012-11-22 18:07           ` Andrew Cooper
@ 2012-11-23  0:12           ` Andrew Cooper
  2012-11-23  1:34             ` H. Peter Anvin
  2012-11-23  1:38             ` H. Peter Anvin
  1 sibling, 2 replies; 35+ messages in thread
From: Andrew Cooper @ 2012-11-23  0:12 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Eric W. Biederman, Daniel Kiper, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel

On 22/11/2012 17:47, H. Peter Anvin wrote:
> The other thing that should be considered here is how utterly 
> preposterous the notion of doing in-guest crash dumping is in a system 
> that contains a hypervisor.  The reason for kdump is that on bare metal 
> there are no other options, but in a hypervisor system the right thing 
> should be for the hypervisor to do the dump (possibly spawning a clean 
> I/O domain if the I/O domain is necessary to access the media.)
>
> There is absolutely no reason to have a crashkernel sitting around in 
> each guest, consuming memory, and possibly get corrupt.
>
> 	-hpa
>

(Your reply to my email which I can see on the xen devel archive appears
to have gotten lost somewhere inside the citrix email system, so
apologies for replying out of order)

The kdump kernel loaded by dom0 is for when Xen crashes, not for when
dom0 crashes (although a dom0 crash does admittedly lead to a Xen crash)

There is no possible way it could be a separate domain; Xen completely
ceases to function as soon as jumps to the entry point of the kdump image.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-23  0:12           ` Andrew Cooper
@ 2012-11-23  1:34             ` H. Peter Anvin
  2012-11-23  1:38             ` H. Peter Anvin
  1 sibling, 0 replies; 35+ messages in thread
From: H. Peter Anvin @ 2012-11-23  1:34 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Eric W. Biederman, Daniel Kiper, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel

Ok... that *sort of* makes sense, but also underscores how utterly different this is from a normal kexec.

Andrew Cooper <andrew.cooper3@citrix.com> wrote:

>On 22/11/2012 17:47, H. Peter Anvin wrote:
>> The other thing that should be considered here is how utterly 
>> preposterous the notion of doing in-guest crash dumping is in a
>system 
>> that contains a hypervisor.  The reason for kdump is that on bare
>metal 
>> there are no other options, but in a hypervisor system the right
>thing 
>> should be for the hypervisor to do the dump (possibly spawning a
>clean 
>> I/O domain if the I/O domain is necessary to access the media.)
>>
>> There is absolutely no reason to have a crashkernel sitting around in
>
>> each guest, consuming memory, and possibly get corrupt.
>>
>> 	-hpa
>>
>
>(Your reply to my email which I can see on the xen devel archive
>appears
>to have gotten lost somewhere inside the citrix email system, so
>apologies for replying out of order)
>
>The kdump kernel loaded by dom0 is for when Xen crashes, not for when
>dom0 crashes (although a dom0 crash does admittedly lead to a Xen
>crash)
>
>There is no possible way it could be a separate domain; Xen completely
>ceases to function as soon as jumps to the entry point of the kdump
>image.
>
>~Andrew

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-23  0:12           ` Andrew Cooper
  2012-11-23  1:34             ` H. Peter Anvin
@ 2012-11-23  1:38             ` H. Peter Anvin
  2012-11-23  1:56               ` Andrew Cooper
  1 sibling, 1 reply; 35+ messages in thread
From: H. Peter Anvin @ 2012-11-23  1:38 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Eric W. Biederman, Daniel Kiper, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel

I still don't really get why it can't be isolated from dom0, which would make more sense to me, even for a Xen crash.

Andrew Cooper <andrew.cooper3@citrix.com> wrote:

>On 22/11/2012 17:47, H. Peter Anvin wrote:
>> The other thing that should be considered here is how utterly 
>> preposterous the notion of doing in-guest crash dumping is in a
>system 
>> that contains a hypervisor.  The reason for kdump is that on bare
>metal 
>> there are no other options, but in a hypervisor system the right
>thing 
>> should be for the hypervisor to do the dump (possibly spawning a
>clean 
>> I/O domain if the I/O domain is necessary to access the media.)
>>
>> There is absolutely no reason to have a crashkernel sitting around in
>
>> each guest, consuming memory, and possibly get corrupt.
>>
>> 	-hpa
>>
>
>(Your reply to my email which I can see on the xen devel archive
>appears
>to have gotten lost somewhere inside the citrix email system, so
>apologies for replying out of order)
>
>The kdump kernel loaded by dom0 is for when Xen crashes, not for when
>dom0 crashes (although a dom0 crash does admittedly lead to a Xen
>crash)
>
>There is no possible way it could be a separate domain; Xen completely
>ceases to function as soon as jumps to the entry point of the kdump
>image.
>
>~Andrew

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-23  1:38             ` H. Peter Anvin
@ 2012-11-23  1:56               ` Andrew Cooper
  2012-11-23  9:53                 ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2012-11-23  1:56 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Eric W. Biederman, Daniel Kiper, jbeulich, konrad.wilk, mingo,
	tglx, x86, kexec, linux-kernel, virtualization, xen-devel

On 23/11/2012 01:38, H. Peter Anvin wrote:
> I still don't really get why it can't be isolated from dom0, which would make more sense to me, even for a Xen crash.
>

The crash region (as specified by crashkernel= on the Xen command line)
is isolated from dom0.

dom0 (using the kexec utility etc) has the task of locating the Xen
crash notes (using the kexec hypercall interface), constructing a binary
blob containing kernel, initram and gubbins, and asking Xen to put this
blob in the crash region (again, using the kexec hypercall interface).

I do not see how this is very much different from the native case
currently (although please correct me if I am misinformed).  Linux has
extra work to do by populating /proc/iomem with the Xen crash regions
boot (so the kexec utility can reference their physical addresses when
constructing the blob), and should just act as a conduit between the
kexec system call and the kexec hypercall to load the blob.

For within-guest kexec/kdump functionality, I agree that it is barking
mad.  However, we do see cloud operators interested in the idea so VM
administrators can look after their crashes themselves.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-22 12:15       ` Eric W. Biederman
  2012-11-22 17:37         ` H. Peter Anvin
  2012-11-22 17:47         ` H. Peter Anvin
@ 2012-11-23  9:47         ` Daniel Kiper
  2012-11-23 20:24           ` Eric W. Biederman
  2 siblings, 1 reply; 35+ messages in thread
From: Daniel Kiper @ 2012-11-23  9:47 UTC (permalink / raw)
  To: ebiederm
  Cc: andrew.cooper3, hpa, jbeulich, konrad.wilk, mingo, tglx, x86,
	kexec, linux-kernel, virtualization, xen-devel

On Thu, Nov 22, 2012 at 04:15:48AM -0800, ebiederm@xmission.com wrote:
> Daniel Kiper <daniel.kiper@oracle.com> writes:
>
> > On Tue, Nov 20, 2012 at 08:40:39AM -0800, ebiederm@xmission.com wrote:
> >> Daniel Kiper <daniel.kiper@oracle.com> writes:
> >>
> >> > Some kexec/kdump implementations (e.g. Xen PVOPS) could not use default
> >> > functions or require some changes in behavior of kexec/kdump generic code.
> >> > To cope with that problem kexec_ops struct was introduced. It allows
> >> > a developer to replace all or some functions and control some
> >> > functionality of kexec/kdump generic code.
> >> >
> >> > Default behavior of kexec/kdump generic code is not changed.
> >>
> >> Ick.
> >>
> >> > v2 - suggestions/fixes:
> >> >    - add comment for kexec_ops.crash_alloc_temp_store member
> >> >      (suggested by Konrad Rzeszutek Wilk),
> >> >    - simplify kexec_ops usage
> >> >      (suggested by Konrad Rzeszutek Wilk).
> >> >
> >> > Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
> >> > ---
> >> >  include/linux/kexec.h |   26 ++++++++++
> >> >  kernel/kexec.c        |  131 +++++++++++++++++++++++++++++++++++++------------
> >> >  2 files changed, 125 insertions(+), 32 deletions(-)
> >> >
> >> > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> >> > index d0b8458..c8d0b35 100644
> >> > --- a/include/linux/kexec.h
> >> > +++ b/include/linux/kexec.h
> >> > @@ -116,7 +116,33 @@ struct kimage {
> >> >  #endif
> >> >  };
> >> >
> >> > +struct kexec_ops {
> >> > +	/*
> >> > +	 * Some kdump implementations (e.g. Xen PVOPS dom0) could not access
> >> > +	 * directly crash kernel memory area. In this situation they must
> >> > +	 * allocate memory outside of it and later move contents from temporary
> >> > +	 * storage to final resting places (usualy done by relocate_kernel()).
> >> > +	 * Such behavior could be enforced by setting
> >> > +	 * crash_alloc_temp_store member to true.
> >> > +	 */
> >>
> >> Why in the world would Xen not be able to access crash kernel memory?
> >> As currently defined it is normal memory that the kernel chooses not to
> >> use.
> >>
> >> If relocate kernel can access that memory you definitely can access the
> >> memory so the comment does not make any sense.
> >
> > Crash kernel memory is reserved by Xen hypervisor and Xen hypervisor
> > only has access to it. dom0 does not have any mapping of this area.
> > However, relocate_kernel() has access to crash kernel memory
> > because it is executed by Xen hypervisor and whole machine
> > memory is identity mapped.
>
> This is all weird.  Doubly so since this code is multi-arch and you have
> a set of requirements no other arch has had.
>
> I recall that Xen uses kexec in a unique manner.  What is the hypervisor
> interface and how is it used?
>
> Is this for when the hypervisor crashes and we want a crash dump of
> that?

dom0 at boot gets some info about kexec/kdump configuration from Xen hypervisor
(e.g. placement of crash kernel area). Later if you call kexec syscall most
things are done in the same way as on baremetal. However, after placing image
in memory, HYPERVISOR_kexec_op() hypercall must be called to inform hypervisor
that image is loaded (new hook machine_kexec_load is used for this;
machine_kexec_unload is used for unload). Then Xen establishes fixmap for pages
found in page_list[] and returns control to dom0. If dom0 crashes or "kexec execute"
is used by user then dom0 calls HYPERVISOR_kexec_op() to instruct hypervisor that
kexec/kdump image should be executed immediately. Xen calls relocate_kernel()
and all things runs as usual.

> >> > +	bool crash_alloc_temp_store;
> >> > +	struct page *(*kimage_alloc_pages)(gfp_t gfp_mask,
> >> > +						unsigned int order,
> >> > +						unsigned long limit);
> >> > +	void (*kimage_free_pages)(struct page *page);
> >> > +	unsigned long (*page_to_pfn)(struct page *page);
> >> > +	struct page *(*pfn_to_page)(unsigned long pfn);
> >> > +	unsigned long (*virt_to_phys)(volatile void *address);
> >> > +	void *(*phys_to_virt)(unsigned long address);
> >> > +	int (*machine_kexec_prepare)(struct kimage *image);
> >> > +	int (*machine_kexec_load)(struct kimage *image);
> >> > +	void (*machine_kexec_cleanup)(struct kimage *image);
> >> > +	void (*machine_kexec_unload)(struct kimage *image);
> >> > +	void (*machine_kexec_shutdown)(void);
> >> > +	void (*machine_kexec)(struct kimage *image);
> >> > +};
> >>
> >> Ugh.  This is a nasty abstraction.
> >>
> >> You are mixing and matching a bunch of things together here.
> >>
> >> If you need to override machine_kexec_xxx please do that on a per
> >> architecture basis.
> >
> > Yes, it is possible but I think that it is worth to do it at that
> > level because it could be useful for other archs too (e.g. Xen ARM port
> > is under development). Then we do not need to duplicate that functionality
> > in arch code. Additionally, Xen requires machine_kexec_load and
> > machine_kexec_unload hooks which are not available in current generic
> > kexec/kdump code.
>
>
> Let me be clear.  kexec_ops as you have implemented it is absolutely
> unacceptable.
>
> Your kexec_ops is not an abstraction but a hack that enshrines in stone
> implementation details.

Roger.

> >> Special case overrides of page_to_pfn, pfn_to_page, virt_to_phys,
> >> phys_to_virt, and friends seem completely inappropriate.
> >
> > They are required in Xen PVOPS case. If we do not do that in that way
> > then we at least need to duplicate almost all generic kexec/kdump existing
> > code in arch depended files. I do not mention that we need to capture
> > relevant syscall and other things. I think that this is wrong way.
>
> A different definition of phys_to_virt and page_to_pfn for one specific
> function is total nonsense.
>
> It may actually be better to have a completely different code path.
> This looks more like code abuse than code reuse.
>
> Successful code reuse depends upon not breaking the assumptions on which
> the code relies, or modifying the code so that the new modified
> assumptions are clear.  In this case you might as well define up as down
> for all of the sense kexec_ops makes.

Hmmm... Well, problem with above mentioned functions is that they work
on physical addresses. In Xen PVOPS (currently dom0 is PVOPS) they
are useless in kexec/kdump case. It means that physical addresses
must be converted to/from machine addresses which has a real meaning
in Xen PVOPS case. That is why those funtions were introduced.

> >> There may be a point to all of these but you are mixing and matching
> >> things badly.
> >
> > Do you whish to split this kexec_ops struct to something which
> > works with addresses and something which is reponsible for
> > loading, unloading and executing kexec/kdump? I am able to change
> > that but I would like to know a bit about your vision first.
>
> My vision is that we should have code that makes sense.
>
> My suspicion is that what you want is a cousin of the existing kexec
> system call.  Perhaps what is needed is a flag to say use the firmware
> kexec system call.
>
> I absolutely do not understand what Xen is trying to do.  kexec by
> design should not require any firmware specific hooks.  kexec at this
> level should only need to care about the processor architeture.  Clearly
> what you are doing with Xen requires special hooks separate even from
> the normal paravirt hooks.  So I do not understand you are trying to do.
>
> It needs to be clear from the code what is happening differently in the
> Xen case.  Otherwise the code is unmaintainable as no one will be able
> to understand it.

I agree. I could remove all machine_* hooks from kexec_ops and call Xen
specific functions from arch files. However, I need to add two new
machine calls, machine_kexec_load and machine_kexec_unload, in the same
manner as existing machine_* calls. In general they could be used to inform
firmware (in this case Xen) that kexec/kdump image is loaded.

kimage_alloc_pages, kimage_free_pages, page_to_pfn, pfn_to_page, virt_to_phys
and phys_to_virt are worse. If we could not find good solution how to replace
them then we end up with calling Xen specific version of kexec/kdump which
would contain nearly full copy of exisiting kexec/kdump code. Not good.

We could add some code to kernel/kexec.c which depends on CONFIG_XEN.
It could contain above mentioned functions which later will be called
by existing kexec code. This is not nice to be honest. However, I hope
that we could find better solution for that problem.

Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-23  1:56               ` Andrew Cooper
@ 2012-11-23  9:53                 ` Jan Beulich
  2012-11-23 10:37                   ` Daniel Kiper
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2012-11-23  9:53 UTC (permalink / raw)
  To: Andrew Cooper, H. Peter Anvin
  Cc: x86, tglx, kexec, virtualization, xen-devel, Daniel Kiper,
	konrad.wilk, mingo, linux-kernel, Eric W. Biederman

>>> On 23.11.12 at 02:56, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 23/11/2012 01:38, H. Peter Anvin wrote:
>> I still don't really get why it can't be isolated from dom0, which would 
> make more sense to me, even for a Xen crash.
>>
> 
> The crash region (as specified by crashkernel= on the Xen command line)
> is isolated from dom0.
> 
> dom0 (using the kexec utility etc) has the task of locating the Xen
> crash notes (using the kexec hypercall interface), constructing a binary
> blob containing kernel, initram and gubbins, and asking Xen to put this
> blob in the crash region (again, using the kexec hypercall interface).
> 
> I do not see how this is very much different from the native case
> currently (although please correct me if I am misinformed).  Linux has
> extra work to do by populating /proc/iomem with the Xen crash regions
> boot (so the kexec utility can reference their physical addresses when
> constructing the blob), and should just act as a conduit between the
> kexec system call and the kexec hypercall to load the blob.

But all of this _could_ be done completely independent of the
Dom0 kernel's kexec infrastructure (i.e. fully from user space,
invoking the necessary hypercalls through the privcmd driver).
It's just that parts of the kexec infrastructure can be re-used
(and hence that mechanism probably seemed the easier approach
to the implementer of the original kexec-on-Xen). If the kernel
folks dislike that re-use (quite understandably looking at how
much of it needs to be re-done), that shouldn't prevent us from
looking into the existing alternatives.

Jan


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-22 17:37         ` H. Peter Anvin
@ 2012-11-23  9:56           ` Jan Beulich
  2012-11-23 10:53             ` [Xen-devel] " Ian Campbell
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2012-11-23  9:56 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: andrew.cooper3, x86, tglx, kexec, virtualization, xen-devel,
	Daniel Kiper, konrad.wilk, mingo, linux-kernel,
	Eric W. Biederman

>>> On 22.11.12 at 18:37, "H. Peter Anvin" <hpa@zytor.com> wrote:
> I actually talked to Ian Jackson at LCE, and mentioned among other 
> things the bogosity of requiring a PUD page for three-level paging in 
> Linux -- a bogosity which has spread from Xen into native.  It's a page 
> wasted for no good reason, since it only contains 32 bytes worth of 
> data, *inherently*.  Furthermore, contrary to popular belief, it is 
> *not* pa page table per se.
> 
> Ian told me: "I didn't know we did that, and we shouldn't have to." 
> Here we have suffered this overhead for at least six years, ...

Even the Xen kernel only needs the full page when running on a
64-bit hypervisor (now that we don't have a 32-bit hypervisor
anymore, that of course basically means always). But yes, I too
never liked this enforced over-allocation for native kernels (and
was surprised that it was allowed in at all).

Jan


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-23  9:53                 ` Jan Beulich
@ 2012-11-23 10:37                   ` Daniel Kiper
  2012-11-23 10:51                     ` [Xen-devel] " Ian Campbell
  2012-11-23 10:51                     ` Jan Beulich
  0 siblings, 2 replies; 35+ messages in thread
From: Daniel Kiper @ 2012-11-23 10:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, H. Peter Anvin, x86, tglx, kexec, virtualization,
	xen-devel, konrad.wilk, mingo, linux-kernel, Eric W. Biederman

On Fri, Nov 23, 2012 at 09:53:37AM +0000, Jan Beulich wrote:
> >>> On 23.11.12 at 02:56, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> > On 23/11/2012 01:38, H. Peter Anvin wrote:
> >> I still don't really get why it can't be isolated from dom0, which would
> > make more sense to me, even for a Xen crash.
> >>
> >
> > The crash region (as specified by crashkernel= on the Xen command line)
> > is isolated from dom0.
> >
> > dom0 (using the kexec utility etc) has the task of locating the Xen
> > crash notes (using the kexec hypercall interface), constructing a binary
> > blob containing kernel, initram and gubbins, and asking Xen to put this
> > blob in the crash region (again, using the kexec hypercall interface).
> >
> > I do not see how this is very much different from the native case
> > currently (although please correct me if I am misinformed).  Linux has
> > extra work to do by populating /proc/iomem with the Xen crash regions
> > boot (so the kexec utility can reference their physical addresses when
> > constructing the blob), and should just act as a conduit between the
> > kexec system call and the kexec hypercall to load the blob.
>
> But all of this _could_ be done completely independent of the
> Dom0 kernel's kexec infrastructure (i.e. fully from user space,
> invoking the necessary hypercalls through the privcmd driver).

No, this is impossible. kexec/kdump image lives in dom0 kernel memory
until execution. That is why privcmd driver itself is not a solution
in this case.

> It's just that parts of the kexec infrastructure can be re-used
> (and hence that mechanism probably seemed the easier approach
> to the implementer of the original kexec-on-Xen). If the kernel
> folks dislike that re-use (quite understandably looking at how
> much of it needs to be re-done), that shouldn't prevent us from
> looking into the existing alternatives.

This is last resort option. First I think we should try to find
good solution which reuses existing code as much as possible.

Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xen-devel] [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-23 10:37                   ` Daniel Kiper
@ 2012-11-23 10:51                     ` Ian Campbell
  2012-11-23 11:13                       ` Daniel Kiper
  2012-11-23 10:51                     ` Jan Beulich
  1 sibling, 1 reply; 35+ messages in thread
From: Ian Campbell @ 2012-11-23 10:51 UTC (permalink / raw)
  To: Daniel Kiper
  Cc: Jan Beulich, xen-devel, konrad.wilk, Andrew Cooper, x86, kexec,
	linux-kernel, virtualization, mingo, Eric W. Biederman,
	H. Peter Anvin, tglx

On Fri, 2012-11-23 at 10:37 +0000, Daniel Kiper wrote:
> On Fri, Nov 23, 2012 at 09:53:37AM +0000, Jan Beulich wrote:
> > >>> On 23.11.12 at 02:56, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> > > The crash region (as specified by crashkernel= on the Xen command line)
> > > is isolated from dom0.
> > >[...]
> >
> > But all of this _could_ be done completely independent of the
> > Dom0 kernel's kexec infrastructure (i.e. fully from user space,
> > invoking the necessary hypercalls through the privcmd driver).
> 
> No, this is impossible. kexec/kdump image lives in dom0 kernel memory
> until execution.

Are you sure? I could have sworn they lived in the hypervisor owned
memory set aside by the crashkernel= parameter as Andy suggested.

Ian.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-23 10:37                   ` Daniel Kiper
  2012-11-23 10:51                     ` [Xen-devel] " Ian Campbell
@ 2012-11-23 10:51                     ` Jan Beulich
  2012-11-23 11:08                       ` Daniel Kiper
  1 sibling, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2012-11-23 10:51 UTC (permalink / raw)
  To: Daniel Kiper
  Cc: Andrew Cooper, x86, tglx, kexec, virtualization, xen-devel,
	konrad.wilk, mingo, linux-kernel, Eric W. Biederman,
	H. Peter Anvin

>>> On 23.11.12 at 11:37, Daniel Kiper <daniel.kiper@oracle.com> wrote:
> On Fri, Nov 23, 2012 at 09:53:37AM +0000, Jan Beulich wrote:
>> >>> On 23.11.12 at 02:56, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> > On 23/11/2012 01:38, H. Peter Anvin wrote:
>> >> I still don't really get why it can't be isolated from dom0, which would
>> > make more sense to me, even for a Xen crash.
>> >>
>> >
>> > The crash region (as specified by crashkernel= on the Xen command line)
>> > is isolated from dom0.
>> >
>> > dom0 (using the kexec utility etc) has the task of locating the Xen
>> > crash notes (using the kexec hypercall interface), constructing a binary
>> > blob containing kernel, initram and gubbins, and asking Xen to put this
>> > blob in the crash region (again, using the kexec hypercall interface).
>> >
>> > I do not see how this is very much different from the native case
>> > currently (although please correct me if I am misinformed).  Linux has
>> > extra work to do by populating /proc/iomem with the Xen crash regions
>> > boot (so the kexec utility can reference their physical addresses when
>> > constructing the blob), and should just act as a conduit between the
>> > kexec system call and the kexec hypercall to load the blob.
>>
>> But all of this _could_ be done completely independent of the
>> Dom0 kernel's kexec infrastructure (i.e. fully from user space,
>> invoking the necessary hypercalls through the privcmd driver).
> 
> No, this is impossible. kexec/kdump image lives in dom0 kernel memory
> until execution. That is why privcmd driver itself is not a solution
> in this case.

Even if so, there's no fundamental reason why that kernel image
can't be put into Xen controlled space instead.

Jan


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xen-devel] [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-23  9:56           ` Jan Beulich
@ 2012-11-23 10:53             ` Ian Campbell
  0 siblings, 0 replies; 35+ messages in thread
From: Ian Campbell @ 2012-11-23 10:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: H. Peter Anvin, xen-devel, konrad.wilk, Andrew Cooper,
	Daniel Kiper, x86, kexec, linux-kernel, virtualization, mingo,
	Eric W. Biederman, tglx

On Fri, 2012-11-23 at 09:56 +0000, Jan Beulich wrote:
> >>> On 22.11.12 at 18:37, "H. Peter Anvin" <hpa@zytor.com> wrote:
> > I actually talked to Ian Jackson at LCE, and mentioned among other 

That was me actually (this happens surprisingly often ;-)).

> > things the bogosity of requiring a PUD page for three-level paging in 
> > Linux -- a bogosity which has spread from Xen into native.  It's a page 
> > wasted for no good reason, since it only contains 32 bytes worth of 
> > data, *inherently*.  Furthermore, contrary to popular belief, it is 
> > *not* pa page table per se.
> > 
> > Ian told me: "I didn't know we did that, and we shouldn't have to." 
> > Here we have suffered this overhead for at least six years, ...
> 
> Even the Xen kernel only needs the full page when running on a
> 64-bit hypervisor (now that we don't have a 32-bit hypervisor
> anymore, that of course basically means always).

I took an, admittedly very brief, look at it on the plane on the way
home and it seems like the requirement for a complete page on the
pvops-xen side comes from the !SHARED_KERNEL_PMD stuff (so still a Xen
related thing). This requires a struct page for the list_head it
contains (see pgd_list_add et al) rather than because of the use of the
page as a pgd as such.

>  But yes, I too
> never liked this enforced over-allocation for native kernels (and
> was surprised that it was allowed in at all).

Completely agreed.

I did wonder if just doing something like:
-	pgd = (pgd_t *)__get_free_page(PGALLOC_GFP);
+	if (SHARED_KERNEL_PMD)
+		pgd = some_appropriate_allocation_primitive(sizeof(*pgd));
+	else
+		pgd = (pgd_t *)__get_free_page(PGALLOC_GFP);

to pgd_alloc (+ the equivalent for the error path & free case, create
helper funcs as desired etc) would be sufficient to remove the over
allocation for the native case but haven't had time to properly
investigate.

Alternatively push the allocation down into paravirt_pgd_alloc to
taste :-/

Ian.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-23 10:51                     ` Jan Beulich
@ 2012-11-23 11:08                       ` Daniel Kiper
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Kiper @ 2012-11-23 11:08 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, x86, tglx, kexec, virtualization, xen-devel,
	konrad.wilk, mingo, linux-kernel, Eric W. Biederman,
	H. Peter Anvin

On Fri, Nov 23, 2012 at 10:51:55AM +0000, Jan Beulich wrote:
> >>> On 23.11.12 at 11:37, Daniel Kiper <daniel.kiper@oracle.com> wrote:
> > On Fri, Nov 23, 2012 at 09:53:37AM +0000, Jan Beulich wrote:
> >> >>> On 23.11.12 at 02:56, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >> > On 23/11/2012 01:38, H. Peter Anvin wrote:
> >> >> I still don't really get why it can't be isolated from dom0, which would
> >> > make more sense to me, even for a Xen crash.
> >> >>
> >> >
> >> > The crash region (as specified by crashkernel= on the Xen command line)
> >> > is isolated from dom0.
> >> >
> >> > dom0 (using the kexec utility etc) has the task of locating the Xen
> >> > crash notes (using the kexec hypercall interface), constructing a binary
> >> > blob containing kernel, initram and gubbins, and asking Xen to put this
> >> > blob in the crash region (again, using the kexec hypercall interface).
> >> >
> >> > I do not see how this is very much different from the native case
> >> > currently (although please correct me if I am misinformed).  Linux has
> >> > extra work to do by populating /proc/iomem with the Xen crash regions
> >> > boot (so the kexec utility can reference their physical addresses when
> >> > constructing the blob), and should just act as a conduit between the
> >> > kexec system call and the kexec hypercall to load the blob.
> >>
> >> But all of this _could_ be done completely independent of the
> >> Dom0 kernel's kexec infrastructure (i.e. fully from user space,
> >> invoking the necessary hypercalls through the privcmd driver).
> >
> > No, this is impossible. kexec/kdump image lives in dom0 kernel memory
> > until execution. That is why privcmd driver itself is not a solution
> > in this case.
>
> Even if so, there's no fundamental reason why that kernel image
> can't be put into Xen controlled space instead.

Yep, but we must change Xen kexec interface and/or its behavior first.
If we take that option then we could also move almost all needed things
from dom0 kernel to Xen. This way we could simplify Linux Kernel
kexec/kdump infrastructure needed to run on Xen.

Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xen-devel] [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-23 10:51                     ` [Xen-devel] " Ian Campbell
@ 2012-11-23 11:13                       ` Daniel Kiper
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Kiper @ 2012-11-23 11:13 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Jan Beulich, xen-devel, konrad.wilk, Andrew Cooper, x86, kexec,
	linux-kernel, virtualization, mingo, Eric W. Biederman,
	H. Peter Anvin, tglx

On Fri, Nov 23, 2012 at 10:51:08AM +0000, Ian Campbell wrote:
> On Fri, 2012-11-23 at 10:37 +0000, Daniel Kiper wrote:
> > On Fri, Nov 23, 2012 at 09:53:37AM +0000, Jan Beulich wrote:
> > > >>> On 23.11.12 at 02:56, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> > > > The crash region (as specified by crashkernel= on the Xen command line)
> > > > is isolated from dom0.
> > > >[...]
> > >
> > > But all of this _could_ be done completely independent of the
> > > Dom0 kernel's kexec infrastructure (i.e. fully from user space,
> > > invoking the necessary hypercalls through the privcmd driver).
> >
> > No, this is impossible. kexec/kdump image lives in dom0 kernel memory
> > until execution.
>
> Are you sure? I could have sworn they lived in the hypervisor owned
> memory set aside by the crashkernel= parameter as Andy suggested.

I am sure. It is moved to final resting place when
relocate_kernel() is called by hypervisor.

Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-23  9:47         ` Daniel Kiper
@ 2012-11-23 20:24           ` Eric W. Biederman
  0 siblings, 0 replies; 35+ messages in thread
From: Eric W. Biederman @ 2012-11-23 20:24 UTC (permalink / raw)
  To: Daniel Kiper
  Cc: andrew.cooper3, hpa, jbeulich, konrad.wilk, mingo, tglx, x86,
	kexec, linux-kernel, virtualization, xen-devel

Daniel Kiper <daniel.kiper@oracle.com> writes:

> On Thu, Nov 22, 2012 at 04:15:48AM -0800, ebiederm@xmission.com wrote:
>>
>> Is this for when the hypervisor crashes and we want a crash dump of
>> that?
>
> dom0 at boot gets some info about kexec/kdump configuration from Xen hypervisor
> (e.g. placement of crash kernel area). Later if you call kexec syscall most
> things are done in the same way as on baremetal. However, after placing image
> in memory, HYPERVISOR_kexec_op() hypercall must be called to inform hypervisor
> that image is loaded (new hook machine_kexec_load is used for this;
> machine_kexec_unload is used for unload). Then Xen establishes fixmap for pages
> found in page_list[] and returns control to dom0. If dom0 crashes or "kexec execute"
> is used by user then dom0 calls HYPERVISOR_kexec_op() to instruct hypervisor that
> kexec/kdump image should be executed immediately. Xen calls relocate_kernel()
> and all things runs as usual.


Close

>> Successful code reuse depends upon not breaking the assumptions on which
>> the code relies, or modifying the code so that the new modified
>> assumptions are clear.  In this case you might as well define up as down
>> for all of the sense kexec_ops makes.
>
> Hmmm... Well, problem with above mentioned functions is that they work
> on physical addresses. In Xen PVOPS (currently dom0 is PVOPS) they
> are useless in kexec/kdump case. It means that physical addresses
> must be converted to/from machine addresses which has a real meaning
> in Xen PVOPS case. That is why those funtions were introduced.

Agreed operating on addresses that are relevant to the operation at hand
makes sense.

>> >> There may be a point to all of these but you are mixing and matching
>> >> things badly.
>> >
>> > Do you whish to split this kexec_ops struct to something which
>> > works with addresses and something which is reponsible for
>> > loading, unloading and executing kexec/kdump? I am able to change
>> > that but I would like to know a bit about your vision first.
>>
>> My vision is that we should have code that makes sense.
>>
>> My suspicion is that what you want is a cousin of the existing kexec
>> system call.  Perhaps what is needed is a flag to say use the firmware
>> kexec system call.
>>
>> I absolutely do not understand what Xen is trying to do.  kexec by
>> design should not require any firmware specific hooks.  kexec at this
>> level should only need to care about the processor architeture.  Clearly
>> what you are doing with Xen requires special hooks separate even from
>> the normal paravirt hooks.  So I do not understand you are trying to do.
>>
>> It needs to be clear from the code what is happening differently in the
>> Xen case.  Otherwise the code is unmaintainable as no one will be able
>> to understand it.
>
> I agree. I could remove all machine_* hooks from kexec_ops and call Xen
> specific functions from arch files. However, I need to add two new
> machine calls, machine_kexec_load and machine_kexec_unload, in the same
> manner as existing machine_* calls. In general they could be used to inform
> firmware (in this case Xen) that kexec/kdump image is loaded.
>
> kimage_alloc_pages, kimage_free_pages, page_to_pfn, pfn_to_page, virt_to_phys
> and phys_to_virt are worse. If we could not find good solution how to replace
> them then we end up with calling Xen specific version of kexec/kdump which
> would contain nearly full copy of exisiting kexec/kdump code. Not good.
>
> We could add some code to kernel/kexec.c which depends on CONFIG_XEN.
> It could contain above mentioned functions which later will be called
> by existing kexec code. This is not nice to be honest. However, I hope
> that we could find better solution for that problem.

Since in the Xen case you are not performing a normal kexec or kdump if
you are going to continue to use the kexec system call then another flag
(like the KEXEC_ON_CRASH flag) should be used.

The userspace flag should be something like KEXEC_HYPERVISOR.  From
there we can have a generic interface that feeds into whatever the Xen
infrastructure is.  And if any other hypervisors implement kexec like
functionality it could feed into them if we so choose.

When the choice is clearly between a linux-only kexec and for a hypervisor
level kexec using different functions to understand the target addresses
makes sense.

And of course /sbin/kexec can easity take an additional flag to say load
the kexec image to the hypervisor.

Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
  2012-11-22 22:26             ` H. Peter Anvin
@ 2014-03-31 10:50               ` Petr Tesarik
  0 siblings, 0 replies; 35+ messages in thread
From: Petr Tesarik @ 2014-03-31 10:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Cooper, xen-devel, konrad.wilk, Daniel Kiper, x86, kexec,
	linux-kernel, virtualization, mingo, Eric W. Biederman, jbeulich,
	tglx

On Thu, 22 Nov 2012 14:26:10 -0800
"H. Peter Anvin" <hpa@zytor.com> wrote:

> Bullshit.  This should be a separate domain.

Thanks for top-posting, hpa...

> Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> 
> >On 22/11/12 17:47, H. Peter Anvin wrote:
> >> The other thing that should be considered here is how utterly 
> >> preposterous the notion of doing in-guest crash dumping is in a
> >system 
> >> that contains a hypervisor.  The reason for kdump is that on bare
> >metal 
> >> there are no other options, but in a hypervisor system the right
> >thing 
> >> should be for the hypervisor to do the dump (possibly spawning a
> >clean 
> >> I/O domain if the I/O domain is necessary to access the media.)
> >>
> >> There is absolutely no reason to have a crashkernel sitting around in
> >
> >> each guest, consuming memory, and possibly get corrupt.
> >>
> >> 	-hpa
> >>
> >
> >I agree that regular guests should not be using the kexec/kdump. 
> >However, this patch series is required for allowing a pvops kernel to
> >be
> >a crash kernel for Xen, which is very important from dom0/Xen's point
> >of
> >view.

In fact, a normal kernel is used for dumping, so it can handle both,
Dom0 crashes _and_ hypervisor crashes. If you wanted to address
hypervisor crashes, you'd have to allocate some space for that, too, so
you may view this "madness" as a way to conserve resources.

The memory area is reserved by the Xen hypervisor, and only the extents
are passed down to the Dom0 kernel. In other words, there is indeed no
physical mapping for this area.

Having said that, I see no reason why that physical mapping cannot be
created if it is needed.

Petr T

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2014-03-31 10:50 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-20 15:04 [PATCH v2 00/11] xen: Initial kexec/kdump implementation Daniel Kiper
2012-11-20 15:04 ` [PATCH v2 01/11] kexec: introduce kexec_ops struct Daniel Kiper
2012-11-20 15:04   ` [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE Daniel Kiper
2012-11-20 15:04     ` [PATCH v2 03/11] xen: Introduce architecture independent data for kexec/kdump Daniel Kiper
2012-11-20 15:04       ` [PATCH v2 04/11] x86/xen: Introduce architecture dependent " Daniel Kiper
2012-11-20 15:04         ` [PATCH v2 05/11] x86/xen: Register resources required by kexec-tools Daniel Kiper
2012-11-20 15:04           ` [PATCH v2 06/11] x86/xen: Add i386 kexec/kdump implementation Daniel Kiper
2012-11-20 15:04             ` [PATCH v2 07/11] x86/xen: Add x86_64 " Daniel Kiper
2012-11-20 15:04               ` [PATCH v2 08/11] x86/xen: Add kexec/kdump makefile rules Daniel Kiper
2012-11-20 15:04                 ` [PATCH v2 09/11] x86/xen/enlighten: Add init and crash kexec/kdump hooks Daniel Kiper
2012-11-20 15:04                   ` [PATCH v2 10/11] drivers/xen: Export vmcoreinfo through sysfs Daniel Kiper
2012-11-20 15:04                     ` [PATCH v2 11/11] x86: Add Xen kexec control code size check to linker script Daniel Kiper
2012-11-20 15:52     ` [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE Jan Beulich
2012-11-20 16:40   ` [PATCH v2 01/11] kexec: introduce kexec_ops struct Eric W. Biederman
2012-11-21 10:52     ` Daniel Kiper
2012-11-22 12:15       ` Eric W. Biederman
2012-11-22 17:37         ` H. Peter Anvin
2012-11-23  9:56           ` Jan Beulich
2012-11-23 10:53             ` [Xen-devel] " Ian Campbell
2012-11-22 17:47         ` H. Peter Anvin
2012-11-22 18:07           ` Andrew Cooper
2012-11-22 22:26             ` H. Peter Anvin
2014-03-31 10:50               ` Petr Tesarik
2012-11-23  0:12           ` Andrew Cooper
2012-11-23  1:34             ` H. Peter Anvin
2012-11-23  1:38             ` H. Peter Anvin
2012-11-23  1:56               ` Andrew Cooper
2012-11-23  9:53                 ` Jan Beulich
2012-11-23 10:37                   ` Daniel Kiper
2012-11-23 10:51                     ` [Xen-devel] " Ian Campbell
2012-11-23 11:13                       ` Daniel Kiper
2012-11-23 10:51                     ` Jan Beulich
2012-11-23 11:08                       ` Daniel Kiper
2012-11-23  9:47         ` Daniel Kiper
2012-11-23 20:24           ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).