All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] xen: support pv-domains larger than 512GB
@ 2015-02-18  6:51 Juergen Gross
  2015-02-18  6:51 ` [PATCH 01/13] xen: sync with xen header Juergen Gross
                   ` (12 more replies)
  0 siblings, 13 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:51 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

Support 64 bit pv-domains with more than 512GB of memory.

Tested with 64 bit dom0 on machines with 8GB and 1TB and 32 bit dom0 on a
8GB machine. Conflicts between E820 map and different hypervisor populated
memory areas have been tested via a fake E820 map reserved area on the
8GB machine.

Juergen Gross (13):
  xen: sync with xen header
  xen: anchor linear p2m list in shared info structure
  xen: eliminate scalability issues from initial mapping setup
  xen: move static e820 map to global scope
  xen: simplify xen_set_identity_and_remap() by using global variables
  xen: detect pre-allocated memory interfering with e820 map
  xen: find unused contiguous memory area
  xen: add service function to copy physical memory areas
  xen: check for kernel memory conflicting with memory layout
  xen: check pre-allocated page tables for conflict with memory map
  xen: move initrd away from e820 non-ram area
  xen: if p2m list located in to be remapped region delay remapping
  xen: allow more than 512 GB of RAM for 64 bit pv-domains

 Documentation/kernel-parameters.txt  |   7 +
 arch/x86/include/asm/xen/interface.h |  96 ++++++-
 arch/x86/include/asm/xen/page.h      |   4 -
 arch/x86/xen/Kconfig                 |  31 +-
 arch/x86/xen/enlighten.c             |  22 ++
 arch/x86/xen/mmu.c                   | 141 ++++++++-
 arch/x86/xen/p2m.c                   |  23 +-
 arch/x86/xen/setup.c                 | 536 ++++++++++++++++++++++++++++-------
 arch/x86/xen/xen-head.S              |   2 +
 arch/x86/xen/xen-ops.h               |   4 +
 10 files changed, 717 insertions(+), 149 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 01/13] xen: sync with xen header
  2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
@ 2015-02-18  6:51 ` Juergen Gross
  2015-02-18  6:51 ` [PATCH 02/13] xen: anchor linear p2m list in shared info structure Juergen Gross
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:51 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

Use the newest header from the xen tree to get some new structure
layouts.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/include/asm/xen/interface.h | 96 ++++++++++++++++++++++++++++++++----
 1 file changed, 87 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/xen/interface.h b/arch/x86/include/asm/xen/interface.h
index 3400dba..3b88eea 100644
--- a/arch/x86/include/asm/xen/interface.h
+++ b/arch/x86/include/asm/xen/interface.h
@@ -3,12 +3,38 @@
  *
  * Guest OS interface to x86 Xen.
  *
- * Copyright (c) 2004, K A Fraser
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2004-2006, K A Fraser
  */
 
 #ifndef _ASM_X86_XEN_INTERFACE_H
 #define _ASM_X86_XEN_INTERFACE_H
 
+/*
+ * XEN_GUEST_HANDLE represents a guest pointer, when passed as a field
+ * in a struct in memory.
+ * XEN_GUEST_HANDLE_PARAM represent a guest pointer, when passed as an
+ * hypercall argument.
+ * XEN_GUEST_HANDLE_PARAM and XEN_GUEST_HANDLE are the same on X86 but
+ * they might not be on other architectures.
+ */
 #ifdef __XEN__
 #define __DEFINE_GUEST_HANDLE(name, type) \
     typedef struct { type *p; } __guest_handle_ ## name
@@ -88,13 +114,16 @@ DEFINE_GUEST_HANDLE(xen_ulong_t);
  * start of the GDT because some stupid OSes export hard-coded selector values
  * in their ABI. These hard-coded values are always near the start of the GDT,
  * so Xen places itself out of the way, at the far end of the GDT.
+ *
+ * NB The LDT is set using the MMUEXT_SET_LDT op of HYPERVISOR_mmuext_op
  */
 #define FIRST_RESERVED_GDT_PAGE  14
 #define FIRST_RESERVED_GDT_BYTE  (FIRST_RESERVED_GDT_PAGE * 4096)
 #define FIRST_RESERVED_GDT_ENTRY (FIRST_RESERVED_GDT_BYTE / 8)
 
 /*
- * Send an array of these to HYPERVISOR_set_trap_table()
+ * Send an array of these to HYPERVISOR_set_trap_table().
+ * Terminate the array with a sentinel entry, with traps[].address==0.
  * The privilege level specifies which modes may enter a trap via a software
  * interrupt. On x86/64, since rings 1 and 2 are unavailable, we allocate
  * privilege levels as follows:
@@ -118,10 +147,41 @@ struct trap_info {
 DEFINE_GUEST_HANDLE_STRUCT(trap_info);
 
 struct arch_shared_info {
-    unsigned long max_pfn;                  /* max pfn that appears in table */
-    /* Frame containing list of mfns containing list of mfns containing p2m. */
-    unsigned long pfn_to_mfn_frame_list_list;
-    unsigned long nmi_reason;
+	/*
+	 * Number of valid entries in the p2m table(s) anchored at
+	 * pfn_to_mfn_frame_list_list and/or p2m_vaddr.
+	 */
+	unsigned long max_pfn;
+	/*
+	 * Frame containing list of mfns containing list of mfns containing p2m.
+	 * A value of 0 indicates it has not yet been set up, ~0 indicates it
+	 * has been set to invalid e.g. due to the p2m being too large for the
+	 * 3-level p2m tree. In this case the linear mapper p2m list anchored
+	 * at p2m_vaddr is to be used.
+	 */
+	xen_pfn_t pfn_to_mfn_frame_list_list;
+	unsigned long nmi_reason;
+	/*
+	 * Following three fields are valid if p2m_cr3 contains a value
+	 * different from 0.
+	 * p2m_cr3 is the root of the address space where p2m_vaddr is valid.
+	 * p2m_cr3 is in the same format as a cr3 value in the vcpu register
+	 * state and holds the folded machine frame number (via xen_pfn_to_cr3)
+	 * of a L3 or L4 page table.
+	 * p2m_vaddr holds the virtual address of the linear p2m list. All
+	 * entries in the range [0...max_pfn[ are accessible via this pointer.
+	 * p2m_generation will be incremented by the guest before and after each
+	 * change of the mappings of the p2m list. p2m_generation starts at 0
+	 * and a value with the least significant bit set indicates that a
+	 * mapping update is in progress. This allows guest external software
+	 * (e.g. in Dom0) to verify that read mappings are consistent and
+	 * whether they have changed since the last check.
+	 * Modifying a p2m element in the linear p2m list is allowed via an
+	 * atomic write only.
+	 */
+	unsigned long p2m_cr3;		/* cr3 value of the p2m address space */
+	unsigned long p2m_vaddr;	/* virtual address of the p2m list */
+	unsigned long p2m_generation;	/* generation count of p2m mapping */
 };
 #endif	/* !__ASSEMBLY__ */
 
@@ -137,13 +197,31 @@ struct arch_shared_info {
 /*
  * The following is all CPU context. Note that the fpu_ctxt block is filled
  * in by FXSAVE if the CPU has feature FXSR; otherwise FSAVE is used.
+ *
+ * Also note that when calling DOMCTL_setvcpucontext and VCPU_initialise
+ * for HVM and PVH guests, not all information in this structure is updated:
+ *
+ * - For HVM guests, the structures read include: fpu_ctxt (if
+ * VGCT_I387_VALID is set), flags, user_regs, debugreg[*]
+ *
+ * - PVH guests are the same as HVM guests, but additionally use ctrlreg[3] to
+ * set cr3. All other fields not used should be set to 0.
  */
 struct vcpu_guest_context {
     /* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */
     struct { char x[512]; } fpu_ctxt;       /* User-level FPU registers     */
-#define VGCF_I387_VALID (1<<0)
-#define VGCF_HVM_GUEST  (1<<1)
-#define VGCF_IN_KERNEL  (1<<2)
+#define VGCF_I387_VALID                (1<<0)
+#define VGCF_IN_KERNEL                 (1<<2)
+#define _VGCF_i387_valid               0
+#define VGCF_i387_valid                (1<<_VGCF_i387_valid)
+#define _VGCF_in_kernel                2
+#define VGCF_in_kernel                 (1<<_VGCF_in_kernel)
+#define _VGCF_failsafe_disables_events 3
+#define VGCF_failsafe_disables_events  (1<<_VGCF_failsafe_disables_events)
+#define _VGCF_syscall_disables_events  4
+#define VGCF_syscall_disables_events   (1<<_VGCF_syscall_disables_events)
+#define _VGCF_online                   5
+#define VGCF_online                    (1<<_VGCF_online)
     unsigned long flags;                    /* VGCF_* flags                 */
     struct cpu_user_regs user_regs;         /* User-level CPU registers     */
     struct trap_info trap_ctxt[256];        /* Virtual IDT                  */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 02/13] xen: anchor linear p2m list in shared info structure
  2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
  2015-02-18  6:51 ` [PATCH 01/13] xen: sync with xen header Juergen Gross
@ 2015-02-18  6:51 ` Juergen Gross
  2015-02-18 10:32     ` David Vrabel
  2015-02-18  6:51 ` [PATCH 03/13] xen: eliminate scalability issues from initial mapping setup Juergen Gross
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:51 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

The linear p2m list should be anchored in the shared info structure
read by the Xen tools to be able to support 64 bit pv-domains larger
than 512 MB. Additionally the linear p2m list interface includes a
generation count which is changed prior to and after each mapping
change of the p2m list. Reading the generation count the Xen tools can
detect changes of the mappings and re-read the p2m list eventually.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/p2m.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index f18fd1d..df73cc5 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -256,6 +256,10 @@ void xen_setup_mfn_list_list(void)
 	HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
 		virt_to_mfn(p2m_top_mfn);
 	HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
+	HYPERVISOR_shared_info->arch.p2m_generation = 0;
+	HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned long)xen_p2m_addr;
+	HYPERVISOR_shared_info->arch.p2m_cr3 =
+		xen_pfn_to_cr3(virt_to_mfn(swapper_pg_dir));
 }
 
 /* Set up p2m_top to point to the domain-builder provided p2m pages */
@@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t *pte_pg)
 
 		ptechk = lookup_address(vaddr, &level);
 		if (ptechk == pte_pg) {
+			HYPERVISOR_shared_info->arch.p2m_generation++;
 			set_pmd(pmdp,
 				__pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
+			HYPERVISOR_shared_info->arch.p2m_generation++;
 			pte_newpg[i] = NULL;
 		}
 
@@ -568,8 +574,10 @@ static bool alloc_p2m(unsigned long pfn)
 		spin_lock_irqsave(&p2m_update_lock, flags);
 
 		if (pte_pfn(*ptep) == p2m_pfn) {
+			HYPERVISOR_shared_info->arch.p2m_generation++;
 			set_pte(ptep,
 				pfn_pte(PFN_DOWN(__pa(p2m)), PAGE_KERNEL));
+			HYPERVISOR_shared_info->arch.p2m_generation++;
 			if (mid_mfn)
 				mid_mfn[mididx] = virt_to_mfn(p2m);
 			p2m = NULL;
@@ -621,6 +629,11 @@ bool __set_phys_to_machine(unsigned long pfn, unsigned long mfn)
 		return true;
 	}
 
+	/*
+	 * The interface requires atomic updates on p2m elements.
+	 * xen_safe_write_ulong() is using __put_user which does an atomic
+	 * store via asm().
+	 */
 	if (likely(!xen_safe_write_ulong(xen_p2m_addr + pfn, mfn)))
 		return true;
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 03/13] xen: eliminate scalability issues from initial mapping setup
  2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
  2015-02-18  6:51 ` [PATCH 01/13] xen: sync with xen header Juergen Gross
  2015-02-18  6:51 ` [PATCH 02/13] xen: anchor linear p2m list in shared info structure Juergen Gross
@ 2015-02-18  6:51 ` Juergen Gross
  2015-02-18  6:51 ` [PATCH 04/13] xen: move static e820 map to global scope Juergen Gross
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:51 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

Direct Xen to place the initial P->M table outside of the initial
mapping, as otherwise the 1G (implementation) / 2G (theoretical)
restriction on the size of the initial mapping limits the amount
of memory a domain can be handed initially.

As the initial P->M table is copied rather early during boot to
domain private memory and it's initial virtual mapping is dropped,
the easiest way to avoid virtual address conflicts with other
addresses in the kernel is to use a user address area for the
virtual address of the initial P->M table. This allows us to just
throw away the page tables of the initial mapping after the copy
without having to care about address invalidation.

It should be noted that this patch won't enable a pv-domain to USE
more than 512 GB of RAM. It just enables it to be started with a
P->M table covering more memory. This is especially important for
being able to boot a Dom0 on a system with more than 512 GB memory.

Signed-off-by: Juergen Gross <jgross@suse.com>
Based-on-patch-by: Jan Beulich <jbeulich@suse.com>
---
 arch/x86/xen/mmu.c      | 126 ++++++++++++++++++++++++++++++++++++++++++++----
 arch/x86/xen/setup.c    |  67 ++++++++++++++-----------
 arch/x86/xen/xen-head.S |   2 +
 3 files changed, 156 insertions(+), 39 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index adca9e2..1ca5197 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1114,6 +1114,77 @@ static void __init xen_cleanhighmap(unsigned long vaddr,
 	xen_mc_flush();
 }
 
+/*
+ * Make a page range writeable and free it.
+ */
+static void __init xen_free_ro_pages(unsigned long paddr, unsigned long size)
+{
+	void *vaddr = __va(paddr);
+	void *vaddr_end = vaddr + size;
+
+	for (; vaddr < vaddr_end; vaddr += PAGE_SIZE)
+		make_lowmem_page_readwrite(vaddr);
+
+	memblock_free(paddr, size);
+}
+
+static void __init xen_cleanmfnmap_free_pgtbl(void *pgtbl)
+{
+	unsigned long pa = __pa(pgtbl) & PHYSICAL_PAGE_MASK;
+
+	ClearPagePinned(virt_to_page(__va(pa)));
+	xen_free_ro_pages(pa, PAGE_SIZE);
+}
+
+/*
+ * Since it is well isolated we can (and since it is perhaps large we should)
+ * also free the page tables mapping the initial P->M table.
+ */
+static void __init xen_cleanmfnmap(unsigned long vaddr)
+{
+	unsigned long va = vaddr & PMD_MASK;
+	unsigned long pa;
+	pgd_t *pgd = pgd_offset_k(va);
+	pud_t *pud_page = pud_offset(pgd, 0);
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	unsigned int i;
+
+	set_pgd(pgd, __pgd(0));
+	do {
+		pud = pud_page + pud_index(va);
+		if (pud_none(*pud)) {
+			va += PUD_SIZE;
+		} else if (pud_large(*pud)) {
+			pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
+			xen_free_ro_pages(pa, PUD_SIZE);
+			va += PUD_SIZE;
+		} else {
+			pmd = pmd_offset(pud, va);
+			if (pmd_large(*pmd)) {
+				pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
+				xen_free_ro_pages(pa, PMD_SIZE);
+			} else if (!pmd_none(*pmd)) {
+				pte = pte_offset_kernel(pmd, va);
+				for (i = 0; i < PTRS_PER_PTE; ++i) {
+					if (pte_none(pte[i]))
+						break;
+					pa = pte_pfn(pte[i]) << PAGE_SHIFT;
+					xen_free_ro_pages(pa, PAGE_SIZE);
+				}
+				xen_cleanmfnmap_free_pgtbl(pte);
+			}
+			va += PMD_SIZE;
+			if (pmd_index(va))
+				continue;
+			xen_cleanmfnmap_free_pgtbl(pmd);
+		}
+
+	} while (pud_index(va) || pmd_index(va));
+	xen_cleanmfnmap_free_pgtbl(pud_page);
+}
+
 static void __init xen_pagetable_p2m_free(void)
 {
 	unsigned long size;
@@ -1128,18 +1199,25 @@ static void __init xen_pagetable_p2m_free(void)
 	/* using __ka address and sticking INVALID_P2M_ENTRY! */
 	memset((void *)xen_start_info->mfn_list, 0xff, size);
 
-	/* We should be in __ka space. */
-	BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
 	addr = xen_start_info->mfn_list;
-	/* We roundup to the PMD, which means that if anybody at this stage is
-	 * using the __ka address of xen_start_info or xen_start_info->shared_info
-	 * they are in going to crash. Fortunatly we have already revectored
-	 * in xen_setup_kernel_pagetable and in xen_setup_shared_info. */
+	/*
+	 * We could be in __ka space.
+	 * We roundup to the PMD, which means that if anybody at this stage is
+	 * using the __ka address of xen_start_info or
+	 * xen_start_info->shared_info they are in going to crash. Fortunatly
+	 * we have already revectored in xen_setup_kernel_pagetable and in
+	 * xen_setup_shared_info.
+	 */
 	size = roundup(size, PMD_SIZE);
-	xen_cleanhighmap(addr, addr + size);
 
-	size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
-	memblock_free(__pa(xen_start_info->mfn_list), size);
+	if (addr >= __START_KERNEL_map) {
+		xen_cleanhighmap(addr, addr + size);
+		size = PAGE_ALIGN(xen_start_info->nr_pages *
+				  sizeof(unsigned long));
+		memblock_free(__pa(addr), size);
+	} else {
+		xen_cleanmfnmap(addr);
+	}
 
 	/* At this stage, cleanup_highmap has already cleaned __ka space
 	 * from _brk_limit way up to the max_pfn_mapped (which is the end of
@@ -1461,6 +1539,24 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
 #else /* CONFIG_X86_64 */
 static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
 {
+	unsigned long pfn;
+
+	if (xen_feature(XENFEAT_writable_page_tables) ||
+	    xen_feature(XENFEAT_auto_translated_physmap) ||
+	    xen_start_info->mfn_list >= __START_KERNEL_map)
+		return pte;
+
+	/*
+	 * Pages belonging to the initial p2m list mapped outside the default
+	 * address range must be mapped read-only. This region contains the
+	 * page tables for mapping the p2m list, too, and page tables MUST be
+	 * mapped read-only.
+	 */
+	pfn = pte_pfn(pte);
+	if (pfn >= xen_start_info->first_p2m_pfn &&
+	    pfn < xen_start_info->first_p2m_pfn + xen_start_info->nr_p2m_frames)
+		pte = __pte_ma(pte_val_ma(pte) & ~_PAGE_RW);
+
 	return pte;
 }
 #endif /* CONFIG_X86_64 */
@@ -1815,7 +1911,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	 * mappings. Considering that on Xen after the kernel mappings we
 	 * have the mappings of some pages that don't exist in pfn space, we
 	 * set max_pfn_mapped to the last real pfn mapped. */
-	max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
+	if (xen_start_info->mfn_list < __START_KERNEL_map)
+		max_pfn_mapped = xen_start_info->first_p2m_pfn;
+	else
+		max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
 
 	pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
 	pt_end = pt_base + xen_start_info->nr_pt_frames;
@@ -1855,6 +1954,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Graft it onto L4[511][510] */
 	copy_page(level2_kernel_pgt, l2);
 
+	/* Copy the initial P->M table mappings if necessary. */
+	i = pgd_index(xen_start_info->mfn_list);
+	if (i && i < pgd_index(__START_KERNEL_map))
+		init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Make pagetable pieces RO */
 		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
@@ -1895,6 +1999,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 
 	/* Our (by three pages) smaller Xen pagetable that we are using */
 	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
+	/* protect xen_start_info */
+	memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
 	/* Revector the xen_start_info */
 	xen_start_info = (struct start_info *)__va(__pa(xen_start_info));
 }
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 55f388e..adad417 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -560,6 +560,41 @@ static void __init xen_ignore_unusable(struct e820entry *list, size_t map_size)
 	}
 }
 
+/*
+ * Reserve Xen mfn_list.
+ * See comment above "struct start_info" in <xen/interface/xen.h>
+ * We tried to make the the memblock_reserve more selective so
+ * that it would be clear what region is reserved. Sadly we ran
+ * in the problem wherein on a 64-bit hypervisor with a 32-bit
+ * initial domain, the pt_base has the cr3 value which is not
+ * neccessarily where the pagetable starts! As Jan put it: "
+ * Actually, the adjustment turns out to be correct: The page
+ * tables for a 32-on-64 dom0 get allocated in the order "first L1",
+ * "first L2", "first L3", so the offset to the page table base is
+ * indeed 2. When reading xen/include/public/xen.h's comment
+ * very strictly, this is not a violation (since there nothing is said
+ * that the first thing in the page table space is pointed to by
+ * pt_base; I admit that this seems to be implied though, namely
+ * do I think that it is implied that the page table space is the
+ * range [pt_base, pt_base + nt_pt_frames), whereas that
+ * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
+ * which - without a priori knowledge - the kernel would have
+ * difficulty to figure out)." - so lets just fall back to the
+ * easy way and reserve the whole region.
+ */
+static void __init xen_reserve_xen_mfnlist(void)
+{
+	if (xen_start_info->mfn_list >= __START_KERNEL_map) {
+		memblock_reserve(__pa(xen_start_info->mfn_list),
+				 xen_start_info->pt_base -
+				 xen_start_info->mfn_list);
+		return;
+	}
+
+	memblock_reserve(PFN_PHYS(xen_start_info->first_p2m_pfn),
+			 PFN_PHYS(xen_start_info->nr_p2m_frames));
+}
+
 /**
  * machine_specific_memory_setup - Hook for machine specific memory setup.
  **/
@@ -684,35 +719,10 @@ char * __init xen_memory_setup(void)
 	e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
 			E820_RESERVED);
 
-	/*
-	 * Reserve Xen bits:
-	 *  - mfn_list
-	 *  - xen_start_info
-	 * See comment above "struct start_info" in <xen/interface/xen.h>
-	 * We tried to make the the memblock_reserve more selective so
-	 * that it would be clear what region is reserved. Sadly we ran
-	 * in the problem wherein on a 64-bit hypervisor with a 32-bit
-	 * initial domain, the pt_base has the cr3 value which is not
-	 * neccessarily where the pagetable starts! As Jan put it: "
-	 * Actually, the adjustment turns out to be correct: The page
-	 * tables for a 32-on-64 dom0 get allocated in the order "first L1",
-	 * "first L2", "first L3", so the offset to the page table base is
-	 * indeed 2. When reading xen/include/public/xen.h's comment
-	 * very strictly, this is not a violation (since there nothing is said
-	 * that the first thing in the page table space is pointed to by
-	 * pt_base; I admit that this seems to be implied though, namely
-	 * do I think that it is implied that the page table space is the
-	 * range [pt_base, pt_base + nt_pt_frames), whereas that
-	 * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
-	 * which - without a priori knowledge - the kernel would have
-	 * difficulty to figure out)." - so lets just fall back to the
-	 * easy way and reserve the whole region.
-	 */
-	memblock_reserve(__pa(xen_start_info->mfn_list),
-			 xen_start_info->pt_base - xen_start_info->mfn_list);
-
 	sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
 
+	xen_reserve_xen_mfnlist();
+
 	return "Xen";
 }
 
@@ -739,8 +749,7 @@ char * __init xen_auto_xlated_memory_setup(void)
 	for (i = 0; i < memmap.nr_entries; i++)
 		e820_add_region(map[i].addr, map[i].size, map[i].type);
 
-	memblock_reserve(__pa(xen_start_info->mfn_list),
-			 xen_start_info->pt_base - xen_start_info->mfn_list);
+	xen_reserve_xen_mfnlist();
 
 	return "Xen";
 }
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 674b2225..e89e816 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -147,6 +147,8 @@ NEXT_HYPERCALL(arch_6)
 	ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE,      _ASM_PTR __PAGE_OFFSET)
 #else
 	ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE,      _ASM_PTR __START_KERNEL_map)
+	/* Map the p2m table to a 512GB-aligned user address. */
+	ELFNOTE(Xen, XEN_ELFNOTE_INIT_P2M,       .quad PGDIR_SIZE)
 #endif
 	ELFNOTE(Xen, XEN_ELFNOTE_ENTRY,          _ASM_PTR startup_xen)
 	ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, _ASM_PTR hypercall_page)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 04/13] xen: move static e820 map to global scope
  2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
                   ` (2 preceding siblings ...)
  2015-02-18  6:51 ` [PATCH 03/13] xen: eliminate scalability issues from initial mapping setup Juergen Gross
@ 2015-02-18  6:51 ` Juergen Gross
  2015-02-19 17:27   ` [Xen-devel] " David Vrabel
  2015-02-18  6:51 ` [PATCH 05/13] xen: simplify xen_set_identity_and_remap() by using global variables Juergen Gross
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:51 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

Instead of using a function local static e820 map in xen_memory_setup()
and calling various functions in the same source with the map as a
parameter use a map directly accessible by all functions in the source.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/setup.c | 96 +++++++++++++++++++++++++++-------------------------
 1 file changed, 49 insertions(+), 47 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index adad417..ab6c36e 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -38,6 +38,10 @@ struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;
 /* Number of pages released from the initial allocation. */
 unsigned long xen_released_pages;
 
+/* E820 map used during setting up memory. */
+static struct e820entry xen_e820_map[E820MAX] __initdata;
+static u32 xen_e820_map_entries __initdata;
+
 /*
  * Buffer used to remap identity mapped pages. We only need the virtual space.
  * The physical page behind this address is remapped as needed to different
@@ -164,15 +168,13 @@ void __init xen_inv_extra_mem(void)
  * This function updates min_pfn with the pfn found and returns
  * the size of that range or zero if not found.
  */
-static unsigned long __init xen_find_pfn_range(
-	const struct e820entry *list, size_t map_size,
-	unsigned long *min_pfn)
+static unsigned long __init xen_find_pfn_range(unsigned long *min_pfn)
 {
-	const struct e820entry *entry;
+	const struct e820entry *entry = xen_e820_map;
 	unsigned int i;
 	unsigned long done = 0;
 
-	for (i = 0, entry = list; i < map_size; i++, entry++) {
+	for (i = 0; i < xen_e820_map_entries; i++, entry++) {
 		unsigned long s_pfn;
 		unsigned long e_pfn;
 
@@ -356,9 +358,9 @@ static void __init xen_do_set_identity_and_remap_chunk(
  * to Xen and not remapped.
  */
 static unsigned long __init xen_set_identity_and_remap_chunk(
-        const struct e820entry *list, size_t map_size, unsigned long start_pfn,
-	unsigned long end_pfn, unsigned long nr_pages, unsigned long remap_pfn,
-	unsigned long *released, unsigned long *remapped)
+	unsigned long start_pfn, unsigned long end_pfn, unsigned long nr_pages,
+	unsigned long remap_pfn, unsigned long *released,
+	unsigned long *remapped)
 {
 	unsigned long pfn;
 	unsigned long i = 0;
@@ -379,8 +381,7 @@ static unsigned long __init xen_set_identity_and_remap_chunk(
 		if (cur_pfn + size > nr_pages)
 			size = nr_pages - cur_pfn;
 
-		remap_range_size = xen_find_pfn_range(list, map_size,
-						      &remap_pfn);
+		remap_range_size = xen_find_pfn_range(&remap_pfn);
 		if (!remap_range_size) {
 			pr_warning("Unable to find available pfn range, not remapping identity pages\n");
 			xen_set_identity_and_release_chunk(cur_pfn,
@@ -411,13 +412,12 @@ static unsigned long __init xen_set_identity_and_remap_chunk(
 	return remap_pfn;
 }
 
-static void __init xen_set_identity_and_remap(
-	const struct e820entry *list, size_t map_size, unsigned long nr_pages,
-	unsigned long *released, unsigned long *remapped)
+static void __init xen_set_identity_and_remap(unsigned long nr_pages,
+			unsigned long *released, unsigned long *remapped)
 {
 	phys_addr_t start = 0;
 	unsigned long last_pfn = nr_pages;
-	const struct e820entry *entry;
+	const struct e820entry *entry = xen_e820_map;
 	unsigned long num_released = 0;
 	unsigned long num_remapped = 0;
 	int i;
@@ -433,9 +433,9 @@ static void __init xen_set_identity_and_remap(
 	 * example) the DMI tables in a reserved region that begins on
 	 * a non-page boundary.
 	 */
-	for (i = 0, entry = list; i < map_size; i++, entry++) {
+	for (i = 0; i < xen_e820_map_entries; i++, entry++) {
 		phys_addr_t end = entry->addr + entry->size;
-		if (entry->type == E820_RAM || i == map_size - 1) {
+		if (entry->type == E820_RAM || i == xen_e820_map_entries - 1) {
 			unsigned long start_pfn = PFN_DOWN(start);
 			unsigned long end_pfn = PFN_UP(end);
 
@@ -444,9 +444,9 @@ static void __init xen_set_identity_and_remap(
 
 			if (start_pfn < end_pfn)
 				last_pfn = xen_set_identity_and_remap_chunk(
-						list, map_size, start_pfn,
-						end_pfn, nr_pages, last_pfn,
-						&num_released, &num_remapped);
+						start_pfn, end_pfn, nr_pages,
+						last_pfn, &num_released,
+						&num_remapped);
 			start = end;
 		}
 	}
@@ -549,12 +549,12 @@ static void __init xen_align_and_add_e820_region(phys_addr_t start,
 	e820_add_region(start, end - start, type);
 }
 
-static void __init xen_ignore_unusable(struct e820entry *list, size_t map_size)
+static void __init xen_ignore_unusable(void)
 {
-	struct e820entry *entry;
+	struct e820entry *entry = xen_e820_map;
 	unsigned int i;
 
-	for (i = 0, entry = list; i < map_size; i++, entry++) {
+	for (i = 0; i < xen_e820_map_entries; i++, entry++) {
 		if (entry->type == E820_UNUSABLE)
 			entry->type = E820_RAM;
 	}
@@ -600,8 +600,6 @@ static void __init xen_reserve_xen_mfnlist(void)
  **/
 char * __init xen_memory_setup(void)
 {
-	static struct e820entry map[E820MAX] __initdata;
-
 	unsigned long max_pfn = xen_start_info->nr_pages;
 	phys_addr_t mem_end;
 	int rc;
@@ -616,7 +614,7 @@ char * __init xen_memory_setup(void)
 	mem_end = PFN_PHYS(max_pfn);
 
 	memmap.nr_entries = E820MAX;
-	set_xen_guest_handle(memmap.buffer, map);
+	set_xen_guest_handle(memmap.buffer, xen_e820_map);
 
 	op = xen_initial_domain() ?
 		XENMEM_machine_memory_map :
@@ -625,15 +623,16 @@ char * __init xen_memory_setup(void)
 	if (rc == -ENOSYS) {
 		BUG_ON(xen_initial_domain());
 		memmap.nr_entries = 1;
-		map[0].addr = 0ULL;
-		map[0].size = mem_end;
+		xen_e820_map[0].addr = 0ULL;
+		xen_e820_map[0].size = mem_end;
 		/* 8MB slack (to balance backend allocations). */
-		map[0].size += 8ULL << 20;
-		map[0].type = E820_RAM;
+		xen_e820_map[0].size += 8ULL << 20;
+		xen_e820_map[0].type = E820_RAM;
 		rc = 0;
 	}
 	BUG_ON(rc);
 	BUG_ON(memmap.nr_entries == 0);
+	xen_e820_map_entries = memmap.nr_entries;
 
 	/*
 	 * Xen won't allow a 1:1 mapping to be created to UNUSABLE
@@ -644,10 +643,11 @@ char * __init xen_memory_setup(void)
 	 * a patch in the future.
 	 */
 	if (xen_initial_domain())
-		xen_ignore_unusable(map, memmap.nr_entries);
+		xen_ignore_unusable();
 
 	/* Make sure the Xen-supplied memory map is well-ordered. */
-	sanitize_e820_map(map, memmap.nr_entries, &memmap.nr_entries);
+	sanitize_e820_map(xen_e820_map, xen_e820_map_entries,
+			  &xen_e820_map_entries);
 
 	max_pages = xen_get_max_pages();
 	if (max_pages > max_pfn)
@@ -657,8 +657,8 @@ char * __init xen_memory_setup(void)
 	 * Set identity map on non-RAM pages and prepare remapping the
 	 * underlying RAM.
 	 */
-	xen_set_identity_and_remap(map, memmap.nr_entries, max_pfn,
-				   &xen_released_pages, &remapped_pages);
+	xen_set_identity_and_remap(max_pfn, &xen_released_pages,
+				   &remapped_pages);
 
 	extra_pages += xen_released_pages;
 	extra_pages += remapped_pages;
@@ -677,10 +677,10 @@ char * __init xen_memory_setup(void)
 	extra_pages = min(EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
 			  extra_pages);
 	i = 0;
-	while (i < memmap.nr_entries) {
-		phys_addr_t addr = map[i].addr;
-		phys_addr_t size = map[i].size;
-		u32 type = map[i].type;
+	while (i < xen_e820_map_entries) {
+		phys_addr_t addr = xen_e820_map[i].addr;
+		phys_addr_t size = xen_e820_map[i].size;
+		u32 type = xen_e820_map[i].type;
 
 		if (type == E820_RAM) {
 			if (addr < mem_end) {
@@ -696,9 +696,9 @@ char * __init xen_memory_setup(void)
 
 		xen_align_and_add_e820_region(addr, size, type);
 
-		map[i].addr += size;
-		map[i].size -= size;
-		if (map[i].size == 0)
+		xen_e820_map[i].addr += size;
+		xen_e820_map[i].size -= size;
+		if (xen_e820_map[i].size == 0)
 			i++;
 	}
 
@@ -709,7 +709,7 @@ char * __init xen_memory_setup(void)
 	 * PFNs above MAX_P2M_PFN are considered identity mapped as
 	 * well.
 	 */
-	set_phys_range_identity(map[i-1].addr / PAGE_SIZE, ~0ul);
+	set_phys_range_identity(xen_e820_map[i - 1].addr / PAGE_SIZE, ~0ul);
 
 	/*
 	 * In domU, the ISA region is normal, usable memory, but we
@@ -731,23 +731,25 @@ char * __init xen_memory_setup(void)
  */
 char * __init xen_auto_xlated_memory_setup(void)
 {
-	static struct e820entry map[E820MAX] __initdata;
-
 	struct xen_memory_map memmap;
 	int i;
 	int rc;
 
 	memmap.nr_entries = E820MAX;
-	set_xen_guest_handle(memmap.buffer, map);
+	set_xen_guest_handle(memmap.buffer, xen_e820_map);
 
 	rc = HYPERVISOR_memory_op(XENMEM_memory_map, &memmap);
 	if (rc < 0)
 		panic("No memory map (%d)\n", rc);
 
-	sanitize_e820_map(map, ARRAY_SIZE(map), &memmap.nr_entries);
+	xen_e820_map_entries = memmap.nr_entries;
+
+	sanitize_e820_map(xen_e820_map, ARRAY_SIZE(xen_e820_map),
+			  &xen_e820_map_entries);
 
-	for (i = 0; i < memmap.nr_entries; i++)
-		e820_add_region(map[i].addr, map[i].size, map[i].type);
+	for (i = 0; i < xen_e820_map_entries; i++)
+		e820_add_region(xen_e820_map[i].addr, xen_e820_map[i].size,
+				xen_e820_map[i].type);
 
 	xen_reserve_xen_mfnlist();
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 05/13] xen: simplify xen_set_identity_and_remap() by using global variables
  2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
                   ` (3 preceding siblings ...)
  2015-02-18  6:51 ` [PATCH 04/13] xen: move static e820 map to global scope Juergen Gross
@ 2015-02-18  6:51 ` Juergen Gross
  2015-02-19 18:10   ` [Xen-devel] " David Vrabel
  2015-02-18  6:51 ` [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map Juergen Gross
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:51 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

xen_set_identity_and_remap() is used to prepare remapping of memory
conflicting with the E820 map. It is tracking the pfn where to remap
new memory via a local variable which is passed to a subfunction
which in turn returns the new value for that variable.

Additionally the targeted maximum pfn is passed as a parameter to
sub functions.

Simplify that construct by using just global variables in the
source for that purpose. This will make things simpler when we need
those values later, too.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/setup.c | 63 +++++++++++++++++++++++++---------------------------
 1 file changed, 30 insertions(+), 33 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index ab6c36e..0dda131 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -56,6 +56,9 @@ static struct {
 } xen_remap_buf __initdata __aligned(PAGE_SIZE);
 static unsigned long xen_remap_mfn __initdata = INVALID_P2M_ENTRY;
 
+static unsigned long xen_remap_pfn;
+static unsigned long xen_max_pfn;
+
 /* 
  * The maximum amount of extra memory compared to the base size.  The
  * main scaling factor is the size of struct page.  At extreme ratios
@@ -223,7 +226,7 @@ static int __init xen_free_mfn(unsigned long mfn)
  * as a fallback if the remapping fails.
  */
 static void __init xen_set_identity_and_release_chunk(unsigned long start_pfn,
-	unsigned long end_pfn, unsigned long nr_pages, unsigned long *released)
+				unsigned long end_pfn, unsigned long *released)
 {
 	unsigned long pfn, end;
 	int ret;
@@ -231,7 +234,7 @@ static void __init xen_set_identity_and_release_chunk(unsigned long start_pfn,
 	WARN_ON(start_pfn > end_pfn);
 
 	/* Release pages first. */
-	end = min(end_pfn, nr_pages);
+	end = min(end_pfn, xen_max_pfn);
 	for (pfn = start_pfn; pfn < end; pfn++) {
 		unsigned long mfn = pfn_to_mfn(pfn);
 
@@ -302,7 +305,7 @@ static void __init xen_update_mem_tables(unsigned long pfn, unsigned long mfn)
  * its callers.
  */
 static void __init xen_do_set_identity_and_remap_chunk(
-        unsigned long start_pfn, unsigned long size, unsigned long remap_pfn)
+	unsigned long start_pfn, unsigned long size)
 {
 	unsigned long buf = (unsigned long)&xen_remap_buf;
 	unsigned long mfn_save, mfn;
@@ -317,7 +320,7 @@ static void __init xen_do_set_identity_and_remap_chunk(
 
 	mfn_save = virt_to_mfn(buf);
 
-	for (ident_pfn_iter = start_pfn, remap_pfn_iter = remap_pfn;
+	for (ident_pfn_iter = start_pfn, remap_pfn_iter = xen_remap_pfn;
 	     ident_pfn_iter < ident_end_pfn;
 	     ident_pfn_iter += REMAP_SIZE, remap_pfn_iter += REMAP_SIZE) {
 		chunk = (left < REMAP_SIZE) ? left : REMAP_SIZE;
@@ -350,17 +353,16 @@ static void __init xen_do_set_identity_and_remap_chunk(
  * This function takes a contiguous pfn range that needs to be identity mapped
  * and:
  *
- *  1) Finds a new range of pfns to use to remap based on E820 and remap_pfn.
+ *  1) Finds a new range of pfns to use to remap based on E820 and
+ *     xen_remap_pfn.
  *  2) Calls the do_ function to actually do the mapping/remapping work.
  *
  * The goal is to not allocate additional memory but to remap the existing
  * pages. In the case of an error the underlying memory is simply released back
  * to Xen and not remapped.
  */
-static unsigned long __init xen_set_identity_and_remap_chunk(
-	unsigned long start_pfn, unsigned long end_pfn, unsigned long nr_pages,
-	unsigned long remap_pfn, unsigned long *released,
-	unsigned long *remapped)
+static void __init xen_set_identity_and_remap_chunk(unsigned long start_pfn,
+	unsigned long end_pfn, unsigned long *released, unsigned long *remapped)
 {
 	unsigned long pfn;
 	unsigned long i = 0;
@@ -373,30 +375,30 @@ static unsigned long __init xen_set_identity_and_remap_chunk(
 		unsigned long remap_range_size;
 
 		/* Do not remap pages beyond the current allocation */
-		if (cur_pfn >= nr_pages) {
+		if (cur_pfn >= xen_max_pfn) {
 			/* Identity map remaining pages */
 			set_phys_range_identity(cur_pfn, cur_pfn + size);
 			break;
 		}
-		if (cur_pfn + size > nr_pages)
-			size = nr_pages - cur_pfn;
+		if (cur_pfn + size > xen_max_pfn)
+			size = xen_max_pfn - cur_pfn;
 
-		remap_range_size = xen_find_pfn_range(&remap_pfn);
+		remap_range_size = xen_find_pfn_range(&xen_remap_pfn);
 		if (!remap_range_size) {
 			pr_warning("Unable to find available pfn range, not remapping identity pages\n");
 			xen_set_identity_and_release_chunk(cur_pfn,
-				cur_pfn + left, nr_pages, released);
+						cur_pfn + left, released);
 			break;
 		}
 		/* Adjust size to fit in current e820 RAM region */
 		if (size > remap_range_size)
 			size = remap_range_size;
 
-		xen_do_set_identity_and_remap_chunk(cur_pfn, size, remap_pfn);
+		xen_do_set_identity_and_remap_chunk(cur_pfn, size);
 
 		/* Update variables to reflect new mappings. */
 		i += size;
-		remap_pfn += size;
+		xen_remap_pfn += size;
 		*remapped += size;
 	}
 
@@ -408,20 +410,19 @@ static unsigned long __init xen_set_identity_and_remap_chunk(
 		(void)HYPERVISOR_update_va_mapping(
 			(unsigned long)__va(pfn << PAGE_SHIFT),
 			mfn_pte(pfn, PAGE_KERNEL_IO), 0);
-
-	return remap_pfn;
 }
 
-static void __init xen_set_identity_and_remap(unsigned long nr_pages,
-			unsigned long *released, unsigned long *remapped)
+static void __init xen_set_identity_and_remap(unsigned long *released,
+					      unsigned long *remapped)
 {
 	phys_addr_t start = 0;
-	unsigned long last_pfn = nr_pages;
 	const struct e820entry *entry = xen_e820_map;
 	unsigned long num_released = 0;
 	unsigned long num_remapped = 0;
 	int i;
 
+	xen_remap_pfn = xen_max_pfn;
+
 	/*
 	 * Combine non-RAM regions and gaps until a RAM region (or the
 	 * end of the map) is reached, then set the 1:1 map and
@@ -443,10 +444,8 @@ static void __init xen_set_identity_and_remap(unsigned long nr_pages,
 				end_pfn = PFN_UP(entry->addr);
 
 			if (start_pfn < end_pfn)
-				last_pfn = xen_set_identity_and_remap_chunk(
-						start_pfn, end_pfn, nr_pages,
-						last_pfn, &num_released,
-						&num_remapped);
+				xen_set_identity_and_remap_chunk(start_pfn,
+					end_pfn, &num_released, &num_remapped);
 			start = end;
 		}
 	}
@@ -600,7 +599,6 @@ static void __init xen_reserve_xen_mfnlist(void)
  **/
 char * __init xen_memory_setup(void)
 {
-	unsigned long max_pfn = xen_start_info->nr_pages;
 	phys_addr_t mem_end;
 	int rc;
 	struct xen_memory_map memmap;
@@ -610,8 +608,8 @@ char * __init xen_memory_setup(void)
 	int i;
 	int op;
 
-	max_pfn = min(MAX_DOMAIN_PAGES, max_pfn);
-	mem_end = PFN_PHYS(max_pfn);
+	xen_max_pfn = min(MAX_DOMAIN_PAGES, xen_start_info->nr_pages);
+	mem_end = PFN_PHYS(xen_max_pfn);
 
 	memmap.nr_entries = E820MAX;
 	set_xen_guest_handle(memmap.buffer, xen_e820_map);
@@ -650,15 +648,14 @@ char * __init xen_memory_setup(void)
 			  &xen_e820_map_entries);
 
 	max_pages = xen_get_max_pages();
-	if (max_pages > max_pfn)
-		extra_pages += max_pages - max_pfn;
+	if (max_pages > xen_max_pfn)
+		extra_pages += max_pages - xen_max_pfn;
 
 	/*
 	 * Set identity map on non-RAM pages and prepare remapping the
 	 * underlying RAM.
 	 */
-	xen_set_identity_and_remap(max_pfn, &xen_released_pages,
-				   &remapped_pages);
+	xen_set_identity_and_remap(&xen_released_pages, &remapped_pages);
 
 	extra_pages += xen_released_pages;
 	extra_pages += remapped_pages;
@@ -674,7 +671,7 @@ char * __init xen_memory_setup(void)
 	 * the initial memory is also very large with respect to
 	 * lowmem, but we won't try to deal with that here.
 	 */
-	extra_pages = min(EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
+	extra_pages = min(EXTRA_MEM_RATIO * min(xen_max_pfn, PFN_DOWN(MAXMEM)),
 			  extra_pages);
 	i = 0;
 	while (i < xen_e820_map_entries) {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map
  2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
                   ` (4 preceding siblings ...)
  2015-02-18  6:51 ` [PATCH 05/13] xen: simplify xen_set_identity_and_remap() by using global variables Juergen Gross
@ 2015-02-18  6:51 ` Juergen Gross
  2015-02-19 18:07   ` [Xen-devel] " David Vrabel
  2015-02-18  6:52 ` [PATCH 07/13] xen: find unused contiguous memory area Juergen Gross
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:51 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

Currently especially for dom0 guest memory with guest pfns not
matching host areas populated with RAM are remapped to areas which
are RAM native as well. This is done to be able to use identity
mappings (pfn == mfn) for I/O areas.

Up to now it is not checked whether the remapped memory is already
in use. Remapping used memory will probably result in data corruption,
as the remapped memory will no longer be reserved. Any memory
allocation after the remap can claim that memory.

Add an infrastructure to check for conflicts of reserved memory areas
and in case of a conflict to react via an area specific function.

This function has 3 options:
- Panic
- Handle the conflict by moving the data to another memory area.
  This is indicated by a return value other than 0.
- Just return 0. This will delay invalidating the conflicting memory
  area to just before doing the remap. This option will be usable for
  cases only where the memory will no longer be needed when the remap
  operation will be started, e.g. for the p2m list, which is already
  copied then.

When doing the remap, check for not remapping a reserved page.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/setup.c   | 185 +++++++++++++++++++++++++++++++++++++++++++++++--
 arch/x86/xen/xen-ops.h |   2 +
 2 files changed, 182 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 0dda131..a0af554 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -59,6 +59,20 @@ static unsigned long xen_remap_mfn __initdata = INVALID_P2M_ENTRY;
 static unsigned long xen_remap_pfn;
 static unsigned long xen_max_pfn;
 
+/*
+ * Areas with memblock_reserve()d memory to be checked against final E820 map.
+ * Each area has an associated function to handle conflicts (by either
+ * removing the conflict or by just crashing with an appropriate message).
+ * The array has a fixed size as there are only few areas of interest which are
+ * well known: kernel, page tables, p2m, initrd.
+ */
+#define XEN_N_RESERVED_AREAS	4
+static struct {
+	phys_addr_t	start;
+	phys_addr_t	size;
+	int		(*func)(phys_addr_t start, phys_addr_t size);
+} xen_reserved_area[XEN_N_RESERVED_AREAS] __initdata;
+
 /* 
  * The maximum amount of extra memory compared to the base size.  The
  * main scaling factor is the size of struct page.  At extreme ratios
@@ -365,10 +379,10 @@ static void __init xen_set_identity_and_remap_chunk(unsigned long start_pfn,
 	unsigned long end_pfn, unsigned long *released, unsigned long *remapped)
 {
 	unsigned long pfn;
-	unsigned long i = 0;
+	unsigned long i;
 	unsigned long n = end_pfn - start_pfn;
 
-	while (i < n) {
+	for (i = 0; i < n; ) {
 		unsigned long cur_pfn = start_pfn + i;
 		unsigned long left = n - i;
 		unsigned long size = left;
@@ -411,6 +425,53 @@ static void __init xen_set_identity_and_remap_chunk(unsigned long start_pfn,
 			(unsigned long)__va(pfn << PAGE_SHIFT),
 			mfn_pte(pfn, PAGE_KERNEL_IO), 0);
 }
+/* Check to be remapped memory area for conflicts with reserved areas.
+ *
+ * Skip regions known to be reserved which are handled later. For these
+ * regions we have to increase the remapped counter in order to reserve
+ * extra memory space.
+ *
+ * In case a memory page already in use is to be remapped, just BUG().
+ */
+static void __init xen_set_identity_and_remap_chunk_chk(unsigned long start_pfn,
+	unsigned long end_pfn, unsigned long *released, unsigned long *remapped)
+{
+	unsigned long pfn;
+	unsigned long area_start, area_end;
+	unsigned i;
+
+	for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
+
+		if (!xen_reserved_area[i].size)
+			break;
+
+		area_start = PFN_DOWN(xen_reserved_area[i].start);
+		area_end = PFN_UP(xen_reserved_area[i].start +
+				  xen_reserved_area[i].size);
+		if (area_start >= end_pfn || area_end <= start_pfn)
+			continue;
+
+		if (area_start > start_pfn)
+			xen_set_identity_and_remap_chunk(start_pfn, area_start,
+							 released, remapped);
+
+		if (area_end < end_pfn)
+			xen_set_identity_and_remap_chunk(area_end, end_pfn,
+							 released, remapped);
+
+		*remapped += min(area_end, end_pfn) -
+			    max(area_start, start_pfn);
+
+		return;
+	}
+
+	/* Test for memory already in use */
+	for (pfn = start_pfn; pfn < end_pfn; pfn++)
+		BUG_ON(memblock_is_reserved(PFN_PHYS(pfn)));
+
+	xen_set_identity_and_remap_chunk(start_pfn, end_pfn,
+					 released, remapped);
+}
 
 static void __init xen_set_identity_and_remap(unsigned long *released,
 					      unsigned long *remapped)
@@ -444,7 +505,7 @@ static void __init xen_set_identity_and_remap(unsigned long *released,
 				end_pfn = PFN_UP(entry->addr);
 
 			if (start_pfn < end_pfn)
-				xen_set_identity_and_remap_chunk(start_pfn,
+				xen_set_identity_and_remap_chunk_chk(start_pfn,
 					end_pfn, &num_released, &num_remapped);
 			start = end;
 		}
@@ -456,6 +517,45 @@ static void __init xen_set_identity_and_remap(unsigned long *released,
 	pr_info("Released %ld page(s)\n", num_released);
 }
 
+static void __init xen_late_set_identity_and_remap(void)
+{
+	const struct e820entry *entry = xen_e820_map;
+	int i, e;
+	unsigned long num_released = 0;
+	unsigned long num_remapped = 0;
+
+	for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
+		unsigned long area_start, area_end;
+
+		if (!xen_reserved_area[i].size)
+			return;
+
+		area_start = PFN_DOWN(xen_reserved_area[i].start);
+		area_end = PFN_UP(xen_reserved_area[i].start +
+				  xen_reserved_area[i].size);
+
+		for (e = 0; e < xen_e820_map_entries; e++, entry++) {
+			unsigned long start_pfn;
+			unsigned long end_pfn;
+
+			if (entry->type == E820_RAM)
+				continue;
+
+			start_pfn = PFN_DOWN(entry->addr);
+			end_pfn = PFN_UP(entry->addr + entry->size);
+
+			if (area_start >= end_pfn || area_end <= start_pfn)
+				continue;
+
+			start_pfn = max(area_start, start_pfn);
+			end_pfn = min(area_end, end_pfn);
+
+			xen_set_identity_and_remap_chunk(start_pfn, end_pfn,
+					&num_released, &num_remapped);
+		}
+	}
+}
+
 /*
  * Remap the memory prepared in xen_do_set_identity_and_remap_chunk().
  * The remap information (which mfn remap to which pfn) is contained in the
@@ -472,6 +572,8 @@ void __init xen_remap_memory(void)
 	unsigned long pfn_s = ~0UL;
 	unsigned long len = 0;
 
+	xen_late_set_identity_and_remap();
+
 	mfn_save = virt_to_mfn(buf);
 
 	while (xen_remap_mfn != INVALID_P2M_ENTRY) {
@@ -560,6 +662,76 @@ static void __init xen_ignore_unusable(void)
 }
 
 /*
+ * Check reserved memory areas for conflicts with E820 map.
+ */
+static void __init xen_chk_e820_reserved(void)
+{
+	struct e820entry *entry;
+	unsigned areacnt, mapcnt;
+	phys_addr_t start, end;
+	int ok;
+
+	for (areacnt = 0; areacnt < XEN_N_RESERVED_AREAS; areacnt++) {
+		start = xen_reserved_area[areacnt].start;
+		end = start + xen_reserved_area[areacnt].size;
+		if (start == end)
+			return;
+
+		ok = 0;
+		entry = xen_e820_map;
+
+		for (mapcnt = 0; mapcnt < xen_e820_map_entries; mapcnt++) {
+			if (entry->type == E820_RAM && entry->addr <= start &&
+			    (entry->addr + entry->size) >= end) {
+				ok = 1;
+				break;
+			}
+			entry++;
+		}
+
+		if (ok || !xen_reserved_area[areacnt].func(start, end - start))
+			continue;
+
+		for (mapcnt = areacnt; mapcnt < XEN_N_RESERVED_AREAS - 1;
+		     mapcnt++)
+			xen_reserved_area[mapcnt] =
+				xen_reserved_area[mapcnt + 1];
+		xen_reserved_area[mapcnt].start = 0;
+		xen_reserved_area[mapcnt].size = 0;
+
+		areacnt--;
+	}
+}
+
+void __init xen_add_reserved_area(phys_addr_t start, phys_addr_t size,
+	int (*func)(phys_addr_t, phys_addr_t), int reserve)
+{
+	unsigned idx;
+
+	if (!size)
+		return;
+
+	BUG_ON(xen_reserved_area[XEN_N_RESERVED_AREAS - 1].size);
+
+	for (idx = XEN_N_RESERVED_AREAS - 1; idx > 0; idx--) {
+		if (!xen_reserved_area[idx - 1].size)
+			continue;
+
+		if (start > xen_reserved_area[idx - 1].start)
+			break;
+
+		xen_reserved_area[idx] = xen_reserved_area[idx - 1];
+	}
+
+	xen_reserved_area[idx].start = start;
+	xen_reserved_area[idx].size = size;
+	xen_reserved_area[idx].func = func;
+
+	if (reserve)
+		memblock_reserve(start, size);
+}
+
+/*
  * Reserve Xen mfn_list.
  * See comment above "struct start_info" in <xen/interface/xen.h>
  * We tried to make the the memblock_reserve more selective so
@@ -608,6 +780,8 @@ char * __init xen_memory_setup(void)
 	int i;
 	int op;
 
+	xen_reserve_xen_mfnlist();
+
 	xen_max_pfn = min(MAX_DOMAIN_PAGES, xen_start_info->nr_pages);
 	mem_end = PFN_PHYS(xen_max_pfn);
 
@@ -647,6 +821,9 @@ char * __init xen_memory_setup(void)
 	sanitize_e820_map(xen_e820_map, xen_e820_map_entries,
 			  &xen_e820_map_entries);
 
+	/* Check for conflicts between used memory and memory map. */
+	xen_chk_e820_reserved();
+
 	max_pages = xen_get_max_pages();
 	if (max_pages > xen_max_pfn)
 		extra_pages += max_pages - xen_max_pfn;
@@ -718,8 +895,6 @@ char * __init xen_memory_setup(void)
 
 	sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
 
-	xen_reserve_xen_mfnlist();
-
 	return "Xen";
 }
 
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 9e195c6..fee4f70 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -42,6 +42,8 @@ void xen_mm_unpin_all(void);
 unsigned long __ref xen_chk_extra_mem(unsigned long pfn);
 void __init xen_inv_extra_mem(void);
 void __init xen_remap_memory(void);
+void __init xen_add_reserved_area(phys_addr_t start, phys_addr_t size,
+	int (*func)(phys_addr_t, phys_addr_t), int reserve);
 char * __init xen_memory_setup(void);
 char * xen_auto_xlated_memory_setup(void);
 void __init xen_arch_setup(void);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 07/13] xen: find unused contiguous memory area
  2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
                   ` (5 preceding siblings ...)
  2015-02-18  6:51 ` [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map Juergen Gross
@ 2015-02-18  6:52 ` Juergen Gross
  2015-02-19 17:31   ` [Xen-devel] " David Vrabel
  2015-02-18  6:52 ` [PATCH 08/13] xen: add service function to copy physical memory areas Juergen Gross
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:52 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

For being able to relocate pre-allocated data areas like initrd or
p2m list it is mandatory to find a contiguous memory area which is
not yet in use and doesn't conflict with the memory map we want to
be in effect.

In case such an area is found reserve it at once as this will be
required to be done in any case.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/setup.c   | 34 ++++++++++++++++++++++++++++++++++
 arch/x86/xen/xen-ops.h |  1 +
 2 files changed, 35 insertions(+)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index a0af554..9c49d71 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -732,6 +732,40 @@ void __init xen_add_reserved_area(phys_addr_t start, phys_addr_t size,
 }
 
 /*
+ * Find a free area in physical memory not yet reserved and compliant with
+ * E820 map.
+ * Used to relocate pre-allocated areas like initrd or p2m list which are in
+ * conflict with the to be used E820 map.
+ * In case no area is found, return 0. Otherwise return the physical address
+ * of the area which is already reserved for convenience.
+ */
+phys_addr_t __init xen_find_free_area(phys_addr_t size)
+{
+	unsigned mapcnt;
+	phys_addr_t addr, start;
+	struct e820entry *entry = xen_e820_map;
+
+	for (mapcnt = 0; mapcnt < xen_e820_map_entries; mapcnt++, entry++) {
+		if (entry->type != E820_RAM || entry->size < size)
+			continue;
+		start = entry->addr;
+		for (addr = start; addr < start + size; addr += PAGE_SIZE) {
+			if (!memblock_is_reserved(addr))
+				continue;
+			start = addr + PAGE_SIZE;
+			if (start + size > entry->addr + entry->size)
+				break;
+		}
+		if (addr >= start + size) {
+			memblock_reserve(start, size);
+			return start;
+		}
+	}
+
+	return 0;
+}
+
+/*
  * Reserve Xen mfn_list.
  * See comment above "struct start_info" in <xen/interface/xen.h>
  * We tried to make the the memblock_reserve more selective so
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index fee4f70..8181e01 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -44,6 +44,7 @@ void __init xen_inv_extra_mem(void);
 void __init xen_remap_memory(void);
 void __init xen_add_reserved_area(phys_addr_t start, phys_addr_t size,
 	int (*func)(phys_addr_t, phys_addr_t), int reserve);
+phys_addr_t __init xen_find_free_area(phys_addr_t size);
 char * __init xen_memory_setup(void);
 char * xen_auto_xlated_memory_setup(void);
 void __init xen_arch_setup(void);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 08/13] xen: add service function to copy physical memory areas
  2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
                   ` (6 preceding siblings ...)
  2015-02-18  6:52 ` [PATCH 07/13] xen: find unused contiguous memory area Juergen Gross
@ 2015-02-18  6:52 ` Juergen Gross
  2015-02-19 17:35   ` [Xen-devel] " David Vrabel
  2015-02-18  6:52 ` [PATCH 09/13] xen: check for kernel memory conflicting with memory layout Juergen Gross
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:52 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

In case a pre-allocated memory area is to be moved in order to avoid
a conflict with the target E820 map we need a way to copy data between
physical addresses.

Add a function doing this via early_memremap().

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/setup.c   | 29 +++++++++++++++++++++++++++++
 arch/x86/xen/xen-ops.h |  1 +
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 9c49d71..eb219c1 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -766,6 +766,35 @@ phys_addr_t __init xen_find_free_area(phys_addr_t size)
 }
 
 /*
+ * Like memcpy, but with physical addresses for dest and src.
+ */
+void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src, phys_addr_t n)
+{
+	phys_addr_t dest_off, src_off, dest_len, src_len, len;
+	void *from, *to;
+
+	while (n) {
+		dest_off = dest & ~PAGE_MASK;
+		src_off = src & ~PAGE_MASK;
+		dest_len = n;
+		if (dest_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off)
+			dest_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off;
+		src_len = n;
+		if (src_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off)
+			src_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off;
+		len = min(dest_len, src_len);
+		to = early_memremap(dest - dest_off, dest_len + dest_off);
+		from = early_memremap(src - src_off, src_len + src_off);
+		memcpy(to, from, len);
+		early_iounmap(to, dest_len + dest_off);
+		early_iounmap(from, src_len + src_off);
+		n -= len;
+		dest += len;
+		src += len;
+	}
+}
+
+/*
  * Reserve Xen mfn_list.
  * See comment above "struct start_info" in <xen/interface/xen.h>
  * We tried to make the the memblock_reserve more selective so
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 8181e01..9bf9df8 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -45,6 +45,7 @@ void __init xen_remap_memory(void);
 void __init xen_add_reserved_area(phys_addr_t start, phys_addr_t size,
 	int (*func)(phys_addr_t, phys_addr_t), int reserve);
 phys_addr_t __init xen_find_free_area(phys_addr_t size);
+void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src, phys_addr_t n);
 char * __init xen_memory_setup(void);
 char * xen_auto_xlated_memory_setup(void);
 void __init xen_arch_setup(void);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 09/13] xen: check for kernel memory conflicting with memory layout
  2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
                   ` (7 preceding siblings ...)
  2015-02-18  6:52 ` [PATCH 08/13] xen: add service function to copy physical memory areas Juergen Gross
@ 2015-02-18  6:52 ` Juergen Gross
  2015-02-19 17:36   ` [Xen-devel] " David Vrabel
  2015-02-18  6:52 ` [PATCH 10/13] xen: check pre-allocated page tables for conflict with memory map Juergen Gross
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:52 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

Checks whether the pre-allocated memory of the loaded kernel is in
conflict with the target memory map. If this is the case, just panic
instead of run into problems later.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/setup.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index eb219c1..37a34f9 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -829,6 +829,12 @@ static void __init xen_reserve_xen_mfnlist(void)
 			 PFN_PHYS(xen_start_info->nr_p2m_frames));
 }
 
+static int __init xen_kernel_mem_conflict(phys_addr_t start, phys_addr_t size)
+{
+	panic("kernel is located at position conflicting with E820 map!\n");
+	return 0;
+}
+
 /**
  * machine_specific_memory_setup - Hook for machine specific memory setup.
  **/
@@ -843,6 +849,10 @@ char * __init xen_memory_setup(void)
 	int i;
 	int op;
 
+	xen_add_reserved_area(__pa_symbol(_text),
+			      __pa_symbol(__bss_stop) - __pa_symbol(_text),
+			      xen_kernel_mem_conflict, 0);
+
 	xen_reserve_xen_mfnlist();
 
 	xen_max_pfn = min(MAX_DOMAIN_PAGES, xen_start_info->nr_pages);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 10/13] xen: check pre-allocated page tables for conflict with memory map
  2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
                   ` (8 preceding siblings ...)
  2015-02-18  6:52 ` [PATCH 09/13] xen: check for kernel memory conflicting with memory layout Juergen Gross
@ 2015-02-18  6:52 ` Juergen Gross
  2015-02-19 17:37   ` [Xen-devel] " David Vrabel
  2015-02-18  6:52 ` [PATCH 11/13] xen: move initrd away from e820 non-ram area Juergen Gross
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:52 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

Check whether the page tables built by the domain builder are at
memory addresses which are in conflict with the target memory map.
If this is the case just panic instead of running into problems
later.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/mmu.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 1ca5197..6641459 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1863,6 +1863,12 @@ void __init xen_setup_machphys_mapping(void)
 #endif
 }
 
+static int __init xen_pt_memory_conflict(phys_addr_t start, phys_addr_t size)
+{
+	panic("page tables are located at position conflicting with E820 map!\n");
+	return 0;
+}
+
 #ifdef CONFIG_X86_64
 static void __init convert_pfn_mfn(void *v)
 {
@@ -1998,7 +2004,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 		check_pt_base(&pt_base, &pt_end, addr[i]);
 
 	/* Our (by three pages) smaller Xen pagetable that we are using */
-	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
+	xen_add_reserved_area(PFN_PHYS(pt_base),
+			      (pt_end - pt_base) * PAGE_SIZE,
+			      xen_pt_memory_conflict, 1);
 	/* protect xen_start_info */
 	memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
 	/* Revector the xen_start_info */
@@ -2074,8 +2082,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 			  PFN_DOWN(__pa(initial_page_table)));
 	xen_write_cr3(__pa(initial_page_table));
 
-	memblock_reserve(__pa(xen_start_info->pt_base),
-			 xen_start_info->nr_pt_frames * PAGE_SIZE);
+	xen_add_reserved_area(__pa(xen_start_info->pt_base),
+			      xen_start_info->nr_pt_frames * PAGE_SIZE,
+			      xen_pt_memory_conflict, 1);
 }
 #endif	/* CONFIG_X86_64 */
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 11/13] xen: move initrd away from e820 non-ram area
  2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
                   ` (9 preceding siblings ...)
  2015-02-18  6:52 ` [PATCH 10/13] xen: check pre-allocated page tables for conflict with memory map Juergen Gross
@ 2015-02-18  6:52 ` Juergen Gross
  2015-02-19 17:42   ` [Xen-devel] " David Vrabel
  2015-02-18  6:52 ` [PATCH 12/13] xen: if p2m list located in to be remapped region delay remapping Juergen Gross
  2015-02-18  6:52 ` [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains Juergen Gross
  12 siblings, 1 reply; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:52 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

When adapting the dom0 memory layout to that of the host make sure
the initrd isn't moved to another pfn range, as it won't be found
there any more.

The easiest way to accomplish that is by copying the initrd to an
area which is RAM according to the E820 map.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/enlighten.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 78a881b..21c82dfd 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1530,6 +1530,25 @@ static void __init xen_pvh_early_guest_init(void)
 }
 #endif    /* CONFIG_XEN_PVH */
 
+static int __init xen_initrd_mem_conflict(phys_addr_t start, phys_addr_t size)
+{
+	phys_addr_t new;
+
+	new = xen_find_free_area(size);
+	if (!new)
+		panic("initrd is located at position conflicting with E820 map!\n");
+
+	xen_phys_memcpy(new, start, size);
+	pr_info("initrd moved from [mem %#010llx-%#010llx] to [mem %#010llx-%#010llx]\n",
+		start, start + size, new, new + size);
+	memblock_free(start, size);
+
+	boot_params.hdr.ramdisk_image = new;
+	boot_params.ext_ramdisk_image = new >> 32;
+
+	return 1;
+}
+
 /* First C function to be called on Xen boot */
 asmlinkage __visible void __init xen_start_kernel(void)
 {
@@ -1691,6 +1710,9 @@ asmlinkage __visible void __init xen_start_kernel(void)
 	boot_params.hdr.ramdisk_size = xen_start_info->mod_len;
 	boot_params.hdr.cmd_line_ptr = __pa(xen_start_info->cmd_line);
 
+	xen_add_reserved_area(initrd_start, xen_start_info->mod_len,
+			      xen_initrd_mem_conflict, 0);
+
 	if (!xen_initial_domain()) {
 		add_preferred_console("xenboot", 0, NULL);
 		add_preferred_console("tty", 0, NULL);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 12/13] xen: if p2m list located in to be remapped region delay remapping
  2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
                   ` (10 preceding siblings ...)
  2015-02-18  6:52 ` [PATCH 11/13] xen: move initrd away from e820 non-ram area Juergen Gross
@ 2015-02-18  6:52 ` Juergen Gross
  2015-02-19 17:44   ` [Xen-devel] " David Vrabel
  2015-02-18  6:52 ` [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains Juergen Gross
  12 siblings, 1 reply; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:52 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

With adapting the memory layout of dom0 to that of the host care must
be taken not to remap the initial p2m list supported by the hypervisor.

If the p2m map is detected to be in a region which is going to be
remapped, delay the remapping of that area. Not doing so can either
crash the system very early, or lead to clobbered data as the target
memory area of the remap operation will no longer be reserved.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/setup.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 37a34f9..84a6473 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -794,6 +794,20 @@ void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src, phys_addr_t n)
 	}
 }
 
+#ifdef CONFIG_X86_64
+static int __init xen_p2m_conflict(phys_addr_t start, phys_addr_t size)
+{
+	/* Delay invalidating memory. */
+	return 0;
+}
+#else
+static int __init xen_p2m_conflict(phys_addr_t start, phys_addr_t size)
+{
+	panic("p2m list is located at position conflicting with E820 map!\n");
+	return 0;
+}
+#endif
+
 /*
  * Reserve Xen mfn_list.
  * See comment above "struct start_info" in <xen/interface/xen.h>
@@ -819,14 +833,16 @@ void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src, phys_addr_t n)
 static void __init xen_reserve_xen_mfnlist(void)
 {
 	if (xen_start_info->mfn_list >= __START_KERNEL_map) {
-		memblock_reserve(__pa(xen_start_info->mfn_list),
-				 xen_start_info->pt_base -
-				 xen_start_info->mfn_list);
+		xen_add_reserved_area(__pa(xen_start_info->mfn_list),
+				      xen_start_info->pt_base -
+				      xen_start_info->mfn_list,
+				      xen_p2m_conflict, 1);
 		return;
 	}
 
-	memblock_reserve(PFN_PHYS(xen_start_info->first_p2m_pfn),
-			 PFN_PHYS(xen_start_info->nr_p2m_frames));
+	xen_add_reserved_area(PFN_PHYS(xen_start_info->first_p2m_pfn),
+			      PFN_PHYS(xen_start_info->nr_p2m_frames),
+			      xen_p2m_conflict, 1);
 }
 
 static int __init xen_kernel_mem_conflict(phys_addr_t start, phys_addr_t size)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
  2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
                   ` (11 preceding siblings ...)
  2015-02-18  6:52 ` [PATCH 12/13] xen: if p2m list located in to be remapped region delay remapping Juergen Gross
@ 2015-02-18  6:52 ` Juergen Gross
  2015-02-18  9:21   ` Paul Bolle
  2015-02-18 11:18     ` David Vrabel
  12 siblings, 2 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  6:52 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky
  Cc: Juergen Gross

64 bit pv-domains under Xen are limited to 512 GB of RAM today. The
main reason has been the 3 level p2m tree, which was replaced by the
virtual mapped linear p2m list. Parallel to the p2m list which is
being used by the kernel itself there is a 3 level mfn tree for usage
by the Xen tools and eventually for crash dump analysis. For this tree
the linear p2m list can serve as a replacement, too. As the kernel
can't know whether the tools are capable of dealing with the p2m list
instead of the mfn tree, the limit of 512 GB can't be dropped in all
cases.

This patch replaces the hard limit by a kernel parameter which tells
the kernel to obey the 512 GB limit or not. The default is selected by
a configuration parameter which specifies whether the 512 GB limit
should be active per default for dom0 (only crash dump analysis is
affected) and/or for domUs (additionally domain save/restore/migration
are affected).

Memory above the domain limit is returned to the hypervisor instead of
being identity mapped, which was wrong anyways.

The kernel configuration parameter to specify the maximum size of a
domain can be deleted, as it is not relevant any more.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 Documentation/kernel-parameters.txt |  7 ++++
 arch/x86/include/asm/xen/page.h     |  4 ---
 arch/x86/xen/Kconfig                | 31 +++++++++++-----
 arch/x86/xen/p2m.c                  | 10 +++---
 arch/x86/xen/setup.c                | 72 ++++++++++++++++++++++++++++++-------
 5 files changed, 93 insertions(+), 31 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index a89e326..7bf6342 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3959,6 +3959,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			plus one apbt timer for broadcast timer.
 			x86_intel_mid_timer=apbt_only | lapic_and_apbt
 
+	xen_512gb_limit		[KNL,X86-64,XEN]
+			Restricts the kernel running paravirtualized under Xen
+			to use only up to 512 GB of RAM. The reason to do so is
+			crash analysis tools and Xen tools for doing domain
+			save/restore/migration must be enabled to handle larger
+			domains.
+
 	xen_emul_unplug=		[HW,X86,XEN]
 			Unplug Xen emulated devices
 			Format: [unplug0,][unplug1]
diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 358dcd3..18a11f2 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -35,10 +35,6 @@ typedef struct xpaddr {
 #define FOREIGN_FRAME(m)	((m) | FOREIGN_FRAME_BIT)
 #define IDENTITY_FRAME(m)	((m) | IDENTITY_FRAME_BIT)
 
-/* Maximum amount of memory we can handle in a domain in pages */
-#define MAX_DOMAIN_PAGES						\
-    ((unsigned long)((u64)CONFIG_XEN_MAX_DOMAIN_MEMORY * 1024 * 1024 * 1024 / PAGE_SIZE))
-
 extern unsigned long *machine_to_phys_mapping;
 extern unsigned long  machine_to_phys_nr;
 extern unsigned long *xen_p2m_addr;
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index e88fda8..b61a15e 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -23,14 +23,29 @@ config XEN_PVHVM
 	def_bool y
 	depends on XEN && PCI && X86_LOCAL_APIC
 
-config XEN_MAX_DOMAIN_MEMORY
-       int
-       default 500 if X86_64
-       default 64 if X86_32
-       depends on XEN
-       help
-         This only affects the sizing of some bss arrays, the unused
-         portions of which are freed.
+if X86_64
+choice
+	prompt "Support pv-domains larger than 512GB"
+	default XEN_512GB_NONE
+	help
+	  Support paravirtualized domains with more than 512GB of RAM.
+
+	  The Xen tools and crash dump analysis tools might not support
+	  pv-domains with more than 512 GB of RAM. This option controls the
+	  default setting of the kernel to use only up to 512 GB or more.
+	  It is always possible to change the default via specifying the
+	  boot parameter "xen_512gb_limit".
+
+	config XEN_512GB_NONE
+		bool "neither dom0 nor domUs can be larger than 512GB"
+	config XEN_512GB_DOM0
+		bool "dom0 can be larger than 512GB, domUs not"
+	config XEN_512GB_DOMU
+		bool "domUs can be larger than 512GB, dom0 not"
+	config XEN_512GB_ALL
+		bool "dom0 and domUs can be larger than 512GB"
+endchoice
+endif
 
 config XEN_SAVE_RESTORE
        bool
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index df73cc5..12a1e98 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -502,7 +502,7 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t *pte_pg)
  */
 static bool alloc_p2m(unsigned long pfn)
 {
-	unsigned topidx, mididx;
+	unsigned topidx;
 	unsigned long *top_mfn_p, *mid_mfn;
 	pte_t *ptep, *pte_pg;
 	unsigned int level;
@@ -510,9 +510,6 @@ static bool alloc_p2m(unsigned long pfn)
 	unsigned long addr = (unsigned long)(xen_p2m_addr + pfn);
 	unsigned long p2m_pfn;
 
-	topidx = p2m_top_index(pfn);
-	mididx = p2m_mid_index(pfn);
-
 	ptep = lookup_address(addr, &level);
 	BUG_ON(!ptep || level != PG_LEVEL_4K);
 	pte_pg = (pte_t *)((unsigned long)ptep & ~(PAGE_SIZE - 1));
@@ -524,7 +521,8 @@ static bool alloc_p2m(unsigned long pfn)
 			return false;
 	}
 
-	if (p2m_top_mfn) {
+	if (p2m_top_mfn && pfn < MAX_P2M_PFN) {
+		topidx = p2m_top_index(pfn);
 		top_mfn_p = &p2m_top_mfn[topidx];
 		mid_mfn = ACCESS_ONCE(p2m_top_mfn_p[topidx]);
 
@@ -579,7 +577,7 @@ static bool alloc_p2m(unsigned long pfn)
 				pfn_pte(PFN_DOWN(__pa(p2m)), PAGE_KERNEL));
 			HYPERVISOR_shared_info->arch.p2m_generation++;
 			if (mid_mfn)
-				mid_mfn[mididx] = virt_to_mfn(p2m);
+				mid_mfn[p2m_mid_index(pfn)] = virt_to_mfn(p2m);
 			p2m = NULL;
 		}
 
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 84a6473..16d94de 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -32,6 +32,8 @@
 #include "p2m.h"
 #include "mmu.h"
 
+#define GB(x) ((uint64_t)(x) * 1024 * 1024 * 1024)
+
 /* Amount of extra memory space we add to the e820 ranges */
 struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;
 
@@ -85,6 +87,27 @@ static struct {
  */
 #define EXTRA_MEM_RATIO		(10)
 
+static bool xen_dom0_512gb_limit __initdata =
+	IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOMU);
+static bool xen_domu_512gb_limit __initdata =
+	IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOM0);
+
+static int __init xen_parse_512gb(char *arg)
+{
+	bool val = false;
+
+	if (!arg)
+		val = true;
+	else if (strtobool(arg, &val))
+		return 1;
+
+	xen_dom0_512gb_limit = val;
+	xen_domu_512gb_limit = val;
+
+	return 0;
+}
+early_param("xen_512gb_limit", xen_parse_512gb);
+
 static void __init xen_add_extra_mem(phys_addr_t start, phys_addr_t size)
 {
 	int i;
@@ -242,14 +265,13 @@ static int __init xen_free_mfn(unsigned long mfn)
 static void __init xen_set_identity_and_release_chunk(unsigned long start_pfn,
 				unsigned long end_pfn, unsigned long *released)
 {
-	unsigned long pfn, end;
+	unsigned long pfn;
 	int ret;
 
 	WARN_ON(start_pfn > end_pfn);
 
 	/* Release pages first. */
-	end = min(end_pfn, xen_max_pfn);
-	for (pfn = start_pfn; pfn < end; pfn++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		unsigned long mfn = pfn_to_mfn(pfn);
 
 		/* Make sure pfn exists to start with */
@@ -390,8 +412,9 @@ static void __init xen_set_identity_and_remap_chunk(unsigned long start_pfn,
 
 		/* Do not remap pages beyond the current allocation */
 		if (cur_pfn >= xen_max_pfn) {
-			/* Identity map remaining pages */
-			set_phys_range_identity(cur_pfn, cur_pfn + size);
+			/* Release remaining pages */
+			xen_set_identity_and_release_chunk(cur_pfn,
+				cur_pfn + size, released);
 			break;
 		}
 		if (cur_pfn + size > xen_max_pfn)
@@ -612,12 +635,34 @@ void __init xen_remap_memory(void)
 	pr_info("Remapped %ld page(s)\n", remapped);
 }
 
+static unsigned long __init xen_get_pages_limit(void)
+{
+	unsigned long limit;
+
+#ifdef CONFIG_X86_32
+	limit = GB(64) / PAGE_SIZE;
+#else
+	limit = ~0ul;
+	if (xen_initial_domain()) {
+		if (xen_dom0_512gb_limit)
+			limit = GB(512) / PAGE_SIZE;
+	} else {
+		if (xen_domu_512gb_limit)
+			limit = GB(512) / PAGE_SIZE;
+	}
+#endif
+	return limit;
+}
+
 static unsigned long __init xen_get_max_pages(void)
 {
-	unsigned long max_pages = MAX_DOMAIN_PAGES;
+	unsigned long max_pages, limit;
 	domid_t domid = DOMID_SELF;
 	int ret;
 
+	limit = xen_get_pages_limit();
+	max_pages = limit;
+
 	/*
 	 * For the initial domain we use the maximum reservation as
 	 * the maximum page.
@@ -633,7 +678,7 @@ static unsigned long __init xen_get_max_pages(void)
 			max_pages = ret;
 	}
 
-	return min(max_pages, MAX_DOMAIN_PAGES);
+	return min(max_pages, limit);
 }
 
 static void __init xen_align_and_add_e820_region(phys_addr_t start,
@@ -871,7 +916,8 @@ char * __init xen_memory_setup(void)
 
 	xen_reserve_xen_mfnlist();
 
-	xen_max_pfn = min(MAX_DOMAIN_PAGES, xen_start_info->nr_pages);
+	xen_max_pfn = xen_get_pages_limit();
+	xen_max_pfn = min(xen_max_pfn, xen_start_info->nr_pages);
 	mem_end = PFN_PHYS(xen_max_pfn);
 
 	memmap.nr_entries = E820MAX;
@@ -933,12 +979,15 @@ char * __init xen_memory_setup(void)
 	 * is limited to the max size of lowmem, so that it doesn't
 	 * get completely filled.
 	 *
+	 * Make sure we have no memory above max_pages, as this area
+	 * isn't handled by the p2m management.
+	 *
 	 * In principle there could be a problem in lowmem systems if
 	 * the initial memory is also very large with respect to
 	 * lowmem, but we won't try to deal with that here.
 	 */
-	extra_pages = min(EXTRA_MEM_RATIO * min(xen_max_pfn, PFN_DOWN(MAXMEM)),
-			  extra_pages);
+	extra_pages = min3(EXTRA_MEM_RATIO * min(xen_max_pfn, PFN_DOWN(MAXMEM)),
+			   extra_pages, max_pages - xen_max_pfn);
 	i = 0;
 	while (i < xen_e820_map_entries) {
 		phys_addr_t addr = xen_e820_map[i].addr;
@@ -968,9 +1017,6 @@ char * __init xen_memory_setup(void)
 	/*
 	 * Set the rest as identity mapped, in case PCI BARs are
 	 * located here.
-	 *
-	 * PFNs above MAX_P2M_PFN are considered identity mapped as
-	 * well.
 	 */
 	set_phys_range_identity(xen_e820_map[i - 1].addr / PAGE_SIZE, ~0ul);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
  2015-02-18  6:52 ` [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains Juergen Gross
@ 2015-02-18  9:21   ` Paul Bolle
  2015-02-18  9:37     ` Juergen Gross
  2015-02-18 11:18     ` David Vrabel
  1 sibling, 1 reply; 59+ messages in thread
From: Paul Bolle @ 2015-02-18  9:21 UTC (permalink / raw)
  To: Juergen Gross
  Cc: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky

On Wed, 2015-02-18 at 07:52 +0100, Juergen Gross wrote:
> 64 bit pv-domains under Xen are limited to 512 GB of RAM today. The
> main reason has been the 3 level p2m tree, which was replaced by the
> virtual mapped linear p2m list. Parallel to the p2m list which is
> being used by the kernel itself there is a 3 level mfn tree for usage
> by the Xen tools and eventually for crash dump analysis. For this tree
> the linear p2m list can serve as a replacement, too. As the kernel
> can't know whether the tools are capable of dealing with the p2m list
> instead of the mfn tree, the limit of 512 GB can't be dropped in all
> cases.
> 
> This patch replaces the hard limit by a kernel parameter which tells
> the kernel to obey the 512 GB limit or not. The default is selected by
> a configuration parameter which specifies whether the 512 GB limit
> should be active per default for dom0 (only crash dump analysis is
> affected) and/or for domUs (additionally domain save/restore/migration
> are affected).
> 
> Memory above the domain limit is returned to the hypervisor instead of
> being identity mapped, which was wrong anyways.
> 
> The kernel configuration parameter to specify the maximum size of a
> domain can be deleted, as it is not relevant any more.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
>  Documentation/kernel-parameters.txt |  7 ++++
>  arch/x86/include/asm/xen/page.h     |  4 ---
>  arch/x86/xen/Kconfig                | 31 +++++++++++-----
>  arch/x86/xen/p2m.c                  | 10 +++---
>  arch/x86/xen/setup.c                | 72 ++++++++++++++++++++++++++++++-------
>  5 files changed, 93 insertions(+), 31 deletions(-)

[...]

> --- a/arch/x86/xen/Kconfig
> +++ b/arch/x86/xen/Kconfig
> @@ -23,14 +23,29 @@ config XEN_PVHVM
>  	def_bool y
>  	depends on XEN && PCI && X86_LOCAL_APIC
>  
> -config XEN_MAX_DOMAIN_MEMORY
> -       int
> -       default 500 if X86_64
> -       default 64 if X86_32
> -       depends on XEN
> -       help
> -         This only affects the sizing of some bss arrays, the unused
> -         portions of which are freed.
> +if X86_64

Not
    && XEN
?

> +choice
> +	prompt "Support pv-domains larger than 512GB"
> +	default XEN_512GB_NONE
> +	help
> +	  Support paravirtualized domains with more than 512GB of RAM.
> +
> +	  The Xen tools and crash dump analysis tools might not support
> +	  pv-domains with more than 512 GB of RAM. This option controls the
> +	  default setting of the kernel to use only up to 512 GB or more.
> +	  It is always possible to change the default via specifying the
> +	  boot parameter "xen_512gb_limit".
> +
> +	config XEN_512GB_NONE
> +		bool "neither dom0 nor domUs can be larger than 512GB"
> +	config XEN_512GB_DOM0
> +		bool "dom0 can be larger than 512GB, domUs not"
> +	config XEN_512GB_DOMU
> +		bool "domUs can be larger than 512GB, dom0 not"
> +	config XEN_512GB_ALL
> +		bool "dom0 and domUs can be larger than 512GB"
> +endchoice

So there are actually two independent limits, configured through a
choice with four entries. Would using just two separate Kconfig symbols
(XEN_512GB_DOM0 and XEN_512GB_DOMU) without a choice wrapper also work?
Because ...

> +endif
>  
>  config XEN_SAVE_RESTORE
>         bool

[...]
 
> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index 84a6473..16d94de 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -32,6 +32,8 @@
>  #include "p2m.h"
>  #include "mmu.h"
>  
> +#define GB(x) ((uint64_t)(x) * 1024 * 1024 * 1024)
> +
>  /* Amount of extra memory space we add to the e820 ranges */
>  struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;
>  
> @@ -85,6 +87,27 @@ static struct {
>   */
>  #define EXTRA_MEM_RATIO		(10)
>  
> +static bool xen_dom0_512gb_limit __initdata =
> +	IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOMU);

... then this could be something like:
    static bool xen_dom0_512gb_limit __initdata = !IS_ENABLED(CONFIG_XEN_512GB_DOM0);

> +static bool xen_domu_512gb_limit __initdata =
> +	IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOM0);
> +

and this likewise:
    static bool xen_domu_512gb_limit __initdata = !IS_ENABLED(CONFIG_XEN_512GB_DOMU);

Correct?

> +static int __init xen_parse_512gb(char *arg)
> +{
> +	bool val = false;
> +
> +	if (!arg)
> +		val = true;
> +	else if (strtobool(arg, &val))
> +		return 1;
> +
> +	xen_dom0_512gb_limit = val;
> +	xen_domu_512gb_limit = val;
> +
> +	return 0;
> +}
> +early_param("xen_512gb_limit", xen_parse_512gb);
> +
>  static void __init xen_add_extra_mem(phys_addr_t start, phys_addr_t size)
>  {
>  	int i;

So one can configure these two limits separately, but the kernel
parameter is used for both. Any particular reason?

Thanks,


Paul Bolle


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
  2015-02-18  9:21   ` Paul Bolle
@ 2015-02-18  9:37     ` Juergen Gross
  2015-02-18  9:49         ` Jan Beulich
                         ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  9:37 UTC (permalink / raw)
  To: Paul Bolle
  Cc: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky

On 02/18/2015 10:21 AM, Paul Bolle wrote:
> On Wed, 2015-02-18 at 07:52 +0100, Juergen Gross wrote:
>> 64 bit pv-domains under Xen are limited to 512 GB of RAM today. The
>> main reason has been the 3 level p2m tree, which was replaced by the
>> virtual mapped linear p2m list. Parallel to the p2m list which is
>> being used by the kernel itself there is a 3 level mfn tree for usage
>> by the Xen tools and eventually for crash dump analysis. For this tree
>> the linear p2m list can serve as a replacement, too. As the kernel
>> can't know whether the tools are capable of dealing with the p2m list
>> instead of the mfn tree, the limit of 512 GB can't be dropped in all
>> cases.
>>
>> This patch replaces the hard limit by a kernel parameter which tells
>> the kernel to obey the 512 GB limit or not. The default is selected by
>> a configuration parameter which specifies whether the 512 GB limit
>> should be active per default for dom0 (only crash dump analysis is
>> affected) and/or for domUs (additionally domain save/restore/migration
>> are affected).
>>
>> Memory above the domain limit is returned to the hypervisor instead of
>> being identity mapped, which was wrong anyways.
>>
>> The kernel configuration parameter to specify the maximum size of a
>> domain can be deleted, as it is not relevant any more.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>>   Documentation/kernel-parameters.txt |  7 ++++
>>   arch/x86/include/asm/xen/page.h     |  4 ---
>>   arch/x86/xen/Kconfig                | 31 +++++++++++-----
>>   arch/x86/xen/p2m.c                  | 10 +++---
>>   arch/x86/xen/setup.c                | 72 ++++++++++++++++++++++++++++++-------
>>   5 files changed, 93 insertions(+), 31 deletions(-)
>
> [...]
>
>> --- a/arch/x86/xen/Kconfig
>> +++ b/arch/x86/xen/Kconfig
>> @@ -23,14 +23,29 @@ config XEN_PVHVM
>>   	def_bool y
>>   	depends on XEN && PCI && X86_LOCAL_APIC
>>
>> -config XEN_MAX_DOMAIN_MEMORY
>> -       int
>> -       default 500 if X86_64
>> -       default 64 if X86_32
>> -       depends on XEN
>> -       help
>> -         This only affects the sizing of some bss arrays, the unused
>> -         portions of which are freed.
>> +if X86_64
>
> Not
>      && XEN
> ?

The complete directory is made only if CONFIG_XEN is set.

>
>> +choice
>> +	prompt "Support pv-domains larger than 512GB"
>> +	default XEN_512GB_NONE
>> +	help
>> +	  Support paravirtualized domains with more than 512GB of RAM.
>> +
>> +	  The Xen tools and crash dump analysis tools might not support
>> +	  pv-domains with more than 512 GB of RAM. This option controls the
>> +	  default setting of the kernel to use only up to 512 GB or more.
>> +	  It is always possible to change the default via specifying the
>> +	  boot parameter "xen_512gb_limit".
>> +
>> +	config XEN_512GB_NONE
>> +		bool "neither dom0 nor domUs can be larger than 512GB"
>> +	config XEN_512GB_DOM0
>> +		bool "dom0 can be larger than 512GB, domUs not"
>> +	config XEN_512GB_DOMU
>> +		bool "domUs can be larger than 512GB, dom0 not"
>> +	config XEN_512GB_ALL
>> +		bool "dom0 and domUs can be larger than 512GB"
>> +endchoice
>
> So there are actually two independent limits, configured through a
> choice with four entries. Would using just two separate Kconfig symbols
> (XEN_512GB_DOM0 and XEN_512GB_DOMU) without a choice wrapper also work?

Yes.

> Because ...
>
>> +endif
>>
>>   config XEN_SAVE_RESTORE
>>          bool
>
> [...]
>
>> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
>> index 84a6473..16d94de 100644
>> --- a/arch/x86/xen/setup.c
>> +++ b/arch/x86/xen/setup.c
>> @@ -32,6 +32,8 @@
>>   #include "p2m.h"
>>   #include "mmu.h"
>>
>> +#define GB(x) ((uint64_t)(x) * 1024 * 1024 * 1024)
>> +
>>   /* Amount of extra memory space we add to the e820 ranges */
>>   struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;
>>
>> @@ -85,6 +87,27 @@ static struct {
>>    */
>>   #define EXTRA_MEM_RATIO		(10)
>>
>> +static bool xen_dom0_512gb_limit __initdata =
>> +	IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOMU);
>
> ... then this could be something like:
>      static bool xen_dom0_512gb_limit __initdata = !IS_ENABLED(CONFIG_XEN_512GB_DOM0);
>
>> +static bool xen_domu_512gb_limit __initdata =
>> +	IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOM0);
>> +
>
> and this likewise:
>      static bool xen_domu_512gb_limit __initdata = !IS_ENABLED(CONFIG_XEN_512GB_DOMU);
>
> Correct?

Yes.

That's a matter of taste, I think.

>
>> +static int __init xen_parse_512gb(char *arg)
>> +{
>> +	bool val = false;
>> +
>> +	if (!arg)
>> +		val = true;
>> +	else if (strtobool(arg, &val))
>> +		return 1;
>> +
>> +	xen_dom0_512gb_limit = val;
>> +	xen_domu_512gb_limit = val;
>> +
>> +	return 0;
>> +}
>> +early_param("xen_512gb_limit", xen_parse_512gb);
>> +
>>   static void __init xen_add_extra_mem(phys_addr_t start, phys_addr_t size)
>>   {
>>   	int i;
>
> So one can configure these two limits separately, but the kernel
> parameter is used for both. Any particular reason?

Yes. A kernel is running only either as Dom0 or as domU at a given time.
Having two parameters here would be nonsense, as only one could apply.

And being able to configure both limits separately does make sense,
of course.


Juergen

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
  2015-02-18  9:37     ` Juergen Gross
@ 2015-02-18  9:49         ` Jan Beulich
       [not found]       ` <54E46E3C0200007800060F98@suse.com>
  2015-02-18 10:35       ` Paul Bolle
  2 siblings, 0 replies; 59+ messages in thread
From: Jan Beulich @ 2015-02-18  9:49 UTC (permalink / raw)
  To: Juergen Gross
  Cc: david.vrabel, xen-devel, boris.ostrovsky, Paul Bolle, linux-kernel

>>> On 18.02.15 at 10:37, <JGross@suse.com> wrote:
> On 02/18/2015 10:21 AM, Paul Bolle wrote:
>> On Wed, 2015-02-18 at 07:52 +0100, Juergen Gross wrote:
>>> --- a/arch/x86/xen/Kconfig
>>> +++ b/arch/x86/xen/Kconfig
>>> @@ -23,14 +23,29 @@ config XEN_PVHVM
>>>   	def_bool y
>>>   	depends on XEN && PCI && X86_LOCAL_APIC
>>>
>>> -config XEN_MAX_DOMAIN_MEMORY
>>> -       int
>>> -       default 500 if X86_64
>>> -       default 64 if X86_32
>>> -       depends on XEN
>>> -       help
>>> -         This only affects the sizing of some bss arrays, the unused
>>> -         portions of which are freed.
>>> +if X86_64
>>
>> Not
>>      && XEN
>> ?
> 
> The complete directory is made only if CONFIG_XEN is set.

But that doesn't mean this file gets used only when XEN is enabled.
I would think though that an eventual "if XEN" should have wider
scope than just this option (i.e. likely almost the entire file).

Jan


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
@ 2015-02-18  9:49         ` Jan Beulich
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Beulich @ 2015-02-18  9:49 UTC (permalink / raw)
  To: Juergen Gross
  Cc: david.vrabel, xen-devel, boris.ostrovsky, Paul Bolle, linux-kernel

>>> On 18.02.15 at 10:37, <JGross@suse.com> wrote:
> On 02/18/2015 10:21 AM, Paul Bolle wrote:
>> On Wed, 2015-02-18 at 07:52 +0100, Juergen Gross wrote:
>>> --- a/arch/x86/xen/Kconfig
>>> +++ b/arch/x86/xen/Kconfig
>>> @@ -23,14 +23,29 @@ config XEN_PVHVM
>>>   	def_bool y
>>>   	depends on XEN && PCI && X86_LOCAL_APIC
>>>
>>> -config XEN_MAX_DOMAIN_MEMORY
>>> -       int
>>> -       default 500 if X86_64
>>> -       default 64 if X86_32
>>> -       depends on XEN
>>> -       help
>>> -         This only affects the sizing of some bss arrays, the unused
>>> -         portions of which are freed.
>>> +if X86_64
>>
>> Not
>>      && XEN
>> ?
> 
> The complete directory is made only if CONFIG_XEN is set.

But that doesn't mean this file gets used only when XEN is enabled.
I would think though that an eventual "if XEN" should have wider
scope than just this option (i.e. likely almost the entire file).

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
       [not found]       ` <54E46E3C0200007800060F98@suse.com>
@ 2015-02-18  9:59         ` Juergen Gross
  0 siblings, 0 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-18  9:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: david.vrabel, xen-devel, boris.ostrovsky, Paul Bolle, linux-kernel

On 02/18/2015 10:49 AM, Jan Beulich wrote:
>>>> On 18.02.15 at 10:37, <JGross@suse.com> wrote:
>> On 02/18/2015 10:21 AM, Paul Bolle wrote:
>>> On Wed, 2015-02-18 at 07:52 +0100, Juergen Gross wrote:
>>>> --- a/arch/x86/xen/Kconfig
>>>> +++ b/arch/x86/xen/Kconfig
>>>> @@ -23,14 +23,29 @@ config XEN_PVHVM
>>>>    	def_bool y
>>>>    	depends on XEN && PCI && X86_LOCAL_APIC
>>>>
>>>> -config XEN_MAX_DOMAIN_MEMORY
>>>> -       int
>>>> -       default 500 if X86_64
>>>> -       default 64 if X86_32
>>>> -       depends on XEN
>>>> -       help
>>>> -         This only affects the sizing of some bss arrays, the unused
>>>> -         portions of which are freed.
>>>> +if X86_64
>>>
>>> Not
>>>       && XEN
>>> ?
>>
>> The complete directory is made only if CONFIG_XEN is set.
>
> But that doesn't mean this file gets used only when XEN is enabled.

Oh, you are right. I seem to have mixed up make and Kconfig of the
directory.

> I would think though that an eventual "if XEN" should have wider
> scope than just this option (i.e. likely almost the entire file).

Indeed.

So either I'll add the XEN dependency for the new option or I do
another patch adding "if XEN" just below configuring XEN and remove
the XEN dependencies in the rest of the entries.

As Luis is just doing a rework of XEN Kconfig stuff, I think I'll add
the XEN dependency.


Juergen


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure
  2015-02-18  6:51 ` [PATCH 02/13] xen: anchor linear p2m list in shared info structure Juergen Gross
@ 2015-02-18 10:32     ` David Vrabel
  0 siblings, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-02-18 10:32 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky

On 18/02/15 06:51, Juergen Gross wrote:
> The linear p2m list should be anchored in the shared info structure

I'm not really sure what you mean by "anchored".

> read by the Xen tools to be able to support 64 bit pv-domains larger
> than 512 MB. Additionally the linear p2m list interface includes a
> generation count which is changed prior to and after each mapping
> change of the p2m list. Reading the generation count the Xen tools can
> detect changes of the mappings and re-read the p2m list eventually.
[...]
> --- a/arch/x86/xen/p2m.c
> +++ b/arch/x86/xen/p2m.c
> @@ -256,6 +256,10 @@ void xen_setup_mfn_list_list(void)
>  	HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
>  		virt_to_mfn(p2m_top_mfn);
>  	HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
> +	HYPERVISOR_shared_info->arch.p2m_generation = 0;
> +	HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned long)xen_p2m_addr;
> +	HYPERVISOR_shared_info->arch.p2m_cr3 =
> +		xen_pfn_to_cr3(virt_to_mfn(swapper_pg_dir));
>  }
>  
>  /* Set up p2m_top to point to the domain-builder provided p2m pages */
> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t *pte_pg)
>  
>  		ptechk = lookup_address(vaddr, &level);
>  		if (ptechk == pte_pg) {
> +			HYPERVISOR_shared_info->arch.p2m_generation++;
>  			set_pmd(pmdp,
>  				__pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
> +			HYPERVISOR_shared_info->arch.p2m_generation++;

Do these increments of p2m_generation need to be atomic?

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure
@ 2015-02-18 10:32     ` David Vrabel
  0 siblings, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-02-18 10:32 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky

On 18/02/15 06:51, Juergen Gross wrote:
> The linear p2m list should be anchored in the shared info structure

I'm not really sure what you mean by "anchored".

> read by the Xen tools to be able to support 64 bit pv-domains larger
> than 512 MB. Additionally the linear p2m list interface includes a
> generation count which is changed prior to and after each mapping
> change of the p2m list. Reading the generation count the Xen tools can
> detect changes of the mappings and re-read the p2m list eventually.
[...]
> --- a/arch/x86/xen/p2m.c
> +++ b/arch/x86/xen/p2m.c
> @@ -256,6 +256,10 @@ void xen_setup_mfn_list_list(void)
>  	HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
>  		virt_to_mfn(p2m_top_mfn);
>  	HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
> +	HYPERVISOR_shared_info->arch.p2m_generation = 0;
> +	HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned long)xen_p2m_addr;
> +	HYPERVISOR_shared_info->arch.p2m_cr3 =
> +		xen_pfn_to_cr3(virt_to_mfn(swapper_pg_dir));
>  }
>  
>  /* Set up p2m_top to point to the domain-builder provided p2m pages */
> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t *pte_pg)
>  
>  		ptechk = lookup_address(vaddr, &level);
>  		if (ptechk == pte_pg) {
> +			HYPERVISOR_shared_info->arch.p2m_generation++;
>  			set_pmd(pmdp,
>  				__pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
> +			HYPERVISOR_shared_info->arch.p2m_generation++;

Do these increments of p2m_generation need to be atomic?

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
  2015-02-18  9:37     ` Juergen Gross
  2015-02-18  9:49         ` Jan Beulich
       [not found]       ` <54E46E3C0200007800060F98@suse.com>
@ 2015-02-18 10:35       ` Paul Bolle
  2 siblings, 0 replies; 59+ messages in thread
From: Paul Bolle @ 2015-02-18 10:35 UTC (permalink / raw)
  To: Juergen Gross
  Cc: linux-kernel, xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky

On Wed, 2015-02-18 at 10:37 +0100, Juergen Gross wrote:
> On 02/18/2015 10:21 AM, Paul Bolle wrote:
> > On Wed, 2015-02-18 at 07:52 +0100, Juergen Gross wrote:
> >> +choice
> >> +	prompt "Support pv-domains larger than 512GB"
> >> +	default XEN_512GB_NONE
> >> +	help
> >> +	  Support paravirtualized domains with more than 512GB of RAM.
> >> +
> >> +	  The Xen tools and crash dump analysis tools might not support
> >> +	  pv-domains with more than 512 GB of RAM. This option controls the
> >> +	  default setting of the kernel to use only up to 512 GB or more.
> >> +	  It is always possible to change the default via specifying the
> >> +	  boot parameter "xen_512gb_limit".
> >> +
> >> +	config XEN_512GB_NONE
> >> +		bool "neither dom0 nor domUs can be larger than 512GB"
> >> +	config XEN_512GB_DOM0
> >> +		bool "dom0 can be larger than 512GB, domUs not"
> >> +	config XEN_512GB_DOMU
> >> +		bool "domUs can be larger than 512GB, dom0 not"
> >> +	config XEN_512GB_ALL
> >> +		bool "dom0 and domUs can be larger than 512GB"
> >> +endchoice
> >
> > So there are actually two independent limits, configured through a
> > choice with four entries. Would using just two separate Kconfig symbols
> > (XEN_512GB_DOM0 and XEN_512GB_DOMU) without a choice wrapper also work?
> 
> Yes.
> 
> > Because ...
> >
> >> +endif

[...]

> >> @@ -85,6 +87,27 @@ static struct {
> >>    */
> >>   #define EXTRA_MEM_RATIO		(10)
> >>
> >> +static bool xen_dom0_512gb_limit __initdata =
> >> +	IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOMU);
> >
> > ... then this could be something like:
> >      static bool xen_dom0_512gb_limit __initdata = !IS_ENABLED(CONFIG_XEN_512GB_DOM0);
> >
> >> +static bool xen_domu_512gb_limit __initdata =
> >> +	IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOM0);
> >> +
> >
> > and this likewise:
> >      static bool xen_domu_512gb_limit __initdata = !IS_ENABLED(CONFIG_XEN_512GB_DOMU);
> >
> > Correct?
> 
> Yes.
> 
> That's a matter of taste, I think.

Well, my suggestion does look simpler. Anyhow, I'll be glad to let the
maintainers decide.

> >
> >> +static int __init xen_parse_512gb(char *arg)
> >> +{
> >> +	bool val = false;
> >> +
> >> +	if (!arg)
> >> +		val = true;
> >> +	else if (strtobool(arg, &val))
> >> +		return 1;
> >> +
> >> +	xen_dom0_512gb_limit = val;
> >> +	xen_domu_512gb_limit = val;
> >> +
> >> +	return 0;
> >> +}
> >> +early_param("xen_512gb_limit", xen_parse_512gb);
> >> +
> >>   static void __init xen_add_extra_mem(phys_addr_t start, phys_addr_t size)
> >>   {
> >>   	int i;
> >
> > So one can configure these two limits separately, but the kernel
> > parameter is used for both. Any particular reason?
> 
> Yes. A kernel is running only either as Dom0 or as domU at a given time.
> Having two parameters here would be nonsense, as only one could apply.

I see.

> And being able to configure both limits separately does make sense,
> of course.

Thanks,


Paul Bolle


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure
  2015-02-18 10:32     ` David Vrabel
  (?)
@ 2015-02-18 10:42     ` Juergen Gross
  2015-02-18 10:50         ` Andrew Cooper
  2015-02-18 10:51         ` David Vrabel
  -1 siblings, 2 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-18 10:42 UTC (permalink / raw)
  To: David Vrabel, linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky

On 02/18/2015 11:32 AM, David Vrabel wrote:
> On 18/02/15 06:51, Juergen Gross wrote:
>> The linear p2m list should be anchored in the shared info structure
>
> I'm not really sure what you mean by "anchored".

Bad wording? What about:

The virtual address of the linear p2m list should be stored in the
shared info structure.

>
>> read by the Xen tools to be able to support 64 bit pv-domains larger
>> than 512 MB. Additionally the linear p2m list interface includes a
>> generation count which is changed prior to and after each mapping
>> change of the p2m list. Reading the generation count the Xen tools can
>> detect changes of the mappings and re-read the p2m list eventually.
> [...]
>> --- a/arch/x86/xen/p2m.c
>> +++ b/arch/x86/xen/p2m.c
>> @@ -256,6 +256,10 @@ void xen_setup_mfn_list_list(void)
>>   	HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
>>   		virt_to_mfn(p2m_top_mfn);
>>   	HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
>> +	HYPERVISOR_shared_info->arch.p2m_generation = 0;
>> +	HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned long)xen_p2m_addr;
>> +	HYPERVISOR_shared_info->arch.p2m_cr3 =
>> +		xen_pfn_to_cr3(virt_to_mfn(swapper_pg_dir));
>>   }
>>
>>   /* Set up p2m_top to point to the domain-builder provided p2m pages */
>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t *pte_pg)
>>
>>   		ptechk = lookup_address(vaddr, &level);
>>   		if (ptechk == pte_pg) {
>> +			HYPERVISOR_shared_info->arch.p2m_generation++;
>>   			set_pmd(pmdp,
>>   				__pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>> +			HYPERVISOR_shared_info->arch.p2m_generation++;
>
> Do these increments of p2m_generation need to be atomic?

Hmm, they are done under lock. I don't think the compiler is allowed to
reorder the writes to p2m_generation across set_pmd().


Juergen

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure
  2015-02-18 10:42     ` Juergen Gross
@ 2015-02-18 10:50         ` Andrew Cooper
  2015-02-18 10:51         ` David Vrabel
  1 sibling, 0 replies; 59+ messages in thread
From: Andrew Cooper @ 2015-02-18 10:50 UTC (permalink / raw)
  To: Juergen Gross, David Vrabel, linux-kernel, xen-devel,
	konrad.wilk, boris.ostrovsky

On 18/02/15 10:42, Juergen Gross wrote:
>
>>>   /* Set up p2m_top to point to the domain-builder provided p2m
>>> pages */
>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>> pte_t *pte_pg)
>>>
>>>           ptechk = lookup_address(vaddr, &level);
>>>           if (ptechk == pte_pg) {
>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>               set_pmd(pmdp,
>>>                   __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>
>> Do these increments of p2m_generation need to be atomic?
>
> Hmm, they are done under lock. I don't think the compiler is allowed to
> reorder the writes to p2m_generation across set_pmd().

They do need smp_wmb() to guarantee that the increment is visible before
the update occurs, just as the toolstack will need smp_rmb() to read.

They also need to be protected from concurrent update inside the kernel,
for which a lock should appear to suffice.

~Andrew


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure
@ 2015-02-18 10:50         ` Andrew Cooper
  0 siblings, 0 replies; 59+ messages in thread
From: Andrew Cooper @ 2015-02-18 10:50 UTC (permalink / raw)
  To: Juergen Gross, David Vrabel, linux-kernel, xen-devel,
	konrad.wilk, boris.ostrovsky

On 18/02/15 10:42, Juergen Gross wrote:
>
>>>   /* Set up p2m_top to point to the domain-builder provided p2m
>>> pages */
>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>> pte_t *pte_pg)
>>>
>>>           ptechk = lookup_address(vaddr, &level);
>>>           if (ptechk == pte_pg) {
>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>               set_pmd(pmdp,
>>>                   __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>
>> Do these increments of p2m_generation need to be atomic?
>
> Hmm, they are done under lock. I don't think the compiler is allowed to
> reorder the writes to p2m_generation across set_pmd().

They do need smp_wmb() to guarantee that the increment is visible before
the update occurs, just as the toolstack will need smp_rmb() to read.

They also need to be protected from concurrent update inside the kernel,
for which a lock should appear to suffice.

~Andrew

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure
  2015-02-18 10:42     ` Juergen Gross
@ 2015-02-18 10:51         ` David Vrabel
  2015-02-18 10:51         ` David Vrabel
  1 sibling, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-02-18 10:51 UTC (permalink / raw)
  To: Juergen Gross, David Vrabel, linux-kernel, xen-devel,
	konrad.wilk, boris.ostrovsky

On 18/02/15 10:42, Juergen Gross wrote:
> On 02/18/2015 11:32 AM, David Vrabel wrote:
>> On 18/02/15 06:51, Juergen Gross wrote:
>>> The linear p2m list should be anchored in the shared info structure
>>
>> I'm not really sure what you mean by "anchored".
> 
> Bad wording? What about:
> 
> The virtual address of the linear p2m list should be stored in the
> shared info structure.

This is better.

>>> read by the Xen tools to be able to support 64 bit pv-domains larger
>>> than 512 MB. Additionally the linear p2m list interface includes a
>>> generation count which is changed prior to and after each mapping
>>> change of the p2m list. Reading the generation count the Xen tools can
>>> detect changes of the mappings and re-read the p2m list eventually.
>> [...]
>>> --- a/arch/x86/xen/p2m.c
>>> +++ b/arch/x86/xen/p2m.c
>>> @@ -256,6 +256,10 @@ void xen_setup_mfn_list_list(void)
>>>       HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
>>>           virt_to_mfn(p2m_top_mfn);
>>>       HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
>>> +    HYPERVISOR_shared_info->arch.p2m_generation = 0;
>>> +    HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned
>>> long)xen_p2m_addr;
>>> +    HYPERVISOR_shared_info->arch.p2m_cr3 =
>>> +        xen_pfn_to_cr3(virt_to_mfn(swapper_pg_dir));
>>>   }
>>>
>>>   /* Set up p2m_top to point to the domain-builder provided p2m pages */
>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>> pte_t *pte_pg)
>>>
>>>           ptechk = lookup_address(vaddr, &level);
>>>           if (ptechk == pte_pg) {
>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>               set_pmd(pmdp,
>>>                   __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>
>> Do these increments of p2m_generation need to be atomic?
> 
> Hmm, they are done under lock.

Ok, atomic isn't necessary.

> I don't think the compiler is allowed to
> reorder the writes to p2m_generation across set_pmd().

Ok. but I think you also need to prevent the processor reordering the
writes so I think some write barriers are needed here.  (The toolstack
would then need the corresponding read barriers when checking the
p2m_generation.)

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure
@ 2015-02-18 10:51         ` David Vrabel
  0 siblings, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-02-18 10:51 UTC (permalink / raw)
  To: Juergen Gross, David Vrabel, linux-kernel, xen-devel,
	konrad.wilk, boris.ostrovsky

On 18/02/15 10:42, Juergen Gross wrote:
> On 02/18/2015 11:32 AM, David Vrabel wrote:
>> On 18/02/15 06:51, Juergen Gross wrote:
>>> The linear p2m list should be anchored in the shared info structure
>>
>> I'm not really sure what you mean by "anchored".
> 
> Bad wording? What about:
> 
> The virtual address of the linear p2m list should be stored in the
> shared info structure.

This is better.

>>> read by the Xen tools to be able to support 64 bit pv-domains larger
>>> than 512 MB. Additionally the linear p2m list interface includes a
>>> generation count which is changed prior to and after each mapping
>>> change of the p2m list. Reading the generation count the Xen tools can
>>> detect changes of the mappings and re-read the p2m list eventually.
>> [...]
>>> --- a/arch/x86/xen/p2m.c
>>> +++ b/arch/x86/xen/p2m.c
>>> @@ -256,6 +256,10 @@ void xen_setup_mfn_list_list(void)
>>>       HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
>>>           virt_to_mfn(p2m_top_mfn);
>>>       HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
>>> +    HYPERVISOR_shared_info->arch.p2m_generation = 0;
>>> +    HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned
>>> long)xen_p2m_addr;
>>> +    HYPERVISOR_shared_info->arch.p2m_cr3 =
>>> +        xen_pfn_to_cr3(virt_to_mfn(swapper_pg_dir));
>>>   }
>>>
>>>   /* Set up p2m_top to point to the domain-builder provided p2m pages */
>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>> pte_t *pte_pg)
>>>
>>>           ptechk = lookup_address(vaddr, &level);
>>>           if (ptechk == pte_pg) {
>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>               set_pmd(pmdp,
>>>                   __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>
>> Do these increments of p2m_generation need to be atomic?
> 
> Hmm, they are done under lock.

Ok, atomic isn't necessary.

> I don't think the compiler is allowed to
> reorder the writes to p2m_generation across set_pmd().

Ok. but I think you also need to prevent the processor reordering the
writes so I think some write barriers are needed here.  (The toolstack
would then need the corresponding read barriers when checking the
p2m_generation.)

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure
  2015-02-18 10:50         ` Andrew Cooper
@ 2015-02-18 10:54           ` David Vrabel
  -1 siblings, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-02-18 10:54 UTC (permalink / raw)
  To: Andrew Cooper, Juergen Gross, linux-kernel, xen-devel,
	konrad.wilk, boris.ostrovsky

On 18/02/15 10:50, Andrew Cooper wrote:
> On 18/02/15 10:42, Juergen Gross wrote:
>>
>>>>   /* Set up p2m_top to point to the domain-builder provided p2m
>>>> pages */
>>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>>> pte_t *pte_pg)
>>>>
>>>>           ptechk = lookup_address(vaddr, &level);
>>>>           if (ptechk == pte_pg) {
>>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>>               set_pmd(pmdp,
>>>>                   __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>
>>> Do these increments of p2m_generation need to be atomic?
>>
>> Hmm, they are done under lock. I don't think the compiler is allowed to
>> reorder the writes to p2m_generation across set_pmd().
> 
> They do need smp_wmb() to guarantee that the increment is visible before
> the update occurs, just as the toolstack will need smp_rmb() to read.

smp_wmb() isn't good enough since you need the barrier even on non-smp
-- you need a wmb().

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure
@ 2015-02-18 10:54           ` David Vrabel
  0 siblings, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-02-18 10:54 UTC (permalink / raw)
  To: Andrew Cooper, Juergen Gross, linux-kernel, xen-devel,
	konrad.wilk, boris.ostrovsky

On 18/02/15 10:50, Andrew Cooper wrote:
> On 18/02/15 10:42, Juergen Gross wrote:
>>
>>>>   /* Set up p2m_top to point to the domain-builder provided p2m
>>>> pages */
>>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>>> pte_t *pte_pg)
>>>>
>>>>           ptechk = lookup_address(vaddr, &level);
>>>>           if (ptechk == pte_pg) {
>>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>>               set_pmd(pmdp,
>>>>                   __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>
>>> Do these increments of p2m_generation need to be atomic?
>>
>> Hmm, they are done under lock. I don't think the compiler is allowed to
>> reorder the writes to p2m_generation across set_pmd().
> 
> They do need smp_wmb() to guarantee that the increment is visible before
> the update occurs, just as the toolstack will need smp_rmb() to read.

smp_wmb() isn't good enough since you need the barrier even on non-smp
-- you need a wmb().

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure
  2015-02-18 10:50         ` Andrew Cooper
  (?)
  (?)
@ 2015-02-18 10:56         ` Juergen Gross
  -1 siblings, 0 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-18 10:56 UTC (permalink / raw)
  To: Andrew Cooper, David Vrabel, linux-kernel, xen-devel,
	konrad.wilk, boris.ostrovsky

On 02/18/2015 11:50 AM, Andrew Cooper wrote:
> On 18/02/15 10:42, Juergen Gross wrote:
>>
>>>>    /* Set up p2m_top to point to the domain-builder provided p2m
>>>> pages */
>>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>>> pte_t *pte_pg)
>>>>
>>>>            ptechk = lookup_address(vaddr, &level);
>>>>            if (ptechk == pte_pg) {
>>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>>                set_pmd(pmdp,
>>>>                    __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>
>>> Do these increments of p2m_generation need to be atomic?
>>
>> Hmm, they are done under lock. I don't think the compiler is allowed to
>> reorder the writes to p2m_generation across set_pmd().
>
> They do need smp_wmb() to guarantee that the increment is visible before
> the update occurs, just as the toolstack will need smp_rmb() to read.

Okay, I'll add smp_wmb() before and after calling set_pmd().


Juergen


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure
  2015-02-18 10:54           ` David Vrabel
@ 2015-02-18 10:57             ` Andrew Cooper
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrew Cooper @ 2015-02-18 10:57 UTC (permalink / raw)
  To: David Vrabel, Juergen Gross, linux-kernel, xen-devel,
	konrad.wilk, boris.ostrovsky

On 18/02/15 10:54, David Vrabel wrote:
> On 18/02/15 10:50, Andrew Cooper wrote:
>> On 18/02/15 10:42, Juergen Gross wrote:
>>>>>   /* Set up p2m_top to point to the domain-builder provided p2m
>>>>> pages */
>>>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>>>> pte_t *pte_pg)
>>>>>
>>>>>           ptechk = lookup_address(vaddr, &level);
>>>>>           if (ptechk == pte_pg) {
>>>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>>>               set_pmd(pmdp,
>>>>>                   __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>> Do these increments of p2m_generation need to be atomic?
>>> Hmm, they are done under lock. I don't think the compiler is allowed to
>>> reorder the writes to p2m_generation across set_pmd().
>> They do need smp_wmb() to guarantee that the increment is visible before
>> the update occurs, just as the toolstack will need smp_rmb() to read.
> smp_wmb() isn't good enough since you need the barrier even on non-smp
> -- you need a wmb().

Ah yes.  I was thinking in the wrong context for smp.  In this case we
need to guarantee interdomain consistency even with a UP guest kernel.

~Andrew

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure
@ 2015-02-18 10:57             ` Andrew Cooper
  0 siblings, 0 replies; 59+ messages in thread
From: Andrew Cooper @ 2015-02-18 10:57 UTC (permalink / raw)
  To: David Vrabel, Juergen Gross, linux-kernel, xen-devel,
	konrad.wilk, boris.ostrovsky

On 18/02/15 10:54, David Vrabel wrote:
> On 18/02/15 10:50, Andrew Cooper wrote:
>> On 18/02/15 10:42, Juergen Gross wrote:
>>>>>   /* Set up p2m_top to point to the domain-builder provided p2m
>>>>> pages */
>>>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>>>> pte_t *pte_pg)
>>>>>
>>>>>           ptechk = lookup_address(vaddr, &level);
>>>>>           if (ptechk == pte_pg) {
>>>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>>>               set_pmd(pmdp,
>>>>>                   __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>> Do these increments of p2m_generation need to be atomic?
>>> Hmm, they are done under lock. I don't think the compiler is allowed to
>>> reorder the writes to p2m_generation across set_pmd().
>> They do need smp_wmb() to guarantee that the increment is visible before
>> the update occurs, just as the toolstack will need smp_rmb() to read.
> smp_wmb() isn't good enough since you need the barrier even on non-smp
> -- you need a wmb().

Ah yes.  I was thinking in the wrong context for smp.  In this case we
need to guarantee interdomain consistency even with a UP guest kernel.

~Andrew

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure
  2015-02-18 10:54           ` David Vrabel
  (?)
  (?)
@ 2015-02-18 10:57           ` Juergen Gross
  -1 siblings, 0 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-18 10:57 UTC (permalink / raw)
  To: David Vrabel, Andrew Cooper, linux-kernel, xen-devel,
	konrad.wilk, boris.ostrovsky

On 02/18/2015 11:54 AM, David Vrabel wrote:
> On 18/02/15 10:50, Andrew Cooper wrote:
>> On 18/02/15 10:42, Juergen Gross wrote:
>>>
>>>>>    /* Set up p2m_top to point to the domain-builder provided p2m
>>>>> pages */
>>>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>>>> pte_t *pte_pg)
>>>>>
>>>>>            ptechk = lookup_address(vaddr, &level);
>>>>>            if (ptechk == pte_pg) {
>>>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>>>                set_pmd(pmdp,
>>>>>                    __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>>>> +            HYPERVISOR_shared_info->arch.p2m_generation++;
>>>>
>>>> Do these increments of p2m_generation need to be atomic?
>>>
>>> Hmm, they are done under lock. I don't think the compiler is allowed to
>>> reorder the writes to p2m_generation across set_pmd().
>>
>> They do need smp_wmb() to guarantee that the increment is visible before
>> the update occurs, just as the toolstack will need smp_rmb() to read.
>
> smp_wmb() isn't good enough since you need the barrier even on non-smp
> -- you need a wmb().

Okay, will do.

Juergen


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
  2015-02-18  6:52 ` [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains Juergen Gross
@ 2015-02-18 11:18     ` David Vrabel
  2015-02-18 11:18     ` David Vrabel
  1 sibling, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-02-18 11:18 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky

On 18/02/15 06:52, Juergen Gross wrote:
> 
> +if X86_64
> +choice
> +	prompt "Support pv-domains larger than 512GB"
> +	default XEN_512GB_NONE
> +	help
> +	  Support paravirtualized domains with more than 512GB of RAM.
> +
> +	  The Xen tools and crash dump analysis tools might not support
> +	  pv-domains with more than 512 GB of RAM. This option controls the
> +	  default setting of the kernel to use only up to 512 GB or more.
> +	  It is always possible to change the default via specifying the
> +	  boot parameter "xen_512gb_limit".
> +
> +	config XEN_512GB_NONE
> +		bool "neither dom0 nor domUs can be larger than 512GB"
> +	config XEN_512GB_DOM0
> +		bool "dom0 can be larger than 512GB, domUs not"
> +	config XEN_512GB_DOMU
> +		bool "domUs can be larger than 512GB, dom0 not"
> +	config XEN_512GB_ALL
> +		bool "dom0 and domUs can be larger than 512GB"
> +endchoice
> +endif

This configuration option doesn't look useful to me.  Can we get rid of
it with runtime checking.  e.g.,

If dom0, enable >512G.
If domU, enable >512G if requested by command line option /or/ toolstack
indicates that it supports the linear p2m.

And

If max_pfn < 512G, populate 3-level p2m /unless/ toolstack indicates it
supports the linear p2m.

People using crash analysis tools that need the 3-level p2m can clamp
dom0 memory with the Xen command line option.  FWIW, the tool we use
doesn't need this.

David


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
@ 2015-02-18 11:18     ` David Vrabel
  0 siblings, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-02-18 11:18 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky

On 18/02/15 06:52, Juergen Gross wrote:
> 
> +if X86_64
> +choice
> +	prompt "Support pv-domains larger than 512GB"
> +	default XEN_512GB_NONE
> +	help
> +	  Support paravirtualized domains with more than 512GB of RAM.
> +
> +	  The Xen tools and crash dump analysis tools might not support
> +	  pv-domains with more than 512 GB of RAM. This option controls the
> +	  default setting of the kernel to use only up to 512 GB or more.
> +	  It is always possible to change the default via specifying the
> +	  boot parameter "xen_512gb_limit".
> +
> +	config XEN_512GB_NONE
> +		bool "neither dom0 nor domUs can be larger than 512GB"
> +	config XEN_512GB_DOM0
> +		bool "dom0 can be larger than 512GB, domUs not"
> +	config XEN_512GB_DOMU
> +		bool "domUs can be larger than 512GB, dom0 not"
> +	config XEN_512GB_ALL
> +		bool "dom0 and domUs can be larger than 512GB"
> +endchoice
> +endif

This configuration option doesn't look useful to me.  Can we get rid of
it with runtime checking.  e.g.,

If dom0, enable >512G.
If domU, enable >512G if requested by command line option /or/ toolstack
indicates that it supports the linear p2m.

And

If max_pfn < 512G, populate 3-level p2m /unless/ toolstack indicates it
supports the linear p2m.

People using crash analysis tools that need the 3-level p2m can clamp
dom0 memory with the Xen command line option.  FWIW, the tool we use
doesn't need this.

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
  2015-02-18 11:18     ` David Vrabel
  (?)
@ 2015-02-18 11:51     ` Juergen Gross
  2015-02-18 11:56       ` Andrew Cooper
  2015-02-19 17:51       ` David Vrabel
  -1 siblings, 2 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-18 11:51 UTC (permalink / raw)
  To: xen-devel

On 02/18/2015 12:18 PM, David Vrabel wrote:
> On 18/02/15 06:52, Juergen Gross wrote:
>>
>> +if X86_64
>> +choice
>> +	prompt "Support pv-domains larger than 512GB"
>> +	default XEN_512GB_NONE
>> +	help
>> +	  Support paravirtualized domains with more than 512GB of RAM.
>> +
>> +	  The Xen tools and crash dump analysis tools might not support
>> +	  pv-domains with more than 512 GB of RAM. This option controls the
>> +	  default setting of the kernel to use only up to 512 GB or more.
>> +	  It is always possible to change the default via specifying the
>> +	  boot parameter "xen_512gb_limit".
>> +
>> +	config XEN_512GB_NONE
>> +		bool "neither dom0 nor domUs can be larger than 512GB"
>> +	config XEN_512GB_DOM0
>> +		bool "dom0 can be larger than 512GB, domUs not"
>> +	config XEN_512GB_DOMU
>> +		bool "domUs can be larger than 512GB, dom0 not"
>> +	config XEN_512GB_ALL
>> +		bool "dom0 and domUs can be larger than 512GB"
>> +endchoice
>> +endif
>
> This configuration option doesn't look useful to me.  Can we get rid of
> it with runtime checking.  e.g.,
>
> If dom0, enable >512G.
> If domU, enable >512G if requested by command line option /or/ toolstack
> indicates that it supports the linear p2m.

How is the toolstack supposed to indicate this?

I'd be more than happy to get rid of that option. For Dom0 you seem to
have changed your mind (you rejected enabling >512GB as default last
year).

Doing some more tests I found the command line option is problematic:
The option seems to be evaluated only after it is needed (I did the
first tests using the config option). Can we get rid of the option
even for domU? Or do I have to pre-scan the command line for the option?

> And
>
> If max_pfn < 512G, populate 3-level p2m /unless/ toolstack indicates it
> supports the linear p2m.

What about Dom0?

> People using crash analysis tools that need the 3-level p2m can clamp
> dom0 memory with the Xen command line option.  FWIW, the tool we use
> doesn't need this.

Interesting. Which tool are you using?


Juergen

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
  2015-02-18 11:18     ` David Vrabel
  (?)
  (?)
@ 2015-02-18 11:53     ` Juergen Gross
  -1 siblings, 0 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-18 11:53 UTC (permalink / raw)
  To: xen-devel, David Vrabel, Konrad Rzeszutek Wilk, Boris Ostrovsky

Sorry, used Reply instead of Reply-all

On 02/18/2015 12:18 PM, David Vrabel wrote:
> On 18/02/15 06:52, Juergen Gross wrote:
>>
>> +if X86_64
>> +choice
>> +	prompt "Support pv-domains larger than 512GB"
>> +	default XEN_512GB_NONE
>> +	help
>> +	  Support paravirtualized domains with more than 512GB of RAM.
>> +
>> +	  The Xen tools and crash dump analysis tools might not support
>> +	  pv-domains with more than 512 GB of RAM. This option controls the
>> +	  default setting of the kernel to use only up to 512 GB or more.
>> +	  It is always possible to change the default via specifying the
>> +	  boot parameter "xen_512gb_limit".
>> +
>> +	config XEN_512GB_NONE
>> +		bool "neither dom0 nor domUs can be larger than 512GB"
>> +	config XEN_512GB_DOM0
>> +		bool "dom0 can be larger than 512GB, domUs not"
>> +	config XEN_512GB_DOMU
>> +		bool "domUs can be larger than 512GB, dom0 not"
>> +	config XEN_512GB_ALL
>> +		bool "dom0 and domUs can be larger than 512GB"
>> +endchoice
>> +endif
>
> This configuration option doesn't look useful to me.  Can we get rid of
> it with runtime checking.  e.g.,
>
> If dom0, enable >512G.
> If domU, enable >512G if requested by command line option /or/ toolstack
> indicates that it supports the linear p2m.

How is the toolstack supposed to indicate this?

I'd be more than happy to get rid of that option. For Dom0 you seem to
have changed your mind (you rejected enabling >512GB as default last
year).

Doing some more tests I found the command line option is problematic:
The option seems to be evaluated only after it is needed (I did the
first tests using the config option). Can we get rid of the option
even for domU? Or do I have to pre-scan the command line for the option?

> And
>
> If max_pfn < 512G, populate 3-level p2m /unless/ toolstack indicates it
> supports the linear p2m.

What about Dom0?

> People using crash analysis tools that need the 3-level p2m can clamp
> dom0 memory with the Xen command line option.  FWIW, the tool we use
> doesn't need this.

Interesting. Which tool are you using?


Juergen

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
  2015-02-18 11:51     ` Juergen Gross
@ 2015-02-18 11:56       ` Andrew Cooper
  2015-02-19 17:51       ` David Vrabel
  1 sibling, 0 replies; 59+ messages in thread
From: Andrew Cooper @ 2015-02-18 11:56 UTC (permalink / raw)
  To: Juergen Gross, xen-devel

On 18/02/15 11:51, Juergen Gross wrote:
>
>
>> People using crash analysis tools that need the 3-level p2m can clamp
>> dom0 memory with the Xen command line option.  FWIW, the tool we use
>> doesn't need this.
>
> Interesting. Which tool are you using?

https://github.com/xenserver/xen-crashdump-analyser

It walks dom0 by locating the real pagetables from struct vcpu, so needs
no p2m/m2p fiddling at all.

~Andrew

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 04/13] xen: move static e820 map to global scope
  2015-02-18  6:51 ` [PATCH 04/13] xen: move static e820 map to global scope Juergen Gross
@ 2015-02-19 17:27   ` David Vrabel
  0 siblings, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-02-19 17:27 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky

On 18/02/2015 06:51, Juergen Gross wrote:
> Instead of using a function local static e820 map in xen_memory_setup()
> and calling various functions in the same source with the map as a
> parameter use a map directly accessible by all functions in the source.

Reviewed-by: David Vrabel <david.vrabel@citrix.com>

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 07/13] xen: find unused contiguous memory area
  2015-02-18  6:52 ` [PATCH 07/13] xen: find unused contiguous memory area Juergen Gross
@ 2015-02-19 17:31   ` David Vrabel
  0 siblings, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-02-19 17:31 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky

On 18/02/2015 06:52, Juergen Gross wrote:
> For being able to relocate pre-allocated data areas like initrd or
> p2m list it is mandatory to find a contiguous memory area which is
> not yet in use and doesn't conflict with the memory map we want to
> be in effect.
>
> In case such an area is found reserve it at once as this will be
> required to be done in any case.

Reviewed-by: David Vrabel <david.vrabel@citrix.com>

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 08/13] xen: add service function to copy physical memory areas
  2015-02-18  6:52 ` [PATCH 08/13] xen: add service function to copy physical memory areas Juergen Gross
@ 2015-02-19 17:35   ` David Vrabel
  2015-02-24  6:34     ` Juergen Gross
  0 siblings, 1 reply; 59+ messages in thread
From: David Vrabel @ 2015-02-19 17:35 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky

On 18/02/2015 06:52, Juergen Gross wrote:
> In case a pre-allocated memory area is to be moved in order to avoid
> a conflict with the target E820 map we need a way to copy data between
> physical addresses.
>
> Add a function doing this via early_memremap().
[...]
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -766,6 +766,35 @@ phys_addr_t __init xen_find_free_area(phys_addr_t size)
>   }
>
>   /*
> + * Like memcpy, but with physical addresses for dest and src.
> + */
> +void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src, phys_addr_t n)
> +{
> +	phys_addr_t dest_off, src_off, dest_len, src_len, len;
> +	void *from, *to;
> +
> +	while (n) {
> +		dest_off = dest & ~PAGE_MASK;
> +		src_off = src & ~PAGE_MASK;
> +		dest_len = n;
> +		if (dest_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off)
> +			dest_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off;
> +		src_len = n;
> +		if (src_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off)
> +			src_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off;
> +		len = min(dest_len, src_len);
> +		to = early_memremap(dest - dest_off, dest_len + dest_off);
> +		from = early_memremap(src - src_off, src_len + src_off);
> +		memcpy(to, from, len);
> +		early_iounmap(to, dest_len + dest_off);
> +		early_iounmap(from, src_len + src_off);

early_memunmap surely?

Otherwise,

Reviewed-by: David Vrabel <david.vrabel@citrix.com>

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 09/13] xen: check for kernel memory conflicting with memory layout
  2015-02-18  6:52 ` [PATCH 09/13] xen: check for kernel memory conflicting with memory layout Juergen Gross
@ 2015-02-19 17:36   ` David Vrabel
  2015-02-24  6:45     ` Juergen Gross
  0 siblings, 1 reply; 59+ messages in thread
From: David Vrabel @ 2015-02-19 17:36 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky

On 18/02/2015 06:52, Juergen Gross wrote:
> Checks whether the pre-allocated memory of the loaded kernel is in
> conflict with the target memory map. If this is the case, just panic
> instead of run into problems later.

What ensures this doesn't actually happen?

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 10/13] xen: check pre-allocated page tables for conflict with memory map
  2015-02-18  6:52 ` [PATCH 10/13] xen: check pre-allocated page tables for conflict with memory map Juergen Gross
@ 2015-02-19 17:37   ` David Vrabel
  2015-02-24  6:45     ` Juergen Gross
  0 siblings, 1 reply; 59+ messages in thread
From: David Vrabel @ 2015-02-19 17:37 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky



On 18/02/2015 06:52, Juergen Gross wrote:
> Check whether the page tables built by the domain builder are at
> memory addresses which are in conflict with the target memory map.
> If this is the case just panic instead of running into problems
> later.

Again, what ensures this never actually happens?

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 11/13] xen: move initrd away from e820 non-ram area
  2015-02-18  6:52 ` [PATCH 11/13] xen: move initrd away from e820 non-ram area Juergen Gross
@ 2015-02-19 17:42   ` David Vrabel
  0 siblings, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-02-19 17:42 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky



On 18/02/2015 06:52, Juergen Gross wrote:
> When adapting the dom0 memory layout to that of the host make sure
> the initrd isn't moved to another pfn range, as it won't be found
> there any more.
>
> The easiest way to accomplish that is by copying the initrd to an
> area which is RAM according to the E820 map.

Reviewed-by: David Vrabel <david.vrabel@citrix.com>

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 12/13] xen: if p2m list located in to be remapped region delay remapping
  2015-02-18  6:52 ` [PATCH 12/13] xen: if p2m list located in to be remapped region delay remapping Juergen Gross
@ 2015-02-19 17:44   ` David Vrabel
  2015-02-24  7:01     ` Juergen Gross
  0 siblings, 1 reply; 59+ messages in thread
From: David Vrabel @ 2015-02-19 17:44 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky

On 18/02/2015 06:52, Juergen Gross wrote:
> With adapting the memory layout of dom0 to that of the host care must
> be taken not to remap the initial p2m list supported by the hypervisor.

"...supplied by the hypervisor" ?

> If the p2m map is detected to be in a region which is going to be
> remapped, delay the remapping of that area. Not doing so can either
> crash the system very early, or lead to clobbered data as the target
> memory area of the remap operation will no longer be reserved.

Would it be better to relocate the p2m before remapping memory?  If not, 
explain why in the commit message.

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
  2015-02-18 11:51     ` Juergen Gross
  2015-02-18 11:56       ` Andrew Cooper
@ 2015-02-19 17:51       ` David Vrabel
  2015-02-24  7:28         ` Juergen Gross
  1 sibling, 1 reply; 59+ messages in thread
From: David Vrabel @ 2015-02-19 17:51 UTC (permalink / raw)
  To: Juergen Gross, xen-devel

On 18/02/2015 11:51, Juergen Gross wrote:
> On 02/18/2015 12:18 PM, David Vrabel wrote:
>> On 18/02/15 06:52, Juergen Gross wrote:
>>>
>>> +if X86_64
>>> +choice
>>> +    prompt "Support pv-domains larger than 512GB"
>>> +    default XEN_512GB_NONE
>>> +    help
>>> +      Support paravirtualized domains with more than 512GB of RAM.
>>> +
>>> +      The Xen tools and crash dump analysis tools might not support
>>> +      pv-domains with more than 512 GB of RAM. This option controls the
>>> +      default setting of the kernel to use only up to 512 GB or more.
>>> +      It is always possible to change the default via specifying the
>>> +      boot parameter "xen_512gb_limit".
>>> +
>>> +    config XEN_512GB_NONE
>>> +        bool "neither dom0 nor domUs can be larger than 512GB"
>>> +    config XEN_512GB_DOM0
>>> +        bool "dom0 can be larger than 512GB, domUs not"
>>> +    config XEN_512GB_DOMU
>>> +        bool "domUs can be larger than 512GB, dom0 not"
>>> +    config XEN_512GB_ALL
>>> +        bool "dom0 and domUs can be larger than 512GB"
>>> +endchoice
>>> +endif
>>
>> This configuration option doesn't look useful to me.  Can we get rid of
>> it with runtime checking.  e.g.,
>>
>> If dom0, enable >512G.
>> If domU, enable >512G if requested by command line option /or/ toolstack
>> indicates that it supports the linear p2m.
>
> How is the toolstack supposed to indicate this?

I don't know.  A bit in the shared info perhaps?

> I'd be more than happy to get rid of that option. For Dom0 you seem to
> have changed your mind (you rejected enabling >512GB as default last
> year).

Since for dom0 it's only crash analysis that might be impacted, removing 
the option is fine.  Anyone using such a tool probably know what they're 
doing.

> Doing some more tests I found the command line option is problematic:
> The option seems to be evaluated only after it is needed (I did the
> first tests using the config option). Can we get rid of the option
> even for domU? Or do I have to pre-scan the command line for the option?

I don't what people to be able to inadvertantly create domUs that cannot 
be saved/migrated.  Can you think of another mechanism?

>> And
>>
>> If max_pfn < 512G, populate 3-level p2m /unless/ toolstack indicates it
>> supports the linear p2m.
>
> What about Dom0?

This bit is for dom0 and domU.

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map
  2015-02-18  6:51 ` [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map Juergen Gross
@ 2015-02-19 18:07   ` David Vrabel
  2015-02-24  6:27     ` Juergen Gross
  0 siblings, 1 reply; 59+ messages in thread
From: David Vrabel @ 2015-02-19 18:07 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky

On 18/02/2015 06:51, Juergen Gross wrote:
> Currently especially for dom0 guest memory with guest pfns not
> matching host areas populated with RAM are remapped to areas which
> are RAM native as well. This is done to be able to use identity
> mappings (pfn == mfn) for I/O areas.
>
> Up to now it is not checked whether the remapped memory is already
> in use. Remapping used memory will probably result in data corruption,
> as the remapped memory will no longer be reserved. Any memory
> allocation after the remap can claim that memory.
>
> Add an infrastructure to check for conflicts of reserved memory areas
> and in case of a conflict to react via an area specific function.
>
> This function has 3 options:
> - Panic
> - Handle the conflict by moving the data to another memory area.
>    This is indicated by a return value other than 0.
> - Just return 0. This will delay invalidating the conflicting memory
>    area to just before doing the remap. This option will be usable for
>    cases only where the memory will no longer be needed when the remap
>    operation will be started, e.g. for the p2m list, which is already
>    copied then.
>
> When doing the remap, check for not remapping a reserved page.
>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
>   arch/x86/xen/setup.c   | 185 +++++++++++++++++++++++++++++++++++++++++++++++--
>   arch/x86/xen/xen-ops.h |   2 +
>   2 files changed, 182 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index 0dda131..a0af554 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -59,6 +59,20 @@ static unsigned long xen_remap_mfn __initdata = INVALID_P2M_ENTRY;
>   static unsigned long xen_remap_pfn;
>   static unsigned long xen_max_pfn;
>
> +/*
> + * Areas with memblock_reserve()d memory to be checked against final E820 map.
> + * Each area has an associated function to handle conflicts (by either
> + * removing the conflict or by just crashing with an appropriate message).
> + * The array has a fixed size as there are only few areas of interest which are
> + * well known: kernel, page tables, p2m, initrd.
> + */
> +#define XEN_N_RESERVED_AREAS	4
> +static struct {
> +	phys_addr_t	start;
> +	phys_addr_t	size;
> +	int		(*func)(phys_addr_t start, phys_addr_t size);
> +} xen_reserved_area[XEN_N_RESERVED_AREAS] __initdata;
> +
>   /*
>    * The maximum amount of extra memory compared to the base size.  The
>    * main scaling factor is the size of struct page.  At extreme ratios
> @@ -365,10 +379,10 @@ static void __init xen_set_identity_and_remap_chunk(unsigned long start_pfn,
>   	unsigned long end_pfn, unsigned long *released, unsigned long *remapped)
>   {
>   	unsigned long pfn;
> -	unsigned long i = 0;
> +	unsigned long i;
>   	unsigned long n = end_pfn - start_pfn;
>
> -	while (i < n) {
> +	for (i = 0; i < n; ) {
>   		unsigned long cur_pfn = start_pfn + i;
>   		unsigned long left = n - i;
>   		unsigned long size = left;
> @@ -411,6 +425,53 @@ static void __init xen_set_identity_and_remap_chunk(unsigned long start_pfn,
>   			(unsigned long)__va(pfn << PAGE_SHIFT),
>   			mfn_pte(pfn, PAGE_KERNEL_IO), 0);
>   }
> +/* Check to be remapped memory area for conflicts with reserved areas.
> + *
> + * Skip regions known to be reserved which are handled later. For these
> + * regions we have to increase the remapped counter in order to reserve
> + * extra memory space.
> + *
> + * In case a memory page already in use is to be remapped, just BUG().
> + */
> +static void __init xen_set_identity_and_remap_chunk_chk(unsigned long start_pfn,
> +	unsigned long end_pfn, unsigned long *released, unsigned long *remapped)

...remap_chunk_checked() ?

> +{
> +	unsigned long pfn;
> +	unsigned long area_start, area_end;
> +	unsigned i;
> +
> +	for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
> +
> +		if (!xen_reserved_area[i].size)
> +			break;
> +
> +		area_start = PFN_DOWN(xen_reserved_area[i].start);
> +		area_end = PFN_UP(xen_reserved_area[i].start +
> +				  xen_reserved_area[i].size);
> +		if (area_start >= end_pfn || area_end <= start_pfn)
> +			continue;
> +
> +		if (area_start > start_pfn)
> +			xen_set_identity_and_remap_chunk(start_pfn, area_start,
> +							 released, remapped);
> +
> +		if (area_end < end_pfn)
> +			xen_set_identity_and_remap_chunk(area_end, end_pfn,
> +							 released, remapped);
> +
> +		*remapped += min(area_end, end_pfn) -
> +			    max(area_start, start_pfn);
> +
> +		return;

Why not defer the whole chunk that conflicts?  Or for that matter defer 
all this remapping to the last minute.

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 05/13] xen: simplify xen_set_identity_and_remap() by using global variables
  2015-02-18  6:51 ` [PATCH 05/13] xen: simplify xen_set_identity_and_remap() by using global variables Juergen Gross
@ 2015-02-19 18:10   ` David Vrabel
  2015-02-24  6:15     ` Juergen Gross
  0 siblings, 1 reply; 59+ messages in thread
From: David Vrabel @ 2015-02-19 18:10 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky



On 18/02/2015 06:51, Juergen Gross wrote:
> xen_set_identity_and_remap() is used to prepare remapping of memory
> conflicting with the E820 map. It is tracking the pfn where to remap
> new memory via a local variable which is passed to a subfunction
> which in turn returns the new value for that variable.
>
> Additionally the targeted maximum pfn is passed as a parameter to
> sub functions.
>
> Simplify that construct by using just global variables in the
> source for that purpose. This will make things simpler when we need
> those values later, too.

I'm not convinced this actually simplifies anything.

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 05/13] xen: simplify xen_set_identity_and_remap() by using global variables
  2015-02-19 18:10   ` [Xen-devel] " David Vrabel
@ 2015-02-24  6:15     ` Juergen Gross
  0 siblings, 0 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-24  6:15 UTC (permalink / raw)
  To: David Vrabel, linux-kernel, xen-devel, konrad.wilk, david.vrabel,
	boris.ostrovsky

On 02/19/2015 07:10 PM, David Vrabel wrote:
>
>
> On 18/02/2015 06:51, Juergen Gross wrote:
>> xen_set_identity_and_remap() is used to prepare remapping of memory
>> conflicting with the E820 map. It is tracking the pfn where to remap
>> new memory via a local variable which is passed to a subfunction
>> which in turn returns the new value for that variable.
>>
>> Additionally the targeted maximum pfn is passed as a parameter to
>> sub functions.
>>
>> Simplify that construct by using just global variables in the
>> source for that purpose. This will make things simpler when we need
>> those values later, too.
>
> I'm not convinced this actually simplifies anything.

Perhaps I should have emphasised the last sentence a bit more. I really
need the global variables when deferring the remap operation.

Juergen


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map
  2015-02-19 18:07   ` [Xen-devel] " David Vrabel
@ 2015-02-24  6:27     ` Juergen Gross
  2015-02-25 14:24         ` David Vrabel
  0 siblings, 1 reply; 59+ messages in thread
From: Juergen Gross @ 2015-02-24  6:27 UTC (permalink / raw)
  To: David Vrabel, linux-kernel, xen-devel, konrad.wilk, david.vrabel,
	boris.ostrovsky

On 02/19/2015 07:07 PM, David Vrabel wrote:
> On 18/02/2015 06:51, Juergen Gross wrote:
>> Currently especially for dom0 guest memory with guest pfns not
>> matching host areas populated with RAM are remapped to areas which
>> are RAM native as well. This is done to be able to use identity
>> mappings (pfn == mfn) for I/O areas.
>>
>> Up to now it is not checked whether the remapped memory is already
>> in use. Remapping used memory will probably result in data corruption,
>> as the remapped memory will no longer be reserved. Any memory
>> allocation after the remap can claim that memory.
>>
>> Add an infrastructure to check for conflicts of reserved memory areas
>> and in case of a conflict to react via an area specific function.
>>
>> This function has 3 options:
>> - Panic
>> - Handle the conflict by moving the data to another memory area.
>>    This is indicated by a return value other than 0.
>> - Just return 0. This will delay invalidating the conflicting memory
>>    area to just before doing the remap. This option will be usable for
>>    cases only where the memory will no longer be needed when the remap
>>    operation will be started, e.g. for the p2m list, which is already
>>    copied then.
>>
>> When doing the remap, check for not remapping a reserved page.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>>   arch/x86/xen/setup.c   | 185
>> +++++++++++++++++++++++++++++++++++++++++++++++--
>>   arch/x86/xen/xen-ops.h |   2 +
>>   2 files changed, 182 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
>> index 0dda131..a0af554 100644
>> --- a/arch/x86/xen/setup.c
>> +++ b/arch/x86/xen/setup.c
>> @@ -59,6 +59,20 @@ static unsigned long xen_remap_mfn __initdata =
>> INVALID_P2M_ENTRY;
>>   static unsigned long xen_remap_pfn;
>>   static unsigned long xen_max_pfn;
>>
>> +/*
>> + * Areas with memblock_reserve()d memory to be checked against final
>> E820 map.
>> + * Each area has an associated function to handle conflicts (by either
>> + * removing the conflict or by just crashing with an appropriate
>> message).
>> + * The array has a fixed size as there are only few areas of interest
>> which are
>> + * well known: kernel, page tables, p2m, initrd.
>> + */
>> +#define XEN_N_RESERVED_AREAS    4
>> +static struct {
>> +    phys_addr_t    start;
>> +    phys_addr_t    size;
>> +    int        (*func)(phys_addr_t start, phys_addr_t size);
>> +} xen_reserved_area[XEN_N_RESERVED_AREAS] __initdata;
>> +
>>   /*
>>    * The maximum amount of extra memory compared to the base size.  The
>>    * main scaling factor is the size of struct page.  At extreme ratios
>> @@ -365,10 +379,10 @@ static void __init
>> xen_set_identity_and_remap_chunk(unsigned long start_pfn,
>>       unsigned long end_pfn, unsigned long *released, unsigned long
>> *remapped)
>>   {
>>       unsigned long pfn;
>> -    unsigned long i = 0;
>> +    unsigned long i;
>>       unsigned long n = end_pfn - start_pfn;
>>
>> -    while (i < n) {
>> +    for (i = 0; i < n; ) {
>>           unsigned long cur_pfn = start_pfn + i;
>>           unsigned long left = n - i;
>>           unsigned long size = left;
>> @@ -411,6 +425,53 @@ static void __init
>> xen_set_identity_and_remap_chunk(unsigned long start_pfn,
>>               (unsigned long)__va(pfn << PAGE_SHIFT),
>>               mfn_pte(pfn, PAGE_KERNEL_IO), 0);
>>   }
>> +/* Check to be remapped memory area for conflicts with reserved areas.
>> + *
>> + * Skip regions known to be reserved which are handled later. For these
>> + * regions we have to increase the remapped counter in order to reserve
>> + * extra memory space.
>> + *
>> + * In case a memory page already in use is to be remapped, just BUG().
>> + */
>> +static void __init xen_set_identity_and_remap_chunk_chk(unsigned long
>> start_pfn,
>> +    unsigned long end_pfn, unsigned long *released, unsigned long
>> *remapped)
>
> ...remap_chunk_checked() ?

I just wanted to avoid the function name to be even longer. OTOH I
really don't mind to use your suggestion. :-)

>
>> +{
>> +    unsigned long pfn;
>> +    unsigned long area_start, area_end;
>> +    unsigned i;
>> +
>> +    for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
>> +
>> +        if (!xen_reserved_area[i].size)
>> +            break;
>> +
>> +        area_start = PFN_DOWN(xen_reserved_area[i].start);
>> +        area_end = PFN_UP(xen_reserved_area[i].start +
>> +                  xen_reserved_area[i].size);
>> +        if (area_start >= end_pfn || area_end <= start_pfn)
>> +            continue;
>> +
>> +        if (area_start > start_pfn)
>> +            xen_set_identity_and_remap_chunk(start_pfn, area_start,
>> +                             released, remapped);
>> +
>> +        if (area_end < end_pfn)
>> +            xen_set_identity_and_remap_chunk(area_end, end_pfn,
>> +                             released, remapped);
>> +
>> +        *remapped += min(area_end, end_pfn) -
>> +                max(area_start, start_pfn);
>> +
>> +        return;
>
> Why not defer the whole chunk that conflicts?  Or for that matter defer
> all this remapping to the last minute.

There are two problems arising from this:

- In the initrd case remapping would be deferred too long: the initrd
   data is still in use when device initialization is running. And we
   really want the remap to have happened before PCI space is being used.

- Delaying all remapping to the point where the new p2m list is in place
   would either result in a p2m list with all memory holes covered with
   individual entries as the new list is built with those holes still
   populated, or we would have to perform the check for being able to use
   the pre-built "invalid" or "identity" p2m page in set_phys_to_machine.
   The first option could easily waste significant amounts of memory (on
   my test machine with 1TB RAM this would have been about 1GB), while
   the second option would be performance critical.


Juergen

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 08/13] xen: add service function to copy physical memory areas
  2015-02-19 17:35   ` [Xen-devel] " David Vrabel
@ 2015-02-24  6:34     ` Juergen Gross
  0 siblings, 0 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-24  6:34 UTC (permalink / raw)
  To: David Vrabel, linux-kernel, xen-devel, konrad.wilk, david.vrabel,
	boris.ostrovsky

On 02/19/2015 06:35 PM, David Vrabel wrote:
> On 18/02/2015 06:52, Juergen Gross wrote:
>> In case a pre-allocated memory area is to be moved in order to avoid
>> a conflict with the target E820 map we need a way to copy data between
>> physical addresses.
>>
>> Add a function doing this via early_memremap().
> [...]
>> --- a/arch/x86/xen/setup.c
>> +++ b/arch/x86/xen/setup.c
>> @@ -766,6 +766,35 @@ phys_addr_t __init xen_find_free_area(phys_addr_t
>> size)
>>   }
>>
>>   /*
>> + * Like memcpy, but with physical addresses for dest and src.
>> + */
>> +void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src,
>> phys_addr_t n)
>> +{
>> +    phys_addr_t dest_off, src_off, dest_len, src_len, len;
>> +    void *from, *to;
>> +
>> +    while (n) {
>> +        dest_off = dest & ~PAGE_MASK;
>> +        src_off = src & ~PAGE_MASK;
>> +        dest_len = n;
>> +        if (dest_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off)
>> +            dest_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off;
>> +        src_len = n;
>> +        if (src_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off)
>> +            src_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off;
>> +        len = min(dest_len, src_len);
>> +        to = early_memremap(dest - dest_off, dest_len + dest_off);
>> +        from = early_memremap(src - src_off, src_len + src_off);
>> +        memcpy(to, from, len);
>> +        early_iounmap(to, dest_len + dest_off);
>> +        early_iounmap(from, src_len + src_off);
>
> early_memunmap surely?

Hmm, yes, sure. I'll update the patch and send another one for
correcting the code where I took the usage from (relocate_initrd).

Juergen

>
> Otherwise,
>
> Reviewed-by: David Vrabel <david.vrabel@citrix.com>
>
> David
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 09/13] xen: check for kernel memory conflicting with memory layout
  2015-02-19 17:36   ` [Xen-devel] " David Vrabel
@ 2015-02-24  6:45     ` Juergen Gross
  0 siblings, 0 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-24  6:45 UTC (permalink / raw)
  To: David Vrabel, linux-kernel, xen-devel, konrad.wilk, david.vrabel,
	boris.ostrovsky

On 02/19/2015 06:36 PM, David Vrabel wrote:
> On 18/02/2015 06:52, Juergen Gross wrote:
>> Checks whether the pre-allocated memory of the loaded kernel is in
>> conflict with the target memory map. If this is the case, just panic
>> instead of run into problems later.
>
> What ensures this doesn't actually happen?

Nothing.

We have basically three options here:

- Die early (my patch).
- Issue a warning that the critical situation has been detected and hope
   for the best by not doing the remap, but probably run into strange
   situations when the discrepancy between E820 map and memory usage is
   becoming problematic.
- Do the remap like today and die eventually when the relocated page is
   used for p2m/m2p translations (no real option).

Juergen


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 10/13] xen: check pre-allocated page tables for conflict with memory map
  2015-02-19 17:37   ` [Xen-devel] " David Vrabel
@ 2015-02-24  6:45     ` Juergen Gross
  0 siblings, 0 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-24  6:45 UTC (permalink / raw)
  To: David Vrabel, linux-kernel, xen-devel, konrad.wilk, david.vrabel,
	boris.ostrovsky

On 02/19/2015 06:37 PM, David Vrabel wrote:
>
>
> On 18/02/2015 06:52, Juergen Gross wrote:
>> Check whether the page tables built by the domain builder are at
>> memory addresses which are in conflict with the target memory map.
>> If this is the case just panic instead of running into problems
>> later.
>
> Again, what ensures this never actually happens?

Same answer as before: nothing.

Juergen


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 12/13] xen: if p2m list located in to be remapped region delay remapping
  2015-02-19 17:44   ` [Xen-devel] " David Vrabel
@ 2015-02-24  7:01     ` Juergen Gross
  0 siblings, 0 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-24  7:01 UTC (permalink / raw)
  To: David Vrabel, linux-kernel, xen-devel, konrad.wilk, david.vrabel,
	boris.ostrovsky

On 02/19/2015 06:44 PM, David Vrabel wrote:
> On 18/02/2015 06:52, Juergen Gross wrote:
>> With adapting the memory layout of dom0 to that of the host care must
>> be taken not to remap the initial p2m list supported by the hypervisor.
>
> "...supplied by the hypervisor" ?

Yes, of course.

>
>> If the p2m map is detected to be in a region which is going to be
>> remapped, delay the remapping of that area. Not doing so can either
>> crash the system very early, or lead to clobbered data as the target
>> memory area of the remap operation will no longer be reserved.
>
> Would it be better to relocate the p2m before remapping memory?  If not,
> explain why in the commit message.

Okay, will do.


Juergen

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains
  2015-02-19 17:51       ` David Vrabel
@ 2015-02-24  7:28         ` Juergen Gross
  0 siblings, 0 replies; 59+ messages in thread
From: Juergen Gross @ 2015-02-24  7:28 UTC (permalink / raw)
  To: David Vrabel, xen-devel
  Cc: keir, Ian.Campbell, Andrew Cooper, Tim Deegan, Jan Beulich, Ian Jackson

On 02/19/2015 06:51 PM, David Vrabel wrote:
> On 18/02/2015 11:51, Juergen Gross wrote:
>> On 02/18/2015 12:18 PM, David Vrabel wrote:
>>> On 18/02/15 06:52, Juergen Gross wrote:
>>>>
>>>> +if X86_64
>>>> +choice
>>>> +    prompt "Support pv-domains larger than 512GB"
>>>> +    default XEN_512GB_NONE
>>>> +    help
>>>> +      Support paravirtualized domains with more than 512GB of RAM.
>>>> +
>>>> +      The Xen tools and crash dump analysis tools might not support
>>>> +      pv-domains with more than 512 GB of RAM. This option controls
>>>> the
>>>> +      default setting of the kernel to use only up to 512 GB or more.
>>>> +      It is always possible to change the default via specifying the
>>>> +      boot parameter "xen_512gb_limit".
>>>> +
>>>> +    config XEN_512GB_NONE
>>>> +        bool "neither dom0 nor domUs can be larger than 512GB"
>>>> +    config XEN_512GB_DOM0
>>>> +        bool "dom0 can be larger than 512GB, domUs not"
>>>> +    config XEN_512GB_DOMU
>>>> +        bool "domUs can be larger than 512GB, dom0 not"
>>>> +    config XEN_512GB_ALL
>>>> +        bool "dom0 and domUs can be larger than 512GB"
>>>> +endchoice
>>>> +endif
>>>
>>> This configuration option doesn't look useful to me.  Can we get rid of
>>> it with runtime checking.  e.g.,
>>>
>>> If dom0, enable >512G.
>>> If domU, enable >512G if requested by command line option /or/ toolstack
>>> indicates that it supports the linear p2m.
>>
>> How is the toolstack supposed to indicate this?
>
> I don't know.  A bit in the shared info perhaps?

Okay, I'd be fine with this. CC'ed the hypervisor maintainers to enable
them to object early. :-)

>
>> I'd be more than happy to get rid of that option. For Dom0 you seem to
>> have changed your mind (you rejected enabling >512GB as default last
>> year).
>
> Since for dom0 it's only crash analysis that might be impacted, removing
> the option is fine.  Anyone using such a tool probably know what they're
> doing.

Fine, I'll remove it.

>
>> Doing some more tests I found the command line option is problematic:
>> The option seems to be evaluated only after it is needed (I did the
>> first tests using the config option). Can we get rid of the option
>> even for domU? Or do I have to pre-scan the command line for the option?
>
> I don't what people to be able to inadvertantly create domUs that cannot
> be saved/migrated.  Can you think of another mechanism?

No, I hoped you'd have an idea.

I'll try to parse the command line options a little bit earlier.

>
>>> And
>>>
>>> If max_pfn < 512G, populate 3-level p2m /unless/ toolstack indicates it
>>> supports the linear p2m.
>>
>> What about Dom0?
>
> This bit is for dom0 and domU.

Okay, but which toolstack would set the indicator for dom0? The
hypervisor? Would it set this bit in any case, or depending on a
command line option?


Juergen

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map
  2015-02-24  6:27     ` Juergen Gross
@ 2015-02-25 14:24         ` David Vrabel
  0 siblings, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-02-25 14:24 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky

On 24/02/15 06:27, Juergen Gross wrote:
> On 02/19/2015 07:07 PM, David Vrabel wrote:
>> On 18/02/2015 06:51, Juergen Gross wrote:
>>> +{
>>> +    unsigned long pfn;
>>> +    unsigned long area_start, area_end;
>>> +    unsigned i;
>>> +
>>> +    for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
>>> +
>>> +        if (!xen_reserved_area[i].size)
>>> +            break;
>>> +
>>> +        area_start = PFN_DOWN(xen_reserved_area[i].start);
>>> +        area_end = PFN_UP(xen_reserved_area[i].start +
>>> +                  xen_reserved_area[i].size);
>>> +        if (area_start >= end_pfn || area_end <= start_pfn)
>>> +            continue;
>>> +
>>> +        if (area_start > start_pfn)
>>> +            xen_set_identity_and_remap_chunk(start_pfn, area_start,
>>> +                             released, remapped);
>>> +
>>> +        if (area_end < end_pfn)
>>> +            xen_set_identity_and_remap_chunk(area_end, end_pfn,
>>> +                             released, remapped);
>>> +
>>> +        *remapped += min(area_end, end_pfn) -
>>> +                max(area_start, start_pfn);
>>> +
>>> +        return;
>>
>> Why not defer the whole chunk that conflicts?  Or for that matter defer
>> all this remapping to the last minute.
> 
> There are two problems arising from this:
> 
> - In the initrd case remapping would be deferred too long: the initrd
>   data is still in use when device initialization is running. And we
>   really want the remap to have happened before PCI space is being used.

I'm not sure I understand what you're saying here.

I'm suggesting:

1. Reserve all holes.

2. Relocate (if necessary) all modules (initrd, etc.) to regions that
are RAM in the e820.

3. Rebuild the p2m in RAM.

4. Relocate frames from E820 holes/reserved to the end, free p2m pages
from the holes and replacing them with the read-only 1:1 page (where
possible).

> - Delaying all remapping to the point where the new p2m list is in place
>   would either result in a p2m list with all memory holes covered with
>   individual entries as the new list is built with those holes still
>   populated, ...
>   The first option could easily waste significant amounts of memory (on
>   my test machine with 1TB RAM this would have been about 1GB), while
>   the second option would be performance critical.

I don't understand how this wastes memory.   When you relocate the
frames from the holes you can reclaim the p2m pages for the holes (and
replace them with the r/o mapped identity p2m page).

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map
@ 2015-02-25 14:24         ` David Vrabel
  0 siblings, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-02-25 14:24 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky

On 24/02/15 06:27, Juergen Gross wrote:
> On 02/19/2015 07:07 PM, David Vrabel wrote:
>> On 18/02/2015 06:51, Juergen Gross wrote:
>>> +{
>>> +    unsigned long pfn;
>>> +    unsigned long area_start, area_end;
>>> +    unsigned i;
>>> +
>>> +    for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
>>> +
>>> +        if (!xen_reserved_area[i].size)
>>> +            break;
>>> +
>>> +        area_start = PFN_DOWN(xen_reserved_area[i].start);
>>> +        area_end = PFN_UP(xen_reserved_area[i].start +
>>> +                  xen_reserved_area[i].size);
>>> +        if (area_start >= end_pfn || area_end <= start_pfn)
>>> +            continue;
>>> +
>>> +        if (area_start > start_pfn)
>>> +            xen_set_identity_and_remap_chunk(start_pfn, area_start,
>>> +                             released, remapped);
>>> +
>>> +        if (area_end < end_pfn)
>>> +            xen_set_identity_and_remap_chunk(area_end, end_pfn,
>>> +                             released, remapped);
>>> +
>>> +        *remapped += min(area_end, end_pfn) -
>>> +                max(area_start, start_pfn);
>>> +
>>> +        return;
>>
>> Why not defer the whole chunk that conflicts?  Or for that matter defer
>> all this remapping to the last minute.
> 
> There are two problems arising from this:
> 
> - In the initrd case remapping would be deferred too long: the initrd
>   data is still in use when device initialization is running. And we
>   really want the remap to have happened before PCI space is being used.

I'm not sure I understand what you're saying here.

I'm suggesting:

1. Reserve all holes.

2. Relocate (if necessary) all modules (initrd, etc.) to regions that
are RAM in the e820.

3. Rebuild the p2m in RAM.

4. Relocate frames from E820 holes/reserved to the end, free p2m pages
from the holes and replacing them with the read-only 1:1 page (where
possible).

> - Delaying all remapping to the point where the new p2m list is in place
>   would either result in a p2m list with all memory holes covered with
>   individual entries as the new list is built with those holes still
>   populated, ...
>   The first option could easily waste significant amounts of memory (on
>   my test machine with 1TB RAM this would have been about 1GB), while
>   the second option would be performance critical.

I don't understand how this wastes memory.   When you relocate the
frames from the holes you can reclaim the p2m pages for the holes (and
replace them with the r/o mapped identity p2m page).

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map
  2015-02-25 14:24         ` David Vrabel
  (?)
@ 2015-02-25 16:00         ` Juergen Gross
  2015-03-30 10:00           ` Juergen Gross
  -1 siblings, 1 reply; 59+ messages in thread
From: Juergen Gross @ 2015-02-25 16:00 UTC (permalink / raw)
  To: David Vrabel, linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky

On 02/25/2015 03:24 PM, David Vrabel wrote:
> On 24/02/15 06:27, Juergen Gross wrote:
>> On 02/19/2015 07:07 PM, David Vrabel wrote:
>>> On 18/02/2015 06:51, Juergen Gross wrote:
>>>> +{
>>>> +    unsigned long pfn;
>>>> +    unsigned long area_start, area_end;
>>>> +    unsigned i;
>>>> +
>>>> +    for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
>>>> +
>>>> +        if (!xen_reserved_area[i].size)
>>>> +            break;
>>>> +
>>>> +        area_start = PFN_DOWN(xen_reserved_area[i].start);
>>>> +        area_end = PFN_UP(xen_reserved_area[i].start +
>>>> +                  xen_reserved_area[i].size);
>>>> +        if (area_start >= end_pfn || area_end <= start_pfn)
>>>> +            continue;
>>>> +
>>>> +        if (area_start > start_pfn)
>>>> +            xen_set_identity_and_remap_chunk(start_pfn, area_start,
>>>> +                             released, remapped);
>>>> +
>>>> +        if (area_end < end_pfn)
>>>> +            xen_set_identity_and_remap_chunk(area_end, end_pfn,
>>>> +                             released, remapped);
>>>> +
>>>> +        *remapped += min(area_end, end_pfn) -
>>>> +                max(area_start, start_pfn);
>>>> +
>>>> +        return;
>>>
>>> Why not defer the whole chunk that conflicts?  Or for that matter defer
>>> all this remapping to the last minute.
>>
>> There are two problems arising from this:
>>
>> - In the initrd case remapping would be deferred too long: the initrd
>>    data is still in use when device initialization is running. And we
>>    really want the remap to have happened before PCI space is being used.
>
> I'm not sure I understand what you're saying here.

I thought you wanted to defer the remapping to the point where the
initrd memory is no longer being used. But the suggestion below is
more clear.

>
> I'm suggesting:
>
> 1. Reserve all holes.
>
> 2. Relocate (if necessary) all modules (initrd, etc.) to regions that
> are RAM in the e820.
>
> 3. Rebuild the p2m in RAM.
>
> 4. Relocate frames from E820 holes/reserved to the end, free p2m pages
> from the holes and replacing them with the read-only 1:1 page (where
> possible).
>
>> - Delaying all remapping to the point where the new p2m list is in place
>>    would either result in a p2m list with all memory holes covered with
>>    individual entries as the new list is built with those holes still
>>    populated, ...
>>    The first option could easily waste significant amounts of memory (on
>>    my test machine with 1TB RAM this would have been about 1GB), while
>>    the second option would be performance critical.
>
> I don't understand how this wastes memory.   When you relocate the
> frames from the holes you can reclaim the p2m pages for the holes (and
> replace them with the r/o mapped identity p2m page).

Okay, this would work, I guess.

I'll have a try with some new patches...


Juergen

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Xen-devel] [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map
  2015-02-25 16:00         ` Juergen Gross
@ 2015-03-30 10:00           ` Juergen Gross
  0 siblings, 0 replies; 59+ messages in thread
From: Juergen Gross @ 2015-03-30 10:00 UTC (permalink / raw)
  To: David Vrabel, linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky

On 02/25/2015 05:00 PM, Juergen Gross wrote:
> On 02/25/2015 03:24 PM, David Vrabel wrote:
>> On 24/02/15 06:27, Juergen Gross wrote:
>>> On 02/19/2015 07:07 PM, David Vrabel wrote:
>>>> On 18/02/2015 06:51, Juergen Gross wrote:
>>>>> +{
>>>>> +    unsigned long pfn;
>>>>> +    unsigned long area_start, area_end;
>>>>> +    unsigned i;
>>>>> +
>>>>> +    for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
>>>>> +
>>>>> +        if (!xen_reserved_area[i].size)
>>>>> +            break;
>>>>> +
>>>>> +        area_start = PFN_DOWN(xen_reserved_area[i].start);
>>>>> +        area_end = PFN_UP(xen_reserved_area[i].start +
>>>>> +                  xen_reserved_area[i].size);
>>>>> +        if (area_start >= end_pfn || area_end <= start_pfn)
>>>>> +            continue;
>>>>> +
>>>>> +        if (area_start > start_pfn)
>>>>> +            xen_set_identity_and_remap_chunk(start_pfn, area_start,
>>>>> +                             released, remapped);
>>>>> +
>>>>> +        if (area_end < end_pfn)
>>>>> +            xen_set_identity_and_remap_chunk(area_end, end_pfn,
>>>>> +                             released, remapped);
>>>>> +
>>>>> +        *remapped += min(area_end, end_pfn) -
>>>>> +                max(area_start, start_pfn);
>>>>> +
>>>>> +        return;
>>>>
>>>> Why not defer the whole chunk that conflicts?  Or for that matter defer
>>>> all this remapping to the last minute.
>>>
>>> There are two problems arising from this:
>>>
>>> - In the initrd case remapping would be deferred too long: the initrd
>>>    data is still in use when device initialization is running. And we
>>>    really want the remap to have happened before PCI space is being
>>> used.
>>
>> I'm not sure I understand what you're saying here.
>
> I thought you wanted to defer the remapping to the point where the
> initrd memory is no longer being used. But the suggestion below is
> more clear.
>
>>
>> I'm suggesting:
>>
>> 1. Reserve all holes.
>>
>> 2. Relocate (if necessary) all modules (initrd, etc.) to regions that
>> are RAM in the e820.
>>
>> 3. Rebuild the p2m in RAM.
>>
>> 4. Relocate frames from E820 holes/reserved to the end, free p2m pages
>> from the holes and replacing them with the read-only 1:1 page (where
>> possible).
>>
>>> - Delaying all remapping to the point where the new p2m list is in place
>>>    would either result in a p2m list with all memory holes covered with
>>>    individual entries as the new list is built with those holes still
>>>    populated, ...
>>>    The first option could easily waste significant amounts of memory (on
>>>    my test machine with 1TB RAM this would have been about 1GB), while
>>>    the second option would be performance critical.
>>
>> I don't understand how this wastes memory.   When you relocate the
>> frames from the holes you can reclaim the p2m pages for the holes (and
>> replace them with the r/o mapped identity p2m page).
>
> Okay, this would work, I guess.
>
> I'll have a try with some new patches...

I tried your approach and hit a problem I can't solve without a major
rework of the kernel's init sequence:

dmi_scan_machine() (and possibly other functions like probe_roms())
need the identity mappings of BIOS, ACPI or PCI memory. Otherwise
SMBIOS, DMI and extension ROMs won't be discovered.

This can be solved only by either a complete rework of the sequence of
called init functions (not desirable, I guess) or by doing the unmap
part of the remapping as early as today.

This means, of course, I was just lucky with my resolution of the p2m
table conflicting with the E820 map by just delaying the remapping of
this memory area: in case it would have collided with an area needed
to be identity mapped early, the machine wouldn't have been able to
boot my kernel. So I really need to relocate the p2m list, even if this
is not as easy as delaying the remapping.


Juergen

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2015-03-30 10:00 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-18  6:51 [PATCH 00/13] xen: support pv-domains larger than 512GB Juergen Gross
2015-02-18  6:51 ` [PATCH 01/13] xen: sync with xen header Juergen Gross
2015-02-18  6:51 ` [PATCH 02/13] xen: anchor linear p2m list in shared info structure Juergen Gross
2015-02-18 10:32   ` [Xen-devel] " David Vrabel
2015-02-18 10:32     ` David Vrabel
2015-02-18 10:42     ` Juergen Gross
2015-02-18 10:50       ` Andrew Cooper
2015-02-18 10:50         ` Andrew Cooper
2015-02-18 10:54         ` David Vrabel
2015-02-18 10:54           ` David Vrabel
2015-02-18 10:57           ` Andrew Cooper
2015-02-18 10:57             ` Andrew Cooper
2015-02-18 10:57           ` Juergen Gross
2015-02-18 10:56         ` Juergen Gross
2015-02-18 10:51       ` David Vrabel
2015-02-18 10:51         ` David Vrabel
2015-02-18  6:51 ` [PATCH 03/13] xen: eliminate scalability issues from initial mapping setup Juergen Gross
2015-02-18  6:51 ` [PATCH 04/13] xen: move static e820 map to global scope Juergen Gross
2015-02-19 17:27   ` [Xen-devel] " David Vrabel
2015-02-18  6:51 ` [PATCH 05/13] xen: simplify xen_set_identity_and_remap() by using global variables Juergen Gross
2015-02-19 18:10   ` [Xen-devel] " David Vrabel
2015-02-24  6:15     ` Juergen Gross
2015-02-18  6:51 ` [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map Juergen Gross
2015-02-19 18:07   ` [Xen-devel] " David Vrabel
2015-02-24  6:27     ` Juergen Gross
2015-02-25 14:24       ` David Vrabel
2015-02-25 14:24         ` David Vrabel
2015-02-25 16:00         ` Juergen Gross
2015-03-30 10:00           ` Juergen Gross
2015-02-18  6:52 ` [PATCH 07/13] xen: find unused contiguous memory area Juergen Gross
2015-02-19 17:31   ` [Xen-devel] " David Vrabel
2015-02-18  6:52 ` [PATCH 08/13] xen: add service function to copy physical memory areas Juergen Gross
2015-02-19 17:35   ` [Xen-devel] " David Vrabel
2015-02-24  6:34     ` Juergen Gross
2015-02-18  6:52 ` [PATCH 09/13] xen: check for kernel memory conflicting with memory layout Juergen Gross
2015-02-19 17:36   ` [Xen-devel] " David Vrabel
2015-02-24  6:45     ` Juergen Gross
2015-02-18  6:52 ` [PATCH 10/13] xen: check pre-allocated page tables for conflict with memory map Juergen Gross
2015-02-19 17:37   ` [Xen-devel] " David Vrabel
2015-02-24  6:45     ` Juergen Gross
2015-02-18  6:52 ` [PATCH 11/13] xen: move initrd away from e820 non-ram area Juergen Gross
2015-02-19 17:42   ` [Xen-devel] " David Vrabel
2015-02-18  6:52 ` [PATCH 12/13] xen: if p2m list located in to be remapped region delay remapping Juergen Gross
2015-02-19 17:44   ` [Xen-devel] " David Vrabel
2015-02-24  7:01     ` Juergen Gross
2015-02-18  6:52 ` [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains Juergen Gross
2015-02-18  9:21   ` Paul Bolle
2015-02-18  9:37     ` Juergen Gross
2015-02-18  9:49       ` [Xen-devel] " Jan Beulich
2015-02-18  9:49         ` Jan Beulich
     [not found]       ` <54E46E3C0200007800060F98@suse.com>
2015-02-18  9:59         ` Juergen Gross
2015-02-18 10:35       ` Paul Bolle
2015-02-18 11:18   ` [Xen-devel] " David Vrabel
2015-02-18 11:18     ` David Vrabel
2015-02-18 11:51     ` Juergen Gross
2015-02-18 11:56       ` Andrew Cooper
2015-02-19 17:51       ` David Vrabel
2015-02-24  7:28         ` Juergen Gross
2015-02-18 11:53     ` Juergen Gross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.