All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7
@ 2012-07-26 20:47 Konrad Rzeszutek Wilk
  2012-07-26 20:47 ` [PATCH 1/7] xen/mmu: use copy_page instead of memcpy Konrad Rzeszutek Wilk
                   ` (8 more replies)
  0 siblings, 9 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-26 20:47 UTC (permalink / raw)
  To: linux-kernel, xen-devel

These depend on the "documentation, refactor and cleanups (v2) patches
I posted" (https://lkml.org/lkml/2012/7/26/469).

The details of this problem are nicely explained in:

 [PATCH 5/7] xen/p2m: Add logic to revector a P2M tree to use __va
 [PATCH 6/7] xen/mmu: Copy and revector the P2M tree.
 [PATCH 7/7] xen/mmu: Remove from __ka space PMD entries for

and the supporting patches are just nice optimizations. Pasting in
what those patches mentioned:


During bootup Xen supplies us with a P2M array. It sticks
it right after the ramdisk, as can be seen with a 128GB PV guest:

(certain parts removed for clarity):
xc_dom_build_image: called
xc_dom_alloc_segment:   kernel       : 0xffffffff81000000 -> 0xffffffff81e43000  (pfn 0x1000 + 0xe43 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1000+0xe43 at 0x7f097d8bf000
xc_dom_alloc_segment:   ramdisk      : 0xffffffff81e43000 -> 0xffffffff925c7000  (pfn 0x1e43 + 0x10784 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1e43+0x10784 at 0x7f0952dd2000
xc_dom_alloc_segment:   phys2mach    : 0xffffffff925c7000 -> 0xffffffffa25c7000  (pfn 0x125c7 + 0x10000 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x125c7+0x10000 at 0x7f0942dd2000
xc_dom_alloc_page   :   start info   : 0xffffffffa25c7000 (pfn 0x225c7)
xc_dom_alloc_page   :   xenstore     : 0xffffffffa25c8000 (pfn 0x225c8)
xc_dom_alloc_page   :   console      : 0xffffffffa25c9000 (pfn 0x225c9)
nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> 0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> 0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> 0xffffffffbfffffff, 1 table(s)
nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> 0xffffffffa27fffff, 276 table(s)
xc_dom_alloc_segment:   page tables  : 0xffffffffa25ca000 -> 0xffffffffa26e1000  (pfn 0x225ca + 0x117 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x225ca+0x117 at 0x7f097d7a8000
xc_dom_alloc_page   :   boot stack   : 0xffffffffa26e1000 (pfn 0x226e1)
xc_dom_build_image  : virt_alloc_end : 0xffffffffa26e2000
xc_dom_build_image  : virt_pgtab_end : 0xffffffffa2800000

So the physical memory and virtual (using __START_KERNEL_map addresses)
layout looks as so:

  phys                             __ka
/------------\                   /-------------------\
| 0          | empty             | 0xffffffff80000000|
| ..         |                   | ..                |
| 16MB       | <= kernel starts  | 0xffffffff81000000|
| ..         |                   |                   |
| 30MB       | <= kernel ends => | 0xffffffff81e43000|
| ..         |  & ramdisk starts | ..                |
| 293MB      | <= ramdisk ends=> | 0xffffffff925c7000|
| ..         |  & P2M starts     | ..                |
| ..         |                   | ..                |
| 549MB      | <= P2M ends    => | 0xffffffffa25c7000|
| ..         | start_info        | 0xffffffffa25c7000|
| ..         | xenstore          | 0xffffffffa25c8000|
| ..         | cosole            | 0xffffffffa25c9000|
| 549MB      | <= page tables => | 0xffffffffa25ca000|
| ..         |                   |                   |
| 550MB      | <= PGT end     => | 0xffffffffa26e1000|
| ..         | boot stack        |                   |
\------------/                   \-------------------/

As can be seen, the ramdisk, P2M and pagetables are taking
a bit of __ka addresses space. Which is a problem since the
MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits
right in there! This results during bootup with the inability to
load modules, with this error:

------------[ cut here ]------------
WARNING: at /home/konrad/ssd/linux/mm/vmalloc.c:106 vmap_page_range_noflush+0x2d9/0x370()
Call Trace:
 [<ffffffff810719fa>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff81030279>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81071a45>] warn_slowpath_null+0x15/0x20
 [<ffffffff81130b89>] vmap_page_range_noflush+0x2d9/0x370
 [<ffffffff81130c4d>] map_vm_area+0x2d/0x50
 [<ffffffff811326d0>] __vmalloc_node_range+0x160/0x250
 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
 [<ffffffff810c6186>] ? load_module+0x66/0x19c0
 [<ffffffff8105cadc>] module_alloc+0x5c/0x60
 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
 [<ffffffff810c5369>] module_alloc_update_bounds+0x19/0x80
 [<ffffffff810c70c3>] load_module+0xfa3/0x19c0
 [<ffffffff812491f6>] ? security_file_permission+0x86/0x90
 [<ffffffff810c7b3a>] sys_init_module+0x5a/0x220
 [<ffffffff815ce339>] system_call_fastpath+0x16/0x1b
---[ end trace fd8f7704fdea0291 ]---
vmalloc: allocation failure, allocated 16384 of 20480 bytes
modprobe: page allocation failure: order:0, mode:0xd2

Since the __va and __ka are 1:1 up to MODULES_VADDR and
cleanup_highmap rids __ka of the ramdisk mapping, what
we want to do is similar - get rid of the P2M in the __ka
address space. There are two ways of fixing this:

 1) All P2M lookups instead of using the __ka address would
    use the __va address. This means we can safely erase from
    __ka space the PMD pointers that point to the PFNs for
    P2M array and be OK.
 2). Allocate a new array, copy the existing P2M into it,
    revector the P2M tree to use that, and return the old
    P2M to the memory allocate. This has the advantage that
    it sets the stage for using XEN_ELF_NOTE_INIT_P2M
    feature. That feature allows us to set the exact virtual
    address space we want for the P2M - and allows us to
    boot as initial domain on large machines.

So we pick option 2).

This patch only lays the groundwork in the P2M code. The patch
that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree."

-- xen/mmu: Copy and revector the P2M tree:

The 'xen_revector_p2m_tree()' function allocates a new P2M tree
copies the contents of the old one in it, and returns the new one.

At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).

We have revectored the P2M tree (and the one for save/restore as well)
to use new shiny __va address to new MFNs. The xen_start_info
has been taken care of already in 'xen_setup_kernel_pagetable()' and
xen_start_info->shared_info in 'xen_setup_shared_info()', so
we are free to roam and delete PMD entries - which is exactly what
we are going to do. We rip out the __ka for the old P2M array.

-- xen/mmu:   Remove from __ka space PMD entries for

At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).

The xen_remove_p2m_tree and code around has ripped out the __ka for
the old P2M array.

Here we continue on doing it to where the Xen page-tables were.
It is safe to do it, as the page-tables are addressed using __va.
For good measure we delete anything that is within MODULES_VADDR
and up to the end of the PMD.

At this point the __ka only contains PMD entries for the start
of the kernel up to __brk.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/7] xen/mmu: use copy_page instead of memcpy.
  2012-07-26 20:47 [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7 Konrad Rzeszutek Wilk
@ 2012-07-26 20:47 ` Konrad Rzeszutek Wilk
  2012-07-27  7:35   ` Jan Beulich
  2012-07-27  7:35   ` [Xen-devel] " Jan Beulich
  2012-07-26 20:47 ` [PATCH 2/7] xen/mmu: For 64-bit do not call xen_map_identity_early Konrad Rzeszutek Wilk
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-26 20:47 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

After all, this is what it is there for.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |   13 ++++++-------
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 6ba6100..7247e5a 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1754,14 +1754,14 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	 * it will be also modified in the __ka space! (But if you just
 	 * modify the PMD table to point to other PTE's or none, then you
 	 * are OK - which is what cleanup_highmap does) */
-	memcpy(level2_ident_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
+	copy_page(level2_ident_pgt, l2);
 	/* Graft it onto L4[511][511] */
-	memcpy(level2_kernel_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
+	copy_page(level2_kernel_pgt, l2);
 
 	/* Get [511][510] and graft that in level2_fixmap_pgt */
 	l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
 	l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
-	memcpy(level2_fixmap_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
+	copy_page(level2_fixmap_pgt, l2);
 	/* Note that we don't do anything with level1_fixmap_pgt which
 	 * we don't need. */
 
@@ -1821,8 +1821,7 @@ static void __init xen_write_cr3_init(unsigned long cr3)
 	 */
 	swapper_kernel_pmd =
 		extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE);
-	memcpy(swapper_kernel_pmd, initial_kernel_pmd,
-	       sizeof(pmd_t) * PTRS_PER_PMD);
+	copy_page(swapper_kernel_pmd, initial_kernel_pmd);
 	swapper_pg_dir[KERNEL_PGD_BOUNDARY] =
 		__pgd(__pa(swapper_kernel_pmd) | _PAGE_PRESENT);
 	set_page_prot(swapper_kernel_pmd, PAGE_KERNEL_RO);
@@ -1851,11 +1850,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 				  512*1024);
 
 	kernel_pmd = m2v(pgd[KERNEL_PGD_BOUNDARY].pgd);
-	memcpy(initial_kernel_pmd, kernel_pmd, sizeof(pmd_t) * PTRS_PER_PMD);
+	copy_page(initial_kernel_pmd, kernel_pmd);
 
 	xen_map_identity_early(initial_kernel_pmd, max_pfn);
 
-	memcpy(initial_page_table, pgd, sizeof(pgd_t) * PTRS_PER_PGD);
+	copy_page(initial_page_table, pgd);
 	initial_page_table[KERNEL_PGD_BOUNDARY] =
 		__pgd(__pa(initial_kernel_pmd) | _PAGE_PRESENT);
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/7] xen/mmu: For 64-bit do not call xen_map_identity_early
  2012-07-26 20:47 [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7 Konrad Rzeszutek Wilk
  2012-07-26 20:47 ` [PATCH 1/7] xen/mmu: use copy_page instead of memcpy Konrad Rzeszutek Wilk
@ 2012-07-26 20:47 ` Konrad Rzeszutek Wilk
  2012-07-26 20:47 ` [PATCH 3/7] xen/mmu: Release the Xen provided L4 (PGD) back Konrad Rzeszutek Wilk
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-26 20:47 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

B/c we do not need it. During the startup the Xen provides
us with all the memory mapped that we need to function.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |   11 +++++------
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 7247e5a..a59070b 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -84,6 +84,7 @@
  */
 DEFINE_SPINLOCK(xen_reservation_lock);
 
+#ifdef CONFIG_X86_32
 /*
  * Identity map, in addition to plain kernel map.  This needs to be
  * large enough to allocate page table pages to allocate the rest.
@@ -91,7 +92,7 @@ DEFINE_SPINLOCK(xen_reservation_lock);
  */
 #define LEVEL1_IDENT_ENTRIES	(PTRS_PER_PTE * 4)
 static RESERVE_BRK_ARRAY(pte_t, level1_ident_pgt, LEVEL1_IDENT_ENTRIES);
-
+#endif
 #ifdef CONFIG_X86_64
 /* l3 pud for userspace vsyscall mapping */
 static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss;
@@ -1628,7 +1629,7 @@ static void set_page_prot(void *addr, pgprot_t prot)
 	if (HYPERVISOR_update_va_mapping((unsigned long)addr, pte, 0))
 		BUG();
 }
-
+#ifdef CONFIG_X86_32
 static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
 {
 	unsigned pmdidx, pteidx;
@@ -1679,7 +1680,7 @@ static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
 
 	set_page_prot(pmd, PAGE_KERNEL_RO);
 }
-
+#endif
 void __init xen_setup_machphys_mapping(void)
 {
 	struct xen_machphys_mapping mapping;
@@ -1765,14 +1766,12 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Note that we don't do anything with level1_fixmap_pgt which
 	 * we don't need. */
 
-	/* Set up identity map */
-	xen_map_identity_early(level2_ident_pgt, max_pfn);
-
 	/* Make pagetable pieces RO */
 	set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
+	set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/7] xen/mmu: Release the Xen provided L4 (PGD) back.
  2012-07-26 20:47 [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7 Konrad Rzeszutek Wilk
  2012-07-26 20:47 ` [PATCH 1/7] xen/mmu: use copy_page instead of memcpy Konrad Rzeszutek Wilk
  2012-07-26 20:47 ` [PATCH 2/7] xen/mmu: For 64-bit do not call xen_map_identity_early Konrad Rzeszutek Wilk
@ 2012-07-26 20:47 ` Konrad Rzeszutek Wilk
  2012-07-27 11:37   ` [Xen-devel] " Stefano Stabellini
  2012-07-26 20:47 ` [PATCH 4/7] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages Konrad Rzeszutek Wilk
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-26 20:47 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

Since we are not using it and somebody else could use it.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |   13 +++++++------
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index a59070b..48bdc9f 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1782,20 +1782,21 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Unpin Xen-provided one */
 	pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
 
-	/* Switch over */
-	pgd = init_level4_pgt;
-
 	/*
 	 * At this stage there can be no user pgd, and no page
 	 * structure to attach it to, so make sure we just set kernel
 	 * pgd.
 	 */
 	xen_mc_batch();
-	__xen_write_cr3(true, __pa(pgd));
+	__xen_write_cr3(true, __pa(init_level4_pgt));
 	xen_mc_issue(PARAVIRT_LAZY_CPU);
 
-	memblock_reserve(__pa(xen_start_info->pt_base),
-			 xen_start_info->nr_pt_frames * PAGE_SIZE);
+	/* Offset by one page since the original pgd is going bye bye */
+	memblock_reserve(__pa(xen_start_info->pt_base + PAGE_SIZE),
+			 (xen_start_info->nr_pt_frames * PAGE_SIZE) - PAGE_SIZE);
+	/* and also RW it so it can actually be used. */
+	set_page_prot(pgd, PAGE_KERNEL);
+	clear_page(pgd);
 }
 #else	/* !CONFIG_X86_64 */
 static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/7] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
  2012-07-26 20:47 [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7 Konrad Rzeszutek Wilk
                   ` (2 preceding siblings ...)
  2012-07-26 20:47 ` [PATCH 3/7] xen/mmu: Release the Xen provided L4 (PGD) back Konrad Rzeszutek Wilk
@ 2012-07-26 20:47 ` Konrad Rzeszutek Wilk
  2012-07-27 11:45   ` [Xen-devel] " Stefano Stabellini
  2012-07-26 20:47 ` [PATCH 5/7] xen/p2m: Add logic to revector a P2M tree to use __va leafs Konrad Rzeszutek Wilk
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-26 20:47 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

As we are not using them. We end up only using the L1 pagetables
and grafting those to our page-tables.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |   38 ++++++++++++++++++++++++++++++++------
 1 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 48bdc9f..7f54b75 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1724,6 +1724,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 {
 	pud_t *l3;
 	pmd_t *l2;
+	unsigned long addr[3];
+	unsigned long pt_base, pt_end;
+	unsigned i;
 
 	/* max_pfn_mapped is the last pfn mapped in the initial memory
 	 * mappings. Considering that on Xen after the kernel mappings we
@@ -1731,6 +1734,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	 * set max_pfn_mapped to the last real pfn mapped. */
 	max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
 
+	pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
+	pt_end = PFN_DOWN(__pa(xen_start_info->pt_base + (xen_start_info->nr_pt_frames * PAGE_SIZE)));
+
 	/* Zap identity mapping */
 	init_level4_pgt[0] = __pgd(0);
 
@@ -1749,6 +1755,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
 	l2 = m2v(l3[pud_index(__START_KERNEL_map)].pud);
 
+	addr[0] = (unsigned long)pgd;
+	addr[1] = (unsigned long)l2;
+	addr[2] = (unsigned long)l3;
 	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
 	 * Both L4[272][0] and L4[511][511] have entries that point to the same
 	 * L2 (PMD) tables. Meaning that if you modify it in __va space
@@ -1791,12 +1800,29 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	__xen_write_cr3(true, __pa(init_level4_pgt));
 	xen_mc_issue(PARAVIRT_LAZY_CPU);
 
-	/* Offset by one page since the original pgd is going bye bye */
-	memblock_reserve(__pa(xen_start_info->pt_base + PAGE_SIZE),
-			 (xen_start_info->nr_pt_frames * PAGE_SIZE) - PAGE_SIZE);
-	/* and also RW it so it can actually be used. */
-	set_page_prot(pgd, PAGE_KERNEL);
-	clear_page(pgd);
+	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
+	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
+	 * the initial domain. For guests using the toolstack, they are in:
+	 * [L4], [L3], [L2], [L1], [L1], order .. */
+	for (i = 0; i < ARRAY_SIZE(addr); i++) {
+		unsigned j;
+		/* No idea about the order the addr are in, so just do them twice. */
+		for (j = 0; j < ARRAY_SIZE(addr); j++) {
+			if (pt_base == PFN_DOWN(__pa(addr[j]))) {
+				set_page_prot((void *)addr[j], PAGE_KERNEL);
+				clear_page((void *)addr[j]);
+				pt_base++;
+
+			}
+			if (pt_end == PFN_DOWN(__pa(addr[j]))) {
+				set_page_prot((void *)addr[j], PAGE_KERNEL);
+				clear_page((void *)addr[j]);
+				pt_end--;
+			}
+		}
+	}
+	/* Our (by three pages) smaller Xen pagetable that we are using */
+	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
 }
 #else	/* !CONFIG_X86_64 */
 static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/7] xen/p2m: Add logic to revector a P2M tree to use __va leafs.
  2012-07-26 20:47 [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7 Konrad Rzeszutek Wilk
                   ` (3 preceding siblings ...)
  2012-07-26 20:47 ` [PATCH 4/7] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages Konrad Rzeszutek Wilk
@ 2012-07-26 20:47 ` Konrad Rzeszutek Wilk
  2012-07-27 11:18   ` [Xen-devel] " Stefano Stabellini
  2012-07-26 20:47 ` [PATCH 6/7] xen/mmu: Copy and revector the P2M tree Konrad Rzeszutek Wilk
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-26 20:47 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

During bootup Xen supplies us with a P2M array. It sticks
it right after the ramdisk, as can be seen with a 128GB PV guest:

(certain parts removed for clarity):
xc_dom_build_image: called
xc_dom_alloc_segment:   kernel       : 0xffffffff81000000 -> 0xffffffff81e43000  (pfn 0x1000 + 0xe43 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1000+0xe43 at 0x7f097d8bf000
xc_dom_alloc_segment:   ramdisk      : 0xffffffff81e43000 -> 0xffffffff925c7000  (pfn 0x1e43 + 0x10784 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1e43+0x10784 at 0x7f0952dd2000
xc_dom_alloc_segment:   phys2mach    : 0xffffffff925c7000 -> 0xffffffffa25c7000  (pfn 0x125c7 + 0x10000 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x125c7+0x10000 at 0x7f0942dd2000
xc_dom_alloc_page   :   start info   : 0xffffffffa25c7000 (pfn 0x225c7)
xc_dom_alloc_page   :   xenstore     : 0xffffffffa25c8000 (pfn 0x225c8)
xc_dom_alloc_page   :   console      : 0xffffffffa25c9000 (pfn 0x225c9)
nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> 0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> 0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> 0xffffffffbfffffff, 1 table(s)
nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> 0xffffffffa27fffff, 276 table(s)
xc_dom_alloc_segment:   page tables  : 0xffffffffa25ca000 -> 0xffffffffa26e1000  (pfn 0x225ca + 0x117 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x225ca+0x117 at 0x7f097d7a8000
xc_dom_alloc_page   :   boot stack   : 0xffffffffa26e1000 (pfn 0x226e1)
xc_dom_build_image  : virt_alloc_end : 0xffffffffa26e2000
xc_dom_build_image  : virt_pgtab_end : 0xffffffffa2800000

So the physical memory and virtual (using __START_KERNEL_map addresses)
layout looks as so:

  phys                             __ka
/------------\                   /-------------------\
| 0          | empty             | 0xffffffff80000000|
| ..         |                   | ..                |
| 16MB       | <= kernel starts  | 0xffffffff81000000|
| ..         |                   |                   |
| 30MB       | <= kernel ends => | 0xffffffff81e43000|
| ..         |  & ramdisk starts | ..                |
| 293MB      | <= ramdisk ends=> | 0xffffffff925c7000|
| ..         |  & P2M starts     | ..                |
| ..         |                   | ..                |
| 549MB      | <= P2M ends    => | 0xffffffffa25c7000|
| ..         | start_info        | 0xffffffffa25c7000|
| ..         | xenstore          | 0xffffffffa25c8000|
| ..         | cosole            | 0xffffffffa25c9000|
| 549MB      | <= page tables => | 0xffffffffa25ca000|
| ..         |                   |                   |
| 550MB      | <= PGT end     => | 0xffffffffa26e1000|
| ..         | boot stack        |                   |
\------------/                   \-------------------/

As can be seen, the ramdisk, P2M and pagetables are taking
a bit of __ka addresses space. Which is a problem since the
MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits
right in there! This results during bootup with the inability to
load modules, with this error:

------------[ cut here ]------------
WARNING: at /home/konrad/ssd/linux/mm/vmalloc.c:106 vmap_page_range_noflush+0x2d9/0x370()
Call Trace:
 [<ffffffff810719fa>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff81030279>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81071a45>] warn_slowpath_null+0x15/0x20
 [<ffffffff81130b89>] vmap_page_range_noflush+0x2d9/0x370
 [<ffffffff81130c4d>] map_vm_area+0x2d/0x50
 [<ffffffff811326d0>] __vmalloc_node_range+0x160/0x250
 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
 [<ffffffff810c6186>] ? load_module+0x66/0x19c0
 [<ffffffff8105cadc>] module_alloc+0x5c/0x60
 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
 [<ffffffff810c5369>] module_alloc_update_bounds+0x19/0x80
 [<ffffffff810c70c3>] load_module+0xfa3/0x19c0
 [<ffffffff812491f6>] ? security_file_permission+0x86/0x90
 [<ffffffff810c7b3a>] sys_init_module+0x5a/0x220
 [<ffffffff815ce339>] system_call_fastpath+0x16/0x1b
---[ end trace fd8f7704fdea0291 ]---
vmalloc: allocation failure, allocated 16384 of 20480 bytes
modprobe: page allocation failure: order:0, mode:0xd2

Since the __va and __ka are 1:1 up to MODULES_VADDR and
cleanup_highmap rids __ka of the ramdisk mapping, what
we want to do is similar - get rid of the P2M in the __ka
address space. There are two ways of fixing this:

 1) All P2M lookups instead of using the __ka address would
    use the __va address. This means we can safely erase from
    __ka space the PMD pointers that point to the PFNs for
    P2M array and be OK.
 2). Allocate a new array, copy the existing P2M into it,
    revector the P2M tree to use that, and return the old
    P2M to the memory allocate. This has the advantage that
    it sets the stage for using XEN_ELF_NOTE_INIT_P2M
    feature. That feature allows us to set the exact virtual
    address space we want for the P2M - and allows us to
    boot as initial domain on large machines.

So we pick option 2).

This patch only lays the groundwork in the P2M code. The patch
that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree."

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/p2m.c     |   70 ++++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/xen/xen-ops.h |    1 +
 2 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 6a2bfa4..bbfd085 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -394,7 +394,77 @@ void __init xen_build_dynamic_phys_to_machine(void)
 	 * Xen provided pagetable). Do it later in xen_reserve_internals.
 	 */
 }
+#ifdef CONFIG_X86_64
+#include <linux/bootmem.h>
+unsigned long __init xen_revector_p2m_tree(void)
+{
+	unsigned long va_start;
+	unsigned long va_end;
+	unsigned long pfn;
+	unsigned long *mfn_list = NULL;
+	unsigned long size;
+
+	va_start = xen_start_info->mfn_list;
+	/*We copy in increments of P2M_PER_PAGE * sizeof(unsigned long),
+	 * so make sure it is rounded up to that */
+	size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
+	va_end = va_start + size;
+
+	/* If we were revectored already, don't do it again. */
+	if (va_start <= __START_KERNEL_map && va_start >= __PAGE_OFFSET)
+		return 0;
+
+	mfn_list = alloc_bootmem_align(size, PAGE_SIZE);
+	if (!mfn_list) {
+		pr_warn("Could not allocate space for a new P2M tree!\n");
+		return xen_start_info->mfn_list;
+	}
+	/* Fill it out with INVALID_P2M_ENTRY value */
+	memset(mfn_list, 0xFF, size);
+
+	for (pfn = 0; pfn < ALIGN(MAX_DOMAIN_PAGES, P2M_PER_PAGE); pfn += P2M_PER_PAGE) {
+		unsigned topidx = p2m_top_index(pfn);
+		unsigned mididx;
+		unsigned long *mid_p;
+
+		if (!p2m_top[topidx])
+			continue;
+
+		if (p2m_top[topidx] == p2m_mid_missing)
+			continue;
+
+		mididx = p2m_mid_index(pfn);
+		mid_p = p2m_top[topidx][mididx];
+		if (!mid_p)
+			continue;
+		if ((mid_p == p2m_missing) || (mid_p == p2m_identity))
+			continue;
+
+		if ((unsigned long)mid_p == INVALID_P2M_ENTRY)
+			continue;
+
+		/* The old va. Rebase it on mfn_list */
+		if (mid_p >= (unsigned long *)va_start && mid_p <= (unsigned long *)va_end) {
+			unsigned long *new;
+
+			new = &mfn_list[pfn];
+
+			copy_page(new, mid_p);
+			p2m_top[topidx][mididx] = &mfn_list[pfn];
+			p2m_top_mfn_p[topidx][mididx] = virt_to_mfn(&mfn_list[pfn]);
 
+		}
+		/* This should be the leafs allocated for identity from _brk. */
+	}
+	return (unsigned long)mfn_list;
+
+}
+#else
+unsigned long __init xen_revector_p2m_tree(void)
+{
+	return 0;
+}
+#endif
 unsigned long get_phys_to_machine(unsigned long pfn)
 {
 	unsigned topidx, mididx, idx;
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 2230f57..bb5a810 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -45,6 +45,7 @@ void xen_hvm_init_shared_info(void);
 void xen_unplug_emulated_devices(void);
 
 void __init xen_build_dynamic_phys_to_machine(void);
+unsigned long __init xen_revector_p2m_tree(void);
 
 void xen_init_irq_ops(void);
 void xen_setup_timer(int cpu);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 6/7] xen/mmu: Copy and revector the P2M tree.
  2012-07-26 20:47 [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7 Konrad Rzeszutek Wilk
                   ` (4 preceding siblings ...)
  2012-07-26 20:47 ` [PATCH 5/7] xen/p2m: Add logic to revector a P2M tree to use __va leafs Konrad Rzeszutek Wilk
@ 2012-07-26 20:47 ` Konrad Rzeszutek Wilk
  2012-07-26 20:47 ` [PATCH 7/7] xen/mmu: Remove from __ka space PMD entries for pagetables Konrad Rzeszutek Wilk
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-26 20:47 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

Please first read the description in "xen/p2m: Add logic to revector a
P2M tree to use __va leafs" patch.

The 'xen_revector_p2m_tree()' function allocates a new P2M tree
copies the contents of the old one in it, and returns the new one.

At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).

We have revectored the P2M tree (and the one for save/restore as well)
to use new shiny __va address to new MFNs. The xen_start_info
has been taken care of already in 'xen_setup_kernel_pagetable()' and
xen_start_info->shared_info in 'xen_setup_shared_info()', so
we are free to roam and delete PMD entries - which is exactly what
we are going to do. We rip out the __ka for the old P2M array.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 57 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 7f54b75..05e8492 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1183,9 +1183,64 @@ static __init void xen_mapping_pagetable_reserve(u64 start, u64 end)
 
 static void xen_post_allocator_init(void);
 
+#ifdef CONFIG_X86_64
+void __init xen_cleanhighmap(unsigned long vaddr, unsigned long vaddr_end)
+{
+	unsigned long kernel_end = roundup((unsigned long)_brk_end, PMD_SIZE) - 1;
+	pmd_t *pmd = level2_kernel_pgt + pmd_index(vaddr);
+
+	/* NOTE: The loop is more greedy than the cleanup_highmap variant.
+	 * We include the PMD passed in on _both_ boundaries. */
+	for (; vaddr <= vaddr_end && (pmd < (level2_kernel_pgt + PAGE_SIZE));
+			pmd++, vaddr += PMD_SIZE) {
+		if (pmd_none(*pmd))
+			continue;
+		if (vaddr < (unsigned long) _text || vaddr > kernel_end)
+			set_pmd(pmd, __pmd(0));
+	}
+	/* In case we did something silly, we should crash in this function
+	 * instead of somewhere later and be confusing. */
+	xen_mc_flush();
+}
+#else
+void __init xen_cleanhighmap(unsigned long vaddr, unsigned long vaddr_end)
+{
+}
+#endif
 static void __init xen_pagetable_setup_done(pgd_t *base)
 {
+	unsigned long size;
+	unsigned long addr;
+
 	xen_setup_shared_info();
+	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+		unsigned long new_mfn_list;
+
+		size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
+
+		new_mfn_list = xen_revector_p2m_tree();
+
+		/* On 32-bit, we get zero so this never gets executed. */
+		if (new_mfn_list && new_mfn_list != xen_start_info->mfn_list) {
+			/* using __ka address! */
+			memset((void *)xen_start_info->mfn_list, 0, size);
+
+			/* We should be in __ka space. */
+			BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
+			addr = xen_start_info->mfn_list;
+			size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
+			/* We roundup to the PMD, which means that if anybody at this stage is
+			 * using the __ka address of xen_start_info or xen_start_info->shared_info
+			 * they are in going to crash. Fortunatly we have already revectored
+			 * in xen_setup_kernel_pagetable and in xen_setup_shared_info. */
+			size = roundup(size, PMD_SIZE);
+			xen_cleanhighmap(addr, addr + size);
+
+			memblock_free(__pa(xen_start_info->mfn_list), size);
+			/* And revector! Bye bye old array */
+			xen_start_info->mfn_list = new_mfn_list;
+		}
+	}
 	xen_post_allocator_init();
 }
 
@@ -1823,6 +1878,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	}
 	/* Our (by three pages) smaller Xen pagetable that we are using */
 	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
+	/* Revector the xen_start_info */
+	xen_start_info = (struct start_info *)__va(__pa(xen_start_info));
 }
 #else	/* !CONFIG_X86_64 */
 static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 7/7] xen/mmu: Remove from __ka space PMD entries for pagetables.
  2012-07-26 20:47 [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7 Konrad Rzeszutek Wilk
                   ` (5 preceding siblings ...)
  2012-07-26 20:47 ` [PATCH 6/7] xen/mmu: Copy and revector the P2M tree Konrad Rzeszutek Wilk
@ 2012-07-26 20:47 ` Konrad Rzeszutek Wilk
  2012-07-27 11:31   ` [Xen-devel] " Stefano Stabellini
  2012-07-27  7:34 ` [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7 Jan Beulich
  2012-07-27  7:34 ` [Xen-devel] " Jan Beulich
  8 siblings, 1 reply; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-26 20:47 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

Please first read the description in "xen/mmu: Copy and revector the
P2M tree."

At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).

The xen_remove_p2m_tree and code around has ripped out the __ka for
the old P2M array.

Here we continue on doing it to where the Xen page-tables were.
It is safe to do it, as the page-tables are addressed using __va.
For good measure we delete anything that is within MODULES_VADDR
and up to the end of the PMD.

At this point the __ka only contains PMD entries for the start
of the kernel up to __brk.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |   20 ++++++++++++++++++++
 1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 05e8492..738feca 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1241,6 +1241,26 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
 			xen_start_info->mfn_list = new_mfn_list;
 		}
 	}
+#ifdef CONFIG_X86_64
+	/* At this stage, cleanup_highmap has already cleaned __ka space
+	 * from _brk_limit way up to the max_pfn_mapped (which is the end of
+	 * the ramdisk). We continue on, erasing PMD entries that point to page
+	 * tables - do note that they are accessible at this stage via __va.
+	 * For good measure we also round up to the PMD - which means that if
+	 * anybody is using __ka address to the initial boot-stack - and try
+	 * to use it - they are going to crash. The xen_start_info has been
+	 * taken care of already in xen_setup_kernel_pagetable. */
+	addr = xen_start_info->pt_base;
+	size = roundup(xen_start_info->nr_pt_frames * PAGE_SIZE, PMD_SIZE);
+
+	xen_cleanhighmap(addr, addr + size);
+	xen_start_info->pt_base = (unsigned long)__va(__pa(xen_start_info->pt_base));
+
+	/* This is superflous and shouldn't be neccessary, but you know what
+	 * lets do it. The MODULES_VADDR -> MODULES_END should be clear of
+	 * anything at this stage. */
+	xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, PUD_SIZE) - 1);
+#endif
 	xen_post_allocator_init();
 }
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7
  2012-07-26 20:47 [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7 Konrad Rzeszutek Wilk
                   ` (7 preceding siblings ...)
  2012-07-27  7:34 ` [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7 Jan Beulich
@ 2012-07-27  7:34 ` Jan Beulich
  2012-07-27 10:00   ` Ian Campbell
  2012-07-27 10:00   ` [Xen-devel] " Ian Campbell
  8 siblings, 2 replies; 37+ messages in thread
From: Jan Beulich @ 2012-07-27  7:34 UTC (permalink / raw)
  To: Ian Jackson, Konrad Rzeszutek Wilk; +Cc: xen-devel, linux-kernel

>>> On 26.07.12 at 22:47, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
>  2). Allocate a new array, copy the existing P2M into it,
>     revector the P2M tree to use that, and return the old
>     P2M to the memory allocate. This has the advantage that
>     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
>     feature. That feature allows us to set the exact virtual
>     address space we want for the P2M - and allows us to
>     boot as initial domain on large machines.

And I would hope that the tools would get updated to recognize
this note too, so that huge DomU-s would become possible as
well.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7
  2012-07-26 20:47 [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7 Konrad Rzeszutek Wilk
                   ` (6 preceding siblings ...)
  2012-07-26 20:47 ` [PATCH 7/7] xen/mmu: Remove from __ka space PMD entries for pagetables Konrad Rzeszutek Wilk
@ 2012-07-27  7:34 ` Jan Beulich
  2012-07-27  7:34 ` [Xen-devel] " Jan Beulich
  8 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2012-07-27  7:34 UTC (permalink / raw)
  To: Ian Jackson, Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

>>> On 26.07.12 at 22:47, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
>  2). Allocate a new array, copy the existing P2M into it,
>     revector the P2M tree to use that, and return the old
>     P2M to the memory allocate. This has the advantage that
>     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
>     feature. That feature allows us to set the exact virtual
>     address space we want for the P2M - and allows us to
>     boot as initial domain on large machines.

And I would hope that the tools would get updated to recognize
this note too, so that huge DomU-s would become possible as
well.

Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [PATCH 1/7] xen/mmu: use copy_page instead of memcpy.
  2012-07-26 20:47 ` [PATCH 1/7] xen/mmu: use copy_page instead of memcpy Konrad Rzeszutek Wilk
  2012-07-27  7:35   ` Jan Beulich
@ 2012-07-27  7:35   ` Jan Beulich
  1 sibling, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2012-07-27  7:35 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, linux-kernel

>>> On 26.07.12 at 22:47, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> After all, this is what it is there for.
> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Acked-by: Jan Beulich <jbeulich@suse.com>

> ---
>  arch/x86/xen/mmu.c |   13 ++++++-------
>  1 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 6ba6100..7247e5a 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1754,14 +1754,14 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
> unsigned long max_pfn)
>  	 * it will be also modified in the __ka space! (But if you just
>  	 * modify the PMD table to point to other PTE's or none, then you
>  	 * are OK - which is what cleanup_highmap does) */
> -	memcpy(level2_ident_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
> +	copy_page(level2_ident_pgt, l2);
>  	/* Graft it onto L4[511][511] */
> -	memcpy(level2_kernel_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
> +	copy_page(level2_kernel_pgt, l2);
>  
>  	/* Get [511][510] and graft that in level2_fixmap_pgt */
>  	l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
>  	l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
> -	memcpy(level2_fixmap_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
> +	copy_page(level2_fixmap_pgt, l2);
>  	/* Note that we don't do anything with level1_fixmap_pgt which
>  	 * we don't need. */
>  
> @@ -1821,8 +1821,7 @@ static void __init xen_write_cr3_init(unsigned long 
> cr3)
>  	 */
>  	swapper_kernel_pmd =
>  		extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE);
> -	memcpy(swapper_kernel_pmd, initial_kernel_pmd,
> -	       sizeof(pmd_t) * PTRS_PER_PMD);
> +	copy_page(swapper_kernel_pmd, initial_kernel_pmd);
>  	swapper_pg_dir[KERNEL_PGD_BOUNDARY] =
>  		__pgd(__pa(swapper_kernel_pmd) | _PAGE_PRESENT);
>  	set_page_prot(swapper_kernel_pmd, PAGE_KERNEL_RO);
> @@ -1851,11 +1850,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
> unsigned long max_pfn)
>  				  512*1024);
>  
>  	kernel_pmd = m2v(pgd[KERNEL_PGD_BOUNDARY].pgd);
> -	memcpy(initial_kernel_pmd, kernel_pmd, sizeof(pmd_t) * PTRS_PER_PMD);
> +	copy_page(initial_kernel_pmd, kernel_pmd);
>  
>  	xen_map_identity_early(initial_kernel_pmd, max_pfn);
>  
> -	memcpy(initial_page_table, pgd, sizeof(pgd_t) * PTRS_PER_PGD);
> +	copy_page(initial_page_table, pgd);
>  	initial_page_table[KERNEL_PGD_BOUNDARY] =
>  		__pgd(__pa(initial_kernel_pmd) | _PAGE_PRESENT);
>  



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/7] xen/mmu: use copy_page instead of memcpy.
  2012-07-26 20:47 ` [PATCH 1/7] xen/mmu: use copy_page instead of memcpy Konrad Rzeszutek Wilk
@ 2012-07-27  7:35   ` Jan Beulich
  2012-07-27  7:35   ` [Xen-devel] " Jan Beulich
  1 sibling, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2012-07-27  7:35 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

>>> On 26.07.12 at 22:47, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> After all, this is what it is there for.
> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Acked-by: Jan Beulich <jbeulich@suse.com>

> ---
>  arch/x86/xen/mmu.c |   13 ++++++-------
>  1 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 6ba6100..7247e5a 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1754,14 +1754,14 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
> unsigned long max_pfn)
>  	 * it will be also modified in the __ka space! (But if you just
>  	 * modify the PMD table to point to other PTE's or none, then you
>  	 * are OK - which is what cleanup_highmap does) */
> -	memcpy(level2_ident_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
> +	copy_page(level2_ident_pgt, l2);
>  	/* Graft it onto L4[511][511] */
> -	memcpy(level2_kernel_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
> +	copy_page(level2_kernel_pgt, l2);
>  
>  	/* Get [511][510] and graft that in level2_fixmap_pgt */
>  	l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
>  	l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
> -	memcpy(level2_fixmap_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
> +	copy_page(level2_fixmap_pgt, l2);
>  	/* Note that we don't do anything with level1_fixmap_pgt which
>  	 * we don't need. */
>  
> @@ -1821,8 +1821,7 @@ static void __init xen_write_cr3_init(unsigned long 
> cr3)
>  	 */
>  	swapper_kernel_pmd =
>  		extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE);
> -	memcpy(swapper_kernel_pmd, initial_kernel_pmd,
> -	       sizeof(pmd_t) * PTRS_PER_PMD);
> +	copy_page(swapper_kernel_pmd, initial_kernel_pmd);
>  	swapper_pg_dir[KERNEL_PGD_BOUNDARY] =
>  		__pgd(__pa(swapper_kernel_pmd) | _PAGE_PRESENT);
>  	set_page_prot(swapper_kernel_pmd, PAGE_KERNEL_RO);
> @@ -1851,11 +1850,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
> unsigned long max_pfn)
>  				  512*1024);
>  
>  	kernel_pmd = m2v(pgd[KERNEL_PGD_BOUNDARY].pgd);
> -	memcpy(initial_kernel_pmd, kernel_pmd, sizeof(pmd_t) * PTRS_PER_PMD);
> +	copy_page(initial_kernel_pmd, kernel_pmd);
>  
>  	xen_map_identity_early(initial_kernel_pmd, max_pfn);
>  
> -	memcpy(initial_page_table, pgd, sizeof(pgd_t) * PTRS_PER_PGD);
> +	copy_page(initial_page_table, pgd);
>  	initial_page_table[KERNEL_PGD_BOUNDARY] =
>  		__pgd(__pa(initial_kernel_pmd) | _PAGE_PRESENT);
>  

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7
  2012-07-27  7:34 ` [Xen-devel] " Jan Beulich
  2012-07-27 10:00   ` Ian Campbell
@ 2012-07-27 10:00   ` Ian Campbell
  2012-07-27 10:17     ` Jan Beulich
  2012-07-27 10:17     ` Jan Beulich
  1 sibling, 2 replies; 37+ messages in thread
From: Ian Campbell @ 2012-07-27 10:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Jackson, Konrad Rzeszutek Wilk, linux-kernel, xen-devel

On Fri, 2012-07-27 at 08:34 +0100, Jan Beulich wrote:
> >>> On 26.07.12 at 22:47, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> >  2). Allocate a new array, copy the existing P2M into it,
> >     revector the P2M tree to use that, and return the old
> >     P2M to the memory allocate. This has the advantage that
> >     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
> >     feature. That feature allows us to set the exact virtual
> >     address space we want for the P2M - and allows us to
> >     boot as initial domain on large machines.
> 
> And I would hope that the tools would get updated to recognize
> this note too, so that huge DomU-s would become possible as
> well.

Does this help us with >160GB 32 bit PV guests too? I'm guessing not
since the real limitation there is the relatively small amount of kernel
address space.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7
  2012-07-27  7:34 ` [Xen-devel] " Jan Beulich
@ 2012-07-27 10:00   ` Ian Campbell
  2012-07-27 10:00   ` [Xen-devel] " Ian Campbell
  1 sibling, 0 replies; 37+ messages in thread
From: Ian Campbell @ 2012-07-27 10:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Ian Jackson, linux-kernel, Konrad Rzeszutek Wilk

On Fri, 2012-07-27 at 08:34 +0100, Jan Beulich wrote:
> >>> On 26.07.12 at 22:47, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> >  2). Allocate a new array, copy the existing P2M into it,
> >     revector the P2M tree to use that, and return the old
> >     P2M to the memory allocate. This has the advantage that
> >     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
> >     feature. That feature allows us to set the exact virtual
> >     address space we want for the P2M - and allows us to
> >     boot as initial domain on large machines.
> 
> And I would hope that the tools would get updated to recognize
> this note too, so that huge DomU-s would become possible as
> well.

Does this help us with >160GB 32 bit PV guests too? I'm guessing not
since the real limitation there is the relatively small amount of kernel
address space.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7
  2012-07-27 10:00   ` [Xen-devel] " Ian Campbell
@ 2012-07-27 10:17     ` Jan Beulich
  2012-07-27 10:21       ` Ian Campbell
  2012-07-27 10:21       ` Ian Campbell
  2012-07-27 10:17     ` Jan Beulich
  1 sibling, 2 replies; 37+ messages in thread
From: Jan Beulich @ 2012-07-27 10:17 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, xen-devel, Konrad Rzeszutek Wilk, linux-kernel

>>> On 27.07.12 at 12:00, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Fri, 2012-07-27 at 08:34 +0100, Jan Beulich wrote:
>> >>> On 26.07.12 at 22:47, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
>> >  2). Allocate a new array, copy the existing P2M into it,
>> >     revector the P2M tree to use that, and return the old
>> >     P2M to the memory allocate. This has the advantage that
>> >     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
>> >     feature. That feature allows us to set the exact virtual
>> >     address space we want for the P2M - and allows us to
>> >     boot as initial domain on large machines.
>> 
>> And I would hope that the tools would get updated to recognize
>> this note too, so that huge DomU-s would become possible as
>> well.
> 
> Does this help us with >160GB 32 bit PV guests too? I'm guessing not
> since the real limitation there is the relatively small amount of kernel
> address space.

Correct - 32-bit PV guests are limited anyway (and it's for a
reason the Dom0 support in the hypervisor only deals with
64-bit ones). And honestly, considering the huge page
information table such a memory amount would require, I
doubt this big a PV guest would even boot (or if it does, be
of any use).

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7
  2012-07-27 10:00   ` [Xen-devel] " Ian Campbell
  2012-07-27 10:17     ` Jan Beulich
@ 2012-07-27 10:17     ` Jan Beulich
  1 sibling, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2012-07-27 10:17 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Konrad Rzeszutek Wilk, Ian Jackson, linux-kernel, xen-devel

>>> On 27.07.12 at 12:00, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Fri, 2012-07-27 at 08:34 +0100, Jan Beulich wrote:
>> >>> On 26.07.12 at 22:47, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
>> >  2). Allocate a new array, copy the existing P2M into it,
>> >     revector the P2M tree to use that, and return the old
>> >     P2M to the memory allocate. This has the advantage that
>> >     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
>> >     feature. That feature allows us to set the exact virtual
>> >     address space we want for the P2M - and allows us to
>> >     boot as initial domain on large machines.
>> 
>> And I would hope that the tools would get updated to recognize
>> this note too, so that huge DomU-s would become possible as
>> well.
> 
> Does this help us with >160GB 32 bit PV guests too? I'm guessing not
> since the real limitation there is the relatively small amount of kernel
> address space.

Correct - 32-bit PV guests are limited anyway (and it's for a
reason the Dom0 support in the hypervisor only deals with
64-bit ones). And honestly, considering the huge page
information table such a memory amount would require, I
doubt this big a PV guest would even boot (or if it does, be
of any use).

Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7
  2012-07-27 10:17     ` Jan Beulich
@ 2012-07-27 10:21       ` Ian Campbell
  2012-07-27 10:33         ` Jan Beulich
  2012-07-27 10:33         ` [Xen-devel] " Jan Beulich
  2012-07-27 10:21       ` Ian Campbell
  1 sibling, 2 replies; 37+ messages in thread
From: Ian Campbell @ 2012-07-27 10:21 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Jackson, xen-devel, Konrad Rzeszutek Wilk, linux-kernel

On Fri, 2012-07-27 at 11:17 +0100, Jan Beulich wrote:
> >>> On 27.07.12 at 12:00, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Fri, 2012-07-27 at 08:34 +0100, Jan Beulich wrote:
> >> >>> On 26.07.12 at 22:47, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> >> >  2). Allocate a new array, copy the existing P2M into it,
> >> >     revector the P2M tree to use that, and return the old
> >> >     P2M to the memory allocate. This has the advantage that
> >> >     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
> >> >     feature. That feature allows us to set the exact virtual
> >> >     address space we want for the P2M - and allows us to
> >> >     boot as initial domain on large machines.
> >> 
> >> And I would hope that the tools would get updated to recognize
> >> this note too, so that huge DomU-s would become possible as
> >> well.
> > 
> > Does this help us with >160GB 32 bit PV guests too? I'm guessing not
> > since the real limitation there is the relatively small amount of kernel
> > address space.
> 
> Correct - 32-bit PV guests are limited anyway (and it's for a
> reason the Dom0 support in the hypervisor only deals with
> 64-bit ones). And honestly, considering the huge page
> information table such a memory amount would require, I
> doubt this big a PV guest would even boot (or if it does, be
> of any use).

Right.

I was actually think of the issue with 32 bit PV guests accessing MFN
space > 160G, even if they are themselves small, which is a separate
concern.

Ian.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7
  2012-07-27 10:17     ` Jan Beulich
  2012-07-27 10:21       ` Ian Campbell
@ 2012-07-27 10:21       ` Ian Campbell
  1 sibling, 0 replies; 37+ messages in thread
From: Ian Campbell @ 2012-07-27 10:21 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Konrad Rzeszutek Wilk, Ian Jackson, linux-kernel, xen-devel

On Fri, 2012-07-27 at 11:17 +0100, Jan Beulich wrote:
> >>> On 27.07.12 at 12:00, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Fri, 2012-07-27 at 08:34 +0100, Jan Beulich wrote:
> >> >>> On 26.07.12 at 22:47, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> >> >  2). Allocate a new array, copy the existing P2M into it,
> >> >     revector the P2M tree to use that, and return the old
> >> >     P2M to the memory allocate. This has the advantage that
> >> >     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
> >> >     feature. That feature allows us to set the exact virtual
> >> >     address space we want for the P2M - and allows us to
> >> >     boot as initial domain on large machines.
> >> 
> >> And I would hope that the tools would get updated to recognize
> >> this note too, so that huge DomU-s would become possible as
> >> well.
> > 
> > Does this help us with >160GB 32 bit PV guests too? I'm guessing not
> > since the real limitation there is the relatively small amount of kernel
> > address space.
> 
> Correct - 32-bit PV guests are limited anyway (and it's for a
> reason the Dom0 support in the hypervisor only deals with
> 64-bit ones). And honestly, considering the huge page
> information table such a memory amount would require, I
> doubt this big a PV guest would even boot (or if it does, be
> of any use).

Right.

I was actually think of the issue with 32 bit PV guests accessing MFN
space > 160G, even if they are themselves small, which is a separate
concern.

Ian.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7
  2012-07-27 10:21       ` Ian Campbell
  2012-07-27 10:33         ` Jan Beulich
@ 2012-07-27 10:33         ` Jan Beulich
  1 sibling, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2012-07-27 10:33 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, xen-devel, Konrad Rzeszutek Wilk, linux-kernel

>>> On 27.07.12 at 12:21, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> I was actually think of the issue with 32 bit PV guests accessing MFN
> space > 160G, even if they are themselves small, which is a separate
> concern.

That can be made work if really needed, but not via the
mechanism we're talking about here. The question is whether
it's worth it.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7
  2012-07-27 10:21       ` Ian Campbell
@ 2012-07-27 10:33         ` Jan Beulich
  2012-07-27 10:33         ` [Xen-devel] " Jan Beulich
  1 sibling, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2012-07-27 10:33 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Konrad Rzeszutek Wilk, Ian Jackson, linux-kernel, xen-devel

>>> On 27.07.12 at 12:21, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> I was actually think of the issue with 32 bit PV guests accessing MFN
> space > 160G, even if they are themselves small, which is a separate
> concern.

That can be made work if really needed, but not via the
mechanism we're talking about here. The question is whether
it's worth it.

Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [PATCH 5/7] xen/p2m: Add logic to revector a P2M tree to use __va leafs.
  2012-07-26 20:47 ` [PATCH 5/7] xen/p2m: Add logic to revector a P2M tree to use __va leafs Konrad Rzeszutek Wilk
@ 2012-07-27 11:18   ` Stefano Stabellini
  2012-07-27 11:47     ` Jan Beulich
  2012-07-27 11:47     ` Jan Beulich
  0 siblings, 2 replies; 37+ messages in thread
From: Stefano Stabellini @ 2012-07-27 11:18 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
> During bootup Xen supplies us with a P2M array. It sticks
> it right after the ramdisk, as can be seen with a 128GB PV guest:
> 
> (certain parts removed for clarity):
> xc_dom_build_image: called
> xc_dom_alloc_segment:   kernel       : 0xffffffff81000000 -> 0xffffffff81e43000  (pfn 0x1000 + 0xe43 pages)
> xc_dom_pfn_to_ptr: domU mapping: pfn 0x1000+0xe43 at 0x7f097d8bf000
> xc_dom_alloc_segment:   ramdisk      : 0xffffffff81e43000 -> 0xffffffff925c7000  (pfn 0x1e43 + 0x10784 pages)
> xc_dom_pfn_to_ptr: domU mapping: pfn 0x1e43+0x10784 at 0x7f0952dd2000
> xc_dom_alloc_segment:   phys2mach    : 0xffffffff925c7000 -> 0xffffffffa25c7000  (pfn 0x125c7 + 0x10000 pages)
> xc_dom_pfn_to_ptr: domU mapping: pfn 0x125c7+0x10000 at 0x7f0942dd2000
> xc_dom_alloc_page   :   start info   : 0xffffffffa25c7000 (pfn 0x225c7)
> xc_dom_alloc_page   :   xenstore     : 0xffffffffa25c8000 (pfn 0x225c8)
> xc_dom_alloc_page   :   console      : 0xffffffffa25c9000 (pfn 0x225c9)
> nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> 0xffffffffffffffff, 1 table(s)
> nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> 0xffffffffffffffff, 1 table(s)
> nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> 0xffffffffbfffffff, 1 table(s)
> nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> 0xffffffffa27fffff, 276 table(s)
> xc_dom_alloc_segment:   page tables  : 0xffffffffa25ca000 -> 0xffffffffa26e1000  (pfn 0x225ca + 0x117 pages)
> xc_dom_pfn_to_ptr: domU mapping: pfn 0x225ca+0x117 at 0x7f097d7a8000
> xc_dom_alloc_page   :   boot stack   : 0xffffffffa26e1000 (pfn 0x226e1)
> xc_dom_build_image  : virt_alloc_end : 0xffffffffa26e2000
> xc_dom_build_image  : virt_pgtab_end : 0xffffffffa2800000
> 
> So the physical memory and virtual (using __START_KERNEL_map addresses)
> layout looks as so:
> 
>   phys                             __ka
> /------------\                   /-------------------\
> | 0          | empty             | 0xffffffff80000000|
> | ..         |                   | ..                |
> | 16MB       | <= kernel starts  | 0xffffffff81000000|
> | ..         |                   |                   |
> | 30MB       | <= kernel ends => | 0xffffffff81e43000|
> | ..         |  & ramdisk starts | ..                |
> | 293MB      | <= ramdisk ends=> | 0xffffffff925c7000|
> | ..         |  & P2M starts     | ..                |
> | ..         |                   | ..                |
> | 549MB      | <= P2M ends    => | 0xffffffffa25c7000|
> | ..         | start_info        | 0xffffffffa25c7000|
> | ..         | xenstore          | 0xffffffffa25c8000|
> | ..         | cosole            | 0xffffffffa25c9000|
> | 549MB      | <= page tables => | 0xffffffffa25ca000|
> | ..         |                   |                   |
> | 550MB      | <= PGT end     => | 0xffffffffa26e1000|
> | ..         | boot stack        |                   |
> \------------/                   \-------------------/
> 
> As can be seen, the ramdisk, P2M and pagetables are taking
> a bit of __ka addresses space. Which is a problem since the
> MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits
> right in there! This results during bootup with the inability to
> load modules, with this error:
> 
> ------------[ cut here ]------------
> WARNING: at /home/konrad/ssd/linux/mm/vmalloc.c:106 vmap_page_range_noflush+0x2d9/0x370()
> Call Trace:
>  [<ffffffff810719fa>] warn_slowpath_common+0x7a/0xb0
>  [<ffffffff81030279>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
>  [<ffffffff81071a45>] warn_slowpath_null+0x15/0x20
>  [<ffffffff81130b89>] vmap_page_range_noflush+0x2d9/0x370
>  [<ffffffff81130c4d>] map_vm_area+0x2d/0x50
>  [<ffffffff811326d0>] __vmalloc_node_range+0x160/0x250
>  [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
>  [<ffffffff810c6186>] ? load_module+0x66/0x19c0
>  [<ffffffff8105cadc>] module_alloc+0x5c/0x60
>  [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
>  [<ffffffff810c5369>] module_alloc_update_bounds+0x19/0x80
>  [<ffffffff810c70c3>] load_module+0xfa3/0x19c0
>  [<ffffffff812491f6>] ? security_file_permission+0x86/0x90
>  [<ffffffff810c7b3a>] sys_init_module+0x5a/0x220
>  [<ffffffff815ce339>] system_call_fastpath+0x16/0x1b
> ---[ end trace fd8f7704fdea0291 ]---
> vmalloc: allocation failure, allocated 16384 of 20480 bytes
> modprobe: page allocation failure: order:0, mode:0xd2
> 
> Since the __va and __ka are 1:1 up to MODULES_VADDR and
> cleanup_highmap rids __ka of the ramdisk mapping, what
> we want to do is similar - get rid of the P2M in the __ka
> address space. There are two ways of fixing this:
> 
>  1) All P2M lookups instead of using the __ka address would
>     use the __va address. This means we can safely erase from
>     __ka space the PMD pointers that point to the PFNs for
>     P2M array and be OK.
>  2). Allocate a new array, copy the existing P2M into it,
>     revector the P2M tree to use that, and return the old
>     P2M to the memory allocate. This has the advantage that
>     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
>     feature. That feature allows us to set the exact virtual
>     address space we want for the P2M - and allows us to
>     boot as initial domain on large machines.
> 
> So we pick option 2).

1) looks like a decent option that requires less code.
Is the problem with 1) that we might want to access the P2M before we
have __va addresses ready?



> This patch only lays the groundwork in the P2M code. The patch
> that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree."
> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  arch/x86/xen/p2m.c     |   70 ++++++++++++++++++++++++++++++++++++++++++++++++
>  arch/x86/xen/xen-ops.h |    1 +
>  2 files changed, 71 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
> index 6a2bfa4..bbfd085 100644
> --- a/arch/x86/xen/p2m.c
> +++ b/arch/x86/xen/p2m.c
> @@ -394,7 +394,77 @@ void __init xen_build_dynamic_phys_to_machine(void)
>  	 * Xen provided pagetable). Do it later in xen_reserve_internals.
>  	 */
>  }
> +#ifdef CONFIG_X86_64
> +#include <linux/bootmem.h>
> +unsigned long __init xen_revector_p2m_tree(void)
> +{
> +	unsigned long va_start;
> +	unsigned long va_end;
> +	unsigned long pfn;
> +	unsigned long *mfn_list = NULL;
> +	unsigned long size;
> +
> +	va_start = xen_start_info->mfn_list;
> +	/*We copy in increments of P2M_PER_PAGE * sizeof(unsigned long),
> +	 * so make sure it is rounded up to that */
> +	size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
> +	va_end = va_start + size;
> +
> +	/* If we were revectored already, don't do it again. */
> +	if (va_start <= __START_KERNEL_map && va_start >= __PAGE_OFFSET)
> +		return 0;
> +
> +	mfn_list = alloc_bootmem_align(size, PAGE_SIZE);
> +	if (!mfn_list) {
> +		pr_warn("Could not allocate space for a new P2M tree!\n");
> +		return xen_start_info->mfn_list;
> +	}
> +	/* Fill it out with INVALID_P2M_ENTRY value */
> +	memset(mfn_list, 0xFF, size);
> +
> +	for (pfn = 0; pfn < ALIGN(MAX_DOMAIN_PAGES, P2M_PER_PAGE); pfn += P2M_PER_PAGE) {
> +		unsigned topidx = p2m_top_index(pfn);
> +		unsigned mididx;
> +		unsigned long *mid_p;
> +
> +		if (!p2m_top[topidx])
> +			continue;
> +
> +		if (p2m_top[topidx] == p2m_mid_missing)
> +			continue;
> +
> +		mididx = p2m_mid_index(pfn);
> +		mid_p = p2m_top[topidx][mididx];
> +		if (!mid_p)
> +			continue;
> +		if ((mid_p == p2m_missing) || (mid_p == p2m_identity))
> +			continue;
> +
> +		if ((unsigned long)mid_p == INVALID_P2M_ENTRY)
> +			continue;
> +
> +		/* The old va. Rebase it on mfn_list */
> +		if (mid_p >= (unsigned long *)va_start && mid_p <= (unsigned long *)va_end) {
> +			unsigned long *new;
> +
> +			new = &mfn_list[pfn];
> +
> +			copy_page(new, mid_p);
> +			p2m_top[topidx][mididx] = &mfn_list[pfn];
> +			p2m_top_mfn_p[topidx][mididx] = virt_to_mfn(&mfn_list[pfn]);
>  
> +		}
> +		/* This should be the leafs allocated for identity from _brk. */
> +	}
> +	return (unsigned long)mfn_list;
> +
> +}
> +#else
> +unsigned long __init xen_revector_p2m_tree(void)
> +{
> +	return 0;
> +}
> +#endif
>  unsigned long get_phys_to_machine(unsigned long pfn)
>  {
>  	unsigned topidx, mididx, idx;
> diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
> index 2230f57..bb5a810 100644
> --- a/arch/x86/xen/xen-ops.h
> +++ b/arch/x86/xen/xen-ops.h
> @@ -45,6 +45,7 @@ void xen_hvm_init_shared_info(void);
>  void xen_unplug_emulated_devices(void);
>  
>  void __init xen_build_dynamic_phys_to_machine(void);
> +unsigned long __init xen_revector_p2m_tree(void);
>  
>  void xen_init_irq_ops(void);
>  void xen_setup_timer(int cpu);
> -- 
> 1.7.7.6
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [PATCH 7/7] xen/mmu: Remove from __ka space PMD entries for pagetables.
  2012-07-26 20:47 ` [PATCH 7/7] xen/mmu: Remove from __ka space PMD entries for pagetables Konrad Rzeszutek Wilk
@ 2012-07-27 11:31   ` Stefano Stabellini
  2012-07-27 17:42     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 37+ messages in thread
From: Stefano Stabellini @ 2012-07-27 11:31 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
> Please first read the description in "xen/mmu: Copy and revector the
> P2M tree."
> 
> At this stage, the __ka address space (which is what the old
> P2M tree was using) is partially disassembled. The cleanup_highmap
> has removed the PMD entries from 0-16MB and anything past _brk_end
> up to the max_pfn_mapped (which is the end of the ramdisk).
> 
> The xen_remove_p2m_tree and code around has ripped out the __ka for
> the old P2M array.
> 
> Here we continue on doing it to where the Xen page-tables were.
> It is safe to do it, as the page-tables are addressed using __va.
> For good measure we delete anything that is within MODULES_VADDR
> and up to the end of the PMD.
> 
> At this point the __ka only contains PMD entries for the start
> of the kernel up to __brk.
> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  arch/x86/xen/mmu.c |   20 ++++++++++++++++++++
>  1 files changed, 20 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 05e8492..738feca 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1241,6 +1241,26 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
>  			xen_start_info->mfn_list = new_mfn_list;
>  		}
>  	}
> +#ifdef CONFIG_X86_64
> +	/* At this stage, cleanup_highmap has already cleaned __ka space
> +	 * from _brk_limit way up to the max_pfn_mapped (which is the end of
> +	 * the ramdisk). We continue on, erasing PMD entries that point to page
> +	 * tables - do note that they are accessible at this stage via __va.
> +	 * For good measure we also round up to the PMD - which means that if
> +	 * anybody is using __ka address to the initial boot-stack - and try
> +	 * to use it - they are going to crash. The xen_start_info has been
> +	 * taken care of already in xen_setup_kernel_pagetable. */
> +	addr = xen_start_info->pt_base;
> +	size = roundup(xen_start_info->nr_pt_frames * PAGE_SIZE, PMD_SIZE);
> +
> +	xen_cleanhighmap(addr, addr + size);
> +	xen_start_info->pt_base = (unsigned long)__va(__pa(xen_start_info->pt_base));
> +
> +	/* This is superflous and shouldn't be neccessary, but you know what
> +	 * lets do it. The MODULES_VADDR -> MODULES_END should be clear of
> +	 * anything at this stage. */
> +	xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, PUD_SIZE) - 1);

I would stick an #ifdef CONFIG_DEBUG of some kind around it


> +#endif
>  	xen_post_allocator_init();
>  }
>  
> -- 
> 1.7.7.6
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [PATCH 3/7] xen/mmu: Release the Xen provided L4 (PGD) back.
  2012-07-26 20:47 ` [PATCH 3/7] xen/mmu: Release the Xen provided L4 (PGD) back Konrad Rzeszutek Wilk
@ 2012-07-27 11:37   ` Stefano Stabellini
  2012-07-27 17:35       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 37+ messages in thread
From: Stefano Stabellini @ 2012-07-27 11:37 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
> Since we are not using it and somebody else could use it.

make sense except it is almost entirely rewritten by the following
patch...

> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  arch/x86/xen/mmu.c |   13 +++++++------
>  1 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index a59070b..48bdc9f 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1782,20 +1782,21 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>  	/* Unpin Xen-provided one */
>  	pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
>  
> -	/* Switch over */
> -	pgd = init_level4_pgt;
> -
>  	/*
>  	 * At this stage there can be no user pgd, and no page
>  	 * structure to attach it to, so make sure we just set kernel
>  	 * pgd.
>  	 */
>  	xen_mc_batch();
> -	__xen_write_cr3(true, __pa(pgd));
> +	__xen_write_cr3(true, __pa(init_level4_pgt));
>  	xen_mc_issue(PARAVIRT_LAZY_CPU);
>  
> -	memblock_reserve(__pa(xen_start_info->pt_base),
> -			 xen_start_info->nr_pt_frames * PAGE_SIZE);
> +	/* Offset by one page since the original pgd is going bye bye */
> +	memblock_reserve(__pa(xen_start_info->pt_base + PAGE_SIZE),
> +			 (xen_start_info->nr_pt_frames * PAGE_SIZE) - PAGE_SIZE);
> +	/* and also RW it so it can actually be used. */
> +	set_page_prot(pgd, PAGE_KERNEL);
> +	clear_page(pgd);
>  }
>  #else	/* !CONFIG_X86_64 */
>  static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD);
> -- 
> 1.7.7.6
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [PATCH 4/7] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
  2012-07-26 20:47 ` [PATCH 4/7] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages Konrad Rzeszutek Wilk
@ 2012-07-27 11:45   ` Stefano Stabellini
  2012-07-27 17:38       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 37+ messages in thread
From: Stefano Stabellini @ 2012-07-27 11:45 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
> As we are not using them. We end up only using the L1 pagetables
> and grafting those to our page-tables.
> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  arch/x86/xen/mmu.c |   38 ++++++++++++++++++++++++++++++++------
>  1 files changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 48bdc9f..7f54b75 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1724,6 +1724,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>  {
>  	pud_t *l3;
>  	pmd_t *l2;
> +	unsigned long addr[3];
> +	unsigned long pt_base, pt_end;
> +	unsigned i;
>  
>  	/* max_pfn_mapped is the last pfn mapped in the initial memory
>  	 * mappings. Considering that on Xen after the kernel mappings we
> @@ -1731,6 +1734,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>  	 * set max_pfn_mapped to the last real pfn mapped. */
>  	max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
>  
> +	pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
> +	pt_end = PFN_DOWN(__pa(xen_start_info->pt_base + (xen_start_info->nr_pt_frames * PAGE_SIZE)));
> +
>  	/* Zap identity mapping */
>  	init_level4_pgt[0] = __pgd(0);
>  
> @@ -1749,6 +1755,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>  	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
>  	l2 = m2v(l3[pud_index(__START_KERNEL_map)].pud);
>  
> +	addr[0] = (unsigned long)pgd;
> +	addr[1] = (unsigned long)l2;
> +	addr[2] = (unsigned long)l3;
>  	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
>  	 * Both L4[272][0] and L4[511][511] have entries that point to the same
>  	 * L2 (PMD) tables. Meaning that if you modify it in __va space
> @@ -1791,12 +1800,29 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>  	__xen_write_cr3(true, __pa(init_level4_pgt));
>  	xen_mc_issue(PARAVIRT_LAZY_CPU);
>  
> -	/* Offset by one page since the original pgd is going bye bye */
> -	memblock_reserve(__pa(xen_start_info->pt_base + PAGE_SIZE),
> -			 (xen_start_info->nr_pt_frames * PAGE_SIZE) - PAGE_SIZE);
> -	/* and also RW it so it can actually be used. */
> -	set_page_prot(pgd, PAGE_KERNEL);
> -	clear_page(pgd);
> +	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
> +	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
> +	 * the initial domain. For guests using the toolstack, they are in:
> +	 * [L4], [L3], [L2], [L1], [L1], order .. */
> +	for (i = 0; i < ARRAY_SIZE(addr); i++) {
> +		unsigned j;
> +		/* No idea about the order the addr are in, so just do them twice. */
> +		for (j = 0; j < ARRAY_SIZE(addr); j++) {

I don't think I understand this double loop.
Shouldn't we be looping on pt_base or pt_end?


> +			if (pt_base == PFN_DOWN(__pa(addr[j]))) {
> +				set_page_prot((void *)addr[j], PAGE_KERNEL);
> +				clear_page((void *)addr[j]);
> +				pt_base++;
> +
> +			}
> +			if (pt_end == PFN_DOWN(__pa(addr[j]))) {
> +				set_page_prot((void *)addr[j], PAGE_KERNEL);
> +				clear_page((void *)addr[j]);
> +				pt_end--;
> +			}
> +		}
> +	}
> +	/* Our (by three pages) smaller Xen pagetable that we are using */
> +	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [PATCH 5/7] xen/p2m: Add logic to revector a P2M tree to use __va leafs.
  2012-07-27 11:18   ` [Xen-devel] " Stefano Stabellini
@ 2012-07-27 11:47     ` Jan Beulich
  2012-07-27 17:34         ` Konrad Rzeszutek Wilk
  2012-07-27 11:47     ` Jan Beulich
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2012-07-27 11:47 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, Konrad Rzeszutek Wilk, linux-kernel

>>> On 27.07.12 at 13:18, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
>>  1) All P2M lookups instead of using the __ka address would
>>     use the __va address. This means we can safely erase from
>>     __ka space the PMD pointers that point to the PFNs for
>>     P2M array and be OK.
>>  2). Allocate a new array, copy the existing P2M into it,
>>     revector the P2M tree to use that, and return the old
>>     P2M to the memory allocate. This has the advantage that
>>     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
>>     feature. That feature allows us to set the exact virtual
>>     address space we want for the P2M - and allows us to
>>     boot as initial domain on large machines.
>> 
>> So we pick option 2).
> 
> 1) looks like a decent option that requires less code.
> Is the problem with 1) that we might want to access the P2M before we
> have __va addresses ready?

AIUI 1) has no easy way of subsequently accommodating support
for XEN_ELF_NOTE_INIT_P2M (where you almost definitely will
want/need to reclaim the originally used VA space - if nothing else,
then for forward compatibility with the rest of the kernel).

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/7] xen/p2m: Add logic to revector a P2M tree to use __va leafs.
  2012-07-27 11:18   ` [Xen-devel] " Stefano Stabellini
  2012-07-27 11:47     ` Jan Beulich
@ 2012-07-27 11:47     ` Jan Beulich
  1 sibling, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2012-07-27 11:47 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Konrad Rzeszutek Wilk, linux-kernel, xen-devel

>>> On 27.07.12 at 13:18, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
>>  1) All P2M lookups instead of using the __ka address would
>>     use the __va address. This means we can safely erase from
>>     __ka space the PMD pointers that point to the PFNs for
>>     P2M array and be OK.
>>  2). Allocate a new array, copy the existing P2M into it,
>>     revector the P2M tree to use that, and return the old
>>     P2M to the memory allocate. This has the advantage that
>>     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
>>     feature. That feature allows us to set the exact virtual
>>     address space we want for the P2M - and allows us to
>>     boot as initial domain on large machines.
>> 
>> So we pick option 2).
> 
> 1) looks like a decent option that requires less code.
> Is the problem with 1) that we might want to access the P2M before we
> have __va addresses ready?

AIUI 1) has no easy way of subsequently accommodating support
for XEN_ELF_NOTE_INIT_P2M (where you almost definitely will
want/need to reclaim the originally used VA space - if nothing else,
then for forward compatibility with the rest of the kernel).

Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [PATCH 5/7] xen/p2m: Add logic to revector a P2M tree to use __va leafs.
  2012-07-27 11:47     ` Jan Beulich
@ 2012-07-27 17:34         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-27 17:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Konrad Rzeszutek Wilk, linux-kernel, xen-devel

On Fri, Jul 27, 2012 at 12:47:47PM +0100, Jan Beulich wrote:
> >>> On 27.07.12 at 13:18, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> > On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
> >>  1) All P2M lookups instead of using the __ka address would
> >>     use the __va address. This means we can safely erase from
> >>     __ka space the PMD pointers that point to the PFNs for
> >>     P2M array and be OK.
> >>  2). Allocate a new array, copy the existing P2M into it,
> >>     revector the P2M tree to use that, and return the old
> >>     P2M to the memory allocate. This has the advantage that
> >>     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
> >>     feature. That feature allows us to set the exact virtual
> >>     address space we want for the P2M - and allows us to
> >>     boot as initial domain on large machines.
> >> 
> >> So we pick option 2).
> > 
> > 1) looks like a decent option that requires less code.
> > Is the problem with 1) that we might want to access the P2M before we
> > have __va addresses ready?
> 
> AIUI 1) has no easy way of subsequently accommodating support
> for XEN_ELF_NOTE_INIT_P2M (where you almost definitely will
> want/need to reclaim the originally used VA space - if nothing else,
> then for forward compatibility with the rest of the kernel).

<nods> That was my thinking - this way we can boot dom0 (since
the hypervisor is the only one that implements the
XEN_ELF_NOTE_INIT_P2M) with large amount of memory. Granted booting
with more than 500GB would require adding another layer to the P2M
tree. But somehow I thought that we are limited in the hypervisor
to 500GB?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/7] xen/p2m: Add logic to revector a P2M tree to use __va leafs.
@ 2012-07-27 17:34         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-27 17:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Konrad Rzeszutek Wilk, xen-devel, linux-kernel, Stefano Stabellini

On Fri, Jul 27, 2012 at 12:47:47PM +0100, Jan Beulich wrote:
> >>> On 27.07.12 at 13:18, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> > On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
> >>  1) All P2M lookups instead of using the __ka address would
> >>     use the __va address. This means we can safely erase from
> >>     __ka space the PMD pointers that point to the PFNs for
> >>     P2M array and be OK.
> >>  2). Allocate a new array, copy the existing P2M into it,
> >>     revector the P2M tree to use that, and return the old
> >>     P2M to the memory allocate. This has the advantage that
> >>     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
> >>     feature. That feature allows us to set the exact virtual
> >>     address space we want for the P2M - and allows us to
> >>     boot as initial domain on large machines.
> >> 
> >> So we pick option 2).
> > 
> > 1) looks like a decent option that requires less code.
> > Is the problem with 1) that we might want to access the P2M before we
> > have __va addresses ready?
> 
> AIUI 1) has no easy way of subsequently accommodating support
> for XEN_ELF_NOTE_INIT_P2M (where you almost definitely will
> want/need to reclaim the originally used VA space - if nothing else,
> then for forward compatibility with the rest of the kernel).

<nods> That was my thinking - this way we can boot dom0 (since
the hypervisor is the only one that implements the
XEN_ELF_NOTE_INIT_P2M) with large amount of memory. Granted booting
with more than 500GB would require adding another layer to the P2M
tree. But somehow I thought that we are limited in the hypervisor
to 500GB?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [PATCH 3/7] xen/mmu: Release the Xen provided L4 (PGD) back.
  2012-07-27 11:37   ` [Xen-devel] " Stefano Stabellini
@ 2012-07-27 17:35       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-27 17:35 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Konrad Rzeszutek Wilk, xen-devel, linux-kernel

On Fri, Jul 27, 2012 at 12:37:24PM +0100, Stefano Stabellini wrote:
> On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
> > Since we are not using it and somebody else could use it.
> 
> make sense except it is almost entirely rewritten by the following
> patch...

Yeah, I should squash them.
> 
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > ---
> >  arch/x86/xen/mmu.c |   13 +++++++------
> >  1 files changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> > index a59070b..48bdc9f 100644
> > --- a/arch/x86/xen/mmu.c
> > +++ b/arch/x86/xen/mmu.c
> > @@ -1782,20 +1782,21 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  	/* Unpin Xen-provided one */
> >  	pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
> >  
> > -	/* Switch over */
> > -	pgd = init_level4_pgt;
> > -
> >  	/*
> >  	 * At this stage there can be no user pgd, and no page
> >  	 * structure to attach it to, so make sure we just set kernel
> >  	 * pgd.
> >  	 */
> >  	xen_mc_batch();
> > -	__xen_write_cr3(true, __pa(pgd));
> > +	__xen_write_cr3(true, __pa(init_level4_pgt));
> >  	xen_mc_issue(PARAVIRT_LAZY_CPU);
> >  
> > -	memblock_reserve(__pa(xen_start_info->pt_base),
> > -			 xen_start_info->nr_pt_frames * PAGE_SIZE);
> > +	/* Offset by one page since the original pgd is going bye bye */
> > +	memblock_reserve(__pa(xen_start_info->pt_base + PAGE_SIZE),
> > +			 (xen_start_info->nr_pt_frames * PAGE_SIZE) - PAGE_SIZE);
> > +	/* and also RW it so it can actually be used. */
> > +	set_page_prot(pgd, PAGE_KERNEL);
> > +	clear_page(pgd);
> >  }
> >  #else	/* !CONFIG_X86_64 */
> >  static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD);
> > -- 
> > 1.7.7.6
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> > 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/7] xen/mmu: Release the Xen provided L4 (PGD) back.
@ 2012-07-27 17:35       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-27 17:35 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, linux-kernel, Konrad Rzeszutek Wilk

On Fri, Jul 27, 2012 at 12:37:24PM +0100, Stefano Stabellini wrote:
> On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
> > Since we are not using it and somebody else could use it.
> 
> make sense except it is almost entirely rewritten by the following
> patch...

Yeah, I should squash them.
> 
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > ---
> >  arch/x86/xen/mmu.c |   13 +++++++------
> >  1 files changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> > index a59070b..48bdc9f 100644
> > --- a/arch/x86/xen/mmu.c
> > +++ b/arch/x86/xen/mmu.c
> > @@ -1782,20 +1782,21 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  	/* Unpin Xen-provided one */
> >  	pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
> >  
> > -	/* Switch over */
> > -	pgd = init_level4_pgt;
> > -
> >  	/*
> >  	 * At this stage there can be no user pgd, and no page
> >  	 * structure to attach it to, so make sure we just set kernel
> >  	 * pgd.
> >  	 */
> >  	xen_mc_batch();
> > -	__xen_write_cr3(true, __pa(pgd));
> > +	__xen_write_cr3(true, __pa(init_level4_pgt));
> >  	xen_mc_issue(PARAVIRT_LAZY_CPU);
> >  
> > -	memblock_reserve(__pa(xen_start_info->pt_base),
> > -			 xen_start_info->nr_pt_frames * PAGE_SIZE);
> > +	/* Offset by one page since the original pgd is going bye bye */
> > +	memblock_reserve(__pa(xen_start_info->pt_base + PAGE_SIZE),
> > +			 (xen_start_info->nr_pt_frames * PAGE_SIZE) - PAGE_SIZE);
> > +	/* and also RW it so it can actually be used. */
> > +	set_page_prot(pgd, PAGE_KERNEL);
> > +	clear_page(pgd);
> >  }
> >  #else	/* !CONFIG_X86_64 */
> >  static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD);
> > -- 
> > 1.7.7.6
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> > 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [PATCH 4/7] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
  2012-07-27 11:45   ` [Xen-devel] " Stefano Stabellini
@ 2012-07-27 17:38       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-27 17:38 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Konrad Rzeszutek Wilk, xen-devel, linux-kernel

On Fri, Jul 27, 2012 at 12:45:38PM +0100, Stefano Stabellini wrote:
> On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
> > As we are not using them. We end up only using the L1 pagetables
> > and grafting those to our page-tables.
> > 
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > ---
> >  arch/x86/xen/mmu.c |   38 ++++++++++++++++++++++++++++++++------
> >  1 files changed, 32 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> > index 48bdc9f..7f54b75 100644
> > --- a/arch/x86/xen/mmu.c
> > +++ b/arch/x86/xen/mmu.c
> > @@ -1724,6 +1724,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  {
> >  	pud_t *l3;
> >  	pmd_t *l2;
> > +	unsigned long addr[3];
> > +	unsigned long pt_base, pt_end;
> > +	unsigned i;
> >  
> >  	/* max_pfn_mapped is the last pfn mapped in the initial memory
> >  	 * mappings. Considering that on Xen after the kernel mappings we
> > @@ -1731,6 +1734,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  	 * set max_pfn_mapped to the last real pfn mapped. */
> >  	max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
> >  
> > +	pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
> > +	pt_end = PFN_DOWN(__pa(xen_start_info->pt_base + (xen_start_info->nr_pt_frames * PAGE_SIZE)));
> > +
> >  	/* Zap identity mapping */
> >  	init_level4_pgt[0] = __pgd(0);
> >  
> > @@ -1749,6 +1755,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
> >  	l2 = m2v(l3[pud_index(__START_KERNEL_map)].pud);
> >  
> > +	addr[0] = (unsigned long)pgd;
> > +	addr[1] = (unsigned long)l2;
> > +	addr[2] = (unsigned long)l3;
> >  	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
> >  	 * Both L4[272][0] and L4[511][511] have entries that point to the same
> >  	 * L2 (PMD) tables. Meaning that if you modify it in __va space
> > @@ -1791,12 +1800,29 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  	__xen_write_cr3(true, __pa(init_level4_pgt));
> >  	xen_mc_issue(PARAVIRT_LAZY_CPU);
> >  
> > -	/* Offset by one page since the original pgd is going bye bye */
> > -	memblock_reserve(__pa(xen_start_info->pt_base + PAGE_SIZE),
> > -			 (xen_start_info->nr_pt_frames * PAGE_SIZE) - PAGE_SIZE);
> > -	/* and also RW it so it can actually be used. */
> > -	set_page_prot(pgd, PAGE_KERNEL);
> > -	clear_page(pgd);
> > +	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
> > +	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
> > +	 * the initial domain. For guests using the toolstack, they are in:
> > +	 * [L4], [L3], [L2], [L1], [L1], order .. */
> > +	for (i = 0; i < ARRAY_SIZE(addr); i++) {
> > +		unsigned j;
> > +		/* No idea about the order the addr are in, so just do them twice. */
> > +		for (j = 0; j < ARRAY_SIZE(addr); j++) {
> 
> I don't think I understand this double loop.

So with Xen toolstack, the order is L4, L3, L2, L1s.. and with
the hypervisor it is L4, L1,... but in the future the order might
be L1, L1 ..., L1, L2, L3, L4 (potentially?) so this double loop
will loop around the addresses twice to catch this in case we get
it like this.

> Shouldn't we be looping on pt_base or pt_end?

So two loops - and it could be put in a seperate function. That
would make this easier to read. Yeah, let me do it that way.
Thanks!
> 
> 
> > +			if (pt_base == PFN_DOWN(__pa(addr[j]))) {
> > +				set_page_prot((void *)addr[j], PAGE_KERNEL);
> > +				clear_page((void *)addr[j]);
> > +				pt_base++;
> > +
> > +			}
> > +			if (pt_end == PFN_DOWN(__pa(addr[j]))) {
> > +				set_page_prot((void *)addr[j], PAGE_KERNEL);
> > +				clear_page((void *)addr[j]);
> > +				pt_end--;
> > +			}
> > +		}
> > +	}
> > +	/* Our (by three pages) smaller Xen pagetable that we are using */
> > +	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/7] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
@ 2012-07-27 17:38       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-27 17:38 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, linux-kernel, Konrad Rzeszutek Wilk

On Fri, Jul 27, 2012 at 12:45:38PM +0100, Stefano Stabellini wrote:
> On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
> > As we are not using them. We end up only using the L1 pagetables
> > and grafting those to our page-tables.
> > 
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > ---
> >  arch/x86/xen/mmu.c |   38 ++++++++++++++++++++++++++++++++------
> >  1 files changed, 32 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> > index 48bdc9f..7f54b75 100644
> > --- a/arch/x86/xen/mmu.c
> > +++ b/arch/x86/xen/mmu.c
> > @@ -1724,6 +1724,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  {
> >  	pud_t *l3;
> >  	pmd_t *l2;
> > +	unsigned long addr[3];
> > +	unsigned long pt_base, pt_end;
> > +	unsigned i;
> >  
> >  	/* max_pfn_mapped is the last pfn mapped in the initial memory
> >  	 * mappings. Considering that on Xen after the kernel mappings we
> > @@ -1731,6 +1734,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  	 * set max_pfn_mapped to the last real pfn mapped. */
> >  	max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
> >  
> > +	pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
> > +	pt_end = PFN_DOWN(__pa(xen_start_info->pt_base + (xen_start_info->nr_pt_frames * PAGE_SIZE)));
> > +
> >  	/* Zap identity mapping */
> >  	init_level4_pgt[0] = __pgd(0);
> >  
> > @@ -1749,6 +1755,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
> >  	l2 = m2v(l3[pud_index(__START_KERNEL_map)].pud);
> >  
> > +	addr[0] = (unsigned long)pgd;
> > +	addr[1] = (unsigned long)l2;
> > +	addr[2] = (unsigned long)l3;
> >  	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
> >  	 * Both L4[272][0] and L4[511][511] have entries that point to the same
> >  	 * L2 (PMD) tables. Meaning that if you modify it in __va space
> > @@ -1791,12 +1800,29 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  	__xen_write_cr3(true, __pa(init_level4_pgt));
> >  	xen_mc_issue(PARAVIRT_LAZY_CPU);
> >  
> > -	/* Offset by one page since the original pgd is going bye bye */
> > -	memblock_reserve(__pa(xen_start_info->pt_base + PAGE_SIZE),
> > -			 (xen_start_info->nr_pt_frames * PAGE_SIZE) - PAGE_SIZE);
> > -	/* and also RW it so it can actually be used. */
> > -	set_page_prot(pgd, PAGE_KERNEL);
> > -	clear_page(pgd);
> > +	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
> > +	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
> > +	 * the initial domain. For guests using the toolstack, they are in:
> > +	 * [L4], [L3], [L2], [L1], [L1], order .. */
> > +	for (i = 0; i < ARRAY_SIZE(addr); i++) {
> > +		unsigned j;
> > +		/* No idea about the order the addr are in, so just do them twice. */
> > +		for (j = 0; j < ARRAY_SIZE(addr); j++) {
> 
> I don't think I understand this double loop.

So with Xen toolstack, the order is L4, L3, L2, L1s.. and with
the hypervisor it is L4, L1,... but in the future the order might
be L1, L1 ..., L1, L2, L3, L4 (potentially?) so this double loop
will loop around the addresses twice to catch this in case we get
it like this.

> Shouldn't we be looping on pt_base or pt_end?

So two loops - and it could be put in a seperate function. That
would make this easier to read. Yeah, let me do it that way.
Thanks!
> 
> 
> > +			if (pt_base == PFN_DOWN(__pa(addr[j]))) {
> > +				set_page_prot((void *)addr[j], PAGE_KERNEL);
> > +				clear_page((void *)addr[j]);
> > +				pt_base++;
> > +
> > +			}
> > +			if (pt_end == PFN_DOWN(__pa(addr[j]))) {
> > +				set_page_prot((void *)addr[j], PAGE_KERNEL);
> > +				clear_page((void *)addr[j]);
> > +				pt_end--;
> > +			}
> > +		}
> > +	}
> > +	/* Our (by three pages) smaller Xen pagetable that we are using */
> > +	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [PATCH 7/7] xen/mmu: Remove from __ka space PMD entries for pagetables.
  2012-07-27 11:31   ` [Xen-devel] " Stefano Stabellini
@ 2012-07-27 17:42     ` Konrad Rzeszutek Wilk
  2012-07-31 14:37       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-27 17:42 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Konrad Rzeszutek Wilk, xen-devel, linux-kernel

On Fri, Jul 27, 2012 at 12:31:17PM +0100, Stefano Stabellini wrote:
> On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
> > Please first read the description in "xen/mmu: Copy and revector the
> > P2M tree."
> > 
> > At this stage, the __ka address space (which is what the old
> > P2M tree was using) is partially disassembled. The cleanup_highmap
> > has removed the PMD entries from 0-16MB and anything past _brk_end
> > up to the max_pfn_mapped (which is the end of the ramdisk).
> > 
> > The xen_remove_p2m_tree and code around has ripped out the __ka for
> > the old P2M array.
> > 
> > Here we continue on doing it to where the Xen page-tables were.
> > It is safe to do it, as the page-tables are addressed using __va.
> > For good measure we delete anything that is within MODULES_VADDR
> > and up to the end of the PMD.
> > 
> > At this point the __ka only contains PMD entries for the start
> > of the kernel up to __brk.
> > 
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > ---
> >  arch/x86/xen/mmu.c |   20 ++++++++++++++++++++
> >  1 files changed, 20 insertions(+), 0 deletions(-)
> > 
> > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> > index 05e8492..738feca 100644
> > --- a/arch/x86/xen/mmu.c
> > +++ b/arch/x86/xen/mmu.c
> > @@ -1241,6 +1241,26 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
> >  			xen_start_info->mfn_list = new_mfn_list;
> >  		}
> >  	}
> > +#ifdef CONFIG_X86_64
> > +	/* At this stage, cleanup_highmap has already cleaned __ka space
> > +	 * from _brk_limit way up to the max_pfn_mapped (which is the end of
> > +	 * the ramdisk). We continue on, erasing PMD entries that point to page
> > +	 * tables - do note that they are accessible at this stage via __va.
> > +	 * For good measure we also round up to the PMD - which means that if
> > +	 * anybody is using __ka address to the initial boot-stack - and try
> > +	 * to use it - they are going to crash. The xen_start_info has been
> > +	 * taken care of already in xen_setup_kernel_pagetable. */
> > +	addr = xen_start_info->pt_base;
> > +	size = roundup(xen_start_info->nr_pt_frames * PAGE_SIZE, PMD_SIZE);
> > +
> > +	xen_cleanhighmap(addr, addr + size);
> > +	xen_start_info->pt_base = (unsigned long)__va(__pa(xen_start_info->pt_base));
> > +
> > +	/* This is superflous and shouldn't be neccessary, but you know what
> > +	 * lets do it. The MODULES_VADDR -> MODULES_END should be clear of
> > +	 * anything at this stage. */
> > +	xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, PUD_SIZE) - 1);
> 
> I would stick an #ifdef CONFIG_DEBUG of some kind around it

I am not really sure why, but we seem to have PMDs filed after the Xen
pagetables. I thought it was the bootstack, but it just looked like we
were filling up to the next PMD (so the 'roundup' right above this code
should take care of that). But let me double check that - to reproduce
this module loading problem I hacked the hypervisor to create a huge P2M
array and I might have not seen this issue when I was doing a proper bootup
of a PV guest with 220GB.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [PATCH 5/7] xen/p2m: Add logic to revector a P2M tree to use __va leafs.
  2012-07-27 17:34         ` Konrad Rzeszutek Wilk
  (?)
  (?)
@ 2012-07-30  7:10         ` Jan Beulich
  -1 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2012-07-30  7:10 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, xen-devel, Konrad Rzeszutek Wilk, linux-kernel

>>> On 27.07.12 at 19:34, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote:
> On Fri, Jul 27, 2012 at 12:47:47PM +0100, Jan Beulich wrote:
>> >>> On 27.07.12 at 13:18, Stefano Stabellini <stefano.stabellini@eu.citrix.com> 
> wrote:
>> > On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
>> >>  1) All P2M lookups instead of using the __ka address would
>> >>     use the __va address. This means we can safely erase from
>> >>     __ka space the PMD pointers that point to the PFNs for
>> >>     P2M array and be OK.
>> >>  2). Allocate a new array, copy the existing P2M into it,
>> >>     revector the P2M tree to use that, and return the old
>> >>     P2M to the memory allocate. This has the advantage that
>> >>     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
>> >>     feature. That feature allows us to set the exact virtual
>> >>     address space we want for the P2M - and allows us to
>> >>     boot as initial domain on large machines.
>> >> 
>> >> So we pick option 2).
>> > 
>> > 1) looks like a decent option that requires less code.
>> > Is the problem with 1) that we might want to access the P2M before we
>> > have __va addresses ready?
>> 
>> AIUI 1) has no easy way of subsequently accommodating support
>> for XEN_ELF_NOTE_INIT_P2M (where you almost definitely will
>> want/need to reclaim the originally used VA space - if nothing else,
>> then for forward compatibility with the rest of the kernel).
> 
> <nods> That was my thinking - this way we can boot dom0 (since
> the hypervisor is the only one that implements the
> XEN_ELF_NOTE_INIT_P2M) with large amount of memory. Granted booting
> with more than 500GB would require adding another layer to the P2M
> tree. But somehow I thought that we are limited in the hypervisor
> to 500GB?

The only limitation is that kexec (with the current specification)
would not work beyond 512Gb, but that's a non-issue for
upstream since kexec doesn't work there yet anyway. Our
kernels come up fine even on 5Tb now (which is the current
limit in the hypervisor).

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/7] xen/p2m: Add logic to revector a P2M tree to use __va leafs.
  2012-07-27 17:34         ` Konrad Rzeszutek Wilk
  (?)
@ 2012-07-30  7:10         ` Jan Beulich
  -1 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2012-07-30  7:10 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: xen-devel, linux-kernel, Konrad Rzeszutek Wilk, Stefano Stabellini

>>> On 27.07.12 at 19:34, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote:
> On Fri, Jul 27, 2012 at 12:47:47PM +0100, Jan Beulich wrote:
>> >>> On 27.07.12 at 13:18, Stefano Stabellini <stefano.stabellini@eu.citrix.com> 
> wrote:
>> > On Thu, 26 Jul 2012, Konrad Rzeszutek Wilk wrote:
>> >>  1) All P2M lookups instead of using the __ka address would
>> >>     use the __va address. This means we can safely erase from
>> >>     __ka space the PMD pointers that point to the PFNs for
>> >>     P2M array and be OK.
>> >>  2). Allocate a new array, copy the existing P2M into it,
>> >>     revector the P2M tree to use that, and return the old
>> >>     P2M to the memory allocate. This has the advantage that
>> >>     it sets the stage for using XEN_ELF_NOTE_INIT_P2M
>> >>     feature. That feature allows us to set the exact virtual
>> >>     address space we want for the P2M - and allows us to
>> >>     boot as initial domain on large machines.
>> >> 
>> >> So we pick option 2).
>> > 
>> > 1) looks like a decent option that requires less code.
>> > Is the problem with 1) that we might want to access the P2M before we
>> > have __va addresses ready?
>> 
>> AIUI 1) has no easy way of subsequently accommodating support
>> for XEN_ELF_NOTE_INIT_P2M (where you almost definitely will
>> want/need to reclaim the originally used VA space - if nothing else,
>> then for forward compatibility with the rest of the kernel).
> 
> <nods> That was my thinking - this way we can boot dom0 (since
> the hypervisor is the only one that implements the
> XEN_ELF_NOTE_INIT_P2M) with large amount of memory. Granted booting
> with more than 500GB would require adding another layer to the P2M
> tree. But somehow I thought that we are limited in the hypervisor
> to 500GB?

The only limitation is that kexec (with the current specification)
would not work beyond 512Gb, but that's a non-issue for
upstream since kexec doesn't work there yet anyway. Our
kernels come up fine even on 5Tb now (which is the current
limit in the hypervisor).

Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [PATCH 7/7] xen/mmu: Remove from __ka space PMD entries for pagetables.
  2012-07-27 17:42     ` Konrad Rzeszutek Wilk
@ 2012-07-31 14:37       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-31 14:37 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Stefano Stabellini, xen-devel, linux-kernel

> > > +	/* This is superflous and shouldn't be neccessary, but you know what
> > > +	 * lets do it. The MODULES_VADDR -> MODULES_END should be clear of
> > > +	 * anything at this stage. */
> > > +	xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, PUD_SIZE) - 1);
> > 
> > I would stick an #ifdef CONFIG_DEBUG of some kind around it
> 
> I am not really sure why, but we seem to have PMDs filed after the Xen
> pagetables. I thought it was the bootstack, but it just looked like we
> were filling up to the next PMD (so the 'roundup' right above this code
> should take care of that). But let me double check that - to reproduce
> this module loading problem I hacked the hypervisor to create a huge P2M

I am not seeing this anymore, so #ifdef DEBUG it is!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Xen-devel] [PATCH 4/7] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
  2012-07-27 17:38       ` Konrad Rzeszutek Wilk
  (?)
@ 2012-07-31 14:39       ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-31 14:39 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Stefano Stabellini, xen-devel, linux-kernel

> > > +	for (i = 0; i < ARRAY_SIZE(addr); i++) {
> > > +		unsigned j;
> > > +		/* No idea about the order the addr are in, so just do them twice. */
> > > +		for (j = 0; j < ARRAY_SIZE(addr); j++) {
> > 
> > I don't think I understand this double loop.
> 
> So with Xen toolstack, the order is L4, L3, L2, L1s.. and with
> the hypervisor it is L4, L1,... but in the future the order might
> be L1, L1 ..., L1, L2, L3, L4 (potentially?) so this double loop
> will loop around the addresses twice to catch this in case we get
> it like this.

Which we would get in case the toolstack ever decided to put those
pages in L4, L2, L3 order. Since the toolstack puts them in L4, L3, L2
and hypervisor puts it in L4, L1, L3, L2 we might as well just
simplify this and not do the extra loop. Posting patches shortly with
this.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2012-07-31 14:49 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-26 20:47 [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7 Konrad Rzeszutek Wilk
2012-07-26 20:47 ` [PATCH 1/7] xen/mmu: use copy_page instead of memcpy Konrad Rzeszutek Wilk
2012-07-27  7:35   ` Jan Beulich
2012-07-27  7:35   ` [Xen-devel] " Jan Beulich
2012-07-26 20:47 ` [PATCH 2/7] xen/mmu: For 64-bit do not call xen_map_identity_early Konrad Rzeszutek Wilk
2012-07-26 20:47 ` [PATCH 3/7] xen/mmu: Release the Xen provided L4 (PGD) back Konrad Rzeszutek Wilk
2012-07-27 11:37   ` [Xen-devel] " Stefano Stabellini
2012-07-27 17:35     ` Konrad Rzeszutek Wilk
2012-07-27 17:35       ` Konrad Rzeszutek Wilk
2012-07-26 20:47 ` [PATCH 4/7] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages Konrad Rzeszutek Wilk
2012-07-27 11:45   ` [Xen-devel] " Stefano Stabellini
2012-07-27 17:38     ` Konrad Rzeszutek Wilk
2012-07-27 17:38       ` Konrad Rzeszutek Wilk
2012-07-31 14:39       ` [Xen-devel] " Konrad Rzeszutek Wilk
2012-07-26 20:47 ` [PATCH 5/7] xen/p2m: Add logic to revector a P2M tree to use __va leafs Konrad Rzeszutek Wilk
2012-07-27 11:18   ` [Xen-devel] " Stefano Stabellini
2012-07-27 11:47     ` Jan Beulich
2012-07-27 17:34       ` Konrad Rzeszutek Wilk
2012-07-27 17:34         ` Konrad Rzeszutek Wilk
2012-07-30  7:10         ` Jan Beulich
2012-07-30  7:10         ` [Xen-devel] " Jan Beulich
2012-07-27 11:47     ` Jan Beulich
2012-07-26 20:47 ` [PATCH 6/7] xen/mmu: Copy and revector the P2M tree Konrad Rzeszutek Wilk
2012-07-26 20:47 ` [PATCH 7/7] xen/mmu: Remove from __ka space PMD entries for pagetables Konrad Rzeszutek Wilk
2012-07-27 11:31   ` [Xen-devel] " Stefano Stabellini
2012-07-27 17:42     ` Konrad Rzeszutek Wilk
2012-07-31 14:37       ` Konrad Rzeszutek Wilk
2012-07-27  7:34 ` [RFC PATCH] Boot PV guests with more than 128GB (v1) for 3.7 Jan Beulich
2012-07-27  7:34 ` [Xen-devel] " Jan Beulich
2012-07-27 10:00   ` Ian Campbell
2012-07-27 10:00   ` [Xen-devel] " Ian Campbell
2012-07-27 10:17     ` Jan Beulich
2012-07-27 10:21       ` Ian Campbell
2012-07-27 10:33         ` Jan Beulich
2012-07-27 10:33         ` [Xen-devel] " Jan Beulich
2012-07-27 10:21       ` Ian Campbell
2012-07-27 10:17     ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.