linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Boot PV guests with more than 128GB (v3) for v3.7.
@ 2012-08-16 16:03 Konrad Rzeszutek Wilk
  2012-08-16 16:03 ` [PATCH 01/11] xen/p2m: Fix the comment describing the P2M tree Konrad Rzeszutek Wilk
                   ` (11 more replies)
  0 siblings, 12 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-16 16:03 UTC (permalink / raw)
  To: linux-kernel, xen-devel

Since (v2): http://lists.xen.org/archives/html/xen-devel/2012-07/msg01864.html
 - fixed a bug if guest booted with non-PMD aligned size (say, 899MB).
 - fixed smack warnings
 - moved a memset(xen_start_info->mfn_list, 0xff,.. ) from one patch to another.
Since v1: [http://lists.xen.org/archives/html/xen-devel/2012-07/msg01561.html]
 - added more comments, and #ifdefs
 - squashed The L4 and L4, L3, and L2 recycle patches together
 - Added Acked-by's

These patches are quite baked by now. If you already looked at this before
I would just skip over it and just look at the last patch.

The explanation of these patches is exactly what v1 had:

The details of this problem are nicely explained in:

 [PATCH 4/6] xen/p2m: Add logic to revector a P2M tree to use __va
 [PATCH 5/6] xen/mmu: Copy and revector the P2M tree.
 [PATCH 6/6] xen/mmu: Remove from __ka space PMD entries for

and the supporting patches are just nice optimizations. Pasting in
what those patches mentioned:


During bootup Xen supplies us with a P2M array. It sticks
it right after the ramdisk, as can be seen with a 128GB PV guest:

(certain parts removed for clarity):
xc_dom_build_image: called
xc_dom_alloc_segment:   kernel       : 0xffffffff81000000 -> 0xffffffff81e43000 
 (pfn 0x1000 + 0xe43 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1000+0xe43 at 0x7f097d8bf000
xc_dom_alloc_segment:   ramdisk      : 0xffffffff81e43000 -> 0xffffffff925c7000 
 (pfn 0x1e43 + 0x10784 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1e43+0x10784 at 0x7f0952dd2000
xc_dom_alloc_segment:   phys2mach    : 0xffffffff925c7000 -> 0xffffffffa25c7000 
 (pfn 0x125c7 + 0x10000 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x125c7+0x10000 at 0x7f0942dd2000
xc_dom_alloc_page   :   start info   : 0xffffffffa25c7000 (pfn 0x225c7)
xc_dom_alloc_page   :   xenstore     : 0xffffffffa25c8000 (pfn 0x225c8)
xc_dom_alloc_page   :   console      : 0xffffffffa25c9000 (pfn 0x225c9)
nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> 
0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> 
0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> 
0xffffffffbfffffff, 1 table(s)
nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> 
0xffffffffa27fffff, 276 table(s)
xc_dom_alloc_segment:   page tables  : 0xffffffffa25ca000 -> 0xffffffffa26e1000 
 (pfn 0x225ca + 0x117 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x225ca+0x117 at 0x7f097d7a8000
xc_dom_alloc_page   :   boot stack   : 0xffffffffa26e1000 (pfn 0x226e1)
xc_dom_build_image  : virt_alloc_end : 0xffffffffa26e2000
xc_dom_build_image  : virt_pgtab_end : 0xffffffffa2800000

So the physical memory and virtual (using __START_KERNEL_map addresses)
layout looks as so:

  phys                             __ka
/------------\                   /-------------------\
| 0          | empty             | 0xffffffff80000000|
| ..         |                   | ..                |
| 16MB       | <= kernel starts  | 0xffffffff81000000|
| ..         |                   |                   |
| 30MB       | <= kernel ends => | 0xffffffff81e43000|
| ..         |  & ramdisk starts | ..                |
| 293MB      | <= ramdisk ends=> | 0xffffffff925c7000|
| ..         |  & P2M starts     | ..                |
| ..         |                   | ..                |
| 549MB      | <= P2M ends    => | 0xffffffffa25c7000|
| ..         | start_info        | 0xffffffffa25c7000|
| ..         | xenstore          | 0xffffffffa25c8000|
| ..         | cosole            | 0xffffffffa25c9000|
| 549MB      | <= page tables => | 0xffffffffa25ca000|
| ..         |                   |                   |
| 550MB      | <= PGT end     => | 0xffffffffa26e1000|
| ..         | boot stack        |                   |
\------------/                   \-------------------/

As can be seen, the ramdisk, P2M and pagetables are taking
a bit of __ka addresses space. Which is a problem since the
MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits
right in there! This results during bootup with the inability to
load modules, with this error:

------------[ cut here ]------------
WARNING: at /home/konrad/ssd/linux/mm/vmalloc.c:106 
vmap_page_range_noflush+0x2d9/0x370()
Call Trace:
 [<ffffffff810719fa>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff81030279>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81071a45>] warn_slowpath_null+0x15/0x20
 [<ffffffff81130b89>] vmap_page_range_noflush+0x2d9/0x370
 [<ffffffff81130c4d>] map_vm_area+0x2d/0x50
 [<ffffffff811326d0>] __vmalloc_node_range+0x160/0x250
 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
 [<ffffffff810c6186>] ? load_module+0x66/0x19c0
 [<ffffffff8105cadc>] module_alloc+0x5c/0x60
 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
 [<ffffffff810c5369>] module_alloc_update_bounds+0x19/0x80
 [<ffffffff810c70c3>] load_module+0xfa3/0x19c0
 [<ffffffff812491f6>] ? security_file_permission+0x86/0x90
 [<ffffffff810c7b3a>] sys_init_module+0x5a/0x220
 [<ffffffff815ce339>] system_call_fastpath+0x16/0x1b
---[ end trace fd8f7704fdea0291 ]---
vmalloc: allocation failure, allocated 16384 of 20480 bytes
modprobe: page allocation failure: order:0, mode:0xd2

Since the __va and __ka are 1:1 up to MODULES_VADDR and
cleanup_highmap rids __ka of the ramdisk mapping, what
we want to do is similar - get rid of the P2M in the __ka
address space. There are two ways of fixing this:

 1) All P2M lookups instead of using the __ka address would
    use the __va address. This means we can safely erase from
    __ka space the PMD pointers that point to the PFNs for
    P2M array and be OK.
 2). Allocate a new array, copy the existing P2M into it,
    revector the P2M tree to use that, and return the old
    P2M to the memory allocate. This has the advantage that
    it sets the stage for using XEN_ELF_NOTE_INIT_P2M
    feature. That feature allows us to set the exact virtual
    address space we want for the P2M - and allows us to
    boot as initial domain on large machines.

So we pick option 2).

This patch only lays the groundwork in the P2M code. The patch
that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree."

-- xen/mmu: Copy and revector the P2M tree:

The 'xen_revector_p2m_tree()' function allocates a new P2M tree
copies the contents of the old one in it, and returns the new one.

At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).

We have revectored the P2M tree (and the one for save/restore as well)
to use new shiny __va address to new MFNs. The xen_start_info
has been taken care of already in 'xen_setup_kernel_pagetable()' and
xen_start_info->shared_info in 'xen_setup_shared_info()', so
we are free to roam and delete PMD entries - which is exactly what
we are going to do. We rip out the __ka for the old P2M array.

-- xen/mmu:   Remove from __ka space PMD entries for

At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).

The xen_remove_p2m_tree and code around has ripped out the __ka for
the old P2M array.

Here we continue on doing it to where the Xen page-tables were.
It is safe to do it, as the page-tables are addressed using __va.
For good measure we delete anything that is within MODULES_VADDR
and up to the end of the PMD.

At this point the __ka only contains PMD entries for the start
of the kernel up to __brk.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 01/11] xen/p2m: Fix the comment describing the P2M tree.
  2012-08-16 16:03 [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
@ 2012-08-16 16:03 ` Konrad Rzeszutek Wilk
  2012-08-17 17:29   ` [Xen-devel] " Stefano Stabellini
  2012-08-16 16:03 ` [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas Konrad Rzeszutek Wilk
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-16 16:03 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

It mixed up the p2m_mid_missing with p2m_missing. Also
remove some extra spaces.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/p2m.c |   14 +++++++-------
 1 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 64effdc..e4adbfb 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -22,7 +22,7 @@
  *
  * P2M_PER_PAGE depends on the architecture, as a mfn is always
  * unsigned long (8 bytes on 64-bit, 4 bytes on 32), leading to
- * 512 and 1024 entries respectively. 
+ * 512 and 1024 entries respectively.
  *
  * In short, these structures contain the Machine Frame Number (MFN) of the PFN.
  *
@@ -139,11 +139,11 @@
  *      /    | ~0, ~0, ....  |
  *     |     \---------------/
  *     |
- *     p2m_missing             p2m_missing
- * /------------------\     /------------\
- * | [p2m_mid_missing]+---->| ~0, ~0, ~0 |
- * | [p2m_mid_missing]+---->| ..., ~0    |
- * \------------------/     \------------/
+ *   p2m_mid_missing           p2m_missing
+ * /-----------------\     /------------\
+ * | [p2m_missing]   +---->| ~0, ~0, ~0 |
+ * | [p2m_missing]   +---->| ..., ~0    |
+ * \-----------------/     \------------/
  *
  * where ~0 is INVALID_P2M_ENTRY. IDENTITY is (PFN | IDENTITY_BIT)
  */
@@ -423,7 +423,7 @@ static void free_p2m_page(void *p)
 	free_page((unsigned long)p);
 }
 
-/* 
+/*
  * Fully allocate the p2m structure for a given pfn.  We need to check
  * that both the top and mid levels are allocated, and make sure the
  * parallel mfn tree is kept in sync.  We may race with other cpus, so
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-16 16:03 [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
  2012-08-16 16:03 ` [PATCH 01/11] xen/p2m: Fix the comment describing the P2M tree Konrad Rzeszutek Wilk
@ 2012-08-16 16:03 ` Konrad Rzeszutek Wilk
  2012-08-17 17:35   ` [Xen-devel] " Stefano Stabellini
  2012-08-16 16:03 ` [PATCH 03/11] xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything Konrad Rzeszutek Wilk
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-16 16:03 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

instead of a big memblock_reserve. This way we can be more
selective in freeing regions (and it also makes it easier
to understand where is what).

[v1: Move the auto_translate_physmap to proper line]
[v2: Per Stefano suggestion add more comments]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/enlighten.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/xen/p2m.c       |    5 ++++
 arch/x86/xen/setup.c     |    9 --------
 3 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index ff962d4..e532eb5 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -998,7 +998,54 @@ static int xen_write_msr_safe(unsigned int msr, unsigned low, unsigned high)
 
 	return ret;
 }
+/*
+ * If the MFN is not in the m2p (provided to us by the hypervisor) this
+ * function won't do anything. In practice this means that the XenBus
+ * MFN won't be available for the initial domain. */
+static void __init xen_reserve_mfn(unsigned long mfn)
+{
+	unsigned long pfn;
+
+	if (!mfn)
+		return;
+	pfn = mfn_to_pfn(mfn);
+	if (phys_to_machine_mapping_valid(pfn))
+		memblock_reserve(PFN_PHYS(pfn), PAGE_SIZE);
+}
+static void __init xen_reserve_internals(void)
+{
+	unsigned long size;
+
+	if (!xen_pv_domain())
+		return;
+
+	/* xen_start_info does not exist in the M2P, hence can't use
+	 * xen_reserve_mfn. */
+	memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
+
+	xen_reserve_mfn(PFN_DOWN(xen_start_info->shared_info));
+	xen_reserve_mfn(xen_start_info->store_mfn);
 
+	if (!xen_initial_domain())
+		xen_reserve_mfn(xen_start_info->console.domU.mfn);
+
+	if (xen_feature(XENFEAT_auto_translated_physmap))
+		return;
+
+	/*
+	 * ALIGN up to compensate for the p2m_page pointing to an array that
+	 * can partially filled (look in xen_build_dynamic_phys_to_machine).
+	 */
+
+	size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
+
+	/* We could use xen_reserve_mfn here, but would end up looping quite
+	 * a lot (and call memblock_reserve for each PAGE), so lets just use
+	 * the easy way and reserve it wholesale. */
+	memblock_reserve(__pa(xen_start_info->mfn_list), size);
+
+	/* The pagetables are reserved in mmu.c */
+}
 void xen_setup_shared_info(void)
 {
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
@@ -1362,6 +1409,7 @@ asmlinkage void __init xen_start_kernel(void)
 	xen_raw_console_write("mapping kernel into physical memory\n");
 	pgd = xen_setup_kernel_pagetable(pgd, xen_start_info->nr_pages);
 
+	xen_reserve_internals();
 	/* Allocate and initialize top and mid mfn levels for p2m structure */
 	xen_build_mfn_list_list();
 
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index e4adbfb..6a2bfa4 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -388,6 +388,11 @@ void __init xen_build_dynamic_phys_to_machine(void)
 	}
 
 	m2p_override_init();
+
+	/* NOTE: We cannot call memblock_reserve here for the mfn_list as there
+	 * isn't enough pieces to make it work (for one - we are still using the
+	 * Xen provided pagetable). Do it later in xen_reserve_internals.
+	 */
 }
 
 unsigned long get_phys_to_machine(unsigned long pfn)
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index a4790bf..9efca75 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -424,15 +424,6 @@ char * __init xen_memory_setup(void)
 	e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
 			E820_RESERVED);
 
-	/*
-	 * Reserve Xen bits:
-	 *  - mfn_list
-	 *  - xen_start_info
-	 * See comment above "struct start_info" in <xen/interface/xen.h>
-	 */
-	memblock_reserve(__pa(xen_start_info->mfn_list),
-			 xen_start_info->pt_base - xen_start_info->mfn_list);
-
 	sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
 
 	return "Xen";
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 03/11] xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything.
  2012-08-16 16:03 [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
  2012-08-16 16:03 ` [PATCH 01/11] xen/p2m: Fix the comment describing the P2M tree Konrad Rzeszutek Wilk
  2012-08-16 16:03 ` [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas Konrad Rzeszutek Wilk
@ 2012-08-16 16:03 ` Konrad Rzeszutek Wilk
  2012-08-16 16:03 ` [PATCH 04/11] xen/mmu: Provide comments describing the _ka and _va aliasing issue Konrad Rzeszutek Wilk
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-16 16:03 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

We don't need to return the new PGD - as we do not use it.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/enlighten.c |    5 +----
 arch/x86/xen/mmu.c       |   10 ++--------
 arch/x86/xen/xen-ops.h   |    2 +-
 3 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index e532eb5..993e2a5 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1305,7 +1305,6 @@ asmlinkage void __init xen_start_kernel(void)
 {
 	struct physdev_set_iopl set_iopl;
 	int rc;
-	pgd_t *pgd;
 
 	if (!xen_start_info)
 		return;
@@ -1397,8 +1396,6 @@ asmlinkage void __init xen_start_kernel(void)
 	acpi_numa = -1;
 #endif
 
-	pgd = (pgd_t *)xen_start_info->pt_base;
-
 	/* Don't do the full vcpu_info placement stuff until we have a
 	   possible map and a non-dummy shared_info. */
 	per_cpu(xen_vcpu, 0) = &HYPERVISOR_shared_info->vcpu_info[0];
@@ -1407,7 +1404,7 @@ asmlinkage void __init xen_start_kernel(void)
 	early_boot_irqs_disabled = true;
 
 	xen_raw_console_write("mapping kernel into physical memory\n");
-	pgd = xen_setup_kernel_pagetable(pgd, xen_start_info->nr_pages);
+	xen_setup_kernel_pagetable((pgd_t *)xen_start_info->pt_base, xen_start_info->nr_pages);
 
 	xen_reserve_internals();
 	/* Allocate and initialize top and mid mfn levels for p2m structure */
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 3a73785..4ac21a4 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1719,8 +1719,7 @@ static void convert_pfn_mfn(void *v)
  * of the physical mapping once some sort of allocator has been set
  * up.
  */
-pgd_t * __init xen_setup_kernel_pagetable(pgd_t *pgd,
-					 unsigned long max_pfn)
+void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 {
 	pud_t *l3;
 	pmd_t *l2;
@@ -1781,8 +1780,6 @@ pgd_t * __init xen_setup_kernel_pagetable(pgd_t *pgd,
 
 	memblock_reserve(__pa(xen_start_info->pt_base),
 			 xen_start_info->nr_pt_frames * PAGE_SIZE);
-
-	return pgd;
 }
 #else	/* !CONFIG_X86_64 */
 static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD);
@@ -1825,8 +1822,7 @@ static void __init xen_write_cr3_init(unsigned long cr3)
 	pv_mmu_ops.write_cr3 = &xen_write_cr3;
 }
 
-pgd_t * __init xen_setup_kernel_pagetable(pgd_t *pgd,
-					 unsigned long max_pfn)
+void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 {
 	pmd_t *kernel_pmd;
 
@@ -1858,8 +1854,6 @@ pgd_t * __init xen_setup_kernel_pagetable(pgd_t *pgd,
 
 	memblock_reserve(__pa(xen_start_info->pt_base),
 			 xen_start_info->nr_pt_frames * PAGE_SIZE);
-
-	return initial_page_table;
 }
 #endif	/* CONFIG_X86_64 */
 
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 202d4c1..2230f57 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -27,7 +27,7 @@ void xen_setup_mfn_list_list(void);
 void xen_setup_shared_info(void);
 void xen_build_mfn_list_list(void);
 void xen_setup_machphys_mapping(void);
-pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn);
+void xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn);
 void xen_reserve_top(void);
 extern unsigned long xen_max_p2m_pfn;
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 04/11] xen/mmu: Provide comments describing the _ka and _va aliasing issue
  2012-08-16 16:03 [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
                   ` (2 preceding siblings ...)
  2012-08-16 16:03 ` [PATCH 03/11] xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything Konrad Rzeszutek Wilk
@ 2012-08-16 16:03 ` Konrad Rzeszutek Wilk
  2012-08-16 16:03 ` [PATCH 05/11] xen/mmu: use copy_page instead of memcpy Konrad Rzeszutek Wilk
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-16 16:03 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

Which is that the level2_kernel_pgt (__ka virtual addresses)
and level2_ident_pgt (__va virtual address) contain the same
PMD entries. So if you modify a PTE in __ka, it will be reflected
in __va (and vice-versa).

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 4ac21a4..6ba6100 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1734,19 +1734,36 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	init_level4_pgt[0] = __pgd(0);
 
 	/* Pre-constructed entries are in pfn, so convert to mfn */
+	/* L4[272] -> level3_ident_pgt
+	 * L4[511] -> level3_kernel_pgt */
 	convert_pfn_mfn(init_level4_pgt);
+
+	/* L3_i[0] -> level2_ident_pgt */
 	convert_pfn_mfn(level3_ident_pgt);
+	/* L3_k[510] -> level2_kernel_pgt
+	 * L3_i[511] -> level2_fixmap_pgt */
 	convert_pfn_mfn(level3_kernel_pgt);
 
+	/* We get [511][511] and have Xen's version of level2_kernel_pgt */
 	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
 	l2 = m2v(l3[pud_index(__START_KERNEL_map)].pud);
 
+	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
+	 * Both L4[272][0] and L4[511][511] have entries that point to the same
+	 * L2 (PMD) tables. Meaning that if you modify it in __va space
+	 * it will be also modified in the __ka space! (But if you just
+	 * modify the PMD table to point to other PTE's or none, then you
+	 * are OK - which is what cleanup_highmap does) */
 	memcpy(level2_ident_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
+	/* Graft it onto L4[511][511] */
 	memcpy(level2_kernel_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
 
+	/* Get [511][510] and graft that in level2_fixmap_pgt */
 	l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
 	l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
 	memcpy(level2_fixmap_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
+	/* Note that we don't do anything with level1_fixmap_pgt which
+	 * we don't need. */
 
 	/* Set up identity map */
 	xen_map_identity_early(level2_ident_pgt, max_pfn);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 05/11] xen/mmu: use copy_page instead of memcpy.
  2012-08-16 16:03 [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
                   ` (3 preceding siblings ...)
  2012-08-16 16:03 ` [PATCH 04/11] xen/mmu: Provide comments describing the _ka and _va aliasing issue Konrad Rzeszutek Wilk
@ 2012-08-16 16:03 ` Konrad Rzeszutek Wilk
  2012-08-16 16:03 ` [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early Konrad Rzeszutek Wilk
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-16 16:03 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

After all, this is what it is there for.

Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |   13 ++++++-------
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 6ba6100..7247e5a 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1754,14 +1754,14 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	 * it will be also modified in the __ka space! (But if you just
 	 * modify the PMD table to point to other PTE's or none, then you
 	 * are OK - which is what cleanup_highmap does) */
-	memcpy(level2_ident_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
+	copy_page(level2_ident_pgt, l2);
 	/* Graft it onto L4[511][511] */
-	memcpy(level2_kernel_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
+	copy_page(level2_kernel_pgt, l2);
 
 	/* Get [511][510] and graft that in level2_fixmap_pgt */
 	l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
 	l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
-	memcpy(level2_fixmap_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD);
+	copy_page(level2_fixmap_pgt, l2);
 	/* Note that we don't do anything with level1_fixmap_pgt which
 	 * we don't need. */
 
@@ -1821,8 +1821,7 @@ static void __init xen_write_cr3_init(unsigned long cr3)
 	 */
 	swapper_kernel_pmd =
 		extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE);
-	memcpy(swapper_kernel_pmd, initial_kernel_pmd,
-	       sizeof(pmd_t) * PTRS_PER_PMD);
+	copy_page(swapper_kernel_pmd, initial_kernel_pmd);
 	swapper_pg_dir[KERNEL_PGD_BOUNDARY] =
 		__pgd(__pa(swapper_kernel_pmd) | _PAGE_PRESENT);
 	set_page_prot(swapper_kernel_pmd, PAGE_KERNEL_RO);
@@ -1851,11 +1850,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 				  512*1024);
 
 	kernel_pmd = m2v(pgd[KERNEL_PGD_BOUNDARY].pgd);
-	memcpy(initial_kernel_pmd, kernel_pmd, sizeof(pmd_t) * PTRS_PER_PMD);
+	copy_page(initial_kernel_pmd, kernel_pmd);
 
 	xen_map_identity_early(initial_kernel_pmd, max_pfn);
 
-	memcpy(initial_page_table, pgd, sizeof(pgd_t) * PTRS_PER_PGD);
+	copy_page(initial_page_table, pgd);
 	initial_page_table[KERNEL_PGD_BOUNDARY] =
 		__pgd(__pa(initial_kernel_pmd) | _PAGE_PRESENT);
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early
  2012-08-16 16:03 [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
                   ` (4 preceding siblings ...)
  2012-08-16 16:03 ` [PATCH 05/11] xen/mmu: use copy_page instead of memcpy Konrad Rzeszutek Wilk
@ 2012-08-16 16:03 ` Konrad Rzeszutek Wilk
  2012-08-17 17:41   ` [Xen-devel] " Stefano Stabellini
  2012-08-16 16:03 ` [PATCH 07/11] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages Konrad Rzeszutek Wilk
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-16 16:03 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

B/c we do not need it. During the startup the Xen provides
us with all the memory mapped that we need to function.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |   11 +++++------
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 7247e5a..a59070b 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -84,6 +84,7 @@
  */
 DEFINE_SPINLOCK(xen_reservation_lock);
 
+#ifdef CONFIG_X86_32
 /*
  * Identity map, in addition to plain kernel map.  This needs to be
  * large enough to allocate page table pages to allocate the rest.
@@ -91,7 +92,7 @@ DEFINE_SPINLOCK(xen_reservation_lock);
  */
 #define LEVEL1_IDENT_ENTRIES	(PTRS_PER_PTE * 4)
 static RESERVE_BRK_ARRAY(pte_t, level1_ident_pgt, LEVEL1_IDENT_ENTRIES);
-
+#endif
 #ifdef CONFIG_X86_64
 /* l3 pud for userspace vsyscall mapping */
 static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss;
@@ -1628,7 +1629,7 @@ static void set_page_prot(void *addr, pgprot_t prot)
 	if (HYPERVISOR_update_va_mapping((unsigned long)addr, pte, 0))
 		BUG();
 }
-
+#ifdef CONFIG_X86_32
 static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
 {
 	unsigned pmdidx, pteidx;
@@ -1679,7 +1680,7 @@ static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
 
 	set_page_prot(pmd, PAGE_KERNEL_RO);
 }
-
+#endif
 void __init xen_setup_machphys_mapping(void)
 {
 	struct xen_machphys_mapping mapping;
@@ -1765,14 +1766,12 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Note that we don't do anything with level1_fixmap_pgt which
 	 * we don't need. */
 
-	/* Set up identity map */
-	xen_map_identity_early(level2_ident_pgt, max_pfn);
-
 	/* Make pagetable pieces RO */
 	set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
+	set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 07/11] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
  2012-08-16 16:03 [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
                   ` (5 preceding siblings ...)
  2012-08-16 16:03 ` [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early Konrad Rzeszutek Wilk
@ 2012-08-16 16:03 ` Konrad Rzeszutek Wilk
  2012-08-17 18:07   ` [Xen-devel] " Stefano Stabellini
  2012-08-16 16:03 ` [PATCH 08/11] xen/p2m: Add logic to revector a P2M tree to use __va leafs Konrad Rzeszutek Wilk
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-16 16:03 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

As we are not using them. We end up only using the L1 pagetables
and grafting those to our page-tables.

[v1: Per Stefano's suggestion squashed two commits]
[v2: Per Stefano's suggestion simplified loop]
[v3: Fix smatch warnings]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |   40 +++++++++++++++++++++++++++++++++-------
 1 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index a59070b..bd92c82 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1708,7 +1708,20 @@ static void convert_pfn_mfn(void *v)
 	for (i = 0; i < PTRS_PER_PTE; i++)
 		pte[i] = xen_make_pte(pte[i].pte);
 }
-
+static void __init check_pt_base(unsigned long *pt_base, unsigned long *pt_end,
+				 unsigned long addr)
+{
+	if (*pt_base == PFN_DOWN(__pa(addr))) {
+		set_page_prot((void *)addr, PAGE_KERNEL);
+		clear_page((void *)addr);
+		(*pt_base)++;
+	}
+	if (*pt_end == PFN_DOWN(__pa(addr))) {
+		set_page_prot((void *)addr, PAGE_KERNEL);
+		clear_page((void *)addr);
+		(*pt_end)--;
+	}
+}
 /*
  * Set up the initial kernel pagetable.
  *
@@ -1724,6 +1737,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 {
 	pud_t *l3;
 	pmd_t *l2;
+	unsigned long addr[3];
+	unsigned long pt_base, pt_end;
+	unsigned i;
 
 	/* max_pfn_mapped is the last pfn mapped in the initial memory
 	 * mappings. Considering that on Xen after the kernel mappings we
@@ -1731,6 +1747,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	 * set max_pfn_mapped to the last real pfn mapped. */
 	max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
 
+	pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
+	pt_end = PFN_DOWN(__pa(xen_start_info->pt_base + (xen_start_info->nr_pt_frames * PAGE_SIZE)));
+
 	/* Zap identity mapping */
 	init_level4_pgt[0] = __pgd(0);
 
@@ -1749,6 +1768,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
 	l2 = m2v(l3[pud_index(__START_KERNEL_map)].pud);
 
+	addr[0] = (unsigned long)pgd;
+	addr[1] = (unsigned long)l3;
+	addr[2] = (unsigned long)l2;
 	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
 	 * Both L4[272][0] and L4[511][511] have entries that point to the same
 	 * L2 (PMD) tables. Meaning that if you modify it in __va space
@@ -1782,20 +1804,24 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Unpin Xen-provided one */
 	pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
 
-	/* Switch over */
-	pgd = init_level4_pgt;
-
 	/*
 	 * At this stage there can be no user pgd, and no page
 	 * structure to attach it to, so make sure we just set kernel
 	 * pgd.
 	 */
 	xen_mc_batch();
-	__xen_write_cr3(true, __pa(pgd));
+	__xen_write_cr3(true, __pa(init_level4_pgt));
 	xen_mc_issue(PARAVIRT_LAZY_CPU);
 
-	memblock_reserve(__pa(xen_start_info->pt_base),
-			 xen_start_info->nr_pt_frames * PAGE_SIZE);
+	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
+	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
+	 * the initial domain. For guests using the toolstack, they are in:
+	 * [L4], [L3], [L2], [L1], [L1], order .. */
+	for (i = 0; i < ARRAY_SIZE(addr); i++)
+		check_pt_base(&pt_base, &pt_end, addr[i]);
+
+	/* Our (by three pages) smaller Xen pagetable that we are using */
+	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
 }
 #else	/* !CONFIG_X86_64 */
 static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 08/11] xen/p2m: Add logic to revector a P2M tree to use __va leafs.
  2012-08-16 16:03 [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
                   ` (6 preceding siblings ...)
  2012-08-16 16:03 ` [PATCH 07/11] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages Konrad Rzeszutek Wilk
@ 2012-08-16 16:03 ` Konrad Rzeszutek Wilk
  2012-08-16 16:03 ` [PATCH 09/11] xen/mmu: Copy and revector the P2M tree Konrad Rzeszutek Wilk
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-16 16:03 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

During bootup Xen supplies us with a P2M array. It sticks
it right after the ramdisk, as can be seen with a 128GB PV guest:

(certain parts removed for clarity):
xc_dom_build_image: called
xc_dom_alloc_segment:   kernel       : 0xffffffff81000000 -> 0xffffffff81e43000  (pfn 0x1000 + 0xe43 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1000+0xe43 at 0x7f097d8bf000
xc_dom_alloc_segment:   ramdisk      : 0xffffffff81e43000 -> 0xffffffff925c7000  (pfn 0x1e43 + 0x10784 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1e43+0x10784 at 0x7f0952dd2000
xc_dom_alloc_segment:   phys2mach    : 0xffffffff925c7000 -> 0xffffffffa25c7000  (pfn 0x125c7 + 0x10000 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x125c7+0x10000 at 0x7f0942dd2000
xc_dom_alloc_page   :   start info   : 0xffffffffa25c7000 (pfn 0x225c7)
xc_dom_alloc_page   :   xenstore     : 0xffffffffa25c8000 (pfn 0x225c8)
xc_dom_alloc_page   :   console      : 0xffffffffa25c9000 (pfn 0x225c9)
nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> 0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> 0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> 0xffffffffbfffffff, 1 table(s)
nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> 0xffffffffa27fffff, 276 table(s)
xc_dom_alloc_segment:   page tables  : 0xffffffffa25ca000 -> 0xffffffffa26e1000  (pfn 0x225ca + 0x117 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x225ca+0x117 at 0x7f097d7a8000
xc_dom_alloc_page   :   boot stack   : 0xffffffffa26e1000 (pfn 0x226e1)
xc_dom_build_image  : virt_alloc_end : 0xffffffffa26e2000
xc_dom_build_image  : virt_pgtab_end : 0xffffffffa2800000

So the physical memory and virtual (using __START_KERNEL_map addresses)
layout looks as so:

  phys                             __ka
/------------\                   /-------------------\
| 0          | empty             | 0xffffffff80000000|
| ..         |                   | ..                |
| 16MB       | <= kernel starts  | 0xffffffff81000000|
| ..         |                   |                   |
| 30MB       | <= kernel ends => | 0xffffffff81e43000|
| ..         |  & ramdisk starts | ..                |
| 293MB      | <= ramdisk ends=> | 0xffffffff925c7000|
| ..         |  & P2M starts     | ..                |
| ..         |                   | ..                |
| 549MB      | <= P2M ends    => | 0xffffffffa25c7000|
| ..         | start_info        | 0xffffffffa25c7000|
| ..         | xenstore          | 0xffffffffa25c8000|
| ..         | cosole            | 0xffffffffa25c9000|
| 549MB      | <= page tables => | 0xffffffffa25ca000|
| ..         |                   |                   |
| 550MB      | <= PGT end     => | 0xffffffffa26e1000|
| ..         | boot stack        |                   |
\------------/                   \-------------------/

As can be seen, the ramdisk, P2M and pagetables are taking
a bit of __ka addresses space. Which is a problem since the
MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits
right in there! This results during bootup with the inability to
load modules, with this error:

------------[ cut here ]------------
WARNING: at /home/konrad/ssd/linux/mm/vmalloc.c:106 vmap_page_range_noflush+0x2d9/0x370()
Call Trace:
 [<ffffffff810719fa>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff81030279>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81071a45>] warn_slowpath_null+0x15/0x20
 [<ffffffff81130b89>] vmap_page_range_noflush+0x2d9/0x370
 [<ffffffff81130c4d>] map_vm_area+0x2d/0x50
 [<ffffffff811326d0>] __vmalloc_node_range+0x160/0x250
 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
 [<ffffffff810c6186>] ? load_module+0x66/0x19c0
 [<ffffffff8105cadc>] module_alloc+0x5c/0x60
 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
 [<ffffffff810c5369>] module_alloc_update_bounds+0x19/0x80
 [<ffffffff810c70c3>] load_module+0xfa3/0x19c0
 [<ffffffff812491f6>] ? security_file_permission+0x86/0x90
 [<ffffffff810c7b3a>] sys_init_module+0x5a/0x220
 [<ffffffff815ce339>] system_call_fastpath+0x16/0x1b
---[ end trace fd8f7704fdea0291 ]---
vmalloc: allocation failure, allocated 16384 of 20480 bytes
modprobe: page allocation failure: order:0, mode:0xd2

Since the __va and __ka are 1:1 up to MODULES_VADDR and
cleanup_highmap rids __ka of the ramdisk mapping, what
we want to do is similar - get rid of the P2M in the __ka
address space. There are two ways of fixing this:

 1) All P2M lookups instead of using the __ka address would
    use the __va address. This means we can safely erase from
    __ka space the PMD pointers that point to the PFNs for
    P2M array and be OK.
 2). Allocate a new array, copy the existing P2M into it,
    revector the P2M tree to use that, and return the old
    P2M to the memory allocate. This has the advantage that
    it sets the stage for using XEN_ELF_NOTE_INIT_P2M
    feature. That feature allows us to set the exact virtual
    address space we want for the P2M - and allows us to
    boot as initial domain on large machines.

So we pick option 2).

This patch only lays the groundwork in the P2M code. The patch
that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree."

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/p2m.c     |   70 ++++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/xen/xen-ops.h |    1 +
 2 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 6a2bfa4..bbfd085 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -394,7 +394,77 @@ void __init xen_build_dynamic_phys_to_machine(void)
 	 * Xen provided pagetable). Do it later in xen_reserve_internals.
 	 */
 }
+#ifdef CONFIG_X86_64
+#include <linux/bootmem.h>
+unsigned long __init xen_revector_p2m_tree(void)
+{
+	unsigned long va_start;
+	unsigned long va_end;
+	unsigned long pfn;
+	unsigned long *mfn_list = NULL;
+	unsigned long size;
+
+	va_start = xen_start_info->mfn_list;
+	/*We copy in increments of P2M_PER_PAGE * sizeof(unsigned long),
+	 * so make sure it is rounded up to that */
+	size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
+	va_end = va_start + size;
+
+	/* If we were revectored already, don't do it again. */
+	if (va_start <= __START_KERNEL_map && va_start >= __PAGE_OFFSET)
+		return 0;
+
+	mfn_list = alloc_bootmem_align(size, PAGE_SIZE);
+	if (!mfn_list) {
+		pr_warn("Could not allocate space for a new P2M tree!\n");
+		return xen_start_info->mfn_list;
+	}
+	/* Fill it out with INVALID_P2M_ENTRY value */
+	memset(mfn_list, 0xFF, size);
+
+	for (pfn = 0; pfn < ALIGN(MAX_DOMAIN_PAGES, P2M_PER_PAGE); pfn += P2M_PER_PAGE) {
+		unsigned topidx = p2m_top_index(pfn);
+		unsigned mididx;
+		unsigned long *mid_p;
+
+		if (!p2m_top[topidx])
+			continue;
+
+		if (p2m_top[topidx] == p2m_mid_missing)
+			continue;
+
+		mididx = p2m_mid_index(pfn);
+		mid_p = p2m_top[topidx][mididx];
+		if (!mid_p)
+			continue;
+		if ((mid_p == p2m_missing) || (mid_p == p2m_identity))
+			continue;
+
+		if ((unsigned long)mid_p == INVALID_P2M_ENTRY)
+			continue;
+
+		/* The old va. Rebase it on mfn_list */
+		if (mid_p >= (unsigned long *)va_start && mid_p <= (unsigned long *)va_end) {
+			unsigned long *new;
+
+			new = &mfn_list[pfn];
+
+			copy_page(new, mid_p);
+			p2m_top[topidx][mididx] = &mfn_list[pfn];
+			p2m_top_mfn_p[topidx][mididx] = virt_to_mfn(&mfn_list[pfn]);
 
+		}
+		/* This should be the leafs allocated for identity from _brk. */
+	}
+	return (unsigned long)mfn_list;
+
+}
+#else
+unsigned long __init xen_revector_p2m_tree(void)
+{
+	return 0;
+}
+#endif
 unsigned long get_phys_to_machine(unsigned long pfn)
 {
 	unsigned topidx, mididx, idx;
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 2230f57..bb5a810 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -45,6 +45,7 @@ void xen_hvm_init_shared_info(void);
 void xen_unplug_emulated_devices(void);
 
 void __init xen_build_dynamic_phys_to_machine(void);
+unsigned long __init xen_revector_p2m_tree(void);
 
 void xen_init_irq_ops(void);
 void xen_setup_timer(int cpu);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 09/11] xen/mmu: Copy and revector the P2M tree.
  2012-08-16 16:03 [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
                   ` (7 preceding siblings ...)
  2012-08-16 16:03 ` [PATCH 08/11] xen/p2m: Add logic to revector a P2M tree to use __va leafs Konrad Rzeszutek Wilk
@ 2012-08-16 16:03 ` Konrad Rzeszutek Wilk
  2012-08-16 16:03 ` [PATCH 10/11] xen/mmu: Remove from __ka space PMD entries for pagetables Konrad Rzeszutek Wilk
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-16 16:03 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

Please first read the description in "xen/p2m: Add logic to revector a
P2M tree to use __va leafs" patch.

The 'xen_revector_p2m_tree()' function allocates a new P2M tree
copies the contents of the old one in it, and returns the new one.

At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).

We have revectored the P2M tree (and the one for save/restore as well)
to use new shiny __va address to new MFNs. The xen_start_info
has been taken care of already in 'xen_setup_kernel_pagetable()' and
xen_start_info->shared_info in 'xen_setup_shared_info()', so
we are free to roam and delete PMD entries - which is exactly what
we are going to do. We rip out the __ka for the old P2M array.

[v1: Fix smatch warnings]
[v2: memset was doing 0 instead of 0xff]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 57 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index bd92c82..e0919c5 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1183,9 +1183,64 @@ static __init void xen_mapping_pagetable_reserve(u64 start, u64 end)
 
 static void xen_post_allocator_init(void);
 
+#ifdef CONFIG_X86_64
+static void __init xen_cleanhighmap(unsigned long vaddr,
+				    unsigned long vaddr_end)
+{
+	unsigned long kernel_end = roundup((unsigned long)_brk_end, PMD_SIZE) - 1;
+	pmd_t *pmd = level2_kernel_pgt + pmd_index(vaddr);
+
+	/* NOTE: The loop is more greedy than the cleanup_highmap variant.
+	 * We include the PMD passed in on _both_ boundaries. */
+	for (; vaddr <= vaddr_end && (pmd < (level2_kernel_pgt + PAGE_SIZE));
+			pmd++, vaddr += PMD_SIZE) {
+		if (pmd_none(*pmd))
+			continue;
+		if (vaddr < (unsigned long) _text || vaddr > kernel_end)
+			set_pmd(pmd, __pmd(0));
+	}
+	/* In case we did something silly, we should crash in this function
+	 * instead of somewhere later and be confusing. */
+	xen_mc_flush();
+}
+#endif
 static void __init xen_pagetable_setup_done(pgd_t *base)
 {
+#ifdef CONFIG_X86_64
+	unsigned long size;
+	unsigned long addr;
+#endif
+
 	xen_setup_shared_info();
+#ifdef CONFIG_X86_64
+	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+		unsigned long new_mfn_list;
+
+		size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
+
+		/* On 32-bit, we get zero so this never gets executed. */
+		new_mfn_list = xen_revector_p2m_tree();
+		if (new_mfn_list && new_mfn_list != xen_start_info->mfn_list) {
+			/* using __ka address and sticking INVALID_P2M_ENTRY! */
+			memset((void *)xen_start_info->mfn_list, 0xff, size);
+
+			/* We should be in __ka space. */
+			BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
+			addr = xen_start_info->mfn_list;
+			size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
+			/* We roundup to the PMD, which means that if anybody at this stage is
+			 * using the __ka address of xen_start_info or xen_start_info->shared_info
+			 * they are in going to crash. Fortunatly we have already revectored
+			 * in xen_setup_kernel_pagetable and in xen_setup_shared_info. */
+			size = roundup(size, PMD_SIZE);
+			xen_cleanhighmap(addr, addr + size);
+
+			memblock_free(__pa(xen_start_info->mfn_list), size);
+			/* And revector! Bye bye old array */
+			xen_start_info->mfn_list = new_mfn_list;
+		}
+	}
+#endif
 	xen_post_allocator_init();
 }
 
@@ -1822,6 +1877,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 
 	/* Our (by three pages) smaller Xen pagetable that we are using */
 	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
+	/* Revector the xen_start_info */
+	xen_start_info = (struct start_info *)__va(__pa(xen_start_info));
 }
 #else	/* !CONFIG_X86_64 */
 static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 10/11] xen/mmu: Remove from __ka space PMD entries for pagetables.
  2012-08-16 16:03 [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
                   ` (8 preceding siblings ...)
  2012-08-16 16:03 ` [PATCH 09/11] xen/mmu: Copy and revector the P2M tree Konrad Rzeszutek Wilk
@ 2012-08-16 16:03 ` Konrad Rzeszutek Wilk
  2012-08-16 16:03 ` [PATCH 11/11] xen/mmu: Release just the MFN list, not MFN list and part of pagetables Konrad Rzeszutek Wilk
  2012-08-17 17:39 ` [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
  11 siblings, 0 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-16 16:03 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

Please first read the description in "xen/mmu: Copy and revector the
P2M tree."

At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).

The xen_remove_p2m_tree and code around has ripped out the __ka for
the old P2M array.

Here we continue on doing it to where the Xen page-tables were.
It is safe to do it, as the page-tables are addressed using __va.
For good measure we delete anything that is within MODULES_VADDR
and up to the end of the PMD.

At this point the __ka only contains PMD entries for the start
of the kernel up to __brk.

[v1: Per Stefano's suggestion wrapped the MODULES_VADDR in debug]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index e0919c5..5a880b8 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1240,6 +1240,25 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
 			xen_start_info->mfn_list = new_mfn_list;
 		}
 	}
+	/* At this stage, cleanup_highmap has already cleaned __ka space
+	 * from _brk_limit way up to the max_pfn_mapped (which is the end of
+	 * the ramdisk). We continue on, erasing PMD entries that point to page
+	 * tables - do note that they are accessible at this stage via __va.
+	 * For good measure we also round up to the PMD - which means that if
+	 * anybody is using __ka address to the initial boot-stack - and try
+	 * to use it - they are going to crash. The xen_start_info has been
+	 * taken care of already in xen_setup_kernel_pagetable. */
+	addr = xen_start_info->pt_base;
+	size = roundup(xen_start_info->nr_pt_frames * PAGE_SIZE, PMD_SIZE);
+
+	xen_cleanhighmap(addr, addr + size);
+	xen_start_info->pt_base = (unsigned long)__va(__pa(xen_start_info->pt_base));
+#ifdef DEBUG
+	/* This is superflous and is not neccessary, but you know what
+	 * lets do it. The MODULES_VADDR -> MODULES_END should be clear of
+	 * anything at this stage. */
+	xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, PUD_SIZE) - 1);
+#endif
 #endif
 	xen_post_allocator_init();
 }
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 11/11] xen/mmu: Release just the MFN list, not MFN list and part of pagetables.
  2012-08-16 16:03 [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
                   ` (9 preceding siblings ...)
  2012-08-16 16:03 ` [PATCH 10/11] xen/mmu: Remove from __ka space PMD entries for pagetables Konrad Rzeszutek Wilk
@ 2012-08-16 16:03 ` Konrad Rzeszutek Wilk
  2012-08-21 14:18   ` [Xen-devel] " Stefano Stabellini
  2012-09-17 18:06   ` William Dauchy
  2012-08-17 17:39 ` [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
  11 siblings, 2 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-16 16:03 UTC (permalink / raw)
  To: linux-kernel, xen-devel; +Cc: Konrad Rzeszutek Wilk

We call memblock_reserve for [start of mfn list] -> [PMD aligned end
of mfn list] instead of <start of mfn list> -> <page aligned end of mfn list].

This has the disastrous effect that if at bootup the end of mfn_list is
not PMD aligned we end up returning to memblock parts of the region
past the mfn_list array. And those parts are the PTE tables with
the disastrous effect of seeing this at bootup:

Write protecting the kernel read-only data: 10240k
Freeing unused kernel memory: 1860k freed
Freeing unused kernel memory: 200k freed
(XEN) mm.c:2429:d0 Bad type (saw 1400000000000002 != exp 7000000000000000) for mfn 116a80 (pfn 14e26)
...
(XEN) mm.c:908:d0 Error getting mfn 116a83 (pfn 14e2a) from L1 entry 8000000116a83067 for l1e_owner=0, pg_owner=0
(XEN) mm.c:908:d0 Error getting mfn 4040 (pfn 5555555555555555) from L1 entry 0000000004040601 for l1e_owner=0, pg_owner=0
.. and so on.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 5a880b8..6019c22 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1227,7 +1227,6 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
 			/* We should be in __ka space. */
 			BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
 			addr = xen_start_info->mfn_list;
-			size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
 			/* We roundup to the PMD, which means that if anybody at this stage is
 			 * using the __ka address of xen_start_info or xen_start_info->shared_info
 			 * they are in going to crash. Fortunatly we have already revectored
@@ -1235,6 +1234,7 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
 			size = roundup(size, PMD_SIZE);
 			xen_cleanhighmap(addr, addr + size);
 
+			size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
 			memblock_free(__pa(xen_start_info->mfn_list), size);
 			/* And revector! Bye bye old array */
 			xen_start_info->mfn_list = new_mfn_list;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 01/11] xen/p2m: Fix the comment describing the P2M tree.
  2012-08-16 16:03 ` [PATCH 01/11] xen/p2m: Fix the comment describing the P2M tree Konrad Rzeszutek Wilk
@ 2012-08-17 17:29   ` Stefano Stabellini
  0 siblings, 0 replies; 43+ messages in thread
From: Stefano Stabellini @ 2012-08-17 17:29 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> It mixed up the p2m_mid_missing with p2m_missing. Also
> remove some extra spaces.
> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

>  arch/x86/xen/p2m.c |   14 +++++++-------
>  1 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
> index 64effdc..e4adbfb 100644
> --- a/arch/x86/xen/p2m.c
> +++ b/arch/x86/xen/p2m.c
> @@ -22,7 +22,7 @@
>   *
>   * P2M_PER_PAGE depends on the architecture, as a mfn is always
>   * unsigned long (8 bytes on 64-bit, 4 bytes on 32), leading to
> - * 512 and 1024 entries respectively. 
> + * 512 and 1024 entries respectively.
>   *
>   * In short, these structures contain the Machine Frame Number (MFN) of the PFN.
>   *
> @@ -139,11 +139,11 @@
>   *      /    | ~0, ~0, ....  |
>   *     |     \---------------/
>   *     |
> - *     p2m_missing             p2m_missing
> - * /------------------\     /------------\
> - * | [p2m_mid_missing]+---->| ~0, ~0, ~0 |
> - * | [p2m_mid_missing]+---->| ..., ~0    |
> - * \------------------/     \------------/
> + *   p2m_mid_missing           p2m_missing
> + * /-----------------\     /------------\
> + * | [p2m_missing]   +---->| ~0, ~0, ~0 |
> + * | [p2m_missing]   +---->| ..., ~0    |
> + * \-----------------/     \------------/
>   *
>   * where ~0 is INVALID_P2M_ENTRY. IDENTITY is (PFN | IDENTITY_BIT)
>   */
> @@ -423,7 +423,7 @@ static void free_p2m_page(void *p)
>  	free_page((unsigned long)p);
>  }
>  
> -/* 
> +/*
>   * Fully allocate the p2m structure for a given pfn.  We need to check
>   * that both the top and mid levels are allocated, and make sure the
>   * parallel mfn tree is kept in sync.  We may race with other cpus, so
> -- 
> 1.7.7.6
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-16 16:03 ` [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas Konrad Rzeszutek Wilk
@ 2012-08-17 17:35   ` Stefano Stabellini
  2012-08-20 14:13     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2012-08-17 17:35 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> instead of a big memblock_reserve. This way we can be more
> selective in freeing regions (and it also makes it easier
> to understand where is what).
> 
> [v1: Move the auto_translate_physmap to proper line]
> [v2: Per Stefano suggestion add more comments]
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

much better now!

>  arch/x86/xen/enlighten.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++
>  arch/x86/xen/p2m.c       |    5 ++++
>  arch/x86/xen/setup.c     |    9 --------
>  3 files changed, 53 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index ff962d4..e532eb5 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -998,7 +998,54 @@ static int xen_write_msr_safe(unsigned int msr, unsigned low, unsigned high)
>  
>  	return ret;
>  }
> +/*
> + * If the MFN is not in the m2p (provided to us by the hypervisor) this
> + * function won't do anything. In practice this means that the XenBus
> + * MFN won't be available for the initial domain. */
> +static void __init xen_reserve_mfn(unsigned long mfn)
> +{
> +	unsigned long pfn;
> +
> +	if (!mfn)
> +		return;
> +	pfn = mfn_to_pfn(mfn);
> +	if (phys_to_machine_mapping_valid(pfn))
> +		memblock_reserve(PFN_PHYS(pfn), PAGE_SIZE);
> +}
> +static void __init xen_reserve_internals(void)
> +{
> +	unsigned long size;
> +
> +	if (!xen_pv_domain())
> +		return;
> +
> +	/* xen_start_info does not exist in the M2P, hence can't use
> +	 * xen_reserve_mfn. */
> +	memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
> +
> +	xen_reserve_mfn(PFN_DOWN(xen_start_info->shared_info));
> +	xen_reserve_mfn(xen_start_info->store_mfn);
>  
> +	if (!xen_initial_domain())
> +		xen_reserve_mfn(xen_start_info->console.domU.mfn);
> +
> +	if (xen_feature(XENFEAT_auto_translated_physmap))
> +		return;
> +
> +	/*
> +	 * ALIGN up to compensate for the p2m_page pointing to an array that
> +	 * can partially filled (look in xen_build_dynamic_phys_to_machine).
> +	 */
> +
> +	size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
> +
> +	/* We could use xen_reserve_mfn here, but would end up looping quite
> +	 * a lot (and call memblock_reserve for each PAGE), so lets just use
> +	 * the easy way and reserve it wholesale. */
> +	memblock_reserve(__pa(xen_start_info->mfn_list), size);
> +
> +	/* The pagetables are reserved in mmu.c */
> +}
>  void xen_setup_shared_info(void)
>  {
>  	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
> @@ -1362,6 +1409,7 @@ asmlinkage void __init xen_start_kernel(void)
>  	xen_raw_console_write("mapping kernel into physical memory\n");
>  	pgd = xen_setup_kernel_pagetable(pgd, xen_start_info->nr_pages);
>  
> +	xen_reserve_internals();
>  	/* Allocate and initialize top and mid mfn levels for p2m structure */
>  	xen_build_mfn_list_list();
>  
> diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
> index e4adbfb..6a2bfa4 100644
> --- a/arch/x86/xen/p2m.c
> +++ b/arch/x86/xen/p2m.c
> @@ -388,6 +388,11 @@ void __init xen_build_dynamic_phys_to_machine(void)
>  	}
>  
>  	m2p_override_init();
> +
> +	/* NOTE: We cannot call memblock_reserve here for the mfn_list as there
> +	 * isn't enough pieces to make it work (for one - we are still using the
> +	 * Xen provided pagetable). Do it later in xen_reserve_internals.
> +	 */
>  }
>  
>  unsigned long get_phys_to_machine(unsigned long pfn)
> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index a4790bf..9efca75 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -424,15 +424,6 @@ char * __init xen_memory_setup(void)
>  	e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
>  			E820_RESERVED);
>  
> -	/*
> -	 * Reserve Xen bits:
> -	 *  - mfn_list
> -	 *  - xen_start_info
> -	 * See comment above "struct start_info" in <xen/interface/xen.h>
> -	 */
> -	memblock_reserve(__pa(xen_start_info->mfn_list),
> -			 xen_start_info->pt_base - xen_start_info->mfn_list);
> -
>  	sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
>  
>  	return "Xen";
> -- 
> 1.7.7.6
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] Boot PV guests with more than 128GB (v3) for v3.7.
  2012-08-16 16:03 [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
                   ` (10 preceding siblings ...)
  2012-08-16 16:03 ` [PATCH 11/11] xen/mmu: Release just the MFN list, not MFN list and part of pagetables Konrad Rzeszutek Wilk
@ 2012-08-17 17:39 ` Konrad Rzeszutek Wilk
  11 siblings, 0 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-17 17:39 UTC (permalink / raw)
  To: linux-kernel, xen-devel

On Thu, Aug 16, 2012 at 12:03:18PM -0400, Konrad Rzeszutek Wilk wrote:
> Since (v2): http://lists.xen.org/archives/html/xen-devel/2012-07/msg01864.html
>  - fixed a bug if guest booted with non-PMD aligned size (say, 899MB).
>  - fixed smack warnings
>  - moved a memset(xen_start_info->mfn_list, 0xff,.. ) from one patch to another.

And two more bug-fixes:

>From f042050664c97a365e98daf5783f682d734e35f8 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Thu, 16 Aug 2012 16:38:55 -0400
Subject: [PATCH 1/2] xen/p2m: When revectoring deal with holes in the P2M
 array.

When we free the PFNs and then subsequently populate them back
during bootup:

Freeing 20000-20200 pfn range: 512 pages freed
1-1 mapping on 20000->20200
Freeing 40000-40200 pfn range: 512 pages freed
1-1 mapping on 40000->40200
Freeing bad80-badf4 pfn range: 116 pages freed
1-1 mapping on bad80->badf4
Freeing badf6-bae7f pfn range: 137 pages freed
1-1 mapping on badf6->bae7f
Freeing bb000-100000 pfn range: 282624 pages freed
1-1 mapping on bb000->100000
Released 283999 pages of unused memory
Set 283999 page(s) to 1-1 mapping
Populating 1acb8a-1f20e9 pfn range: 283999 pages added

We end up having the P2M array (that is the one that was
grafted on the P2M tree) filled with IDENTITY_FRAME or
INVALID_P2M_ENTRY) entries. The patch titled

"xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID."
recycles said slots and replaces the P2M tree leaf's with
 &mfn_list[xx] with p2m_identity or p2m_missing.

And re-uses the P2M array sections for other P2M tree leaf's.
For the above mentioned bootup excerpt, the PFNs at
0x20000->0x20200 are going to be IDENTITY based:

P2M[0][256][0] -> P2M[0][257][0] get turned in IDENTITY_FRAME.

We can re-use that and replace P2M[0][256] to point to p2m_identity.
The "old" page (the grafted P2M array provided by Xen) that was at
P2M[0][256] gets put somewhere else. Specifically at P2M[6][358],
b/c when we populate back:

Populating 1acb8a-1f20e9 pfn range: 283999 pages added

we fill P2M[6][358][0] (and P2M[6][358], P2M[6][359], ...) with
the new MFNs.

That is all OK, except when we revector we assume that the PFN
count would be the same in the grafted P2M array and in the
newly allocated. Since that is no longer the case, as we have
holes in the P2M that point to p2m_missing or p2m_identity we
have to take that into account.

[v2: Check for overflow]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/p2m.c |   14 +++++++++++---
 1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index bbfd085..3b5bd7e 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -401,6 +401,7 @@ unsigned long __init xen_revector_p2m_tree(void)
 	unsigned long va_start;
 	unsigned long va_end;
 	unsigned long pfn;
+	unsigned long pfn_free = 0;
 	unsigned long *mfn_list = NULL;
 	unsigned long size;
 
@@ -443,15 +444,22 @@ unsigned long __init xen_revector_p2m_tree(void)
 		if ((unsigned long)mid_p == INVALID_P2M_ENTRY)
 			continue;
 
+		if ((pfn_free + P2M_PER_PAGE) * PAGE_SIZE > size) {
+			WARN(1, "Only allocated for %ld pages, but we want %ld!\n",
+			     size, pfn_free + P2M_PER_PAGE);
+			return 0;
+		}
 		/* The old va. Rebase it on mfn_list */
 		if (mid_p >= (unsigned long *)va_start && mid_p <= (unsigned long *)va_end) {
 			unsigned long *new;
 
-			new = &mfn_list[pfn];
+			new = &mfn_list[pfn_free];
 
 			copy_page(new, mid_p);
-			p2m_top[topidx][mididx] = &mfn_list[pfn];
-			p2m_top_mfn_p[topidx][mididx] = virt_to_mfn(&mfn_list[pfn]);
+			p2m_top[topidx][mididx] = &mfn_list[pfn_free];
+			p2m_top_mfn_p[topidx][mididx] = virt_to_mfn(&mfn_list[pfn_free]);
+
+			pfn_free += P2M_PER_PAGE;
 
 		}
 		/* This should be the leafs allocated for identity from _brk. */
-- 
1.7.7.6



>From 60a9b396456a2990bb3490671ca832d36e8dd6aa Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 17 Aug 2012 09:35:31 -0400
Subject: [PATCH 2/2] xen/mmu: If the revector fails, don't attempt to
 revector anything else.

If the P2M revectoring would fail, we would try to continue on by
cleaning the PMD for L1 (PTE) page-tables. The xen_cleanhighmap
is greedy and erases the PMD on both boundaries. Since the P2M
array can share the PMD, we would wipe out part of the __ka
that is still used in the P2M tree to point to P2M leafs.

This fixes it by bypassing the revectoring and continuing on.
If the revector fails, a nice WARN is printed so we can still
troubleshoot this.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 6019c22..0dac3d2 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1238,7 +1238,8 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
 			memblock_free(__pa(xen_start_info->mfn_list), size);
 			/* And revector! Bye bye old array */
 			xen_start_info->mfn_list = new_mfn_list;
-		}
+		} else
+			goto skip;
 	}
 	/* At this stage, cleanup_highmap has already cleaned __ka space
 	 * from _brk_limit way up to the max_pfn_mapped (which is the end of
@@ -1259,6 +1260,7 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
 	 * anything at this stage. */
 	xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, PUD_SIZE) - 1);
 #endif
+skip:
 #endif
 	xen_post_allocator_init();
 }
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early
  2012-08-16 16:03 ` [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early Konrad Rzeszutek Wilk
@ 2012-08-17 17:41   ` Stefano Stabellini
  2012-08-17 17:45     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2012-08-17 17:41 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> B/c we do not need it. During the startup the Xen provides
> us with all the memory mapped that we need to function.

Shouldn't we check to make sure that is actually true (I am thinking at
nr_pt_frames)?
Or is it actually stated somewhere in the Xen headers?



> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  arch/x86/xen/mmu.c |   11 +++++------
>  1 files changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 7247e5a..a59070b 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -84,6 +84,7 @@
>   */
>  DEFINE_SPINLOCK(xen_reservation_lock);
>  
> +#ifdef CONFIG_X86_32
>  /*
>   * Identity map, in addition to plain kernel map.  This needs to be
>   * large enough to allocate page table pages to allocate the rest.
> @@ -91,7 +92,7 @@ DEFINE_SPINLOCK(xen_reservation_lock);
>   */
>  #define LEVEL1_IDENT_ENTRIES	(PTRS_PER_PTE * 4)
>  static RESERVE_BRK_ARRAY(pte_t, level1_ident_pgt, LEVEL1_IDENT_ENTRIES);
> -
> +#endif
>  #ifdef CONFIG_X86_64
>  /* l3 pud for userspace vsyscall mapping */
>  static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss;
> @@ -1628,7 +1629,7 @@ static void set_page_prot(void *addr, pgprot_t prot)
>  	if (HYPERVISOR_update_va_mapping((unsigned long)addr, pte, 0))
>  		BUG();
>  }
> -
> +#ifdef CONFIG_X86_32
>  static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
>  {
>  	unsigned pmdidx, pteidx;
> @@ -1679,7 +1680,7 @@ static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
>  
>  	set_page_prot(pmd, PAGE_KERNEL_RO);
>  }
> -
> +#endif
>  void __init xen_setup_machphys_mapping(void)
>  {
>  	struct xen_machphys_mapping mapping;
> @@ -1765,14 +1766,12 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>  	/* Note that we don't do anything with level1_fixmap_pgt which
>  	 * we don't need. */
>  
> -	/* Set up identity map */
> -	xen_map_identity_early(level2_ident_pgt, max_pfn);
> -
>  	/* Make pagetable pieces RO */
>  	set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
>  	set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
>  	set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
>  	set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
> +	set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
>  	set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
>  	set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
>  
> -- 
> 1.7.7.6
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early
  2012-08-17 17:41   ` [Xen-devel] " Stefano Stabellini
@ 2012-08-17 17:45     ` Konrad Rzeszutek Wilk
  2012-08-20 11:45       ` Stefano Stabellini
  0 siblings, 1 reply; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-17 17:45 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: linux-kernel, xen-devel

On Fri, Aug 17, 2012 at 06:41:23PM +0100, Stefano Stabellini wrote:
> On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > B/c we do not need it. During the startup the Xen provides
> > us with all the memory mapped that we need to function.
> 
> Shouldn't we check to make sure that is actually true (I am thinking at
> nr_pt_frames)?

I was looking at the source code (hypervisor) to figure it out and
that is certainly true.


> Or is it actually stated somewhere in the Xen headers?

Couldn't find it, but after looking so long at the source code
I didn't even bother looking for it.

Thought to be honest - I only looked at how the 64-bit pagetables
were set up, so I didn't dare to touch the 32-bit. Hence the #ifdef

> 
> 
> 
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > ---
> >  arch/x86/xen/mmu.c |   11 +++++------
> >  1 files changed, 5 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> > index 7247e5a..a59070b 100644
> > --- a/arch/x86/xen/mmu.c
> > +++ b/arch/x86/xen/mmu.c
> > @@ -84,6 +84,7 @@
> >   */
> >  DEFINE_SPINLOCK(xen_reservation_lock);
> >  
> > +#ifdef CONFIG_X86_32
> >  /*
> >   * Identity map, in addition to plain kernel map.  This needs to be
> >   * large enough to allocate page table pages to allocate the rest.
> > @@ -91,7 +92,7 @@ DEFINE_SPINLOCK(xen_reservation_lock);
> >   */
> >  #define LEVEL1_IDENT_ENTRIES	(PTRS_PER_PTE * 4)
> >  static RESERVE_BRK_ARRAY(pte_t, level1_ident_pgt, LEVEL1_IDENT_ENTRIES);
> > -
> > +#endif
> >  #ifdef CONFIG_X86_64
> >  /* l3 pud for userspace vsyscall mapping */
> >  static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss;
> > @@ -1628,7 +1629,7 @@ static void set_page_prot(void *addr, pgprot_t prot)
> >  	if (HYPERVISOR_update_va_mapping((unsigned long)addr, pte, 0))
> >  		BUG();
> >  }
> > -
> > +#ifdef CONFIG_X86_32
> >  static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
> >  {
> >  	unsigned pmdidx, pteidx;
> > @@ -1679,7 +1680,7 @@ static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
> >  
> >  	set_page_prot(pmd, PAGE_KERNEL_RO);
> >  }
> > -
> > +#endif
> >  void __init xen_setup_machphys_mapping(void)
> >  {
> >  	struct xen_machphys_mapping mapping;
> > @@ -1765,14 +1766,12 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  	/* Note that we don't do anything with level1_fixmap_pgt which
> >  	 * we don't need. */
> >  
> > -	/* Set up identity map */
> > -	xen_map_identity_early(level2_ident_pgt, max_pfn);
> > -
> >  	/* Make pagetable pieces RO */
> >  	set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
> >  	set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
> >  	set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
> >  	set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
> > +	set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
> >  	set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
> >  	set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
> >  
> > -- 
> > 1.7.7.6
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 07/11] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
  2012-08-17 18:07   ` [Xen-devel] " Stefano Stabellini
@ 2012-08-17 18:05     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-17 18:05 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: linux-kernel, xen-devel

On Fri, Aug 17, 2012 at 07:07:28PM +0100, Stefano Stabellini wrote:
> On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > As we are not using them. We end up only using the L1 pagetables
> > and grafting those to our page-tables.
> > 
> > [v1: Per Stefano's suggestion squashed two commits]
> > [v2: Per Stefano's suggestion simplified loop]
> > [v3: Fix smatch warnings]
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > ---
> >  arch/x86/xen/mmu.c |   40 +++++++++++++++++++++++++++++++++-------
> >  1 files changed, 33 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> > index a59070b..bd92c82 100644
> > --- a/arch/x86/xen/mmu.c
> > +++ b/arch/x86/xen/mmu.c
> > @@ -1708,7 +1708,20 @@ static void convert_pfn_mfn(void *v)
> >  	for (i = 0; i < PTRS_PER_PTE; i++)
> >  		pte[i] = xen_make_pte(pte[i].pte);
> >  }
> > -
> > +static void __init check_pt_base(unsigned long *pt_base, unsigned long *pt_end,
> > +				 unsigned long addr)
> > +{
> > +	if (*pt_base == PFN_DOWN(__pa(addr))) {
> > +		set_page_prot((void *)addr, PAGE_KERNEL);
> > +		clear_page((void *)addr);
> > +		(*pt_base)++;
> > +	}
> > +	if (*pt_end == PFN_DOWN(__pa(addr))) {
> > +		set_page_prot((void *)addr, PAGE_KERNEL);
> > +		clear_page((void *)addr);
> > +		(*pt_end)--;
> > +	}
> > +}
> >  /*
> >   * Set up the initial kernel pagetable.
> >   *
> > @@ -1724,6 +1737,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  {
> >  	pud_t *l3;
> >  	pmd_t *l2;
> > +	unsigned long addr[3];
> > +	unsigned long pt_base, pt_end;
> > +	unsigned i;
> >  
> >  	/* max_pfn_mapped is the last pfn mapped in the initial memory
> >  	 * mappings. Considering that on Xen after the kernel mappings we
> > @@ -1731,6 +1747,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  	 * set max_pfn_mapped to the last real pfn mapped. */
> >  	max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
> >  
> > +	pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
> > +	pt_end = PFN_DOWN(__pa(xen_start_info->pt_base + (xen_start_info->nr_pt_frames * PAGE_SIZE)));
> 

or just do:

	pt_end = pt_base + xen_start_info->nr_pt_frames;

> code style
> 
> >  	/* Zap identity mapping */
> >  	init_level4_pgt[0] = __pgd(0);
> >  
> > @@ -1749,6 +1768,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
> >  	l2 = m2v(l3[pud_index(__START_KERNEL_map)].pud);
> >  
> > +	addr[0] = (unsigned long)pgd;
> > +	addr[1] = (unsigned long)l3;
> > +	addr[2] = (unsigned long)l2;
> >  	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
> >  	 * Both L4[272][0] and L4[511][511] have entries that point to the same
> >  	 * L2 (PMD) tables. Meaning that if you modify it in __va space
> > @@ -1782,20 +1804,24 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
> >  	/* Unpin Xen-provided one */
> >  	pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
> >  
> > -	/* Switch over */
> > -	pgd = init_level4_pgt;
> > -
> >  	/*
> >  	 * At this stage there can be no user pgd, and no page
> >  	 * structure to attach it to, so make sure we just set kernel
> >  	 * pgd.
> >  	 */
> >  	xen_mc_batch();
> > -	__xen_write_cr3(true, __pa(pgd));
> > +	__xen_write_cr3(true, __pa(init_level4_pgt));
> >  	xen_mc_issue(PARAVIRT_LAZY_CPU);
> >  
> > -	memblock_reserve(__pa(xen_start_info->pt_base),
> > -			 xen_start_info->nr_pt_frames * PAGE_SIZE);
> > +	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
> > +	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
> > +	 * the initial domain. For guests using the toolstack, they are in:
> > +	 * [L4], [L3], [L2], [L1], [L1], order .. */
> > +	for (i = 0; i < ARRAY_SIZE(addr); i++)
> > +		check_pt_base(&pt_base, &pt_end, addr[i]);
> 
> It is much clearer now, but if the comment is correct, doesn't it mean
> that we are going to be able to free pgd, l3 and l2 only in the non-dom0
> case?

And in dom0 case only PGD.

> If so it might be worth saying it explicitly.

OK.
> 
> Other than that, it is fine by me.
> 
> 
> > +	/* Our (by three pages) smaller Xen pagetable that we are using */
> > +	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
> >  }
> >  #else	/* !CONFIG_X86_64 */
> >  static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD);
> > -- 
> > 1.7.7.6
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> > 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 07/11] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
  2012-08-16 16:03 ` [PATCH 07/11] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages Konrad Rzeszutek Wilk
@ 2012-08-17 18:07   ` Stefano Stabellini
  2012-08-17 18:05     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2012-08-17 18:07 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> As we are not using them. We end up only using the L1 pagetables
> and grafting those to our page-tables.
> 
> [v1: Per Stefano's suggestion squashed two commits]
> [v2: Per Stefano's suggestion simplified loop]
> [v3: Fix smatch warnings]
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  arch/x86/xen/mmu.c |   40 +++++++++++++++++++++++++++++++++-------
>  1 files changed, 33 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index a59070b..bd92c82 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1708,7 +1708,20 @@ static void convert_pfn_mfn(void *v)
>  	for (i = 0; i < PTRS_PER_PTE; i++)
>  		pte[i] = xen_make_pte(pte[i].pte);
>  }
> -
> +static void __init check_pt_base(unsigned long *pt_base, unsigned long *pt_end,
> +				 unsigned long addr)
> +{
> +	if (*pt_base == PFN_DOWN(__pa(addr))) {
> +		set_page_prot((void *)addr, PAGE_KERNEL);
> +		clear_page((void *)addr);
> +		(*pt_base)++;
> +	}
> +	if (*pt_end == PFN_DOWN(__pa(addr))) {
> +		set_page_prot((void *)addr, PAGE_KERNEL);
> +		clear_page((void *)addr);
> +		(*pt_end)--;
> +	}
> +}
>  /*
>   * Set up the initial kernel pagetable.
>   *
> @@ -1724,6 +1737,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>  {
>  	pud_t *l3;
>  	pmd_t *l2;
> +	unsigned long addr[3];
> +	unsigned long pt_base, pt_end;
> +	unsigned i;
>  
>  	/* max_pfn_mapped is the last pfn mapped in the initial memory
>  	 * mappings. Considering that on Xen after the kernel mappings we
> @@ -1731,6 +1747,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>  	 * set max_pfn_mapped to the last real pfn mapped. */
>  	max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
>  
> +	pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
> +	pt_end = PFN_DOWN(__pa(xen_start_info->pt_base + (xen_start_info->nr_pt_frames * PAGE_SIZE)));

code style

>  	/* Zap identity mapping */
>  	init_level4_pgt[0] = __pgd(0);
>  
> @@ -1749,6 +1768,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>  	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
>  	l2 = m2v(l3[pud_index(__START_KERNEL_map)].pud);
>  
> +	addr[0] = (unsigned long)pgd;
> +	addr[1] = (unsigned long)l3;
> +	addr[2] = (unsigned long)l2;
>  	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
>  	 * Both L4[272][0] and L4[511][511] have entries that point to the same
>  	 * L2 (PMD) tables. Meaning that if you modify it in __va space
> @@ -1782,20 +1804,24 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>  	/* Unpin Xen-provided one */
>  	pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
>  
> -	/* Switch over */
> -	pgd = init_level4_pgt;
> -
>  	/*
>  	 * At this stage there can be no user pgd, and no page
>  	 * structure to attach it to, so make sure we just set kernel
>  	 * pgd.
>  	 */
>  	xen_mc_batch();
> -	__xen_write_cr3(true, __pa(pgd));
> +	__xen_write_cr3(true, __pa(init_level4_pgt));
>  	xen_mc_issue(PARAVIRT_LAZY_CPU);
>  
> -	memblock_reserve(__pa(xen_start_info->pt_base),
> -			 xen_start_info->nr_pt_frames * PAGE_SIZE);
> +	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
> +	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
> +	 * the initial domain. For guests using the toolstack, they are in:
> +	 * [L4], [L3], [L2], [L1], [L1], order .. */
> +	for (i = 0; i < ARRAY_SIZE(addr); i++)
> +		check_pt_base(&pt_base, &pt_end, addr[i]);

It is much clearer now, but if the comment is correct, doesn't it mean
that we are going to be able to free pgd, l3 and l2 only in the non-dom0
case?
If so it might be worth saying it explicitly.

Other than that, it is fine by me.


> +	/* Our (by three pages) smaller Xen pagetable that we are using */
> +	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
>  }
>  #else	/* !CONFIG_X86_64 */
>  static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD);
> -- 
> 1.7.7.6
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early
  2012-08-17 17:45     ` Konrad Rzeszutek Wilk
@ 2012-08-20 11:45       ` Stefano Stabellini
  2012-08-20 11:53         ` Ian Campbell
  0 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2012-08-20 11:45 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, linux-kernel, xen-devel, Jan Beulich

On Fri, 17 Aug 2012, Konrad Rzeszutek Wilk wrote:
> On Fri, Aug 17, 2012 at 06:41:23PM +0100, Stefano Stabellini wrote:
> > On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > B/c we do not need it. During the startup the Xen provides
> > > us with all the memory mapped that we need to function.
> > 
> > Shouldn't we check to make sure that is actually true (I am thinking at
> > nr_pt_frames)?
> 
> I was looking at the source code (hypervisor) to figure it out and
> that is certainly true.
> 
> 
> > Or is it actually stated somewhere in the Xen headers?
> 
> Couldn't find it, but after looking so long at the source code
> I didn't even bother looking for it.
> 
> Thought to be honest - I only looked at how the 64-bit pagetables
> were set up, so I didn't dare to touch the 32-bit. Hence the #ifdef

I think that we need to involve some Xen maintainers and get this
written down somewhere in the public headers, otherwise we have no
guarantees that it is going to stay as it is in the next Xen versions.

Maybe we just need to add a couple of lines of comment to
xen/include/public/xen.h.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early
  2012-08-20 11:45       ` Stefano Stabellini
@ 2012-08-20 11:53         ` Ian Campbell
  2012-08-20 11:58           ` Stefano Stabellini
  0 siblings, 1 reply; 43+ messages in thread
From: Ian Campbell @ 2012-08-20 11:53 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Konrad Rzeszutek Wilk, xen-devel, linux-kernel

On Mon, 2012-08-20 at 12:45 +0100, Stefano Stabellini wrote:
> On Fri, 17 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > On Fri, Aug 17, 2012 at 06:41:23PM +0100, Stefano Stabellini wrote:
> > > On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > > B/c we do not need it. During the startup the Xen provides
> > > > us with all the memory mapped that we need to function.
> > > 
> > > Shouldn't we check to make sure that is actually true (I am thinking at
> > > nr_pt_frames)?
> > 
> > I was looking at the source code (hypervisor) to figure it out and
> > that is certainly true.
> > 
> > 
> > > Or is it actually stated somewhere in the Xen headers?
> > 
> > Couldn't find it, but after looking so long at the source code
> > I didn't even bother looking for it.
> > 
> > Thought to be honest - I only looked at how the 64-bit pagetables
> > were set up, so I didn't dare to touch the 32-bit. Hence the #ifdef
> 
> I think that we need to involve some Xen maintainers and get this
> written down somewhere in the public headers, otherwise we have no
> guarantees that it is going to stay as it is in the next Xen versions.
> 
> Maybe we just need to add a couple of lines of comment to
> xen/include/public/xen.h.

The start of day memory layout for PV guests is written down in the
comment just before struct start_info at
http://xenbits.xen.org/docs/unstable/hypercall/include,public,xen.h.html#Struct_start_info

(I haven't read this thread to determine if what is documented matches
what you guys are talking about relying on)

Ian.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early
  2012-08-20 11:53         ` Ian Campbell
@ 2012-08-20 11:58           ` Stefano Stabellini
  2012-08-20 12:06             ` Konrad Rzeszutek Wilk
  2012-08-23 15:40             ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 43+ messages in thread
From: Stefano Stabellini @ 2012-08-20 11:58 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Stefano Stabellini, Konrad Rzeszutek Wilk, xen-devel, linux-kernel

On Mon, 20 Aug 2012, Ian Campbell wrote:
> On Mon, 2012-08-20 at 12:45 +0100, Stefano Stabellini wrote:
> > On Fri, 17 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Aug 17, 2012 at 06:41:23PM +0100, Stefano Stabellini wrote:
> > > > On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > > > B/c we do not need it. During the startup the Xen provides
> > > > > us with all the memory mapped that we need to function.
> > > > 
> > > > Shouldn't we check to make sure that is actually true (I am thinking at
> > > > nr_pt_frames)?
> > > 
> > > I was looking at the source code (hypervisor) to figure it out and
> > > that is certainly true.
> > > 
> > > 
> > > > Or is it actually stated somewhere in the Xen headers?
> > > 
> > > Couldn't find it, but after looking so long at the source code
> > > I didn't even bother looking for it.
> > > 
> > > Thought to be honest - I only looked at how the 64-bit pagetables
> > > were set up, so I didn't dare to touch the 32-bit. Hence the #ifdef
> > 
> > I think that we need to involve some Xen maintainers and get this
> > written down somewhere in the public headers, otherwise we have no
> > guarantees that it is going to stay as it is in the next Xen versions.
> > 
> > Maybe we just need to add a couple of lines of comment to
> > xen/include/public/xen.h.
> 
> The start of day memory layout for PV guests is written down in the
> comment just before struct start_info at
> http://xenbits.xen.org/docs/unstable/hypercall/include,public,xen.h.html#Struct_start_info
> 
> (I haven't read this thread to determine if what is documented matches
> what you guys are talking about relying on)

but it is not written down how much physical memory is going to be
mapped in the bootstrap page tables.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early
  2012-08-20 11:58           ` Stefano Stabellini
@ 2012-08-20 12:06             ` Konrad Rzeszutek Wilk
  2012-08-20 12:19               ` Stefano Stabellini
  2012-08-23 15:40             ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-20 12:06 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Ian Campbell, linux-kernel, xen-devel

On Mon, Aug 20, 2012 at 12:58:37PM +0100, Stefano Stabellini wrote:
> On Mon, 20 Aug 2012, Ian Campbell wrote:
> > On Mon, 2012-08-20 at 12:45 +0100, Stefano Stabellini wrote:
> > > On Fri, 17 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > > On Fri, Aug 17, 2012 at 06:41:23PM +0100, Stefano Stabellini wrote:
> > > > > On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > > > > B/c we do not need it. During the startup the Xen provides
> > > > > > us with all the memory mapped that we need to function.
> > > > > 
> > > > > Shouldn't we check to make sure that is actually true (I am thinking at
> > > > > nr_pt_frames)?
> > > > 
> > > > I was looking at the source code (hypervisor) to figure it out and
> > > > that is certainly true.
> > > > 
> > > > 
> > > > > Or is it actually stated somewhere in the Xen headers?
> > > > 
> > > > Couldn't find it, but after looking so long at the source code
> > > > I didn't even bother looking for it.
> > > > 
> > > > Thought to be honest - I only looked at how the 64-bit pagetables
> > > > were set up, so I didn't dare to touch the 32-bit. Hence the #ifdef
> > > 
> > > I think that we need to involve some Xen maintainers and get this
> > > written down somewhere in the public headers, otherwise we have no
> > > guarantees that it is going to stay as it is in the next Xen versions.
> > > 
> > > Maybe we just need to add a couple of lines of comment to
> > > xen/include/public/xen.h.
> > 
> > The start of day memory layout for PV guests is written down in the
> > comment just before struct start_info at
> > http://xenbits.xen.org/docs/unstable/hypercall/include,public,xen.h.html#Struct_start_info
> > 
> > (I haven't read this thread to determine if what is documented matches
> > what you guys are talking about relying on)
> 
> but it is not written down how much physical memory is going to be
> mapped in the bootstrap page tables.

Considering that only pvops kernel has this change I think we are ok?
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early
  2012-08-20 12:06             ` Konrad Rzeszutek Wilk
@ 2012-08-20 12:19               ` Stefano Stabellini
  0 siblings, 0 replies; 43+ messages in thread
From: Stefano Stabellini @ 2012-08-20 12:19 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, Ian Campbell, linux-kernel, xen-devel

On Mon, 20 Aug 2012, Konrad Rzeszutek Wilk wrote:
> On Mon, Aug 20, 2012 at 12:58:37PM +0100, Stefano Stabellini wrote:
> > On Mon, 20 Aug 2012, Ian Campbell wrote:
> > > On Mon, 2012-08-20 at 12:45 +0100, Stefano Stabellini wrote:
> > > > On Fri, 17 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > > > On Fri, Aug 17, 2012 at 06:41:23PM +0100, Stefano Stabellini wrote:
> > > > > > On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > > > > > B/c we do not need it. During the startup the Xen provides
> > > > > > > us with all the memory mapped that we need to function.
> > > > > > 
> > > > > > Shouldn't we check to make sure that is actually true (I am thinking at
> > > > > > nr_pt_frames)?
> > > > > 
> > > > > I was looking at the source code (hypervisor) to figure it out and
> > > > > that is certainly true.
> > > > > 
> > > > > 
> > > > > > Or is it actually stated somewhere in the Xen headers?
> > > > > 
> > > > > Couldn't find it, but after looking so long at the source code
> > > > > I didn't even bother looking for it.
> > > > > 
> > > > > Thought to be honest - I only looked at how the 64-bit pagetables
> > > > > were set up, so I didn't dare to touch the 32-bit. Hence the #ifdef
> > > > 
> > > > I think that we need to involve some Xen maintainers and get this
> > > > written down somewhere in the public headers, otherwise we have no
> > > > guarantees that it is going to stay as it is in the next Xen versions.
> > > > 
> > > > Maybe we just need to add a couple of lines of comment to
> > > > xen/include/public/xen.h.
> > > 
> > > The start of day memory layout for PV guests is written down in the
> > > comment just before struct start_info at
> > > http://xenbits.xen.org/docs/unstable/hypercall/include,public,xen.h.html#Struct_start_info
> > > 
> > > (I haven't read this thread to determine if what is documented matches
> > > what you guys are talking about relying on)
> > 
> > but it is not written down how much physical memory is going to be
> > mapped in the bootstrap page tables.
> 
> Considering that only pvops kernel has this change I think we are ok?

We are OK if we write it down :)
Otherwise it might change in the future and we won't even know what the
correct behavior is supposed to be.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-17 17:35   ` [Xen-devel] " Stefano Stabellini
@ 2012-08-20 14:13     ` Konrad Rzeszutek Wilk
  2012-08-21 17:27       ` Q:pt_base in COMPAT mode offset by two pages. Was:Re: " Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-20 14:13 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: linux-kernel, xen-devel

On Fri, Aug 17, 2012 at 06:35:12PM +0100, Stefano Stabellini wrote:
> On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > instead of a big memblock_reserve. This way we can be more
> > selective in freeing regions (and it also makes it easier
> > to understand where is what).
> > 
> > [v1: Move the auto_translate_physmap to proper line]
> > [v2: Per Stefano suggestion add more comments]
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> 
> much better now!

Thought interestingly enough it breaks 32-bit dom0s (and only dom0s).
Will have a revised patch posted shortly.

> 
> >  arch/x86/xen/enlighten.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++
> >  arch/x86/xen/p2m.c       |    5 ++++
> >  arch/x86/xen/setup.c     |    9 --------
> >  3 files changed, 53 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> > index ff962d4..e532eb5 100644
> > --- a/arch/x86/xen/enlighten.c
> > +++ b/arch/x86/xen/enlighten.c
> > @@ -998,7 +998,54 @@ static int xen_write_msr_safe(unsigned int msr, unsigned low, unsigned high)
> >  
> >  	return ret;
> >  }
> > +/*
> > + * If the MFN is not in the m2p (provided to us by the hypervisor) this
> > + * function won't do anything. In practice this means that the XenBus
> > + * MFN won't be available for the initial domain. */
> > +static void __init xen_reserve_mfn(unsigned long mfn)
> > +{
> > +	unsigned long pfn;
> > +
> > +	if (!mfn)
> > +		return;
> > +	pfn = mfn_to_pfn(mfn);
> > +	if (phys_to_machine_mapping_valid(pfn))
> > +		memblock_reserve(PFN_PHYS(pfn), PAGE_SIZE);
> > +}
> > +static void __init xen_reserve_internals(void)
> > +{
> > +	unsigned long size;
> > +
> > +	if (!xen_pv_domain())
> > +		return;
> > +
> > +	/* xen_start_info does not exist in the M2P, hence can't use
> > +	 * xen_reserve_mfn. */
> > +	memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
> > +
> > +	xen_reserve_mfn(PFN_DOWN(xen_start_info->shared_info));
> > +	xen_reserve_mfn(xen_start_info->store_mfn);
> >  
> > +	if (!xen_initial_domain())
> > +		xen_reserve_mfn(xen_start_info->console.domU.mfn);
> > +
> > +	if (xen_feature(XENFEAT_auto_translated_physmap))
> > +		return;
> > +
> > +	/*
> > +	 * ALIGN up to compensate for the p2m_page pointing to an array that
> > +	 * can partially filled (look in xen_build_dynamic_phys_to_machine).
> > +	 */
> > +
> > +	size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
> > +
> > +	/* We could use xen_reserve_mfn here, but would end up looping quite
> > +	 * a lot (and call memblock_reserve for each PAGE), so lets just use
> > +	 * the easy way and reserve it wholesale. */
> > +	memblock_reserve(__pa(xen_start_info->mfn_list), size);
> > +
> > +	/* The pagetables are reserved in mmu.c */
> > +}
> >  void xen_setup_shared_info(void)
> >  {
> >  	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
> > @@ -1362,6 +1409,7 @@ asmlinkage void __init xen_start_kernel(void)
> >  	xen_raw_console_write("mapping kernel into physical memory\n");
> >  	pgd = xen_setup_kernel_pagetable(pgd, xen_start_info->nr_pages);
> >  
> > +	xen_reserve_internals();
> >  	/* Allocate and initialize top and mid mfn levels for p2m structure */
> >  	xen_build_mfn_list_list();
> >  
> > diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
> > index e4adbfb..6a2bfa4 100644
> > --- a/arch/x86/xen/p2m.c
> > +++ b/arch/x86/xen/p2m.c
> > @@ -388,6 +388,11 @@ void __init xen_build_dynamic_phys_to_machine(void)
> >  	}
> >  
> >  	m2p_override_init();
> > +
> > +	/* NOTE: We cannot call memblock_reserve here for the mfn_list as there
> > +	 * isn't enough pieces to make it work (for one - we are still using the
> > +	 * Xen provided pagetable). Do it later in xen_reserve_internals.
> > +	 */
> >  }
> >  
> >  unsigned long get_phys_to_machine(unsigned long pfn)
> > diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> > index a4790bf..9efca75 100644
> > --- a/arch/x86/xen/setup.c
> > +++ b/arch/x86/xen/setup.c
> > @@ -424,15 +424,6 @@ char * __init xen_memory_setup(void)
> >  	e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
> >  			E820_RESERVED);
> >  
> > -	/*
> > -	 * Reserve Xen bits:
> > -	 *  - mfn_list
> > -	 *  - xen_start_info
> > -	 * See comment above "struct start_info" in <xen/interface/xen.h>
> > -	 */
> > -	memblock_reserve(__pa(xen_start_info->mfn_list),
> > -			 xen_start_info->pt_base - xen_start_info->mfn_list);
> > -
> >  	sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
> >  
> >  	return "Xen";
> > -- 
> > 1.7.7.6
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> > 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 11/11] xen/mmu: Release just the MFN list, not MFN list and part of pagetables.
  2012-08-16 16:03 ` [PATCH 11/11] xen/mmu: Release just the MFN list, not MFN list and part of pagetables Konrad Rzeszutek Wilk
@ 2012-08-21 14:18   ` Stefano Stabellini
  2012-08-21 14:57     ` Konrad Rzeszutek Wilk
  2012-09-17 18:06   ` William Dauchy
  1 sibling, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2012-08-21 14:18 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> We call memblock_reserve for [start of mfn list] -> [PMD aligned end
> of mfn list] instead of <start of mfn list> -> <page aligned end of mfn list].
> 
> This has the disastrous effect that if at bootup the end of mfn_list is
> not PMD aligned we end up returning to memblock parts of the region
> past the mfn_list array. And those parts are the PTE tables with
> the disastrous effect of seeing this at bootup:

This patch looks wrong to me.

Aren't you changing the way mfn_list is reserved using memblock in patch
#3? Moreover it really seems to me that you are PAGE_ALIGN'ing size
rather than PMD_ALIGN'ing it there.


> Write protecting the kernel read-only data: 10240k
> Freeing unused kernel memory: 1860k freed
> Freeing unused kernel memory: 200k freed
> (XEN) mm.c:2429:d0 Bad type (saw 1400000000000002 != exp 7000000000000000) for mfn 116a80 (pfn 14e26)
> ...
> (XEN) mm.c:908:d0 Error getting mfn 116a83 (pfn 14e2a) from L1 entry 8000000116a83067 for l1e_owner=0, pg_owner=0
> (XEN) mm.c:908:d0 Error getting mfn 4040 (pfn 5555555555555555) from L1 entry 0000000004040601 for l1e_owner=0, pg_owner=0
> .. and so on.
>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  arch/x86/xen/mmu.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 5a880b8..6019c22 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1227,7 +1227,6 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
>  			/* We should be in __ka space. */
>  			BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
>  			addr = xen_start_info->mfn_list;
> -			size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
>  			/* We roundup to the PMD, which means that if anybody at this stage is
>  			 * using the __ka address of xen_start_info or xen_start_info->shared_info
>  			 * they are in going to crash. Fortunatly we have already revectored
> @@ -1235,6 +1234,7 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
>  			size = roundup(size, PMD_SIZE);
>  			xen_cleanhighmap(addr, addr + size);
>  
> +			size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
>  			memblock_free(__pa(xen_start_info->mfn_list), size);
>  			/* And revector! Bye bye old array */
>  			xen_start_info->mfn_list = new_mfn_list;
> -- 
> 1.7.7.6
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 11/11] xen/mmu: Release just the MFN list, not MFN list and part of pagetables.
  2012-08-21 14:18   ` [Xen-devel] " Stefano Stabellini
@ 2012-08-21 14:57     ` Konrad Rzeszutek Wilk
  2012-08-21 15:27       ` Stefano Stabellini
  0 siblings, 1 reply; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-21 14:57 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: linux-kernel, xen-devel

On Tue, Aug 21, 2012 at 03:18:26PM +0100, Stefano Stabellini wrote:
> On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > We call memblock_reserve for [start of mfn list] -> [PMD aligned end
> > of mfn list] instead of <start of mfn list> -> <page aligned end of mfn list].
> > 
> > This has the disastrous effect that if at bootup the end of mfn_list is
> > not PMD aligned we end up returning to memblock parts of the region
> > past the mfn_list array. And those parts are the PTE tables with
> > the disastrous effect of seeing this at bootup:
> 
> This patch looks wrong to me.

Its easier to see if you stick the patch in the code. The size = part
was actually also done earlier.
> 
> Aren't you changing the way mfn_list is reserved using memblock in patch
> #3? Moreover it really seems to me that you are PAGE_ALIGN'ing size
> rather than PMD_ALIGN'ing it there.

Correct. That is proper way of doing it. We want to PMD_ALIGN for the xen_cleanhighmap
to remove the pesky virtual address, but we want PAGE_ALIGN (so exactly the
same way memblock_reserve was called) for memblock_free.
> 
> 
> > Write protecting the kernel read-only data: 10240k
> > Freeing unused kernel memory: 1860k freed
> > Freeing unused kernel memory: 200k freed
> > (XEN) mm.c:2429:d0 Bad type (saw 1400000000000002 != exp 7000000000000000) for mfn 116a80 (pfn 14e26)
> > ...
> > (XEN) mm.c:908:d0 Error getting mfn 116a83 (pfn 14e2a) from L1 entry 8000000116a83067 for l1e_owner=0, pg_owner=0
> > (XEN) mm.c:908:d0 Error getting mfn 4040 (pfn 5555555555555555) from L1 entry 0000000004040601 for l1e_owner=0, pg_owner=0
> > .. and so on.
> >
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > ---
> >  arch/x86/xen/mmu.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> > index 5a880b8..6019c22 100644
> > --- a/arch/x86/xen/mmu.c
> > +++ b/arch/x86/xen/mmu.c
> > @@ -1227,7 +1227,6 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
> >  			/* We should be in __ka space. */
> >  			BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
> >  			addr = xen_start_info->mfn_list;
> > -			size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
> >  			/* We roundup to the PMD, which means that if anybody at this stage is
> >  			 * using the __ka address of xen_start_info or xen_start_info->shared_info
> >  			 * they are in going to crash. Fortunatly we have already revectored
> > @@ -1235,6 +1234,7 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
> >  			size = roundup(size, PMD_SIZE);
> >  			xen_cleanhighmap(addr, addr + size);
> >  
> > +			size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
> >  			memblock_free(__pa(xen_start_info->mfn_list), size);
> >  			/* And revector! Bye bye old array */
> >  			xen_start_info->mfn_list = new_mfn_list;
> > -- 
> > 1.7.7.6
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> > 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 11/11] xen/mmu: Release just the MFN list, not MFN list and part of pagetables.
  2012-08-21 14:57     ` Konrad Rzeszutek Wilk
@ 2012-08-21 15:27       ` Stefano Stabellini
  0 siblings, 0 replies; 43+ messages in thread
From: Stefano Stabellini @ 2012-08-21 15:27 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Stefano Stabellini, linux-kernel, xen-devel

On Tue, 21 Aug 2012, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 21, 2012 at 03:18:26PM +0100, Stefano Stabellini wrote:
> > On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > We call memblock_reserve for [start of mfn list] -> [PMD aligned end
> > > of mfn list] instead of <start of mfn list> -> <page aligned end of mfn list].
> > > 
> > > This has the disastrous effect that if at bootup the end of mfn_list is
> > > not PMD aligned we end up returning to memblock parts of the region
> > > past the mfn_list array. And those parts are the PTE tables with
> > > the disastrous effect of seeing this at bootup:
> > 
> > This patch looks wrong to me.
> 
> Its easier to see if you stick the patch in the code. The size = part
> was actually also done earlier.

Yes, you are right, I see that know.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Q:pt_base in COMPAT mode offset by two pages. Was:Re: [Xen-devel] [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-20 14:13     ` Konrad Rzeszutek Wilk
@ 2012-08-21 17:27       ` Konrad Rzeszutek Wilk
  2012-08-21 19:03         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-21 17:27 UTC (permalink / raw)
  To: Stefano Stabellini, JBeulich; +Cc: linux-kernel, xen-devel

On Mon, Aug 20, 2012 at 10:13:05AM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Aug 17, 2012 at 06:35:12PM +0100, Stefano Stabellini wrote:
> > On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > instead of a big memblock_reserve. This way we can be more
> > > selective in freeing regions (and it also makes it easier
> > > to understand where is what).
> > > 
> > > [v1: Move the auto_translate_physmap to proper line]
> > > [v2: Per Stefano suggestion add more comments]
> > > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > 
> > much better now!
> 
> Thought interestingly enough it breaks 32-bit dom0s (and only dom0s).
> Will have a revised patch posted shortly.

Jan, I thought something odd. Part of this code replaces this:

	memblock_reserve(__pa(xen_start_info->mfn_list),
		xen_start_info->pt_base - xen_start_info->mfn_list);

with a more region-by-region area. What I found out that if I boot this
as 32-bit guest with a 64-bit hypervisor the xen_start_info->pt_base is
actually wrong.

Specifically this is what bootup says:

(good working case - 32bit hypervisor with 32-bit dom0):
(XEN)  Loaded kernel: c1000000->c1a23000
(XEN)  Init. ramdisk: c1a23000->cf730e00
(XEN)  Phys-Mach map: cf731000->cf831000
(XEN)  Start info:    cf831000->cf83147c
(XEN)  Page tables:   cf832000->cf8b5000
(XEN)  Boot stack:    cf8b5000->cf8b6000
(XEN)  TOTAL:         c0000000->cfc00000

[    0.000000] PT: cf832000 (f832000)
[    0.000000] Reserving PT: f832000->f8b5000

And with a 64-bit hypervisor:

XEN) VIRTUAL MEMORY ARRANGEMENT:
(XEN)  Loaded kernel: 00000000c1000000->00000000c1a23000
(XEN)  Init. ramdisk: 00000000c1a23000->00000000cf730e00
(XEN)  Phys-Mach map: 00000000cf731000->00000000cf831000
(XEN)  Start info:    00000000cf831000->00000000cf8314b4
(XEN)  Page tables:   00000000cf832000->00000000cf8b6000
(XEN)  Boot stack:    00000000cf8b6000->00000000cf8b7000
(XEN)  TOTAL:         00000000c0000000->00000000cfc00000
(XEN)  ENTRY ADDRESS: 00000000c16bb22c

[    0.000000] PT: cf834000 (f834000)
[    0.000000] Reserving PT: f834000->f8b8000

So the pt_base is offset by two pages. And looking at c/s 13257
its not clear to me why this two page offset was added?

The toolstack works fine - so launching 32-bit guests either
under a 32-bit hypervisor or 64-bit works fine:
] domainbuilder: detail: xc_dom_alloc_segment:   page tables  : 0xcf805000 -> 0xcf885000  (pfn 0xf805 + 0x80 pages)
[    0.000000] PT: cf805000 (f805000)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Q:pt_base in COMPAT mode offset by two pages. Was:Re: [Xen-devel] [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-21 17:27       ` Q:pt_base in COMPAT mode offset by two pages. Was:Re: " Konrad Rzeszutek Wilk
@ 2012-08-21 19:03         ` Konrad Rzeszutek Wilk
  2012-08-22 10:48           ` Stefano Stabellini
                             ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-21 19:03 UTC (permalink / raw)
  To: Stefano Stabellini, JBeulich; +Cc: linux-kernel, xen-devel

On Tue, Aug 21, 2012 at 01:27:32PM -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, Aug 20, 2012 at 10:13:05AM -0400, Konrad Rzeszutek Wilk wrote:
> > On Fri, Aug 17, 2012 at 06:35:12PM +0100, Stefano Stabellini wrote:
> > > On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > > instead of a big memblock_reserve. This way we can be more
> > > > selective in freeing regions (and it also makes it easier
> > > > to understand where is what).
> > > > 
> > > > [v1: Move the auto_translate_physmap to proper line]
> > > > [v2: Per Stefano suggestion add more comments]
> > > > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > 
> > > much better now!
> > 
> > Thought interestingly enough it breaks 32-bit dom0s (and only dom0s).
> > Will have a revised patch posted shortly.
> 
> Jan, I thought something odd. Part of this code replaces this:
> 
> 	memblock_reserve(__pa(xen_start_info->mfn_list),
> 		xen_start_info->pt_base - xen_start_info->mfn_list);
> 
> with a more region-by-region area. What I found out that if I boot this
> as 32-bit guest with a 64-bit hypervisor the xen_start_info->pt_base is
> actually wrong.
> 
> Specifically this is what bootup says:
> 
> (good working case - 32bit hypervisor with 32-bit dom0):
> (XEN)  Loaded kernel: c1000000->c1a23000
> (XEN)  Init. ramdisk: c1a23000->cf730e00
> (XEN)  Phys-Mach map: cf731000->cf831000
> (XEN)  Start info:    cf831000->cf83147c
> (XEN)  Page tables:   cf832000->cf8b5000
> (XEN)  Boot stack:    cf8b5000->cf8b6000
> (XEN)  TOTAL:         c0000000->cfc00000
> 
> [    0.000000] PT: cf832000 (f832000)
> [    0.000000] Reserving PT: f832000->f8b5000
> 
> And with a 64-bit hypervisor:
> 
> XEN) VIRTUAL MEMORY ARRANGEMENT:
> (XEN)  Loaded kernel: 00000000c1000000->00000000c1a23000
> (XEN)  Init. ramdisk: 00000000c1a23000->00000000cf730e00
> (XEN)  Phys-Mach map: 00000000cf731000->00000000cf831000
> (XEN)  Start info:    00000000cf831000->00000000cf8314b4
> (XEN)  Page tables:   00000000cf832000->00000000cf8b6000
> (XEN)  Boot stack:    00000000cf8b6000->00000000cf8b7000
> (XEN)  TOTAL:         00000000c0000000->00000000cfc00000
> (XEN)  ENTRY ADDRESS: 00000000c16bb22c
> 
> [    0.000000] PT: cf834000 (f834000)
> [    0.000000] Reserving PT: f834000->f8b8000
> 
> So the pt_base is offset by two pages. And looking at c/s 13257
> its not clear to me why this two page offset was added?
> 
> The toolstack works fine - so launching 32-bit guests either
> under a 32-bit hypervisor or 64-bit works fine:
> ] domainbuilder: detail: xc_dom_alloc_segment:   page tables  : 0xcf805000 -> 0xcf885000  (pfn 0xf805 + 0x80 pages)
> [    0.000000] PT: cf805000 (f805000)
> 

And this patch on top of the others fixes this..


>From 806c312e50f122c47913145cf884f53dd09d9199 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Tue, 21 Aug 2012 14:31:24 -0400
Subject: [PATCH] xen/x86: Workaround 64-bit hypervisor and 32-bit initial
 domain.

If a 64-bit hypervisor is booted with a 32-bit initial domain,
the hypervisor deals with the initial domain as "compat" and
does some extra adjustments (like pagetables are 4 bytes instead
of 8). It also adjusts the xen_start_info->pt_base incorrectly.

When booted with a 32-bit hypervisor (32-bit initial domain):
..
(XEN)  Start info:    cf831000->cf83147c
(XEN)  Page tables:   cf832000->cf8b5000
..
[    0.000000] PT: cf832000 (f832000)
[    0.000000] Reserving PT: f832000->f8b5000

And with a 64-bit hypervisor:
(XEN)  Start info:    00000000cf831000->00000000cf8314b4
(XEN)  Page tables:   00000000cf832000->00000000cf8b6000

[    0.000000] PT: cf834000 (f834000)
[    0.000000] Reserving PT: f834000->f8b8000

To deal with this, we keep keep track of the highest physical
address we have reserved via memblock_reserve. If that address
does not overlap with pt_base, we have a gap which we reserve.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/enlighten.c |   30 +++++++++++++++++++++---------
 1 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index e532eb5..511f92d 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1002,19 +1002,24 @@ static int xen_write_msr_safe(unsigned int msr, unsigned low, unsigned high)
  * If the MFN is not in the m2p (provided to us by the hypervisor) this
  * function won't do anything. In practice this means that the XenBus
  * MFN won't be available for the initial domain. */
-static void __init xen_reserve_mfn(unsigned long mfn)
+static unsigned long __init xen_reserve_mfn(unsigned long mfn)
 {
-	unsigned long pfn;
+	unsigned long pfn, end_pfn = 0;
 
 	if (!mfn)
-		return;
+		return end_pfn;
+
 	pfn = mfn_to_pfn(mfn);
-	if (phys_to_machine_mapping_valid(pfn))
-		memblock_reserve(PFN_PHYS(pfn), PAGE_SIZE);
+	if (phys_to_machine_mapping_valid(pfn)) {
+		end_pfn = PFN_PHYS(pfn) + PAGE_SIZE;
+		memblock_reserve(PFN_PHYS(pfn), end_pfn);
+	}
+	return end_pfn;
 }
 static void __init xen_reserve_internals(void)
 {
 	unsigned long size;
+	unsigned long last_phys = 0;
 
 	if (!xen_pv_domain())
 		return;
@@ -1022,12 +1027,13 @@ static void __init xen_reserve_internals(void)
 	/* xen_start_info does not exist in the M2P, hence can't use
 	 * xen_reserve_mfn. */
 	memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
+	last_phys = __pa(xen_start_info) + PAGE_SIZE;
 
-	xen_reserve_mfn(PFN_DOWN(xen_start_info->shared_info));
-	xen_reserve_mfn(xen_start_info->store_mfn);
+	last_phys = max(xen_reserve_mfn(PFN_DOWN(xen_start_info->shared_info)), last_phys);
+	last_phys = max(xen_reserve_mfn(xen_start_info->store_mfn), last_phys);
 
 	if (!xen_initial_domain())
-		xen_reserve_mfn(xen_start_info->console.domU.mfn);
+		last_phys = max(xen_reserve_mfn(xen_start_info->console.domU.mfn), last_phys);
 
 	if (xen_feature(XENFEAT_auto_translated_physmap))
 		return;
@@ -1043,8 +1049,14 @@ static void __init xen_reserve_internals(void)
 	 * a lot (and call memblock_reserve for each PAGE), so lets just use
 	 * the easy way and reserve it wholesale. */
 	memblock_reserve(__pa(xen_start_info->mfn_list), size);
-
+	last_phys = max(__pa(xen_start_info->mfn_list) + size, last_phys);
 	/* The pagetables are reserved in mmu.c */
+
+	/* Under 64-bit hypervisor with a 32-bit domain, the hypervisor
+	 * offsets the pt_base by two pages. Hence the reservation that is done
+	 * in mmu.c misses two pages. We correct it here if we detect this. */
+	if (last_phys < __pa(xen_start_info->pt_base))
+		memblock_reserve(last_phys, __pa(xen_start_info->pt_base) - last_phys);
 }
 void xen_setup_shared_info(void)
 {
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Q:pt_base in COMPAT mode offset by two pages. Was:Re: [Xen-devel] [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-21 19:03         ` Konrad Rzeszutek Wilk
@ 2012-08-22 10:48           ` Stefano Stabellini
  2012-08-22 14:00             ` Konrad Rzeszutek Wilk
  2012-08-22 14:12           ` Jan Beulich
  2012-08-22 15:59           ` Jan Beulich
  2 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2012-08-22 10:48 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, JBeulich, linux-kernel, xen-devel

On Tue, 21 Aug 2012, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 21, 2012 at 01:27:32PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Mon, Aug 20, 2012 at 10:13:05AM -0400, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Aug 17, 2012 at 06:35:12PM +0100, Stefano Stabellini wrote:
> > > > On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > > > instead of a big memblock_reserve. This way we can be more
> > > > > selective in freeing regions (and it also makes it easier
> > > > > to understand where is what).
> > > > > 
> > > > > [v1: Move the auto_translate_physmap to proper line]
> > > > > [v2: Per Stefano suggestion add more comments]
> > > > > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > > 
> > > > much better now!
> > > 
> > > Thought interestingly enough it breaks 32-bit dom0s (and only dom0s).
> > > Will have a revised patch posted shortly.
> > 
> > Jan, I thought something odd. Part of this code replaces this:
> > 
> > 	memblock_reserve(__pa(xen_start_info->mfn_list),
> > 		xen_start_info->pt_base - xen_start_info->mfn_list);
> > 
> > with a more region-by-region area. What I found out that if I boot this
> > as 32-bit guest with a 64-bit hypervisor the xen_start_info->pt_base is
> > actually wrong.
> > 
> > Specifically this is what bootup says:
> > 
> > (good working case - 32bit hypervisor with 32-bit dom0):
> > (XEN)  Loaded kernel: c1000000->c1a23000
> > (XEN)  Init. ramdisk: c1a23000->cf730e00
> > (XEN)  Phys-Mach map: cf731000->cf831000
> > (XEN)  Start info:    cf831000->cf83147c
> > (XEN)  Page tables:   cf832000->cf8b5000
> > (XEN)  Boot stack:    cf8b5000->cf8b6000
> > (XEN)  TOTAL:         c0000000->cfc00000
> > 
> > [    0.000000] PT: cf832000 (f832000)
> > [    0.000000] Reserving PT: f832000->f8b5000
> > 
> > And with a 64-bit hypervisor:
> > 
> > XEN) VIRTUAL MEMORY ARRANGEMENT:
> > (XEN)  Loaded kernel: 00000000c1000000->00000000c1a23000
> > (XEN)  Init. ramdisk: 00000000c1a23000->00000000cf730e00
> > (XEN)  Phys-Mach map: 00000000cf731000->00000000cf831000
> > (XEN)  Start info:    00000000cf831000->00000000cf8314b4
> > (XEN)  Page tables:   00000000cf832000->00000000cf8b6000
> > (XEN)  Boot stack:    00000000cf8b6000->00000000cf8b7000
> > (XEN)  TOTAL:         00000000c0000000->00000000cfc00000
> > (XEN)  ENTRY ADDRESS: 00000000c16bb22c
> > 
> > [    0.000000] PT: cf834000 (f834000)
> > [    0.000000] Reserving PT: f834000->f8b8000
> > 
> > So the pt_base is offset by two pages. And looking at c/s 13257
> > its not clear to me why this two page offset was added?
> > 
> > The toolstack works fine - so launching 32-bit guests either
> > under a 32-bit hypervisor or 64-bit works fine:
> > ] domainbuilder: detail: xc_dom_alloc_segment:   page tables  : 0xcf805000 -> 0xcf885000  (pfn 0xf805 + 0x80 pages)
> > [    0.000000] PT: cf805000 (f805000)
> > 
> 
> And this patch on top of the others fixes this..
> 
> 
> >From 806c312e50f122c47913145cf884f53dd09d9199 Mon Sep 17 00:00:00 2001
> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Date: Tue, 21 Aug 2012 14:31:24 -0400
> Subject: [PATCH] xen/x86: Workaround 64-bit hypervisor and 32-bit initial
>  domain.
> 
> If a 64-bit hypervisor is booted with a 32-bit initial domain,
> the hypervisor deals with the initial domain as "compat" and
> does some extra adjustments (like pagetables are 4 bytes instead
> of 8). It also adjusts the xen_start_info->pt_base incorrectly.
> 
> When booted with a 32-bit hypervisor (32-bit initial domain):
> ..
> (XEN)  Start info:    cf831000->cf83147c
> (XEN)  Page tables:   cf832000->cf8b5000
> ..
> [    0.000000] PT: cf832000 (f832000)
> [    0.000000] Reserving PT: f832000->f8b5000
> 
> And with a 64-bit hypervisor:
> (XEN)  Start info:    00000000cf831000->00000000cf8314b4
> (XEN)  Page tables:   00000000cf832000->00000000cf8b6000
> 
> [    0.000000] PT: cf834000 (f834000)
> [    0.000000] Reserving PT: f834000->f8b8000
> 
> To deal with this, we keep keep track of the highest physical
> address we have reserved via memblock_reserve. If that address
> does not overlap with pt_base, we have a gap which we reserve.
> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  arch/x86/xen/enlighten.c |   30 +++++++++++++++++++++---------
>  1 files changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index e532eb5..511f92d 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -1002,19 +1002,24 @@ static int xen_write_msr_safe(unsigned int msr, unsigned low, unsigned high)
>   * If the MFN is not in the m2p (provided to us by the hypervisor) this
>   * function won't do anything. In practice this means that the XenBus
>   * MFN won't be available for the initial domain. */
> -static void __init xen_reserve_mfn(unsigned long mfn)
> +static unsigned long __init xen_reserve_mfn(unsigned long mfn)
>  {
> -	unsigned long pfn;
> +	unsigned long pfn, end_pfn = 0;
>  
>  	if (!mfn)
> -		return;
> +		return end_pfn;
> +
>  	pfn = mfn_to_pfn(mfn);
> -	if (phys_to_machine_mapping_valid(pfn))
> -		memblock_reserve(PFN_PHYS(pfn), PAGE_SIZE);
> +	if (phys_to_machine_mapping_valid(pfn)) {
> +		end_pfn = PFN_PHYS(pfn) + PAGE_SIZE;
> +		memblock_reserve(PFN_PHYS(pfn), end_pfn);
> +	}
> +	return end_pfn;
>  }
>  static void __init xen_reserve_internals(void)
>  {
>  	unsigned long size;
> +	unsigned long last_phys = 0;
>  
>  	if (!xen_pv_domain())
>  		return;
> @@ -1022,12 +1027,13 @@ static void __init xen_reserve_internals(void)
>  	/* xen_start_info does not exist in the M2P, hence can't use
>  	 * xen_reserve_mfn. */
>  	memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
> +	last_phys = __pa(xen_start_info) + PAGE_SIZE;
>  
> -	xen_reserve_mfn(PFN_DOWN(xen_start_info->shared_info));
> -	xen_reserve_mfn(xen_start_info->store_mfn);
> +	last_phys = max(xen_reserve_mfn(PFN_DOWN(xen_start_info->shared_info)), last_phys);
> +	last_phys = max(xen_reserve_mfn(xen_start_info->store_mfn), last_phys);
>  
>  	if (!xen_initial_domain())
> -		xen_reserve_mfn(xen_start_info->console.domU.mfn);
> +		last_phys = max(xen_reserve_mfn(xen_start_info->console.domU.mfn), last_phys);
>  
>  	if (xen_feature(XENFEAT_auto_translated_physmap))
>  		return;
> @@ -1043,8 +1049,14 @@ static void __init xen_reserve_internals(void)
>  	 * a lot (and call memblock_reserve for each PAGE), so lets just use
>  	 * the easy way and reserve it wholesale. */
>  	memblock_reserve(__pa(xen_start_info->mfn_list), size);
> -
> +	last_phys = max(__pa(xen_start_info->mfn_list) + size, last_phys);
>  	/* The pagetables are reserved in mmu.c */
> +
> +	/* Under 64-bit hypervisor with a 32-bit domain, the hypervisor
> +	 * offsets the pt_base by two pages. Hence the reservation that is done
> +	 * in mmu.c misses two pages. We correct it here if we detect this. */
> +	if (last_phys < __pa(xen_start_info->pt_base))
> +		memblock_reserve(last_phys, __pa(xen_start_info->pt_base) - last_phys);
>  }

What are these two pages used for? They are not documented in xen.h, why
should we reserve them?

After all we still have:

memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);

that should protect what we are interested in anyway...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Q:pt_base in COMPAT mode offset by two pages. Was:Re: [Xen-devel] [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-22 10:48           ` Stefano Stabellini
@ 2012-08-22 14:00             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-22 14:00 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: JBeulich, linux-kernel, xen-devel

> > +	/* Under 64-bit hypervisor with a 32-bit domain, the hypervisor
> > +	 * offsets the pt_base by two pages. Hence the reservation that is done
> > +	 * in mmu.c misses two pages. We correct it here if we detect this. */
> > +	if (last_phys < __pa(xen_start_info->pt_base))
> > +		memblock_reserve(last_phys, __pa(xen_start_info->pt_base) - last_phys);
> >  }
> 
> What are these two pages used for? They are not documented in xen.h, why
> should we reserve them?
> 
> After all we still have:
> 
> memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
> 
> that should protect what we are interested in anyway...

You are looking at the x86_64 piece of code. This issue only appears
on 32-bit which was not modified by my patches and has:

2003         memblock_reserve(__pa(xen_start_info->pt_base),
2004                          xen_start_info->nr_pt_frames * PAGE_SIZE);

and as I found out, the pt_base is wrong. The cr3 we load and use is actually
two pages back!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Q:pt_base in COMPAT mode offset by two pages. Was:Re: [Xen-devel] [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-21 19:03         ` Konrad Rzeszutek Wilk
  2012-08-22 10:48           ` Stefano Stabellini
@ 2012-08-22 14:12           ` Jan Beulich
  2012-08-22 14:41             ` Stefano Stabellini
  2012-08-22 14:57             ` Konrad Rzeszutek Wilk
  2012-08-22 15:59           ` Jan Beulich
  2 siblings, 2 replies; 43+ messages in thread
From: Jan Beulich @ 2012-08-22 14:12 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Stefano Stabellini, xen-devel, linux-kernel

>>> On 21.08.12 at 21:03, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> On Tue, Aug 21, 2012 at 01:27:32PM -0400, Konrad Rzeszutek Wilk wrote:
>> Jan, I thought something odd. Part of this code replaces this:
>> 
>> 	memblock_reserve(__pa(xen_start_info->mfn_list),
>> 		xen_start_info->pt_base - xen_start_info->mfn_list);
>> 
>> with a more region-by-region area. What I found out that if I boot this
>> as 32-bit guest with a 64-bit hypervisor the xen_start_info->pt_base is
>> actually wrong.
>> 
>> Specifically this is what bootup says:
>> 
>> (good working case - 32bit hypervisor with 32-bit dom0):
>> (XEN)  Loaded kernel: c1000000->c1a23000
>> (XEN)  Init. ramdisk: c1a23000->cf730e00
>> (XEN)  Phys-Mach map: cf731000->cf831000
>> (XEN)  Start info:    cf831000->cf83147c
>> (XEN)  Page tables:   cf832000->cf8b5000
>> (XEN)  Boot stack:    cf8b5000->cf8b6000
>> (XEN)  TOTAL:         c0000000->cfc00000
>> 
>> [    0.000000] PT: cf832000 (f832000)
>> [    0.000000] Reserving PT: f832000->f8b5000
>> 
>> And with a 64-bit hypervisor:
>> 
>> XEN) VIRTUAL MEMORY ARRANGEMENT:
>> (XEN)  Loaded kernel: 00000000c1000000->00000000c1a23000
>> (XEN)  Init. ramdisk: 00000000c1a23000->00000000cf730e00
>> (XEN)  Phys-Mach map: 00000000cf731000->00000000cf831000
>> (XEN)  Start info:    00000000cf831000->00000000cf8314b4
>> (XEN)  Page tables:   00000000cf832000->00000000cf8b6000
>> (XEN)  Boot stack:    00000000cf8b6000->00000000cf8b7000
>> (XEN)  TOTAL:         00000000c0000000->00000000cfc00000
>> (XEN)  ENTRY ADDRESS: 00000000c16bb22c
>> 
>> [    0.000000] PT: cf834000 (f834000)
>> [    0.000000] Reserving PT: f834000->f8b8000
>> 
>> So the pt_base is offset by two pages. And looking at c/s 13257
>> its not clear to me why this two page offset was added?

Honestly, without looking through this in greater detail I don't
recall. That'll have to wait possibly until after the summit, though.
I can't exclude that this is just a forgotten leftover from an earlier
version of the patch. I would have thought this was to account
for the L4 tables that the guest doesn't see, but
- this should only be a single page
- this should then also (or rather instead) be subtracted from
  nr_pt_frames
so that's likely not it.

>> The toolstack works fine - so launching 32-bit guests either
>> under a 32-bit hypervisor or 64-bit works fine:
>> ] domainbuilder: detail: xc_dom_alloc_segment:   page tables  : 0xcf805000 -> 
> 0xcf885000  (pfn 0xf805 + 0x80 pages)
>> [    0.000000] PT: cf805000 (f805000)
>> 
> 
> And this patch on top of the others fixes this..

I didn't look at this in too close detail, but I started to get
afraid that you might be making the code dependent on
many hypervisor implementation details. And should the
above turn out to be a bug in the hypervisor, I hope your
kernel side changes won't make it impossible to fix that bug.

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Q:pt_base in COMPAT mode offset by two pages. Was:Re: [Xen-devel] [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-22 14:12           ` Jan Beulich
@ 2012-08-22 14:41             ` Stefano Stabellini
  2012-08-22 14:57             ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 43+ messages in thread
From: Stefano Stabellini @ 2012-08-22 14:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Konrad Rzeszutek Wilk, Stefano Stabellini, xen-devel, linux-kernel

On Wed, 22 Aug 2012, Jan Beulich wrote:
> >> The toolstack works fine - so launching 32-bit guests either
> >> under a 32-bit hypervisor or 64-bit works fine:
> >> ] domainbuilder: detail: xc_dom_alloc_segment:   page tables  : 0xcf805000 -> 
> > 0xcf885000  (pfn 0xf805 + 0x80 pages)
> >> [    0.000000] PT: cf805000 (f805000)
> >> 
> > 
> > And this patch on top of the others fixes this..
> 
> I didn't look at this in too close detail, but I started to get
> afraid that you might be making the code dependent on
> many hypervisor implementation details. And should the
> above turn out to be a bug in the hypervisor, I hope your
> kernel side changes won't make it impossible to fix that bug.

I agree

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Q:pt_base in COMPAT mode offset by two pages. Was:Re: [Xen-devel] [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-22 14:12           ` Jan Beulich
  2012-08-22 14:41             ` Stefano Stabellini
@ 2012-08-22 14:57             ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-22 14:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Stefano Stabellini, xen-devel, linux-kernel

On Wed, Aug 22, 2012 at 03:12:46PM +0100, Jan Beulich wrote:
> >>> On 21.08.12 at 21:03, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> > On Tue, Aug 21, 2012 at 01:27:32PM -0400, Konrad Rzeszutek Wilk wrote:
> >> Jan, I thought something odd. Part of this code replaces this:
> >> 
> >> 	memblock_reserve(__pa(xen_start_info->mfn_list),
> >> 		xen_start_info->pt_base - xen_start_info->mfn_list);
> >> 
> >> with a more region-by-region area. What I found out that if I boot this
> >> as 32-bit guest with a 64-bit hypervisor the xen_start_info->pt_base is
> >> actually wrong.
> >> 
> >> Specifically this is what bootup says:
> >> 
> >> (good working case - 32bit hypervisor with 32-bit dom0):
> >> (XEN)  Loaded kernel: c1000000->c1a23000
> >> (XEN)  Init. ramdisk: c1a23000->cf730e00
> >> (XEN)  Phys-Mach map: cf731000->cf831000
> >> (XEN)  Start info:    cf831000->cf83147c
> >> (XEN)  Page tables:   cf832000->cf8b5000
> >> (XEN)  Boot stack:    cf8b5000->cf8b6000
> >> (XEN)  TOTAL:         c0000000->cfc00000
> >> 
> >> [    0.000000] PT: cf832000 (f832000)
> >> [    0.000000] Reserving PT: f832000->f8b5000
> >> 
> >> And with a 64-bit hypervisor:
> >> 
> >> XEN) VIRTUAL MEMORY ARRANGEMENT:
> >> (XEN)  Loaded kernel: 00000000c1000000->00000000c1a23000
> >> (XEN)  Init. ramdisk: 00000000c1a23000->00000000cf730e00
> >> (XEN)  Phys-Mach map: 00000000cf731000->00000000cf831000
> >> (XEN)  Start info:    00000000cf831000->00000000cf8314b4
> >> (XEN)  Page tables:   00000000cf832000->00000000cf8b6000
> >> (XEN)  Boot stack:    00000000cf8b6000->00000000cf8b7000
> >> (XEN)  TOTAL:         00000000c0000000->00000000cfc00000
> >> (XEN)  ENTRY ADDRESS: 00000000c16bb22c
> >> 
> >> [    0.000000] PT: cf834000 (f834000)
> >> [    0.000000] Reserving PT: f834000->f8b8000
> >> 
> >> So the pt_base is offset by two pages. And looking at c/s 13257
> >> its not clear to me why this two page offset was added?
> 
> Honestly, without looking through this in greater detail I don't
> recall. That'll have to wait possibly until after the summit, though.

I figured it was baked in the API so not really worth persuing
a fix and just leave it as is.

> I can't exclude that this is just a forgotten leftover from an earlier
> version of the patch. I would have thought this was to account
> for the L4 tables that the guest doesn't see, but
> - this should only be a single page
> - this should then also (or rather instead) be subtracted from
>   nr_pt_frames
> so that's likely not it.
> 
> >> The toolstack works fine - so launching 32-bit guests either
> >> under a 32-bit hypervisor or 64-bit works fine:
> >> ] domainbuilder: detail: xc_dom_alloc_segment:   page tables  : 0xcf805000 -> 
> > 0xcf885000  (pfn 0xf805 + 0x80 pages)
> >> [    0.000000] PT: cf805000 (f805000)
> >> 
> > 
> > And this patch on top of the others fixes this..
> 
> I didn't look at this in too close detail, but I started to get
> afraid that you might be making the code dependent on
> many hypervisor implementation details. And should the
> above turn out to be a bug in the hypervisor, I hope your
> kernel side changes won't make it impossible to fix that bug.

Actually they will work OK. I've tested it with and without the
hypervisor bug-fix and it worked nicely.

But this is "make the memblock_reserve" easier to see is getting
out of hands :-(

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Q:pt_base in COMPAT mode offset by two pages. Was:Re: [Xen-devel] [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-21 19:03         ` Konrad Rzeszutek Wilk
  2012-08-22 10:48           ` Stefano Stabellini
  2012-08-22 14:12           ` Jan Beulich
@ 2012-08-22 15:59           ` Jan Beulich
  2012-08-22 16:21             ` Konrad Rzeszutek Wilk
  2012-08-22 18:55             ` [Xen-devel] Q:pt_base in COMPAT mode offset by two pages. Was:Re: " Konrad Rzeszutek Wilk
  2 siblings, 2 replies; 43+ messages in thread
From: Jan Beulich @ 2012-08-22 15:59 UTC (permalink / raw)
  To: Stefano Stabellini, Konrad Rzeszutek Wilk; +Cc: xen-devel, linux-kernel

>>> On 21.08.12 at 21:03, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> On Tue, Aug 21, 2012 at 01:27:32PM -0400, Konrad Rzeszutek Wilk wrote:
>> On Mon, Aug 20, 2012 at 10:13:05AM -0400, Konrad Rzeszutek Wilk wrote:
>> > On Fri, Aug 17, 2012 at 06:35:12PM +0100, Stefano Stabellini wrote:
>> > > On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
>> > > > instead of a big memblock_reserve. This way we can be more
>> > > > selective in freeing regions (and it also makes it easier
>> > > > to understand where is what).
>> > > > 
>> > > > [v1: Move the auto_translate_physmap to proper line]
>> > > > [v2: Per Stefano suggestion add more comments]
>> > > > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> > > 
>> > > much better now!
>> > 
>> > Thought interestingly enough it breaks 32-bit dom0s (and only dom0s).
>> > Will have a revised patch posted shortly.
>> 
>> Jan, I thought something odd. Part of this code replaces this:
>> 
>> 	memblock_reserve(__pa(xen_start_info->mfn_list),
>> 		xen_start_info->pt_base - xen_start_info->mfn_list);
>> 
>> with a more region-by-region area. What I found out that if I boot this
>> as 32-bit guest with a 64-bit hypervisor the xen_start_info->pt_base is
>> actually wrong.
>> 
>> Specifically this is what bootup says:
>> 
>> (good working case - 32bit hypervisor with 32-bit dom0):
>> (XEN)  Loaded kernel: c1000000->c1a23000
>> (XEN)  Init. ramdisk: c1a23000->cf730e00
>> (XEN)  Phys-Mach map: cf731000->cf831000
>> (XEN)  Start info:    cf831000->cf83147c
>> (XEN)  Page tables:   cf832000->cf8b5000
>> (XEN)  Boot stack:    cf8b5000->cf8b6000
>> (XEN)  TOTAL:         c0000000->cfc00000
>> 
>> [    0.000000] PT: cf832000 (f832000)
>> [    0.000000] Reserving PT: f832000->f8b5000
>> 
>> And with a 64-bit hypervisor:
>> 
>> XEN) VIRTUAL MEMORY ARRANGEMENT:
>> (XEN)  Loaded kernel: 00000000c1000000->00000000c1a23000
>> (XEN)  Init. ramdisk: 00000000c1a23000->00000000cf730e00
>> (XEN)  Phys-Mach map: 00000000cf731000->00000000cf831000
>> (XEN)  Start info:    00000000cf831000->00000000cf8314b4
>> (XEN)  Page tables:   00000000cf832000->00000000cf8b6000
>> (XEN)  Boot stack:    00000000cf8b6000->00000000cf8b7000
>> (XEN)  TOTAL:         00000000c0000000->00000000cfc00000
>> (XEN)  ENTRY ADDRESS: 00000000c16bb22c
>> 
>> [    0.000000] PT: cf834000 (f834000)
>> [    0.000000] Reserving PT: f834000->f8b8000
>> 
>> So the pt_base is offset by two pages. And looking at c/s 13257
>> its not clear to me why this two page offset was added?

Actually, the adjustment turns out to be correct: The page
tables for a 32-on-64 dom0 get allocated in the order "first L1",
"first L2", "first L3", so the offset to the page table base is
indeed 2. When reading xen/include/public/xen.h's comment
very strictly, this is not a violation (since there nothing is said
that the first thing in the page table space is pointed to by
pt_base; I admit that this seems to be implied though, namely
do I think that it is implied that the page table space is the
range [pt_base, pt_base + nt_pt_frames), whereas that
range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
which - without a priori knowledge - the kernel would have
difficulty to figure out).

Below is a debugging patch I used to see the full picture, if you
want to double check.

One thing I also noticed is that nr_pt_frames apparently is
one too high in this case, as the L4 is not really part of the
page tables from the kernel's perspective (and not represented
anywhere in the corresponding VA range).

Jan

--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -940,6 +940,7 @@ int __init construct_dom0(
     si->flags       |= (xen_processor_pmbits << 8) & SIF_PM_MASK;
     si->pt_base      = vpt_start + 2 * PAGE_SIZE * !!is_pv_32on64_domain(d);
     si->nr_pt_frames = nr_pt_pages;
+printk("PT#%lx\n", si->nr_pt_frames);//temp
     si->mfn_list     = vphysmap_start;
     snprintf(si->magic, sizeof(si->magic), "xen-3.0-x86_%d%s",
              elf_64bit(&elf) ? 64 : 32, parms.pae ? "p" : "");
@@ -1115,6 +1116,10 @@ int __init construct_dom0(
                 process_pending_softirqs();
         }
     }
+show_page_walk(vpt_start);//temp
+show_page_walk(si->pt_base);//temp
+show_page_walk(v_start);//temp
+show_page_walk(v_end - 1);//temp
 
     if ( initrd_len != 0 )
     {


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Q:pt_base in COMPAT mode offset by two pages. Was:Re: [Xen-devel] [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-22 15:59           ` Jan Beulich
@ 2012-08-22 16:21             ` Konrad Rzeszutek Wilk
  2012-08-22 18:55             ` [Xen-devel] Q:pt_base in COMPAT mode offset by two pages. Was:Re: " Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-22 16:21 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Stefano Stabellini, xen-devel, linux-kernel

On Wed, Aug 22, 2012 at 04:59:11PM +0100, Jan Beulich wrote:
> >>> On 21.08.12 at 21:03, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> > On Tue, Aug 21, 2012 at 01:27:32PM -0400, Konrad Rzeszutek Wilk wrote:
> >> On Mon, Aug 20, 2012 at 10:13:05AM -0400, Konrad Rzeszutek Wilk wrote:
> >> > On Fri, Aug 17, 2012 at 06:35:12PM +0100, Stefano Stabellini wrote:
> >> > > On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> >> > > > instead of a big memblock_reserve. This way we can be more
> >> > > > selective in freeing regions (and it also makes it easier
> >> > > > to understand where is what).
> >> > > > 
> >> > > > [v1: Move the auto_translate_physmap to proper line]
> >> > > > [v2: Per Stefano suggestion add more comments]
> >> > > > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> > > 
> >> > > much better now!
> >> > 
> >> > Thought interestingly enough it breaks 32-bit dom0s (and only dom0s).
> >> > Will have a revised patch posted shortly.
> >> 
> >> Jan, I thought something odd. Part of this code replaces this:
> >> 
> >> 	memblock_reserve(__pa(xen_start_info->mfn_list),
> >> 		xen_start_info->pt_base - xen_start_info->mfn_list);
> >> 
> >> with a more region-by-region area. What I found out that if I boot this
> >> as 32-bit guest with a 64-bit hypervisor the xen_start_info->pt_base is
> >> actually wrong.
> >> 
> >> Specifically this is what bootup says:
> >> 
> >> (good working case - 32bit hypervisor with 32-bit dom0):
> >> (XEN)  Loaded kernel: c1000000->c1a23000
> >> (XEN)  Init. ramdisk: c1a23000->cf730e00
> >> (XEN)  Phys-Mach map: cf731000->cf831000
> >> (XEN)  Start info:    cf831000->cf83147c
> >> (XEN)  Page tables:   cf832000->cf8b5000
> >> (XEN)  Boot stack:    cf8b5000->cf8b6000
> >> (XEN)  TOTAL:         c0000000->cfc00000
> >> 
> >> [    0.000000] PT: cf832000 (f832000)
> >> [    0.000000] Reserving PT: f832000->f8b5000
> >> 
> >> And with a 64-bit hypervisor:
> >> 
> >> XEN) VIRTUAL MEMORY ARRANGEMENT:
> >> (XEN)  Loaded kernel: 00000000c1000000->00000000c1a23000
> >> (XEN)  Init. ramdisk: 00000000c1a23000->00000000cf730e00
> >> (XEN)  Phys-Mach map: 00000000cf731000->00000000cf831000
> >> (XEN)  Start info:    00000000cf831000->00000000cf8314b4
> >> (XEN)  Page tables:   00000000cf832000->00000000cf8b6000
> >> (XEN)  Boot stack:    00000000cf8b6000->00000000cf8b7000
> >> (XEN)  TOTAL:         00000000c0000000->00000000cfc00000
> >> (XEN)  ENTRY ADDRESS: 00000000c16bb22c
> >> 
> >> [    0.000000] PT: cf834000 (f834000)
> >> [    0.000000] Reserving PT: f834000->f8b8000
> >> 
> >> So the pt_base is offset by two pages. And looking at c/s 13257
> >> its not clear to me why this two page offset was added?
> 
> Actually, the adjustment turns out to be correct: The page
> tables for a 32-on-64 dom0 get allocated in the order "first L1",
> "first L2", "first L3", so the offset to the page table base is
> indeed 2. When reading xen/include/public/xen.h's comment
> very strictly, this is not a violation (since there nothing is said
> that the first thing in the page table space is pointed to by
> pt_base; I admit that this seems to be implied though, namely
> do I think that it is implied that the page table space is the
> range [pt_base, pt_base + nt_pt_frames), whereas that
> range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
> which - without a priori knowledge - the kernel would have
> difficulty to figure out).

And only in compat mode. <sigh> Well I am happy that we have found
this so we can document it more throughly but I think I will
step away from those memblock patches for a while as the earlier

 "lets just reserve everything from mfn->list up to the pt_base"

and then in the mmu:
 "reserve everything from pt_base up to nr_pt_frames*PAGE_SIZE"

works.

And document it in the Linux kernel a bit more of why we want to
do that.
> 
> Below is a debugging patch I used to see the full picture, if you
> want to double check.

I trust you and the production of said pages in the L1, L2, L3
is closly related to how the 64-bit does it. Which is L4, L1, L2, L3
and then the L1's follow.

The toolstack does it in L4, L3, L2, L1 order..
> 
> One thing I also noticed is that nr_pt_frames apparently is
> one too high in this case, as the L4 is not really part of the
> page tables from the kernel's perspective (and not represented
> anywhere in the corresponding VA range).
> 
> Jan
> 
> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c
> @@ -940,6 +940,7 @@ int __init construct_dom0(
>      si->flags       |= (xen_processor_pmbits << 8) & SIF_PM_MASK;
>      si->pt_base      = vpt_start + 2 * PAGE_SIZE * !!is_pv_32on64_domain(d);
>      si->nr_pt_frames = nr_pt_pages;
> +printk("PT#%lx\n", si->nr_pt_frames);//temp
>      si->mfn_list     = vphysmap_start;
>      snprintf(si->magic, sizeof(si->magic), "xen-3.0-x86_%d%s",
>               elf_64bit(&elf) ? 64 : 32, parms.pae ? "p" : "");
> @@ -1115,6 +1116,10 @@ int __init construct_dom0(
>                  process_pending_softirqs();
>          }
>      }
> +show_page_walk(vpt_start);//temp
> +show_page_walk(si->pt_base);//temp
> +show_page_walk(v_start);//temp
> +show_page_walk(v_end - 1);//temp
>  
>      if ( initrd_len != 0 )
>      {

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] Q:pt_base in COMPAT mode offset by two pages. Was:Re: [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-22 15:59           ` Jan Beulich
  2012-08-22 16:21             ` Konrad Rzeszutek Wilk
@ 2012-08-22 18:55             ` Konrad Rzeszutek Wilk
  2012-08-23  6:23               ` Jan Beulich
  1 sibling, 1 reply; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-22 18:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Konrad Rzeszutek Wilk, xen-devel, linux-kernel

On Wed, Aug 22, 2012 at 04:59:11PM +0100, Jan Beulich wrote:
> >>> On 21.08.12 at 21:03, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> > On Tue, Aug 21, 2012 at 01:27:32PM -0400, Konrad Rzeszutek Wilk wrote:
> >> On Mon, Aug 20, 2012 at 10:13:05AM -0400, Konrad Rzeszutek Wilk wrote:
> >> > On Fri, Aug 17, 2012 at 06:35:12PM +0100, Stefano Stabellini wrote:
> >> > > On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> >> > > > instead of a big memblock_reserve. This way we can be more
> >> > > > selective in freeing regions (and it also makes it easier
> >> > > > to understand where is what).
> >> > > > 
> >> > > > [v1: Move the auto_translate_physmap to proper line]
> >> > > > [v2: Per Stefano suggestion add more comments]
> >> > > > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> > > 
> >> > > much better now!
> >> > 
> >> > Thought interestingly enough it breaks 32-bit dom0s (and only dom0s).
> >> > Will have a revised patch posted shortly.
> >> 
> >> Jan, I thought something odd. Part of this code replaces this:
> >> 
> >> 	memblock_reserve(__pa(xen_start_info->mfn_list),
> >> 		xen_start_info->pt_base - xen_start_info->mfn_list);
> >> 
> >> with a more region-by-region area. What I found out that if I boot this
> >> as 32-bit guest with a 64-bit hypervisor the xen_start_info->pt_base is
> >> actually wrong.
> >> 
> >> Specifically this is what bootup says:
> >> 
> >> (good working case - 32bit hypervisor with 32-bit dom0):
> >> (XEN)  Loaded kernel: c1000000->c1a23000
> >> (XEN)  Init. ramdisk: c1a23000->cf730e00
> >> (XEN)  Phys-Mach map: cf731000->cf831000
> >> (XEN)  Start info:    cf831000->cf83147c
> >> (XEN)  Page tables:   cf832000->cf8b5000
> >> (XEN)  Boot stack:    cf8b5000->cf8b6000
> >> (XEN)  TOTAL:         c0000000->cfc00000
> >> 
> >> [    0.000000] PT: cf832000 (f832000)
> >> [    0.000000] Reserving PT: f832000->f8b5000
> >> 
> >> And with a 64-bit hypervisor:
> >> 
> >> XEN) VIRTUAL MEMORY ARRANGEMENT:
> >> (XEN)  Loaded kernel: 00000000c1000000->00000000c1a23000
> >> (XEN)  Init. ramdisk: 00000000c1a23000->00000000cf730e00
> >> (XEN)  Phys-Mach map: 00000000cf731000->00000000cf831000
> >> (XEN)  Start info:    00000000cf831000->00000000cf8314b4
> >> (XEN)  Page tables:   00000000cf832000->00000000cf8b6000
> >> (XEN)  Boot stack:    00000000cf8b6000->00000000cf8b7000
> >> (XEN)  TOTAL:         00000000c0000000->00000000cfc00000
> >> (XEN)  ENTRY ADDRESS: 00000000c16bb22c
> >> 
> >> [    0.000000] PT: cf834000 (f834000)
> >> [    0.000000] Reserving PT: f834000->f8b8000
> >> 
> >> So the pt_base is offset by two pages. And looking at c/s 13257
> >> its not clear to me why this two page offset was added?
> 
> Actually, the adjustment turns out to be correct: The page
> tables for a 32-on-64 dom0 get allocated in the order "first L1",
> "first L2", "first L3", so the offset to the page table base is
> indeed 2. When reading xen/include/public/xen.h's comment
> very strictly, this is not a violation (since there nothing is said
> that the first thing in the page table space is pointed to by
> pt_base; I admit that this seems to be implied though, namely
> do I think that it is implied that the page table space is the
> range [pt_base, pt_base + nt_pt_frames), whereas that
> range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
> which - without a priori knowledge - the kernel would have
> difficulty to figure out).
> 
> Below is a debugging patch I used to see the full picture, if you
> want to double check.

Thinking of just sticking this patch then:

>From 9aa58784b163ee435ff5596cf3ec059b57ab85e1 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Wed, 22 Aug 2012 13:00:10 -0400
Subject: [PATCH] Revert "xen/x86: Workaround 64-bit hypervisor and 32-bit
 initial domain." and "xen/x86: Use memblock_reserve for
 sensitive areas."

This reverts commit 806c312e50f122c47913145cf884f53dd09d9199 and
commit 59b294403e9814e7c1154043567f0d71bac7a511.

And also documents setup.c and why we want to do it that way, which
is that we tried to make the the memblock_reserve more selective so
that it would be clear what region is reserved. Sadly we ran
in the problem wherein on a 64-bit hypervisor with a 32-bit
initial domain, the pt_base has the cr3 value which is not
neccessarily where the pagetable starts! As Jan put it: "
Actually, the adjustment turns out to be correct: The page
tables for a 32-on-64 dom0 get allocated in the order "first L1",
"first L2", "first L3", so the offset to the page table base is
indeed 2. When reading xen/include/public/xen.h's comment
very strictly, this is not a violation (since there nothing is said
that the first thing in the page table space is pointed to by
pt_base; I admit that this seems to be implied though, namely
do I think that it is implied that the page table space is the
range [pt_base, pt_base + nt_pt_frames), whereas that
range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
which - without a priori knowledge - the kernel would have
difficulty to figure out)." - so lets just fall back to the
easy way and reserve the whole region.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/enlighten.c |   60 ----------------------------------------------
 arch/x86/xen/p2m.c       |    5 ----
 arch/x86/xen/setup.c     |   27 ++++++++++++++++++++
 3 files changed, 27 insertions(+), 65 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index c1e940d..d87a038 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -998,66 +998,7 @@ static int xen_write_msr_safe(unsigned int msr, unsigned low, unsigned high)
 
 	return ret;
 }
-/*
- * If the MFN is not in the m2p (provided to us by the hypervisor) this
- * function won't do anything. In practice this means that the XenBus
- * MFN won't be available for the initial domain. */
-static unsigned long __init xen_reserve_mfn(unsigned long mfn)
-{
-	unsigned long pfn, end_pfn = 0;
-
-	if (!mfn)
-		return end_pfn;
-
-	pfn = mfn_to_pfn(mfn);
-	if (phys_to_machine_mapping_valid(pfn)) {
-		end_pfn = PFN_PHYS(pfn) + PAGE_SIZE;
-		memblock_reserve(PFN_PHYS(pfn), end_pfn);
-	}
-	return end_pfn;
-}
-static void __init xen_reserve_internals(void)
-{
-	unsigned long size;
-	unsigned long last_phys = 0;
-
-	if (!xen_pv_domain())
-		return;
-
-	/* xen_start_info does not exist in the M2P, hence can't use
-	 * xen_reserve_mfn. */
-	memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
-	last_phys = __pa(xen_start_info) + PAGE_SIZE;
-
-	last_phys = max(xen_reserve_mfn(PFN_DOWN(xen_start_info->shared_info)), last_phys);
-	last_phys = max(xen_reserve_mfn(xen_start_info->store_mfn), last_phys);
 
-	if (!xen_initial_domain())
-		last_phys = max(xen_reserve_mfn(xen_start_info->console.domU.mfn), last_phys);
-
-	if (xen_feature(XENFEAT_auto_translated_physmap))
-		return;
-
-	/*
-	 * ALIGN up to compensate for the p2m_page pointing to an array that
-	 * can partially filled (look in xen_build_dynamic_phys_to_machine).
-	 */
-
-	size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
-
-	/* We could use xen_reserve_mfn here, but would end up looping quite
-	 * a lot (and call memblock_reserve for each PAGE), so lets just use
-	 * the easy way and reserve it wholesale. */
-	memblock_reserve(__pa(xen_start_info->mfn_list), size);
-	last_phys = max(__pa(xen_start_info->mfn_list) + size, last_phys);
-	/* The pagetables are reserved in mmu.c */
-
-	/* Under 64-bit hypervisor with a 32-bit domain, the hypervisor
-	 * offsets the pt_base by two pages. Hence the reservation that is done
-	 * in mmu.c misses two pages. We correct it here if we detect this. */
-	if (last_phys < __pa(xen_start_info->pt_base))
-		memblock_reserve(last_phys, __pa(xen_start_info->pt_base) - last_phys);
-}
 void xen_setup_shared_info(void)
 {
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
@@ -1418,7 +1359,6 @@ asmlinkage void __init xen_start_kernel(void)
 	xen_raw_console_write("mapping kernel into physical memory\n");
 	xen_setup_kernel_pagetable((pgd_t *)xen_start_info->pt_base, xen_start_info->nr_pages);
 
-	xen_reserve_internals();
 	/* Allocate and initialize top and mid mfn levels for p2m structure */
 	xen_build_mfn_list_list();
 
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 95c3ea2..c3e9291 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -388,11 +388,6 @@ void __init xen_build_dynamic_phys_to_machine(void)
 	}
 
 	m2p_override_init();
-
-	/* NOTE: We cannot call memblock_reserve here for the mfn_list as there
-	 * isn't enough pieces to make it work (for one - we are still using the
-	 * Xen provided pagetable). Do it later in xen_reserve_internals.
-	 */
 }
 #ifdef CONFIG_X86_64
 #include <linux/bootmem.h>
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 9efca75..740517b 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -424,6 +424,33 @@ char * __init xen_memory_setup(void)
 	e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
 			E820_RESERVED);
 
+	/*
+	 * Reserve Xen bits:
+	 *  - mfn_list
+	 *  - xen_start_info
+	 * See comment above "struct start_info" in <xen/interface/xen.h>
+	 * We tried to make the the memblock_reserve more selective so
+	 * that it would be clear what region is reserved. Sadly we ran
+	 * in the problem wherein on a 64-bit hypervisor with a 32-bit
+	 * initial domain, the pt_base has the cr3 value which is not
+	 * neccessarily where the pagetable starts! As Jan put it: "
+	 * Actually, the adjustment turns out to be correct: The page
+	 * tables for a 32-on-64 dom0 get allocated in the order "first L1",
+	 * "first L2", "first L3", so the offset to the page table base is
+	 * indeed 2. When reading xen/include/public/xen.h's comment
+	 * very strictly, this is not a violation (since there nothing is said
+	 * that the first thing in the page table space is pointed to by
+	 * pt_base; I admit that this seems to be implied though, namely
+	 * do I think that it is implied that the page table space is the
+	 * range [pt_base, pt_base + nt_pt_frames), whereas that
+	 * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
+	 * which - without a priori knowledge - the kernel would have
+	 * difficulty to figure out)." - so lets just fall back to the
+	 * easy way and reserve the whole region.
+	 */
+	memblock_reserve(__pa(xen_start_info->mfn_list),
+			 xen_start_info->pt_base - xen_start_info->mfn_list);
+
 	sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
 
 	return "Xen";
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] Q:pt_base in COMPAT mode offset by two pages. Was:Re: [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas.
  2012-08-22 18:55             ` [Xen-devel] Q:pt_base in COMPAT mode offset by two pages. Was:Re: " Konrad Rzeszutek Wilk
@ 2012-08-23  6:23               ` Jan Beulich
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2012-08-23  6:23 UTC (permalink / raw)
  To: konrad; +Cc: stefano.stabellini, xen-devel, konrad.wilk, linux-kernel

>>> Konrad Rzeszutek Wilk <konrad@kernel.org> 08/22/12 9:05 PM >>>
>Thinking of just sticking this patch then:

Yeah, that may be best for the moment. Albeit I see no reason why you
shouldn't be able to use your more selective logic, just making it either
deal with only the pt_base == start-of-page-tables case (and fall back to
the current logic alternatively), or figure out the true range.

I'm nevertheless considering to re-arrange the allocation order in the
hypervisor (and to remove the superfluously reserved page covering
what would be the L4), so going with the former, simpler case for the
kernel would be a reasonable option.

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early
  2012-08-20 11:58           ` Stefano Stabellini
  2012-08-20 12:06             ` Konrad Rzeszutek Wilk
@ 2012-08-23 15:40             ` Konrad Rzeszutek Wilk
  2012-08-23 15:57               ` Stefano Stabellini
  1 sibling, 1 reply; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-08-23 15:40 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Ian Campbell, linux-kernel, xen-devel

On Mon, Aug 20, 2012 at 12:58:37PM +0100, Stefano Stabellini wrote:
> On Mon, 20 Aug 2012, Ian Campbell wrote:
> > On Mon, 2012-08-20 at 12:45 +0100, Stefano Stabellini wrote:
> > > On Fri, 17 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > > On Fri, Aug 17, 2012 at 06:41:23PM +0100, Stefano Stabellini wrote:
> > > > > On Thu, 16 Aug 2012, Konrad Rzeszutek Wilk wrote:
> > > > > > B/c we do not need it. During the startup the Xen provides
> > > > > > us with all the memory mapped that we need to function.
> > > > > 
> > > > > Shouldn't we check to make sure that is actually true (I am thinking at
> > > > > nr_pt_frames)?
> > > > 
> > > > I was looking at the source code (hypervisor) to figure it out and
> > > > that is certainly true.
> > > > 
> > > > 
> > > > > Or is it actually stated somewhere in the Xen headers?
> > > > 
> > > > Couldn't find it, but after looking so long at the source code
> > > > I didn't even bother looking for it.
> > > > 
> > > > Thought to be honest - I only looked at how the 64-bit pagetables
> > > > were set up, so I didn't dare to touch the 32-bit. Hence the #ifdef
> > > 
> > > I think that we need to involve some Xen maintainers and get this
> > > written down somewhere in the public headers, otherwise we have no
> > > guarantees that it is going to stay as it is in the next Xen versions.
> > > 
> > > Maybe we just need to add a couple of lines of comment to
> > > xen/include/public/xen.h.
> > 
> > The start of day memory layout for PV guests is written down in the
> > comment just before struct start_info at
> > http://xenbits.xen.org/docs/unstable/hypercall/include,public,xen.h.html#Struct_start_info
> > 
> > (I haven't read this thread to determine if what is documented matches
> > what you guys are talking about relying on)
> 
> but it is not written down how much physical memory is going to be
> mapped in the bootstrap page tables.

How about this:


>From 43fd7a5d9ecd31c2fc26851523cd4b5f7650fb39 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Thu, 12 Jul 2012 13:59:36 -0400
Subject: [PATCH] xen/mmu: For 64-bit do not call xen_map_identity_early

B/c we do not need it. During the startup the Xen provides
us with all the initial memory mapped that we need to function.

The initial memory mapped is up to the bootstack, which means
we can reference using __ka up to 4.f):

(from xen/interface/xen.h):

 4. This the order of bootstrap elements in the initial virtual region:
   a. relocated kernel image
   b. initial ram disk              [mod_start, mod_len]
   c. list of allocated page frames [mfn_list, nr_pages]
   d. start_info_t structure        [register ESI (x86)]
   e. bootstrap page tables         [pt_base, CR3 (x86)]
   f. bootstrap stack               [register ESP (x86)]

(initial ram disk may be ommitted).

[v1: More comments in git commit]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/mmu.c |   11 +++++------
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 7247e5a..a59070b 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -84,6 +84,7 @@
  */
 DEFINE_SPINLOCK(xen_reservation_lock);
 
+#ifdef CONFIG_X86_32
 /*
  * Identity map, in addition to plain kernel map.  This needs to be
  * large enough to allocate page table pages to allocate the rest.
@@ -91,7 +92,7 @@ DEFINE_SPINLOCK(xen_reservation_lock);
  */
 #define LEVEL1_IDENT_ENTRIES	(PTRS_PER_PTE * 4)
 static RESERVE_BRK_ARRAY(pte_t, level1_ident_pgt, LEVEL1_IDENT_ENTRIES);
-
+#endif
 #ifdef CONFIG_X86_64
 /* l3 pud for userspace vsyscall mapping */
 static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss;
@@ -1628,7 +1629,7 @@ static void set_page_prot(void *addr, pgprot_t prot)
 	if (HYPERVISOR_update_va_mapping((unsigned long)addr, pte, 0))
 		BUG();
 }
-
+#ifdef CONFIG_X86_32
 static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
 {
 	unsigned pmdidx, pteidx;
@@ -1679,7 +1680,7 @@ static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
 
 	set_page_prot(pmd, PAGE_KERNEL_RO);
 }
-
+#endif
 void __init xen_setup_machphys_mapping(void)
 {
 	struct xen_machphys_mapping mapping;
@@ -1765,14 +1766,12 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Note that we don't do anything with level1_fixmap_pgt which
 	 * we don't need. */
 
-	/* Set up identity map */
-	xen_map_identity_early(level2_ident_pgt, max_pfn);
-
 	/* Make pagetable pieces RO */
 	set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
+	set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early
  2012-08-23 15:40             ` Konrad Rzeszutek Wilk
@ 2012-08-23 15:57               ` Stefano Stabellini
  0 siblings, 0 replies; 43+ messages in thread
From: Stefano Stabellini @ 2012-08-23 15:57 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, Ian Campbell, linux-kernel, xen-devel

On Thu, 23 Aug 2012, Konrad Rzeszutek Wilk wrote:
> >From 43fd7a5d9ecd31c2fc26851523cd4b5f7650fb39 Mon Sep 17 00:00:00 2001
> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Date: Thu, 12 Jul 2012 13:59:36 -0400
> Subject: [PATCH] xen/mmu: For 64-bit do not call xen_map_identity_early
> 
> B/c we do not need it. During the startup the Xen provides
                                            ^ remove the
> us with all the initial memory mapped that we need to function.
> 
> The initial memory mapped is up to the bootstack, which means
> we can reference using __ka up to 4.f):

Thanks, that is what I was looking for.

> (from xen/interface/xen.h):
> 
>  4. This the order of bootstrap elements in the initial virtual region:
>    a. relocated kernel image
>    b. initial ram disk              [mod_start, mod_len]
>    c. list of allocated page frames [mfn_list, nr_pages]
>    d. start_info_t structure        [register ESI (x86)]
>    e. bootstrap page tables         [pt_base, CR3 (x86)]
>    f. bootstrap stack               [register ESP (x86)]
> 
> (initial ram disk may be ommitted).
> 
> [v1: More comments in git commit]
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>


> ---
>  arch/x86/xen/mmu.c |   11 +++++------
>  1 files changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 7247e5a..a59070b 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -84,6 +84,7 @@
>   */
>  DEFINE_SPINLOCK(xen_reservation_lock);
>  
> +#ifdef CONFIG_X86_32
>  /*
>   * Identity map, in addition to plain kernel map.  This needs to be
>   * large enough to allocate page table pages to allocate the rest.
> @@ -91,7 +92,7 @@ DEFINE_SPINLOCK(xen_reservation_lock);
>   */
>  #define LEVEL1_IDENT_ENTRIES	(PTRS_PER_PTE * 4)
>  static RESERVE_BRK_ARRAY(pte_t, level1_ident_pgt, LEVEL1_IDENT_ENTRIES);
> -
> +#endif
>  #ifdef CONFIG_X86_64
>  /* l3 pud for userspace vsyscall mapping */
>  static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss;
> @@ -1628,7 +1629,7 @@ static void set_page_prot(void *addr, pgprot_t prot)
>  	if (HYPERVISOR_update_va_mapping((unsigned long)addr, pte, 0))
>  		BUG();
>  }
> -
> +#ifdef CONFIG_X86_32
>  static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
>  {
>  	unsigned pmdidx, pteidx;
> @@ -1679,7 +1680,7 @@ static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
>  
>  	set_page_prot(pmd, PAGE_KERNEL_RO);
>  }
> -
> +#endif
>  void __init xen_setup_machphys_mapping(void)
>  {
>  	struct xen_machphys_mapping mapping;
> @@ -1765,14 +1766,12 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>  	/* Note that we don't do anything with level1_fixmap_pgt which
>  	 * we don't need. */
>  
> -	/* Set up identity map */
> -	xen_map_identity_early(level2_ident_pgt, max_pfn);
> -
>  	/* Make pagetable pieces RO */
>  	set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
>  	set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
>  	set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
>  	set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
> +	set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
>  	set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
>  	set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
>  
> -- 
> 1.7.7.6
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 11/11] xen/mmu: Release just the MFN list, not MFN list and part of pagetables.
  2012-08-16 16:03 ` [PATCH 11/11] xen/mmu: Release just the MFN list, not MFN list and part of pagetables Konrad Rzeszutek Wilk
  2012-08-21 14:18   ` [Xen-devel] " Stefano Stabellini
@ 2012-09-17 18:06   ` William Dauchy
  2012-09-17 18:18     ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 43+ messages in thread
From: William Dauchy @ 2012-09-17 18:06 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: linux-kernel, xen-devel

Hello Konrad,

On Thu, Aug 16, 2012 at 6:03 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> (XEN) mm.c:908:d0 Error getting mfn 116a83 (pfn 14e2a) from L1 entry 8000000116a83067 for l1e_owner=0, pg_owner=0
> (XEN) mm.c:908:d0 Error getting mfn 4040 (pfn 5555555555555555) from L1 entry 0000000004040601 for l1e_owner=0, pg_owner=0

This is something I also have on my 3.4.x branch. Since then mmu.c has
change a lot; is there an easy fix to make a backport for 3.4.x?

-- 
William

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Xen-devel] [PATCH 11/11] xen/mmu: Release just the MFN list, not MFN list and part of pagetables.
  2012-09-17 18:06   ` William Dauchy
@ 2012-09-17 18:18     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-09-17 18:18 UTC (permalink / raw)
  To: William Dauchy; +Cc: linux-kernel, xen-devel

On Mon, Sep 17, 2012 at 08:06:26PM +0200, William Dauchy wrote:
> Hello Konrad,
> 
> On Thu, Aug 16, 2012 at 6:03 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > (XEN) mm.c:908:d0 Error getting mfn 116a83 (pfn 14e2a) from L1 entry 8000000116a83067 for l1e_owner=0, pg_owner=0
> > (XEN) mm.c:908:d0 Error getting mfn 4040 (pfn 5555555555555555) from L1 entry 0000000004040601 for l1e_owner=0, pg_owner=0
> 
> This is something I also have on my 3.4.x branch. Since then mmu.c has
> change a lot; is there an easy fix to make a backport for 3.4.x?

Uhh, can you open a new email thread with the error you are seeing and
please include the full serial log along with dmidecode pls?

> 
> -- 
> William

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2012-09-17 18:29 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-16 16:03 [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk
2012-08-16 16:03 ` [PATCH 01/11] xen/p2m: Fix the comment describing the P2M tree Konrad Rzeszutek Wilk
2012-08-17 17:29   ` [Xen-devel] " Stefano Stabellini
2012-08-16 16:03 ` [PATCH 02/11] xen/x86: Use memblock_reserve for sensitive areas Konrad Rzeszutek Wilk
2012-08-17 17:35   ` [Xen-devel] " Stefano Stabellini
2012-08-20 14:13     ` Konrad Rzeszutek Wilk
2012-08-21 17:27       ` Q:pt_base in COMPAT mode offset by two pages. Was:Re: " Konrad Rzeszutek Wilk
2012-08-21 19:03         ` Konrad Rzeszutek Wilk
2012-08-22 10:48           ` Stefano Stabellini
2012-08-22 14:00             ` Konrad Rzeszutek Wilk
2012-08-22 14:12           ` Jan Beulich
2012-08-22 14:41             ` Stefano Stabellini
2012-08-22 14:57             ` Konrad Rzeszutek Wilk
2012-08-22 15:59           ` Jan Beulich
2012-08-22 16:21             ` Konrad Rzeszutek Wilk
2012-08-22 18:55             ` [Xen-devel] Q:pt_base in COMPAT mode offset by two pages. Was:Re: " Konrad Rzeszutek Wilk
2012-08-23  6:23               ` Jan Beulich
2012-08-16 16:03 ` [PATCH 03/11] xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything Konrad Rzeszutek Wilk
2012-08-16 16:03 ` [PATCH 04/11] xen/mmu: Provide comments describing the _ka and _va aliasing issue Konrad Rzeszutek Wilk
2012-08-16 16:03 ` [PATCH 05/11] xen/mmu: use copy_page instead of memcpy Konrad Rzeszutek Wilk
2012-08-16 16:03 ` [PATCH 06/11] xen/mmu: For 64-bit do not call xen_map_identity_early Konrad Rzeszutek Wilk
2012-08-17 17:41   ` [Xen-devel] " Stefano Stabellini
2012-08-17 17:45     ` Konrad Rzeszutek Wilk
2012-08-20 11:45       ` Stefano Stabellini
2012-08-20 11:53         ` Ian Campbell
2012-08-20 11:58           ` Stefano Stabellini
2012-08-20 12:06             ` Konrad Rzeszutek Wilk
2012-08-20 12:19               ` Stefano Stabellini
2012-08-23 15:40             ` Konrad Rzeszutek Wilk
2012-08-23 15:57               ` Stefano Stabellini
2012-08-16 16:03 ` [PATCH 07/11] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages Konrad Rzeszutek Wilk
2012-08-17 18:07   ` [Xen-devel] " Stefano Stabellini
2012-08-17 18:05     ` Konrad Rzeszutek Wilk
2012-08-16 16:03 ` [PATCH 08/11] xen/p2m: Add logic to revector a P2M tree to use __va leafs Konrad Rzeszutek Wilk
2012-08-16 16:03 ` [PATCH 09/11] xen/mmu: Copy and revector the P2M tree Konrad Rzeszutek Wilk
2012-08-16 16:03 ` [PATCH 10/11] xen/mmu: Remove from __ka space PMD entries for pagetables Konrad Rzeszutek Wilk
2012-08-16 16:03 ` [PATCH 11/11] xen/mmu: Release just the MFN list, not MFN list and part of pagetables Konrad Rzeszutek Wilk
2012-08-21 14:18   ` [Xen-devel] " Stefano Stabellini
2012-08-21 14:57     ` Konrad Rzeszutek Wilk
2012-08-21 15:27       ` Stefano Stabellini
2012-09-17 18:06   ` William Dauchy
2012-09-17 18:18     ` Konrad Rzeszutek Wilk
2012-08-17 17:39 ` [PATCH] Boot PV guests with more than 128GB (v3) for v3.7 Konrad Rzeszutek Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).