All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] x86: support up to 16Tb
@ 2013-01-22 10:45 Jan Beulich
  2013-01-22 10:50 ` [PATCH 02/11] x86: extend frame table virtual space Jan Beulich
                   ` (12 more replies)
  0 siblings, 13 replies; 24+ messages in thread
From: Jan Beulich @ 2013-01-22 10:45 UTC (permalink / raw)
  To: xen-devel

This series enables Xen to support up to 16Tb.

01: x86: introduce virt_to_xen_l1e()
02: x86: extend frame table virtual space
03: x86: re-introduce map_domain_page() et al
04: x86: properly use map_domain_page() when building Dom0
05: x86: consolidate initialization of PV guest L4 page tables
06: x86: properly use map_domain_page() during domain creation/destruction
07: x86: properly use map_domain_page() during page table manipulation
08: x86: properly use map_domain_page() in nested HVM code
09: x86: properly use map_domain_page() in miscellaneous places
10: tmem: partial adjustments for x86 16Tb support
11: x86: support up to 16Tb

As I don't have a 16Tb system around, I used the following
debugging patch to simulate the most critical aspect the changes
above would have on a system with this much memory: Not all of
the 1:1 mapping being accessible when in PV guest context. To do
so, a command line option to pull the split point down is being
added. The patch is being provided in the raw form I used it, but
has pieces properly formatted and not marked "//temp" which I
would think might be worth considering to add. The other pieces
are likely less worthwhile, but if others think differently I could
certainly also put them into "normal" shape.

12: x86: debugging code for testing 16Tb support on smaller memory systems

Signed-off-by: Jan Beulich <jbeulich@suse.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 02/11] x86: extend frame table virtual space
  2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
@ 2013-01-22 10:50 ` Jan Beulich
  2013-01-22 10:50 ` [PATCH 03/11] x86: re-introduce map_domain_page() et al Jan Beulich
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2013-01-22 10:50 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 4853 bytes --]

... to allow frames for up to 16Tb.

At the same time, add the super page frame table coordinates to the
comment describing the address space layout.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -146,8 +146,7 @@ unsigned long max_page;
 unsigned long total_pages;
 
 unsigned long __read_mostly pdx_group_valid[BITS_TO_LONGS(
-    (FRAMETABLE_SIZE / sizeof(*frame_table) + PDX_GROUP_COUNT - 1)
-    / PDX_GROUP_COUNT)] = { [0] = 1 };
+    (FRAMETABLE_NR + PDX_GROUP_COUNT - 1) / PDX_GROUP_COUNT)] = { [0] = 1 };
 
 bool_t __read_mostly machine_to_phys_mapping_valid = 0;
 
@@ -218,7 +217,7 @@ static void __init init_spagetable(void)
     BUILD_BUG_ON(XEN_VIRT_END > SPAGETABLE_VIRT_START);
 
     init_frametable_chunk(spage_table,
-                          mem_hotplug ? (void *)SPAGETABLE_VIRT_END
+                          mem_hotplug ? spage_table + SPAGETABLE_NR
                                       : pdx_to_spage(max_pdx - 1) + 1);
 }
 
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -378,8 +378,8 @@ static void __init setup_max_pdx(void)
     if ( max_pdx > (DIRECTMAP_SIZE >> PAGE_SHIFT) )
         max_pdx = DIRECTMAP_SIZE >> PAGE_SHIFT;
 
-    if ( max_pdx > FRAMETABLE_SIZE / sizeof(*frame_table) )
-        max_pdx = FRAMETABLE_SIZE / sizeof(*frame_table);
+    if ( max_pdx > FRAMETABLE_NR )
+        max_pdx = FRAMETABLE_NR;
 
     max_page = pdx_to_pfn(max_pdx - 1) + 1;
 }
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -958,7 +958,7 @@ static int extend_frame_table(struct mem
     nidx = cidx = pfn_to_pdx(spfn)/PDX_GROUP_COUNT;
 
     ASSERT( pfn_to_pdx(epfn) <= (DIRECTMAP_SIZE >> PAGE_SHIFT) &&
-         (pfn_to_pdx(epfn) <= FRAMETABLE_SIZE / sizeof(struct page_info)) );
+            pfn_to_pdx(epfn) <= FRAMETABLE_NR );
 
     if ( test_bit(cidx, pdx_group_valid) )
         cidx = find_next_zero_bit(pdx_group_valid, eidx, cidx);
@@ -1406,7 +1406,7 @@ int mem_hotadd_check(unsigned long spfn,
     if ( (spfn >= epfn) )
         return 0;
 
-    if (pfn_to_pdx(epfn) > (FRAMETABLE_SIZE / sizeof(*frame_table)))
+    if (pfn_to_pdx(epfn) > FRAMETABLE_NR)
         return 0;
 
     if ( (spfn | epfn) & ((1UL << PAGETABLE_ORDER) - 1) )
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -152,9 +152,11 @@ extern unsigned char boot_edid_info[128]
  *    High read-only compatibility machine-to-phys translation table.
  *  0xffff82c480000000 - 0xffff82c4bfffffff [1GB,   2^30 bytes, PML4:261]
  *    Xen text, static data, bss.
- *  0xffff82c4c0000000 - 0xffff82f5ffffffff [197GB,             PML4:261]
+ *  0xffff82c4c0000000 - 0xffff82dffbffffff [109GB - 64MB,      PML4:261]
  *    Reserved for future use.
- *  0xffff82f600000000 - 0xffff82ffffffffff [40GB,  2^38 bytes, PML4:261]
+ *  0xffff82dffc000000 - 0xffff82dfffffffff [64MB,  2^26 bytes, PML4:261]
+ *    Super-page information array.
+ *  0xffff82e000000000 - 0xffff82ffffffffff [128GB, 2^37 bytes, PML4:261]
  *    Page-frame information array.
  *  0xffff830000000000 - 0xffff87ffffffffff [5TB, 5*2^40 bytes, PML4:262-271]
  *    1:1 direct mapping of all physical memory.
@@ -218,15 +220,17 @@ extern unsigned char boot_edid_info[128]
 /* Slot 261: xen text, static data and bss (1GB). */
 #define XEN_VIRT_START          (HIRO_COMPAT_MPT_VIRT_END)
 #define XEN_VIRT_END            (XEN_VIRT_START + GB(1))
-/* Slot 261: superpage information array (20MB). */
+/* Slot 261: superpage information array (64MB). */
 #define SPAGETABLE_VIRT_END     FRAMETABLE_VIRT_START
-#define SPAGETABLE_SIZE         ((DIRECTMAP_SIZE >> SUPERPAGE_SHIFT) * \
-                                 sizeof(struct spage_info))
-#define SPAGETABLE_VIRT_START   (SPAGETABLE_VIRT_END - SPAGETABLE_SIZE)
-/* Slot 261: page-frame information array (40GB). */
+#define SPAGETABLE_NR           (((FRAMETABLE_NR - 1) >> (SUPERPAGE_SHIFT - \
+                                                          PAGE_SHIFT)) + 1)
+#define SPAGETABLE_SIZE         (SPAGETABLE_NR * sizeof(struct spage_info))
+#define SPAGETABLE_VIRT_START   ((SPAGETABLE_VIRT_END - SPAGETABLE_SIZE) & \
+                                 (-1UL << SUPERPAGE_SHIFT))
+/* Slot 261: page-frame information array (128GB). */
 #define FRAMETABLE_VIRT_END     DIRECTMAP_VIRT_START
-#define FRAMETABLE_SIZE         ((DIRECTMAP_SIZE >> PAGE_SHIFT) * \
-                                 sizeof(struct page_info))
+#define FRAMETABLE_SIZE         GB(128)
+#define FRAMETABLE_NR           (FRAMETABLE_SIZE / sizeof(*frame_table))
 #define FRAMETABLE_VIRT_START   (FRAMETABLE_VIRT_END - FRAMETABLE_SIZE)
 /* Slot 262-271: A direct 1:1 mapping of all of physical memory. */
 #define DIRECTMAP_VIRT_START    (PML4_ADDR(262))



[-- Attachment #2: x86-va-layout.patch --]
[-- Type: text/plain, Size: 4890 bytes --]

x86: extend frame table virtual space

... to allow frames for up to 16Tb.

At the same time, add the super page frame table coordinates to the
comment describing the address space layout.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -146,8 +146,7 @@ unsigned long max_page;
 unsigned long total_pages;
 
 unsigned long __read_mostly pdx_group_valid[BITS_TO_LONGS(
-    (FRAMETABLE_SIZE / sizeof(*frame_table) + PDX_GROUP_COUNT - 1)
-    / PDX_GROUP_COUNT)] = { [0] = 1 };
+    (FRAMETABLE_NR + PDX_GROUP_COUNT - 1) / PDX_GROUP_COUNT)] = { [0] = 1 };
 
 bool_t __read_mostly machine_to_phys_mapping_valid = 0;
 
@@ -218,7 +217,7 @@ static void __init init_spagetable(void)
     BUILD_BUG_ON(XEN_VIRT_END > SPAGETABLE_VIRT_START);
 
     init_frametable_chunk(spage_table,
-                          mem_hotplug ? (void *)SPAGETABLE_VIRT_END
+                          mem_hotplug ? spage_table + SPAGETABLE_NR
                                       : pdx_to_spage(max_pdx - 1) + 1);
 }
 
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -378,8 +378,8 @@ static void __init setup_max_pdx(void)
     if ( max_pdx > (DIRECTMAP_SIZE >> PAGE_SHIFT) )
         max_pdx = DIRECTMAP_SIZE >> PAGE_SHIFT;
 
-    if ( max_pdx > FRAMETABLE_SIZE / sizeof(*frame_table) )
-        max_pdx = FRAMETABLE_SIZE / sizeof(*frame_table);
+    if ( max_pdx > FRAMETABLE_NR )
+        max_pdx = FRAMETABLE_NR;
 
     max_page = pdx_to_pfn(max_pdx - 1) + 1;
 }
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -958,7 +958,7 @@ static int extend_frame_table(struct mem
     nidx = cidx = pfn_to_pdx(spfn)/PDX_GROUP_COUNT;
 
     ASSERT( pfn_to_pdx(epfn) <= (DIRECTMAP_SIZE >> PAGE_SHIFT) &&
-         (pfn_to_pdx(epfn) <= FRAMETABLE_SIZE / sizeof(struct page_info)) );
+            pfn_to_pdx(epfn) <= FRAMETABLE_NR );
 
     if ( test_bit(cidx, pdx_group_valid) )
         cidx = find_next_zero_bit(pdx_group_valid, eidx, cidx);
@@ -1406,7 +1406,7 @@ int mem_hotadd_check(unsigned long spfn,
     if ( (spfn >= epfn) )
         return 0;
 
-    if (pfn_to_pdx(epfn) > (FRAMETABLE_SIZE / sizeof(*frame_table)))
+    if (pfn_to_pdx(epfn) > FRAMETABLE_NR)
         return 0;
 
     if ( (spfn | epfn) & ((1UL << PAGETABLE_ORDER) - 1) )
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -152,9 +152,11 @@ extern unsigned char boot_edid_info[128]
  *    High read-only compatibility machine-to-phys translation table.
  *  0xffff82c480000000 - 0xffff82c4bfffffff [1GB,   2^30 bytes, PML4:261]
  *    Xen text, static data, bss.
- *  0xffff82c4c0000000 - 0xffff82f5ffffffff [197GB,             PML4:261]
+ *  0xffff82c4c0000000 - 0xffff82dffbffffff [109GB - 64MB,      PML4:261]
  *    Reserved for future use.
- *  0xffff82f600000000 - 0xffff82ffffffffff [40GB,  2^38 bytes, PML4:261]
+ *  0xffff82dffc000000 - 0xffff82dfffffffff [64MB,  2^26 bytes, PML4:261]
+ *    Super-page information array.
+ *  0xffff82e000000000 - 0xffff82ffffffffff [128GB, 2^37 bytes, PML4:261]
  *    Page-frame information array.
  *  0xffff830000000000 - 0xffff87ffffffffff [5TB, 5*2^40 bytes, PML4:262-271]
  *    1:1 direct mapping of all physical memory.
@@ -218,15 +220,17 @@ extern unsigned char boot_edid_info[128]
 /* Slot 261: xen text, static data and bss (1GB). */
 #define XEN_VIRT_START          (HIRO_COMPAT_MPT_VIRT_END)
 #define XEN_VIRT_END            (XEN_VIRT_START + GB(1))
-/* Slot 261: superpage information array (20MB). */
+/* Slot 261: superpage information array (64MB). */
 #define SPAGETABLE_VIRT_END     FRAMETABLE_VIRT_START
-#define SPAGETABLE_SIZE         ((DIRECTMAP_SIZE >> SUPERPAGE_SHIFT) * \
-                                 sizeof(struct spage_info))
-#define SPAGETABLE_VIRT_START   (SPAGETABLE_VIRT_END - SPAGETABLE_SIZE)
-/* Slot 261: page-frame information array (40GB). */
+#define SPAGETABLE_NR           (((FRAMETABLE_NR - 1) >> (SUPERPAGE_SHIFT - \
+                                                          PAGE_SHIFT)) + 1)
+#define SPAGETABLE_SIZE         (SPAGETABLE_NR * sizeof(struct spage_info))
+#define SPAGETABLE_VIRT_START   ((SPAGETABLE_VIRT_END - SPAGETABLE_SIZE) & \
+                                 (-1UL << SUPERPAGE_SHIFT))
+/* Slot 261: page-frame information array (128GB). */
 #define FRAMETABLE_VIRT_END     DIRECTMAP_VIRT_START
-#define FRAMETABLE_SIZE         ((DIRECTMAP_SIZE >> PAGE_SHIFT) * \
-                                 sizeof(struct page_info))
+#define FRAMETABLE_SIZE         GB(128)
+#define FRAMETABLE_NR           (FRAMETABLE_SIZE / sizeof(*frame_table))
 #define FRAMETABLE_VIRT_START   (FRAMETABLE_VIRT_END - FRAMETABLE_SIZE)
 /* Slot 262-271: A direct 1:1 mapping of all of physical memory. */
 #define DIRECTMAP_VIRT_START    (PML4_ADDR(262))

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 03/11] x86: re-introduce map_domain_page() et al
  2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
  2013-01-22 10:50 ` [PATCH 02/11] x86: extend frame table virtual space Jan Beulich
@ 2013-01-22 10:50 ` Jan Beulich
  2013-01-22 10:51 ` [PATCH 04/11] x86: properly use map_domain_page() when building Dom0 Jan Beulich
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2013-01-22 10:50 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 29481 bytes --]

This is being done mostly in the form previously used on x86-32,
utilizing the second L3 page table slot within the per-domain mapping
area for those mappings. It remains to be determined whether that
concept is really suitable, or whether instead re-implementing at least
the non-global variant from scratch would be better.

Also add the helpers {clear,copy}_domain_page() as well as initial uses
of them.

One question is whether, to exercise the non-trivial code paths, we
shouldn't make the trivial shortcuts conditional upon NDEBUG being
defined. See the debugging patch at the end of the series.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -19,6 +19,7 @@ obj-bin-y += dmi_scan.init.o
 obj-y += domctl.o
 obj-y += domain.o
 obj-bin-y += domain_build.init.o
+obj-y += domain_page.o
 obj-y += e820.o
 obj-y += extable.o
 obj-y += flushtlb.o
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -397,10 +397,14 @@ int vcpu_initialise(struct vcpu *v)
             return -ENOMEM;
         clear_page(page_to_virt(pg));
         perdomain_pt_page(d, idx) = pg;
-        d->arch.mm_perdomain_l2[l2_table_offset(PERDOMAIN_VIRT_START)+idx]
+        d->arch.mm_perdomain_l2[0][l2_table_offset(PERDOMAIN_VIRT_START)+idx]
             = l2e_from_page(pg, __PAGE_HYPERVISOR);
     }
 
+    rc = mapcache_vcpu_init(v);
+    if ( rc )
+        return rc;
+
     paging_vcpu_init(v);
 
     v->arch.perdomain_ptes = perdomain_ptes(d, v);
@@ -526,8 +530,8 @@ int arch_domain_create(struct domain *d,
     pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
     if ( pg == NULL )
         goto fail;
-    d->arch.mm_perdomain_l2 = page_to_virt(pg);
-    clear_page(d->arch.mm_perdomain_l2);
+    d->arch.mm_perdomain_l2[0] = page_to_virt(pg);
+    clear_page(d->arch.mm_perdomain_l2[0]);
 
     pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
     if ( pg == NULL )
@@ -535,8 +539,10 @@ int arch_domain_create(struct domain *d,
     d->arch.mm_perdomain_l3 = page_to_virt(pg);
     clear_page(d->arch.mm_perdomain_l3);
     d->arch.mm_perdomain_l3[l3_table_offset(PERDOMAIN_VIRT_START)] =
-        l3e_from_page(virt_to_page(d->arch.mm_perdomain_l2),
-                            __PAGE_HYPERVISOR);
+        l3e_from_pfn(virt_to_mfn(d->arch.mm_perdomain_l2[0]),
+                     __PAGE_HYPERVISOR);
+
+    mapcache_domain_init(d);
 
     HYPERVISOR_COMPAT_VIRT_START(d) =
         is_hvm_domain(d) ? ~0u : __HYPERVISOR_COMPAT_VIRT_START;
@@ -609,8 +615,9 @@ int arch_domain_create(struct domain *d,
     free_xenheap_page(d->shared_info);
     if ( paging_initialised )
         paging_final_teardown(d);
-    if ( d->arch.mm_perdomain_l2 )
-        free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2));
+    mapcache_domain_exit(d);
+    if ( d->arch.mm_perdomain_l2[0] )
+        free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2[0]));
     if ( d->arch.mm_perdomain_l3 )
         free_domheap_page(virt_to_page(d->arch.mm_perdomain_l3));
     if ( d->arch.mm_perdomain_pt_pages )
@@ -633,13 +640,15 @@ void arch_domain_destroy(struct domain *
 
     paging_final_teardown(d);
 
+    mapcache_domain_exit(d);
+
     for ( i = 0; i < PDPT_L2_ENTRIES; ++i )
     {
         if ( perdomain_pt_page(d, i) )
             free_domheap_page(perdomain_pt_page(d, i));
     }
     free_domheap_page(virt_to_page(d->arch.mm_perdomain_pt_pages));
-    free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2));
+    free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2[0]));
     free_domheap_page(virt_to_page(d->arch.mm_perdomain_l3));
 
     free_xenheap_page(d->shared_info);
--- /dev/null
+++ b/xen/arch/x86/domain_page.c
@@ -0,0 +1,471 @@
+/******************************************************************************
+ * domain_page.h
+ *
+ * Allow temporary mapping of domain pages.
+ *
+ * Copyright (c) 2003-2006, Keir Fraser <keir@xensource.com>
+ */
+
+#include <xen/domain_page.h>
+#include <xen/mm.h>
+#include <xen/perfc.h>
+#include <xen/pfn.h>
+#include <xen/sched.h>
+#include <asm/current.h>
+#include <asm/flushtlb.h>
+#include <asm/hardirq.h>
+
+static inline struct vcpu *mapcache_current_vcpu(void)
+{
+    /* In the common case we use the mapcache of the running VCPU. */
+    struct vcpu *v = current;
+
+    /*
+     * When current isn't properly set up yet, this is equivalent to
+     * running in an idle vCPU (callers must check for NULL).
+     */
+    if ( v == (struct vcpu *)0xfffff000 )
+        return NULL;
+
+    /*
+     * If guest_table is NULL, and we are running a paravirtualised guest,
+     * then it means we are running on the idle domain's page table and must
+     * therefore use its mapcache.
+     */
+    if ( unlikely(pagetable_is_null(v->arch.guest_table)) && !is_hvm_vcpu(v) )
+    {
+        /* If we really are idling, perform lazy context switch now. */
+        if ( (v = idle_vcpu[smp_processor_id()]) == current )
+            sync_local_execstate();
+        /* We must now be running on the idle page table. */
+        ASSERT(read_cr3() == __pa(idle_pg_table));
+    }
+
+    return v;
+}
+
+#define mapcache_l2_entry(e) ((e) >> PAGETABLE_ORDER)
+#define MAPCACHE_L2_ENTRIES (mapcache_l2_entry(MAPCACHE_ENTRIES - 1) + 1)
+#define DCACHE_L1ENT(dc, idx) \
+    ((dc)->l1tab[(idx) >> PAGETABLE_ORDER] \
+                [(idx) & ((1 << PAGETABLE_ORDER) - 1)])
+
+void *map_domain_page(unsigned long mfn)
+{
+    unsigned long flags;
+    unsigned int idx, i;
+    struct vcpu *v;
+    struct mapcache_domain *dcache;
+    struct mapcache_vcpu *vcache;
+    struct vcpu_maphash_entry *hashent;
+
+    if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
+        return mfn_to_virt(mfn);
+
+    v = mapcache_current_vcpu();
+    if ( !v || is_hvm_vcpu(v) )
+        return mfn_to_virt(mfn);
+
+    dcache = &v->domain->arch.pv_domain.mapcache;
+    vcache = &v->arch.pv_vcpu.mapcache;
+    if ( !dcache->l1tab )
+        return mfn_to_virt(mfn);
+
+    perfc_incr(map_domain_page_count);
+
+    local_irq_save(flags);
+
+    hashent = &vcache->hash[MAPHASH_HASHFN(mfn)];
+    if ( hashent->mfn == mfn )
+    {
+        idx = hashent->idx;
+        ASSERT(idx < dcache->entries);
+        hashent->refcnt++;
+        ASSERT(hashent->refcnt);
+        ASSERT(l1e_get_pfn(DCACHE_L1ENT(dcache, idx)) == mfn);
+        goto out;
+    }
+
+    spin_lock(&dcache->lock);
+
+    /* Has some other CPU caused a wrap? We must flush if so. */
+    if ( unlikely(dcache->epoch != vcache->shadow_epoch) )
+    {
+        vcache->shadow_epoch = dcache->epoch;
+        if ( NEED_FLUSH(this_cpu(tlbflush_time), dcache->tlbflush_timestamp) )
+        {
+            perfc_incr(domain_page_tlb_flush);
+            flush_tlb_local();
+        }
+    }
+
+    idx = find_next_zero_bit(dcache->inuse, dcache->entries, dcache->cursor);
+    if ( unlikely(idx >= dcache->entries) )
+    {
+        unsigned long accum = 0;
+
+        /* /First/, clean the garbage map and update the inuse list. */
+        for ( i = 0; i < BITS_TO_LONGS(dcache->entries); i++ )
+        {
+            dcache->inuse[i] &= ~xchg(&dcache->garbage[i], 0);
+            accum |= ~dcache->inuse[i];
+        }
+
+        if ( accum )
+            idx = find_first_zero_bit(dcache->inuse, dcache->entries);
+        else
+        {
+            /* Replace a hash entry instead. */
+            i = MAPHASH_HASHFN(mfn);
+            do {
+                hashent = &vcache->hash[i];
+                if ( hashent->idx != MAPHASHENT_NOTINUSE && !hashent->refcnt )
+                {
+                    idx = hashent->idx;
+                    ASSERT(l1e_get_pfn(DCACHE_L1ENT(dcache, idx)) ==
+                           hashent->mfn);
+                    l1e_write(&DCACHE_L1ENT(dcache, idx), l1e_empty());
+                    hashent->idx = MAPHASHENT_NOTINUSE;
+                    hashent->mfn = ~0UL;
+                    break;
+                }
+                if ( ++i == MAPHASH_ENTRIES )
+                    i = 0;
+            } while ( i != MAPHASH_HASHFN(mfn) );
+        }
+        BUG_ON(idx >= dcache->entries);
+
+        /* /Second/, flush TLBs. */
+        perfc_incr(domain_page_tlb_flush);
+        flush_tlb_local();
+        vcache->shadow_epoch = ++dcache->epoch;
+        dcache->tlbflush_timestamp = tlbflush_current_time();
+    }
+
+    set_bit(idx, dcache->inuse);
+    dcache->cursor = idx + 1;
+
+    spin_unlock(&dcache->lock);
+
+    l1e_write(&DCACHE_L1ENT(dcache, idx),
+              l1e_from_pfn(mfn, __PAGE_HYPERVISOR));
+
+ out:
+    local_irq_restore(flags);
+    return (void *)MAPCACHE_VIRT_START + pfn_to_paddr(idx);
+}
+
+void unmap_domain_page(const void *ptr)
+{
+    unsigned int idx;
+    struct vcpu *v;
+    struct mapcache_domain *dcache;
+    unsigned long va = (unsigned long)ptr, mfn, flags;
+    struct vcpu_maphash_entry *hashent;
+
+    if ( va >= DIRECTMAP_VIRT_START )
+        return;
+
+    ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END);
+
+    v = mapcache_current_vcpu();
+    ASSERT(v && !is_hvm_vcpu(v));
+
+    dcache = &v->domain->arch.pv_domain.mapcache;
+    ASSERT(dcache->l1tab);
+
+    idx = PFN_DOWN(va - MAPCACHE_VIRT_START);
+    mfn = l1e_get_pfn(DCACHE_L1ENT(dcache, idx));
+    hashent = &v->arch.pv_vcpu.mapcache.hash[MAPHASH_HASHFN(mfn)];
+
+    local_irq_save(flags);
+
+    if ( hashent->idx == idx )
+    {
+        ASSERT(hashent->mfn == mfn);
+        ASSERT(hashent->refcnt);
+        hashent->refcnt--;
+    }
+    else if ( !hashent->refcnt )
+    {
+        if ( hashent->idx != MAPHASHENT_NOTINUSE )
+        {
+            /* /First/, zap the PTE. */
+            ASSERT(l1e_get_pfn(DCACHE_L1ENT(dcache, hashent->idx)) ==
+                   hashent->mfn);
+            l1e_write(&DCACHE_L1ENT(dcache, hashent->idx), l1e_empty());
+            /* /Second/, mark as garbage. */
+            set_bit(hashent->idx, dcache->garbage);
+        }
+
+        /* Add newly-freed mapping to the maphash. */
+        hashent->mfn = mfn;
+        hashent->idx = idx;
+    }
+    else
+    {
+        /* /First/, zap the PTE. */
+        l1e_write(&DCACHE_L1ENT(dcache, idx), l1e_empty());
+        /* /Second/, mark as garbage. */
+        set_bit(idx, dcache->garbage);
+    }
+
+    local_irq_restore(flags);
+}
+
+void clear_domain_page(unsigned long mfn)
+{
+    void *ptr = map_domain_page(mfn);
+
+    clear_page(ptr);
+    unmap_domain_page(ptr);
+}
+
+void copy_domain_page(unsigned long dmfn, unsigned long smfn)
+{
+    const void *src = map_domain_page(smfn);
+    void *dst = map_domain_page(dmfn);
+
+    copy_page(dst, src);
+    unmap_domain_page(dst);
+    unmap_domain_page(src);
+}
+
+int mapcache_domain_init(struct domain *d)
+{
+    struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
+    unsigned int i, bitmap_pages, memf = MEMF_node(domain_to_node(d));
+    unsigned long *end;
+
+    if ( is_hvm_domain(d) || is_idle_domain(d) )
+        return 0;
+
+    if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
+        return 0;
+
+    dcache->l1tab = xzalloc_array(l1_pgentry_t *, MAPCACHE_L2_ENTRIES + 1);
+    d->arch.mm_perdomain_l2[MAPCACHE_SLOT] = alloc_xenheap_pages(0, memf);
+    if ( !dcache->l1tab || !d->arch.mm_perdomain_l2[MAPCACHE_SLOT] )
+        return -ENOMEM;
+
+    clear_page(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]);
+    d->arch.mm_perdomain_l3[l3_table_offset(MAPCACHE_VIRT_START)] =
+        l3e_from_paddr(__pa(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]),
+                       __PAGE_HYPERVISOR);
+
+    BUILD_BUG_ON(MAPCACHE_VIRT_END + 3 +
+                 2 * PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long)) >
+                 MAPCACHE_VIRT_START + (PERDOMAIN_SLOT_MBYTES << 20));
+    bitmap_pages = PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long));
+    dcache->inuse = (void *)MAPCACHE_VIRT_END + PAGE_SIZE;
+    dcache->garbage = dcache->inuse +
+                      (bitmap_pages + 1) * PAGE_SIZE / sizeof(long);
+    end = dcache->garbage + bitmap_pages * PAGE_SIZE / sizeof(long);
+
+    for ( i = l2_table_offset((unsigned long)dcache->inuse);
+          i <= l2_table_offset((unsigned long)(end - 1)); ++i )
+    {
+        ASSERT(i <= MAPCACHE_L2_ENTRIES);
+        dcache->l1tab[i] = alloc_xenheap_pages(0, memf);
+        if ( !dcache->l1tab[i] )
+            return -ENOMEM;
+        clear_page(dcache->l1tab[i]);
+        d->arch.mm_perdomain_l2[MAPCACHE_SLOT][i] =
+            l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR);
+    }
+
+    spin_lock_init(&dcache->lock);
+
+    return 0;
+}
+
+void mapcache_domain_exit(struct domain *d)
+{
+    struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
+
+    if ( is_hvm_domain(d) )
+        return;
+
+    if ( dcache->l1tab )
+    {
+        unsigned long i;
+
+        for ( i = (unsigned long)dcache->inuse; ; i += PAGE_SIZE )
+        {
+            l1_pgentry_t *pl1e;
+
+            if ( l2_table_offset(i) > MAPCACHE_L2_ENTRIES ||
+                 !dcache->l1tab[l2_table_offset(i)] )
+                break;
+
+            pl1e = &dcache->l1tab[l2_table_offset(i)][l1_table_offset(i)];
+            if ( l1e_get_flags(*pl1e) )
+                free_domheap_page(l1e_get_page(*pl1e));
+        }
+
+        for ( i = 0; i < MAPCACHE_L2_ENTRIES + 1; ++i )
+            free_xenheap_page(dcache->l1tab[i]);
+
+        xfree(dcache->l1tab);
+    }
+    free_xenheap_page(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]);
+}
+
+int mapcache_vcpu_init(struct vcpu *v)
+{
+    struct domain *d = v->domain;
+    struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
+    unsigned long i;
+    unsigned int memf = MEMF_node(vcpu_to_node(v));
+
+    if ( is_hvm_vcpu(v) || !dcache->l1tab )
+        return 0;
+
+    while ( dcache->entries < d->max_vcpus * MAPCACHE_VCPU_ENTRIES )
+    {
+        unsigned int ents = dcache->entries + MAPCACHE_VCPU_ENTRIES;
+        l1_pgentry_t *pl1e;
+
+        /* Populate page tables. */
+        if ( !dcache->l1tab[i = mapcache_l2_entry(ents - 1)] )
+        {
+            dcache->l1tab[i] = alloc_xenheap_pages(0, memf);
+            if ( !dcache->l1tab[i] )
+                return -ENOMEM;
+            clear_page(dcache->l1tab[i]);
+            d->arch.mm_perdomain_l2[MAPCACHE_SLOT][i] =
+                l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR);
+        }
+
+        /* Populate bit maps. */
+        i = (unsigned long)(dcache->inuse + BITS_TO_LONGS(ents));
+        pl1e = &dcache->l1tab[l2_table_offset(i)][l1_table_offset(i)];
+        if ( !l1e_get_flags(*pl1e) )
+        {
+            struct page_info *pg = alloc_domheap_page(NULL, memf);
+
+            if ( !pg )
+                return -ENOMEM;
+            clear_domain_page(page_to_mfn(pg));
+            *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR);
+
+            i = (unsigned long)(dcache->garbage + BITS_TO_LONGS(ents));
+            pl1e = &dcache->l1tab[l2_table_offset(i)][l1_table_offset(i)];
+            ASSERT(!l1e_get_flags(*pl1e));
+
+            pg = alloc_domheap_page(NULL, memf);
+            if ( !pg )
+                return -ENOMEM;
+            clear_domain_page(page_to_mfn(pg));
+            *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR);
+        }
+
+        dcache->entries = ents;
+    }
+
+    /* Mark all maphash entries as not in use. */
+    BUILD_BUG_ON(MAPHASHENT_NOTINUSE < MAPCACHE_ENTRIES);
+    for ( i = 0; i < MAPHASH_ENTRIES; i++ )
+    {
+        struct vcpu_maphash_entry *hashent = &v->arch.pv_vcpu.mapcache.hash[i];
+
+        hashent->mfn = ~0UL; /* never valid to map */
+        hashent->idx = MAPHASHENT_NOTINUSE;
+    }
+
+    return 0;
+}
+
+#define GLOBALMAP_BITS (GLOBALMAP_GBYTES << (30 - PAGE_SHIFT))
+static unsigned long inuse[BITS_TO_LONGS(GLOBALMAP_BITS)];
+static unsigned long garbage[BITS_TO_LONGS(GLOBALMAP_BITS)];
+static unsigned int inuse_cursor;
+static DEFINE_SPINLOCK(globalmap_lock);
+
+void *map_domain_page_global(unsigned long mfn)
+{
+    l1_pgentry_t *pl1e;
+    unsigned int idx, i;
+    unsigned long va;
+
+    ASSERT(!in_irq() && local_irq_is_enabled());
+
+    if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
+        return mfn_to_virt(mfn);
+
+    spin_lock(&globalmap_lock);
+
+    idx = find_next_zero_bit(inuse, GLOBALMAP_BITS, inuse_cursor);
+    va = GLOBALMAP_VIRT_START + pfn_to_paddr(idx);
+    if ( unlikely(va >= GLOBALMAP_VIRT_END) )
+    {
+        /* /First/, clean the garbage map and update the inuse list. */
+        for ( i = 0; i < ARRAY_SIZE(garbage); i++ )
+            inuse[i] &= ~xchg(&garbage[i], 0);
+
+        /* /Second/, flush all TLBs to get rid of stale garbage mappings. */
+        flush_tlb_all();
+
+        idx = find_first_zero_bit(inuse, GLOBALMAP_BITS);
+        va = GLOBALMAP_VIRT_START + pfn_to_paddr(idx);
+        if ( unlikely(va >= GLOBALMAP_VIRT_END) )
+        {
+            spin_unlock(&globalmap_lock);
+            return NULL;
+        }
+    }
+
+    set_bit(idx, inuse);
+    inuse_cursor = idx + 1;
+
+    spin_unlock(&globalmap_lock);
+
+    pl1e = virt_to_xen_l1e(va);
+    if ( !pl1e )
+        return NULL;
+    l1e_write(pl1e, l1e_from_pfn(mfn, __PAGE_HYPERVISOR));
+
+    return (void *)va;
+}
+
+void unmap_domain_page_global(const void *ptr)
+{
+    unsigned long va = (unsigned long)ptr;
+    l1_pgentry_t *pl1e;
+
+    if ( va >= DIRECTMAP_VIRT_START )
+        return;
+
+    ASSERT(va >= GLOBALMAP_VIRT_START && va < GLOBALMAP_VIRT_END);
+
+    /* /First/, we zap the PTE. */
+    pl1e = virt_to_xen_l1e(va);
+    BUG_ON(!pl1e);
+    l1e_write(pl1e, l1e_empty());
+
+    /* /Second/, we add to the garbage map. */
+    set_bit(PFN_DOWN(va - GLOBALMAP_VIRT_START), garbage);
+}
+
+/* Translate a map-domain-page'd address to the underlying MFN */
+unsigned long domain_page_map_to_mfn(const void *ptr)
+{
+    unsigned long va = (unsigned long)ptr;
+    const l1_pgentry_t *pl1e;
+
+    if ( va >= DIRECTMAP_VIRT_START )
+        return virt_to_mfn(ptr);
+
+    if ( va >= GLOBALMAP_VIRT_START && va < GLOBALMAP_VIRT_END )
+    {
+        pl1e = virt_to_xen_l1e(va);
+        BUG_ON(!pl1e);
+    }
+    else
+    {
+        ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END);
+        pl1e = &__linear_l1_table[l1_linear_offset(va)];
+    }
+
+    return l1e_get_pfn(*pl1e);
+}
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2661,9 +2661,6 @@ static inline int vcpumask_to_pcpumask(
     }
 }
 
-#define fixmap_domain_page(mfn) mfn_to_virt(mfn)
-#define fixunmap_domain_page(ptr) ((void)(ptr))
-
 long do_mmuext_op(
     XEN_GUEST_HANDLE_PARAM(mmuext_op_t) uops,
     unsigned int count,
@@ -2983,7 +2980,6 @@ long do_mmuext_op(
 
         case MMUEXT_CLEAR_PAGE: {
             struct page_info *page;
-            unsigned char *ptr;
 
             page = get_page_from_gfn(d, op.arg1.mfn, NULL, P2M_ALLOC);
             if ( !page || !get_page_type(page, PGT_writable_page) )
@@ -2998,9 +2994,7 @@ long do_mmuext_op(
             /* A page is dirtied when it's being cleared. */
             paging_mark_dirty(d, page_to_mfn(page));
 
-            ptr = fixmap_domain_page(page_to_mfn(page));
-            clear_page(ptr);
-            fixunmap_domain_page(ptr);
+            clear_domain_page(page_to_mfn(page));
 
             put_page_and_type(page);
             break;
@@ -3008,8 +3002,6 @@ long do_mmuext_op(
 
         case MMUEXT_COPY_PAGE:
         {
-            const unsigned char *src;
-            unsigned char *dst;
             struct page_info *src_page, *dst_page;
 
             src_page = get_page_from_gfn(d, op.arg2.src_mfn, NULL, P2M_ALLOC);
@@ -3034,11 +3026,7 @@ long do_mmuext_op(
             /* A page is dirtied when it's being copied to. */
             paging_mark_dirty(d, page_to_mfn(dst_page));
 
-            src = __map_domain_page(src_page);
-            dst = fixmap_domain_page(page_to_mfn(dst_page));
-            copy_page(dst, src);
-            fixunmap_domain_page(dst);
-            unmap_domain_page(src);
+            copy_domain_page(page_to_mfn(dst_page), page_to_mfn(src_page));
 
             put_page_and_type(dst_page);
             put_page(src_page);
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -27,6 +27,7 @@
 #define CONFIG_DISCONTIGMEM 1
 #define CONFIG_NUMA_EMU 1
 #define CONFIG_PAGEALLOC_MAX_ORDER (2 * PAGETABLE_ORDER)
+#define CONFIG_DOMAIN_PAGE 1
 
 /* Intel P4 currently has largest cache line (L2 line size is 128 bytes). */
 #define CONFIG_X86_L1_CACHE_SHIFT 7
@@ -147,12 +148,14 @@ extern unsigned char boot_edid_info[128]
  *  0xffff82c000000000 - 0xffff82c3ffffffff [16GB,  2^34 bytes, PML4:261]
  *    vmap()/ioremap()/fixmap area.
  *  0xffff82c400000000 - 0xffff82c43fffffff [1GB,   2^30 bytes, PML4:261]
- *    Compatibility machine-to-phys translation table.
+ *    Global domain page map area.
  *  0xffff82c440000000 - 0xffff82c47fffffff [1GB,   2^30 bytes, PML4:261]
- *    High read-only compatibility machine-to-phys translation table.
+ *    Compatibility machine-to-phys translation table.
  *  0xffff82c480000000 - 0xffff82c4bfffffff [1GB,   2^30 bytes, PML4:261]
+ *    High read-only compatibility machine-to-phys translation table.
+ *  0xffff82c4c0000000 - 0xffff82c4ffffffff [1GB,   2^30 bytes, PML4:261]
  *    Xen text, static data, bss.
- *  0xffff82c4c0000000 - 0xffff82dffbffffff [109GB - 64MB,      PML4:261]
+ *  0xffff82c500000000 - 0xffff82dffbffffff [108GB - 64MB,      PML4:261]
  *    Reserved for future use.
  *  0xffff82dffc000000 - 0xffff82dfffffffff [64MB,  2^26 bytes, PML4:261]
  *    Super-page information array.
@@ -201,18 +204,24 @@ extern unsigned char boot_edid_info[128]
 /* Slot 259: linear page table (shadow table). */
 #define SH_LINEAR_PT_VIRT_START (PML4_ADDR(259))
 #define SH_LINEAR_PT_VIRT_END   (SH_LINEAR_PT_VIRT_START + PML4_ENTRY_BYTES)
-/* Slot 260: per-domain mappings. */
+/* Slot 260: per-domain mappings (including map cache). */
 #define PERDOMAIN_VIRT_START    (PML4_ADDR(260))
-#define PERDOMAIN_VIRT_END      (PERDOMAIN_VIRT_START + (PERDOMAIN_MBYTES<<20))
-#define PERDOMAIN_MBYTES        (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER))
+#define PERDOMAIN_SLOT_MBYTES   (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER))
+#define PERDOMAIN_SLOTS         2
+#define PERDOMAIN_VIRT_SLOT(s)  (PERDOMAIN_VIRT_START + (s) * \
+                                 (PERDOMAIN_SLOT_MBYTES << 20))
 /* Slot 261: machine-to-phys conversion table (256GB). */
 #define RDWR_MPT_VIRT_START     (PML4_ADDR(261))
 #define RDWR_MPT_VIRT_END       (RDWR_MPT_VIRT_START + MPT_VIRT_SIZE)
 /* Slot 261: vmap()/ioremap()/fixmap area (16GB). */
 #define VMAP_VIRT_START         RDWR_MPT_VIRT_END
 #define VMAP_VIRT_END           (VMAP_VIRT_START + GB(16))
+/* Slot 261: global domain page map area (1GB). */
+#define GLOBALMAP_GBYTES        1
+#define GLOBALMAP_VIRT_START    VMAP_VIRT_END
+#define GLOBALMAP_VIRT_END      (GLOBALMAP_VIRT_START + (GLOBALMAP_GBYTES<<30))
 /* Slot 261: compatibility machine-to-phys conversion table (1GB). */
-#define RDWR_COMPAT_MPT_VIRT_START VMAP_VIRT_END
+#define RDWR_COMPAT_MPT_VIRT_START GLOBALMAP_VIRT_END
 #define RDWR_COMPAT_MPT_VIRT_END (RDWR_COMPAT_MPT_VIRT_START + GB(1))
 /* Slot 261: high read-only compat machine-to-phys conversion table (1GB). */
 #define HIRO_COMPAT_MPT_VIRT_START RDWR_COMPAT_MPT_VIRT_END
@@ -279,9 +288,9 @@ extern unsigned long xen_phys_start;
 /* GDT/LDT shadow mapping area. The first per-domain-mapping sub-area. */
 #define GDT_LDT_VCPU_SHIFT       5
 #define GDT_LDT_VCPU_VA_SHIFT    (GDT_LDT_VCPU_SHIFT + PAGE_SHIFT)
-#define GDT_LDT_MBYTES           PERDOMAIN_MBYTES
+#define GDT_LDT_MBYTES           PERDOMAIN_SLOT_MBYTES
 #define MAX_VIRT_CPUS            (GDT_LDT_MBYTES << (20-GDT_LDT_VCPU_VA_SHIFT))
-#define GDT_LDT_VIRT_START       PERDOMAIN_VIRT_START
+#define GDT_LDT_VIRT_START       PERDOMAIN_VIRT_SLOT(0)
 #define GDT_LDT_VIRT_END         (GDT_LDT_VIRT_START + (GDT_LDT_MBYTES << 20))
 
 /* The address of a particular VCPU's GDT or LDT. */
@@ -290,8 +299,16 @@ extern unsigned long xen_phys_start;
 #define LDT_VIRT_START(v)    \
     (GDT_VIRT_START(v) + (64*1024))
 
+/* map_domain_page() map cache. The last per-domain-mapping sub-area. */
+#define MAPCACHE_VCPU_ENTRIES    (CONFIG_PAGING_LEVELS * CONFIG_PAGING_LEVELS)
+#define MAPCACHE_ENTRIES         (MAX_VIRT_CPUS * MAPCACHE_VCPU_ENTRIES)
+#define MAPCACHE_SLOT            (PERDOMAIN_SLOTS - 1)
+#define MAPCACHE_VIRT_START      PERDOMAIN_VIRT_SLOT(MAPCACHE_SLOT)
+#define MAPCACHE_VIRT_END        (MAPCACHE_VIRT_START + \
+                                  MAPCACHE_ENTRIES * PAGE_SIZE)
+
 #define PDPT_L1_ENTRIES       \
-    ((PERDOMAIN_VIRT_END - PERDOMAIN_VIRT_START) >> PAGE_SHIFT)
+    ((PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS - 1) - PERDOMAIN_VIRT_START) >> PAGE_SHIFT)
 #define PDPT_L2_ENTRIES       \
     ((PDPT_L1_ENTRIES + (1 << PAGETABLE_ORDER) - 1) >> PAGETABLE_ORDER)
 
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -39,7 +39,7 @@ struct trap_bounce {
 
 #define MAPHASH_ENTRIES 8
 #define MAPHASH_HASHFN(pfn) ((pfn) & (MAPHASH_ENTRIES-1))
-#define MAPHASHENT_NOTINUSE ((u16)~0U)
+#define MAPHASHENT_NOTINUSE ((u32)~0U)
 struct mapcache_vcpu {
     /* Shadow of mapcache_domain.epoch. */
     unsigned int shadow_epoch;
@@ -47,16 +47,15 @@ struct mapcache_vcpu {
     /* Lock-free per-VCPU hash of recently-used mappings. */
     struct vcpu_maphash_entry {
         unsigned long mfn;
-        uint16_t      idx;
-        uint16_t      refcnt;
+        uint32_t      idx;
+        uint32_t      refcnt;
     } hash[MAPHASH_ENTRIES];
 };
 
-#define MAPCACHE_ORDER   10
-#define MAPCACHE_ENTRIES (1 << MAPCACHE_ORDER)
 struct mapcache_domain {
     /* The PTEs that provide the mappings, and a cursor into the array. */
-    l1_pgentry_t *l1tab;
+    l1_pgentry_t **l1tab;
+    unsigned int entries;
     unsigned int cursor;
 
     /* Protects map_domain_page(). */
@@ -67,12 +66,13 @@ struct mapcache_domain {
     u32 tlbflush_timestamp;
 
     /* Which mappings are in use, and which are garbage to reap next epoch? */
-    unsigned long inuse[BITS_TO_LONGS(MAPCACHE_ENTRIES)];
-    unsigned long garbage[BITS_TO_LONGS(MAPCACHE_ENTRIES)];
+    unsigned long *inuse;
+    unsigned long *garbage;
 };
 
-void mapcache_domain_init(struct domain *);
-void mapcache_vcpu_init(struct vcpu *);
+int mapcache_domain_init(struct domain *);
+void mapcache_domain_exit(struct domain *);
+int mapcache_vcpu_init(struct vcpu *);
 
 /* x86/64: toggle guest between kernel and user modes. */
 void toggle_guest_mode(struct vcpu *);
@@ -229,6 +229,9 @@ struct pv_domain
      * unmask the event channel */
     bool_t auto_unmask;
 
+    /* map_domain_page() mapping cache. */
+    struct mapcache_domain mapcache;
+
     /* Pseudophysical e820 map (XENMEM_memory_map).  */
     spinlock_t e820_lock;
     struct e820entry *e820;
@@ -238,7 +241,7 @@ struct pv_domain
 struct arch_domain
 {
     struct page_info **mm_perdomain_pt_pages;
-    l2_pgentry_t *mm_perdomain_l2;
+    l2_pgentry_t *mm_perdomain_l2[PERDOMAIN_SLOTS];
     l3_pgentry_t *mm_perdomain_l3;
 
     unsigned int hv_compat_vstart;
@@ -324,6 +327,9 @@ struct arch_domain
 
 struct pv_vcpu
 {
+    /* map_domain_page() mapping cache. */
+    struct mapcache_vcpu mapcache;
+
     struct trap_info *trap_ctxt;
 
     unsigned long gdt_frames[FIRST_RESERVED_GDT_PAGE];
--- a/xen/include/xen/domain_page.h
+++ b/xen/include/xen/domain_page.h
@@ -25,11 +25,16 @@ void *map_domain_page(unsigned long mfn)
  */
 void unmap_domain_page(const void *va);
 
+/*
+ * Clear a given page frame, or copy between two of them.
+ */
+void clear_domain_page(unsigned long mfn);
+void copy_domain_page(unsigned long dmfn, unsigned long smfn);
 
 /* 
  * Given a VA from map_domain_page(), return its underlying MFN.
  */
-unsigned long domain_page_map_to_mfn(void *va);
+unsigned long domain_page_map_to_mfn(const void *va);
 
 /*
  * Similar to the above calls, except the mapping is accessible in all
@@ -107,6 +112,9 @@ domain_mmap_cache_destroy(struct domain_
 #define map_domain_page(mfn)                mfn_to_virt(mfn)
 #define __map_domain_page(pg)               page_to_virt(pg)
 #define unmap_domain_page(va)               ((void)(va))
+#define clear_domain_page(mfn)              clear_page(mfn_to_virt(mfn))
+#define copy_domain_page(dmfn, smfn)        copy_page(mfn_to_virt(dmfn), \
+                                                      mfn_to_virt(smfn))
 #define domain_page_map_to_mfn(va)          virt_to_mfn((unsigned long)(va))
 
 #define map_domain_page_global(mfn)         mfn_to_virt(mfn)



[-- Attachment #2: x86-map-domain-page.patch --]
[-- Type: text/plain, Size: 29522 bytes --]

x86: re-introduce map_domain_page() et al

This is being done mostly in the form previously used on x86-32,
utilizing the second L3 page table slot within the per-domain mapping
area for those mappings. It remains to be determined whether that
concept is really suitable, or whether instead re-implementing at least
the non-global variant from scratch would be better.

Also add the helpers {clear,copy}_domain_page() as well as initial uses
of them.

One question is whether, to exercise the non-trivial code paths, we
shouldn't make the trivial shortcuts conditional upon NDEBUG being
defined. See the debugging patch at the end of the series.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -19,6 +19,7 @@ obj-bin-y += dmi_scan.init.o
 obj-y += domctl.o
 obj-y += domain.o
 obj-bin-y += domain_build.init.o
+obj-y += domain_page.o
 obj-y += e820.o
 obj-y += extable.o
 obj-y += flushtlb.o
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -397,10 +397,14 @@ int vcpu_initialise(struct vcpu *v)
             return -ENOMEM;
         clear_page(page_to_virt(pg));
         perdomain_pt_page(d, idx) = pg;
-        d->arch.mm_perdomain_l2[l2_table_offset(PERDOMAIN_VIRT_START)+idx]
+        d->arch.mm_perdomain_l2[0][l2_table_offset(PERDOMAIN_VIRT_START)+idx]
             = l2e_from_page(pg, __PAGE_HYPERVISOR);
     }
 
+    rc = mapcache_vcpu_init(v);
+    if ( rc )
+        return rc;
+
     paging_vcpu_init(v);
 
     v->arch.perdomain_ptes = perdomain_ptes(d, v);
@@ -526,8 +530,8 @@ int arch_domain_create(struct domain *d,
     pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
     if ( pg == NULL )
         goto fail;
-    d->arch.mm_perdomain_l2 = page_to_virt(pg);
-    clear_page(d->arch.mm_perdomain_l2);
+    d->arch.mm_perdomain_l2[0] = page_to_virt(pg);
+    clear_page(d->arch.mm_perdomain_l2[0]);
 
     pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
     if ( pg == NULL )
@@ -535,8 +539,10 @@ int arch_domain_create(struct domain *d,
     d->arch.mm_perdomain_l3 = page_to_virt(pg);
     clear_page(d->arch.mm_perdomain_l3);
     d->arch.mm_perdomain_l3[l3_table_offset(PERDOMAIN_VIRT_START)] =
-        l3e_from_page(virt_to_page(d->arch.mm_perdomain_l2),
-                            __PAGE_HYPERVISOR);
+        l3e_from_pfn(virt_to_mfn(d->arch.mm_perdomain_l2[0]),
+                     __PAGE_HYPERVISOR);
+
+    mapcache_domain_init(d);
 
     HYPERVISOR_COMPAT_VIRT_START(d) =
         is_hvm_domain(d) ? ~0u : __HYPERVISOR_COMPAT_VIRT_START;
@@ -609,8 +615,9 @@ int arch_domain_create(struct domain *d,
     free_xenheap_page(d->shared_info);
     if ( paging_initialised )
         paging_final_teardown(d);
-    if ( d->arch.mm_perdomain_l2 )
-        free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2));
+    mapcache_domain_exit(d);
+    if ( d->arch.mm_perdomain_l2[0] )
+        free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2[0]));
     if ( d->arch.mm_perdomain_l3 )
         free_domheap_page(virt_to_page(d->arch.mm_perdomain_l3));
     if ( d->arch.mm_perdomain_pt_pages )
@@ -633,13 +640,15 @@ void arch_domain_destroy(struct domain *
 
     paging_final_teardown(d);
 
+    mapcache_domain_exit(d);
+
     for ( i = 0; i < PDPT_L2_ENTRIES; ++i )
     {
         if ( perdomain_pt_page(d, i) )
             free_domheap_page(perdomain_pt_page(d, i));
     }
     free_domheap_page(virt_to_page(d->arch.mm_perdomain_pt_pages));
-    free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2));
+    free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2[0]));
     free_domheap_page(virt_to_page(d->arch.mm_perdomain_l3));
 
     free_xenheap_page(d->shared_info);
--- /dev/null
+++ b/xen/arch/x86/domain_page.c
@@ -0,0 +1,471 @@
+/******************************************************************************
+ * domain_page.h
+ *
+ * Allow temporary mapping of domain pages.
+ *
+ * Copyright (c) 2003-2006, Keir Fraser <keir@xensource.com>
+ */
+
+#include <xen/domain_page.h>
+#include <xen/mm.h>
+#include <xen/perfc.h>
+#include <xen/pfn.h>
+#include <xen/sched.h>
+#include <asm/current.h>
+#include <asm/flushtlb.h>
+#include <asm/hardirq.h>
+
+static inline struct vcpu *mapcache_current_vcpu(void)
+{
+    /* In the common case we use the mapcache of the running VCPU. */
+    struct vcpu *v = current;
+
+    /*
+     * When current isn't properly set up yet, this is equivalent to
+     * running in an idle vCPU (callers must check for NULL).
+     */
+    if ( v == (struct vcpu *)0xfffff000 )
+        return NULL;
+
+    /*
+     * If guest_table is NULL, and we are running a paravirtualised guest,
+     * then it means we are running on the idle domain's page table and must
+     * therefore use its mapcache.
+     */
+    if ( unlikely(pagetable_is_null(v->arch.guest_table)) && !is_hvm_vcpu(v) )
+    {
+        /* If we really are idling, perform lazy context switch now. */
+        if ( (v = idle_vcpu[smp_processor_id()]) == current )
+            sync_local_execstate();
+        /* We must now be running on the idle page table. */
+        ASSERT(read_cr3() == __pa(idle_pg_table));
+    }
+
+    return v;
+}
+
+#define mapcache_l2_entry(e) ((e) >> PAGETABLE_ORDER)
+#define MAPCACHE_L2_ENTRIES (mapcache_l2_entry(MAPCACHE_ENTRIES - 1) + 1)
+#define DCACHE_L1ENT(dc, idx) \
+    ((dc)->l1tab[(idx) >> PAGETABLE_ORDER] \
+                [(idx) & ((1 << PAGETABLE_ORDER) - 1)])
+
+void *map_domain_page(unsigned long mfn)
+{
+    unsigned long flags;
+    unsigned int idx, i;
+    struct vcpu *v;
+    struct mapcache_domain *dcache;
+    struct mapcache_vcpu *vcache;
+    struct vcpu_maphash_entry *hashent;
+
+    if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
+        return mfn_to_virt(mfn);
+
+    v = mapcache_current_vcpu();
+    if ( !v || is_hvm_vcpu(v) )
+        return mfn_to_virt(mfn);
+
+    dcache = &v->domain->arch.pv_domain.mapcache;
+    vcache = &v->arch.pv_vcpu.mapcache;
+    if ( !dcache->l1tab )
+        return mfn_to_virt(mfn);
+
+    perfc_incr(map_domain_page_count);
+
+    local_irq_save(flags);
+
+    hashent = &vcache->hash[MAPHASH_HASHFN(mfn)];
+    if ( hashent->mfn == mfn )
+    {
+        idx = hashent->idx;
+        ASSERT(idx < dcache->entries);
+        hashent->refcnt++;
+        ASSERT(hashent->refcnt);
+        ASSERT(l1e_get_pfn(DCACHE_L1ENT(dcache, idx)) == mfn);
+        goto out;
+    }
+
+    spin_lock(&dcache->lock);
+
+    /* Has some other CPU caused a wrap? We must flush if so. */
+    if ( unlikely(dcache->epoch != vcache->shadow_epoch) )
+    {
+        vcache->shadow_epoch = dcache->epoch;
+        if ( NEED_FLUSH(this_cpu(tlbflush_time), dcache->tlbflush_timestamp) )
+        {
+            perfc_incr(domain_page_tlb_flush);
+            flush_tlb_local();
+        }
+    }
+
+    idx = find_next_zero_bit(dcache->inuse, dcache->entries, dcache->cursor);
+    if ( unlikely(idx >= dcache->entries) )
+    {
+        unsigned long accum = 0;
+
+        /* /First/, clean the garbage map and update the inuse list. */
+        for ( i = 0; i < BITS_TO_LONGS(dcache->entries); i++ )
+        {
+            dcache->inuse[i] &= ~xchg(&dcache->garbage[i], 0);
+            accum |= ~dcache->inuse[i];
+        }
+
+        if ( accum )
+            idx = find_first_zero_bit(dcache->inuse, dcache->entries);
+        else
+        {
+            /* Replace a hash entry instead. */
+            i = MAPHASH_HASHFN(mfn);
+            do {
+                hashent = &vcache->hash[i];
+                if ( hashent->idx != MAPHASHENT_NOTINUSE && !hashent->refcnt )
+                {
+                    idx = hashent->idx;
+                    ASSERT(l1e_get_pfn(DCACHE_L1ENT(dcache, idx)) ==
+                           hashent->mfn);
+                    l1e_write(&DCACHE_L1ENT(dcache, idx), l1e_empty());
+                    hashent->idx = MAPHASHENT_NOTINUSE;
+                    hashent->mfn = ~0UL;
+                    break;
+                }
+                if ( ++i == MAPHASH_ENTRIES )
+                    i = 0;
+            } while ( i != MAPHASH_HASHFN(mfn) );
+        }
+        BUG_ON(idx >= dcache->entries);
+
+        /* /Second/, flush TLBs. */
+        perfc_incr(domain_page_tlb_flush);
+        flush_tlb_local();
+        vcache->shadow_epoch = ++dcache->epoch;
+        dcache->tlbflush_timestamp = tlbflush_current_time();
+    }
+
+    set_bit(idx, dcache->inuse);
+    dcache->cursor = idx + 1;
+
+    spin_unlock(&dcache->lock);
+
+    l1e_write(&DCACHE_L1ENT(dcache, idx),
+              l1e_from_pfn(mfn, __PAGE_HYPERVISOR));
+
+ out:
+    local_irq_restore(flags);
+    return (void *)MAPCACHE_VIRT_START + pfn_to_paddr(idx);
+}
+
+void unmap_domain_page(const void *ptr)
+{
+    unsigned int idx;
+    struct vcpu *v;
+    struct mapcache_domain *dcache;
+    unsigned long va = (unsigned long)ptr, mfn, flags;
+    struct vcpu_maphash_entry *hashent;
+
+    if ( va >= DIRECTMAP_VIRT_START )
+        return;
+
+    ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END);
+
+    v = mapcache_current_vcpu();
+    ASSERT(v && !is_hvm_vcpu(v));
+
+    dcache = &v->domain->arch.pv_domain.mapcache;
+    ASSERT(dcache->l1tab);
+
+    idx = PFN_DOWN(va - MAPCACHE_VIRT_START);
+    mfn = l1e_get_pfn(DCACHE_L1ENT(dcache, idx));
+    hashent = &v->arch.pv_vcpu.mapcache.hash[MAPHASH_HASHFN(mfn)];
+
+    local_irq_save(flags);
+
+    if ( hashent->idx == idx )
+    {
+        ASSERT(hashent->mfn == mfn);
+        ASSERT(hashent->refcnt);
+        hashent->refcnt--;
+    }
+    else if ( !hashent->refcnt )
+    {
+        if ( hashent->idx != MAPHASHENT_NOTINUSE )
+        {
+            /* /First/, zap the PTE. */
+            ASSERT(l1e_get_pfn(DCACHE_L1ENT(dcache, hashent->idx)) ==
+                   hashent->mfn);
+            l1e_write(&DCACHE_L1ENT(dcache, hashent->idx), l1e_empty());
+            /* /Second/, mark as garbage. */
+            set_bit(hashent->idx, dcache->garbage);
+        }
+
+        /* Add newly-freed mapping to the maphash. */
+        hashent->mfn = mfn;
+        hashent->idx = idx;
+    }
+    else
+    {
+        /* /First/, zap the PTE. */
+        l1e_write(&DCACHE_L1ENT(dcache, idx), l1e_empty());
+        /* /Second/, mark as garbage. */
+        set_bit(idx, dcache->garbage);
+    }
+
+    local_irq_restore(flags);
+}
+
+void clear_domain_page(unsigned long mfn)
+{
+    void *ptr = map_domain_page(mfn);
+
+    clear_page(ptr);
+    unmap_domain_page(ptr);
+}
+
+void copy_domain_page(unsigned long dmfn, unsigned long smfn)
+{
+    const void *src = map_domain_page(smfn);
+    void *dst = map_domain_page(dmfn);
+
+    copy_page(dst, src);
+    unmap_domain_page(dst);
+    unmap_domain_page(src);
+}
+
+int mapcache_domain_init(struct domain *d)
+{
+    struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
+    unsigned int i, bitmap_pages, memf = MEMF_node(domain_to_node(d));
+    unsigned long *end;
+
+    if ( is_hvm_domain(d) || is_idle_domain(d) )
+        return 0;
+
+    if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
+        return 0;
+
+    dcache->l1tab = xzalloc_array(l1_pgentry_t *, MAPCACHE_L2_ENTRIES + 1);
+    d->arch.mm_perdomain_l2[MAPCACHE_SLOT] = alloc_xenheap_pages(0, memf);
+    if ( !dcache->l1tab || !d->arch.mm_perdomain_l2[MAPCACHE_SLOT] )
+        return -ENOMEM;
+
+    clear_page(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]);
+    d->arch.mm_perdomain_l3[l3_table_offset(MAPCACHE_VIRT_START)] =
+        l3e_from_paddr(__pa(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]),
+                       __PAGE_HYPERVISOR);
+
+    BUILD_BUG_ON(MAPCACHE_VIRT_END + 3 +
+                 2 * PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long)) >
+                 MAPCACHE_VIRT_START + (PERDOMAIN_SLOT_MBYTES << 20));
+    bitmap_pages = PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long));
+    dcache->inuse = (void *)MAPCACHE_VIRT_END + PAGE_SIZE;
+    dcache->garbage = dcache->inuse +
+                      (bitmap_pages + 1) * PAGE_SIZE / sizeof(long);
+    end = dcache->garbage + bitmap_pages * PAGE_SIZE / sizeof(long);
+
+    for ( i = l2_table_offset((unsigned long)dcache->inuse);
+          i <= l2_table_offset((unsigned long)(end - 1)); ++i )
+    {
+        ASSERT(i <= MAPCACHE_L2_ENTRIES);
+        dcache->l1tab[i] = alloc_xenheap_pages(0, memf);
+        if ( !dcache->l1tab[i] )
+            return -ENOMEM;
+        clear_page(dcache->l1tab[i]);
+        d->arch.mm_perdomain_l2[MAPCACHE_SLOT][i] =
+            l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR);
+    }
+
+    spin_lock_init(&dcache->lock);
+
+    return 0;
+}
+
+void mapcache_domain_exit(struct domain *d)
+{
+    struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
+
+    if ( is_hvm_domain(d) )
+        return;
+
+    if ( dcache->l1tab )
+    {
+        unsigned long i;
+
+        for ( i = (unsigned long)dcache->inuse; ; i += PAGE_SIZE )
+        {
+            l1_pgentry_t *pl1e;
+
+            if ( l2_table_offset(i) > MAPCACHE_L2_ENTRIES ||
+                 !dcache->l1tab[l2_table_offset(i)] )
+                break;
+
+            pl1e = &dcache->l1tab[l2_table_offset(i)][l1_table_offset(i)];
+            if ( l1e_get_flags(*pl1e) )
+                free_domheap_page(l1e_get_page(*pl1e));
+        }
+
+        for ( i = 0; i < MAPCACHE_L2_ENTRIES + 1; ++i )
+            free_xenheap_page(dcache->l1tab[i]);
+
+        xfree(dcache->l1tab);
+    }
+    free_xenheap_page(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]);
+}
+
+int mapcache_vcpu_init(struct vcpu *v)
+{
+    struct domain *d = v->domain;
+    struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
+    unsigned long i;
+    unsigned int memf = MEMF_node(vcpu_to_node(v));
+
+    if ( is_hvm_vcpu(v) || !dcache->l1tab )
+        return 0;
+
+    while ( dcache->entries < d->max_vcpus * MAPCACHE_VCPU_ENTRIES )
+    {
+        unsigned int ents = dcache->entries + MAPCACHE_VCPU_ENTRIES;
+        l1_pgentry_t *pl1e;
+
+        /* Populate page tables. */
+        if ( !dcache->l1tab[i = mapcache_l2_entry(ents - 1)] )
+        {
+            dcache->l1tab[i] = alloc_xenheap_pages(0, memf);
+            if ( !dcache->l1tab[i] )
+                return -ENOMEM;
+            clear_page(dcache->l1tab[i]);
+            d->arch.mm_perdomain_l2[MAPCACHE_SLOT][i] =
+                l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR);
+        }
+
+        /* Populate bit maps. */
+        i = (unsigned long)(dcache->inuse + BITS_TO_LONGS(ents));
+        pl1e = &dcache->l1tab[l2_table_offset(i)][l1_table_offset(i)];
+        if ( !l1e_get_flags(*pl1e) )
+        {
+            struct page_info *pg = alloc_domheap_page(NULL, memf);
+
+            if ( !pg )
+                return -ENOMEM;
+            clear_domain_page(page_to_mfn(pg));
+            *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR);
+
+            i = (unsigned long)(dcache->garbage + BITS_TO_LONGS(ents));
+            pl1e = &dcache->l1tab[l2_table_offset(i)][l1_table_offset(i)];
+            ASSERT(!l1e_get_flags(*pl1e));
+
+            pg = alloc_domheap_page(NULL, memf);
+            if ( !pg )
+                return -ENOMEM;
+            clear_domain_page(page_to_mfn(pg));
+            *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR);
+        }
+
+        dcache->entries = ents;
+    }
+
+    /* Mark all maphash entries as not in use. */
+    BUILD_BUG_ON(MAPHASHENT_NOTINUSE < MAPCACHE_ENTRIES);
+    for ( i = 0; i < MAPHASH_ENTRIES; i++ )
+    {
+        struct vcpu_maphash_entry *hashent = &v->arch.pv_vcpu.mapcache.hash[i];
+
+        hashent->mfn = ~0UL; /* never valid to map */
+        hashent->idx = MAPHASHENT_NOTINUSE;
+    }
+
+    return 0;
+}
+
+#define GLOBALMAP_BITS (GLOBALMAP_GBYTES << (30 - PAGE_SHIFT))
+static unsigned long inuse[BITS_TO_LONGS(GLOBALMAP_BITS)];
+static unsigned long garbage[BITS_TO_LONGS(GLOBALMAP_BITS)];
+static unsigned int inuse_cursor;
+static DEFINE_SPINLOCK(globalmap_lock);
+
+void *map_domain_page_global(unsigned long mfn)
+{
+    l1_pgentry_t *pl1e;
+    unsigned int idx, i;
+    unsigned long va;
+
+    ASSERT(!in_irq() && local_irq_is_enabled());
+
+    if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
+        return mfn_to_virt(mfn);
+
+    spin_lock(&globalmap_lock);
+
+    idx = find_next_zero_bit(inuse, GLOBALMAP_BITS, inuse_cursor);
+    va = GLOBALMAP_VIRT_START + pfn_to_paddr(idx);
+    if ( unlikely(va >= GLOBALMAP_VIRT_END) )
+    {
+        /* /First/, clean the garbage map and update the inuse list. */
+        for ( i = 0; i < ARRAY_SIZE(garbage); i++ )
+            inuse[i] &= ~xchg(&garbage[i], 0);
+
+        /* /Second/, flush all TLBs to get rid of stale garbage mappings. */
+        flush_tlb_all();
+
+        idx = find_first_zero_bit(inuse, GLOBALMAP_BITS);
+        va = GLOBALMAP_VIRT_START + pfn_to_paddr(idx);
+        if ( unlikely(va >= GLOBALMAP_VIRT_END) )
+        {
+            spin_unlock(&globalmap_lock);
+            return NULL;
+        }
+    }
+
+    set_bit(idx, inuse);
+    inuse_cursor = idx + 1;
+
+    spin_unlock(&globalmap_lock);
+
+    pl1e = virt_to_xen_l1e(va);
+    if ( !pl1e )
+        return NULL;
+    l1e_write(pl1e, l1e_from_pfn(mfn, __PAGE_HYPERVISOR));
+
+    return (void *)va;
+}
+
+void unmap_domain_page_global(const void *ptr)
+{
+    unsigned long va = (unsigned long)ptr;
+    l1_pgentry_t *pl1e;
+
+    if ( va >= DIRECTMAP_VIRT_START )
+        return;
+
+    ASSERT(va >= GLOBALMAP_VIRT_START && va < GLOBALMAP_VIRT_END);
+
+    /* /First/, we zap the PTE. */
+    pl1e = virt_to_xen_l1e(va);
+    BUG_ON(!pl1e);
+    l1e_write(pl1e, l1e_empty());
+
+    /* /Second/, we add to the garbage map. */
+    set_bit(PFN_DOWN(va - GLOBALMAP_VIRT_START), garbage);
+}
+
+/* Translate a map-domain-page'd address to the underlying MFN */
+unsigned long domain_page_map_to_mfn(const void *ptr)
+{
+    unsigned long va = (unsigned long)ptr;
+    const l1_pgentry_t *pl1e;
+
+    if ( va >= DIRECTMAP_VIRT_START )
+        return virt_to_mfn(ptr);
+
+    if ( va >= GLOBALMAP_VIRT_START && va < GLOBALMAP_VIRT_END )
+    {
+        pl1e = virt_to_xen_l1e(va);
+        BUG_ON(!pl1e);
+    }
+    else
+    {
+        ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END);
+        pl1e = &__linear_l1_table[l1_linear_offset(va)];
+    }
+
+    return l1e_get_pfn(*pl1e);
+}
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2661,9 +2661,6 @@ static inline int vcpumask_to_pcpumask(
     }
 }
 
-#define fixmap_domain_page(mfn) mfn_to_virt(mfn)
-#define fixunmap_domain_page(ptr) ((void)(ptr))
-
 long do_mmuext_op(
     XEN_GUEST_HANDLE_PARAM(mmuext_op_t) uops,
     unsigned int count,
@@ -2983,7 +2980,6 @@ long do_mmuext_op(
 
         case MMUEXT_CLEAR_PAGE: {
             struct page_info *page;
-            unsigned char *ptr;
 
             page = get_page_from_gfn(d, op.arg1.mfn, NULL, P2M_ALLOC);
             if ( !page || !get_page_type(page, PGT_writable_page) )
@@ -2998,9 +2994,7 @@ long do_mmuext_op(
             /* A page is dirtied when it's being cleared. */
             paging_mark_dirty(d, page_to_mfn(page));
 
-            ptr = fixmap_domain_page(page_to_mfn(page));
-            clear_page(ptr);
-            fixunmap_domain_page(ptr);
+            clear_domain_page(page_to_mfn(page));
 
             put_page_and_type(page);
             break;
@@ -3008,8 +3002,6 @@ long do_mmuext_op(
 
         case MMUEXT_COPY_PAGE:
         {
-            const unsigned char *src;
-            unsigned char *dst;
             struct page_info *src_page, *dst_page;
 
             src_page = get_page_from_gfn(d, op.arg2.src_mfn, NULL, P2M_ALLOC);
@@ -3034,11 +3026,7 @@ long do_mmuext_op(
             /* A page is dirtied when it's being copied to. */
             paging_mark_dirty(d, page_to_mfn(dst_page));
 
-            src = __map_domain_page(src_page);
-            dst = fixmap_domain_page(page_to_mfn(dst_page));
-            copy_page(dst, src);
-            fixunmap_domain_page(dst);
-            unmap_domain_page(src);
+            copy_domain_page(page_to_mfn(dst_page), page_to_mfn(src_page));
 
             put_page_and_type(dst_page);
             put_page(src_page);
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -27,6 +27,7 @@
 #define CONFIG_DISCONTIGMEM 1
 #define CONFIG_NUMA_EMU 1
 #define CONFIG_PAGEALLOC_MAX_ORDER (2 * PAGETABLE_ORDER)
+#define CONFIG_DOMAIN_PAGE 1
 
 /* Intel P4 currently has largest cache line (L2 line size is 128 bytes). */
 #define CONFIG_X86_L1_CACHE_SHIFT 7
@@ -147,12 +148,14 @@ extern unsigned char boot_edid_info[128]
  *  0xffff82c000000000 - 0xffff82c3ffffffff [16GB,  2^34 bytes, PML4:261]
  *    vmap()/ioremap()/fixmap area.
  *  0xffff82c400000000 - 0xffff82c43fffffff [1GB,   2^30 bytes, PML4:261]
- *    Compatibility machine-to-phys translation table.
+ *    Global domain page map area.
  *  0xffff82c440000000 - 0xffff82c47fffffff [1GB,   2^30 bytes, PML4:261]
- *    High read-only compatibility machine-to-phys translation table.
+ *    Compatibility machine-to-phys translation table.
  *  0xffff82c480000000 - 0xffff82c4bfffffff [1GB,   2^30 bytes, PML4:261]
+ *    High read-only compatibility machine-to-phys translation table.
+ *  0xffff82c4c0000000 - 0xffff82c4ffffffff [1GB,   2^30 bytes, PML4:261]
  *    Xen text, static data, bss.
- *  0xffff82c4c0000000 - 0xffff82dffbffffff [109GB - 64MB,      PML4:261]
+ *  0xffff82c500000000 - 0xffff82dffbffffff [108GB - 64MB,      PML4:261]
  *    Reserved for future use.
  *  0xffff82dffc000000 - 0xffff82dfffffffff [64MB,  2^26 bytes, PML4:261]
  *    Super-page information array.
@@ -201,18 +204,24 @@ extern unsigned char boot_edid_info[128]
 /* Slot 259: linear page table (shadow table). */
 #define SH_LINEAR_PT_VIRT_START (PML4_ADDR(259))
 #define SH_LINEAR_PT_VIRT_END   (SH_LINEAR_PT_VIRT_START + PML4_ENTRY_BYTES)
-/* Slot 260: per-domain mappings. */
+/* Slot 260: per-domain mappings (including map cache). */
 #define PERDOMAIN_VIRT_START    (PML4_ADDR(260))
-#define PERDOMAIN_VIRT_END      (PERDOMAIN_VIRT_START + (PERDOMAIN_MBYTES<<20))
-#define PERDOMAIN_MBYTES        (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER))
+#define PERDOMAIN_SLOT_MBYTES   (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER))
+#define PERDOMAIN_SLOTS         2
+#define PERDOMAIN_VIRT_SLOT(s)  (PERDOMAIN_VIRT_START + (s) * \
+                                 (PERDOMAIN_SLOT_MBYTES << 20))
 /* Slot 261: machine-to-phys conversion table (256GB). */
 #define RDWR_MPT_VIRT_START     (PML4_ADDR(261))
 #define RDWR_MPT_VIRT_END       (RDWR_MPT_VIRT_START + MPT_VIRT_SIZE)
 /* Slot 261: vmap()/ioremap()/fixmap area (16GB). */
 #define VMAP_VIRT_START         RDWR_MPT_VIRT_END
 #define VMAP_VIRT_END           (VMAP_VIRT_START + GB(16))
+/* Slot 261: global domain page map area (1GB). */
+#define GLOBALMAP_GBYTES        1
+#define GLOBALMAP_VIRT_START    VMAP_VIRT_END
+#define GLOBALMAP_VIRT_END      (GLOBALMAP_VIRT_START + (GLOBALMAP_GBYTES<<30))
 /* Slot 261: compatibility machine-to-phys conversion table (1GB). */
-#define RDWR_COMPAT_MPT_VIRT_START VMAP_VIRT_END
+#define RDWR_COMPAT_MPT_VIRT_START GLOBALMAP_VIRT_END
 #define RDWR_COMPAT_MPT_VIRT_END (RDWR_COMPAT_MPT_VIRT_START + GB(1))
 /* Slot 261: high read-only compat machine-to-phys conversion table (1GB). */
 #define HIRO_COMPAT_MPT_VIRT_START RDWR_COMPAT_MPT_VIRT_END
@@ -279,9 +288,9 @@ extern unsigned long xen_phys_start;
 /* GDT/LDT shadow mapping area. The first per-domain-mapping sub-area. */
 #define GDT_LDT_VCPU_SHIFT       5
 #define GDT_LDT_VCPU_VA_SHIFT    (GDT_LDT_VCPU_SHIFT + PAGE_SHIFT)
-#define GDT_LDT_MBYTES           PERDOMAIN_MBYTES
+#define GDT_LDT_MBYTES           PERDOMAIN_SLOT_MBYTES
 #define MAX_VIRT_CPUS            (GDT_LDT_MBYTES << (20-GDT_LDT_VCPU_VA_SHIFT))
-#define GDT_LDT_VIRT_START       PERDOMAIN_VIRT_START
+#define GDT_LDT_VIRT_START       PERDOMAIN_VIRT_SLOT(0)
 #define GDT_LDT_VIRT_END         (GDT_LDT_VIRT_START + (GDT_LDT_MBYTES << 20))
 
 /* The address of a particular VCPU's GDT or LDT. */
@@ -290,8 +299,16 @@ extern unsigned long xen_phys_start;
 #define LDT_VIRT_START(v)    \
     (GDT_VIRT_START(v) + (64*1024))
 
+/* map_domain_page() map cache. The last per-domain-mapping sub-area. */
+#define MAPCACHE_VCPU_ENTRIES    (CONFIG_PAGING_LEVELS * CONFIG_PAGING_LEVELS)
+#define MAPCACHE_ENTRIES         (MAX_VIRT_CPUS * MAPCACHE_VCPU_ENTRIES)
+#define MAPCACHE_SLOT            (PERDOMAIN_SLOTS - 1)
+#define MAPCACHE_VIRT_START      PERDOMAIN_VIRT_SLOT(MAPCACHE_SLOT)
+#define MAPCACHE_VIRT_END        (MAPCACHE_VIRT_START + \
+                                  MAPCACHE_ENTRIES * PAGE_SIZE)
+
 #define PDPT_L1_ENTRIES       \
-    ((PERDOMAIN_VIRT_END - PERDOMAIN_VIRT_START) >> PAGE_SHIFT)
+    ((PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS - 1) - PERDOMAIN_VIRT_START) >> PAGE_SHIFT)
 #define PDPT_L2_ENTRIES       \
     ((PDPT_L1_ENTRIES + (1 << PAGETABLE_ORDER) - 1) >> PAGETABLE_ORDER)
 
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -39,7 +39,7 @@ struct trap_bounce {
 
 #define MAPHASH_ENTRIES 8
 #define MAPHASH_HASHFN(pfn) ((pfn) & (MAPHASH_ENTRIES-1))
-#define MAPHASHENT_NOTINUSE ((u16)~0U)
+#define MAPHASHENT_NOTINUSE ((u32)~0U)
 struct mapcache_vcpu {
     /* Shadow of mapcache_domain.epoch. */
     unsigned int shadow_epoch;
@@ -47,16 +47,15 @@ struct mapcache_vcpu {
     /* Lock-free per-VCPU hash of recently-used mappings. */
     struct vcpu_maphash_entry {
         unsigned long mfn;
-        uint16_t      idx;
-        uint16_t      refcnt;
+        uint32_t      idx;
+        uint32_t      refcnt;
     } hash[MAPHASH_ENTRIES];
 };
 
-#define MAPCACHE_ORDER   10
-#define MAPCACHE_ENTRIES (1 << MAPCACHE_ORDER)
 struct mapcache_domain {
     /* The PTEs that provide the mappings, and a cursor into the array. */
-    l1_pgentry_t *l1tab;
+    l1_pgentry_t **l1tab;
+    unsigned int entries;
     unsigned int cursor;
 
     /* Protects map_domain_page(). */
@@ -67,12 +66,13 @@ struct mapcache_domain {
     u32 tlbflush_timestamp;
 
     /* Which mappings are in use, and which are garbage to reap next epoch? */
-    unsigned long inuse[BITS_TO_LONGS(MAPCACHE_ENTRIES)];
-    unsigned long garbage[BITS_TO_LONGS(MAPCACHE_ENTRIES)];
+    unsigned long *inuse;
+    unsigned long *garbage;
 };
 
-void mapcache_domain_init(struct domain *);
-void mapcache_vcpu_init(struct vcpu *);
+int mapcache_domain_init(struct domain *);
+void mapcache_domain_exit(struct domain *);
+int mapcache_vcpu_init(struct vcpu *);
 
 /* x86/64: toggle guest between kernel and user modes. */
 void toggle_guest_mode(struct vcpu *);
@@ -229,6 +229,9 @@ struct pv_domain
      * unmask the event channel */
     bool_t auto_unmask;
 
+    /* map_domain_page() mapping cache. */
+    struct mapcache_domain mapcache;
+
     /* Pseudophysical e820 map (XENMEM_memory_map).  */
     spinlock_t e820_lock;
     struct e820entry *e820;
@@ -238,7 +241,7 @@ struct pv_domain
 struct arch_domain
 {
     struct page_info **mm_perdomain_pt_pages;
-    l2_pgentry_t *mm_perdomain_l2;
+    l2_pgentry_t *mm_perdomain_l2[PERDOMAIN_SLOTS];
     l3_pgentry_t *mm_perdomain_l3;
 
     unsigned int hv_compat_vstart;
@@ -324,6 +327,9 @@ struct arch_domain
 
 struct pv_vcpu
 {
+    /* map_domain_page() mapping cache. */
+    struct mapcache_vcpu mapcache;
+
     struct trap_info *trap_ctxt;
 
     unsigned long gdt_frames[FIRST_RESERVED_GDT_PAGE];
--- a/xen/include/xen/domain_page.h
+++ b/xen/include/xen/domain_page.h
@@ -25,11 +25,16 @@ void *map_domain_page(unsigned long mfn)
  */
 void unmap_domain_page(const void *va);
 
+/*
+ * Clear a given page frame, or copy between two of them.
+ */
+void clear_domain_page(unsigned long mfn);
+void copy_domain_page(unsigned long dmfn, unsigned long smfn);
 
 /* 
  * Given a VA from map_domain_page(), return its underlying MFN.
  */
-unsigned long domain_page_map_to_mfn(void *va);
+unsigned long domain_page_map_to_mfn(const void *va);
 
 /*
  * Similar to the above calls, except the mapping is accessible in all
@@ -107,6 +112,9 @@ domain_mmap_cache_destroy(struct domain_
 #define map_domain_page(mfn)                mfn_to_virt(mfn)
 #define __map_domain_page(pg)               page_to_virt(pg)
 #define unmap_domain_page(va)               ((void)(va))
+#define clear_domain_page(mfn)              clear_page(mfn_to_virt(mfn))
+#define copy_domain_page(dmfn, smfn)        copy_page(mfn_to_virt(dmfn), \
+                                                      mfn_to_virt(smfn))
 #define domain_page_map_to_mfn(va)          virt_to_mfn((unsigned long)(va))
 
 #define map_domain_page_global(mfn)         mfn_to_virt(mfn)

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 04/11] x86: properly use map_domain_page() when building Dom0
  2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
  2013-01-22 10:50 ` [PATCH 02/11] x86: extend frame table virtual space Jan Beulich
  2013-01-22 10:50 ` [PATCH 03/11] x86: re-introduce map_domain_page() et al Jan Beulich
@ 2013-01-22 10:51 ` Jan Beulich
  2013-01-22 10:52 ` [PATCH 05/11] x86: consolidate initialization of PV guest L4 page tables Jan Beulich
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2013-01-22 10:51 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 7782 bytes --]

This requires a minor hack to allow the correct page tables to be used
while running on Dom0's page tables (as they can't be determined from
"current" at that time).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -621,8 +621,10 @@ int __init construct_dom0(
         maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l3_page_table;
         l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
     }
-    copy_page(l4tab, idle_pg_table);
-    l4tab[0] = l4e_empty(); /* zap trampoline mapping */
+    clear_page(l4tab);
+    for ( i = l4_table_offset(HYPERVISOR_VIRT_START);
+          i < l4_table_offset(HYPERVISOR_VIRT_END); ++i )
+        l4tab[i] = idle_pg_table[i];
     l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
         l4e_from_paddr(__pa(l4start), __PAGE_HYPERVISOR);
     l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
@@ -766,6 +768,7 @@ int __init construct_dom0(
 
     /* We run on dom0's page tables for the final part of the build process. */
     write_ptbase(v);
+    mapcache_override_current(v);
 
     /* Copy the OS image and free temporary buffer. */
     elf.dest = (void*)vkern_start;
@@ -782,6 +785,7 @@ int __init construct_dom0(
         if ( (parms.virt_hypercall < v_start) ||
              (parms.virt_hypercall >= v_end) )
         {
+            mapcache_override_current(NULL);
             write_ptbase(current);
             printk("Invalid HYPERCALL_PAGE field in ELF notes.\n");
             return -1;
@@ -811,6 +815,10 @@ int __init construct_dom0(
              elf_64bit(&elf) ? 64 : 32, parms.pae ? "p" : "");
 
     count = d->tot_pages;
+    l4start = map_domain_page(pagetable_get_pfn(v->arch.guest_table));
+    l3tab = NULL;
+    l2tab = NULL;
+    l1tab = NULL;
     /* Set up the phys->machine table if not part of the initial mapping. */
     if ( parms.p2m_base != UNSET_ADDR )
     {
@@ -825,6 +833,21 @@ int __init construct_dom0(
                                  >> PAGE_SHIFT) + 3 > nr_pages )
                 panic("Dom0 allocation too small for initial P->M table.\n");
 
+            if ( l1tab )
+            {
+                unmap_domain_page(l1tab);
+                l1tab = NULL;
+            }
+            if ( l2tab )
+            {
+                unmap_domain_page(l2tab);
+                l2tab = NULL;
+            }
+            if ( l3tab )
+            {
+                unmap_domain_page(l3tab);
+                l3tab = NULL;
+            }
             l4tab = l4start + l4_table_offset(va);
             if ( !l4e_get_intpte(*l4tab) )
             {
@@ -835,10 +858,11 @@ int __init construct_dom0(
                 page->count_info = PGC_allocated | 2;
                 page->u.inuse.type_info =
                     PGT_l3_page_table | PGT_validated | 1;
-                clear_page(page_to_virt(page));
+                l3tab = __map_domain_page(page);
+                clear_page(l3tab);
                 *l4tab = l4e_from_page(page, L4_PROT);
-            }
-            l3tab = page_to_virt(l4e_get_page(*l4tab));
+            } else
+                l3tab = map_domain_page(l4e_get_pfn(*l4tab));
             l3tab += l3_table_offset(va);
             if ( !l3e_get_intpte(*l3tab) )
             {
@@ -857,17 +881,16 @@ int __init construct_dom0(
                 }
                 if ( (page = alloc_domheap_page(d, 0)) == NULL )
                     break;
-                else
-                {
-                    /* No mapping, PGC_allocated + page-table page. */
-                    page->count_info = PGC_allocated | 2;
-                    page->u.inuse.type_info =
-                        PGT_l2_page_table | PGT_validated | 1;
-                    clear_page(page_to_virt(page));
-                    *l3tab = l3e_from_page(page, L3_PROT);
-                }
+                /* No mapping, PGC_allocated + page-table page. */
+                page->count_info = PGC_allocated | 2;
+                page->u.inuse.type_info =
+                    PGT_l2_page_table | PGT_validated | 1;
+                l2tab = __map_domain_page(page);
+                clear_page(l2tab);
+                *l3tab = l3e_from_page(page, L3_PROT);
             }
-            l2tab = page_to_virt(l3e_get_page(*l3tab));
+            else
+               l2tab = map_domain_page(l3e_get_pfn(*l3tab));
             l2tab += l2_table_offset(va);
             if ( !l2e_get_intpte(*l2tab) )
             {
@@ -887,17 +910,16 @@ int __init construct_dom0(
                 }
                 if ( (page = alloc_domheap_page(d, 0)) == NULL )
                     break;
-                else
-                {
-                    /* No mapping, PGC_allocated + page-table page. */
-                    page->count_info = PGC_allocated | 2;
-                    page->u.inuse.type_info =
-                        PGT_l1_page_table | PGT_validated | 1;
-                    clear_page(page_to_virt(page));
-                    *l2tab = l2e_from_page(page, L2_PROT);
-                }
+                /* No mapping, PGC_allocated + page-table page. */
+                page->count_info = PGC_allocated | 2;
+                page->u.inuse.type_info =
+                    PGT_l1_page_table | PGT_validated | 1;
+                l1tab = __map_domain_page(page);
+                clear_page(l1tab);
+                *l2tab = l2e_from_page(page, L2_PROT);
             }
-            l1tab = page_to_virt(l2e_get_page(*l2tab));
+            else
+                l1tab = map_domain_page(l2e_get_pfn(*l2tab));
             l1tab += l1_table_offset(va);
             BUG_ON(l1e_get_intpte(*l1tab));
             page = alloc_domheap_page(d, 0);
@@ -911,6 +933,14 @@ int __init construct_dom0(
             panic("Not enough RAM for DOM0 P->M table.\n");
     }
 
+    if ( l1tab )
+        unmap_domain_page(l1tab);
+    if ( l2tab )
+        unmap_domain_page(l2tab);
+    if ( l3tab )
+        unmap_domain_page(l3tab);
+    unmap_domain_page(l4start);
+
     /* Write the phys->machine and machine->phys table entries. */
     for ( pfn = 0; pfn < count; pfn++ )
     {
@@ -1000,6 +1030,7 @@ int __init construct_dom0(
         xlat_start_info(si, XLAT_start_info_console_dom0);
 
     /* Return to idle domain's page tables. */
+    mapcache_override_current(NULL);
     write_ptbase(current);
 
     update_domain_wallclock_time(d);
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -15,10 +15,12 @@
 #include <asm/flushtlb.h>
 #include <asm/hardirq.h>
 
+static struct vcpu *__read_mostly override;
+
 static inline struct vcpu *mapcache_current_vcpu(void)
 {
     /* In the common case we use the mapcache of the running VCPU. */
-    struct vcpu *v = current;
+    struct vcpu *v = override ?: current;
 
     /*
      * When current isn't properly set up yet, this is equivalent to
@@ -44,6 +46,11 @@ static inline struct vcpu *mapcache_curr
     return v;
 }
 
+void __init mapcache_override_current(struct vcpu *v)
+{
+    override = v;
+}
+
 #define mapcache_l2_entry(e) ((e) >> PAGETABLE_ORDER)
 #define MAPCACHE_L2_ENTRIES (mapcache_l2_entry(MAPCACHE_ENTRIES - 1) + 1)
 #define DCACHE_L1ENT(dc, idx) \
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -73,6 +73,7 @@ struct mapcache_domain {
 int mapcache_domain_init(struct domain *);
 void mapcache_domain_exit(struct domain *);
 int mapcache_vcpu_init(struct vcpu *);
+void mapcache_override_current(struct vcpu *);
 
 /* x86/64: toggle guest between kernel and user modes. */
 void toggle_guest_mode(struct vcpu *);



[-- Attachment #2: x86-map-domain-build.patch --]
[-- Type: text/plain, Size: 7836 bytes --]

x86: properly use map_domain_page() when building Dom0

This requires a minor hack to allow the correct page tables to be used
while running on Dom0's page tables (as they can't be determined from
"current" at that time).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -621,8 +621,10 @@ int __init construct_dom0(
         maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l3_page_table;
         l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
     }
-    copy_page(l4tab, idle_pg_table);
-    l4tab[0] = l4e_empty(); /* zap trampoline mapping */
+    clear_page(l4tab);
+    for ( i = l4_table_offset(HYPERVISOR_VIRT_START);
+          i < l4_table_offset(HYPERVISOR_VIRT_END); ++i )
+        l4tab[i] = idle_pg_table[i];
     l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
         l4e_from_paddr(__pa(l4start), __PAGE_HYPERVISOR);
     l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
@@ -766,6 +768,7 @@ int __init construct_dom0(
 
     /* We run on dom0's page tables for the final part of the build process. */
     write_ptbase(v);
+    mapcache_override_current(v);
 
     /* Copy the OS image and free temporary buffer. */
     elf.dest = (void*)vkern_start;
@@ -782,6 +785,7 @@ int __init construct_dom0(
         if ( (parms.virt_hypercall < v_start) ||
              (parms.virt_hypercall >= v_end) )
         {
+            mapcache_override_current(NULL);
             write_ptbase(current);
             printk("Invalid HYPERCALL_PAGE field in ELF notes.\n");
             return -1;
@@ -811,6 +815,10 @@ int __init construct_dom0(
              elf_64bit(&elf) ? 64 : 32, parms.pae ? "p" : "");
 
     count = d->tot_pages;
+    l4start = map_domain_page(pagetable_get_pfn(v->arch.guest_table));
+    l3tab = NULL;
+    l2tab = NULL;
+    l1tab = NULL;
     /* Set up the phys->machine table if not part of the initial mapping. */
     if ( parms.p2m_base != UNSET_ADDR )
     {
@@ -825,6 +833,21 @@ int __init construct_dom0(
                                  >> PAGE_SHIFT) + 3 > nr_pages )
                 panic("Dom0 allocation too small for initial P->M table.\n");
 
+            if ( l1tab )
+            {
+                unmap_domain_page(l1tab);
+                l1tab = NULL;
+            }
+            if ( l2tab )
+            {
+                unmap_domain_page(l2tab);
+                l2tab = NULL;
+            }
+            if ( l3tab )
+            {
+                unmap_domain_page(l3tab);
+                l3tab = NULL;
+            }
             l4tab = l4start + l4_table_offset(va);
             if ( !l4e_get_intpte(*l4tab) )
             {
@@ -835,10 +858,11 @@ int __init construct_dom0(
                 page->count_info = PGC_allocated | 2;
                 page->u.inuse.type_info =
                     PGT_l3_page_table | PGT_validated | 1;
-                clear_page(page_to_virt(page));
+                l3tab = __map_domain_page(page);
+                clear_page(l3tab);
                 *l4tab = l4e_from_page(page, L4_PROT);
-            }
-            l3tab = page_to_virt(l4e_get_page(*l4tab));
+            } else
+                l3tab = map_domain_page(l4e_get_pfn(*l4tab));
             l3tab += l3_table_offset(va);
             if ( !l3e_get_intpte(*l3tab) )
             {
@@ -857,17 +881,16 @@ int __init construct_dom0(
                 }
                 if ( (page = alloc_domheap_page(d, 0)) == NULL )
                     break;
-                else
-                {
-                    /* No mapping, PGC_allocated + page-table page. */
-                    page->count_info = PGC_allocated | 2;
-                    page->u.inuse.type_info =
-                        PGT_l2_page_table | PGT_validated | 1;
-                    clear_page(page_to_virt(page));
-                    *l3tab = l3e_from_page(page, L3_PROT);
-                }
+                /* No mapping, PGC_allocated + page-table page. */
+                page->count_info = PGC_allocated | 2;
+                page->u.inuse.type_info =
+                    PGT_l2_page_table | PGT_validated | 1;
+                l2tab = __map_domain_page(page);
+                clear_page(l2tab);
+                *l3tab = l3e_from_page(page, L3_PROT);
             }
-            l2tab = page_to_virt(l3e_get_page(*l3tab));
+            else
+               l2tab = map_domain_page(l3e_get_pfn(*l3tab));
             l2tab += l2_table_offset(va);
             if ( !l2e_get_intpte(*l2tab) )
             {
@@ -887,17 +910,16 @@ int __init construct_dom0(
                 }
                 if ( (page = alloc_domheap_page(d, 0)) == NULL )
                     break;
-                else
-                {
-                    /* No mapping, PGC_allocated + page-table page. */
-                    page->count_info = PGC_allocated | 2;
-                    page->u.inuse.type_info =
-                        PGT_l1_page_table | PGT_validated | 1;
-                    clear_page(page_to_virt(page));
-                    *l2tab = l2e_from_page(page, L2_PROT);
-                }
+                /* No mapping, PGC_allocated + page-table page. */
+                page->count_info = PGC_allocated | 2;
+                page->u.inuse.type_info =
+                    PGT_l1_page_table | PGT_validated | 1;
+                l1tab = __map_domain_page(page);
+                clear_page(l1tab);
+                *l2tab = l2e_from_page(page, L2_PROT);
             }
-            l1tab = page_to_virt(l2e_get_page(*l2tab));
+            else
+                l1tab = map_domain_page(l2e_get_pfn(*l2tab));
             l1tab += l1_table_offset(va);
             BUG_ON(l1e_get_intpte(*l1tab));
             page = alloc_domheap_page(d, 0);
@@ -911,6 +933,14 @@ int __init construct_dom0(
             panic("Not enough RAM for DOM0 P->M table.\n");
     }
 
+    if ( l1tab )
+        unmap_domain_page(l1tab);
+    if ( l2tab )
+        unmap_domain_page(l2tab);
+    if ( l3tab )
+        unmap_domain_page(l3tab);
+    unmap_domain_page(l4start);
+
     /* Write the phys->machine and machine->phys table entries. */
     for ( pfn = 0; pfn < count; pfn++ )
     {
@@ -1000,6 +1030,7 @@ int __init construct_dom0(
         xlat_start_info(si, XLAT_start_info_console_dom0);
 
     /* Return to idle domain's page tables. */
+    mapcache_override_current(NULL);
     write_ptbase(current);
 
     update_domain_wallclock_time(d);
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -15,10 +15,12 @@
 #include <asm/flushtlb.h>
 #include <asm/hardirq.h>
 
+static struct vcpu *__read_mostly override;
+
 static inline struct vcpu *mapcache_current_vcpu(void)
 {
     /* In the common case we use the mapcache of the running VCPU. */
-    struct vcpu *v = current;
+    struct vcpu *v = override ?: current;
 
     /*
      * When current isn't properly set up yet, this is equivalent to
@@ -44,6 +46,11 @@ static inline struct vcpu *mapcache_curr
     return v;
 }
 
+void __init mapcache_override_current(struct vcpu *v)
+{
+    override = v;
+}
+
 #define mapcache_l2_entry(e) ((e) >> PAGETABLE_ORDER)
 #define MAPCACHE_L2_ENTRIES (mapcache_l2_entry(MAPCACHE_ENTRIES - 1) + 1)
 #define DCACHE_L1ENT(dc, idx) \
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -73,6 +73,7 @@ struct mapcache_domain {
 int mapcache_domain_init(struct domain *);
 void mapcache_domain_exit(struct domain *);
 int mapcache_vcpu_init(struct vcpu *);
+void mapcache_override_current(struct vcpu *);
 
 /* x86/64: toggle guest between kernel and user modes. */
 void toggle_guest_mode(struct vcpu *);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 05/11] x86: consolidate initialization of PV guest L4 page tables
  2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
                   ` (2 preceding siblings ...)
  2013-01-22 10:51 ` [PATCH 04/11] x86: properly use map_domain_page() when building Dom0 Jan Beulich
@ 2013-01-22 10:52 ` Jan Beulich
  2013-01-22 10:53 ` [PATCH 06/11] x86: properly use map_domain_page() during domain creation/destruction Jan Beulich
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2013-01-22 10:52 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 3583 bytes --]

So far this has been repeated in 3 places, requiring to remember to
update all of them if a change is being made.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -290,13 +290,8 @@ static int setup_compat_l4(struct vcpu *
     pg->u.inuse.type_info = PGT_l4_page_table|PGT_validated|1;
 
     l4tab = page_to_virt(pg);
-    copy_page(l4tab, idle_pg_table);
-    l4tab[0] = l4e_empty();
-    l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
-        l4e_from_page(pg, __PAGE_HYPERVISOR);
-    l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_paddr(__pa(v->domain->arch.mm_perdomain_l3),
-                       __PAGE_HYPERVISOR);
+    clear_page(l4tab);
+    init_guest_l4_table(l4tab, v->domain);
 
     v->arch.guest_table = pagetable_from_page(pg);
     v->arch.guest_table_user = v->arch.guest_table;
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -622,13 +622,7 @@ int __init construct_dom0(
         l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
     }
     clear_page(l4tab);
-    for ( i = l4_table_offset(HYPERVISOR_VIRT_START);
-          i < l4_table_offset(HYPERVISOR_VIRT_END); ++i )
-        l4tab[i] = idle_pg_table[i];
-    l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
-        l4e_from_paddr(__pa(l4start), __PAGE_HYPERVISOR);
-    l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_paddr(__pa(d->arch.mm_perdomain_l3), __PAGE_HYPERVISOR);
+    init_guest_l4_table(l4tab, d);
     v->arch.guest_table = pagetable_from_paddr(__pa(l4start));
     if ( is_pv_32on64_domain(d) )
         v->arch.guest_table_user = v->arch.guest_table;
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1315,6 +1315,18 @@ static int alloc_l3_table(struct page_in
     return rc > 0 ? 0 : rc;
 }
 
+void init_guest_l4_table(l4_pgentry_t l4tab[], const struct domain *d)
+{
+    /* Xen private mappings. */
+    memcpy(&l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT],
+           &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
+           ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));
+    l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
+        l4e_from_pfn(virt_to_mfn(l4tab), __PAGE_HYPERVISOR);
+    l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
+        l4e_from_pfn(virt_to_mfn(d->arch.mm_perdomain_l3), __PAGE_HYPERVISOR);
+}
+
 static int alloc_l4_table(struct page_info *page, int preemptible)
 {
     struct domain *d = page_get_owner(page);
@@ -1358,15 +1370,7 @@ static int alloc_l4_table(struct page_in
         adjust_guest_l4e(pl4e[i], d);
     }
 
-    /* Xen private mappings. */
-    memcpy(&pl4e[ROOT_PAGETABLE_FIRST_XEN_SLOT],
-           &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
-           ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));
-    pl4e[l4_table_offset(LINEAR_PT_VIRT_START)] =
-        l4e_from_pfn(pfn, __PAGE_HYPERVISOR);
-    pl4e[l4_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_page(virt_to_page(d->arch.mm_perdomain_l3),
-                      __PAGE_HYPERVISOR);
+    init_guest_l4_table(pl4e, d);
 
     return rc > 0 ? 0 : rc;
 }
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -316,6 +316,8 @@ static inline void *__page_to_virt(const
 int free_page_type(struct page_info *page, unsigned long type,
                    int preemptible);
 
+void init_guest_l4_table(l4_pgentry_t[], const struct domain *);
+
 int is_iomem_page(unsigned long mfn);
 
 void clear_superpage_mark(struct page_info *page);




[-- Attachment #2: x86-guest-l4-init.patch --]
[-- Type: text/plain, Size: 3639 bytes --]

x86: consolidate initialization of PV guest L4 page tables

So far this has been repeated in 3 places, requiring to remember to
update all of them if a change is being made.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -290,13 +290,8 @@ static int setup_compat_l4(struct vcpu *
     pg->u.inuse.type_info = PGT_l4_page_table|PGT_validated|1;
 
     l4tab = page_to_virt(pg);
-    copy_page(l4tab, idle_pg_table);
-    l4tab[0] = l4e_empty();
-    l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
-        l4e_from_page(pg, __PAGE_HYPERVISOR);
-    l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_paddr(__pa(v->domain->arch.mm_perdomain_l3),
-                       __PAGE_HYPERVISOR);
+    clear_page(l4tab);
+    init_guest_l4_table(l4tab, v->domain);
 
     v->arch.guest_table = pagetable_from_page(pg);
     v->arch.guest_table_user = v->arch.guest_table;
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -622,13 +622,7 @@ int __init construct_dom0(
         l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
     }
     clear_page(l4tab);
-    for ( i = l4_table_offset(HYPERVISOR_VIRT_START);
-          i < l4_table_offset(HYPERVISOR_VIRT_END); ++i )
-        l4tab[i] = idle_pg_table[i];
-    l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
-        l4e_from_paddr(__pa(l4start), __PAGE_HYPERVISOR);
-    l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_paddr(__pa(d->arch.mm_perdomain_l3), __PAGE_HYPERVISOR);
+    init_guest_l4_table(l4tab, d);
     v->arch.guest_table = pagetable_from_paddr(__pa(l4start));
     if ( is_pv_32on64_domain(d) )
         v->arch.guest_table_user = v->arch.guest_table;
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1315,6 +1315,18 @@ static int alloc_l3_table(struct page_in
     return rc > 0 ? 0 : rc;
 }
 
+void init_guest_l4_table(l4_pgentry_t l4tab[], const struct domain *d)
+{
+    /* Xen private mappings. */
+    memcpy(&l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT],
+           &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
+           ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));
+    l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
+        l4e_from_pfn(virt_to_mfn(l4tab), __PAGE_HYPERVISOR);
+    l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
+        l4e_from_pfn(virt_to_mfn(d->arch.mm_perdomain_l3), __PAGE_HYPERVISOR);
+}
+
 static int alloc_l4_table(struct page_info *page, int preemptible)
 {
     struct domain *d = page_get_owner(page);
@@ -1358,15 +1370,7 @@ static int alloc_l4_table(struct page_in
         adjust_guest_l4e(pl4e[i], d);
     }
 
-    /* Xen private mappings. */
-    memcpy(&pl4e[ROOT_PAGETABLE_FIRST_XEN_SLOT],
-           &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
-           ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));
-    pl4e[l4_table_offset(LINEAR_PT_VIRT_START)] =
-        l4e_from_pfn(pfn, __PAGE_HYPERVISOR);
-    pl4e[l4_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_page(virt_to_page(d->arch.mm_perdomain_l3),
-                      __PAGE_HYPERVISOR);
+    init_guest_l4_table(pl4e, d);
 
     return rc > 0 ? 0 : rc;
 }
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -316,6 +316,8 @@ static inline void *__page_to_virt(const
 int free_page_type(struct page_info *page, unsigned long type,
                    int preemptible);
 
+void init_guest_l4_table(l4_pgentry_t[], const struct domain *);
+
 int is_iomem_page(unsigned long mfn);
 
 void clear_superpage_mark(struct page_info *page);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 06/11] x86: properly use map_domain_page() during domain creation/destruction
  2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
                   ` (3 preceding siblings ...)
  2013-01-22 10:52 ` [PATCH 05/11] x86: consolidate initialization of PV guest L4 page tables Jan Beulich
@ 2013-01-22 10:53 ` Jan Beulich
  2013-01-22 10:55 ` [PATCH 07/11] x86: properly use map_domain_page() during page table manipulation Jan Beulich
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2013-01-22 10:53 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 16745 bytes --]

This involves no longer storing virtual addresses of the per-domain
mapping L2 and L3 page tables.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -289,9 +289,10 @@ static int setup_compat_l4(struct vcpu *
     /* This page needs to look like a pagetable so that it can be shadowed */
     pg->u.inuse.type_info = PGT_l4_page_table|PGT_validated|1;
 
-    l4tab = page_to_virt(pg);
+    l4tab = __map_domain_page(pg);
     clear_page(l4tab);
     init_guest_l4_table(l4tab, v->domain);
+    unmap_domain_page(l4tab);
 
     v->arch.guest_table = pagetable_from_page(pg);
     v->arch.guest_table_user = v->arch.guest_table;
@@ -383,17 +384,22 @@ int vcpu_initialise(struct vcpu *v)
 
     v->arch.flags = TF_kernel_mode;
 
-    idx = perdomain_pt_pgidx(v);
-    if ( !perdomain_pt_page(d, idx) )
+    idx = perdomain_pt_idx(v);
+    if ( !d->arch.perdomain_pts[idx] )
     {
-        struct page_info *pg;
-        pg = alloc_domheap_page(NULL, MEMF_node(vcpu_to_node(v)));
-        if ( !pg )
+        void *pt;
+        l2_pgentry_t *l2tab;
+
+        pt = alloc_xenheap_pages(0, MEMF_node(vcpu_to_node(v)));
+        if ( !pt )
             return -ENOMEM;
-        clear_page(page_to_virt(pg));
-        perdomain_pt_page(d, idx) = pg;
-        d->arch.mm_perdomain_l2[0][l2_table_offset(PERDOMAIN_VIRT_START)+idx]
-            = l2e_from_page(pg, __PAGE_HYPERVISOR);
+        clear_page(pt);
+        d->arch.perdomain_pts[idx] = pt;
+
+        l2tab = __map_domain_page(d->arch.perdomain_l2_pg[0]);
+        l2tab[l2_table_offset(PERDOMAIN_VIRT_START) + idx]
+            = l2e_from_paddr(__pa(pt), __PAGE_HYPERVISOR);
+        unmap_domain_page(l2tab);
     }
 
     rc = mapcache_vcpu_init(v);
@@ -484,6 +490,7 @@ void vcpu_destroy(struct vcpu *v)
 int arch_domain_create(struct domain *d, unsigned int domcr_flags)
 {
     struct page_info *pg;
+    l3_pgentry_t *l3tab;
     int i, paging_initialised = 0;
     int rc = -ENOMEM;
 
@@ -514,28 +521,29 @@ int arch_domain_create(struct domain *d,
                d->domain_id);
     }
 
-    BUILD_BUG_ON(PDPT_L2_ENTRIES * sizeof(*d->arch.mm_perdomain_pt_pages)
+    BUILD_BUG_ON(PDPT_L2_ENTRIES * sizeof(*d->arch.perdomain_pts)
                  != PAGE_SIZE);
-    pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
-    if ( !pg )
+    d->arch.perdomain_pts =
+        alloc_xenheap_pages(0, MEMF_node(domain_to_node(d)));
+    if ( !d->arch.perdomain_pts )
         goto fail;
-    d->arch.mm_perdomain_pt_pages = page_to_virt(pg);
-    clear_page(d->arch.mm_perdomain_pt_pages);
+    clear_page(d->arch.perdomain_pts);
 
     pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
     if ( pg == NULL )
         goto fail;
-    d->arch.mm_perdomain_l2[0] = page_to_virt(pg);
-    clear_page(d->arch.mm_perdomain_l2[0]);
+    d->arch.perdomain_l2_pg[0] = pg;
+    clear_domain_page(page_to_mfn(pg));
 
     pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
     if ( pg == NULL )
         goto fail;
-    d->arch.mm_perdomain_l3 = page_to_virt(pg);
-    clear_page(d->arch.mm_perdomain_l3);
-    d->arch.mm_perdomain_l3[l3_table_offset(PERDOMAIN_VIRT_START)] =
-        l3e_from_pfn(virt_to_mfn(d->arch.mm_perdomain_l2[0]),
-                     __PAGE_HYPERVISOR);
+    d->arch.perdomain_l3_pg = pg;
+    l3tab = __map_domain_page(pg);
+    clear_page(l3tab);
+    l3tab[l3_table_offset(PERDOMAIN_VIRT_START)] =
+        l3e_from_page(d->arch.perdomain_l2_pg[0], __PAGE_HYPERVISOR);
+    unmap_domain_page(l3tab);
 
     mapcache_domain_init(d);
 
@@ -611,12 +619,12 @@ int arch_domain_create(struct domain *d,
     if ( paging_initialised )
         paging_final_teardown(d);
     mapcache_domain_exit(d);
-    if ( d->arch.mm_perdomain_l2[0] )
-        free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2[0]));
-    if ( d->arch.mm_perdomain_l3 )
-        free_domheap_page(virt_to_page(d->arch.mm_perdomain_l3));
-    if ( d->arch.mm_perdomain_pt_pages )
-        free_domheap_page(virt_to_page(d->arch.mm_perdomain_pt_pages));
+    for ( i = 0; i < PERDOMAIN_SLOTS; ++i)
+        if ( d->arch.perdomain_l2_pg[i] )
+            free_domheap_page(d->arch.perdomain_l2_pg[i]);
+    if ( d->arch.perdomain_l3_pg )
+        free_domheap_page(d->arch.perdomain_l3_pg);
+    free_xenheap_page(d->arch.perdomain_pts);
     return rc;
 }
 
@@ -638,13 +646,12 @@ void arch_domain_destroy(struct domain *
     mapcache_domain_exit(d);
 
     for ( i = 0; i < PDPT_L2_ENTRIES; ++i )
-    {
-        if ( perdomain_pt_page(d, i) )
-            free_domheap_page(perdomain_pt_page(d, i));
-    }
-    free_domheap_page(virt_to_page(d->arch.mm_perdomain_pt_pages));
-    free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2[0]));
-    free_domheap_page(virt_to_page(d->arch.mm_perdomain_l3));
+        free_xenheap_page(d->arch.perdomain_pts[i]);
+    free_xenheap_page(d->arch.perdomain_pts);
+    for ( i = 0; i < PERDOMAIN_SLOTS; ++i)
+        if ( d->arch.perdomain_l2_pg[i] )
+            free_domheap_page(d->arch.perdomain_l2_pg[i]);
+    free_domheap_page(d->arch.perdomain_l3_pg);
 
     free_xenheap_page(d->shared_info);
     cleanup_domain_irq_mapping(d);
@@ -810,9 +817,10 @@ int arch_set_info_guest(
                 fail |= xen_pfn_to_cr3(pfn) != c.nat->ctrlreg[1];
             }
         } else {
-            l4_pgentry_t *l4tab = __va(pfn_to_paddr(pfn));
+            l4_pgentry_t *l4tab = map_domain_page(pfn);
 
             pfn = l4e_get_pfn(*l4tab);
+            unmap_domain_page(l4tab);
             fail = compat_pfn_to_cr3(pfn) != c.cmp->ctrlreg[3];
         }
 
@@ -951,9 +959,10 @@ int arch_set_info_guest(
             return -EINVAL;
         }
 
-        l4tab = __va(pagetable_get_paddr(v->arch.guest_table));
+        l4tab = map_domain_page(pagetable_get_pfn(v->arch.guest_table));
         *l4tab = l4e_from_pfn(page_to_mfn(cr3_page),
             _PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_ACCESSED);
+        unmap_domain_page(l4tab);
     }
 
     if ( v->vcpu_id == 0 )
@@ -1971,12 +1980,13 @@ static int relinquish_memory(
 static void vcpu_destroy_pagetables(struct vcpu *v)
 {
     struct domain *d = v->domain;
-    unsigned long pfn;
+    unsigned long pfn = pagetable_get_pfn(v->arch.guest_table);
 
     if ( is_pv_32on64_vcpu(v) )
     {
-        pfn = l4e_get_pfn(*(l4_pgentry_t *)
-                          __va(pagetable_get_paddr(v->arch.guest_table)));
+        l4_pgentry_t *l4tab = map_domain_page(pfn);
+
+        pfn = l4e_get_pfn(*l4tab);
 
         if ( pfn != 0 )
         {
@@ -1986,15 +1996,12 @@ static void vcpu_destroy_pagetables(stru
                 put_page_and_type(mfn_to_page(pfn));
         }
 
-        l4e_write(
-            (l4_pgentry_t *)__va(pagetable_get_paddr(v->arch.guest_table)),
-            l4e_empty());
+        l4e_write(l4tab, l4e_empty());
 
         v->arch.cr3 = 0;
         return;
     }
 
-    pfn = pagetable_get_pfn(v->arch.guest_table);
     if ( pfn != 0 )
     {
         if ( paging_mode_refcounts(d) )
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -241,6 +241,8 @@ void copy_domain_page(unsigned long dmfn
 int mapcache_domain_init(struct domain *d)
 {
     struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
+    l3_pgentry_t *l3tab;
+    l2_pgentry_t *l2tab;
     unsigned int i, bitmap_pages, memf = MEMF_node(domain_to_node(d));
     unsigned long *end;
 
@@ -251,14 +253,18 @@ int mapcache_domain_init(struct domain *
         return 0;
 
     dcache->l1tab = xzalloc_array(l1_pgentry_t *, MAPCACHE_L2_ENTRIES + 1);
-    d->arch.mm_perdomain_l2[MAPCACHE_SLOT] = alloc_xenheap_pages(0, memf);
-    if ( !dcache->l1tab || !d->arch.mm_perdomain_l2[MAPCACHE_SLOT] )
+    d->arch.perdomain_l2_pg[MAPCACHE_SLOT] = alloc_domheap_page(NULL, memf);
+    if ( !dcache->l1tab || !d->arch.perdomain_l2_pg[MAPCACHE_SLOT] )
         return -ENOMEM;
 
-    clear_page(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]);
-    d->arch.mm_perdomain_l3[l3_table_offset(MAPCACHE_VIRT_START)] =
-        l3e_from_paddr(__pa(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]),
-                       __PAGE_HYPERVISOR);
+    clear_domain_page(page_to_mfn(d->arch.perdomain_l2_pg[MAPCACHE_SLOT]));
+    l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
+    l3tab[l3_table_offset(MAPCACHE_VIRT_START)] =
+        l3e_from_page(d->arch.perdomain_l2_pg[MAPCACHE_SLOT],
+                      __PAGE_HYPERVISOR);
+    unmap_domain_page(l3tab);
+
+    l2tab = __map_domain_page(d->arch.perdomain_l2_pg[MAPCACHE_SLOT]);
 
     BUILD_BUG_ON(MAPCACHE_VIRT_END + 3 +
                  2 * PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long)) >
@@ -275,12 +281,16 @@ int mapcache_domain_init(struct domain *
         ASSERT(i <= MAPCACHE_L2_ENTRIES);
         dcache->l1tab[i] = alloc_xenheap_pages(0, memf);
         if ( !dcache->l1tab[i] )
+        {
+            unmap_domain_page(l2tab);
             return -ENOMEM;
+        }
         clear_page(dcache->l1tab[i]);
-        d->arch.mm_perdomain_l2[MAPCACHE_SLOT][i] =
-            l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR);
+        l2tab[i] = l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR);
     }
 
+    unmap_domain_page(l2tab);
+
     spin_lock_init(&dcache->lock);
 
     return 0;
@@ -315,19 +325,21 @@ void mapcache_domain_exit(struct domain 
 
         xfree(dcache->l1tab);
     }
-    free_xenheap_page(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]);
 }
 
 int mapcache_vcpu_init(struct vcpu *v)
 {
     struct domain *d = v->domain;
     struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
+    l2_pgentry_t *l2tab;
     unsigned long i;
     unsigned int memf = MEMF_node(vcpu_to_node(v));
 
     if ( is_hvm_vcpu(v) || !dcache->l1tab )
         return 0;
 
+    l2tab = __map_domain_page(d->arch.perdomain_l2_pg[MAPCACHE_SLOT]);
+
     while ( dcache->entries < d->max_vcpus * MAPCACHE_VCPU_ENTRIES )
     {
         unsigned int ents = dcache->entries + MAPCACHE_VCPU_ENTRIES;
@@ -338,10 +350,13 @@ int mapcache_vcpu_init(struct vcpu *v)
         {
             dcache->l1tab[i] = alloc_xenheap_pages(0, memf);
             if ( !dcache->l1tab[i] )
+            {
+                unmap_domain_page(l2tab);
                 return -ENOMEM;
+            }
             clear_page(dcache->l1tab[i]);
-            d->arch.mm_perdomain_l2[MAPCACHE_SLOT][i] =
-                l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR);
+            l2tab[i] = l2e_from_paddr(__pa(dcache->l1tab[i]),
+                                      __PAGE_HYPERVISOR);
         }
 
         /* Populate bit maps. */
@@ -351,18 +366,22 @@ int mapcache_vcpu_init(struct vcpu *v)
         {
             struct page_info *pg = alloc_domheap_page(NULL, memf);
 
+            if ( pg )
+            {
+                clear_domain_page(page_to_mfn(pg));
+                *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR);
+                pg = alloc_domheap_page(NULL, memf);
+            }
             if ( !pg )
+            {
+                unmap_domain_page(l2tab);
                 return -ENOMEM;
-            clear_domain_page(page_to_mfn(pg));
-            *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR);
+            }
 
             i = (unsigned long)(dcache->garbage + BITS_TO_LONGS(ents));
             pl1e = &dcache->l1tab[l2_table_offset(i)][l1_table_offset(i)];
             ASSERT(!l1e_get_flags(*pl1e));
 
-            pg = alloc_domheap_page(NULL, memf);
-            if ( !pg )
-                return -ENOMEM;
             clear_domain_page(page_to_mfn(pg));
             *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR);
         }
@@ -370,6 +389,8 @@ int mapcache_vcpu_init(struct vcpu *v)
         dcache->entries = ents;
     }
 
+    unmap_domain_page(l2tab);
+
     /* Mark all maphash entries as not in use. */
     BUILD_BUG_ON(MAPHASHENT_NOTINUSE < MAPCACHE_ENTRIES);
     for ( i = 0; i < MAPHASH_ENTRIES; i++ )
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1322,9 +1322,9 @@ void init_guest_l4_table(l4_pgentry_t l4
            &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
            ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));
     l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
-        l4e_from_pfn(virt_to_mfn(l4tab), __PAGE_HYPERVISOR);
+        l4e_from_pfn(domain_page_map_to_mfn(l4tab), __PAGE_HYPERVISOR);
     l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_pfn(virt_to_mfn(d->arch.mm_perdomain_l3), __PAGE_HYPERVISOR);
+        l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR);
 }
 
 static int alloc_l4_table(struct page_info *page, int preemptible)
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -369,7 +369,7 @@ static void hap_install_xen_entries_in_l
 
     /* Install the per-domain mappings for this domain */
     l4e[l4_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_pfn(mfn_x(page_to_mfn(virt_to_page(d->arch.mm_perdomain_l3))),
+        l4e_from_pfn(mfn_x(page_to_mfn(d->arch.perdomain_l3_pg)),
                      __PAGE_HYPERVISOR);
 
     /* Install a linear mapping */
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -1449,7 +1449,7 @@ void sh_install_xen_entries_in_l4(struct
 
     /* Install the per-domain mappings for this domain */
     sl4e[shadow_l4_table_offset(PERDOMAIN_VIRT_START)] =
-        shadow_l4e_from_mfn(page_to_mfn(virt_to_page(d->arch.mm_perdomain_l3)),
+        shadow_l4e_from_mfn(page_to_mfn(d->arch.perdomain_l3_pg),
                             __PAGE_HYPERVISOR);
 
     /* Shadow linear mapping for 4-level shadows.  N.B. for 3-level
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -823,9 +823,8 @@ void __init setup_idle_pagetable(void)
 {
     /* Install per-domain mappings for idle domain. */
     l4e_write(&idle_pg_table[l4_table_offset(PERDOMAIN_VIRT_START)],
-              l4e_from_page(
-                  virt_to_page(idle_vcpu[0]->domain->arch.mm_perdomain_l3),
-                  __PAGE_HYPERVISOR));
+              l4e_from_page(idle_vcpu[0]->domain->arch.perdomain_l3_pg,
+                            __PAGE_HYPERVISOR));
 }
 
 void __init zap_low_mappings(void)
@@ -850,21 +849,18 @@ void *compat_arg_xlat_virt_base(void)
 int setup_compat_arg_xlat(struct vcpu *v)
 {
     unsigned int order = get_order_from_bytes(COMPAT_ARG_XLAT_SIZE);
-    struct page_info *pg;
 
-    pg = alloc_domheap_pages(NULL, order, 0);
-    if ( pg == NULL )
-        return -ENOMEM;
+    v->arch.compat_arg_xlat = alloc_xenheap_pages(order,
+                                                  MEMF_node(vcpu_to_node(v)));
 
-    v->arch.compat_arg_xlat = page_to_virt(pg);
-    return 0;
+    return v->arch.compat_arg_xlat ? 0 : -ENOMEM;
 }
 
 void free_compat_arg_xlat(struct vcpu *v)
 {
     unsigned int order = get_order_from_bytes(COMPAT_ARG_XLAT_SIZE);
-    if ( v->arch.compat_arg_xlat != NULL )
-        free_domheap_pages(virt_to_page(v->arch.compat_arg_xlat), order);
+
+    free_xenheap_pages(v->arch.compat_arg_xlat, order);
     v->arch.compat_arg_xlat = NULL;
 }
 
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -241,9 +241,9 @@ struct pv_domain
 
 struct arch_domain
 {
-    struct page_info **mm_perdomain_pt_pages;
-    l2_pgentry_t *mm_perdomain_l2[PERDOMAIN_SLOTS];
-    l3_pgentry_t *mm_perdomain_l3;
+    void **perdomain_pts;
+    struct page_info *perdomain_l2_pg[PERDOMAIN_SLOTS];
+    struct page_info *perdomain_l3_pg;
 
     unsigned int hv_compat_vstart;
 
@@ -318,13 +318,11 @@ struct arch_domain
 #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))
 #define has_arch_mmios(d)    (!rangeset_is_empty((d)->iomem_caps))
 
-#define perdomain_pt_pgidx(v) \
+#define perdomain_pt_idx(v) \
       ((v)->vcpu_id >> (PAGETABLE_ORDER - GDT_LDT_VCPU_SHIFT))
 #define perdomain_ptes(d, v) \
-    ((l1_pgentry_t *)page_to_virt((d)->arch.mm_perdomain_pt_pages \
-      [perdomain_pt_pgidx(v)]) + (((v)->vcpu_id << GDT_LDT_VCPU_SHIFT) & \
-                                  (L1_PAGETABLE_ENTRIES - 1)))
-#define perdomain_pt_page(d, n) ((d)->arch.mm_perdomain_pt_pages[n])
+    ((l1_pgentry_t *)(d)->arch.perdomain_pts[perdomain_pt_idx(v)] + \
+     (((v)->vcpu_id << GDT_LDT_VCPU_SHIFT) & (L1_PAGETABLE_ENTRIES - 1)))
 
 struct pv_vcpu
 {



[-- Attachment #2: x86-map-domain-create.patch --]
[-- Type: text/plain, Size: 16815 bytes --]

x86: properly use map_domain_page() during domain creation/destruction

This involves no longer storing virtual addresses of the per-domain
mapping L2 and L3 page tables.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -289,9 +289,10 @@ static int setup_compat_l4(struct vcpu *
     /* This page needs to look like a pagetable so that it can be shadowed */
     pg->u.inuse.type_info = PGT_l4_page_table|PGT_validated|1;
 
-    l4tab = page_to_virt(pg);
+    l4tab = __map_domain_page(pg);
     clear_page(l4tab);
     init_guest_l4_table(l4tab, v->domain);
+    unmap_domain_page(l4tab);
 
     v->arch.guest_table = pagetable_from_page(pg);
     v->arch.guest_table_user = v->arch.guest_table;
@@ -383,17 +384,22 @@ int vcpu_initialise(struct vcpu *v)
 
     v->arch.flags = TF_kernel_mode;
 
-    idx = perdomain_pt_pgidx(v);
-    if ( !perdomain_pt_page(d, idx) )
+    idx = perdomain_pt_idx(v);
+    if ( !d->arch.perdomain_pts[idx] )
     {
-        struct page_info *pg;
-        pg = alloc_domheap_page(NULL, MEMF_node(vcpu_to_node(v)));
-        if ( !pg )
+        void *pt;
+        l2_pgentry_t *l2tab;
+
+        pt = alloc_xenheap_pages(0, MEMF_node(vcpu_to_node(v)));
+        if ( !pt )
             return -ENOMEM;
-        clear_page(page_to_virt(pg));
-        perdomain_pt_page(d, idx) = pg;
-        d->arch.mm_perdomain_l2[0][l2_table_offset(PERDOMAIN_VIRT_START)+idx]
-            = l2e_from_page(pg, __PAGE_HYPERVISOR);
+        clear_page(pt);
+        d->arch.perdomain_pts[idx] = pt;
+
+        l2tab = __map_domain_page(d->arch.perdomain_l2_pg[0]);
+        l2tab[l2_table_offset(PERDOMAIN_VIRT_START) + idx]
+            = l2e_from_paddr(__pa(pt), __PAGE_HYPERVISOR);
+        unmap_domain_page(l2tab);
     }
 
     rc = mapcache_vcpu_init(v);
@@ -484,6 +490,7 @@ void vcpu_destroy(struct vcpu *v)
 int arch_domain_create(struct domain *d, unsigned int domcr_flags)
 {
     struct page_info *pg;
+    l3_pgentry_t *l3tab;
     int i, paging_initialised = 0;
     int rc = -ENOMEM;
 
@@ -514,28 +521,29 @@ int arch_domain_create(struct domain *d,
                d->domain_id);
     }
 
-    BUILD_BUG_ON(PDPT_L2_ENTRIES * sizeof(*d->arch.mm_perdomain_pt_pages)
+    BUILD_BUG_ON(PDPT_L2_ENTRIES * sizeof(*d->arch.perdomain_pts)
                  != PAGE_SIZE);
-    pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
-    if ( !pg )
+    d->arch.perdomain_pts =
+        alloc_xenheap_pages(0, MEMF_node(domain_to_node(d)));
+    if ( !d->arch.perdomain_pts )
         goto fail;
-    d->arch.mm_perdomain_pt_pages = page_to_virt(pg);
-    clear_page(d->arch.mm_perdomain_pt_pages);
+    clear_page(d->arch.perdomain_pts);
 
     pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
     if ( pg == NULL )
         goto fail;
-    d->arch.mm_perdomain_l2[0] = page_to_virt(pg);
-    clear_page(d->arch.mm_perdomain_l2[0]);
+    d->arch.perdomain_l2_pg[0] = pg;
+    clear_domain_page(page_to_mfn(pg));
 
     pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
     if ( pg == NULL )
         goto fail;
-    d->arch.mm_perdomain_l3 = page_to_virt(pg);
-    clear_page(d->arch.mm_perdomain_l3);
-    d->arch.mm_perdomain_l3[l3_table_offset(PERDOMAIN_VIRT_START)] =
-        l3e_from_pfn(virt_to_mfn(d->arch.mm_perdomain_l2[0]),
-                     __PAGE_HYPERVISOR);
+    d->arch.perdomain_l3_pg = pg;
+    l3tab = __map_domain_page(pg);
+    clear_page(l3tab);
+    l3tab[l3_table_offset(PERDOMAIN_VIRT_START)] =
+        l3e_from_page(d->arch.perdomain_l2_pg[0], __PAGE_HYPERVISOR);
+    unmap_domain_page(l3tab);
 
     mapcache_domain_init(d);
 
@@ -611,12 +619,12 @@ int arch_domain_create(struct domain *d,
     if ( paging_initialised )
         paging_final_teardown(d);
     mapcache_domain_exit(d);
-    if ( d->arch.mm_perdomain_l2[0] )
-        free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2[0]));
-    if ( d->arch.mm_perdomain_l3 )
-        free_domheap_page(virt_to_page(d->arch.mm_perdomain_l3));
-    if ( d->arch.mm_perdomain_pt_pages )
-        free_domheap_page(virt_to_page(d->arch.mm_perdomain_pt_pages));
+    for ( i = 0; i < PERDOMAIN_SLOTS; ++i)
+        if ( d->arch.perdomain_l2_pg[i] )
+            free_domheap_page(d->arch.perdomain_l2_pg[i]);
+    if ( d->arch.perdomain_l3_pg )
+        free_domheap_page(d->arch.perdomain_l3_pg);
+    free_xenheap_page(d->arch.perdomain_pts);
     return rc;
 }
 
@@ -638,13 +646,12 @@ void arch_domain_destroy(struct domain *
     mapcache_domain_exit(d);
 
     for ( i = 0; i < PDPT_L2_ENTRIES; ++i )
-    {
-        if ( perdomain_pt_page(d, i) )
-            free_domheap_page(perdomain_pt_page(d, i));
-    }
-    free_domheap_page(virt_to_page(d->arch.mm_perdomain_pt_pages));
-    free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2[0]));
-    free_domheap_page(virt_to_page(d->arch.mm_perdomain_l3));
+        free_xenheap_page(d->arch.perdomain_pts[i]);
+    free_xenheap_page(d->arch.perdomain_pts);
+    for ( i = 0; i < PERDOMAIN_SLOTS; ++i)
+        if ( d->arch.perdomain_l2_pg[i] )
+            free_domheap_page(d->arch.perdomain_l2_pg[i]);
+    free_domheap_page(d->arch.perdomain_l3_pg);
 
     free_xenheap_page(d->shared_info);
     cleanup_domain_irq_mapping(d);
@@ -810,9 +817,10 @@ int arch_set_info_guest(
                 fail |= xen_pfn_to_cr3(pfn) != c.nat->ctrlreg[1];
             }
         } else {
-            l4_pgentry_t *l4tab = __va(pfn_to_paddr(pfn));
+            l4_pgentry_t *l4tab = map_domain_page(pfn);
 
             pfn = l4e_get_pfn(*l4tab);
+            unmap_domain_page(l4tab);
             fail = compat_pfn_to_cr3(pfn) != c.cmp->ctrlreg[3];
         }
 
@@ -951,9 +959,10 @@ int arch_set_info_guest(
             return -EINVAL;
         }
 
-        l4tab = __va(pagetable_get_paddr(v->arch.guest_table));
+        l4tab = map_domain_page(pagetable_get_pfn(v->arch.guest_table));
         *l4tab = l4e_from_pfn(page_to_mfn(cr3_page),
             _PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_ACCESSED);
+        unmap_domain_page(l4tab);
     }
 
     if ( v->vcpu_id == 0 )
@@ -1971,12 +1980,13 @@ static int relinquish_memory(
 static void vcpu_destroy_pagetables(struct vcpu *v)
 {
     struct domain *d = v->domain;
-    unsigned long pfn;
+    unsigned long pfn = pagetable_get_pfn(v->arch.guest_table);
 
     if ( is_pv_32on64_vcpu(v) )
     {
-        pfn = l4e_get_pfn(*(l4_pgentry_t *)
-                          __va(pagetable_get_paddr(v->arch.guest_table)));
+        l4_pgentry_t *l4tab = map_domain_page(pfn);
+
+        pfn = l4e_get_pfn(*l4tab);
 
         if ( pfn != 0 )
         {
@@ -1986,15 +1996,12 @@ static void vcpu_destroy_pagetables(stru
                 put_page_and_type(mfn_to_page(pfn));
         }
 
-        l4e_write(
-            (l4_pgentry_t *)__va(pagetable_get_paddr(v->arch.guest_table)),
-            l4e_empty());
+        l4e_write(l4tab, l4e_empty());
 
         v->arch.cr3 = 0;
         return;
     }
 
-    pfn = pagetable_get_pfn(v->arch.guest_table);
     if ( pfn != 0 )
     {
         if ( paging_mode_refcounts(d) )
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -241,6 +241,8 @@ void copy_domain_page(unsigned long dmfn
 int mapcache_domain_init(struct domain *d)
 {
     struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
+    l3_pgentry_t *l3tab;
+    l2_pgentry_t *l2tab;
     unsigned int i, bitmap_pages, memf = MEMF_node(domain_to_node(d));
     unsigned long *end;
 
@@ -251,14 +253,18 @@ int mapcache_domain_init(struct domain *
         return 0;
 
     dcache->l1tab = xzalloc_array(l1_pgentry_t *, MAPCACHE_L2_ENTRIES + 1);
-    d->arch.mm_perdomain_l2[MAPCACHE_SLOT] = alloc_xenheap_pages(0, memf);
-    if ( !dcache->l1tab || !d->arch.mm_perdomain_l2[MAPCACHE_SLOT] )
+    d->arch.perdomain_l2_pg[MAPCACHE_SLOT] = alloc_domheap_page(NULL, memf);
+    if ( !dcache->l1tab || !d->arch.perdomain_l2_pg[MAPCACHE_SLOT] )
         return -ENOMEM;
 
-    clear_page(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]);
-    d->arch.mm_perdomain_l3[l3_table_offset(MAPCACHE_VIRT_START)] =
-        l3e_from_paddr(__pa(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]),
-                       __PAGE_HYPERVISOR);
+    clear_domain_page(page_to_mfn(d->arch.perdomain_l2_pg[MAPCACHE_SLOT]));
+    l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
+    l3tab[l3_table_offset(MAPCACHE_VIRT_START)] =
+        l3e_from_page(d->arch.perdomain_l2_pg[MAPCACHE_SLOT],
+                      __PAGE_HYPERVISOR);
+    unmap_domain_page(l3tab);
+
+    l2tab = __map_domain_page(d->arch.perdomain_l2_pg[MAPCACHE_SLOT]);
 
     BUILD_BUG_ON(MAPCACHE_VIRT_END + 3 +
                  2 * PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long)) >
@@ -275,12 +281,16 @@ int mapcache_domain_init(struct domain *
         ASSERT(i <= MAPCACHE_L2_ENTRIES);
         dcache->l1tab[i] = alloc_xenheap_pages(0, memf);
         if ( !dcache->l1tab[i] )
+        {
+            unmap_domain_page(l2tab);
             return -ENOMEM;
+        }
         clear_page(dcache->l1tab[i]);
-        d->arch.mm_perdomain_l2[MAPCACHE_SLOT][i] =
-            l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR);
+        l2tab[i] = l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR);
     }
 
+    unmap_domain_page(l2tab);
+
     spin_lock_init(&dcache->lock);
 
     return 0;
@@ -315,19 +325,21 @@ void mapcache_domain_exit(struct domain 
 
         xfree(dcache->l1tab);
     }
-    free_xenheap_page(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]);
 }
 
 int mapcache_vcpu_init(struct vcpu *v)
 {
     struct domain *d = v->domain;
     struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
+    l2_pgentry_t *l2tab;
     unsigned long i;
     unsigned int memf = MEMF_node(vcpu_to_node(v));
 
     if ( is_hvm_vcpu(v) || !dcache->l1tab )
         return 0;
 
+    l2tab = __map_domain_page(d->arch.perdomain_l2_pg[MAPCACHE_SLOT]);
+
     while ( dcache->entries < d->max_vcpus * MAPCACHE_VCPU_ENTRIES )
     {
         unsigned int ents = dcache->entries + MAPCACHE_VCPU_ENTRIES;
@@ -338,10 +350,13 @@ int mapcache_vcpu_init(struct vcpu *v)
         {
             dcache->l1tab[i] = alloc_xenheap_pages(0, memf);
             if ( !dcache->l1tab[i] )
+            {
+                unmap_domain_page(l2tab);
                 return -ENOMEM;
+            }
             clear_page(dcache->l1tab[i]);
-            d->arch.mm_perdomain_l2[MAPCACHE_SLOT][i] =
-                l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR);
+            l2tab[i] = l2e_from_paddr(__pa(dcache->l1tab[i]),
+                                      __PAGE_HYPERVISOR);
         }
 
         /* Populate bit maps. */
@@ -351,18 +366,22 @@ int mapcache_vcpu_init(struct vcpu *v)
         {
             struct page_info *pg = alloc_domheap_page(NULL, memf);
 
+            if ( pg )
+            {
+                clear_domain_page(page_to_mfn(pg));
+                *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR);
+                pg = alloc_domheap_page(NULL, memf);
+            }
             if ( !pg )
+            {
+                unmap_domain_page(l2tab);
                 return -ENOMEM;
-            clear_domain_page(page_to_mfn(pg));
-            *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR);
+            }
 
             i = (unsigned long)(dcache->garbage + BITS_TO_LONGS(ents));
             pl1e = &dcache->l1tab[l2_table_offset(i)][l1_table_offset(i)];
             ASSERT(!l1e_get_flags(*pl1e));
 
-            pg = alloc_domheap_page(NULL, memf);
-            if ( !pg )
-                return -ENOMEM;
             clear_domain_page(page_to_mfn(pg));
             *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR);
         }
@@ -370,6 +389,8 @@ int mapcache_vcpu_init(struct vcpu *v)
         dcache->entries = ents;
     }
 
+    unmap_domain_page(l2tab);
+
     /* Mark all maphash entries as not in use. */
     BUILD_BUG_ON(MAPHASHENT_NOTINUSE < MAPCACHE_ENTRIES);
     for ( i = 0; i < MAPHASH_ENTRIES; i++ )
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1322,9 +1322,9 @@ void init_guest_l4_table(l4_pgentry_t l4
            &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
            ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));
     l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
-        l4e_from_pfn(virt_to_mfn(l4tab), __PAGE_HYPERVISOR);
+        l4e_from_pfn(domain_page_map_to_mfn(l4tab), __PAGE_HYPERVISOR);
     l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_pfn(virt_to_mfn(d->arch.mm_perdomain_l3), __PAGE_HYPERVISOR);
+        l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR);
 }
 
 static int alloc_l4_table(struct page_info *page, int preemptible)
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -369,7 +369,7 @@ static void hap_install_xen_entries_in_l
 
     /* Install the per-domain mappings for this domain */
     l4e[l4_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_pfn(mfn_x(page_to_mfn(virt_to_page(d->arch.mm_perdomain_l3))),
+        l4e_from_pfn(mfn_x(page_to_mfn(d->arch.perdomain_l3_pg)),
                      __PAGE_HYPERVISOR);
 
     /* Install a linear mapping */
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -1449,7 +1449,7 @@ void sh_install_xen_entries_in_l4(struct
 
     /* Install the per-domain mappings for this domain */
     sl4e[shadow_l4_table_offset(PERDOMAIN_VIRT_START)] =
-        shadow_l4e_from_mfn(page_to_mfn(virt_to_page(d->arch.mm_perdomain_l3)),
+        shadow_l4e_from_mfn(page_to_mfn(d->arch.perdomain_l3_pg),
                             __PAGE_HYPERVISOR);
 
     /* Shadow linear mapping for 4-level shadows.  N.B. for 3-level
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -823,9 +823,8 @@ void __init setup_idle_pagetable(void)
 {
     /* Install per-domain mappings for idle domain. */
     l4e_write(&idle_pg_table[l4_table_offset(PERDOMAIN_VIRT_START)],
-              l4e_from_page(
-                  virt_to_page(idle_vcpu[0]->domain->arch.mm_perdomain_l3),
-                  __PAGE_HYPERVISOR));
+              l4e_from_page(idle_vcpu[0]->domain->arch.perdomain_l3_pg,
+                            __PAGE_HYPERVISOR));
 }
 
 void __init zap_low_mappings(void)
@@ -850,21 +849,18 @@ void *compat_arg_xlat_virt_base(void)
 int setup_compat_arg_xlat(struct vcpu *v)
 {
     unsigned int order = get_order_from_bytes(COMPAT_ARG_XLAT_SIZE);
-    struct page_info *pg;
 
-    pg = alloc_domheap_pages(NULL, order, 0);
-    if ( pg == NULL )
-        return -ENOMEM;
+    v->arch.compat_arg_xlat = alloc_xenheap_pages(order,
+                                                  MEMF_node(vcpu_to_node(v)));
 
-    v->arch.compat_arg_xlat = page_to_virt(pg);
-    return 0;
+    return v->arch.compat_arg_xlat ? 0 : -ENOMEM;
 }
 
 void free_compat_arg_xlat(struct vcpu *v)
 {
     unsigned int order = get_order_from_bytes(COMPAT_ARG_XLAT_SIZE);
-    if ( v->arch.compat_arg_xlat != NULL )
-        free_domheap_pages(virt_to_page(v->arch.compat_arg_xlat), order);
+
+    free_xenheap_pages(v->arch.compat_arg_xlat, order);
     v->arch.compat_arg_xlat = NULL;
 }
 
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -241,9 +241,9 @@ struct pv_domain
 
 struct arch_domain
 {
-    struct page_info **mm_perdomain_pt_pages;
-    l2_pgentry_t *mm_perdomain_l2[PERDOMAIN_SLOTS];
-    l3_pgentry_t *mm_perdomain_l3;
+    void **perdomain_pts;
+    struct page_info *perdomain_l2_pg[PERDOMAIN_SLOTS];
+    struct page_info *perdomain_l3_pg;
 
     unsigned int hv_compat_vstart;
 
@@ -318,13 +318,11 @@ struct arch_domain
 #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))
 #define has_arch_mmios(d)    (!rangeset_is_empty((d)->iomem_caps))
 
-#define perdomain_pt_pgidx(v) \
+#define perdomain_pt_idx(v) \
       ((v)->vcpu_id >> (PAGETABLE_ORDER - GDT_LDT_VCPU_SHIFT))
 #define perdomain_ptes(d, v) \
-    ((l1_pgentry_t *)page_to_virt((d)->arch.mm_perdomain_pt_pages \
-      [perdomain_pt_pgidx(v)]) + (((v)->vcpu_id << GDT_LDT_VCPU_SHIFT) & \
-                                  (L1_PAGETABLE_ENTRIES - 1)))
-#define perdomain_pt_page(d, n) ((d)->arch.mm_perdomain_pt_pages[n])
+    ((l1_pgentry_t *)(d)->arch.perdomain_pts[perdomain_pt_idx(v)] + \
+     (((v)->vcpu_id << GDT_LDT_VCPU_SHIFT) & (L1_PAGETABLE_ENTRIES - 1)))
 
 struct pv_vcpu
 {

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 07/11] x86: properly use map_domain_page() during page table manipulation
  2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
                   ` (4 preceding siblings ...)
  2013-01-22 10:53 ` [PATCH 06/11] x86: properly use map_domain_page() during domain creation/destruction Jan Beulich
@ 2013-01-22 10:55 ` Jan Beulich
  2013-01-22 10:55 ` [PATCH 08/11] x86: properly use map_domain_page() in nested HVM code Jan Beulich
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2013-01-22 10:55 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 12317 bytes --]

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -98,8 +98,9 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct doma
 
     if ( pgd3val == 0 )
     {
-        l4t = mfn_to_virt(mfn);
+        l4t = map_domain_page(mfn);
         l4e = l4t[l4_table_offset(vaddr)];
+        unmap_domain_page(l4t);
         mfn = l4e_get_pfn(l4e);
         DBGP2("l4t:%p l4to:%lx l4e:%lx mfn:%lx\n", l4t, 
               l4_table_offset(vaddr), l4e, mfn);
@@ -109,20 +110,23 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct doma
             return INVALID_MFN;
         }
 
-        l3t = mfn_to_virt(mfn);
+        l3t = map_domain_page(mfn);
         l3e = l3t[l3_table_offset(vaddr)];
+        unmap_domain_page(l3t);
         mfn = l3e_get_pfn(l3e);
         DBGP2("l3t:%p l3to:%lx l3e:%lx mfn:%lx\n", l3t, 
               l3_table_offset(vaddr), l3e, mfn);
-        if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) )
+        if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) ||
+             (l3e_get_flags(l3e) & _PAGE_PSE) )
         {
             DBGP1("l3 PAGE not present. vaddr:%lx cr3:%lx\n", vaddr, cr3);
             return INVALID_MFN;
         }
     }
 
-    l2t = mfn_to_virt(mfn);
+    l2t = map_domain_page(mfn);
     l2e = l2t[l2_table_offset(vaddr)];
+    unmap_domain_page(l2t);
     mfn = l2e_get_pfn(l2e);
     DBGP2("l2t:%p l2to:%lx l2e:%lx mfn:%lx\n", l2t, l2_table_offset(vaddr),
           l2e, mfn);
@@ -132,8 +136,9 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct doma
         DBGP1("l2 PAGE not present. vaddr:%lx cr3:%lx\n", vaddr, cr3);
         return INVALID_MFN;
     }
-    l1t = mfn_to_virt(mfn);
+    l1t = map_domain_page(mfn);
     l1e = l1t[l1_table_offset(vaddr)];
+    unmap_domain_page(l1t);
     mfn = l1e_get_pfn(l1e);
     DBGP2("l1t:%p l1to:%lx l1e:%lx mfn:%lx\n", l1t, l1_table_offset(vaddr),
           l1e, mfn);
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1331,7 +1331,7 @@ static int alloc_l4_table(struct page_in
 {
     struct domain *d = page_get_owner(page);
     unsigned long  pfn = page_to_mfn(page);
-    l4_pgentry_t  *pl4e = page_to_virt(page);
+    l4_pgentry_t  *pl4e = map_domain_page(pfn);
     unsigned int   i;
     int            rc = 0, partial = page->partial_pte;
 
@@ -1365,12 +1365,16 @@ static int alloc_l4_table(struct page_in
                     put_page_from_l4e(pl4e[i], pfn, 0, 0);
         }
         if ( rc < 0 )
+        {
+            unmap_domain_page(pl4e);
             return rc;
+        }
 
         adjust_guest_l4e(pl4e[i], d);
     }
 
     init_guest_l4_table(pl4e, d);
+    unmap_domain_page(pl4e);
 
     return rc > 0 ? 0 : rc;
 }
@@ -1464,7 +1468,7 @@ static int free_l4_table(struct page_inf
 {
     struct domain *d = page_get_owner(page);
     unsigned long pfn = page_to_mfn(page);
-    l4_pgentry_t *pl4e = page_to_virt(page);
+    l4_pgentry_t *pl4e = map_domain_page(pfn);
     int rc = 0, partial = page->partial_pte;
     unsigned int  i = page->nr_validated_ptes - !partial;
 
@@ -1487,6 +1491,9 @@ static int free_l4_table(struct page_inf
         page->partial_pte = 0;
         rc = -EAGAIN;
     }
+
+    unmap_domain_page(pl4e);
+
     return rc > 0 ? 0 : rc;
 }
 
@@ -4983,15 +4990,23 @@ int mmio_ro_do_page_fault(struct vcpu *v
     return rc != X86EMUL_UNHANDLEABLE ? EXCRET_fault_fixed : 0;
 }
 
-void free_xen_pagetable(void *v)
+void *alloc_xen_pagetable(void)
 {
-    if ( system_state == SYS_STATE_early_boot )
-        return;
+    if ( system_state != SYS_STATE_early_boot )
+    {
+        void *ptr = alloc_xenheap_page();
 
-    if ( is_xen_heap_page(virt_to_page(v)) )
+        BUG_ON(!dom0 && !ptr);
+        return ptr;
+    }
+
+    return mfn_to_virt(alloc_boot_pages(1, 1));
+}
+
+void free_xen_pagetable(void *v)
+{
+    if ( system_state != SYS_STATE_early_boot )
         free_xenheap_page(v);
-    else
-        free_domheap_page(virt_to_page(v));
 }
 
 /* Convert to from superpage-mapping flags for map_pages_to_xen(). */
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -180,6 +180,11 @@ static void show_guest_stack(struct vcpu
         printk(" %p", _p(addr));
         stack++;
     }
+    if ( mask == PAGE_SIZE )
+    {
+        BUILD_BUG_ON(PAGE_SIZE == STACK_SIZE);
+        unmap_domain_page(stack);
+    }
     if ( i == 0 )
         printk("Stack empty.");
     printk("\n");
--- a/xen/arch/x86/x86_64/compat/traps.c
+++ b/xen/arch/x86/x86_64/compat/traps.c
@@ -56,6 +56,11 @@ void compat_show_guest_stack(struct vcpu
         printk(" %08x", addr);
         stack++;
     }
+    if ( mask == PAGE_SIZE )
+    {
+        BUILD_BUG_ON(PAGE_SIZE == STACK_SIZE);
+        unmap_domain_page(stack);
+    }
     if ( i == 0 )
         printk("Stack empty.");
     printk("\n");
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -65,22 +65,6 @@ int __mfn_valid(unsigned long mfn)
                            pdx_group_valid));
 }
 
-void *alloc_xen_pagetable(void)
-{
-    unsigned long mfn;
-
-    if ( system_state != SYS_STATE_early_boot )
-    {
-        struct page_info *pg = alloc_domheap_page(NULL, 0);
-
-        BUG_ON(!dom0 && !pg);
-        return pg ? page_to_virt(pg) : NULL;
-    }
-
-    mfn = alloc_boot_pages(1, 1);
-    return mfn_to_virt(mfn);
-}
-
 l3_pgentry_t *virt_to_xen_l3e(unsigned long v)
 {
     l4_pgentry_t *pl4e;
@@ -154,35 +138,45 @@ void *do_page_walk(struct vcpu *v, unsig
     if ( is_hvm_vcpu(v) )
         return NULL;
 
-    l4t = mfn_to_virt(mfn);
+    l4t = map_domain_page(mfn);
     l4e = l4t[l4_table_offset(addr)];
-    mfn = l4e_get_pfn(l4e);
+    unmap_domain_page(l4t);
     if ( !(l4e_get_flags(l4e) & _PAGE_PRESENT) )
         return NULL;
 
-    l3t = mfn_to_virt(mfn);
+    l3t = map_l3t_from_l4e(l4e);
     l3e = l3t[l3_table_offset(addr)];
+    unmap_domain_page(l3t);
     mfn = l3e_get_pfn(l3e);
     if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) || !mfn_valid(mfn) )
         return NULL;
     if ( (l3e_get_flags(l3e) & _PAGE_PSE) )
-        return mfn_to_virt(mfn) + (addr & ((1UL << L3_PAGETABLE_SHIFT) - 1));
+    {
+        mfn += PFN_DOWN(addr & ((1UL << L3_PAGETABLE_SHIFT) - 1));
+        goto ret;
+    }
 
-    l2t = mfn_to_virt(mfn);
+    l2t = map_domain_page(mfn);
     l2e = l2t[l2_table_offset(addr)];
+    unmap_domain_page(l2t);
     mfn = l2e_get_pfn(l2e);
     if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) || !mfn_valid(mfn) )
         return NULL;
     if ( (l2e_get_flags(l2e) & _PAGE_PSE) )
-        return mfn_to_virt(mfn) + (addr & ((1UL << L2_PAGETABLE_SHIFT) - 1));
+    {
+        mfn += PFN_DOWN(addr & ((1UL << L2_PAGETABLE_SHIFT) - 1));
+        goto ret;
+    }
 
-    l1t = mfn_to_virt(mfn);
+    l1t = map_domain_page(mfn);
     l1e = l1t[l1_table_offset(addr)];
+    unmap_domain_page(l1t);
     mfn = l1e_get_pfn(l1e);
     if ( !(l1e_get_flags(l1e) & _PAGE_PRESENT) || !mfn_valid(mfn) )
         return NULL;
 
-    return mfn_to_virt(mfn) + (addr & ~PAGE_MASK);
+ ret:
+    return map_domain_page(mfn) + (addr & ~PAGE_MASK);
 }
 
 void __init pfn_pdx_hole_setup(unsigned long mask)
@@ -519,10 +513,9 @@ static int setup_compat_m2p_table(struct
 static int setup_m2p_table(struct mem_hotadd_info *info)
 {
     unsigned long i, va, smap, emap;
-    unsigned int n, memflags;
+    unsigned int n;
     l2_pgentry_t *l2_ro_mpt = NULL;
     l3_pgentry_t *l3_ro_mpt = NULL;
-    struct page_info *l2_pg;
     int ret = 0;
 
     ASSERT(l4e_get_flags(idle_pg_table[l4_table_offset(RO_MPT_VIRT_START)])
@@ -560,7 +553,6 @@ static int setup_m2p_table(struct mem_ho
         }
 
         va = RO_MPT_VIRT_START + i * sizeof(*machine_to_phys_mapping);
-        memflags = MEMF_node(phys_to_nid(i << PAGE_SHIFT));
 
         for ( n = 0; n < CNT; ++n)
             if ( mfn_valid(i + n * PDX_GROUP_COUNT) )
@@ -587,19 +579,18 @@ static int setup_m2p_table(struct mem_ho
                   l2_table_offset(va);
             else
             {
-                l2_pg = alloc_domheap_page(NULL, memflags);
-
-                if (!l2_pg)
+                l2_ro_mpt = alloc_xen_pagetable();
+                if ( !l2_ro_mpt )
                 {
                     ret = -ENOMEM;
                     goto error;
                 }
 
-                l2_ro_mpt = page_to_virt(l2_pg);
                 clear_page(l2_ro_mpt);
                 l3e_write(&l3_ro_mpt[l3_table_offset(va)],
-                  l3e_from_page(l2_pg, __PAGE_HYPERVISOR | _PAGE_USER));
-               l2_ro_mpt += l2_table_offset(va);
+                          l3e_from_paddr(__pa(l2_ro_mpt),
+                                         __PAGE_HYPERVISOR | _PAGE_USER));
+                l2_ro_mpt += l2_table_offset(va);
             }
 
             /* NB. Cannot be GLOBAL as shadow_mode_translate reuses this area. */
@@ -762,12 +753,12 @@ void __init paging_init(void)
                  l4_table_offset(HIRO_COMPAT_MPT_VIRT_START));
     l3_ro_mpt = l4e_to_l3e(idle_pg_table[l4_table_offset(
         HIRO_COMPAT_MPT_VIRT_START)]);
-    if ( (l2_pg = alloc_domheap_page(NULL, 0)) == NULL )
+    if ( (l2_ro_mpt = alloc_xen_pagetable()) == NULL )
         goto nomem;
-    compat_idle_pg_table_l2 = l2_ro_mpt = page_to_virt(l2_pg);
+    compat_idle_pg_table_l2 = l2_ro_mpt;
     clear_page(l2_ro_mpt);
     l3e_write(&l3_ro_mpt[l3_table_offset(HIRO_COMPAT_MPT_VIRT_START)],
-              l3e_from_page(l2_pg, __PAGE_HYPERVISOR));
+              l3e_from_paddr(__pa(l2_ro_mpt), __PAGE_HYPERVISOR));
     l2_ro_mpt += l2_table_offset(HIRO_COMPAT_MPT_VIRT_START);
     /* Allocate and map the compatibility mode machine-to-phys table. */
     mpt_size = (mpt_size >> 1) + (1UL << (L2_PAGETABLE_SHIFT - 1));
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -175,8 +175,9 @@ void show_page_walk(unsigned long addr)
 
     printk("Pagetable walk from %016lx:\n", addr);
 
-    l4t = mfn_to_virt(mfn);
+    l4t = map_domain_page(mfn);
     l4e = l4t[l4_table_offset(addr)];
+    unmap_domain_page(l4t);
     mfn = l4e_get_pfn(l4e);
     pfn = mfn_valid(mfn) && machine_to_phys_mapping_valid ?
           get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
@@ -186,8 +187,9 @@ void show_page_walk(unsigned long addr)
          !mfn_valid(mfn) )
         return;
 
-    l3t = mfn_to_virt(mfn);
+    l3t = map_domain_page(mfn);
     l3e = l3t[l3_table_offset(addr)];
+    unmap_domain_page(l3t);
     mfn = l3e_get_pfn(l3e);
     pfn = mfn_valid(mfn) && machine_to_phys_mapping_valid ?
           get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
@@ -199,8 +201,9 @@ void show_page_walk(unsigned long addr)
          !mfn_valid(mfn) )
         return;
 
-    l2t = mfn_to_virt(mfn);
+    l2t = map_domain_page(mfn);
     l2e = l2t[l2_table_offset(addr)];
+    unmap_domain_page(l2t);
     mfn = l2e_get_pfn(l2e);
     pfn = mfn_valid(mfn) && machine_to_phys_mapping_valid ?
           get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
@@ -212,8 +215,9 @@ void show_page_walk(unsigned long addr)
          !mfn_valid(mfn) )
         return;
 
-    l1t = mfn_to_virt(mfn);
+    l1t = map_domain_page(mfn);
     l1e = l1t[l1_table_offset(addr)];
+    unmap_domain_page(l1t);
     mfn = l1e_get_pfn(l1e);
     pfn = mfn_valid(mfn) && machine_to_phys_mapping_valid ?
           get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
--- a/xen/include/asm-x86/page.h
+++ b/xen/include/asm-x86/page.h
@@ -172,6 +172,10 @@ static inline l4_pgentry_t l4e_from_padd
 #define l3e_to_l2e(x)              ((l2_pgentry_t *)__va(l3e_get_paddr(x)))
 #define l4e_to_l3e(x)              ((l3_pgentry_t *)__va(l4e_get_paddr(x)))
 
+#define map_l1t_from_l2e(x)        ((l1_pgentry_t *)map_domain_page(l2e_get_pfn(x)))
+#define map_l2t_from_l3e(x)        ((l2_pgentry_t *)map_domain_page(l3e_get_pfn(x)))
+#define map_l3t_from_l4e(x)        ((l3_pgentry_t *)map_domain_page(l4e_get_pfn(x)))
+
 /* Given a virtual address, get an entry offset into a page table. */
 #define l1_table_offset(a)         \
     (((a) >> L1_PAGETABLE_SHIFT) & (L1_PAGETABLE_ENTRIES - 1))



[-- Attachment #2: x86-map-domain-pagetable.patch --]
[-- Type: text/plain, Size: 12383 bytes --]

x86: properly use map_domain_page() during page table manipulation

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -98,8 +98,9 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct doma
 
     if ( pgd3val == 0 )
     {
-        l4t = mfn_to_virt(mfn);
+        l4t = map_domain_page(mfn);
         l4e = l4t[l4_table_offset(vaddr)];
+        unmap_domain_page(l4t);
         mfn = l4e_get_pfn(l4e);
         DBGP2("l4t:%p l4to:%lx l4e:%lx mfn:%lx\n", l4t, 
               l4_table_offset(vaddr), l4e, mfn);
@@ -109,20 +110,23 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct doma
             return INVALID_MFN;
         }
 
-        l3t = mfn_to_virt(mfn);
+        l3t = map_domain_page(mfn);
         l3e = l3t[l3_table_offset(vaddr)];
+        unmap_domain_page(l3t);
         mfn = l3e_get_pfn(l3e);
         DBGP2("l3t:%p l3to:%lx l3e:%lx mfn:%lx\n", l3t, 
               l3_table_offset(vaddr), l3e, mfn);
-        if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) )
+        if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) ||
+             (l3e_get_flags(l3e) & _PAGE_PSE) )
         {
             DBGP1("l3 PAGE not present. vaddr:%lx cr3:%lx\n", vaddr, cr3);
             return INVALID_MFN;
         }
     }
 
-    l2t = mfn_to_virt(mfn);
+    l2t = map_domain_page(mfn);
     l2e = l2t[l2_table_offset(vaddr)];
+    unmap_domain_page(l2t);
     mfn = l2e_get_pfn(l2e);
     DBGP2("l2t:%p l2to:%lx l2e:%lx mfn:%lx\n", l2t, l2_table_offset(vaddr),
           l2e, mfn);
@@ -132,8 +136,9 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct doma
         DBGP1("l2 PAGE not present. vaddr:%lx cr3:%lx\n", vaddr, cr3);
         return INVALID_MFN;
     }
-    l1t = mfn_to_virt(mfn);
+    l1t = map_domain_page(mfn);
     l1e = l1t[l1_table_offset(vaddr)];
+    unmap_domain_page(l1t);
     mfn = l1e_get_pfn(l1e);
     DBGP2("l1t:%p l1to:%lx l1e:%lx mfn:%lx\n", l1t, l1_table_offset(vaddr),
           l1e, mfn);
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1331,7 +1331,7 @@ static int alloc_l4_table(struct page_in
 {
     struct domain *d = page_get_owner(page);
     unsigned long  pfn = page_to_mfn(page);
-    l4_pgentry_t  *pl4e = page_to_virt(page);
+    l4_pgentry_t  *pl4e = map_domain_page(pfn);
     unsigned int   i;
     int            rc = 0, partial = page->partial_pte;
 
@@ -1365,12 +1365,16 @@ static int alloc_l4_table(struct page_in
                     put_page_from_l4e(pl4e[i], pfn, 0, 0);
         }
         if ( rc < 0 )
+        {
+            unmap_domain_page(pl4e);
             return rc;
+        }
 
         adjust_guest_l4e(pl4e[i], d);
     }
 
     init_guest_l4_table(pl4e, d);
+    unmap_domain_page(pl4e);
 
     return rc > 0 ? 0 : rc;
 }
@@ -1464,7 +1468,7 @@ static int free_l4_table(struct page_inf
 {
     struct domain *d = page_get_owner(page);
     unsigned long pfn = page_to_mfn(page);
-    l4_pgentry_t *pl4e = page_to_virt(page);
+    l4_pgentry_t *pl4e = map_domain_page(pfn);
     int rc = 0, partial = page->partial_pte;
     unsigned int  i = page->nr_validated_ptes - !partial;
 
@@ -1487,6 +1491,9 @@ static int free_l4_table(struct page_inf
         page->partial_pte = 0;
         rc = -EAGAIN;
     }
+
+    unmap_domain_page(pl4e);
+
     return rc > 0 ? 0 : rc;
 }
 
@@ -4983,15 +4990,23 @@ int mmio_ro_do_page_fault(struct vcpu *v
     return rc != X86EMUL_UNHANDLEABLE ? EXCRET_fault_fixed : 0;
 }
 
-void free_xen_pagetable(void *v)
+void *alloc_xen_pagetable(void)
 {
-    if ( system_state == SYS_STATE_early_boot )
-        return;
+    if ( system_state != SYS_STATE_early_boot )
+    {
+        void *ptr = alloc_xenheap_page();
 
-    if ( is_xen_heap_page(virt_to_page(v)) )
+        BUG_ON(!dom0 && !ptr);
+        return ptr;
+    }
+
+    return mfn_to_virt(alloc_boot_pages(1, 1));
+}
+
+void free_xen_pagetable(void *v)
+{
+    if ( system_state != SYS_STATE_early_boot )
         free_xenheap_page(v);
-    else
-        free_domheap_page(virt_to_page(v));
 }
 
 /* Convert to from superpage-mapping flags for map_pages_to_xen(). */
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -180,6 +180,11 @@ static void show_guest_stack(struct vcpu
         printk(" %p", _p(addr));
         stack++;
     }
+    if ( mask == PAGE_SIZE )
+    {
+        BUILD_BUG_ON(PAGE_SIZE == STACK_SIZE);
+        unmap_domain_page(stack);
+    }
     if ( i == 0 )
         printk("Stack empty.");
     printk("\n");
--- a/xen/arch/x86/x86_64/compat/traps.c
+++ b/xen/arch/x86/x86_64/compat/traps.c
@@ -56,6 +56,11 @@ void compat_show_guest_stack(struct vcpu
         printk(" %08x", addr);
         stack++;
     }
+    if ( mask == PAGE_SIZE )
+    {
+        BUILD_BUG_ON(PAGE_SIZE == STACK_SIZE);
+        unmap_domain_page(stack);
+    }
     if ( i == 0 )
         printk("Stack empty.");
     printk("\n");
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -65,22 +65,6 @@ int __mfn_valid(unsigned long mfn)
                            pdx_group_valid));
 }
 
-void *alloc_xen_pagetable(void)
-{
-    unsigned long mfn;
-
-    if ( system_state != SYS_STATE_early_boot )
-    {
-        struct page_info *pg = alloc_domheap_page(NULL, 0);
-
-        BUG_ON(!dom0 && !pg);
-        return pg ? page_to_virt(pg) : NULL;
-    }
-
-    mfn = alloc_boot_pages(1, 1);
-    return mfn_to_virt(mfn);
-}
-
 l3_pgentry_t *virt_to_xen_l3e(unsigned long v)
 {
     l4_pgentry_t *pl4e;
@@ -154,35 +138,45 @@ void *do_page_walk(struct vcpu *v, unsig
     if ( is_hvm_vcpu(v) )
         return NULL;
 
-    l4t = mfn_to_virt(mfn);
+    l4t = map_domain_page(mfn);
     l4e = l4t[l4_table_offset(addr)];
-    mfn = l4e_get_pfn(l4e);
+    unmap_domain_page(l4t);
     if ( !(l4e_get_flags(l4e) & _PAGE_PRESENT) )
         return NULL;
 
-    l3t = mfn_to_virt(mfn);
+    l3t = map_l3t_from_l4e(l4e);
     l3e = l3t[l3_table_offset(addr)];
+    unmap_domain_page(l3t);
     mfn = l3e_get_pfn(l3e);
     if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) || !mfn_valid(mfn) )
         return NULL;
     if ( (l3e_get_flags(l3e) & _PAGE_PSE) )
-        return mfn_to_virt(mfn) + (addr & ((1UL << L3_PAGETABLE_SHIFT) - 1));
+    {
+        mfn += PFN_DOWN(addr & ((1UL << L3_PAGETABLE_SHIFT) - 1));
+        goto ret;
+    }
 
-    l2t = mfn_to_virt(mfn);
+    l2t = map_domain_page(mfn);
     l2e = l2t[l2_table_offset(addr)];
+    unmap_domain_page(l2t);
     mfn = l2e_get_pfn(l2e);
     if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) || !mfn_valid(mfn) )
         return NULL;
     if ( (l2e_get_flags(l2e) & _PAGE_PSE) )
-        return mfn_to_virt(mfn) + (addr & ((1UL << L2_PAGETABLE_SHIFT) - 1));
+    {
+        mfn += PFN_DOWN(addr & ((1UL << L2_PAGETABLE_SHIFT) - 1));
+        goto ret;
+    }
 
-    l1t = mfn_to_virt(mfn);
+    l1t = map_domain_page(mfn);
     l1e = l1t[l1_table_offset(addr)];
+    unmap_domain_page(l1t);
     mfn = l1e_get_pfn(l1e);
     if ( !(l1e_get_flags(l1e) & _PAGE_PRESENT) || !mfn_valid(mfn) )
         return NULL;
 
-    return mfn_to_virt(mfn) + (addr & ~PAGE_MASK);
+ ret:
+    return map_domain_page(mfn) + (addr & ~PAGE_MASK);
 }
 
 void __init pfn_pdx_hole_setup(unsigned long mask)
@@ -519,10 +513,9 @@ static int setup_compat_m2p_table(struct
 static int setup_m2p_table(struct mem_hotadd_info *info)
 {
     unsigned long i, va, smap, emap;
-    unsigned int n, memflags;
+    unsigned int n;
     l2_pgentry_t *l2_ro_mpt = NULL;
     l3_pgentry_t *l3_ro_mpt = NULL;
-    struct page_info *l2_pg;
     int ret = 0;
 
     ASSERT(l4e_get_flags(idle_pg_table[l4_table_offset(RO_MPT_VIRT_START)])
@@ -560,7 +553,6 @@ static int setup_m2p_table(struct mem_ho
         }
 
         va = RO_MPT_VIRT_START + i * sizeof(*machine_to_phys_mapping);
-        memflags = MEMF_node(phys_to_nid(i << PAGE_SHIFT));
 
         for ( n = 0; n < CNT; ++n)
             if ( mfn_valid(i + n * PDX_GROUP_COUNT) )
@@ -587,19 +579,18 @@ static int setup_m2p_table(struct mem_ho
                   l2_table_offset(va);
             else
             {
-                l2_pg = alloc_domheap_page(NULL, memflags);
-
-                if (!l2_pg)
+                l2_ro_mpt = alloc_xen_pagetable();
+                if ( !l2_ro_mpt )
                 {
                     ret = -ENOMEM;
                     goto error;
                 }
 
-                l2_ro_mpt = page_to_virt(l2_pg);
                 clear_page(l2_ro_mpt);
                 l3e_write(&l3_ro_mpt[l3_table_offset(va)],
-                  l3e_from_page(l2_pg, __PAGE_HYPERVISOR | _PAGE_USER));
-               l2_ro_mpt += l2_table_offset(va);
+                          l3e_from_paddr(__pa(l2_ro_mpt),
+                                         __PAGE_HYPERVISOR | _PAGE_USER));
+                l2_ro_mpt += l2_table_offset(va);
             }
 
             /* NB. Cannot be GLOBAL as shadow_mode_translate reuses this area. */
@@ -762,12 +753,12 @@ void __init paging_init(void)
                  l4_table_offset(HIRO_COMPAT_MPT_VIRT_START));
     l3_ro_mpt = l4e_to_l3e(idle_pg_table[l4_table_offset(
         HIRO_COMPAT_MPT_VIRT_START)]);
-    if ( (l2_pg = alloc_domheap_page(NULL, 0)) == NULL )
+    if ( (l2_ro_mpt = alloc_xen_pagetable()) == NULL )
         goto nomem;
-    compat_idle_pg_table_l2 = l2_ro_mpt = page_to_virt(l2_pg);
+    compat_idle_pg_table_l2 = l2_ro_mpt;
     clear_page(l2_ro_mpt);
     l3e_write(&l3_ro_mpt[l3_table_offset(HIRO_COMPAT_MPT_VIRT_START)],
-              l3e_from_page(l2_pg, __PAGE_HYPERVISOR));
+              l3e_from_paddr(__pa(l2_ro_mpt), __PAGE_HYPERVISOR));
     l2_ro_mpt += l2_table_offset(HIRO_COMPAT_MPT_VIRT_START);
     /* Allocate and map the compatibility mode machine-to-phys table. */
     mpt_size = (mpt_size >> 1) + (1UL << (L2_PAGETABLE_SHIFT - 1));
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -175,8 +175,9 @@ void show_page_walk(unsigned long addr)
 
     printk("Pagetable walk from %016lx:\n", addr);
 
-    l4t = mfn_to_virt(mfn);
+    l4t = map_domain_page(mfn);
     l4e = l4t[l4_table_offset(addr)];
+    unmap_domain_page(l4t);
     mfn = l4e_get_pfn(l4e);
     pfn = mfn_valid(mfn) && machine_to_phys_mapping_valid ?
           get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
@@ -186,8 +187,9 @@ void show_page_walk(unsigned long addr)
          !mfn_valid(mfn) )
         return;
 
-    l3t = mfn_to_virt(mfn);
+    l3t = map_domain_page(mfn);
     l3e = l3t[l3_table_offset(addr)];
+    unmap_domain_page(l3t);
     mfn = l3e_get_pfn(l3e);
     pfn = mfn_valid(mfn) && machine_to_phys_mapping_valid ?
           get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
@@ -199,8 +201,9 @@ void show_page_walk(unsigned long addr)
          !mfn_valid(mfn) )
         return;
 
-    l2t = mfn_to_virt(mfn);
+    l2t = map_domain_page(mfn);
     l2e = l2t[l2_table_offset(addr)];
+    unmap_domain_page(l2t);
     mfn = l2e_get_pfn(l2e);
     pfn = mfn_valid(mfn) && machine_to_phys_mapping_valid ?
           get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
@@ -212,8 +215,9 @@ void show_page_walk(unsigned long addr)
          !mfn_valid(mfn) )
         return;
 
-    l1t = mfn_to_virt(mfn);
+    l1t = map_domain_page(mfn);
     l1e = l1t[l1_table_offset(addr)];
+    unmap_domain_page(l1t);
     mfn = l1e_get_pfn(l1e);
     pfn = mfn_valid(mfn) && machine_to_phys_mapping_valid ?
           get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
--- a/xen/include/asm-x86/page.h
+++ b/xen/include/asm-x86/page.h
@@ -172,6 +172,10 @@ static inline l4_pgentry_t l4e_from_padd
 #define l3e_to_l2e(x)              ((l2_pgentry_t *)__va(l3e_get_paddr(x)))
 #define l4e_to_l3e(x)              ((l3_pgentry_t *)__va(l4e_get_paddr(x)))
 
+#define map_l1t_from_l2e(x)        ((l1_pgentry_t *)map_domain_page(l2e_get_pfn(x)))
+#define map_l2t_from_l3e(x)        ((l2_pgentry_t *)map_domain_page(l3e_get_pfn(x)))
+#define map_l3t_from_l4e(x)        ((l3_pgentry_t *)map_domain_page(l4e_get_pfn(x)))
+
 /* Given a virtual address, get an entry offset into a page table. */
 #define l1_table_offset(a)         \
     (((a) >> L1_PAGETABLE_SHIFT) & (L1_PAGETABLE_ENTRIES - 1))

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 08/11] x86: properly use map_domain_page() in nested HVM code
  2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
                   ` (5 preceding siblings ...)
  2013-01-22 10:55 ` [PATCH 07/11] x86: properly use map_domain_page() during page table manipulation Jan Beulich
@ 2013-01-22 10:55 ` Jan Beulich
  2013-01-22 10:56 ` [PATCH 09/11] x86: properly use map_domain_page() in miscellaneous places Jan Beulich
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2013-01-22 10:55 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 10885 bytes --]

This eliminates a couple of incorrect/inconsistent uses of
map_domain_page() from VT-x code.

Note that this does _not_ add error handling where none was present
before, even though I think NULL returns from any of the mapping
operations touched here need to properly be handled. I just don't know
this code well enough to figure out what the right action in each case
would be.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1966,7 +1966,8 @@ int hvm_virtual_to_linear_addr(
 
 /* On non-NULL return, we leave this function holding an additional 
  * ref on the underlying mfn, if any */
-static void *__hvm_map_guest_frame(unsigned long gfn, bool_t writable)
+static void *__hvm_map_guest_frame(unsigned long gfn, bool_t writable,
+                                   bool_t permanent)
 {
     void *map;
     p2m_type_t p2mt;
@@ -1991,28 +1992,41 @@ static void *__hvm_map_guest_frame(unsig
     if ( writable )
         paging_mark_dirty(d, page_to_mfn(page));
 
-    map = __map_domain_page(page);
+    if ( !permanent )
+        return __map_domain_page(page);
+
+    map = __map_domain_page_global(page);
+    if ( !map )
+        put_page(page);
+
     return map;
 }
 
-void *hvm_map_guest_frame_rw(unsigned long gfn)
+void *hvm_map_guest_frame_rw(unsigned long gfn, bool_t permanent)
 {
-    return __hvm_map_guest_frame(gfn, 1);
+    return __hvm_map_guest_frame(gfn, 1, permanent);
 }
 
-void *hvm_map_guest_frame_ro(unsigned long gfn)
+void *hvm_map_guest_frame_ro(unsigned long gfn, bool_t permanent)
 {
-    return __hvm_map_guest_frame(gfn, 0);
+    return __hvm_map_guest_frame(gfn, 0, permanent);
 }
 
-void hvm_unmap_guest_frame(void *p)
+void hvm_unmap_guest_frame(void *p, bool_t permanent)
 {
-    if ( p )
-    {
-        unsigned long mfn = domain_page_map_to_mfn(p);
+    unsigned long mfn;
+
+    if ( !p )
+        return;
+
+    mfn = domain_page_map_to_mfn(p);
+
+    if ( !permanent )
         unmap_domain_page(p);
-        put_page(mfn_to_page(mfn));
-    }
+    else
+        unmap_domain_page_global(p);
+
+    put_page(mfn_to_page(mfn));
 }
 
 static void *hvm_map_entry(unsigned long va)
@@ -2038,7 +2052,7 @@ static void *hvm_map_entry(unsigned long
     if ( (pfec == PFEC_page_paged) || (pfec == PFEC_page_shared) )
         goto fail;
 
-    v = hvm_map_guest_frame_rw(gfn);
+    v = hvm_map_guest_frame_rw(gfn, 0);
     if ( v == NULL )
         goto fail;
 
@@ -2051,7 +2065,7 @@ static void *hvm_map_entry(unsigned long
 
 static void hvm_unmap_entry(void *p)
 {
-    hvm_unmap_guest_frame(p);
+    hvm_unmap_guest_frame(p, 0);
 }
 
 static int hvm_load_segment_selector(
--- a/xen/arch/x86/hvm/nestedhvm.c
+++ b/xen/arch/x86/hvm/nestedhvm.c
@@ -53,8 +53,7 @@ nestedhvm_vcpu_reset(struct vcpu *v)
     nv->nv_ioport80 = 0;
     nv->nv_ioportED = 0;
 
-    if (nv->nv_vvmcx)
-        hvm_unmap_guest_frame(nv->nv_vvmcx);
+    hvm_unmap_guest_frame(nv->nv_vvmcx, 1);
     nv->nv_vvmcx = NULL;
     nv->nv_vvmcxaddr = VMCX_EADDR;
     nv->nv_flushp2m = 0;
--- a/xen/arch/x86/hvm/svm/nestedsvm.c
+++ b/xen/arch/x86/hvm/svm/nestedsvm.c
@@ -69,15 +69,14 @@ int nestedsvm_vmcb_map(struct vcpu *v, u
     struct nestedvcpu *nv = &vcpu_nestedhvm(v);
 
     if (nv->nv_vvmcx != NULL && nv->nv_vvmcxaddr != vmcbaddr) {
-        ASSERT(nv->nv_vvmcx != NULL);
         ASSERT(nv->nv_vvmcxaddr != VMCX_EADDR);
-        hvm_unmap_guest_frame(nv->nv_vvmcx);
+        hvm_unmap_guest_frame(nv->nv_vvmcx, 1);
         nv->nv_vvmcx = NULL;
         nv->nv_vvmcxaddr = VMCX_EADDR;
     }
 
     if (nv->nv_vvmcx == NULL) {
-        nv->nv_vvmcx = hvm_map_guest_frame_rw(vmcbaddr >> PAGE_SHIFT);
+        nv->nv_vvmcx = hvm_map_guest_frame_rw(vmcbaddr >> PAGE_SHIFT, 1);
         if (nv->nv_vvmcx == NULL)
             return 0;
         nv->nv_vvmcxaddr = vmcbaddr;
@@ -141,6 +140,8 @@ void nsvm_vcpu_destroy(struct vcpu *v)
                            get_order_from_bytes(MSRPM_SIZE));
         svm->ns_merged_msrpm = NULL;
     }
+    hvm_unmap_guest_frame(nv->nv_vvmcx, 1);
+    nv->nv_vvmcx = NULL;
     if (nv->nv_n2vmcx) {
         free_vmcb(nv->nv_n2vmcx);
         nv->nv_n2vmcx = NULL;
@@ -358,11 +359,11 @@ static int nsvm_vmrun_permissionmap(stru
     svm->ns_oiomap_pa = svm->ns_iomap_pa;
     svm->ns_iomap_pa = ns_vmcb->_iopm_base_pa;
 
-    ns_viomap = hvm_map_guest_frame_ro(svm->ns_iomap_pa >> PAGE_SHIFT);
+    ns_viomap = hvm_map_guest_frame_ro(svm->ns_iomap_pa >> PAGE_SHIFT, 0);
     ASSERT(ns_viomap != NULL);
     ioport_80 = test_bit(0x80, ns_viomap);
     ioport_ed = test_bit(0xed, ns_viomap);
-    hvm_unmap_guest_frame(ns_viomap);
+    hvm_unmap_guest_frame(ns_viomap, 0);
 
     svm->ns_iomap = nestedhvm_vcpu_iomap_get(ioport_80, ioport_ed);
 
@@ -888,7 +889,7 @@ nsvm_vmcb_guest_intercepts_ioio(paddr_t 
         break;
     }
 
-    io_bitmap = hvm_map_guest_frame_ro(gfn);
+    io_bitmap = hvm_map_guest_frame_ro(gfn, 0);
     if (io_bitmap == NULL) {
         gdprintk(XENLOG_ERR,
             "IOIO intercept: mapping of permission map failed\n");
@@ -896,7 +897,7 @@ nsvm_vmcb_guest_intercepts_ioio(paddr_t 
     }
 
     enabled = test_bit(port, io_bitmap);
-    hvm_unmap_guest_frame(io_bitmap);
+    hvm_unmap_guest_frame(io_bitmap, 0);
 
     if (!enabled)
         return NESTEDHVM_VMEXIT_HOST;
--- a/xen/arch/x86/hvm/vmx/vvmx.c
+++ b/xen/arch/x86/hvm/vmx/vvmx.c
@@ -569,18 +569,20 @@ void nvmx_update_exception_bitmap(struct
 static void nvmx_update_apic_access_address(struct vcpu *v)
 {
     struct nestedvcpu *nvcpu = &vcpu_nestedhvm(v);
-    u64 apic_gpfn, apic_mfn;
     u32 ctrl;
-    void *apic_va;
 
     ctrl = __n2_secondary_exec_control(v);
     if ( ctrl & SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES )
     {
+        p2m_type_t p2mt;
+        unsigned long apic_gpfn;
+        struct page_info *apic_pg;
+
         apic_gpfn = __get_vvmcs(nvcpu->nv_vvmcx, APIC_ACCESS_ADDR) >> PAGE_SHIFT;
-        apic_va = hvm_map_guest_frame_ro(apic_gpfn);
-        apic_mfn = virt_to_mfn(apic_va);
-        __vmwrite(APIC_ACCESS_ADDR, (apic_mfn << PAGE_SHIFT));
-        hvm_unmap_guest_frame(apic_va); 
+        apic_pg = get_page_from_gfn(v->domain, apic_gpfn, &p2mt, P2M_ALLOC);
+        ASSERT(apic_pg && !p2m_is_paging(p2mt));
+        __vmwrite(APIC_ACCESS_ADDR, page_to_maddr(apic_pg));
+        put_page(apic_pg);
     }
     else
         __vmwrite(APIC_ACCESS_ADDR, 0);
@@ -589,18 +591,20 @@ static void nvmx_update_apic_access_addr
 static void nvmx_update_virtual_apic_address(struct vcpu *v)
 {
     struct nestedvcpu *nvcpu = &vcpu_nestedhvm(v);
-    u64 vapic_gpfn, vapic_mfn;
     u32 ctrl;
-    void *vapic_va;
 
     ctrl = __n2_exec_control(v);
     if ( ctrl & CPU_BASED_TPR_SHADOW )
     {
+        p2m_type_t p2mt;
+        unsigned long vapic_gpfn;
+        struct page_info *vapic_pg;
+
         vapic_gpfn = __get_vvmcs(nvcpu->nv_vvmcx, VIRTUAL_APIC_PAGE_ADDR) >> PAGE_SHIFT;
-        vapic_va = hvm_map_guest_frame_ro(vapic_gpfn);
-        vapic_mfn = virt_to_mfn(vapic_va);
-        __vmwrite(VIRTUAL_APIC_PAGE_ADDR, (vapic_mfn << PAGE_SHIFT));
-        hvm_unmap_guest_frame(vapic_va); 
+        vapic_pg = get_page_from_gfn(v->domain, vapic_gpfn, &p2mt, P2M_ALLOC);
+        ASSERT(vapic_pg && !p2m_is_paging(p2mt));
+        __vmwrite(VIRTUAL_APIC_PAGE_ADDR, page_to_maddr(vapic_pg));
+        put_page(vapic_pg);
     }
     else
         __vmwrite(VIRTUAL_APIC_PAGE_ADDR, 0);
@@ -641,9 +645,9 @@ static void __map_msr_bitmap(struct vcpu
     unsigned long gpa;
 
     if ( nvmx->msrbitmap )
-        hvm_unmap_guest_frame (nvmx->msrbitmap); 
+        hvm_unmap_guest_frame(nvmx->msrbitmap, 1);
     gpa = __get_vvmcs(vcpu_nestedhvm(v).nv_vvmcx, MSR_BITMAP);
-    nvmx->msrbitmap = hvm_map_guest_frame_ro(gpa >> PAGE_SHIFT);
+    nvmx->msrbitmap = hvm_map_guest_frame_ro(gpa >> PAGE_SHIFT, 1);
 }
 
 static void __map_io_bitmap(struct vcpu *v, u64 vmcs_reg)
@@ -654,9 +658,9 @@ static void __map_io_bitmap(struct vcpu 
 
     index = vmcs_reg == IO_BITMAP_A ? 0 : 1;
     if (nvmx->iobitmap[index])
-        hvm_unmap_guest_frame (nvmx->iobitmap[index]); 
+        hvm_unmap_guest_frame(nvmx->iobitmap[index], 1);
     gpa = __get_vvmcs(vcpu_nestedhvm(v).nv_vvmcx, vmcs_reg);
-    nvmx->iobitmap[index] = hvm_map_guest_frame_ro(gpa >> PAGE_SHIFT);
+    nvmx->iobitmap[index] = hvm_map_guest_frame_ro(gpa >> PAGE_SHIFT, 1);
 }
 
 static inline void map_io_bitmap_all(struct vcpu *v)
@@ -673,17 +677,17 @@ static void nvmx_purge_vvmcs(struct vcpu
 
     __clear_current_vvmcs(v);
     if ( nvcpu->nv_vvmcxaddr != VMCX_EADDR )
-        hvm_unmap_guest_frame(nvcpu->nv_vvmcx);
+        hvm_unmap_guest_frame(nvcpu->nv_vvmcx, 1);
     nvcpu->nv_vvmcx = NULL;
     nvcpu->nv_vvmcxaddr = VMCX_EADDR;
     for (i=0; i<2; i++) {
         if ( nvmx->iobitmap[i] ) {
-            hvm_unmap_guest_frame(nvmx->iobitmap[i]); 
+            hvm_unmap_guest_frame(nvmx->iobitmap[i], 1);
             nvmx->iobitmap[i] = NULL;
         }
     }
     if ( nvmx->msrbitmap ) {
-        hvm_unmap_guest_frame(nvmx->msrbitmap);
+        hvm_unmap_guest_frame(nvmx->msrbitmap, 1);
         nvmx->msrbitmap = NULL;
     }
 }
@@ -1289,7 +1293,7 @@ int nvmx_handle_vmptrld(struct cpu_user_
 
     if ( nvcpu->nv_vvmcxaddr == VMCX_EADDR )
     {
-        nvcpu->nv_vvmcx = hvm_map_guest_frame_rw(gpa >> PAGE_SHIFT);
+        nvcpu->nv_vvmcx = hvm_map_guest_frame_rw(gpa >> PAGE_SHIFT, 1);
         nvcpu->nv_vvmcxaddr = gpa;
         map_io_bitmap_all (v);
         __map_msr_bitmap(v);
@@ -1350,10 +1354,10 @@ int nvmx_handle_vmclear(struct cpu_user_
     else 
     {
         /* Even if this VMCS isn't the current one, we must clear it. */
-        vvmcs = hvm_map_guest_frame_rw(gpa >> PAGE_SHIFT);
+        vvmcs = hvm_map_guest_frame_rw(gpa >> PAGE_SHIFT, 0);
         if ( vvmcs ) 
             __set_vvmcs(vvmcs, NVMX_LAUNCH_STATE, 0);
-        hvm_unmap_guest_frame(vvmcs);
+        hvm_unmap_guest_frame(vvmcs, 0);
     }
 
     vmreturn(regs, VMSUCCEED);
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -423,9 +423,9 @@ int hvm_virtual_to_linear_addr(
     unsigned int addr_size,
     unsigned long *linear_addr);
 
-void *hvm_map_guest_frame_rw(unsigned long gfn);
-void *hvm_map_guest_frame_ro(unsigned long gfn);
-void hvm_unmap_guest_frame(void *p);
+void *hvm_map_guest_frame_rw(unsigned long gfn, bool_t permanent);
+void *hvm_map_guest_frame_ro(unsigned long gfn, bool_t permanent);
+void hvm_unmap_guest_frame(void *p, bool_t permanent);
 
 static inline void hvm_set_info_guest(struct vcpu *v)
 {



[-- Attachment #2: x86-map-domain-nhvm.patch --]
[-- Type: text/plain, Size: 10939 bytes --]

x86: properly use map_domain_page() in nested HVM code

This eliminates a couple of incorrect/inconsistent uses of
map_domain_page() from VT-x code.

Note that this does _not_ add error handling where none was present
before, even though I think NULL returns from any of the mapping
operations touched here need to properly be handled. I just don't know
this code well enough to figure out what the right action in each case
would be.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1966,7 +1966,8 @@ int hvm_virtual_to_linear_addr(
 
 /* On non-NULL return, we leave this function holding an additional 
  * ref on the underlying mfn, if any */
-static void *__hvm_map_guest_frame(unsigned long gfn, bool_t writable)
+static void *__hvm_map_guest_frame(unsigned long gfn, bool_t writable,
+                                   bool_t permanent)
 {
     void *map;
     p2m_type_t p2mt;
@@ -1991,28 +1992,41 @@ static void *__hvm_map_guest_frame(unsig
     if ( writable )
         paging_mark_dirty(d, page_to_mfn(page));
 
-    map = __map_domain_page(page);
+    if ( !permanent )
+        return __map_domain_page(page);
+
+    map = __map_domain_page_global(page);
+    if ( !map )
+        put_page(page);
+
     return map;
 }
 
-void *hvm_map_guest_frame_rw(unsigned long gfn)
+void *hvm_map_guest_frame_rw(unsigned long gfn, bool_t permanent)
 {
-    return __hvm_map_guest_frame(gfn, 1);
+    return __hvm_map_guest_frame(gfn, 1, permanent);
 }
 
-void *hvm_map_guest_frame_ro(unsigned long gfn)
+void *hvm_map_guest_frame_ro(unsigned long gfn, bool_t permanent)
 {
-    return __hvm_map_guest_frame(gfn, 0);
+    return __hvm_map_guest_frame(gfn, 0, permanent);
 }
 
-void hvm_unmap_guest_frame(void *p)
+void hvm_unmap_guest_frame(void *p, bool_t permanent)
 {
-    if ( p )
-    {
-        unsigned long mfn = domain_page_map_to_mfn(p);
+    unsigned long mfn;
+
+    if ( !p )
+        return;
+
+    mfn = domain_page_map_to_mfn(p);
+
+    if ( !permanent )
         unmap_domain_page(p);
-        put_page(mfn_to_page(mfn));
-    }
+    else
+        unmap_domain_page_global(p);
+
+    put_page(mfn_to_page(mfn));
 }
 
 static void *hvm_map_entry(unsigned long va)
@@ -2038,7 +2052,7 @@ static void *hvm_map_entry(unsigned long
     if ( (pfec == PFEC_page_paged) || (pfec == PFEC_page_shared) )
         goto fail;
 
-    v = hvm_map_guest_frame_rw(gfn);
+    v = hvm_map_guest_frame_rw(gfn, 0);
     if ( v == NULL )
         goto fail;
 
@@ -2051,7 +2065,7 @@ static void *hvm_map_entry(unsigned long
 
 static void hvm_unmap_entry(void *p)
 {
-    hvm_unmap_guest_frame(p);
+    hvm_unmap_guest_frame(p, 0);
 }
 
 static int hvm_load_segment_selector(
--- a/xen/arch/x86/hvm/nestedhvm.c
+++ b/xen/arch/x86/hvm/nestedhvm.c
@@ -53,8 +53,7 @@ nestedhvm_vcpu_reset(struct vcpu *v)
     nv->nv_ioport80 = 0;
     nv->nv_ioportED = 0;
 
-    if (nv->nv_vvmcx)
-        hvm_unmap_guest_frame(nv->nv_vvmcx);
+    hvm_unmap_guest_frame(nv->nv_vvmcx, 1);
     nv->nv_vvmcx = NULL;
     nv->nv_vvmcxaddr = VMCX_EADDR;
     nv->nv_flushp2m = 0;
--- a/xen/arch/x86/hvm/svm/nestedsvm.c
+++ b/xen/arch/x86/hvm/svm/nestedsvm.c
@@ -69,15 +69,14 @@ int nestedsvm_vmcb_map(struct vcpu *v, u
     struct nestedvcpu *nv = &vcpu_nestedhvm(v);
 
     if (nv->nv_vvmcx != NULL && nv->nv_vvmcxaddr != vmcbaddr) {
-        ASSERT(nv->nv_vvmcx != NULL);
         ASSERT(nv->nv_vvmcxaddr != VMCX_EADDR);
-        hvm_unmap_guest_frame(nv->nv_vvmcx);
+        hvm_unmap_guest_frame(nv->nv_vvmcx, 1);
         nv->nv_vvmcx = NULL;
         nv->nv_vvmcxaddr = VMCX_EADDR;
     }
 
     if (nv->nv_vvmcx == NULL) {
-        nv->nv_vvmcx = hvm_map_guest_frame_rw(vmcbaddr >> PAGE_SHIFT);
+        nv->nv_vvmcx = hvm_map_guest_frame_rw(vmcbaddr >> PAGE_SHIFT, 1);
         if (nv->nv_vvmcx == NULL)
             return 0;
         nv->nv_vvmcxaddr = vmcbaddr;
@@ -141,6 +140,8 @@ void nsvm_vcpu_destroy(struct vcpu *v)
                            get_order_from_bytes(MSRPM_SIZE));
         svm->ns_merged_msrpm = NULL;
     }
+    hvm_unmap_guest_frame(nv->nv_vvmcx, 1);
+    nv->nv_vvmcx = NULL;
     if (nv->nv_n2vmcx) {
         free_vmcb(nv->nv_n2vmcx);
         nv->nv_n2vmcx = NULL;
@@ -358,11 +359,11 @@ static int nsvm_vmrun_permissionmap(stru
     svm->ns_oiomap_pa = svm->ns_iomap_pa;
     svm->ns_iomap_pa = ns_vmcb->_iopm_base_pa;
 
-    ns_viomap = hvm_map_guest_frame_ro(svm->ns_iomap_pa >> PAGE_SHIFT);
+    ns_viomap = hvm_map_guest_frame_ro(svm->ns_iomap_pa >> PAGE_SHIFT, 0);
     ASSERT(ns_viomap != NULL);
     ioport_80 = test_bit(0x80, ns_viomap);
     ioport_ed = test_bit(0xed, ns_viomap);
-    hvm_unmap_guest_frame(ns_viomap);
+    hvm_unmap_guest_frame(ns_viomap, 0);
 
     svm->ns_iomap = nestedhvm_vcpu_iomap_get(ioport_80, ioport_ed);
 
@@ -888,7 +889,7 @@ nsvm_vmcb_guest_intercepts_ioio(paddr_t 
         break;
     }
 
-    io_bitmap = hvm_map_guest_frame_ro(gfn);
+    io_bitmap = hvm_map_guest_frame_ro(gfn, 0);
     if (io_bitmap == NULL) {
         gdprintk(XENLOG_ERR,
             "IOIO intercept: mapping of permission map failed\n");
@@ -896,7 +897,7 @@ nsvm_vmcb_guest_intercepts_ioio(paddr_t 
     }
 
     enabled = test_bit(port, io_bitmap);
-    hvm_unmap_guest_frame(io_bitmap);
+    hvm_unmap_guest_frame(io_bitmap, 0);
 
     if (!enabled)
         return NESTEDHVM_VMEXIT_HOST;
--- a/xen/arch/x86/hvm/vmx/vvmx.c
+++ b/xen/arch/x86/hvm/vmx/vvmx.c
@@ -569,18 +569,20 @@ void nvmx_update_exception_bitmap(struct
 static void nvmx_update_apic_access_address(struct vcpu *v)
 {
     struct nestedvcpu *nvcpu = &vcpu_nestedhvm(v);
-    u64 apic_gpfn, apic_mfn;
     u32 ctrl;
-    void *apic_va;
 
     ctrl = __n2_secondary_exec_control(v);
     if ( ctrl & SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES )
     {
+        p2m_type_t p2mt;
+        unsigned long apic_gpfn;
+        struct page_info *apic_pg;
+
         apic_gpfn = __get_vvmcs(nvcpu->nv_vvmcx, APIC_ACCESS_ADDR) >> PAGE_SHIFT;
-        apic_va = hvm_map_guest_frame_ro(apic_gpfn);
-        apic_mfn = virt_to_mfn(apic_va);
-        __vmwrite(APIC_ACCESS_ADDR, (apic_mfn << PAGE_SHIFT));
-        hvm_unmap_guest_frame(apic_va); 
+        apic_pg = get_page_from_gfn(v->domain, apic_gpfn, &p2mt, P2M_ALLOC);
+        ASSERT(apic_pg && !p2m_is_paging(p2mt));
+        __vmwrite(APIC_ACCESS_ADDR, page_to_maddr(apic_pg));
+        put_page(apic_pg);
     }
     else
         __vmwrite(APIC_ACCESS_ADDR, 0);
@@ -589,18 +591,20 @@ static void nvmx_update_apic_access_addr
 static void nvmx_update_virtual_apic_address(struct vcpu *v)
 {
     struct nestedvcpu *nvcpu = &vcpu_nestedhvm(v);
-    u64 vapic_gpfn, vapic_mfn;
     u32 ctrl;
-    void *vapic_va;
 
     ctrl = __n2_exec_control(v);
     if ( ctrl & CPU_BASED_TPR_SHADOW )
     {
+        p2m_type_t p2mt;
+        unsigned long vapic_gpfn;
+        struct page_info *vapic_pg;
+
         vapic_gpfn = __get_vvmcs(nvcpu->nv_vvmcx, VIRTUAL_APIC_PAGE_ADDR) >> PAGE_SHIFT;
-        vapic_va = hvm_map_guest_frame_ro(vapic_gpfn);
-        vapic_mfn = virt_to_mfn(vapic_va);
-        __vmwrite(VIRTUAL_APIC_PAGE_ADDR, (vapic_mfn << PAGE_SHIFT));
-        hvm_unmap_guest_frame(vapic_va); 
+        vapic_pg = get_page_from_gfn(v->domain, vapic_gpfn, &p2mt, P2M_ALLOC);
+        ASSERT(vapic_pg && !p2m_is_paging(p2mt));
+        __vmwrite(VIRTUAL_APIC_PAGE_ADDR, page_to_maddr(vapic_pg));
+        put_page(vapic_pg);
     }
     else
         __vmwrite(VIRTUAL_APIC_PAGE_ADDR, 0);
@@ -641,9 +645,9 @@ static void __map_msr_bitmap(struct vcpu
     unsigned long gpa;
 
     if ( nvmx->msrbitmap )
-        hvm_unmap_guest_frame (nvmx->msrbitmap); 
+        hvm_unmap_guest_frame(nvmx->msrbitmap, 1);
     gpa = __get_vvmcs(vcpu_nestedhvm(v).nv_vvmcx, MSR_BITMAP);
-    nvmx->msrbitmap = hvm_map_guest_frame_ro(gpa >> PAGE_SHIFT);
+    nvmx->msrbitmap = hvm_map_guest_frame_ro(gpa >> PAGE_SHIFT, 1);
 }
 
 static void __map_io_bitmap(struct vcpu *v, u64 vmcs_reg)
@@ -654,9 +658,9 @@ static void __map_io_bitmap(struct vcpu 
 
     index = vmcs_reg == IO_BITMAP_A ? 0 : 1;
     if (nvmx->iobitmap[index])
-        hvm_unmap_guest_frame (nvmx->iobitmap[index]); 
+        hvm_unmap_guest_frame(nvmx->iobitmap[index], 1);
     gpa = __get_vvmcs(vcpu_nestedhvm(v).nv_vvmcx, vmcs_reg);
-    nvmx->iobitmap[index] = hvm_map_guest_frame_ro(gpa >> PAGE_SHIFT);
+    nvmx->iobitmap[index] = hvm_map_guest_frame_ro(gpa >> PAGE_SHIFT, 1);
 }
 
 static inline void map_io_bitmap_all(struct vcpu *v)
@@ -673,17 +677,17 @@ static void nvmx_purge_vvmcs(struct vcpu
 
     __clear_current_vvmcs(v);
     if ( nvcpu->nv_vvmcxaddr != VMCX_EADDR )
-        hvm_unmap_guest_frame(nvcpu->nv_vvmcx);
+        hvm_unmap_guest_frame(nvcpu->nv_vvmcx, 1);
     nvcpu->nv_vvmcx = NULL;
     nvcpu->nv_vvmcxaddr = VMCX_EADDR;
     for (i=0; i<2; i++) {
         if ( nvmx->iobitmap[i] ) {
-            hvm_unmap_guest_frame(nvmx->iobitmap[i]); 
+            hvm_unmap_guest_frame(nvmx->iobitmap[i], 1);
             nvmx->iobitmap[i] = NULL;
         }
     }
     if ( nvmx->msrbitmap ) {
-        hvm_unmap_guest_frame(nvmx->msrbitmap);
+        hvm_unmap_guest_frame(nvmx->msrbitmap, 1);
         nvmx->msrbitmap = NULL;
     }
 }
@@ -1289,7 +1293,7 @@ int nvmx_handle_vmptrld(struct cpu_user_
 
     if ( nvcpu->nv_vvmcxaddr == VMCX_EADDR )
     {
-        nvcpu->nv_vvmcx = hvm_map_guest_frame_rw(gpa >> PAGE_SHIFT);
+        nvcpu->nv_vvmcx = hvm_map_guest_frame_rw(gpa >> PAGE_SHIFT, 1);
         nvcpu->nv_vvmcxaddr = gpa;
         map_io_bitmap_all (v);
         __map_msr_bitmap(v);
@@ -1350,10 +1354,10 @@ int nvmx_handle_vmclear(struct cpu_user_
     else 
     {
         /* Even if this VMCS isn't the current one, we must clear it. */
-        vvmcs = hvm_map_guest_frame_rw(gpa >> PAGE_SHIFT);
+        vvmcs = hvm_map_guest_frame_rw(gpa >> PAGE_SHIFT, 0);
         if ( vvmcs ) 
             __set_vvmcs(vvmcs, NVMX_LAUNCH_STATE, 0);
-        hvm_unmap_guest_frame(vvmcs);
+        hvm_unmap_guest_frame(vvmcs, 0);
     }
 
     vmreturn(regs, VMSUCCEED);
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -423,9 +423,9 @@ int hvm_virtual_to_linear_addr(
     unsigned int addr_size,
     unsigned long *linear_addr);
 
-void *hvm_map_guest_frame_rw(unsigned long gfn);
-void *hvm_map_guest_frame_ro(unsigned long gfn);
-void hvm_unmap_guest_frame(void *p);
+void *hvm_map_guest_frame_rw(unsigned long gfn, bool_t permanent);
+void *hvm_map_guest_frame_ro(unsigned long gfn, bool_t permanent);
+void hvm_unmap_guest_frame(void *p, bool_t permanent);
 
 static inline void hvm_set_info_guest(struct vcpu *v)
 {

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 09/11] x86: properly use map_domain_page() in miscellaneous places
  2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
                   ` (6 preceding siblings ...)
  2013-01-22 10:55 ` [PATCH 08/11] x86: properly use map_domain_page() in nested HVM code Jan Beulich
@ 2013-01-22 10:56 ` Jan Beulich
  2013-01-22 10:57 ` [PATCH 10/11] tmem: partial adjustments for x86 16Tb support Jan Beulich
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2013-01-22 10:56 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 4472 bytes --]

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -150,7 +150,7 @@ long arch_do_domctl(
                 ret = -ENOMEM;
                 break;
             }
-            arr = page_to_virt(page);
+            arr = __map_domain_page(page);
 
             for ( n = ret = 0; n < num; )
             {
@@ -220,7 +220,9 @@ long arch_do_domctl(
                 n += k;
             }
 
-            free_domheap_page(virt_to_page(arr));
+            page = mfn_to_page(domain_page_map_to_mfn(arr));
+            unmap_domain_page(arr);
+            free_domheap_page(page);
 
             break;
         }
@@ -1347,8 +1349,11 @@ void arch_get_info_guest(struct vcpu *v,
         }
         else
         {
-            l4_pgentry_t *l4e = __va(pagetable_get_paddr(v->arch.guest_table));
+            const l4_pgentry_t *l4e =
+                map_domain_page(pagetable_get_pfn(v->arch.guest_table));
+
             c.cmp->ctrlreg[3] = compat_pfn_to_cr3(l4e_get_pfn(*l4e));
+            unmap_domain_page(l4e);
 
             /* Merge shadow DR7 bits into real DR7. */
             c.cmp->debugreg[7] |= c.cmp->debugreg[5];
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2538,14 +2538,18 @@ int new_guest_cr3(unsigned long mfn)
 
     if ( is_pv_32on64_domain(d) )
     {
+        unsigned long gt_mfn = pagetable_get_pfn(curr->arch.guest_table);
+        l4_pgentry_t *pl4e = map_domain_page(gt_mfn);
+
         okay = paging_mode_refcounts(d)
             ? 0 /* Old code was broken, but what should it be? */
             : mod_l4_entry(
-                    __va(pagetable_get_paddr(curr->arch.guest_table)),
+                    pl4e,
                     l4e_from_pfn(
                         mfn,
                         (_PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_ACCESSED)),
-                    pagetable_get_pfn(curr->arch.guest_table), 0, 0, curr) == 0;
+                    gt_mfn, 0, 0, curr) == 0;
+        unmap_domain_page(pl4e);
         if ( unlikely(!okay) )
         {
             MEM_LOG("Error while installing new compat baseptr %lx", mfn);
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -3543,6 +3543,9 @@ int shadow_track_dirty_vram(struct domai
     }
     else
     {
+        unsigned long map_mfn = INVALID_MFN;
+        void *map_sl1p = NULL;
+
         /* Iterate over VRAM to track dirty bits. */
         for ( i = 0; i < nr; i++ ) {
             mfn_t mfn = get_gfn_query_unlocked(d, begin_pfn + i, &t);
@@ -3576,7 +3579,17 @@ int shadow_track_dirty_vram(struct domai
                     {
                         /* Hopefully the most common case: only one mapping,
                          * whose dirty bit we can use. */
-                        l1_pgentry_t *sl1e = maddr_to_virt(sl1ma);
+                        l1_pgentry_t *sl1e;
+                        unsigned long sl1mfn = paddr_to_pfn(sl1ma);
+
+                        if ( sl1mfn != map_mfn )
+                        {
+                            if ( map_sl1p )
+                                sh_unmap_domain_page(map_sl1p);
+                            map_sl1p = sh_map_domain_page(_mfn(sl1mfn));
+                            map_mfn = sl1mfn;
+                        }
+                        sl1e = map_sl1p + (sl1ma & ~PAGE_MASK);
 
                         if ( l1e_get_flags(*sl1e) & _PAGE_DIRTY )
                         {
@@ -3603,6 +3616,9 @@ int shadow_track_dirty_vram(struct domai
             }
         }
 
+        if ( map_sl1p )
+            sh_unmap_domain_page(map_sl1p);
+
         rc = -EFAULT;
         if ( copy_to_guest(dirty_bitmap, dirty_vram->dirty_bitmap, dirty_size) == 0 ) {
             memset(dirty_vram->dirty_bitmap, 0, dirty_size);
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2255,7 +2255,11 @@ static int emulate_privileged_op(struct 
             }
             else
             {
-                mfn = l4e_get_pfn(*(l4_pgentry_t *)__va(pagetable_get_paddr(v->arch.guest_table)));
+                l4_pgentry_t *pl4e =
+                    map_domain_page(pagetable_get_pfn(v->arch.guest_table));
+
+                mfn = l4e_get_pfn(*pl4e);
+                unmap_domain_page(pl4e);
                 *reg = compat_pfn_to_cr3(mfn_to_gmfn(
                     v->domain, mfn));
             }



[-- Attachment #2: x86-map-domain-misc.patch --]
[-- Type: text/plain, Size: 4531 bytes --]

x86: properly use map_domain_page() in miscellaneous places

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -150,7 +150,7 @@ long arch_do_domctl(
                 ret = -ENOMEM;
                 break;
             }
-            arr = page_to_virt(page);
+            arr = __map_domain_page(page);
 
             for ( n = ret = 0; n < num; )
             {
@@ -220,7 +220,9 @@ long arch_do_domctl(
                 n += k;
             }
 
-            free_domheap_page(virt_to_page(arr));
+            page = mfn_to_page(domain_page_map_to_mfn(arr));
+            unmap_domain_page(arr);
+            free_domheap_page(page);
 
             break;
         }
@@ -1347,8 +1349,11 @@ void arch_get_info_guest(struct vcpu *v,
         }
         else
         {
-            l4_pgentry_t *l4e = __va(pagetable_get_paddr(v->arch.guest_table));
+            const l4_pgentry_t *l4e =
+                map_domain_page(pagetable_get_pfn(v->arch.guest_table));
+
             c.cmp->ctrlreg[3] = compat_pfn_to_cr3(l4e_get_pfn(*l4e));
+            unmap_domain_page(l4e);
 
             /* Merge shadow DR7 bits into real DR7. */
             c.cmp->debugreg[7] |= c.cmp->debugreg[5];
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2538,14 +2538,18 @@ int new_guest_cr3(unsigned long mfn)
 
     if ( is_pv_32on64_domain(d) )
     {
+        unsigned long gt_mfn = pagetable_get_pfn(curr->arch.guest_table);
+        l4_pgentry_t *pl4e = map_domain_page(gt_mfn);
+
         okay = paging_mode_refcounts(d)
             ? 0 /* Old code was broken, but what should it be? */
             : mod_l4_entry(
-                    __va(pagetable_get_paddr(curr->arch.guest_table)),
+                    pl4e,
                     l4e_from_pfn(
                         mfn,
                         (_PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_ACCESSED)),
-                    pagetable_get_pfn(curr->arch.guest_table), 0, 0, curr) == 0;
+                    gt_mfn, 0, 0, curr) == 0;
+        unmap_domain_page(pl4e);
         if ( unlikely(!okay) )
         {
             MEM_LOG("Error while installing new compat baseptr %lx", mfn);
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -3543,6 +3543,9 @@ int shadow_track_dirty_vram(struct domai
     }
     else
     {
+        unsigned long map_mfn = INVALID_MFN;
+        void *map_sl1p = NULL;
+
         /* Iterate over VRAM to track dirty bits. */
         for ( i = 0; i < nr; i++ ) {
             mfn_t mfn = get_gfn_query_unlocked(d, begin_pfn + i, &t);
@@ -3576,7 +3579,17 @@ int shadow_track_dirty_vram(struct domai
                     {
                         /* Hopefully the most common case: only one mapping,
                          * whose dirty bit we can use. */
-                        l1_pgentry_t *sl1e = maddr_to_virt(sl1ma);
+                        l1_pgentry_t *sl1e;
+                        unsigned long sl1mfn = paddr_to_pfn(sl1ma);
+
+                        if ( sl1mfn != map_mfn )
+                        {
+                            if ( map_sl1p )
+                                sh_unmap_domain_page(map_sl1p);
+                            map_sl1p = sh_map_domain_page(_mfn(sl1mfn));
+                            map_mfn = sl1mfn;
+                        }
+                        sl1e = map_sl1p + (sl1ma & ~PAGE_MASK);
 
                         if ( l1e_get_flags(*sl1e) & _PAGE_DIRTY )
                         {
@@ -3603,6 +3616,9 @@ int shadow_track_dirty_vram(struct domai
             }
         }
 
+        if ( map_sl1p )
+            sh_unmap_domain_page(map_sl1p);
+
         rc = -EFAULT;
         if ( copy_to_guest(dirty_bitmap, dirty_vram->dirty_bitmap, dirty_size) == 0 ) {
             memset(dirty_vram->dirty_bitmap, 0, dirty_size);
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2255,7 +2255,11 @@ static int emulate_privileged_op(struct 
             }
             else
             {
-                mfn = l4e_get_pfn(*(l4_pgentry_t *)__va(pagetable_get_paddr(v->arch.guest_table)));
+                l4_pgentry_t *pl4e =
+                    map_domain_page(pagetable_get_pfn(v->arch.guest_table));
+
+                mfn = l4e_get_pfn(*pl4e);
+                unmap_domain_page(pl4e);
                 *reg = compat_pfn_to_cr3(mfn_to_gmfn(
                     v->domain, mfn));
             }

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 10/11] tmem: partial adjustments for x86 16Tb support
  2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
                   ` (7 preceding siblings ...)
  2013-01-22 10:56 ` [PATCH 09/11] x86: properly use map_domain_page() in miscellaneous places Jan Beulich
@ 2013-01-22 10:57 ` Jan Beulich
  2013-01-22 17:55   ` Dan Magenheimer
  2013-01-22 10:57 ` [PATCH 11/11] x86: support up to 16Tb Jan Beulich
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2013-01-22 10:57 UTC (permalink / raw)
  To: xen-devel; +Cc: dan.magenheimer

[-- Attachment #1: Type: text/plain, Size: 2875 bytes --]

Despite the changes below, tmem still has code assuming to be able to
directly access all memory, or mapping arbitrary amounts of not
directly accessible memory. I cannot see how to fix this without
converting _all_ its domheap allocations to xenheap ones. And even then
I wouldn't be certain about there not being other cases where the "all
memory is always mapped" assumption would be broken. Therefore, tmem
gets disabled by the next patch for the time being if the full 1:1
mapping isn't always visible.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/tmem_xen.c
+++ b/xen/common/tmem_xen.c
@@ -393,7 +393,8 @@ static void tmh_persistent_pool_page_put
     struct page_info *pi;
 
     ASSERT(IS_PAGE_ALIGNED(page_va));
-    pi = virt_to_page(page_va);
+    pi = mfn_to_page(domain_page_map_to_mfn(page_va));
+    unmap_domain_page(page_va);
     ASSERT(IS_VALID_PAGE(pi));
     _tmh_free_page_thispool(pi);
 }
@@ -441,39 +442,28 @@ static int cpu_callback(
     {
     case CPU_UP_PREPARE: {
         if ( per_cpu(dstmem, cpu) == NULL )
-        {
-            struct page_info *p = alloc_domheap_pages(0, dstmem_order, 0);
-            per_cpu(dstmem, cpu) = p ? page_to_virt(p) : NULL;
-        }
+            per_cpu(dstmem, cpu) = alloc_xenheap_pages(dstmem_order, 0);
         if ( per_cpu(workmem, cpu) == NULL )
-        {
-            struct page_info *p = alloc_domheap_pages(0, workmem_order, 0);
-            per_cpu(workmem, cpu) = p ? page_to_virt(p) : NULL;
-        }
+            per_cpu(workmem, cpu) = alloc_xenheap_pages(workmem_order, 0);
         if ( per_cpu(scratch_page, cpu) == NULL )
-        {
-            struct page_info *p = alloc_domheap_page(NULL, 0);
-            per_cpu(scratch_page, cpu) = p ? page_to_virt(p) : NULL;
-        }
+            per_cpu(scratch_page, cpu) = alloc_xenheap_page();
         break;
     }
     case CPU_DEAD:
     case CPU_UP_CANCELED: {
         if ( per_cpu(dstmem, cpu) != NULL )
         {
-            struct page_info *p = virt_to_page(per_cpu(dstmem, cpu));
-            free_domheap_pages(p, dstmem_order);
+            free_xenheap_pages(per_cpu(dstmem, cpu), dstmem_order);
             per_cpu(dstmem, cpu) = NULL;
         }
         if ( per_cpu(workmem, cpu) != NULL )
         {
-            struct page_info *p = virt_to_page(per_cpu(workmem, cpu));
-            free_domheap_pages(p, workmem_order);
+            free_xenheap_pages(per_cpu(workmem, cpu), workmem_order);
             per_cpu(workmem, cpu) = NULL;
         }
         if ( per_cpu(scratch_page, cpu) != NULL )
         {
-            free_domheap_page(virt_to_page(per_cpu(scratch_page, cpu)));
+            free_xenheap_page(per_cpu(scratch_page, cpu));
             per_cpu(scratch_page, cpu) = NULL;
         }
         break;




[-- Attachment #2: map-domain-page-tmem.patch --]
[-- Type: text/plain, Size: 2919 bytes --]

tmem: partial adjustments for x86 16Tb support

Despite the changes below, tmem still has code assuming to be able to
directly access all memory, or mapping arbitrary amounts of not
directly accessible memory. I cannot see how to fix this without
converting _all_ its domheap allocations to xenheap ones. And even then
I wouldn't be certain about there not being other cases where the "all
memory is always mapped" assumption would be broken. Therefore, tmem
gets disabled by the next patch for the time being if the full 1:1
mapping isn't always visible.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/tmem_xen.c
+++ b/xen/common/tmem_xen.c
@@ -393,7 +393,8 @@ static void tmh_persistent_pool_page_put
     struct page_info *pi;
 
     ASSERT(IS_PAGE_ALIGNED(page_va));
-    pi = virt_to_page(page_va);
+    pi = mfn_to_page(domain_page_map_to_mfn(page_va));
+    unmap_domain_page(page_va);
     ASSERT(IS_VALID_PAGE(pi));
     _tmh_free_page_thispool(pi);
 }
@@ -441,39 +442,28 @@ static int cpu_callback(
     {
     case CPU_UP_PREPARE: {
         if ( per_cpu(dstmem, cpu) == NULL )
-        {
-            struct page_info *p = alloc_domheap_pages(0, dstmem_order, 0);
-            per_cpu(dstmem, cpu) = p ? page_to_virt(p) : NULL;
-        }
+            per_cpu(dstmem, cpu) = alloc_xenheap_pages(dstmem_order, 0);
         if ( per_cpu(workmem, cpu) == NULL )
-        {
-            struct page_info *p = alloc_domheap_pages(0, workmem_order, 0);
-            per_cpu(workmem, cpu) = p ? page_to_virt(p) : NULL;
-        }
+            per_cpu(workmem, cpu) = alloc_xenheap_pages(workmem_order, 0);
         if ( per_cpu(scratch_page, cpu) == NULL )
-        {
-            struct page_info *p = alloc_domheap_page(NULL, 0);
-            per_cpu(scratch_page, cpu) = p ? page_to_virt(p) : NULL;
-        }
+            per_cpu(scratch_page, cpu) = alloc_xenheap_page();
         break;
     }
     case CPU_DEAD:
     case CPU_UP_CANCELED: {
         if ( per_cpu(dstmem, cpu) != NULL )
         {
-            struct page_info *p = virt_to_page(per_cpu(dstmem, cpu));
-            free_domheap_pages(p, dstmem_order);
+            free_xenheap_pages(per_cpu(dstmem, cpu), dstmem_order);
             per_cpu(dstmem, cpu) = NULL;
         }
         if ( per_cpu(workmem, cpu) != NULL )
         {
-            struct page_info *p = virt_to_page(per_cpu(workmem, cpu));
-            free_domheap_pages(p, workmem_order);
+            free_xenheap_pages(per_cpu(workmem, cpu), workmem_order);
             per_cpu(workmem, cpu) = NULL;
         }
         if ( per_cpu(scratch_page, cpu) != NULL )
         {
-            free_domheap_page(virt_to_page(per_cpu(scratch_page, cpu)));
+            free_xenheap_page(per_cpu(scratch_page, cpu));
             per_cpu(scratch_page, cpu) = NULL;
         }
         break;

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 11/11] x86: support up to 16Tb
  2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
                   ` (8 preceding siblings ...)
  2013-01-22 10:57 ` [PATCH 10/11] tmem: partial adjustments for x86 16Tb support Jan Beulich
@ 2013-01-22 10:57 ` Jan Beulich
  2013-01-22 15:20   ` Dan Magenheimer
  2013-01-22 10:58 ` [PATCH 12/11] x86: debugging code for testing 16Tb support on smaller memory systems Jan Beulich
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2013-01-22 10:57 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 9911 bytes --]

This mainly involves adjusting the number of L4 entries needing copying
between page tables (which is now different between PV and HVM/idle
domains), and changing the cutoff point and method when more than the
supported amount of memory is found in a system.

Since TMEM doesn't currently cope with the full 1:1 map not always
being visible, it gets forcefully disabled in that case.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/efi/boot.c
+++ b/xen/arch/x86/efi/boot.c
@@ -1591,7 +1591,7 @@ void __init efi_init_memory(void)
 
     /* Insert Xen mappings. */
     for ( i = l4_table_offset(HYPERVISOR_VIRT_START);
-          i < l4_table_offset(HYPERVISOR_VIRT_END); ++i )
+          i < l4_table_offset(DIRECTMAP_VIRT_END); ++i )
         efi_l4_pgtable[i] = idle_pg_table[i];
 #endif
 }
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1320,7 +1320,7 @@ void init_guest_l4_table(l4_pgentry_t l4
     /* Xen private mappings. */
     memcpy(&l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT],
            &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
-           ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));
+           ROOT_PAGETABLE_PV_XEN_SLOTS * sizeof(l4_pgentry_t));
     l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
         l4e_from_pfn(domain_page_map_to_mfn(l4tab), __PAGE_HYPERVISOR);
     l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -25,6 +25,7 @@
 #include <xen/dmi.h>
 #include <xen/pfn.h>
 #include <xen/nodemask.h>
+#include <xen/tmem_xen.h> /* for opt_tmem only */
 #include <public/version.h>
 #include <compat/platform.h>
 #include <compat/xen.h>
@@ -381,6 +382,11 @@ static void __init setup_max_pdx(void)
     if ( max_pdx > FRAMETABLE_NR )
         max_pdx = FRAMETABLE_NR;
 
+#ifdef PAGE_LIST_NULL
+    if ( max_pdx >= PAGE_LIST_NULL )
+        max_pdx = PAGE_LIST_NULL - 1;
+#endif
+
     max_page = pdx_to_pfn(max_pdx - 1) + 1;
 }
 
@@ -1031,9 +1037,23 @@ void __init __start_xen(unsigned long mb
         /* Create new mappings /before/ passing memory to the allocator. */
         if ( map_e < e )
         {
-            map_pages_to_xen((unsigned long)__va(map_e), map_e >> PAGE_SHIFT,
-                             (e - map_e) >> PAGE_SHIFT, PAGE_HYPERVISOR);
-            init_boot_pages(map_e, e);
+            uint64_t limit = __pa(HYPERVISOR_VIRT_END - 1) + 1;
+            uint64_t end = min(e, limit);
+
+            if ( map_e < end )
+            {
+                map_pages_to_xen((unsigned long)__va(map_e), PFN_DOWN(map_e),
+                                 PFN_DOWN(end - map_e), PAGE_HYPERVISOR);
+                init_boot_pages(map_e, end);
+                map_e = end;
+            }
+        }
+        if ( map_e < e )
+        {
+            /* This range must not be passed to the boot allocator and
+             * must also not be mapped with _PAGE_GLOBAL. */
+            map_pages_to_xen((unsigned long)__va(map_e), PFN_DOWN(map_e),
+                             PFN_DOWN(e - map_e), __PAGE_HYPERVISOR);
         }
         if ( s < map_s )
         {
@@ -1104,6 +1124,34 @@ void __init __start_xen(unsigned long mb
     end_boot_allocator();
     system_state = SYS_STATE_boot;
 
+    if ( max_page - 1 > virt_to_mfn(HYPERVISOR_VIRT_END - 1) )
+    {
+        unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
+        uint64_t mask = PAGE_SIZE - 1;
+
+        xenheap_max_mfn(limit);
+
+        /* Pass the remaining memory to the allocator. */
+        for ( i = 0; i < boot_e820.nr_map; i++ )
+        {
+            uint64_t s, e;
+
+            s = (boot_e820.map[i].addr + mask) & ~mask;
+            e = (boot_e820.map[i].addr + boot_e820.map[i].size) & ~mask;
+            if ( PFN_DOWN(e) <= limit )
+                continue;
+            if ( PFN_DOWN(s) <= limit )
+                s = pfn_to_paddr(limit + 1);
+            init_domheap_pages(s, e);
+        }
+
+        if ( opt_tmem )
+        {
+           printk(XENLOG_WARNING "Forcing TMEM off\n");
+           opt_tmem = 0;
+        }
+    }
+
     vm_init();
     vesa_init();
 
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -1471,10 +1471,23 @@ int memory_add(unsigned long spfn, unsig
         return -EINVAL;
     }
 
-    ret =  map_pages_to_xen((unsigned long)mfn_to_virt(spfn), spfn,
-                            epfn - spfn, PAGE_HYPERVISOR);
-     if ( ret )
-        return ret;
+    i = virt_to_mfn(HYPERVISOR_VIRT_END - 1) + 1;
+    if ( spfn < i )
+    {
+        ret = map_pages_to_xen((unsigned long)mfn_to_virt(spfn), spfn,
+                               min(epfn, i) - spfn, PAGE_HYPERVISOR);
+        if ( ret )
+            return ret;
+    }
+    if ( i < epfn )
+    {
+        if ( i < spfn )
+            i = spfn;
+        ret = map_pages_to_xen((unsigned long)mfn_to_virt(i), i,
+                               epfn - i, __PAGE_HYPERVISOR);
+        if ( ret )
+            return ret;
+    }
 
     old_node_start = NODE_DATA(node)->node_start_pfn;
     old_node_span = NODE_DATA(node)->node_spanned_pages;
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -255,6 +255,9 @@ static unsigned long init_node_heap(int 
     unsigned long needed = (sizeof(**_heap) +
                             sizeof(**avail) * NR_ZONES +
                             PAGE_SIZE - 1) >> PAGE_SHIFT;
+#ifdef DIRECTMAP_VIRT_END
+    unsigned long eva = min(DIRECTMAP_VIRT_END, HYPERVISOR_VIRT_END);
+#endif
     int i, j;
 
     if ( !first_node_initialised )
@@ -266,14 +269,14 @@ static unsigned long init_node_heap(int 
     }
 #ifdef DIRECTMAP_VIRT_END
     else if ( *use_tail && nr >= needed &&
-              (mfn + nr) <= (virt_to_mfn(DIRECTMAP_VIRT_END - 1) + 1) )
+              (mfn + nr) <= (virt_to_mfn(eva - 1) + 1) )
     {
         _heap[node] = mfn_to_virt(mfn + nr - needed);
         avail[node] = mfn_to_virt(mfn + nr - 1) +
                       PAGE_SIZE - sizeof(**avail) * NR_ZONES;
     }
     else if ( nr >= needed &&
-              (mfn + needed) <= (virt_to_mfn(DIRECTMAP_VIRT_END - 1) + 1) )
+              (mfn + needed) <= (virt_to_mfn(eva - 1) + 1) )
     {
         _heap[node] = mfn_to_virt(mfn);
         avail[node] = mfn_to_virt(mfn + needed - 1) +
@@ -1205,6 +1208,13 @@ void free_xenheap_pages(void *v, unsigne
 
 #else
 
+static unsigned int __read_mostly xenheap_bits;
+
+void __init xenheap_max_mfn(unsigned long mfn)
+{
+    xenheap_bits = fls(mfn) + PAGE_SHIFT - 1;
+}
+
 void init_xenheap_pages(paddr_t ps, paddr_t pe)
 {
     init_domheap_pages(ps, pe);
@@ -1217,6 +1227,11 @@ void *alloc_xenheap_pages(unsigned int o
 
     ASSERT(!in_irq());
 
+    if ( xenheap_bits && (memflags >> _MEMF_bits) > xenheap_bits )
+        memflags &= ~MEMF_bits(~0);
+    if ( !(memflags >> _MEMF_bits) )
+        memflags |= MEMF_bits(xenheap_bits);
+
     pg = alloc_domheap_pages(NULL, order, memflags);
     if ( unlikely(pg == NULL) )
         return NULL;
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -163,8 +163,12 @@ extern unsigned char boot_edid_info[128]
  *    Page-frame information array.
  *  0xffff830000000000 - 0xffff87ffffffffff [5TB, 5*2^40 bytes, PML4:262-271]
  *    1:1 direct mapping of all physical memory.
- *  0xffff880000000000 - 0xffffffffffffffff [120TB, PML4:272-511]
- *    Guest-defined use.
+ *  0xffff880000000000 - 0xffffffffffffffff [120TB,             PML4:272-511]
+ *    PV: Guest-defined use.
+ *  0xffff880000000000 - 0xffffff7fffffffff [119.5TB,           PML4:272-510]
+ *    HVM/idle: continuation of 1:1 mapping
+ *  0xffffff8000000000 - 0xffffffffffffffff [512GB, 2^39 bytes  PML4:511]
+ *    HVM/idle: unused
  *
  * Compatibility guest area layout:
  *  0x0000000000000000 - 0x00000000f57fffff [3928MB,            PML4:0]
@@ -183,6 +187,8 @@ extern unsigned char boot_edid_info[128]
 #define ROOT_PAGETABLE_FIRST_XEN_SLOT 256
 #define ROOT_PAGETABLE_LAST_XEN_SLOT  271
 #define ROOT_PAGETABLE_XEN_SLOTS \
+    (L4_PAGETABLE_ENTRIES - ROOT_PAGETABLE_FIRST_XEN_SLOT - 1)
+#define ROOT_PAGETABLE_PV_XEN_SLOTS \
     (ROOT_PAGETABLE_LAST_XEN_SLOT - ROOT_PAGETABLE_FIRST_XEN_SLOT + 1)
 
 /* Hypervisor reserves PML4 slots 256 to 271 inclusive. */
@@ -241,9 +247,9 @@ extern unsigned char boot_edid_info[128]
 #define FRAMETABLE_SIZE         GB(128)
 #define FRAMETABLE_NR           (FRAMETABLE_SIZE / sizeof(*frame_table))
 #define FRAMETABLE_VIRT_START   (FRAMETABLE_VIRT_END - FRAMETABLE_SIZE)
-/* Slot 262-271: A direct 1:1 mapping of all of physical memory. */
+/* Slot 262-271/510: A direct 1:1 mapping of all of physical memory. */
 #define DIRECTMAP_VIRT_START    (PML4_ADDR(262))
-#define DIRECTMAP_SIZE          (PML4_ENTRY_BYTES*10)
+#define DIRECTMAP_SIZE          (PML4_ENTRY_BYTES * (511 - 262))
 #define DIRECTMAP_VIRT_END      (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE)
 
 #ifndef __ASSEMBLY__
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -43,6 +43,7 @@ void end_boot_allocator(void);
 
 /* Xen suballocator. These functions are interrupt-safe. */
 void init_xenheap_pages(paddr_t ps, paddr_t pe);
+void xenheap_max_mfn(unsigned long mfn);
 void *alloc_xenheap_pages(unsigned int order, unsigned int memflags);
 void free_xenheap_pages(void *v, unsigned int order);
 #define alloc_xenheap_page() (alloc_xenheap_pages(0,0))
@@ -111,7 +112,7 @@ struct page_list_head
 /* These must only have instances in struct page_info. */
 # define page_list_entry
 
-#define PAGE_LIST_NULL (~0)
+# define PAGE_LIST_NULL ((typeof(((struct page_info){}).list.next))~0)
 
 # if !defined(pdx_to_page) && !defined(page_to_pdx)
 #  if defined(__page_to_mfn) || defined(__mfn_to_page)



[-- Attachment #2: x86-extend-RAM.patch --]
[-- Type: text/plain, Size: 9934 bytes --]

x86: support up to 16Tb

This mainly involves adjusting the number of L4 entries needing copying
between page tables (which is now different between PV and HVM/idle
domains), and changing the cutoff point and method when more than the
supported amount of memory is found in a system.

Since TMEM doesn't currently cope with the full 1:1 map not always
being visible, it gets forcefully disabled in that case.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/efi/boot.c
+++ b/xen/arch/x86/efi/boot.c
@@ -1591,7 +1591,7 @@ void __init efi_init_memory(void)
 
     /* Insert Xen mappings. */
     for ( i = l4_table_offset(HYPERVISOR_VIRT_START);
-          i < l4_table_offset(HYPERVISOR_VIRT_END); ++i )
+          i < l4_table_offset(DIRECTMAP_VIRT_END); ++i )
         efi_l4_pgtable[i] = idle_pg_table[i];
 #endif
 }
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1320,7 +1320,7 @@ void init_guest_l4_table(l4_pgentry_t l4
     /* Xen private mappings. */
     memcpy(&l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT],
            &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
-           ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));
+           ROOT_PAGETABLE_PV_XEN_SLOTS * sizeof(l4_pgentry_t));
     l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
         l4e_from_pfn(domain_page_map_to_mfn(l4tab), __PAGE_HYPERVISOR);
     l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -25,6 +25,7 @@
 #include <xen/dmi.h>
 #include <xen/pfn.h>
 #include <xen/nodemask.h>
+#include <xen/tmem_xen.h> /* for opt_tmem only */
 #include <public/version.h>
 #include <compat/platform.h>
 #include <compat/xen.h>
@@ -381,6 +382,11 @@ static void __init setup_max_pdx(void)
     if ( max_pdx > FRAMETABLE_NR )
         max_pdx = FRAMETABLE_NR;
 
+#ifdef PAGE_LIST_NULL
+    if ( max_pdx >= PAGE_LIST_NULL )
+        max_pdx = PAGE_LIST_NULL - 1;
+#endif
+
     max_page = pdx_to_pfn(max_pdx - 1) + 1;
 }
 
@@ -1031,9 +1037,23 @@ void __init __start_xen(unsigned long mb
         /* Create new mappings /before/ passing memory to the allocator. */
         if ( map_e < e )
         {
-            map_pages_to_xen((unsigned long)__va(map_e), map_e >> PAGE_SHIFT,
-                             (e - map_e) >> PAGE_SHIFT, PAGE_HYPERVISOR);
-            init_boot_pages(map_e, e);
+            uint64_t limit = __pa(HYPERVISOR_VIRT_END - 1) + 1;
+            uint64_t end = min(e, limit);
+
+            if ( map_e < end )
+            {
+                map_pages_to_xen((unsigned long)__va(map_e), PFN_DOWN(map_e),
+                                 PFN_DOWN(end - map_e), PAGE_HYPERVISOR);
+                init_boot_pages(map_e, end);
+                map_e = end;
+            }
+        }
+        if ( map_e < e )
+        {
+            /* This range must not be passed to the boot allocator and
+             * must also not be mapped with _PAGE_GLOBAL. */
+            map_pages_to_xen((unsigned long)__va(map_e), PFN_DOWN(map_e),
+                             PFN_DOWN(e - map_e), __PAGE_HYPERVISOR);
         }
         if ( s < map_s )
         {
@@ -1104,6 +1124,34 @@ void __init __start_xen(unsigned long mb
     end_boot_allocator();
     system_state = SYS_STATE_boot;
 
+    if ( max_page - 1 > virt_to_mfn(HYPERVISOR_VIRT_END - 1) )
+    {
+        unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
+        uint64_t mask = PAGE_SIZE - 1;
+
+        xenheap_max_mfn(limit);
+
+        /* Pass the remaining memory to the allocator. */
+        for ( i = 0; i < boot_e820.nr_map; i++ )
+        {
+            uint64_t s, e;
+
+            s = (boot_e820.map[i].addr + mask) & ~mask;
+            e = (boot_e820.map[i].addr + boot_e820.map[i].size) & ~mask;
+            if ( PFN_DOWN(e) <= limit )
+                continue;
+            if ( PFN_DOWN(s) <= limit )
+                s = pfn_to_paddr(limit + 1);
+            init_domheap_pages(s, e);
+        }
+
+        if ( opt_tmem )
+        {
+           printk(XENLOG_WARNING "Forcing TMEM off\n");
+           opt_tmem = 0;
+        }
+    }
+
     vm_init();
     vesa_init();
 
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -1471,10 +1471,23 @@ int memory_add(unsigned long spfn, unsig
         return -EINVAL;
     }
 
-    ret =  map_pages_to_xen((unsigned long)mfn_to_virt(spfn), spfn,
-                            epfn - spfn, PAGE_HYPERVISOR);
-     if ( ret )
-        return ret;
+    i = virt_to_mfn(HYPERVISOR_VIRT_END - 1) + 1;
+    if ( spfn < i )
+    {
+        ret = map_pages_to_xen((unsigned long)mfn_to_virt(spfn), spfn,
+                               min(epfn, i) - spfn, PAGE_HYPERVISOR);
+        if ( ret )
+            return ret;
+    }
+    if ( i < epfn )
+    {
+        if ( i < spfn )
+            i = spfn;
+        ret = map_pages_to_xen((unsigned long)mfn_to_virt(i), i,
+                               epfn - i, __PAGE_HYPERVISOR);
+        if ( ret )
+            return ret;
+    }
 
     old_node_start = NODE_DATA(node)->node_start_pfn;
     old_node_span = NODE_DATA(node)->node_spanned_pages;
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -255,6 +255,9 @@ static unsigned long init_node_heap(int 
     unsigned long needed = (sizeof(**_heap) +
                             sizeof(**avail) * NR_ZONES +
                             PAGE_SIZE - 1) >> PAGE_SHIFT;
+#ifdef DIRECTMAP_VIRT_END
+    unsigned long eva = min(DIRECTMAP_VIRT_END, HYPERVISOR_VIRT_END);
+#endif
     int i, j;
 
     if ( !first_node_initialised )
@@ -266,14 +269,14 @@ static unsigned long init_node_heap(int 
     }
 #ifdef DIRECTMAP_VIRT_END
     else if ( *use_tail && nr >= needed &&
-              (mfn + nr) <= (virt_to_mfn(DIRECTMAP_VIRT_END - 1) + 1) )
+              (mfn + nr) <= (virt_to_mfn(eva - 1) + 1) )
     {
         _heap[node] = mfn_to_virt(mfn + nr - needed);
         avail[node] = mfn_to_virt(mfn + nr - 1) +
                       PAGE_SIZE - sizeof(**avail) * NR_ZONES;
     }
     else if ( nr >= needed &&
-              (mfn + needed) <= (virt_to_mfn(DIRECTMAP_VIRT_END - 1) + 1) )
+              (mfn + needed) <= (virt_to_mfn(eva - 1) + 1) )
     {
         _heap[node] = mfn_to_virt(mfn);
         avail[node] = mfn_to_virt(mfn + needed - 1) +
@@ -1205,6 +1208,13 @@ void free_xenheap_pages(void *v, unsigne
 
 #else
 
+static unsigned int __read_mostly xenheap_bits;
+
+void __init xenheap_max_mfn(unsigned long mfn)
+{
+    xenheap_bits = fls(mfn) + PAGE_SHIFT - 1;
+}
+
 void init_xenheap_pages(paddr_t ps, paddr_t pe)
 {
     init_domheap_pages(ps, pe);
@@ -1217,6 +1227,11 @@ void *alloc_xenheap_pages(unsigned int o
 
     ASSERT(!in_irq());
 
+    if ( xenheap_bits && (memflags >> _MEMF_bits) > xenheap_bits )
+        memflags &= ~MEMF_bits(~0);
+    if ( !(memflags >> _MEMF_bits) )
+        memflags |= MEMF_bits(xenheap_bits);
+
     pg = alloc_domheap_pages(NULL, order, memflags);
     if ( unlikely(pg == NULL) )
         return NULL;
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -163,8 +163,12 @@ extern unsigned char boot_edid_info[128]
  *    Page-frame information array.
  *  0xffff830000000000 - 0xffff87ffffffffff [5TB, 5*2^40 bytes, PML4:262-271]
  *    1:1 direct mapping of all physical memory.
- *  0xffff880000000000 - 0xffffffffffffffff [120TB, PML4:272-511]
- *    Guest-defined use.
+ *  0xffff880000000000 - 0xffffffffffffffff [120TB,             PML4:272-511]
+ *    PV: Guest-defined use.
+ *  0xffff880000000000 - 0xffffff7fffffffff [119.5TB,           PML4:272-510]
+ *    HVM/idle: continuation of 1:1 mapping
+ *  0xffffff8000000000 - 0xffffffffffffffff [512GB, 2^39 bytes  PML4:511]
+ *    HVM/idle: unused
  *
  * Compatibility guest area layout:
  *  0x0000000000000000 - 0x00000000f57fffff [3928MB,            PML4:0]
@@ -183,6 +187,8 @@ extern unsigned char boot_edid_info[128]
 #define ROOT_PAGETABLE_FIRST_XEN_SLOT 256
 #define ROOT_PAGETABLE_LAST_XEN_SLOT  271
 #define ROOT_PAGETABLE_XEN_SLOTS \
+    (L4_PAGETABLE_ENTRIES - ROOT_PAGETABLE_FIRST_XEN_SLOT - 1)
+#define ROOT_PAGETABLE_PV_XEN_SLOTS \
     (ROOT_PAGETABLE_LAST_XEN_SLOT - ROOT_PAGETABLE_FIRST_XEN_SLOT + 1)
 
 /* Hypervisor reserves PML4 slots 256 to 271 inclusive. */
@@ -241,9 +247,9 @@ extern unsigned char boot_edid_info[128]
 #define FRAMETABLE_SIZE         GB(128)
 #define FRAMETABLE_NR           (FRAMETABLE_SIZE / sizeof(*frame_table))
 #define FRAMETABLE_VIRT_START   (FRAMETABLE_VIRT_END - FRAMETABLE_SIZE)
-/* Slot 262-271: A direct 1:1 mapping of all of physical memory. */
+/* Slot 262-271/510: A direct 1:1 mapping of all of physical memory. */
 #define DIRECTMAP_VIRT_START    (PML4_ADDR(262))
-#define DIRECTMAP_SIZE          (PML4_ENTRY_BYTES*10)
+#define DIRECTMAP_SIZE          (PML4_ENTRY_BYTES * (511 - 262))
 #define DIRECTMAP_VIRT_END      (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE)
 
 #ifndef __ASSEMBLY__
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -43,6 +43,7 @@ void end_boot_allocator(void);
 
 /* Xen suballocator. These functions are interrupt-safe. */
 void init_xenheap_pages(paddr_t ps, paddr_t pe);
+void xenheap_max_mfn(unsigned long mfn);
 void *alloc_xenheap_pages(unsigned int order, unsigned int memflags);
 void free_xenheap_pages(void *v, unsigned int order);
 #define alloc_xenheap_page() (alloc_xenheap_pages(0,0))
@@ -111,7 +112,7 @@ struct page_list_head
 /* These must only have instances in struct page_info. */
 # define page_list_entry
 
-#define PAGE_LIST_NULL (~0)
+# define PAGE_LIST_NULL ((typeof(((struct page_info){}).list.next))~0)
 
 # if !defined(pdx_to_page) && !defined(page_to_pdx)
 #  if defined(__page_to_mfn) || defined(__mfn_to_page)

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 12/11] x86: debugging code for testing 16Tb support on smaller memory systems
  2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
                   ` (9 preceding siblings ...)
  2013-01-22 10:57 ` [PATCH 11/11] x86: support up to 16Tb Jan Beulich
@ 2013-01-22 10:58 ` Jan Beulich
  2013-01-23 14:26   ` [PATCH v2] " Jan Beulich
  2013-01-22 20:13 ` [PATCH 00/11] x86: support up to 16Tb Keir Fraser
  2013-01-23  9:33 ` Keir Fraser
  12 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2013-01-22 10:58 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 9375 bytes --]

DO NOT APPLY AS IS.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -66,8 +66,10 @@ void *map_domain_page(unsigned long mfn)
     struct mapcache_vcpu *vcache;
     struct vcpu_maphash_entry *hashent;
 
+#ifdef NDEBUG
     if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
         return mfn_to_virt(mfn);
+#endif
 
     v = mapcache_current_vcpu();
     if ( !v || is_hvm_vcpu(v) )
@@ -139,6 +141,14 @@ void *map_domain_page(unsigned long mfn)
                 if ( ++i == MAPHASH_ENTRIES )
                     i = 0;
             } while ( i != MAPHASH_HASHFN(mfn) );
+if(idx >= dcache->entries) {//temp
+ mapcache_domain_dump(v->domain);
+ for(i = 0; i < ARRAY_SIZE(vcache->hash); ++i)
+  if(hashent->idx != MAPHASHENT_NOTINUSE) {
+   hashent = &vcache->hash[i];
+   printk("vc[%u]: ref=%u idx=%04x mfn=%08lx\n", i, hashent->refcnt, hashent->idx, hashent->mfn);
+  }
+}
         }
         BUG_ON(idx >= dcache->entries);
 
@@ -249,8 +259,10 @@ int mapcache_domain_init(struct domain *
     if ( is_hvm_domain(d) || is_idle_domain(d) )
         return 0;
 
+#ifdef NDEBUG
     if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
         return 0;
+#endif
 
     dcache->l1tab = xzalloc_array(l1_pgentry_t *, MAPCACHE_L2_ENTRIES + 1);
     d->arch.perdomain_l2_pg[MAPCACHE_SLOT] = alloc_domheap_page(NULL, memf);
@@ -418,8 +430,10 @@ void *map_domain_page_global(unsigned lo
 
     ASSERT(!in_irq() && local_irq_is_enabled());
 
+#ifdef NDEBUG
     if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
         return mfn_to_virt(mfn);
+#endif
 
     spin_lock(&globalmap_lock);
 
@@ -497,3 +511,26 @@ unsigned long domain_page_map_to_mfn(con
 
     return l1e_get_pfn(*pl1e);
 }
+
+void mapcache_domain_dump(struct domain *d) {//temp
+ unsigned i, n = 0;
+ const struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
+ const struct vcpu *v;
+ if(is_hvm_domain(d) || is_idle_domain(d))
+  return;
+ for_each_vcpu(d, v) {
+  const struct mapcache_vcpu *vcache = &v->arch.pv_vcpu.mapcache;
+  for(i = 0; i < ARRAY_SIZE(vcache->hash); ++i)
+   n += (vcache->hash[i].idx != MAPHASHENT_NOTINUSE);
+ }
+ printk("Dom%d mc (#=%u v=%u) [%p]:\n", d->domain_id, n, d->max_vcpus, __builtin_return_address(0));
+ for(i = 0; i < BITS_TO_LONGS(dcache->entries); ++i)
+  printk("dcu[%02x]: %016lx\n", i, dcache->inuse[i]);
+ for(i = 0; i < BITS_TO_LONGS(dcache->entries); ++i)
+  printk("dcg[%02x]: %016lx\n", i, dcache->garbage[i]);
+ for(i = 0; i < dcache->entries; ++i) {
+  l1_pgentry_t l1e = DCACHE_L1ENT(dcache, i);
+  if((test_bit(i, dcache->inuse) && !test_bit(i, dcache->garbage)) || (l1e_get_flags(l1e) & _PAGE_PRESENT))
+   printk("dc[%04x]: %"PRIpte"\n", i, l1e_get_intpte(l1e));
+ }
+}
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -250,6 +250,14 @@ void __init init_frametable(void)
         init_spagetable();
 }
 
+#ifndef NDEBUG
+static unsigned int __read_mostly root_pgt_pv_xen_slots
+    = ROOT_PAGETABLE_PV_XEN_SLOTS;
+static l4_pgentry_t __read_mostly split_l4e;
+#else
+#define root_pgt_pv_xen_slots ROOT_PAGETABLE_PV_XEN_SLOTS
+#endif
+
 void __init arch_init_memory(void)
 {
     unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
@@ -344,6 +352,41 @@ void __init arch_init_memory(void)
     efi_init_memory();
 
     mem_sharing_init();
+
+#ifndef NDEBUG
+    if ( split_gb )
+    {
+        paddr_t split_pa = split_gb * GB(1);
+        unsigned long split_va = (unsigned long)__va(split_pa);
+
+        if ( split_va < HYPERVISOR_VIRT_END &&
+             split_va - 1 == (unsigned long)__va(split_pa - 1) )
+        {
+            root_pgt_pv_xen_slots = l4_table_offset(split_va) -
+                                    ROOT_PAGETABLE_FIRST_XEN_SLOT;
+            ASSERT(root_pgt_pv_xen_slots < ROOT_PAGETABLE_PV_XEN_SLOTS);
+            if ( l4_table_offset(split_va) == l4_table_offset(split_va - 1) )
+            {
+                l3_pgentry_t *l3tab = alloc_xen_pagetable();
+
+                if ( l3tab )
+                {
+                    const l3_pgentry_t *l3idle =
+                        l4e_to_l3e(idle_pg_table[l4_table_offset(split_va)]);
+
+                    for ( i = 0; i < l3_table_offset(split_va); ++i )
+                        l3tab[i] = l3idle[i];
+                    for ( ; i <= L3_PAGETABLE_ENTRIES; ++i )
+                        l3tab[i] = l3e_empty();
+                    split_l4e = l4e_from_pfn(virt_to_mfn(l3tab),
+                                             __PAGE_HYPERVISOR);
+                }
+                else
+                    ++root_pgt_pv_xen_slots;
+            }
+        }
+    }
+#endif
 }
 
 int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
@@ -1320,7 +1363,12 @@ void init_guest_l4_table(l4_pgentry_t l4
     /* Xen private mappings. */
     memcpy(&l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT],
            &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
-           ROOT_PAGETABLE_PV_XEN_SLOTS * sizeof(l4_pgentry_t));
+           root_pgt_pv_xen_slots * sizeof(l4_pgentry_t));
+#ifndef NDEBUG
+    if ( l4e_get_intpte(split_l4e) )
+        l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT + root_pgt_pv_xen_slots] =
+            split_l4e;
+#endif
     l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
         l4e_from_pfn(domain_page_map_to_mfn(l4tab), __PAGE_HYPERVISOR);
     l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -82,6 +82,11 @@ boolean_param("noapic", skip_ioapic_setu
 s8 __read_mostly xen_cpuidle = -1;
 boolean_param("cpuidle", xen_cpuidle);
 
+#ifndef NDEBUG
+unsigned int __initdata split_gb;
+integer_param("split-gb", split_gb);
+#endif
+
 cpumask_t __read_mostly cpu_present_map;
 
 unsigned long __read_mostly xen_phys_start;
@@ -789,6 +794,11 @@ void __init __start_xen(unsigned long mb
     modules_headroom = bzimage_headroom(bootstrap_map(mod), mod->mod_end);
     bootstrap_map(NULL);
 
+#ifndef split_gb /* Don't allow split below 4Gb. */
+    if ( split_gb < 4 )
+        split_gb = 0;
+#endif
+
     for ( i = boot_e820.nr_map-1; i >= 0; i-- )
     {
         uint64_t s, e, mask = (1UL << L2_PAGETABLE_SHIFT) - 1;
@@ -917,6 +927,9 @@ void __init __start_xen(unsigned long mb
             /* Don't overlap with other modules. */
             end = consider_modules(s, e, size, mod, mbi->mods_count, j);
 
+            if ( split_gb && end > split_gb * GB(1) )
+                continue;
+
             if ( s < end &&
                  (headroom ||
                   ((end - size) >> PAGE_SHIFT) > mod[j].mod_start) )
@@ -958,6 +971,8 @@ void __init __start_xen(unsigned long mb
     kexec_reserve_area(&boot_e820);
 
     setup_max_pdx();
+    if ( split_gb )
+        xenheap_max_mfn(split_gb << (30 - PAGE_SHIFT));
 
     /*
      * Walk every RAM region and map it in its entirety (on x86/64, at least)
@@ -1129,7 +1144,8 @@ void __init __start_xen(unsigned long mb
         unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
         uint64_t mask = PAGE_SIZE - 1;
 
-        xenheap_max_mfn(limit);
+        if ( !split_gb )
+            xenheap_max_mfn(limit);
 
         /* Pass the remaining memory to the allocator. */
         for ( i = 0; i < boot_e820.nr_map; i++ )
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -45,6 +45,7 @@
 #include <asm/flushtlb.h>
 #ifdef CONFIG_X86
 #include <asm/p2m.h>
+#include <asm/setup.h> /* for split_gb only */
 #else
 #define p2m_pod_offline_or_broken_hit(pg) 0
 #define p2m_pod_offline_or_broken_replace(pg) BUG_ON(pg != NULL)
@@ -203,6 +204,25 @@ unsigned long __init alloc_boot_pages(
         pg = (r->e - nr_pfns) & ~(pfn_align - 1);
         if ( pg < r->s )
             continue;
+
+#if defined(CONFIG_X86) && !defined(NDEBUG)
+        /*
+         * Filtering pfn_align == 1 since the only allocations using a bigger
+         * alignment are the ones used for setting up the frame table chunks.
+         * Those allocations get remapped anyway, i.e. them not having 1:1
+         * mappings always accessible is not a problem.
+         */
+        if ( split_gb && pfn_align == 1 &&
+             r->e > (split_gb << (30 - PAGE_SHIFT)) )
+        {
+            pg = r->s;
+            if ( pg + nr_pfns > (split_gb << (30 - PAGE_SHIFT)) )
+                continue;
+            r->s = pg + nr_pfns;
+            return pg;
+        }
+#endif
+
         _e = r->e;
         r->e = pg;
         bootmem_region_add(pg + nr_pfns, _e);
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -72,6 +72,7 @@ struct mapcache_domain {
 
 int mapcache_domain_init(struct domain *);
 void mapcache_domain_exit(struct domain *);
+void mapcache_domain_dump(struct domain *);//temp
 int mapcache_vcpu_init(struct vcpu *);
 void mapcache_override_current(struct vcpu *);
 
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -43,4 +43,10 @@ void microcode_grab_module(
 
 extern uint8_t kbd_shift_flags;
 
+#ifdef NDEBUG
+# define split_gb 0
+#else
+extern unsigned int split_gb;
+#endif
+
 #endif



[-- Attachment #2: x86-map-domain-debug.patch --]
[-- Type: text/plain, Size: 9445 bytes --]

x86: debugging code for testing 16Tb support on smaller memory systems

DO NOT APPLY AS IS.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -66,8 +66,10 @@ void *map_domain_page(unsigned long mfn)
     struct mapcache_vcpu *vcache;
     struct vcpu_maphash_entry *hashent;
 
+#ifdef NDEBUG
     if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
         return mfn_to_virt(mfn);
+#endif
 
     v = mapcache_current_vcpu();
     if ( !v || is_hvm_vcpu(v) )
@@ -139,6 +141,14 @@ void *map_domain_page(unsigned long mfn)
                 if ( ++i == MAPHASH_ENTRIES )
                     i = 0;
             } while ( i != MAPHASH_HASHFN(mfn) );
+if(idx >= dcache->entries) {//temp
+ mapcache_domain_dump(v->domain);
+ for(i = 0; i < ARRAY_SIZE(vcache->hash); ++i)
+  if(hashent->idx != MAPHASHENT_NOTINUSE) {
+   hashent = &vcache->hash[i];
+   printk("vc[%u]: ref=%u idx=%04x mfn=%08lx\n", i, hashent->refcnt, hashent->idx, hashent->mfn);
+  }
+}
         }
         BUG_ON(idx >= dcache->entries);
 
@@ -249,8 +259,10 @@ int mapcache_domain_init(struct domain *
     if ( is_hvm_domain(d) || is_idle_domain(d) )
         return 0;
 
+#ifdef NDEBUG
     if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
         return 0;
+#endif
 
     dcache->l1tab = xzalloc_array(l1_pgentry_t *, MAPCACHE_L2_ENTRIES + 1);
     d->arch.perdomain_l2_pg[MAPCACHE_SLOT] = alloc_domheap_page(NULL, memf);
@@ -418,8 +430,10 @@ void *map_domain_page_global(unsigned lo
 
     ASSERT(!in_irq() && local_irq_is_enabled());
 
+#ifdef NDEBUG
     if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
         return mfn_to_virt(mfn);
+#endif
 
     spin_lock(&globalmap_lock);
 
@@ -497,3 +511,26 @@ unsigned long domain_page_map_to_mfn(con
 
     return l1e_get_pfn(*pl1e);
 }
+
+void mapcache_domain_dump(struct domain *d) {//temp
+ unsigned i, n = 0;
+ const struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
+ const struct vcpu *v;
+ if(is_hvm_domain(d) || is_idle_domain(d))
+  return;
+ for_each_vcpu(d, v) {
+  const struct mapcache_vcpu *vcache = &v->arch.pv_vcpu.mapcache;
+  for(i = 0; i < ARRAY_SIZE(vcache->hash); ++i)
+   n += (vcache->hash[i].idx != MAPHASHENT_NOTINUSE);
+ }
+ printk("Dom%d mc (#=%u v=%u) [%p]:\n", d->domain_id, n, d->max_vcpus, __builtin_return_address(0));
+ for(i = 0; i < BITS_TO_LONGS(dcache->entries); ++i)
+  printk("dcu[%02x]: %016lx\n", i, dcache->inuse[i]);
+ for(i = 0; i < BITS_TO_LONGS(dcache->entries); ++i)
+  printk("dcg[%02x]: %016lx\n", i, dcache->garbage[i]);
+ for(i = 0; i < dcache->entries; ++i) {
+  l1_pgentry_t l1e = DCACHE_L1ENT(dcache, i);
+  if((test_bit(i, dcache->inuse) && !test_bit(i, dcache->garbage)) || (l1e_get_flags(l1e) & _PAGE_PRESENT))
+   printk("dc[%04x]: %"PRIpte"\n", i, l1e_get_intpte(l1e));
+ }
+}
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -250,6 +250,14 @@ void __init init_frametable(void)
         init_spagetable();
 }
 
+#ifndef NDEBUG
+static unsigned int __read_mostly root_pgt_pv_xen_slots
+    = ROOT_PAGETABLE_PV_XEN_SLOTS;
+static l4_pgentry_t __read_mostly split_l4e;
+#else
+#define root_pgt_pv_xen_slots ROOT_PAGETABLE_PV_XEN_SLOTS
+#endif
+
 void __init arch_init_memory(void)
 {
     unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
@@ -344,6 +352,41 @@ void __init arch_init_memory(void)
     efi_init_memory();
 
     mem_sharing_init();
+
+#ifndef NDEBUG
+    if ( split_gb )
+    {
+        paddr_t split_pa = split_gb * GB(1);
+        unsigned long split_va = (unsigned long)__va(split_pa);
+
+        if ( split_va < HYPERVISOR_VIRT_END &&
+             split_va - 1 == (unsigned long)__va(split_pa - 1) )
+        {
+            root_pgt_pv_xen_slots = l4_table_offset(split_va) -
+                                    ROOT_PAGETABLE_FIRST_XEN_SLOT;
+            ASSERT(root_pgt_pv_xen_slots < ROOT_PAGETABLE_PV_XEN_SLOTS);
+            if ( l4_table_offset(split_va) == l4_table_offset(split_va - 1) )
+            {
+                l3_pgentry_t *l3tab = alloc_xen_pagetable();
+
+                if ( l3tab )
+                {
+                    const l3_pgentry_t *l3idle =
+                        l4e_to_l3e(idle_pg_table[l4_table_offset(split_va)]);
+
+                    for ( i = 0; i < l3_table_offset(split_va); ++i )
+                        l3tab[i] = l3idle[i];
+                    for ( ; i <= L3_PAGETABLE_ENTRIES; ++i )
+                        l3tab[i] = l3e_empty();
+                    split_l4e = l4e_from_pfn(virt_to_mfn(l3tab),
+                                             __PAGE_HYPERVISOR);
+                }
+                else
+                    ++root_pgt_pv_xen_slots;
+            }
+        }
+    }
+#endif
 }
 
 int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
@@ -1320,7 +1363,12 @@ void init_guest_l4_table(l4_pgentry_t l4
     /* Xen private mappings. */
     memcpy(&l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT],
            &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
-           ROOT_PAGETABLE_PV_XEN_SLOTS * sizeof(l4_pgentry_t));
+           root_pgt_pv_xen_slots * sizeof(l4_pgentry_t));
+#ifndef NDEBUG
+    if ( l4e_get_intpte(split_l4e) )
+        l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT + root_pgt_pv_xen_slots] =
+            split_l4e;
+#endif
     l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
         l4e_from_pfn(domain_page_map_to_mfn(l4tab), __PAGE_HYPERVISOR);
     l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -82,6 +82,11 @@ boolean_param("noapic", skip_ioapic_setu
 s8 __read_mostly xen_cpuidle = -1;
 boolean_param("cpuidle", xen_cpuidle);
 
+#ifndef NDEBUG
+unsigned int __initdata split_gb;
+integer_param("split-gb", split_gb);
+#endif
+
 cpumask_t __read_mostly cpu_present_map;
 
 unsigned long __read_mostly xen_phys_start;
@@ -789,6 +794,11 @@ void __init __start_xen(unsigned long mb
     modules_headroom = bzimage_headroom(bootstrap_map(mod), mod->mod_end);
     bootstrap_map(NULL);
 
+#ifndef split_gb /* Don't allow split below 4Gb. */
+    if ( split_gb < 4 )
+        split_gb = 0;
+#endif
+
     for ( i = boot_e820.nr_map-1; i >= 0; i-- )
     {
         uint64_t s, e, mask = (1UL << L2_PAGETABLE_SHIFT) - 1;
@@ -917,6 +927,9 @@ void __init __start_xen(unsigned long mb
             /* Don't overlap with other modules. */
             end = consider_modules(s, e, size, mod, mbi->mods_count, j);
 
+            if ( split_gb && end > split_gb * GB(1) )
+                continue;
+
             if ( s < end &&
                  (headroom ||
                   ((end - size) >> PAGE_SHIFT) > mod[j].mod_start) )
@@ -958,6 +971,8 @@ void __init __start_xen(unsigned long mb
     kexec_reserve_area(&boot_e820);
 
     setup_max_pdx();
+    if ( split_gb )
+        xenheap_max_mfn(split_gb << (30 - PAGE_SHIFT));
 
     /*
      * Walk every RAM region and map it in its entirety (on x86/64, at least)
@@ -1129,7 +1144,8 @@ void __init __start_xen(unsigned long mb
         unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
         uint64_t mask = PAGE_SIZE - 1;
 
-        xenheap_max_mfn(limit);
+        if ( !split_gb )
+            xenheap_max_mfn(limit);
 
         /* Pass the remaining memory to the allocator. */
         for ( i = 0; i < boot_e820.nr_map; i++ )
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -45,6 +45,7 @@
 #include <asm/flushtlb.h>
 #ifdef CONFIG_X86
 #include <asm/p2m.h>
+#include <asm/setup.h> /* for split_gb only */
 #else
 #define p2m_pod_offline_or_broken_hit(pg) 0
 #define p2m_pod_offline_or_broken_replace(pg) BUG_ON(pg != NULL)
@@ -203,6 +204,25 @@ unsigned long __init alloc_boot_pages(
         pg = (r->e - nr_pfns) & ~(pfn_align - 1);
         if ( pg < r->s )
             continue;
+
+#if defined(CONFIG_X86) && !defined(NDEBUG)
+        /*
+         * Filtering pfn_align == 1 since the only allocations using a bigger
+         * alignment are the ones used for setting up the frame table chunks.
+         * Those allocations get remapped anyway, i.e. them not having 1:1
+         * mappings always accessible is not a problem.
+         */
+        if ( split_gb && pfn_align == 1 &&
+             r->e > (split_gb << (30 - PAGE_SHIFT)) )
+        {
+            pg = r->s;
+            if ( pg + nr_pfns > (split_gb << (30 - PAGE_SHIFT)) )
+                continue;
+            r->s = pg + nr_pfns;
+            return pg;
+        }
+#endif
+
         _e = r->e;
         r->e = pg;
         bootmem_region_add(pg + nr_pfns, _e);
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -72,6 +72,7 @@ struct mapcache_domain {
 
 int mapcache_domain_init(struct domain *);
 void mapcache_domain_exit(struct domain *);
+void mapcache_domain_dump(struct domain *);//temp
 int mapcache_vcpu_init(struct vcpu *);
 void mapcache_override_current(struct vcpu *);
 
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -43,4 +43,10 @@ void microcode_grab_module(
 
 extern uint8_t kbd_shift_flags;
 
+#ifdef NDEBUG
+# define split_gb 0
+#else
+extern unsigned int split_gb;
+#endif
+
 #endif

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 11/11] x86: support up to 16Tb
  2013-01-22 10:57 ` [PATCH 11/11] x86: support up to 16Tb Jan Beulich
@ 2013-01-22 15:20   ` Dan Magenheimer
  2013-01-22 15:31     ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Dan Magenheimer @ 2013-01-22 15:20 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Konrad Wilk

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: [Xen-devel] [PATCH 11/11] x86: support up to 16Tb
> 
> Since TMEM doesn't currently cope with the full 1:1 map not always
> being visible, it gets forcefully disabled in that case.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

I agree this is the correct short-term (and maybe mid-term)
answer.  Anyone who can afford to fill their machine with
more than 5TiB of RAM is likely not very interested in
memory overcommit technologies :-) at least for the next
year or three.  Cloud providers may be an exception but
I'd imagine most of those are buying small- to mid-range
machines to optimize cost/performance, rather than
behemoths that expand to 5TiB+ which are highly performant
but often not cost-effective.

Longer term, zcache in Linux (which is a tmem-based technology)
successfully uses kmap/kunmap to run on 32-bit Linux OS's
so I'd imagine a similar technique could be used in Xen?

In any case, thanks Jan for remembering to handle tmem.

One nit below...

Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com>


> +        if ( opt_tmem )
> +        {
> +           printk(XENLOG_WARNING "Forcing TMEM off\n");
> +           opt_tmem = 0;
> +        }
> +    }

Maybe a bit more descriptive? I.e. "TMEM physical RAM limit
exceeded, disabling TMEM".

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 11/11] x86: support up to 16Tb
  2013-01-22 15:20   ` Dan Magenheimer
@ 2013-01-22 15:31     ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2013-01-22 15:31 UTC (permalink / raw)
  To: xen-devel, Dan Magenheimer; +Cc: Konrad Wilk

>>> On 22.01.13 at 16:20, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Subject: [Xen-devel] [PATCH 11/11] x86: support up to 16Tb
>> 
>> Since TMEM doesn't currently cope with the full 1:1 map not always
>> being visible, it gets forcefully disabled in that case.
>> 
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> I agree this is the correct short-term (and maybe mid-term)
> answer.  Anyone who can afford to fill their machine with
> more than 5TiB of RAM is likely not very interested in
> memory overcommit technologies :-) at least for the next
> year or three.  Cloud providers may be an exception but
> I'd imagine most of those are buying small- to mid-range
> machines to optimize cost/performance, rather than
> behemoths that expand to 5TiB+ which are highly performant
> but often not cost-effective.
> 
> Longer term, zcache in Linux (which is a tmem-based technology)
> successfully uses kmap/kunmap to run on 32-bit Linux OS's
> so I'd imagine a similar technique could be used in Xen?
> 
> In any case, thanks Jan for remembering to handle tmem.
> 
> One nit below...
> 
> Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com>

Hmm, an ack on this patch is sort of unexpected from you; I
would have hoped you would ack patch 10...

>> +        if ( opt_tmem )
>> +        {
>> +           printk(XENLOG_WARNING "Forcing TMEM off\n");
>> +           opt_tmem = 0;
>> +        }
>> +    }
> 
> Maybe a bit more descriptive? I.e. "TMEM physical RAM limit
> exceeded, disabling TMEM".

Fine with me, patch updated.

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 10/11] tmem: partial adjustments for x86 16Tb support
  2013-01-22 10:57 ` [PATCH 10/11] tmem: partial adjustments for x86 16Tb support Jan Beulich
@ 2013-01-22 17:55   ` Dan Magenheimer
  0 siblings, 0 replies; 24+ messages in thread
From: Dan Magenheimer @ 2013-01-22 17:55 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Konrad Wilk

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, January 22, 2013 8:32 AM
> To: xen-devel; Dan Magenheimer
> Cc: Konrad Wilk
> Subject: RE: [Xen-devel] [PATCH 11/11] x86: support up to 16Tb
> 
> > Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com>
> 
> Hmm, an ack on this patch is sort of unexpected from you; I
> would have hoped you would ack patch 10...

Heh.  I was intrigued by the new domain_page_map_to_mfn()
and wanted to look deeper before acking patch 10.  So...

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: [PATCH 10/11] tmem: partial adjustments for x86 16Tb support
> 
> Despite the changes below, tmem still has code assuming to be able to
> directly access all memory, or mapping arbitrary amounts of not
> directly accessible memory. I cannot see how to fix this without
> converting _all_ its domheap allocations to xenheap ones. And even then
> I wouldn't be certain about there not being other cases where the "all
> memory is always mapped" assumption would be broken. Therefore, tmem
> gets disabled by the next patch for the time being if the full 1:1
> mapping isn't always visible.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

IIUC, all the metadata will need to be allocated from the xenheap
and all "wholepage" accesses will need some kind of wrapper.
This will get messier with compression/deduplication, but
I'm thinking it will still be doable... sometime in the future
if/when users want/need memory overcommit on huge RAM systems.

In any case...

Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/11] x86: support up to 16Tb
  2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
                   ` (10 preceding siblings ...)
  2013-01-22 10:58 ` [PATCH 12/11] x86: debugging code for testing 16Tb support on smaller memory systems Jan Beulich
@ 2013-01-22 20:13 ` Keir Fraser
  2013-01-23  9:33 ` Keir Fraser
  12 siblings, 0 replies; 24+ messages in thread
From: Keir Fraser @ 2013-01-22 20:13 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 22/01/2013 10:45, "Jan Beulich" <JBeulich@suse.com> wrote:

> This series enables Xen to support up to 16Tb.
> 
> 01: x86: introduce virt_to_xen_l1e()
> 02: x86: extend frame table virtual space
> 03: x86: re-introduce map_domain_page() et al
> 04: x86: properly use map_domain_page() when building Dom0
> 05: x86: consolidate initialization of PV guest L4 page tables
> 06: x86: properly use map_domain_page() during domain creation/destruction
> 07: x86: properly use map_domain_page() during page table manipulation
> 08: x86: properly use map_domain_page() in nested HVM code
> 09: x86: properly use map_domain_page() in miscellaneous places
> 10: tmem: partial adjustments for x86 16Tb support
> 11: x86: support up to 16Tb

I will take a look at these tomorrow.

 -- Keir

> As I don't have a 16Tb system around, I used the following
> debugging patch to simulate the most critical aspect the changes
> above would have on a system with this much memory: Not all of
> the 1:1 mapping being accessible when in PV guest context. To do
> so, a command line option to pull the split point down is being
> added. The patch is being provided in the raw form I used it, but
> has pieces properly formatted and not marked "//temp" which I
> would think might be worth considering to add. The other pieces
> are likely less worthwhile, but if others think differently I could
> certainly also put them into "normal" shape.
> 
> 12: x86: debugging code for testing 16Tb support on smaller memory systems
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/11] x86: support up to 16Tb
  2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
                   ` (11 preceding siblings ...)
  2013-01-22 20:13 ` [PATCH 00/11] x86: support up to 16Tb Keir Fraser
@ 2013-01-23  9:33 ` Keir Fraser
  2013-01-23  9:56   ` Jan Beulich
  12 siblings, 1 reply; 24+ messages in thread
From: Keir Fraser @ 2013-01-23  9:33 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 22/01/2013 10:45, "Jan Beulich" <JBeulich@suse.com> wrote:

> This series enables Xen to support up to 16Tb.
> 
> 01: x86: introduce virt_to_xen_l1e()
> 02: x86: extend frame table virtual space
> 03: x86: re-introduce map_domain_page() et al
> 04: x86: properly use map_domain_page() when building Dom0
> 05: x86: consolidate initialization of PV guest L4 page tables
> 06: x86: properly use map_domain_page() during domain creation/destruction
> 07: x86: properly use map_domain_page() during page table manipulation
> 08: x86: properly use map_domain_page() in nested HVM code
> 09: x86: properly use map_domain_page() in miscellaneous places
> 10: tmem: partial adjustments for x86 16Tb support
> 11: x86: support up to 16Tb

Acked-by: Keir Fraser <keir@xen.org>

There's an 'ifdef PAGE_LIST_NULL' in patch 11 in x86/setup.c. Is that really
needed?

> As I don't have a 16Tb system around, I used the following
> debugging patch to simulate the most critical aspect the changes
> above would have on a system with this much memory: Not all of
> the 1:1 mapping being accessible when in PV guest context. To do
> so, a command line option to pull the split point down is being
> added. The patch is being provided in the raw form I used it, but
> has pieces properly formatted and not marked "//temp" which I
> would think might be worth considering to add. The other pieces
> are likely less worthwhile, but if others think differently I could
> certainly also put them into "normal" shape.
> 
> 12: x86: debugging code for testing 16Tb support on smaller memory systems

Make split-gb a size_param, and rename to something more meaningful like
highmem_start.

 -- Keir

> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/11] x86: support up to 16Tb
  2013-01-23  9:33 ` Keir Fraser
@ 2013-01-23  9:56   ` Jan Beulich
  2013-01-23 10:16     ` Keir Fraser
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2013-01-23  9:56 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

>>> On 23.01.13 at 10:33, Keir Fraser <keir@xen.org> wrote:
> There's an 'ifdef PAGE_LIST_NULL' in patch 11 in x86/setup.c. Is that really
> needed?

That's there so that once we go beyond 16Tb the code won't need
to change. In particular, because of the implied growth of struct
page_info, I'm envisioning such support to become optional (to be
enabled at build time).

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/11] x86: support up to 16Tb
  2013-01-23  9:56   ` Jan Beulich
@ 2013-01-23 10:16     ` Keir Fraser
  0 siblings, 0 replies; 24+ messages in thread
From: Keir Fraser @ 2013-01-23 10:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 23/01/2013 09:56, "Jan Beulich" <JBeulich@suse.com> wrote:

>>>> On 23.01.13 at 10:33, Keir Fraser <keir@xen.org> wrote:
>> There's an 'ifdef PAGE_LIST_NULL' in patch 11 in x86/setup.c. Is that really
>> needed?
> 
> That's there so that once we go beyond 16Tb the code won't need
> to change. In particular, because of the implied growth of struct
> page_info, I'm envisioning such support to become optional (to be
> enabled at build time).

Defer the ifdef until it's needed, then when it's added its in the sensible
place (ie. where PAGE_LIST_NULL really does become build-time optional).

 -- Keir

> Jan
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2] x86: debugging code for testing 16Tb support on smaller memory systems
  2013-01-22 10:58 ` [PATCH 12/11] x86: debugging code for testing 16Tb support on smaller memory systems Jan Beulich
@ 2013-01-23 14:26   ` Jan Beulich
  2013-01-23 15:18     ` Keir Fraser
  2013-01-24 11:36     ` Tim Deegan
  0 siblings, 2 replies; 24+ messages in thread
From: Jan Beulich @ 2013-01-23 14:26 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 7872 bytes --]

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Removed unwanted bits and switched to byte-granular "highmem-start"
    option.

--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -546,6 +546,12 @@ Paging (HAP).
 ### hvm\_port80
 > `= <boolean>`
 
+### highmem-start
+> `= <size>`
+
+Specify the memory boundary past which memory will be treated as highmem (x86
+debug hypervisor only).
+
 ### idle\_latency\_factor
 > `= <integer>`
 
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -66,8 +66,10 @@ void *map_domain_page(unsigned long mfn)
     struct mapcache_vcpu *vcache;
     struct vcpu_maphash_entry *hashent;
 
+#ifdef NDEBUG
     if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
         return mfn_to_virt(mfn);
+#endif
 
     v = mapcache_current_vcpu();
     if ( !v || is_hvm_vcpu(v) )
@@ -249,8 +251,10 @@ int mapcache_domain_init(struct domain *
     if ( is_hvm_domain(d) || is_idle_domain(d) )
         return 0;
 
+#ifdef NDEBUG
     if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
         return 0;
+#endif
 
     dcache->l1tab = xzalloc_array(l1_pgentry_t *, MAPCACHE_L2_ENTRIES + 1);
     d->arch.perdomain_l2_pg[MAPCACHE_SLOT] = alloc_domheap_page(NULL, memf);
@@ -418,8 +422,10 @@ void *map_domain_page_global(unsigned lo
 
     ASSERT(!in_irq() && local_irq_is_enabled());
 
+#ifdef NDEBUG
     if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
         return mfn_to_virt(mfn);
+#endif
 
     spin_lock(&globalmap_lock);
 
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -250,6 +250,14 @@ void __init init_frametable(void)
         init_spagetable();
 }
 
+#ifndef NDEBUG
+static unsigned int __read_mostly root_pgt_pv_xen_slots
+    = ROOT_PAGETABLE_PV_XEN_SLOTS;
+static l4_pgentry_t __read_mostly split_l4e;
+#else
+#define root_pgt_pv_xen_slots ROOT_PAGETABLE_PV_XEN_SLOTS
+#endif
+
 void __init arch_init_memory(void)
 {
     unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
@@ -344,6 +352,40 @@ void __init arch_init_memory(void)
     efi_init_memory();
 
     mem_sharing_init();
+
+#ifndef NDEBUG
+    if ( highmem_start )
+    {
+        unsigned long split_va = (unsigned long)__va(highmem_start);
+
+        if ( split_va < HYPERVISOR_VIRT_END &&
+             split_va - 1 == (unsigned long)__va(highmem_start - 1) )
+        {
+            root_pgt_pv_xen_slots = l4_table_offset(split_va) -
+                                    ROOT_PAGETABLE_FIRST_XEN_SLOT;
+            ASSERT(root_pgt_pv_xen_slots < ROOT_PAGETABLE_PV_XEN_SLOTS);
+            if ( l4_table_offset(split_va) == l4_table_offset(split_va - 1) )
+            {
+                l3_pgentry_t *l3tab = alloc_xen_pagetable();
+
+                if ( l3tab )
+                {
+                    const l3_pgentry_t *l3idle =
+                        l4e_to_l3e(idle_pg_table[l4_table_offset(split_va)]);
+
+                    for ( i = 0; i < l3_table_offset(split_va); ++i )
+                        l3tab[i] = l3idle[i];
+                    for ( ; i <= L3_PAGETABLE_ENTRIES; ++i )
+                        l3tab[i] = l3e_empty();
+                    split_l4e = l4e_from_pfn(virt_to_mfn(l3tab),
+                                             __PAGE_HYPERVISOR);
+                }
+                else
+                    ++root_pgt_pv_xen_slots;
+            }
+        }
+    }
+#endif
 }
 
 int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
@@ -1320,7 +1362,12 @@ void init_guest_l4_table(l4_pgentry_t l4
     /* Xen private mappings. */
     memcpy(&l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT],
            &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
-           ROOT_PAGETABLE_PV_XEN_SLOTS * sizeof(l4_pgentry_t));
+           root_pgt_pv_xen_slots * sizeof(l4_pgentry_t));
+#ifndef NDEBUG
+    if ( l4e_get_intpte(split_l4e) )
+        l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT + root_pgt_pv_xen_slots] =
+            split_l4e;
+#endif
     l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
         l4e_from_pfn(domain_page_map_to_mfn(l4tab), __PAGE_HYPERVISOR);
     l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -82,6 +82,11 @@ boolean_param("noapic", skip_ioapic_setu
 s8 __read_mostly xen_cpuidle = -1;
 boolean_param("cpuidle", xen_cpuidle);
 
+#ifndef NDEBUG
+unsigned long __initdata highmem_start;
+size_param("highmem-start", highmem_start);
+#endif
+
 cpumask_t __read_mostly cpu_present_map;
 
 unsigned long __read_mostly xen_phys_start;
@@ -787,6 +792,14 @@ void __init __start_xen(unsigned long mb
     modules_headroom = bzimage_headroom(bootstrap_map(mod), mod->mod_end);
     bootstrap_map(NULL);
 
+#ifndef highmem_start
+    /* Don't allow split below 4Gb. */
+    if ( highmem_start < GB(4) )
+        highmem_start = 0;
+    else /* align to L3 entry boundary */
+        highmem_start &= ~((1UL << L3_PAGETABLE_SHIFT) - 1);
+#endif
+
     for ( i = boot_e820.nr_map-1; i >= 0; i-- )
     {
         uint64_t s, e, mask = (1UL << L2_PAGETABLE_SHIFT) - 1;
@@ -915,6 +928,9 @@ void __init __start_xen(unsigned long mb
             /* Don't overlap with other modules. */
             end = consider_modules(s, e, size, mod, mbi->mods_count, j);
 
+            if ( highmem_start && end > highmem_start )
+                continue;
+
             if ( s < end &&
                  (headroom ||
                   ((end - size) >> PAGE_SHIFT) > mod[j].mod_start) )
@@ -956,6 +972,8 @@ void __init __start_xen(unsigned long mb
     kexec_reserve_area(&boot_e820);
 
     setup_max_pdx();
+    if ( highmem_start )
+        xenheap_max_mfn(PFN_DOWN(highmem_start));
 
     /*
      * Walk every RAM region and map it in its entirety (on x86/64, at least)
@@ -1127,7 +1145,8 @@ void __init __start_xen(unsigned long mb
         unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
         uint64_t mask = PAGE_SIZE - 1;
 
-        xenheap_max_mfn(limit);
+        if ( !highmem_start )
+            xenheap_max_mfn(limit);
 
         /* Pass the remaining memory to the allocator. */
         for ( i = 0; i < boot_e820.nr_map; i++ )
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -45,6 +45,7 @@
 #include <asm/flushtlb.h>
 #ifdef CONFIG_X86
 #include <asm/p2m.h>
+#include <asm/setup.h> /* for highmem_start only */
 #else
 #define p2m_pod_offline_or_broken_hit(pg) 0
 #define p2m_pod_offline_or_broken_replace(pg) BUG_ON(pg != NULL)
@@ -203,6 +204,25 @@ unsigned long __init alloc_boot_pages(
         pg = (r->e - nr_pfns) & ~(pfn_align - 1);
         if ( pg < r->s )
             continue;
+
+#if defined(CONFIG_X86) && !defined(NDEBUG)
+        /*
+         * Filtering pfn_align == 1 since the only allocations using a bigger
+         * alignment are the ones used for setting up the frame table chunks.
+         * Those allocations get remapped anyway, i.e. them not having 1:1
+         * mappings always accessible is not a problem.
+         */
+        if ( highmem_start && pfn_align == 1 &&
+             r->e > PFN_DOWN(highmem_start) )
+        {
+            pg = r->s;
+            if ( pg + nr_pfns > PFN_DOWN(highmem_start) )
+                continue;
+            r->s = pg + nr_pfns;
+            return pg;
+        }
+#endif
+
         _e = r->e;
         r->e = pg;
         bootmem_region_add(pg + nr_pfns, _e);
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -43,4 +43,10 @@ void microcode_grab_module(
 
 extern uint8_t kbd_shift_flags;
 
+#ifdef NDEBUG
+# define highmem_start 0
+#else
+extern unsigned long highmem_start;
+#endif
+
 #endif



[-- Attachment #2: x86-map-domain-debug.patch --]
[-- Type: text/plain, Size: 7942 bytes --]

x86: debugging code for testing 16Tb support on smaller memory systems

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Removed unwanted bits and switched to byte-granular "highmem-start"
    option.

--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -546,6 +546,12 @@ Paging (HAP).
 ### hvm\_port80
 > `= <boolean>`
 
+### highmem-start
+> `= <size>`
+
+Specify the memory boundary past which memory will be treated as highmem (x86
+debug hypervisor only).
+
 ### idle\_latency\_factor
 > `= <integer>`
 
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -66,8 +66,10 @@ void *map_domain_page(unsigned long mfn)
     struct mapcache_vcpu *vcache;
     struct vcpu_maphash_entry *hashent;
 
+#ifdef NDEBUG
     if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
         return mfn_to_virt(mfn);
+#endif
 
     v = mapcache_current_vcpu();
     if ( !v || is_hvm_vcpu(v) )
@@ -249,8 +251,10 @@ int mapcache_domain_init(struct domain *
     if ( is_hvm_domain(d) || is_idle_domain(d) )
         return 0;
 
+#ifdef NDEBUG
     if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
         return 0;
+#endif
 
     dcache->l1tab = xzalloc_array(l1_pgentry_t *, MAPCACHE_L2_ENTRIES + 1);
     d->arch.perdomain_l2_pg[MAPCACHE_SLOT] = alloc_domheap_page(NULL, memf);
@@ -418,8 +422,10 @@ void *map_domain_page_global(unsigned lo
 
     ASSERT(!in_irq() && local_irq_is_enabled());
 
+#ifdef NDEBUG
     if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
         return mfn_to_virt(mfn);
+#endif
 
     spin_lock(&globalmap_lock);
 
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -250,6 +250,14 @@ void __init init_frametable(void)
         init_spagetable();
 }
 
+#ifndef NDEBUG
+static unsigned int __read_mostly root_pgt_pv_xen_slots
+    = ROOT_PAGETABLE_PV_XEN_SLOTS;
+static l4_pgentry_t __read_mostly split_l4e;
+#else
+#define root_pgt_pv_xen_slots ROOT_PAGETABLE_PV_XEN_SLOTS
+#endif
+
 void __init arch_init_memory(void)
 {
     unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
@@ -344,6 +352,40 @@ void __init arch_init_memory(void)
     efi_init_memory();
 
     mem_sharing_init();
+
+#ifndef NDEBUG
+    if ( highmem_start )
+    {
+        unsigned long split_va = (unsigned long)__va(highmem_start);
+
+        if ( split_va < HYPERVISOR_VIRT_END &&
+             split_va - 1 == (unsigned long)__va(highmem_start - 1) )
+        {
+            root_pgt_pv_xen_slots = l4_table_offset(split_va) -
+                                    ROOT_PAGETABLE_FIRST_XEN_SLOT;
+            ASSERT(root_pgt_pv_xen_slots < ROOT_PAGETABLE_PV_XEN_SLOTS);
+            if ( l4_table_offset(split_va) == l4_table_offset(split_va - 1) )
+            {
+                l3_pgentry_t *l3tab = alloc_xen_pagetable();
+
+                if ( l3tab )
+                {
+                    const l3_pgentry_t *l3idle =
+                        l4e_to_l3e(idle_pg_table[l4_table_offset(split_va)]);
+
+                    for ( i = 0; i < l3_table_offset(split_va); ++i )
+                        l3tab[i] = l3idle[i];
+                    for ( ; i <= L3_PAGETABLE_ENTRIES; ++i )
+                        l3tab[i] = l3e_empty();
+                    split_l4e = l4e_from_pfn(virt_to_mfn(l3tab),
+                                             __PAGE_HYPERVISOR);
+                }
+                else
+                    ++root_pgt_pv_xen_slots;
+            }
+        }
+    }
+#endif
 }
 
 int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
@@ -1320,7 +1362,12 @@ void init_guest_l4_table(l4_pgentry_t l4
     /* Xen private mappings. */
     memcpy(&l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT],
            &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
-           ROOT_PAGETABLE_PV_XEN_SLOTS * sizeof(l4_pgentry_t));
+           root_pgt_pv_xen_slots * sizeof(l4_pgentry_t));
+#ifndef NDEBUG
+    if ( l4e_get_intpte(split_l4e) )
+        l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT + root_pgt_pv_xen_slots] =
+            split_l4e;
+#endif
     l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] =
         l4e_from_pfn(domain_page_map_to_mfn(l4tab), __PAGE_HYPERVISOR);
     l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] =
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -82,6 +82,11 @@ boolean_param("noapic", skip_ioapic_setu
 s8 __read_mostly xen_cpuidle = -1;
 boolean_param("cpuidle", xen_cpuidle);
 
+#ifndef NDEBUG
+unsigned long __initdata highmem_start;
+size_param("highmem-start", highmem_start);
+#endif
+
 cpumask_t __read_mostly cpu_present_map;
 
 unsigned long __read_mostly xen_phys_start;
@@ -787,6 +792,14 @@ void __init __start_xen(unsigned long mb
     modules_headroom = bzimage_headroom(bootstrap_map(mod), mod->mod_end);
     bootstrap_map(NULL);
 
+#ifndef highmem_start
+    /* Don't allow split below 4Gb. */
+    if ( highmem_start < GB(4) )
+        highmem_start = 0;
+    else /* align to L3 entry boundary */
+        highmem_start &= ~((1UL << L3_PAGETABLE_SHIFT) - 1);
+#endif
+
     for ( i = boot_e820.nr_map-1; i >= 0; i-- )
     {
         uint64_t s, e, mask = (1UL << L2_PAGETABLE_SHIFT) - 1;
@@ -915,6 +928,9 @@ void __init __start_xen(unsigned long mb
             /* Don't overlap with other modules. */
             end = consider_modules(s, e, size, mod, mbi->mods_count, j);
 
+            if ( highmem_start && end > highmem_start )
+                continue;
+
             if ( s < end &&
                  (headroom ||
                   ((end - size) >> PAGE_SHIFT) > mod[j].mod_start) )
@@ -956,6 +972,8 @@ void __init __start_xen(unsigned long mb
     kexec_reserve_area(&boot_e820);
 
     setup_max_pdx();
+    if ( highmem_start )
+        xenheap_max_mfn(PFN_DOWN(highmem_start));
 
     /*
      * Walk every RAM region and map it in its entirety (on x86/64, at least)
@@ -1127,7 +1145,8 @@ void __init __start_xen(unsigned long mb
         unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
         uint64_t mask = PAGE_SIZE - 1;
 
-        xenheap_max_mfn(limit);
+        if ( !highmem_start )
+            xenheap_max_mfn(limit);
 
         /* Pass the remaining memory to the allocator. */
         for ( i = 0; i < boot_e820.nr_map; i++ )
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -45,6 +45,7 @@
 #include <asm/flushtlb.h>
 #ifdef CONFIG_X86
 #include <asm/p2m.h>
+#include <asm/setup.h> /* for highmem_start only */
 #else
 #define p2m_pod_offline_or_broken_hit(pg) 0
 #define p2m_pod_offline_or_broken_replace(pg) BUG_ON(pg != NULL)
@@ -203,6 +204,25 @@ unsigned long __init alloc_boot_pages(
         pg = (r->e - nr_pfns) & ~(pfn_align - 1);
         if ( pg < r->s )
             continue;
+
+#if defined(CONFIG_X86) && !defined(NDEBUG)
+        /*
+         * Filtering pfn_align == 1 since the only allocations using a bigger
+         * alignment are the ones used for setting up the frame table chunks.
+         * Those allocations get remapped anyway, i.e. them not having 1:1
+         * mappings always accessible is not a problem.
+         */
+        if ( highmem_start && pfn_align == 1 &&
+             r->e > PFN_DOWN(highmem_start) )
+        {
+            pg = r->s;
+            if ( pg + nr_pfns > PFN_DOWN(highmem_start) )
+                continue;
+            r->s = pg + nr_pfns;
+            return pg;
+        }
+#endif
+
         _e = r->e;
         r->e = pg;
         bootmem_region_add(pg + nr_pfns, _e);
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -43,4 +43,10 @@ void microcode_grab_module(
 
 extern uint8_t kbd_shift_flags;
 
+#ifdef NDEBUG
+# define highmem_start 0
+#else
+extern unsigned long highmem_start;
+#endif
+
 #endif

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] x86: debugging code for testing 16Tb support on smaller memory systems
  2013-01-23 14:26   ` [PATCH v2] " Jan Beulich
@ 2013-01-23 15:18     ` Keir Fraser
  2013-01-24 11:36     ` Tim Deegan
  1 sibling, 0 replies; 24+ messages in thread
From: Keir Fraser @ 2013-01-23 15:18 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 23/01/2013 14:26, "Jan Beulich" <JBeulich@suse.com> wrote:

> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Keir Fraser <keir@xen.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] x86: debugging code for testing 16Tb support on smaller memory systems
  2013-01-23 14:26   ` [PATCH v2] " Jan Beulich
  2013-01-23 15:18     ` Keir Fraser
@ 2013-01-24 11:36     ` Tim Deegan
  2013-01-24 12:23       ` Jan Beulich
  1 sibling, 1 reply; 24+ messages in thread
From: Tim Deegan @ 2013-01-24 11:36 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

At 14:26 +0000 on 23 Jan (1358951188), Jan Beulich wrote:
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -82,6 +82,11 @@ boolean_param("noapic", skip_ioapic_setu
>  s8 __read_mostly xen_cpuidle = -1;
>  boolean_param("cpuidle", xen_cpuidle);
>  
> +#ifndef NDEBUG
> +unsigned long __initdata highmem_start;
> +size_param("highmem-start", highmem_start);
> +#endif
> +
>  cpumask_t __read_mostly cpu_present_map;
>  
>  unsigned long __read_mostly xen_phys_start;
> @@ -787,6 +792,14 @@ void __init __start_xen(unsigned long mb
>      modules_headroom = bzimage_headroom(bootstrap_map(mod), mod->mod_end);
>      bootstrap_map(NULL);
>  
> +#ifndef highmem_start
> +    /* Don't allow split below 4Gb. */
> +    if ( highmem_start < GB(4) )
> +        highmem_start = 0;
> +    else /* align to L3 entry boundary */
> +        highmem_start &= ~((1UL << L3_PAGETABLE_SHIFT) - 1);
> +#endif

DYM #ifndef NDEBUG ?  I can see that checking for highmem_start being a
macro is strictly correct but it seems more vulnerable to later changes,
esp. since this:

> --- a/xen/include/asm-x86/setup.h
> +++ b/xen/include/asm-x86/setup.h
> @@ -43,4 +43,10 @@ void microcode_grab_module(
>  
>  extern uint8_t kbd_shift_flags;
>  
> +#ifdef NDEBUG
> +# define highmem_start 0
> +#else
> +extern unsigned long highmem_start;
> +#endif

happens so far away. 

Tim.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] x86: debugging code for testing 16Tb support on smaller memory systems
  2013-01-24 11:36     ` Tim Deegan
@ 2013-01-24 12:23       ` Jan Beulich
  2013-01-24 12:36         ` Tim Deegan
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2013-01-24 12:23 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel

>>> On 24.01.13 at 12:36, Tim Deegan <tim@xen.org> wrote:
> At 14:26 +0000 on 23 Jan (1358951188), Jan Beulich wrote:
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -82,6 +82,11 @@ boolean_param("noapic", skip_ioapic_setu
>>  s8 __read_mostly xen_cpuidle = -1;
>>  boolean_param("cpuidle", xen_cpuidle);
>>  
>> +#ifndef NDEBUG
>> +unsigned long __initdata highmem_start;
>> +size_param("highmem-start", highmem_start);
>> +#endif
>> +
>>  cpumask_t __read_mostly cpu_present_map;
>>  
>>  unsigned long __read_mostly xen_phys_start;
>> @@ -787,6 +792,14 @@ void __init __start_xen(unsigned long mb
>>      modules_headroom = bzimage_headroom(bootstrap_map(mod), mod->mod_end);
>>      bootstrap_map(NULL);
>>  
>> +#ifndef highmem_start
>> +    /* Don't allow split below 4Gb. */
>> +    if ( highmem_start < GB(4) )
>> +        highmem_start = 0;
>> +    else /* align to L3 entry boundary */
>> +        highmem_start &= ~((1UL << L3_PAGETABLE_SHIFT) - 1);
>> +#endif
> 
> DYM #ifndef NDEBUG ?  I can see that checking for highmem_start being a
> macro is strictly correct

I intended it to be that way, because there could be other uses
for having the symbol #define-d/real.

> but it seems more vulnerable to later changes,
> esp. since this:
> 
>> --- a/xen/include/asm-x86/setup.h
>> +++ b/xen/include/asm-x86/setup.h
>> @@ -43,4 +43,10 @@ void microcode_grab_module(
>>  
>>  extern uint8_t kbd_shift_flags;
>>  
>> +#ifdef NDEBUG
>> +# define highmem_start 0
>> +#else
>> +extern unsigned long highmem_start;
>> +#endif
> 
> happens so far away. 

I realize that, but these getting out of sync is no problem the
way it is coded now. The distance of the two would really be
more of a problem imo if the condition here got changed (which
would then require to also change it up there).

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] x86: debugging code for testing 16Tb support on smaller memory systems
  2013-01-24 12:23       ` Jan Beulich
@ 2013-01-24 12:36         ` Tim Deegan
  0 siblings, 0 replies; 24+ messages in thread
From: Tim Deegan @ 2013-01-24 12:36 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

At 12:23 +0000 on 24 Jan (1359030221), Jan Beulich wrote:
> >>> On 24.01.13 at 12:36, Tim Deegan <tim@xen.org> wrote:
> > At 14:26 +0000 on 23 Jan (1358951188), Jan Beulich wrote:
> >> --- a/xen/arch/x86/setup.c
> >> +++ b/xen/arch/x86/setup.c
> >> @@ -82,6 +82,11 @@ boolean_param("noapic", skip_ioapic_setu
> >>  s8 __read_mostly xen_cpuidle = -1;
> >>  boolean_param("cpuidle", xen_cpuidle);
> >>  
> >> +#ifndef NDEBUG
> >> +unsigned long __initdata highmem_start;
> >> +size_param("highmem-start", highmem_start);
> >> +#endif
> >> +
> >>  cpumask_t __read_mostly cpu_present_map;
> >>  
> >>  unsigned long __read_mostly xen_phys_start;
> >> @@ -787,6 +792,14 @@ void __init __start_xen(unsigned long mb
> >>      modules_headroom = bzimage_headroom(bootstrap_map(mod), mod->mod_end);
> >>      bootstrap_map(NULL);
> >>  
> >> +#ifndef highmem_start
> >> +    /* Don't allow split below 4Gb. */
> >> +    if ( highmem_start < GB(4) )
> >> +        highmem_start = 0;
> >> +    else /* align to L3 entry boundary */
> >> +        highmem_start &= ~((1UL << L3_PAGETABLE_SHIFT) - 1);
> >> +#endif
> > 
> > DYM #ifndef NDEBUG ?  I can see that checking for highmem_start being a
> > macro is strictly correct
> 
> I intended it to be that way, because there could be other uses
> for having the symbol #define-d/real.

Yes - but if it ever ends up being a #define _and_ user-settable, these
checks will silently disappear.  Since there's no indication in the
places where you might make it a #define that doing so will remove these
checks, I'd be inclined to leave it gated on NDEBUG so it's fail in an
obvious way.

Or add a #define CONFIG_HIGHMEM_START (default to == !NDEBUG), and gate
everything on that?

Tim.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-01-24 12:36 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-22 10:45 [PATCH 00/11] x86: support up to 16Tb Jan Beulich
2013-01-22 10:50 ` [PATCH 02/11] x86: extend frame table virtual space Jan Beulich
2013-01-22 10:50 ` [PATCH 03/11] x86: re-introduce map_domain_page() et al Jan Beulich
2013-01-22 10:51 ` [PATCH 04/11] x86: properly use map_domain_page() when building Dom0 Jan Beulich
2013-01-22 10:52 ` [PATCH 05/11] x86: consolidate initialization of PV guest L4 page tables Jan Beulich
2013-01-22 10:53 ` [PATCH 06/11] x86: properly use map_domain_page() during domain creation/destruction Jan Beulich
2013-01-22 10:55 ` [PATCH 07/11] x86: properly use map_domain_page() during page table manipulation Jan Beulich
2013-01-22 10:55 ` [PATCH 08/11] x86: properly use map_domain_page() in nested HVM code Jan Beulich
2013-01-22 10:56 ` [PATCH 09/11] x86: properly use map_domain_page() in miscellaneous places Jan Beulich
2013-01-22 10:57 ` [PATCH 10/11] tmem: partial adjustments for x86 16Tb support Jan Beulich
2013-01-22 17:55   ` Dan Magenheimer
2013-01-22 10:57 ` [PATCH 11/11] x86: support up to 16Tb Jan Beulich
2013-01-22 15:20   ` Dan Magenheimer
2013-01-22 15:31     ` Jan Beulich
2013-01-22 10:58 ` [PATCH 12/11] x86: debugging code for testing 16Tb support on smaller memory systems Jan Beulich
2013-01-23 14:26   ` [PATCH v2] " Jan Beulich
2013-01-23 15:18     ` Keir Fraser
2013-01-24 11:36     ` Tim Deegan
2013-01-24 12:23       ` Jan Beulich
2013-01-24 12:36         ` Tim Deegan
2013-01-22 20:13 ` [PATCH 00/11] x86: support up to 16Tb Keir Fraser
2013-01-23  9:33 ` Keir Fraser
2013-01-23  9:56   ` Jan Beulich
2013-01-23 10:16     ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.