All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Refactor x86 dom0 builder
@ 2017-03-16 17:54 Wei Liu
  2017-03-16 17:54 ` [PATCH v2 1/4] x86: rename domain_build.c to dom0_build.c Wei Liu
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Wei Liu @ 2017-03-16 17:54 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

Wei Liu (4):
  x86: rename domain_build.c to dom0_build.c
  x86: split PV dom0 builder to pv/dom0_builder.c
  x86: split PVH dom0 builder to hvm/dom0_build.c
  x86: clean up header files in dom0_build.c

 xen/arch/x86/Makefile            |    2 +-
 xen/arch/x86/dom0_build.c        |  470 ++++++++
 xen/arch/x86/domain_build.c      | 2438 --------------------------------------
 xen/arch/x86/hvm/Makefile        |    1 +
 xen/arch/x86/hvm/dom0_build.c    | 1112 +++++++++++++++++
 xen/arch/x86/pv/Makefile         |    1 +
 xen/arch/x86/pv/dom0_build.c     |  910 ++++++++++++++
 xen/include/asm-x86/dom0_build.h |   41 +
 8 files changed, 2536 insertions(+), 2439 deletions(-)
 create mode 100644 xen/arch/x86/dom0_build.c
 delete mode 100644 xen/arch/x86/domain_build.c
 create mode 100644 xen/arch/x86/hvm/dom0_build.c
 create mode 100644 xen/arch/x86/pv/dom0_build.c
 create mode 100644 xen/include/asm-x86/dom0_build.h

-- 
2.11.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/4] x86: rename domain_build.c to dom0_build.c
  2017-03-16 17:54 [PATCH v2 0/4] Refactor x86 dom0 builder Wei Liu
@ 2017-03-16 17:54 ` Wei Liu
  2017-03-16 18:00   ` Andrew Cooper
  2017-03-16 17:54 ` [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c Wei Liu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Wei Liu @ 2017-03-16 17:54 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

To reflect the true nature of this file. No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
---
 xen/arch/x86/Makefile                         | 2 +-
 xen/arch/x86/{domain_build.c => dom0_build.c} | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
 rename xen/arch/x86/{domain_build.c => dom0_build.c} (99%)

diff --git a/xen/arch/x86/Makefile b/xen/arch/x86/Makefile
index f75eca07f7..adc768f0a8 100644
--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -23,7 +23,7 @@ obj-y += delay.o
 obj-bin-y += dmi_scan.init.o
 obj-y += domctl.o
 obj-y += domain.o
-obj-bin-y += domain_build.init.o
+obj-bin-y += dom0_build.init.o
 obj-y += domain_page.o
 obj-y += e820.o
 obj-y += extable.o
diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/dom0_build.c
similarity index 99%
rename from xen/arch/x86/domain_build.c
rename to xen/arch/x86/dom0_build.c
index 00ce5f24e9..1c723c9ef1 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -1,5 +1,5 @@
 /******************************************************************************
- * domain_build.c
+ * dom0_build.c
  * 
  * Copyright (c) 2002-2005, K A Fraser
  */
-- 
2.11.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c
  2017-03-16 17:54 [PATCH v2 0/4] Refactor x86 dom0 builder Wei Liu
  2017-03-16 17:54 ` [PATCH v2 1/4] x86: rename domain_build.c to dom0_build.c Wei Liu
@ 2017-03-16 17:54 ` Wei Liu
  2017-03-16 18:12   ` Roger Pau Monné
                     ` (2 more replies)
  2017-03-16 17:54 ` [PATCH v2 3/4] x86: split PVH dom0 builder to hvm/dom0_build.c Wei Liu
  2017-03-16 17:54 ` [PATCH v2 4/4] x86: clean up header files in dom0_build.c Wei Liu
  3 siblings, 3 replies; 18+ messages in thread
From: Wei Liu @ 2017-03-16 17:54 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

Long term we want to be able to disentangle PV and HVM code. Move the PV
domain builder to a dedicated file.

This in turn requires exposing a few functions and variables via a new
header dom0_build.h. These functions and variables are now prefixed with
"dom0_" if they weren't already so.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
---
v2:
1. put file under pv directory
2. use dom0_ prefix
3. header now in asm-x86
---
 xen/arch/x86/dom0_build.c        | 911 +--------------------------------------
 xen/arch/x86/pv/Makefile         |   1 +
 xen/arch/x86/pv/dom0_build.c     | 910 ++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/dom0_build.h |  32 ++
 4 files changed, 960 insertions(+), 894 deletions(-)
 create mode 100644 xen/arch/x86/pv/dom0_build.c
 create mode 100644 xen/include/asm-x86/dom0_build.h

diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
index 1c723c9ef1..c74f9b52e0 100644
--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -28,6 +28,7 @@
 #include <asm/io.h>
 #include <asm/processor.h>
 #include <asm/desc.h>
+#include <asm/dom0_build.h>
 #include <asm/i387.h>
 #include <asm/paging.h>
 #include <asm/p2m.h>
@@ -154,11 +155,11 @@ static void __init parse_dom0_nodes(const char *s)
 }
 custom_param("dom0_nodes", parse_dom0_nodes);
 
-static cpumask_t __initdata dom0_cpus;
+cpumask_t __initdata dom0_cpus;
 
-static struct vcpu *__init setup_dom0_vcpu(struct domain *d,
-                                           unsigned int vcpu_id,
-                                           unsigned int cpu)
+struct vcpu *__init dom0_setup_vcpu(struct domain *d,
+                                    unsigned int vcpu_id,
+                                    unsigned int cpu)
 {
     struct vcpu *v = alloc_vcpu(d, vcpu_id, cpu);
 
@@ -215,7 +216,7 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
         return NULL;
     dom0->max_vcpus = max_vcpus;
 
-    return setup_dom0_vcpu(dom0, 0, cpumask_first(&dom0_cpus));
+    return dom0_setup_vcpu(dom0, 0, cpumask_first(&dom0_cpus));
 }
 
 #ifdef CONFIG_SHADOW_PAGING
@@ -258,66 +259,7 @@ string_param("dom0_ioports_disable", opt_dom0_ioports_disable);
 static bool_t __initdata ro_hpet = 1;
 boolean_param("ro-hpet", ro_hpet);
 
-/* Allow ring-3 access in long mode as guest cannot use ring 1 ... */
-#define BASE_PROT (_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|_PAGE_USER)
-#define L1_PROT (BASE_PROT|_PAGE_GUEST_KERNEL)
-/* ... except for compatibility mode guests. */
-#define COMPAT_L1_PROT (_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED)
-#define L2_PROT (BASE_PROT|_PAGE_DIRTY)
-#define L3_PROT (BASE_PROT|_PAGE_DIRTY)
-#define L4_PROT (BASE_PROT|_PAGE_DIRTY)
-
-static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
-
-static struct page_info * __init alloc_chunk(
-    struct domain *d, unsigned long max_pages)
-{
-    static unsigned int __initdata last_order = MAX_ORDER;
-    struct page_info *page;
-    unsigned int order = get_order_from_pages(max_pages), free_order;
-
-    if ( order > last_order )
-        order = last_order;
-    else if ( max_pages & (max_pages - 1) )
-        --order;
-    while ( (page = alloc_domheap_pages(d, order, memflags)) == NULL )
-        if ( order-- == 0 )
-            break;
-    if ( page )
-        last_order = order;
-    else if ( memflags )
-    {
-        /*
-         * Allocate up to 2MB at a time: It prevents allocating very large
-         * chunks from DMA pools before the >4GB pool is fully depleted.
-         */
-        last_order = 21 - PAGE_SHIFT;
-        memflags = 0;
-        return alloc_chunk(d, max_pages);
-    }
-
-    /*
-     * Make a reasonable attempt at finding a smaller chunk at a higher
-     * address, to avoid allocating from low memory as much as possible.
-     */
-    for ( free_order = order; !memflags && page && order--; )
-    {
-        struct page_info *pg2;
-
-        if ( d->tot_pages + (1 << order) > d->max_pages )
-            continue;
-        pg2 = alloc_domheap_pages(d, order, MEMF_exact_node);
-        if ( pg2 > page )
-        {
-            free_domheap_pages(page, free_order);
-            page = pg2;
-            free_order = order;
-        }
-        else if ( pg2 )
-            free_domheap_pages(pg2, order);
-    }
-    return page;
-}
+unsigned int __initdata dom0_memflags = MEMF_no_dma|MEMF_exact_node;
 
 static unsigned long __init dom0_paging_pages(const struct domain *d,
                                               unsigned long nr_pages)
@@ -330,7 +272,7 @@ static unsigned long __init dom0_paging_pages(const struct domain *d,
     return ((memkb + 1023) / 1024) << (20 - PAGE_SHIFT);
 }
 
-static unsigned long __init compute_dom0_nr_pages(
+unsigned long __init dom0_compute_nr_pages(
     struct domain *d, struct elf_dom_parms *parms, unsigned long initrd_len)
 {
     nodeid_t node;
@@ -467,199 +409,7 @@ static void __init process_dom0_ioports_disable(struct domain *dom0)
     }
 }
 
-static __init void dom0_update_physmap(struct domain *d, unsigned long pfn,
-                                   unsigned long mfn, unsigned long vphysmap_s)
-{
-    if ( !is_pv_32bit_domain(d) )
-        ((unsigned long *)vphysmap_s)[pfn] = mfn;
-    else
-        ((unsigned int *)vphysmap_s)[pfn] = mfn;
-
-    set_gpfn_from_mfn(mfn, pfn);
-}
-
-static __init void mark_pv_pt_pages_rdonly(struct domain *d,
-                                           l4_pgentry_t *l4start,
-                                           unsigned long vpt_start,
-                                           unsigned long nr_pt_pages)
-{
-    unsigned long count;
-    struct page_info *page;
-    l4_pgentry_t *pl4e;
-    l3_pgentry_t *pl3e;
-    l2_pgentry_t *pl2e;
-    l1_pgentry_t *pl1e;
-
-    pl4e = l4start + l4_table_offset(vpt_start);
-    pl3e = l4e_to_l3e(*pl4e);
-    pl3e += l3_table_offset(vpt_start);
-    pl2e = l3e_to_l2e(*pl3e);
-    pl2e += l2_table_offset(vpt_start);
-    pl1e = l2e_to_l1e(*pl2e);
-    pl1e += l1_table_offset(vpt_start);
-    for ( count = 0; count < nr_pt_pages; count++ )
-    {
-        l1e_remove_flags(*pl1e, _PAGE_RW);
-        page = mfn_to_page(l1e_get_pfn(*pl1e));
-
-        /* Read-only mapping + PGC_allocated + page-table page. */
-        page->count_info         = PGC_allocated | 3;
-        page->u.inuse.type_info |= PGT_validated | 1;
-
-        /* Top-level p.t. is pinned. */
-        if ( (page->u.inuse.type_info & PGT_type_mask) ==
-             (!is_pv_32bit_domain(d) ?
-              PGT_l4_page_table : PGT_l3_page_table) )
-        {
-            page->count_info        += 1;
-            page->u.inuse.type_info += 1 | PGT_pinned;
-        }
-
-        /* Iterate. */
-        if ( !((unsigned long)++pl1e & (PAGE_SIZE - 1)) )
-        {
-            if ( !((unsigned long)++pl2e & (PAGE_SIZE - 1)) )
-            {
-                if ( !((unsigned long)++pl3e & (PAGE_SIZE - 1)) )
-                    pl3e = l4e_to_l3e(*++pl4e);
-                pl2e = l3e_to_l2e(*pl3e);
-            }
-            pl1e = l2e_to_l1e(*pl2e);
-        }
-    }
-}
-
-static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
-                                    unsigned long v_start, unsigned long v_end,
-                                    unsigned long vphysmap_start,
-                                    unsigned long vphysmap_end,
-                                    unsigned long nr_pages)
-{
-    struct page_info *page = NULL;
-    l4_pgentry_t *pl4e, *l4start = map_domain_page(_mfn(pgtbl_pfn));
-    l3_pgentry_t *pl3e = NULL;
-    l2_pgentry_t *pl2e = NULL;
-    l1_pgentry_t *pl1e = NULL;
-
-    if ( v_start <= vphysmap_end && vphysmap_start <= v_end )
-        panic("DOM0 P->M table overlaps initial mapping");
-
-    while ( vphysmap_start < vphysmap_end )
-    {
-        if ( d->tot_pages + ((round_pgup(vphysmap_end) - vphysmap_start)
-                             >> PAGE_SHIFT) + 3 > nr_pages )
-            panic("Dom0 allocation too small for initial P->M table");
-
-        if ( pl1e )
-        {
-            unmap_domain_page(pl1e);
-            pl1e = NULL;
-        }
-        if ( pl2e )
-        {
-            unmap_domain_page(pl2e);
-            pl2e = NULL;
-        }
-        if ( pl3e )
-        {
-            unmap_domain_page(pl3e);
-            pl3e = NULL;
-        }
-        pl4e = l4start + l4_table_offset(vphysmap_start);
-        if ( !l4e_get_intpte(*pl4e) )
-        {
-            page = alloc_domheap_page(d, 0);
-            if ( !page )
-                break;
-
-            /* No mapping, PGC_allocated + page-table page. */
-            page->count_info = PGC_allocated | 2;
-            page->u.inuse.type_info = PGT_l3_page_table | PGT_validated | 1;
-            pl3e = __map_domain_page(page);
-            clear_page(pl3e);
-            *pl4e = l4e_from_page(page, L4_PROT);
-        } else
-            pl3e = map_domain_page(_mfn(l4e_get_pfn(*pl4e)));
-
-        pl3e += l3_table_offset(vphysmap_start);
-        if ( !l3e_get_intpte(*pl3e) )
-        {
-            if ( cpu_has_page1gb &&
-                 !(vphysmap_start & ((1UL << L3_PAGETABLE_SHIFT) - 1)) &&
-                 vphysmap_end >= vphysmap_start + (1UL << L3_PAGETABLE_SHIFT) &&
-                 (page = alloc_domheap_pages(d,
-                                             L3_PAGETABLE_SHIFT - PAGE_SHIFT,
-                                             0)) != NULL )
-            {
-                *pl3e = l3e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
-                vphysmap_start += 1UL << L3_PAGETABLE_SHIFT;
-                continue;
-            }
-            if ( (page = alloc_domheap_page(d, 0)) == NULL )
-                break;
-
-            /* No mapping, PGC_allocated + page-table page. */
-            page->count_info = PGC_allocated | 2;
-            page->u.inuse.type_info = PGT_l2_page_table | PGT_validated | 1;
-            pl2e = __map_domain_page(page);
-            clear_page(pl2e);
-            *pl3e = l3e_from_page(page, L3_PROT);
-        }
-        else
-            pl2e = map_domain_page(_mfn(l3e_get_pfn(*pl3e)));
-
-        pl2e += l2_table_offset(vphysmap_start);
-        if ( !l2e_get_intpte(*pl2e) )
-        {
-            if ( !(vphysmap_start & ((1UL << L2_PAGETABLE_SHIFT) - 1)) &&
-                 vphysmap_end >= vphysmap_start + (1UL << L2_PAGETABLE_SHIFT) &&
-                 (page = alloc_domheap_pages(d,
-                                             L2_PAGETABLE_SHIFT - PAGE_SHIFT,
-                                             0)) != NULL )
-            {
-                *pl2e = l2e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
-                if ( opt_allow_superpage )
-                    get_superpage(page_to_mfn(page), d);
-                vphysmap_start += 1UL << L2_PAGETABLE_SHIFT;
-                continue;
-            }
-            if ( (page = alloc_domheap_page(d, 0)) == NULL )
-                break;
-
-            /* No mapping, PGC_allocated + page-table page. */
-            page->count_info = PGC_allocated | 2;
-            page->u.inuse.type_info = PGT_l1_page_table | PGT_validated | 1;
-            pl1e = __map_domain_page(page);
-            clear_page(pl1e);
-            *pl2e = l2e_from_page(page, L2_PROT);
-        }
-        else
-            pl1e = map_domain_page(_mfn(l2e_get_pfn(*pl2e)));
-
-        pl1e += l1_table_offset(vphysmap_start);
-        BUG_ON(l1e_get_intpte(*pl1e));
-        page = alloc_domheap_page(d, 0);
-        if ( !page )
-            break;
-
-        *pl1e = l1e_from_page(page, L1_PROT|_PAGE_DIRTY);
-        vphysmap_start += PAGE_SIZE;
-        vphysmap_start &= PAGE_MASK;
-    }
-    if ( !page )
-        panic("Not enough RAM for DOM0 P->M table");
-
-    if ( pl1e )
-        unmap_domain_page(pl1e);
-    if ( pl2e )
-        unmap_domain_page(pl2e);
-    if ( pl3e )
-        unmap_domain_page(pl3e);
-
-    unmap_domain_page(l4start);
-}
-
-static int __init setup_permissions(struct domain *d)
+int __init dom0_setup_permissions(struct domain *d)
 {
     unsigned long mfn;
     unsigned int i;
@@ -741,633 +491,6 @@ static int __init setup_permissions(struct domain *d)
     return rc;
 }
 
-static int __init construct_dom0_pv(
-    struct domain *d,
-    const module_t *image, unsigned long image_headroom,
-    module_t *initrd,
-    void *(*bootstrap_map)(const module_t *),
-    char *cmdline)
-{
-    int i, cpu, rc, compatible, compat32, order, machine;
-    struct cpu_user_regs *regs;
-    unsigned long pfn, mfn;
-    unsigned long nr_pages;
-    unsigned long nr_pt_pages;
-    unsigned long alloc_spfn;
-    unsigned long alloc_epfn;
-    unsigned long initrd_pfn = -1, initrd_mfn = 0;
-    unsigned long count;
-    struct page_info *page = NULL;
-    start_info_t *si;
-    struct vcpu *v = d->vcpu[0];
-    unsigned long long value;
-    void *image_base = bootstrap_map(image);
-    unsigned long image_len = image->mod_end;
-    void *image_start = image_base + image_headroom;
-    unsigned long initrd_len = initrd ? initrd->mod_end : 0;
-    l4_pgentry_t *l4tab = NULL, *l4start = NULL;
-    l3_pgentry_t *l3tab = NULL, *l3start = NULL;
-    l2_pgentry_t *l2tab = NULL, *l2start = NULL;
-    l1_pgentry_t *l1tab = NULL, *l1start = NULL;
-
-    /*
-     * This fully describes the memory layout of the initial domain. All 
-     * *_start address are page-aligned, except v_start (and v_end) which are 
-     * superpage-aligned.
-     */
-    struct elf_binary elf;
-    struct elf_dom_parms parms;
-    unsigned long vkern_start;
-    unsigned long vkern_end;
-    unsigned long vinitrd_start;
-    unsigned long vinitrd_end;
-    unsigned long vphysmap_start;
-    unsigned long vphysmap_end;
-    unsigned long vstartinfo_start;
-    unsigned long vstartinfo_end;
-    unsigned long vstack_start;
-    unsigned long vstack_end;
-    unsigned long vpt_start;
-    unsigned long vpt_end;
-    unsigned long v_start;
-    unsigned long v_end;
-
-    /* Machine address of next candidate page-table page. */
-    paddr_t mpt_alloc;
-
-    printk("*** LOADING DOMAIN 0 ***\n");
-
-    d->max_pages = ~0U;
-
-    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
-        return rc;
-
-    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
-        return rc;
-#ifdef CONFIG_VERBOSE_DEBUG
-    elf_set_verbose(&elf);
-#endif
-    elf_parse_binary(&elf);
-    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
-        goto out;
-
-    /* compatibility check */
-    compatible = 0;
-    compat32   = 0;
-    machine = elf_uval(&elf, elf.ehdr, e_machine);
-    printk(" Xen  kernel: 64-bit, lsb, compat32\n");
-    if (elf_32bit(&elf) && parms.pae == XEN_PAE_BIMODAL)
-        parms.pae = XEN_PAE_EXTCR3;
-    if (elf_32bit(&elf) && parms.pae && machine == EM_386)
-    {
-        compat32 = 1;
-        compatible = 1;
-    }
-    if (elf_64bit(&elf) && machine == EM_X86_64)
-        compatible = 1;
-    printk(" Dom0 kernel: %s%s, %s, paddr %#" PRIx64 " -> %#" PRIx64 "\n",
-           elf_64bit(&elf) ? "64-bit" : "32-bit",
-           parms.pae       ? ", PAE"  : "",
-           elf_msb(&elf)   ? "msb"    : "lsb",
-           elf.pstart, elf.pend);
-    if ( elf.bsd_symtab_pstart )
-        printk(" Dom0 symbol map %#" PRIx64 " -> %#" PRIx64 "\n",
-               elf.bsd_symtab_pstart, elf.bsd_symtab_pend);
-
-    if ( !compatible )
-    {
-        printk("Mismatch between Xen and DOM0 kernel\n");
-        rc = -EINVAL;
-        goto out;
-    }
-
-    if ( parms.elf_notes[XEN_ELFNOTE_SUPPORTED_FEATURES].type != XEN_ENT_NONE )
-    {
-        if ( !test_bit(XENFEAT_dom0, parms.f_supported) )
-        {
-            printk("Kernel does not support Dom0 operation\n");
-            rc = -EINVAL;
-            goto out;
-        }
-    }
-
-    if ( compat32 )
-    {
-        d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 1;
-        v->vcpu_info = (void *)&d->shared_info->compat.vcpu_info[0];
-        if ( setup_compat_arg_xlat(v) != 0 )
-            BUG();
-    }
-
-    nr_pages = compute_dom0_nr_pages(d, &parms, initrd_len);
-
-    if ( parms.pae == XEN_PAE_EXTCR3 )
-            set_bit(VMASST_TYPE_pae_extended_cr3, &d->vm_assist);
-
-    if ( (parms.virt_hv_start_low != UNSET_ADDR) && elf_32bit(&elf) )
-    {
-        unsigned long mask = (1UL << L2_PAGETABLE_SHIFT) - 1;
-        value = (parms.virt_hv_start_low + mask) & ~mask;
-        BUG_ON(!is_pv_32bit_domain(d));
-        if ( value > __HYPERVISOR_COMPAT_VIRT_START )
-            panic("Domain 0 expects too high a hypervisor start address");
-        HYPERVISOR_COMPAT_VIRT_START(d) =
-            max_t(unsigned int, m2p_compat_vstart, value);
-    }
-
-    if ( (parms.p2m_base != UNSET_ADDR) && elf_32bit(&elf) )
-    {
-        printk(XENLOG_WARNING "P2M table base ignored\n");
-        parms.p2m_base = UNSET_ADDR;
-    }
-
-    domain_set_alloc_bitsize(d);
-
-    /*
-     * Why do we need this? The number of page-table frames depends on the 
-     * size of the bootstrap address space. But the size of the address space 
-     * depends on the number of page-table frames (since each one is mapped 
-     * read-only). We have a pair of simultaneous equations in two unknowns, 
-     * which we solve by exhaustive search.
-     */
-    v_start          = parms.virt_base;
-    vkern_start      = parms.virt_kstart;
-    vkern_end        = parms.virt_kend;
-    if ( parms.unmapped_initrd )
-    {
-        vinitrd_start  = vinitrd_end = 0;
-        vphysmap_start = round_pgup(vkern_end);
-    }
-    else
-    {
-        vinitrd_start  = round_pgup(vkern_end);
-        vinitrd_end    = vinitrd_start + initrd_len;
-        vphysmap_start = round_pgup(vinitrd_end);
-    }
-    vphysmap_end     = vphysmap_start + (nr_pages * (!is_pv_32bit_domain(d) ?
-                                                     sizeof(unsigned long) :
-                                                     sizeof(unsigned int)));
-    if ( parms.p2m_base != UNSET_ADDR )
-        vphysmap_end = vphysmap_start;
-    vstartinfo_start = round_pgup(vphysmap_end);
-    vstartinfo_end   = (vstartinfo_start +
-                        sizeof(struct start_info) +
-                        sizeof(struct dom0_vga_console_info));
-
-    vpt_start        = round_pgup(vstartinfo_end);
-    for ( nr_pt_pages = 2; ; nr_pt_pages++ )
-    {
-        vpt_end          = vpt_start + (nr_pt_pages * PAGE_SIZE);
-        vstack_start     = vpt_end;
-        vstack_end       = vstack_start + PAGE_SIZE;
-        v_end            = (vstack_end + (1UL<<22)-1) & ~((1UL<<22)-1);
-        if ( (v_end - vstack_end) < (512UL << 10) )
-            v_end += 1UL << 22; /* Add extra 4MB to get >= 512kB padding. */
-#define NR(_l,_h,_s) \
-    (((((_h) + ((1UL<<(_s))-1)) & ~((1UL<<(_s))-1)) - \
-       ((_l) & ~((1UL<<(_s))-1))) >> (_s))
-        if ( (!is_pv_32bit_domain(d) + /* # L4 */
-              NR(v_start, v_end, L4_PAGETABLE_SHIFT) + /* # L3 */
-              (!is_pv_32bit_domain(d) ?
-               NR(v_start, v_end, L3_PAGETABLE_SHIFT) : /* # L2 */
-               4) + /* # compat L2 */
-              NR(v_start, v_end, L2_PAGETABLE_SHIFT))  /* # L1 */
-             <= nr_pt_pages )
-            break;
-    }
-
-    count = v_end - v_start;
-    if ( vinitrd_start )
-        count -= PAGE_ALIGN(initrd_len);
-    order = get_order_from_bytes(count);
-    if ( (1UL << order) + PFN_UP(initrd_len) > nr_pages )
-        panic("Domain 0 allocation is too small for kernel image");
-
-    if ( parms.p2m_base != UNSET_ADDR )
-    {
-        vphysmap_start = parms.p2m_base;
-        vphysmap_end   = vphysmap_start + nr_pages * sizeof(unsigned long);
-    }
-    page = alloc_domheap_pages(d, order, 0);
-    if ( page == NULL )
-        panic("Not enough RAM for domain 0 allocation");
-    alloc_spfn = page_to_mfn(page);
-    alloc_epfn = alloc_spfn + d->tot_pages;
-
-    if ( initrd_len )
-    {
-        initrd_pfn = vinitrd_start ?
-                     (vinitrd_start - v_start) >> PAGE_SHIFT :
-                     d->tot_pages;
-        initrd_mfn = mfn = initrd->mod_start;
-        count = PFN_UP(initrd_len);
-        if ( d->arch.physaddr_bitsize &&
-             ((mfn + count - 1) >> (d->arch.physaddr_bitsize - PAGE_SHIFT)) )
-        {
-            order = get_order_from_pages(count);
-            page = alloc_domheap_pages(d, order, 0);
-            if ( !page )
-                panic("Not enough RAM for domain 0 initrd");
-            for ( count = -count; order--; )
-                if ( count & (1UL << order) )
-                {
-                    free_domheap_pages(page, order);
-                    page += 1UL << order;
-                }
-            memcpy(page_to_virt(page), mfn_to_virt(initrd->mod_start),
-                   initrd_len);
-            mpt_alloc = (paddr_t)initrd->mod_start << PAGE_SHIFT;
-            init_domheap_pages(mpt_alloc,
-                               mpt_alloc + PAGE_ALIGN(initrd_len));
-            initrd->mod_start = initrd_mfn = page_to_mfn(page);
-        }
-        else
-        {
-            while ( count-- )
-                if ( assign_pages(d, mfn_to_page(mfn++), 0, 0) )
-                    BUG();
-        }
-        initrd->mod_end = 0;
-    }
-
-    printk("PHYSICAL MEMORY ARRANGEMENT:\n"
-           " Dom0 alloc.:   %"PRIpaddr"->%"PRIpaddr,
-           pfn_to_paddr(alloc_spfn), pfn_to_paddr(alloc_epfn));
-    if ( d->tot_pages < nr_pages )
-        printk(" (%lu pages to be allocated)",
-               nr_pages - d->tot_pages);
-    if ( initrd )
-    {
-        mpt_alloc = (paddr_t)initrd->mod_start << PAGE_SHIFT;
-        printk("\n Init. ramdisk: %"PRIpaddr"->%"PRIpaddr,
-               mpt_alloc, mpt_alloc + initrd_len);
-    }
-    printk("\nVIRTUAL MEMORY ARRANGEMENT:\n"
-           " Loaded kernel: %p->%p\n"
-           " Init. ramdisk: %p->%p\n"
-           " Phys-Mach map: %p->%p\n"
-           " Start info:    %p->%p\n"
-           " Page tables:   %p->%p\n"
-           " Boot stack:    %p->%p\n"
-           " TOTAL:         %p->%p\n",
-           _p(vkern_start), _p(vkern_end),
-           _p(vinitrd_start), _p(vinitrd_end),
-           _p(vphysmap_start), _p(vphysmap_end),
-           _p(vstartinfo_start), _p(vstartinfo_end),
-           _p(vpt_start), _p(vpt_end),
-           _p(vstack_start), _p(vstack_end),
-           _p(v_start), _p(v_end));
-    printk(" ENTRY ADDRESS: %p\n", _p(parms.virt_entry));
-
-    process_pending_softirqs();
-
-    mpt_alloc = (vpt_start - v_start) + pfn_to_paddr(alloc_spfn);
-    if ( vinitrd_start )
-        mpt_alloc -= PAGE_ALIGN(initrd_len);
-
-    /* Overlap with Xen protected area? */
-    if ( !is_pv_32bit_domain(d) ?
-         ((v_start < HYPERVISOR_VIRT_END) &&
-          (v_end > HYPERVISOR_VIRT_START)) :
-         (v_end > HYPERVISOR_COMPAT_VIRT_START(d)) )
-    {
-        printk("DOM0 image overlaps with Xen private area.\n");
-        rc = -EINVAL;
-        goto out;
-    }
-
-    if ( is_pv_32bit_domain(d) )
-    {
-        v->arch.pv_vcpu.failsafe_callback_cs = FLAT_COMPAT_KERNEL_CS;
-        v->arch.pv_vcpu.event_callback_cs    = FLAT_COMPAT_KERNEL_CS;
-    }
-
-    /* WARNING: The new domain must have its 'processor' field filled in! */
-    if ( !is_pv_32bit_domain(d) )
-    {
-        maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l4_page_table;
-        l4start = l4tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
-    }
-    else
-    {
-        page = alloc_domheap_page(d, MEMF_no_owner);
-        if ( !page )
-            panic("Not enough RAM for domain 0 PML4");
-        page->u.inuse.type_info = PGT_l4_page_table|PGT_validated|1;
-        l4start = l4tab = page_to_virt(page);
-        maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l3_page_table;
-        l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
-    }
-    clear_page(l4tab);
-    init_guest_l4_table(l4tab, d, 0);
-    v->arch.guest_table = pagetable_from_paddr(__pa(l4start));
-    if ( is_pv_32bit_domain(d) )
-        v->arch.guest_table_user = v->arch.guest_table;
-
-    l4tab += l4_table_offset(v_start);
-    pfn = alloc_spfn;
-    for ( count = 0; count < ((v_end-v_start)>>PAGE_SHIFT); count++ )
-    {
-        if ( !((unsigned long)l1tab & (PAGE_SIZE-1)) )
-        {
-            maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l1_page_table;
-            l1start = l1tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
-            clear_page(l1tab);
-            if ( count == 0 )
-                l1tab += l1_table_offset(v_start);
-            if ( !((unsigned long)l2tab & (PAGE_SIZE-1)) )
-            {
-                maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l2_page_table;
-                l2start = l2tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
-                clear_page(l2tab);
-                if ( count == 0 )
-                    l2tab += l2_table_offset(v_start);
-                if ( !((unsigned long)l3tab & (PAGE_SIZE-1)) )
-                {
-                    if ( count || !l3start )
-                    {
-                        maddr_to_page(mpt_alloc)->u.inuse.type_info =
-                            PGT_l3_page_table;
-                        l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
-                    }
-                    l3tab = l3start;
-                    clear_page(l3tab);
-                    if ( count == 0 )
-                        l3tab += l3_table_offset(v_start);
-                    *l4tab = l4e_from_paddr(__pa(l3start), L4_PROT);
-                    l4tab++;
-                }
-                *l3tab = l3e_from_paddr(__pa(l2start), L3_PROT);
-                l3tab++;
-            }
-            *l2tab = l2e_from_paddr(__pa(l1start), L2_PROT);
-            l2tab++;
-        }
-        if ( count < initrd_pfn || count >= initrd_pfn + PFN_UP(initrd_len) )
-            mfn = pfn++;
-        else
-            mfn = initrd_mfn++;
-        *l1tab = l1e_from_pfn(mfn, (!is_pv_32bit_domain(d) ?
-                                    L1_PROT : COMPAT_L1_PROT));
-        l1tab++;
-
-        if ( !paging_mode_translate(d) )
-        {
-            page = mfn_to_page(mfn);
-            if ( !page->u.inuse.type_info &&
-                 !get_page_and_type(page, d, PGT_writable_page) )
-                BUG();
-        }
-    }
-
-    if ( is_pv_32bit_domain(d) )
-    {
-        /* Ensure the first four L3 entries are all populated. */
-        for ( i = 0, l3tab = l3start; i < 4; ++i, ++l3tab )
-        {
-            if ( !l3e_get_intpte(*l3tab) )
-            {
-                maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l2_page_table;
-                l2tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
-                clear_page(l2tab);
-                *l3tab = l3e_from_paddr(__pa(l2tab), L3_PROT);
-            }
-            if ( i == 3 )
-                l3e_get_page(*l3tab)->u.inuse.type_info |= PGT_pae_xen_l2;
-        }
-        /* Install read-only guest visible MPT mapping. */
-        l2tab = l3e_to_l2e(l3start[3]);
-        memcpy(&l2tab[COMPAT_L2_PAGETABLE_FIRST_XEN_SLOT(d)],
-               &compat_idle_pg_table_l2[l2_table_offset(HIRO_COMPAT_MPT_VIRT_START)],
-               COMPAT_L2_PAGETABLE_XEN_SLOTS(d) * sizeof(*l2tab));
-    }
-
-    /* Pages that are part of page tables must be read only. */
-    if  ( is_pv_domain(d) )
-        mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
-
-    /* Mask all upcalls... */
-    for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
-        shared_info(d, vcpu_info[i].evtchn_upcall_mask) = 1;
-
-    printk("Dom0 has maximum %u VCPUs\n", d->max_vcpus);
-
-    cpu = v->processor;
-    for ( i = 1; i < d->max_vcpus; i++ )
-    {
-        cpu = cpumask_cycle(cpu, &dom0_cpus);
-        setup_dom0_vcpu(d, i, cpu);
-    }
-
-    d->arch.paging.mode = 0;
-
-    /* Set up CR3 value for write_ptbase */
-    if ( paging_mode_enabled(d) )
-        paging_update_paging_modes(v);
-    else
-        update_cr3(v);
-
-    /* We run on dom0's page tables for the final part of the build process. */
-    write_ptbase(v);
-    mapcache_override_current(v);
-
-    /* Copy the OS image and free temporary buffer. */
-    elf.dest_base = (void*)vkern_start;
-    elf.dest_size = vkern_end - vkern_start;
-    elf_set_vcpu(&elf, v);
-    rc = elf_load_binary(&elf);
-    if ( rc < 0 )
-    {
-        printk("Failed to load the kernel binary\n");
-        goto out;
-    }
-    bootstrap_map(NULL);
-
-    if ( UNSET_ADDR != parms.virt_hypercall )
-    {
-        if ( (parms.virt_hypercall < v_start) ||
-             (parms.virt_hypercall >= v_end) )
-        {
-            mapcache_override_current(NULL);
-            write_ptbase(current);
-            printk("Invalid HYPERCALL_PAGE field in ELF notes.\n");
-            rc = -1;
-            goto out;
-        }
-        hypercall_page_initialise(
-            d, (void *)(unsigned long)parms.virt_hypercall);
-    }
-
-    /* Free temporary buffers. */
-    discard_initial_images();
-
-    /* Set up start info area. */
-    si = (start_info_t *)vstartinfo_start;
-    clear_page(si);
-    si->nr_pages = nr_pages;
-
-    si->shared_info = virt_to_maddr(d->shared_info);
-
-    si->flags        = SIF_PRIVILEGED | SIF_INITDOMAIN;
-    if ( !vinitrd_start && initrd_len )
-        si->flags   |= SIF_MOD_START_PFN;
-    si->flags       |= (xen_processor_pmbits << 8) & SIF_PM_MASK;
-    si->pt_base      = vpt_start;
-    si->nr_pt_frames = nr_pt_pages;
-    si->mfn_list     = vphysmap_start;
-    snprintf(si->magic, sizeof(si->magic), "xen-3.0-x86_%d%s",
-             elf_64bit(&elf) ? 64 : 32, parms.pae ? "p" : "");
-
-    count = d->tot_pages;
-
-    /* Set up the phys->machine table if not part of the initial mapping. */
-    if ( is_pv_domain(d) && parms.p2m_base != UNSET_ADDR )
-    {
-        pfn = pagetable_get_pfn(v->arch.guest_table);
-        setup_pv_physmap(d, pfn, v_start, v_end, vphysmap_start, vphysmap_end,
-                         nr_pages);
-    }
-
-    /* Write the phys->machine and machine->phys table entries. */
-    for ( pfn = 0; pfn < count; pfn++ )
-    {
-        mfn = pfn + alloc_spfn;
-        if ( pfn >= initrd_pfn )
-        {
-            if ( pfn < initrd_pfn + PFN_UP(initrd_len) )
-                mfn = initrd->mod_start + (pfn - initrd_pfn);
-            else
-                mfn -= PFN_UP(initrd_len);
-        }
-#ifndef NDEBUG
-#define REVERSE_START ((v_end - v_start) >> PAGE_SHIFT)
-        if ( pfn > REVERSE_START && (vinitrd_start || pfn < initrd_pfn) )
-            mfn = alloc_epfn - (pfn - REVERSE_START);
-#endif
-        dom0_update_physmap(d, pfn, mfn, vphysmap_start);
-        if (!(pfn & 0xfffff))
-            process_pending_softirqs();
-    }
-    si->first_p2m_pfn = pfn;
-    si->nr_p2m_frames = d->tot_pages - count;
-    page_list_for_each ( page, &d->page_list )
-    {
-        mfn = page_to_mfn(page);
-        BUG_ON(SHARED_M2P(get_gpfn_from_mfn(mfn)));
-        if ( get_gpfn_from_mfn(mfn) >= count )
-        {
-            BUG_ON(is_pv_32bit_domain(d));
-            if ( !paging_mode_translate(d) && !page->u.inuse.type_info &&
-                 !get_page_and_type(page, d, PGT_writable_page) )
-                BUG();
-
-            dom0_update_physmap(d, pfn, mfn, vphysmap_start);
-            ++pfn;
-            if (!(pfn & 0xfffff))
-                process_pending_softirqs();
-        }
-    }
-    BUG_ON(pfn != d->tot_pages);
-#ifndef NDEBUG
-    alloc_epfn += PFN_UP(initrd_len) + si->nr_p2m_frames;
-#endif
-    while ( pfn < nr_pages )
-    {
-        if ( (page = alloc_chunk(d, nr_pages - d->tot_pages)) == NULL )
-            panic("Not enough RAM for DOM0 reservation");
-        while ( pfn < d->tot_pages )
-        {
-            mfn = page_to_mfn(page);
-#ifndef NDEBUG
-#define pfn (nr_pages - 1 - (pfn - (alloc_epfn - alloc_spfn)))
-#endif
-            dom0_update_physmap(d, pfn, mfn, vphysmap_start);
-#undef pfn
-            page++; pfn++;
-            if (!(pfn & 0xfffff))
-                process_pending_softirqs();
-        }
-    }
-
-    if ( initrd_len != 0 )
-    {
-        si->mod_start = vinitrd_start ?: initrd_pfn;
-        si->mod_len   = initrd_len;
-    }
-
-    memset(si->cmd_line, 0, sizeof(si->cmd_line));
-    if ( cmdline != NULL )
-        strlcpy((char *)si->cmd_line, cmdline, sizeof(si->cmd_line));
-
-    if ( fill_console_start_info((void *)(si + 1)) )
-    {
-        si->console.dom0.info_off  = sizeof(struct start_info);
-        si->console.dom0.info_size = sizeof(struct dom0_vga_console_info);
-    }
-
-    if ( is_pv_32bit_domain(d) )
-        xlat_start_info(si, XLAT_start_info_console_dom0);
-
-    /* Return to idle domain's page tables. */
-    mapcache_override_current(NULL);
-    write_ptbase(current);
-
-    update_domain_wallclock_time(d);
-
-    v->is_initialised = 1;
-    clear_bit(_VPF_down, &v->pause_flags);
-
-    /*
-     * Initial register values:
-     *  DS,ES,FS,GS = FLAT_KERNEL_DS
-     *       CS:rIP = FLAT_KERNEL_CS:start_pc
-     *       SS:rSP = FLAT_KERNEL_SS:start_stack
-     *          rSI = start_info
-     *  [rAX,rBX,rCX,rDX,rDI,rBP,R8-R15 are zero]
-     */
-    regs = &v->arch.user_regs;
-    regs->ds = regs->es = regs->fs = regs->gs =
-        !is_pv_32bit_domain(d) ? FLAT_KERNEL_DS : FLAT_COMPAT_KERNEL_DS;
-    regs->ss = (!is_pv_32bit_domain(d) ?
-                FLAT_KERNEL_SS : FLAT_COMPAT_KERNEL_SS);
-    regs->cs = (!is_pv_32bit_domain(d) ?
-                FLAT_KERNEL_CS : FLAT_COMPAT_KERNEL_CS);
-    regs->rip = parms.virt_entry;
-    regs->rsp = vstack_end;
-    regs->rsi = vstartinfo_start;
-    regs->eflags = X86_EFLAGS_IF;
-
-#ifdef CONFIG_SHADOW_PAGING
-    if ( opt_dom0_shadow && paging_enable(d, PG_SH_enable) == 0 )
-        paging_update_paging_modes(v);
-#endif
-
-    /*
-     * PVH Fixme: XENFEAT_supervisor_mode_kernel has been reused in PVH with a
-     * different meaning.
-     */
-    if ( test_bit(XENFEAT_supervisor_mode_kernel, parms.f_required) )
-        panic("Dom0 requires supervisor-mode execution");
-
-    rc = setup_permissions(d);
-    BUG_ON(rc != 0);
-
-    if ( elf_check_broken(&elf) )
-        printk(" Xen warning: dom0 kernel broken ELF: %s\n",
-               elf_check_broken(&elf));
-
-    if ( d->domain_id == hardware_domid )
-        iommu_hwdom_init(d);
-
-    return 0;
-
-out:
-    if ( elf_check_broken(&elf) )
-        printk(" Xen dom0 kernel broken ELF: %s\n",
-               elf_check_broken(&elf));
-
-    return rc;
-}
 
 static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
                                        unsigned long nr_pages, const bool map)
@@ -1411,13 +534,13 @@ static int __init pvh_populate_memory_range(struct domain *d,
         unsigned int range_order = get_order_from_pages(nr_pages + 1);
 
         order = min(range_order ? range_order - 1 : 0, order);
-        page = alloc_domheap_pages(d, order, memflags);
+        page = alloc_domheap_pages(d, order, dom0_memflags);
         if ( page == NULL )
         {
-            if ( order == 0 && memflags )
+            if ( order == 0 && dom0_memflags )
             {
-                /* Try again without any memflags. */
-                memflags = 0;
+                /* Try again without any dom0_memflags. */
+                dom0_memflags = 0;
                 order = MAX_ORDER;
                 continue;
             }
@@ -1691,7 +814,7 @@ static int __init pvh_setup_p2m(struct domain *d)
     bool preempted;
 #define MB1_PAGES PFN_DOWN(MB(1))
 
-    nr_pages = compute_dom0_nr_pages(d, NULL, 0);
+    nr_pages = dom0_compute_nr_pages(d, NULL, 0);
 
     pvh_setup_e820(d, nr_pages);
     do {
@@ -1903,7 +1026,7 @@ static int __init pvh_setup_cpus(struct domain *d, paddr_t entry,
     for ( i = 1; i < d->max_vcpus; i++ )
     {
         cpu = cpumask_cycle(cpu, &dom0_cpus);
-        setup_dom0_vcpu(d, i, cpu);
+        dom0_setup_vcpu(d, i, cpu);
     }
 
     rc = arch_set_info_hvm_guest(v, &cpu_ctx);
@@ -1913,7 +1036,7 @@ static int __init pvh_setup_cpus(struct domain *d, paddr_t entry,
         return rc;
     }
 
-    rc = setup_permissions(d);
+    rc = dom0_setup_permissions(d);
     if ( rc )
     {
         panic("Unable to setup Dom0 permissions: %d\n", rc);
@@ -2423,7 +1546,7 @@ int __init construct_dom0(struct domain *d, const module_t *image,
 
     process_pending_softirqs();
 
-    return (is_hvm_domain(d) ? construct_dom0_pvh : construct_dom0_pv)
+    return (is_hvm_domain(d) ? construct_dom0_pvh : dom0_construct_pv)
            (d, image, image_headroom, initrd,bootstrap_map, cmdline);
 }
 
diff --git a/xen/arch/x86/pv/Makefile b/xen/arch/x86/pv/Makefile
index de21937129..ea94599438 100644
--- a/xen/arch/x86/pv/Makefile
+++ b/xen/arch/x86/pv/Makefile
@@ -1 +1,2 @@
 obj-y += hypercall.o
+obj-bin-y += dom0_build.init.o
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
new file mode 100644
index 0000000000..bb82c33a4c
--- /dev/null
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -0,0 +1,910 @@
+/******************************************************************************
+ * pv/dom0_build.c
+ *
+ * Copyright (c) 2002-2005, K A Fraser
+ */
+
+#include <xen/console.h>
+#include <xen/domain.h>
+#include <xen/domain_page.h>
+#include <xen/init.h>
+#include <xen/libelf.h>
+#include <xen/multiboot.h>
+#include <xen/paging.h>
+#include <xen/pfn.h>
+#include <xen/sched.h>
+#include <xen/softirq.h>
+
+#include <asm/bzimage.h>
+#include <asm/dom0_build.h>
+#include <asm/page.h>
+#include <asm/setup.h>
+
+/* Allow ring-3 access in long mode as guest cannot use ring 1 ... */
+#define BASE_PROT (_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|_PAGE_USER)
+#define L1_PROT (BASE_PROT|_PAGE_GUEST_KERNEL)
+/* ... except for compatibility mode guests. */
+#define COMPAT_L1_PROT (_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED)
+#define L2_PROT (BASE_PROT|_PAGE_DIRTY)
+#define L3_PROT (BASE_PROT|_PAGE_DIRTY)
+#define L4_PROT (BASE_PROT|_PAGE_DIRTY)
+
+static __init void dom0_update_physmap(struct domain *d, unsigned long pfn,
+                                   unsigned long mfn, unsigned long vphysmap_s)
+{
+    if ( !is_pv_32bit_domain(d) )
+        ((unsigned long *)vphysmap_s)[pfn] = mfn;
+    else
+        ((unsigned int *)vphysmap_s)[pfn] = mfn;
+
+    set_gpfn_from_mfn(mfn, pfn);
+}
+
+static __init void mark_pv_pt_pages_rdonly(struct domain *d,
+                                           l4_pgentry_t *l4start,
+                                           unsigned long vpt_start,
+                                           unsigned long nr_pt_pages)
+{
+    unsigned long count;
+    struct page_info *page;
+    l4_pgentry_t *pl4e;
+    l3_pgentry_t *pl3e;
+    l2_pgentry_t *pl2e;
+    l1_pgentry_t *pl1e;
+
+    pl4e = l4start + l4_table_offset(vpt_start);
+    pl3e = l4e_to_l3e(*pl4e);
+    pl3e += l3_table_offset(vpt_start);
+    pl2e = l3e_to_l2e(*pl3e);
+    pl2e += l2_table_offset(vpt_start);
+    pl1e = l2e_to_l1e(*pl2e);
+    pl1e += l1_table_offset(vpt_start);
+    for ( count = 0; count < nr_pt_pages; count++ )
+    {
+        l1e_remove_flags(*pl1e, _PAGE_RW);
+        page = mfn_to_page(l1e_get_pfn(*pl1e));
+
+        /* Read-only mapping + PGC_allocated + page-table page. */
+        page->count_info         = PGC_allocated | 3;
+        page->u.inuse.type_info |= PGT_validated | 1;
+
+        /* Top-level p.t. is pinned. */
+        if ( (page->u.inuse.type_info & PGT_type_mask) ==
+             (!is_pv_32bit_domain(d) ?
+              PGT_l4_page_table : PGT_l3_page_table) )
+        {
+            page->count_info        += 1;
+            page->u.inuse.type_info += 1 | PGT_pinned;
+        }
+
+        /* Iterate. */
+        if ( !((unsigned long)++pl1e & (PAGE_SIZE - 1)) )
+        {
+            if ( !((unsigned long)++pl2e & (PAGE_SIZE - 1)) )
+            {
+                if ( !((unsigned long)++pl3e & (PAGE_SIZE - 1)) )
+                    pl3e = l4e_to_l3e(*++pl4e);
+                pl2e = l3e_to_l2e(*pl3e);
+            }
+            pl1e = l2e_to_l1e(*pl2e);
+        }
+    }
+}
+
+static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
+                                    unsigned long v_start, unsigned long v_end,
+                                    unsigned long vphysmap_start,
+                                    unsigned long vphysmap_end,
+                                    unsigned long nr_pages)
+{
+    struct page_info *page = NULL;
+    l4_pgentry_t *pl4e, *l4start = map_domain_page(_mfn(pgtbl_pfn));
+    l3_pgentry_t *pl3e = NULL;
+    l2_pgentry_t *pl2e = NULL;
+    l1_pgentry_t *pl1e = NULL;
+
+    if ( v_start <= vphysmap_end && vphysmap_start <= v_end )
+        panic("DOM0 P->M table overlaps initial mapping");
+
+    while ( vphysmap_start < vphysmap_end )
+    {
+        if ( d->tot_pages + ((round_pgup(vphysmap_end) - vphysmap_start)
+                             >> PAGE_SHIFT) + 3 > nr_pages )
+            panic("Dom0 allocation too small for initial P->M table");
+
+        if ( pl1e )
+        {
+            unmap_domain_page(pl1e);
+            pl1e = NULL;
+        }
+        if ( pl2e )
+        {
+            unmap_domain_page(pl2e);
+            pl2e = NULL;
+        }
+        if ( pl3e )
+        {
+            unmap_domain_page(pl3e);
+            pl3e = NULL;
+        }
+        pl4e = l4start + l4_table_offset(vphysmap_start);
+        if ( !l4e_get_intpte(*pl4e) )
+        {
+            page = alloc_domheap_page(d, 0);
+            if ( !page )
+                break;
+
+            /* No mapping, PGC_allocated + page-table page. */
+            page->count_info = PGC_allocated | 2;
+            page->u.inuse.type_info = PGT_l3_page_table | PGT_validated | 1;
+            pl3e = __map_domain_page(page);
+            clear_page(pl3e);
+            *pl4e = l4e_from_page(page, L4_PROT);
+        } else
+            pl3e = map_domain_page(_mfn(l4e_get_pfn(*pl4e)));
+
+        pl3e += l3_table_offset(vphysmap_start);
+        if ( !l3e_get_intpte(*pl3e) )
+        {
+            if ( cpu_has_page1gb &&
+                 !(vphysmap_start & ((1UL << L3_PAGETABLE_SHIFT) - 1)) &&
+                 vphysmap_end >= vphysmap_start + (1UL << L3_PAGETABLE_SHIFT) &&
+                 (page = alloc_domheap_pages(d,
+                                             L3_PAGETABLE_SHIFT - PAGE_SHIFT,
+                                             0)) != NULL )
+            {
+                *pl3e = l3e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
+                vphysmap_start += 1UL << L3_PAGETABLE_SHIFT;
+                continue;
+            }
+            if ( (page = alloc_domheap_page(d, 0)) == NULL )
+                break;
+
+            /* No mapping, PGC_allocated + page-table page. */
+            page->count_info = PGC_allocated | 2;
+            page->u.inuse.type_info = PGT_l2_page_table | PGT_validated | 1;
+            pl2e = __map_domain_page(page);
+            clear_page(pl2e);
+            *pl3e = l3e_from_page(page, L3_PROT);
+        }
+        else
+            pl2e = map_domain_page(_mfn(l3e_get_pfn(*pl3e)));
+
+        pl2e += l2_table_offset(vphysmap_start);
+        if ( !l2e_get_intpte(*pl2e) )
+        {
+            if ( !(vphysmap_start & ((1UL << L2_PAGETABLE_SHIFT) - 1)) &&
+                 vphysmap_end >= vphysmap_start + (1UL << L2_PAGETABLE_SHIFT) &&
+                 (page = alloc_domheap_pages(d,
+                                             L2_PAGETABLE_SHIFT - PAGE_SHIFT,
+                                             0)) != NULL )
+            {
+                *pl2e = l2e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
+                if ( opt_allow_superpage )
+                    get_superpage(page_to_mfn(page), d);
+                vphysmap_start += 1UL << L2_PAGETABLE_SHIFT;
+                continue;
+            }
+            if ( (page = alloc_domheap_page(d, 0)) == NULL )
+                break;
+
+            /* No mapping, PGC_allocated + page-table page. */
+            page->count_info = PGC_allocated | 2;
+            page->u.inuse.type_info = PGT_l1_page_table | PGT_validated | 1;
+            pl1e = __map_domain_page(page);
+            clear_page(pl1e);
+            *pl2e = l2e_from_page(page, L2_PROT);
+        }
+        else
+            pl1e = map_domain_page(_mfn(l2e_get_pfn(*pl2e)));
+
+        pl1e += l1_table_offset(vphysmap_start);
+        BUG_ON(l1e_get_intpte(*pl1e));
+        page = alloc_domheap_page(d, 0);
+        if ( !page )
+            break;
+
+        *pl1e = l1e_from_page(page, L1_PROT|_PAGE_DIRTY);
+        vphysmap_start += PAGE_SIZE;
+        vphysmap_start &= PAGE_MASK;
+    }
+    if ( !page )
+        panic("Not enough RAM for DOM0 P->M table");
+
+    if ( pl1e )
+        unmap_domain_page(pl1e);
+    if ( pl2e )
+        unmap_domain_page(pl2e);
+    if ( pl3e )
+        unmap_domain_page(pl3e);
+
+    unmap_domain_page(l4start);
+}
+
+static struct page_info * __init alloc_chunk(
+    struct domain *d, unsigned long max_pages)
+{
+    static unsigned int __initdata last_order = MAX_ORDER;
+    struct page_info *page;
+    unsigned int order = get_order_from_pages(max_pages), free_order;
+
+    if ( order > last_order )
+        order = last_order;
+    else if ( max_pages & (max_pages - 1) )
+        --order;
+    while ( (page = alloc_domheap_pages(d, order, dom0_memflags)) == NULL )
+        if ( order-- == 0 )
+            break;
+    if ( page )
+        last_order = order;
+    else if ( dom0_memflags )
+    {
+        /*
+         * Allocate up to 2MB at a time: It prevents allocating very large
+         * chunks from DMA pools before the >4GB pool is fully depleted.
+         */
+        last_order = 21 - PAGE_SHIFT;
+        dom0_memflags = 0;
+        return alloc_chunk(d, max_pages);
+    }
+
+    /*
+     * Make a reasonable attempt at finding a smaller chunk at a higher
+     * address, to avoid allocating from low memory as much as possible.
+     */
+    for ( free_order = order; !dom0_memflags && page && order--; )
+    {
+        struct page_info *pg2;
+
+        if ( d->tot_pages + (1 << order) > d->max_pages )
+            continue;
+        pg2 = alloc_domheap_pages(d, order, MEMF_exact_node);
+        if ( pg2 > page )
+        {
+            free_domheap_pages(page, free_order);
+            page = pg2;
+            free_order = order;
+        }
+        else if ( pg2 )
+            free_domheap_pages(pg2, order);
+    }
+    return page;
+}
+
+int __init dom0_construct_pv(
+    struct domain *d,
+    const module_t *image, unsigned long image_headroom,
+    module_t *initrd,
+    void *(*bootstrap_map)(const module_t *),
+    char *cmdline)
+{
+    int i, cpu, rc, compatible, compat32, order, machine;
+    struct cpu_user_regs *regs;
+    unsigned long pfn, mfn;
+    unsigned long nr_pages;
+    unsigned long nr_pt_pages;
+    unsigned long alloc_spfn;
+    unsigned long alloc_epfn;
+    unsigned long initrd_pfn = -1, initrd_mfn = 0;
+    unsigned long count;
+    struct page_info *page = NULL;
+    start_info_t *si;
+    struct vcpu *v = d->vcpu[0];
+    unsigned long long value;
+    void *image_base = bootstrap_map(image);
+    unsigned long image_len = image->mod_end;
+    void *image_start = image_base + image_headroom;
+    unsigned long initrd_len = initrd ? initrd->mod_end : 0;
+    l4_pgentry_t *l4tab = NULL, *l4start = NULL;
+    l3_pgentry_t *l3tab = NULL, *l3start = NULL;
+    l2_pgentry_t *l2tab = NULL, *l2start = NULL;
+    l1_pgentry_t *l1tab = NULL, *l1start = NULL;
+
+    /*
+     * This fully describes the memory layout of the initial domain. All
+     * *_start address are page-aligned, except v_start (and v_end) which are
+     * superpage-aligned.
+     */
+    struct elf_binary elf;
+    struct elf_dom_parms parms;
+    unsigned long vkern_start;
+    unsigned long vkern_end;
+    unsigned long vinitrd_start;
+    unsigned long vinitrd_end;
+    unsigned long vphysmap_start;
+    unsigned long vphysmap_end;
+    unsigned long vstartinfo_start;
+    unsigned long vstartinfo_end;
+    unsigned long vstack_start;
+    unsigned long vstack_end;
+    unsigned long vpt_start;
+    unsigned long vpt_end;
+    unsigned long v_start;
+    unsigned long v_end;
+
+    /* Machine address of next candidate page-table page. */
+    paddr_t mpt_alloc;
+
+    printk("*** LOADING DOMAIN 0 ***\n");
+
+    d->max_pages = ~0U;
+
+    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
+        return rc;
+
+    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
+        return rc;
+#ifdef CONFIG_VERBOSE_DEBUG
+    elf_set_verbose(&elf);
+#endif
+    elf_parse_binary(&elf);
+    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
+        goto out;
+
+    /* compatibility check */
+    compatible = 0;
+    compat32   = 0;
+    machine = elf_uval(&elf, elf.ehdr, e_machine);
+    printk(" Xen  kernel: 64-bit, lsb, compat32\n");
+    if (elf_32bit(&elf) && parms.pae == XEN_PAE_BIMODAL)
+        parms.pae = XEN_PAE_EXTCR3;
+    if (elf_32bit(&elf) && parms.pae && machine == EM_386)
+    {
+        compat32 = 1;
+        compatible = 1;
+    }
+    if (elf_64bit(&elf) && machine == EM_X86_64)
+        compatible = 1;
+    printk(" Dom0 kernel: %s%s, %s, paddr %#" PRIx64 " -> %#" PRIx64 "\n",
+           elf_64bit(&elf) ? "64-bit" : "32-bit",
+           parms.pae       ? ", PAE"  : "",
+           elf_msb(&elf)   ? "msb"    : "lsb",
+           elf.pstart, elf.pend);
+    if ( elf.bsd_symtab_pstart )
+        printk(" Dom0 symbol map %#" PRIx64 " -> %#" PRIx64 "\n",
+               elf.bsd_symtab_pstart, elf.bsd_symtab_pend);
+
+    if ( !compatible )
+    {
+        printk("Mismatch between Xen and DOM0 kernel\n");
+        rc = -EINVAL;
+        goto out;
+    }
+
+    if ( parms.elf_notes[XEN_ELFNOTE_SUPPORTED_FEATURES].type != XEN_ENT_NONE )
+    {
+        if ( !test_bit(XENFEAT_dom0, parms.f_supported) )
+        {
+            printk("Kernel does not support Dom0 operation\n");
+            rc = -EINVAL;
+            goto out;
+        }
+    }
+
+    if ( compat32 )
+    {
+        d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 1;
+        v->vcpu_info = (void *)&d->shared_info->compat.vcpu_info[0];
+        if ( setup_compat_arg_xlat(v) != 0 )
+            BUG();
+    }
+
+    nr_pages = dom0_compute_nr_pages(d, &parms, initrd_len);
+
+    if ( parms.pae == XEN_PAE_EXTCR3 )
+            set_bit(VMASST_TYPE_pae_extended_cr3, &d->vm_assist);
+
+    if ( (parms.virt_hv_start_low != UNSET_ADDR) && elf_32bit(&elf) )
+    {
+        unsigned long mask = (1UL << L2_PAGETABLE_SHIFT) - 1;
+        value = (parms.virt_hv_start_low + mask) & ~mask;
+        BUG_ON(!is_pv_32bit_domain(d));
+        if ( value > __HYPERVISOR_COMPAT_VIRT_START )
+            panic("Domain 0 expects too high a hypervisor start address");
+        HYPERVISOR_COMPAT_VIRT_START(d) =
+            max_t(unsigned int, m2p_compat_vstart, value);
+    }
+
+    if ( (parms.p2m_base != UNSET_ADDR) && elf_32bit(&elf) )
+    {
+        printk(XENLOG_WARNING "P2M table base ignored\n");
+        parms.p2m_base = UNSET_ADDR;
+    }
+
+    domain_set_alloc_bitsize(d);
+
+    /*
+     * Why do we need this? The number of page-table frames depends on the
+     * size of the bootstrap address space. But the size of the address space
+     * depends on the number of page-table frames (since each one is mapped
+     * read-only). We have a pair of simultaneous equations in two unknowns,
+     * which we solve by exhaustive search.
+     */
+    v_start          = parms.virt_base;
+    vkern_start      = parms.virt_kstart;
+    vkern_end        = parms.virt_kend;
+    if ( parms.unmapped_initrd )
+    {
+        vinitrd_start  = vinitrd_end = 0;
+        vphysmap_start = round_pgup(vkern_end);
+    }
+    else
+    {
+        vinitrd_start  = round_pgup(vkern_end);
+        vinitrd_end    = vinitrd_start + initrd_len;
+        vphysmap_start = round_pgup(vinitrd_end);
+    }
+    vphysmap_end     = vphysmap_start + (nr_pages * (!is_pv_32bit_domain(d) ?
+                                                     sizeof(unsigned long) :
+                                                     sizeof(unsigned int)));
+    if ( parms.p2m_base != UNSET_ADDR )
+        vphysmap_end = vphysmap_start;
+    vstartinfo_start = round_pgup(vphysmap_end);
+    vstartinfo_end   = (vstartinfo_start +
+                        sizeof(struct start_info) +
+                        sizeof(struct dom0_vga_console_info));
+
+    vpt_start        = round_pgup(vstartinfo_end);
+    for ( nr_pt_pages = 2; ; nr_pt_pages++ )
+    {
+        vpt_end          = vpt_start + (nr_pt_pages * PAGE_SIZE);
+        vstack_start     = vpt_end;
+        vstack_end       = vstack_start + PAGE_SIZE;
+        v_end            = (vstack_end + (1UL<<22)-1) & ~((1UL<<22)-1);
+        if ( (v_end - vstack_end) < (512UL << 10) )
+            v_end += 1UL << 22; /* Add extra 4MB to get >= 512kB padding. */
+#define NR(_l,_h,_s) \
+    (((((_h) + ((1UL<<(_s))-1)) & ~((1UL<<(_s))-1)) - \
+       ((_l) & ~((1UL<<(_s))-1))) >> (_s))
+        if ( (!is_pv_32bit_domain(d) + /* # L4 */
+              NR(v_start, v_end, L4_PAGETABLE_SHIFT) + /* # L3 */
+              (!is_pv_32bit_domain(d) ?
+               NR(v_start, v_end, L3_PAGETABLE_SHIFT) : /* # L2 */
+               4) + /* # compat L2 */
+              NR(v_start, v_end, L2_PAGETABLE_SHIFT))  /* # L1 */
+             <= nr_pt_pages )
+            break;
+    }
+
+    count = v_end - v_start;
+    if ( vinitrd_start )
+        count -= PAGE_ALIGN(initrd_len);
+    order = get_order_from_bytes(count);
+    if ( (1UL << order) + PFN_UP(initrd_len) > nr_pages )
+        panic("Domain 0 allocation is too small for kernel image");
+
+    if ( parms.p2m_base != UNSET_ADDR )
+    {
+        vphysmap_start = parms.p2m_base;
+        vphysmap_end   = vphysmap_start + nr_pages * sizeof(unsigned long);
+    }
+    page = alloc_domheap_pages(d, order, 0);
+    if ( page == NULL )
+        panic("Not enough RAM for domain 0 allocation");
+    alloc_spfn = page_to_mfn(page);
+    alloc_epfn = alloc_spfn + d->tot_pages;
+
+    if ( initrd_len )
+    {
+        initrd_pfn = vinitrd_start ?
+                     (vinitrd_start - v_start) >> PAGE_SHIFT :
+                     d->tot_pages;
+        initrd_mfn = mfn = initrd->mod_start;
+        count = PFN_UP(initrd_len);
+        if ( d->arch.physaddr_bitsize &&
+             ((mfn + count - 1) >> (d->arch.physaddr_bitsize - PAGE_SHIFT)) )
+        {
+            order = get_order_from_pages(count);
+            page = alloc_domheap_pages(d, order, 0);
+            if ( !page )
+                panic("Not enough RAM for domain 0 initrd");
+            for ( count = -count; order--; )
+                if ( count & (1UL << order) )
+                {
+                    free_domheap_pages(page, order);
+                    page += 1UL << order;
+                }
+            memcpy(page_to_virt(page), mfn_to_virt(initrd->mod_start),
+                   initrd_len);
+            mpt_alloc = (paddr_t)initrd->mod_start << PAGE_SHIFT;
+            init_domheap_pages(mpt_alloc,
+                               mpt_alloc + PAGE_ALIGN(initrd_len));
+            initrd->mod_start = initrd_mfn = page_to_mfn(page);
+        }
+        else
+        {
+            while ( count-- )
+                if ( assign_pages(d, mfn_to_page(mfn++), 0, 0) )
+                    BUG();
+        }
+        initrd->mod_end = 0;
+    }
+
+    printk("PHYSICAL MEMORY ARRANGEMENT:\n"
+           " Dom0 alloc.:   %"PRIpaddr"->%"PRIpaddr,
+           pfn_to_paddr(alloc_spfn), pfn_to_paddr(alloc_epfn));
+    if ( d->tot_pages < nr_pages )
+        printk(" (%lu pages to be allocated)",
+               nr_pages - d->tot_pages);
+    if ( initrd )
+    {
+        mpt_alloc = (paddr_t)initrd->mod_start << PAGE_SHIFT;
+        printk("\n Init. ramdisk: %"PRIpaddr"->%"PRIpaddr,
+               mpt_alloc, mpt_alloc + initrd_len);
+    }
+    printk("\nVIRTUAL MEMORY ARRANGEMENT:\n"
+           " Loaded kernel: %p->%p\n"
+           " Init. ramdisk: %p->%p\n"
+           " Phys-Mach map: %p->%p\n"
+           " Start info:    %p->%p\n"
+           " Page tables:   %p->%p\n"
+           " Boot stack:    %p->%p\n"
+           " TOTAL:         %p->%p\n",
+           _p(vkern_start), _p(vkern_end),
+           _p(vinitrd_start), _p(vinitrd_end),
+           _p(vphysmap_start), _p(vphysmap_end),
+           _p(vstartinfo_start), _p(vstartinfo_end),
+           _p(vpt_start), _p(vpt_end),
+           _p(vstack_start), _p(vstack_end),
+           _p(v_start), _p(v_end));
+    printk(" ENTRY ADDRESS: %p\n", _p(parms.virt_entry));
+
+    process_pending_softirqs();
+
+    mpt_alloc = (vpt_start - v_start) + pfn_to_paddr(alloc_spfn);
+    if ( vinitrd_start )
+        mpt_alloc -= PAGE_ALIGN(initrd_len);
+
+    /* Overlap with Xen protected area? */
+    if ( !is_pv_32bit_domain(d) ?
+         ((v_start < HYPERVISOR_VIRT_END) &&
+          (v_end > HYPERVISOR_VIRT_START)) :
+         (v_end > HYPERVISOR_COMPAT_VIRT_START(d)) )
+    {
+        printk("DOM0 image overlaps with Xen private area.\n");
+        rc = -EINVAL;
+        goto out;
+    }
+
+    if ( is_pv_32bit_domain(d) )
+    {
+        v->arch.pv_vcpu.failsafe_callback_cs = FLAT_COMPAT_KERNEL_CS;
+        v->arch.pv_vcpu.event_callback_cs    = FLAT_COMPAT_KERNEL_CS;
+    }
+
+    /* WARNING: The new domain must have its 'processor' field filled in! */
+    if ( !is_pv_32bit_domain(d) )
+    {
+        maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l4_page_table;
+        l4start = l4tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+    }
+    else
+    {
+        page = alloc_domheap_page(d, MEMF_no_owner);
+        if ( !page )
+            panic("Not enough RAM for domain 0 PML4");
+        page->u.inuse.type_info = PGT_l4_page_table|PGT_validated|1;
+        l4start = l4tab = page_to_virt(page);
+        maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l3_page_table;
+        l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+    }
+    clear_page(l4tab);
+    init_guest_l4_table(l4tab, d, 0);
+    v->arch.guest_table = pagetable_from_paddr(__pa(l4start));
+    if ( is_pv_32bit_domain(d) )
+        v->arch.guest_table_user = v->arch.guest_table;
+
+    l4tab += l4_table_offset(v_start);
+    pfn = alloc_spfn;
+    for ( count = 0; count < ((v_end-v_start)>>PAGE_SHIFT); count++ )
+    {
+        if ( !((unsigned long)l1tab & (PAGE_SIZE-1)) )
+        {
+            maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l1_page_table;
+            l1start = l1tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+            clear_page(l1tab);
+            if ( count == 0 )
+                l1tab += l1_table_offset(v_start);
+            if ( !((unsigned long)l2tab & (PAGE_SIZE-1)) )
+            {
+                maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l2_page_table;
+                l2start = l2tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+                clear_page(l2tab);
+                if ( count == 0 )
+                    l2tab += l2_table_offset(v_start);
+                if ( !((unsigned long)l3tab & (PAGE_SIZE-1)) )
+                {
+                    if ( count || !l3start )
+                    {
+                        maddr_to_page(mpt_alloc)->u.inuse.type_info =
+                            PGT_l3_page_table;
+                        l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+                    }
+                    l3tab = l3start;
+                    clear_page(l3tab);
+                    if ( count == 0 )
+                        l3tab += l3_table_offset(v_start);
+                    *l4tab = l4e_from_paddr(__pa(l3start), L4_PROT);
+                    l4tab++;
+                }
+                *l3tab = l3e_from_paddr(__pa(l2start), L3_PROT);
+                l3tab++;
+            }
+            *l2tab = l2e_from_paddr(__pa(l1start), L2_PROT);
+            l2tab++;
+        }
+        if ( count < initrd_pfn || count >= initrd_pfn + PFN_UP(initrd_len) )
+            mfn = pfn++;
+        else
+            mfn = initrd_mfn++;
+        *l1tab = l1e_from_pfn(mfn, (!is_pv_32bit_domain(d) ?
+                                    L1_PROT : COMPAT_L1_PROT));
+        l1tab++;
+
+        if ( !paging_mode_translate(d) )
+        {
+            page = mfn_to_page(mfn);
+            if ( !page->u.inuse.type_info &&
+                 !get_page_and_type(page, d, PGT_writable_page) )
+                BUG();
+        }
+    }
+
+    if ( is_pv_32bit_domain(d) )
+    {
+        /* Ensure the first four L3 entries are all populated. */
+        for ( i = 0, l3tab = l3start; i < 4; ++i, ++l3tab )
+        {
+            if ( !l3e_get_intpte(*l3tab) )
+            {
+                maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l2_page_table;
+                l2tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+                clear_page(l2tab);
+                *l3tab = l3e_from_paddr(__pa(l2tab), L3_PROT);
+            }
+            if ( i == 3 )
+                l3e_get_page(*l3tab)->u.inuse.type_info |= PGT_pae_xen_l2;
+        }
+        /* Install read-only guest visible MPT mapping. */
+        l2tab = l3e_to_l2e(l3start[3]);
+        memcpy(&l2tab[COMPAT_L2_PAGETABLE_FIRST_XEN_SLOT(d)],
+               &compat_idle_pg_table_l2[l2_table_offset(HIRO_COMPAT_MPT_VIRT_START)],
+               COMPAT_L2_PAGETABLE_XEN_SLOTS(d) * sizeof(*l2tab));
+    }
+
+    /* Pages that are part of page tables must be read only. */
+    if  ( is_pv_domain(d) )
+        mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
+
+    /* Mask all upcalls... */
+    for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
+        shared_info(d, vcpu_info[i].evtchn_upcall_mask) = 1;
+
+    printk("Dom0 has maximum %u VCPUs\n", d->max_vcpus);
+
+    cpu = v->processor;
+    for ( i = 1; i < d->max_vcpus; i++ )
+    {
+        cpu = cpumask_cycle(cpu, &dom0_cpus);
+        dom0_setup_vcpu(d, i, cpu);
+    }
+
+    d->arch.paging.mode = 0;
+
+    /* Set up CR3 value for write_ptbase */
+    if ( paging_mode_enabled(d) )
+        paging_update_paging_modes(v);
+    else
+        update_cr3(v);
+
+    /* We run on dom0's page tables for the final part of the build process. */
+    write_ptbase(v);
+    mapcache_override_current(v);
+
+    /* Copy the OS image and free temporary buffer. */
+    elf.dest_base = (void*)vkern_start;
+    elf.dest_size = vkern_end - vkern_start;
+    elf_set_vcpu(&elf, v);
+    rc = elf_load_binary(&elf);
+    if ( rc < 0 )
+    {
+        printk("Failed to load the kernel binary\n");
+        goto out;
+    }
+    bootstrap_map(NULL);
+
+    if ( UNSET_ADDR != parms.virt_hypercall )
+    {
+        if ( (parms.virt_hypercall < v_start) ||
+             (parms.virt_hypercall >= v_end) )
+        {
+            mapcache_override_current(NULL);
+            write_ptbase(current);
+            printk("Invalid HYPERCALL_PAGE field in ELF notes.\n");
+            rc = -1;
+            goto out;
+        }
+        hypercall_page_initialise(
+            d, (void *)(unsigned long)parms.virt_hypercall);
+    }
+
+    /* Free temporary buffers. */
+    discard_initial_images();
+
+    /* Set up start info area. */
+    si = (start_info_t *)vstartinfo_start;
+    clear_page(si);
+    si->nr_pages = nr_pages;
+
+    si->shared_info = virt_to_maddr(d->shared_info);
+
+    si->flags        = SIF_PRIVILEGED | SIF_INITDOMAIN;
+    if ( !vinitrd_start && initrd_len )
+        si->flags   |= SIF_MOD_START_PFN;
+    si->flags       |= (xen_processor_pmbits << 8) & SIF_PM_MASK;
+    si->pt_base      = vpt_start;
+    si->nr_pt_frames = nr_pt_pages;
+    si->mfn_list     = vphysmap_start;
+    snprintf(si->magic, sizeof(si->magic), "xen-3.0-x86_%d%s",
+             elf_64bit(&elf) ? 64 : 32, parms.pae ? "p" : "");
+
+    count = d->tot_pages;
+
+    /* Set up the phys->machine table if not part of the initial mapping. */
+    if ( is_pv_domain(d) && parms.p2m_base != UNSET_ADDR )
+    {
+        pfn = pagetable_get_pfn(v->arch.guest_table);
+        setup_pv_physmap(d, pfn, v_start, v_end, vphysmap_start, vphysmap_end,
+                         nr_pages);
+    }
+
+    /* Write the phys->machine and machine->phys table entries. */
+    for ( pfn = 0; pfn < count; pfn++ )
+    {
+        mfn = pfn + alloc_spfn;
+        if ( pfn >= initrd_pfn )
+        {
+            if ( pfn < initrd_pfn + PFN_UP(initrd_len) )
+                mfn = initrd->mod_start + (pfn - initrd_pfn);
+            else
+                mfn -= PFN_UP(initrd_len);
+        }
+#ifndef NDEBUG
+#define REVERSE_START ((v_end - v_start) >> PAGE_SHIFT)
+        if ( pfn > REVERSE_START && (vinitrd_start || pfn < initrd_pfn) )
+            mfn = alloc_epfn - (pfn - REVERSE_START);
+#endif
+        dom0_update_physmap(d, pfn, mfn, vphysmap_start);
+        if (!(pfn & 0xfffff))
+            process_pending_softirqs();
+    }
+    si->first_p2m_pfn = pfn;
+    si->nr_p2m_frames = d->tot_pages - count;
+    page_list_for_each ( page, &d->page_list )
+    {
+        mfn = page_to_mfn(page);
+        BUG_ON(SHARED_M2P(get_gpfn_from_mfn(mfn)));
+        if ( get_gpfn_from_mfn(mfn) >= count )
+        {
+            BUG_ON(is_pv_32bit_domain(d));
+            if ( !paging_mode_translate(d) && !page->u.inuse.type_info &&
+                 !get_page_and_type(page, d, PGT_writable_page) )
+                BUG();
+
+            dom0_update_physmap(d, pfn, mfn, vphysmap_start);
+            ++pfn;
+            if (!(pfn & 0xfffff))
+                process_pending_softirqs();
+        }
+    }
+    BUG_ON(pfn != d->tot_pages);
+#ifndef NDEBUG
+    alloc_epfn += PFN_UP(initrd_len) + si->nr_p2m_frames;
+#endif
+    while ( pfn < nr_pages )
+    {
+        if ( (page = alloc_chunk(d, nr_pages - d->tot_pages)) == NULL )
+            panic("Not enough RAM for DOM0 reservation");
+        while ( pfn < d->tot_pages )
+        {
+            mfn = page_to_mfn(page);
+#ifndef NDEBUG
+#define pfn (nr_pages - 1 - (pfn - (alloc_epfn - alloc_spfn)))
+#endif
+            dom0_update_physmap(d, pfn, mfn, vphysmap_start);
+#undef pfn
+            page++; pfn++;
+            if (!(pfn & 0xfffff))
+                process_pending_softirqs();
+        }
+    }
+
+    if ( initrd_len != 0 )
+    {
+        si->mod_start = vinitrd_start ?: initrd_pfn;
+        si->mod_len   = initrd_len;
+    }
+
+    memset(si->cmd_line, 0, sizeof(si->cmd_line));
+    if ( cmdline != NULL )
+        strlcpy((char *)si->cmd_line, cmdline, sizeof(si->cmd_line));
+
+    if ( fill_console_start_info((void *)(si + 1)) )
+    {
+        si->console.dom0.info_off  = sizeof(struct start_info);
+        si->console.dom0.info_size = sizeof(struct dom0_vga_console_info);
+    }
+
+    if ( is_pv_32bit_domain(d) )
+        xlat_start_info(si, XLAT_start_info_console_dom0);
+
+    /* Return to idle domain's page tables. */
+    mapcache_override_current(NULL);
+    write_ptbase(current);
+
+    update_domain_wallclock_time(d);
+
+    v->is_initialised = 1;
+    clear_bit(_VPF_down, &v->pause_flags);
+
+    /*
+     * Initial register values:
+     *  DS,ES,FS,GS = FLAT_KERNEL_DS
+     *       CS:rIP = FLAT_KERNEL_CS:start_pc
+     *       SS:rSP = FLAT_KERNEL_SS:start_stack
+     *          rSI = start_info
+     *  [rAX,rBX,rCX,rDX,rDI,rBP,R8-R15 are zero]
+     */
+    regs = &v->arch.user_regs;
+    regs->ds = regs->es = regs->fs = regs->gs =
+        !is_pv_32bit_domain(d) ? FLAT_KERNEL_DS : FLAT_COMPAT_KERNEL_DS;
+    regs->ss = (!is_pv_32bit_domain(d) ?
+                FLAT_KERNEL_SS : FLAT_COMPAT_KERNEL_SS);
+    regs->cs = (!is_pv_32bit_domain(d) ?
+                FLAT_KERNEL_CS : FLAT_COMPAT_KERNEL_CS);
+    regs->rip = parms.virt_entry;
+    regs->rsp = vstack_end;
+    regs->rsi = vstartinfo_start;
+    regs->eflags = X86_EFLAGS_IF;
+
+#ifdef CONFIG_SHADOW_PAGING
+    if ( opt_dom0_shadow && paging_enable(d, PG_SH_enable) == 0 )
+        paging_update_paging_modes(v);
+#endif
+
+    /*
+     * PVH Fixme: XENFEAT_supervisor_mode_kernel has been reused in PVH with a
+     * different meaning.
+     */
+    if ( test_bit(XENFEAT_supervisor_mode_kernel, parms.f_required) )
+        panic("Dom0 requires supervisor-mode execution");
+
+    rc = dom0_setup_permissions(d);
+    BUG_ON(rc != 0);
+
+    if ( elf_check_broken(&elf) )
+        printk(" Xen warning: dom0 kernel broken ELF: %s\n",
+               elf_check_broken(&elf));
+
+    if ( d->domain_id == hardware_domid )
+        iommu_hwdom_init(d);
+
+    return 0;
+
+out:
+    if ( elf_check_broken(&elf) )
+        printk(" Xen dom0 kernel broken ELF: %s\n",
+               elf_check_broken(&elf));
+
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/asm-x86/dom0_build.h b/xen/include/asm-x86/dom0_build.h
new file mode 100644
index 0000000000..2ba349e445
--- /dev/null
+++ b/xen/include/asm-x86/dom0_build.h
@@ -0,0 +1,32 @@
+#ifndef _DOM0_BUILD_H_
+#define _DOM0_BUILD_H_
+
+#include <asm/setup.h>
+
+extern unsigned int dom0_memflags;
+extern cpumask_t dom0_cpus;
+
+unsigned long dom0_compute_nr_pages(
+    struct domain *d, struct elf_dom_parms *parms, unsigned long initrd_len);
+struct vcpu *dom0_setup_vcpu(struct domain *d, unsigned int vcpu_id,
+                                    unsigned int cpu);
+int dom0_setup_permissions(struct domain *d);
+
+int dom0_construct_pv(
+    struct domain *d,
+    const module_t *image, unsigned long image_headroom,
+    module_t *initrd,
+    void *(*bootstrap_map)(const module_t *),
+    char *cmdline);
+
+#endif	/* _DOM0_BUILD_H_ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
2.11.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/4] x86: split PVH dom0 builder to hvm/dom0_build.c
  2017-03-16 17:54 [PATCH v2 0/4] Refactor x86 dom0 builder Wei Liu
  2017-03-16 17:54 ` [PATCH v2 1/4] x86: rename domain_build.c to dom0_build.c Wei Liu
  2017-03-16 17:54 ` [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c Wei Liu
@ 2017-03-16 17:54 ` Wei Liu
  2017-03-16 18:16   ` Roger Pau Monné
  2017-03-16 17:54 ` [PATCH v2 4/4] x86: clean up header files in dom0_build.c Wei Liu
  3 siblings, 1 reply; 18+ messages in thread
From: Wei Liu @ 2017-03-16 17:54 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

Long term we want to be able to disentangle PV and HVM code. Move
the PVH domain builder to a dedicated file.

Lift function declarations to dom0_build.h and rename them when
necessary.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
---
v2:
1. add copyright header
---
 xen/arch/x86/dom0_build.c        | 1069 +-----------------------------------
 xen/arch/x86/hvm/Makefile        |    1 +
 xen/arch/x86/hvm/dom0_build.c    | 1112 ++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/dom0_build.h |    9 +
 4 files changed, 1125 insertions(+), 1066 deletions(-)
 create mode 100644 xen/arch/x86/hvm/dom0_build.c

diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
index c74f9b52e0..772f077d5a 100644
--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -50,27 +50,6 @@ static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
 static long __initdata dom0_max_nrpages = LONG_MAX;
 
-/*
- * Have the TSS cover the ISA port range, which makes it
- * - 104 bytes base structure
- * - 32 bytes interrupt redirection bitmap
- * - 128 bytes I/O bitmap
- * - one trailing byte
- * or a total of 265 bytes.
- *
- * NB: as PVHv2 Dom0 doesn't have legacy devices (ISA), it shouldn't have any
- * business in accessing the ISA port range, much less in real mode, and due to
- * the lack of firmware it shouldn't also execute any INT instruction. This is
- * done just for consistency with what hvmloader does.
- */
-#define HVM_VM86_TSS_SIZE 265
-
-static unsigned int __initdata acpi_intr_overrides;
-static struct acpi_madt_interrupt_override __initdata *intsrcovr;
-
-static unsigned int __initdata acpi_nmi_sources;
-static struct acpi_madt_nmi_source __initdata *nmisrc;
-
 /*
  * dom0_mem=[min:<min_amt>,][max:<max_amt>,][<amt>]
  * 
@@ -261,8 +240,8 @@ boolean_param("ro-hpet", ro_hpet);
 
 unsigned int __initdata dom0_memflags = MEMF_no_dma|MEMF_exact_node;
 
-static unsigned long __init dom0_paging_pages(const struct domain *d,
-                                              unsigned long nr_pages)
+unsigned long __init dom0_paging_pages(const struct domain *d,
+                                       unsigned long nr_pages)
 {
     /* Copied from: libxl_get_required_shadow_memory() */
     unsigned long memkb = nr_pages * (PAGE_SIZE / 1024);
@@ -492,1048 +471,6 @@ int __init dom0_setup_permissions(struct domain *d)
 }
 
 
-static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
-                                       unsigned long nr_pages, const bool map)
-{
-    int rc;
-
-    for ( ; ; )
-    {
-        rc = (map ? map_mmio_regions : unmap_mmio_regions)
-             (d, _gfn(pfn), nr_pages, _mfn(pfn));
-        if ( rc == 0 )
-            break;
-        if ( rc < 0 )
-        {
-            printk(XENLOG_WARNING
-                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
-                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
-            break;
-        }
-        nr_pages -= rc;
-        pfn += rc;
-        process_pending_softirqs();
-    }
-
-    return rc;
-}
-
-/* Populate a HVM memory range using the biggest possible order. */
-static int __init pvh_populate_memory_range(struct domain *d,
-                                            unsigned long start,
-                                            unsigned long nr_pages)
-{
-    unsigned int order, i = 0;
-    struct page_info *page;
-    int rc;
-#define MAP_MAX_ITER 64
-
-    order = MAX_ORDER;
-    while ( nr_pages != 0 )
-    {
-        unsigned int range_order = get_order_from_pages(nr_pages + 1);
-
-        order = min(range_order ? range_order - 1 : 0, order);
-        page = alloc_domheap_pages(d, order, dom0_memflags);
-        if ( page == NULL )
-        {
-            if ( order == 0 && dom0_memflags )
-            {
-                /* Try again without any dom0_memflags. */
-                dom0_memflags = 0;
-                order = MAX_ORDER;
-                continue;
-            }
-            if ( order == 0 )
-            {
-                printk("Unable to allocate memory with order 0!\n");
-                return -ENOMEM;
-            }
-            order--;
-            continue;
-        }
-
-        rc = guest_physmap_add_page(d, _gfn(start), _mfn(page_to_mfn(page)),
-                                    order);
-        if ( rc != 0 )
-        {
-            printk("Failed to populate memory: [%#lx,%lx): %d\n",
-                   start, start + (1UL << order), rc);
-            return -ENOMEM;
-        }
-        start += 1UL << order;
-        nr_pages -= 1UL << order;
-        if ( (++i % MAP_MAX_ITER) == 0 )
-            process_pending_softirqs();
-    }
-
-    return 0;
-#undef MAP_MAX_ITER
-}
-
-/* Steal RAM from the end of a memory region. */
-static int __init pvh_steal_ram(struct domain *d, unsigned long size,
-                                unsigned long align, paddr_t limit,
-                                paddr_t *addr)
-{
-    unsigned int i = d->arch.nr_e820;
-
-    /*
-     * Alignment 0 should be set to 1, so it doesn't wrap around in the
-     * calculations below.
-     */
-    align = align ? : 1;
-    while ( i-- )
-    {
-        struct e820entry *entry = &d->arch.e820[i];
-
-        if ( entry->type != E820_RAM || entry->addr + entry->size > limit ||
-             entry->addr < MB(1) )
-            continue;
-
-        *addr = (entry->addr + entry->size - size) & ~(align - 1);
-        if ( *addr < entry->addr )
-            continue;
-
-        entry->size = *addr - entry->addr;
-        return 0;
-    }
-
-    return -ENOMEM;
-}
-
-/* NB: memory map must be sorted at all times for this to work correctly. */
-static int __init pvh_add_mem_range(struct domain *d, uint64_t s, uint64_t e,
-                                    unsigned int type)
-{
-    struct e820entry *map;
-    unsigned int i;
-
-    for ( i = 0; i < d->arch.nr_e820; i++ )
-    {
-        uint64_t rs = d->arch.e820[i].addr;
-        uint64_t re = rs + d->arch.e820[i].size;
-
-        if ( rs == e && d->arch.e820[i].type == type )
-        {
-            d->arch.e820[i].addr = s;
-            return 0;
-        }
-
-        if ( re == s && d->arch.e820[i].type == type &&
-             (i + 1 == d->arch.nr_e820 || d->arch.e820[i + 1].addr >= e) )
-        {
-            d->arch.e820[i].size += e - s;
-            return 0;
-        }
-
-        if ( rs >= e )
-            break;
-
-        if ( re > s )
-            return -EEXIST;
-    }
-
-    map = xzalloc_array(struct e820entry, d->arch.nr_e820 + 1);
-    if ( !map )
-    {
-        printk(XENLOG_WARNING "E820: out of memory to add region\n");
-        return -ENOMEM;
-    }
-
-    memcpy(map, d->arch.e820, i * sizeof(*d->arch.e820));
-    memcpy(map + i + 1, d->arch.e820 + i,
-           (d->arch.nr_e820 - i) * sizeof(*d->arch.e820));
-    map[i].addr = s;
-    map[i].size = e - s;
-    map[i].type = type;
-    xfree(d->arch.e820);
-    d->arch.e820 = map;
-    d->arch.nr_e820++;
-
-    return 0;
-}
-
-static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
-{
-    p2m_type_t p2mt;
-    uint32_t rc, *ident_pt;
-    mfn_t mfn;
-    paddr_t gaddr;
-    struct vcpu *v = d->vcpu[0];
-
-    /*
-     * Steal some space from the last RAM region below 4GB and use it to
-     * store the real-mode TSS. It needs to be aligned to 128 so that the
-     * TSS structure (which accounts for the first 104b) doesn't cross
-     * a page boundary.
-     */
-    if ( !pvh_steal_ram(d, HVM_VM86_TSS_SIZE, 128, GB(4), &gaddr) )
-    {
-        if ( hvm_copy_to_guest_phys(gaddr, NULL, HVM_VM86_TSS_SIZE, v) !=
-             HVMCOPY_okay )
-            printk("Unable to zero VM86 TSS area\n");
-        d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS_SIZED] =
-            VM86_TSS_UPDATED | ((uint64_t)HVM_VM86_TSS_SIZE << 32) | gaddr;
-        if ( pvh_add_mem_range(d, gaddr, gaddr + HVM_VM86_TSS_SIZE,
-                               E820_RESERVED) )
-            printk("Unable to set VM86 TSS as reserved in the memory map\n");
-    }
-    else
-        printk("Unable to allocate VM86 TSS area\n");
-
-    /* Steal some more RAM for the identity page tables. */
-    if ( pvh_steal_ram(d, PAGE_SIZE, PAGE_SIZE, GB(4), &gaddr) )
-    {
-        printk("Unable to find memory to stash the identity page tables\n");
-        return -ENOMEM;
-    }
-
-    /*
-     * Identity-map page table is required for running with CR0.PG=0
-     * when using Intel EPT. Create a 32-bit non-PAE page directory of
-     * superpages.
-     */
-    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
-                              &mfn, &p2mt, 0, &rc);
-    if ( ident_pt == NULL )
-    {
-        printk("Unable to map identity page tables\n");
-        return -ENOMEM;
-    }
-    write_32bit_pse_identmap(ident_pt);
-    unmap_domain_page(ident_pt);
-    put_page(mfn_to_page(mfn_x(mfn)));
-    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
-    if ( pvh_add_mem_range(d, gaddr, gaddr + PAGE_SIZE, E820_RESERVED) )
-            printk("Unable to set identity page tables as reserved in the memory map\n");
-
-    return 0;
-}
-
-/* Assign the low 1MB to Dom0. */
-static void __init pvh_steal_low_ram(struct domain *d, unsigned long start,
-                                     unsigned long nr_pages)
-{
-    unsigned long mfn;
-
-    ASSERT(start + nr_pages <= PFN_DOWN(MB(1)));
-
-    for ( mfn = start; mfn < start + nr_pages; mfn++ )
-    {
-        struct page_info *pg = mfn_to_page(mfn);
-        int rc;
-
-        rc = unshare_xen_page_with_guest(pg, dom_io);
-        if ( rc )
-        {
-            printk("Unable to unshare Xen mfn %#lx: %d\n", mfn, rc);
-            continue;
-        }
-
-        share_xen_page_with_guest(pg, d, XENSHARE_writable);
-        rc = guest_physmap_add_entry(d, _gfn(mfn), _mfn(mfn), 0, p2m_ram_rw);
-        if ( rc )
-            printk("Unable to add mfn %#lx to p2m: %d\n", mfn, rc);
-    }
-}
-
-static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
-{
-    struct e820entry *entry, *entry_guest;
-    unsigned int i;
-    unsigned long pages, cur_pages = 0;
-    uint64_t start, end;
-
-    /*
-     * Craft the e820 memory map for Dom0 based on the hardware e820 map.
-     */
-    d->arch.e820 = xzalloc_array(struct e820entry, e820.nr_map);
-    if ( !d->arch.e820 )
-        panic("Unable to allocate memory for Dom0 e820 map");
-    entry_guest = d->arch.e820;
-
-    /* Clamp e820 memory map to match the memory assigned to Dom0 */
-    for ( i = 0, entry = e820.map; i < e820.nr_map; i++, entry++ )
-    {
-        if ( entry->type != E820_RAM )
-        {
-            *entry_guest = *entry;
-            goto next;
-        }
-
-        if ( nr_pages == cur_pages )
-        {
-            /*
-             * We already have all the assigned memory,
-             * skip this entry
-             */
-            continue;
-        }
-
-        /*
-         * Make sure the start and length are aligned to PAGE_SIZE, because
-         * that's the minimum granularity of the 2nd stage translation. Since
-         * the p2m code uses PAGE_ORDER_4K internally, also use it here in
-         * order to prevent this code from getting out of sync.
-         */
-        start = ROUNDUP(entry->addr, PAGE_SIZE << PAGE_ORDER_4K);
-        end = (entry->addr + entry->size) &
-              ~((PAGE_SIZE << PAGE_ORDER_4K) - 1);
-        if ( start >= end )
-            continue;
-
-        entry_guest->type = E820_RAM;
-        entry_guest->addr = start;
-        entry_guest->size = end - start;
-        pages = PFN_DOWN(entry_guest->size);
-        if ( (cur_pages + pages) > nr_pages )
-        {
-            /* Truncate region */
-            entry_guest->size = (nr_pages - cur_pages) << PAGE_SHIFT;
-            cur_pages = nr_pages;
-        }
-        else
-        {
-            cur_pages += pages;
-        }
- next:
-        d->arch.nr_e820++;
-        entry_guest++;
-    }
-    ASSERT(cur_pages == nr_pages);
-    ASSERT(d->arch.nr_e820 <= e820.nr_map);
-}
-
-static int __init pvh_setup_p2m(struct domain *d)
-{
-    struct vcpu *v = d->vcpu[0];
-    unsigned long nr_pages;
-    unsigned int i;
-    int rc;
-    bool preempted;
-#define MB1_PAGES PFN_DOWN(MB(1))
-
-    nr_pages = dom0_compute_nr_pages(d, NULL, 0);
-
-    pvh_setup_e820(d, nr_pages);
-    do {
-        preempted = false;
-        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
-                              &preempted);
-        process_pending_softirqs();
-    } while ( preempted );
-
-    /*
-     * Memory below 1MB is identity mapped.
-     * NB: this only makes sense when booted from legacy BIOS.
-     */
-    rc = modify_identity_mmio(d, 0, MB1_PAGES, true);
-    if ( rc )
-    {
-        printk("Failed to identity map low 1MB: %d\n", rc);
-        return rc;
-    }
-
-    /* Populate memory map. */
-    for ( i = 0; i < d->arch.nr_e820; i++ )
-    {
-        unsigned long addr, size;
-
-        if ( d->arch.e820[i].type != E820_RAM )
-            continue;
-
-        addr = PFN_DOWN(d->arch.e820[i].addr);
-        size = PFN_DOWN(d->arch.e820[i].size);
-
-        if ( addr >= MB1_PAGES )
-            rc = pvh_populate_memory_range(d, addr, size);
-        else
-        {
-            ASSERT(addr + size < MB1_PAGES);
-            pvh_steal_low_ram(d, addr, size);
-        }
-
-        if ( rc )
-            return rc;
-    }
-
-    if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) )
-    {
-        /*
-         * Since Dom0 cannot be migrated, we will only setup the
-         * unrestricted guest helpers if they are needed by the current
-         * hardware we are running on.
-         */
-        rc = pvh_setup_vmx_realmode_helpers(d);
-        if ( rc )
-            return rc;
-    }
-
-    return 0;
-#undef MB1_PAGES
-}
-
-static int __init pvh_load_kernel(struct domain *d, const module_t *image,
-                                  unsigned long image_headroom,
-                                  module_t *initrd, void *image_base,
-                                  char *cmdline, paddr_t *entry,
-                                  paddr_t *start_info_addr)
-{
-    void *image_start = image_base + image_headroom;
-    unsigned long image_len = image->mod_end;
-    struct elf_binary elf;
-    struct elf_dom_parms parms;
-    paddr_t last_addr;
-    struct hvm_start_info start_info = { 0 };
-    struct hvm_modlist_entry mod = { 0 };
-    struct vcpu *v = d->vcpu[0];
-    int rc;
-
-    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
-    {
-        printk("Error trying to detect bz compressed kernel\n");
-        return rc;
-    }
-
-    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
-    {
-        printk("Unable to init ELF\n");
-        return rc;
-    }
-#ifdef VERBOSE
-    elf_set_verbose(&elf);
-#endif
-    elf_parse_binary(&elf);
-    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
-    {
-        printk("Unable to parse kernel for ELFNOTES\n");
-        return rc;
-    }
-
-    if ( parms.phys_entry == UNSET_ADDR32 )
-    {
-        printk("Unable to find XEN_ELFNOTE_PHYS32_ENTRY address\n");
-        return -EINVAL;
-    }
-
-    printk("OS: %s version: %s loader: %s bitness: %s\n", parms.guest_os,
-           parms.guest_ver, parms.loader,
-           elf_64bit(&elf) ? "64-bit" : "32-bit");
-
-    /* Copy the OS image and free temporary buffer. */
-    elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
-    elf.dest_size = parms.virt_kend - parms.virt_kstart;
-
-    elf_set_vcpu(&elf, v);
-    rc = elf_load_binary(&elf);
-    if ( rc < 0 )
-    {
-        printk("Failed to load kernel: %d\n", rc);
-        printk("Xen dom0 kernel broken ELF: %s\n", elf_check_broken(&elf));
-        return rc;
-    }
-
-    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
-
-    if ( initrd != NULL )
-    {
-        rc = hvm_copy_to_guest_phys(last_addr, mfn_to_virt(initrd->mod_start),
-                                    initrd->mod_end, v);
-        if ( rc )
-        {
-            printk("Unable to copy initrd to guest\n");
-            return rc;
-        }
-
-        mod.paddr = last_addr;
-        mod.size = initrd->mod_end;
-        last_addr += ROUNDUP(initrd->mod_end, PAGE_SIZE);
-    }
-
-    /* Free temporary buffers. */
-    discard_initial_images();
-
-    if ( cmdline != NULL )
-    {
-        rc = hvm_copy_to_guest_phys(last_addr, cmdline, strlen(cmdline) + 1, v);
-        if ( rc )
-        {
-            printk("Unable to copy guest command line\n");
-            return rc;
-        }
-        start_info.cmdline_paddr = last_addr;
-        /*
-         * Round up to 32/64 bits (depending on the guest kernel bitness) so
-         * the modlist/start_info is aligned.
-         */
-        last_addr += ROUNDUP(strlen(cmdline) + 1, elf_64bit(&elf) ? 8 : 4);
-    }
-    if ( initrd != NULL )
-    {
-        rc = hvm_copy_to_guest_phys(last_addr, &mod, sizeof(mod), v);
-        if ( rc )
-        {
-            printk("Unable to copy guest modules\n");
-            return rc;
-        }
-        start_info.modlist_paddr = last_addr;
-        start_info.nr_modules = 1;
-        last_addr += sizeof(mod);
-    }
-
-    start_info.magic = XEN_HVM_START_MAGIC_VALUE;
-    start_info.flags = SIF_PRIVILEGED | SIF_INITDOMAIN;
-    rc = hvm_copy_to_guest_phys(last_addr, &start_info, sizeof(start_info), v);
-    if ( rc )
-    {
-        printk("Unable to copy start info to guest\n");
-        return rc;
-    }
-
-    *entry = parms.phys_entry;
-    *start_info_addr = last_addr;
-
-    return 0;
-}
-
-static int __init pvh_setup_cpus(struct domain *d, paddr_t entry,
-                                 paddr_t start_info)
-{
-    struct vcpu *v = d->vcpu[0];
-    unsigned int cpu, i;
-    int rc;
-    /* 
-     * This sets the vCPU state according to the state described in
-     * docs/misc/hvmlite.markdown.
-     */
-    vcpu_hvm_context_t cpu_ctx = {
-        .mode = VCPU_HVM_MODE_32B,
-        .cpu_regs.x86_32.ebx = start_info,
-        .cpu_regs.x86_32.eip = entry,
-        .cpu_regs.x86_32.cr0 = X86_CR0_PE | X86_CR0_ET,
-        .cpu_regs.x86_32.cs_limit = ~0u,
-        .cpu_regs.x86_32.ds_limit = ~0u,
-        .cpu_regs.x86_32.ss_limit = ~0u,
-        .cpu_regs.x86_32.tr_limit = 0x67,
-        .cpu_regs.x86_32.cs_ar = 0xc9b,
-        .cpu_regs.x86_32.ds_ar = 0xc93,
-        .cpu_regs.x86_32.ss_ar = 0xc93,
-        .cpu_regs.x86_32.tr_ar = 0x8b,
-    };
-
-    cpu = v->processor;
-    for ( i = 1; i < d->max_vcpus; i++ )
-    {
-        cpu = cpumask_cycle(cpu, &dom0_cpus);
-        dom0_setup_vcpu(d, i, cpu);
-    }
-
-    rc = arch_set_info_hvm_guest(v, &cpu_ctx);
-    if ( rc )
-    {
-        printk("Unable to setup Dom0 BSP context: %d\n", rc);
-        return rc;
-    }
-
-    rc = dom0_setup_permissions(d);
-    if ( rc )
-    {
-        panic("Unable to setup Dom0 permissions: %d\n", rc);
-        return rc;
-    }
-
-    update_domain_wallclock_time(d);
-
-    clear_bit(_VPF_down, &v->pause_flags);
-
-    return 0;
-}
-
-static int __init acpi_count_intr_ovr(struct acpi_subtable_header *header,
-                                     const unsigned long end)
-{
-
-    acpi_intr_overrides++;
-    return 0;
-}
-
-static int __init acpi_set_intr_ovr(struct acpi_subtable_header *header,
-                                    const unsigned long end)
-{
-    const struct acpi_madt_interrupt_override *intr =
-        container_of(header, struct acpi_madt_interrupt_override, header);
-
-    *intsrcovr = *intr;
-    intsrcovr++;
-
-    return 0;
-}
-
-static int __init acpi_count_nmi_src(struct acpi_subtable_header *header,
-                                     const unsigned long end)
-{
-
-    acpi_nmi_sources++;
-    return 0;
-}
-
-static int __init acpi_set_nmi_src(struct acpi_subtable_header *header,
-                                   const unsigned long end)
-{
-    const struct acpi_madt_nmi_source *src =
-        container_of(header, struct acpi_madt_nmi_source, header);
-
-    *nmisrc = *src;
-    nmisrc++;
-
-    return 0;
-}
-
-static int __init pvh_setup_acpi_madt(struct domain *d, paddr_t *addr)
-{
-    struct acpi_table_madt *madt;
-    struct acpi_table_header *table;
-    struct acpi_madt_io_apic *io_apic;
-    struct acpi_madt_local_x2apic *x2apic;
-    acpi_status status;
-    unsigned long size;
-    unsigned int i, max_vcpus;
-    int rc;
-
-    /* Count number of interrupt overrides in the MADT. */
-    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE,
-                          acpi_count_intr_ovr, UINT_MAX);
-
-    /* Count number of NMI sources in the MADT. */
-    acpi_table_parse_madt(ACPI_MADT_TYPE_NMI_SOURCE, acpi_count_nmi_src,
-                          UINT_MAX);
-
-    max_vcpus = dom0_max_vcpus();
-    /* Calculate the size of the crafted MADT. */
-    size = sizeof(*madt);
-    /*
-     * FIXME: the current vIO-APIC code just supports one IO-APIC instance
-     * per domain. This must be fixed in order to provide the same amount of
-     * IO APICs as available on bare metal.
-     */
-    size += sizeof(*io_apic);
-    size += sizeof(*intsrcovr) * acpi_intr_overrides;
-    size += sizeof(*nmisrc) * acpi_nmi_sources;
-    size += sizeof(*x2apic) * max_vcpus;
-
-    madt = xzalloc_bytes(size);
-    if ( !madt )
-    {
-        printk("Unable to allocate memory for MADT table\n");
-        rc = -ENOMEM;
-        goto out;
-    }
-
-    /* Copy the native MADT table header. */
-    status = acpi_get_table(ACPI_SIG_MADT, 0, &table);
-    if ( !ACPI_SUCCESS(status) )
-    {
-        printk("Failed to get MADT ACPI table, aborting.\n");
-        rc = -EINVAL;
-        goto out;
-    }
-    madt->header = *table;
-    madt->address = APIC_DEFAULT_PHYS_BASE;
-    /*
-     * NB: this is currently set to 4, which is the revision in the ACPI
-     * spec 6.1. Sadly ACPICA doesn't provide revision numbers for the
-     * tables described in the headers.
-     */
-    madt->header.revision = min_t(unsigned char, table->revision, 4);
-
-    /*
-     * Setup the IO APIC entry.
-     * FIXME: the current vIO-APIC code just supports one IO-APIC instance
-     * per domain. This must be fixed in order to provide the same amount of
-     * IO APICs as available on bare metal, and with the same IDs as found in
-     * the native IO APIC MADT entries.
-     */
-    if ( nr_ioapics > 1 )
-        printk("WARNING: found %d IO APICs, Dom0 will only have access to 1 emulated IO APIC\n",
-               nr_ioapics);
-    io_apic = (void *)(madt + 1);
-    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
-    io_apic->header.length = sizeof(*io_apic);
-    io_apic->id = domain_vioapic(d)->id;
-    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;
-
-    x2apic = (void *)(io_apic + 1);
-    for ( i = 0; i < max_vcpus; i++ )
-    {
-        x2apic->header.type = ACPI_MADT_TYPE_LOCAL_X2APIC;
-        x2apic->header.length = sizeof(*x2apic);
-        x2apic->uid = i;
-        x2apic->local_apic_id = i * 2;
-        x2apic->lapic_flags = ACPI_MADT_ENABLED;
-        x2apic++;
-    }
-
-    /* Setup interrupt overrides. */
-    intsrcovr = (void *)x2apic;
-    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE, acpi_set_intr_ovr,
-                          acpi_intr_overrides);
-
-    /* Setup NMI sources. */
-    nmisrc = (void *)intsrcovr;
-    acpi_table_parse_madt(ACPI_MADT_TYPE_NMI_SOURCE, acpi_set_nmi_src,
-                          acpi_nmi_sources);
-
-    ASSERT(((void *)nmisrc - (void *)madt) == size);
-    madt->header.length = size;
-    /*
-     * Calling acpi_tb_checksum here is a layering violation, but
-     * introducing a wrapper for such simple usage seems overkill.
-     */
-    madt->header.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, madt), size);
-
-    /* Place the new MADT in guest memory space. */
-    if ( pvh_steal_ram(d, size, 0, GB(4), addr) )
-    {
-        printk("Unable to find allocate guest RAM for MADT\n");
-        rc = -ENOMEM;
-        goto out;
-    }
-
-    /* Mark this region as E820_ACPI. */
-    if ( pvh_add_mem_range(d, *addr, *addr + size, E820_ACPI) )
-        printk("Unable to add MADT region to memory map\n");
-
-    rc = hvm_copy_to_guest_phys(*addr, madt, size, d->vcpu[0]);
-    if ( rc )
-    {
-        printk("Unable to copy MADT into guest memory\n");
-        goto out;
-    }
-
-    rc = 0;
-
- out:
-    xfree(madt);
-
-    return rc;
-}
-
-static bool __init acpi_memory_banned(unsigned long address,
-                                      unsigned long size)
-{
-    unsigned long mfn, nr_pages, i;
-
-    mfn = PFN_DOWN(address);
-    nr_pages = PFN_UP((address & ~PAGE_MASK) + size);
-    for ( i = 0 ; i < nr_pages; i++ )
-        if ( !page_is_ram_type(mfn + i, RAM_TYPE_RESERVED) &&
-             !page_is_ram_type(mfn + i, RAM_TYPE_ACPI) )
-            return true;
-
-    return false;
-}
-
-static bool __init pvh_acpi_table_allowed(const char *sig)
-{
-    static const char __initconst banned_tables[][ACPI_NAME_SIZE] = {
-        ACPI_SIG_HPET, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_MPST,
-        ACPI_SIG_PMTT, ACPI_SIG_MADT, ACPI_SIG_DMAR};
-    unsigned int i;
-
-    for ( i = 0 ; i < ARRAY_SIZE(banned_tables); i++ )
-        if ( strncmp(sig, banned_tables[i], ACPI_NAME_SIZE) == 0 )
-            return false;
-
-    /* Make sure table doesn't reside in a RAM region. */
-    if ( acpi_memory_banned(acpi_gbl_root_table_list.tables[i].address,
-                            acpi_gbl_root_table_list.tables[i].length) )
-    {
-        printk("Skipping table %.4s because resides in a non-ACPI, non-reserved region\n",
-               sig);
-        return false;
-    }
-
-    return true;
-}
-
-static int __init pvh_setup_acpi_xsdt(struct domain *d, paddr_t madt_addr,
-                                      paddr_t *addr)
-{
-    struct acpi_table_xsdt *xsdt;
-    struct acpi_table_header *table;
-    struct acpi_table_rsdp *rsdp;
-    unsigned long size = sizeof(*xsdt);
-    unsigned int i, j, num_tables = 0;
-    paddr_t xsdt_paddr;
-    int rc;
-
-    /*
-     * Restore original DMAR table signature, we are going to filter it from
-     * the new XSDT that is presented to the guest, so it is no longer
-     * necessary to have it's signature zapped.
-     */
-    acpi_dmar_reinstate();
-
-    /* Count the number of tables that will be added to the XSDT. */
-    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
-    {
-        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
-
-        if ( pvh_acpi_table_allowed(sig) )
-            num_tables++;
-    }
-
-    /*
-     * No need to add or subtract anything because struct acpi_table_xsdt
-     * includes one array slot already, and we have filtered out the original
-     * MADT and we are going to add a custom built MADT.
-     */
-    size += num_tables * sizeof(xsdt->table_offset_entry[0]);
-
-    xsdt = xzalloc_bytes(size);
-    if ( !xsdt )
-    {
-        printk("Unable to allocate memory for XSDT table\n");
-        rc = -ENOMEM;
-        goto out;
-    }
-
-    /* Copy the native XSDT table header. */
-    rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(*rsdp));
-    if ( !rsdp )
-    {
-        printk("Unable to map RSDP\n");
-        rc = -EINVAL;
-        goto out;
-    }
-    xsdt_paddr = rsdp->xsdt_physical_address;
-    acpi_os_unmap_memory(rsdp, sizeof(*rsdp));
-    table = acpi_os_map_memory(xsdt_paddr, sizeof(*table));
-    if ( !table )
-    {
-        printk("Unable to map XSDT\n");
-        rc = -EINVAL;
-        goto out;
-    }
-    xsdt->header = *table;
-    acpi_os_unmap_memory(table, sizeof(*table));
-
-    /* Add the custom MADT. */
-    xsdt->table_offset_entry[0] = madt_addr;
-
-    /* Copy the addresses of the rest of the allowed tables. */
-    for( i = 0, j = 1; i < acpi_gbl_root_table_list.count; i++ )
-    {
-        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
-
-        if ( pvh_acpi_table_allowed(sig) )
-            xsdt->table_offset_entry[j++] =
-                                acpi_gbl_root_table_list.tables[i].address;
-    }
-
-    xsdt->header.revision = 1;
-    xsdt->header.length = size;
-    /*
-     * Calling acpi_tb_checksum here is a layering violation, but
-     * introducing a wrapper for such simple usage seems overkill.
-     */
-    xsdt->header.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, xsdt), size);
-
-    /* Place the new XSDT in guest memory space. */
-    if ( pvh_steal_ram(d, size, 0, GB(4), addr) )
-    {
-        printk("Unable to find guest RAM for XSDT\n");
-        rc = -ENOMEM;
-        goto out;
-    }
-
-    /* Mark this region as E820_ACPI. */
-    if ( pvh_add_mem_range(d, *addr, *addr + size, E820_ACPI) )
-        printk("Unable to add XSDT region to memory map\n");
-
-    rc = hvm_copy_to_guest_phys(*addr, xsdt, size, d->vcpu[0]);
-    if ( rc )
-    {
-        printk("Unable to copy XSDT into guest memory\n");
-        goto out;
-    }
-
-    rc = 0;
-
- out:
-    xfree(xsdt);
-
-    return rc;
-}
-
-static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
-{
-    unsigned long pfn, nr_pages;
-    paddr_t madt_paddr, xsdt_paddr, rsdp_paddr;
-    unsigned int i;
-    int rc;
-    struct acpi_table_rsdp *native_rsdp, rsdp = {
-        .signature = ACPI_SIG_RSDP,
-        .revision = 2,
-        .length = sizeof(rsdp),
-    };
-
-
-    /* Scan top-level tables and add their regions to the guest memory map. */
-    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
-    {
-        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
-        unsigned long addr = acpi_gbl_root_table_list.tables[i].address;
-        unsigned long size = acpi_gbl_root_table_list.tables[i].length;
-
-        /*
-         * Make sure the original MADT is also mapped, so that Dom0 can
-         * properly access the data returned by _MAT methods in case it's
-         * re-using MADT memory.
-         */
-        if ( strncmp(sig, ACPI_SIG_MADT, ACPI_NAME_SIZE)
-             ? pvh_acpi_table_allowed(sig)
-             : !acpi_memory_banned(addr, size) )
-             pvh_add_mem_range(d, addr, addr + size, E820_ACPI);
-    }
-
-    /* Identity map ACPI e820 regions. */
-    for ( i = 0; i < d->arch.nr_e820; i++ )
-    {
-        if ( d->arch.e820[i].type != E820_ACPI &&
-             d->arch.e820[i].type != E820_NVS )
-            continue;
-
-        pfn = PFN_DOWN(d->arch.e820[i].addr);
-        nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
-                          d->arch.e820[i].size);
-
-        rc = modify_identity_mmio(d, pfn, nr_pages, true);
-        if ( rc )
-        {
-            printk("Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
-                   pfn, pfn + nr_pages);
-            return rc;
-        }
-    }
-
-    rc = pvh_setup_acpi_madt(d, &madt_paddr);
-    if ( rc )
-        return rc;
-
-    rc = pvh_setup_acpi_xsdt(d, madt_paddr, &xsdt_paddr);
-    if ( rc )
-        return rc;
-
-    /* Craft a custom RSDP. */
-    native_rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(rsdp));
-    if ( !native_rsdp )
-    {
-        printk("Failed to map native RSDP\n");
-        return -ENOMEM;
-    }
-    memcpy(rsdp.oem_id, native_rsdp->oem_id, sizeof(rsdp.oem_id));
-    acpi_os_unmap_memory(native_rsdp, sizeof(rsdp));
-    rsdp.xsdt_physical_address = xsdt_paddr;
-    /*
-     * Calling acpi_tb_checksum here is a layering violation, but
-     * introducing a wrapper for such simple usage seems overkill.
-     */
-    rsdp.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
-                                      ACPI_RSDP_REV0_SIZE);
-    rsdp.extended_checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
-                                               sizeof(rsdp));
-
-    /*
-     * Place the new RSDP in guest memory space.
-     *
-     * NB: this RSDP is not going to replace the original RSDP, which should
-     * still be accessible to the guest. However that RSDP is going to point to
-     * the native RSDT, and should not be used for the Dom0 kernel's boot
-     * purposes (we keep it visible for post boot access).
-     */
-    if ( pvh_steal_ram(d, sizeof(rsdp), 0, GB(4), &rsdp_paddr) )
-    {
-        printk("Unable to allocate guest RAM for RSDP\n");
-        return -ENOMEM;
-    }
-
-    /* Mark this region as E820_ACPI. */
-    if ( pvh_add_mem_range(d, rsdp_paddr, rsdp_paddr + sizeof(rsdp),
-                           E820_ACPI) )
-        printk("Unable to add RSDP region to memory map\n");
-
-    /* Copy RSDP into guest memory. */
-    rc = hvm_copy_to_guest_phys(rsdp_paddr, &rsdp, sizeof(rsdp), d->vcpu[0]);
-    if ( rc )
-    {
-        printk("Unable to copy RSDP into guest memory\n");
-        return rc;
-    }
-
-    /* Copy RSDP address to start_info. */
-    rc = hvm_copy_to_guest_phys(start_info +
-                                offsetof(struct hvm_start_info, rsdp_paddr),
-                                &rsdp_paddr,
-                                sizeof(((struct hvm_start_info *)
-                                        0)->rsdp_paddr),
-                                d->vcpu[0]);
-    if ( rc )
-    {
-        printk("Unable to copy RSDP into guest memory\n");
-        return rc;
-    }
-
-    return 0;
-}
-
-static int __init construct_dom0_pvh(struct domain *d, const module_t *image,
-                                     unsigned long image_headroom,
-                                     module_t *initrd,
-                                     void *(*bootstrap_map)(const module_t *),
-                                     char *cmdline)
-{
-    paddr_t entry, start_info;
-    int rc;
-
-    printk("** Building a PVH Dom0 **\n");
-
-    iommu_hwdom_init(d);
-
-    rc = pvh_setup_p2m(d);
-    if ( rc )
-    {
-        printk("Failed to setup Dom0 physical memory map\n");
-        return rc;
-    }
-
-    rc = pvh_load_kernel(d, image, image_headroom, initrd, bootstrap_map(image),
-                         cmdline, &entry, &start_info);
-    if ( rc )
-    {
-        printk("Failed to load Dom0 kernel\n");
-        return rc;
-    }
-
-    rc = pvh_setup_cpus(d, entry, start_info);
-    if ( rc )
-    {
-        printk("Failed to setup Dom0 CPUs: %d\n", rc);
-        return rc;
-    }
-
-    rc = pvh_setup_acpi(d, start_info);
-    if ( rc )
-    {
-        printk("Failed to setup Dom0 ACPI tables: %d\n", rc);
-        return rc;
-    }
-
-    panic("Building a PVHv2 Dom0 is not yet supported.");
-    return 0;
-}
-
 int __init construct_dom0(struct domain *d, const module_t *image,
                           unsigned long image_headroom, module_t *initrd,
                           void *(*bootstrap_map)(const module_t *),
@@ -1546,7 +483,7 @@ int __init construct_dom0(struct domain *d, const module_t *image,
 
     process_pending_softirqs();
 
-    return (is_hvm_domain(d) ? construct_dom0_pvh : dom0_construct_pv)
+    return (is_hvm_domain(d) ? dom0_construct_pvh : dom0_construct_pv)
            (d, image, image_headroom, initrd,bootstrap_map, cmdline);
 }
 
diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index ec0daae0ea..0a3d0f4f7e 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -3,6 +3,7 @@ subdir-y += vmx
 
 obj-y += asid.o
 obj-y += dm.o
+obj-bin-y += dom0_build.init.o
 obj-y += emulate.o
 obj-y += hpet.o
 obj-y += hvm.o
diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
new file mode 100644
index 0000000000..0256054e54
--- /dev/null
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -0,0 +1,1112 @@
+/*
+ * hvm/dom0_build.c
+ *
+ * Dom0 builder for PVH guest.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/acpi.h>
+#include <xen/init.h>
+#include <xen/libelf.h>
+#include <xen/multiboot.h>
+#include <xen/softirq.h>
+
+#include <acpi/actables.h>
+
+#include <asm/bzimage.h>
+#include <asm/dom0_build.h>
+#include <asm/hvm/support.h>
+#include <asm/io_apic.h>
+#include <asm/p2m.h>
+#include <asm/paging.h>
+#include <asm/setup.h>
+
+#include <public/arch-x86/hvm/start_info.h>
+#include <public/hvm/hvm_info_table.h>
+#include <public/hvm/hvm_vcpu.h>
+
+/*
+ * Have the TSS cover the ISA port range, which makes it
+ * - 104 bytes base structure
+ * - 32 bytes interrupt redirection bitmap
+ * - 128 bytes I/O bitmap
+ * - one trailing byte
+ * or a total of 265 bytes.
+ *
+ * NB: as PVHv2 Dom0 doesn't have legacy devices (ISA), it shouldn't have any
+ * business in accessing the ISA port range, much less in real mode, and due to
+ * the lack of firmware it shouldn't also execute any INT instruction. This is
+ * done just for consistency with what hvmloader does.
+ */
+#define HVM_VM86_TSS_SIZE 265
+
+static unsigned int __initdata acpi_intr_overrides;
+static struct acpi_madt_interrupt_override __initdata *intsrcovr;
+
+static unsigned int __initdata acpi_nmi_sources;
+static struct acpi_madt_nmi_source __initdata *nmisrc;
+
+static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
+                                       unsigned long nr_pages, const bool map)
+{
+    int rc;
+
+    for ( ; ; )
+    {
+        rc = (map ? map_mmio_regions : unmap_mmio_regions)
+             (d, _gfn(pfn), nr_pages, _mfn(pfn));
+        if ( rc == 0 )
+            break;
+        if ( rc < 0 )
+        {
+            printk(XENLOG_WARNING
+                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
+                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
+            break;
+        }
+        nr_pages -= rc;
+        pfn += rc;
+        process_pending_softirqs();
+    }
+
+    return rc;
+}
+
+/* Populate a HVM memory range using the biggest possible order. */
+static int __init pvh_populate_memory_range(struct domain *d,
+                                            unsigned long start,
+                                            unsigned long nr_pages)
+{
+    unsigned int order, i = 0;
+    struct page_info *page;
+    int rc;
+#define MAP_MAX_ITER 64
+
+    order = MAX_ORDER;
+    while ( nr_pages != 0 )
+    {
+        unsigned int range_order = get_order_from_pages(nr_pages + 1);
+
+        order = min(range_order ? range_order - 1 : 0, order);
+        page = alloc_domheap_pages(d, order, dom0_memflags);
+        if ( page == NULL )
+        {
+            if ( order == 0 && dom0_memflags )
+            {
+                /* Try again without any dom0_memflags. */
+                dom0_memflags = 0;
+                order = MAX_ORDER;
+                continue;
+            }
+            if ( order == 0 )
+            {
+                printk("Unable to allocate memory with order 0!\n");
+                return -ENOMEM;
+            }
+            order--;
+            continue;
+        }
+
+        rc = guest_physmap_add_page(d, _gfn(start), _mfn(page_to_mfn(page)),
+                                    order);
+        if ( rc != 0 )
+        {
+            printk("Failed to populate memory: [%#lx,%lx): %d\n",
+                   start, start + (1UL << order), rc);
+            return -ENOMEM;
+        }
+        start += 1UL << order;
+        nr_pages -= 1UL << order;
+        if ( (++i % MAP_MAX_ITER) == 0 )
+            process_pending_softirqs();
+    }
+
+    return 0;
+#undef MAP_MAX_ITER
+}
+
+/* Steal RAM from the end of a memory region. */
+static int __init pvh_steal_ram(struct domain *d, unsigned long size,
+                                unsigned long align, paddr_t limit,
+                                paddr_t *addr)
+{
+    unsigned int i = d->arch.nr_e820;
+
+    /*
+     * Alignment 0 should be set to 1, so it doesn't wrap around in the
+     * calculations below.
+     */
+    align = align ? : 1;
+    while ( i-- )
+    {
+        struct e820entry *entry = &d->arch.e820[i];
+
+        if ( entry->type != E820_RAM || entry->addr + entry->size > limit ||
+             entry->addr < MB(1) )
+            continue;
+
+        *addr = (entry->addr + entry->size - size) & ~(align - 1);
+        if ( *addr < entry->addr )
+            continue;
+
+        entry->size = *addr - entry->addr;
+        return 0;
+    }
+
+    return -ENOMEM;
+}
+
+/* NB: memory map must be sorted at all times for this to work correctly. */
+static int __init pvh_add_mem_range(struct domain *d, uint64_t s, uint64_t e,
+                                    unsigned int type)
+{
+    struct e820entry *map;
+    unsigned int i;
+
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        uint64_t rs = d->arch.e820[i].addr;
+        uint64_t re = rs + d->arch.e820[i].size;
+
+        if ( rs == e && d->arch.e820[i].type == type )
+        {
+            d->arch.e820[i].addr = s;
+            return 0;
+        }
+
+        if ( re == s && d->arch.e820[i].type == type &&
+             (i + 1 == d->arch.nr_e820 || d->arch.e820[i + 1].addr >= e) )
+        {
+            d->arch.e820[i].size += e - s;
+            return 0;
+        }
+
+        if ( rs >= e )
+            break;
+
+        if ( re > s )
+            return -EEXIST;
+    }
+
+    map = xzalloc_array(struct e820entry, d->arch.nr_e820 + 1);
+    if ( !map )
+    {
+        printk(XENLOG_WARNING "E820: out of memory to add region\n");
+        return -ENOMEM;
+    }
+
+    memcpy(map, d->arch.e820, i * sizeof(*d->arch.e820));
+    memcpy(map + i + 1, d->arch.e820 + i,
+           (d->arch.nr_e820 - i) * sizeof(*d->arch.e820));
+    map[i].addr = s;
+    map[i].size = e - s;
+    map[i].type = type;
+    xfree(d->arch.e820);
+    d->arch.e820 = map;
+    d->arch.nr_e820++;
+
+    return 0;
+}
+
+static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
+{
+    p2m_type_t p2mt;
+    uint32_t rc, *ident_pt;
+    mfn_t mfn;
+    paddr_t gaddr;
+    struct vcpu *v = d->vcpu[0];
+
+    /*
+     * Steal some space from the last RAM region below 4GB and use it to
+     * store the real-mode TSS. It needs to be aligned to 128 so that the
+     * TSS structure (which accounts for the first 104b) doesn't cross
+     * a page boundary.
+     */
+    if ( !pvh_steal_ram(d, HVM_VM86_TSS_SIZE, 128, GB(4), &gaddr) )
+    {
+        if ( hvm_copy_to_guest_phys(gaddr, NULL, HVM_VM86_TSS_SIZE, v) !=
+             HVMCOPY_okay )
+            printk("Unable to zero VM86 TSS area\n");
+        d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS_SIZED] =
+            VM86_TSS_UPDATED | ((uint64_t)HVM_VM86_TSS_SIZE << 32) | gaddr;
+        if ( pvh_add_mem_range(d, gaddr, gaddr + HVM_VM86_TSS_SIZE,
+                               E820_RESERVED) )
+            printk("Unable to set VM86 TSS as reserved in the memory map\n");
+    }
+    else
+        printk("Unable to allocate VM86 TSS area\n");
+
+    /* Steal some more RAM for the identity page tables. */
+    if ( pvh_steal_ram(d, PAGE_SIZE, PAGE_SIZE, GB(4), &gaddr) )
+    {
+        printk("Unable to find memory to stash the identity page tables\n");
+        return -ENOMEM;
+    }
+
+    /*
+     * Identity-map page table is required for running with CR0.PG=0
+     * when using Intel EPT. Create a 32-bit non-PAE page directory of
+     * superpages.
+     */
+    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
+                              &mfn, &p2mt, 0, &rc);
+    if ( ident_pt == NULL )
+    {
+        printk("Unable to map identity page tables\n");
+        return -ENOMEM;
+    }
+    write_32bit_pse_identmap(ident_pt);
+    unmap_domain_page(ident_pt);
+    put_page(mfn_to_page(mfn_x(mfn)));
+    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
+    if ( pvh_add_mem_range(d, gaddr, gaddr + PAGE_SIZE, E820_RESERVED) )
+            printk("Unable to set identity page tables as reserved in the memory map\n");
+
+    return 0;
+}
+
+/* Assign the low 1MB to Dom0. */
+static void __init pvh_steal_low_ram(struct domain *d, unsigned long start,
+                                     unsigned long nr_pages)
+{
+    unsigned long mfn;
+
+    ASSERT(start + nr_pages <= PFN_DOWN(MB(1)));
+
+    for ( mfn = start; mfn < start + nr_pages; mfn++ )
+    {
+        struct page_info *pg = mfn_to_page(mfn);
+        int rc;
+
+        rc = unshare_xen_page_with_guest(pg, dom_io);
+        if ( rc )
+        {
+            printk("Unable to unshare Xen mfn %#lx: %d\n", mfn, rc);
+            continue;
+        }
+
+        share_xen_page_with_guest(pg, d, XENSHARE_writable);
+        rc = guest_physmap_add_entry(d, _gfn(mfn), _mfn(mfn), 0, p2m_ram_rw);
+        if ( rc )
+            printk("Unable to add mfn %#lx to p2m: %d\n", mfn, rc);
+    }
+}
+
+static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
+{
+    struct e820entry *entry, *entry_guest;
+    unsigned int i;
+    unsigned long pages, cur_pages = 0;
+    uint64_t start, end;
+
+    /*
+     * Craft the e820 memory map for Dom0 based on the hardware e820 map.
+     */
+    d->arch.e820 = xzalloc_array(struct e820entry, e820.nr_map);
+    if ( !d->arch.e820 )
+        panic("Unable to allocate memory for Dom0 e820 map");
+    entry_guest = d->arch.e820;
+
+    /* Clamp e820 memory map to match the memory assigned to Dom0 */
+    for ( i = 0, entry = e820.map; i < e820.nr_map; i++, entry++ )
+    {
+        if ( entry->type != E820_RAM )
+        {
+            *entry_guest = *entry;
+            goto next;
+        }
+
+        if ( nr_pages == cur_pages )
+        {
+            /*
+             * We already have all the assigned memory,
+             * skip this entry
+             */
+            continue;
+        }
+
+        /*
+         * Make sure the start and length are aligned to PAGE_SIZE, because
+         * that's the minimum granularity of the 2nd stage translation. Since
+         * the p2m code uses PAGE_ORDER_4K internally, also use it here in
+         * order to prevent this code from getting out of sync.
+         */
+        start = ROUNDUP(entry->addr, PAGE_SIZE << PAGE_ORDER_4K);
+        end = (entry->addr + entry->size) &
+              ~((PAGE_SIZE << PAGE_ORDER_4K) - 1);
+        if ( start >= end )
+            continue;
+
+        entry_guest->type = E820_RAM;
+        entry_guest->addr = start;
+        entry_guest->size = end - start;
+        pages = PFN_DOWN(entry_guest->size);
+        if ( (cur_pages + pages) > nr_pages )
+        {
+            /* Truncate region */
+            entry_guest->size = (nr_pages - cur_pages) << PAGE_SHIFT;
+            cur_pages = nr_pages;
+        }
+        else
+        {
+            cur_pages += pages;
+        }
+ next:
+        d->arch.nr_e820++;
+        entry_guest++;
+    }
+    ASSERT(cur_pages == nr_pages);
+    ASSERT(d->arch.nr_e820 <= e820.nr_map);
+}
+
+static int __init pvh_setup_p2m(struct domain *d)
+{
+    struct vcpu *v = d->vcpu[0];
+    unsigned long nr_pages;
+    unsigned int i;
+    int rc;
+    bool preempted;
+#define MB1_PAGES PFN_DOWN(MB(1))
+
+    nr_pages = dom0_compute_nr_pages(d, NULL, 0);
+
+    pvh_setup_e820(d, nr_pages);
+    do {
+        preempted = false;
+        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
+                              &preempted);
+        process_pending_softirqs();
+    } while ( preempted );
+
+    /*
+     * Memory below 1MB is identity mapped.
+     * NB: this only makes sense when booted from legacy BIOS.
+     */
+    rc = modify_identity_mmio(d, 0, MB1_PAGES, true);
+    if ( rc )
+    {
+        printk("Failed to identity map low 1MB: %d\n", rc);
+        return rc;
+    }
+
+    /* Populate memory map. */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        unsigned long addr, size;
+
+        if ( d->arch.e820[i].type != E820_RAM )
+            continue;
+
+        addr = PFN_DOWN(d->arch.e820[i].addr);
+        size = PFN_DOWN(d->arch.e820[i].size);
+
+        if ( addr >= MB1_PAGES )
+            rc = pvh_populate_memory_range(d, addr, size);
+        else
+        {
+            ASSERT(addr + size < MB1_PAGES);
+            pvh_steal_low_ram(d, addr, size);
+        }
+
+        if ( rc )
+            return rc;
+    }
+
+    if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) )
+    {
+        /*
+         * Since Dom0 cannot be migrated, we will only setup the
+         * unrestricted guest helpers if they are needed by the current
+         * hardware we are running on.
+         */
+        rc = pvh_setup_vmx_realmode_helpers(d);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+#undef MB1_PAGES
+}
+
+static int __init pvh_load_kernel(struct domain *d, const module_t *image,
+                                  unsigned long image_headroom,
+                                  module_t *initrd, void *image_base,
+                                  char *cmdline, paddr_t *entry,
+                                  paddr_t *start_info_addr)
+{
+    void *image_start = image_base + image_headroom;
+    unsigned long image_len = image->mod_end;
+    struct elf_binary elf;
+    struct elf_dom_parms parms;
+    paddr_t last_addr;
+    struct hvm_start_info start_info = { 0 };
+    struct hvm_modlist_entry mod = { 0 };
+    struct vcpu *v = d->vcpu[0];
+    int rc;
+
+    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
+    {
+        printk("Error trying to detect bz compressed kernel\n");
+        return rc;
+    }
+
+    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
+    {
+        printk("Unable to init ELF\n");
+        return rc;
+    }
+#ifdef VERBOSE
+    elf_set_verbose(&elf);
+#endif
+    elf_parse_binary(&elf);
+    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
+    {
+        printk("Unable to parse kernel for ELFNOTES\n");
+        return rc;
+    }
+
+    if ( parms.phys_entry == UNSET_ADDR32 )
+    {
+        printk("Unable to find XEN_ELFNOTE_PHYS32_ENTRY address\n");
+        return -EINVAL;
+    }
+
+    printk("OS: %s version: %s loader: %s bitness: %s\n", parms.guest_os,
+           parms.guest_ver, parms.loader,
+           elf_64bit(&elf) ? "64-bit" : "32-bit");
+
+    /* Copy the OS image and free temporary buffer. */
+    elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
+    elf.dest_size = parms.virt_kend - parms.virt_kstart;
+
+    elf_set_vcpu(&elf, v);
+    rc = elf_load_binary(&elf);
+    if ( rc < 0 )
+    {
+        printk("Failed to load kernel: %d\n", rc);
+        printk("Xen dom0 kernel broken ELF: %s\n", elf_check_broken(&elf));
+        return rc;
+    }
+
+    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
+
+    if ( initrd != NULL )
+    {
+        rc = hvm_copy_to_guest_phys(last_addr, mfn_to_virt(initrd->mod_start),
+                                    initrd->mod_end, v);
+        if ( rc )
+        {
+            printk("Unable to copy initrd to guest\n");
+            return rc;
+        }
+
+        mod.paddr = last_addr;
+        mod.size = initrd->mod_end;
+        last_addr += ROUNDUP(initrd->mod_end, PAGE_SIZE);
+    }
+
+    /* Free temporary buffers. */
+    discard_initial_images();
+
+    if ( cmdline != NULL )
+    {
+        rc = hvm_copy_to_guest_phys(last_addr, cmdline, strlen(cmdline) + 1, v);
+        if ( rc )
+        {
+            printk("Unable to copy guest command line\n");
+            return rc;
+        }
+        start_info.cmdline_paddr = last_addr;
+        /*
+         * Round up to 32/64 bits (depending on the guest kernel bitness) so
+         * the modlist/start_info is aligned.
+         */
+        last_addr += ROUNDUP(strlen(cmdline) + 1, elf_64bit(&elf) ? 8 : 4);
+    }
+    if ( initrd != NULL )
+    {
+        rc = hvm_copy_to_guest_phys(last_addr, &mod, sizeof(mod), v);
+        if ( rc )
+        {
+            printk("Unable to copy guest modules\n");
+            return rc;
+        }
+        start_info.modlist_paddr = last_addr;
+        start_info.nr_modules = 1;
+        last_addr += sizeof(mod);
+    }
+
+    start_info.magic = XEN_HVM_START_MAGIC_VALUE;
+    start_info.flags = SIF_PRIVILEGED | SIF_INITDOMAIN;
+    rc = hvm_copy_to_guest_phys(last_addr, &start_info, sizeof(start_info), v);
+    if ( rc )
+    {
+        printk("Unable to copy start info to guest\n");
+        return rc;
+    }
+
+    *entry = parms.phys_entry;
+    *start_info_addr = last_addr;
+
+    return 0;
+}
+
+static int __init pvh_setup_cpus(struct domain *d, paddr_t entry,
+                                 paddr_t start_info)
+{
+    struct vcpu *v = d->vcpu[0];
+    unsigned int cpu, i;
+    int rc;
+    /*
+     * This sets the vCPU state according to the state described in
+     * docs/misc/hvmlite.markdown.
+     */
+    vcpu_hvm_context_t cpu_ctx = {
+        .mode = VCPU_HVM_MODE_32B,
+        .cpu_regs.x86_32.ebx = start_info,
+        .cpu_regs.x86_32.eip = entry,
+        .cpu_regs.x86_32.cr0 = X86_CR0_PE | X86_CR0_ET,
+        .cpu_regs.x86_32.cs_limit = ~0u,
+        .cpu_regs.x86_32.ds_limit = ~0u,
+        .cpu_regs.x86_32.ss_limit = ~0u,
+        .cpu_regs.x86_32.tr_limit = 0x67,
+        .cpu_regs.x86_32.cs_ar = 0xc9b,
+        .cpu_regs.x86_32.ds_ar = 0xc93,
+        .cpu_regs.x86_32.ss_ar = 0xc93,
+        .cpu_regs.x86_32.tr_ar = 0x8b,
+    };
+
+    cpu = v->processor;
+    for ( i = 1; i < d->max_vcpus; i++ )
+    {
+        cpu = cpumask_cycle(cpu, &dom0_cpus);
+        dom0_setup_vcpu(d, i, cpu);
+    }
+
+    rc = arch_set_info_hvm_guest(v, &cpu_ctx);
+    if ( rc )
+    {
+        printk("Unable to setup Dom0 BSP context: %d\n", rc);
+        return rc;
+    }
+
+    rc = dom0_setup_permissions(d);
+    if ( rc )
+    {
+        panic("Unable to setup Dom0 permissions: %d\n", rc);
+        return rc;
+    }
+
+    update_domain_wallclock_time(d);
+
+    clear_bit(_VPF_down, &v->pause_flags);
+
+    return 0;
+}
+
+static int __init acpi_count_intr_ovr(struct acpi_subtable_header *header,
+                                     const unsigned long end)
+{
+
+    acpi_intr_overrides++;
+    return 0;
+}
+
+static int __init acpi_set_intr_ovr(struct acpi_subtable_header *header,
+                                    const unsigned long end)
+{
+    const struct acpi_madt_interrupt_override *intr =
+        container_of(header, struct acpi_madt_interrupt_override, header);
+
+    *intsrcovr = *intr;
+    intsrcovr++;
+
+    return 0;
+}
+
+static int __init acpi_count_nmi_src(struct acpi_subtable_header *header,
+                                     const unsigned long end)
+{
+
+    acpi_nmi_sources++;
+    return 0;
+}
+
+static int __init acpi_set_nmi_src(struct acpi_subtable_header *header,
+                                   const unsigned long end)
+{
+    const struct acpi_madt_nmi_source *src =
+        container_of(header, struct acpi_madt_nmi_source, header);
+
+    *nmisrc = *src;
+    nmisrc++;
+
+    return 0;
+}
+
+static int __init pvh_setup_acpi_madt(struct domain *d, paddr_t *addr)
+{
+    struct acpi_table_madt *madt;
+    struct acpi_table_header *table;
+    struct acpi_madt_io_apic *io_apic;
+    struct acpi_madt_local_x2apic *x2apic;
+    acpi_status status;
+    unsigned long size;
+    unsigned int i, max_vcpus;
+    int rc;
+
+    /* Count number of interrupt overrides in the MADT. */
+    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE,
+                          acpi_count_intr_ovr, UINT_MAX);
+
+    /* Count number of NMI sources in the MADT. */
+    acpi_table_parse_madt(ACPI_MADT_TYPE_NMI_SOURCE, acpi_count_nmi_src,
+                          UINT_MAX);
+
+    max_vcpus = dom0_max_vcpus();
+    /* Calculate the size of the crafted MADT. */
+    size = sizeof(*madt);
+    /*
+     * FIXME: the current vIO-APIC code just supports one IO-APIC instance
+     * per domain. This must be fixed in order to provide the same amount of
+     * IO APICs as available on bare metal.
+     */
+    size += sizeof(*io_apic);
+    size += sizeof(*intsrcovr) * acpi_intr_overrides;
+    size += sizeof(*nmisrc) * acpi_nmi_sources;
+    size += sizeof(*x2apic) * max_vcpus;
+
+    madt = xzalloc_bytes(size);
+    if ( !madt )
+    {
+        printk("Unable to allocate memory for MADT table\n");
+        rc = -ENOMEM;
+        goto out;
+    }
+
+    /* Copy the native MADT table header. */
+    status = acpi_get_table(ACPI_SIG_MADT, 0, &table);
+    if ( !ACPI_SUCCESS(status) )
+    {
+        printk("Failed to get MADT ACPI table, aborting.\n");
+        rc = -EINVAL;
+        goto out;
+    }
+    madt->header = *table;
+    madt->address = APIC_DEFAULT_PHYS_BASE;
+    /*
+     * NB: this is currently set to 4, which is the revision in the ACPI
+     * spec 6.1. Sadly ACPICA doesn't provide revision numbers for the
+     * tables described in the headers.
+     */
+    madt->header.revision = min_t(unsigned char, table->revision, 4);
+
+    /*
+     * Setup the IO APIC entry.
+     * FIXME: the current vIO-APIC code just supports one IO-APIC instance
+     * per domain. This must be fixed in order to provide the same amount of
+     * IO APICs as available on bare metal, and with the same IDs as found in
+     * the native IO APIC MADT entries.
+     */
+    if ( nr_ioapics > 1 )
+        printk("WARNING: found %d IO APICs, Dom0 will only have access to 1 emulated IO APIC\n",
+               nr_ioapics);
+    io_apic = (void *)(madt + 1);
+    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
+    io_apic->header.length = sizeof(*io_apic);
+    io_apic->id = domain_vioapic(d)->id;
+    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;
+
+    x2apic = (void *)(io_apic + 1);
+    for ( i = 0; i < max_vcpus; i++ )
+    {
+        x2apic->header.type = ACPI_MADT_TYPE_LOCAL_X2APIC;
+        x2apic->header.length = sizeof(*x2apic);
+        x2apic->uid = i;
+        x2apic->local_apic_id = i * 2;
+        x2apic->lapic_flags = ACPI_MADT_ENABLED;
+        x2apic++;
+    }
+
+    /* Setup interrupt overrides. */
+    intsrcovr = (void *)x2apic;
+    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE, acpi_set_intr_ovr,
+                          acpi_intr_overrides);
+
+    /* Setup NMI sources. */
+    nmisrc = (void *)intsrcovr;
+    acpi_table_parse_madt(ACPI_MADT_TYPE_NMI_SOURCE, acpi_set_nmi_src,
+                          acpi_nmi_sources);
+
+    ASSERT(((void *)nmisrc - (void *)madt) == size);
+    madt->header.length = size;
+    /*
+     * Calling acpi_tb_checksum here is a layering violation, but
+     * introducing a wrapper for such simple usage seems overkill.
+     */
+    madt->header.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, madt), size);
+
+    /* Place the new MADT in guest memory space. */
+    if ( pvh_steal_ram(d, size, 0, GB(4), addr) )
+    {
+        printk("Unable to find allocate guest RAM for MADT\n");
+        rc = -ENOMEM;
+        goto out;
+    }
+
+    /* Mark this region as E820_ACPI. */
+    if ( pvh_add_mem_range(d, *addr, *addr + size, E820_ACPI) )
+        printk("Unable to add MADT region to memory map\n");
+
+    rc = hvm_copy_to_guest_phys(*addr, madt, size, d->vcpu[0]);
+    if ( rc )
+    {
+        printk("Unable to copy MADT into guest memory\n");
+        goto out;
+    }
+
+    rc = 0;
+
+ out:
+    xfree(madt);
+
+    return rc;
+}
+
+static bool __init acpi_memory_banned(unsigned long address,
+                                      unsigned long size)
+{
+    unsigned long mfn, nr_pages, i;
+
+    mfn = PFN_DOWN(address);
+    nr_pages = PFN_UP((address & ~PAGE_MASK) + size);
+    for ( i = 0 ; i < nr_pages; i++ )
+        if ( !page_is_ram_type(mfn + i, RAM_TYPE_RESERVED) &&
+             !page_is_ram_type(mfn + i, RAM_TYPE_ACPI) )
+            return true;
+
+    return false;
+}
+
+static bool __init pvh_acpi_table_allowed(const char *sig)
+{
+    static const char __initconst banned_tables[][ACPI_NAME_SIZE] = {
+        ACPI_SIG_HPET, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_MPST,
+        ACPI_SIG_PMTT, ACPI_SIG_MADT, ACPI_SIG_DMAR};
+    unsigned int i;
+
+    for ( i = 0 ; i < ARRAY_SIZE(banned_tables); i++ )
+        if ( strncmp(sig, banned_tables[i], ACPI_NAME_SIZE) == 0 )
+            return false;
+
+    /* Make sure table doesn't reside in a RAM region. */
+    if ( acpi_memory_banned(acpi_gbl_root_table_list.tables[i].address,
+                            acpi_gbl_root_table_list.tables[i].length) )
+    {
+        printk("Skipping table %.4s because resides in a non-ACPI, non-reserved region\n",
+               sig);
+        return false;
+    }
+
+    return true;
+}
+
+static int __init pvh_setup_acpi_xsdt(struct domain *d, paddr_t madt_addr,
+                                      paddr_t *addr)
+{
+    struct acpi_table_xsdt *xsdt;
+    struct acpi_table_header *table;
+    struct acpi_table_rsdp *rsdp;
+    unsigned long size = sizeof(*xsdt);
+    unsigned int i, j, num_tables = 0;
+    paddr_t xsdt_paddr;
+    int rc;
+
+    /*
+     * Restore original DMAR table signature, we are going to filter it from
+     * the new XSDT that is presented to the guest, so it is no longer
+     * necessary to have it's signature zapped.
+     */
+    acpi_dmar_reinstate();
+
+    /* Count the number of tables that will be added to the XSDT. */
+    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
+    {
+        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
+
+        if ( pvh_acpi_table_allowed(sig) )
+            num_tables++;
+    }
+
+    /*
+     * No need to add or subtract anything because struct acpi_table_xsdt
+     * includes one array slot already, and we have filtered out the original
+     * MADT and we are going to add a custom built MADT.
+     */
+    size += num_tables * sizeof(xsdt->table_offset_entry[0]);
+
+    xsdt = xzalloc_bytes(size);
+    if ( !xsdt )
+    {
+        printk("Unable to allocate memory for XSDT table\n");
+        rc = -ENOMEM;
+        goto out;
+    }
+
+    /* Copy the native XSDT table header. */
+    rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(*rsdp));
+    if ( !rsdp )
+    {
+        printk("Unable to map RSDP\n");
+        rc = -EINVAL;
+        goto out;
+    }
+    xsdt_paddr = rsdp->xsdt_physical_address;
+    acpi_os_unmap_memory(rsdp, sizeof(*rsdp));
+    table = acpi_os_map_memory(xsdt_paddr, sizeof(*table));
+    if ( !table )
+    {
+        printk("Unable to map XSDT\n");
+        rc = -EINVAL;
+        goto out;
+    }
+    xsdt->header = *table;
+    acpi_os_unmap_memory(table, sizeof(*table));
+
+    /* Add the custom MADT. */
+    xsdt->table_offset_entry[0] = madt_addr;
+
+    /* Copy the addresses of the rest of the allowed tables. */
+    for( i = 0, j = 1; i < acpi_gbl_root_table_list.count; i++ )
+    {
+        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
+
+        if ( pvh_acpi_table_allowed(sig) )
+            xsdt->table_offset_entry[j++] =
+                                acpi_gbl_root_table_list.tables[i].address;
+    }
+
+    xsdt->header.revision = 1;
+    xsdt->header.length = size;
+    /*
+     * Calling acpi_tb_checksum here is a layering violation, but
+     * introducing a wrapper for such simple usage seems overkill.
+     */
+    xsdt->header.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, xsdt), size);
+
+    /* Place the new XSDT in guest memory space. */
+    if ( pvh_steal_ram(d, size, 0, GB(4), addr) )
+    {
+        printk("Unable to find guest RAM for XSDT\n");
+        rc = -ENOMEM;
+        goto out;
+    }
+
+    /* Mark this region as E820_ACPI. */
+    if ( pvh_add_mem_range(d, *addr, *addr + size, E820_ACPI) )
+        printk("Unable to add XSDT region to memory map\n");
+
+    rc = hvm_copy_to_guest_phys(*addr, xsdt, size, d->vcpu[0]);
+    if ( rc )
+    {
+        printk("Unable to copy XSDT into guest memory\n");
+        goto out;
+    }
+
+    rc = 0;
+
+ out:
+    xfree(xsdt);
+
+    return rc;
+}
+
+static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
+{
+    unsigned long pfn, nr_pages;
+    paddr_t madt_paddr, xsdt_paddr, rsdp_paddr;
+    unsigned int i;
+    int rc;
+    struct acpi_table_rsdp *native_rsdp, rsdp = {
+        .signature = ACPI_SIG_RSDP,
+        .revision = 2,
+        .length = sizeof(rsdp),
+    };
+
+
+    /* Scan top-level tables and add their regions to the guest memory map. */
+    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
+    {
+        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
+        unsigned long addr = acpi_gbl_root_table_list.tables[i].address;
+        unsigned long size = acpi_gbl_root_table_list.tables[i].length;
+
+        /*
+         * Make sure the original MADT is also mapped, so that Dom0 can
+         * properly access the data returned by _MAT methods in case it's
+         * re-using MADT memory.
+         */
+        if ( strncmp(sig, ACPI_SIG_MADT, ACPI_NAME_SIZE)
+             ? pvh_acpi_table_allowed(sig)
+             : !acpi_memory_banned(addr, size) )
+             pvh_add_mem_range(d, addr, addr + size, E820_ACPI);
+    }
+
+    /* Identity map ACPI e820 regions. */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        if ( d->arch.e820[i].type != E820_ACPI &&
+             d->arch.e820[i].type != E820_NVS )
+            continue;
+
+        pfn = PFN_DOWN(d->arch.e820[i].addr);
+        nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
+                          d->arch.e820[i].size);
+
+        rc = modify_identity_mmio(d, pfn, nr_pages, true);
+        if ( rc )
+        {
+            printk("Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
+                   pfn, pfn + nr_pages);
+            return rc;
+        }
+    }
+
+    rc = pvh_setup_acpi_madt(d, &madt_paddr);
+    if ( rc )
+        return rc;
+
+    rc = pvh_setup_acpi_xsdt(d, madt_paddr, &xsdt_paddr);
+    if ( rc )
+        return rc;
+
+    /* Craft a custom RSDP. */
+    native_rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(rsdp));
+    if ( !native_rsdp )
+    {
+        printk("Failed to map native RSDP\n");
+        return -ENOMEM;
+    }
+    memcpy(rsdp.oem_id, native_rsdp->oem_id, sizeof(rsdp.oem_id));
+    acpi_os_unmap_memory(native_rsdp, sizeof(rsdp));
+    rsdp.xsdt_physical_address = xsdt_paddr;
+    /*
+     * Calling acpi_tb_checksum here is a layering violation, but
+     * introducing a wrapper for such simple usage seems overkill.
+     */
+    rsdp.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
+                                      ACPI_RSDP_REV0_SIZE);
+    rsdp.extended_checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
+                                               sizeof(rsdp));
+
+    /*
+     * Place the new RSDP in guest memory space.
+     *
+     * NB: this RSDP is not going to replace the original RSDP, which should
+     * still be accessible to the guest. However that RSDP is going to point to
+     * the native RSDT, and should not be used for the Dom0 kernel's boot
+     * purposes (we keep it visible for post boot access).
+     */
+    if ( pvh_steal_ram(d, sizeof(rsdp), 0, GB(4), &rsdp_paddr) )
+    {
+        printk("Unable to allocate guest RAM for RSDP\n");
+        return -ENOMEM;
+    }
+
+    /* Mark this region as E820_ACPI. */
+    if ( pvh_add_mem_range(d, rsdp_paddr, rsdp_paddr + sizeof(rsdp),
+                           E820_ACPI) )
+        printk("Unable to add RSDP region to memory map\n");
+
+    /* Copy RSDP into guest memory. */
+    rc = hvm_copy_to_guest_phys(rsdp_paddr, &rsdp, sizeof(rsdp), d->vcpu[0]);
+    if ( rc )
+    {
+        printk("Unable to copy RSDP into guest memory\n");
+        return rc;
+    }
+
+    /* Copy RSDP address to start_info. */
+    rc = hvm_copy_to_guest_phys(start_info +
+                                offsetof(struct hvm_start_info, rsdp_paddr),
+                                &rsdp_paddr,
+                                sizeof(((struct hvm_start_info *)
+                                        0)->rsdp_paddr),
+                                d->vcpu[0]);
+    if ( rc )
+    {
+        printk("Unable to copy RSDP into guest memory\n");
+        return rc;
+    }
+
+    return 0;
+}
+
+int __init dom0_construct_pvh(struct domain *d, const module_t *image,
+                              unsigned long image_headroom,
+                              module_t *initrd,
+                              void *(*bootstrap_map)(const module_t *),
+                              char *cmdline)
+{
+    paddr_t entry, start_info;
+    int rc;
+
+    printk("** Building a PVH Dom0 **\n");
+
+    iommu_hwdom_init(d);
+
+    rc = pvh_setup_p2m(d);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 physical memory map\n");
+        return rc;
+    }
+
+    rc = pvh_load_kernel(d, image, image_headroom, initrd, bootstrap_map(image),
+                         cmdline, &entry, &start_info);
+    if ( rc )
+    {
+        printk("Failed to load Dom0 kernel\n");
+        return rc;
+    }
+
+    rc = pvh_setup_cpus(d, entry, start_info);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 CPUs: %d\n", rc);
+        return rc;
+    }
+
+    rc = pvh_setup_acpi(d, start_info);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 ACPI tables: %d\n", rc);
+        return rc;
+    }
+
+    panic("Building a PVHv2 Dom0 is not yet supported.");
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/asm-x86/dom0_build.h b/xen/include/asm-x86/dom0_build.h
index 2ba349e445..09cd8cb240 100644
--- a/xen/include/asm-x86/dom0_build.h
+++ b/xen/include/asm-x86/dom0_build.h
@@ -19,6 +19,15 @@ int dom0_construct_pv(
     void *(*bootstrap_map)(const module_t *),
     char *cmdline);
 
+int dom0_construct_pvh(struct domain *d, const module_t *image,
+                       unsigned long image_headroom,
+                       module_t *initrd,
+                       void *(*bootstrap_map)(const module_t *),
+                       char *cmdline);
+
+unsigned long dom0_paging_pages(const struct domain *d,
+                                unsigned long nr_pages);
+
 #endif	/* _DOM0_BUILD_H_ */
 
 /*
-- 
2.11.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 4/4] x86: clean up header files in dom0_build.c
  2017-03-16 17:54 [PATCH v2 0/4] Refactor x86 dom0 builder Wei Liu
                   ` (2 preceding siblings ...)
  2017-03-16 17:54 ` [PATCH v2 3/4] x86: split PVH dom0 builder to hvm/dom0_build.c Wei Liu
@ 2017-03-16 17:54 ` Wei Liu
  3 siblings, 0 replies; 18+ messages in thread
From: Wei Liu @ 2017-03-16 17:54 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

Remove the ones that are no longer needed and sort them.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/dom0_build.c | 40 ++++++----------------------------------
 1 file changed, 6 insertions(+), 34 deletions(-)

diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
index 772f077d5a..7dc86999fd 100644
--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -5,46 +5,18 @@
  */
 
 #include <xen/init.h>
-#include <xen/lib.h>
-#include <xen/ctype.h>
-#include <xen/sched.h>
-#include <xen/sched-if.h>
-#include <xen/smp.h>
-#include <xen/delay.h>
-#include <xen/event.h>
-#include <xen/console.h>
-#include <xen/kernel.h>
-#include <xen/domain.h>
-#include <xen/version.h>
 #include <xen/iocap.h>
-#include <xen/bitops.h>
-#include <xen/compat.h>
 #include <xen/libelf.h>
 #include <xen/pfn.h>
-#include <xen/guest_access.h>
-#include <xen/acpi.h>
-#include <asm/regs.h>
-#include <asm/system.h>
-#include <asm/io.h>
-#include <asm/processor.h>
-#include <asm/desc.h>
+#include <xen/sched.h>
+#include <xen/sched-if.h>
+#include <xen/softirq.h>
+
 #include <asm/dom0_build.h>
-#include <asm/i387.h>
-#include <asm/paging.h>
+#include <asm/hpet.h>
+#include <asm/io_apic.h>
 #include <asm/p2m.h>
-#include <asm/e820.h>
-#include <asm/acpi.h>
 #include <asm/setup.h>
-#include <asm/bzimage.h> /* for bzimage_parse */
-#include <asm/io_apic.h>
-#include <asm/hpet.h>
-
-#include <acpi/actables.h>
-
-#include <public/version.h>
-#include <public/hvm/hvm_info_table.h>
-#include <public/arch-x86/hvm/start_info.h>
-#include <public/hvm/hvm_vcpu.h>
 
 static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
-- 
2.11.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/4] x86: rename domain_build.c to dom0_build.c
  2017-03-16 17:54 ` [PATCH v2 1/4] x86: rename domain_build.c to dom0_build.c Wei Liu
@ 2017-03-16 18:00   ` Andrew Cooper
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Cooper @ 2017-03-16 18:00 UTC (permalink / raw)
  To: Wei Liu, Xen-devel; +Cc: Jan Beulich, Roger Pau Monné

On 16/03/17 17:54, Wei Liu wrote:
> To reflect the true nature of this file. No functional change.
>
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c
  2017-03-16 17:54 ` [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c Wei Liu
@ 2017-03-16 18:12   ` Roger Pau Monné
  2017-03-16 18:14     ` Wei Liu
  2017-03-16 18:15     ` Andrew Cooper
  2017-03-17 11:15   ` Andrew Cooper
  2017-03-20 10:40   ` Jan Beulich
  2 siblings, 2 replies; 18+ messages in thread
From: Roger Pau Monné @ 2017-03-16 18:12 UTC (permalink / raw)
  To: Wei Liu; +Cc: Xen-devel, Jan Beulich, Andrew Cooper

On Thu, Mar 16, 2017 at 05:54:56PM +0000, Wei Liu wrote:
[...]
> +    /*
> +     * PVH Fixme: XENFEAT_supervisor_mode_kernel has been reused in PVH with a
> +     * different meaning.

Crap, I've forgot to remove this PVH fixme!

[...]
> diff --git a/xen/include/asm-x86/dom0_build.h b/xen/include/asm-x86/dom0_build.h
> new file mode 100644
> index 0000000000..2ba349e445
> --- /dev/null
> +++ b/xen/include/asm-x86/dom0_build.h
> @@ -0,0 +1,32 @@
> +#ifndef _DOM0_BUILD_H_
> +#define _DOM0_BUILD_H_
> +
> +#include <asm/setup.h>
> +
> +extern unsigned int dom0_memflags;
> +extern cpumask_t dom0_cpus;
> +
> +unsigned long dom0_compute_nr_pages(
> +    struct domain *d, struct elf_dom_parms *parms, unsigned long initrd_len);
> +struct vcpu *dom0_setup_vcpu(struct domain *d, unsigned int vcpu_id,
> +                                    unsigned int cpu);
> +int dom0_setup_permissions(struct domain *d);
> +
> +int dom0_construct_pv(
> +    struct domain *d,
> +    const module_t *image, unsigned long image_headroom,
> +    module_t *initrd,
> +    void *(*bootstrap_map)(const module_t *),
> +    char *cmdline);

Since this is kind of new code, could you fix the indentation of the
prototypes?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c
  2017-03-16 18:12   ` Roger Pau Monné
@ 2017-03-16 18:14     ` Wei Liu
  2017-03-16 18:15     ` Andrew Cooper
  1 sibling, 0 replies; 18+ messages in thread
From: Wei Liu @ 2017-03-16 18:14 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Xen-devel, Wei Liu, Jan Beulich, Andrew Cooper

On Thu, Mar 16, 2017 at 06:12:08PM +0000, Roger Pau Monné wrote:
> On Thu, Mar 16, 2017 at 05:54:56PM +0000, Wei Liu wrote:
> [...]
> > +    /*
> > +     * PVH Fixme: XENFEAT_supervisor_mode_kernel has been reused in PVH with a
> > +     * different meaning.
> 
> Crap, I've forgot to remove this PVH fixme!

Don't worry, I will take care of it.

> 
> [...]
> > diff --git a/xen/include/asm-x86/dom0_build.h b/xen/include/asm-x86/dom0_build.h
> > new file mode 100644
> > index 0000000000..2ba349e445
> > --- /dev/null
> > +++ b/xen/include/asm-x86/dom0_build.h
> > @@ -0,0 +1,32 @@
> > +#ifndef _DOM0_BUILD_H_
> > +#define _DOM0_BUILD_H_
> > +
> > +#include <asm/setup.h>
> > +
> > +extern unsigned int dom0_memflags;
> > +extern cpumask_t dom0_cpus;
> > +
> > +unsigned long dom0_compute_nr_pages(
> > +    struct domain *d, struct elf_dom_parms *parms, unsigned long initrd_len);
> > +struct vcpu *dom0_setup_vcpu(struct domain *d, unsigned int vcpu_id,
> > +                                    unsigned int cpu);
> > +int dom0_setup_permissions(struct domain *d);
> > +
> > +int dom0_construct_pv(
> > +    struct domain *d,
> > +    const module_t *image, unsigned long image_headroom,
> > +    module_t *initrd,
> > +    void *(*bootstrap_map)(const module_t *),
> > +    char *cmdline);
> 
> Since this is kind of new code, could you fix the indentation of the
> prototypes?

Sure. I also notice there is something wrong with the indentation of
dom0_setup_vcpu. I will fix them.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c
  2017-03-16 18:12   ` Roger Pau Monné
  2017-03-16 18:14     ` Wei Liu
@ 2017-03-16 18:15     ` Andrew Cooper
  2017-03-16 18:48       ` Wei Liu
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2017-03-16 18:15 UTC (permalink / raw)
  To: Roger Pau Monné, Wei Liu; +Cc: Xen-devel, Jan Beulich

On 16/03/17 18:12, Roger Pau Monné wrote:
> On Thu, Mar 16, 2017 at 05:54:56PM +0000, Wei Liu wrote:
> [...]
>> +    /*
>> +     * PVH Fixme: XENFEAT_supervisor_mode_kernel has been reused in PVH with a
>> +     * different meaning.
> Crap, I've forgot to remove this PVH fixme!

And

andrewcoop@andrewcoop:/local/xen.git/xen$ git grep -l has_hvm_container
-- :/
../docs/designs/dmop.markdown
include/xen/sched.h

if you are feeling like fixing up other stray bits :)

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/4] x86: split PVH dom0 builder to hvm/dom0_build.c
  2017-03-16 17:54 ` [PATCH v2 3/4] x86: split PVH dom0 builder to hvm/dom0_build.c Wei Liu
@ 2017-03-16 18:16   ` Roger Pau Monné
  2017-03-16 18:21     ` Wei Liu
  0 siblings, 1 reply; 18+ messages in thread
From: Roger Pau Monné @ 2017-03-16 18:16 UTC (permalink / raw)
  To: Wei Liu; +Cc: Xen-devel, Jan Beulich, Andrew Cooper

On Thu, Mar 16, 2017 at 05:54:57PM +0000, Wei Liu wrote:
[...]
> diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
> index c74f9b52e0..772f077d5a 100644
> --- a/xen/arch/x86/dom0_build.c
> +++ b/xen/arch/x86/dom0_build.c
[...]
> @@ -261,8 +240,8 @@ boolean_param("ro-hpet", ro_hpet);
>  
>  unsigned int __initdata dom0_memflags = MEMF_no_dma|MEMF_exact_node;
>  
> -static unsigned long __init dom0_paging_pages(const struct domain *d,
> -                                              unsigned long nr_pages)
> +unsigned long __init dom0_paging_pages(const struct domain *d,
> +                                       unsigned long nr_pages)
>  {
>      /* Copied from: libxl_get_required_shadow_memory() */
>      unsigned long memkb = nr_pages * (PAGE_SIZE / 1024);

Shouldn't this be moved to the PVH Dom0 file? AFAICT this is not used by the PV
Dom0, or else you would have made that global in the previous patch.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/4] x86: split PVH dom0 builder to hvm/dom0_build.c
  2017-03-16 18:16   ` Roger Pau Monné
@ 2017-03-16 18:21     ` Wei Liu
  2017-03-17 10:00       ` Roger Pau Monné
  0 siblings, 1 reply; 18+ messages in thread
From: Wei Liu @ 2017-03-16 18:21 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Xen-devel, Wei Liu, Jan Beulich, Andrew Cooper

On Thu, Mar 16, 2017 at 06:16:26PM +0000, Roger Pau Monné wrote:
> On Thu, Mar 16, 2017 at 05:54:57PM +0000, Wei Liu wrote:
> [...]
> > diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
> > index c74f9b52e0..772f077d5a 100644
> > --- a/xen/arch/x86/dom0_build.c
> > +++ b/xen/arch/x86/dom0_build.c
> [...]
> > @@ -261,8 +240,8 @@ boolean_param("ro-hpet", ro_hpet);
> >  
> >  unsigned int __initdata dom0_memflags = MEMF_no_dma|MEMF_exact_node;
> >  
> > -static unsigned long __init dom0_paging_pages(const struct domain *d,
> > -                                              unsigned long nr_pages)
> > +unsigned long __init dom0_paging_pages(const struct domain *d,
> > +                                       unsigned long nr_pages)
> >  {
> >      /* Copied from: libxl_get_required_shadow_memory() */
> >      unsigned long memkb = nr_pages * (PAGE_SIZE / 1024);
> 
> Shouldn't this be moved to the PVH Dom0 file? AFAICT this is not used by the PV
> Dom0, or else you would have made that global in the previous patch.
> 

It is needed by dom0_compute_nr_pages which is used by PV dom0 builder
and PVH dom0 builder, plus it is directly called by PVH dom0 builder.
It's best to leave it here.

> Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c
  2017-03-16 18:15     ` Andrew Cooper
@ 2017-03-16 18:48       ` Wei Liu
  0 siblings, 0 replies; 18+ messages in thread
From: Wei Liu @ 2017-03-16 18:48 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Jan Beulich, Roger Pau Monné

On Thu, Mar 16, 2017 at 06:15:01PM +0000, Andrew Cooper wrote:
> On 16/03/17 18:12, Roger Pau Monné wrote:
> > On Thu, Mar 16, 2017 at 05:54:56PM +0000, Wei Liu wrote:
> > [...]
> >> +    /*
> >> +     * PVH Fixme: XENFEAT_supervisor_mode_kernel has been reused in PVH with a
> >> +     * different meaning.
> > Crap, I've forgot to remove this PVH fixme!
> 
> And
> 
> andrewcoop@andrewcoop:/local/xen.git/xen$ git grep -l has_hvm_container
> -- :/
> ../docs/designs/dmop.markdown
> include/xen/sched.h
> 
> if you are feeling like fixing up other stray bits :)
> 

There are two more "PVH fixme" in code. I think I will write another
series to remove all the stale bits.

Wei.

> ~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/4] x86: split PVH dom0 builder to hvm/dom0_build.c
  2017-03-16 18:21     ` Wei Liu
@ 2017-03-17 10:00       ` Roger Pau Monné
  0 siblings, 0 replies; 18+ messages in thread
From: Roger Pau Monné @ 2017-03-17 10:00 UTC (permalink / raw)
  To: Wei Liu; +Cc: Xen-devel, Jan Beulich, Andrew Cooper

On Thu, Mar 16, 2017 at 06:21:16PM +0000, Wei Liu wrote:
> On Thu, Mar 16, 2017 at 06:16:26PM +0000, Roger Pau Monné wrote:
> > On Thu, Mar 16, 2017 at 05:54:57PM +0000, Wei Liu wrote:
> > [...]
> > > diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
> > > index c74f9b52e0..772f077d5a 100644
> > > --- a/xen/arch/x86/dom0_build.c
> > > +++ b/xen/arch/x86/dom0_build.c
> > [...]
> > > @@ -261,8 +240,8 @@ boolean_param("ro-hpet", ro_hpet);
> > >  
> > >  unsigned int __initdata dom0_memflags = MEMF_no_dma|MEMF_exact_node;
> > >  
> > > -static unsigned long __init dom0_paging_pages(const struct domain *d,
> > > -                                              unsigned long nr_pages)
> > > +unsigned long __init dom0_paging_pages(const struct domain *d,
> > > +                                       unsigned long nr_pages)
> > >  {
> > >      /* Copied from: libxl_get_required_shadow_memory() */
> > >      unsigned long memkb = nr_pages * (PAGE_SIZE / 1024);
> > 
> > Shouldn't this be moved to the PVH Dom0 file? AFAICT this is not used by the PV
> > Dom0, or else you would have made that global in the previous patch.
> > 
> 
> It is needed by dom0_compute_nr_pages which is used by PV dom0 builder
> and PVH dom0 builder, plus it is directly called by PVH dom0 builder.
> It's best to leave it here.

Right, Xen needs that in order to account for the memory used by the paging
code, plus for setting the paging buffer itself.

I'm quite sure that creating a PV Dom0 with shadow is completely broken, and
that this could be moved provided that compute_dom0_nr_pages is also split
between PV and PVH, but let's leave this for another time.

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c
  2017-03-16 17:54 ` [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c Wei Liu
  2017-03-16 18:12   ` Roger Pau Monné
@ 2017-03-17 11:15   ` Andrew Cooper
  2017-03-17 11:18     ` Wei Liu
  2017-03-20 10:40   ` Jan Beulich
  2 siblings, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2017-03-17 11:15 UTC (permalink / raw)
  To: Wei Liu, Xen-devel; +Cc: Jan Beulich, Roger Pau Monné

On 16/03/17 17:54, Wei Liu wrote:
> diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
> new file mode 100644
> index 0000000000..bb82c33a4c
> --- /dev/null
> +++ b/xen/arch/x86/pv/dom0_build.c
> @@ -0,0 +1,910 @@
> +/******************************************************************************
> + * pv/dom0_build.c
> + *
> + * Copyright (c) 2002-2005, K A Fraser
> + */
> +
> +#include <xen/console.h>
> +#include <xen/domain.h>
> +#include <xen/domain_page.h>
> +#include <xen/init.h>
> +#include <xen/libelf.h>
> +#include <xen/multiboot.h>
> +#include <xen/paging.h>
> +#include <xen/pfn.h>
> +#include <xen/sched.h>
> +#include <xen/softirq.h>
> +
> +#include <asm/bzimage.h>
> +#include <asm/dom0_build.h>
> +#include <asm/page.h>
> +#include <asm/setup.h>
> +
> +/* Allow ring-3 access in long mode as guest cannot use ring 1 ... */
> +#define BASE_PROT (_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|_PAGE_USER)
> +#define L1_PROT (BASE_PROT|_PAGE_GUEST_KERNEL)
> +/* ... except for compatibility mode guests. */
> +#define COMPAT_L1_PROT (_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED)
> +#define L2_PROT (BASE_PROT|_PAGE_DIRTY)
> +#define L3_PROT (BASE_PROT|_PAGE_DIRTY)
> +#define L4_PROT (BASE_PROT|_PAGE_DIRTY)
> +
> +static __init void dom0_update_physmap(struct domain *d, unsigned long pfn,
> +                                   unsigned long mfn, unsigned long vphysmap_s)

Sorry to be awkward, but can you fix up all the style issues as part of
moving this code?  There are a number of mis-indenting and missing newlines.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c
  2017-03-17 11:15   ` Andrew Cooper
@ 2017-03-17 11:18     ` Wei Liu
  0 siblings, 0 replies; 18+ messages in thread
From: Wei Liu @ 2017-03-17 11:18 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Jan Beulich, Roger Pau Monné

On Fri, Mar 17, 2017 at 11:15:45AM +0000, Andrew Cooper wrote:
> On 16/03/17 17:54, Wei Liu wrote:
> > diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
> > new file mode 100644
> > index 0000000000..bb82c33a4c
> > --- /dev/null
> > +++ b/xen/arch/x86/pv/dom0_build.c
> > @@ -0,0 +1,910 @@
> > +/******************************************************************************
> > + * pv/dom0_build.c
> > + *
> > + * Copyright (c) 2002-2005, K A Fraser
> > + */
> > +
> > +#include <xen/console.h>
> > +#include <xen/domain.h>
> > +#include <xen/domain_page.h>
> > +#include <xen/init.h>
> > +#include <xen/libelf.h>
> > +#include <xen/multiboot.h>
> > +#include <xen/paging.h>
> > +#include <xen/pfn.h>
> > +#include <xen/sched.h>
> > +#include <xen/softirq.h>
> > +
> > +#include <asm/bzimage.h>
> > +#include <asm/dom0_build.h>
> > +#include <asm/page.h>
> > +#include <asm/setup.h>
> > +
> > +/* Allow ring-3 access in long mode as guest cannot use ring 1 ... */
> > +#define BASE_PROT (_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|_PAGE_USER)
> > +#define L1_PROT (BASE_PROT|_PAGE_GUEST_KERNEL)
> > +/* ... except for compatibility mode guests. */
> > +#define COMPAT_L1_PROT (_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED)
> > +#define L2_PROT (BASE_PROT|_PAGE_DIRTY)
> > +#define L3_PROT (BASE_PROT|_PAGE_DIRTY)
> > +#define L4_PROT (BASE_PROT|_PAGE_DIRTY)
> > +
> > +static __init void dom0_update_physmap(struct domain *d, unsigned long pfn,
> > +                                   unsigned long mfn, unsigned long vphysmap_s)
> 
> Sorry to be awkward, but can you fix up all the style issues as part of
> moving this code?  There are a number of mis-indenting and missing newlines.
> 

No problem. Emacs is very good at this so it would be easy.

> ~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c
  2017-03-16 17:54 ` [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c Wei Liu
  2017-03-16 18:12   ` Roger Pau Monné
  2017-03-17 11:15   ` Andrew Cooper
@ 2017-03-20 10:40   ` Jan Beulich
  2017-03-20 13:10     ` Wei Liu
  2 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2017-03-20 10:40 UTC (permalink / raw)
  To: Wei Liu; +Cc: Andrew Cooper, Xen-devel, Roger Pau Monné

>>> On 16.03.17 at 18:54, <wei.liu2@citrix.com> wrote:
> @@ -154,11 +155,11 @@ static void __init parse_dom0_nodes(const char *s)
>  }
>  custom_param("dom0_nodes", parse_dom0_nodes);
>  
> -static cpumask_t __initdata dom0_cpus;
> +cpumask_t __initdata dom0_cpus;

I'd prefer if this variable remained static, and I think this is doable:

> -static struct vcpu *__init setup_dom0_vcpu(struct domain *d,
> -                                           unsigned int vcpu_id,
> -                                           unsigned int cpu)
> +struct vcpu *__init dom0_setup_vcpu(struct domain *d,
> +                                    unsigned int vcpu_id,
> +                                    unsigned int cpu)
>  {

It's needed only by the callers of this function afaics, and the
cpumask_first() / cpumask_cycle() invocations could easily move
into the function, with the callers updating their "cpu" variables
from v->processor (with v assumed to be a variable to store the
return value of the function, and checked to be non-NULL).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c
  2017-03-20 10:40   ` Jan Beulich
@ 2017-03-20 13:10     ` Wei Liu
  2017-03-20 13:35       ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Wei Liu @ 2017-03-20 13:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Wei Liu, Xen-devel, Roger Pau Monné

On Mon, Mar 20, 2017 at 04:40:49AM -0600, Jan Beulich wrote:
> >>> On 16.03.17 at 18:54, <wei.liu2@citrix.com> wrote:
> > @@ -154,11 +155,11 @@ static void __init parse_dom0_nodes(const char *s)
> >  }
> >  custom_param("dom0_nodes", parse_dom0_nodes);
> >  
> > -static cpumask_t __initdata dom0_cpus;
> > +cpumask_t __initdata dom0_cpus;
> 
> I'd prefer if this variable remained static, and I think this is doable:
> 
> > -static struct vcpu *__init setup_dom0_vcpu(struct domain *d,
> > -                                           unsigned int vcpu_id,
> > -                                           unsigned int cpu)
> > +struct vcpu *__init dom0_setup_vcpu(struct domain *d,
> > +                                    unsigned int vcpu_id,
> > +                                    unsigned int cpu)
> >  {
> 
> It's needed only by the callers of this function afaics, and the
> cpumask_first() / cpumask_cycle() invocations could easily move
> into the function, with the callers updating their "cpu" variables
> from v->processor (with v assumed to be a variable to store the
> return value of the function, and checked to be non-NULL).
> 

Like this?

From 6b066814a424fdaf9ee0a1d2afc6b0765961e932 Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Mon, 20 Mar 2017 13:05:08 +0000
Subject: [PATCH] x86: modify setup_dom0_vcpu to keep dom0_cpus static

We will later move dom0 builders to different directory. To avoid the
need of making dom0_cpus visible outside its original file, modify
setup_dom0_vcpus to cycle through dom0_cpus internally instead of
relying on the callers to do that.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
---
 xen/arch/x86/dom0_build.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
index 1c723c9ef1..102d3daea1 100644
--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -158,8 +158,9 @@ static cpumask_t __initdata dom0_cpus;
 
 static struct vcpu *__init setup_dom0_vcpu(struct domain *d,
                                            unsigned int vcpu_id,
-                                           unsigned int cpu)
+                                           unsigned int prev_cpu)
 {
+    unsigned int cpu = cpumask_cycle(prev_cpu, &dom0_cpus);
     struct vcpu *v = alloc_vcpu(d, vcpu_id, cpu);
 
     if ( v )
@@ -215,7 +216,8 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
         return NULL;
     dom0->max_vcpus = max_vcpus;
 
-    return setup_dom0_vcpu(dom0, 0, cpumask_first(&dom0_cpus));
+    return setup_dom0_vcpu(dom0, 0,
+                           cpumask_last(&dom0_cpus) /* so it wraps around to first pcpu */);
 }
 
 #ifdef CONFIG_SHADOW_PAGING
@@ -1155,8 +1157,11 @@ static int __init construct_dom0_pv(
     cpu = v->processor;
     for ( i = 1; i < d->max_vcpus; i++ )
     {
-        cpu = cpumask_cycle(cpu, &dom0_cpus);
-        setup_dom0_vcpu(d, i, cpu);
+        struct vcpu *p = setup_dom0_vcpu(d, i, cpu);
+        if ( !p )
+            panic("Cannot allocate vcpu%u for Dom0", i);
+
+        cpu = p->processor;
     }
 
     d->arch.paging.mode = 0;
@@ -1902,8 +1907,11 @@ static int __init pvh_setup_cpus(struct domain *d, paddr_t entry,
     cpu = v->processor;
     for ( i = 1; i < d->max_vcpus; i++ )
     {
-        cpu = cpumask_cycle(cpu, &dom0_cpus);
-        setup_dom0_vcpu(d, i, cpu);
+        struct vcpu *p = setup_dom0_vcpu(d, i, cpu);
+        if ( !p )
+            panic("Cannot allocate vcpu%u for Dom0", i);
+
+        cpu = p->processor;
     }
 
     rc = arch_set_info_hvm_guest(v, &cpu_ctx);
-- 
2.11.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c
  2017-03-20 13:10     ` Wei Liu
@ 2017-03-20 13:35       ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2017-03-20 13:35 UTC (permalink / raw)
  To: Wei Liu; +Cc: Andrew Cooper, Xen-devel, Roger Pau Monné

>>> On 20.03.17 at 14:10, <wei.liu2@citrix.com> wrote:
> On Mon, Mar 20, 2017 at 04:40:49AM -0600, Jan Beulich wrote:
>> >>> On 16.03.17 at 18:54, <wei.liu2@citrix.com> wrote:
>> > @@ -154,11 +155,11 @@ static void __init parse_dom0_nodes(const char *s)
>> >  }
>> >  custom_param("dom0_nodes", parse_dom0_nodes);
>> >  
>> > -static cpumask_t __initdata dom0_cpus;
>> > +cpumask_t __initdata dom0_cpus;
>> 
>> I'd prefer if this variable remained static, and I think this is doable:
>> 
>> > -static struct vcpu *__init setup_dom0_vcpu(struct domain *d,
>> > -                                           unsigned int vcpu_id,
>> > -                                           unsigned int cpu)
>> > +struct vcpu *__init dom0_setup_vcpu(struct domain *d,
>> > +                                    unsigned int vcpu_id,
>> > +                                    unsigned int cpu)
>> >  {
>> 
>> It's needed only by the callers of this function afaics, and the
>> cpumask_first() / cpumask_cycle() invocations could easily move
>> into the function, with the callers updating their "cpu" variables
>> from v->processor (with v assumed to be a variable to store the
>> return value of the function, and checked to be non-NULL).
>> 
> 
> Like this?

Almost (I'd have avoided using cpumask_last(), but that's a matter
of taste):

> @@ -1155,8 +1157,11 @@ static int __init construct_dom0_pv(
>      cpu = v->processor;
>      for ( i = 1; i < d->max_vcpus; i++ )
>      {
> -        cpu = cpumask_cycle(cpu, &dom0_cpus);
> -        setup_dom0_vcpu(d, i, cpu);
> +        struct vcpu *p = setup_dom0_vcpu(d, i, cpu);
> +        if ( !p )
> +            panic("Cannot allocate vcpu%u for Dom0", i);
> +
> +        cpu = p->processor;
>      }

Let's avoid panic()ing here, even if there's not much chance of making
it much further. Either simply break out of the loop (on the assumption
that further setup_dom0_vcpu()s won't succeed either) or simply
make the assignment of cpu conditional upon p. Also p wants to be
pointer-to-const here, and a blank line is needed between declaration
and statements.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-03-20 13:36 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-16 17:54 [PATCH v2 0/4] Refactor x86 dom0 builder Wei Liu
2017-03-16 17:54 ` [PATCH v2 1/4] x86: rename domain_build.c to dom0_build.c Wei Liu
2017-03-16 18:00   ` Andrew Cooper
2017-03-16 17:54 ` [PATCH v2 2/4] x86: split PV dom0 builder to pv/dom0_builder.c Wei Liu
2017-03-16 18:12   ` Roger Pau Monné
2017-03-16 18:14     ` Wei Liu
2017-03-16 18:15     ` Andrew Cooper
2017-03-16 18:48       ` Wei Liu
2017-03-17 11:15   ` Andrew Cooper
2017-03-17 11:18     ` Wei Liu
2017-03-20 10:40   ` Jan Beulich
2017-03-20 13:10     ` Wei Liu
2017-03-20 13:35       ` Jan Beulich
2017-03-16 17:54 ` [PATCH v2 3/4] x86: split PVH dom0 builder to hvm/dom0_build.c Wei Liu
2017-03-16 18:16   ` Roger Pau Monné
2017-03-16 18:21     ` Wei Liu
2017-03-17 10:00       ` Roger Pau Monné
2017-03-16 17:54 ` [PATCH v2 4/4] x86: clean up header files in dom0_build.c Wei Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.