All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/17][V4]: PVH xen: version 4 patches...
@ 2013-04-23 21:25 Mukesh Rathor
  2013-04-23 21:25 ` [PATCH 01/17] PVH xen: turn gdb_frames/gdt_ents into union Mukesh Rathor
                   ` (16 more replies)
  0 siblings, 17 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:25 UTC (permalink / raw)
  To: Xen-devel

I've version 4 of my patches for 64bit PVH guest for xen.  This is Phase I.
These patches are built on top of git
c/s: 26c35e5cb93a7b4dcde940620eb7ac1845ed6e5a

Phase I:
   - Establish a baseline of something working. These patches allow for
     dom0 to be booted in PVH mode, and after that guests to be started
     in PV, PVH, and HVM modes. I also tested booting dom0 in PV mode,
     and starting PV, PVH, and HVM guests.

     Also, the disk must be specified as phy: in vm.cfg file:
         > losetup /dev/loop1 guest.img
         > vm.cfg file: disk = ['phy:/dev/loop1,xvda,w']    

     I've not tested anything else.
     Note, HAP and iommu are required for PVH.

As a result of V3, there were two new action items on the linux side before
it will boot as PVH: 1)MSI-X fixup and 2)load KERNEL_CS righ after gdt switch.

V4: changes listed in patches which were changed.

Following fixme's exist in the code:
  - Add support for more memory types in arch/x86/hvm/mtrr.c.
  - arch/x86/time.c: support more tsc modes.
  - check_guest_io_breakpoint(): check/add support for IO breakpoint.
  - implement arch_get_info_guest() for pvh.
  - vmxit_msr_read(): during AMD port go thru hvm_msr_read_intercept() again.
  - verify bp matching on emulated instructions will work same as HVM for
    PVH guest. see instruction_done() and check_guest_io_breakpoint().

Following remain to be done for PVH:
   - AMD port.
   - Avail PVH dom0 of posted interrupts. (This will be a big win).
   - 32bit support in both linux and xen. Xen changes are tagged "32bitfixme".
   - Add support for monitoring guest behavior. See hvm_memory_event* functions
     in hvm.c
   - Change xl to support other modes other than "phy:".
   - Hotplug support
   - Migration of PVH guests.

Thanks for all the help,
Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 01/17] PVH xen: turn gdb_frames/gdt_ents into union
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
@ 2013-04-23 21:25 ` Mukesh Rathor
  2013-04-23 21:25 ` [PATCH 02/17] PVH xen: add XENMEM_add_to_physmap_range Mukesh Rathor
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:25 UTC (permalink / raw)
  To: Xen-devel

Changes in V2:
  - Add __XEN_INTERFACE_VERSION__

  Changes in V3:
    - Rename union to 'gdt' and rename field names.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 tools/libxc/xc_domain_restore.c   |    8 ++++----
 tools/libxc/xc_domain_save.c      |    6 +++---
 xen/arch/x86/domain.c             |   12 ++++++------
 xen/arch/x86/domctl.c             |   12 ++++++------
 xen/include/public/arch-x86/xen.h |   14 ++++++++++++++
 5 files changed, 33 insertions(+), 19 deletions(-)

diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
index a15f86a..5530631 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -2020,15 +2020,15 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
             munmap(start_info, PAGE_SIZE);
         }
         /* Uncanonicalise each GDT frame number. */
-        if ( GET_FIELD(ctxt, gdt_ents) > 8192 )
+        if ( GET_FIELD(ctxt, gdt.pv.num_ents) > 8192 )
         {
             ERROR("GDT entry count out of range");
             goto out;
         }
 
-        for ( j = 0; (512*j) < GET_FIELD(ctxt, gdt_ents); j++ )
+        for ( j = 0; (512*j) < GET_FIELD(ctxt, gdt.pv.num_ents); j++ )
         {
-            pfn = GET_FIELD(ctxt, gdt_frames[j]);
+            pfn = GET_FIELD(ctxt, gdt.pv.frames[j]);
             if ( (pfn >= dinfo->p2m_size) ||
                  (pfn_type[pfn] != XEN_DOMCTL_PFINFO_NOTAB) )
             {
@@ -2036,7 +2036,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
                       j, (unsigned long)pfn);
                 goto out;
             }
-            SET_FIELD(ctxt, gdt_frames[j], ctx->p2m[pfn]);
+            SET_FIELD(ctxt, gdt.pv.frames[j], ctx->p2m[pfn]);
         }
         /* Uncanonicalise the page table base pointer. */
         pfn = UNFOLD_CR3(GET_FIELD(ctxt, ctrlreg[3]));
diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
index ff76626..97cf64a 100644
--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -1900,15 +1900,15 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
         }
 
         /* Canonicalise each GDT frame number. */
-        for ( j = 0; (512*j) < GET_FIELD(&ctxt, gdt_ents); j++ )
+        for ( j = 0; (512*j) < GET_FIELD(&ctxt, gdt.pv.num_ents); j++ )
         {
-            mfn = GET_FIELD(&ctxt, gdt_frames[j]);
+            mfn = GET_FIELD(&ctxt, gdt.pv.frames[j]);
             if ( !MFN_IS_IN_PSEUDOPHYS_MAP(mfn) )
             {
                 ERROR("GDT frame is not in range of pseudophys map");
                 goto out;
             }
-            SET_FIELD(&ctxt, gdt_frames[j], mfn_to_pfn(mfn));
+            SET_FIELD(&ctxt, gdt.pv.frames[j], mfn_to_pfn(mfn));
         }
 
         /* Canonicalise the page table base pointer. */
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 14b6d13..e4da965 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -781,8 +781,8 @@ int arch_set_info_guest(
         }
 
         for ( i = 0; i < ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames); ++i )
-            fail |= v->arch.pv_vcpu.gdt_frames[i] != c(gdt_frames[i]);
-        fail |= v->arch.pv_vcpu.gdt_ents != c(gdt_ents);
+            fail |= v->arch.pv_vcpu.gdt_frames[i] != c(gdt.pv.frames[i]);
+        fail |= v->arch.pv_vcpu.gdt_ents != c(gdt.pv.num_ents);
 
         fail |= v->arch.pv_vcpu.ldt_base != c(ldt_base);
         fail |= v->arch.pv_vcpu.ldt_ents != c(ldt_ents);
@@ -831,17 +831,17 @@ int arch_set_info_guest(
         d->vm_assist = c(vm_assist);
 
     if ( !compat )
-        rc = (int)set_gdt(v, c.nat->gdt_frames, c.nat->gdt_ents);
+        rc = (int)set_gdt(v, c.nat->gdt.pv.frames, c.nat->gdt.pv.num_ents);
     else
     {
         unsigned long gdt_frames[ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames)];
-        unsigned int n = (c.cmp->gdt_ents + 511) / 512;
+        unsigned int n = (c.cmp->gdt.pv.num_ents + 511) / 512;
 
         if ( n > ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames) )
             return -EINVAL;
         for ( i = 0; i < n; ++i )
-            gdt_frames[i] = c.cmp->gdt_frames[i];
-        rc = (int)set_gdt(v, gdt_frames, c.cmp->gdt_ents);
+            gdt_frames[i] = c.cmp->gdt.pv.frames[i];
+        rc = (int)set_gdt(v, gdt_frames, c.cmp->gdt.pv.num_ents);
     }
     if ( rc != 0 )
         return rc;
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index a196e2a..a59e418 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1305,12 +1305,12 @@ void arch_get_info_guest(struct vcpu *v, vcpu_guest_context_u c)
         c(ldt_base = v->arch.pv_vcpu.ldt_base);
         c(ldt_ents = v->arch.pv_vcpu.ldt_ents);
         for ( i = 0; i < ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames); ++i )
-            c(gdt_frames[i] = v->arch.pv_vcpu.gdt_frames[i]);
-        BUILD_BUG_ON(ARRAY_SIZE(c.nat->gdt_frames) !=
-                     ARRAY_SIZE(c.cmp->gdt_frames));
-        for ( ; i < ARRAY_SIZE(c.nat->gdt_frames); ++i )
-            c(gdt_frames[i] = 0);
-        c(gdt_ents = v->arch.pv_vcpu.gdt_ents);
+            c(gdt.pv.frames[i] = v->arch.pv_vcpu.gdt_frames[i]);
+        BUILD_BUG_ON(ARRAY_SIZE(c.nat->gdt.pv.frames) !=
+                     ARRAY_SIZE(c.cmp->gdt.pv.frames));
+        for ( ; i < ARRAY_SIZE(c.nat->gdt.pv.frames); ++i )
+            c(gdt.pv.frames[i] = 0);
+        c(gdt.pv.num_ents = v->arch.pv_vcpu.gdt_ents);
         c(kernel_ss = v->arch.pv_vcpu.kernel_ss);
         c(kernel_sp = v->arch.pv_vcpu.kernel_sp);
         for ( i = 0; i < ARRAY_SIZE(v->arch.pv_vcpu.ctrlreg); ++i )
diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
index b7f6a51..25c8519 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -170,7 +170,21 @@ struct vcpu_guest_context {
     struct cpu_user_regs user_regs;         /* User-level CPU registers     */
     struct trap_info trap_ctxt[256];        /* Virtual IDT                  */
     unsigned long ldt_base, ldt_ents;       /* LDT (linear address, # ents) */
+#if __XEN_INTERFACE_VERSION__ < 0x00040400
     unsigned long gdt_frames[16], gdt_ents; /* GDT (machine frames, # ents) */
+#else
+    union {
+        struct {
+            /* GDT (machine frames, # ents) */
+            unsigned long frames[16], num_ents;
+        } pv;
+        struct {
+            /* PVH: GDTR addr and size */
+            uint64_t addr;
+            uint16_t limit;
+        } pvh;
+    } gdt;
+#endif
     unsigned long kernel_ss, kernel_sp;     /* Virtual TSS (only SS1/SP1)   */
     /* NB. User pagetable on x86/64 is placed in ctrlreg[1]. */
     unsigned long ctrlreg[8];               /* CR0-CR7 (control registers)  */
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 02/17] PVH xen: add XENMEM_add_to_physmap_range
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
  2013-04-23 21:25 ` [PATCH 01/17] PVH xen: turn gdb_frames/gdt_ents into union Mukesh Rathor
@ 2013-04-23 21:25 ` Mukesh Rathor
  2013-04-23 21:25 ` [PATCH 03/17] PVH xen: create domctl_memory_mapping() function Mukesh Rathor
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:25 UTC (permalink / raw)
  To: Xen-devel

In this patch we add a new function xenmem_add_to_physmap_range(), and
change xenmem_add_to_physmap_once parameters so it can be called from
xenmem_add_to_physmap_range. There is no PVH specific change here.

Changes in V2:
  - Do not break parameter so xenmem_add_to_physmap_once() but pass in
    struct xen_add_to_physmap.

Changes in V3:
  - add xsm hook
  - redo xenmem_add_to_physmap_range() a bit as the struct
    xen_add_to_physmap_range got enhanced.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/mm.c |   80 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 58e1402..1c8442f 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4269,7 +4269,8 @@ static int handle_iomem_range(unsigned long s, unsigned long e, void *p)
 
 static int xenmem_add_to_physmap_once(
     struct domain *d,
-    const struct xen_add_to_physmap *xatp)
+    const struct xen_add_to_physmap *xatp,
+    domid_t foreign_domid)
 {
     struct page_info *page = NULL;
     unsigned long gfn = 0; /* gcc ... */
@@ -4396,7 +4397,7 @@ static int xenmem_add_to_physmap(struct domain *d,
         start_xatp = *xatp;
         while ( xatp->size > 0 )
         {
-            rc = xenmem_add_to_physmap_once(d, xatp);
+            rc = xenmem_add_to_physmap_once(d, xatp, DOMID_INVALID);
             if ( rc < 0 )
                 return rc;
 
@@ -4422,7 +4423,43 @@ static int xenmem_add_to_physmap(struct domain *d,
         return rc;
     }
 
-    return xenmem_add_to_physmap_once(d, xatp);
+    return xenmem_add_to_physmap_once(d, xatp, DOMID_INVALID);
+}
+
+static int xenmem_add_to_physmap_range(struct domain *d,
+                                       struct xen_add_to_physmap_range *xatpr)
+{
+    int rc;
+
+    /* Process entries in reverse order to allow continuations */
+    while ( xatpr->size > 0 )
+    {
+        xen_ulong_t idx;
+        xen_pfn_t gpfn;
+        struct xen_add_to_physmap xatp;
+
+        if ( copy_from_guest_offset(&idx, xatpr->idxs, xatpr->size-1, 1)  ||
+             copy_from_guest_offset(&gpfn, xatpr->gpfns, xatpr->size-1, 1) )
+        {
+            return -EFAULT;
+        }
+
+        xatp.space = xatpr->space;
+        xatp.idx = idx;
+        xatp.gpfn = gpfn;
+        rc = xenmem_add_to_physmap_once(d, &xatp, xatpr->foreign_domid);
+
+        if ( copy_to_guest_offset(xatpr->errs, xatpr->size-1, &rc, 1) )
+            return -EFAULT;
+
+        xatpr->size--;
+
+        /* Check for continuation if it's not the last interation */
+        if ( xatpr->size > 0 && hypercall_preempt_check() )
+            return -EAGAIN;
+    }
+
+    return 0;
 }
 
 long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg)
@@ -4439,6 +4476,10 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg)
         if ( copy_from_guest(&xatp, arg, 1) )
             return -EFAULT;
 
+        /* This one is only supported for add_to_physmap_range */
+        if ( xatp.space == XENMAPSPACE_gmfn_foreign )
+            return -EINVAL;
+
         d = rcu_lock_domain_by_any_id(xatp.domid);
         if ( d == NULL )
             return -ESRCH;
@@ -4466,6 +4507,39 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg)
         return rc;
     }
 
+    case XENMEM_add_to_physmap_range:
+    {
+        struct xen_add_to_physmap_range xatpr;
+        struct domain *d;
+
+        if ( copy_from_guest(&xatpr, arg, 1) )
+            return -EFAULT;
+
+        /* This mapspace is redundant for this hypercall */
+        if ( xatpr.space == XENMAPSPACE_gmfn_range )
+            return -EINVAL;
+
+        rc = rcu_lock_target_domain_by_id(xatpr.domid, &d);
+        if ( rc != 0 )
+            return rc;
+
+        if ( xsm_add_to_physmap(XSM_TARGET, current->domain, d) )
+        {
+            rcu_unlock_domain(d);
+            return -EPERM;
+        }
+
+        rc = xenmem_add_to_physmap_range(d, &xatpr);
+
+        rcu_unlock_domain(d);
+
+        if ( rc == -EAGAIN )
+            rc = hypercall_create_continuation(
+                __HYPERVISOR_memory_op, "ih", op, arg);
+
+        return rc;
+    }
+
     case XENMEM_set_memory_map:
     {
         struct xen_foreign_memory_map fmap;
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 03/17] PVH xen: create domctl_memory_mapping() function
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
  2013-04-23 21:25 ` [PATCH 01/17] PVH xen: turn gdb_frames/gdt_ents into union Mukesh Rathor
  2013-04-23 21:25 ` [PATCH 02/17] PVH xen: add XENMEM_add_to_physmap_range Mukesh Rathor
@ 2013-04-23 21:25 ` Mukesh Rathor
  2013-04-24  7:01   ` Jan Beulich
  2013-04-23 21:25 ` [PATCH 04/17] PVH xen: add params to read_segment_register Mukesh Rathor
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:25 UTC (permalink / raw)
  To: Xen-devel

In this patch, XEN_DOMCTL_memory_mapping code is put into a function so
it can be shared later for PVH. There is no change in it's
functionality.

Changes in V2:
  - Remove PHYSDEVOP_map_iomem sub hypercall, and the code supporting it
           as the IO region is mapped transparently now.

Changes in V3:
  - change loop control variable to ulong from int.
  - move priv checks to the function.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/domctl.c    |  128 ++++++++++++++++++++++++----------------------
 xen/include/xen/domain.h |    2 +
 2 files changed, 69 insertions(+), 61 deletions(-)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index a59e418..88fe868 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -46,6 +46,72 @@ static int gdbsx_guest_mem_io(
     return (iop->remain ? -EFAULT : 0);
 }
 
+long domctl_memory_mapping(struct domain *d, unsigned long gfn,
+                           unsigned long mfn, unsigned long nr_mfns,
+                           int add_map)
+{
+    unsigned long i;
+    long ret;
+
+    if ( !IS_PRIV(current->domain)  &&
+         !iomem_access_permitted(current->domain, mfn, mfn + nr_mfns - 1) )
+        return -EPERM;
+
+    if ( (mfn + nr_mfns - 1) < mfn || /* wrap? */
+         ((mfn | (mfn + nr_mfns - 1)) >> (paddr_bits - PAGE_SHIFT)) ||
+         (gfn + nr_mfns - 1) < gfn ) /* wrap? */
+        return -EINVAL;
+
+    ret = xsm_iomem_permission(XSM_HOOK, d, mfn, mfn + nr_mfns - 1, add_map);
+    if ( ret )
+        return ret;
+
+    if ( add_map )
+    {
+        printk(XENLOG_G_INFO
+               "memory_map:add: dom%d gfn=%lx mfn=%lx nr=%lx\n",
+               d->domain_id, gfn, mfn, nr_mfns);
+
+        ret = iomem_permit_access(d, mfn, mfn + nr_mfns - 1);
+        if ( !ret && paging_mode_translate(d) )
+        {
+            for ( i = 0; !ret && i < nr_mfns; i++ )
+                if ( !set_mmio_p2m_entry(d, gfn + i, _mfn(mfn + i)) )
+                    ret = -EIO;
+            if ( ret )
+            {
+                printk(XENLOG_G_WARNING
+                       "memory_map:fail: dom%d gfn=%lx mfn=%lx\n",
+                       d->domain_id, gfn + i, mfn + i);
+                while ( i-- )
+                    clear_mmio_p2m_entry(d, gfn + i);
+                if ( iomem_deny_access(d, mfn, mfn + nr_mfns - 1) &&
+                     IS_PRIV(current->domain) )
+                    printk(XENLOG_ERR
+                           "memory_map: failed to deny dom%d access to [%lx,%lx]\n",
+                           d->domain_id, mfn, mfn + nr_mfns - 1);
+            }
+        }
+    } else {
+        printk(XENLOG_G_INFO
+               "memory_map:remove: dom%d gfn=%lx mfn=%lx nr=%lx\n",
+               d->domain_id, gfn, mfn, nr_mfns);
+
+        if ( paging_mode_translate(d) )
+            for ( i = 0; i < nr_mfns; i++ )
+                add_map |= !clear_mmio_p2m_entry(d, gfn + i);
+        ret = iomem_deny_access(d, mfn, mfn + nr_mfns - 1);
+        if ( !ret && add_map )
+            ret = -EIO;
+        if ( ret && IS_PRIV(current->domain) )
+            printk(XENLOG_ERR
+                   "memory_map: error %ld %s dom%d access to [%lx,%lx]\n",
+                   ret, add_map ? "removing" : "denying", d->domain_id,
+                   mfn, mfn + nr_mfns - 1);
+    }
+    return ret;
+}
+
 long arch_do_domctl(
     struct xen_domctl *domctl, struct domain *d,
     XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
@@ -628,68 +694,8 @@ long arch_do_domctl(
         unsigned long mfn = domctl->u.memory_mapping.first_mfn;
         unsigned long nr_mfns = domctl->u.memory_mapping.nr_mfns;
         int add = domctl->u.memory_mapping.add_mapping;
-        unsigned long i;
-
-        ret = -EINVAL;
-        if ( (mfn + nr_mfns - 1) < mfn || /* wrap? */
-             ((mfn | (mfn + nr_mfns - 1)) >> (paddr_bits - PAGE_SHIFT)) ||
-             (gfn + nr_mfns - 1) < gfn ) /* wrap? */
-            break;
-
-        ret = -EPERM;
-        if ( !IS_PRIV(current->domain) &&
-             !iomem_access_permitted(current->domain, mfn, mfn + nr_mfns - 1) )
-            break;
-
-        ret = xsm_iomem_mapping(XSM_HOOK, d, mfn, mfn + nr_mfns - 1, add);
-        if ( ret )
-            break;
 
-        if ( add )
-        {
-            printk(XENLOG_G_INFO
-                   "memory_map:add: dom%d gfn=%lx mfn=%lx nr=%lx\n",
-                   d->domain_id, gfn, mfn, nr_mfns);
-
-            ret = iomem_permit_access(d, mfn, mfn + nr_mfns - 1);
-            if ( !ret && paging_mode_translate(d) )
-            {
-                for ( i = 0; !ret && i < nr_mfns; i++ )
-                    if ( !set_mmio_p2m_entry(d, gfn + i, _mfn(mfn + i)) )
-                        ret = -EIO;
-                if ( ret )
-                {
-                    printk(XENLOG_G_WARNING
-                           "memory_map:fail: dom%d gfn=%lx mfn=%lx\n",
-                           d->domain_id, gfn + i, mfn + i);
-                    while ( i-- )
-                        clear_mmio_p2m_entry(d, gfn + i);
-                    if ( iomem_deny_access(d, mfn, mfn + nr_mfns - 1) &&
-                         IS_PRIV(current->domain) )
-                        printk(XENLOG_ERR
-                               "memory_map: failed to deny dom%d access to [%lx,%lx]\n",
-                               d->domain_id, mfn, mfn + nr_mfns - 1);
-                }
-            }
-        }
-        else
-        {
-            printk(XENLOG_G_INFO
-                   "memory_map:remove: dom%d gfn=%lx mfn=%lx nr=%lx\n",
-                   d->domain_id, gfn, mfn, nr_mfns);
-
-            if ( paging_mode_translate(d) )
-                for ( i = 0; i < nr_mfns; i++ )
-                    add |= !clear_mmio_p2m_entry(d, gfn + i);
-            ret = iomem_deny_access(d, mfn, mfn + nr_mfns - 1);
-            if ( !ret && add )
-                ret = -EIO;
-            if ( ret && IS_PRIV(current->domain) )
-                printk(XENLOG_ERR
-                       "memory_map: error %ld %s dom%d access to [%lx,%lx]\n",
-                       ret, add ? "removing" : "denying", d->domain_id,
-                       mfn, mfn + nr_mfns - 1);
-        }
+        ret = domctl_memory_mapping(d, gfn, mfn, nr_mfns, add);
     }
     break;
 
diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
index d4ac50f..a7b4c34 100644
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -86,4 +86,6 @@ extern unsigned int xen_processor_pmbits;
 
 extern bool_t opt_dom0_vcpus_pin;
 
+extern long domctl_memory_mapping(struct domain *d, unsigned long gfn,
+                    unsigned long mfn, unsigned long nr_mfns, int add_map);
 #endif /* __XEN_DOMAIN_H__ */
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 04/17] PVH xen: add params to read_segment_register
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (2 preceding siblings ...)
  2013-04-23 21:25 ` [PATCH 03/17] PVH xen: create domctl_memory_mapping() function Mukesh Rathor
@ 2013-04-23 21:25 ` Mukesh Rathor
  2013-04-23 21:25 ` [PATCH 05/17] PVH xen: vmx realted preparatory changes for PVH Mukesh Rathor
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:25 UTC (permalink / raw)
  To: Xen-devel

In this patch, read_segment_register macro is changed to take vcpu and
regs parameters so it can check if it's PVH guest (change in upcoming
patches). No functionality change. Also, make emulate_privileged_op()
public for later while changing this file.

Changes in V2:  None
Changes in V3:
   - Replace read_sreg with read_segment_register

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/domain.c        |    8 ++++----
 xen/arch/x86/traps.c         |   28 +++++++++++++---------------
 xen/arch/x86/x86_64/traps.c  |   16 ++++++++--------
 xen/include/asm-x86/system.h |    2 +-
 xen/include/asm-x86/traps.h  |    1 +
 5 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index e4da965..219d96c 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1341,10 +1341,10 @@ static void save_segments(struct vcpu *v)
     struct cpu_user_regs *regs = &v->arch.user_regs;
     unsigned int dirty_segment_mask = 0;
 
-    regs->ds = read_segment_register(ds);
-    regs->es = read_segment_register(es);
-    regs->fs = read_segment_register(fs);
-    regs->gs = read_segment_register(gs);
+    regs->ds = read_segment_register(v, regs, ds);
+    regs->es = read_segment_register(v, regs, es);
+    regs->fs = read_segment_register(v, regs, fs);
+    regs->gs = read_segment_register(v, regs, gs);
 
     if ( regs->ds )
         dirty_segment_mask |= DIRTY_DS;
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index d36eddd..8b6350e 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1823,8 +1823,6 @@ static inline uint64_t guest_misc_enable(uint64_t val)
     }                                                                       \
     (eip) += sizeof(_x); _x; })
 
-#define read_sreg(regs, sr) read_segment_register(sr)
-
 static int is_cpufreq_controller(struct domain *d)
 {
     return ((cpufreq_controller == FREQCTL_dom0_kernel) &&
@@ -1833,7 +1831,7 @@ static int is_cpufreq_controller(struct domain *d)
 
 #include "x86_64/mmconfig.h"
 
-static int emulate_privileged_op(struct cpu_user_regs *regs)
+int emulate_privileged_op(struct cpu_user_regs *regs)
 {
     struct vcpu *v = current;
     unsigned long *reg, eip = regs->eip;
@@ -1869,7 +1867,7 @@ static int emulate_privileged_op(struct cpu_user_regs *regs)
         goto fail;
 
     /* emulating only opcodes not allowing SS to be default */
-    data_sel = read_sreg(regs, ds);
+    data_sel = read_segment_register(v, regs, ds);
 
     /* Legacy prefixes. */
     for ( i = 0; i < 8; i++, rex == opcode || (rex = 0) )
@@ -1887,17 +1885,17 @@ static int emulate_privileged_op(struct cpu_user_regs *regs)
             data_sel = regs->cs;
             continue;
         case 0x3e: /* DS override */
-            data_sel = read_sreg(regs, ds);
+            data_sel = read_segment_register(v, regs, ds);
             continue;
         case 0x26: /* ES override */
-            data_sel = read_sreg(regs, es);
+            data_sel = read_segment_register(v, regs, es);
             continue;
         case 0x64: /* FS override */
-            data_sel = read_sreg(regs, fs);
+            data_sel = read_segment_register(v, regs, fs);
             lm_ovr = lm_seg_fs;
             continue;
         case 0x65: /* GS override */
-            data_sel = read_sreg(regs, gs);
+            data_sel = read_segment_register(v, regs, gs);
             lm_ovr = lm_seg_gs;
             continue;
         case 0x36: /* SS override */
@@ -1944,7 +1942,7 @@ static int emulate_privileged_op(struct cpu_user_regs *regs)
 
         if ( !(opcode & 2) )
         {
-            data_sel = read_sreg(regs, es);
+            data_sel = read_segment_register(v, regs, es);
             lm_ovr = lm_seg_none;
         }
 
@@ -2677,22 +2675,22 @@ static void emulate_gate_op(struct cpu_user_regs *regs)
             ASSERT(opnd_sel);
             continue;
         case 0x3e: /* DS override */
-            opnd_sel = read_sreg(regs, ds);
+            opnd_sel = read_segment_register(v, regs, ds);
             if ( !opnd_sel )
                 opnd_sel = dpl;
             continue;
         case 0x26: /* ES override */
-            opnd_sel = read_sreg(regs, es);
+            opnd_sel = read_segment_register(v, regs, es);
             if ( !opnd_sel )
                 opnd_sel = dpl;
             continue;
         case 0x64: /* FS override */
-            opnd_sel = read_sreg(regs, fs);
+            opnd_sel = read_segment_register(v, regs, fs);
             if ( !opnd_sel )
                 opnd_sel = dpl;
             continue;
         case 0x65: /* GS override */
-            opnd_sel = read_sreg(regs, gs);
+            opnd_sel = read_segment_register(v, regs, gs);
             if ( !opnd_sel )
                 opnd_sel = dpl;
             continue;
@@ -2745,7 +2743,7 @@ static void emulate_gate_op(struct cpu_user_regs *regs)
                             switch ( modrm & 7 )
                             {
                             default:
-                                opnd_sel = read_sreg(regs, ds);
+                                opnd_sel = read_segment_register(v, regs, ds);
                                 break;
                             case 4: case 5:
                                 opnd_sel = regs->ss;
@@ -2773,7 +2771,7 @@ static void emulate_gate_op(struct cpu_user_regs *regs)
                             break;
                         }
                         if ( !opnd_sel )
-                            opnd_sel = read_sreg(regs, ds);
+                            opnd_sel = read_segment_register(v, regs, ds);
                         switch ( modrm & 7 )
                         {
                         case 0: case 2: case 4:
diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
index eec919a..d2f7209 100644
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -122,10 +122,10 @@ void show_registers(struct cpu_user_regs *regs)
         fault_crs[0] = read_cr0();
         fault_crs[3] = read_cr3();
         fault_crs[4] = read_cr4();
-        fault_regs.ds = read_segment_register(ds);
-        fault_regs.es = read_segment_register(es);
-        fault_regs.fs = read_segment_register(fs);
-        fault_regs.gs = read_segment_register(gs);
+        fault_regs.ds = read_segment_register(v, regs, ds);
+        fault_regs.es = read_segment_register(v, regs, es);
+        fault_regs.fs = read_segment_register(v, regs, fs);
+        fault_regs.gs = read_segment_register(v, regs, gs);
     }
 
     print_xen_info();
@@ -240,10 +240,10 @@ void do_double_fault(struct cpu_user_regs *regs)
     crs[2] = read_cr2();
     crs[3] = read_cr3();
     crs[4] = read_cr4();
-    regs->ds = read_segment_register(ds);
-    regs->es = read_segment_register(es);
-    regs->fs = read_segment_register(fs);
-    regs->gs = read_segment_register(gs);
+    regs->ds = read_segment_register(current, regs, ds);
+    regs->es = read_segment_register(current, regs, es);
+    regs->fs = read_segment_register(current, regs, fs);
+    regs->gs = read_segment_register(current, regs, gs);
 
     printk("CPU:    %d\n", cpu);
     _show_registers(regs, crs, CTXT_hypervisor, NULL);
diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
index b0876d6..d8dc6f2 100644
--- a/xen/include/asm-x86/system.h
+++ b/xen/include/asm-x86/system.h
@@ -4,7 +4,7 @@
 #include <xen/lib.h>
 #include <asm/bitops.h>
 
-#define read_segment_register(name)                             \
+#define read_segment_register(vcpu, regs, name)                 \
 ({  u16 __sel;                                                  \
     asm volatile ( "movw %%" STR(name) ",%0" : "=r" (__sel) );  \
     __sel;                                                      \
diff --git a/xen/include/asm-x86/traps.h b/xen/include/asm-x86/traps.h
index 82cbcee..202e3be 100644
--- a/xen/include/asm-x86/traps.h
+++ b/xen/include/asm-x86/traps.h
@@ -49,4 +49,5 @@ extern int guest_has_trap_callback(struct domain *d, uint16_t vcpuid,
 extern int send_guest_trap(struct domain *d, uint16_t vcpuid,
 				unsigned int trap_nr);
 
+int emulate_privileged_op(struct cpu_user_regs *regs);
 #endif /* ASM_TRAP_H */
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 05/17] PVH xen: vmx realted preparatory changes for PVH
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (3 preceding siblings ...)
  2013-04-23 21:25 ` [PATCH 04/17] PVH xen: add params to read_segment_register Mukesh Rathor
@ 2013-04-23 21:25 ` Mukesh Rathor
  2013-04-23 21:25 ` [PATCH 06/17] PVH xen: Introduce PVH guest type Mukesh Rathor
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:25 UTC (permalink / raw)
  To: Xen-devel

This is another preparotary patch for PVH. In this patch, following
functions are made non-static:
    vmx_fpu_enter(), get_instruction_length(), update_guest_eip(),
    vmx_dr_access(), and pv_cpuid().

There is no functionality change.

Changes in V2:
  - prepend vmx_ to get_instruction_length and update_guest_eip.
  - Do not export/use vmr().

Changes in V3:
  - Do not change emulate_forced_invalid_op() in this patch.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/hvm/vmx/vmx.c         |   72 +++++++++++++++---------------------
 xen/arch/x86/hvm/vmx/vvmx.c        |    2 +-
 xen/arch/x86/traps.c               |    2 +-
 xen/include/asm-x86/hvm/vmx/vmcs.h |    1 +
 xen/include/asm-x86/hvm/vmx/vmx.h  |   15 +++++++-
 xen/include/asm-x86/processor.h    |    1 +
 6 files changed, 48 insertions(+), 45 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index e36dbcb..59336b9 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -574,7 +574,7 @@ static int vmx_load_vmcs_ctxt(struct vcpu *v, struct hvm_hw_cpu *ctxt)
     return 0;
 }
 
-static void vmx_fpu_enter(struct vcpu *v)
+void vmx_fpu_enter(struct vcpu *v)
 {
     vcpu_restore_fpu_lazy(v);
     v->arch.hvm_vmx.exception_bitmap &= ~(1u << TRAP_no_device);
@@ -1527,24 +1527,12 @@ struct hvm_function_table * __init start_vmx(void)
     return &vmx_function_table;
 }
 
-/*
- * Not all cases receive valid value in the VM-exit instruction length field.
- * Callers must know what they're doing!
- */
-static int get_instruction_length(void)
-{
-    int len;
-    len = __vmread(VM_EXIT_INSTRUCTION_LEN); /* Safe: callers audited */
-    BUG_ON((len < 1) || (len > 15));
-    return len;
-}
-
-void update_guest_eip(void)
+void vmx_update_guest_eip(void)
 {
     struct cpu_user_regs *regs = guest_cpu_user_regs();
     unsigned long x;
 
-    regs->eip += get_instruction_length(); /* Safe: callers audited */
+    regs->eip += vmx_get_instruction_length(); /* Safe: callers audited */
     regs->eflags &= ~X86_EFLAGS_RF;
 
     x = __vmread(GUEST_INTERRUPTIBILITY_INFO);
@@ -1617,8 +1605,8 @@ static void vmx_do_cpuid(struct cpu_user_regs *regs)
     regs->edx = edx;
 }
 
-static void vmx_dr_access(unsigned long exit_qualification,
-                          struct cpu_user_regs *regs)
+void vmx_dr_access(unsigned long exit_qualification,
+                   struct cpu_user_regs *regs)
 {
     struct vcpu *v = current;
 
@@ -2222,7 +2210,7 @@ static int vmx_handle_eoi_write(void)
     if ( (((exit_qualification >> 12) & 0xf) == 1) &&
          ((exit_qualification & 0xfff) == APIC_EOI) )
     {
-        update_guest_eip(); /* Safe: APIC data write */
+        vmx_update_guest_eip(); /* Safe: APIC data write */
         vlapic_EOI_set(vcpu_vlapic(current));
         HVMTRACE_0D(VLAPIC);
         return 1;
@@ -2435,7 +2423,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
             HVMTRACE_1D(TRAP, vector);
             if ( v->domain->debugger_attached )
             {
-                update_guest_eip(); /* Safe: INT3 */            
+                vmx_update_guest_eip(); /* Safe: INT3 */
                 current->arch.gdbsx_vcpu_event = TRAP_int3;
                 domain_pause_for_debugger();
                 break;
@@ -2543,7 +2531,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
          */
         inst_len = ((source != 3) ||        /* CALL, IRET, or JMP? */
                     (idtv_info & (1u<<10))) /* IntrType > 3? */
-            ? get_instruction_length() /* Safe: SDM 3B 23.2.4 */ : 0;
+            ? vmx_get_instruction_length() /* Safe: SDM 3B 23.2.4 */ : 0;
         if ( (source == 3) && (idtv_info & INTR_INFO_DELIVER_CODE_MASK) )
             ecode = __vmread(IDT_VECTORING_ERROR_CODE);
         regs->eip += inst_len;
@@ -2551,15 +2539,15 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
         break;
     }
     case EXIT_REASON_CPUID:
-        update_guest_eip(); /* Safe: CPUID */
+        vmx_update_guest_eip(); /* Safe: CPUID */
         vmx_do_cpuid(regs);
         break;
     case EXIT_REASON_HLT:
-        update_guest_eip(); /* Safe: HLT */
+        vmx_update_guest_eip(); /* Safe: HLT */
         hvm_hlt(regs->eflags);
         break;
     case EXIT_REASON_INVLPG:
-        update_guest_eip(); /* Safe: INVLPG */
+        vmx_update_guest_eip(); /* Safe: INVLPG */
         exit_qualification = __vmread(EXIT_QUALIFICATION);
         vmx_invlpg_intercept(exit_qualification);
         break;
@@ -2567,7 +2555,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
         regs->ecx = hvm_msr_tsc_aux(v);
         /* fall through */
     case EXIT_REASON_RDTSC:
-        update_guest_eip(); /* Safe: RDTSC, RDTSCP */
+        vmx_update_guest_eip(); /* Safe: RDTSC, RDTSCP */
         hvm_rdtsc_intercept(regs);
         break;
     case EXIT_REASON_VMCALL:
@@ -2577,7 +2565,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
         rc = hvm_do_hypercall(regs);
         if ( rc != HVM_HCALL_preempted )
         {
-            update_guest_eip(); /* Safe: VMCALL */
+            vmx_update_guest_eip(); /* Safe: VMCALL */
             if ( rc == HVM_HCALL_invalidate )
                 send_invalidate_req();
         }
@@ -2587,7 +2575,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
     {
         exit_qualification = __vmread(EXIT_QUALIFICATION);
         if ( vmx_cr_access(exit_qualification) == X86EMUL_OKAY )
-            update_guest_eip(); /* Safe: MOV Cn, LMSW, CLTS */
+            vmx_update_guest_eip(); /* Safe: MOV Cn, LMSW, CLTS */
         break;
     }
     case EXIT_REASON_DR_ACCESS:
@@ -2601,7 +2589,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
         {
             regs->eax = (uint32_t)msr_content;
             regs->edx = (uint32_t)(msr_content >> 32);
-            update_guest_eip(); /* Safe: RDMSR */
+            vmx_update_guest_eip(); /* Safe: RDMSR */
         }
         break;
     }
@@ -2610,63 +2598,63 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
         uint64_t msr_content;
         msr_content = ((uint64_t)regs->edx << 32) | (uint32_t)regs->eax;
         if ( hvm_msr_write_intercept(regs->ecx, msr_content) == X86EMUL_OKAY )
-            update_guest_eip(); /* Safe: WRMSR */
+            vmx_update_guest_eip(); /* Safe: WRMSR */
         break;
     }
 
     case EXIT_REASON_VMXOFF:
         if ( nvmx_handle_vmxoff(regs) == X86EMUL_OKAY )
-            update_guest_eip();
+            vmx_update_guest_eip();
         break;
 
     case EXIT_REASON_VMXON:
         if ( nvmx_handle_vmxon(regs) == X86EMUL_OKAY )
-            update_guest_eip();
+            vmx_update_guest_eip();
         break;
 
     case EXIT_REASON_VMCLEAR:
         if ( nvmx_handle_vmclear(regs) == X86EMUL_OKAY )
-            update_guest_eip();
+            vmx_update_guest_eip();
         break;
  
     case EXIT_REASON_VMPTRLD:
         if ( nvmx_handle_vmptrld(regs) == X86EMUL_OKAY )
-            update_guest_eip();
+            vmx_update_guest_eip();
         break;
 
     case EXIT_REASON_VMPTRST:
         if ( nvmx_handle_vmptrst(regs) == X86EMUL_OKAY )
-            update_guest_eip();
+            vmx_update_guest_eip();
         break;
 
     case EXIT_REASON_VMREAD:
         if ( nvmx_handle_vmread(regs) == X86EMUL_OKAY )
-            update_guest_eip();
+            vmx_update_guest_eip();
         break;
  
     case EXIT_REASON_VMWRITE:
         if ( nvmx_handle_vmwrite(regs) == X86EMUL_OKAY )
-            update_guest_eip();
+            vmx_update_guest_eip();
         break;
 
     case EXIT_REASON_VMLAUNCH:
         if ( nvmx_handle_vmlaunch(regs) == X86EMUL_OKAY )
-            update_guest_eip();
+            vmx_update_guest_eip();
         break;
 
     case EXIT_REASON_VMRESUME:
         if ( nvmx_handle_vmresume(regs) == X86EMUL_OKAY )
-            update_guest_eip();
+            vmx_update_guest_eip();
         break;
 
     case EXIT_REASON_INVEPT:
         if ( nvmx_handle_invept(regs) == X86EMUL_OKAY )
-            update_guest_eip();
+            vmx_update_guest_eip();
         break;
 
     case EXIT_REASON_INVVPID:
         if ( nvmx_handle_invvpid(regs) == X86EMUL_OKAY )
-            update_guest_eip();
+            vmx_update_guest_eip();
         break;
 
     case EXIT_REASON_MWAIT_INSTRUCTION:
@@ -2714,14 +2702,14 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
             int bytes = (exit_qualification & 0x07) + 1;
             int dir = (exit_qualification & 0x08) ? IOREQ_READ : IOREQ_WRITE;
             if ( handle_pio(port, bytes, dir) )
-                update_guest_eip(); /* Safe: IN, OUT */
+                vmx_update_guest_eip(); /* Safe: IN, OUT */
         }
         break;
 
     case EXIT_REASON_INVD:
     case EXIT_REASON_WBINVD:
     {
-        update_guest_eip(); /* Safe: INVD, WBINVD */
+        vmx_update_guest_eip(); /* Safe: INVD, WBINVD */
         vmx_wbinvd_intercept();
         break;
     }
@@ -2754,7 +2742,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
     {
         u64 new_bv = (((u64)regs->edx) << 32) | regs->eax;
         if ( hvm_handle_xsetbv(new_bv) == 0 )
-            update_guest_eip(); /* Safe: XSETBV */
+            vmx_update_guest_eip(); /* Safe: XSETBV */
         break;
     }
 
diff --git a/xen/arch/x86/hvm/vmx/vvmx.c b/xen/arch/x86/hvm/vmx/vvmx.c
index bb7688f..225de9f 100644
--- a/xen/arch/x86/hvm/vmx/vvmx.c
+++ b/xen/arch/x86/hvm/vmx/vvmx.c
@@ -2136,7 +2136,7 @@ int nvmx_n2_vmexit_handler(struct cpu_user_regs *regs,
             tsc += __get_vvmcs(nvcpu->nv_vvmcx, TSC_OFFSET);
             regs->eax = (uint32_t)tsc;
             regs->edx = (uint32_t)(tsc >> 32);
-            update_guest_eip();
+            vmx_update_guest_eip();
 
             return 1;
         }
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 8b6350e..dbea755 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -728,7 +728,7 @@ int cpuid_hypervisor_leaves( uint32_t idx, uint32_t sub_idx,
     return 1;
 }
 
-static void pv_cpuid(struct cpu_user_regs *regs)
+void pv_cpuid(struct cpu_user_regs *regs)
 {
     uint32_t a, b, c, d;
 
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index 37e6734..11b09ef 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -461,6 +461,7 @@ void vmx_vmcs_switch(struct vmcs_struct *from, struct vmcs_struct *to);
 void vmx_set_eoi_exit_bitmap(struct vcpu *v, u8 vector);
 void vmx_clear_eoi_exit_bitmap(struct vcpu *v, u8 vector);
 int vmx_check_msr_bitmap(unsigned long *msr_bitmap, u32 msr, int access_type);
+void vmx_fpu_enter(struct vcpu *v);
 void virtual_vmcs_enter(void *vvmcs);
 void virtual_vmcs_exit(void *vvmcs);
 u64 virtual_vmcs_vmread(void *vvmcs, u32 vmcs_encoding);
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index d4d6feb..4c97d50 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -420,6 +420,18 @@ static inline int __vmxon(u64 addr)
     return rc;
 }
 
+/*
+ * Not all cases receive valid value in the VM-exit instruction length field.
+ * Callers must know what they're doing!
+ */
+static inline int vmx_get_instruction_length(void)
+{
+    int len;
+    len = __vmread(VM_EXIT_INSTRUCTION_LEN); /* Safe: callers audited */
+    BUG_ON((len < 1) || (len > 15));
+    return len;
+}
+
 void vmx_get_segment_register(struct vcpu *, enum x86_segment,
                               struct segment_register *);
 void vmx_inject_extint(int trap);
@@ -431,7 +443,8 @@ void ept_p2m_uninit(struct p2m_domain *p2m);
 void ept_walk_table(struct domain *d, unsigned long gfn);
 void setup_ept_dump(void);
 
-void update_guest_eip(void);
+void vmx_update_guest_eip(void);
+void vmx_dr_access(unsigned long exit_qualification,struct cpu_user_regs *regs);
 
 int alloc_p2m_hap_data(struct p2m_domain *p2m);
 void free_p2m_hap_data(struct p2m_domain *p2m);
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 5cdacc7..8c70324 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -566,6 +566,7 @@ void microcode_set_module(unsigned int);
 int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void), unsigned long len);
 int microcode_resume_cpu(int cpu);
 
+void pv_cpuid(struct cpu_user_regs *regs);
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_X86_PROCESSOR_H */
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 06/17]  PVH xen: Introduce PVH guest type
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (4 preceding siblings ...)
  2013-04-23 21:25 ` [PATCH 05/17] PVH xen: vmx realted preparatory changes for PVH Mukesh Rathor
@ 2013-04-23 21:25 ` Mukesh Rathor
  2013-04-24  7:07   ` Jan Beulich
  2013-04-23 21:25 ` [PATCH 07/17] PVH xen: tools changes to create PVH domain Mukesh Rathor
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:25 UTC (permalink / raw)
  To: Xen-devel

This patch introduces the concept of a pvh guest. There are also other basic
changes like creating macros to check for pvh vcpu/domain, and creating
new macros to see if it's pv/pvh/hvm domain/vcpu. Also, modify copy macros
to include pvh. Lastly, we introduce that PVH uses HVM style event delivery.

Chagnes in V2:
  - make is_pvh/is_hvm enum instead of adding is_pvh as a new flag.
  - fix indentation and spacing in guest_kernel_mode macro.
  - add debug only BUG() in GUEST_KERNEL_RPL macro as it should no longer
    be called in any PVH paths.

Chagnes in V3:
  - Rename enum fields, and add is_pv to it.
  - Get rid if is_hvm_or_pvh_* macros.

Chagnes in V4:
  - Move e820 fields out of pv_domain struct.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/debug.c               |    2 +-
 xen/arch/x86/domain.c              |   11 +++++++++--
 xen/arch/x86/mm.c                  |   26 +++++++++++++-------------
 xen/common/domain.c                |    2 +-
 xen/include/asm-x86/desc.h         |    5 +++++
 xen/include/asm-x86/domain.h       |   12 ++++++------
 xen/include/asm-x86/event.h        |    2 +-
 xen/include/asm-x86/guest_access.h |   12 ++++++------
 xen/include/asm-x86/x86_64/regs.h  |    9 +++++----
 xen/include/xen/sched.h            |   21 ++++++++++++++++++---
 10 files changed, 65 insertions(+), 37 deletions(-)

diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c
index e67473e..167421d 100644
--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -158,7 +158,7 @@ dbg_rw_guest_mem(dbgva_t addr, dbgbyte_t *buf, int len, struct domain *dp,
 
         pagecnt = min_t(long, PAGE_SIZE - (addr & ~PAGE_MASK), len);
 
-        mfn = (dp->is_hvm
+        mfn = (!is_pv_domain(dp)
                ? dbg_hvm_va2mfn(addr, dp, toaddr, &gfn)
                : dbg_pv_va2mfn(addr, dp, pgd3));
         if ( mfn == INVALID_MFN ) 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 219d96c..b0fa339 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -571,7 +571,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags)
         /* 64-bit PV guest by default. */
         d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0;
 
-        spin_lock_init(&d->arch.pv_domain.e820_lock);
+        spin_lock_init(&d->arch.e820_lock);
     }
 
     /* initialize default tsc behavior in case tools don't */
@@ -597,7 +597,7 @@ void arch_domain_destroy(struct domain *d)
     if ( is_hvm_domain(d) )
         hvm_domain_destroy(d);
     else
-        xfree(d->arch.pv_domain.e820);
+        xfree(d->arch.e820);
 
     free_domain_pirqs(d);
     if ( !is_idle_domain(d) )
@@ -650,6 +650,13 @@ int arch_set_info_guest(
     unsigned int i;
     int rc = 0, compat;
 
+    /* This removed when all patches are checked in and PVH is done */
+    if ( is_pvh_vcpu(v) )
+    {
+        printk("PVH: You don't have the correct xen version for PVH\n");
+        return -EINVAL;
+    }
+
     /* The context is a compat-mode one if the target domain is compat-mode;
      * we expect the tools to DTRT even in compat-mode callers. */
     compat = is_pv_32on64_domain(d);
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 1c8442f..6a3d50a 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4583,11 +4583,11 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg)
             return -EFAULT;
         }
 
-        spin_lock(&d->arch.pv_domain.e820_lock);
-        xfree(d->arch.pv_domain.e820);
-        d->arch.pv_domain.e820 = e820;
-        d->arch.pv_domain.nr_e820 = fmap.map.nr_entries;
-        spin_unlock(&d->arch.pv_domain.e820_lock);
+        spin_lock(&d->arch.e820_lock);
+        xfree(d->arch.e820);
+        d->arch.e820 = e820;
+        d->arch.nr_e820 = fmap.map.nr_entries;
+        spin_unlock(&d->arch.e820_lock);
 
         rcu_unlock_domain(d);
         return rc;
@@ -4601,26 +4601,26 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg)
         if ( copy_from_guest(&map, arg, 1) )
             return -EFAULT;
 
-        spin_lock(&d->arch.pv_domain.e820_lock);
+        spin_lock(&d->arch.e820_lock);
 
         /* Backwards compatibility. */
-        if ( (d->arch.pv_domain.nr_e820 == 0) ||
-             (d->arch.pv_domain.e820 == NULL) )
+        if ( (d->arch.nr_e820 == 0) ||
+             (d->arch.e820 == NULL) )
         {
-            spin_unlock(&d->arch.pv_domain.e820_lock);
+            spin_unlock(&d->arch.e820_lock);
             return -ENOSYS;
         }
 
-        map.nr_entries = min(map.nr_entries, d->arch.pv_domain.nr_e820);
-        if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820,
+        map.nr_entries = min(map.nr_entries, d->arch.nr_e820);
+        if ( copy_to_guest(map.buffer, d->arch.e820,
                            map.nr_entries) ||
              __copy_to_guest(arg, &map, 1) )
         {
-            spin_unlock(&d->arch.pv_domain.e820_lock);
+            spin_unlock(&d->arch.e820_lock);
             return -EFAULT;
         }
 
-        spin_unlock(&d->arch.pv_domain.e820_lock);
+        spin_unlock(&d->arch.e820_lock);
         return 0;
     }
 
diff --git a/xen/common/domain.c b/xen/common/domain.c
index ce45d66..9b8368c 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -234,7 +234,7 @@ struct domain *domain_create(
         goto fail;
 
     if ( domcr_flags & DOMCRF_hvm )
-        d->is_hvm = 1;
+        d->guest_type = is_hvm;
 
     if ( domid == 0 )
     {
diff --git a/xen/include/asm-x86/desc.h b/xen/include/asm-x86/desc.h
index 354b889..4dca0a3 100644
--- a/xen/include/asm-x86/desc.h
+++ b/xen/include/asm-x86/desc.h
@@ -38,7 +38,12 @@
 
 #ifndef __ASSEMBLY__
 
+#ifndef NDEBUG
+#define GUEST_KERNEL_RPL(d) (is_pvh_domain(d) ? ({ BUG(); 0; }) :    \
+                                                is_pv_32bit_domain(d) ? 1 : 3)
+#else
 #define GUEST_KERNEL_RPL(d) (is_pv_32bit_domain(d) ? 1 : 3)
+#endif
 
 /* Fix up the RPL of a guest segment selector. */
 #define __fixup_guest_selector(d, sel)                             \
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index bdaf714..5ace2bb 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -16,7 +16,7 @@
 #define is_pv_32on64_domain(d) (is_pv_32bit_domain(d))
 #define is_pv_32on64_vcpu(v)   (is_pv_32on64_domain((v)->domain))
 
-#define is_hvm_pv_evtchn_domain(d) (is_hvm_domain(d) && \
+#define is_hvm_pv_evtchn_domain(d) (!is_pv_domain(d) && \
         d->arch.hvm_domain.irq.callback_via_type == HVMIRQ_callback_vector)
 #define is_hvm_pv_evtchn_vcpu(v) (is_hvm_pv_evtchn_domain(v->domain))
 
@@ -234,11 +234,6 @@ struct pv_domain
 
     /* map_domain_page() mapping cache. */
     struct mapcache_domain mapcache;
-
-    /* Pseudophysical e820 map (XENMEM_memory_map).  */
-    spinlock_t e820_lock;
-    struct e820entry *e820;
-    unsigned int nr_e820;
 };
 
 struct arch_domain
@@ -313,6 +308,11 @@ struct arch_domain
                                 (possibly other cases in the future */
     uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */
     uint64_t vtsc_usercount; /* not used for hvm */
+
+    /* Pseudophysical e820 map (XENMEM_memory_map).  PV and PVH. */
+    spinlock_t e820_lock;
+    struct e820entry *e820;
+    unsigned int nr_e820;
 } __cacheline_aligned;
 
 #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))
diff --git a/xen/include/asm-x86/event.h b/xen/include/asm-x86/event.h
index 06057c7..7ed5812 100644
--- a/xen/include/asm-x86/event.h
+++ b/xen/include/asm-x86/event.h
@@ -18,7 +18,7 @@ int hvm_local_events_need_delivery(struct vcpu *v);
 static inline int local_events_need_delivery(void)
 {
     struct vcpu *v = current;
-    return (is_hvm_vcpu(v) ? hvm_local_events_need_delivery(v) :
+    return (!is_pv_vcpu(v) ? hvm_local_events_need_delivery(v) :
             (vcpu_info(v, evtchn_upcall_pending) &&
              !vcpu_info(v, evtchn_upcall_mask)));
 }
diff --git a/xen/include/asm-x86/guest_access.h b/xen/include/asm-x86/guest_access.h
index ca700c9..675dda1 100644
--- a/xen/include/asm-x86/guest_access.h
+++ b/xen/include/asm-x86/guest_access.h
@@ -14,27 +14,27 @@
 
 /* Raw access functions: no type checking. */
 #define raw_copy_to_guest(dst, src, len)        \
-    (is_hvm_vcpu(current) ?                     \
+    (!is_pv_vcpu(current) ?                     \
      copy_to_user_hvm((dst), (src), (len)) :    \
      copy_to_user((dst), (src), (len)))
 #define raw_copy_from_guest(dst, src, len)      \
-    (is_hvm_vcpu(current) ?                     \
+    (!is_pv_vcpu(current) ?                     \
      copy_from_user_hvm((dst), (src), (len)) :  \
      copy_from_user((dst), (src), (len)))
 #define raw_clear_guest(dst,  len)              \
-    (is_hvm_vcpu(current) ?                     \
+    (!is_pv_vcpu(current) ?                     \
      clear_user_hvm((dst), (len)) :             \
      clear_user((dst), (len)))
 #define __raw_copy_to_guest(dst, src, len)      \
-    (is_hvm_vcpu(current) ?                     \
+    (!is_pv_vcpu(current) ?                     \
      copy_to_user_hvm((dst), (src), (len)) :    \
      __copy_to_user((dst), (src), (len)))
 #define __raw_copy_from_guest(dst, src, len)    \
-    (is_hvm_vcpu(current) ?                     \
+    (!is_pv_vcpu(current) ?                     \
      copy_from_user_hvm((dst), (src), (len)) :  \
      __copy_from_user((dst), (src), (len)))
 #define __raw_clear_guest(dst,  len)            \
-    (is_hvm_vcpu(current) ?                     \
+    (!is_pv_vcpu(current) ?                     \
      clear_user_hvm((dst), (len)) :             \
      clear_user((dst), (len)))
 
diff --git a/xen/include/asm-x86/x86_64/regs.h b/xen/include/asm-x86/x86_64/regs.h
index 3cdc702..bb475cf 100644
--- a/xen/include/asm-x86/x86_64/regs.h
+++ b/xen/include/asm-x86/x86_64/regs.h
@@ -10,10 +10,11 @@
 #define ring_2(r)    (((r)->cs & 3) == 2)
 #define ring_3(r)    (((r)->cs & 3) == 3)
 
-#define guest_kernel_mode(v, r)                                 \
-    (!is_pv_32bit_vcpu(v) ?                                     \
-     (ring_3(r) && ((v)->arch.flags & TF_kernel_mode)) :        \
-     (ring_1(r)))
+#define guest_kernel_mode(v, r)                                   \
+    (is_pvh_vcpu(v) ? ({ ASSERT(v == current); ring_0(r); }) :    \
+     (!is_pv_32bit_vcpu(v) ?                                      \
+      (ring_3(r) && ((v)->arch.flags & TF_kernel_mode)) :         \
+      (ring_1(r))))
 
 #define permit_softint(dpl, v, r) \
     ((dpl) >= (guest_kernel_mode(v, r) ? 1 : 3))
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index ad971d2..5194c94 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -235,6 +235,14 @@ struct mem_event_per_domain
     struct mem_event_domain access;
 };
 
+/*
+ * PVH is a PV guest running in an HVM container. While is_hvm is false
+ * for it, it uses many of the HVM data structs.
+ */
+enum guest_type {
+    is_pv, is_pvh, is_hvm
+};
+
 struct domain
 {
     domid_t          domain_id;
@@ -282,8 +290,8 @@ struct domain
     struct rangeset *iomem_caps;
     struct rangeset *irq_caps;
 
-    /* Is this an HVM guest? */
-    bool_t           is_hvm;
+    enum guest_type guest_type;
+
 #ifdef HAS_PASSTHROUGH
     /* Does this guest need iommu mappings? */
     bool_t           need_iommu;
@@ -461,6 +469,9 @@ struct domain *domain_create(
  /* DOMCRF_oos_off: dont use out-of-sync optimization for shadow page tables */
 #define _DOMCRF_oos_off         4
 #define DOMCRF_oos_off          (1U<<_DOMCRF_oos_off)
+ /* DOMCRF_pvh: Create PV domain in HVM container */
+#define _DOMCRF_pvh            5
+#define DOMCRF_pvh             (1U<<_DOMCRF_pvh)
 
 /*
  * rcu_lock_domain_by_id() is more efficient than get_domain_by_id().
@@ -731,8 +742,12 @@ void watchdog_domain_destroy(struct domain *d);
 
 #define VM_ASSIST(_d,_t) (test_bit((_t), &(_d)->vm_assist))
 
-#define is_hvm_domain(d) ((d)->is_hvm)
+#define is_pv_domain(d) ((d)->guest_type == is_pv)
+#define is_pv_vcpu(v)   (is_pv_domain(v->domain))
+#define is_hvm_domain(d) ((d)->guest_type == is_hvm)
 #define is_hvm_vcpu(v)   (is_hvm_domain(v->domain))
+#define is_pvh_domain(d) ((d)->guest_type == is_pvh)
+#define is_pvh_vcpu(v)   (is_pvh_domain(v->domain))
 #define is_pinned_vcpu(v) ((v)->domain->is_pinned || \
                            cpumask_weight((v)->cpu_affinity) == 1)
 #ifdef HAS_PASSTHROUGH
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 07/17] PVH xen: tools changes to create PVH domain
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (5 preceding siblings ...)
  2013-04-23 21:25 ` [PATCH 06/17] PVH xen: Introduce PVH guest type Mukesh Rathor
@ 2013-04-23 21:25 ` Mukesh Rathor
  2013-04-24  7:10   ` Jan Beulich
  2013-04-23 21:25 ` [PATCH 08/17] PVH xen: domain creation code changes Mukesh Rathor
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:25 UTC (permalink / raw)
  To: Xen-devel

This patch contains tools changes for PVH. For now, only one mode is
supported/tested:
    dom0> losetup /dev/loop1 guest.img
    dom0> In vm.cfg file: disk = ['phy:/dev/loop1,xvda,w']

Chnages in V2: None
Chnages in V3:
  - Document pvh boolean flag in xl.cfg.pod.5
  - Rename ci_pvh and bi_pvh to pvh, and domcr_is_pvh to pvh_enabled.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 docs/man/xl.cfg.pod.5             |    3 +++
 tools/debugger/gdbsx/xg/xg_main.c |    4 +++-
 tools/libxc/xc_dom.h              |    1 +
 tools/libxc/xc_dom_x86.c          |    7 ++++---
 tools/libxl/libxl_create.c        |    2 ++
 tools/libxl/libxl_dom.c           |   18 +++++++++++++++++-
 tools/libxl/libxl_types.idl       |    2 ++
 tools/libxl/libxl_x86.c           |    4 +++-
 tools/libxl/xl_cmdimpl.c          |   11 +++++++++++
 tools/xenstore/xenstored_domain.c |   12 +++++++-----
 xen/include/public/domctl.h       |    3 +++
 11 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index f8b4576..17c5679 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -620,6 +620,9 @@ if your particular guest kernel does not require this behaviour then
 it is safe to allow this to be enabled but you may wish to disable it
 anyway.
 
+=item B<pvh=BOOLEAN>
+Selects whether to run this guest in an HVM container. Default is 0.
+
 =back
 
 =head2 Fully-virtualised (HVM) Guest Specific Options
diff --git a/tools/debugger/gdbsx/xg/xg_main.c b/tools/debugger/gdbsx/xg/xg_main.c
index 64c7484..5736b86 100644
--- a/tools/debugger/gdbsx/xg/xg_main.c
+++ b/tools/debugger/gdbsx/xg/xg_main.c
@@ -81,6 +81,7 @@ int xgtrc_on = 0;
 struct xen_domctl domctl;         /* just use a global domctl */
 
 static int     _hvm_guest;        /* hvm guest? 32bit HVMs have 64bit context */
+static int     _pvh_guest;        /* PV guest in HVM container */
 static domid_t _dom_id;           /* guest domid */
 static int     _max_vcpu_id;      /* thus max_vcpu_id+1 VCPUs */
 static int     _dom0_fd;          /* fd of /dev/privcmd */
@@ -309,6 +310,7 @@ xg_attach(int domid, int guest_bitness)
 
     _max_vcpu_id = domctl.u.getdomaininfo.max_vcpu_id;
     _hvm_guest = (domctl.u.getdomaininfo.flags & XEN_DOMINF_hvm_guest);
+    _pvh_guest = (domctl.u.getdomaininfo.flags & XEN_DOMINF_pvh_guest);
     return _max_vcpu_id;
 }
 
@@ -369,7 +371,7 @@ _change_TF(vcpuid_t which_vcpu, int guest_bitness, int setit)
     int sz = sizeof(anyc);
 
     /* first try the MTF for hvm guest. otherwise do manually */
-    if (_hvm_guest) {
+    if (_hvm_guest || _pvh_guest) {
         domctl.u.debug_op.vcpu = which_vcpu;
         domctl.u.debug_op.op = setit ? XEN_DOMCTL_DEBUG_OP_SINGLE_STEP_ON :
                                        XEN_DOMCTL_DEBUG_OP_SINGLE_STEP_OFF;
diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h
index ac36600..8b43d2b 100644
--- a/tools/libxc/xc_dom.h
+++ b/tools/libxc/xc_dom.h
@@ -130,6 +130,7 @@ struct xc_dom_image {
     domid_t console_domid;
     domid_t xenstore_domid;
     xen_pfn_t shared_info_mfn;
+    int pvh_enabled;
 
     xc_interface *xch;
     domid_t guest_domid;
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index d89526d..3145da4 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -355,7 +355,8 @@ static int setup_pgtables_x86_64(struct xc_dom_image *dom)
         pgpfn = (addr - dom->parms.virt_base) >> PAGE_SHIFT_X86;
         l1tab[l1off] =
             pfn_to_paddr(xc_dom_p2m_guest(dom, pgpfn)) | L1_PROT;
-        if ( (addr >= dom->pgtables_seg.vstart) && 
+        if ( (!dom->pvh_enabled)                &&
+             (addr >= dom->pgtables_seg.vstart) &&
              (addr < dom->pgtables_seg.vend) )
             l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */
         if ( l1off == (L1_PAGETABLE_ENTRIES_X86_64 - 1) )
@@ -672,7 +673,7 @@ int arch_setup_meminit(struct xc_dom_image *dom)
     rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
     if ( rc )
         return rc;
-    if ( xc_dom_feature_translated(dom) )
+    if ( xc_dom_feature_translated(dom) && !dom->pvh_enabled )
     {
         dom->shadow_enabled = 1;
         rc = x86_shadow(dom->xch, dom->guest_domid);
@@ -798,7 +799,7 @@ int arch_setup_bootlate(struct xc_dom_image *dom)
         }
 
         /* Map grant table frames into guest physmap. */
-        for ( i = 0; ; i++ )
+        for ( i = 0; !dom->pvh_enabled; i++ )
         {
             rc = xc_domain_add_to_physmap(dom->xch, dom->guest_domid,
                                           XENMAPSPACE_grant_table,
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 19a56c0..7c8733c 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -421,6 +421,8 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_create_info *info,
         flags |= XEN_DOMCTL_CDF_hvm_guest;
         flags |= libxl_defbool_val(info->hap) ? XEN_DOMCTL_CDF_hap : 0;
         flags |= libxl_defbool_val(info->oos) ? 0 : XEN_DOMCTL_CDF_oos_off;
+    } else if ( libxl_defbool_val(info->pvh) ) {
+        flags |= XEN_DOMCTL_CDF_hap;
     }
     *domid = -1;
 
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index b38d0a7..cefbf76 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -329,9 +329,23 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid,
     struct xc_dom_image *dom;
     int ret;
     int flags = 0;
+    int is_pvh = libxl_defbool_val(info->pvh);
 
     xc_dom_loginit(ctx->xch);
 
+    if (is_pvh) {
+        char *pv_feats = "writable_descriptor_tables|auto_translated_physmap"
+                         "|supervisor_mode_kernel|hvm_callback_vector";
+
+        if (info->u.pv.features && info->u.pv.features[0] != '\0')
+        {
+            LOG(ERROR, "Didn't expect info->u.pv.features to contain string\n");
+            LOG(ERROR, "String: %s\n", info->u.pv.features);
+            return ERROR_FAIL;
+        }
+        info->u.pv.features = strdup(pv_feats);
+    }
+
     dom = xc_dom_allocate(ctx->xch, state->pv_cmdline, info->u.pv.features);
     if (!dom) {
         LOGE(ERROR, "xc_dom_allocate failed");
@@ -370,6 +384,7 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid,
     }
 
     dom->flags = flags;
+    dom->pvh_enabled = is_pvh;
     dom->console_evtchn = state->console_port;
     dom->console_domid = state->console_domid;
     dom->xenstore_evtchn = state->store_port;
@@ -400,7 +415,8 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid,
         LOGE(ERROR, "xc_dom_boot_image failed");
         goto out;
     }
-    if ( (ret = xc_dom_gnttab_init(dom)) != 0 ) {
+    /* PVH sets up its own grant during boot via hvm mechanisms */
+    if ( !is_pvh && (ret = xc_dom_gnttab_init(dom)) != 0 ) {
         LOGE(ERROR, "xc_dom_gnttab_init failed");
         goto out;
     }
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index ecf1f0b..2599e01 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -245,6 +245,7 @@ libxl_domain_create_info = Struct("domain_create_info",[
     ("platformdata", libxl_key_value_list),
     ("poolid",       uint32),
     ("run_hotplug_scripts",libxl_defbool),
+    ("pvh",          libxl_defbool),
     ], dir=DIR_IN)
 
 MemKB = UInt(64, init_val = "LIBXL_MEMKB_DEFAULT")
@@ -346,6 +347,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
                                       ])),
                  ("invalid", Struct(None, [])),
                  ], keyvar_init_val = "LIBXL_DOMAIN_TYPE_INVALID")),
+    ("pvh",       libxl_defbool),
     ], dir=DIR_IN
 )
 
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index a17f6ae..424bc68 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -290,7 +290,9 @@ int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config,
     if (rtc_timeoffset)
         xc_domain_set_time_offset(ctx->xch, domid, rtc_timeoffset);
 
-    if (d_config->b_info.type == LIBXL_DOMAIN_TYPE_HVM) {
+    if (d_config->b_info.type == LIBXL_DOMAIN_TYPE_HVM ||
+        libxl_defbool_val(d_config->b_info.pvh)) {
+
         unsigned long shadow;
         shadow = (d_config->b_info.shadow_memkb + 1023) / 1024;
         xc_shadow_control(ctx->xch, domid, XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION, NULL, 0, &shadow, 0, NULL);
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index c1a969b..3ee7593 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -610,8 +610,18 @@ static void parse_config_data(const char *config_source,
         !strncmp(buf, "hvm", strlen(buf)))
         c_info->type = LIBXL_DOMAIN_TYPE_HVM;
 
+    libxl_defbool_setdefault(&c_info->pvh, false);
+    libxl_defbool_setdefault(&c_info->hap, false);
+    xlu_cfg_get_defbool(config, "pvh", &c_info->pvh, 0);
     xlu_cfg_get_defbool(config, "hap", &c_info->hap, 0);
 
+    if (libxl_defbool_val(c_info->pvh) &&
+        !libxl_defbool_val(c_info->hap)) {
+
+        fprintf(stderr, "hap is required for PVH domain\n");
+        exit(1);
+    }
+
     if (xlu_cfg_replace_string (config, "name", &c_info->name, 0)) {
         fprintf(stderr, "Domain name must be specified.\n");
         exit(1);
@@ -918,6 +928,7 @@ static void parse_config_data(const char *config_source,
 
         b_info->u.pv.cmdline = cmdline;
         xlu_cfg_replace_string (config, "ramdisk", &b_info->u.pv.ramdisk, 0);
+        libxl_defbool_set(&b_info->pvh, libxl_defbool_val(c_info->pvh));
         break;
     }
     default:
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index bf83d58..10c23a1 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -168,13 +168,15 @@ static int readchn(struct connection *conn, void *data, unsigned int len)
 static void *map_interface(domid_t domid, unsigned long mfn)
 {
 	if (*xcg_handle != NULL) {
-		/* this is the preferred method */
-		return xc_gnttab_map_grant_ref(*xcg_handle, domid,
+                void *addr;
+                /* this is the preferred method */
+                addr = xc_gnttab_map_grant_ref(*xcg_handle, domid,
 			GNTTAB_RESERVED_XENSTORE, PROT_READ|PROT_WRITE);
-	} else {
-		return xc_map_foreign_range(*xc_handle, domid,
-			getpagesize(), PROT_READ|PROT_WRITE, mfn);
+                if (addr)
+                        return addr;
 	}
+	return xc_map_foreign_range(*xc_handle, domid,
+		        getpagesize(), PROT_READ|PROT_WRITE, mfn);
 }
 
 static void unmap_interface(void *interface)
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 4c5b2bb..efd8907 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -89,6 +89,9 @@ struct xen_domctl_getdomaininfo {
  /* Being debugged.  */
 #define _XEN_DOMINF_debugged  6
 #define XEN_DOMINF_debugged   (1U<<_XEN_DOMINF_debugged)
+ /* domain is PVH */
+#define _XEN_DOMINF_pvh_guest 7
+#define XEN_DOMINF_pvh_guest   (1U<<_XEN_DOMINF_pvh_guest)
  /* XEN_DOMINF_shutdown guest-supplied code.  */
 #define XEN_DOMINF_shutdownmask 255
 #define XEN_DOMINF_shutdownshift 16
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 08/17]  PVH xen: domain creation code changes
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (6 preceding siblings ...)
  2013-04-23 21:25 ` [PATCH 07/17] PVH xen: tools changes to create PVH domain Mukesh Rathor
@ 2013-04-23 21:25 ` Mukesh Rathor
  2013-04-23 21:25 ` [PATCH 09/17] PVH xen: create PVH vmcs, and also initialization Mukesh Rathor
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:25 UTC (permalink / raw)
  To: Xen-devel

This patch contains changes to arch/x86/domain.c to allow for a PVH
domain.
Changes in V2:
  - changes to read_segment_register() moved to this patch.
  - The other comment was to create NULL functions for pvh_set_vcpu_info
    and pvh_read_descriptor which are implemented in later patch, but since
    I disable PVH creation until all patches are checked in, it is not needed.
    But it helps breaking down of patches.

Changes in V3:
  - Fix read_segment_register() macro to make sure args are evaluated once,
    and use # instead of STR for name in the macro.

Changes in V4:
  - Remove pvh substruct in the hvm substruct, as the vcpu_info_mfn has been
    moved out of pv_vcpu struct.
  - rename hvm_pvh_* functions to hvm_*.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/domain.c         |   66 ++++++++++++++++++++++++++--------------
 xen/arch/x86/mm.c             |    3 ++
 xen/arch/x86/mm/hap/hap.c     |    4 ++-
 xen/include/asm-x86/hvm/hvm.h |   18 +++++++++++
 xen/include/asm-x86/system.h  |   18 +++++++++--
 5 files changed, 81 insertions(+), 28 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index b0fa339..b1fd758 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -389,7 +389,7 @@ int vcpu_initialise(struct vcpu *v)
 
     v->arch.vcpu_info_mfn = INVALID_MFN;
 
-    if ( is_hvm_domain(d) )
+    if ( !is_pv_domain(d) )
     {
         rc = hvm_vcpu_initialise(v);
         goto done;
@@ -456,7 +456,7 @@ void vcpu_destroy(struct vcpu *v)
 
     vcpu_destroy_fpu(v);
 
-    if ( is_hvm_vcpu(v) )
+    if ( !is_pv_vcpu(v) )
         hvm_vcpu_destroy(v);
     else
         xfree(v->arch.pv_vcpu.trap_ctxt);
@@ -468,7 +468,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags)
     int rc = -ENOMEM;
 
     d->arch.hvm_domain.hap_enabled =
-        is_hvm_domain(d) &&
+        !is_pv_domain(d) &&
         hvm_funcs.hap_supported &&
         (domcr_flags & DOMCRF_hap);
     d->arch.hvm_domain.mem_sharing_enabled = 0;
@@ -516,7 +516,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags)
     mapcache_domain_init(d);
 
     HYPERVISOR_COMPAT_VIRT_START(d) =
-        is_hvm_domain(d) ? ~0u : __HYPERVISOR_COMPAT_VIRT_START;
+        is_pv_domain(d) ? __HYPERVISOR_COMPAT_VIRT_START : ~0u;
 
     if ( (rc = paging_domain_init(d, domcr_flags)) != 0 )
         goto fail;
@@ -558,7 +558,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags)
             goto fail;
     }
 
-    if ( is_hvm_domain(d) )
+    if ( !is_pv_domain(d) )
     {
         if ( (rc = hvm_domain_initialise(d)) != 0 )
         {
@@ -567,12 +567,11 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags)
         }
     }
     else
-    {
         /* 64-bit PV guest by default. */
         d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0;
 
+    if ( !is_hvm_domain(d) )
         spin_lock_init(&d->arch.e820_lock);
-    }
 
     /* initialize default tsc behavior in case tools don't */
     tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0);
@@ -594,9 +593,10 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags)
 
 void arch_domain_destroy(struct domain *d)
 {
-    if ( is_hvm_domain(d) )
+    if ( !is_pv_domain(d) )
         hvm_domain_destroy(d);
-    else
+
+    if ( !is_hvm_domain(d) )
         xfree(d->arch.e820);
 
     free_domain_pirqs(d);
@@ -664,7 +664,7 @@ int arch_set_info_guest(
 #define c(fld) (compat ? (c.cmp->fld) : (c.nat->fld))
     flags = c(flags);
 
-    if ( !is_hvm_vcpu(v) )
+    if ( is_pv_vcpu(v) )
     {
         if ( !compat )
         {
@@ -717,7 +717,7 @@ int arch_set_info_guest(
     v->fpu_initialised = !!(flags & VGCF_I387_VALID);
 
     v->arch.flags &= ~TF_kernel_mode;
-    if ( (flags & VGCF_in_kernel) || is_hvm_vcpu(v)/*???*/ )
+    if ( (flags & VGCF_in_kernel) || !is_pv_vcpu(v)/*???*/ )
         v->arch.flags |= TF_kernel_mode;
 
     v->arch.vgc_flags = flags;
@@ -728,7 +728,7 @@ int arch_set_info_guest(
     if ( !compat )
     {
         memcpy(&v->arch.user_regs, &c.nat->user_regs, sizeof(c.nat->user_regs));
-        if ( !is_hvm_vcpu(v) )
+        if ( is_pv_vcpu(v) )
             memcpy(v->arch.pv_vcpu.trap_ctxt, c.nat->trap_ctxt,
                    sizeof(c.nat->trap_ctxt));
     }
@@ -744,10 +744,13 @@ int arch_set_info_guest(
 
     v->arch.user_regs.eflags |= 2;
 
-    if ( is_hvm_vcpu(v) )
+    if ( !is_pv_vcpu(v) )
     {
         hvm_set_info_guest(v);
-        goto out;
+        if ( is_hvm_vcpu(v) || v->is_initialised )
+            goto out;
+        else 
+            goto pvh_skip_pv_stuff;
     }
 
     init_int80_direct_trap(v);
@@ -756,7 +759,10 @@ int arch_set_info_guest(
     v->arch.pv_vcpu.iopl = (v->arch.user_regs.eflags >> 12) & 3;
     v->arch.user_regs.eflags &= ~X86_EFLAGS_IOPL;
 
-    /* Ensure real hardware interrupts are enabled. */
+    /*
+     * Ensure real hardware interrupts are enabled. Note: PVH may not have
+     * IDT set on all vcpus so we don't enable IF for it yet.
+     */
     v->arch.user_regs.eflags |= X86_EFLAGS_IF;
 
     if ( !v->is_initialised )
@@ -853,6 +859,7 @@ int arch_set_info_guest(
     if ( rc != 0 )
         return rc;
 
+pvh_skip_pv_stuff:
     if ( !compat )
     {
         cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]);
@@ -871,8 +878,14 @@ int arch_set_info_guest(
             return -EINVAL;
         }
 
+        if ( is_pvh_vcpu(v) )
+        {
+            v->arch.cr3 = page_to_mfn(cr3_page);
+            v->arch.hvm_vcpu.guest_cr[3] = c.nat->ctrlreg[3];
+        }
+
         v->arch.guest_table = pagetable_from_page(cr3_page);
-        if ( c.nat->ctrlreg[1] )
+        if ( c.nat->ctrlreg[1] && !is_pvh_vcpu(v) )
         {
             cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[1]);
             cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC);
@@ -939,6 +952,13 @@ int arch_set_info_guest(
 
     update_cr3(v);
 
+    if ( is_pvh_vcpu(v) )
+    {
+        /* guest is bringing up non-boot SMP vcpu */
+        if ( (rc=hvm_set_vcpu_info(v, c.nat)) != 0 )
+            return rc;
+    }
+
  out:
     if ( flags & VGCF_online )
         clear_bit(_VPF_down, &v->pause_flags);
@@ -1444,7 +1464,7 @@ static void update_runstate_area(struct vcpu *v)
 
 static inline int need_full_gdt(struct vcpu *v)
 {
-    return (!is_hvm_vcpu(v) && !is_idle_vcpu(v));
+    return (is_pv_vcpu(v) && !is_idle_vcpu(v));
 }
 
 static void __context_switch(void)
@@ -1578,7 +1598,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
         /* Re-enable interrupts before restoring state which may fault. */
         local_irq_enable();
 
-        if ( !is_hvm_vcpu(next) )
+        if ( is_pv_vcpu(next) )
         {
             load_LDT(next);
             load_segments(next);
@@ -1701,12 +1721,12 @@ unsigned long hypercall_create_continuation(
         regs->eax  = op;
 
         /* Ensure the hypercall trap instruction is re-executed. */
-        if ( !is_hvm_vcpu(current) )
+        if ( is_pv_vcpu(current) )
             regs->eip -= 2;  /* re-execute 'syscall' / 'int $xx' */
         else
             current->arch.hvm_vcpu.hcall_preempted = 1;
 
-        if ( !is_hvm_vcpu(current) ?
+        if ( is_pv_vcpu(current) ?
              !is_pv_32on64_vcpu(current) :
              (hvm_guest_x86_mode(current) == 8) )
         {
@@ -2022,7 +2042,7 @@ int domain_relinquish_resources(struct domain *d)
         for_each_vcpu ( d, v )
             vcpu_destroy_pagetables(v);
 
-        if ( !is_hvm_domain(d) )
+        if ( is_pv_domain(d) )
         {
             for_each_vcpu ( d, v )
             {
@@ -2097,7 +2117,7 @@ int domain_relinquish_resources(struct domain *d)
         BUG();
     }
 
-    if ( is_hvm_domain(d) )
+    if ( !is_pv_domain(d) )
         hvm_domain_relinquish_resources(d);
 
     return 0;
@@ -2181,7 +2201,7 @@ void vcpu_mark_events_pending(struct vcpu *v)
     if ( already_pending )
         return;
 
-    if ( is_hvm_vcpu(v) )
+    if ( !is_pv_vcpu(v) )
         hvm_assert_evtchn_irq(v);
     else
         vcpu_kick(v);
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 6a3d50a..d9bdded 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4080,6 +4080,9 @@ void destroy_gdt(struct vcpu *v)
     int i;
     unsigned long pfn;
 
+    if ( is_pvh_vcpu(v) )
+        return;
+
     v->arch.pv_vcpu.gdt_ents = 0;
     pl1e = gdt_ldt_ptes(v->domain, v);
     for ( i = 0; i < FIRST_RESERVED_GDT_PAGE; i++ )
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index bff05d9..5aa0852 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -639,7 +639,9 @@ static void hap_update_cr3(struct vcpu *v, int do_locking)
 const struct paging_mode *
 hap_paging_get_mode(struct vcpu *v)
 {
-    return !hvm_paging_enabled(v)   ? &hap_paging_real_mode :
+    /* PVH 32bitfixme */
+    return is_pvh_vcpu(v) ? &hap_paging_long_mode :
+        !hvm_paging_enabled(v)   ? &hap_paging_real_mode :
         hvm_long_mode_enabled(v) ? &hap_paging_long_mode :
         hvm_pae_enabled(v)       ? &hap_paging_pae_mode  :
                                    &hap_paging_protected_mode;
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 2fa2ea5..a790954 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -190,6 +190,11 @@ struct hvm_function_table {
                                 paddr_t *L1_gpa, unsigned int *page_order,
                                 uint8_t *p2m_acc, bool_t access_r,
                                 bool_t access_w, bool_t access_x);
+    /* PVH functions */
+    int (*pvh_set_vcpu_info)(struct vcpu *v, struct vcpu_guest_context *ctxtp);
+    int (*pvh_read_descriptor)(unsigned int sel, const struct vcpu *v,
+                         const struct cpu_user_regs *regs, unsigned long *base,
+                         unsigned long *limit, unsigned int *ar);
 };
 
 extern struct hvm_function_table hvm_funcs;
@@ -323,6 +328,19 @@ static inline unsigned long hvm_get_shadow_gs_base(struct vcpu *v)
     return hvm_funcs.get_shadow_gs_base(v);
 }
 
+static inline int hvm_set_vcpu_info(struct vcpu *v, 
+                                        struct vcpu_guest_context *ctxtp)
+{
+    return hvm_funcs.pvh_set_vcpu_info(v, ctxtp);
+}
+
+static inline int hvm_read_descriptor(unsigned int sel, 
+               const struct vcpu *v, const struct cpu_user_regs *regs, 
+               unsigned long *base, unsigned long *limit, unsigned int *ar)
+{
+    return hvm_funcs.pvh_read_descriptor(sel, v, regs, base, limit, ar);
+}
+
 #define is_viridian_domain(_d)                                             \
  (is_hvm_domain(_d) && ((_d)->arch.hvm_domain.params[HVM_PARAM_VIRIDIAN]))
 
diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
index d8dc6f2..7780c16 100644
--- a/xen/include/asm-x86/system.h
+++ b/xen/include/asm-x86/system.h
@@ -4,10 +4,20 @@
 #include <xen/lib.h>
 #include <asm/bitops.h>
 
-#define read_segment_register(vcpu, regs, name)                 \
-({  u16 __sel;                                                  \
-    asm volatile ( "movw %%" STR(name) ",%0" : "=r" (__sel) );  \
-    __sel;                                                      \
+/* 
+ * We need vcpu because during context switch, going from pure PV to PVH,
+ * in save_segments(), current has been updated to next, and no longer pointing
+ * to the pure PV. Note: for PVH, we update regs->selectors on each vmexit.
+ */
+#define read_segment_register(vcpu, regs, name)                   \
+({  u16 __sel;                                                    \
+    struct cpu_user_regs *_regs = (regs);                         \
+                                                                  \
+    if ( is_pvh_vcpu(vcpu) )                                      \
+        __sel = _regs->name;                                      \
+    else                                                          \
+        asm volatile ( "movw %%" #name ",%0" : "=r" (__sel) );    \
+    __sel;                                                        \
 })
 
 #define wbinvd() \
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 09/17]  PVH xen: create PVH vmcs, and also initialization
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (7 preceding siblings ...)
  2013-04-23 21:25 ` [PATCH 08/17] PVH xen: domain creation code changes Mukesh Rathor
@ 2013-04-23 21:25 ` Mukesh Rathor
  2013-04-24  7:42   ` Jan Beulich
  2013-04-23 21:25 ` [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c Mukesh Rathor
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:25 UTC (permalink / raw)
  To: Xen-devel

This patch mainly contains code to create a VMCS for PVH guest, and HVM
specific vcpu/domain creation code.

Changes in V2:
  - Avoid call to hvm_do_resume() at call site rather than return in it.
  - Return for PVH vmx_do_resume prior to intel debugger stuff.

Changes in V3:
  - Cleanup pvh_construct_vmcs().
  - Fix formatting in few places, adding XENLOG_G_ERR to printing.
  - Do not load the CS selector for PVH here, but try to do that in Linux.

Changes in V4:
  - Remove VM_ENTRY_LOAD_DEBUG_CTLS clearing.
  - Add 32bit kernel changes mark.
  - Verify pit_init call for PVH.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/hvm/hvm.c      |   88 ++++++++++++-
 xen/arch/x86/hvm/vmx/vmcs.c |  309 ++++++++++++++++++++++++++++++++++++++----
 xen/arch/x86/hvm/vmx/vmx.c  |   39 ++++++
 3 files changed, 403 insertions(+), 33 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 38e87ce..27dbe3d 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -510,6 +510,31 @@ static int hvm_print_line(
     return X86EMUL_OKAY;
 }
 
+static int pvh_dom_initialise(struct domain *d)
+{
+    int rc;
+
+    if ( !d->arch.hvm_domain.hap_enabled )
+        return -EINVAL;
+
+    spin_lock_init(&d->arch.hvm_domain.irq_lock);
+    hvm_init_guest_time(d);
+
+    hvm_init_cacheattr_region_list(d);
+
+    if ( (rc = paging_enable(d, PG_refcounts|PG_translate|PG_external)) != 0 )
+        goto fail1;
+
+    if ( (rc = hvm_funcs.domain_initialise(d)) != 0 )
+        goto fail1;
+
+    return 0;
+
+fail1:
+    hvm_destroy_cacheattr_region_list(d);
+    return rc;
+}
+
 int hvm_domain_initialise(struct domain *d)
 {
     int rc;
@@ -520,6 +545,8 @@ int hvm_domain_initialise(struct domain *d)
                  "on a non-VT/AMDV platform.\n");
         return -EINVAL;
     }
+    if ( is_pvh_domain(d) )
+        return pvh_dom_initialise(d);
 
     spin_lock_init(&d->arch.hvm_domain.pbuf_lock);
     spin_lock_init(&d->arch.hvm_domain.irq_lock);
@@ -584,6 +611,11 @@ int hvm_domain_initialise(struct domain *d)
 
 void hvm_domain_relinquish_resources(struct domain *d)
 {
+    if ( is_pvh_domain(d) )
+    {
+        pit_deinit(d);
+        return;
+    }
     if ( hvm_funcs.nhvm_domain_relinquish_resources )
         hvm_funcs.nhvm_domain_relinquish_resources(d);
 
@@ -609,10 +641,14 @@ void hvm_domain_relinquish_resources(struct domain *d)
 void hvm_domain_destroy(struct domain *d)
 {
     hvm_funcs.domain_destroy(d);
+    hvm_destroy_cacheattr_region_list(d);
+
+    if ( is_pvh_domain(d) )
+        return;
+
     rtc_deinit(d);
     stdvga_deinit(d);
     vioapic_deinit(d);
-    hvm_destroy_cacheattr_region_list(d);
 }
 
 static int hvm_save_tsc_adjust(struct domain *d, hvm_domain_context_t *h)
@@ -1066,14 +1102,43 @@ static int __init __hvm_register_CPU_XSAVE_save_and_restore(void)
 }
 __initcall(__hvm_register_CPU_XSAVE_save_and_restore);
 
+static int pvh_vcpu_initialise(struct vcpu *v)
+{
+    int rc;
+
+    if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 )
+        return rc;
+
+    softirq_tasklet_init(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet,
+                         (void(*)(unsigned long))hvm_assert_evtchn_irq,
+                         (unsigned long)v);
+
+    v->arch.hvm_vcpu.hcall_64bit = 1;    /* PVH 32bitfixme */
+    v->arch.user_regs.eflags = 2;
+    v->arch.hvm_vcpu.inject_trap.vector = -1;
+
+    if ( (rc = hvm_vcpu_cacheattr_init(v)) != 0 )
+    {
+        hvm_funcs.vcpu_destroy(v);
+        return rc;
+    }
+    if ( v->vcpu_id == 0 )
+        pit_init(v, cpu_khz);
+
+    return 0;
+}
+
 int hvm_vcpu_initialise(struct vcpu *v)
 {
     int rc;
     struct domain *d = v->domain;
-    domid_t dm_domid = d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN];
+    domid_t dm_domid;
 
     hvm_asid_flush_vcpu(v);
 
+    if ( is_pvh_vcpu(v) )
+        return pvh_vcpu_initialise(v);
+
     if ( (rc = vlapic_init(v)) != 0 )
         goto fail1;
 
@@ -1084,6 +1149,8 @@ int hvm_vcpu_initialise(struct vcpu *v)
          && (rc = nestedhvm_vcpu_initialise(v)) < 0 ) 
         goto fail3;
 
+    dm_domid = d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN];
+
     /* Create ioreq event channel. */
     rc = alloc_unbound_xen_event_channel(v, dm_domid, NULL);
     if ( rc < 0 )
@@ -1163,7 +1230,10 @@ void hvm_vcpu_destroy(struct vcpu *v)
 
     tasklet_kill(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet);
     hvm_vcpu_cacheattr_destroy(v);
-    vlapic_destroy(v);
+
+    if ( !is_pvh_vcpu(v) )
+        vlapic_destroy(v);
+
     hvm_funcs.vcpu_destroy(v);
 
     /* Event channel is already freed by evtchn_destroy(). */
@@ -4512,6 +4582,8 @@ static int hvm_memory_event_traps(long p, uint32_t reason,
 
 void hvm_memory_event_cr0(unsigned long value, unsigned long old) 
 {
+    if ( is_pvh_vcpu(current) )
+        return;
     hvm_memory_event_traps(current->domain->arch.hvm_domain
                              .params[HVM_PARAM_MEMORY_EVENT_CR0],
                            MEM_EVENT_REASON_CR0,
@@ -4520,6 +4592,8 @@ void hvm_memory_event_cr0(unsigned long value, unsigned long old)
 
 void hvm_memory_event_cr3(unsigned long value, unsigned long old) 
 {
+    if ( is_pvh_vcpu(current) )
+        return;
     hvm_memory_event_traps(current->domain->arch.hvm_domain
                              .params[HVM_PARAM_MEMORY_EVENT_CR3],
                            MEM_EVENT_REASON_CR3,
@@ -4528,6 +4602,8 @@ void hvm_memory_event_cr3(unsigned long value, unsigned long old)
 
 void hvm_memory_event_cr4(unsigned long value, unsigned long old) 
 {
+    if ( is_pvh_vcpu(current) )
+        return;
     hvm_memory_event_traps(current->domain->arch.hvm_domain
                              .params[HVM_PARAM_MEMORY_EVENT_CR4],
                            MEM_EVENT_REASON_CR4,
@@ -4536,6 +4612,8 @@ void hvm_memory_event_cr4(unsigned long value, unsigned long old)
 
 void hvm_memory_event_msr(unsigned long msr, unsigned long value)
 {
+    if ( is_pvh_vcpu(current) )
+        return;
     hvm_memory_event_traps(current->domain->arch.hvm_domain
                              .params[HVM_PARAM_MEMORY_EVENT_MSR],
                            MEM_EVENT_REASON_MSR,
@@ -4548,6 +4626,8 @@ int hvm_memory_event_int3(unsigned long gla)
     unsigned long gfn;
     gfn = paging_gva_to_gfn(current, gla, &pfec);
 
+    if ( is_pvh_vcpu(current) )
+        return 0;
     return hvm_memory_event_traps(current->domain->arch.hvm_domain
                                     .params[HVM_PARAM_MEMORY_EVENT_INT3],
                                   MEM_EVENT_REASON_INT3,
@@ -4560,6 +4640,8 @@ int hvm_memory_event_single_step(unsigned long gla)
     unsigned long gfn;
     gfn = paging_gva_to_gfn(current, gla, &pfec);
 
+    if ( is_pvh_vcpu(current) )
+        return 0;
     return hvm_memory_event_traps(current->domain->arch.hvm_domain
             .params[HVM_PARAM_MEMORY_EVENT_SINGLE_STEP],
             MEM_EVENT_REASON_SINGLESTEP,
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 9926ffb..e7b0c4b 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -624,7 +624,7 @@ void vmx_vmcs_exit(struct vcpu *v)
     {
         /* Don't confuse vmx_do_resume (for @v or @current!) */
         vmx_clear_vmcs(v);
-        if ( is_hvm_vcpu(current) )
+        if ( !is_pv_vcpu(current) )
             vmx_load_vmcs(current);
 
         spin_unlock(&v->arch.hvm_vmx.vmcs_lock);
@@ -815,16 +815,283 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 vmcs_encoding, u64 val)
     virtual_vmcs_exit(vvmcs);
 }
 
-static int construct_vmcs(struct vcpu *v)
+static void vmx_set_common_host_vmcs_fields(struct vcpu *v)
 {
-    struct domain *d = v->domain;
     uint16_t sysenter_cs;
     unsigned long sysenter_eip;
+
+    /* Host data selectors. */
+    __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS);
+    __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS);
+    __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS);
+    __vmwrite(HOST_FS_SELECTOR, 0);
+    __vmwrite(HOST_GS_SELECTOR, 0);
+    __vmwrite(HOST_FS_BASE, 0);
+    __vmwrite(HOST_GS_BASE, 0);
+
+    /* Host control registers. */
+    v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS;
+    __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
+    __vmwrite(HOST_CR4,
+              mmu_cr4_features | (xsave_enabled(v) ? X86_CR4_OSXSAVE : 0));
+
+    /* Host CS:RIP. */
+    __vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS);
+    __vmwrite(HOST_RIP, (unsigned long)vmx_asm_vmexit_handler);
+
+    /* Host SYSENTER CS:RIP. */
+    rdmsrl(MSR_IA32_SYSENTER_CS, sysenter_cs);
+    __vmwrite(HOST_SYSENTER_CS, sysenter_cs);
+    rdmsrl(MSR_IA32_SYSENTER_EIP, sysenter_eip);
+    __vmwrite(HOST_SYSENTER_EIP, sysenter_eip);
+}
+
+static int pvh_check_requirements(struct vcpu *v)
+{
+    u64 required, tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
+
+    if ( !paging_mode_hap(v->domain) )
+    {
+        dprintk(XENLOG_G_ERR, "HAP is required for PVH guest.\n");
+        return -EINVAL;
+    }
+    if ( !cpu_has_vmx_pat )
+    {
+        dprintk(XENLOG_G_ERR, "PVH: CPU does not have PAT support\n");
+        return -ENOSYS;
+    }
+    if ( !cpu_has_vmx_msr_bitmap )
+    {
+        dprintk(XENLOG_G_ERR, "PVH: CPU does not have msr bitmap\n");
+        return -ENOSYS;
+    }
+    if ( !cpu_has_vmx_vpid )
+    {
+        dprintk(XENLOG_G_ERR, "PVH: CPU doesn't have VPID support\n");
+        return -ENOSYS;
+    }
+    if ( !cpu_has_vmx_secondary_exec_control )
+    {
+        dprintk(XENLOG_G_ERR, "CPU Secondary exec is required to run PVH\n");
+        return -ENOSYS;
+    }
+
+    if ( v->domain->arch.vtsc )
+    {
+        dprintk(XENLOG_G_ERR,
+                "At present PVH only supports the default timer mode\n");
+        return -ENOSYS;
+    }
+
+    required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR;
+    if ( (tmpval & required) != required )
+    {
+        dprintk(XENLOG_G_ERR, "PVH: required CR4 features not available:%lx\n",
+                required);
+        return -ENOSYS;
+    }
+
+    return 0;
+}
+
+static int pvh_construct_vmcs(struct vcpu *v)
+{
+    int rc, msr_type;
+    unsigned long *msr_bitmap;
+    struct domain *d = v->domain;
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    struct ept_data *ept = &p2m->ept;
+    u32 vmexit_ctl = vmx_vmexit_control;
+    u32 vmentry_ctl = vmx_vmentry_control;
+    u64 host_pat, guest_pat, tmpval = -1;
+
+    if ( (rc = pvh_check_requirements(v)) )
+        return rc;
+
+    msr_bitmap = alloc_xenheap_page();
+    if ( msr_bitmap == NULL )
+        return -ENOMEM;
+
+    /* 1. Pin-Based Controls */
+    __vmwrite(PIN_BASED_VM_EXEC_CONTROL, vmx_pin_based_exec_control);
+
+    v->arch.hvm_vmx.exec_control = vmx_cpu_based_exec_control;
+
+    /* 2. Primary Processor-based controls */
+    /*
+     * If rdtsc exiting is turned on and it goes thru emulate_privileged_op,
+     * then pv_vcpu.ctrlreg must be added to the pvh struct.
+     */
+    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING;
+    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING;
+
+    v->arch.hvm_vmx.exec_control &= ~(CPU_BASED_INVLPG_EXITING |
+                                      CPU_BASED_CR3_LOAD_EXITING |
+                                      CPU_BASED_CR3_STORE_EXITING);
+    v->arch.hvm_vmx.exec_control |= CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG;
+    v->arch.hvm_vmx.exec_control |= CPU_BASED_ACTIVATE_MSR_BITMAP;
+    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW;
+    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING;
+
+    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+
+    /* 3. Secondary Processor-based controls. Intel SDM: all resvd bits are 0*/
+    v->arch.hvm_vmx.secondary_exec_control = SECONDARY_EXEC_ENABLE_EPT;
+    v->arch.hvm_vmx.secondary_exec_control |= SECONDARY_EXEC_ENABLE_VPID;
+    v->arch.hvm_vmx.secondary_exec_control |= SECONDARY_EXEC_PAUSE_LOOP_EXITING;
+
+    __vmwrite(SECONDARY_VM_EXEC_CONTROL,
+              v->arch.hvm_vmx.secondary_exec_control);
+
+    __vmwrite(IO_BITMAP_A, virt_to_maddr((char *)hvm_io_bitmap + 0));
+    __vmwrite(IO_BITMAP_B, virt_to_maddr((char *)hvm_io_bitmap + PAGE_SIZE));
+
+    /* MSR bitmap for intercepts */
+    memset(msr_bitmap, ~0, PAGE_SIZE);
+    v->arch.hvm_vmx.msr_bitmap = msr_bitmap;
+    __vmwrite(MSR_BITMAP, virt_to_maddr(msr_bitmap));
+
+    msr_type = MSR_TYPE_R | MSR_TYPE_W;
+    vmx_disable_intercept_for_msr(v, MSR_FS_BASE, msr_type);
+    vmx_disable_intercept_for_msr(v, MSR_GS_BASE, msr_type);
+    vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_CS, msr_type);
+    vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_ESP, msr_type);
+    vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP, msr_type);
+    vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, msr_type);
+    vmx_disable_intercept_for_msr(v, MSR_STAR, msr_type);
+    vmx_disable_intercept_for_msr(v, MSR_LSTAR, msr_type);
+    vmx_disable_intercept_for_msr(v, MSR_CSTAR, msr_type);
+    vmx_disable_intercept_for_msr(v, MSR_SYSCALL_MASK, msr_type);
+    vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT, msr_type);
+
+    __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl);
+
+    /*
+     * Note: we run with default VM_ENTRY_LOAD_DEBUG_CTLS of 1, which means
+     * upon vmentry, the cpu reads/loads VMCS.DR7 and VMCS.DEBUGCTLS, and not
+     * use the host values. 0 would cause it to not use the VMCS values.
+     */
+    vmentry_ctl &= ~VM_ENTRY_LOAD_GUEST_EFER;
+    vmentry_ctl &= ~VM_ENTRY_SMM;
+    vmentry_ctl &= ~VM_ENTRY_DEACT_DUAL_MONITOR;
+    /* PVH 32bitfixme */
+    vmentry_ctl |= VM_ENTRY_IA32E_MODE;       /* GUEST_EFER.LME/LMA ignored */
+
+    __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl);
+
+    __vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0);
+    __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0);
+    __vmwrite(VM_EXIT_MSR_STORE_COUNT, 0);
+
+    vmx_set_common_host_vmcs_fields(v);
+    vmx_set_host_env(v);
+
+    __vmwrite(VM_ENTRY_INTR_INFO, 0);
+    __vmwrite(CR3_TARGET_COUNT, 0);
+    __vmwrite(GUEST_ACTIVITY_STATE, 0);
+
+    /* These are sorta irrelevant as we load the discriptors directly. */
+    __vmwrite(GUEST_CS_SELECTOR, 0);
+    __vmwrite(GUEST_DS_SELECTOR, 0);
+    __vmwrite(GUEST_SS_SELECTOR, 0);
+    __vmwrite(GUEST_ES_SELECTOR, 0);
+    __vmwrite(GUEST_FS_SELECTOR, 0);
+    __vmwrite(GUEST_GS_SELECTOR, 0);
+
+    __vmwrite(GUEST_CS_BASE, 0);
+    __vmwrite(GUEST_CS_LIMIT, ~0u);
+    /* CS.L == 1, exec, read/write, accessed. PVH 32bitfixme */
+    __vmwrite(GUEST_CS_AR_BYTES, 0xa09b);
+
+    __vmwrite(GUEST_DS_BASE, 0);
+    __vmwrite(GUEST_DS_LIMIT, ~0u);
+    __vmwrite(GUEST_DS_AR_BYTES, 0xc093); /* read/write, accessed */
+
+    __vmwrite(GUEST_SS_BASE, 0);
+    __vmwrite(GUEST_SS_LIMIT, ~0u);
+    __vmwrite(GUEST_SS_AR_BYTES, 0xc093); /* read/write, accessed */
+
+    __vmwrite(GUEST_ES_BASE, 0);
+    __vmwrite(GUEST_ES_LIMIT, ~0u);
+    __vmwrite(GUEST_ES_AR_BYTES, 0xc093); /* read/write, accessed */
+
+    __vmwrite(GUEST_FS_BASE, 0);
+    __vmwrite(GUEST_FS_LIMIT, ~0u);
+    __vmwrite(GUEST_FS_AR_BYTES, 0xc093); /* read/write, accessed */
+
+    __vmwrite(GUEST_GS_BASE, 0);
+    __vmwrite(GUEST_GS_LIMIT, ~0u);
+    __vmwrite(GUEST_GS_AR_BYTES, 0xc093); /* read/write, accessed */
+
+    __vmwrite(GUEST_GDTR_BASE, 0);
+    __vmwrite(GUEST_GDTR_LIMIT, 0);
+
+    __vmwrite(GUEST_LDTR_BASE, 0);
+    __vmwrite(GUEST_LDTR_LIMIT, 0);
+    __vmwrite(GUEST_LDTR_AR_BYTES, 0x82); /* LDT */
+    __vmwrite(GUEST_LDTR_SELECTOR, 0);
+
+    /* Guest TSS. */
+    __vmwrite(GUEST_TR_BASE, 0);
+    __vmwrite(GUEST_TR_LIMIT, 0xff);
+    __vmwrite(GUEST_TR_AR_BYTES, 0x8b); /* 32-bit TSS (busy) */
+
+    __vmwrite(GUEST_INTERRUPTIBILITY_INFO, 0);
+    __vmwrite(GUEST_DR7, 0);
+    __vmwrite(VMCS_LINK_POINTER, ~0UL);
+
+    __vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0);
+    __vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, 0);
+
+    v->arch.hvm_vmx.exception_bitmap = HVM_TRAP_MASK | (1U << TRAP_debug) |
+                                   (1U << TRAP_int3) | (1U << TRAP_no_device);
+    __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
+
+    /* Set WP bit so rdonly pages are not written from CPL 0 */
+    tmpval = X86_CR0_PG | X86_CR0_NE | X86_CR0_PE | X86_CR0_WP;
+    __vmwrite(GUEST_CR0, tmpval);
+    __vmwrite(CR0_READ_SHADOW, tmpval);
+    v->arch.hvm_vcpu.hw_cr[0] = v->arch.hvm_vcpu.guest_cr[0] = tmpval;
+
+    tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
+    __vmwrite(GUEST_CR4, tmpval);
+    __vmwrite(CR4_READ_SHADOW, tmpval);
+    v->arch.hvm_vcpu.guest_cr[4] = tmpval;
+
+    __vmwrite(CR0_GUEST_HOST_MASK, ~0UL);
+    __vmwrite(CR4_GUEST_HOST_MASK, ~0UL);
+
+     v->arch.hvm_vmx.vmx_realmode = 0;
+
+    ept->asr  = pagetable_get_pfn(p2m_get_pagetable(p2m));
+    __vmwrite(EPT_POINTER, ept_get_eptp(ept));
+
+    rdmsrl(MSR_IA32_CR_PAT, host_pat);
+    __vmwrite(HOST_PAT, host_pat);
+    guest_pat = MSR_IA32_CR_PAT_RESET;
+    __vmwrite(GUEST_PAT, guest_pat);
+
+    /* the paging mode is updated for PVH by arch_set_info_guest() */
+
+    return 0;
+}
+
+static int construct_vmcs(struct vcpu *v)
+{
+    struct domain *d = v->domain;
     u32 vmexit_ctl = vmx_vmexit_control;
     u32 vmentry_ctl = vmx_vmentry_control;
 
     vmx_vmcs_enter(v);
 
+    if ( is_pvh_vcpu(v) )
+    {
+        int rc = pvh_construct_vmcs(v);
+        vmx_vmcs_exit(v);
+        return rc;
+    }
+
     /* VMCS controls. */
     __vmwrite(PIN_BASED_VM_EXEC_CONTROL, vmx_pin_based_exec_control);
 
@@ -916,30 +1183,7 @@ static int construct_vmcs(struct vcpu *v)
         __vmwrite(GUEST_INTR_STATUS, 0);
     }
 
-    /* Host data selectors. */
-    __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS);
-    __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS);
-    __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS);
-    __vmwrite(HOST_FS_SELECTOR, 0);
-    __vmwrite(HOST_GS_SELECTOR, 0);
-    __vmwrite(HOST_FS_BASE, 0);
-    __vmwrite(HOST_GS_BASE, 0);
-
-    /* Host control registers. */
-    v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS;
-    __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
-    __vmwrite(HOST_CR4,
-              mmu_cr4_features | (xsave_enabled(v) ? X86_CR4_OSXSAVE : 0));
-
-    /* Host CS:RIP. */
-    __vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS);
-    __vmwrite(HOST_RIP, (unsigned long)vmx_asm_vmexit_handler);
-
-    /* Host SYSENTER CS:RIP. */
-    rdmsrl(MSR_IA32_SYSENTER_CS, sysenter_cs);
-    __vmwrite(HOST_SYSENTER_CS, sysenter_cs);
-    rdmsrl(MSR_IA32_SYSENTER_EIP, sysenter_eip);
-    __vmwrite(HOST_SYSENTER_EIP, sysenter_eip);
+    vmx_set_common_host_vmcs_fields(v);
 
     /* MSR intercepts. */
     __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0);
@@ -1259,8 +1503,10 @@ void vmx_do_resume(struct vcpu *v)
 
         vmx_clear_vmcs(v);
         vmx_load_vmcs(v);
-        hvm_migrate_timers(v);
-        hvm_migrate_pirqs(v);
+        if ( !is_pvh_vcpu(v) ) {
+            hvm_migrate_timers(v);
+            hvm_migrate_pirqs(v);
+        }
         vmx_set_host_env(v);
         /*
          * Both n1 VMCS and n2 VMCS need to update the host environment after 
@@ -1272,6 +1518,9 @@ void vmx_do_resume(struct vcpu *v)
         hvm_asid_flush_vcpu(v);
     }
 
+    if ( is_pvh_vcpu(v) )
+        reset_stack_and_jump(vmx_asm_do_vmentry);
+
     debug_state = v->domain->debugger_attached
                   || v->domain->arch.hvm_domain.params[HVM_PARAM_MEMORY_EVENT_INT3]
                   || v->domain->arch.hvm_domain.params[HVM_PARAM_MEMORY_EVENT_SINGLE_STEP];
@@ -1455,7 +1704,7 @@ static void vmcs_dump(unsigned char ch)
 
     for_each_domain ( d )
     {
-        if ( !is_hvm_domain(d) )
+        if ( is_pv_domain(d) )
             continue;
         printk("\n>>> Domain %d <<<\n", d->domain_id);
         for_each_vcpu ( d, v )
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 59336b9..70d0286 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -79,6 +79,9 @@ static int vmx_domain_initialise(struct domain *d)
 {
     int rc;
 
+    if ( is_pvh_domain(d) )
+        return 0;
+
     if ( (rc = vmx_alloc_vlapic_mapping(d)) != 0 )
         return rc;
 
@@ -87,6 +90,9 @@ static int vmx_domain_initialise(struct domain *d)
 
 static void vmx_domain_destroy(struct domain *d)
 {
+    if ( is_pvh_domain(d) )
+        return;
+
     vmx_free_vlapic_mapping(d);
 }
 
@@ -110,6 +116,12 @@ static int vmx_vcpu_initialise(struct vcpu *v)
 
     vpmu_initialise(v);
 
+    if (is_pvh_vcpu(v) ) 
+    {
+        /* this for hvm_long_mode_enabled(v) */
+        v->arch.hvm_vcpu.guest_efer = EFER_SCE | EFER_LMA | EFER_LME;
+        return 0;
+    }
     vmx_install_vlapic_mapping(v);
 
     /* %eax == 1 signals full real-mode support to the guest loader. */
@@ -1031,6 +1043,27 @@ static void vmx_update_host_cr3(struct vcpu *v)
     vmx_vmcs_exit(v);
 }
 
+/*
+ * PVH guest never causes CR3 write vmexit. This called during the guest setup.
+ */
+static void vmx_update_pvh_cr(struct vcpu *v, unsigned int cr)
+{
+    vmx_vmcs_enter(v);
+    switch ( cr )
+    {
+        case 3:
+            __vmwrite(GUEST_CR3, v->arch.hvm_vcpu.guest_cr[3]);
+            hvm_asid_flush_vcpu(v);
+            break;
+
+        default:
+            dprintk(XENLOG_ERR,
+                   "PVH: d%d v%d unexpected cr%d update at rip:%lx\n",
+                   v->domain->domain_id, v->vcpu_id, cr, __vmread(GUEST_RIP));
+    }
+    vmx_vmcs_exit(v);
+}
+
 void vmx_update_debug_state(struct vcpu *v)
 {
     unsigned long mask;
@@ -1050,6 +1083,12 @@ void vmx_update_debug_state(struct vcpu *v)
 
 static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr)
 {
+    if ( is_pvh_vcpu(v) )
+    {
+        vmx_update_pvh_cr(v, cr);
+        return;
+    }
+
     vmx_vmcs_enter(v);
 
     switch ( cr )
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (8 preceding siblings ...)
  2013-04-23 21:25 ` [PATCH 09/17] PVH xen: create PVH vmcs, and also initialization Mukesh Rathor
@ 2013-04-23 21:25 ` Mukesh Rathor
  2013-04-24  8:47   ` Jan Beulich
  2013-04-25 11:19   ` Tim Deegan
  2013-04-23 21:26 ` [PATCH 11/17] PVH xen: some misc changes like mtrr, intr, msi Mukesh Rathor
                   ` (6 subsequent siblings)
  16 siblings, 2 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:25 UTC (permalink / raw)
  To: Xen-devel

The heart of this patch is vmx exit handler for PVH guest. It is nicely
isolated in a separate module as preferred by most of us. A call to it
is added to vmx_pvh_vmexit_handler().

Changes in V2:
  - Move non VMX generic code to arch/x86/hvm/pvh.c
  - Remove get_gpr_ptr() and use existing decode_register() instead.
  - Defer call to pvh vmx exit handler until interrupts are enabled. So the
    caller vmx_pvh_vmexit_handler() handles the NMI/EXT-INT/TRIPLE_FAULT now.
  - Fix the CPUID (wrongly) clearing bit 24. No need to do this now, set
    the correct feature bits in CR4 during vmcs creation.
  - Fix few hard tabs.

Changes in V3:
  - Lot of cleanup and rework in PVH vm exit handler.
  - add parameter to emulate_forced_invalid_op().

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/hvm/Makefile         |    3 +-
 xen/arch/x86/hvm/hvm.c            |    4 -
 xen/arch/x86/hvm/pvh.c            |  203 +++++++++++++
 xen/arch/x86/hvm/vmx/Makefile     |    1 +
 xen/arch/x86/hvm/vmx/vmcs.c       |    3 +-
 xen/arch/x86/hvm/vmx/vmx.c        |    8 +
 xen/arch/x86/hvm/vmx/vmx_pvh.c    |  597 +++++++++++++++++++++++++++++++++++++
 xen/arch/x86/traps.c              |   23 +-
 xen/include/asm-x86/hvm/hvm.h     |    6 +
 xen/include/asm-x86/hvm/vmx/vmx.h |    5 +
 xen/include/asm-x86/processor.h   |    1 +
 xen/include/asm-x86/pvh.h         |    6 +
 12 files changed, 847 insertions(+), 13 deletions(-)
 create mode 100644 xen/arch/x86/hvm/pvh.c
 create mode 100644 xen/arch/x86/hvm/vmx/vmx_pvh.c
 create mode 100644 xen/include/asm-x86/pvh.h

diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index eea5555..65ff9f3 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -22,4 +22,5 @@ obj-y += vlapic.o
 obj-y += vmsi.o
 obj-y += vpic.o
 obj-y += vpt.o
-obj-y += vpmu.o
\ No newline at end of file
+obj-y += vpmu.o
+obj-y += pvh.o
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 27dbe3d..0d84ec7 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3254,10 +3254,6 @@ static long hvm_vcpu_op(
     return rc;
 }
 
-typedef unsigned long hvm_hypercall_t(
-    unsigned long, unsigned long, unsigned long, unsigned long, unsigned long,
-    unsigned long);
-
 #define HYPERCALL(x)                                        \
     [ __HYPERVISOR_ ## x ] = (hvm_hypercall_t *) do_ ## x
 
diff --git a/xen/arch/x86/hvm/pvh.c b/xen/arch/x86/hvm/pvh.c
new file mode 100644
index 0000000..fe8b89c
--- /dev/null
+++ b/xen/arch/x86/hvm/pvh.c
@@ -0,0 +1,203 @@
+/*
+ * Copyright (C) 2013, Mukesh Rathor, Oracle Corp.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ */
+#include <xen/hypercall.h>
+#include <xen/guest_access.h>
+#include <asm/p2m.h>
+#include <asm/traps.h>
+#include <asm/hvm/vmx/vmx.h>
+#include <public/sched.h>
+
+
+static int pvh_grant_table_op(unsigned int cmd, XEN_GUEST_HANDLE(void) uop,
+                              unsigned int count)
+{
+    switch ( cmd )
+    {
+        /*
+         * Only the following Grant Ops have been verified for PVH guest, hence
+         * we check for them here.
+         */
+        case GNTTABOP_map_grant_ref:
+        case GNTTABOP_unmap_grant_ref:
+        case GNTTABOP_setup_table:
+        case GNTTABOP_copy:
+        case GNTTABOP_query_size:
+        case GNTTABOP_set_version:
+            return do_grant_table_op(cmd, uop, count);
+    }
+    return -ENOSYS;
+}
+
+static long pvh_vcpu_op(int cmd, int vcpuid, XEN_GUEST_HANDLE(void) arg)
+{
+    long rc = -ENOSYS;
+
+    switch ( cmd )
+    {
+        case VCPUOP_register_runstate_memory_area:
+        case VCPUOP_get_runstate_info:
+        case VCPUOP_set_periodic_timer:
+        case VCPUOP_stop_periodic_timer:
+        case VCPUOP_set_singleshot_timer:
+        case VCPUOP_stop_singleshot_timer:
+        case VCPUOP_is_up:
+        case VCPUOP_up:
+        case VCPUOP_initialise:
+            rc = do_vcpu_op(cmd, vcpuid, arg);
+
+            /* pvh boot vcpu setting context for bringing up smp vcpu */
+            if ( cmd == VCPUOP_initialise )
+                vmx_vmcs_enter(current);
+    }
+    return rc;
+}
+
+static long pvh_physdev_op(int cmd, XEN_GUEST_HANDLE(void) arg)
+{
+    switch ( cmd )
+    {
+        case PHYSDEVOP_map_pirq:
+        case PHYSDEVOP_unmap_pirq:
+        case PHYSDEVOP_eoi:
+        case PHYSDEVOP_irq_status_query:
+        case PHYSDEVOP_get_free_pirq:
+            return do_physdev_op(cmd, arg);
+
+        default:
+            if ( IS_PRIV(current->domain) )
+                return do_physdev_op(cmd, arg);
+    }
+    return -ENOSYS;
+}
+
+static long pvh_hvm_op(unsigned long op, XEN_GUEST_HANDLE(void) arg)
+{
+    long rc = -EINVAL;
+    struct xen_hvm_param harg;
+    struct domain *d;
+
+    if ( copy_from_guest(&harg, arg, 1) )
+        return -EFAULT;
+
+    rc = rcu_lock_target_domain_by_id(harg.domid, &d);
+    if ( rc != 0 )
+        return rc;
+
+    if ( is_hvm_domain(d) )
+    {
+        /* pvh dom0 is building an hvm guest */
+        rcu_unlock_domain(d);
+        return do_hvm_op(op, arg);
+    }
+
+    rc = -ENOSYS;
+    if ( op == HVMOP_set_param )
+    {
+        if ( harg.index == HVM_PARAM_CALLBACK_IRQ )
+        {
+            struct hvm_irq *hvm_irq = &d->arch.hvm_domain.irq;
+            uint64_t via = harg.value;
+            uint8_t via_type = (uint8_t)(via >> 56) + 1;
+
+            if ( via_type == HVMIRQ_callback_vector )
+            {
+                hvm_irq->callback_via_type = HVMIRQ_callback_vector;
+                hvm_irq->callback_via.vector = (uint8_t)via;
+                rc = 0;
+            }
+        }
+    }
+    rcu_unlock_domain(d);
+    if ( rc )
+        gdprintk(XENLOG_DEBUG, "op:%ld -ENOSYS\n", op);
+
+    return rc;
+}
+
+static hvm_hypercall_t *pvh_hypercall64_table[NR_hypercalls] = {
+    [__HYPERVISOR_platform_op]     = (hvm_hypercall_t *)do_platform_op,
+    [__HYPERVISOR_memory_op]       = (hvm_hypercall_t *)do_memory_op,
+    [__HYPERVISOR_xen_version]     = (hvm_hypercall_t *)do_xen_version,
+    [__HYPERVISOR_console_io]      = (hvm_hypercall_t *)do_console_io,
+    [__HYPERVISOR_grant_table_op]  = (hvm_hypercall_t *)pvh_grant_table_op,
+    [__HYPERVISOR_vcpu_op]         = (hvm_hypercall_t *)pvh_vcpu_op,
+    [__HYPERVISOR_mmuext_op]       = (hvm_hypercall_t *)do_mmuext_op,
+    [__HYPERVISOR_xsm_op]          = (hvm_hypercall_t *)do_xsm_op,
+    [__HYPERVISOR_sched_op]        = (hvm_hypercall_t *)do_sched_op,
+    [__HYPERVISOR_event_channel_op]= (hvm_hypercall_t *)do_event_channel_op,
+    [__HYPERVISOR_physdev_op]      = (hvm_hypercall_t *)pvh_physdev_op,
+    [__HYPERVISOR_hvm_op]          = (hvm_hypercall_t *)pvh_hvm_op,
+    [__HYPERVISOR_sysctl]          = (hvm_hypercall_t *)do_sysctl,
+    [__HYPERVISOR_domctl]          = (hvm_hypercall_t *)do_domctl
+};
+
+/*
+ * Check if hypercall is valid
+ * Returns: 0 if hcall is not valid with eax set to the errno to ret to guest
+ */
+static bool_t hcall_valid(struct cpu_user_regs *regs)
+{
+    struct segment_register sreg;
+
+    hvm_get_segment_register(current, x86_seg_ss, &sreg);
+    if ( unlikely(sreg.attr.fields.dpl == 3) )
+    {
+        regs->eax = -EPERM;
+        return 0;
+    }
+
+    /* Following HCALLs have not been verified for PVH domUs */
+    if ( !IS_PRIV(current->domain) &&
+         (regs->eax == __HYPERVISOR_xsm_op ||
+          regs->eax == __HYPERVISOR_platform_op ||
+          regs->eax == __HYPERVISOR_domctl) )       /* for privcmd mmap */
+    {
+        regs->eax = -ENOSYS;
+        return 0;
+    }
+    return 1;
+}
+
+int pvh_do_hypercall(struct cpu_user_regs *regs)
+{
+    uint32_t hnum = regs->eax;
+
+    if ( hnum >= NR_hypercalls || pvh_hypercall64_table[hnum] == NULL )
+    {
+        gdprintk(XENLOG_WARNING, "PVH: Unimplemented HCALL:%d. Returning "
+                 "-ENOSYS. domid:%d IP:%lx SP:%lx\n",
+                 hnum, current->domain->domain_id, regs->rip, regs->rsp);
+        regs->eax = -ENOSYS;
+        vmx_update_guest_eip();
+        return HVM_HCALL_completed;
+    }
+
+    if ( regs->eax == __HYPERVISOR_sched_op && regs->rdi == SCHEDOP_shutdown )
+    {
+        domain_crash_synchronous();
+        return HVM_HCALL_completed;
+    }
+
+    if ( !hcall_valid(regs) )
+        return HVM_HCALL_completed;
+
+    current->arch.hvm_vcpu.hcall_preempted = 0;
+    regs->rax = pvh_hypercall64_table[hnum](regs->rdi, regs->rsi, regs->rdx,
+                                            regs->r10, regs->r8, regs->r9);
+
+    if ( current->arch.hvm_vcpu.hcall_preempted )
+        return HVM_HCALL_preempted;
+
+    return HVM_HCALL_completed;
+}
diff --git a/xen/arch/x86/hvm/vmx/Makefile b/xen/arch/x86/hvm/vmx/Makefile
index 373b3d9..8b71dae 100644
--- a/xen/arch/x86/hvm/vmx/Makefile
+++ b/xen/arch/x86/hvm/vmx/Makefile
@@ -5,3 +5,4 @@ obj-y += vmcs.o
 obj-y += vmx.o
 obj-y += vpmu_core2.o
 obj-y += vvmx.o
+obj-y += vmx_pvh.o
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index e7b0c4b..45e2d84 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -1503,7 +1503,8 @@ void vmx_do_resume(struct vcpu *v)
 
         vmx_clear_vmcs(v);
         vmx_load_vmcs(v);
-        if ( !is_pvh_vcpu(v) ) {
+        if ( !is_pvh_vcpu(v) )
+        {
             hvm_migrate_timers(v);
             hvm_migrate_pirqs(v);
         }
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 70d0286..ad9344c 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1535,6 +1535,8 @@ static struct hvm_function_table __read_mostly vmx_function_table = {
     .virtual_intr_delivery_enabled = vmx_virtual_intr_delivery_enabled,
     .process_isr          = vmx_process_isr,
     .nhvm_hap_walk_L1_p2m = nvmx_hap_walk_L1_p2m,
+    .pvh_set_vcpu_info    = vmx_pvh_set_vcpu_info,
+    .pvh_read_descriptor  = vmx_pvh_read_descriptor,
 };
 
 struct hvm_function_table * __init start_vmx(void)
@@ -2370,6 +2372,12 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
     if ( unlikely(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) )
         return vmx_failed_vmentry(exit_reason, regs);
 
+    if ( is_pvh_vcpu(v) )
+    {
+        vmx_pvh_vmexit_handler(regs);
+        return;
+    }
+
     if ( v->arch.hvm_vmx.vmx_realmode )
     {
         /* Put RFLAGS back the way the guest wants it */
diff --git a/xen/arch/x86/hvm/vmx/vmx_pvh.c b/xen/arch/x86/hvm/vmx/vmx_pvh.c
new file mode 100644
index 0000000..3ee5556
--- /dev/null
+++ b/xen/arch/x86/hvm/vmx/vmx_pvh.c
@@ -0,0 +1,597 @@
+/*
+ * Copyright (C) 2013, Mukesh Rathor, Oracle Corp.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#include <xen/hypercall.h>
+#include <xen/guest_access.h>
+#include <asm/p2m.h>
+#include <asm/traps.h>
+#include <asm/hvm/vmx/vmx.h>
+#include <public/sched.h>
+#include <asm/pvh.h>
+
+#ifndef NDEBUG
+int pvhdbg = 0;
+#define dbgp1(...) do { (pvhdbg == 1) ? printk(__VA_ARGS__) : 0; } while ( 0 )
+#else
+#define dbgp1(...) ((void)0)
+#endif
+
+
+static void read_vmcs_selectors(struct cpu_user_regs *regs)
+{
+    regs->cs = __vmread(GUEST_CS_SELECTOR);
+    regs->ss = __vmread(GUEST_SS_SELECTOR);
+    regs->ds = __vmread(GUEST_DS_SELECTOR);
+    regs->es = __vmread(GUEST_ES_SELECTOR);
+    regs->gs = __vmread(GUEST_GS_SELECTOR);
+    regs->fs = __vmread(GUEST_FS_SELECTOR);
+}
+
+/* returns : 0 == msr read successfully */
+static int vmxit_msr_read(struct cpu_user_regs *regs)
+{
+    u64 msr_content = 0;
+
+    switch ( regs->ecx )
+    {
+        case MSR_IA32_MISC_ENABLE:
+        {
+            rdmsrl(MSR_IA32_MISC_ENABLE, msr_content);
+            msr_content |= MSR_IA32_MISC_ENABLE_BTS_UNAVAIL |
+                           MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
+            break;
+        }
+        default:
+        {
+            /* pvh fixme: see hvm_msr_read_intercept() */
+            rdmsrl(regs->ecx, msr_content);
+            break;
+        }
+    }
+    regs->eax = (uint32_t)msr_content;
+    regs->edx = (uint32_t)(msr_content >> 32);
+    vmx_update_guest_eip();
+
+    dbgp1("msr read c:%lx a:%lx d:%lx RIP:%lx RSP:%lx\n", regs->ecx, regs->eax,
+          regs->edx, regs->rip, regs->rsp);
+
+    return 0;
+}
+
+/* returns : 0 == msr written successfully */
+static int vmxit_msr_write(struct cpu_user_regs *regs)
+{
+    uint64_t msr_content = (uint32_t)regs->eax | ((uint64_t)regs->edx << 32);
+
+    dbgp1("PVH: msr write:0x%lx. eax:0x%lx edx:0x%lx\n", regs->ecx,
+          regs->eax, regs->edx);
+
+    if ( hvm_msr_write_intercept(regs->ecx, msr_content) == X86EMUL_OKAY )
+    {
+        vmx_update_guest_eip();
+        return 0;
+    }
+    return 1;
+}
+
+static int vmxit_debug(struct cpu_user_regs *regs)
+{
+    struct vcpu *vp = current;
+    unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION);
+
+    write_debugreg(6, exit_qualification | 0xffff0ff0);
+
+    /* gdbsx or another debugger */
+    if ( vp->domain->domain_id != 0 &&    /* never pause dom0 */
+         guest_kernel_mode(vp, regs) &&  vp->domain->debugger_attached )
+    {
+        domain_pause_for_debugger();
+    } else {
+        hvm_inject_hw_exception(TRAP_debug, HVM_DELIVER_NO_ERROR_CODE);
+    }
+    return 0;
+}
+
+/* Returns: rc == 0: handled the MTF vmexit */
+static int vmxit_mtf(struct cpu_user_regs *regs)
+{
+    struct vcpu *vp = current;
+    int rc = -EINVAL, ss = vp->arch.hvm_vcpu.single_step;
+
+    vp->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG;
+    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, vp->arch.hvm_vmx.exec_control);
+    vp->arch.hvm_vcpu.single_step = 0;
+
+    if ( vp->domain->debugger_attached && ss )
+    {
+        domain_pause_for_debugger();
+        rc = 0;
+    }
+    return rc;
+}
+
+static int vmxit_int3(struct cpu_user_regs *regs)
+{
+    int ilen = vmx_get_instruction_length();
+    struct vcpu *vp = current;
+    struct hvm_trap trap_info = {
+                        .vector = TRAP_int3,
+                        .type = X86_EVENTTYPE_SW_EXCEPTION,
+                        .error_code = HVM_DELIVER_NO_ERROR_CODE,
+                        .insn_len = ilen
+    };
+
+    regs->eip += ilen;
+
+    /* gdbsx or another debugger. Never pause dom0 */
+    if ( vp->domain->domain_id != 0 && guest_kernel_mode(vp, regs) )
+    {
+        dbgp1("[%d]PVH: domain pause for debugger\n", smp_processor_id());
+        current->arch.gdbsx_vcpu_event = TRAP_int3;
+        domain_pause_for_debugger();
+        return 0;
+    }
+
+    regs->eip -= ilen;
+    hvm_inject_trap(&trap_info);
+
+    return 0;
+}
+
+static int vmxit_invalid_op(struct cpu_user_regs *regs)
+{
+    ulong addr = 0;
+
+    if ( guest_kernel_mode(current, regs) ||
+         emulate_forced_invalid_op(regs, &addr) == 0 )
+    {
+        hvm_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE);
+        return 0;
+    }
+    if ( addr )
+        hvm_inject_page_fault(0, addr);
+
+    return 0;
+}
+
+/* Returns: rc == 0: handled the exception/NMI */
+static int vmxit_exception(struct cpu_user_regs *regs)
+{
+    unsigned int vector = (__vmread(VM_EXIT_INTR_INFO)) & INTR_INFO_VECTOR_MASK;
+    int rc = -ENOSYS;
+
+    dbgp1(" EXCPT: vec:%d cs:%lx r.IP:%lx\n", vector,
+          __vmread(GUEST_CS_SELECTOR), regs->eip);
+
+    switch ( vector )
+    {
+        case TRAP_debug:
+            rc = vmxit_debug(regs);
+            break;
+
+        case TRAP_int3:
+            rc = vmxit_int3(regs);
+            break;
+
+        case TRAP_invalid_op:
+            rc = vmxit_invalid_op(regs);
+            break;
+
+        case TRAP_no_device:
+            hvm_funcs.fpu_dirty_intercept();  /* vmx_fpu_dirty_intercept */
+            rc = 0;
+            break;
+
+        default:
+            gdprintk(XENLOG_WARNING,
+                     "PVH: Unhandled trap:%d. IP:%lx\n", vector, regs->eip);
+    }
+    return rc;
+}
+
+static int vmxit_vmcall(struct cpu_user_regs *regs)
+{
+    if ( pvh_do_hypercall(regs) != HVM_HCALL_preempted )
+        vmx_update_guest_eip();
+    return 0;
+}
+
+/* Returns: rc == 0: success */
+static int access_cr0(struct cpu_user_regs *regs, uint acc_typ, uint64_t *regp)
+{
+    struct vcpu *vp = current;
+
+    if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR )
+    {
+        unsigned long new_cr0 = *regp;
+        unsigned long old_cr0 = __vmread(GUEST_CR0);
+
+        dbgp1("PVH:writing to CR0. RIP:%lx val:0x%lx\n", regs->rip, *regp);
+        if ( (u32)new_cr0 != new_cr0 )
+        {
+            gdprintk(XENLOG_ERR,
+                     "Guest setting upper 32 bits in CR0: %lx", new_cr0);
+            return -EPERM;
+        }
+
+        new_cr0 &= ~HVM_CR0_GUEST_RESERVED_BITS;
+        /* ET is reserved and should be always be 1. */
+        new_cr0 |= X86_CR0_ET;
+
+        /* pvh not expected to change to real mode */
+        if ( (new_cr0 & (X86_CR0_PE | X86_CR0_PG)) !=
+             (X86_CR0_PG | X86_CR0_PE) )
+        {
+            gdprintk(XENLOG_ERR,
+                     "PVH attempting to turn off PE/PG. CR0:%lx\n", new_cr0);
+            return -EPERM;
+        }
+        /* TS going from 1 to 0 */
+        if ( (old_cr0 & X86_CR0_TS) && ((new_cr0 & X86_CR0_TS) == 0) )
+            vmx_fpu_enter(vp);
+
+        vp->arch.hvm_vcpu.hw_cr[0] = vp->arch.hvm_vcpu.guest_cr[0] = new_cr0;
+        __vmwrite(GUEST_CR0, new_cr0);
+        __vmwrite(CR0_READ_SHADOW, new_cr0);
+    }
+    else
+    {
+        *regp = __vmread(GUEST_CR0);
+    }
+    return 0;
+}
+
+/* Returns: rc == 0: success */
+static int access_cr4(struct cpu_user_regs *regs, uint acc_typ, uint64_t *regp)
+{
+    if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR )
+    {
+        u64 old_cr4 = __vmread(GUEST_CR4);
+
+        if ( (old_cr4 ^ (*regp)) & (X86_CR4_PSE | X86_CR4_PGE | X86_CR4_PAE) )
+            vpid_sync_all();
+
+        __vmwrite(GUEST_CR4, *regp);
+    }
+    else
+        *regp = __vmread(GUEST_CR4);
+
+    return 0;
+}
+
+/* Returns: rc == 0: success, else -errno */
+static int vmxit_cr_access(struct cpu_user_regs *regs)
+{
+    unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION);
+    uint acc_typ = VMX_CONTROL_REG_ACCESS_TYPE(exit_qualification);
+    int cr, rc = -EINVAL;
+
+    switch ( acc_typ )
+    {
+        case VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR:
+        case VMX_CONTROL_REG_ACCESS_TYPE_MOV_FROM_CR:
+        {
+            uint gpr = VMX_CONTROL_REG_ACCESS_GPR(exit_qualification);
+            uint64_t *regp = decode_register(gpr, regs, 0);
+            cr = VMX_CONTROL_REG_ACCESS_NUM(exit_qualification);
+
+            if ( regp == NULL )
+                break;
+
+            switch ( cr )
+            {
+                case 0:
+                    rc = access_cr0(regs, acc_typ, regp);
+                    break;
+
+                case 3:
+                    gdprintk(XENLOG_ERR,
+                             "PVH: unexpected cr3 vmexit. rip:%lx\n",
+                             regs->rip);
+                    domain_crash_synchronous();
+                    break;
+
+                case 4:
+                    rc = access_cr4(regs, acc_typ, regp);
+                    break;
+            }
+            if ( rc == 0 )
+                vmx_update_guest_eip();
+            break;
+        }
+
+        case VMX_CONTROL_REG_ACCESS_TYPE_CLTS:
+        {
+            struct vcpu *vp = current;
+            unsigned long cr0 = vp->arch.hvm_vcpu.guest_cr[0] & ~X86_CR0_TS;
+            vp->arch.hvm_vcpu.hw_cr[0] = vp->arch.hvm_vcpu.guest_cr[0] = cr0;
+
+            vmx_fpu_enter(vp);
+            __vmwrite(GUEST_CR0, cr0);
+            __vmwrite(CR0_READ_SHADOW, cr0);
+            vmx_update_guest_eip();
+            rc = 0;
+        }
+    }
+    return rc;
+}
+
+/*
+ * NOTE: a PVH sets IOPL natively by setting bits in the eflags and not by
+ * hypercalls used by a PV.
+ */
+static int vmxit_io_instr(struct cpu_user_regs *regs)
+{
+    int curr_lvl;
+    int requested = (regs->rflags >> 12) & 3;
+
+    read_vmcs_selectors(regs);
+    curr_lvl = regs->cs & 3;
+
+    if ( requested >= curr_lvl && emulate_privileged_op(regs) )
+        return 0;
+
+    hvm_inject_hw_exception(TRAP_gp_fault, regs->error_code);
+    return 0;
+}
+
+static int pvh_ept_handle_violation(unsigned long qualification,
+                                    paddr_t gpa, struct cpu_user_regs *regs)
+{
+    unsigned long gla, gfn = gpa >> PAGE_SHIFT;
+    p2m_type_t p2mt;
+    mfn_t mfn = get_gfn_query_unlocked(current->domain, gfn, &p2mt);
+
+    gdprintk(XENLOG_ERR, "EPT violation %#lx (%c%c%c/%c%c%c), "
+             "gpa %#"PRIpaddr", mfn %#lx, type %i. IP:0x%lx RSP:0x%lx\n",
+             qualification,
+             (qualification & EPT_READ_VIOLATION) ? 'r' : '-',
+             (qualification & EPT_WRITE_VIOLATION) ? 'w' : '-',
+             (qualification & EPT_EXEC_VIOLATION) ? 'x' : '-',
+             (qualification & EPT_EFFECTIVE_READ) ? 'r' : '-',
+             (qualification & EPT_EFFECTIVE_WRITE) ? 'w' : '-',
+             (qualification & EPT_EFFECTIVE_EXEC) ? 'x' : '-',
+             gpa, mfn_x(mfn), p2mt, regs->rip, regs->rsp);
+
+    ept_walk_table(current->domain, gfn);
+
+    if ( qualification & EPT_GLA_VALID )
+    {
+        gla = __vmread(GUEST_LINEAR_ADDRESS);
+        gdprintk(XENLOG_ERR, " --- GLA %#lx\n", gla);
+    }
+
+    hvm_inject_hw_exception(TRAP_gp_fault, 0);
+    return 0;
+}
+
+/*
+ * The cpuid macro clears rcx, so execute cpuid here exactly as the user
+ * process would on a PV guest.
+ */
+static void pvh_user_cpuid(struct cpu_user_regs *regs)
+{
+    unsigned int eax, ebx, ecx, edx;
+
+    asm volatile ( "cpuid"
+              : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx)
+              : "0" (regs->eax), "2" (regs->rcx) );
+
+    regs->rax = eax; regs->rbx = ebx; regs->rcx = ecx; regs->rdx = edx;
+}
+
+/*
+ * Main vm exit handler for PVH . Called from vmx_vmexit_handler().
+ * Note: vmx_asm_vmexit_handler updates rip/rsp/eflags in regs{} struct.
+ */
+void vmx_pvh_vmexit_handler(struct cpu_user_regs *regs)
+{
+    unsigned long exit_qualification;
+    unsigned int exit_reason = __vmread(VM_EXIT_REASON);
+    int rc=0, ccpu = smp_processor_id();
+    struct vcpu *vp = current;
+
+    dbgp1("PVH:[%d]left VMCS exitreas:%d RIP:%lx RSP:%lx EFLAGS:%lx CR0:%lx\n",
+          ccpu, exit_reason, regs->rip, regs->rsp, regs->rflags,
+          __vmread(GUEST_CR0));
+
+    /* for guest_kernel_mode() */
+    regs->cs = __vmread(GUEST_CS_SELECTOR);
+
+    switch ( (uint16_t)exit_reason )
+    {
+        case EXIT_REASON_EXCEPTION_NMI:      /* 0 */
+            rc = vmxit_exception(regs);
+            break;
+
+        case EXIT_REASON_EXTERNAL_INTERRUPT: /* 1 */
+            break;              /* handled in vmx_vmexit_handler() */
+
+        case EXIT_REASON_PENDING_VIRT_INTR:  /* 7 */
+        {
+            struct vcpu *v = current;
+
+            /* Disable the interrupt window. */
+            v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+            __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+            break;
+        }
+
+        case EXIT_REASON_CPUID:              /* 10 */
+        {
+            if ( guest_kernel_mode(vp, regs) )
+                pv_cpuid(regs);
+            else
+                pvh_user_cpuid(regs);
+
+            vmx_update_guest_eip();
+            break;
+        }
+
+        case EXIT_REASON_HLT:                /* 12 */
+        {
+            vmx_update_guest_eip();
+            hvm_hlt(regs->eflags);
+            break;
+        }
+
+        case EXIT_REASON_VMCALL:             /* 18 */
+            rc = vmxit_vmcall(regs);
+            break;
+
+        case EXIT_REASON_CR_ACCESS:          /* 28 */
+            rc = vmxit_cr_access(regs);
+            break;
+
+        case EXIT_REASON_DR_ACCESS:          /* 29 */
+        {
+            exit_qualification = __vmread(EXIT_QUALIFICATION);
+            vmx_dr_access(exit_qualification, regs);
+            break;
+        }
+
+        case EXIT_REASON_IO_INSTRUCTION:     /* 30 */
+            vmxit_io_instr(regs);
+            break;
+
+        case EXIT_REASON_MSR_READ:           /* 31 */
+            rc = vmxit_msr_read(regs);
+            break;
+
+        case EXIT_REASON_MSR_WRITE:          /* 32 */
+            rc = vmxit_msr_write(regs);
+            break;
+
+        case EXIT_REASON_MONITOR_TRAP_FLAG:  /* 37 */
+            rc = vmxit_mtf(regs);
+            break;
+
+        case EXIT_REASON_MCE_DURING_VMENTRY: /* 41 */
+            break;              /* handled in vmx_vmexit_handler() */
+
+        case EXIT_REASON_EPT_VIOLATION:      /* 48 */
+        {
+            paddr_t gpa = __vmread(GUEST_PHYSICAL_ADDRESS);
+            exit_qualification = __vmread(EXIT_QUALIFICATION);
+            rc = pvh_ept_handle_violation(exit_qualification, gpa, regs);
+            break;
+        }
+
+        default:
+            rc = 1;
+            gdprintk(XENLOG_ERR,
+                     "PVH: Unexpected exit reason:0x%x\n", exit_reason);
+    }
+    if ( rc )
+    {
+        exit_qualification = __vmread(EXIT_QUALIFICATION);
+        gdprintk(XENLOG_ERR,
+                 "PVH: [%d] exit_reas:%d 0x%x qual:%ld 0x%lx cr0:0x%016lx\n",
+                 ccpu, exit_reason, exit_reason, exit_qualification,
+                 exit_qualification, __vmread(GUEST_CR0));
+        gdprintk(XENLOG_ERR, "PVH: RIP:%lx RSP:%lx EFLAGS:%lx CR3:%lx\n",
+                 regs->rip, regs->rsp, regs->rflags, __vmread(GUEST_CR3));
+        domain_crash_synchronous();
+    }
+}
+
+/*
+ * Sets info for non boot SMP vcpu. VCPU 0 context is set by the library.
+ * In case of linux, the call comes from cpu_initialize_context().
+ */
+int vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp)
+{
+    if ( v->vcpu_id == 0 )
+        return 0;
+
+    vmx_vmcs_enter(v);
+    __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr);
+    __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit);
+    __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_user);
+
+    __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs);
+    __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds);
+    __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es);
+    __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss);
+    __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs);
+
+    if ( vmx_add_guest_msr(MSR_SHADOW_GS_BASE) )
+        return -EINVAL;
+
+    vmx_write_guest_msr(MSR_SHADOW_GS_BASE, ctxtp->gs_base_kernel);
+
+    vmx_vmcs_exit(v);
+    return 0;
+}
+
+int vmx_pvh_read_descriptor(unsigned int sel, const struct vcpu *v,
+                            const struct cpu_user_regs *regs,
+                            unsigned long *base, unsigned long *limit,
+                            unsigned int *ar)
+{
+    unsigned int tmp_ar = 0;
+    ASSERT(v == current);
+    ASSERT(is_pvh_vcpu(v));
+
+    if ( sel == (unsigned int)regs->cs )
+    {
+        *base = __vmread(GUEST_CS_BASE);
+        *limit = __vmread(GUEST_CS_LIMIT);
+        tmp_ar = __vmread(GUEST_CS_AR_BYTES);
+    }
+    else if ( sel == (unsigned int)regs->ds )
+    {
+        *base = __vmread(GUEST_DS_BASE);
+        *limit = __vmread(GUEST_DS_LIMIT);
+        tmp_ar = __vmread(GUEST_DS_AR_BYTES);
+    }
+    else if ( sel == (unsigned int)regs->ss )
+    {
+        *base = __vmread(GUEST_SS_BASE);
+        *limit = __vmread(GUEST_SS_LIMIT);
+        tmp_ar = __vmread(GUEST_SS_AR_BYTES);
+    }
+    else if ( sel == (unsigned int)regs->gs )
+    {
+        *base = __vmread(GUEST_GS_BASE);
+        *limit = __vmread(GUEST_GS_LIMIT);
+        tmp_ar = __vmread(GUEST_GS_AR_BYTES);
+    }
+    else if ( sel == (unsigned int)regs->fs )
+    {
+        *base = __vmread(GUEST_FS_BASE);
+        *limit = __vmread(GUEST_FS_LIMIT);
+        tmp_ar = __vmread(GUEST_FS_AR_BYTES);
+    }
+    else if ( sel == (unsigned int)regs->es )
+    {
+        *base = __vmread(GUEST_ES_BASE);
+        *limit = __vmread(GUEST_ES_LIMIT);
+        tmp_ar = __vmread(GUEST_ES_AR_BYTES);
+    }
+    else
+    {
+        gdprintk(XENLOG_WARNING, "Unmatched segment selector:%d\n", sel);
+        return 0;
+    }
+
+    if ( tmp_ar & X86_SEG_AR_CS_LM_ACTIVE )
+    {
+        *base = 0UL;
+        *limit = ~0UL;
+    }
+    /* Fixup ar so that it looks the same as in native mode */
+    *ar = (tmp_ar << 8);
+
+    return 1;
+}
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index dbea755..b95bed5 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -905,17 +905,22 @@ static int emulate_invalid_rdtscp(struct cpu_user_regs *regs)
     return EXCRET_fault_fixed;
 }
 
-static int emulate_forced_invalid_op(struct cpu_user_regs *regs)
+int emulate_forced_invalid_op(struct cpu_user_regs *regs,
+                              unsigned long *addrp)
 {
     char sig[5], instr[2];
-    unsigned long eip, rc;
+    unsigned long eip, rc, addr;
 
     eip = regs->eip;
 
     /* Check for forced emulation signature: ud2 ; .ascii "xen". */
-    if ( (rc = copy_from_user(sig, (char *)eip, sizeof(sig))) != 0 )
+    if ( (rc = raw_copy_from_guest(sig, (char *)eip, sizeof(sig))) != 0 )
     {
-        propagate_page_fault(eip + sizeof(sig) - rc, 0);
+        addr = eip + sizeof(sig) - rc;
+        if ( addrp )
+            *addrp = addr;
+        else
+            propagate_page_fault(addr, 0);
         return EXCRET_fault_fixed;
     }
     if ( memcmp(sig, "\xf\xbxen", sizeof(sig)) )
@@ -923,9 +928,13 @@ static int emulate_forced_invalid_op(struct cpu_user_regs *regs)
     eip += sizeof(sig);
 
     /* We only emulate CPUID. */
-    if ( ( rc = copy_from_user(instr, (char *)eip, sizeof(instr))) != 0 )
+    if ( ( rc = raw_copy_from_guest(instr, (char *)eip, sizeof(instr))) != 0 )
     {
-        propagate_page_fault(eip + sizeof(instr) - rc, 0);
+        addr = eip + sizeof(instr) - rc;
+        if ( addrp )
+            *addrp = addr;
+        else
+            propagate_page_fault(addr, 0);
         return EXCRET_fault_fixed;
     }
     if ( memcmp(instr, "\xf\xa2", sizeof(instr)) )
@@ -954,7 +963,7 @@ void do_invalid_op(struct cpu_user_regs *regs)
     if ( likely(guest_mode(regs)) )
     {
         if ( !emulate_invalid_rdtscp(regs) &&
-             !emulate_forced_invalid_op(regs) )
+             !emulate_forced_invalid_op(regs, NULL) )
             do_guest_trap(TRAP_invalid_op, regs, 0);
         return;
     }
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index a790954..e2f99f3 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -514,4 +514,10 @@ bool_t nhvm_vmcx_hap_enabled(struct vcpu *v);
 /* interrupt */
 enum hvm_intblk nhvm_interrupt_blocked(struct vcpu *v);
 
+
+/* hypercall table typedef for HVM */
+typedef unsigned long hvm_hypercall_t(
+    unsigned long, unsigned long, unsigned long, unsigned long, unsigned long,
+    unsigned long);
+
 #endif /* __ASM_X86_HVM_HVM_H__ */
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 4c97d50..a9bca14 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -445,6 +445,11 @@ void setup_ept_dump(void);
 
 void vmx_update_guest_eip(void);
 void vmx_dr_access(unsigned long exit_qualification,struct cpu_user_regs *regs);
+void vmx_pvh_vmexit_handler(struct cpu_user_regs *regs);
+int  vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp);
+int  vmx_pvh_read_descriptor(unsigned int sel, const struct vcpu *v,
+                         const struct cpu_user_regs *regs, unsigned long *base,
+                         unsigned long *limit, unsigned int *ar);
 
 int alloc_p2m_hap_data(struct p2m_domain *p2m);
 void free_p2m_hap_data(struct p2m_domain *p2m);
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 8c70324..6d0794c 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -567,6 +567,7 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void), unsigned long len);
 int microcode_resume_cpu(int cpu);
 
 void pv_cpuid(struct cpu_user_regs *regs);
+int emulate_forced_invalid_op(struct cpu_user_regs *regs, unsigned long *);
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_X86_PROCESSOR_H */
diff --git a/xen/include/asm-x86/pvh.h b/xen/include/asm-x86/pvh.h
new file mode 100644
index 0000000..73e59d3
--- /dev/null
+++ b/xen/include/asm-x86/pvh.h
@@ -0,0 +1,6 @@
+#ifndef __ASM_X86_PVH_H__
+#define __ASM_X86_PVH_H__
+
+int pvh_do_hypercall(struct cpu_user_regs *regs);
+
+#endif  /* __ASM_X86_PVH_H__ */
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 11/17] PVH xen: some misc changes like mtrr, intr, msi...
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (9 preceding siblings ...)
  2013-04-23 21:25 ` [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c Mukesh Rathor
@ 2013-04-23 21:26 ` Mukesh Rathor
  2013-04-23 21:26 ` [PATCH 12/17] PVH xen: support invalid op, return PVH features etc Mukesh Rathor
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:26 UTC (permalink / raw)
  To: Xen-devel

Changes in irq.c as PVH doesn't use vlapic emulation. In mtrr we add
assert and set MTRR_TYPEs for PVH.

Changes in V2:
   - Some cleanup of redundant code.
   - time.c: Honor no rdtsc exiting for PVH by setting vtsc to 0 in time.c

Changes in V3:
   - Dont check for pvh in making mmio rangesets readonly.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/hvm/irq.c      |    3 +++
 xen/arch/x86/hvm/mtrr.c     |   11 +++++++++++
 xen/arch/x86/hvm/vmx/intr.c |    7 ++++---
 xen/arch/x86/time.c         |    9 +++++++++
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c
index 9eae5de..92fb245 100644
--- a/xen/arch/x86/hvm/irq.c
+++ b/xen/arch/x86/hvm/irq.c
@@ -405,6 +405,9 @@ struct hvm_intack hvm_vcpu_has_pending_irq(struct vcpu *v)
          && vcpu_info(v, evtchn_upcall_pending) )
         return hvm_intack_vector(plat->irq.callback_via.vector);
 
+    if ( is_pvh_vcpu(v) )
+        return hvm_intack_none;
+
     if ( vlapic_accept_pic_intr(v) && plat->vpic[0].int_output )
         return hvm_intack_pic(0);
 
diff --git a/xen/arch/x86/hvm/mtrr.c b/xen/arch/x86/hvm/mtrr.c
index ef51a8d..f088ce0 100644
--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -578,6 +578,9 @@ int32_t hvm_set_mem_pinned_cacheattr(
 {
     struct hvm_mem_pinned_cacheattr_range *range;
 
+    /* PVH note: The guest writes to MSR_IA32_CR_PAT natively */
+    ASSERT( !is_pvh_domain(d) );
+
     if ( !((type == PAT_TYPE_UNCACHABLE) ||
            (type == PAT_TYPE_WRCOMB) ||
            (type == PAT_TYPE_WRTHROUGH) ||
@@ -693,6 +696,14 @@ uint8_t epte_get_entry_emt(struct domain *d, unsigned long gfn, mfn_t mfn,
          ((d->vcpu == NULL) || ((v = d->vcpu[0]) == NULL)) )
         return MTRR_TYPE_WRBACK;
 
+    /* PVH fixme: Add support for more memory types */
+    if ( is_pvh_domain(d) )
+    {
+        if (direct_mmio)
+            return MTRR_TYPE_UNCACHABLE;
+        return MTRR_TYPE_WRBACK;
+    }
+
     if ( !v->domain->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] )
         return MTRR_TYPE_WRBACK;
 
diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c
index e376f3c..b94f9d5 100644
--- a/xen/arch/x86/hvm/vmx/intr.c
+++ b/xen/arch/x86/hvm/vmx/intr.c
@@ -219,15 +219,16 @@ void vmx_intr_assist(void)
         return;
     }
 
-    /* Crank the handle on interrupt state. */
-    pt_vector = pt_update_irq(v);
+    if ( !is_pvh_vcpu(v) )
+        /* Crank the handle on interrupt state. */
+        pt_vector = pt_update_irq(v);
 
     do {
         intack = hvm_vcpu_has_pending_irq(v);
         if ( likely(intack.source == hvm_intsrc_none) )
             goto out;
 
-        if ( unlikely(nvmx_intr_intercept(v, intack)) )
+        if ( !is_pvh_vcpu(v) && unlikely(nvmx_intr_intercept(v, intack)) )
             goto out;
 
         intblk = hvm_interrupt_blocked(v, intack);
diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index 6e94847..484eb07 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1933,6 +1933,15 @@ void tsc_set_info(struct domain *d,
         d->arch.vtsc = 0;
         return;
     }
+    if ( is_pvh_domain(d) && tsc_mode != TSC_MODE_NEVER_EMULATE )
+    {
+        /* PVH fixme: support more tsc modes */
+        dprintk(XENLOG_WARNING,
+                "PVH currently does not support tsc emulation. Setting it "
+                "to no emulation\n");
+        d->arch.vtsc = 0;
+        return;
+    }
 
     switch ( d->arch.tsc_mode = tsc_mode )
     {
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 12/17] PVH xen: support invalid op, return PVH features etc...
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (10 preceding siblings ...)
  2013-04-23 21:26 ` [PATCH 11/17] PVH xen: some misc changes like mtrr, intr, msi Mukesh Rathor
@ 2013-04-23 21:26 ` Mukesh Rathor
  2013-04-24  9:01   ` Jan Beulich
  2013-04-23 21:26 ` [PATCH 13/17] PVH xen: p2m related changes Mukesh Rathor
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:26 UTC (permalink / raw)
  To: Xen-devel

The biggest change in this patch is in traps.c to allow forced invalid
op for PVH guest. Also, enable hypercall page init for PVH guest also.
Finally, set guest type to PVH if PV with HAP is created.

Changes in V2:
  - Fix emulate_forced_invalid_op() to use proper copy function, and inject PF
    in case it fails.
  - remove extraneous PVH check in STI/CLI ops en emulate_privileged_op().
  - Make assert a debug ASSERT in show_registers().
  - debug.c: keep get_gfn() locked and move put_gfn closer to it.

Changes in V3:
  - Mostly formatting.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/debug.c        |    7 +++----
 xen/arch/x86/traps.c        |   25 ++++++++++++++++++++++++-
 xen/arch/x86/x86_64/traps.c |    7 ++++---
 xen/common/domain.c         |    9 +++++++++
 xen/common/domctl.c         |    5 +++++
 xen/common/kernel.c         |    6 +++++-
 6 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c
index 167421d..235f0bb 100644
--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -59,7 +59,9 @@ dbg_hvm_va2mfn(dbgva_t vaddr, struct domain *dp, int toaddr,
         return INVALID_MFN;
     }
 
-    mfn = mfn_x(get_gfn(dp, *gfn, &gfntype)); 
+    mfn = mfn_x(get_gfn_query(dp, *gfn, &gfntype));
+    put_gfn(dp, *gfn);
+
     if ( p2m_is_readonly(gfntype) && toaddr )
     {
         DBGP2("kdb:p2m_is_readonly: gfntype:%x\n", gfntype);
@@ -178,9 +180,6 @@ dbg_rw_guest_mem(dbgva_t addr, dbgbyte_t *buf, int len, struct domain *dp,
         }
 
         unmap_domain_page(va);
-        if ( gfn != INVALID_GFN )
-            put_gfn(dp, gfn);
-
         addr += pagecnt;
         buf += pagecnt;
         len -= pagecnt;
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index b95bed5..9e819b5 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -459,6 +459,10 @@ static void instruction_done(
     struct cpu_user_regs *regs, unsigned long eip, unsigned int bpmatch)
 {
     regs->eip = eip;
+
+    if ( is_pvh_vcpu(current) )
+        return;
+
     regs->eflags &= ~X86_EFLAGS_RF;
     if ( bpmatch || (regs->eflags & X86_EFLAGS_TF) )
     {
@@ -475,6 +479,9 @@ static unsigned int check_guest_io_breakpoint(struct vcpu *v,
     unsigned int width, i, match = 0;
     unsigned long start;
 
+    if ( is_pvh_vcpu(v) )
+        return 0;          /* PVH fixme: support io breakpoint */
+
     if ( !(v->arch.debugreg[5]) ||
          !(v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_DE) )
         return 0;
@@ -1077,6 +1084,9 @@ void propagate_page_fault(unsigned long addr, u16 error_code)
     struct vcpu *v = current;
     struct trap_bounce *tb = &v->arch.pv_vcpu.trap_bounce;
 
+    /* PVH should not get here. (ctrlreg is not implemented) */
+    ASSERT(!is_pvh_vcpu(v));
+
     v->arch.pv_vcpu.ctrlreg[2] = addr;
     arch_set_cr2(v, addr);
 
@@ -1462,6 +1472,9 @@ static int read_descriptor(unsigned int sel,
 {
     struct desc_struct desc;
 
+    if ( is_pvh_vcpu(v) )
+        return hvm_read_descriptor(sel, v, regs, base, limit, ar);
+
     if ( !vm86_mode(regs) )
     {
         if ( sel < 4)
@@ -1580,6 +1593,13 @@ static int guest_io_okay(
     int user_mode = !(v->arch.flags & TF_kernel_mode);
 #define TOGGLE_MODE() if ( user_mode ) toggle_guest_mode(v)
 
+    /*
+     * For PVH we check this in vmexit for EXIT_REASON_IO_INSTRUCTION
+     * and so don't need to check again here.
+     */
+    if ( is_pvh_vcpu(v) )
+        return 1;
+
     if ( !vm86_mode(regs) &&
          (v->arch.pv_vcpu.iopl >= (guest_kernel_mode(v, regs) ? 1 : 3)) )
         return 1;
@@ -1825,7 +1845,7 @@ static inline uint64_t guest_misc_enable(uint64_t val)
         _ptr = (unsigned int)_ptr;                                          \
     if ( (limit) < sizeof(_x) - 1 || (eip) > (limit) - (sizeof(_x) - 1) )   \
         goto fail;                                                          \
-    if ( (_rc = copy_from_user(&_x, (type *)_ptr, sizeof(_x))) != 0 )       \
+    if ( (_rc = raw_copy_from_guest(&_x, (type *)_ptr, sizeof(_x))) != 0 )  \
     {                                                                       \
         propagate_page_fault(_ptr + sizeof(_x) - _rc, 0);                   \
         goto skip;                                                          \
@@ -3252,6 +3272,9 @@ void do_device_not_available(struct cpu_user_regs *regs)
 
     BUG_ON(!guest_mode(regs));
 
+    /* PVH should not get here. (ctrlreg is not implemented) */
+    ASSERT(!is_pvh_vcpu(curr));
+
     vcpu_restore_fpu_lazy(curr);
 
     if ( curr->arch.pv_vcpu.ctrlreg[0] & X86_CR0_TS )
diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
index d2f7209..a47b8d4 100644
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -146,8 +146,8 @@ void vcpu_show_registers(const struct vcpu *v)
     const struct cpu_user_regs *regs = &v->arch.user_regs;
     unsigned long crs[8];
 
-    /* No need to handle HVM for now. */
-    if ( is_hvm_vcpu(v) )
+    /* No need to handle HVM and PVH for now. */
+    if ( !is_pv_vcpu(v) )
         return;
 
     crs[0] = v->arch.pv_vcpu.ctrlreg[0];
@@ -440,6 +440,7 @@ static long register_guest_callback(struct callback_register *reg)
     long ret = 0;
     struct vcpu *v = current;
 
+    ASSERT( !is_pvh_vcpu(v) );
     if ( !is_canonical_address(reg->address) )
         return -EINVAL;
 
@@ -620,7 +621,7 @@ static void hypercall_page_initialise_ring3_kernel(void *hypercall_page)
 void hypercall_page_initialise(struct domain *d, void *hypercall_page)
 {
     memset(hypercall_page, 0xCC, PAGE_SIZE);
-    if ( is_hvm_domain(d) )
+    if ( !is_pv_domain(d) )
         hvm_hypercall_page_initialise(d, hypercall_page);
     else if ( !is_pv_32bit_domain(d) )
         hypercall_page_initialise_ring3_kernel(hypercall_page);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 9b8368c..e1a2397 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -235,6 +235,15 @@ struct domain *domain_create(
 
     if ( domcr_flags & DOMCRF_hvm )
         d->guest_type = is_hvm;
+    else if ( domcr_flags & DOMCRF_pvh )
+    {
+        if ( !(domcr_flags & DOMCRF_hap) )
+        {
+            dprintk(XENLOG_ERR, "PVH guest must have HAP on\n");
+            goto fail;
+        }
+        d->guest_type = is_pvh;
+    }
 
     if ( domid == 0 )
     {
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 6bd8efd..7dec348 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -186,6 +186,8 @@ void getdomaininfo(struct domain *d, struct xen_domctl_getdomaininfo *info)
 
     if ( is_hvm_domain(d) )
         info->flags |= XEN_DOMINF_hvm_guest;
+    else if ( is_pvh_domain(d) )
+        info->flags |= XEN_DOMINF_pvh_guest;
 
     xsm_security_domaininfo(d, info);
 
@@ -437,6 +439,9 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
         domcr_flags = 0;
         if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hvm_guest )
             domcr_flags |= DOMCRF_hvm;
+        else if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hap )
+            domcr_flags |= DOMCRF_pvh;     /* PV with HAP is a PVH guest */
+
         if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hap )
             domcr_flags |= DOMCRF_hap;
         if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_s3_integrity )
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index 72fb905..3bba758 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -289,7 +289,11 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             if ( current->domain == dom0 )
                 fi.submap |= 1U << XENFEAT_dom0;
 #ifdef CONFIG_X86
-            if ( !is_hvm_vcpu(current) )
+            if ( is_pvh_vcpu(current) )
+                fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) |
+                             (1U << XENFEAT_supervisor_mode_kernel) |
+                             (1U << XENFEAT_hvm_callback_vector);
+            else if ( !is_hvm_vcpu(current) )
                 fi.submap |= (1U << XENFEAT_mmu_pt_update_preserve_ad) |
                              (1U << XENFEAT_highmem_assist) |
                              (1U << XENFEAT_gnttab_map_avail_bits);
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 13/17] PVH xen: p2m related changes.
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (11 preceding siblings ...)
  2013-04-23 21:26 ` [PATCH 12/17] PVH xen: support invalid op, return PVH features etc Mukesh Rathor
@ 2013-04-23 21:26 ` Mukesh Rathor
  2013-04-25 11:28   ` Tim Deegan
  2013-04-23 21:26 ` [PATCH 14/17] PVH xen: Add and remove foreign pages Mukesh Rathor
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:26 UTC (permalink / raw)
  To: Xen-devel

In this patch, I introduce  a new type p2m_map_foreign for pages that a
dom0 maps from foreign domains its creating. Also, add
set_foreign_p2m_entry() to map p2m_map_foreign type pages. Other misc changes
related to p2m.

Changes in V2:
   - Make guest_physmap_add_entry() same for PVH in terms of overwriting old
     entry.
   - In set_foreign_p2m_entry() do locked get_gfn and not unlocked.
   - Replace ASSERT with return -EINVAL in do_physdev_op.
   - Remove unnecessary check for PVH in do_physdev_op().

Changes in V3:
   - remove changes unrelated to this patch.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/domctl.c     |   14 ++++++++------
 xen/arch/x86/mm/p2m-ept.c |    3 ++-
 xen/arch/x86/mm/p2m-pt.c  |    3 ++-
 xen/arch/x86/mm/p2m.c     |   31 ++++++++++++++++++++++++++++++-
 xen/include/asm-x86/p2m.h |    4 ++++
 5 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 88fe868..dc161c7 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -68,9 +68,10 @@ long domctl_memory_mapping(struct domain *d, unsigned long gfn,
 
     if ( add_map )
     {
-        printk(XENLOG_G_INFO
-               "memory_map:add: dom%d gfn=%lx mfn=%lx nr=%lx\n",
-               d->domain_id, gfn, mfn, nr_mfns);
+        if ( !is_pvh_domain(d) )     /* PVH maps lots and lots */
+            printk(XENLOG_G_INFO
+                   "memory_map:add: dom%d gfn=%lx mfn=%lx nr=%lx\n",
+                   d->domain_id, gfn, mfn, nr_mfns);
 
         ret = iomem_permit_access(d, mfn, mfn + nr_mfns - 1);
         if ( !ret && paging_mode_translate(d) )
@@ -93,9 +94,10 @@ long domctl_memory_mapping(struct domain *d, unsigned long gfn,
             }
         }
     } else {
-        printk(XENLOG_G_INFO
-               "memory_map:remove: dom%d gfn=%lx mfn=%lx nr=%lx\n",
-               d->domain_id, gfn, mfn, nr_mfns);
+        if ( !is_pvh_domain(d) )     /* PVH unmaps lots and lots */
+            printk(XENLOG_G_INFO
+                   "memory_map:remove: dom%d gfn=%lx mfn=%lx nr=%lx\n",
+                   d->domain_id, gfn, mfn, nr_mfns);
 
         if ( paging_mode_translate(d) )
             for ( i = 0; i < nr_mfns; i++ )
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 595c6e7..cb8c2df 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -75,6 +75,7 @@ static void ept_p2m_type_to_flags(ept_entry_t *entry, p2m_type_t type, p2m_acces
             entry->w = 0;
             break;
         case p2m_grant_map_rw:
+        case p2m_map_foreign:
             entry->r = entry->w = 1;
             entry->x = 0;
             break;
@@ -431,7 +432,7 @@ ept_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
     }
 
     /* Track the highest gfn for which we have ever had a valid mapping */
-    if ( p2mt != p2m_invalid &&
+    if ( p2mt != p2m_invalid && p2mt != p2m_mmio_dm &&
          (gfn + (1UL << order) - 1 > p2m->max_mapped_pfn) )
         p2m->max_mapped_pfn = gfn + (1UL << order) - 1;
 
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index 302b621..3f46418 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -89,6 +89,7 @@ static unsigned long p2m_type_to_flags(p2m_type_t t, mfn_t mfn)
     case p2m_ram_rw:
         return flags | P2M_BASE_FLAGS | _PAGE_RW;
     case p2m_grant_map_rw:
+    case p2m_map_foreign:
         return flags | P2M_BASE_FLAGS | _PAGE_RW | _PAGE_NX_BIT;
     case p2m_mmio_direct:
         if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
@@ -429,7 +430,7 @@ p2m_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
     }
 
     /* Track the highest gfn for which we have ever had a valid mapping */
-    if ( p2mt != p2m_invalid
+    if ( p2mt != p2m_invalid && p2mt != p2m_mmio_dm
          && (gfn + (1UL << page_order) - 1 > p2m->max_mapped_pfn) )
         p2m->max_mapped_pfn = gfn + (1UL << page_order) - 1;
 
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index f5ddd20..17cb78f 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -523,7 +523,7 @@ p2m_remove_page(struct p2m_domain *p2m, unsigned long gfn, unsigned long mfn,
         for ( i = 0; i < (1UL << page_order); i++ )
         {
             mfn_return = p2m->get_entry(p2m, gfn + i, &t, &a, 0, NULL);
-            if ( !p2m_is_grant(t) && !p2m_is_shared(t) )
+            if ( !p2m_is_grant(t) && !p2m_is_shared(t) && !p2m_is_foreign(t) )
                 set_gpfn_from_mfn(mfn+i, INVALID_M2P_ENTRY);
             ASSERT( !p2m_is_valid(t) || mfn + i == mfn_x(mfn_return) );
         }
@@ -754,7 +754,36 @@ void p2m_change_type_range(struct domain *d,
     p2m_unlock(p2m);
 }
 
+/* Returns: True for success. 0 for failure */
+int set_foreign_p2m_entry(struct domain *dp, unsigned long gfn, mfn_t mfn)
+{
+    int rc = 0;
+    p2m_type_t ot;
+    mfn_t omfn;
+    struct p2m_domain *p2m = p2m_get_hostp2m(dp);
 
+    if ( !paging_mode_translate(dp) )
+        return 0;
+
+    omfn = get_gfn_query(dp, gfn, &ot);
+    if ( mfn_valid(omfn) )
+    {
+        gdprintk(XENLOG_ERR, "Already mapped mfn %lx at gfn:%lx\n",
+                 mfn_x(omfn), gfn);
+        set_gpfn_from_mfn(mfn_x(omfn), INVALID_M2P_ENTRY);
+    }
+    put_gfn(dp, gfn);
+
+    P2M_DEBUG("set foreign %lx %lx\n", gfn, mfn_x(mfn));
+    p2m_lock(p2m);
+    rc = set_p2m_entry(p2m, gfn, mfn, 0, p2m_map_foreign, p2m->default_access);
+    p2m_unlock(p2m);
+    if ( rc == 0 )
+        gdprintk(XENLOG_ERR,
+            "set_foreign_p2m_entry: set_p2m_entry failed! gfn:%lx mfn=%08lx\n",
+            gfn, mfn_x(get_gfn_query(dp, gfn, &ot)));
+    return rc;
+}
 
 int
 set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn)
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 43583b2..b76dc33 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -70,6 +70,7 @@ typedef enum {
     p2m_ram_paging_in = 11,       /* Memory that is being paged in */
     p2m_ram_shared = 12,          /* Shared or sharable memory */
     p2m_ram_broken = 13,          /* Broken page, access cause domain crash */
+    p2m_map_foreign  = 14,        /* ram pages from foreign domain */
 } p2m_type_t;
 
 /*
@@ -180,6 +181,7 @@ typedef unsigned int p2m_query_t;
 #define p2m_is_sharable(_t) (p2m_to_mask(_t) & P2M_SHARABLE_TYPES)
 #define p2m_is_shared(_t)   (p2m_to_mask(_t) & P2M_SHARED_TYPES)
 #define p2m_is_broken(_t)   (p2m_to_mask(_t) & P2M_BROKEN_TYPES)
+#define p2m_is_foreign(_t)  (p2m_to_mask(_t) & p2m_to_mask(p2m_map_foreign))
 
 /* Per-p2m-table state */
 struct p2m_domain {
@@ -510,6 +512,8 @@ p2m_type_t p2m_change_type(struct domain *d, unsigned long gfn,
 int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn);
 int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn);
 
+/* Set foreign mfn in the current guest's p2m table (for pvh dom0) */
+int set_foreign_p2m_entry(struct domain *domp, unsigned long gfn, mfn_t mfn);
 
 /* 
  * Populate-on-demand
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 14/17] PVH xen: Add and remove foreign pages
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (12 preceding siblings ...)
  2013-04-23 21:26 ` [PATCH 13/17] PVH xen: p2m related changes Mukesh Rathor
@ 2013-04-23 21:26 ` Mukesh Rathor
  2013-04-25 11:38   ` Tim Deegan
  2013-04-23 21:26 ` [PATCH 15/17] PVH xen: Miscellaneous changes Mukesh Rathor
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:26 UTC (permalink / raw)
  To: Xen-devel

In this patch, a new function, xenmem_add_foreign_to_pmap(), is added
to map pages from foreign guest into current dom0 for domU creation.
Also, allow XENMEM_remove_from_physmap to remove p2m_map_foreign
pages. Note, in this path, we must release the refcount that was taken
during the map phase.

Changes in V2:
  - Move the XENMEM_remove_from_physmap changes here instead of prev patch
  - Move grant changes from this to one of the next patches.
  - In xenmem_add_foreign_to_pmap(), do locked get_gfn
  - Fail the mappings for qemu mapping pages for memory not there.

Changes in V3:
  - remove mmio pages.
  - remove unrelated changes.
  - cleanup both add and remove.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/mm.c   |   80 +++++++++++++++++++++++++++++++++++++++++++++++++++
 xen/common/memory.c |   38 +++++++++++++++++++++---
 2 files changed, 113 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index d9bdded..2316981 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4270,6 +4270,78 @@ static int handle_iomem_range(unsigned long s, unsigned long e, void *p)
     return 0;
 }
 
+/*
+ * Add frames from foreign domain to current domain's physmap. Similar to
+ * XENMAPSPACE_gmfn but the frame is foreign being mapped into current,
+ * and is not removed from foreign domain.
+ * Usage: libxl on pvh dom0 creating a guest and doing privcmd_ioctl_mmap.
+ * Side Effect: the mfn for fgfn will be refcounted so it is not lost
+ *              while mapped here. The refcnt is released in do_memory_op()
+ *              via XENMEM_remove_from_physmap.
+ * Returns: 0 ==> success
+ */
+static int xenmem_add_foreign_to_pmap(domid_t foreign_domid,
+                                      unsigned long fgfn, unsigned long gpfn)
+{
+    p2m_type_t p2mt, p2mt_prev;
+    int rc = 0;
+    unsigned long prev_mfn, mfn = 0;
+    struct domain *fdom, *currd = current->domain;
+    struct page_info *page = NULL;
+
+    if ( currd->domain_id == foreign_domid || foreign_domid == DOMID_SELF ||
+         !is_pvh_domain(currd) )
+        return -EINVAL;
+
+    if ( !IS_PRIV(currd) || (fdom = get_pg_owner(foreign_domid)) == NULL )
+        return -EPERM;
+
+    /* following will take a refcnt on the mfn */
+    page = get_page_from_gfn(fdom, fgfn, &p2mt, P2M_ALLOC);
+    if ( !page || !p2m_is_valid(p2mt) )
+    {
+        if ( page )
+            put_page(page);
+        put_pg_owner(fdom);
+        return -EINVAL;
+    }
+    mfn = page_to_mfn(page);
+
+    /* Remove previously mapped page if it is present. */
+    prev_mfn = mfn_x(get_gfn(currd, gpfn, &p2mt_prev));
+    if ( mfn_valid(prev_mfn) )
+    {
+        if ( is_xen_heap_mfn(prev_mfn) )
+            /* Xen heap frames are simply unhooked from this phys slot */
+            guest_physmap_remove_page(currd, gpfn, prev_mfn, 0);
+        else
+            /* Normal domain memory is freed, to avoid leaking memory. */
+            guest_remove_page(currd, gpfn);
+    }
+    /*
+     * Create the new mapping. Can't use guest_physmap_add_page() because it
+     * will update the m2p table which will result in  mfn -> gpfn of dom0
+     * and not fgfn of domU.
+     */
+    if ( set_foreign_p2m_entry(currd, gpfn, _mfn(mfn)) == 0 )
+    {
+        dprintk(XENLOG_WARNING,
+                "guest_physmap_add_page failed. gpfn:%lx mfn:%lx fgfn:%lx\n",
+                gpfn, mfn, fgfn);
+        put_page(page);
+        rc = -EINVAL;
+    }
+
+    /*
+     * We must do this put_gfn after set_foreign_p2m_entry so another cpu
+     * doesn't populate the gpfn before us.
+     */
+    put_gfn(currd, gpfn);
+
+    put_pg_owner(fdom);
+    return rc;
+}
+
 static int xenmem_add_to_physmap_once(
     struct domain *d,
     const struct xen_add_to_physmap *xatp,
@@ -4332,6 +4404,14 @@ static int xenmem_add_to_physmap_once(
             page = mfn_to_page(mfn);
             break;
         }
+
+        case XENMAPSPACE_gmfn_foreign:
+        {
+            rc = xenmem_add_foreign_to_pmap(foreign_domid, xatp->idx,
+                                            xatp->gpfn);
+            return rc;
+        }
+
         default:
             break;
     }
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 68501d1..a321d33 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -675,9 +675,11 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
     case XENMEM_remove_from_physmap:
     {
+        unsigned long mfn;
         struct xen_remove_from_physmap xrfp;
         struct page_info *page;
-        struct domain *d;
+        struct domain *d, *foreign_dom = NULL;
+        p2m_type_t p2mt, tp;
 
         if ( copy_from_guest(&xrfp, arg, 1) )
             return -EFAULT;
@@ -695,11 +697,37 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
         domain_lock(d);
 
-        page = get_page_from_gfn(d, xrfp.gpfn, NULL, P2M_ALLOC);
-        if ( page )
+        /*
+         * if PVH, the gfn could be mapped to a mfn from foreign domain by the
+         * user space tool during domain creation. We need to check for that,
+         * free it up from the p2m, and release refcnt on it. In such a case,
+         * page would be NULL and the following call would not have refcnt'd
+         * the page. See also xenmem_add_foreign_to_pmap().
+         */
+        page = get_page_from_gfn(d, xrfp.gpfn, &p2mt, P2M_ALLOC);
+
+        if ( page || p2m_is_foreign(p2mt) )
         {
-            guest_physmap_remove_page(d, xrfp.gpfn, page_to_mfn(page), 0);
-            put_page(page);
+            if ( page )
+                mfn = page_to_mfn(page);
+            else
+            {
+                mfn = mfn_x(get_gfn_query(d, xrfp.gpfn, &tp));
+                foreign_dom = page_get_owner(mfn_to_page(mfn));
+                ASSERT(is_pvh_domain(d));
+                ASSERT(d != foreign_dom);
+                ASSERT(p2m_is_foreign(tp));
+            }
+
+            guest_physmap_remove_page(d, xrfp.gpfn, mfn, 0);
+            if (page)
+                put_page(page);
+
+            if ( p2m_is_foreign(p2mt) )
+            {
+                put_page(mfn_to_page(mfn));
+                put_gfn(d, xrfp.gpfn);
+            }
         }
         else
             rc = -ENOENT;
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 15/17]  PVH xen: Miscellaneous changes
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (13 preceding siblings ...)
  2013-04-23 21:26 ` [PATCH 14/17] PVH xen: Add and remove foreign pages Mukesh Rathor
@ 2013-04-23 21:26 ` Mukesh Rathor
  2013-04-24  9:06   ` Jan Beulich
  2013-04-23 21:26 ` [PATCH 16/17] PVH xen: elf and iommu related changes to prep for dom0 PVH Mukesh Rathor
  2013-04-23 21:26 ` [PATCH 17/17] PVH xen: PVH dom0 creation Mukesh Rathor
  16 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:26 UTC (permalink / raw)
  To: Xen-devel

This patch contains misc changes like restricting iobitmap calls for PVH,
restricting 32bit PVH guest, etc..

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/domain.c      |    7 +++++++
 xen/arch/x86/domain_page.c |   10 +++++-----
 xen/arch/x86/domctl.c      |    5 +++++
 xen/arch/x86/mm.c          |    2 +-
 xen/arch/x86/physdev.c     |   13 +++++++++++++
 xen/common/grant_table.c   |    4 ++--
 xen/include/public/xen.h   |    2 ++
 7 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index b1fd758..c895cec 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -341,6 +341,13 @@ int switch_compat(struct domain *d)
 
     if ( d == NULL )
         return -EINVAL;
+
+    if ( is_pvh_domain(d) )
+    {
+        dprintk(XENLOG_G_ERR,
+                "Xen does not currently support 32bit PVH guests\n");
+        return -EINVAL;
+    }
     if ( !may_switch_mode(d) )
         return -EACCES;
     if ( is_pv_32on64_domain(d) )
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index efda6af..7685416 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -34,7 +34,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
      * then it means we are running on the idle domain's page table and must
      * therefore use its mapcache.
      */
-    if ( unlikely(pagetable_is_null(v->arch.guest_table)) && !is_hvm_vcpu(v) )
+    if ( unlikely(pagetable_is_null(v->arch.guest_table)) && is_pv_vcpu(v) )
     {
         /* If we really are idling, perform lazy context switch now. */
         if ( (v = idle_vcpu[smp_processor_id()]) == current )
@@ -71,7 +71,7 @@ void *map_domain_page(unsigned long mfn)
 #endif
 
     v = mapcache_current_vcpu();
-    if ( !v || is_hvm_vcpu(v) )
+    if ( !v || !is_pv_vcpu(v) )
         return mfn_to_virt(mfn);
 
     dcache = &v->domain->arch.pv_domain.mapcache;
@@ -175,7 +175,7 @@ void unmap_domain_page(const void *ptr)
     ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END);
 
     v = mapcache_current_vcpu();
-    ASSERT(v && !is_hvm_vcpu(v));
+    ASSERT(v && is_pv_vcpu(v));
 
     dcache = &v->domain->arch.pv_domain.mapcache;
     ASSERT(dcache->inuse);
@@ -242,7 +242,7 @@ int mapcache_domain_init(struct domain *d)
     struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
     unsigned int bitmap_pages;
 
-    if ( is_hvm_domain(d) || is_idle_domain(d) )
+    if ( !is_pv_domain(d) || is_idle_domain(d) )
         return 0;
 
 #ifdef NDEBUG
@@ -273,7 +273,7 @@ int mapcache_vcpu_init(struct vcpu *v)
     unsigned int ents = d->max_vcpus * MAPCACHE_VCPU_ENTRIES;
     unsigned int nr = PFN_UP(BITS_TO_LONGS(ents) * sizeof(long));
 
-    if ( is_hvm_vcpu(v) || !dcache->inuse )
+    if ( !is_pv_vcpu(v) || !dcache->inuse )
         return 0;
 
     if ( ents > dcache->entries )
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index dc161c7..8f63a0b 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1308,6 +1308,11 @@ void arch_get_info_guest(struct vcpu *v, vcpu_guest_context_u c)
             c.nat->gs_base_kernel = hvm_get_shadow_gs_base(v);
         }
     }
+    else if ( is_pvh_vcpu(v) )
+    {
+        /* pvh fixme: punt it to phase II */
+        dprintk(XENLOG_ERR, "PVH: fixme: arch_get_info_guest()\n");
+    }
     else
     {
         c(ldt_base = v->arch.pv_vcpu.ldt_base);
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 2316981..6266876 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2657,7 +2657,7 @@ static struct domain *get_pg_owner(domid_t domid)
         goto out;
     }
 
-    if ( unlikely(paging_mode_translate(curr)) )
+    if ( !is_pvh_domain(curr) && unlikely(paging_mode_translate(curr)) )
     {
         MEM_LOG("Cannot mix foreign mappings with translated domains");
         goto out;
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 876ac9d..78d9492 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -475,6 +475,13 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
     case PHYSDEVOP_set_iopl: {
         struct physdev_set_iopl set_iopl;
+
+        if ( is_pvh_vcpu(current) )
+        {
+            ret = -EINVAL;
+            break;
+        }
+
         ret = -EFAULT;
         if ( copy_from_guest(&set_iopl, arg, 1) != 0 )
             break;
@@ -488,6 +495,12 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
     case PHYSDEVOP_set_iobitmap: {
         struct physdev_set_iobitmap set_iobitmap;
+
+        if ( is_pvh_vcpu(current) )
+        {
+            ret = -EINVAL;
+            break;
+        }
         ret = -EFAULT;
         if ( copy_from_guest(&set_iobitmap, arg, 1) != 0 )
             break;
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 3f97328..a2073d2 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -721,7 +721,7 @@ __gnttab_map_grant_ref(
 
     double_gt_lock(lgt, rgt);
 
-    if ( !is_hvm_domain(ld) && need_iommu(ld) )
+    if ( is_pv_domain(ld) && need_iommu(ld) )
     {
         unsigned int wrc, rdc;
         int err = 0;
@@ -932,7 +932,7 @@ __gnttab_unmap_common(
             act->pin -= GNTPIN_hstw_inc;
     }
 
-    if ( !is_hvm_domain(ld) && need_iommu(ld) )
+    if ( is_pv_domain(ld) && need_iommu(ld) )
     {
         unsigned int wrc, rdc;
         int err = 0;
diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
index 3cab74f..0d433a7 100644
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -693,6 +693,8 @@ typedef struct shared_info shared_info_t;
  *      c. list of allocated page frames [mfn_list, nr_pages]
  *         (unless relocated due to XEN_ELFNOTE_INIT_P2M)
  *      d. start_info_t structure        [register ESI (x86)]
+ *      d1. struct shared_info_t                [shared_info]
+ *                   (above if auto translated guest only)
  *      e. bootstrap page tables         [pt_base and CR3 (x86)]
  *      f. bootstrap stack               [register ESP (x86)]
  *  4. Bootstrap elements are packed together, but each is 4kB-aligned.
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 16/17] PVH xen: elf and iommu related changes to prep for dom0 PVH
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (14 preceding siblings ...)
  2013-04-23 21:26 ` [PATCH 15/17] PVH xen: Miscellaneous changes Mukesh Rathor
@ 2013-04-23 21:26 ` Mukesh Rathor
  2013-04-24  9:15   ` Jan Beulich
  2013-04-23 21:26 ` [PATCH 17/17] PVH xen: PVH dom0 creation Mukesh Rathor
  16 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:26 UTC (permalink / raw)
  To: Xen-devel

This patch prepares for dom0 PVH by making some changes in the elf
code; add a new parameter to indicate PVH dom0 and use different copy
function for PVH. Also, add check in iommu.c to check for iommu
enabled for dom0 PVH.

Changes in V2: None

Changes in V3:
   - introduct early_pvh_copy_or_zero() to replace dbg_rw_mem().

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/domain_build.c       |   39 +++++++++++++++++++++++++++++++++++-
 xen/common/libelf/libelf-loader.c |   40 +++++++++++++++++++++++++++++++++---
 xen/drivers/passthrough/iommu.c   |   22 ++++++++++++++++++-
 xen/include/xen/libelf.h          |    3 +-
 4 files changed, 96 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index c8f435d..f8cae52 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -307,6 +307,43 @@ static void __init process_dom0_ioports_disable(void)
     }
 }
 
+ /*
+  * Copy or zero function for dom0 only during boot. This because
+  * raw_copy_to_guest -> copy_to_user_hvm -> __hvm_copy needs curr to
+  * point to the hvm/pvh vcpu which is not all setup yet.
+  *
+  * If src is NULL, then len bytes are zeroed.
+  */
+void __init early_pvh_copy_or_zero(unsigned long dest, char *src, int len,
+                                   unsigned long v_start)
+{
+    while ( len > 0 )
+    {
+        char *va;
+        p2m_type_t gfntype;
+        unsigned long mfn, gfn, pagecnt;
+        struct domain *d = get_domain_by_id(0);
+
+        pagecnt = min_t(unsigned long, PAGE_SIZE - (dest & ~PAGE_MASK), len);
+
+        gfn = (dest - v_start) >> PAGE_SHIFT;
+        if ( (mfn = mfn_x(get_gfn_query(d, gfn, &gfntype))) == INVALID_MFN )
+            panic("Unable to get mfn for gfn:%lx\n", gfn);
+        put_gfn(d, gfn);
+
+        va = map_domain_page(mfn) + (dest & (PAGE_SIZE-1));
+        if ( src )
+            memcpy(va, src, pagecnt);
+        else
+            memset(va, 0, pagecnt);
+        unmap_domain_page(va);
+
+        dest += pagecnt;
+        src = src ? src + pagecnt : 0;
+        len -= pagecnt;
+    }
+}
+
 int __init construct_dom0(
     struct domain *d,
     const module_t *image, unsigned long image_headroom,
@@ -766,7 +803,7 @@ int __init construct_dom0(
 
     /* Copy the OS image and free temporary buffer. */
     elf.dest = (void*)vkern_start;
-    rc = elf_load_binary(&elf);
+    rc = elf_load_binary(&elf, (is_pvh_domain(d) ? v_start : 0));
     if ( rc < 0 )
     {
         printk("Failed to load the kernel binary\n");
diff --git a/xen/common/libelf/libelf-loader.c b/xen/common/libelf/libelf-loader.c
index 3cf9c59..077f3dd 100644
--- a/xen/common/libelf/libelf-loader.c
+++ b/xen/common/libelf/libelf-loader.c
@@ -108,7 +108,8 @@ void elf_set_log(struct elf_binary *elf, elf_log_callback *log_callback,
     elf->verbose = verbose;
 }
 
-static int elf_load_image(void *dst, const void *src, uint64_t filesz, uint64_t memsz)
+static int elf_load_image(void *dst, const void *src, uint64_t filesz,
+                          uint64_t memsz, int not_used)
 {
     memcpy(dst, src, filesz);
     memset(dst + filesz, 0, memsz - filesz);
@@ -122,11 +123,25 @@ void elf_set_verbose(struct elf_binary *elf)
     elf->verbose = 1;
 }
 
-static int elf_load_image(void *dst, const void *src, uint64_t filesz, uint64_t memsz)
+extern void __init early_pvh_copy_or_zero(unsigned long dest, char *src,
+                                          int len, unsigned long v_start);
+
+static int elf_load_image(void *dst, const void *src, uint64_t filesz,
+                          uint64_t memsz, unsigned long v_start)
 {
     int rc;
     if ( filesz > ULONG_MAX || memsz > ULONG_MAX )
         return -1;
+
+    if ( v_start )
+    {
+        unsigned long addr = (unsigned long)dst;
+        early_pvh_copy_or_zero(addr, (char *)src, filesz, v_start);
+        early_pvh_copy_or_zero(addr + filesz, NULL, memsz - filesz, v_start);
+
+        return 0;
+    }
+
     rc = raw_copy_to_guest(dst, src, filesz);
     if ( rc != 0 )
         return -1;
@@ -260,7 +275,11 @@ void elf_parse_binary(struct elf_binary *elf)
             __FUNCTION__, elf->pstart, elf->pend);
 }
 
-int elf_load_binary(struct elf_binary *elf)
+/*
+ * This function called from the libraries when building guests, and also for
+ * dom0 from construct_dom0().
+ */
+static int _elf_load_binary(struct elf_binary *elf, unsigned long v_start)
 {
     const elf_phdr *phdr;
     uint64_t i, count, paddr, offset, filesz, memsz;
@@ -279,7 +298,8 @@ int elf_load_binary(struct elf_binary *elf)
         dest = elf_get_ptr(elf, paddr);
         elf_msg(elf, "%s: phdr %" PRIu64 " at 0x%p -> 0x%p\n",
                 __func__, i, dest, dest + filesz);
-        if ( elf_load_image(dest, elf->image + offset, filesz, memsz) != 0 )
+        if ( elf_load_image(dest, elf->image + offset, filesz, memsz,
+                            v_start) != 0 )
             return -1;
     }
 
@@ -287,6 +307,18 @@ int elf_load_binary(struct elf_binary *elf)
     return 0;
 }
 
+#ifdef __XEN__
+int elf_load_binary(struct elf_binary *elf, unsigned long v_start)
+{
+    return _elf_load_binary(elf, v_start);
+}
+#else
+int elf_load_binary(struct elf_binary *elf)
+{
+    return _elf_load_binary(elf, 0);
+}
+#endif
+
 void *elf_get_ptr(struct elf_binary *elf, unsigned long addr)
 {
     return elf->dest + addr - elf->pstart;
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 93ad122..64ba44e 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -125,15 +125,25 @@ int iommu_domain_init(struct domain *d)
     return hd->platform_ops->init(d);
 }
 
+static inline void check_dom0_pvh_reqs(struct domain *d)
+{
+    if (!iommu_enabled || iommu_passthrough)
+        panic("For pvh dom0, iommu must be enabled, dom0-passthrough must "
+              "not be enabled \n");
+}
+
 void __init iommu_dom0_init(struct domain *d)
 {
     struct hvm_iommu *hd = domain_hvm_iommu(d);
 
+    if ( is_pvh_domain(d) )
+        check_dom0_pvh_reqs(d);
+
     if ( !iommu_enabled )
         return;
 
     register_keyhandler('o', &iommu_p2m_table);
-    d->need_iommu = !!iommu_dom0_strict;
+    d->need_iommu = is_pvh_domain(d) || !!iommu_dom0_strict;
     if ( need_iommu(d) )
     {
         struct page_info *page;
@@ -146,7 +156,15 @@ void __init iommu_dom0_init(struct domain *d)
                  ((page->u.inuse.type_info & PGT_type_mask)
                   == PGT_writable_page) )
                 mapping |= IOMMUF_writable;
-            hd->platform_ops->map_page(d, mfn, mfn, mapping);
+
+            if ( is_pvh_domain(d) )
+            {
+                unsigned long gfn = mfn_to_gfn(d, _mfn(mfn));
+                hd->platform_ops->map_page(d, gfn, mfn, mapping);
+            }
+            else
+                hd->platform_ops->map_page(d, mfn, mfn, mapping);
+
             if ( !(i++ & 0xfffff) )
                 process_pending_softirqs();
         }
diff --git a/xen/include/xen/libelf.h b/xen/include/xen/libelf.h
index 218bb18..9d695b7 100644
--- a/xen/include/xen/libelf.h
+++ b/xen/include/xen/libelf.h
@@ -192,13 +192,14 @@ int elf_phdr_is_loadable(struct elf_binary *elf, const elf_phdr * phdr);
 int elf_init(struct elf_binary *elf, const char *image, size_t size);
 #ifdef __XEN__
 void elf_set_verbose(struct elf_binary *elf);
+int elf_load_binary(struct elf_binary *elf, unsigned long v_start);
 #else
 void elf_set_log(struct elf_binary *elf, elf_log_callback*,
                  void *log_caller_pointer, int verbose);
+int elf_load_binary(struct elf_binary *elf);
 #endif
 
 void elf_parse_binary(struct elf_binary *elf);
-int elf_load_binary(struct elf_binary *elf);
 
 void *elf_get_ptr(struct elf_binary *elf, unsigned long addr);
 uint64_t elf_lookup_addr(struct elf_binary *elf, const char *symbol);
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 17/17] PVH xen: PVH dom0 creation....
  2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
                   ` (15 preceding siblings ...)
  2013-04-23 21:26 ` [PATCH 16/17] PVH xen: elf and iommu related changes to prep for dom0 PVH Mukesh Rathor
@ 2013-04-23 21:26 ` Mukesh Rathor
  2013-04-24  9:28   ` Jan Beulich
  16 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-23 21:26 UTC (permalink / raw)
  To: Xen-devel

Finally, the hardest. Mostly modify construct_dom0() to boot PV dom0 in
PVH mode. Introduce, opt_dom0pvh, which when specified in the command
line, causes dom0 to boot in PVH mode.
Note, the call to elf_load_binary() is moved down after required PVH setup
so that we can use the code path for both PV and PVH.

Change in V2:
  - Map the entire IO region upfront in the P2M for PVH dom0.

Change in V3:
  - Fixup pvh_map_all_iomem() to make sure map upto 4GB of io space.
  - remove use of dbg_* functions.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
---
 xen/arch/x86/domain_build.c |  268 ++++++++++++++++++++++++++++++++++---------
 xen/arch/x86/mm/hap/hap.c   |   14 +++
 xen/arch/x86/setup.c        |   10 ++-
 xen/include/asm-x86/hap.h   |    1 +
 4 files changed, 236 insertions(+), 57 deletions(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index f8cae52..2558017 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -35,6 +35,7 @@
 #include <asm/setup.h>
 #include <asm/bzimage.h> /* for bzimage_parse */
 #include <asm/io_apic.h>
+#include <asm/hap.h>
 
 #include <public/version.h>
 
@@ -307,6 +308,69 @@ static void __init process_dom0_ioports_disable(void)
     }
 }
 
+/*
+ * Set the 1:1 map for all non-RAM regions for dom 0. Thus, dom0 will have
+ * the entire io region mapped in the EPT/NPT.
+ */
+static __init void pvh_map_all_iomem(struct domain *d)
+{
+    unsigned long start_pfn, end_pfn, end, start = 0;
+    const struct e820entry *entry;
+    unsigned int i, nump;
+    int rc;
+
+    for ( i = 0, entry = e820.map; i < e820.nr_map; i++, entry++ )
+    {
+        end = entry->addr + entry->size;
+
+        if ( entry->type == E820_RAM || entry->type == E820_UNUSABLE ||
+             i == e820.nr_map - 1 )
+        {
+            start_pfn = PFN_DOWN(start);
+            end_pfn = PFN_UP(end);
+
+            if ( entry->type == E820_RAM || entry->type == E820_UNUSABLE )
+                end_pfn = PFN_UP(entry->addr);
+
+            if ( start_pfn < end_pfn )
+            {
+                nump = end_pfn - start_pfn;
+                /* Add pages to the mapping */
+                rc = domctl_memory_mapping(d, start_pfn, start_pfn, nump, 1);
+                BUG_ON(rc);
+            }
+            start = end;
+        }
+    }
+
+    /* If the e820 ended under 4GB, we must map the remaining space upto 4GB */
+    if ( end < GB(4) )
+    {
+        start_pfn = PFN_UP(end);
+        end_pfn = (GB(4)) >> PAGE_SHIFT;
+        nump = end_pfn - start_pfn;
+        rc = domctl_memory_mapping(d, start_pfn, start_pfn, nump, 1);
+        BUG_ON(rc);
+    }
+}
+
+static __init void dom0_update_physmap(struct domain *d, unsigned long pfn,
+                                   unsigned long mfn, unsigned long vphysmap_s)
+{
+    if ( is_pvh_domain(d) )
+    {
+        int rc = guest_physmap_add_page(d, pfn, mfn, 0);
+        BUG_ON(rc);
+        return;
+    }
+    if ( !is_pv_32on64_domain(d) )
+        ((unsigned long *)vphysmap_s)[pfn] = mfn;
+    else
+        ((unsigned int *)vphysmap_s)[pfn] = mfn;
+
+    set_gpfn_from_mfn(mfn, pfn);
+}
+
  /*
   * Copy or zero function for dom0 only during boot. This because
   * raw_copy_to_guest -> copy_to_user_hvm -> __hvm_copy needs curr to
@@ -351,6 +415,7 @@ int __init construct_dom0(
     void *(*bootstrap_map)(const module_t *),
     char *cmdline)
 {
+    char *si_buf=NULL;
     int i, cpu, rc, compatible, compat32, order, machine;
     struct cpu_user_regs *regs;
     unsigned long pfn, mfn;
@@ -359,7 +424,7 @@ int __init construct_dom0(
     unsigned long alloc_spfn;
     unsigned long alloc_epfn;
     unsigned long initrd_pfn = -1, initrd_mfn = 0;
-    unsigned long count;
+    unsigned long count, shared_info_paddr = 0;
     struct page_info *page = NULL;
     start_info_t *si;
     struct vcpu *v = d->vcpu[0];
@@ -448,11 +513,19 @@ int __init construct_dom0(
         return -EINVAL;
     }
 
-    if ( parms.elf_notes[XEN_ELFNOTE_SUPPORTED_FEATURES].type != XEN_ENT_NONE &&
-         !test_bit(XENFEAT_dom0, parms.f_supported) )
+    if ( parms.elf_notes[XEN_ELFNOTE_SUPPORTED_FEATURES].type != XEN_ENT_NONE )
     {
-        printk("Kernel does not support Dom0 operation\n");
-        return -EINVAL;
+        if ( !test_bit(XENFEAT_dom0, parms.f_supported) )
+        {
+            printk("Kernel does not support Dom0 operation\n");
+            return -EINVAL;
+        }
+        if ( is_pvh_domain(d) &&
+             !test_bit(XENFEAT_hvm_callback_vector, parms.f_supported) )
+        {
+            printk("Kernel does not support PVH mode\n");
+            return -EINVAL;
+        }
     }
 
     if ( compat32 )
@@ -517,6 +590,14 @@ int __init construct_dom0(
     vstartinfo_end   = (vstartinfo_start +
                         sizeof(struct start_info) +
                         sizeof(struct dom0_vga_console_info));
+
+    if ( is_pvh_domain(d) )
+    {
+        /* note, following is paddr as opposed to maddr */
+        shared_info_paddr = round_pgup(vstartinfo_end) - v_start;
+        vstartinfo_end   += PAGE_SIZE;
+    }
+
     vpt_start        = round_pgup(vstartinfo_end);
     for ( nr_pt_pages = 2; ; nr_pt_pages++ )
     {
@@ -658,16 +739,34 @@ int __init construct_dom0(
         maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l3_page_table;
         l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
     }
-    clear_page(l4tab);
-    init_guest_l4_table(l4tab, d);
-    v->arch.guest_table = pagetable_from_paddr(__pa(l4start));
-    if ( is_pv_32on64_domain(d) )
-        v->arch.guest_table_user = v->arch.guest_table;
+    if ( is_pvh_domain(d) )
+    {
+        v->arch.cr3 = v->arch.hvm_vcpu.guest_cr[3] = (vpt_start - v_start);
+
+        /* HAP is required for PVH and pfns are serially mapped there */
+        pfn = 0;
+    }
+    else
+    {
+        clear_page(l4tab);
+        init_guest_l4_table(l4tab, d);
+        v->arch.guest_table = pagetable_from_paddr(__pa(l4start));
+        if ( is_pv_32on64_domain(d) )
+            v->arch.guest_table_user = v->arch.guest_table;
+        pfn = alloc_spfn;
+    }
 
     l4tab += l4_table_offset(v_start);
-    pfn = alloc_spfn;
     for ( count = 0; count < ((v_end-v_start)>>PAGE_SHIFT); count++ )
     {
+        /*
+         * initrd chunk's mfns are allocated from a separate mfn chunk. Hence
+         * we need to adjust for them.
+         */
+        signed long pvh_adj = is_pvh_domain(d) ?
+                              (PFN_UP(initrd_len) - alloc_spfn) << PAGE_SHIFT
+                              : 0;
+
         if ( !((unsigned long)l1tab & (PAGE_SIZE-1)) )
         {
             maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l1_page_table;
@@ -694,16 +793,17 @@ int __init construct_dom0(
                     clear_page(l3tab);
                     if ( count == 0 )
                         l3tab += l3_table_offset(v_start);
-                    *l4tab = l4e_from_paddr(__pa(l3start), L4_PROT);
+                    *l4tab = l4e_from_paddr(__pa(l3start) + pvh_adj, L4_PROT);
                     l4tab++;
                 }
-                *l3tab = l3e_from_paddr(__pa(l2start), L3_PROT);
+                *l3tab = l3e_from_paddr(__pa(l2start) + pvh_adj, L3_PROT);
                 l3tab++;
             }
-            *l2tab = l2e_from_paddr(__pa(l1start), L2_PROT);
+            *l2tab = l2e_from_paddr(__pa(l1start) + pvh_adj, L2_PROT);
             l2tab++;
         }
-        if ( count < initrd_pfn || count >= initrd_pfn + PFN_UP(initrd_len) )
+        if ( is_pvh_domain(d) ||
+             count < initrd_pfn || count >= initrd_pfn + PFN_UP(initrd_len) )
             mfn = pfn++;
         else
             mfn = initrd_mfn++;
@@ -711,6 +811,9 @@ int __init construct_dom0(
                                     L1_PROT : COMPAT_L1_PROT));
         l1tab++;
 
+        if ( is_pvh_domain(d) )
+            continue;
+
         page = mfn_to_page(mfn);
         if ( (page->u.inuse.type_info == 0) &&
              !get_page_and_type(page, d, PGT_writable_page) )
@@ -739,6 +842,9 @@ int __init construct_dom0(
                COMPAT_L2_PAGETABLE_XEN_SLOTS(d) * sizeof(*l2tab));
     }
 
+    if  ( is_pvh_domain(d) )
+        goto pvh_skip_pt_rdonly;
+
     /* Pages that are part of page tables must be read only. */
     l4tab = l4start + l4_table_offset(vpt_start);
     l3start = l3tab = l4e_to_l3e(*l4tab);
@@ -778,6 +884,8 @@ int __init construct_dom0(
         }
     }
 
+pvh_skip_pt_rdonly:
+
     /* Mask all upcalls... */
     for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
         shared_info(d, vcpu_info[i].evtchn_upcall_mask) = 1;
@@ -801,35 +909,20 @@ int __init construct_dom0(
     write_ptbase(v);
     mapcache_override_current(v);
 
-    /* Copy the OS image and free temporary buffer. */
-    elf.dest = (void*)vkern_start;
-    rc = elf_load_binary(&elf, (is_pvh_domain(d) ? v_start : 0));
-    if ( rc < 0 )
-    {
-        printk("Failed to load the kernel binary\n");
-        return rc;
-    }
-    bootstrap_map(NULL);
-
-    if ( UNSET_ADDR != parms.virt_hypercall )
+    /* Set up start info area. */
+    if ( is_pvh_domain(d) )
     {
-        if ( (parms.virt_hypercall < v_start) ||
-             (parms.virt_hypercall >= v_end) )
+        /* avoid calling copy for every write to the vstartinfo_start */
+        if ( (si_buf = xmalloc_bytes(PAGE_SIZE)) == NULL )
         {
-            mapcache_override_current(NULL);
-            write_ptbase(current);
-            printk("Invalid HYPERCALL_PAGE field in ELF notes.\n");
-            return -1;
+            printk("PVH: xmalloc failed to alloc %ld bytes.\n", PAGE_SIZE);
+            return -ENOMEM;
         }
-        hypercall_page_initialise(
-            d, (void *)(unsigned long)parms.virt_hypercall);
+        si = (start_info_t *)si_buf;
     }
+    else
+        si = (start_info_t *)vstartinfo_start;
 
-    /* Free temporary buffers. */
-    discard_initial_images();
-
-    /* Set up start info area. */
-    si = (start_info_t *)vstartinfo_start;
     clear_page(si);
     si->nr_pages = nr_pages;
 
@@ -846,12 +939,16 @@ int __init construct_dom0(
              elf_64bit(&elf) ? 64 : 32, parms.pae ? "p" : "");
 
     count = d->tot_pages;
+
+    if ( is_pvh_domain(d) )
+        goto pvh_skip_guest_p2m_table;
+
     l4start = map_domain_page(pagetable_get_pfn(v->arch.guest_table));
     l3tab = NULL;
     l2tab = NULL;
     l1tab = NULL;
     /* Set up the phys->machine table if not part of the initial mapping. */
-    if ( parms.p2m_base != UNSET_ADDR )
+    if ( parms.p2m_base != UNSET_ADDR && !is_pvh_domain(d) )
     {
         unsigned long va = vphysmap_start;
 
@@ -972,6 +1069,11 @@ int __init construct_dom0(
         unmap_domain_page(l3tab);
     unmap_domain_page(l4start);
 
+pvh_skip_guest_p2m_table:
+
+    if (is_pvh_domain(d) )
+        hap_set_pvh_alloc_for_dom0(d, nr_pages);
+
     /* Write the phys->machine and machine->phys table entries. */
     for ( pfn = 0; pfn < count; pfn++ )
     {
@@ -988,11 +1090,8 @@ int __init construct_dom0(
         if ( pfn > REVERSE_START && (vinitrd_start || pfn < initrd_pfn) )
             mfn = alloc_epfn - (pfn - REVERSE_START);
 #endif
-        if ( !is_pv_32on64_domain(d) )
-            ((unsigned long *)vphysmap_start)[pfn] = mfn;
-        else
-            ((unsigned int *)vphysmap_start)[pfn] = mfn;
-        set_gpfn_from_mfn(mfn, pfn);
+        dom0_update_physmap(d, pfn, mfn, vphysmap_start);
+
         if (!(pfn & 0xfffff))
             process_pending_softirqs();
     }
@@ -1008,8 +1107,8 @@ int __init construct_dom0(
             if ( !page->u.inuse.type_info &&
                  !get_page_and_type(page, d, PGT_writable_page) )
                 BUG();
-            ((unsigned long *)vphysmap_start)[pfn] = mfn;
-            set_gpfn_from_mfn(mfn, pfn);
+
+            dom0_update_physmap(d, pfn, mfn, vphysmap_start);
             ++pfn;
             if (!(pfn & 0xfffff))
                 process_pending_softirqs();
@@ -1029,11 +1128,7 @@ int __init construct_dom0(
 #ifndef NDEBUG
 #define pfn (nr_pages - 1 - (pfn - (alloc_epfn - alloc_spfn)))
 #endif
-            if ( !is_pv_32on64_domain(d) )
-                ((unsigned long *)vphysmap_start)[pfn] = mfn;
-            else
-                ((unsigned int *)vphysmap_start)[pfn] = mfn;
-            set_gpfn_from_mfn(mfn, pfn);
+            dom0_update_physmap(d, pfn, mfn, vphysmap_start);
 #undef pfn
             page++; pfn++;
             if (!(pfn & 0xfffff))
@@ -1041,6 +1136,50 @@ int __init construct_dom0(
         }
     }
 
+    /* Copy the OS image and free temporary buffer. */
+    elf.dest = (void*)vkern_start;
+    rc = elf_load_binary(&elf, (is_pvh_domain(d) ? v_start : 0));
+    if ( rc < 0 )
+    {
+        printk("Failed to load the kernel binary\n");
+        return rc;
+    }
+    bootstrap_map(NULL);
+
+    if ( UNSET_ADDR != parms.virt_hypercall )
+    {
+        void *addr;
+        if ( is_pvh_domain(d) )
+        {
+            if ( (addr = xzalloc_bytes(PAGE_SIZE)) == NULL )
+            {
+                printk("pvh: xzalloc failed for %ld bytes.\n", PAGE_SIZE);
+                return -ENOMEM;
+            }
+        } else
+            addr = (void *)parms.virt_hypercall;
+
+        if ( (parms.virt_hypercall < v_start) ||
+             (parms.virt_hypercall >= v_end) )
+        {
+            mapcache_override_current(NULL);
+            write_ptbase(current);
+            printk("Invalid HYPERCALL_PAGE field in ELF notes.\n");
+            return -1;
+        }
+        hypercall_page_initialise(d, addr);
+
+        if ( is_pvh_domain(d) )
+        {
+            early_pvh_copy_or_zero(parms.virt_hypercall, addr, PAGE_SIZE,
+                                   v_start);
+            xfree(addr);
+        }
+    }
+
+    /* Free temporary buffers. */
+    discard_initial_images();
+
     if ( initrd_len != 0 )
     {
         si->mod_start = vinitrd_start ?: initrd_pfn;
@@ -1056,6 +1195,16 @@ int __init construct_dom0(
         si->console.dom0.info_off  = sizeof(struct start_info);
         si->console.dom0.info_size = sizeof(struct dom0_vga_console_info);
     }
+    if ( is_pvh_domain(d) )
+    {
+        unsigned long mfn = virt_to_mfn(d->shared_info);
+        unsigned long pfn = shared_info_paddr >> PAGE_SHIFT;
+        si->shared_info = shared_info_paddr;
+        dom0_update_physmap(d, pfn, mfn, 0);
+
+        early_pvh_copy_or_zero(vstartinfo_start, si_buf, PAGE_SIZE, v_start);
+        xfree(si_buf);
+    }
 
     if ( is_pv_32on64_domain(d) )
         xlat_start_info(si, XLAT_start_info_console_dom0);
@@ -1087,12 +1236,18 @@ int __init construct_dom0(
     regs->eip = parms.virt_entry;
     regs->esp = vstack_end;
     regs->esi = vstartinfo_start;
-    regs->eflags = X86_EFLAGS_IF;
+    regs->eflags = X86_EFLAGS_IF | 0x2;
 
     if ( opt_dom0_shadow )
-        if ( paging_enable(d, PG_SH_enable) == 0 ) 
+    {
+        if ( is_pvh_domain(d) )
+        {
+            printk("Invalid option dom0_shadow for PVH\n");
+            return -EINVAL;
+        }
+        if ( paging_enable(d, PG_SH_enable) == 0 )
             paging_update_paging_modes(v);
-
+    }
     if ( supervisor_mode_kernel )
     {
         v->arch.pv_vcpu.kernel_ss &= ~3;
@@ -1169,6 +1324,9 @@ int __init construct_dom0(
 
     BUG_ON(rc != 0);
 
+    if ( is_pvh_domain(d) )
+        pvh_map_all_iomem(d);
+
     iommu_dom0_init(dom0);
 
     return 0;
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index 5aa0852..674c324 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -580,6 +580,20 @@ int hap_domctl(struct domain *d, xen_domctl_shadow_op_t *sc,
     }
 }
 
+/* Resize hap table. Copied from: libxl_get_required_shadow_memory() */
+void hap_set_pvh_alloc_for_dom0(struct domain *d, unsigned long num_pages)
+{
+    int rc;
+    unsigned long memkb = num_pages * (PAGE_SIZE / 1024);
+
+    memkb = 4 * (256 * d->max_vcpus + 2 * (memkb / 1024));
+    num_pages = ((memkb+1023)/1024) << (20 - PAGE_SHIFT);
+    paging_lock(d);
+    rc = hap_set_allocation(d, num_pages, NULL);
+    paging_unlock(d);
+    BUG_ON(rc);
+}
+
 static const struct paging_mode hap_paging_real_mode;
 static const struct paging_mode hap_paging_protected_mode;
 static const struct paging_mode hap_paging_pae_mode;
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 43301a5..f307f24 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -60,6 +60,10 @@ integer_param("maxcpus", max_cpus);
 static bool_t __initdata disable_smep;
 invbool_param("smep", disable_smep);
 
+/* Boot dom0 in PVH mode */
+static bool_t __initdata opt_dom0pvh;
+boolean_param("dom0pvh", opt_dom0pvh);
+
 /* **** Linux config option: propagated to domain0. */
 /* "acpi=off":    Sisables both ACPI table parsing and interpreter. */
 /* "acpi=force":  Override the disable blacklist.                   */
@@ -545,7 +549,7 @@ void __init __start_xen(unsigned long mbi_p)
 {
     char *memmap_type = NULL;
     char *cmdline, *kextra, *loader;
-    unsigned int initrdidx;
+    unsigned int initrdidx, domcr_flags = 0;
     multiboot_info_t *mbi = __va(mbi_p);
     module_t *mod = (module_t *)__va(mbi->mods_addr);
     unsigned long nr_pages, modules_headroom, *module_map;
@@ -1314,7 +1318,9 @@ void __init __start_xen(unsigned long mbi_p)
         panic("Could not protect TXT memory regions\n");
 
     /* Create initial domain 0. */
-    dom0 = domain_create(0, DOMCRF_s3_integrity, 0);
+    domcr_flags = (opt_dom0pvh ? DOMCRF_pvh | DOMCRF_hap : 0);
+    domcr_flags |= DOMCRF_s3_integrity;
+    dom0 = domain_create(0, domcr_flags, 0);
     if ( IS_ERR(dom0) || (alloc_dom0_vcpu0() == NULL) )
         panic("Error creating domain 0\n");
 
diff --git a/xen/include/asm-x86/hap.h b/xen/include/asm-x86/hap.h
index e03f983..aab8558 100644
--- a/xen/include/asm-x86/hap.h
+++ b/xen/include/asm-x86/hap.h
@@ -63,6 +63,7 @@ int   hap_track_dirty_vram(struct domain *d,
                            XEN_GUEST_HANDLE_64(uint8) dirty_bitmap);
 
 extern const struct paging_mode *hap_paging_get_mode(struct vcpu *);
+void hap_set_pvh_alloc_for_dom0(struct domain *d, unsigned long num_pages);
 
 #endif /* XEN_HAP_H */
 
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 03/17] PVH xen: create domctl_memory_mapping() function
  2013-04-23 21:25 ` [PATCH 03/17] PVH xen: create domctl_memory_mapping() function Mukesh Rathor
@ 2013-04-24  7:01   ` Jan Beulich
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Beulich @ 2013-04-24  7:01 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -46,6 +46,72 @@ static int gdbsx_guest_mem_io(
>      return (iop->remain ? -EFAULT : 0);
>  }
>  
> +long domctl_memory_mapping(struct domain *d, unsigned long gfn,
> +                           unsigned long mfn, unsigned long nr_mfns,
> +                           int add_map)

bool_t.

> +{
> +    unsigned long i;
> +    long ret;
> +
> +    if ( !IS_PRIV(current->domain)  &&
> +         !iomem_access_permitted(current->domain, mfn, mfn + nr_mfns - 1) )
> +        return -EPERM;

This construct is stale as of 76401237 ("x86: remove IS_PRIV access
check bypasses"). Oh, I just saw that you say this series is based on
an almost week old tree...

> +
> +    if ( (mfn + nr_mfns - 1) < mfn || /* wrap? */
> +         ((mfn | (mfn + nr_mfns - 1)) >> (paddr_bits - PAGE_SHIFT)) ||
> +         (gfn + nr_mfns - 1) < gfn ) /* wrap? */
> +        return -EINVAL;
> +
> +    ret = xsm_iomem_permission(XSM_HOOK, d, mfn, mfn + nr_mfns - 1, add_map);
> +    if ( ret )
> +        return ret;
> +
> +    if ( add_map )
> +    {
> +        printk(XENLOG_G_INFO
> +               "memory_map:add: dom%d gfn=%lx mfn=%lx nr=%lx\n",
> +               d->domain_id, gfn, mfn, nr_mfns);
> +
> +        ret = iomem_permit_access(d, mfn, mfn + nr_mfns - 1);
> +        if ( !ret && paging_mode_translate(d) )
> +        {
> +            for ( i = 0; !ret && i < nr_mfns; i++ )
> +                if ( !set_mmio_p2m_entry(d, gfn + i, _mfn(mfn + i)) )
> +                    ret = -EIO;
> +            if ( ret )
> +            {
> +                printk(XENLOG_G_WARNING
> +                       "memory_map:fail: dom%d gfn=%lx mfn=%lx\n",
> +                       d->domain_id, gfn + i, mfn + i);
> +                while ( i-- )
> +                    clear_mmio_p2m_entry(d, gfn + i);
> +                if ( iomem_deny_access(d, mfn, mfn + nr_mfns - 1) &&
> +                     IS_PRIV(current->domain) )
> +                    printk(XENLOG_ERR
> +                           "memory_map: failed to deny dom%d access to [%lx,%lx]\n",
> +                           d->domain_id, mfn, mfn + nr_mfns - 1);
> +            }
> +        }
> +    } else {

How shall we trust this is pure code movement if even formatting
got broken?

> +        printk(XENLOG_G_INFO
> +               "memory_map:remove: dom%d gfn=%lx mfn=%lx nr=%lx\n",
> +               d->domain_id, gfn, mfn, nr_mfns);
> +
> +        if ( paging_mode_translate(d) )
> +            for ( i = 0; i < nr_mfns; i++ )
> +                add_map |= !clear_mmio_p2m_entry(d, gfn + i);
> +        ret = iomem_deny_access(d, mfn, mfn + nr_mfns - 1);
> +        if ( !ret && add_map )
> +            ret = -EIO;
> +        if ( ret && IS_PRIV(current->domain) )
> +            printk(XENLOG_ERR
> +                   "memory_map: error %ld %s dom%d access to [%lx,%lx]\n",
> +                   ret, add_map ? "removing" : "denying", d->domain_id,
> +                   mfn, mfn + nr_mfns - 1);
> +    }
> +    return ret;
> +}
> +
>  long arch_do_domctl(
>      struct xen_domctl *domctl, struct domain *d,
>      XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
> @@ -628,68 +694,8 @@ long arch_do_domctl(
>          unsigned long mfn = domctl->u.memory_mapping.first_mfn;
>          unsigned long nr_mfns = domctl->u.memory_mapping.nr_mfns;
>          int add = domctl->u.memory_mapping.add_mapping;
> -        unsigned long i;
> -
> -        ret = -EINVAL;
> -        if ( (mfn + nr_mfns - 1) < mfn || /* wrap? */
> -             ((mfn | (mfn + nr_mfns - 1)) >> (paddr_bits - PAGE_SHIFT)) ||
> -             (gfn + nr_mfns - 1) < gfn ) /* wrap? */
> -            break;
> -
> -        ret = -EPERM;
> -        if ( !IS_PRIV(current->domain) &&
> -             !iomem_access_permitted(current->domain, mfn, mfn + nr_mfns - 1) )
> -            break;
> -
> -        ret = xsm_iomem_mapping(XSM_HOOK, d, mfn, mfn + nr_mfns - 1, add);
> -        if ( ret )
> -            break;
>  
> -        if ( add )
> -        {
> -            printk(XENLOG_G_INFO
> -                   "memory_map:add: dom%d gfn=%lx mfn=%lx nr=%lx\n",
> -                   d->domain_id, gfn, mfn, nr_mfns);
> -
> -            ret = iomem_permit_access(d, mfn, mfn + nr_mfns - 1);
> -            if ( !ret && paging_mode_translate(d) )
> -            {
> -                for ( i = 0; !ret && i < nr_mfns; i++ )
> -                    if ( !set_mmio_p2m_entry(d, gfn + i, _mfn(mfn + i)) )
> -                        ret = -EIO;
> -                if ( ret )
> -                {
> -                    printk(XENLOG_G_WARNING
> -                           "memory_map:fail: dom%d gfn=%lx mfn=%lx\n",
> -                           d->domain_id, gfn + i, mfn + i);
> -                    while ( i-- )
> -                        clear_mmio_p2m_entry(d, gfn + i);
> -                    if ( iomem_deny_access(d, mfn, mfn + nr_mfns - 1) &&
> -                         IS_PRIV(current->domain) )
> -                        printk(XENLOG_ERR
> -                               "memory_map: failed to deny dom%d access to [%lx,%lx]\n",
> -                               d->domain_id, mfn, mfn + nr_mfns - 1);
> -                }
> -            }
> -        }
> -        else
> -        {

See the proper original code here.

Jan

> -            printk(XENLOG_G_INFO
> -                   "memory_map:remove: dom%d gfn=%lx mfn=%lx nr=%lx\n",
> -                   d->domain_id, gfn, mfn, nr_mfns);
> -
> -            if ( paging_mode_translate(d) )
> -                for ( i = 0; i < nr_mfns; i++ )
> -                    add |= !clear_mmio_p2m_entry(d, gfn + i);
> -            ret = iomem_deny_access(d, mfn, mfn + nr_mfns - 1);
> -            if ( !ret && add )
> -                ret = -EIO;
> -            if ( ret && IS_PRIV(current->domain) )
> -                printk(XENLOG_ERR
> -                       "memory_map: error %ld %s dom%d access to [%lx,%lx]\n",
> -                       ret, add ? "removing" : "denying", d->domain_id,
> -                       mfn, mfn + nr_mfns - 1);
> -        }
> +        ret = domctl_memory_mapping(d, gfn, mfn, nr_mfns, add);
>      }
>      break;
>  

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 06/17]  PVH xen: Introduce PVH guest type
  2013-04-23 21:25 ` [PATCH 06/17] PVH xen: Introduce PVH guest type Mukesh Rathor
@ 2013-04-24  7:07   ` Jan Beulich
  2013-04-24 23:01     ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-04-24  7:07 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> This patch introduces the concept of a pvh guest. There are also other basic
> changes like creating macros to check for pvh vcpu/domain, and creating
> new macros to see if it's pv/pvh/hvm domain/vcpu. Also, modify copy macros
> to include pvh. Lastly, we introduce that PVH uses HVM style event delivery.
> 
> Chagnes in V2:
>   - make is_pvh/is_hvm enum instead of adding is_pvh as a new flag.
>   - fix indentation and spacing in guest_kernel_mode macro.
>   - add debug only BUG() in GUEST_KERNEL_RPL macro as it should no longer
>     be called in any PVH paths.
> 
> Chagnes in V3:
>   - Rename enum fields, and add is_pv to it.
>   - Get rid if is_hvm_or_pvh_* macros.
> 
> Chagnes in V4:
>   - Move e820 fields out of pv_domain struct.

Is there any reason why this can't be a standalone patch?

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 07/17] PVH xen: tools changes to create PVH domain
  2013-04-23 21:25 ` [PATCH 07/17] PVH xen: tools changes to create PVH domain Mukesh Rathor
@ 2013-04-24  7:10   ` Jan Beulich
  2013-04-24 23:02     ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-04-24  7:10 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -89,6 +89,9 @@ struct xen_domctl_getdomaininfo {
>   /* Being debugged.  */
>  #define _XEN_DOMINF_debugged  6
>  #define XEN_DOMINF_debugged   (1U<<_XEN_DOMINF_debugged)
> + /* domain is PVH */
> +#define _XEN_DOMINF_pvh_guest 7
> +#define XEN_DOMINF_pvh_guest   (1U<<_XEN_DOMINF_pvh_guest)
>   /* XEN_DOMINF_shutdown guest-supplied code.  */
>  #define XEN_DOMINF_shutdownmask 255
>  #define XEN_DOMINF_shutdownshift 16

This change cannot logically belong here, but ought to live in the
hypervisor side one. That's both for easing applying the patch
eventually (thus needing only a tools side ack) and from a logical
point of view (the producer of the interface change should exist
_before_ the consumer).

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/17]  PVH xen: create PVH vmcs, and also initialization
  2013-04-23 21:25 ` [PATCH 09/17] PVH xen: create PVH vmcs, and also initialization Mukesh Rathor
@ 2013-04-24  7:42   ` Jan Beulich
  2013-04-30 21:01     ` Mukesh Rathor
  2013-04-30 21:04     ` Mukesh Rathor
  0 siblings, 2 replies; 72+ messages in thread
From: Jan Beulich @ 2013-04-24  7:42 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel, dexuan.cui

>>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> This patch mainly contains code to create a VMCS for PVH guest, and HVM
> specific vcpu/domain creation code.
> 
> Changes in V2:
>   - Avoid call to hvm_do_resume() at call site rather than return in it.
>   - Return for PVH vmx_do_resume prior to intel debugger stuff.
> 
> Changes in V3:
>   - Cleanup pvh_construct_vmcs().
>   - Fix formatting in few places, adding XENLOG_G_ERR to printing.
>   - Do not load the CS selector for PVH here, but try to do that in Linux.
> 
> Changes in V4:
>   - Remove VM_ENTRY_LOAD_DEBUG_CTLS clearing.
>   - Add 32bit kernel changes mark.
>   - Verify pit_init call for PVH.

Verify in what way?

> +static int pvh_vcpu_initialise(struct vcpu *v)
> +{
> +    int rc;
> +
> +    if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 )
> +        return rc;
> +
> +    softirq_tasklet_init(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet,
> +                         (void(*)(unsigned long))hvm_assert_evtchn_irq,
> +                         (unsigned long)v);
> +
> +    v->arch.hvm_vcpu.hcall_64bit = 1;    /* PVH 32bitfixme */
> +    v->arch.user_regs.eflags = 2;
> +    v->arch.hvm_vcpu.inject_trap.vector = -1;
> +
> +    if ( (rc = hvm_vcpu_cacheattr_init(v)) != 0 )
> +    {
> +        hvm_funcs.vcpu_destroy(v);
> +        return rc;
> +    }
> +    if ( v->vcpu_id == 0 )
> +        pit_init(v, cpu_khz);

I'm asking in particular because my understanding of "verify" would
be checking of an eventual return value...

> @@ -4512,6 +4582,8 @@ static int hvm_memory_event_traps(long p, uint32_t reason,
>  
>  void hvm_memory_event_cr0(unsigned long value, unsigned long old) 
>  {
> +    if ( is_pvh_vcpu(current) )
> +        return;
>      hvm_memory_event_traps(current->domain->arch.hvm_domain
>                               .params[HVM_PARAM_MEMORY_EVENT_CR0],
>                             MEM_EVENT_REASON_CR0,

So these checks are still there, with no mark of being temporary,
despite having pointed out that they ought to be unnecessary once
full PVH support is there. As with the 32-bit specific changes that
the code currently lacks, such temporary adjustments should be
marked clearly and completely, so subsequently one can locate
them _all_. Just consider what happens if after phase I you get
taken off this project, and someone else would have to complete
it.

> +    /* Host control registers. */
> +    v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS;
> +    __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
> +    __vmwrite(HOST_CR4,
> +              mmu_cr4_features | (xsave_enabled(v) ? X86_CR4_OSXSAVE : 0));

Didn't pay attention to this before: Why is this conditional needed?
mmu_cr4_features ought to have X86_CR4_OSXSAVE set/cleared
correctly - after all, set_in_cr4(X86_CR4_OSXSAVE) is supposed to
ensure exactly that.

> +static int pvh_check_requirements(struct vcpu *v)
> +{
> +    u64 required, tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
> +
> +    if ( !paging_mode_hap(v->domain) )
> +    {
> +        dprintk(XENLOG_G_ERR, "HAP is required for PVH guest.\n");
> +        return -EINVAL;

The use of dprintk() appears to be misguided here (and in most
pre-existing places, so please don't use those as reference. My
take on this is that it should be used when an otherwise
ambiguous message is being printed solely for debugging
purposes.

Furthermore the log level of these messages surely shouldn't be
"error" - the tools are supposed to handle the returned error with
due verbosity, and hence the messages here are really just for
debugging purposes, i.e. not higher than XENLOG_G_INFO (i.e.
hidden entirely by default).

> +    guest_pat = MSR_IA32_CR_PAT_RESET;
> +    __vmwrite(GUEST_PAT, guest_pat);

What's the point of having the local variable "guest_pat" here?

> @@ -916,30 +1183,7 @@ static int construct_vmcs(struct vcpu *v)
>          __vmwrite(GUEST_INTR_STATUS, 0);
>      }
>  
> -    /* Host data selectors. */
> -    __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS);
> -    __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS);
> -    __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS);
> -    __vmwrite(HOST_FS_SELECTOR, 0);
> -    __vmwrite(HOST_GS_SELECTOR, 0);
> -    __vmwrite(HOST_FS_BASE, 0);
> -    __vmwrite(HOST_GS_BASE, 0);
> -
> -    /* Host control registers. */
> -    v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS;
> -    __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
> -    __vmwrite(HOST_CR4,
> -              mmu_cr4_features | (xsave_enabled(v) ? X86_CR4_OSXSAVE : 0));

Oh, I see this had been this way already, dating back to c/s
20260:ad35f39e5fdc - Dexuan - what was the point here? The host
CR4 value depending on guest properties (even if only apparently,
since xsave_enabled() simply returns cpu_has_xsave, along with
doing a few assertions) seems conceptually wrong.

> @@ -1259,8 +1503,10 @@ void vmx_do_resume(struct vcpu *v)
>  
>          vmx_clear_vmcs(v);
>          vmx_load_vmcs(v);
> -        hvm_migrate_timers(v);
> -        hvm_migrate_pirqs(v);
> +        if ( !is_pvh_vcpu(v) ) {

Formatting.


> @@ -110,6 +116,12 @@ static int vmx_vcpu_initialise(struct vcpu *v)
>  
>      vpmu_initialise(v);
>  
> +    if (is_pvh_vcpu(v) ) 

Ditto.

> @@ -1031,6 +1043,27 @@ static void vmx_update_host_cr3(struct vcpu *v)
>      vmx_vmcs_exit(v);
>  }
>  
> +/*
> + * PVH guest never causes CR3 write vmexit. This called during the guest setup.
> + */
> +static void vmx_update_pvh_cr(struct vcpu *v, unsigned int cr)
> +{
> +    vmx_vmcs_enter(v);
> +    switch ( cr )
> +    {
> +        case 3:
> +            __vmwrite(GUEST_CR3, v->arch.hvm_vcpu.guest_cr[3]);
> +            hvm_asid_flush_vcpu(v);
> +            break;
> +
> +        default:
> +            dprintk(XENLOG_ERR,
> +                   "PVH: d%d v%d unexpected cr%d update at rip:%lx\n",
> +                   v->domain->domain_id, v->vcpu_id, cr, __vmread(GUEST_RIP));
> +    }

Again. Even if not written down formally, the fundamental rule is:
"In general you should copy the style of the surrounding code. If
you are unsure please ask." And surrounding code here is making
clear that the case statements ought to have the same indentation
as the switch one.

Please go over your patches before submitting to eliminate all
formatting issues.

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-04-23 21:25 ` [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c Mukesh Rathor
@ 2013-04-24  8:47   ` Jan Beulich
  2013-04-25  0:57     ` Mukesh Rathor
                       ` (3 more replies)
  2013-04-25 11:19   ` Tim Deegan
  1 sibling, 4 replies; 72+ messages in thread
From: Jan Beulich @ 2013-04-24  8:47 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> The heart of this patch is vmx exit handler for PVH guest. It is nicely
> isolated in a separate module as preferred by most of us. A call to it
> is added to vmx_pvh_vmexit_handler().

vmx_vmexit_handler()?

> +static int pvh_grant_table_op(unsigned int cmd, XEN_GUEST_HANDLE(void) uop,
> +                              unsigned int count)
> +{
> +    switch ( cmd )
> +    {
> +        /*
> +         * Only the following Grant Ops have been verified for PVH guest, > hence
> +         * we check for them here.
> +         */
> +        case GNTTABOP_map_grant_ref:
> +        case GNTTABOP_unmap_grant_ref:
> +        case GNTTABOP_setup_table:
> +        case GNTTABOP_copy:
> +        case GNTTABOP_query_size:
> +        case GNTTABOP_set_version:
> +            return do_grant_table_op(cmd, uop, count);
> +    }
> +    return -ENOSYS;
> +}

As said before - I object to this sort of white listing. A PVH guest
ought to be permitted to issue any hypercall, with the sole
exception of MMU and very few other ones. So if anything,
specific hypercall functions should be black listed.

And of course, even if this is a new file, the indentation of case
statements needs adjustment (to match that in the file it got
cloned from, despite there being a small number of bad examples
in hvm.c).

> +static hvm_hypercall_t *pvh_hypercall64_table[NR_hypercalls] = {
> +    [__HYPERVISOR_platform_op]     = (hvm_hypercall_t *)do_platform_op,
> +    [__HYPERVISOR_memory_op]       = (hvm_hypercall_t *)do_memory_op,
> +    [__HYPERVISOR_xen_version]     = (hvm_hypercall_t *)do_xen_version,
> +    [__HYPERVISOR_console_io]      = (hvm_hypercall_t *)do_console_io,
> +    [__HYPERVISOR_grant_table_op]  = (hvm_hypercall_t *)pvh_grant_table_op,
> +    [__HYPERVISOR_vcpu_op]         = (hvm_hypercall_t *)pvh_vcpu_op,
> +    [__HYPERVISOR_mmuext_op]       = (hvm_hypercall_t *)do_mmuext_op,
> +    [__HYPERVISOR_xsm_op]          = (hvm_hypercall_t *)do_xsm_op,
> +    [__HYPERVISOR_sched_op]        = (hvm_hypercall_t *)do_sched_op,
> +    [__HYPERVISOR_event_channel_op]= (hvm_hypercall_t *)do_event_channel_op,
> +    [__HYPERVISOR_physdev_op]      = (hvm_hypercall_t *)pvh_physdev_op,
> +    [__HYPERVISOR_hvm_op]          = (hvm_hypercall_t *)pvh_hvm_op,
> +    [__HYPERVISOR_sysctl]          = (hvm_hypercall_t *)do_sysctl,
> +    [__HYPERVISOR_domctl]          = (hvm_hypercall_t *)do_domctl
> +};

Is this table complete? What about multicalls, timer_op, kexec_op,
tmem_op, mca? I again think that copying hypercall_table and
adjusting the entries you explicitly need to override might be
better than creating yet another table that needs attention when
a new hypercall gets added.

> +/*
> + * Check if hypercall is valid
> + * Returns: 0 if hcall is not valid with eax set to the errno to ret to guest
> + */
> +static bool_t hcall_valid(struct cpu_user_regs *regs)
> +{
> +    struct segment_register sreg;
> +
> +    hvm_get_segment_register(current, x86_seg_ss, &sreg);
> +    if ( unlikely(sreg.attr.fields.dpl == 3) )

    if ( unlikely(sreg.attr.fields.dpl != 0) )

> +    {
> +        regs->eax = -EPERM;
> +        return 0;
> +    }
> +
> +    /* Following HCALLs have not been verified for PVH domUs */
> +    if ( !IS_PRIV(current->domain) &&
> +         (regs->eax == __HYPERVISOR_xsm_op ||
> +          regs->eax == __HYPERVISOR_platform_op ||
> +          regs->eax == __HYPERVISOR_domctl) )       /* for privcmd mmap */
> +    {
> +        regs->eax = -ENOSYS;
> +        return 0;
> +    }

This looks bogus - it shouldn't be the job of the generic handler
to verify permission to use certain hypercalls - the individual
handlers should be doing this quite well. And I suppose you saw
Daniel De Graaf's effort to eliminate IS_PRIV() from the tree?

> +    if ( regs->eax == __HYPERVISOR_sched_op && regs->rdi == SCHEDOP_shutdown )
> +    {
> +        domain_crash_synchronous();
> +        return HVM_HCALL_completed;
> +    }

???

> +    regs->rax = pvh_hypercall64_table[hnum](regs->rdi, regs->rsi, regs->rdx,
> +                                            regs->r10, regs->r8, regs->r9);

This again lacks an annotation to clarify that 32-bit support is missing.

> @@ -1503,7 +1503,8 @@ void vmx_do_resume(struct vcpu *v)
>  
>          vmx_clear_vmcs(v);
>          vmx_load_vmcs(v);
> -        if ( !is_pvh_vcpu(v) ) {
> +        if ( !is_pvh_vcpu(v) )
> +        {

Surely an unnecessary adjustment, if an earlier patch got it right
from the beginning?

> +    switch ( regs->ecx )
> +    {
> +        case MSR_IA32_MISC_ENABLE:
> +        {
> +            rdmsrl(MSR_IA32_MISC_ENABLE, msr_content);
> +            msr_content |= MSR_IA32_MISC_ENABLE_BTS_UNAVAIL |
> +                           MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
> +            break;
> +        }
> +        default:
> +        {
> +            /* pvh fixme: see hvm_msr_read_intercept() */
> +            rdmsrl(regs->ecx, msr_content);
> +            break;
> +        }
> +    }

Apart from the recurring indentation problem, please also don't
add braces when you don't really need them (i.e. for declaring
case specific local variables).

> +    if ( vp->domain->domain_id != 0 &&    /* never pause dom0 */
> +         guest_kernel_mode(vp, regs) &&  vp->domain->debugger_attached )
> +    {
> +        domain_pause_for_debugger();
> +    } else {
> +        hvm_inject_hw_exception(TRAP_debug, HVM_DELIVER_NO_ERROR_CODE);
> +    }

In order to avoid two patch iterations here: There's no need for
braces at all in cases like this.

> +static int vmxit_int3(struct cpu_user_regs *regs)
> +{
> +    int ilen = vmx_get_instruction_length();
> +    struct vcpu *vp = current;
> +    struct hvm_trap trap_info = {
> +                        .vector = TRAP_int3,
> +                        .type = X86_EVENTTYPE_SW_EXCEPTION,
> +                        .error_code = HVM_DELIVER_NO_ERROR_CODE,
> +                        .insn_len = ilen

Indentation.

> +    };
> +
> +    regs->eip += ilen;
> +
> +    /* gdbsx or another debugger. Never pause dom0 */
> +    if ( vp->domain->domain_id != 0 && guest_kernel_mode(vp, regs) )
> +    {
> +        dbgp1("[%d]PVH: domain pause for debugger\n", smp_processor_id());
> +        current->arch.gdbsx_vcpu_event = TRAP_int3;
> +        domain_pause_for_debugger();
> +        return 0;
> +    }
> +
> +    regs->eip -= ilen;

Please move the first adjustment into the if() body, making the
second adjustment here unnecessary.

> +static int vmxit_invalid_op(struct cpu_user_regs *regs)
> +{
> +    ulong addr = 0;
> +
> +    if ( guest_kernel_mode(current, regs) ||
> +         emulate_forced_invalid_op(regs, &addr) == 0 )
> +    {
> +        hvm_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE);
> +        return 0;
> +    }
> +    if ( addr )
> +        hvm_inject_page_fault(0, addr);

This cannot be conditional upon addr being non-zero.

> +        case TRAP_no_device:
> +            hvm_funcs.fpu_dirty_intercept();  /* vmx_fpu_dirty_intercept */

It ought to be perfectly valid to avoid the indirect call here.

> +static int access_cr4(struct cpu_user_regs *regs, uint acc_typ, uint64_t *regp)
> +{
> +    if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR )
> +    {
> +        u64 old_cr4 = __vmread(GUEST_CR4);
> +
> +        if ( (old_cr4 ^ (*regp)) & (X86_CR4_PSE | X86_CR4_PGE | X86_CR4_PAE) 
> )
> +            vpid_sync_all();
> +
> +        __vmwrite(GUEST_CR4, *regp);

No modification of CR4_READ_SHADOW here?

> +static int vmxit_io_instr(struct cpu_user_regs *regs)
> +{
> +    int curr_lvl;
> +    int requested = (regs->rflags >> 12) & 3;
> +
> +    read_vmcs_selectors(regs);
> +    curr_lvl = regs->cs & 3;

Shouldn't you look at SS'es DPL instead?

> +static int pvh_ept_handle_violation(unsigned long qualification,
> +                                    paddr_t gpa, struct cpu_user_regs *regs)
> +{
> +    unsigned long gla, gfn = gpa >> PAGE_SHIFT;
> +    p2m_type_t p2mt;
> +    mfn_t mfn = get_gfn_query_unlocked(current->domain, gfn, &p2mt);
> +
> +    gdprintk(XENLOG_ERR, "EPT violation %#lx (%c%c%c/%c%c%c), "

This lacks the _G_ infix, or else you're risking a security issue here
by allowing a DomU to spam the log.

> +    if ( qualification & EPT_GLA_VALID )
> +    {
> +        gla = __vmread(GUEST_LINEAR_ADDRESS);
> +        gdprintk(XENLOG_ERR, " --- GLA %#lx\n", gla);

Same here, obviously.

> +/*
> + * The cpuid macro clears rcx, so execute cpuid here exactly as the user
> + * process would on a PV guest.
> + */
> +static void pvh_user_cpuid(struct cpu_user_regs *regs)
> +{
> +    unsigned int eax, ebx, ecx, edx;
> +
> +    asm volatile ( "cpuid"
> +              : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx)
> +              : "0" (regs->eax), "2" (regs->rcx) );
> +
> +    regs->rax = eax; regs->rbx = ebx; regs->rcx = ecx; regs->rdx = edx;

One assignment per line please.

> +    switch ( (uint16_t)exit_reason )
> +    {
> +        case EXIT_REASON_EXCEPTION_NMI:      /* 0 */
> +            rc = vmxit_exception(regs);
> +            break;

Why would an NMI be blindly reflected to the guest?

> +        case EXIT_REASON_CPUID:              /* 10 */
> +        {
> +            if ( guest_kernel_mode(vp, regs) )
> +                pv_cpuid(regs);
> +            else
> +                pvh_user_cpuid(regs);

What's the reason for this distinction? I would think it's actually a
benefit of PVH to allow also hiding unwanted features from guest
user mode (like HVM, but unlike PV without CPUID faulting).

> +    if ( rc )
> +    {
> +        exit_qualification = __vmread(EXIT_QUALIFICATION);
> +        gdprintk(XENLOG_ERR,
> +                 "PVH: [%d] exit_reas:%d 0x%x qual:%ld 0x%lx cr0:0x%016lx\n",
> +                 ccpu, exit_reason, exit_reason, exit_qualification,
> +                 exit_qualification, __vmread(GUEST_CR0));
> +        gdprintk(XENLOG_ERR, "PVH: RIP:%lx RSP:%lx EFLAGS:%lx CR3:%lx\n",
> +                 regs->rip, regs->rsp, regs->rflags, __vmread(GUEST_CR3));
> +        domain_crash_synchronous();
> +    }

The log levels are too high again here, and lacking the _G_ infix.

> +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp)
> +{
> +    if ( v->vcpu_id == 0 )
> +        return 0;
> +
> +    vmx_vmcs_enter(v);
> +    __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr);
> +    __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit);
> +    __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_user);
> +
> +    __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs);
> +    __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds);
> +    __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es);
> +    __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss);
> +    __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs);
> +
> +    if ( vmx_add_guest_msr(MSR_SHADOW_GS_BASE) )
> +        return -EINVAL;

Lacking vmx_vmcs_exit().

> +int vmx_pvh_read_descriptor(unsigned int sel, const struct vcpu *v,
> +                            const struct cpu_user_regs *regs,
> +                            unsigned long *base, unsigned long *limit,
> +                            unsigned int *ar)
> +{
> +    unsigned int tmp_ar = 0;
> +    ASSERT(v == current);
> +    ASSERT(is_pvh_vcpu(v));
> +
> +    if ( sel == (unsigned int)regs->cs )
> +    {
> +        *base = __vmread(GUEST_CS_BASE);
> +        *limit = __vmread(GUEST_CS_LIMIT);
> +        tmp_ar = __vmread(GUEST_CS_AR_BYTES);
> +    }
> +    else if ( sel == (unsigned int)regs->ds )

This if/else-if sequence can't be right - a selector can be in more
than one selector register (and one of them may have got reloaded
after a GDT/LDT adjustment, while another may not), so you can't
base the descriptor read upon the selector value. The caller will
have to tell you which register it wants the descriptor for, not which
selector.

> +    if ( tmp_ar & X86_SEG_AR_CS_LM_ACTIVE )
> +    {
> +        *base = 0UL;
> +        *limit = ~0UL;
> +    }

Doing that adjustment here rather than only in the CS case
suggests that this can happen for other than CS, in which case
this is wrong specifically for FS and GS.

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 12/17] PVH xen: support invalid op, return PVH features etc...
  2013-04-23 21:26 ` [PATCH 12/17] PVH xen: support invalid op, return PVH features etc Mukesh Rathor
@ 2013-04-24  9:01   ` Jan Beulich
  2013-04-25  1:01     ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-04-24  9:01 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> --- a/xen/arch/x86/debug.c
> +++ b/xen/arch/x86/debug.c
> @@ -59,7 +59,9 @@ dbg_hvm_va2mfn(dbgva_t vaddr, struct domain *dp, int toaddr,
>          return INVALID_MFN;
>      }
>  
> -    mfn = mfn_x(get_gfn(dp, *gfn, &gfntype)); 
> +    mfn = mfn_x(get_gfn_query(dp, *gfn, &gfntype));
> +    put_gfn(dp, *gfn);
> +
>      if ( p2m_is_readonly(gfntype) && toaddr )
>      {
>          DBGP2("kdb:p2m_is_readonly: gfntype:%x\n", gfntype);
> @@ -178,9 +180,6 @@ dbg_rw_guest_mem(dbgva_t addr, dbgbyte_t *buf, int len, struct domain *dp,
>          }
>  
>          unmap_domain_page(va);
> -        if ( gfn != INVALID_GFN )
> -            put_gfn(dp, gfn);
> -
>          addr += pagecnt;
>          buf += pagecnt;
>          len -= pagecnt;

How is this change related to PVH? I.e. isn't this a standalone fix/
change?

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 15/17]  PVH xen: Miscellaneous changes
  2013-04-23 21:26 ` [PATCH 15/17] PVH xen: Miscellaneous changes Mukesh Rathor
@ 2013-04-24  9:06   ` Jan Beulich
  2013-05-10  1:54     ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-04-24  9:06 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> --- a/xen/include/public/xen.h
> +++ b/xen/include/public/xen.h
> @@ -693,6 +693,8 @@ typedef struct shared_info shared_info_t;
>   *      c. list of allocated page frames [mfn_list, nr_pages]
>   *         (unless relocated due to XEN_ELFNOTE_INIT_P2M)
>   *      d. start_info_t structure        [register ESI (x86)]
> + *      d1. struct shared_info_t                [shared_info]
> + *                   (above if auto translated guest only)
>   *      e. bootstrap page tables         [pt_base and CR3 (x86)]
>   *      f. bootstrap stack               [register ESP (x86)]
>   *  4. Bootstrap elements are packed together, but each is 4kB-aligned.

This adjustment should be done in the patch implementing this.

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 16/17] PVH xen: elf and iommu related changes to prep for dom0 PVH
  2013-04-23 21:26 ` [PATCH 16/17] PVH xen: elf and iommu related changes to prep for dom0 PVH Mukesh Rathor
@ 2013-04-24  9:15   ` Jan Beulich
  2013-05-14  1:16     ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-04-24  9:15 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> +void __init early_pvh_copy_or_zero(unsigned long dest, char *src, int len,
> +                                   unsigned long v_start)

..., const void *src, ...

> @@ -122,11 +123,25 @@ void elf_set_verbose(struct elf_binary *elf)
>      elf->verbose = 1;
>  }
>  
> -static int elf_load_image(void *dst, const void *src, uint64_t filesz, uint64_t memsz)
> +extern void __init early_pvh_copy_or_zero(unsigned long dest, char *src,
> +                                          int len, unsigned long v_start);

This needs to be put in a header included both here and at the
producer side.

Also, if you need to pass v_start around just to pass it back to
this function, you could as well store it in a static variable in
domain_build.c, and leave all of these functions untouched.

> +
> +static int elf_load_image(void *dst, const void *src, uint64_t filesz,
> +                          uint64_t memsz, unsigned long v_start)
>  {
>      int rc;
>      if ( filesz > ULONG_MAX || memsz > ULONG_MAX )
>          return -1;
> +
> +    if ( v_start )
> +    {
> +        unsigned long addr = (unsigned long)dst;
> +        early_pvh_copy_or_zero(addr, (char *)src, filesz, v_start);

With the adjustment above, ugly casts like this can be dropped.

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/17] PVH xen: PVH dom0 creation....
  2013-04-23 21:26 ` [PATCH 17/17] PVH xen: PVH dom0 creation Mukesh Rathor
@ 2013-04-24  9:28   ` Jan Beulich
  2013-04-26  1:18     ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-04-24  9:28 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> +    /* If the e820 ended under 4GB, we must map the remaining space upto 4GB */
> +    if ( end < GB(4) )
> +    {
> +        start_pfn = PFN_UP(end);
> +        end_pfn = (GB(4)) >> PAGE_SHIFT;
> +        nump = end_pfn - start_pfn;
> +        rc = domctl_memory_mapping(d, start_pfn, start_pfn, nump, 1);
> +        BUG_ON(rc);
> +    }

That's necessary, but not sufficient. Or did I overlook MMIO ranges
getting added somewhere else for Dom0, when they sit above the
highest E820 covered address?

> +    if ( is_pvh_domain(d) )
> +    {
> +        v->arch.cr3 = v->arch.hvm_vcpu.guest_cr[3] = (vpt_start - v_start);
> +
> +        /* HAP is required for PVH and pfns are serially mapped there */

sequentially?

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 06/17]  PVH xen: Introduce PVH guest type
  2013-04-24  7:07   ` Jan Beulich
@ 2013-04-24 23:01     ` Mukesh Rathor
  2013-04-25  8:28       ` Jan Beulich
  0 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-24 23:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, 24 Apr 2013 08:07:08 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > This patch introduces the concept of a pvh guest. There are also
> > other basic changes like creating macros to check for pvh
> > vcpu/domain, and creating new macros to see if it's pv/pvh/hvm
> > domain/vcpu. Also, modify copy macros to include pvh. Lastly, we
> > introduce that PVH uses HVM style event delivery.
> > 
> > Chagnes in V2:
> >   - make is_pvh/is_hvm enum instead of adding is_pvh as a new flag.
> >   - fix indentation and spacing in guest_kernel_mode macro.
> >   - add debug only BUG() in GUEST_KERNEL_RPL macro as it should no
> > longer be called in any PVH paths.
> > 
> > Chagnes in V3:
> >   - Rename enum fields, and add is_pv to it.
> >   - Get rid if is_hvm_or_pvh_* macros.
> > 
> > Chagnes in V4:
> >   - Move e820 fields out of pv_domain struct.
> 
> Is there any reason why this can't be a standalone patch?
> 
> Jan
> 

I could move it, the e820 field changes, to an earlier smaller prep
patch or create a separate one if they are all big.


M-

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 07/17] PVH xen: tools changes to create PVH domain
  2013-04-24  7:10   ` Jan Beulich
@ 2013-04-24 23:02     ` Mukesh Rathor
  0 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-24 23:02 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, 24 Apr 2013 08:10:05 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > --- a/xen/include/public/domctl.h
> > +++ b/xen/include/public/domctl.h
> > @@ -89,6 +89,9 @@ struct xen_domctl_getdomaininfo {
> >   /* Being debugged.  */
> >  #define _XEN_DOMINF_debugged  6
> >  #define XEN_DOMINF_debugged   (1U<<_XEN_DOMINF_debugged)
> > + /* domain is PVH */
> > +#define _XEN_DOMINF_pvh_guest 7
> > +#define XEN_DOMINF_pvh_guest   (1U<<_XEN_DOMINF_pvh_guest)
> >   /* XEN_DOMINF_shutdown guest-supplied code.  */
> >  #define XEN_DOMINF_shutdownmask 255
> >  #define XEN_DOMINF_shutdownshift 16
> 
> This change cannot logically belong here, but ought to live in the
> hypervisor side one. That's both for easing applying the patch
> eventually (thus needing only a tools side ack) and from a logical
> point of view (the producer of the interface change should exist
> _before_ the consumer).
> 
> Jan
> 

Ok, I'll move this change to patch 6.

thanks,
Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-04-24  8:47   ` Jan Beulich
@ 2013-04-25  0:57     ` Mukesh Rathor
  2013-04-25  8:36       ` Jan Beulich
  2013-05-01  0:51     ` Mukesh Rathor
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-25  0:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, 24 Apr 2013 09:47:55 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > +static int pvh_grant_table_op(unsigned int cmd,
> > XEN_GUEST_HANDLE(void) uop,
> > +                              unsigned int count)
> > +{
> > +    switch ( cmd )
> > +    {
> > +        /*
> > +         * Only the following Grant Ops have been verified for PVH
> > guest, > hence
> > +         * we check for them here.
> > +         */
> > +        case GNTTABOP_map_grant_ref:
> > +        case GNTTABOP_unmap_grant_ref:
> > +        case GNTTABOP_setup_table:
> > +        case GNTTABOP_copy:
> > +        case GNTTABOP_query_size:
> > +        case GNTTABOP_set_version:
> > +            return do_grant_table_op(cmd, uop, count);
> > +    }
> > +    return -ENOSYS;
> > +}
> 
> As said before - I object to this sort of white listing. A PVH guest
> ought to be permitted to issue any hypercall, with the sole
> exception of MMU and very few other ones. So if anything,
> specific hypercall functions should be black listed.

Well, like I said before, these are verified/tested with PVH currently, 
and during the early stages we need to do whatever to catch things as 
bugs come in. I can make it DEBUG only if that makes it easier for you?
I'd rather see a post here saying they got ENOSYS than saying they got
weird crash/hang/etc...

BTW, similar checks exist in hvm_physdev_op(),hvm_vcpu_op()......

> > +static hvm_hypercall_t *pvh_hypercall64_table[NR_hypercalls] = {
> > +    [__HYPERVISOR_platform_op]     = (hvm_hypercall_t
> > *)do_platform_op,
> > +    [__HYPERVISOR_memory_op]       = (hvm_hypercall_t
> > *)do_memory_op,
> > +    [__HYPERVISOR_xen_version]     = (hvm_hypercall_t
> > *)do_xen_version,
> > +    [__HYPERVISOR_console_io]      = (hvm_hypercall_t
> > *)do_console_io,
> > +    [__HYPERVISOR_grant_table_op]  = (hvm_hypercall_t
> > *)pvh_grant_table_op,
> > +    [__HYPERVISOR_vcpu_op]         = (hvm_hypercall_t
> > *)pvh_vcpu_op,
> > +    [__HYPERVISOR_mmuext_op]       = (hvm_hypercall_t
> > *)do_mmuext_op,
> > +    [__HYPERVISOR_xsm_op]          = (hvm_hypercall_t *)do_xsm_op,
> > +    [__HYPERVISOR_sched_op]        = (hvm_hypercall_t
> > *)do_sched_op,
> > +    [__HYPERVISOR_event_channel_op]= (hvm_hypercall_t
> > *)do_event_channel_op,
> > +    [__HYPERVISOR_physdev_op]      = (hvm_hypercall_t
> > *)pvh_physdev_op,
> > +    [__HYPERVISOR_hvm_op]          = (hvm_hypercall_t *)pvh_hvm_op,
> > +    [__HYPERVISOR_sysctl]          = (hvm_hypercall_t *)do_sysctl,
> > +    [__HYPERVISOR_domctl]          = (hvm_hypercall_t *)do_domctl
> > +};
> 
> Is this table complete? What about multicalls, timer_op, kexec_op,
> tmem_op, mca? I again think that copying hypercall_table and
> adjusting the entries you explicitly need to override might be
> better than creating yet another table that needs attention when
> a new hypercall gets added.

Nop.  Like I've said few times before, we are not fully done with PVH with 
this patch set. This patch set establishes min baseline to get linux to 
boot with PVH enabled. Next would be go thru timer_op, look at do_timer_op,
verify/test it out for PVH, and add it here. Then mca, tmem, ....
When we are fully satifisfied with PVH completeness, we can 
remove this table if you wish, or make it DEBUG so one knows right away
what PVH supports at that time.

I am not sure I understant what you mean by copying hypercall_table. You
mean copy all the calls in this table above from entry.S?

> > +/*
> > + * Check if hypercall is valid
> > + * Returns: 0 if hcall is not valid with eax set to the errno to
> > ret to guest
> > + */
> > +static bool_t hcall_valid(struct cpu_user_regs *regs)
> > +{
> > +    struct segment_register sreg;
> > +
> > +    hvm_get_segment_register(current, x86_seg_ss, &sreg);
> > +    if ( unlikely(sreg.attr.fields.dpl == 3) )
> 
>     if ( unlikely(sreg.attr.fields.dpl != 0) )

Ok, thanks.

> > +    {
> > +        regs->eax = -EPERM;
> > +        return 0;
> > +    }
> > +
> > +    /* Following HCALLs have not been verified for PVH domUs */
> > +    if ( !IS_PRIV(current->domain) &&
> > +         (regs->eax == __HYPERVISOR_xsm_op ||
> > +          regs->eax == __HYPERVISOR_platform_op ||
> > +          regs->eax == __HYPERVISOR_domctl) )       /* for privcmd
> > mmap */
> > +    {
> > +        regs->eax = -ENOSYS;
> > +        return 0;
> > +    }
> 
> This looks bogus - it shouldn't be the job of the generic handler
> to verify permission to use certain hypercalls - the individual
> handlers should be doing this quite well. And I suppose you saw
> Daniel De Graaf's effort to eliminate IS_PRIV() from the tree?

Ok, will remove it. Yup, my next refresh will take care of IS_PRIV.

> > +    if ( regs->eax == __HYPERVISOR_sched_op && regs->rdi ==
> > SCHEDOP_shutdown )
> > +    {
> > +        domain_crash_synchronous();
> > +        return HVM_HCALL_completed;
> > +    }
> 
> ???

Right. my bad, don't need this anymore.

thanks,
Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 12/17] PVH xen: support invalid op, return PVH features etc...
  2013-04-24  9:01   ` Jan Beulich
@ 2013-04-25  1:01     ` Mukesh Rathor
  0 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-25  1:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, 24 Apr 2013 10:01:26 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > --- a/xen/arch/x86/debug.c
> > +++ b/xen/arch/x86/debug.c
> > @@ -59,7 +59,9 @@ dbg_hvm_va2mfn(dbgva_t vaddr, struct domain *dp,
> > int toaddr, return INVALID_MFN;
> >      }
> >  
> > -    mfn = mfn_x(get_gfn(dp, *gfn, &gfntype)); 
> > +    mfn = mfn_x(get_gfn_query(dp, *gfn, &gfntype));
> > +    put_gfn(dp, *gfn);
> > +
> >      if ( p2m_is_readonly(gfntype) && toaddr )
> >      {
> >          DBGP2("kdb:p2m_is_readonly: gfntype:%x\n", gfntype);
> > @@ -178,9 +180,6 @@ dbg_rw_guest_mem(dbgva_t addr, dbgbyte_t *buf,
> > int len, struct domain *dp, }
> >  
> >          unmap_domain_page(va);
> > -        if ( gfn != INVALID_GFN )
> > -            put_gfn(dp, gfn);
> > -
> >          addr += pagecnt;
> >          buf += pagecnt;
> >          len -= pagecnt;
> 
> How is this change related to PVH? I.e. isn't this a standalone fix/
> change?
> 
> Jan

fine. I'll move it to a separate patch.

thanks,
Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 06/17]  PVH xen: Introduce PVH guest type
  2013-04-24 23:01     ` Mukesh Rathor
@ 2013-04-25  8:28       ` Jan Beulich
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Beulich @ 2013-04-25  8:28 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 25.04.13 at 01:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> On Wed, 24 Apr 2013 08:07:08 +0100
> "Jan Beulich" <JBeulich@suse.com> wrote:
> 
>> >>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com>
>> >>> wrote:
>> > This patch introduces the concept of a pvh guest. There are also
>> > other basic changes like creating macros to check for pvh
>> > vcpu/domain, and creating new macros to see if it's pv/pvh/hvm
>> > domain/vcpu. Also, modify copy macros to include pvh. Lastly, we
>> > introduce that PVH uses HVM style event delivery.
>> > 
>> > Chagnes in V2:
>> >   - make is_pvh/is_hvm enum instead of adding is_pvh as a new flag.
>> >   - fix indentation and spacing in guest_kernel_mode macro.
>> >   - add debug only BUG() in GUEST_KERNEL_RPL macro as it should no
>> > longer be called in any PVH paths.
>> > 
>> > Chagnes in V3:
>> >   - Rename enum fields, and add is_pv to it.
>> >   - Get rid if is_hvm_or_pvh_* macros.
>> > 
>> > Chagnes in V4:
>> >   - Move e820 fields out of pv_domain struct.
>> 
>> Is there any reason why this can't be a standalone patch?
> 
> I could move it, the e820 field changes, to an earlier smaller prep
> patch or create a separate one if they are all big.

A separate one would be much preferred.

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-04-25  0:57     ` Mukesh Rathor
@ 2013-04-25  8:36       ` Jan Beulich
  2013-04-26  1:16         ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-04-25  8:36 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 25.04.13 at 02:57, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> On Wed, 24 Apr 2013 09:47:55 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:
>> >>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
>> > +static int pvh_grant_table_op(unsigned int cmd, XEN_GUEST_HANDLE(void) uop,
>> > +                              unsigned int count)
>> > +{
>> > +    switch ( cmd )
>> > +    {
>> > +        /*
>> > +         * Only the following Grant Ops have been verified for PVH guest, hence
>> > +         * we check for them here.
>> > +         */
>> > +        case GNTTABOP_map_grant_ref:
>> > +        case GNTTABOP_unmap_grant_ref:
>> > +        case GNTTABOP_setup_table:
>> > +        case GNTTABOP_copy:
>> > +        case GNTTABOP_query_size:
>> > +        case GNTTABOP_set_version:
>> > +            return do_grant_table_op(cmd, uop, count);
>> > +    }
>> > +    return -ENOSYS;
>> > +}
>> 
>> As said before - I object to this sort of white listing. A PVH guest
>> ought to be permitted to issue any hypercall, with the sole
>> exception of MMU and very few other ones. So if anything,
>> specific hypercall functions should be black listed.
> 
> Well, like I said before, these are verified/tested with PVH currently, 
> and during the early stages we need to do whatever to catch things as 
> bugs come in. I can make it DEBUG only if that makes it easier for you?
> I'd rather see a post here saying they got ENOSYS than saying they got
> weird crash/hang/etc...

Then this patch series really ought to continue to be RFC, and
I start questioning why I'm spending hours reviewing it. The
number of hacks you need clearly should be limited - to me it is
unacceptable to scatter half done code all over the tree. I had
the same problem when I did the 32-on-64 support, and iirc I
got things into largely hack free state before even posting the
first full, non-RFC series.

> BTW, similar checks exist in hvm_physdev_op(),hvm_vcpu_op()......

And of course the comment applies to all of them.

>> > +static hvm_hypercall_t *pvh_hypercall64_table[NR_hypercalls] = {
>> > +    [__HYPERVISOR_platform_op]     = (hvm_hypercall_t
>> > *)do_platform_op,
>> > +    [__HYPERVISOR_memory_op]       = (hvm_hypercall_t
>> > *)do_memory_op,
>> > +    [__HYPERVISOR_xen_version]     = (hvm_hypercall_t
>> > *)do_xen_version,
>> > +    [__HYPERVISOR_console_io]      = (hvm_hypercall_t
>> > *)do_console_io,
>> > +    [__HYPERVISOR_grant_table_op]  = (hvm_hypercall_t
>> > *)pvh_grant_table_op,
>> > +    [__HYPERVISOR_vcpu_op]         = (hvm_hypercall_t
>> > *)pvh_vcpu_op,
>> > +    [__HYPERVISOR_mmuext_op]       = (hvm_hypercall_t
>> > *)do_mmuext_op,
>> > +    [__HYPERVISOR_xsm_op]          = (hvm_hypercall_t *)do_xsm_op,
>> > +    [__HYPERVISOR_sched_op]        = (hvm_hypercall_t
>> > *)do_sched_op,
>> > +    [__HYPERVISOR_event_channel_op]= (hvm_hypercall_t
>> > *)do_event_channel_op,
>> > +    [__HYPERVISOR_physdev_op]      = (hvm_hypercall_t
>> > *)pvh_physdev_op,
>> > +    [__HYPERVISOR_hvm_op]          = (hvm_hypercall_t *)pvh_hvm_op,
>> > +    [__HYPERVISOR_sysctl]          = (hvm_hypercall_t *)do_sysctl,
>> > +    [__HYPERVISOR_domctl]          = (hvm_hypercall_t *)do_domctl
>> > +};
>> 
>> Is this table complete? What about multicalls, timer_op, kexec_op,
>> tmem_op, mca? I again think that copying hypercall_table and
>> adjusting the entries you explicitly need to override might be
>> better than creating yet another table that needs attention when
>> a new hypercall gets added.
> 
> Nop.  Like I've said few times before, we are not fully done with PVH with 
> this patch set. This patch set establishes min baseline to get linux to 
> boot with PVH enabled. Next would be go thru timer_op, look at do_timer_op,
> verify/test it out for PVH, and add it here. Then mca, tmem, ....
> When we are fully satifisfied with PVH completeness, we can 
> remove this table if you wish, or make it DEBUG so one knows right away
> what PVH supports at that time.

Having that extra stuff only in debug builds would at least look
slightly more acceptable.

> I am not sure I understant what you mean by copying hypercall_table. You
> mean copy all the calls in this table above from entry.S?

Yes - memcpy() the whole table, then overwrite the (few) entries
you need to overwrite. After all, in the long run adding a new
hypercall ought to "just work" for PVH (and in most cases even for
HVM).

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-04-23 21:25 ` [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c Mukesh Rathor
  2013-04-24  8:47   ` Jan Beulich
@ 2013-04-25 11:19   ` Tim Deegan
  1 sibling, 0 replies; 72+ messages in thread
From: Tim Deegan @ 2013-04-25 11:19 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: Xen-devel

Hi, 

At 14:25 -0700 on 23 Apr (1366727159), Mukesh Rathor wrote:
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -567,6 +567,7 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void), unsigned long len);
>  int microcode_resume_cpu(int cpu);
>  
>  void pv_cpuid(struct cpu_user_regs *regs);
> +int emulate_forced_invalid_op(struct cpu_user_regs *regs, unsigned long *);

This is much better than the previous changes, thanks.  Can you please
add a comment here describing what the function does, and in particular
how the new argument affects whether page faults are injected?

Tim.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 13/17] PVH xen: p2m related changes.
  2013-04-23 21:26 ` [PATCH 13/17] PVH xen: p2m related changes Mukesh Rathor
@ 2013-04-25 11:28   ` Tim Deegan
  2013-04-25 21:59     ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Tim Deegan @ 2013-04-25 11:28 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: Xen-devel

At 14:26 -0700 on 23 Apr (1366727162), Mukesh Rathor wrote:
> In this patch, I introduce  a new type p2m_map_foreign for pages that a
> dom0 maps from foreign domains its creating. Also, add
> set_foreign_p2m_entry() to map p2m_map_foreign type pages. Other misc changes
> related to p2m.

You haven't addressed my comments from v2:
http://lists.xen.org/archives/html/xen-devel/2013-03/msg01895.html

In particular this:

>      /* Track the highest gfn for which we have ever had a valid mapping */
> -    if ( p2mt != p2m_invalid &&
> +    if ( p2mt != p2m_invalid && p2mt != p2m_mmio_dm &&
>           (gfn + (1UL << order) - 1 > p2m->max_mapped_pfn) )
>          p2m->max_mapped_pfn = gfn + (1UL << order) - 1;

seems like a big change to be hiding away in a 'p2m related changes'
patch without a good explanation of what you're doing and why.

Tim.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 14/17] PVH xen: Add and remove foreign pages
  2013-04-23 21:26 ` [PATCH 14/17] PVH xen: Add and remove foreign pages Mukesh Rathor
@ 2013-04-25 11:38   ` Tim Deegan
  0 siblings, 0 replies; 72+ messages in thread
From: Tim Deegan @ 2013-04-25 11:38 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: Xen-devel

At 14:26 -0700 on 23 Apr (1366727163), Mukesh Rathor wrote:
> In this patch, a new function, xenmem_add_foreign_to_pmap(), is added
> to map pages from foreign guest into current dom0 for domU creation.
> Also, allow XENMEM_remove_from_physmap to remove p2m_map_foreign
> pages. Note, in this path, we must release the refcount that was taken
> during the map phase.

Much better, thanks!

One comment: 

> +    if ( currd->domain_id == foreign_domid || foreign_domid == DOMID_SELF ||
> +         !is_pvh_domain(currd) )
> +        return -EINVAL;

If you're not going to implement XENMAPSPACE_gmfn_foreign for normal HVM
domains, can you please add a comment to public/memory.h to say that
it's PVH-only on x86. 

Thanks,

Tim.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 13/17] PVH xen: p2m related changes.
  2013-04-25 11:28   ` Tim Deegan
@ 2013-04-25 21:59     ` Mukesh Rathor
  2013-04-26  8:53       ` Tim Deegan
  0 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-25 21:59 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Xen-devel

On Thu, 25 Apr 2013 12:28:13 +0100
Tim Deegan <tim@xen.org> wrote:

> At 14:26 -0700 on 23 Apr (1366727162), Mukesh Rathor wrote:
> > In this patch, I introduce  a new type p2m_map_foreign for pages
> > that a dom0 maps from foreign domains its creating. Also, add
> > set_foreign_p2m_entry() to map p2m_map_foreign type pages. Other
> > misc changes related to p2m.
> 
> You haven't addressed my comments from v2:
> http://lists.xen.org/archives/html/xen-devel/2013-03/msg01895.html
> 
> In particular this:
> 
> >      /* Track the highest gfn for which we have ever had a valid
> > mapping */
> > -    if ( p2mt != p2m_invalid &&
> > +    if ( p2mt != p2m_invalid && p2mt != p2m_mmio_dm &&
> >           (gfn + (1UL << order) - 1 > p2m->max_mapped_pfn) )
> >          p2m->max_mapped_pfn = gfn + (1UL << order) - 1;
> 
> seems like a big change to be hiding away in a 'p2m related changes'
> patch without a good explanation of what you're doing and why.
> 
> Tim.

Remember:
http://lists.xen.org/archives/html/xen-devel/2012-03/msg02456.html

Ok, I'll comment saying "because Tim says so", just kidding..... 

Ok how about comment like this:

/*
 * PVH maps the entire mmio 1-1. Hence we must adjust the max_mapped_pfn
 * to account for that. Such mfns are p2m_mmio_direct and !mfn_valid.
 */

I can't think of better way of saying it, so please feel free to send 
better version if you can.

thanks,
Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-04-25  8:36       ` Jan Beulich
@ 2013-04-26  1:16         ` Mukesh Rathor
  2013-04-26  1:58           ` Mukesh Rathor
  2013-04-26  7:20           ` Jan Beulich
  0 siblings, 2 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-26  1:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Thu, 25 Apr 2013 09:36:56 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 25.04.13 at 02:57, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > On Wed, 24 Apr 2013 09:47:55 +0100 "Jan Beulich"
> > <JBeulich@suse.com> wrote:
> >> >>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com>
> >> >>> wrote:
> >> > +static int pvh_grant_table_op(unsigned int cmd,
> >> > XEN_GUEST_HANDLE(void) uop,
> >> > +                              unsigned int count)
> >> > +{
> >> > +    switch ( cmd )
> >> > +    {
> >> > +        /*
> >> > +         * Only the following Grant Ops have been verified for
> >> > PVH guest, hence
> >> > +         * we check for them here.
> >> > +         */
> >> > +        case GNTTABOP_map_grant_ref:
> >> > +        case GNTTABOP_unmap_grant_ref:
> >> > +        case GNTTABOP_setup_table:
> >> > +        case GNTTABOP_copy:
> >> > +        case GNTTABOP_query_size:
> >> > +        case GNTTABOP_set_version:
> >> > +            return do_grant_table_op(cmd, uop, count);
> >> > +    }
> >> > +    return -ENOSYS;
> >> > +}
> >> 
> >> As said before - I object to this sort of white listing. A PVH
> >> guest ought to be permitted to issue any hypercall, with the sole
> >> exception of MMU and very few other ones. So if anything,
> >> specific hypercall functions should be black listed.
> > 
> > Well, like I said before, these are verified/tested with PVH
> > currently, and during the early stages we need to do whatever to
> > catch things as bugs come in. I can make it DEBUG only if that
> > makes it easier for you? I'd rather see a post here saying they got
> > ENOSYS than saying they got weird crash/hang/etc...
> 
> Then this patch series really ought to continue to be RFC, and
> I start questioning why I'm spending hours reviewing it. The
> number of hacks you need clearly should be limited - to me it is
> unacceptable to scatter half done code all over the tree. I had
> the same problem when I did the 32-on-64 support, and iirc I
> got things into largely hack free state before even posting the
> first full, non-RFC series.

I really appreciate your time reviewing it. Given the size of the
feature and that I'm the only one working on it, the only way I know
is to do it in steps, and that sometimes requires temporary code.

I'll ifdef DEBUG the above code.

>> I am not sure I understant what you mean by copying hypercall_table. You
>> mean copy all the calls in this table above from entry.S?  

>Yes - memcpy() the whole table, then overwrite the (few) entries
>you need to overwrite. After all, in the long run adding a new
>hypercall ought to "just work" for PVH (and in most cases even for

How would a poor soul who is trying to find all callers of do_xxx()
find it then? And is it really that often that hypercalls are added
that it is such a big deal?

thanks,
Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/17] PVH xen: PVH dom0 creation....
  2013-04-24  9:28   ` Jan Beulich
@ 2013-04-26  1:18     ` Mukesh Rathor
  2013-04-26  7:22       ` Jan Beulich
  0 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-26  1:18 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, 24 Apr 2013 10:28:35 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > +    /* If the e820 ended under 4GB, we must map the remaining
> > space upto 4GB */
> > +    if ( end < GB(4) )
> > +    {
> > +        start_pfn = PFN_UP(end);
> > +        end_pfn = (GB(4)) >> PAGE_SHIFT;
> > +        nump = end_pfn - start_pfn;
> > +        rc = domctl_memory_mapping(d, start_pfn, start_pfn, nump,
> > 1);
> > +        BUG_ON(rc);
> > +    }
> 
> That's necessary, but not sufficient. Or did I overlook MMIO ranges
> getting added somewhere else for Dom0, when they sit above the
> highest E820 covered address?

construct_dom0() adds the entire range:

    /* DOM0 is permitted full I/O capabilities. */
    rc |= ioports_permit_access(dom0, 0, 0xFFFF);
    rc |= iomem_permit_access(dom0, 0UL, ~0UL);

Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-04-26  1:16         ` Mukesh Rathor
@ 2013-04-26  1:58           ` Mukesh Rathor
  2013-04-26  7:29             ` Jan Beulich
  2013-04-26  7:20           ` Jan Beulich
  1 sibling, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-26  1:58 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: Jan Beulich, xen-devel

On Thu, 25 Apr 2013 18:16:28 -0700
Mukesh Rathor <mukesh.rathor@oracle.com> wrote:

> On Thu, 25 Apr 2013 09:36:56 +0100
> "Jan Beulich" <JBeulich@suse.com> wrote:
> 
> > >>> On 25.04.13 at 02:57, Mukesh Rathor <mukesh.rathor@oracle.com>
> > >>> wrote:
> > >> > +         */
> > >> > +        case GNTTABOP_map_grant_ref:
> > >> > +        case GNTTABOP_unmap_grant_ref:
> > >> > +        case GNTTABOP_setup_table:
> > >> > +        case GNTTABOP_copy:
> > >> > +        case GNTTABOP_query_size:
> > >> > +        case GNTTABOP_set_version:
> > >> > +            return do_grant_table_op(cmd, uop, count);
> > >> > +    }
> > >> > +    return -ENOSYS;
> > >> > +}
> > >> 
> > >> As said before - I object to this sort of white listing. A PVH
> > >> guest ought to be permitted to issue any hypercall, with the sole
> > >> exception of MMU and very few other ones. So if anything,
> > >> specific hypercall functions should be black listed.
> > > 
> > > Well, like I said before, these are verified/tested with PVH
> > > currently, and during the early stages we need to do whatever to
> > > catch things as bugs come in. I can make it DEBUG only if that
> > > makes it easier for you? I'd rather see a post here saying they
> > > got ENOSYS than saying they got weird crash/hang/etc...
> > 
> > Then this patch series really ought to continue to be RFC, and
> > I start questioning why I'm spending hours reviewing it. The
> > number of hacks you need clearly should be limited - to me it is
> > unacceptable to scatter half done code all over the tree. I had
> > the same problem when I did the 32-on-64 support, and iirc I
> > got things into largely hack free state before even posting the
> > first full, non-RFC series.
> 
> I really appreciate your time reviewing it. Given the size of the
> feature and that I'm the only one working on it, the only way I know
> is to do it in steps, and that sometimes requires temporary code.
> 
> I'll ifdef DEBUG the above code.

Acutally, on a second thought, would you be OK if I just added
return -ENOSYS to the do_grant_table_op() for calls that are not in
above list?

thx,
m

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-04-26  1:16         ` Mukesh Rathor
  2013-04-26  1:58           ` Mukesh Rathor
@ 2013-04-26  7:20           ` Jan Beulich
  2013-04-27  2:06             ` Mukesh Rathor
  1 sibling, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-04-26  7:20 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 26.04.13 at 03:16, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> On Thu, 25 Apr 2013 09:36:56 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:
>> >>> On 25.04.13 at 02:57, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
>>> I am not sure I understant what you mean by copying hypercall_table. You
>>> mean copy all the calls in this table above from entry.S?  
> 
>>Yes - memcpy() the whole table, then overwrite the (few) entries
>>you need to overwrite. After all, in the long run adding a new
>>hypercall ought to "just work" for PVH (and in most cases even for
> 
> How would a poor soul who is trying to find all callers of do_xxx()
> find it then? And is it really that often that hypercalls are added
> that it is such a big deal?

It's no that often, but I nevertheless dislike that redundancy.
Ideally there would be just one hypercall table (not considering
the compat case, which has to have a different one because of
the different calling convention), and the hypercall handlers
would take care of the details...

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/17] PVH xen: PVH dom0 creation....
  2013-04-26  1:18     ` Mukesh Rathor
@ 2013-04-26  7:22       ` Jan Beulich
  2013-05-10  1:53         ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-04-26  7:22 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 26.04.13 at 03:18, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> On Wed, 24 Apr 2013 10:28:35 +0100
> "Jan Beulich" <JBeulich@suse.com> wrote:
> 
>> >>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com>
>> >>> wrote:
>> > +    /* If the e820 ended under 4GB, we must map the remaining
>> > space upto 4GB */
>> > +    if ( end < GB(4) )
>> > +    {
>> > +        start_pfn = PFN_UP(end);
>> > +        end_pfn = (GB(4)) >> PAGE_SHIFT;
>> > +        nump = end_pfn - start_pfn;
>> > +        rc = domctl_memory_mapping(d, start_pfn, start_pfn, nump,
>> > 1);
>> > +        BUG_ON(rc);
>> > +    }
>> 
>> That's necessary, but not sufficient. Or did I overlook MMIO ranges
>> getting added somewhere else for Dom0, when they sit above the
>> highest E820 covered address?
> 
> construct_dom0() adds the entire range:
> 
>     /* DOM0 is permitted full I/O capabilities. */
>     rc |= ioports_permit_access(dom0, 0, 0xFFFF);
>     rc |= iomem_permit_access(dom0, 0UL, ~0UL);

Which does not create any mappings at all - these are just
permissions being granted.

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-04-26  1:58           ` Mukesh Rathor
@ 2013-04-26  7:29             ` Jan Beulich
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Beulich @ 2013-04-26  7:29 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 26.04.13 at 03:58, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> On Thu, 25 Apr 2013 18:16:28 -0700
> Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> 
>> On Thu, 25 Apr 2013 09:36:56 +0100
>> "Jan Beulich" <JBeulich@suse.com> wrote:
>> 
>> > >>> On 25.04.13 at 02:57, Mukesh Rathor <mukesh.rathor@oracle.com>
>> > >>> wrote:
>> > >> > +         */
>> > >> > +        case GNTTABOP_map_grant_ref:
>> > >> > +        case GNTTABOP_unmap_grant_ref:
>> > >> > +        case GNTTABOP_setup_table:
>> > >> > +        case GNTTABOP_copy:
>> > >> > +        case GNTTABOP_query_size:
>> > >> > +        case GNTTABOP_set_version:
>> > >> > +            return do_grant_table_op(cmd, uop, count);
>> > >> > +    }
>> > >> > +    return -ENOSYS;
>> > >> > +}
>> > >> 
>> > >> As said before - I object to this sort of white listing. A PVH
>> > >> guest ought to be permitted to issue any hypercall, with the sole
>> > >> exception of MMU and very few other ones. So if anything,
>> > >> specific hypercall functions should be black listed.
>> > > 
>> > > Well, like I said before, these are verified/tested with PVH
>> > > currently, and during the early stages we need to do whatever to
>> > > catch things as bugs come in. I can make it DEBUG only if that
>> > > makes it easier for you? I'd rather see a post here saying they
>> > > got ENOSYS than saying they got weird crash/hang/etc...
>> > 
>> > Then this patch series really ought to continue to be RFC, and
>> > I start questioning why I'm spending hours reviewing it. The
>> > number of hacks you need clearly should be limited - to me it is
>> > unacceptable to scatter half done code all over the tree. I had
>> > the same problem when I did the 32-on-64 support, and iirc I
>> > got things into largely hack free state before even posting the
>> > first full, non-RFC series.
>> 
>> I really appreciate your time reviewing it. Given the size of the
>> feature and that I'm the only one working on it, the only way I know
>> is to do it in steps, and that sometimes requires temporary code.
>> 
>> I'll ifdef DEBUG the above code.
> 
> Acutally, on a second thought, would you be OK if I just added
> return -ENOSYS to the do_grant_table_op() for calls that are not in
> above list?

On a first glance this would be at least as bogus as adding a
frontend stub.

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 13/17] PVH xen: p2m related changes.
  2013-04-25 21:59     ` Mukesh Rathor
@ 2013-04-26  8:53       ` Tim Deegan
  0 siblings, 0 replies; 72+ messages in thread
From: Tim Deegan @ 2013-04-26  8:53 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: Xen-devel

At 14:59 -0700 on 25 Apr (1366901973), Mukesh Rathor wrote:
> On Thu, 25 Apr 2013 12:28:13 +0100
> Tim Deegan <tim@xen.org> wrote:
> 
> > At 14:26 -0700 on 23 Apr (1366727162), Mukesh Rathor wrote:
> > > In this patch, I introduce  a new type p2m_map_foreign for pages
> > > that a dom0 maps from foreign domains its creating. Also, add
> > > set_foreign_p2m_entry() to map p2m_map_foreign type pages. Other
> > > misc changes related to p2m.
> > 
> > You haven't addressed my comments from v2:
> > http://lists.xen.org/archives/html/xen-devel/2013-03/msg01895.html
> > 
> > In particular this:
> > 
> > >      /* Track the highest gfn for which we have ever had a valid
> > > mapping */
> > > -    if ( p2mt != p2m_invalid &&
> > > +    if ( p2mt != p2m_invalid && p2mt != p2m_mmio_dm &&
> > >           (gfn + (1UL << order) - 1 > p2m->max_mapped_pfn) )
> > >          p2m->max_mapped_pfn = gfn + (1UL << order) - 1;
> > 
> > seems like a big change to be hiding away in a 'p2m related changes'
> > patch without a good explanation of what you're doing and why.
> > 
> > Tim.
> 
> Remember:
> http://lists.xen.org/archives/html/xen-devel/2012-03/msg02456.html

Clearly not. :)  But in fact, since then we've already made this change:
http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=b6f3a3cbf014b7adb992ffa697aca568ff7a7fcb

so I think you can just drop these changes from your patch.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-04-26  7:20           ` Jan Beulich
@ 2013-04-27  2:06             ` Mukesh Rathor
  0 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-27  2:06 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Fri, 26 Apr 2013 08:20:11 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 26.04.13 at 03:16, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > On Thu, 25 Apr 2013 09:36:56 +0100 "Jan Beulich"
> > <JBeulich@suse.com> wrote:
> >> >>> On 25.04.13 at 02:57, Mukesh Rathor <mukesh.rathor@oracle.com>
> >> >>> wrote:
> >>> I am not sure I understant what you mean by copying
> >>> hypercall_table. You mean copy all the calls in this table above
> >>> from entry.S?  
> > 
> >>Yes - memcpy() the whole table, then overwrite the (few) entries
> >>you need to overwrite. After all, in the long run adding a new
> >>hypercall ought to "just work" for PVH (and in most cases even for
> > 
> > How would a poor soul who is trying to find all callers of do_xxx()
> > find it then? And is it really that often that hypercalls are added
> > that it is such a big deal?
> 
> It's no that often, but I nevertheless dislike that redundancy.
> Ideally there would be just one hypercall table (not considering
> the compat case, which has to have a different one because of
> the different calling convention), and the hypercall handlers
> would take care of the details...

I beg to differ.  I'll make my code DEBUG in next patch.  I'll let you
do the memcpy of the table - I wouldn't want my name on something I
consider to be bad software engineering practice.

thanks
Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/17]  PVH xen: create PVH vmcs, and also initialization
  2013-04-24  7:42   ` Jan Beulich
@ 2013-04-30 21:01     ` Mukesh Rathor
  2013-04-30 21:04     ` Mukesh Rathor
  1 sibling, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-30 21:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, dexuan.cui

On Wed, 24 Apr 2013 08:42:49 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > Changes in V4:
> >   - Remove VM_ENTRY_LOAD_DEBUG_CTLS clearing.
> >   - Add 32bit kernel changes mark.
> >   - Verify pit_init call for PVH.
> 
> Verify in what way?
> 
..
> > +
> > +    if ( (rc = hvm_vcpu_cacheattr_init(v)) != 0 )
> > +    {
> > +        hvm_funcs.vcpu_destroy(v);
> > +        return rc;
> > +    }
> > +    if ( v->vcpu_id == 0 )
> > +        pit_init(v, cpu_khz);
> 
> I'm asking in particular because my understanding of "verify" would
> be checking of an eventual return value...

Function returns void. Verified that the speaker and pit IO would
be properly handled for PVH.

> > @@ -4512,6 +4582,8 @@ static int hvm_memory_event_traps(long p,
> > uint32_t reason, 
> >  void hvm_memory_event_cr0(unsigned long value, unsigned long old) 
> >  {
> > +    if ( is_pvh_vcpu(current) )
> > +        return;
> >      hvm_memory_event_traps(current->domain->arch.hvm_domain
> >                               .params[HVM_PARAM_MEMORY_EVENT_CR0],
> >                             MEM_EVENT_REASON_CR0,
> 
> So these checks are still there, with no mark of being temporary,
> despite having pointed out that they ought to be unnecessary once
> full PVH support is there. As with the 32-bit specific changes that
> the code currently lacks, such temporary adjustments should be
> marked clearly and completely, so subsequently one can locate
> them _all_. Just consider what happens if after phase I you get
> taken off this project, and someone else would have to complete
> it.

I put action item in the cover letter:

"- Add support for monitoring guest behavior. See hvm_memory_event* functions
     in hvm.c"

I can add "PVH: fixme" comment tags too.

thanks
Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/17]  PVH xen: create PVH vmcs, and also initialization
  2013-04-24  7:42   ` Jan Beulich
  2013-04-30 21:01     ` Mukesh Rathor
@ 2013-04-30 21:04     ` Mukesh Rathor
  1 sibling, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-04-30 21:04 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, dexuan.cui

On Wed, 24 Apr 2013 08:42:49 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> 
> > +    guest_pat = MSR_IA32_CR_PAT_RESET;
> > +    __vmwrite(GUEST_PAT, guest_pat);
> 
> What's the point of having the local variable "guest_pat" here?
> 

copy/paste from existing code in construct_vmcs():

        u64 host_pat, guest_pat;

        rdmsrl(MSR_IA32_CR_PAT, host_pat);
        guest_pat = MSR_IA32_CR_PAT_RESET;

        __vmwrite(HOST_PAT, host_pat);
        __vmwrite(GUEST_PAT, guest_pat);


I'll remove it in mine.

thanks
mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-04-24  8:47   ` Jan Beulich
  2013-04-25  0:57     ` Mukesh Rathor
@ 2013-05-01  0:51     ` Mukesh Rathor
  2013-05-01 13:52       ` Jan Beulich
  2013-05-02  1:17     ` Mukesh Rathor
  2013-05-11  0:30     ` Mukesh Rathor
  3 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-01  0:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, 24 Apr 2013 09:47:55 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > @@ -1503,7 +1503,8 @@ void vmx_do_resume(struct vcpu *v)
> >  
> >          vmx_clear_vmcs(v);
> >          vmx_load_vmcs(v);
> > -        if ( !is_pvh_vcpu(v) ) {
> > +        if ( !is_pvh_vcpu(v) )
> > +        {
> 
> Surely an unnecessary adjustment, if an earlier patch got it right
> from the beginning?

Hmm... I don't understand lot of the time code, but PVH uses PV time
ops right now, so don't need to worry about it. But the time thing needs
a revisit anyways with more vtsc modes added in phase II.

> > +    };
> > +
> > +    regs->eip += ilen;
> > +
> > +    /* gdbsx or another debugger. Never pause dom0 */
> > +    if ( vp->domain->domain_id != 0 && guest_kernel_mode(vp,
> > regs) )
> > +    {
> > +        dbgp1("[%d]PVH: domain pause for debugger\n",
> > smp_processor_id());
> > +        current->arch.gdbsx_vcpu_event = TRAP_int3;
> > +        domain_pause_for_debugger();
> > +        return 0;
> > +    }
> > +
> > +    regs->eip -= ilen;
> 
> Please move the first adjustment into the if() body, making the
> second adjustment here unnecessary.

Actually, there could more debuggers being used also, so if you don't
mind i'd like to leave it as is:

    regs->eip += ilen;

#if defined(XEN_KDB_CONFIG)
    if ( kdb_handle_trap_entry(TRAP_int3, regs) )
        return 0;
#endif
    /* gdbsx or another debugger. Never pause dom0 */
    if ( vp->domain->domain_id != 0 && guest_kernel_mode(vp, regs) )
......


> > +static int vmxit_invalid_op(struct cpu_user_regs *regs)
> > +{
> > +    ulong addr = 0;
> > +
> > +    if ( guest_kernel_mode(current, regs) ||
> > +         emulate_forced_invalid_op(regs, &addr) == 0 )
> > +    {
> > +        hvm_inject_hw_exception(TRAP_invalid_op,
> > HVM_DELIVER_NO_ERROR_CODE);
> > +        return 0;
> > +    }
> > +    if ( addr )
> > +        hvm_inject_page_fault(0, addr);
> 
> This cannot be conditional upon addr being non-zero.

Why not? rc = emulate_forced_invalid_op():

   rc == 0 =>  not a valid emul signature. inject #UD.
   rc == 1 && addr != 0 => copy failed, need to inject PF
   rc == 1 && addr == 0 => emul done succesfully 

 
> > +static int access_cr4(struct cpu_user_regs *regs, uint acc_typ,
> > uint64_t *regp) +{
> > +    if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR )
> > +    {
> > +        u64 old_cr4 = __vmread(GUEST_CR4);
> > +
> > +        if ( (old_cr4 ^ (*regp)) & (X86_CR4_PSE | X86_CR4_PGE |
> > X86_CR4_PAE) )
> > +            vpid_sync_all();
> > +
> > +        __vmwrite(GUEST_CR4, *regp);
> 
> No modification of CR4_READ_SHADOW here?

Added. BTW, I think I need to also set following unconditionally: 

     *regp |= X86_CR4_VMXE | X86_CR4_MCE;
     __vmwrite(GUEST_CR4, *regp);

in case the guest is turning them off.
 
> > +static int vmxit_io_instr(struct cpu_user_regs *regs)
> > +{
> > +    int curr_lvl;
> > +    int requested = (regs->rflags >> 12) & 3;
> > +
> > +    read_vmcs_selectors(regs);
> > +    curr_lvl = regs->cs & 3;
> 
> Shouldn't you look at SS'es DPL instead?

Ok. It looks like CPL is stored in both CS and SS, so either
should be ok. But I changed it to ss. 

> > +    switch ( (uint16_t)exit_reason )
> > +    {
> > +        case EXIT_REASON_EXCEPTION_NMI:      /* 0 */
> > +            rc = vmxit_exception(regs);
> > +            break;
> 
> Why would an NMI be blindly reflected to the guest?

I wish it was named EXIT_REASON_EXCEPTION_OR_NMI.
Anyways, TRAP_machine_check is handled in caller. We handle other 
excpetions here.
 
> > +        case EXIT_REASON_CPUID:              /* 10 */
> > +        {
> > +            if ( guest_kernel_mode(vp, regs) )
> > +                pv_cpuid(regs);
> > +            else
> > +                pvh_user_cpuid(regs);
> 
> What's the reason for this distinction? I would think it's actually a
> benefit of PVH to allow also hiding unwanted features from guest
> user mode (like HVM, but unlike PV without CPUID faulting).

I was trying to keep it exactly as PV where a user mode would not
be trapped. I will just call pv_cpuid() for both then.
 
> > +int vmx_pvh_read_descriptor(unsigned int sel, const struct vcpu *v,
> > +                            const struct cpu_user_regs *regs,
> > +                            unsigned long *base, unsigned long
> > *limit,
> > +                            unsigned int *ar)
> > +{
> > +    unsigned int tmp_ar = 0;
> > +    ASSERT(v == current);
> > +    ASSERT(is_pvh_vcpu(v));
> > +
> > +    if ( sel == (unsigned int)regs->cs )
> > +    {
> > +        *base = __vmread(GUEST_CS_BASE);
> > +        *limit = __vmread(GUEST_CS_LIMIT);
> > +        tmp_ar = __vmread(GUEST_CS_AR_BYTES);
> > +    }
> > +    else if ( sel == (unsigned int)regs->ds )
> 
> This if/else-if sequence can't be right - a selector can be in more
> than one selector register (and one of them may have got reloaded
> after a GDT/LDT adjustment, while another may not), so you can't
> base the descriptor read upon the selector value. The caller will
> have to tell you which register it wants the descriptor for, not which
> selector.

Ah, right! Duh. I must have made the change same time as the read_sreg
macro.
 
thanks a lot for your time.
Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-01  0:51     ` Mukesh Rathor
@ 2013-05-01 13:52       ` Jan Beulich
  2013-05-02  1:10         ` Mukesh Rathor
  2013-05-10  1:51         ` Mukesh Rathor
  0 siblings, 2 replies; 72+ messages in thread
From: Jan Beulich @ 2013-05-01 13:52 UTC (permalink / raw)
  To: mukesh.rathor; +Cc: xen-devel

>>> Mukesh Rathor <mukesh.rathor@oracle.com> 05/01/13 2:51 AM >>>
>On Wed, 24 Apr 2013 09:47:55 +0100
>"Jan Beulich" <JBeulich@suse.com> wrote:
>> >>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com>
>> >>> wrote:
>> > @@ -1503,7 +1503,8 @@ void vmx_do_resume(struct vcpu *v)
>> >  
>> >          vmx_clear_vmcs(v);
>> >          vmx_load_vmcs(v);
>> > -        if ( !is_pvh_vcpu(v) ) {
>> > +        if ( !is_pvh_vcpu(v) )
>> > +        {
>> 
>> Surely an unnecessary adjustment, if an earlier patch got it right
>> from the beginning?
>
>Hmm... I don't understand lot of the time code, but PVH uses PV time
>ops right now, so don't need to worry about it. But the time thing needs
>a revisit anyways with more vtsc modes added in phase II.

The point was that this is a formatting only change, which you shouldn't
do here, but get things right where you insert the conditional.

>> > +    };
>> > +
>> > +    regs->eip += ilen;
>> > +
>> > +    /* gdbsx or another debugger. Never pause dom0 */
>> > +    if ( vp->domain->domain_id != 0 && guest_kernel_mode(vp,
>> > regs) )
>> > +    {
>> > +        dbgp1("[%d]PVH: domain pause for debugger\n",
>> > smp_processor_id());
>> > +        current->arch.gdbsx_vcpu_event = TRAP_int3;
>> > +        domain_pause_for_debugger();
>> > +        return 0;
>> > +    }
>> > +
>> > +    regs->eip -= ilen;
>> 
>> Please move the first adjustment into the if() body, making the
>> second adjustment here unnecessary.
>
>Actually, there could more debuggers being used also, so if you don't
>mind i'd like to leave it as is:

I do mind actually, not the least because I even consider it wrong to do the
adjustment before calling out to the debugger code. That code should be
handed the original state, not something already modified.

>> > +static int vmxit_invalid_op(struct cpu_user_regs *regs)
>> > +{
>> > +    ulong addr = 0;
>> > +
>> > +    if ( guest_kernel_mode(current, regs) ||
>> > +         emulate_forced_invalid_op(regs, &addr) == 0 )
>> > +    {
>> > +        hvm_inject_hw_exception(TRAP_invalid_op,
>> > HVM_DELIVER_NO_ERROR_CODE);
>> > +        return 0;
>> > +    }
>> > +    if ( addr )
>> > +        hvm_inject_page_fault(0, addr);
>> 
>> This cannot be conditional upon addr being non-zero.
>
>Why not? rc = emulate_forced_invalid_op():

Because zero can be a valid address that a fault occurred on.

>> > +static int vmxit_io_instr(struct cpu_user_regs *regs)
>> > +{
>> > +    int curr_lvl;
>> > +    int requested = (regs->rflags >> 12) & 3;
>> > +
>> > +    read_vmcs_selectors(regs);
>> > +    curr_lvl = regs->cs & 3;
>> 
>> Shouldn't you look at SS'es DPL instead?
>
>Ok. It looks like CPL is stored in both CS and SS, so either
>should be ok. But I changed it to ss. 

Your response reads as if you're still looking at the low two bits of the selector,
whereas me using DPL was intended to hint at you needing to look at the "hidden"
portion of the register.

>> > +    switch ( (uint16_t)exit_reason )
>> > +    {
>> > +        case EXIT_REASON_EXCEPTION_NMI:      /* 0 */
>> > +            rc = vmxit_exception(regs);
>> > +            break;
>> 
>> Why would an NMI be blindly reflected to the guest?
>
>I wish it was named EXIT_REASON_EXCEPTION_OR_NMI.
>Anyways, TRAP_machine_check is handled in caller. We handle other 
>excpetions here.

Okay, but then please add a comment to that effect.
 
Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-01 13:52       ` Jan Beulich
@ 2013-05-02  1:10         ` Mukesh Rathor
  2013-05-02  6:42           ` Jan Beulich
  2013-05-10  1:51         ` Mukesh Rathor
  1 sibling, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-02  1:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, 01 May 2013 14:52:27 +0100
"Jan Beulich" <jbeulich@suse.com> wrote:

> >>> Mukesh Rathor <mukesh.rathor@oracle.com> 05/01/13 2:51 AM >>>
> >On Wed, 24 Apr 2013 09:47:55 +0100
> >> > +    regs->eip += ilen;
> >> > +
> >> > +    /* gdbsx or another debugger. Never pause dom0 */
> >> > +    if ( vp->domain->domain_id != 0 && guest_kernel_mode(vp,
> >> > regs) )
> >> > +    {
> >> > +        dbgp1("[%d]PVH: domain pause for debugger\n",
> >> > smp_processor_id());
> >> > +        current->arch.gdbsx_vcpu_event = TRAP_int3;
> >> > +        domain_pause_for_debugger();
> >> > +        return 0;
> >> > +    }
> >> > +
> >> > +    regs->eip -= ilen;
> >> 
> >> Please move the first adjustment into the if() body, making the
> >> second adjustment here unnecessary.
> >
> >Actually, there could more debuggers being used also, so if you don't
> >mind i'd like to leave it as is:
> 
> I do mind actually, not the least because I even consider it wrong to
> do the adjustment before calling out to the debugger code. That code
> should be handed the original state, not something already modified.

In case of non-vmx, upon int3, the eip is advanced to eip+1, where in
case of vmx, eip is not. (I mean in vmx eip is pointing to CC where
in non-vmx it's pointing to eip+1 upon exception). Most debuggers will
check if (eip-1 == breakpoint addr) and take ownership if it is. 

> >> > +static int vmxit_invalid_op(struct cpu_user_regs *regs)
> >> > +{
> >> > +    ulong addr = 0;
> >> > +
> >> > +    if ( guest_kernel_mode(current, regs) ||
> >> > +         emulate_forced_invalid_op(regs, &addr) == 0 )
> >> > +    {
> >> > +        hvm_inject_hw_exception(TRAP_invalid_op,
> >> > HVM_DELIVER_NO_ERROR_CODE);
> >> > +        return 0;
> >> > +    }
> >> > +    if ( addr )
> >> > +        hvm_inject_page_fault(0, addr);
> >> 
> >> This cannot be conditional upon addr being non-zero.
> >
> >Why not? rc = emulate_forced_invalid_op():
> 
> Because zero can be a valid address that a fault occurred on.

Hmm... for that to happen, the guest would have to cause vmxit
with invalid op at address 000H. I didn't think that was possible.
Alternate would be to add a new return code:  EXCRET_inject_pf.

thanks
Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-04-24  8:47   ` Jan Beulich
  2013-04-25  0:57     ` Mukesh Rathor
  2013-05-01  0:51     ` Mukesh Rathor
@ 2013-05-02  1:17     ` Mukesh Rathor
  2013-05-02  6:53       ` Jan Beulich
  2013-05-11  0:30     ` Mukesh Rathor
  3 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-02  1:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, 24 Apr 2013 09:47:55 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> 
> > +int vmx_pvh_read_descriptor(unsigned int sel, const struct vcpu *v,
> > +                            const struct cpu_user_regs *regs,
> > +                            unsigned long *base, unsigned long
> > *limit,
> > +                            unsigned int *ar)
> > +{
> > +    unsigned int tmp_ar = 0;
> > +    ASSERT(v == current);
> > +    ASSERT(is_pvh_vcpu(v));
> > +
> > +    if ( sel == (unsigned int)regs->cs )
> > +    {
> > +        *base = __vmread(GUEST_CS_BASE);
> > +        *limit = __vmread(GUEST_CS_LIMIT);
> > +        tmp_ar = __vmread(GUEST_CS_AR_BYTES);
> > +    }
> > +    else if ( sel == (unsigned int)regs->ds )
> 
> This if/else-if sequence can't be right - a selector can be in more
> than one selector register (and one of them may have got reloaded
> after a GDT/LDT adjustment, while another may not), so you can't
> base the descriptor read upon the selector value. The caller will
> have to tell you which register it wants the descriptor for, not which
> selector.

Ok, I redid it. Created a new function read_descriptor_sel() and rewrote 
vmx_pvh_read_descriptor(). Please lmk if looks ok to you.  thanks a lot :


static int read_descriptor_sel(unsigned int sel,
                               enum sel_type which_sel,
                               const struct vcpu *v,
                               const struct cpu_user_regs *regs,
                               unsigned long *base,
                               unsigned long *limit,
                               unsigned int *ar,
                               unsigned int vm86attr)
{
    if ( is_pvh_vcpu(v) )
        return hvm_read_descriptor(which_sel, v, regs, base, limit, ar);

    return read_descriptor(sel, v, regs, base, limit, ar, vm86attr);

}

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index d003ae2..776522e 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1862,6 +1875,7 @@ static int is_cpufreq_controller(struct domain *d)
 
 int emulate_privileged_op(struct cpu_user_regs *regs)
 {
+    enum sel_type which_sel;
     struct vcpu *v = current;
     unsigned long *reg, eip = regs->eip;
     u8 opcode, modrm_reg = 0, modrm_rm = 0, rep_prefix = 0, lock = 0, rex = 0;
@@ -1884,9 +1898,10 @@ int emulate_privileged_op(struct cpu_user_regs *regs)
     void (*io_emul)(struct cpu_user_regs *) __attribute__((__regparm__(1)));
     uint64_t val, msr_content;
 
-    if ( !read_descriptor(regs->cs, v, regs,
-                          &code_base, &code_limit, &ar,
-                          _SEGMENT_CODE|_SEGMENT_S|_SEGMENT_DPL|_SEGMENT_P) )
+    if ( !read_descriptor_sel(regs->cs, SEL_CS, v, regs,
+                              &code_base, &code_limit, &ar,
+                              _SEGMENT_CODE|_SEGMENT_S|
+                              _SEGMENT_DPL|_SEGMENT_P) )
         goto fail;
     op_default = op_bytes = (ar & (_SEGMENT_L|_SEGMENT_DB)) ? 4 : 2;
     ad_default = ad_bytes = (ar & _SEGMENT_L) ? 8 : op_default;
@@ -1897,6 +1912,7 @@ int emulate_privileged_op(struct cpu_user_regs *regs)
 
     /* emulating only opcodes not allowing SS to be default */
     data_sel = read_segment_register(v, regs, ds);
+    which_sel = SEL_DS;
 
     /* Legacy prefixes. */
     for ( i = 0; i < 8; i++, rex == opcode || (rex = 0) )
@@ -1912,23 +1928,29 @@ int emulate_privileged_op(struct cpu_user_regs *regs)
             continue;
         case 0x2e: /* CS override */
             data_sel = regs->cs;
+            which_sel = SEL_CS;
             continue;
         case 0x3e: /* DS override */
             data_sel = read_segment_register(v, regs, ds);
+            which_sel = SEL_DS;
             continue;
         case 0x26: /* ES override */
             data_sel = read_segment_register(v, regs, es);
+            which_sel = SEL_ES;
             continue;
         case 0x64: /* FS override */
             data_sel = read_segment_register(v, regs, fs);
+            which_sel = SEL_FS;
             lm_ovr = lm_seg_fs;
             continue;
         case 0x65: /* GS override */
             data_sel = read_segment_register(v, regs, gs);
+            which_sel = SEL_GS;
             lm_ovr = lm_seg_gs;
             continue;
         case 0x36: /* SS override */
             data_sel = regs->ss;
+            which_sel = SEL_SS;
             continue;
         case 0xf0: /* LOCK */
             lock = 1;
@@ -1972,15 +1994,16 @@ int emulate_privileged_op(struct cpu_user_regs *regs)
         if ( !(opcode & 2) )
         {
             data_sel = read_segment_register(v, regs, es);
+            which_sel = SEL_ES;
             lm_ovr = lm_seg_none;
         }
 
         if ( !(ar & _SEGMENT_L) )
         {
-            if ( !read_descriptor(data_sel, v, regs,
-                                  &data_base, &data_limit, &ar,
-                                  _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL|
-                                  _SEGMENT_P) )
+            if ( !read_descriptor_sel(data_sel, which_sel, v, regs,
+                                      &data_base, &data_limit, &ar,
+                                      _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL|
+                                      _SEGMENT_P) )
                 goto fail;
             if ( !(ar & _SEGMENT_S) ||
                  !(ar & _SEGMENT_P) ||
@@ -2010,9 +2033,9 @@ int emulate_privileged_op(struct cpu_user_regs *regs)
                 }
             }
             else
-                read_descriptor(data_sel, v, regs,
-                                &data_base, &data_limit, &ar,
-                                0);
+                read_descriptor_sel(data_sel, which_sel, v, regs,
+                                    &data_base, &data_limit, &ar,
+                                    0);
             data_limit = ~0UL;
             ar = _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL|_SEGMENT_P;
         }
diff --git a/xen/include/asm-x86/desc.h b/xen/include/asm-x86/desc.h
index 4dca0a3..deecef4 100644
--- a/xen/include/asm-x86/desc.h
+++ b/xen/include/asm-x86/desc.h
@@ -199,6 +199,8 @@ DECLARE_PER_CPU(struct desc_struct *, compat_gdt_table);
 extern void set_intr_gate(unsigned int irq, void * addr);
 extern void load_TR(void);
 
+enum sel_type { SEL_NONE, SEL_CS, SEL_SS, SEL_DS, SEL_ES, SEL_GS, SEL_FS };
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ARCH_DESC_H */

=============================================================================

New version of int vmx_pvh_read_descriptor:

int vmx_pvh_read_descriptor(enum sel_type which_sel, const struct vcpu *v,
                            const struct cpu_user_regs *regs,
                            unsigned long *base, unsigned long *limit,
                            unsigned int *ar)
{
    unsigned int tmp_ar = 0;
    ASSERT(v == current);
    ASSERT(is_pvh_vcpu(v));

    switch ( which_sel )
    {
    case SEL_CS:
        tmp_ar = __vmread(GUEST_CS_AR_BYTES);
        if ( tmp_ar & X86_SEG_AR_CS_LM_ACTIVE )
        {
            *base = 0UL;
            *limit = ~0UL;
        }
        else
        {
            *base = __vmread(GUEST_CS_BASE);
            *limit = __vmread(GUEST_CS_LIMIT);
        }
        break;

    case SEL_DS:
        *base = __vmread(GUEST_DS_BASE);
        *limit = __vmread(GUEST_DS_LIMIT);
        tmp_ar = __vmread(GUEST_DS_AR_BYTES);
        break;

    case SEL_SS:
        *base = __vmread(GUEST_SS_BASE);
        *limit = __vmread(GUEST_SS_LIMIT);
        tmp_ar = __vmread(GUEST_SS_AR_BYTES);
        break;

    case SEL_GS:
        *base = __vmread(GUEST_GS_BASE);
        *limit = __vmread(GUEST_GS_LIMIT);
        tmp_ar = __vmread(GUEST_GS_AR_BYTES);
        break;

    case SEL_FS:
        *base = __vmread(GUEST_FS_BASE);
        *limit = __vmread(GUEST_FS_LIMIT);
        tmp_ar = __vmread(GUEST_FS_AR_BYTES);
        break;

    case SEL_ES:
        *base = __vmread(GUEST_ES_BASE);
        *limit = __vmread(GUEST_ES_LIMIT);
        tmp_ar = __vmread(GUEST_ES_AR_BYTES);
        break;

    default:
        gdprintk(XENLOG_WARNING, "Unmatched segment selector:%d\n", which_sel);
        return 0;
    }

    /* Fixup ar so that it looks the same as in native mode */
    *ar = (tmp_ar << 8);

    return 1;
}

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-02  1:10         ` Mukesh Rathor
@ 2013-05-02  6:42           ` Jan Beulich
  2013-05-03  1:03             ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-05-02  6:42 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 02.05.13 at 03:10, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> On Wed, 01 May 2013 14:52:27 +0100
> "Jan Beulich" <jbeulich@suse.com> wrote:
> 
>> >>> Mukesh Rathor <mukesh.rathor@oracle.com> 05/01/13 2:51 AM >>>
>> >On Wed, 24 Apr 2013 09:47:55 +0100
>> >> > +    regs->eip += ilen;
>> >> > +
>> >> > +    /* gdbsx or another debugger. Never pause dom0 */
>> >> > +    if ( vp->domain->domain_id != 0 && guest_kernel_mode(vp,
>> >> > regs) )
>> >> > +    {
>> >> > +        dbgp1("[%d]PVH: domain pause for debugger\n",
>> >> > smp_processor_id());
>> >> > +        current->arch.gdbsx_vcpu_event = TRAP_int3;
>> >> > +        domain_pause_for_debugger();
>> >> > +        return 0;
>> >> > +    }
>> >> > +
>> >> > +    regs->eip -= ilen;
>> >> 
>> >> Please move the first adjustment into the if() body, making the
>> >> second adjustment here unnecessary.
>> >
>> >Actually, there could more debuggers being used also, so if you don't
>> >mind i'd like to leave it as is:
>> 
>> I do mind actually, not the least because I even consider it wrong to
>> do the adjustment before calling out to the debugger code. That code
>> should be handed the original state, not something already modified.
> 
> In case of non-vmx, upon int3, the eip is advanced to eip+1, where in
> case of vmx, eip is not. (I mean in vmx eip is pointing to CC where
> in non-vmx it's pointing to eip+1 upon exception). Most debuggers will
> check if (eip-1 == breakpoint addr) and take ownership if it is. 

INT3 being a trap rather than a fault has always been looking
odd to me, and in all debugger code I ever wrote I always
adjusted for that first thing in the handler (accepting or working
around the issue with the exception potentially having been
caused by CD 03 instead of CC).

>> >> > +static int vmxit_invalid_op(struct cpu_user_regs *regs)
>> >> > +{
>> >> > +    ulong addr = 0;
>> >> > +
>> >> > +    if ( guest_kernel_mode(current, regs) ||
>> >> > +         emulate_forced_invalid_op(regs, &addr) == 0 )
>> >> > +    {
>> >> > +        hvm_inject_hw_exception(TRAP_invalid_op,
>> >> > HVM_DELIVER_NO_ERROR_CODE);
>> >> > +        return 0;
>> >> > +    }
>> >> > +    if ( addr )
>> >> > +        hvm_inject_page_fault(0, addr);
>> >> 
>> >> This cannot be conditional upon addr being non-zero.
>> >
>> >Why not? rc = emulate_forced_invalid_op():
>> 
>> Because zero can be a valid address that a fault occurred on.
> 
> Hmm... for that to happen, the guest would have to cause vmxit
> with invalid op at address 000H. I didn't think that was possible.

Why would it not. You have to cover all guest kernels, and not
misbehave on malicious ones (i.e. those ought to get an
exception injected if so needed, no matter what address it
occurred on).

> Alternate would be to add a new return code:  EXCRET_inject_pf.

Something along those lines, yes.

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-02  1:17     ` Mukesh Rathor
@ 2013-05-02  6:53       ` Jan Beulich
  2013-05-03  0:40         ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-05-02  6:53 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 02.05.13 at 03:17, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> Ok, I redid it. Created a new function read_descriptor_sel() and rewrote 
> vmx_pvh_read_descriptor(). Please lmk if looks ok to you.  thanks a lot :
> 
> 
> static int read_descriptor_sel(unsigned int sel,
>                                enum sel_type which_sel,
>                                const struct vcpu *v,
>                                const struct cpu_user_regs *regs,
>                                unsigned long *base,
>                                unsigned long *limit,
>                                unsigned int *ar,
>                                unsigned int vm86attr)
> {
>     if ( is_pvh_vcpu(v) )
>         return hvm_read_descriptor(which_sel, v, regs, base, limit, ar);

Why not insert this into read_descriptor(), rather than creating a
new wrapper?

> --- a/xen/include/asm-x86/desc.h
> +++ b/xen/include/asm-x86/desc.h
> @@ -199,6 +199,8 @@ DECLARE_PER_CPU(struct desc_struct *, compat_gdt_table);
>  extern void set_intr_gate(unsigned int irq, void * addr);
>  extern void load_TR(void);
>  
> +enum sel_type { SEL_NONE, SEL_CS, SEL_SS, SEL_DS, SEL_ES, SEL_GS, SEL_FS };

I'd prefer if you re-used enum x86_segment instead of introducing
another enumeration.

> New version of int vmx_pvh_read_descriptor:
> 
> int vmx_pvh_read_descriptor(enum sel_type which_sel, const struct vcpu *v,
>                             const struct cpu_user_regs *regs,
>                             unsigned long *base, unsigned long *limit,
>                             unsigned int *ar)
> {
>     unsigned int tmp_ar = 0;
>     ASSERT(v == current);
>     ASSERT(is_pvh_vcpu(v));
> 
>     switch ( which_sel )
>     {
>     case SEL_CS:
>         tmp_ar = __vmread(GUEST_CS_AR_BYTES);
>         if ( tmp_ar & X86_SEG_AR_CS_LM_ACTIVE )
>         {
>             *base = 0UL;
>             *limit = ~0UL;
>         }
>         else
>         {
>             *base = __vmread(GUEST_CS_BASE);
>             *limit = __vmread(GUEST_CS_LIMIT);
>         }
>         break;
> 
>     case SEL_DS:
>         *base = __vmread(GUEST_DS_BASE);
>         *limit = __vmread(GUEST_DS_LIMIT);

This (as well as SS and ES handling) needs to be consistent with
CS handling - either you rely on the VMCS fields to be correct
even for long mode, or you override base and limit based upon
the _CS_ access rights having the L bit set.

>         tmp_ar = __vmread(GUEST_DS_AR_BYTES);
>         break;
> 
>     case SEL_SS:
>         *base = __vmread(GUEST_SS_BASE);
>         *limit = __vmread(GUEST_SS_LIMIT);
>         tmp_ar = __vmread(GUEST_SS_AR_BYTES);
>         break;
> 
>     case SEL_GS:
>         *base = __vmread(GUEST_GS_BASE);
>         *limit = __vmread(GUEST_GS_LIMIT);
>         tmp_ar = __vmread(GUEST_GS_AR_BYTES);
>         break;
> 
>     case SEL_FS:
>         *base = __vmread(GUEST_FS_BASE);
>         *limit = __vmread(GUEST_FS_LIMIT);
>         tmp_ar = __vmread(GUEST_FS_AR_BYTES);
>         break;
> 
>     case SEL_ES:
>         *base = __vmread(GUEST_ES_BASE);
>         *limit = __vmread(GUEST_ES_LIMIT);
>         tmp_ar = __vmread(GUEST_ES_AR_BYTES);
>         break;

While secondary, I'm also a bit puzzled about the non-natural and
non-logical ordering (CS, DS, SS, GS, FS, ES)...

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-02  6:53       ` Jan Beulich
@ 2013-05-03  0:40         ` Mukesh Rathor
  2013-05-03  6:33           ` Jan Beulich
  0 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-03  0:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Thu, 02 May 2013 07:53:18 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 02.05.13 at 03:17, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > Ok, I redid it. Created a new function read_descriptor_sel() and
> > rewrote vmx_pvh_read_descriptor(). Please lmk if looks ok to you.
> > thanks a lot :
> > 
> > 
> > static int read_descriptor_sel(unsigned int sel,
> >                                enum sel_type which_sel,
> >                                const struct vcpu *v,
> >                                const struct cpu_user_regs *regs,
> >                                unsigned long *base,
> >                                unsigned long *limit,
> >                                unsigned int *ar,
> >                                unsigned int vm86attr)
> > {
> >     if ( is_pvh_vcpu(v) )
> >         return hvm_read_descriptor(which_sel, v, regs, base, limit,
> > ar);
> 
> Why not insert this into read_descriptor(), rather than creating a
> new wrapper?

There are other callers of read_descriptor() which would need to be 
unnecessaraly changed, we need PVH support for only one caller. So this
seemed the least intrusive.

> > --- a/xen/include/asm-x86/desc.h
> > +++ b/xen/include/asm-x86/desc.h
> > @@ -199,6 +199,8 @@ DECLARE_PER_CPU(struct desc_struct *,
> > compat_gdt_table); extern void set_intr_gate(unsigned int irq, void
> > * addr); extern void load_TR(void);
> >  
> > +enum sel_type { SEL_NONE, SEL_CS, SEL_SS, SEL_DS, SEL_ES, SEL_GS,
> > SEL_FS };
> 
> I'd prefer if you re-used enum x86_segment instead of introducing
> another enumeration.

Of course. I looked for en existing, but didn't look hard enough :).
 
> This (as well as SS and ES handling) needs to be consistent with
> CS handling - either you rely on the VMCS fields to be correct
> even for long mode, or you override base and limit based upon
> the _CS_ access rights having the L bit set.

Right.

> While secondary, I'm also a bit puzzled about the non-natural and
> non-logical ordering (CS, DS, SS, GS, FS, ES)...

Not sure what the natural ordering is, so I sorted according to
the enum x86_segment:

int vmx_pvh_read_descriptor(enum x86_segment selector, const struct vcpu *v,
                            const struct cpu_user_regs *regs,
                            unsigned long *base, unsigned long *limit,
                            unsigned int *ar)
{
    unsigned int tmp_ar = 0;
    ASSERT(v == current);
    ASSERT(is_pvh_vcpu(v));

    switch ( selector )
    {
    case x86_seg_cs:
        *base = __vmread(GUEST_CS_BASE);
        *limit = __vmread(GUEST_CS_LIMIT);
        tmp_ar = __vmread(GUEST_CS_AR_BYTES);
        break;

    case x86_seg_ss:
        *base = __vmread(GUEST_SS_BASE);
        *limit = __vmread(GUEST_SS_LIMIT);
        tmp_ar = __vmread(GUEST_SS_AR_BYTES);
        break;

    case x86_seg_ds:
        *base = __vmread(GUEST_DS_BASE);
        *limit = __vmread(GUEST_DS_LIMIT);
        tmp_ar = __vmread(GUEST_DS_AR_BYTES);
        break;

    case x86_seg_es:
        *base = __vmread(GUEST_ES_BASE);
        *limit = __vmread(GUEST_ES_LIMIT);
        tmp_ar = __vmread(GUEST_ES_AR_BYTES);
        break;

    case x86_seg_fs:
        *base = __vmread(GUEST_FS_BASE);
        *limit = __vmread(GUEST_FS_LIMIT);
        tmp_ar = __vmread(GUEST_FS_AR_BYTES);
        break;

    case x86_seg_gs:
        *base = __vmread(GUEST_GS_BASE);
        *limit = __vmread(GUEST_GS_LIMIT);
        tmp_ar = __vmread(GUEST_GS_AR_BYTES);
        break;

    default:
        gdprintk(XENLOG_WARNING, "Unmatched segment selector:%d\n", selector);
        return 0;
    }

    if ( (tmp_ar & X86_SEG_AR_CS_LM_ACTIVE) && selector < x86_seg_fs  )
    {
        *base = 0UL;
        *limit = ~0UL;
    }

    /* Fix ar so that it looks the same as in native mode */
    *ar = (tmp_ar << 8);

    return 1;
}

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-02  6:42           ` Jan Beulich
@ 2013-05-03  1:03             ` Mukesh Rathor
  0 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-03  1:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Thu, 02 May 2013 07:42:16 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 02.05.13 at 03:10, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > On Wed, 01 May 2013 14:52:27 +0100
> > "Jan Beulich" <jbeulich@suse.com> wrote:
> >> >> > +static int vmxit_invalid_op(struct cpu_user_regs *regs)
> >> >> > +{
> >> >> > +    ulong addr = 0;
> >> >> > +
> >> >> > +    if ( guest_kernel_mode(current, regs) ||
> >> >> > +         emulate_forced_invalid_op(regs, &addr) == 0 )
> >> >> > +    {
> >> >> > +        hvm_inject_hw_exception(TRAP_invalid_op,
> >> >> > HVM_DELIVER_NO_ERROR_CODE);
> >> >> > +        return 0;
> >> >> > +    }
> >> >> > +    if ( addr )
> >> >> > +        hvm_inject_page_fault(0, addr);
> >> >> 
> >> >> This cannot be conditional upon addr being non-zero.
> >> >
> >> >Why not? rc = emulate_forced_invalid_op():
> >> 
> >> Because zero can be a valid address that a fault occurred on.
> > 
> > Hmm... for that to happen, the guest would have to cause vmxit
> > with invalid op at address 000H. I didn't think that was possible.
> 
> Why would it not. You have to cover all guest kernels, and not
> misbehave on malicious ones (i.e. those ought to get an
> exception injected if so needed, no matter what address it
> occurred on).
> 
> > Alternate would be to add a new return code:  EXCRET_inject_pf.
> 
> Something along those lines, yes.

Actually, sigh, I realized I missed emulate_privileged_op() and the 
macro insn_fetch which calls propagate_page_fault for PVH also. So I
am thinking of just giving in and writing up a
pvh_propagate_page_fault() function that propagate_page_fault() can
just call. Then  emulate_forced_invalid_op() can remain as is.

Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-03  0:40         ` Mukesh Rathor
@ 2013-05-03  6:33           ` Jan Beulich
  2013-05-04  1:40             ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-05-03  6:33 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 03.05.13 at 02:40, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> On Thu, 02 May 2013 07:53:18 +0100
> "Jan Beulich" <JBeulich@suse.com> wrote:
> 
>> >>> On 02.05.13 at 03:17, Mukesh Rathor <mukesh.rathor@oracle.com>
>> >>> wrote:
>> > Ok, I redid it. Created a new function read_descriptor_sel() and
>> > rewrote vmx_pvh_read_descriptor(). Please lmk if looks ok to you.
>> > thanks a lot :
>> > 
>> > 
>> > static int read_descriptor_sel(unsigned int sel,
>> >                                enum sel_type which_sel,
>> >                                const struct vcpu *v,
>> >                                const struct cpu_user_regs *regs,
>> >                                unsigned long *base,
>> >                                unsigned long *limit,
>> >                                unsigned int *ar,
>> >                                unsigned int vm86attr)
>> > {
>> >     if ( is_pvh_vcpu(v) )
>> >         return hvm_read_descriptor(which_sel, v, regs, base, limit,
>> > ar);
>> 
>> Why not insert this into read_descriptor(), rather than creating a
>> new wrapper?
> 
> There are other callers of read_descriptor() which would need to be 
> unnecessaraly changed, we need PVH support for only one caller. So this
> seemed the least intrusive.

Ah, okay - that's fine then.

>> While secondary, I'm also a bit puzzled about the non-natural and
>> non-logical ordering (CS, DS, SS, GS, FS, ES)...
> 
> Not sure what the natural ordering is, so I sorted according to
> the enum x86_segment:

Yes, that's one of the three reasonable orderings now. The others
would be by register number or alphabetically.

> int vmx_pvh_read_descriptor(enum x86_segment selector, const struct vcpu *v,
>                             const struct cpu_user_regs *regs,
>                             unsigned long *base, unsigned long *limit,
>                             unsigned int *ar)
> {
>     unsigned int tmp_ar = 0;
>     ASSERT(v == current);
>     ASSERT(is_pvh_vcpu(v));
> 
>     switch ( selector )
>     {
>     case x86_seg_cs:
>         *base = __vmread(GUEST_CS_BASE);
>         *limit = __vmread(GUEST_CS_LIMIT);
>         tmp_ar = __vmread(GUEST_CS_AR_BYTES);
>         break;
> 
>     case x86_seg_ss:
>         *base = __vmread(GUEST_SS_BASE);
>         *limit = __vmread(GUEST_SS_LIMIT);
>         tmp_ar = __vmread(GUEST_SS_AR_BYTES);
>         break;
> 
>     case x86_seg_ds:
>         *base = __vmread(GUEST_DS_BASE);
>         *limit = __vmread(GUEST_DS_LIMIT);
>         tmp_ar = __vmread(GUEST_DS_AR_BYTES);
>         break;
> 
>     case x86_seg_es:
>         *base = __vmread(GUEST_ES_BASE);
>         *limit = __vmread(GUEST_ES_LIMIT);
>         tmp_ar = __vmread(GUEST_ES_AR_BYTES);
>         break;
> 
>     case x86_seg_fs:
>         *base = __vmread(GUEST_FS_BASE);
>         *limit = __vmread(GUEST_FS_LIMIT);
>         tmp_ar = __vmread(GUEST_FS_AR_BYTES);
>         break;
> 
>     case x86_seg_gs:
>         *base = __vmread(GUEST_GS_BASE);
>         *limit = __vmread(GUEST_GS_LIMIT);
>         tmp_ar = __vmread(GUEST_GS_AR_BYTES);
>         break;
> 
>     default:
>         gdprintk(XENLOG_WARNING, "Unmatched segment selector:%d\n", selector);

This message is now stale and hence confusing.

And with the way the function is now I don't see why at least the
whole switch can't be dropped, and the function instead call
vmx_get_segment_register(); perhaps that could even be done
in vendor independent code, calling hvm_get_segment_register().

>         return 0;
>     }
> 
>     if ( (tmp_ar & X86_SEG_AR_CS_LM_ACTIVE) && selector < x86_seg_fs  )

This is still wrong. As said before you need to look as the _CS_
access rights, not the ones of the selector register you read.

But as also hinted at - do you really need the override at all?

Jan

>     {
>         *base = 0UL;
>         *limit = ~0UL;
>     }
> 
>     /* Fix ar so that it looks the same as in native mode */
>     *ar = (tmp_ar << 8);
> 
>     return 1;
> }

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-03  6:33           ` Jan Beulich
@ 2013-05-04  1:40             ` Mukesh Rathor
  2013-05-06  6:44               ` Jan Beulich
  0 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-04  1:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Fri, 03 May 2013 07:33:50 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 03.05.13 at 02:40, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > On Thu, 02 May 2013 07:53:18 +0100
> > "Jan Beulich" <JBeulich@suse.com> wrote:
> > 
> >     if ( (tmp_ar & X86_SEG_AR_CS_LM_ACTIVE) && selector <
> > x86_seg_fs  )
> 
> This is still wrong. As said before you need to look as the _CS_
> access rights, not the ones of the selector register you read.

Hmm... unless I'm reading the SDM wrong, it says "for non-code segments
bit 21 is reserved and should always be set to 0". But its prob clearer
to check for _CS_ only. 

> But as also hinted at - do you really need the override at all?

Yes, because of the following check in insn_fetch macro:

     "(eip) > (limit) - (sizeof(_x) - 1)" in the if statment:

   if ( (limit) < sizeof(_x) - 1 || (eip) > (limit) - (sizeof(_x) - 1) )   \
       goto fail;                                                          \

Reading vmcs would return 32bit limit of 0xffffffff. BTW, same override
exists in read_descriptor() (it seems to do the override for FS and
GS also, which I don't understand).


Anyways, thanks to hvm_get_segment_register(), I got rid of the function 
vmx_pvh_read_descriptor():

static int read_descriptor_sel(unsigned int sel,
                               enum x86_segment which_sel,
                               struct vcpu *v,
                               const struct cpu_user_regs *regs,
                               unsigned long *base,
                               unsigned long *limit,
                               unsigned int *ar,
                               unsigned int vm86attr)
{
    if ( is_pvh_vcpu(v) )
    {
        struct segment_register seg;

        hvm_get_segment_register(v, which_sel, &seg);
        *ar = (unsigned int)seg.attr.bytes;

        /* ar is returned packed as in segment_attributes_t. fix it up */
        *ar = (*ar & 0xff ) | ((*ar & 0xf00) << 4);
        *ar = *ar << 8;

        if ( (vm86attr & _SEGMENT_CODE) && (*ar & _SEGMENT_L) &&
             (which_sel < x86_seg_fs) )
        {
            *base = 0UL;
            *limit = ~0UL;
        }
        else
        {
            *base = (unsigned long)seg.base;
            *limit = (unsigned long)seg.limit;
        }

        return 1;
    }

    return read_descriptor(sel, v, regs, base, limit, ar, vm86attr);

}

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-04  1:40             ` Mukesh Rathor
@ 2013-05-06  6:44               ` Jan Beulich
  2013-05-07  1:25                 ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-05-06  6:44 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 04.05.13 at 03:40, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> On Fri, 03 May 2013 07:33:50 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:
>> >>> On 03.05.13 at 02:40, Mukesh Rathor <mukesh.rathor@oracle.com>
>> >>> wrote:
>> > On Thu, 02 May 2013 07:53:18 +0100
>> > "Jan Beulich" <JBeulich@suse.com> wrote:
>> > 
>> >     if ( (tmp_ar & X86_SEG_AR_CS_LM_ACTIVE) && selector <
>> > x86_seg_fs  )
>> 
>> This is still wrong. As said before you need to look as the _CS_
>> access rights, not the ones of the selector register you read.
> 
> Hmm... unless I'm reading the SDM wrong, it says "for non-code segments
> bit 21 is reserved and should always be set to 0". But its prob clearer
> to check for _CS_ only. 

I'm afraid you're still not understanding what I'm trying to explain:
Whether base and limit are ignored (and default to 0/~0) depends
on whether the guest executes in 64-bit mode, and this you can
know only by looking at CS.L, no matter what selector register
you're reading.

Maybe part of the confusion stems from you mixing two things
here - reading of a descriptor from a descriptor table (which is
what read_descriptor() does, as that's all you can do for PV
guests) vs reading of the hidden portions of a selector register
(which is what hvm_get_segment_register() does, thanks to
VMX/SVM providing access).

>> But as also hinted at - do you really need the override at all?
> 
> Yes, because of the following check in insn_fetch macro:
> 
>      "(eip) > (limit) - (sizeof(_x) - 1)" in the if statment:
> 
>    if ( (limit) < sizeof(_x) - 1 || (eip) > (limit) - (sizeof(_x) - 1) )   \
>        goto fail;                                                          \
> 
> Reading vmcs would return 32bit limit of 0xffffffff.

That's unfortunate of course - then the override indeed is
unavoidable.

> BTW, same override
> exists in read_descriptor() (it seems to do the override for FS and
> GS also, which I don't understand).

See above - this function just can't access the hidden portion of
the selector registers, and hence doesn't even care what register
a particular selector might be in. The caller has to take care to
ignore base and limit when the guest is in 64-bit mode.

> Anyways, thanks to hvm_get_segment_register(), I got rid of the function 
> vmx_pvh_read_descriptor():
> 
> static int read_descriptor_sel(unsigned int sel,
>                                enum x86_segment which_sel,
>                                struct vcpu *v,
>                                const struct cpu_user_regs *regs,
>                                unsigned long *base,
>                                unsigned long *limit,
>                                unsigned int *ar,
>                                unsigned int vm86attr)
> {
>     if ( is_pvh_vcpu(v) )
>     {
>         struct segment_register seg;
> 
>         hvm_get_segment_register(v, which_sel, &seg);
>         *ar = (unsigned int)seg.attr.bytes;
> 
>         /* ar is returned packed as in segment_attributes_t. fix it up */
>         *ar = (*ar & 0xff ) | ((*ar & 0xf00) << 4);
>         *ar = *ar << 8;
> 
>         if ( (vm86attr & _SEGMENT_CODE) && (*ar & _SEGMENT_L) &&

So as per above this is still wrong.

>              (which_sel < x86_seg_fs) )
>         {
>             *base = 0UL;
>             *limit = ~0UL;
>         }
>         else
>         {
>             *base = (unsigned long)seg.base;
>             *limit = (unsigned long)seg.limit;
>         }

One thing I misguided you slightly is that you will need to
override the limit regardless of selector register; only the base
must not be forced to zero for FS and GS.

Jan

> 
>         return 1;
>     }
> 
>     return read_descriptor(sel, v, regs, base, limit, ar, vm86attr);
> 
> }

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-06  6:44               ` Jan Beulich
@ 2013-05-07  1:25                 ` Mukesh Rathor
  2013-05-07  8:07                   ` Jan Beulich
  0 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-07  1:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Mon, 06 May 2013 07:44:33 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 04.05.13 at 03:40, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > On Fri, 03 May 2013 07:33:50 +0100 "Jan Beulich"
 access rights, not the ones of the selector register you read.
> > 
> > Hmm... unless I'm reading the SDM wrong, it says "for non-code
> > segments bit 21 is reserved and should always be set to 0". But its
> > prob clearer to check for _CS_ only. 
> 
> I'm afraid you're still not understanding what I'm trying to explain:
> Whether base and limit are ignored (and default to 0/~0) depends
> on whether the guest executes in 64-bit mode, and this you can
> know only by looking at CS.L, no matter what selector register
> you're reading.
> 
> Maybe part of the confusion stems from you mixing two things
> here - reading of a descriptor from a descriptor table (which is
> what read_descriptor() does, as that's all you can do for PV
> guests) vs reading of the hidden portions of a selector register
> (which is what hvm_get_segment_register() does, thanks to
> VMX/SVM providing access).

read_descriptor() confuses me a lot. The way I understood it: it reads
the full desc from LDT/GDT indexed by the selector, which is where the hidden 
portions get loaded from, so it has full info like we get from vmcs.
It can look at the "Type" bits 8-11 in the upper half and figure if it's
code segment also.  Following check in it adds to confusiong:

        if ( !(vm86attr & _SEGMENT_CODE) )
            desc.b &= ~_SEGMENT_L;


Anyways, I think (and hope), I finally have it:

{
    struct segment_register seg;
    unsigned int long_mode = 0;

    if ( !is_pvh_vcpu(v) )
        return read_descriptor(sel, v, regs, base, limit, ar, vm86attr);

    hvm_get_segment_register(v, x86_seg_cs, &seg);
    long_mode = seg.attr.fields.l;

    if ( which_sel != x86_seg_cs )
        hvm_get_segment_register(v, which_sel, &seg);

    /* ar is returned packed as in segment_attributes_t. Fix it up */
    *ar = (unsigned int)seg.attr.bytes;
    *ar = (*ar & 0xff ) | ((*ar & 0xf00) << 4);
    *ar = *ar << 8;

    if ( long_mode )
    {
        *limit = ~0UL;

        if ( which_sel < x86_seg_fs )
        {
            *base = 0UL;
            return 1;
        }
   }
   else
       *limit = (unsigned long)seg.limit;

   *base = (unsigned long)seg.base;
    return 1;
}


Thanks for your time and help.
Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-07  1:25                 ` Mukesh Rathor
@ 2013-05-07  8:07                   ` Jan Beulich
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Beulich @ 2013-05-07  8:07 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 07.05.13 at 03:25, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> On Mon, 06 May 2013 07:44:33 +0100
> "Jan Beulich" <JBeulich@suse.com> wrote:
> 
>> >>> On 04.05.13 at 03:40, Mukesh Rathor <mukesh.rathor@oracle.com>
>> >>> wrote:
>> > On Fri, 03 May 2013 07:33:50 +0100 "Jan Beulich"
>  access rights, not the ones of the selector register you read.
>> > 
>> > Hmm... unless I'm reading the SDM wrong, it says "for non-code
>> > segments bit 21 is reserved and should always be set to 0". But its
>> > prob clearer to check for _CS_ only. 
>> 
>> I'm afraid you're still not understanding what I'm trying to explain:
>> Whether base and limit are ignored (and default to 0/~0) depends
>> on whether the guest executes in 64-bit mode, and this you can
>> know only by looking at CS.L, no matter what selector register
>> you're reading.
>> 
>> Maybe part of the confusion stems from you mixing two things
>> here - reading of a descriptor from a descriptor table (which is
>> what read_descriptor() does, as that's all you can do for PV
>> guests) vs reading of the hidden portions of a selector register
>> (which is what hvm_get_segment_register() does, thanks to
>> VMX/SVM providing access).
> 
> read_descriptor() confuses me a lot. The way I understood it: it reads
> the full desc from LDT/GDT indexed by the selector, which is where the 
> hidden 
> portions get loaded from, so it has full info like we get from vmcs.
> It can look at the "Type" bits 8-11 in the upper half and figure if it's
> code segment also.  Following check in it adds to confusiong:
> 
>         if ( !(vm86attr & _SEGMENT_CODE) )
>             desc.b &= ~_SEGMENT_L;

Not sure what's confusing about this - the L bit is meaningful only
in code segments.

> Anyways, I think (and hope), I finally have it:

Yes, looks like so (leaving aside a couple of casts I don't know
why you think you need).

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-01 13:52       ` Jan Beulich
  2013-05-02  1:10         ` Mukesh Rathor
@ 2013-05-10  1:51         ` Mukesh Rathor
  2013-05-10  7:07           ` Jan Beulich
  1 sibling, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-10  1:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, 01 May 2013 14:52:27 +0100
"Jan Beulich" <jbeulich@suse.com> wrote:

> >> > +static int vmxit_io_instr(struct cpu_user_regs *regs)
> >> > +{
> >> > +    int curr_lvl;
> >> > +    int requested = (regs->rflags >> 12) & 3;
> >> > +
> >> > +    read_vmcs_selectors(regs);
> >> > +    curr_lvl = regs->cs & 3;
> >> 
> >> Shouldn't you look at SS'es DPL instead?
> >
> >Ok. It looks like CPL is stored in both CS and SS, so either
> >should be ok. But I changed it to ss. 
> 
> Your response reads as if you're still looking at the low two bits of
> the selector, whereas me using DPL was intended to hint at you
> needing to look at the "hidden" portion of the register.

Hmm... sorry, still don't understand why I need to use DPL here. Ref'ing
the SDM again: Vol1 Basic architecture on IO says:

The following instructions can be executed only if the current privilege 
level (CPL) of the program or task currently executing is less than or 
equal to the IOPL: IN, INS, OUT, OUTS, CLI ..........

It says in Vol 3A in chapter on Protection, that CPL comes
from bit 0 and 1 of the CS seg register. Since the RPL relfects the CPL
when the program is executing, it seems the above code is correct. Moreover,
I don't understand how the desc priv level of stack segement relates 
to the IO instructions.

Here's how the PV check looks btw, in guest_io_okay():

    if ( !vm86_mode(regs) &&
        (v->arch.pv_vcpu.iopl >= (guest_kernel_mode(v, regs) ? 1 : 3)) )

what am i missing?

thanks
mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/17] PVH xen: PVH dom0 creation....
  2013-04-26  7:22       ` Jan Beulich
@ 2013-05-10  1:53         ` Mukesh Rathor
  2013-05-10  7:14           ` Jan Beulich
  0 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-10  1:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Fri, 26 Apr 2013 08:22:08 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 26.04.13 at 03:18, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > On Wed, 24 Apr 2013 10:28:35 +0100
> > "Jan Beulich" <JBeulich@suse.com> wrote:
> > 
> >> >>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com>
> >> >>> wrote:
> >> > +    /* If the e820 ended under 4GB, we must map the remaining
> >> > space upto 4GB */
> >> > +    if ( end < GB(4) )
> >> > +    {
> >> > +        start_pfn = PFN_UP(end);
> >> > +        end_pfn = (GB(4)) >> PAGE_SHIFT;
> >> > +        nump = end_pfn - start_pfn;
> >> > +        rc = domctl_memory_mapping(d, start_pfn, start_pfn,
> >> > nump, 1);
> >> > +        BUG_ON(rc);
> >> > +    }
> >> 
> >> That's necessary, but not sufficient. Or did I overlook MMIO ranges
> >> getting added somewhere else for Dom0, when they sit above the
> >> highest E820 covered address?
> > 
> > construct_dom0() adds the entire range:
> > 
> >     /* DOM0 is permitted full I/O capabilities. */
> >     rc |= ioports_permit_access(dom0, 0, 0xFFFF);
> >     rc |= iomem_permit_access(dom0, 0UL, ~0UL);
> 
> Which does not create any mappings at all - these are just
> permissions being granted.

Right. I'm not sure where its happening for dom0.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 15/17]  PVH xen: Miscellaneous changes
  2013-04-24  9:06   ` Jan Beulich
@ 2013-05-10  1:54     ` Mukesh Rathor
  2013-05-10  7:10       ` Jan Beulich
  0 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-10  1:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, 24 Apr 2013 10:06:22 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > --- a/xen/include/public/xen.h
> > +++ b/xen/include/public/xen.h
> > @@ -693,6 +693,8 @@ typedef struct shared_info shared_info_t;
> >   *      c. list of allocated page frames [mfn_list, nr_pages]
> >   *         (unless relocated due to XEN_ELFNOTE_INIT_P2M)
> >   *      d. start_info_t structure        [register ESI (x86)]
> > + *      d1. struct shared_info_t                [shared_info]
> > + *                   (above if auto translated guest only)
> >   *      e. bootstrap page tables         [pt_base and CR3 (x86)]
> >   *      f. bootstrap stack               [register ESP (x86)]
> >   *  4. Bootstrap elements are packed together, but each is
> > 4kB-aligned.
> 
> This adjustment should be done in the patch implementing this.

Happens in tool stack for domU and construct_dom0 for dom0. So probably
best to leave in this patch? Or I could move it to last patch changing  
construct_dom0.

thanks
M

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-10  1:51         ` Mukesh Rathor
@ 2013-05-10  7:07           ` Jan Beulich
  2013-05-10 23:44             ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-05-10  7:07 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 10.05.13 at 03:51, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> On Wed, 01 May 2013 14:52:27 +0100
> "Jan Beulich" <jbeulich@suse.com> wrote:
> 
>> >> > +static int vmxit_io_instr(struct cpu_user_regs *regs)
>> >> > +{
>> >> > +    int curr_lvl;
>> >> > +    int requested = (regs->rflags >> 12) & 3;
>> >> > +
>> >> > +    read_vmcs_selectors(regs);
>> >> > +    curr_lvl = regs->cs & 3;
>> >> 
>> >> Shouldn't you look at SS'es DPL instead?
>> >
>> >Ok. It looks like CPL is stored in both CS and SS, so either
>> >should be ok. But I changed it to ss. 
>> 
>> Your response reads as if you're still looking at the low two bits of
>> the selector, whereas me using DPL was intended to hint at you
>> needing to look at the "hidden" portion of the register.
> 
> Hmm... sorry, still don't understand why I need to use DPL here. Ref'ing
> the SDM again: Vol1 Basic architecture on IO says:
> 
> The following instructions can be executed only if the current privilege 
> level (CPL) of the program or task currently executing is less than or 
> equal to the IOPL: IN, INS, OUT, OUTS, CLI ..........
> 
> It says in Vol 3A in chapter on Protection, that CPL comes
> from bit 0 and 1 of the CS seg register. Since the RPL relfects the CPL
> when the program is executing, it seems the above code is correct. Moreover,
> I don't understand how the desc priv level of stack segement relates 
> to the IO instructions.

This is of specific relevance when including real and VM86 modes in
the picture: The section "Guest Register State" says "The value of
the DPL field for SS is always equal to the logical processor’s current
privilege level (CPL)", with the respective footnote "In protected mode,
CPL is also associated with the RPL field in the CS selector. However,
the RPL fields are not meaningful in real-address mode or in virtual-
8086 mode".

While I didn't want to spend even more time finding the respective
sections in the documentation, I'm certain this is being documented
this way also in areas not concerned with VMX (because I've been
knowing of this rule for far longer than VMX exists).

Also, if you look through the code, I'm sure you will find other places
where SS is being used in favor of CS (albeit in the PV cases obviously
we have to [and can safely] use RPL, as we can't see the hidden parts
of the registers, but there's also no real mode involved). get_cpl() in
the instruction emulator is a good example.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 15/17]  PVH xen: Miscellaneous changes
  2013-05-10  1:54     ` Mukesh Rathor
@ 2013-05-10  7:10       ` Jan Beulich
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Beulich @ 2013-05-10  7:10 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 10.05.13 at 03:54, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> On Wed, 24 Apr 2013 10:06:22 +0100
> "Jan Beulich" <JBeulich@suse.com> wrote:
> 
>> >>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com>
>> >>> wrote:
>> > --- a/xen/include/public/xen.h
>> > +++ b/xen/include/public/xen.h
>> > @@ -693,6 +693,8 @@ typedef struct shared_info shared_info_t;
>> >   *      c. list of allocated page frames [mfn_list, nr_pages]
>> >   *         (unless relocated due to XEN_ELFNOTE_INIT_P2M)
>> >   *      d. start_info_t structure        [register ESI (x86)]
>> > + *      d1. struct shared_info_t                [shared_info]
>> > + *                   (above if auto translated guest only)
>> >   *      e. bootstrap page tables         [pt_base and CR3 (x86)]
>> >   *      f. bootstrap stack               [register ESP (x86)]
>> >   *  4. Bootstrap elements are packed together, but each is
>> > 4kB-aligned.
>> 
>> This adjustment should be done in the patch implementing this.
> 
> Happens in tool stack for domU and construct_dom0 for dom0. So probably
> best to leave in this patch? Or I could move it to last patch changing  
> construct_dom0.

If the adjustment is done in two separate patches, then the
change above logically belongs into the first one (albeit for
ease of committing putting it in the Dom0 one would seem
preferable, even if that's the later one in the series).

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/17] PVH xen: PVH dom0 creation....
  2013-05-10  1:53         ` Mukesh Rathor
@ 2013-05-10  7:14           ` Jan Beulich
  2013-05-15  1:18             ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-05-10  7:14 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 10.05.13 at 03:53, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> On Fri, 26 Apr 2013 08:22:08 +0100
> "Jan Beulich" <JBeulich@suse.com> wrote:
> 
>> >>> On 26.04.13 at 03:18, Mukesh Rathor <mukesh.rathor@oracle.com>
>> >>> wrote:
>> > On Wed, 24 Apr 2013 10:28:35 +0100
>> > "Jan Beulich" <JBeulich@suse.com> wrote:
>> > 
>> >> >>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com>
>> >> >>> wrote:
>> >> > +    /* If the e820 ended under 4GB, we must map the remaining
>> >> > space upto 4GB */
>> >> > +    if ( end < GB(4) )
>> >> > +    {
>> >> > +        start_pfn = PFN_UP(end);
>> >> > +        end_pfn = (GB(4)) >> PAGE_SHIFT;
>> >> > +        nump = end_pfn - start_pfn;
>> >> > +        rc = domctl_memory_mapping(d, start_pfn, start_pfn,
>> >> > nump, 1);
>> >> > +        BUG_ON(rc);
>> >> > +    }
>> >> 
>> >> That's necessary, but not sufficient. Or did I overlook MMIO ranges
>> >> getting added somewhere else for Dom0, when they sit above the
>> >> highest E820 covered address?
>> > 
>> > construct_dom0() adds the entire range:
>> > 
>> >     /* DOM0 is permitted full I/O capabilities. */
>> >     rc |= ioports_permit_access(dom0, 0, 0xFFFF);
>> >     rc |= iomem_permit_access(dom0, 0UL, ~0UL);
>> 
>> Which does not create any mappings at all - these are just
>> permissions being granted.
> 
> Right. I'm not sure where its happening for dom0.

So if you don't know where you do this, I have to guess you don't
do this at all. But you obviously need to. Your main problem is that
you likely don't want to waste memory on page tables to cover the
whole (up to 52 bit wide) address space, so I assume you will need
to add these tables on demand. Yet then again iirc IOMMU faults
aren't recoverable up to now, so perhaps you have no way around
setting them up in their entirety during boot, unless you want to
get into the business of interacting with the MMIO assignment being
done for PCI devices in the BIOS and Dom0.

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-05-10  7:07           ` Jan Beulich
@ 2013-05-10 23:44             ` Mukesh Rathor
  0 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-10 23:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Fri, 10 May 2013 08:07:59 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 10.05.13 at 03:51, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > On Wed, 01 May 2013 14:52:27 +0100
> > "Jan Beulich" <jbeulich@suse.com> wrote:
> > 
> >> >> > +static int vmxit_io_instr(struct cpu_user_regs *regs)
> >> >> > +{
> >> >> > +    int curr_lvl;
> >> >> > +    int requested = (regs->rflags >> 12) & 3;
> >> >> > +
> >> >> > +    read_vmcs_selectors(regs);
> >> >> > +    curr_lvl = regs->cs & 3;
> >> >> 
> >> >> Shouldn't you look at SS'es DPL instead?
> >> >
> >> >Ok. It looks like CPL is stored in both CS and SS, so either
> >> >should be ok. But I changed it to ss. 
> >> 
> >> Your response reads as if you're still looking at the low two bits
> >> of the selector, whereas me using DPL was intended to hint at you
> >> needing to look at the "hidden" portion of the register.
> > 
> > Hmm... sorry, still don't understand why I need to use DPL here.
> > Ref'ing the SDM again: Vol1 Basic architecture on IO says:
> > 
> > The following instructions can be executed only if the current
> > privilege level (CPL) of the program or task currently executing is
> > less than or equal to the IOPL: IN, INS, OUT, OUTS, CLI ..........
> > 
> > It says in Vol 3A in chapter on Protection, that CPL comes
> > from bit 0 and 1 of the CS seg register. Since the RPL relfects the
> > CPL when the program is executing, it seems the above code is
> > correct. Moreover, I don't understand how the desc priv level of
> > stack segement relates to the IO instructions.
> 
> This is of specific relevance when including real and VM86 modes in
> the picture: The section "Guest Register State" says "The value of
> the DPL field for SS is always equal to the logical processor’s
> current privilege level (CPL)", with the respective footnote "In
> protected mode, CPL is also associated with the RPL field in the CS
> selector. However, the RPL fields are not meaningful in real-address
> mode or in virtual- 8086 mode".

A PVH is not expected to be in real/v86 mode, but I guess it's not
enforced. I'll change it to look for SS DPL instead.

Mukesh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c
  2013-04-24  8:47   ` Jan Beulich
                       ` (2 preceding siblings ...)
  2013-05-02  1:17     ` Mukesh Rathor
@ 2013-05-11  0:30     ` Mukesh Rathor
  3 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-11  0:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, 24 Apr 2013 09:47:55 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> 
> > +        case TRAP_no_device:
> > +            hvm_funcs.fpu_dirty_intercept();  /*
> > vmx_fpu_dirty_intercept */
> 
> It ought to be perfectly valid to avoid the indirect call here.

Well, that would entail making the function public and adding to headers,
so I followed the example of other code doing the indirect calls.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 16/17] PVH xen: elf and iommu related changes to prep for dom0 PVH
  2013-04-24  9:15   ` Jan Beulich
@ 2013-05-14  1:16     ` Mukesh Rathor
  2013-05-14  6:56       ` Jan Beulich
  0 siblings, 1 reply; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-14  1:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, 24 Apr 2013 10:15:25 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> >  
> > -static int elf_load_image(void *dst, const void *src, uint64_t
> > filesz, uint64_t memsz) +extern void __init
> > early_pvh_copy_or_zero(unsigned long dest, char *src,
> > +                                          int len, unsigned long
> > v_start);
> 
> This needs to be put in a header included both here and at the
> producer side.
> 
> Also, if you need to pass v_start around just to pass it back to
> this function, you could as well store it in a static variable in
> domain_build.c, and leave all of these functions untouched.

Actually, elf_load_image() <-- elf_load_binary() needs to know if it's 
PVH domain, so I'd need to change it to pass is_pvh_domain anyways. 

I could check for idle_domain in elf_load_binary() and assume PVH dom0
construct, since for other callers, current should never be idle domain,
but, that seems pretty hacky... 

So, I could either leave it as is with v_start being passed, or make
v_start static and pass is_pvh_domain flag. Please LMK.

thanks,
mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 16/17] PVH xen: elf and iommu related changes to prep for dom0 PVH
  2013-05-14  1:16     ` Mukesh Rathor
@ 2013-05-14  6:56       ` Jan Beulich
  2013-05-14 19:14         ` Mukesh Rathor
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Beulich @ 2013-05-14  6:56 UTC (permalink / raw)
  To: Mukesh Rathor; +Cc: xen-devel

>>> On 14.05.13 at 03:16, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:
> On Wed, 24 Apr 2013 10:15:25 +0100
> "Jan Beulich" <JBeulich@suse.com> wrote:
> 
>> >>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com>
>> >>> wrote:
>> >  
>> > -static int elf_load_image(void *dst, const void *src, uint64_t
>> > filesz, uint64_t memsz) +extern void __init
>> > early_pvh_copy_or_zero(unsigned long dest, char *src,
>> > +                                          int len, unsigned long
>> > v_start);
>> 
>> This needs to be put in a header included both here and at the
>> producer side.
>> 
>> Also, if you need to pass v_start around just to pass it back to
>> this function, you could as well store it in a static variable in
>> domain_build.c, and leave all of these functions untouched.
> 
> Actually, elf_load_image() <-- elf_load_binary() needs to know if it's 
> PVH domain, so I'd need to change it to pass is_pvh_domain anyways. 

But the single place where it's being looked at for purposes other
than forwarding to the next function is a pretty odd hack anyway.

> I could check for idle_domain in elf_load_binary() and assume PVH dom0
> construct, since for other callers, current should never be idle domain,
> but, that seems pretty hacky... 

Using v_start being (non-)zero as a flag to tell pv from pvh isn't
much less of a hack.

> So, I could either leave it as is with v_start being passed, or make
> v_start static and pass is_pvh_domain flag. Please LMK.

In the end I wonder whether for all this special casing a cleaner
implementation can't be found.

Jan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 16/17] PVH xen: elf and iommu related changes to prep for dom0 PVH
  2013-05-14  6:56       ` Jan Beulich
@ 2013-05-14 19:14         ` Mukesh Rathor
  0 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-14 19:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Tue, 14 May 2013 07:56:47 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 14.05.13 at 03:16, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > On Wed, 24 Apr 2013 10:15:25 +0100
> > "Jan Beulich" <JBeulich@suse.com> wrote:
> > 
> >> >>> On 23.04.13 at 23:26, Mukesh Rathor <mukesh.rathor@oracle.com>
> >> >>> wrote:
> >> >  
> >> > -static int elf_load_image(void *dst, const void *src, uint64_t
> >> > filesz, uint64_t memsz) +extern void __init
> >> > early_pvh_copy_or_zero(unsigned long dest, char *src,
> >> > +                                          int len, unsigned long
> >> > v_start);
> >> 
> >> This needs to be put in a header included both here and at the
> >> producer side.
> >> 
> >> Also, if you need to pass v_start around just to pass it back to
> >> this function, you could as well store it in a static variable in
> >> domain_build.c, and leave all of these functions untouched.
> > 
> > Actually, elf_load_image() <-- elf_load_binary() needs to know if
> > it's PVH domain, so I'd need to change it to pass is_pvh_domain
> > anyways. 
> 
> But the single place where it's being looked at for purposes other
> than forwarding to the next function is a pretty odd hack anyway.
> 
> > I could check for idle_domain in elf_load_binary() and assume PVH
> > dom0 construct, since for other callers, current should never be
> > idle domain, but, that seems pretty hacky... 
> 
> Using v_start being (non-)zero as a flag to tell pv from pvh isn't
> much less of a hack.
> 
> > So, I could either leave it as is with v_start being passed, or make
> > v_start static and pass is_pvh_domain flag. Please LMK.
> 
> In the end I wonder whether for all this special casing a cleaner
> implementation can't be found.

Ah, got it. I can just check for opt_dom0pvh in elf_load_image().

Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/17] PVH xen: PVH dom0 creation....
  2013-05-10  7:14           ` Jan Beulich
@ 2013-05-15  1:18             ` Mukesh Rathor
  0 siblings, 0 replies; 72+ messages in thread
From: Mukesh Rathor @ 2013-05-15  1:18 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Fri, 10 May 2013 08:14:55 +0100
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 10.05.13 at 03:53, Mukesh Rathor <mukesh.rathor@oracle.com>
> >>> wrote:
> > On Fri, 26 Apr 2013 08:22:08 +0100
> >> >> > +    /* If the e820 ended under 4GB, we must map the remaining
> >> >> > space upto 4GB */
> >> >> > +    if ( end < GB(4) )
> >> >> > +    {
> >> >> > +        start_pfn = PFN_UP(end);
> >> >> > +        end_pfn = (GB(4)) >> PAGE_SHIFT;
> >> >> > +        nump = end_pfn - start_pfn;
> >> >> > +        rc = domctl_memory_mapping(d, start_pfn, start_pfn,
> >> >> > nump, 1);
> >> >> > +        BUG_ON(rc);
> >> >> > +    }
> >> >> 
> >> >> That's necessary, but not sufficient. Or did I overlook MMIO
> >> >> ranges getting added somewhere else for Dom0, when they sit
> >> >> above the highest E820 covered address?
> >> > 
> >> > construct_dom0() adds the entire range:
> >> > 
> >> >     /* DOM0 is permitted full I/O capabilities. */
> >> >     rc |= ioports_permit_access(dom0, 0, 0xFFFF);
> >> >     rc |= iomem_permit_access(dom0, 0UL, ~0UL);
> >> 
> >> Which does not create any mappings at all - these are just
> >> permissions being granted.
> > 
> > Right. I'm not sure where its happening for dom0.
> 
> So if you don't know where you do this, I have to guess you don't
> do this at all. But you obviously need to. Your main problem is that
> you likely don't want to waste memory on page tables to cover the
> whole (up to 52 bit wide) address space, so I assume you will need
> to add these tables on demand. Yet then again iirc IOMMU faults

Hmm... well, I originally had it where the tables were updated "on
demand" initiated by guest, but then suggestions were to make that
transparent to the guest. I don't really know what the best solution
is, let me investigate/think some more.

thanks,
Mukesh

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2013-05-15  1:18 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-23 21:25 [PATCH 00/17][V4]: PVH xen: version 4 patches Mukesh Rathor
2013-04-23 21:25 ` [PATCH 01/17] PVH xen: turn gdb_frames/gdt_ents into union Mukesh Rathor
2013-04-23 21:25 ` [PATCH 02/17] PVH xen: add XENMEM_add_to_physmap_range Mukesh Rathor
2013-04-23 21:25 ` [PATCH 03/17] PVH xen: create domctl_memory_mapping() function Mukesh Rathor
2013-04-24  7:01   ` Jan Beulich
2013-04-23 21:25 ` [PATCH 04/17] PVH xen: add params to read_segment_register Mukesh Rathor
2013-04-23 21:25 ` [PATCH 05/17] PVH xen: vmx realted preparatory changes for PVH Mukesh Rathor
2013-04-23 21:25 ` [PATCH 06/17] PVH xen: Introduce PVH guest type Mukesh Rathor
2013-04-24  7:07   ` Jan Beulich
2013-04-24 23:01     ` Mukesh Rathor
2013-04-25  8:28       ` Jan Beulich
2013-04-23 21:25 ` [PATCH 07/17] PVH xen: tools changes to create PVH domain Mukesh Rathor
2013-04-24  7:10   ` Jan Beulich
2013-04-24 23:02     ` Mukesh Rathor
2013-04-23 21:25 ` [PATCH 08/17] PVH xen: domain creation code changes Mukesh Rathor
2013-04-23 21:25 ` [PATCH 09/17] PVH xen: create PVH vmcs, and also initialization Mukesh Rathor
2013-04-24  7:42   ` Jan Beulich
2013-04-30 21:01     ` Mukesh Rathor
2013-04-30 21:04     ` Mukesh Rathor
2013-04-23 21:25 ` [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c Mukesh Rathor
2013-04-24  8:47   ` Jan Beulich
2013-04-25  0:57     ` Mukesh Rathor
2013-04-25  8:36       ` Jan Beulich
2013-04-26  1:16         ` Mukesh Rathor
2013-04-26  1:58           ` Mukesh Rathor
2013-04-26  7:29             ` Jan Beulich
2013-04-26  7:20           ` Jan Beulich
2013-04-27  2:06             ` Mukesh Rathor
2013-05-01  0:51     ` Mukesh Rathor
2013-05-01 13:52       ` Jan Beulich
2013-05-02  1:10         ` Mukesh Rathor
2013-05-02  6:42           ` Jan Beulich
2013-05-03  1:03             ` Mukesh Rathor
2013-05-10  1:51         ` Mukesh Rathor
2013-05-10  7:07           ` Jan Beulich
2013-05-10 23:44             ` Mukesh Rathor
2013-05-02  1:17     ` Mukesh Rathor
2013-05-02  6:53       ` Jan Beulich
2013-05-03  0:40         ` Mukesh Rathor
2013-05-03  6:33           ` Jan Beulich
2013-05-04  1:40             ` Mukesh Rathor
2013-05-06  6:44               ` Jan Beulich
2013-05-07  1:25                 ` Mukesh Rathor
2013-05-07  8:07                   ` Jan Beulich
2013-05-11  0:30     ` Mukesh Rathor
2013-04-25 11:19   ` Tim Deegan
2013-04-23 21:26 ` [PATCH 11/17] PVH xen: some misc changes like mtrr, intr, msi Mukesh Rathor
2013-04-23 21:26 ` [PATCH 12/17] PVH xen: support invalid op, return PVH features etc Mukesh Rathor
2013-04-24  9:01   ` Jan Beulich
2013-04-25  1:01     ` Mukesh Rathor
2013-04-23 21:26 ` [PATCH 13/17] PVH xen: p2m related changes Mukesh Rathor
2013-04-25 11:28   ` Tim Deegan
2013-04-25 21:59     ` Mukesh Rathor
2013-04-26  8:53       ` Tim Deegan
2013-04-23 21:26 ` [PATCH 14/17] PVH xen: Add and remove foreign pages Mukesh Rathor
2013-04-25 11:38   ` Tim Deegan
2013-04-23 21:26 ` [PATCH 15/17] PVH xen: Miscellaneous changes Mukesh Rathor
2013-04-24  9:06   ` Jan Beulich
2013-05-10  1:54     ` Mukesh Rathor
2013-05-10  7:10       ` Jan Beulich
2013-04-23 21:26 ` [PATCH 16/17] PVH xen: elf and iommu related changes to prep for dom0 PVH Mukesh Rathor
2013-04-24  9:15   ` Jan Beulich
2013-05-14  1:16     ` Mukesh Rathor
2013-05-14  6:56       ` Jan Beulich
2013-05-14 19:14         ` Mukesh Rathor
2013-04-23 21:26 ` [PATCH 17/17] PVH xen: PVH dom0 creation Mukesh Rathor
2013-04-24  9:28   ` Jan Beulich
2013-04-26  1:18     ` Mukesh Rathor
2013-04-26  7:22       ` Jan Beulich
2013-05-10  1:53         ` Mukesh Rathor
2013-05-10  7:14           ` Jan Beulich
2013-05-15  1:18             ` Mukesh Rathor

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.